feat(scripts): add Italian CEFR data pipeline

- Add extractors for Italian sources: it_m3.xls and italian.json
- Add comparison script (compare-italian.py) to report source overlaps and conflicts
- Add merge script (merge-italian-json.py) with priority order ['italian', 'it_m3']
- Output authoritative dataset to datafiles/italian-merged.json
- Update README to document both English and Italian pipelines
This commit is contained in:
lila 2026-04-08 18:32:03 +02:00
parent 59152950d6
commit 3374bd8b20
9 changed files with 208535 additions and 26 deletions

View file

@ -1,11 +1,16 @@
# CEFR Data Pipeline # CEFR Data Pipeline
This directory contains the source data files and extraction/merge pipeline for generating CEFR-enriched datasets. The final output (`english-merged.json`) is consumed by the database seeding process in `packages/db`. This directory contains the source data files and extraction/merge pipeline for generating CEFR-enriched datasets. The final outputs (`english-merged.json`, `italian-merged.json`) are consumed by the database seeding process in `packages/db`.
## Overview ## Overview
The pipeline transforms raw vocabulary data from multiple sources into a standardized format, resolves conflicts between sources, and produces an authoritative CEFR dataset per language. This dataset is then used by the Glossa database package to update translation records. The pipeline transforms raw vocabulary data from multiple sources into a standardized format, resolves conflicts between sources, and produces an authoritative CEFR dataset per language. This dataset is then used by the Glossa database package to update translation records.
## Supported Languages
- ✅ English (`en`)
- ✅ Italian (`it`)
## Pipeline Stages ## Pipeline Stages
### Stage 1: Extraction ### Stage 1: Extraction
@ -22,12 +27,16 @@ Each source file is processed by a dedicated extractor script. The extractor rea
- CEFR levels are validated against A1-C2 - CEFR levels are validated against A1-C2
- Each record includes the source identifier for traceability - Each record includes the source identifier for traceability
**Location:** `extraction-scripts/english/` **Extractor Scripts:**
**Scripts:**
- `extract-cefrj-csv.py` | Language | Source | Script |
- `extract-en_m3.py` |----------|------------------------|---------------------------------------------------------|
- `extract-octanove.py` | English | `cefrj.csv` | `extraction-scripts/english/extract-cefrj-csv.py` |
- `extract-random-json.py` | English | `en_m3.xls` | `extraction-scripts/english/extract-en_m3.py` |
| English | `octanove.csv` | `extraction-scripts/english/extract-octanove.py` |
| English | `random.json` | `extraction-scripts/english/extract-random-json.py` |
| Italian | `it_m3.xls` | `extraction-scripts/italian/extract-it_m3.py` |
| Italian | `italian.json` | `extraction-scripts/italian/extract-italian-json.py` |
### Stage 2: Comparison ### Stage 2: Comparison
@ -39,17 +48,18 @@ Before merging, sources are compared to identify agreements and conflicts. This
- Overlap between sources (words appearing in multiple sources) - Overlap between sources (words appearing in multiple sources)
- Agreement rate (sources assigning the same CEFR level) - Agreement rate (sources assigning the same CEFR level)
- Conflicts (same word/POS with different CEFR levels) - Conflicts (same word/POS with different CEFR levels)
- Database coverage (how many extracted words exist in the database)
**Location:** `comparison-scripts/compare-english.py` **Comparison Scripts:**
**Usage:**
| Language | Script |
|----------|-----------------------------------------------|
| English | `comparison-scripts/compare-english.py` |
| Italian | `comparison-scripts/compare-italian.py` |
Run from the `scripts/` directory:
```bash
cd scripts/
python comparison-scripts/compare-english.py python comparison-scripts/compare-english.py
``` python comparison-scripts/compare-italian.py
Conflicts are resolved in the next stage using source priority rules.
### Stage 3: Merge ### Stage 3: Merge
@ -71,13 +81,17 @@ Difficulty is not extracted from sources. It is derived from the final CEFR leve
The merged file includes both CEFR level and derived difficulty, plus a list of sources that contributed to each entry. The merged file includes both CEFR level and derived difficulty, plus a list of sources that contributed to each entry.
**Location**: merge-scripts/merge-english-json.py **Merge Scripts & Priorities:**
**Usage:**
| Language | Script | Priority (lowest → highest) |
|----------|-------------------------------------------|----------------------------------------------|
| English | `merge-scripts/merge-english-json.py` | `random`, `octanove`, `cefrj`, `en_m3` |
| Italian | `merge-scripts/merge-italian-json.py` | `italian`, `it_m3` |
Run from the `scripts/` directory:
```bash
cd scripts/
python merge-scripts/merge-english-json.py python merge-scripts/merge-english-json.py
``` python merge-scripts/merge-italian-json.py
### Stage 4: Enrichment ### Stage 4: Enrichment
@ -88,9 +102,11 @@ The authoritative merged file is consumed by the database package (packages/db)
``` ```
scripts/ scripts/
├── comparison-scripts/ ├── comparison-scripts/
│ └── compare-english.py # Stage 2: compare extracted data │ ├── compare-english.py
│ └── compare-italian.py # Stage 2: compare extracted data
├── datafiles/ ├── datafiles/
│ ├── english-merged.json # Stage 3 output (authoritative dataset) │ ├── english-merged.json # Stage 3 output (authoritative)
│ ├── italian-merged.json # Stage 3 output (authoritative)
│ ├── omw-noun.json │ ├── omw-noun.json
│ └── omw-verb.json │ └── omw-verb.json
├── data-sources/ ├── data-sources/
@ -105,7 +121,11 @@ scripts/
│ │ └── random-extracted.json │ │ └── random-extracted.json
│ ├── french/ # (future) │ ├── french/ # (future)
│ ├── german/ # (future) │ ├── german/ # (future)
│ ├── italian/ # (future) │ ├── italian/
│ │ ├── it_m3.xls
│ │ ├── it_m3-extracted.json
│ │ ├── italian.json
│ │ └── italian-extracted.json
│ └── spanish/ # (future) │ └── spanish/ # (future)
├── extraction-scripts/ ├── extraction-scripts/
│ └── english/ │ └── english/
@ -113,6 +133,9 @@ scripts/
│ ├── extract-en_m3.py │ ├── extract-en_m3.py
│ ├── extract-octanove.py │ ├── extract-octanove.py
│ └── extract-random-json.py │ └── extract-random-json.py
│ └── italian/
│ ├── extract-it_m3.py
│ └── extract-italian-json.py
├── merge-scripts/ ├── merge-scripts/
│ └── merge-english-json.py # Stage 3: merge into authority │ └── merge-english-json.py # Stage 3: merge into authority
├── extract-own-save-to-json.py # script to extract words from wordnet ├── extract-own-save-to-json.py # script to extract words from wordnet

View file

@ -0,0 +1,166 @@
#!/usr/bin/env python3
"""
CEFR Data Pipeline - Stage 2: Italian Comparison
Compares extracted JSON files for Italian and reports agreements and conflicts.
"""
import json
from collections import defaultdict
from pathlib import Path
from typing import Dict, List, Tuple
# Supported CEFR levels
CEFR_LEVELS = {"A1", "A2", "B1", "B2", "C1", "C2"}
def load_extracted_files(data_dir: Path) -> Dict[str, List[dict]]:
"""Load all *-extracted.json files from the Italian data directory."""
sources = {}
for file_path in data_dir.glob("*-extracted.json"):
source_name = file_path.stem.replace("-extracted", "")
with open(file_path, "r", encoding="utf-8") as f:
data = json.load(f)
if isinstance(data, list):
sources[source_name] = data
else:
print(f"Warning: {file_path} does not contain a list, skipping.")
return sources
def normalize_entry(entry: dict) -> Tuple[str, str]:
"""Return (word, pos) key for comparison."""
return entry["word"].lower().strip(), entry["pos"].lower().strip()
def compute_statistics(sources: Dict[str, List[dict]]) -> dict:
"""Compute overlap, agreement, and conflict statistics."""
# Per-source counts by CEFR level
source_counts = {}
for src, entries in sources.items():
cefr_counts = defaultdict(int)
for e in entries:
cefr = e.get("cefr", "UNKNOWN")
cefr_counts[cefr] += 1
source_counts[src] = dict(cefr_counts)
# Build word->pos->sources and CEFR assignments
word_map = defaultdict(lambda: defaultdict(dict))
for src, entries in sources.items():
for e in entries:
key = normalize_entry(e)
word_map[key][src] = e["cefr"]
# Compute overlaps, agreements, conflicts
total_entries = sum(len(e) for e in sources.values())
unique_words = len(word_map)
overlap_stats = defaultdict(int)
agreement_count = 0
conflict_count = 0
conflict_details = []
for key, src_cefr_map in word_map.items():
num_sources = len(src_cefr_map)
overlap_stats[num_sources] += 1
if num_sources > 1:
cefr_values = set(src_cefr_map.values())
if len(cefr_values) == 1:
agreement_count += 1
else:
conflict_count += 1
conflict_details.append(
{"word": key[0], "pos": key[1], "assignments": dict(src_cefr_map)}
)
return {
"source_counts": source_counts,
"total_entries": total_entries,
"unique_words": unique_words,
"overlap_distribution": dict(overlap_stats),
"agreements": agreement_count,
"conflicts": conflict_count,
"conflict_details": conflict_details,
}
def print_report(stats: dict, sources: Dict[str, List[dict]]):
"""Print formatted comparison report."""
print(f"\n{'=' * 60}")
print("CEFR COMPARISON REPORT - ITALIAN")
print(f"{'=' * 60}")
# Source entry counts
print("\n📊 ENTRIES PER SOURCE AND CEFR LEVEL")
print("-" * 50)
for src, counts in stats["source_counts"].items():
total = sum(counts.values())
print(f"\n{src}: {total} total entries")
for level in CEFR_LEVELS:
cnt = counts.get(level, 0)
if cnt > 0:
print(f" {level}: {cnt}")
# Show non-standard levels
for level, cnt in counts.items():
if level not in CEFR_LEVELS and level != "UNKNOWN":
print(f" {level}: {cnt} (non-standard)")
# Overlap statistics
print("\n🔄 OVERLAP BETWEEN SOURCES")
print("-" * 50)
print(f"Total unique (word, POS) combinations: {stats['unique_words']}")
print(f"Total entries across all sources: {stats['total_entries']}")
overlap = stats["overlap_distribution"]
for n_sources in sorted(overlap.keys()):
count = overlap[n_sources]
pct = (count / stats["unique_words"]) * 100
print(f"Words appearing in {n_sources} source(s): {count} ({pct:.1f}%)")
# Agreement and conflicts
print("\n⚖️ AGREEMENT / CONFLICT SUMMARY")
print("-" * 50)
print(f"Words with >1 source: {stats['agreements'] + stats['conflicts']}")
print(f" ✅ Agreements (same CEFR): {stats['agreements']}")
print(f" ❌ Conflicts (different CEFR): {stats['conflicts']}")
if stats["conflicts"] > 0:
agreement_rate = (
stats["agreements"] / (stats["agreements"] + stats["conflicts"])
) * 100
print(f" Agreement rate: {agreement_rate:.1f}%")
print("\n📋 CONFLICT DETAILS (first 10 shown):")
for i, conflict in enumerate(stats["conflict_details"][:10]):
print(f" {i + 1}. {conflict['word']} ({conflict['pos']})")
for src, cefr in conflict["assignments"].items():
print(f" {src}: {cefr}")
if len(stats["conflict_details"]) > 10:
print(f" ... and {len(stats['conflict_details']) - 10} more conflicts.")
print(f"\n{'=' * 60}\n")
def main():
# Determine paths
script_dir = Path(__file__).parent
data_dir = script_dir.parent / "data-sources" / "italian"
if not data_dir.exists():
print(f"Error: Italian data directory not found: {data_dir}")
return
print(f"Loading extracted files from {data_dir}...")
sources = load_extracted_files(data_dir)
if not sources:
print("No extracted files found.")
return
print(f"Found sources: {', '.join(sources.keys())}")
stats = compute_statistics(sources)
print_report(stats, sources)
if __name__ == "__main__":
main()

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -91,12 +91,12 @@ def extract() -> None:
print(f"Extracted: {len(records)} records") print(f"Extracted: {len(records)} records")
print(f" - Nouns: {noun_count}") print(f" - Nouns: {noun_count}")
print(f" - Verbs: {verb_count}") print(f" - Verbs: {verb_count}")
print(f"\nCEFR distribution:") print("\nCEFR distribution:")
for level in CEFR_LEVELS: for level in CEFR_LEVELS:
if level in cefr_distribution: if level in cefr_distribution:
print(f" - {level}: {cefr_distribution[level]}") print(f" - {level}: {cefr_distribution[level]}")
print(f"\nSkipped:") print("\nSkipped:")
print(f" - Unsupported POS: {skipped_pos}") print(f" - Unsupported POS: {skipped_pos}")
print(f" - Invalid CEFR: {skipped_invalid_cefr}") print(f" - Invalid CEFR: {skipped_invalid_cefr}")
print(f" - Empty word: {skipped_empty_word}") print(f" - Empty word: {skipped_empty_word}")

View file

@ -0,0 +1,114 @@
#!/usr/bin/env python3
"""
scripts/extraction-scripts/italian/extract-it_m3.py
Extracts CEFR data from it_m3.xls (Italian M3 wordlist).
"""
import json
from pathlib import Path
import xlrd
# Constants matching @glossa/shared
SUPPORTED_POS = ["noun", "verb"]
CEFR_LEVELS = ["A1", "A2", "B1", "B2", "C1", "C2"]
# POS mapping (case-insensitive) based on observed abbreviations
POS_MAP = {
"n": "noun", # nome
"v": "verb", # verbo
}
# Column indices (0-based) verified from sample
WORD_COL = 0 # Lemma
POS_COL = 1 # Pos
CEFR_COL = 2 # Points (CEFR level)
# Paths (relative to project root)
INPUT_FILE = Path("scripts/data-sources/italian/it_m3.xls")
OUTPUT_FILE = Path("scripts/data-sources/italian/it_m3-extracted.json")
def extract() -> None:
print(f"Reading: {INPUT_FILE}")
records = []
skipped_pos = 0
skipped_invalid_cefr = 0
skipped_empty_word = 0
total_rows = 0
wb = xlrd.open_workbook(INPUT_FILE)
ws = wb.sheet_by_index(0)
# Skip header row, start from row 1
for row_idx in range(1, ws.nrows):
total_rows += 1
word_raw = ws.cell_value(row_idx, WORD_COL)
pos_raw = ws.cell_value(row_idx, POS_COL)
cefr_raw = ws.cell_value(row_idx, CEFR_COL)
# Normalize POS (case-insensitive)
pos = str(pos_raw).lower().strip() if pos_raw else ""
if pos not in POS_MAP:
skipped_pos += 1
continue
pos = POS_MAP[pos]
# Normalize CEFR - handle smart quotes
cefr_str = str(cefr_raw).strip() if cefr_raw else ""
cefr_str = cefr_str.strip("\u201c\u201d") # strip Unicode smart quotes
cefr = cefr_str.upper()
if cefr not in CEFR_LEVELS:
skipped_invalid_cefr += 1
continue
# Normalize word handle multiple forms like "il, lo, la" → take first?
word_raw_str = str(word_raw).strip() if word_raw else ""
# If word contains comma, take first part (e.g., "il, lo, la" → "il")
# But this may lose variants; consider keeping as is or processing differently.
# For consistency, we'll keep the full string and lowercase it.
word = word_raw_str.lower()
if not word:
skipped_empty_word += 1
continue
record = {"word": word, "pos": pos, "cefr": cefr, "source": "it_m3"}
records.append(record)
# Write output
with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
json.dump(records, f, indent=2, ensure_ascii=False)
# Stats
noun_count = sum(1 for r in records if r["pos"] == "noun")
verb_count = sum(1 for r in records if r["pos"] == "verb")
cefr_distribution = {}
for level in CEFR_LEVELS:
count = sum(1 for r in records if r["cefr"] == level)
if count > 0:
cefr_distribution[level] = count
print(f"\nTotal rows in XLS: {total_rows}")
print(f"Extracted: {len(records)} records")
print(f" - Nouns: {noun_count}")
print(f" - Verbs: {verb_count}")
print(f"\nCEFR distribution:")
for level in CEFR_LEVELS:
if level in cefr_distribution:
print(f" - {level}: {cefr_distribution[level]}")
print(f"\nSkipped:")
print(f" - Unsupported POS: {skipped_pos}")
print(f" - Invalid CEFR: {skipped_invalid_cefr}")
print(f" - Empty word: {skipped_empty_word}")
print(f"\nOutput: {OUTPUT_FILE}")
if __name__ == "__main__":
extract()

View file

@ -0,0 +1,91 @@
#!/usr/bin/env python3
"""
scripts/extraction-scripts/italian/extract-italian-json.py
Extracts CEFR data from italian.json (Italian flashcard source).
Filters for useful_for_flashcard=true and supported POS (noun, verb).
"""
import json
from pathlib import Path
# Constants matching @glossa/shared
SUPPORTED_POS = ["noun", "verb"]
CEFR_LEVELS = ["A1", "A2", "B1", "B2", "C1", "C2"]
# Paths (relative to project root)
INPUT_FILE = Path("scripts/data-sources/italian/italian.json")
OUTPUT_FILE = Path("scripts/data-sources/italian/italian-extracted.json")
def extract() -> None:
print(f"Reading: {INPUT_FILE}")
with open(INPUT_FILE, "r", encoding="utf-8") as f:
data = json.load(f)
records = []
skipped_pos = 0
skipped_not_useful = 0
skipped_invalid_cefr = 0
skipped_empty_word = 0
for entry in data:
# Filter: must be useful for flashcard
if not entry.get("useful_for_flashcard", False):
skipped_not_useful += 1
continue
# Filter: must have supported POS
pos = entry.get("pos", "").lower().strip()
if pos not in SUPPORTED_POS:
skipped_pos += 1
continue
# Filter: must have valid CEFR level
cefr = entry.get("cefr_level", "").upper().strip()
if cefr not in CEFR_LEVELS:
skipped_invalid_cefr += 1
continue
# Normalize word
word = entry.get("word", "").lower().strip()
if not word:
skipped_empty_word += 1
continue
record = {"word": word, "pos": pos, "cefr": cefr, "source": "italian"}
records.append(record)
# Write output
with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
json.dump(records, f, indent=2, ensure_ascii=False)
# Stats
noun_count = sum(1 for r in records if r["pos"] == "noun")
verb_count = sum(1 for r in records if r["pos"] == "verb")
cefr_distribution = {}
for level in CEFR_LEVELS:
count = sum(1 for r in records if r["cefr"] == level)
if count > 0:
cefr_distribution[level] = count
print(f"\nExtracted: {len(records)} records")
print(f" - Nouns: {noun_count}")
print(f" - Verbs: {verb_count}")
print("\nCEFR distribution:")
for level in CEFR_LEVELS:
if level in cefr_distribution:
print(f" - {level}: {cefr_distribution[level]}")
print("\nSkipped:")
print(f" - Not useful for flashcard: {skipped_not_useful}")
print(f" - Unsupported POS: {skipped_pos}")
print(f" - Invalid CEFR: {skipped_invalid_cefr}")
print(f" - Empty word: {skipped_empty_word}")
print(f"\nOutput: {OUTPUT_FILE}")
if __name__ == "__main__":
extract()

View file

@ -0,0 +1,159 @@
#!/usr/bin/env python3
"""
CEFR Data Pipeline - Stage 3: Italian Merge
Merges extracted JSON files for Italian into an authoritative dataset.
"""
import json
from collections import defaultdict
from pathlib import Path
from typing import Dict, List, Tuple
# Supported CEFR levels and difficulty mapping
CEFR_LEVELS = {"A1", "A2", "B1", "B2", "C1", "C2"}
DIFFICULTY_MAP = {
"A1": "easy",
"A2": "easy",
"B1": "intermediate",
"B2": "intermediate",
"C1": "hard",
"C2": "hard",
}
# Source priority order (from lowest to highest priority)
# Higher index = higher authority when conflicts occur
PRIORITY_ORDER = ["italian", "it_m3"]
def load_extracted_files(data_dir: Path) -> Dict[str, List[dict]]:
"""Load all *-extracted.json files from the Italian data directory."""
sources = {}
for file_path in data_dir.glob("*-extracted.json"):
source_name = file_path.stem.replace("-extracted", "")
with open(file_path, "r", encoding="utf-8") as f:
data = json.load(f)
if isinstance(data, list):
sources[source_name] = data
else:
print(f"Warning: {file_path} does not contain a list, skipping.")
return sources
def normalize_entry(entry: dict) -> Tuple[str, str]:
"""Return (word, pos) key for merging."""
return entry["word"].lower().strip(), entry["pos"].lower().strip()
def get_source_priority(source_name: str) -> int:
"""Return priority index for a source (higher = more authoritative)."""
try:
return PRIORITY_ORDER.index(source_name)
except ValueError:
# If source not in list, assign lowest priority
return -1
def merge_entries(sources: Dict[str, List[dict]]) -> List[dict]:
"""Merge entries from multiple sources, resolving conflicts by priority."""
grouped = defaultdict(list)
for src_name, entries in sources.items():
for entry in entries:
key = normalize_entry(entry)
grouped[key].append((src_name, entry["cefr"], entry))
merged = []
conflicts_resolved = 0
total_multi_source = 0
for (word, pos), src_entries in grouped.items():
if len(src_entries) == 1:
src_name, cefr, original = src_entries[0]
final_cefr = cefr
contributing_sources = [src_name]
else:
total_multi_source += 1
sorted_entries = sorted(
src_entries, key=lambda x: get_source_priority(x[0]), reverse=True
)
highest_src, highest_cefr, _ = sorted_entries[0]
all_cefrs = {e[1] for e in src_entries}
if len(all_cefrs) > 1:
conflicts_resolved += 1
final_cefr = highest_cefr
contributing_sources = [e[0] for e in src_entries]
difficulty = DIFFICULTY_MAP.get(final_cefr, "unknown")
merged.append(
{
"word": word,
"pos": pos,
"cefr": final_cefr,
"difficulty": difficulty,
"sources": sorted(contributing_sources),
}
)
print(f"Merge statistics:")
print(f" Total unique entries: {len(merged)}")
print(f" Entries with multiple sources: {total_multi_source}")
print(f" Conflicts resolved by priority: {conflicts_resolved}")
return merged
def print_summary(merged: List[dict]):
"""Print distribution of CEFR levels and difficulty in final dataset."""
cefr_counts = defaultdict(int)
diff_counts = defaultdict(int)
for entry in merged:
cefr_counts[entry["cefr"]] += 1
diff_counts[entry["difficulty"]] += 1
print("\n📊 Final CEFR distribution:")
for level in sorted(CEFR_LEVELS):
count = cefr_counts.get(level, 0)
if count:
print(f" {level}: {count}")
print("\n📊 Final difficulty distribution:")
for diff in ["easy", "intermediate", "hard"]:
count = diff_counts.get(diff, 0)
print(f" {diff}: {count}")
def main():
script_dir = Path(__file__).parent
data_dir = script_dir.parent / "data-sources" / "italian"
output_dir = script_dir.parent / "datafiles"
output_file = output_dir / "italian-merged.json"
if not data_dir.exists():
print(f"Error: Italian data directory not found: {data_dir}")
return
output_dir.mkdir(parents=True, exist_ok=True)
print(f"Loading extracted files from {data_dir}...")
sources = load_extracted_files(data_dir)
if not sources:
print("No extracted files found.")
return
print(f"Found sources: {', '.join(sources.keys())}")
print(f"Priority order (lowest to highest): {PRIORITY_ORDER}")
merged = merge_entries(sources)
with open(output_file, "w", encoding="utf-8") as f:
json.dump(merged, f, indent=2, ensure_ascii=False)
print(f"\n✅ Merged dataset written to: {output_file}")
print_summary(merged)
if __name__ == "__main__":
main()