feat(scripts): add Italian CEFR data pipeline
- Add extractors for Italian sources: it_m3.xls and italian.json - Add comparison script (compare-italian.py) to report source overlaps and conflicts - Add merge script (merge-italian-json.py) with priority order ['italian', 'it_m3'] - Output authoritative dataset to datafiles/italian-merged.json - Update README to document both English and Italian pipelines
This commit is contained in:
parent
59152950d6
commit
3374bd8b20
9 changed files with 208535 additions and 26 deletions
|
|
@ -1,11 +1,16 @@
|
||||||
# CEFR Data Pipeline
|
# CEFR Data Pipeline
|
||||||
|
|
||||||
This directory contains the source data files and extraction/merge pipeline for generating CEFR-enriched datasets. The final output (`english-merged.json`) is consumed by the database seeding process in `packages/db`.
|
This directory contains the source data files and extraction/merge pipeline for generating CEFR-enriched datasets. The final outputs (`english-merged.json`, `italian-merged.json`) are consumed by the database seeding process in `packages/db`.
|
||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
The pipeline transforms raw vocabulary data from multiple sources into a standardized format, resolves conflicts between sources, and produces an authoritative CEFR dataset per language. This dataset is then used by the Glossa database package to update translation records.
|
The pipeline transforms raw vocabulary data from multiple sources into a standardized format, resolves conflicts between sources, and produces an authoritative CEFR dataset per language. This dataset is then used by the Glossa database package to update translation records.
|
||||||
|
|
||||||
|
## Supported Languages
|
||||||
|
|
||||||
|
- ✅ English (`en`)
|
||||||
|
- ✅ Italian (`it`)
|
||||||
|
|
||||||
## Pipeline Stages
|
## Pipeline Stages
|
||||||
|
|
||||||
### Stage 1: Extraction
|
### Stage 1: Extraction
|
||||||
|
|
@ -22,12 +27,16 @@ Each source file is processed by a dedicated extractor script. The extractor rea
|
||||||
- CEFR levels are validated against A1-C2
|
- CEFR levels are validated against A1-C2
|
||||||
- Each record includes the source identifier for traceability
|
- Each record includes the source identifier for traceability
|
||||||
|
|
||||||
**Location:** `extraction-scripts/english/`
|
**Extractor Scripts:**
|
||||||
**Scripts:**
|
|
||||||
- `extract-cefrj-csv.py`
|
| Language | Source | Script |
|
||||||
- `extract-en_m3.py`
|
|----------|------------------------|---------------------------------------------------------|
|
||||||
- `extract-octanove.py`
|
| English | `cefrj.csv` | `extraction-scripts/english/extract-cefrj-csv.py` |
|
||||||
- `extract-random-json.py`
|
| English | `en_m3.xls` | `extraction-scripts/english/extract-en_m3.py` |
|
||||||
|
| English | `octanove.csv` | `extraction-scripts/english/extract-octanove.py` |
|
||||||
|
| English | `random.json` | `extraction-scripts/english/extract-random-json.py` |
|
||||||
|
| Italian | `it_m3.xls` | `extraction-scripts/italian/extract-it_m3.py` |
|
||||||
|
| Italian | `italian.json` | `extraction-scripts/italian/extract-italian-json.py` |
|
||||||
|
|
||||||
### Stage 2: Comparison
|
### Stage 2: Comparison
|
||||||
|
|
||||||
|
|
@ -39,17 +48,18 @@ Before merging, sources are compared to identify agreements and conflicts. This
|
||||||
- Overlap between sources (words appearing in multiple sources)
|
- Overlap between sources (words appearing in multiple sources)
|
||||||
- Agreement rate (sources assigning the same CEFR level)
|
- Agreement rate (sources assigning the same CEFR level)
|
||||||
- Conflicts (same word/POS with different CEFR levels)
|
- Conflicts (same word/POS with different CEFR levels)
|
||||||
- Database coverage (how many extracted words exist in the database)
|
|
||||||
|
|
||||||
**Location:** `comparison-scripts/compare-english.py`
|
**Comparison Scripts:**
|
||||||
**Usage:**
|
|
||||||
|
| Language | Script |
|
||||||
|
|----------|-----------------------------------------------|
|
||||||
|
| English | `comparison-scripts/compare-english.py` |
|
||||||
|
| Italian | `comparison-scripts/compare-italian.py` |
|
||||||
|
|
||||||
|
Run from the `scripts/` directory:
|
||||||
|
|
||||||
```bash
|
|
||||||
cd scripts/
|
|
||||||
python comparison-scripts/compare-english.py
|
python comparison-scripts/compare-english.py
|
||||||
```
|
python comparison-scripts/compare-italian.py
|
||||||
|
|
||||||
Conflicts are resolved in the next stage using source priority rules.
|
|
||||||
|
|
||||||
### Stage 3: Merge
|
### Stage 3: Merge
|
||||||
|
|
||||||
|
|
@ -71,13 +81,17 @@ Difficulty is not extracted from sources. It is derived from the final CEFR leve
|
||||||
|
|
||||||
The merged file includes both CEFR level and derived difficulty, plus a list of sources that contributed to each entry.
|
The merged file includes both CEFR level and derived difficulty, plus a list of sources that contributed to each entry.
|
||||||
|
|
||||||
**Location**: merge-scripts/merge-english-json.py
|
**Merge Scripts & Priorities:**
|
||||||
**Usage:**
|
|
||||||
|
| Language | Script | Priority (lowest → highest) |
|
||||||
|
|----------|-------------------------------------------|----------------------------------------------|
|
||||||
|
| English | `merge-scripts/merge-english-json.py` | `random`, `octanove`, `cefrj`, `en_m3` |
|
||||||
|
| Italian | `merge-scripts/merge-italian-json.py` | `italian`, `it_m3` |
|
||||||
|
|
||||||
|
Run from the `scripts/` directory:
|
||||||
|
|
||||||
```bash
|
|
||||||
cd scripts/
|
|
||||||
python merge-scripts/merge-english-json.py
|
python merge-scripts/merge-english-json.py
|
||||||
```
|
python merge-scripts/merge-italian-json.py
|
||||||
|
|
||||||
### Stage 4: Enrichment
|
### Stage 4: Enrichment
|
||||||
|
|
||||||
|
|
@ -88,9 +102,11 @@ The authoritative merged file is consumed by the database package (packages/db)
|
||||||
```
|
```
|
||||||
scripts/
|
scripts/
|
||||||
├── comparison-scripts/
|
├── comparison-scripts/
|
||||||
│ └── compare-english.py # Stage 2: compare extracted data
|
│ ├── compare-english.py
|
||||||
|
│ └── compare-italian.py # Stage 2: compare extracted data
|
||||||
├── datafiles/
|
├── datafiles/
|
||||||
│ ├── english-merged.json # Stage 3 output (authoritative dataset)
|
│ ├── english-merged.json # Stage 3 output (authoritative)
|
||||||
|
│ ├── italian-merged.json # Stage 3 output (authoritative)
|
||||||
│ ├── omw-noun.json
|
│ ├── omw-noun.json
|
||||||
│ └── omw-verb.json
|
│ └── omw-verb.json
|
||||||
├── data-sources/
|
├── data-sources/
|
||||||
|
|
@ -105,7 +121,11 @@ scripts/
|
||||||
│ │ └── random-extracted.json
|
│ │ └── random-extracted.json
|
||||||
│ ├── french/ # (future)
|
│ ├── french/ # (future)
|
||||||
│ ├── german/ # (future)
|
│ ├── german/ # (future)
|
||||||
│ ├── italian/ # (future)
|
│ ├── italian/
|
||||||
|
│ │ ├── it_m3.xls
|
||||||
|
│ │ ├── it_m3-extracted.json
|
||||||
|
│ │ ├── italian.json
|
||||||
|
│ │ └── italian-extracted.json
|
||||||
│ └── spanish/ # (future)
|
│ └── spanish/ # (future)
|
||||||
├── extraction-scripts/
|
├── extraction-scripts/
|
||||||
│ └── english/
|
│ └── english/
|
||||||
|
|
@ -113,6 +133,9 @@ scripts/
|
||||||
│ ├── extract-en_m3.py
|
│ ├── extract-en_m3.py
|
||||||
│ ├── extract-octanove.py
|
│ ├── extract-octanove.py
|
||||||
│ └── extract-random-json.py
|
│ └── extract-random-json.py
|
||||||
|
│ └── italian/
|
||||||
|
│ ├── extract-it_m3.py
|
||||||
|
│ └── extract-italian-json.py
|
||||||
├── merge-scripts/
|
├── merge-scripts/
|
||||||
│ └── merge-english-json.py # Stage 3: merge into authority
|
│ └── merge-english-json.py # Stage 3: merge into authority
|
||||||
├── extract-own-save-to-json.py # script to extract words from wordnet
|
├── extract-own-save-to-json.py # script to extract words from wordnet
|
||||||
|
|
|
||||||
166
scripts/comparison-scripts/compare-italian.py
Normal file
166
scripts/comparison-scripts/compare-italian.py
Normal file
|
|
@ -0,0 +1,166 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
CEFR Data Pipeline - Stage 2: Italian Comparison
|
||||||
|
Compares extracted JSON files for Italian and reports agreements and conflicts.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
from collections import defaultdict
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Dict, List, Tuple
|
||||||
|
|
||||||
|
# Supported CEFR levels
|
||||||
|
CEFR_LEVELS = {"A1", "A2", "B1", "B2", "C1", "C2"}
|
||||||
|
|
||||||
|
|
||||||
|
def load_extracted_files(data_dir: Path) -> Dict[str, List[dict]]:
|
||||||
|
"""Load all *-extracted.json files from the Italian data directory."""
|
||||||
|
sources = {}
|
||||||
|
for file_path in data_dir.glob("*-extracted.json"):
|
||||||
|
source_name = file_path.stem.replace("-extracted", "")
|
||||||
|
with open(file_path, "r", encoding="utf-8") as f:
|
||||||
|
data = json.load(f)
|
||||||
|
if isinstance(data, list):
|
||||||
|
sources[source_name] = data
|
||||||
|
else:
|
||||||
|
print(f"Warning: {file_path} does not contain a list, skipping.")
|
||||||
|
return sources
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_entry(entry: dict) -> Tuple[str, str]:
|
||||||
|
"""Return (word, pos) key for comparison."""
|
||||||
|
return entry["word"].lower().strip(), entry["pos"].lower().strip()
|
||||||
|
|
||||||
|
|
||||||
|
def compute_statistics(sources: Dict[str, List[dict]]) -> dict:
|
||||||
|
"""Compute overlap, agreement, and conflict statistics."""
|
||||||
|
# Per-source counts by CEFR level
|
||||||
|
source_counts = {}
|
||||||
|
for src, entries in sources.items():
|
||||||
|
cefr_counts = defaultdict(int)
|
||||||
|
for e in entries:
|
||||||
|
cefr = e.get("cefr", "UNKNOWN")
|
||||||
|
cefr_counts[cefr] += 1
|
||||||
|
source_counts[src] = dict(cefr_counts)
|
||||||
|
|
||||||
|
# Build word->pos->sources and CEFR assignments
|
||||||
|
word_map = defaultdict(lambda: defaultdict(dict))
|
||||||
|
for src, entries in sources.items():
|
||||||
|
for e in entries:
|
||||||
|
key = normalize_entry(e)
|
||||||
|
word_map[key][src] = e["cefr"]
|
||||||
|
|
||||||
|
# Compute overlaps, agreements, conflicts
|
||||||
|
total_entries = sum(len(e) for e in sources.values())
|
||||||
|
unique_words = len(word_map)
|
||||||
|
|
||||||
|
overlap_stats = defaultdict(int)
|
||||||
|
agreement_count = 0
|
||||||
|
conflict_count = 0
|
||||||
|
conflict_details = []
|
||||||
|
|
||||||
|
for key, src_cefr_map in word_map.items():
|
||||||
|
num_sources = len(src_cefr_map)
|
||||||
|
overlap_stats[num_sources] += 1
|
||||||
|
if num_sources > 1:
|
||||||
|
cefr_values = set(src_cefr_map.values())
|
||||||
|
if len(cefr_values) == 1:
|
||||||
|
agreement_count += 1
|
||||||
|
else:
|
||||||
|
conflict_count += 1
|
||||||
|
conflict_details.append(
|
||||||
|
{"word": key[0], "pos": key[1], "assignments": dict(src_cefr_map)}
|
||||||
|
)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"source_counts": source_counts,
|
||||||
|
"total_entries": total_entries,
|
||||||
|
"unique_words": unique_words,
|
||||||
|
"overlap_distribution": dict(overlap_stats),
|
||||||
|
"agreements": agreement_count,
|
||||||
|
"conflicts": conflict_count,
|
||||||
|
"conflict_details": conflict_details,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def print_report(stats: dict, sources: Dict[str, List[dict]]):
|
||||||
|
"""Print formatted comparison report."""
|
||||||
|
print(f"\n{'=' * 60}")
|
||||||
|
print("CEFR COMPARISON REPORT - ITALIAN")
|
||||||
|
print(f"{'=' * 60}")
|
||||||
|
|
||||||
|
# Source entry counts
|
||||||
|
print("\n📊 ENTRIES PER SOURCE AND CEFR LEVEL")
|
||||||
|
print("-" * 50)
|
||||||
|
for src, counts in stats["source_counts"].items():
|
||||||
|
total = sum(counts.values())
|
||||||
|
print(f"\n{src}: {total} total entries")
|
||||||
|
for level in CEFR_LEVELS:
|
||||||
|
cnt = counts.get(level, 0)
|
||||||
|
if cnt > 0:
|
||||||
|
print(f" {level}: {cnt}")
|
||||||
|
# Show non-standard levels
|
||||||
|
for level, cnt in counts.items():
|
||||||
|
if level not in CEFR_LEVELS and level != "UNKNOWN":
|
||||||
|
print(f" {level}: {cnt} (non-standard)")
|
||||||
|
|
||||||
|
# Overlap statistics
|
||||||
|
print("\n🔄 OVERLAP BETWEEN SOURCES")
|
||||||
|
print("-" * 50)
|
||||||
|
print(f"Total unique (word, POS) combinations: {stats['unique_words']}")
|
||||||
|
print(f"Total entries across all sources: {stats['total_entries']}")
|
||||||
|
|
||||||
|
overlap = stats["overlap_distribution"]
|
||||||
|
for n_sources in sorted(overlap.keys()):
|
||||||
|
count = overlap[n_sources]
|
||||||
|
pct = (count / stats["unique_words"]) * 100
|
||||||
|
print(f"Words appearing in {n_sources} source(s): {count} ({pct:.1f}%)")
|
||||||
|
|
||||||
|
# Agreement and conflicts
|
||||||
|
print("\n⚖️ AGREEMENT / CONFLICT SUMMARY")
|
||||||
|
print("-" * 50)
|
||||||
|
print(f"Words with >1 source: {stats['agreements'] + stats['conflicts']}")
|
||||||
|
print(f" ✅ Agreements (same CEFR): {stats['agreements']}")
|
||||||
|
print(f" ❌ Conflicts (different CEFR): {stats['conflicts']}")
|
||||||
|
|
||||||
|
if stats["conflicts"] > 0:
|
||||||
|
agreement_rate = (
|
||||||
|
stats["agreements"] / (stats["agreements"] + stats["conflicts"])
|
||||||
|
) * 100
|
||||||
|
print(f" Agreement rate: {agreement_rate:.1f}%")
|
||||||
|
|
||||||
|
print("\n📋 CONFLICT DETAILS (first 10 shown):")
|
||||||
|
for i, conflict in enumerate(stats["conflict_details"][:10]):
|
||||||
|
print(f" {i + 1}. {conflict['word']} ({conflict['pos']})")
|
||||||
|
for src, cefr in conflict["assignments"].items():
|
||||||
|
print(f" {src}: {cefr}")
|
||||||
|
if len(stats["conflict_details"]) > 10:
|
||||||
|
print(f" ... and {len(stats['conflict_details']) - 10} more conflicts.")
|
||||||
|
|
||||||
|
print(f"\n{'=' * 60}\n")
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
# Determine paths
|
||||||
|
script_dir = Path(__file__).parent
|
||||||
|
data_dir = script_dir.parent / "data-sources" / "italian"
|
||||||
|
|
||||||
|
if not data_dir.exists():
|
||||||
|
print(f"Error: Italian data directory not found: {data_dir}")
|
||||||
|
return
|
||||||
|
|
||||||
|
print(f"Loading extracted files from {data_dir}...")
|
||||||
|
sources = load_extracted_files(data_dir)
|
||||||
|
|
||||||
|
if not sources:
|
||||||
|
print("No extracted files found.")
|
||||||
|
return
|
||||||
|
|
||||||
|
print(f"Found sources: {', '.join(sources.keys())}")
|
||||||
|
|
||||||
|
stats = compute_statistics(sources)
|
||||||
|
print_report(stats, sources)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
22076
scripts/data-sources/italian/it_m3-extracted.json
Normal file
22076
scripts/data-sources/italian/it_m3-extracted.json
Normal file
File diff suppressed because it is too large
Load diff
72212
scripts/data-sources/italian/italian-extracted.json
Normal file
72212
scripts/data-sources/italian/italian-extracted.json
Normal file
File diff suppressed because it is too large
Load diff
113668
scripts/datafiles/italian-merged.json
Normal file
113668
scripts/datafiles/italian-merged.json
Normal file
File diff suppressed because it is too large
Load diff
|
|
@ -91,12 +91,12 @@ def extract() -> None:
|
||||||
print(f"Extracted: {len(records)} records")
|
print(f"Extracted: {len(records)} records")
|
||||||
print(f" - Nouns: {noun_count}")
|
print(f" - Nouns: {noun_count}")
|
||||||
print(f" - Verbs: {verb_count}")
|
print(f" - Verbs: {verb_count}")
|
||||||
print(f"\nCEFR distribution:")
|
print("\nCEFR distribution:")
|
||||||
for level in CEFR_LEVELS:
|
for level in CEFR_LEVELS:
|
||||||
if level in cefr_distribution:
|
if level in cefr_distribution:
|
||||||
print(f" - {level}: {cefr_distribution[level]}")
|
print(f" - {level}: {cefr_distribution[level]}")
|
||||||
|
|
||||||
print(f"\nSkipped:")
|
print("\nSkipped:")
|
||||||
print(f" - Unsupported POS: {skipped_pos}")
|
print(f" - Unsupported POS: {skipped_pos}")
|
||||||
print(f" - Invalid CEFR: {skipped_invalid_cefr}")
|
print(f" - Invalid CEFR: {skipped_invalid_cefr}")
|
||||||
print(f" - Empty word: {skipped_empty_word}")
|
print(f" - Empty word: {skipped_empty_word}")
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,114 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
scripts/extraction-scripts/italian/extract-it_m3.py
|
||||||
|
|
||||||
|
Extracts CEFR data from it_m3.xls (Italian M3 wordlist).
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import xlrd
|
||||||
|
|
||||||
|
# Constants matching @glossa/shared
|
||||||
|
SUPPORTED_POS = ["noun", "verb"]
|
||||||
|
CEFR_LEVELS = ["A1", "A2", "B1", "B2", "C1", "C2"]
|
||||||
|
|
||||||
|
# POS mapping (case-insensitive) – based on observed abbreviations
|
||||||
|
POS_MAP = {
|
||||||
|
"n": "noun", # nome
|
||||||
|
"v": "verb", # verbo
|
||||||
|
}
|
||||||
|
|
||||||
|
# Column indices (0-based) – verified from sample
|
||||||
|
WORD_COL = 0 # Lemma
|
||||||
|
POS_COL = 1 # Pos
|
||||||
|
CEFR_COL = 2 # Points (CEFR level)
|
||||||
|
|
||||||
|
# Paths (relative to project root)
|
||||||
|
INPUT_FILE = Path("scripts/data-sources/italian/it_m3.xls")
|
||||||
|
OUTPUT_FILE = Path("scripts/data-sources/italian/it_m3-extracted.json")
|
||||||
|
|
||||||
|
|
||||||
|
def extract() -> None:
|
||||||
|
print(f"Reading: {INPUT_FILE}")
|
||||||
|
|
||||||
|
records = []
|
||||||
|
skipped_pos = 0
|
||||||
|
skipped_invalid_cefr = 0
|
||||||
|
skipped_empty_word = 0
|
||||||
|
total_rows = 0
|
||||||
|
|
||||||
|
wb = xlrd.open_workbook(INPUT_FILE)
|
||||||
|
ws = wb.sheet_by_index(0)
|
||||||
|
|
||||||
|
# Skip header row, start from row 1
|
||||||
|
for row_idx in range(1, ws.nrows):
|
||||||
|
total_rows += 1
|
||||||
|
|
||||||
|
word_raw = ws.cell_value(row_idx, WORD_COL)
|
||||||
|
pos_raw = ws.cell_value(row_idx, POS_COL)
|
||||||
|
cefr_raw = ws.cell_value(row_idx, CEFR_COL)
|
||||||
|
|
||||||
|
# Normalize POS (case-insensitive)
|
||||||
|
pos = str(pos_raw).lower().strip() if pos_raw else ""
|
||||||
|
if pos not in POS_MAP:
|
||||||
|
skipped_pos += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
pos = POS_MAP[pos]
|
||||||
|
|
||||||
|
# Normalize CEFR - handle smart quotes
|
||||||
|
cefr_str = str(cefr_raw).strip() if cefr_raw else ""
|
||||||
|
cefr_str = cefr_str.strip("\u201c\u201d") # strip Unicode smart quotes
|
||||||
|
cefr = cefr_str.upper()
|
||||||
|
|
||||||
|
if cefr not in CEFR_LEVELS:
|
||||||
|
skipped_invalid_cefr += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Normalize word – handle multiple forms like "il, lo, la" → take first?
|
||||||
|
word_raw_str = str(word_raw).strip() if word_raw else ""
|
||||||
|
# If word contains comma, take first part (e.g., "il, lo, la" → "il")
|
||||||
|
# But this may lose variants; consider keeping as is or processing differently.
|
||||||
|
# For consistency, we'll keep the full string and lowercase it.
|
||||||
|
word = word_raw_str.lower()
|
||||||
|
if not word:
|
||||||
|
skipped_empty_word += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
record = {"word": word, "pos": pos, "cefr": cefr, "source": "it_m3"}
|
||||||
|
records.append(record)
|
||||||
|
|
||||||
|
# Write output
|
||||||
|
with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
|
||||||
|
json.dump(records, f, indent=2, ensure_ascii=False)
|
||||||
|
|
||||||
|
# Stats
|
||||||
|
noun_count = sum(1 for r in records if r["pos"] == "noun")
|
||||||
|
verb_count = sum(1 for r in records if r["pos"] == "verb")
|
||||||
|
|
||||||
|
cefr_distribution = {}
|
||||||
|
for level in CEFR_LEVELS:
|
||||||
|
count = sum(1 for r in records if r["cefr"] == level)
|
||||||
|
if count > 0:
|
||||||
|
cefr_distribution[level] = count
|
||||||
|
|
||||||
|
print(f"\nTotal rows in XLS: {total_rows}")
|
||||||
|
print(f"Extracted: {len(records)} records")
|
||||||
|
print(f" - Nouns: {noun_count}")
|
||||||
|
print(f" - Verbs: {verb_count}")
|
||||||
|
print(f"\nCEFR distribution:")
|
||||||
|
for level in CEFR_LEVELS:
|
||||||
|
if level in cefr_distribution:
|
||||||
|
print(f" - {level}: {cefr_distribution[level]}")
|
||||||
|
|
||||||
|
print(f"\nSkipped:")
|
||||||
|
print(f" - Unsupported POS: {skipped_pos}")
|
||||||
|
print(f" - Invalid CEFR: {skipped_invalid_cefr}")
|
||||||
|
print(f" - Empty word: {skipped_empty_word}")
|
||||||
|
print(f"\nOutput: {OUTPUT_FILE}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
extract()
|
||||||
|
|
@ -0,0 +1,91 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
scripts/extraction-scripts/italian/extract-italian-json.py
|
||||||
|
|
||||||
|
Extracts CEFR data from italian.json (Italian flashcard source).
|
||||||
|
Filters for useful_for_flashcard=true and supported POS (noun, verb).
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
# Constants matching @glossa/shared
|
||||||
|
SUPPORTED_POS = ["noun", "verb"]
|
||||||
|
CEFR_LEVELS = ["A1", "A2", "B1", "B2", "C1", "C2"]
|
||||||
|
|
||||||
|
# Paths (relative to project root)
|
||||||
|
INPUT_FILE = Path("scripts/data-sources/italian/italian.json")
|
||||||
|
OUTPUT_FILE = Path("scripts/data-sources/italian/italian-extracted.json")
|
||||||
|
|
||||||
|
|
||||||
|
def extract() -> None:
|
||||||
|
print(f"Reading: {INPUT_FILE}")
|
||||||
|
|
||||||
|
with open(INPUT_FILE, "r", encoding="utf-8") as f:
|
||||||
|
data = json.load(f)
|
||||||
|
|
||||||
|
records = []
|
||||||
|
skipped_pos = 0
|
||||||
|
skipped_not_useful = 0
|
||||||
|
skipped_invalid_cefr = 0
|
||||||
|
skipped_empty_word = 0
|
||||||
|
|
||||||
|
for entry in data:
|
||||||
|
# Filter: must be useful for flashcard
|
||||||
|
if not entry.get("useful_for_flashcard", False):
|
||||||
|
skipped_not_useful += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Filter: must have supported POS
|
||||||
|
pos = entry.get("pos", "").lower().strip()
|
||||||
|
if pos not in SUPPORTED_POS:
|
||||||
|
skipped_pos += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Filter: must have valid CEFR level
|
||||||
|
cefr = entry.get("cefr_level", "").upper().strip()
|
||||||
|
if cefr not in CEFR_LEVELS:
|
||||||
|
skipped_invalid_cefr += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Normalize word
|
||||||
|
word = entry.get("word", "").lower().strip()
|
||||||
|
if not word:
|
||||||
|
skipped_empty_word += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
record = {"word": word, "pos": pos, "cefr": cefr, "source": "italian"}
|
||||||
|
records.append(record)
|
||||||
|
|
||||||
|
# Write output
|
||||||
|
with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
|
||||||
|
json.dump(records, f, indent=2, ensure_ascii=False)
|
||||||
|
|
||||||
|
# Stats
|
||||||
|
noun_count = sum(1 for r in records if r["pos"] == "noun")
|
||||||
|
verb_count = sum(1 for r in records if r["pos"] == "verb")
|
||||||
|
|
||||||
|
cefr_distribution = {}
|
||||||
|
for level in CEFR_LEVELS:
|
||||||
|
count = sum(1 for r in records if r["cefr"] == level)
|
||||||
|
if count > 0:
|
||||||
|
cefr_distribution[level] = count
|
||||||
|
|
||||||
|
print(f"\nExtracted: {len(records)} records")
|
||||||
|
print(f" - Nouns: {noun_count}")
|
||||||
|
print(f" - Verbs: {verb_count}")
|
||||||
|
print("\nCEFR distribution:")
|
||||||
|
for level in CEFR_LEVELS:
|
||||||
|
if level in cefr_distribution:
|
||||||
|
print(f" - {level}: {cefr_distribution[level]}")
|
||||||
|
|
||||||
|
print("\nSkipped:")
|
||||||
|
print(f" - Not useful for flashcard: {skipped_not_useful}")
|
||||||
|
print(f" - Unsupported POS: {skipped_pos}")
|
||||||
|
print(f" - Invalid CEFR: {skipped_invalid_cefr}")
|
||||||
|
print(f" - Empty word: {skipped_empty_word}")
|
||||||
|
print(f"\nOutput: {OUTPUT_FILE}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
extract()
|
||||||
159
scripts/merge-scripts/merge-italian-json.py
Normal file
159
scripts/merge-scripts/merge-italian-json.py
Normal file
|
|
@ -0,0 +1,159 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
CEFR Data Pipeline - Stage 3: Italian Merge
|
||||||
|
Merges extracted JSON files for Italian into an authoritative dataset.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
from collections import defaultdict
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Dict, List, Tuple
|
||||||
|
|
||||||
|
# Supported CEFR levels and difficulty mapping
|
||||||
|
CEFR_LEVELS = {"A1", "A2", "B1", "B2", "C1", "C2"}
|
||||||
|
DIFFICULTY_MAP = {
|
||||||
|
"A1": "easy",
|
||||||
|
"A2": "easy",
|
||||||
|
"B1": "intermediate",
|
||||||
|
"B2": "intermediate",
|
||||||
|
"C1": "hard",
|
||||||
|
"C2": "hard",
|
||||||
|
}
|
||||||
|
|
||||||
|
# Source priority order (from lowest to highest priority)
|
||||||
|
# Higher index = higher authority when conflicts occur
|
||||||
|
PRIORITY_ORDER = ["italian", "it_m3"]
|
||||||
|
|
||||||
|
|
||||||
|
def load_extracted_files(data_dir: Path) -> Dict[str, List[dict]]:
|
||||||
|
"""Load all *-extracted.json files from the Italian data directory."""
|
||||||
|
sources = {}
|
||||||
|
for file_path in data_dir.glob("*-extracted.json"):
|
||||||
|
source_name = file_path.stem.replace("-extracted", "")
|
||||||
|
with open(file_path, "r", encoding="utf-8") as f:
|
||||||
|
data = json.load(f)
|
||||||
|
if isinstance(data, list):
|
||||||
|
sources[source_name] = data
|
||||||
|
else:
|
||||||
|
print(f"Warning: {file_path} does not contain a list, skipping.")
|
||||||
|
return sources
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_entry(entry: dict) -> Tuple[str, str]:
|
||||||
|
"""Return (word, pos) key for merging."""
|
||||||
|
return entry["word"].lower().strip(), entry["pos"].lower().strip()
|
||||||
|
|
||||||
|
|
||||||
|
def get_source_priority(source_name: str) -> int:
|
||||||
|
"""Return priority index for a source (higher = more authoritative)."""
|
||||||
|
try:
|
||||||
|
return PRIORITY_ORDER.index(source_name)
|
||||||
|
except ValueError:
|
||||||
|
# If source not in list, assign lowest priority
|
||||||
|
return -1
|
||||||
|
|
||||||
|
|
||||||
|
def merge_entries(sources: Dict[str, List[dict]]) -> List[dict]:
|
||||||
|
"""Merge entries from multiple sources, resolving conflicts by priority."""
|
||||||
|
grouped = defaultdict(list)
|
||||||
|
for src_name, entries in sources.items():
|
||||||
|
for entry in entries:
|
||||||
|
key = normalize_entry(entry)
|
||||||
|
grouped[key].append((src_name, entry["cefr"], entry))
|
||||||
|
|
||||||
|
merged = []
|
||||||
|
conflicts_resolved = 0
|
||||||
|
total_multi_source = 0
|
||||||
|
|
||||||
|
for (word, pos), src_entries in grouped.items():
|
||||||
|
if len(src_entries) == 1:
|
||||||
|
src_name, cefr, original = src_entries[0]
|
||||||
|
final_cefr = cefr
|
||||||
|
contributing_sources = [src_name]
|
||||||
|
else:
|
||||||
|
total_multi_source += 1
|
||||||
|
sorted_entries = sorted(
|
||||||
|
src_entries, key=lambda x: get_source_priority(x[0]), reverse=True
|
||||||
|
)
|
||||||
|
highest_src, highest_cefr, _ = sorted_entries[0]
|
||||||
|
all_cefrs = {e[1] for e in src_entries}
|
||||||
|
if len(all_cefrs) > 1:
|
||||||
|
conflicts_resolved += 1
|
||||||
|
|
||||||
|
final_cefr = highest_cefr
|
||||||
|
contributing_sources = [e[0] for e in src_entries]
|
||||||
|
|
||||||
|
difficulty = DIFFICULTY_MAP.get(final_cefr, "unknown")
|
||||||
|
|
||||||
|
merged.append(
|
||||||
|
{
|
||||||
|
"word": word,
|
||||||
|
"pos": pos,
|
||||||
|
"cefr": final_cefr,
|
||||||
|
"difficulty": difficulty,
|
||||||
|
"sources": sorted(contributing_sources),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f"Merge statistics:")
|
||||||
|
print(f" Total unique entries: {len(merged)}")
|
||||||
|
print(f" Entries with multiple sources: {total_multi_source}")
|
||||||
|
print(f" Conflicts resolved by priority: {conflicts_resolved}")
|
||||||
|
|
||||||
|
return merged
|
||||||
|
|
||||||
|
|
||||||
|
def print_summary(merged: List[dict]):
|
||||||
|
"""Print distribution of CEFR levels and difficulty in final dataset."""
|
||||||
|
cefr_counts = defaultdict(int)
|
||||||
|
diff_counts = defaultdict(int)
|
||||||
|
|
||||||
|
for entry in merged:
|
||||||
|
cefr_counts[entry["cefr"]] += 1
|
||||||
|
diff_counts[entry["difficulty"]] += 1
|
||||||
|
|
||||||
|
print("\n📊 Final CEFR distribution:")
|
||||||
|
for level in sorted(CEFR_LEVELS):
|
||||||
|
count = cefr_counts.get(level, 0)
|
||||||
|
if count:
|
||||||
|
print(f" {level}: {count}")
|
||||||
|
|
||||||
|
print("\n📊 Final difficulty distribution:")
|
||||||
|
for diff in ["easy", "intermediate", "hard"]:
|
||||||
|
count = diff_counts.get(diff, 0)
|
||||||
|
print(f" {diff}: {count}")
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
script_dir = Path(__file__).parent
|
||||||
|
data_dir = script_dir.parent / "data-sources" / "italian"
|
||||||
|
output_dir = script_dir.parent / "datafiles"
|
||||||
|
output_file = output_dir / "italian-merged.json"
|
||||||
|
|
||||||
|
if not data_dir.exists():
|
||||||
|
print(f"Error: Italian data directory not found: {data_dir}")
|
||||||
|
return
|
||||||
|
|
||||||
|
output_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
print(f"Loading extracted files from {data_dir}...")
|
||||||
|
sources = load_extracted_files(data_dir)
|
||||||
|
|
||||||
|
if not sources:
|
||||||
|
print("No extracted files found.")
|
||||||
|
return
|
||||||
|
|
||||||
|
print(f"Found sources: {', '.join(sources.keys())}")
|
||||||
|
print(f"Priority order (lowest to highest): {PRIORITY_ORDER}")
|
||||||
|
|
||||||
|
merged = merge_entries(sources)
|
||||||
|
|
||||||
|
with open(output_file, "w", encoding="utf-8") as f:
|
||||||
|
json.dump(merged, f, indent=2, ensure_ascii=False)
|
||||||
|
|
||||||
|
print(f"\n✅ Merged dataset written to: {output_file}")
|
||||||
|
print_summary(merged)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
Loading…
Add table
Add a link
Reference in a new issue