feat(scripts): add Italian CEFR data pipeline

- Add extractors for Italian sources: it_m3.xls and italian.json - Add comparison script (compare-italian.py) to report source overlaps and conflicts - Add merge script (merge-italian-json.py) with priority order ['italian', 'it_m3'] - Output authoritative dataset to datafiles/italian-merged.json - Update README to document both English and Italian pipelines
2026-04-08 18:32:03 +02:00 · 2026-04-08 18:32:03 +02:00 · 3374bd8b20
commit 3374bd8b20
parent 59152950d6
9 changed files with 208535 additions and 26 deletions
--- a/scripts/README.md
+++ b/scripts/README.md
@ -1,11 +1,16 @@
 # CEFR Data Pipeline

-This directory contains the source data files and extraction/merge pipeline for generating CEFR-enriched datasets. The final output (`english-merged.json`) is consumed by the database seeding process in `packages/db`.
+This directory contains the source data files and extraction/merge pipeline for generating CEFR-enriched datasets. The final outputs (`english-merged.json`, `italian-merged.json`) are consumed by the database seeding process in `packages/db`.

 ## Overview

 The pipeline transforms raw vocabulary data from multiple sources into a standardized format, resolves conflicts between sources, and produces an authoritative CEFR dataset per language. This dataset is then used by the Glossa database package to update translation records.

+## Supported Languages
+
+- ✅ English (`en`)
+- ✅ Italian (`it`)
+
 ## Pipeline Stages

 ### Stage 1: Extraction
@ -22,12 +27,16 @@ Each source file is processed by a dedicated extractor script. The extractor rea
 - CEFR levels are validated against A1-C2
 - Each record includes the source identifier for traceability

-**Location:** `extraction-scripts/english/`  
-**Scripts:**
- `extract-cefrj-csv.py`
- `extract-en_m3.py`
- `extract-octanove.py`
- `extract-random-json.py`
+**Extractor Scripts:**
+
+| Language | Source                 | Script                                                  |
+|----------|------------------------|---------------------------------------------------------|
+| English  | `cefrj.csv`            | `extraction-scripts/english/extract-cefrj-csv.py`       |
+| English  | `en_m3.xls`            | `extraction-scripts/english/extract-en_m3.py`           |
+| English  | `octanove.csv`         | `extraction-scripts/english/extract-octanove.py`        |
+| English  | `random.json`          | `extraction-scripts/english/extract-random-json.py`     |
+| Italian  | `it_m3.xls`            | `extraction-scripts/italian/extract-it_m3.py`           |
+| Italian  | `italian.json`         | `extraction-scripts/italian/extract-italian-json.py`    |

 ### Stage 2: Comparison

@ -39,17 +48,18 @@ Before merging, sources are compared to identify agreements and conflicts. This
 - Overlap between sources (words appearing in multiple sources)
 - Agreement rate (sources assigning the same CEFR level)
 - Conflicts (same word/POS with different CEFR levels)
- Database coverage (how many extracted words exist in the database)

-**Location:** `comparison-scripts/compare-english.py`  
-**Usage:**
+**Comparison Scripts:**

-```bash
-cd scripts/
-python comparison-scripts/compare-english.py
-```
+| Language | Script                                        |
+|----------|-----------------------------------------------|
+| English  | `comparison-scripts/compare-english.py`       |
+| Italian  | `comparison-scripts/compare-italian.py`       |

-Conflicts are resolved in the next stage using source priority rules.
+Run from the `scripts/` directory:
+
+    python comparison-scripts/compare-english.py
+    python comparison-scripts/compare-italian.py

 ### Stage 3: Merge

@ -71,13 +81,17 @@ Difficulty is not extracted from sources. It is derived from the final CEFR leve

 The merged file includes both CEFR level and derived difficulty, plus a list of sources that contributed to each entry.

-**Location**: merge-scripts/merge-english-json.py
-**Usage:**
+**Merge Scripts & Priorities:**

-```bash
-cd scripts/
-python merge-scripts/merge-english-json.py
-```
+| Language | Script                                    | Priority (lowest → highest)                  |
+|----------|-------------------------------------------|----------------------------------------------|
+| English  | `merge-scripts/merge-english-json.py`     | `random`, `octanove`, `cefrj`, `en_m3`       |
+| Italian  | `merge-scripts/merge-italian-json.py`     | `italian`, `it_m3`                           |
+
+Run from the `scripts/` directory:
+
+    python merge-scripts/merge-english-json.py
+    python merge-scripts/merge-italian-json.py

 ### Stage 4: Enrichment

@ -88,9 +102,11 @@ The authoritative merged file is consumed by the database package (packages/db)
 ```
 scripts/
 ├── comparison-scripts/
-│   └── compare-english.py          # Stage 2: compare extracted data
+│ ├── compare-english.py
+│ └── compare-italian.py        # Stage 2: compare extracted data
 ├── datafiles/
-│   ├── english-merged.json         # Stage 3 output (authoritative dataset)
+│   ├── english-merged.json # Stage 3 output (authoritative)
+│   ├── italian-merged.json # Stage 3 output (authoritative)
 │   ├── omw-noun.json
 │   └── omw-verb.json
 ├── data-sources/
@ -105,7 +121,11 @@ scripts/
 │   │   └── random-extracted.json
 │   ├── french/                     # (future)
 │   ├── german/                     # (future)
-│   ├── italian/                    # (future)
+│   ├── italian/
+│   │   ├── it_m3.xls
+│   │   ├── it_m3-extracted.json
+│   │   ├── italian.json
+│   │   └── italian-extracted.json
 │   └── spanish/                    # (future)
 ├── extraction-scripts/
 │   └── english/
@ -113,6 +133,9 @@ scripts/
 │       ├── extract-en_m3.py
 │       ├── extract-octanove.py
 │       └── extract-random-json.py
+│   └── italian/
+│       ├── extract-it_m3.py
+│       └── extract-italian-json.py
 ├── merge-scripts/
 │   └── merge-english-json.py       # Stage 3: merge into authority
 ├── extract-own-save-to-json.py # script to extract words from wordnet