feat(scripts): add Italian CEFR data pipeline

- Add extractors for Italian sources: it_m3.xls and italian.json
- Add comparison script (compare-italian.py) to report source overlaps and conflicts
- Add merge script (merge-italian-json.py) with priority order ['italian', 'it_m3']
- Output authoritative dataset to datafiles/italian-merged.json
- Update README to document both English and Italian pipelines
This commit is contained in:
lila 2026-04-08 18:32:03 +02:00
parent 59152950d6
commit 3374bd8b20
9 changed files with 208535 additions and 26 deletions

View file

@ -1,11 +1,16 @@
# CEFR Data Pipeline
This directory contains the source data files and extraction/merge pipeline for generating CEFR-enriched datasets. The final output (`english-merged.json`) is consumed by the database seeding process in `packages/db`.
This directory contains the source data files and extraction/merge pipeline for generating CEFR-enriched datasets. The final outputs (`english-merged.json`, `italian-merged.json`) are consumed by the database seeding process in `packages/db`.
## Overview
The pipeline transforms raw vocabulary data from multiple sources into a standardized format, resolves conflicts between sources, and produces an authoritative CEFR dataset per language. This dataset is then used by the Glossa database package to update translation records.
## Supported Languages
- ✅ English (`en`)
- ✅ Italian (`it`)
## Pipeline Stages
### Stage 1: Extraction
@ -22,12 +27,16 @@ Each source file is processed by a dedicated extractor script. The extractor rea
- CEFR levels are validated against A1-C2
- Each record includes the source identifier for traceability
**Location:** `extraction-scripts/english/`
**Scripts:**
- `extract-cefrj-csv.py`
- `extract-en_m3.py`
- `extract-octanove.py`
- `extract-random-json.py`
**Extractor Scripts:**
| Language | Source | Script |
|----------|------------------------|---------------------------------------------------------|
| English | `cefrj.csv` | `extraction-scripts/english/extract-cefrj-csv.py` |
| English | `en_m3.xls` | `extraction-scripts/english/extract-en_m3.py` |
| English | `octanove.csv` | `extraction-scripts/english/extract-octanove.py` |
| English | `random.json` | `extraction-scripts/english/extract-random-json.py` |
| Italian | `it_m3.xls` | `extraction-scripts/italian/extract-it_m3.py` |
| Italian | `italian.json` | `extraction-scripts/italian/extract-italian-json.py` |
### Stage 2: Comparison
@ -39,17 +48,18 @@ Before merging, sources are compared to identify agreements and conflicts. This
- Overlap between sources (words appearing in multiple sources)
- Agreement rate (sources assigning the same CEFR level)
- Conflicts (same word/POS with different CEFR levels)
- Database coverage (how many extracted words exist in the database)
**Location:** `comparison-scripts/compare-english.py`
**Usage:**
**Comparison Scripts:**
```bash
cd scripts/
python comparison-scripts/compare-english.py
```
| Language | Script |
|----------|-----------------------------------------------|
| English | `comparison-scripts/compare-english.py` |
| Italian | `comparison-scripts/compare-italian.py` |
Conflicts are resolved in the next stage using source priority rules.
Run from the `scripts/` directory:
python comparison-scripts/compare-english.py
python comparison-scripts/compare-italian.py
### Stage 3: Merge
@ -71,13 +81,17 @@ Difficulty is not extracted from sources. It is derived from the final CEFR leve
The merged file includes both CEFR level and derived difficulty, plus a list of sources that contributed to each entry.
**Location**: merge-scripts/merge-english-json.py
**Usage:**
**Merge Scripts & Priorities:**
```bash
cd scripts/
python merge-scripts/merge-english-json.py
```
| Language | Script | Priority (lowest → highest) |
|----------|-------------------------------------------|----------------------------------------------|
| English | `merge-scripts/merge-english-json.py` | `random`, `octanove`, `cefrj`, `en_m3` |
| Italian | `merge-scripts/merge-italian-json.py` | `italian`, `it_m3` |
Run from the `scripts/` directory:
python merge-scripts/merge-english-json.py
python merge-scripts/merge-italian-json.py
### Stage 4: Enrichment
@ -88,9 +102,11 @@ The authoritative merged file is consumed by the database package (packages/db)
```
scripts/
├── comparison-scripts/
│ └── compare-english.py # Stage 2: compare extracted data
│ ├── compare-english.py
│ └── compare-italian.py # Stage 2: compare extracted data
├── datafiles/
│ ├── english-merged.json # Stage 3 output (authoritative dataset)
│ ├── english-merged.json # Stage 3 output (authoritative)
│ ├── italian-merged.json # Stage 3 output (authoritative)
│ ├── omw-noun.json
│ └── omw-verb.json
├── data-sources/
@ -105,7 +121,11 @@ scripts/
│ │ └── random-extracted.json
│ ├── french/ # (future)
│ ├── german/ # (future)
│ ├── italian/ # (future)
│ ├── italian/
│ │ ├── it_m3.xls
│ │ ├── it_m3-extracted.json
│ │ ├── italian.json
│ │ └── italian-extracted.json
│ └── spanish/ # (future)
├── extraction-scripts/
│ └── english/
@ -113,6 +133,9 @@ scripts/
│ ├── extract-en_m3.py
│ ├── extract-octanove.py
│ └── extract-random-json.py
│ └── italian/
│ ├── extract-it_m3.py
│ └── extract-italian-json.py
├── merge-scripts/
│ └── merge-english-json.py # Stage 3: merge into authority
├── extract-own-save-to-json.py # script to extract words from wordnet