feat(scripts): add Italian CEFR data pipeline
- Add extractors for Italian sources: it_m3.xls and italian.json - Add comparison script (compare-italian.py) to report source overlaps and conflicts - Add merge script (merge-italian-json.py) with priority order ['italian', 'it_m3'] - Output authoritative dataset to datafiles/italian-merged.json - Update README to document both English and Italian pipelines
This commit is contained in:
parent
59152950d6
commit
3374bd8b20
9 changed files with 208535 additions and 26 deletions
|
|
@ -1,11 +1,16 @@
|
|||
# CEFR Data Pipeline
|
||||
|
||||
This directory contains the source data files and extraction/merge pipeline for generating CEFR-enriched datasets. The final output (`english-merged.json`) is consumed by the database seeding process in `packages/db`.
|
||||
This directory contains the source data files and extraction/merge pipeline for generating CEFR-enriched datasets. The final outputs (`english-merged.json`, `italian-merged.json`) are consumed by the database seeding process in `packages/db`.
|
||||
|
||||
## Overview
|
||||
|
||||
The pipeline transforms raw vocabulary data from multiple sources into a standardized format, resolves conflicts between sources, and produces an authoritative CEFR dataset per language. This dataset is then used by the Glossa database package to update translation records.
|
||||
|
||||
## Supported Languages
|
||||
|
||||
- ✅ English (`en`)
|
||||
- ✅ Italian (`it`)
|
||||
|
||||
## Pipeline Stages
|
||||
|
||||
### Stage 1: Extraction
|
||||
|
|
@ -22,12 +27,16 @@ Each source file is processed by a dedicated extractor script. The extractor rea
|
|||
- CEFR levels are validated against A1-C2
|
||||
- Each record includes the source identifier for traceability
|
||||
|
||||
**Location:** `extraction-scripts/english/`
|
||||
**Scripts:**
|
||||
- `extract-cefrj-csv.py`
|
||||
- `extract-en_m3.py`
|
||||
- `extract-octanove.py`
|
||||
- `extract-random-json.py`
|
||||
**Extractor Scripts:**
|
||||
|
||||
| Language | Source | Script |
|
||||
|----------|------------------------|---------------------------------------------------------|
|
||||
| English | `cefrj.csv` | `extraction-scripts/english/extract-cefrj-csv.py` |
|
||||
| English | `en_m3.xls` | `extraction-scripts/english/extract-en_m3.py` |
|
||||
| English | `octanove.csv` | `extraction-scripts/english/extract-octanove.py` |
|
||||
| English | `random.json` | `extraction-scripts/english/extract-random-json.py` |
|
||||
| Italian | `it_m3.xls` | `extraction-scripts/italian/extract-it_m3.py` |
|
||||
| Italian | `italian.json` | `extraction-scripts/italian/extract-italian-json.py` |
|
||||
|
||||
### Stage 2: Comparison
|
||||
|
||||
|
|
@ -39,17 +48,18 @@ Before merging, sources are compared to identify agreements and conflicts. This
|
|||
- Overlap between sources (words appearing in multiple sources)
|
||||
- Agreement rate (sources assigning the same CEFR level)
|
||||
- Conflicts (same word/POS with different CEFR levels)
|
||||
- Database coverage (how many extracted words exist in the database)
|
||||
|
||||
**Location:** `comparison-scripts/compare-english.py`
|
||||
**Usage:**
|
||||
**Comparison Scripts:**
|
||||
|
||||
```bash
|
||||
cd scripts/
|
||||
python comparison-scripts/compare-english.py
|
||||
```
|
||||
| Language | Script |
|
||||
|----------|-----------------------------------------------|
|
||||
| English | `comparison-scripts/compare-english.py` |
|
||||
| Italian | `comparison-scripts/compare-italian.py` |
|
||||
|
||||
Conflicts are resolved in the next stage using source priority rules.
|
||||
Run from the `scripts/` directory:
|
||||
|
||||
python comparison-scripts/compare-english.py
|
||||
python comparison-scripts/compare-italian.py
|
||||
|
||||
### Stage 3: Merge
|
||||
|
||||
|
|
@ -71,13 +81,17 @@ Difficulty is not extracted from sources. It is derived from the final CEFR leve
|
|||
|
||||
The merged file includes both CEFR level and derived difficulty, plus a list of sources that contributed to each entry.
|
||||
|
||||
**Location**: merge-scripts/merge-english-json.py
|
||||
**Usage:**
|
||||
**Merge Scripts & Priorities:**
|
||||
|
||||
```bash
|
||||
cd scripts/
|
||||
python merge-scripts/merge-english-json.py
|
||||
```
|
||||
| Language | Script | Priority (lowest → highest) |
|
||||
|----------|-------------------------------------------|----------------------------------------------|
|
||||
| English | `merge-scripts/merge-english-json.py` | `random`, `octanove`, `cefrj`, `en_m3` |
|
||||
| Italian | `merge-scripts/merge-italian-json.py` | `italian`, `it_m3` |
|
||||
|
||||
Run from the `scripts/` directory:
|
||||
|
||||
python merge-scripts/merge-english-json.py
|
||||
python merge-scripts/merge-italian-json.py
|
||||
|
||||
### Stage 4: Enrichment
|
||||
|
||||
|
|
@ -88,9 +102,11 @@ The authoritative merged file is consumed by the database package (packages/db)
|
|||
```
|
||||
scripts/
|
||||
├── comparison-scripts/
|
||||
│ └── compare-english.py # Stage 2: compare extracted data
|
||||
│ ├── compare-english.py
|
||||
│ └── compare-italian.py # Stage 2: compare extracted data
|
||||
├── datafiles/
|
||||
│ ├── english-merged.json # Stage 3 output (authoritative dataset)
|
||||
│ ├── english-merged.json # Stage 3 output (authoritative)
|
||||
│ ├── italian-merged.json # Stage 3 output (authoritative)
|
||||
│ ├── omw-noun.json
|
||||
│ └── omw-verb.json
|
||||
├── data-sources/
|
||||
|
|
@ -105,7 +121,11 @@ scripts/
|
|||
│ │ └── random-extracted.json
|
||||
│ ├── french/ # (future)
|
||||
│ ├── german/ # (future)
|
||||
│ ├── italian/ # (future)
|
||||
│ ├── italian/
|
||||
│ │ ├── it_m3.xls
|
||||
│ │ ├── it_m3-extracted.json
|
||||
│ │ ├── italian.json
|
||||
│ │ └── italian-extracted.json
|
||||
│ └── spanish/ # (future)
|
||||
├── extraction-scripts/
|
||||
│ └── english/
|
||||
|
|
@ -113,6 +133,9 @@ scripts/
|
|||
│ ├── extract-en_m3.py
|
||||
│ ├── extract-octanove.py
|
||||
│ └── extract-random-json.py
|
||||
│ └── italian/
|
||||
│ ├── extract-it_m3.py
|
||||
│ └── extract-italian-json.py
|
||||
├── merge-scripts/
|
||||
│ └── merge-english-json.py # Stage 3: merge into authority
|
||||
├── extract-own-save-to-json.py # script to extract words from wordnet
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue