- Update all package names from @glossa/* to @lila/* - Update all imports, container names, volume names - Update documentation references - Recreate database with new credentials
7.6 KiB
CEFR Data Pipeline
This directory contains the source data files and extraction/merge pipeline for generating CEFR-enriched datasets. The final outputs (english-merged.json, italian-merged.json) are consumed by the database seeding process in packages/db.
Overview
The pipeline transforms raw vocabulary data from multiple sources into a standardized format, resolves conflicts between sources, and produces an authoritative CEFR dataset per language. This dataset is then used by the lila database package to update translation records.
Supported Languages
- ✅ English (
en) - ✅ Italian (
it)
Pipeline Stages
Stage 1: Extraction
Each source file is processed by a dedicated extractor script. The extractor reads the source-specific format, normalizes the data, filters for supported parts of speech, and outputs a standardized JSON file.
Input: Raw source files (JSON, CSV, XLS)
Output: {source}-extracted.json files (same directory as source)
Normalization rules:
- Words are lowercased and trimmed
- Part of speech is mapped to supported values (noun, verb)
- Entries with unsupported POS are skipped
- CEFR levels are validated against A1-C2
- Each record includes the source identifier for traceability
Extractor Scripts:
| Language | Source | Script |
|---|---|---|
| English | cefrj.csv |
extraction-scripts/english/extract-cefrj-csv.py |
| English | en_m3.xls |
extraction-scripts/english/extract-en_m3.py |
| English | octanove.csv |
extraction-scripts/english/extract-octanove.py |
| English | random.json |
extraction-scripts/english/extract-random-json.py |
| Italian | it_m3.xls |
extraction-scripts/italian/extract-it_m3.py |
| Italian | italian.json |
extraction-scripts/italian/extract-italian-json.py |
Stage 2: Comparison
Before merging, sources are compared to identify agreements and conflicts. This stage is read-only and serves as a quality gate.
Input: All {source}-extracted.json files for a language
Output: Console report showing:
- Entry counts per source and CEFR level
- Overlap between sources (words appearing in multiple sources)
- Agreement rate (sources assigning the same CEFR level)
- Conflicts (same word/POS with different CEFR levels)
Comparison Scripts:
| Language | Script |
|---|---|
| English | comparison-scripts/compare-english.py |
| Italian | comparison-scripts/compare-italian.py |
Run from the scripts/ directory:
python comparison-scripts/compare-english.py
python comparison-scripts/compare-italian.py
Stage 3: Merge
Multiple extracted sources are merged into a single authoritative JSON file per language. When the same word/POS appears in multiple sources with different CEFR levels, the conflict is resolved using a predefined priority order.
Input: All {source}-extracted.json files for a language
Output: {language}-merged.json in ../datafiles/
Merge rules:
- Single source: use that source's CEFR level
- Multiple sources agree: use the agreed CEFR level
- Multiple sources conflict: use the level from the highest-priority source
Difficulty derivation: Difficulty is not extracted from sources. It is derived from the final CEFR level:
- A1, A2 → easy
- B1, B2 → intermediate
- C1, C2 → hard
The merged file includes both CEFR level and derived difficulty, plus a list of sources that contributed to each entry.
Merge Scripts & Priorities:
| Language | Script | Priority (lowest → highest) |
|---|---|---|
| English | merge-scripts/merge-english-json.py |
random, octanove, cefrj, en_m3 |
| Italian | merge-scripts/merge-italian-json.py |
italian, it_m3 |
Run from the scripts/ directory:
python merge-scripts/merge-english-json.py
python merge-scripts/merge-italian-json.py
Stage 4: Enrichment
The authoritative merged file is consumed by the database package (packages/db) during the seeding or update process. This stage is implemented in TypeScript and is not part of the Python scripts in this directory.
File Organization
scripts/
├── comparison-scripts/
│ ├── compare-english.py
│ └── compare-italian.py # Stage 2: compare extracted data
├── datafiles/
│ ├── english-merged.json # Stage 3 output (authoritative)
│ ├── italian-merged.json # Stage 3 output (authoritative)
│ ├── omw-noun.json
│ └── omw-verb.json
├── data-sources/
│ ├── english/
│ │ ├── cefrj.csv
│ │ ├── cefrj-extracted.json
│ │ ├── en_m3.xls
│ │ ├── en_m3-extracted.json
│ │ ├── octanove.csv
│ │ ├── octanove-extracted.json
│ │ ├── random.json
│ │ └── random-extracted.json
│ ├── french/ # (future)
│ ├── german/ # (future)
│ ├── italian/
│ │ ├── it_m3.xls
│ │ ├── it_m3-extracted.json
│ │ ├── italian.json
│ │ └── italian-extracted.json
│ └── spanish/ # (future)
├── extraction-scripts/
│ └── english/
│ ├── extract-cefrj-csv.py
│ ├── extract-en_m3.py
│ ├── extract-octanove.py
│ └── extract-random-json.py
│ └── italian/
│ ├── extract-it_m3.py
│ └── extract-italian-json.py
├── merge-scripts/
│ └── merge-english-json.py # Stage 3: merge into authority
├── extract-own-save-to-json.py # script to extract words from wordnet
├── requirements.txt
└── README.md # This file
Extracted files are co-located with their sources for easy traceability. Merged files live in ../datafiles/.
Source Priority by Language
Source priority determines which CEFR level wins when sources conflict:
English:
- en_m3
- cefrj
- octanove
- random
Italian:
- it_m3
- italian
Priority is defined in the merge configuration. Higher priority sources override lower priority sources when conflicts occur.
This is defined in merge-scripts/merge-english-json.py.
Data Flow Summary
Raw Source → Extracted JSON → Merged JSON → Database
(1) (2) (3) (4)
- Extract: Transform source formats to normalized records
- Compare: Validate source quality and surface conflicts
- Merge: Resolve conflicts, derive difficulty, create authority
- Enrich: Write to database (handled in packages/db)
Adding New Sources
To add a new source:
- Place the raw file in the appropriate
data-sources/{language}/directory - Create an extractor script in
../extractors/{language}/ - Run the extractor to generate
{source}-extracted.json - Run comparison to assess coverage and conflicts
- Update source priority in the merge configuration if needed
- Run merge to regenerate the authoritative file
- Run enrichment to update the database
Constants and Constraints
The pipeline respects these constraints from the lila shared constants:
- Supported languages: en, it
- Supported parts of speech: noun, verb
- CEFR levels: A1, A2, B1, B2, C1, C2
- Difficulty levels: easy, intermediate, hard
Entries violating these constraints are filtered out during extraction.