- Add extractors for Italian sources: it_m3.xls and italian.json - Add comparison script (compare-italian.py) to report source overlaps and conflicts - Add merge script (merge-italian-json.py) with priority order ['italian', 'it_m3'] - Output authoritative dataset to datafiles/italian-merged.json - Update README to document both English and Italian pipelines
199 lines
7.8 KiB
Markdown
199 lines
7.8 KiB
Markdown
# CEFR Data Pipeline
|
|
|
|
This directory contains the source data files and extraction/merge pipeline for generating CEFR-enriched datasets. The final outputs (`english-merged.json`, `italian-merged.json`) are consumed by the database seeding process in `packages/db`.
|
|
|
|
## Overview
|
|
|
|
The pipeline transforms raw vocabulary data from multiple sources into a standardized format, resolves conflicts between sources, and produces an authoritative CEFR dataset per language. This dataset is then used by the Glossa database package to update translation records.
|
|
|
|
## Supported Languages
|
|
|
|
- ✅ English (`en`)
|
|
- ✅ Italian (`it`)
|
|
|
|
## Pipeline Stages
|
|
|
|
### Stage 1: Extraction
|
|
|
|
Each source file is processed by a dedicated extractor script. The extractor reads the source-specific format, normalizes the data, filters for supported parts of speech, and outputs a standardized JSON file.
|
|
|
|
**Input:** Raw source files (JSON, CSV, XLS)
|
|
**Output:** `{source}-extracted.json` files (same directory as source)
|
|
|
|
**Normalization rules:**
|
|
- Words are lowercased and trimmed
|
|
- Part of speech is mapped to supported values (noun, verb)
|
|
- Entries with unsupported POS are skipped
|
|
- CEFR levels are validated against A1-C2
|
|
- Each record includes the source identifier for traceability
|
|
|
|
**Extractor Scripts:**
|
|
|
|
| Language | Source | Script |
|
|
|----------|------------------------|---------------------------------------------------------|
|
|
| English | `cefrj.csv` | `extraction-scripts/english/extract-cefrj-csv.py` |
|
|
| English | `en_m3.xls` | `extraction-scripts/english/extract-en_m3.py` |
|
|
| English | `octanove.csv` | `extraction-scripts/english/extract-octanove.py` |
|
|
| English | `random.json` | `extraction-scripts/english/extract-random-json.py` |
|
|
| Italian | `it_m3.xls` | `extraction-scripts/italian/extract-it_m3.py` |
|
|
| Italian | `italian.json` | `extraction-scripts/italian/extract-italian-json.py` |
|
|
|
|
### Stage 2: Comparison
|
|
|
|
Before merging, sources are compared to identify agreements and conflicts. This stage is read-only and serves as a quality gate.
|
|
|
|
**Input:** All `{source}-extracted.json` files for a language
|
|
**Output:** Console report showing:
|
|
- Entry counts per source and CEFR level
|
|
- Overlap between sources (words appearing in multiple sources)
|
|
- Agreement rate (sources assigning the same CEFR level)
|
|
- Conflicts (same word/POS with different CEFR levels)
|
|
|
|
**Comparison Scripts:**
|
|
|
|
| Language | Script |
|
|
|----------|-----------------------------------------------|
|
|
| English | `comparison-scripts/compare-english.py` |
|
|
| Italian | `comparison-scripts/compare-italian.py` |
|
|
|
|
Run from the `scripts/` directory:
|
|
|
|
python comparison-scripts/compare-english.py
|
|
python comparison-scripts/compare-italian.py
|
|
|
|
### Stage 3: Merge
|
|
|
|
Multiple extracted sources are merged into a single authoritative JSON file per language. When the same word/POS appears in multiple sources with different CEFR levels, the conflict is resolved using a predefined priority order.
|
|
|
|
**Input:** All `{source}-extracted.json` files for a language
|
|
**Output:** `{language}-merged.json` in `../datafiles/`
|
|
|
|
**Merge rules:**
|
|
- Single source: use that source's CEFR level
|
|
- Multiple sources agree: use the agreed CEFR level
|
|
- Multiple sources conflict: use the level from the highest-priority source
|
|
|
|
**Difficulty derivation:**
|
|
Difficulty is not extracted from sources. It is derived from the final CEFR level:
|
|
- A1, A2 → easy
|
|
- B1, B2 → intermediate
|
|
- C1, C2 → hard
|
|
|
|
The merged file includes both CEFR level and derived difficulty, plus a list of sources that contributed to each entry.
|
|
|
|
**Merge Scripts & Priorities:**
|
|
|
|
| Language | Script | Priority (lowest → highest) |
|
|
|----------|-------------------------------------------|----------------------------------------------|
|
|
| English | `merge-scripts/merge-english-json.py` | `random`, `octanove`, `cefrj`, `en_m3` |
|
|
| Italian | `merge-scripts/merge-italian-json.py` | `italian`, `it_m3` |
|
|
|
|
Run from the `scripts/` directory:
|
|
|
|
python merge-scripts/merge-english-json.py
|
|
python merge-scripts/merge-italian-json.py
|
|
|
|
### Stage 4: Enrichment
|
|
|
|
The authoritative merged file is consumed by the database package (packages/db) during the seeding or update process. This stage is implemented in TypeScript and is not part of the Python scripts in this directory.
|
|
|
|
## File Organization
|
|
|
|
```
|
|
scripts/
|
|
├── comparison-scripts/
|
|
│ ├── compare-english.py
|
|
│ └── compare-italian.py # Stage 2: compare extracted data
|
|
├── datafiles/
|
|
│ ├── english-merged.json # Stage 3 output (authoritative)
|
|
│ ├── italian-merged.json # Stage 3 output (authoritative)
|
|
│ ├── omw-noun.json
|
|
│ └── omw-verb.json
|
|
├── data-sources/
|
|
│ ├── english/
|
|
│ │ ├── cefrj.csv
|
|
│ │ ├── cefrj-extracted.json
|
|
│ │ ├── en_m3.xls
|
|
│ │ ├── en_m3-extracted.json
|
|
│ │ ├── octanove.csv
|
|
│ │ ├── octanove-extracted.json
|
|
│ │ ├── random.json
|
|
│ │ └── random-extracted.json
|
|
│ ├── french/ # (future)
|
|
│ ├── german/ # (future)
|
|
│ ├── italian/
|
|
│ │ ├── it_m3.xls
|
|
│ │ ├── it_m3-extracted.json
|
|
│ │ ├── italian.json
|
|
│ │ └── italian-extracted.json
|
|
│ └── spanish/ # (future)
|
|
├── extraction-scripts/
|
|
│ └── english/
|
|
│ ├── extract-cefrj-csv.py
|
|
│ ├── extract-en_m3.py
|
|
│ ├── extract-octanove.py
|
|
│ └── extract-random-json.py
|
|
│ └── italian/
|
|
│ ├── extract-it_m3.py
|
|
│ └── extract-italian-json.py
|
|
├── merge-scripts/
|
|
│ └── merge-english-json.py # Stage 3: merge into authority
|
|
├── extract-own-save-to-json.py # script to extract words from wordnet
|
|
├── requirements.txt
|
|
└── README.md # This file
|
|
```
|
|
|
|
Extracted files are co-located with their sources for easy traceability. Merged files live in `../datafiles/`.
|
|
|
|
## Source Priority by Language
|
|
|
|
Source priority determines which CEFR level wins when sources conflict:
|
|
|
|
**English:**
|
|
1. en_m3
|
|
2. cefrj
|
|
3. octanove
|
|
4. random
|
|
|
|
**Italian:**
|
|
1. it_m3
|
|
2. italian
|
|
|
|
Priority is defined in the merge configuration. Higher priority sources override lower priority sources when conflicts occur.
|
|
|
|
This is defined in merge-scripts/merge-english-json.py.
|
|
|
|
## Data Flow Summary
|
|
|
|
```
|
|
Raw Source → Extracted JSON → Merged JSON → Database
|
|
(1) (2) (3) (4)
|
|
```
|
|
|
|
1. **Extract:** Transform source formats to normalized records
|
|
2. **Compare:** Validate source quality and surface conflicts
|
|
3. **Merge:** Resolve conflicts, derive difficulty, create authority
|
|
4. **Enrich:** Write to database (handled in packages/db)
|
|
|
|
## Adding New Sources
|
|
|
|
To add a new source:
|
|
|
|
1. Place the raw file in the appropriate `data-sources/{language}/` directory
|
|
2. Create an extractor script in `../extractors/{language}/`
|
|
3. Run the extractor to generate `{source}-extracted.json`
|
|
4. Run comparison to assess coverage and conflicts
|
|
5. Update source priority in the merge configuration if needed
|
|
6. Run merge to regenerate the authoritative file
|
|
7. Run enrichment to update the database
|
|
|
|
## Constants and Constraints
|
|
|
|
The pipeline respects these constraints from the Glossa shared constants:
|
|
|
|
- **Supported languages:** en, it
|
|
- **Supported parts of speech:** noun, verb
|
|
- **CEFR levels:** A1, A2, B1, B2, C1, C2
|
|
- **Difficulty levels:** easy, intermediate, hard
|
|
|
|
Entries violating these constraints are filtered out during extraction.
|