History

lila 3374bd8b20 feat(scripts): add Italian CEFR data pipeline - Add extractors for Italian sources: it_m3.xls and italian.json - Add comparison script (compare-italian.py) to report source overlaps and conflicts - Add merge script (merge-italian-json.py) with priority order ['italian', 'it_m3'] - Output authoritative dataset to datafiles/italian-merged.json - Update README to document both English and Italian pipelines		2026-04-08 18:32:03 +02:00
..
comparison-scripts	feat(scripts): add Italian CEFR data pipeline	2026-04-08 18:32:03 +02:00
data-sources	feat(scripts): add Italian CEFR data pipeline	2026-04-08 18:32:03 +02:00
datafiles	feat(scripts): add Italian CEFR data pipeline	2026-04-08 18:32:03 +02:00
extraction-scripts	feat(scripts): add Italian CEFR data pipeline	2026-04-08 18:32:03 +02:00
merge-scripts	feat(scripts): add Italian CEFR data pipeline	2026-04-08 18:32:03 +02:00
random-datafiles/italian	extraction, comparison and merging scripts for english are done, final english.json exists	2026-04-08 17:50:25 +02:00
extract-own-save-to-json.py	updating seeding pipeline	2026-04-05 19:29:17 +02:00
README.md	feat(scripts): add Italian CEFR data pipeline	2026-04-08 18:32:03 +02:00
requirements.txt	extraction, comparison and merging scripts for english are done, final english.json exists	2026-04-08 17:50:25 +02:00

README.md

CEFR Data Pipeline

This directory contains the source data files and extraction/merge pipeline for generating CEFR-enriched datasets. The final outputs (english-merged.json, italian-merged.json) are consumed by the database seeding process in packages/db.

Overview

The pipeline transforms raw vocabulary data from multiple sources into a standardized format, resolves conflicts between sources, and produces an authoritative CEFR dataset per language. This dataset is then used by the Glossa database package to update translation records.

Supported Languages

✅ English (en)
✅ Italian (it)

Pipeline Stages

Stage 1: Extraction

Each source file is processed by a dedicated extractor script. The extractor reads the source-specific format, normalizes the data, filters for supported parts of speech, and outputs a standardized JSON file.

Input: Raw source files (JSON, CSV, XLS) Output: {source}-extracted.json files (same directory as source)

Normalization rules:

Words are lowercased and trimmed
Part of speech is mapped to supported values (noun, verb)
Entries with unsupported POS are skipped
CEFR levels are validated against A1-C2
Each record includes the source identifier for traceability

Extractor Scripts:

Language	Source	Script
English	`cefrj.csv`	`extraction-scripts/english/extract-cefrj-csv.py`
English	`en_m3.xls`	`extraction-scripts/english/extract-en_m3.py`
English	`octanove.csv`	`extraction-scripts/english/extract-octanove.py`
English	`random.json`	`extraction-scripts/english/extract-random-json.py`
Italian	`it_m3.xls`	`extraction-scripts/italian/extract-it_m3.py`
Italian	`italian.json`	`extraction-scripts/italian/extract-italian-json.py`

Stage 2: Comparison

Before merging, sources are compared to identify agreements and conflicts. This stage is read-only and serves as a quality gate.

Input: All {source}-extracted.json files for a language Output: Console report showing:

Entry counts per source and CEFR level
Overlap between sources (words appearing in multiple sources)
Agreement rate (sources assigning the same CEFR level)
Conflicts (same word/POS with different CEFR levels)

Comparison Scripts:

Language	Script
English	`comparison-scripts/compare-english.py`
Italian	`comparison-scripts/compare-italian.py`

Run from the scripts/ directory:

python comparison-scripts/compare-english.py
python comparison-scripts/compare-italian.py

Stage 3: Merge

Multiple extracted sources are merged into a single authoritative JSON file per language. When the same word/POS appears in multiple sources with different CEFR levels, the conflict is resolved using a predefined priority order.

Input: All {source}-extracted.json files for a language Output: {language}-merged.json in ../datafiles/

Merge rules:

Single source: use that source's CEFR level
Multiple sources agree: use the agreed CEFR level
Multiple sources conflict: use the level from the highest-priority source

Difficulty derivation: Difficulty is not extracted from sources. It is derived from the final CEFR level:

A1, A2 → easy
B1, B2 → intermediate
C1, C2 → hard

The merged file includes both CEFR level and derived difficulty, plus a list of sources that contributed to each entry.

Merge Scripts & Priorities:

Language	Script	Priority (lowest → highest)
English	`merge-scripts/merge-english-json.py`	`random`, `octanove`, `cefrj`, `en_m3`
Italian	`merge-scripts/merge-italian-json.py`	`italian`, `it_m3`

Run from the scripts/ directory:

python merge-scripts/merge-english-json.py
python merge-scripts/merge-italian-json.py

Stage 4: Enrichment

The authoritative merged file is consumed by the database package (packages/db) during the seeding or update process. This stage is implemented in TypeScript and is not part of the Python scripts in this directory.

File Organization

scripts/
├── comparison-scripts/
│ ├── compare-english.py
│ └── compare-italian.py        # Stage 2: compare extracted data
├── datafiles/
│   ├── english-merged.json # Stage 3 output (authoritative)
│   ├── italian-merged.json # Stage 3 output (authoritative)
│   ├── omw-noun.json
│   └── omw-verb.json
├── data-sources/
│   ├── english/
│   │   ├── cefrj.csv
│   │   ├── cefrj-extracted.json
│   │   ├── en_m3.xls
│   │   ├── en_m3-extracted.json
│   │   ├── octanove.csv
│   │   ├── octanove-extracted.json
│   │   ├── random.json
│   │   └── random-extracted.json
│   ├── french/                     # (future)
│   ├── german/                     # (future)
│   ├── italian/
│   │   ├── it_m3.xls
│   │   ├── it_m3-extracted.json
│   │   ├── italian.json
│   │   └── italian-extracted.json
│   └── spanish/                    # (future)
├── extraction-scripts/
│   └── english/
│       ├── extract-cefrj-csv.py
│       ├── extract-en_m3.py
│       ├── extract-octanove.py
│       └── extract-random-json.py
│   └── italian/
│       ├── extract-it_m3.py
│       └── extract-italian-json.py
├── merge-scripts/
│   └── merge-english-json.py       # Stage 3: merge into authority
├── extract-own-save-to-json.py # script to extract words from wordnet
├── requirements.txt
└── README.md                   # This file

Extracted files are co-located with their sources for easy traceability. Merged files live in ../datafiles/.

Source Priority by Language

Source priority determines which CEFR level wins when sources conflict:

English:

en_m3
cefrj
octanove
random

Italian:

it_m3
italian

Priority is defined in the merge configuration. Higher priority sources override lower priority sources when conflicts occur.

This is defined in merge-scripts/merge-english-json.py.

Data Flow Summary

Raw Source → Extracted JSON → Merged JSON → Database
    (1)           (2)            (3)           (4)

Extract: Transform source formats to normalized records
Compare: Validate source quality and surface conflicts
Merge: Resolve conflicts, derive difficulty, create authority
Enrich: Write to database (handled in packages/db)

Adding New Sources

To add a new source:

Place the raw file in the appropriate data-sources/{language}/ directory
Create an extractor script in ../extractors/{language}/
Run the extractor to generate {source}-extracted.json
Run comparison to assess coverage and conflicts
Update source priority in the merge configuration if needed
Run merge to regenerate the authoritative file
Run enrichment to update the database

Constants and Constraints

The pipeline respects these constraints from the Glossa shared constants:

Supported languages: en, it
Supported parts of speech: noun, verb
CEFR levels: A1, A2, B1, B2, C1, C2
Difficulty levels: easy, intermediate, hard

Entries violating these constraints are filtered out during extraction.