extraction, comparison and merging scripts for english are done, final english.json exists

2026-04-08 17:50:25 +02:00 · 2026-04-08 17:50:25 +02:00 · 59152950d6
commit 59152950d6
parent 3596f76492
14 changed files with 206319 additions and 0 deletions
--- a/scripts/README.md
+++ b/scripts/README.md
@ -0,0 +1,176 @@
+# CEFR Data Pipeline
+
+This directory contains the source data files and extraction/merge pipeline for generating CEFR-enriched datasets. The final output (`english-merged.json`) is consumed by the database seeding process in `packages/db`.
+
+## Overview
+
+The pipeline transforms raw vocabulary data from multiple sources into a standardized format, resolves conflicts between sources, and produces an authoritative CEFR dataset per language. This dataset is then used by the Glossa database package to update translation records.
+
+## Pipeline Stages
+
+### Stage 1: Extraction
+
+Each source file is processed by a dedicated extractor script. The extractor reads the source-specific format, normalizes the data, filters for supported parts of speech, and outputs a standardized JSON file.
+
+**Input:** Raw source files (JSON, CSV, XLS)
+**Output:** `{source}-extracted.json` files (same directory as source)
+
+**Normalization rules:**
+- Words are lowercased and trimmed
+- Part of speech is mapped to supported values (noun, verb)
+- Entries with unsupported POS are skipped
+- CEFR levels are validated against A1-C2
+- Each record includes the source identifier for traceability
+
+**Location:** `extraction-scripts/english/`  
+**Scripts:**
+- `extract-cefrj-csv.py`
+- `extract-en_m3.py`
+- `extract-octanove.py`
+- `extract-random-json.py`
+
+### Stage 2: Comparison
+
+Before merging, sources are compared to identify agreements and conflicts. This stage is read-only and serves as a quality gate.
+
+**Input:** All `{source}-extracted.json` files for a language
+**Output:** Console report showing:
+- Entry counts per source and CEFR level
+- Overlap between sources (words appearing in multiple sources)
+- Agreement rate (sources assigning the same CEFR level)
+- Conflicts (same word/POS with different CEFR levels)
+- Database coverage (how many extracted words exist in the database)
+
+**Location:** `comparison-scripts/compare-english.py`  
+**Usage:**
+
+```bash
+cd scripts/
+python comparison-scripts/compare-english.py
+```
+
+Conflicts are resolved in the next stage using source priority rules.
+
+### Stage 3: Merge
+
+Multiple extracted sources are merged into a single authoritative JSON file per language. When the same word/POS appears in multiple sources with different CEFR levels, the conflict is resolved using a predefined priority order.
+
+**Input:** All `{source}-extracted.json` files for a language
+**Output:** `{language}-merged.json` in `../datafiles/`
+
+**Merge rules:**
+- Single source: use that source's CEFR level
+- Multiple sources agree: use the agreed CEFR level
+- Multiple sources conflict: use the level from the highest-priority source
+
+**Difficulty derivation:**
+Difficulty is not extracted from sources. It is derived from the final CEFR level:
+- A1, A2 → easy
+- B1, B2 → intermediate
+- C1, C2 → hard
+
+The merged file includes both CEFR level and derived difficulty, plus a list of sources that contributed to each entry.
+
+**Location**: merge-scripts/merge-english-json.py
+**Usage:**
+
+```bash
+cd scripts/
+python merge-scripts/merge-english-json.py
+```
+
+### Stage 4: Enrichment
+
+The authoritative merged file is consumed by the database package (packages/db) during the seeding or update process. This stage is implemented in TypeScript and is not part of the Python scripts in this directory.
+
+## File Organization
+
+```
+scripts/
+├── comparison-scripts/
+│   └── compare-english.py          # Stage 2: compare extracted data
+├── datafiles/
+│   ├── english-merged.json         # Stage 3 output (authoritative dataset)
+│   ├── omw-noun.json
+│   └── omw-verb.json
+├── data-sources/
+│   ├── english/
+│   │   ├── cefrj.csv
+│   │   ├── cefrj-extracted.json
+│   │   ├── en_m3.xls
+│   │   ├── en_m3-extracted.json
+│   │   ├── octanove.csv
+│   │   ├── octanove-extracted.json
+│   │   ├── random.json
+│   │   └── random-extracted.json
+│   ├── french/                     # (future)
+│   ├── german/                     # (future)
+│   ├── italian/                    # (future)
+│   └── spanish/                    # (future)
+├── extraction-scripts/
+│   └── english/
+│       ├── extract-cefrj-csv.py
+│       ├── extract-en_m3.py
+│       ├── extract-octanove.py
+│       └── extract-random-json.py
+├── merge-scripts/
+│   └── merge-english-json.py       # Stage 3: merge into authority
+├── extract-own-save-to-json.py # script to extract words from wordnet
+├── requirements.txt
+└── README.md                   # This file
+```
+
+Extracted files are co-located with their sources for easy traceability. Merged files live in `../datafiles/`.
+
+## Source Priority by Language
+
+Source priority determines which CEFR level wins when sources conflict:
+
+**English:**
+1. en_m3
+2. cefrj
+3. octanove
+4. random
+
+**Italian:**
+1. it_m3
+2. italian
+
+Priority is defined in the merge configuration. Higher priority sources override lower priority sources when conflicts occur.
+
+This is defined in merge-scripts/merge-english-json.py.
+
+## Data Flow Summary
+
+```
+Raw Source → Extracted JSON → Merged JSON → Database
+    (1)           (2)            (3)           (4)
+```
+
+1. **Extract:** Transform source formats to normalized records
+2. **Compare:** Validate source quality and surface conflicts
+3. **Merge:** Resolve conflicts, derive difficulty, create authority
+4. **Enrich:** Write to database (handled in packages/db)
+
+## Adding New Sources
+
+To add a new source:
+
+1. Place the raw file in the appropriate `data-sources/{language}/` directory
+2. Create an extractor script in `../extractors/{language}/`
+3. Run the extractor to generate `{source}-extracted.json`
+4. Run comparison to assess coverage and conflicts
+5. Update source priority in the merge configuration if needed
+6. Run merge to regenerate the authoritative file
+7. Run enrichment to update the database
+
+## Constants and Constraints
+
+The pipeline respects these constraints from the Glossa shared constants:
+
+- **Supported languages:** en, it
+- **Supported parts of speech:** noun, verb
+- **CEFR levels:** A1, A2, B1, B2, C1, C2
+- **Difficulty levels:** easy, intermediate, hard
+
+Entries violating these constraints are filtered out during extraction.