extraction, comparison and merging scripts for english are done, final english.json exists

This commit is contained in:
lila 2026-04-08 17:50:25 +02:00
parent 3596f76492
commit 59152950d6
14 changed files with 206319 additions and 0 deletions

176
scripts/README.md Normal file
View file

@ -0,0 +1,176 @@
# CEFR Data Pipeline
This directory contains the source data files and extraction/merge pipeline for generating CEFR-enriched datasets. The final output (`english-merged.json`) is consumed by the database seeding process in `packages/db`.
## Overview
The pipeline transforms raw vocabulary data from multiple sources into a standardized format, resolves conflicts between sources, and produces an authoritative CEFR dataset per language. This dataset is then used by the Glossa database package to update translation records.
## Pipeline Stages
### Stage 1: Extraction
Each source file is processed by a dedicated extractor script. The extractor reads the source-specific format, normalizes the data, filters for supported parts of speech, and outputs a standardized JSON file.
**Input:** Raw source files (JSON, CSV, XLS)
**Output:** `{source}-extracted.json` files (same directory as source)
**Normalization rules:**
- Words are lowercased and trimmed
- Part of speech is mapped to supported values (noun, verb)
- Entries with unsupported POS are skipped
- CEFR levels are validated against A1-C2
- Each record includes the source identifier for traceability
**Location:** `extraction-scripts/english/`
**Scripts:**
- `extract-cefrj-csv.py`
- `extract-en_m3.py`
- `extract-octanove.py`
- `extract-random-json.py`
### Stage 2: Comparison
Before merging, sources are compared to identify agreements and conflicts. This stage is read-only and serves as a quality gate.
**Input:** All `{source}-extracted.json` files for a language
**Output:** Console report showing:
- Entry counts per source and CEFR level
- Overlap between sources (words appearing in multiple sources)
- Agreement rate (sources assigning the same CEFR level)
- Conflicts (same word/POS with different CEFR levels)
- Database coverage (how many extracted words exist in the database)
**Location:** `comparison-scripts/compare-english.py`
**Usage:**
```bash
cd scripts/
python comparison-scripts/compare-english.py
```
Conflicts are resolved in the next stage using source priority rules.
### Stage 3: Merge
Multiple extracted sources are merged into a single authoritative JSON file per language. When the same word/POS appears in multiple sources with different CEFR levels, the conflict is resolved using a predefined priority order.
**Input:** All `{source}-extracted.json` files for a language
**Output:** `{language}-merged.json` in `../datafiles/`
**Merge rules:**
- Single source: use that source's CEFR level
- Multiple sources agree: use the agreed CEFR level
- Multiple sources conflict: use the level from the highest-priority source
**Difficulty derivation:**
Difficulty is not extracted from sources. It is derived from the final CEFR level:
- A1, A2 → easy
- B1, B2 → intermediate
- C1, C2 → hard
The merged file includes both CEFR level and derived difficulty, plus a list of sources that contributed to each entry.
**Location**: merge-scripts/merge-english-json.py
**Usage:**
```bash
cd scripts/
python merge-scripts/merge-english-json.py
```
### Stage 4: Enrichment
The authoritative merged file is consumed by the database package (packages/db) during the seeding or update process. This stage is implemented in TypeScript and is not part of the Python scripts in this directory.
## File Organization
```
scripts/
├── comparison-scripts/
│ └── compare-english.py # Stage 2: compare extracted data
├── datafiles/
│ ├── english-merged.json # Stage 3 output (authoritative dataset)
│ ├── omw-noun.json
│ └── omw-verb.json
├── data-sources/
│ ├── english/
│ │ ├── cefrj.csv
│ │ ├── cefrj-extracted.json
│ │ ├── en_m3.xls
│ │ ├── en_m3-extracted.json
│ │ ├── octanove.csv
│ │ ├── octanove-extracted.json
│ │ ├── random.json
│ │ └── random-extracted.json
│ ├── french/ # (future)
│ ├── german/ # (future)
│ ├── italian/ # (future)
│ └── spanish/ # (future)
├── extraction-scripts/
│ └── english/
│ ├── extract-cefrj-csv.py
│ ├── extract-en_m3.py
│ ├── extract-octanove.py
│ └── extract-random-json.py
├── merge-scripts/
│ └── merge-english-json.py # Stage 3: merge into authority
├── extract-own-save-to-json.py # script to extract words from wordnet
├── requirements.txt
└── README.md # This file
```
Extracted files are co-located with their sources for easy traceability. Merged files live in `../datafiles/`.
## Source Priority by Language
Source priority determines which CEFR level wins when sources conflict:
**English:**
1. en_m3
2. cefrj
3. octanove
4. random
**Italian:**
1. it_m3
2. italian
Priority is defined in the merge configuration. Higher priority sources override lower priority sources when conflicts occur.
This is defined in merge-scripts/merge-english-json.py.
## Data Flow Summary
```
Raw Source → Extracted JSON → Merged JSON → Database
(1) (2) (3) (4)
```
1. **Extract:** Transform source formats to normalized records
2. **Compare:** Validate source quality and surface conflicts
3. **Merge:** Resolve conflicts, derive difficulty, create authority
4. **Enrich:** Write to database (handled in packages/db)
## Adding New Sources
To add a new source:
1. Place the raw file in the appropriate `data-sources/{language}/` directory
2. Create an extractor script in `../extractors/{language}/`
3. Run the extractor to generate `{source}-extracted.json`
4. Run comparison to assess coverage and conflicts
5. Update source priority in the merge configuration if needed
6. Run merge to regenerate the authoritative file
7. Run enrichment to update the database
## Constants and Constraints
The pipeline respects these constraints from the Glossa shared constants:
- **Supported languages:** en, it
- **Supported parts of speech:** noun, verb
- **CEFR levels:** A1, A2, B1, B2, C1, C2
- **Difficulty levels:** easy, intermediate, hard
Entries violating these constraints are filtered out during extraction.