extraction, comparison and merging scripts for english are done, final english.json exists
This commit is contained in:
parent
3596f76492
commit
59152950d6
14 changed files with 206319 additions and 0 deletions
176
scripts/README.md
Normal file
176
scripts/README.md
Normal file
|
|
@ -0,0 +1,176 @@
|
|||
# CEFR Data Pipeline
|
||||
|
||||
This directory contains the source data files and extraction/merge pipeline for generating CEFR-enriched datasets. The final output (`english-merged.json`) is consumed by the database seeding process in `packages/db`.
|
||||
|
||||
## Overview
|
||||
|
||||
The pipeline transforms raw vocabulary data from multiple sources into a standardized format, resolves conflicts between sources, and produces an authoritative CEFR dataset per language. This dataset is then used by the Glossa database package to update translation records.
|
||||
|
||||
## Pipeline Stages
|
||||
|
||||
### Stage 1: Extraction
|
||||
|
||||
Each source file is processed by a dedicated extractor script. The extractor reads the source-specific format, normalizes the data, filters for supported parts of speech, and outputs a standardized JSON file.
|
||||
|
||||
**Input:** Raw source files (JSON, CSV, XLS)
|
||||
**Output:** `{source}-extracted.json` files (same directory as source)
|
||||
|
||||
**Normalization rules:**
|
||||
- Words are lowercased and trimmed
|
||||
- Part of speech is mapped to supported values (noun, verb)
|
||||
- Entries with unsupported POS are skipped
|
||||
- CEFR levels are validated against A1-C2
|
||||
- Each record includes the source identifier for traceability
|
||||
|
||||
**Location:** `extraction-scripts/english/`
|
||||
**Scripts:**
|
||||
- `extract-cefrj-csv.py`
|
||||
- `extract-en_m3.py`
|
||||
- `extract-octanove.py`
|
||||
- `extract-random-json.py`
|
||||
|
||||
### Stage 2: Comparison
|
||||
|
||||
Before merging, sources are compared to identify agreements and conflicts. This stage is read-only and serves as a quality gate.
|
||||
|
||||
**Input:** All `{source}-extracted.json` files for a language
|
||||
**Output:** Console report showing:
|
||||
- Entry counts per source and CEFR level
|
||||
- Overlap between sources (words appearing in multiple sources)
|
||||
- Agreement rate (sources assigning the same CEFR level)
|
||||
- Conflicts (same word/POS with different CEFR levels)
|
||||
- Database coverage (how many extracted words exist in the database)
|
||||
|
||||
**Location:** `comparison-scripts/compare-english.py`
|
||||
**Usage:**
|
||||
|
||||
```bash
|
||||
cd scripts/
|
||||
python comparison-scripts/compare-english.py
|
||||
```
|
||||
|
||||
Conflicts are resolved in the next stage using source priority rules.
|
||||
|
||||
### Stage 3: Merge
|
||||
|
||||
Multiple extracted sources are merged into a single authoritative JSON file per language. When the same word/POS appears in multiple sources with different CEFR levels, the conflict is resolved using a predefined priority order.
|
||||
|
||||
**Input:** All `{source}-extracted.json` files for a language
|
||||
**Output:** `{language}-merged.json` in `../datafiles/`
|
||||
|
||||
**Merge rules:**
|
||||
- Single source: use that source's CEFR level
|
||||
- Multiple sources agree: use the agreed CEFR level
|
||||
- Multiple sources conflict: use the level from the highest-priority source
|
||||
|
||||
**Difficulty derivation:**
|
||||
Difficulty is not extracted from sources. It is derived from the final CEFR level:
|
||||
- A1, A2 → easy
|
||||
- B1, B2 → intermediate
|
||||
- C1, C2 → hard
|
||||
|
||||
The merged file includes both CEFR level and derived difficulty, plus a list of sources that contributed to each entry.
|
||||
|
||||
**Location**: merge-scripts/merge-english-json.py
|
||||
**Usage:**
|
||||
|
||||
```bash
|
||||
cd scripts/
|
||||
python merge-scripts/merge-english-json.py
|
||||
```
|
||||
|
||||
### Stage 4: Enrichment
|
||||
|
||||
The authoritative merged file is consumed by the database package (packages/db) during the seeding or update process. This stage is implemented in TypeScript and is not part of the Python scripts in this directory.
|
||||
|
||||
## File Organization
|
||||
|
||||
```
|
||||
scripts/
|
||||
├── comparison-scripts/
|
||||
│ └── compare-english.py # Stage 2: compare extracted data
|
||||
├── datafiles/
|
||||
│ ├── english-merged.json # Stage 3 output (authoritative dataset)
|
||||
│ ├── omw-noun.json
|
||||
│ └── omw-verb.json
|
||||
├── data-sources/
|
||||
│ ├── english/
|
||||
│ │ ├── cefrj.csv
|
||||
│ │ ├── cefrj-extracted.json
|
||||
│ │ ├── en_m3.xls
|
||||
│ │ ├── en_m3-extracted.json
|
||||
│ │ ├── octanove.csv
|
||||
│ │ ├── octanove-extracted.json
|
||||
│ │ ├── random.json
|
||||
│ │ └── random-extracted.json
|
||||
│ ├── french/ # (future)
|
||||
│ ├── german/ # (future)
|
||||
│ ├── italian/ # (future)
|
||||
│ └── spanish/ # (future)
|
||||
├── extraction-scripts/
|
||||
│ └── english/
|
||||
│ ├── extract-cefrj-csv.py
|
||||
│ ├── extract-en_m3.py
|
||||
│ ├── extract-octanove.py
|
||||
│ └── extract-random-json.py
|
||||
├── merge-scripts/
|
||||
│ └── merge-english-json.py # Stage 3: merge into authority
|
||||
├── extract-own-save-to-json.py # script to extract words from wordnet
|
||||
├── requirements.txt
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
Extracted files are co-located with their sources for easy traceability. Merged files live in `../datafiles/`.
|
||||
|
||||
## Source Priority by Language
|
||||
|
||||
Source priority determines which CEFR level wins when sources conflict:
|
||||
|
||||
**English:**
|
||||
1. en_m3
|
||||
2. cefrj
|
||||
3. octanove
|
||||
4. random
|
||||
|
||||
**Italian:**
|
||||
1. it_m3
|
||||
2. italian
|
||||
|
||||
Priority is defined in the merge configuration. Higher priority sources override lower priority sources when conflicts occur.
|
||||
|
||||
This is defined in merge-scripts/merge-english-json.py.
|
||||
|
||||
## Data Flow Summary
|
||||
|
||||
```
|
||||
Raw Source → Extracted JSON → Merged JSON → Database
|
||||
(1) (2) (3) (4)
|
||||
```
|
||||
|
||||
1. **Extract:** Transform source formats to normalized records
|
||||
2. **Compare:** Validate source quality and surface conflicts
|
||||
3. **Merge:** Resolve conflicts, derive difficulty, create authority
|
||||
4. **Enrich:** Write to database (handled in packages/db)
|
||||
|
||||
## Adding New Sources
|
||||
|
||||
To add a new source:
|
||||
|
||||
1. Place the raw file in the appropriate `data-sources/{language}/` directory
|
||||
2. Create an extractor script in `../extractors/{language}/`
|
||||
3. Run the extractor to generate `{source}-extracted.json`
|
||||
4. Run comparison to assess coverage and conflicts
|
||||
5. Update source priority in the merge configuration if needed
|
||||
6. Run merge to regenerate the authoritative file
|
||||
7. Run enrichment to update the database
|
||||
|
||||
## Constants and Constraints
|
||||
|
||||
The pipeline respects these constraints from the Glossa shared constants:
|
||||
|
||||
- **Supported languages:** en, it
|
||||
- **Supported parts of speech:** noun, verb
|
||||
- **CEFR levels:** A1, A2, B1, B2, C1, C2
|
||||
- **Difficulty levels:** easy, intermediate, hard
|
||||
|
||||
Entries violating these constraints are filtered out during extraction.
|
||||
Loading…
Add table
Add a link
Reference in a new issue