feat(pipeline): add data pipeline workspace and extraction stage

- rename scripts/ to data-pipeline/, archive existing scripts
- add @lila/pipeline as pnpm workspace package
- add stage-1-extract through stage-5-compare folder structure
- update SUPPORTED_LANGUAGE_CODES (add es, de, fr)
- update SUPPORTED_POS (add adjective, adverb)
- add description field to term_glosses
- add term_examples table
- run and verify db migration
- write and verify extract.py (117,659 synsets across 5 languages)
- write PIPELINE.md
This commit is contained in:
lila 2026-04-21 09:39:36 +02:00
parent e993aac711
commit c9cddf68de
7 changed files with 1054164 additions and 33 deletions

View file

@ -112,77 +112,83 @@ The pipeline runs in five stages. Each stage is independent and can be re-run wi
### 1. Extract
Reads each language from the OMW SQLite database (`~/.wn_data/wn.db`) and produces a normalized JSON file per language containing all synsets with their translations, glosses, and usage examples across all parts of speech. Adjective satellites are collapsed into adjective at this stage.
Reads the OMW SQLite database (`~/.wn_data/wn.db`) and produces a single normalized JSON file containing all synsets with their translations, glosses, and usage examples across all five languages and all parts of speech. Adjective satellites are collapsed into adjective at this stage.
**Input:** `~/.wn_data/wn.db`
**Output:** `stage-1-extract/output/{lang}.json`
**Output:** `stage-1-extract/output/omw.json`
```bash
python scripts/extract.py
python stage-1-extract/scripts/extract.py
```
Add `--sample` to extract 100 synsets for inspection before running the full
extraction.
Each record in the output looks like this:
```json
{
"source_id": "omw-en-12345",
"pos": "noun",
"source_id": "ili:i1",
"pos": "adjective",
"translations": {
"en": ["dog", "canine"],
"it": ["cane"]
"en": ["able"],
"it": ["abile", "intelligente", "valente", "capace"],
"es": ["capaz"],
"fr": ["comptable"]
},
"glosses": {
"en": "a domesticated carnivorous mammal"
"en": ["(usually followed by 'to') having the necessary means or skill or know-how or authority to do something"]
},
"examples": {
"en": ["the dog barked at the stranger"]
"en": ["able to swim", "she was able to program her computer"]
}
}
```
Note: glosses and examples are not available for all languages. French and Spanish have no glosses in the current OMW database. Coverage detail is in `COVERAGE.md`.
<!-- TODO: verify record shape once extract.py is written -->
> **Note for first run:** Before extracting the full dataset, run the script
> in sample mode to inspect the actual data per language. Real-world wordnet
> data often contains unexpected formatting, missing fields, or inconsistencies
> that are better discovered early. A sample of 50100 synsets per language is
> enough to verify the output shape and spot anything worth handling before
> processing the full dataset.
Note: glosses and examples are not available for all languages. French and Spanish have no glosses or examples in the current OMW database — these will be generated by the LLM in the enrich stage. Coverage detail is in `COVERAGE.md`.
### 2. Annotate
Merges the CEFR source files into the extracted data. Each word in each language is looked up in the corresponding CEFR source file. Matched words receive a `cefr_source` vote which carries into the enrich stage. Unmatched words proceed without a vote — the enrich stage handles them entirely.
Reads the combined OMW extract and merges CEFR source data into it. Each translation in each language is matched against the corresponding CEFR source
file by word text and part of speech. Matched translations receive a `cefr_source` vote which carries into the enrich stage. Unmatched translations proceed without a vote.
This stage is language-agnostic and processes all languages in one run.
This stage also extracts native example sentences from the CEFR source files and adds them to the record alongside OMW examples, with `source: "cefr"` to distinguish them.
**Input:** `stage-1-extract/output/{lang}.json` + `stage-2-annotate/sources/cefr/{lang}.json`
**Output:** `stage-2-annotate/output/{lang}.json`
Words appearing in the CEFR source file multiple times with different CEFR levels are written to `conflicts.json` for manual review and excluded from voting until resolved.
**Input:** `stage-1-extract/output/omw.json` + `stage-2-annotate/sources/cefr/{lang}.json`
**Output:**
- `stage-2-annotate/output/{lang}.json` — one per language
- `stage-2-annotate/output/conflicts.json` — cross-language conflicts for review
```bash
pnpm --filter @lila/pipeline annotate
```
Each record in the output extends the extracted record with a `votes` field:
Each record in the output extends the OMW record with a `votes` field and any additional examples from the CEFR source file:
```json
{
"source_id": "omw-en-12345",
"pos": "noun",
"source_id": "ili:i1",
"pos": "adjective",
"translations": {
"en": ["dog", "canine"],
"it": ["cane"]
"en": ["able"],
"it": ["abile", "intelligente", "valente", "capace"],
"es": ["capaz"],
"fr": ["comptable"]
},
"glosses": {
"en": "a domesticated carnivorous mammal"
"en": ["having the necessary means or skill to do something"]
},
"examples": {
"en": ["the dog barked at the stranger"]
"en": [
{ "text": "able to swim", "source": "omw" },
{ "text": "She was able to finish the task.", "source": "cefr" }
]
},
"votes": {
"en": {
"cefr_source": "A1"
"able": { "cefr_source": "B1" }
}
}
}
@ -196,14 +202,16 @@ The enrich stage runs in two rounds, both designed to execute overnight one mode
**Round 1 — generation**
Each model processes every word in every language one term at a time and generates:
Each model processes every word in every language one term at a time and
generates:
- A CEFR level vote for each translation
- A description for each language
- A translation for each language, only if OMW provides none
- A gloss for each language, only if OMW provides none
- Usage examples for each language, only if OMW provides none
OMW data is never duplicated — the script checks what OMW already provides before building the prompt. For glosses and examples, if OMW data exists for that language the LLM skips generation entirely. This significantly reduces compute time for languages with good OMW coverage such as English and Italian.
OMW data is never duplicated — the script checks what OMW already provides before building the prompt. For translations, glosses and examples, if OMW data exists for that language the LLM skips generation entirely. This significantly reduces compute time for languages with good OMW coverage such as English.
All model-generated content is stored with an anonymised source (`model_1`, `model_2` etc.) so models cannot be biased by knowing who generated what in round 2.