diff --git a/.gitignore b/.gitignore index 89fbf2e..ad49f49 100644 --- a/.gitignore +++ b/.gitignore @@ -11,5 +11,7 @@ __pycache__/ *.pyc data-pipeline/archive/ -data-pipeline/output/ -data-pipeline/sources/omw/ +data-pipeline/stage-1-extract/output/ +data-pipeline/stage-2-annotate/output/ +data-pipeline/stage-3-enrich/output/ +data-pipeline/stage-4-merge/output/ diff --git a/data-pipeline/PIPELINE.md b/data-pipeline/PIPELINE.md deleted file mode 100644 index ec0cc3b..0000000 --- a/data-pipeline/PIPELINE.md +++ /dev/null @@ -1,33 +0,0 @@ -# lila data pipeline - -One paragraph: what this is, why it exists, where it feeds into. - -## Overview - Flow diagram: OMW + CEFR sources → Extract → Annotate → Enrich (LLM) → Merge → JSON → TS seeder → DB - -## Data sources - ### OMW / WordNet - ### Per-language CEFR files - (table: language, filename, approx. coverage — with a note pointing to COVERAGE.md for detail) - -## Pipeline stages - ### 1. Extract - ### 2. Annotate (CEFR) - ### 3. Enrich (LLM) - ### 4. Merge - ### 5. Compare / QA - Each: what it does, input, output, how to run. - -## LLM setup - - llama.cpp server: how to start it, what port, recommended models - - How the pipeline hits it - - Resuming interrupted runs - -## Supported languages - Table: language code, name, CEFR source file, full detail → COVERAGE.md - -## Adding a new language - Step by step. - -## Constants and constraints - POS values, CEFR levels, difficulty mapping, language codes. diff --git a/data-pipeline/scripts/annotate.ts b/data-pipeline/scripts/annotate.ts deleted file mode 100644 index e69de29..0000000 diff --git a/data-pipeline/scripts/compare.ts b/data-pipeline/scripts/compare.ts deleted file mode 100644 index e69de29..0000000 diff --git a/data-pipeline/scripts/enrich.ts b/data-pipeline/scripts/enrich.ts deleted file mode 100644 index e69de29..0000000 diff --git a/data-pipeline/scripts/extract.ts b/data-pipeline/scripts/extract.ts deleted file mode 100644 index e69de29..0000000 diff --git a/data-pipeline/scripts/merge.ts b/data-pipeline/scripts/merge.ts deleted file mode 100644 index e69de29..0000000 diff --git a/documentation/PIPELINE.md b/documentation/PIPELINE.md new file mode 100644 index 0000000..c61a2e2 --- /dev/null +++ b/documentation/PIPELINE.md @@ -0,0 +1,465 @@ +# lila data pipeline + +> **NOTE: BEFORE RUNNING THE PIPELINE, CONSIDER IMPROVING THE CEFR SOURCE +> FILES IN `stage-2-annotate/sources/cefr/`. BETTER SOURCE COVERAGE MEANS +> FEWER WORDS FOR THE LLM TO ANNOTATE FROM SCRATCH, FASTER OVERNIGHT RUNS, +> AND HIGHER CONFIDENCE IN THE FINAL OUTPUT. SEE UNIVERSALCEFR +> (huggingface.co/UniversalCEFR) AND CEFR-J +> (github.com/openlanguageprofiles/olp-en-cefrj) AS STARTING POINTS.** + +This pipeline extracts vocabulary data from the Open Multilingual Wordnet (OMW), annotates it with CEFR levels from curated source files, verifies and enriches annotations using local LLMs, and produces authoritative JSON files per language. These files are consumed by the seeder in `packages/db` to populate the database with terms, translations, glosses, CEFR levels, difficulty ratings, and LLM-generated descriptions. + +## Overview + +```mermaid +flowchart LR + omw[(OMW SQLite DBs)] + cefr[(CEFR JSON files)] + extract[Extract] + annotate[Annotate] + enrich[Enrich] + merge[Merge] + final[(final/lang.json)] + flagged[(flagged/lang.json)] + seeder[packages/db seeder] + db[(Database)] + + omw --> extract + cefr --> annotate + extract --> annotate + annotate --> enrich + enrich --> merge + merge --> final + merge --> flagged + final --> seeder + seeder --> db +``` + +Each stage is a standalone script that reads from the previous stage's output and produces one JSON file per language. Stages can be re-run independently without affecting earlier or later stages. + +The enrich stage is the exception — it produces one checkpoint file per model run per language, plus a compiled votes file once all runs are complete. It is designed to run overnight, one model at a time, and is fully resumable if interrupted. + +Only fully annotated output in `stage-4-merge/output/final/` reaches the database. Words where LLMs could not reach a majority vote land in `stage-4-merge/output/flagged/` and wait for manual review before seeding. + +## Data sources + +### OMW / WordNet + +The Open Multilingual Wordnet (OMW) is the base vocabulary source. It provides synsets — groups of synonymous words — with translations and glosses across multiple languages. One SQLite database per language is downloaded and placed in `sources/omw/`. These files are not committed to git. + +All four parts of speech are extracted: noun, verb, adjective, adverb. WordNet's adjective satellites are collapsed into adjective — this is a WordNet-internal distinction that has no relevance for language learning. Alongside translations and glosses, usage examples are extracted where available and stored in the database as term_examples. + +See **Setup** for download instructions. + +### CEFR source files + +Per-language JSON files in `sources/cefr/` provide the initial CEFR level annotations. These files do not cover the full vocabulary extracted from OMW — coverage varies by language. Gaps and disagreements are handled by the enrich stage. + +| Language | File | +|---|---| +| English | `sources/cefr/en.json` | +| Italian | `sources/cefr/it.json` | +| Spanish | `sources/cefr/es.json` | +| German | `sources/cefr/de.json` | +| French | `sources/cefr/fr.json` | + +These files are committed to git. For per-language coverage detail see `COVERAGE.md`. + +### CEFR annotation and verification + +CEFR levels are determined by a majority vote combining all available sources: + +- The CEFR source file counts as one vote (if it has an entry for the word) +- Each LLM model run counts as one vote + +The LLMs verify existing annotations as well as filling gaps — a source file entry does not automatically win. Majority vote across all sources determines the final level. + +If no majority is reached, the word is flagged for manual review and excluded from the database until resolved. + +## Setup + +### OMW databases + +Download the OMW SQLite database for each language using the `wn` Python +library: + +```bash +python -m wn download omw-en:1.4 +python -m wn download omw-it:1.4 +python -m wn download omw-de:1.4 +python -m wn download omw-es:1.4 +python -m wn download omw-fr:1.4 +``` + +The data is stored automatically at `~/.wn_data/wn.db` and is not committed +to git. + +### LLM setup + +See `LLM-SETUP.md`. + +## Pipeline stages + +The pipeline runs in five stages. Each stage is independent and can be re-run without affecting the others. + +| Stage | What it does | +|---|---| +| 1. Extract | Reads OMW SQLite database, outputs normalized JSON per language | +| 2. Annotate | Merges CEFR source files into extracted data, adds source file votes | +| 3. Enrich | Runs local LLMs in two rounds — generation then voting | +| 4. Merge | Resolves votes, derives difficulty, splits into final and flagged | +| 5. Compare | Generates COVERAGE.md with detailed quality report | + +### 1. Extract + +Reads each language from the OMW SQLite database (`~/.wn_data/wn.db`) and produces a normalized JSON file per language containing all synsets with their translations, glosses, and usage examples across all parts of speech. Adjective satellites are collapsed into adjective at this stage. + +**Input:** `~/.wn_data/wn.db` +**Output:** `stage-1-extract/output/{lang}.json` + +```bash +python scripts/extract.py +``` + +Each record in the output looks like this: + +```json +{ + "source_id": "omw-en-12345", + "pos": "noun", + "translations": { + "en": ["dog", "canine"], + "it": ["cane"] + }, + "glosses": { + "en": "a domesticated carnivorous mammal" + }, + "examples": { + "en": ["the dog barked at the stranger"] + } +} +``` + +Note: glosses and examples are not available for all languages. French and Spanish have no glosses in the current OMW database. Coverage detail is in `COVERAGE.md`. + + + +> **Note for first run:** Before extracting the full dataset, run the script +> in sample mode to inspect the actual data per language. Real-world wordnet +> data often contains unexpected formatting, missing fields, or inconsistencies +> that are better discovered early. A sample of 50–100 synsets per language is +> enough to verify the output shape and spot anything worth handling before +> processing the full dataset. + +### 2. Annotate + +Merges the CEFR source files into the extracted data. Each word in each language is looked up in the corresponding CEFR source file. Matched words receive a `cefr_source` vote which carries into the enrich stage. Unmatched words proceed without a vote — the enrich stage handles them entirely. + +This stage is language-agnostic and processes all languages in one run. + +**Input:** `stage-1-extract/output/{lang}.json` + `stage-2-annotate/sources/cefr/{lang}.json` +**Output:** `stage-2-annotate/output/{lang}.json` + +```bash +pnpm --filter @lila/pipeline annotate +``` + +Each record in the output extends the extracted record with a `votes` field: + +```json +{ + "source_id": "omw-en-12345", + "pos": "noun", + "translations": { + "en": ["dog", "canine"], + "it": ["cane"] + }, + "glosses": { + "en": "a domesticated carnivorous mammal" + }, + "examples": { + "en": ["the dog barked at the stranger"] + }, + "votes": { + "en": { + "cefr_source": "A1" + } + } +} +``` + +Words not present in the CEFR source file will have an empty `votes` object. + +### 3. Enrich + +The enrich stage runs in two rounds, both designed to execute overnight one model at a time. The llama.cpp server must be running locally before starting either round. See `LLM-SETUP.md` for setup instructions. + +**Round 1 — generation** + +Each model processes every word in every language one term at a time and generates: + +- A CEFR level vote for each translation +- A description for each language +- A gloss for each language, only if OMW provides none +- Usage examples for each language, only if OMW provides none + +OMW data is never duplicated — the script checks what OMW already provides before building the prompt. For glosses and examples, if OMW data exists for that language the LLM skips generation entirely. This significantly reduces compute time for languages with good OMW coverage such as English and Italian. + +All model-generated content is stored with an anonymised source (`model_1`, `model_2` etc.) so models cannot be biased by knowing who generated what in round 2. + +**Input:** `stage-2-annotate/output/{lang}.json` +**Output:** `stage-3-enrich/output/round1/{lang}_{model}.json` per run + +```bash +pnpm --filter @lila/pipeline enrich --round 1 --model {model} +``` + +**Compiling candidates** + +Once all round 1 runs are complete, compile all generated candidates into a single structured file per language. This is the input to round 2. + +**Input:** `stage-3-enrich/output/round1/{lang}_{model}.json` +**Output:** `stage-3-enrich/output/candidates/{lang}_candidates.json` + +```bash +pnpm --filter @lila/pipeline enrich --compile-candidates +``` + +**Round 2 — voting** + +Each model receives the compiled candidate list for every word and votes on: + +- The best gloss candidate (if multiple exist) +- The best description candidate (if multiple exist) +- The best usage examples candidate (if multiple exist) +- A CEFR level vote for each translation + +OMW data is not put to a vote — it automatically wins over any LLM-generated candidate. Round 2 only resolves conflicts between model-generated candidates. The prompt is kept small — one word at a time, a clean numbered candidate list — to fit within a limited context window. + +**Input:** `stage-3-enrich/output/candidates/{lang}_candidates.json` +**Output:** `stage-3-enrich/output/round2/{lang}_{model}.json` per run + +```bash +pnpm --filter @lila/pipeline enrich --round 2 --model {model} +``` + +**Compiling votes** + +Once all round 2 runs are complete, compile all votes into a single file per language. This is the input to the merge stage. + +**Input:** `stage-3-enrich/output/round2/{lang}_{model}.json` +**Output:** `stage-3-enrich/output/votes/{lang}_votes.json` + +```bash +pnpm --filter @lila/pipeline enrich --compile-votes +``` + +Each record in the votes file looks like this: + +```json +{ + "source_id": "omw-en-12345", + "pos": "noun", + "translations": { + "en": [ + { + "text": "dog", + "votes": { "cefr_source": "A1", "model_1": "A1", "model_2": "A1" } + }, + { + "text": "canine", + "votes": { "cefr_source": "B2", "model_1": "B2", "model_2": "B1" } + } + ], + "it": [ + { + "text": "cane", + "votes": { "cefr_source": "A1", "model_1": "A1", "model_2": "A1" } + } + ] + }, + "glosses": { + "en": { "text": "a domesticated carnivorous mammal", "source": "omw" }, + "fr": { + "candidates": [ + { "text": "un mammifère carnivore domestiqué", "source": "model_1" }, + { "text": "un animal domestique carnivore", "source": "model_2" } + ], + "votes": { "model_1": 1, "model_2": 1 } + } + }, + "examples": { + "en": [ + { "text": "the dog barked at the stranger", "source": "omw" } + ], + "fr": { + "candidates": [ + { "text": "le chien a aboyé", "source": "model_1" }, + { "text": "le chien gardait la maison", "source": "model_2" } + ], + "votes": { "model_1": 2, "model_2": 1 } + } + }, + "descriptions": { + "en": { + "candidates": [ + { "text": "a common household pet known for loyalty", "source": "model_1" }, + { "text": "a domesticated animal and loyal companion", "source": "model_2" } + ], + "votes": { "model_1": 2, "model_2": 1 } + } + } +} +``` + +### 4. Merge + +Reads the votes file per language and resolves the final value for every field. Produces two output files per language — fully resolved records ready for seeding, and flagged records that need manual review. + +**Merge rules:** + +- OMW data wins automatically and is never overridden +- For CEFR levels: the level with the most votes wins. If no majority is reached, that translation is flagged +- For LLM-generated text fields (gloss, examples, descriptions): the candidate with the most votes wins + + + +**Difficulty mapping:** + +| CEFR | Difficulty | +|---|---| +| A1, A2 | easy | +| B1, B2 | intermediate | +| C1, C2 | hard | + +**Input:** `stage-3-enrich/output/votes/{lang}_votes.json` +**Output:** +- `stage-4-merge/output/final/{lang}.json` — fully resolved, ready for seeding +- `stage-4-merge/output/flagged/{lang}.json` — CEFR majority not reached, needs manual review before seeding + +```bash +pnpm --filter @lila/pipeline merge +``` + +Each record in `final/{lang}.json` looks like this: + +```json +{ + "source_id": "omw-en-12345", + "pos": "noun", + "translations": { + "en": [ + { "text": "dog", "cefr_level": "A1", "difficulty": "easy" }, + { "text": "canine", "cefr_level": "B2", "difficulty": "intermediate" } + ], + "it": [ + { "text": "cane", "cefr_level": "A1", "difficulty": "easy" } + ] + }, + "glosses": { + "en": { "text": "a domesticated carnivorous mammal", "source": "omw" }, + "fr": { "text": "un mammifère carnivore domestiqué", "source": "model_1" } + }, + "examples": { + "en": [ + { "text": "the dog barked at the stranger", "source": "omw" } + ], + "fr": [ + { "text": "le chien a aboyé", "source": "model_1" } + ] + }, + "descriptions": { + "en": { + "text": "a common household pet known for loyalty and companionship", + "source": "model_1" + }, + "it": { + "text": "un animale domestico comune noto per la sua fedeltà", + "source": "model_2" + } + } +} +``` + +**Resolving flagged words:** + +Open `stage-4-merge/output/flagged/{lang}.json`, manually set the correct `cefr_level` and `difficulty` for each flagged translation, then move the resolved entries into `stage-4-merge/output/final/{lang}.json`. Re-run the seeder after resolving. + +### 5. Compare / QA + +Read-only. Generates `COVERAGE.md` with a full breakdown of the pipeline +output quality per language. Run this after merge to verify output before +seeding the database. + +**Input:** +- `stage-4-merge/output/final/{lang}.json` +- `stage-4-merge/output/flagged/{lang}.json` + +**Output:** `COVERAGE.md` + +```bash +pnpm --filter @lila/pipeline compare +``` + +`COVERAGE.md` reports the following per language: + +- Total synsets extracted +- Total translations per language +- POS breakdown per language — word counts for noun, verb, adjective, adverb +- CEFR coverage per language — how many translations have a resolved CEFR level, broken down by level (A1, A2, B1, B2, C1, C2) +- Difficulty breakdown per language — word counts for easy, intermediate, hard +- Flagged count per language — how many translations are awaiting manual review +- Gloss coverage per language — total glosses, broken down by source (omw vs LLM-generated) and which languages have no glosses at all +- Example coverage per language — same breakdown as glosses +- Description coverage per language — how many translations have a description, broken down by source +- CEFR source file coverage per language — how many words from the source file were matched against OMW translations +- LLM model contribution — how many CEFR votes and text candidates each anonymised model contributed + +## Adding a new language + +1. Add the language code to `SUPPORTED_LANGUAGE_CODES` in `packages/shared/src/constants.ts` +2. Build shared: `pnpm --filter @lila/shared build` +3. Generate and run a DB migration: `pnpm --filter @lila/db generate` then `pnpm --filter @lila/db migrate` +4. Download the OMW lexicon for the language using the `wn` Python library +5. Add a CEFR source file at `stage-2-annotate/sources/cefr/{lang}.json` +6. Run the full pipeline + +## Constants and constraints + +These values are defined in `packages/shared/src/constants.ts` and enforced by database check constraints. The pipeline filters out any entries that violate them. + +| Constant | Values | +|---|---| +| Languages | `en`, `it`, `de`, `es`, `fr` | +| Parts of speech | `noun`, `verb`, `adjective`, `adverb` | +| CEFR levels | `A1`, `A2`, `B1`, `B2`, `C1`, `C2` | +| Difficulty | `easy`, `intermediate`, `hard` | + +Adding a new value to any of these requires a constants update and a database migration before re-running the pipeline. See **Adding a new language** for the full steps — the same process applies for new parts of speech. + +## Further extensions + +These are not part of the current pipeline but are worth considering as the +dataset matures: + +- **Grammatical gender and articles** — Wiktionary dumps contain gender and + article data for nouns across all supported languages. Could be extracted + and stored as a new `translation_forms` table. +- **Conjugations** — Wiktionary also carries verb conjugation tables. Useful + for a future grammar-focused quiz mode. +- **IPA pronunciations** — Wiktionary and Forvo are potential sources for + phonetic transcriptions per language. +- **TTS audio files** — Generate pronunciation audio for each translation + using a local or cloud TTS engine. Stored as static files, served alongside + the quiz UI. +- **Images** — Associate an image with each synset to support visual + vocabulary learning. Could be sourced from open image datasets like + ImageNet or WikiMedia Commons. +- **Frequency data** — Word frequency rankings per language from sources like + the Google Ngram dataset. Useful for smarter difficulty calibration beyond + CEFR levels alone. +- **Improved CEFR source files** — See note at the top of this document. + UniversalCEFR and CEFR-J are good starting points. +- **Additional languages** — The pipeline is language-agnostic. Adding a new + language requires an OMW lexicon, a CEFR source file, and a constants + update. See **Adding a new language**.