lila/documentation/data-pipeline.md
lila 73fb12ac35 feat: enrich script working, redesigning to sub-stage architecture
- Enrich script functional with timeout, progress tracking, rejection mechanism
- Identified ordering issue: CEFR voting needs validated translations first
- Redesign: round1_gloss → round1_example → round1_translations → round1_cefr
- Update data-pipeline.md with new sub-stage design and roadmap
- Qwen3.5-4B confirmed working with thinking disabled
2026-05-07 13:09:43 +02:00

501 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# lila data pipeline
This pipeline extracts vocabulary data from Wiktionary via the Kaikki dataset, enriches it with CEFR levels and fills content gaps using local LLMs, and produces authoritative output in `pipeline.db`. This database is consumed by the sync script to populate the production database with vocabulary entries, translations, glosses, CEFR levels, and difficulty ratings.
## Overview
```mermaid
flowchart LR
kaikki[(Kaikki JSONL)]
extract[Extract]
reverselink[Reverse Link Sync]
enrich[Enrich]
pipelinedb[(pipeline.db)]
merge[Merge]
tiebreak[Tiebreak]
compare[Compare]
sync[Sync]
db[(PostgreSQL)]
kaikki --> extract
extract --> pipelinedb
pipelinedb --> reverselink
reverselink --> pipelinedb
pipelinedb --> enrich
enrich --> pipelinedb
pipelinedb --> merge
merge --> pipelinedb
pipelinedb --> tiebreak
tiebreak --> pipelinedb
pipelinedb --> compare
pipelinedb --> sync
sync --> db
```
Each stage is a standalone script that reads from and writes to `pipeline.db`. The pipeline is fully resumable — interrupted overnight runs pick up from the last processed record without losing work.
Stage 1 is a manual prerequisite and is not run by the pipeline orchestrator. See **Stage 1 — Extract** for instructions.
The enrich stage is designed to run overnight, one model at a time. Each model processes every entry and writes results to `pipeline.db` atomically per record.
Only fully resolved records reach the production database. Records where LLMs could not reach a majority vote are handled automatically by the tiebreaker stage before syncing.
## pipeline.db
All pipeline state is stored in `pipeline.db` — a SQLite database in `data-pipeline/db/`. It is created automatically on first run and is not committed to git.
The database serves three purposes:
- **Resumability** — every record is written atomically with a status. Interrupted overnight runs resume from the last pending record without losing work.
- **Vote tracking** — all model votes for CEFR levels and generated content are stored per model per record, giving full auditability of how every decision was reached.
- **Resolved output** — the final resolved records live here and are read by the sync script to seed the production database.
The schema is defined in `data-pipeline/db/schema.sql`. Never edit `pipeline.db` directly — all writes go through the pipeline scripts.
On first run the orchestrator initialises `pipeline.db` automatically and imports the stage 1 output into the base tables. This happens once — subsequent runs skip the import if the base tables are already populated.
## Common commands
### Starting llama.cpp
```bash
cd ~/Downloads/llama.cpp
./build/bin/llama-server \
--model models/qwen3.5-4b-q4_k_m.gguf \
--port 8080 \
--ctx-size 4096 \
--n-gpu-layers 999 \
--host 127.0.0.1 \
--chat-template-kwargs '{"enable_thinking":false}' \
--reasoning-budget 0
```
Verify the server is running:
```bash
curl http://127.0.0.1:8080/health
```
### Running the pipeline
```bash
pnpm --filter @lila/pipeline pipeline:run
```
The pipeline auto-generates a run name from the date and a counter. It picks up where it left off — completed stages are skipped automatically.
### Stage 1 — Extract
```bash
pnpm --filter @lila/pipeline extract
```
Runs in sample mode (500 entries per language) by default. Remove the hardcoded limit in `stage-1-extract/scripts/extract.ts` for a full run.
### Stage 2 — Reverse link sync
```bash
pnpm --filter @lila/pipeline reverse-link
```
### Initialising and importing the database
```bash
# Initialise pipeline.db from schema
pnpm --filter @lila/pipeline db:init
# Import stage 1 output into pipeline.db
pnpm --filter @lila/pipeline db:import
```
### Resetting the database
```bash
# Full reset — delete and reinitialise
rm data-pipeline/db/pipeline.db
pnpm --filter @lila/pipeline db:init
pnpm --filter @lila/pipeline db:import
pnpm --filter @lila/pipeline reverse-link
```
### Resetting enrich stage progress
```bash
# Reset round 1 only (retry failed or incomplete run)
node -e "
const Database = require('better-sqlite3');
const db = new Database('/db/pipeline.db');
const result = db.prepare(\"DELETE FROM run_status WHERE stage = 'round1'\").run();
console.log('Deleted', result.changes, 'rows');
db.close();
"
# Reset all enrich progress (round 1 and round 2)
node -e "
const Database = require('better-sqlite3');
const db = new Database('data-pipeline/db/pipeline.db');
const result = db.prepare(\"DELETE FROM run_status WHERE stage IN ('round1', 'round2')\").run();
console.log('Deleted', result.changes, 'rows');
db.close();
"
```
### Checking pipeline progress
```bash
node -e "
const Database = require('better-sqlite3');
const db = new Database('data-pipeline/db/pipeline.db', { readonly: true });
const total = db.prepare('SELECT COUNT(*) as c FROM entries WHERE language = \\'en\\'').get().c;
const complete = db.prepare(\"SELECT COUNT(*) as c FROM run_status WHERE stage = 'round1' AND status = 'complete'\").get().c;
const needsReview = db.prepare(\"SELECT COUNT(*) as c FROM run_status WHERE stage = 'round1' AND status = 'needs_review'\").get().c;
console.log('Total English entries:', total);
console.log('Round 1 complete:', complete);
console.log('Needs review:', needsReview);
console.log('Pending:', total - complete - needsReview);
db.close();
"
```
## Data source
### Kaikki (Wiktionary)
The pipeline uses pre-extracted Wiktionary data from [kaikki.org](https://kaikki.org), built with the [wiktextract](https://github.com/tatuylonen/wiktextract) tool. This data is updated weekly from the English Wiktionary dump and is freely available under the same license as Wiktionary (CC-BY-SA).
**Why Kaikki instead of OMW:**
Kaikki is structured per word sense. Each headword has multiple senses, and translations are linked to a specific sense rather than a general concept. This prevents the sense disambiguation problems found in OMW, where a single concept entry could contain translations from entirely different meanings of a word.
Each Kaikki entry provides:
- A headword in the entry language
- One or more senses, each with a gloss and examples
- Per-sense translations to other languages with sense hints
- IPA pronunciations and audio file references (deferred — see **Further extensions**)
- Inflected forms (deferred — see **Further extensions**)
The pipeline uses the English Wiktionary edition (`enwiktionary`), which contains entries for all five supported languages with glosses in English.
### CEFR levels
CEFR levels are assigned entirely by LLM majority vote. Each model receives the headword, gloss, and an example sentence and votes on the appropriate level (A1C2). There are no curated source files — the LLMs are the sole source of CEFR annotations.
If no majority is reached after all model runs, the entry is handled automatically by the tiebreaker stage.
## Setup
### Kaikki data files
Download the pre-extracted Kaikki JSONL files for each language. These are large files — download them to `stage-1-extract/sources/` which is not committed to git.
```bash
mkdir -p stage-1-extract/sources
cd stage-1-extract/sources
# English entries (contains translations to all other languages)
wget https://kaikki.org/dictionary/English/kaikki.org-dictionary-English.jsonl.gz
# Per-language files (for entries written in those languages)
wget https://kaikki.org/dictionary/German/kaikki.org-dictionary-German.jsonl.gz
wget https://kaikki.org/dictionary/Italian/kaikki.org-dictionary-Italian.jsonl.gz
wget https://kaikki.org/dictionary/French/kaikki.org-dictionary-French.jsonl.gz
wget https://kaikki.org/dictionary/Spanish/kaikki.org-dictionary-Spanish.jsonl.gz
# Decompress
gunzip *.gz
```
### LLM setup
See `llm-setup.md`.
## Pipeline stages
| Stage | What it does |
| --------------- | ------------------------------------------------------------------------ |
| 1. Extract | Parses Kaikki JSONL, imports entries into `pipeline.db` |
| 2. Reverse link | Inserts missing reverse translations between language pairs |
| 3. Enrich | LLMs fill translation gaps, improve glosses/examples, assign CEFR levels |
| 4. Merge | Resolves LLM votes into final values |
| 4b. Tiebreak | Runs unused models on flagged entries until majority is reached |
| 5. Compare / QA | Generates `COVERAGE.md` with detailed quality report |
| 6. Sync | Upserts resolved records into production PostgreSQL |
### 1. Extract
Parses the Kaikki JSONL files for all five languages and imports them into the base tables of `pipeline.db`. Filters to the four supported parts of speech: noun, verb, adjective, adverb. Each Kaikki sense becomes one row in `vocabulary_entries`. Translations are stored in `entry_translations` with their sense hints.
**Input:** `stage-1-extract/sources/*.jsonl`
**Output:** `pipeline.db``vocabulary_entries` and `entry_translations` tables populated
```bash
pnpm --filter @lila/pipeline extract
```
Add `--sample 100` to import only 100 entries per language for inspection before running the full import.
Each entry in `pipeline.db` looks like this:
```json
{
"headword": "thrill",
"language": "en",
"pos": "verb",
"sense_index": 0,
"gloss": "To suddenly excite someone, or to give them great pleasure.",
"examples": ["The movie thrilled the audience."],
"translations": [
{ "language": "de", "word": "begeistern", "sense_hint": "suddenly excite" },
{
"language": "fr",
"word": "enthousiasmer",
"sense_hint": "suddenly excite"
},
{ "language": "it", "word": "entusiasmare" },
{ "language": "es", "word": "emocionar" }
]
}
```
> **Note:** Stage 1 is a manual prerequisite. It is not run by the pipeline orchestrator (`pipeline.ts`). Run it once before running the orchestrator for the first time, and re-run it manually if the Kaikki source files are updated.
### 2. Reverse link sync
A pure script stage — no LLMs. For each translation pair in `entry_translations`, checks whether the reverse link exists. If English _thrill → begeistern_ exists and the German entry _begeistern_ exists in `vocabulary_entries` but lacks the English back-link, it is inserted automatically.
This runs before the enrich stage so that LLMs only generate translations that are genuinely missing — not translations that would be found by a simple reverse lookup.
**Input:** `pipeline.db` — populated `vocabulary_entries` and `entry_translations`
**Output:** `pipeline.db` — missing reverse links inserted into `entry_translations`
```bash
pnpm --filter @lila/pipeline reverse-link
```
### 3. Enrich
> **Note:** Before running this stage, ensure the llama.cpp server is running
> locally. The orchestrator checks for a running server at
> `http://127.0.0.1:8080/health` and exits with instructions if it is not
> reachable. See `llm-setup.md` for setup instructions.
The enrich stage runs in four ordered sub-stages per entry, designed to build context progressively. All output is written to `pipeline.db` atomically per sub-stage — runs are fully resumable if interrupted. Each model is run once — one model produces one vote per sub-stage.
**Sub-stage order:**
1. **`round1_gloss`** — the LLM reviews the existing gloss. If it is clear and learner-friendly, it confirms it. If not, it generates a better one.
2. **`round1_example`** — the LLM reviews the existing examples. If they are natural and suitable, it confirms them. If not, it generates one better example sentence in the entry language.
3. **`round1_translations`** — using the verified gloss as context, the LLM reviews each existing translation. Valid translations are confirmed. Invalid ones (wrong language, suffixes, garbled text, wrong sense) are explicitly rejected. Missing languages get a generated translation.
4. **`round1_cefr`** — using only the validated translations from the previous sub-stage, the LLM votes on the CEFR level for the headword and for each confirmed translation. Rejected translations never reach this sub-stage.
This ordering ensures the CEFR voting sub-stage only sees clean, verified data.
All output is written to `pipeline.db` atomically per sub-stage per entry. Interrupted runs resume from the last incomplete sub-stage without losing work. Each model is run once — one model, one vote per sub-stage.
**Input:** `pipeline.db` — entries after reverse link sync
**Output:** `pipeline.db` — gloss votes, example votes, translation votes, CEFR votes per entry per model
> **Note:** The tiebreaker is not a standalone script. It runs automatically > as part of the pipeline orchestrator after merge completes.
### 4. Merge
Reads all LLM votes from `pipeline.db` and resolves the final value for every field. Writes resolved entries back to `pipeline.db`.
**Merge rules:**
- Kaikki source data wins automatically and is never overridden by LLM output
- For CEFR levels: the level with the most votes wins. If no majority is reached, the entry is flagged for the tiebreaker
- For LLM-generated text fields: the candidate with the most votes wins. If no majority is reached, the tiebreaker runs
**Difficulty mapping:**
| CEFR | Difficulty |
| ------ | ------------ |
| A1, A2 | easy |
| B1, B2 | intermediate |
| C1, C2 | hard |
**Input:** `pipeline.db` — LLM votes
**Output:** `pipeline.db` — entries updated with resolved values or flagged status
### 4b. Tiebreak
Runs automatically after merge if any entries remain flagged. The script queries `pipeline.db` for flagged entries, identifies which configured models have not yet voted on each entry, and runs those models on the flagged subset only. Merge is re-run after each tiebreaker pass. This repeats until all flagged entries are resolved or no unused models remain.
If unused models are exhausted and flagged entries remain, the script logs a detailed report showing the exact vote split for each unresolved entry and lists available models from OpenRouter that have not been used. Syncing is blocked until all entries are resolved. To continue, add one or more models to the config and re-run the pipeline — the tiebreaker will pick up automatically.
> **Note:** The tiebreaker is not a standalone script. It runs automatically as part of the pipeline orchestrator after merge completes.
### 5. Compare / QA
Read-only. Generates `COVERAGE.md` with a full breakdown of pipeline output quality per language. Run this after merge to verify output before syncing to the database.
**Input:** `pipeline.db` — entries with status `final`
**Output:** `COVERAGE.md`
`COVERAGE.md` reports the following per language:
- Total entries extracted
- POS breakdown — entry counts for noun, verb, adjective, adverb
- Translation coverage — how many entries have translations in each other language
- CEFR coverage — how many entries have a resolved CEFR level, broken down by level
- Difficulty breakdown — entry counts for easy, intermediate, hard
- Gloss coverage — how many entries have a gloss, broken down by source (Kaikki vs LLM-generated)
- Example coverage — same breakdown as glosses
- LLM model contribution — how many CEFR votes and text candidates each anonymised model contributed
## Sync
The sync script transfers all entries with status `final` in `pipeline.db` to the production PostgreSQL database. It is upsert-based and never wipes existing data. For each entry it checks whether a matching record already exists in the target database:
- **Missing** → insert
- **Present but changed** → update
- **Present and unchanged** → skip
Run this after all entries are resolved and Compare / QA has been reviewed.
```bash
pnpm --filter @lila/pipeline sync
```
The sync script requires a connection string to the target database. Set `DATABASE_URL` in your `.env` file before running.
## Reports
The pipeline generates a report at the end of every run. Reports are written to `data-pipeline/reports/` as a JSON file and a markdown file with the same name. The markdown is generated from the JSON and contains identical data.
```
data-pipeline/reports/
2026-05-03_run-1.json
2026-05-03_run-1.md
```
The run name is auto-generated from the date and a counter. Reports are not committed to git.
**Nightly report** contains:
- Entries processed this run vs total
- Entries remaining per stage
- Average processing speed and estimated nights remaining
- `needs_review` count — entries that failed structural validation
- Per-model progress breakdown
**Final report** (generated when all entries are processed) additionally contains:
- Full vote breakdown per model
- Flagged entries with exact vote splits
- Available unused models from OpenRouter for tiebreaking
- Per-model quality metrics — CEFR agreement rate, field coverage, JSON parse rate
## Adding a new language
1. Add the language code to `SUPPORTED_LANGUAGE_CODES` in `packages/shared/src/constants.ts`
2. Build shared: `pnpm --filter @lila/shared build`
3. Generate and run a DB migration: `pnpm --filter @lila/db generate` then `pnpm --filter @lila/db migrate`
4. Download the Kaikki JSONL file for the language from kaikki.org
5. Re-run the full pipeline
## Constants and constraints
These values are defined in `packages/shared/src/constants.ts` and enforced by database check constraints. The pipeline filters out any entries that violate them.
| Constant | Values |
| --------------- | ------------------------------------- |
| Languages | `en`, `it`, `de`, `es`, `fr` |
| Parts of speech | `noun`, `verb`, `adjective`, `adverb` |
| CEFR levels | `A1`, `A2`, `B1`, `B2`, `C1`, `C2` |
| Difficulty | `easy`, `intermediate`, `hard` |
Adding a new value to any of these requires a constants update and a database migration before re-running the pipeline. See **Adding a new language** for the full steps — the same process applies for new parts of speech.
## Further extensions
These are not part of the current pipeline but are worth considering as the dataset matures:
- **IPA pronunciations** — Kaikki includes IPA transcriptions for most entries. Could be extracted and stored in a `entry_pronunciations` table and displayed in the quiz UI.
- **Audio files** — kaikki.org provides bulk audio file downloads (~20GB) for pronunciations. Could be stored as static files and served alongside the quiz UI.
- **Inflected forms** — Kaikki provides conjugation and declension tables in a `forms` array. Useful for a future grammar-focused quiz mode.
- **Grammatical gender** — Kaikki includes grammatical gender for nouns. Could be stored per entry and used as an additional quiz mechanic.
- **Frequency data** — Word frequency rankings per language from sources like the Google Ngram dataset. Useful for smarter difficulty calibration beyond CEFR levels alone.
- **Additional languages** — The pipeline is language-agnostic. Adding a new language requires downloading its Kaikki JSONL file, a constants update, and a database migration. See **Adding a new language**.
## Roadmap
**Current state:** Stage 1 extraction and stage 2 reverse link sync complete and verified on sample data. Stage 3 enrich script written and tested — redesigning to sub-stage architecture for better data quality. llama.cpp running with Qwen3.5-4B.
**Next action:** Rewrite enrich script for sub-stage design.
| Stage | Status |
| --------------- | -------------- |
| 1. Extract | 🔲 not started |
| 2. Reverse link | 🔲 not started |
| 3. Enrich | 🔲 not started |
| 4. Merge | 🔲 not started |
| 4b. Tiebreak | 🔲 not started |
| 5. Compare / QA | 🔲 not started |
| 6. Sync | 🔲 not started |
### Stage 1 — Extract `🔄 in progress`
- [x] Download Kaikki JSONL files for all 5 languages
- [x] Write extraction script
- [x] Write stage 1 validation tests
- [x] Write db schema, init, and import scripts
- [x] Write db import validation tests
- [x] Run sample extraction → `stage-1-extract/output/{lang}.json`
- [ ] Remove sample limit and run full extraction
- [ ] Re-run full import → `pipeline.db`
### Stage 2 — Reverse link sync `🔄 in progress`
- [x] Write reverse link sync script
- [x] Run reverse link sync on sample data → 141 links inserted
- [ ] Run reverse link sync on full data after full extraction
### Stage 3 — Enrich `🔄 in progress`
**Next action:** Rewrite enrich script for sub-stage design.
- [x] Write initial enrich script (single-prompt design)
- [x] Install llama.cpp and verify server
- [x] Smoke test with sample entries
- [ ] Rewrite enrich script for sub-stage design (round1_gloss, round1_example, round1_translations, round1_cefr)
- [ ] Write tests for enrich sub-stages
- [ ] Run full sample, collect metrics
- [ ] Compare providers (local vs OpenRouter free models)
- [ ] Production run — all entries, all models
### Stage 4 — Merge `🔲 not started`
- [ ] Write merge script
- [ ] Write tests
- [ ] Run merge → `pipeline.db`
- [ ] Confirm tiebreaker resolves all flagged entries
### Stage 4b — Tiebreak `🔲 not started`
- [ ] Write tiebreak logic
- [ ] Run tiebreaker for all flagged entries
- [ ] Confirm no flagged entries remain before syncing
### Stage 5 — Compare / QA `🔲 not started`
- [ ] Write compare script
- [ ] Write tests
- [ ] Run compare → `COVERAGE.md`
- [ ] Review output quality before syncing
### Stage 6 — Sync `🔲 not started`
- [ ] Write sync script
- [ ] Write tests
- [ ] Configure `DATABASE_URL` in `.env`
- [ ] Run sync → production PostgreSQL
- [ ] Verify seeded data in production
### Utilities
**`sample/`** — Runs the pipeline against a small sample to produce human-readable output for a quick sanity check before committing to a full run. Run this after any script change before running the full pipeline.