lila 38d8b85228 docs: rewrite data-pipeline.md for Kaikki migration

2026-05-05 17:14:48 +02:00

18 KiB

Raw Blame History

lila data pipeline

This pipeline extracts vocabulary data from Wiktionary via the Kaikki dataset, enriches it with CEFR levels and fills content gaps using local LLMs, and produces authoritative output in pipeline.db. This database is consumed by the sync script to populate the production database with vocabulary entries, translations, glosses, CEFR levels, and difficulty ratings.

Overview

flowchart LR
    kaikki[(Kaikki JSONL)]
    extract[Extract]
    reverselink[Reverse Link Sync]
    enrich[Enrich]
    pipelinedb[(pipeline.db)]
    merge[Merge]
    tiebreak[Tiebreak]
    compare[Compare]
    sync[Sync]
    db[(PostgreSQL)]

    kaikki --> extract
    extract --> pipelinedb
    pipelinedb --> reverselink
    reverselink --> pipelinedb
    pipelinedb --> enrich
    enrich --> pipelinedb
    pipelinedb --> merge
    merge --> pipelinedb
    pipelinedb --> tiebreak
    tiebreak --> pipelinedb
    pipelinedb --> compare
    pipelinedb --> sync
    sync --> db

Each stage is a standalone script that reads from and writes to pipeline.db. The pipeline is fully resumable — interrupted overnight runs pick up from the last processed record without losing work.

Stage 1 is a manual prerequisite and is not run by the pipeline orchestrator. See Stage 1 — Extract for instructions.

The enrich stage is designed to run overnight, one model at a time. Each model processes every entry and writes results to pipeline.db atomically per record.

Only fully resolved records reach the production database. Records where LLMs could not reach a majority vote are handled automatically by the tiebreaker stage before syncing.

pipeline.db

All pipeline state is stored in pipeline.db — a SQLite database in data-pipeline/db/. It is created automatically on first run and is not committed to git.

The database serves three purposes:

Resumability — every record is written atomically with a status. Interrupted overnight runs resume from the last pending record without losing work.
Vote tracking — all model votes for CEFR levels and generated content are stored per model per record, giving full auditability of how every decision was reached.
Resolved output — the final resolved records live here and are read by the sync script to seed the production database.

The schema is defined in data-pipeline/db/schema.sql. Never edit pipeline.db directly — all writes go through the pipeline scripts.

On first run the orchestrator initialises pipeline.db automatically and imports the stage 1 output into the base tables. This happens once — subsequent runs skip the import if the base tables are already populated.

Data source

Kaikki (Wiktionary)

The pipeline uses pre-extracted Wiktionary data from kaikki.org, built with the wiktextract tool. This data is updated weekly from the English Wiktionary dump and is freely available under the same license as Wiktionary (CC-BY-SA).

Why Kaikki instead of OMW: Kaikki is structured per word sense. Each headword has multiple senses, and translations are linked to a specific sense rather than a general concept. This prevents the sense disambiguation problems found in OMW, where a single concept entry could contain translations from entirely different meanings of a word.

Each Kaikki entry provides:

A headword in the entry language
One or more senses, each with a gloss and examples
Per-sense translations to other languages with sense hints
IPA pronunciations and audio file references (deferred — see Further extensions)
Inflected forms (deferred — see Further extensions)

The pipeline uses the English Wiktionary edition (enwiktionary), which contains entries for all five supported languages with glosses in English.

CEFR levels

CEFR levels are assigned entirely by LLM majority vote. Each model receives the headword, gloss, and an example sentence and votes on the appropriate level (A1–C2). There are no curated source files — the LLMs are the sole source of CEFR annotations.

If no majority is reached after all model runs, the entry is handled automatically by the tiebreaker stage.

Setup

Kaikki data files

Download the pre-extracted Kaikki JSONL files for each language. These are large files — download them to stage-1-extract/sources/ which is not committed to git.

mkdir -p stage-1-extract/sources
cd stage-1-extract/sources

# English entries (contains translations to all other languages)
wget https://kaikki.org/dictionary/English/kaikki.org-dictionary-English.jsonl.gz

# Per-language files (for entries written in those languages)
wget https://kaikki.org/dictionary/German/kaikki.org-dictionary-German.jsonl.gz
wget https://kaikki.org/dictionary/Italian/kaikki.org-dictionary-Italian.jsonl.gz
wget https://kaikki.org/dictionary/French/kaikki.org-dictionary-French.jsonl.gz
wget https://kaikki.org/dictionary/Spanish/kaikki.org-dictionary-Spanish.jsonl.gz

# Decompress
gunzip *.gz

LLM setup

See llm-setup.md.

Pipeline stages

Stage	What it does
1. Extract	Parses Kaikki JSONL, imports entries into `pipeline.db`
2. Reverse link	Inserts missing reverse translations between language pairs
3. Enrich	LLMs fill translation gaps, improve glosses/examples, assign CEFR levels
4. Merge	Resolves LLM votes into final values
4b. Tiebreak	Runs unused models on flagged entries until majority is reached
5. Compare / QA	Generates `COVERAGE.md` with detailed quality report
6. Sync	Upserts resolved records into production PostgreSQL

1. Extract

Parses the Kaikki JSONL files for all five languages and imports them into the base tables of pipeline.db. Filters to the four supported parts of speech: noun, verb, adjective, adverb. Each Kaikki sense becomes one row in vocabulary_entries. Translations are stored in entry_translations with their sense hints.

Input: stage-1-extract/sources/*.jsonl Output: pipeline.db — vocabulary_entries and entry_translations tables populated

pnpm --filter @lila/pipeline extract

Add --sample 100 to import only 100 entries per language for inspection before running the full import.

Each entry in pipeline.db looks like this:

{
  "headword": "thrill",
  "language": "en",
  "pos": "verb",
  "sense_index": 0,
  "gloss": "To suddenly excite someone, or to give them great pleasure.",
  "examples": ["The movie thrilled the audience."],
  "translations": [
    { "language": "de", "word": "begeistern", "sense_hint": "suddenly excite" },
    {
      "language": "fr",
      "word": "enthousiasmer",
      "sense_hint": "suddenly excite"
    },
    { "language": "it", "word": "entusiasmare" },
    { "language": "es", "word": "emocionar" }
  ]
}

Note: Stage 1 is a manual prerequisite. It is not run by the pipeline orchestrator (pipeline.ts). Run it once before running the orchestrator for the first time, and re-run it manually if the Kaikki source files are updated.

2. Reverse link sync

A pure script stage — no LLMs. For each translation pair in entry_translations, checks whether the reverse link exists. If English thrill → begeistern exists and the German entry begeistern exists in vocabulary_entries but lacks the English back-link, it is inserted automatically.

This runs before the enrich stage so that LLMs only generate translations that are genuinely missing — not translations that would be found by a simple reverse lookup.

Input: pipeline.db — populated vocabulary_entries and entry_translations Output: pipeline.db — missing reverse links inserted into entry_translations

pnpm --filter @lila/pipeline reverse-link

3. Enrich

The enrich stage runs LLMs to fill four types of gaps, in this order:

A — Missing translations: for each entry that has no translation in one or more supported languages after reverse link sync, the LLM generates the best translation for that language given the entry's headword, gloss, and examples.

B — Weak glosses and examples: for each entry where the gloss is missing or the examples are missing, the LLM generates a natural, learner-friendly gloss and one usage example in the entry's language.

C — CEFR levels: for every entry, the LLM assigns a CEFR level (A1–C2) based on the headword, gloss, and examples. This runs for all entries regardless of whether other enrichment was needed.

All output is written to pipeline.db atomically per entry — runs are fully resumable if interrupted. Each model is run once — one model produces one vote.

Note: Before running this stage, ensure the llama.cpp server is running locally. The orchestrator checks for a running server at http://127.0.0.1:8080/health and exits with instructions if it is not reachable. See llm-setup.md for setup instructions.

Input: pipeline.db — entries after reverse link sync Output: pipeline.db — LLM-generated translations, glosses, examples, and CEFR votes

pnpm --filter @lila/pipeline run --name "night-1"

4. Merge

Reads all LLM votes from pipeline.db and resolves the final value for every field. Writes resolved entries back to pipeline.db.

Merge rules:

Kaikki source data wins automatically and is never overridden by LLM output
For CEFR levels: the level with the most votes wins. If no majority is reached, the entry is flagged for the tiebreaker
For LLM-generated text fields: the candidate with the most votes wins. If no majority is reached, the tiebreaker runs

Difficulty mapping:

CEFR	Difficulty
A1, A2	easy
B1, B2	intermediate
C1, C2	hard

Input: pipeline.db — LLM votes Output: pipeline.db — entries updated with resolved values or flagged status

4b. Tiebreak

Runs automatically after merge if any entries remain flagged. The script queries pipeline.db for flagged entries, identifies which configured models have not yet voted on each entry, and runs those models on the flagged subset only. Merge is re-run after each tiebreaker pass. This repeats until all flagged entries are resolved or no unused models remain.

If unused models are exhausted and flagged entries remain, the script logs a detailed report showing the exact vote split for each unresolved entry and lists available models from OpenRouter that have not been used. Syncing is blocked until all entries are resolved. To continue, add one or more models to the config and re-run the pipeline — the tiebreaker will pick up automatically.

Note: The tiebreaker is not a standalone script. It runs automatically as part of the pipeline orchestrator after merge completes.

5. Compare / QA

Read-only. Generates COVERAGE.md with a full breakdown of pipeline output quality per language. Run this after merge to verify output before syncing to the database.

Input: pipeline.db — entries with status final Output: COVERAGE.md

COVERAGE.md reports the following per language:

Total entries extracted
POS breakdown — entry counts for noun, verb, adjective, adverb
Translation coverage — how many entries have translations in each other language
CEFR coverage — how many entries have a resolved CEFR level, broken down by level
Difficulty breakdown — entry counts for easy, intermediate, hard
Gloss coverage — how many entries have a gloss, broken down by source (Kaikki vs LLM-generated)
Example coverage — same breakdown as glosses
LLM model contribution — how many CEFR votes and text candidates each anonymised model contributed

Sync

The sync script transfers all entries with status final in pipeline.db to the production PostgreSQL database. It is upsert-based and never wipes existing data. For each entry it checks whether a matching record already exists in the target database:

Missing → insert
Present but changed → update
Present and unchanged → skip

Run this after all entries are resolved and Compare / QA has been reviewed.

pnpm --filter @lila/pipeline sync

The sync script requires a connection string to the target database. Set DATABASE_URL in your .env file before running.

Reports

The pipeline generates a report at the end of every run. Reports are written to data-pipeline/reports/ as a JSON file and a markdown file with the same name. The markdown is generated from the JSON and contains identical data.

data-pipeline/reports/
  2026-05-03_run-1.json
  2026-05-03_run-1.md

The run name is auto-generated from the date and a counter. Reports are not committed to git.

Nightly report contains:

Entries processed this run vs total
Entries remaining per stage
Average processing speed and estimated nights remaining
needs_review count — entries that failed structural validation
Per-model progress breakdown

Final report (generated when all entries are processed) additionally contains:

Full vote breakdown per model
Flagged entries with exact vote splits
Available unused models from OpenRouter for tiebreaking
Per-model quality metrics — CEFR agreement rate, field coverage, JSON parse rate

Adding a new language

Add the language code to SUPPORTED_LANGUAGE_CODES in packages/shared/src/constants.ts
Build shared: pnpm --filter @lila/shared build
Generate and run a DB migration: pnpm --filter @lila/db generate then pnpm --filter @lila/db migrate
Download the Kaikki JSONL file for the language from kaikki.org
Re-run the full pipeline

Constants and constraints

These values are defined in packages/shared/src/constants.ts and enforced by database check constraints. The pipeline filters out any entries that violate them.

Constant	Values
Languages	`en`, `it`, `de`, `es`, `fr`
Parts of speech	`noun`, `verb`, `adjective`, `adverb`
CEFR levels	`A1`, `A2`, `B1`, `B2`, `C1`, `C2`
Difficulty	`easy`, `intermediate`, `hard`

Adding a new value to any of these requires a constants update and a database migration before re-running the pipeline. See Adding a new language for the full steps — the same process applies for new parts of speech.

Further extensions

These are not part of the current pipeline but are worth considering as the dataset matures:

IPA pronunciations — Kaikki includes IPA transcriptions for most entries. Could be extracted and stored in a entry_pronunciations table and displayed in the quiz UI.
Audio files — kaikki.org provides bulk audio file downloads (~20GB) for pronunciations. Could be stored as static files and served alongside the quiz UI.
Inflected forms — Kaikki provides conjugation and declension tables in a forms array. Useful for a future grammar-focused quiz mode.
Grammatical gender — Kaikki includes grammatical gender for nouns. Could be stored per entry and used as an additional quiz mechanic.
Frequency data — Word frequency rankings per language from sources like the Google Ngram dataset. Useful for smarter difficulty calibration beyond CEFR levels alone.
Additional languages — The pipeline is language-agnostic. Adding a new language requires downloading its Kaikki JSONL file, a constants update, and a database migration. See Adding a new language.

Roadmap

Current state: Data source migrated from OMW to Kaikki. Production schema and pipeline being rewritten on feat/kaikki-vocabulary-schema. Pipeline infrastructure (orchestrator, db init, reporting, tests) is in place and carries forward.

Next action: Rewrite production schema in packages/db, then rewrite pipeline extraction stage for Kaikki.

Stage	Status
1. Extract	🔲 not started
2. Reverse link	🔲 not started
3. Enrich	🔲 not started
4. Merge	🔲 not started
4b. Tiebreak	🔲 not started
5. Compare / QA	🔲 not started
6. Sync	🔲 not started

Stage 1 — Extract `🔲 not started`

Download Kaikki JSONL files for all 5 languages
Write extraction script
Write stage 1 validation tests
Run extraction → pipeline.db

Stage 2 — Reverse link sync `🔲 not started`

Write reverse link sync script
Write tests
Run reverse link sync → pipeline.db

Stage 3 — Enrich `🔲 not started`

Next action: Write the enrich script after production schema is complete.

Write enrich script (missing translations, glosses, examples, CEFR votes)
Write tests
Install llama.cpp and verify server
Smoke test with sample entries
Run full sample, collect metrics
Compare providers (local vs OpenRouter free models)
Production run — all entries, all models

Stage 4 — Merge `🔲 not started`

Write merge script
Write tests
Run merge → pipeline.db
Confirm tiebreaker resolves all flagged entries

Stage 4b — Tiebreak `🔲 not started`

Write tiebreak logic
Run tiebreaker for all flagged entries
Confirm no flagged entries remain before syncing

Stage 5 — Compare / QA `🔲 not started`

Write compare script
Write tests
Run compare → COVERAGE.md
Review output quality before syncing

Stage 6 — Sync `🔲 not started`

Write sync script
Write tests
Configure DATABASE_URL in .env
Run sync → production PostgreSQL
Verify seeded data in production

Utilities

sample/ — Runs the pipeline against a small sample to produce human-readable output for a quick sanity check before committing to a full run. Run this after any script change before running the full pipeline.

18 KiB Raw Blame History Unescape Escape