# 05 — Data Pipeline > **Purpose:** Condensed reference for LLMs working on the Kaikki data pipeline. Covers stages, data flow, and current blockers. For full operational details (llama.cpp setup, provider configs, hardware specs), see the human-readable DATA_PIPELINE.md. > **Last updated:** 2026-05-15 > **Depends on:** 00-project-overview.md --- ## Pipeline Overview ``` Kaikki JSONL (Wiktionary extracts) ↓ Stage 1: Extract → Parse into pipeline.db (SQLite) ↓ Stage 2: Reverse Link → Insert missing reverse translations ↓ Stage 3: Enrich → LLMs review glosses, examples, translations, assign CEFR ↓ Stage 4: Merge → Resolve LLM votes into final values ↓ Stage 4b: Tiebreak → Run unused models on flagged entries ↓ Stage 5: Compare / QA → Generate COVERAGE.md quality report ↓ Stage 6: Sync → Upsert resolved records into production PostgreSQL ``` **Current state:** Stage 1 and 2 complete on sample data. Stage 3 enrich script being rewritten for sub-stage architecture. Stages 4–6 not started. --- ## Stage 1: Extract **Input:** `data-pipeline/stage-1-extract/sources/*.jsonl` (Kaikki files, not in git) **Output:** `pipeline.db` — `vocabulary_entries` and `entry_translations` tables **What it does:** - Parses Kaikki JSONL for all 5 languages (en, de, es, fr, it) - Filters to 4 POS: noun, verb, adjective, adverb - Each Kaikki sense becomes one `vocabulary_entries` row - Translations stored in `entry_translations` with sense hints **Key design:** Kaikki is structured per word sense. Each headword has multiple senses, and translations are linked to a specific sense. This prevents the sense-disambiguation problems of OpenWordNet/OMW. --- ## Stage 2: Reverse Link Sync **Pure script, no LLMs.** For each translation pair (e.g., English "thrill" → German "begeistern"), checks if the reverse exists (German "begeistern" → English "thrill"). If the German entry exists but lacks the English back-link, inserts it automatically. **Why:** Ensures LLMs in Stage 3 only generate translations that are genuinely missing — not translations findable by simple reverse lookup. --- ## Stage 3: Enrich (In Progress — Being Rewritten) **Current blocker:** The original single-prompt design had problems (skipped invalid translations, triggered reasoning mode, 20% manual review). Being rewritten as four ordered sub-stages. ### Sub-Stage Architecture Each model processes every entry through four sub-stages in order: 1. **`round1_gloss`** — Review existing gloss. Confirm if clear, generate better one if not. 2. **`round1_example`** — Review examples. Confirm if natural, generate one better sentence. 3. **`round1_translations`** — Validate translations with verified gloss as context. Confirm valid, reject invalid, generate missing. 4. **`round1_cefr`** — Assign CEFR level (A1–C2) to headword and each confirmed translation. **Why this order:** CEFR sub-stage only sees clean, verified data. Bad translations are rejected before reaching CEFR assignment. **Voter strategy:** Multiple models vote independently. Each model = one vote per sub-stage. Current plan: - Primary: Local Qwen3.5-9B (overnight runs, unlimited) - Secondary: Groq Llama 3.3 70B (cloud, batched) - Tertiary: Gemini AI Studio (cloud, batched) **Context enrichment:** Before calling models for gloss/example, pipeline queries Wiktionary API for the headword. Full entry (all senses, usage notes) added to prompt. Fixes category header glosses and short ambiguous glosses. --- ## Stage 4: Merge Resolves LLM votes into final values per entry. **Rules:** - Kaikki source data wins automatically (never overridden) - CEFR: level with most votes wins - Text fields (gloss, example, translation): candidate with most votes wins - No majority → flag for tiebreaker **Difficulty mapping:** | CEFR | Difficulty | |------|-----------| | A1, A2 | easy | | B1, B2 | intermediate | | C1, C2 | hard | --- ## Stage 4b: Tiebreak Runs automatically after merge if flagged entries remain. Queries unused models (not yet voted) and re-runs merge. Repeats until resolved or no unused models remain. **If still unresolved:** Sync is blocked. Add more models to config and re-run. --- ## Stage 5: Compare / QA Read-only. Generates `COVERAGE.md` with per-language breakdown: - Total entries, POS distribution - Translation coverage per language pair - CEFR coverage and difficulty breakdown - Gloss/example coverage by source (Kaikki vs LLM) - Per-model contribution stats Run this before syncing to production. --- ## Stage 6: Sync Upserts all `status = "final"` entries from `pipeline.db` to production PostgreSQL. **Behavior:** - Missing → insert - Present but changed → update - Present and unchanged → skip **Idempotent.** Safe to re-run. --- ## Key Constraints | Constant | Values | | ---------- | ------------------------------------- | | Languages | `en`, `it`, `de`, `es`, `fr` | | POS | `noun`, `verb`, `adjective`, `adverb` | | CEFR | `A1`, `A2`, `B1`, `B2`, `C1`, `C2` | | Difficulty | `easy`, `intermediate`, `hard` | Adding a new value requires updating `packages/shared/src/constants.ts` AND a database migration before re-running the pipeline. --- ## Current Blockers 1. **Enrich sub-stage rewrite** — Stage 3 script needs redesign and testing 2. **Cloud provider integration** — Groq and Gemini not yet wired into pipeline 3. **Batching prompt design** — 5–10 entries per API call for efficiency; not yet designed 4. **Full dataset scale unknown** — Currently running on 500-entry samples. Full Kaikki English file has ~1.3M entries. Exact filtered count and runtime estimate not yet known. --- ## Key Files | File | Purpose | | ------------------------------------------------------------ | --------------------------------------------------------- | | `data-pipeline/pipeline.ts` | Orchestrator — runs stages in order, handles resumability | | `data-pipeline/stage-1-extract/scripts/extract.ts` | Parse Kaikki JSONL | | `data-pipeline/stage-2-reverse-link/scripts/reverse-link.ts` | Insert reverse translations | | `data-pipeline/stage-3-enrich/scripts/enrich.ts` | LLM enrichment (being rewritten) | | `data-pipeline/stage-3-enrich/config.ts` | Provider configs (local, OpenRouter, etc.) | | `data-pipeline/db/schema.sql` | pipeline.db schema | | `data-pipeline/db/import.ts` | Import stage 1 output into pipeline.db | | `packages/shared/src/constants.ts` | Language codes, POS, CEFR, difficulty constants |