lila/documentation/ai-context/05-data-pipeline.md
2026-05-16 01:59:43 +02:00

6.9 KiB
Raw Blame History

05 — Data Pipeline

Purpose: Condensed reference for LLMs working on the Kaikki data pipeline. Covers stages, data flow, and current blockers. For full operational details (llama.cpp setup, provider configs, hardware specs), see the human-readable DATA_PIPELINE.md. Last updated: 2026-05-15 Depends on: 00-project-overview.md


Pipeline Overview

Kaikki JSONL (Wiktionary extracts)
     ↓
Stage 1: Extract → Parse into pipeline.db (SQLite)
     ↓
Stage 2: Reverse Link → Insert missing reverse translations
     ↓
Stage 3: Enrich → LLMs review glosses, examples, translations, assign CEFR
     ↓
Stage 4: Merge → Resolve LLM votes into final values
     ↓
Stage 4b: Tiebreak → Run unused models on flagged entries
     ↓
Stage 5: Compare / QA → Generate COVERAGE.md quality report
     ↓
Stage 6: Sync → Upsert resolved records into production PostgreSQL

Current state: Stage 1 and 2 complete on sample data. Stage 3 enrich script being rewritten for sub-stage architecture. Stages 46 not started.


Stage 1: Extract

Input: data-pipeline/stage-1-extract/sources/*.jsonl (Kaikki files, not in git) Output: pipeline.dbvocabulary_entries and entry_translations tables

What it does:

  • Parses Kaikki JSONL for all 5 languages (en, de, es, fr, it)
  • Filters to 4 POS: noun, verb, adjective, adverb
  • Each Kaikki sense becomes one vocabulary_entries row
  • Translations stored in entry_translations with sense hints

Key design: Kaikki is structured per word sense. Each headword has multiple senses, and translations are linked to a specific sense. This prevents the sense-disambiguation problems of OpenWordNet/OMW.


Pure script, no LLMs.

For each translation pair (e.g., English "thrill" → German "begeistern"), checks if the reverse exists (German "begeistern" → English "thrill"). If the German entry exists but lacks the English back-link, inserts it automatically.

Why: Ensures LLMs in Stage 3 only generate translations that are genuinely missing — not translations findable by simple reverse lookup.


Stage 3: Enrich (In Progress — Being Rewritten)

Current blocker: The original single-prompt design had problems (skipped invalid translations, triggered reasoning mode, 20% manual review). Being rewritten as four ordered sub-stages.

Sub-Stage Architecture

Each model processes every entry through four sub-stages in order:

  1. round1_gloss — Review existing gloss. Confirm if clear, generate better one if not.
  2. round1_example — Review examples. Confirm if natural, generate one better sentence.
  3. round1_translations — Validate translations with verified gloss as context. Confirm valid, reject invalid, generate missing.
  4. round1_cefr — Assign CEFR level (A1C2) to headword and each confirmed translation.

Why this order: CEFR sub-stage only sees clean, verified data. Bad translations are rejected before reaching CEFR assignment.

Voter strategy: Multiple models vote independently. Each model = one vote per sub-stage. Current plan:

  • Primary: Local Qwen3.5-9B (overnight runs, unlimited)
  • Secondary: Groq Llama 3.3 70B (cloud, batched)
  • Tertiary: Gemini AI Studio (cloud, batched)

Context enrichment: Before calling models for gloss/example, pipeline queries Wiktionary API for the headword. Full entry (all senses, usage notes) added to prompt. Fixes category header glosses and short ambiguous glosses.


Stage 4: Merge

Resolves LLM votes into final values per entry.

Rules:

  • Kaikki source data wins automatically (never overridden)
  • CEFR: level with most votes wins
  • Text fields (gloss, example, translation): candidate with most votes wins
  • No majority → flag for tiebreaker

Difficulty mapping:

CEFR Difficulty
A1, A2 easy
B1, B2 intermediate
C1, C2 hard

Stage 4b: Tiebreak

Runs automatically after merge if flagged entries remain. Queries unused models (not yet voted) and re-runs merge. Repeats until resolved or no unused models remain.

If still unresolved: Sync is blocked. Add more models to config and re-run.


Stage 5: Compare / QA

Read-only. Generates COVERAGE.md with per-language breakdown:

  • Total entries, POS distribution
  • Translation coverage per language pair
  • CEFR coverage and difficulty breakdown
  • Gloss/example coverage by source (Kaikki vs LLM)
  • Per-model contribution stats

Run this before syncing to production.


Stage 6: Sync

Upserts all status = "final" entries from pipeline.db to production PostgreSQL.

Behavior:

  • Missing → insert
  • Present but changed → update
  • Present and unchanged → skip

Idempotent. Safe to re-run.


Key Constraints

Constant Values
Languages en, it, de, es, fr
POS noun, verb, adjective, adverb
CEFR A1, A2, B1, B2, C1, C2
Difficulty easy, intermediate, hard

Adding a new value requires updating packages/shared/src/constants.ts AND a database migration before re-running the pipeline.


Current Blockers

  1. Enrich sub-stage rewrite — Stage 3 script needs redesign and testing
  2. Cloud provider integration — Groq and Gemini not yet wired into pipeline
  3. Batching prompt design — 510 entries per API call for efficiency; not yet designed
  4. Full dataset scale unknown — Currently running on 500-entry samples. Full Kaikki English file has ~1.3M entries. Exact filtered count and runtime estimate not yet known.

Key Files

File Purpose
data-pipeline/pipeline.ts Orchestrator — runs stages in order, handles resumability
data-pipeline/stage-1-extract/scripts/extract.ts Parse Kaikki JSONL
data-pipeline/stage-2-reverse-link/scripts/reverse-link.ts Insert reverse translations
data-pipeline/stage-3-enrich/scripts/enrich.ts LLM enrichment (being rewritten)
data-pipeline/stage-3-enrich/config.ts Provider configs (local, OpenRouter, etc.)
data-pipeline/db/schema.sql pipeline.db schema
data-pipeline/db/import.ts Import stage 1 output into pipeline.db
packages/shared/src/constants.ts Language codes, POS, CEFR, difficulty constants