updating documentation

2026-05-16 01:59:43 +02:00 · 2026-05-16 01:59:43 +02:00 · 7e0311683f
commit 7e0311683f
parent 1ba57c7e9d
25 changed files with 2660 additions and 226 deletions
--- a/documentation/ai-context/05-data-pipeline.md
+++ b/documentation/ai-context/05-data-pipeline.md
@ -0,0 +1,173 @@
+# 05 — Data Pipeline
+
+> **Purpose:** Condensed reference for LLMs working on the Kaikki data pipeline. Covers stages, data flow, and current blockers. For full operational details (llama.cpp setup, provider configs, hardware specs), see the human-readable DATA_PIPELINE.md.
+> **Last updated:** 2026-05-15
+> **Depends on:** 00-project-overview.md
+
+---
+
+## Pipeline Overview
+
+```
+Kaikki JSONL (Wiktionary extracts)
+     ↓
+Stage 1: Extract → Parse into pipeline.db (SQLite)
+     ↓
+Stage 2: Reverse Link → Insert missing reverse translations
+     ↓
+Stage 3: Enrich → LLMs review glosses, examples, translations, assign CEFR
+     ↓
+Stage 4: Merge → Resolve LLM votes into final values
+     ↓
+Stage 4b: Tiebreak → Run unused models on flagged entries
+     ↓
+Stage 5: Compare / QA → Generate COVERAGE.md quality report
+     ↓
+Stage 6: Sync → Upsert resolved records into production PostgreSQL
+```
+
+**Current state:** Stage 1 and 2 complete on sample data. Stage 3 enrich script being rewritten for sub-stage architecture. Stages 4–6 not started.
+
+---
+
+## Stage 1: Extract
+
+**Input:** `data-pipeline/stage-1-extract/sources/*.jsonl` (Kaikki files, not in git)
+**Output:** `pipeline.db` — `vocabulary_entries` and `entry_translations` tables
+
+**What it does:**
+
+- Parses Kaikki JSONL for all 5 languages (en, de, es, fr, it)
+- Filters to 4 POS: noun, verb, adjective, adverb
+- Each Kaikki sense becomes one `vocabulary_entries` row
+- Translations stored in `entry_translations` with sense hints
+
+**Key design:** Kaikki is structured per word sense. Each headword has multiple senses, and translations are linked to a specific sense. This prevents the sense-disambiguation problems of OpenWordNet/OMW.
+
+---
+
+## Stage 2: Reverse Link Sync
+
+**Pure script, no LLMs.**
+
+For each translation pair (e.g., English "thrill" → German "begeistern"), checks if the reverse exists (German "begeistern" → English "thrill"). If the German entry exists but lacks the English back-link, inserts it automatically.
+
+**Why:** Ensures LLMs in Stage 3 only generate translations that are genuinely missing — not translations findable by simple reverse lookup.
+
+---
+
+## Stage 3: Enrich (In Progress — Being Rewritten)
+
+**Current blocker:** The original single-prompt design had problems (skipped invalid translations, triggered reasoning mode, 20% manual review). Being rewritten as four ordered sub-stages.
+
+### Sub-Stage Architecture
+
+Each model processes every entry through four sub-stages in order:
+
+1. **`round1_gloss`** — Review existing gloss. Confirm if clear, generate better one if not.
+2. **`round1_example`** — Review examples. Confirm if natural, generate one better sentence.
+3. **`round1_translations`** — Validate translations with verified gloss as context. Confirm valid, reject invalid, generate missing.
+4. **`round1_cefr`** — Assign CEFR level (A1–C2) to headword and each confirmed translation.
+
+**Why this order:** CEFR sub-stage only sees clean, verified data. Bad translations are rejected before reaching CEFR assignment.
+
+**Voter strategy:** Multiple models vote independently. Each model = one vote per sub-stage. Current plan:
+
+- Primary: Local Qwen3.5-9B (overnight runs, unlimited)
+- Secondary: Groq Llama 3.3 70B (cloud, batched)
+- Tertiary: Gemini AI Studio (cloud, batched)
+
+**Context enrichment:** Before calling models for gloss/example, pipeline queries Wiktionary API for the headword. Full entry (all senses, usage notes) added to prompt. Fixes category header glosses and short ambiguous glosses.
+
+---
+
+## Stage 4: Merge
+
+Resolves LLM votes into final values per entry.
+
+**Rules:**
+
+- Kaikki source data wins automatically (never overridden)
+- CEFR: level with most votes wins
+- Text fields (gloss, example, translation): candidate with most votes wins
+- No majority → flag for tiebreaker
+
+**Difficulty mapping:**
+| CEFR | Difficulty |
+|------|-----------|
+| A1, A2 | easy |
+| B1, B2 | intermediate |
+| C1, C2 | hard |
+
+---
+
+## Stage 4b: Tiebreak
+
+Runs automatically after merge if flagged entries remain. Queries unused models (not yet voted) and re-runs merge. Repeats until resolved or no unused models remain.
+
+**If still unresolved:** Sync is blocked. Add more models to config and re-run.
+
+---
+
+## Stage 5: Compare / QA
+
+Read-only. Generates `COVERAGE.md` with per-language breakdown:
+
+- Total entries, POS distribution
+- Translation coverage per language pair
+- CEFR coverage and difficulty breakdown
+- Gloss/example coverage by source (Kaikki vs LLM)
+- Per-model contribution stats
+
+Run this before syncing to production.
+
+---
+
+## Stage 6: Sync
+
+Upserts all `status = "final"` entries from `pipeline.db` to production PostgreSQL.
+
+**Behavior:**
+
+- Missing → insert
+- Present but changed → update
+- Present and unchanged → skip
+
+**Idempotent.** Safe to re-run.
+
+---
+
+## Key Constraints
+
+| Constant   | Values                                |
+| ---------- | ------------------------------------- |
+| Languages  | `en`, `it`, `de`, `es`, `fr`          |
+| POS        | `noun`, `verb`, `adjective`, `adverb` |
+| CEFR       | `A1`, `A2`, `B1`, `B2`, `C1`, `C2`    |
+| Difficulty | `easy`, `intermediate`, `hard`        |
+
+Adding a new value requires updating `packages/shared/src/constants.ts` AND a database migration before re-running the pipeline.
+
+---
+
+## Current Blockers
+
+1. **Enrich sub-stage rewrite** — Stage 3 script needs redesign and testing
+2. **Cloud provider integration** — Groq and Gemini not yet wired into pipeline
+3. **Batching prompt design** — 5–10 entries per API call for efficiency; not yet designed
+4. **Full dataset scale unknown** — Currently running on 500-entry samples. Full Kaikki English file has ~1.3M entries. Exact filtered count and runtime estimate not yet known.
+
+---
+
+## Key Files
+
+| File                                                         | Purpose                                                   |
+| ------------------------------------------------------------ | --------------------------------------------------------- |
+| `data-pipeline/pipeline.ts`                                  | Orchestrator — runs stages in order, handles resumability |
+| `data-pipeline/stage-1-extract/scripts/extract.ts`           | Parse Kaikki JSONL                                        |
+| `data-pipeline/stage-2-reverse-link/scripts/reverse-link.ts` | Insert reverse translations                               |
+| `data-pipeline/stage-3-enrich/scripts/enrich.ts`             | LLM enrichment (being rewritten)                          |
+| `data-pipeline/stage-3-enrich/config.ts`                     | Provider configs (local, OpenRouter, etc.)                |
+| `data-pipeline/db/schema.sql`                                | pipeline.db schema                                        |
+| `data-pipeline/db/import.ts`                                 | Import stage 1 output into pipeline.db                    |
+| `packages/shared/src/constants.ts`                           | Language codes, POS, CEFR, difficulty constants           |