lila/documentation/ai-context/05-data-pipeline.md
2026-05-16 01:59:43 +02:00

173 lines
6.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 05 — Data Pipeline
> **Purpose:** Condensed reference for LLMs working on the Kaikki data pipeline. Covers stages, data flow, and current blockers. For full operational details (llama.cpp setup, provider configs, hardware specs), see the human-readable DATA_PIPELINE.md.
> **Last updated:** 2026-05-15
> **Depends on:** 00-project-overview.md
---
## Pipeline Overview
```
Kaikki JSONL (Wiktionary extracts)
Stage 1: Extract → Parse into pipeline.db (SQLite)
Stage 2: Reverse Link → Insert missing reverse translations
Stage 3: Enrich → LLMs review glosses, examples, translations, assign CEFR
Stage 4: Merge → Resolve LLM votes into final values
Stage 4b: Tiebreak → Run unused models on flagged entries
Stage 5: Compare / QA → Generate COVERAGE.md quality report
Stage 6: Sync → Upsert resolved records into production PostgreSQL
```
**Current state:** Stage 1 and 2 complete on sample data. Stage 3 enrich script being rewritten for sub-stage architecture. Stages 46 not started.
---
## Stage 1: Extract
**Input:** `data-pipeline/stage-1-extract/sources/*.jsonl` (Kaikki files, not in git)
**Output:** `pipeline.db``vocabulary_entries` and `entry_translations` tables
**What it does:**
- Parses Kaikki JSONL for all 5 languages (en, de, es, fr, it)
- Filters to 4 POS: noun, verb, adjective, adverb
- Each Kaikki sense becomes one `vocabulary_entries` row
- Translations stored in `entry_translations` with sense hints
**Key design:** Kaikki is structured per word sense. Each headword has multiple senses, and translations are linked to a specific sense. This prevents the sense-disambiguation problems of OpenWordNet/OMW.
---
## Stage 2: Reverse Link Sync
**Pure script, no LLMs.**
For each translation pair (e.g., English "thrill" → German "begeistern"), checks if the reverse exists (German "begeistern" → English "thrill"). If the German entry exists but lacks the English back-link, inserts it automatically.
**Why:** Ensures LLMs in Stage 3 only generate translations that are genuinely missing — not translations findable by simple reverse lookup.
---
## Stage 3: Enrich (In Progress — Being Rewritten)
**Current blocker:** The original single-prompt design had problems (skipped invalid translations, triggered reasoning mode, 20% manual review). Being rewritten as four ordered sub-stages.
### Sub-Stage Architecture
Each model processes every entry through four sub-stages in order:
1. **`round1_gloss`** — Review existing gloss. Confirm if clear, generate better one if not.
2. **`round1_example`** — Review examples. Confirm if natural, generate one better sentence.
3. **`round1_translations`** — Validate translations with verified gloss as context. Confirm valid, reject invalid, generate missing.
4. **`round1_cefr`** — Assign CEFR level (A1C2) to headword and each confirmed translation.
**Why this order:** CEFR sub-stage only sees clean, verified data. Bad translations are rejected before reaching CEFR assignment.
**Voter strategy:** Multiple models vote independently. Each model = one vote per sub-stage. Current plan:
- Primary: Local Qwen3.5-9B (overnight runs, unlimited)
- Secondary: Groq Llama 3.3 70B (cloud, batched)
- Tertiary: Gemini AI Studio (cloud, batched)
**Context enrichment:** Before calling models for gloss/example, pipeline queries Wiktionary API for the headword. Full entry (all senses, usage notes) added to prompt. Fixes category header glosses and short ambiguous glosses.
---
## Stage 4: Merge
Resolves LLM votes into final values per entry.
**Rules:**
- Kaikki source data wins automatically (never overridden)
- CEFR: level with most votes wins
- Text fields (gloss, example, translation): candidate with most votes wins
- No majority → flag for tiebreaker
**Difficulty mapping:**
| CEFR | Difficulty |
|------|-----------|
| A1, A2 | easy |
| B1, B2 | intermediate |
| C1, C2 | hard |
---
## Stage 4b: Tiebreak
Runs automatically after merge if flagged entries remain. Queries unused models (not yet voted) and re-runs merge. Repeats until resolved or no unused models remain.
**If still unresolved:** Sync is blocked. Add more models to config and re-run.
---
## Stage 5: Compare / QA
Read-only. Generates `COVERAGE.md` with per-language breakdown:
- Total entries, POS distribution
- Translation coverage per language pair
- CEFR coverage and difficulty breakdown
- Gloss/example coverage by source (Kaikki vs LLM)
- Per-model contribution stats
Run this before syncing to production.
---
## Stage 6: Sync
Upserts all `status = "final"` entries from `pipeline.db` to production PostgreSQL.
**Behavior:**
- Missing → insert
- Present but changed → update
- Present and unchanged → skip
**Idempotent.** Safe to re-run.
---
## Key Constraints
| Constant | Values |
| ---------- | ------------------------------------- |
| Languages | `en`, `it`, `de`, `es`, `fr` |
| POS | `noun`, `verb`, `adjective`, `adverb` |
| CEFR | `A1`, `A2`, `B1`, `B2`, `C1`, `C2` |
| Difficulty | `easy`, `intermediate`, `hard` |
Adding a new value requires updating `packages/shared/src/constants.ts` AND a database migration before re-running the pipeline.
---
## Current Blockers
1. **Enrich sub-stage rewrite** — Stage 3 script needs redesign and testing
2. **Cloud provider integration** — Groq and Gemini not yet wired into pipeline
3. **Batching prompt design** — 510 entries per API call for efficiency; not yet designed
4. **Full dataset scale unknown** — Currently running on 500-entry samples. Full Kaikki English file has ~1.3M entries. Exact filtered count and runtime estimate not yet known.
---
## Key Files
| File | Purpose |
| ------------------------------------------------------------ | --------------------------------------------------------- |
| `data-pipeline/pipeline.ts` | Orchestrator — runs stages in order, handles resumability |
| `data-pipeline/stage-1-extract/scripts/extract.ts` | Parse Kaikki JSONL |
| `data-pipeline/stage-2-reverse-link/scripts/reverse-link.ts` | Insert reverse translations |
| `data-pipeline/stage-3-enrich/scripts/enrich.ts` | LLM enrichment (being rewritten) |
| `data-pipeline/stage-3-enrich/config.ts` | Provider configs (local, OpenRouter, etc.) |
| `data-pipeline/db/schema.sql` | pipeline.db schema |
| `data-pipeline/db/import.ts` | Import stage 1 output into pipeline.db |
| `packages/shared/src/constants.ts` | Language codes, POS, CEFR, difficulty constants |