feat: enrich script working, redesigning to sub-stage architecture

- Enrich script functional with timeout, progress tracking, rejection mechanism - Identified ordering issue: CEFR voting needs validated translations first - Redesign: round1_gloss → round1_example → round1_translations → round1_cefr - Update data-pipeline.md with new sub-stage design and roadmap - Qwen3.5-4B confirmed working with thinking disabled
2026-05-07 13:09:43 +02:00 · 2026-05-07 13:09:43 +02:00 · 73fb12ac35
commit 73fb12ac35
parent 7f10c35e03
7 changed files with 337 additions and 122 deletions
--- a/documentation/data-pipeline.md
+++ b/documentation/data-pipeline.md
@ -54,6 +54,109 @@ The schema is defined in `data-pipeline/db/schema.sql`. Never edit `pipeline.db`

 On first run the orchestrator initialises `pipeline.db` automatically and imports the stage 1 output into the base tables. This happens once — subsequent runs skip the import if the base tables are already populated.

+## Common commands
+
+### Starting llama.cpp
+
+```bash
+cd ~/Downloads/llama.cpp
+./build/bin/llama-server \
+  --model models/qwen3.5-4b-q4_k_m.gguf \
+  --port 8080 \
+  --ctx-size 4096 \
+  --n-gpu-layers 999 \
+  --host 127.0.0.1 \
+  --chat-template-kwargs '{"enable_thinking":false}' \
+  --reasoning-budget 0
+```
+
+Verify the server is running:
+
+```bash
+curl http://127.0.0.1:8080/health
+```
+
+### Running the pipeline
+
+```bash
+pnpm --filter @lila/pipeline pipeline:run
+```
+
+The pipeline auto-generates a run name from the date and a counter. It picks up where it left off — completed stages are skipped automatically.
+
+### Stage 1 — Extract
+
+```bash
+pnpm --filter @lila/pipeline extract
+```
+
+Runs in sample mode (500 entries per language) by default. Remove the hardcoded limit in `stage-1-extract/scripts/extract.ts` for a full run.
+
+### Stage 2 — Reverse link sync
+
+```bash
+pnpm --filter @lila/pipeline reverse-link
+```
+
+### Initialising and importing the database
+
+```bash
+# Initialise pipeline.db from schema
+pnpm --filter @lila/pipeline db:init
+
+# Import stage 1 output into pipeline.db
+pnpm --filter @lila/pipeline db:import
+```
+
+### Resetting the database
+
+```bash
+# Full reset — delete and reinitialise
+rm data-pipeline/db/pipeline.db
+pnpm --filter @lila/pipeline db:init
+pnpm --filter @lila/pipeline db:import
+pnpm --filter @lila/pipeline reverse-link
+```
+
+### Resetting enrich stage progress
+
+```bash
+# Reset round 1 only (retry failed or incomplete run)
+node -e "
+const Database = require('better-sqlite3');
+const db = new Database('/db/pipeline.db');
+const result = db.prepare(\"DELETE FROM run_status WHERE stage = 'round1'\").run();
+console.log('Deleted', result.changes, 'rows');
+db.close();
+"
+
+# Reset all enrich progress (round 1 and round 2)
+node -e "
+const Database = require('better-sqlite3');
+const db = new Database('data-pipeline/db/pipeline.db');
+const result = db.prepare(\"DELETE FROM run_status WHERE stage IN ('round1', 'round2')\").run();
+console.log('Deleted', result.changes, 'rows');
+db.close();
+"
+```
+
+### Checking pipeline progress
+
+```bash
+node -e "
+const Database = require('better-sqlite3');
+const db = new Database('data-pipeline/db/pipeline.db', { readonly: true });
+const total = db.prepare('SELECT COUNT(*) as c FROM entries WHERE language = \\'en\\'').get().c;
+const complete = db.prepare(\"SELECT COUNT(*) as c FROM run_status WHERE stage = 'round1' AND status = 'complete'\").get().c;
+const needsReview = db.prepare(\"SELECT COUNT(*) as c FROM run_status WHERE stage = 'round1' AND status = 'needs_review'\").get().c;
+console.log('Total English entries:', total);
+console.log('Round 1 complete:', complete);
+console.log('Needs review:', needsReview);
+console.log('Pending:', total - complete - needsReview);
+db.close();
+"
+```
+
 ## Data source

 ### Kaikki (Wiktionary)
@ -171,24 +274,31 @@ pnpm --filter @lila/pipeline reverse-link

 ### 3. Enrich

-The enrich stage runs LLMs to fill four types of gaps, in this order:
+> **Note:** Before running this stage, ensure the llama.cpp server is running
+> locally. The orchestrator checks for a running server at
+> `http://127.0.0.1:8080/health` and exits with instructions if it is not
+> reachable. See `llm-setup.md` for setup instructions.

-**A — Missing translations:** for each entry that has no translation in one or more supported languages after reverse link sync, the LLM generates the best translation for that language given the entry's headword, gloss, and examples.
+The enrich stage runs in four ordered sub-stages per entry, designed to build context progressively. All output is written to `pipeline.db` atomically per sub-stage — runs are fully resumable if interrupted. Each model is run once — one model produces one vote per sub-stage.

-**B — Weak glosses and examples:** for each entry where the gloss is missing or the examples are missing, the LLM generates a natural, learner-friendly gloss and one usage example in the entry's language.
+**Sub-stage order:**

-**C — CEFR levels:** for every entry, the LLM assigns a CEFR level (A1–C2) based on the headword, gloss, and examples. This runs for all entries regardless of whether other enrichment was needed.
+1. **`round1_gloss`** — the LLM reviews the existing gloss. If it is clear and learner-friendly, it confirms it. If not, it generates a better one.

-All output is written to `pipeline.db` atomically per entry — runs are fully resumable if interrupted. Each model is run once — one model produces one vote.
+2. **`round1_example`** — the LLM reviews the existing examples. If they are natural and suitable, it confirms them. If not, it generates one better example sentence in the entry language.

-> **Note:** Before running this stage, ensure the llama.cpp server is running locally. The orchestrator checks for a running server at `http://127.0.0.1:8080/health` and exits with instructions if it is not reachable. See `llm-setup.md` for setup instructions.
+3. **`round1_translations`** — using the verified gloss as context, the LLM reviews each existing translation. Valid translations are confirmed. Invalid ones (wrong language, suffixes, garbled text, wrong sense) are explicitly rejected. Missing languages get a generated translation.
+
+4. **`round1_cefr`** — using only the validated translations from the previous sub-stage, the LLM votes on the CEFR level for the headword and for each confirmed translation. Rejected translations never reach this sub-stage.
+
+This ordering ensures the CEFR voting sub-stage only sees clean, verified data.
+
+All output is written to `pipeline.db` atomically per sub-stage per entry. Interrupted runs resume from the last incomplete sub-stage without losing work. Each model is run once — one model, one vote per sub-stage.

 **Input:** `pipeline.db` — entries after reverse link sync
-**Output:** `pipeline.db` — LLM-generated translations, glosses, examples, and CEFR votes
+**Output:** `pipeline.db` — gloss votes, example votes, translation votes, CEFR votes per entry per model

-```bash
-pnpm --filter @lila/pipeline run --name "night-1"
-```
+> **Note:** The tiebreaker is not a standalone script. It runs automatically > as part of the pipeline orchestrator after merge completes.

 ### 4. Merge

@ -314,11 +424,9 @@ These are not part of the current pipeline but are worth considering as the data

 ## Roadmap

-**Current state:** Stages 1 and 2 complete and verified on sample data.
-Stage 3 round 1 enrich script written. llama.cpp not yet installed.
-pipeline.db contains 4,156 entries and 4,287 translations across 5 languages.
+**Current state:** Stage 1 extraction and stage 2 reverse link sync complete and verified on sample data. Stage 3 enrich script written and tested — redesigning to sub-stage architecture for better data quality. llama.cpp running with Qwen3.5-4B.

-**Next action:** Install llama.cpp, run smoke test with sample data.
+**Next action:** Rewrite enrich script for sub-stage design.

 | Stage           | Status         |
 | --------------- | -------------- |
@ -347,14 +455,15 @@ pipeline.db contains 4,156 entries and 4,287 translations across 5 languages.
 - [x] Run reverse link sync on sample data → 141 links inserted
 - [ ] Run reverse link sync on full data after full extraction

-### Stage 3 — Enrich `🔲 not started`
+### Stage 3 — Enrich `🔄 in progress`

-**Next action:** Write the enrich script after production schema is complete.
+**Next action:** Rewrite enrich script for sub-stage design.

- [x] Write enrich script (missing translations, glosses, examples, CEFR votes)
- [ ] Write tests
- [ ] Install llama.cpp and verify server
- [ ] Smoke test with sample entries
+- [x] Write initial enrich script (single-prompt design)
+- [x] Install llama.cpp and verify server
+- [x] Smoke test with sample entries
+- [ ] Rewrite enrich script for sub-stage design (round1_gloss, round1_example, round1_translations, round1_cefr)
+- [ ] Write tests for enrich sub-stages
 - [ ] Run full sample, collect metrics
 - [ ] Compare providers (local vs OpenRouter free models)
 - [ ] Production run — all entries, all models
--- a/documentation/llm-setup.md
+++ b/documentation/llm-setup.md
@ -1,17 +1,12 @@
 # LLM Setup — lila pipeline

-This document covers the LLM infrastructure for stage 3 (enrich) of the lila
-data pipeline. It documents the hardware constraints, supported providers,
-model recommendations, and how to configure and swap providers in the test
-and production scripts.
+This document covers the LLM infrastructure for stage 3 (enrich) of the lila data pipeline. It documents the hardware constraints, supported providers, model recommendations, and how to configure and swap providers in the test and production scripts.

 ---

 ## Provider model

-Each provider + model combination counts as one vote in the final majority.
-Running the same model twice is not supported — one model, one vote. To
-increase vote confidence, add more models rather than re-running existing ones.
+Each provider + model combination counts as one vote in the final majority. Running the same model twice is not supported — one model, one vote. To increase vote confidence, add more models rather than re-running existing ones.

 ---

@ -24,17 +19,13 @@ increase vote confidence, add more models rather than re-running existing ones.
 | GPU       | NVIDIA GeForce GTX 950M — 4 GB VRAM (Maxwell, CUDA compute 5.0) |
 | OS        | Debian GNU/Linux 13 (trixie) x86_64                             |

-**Local inference verdict:** viable for small/quantized models, not for
-production runs. See the [Local inference](#local-inference-llamacpp) section
-for details.
+**Local inference verdict:** viable for small/quantized models, not for production runs. See the [Local inference](#local-inference-llamacpp) section for details.

 ---

 ## Provider overview

-The enrich script uses a single, swappable provider config. All providers
-except Anthropic expose an OpenAI-compatible API, so the same client code
-works across all of them — only `baseURL`, `apiKey`, and `model` change.
+The enrich script uses a single, swappable provider config. All providers except Anthropic expose an OpenAI-compatible API, so the same client code works across all of them — only `baseURL`, `apiKey`, and `model` change.

 | Provider               | Use case                                      | Cost               | Rate limits            |
 | ---------------------- | --------------------------------------------- | ------------------ | ---------------------- |
@ -49,20 +40,13 @@ works across all of them — only `baseURL`, `apiKey`, and `model` change.

 ### Why local inference is worth testing

-Time is not a constraint — the pipeline scripts are fully resumable. The
-laptop can run overnight for multiple nights. The only question is output
-quality, which the test script evaluates empirically.
+Time is not a constraint — the pipeline scripts are fully resumable. The laptop can run overnight for multiple nights. The only question is output quality, which the test script evaluates empirically.

 ### Hardware constraints

-The GTX 950M has 4 GB VRAM and Maxwell architecture (CUDA compute 5.0).
-llama.cpp supports Maxwell via CUDA backend but newer builds may require
-the `--cuda-no-kv-offload` flag depending on the version.
+The GTX 950M has 4 GB VRAM and Maxwell architecture (CUDA compute 5.0). llama.cpp supports Maxwell via CUDA backend but newer builds may require the `--cuda-no-kv-offload` flag depending on the version.

-llama.cpp splits model layers between GPU and CPU automatically via
-`--n-gpu-layers`. You set how many layers go on the GPU; the rest run on
-CPU/RAM. This means a model larger than VRAM is not a dead end — it runs
-in hybrid mode, slower than full-GPU but much faster than pure CPU.
+llama.cpp splits model layers between GPU and CPU automatically via `--n-gpu-layers`. You set how many layers go on the GPU; the rest run on CPU/RAM. This means a model larger than VRAM is not a dead end — it runs in hybrid mode, slower than full-GPU but much faster than pure CPU.

 Practical estimates for this hardware (~3.5 GB VRAM usable after drivers):

@ -75,24 +59,19 @@ Practical estimates for this hardware (~3.5 GB VRAM usable after drivers):

 ### Recommended local models

-Two candidates worth testing, covering different points on the size/quality
-tradeoff:
+Two candidates worth testing, covering different points on the size/quality tradeoff:

 **Gemma 4 E4B Instruct (Q4 / UD-Q4_K_XL)**

 - GGUF file: `gemma-4-E4B-it-UD-Q4_K_XL.gguf` (~2.5 GB)
 - Source: https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF
- Runs fully on GPU. Brand new (April 2025), built for edge hardware, 140+
-  language support including all five pipeline languages. First candidate
-  to test.
+- Runs fully on GPU. Brand new (April 2025), built for edge hardware, 140+ language support including all five pipeline languages. First candidate to test.

 **Qwen2.5 7B Instruct (Q4_K_M)**

 - GGUF file: `Qwen2.5-7B-Instruct-Q4_K_M.gguf` (~4.5 GB)
 - Source: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF
- Runs in hybrid mode (~26 of 32 layers on GPU, rest on CPU), ~8–12 tok/s.
-  Stronger multilingual generation than any 3–4B model. Second candidate,
-  for comparison against the smaller Gemma 4 E4B.
+- Runs in hybrid mode (~26 of 32 layers on GPU, rest on CPU), ~8–12 tok/s. Stronger multilingual generation than any 3–4B model. Second candidate, for comparison against the smaller Gemma 4 E4B.

 ### Installation