feat: enrich script working, redesigning to sub-stage architecture

- Enrich script functional with timeout, progress tracking, rejection mechanism - Identified ordering issue: CEFR voting needs validated translations first - Redesign: round1_gloss → round1_example → round1_translations → round1_cefr - Update data-pipeline.md with new sub-stage design and roadmap - Qwen3.5-4B confirmed working with thinking disabled
2026-05-07 13:09:43 +02:00 · 2026-05-07 13:09:43 +02:00 · 73fb12ac35
commit 73fb12ac35
parent 7f10c35e03
7 changed files with 337 additions and 122 deletions
--- a/documentation/llm-setup.md
+++ b/documentation/llm-setup.md
@ -1,17 +1,12 @@
 # LLM Setup — lila pipeline

-This document covers the LLM infrastructure for stage 3 (enrich) of the lila
-data pipeline. It documents the hardware constraints, supported providers,
-model recommendations, and how to configure and swap providers in the test
-and production scripts.
+This document covers the LLM infrastructure for stage 3 (enrich) of the lila data pipeline. It documents the hardware constraints, supported providers, model recommendations, and how to configure and swap providers in the test and production scripts.

 ---

 ## Provider model

-Each provider + model combination counts as one vote in the final majority.
-Running the same model twice is not supported — one model, one vote. To
-increase vote confidence, add more models rather than re-running existing ones.
+Each provider + model combination counts as one vote in the final majority. Running the same model twice is not supported — one model, one vote. To increase vote confidence, add more models rather than re-running existing ones.

 ---

@ -24,17 +19,13 @@ increase vote confidence, add more models rather than re-running existing ones.
 | GPU       | NVIDIA GeForce GTX 950M — 4 GB VRAM (Maxwell, CUDA compute 5.0) |
 | OS        | Debian GNU/Linux 13 (trixie) x86_64                             |

-**Local inference verdict:** viable for small/quantized models, not for
-production runs. See the [Local inference](#local-inference-llamacpp) section
-for details.
+**Local inference verdict:** viable for small/quantized models, not for production runs. See the [Local inference](#local-inference-llamacpp) section for details.

 ---

 ## Provider overview

-The enrich script uses a single, swappable provider config. All providers
-except Anthropic expose an OpenAI-compatible API, so the same client code
-works across all of them — only `baseURL`, `apiKey`, and `model` change.
+The enrich script uses a single, swappable provider config. All providers except Anthropic expose an OpenAI-compatible API, so the same client code works across all of them — only `baseURL`, `apiKey`, and `model` change.

 | Provider               | Use case                                      | Cost               | Rate limits            |
 | ---------------------- | --------------------------------------------- | ------------------ | ---------------------- |
@ -49,20 +40,13 @@ works across all of them — only `baseURL`, `apiKey`, and `model` change.

 ### Why local inference is worth testing

-Time is not a constraint — the pipeline scripts are fully resumable. The
-laptop can run overnight for multiple nights. The only question is output
-quality, which the test script evaluates empirically.
+Time is not a constraint — the pipeline scripts are fully resumable. The laptop can run overnight for multiple nights. The only question is output quality, which the test script evaluates empirically.

 ### Hardware constraints

-The GTX 950M has 4 GB VRAM and Maxwell architecture (CUDA compute 5.0).
-llama.cpp supports Maxwell via CUDA backend but newer builds may require
-the `--cuda-no-kv-offload` flag depending on the version.
+The GTX 950M has 4 GB VRAM and Maxwell architecture (CUDA compute 5.0). llama.cpp supports Maxwell via CUDA backend but newer builds may require the `--cuda-no-kv-offload` flag depending on the version.

-llama.cpp splits model layers between GPU and CPU automatically via
-`--n-gpu-layers`. You set how many layers go on the GPU; the rest run on
-CPU/RAM. This means a model larger than VRAM is not a dead end — it runs
-in hybrid mode, slower than full-GPU but much faster than pure CPU.
+llama.cpp splits model layers between GPU and CPU automatically via `--n-gpu-layers`. You set how many layers go on the GPU; the rest run on CPU/RAM. This means a model larger than VRAM is not a dead end — it runs in hybrid mode, slower than full-GPU but much faster than pure CPU.

 Practical estimates for this hardware (~3.5 GB VRAM usable after drivers):

@ -75,24 +59,19 @@ Practical estimates for this hardware (~3.5 GB VRAM usable after drivers):

 ### Recommended local models

-Two candidates worth testing, covering different points on the size/quality
-tradeoff:
+Two candidates worth testing, covering different points on the size/quality tradeoff:

 **Gemma 4 E4B Instruct (Q4 / UD-Q4_K_XL)**

 - GGUF file: `gemma-4-E4B-it-UD-Q4_K_XL.gguf` (~2.5 GB)
 - Source: https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF
- Runs fully on GPU. Brand new (April 2025), built for edge hardware, 140+
-  language support including all five pipeline languages. First candidate
-  to test.
+- Runs fully on GPU. Brand new (April 2025), built for edge hardware, 140+ language support including all five pipeline languages. First candidate to test.

 **Qwen2.5 7B Instruct (Q4_K_M)**

 - GGUF file: `Qwen2.5-7B-Instruct-Q4_K_M.gguf` (~4.5 GB)
 - Source: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF
- Runs in hybrid mode (~26 of 32 layers on GPU, rest on CPU), ~8–12 tok/s.
-  Stronger multilingual generation than any 3–4B model. Second candidate,
-  for comparison against the smaller Gemma 4 E4B.
+- Runs in hybrid mode (~26 of 32 layers on GPU, rest on CPU), ~8–12 tok/s. Stronger multilingual generation than any 3–4B model. Second candidate, for comparison against the smaller Gemma 4 E4B.

 ### Installation