feat: enrich script working, redesigning to sub-stage architecture

- Enrich script functional with timeout, progress tracking, rejection mechanism
- Identified ordering issue: CEFR voting needs validated translations first
- Redesign: round1_gloss → round1_example → round1_translations → round1_cefr
- Update data-pipeline.md with new sub-stage design and roadmap
- Qwen3.5-4B confirmed working with thinking disabled
This commit is contained in:
lila 2026-05-07 13:09:43 +02:00
parent 7f10c35e03
commit 73fb12ac35
7 changed files with 337 additions and 122 deletions

View file

@ -1,17 +1,12 @@
# LLM Setup — lila pipeline
This document covers the LLM infrastructure for stage 3 (enrich) of the lila
data pipeline. It documents the hardware constraints, supported providers,
model recommendations, and how to configure and swap providers in the test
and production scripts.
This document covers the LLM infrastructure for stage 3 (enrich) of the lila data pipeline. It documents the hardware constraints, supported providers, model recommendations, and how to configure and swap providers in the test and production scripts.
---
## Provider model
Each provider + model combination counts as one vote in the final majority.
Running the same model twice is not supported — one model, one vote. To
increase vote confidence, add more models rather than re-running existing ones.
Each provider + model combination counts as one vote in the final majority. Running the same model twice is not supported — one model, one vote. To increase vote confidence, add more models rather than re-running existing ones.
---
@ -24,17 +19,13 @@ increase vote confidence, add more models rather than re-running existing ones.
| GPU | NVIDIA GeForce GTX 950M — 4 GB VRAM (Maxwell, CUDA compute 5.0) |
| OS | Debian GNU/Linux 13 (trixie) x86_64 |
**Local inference verdict:** viable for small/quantized models, not for
production runs. See the [Local inference](#local-inference-llamacpp) section
for details.
**Local inference verdict:** viable for small/quantized models, not for production runs. See the [Local inference](#local-inference-llamacpp) section for details.
---
## Provider overview
The enrich script uses a single, swappable provider config. All providers
except Anthropic expose an OpenAI-compatible API, so the same client code
works across all of them — only `baseURL`, `apiKey`, and `model` change.
The enrich script uses a single, swappable provider config. All providers except Anthropic expose an OpenAI-compatible API, so the same client code works across all of them — only `baseURL`, `apiKey`, and `model` change.
| Provider | Use case | Cost | Rate limits |
| ---------------------- | --------------------------------------------- | ------------------ | ---------------------- |
@ -49,20 +40,13 @@ works across all of them — only `baseURL`, `apiKey`, and `model` change.
### Why local inference is worth testing
Time is not a constraint — the pipeline scripts are fully resumable. The
laptop can run overnight for multiple nights. The only question is output
quality, which the test script evaluates empirically.
Time is not a constraint — the pipeline scripts are fully resumable. The laptop can run overnight for multiple nights. The only question is output quality, which the test script evaluates empirically.
### Hardware constraints
The GTX 950M has 4 GB VRAM and Maxwell architecture (CUDA compute 5.0).
llama.cpp supports Maxwell via CUDA backend but newer builds may require
the `--cuda-no-kv-offload` flag depending on the version.
The GTX 950M has 4 GB VRAM and Maxwell architecture (CUDA compute 5.0). llama.cpp supports Maxwell via CUDA backend but newer builds may require the `--cuda-no-kv-offload` flag depending on the version.
llama.cpp splits model layers between GPU and CPU automatically via
`--n-gpu-layers`. You set how many layers go on the GPU; the rest run on
CPU/RAM. This means a model larger than VRAM is not a dead end — it runs
in hybrid mode, slower than full-GPU but much faster than pure CPU.
llama.cpp splits model layers between GPU and CPU automatically via `--n-gpu-layers`. You set how many layers go on the GPU; the rest run on CPU/RAM. This means a model larger than VRAM is not a dead end — it runs in hybrid mode, slower than full-GPU but much faster than pure CPU.
Practical estimates for this hardware (~3.5 GB VRAM usable after drivers):
@ -75,24 +59,19 @@ Practical estimates for this hardware (~3.5 GB VRAM usable after drivers):
### Recommended local models
Two candidates worth testing, covering different points on the size/quality
tradeoff:
Two candidates worth testing, covering different points on the size/quality tradeoff:
**Gemma 4 E4B Instruct (Q4 / UD-Q4_K_XL)**
- GGUF file: `gemma-4-E4B-it-UD-Q4_K_XL.gguf` (~2.5 GB)
- Source: https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF
- Runs fully on GPU. Brand new (April 2025), built for edge hardware, 140+
language support including all five pipeline languages. First candidate
to test.
- Runs fully on GPU. Brand new (April 2025), built for edge hardware, 140+ language support including all five pipeline languages. First candidate to test.
**Qwen2.5 7B Instruct (Q4_K_M)**
- GGUF file: `Qwen2.5-7B-Instruct-Q4_K_M.gguf` (~4.5 GB)
- Source: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF
- Runs in hybrid mode (~26 of 32 layers on GPU, rest on CPU), ~812 tok/s.
Stronger multilingual generation than any 34B model. Second candidate,
for comparison against the smaller Gemma 4 E4B.
- Runs in hybrid mode (~26 of 32 layers on GPU, rest on CPU), ~812 tok/s. Stronger multilingual generation than any 34B model. Second candidate, for comparison against the smaller Gemma 4 E4B.
### Installation