feat: enrich script working, redesigning to sub-stage architecture
- Enrich script functional with timeout, progress tracking, rejection mechanism - Identified ordering issue: CEFR voting needs validated translations first - Redesign: round1_gloss → round1_example → round1_translations → round1_cefr - Update data-pipeline.md with new sub-stage design and roadmap - Qwen3.5-4B confirmed working with thinking disabled
This commit is contained in:
parent
7f10c35e03
commit
73fb12ac35
7 changed files with 337 additions and 122 deletions
|
|
@ -1,17 +1,12 @@
|
|||
# LLM Setup — lila pipeline
|
||||
|
||||
This document covers the LLM infrastructure for stage 3 (enrich) of the lila
|
||||
data pipeline. It documents the hardware constraints, supported providers,
|
||||
model recommendations, and how to configure and swap providers in the test
|
||||
and production scripts.
|
||||
This document covers the LLM infrastructure for stage 3 (enrich) of the lila data pipeline. It documents the hardware constraints, supported providers, model recommendations, and how to configure and swap providers in the test and production scripts.
|
||||
|
||||
---
|
||||
|
||||
## Provider model
|
||||
|
||||
Each provider + model combination counts as one vote in the final majority.
|
||||
Running the same model twice is not supported — one model, one vote. To
|
||||
increase vote confidence, add more models rather than re-running existing ones.
|
||||
Each provider + model combination counts as one vote in the final majority. Running the same model twice is not supported — one model, one vote. To increase vote confidence, add more models rather than re-running existing ones.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -24,17 +19,13 @@ increase vote confidence, add more models rather than re-running existing ones.
|
|||
| GPU | NVIDIA GeForce GTX 950M — 4 GB VRAM (Maxwell, CUDA compute 5.0) |
|
||||
| OS | Debian GNU/Linux 13 (trixie) x86_64 |
|
||||
|
||||
**Local inference verdict:** viable for small/quantized models, not for
|
||||
production runs. See the [Local inference](#local-inference-llamacpp) section
|
||||
for details.
|
||||
**Local inference verdict:** viable for small/quantized models, not for production runs. See the [Local inference](#local-inference-llamacpp) section for details.
|
||||
|
||||
---
|
||||
|
||||
## Provider overview
|
||||
|
||||
The enrich script uses a single, swappable provider config. All providers
|
||||
except Anthropic expose an OpenAI-compatible API, so the same client code
|
||||
works across all of them — only `baseURL`, `apiKey`, and `model` change.
|
||||
The enrich script uses a single, swappable provider config. All providers except Anthropic expose an OpenAI-compatible API, so the same client code works across all of them — only `baseURL`, `apiKey`, and `model` change.
|
||||
|
||||
| Provider | Use case | Cost | Rate limits |
|
||||
| ---------------------- | --------------------------------------------- | ------------------ | ---------------------- |
|
||||
|
|
@ -49,20 +40,13 @@ works across all of them — only `baseURL`, `apiKey`, and `model` change.
|
|||
|
||||
### Why local inference is worth testing
|
||||
|
||||
Time is not a constraint — the pipeline scripts are fully resumable. The
|
||||
laptop can run overnight for multiple nights. The only question is output
|
||||
quality, which the test script evaluates empirically.
|
||||
Time is not a constraint — the pipeline scripts are fully resumable. The laptop can run overnight for multiple nights. The only question is output quality, which the test script evaluates empirically.
|
||||
|
||||
### Hardware constraints
|
||||
|
||||
The GTX 950M has 4 GB VRAM and Maxwell architecture (CUDA compute 5.0).
|
||||
llama.cpp supports Maxwell via CUDA backend but newer builds may require
|
||||
the `--cuda-no-kv-offload` flag depending on the version.
|
||||
The GTX 950M has 4 GB VRAM and Maxwell architecture (CUDA compute 5.0). llama.cpp supports Maxwell via CUDA backend but newer builds may require the `--cuda-no-kv-offload` flag depending on the version.
|
||||
|
||||
llama.cpp splits model layers between GPU and CPU automatically via
|
||||
`--n-gpu-layers`. You set how many layers go on the GPU; the rest run on
|
||||
CPU/RAM. This means a model larger than VRAM is not a dead end — it runs
|
||||
in hybrid mode, slower than full-GPU but much faster than pure CPU.
|
||||
llama.cpp splits model layers between GPU and CPU automatically via `--n-gpu-layers`. You set how many layers go on the GPU; the rest run on CPU/RAM. This means a model larger than VRAM is not a dead end — it runs in hybrid mode, slower than full-GPU but much faster than pure CPU.
|
||||
|
||||
Practical estimates for this hardware (~3.5 GB VRAM usable after drivers):
|
||||
|
||||
|
|
@ -75,24 +59,19 @@ Practical estimates for this hardware (~3.5 GB VRAM usable after drivers):
|
|||
|
||||
### Recommended local models
|
||||
|
||||
Two candidates worth testing, covering different points on the size/quality
|
||||
tradeoff:
|
||||
Two candidates worth testing, covering different points on the size/quality tradeoff:
|
||||
|
||||
**Gemma 4 E4B Instruct (Q4 / UD-Q4_K_XL)**
|
||||
|
||||
- GGUF file: `gemma-4-E4B-it-UD-Q4_K_XL.gguf` (~2.5 GB)
|
||||
- Source: https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF
|
||||
- Runs fully on GPU. Brand new (April 2025), built for edge hardware, 140+
|
||||
language support including all five pipeline languages. First candidate
|
||||
to test.
|
||||
- Runs fully on GPU. Brand new (April 2025), built for edge hardware, 140+ language support including all five pipeline languages. First candidate to test.
|
||||
|
||||
**Qwen2.5 7B Instruct (Q4_K_M)**
|
||||
|
||||
- GGUF file: `Qwen2.5-7B-Instruct-Q4_K_M.gguf` (~4.5 GB)
|
||||
- Source: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF
|
||||
- Runs in hybrid mode (~26 of 32 layers on GPU, rest on CPU), ~8–12 tok/s.
|
||||
Stronger multilingual generation than any 3–4B model. Second candidate,
|
||||
for comparison against the smaller Gemma 4 E4B.
|
||||
- Runs in hybrid mode (~26 of 32 layers on GPU, rest on CPU), ~8–12 tok/s. Stronger multilingual generation than any 3–4B model. Second candidate, for comparison against the smaller Gemma 4 E4B.
|
||||
|
||||
### Installation
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue