updating documentation

2026-05-16 01:59:43 +02:00 · 2026-05-16 01:59:43 +02:00 · 7e0311683f
commit 7e0311683f
parent 1ba57c7e9d
25 changed files with 2660 additions and 226 deletions
--- a/documentation/LLM_SETUP.md
+++ b/documentation/LLM_SETUP.md
@ -0,0 +1,285 @@
+# LLM Setup — lila pipeline
+
+This document covers the LLM infrastructure for stage 3 (enrich) of the lila data pipeline. It documents the hardware constraints, supported providers, model recommendations, and how to configure and swap providers in the test and production scripts.
+
+---
+
+## Provider model
+
+Each provider + model combination counts as one vote in the final majority. Running the same model twice is not supported — one model, one vote. To increase vote confidence, add more models rather than re-running existing ones.
+
+---
+
+## Hardware (dev machine)
+
+| Component | Spec                                                            |
+| --------- | --------------------------------------------------------------- |
+| CPU       | Intel Core i7-6500U (2 cores / 4 threads @ 3.10 GHz)            |
+| RAM       | 8 GB                                                            |
+| GPU       | NVIDIA GeForce GTX 950M — 4 GB VRAM (Maxwell, CUDA compute 5.0) |
+| OS        | Debian GNU/Linux 13 (trixie) x86_64                             |
+
+**Local inference verdict:** viable for small/quantized models, not for production runs. See the [Local inference](#local-inference-llamacpp) section for details.
+
+---
+
+## Provider overview
+
+The enrich script uses a single, swappable provider config. All providers except Anthropic expose an OpenAI-compatible API, so the same client code works across all of them — only `baseURL`, `apiKey`, and `model` change.
+
+| Provider               | Use case                                      | Cost               | Rate limits            |
+| ---------------------- | --------------------------------------------- | ------------------ | ---------------------- |
+| llama.cpp (local)      | Quality testing, overnight dev runs           | Free (electricity) | None                   |
+| OpenRouter (free tier) | Quality comparison, multi-model evaluation    | Free               | 50 req/day, 20 req/min |
+| OpenRouter (paid)      | Production runs if local quality insufficient | Pay-per-token      | None                   |
+| Anthropic API          | Quality baseline / reference                  | Pay-per-token      | Standard               |
+
+---
+
+## Local inference (llama.cpp)
+
+### Why local inference is worth testing
+
+Time is not a constraint — the pipeline scripts are fully resumable. The laptop can run overnight for multiple nights. The only question is output quality, which the test script evaluates empirically.
+
+### Hardware constraints
+
+The GTX 950M has 4 GB VRAM and Maxwell architecture (CUDA compute 5.0). llama.cpp supports Maxwell via CUDA backend but newer builds may require the `--cuda-no-kv-offload` flag depending on the version.
+
+llama.cpp splits model layers between GPU and CPU automatically via `--n-gpu-layers`. You set how many layers go on the GPU; the rest run on CPU/RAM. This means a model larger than VRAM is not a dead end — it runs in hybrid mode, slower than full-GPU but much faster than pure CPU.
+
+Practical estimates for this hardware (~3.5 GB VRAM usable after drivers):
+
+| Model size | Q4 VRAM | Mode                          | Est. speed   |
+| ---------- | ------- | ----------------------------- | ------------ |
+| 3B         | ~2.0 GB | Full GPU                      | ~15–20 tok/s |
+| 4B         | ~2.5 GB | Full GPU                      | ~12–18 tok/s |
+| 7B         | ~4.5 GB | Hybrid (~26/32 layers on GPU) | ~8–12 tok/s  |
+| 13B+       | ~8 GB+  | CPU-heavy hybrid              | too slow     |
+
+### Recommended local models
+
+Two candidates worth testing, covering different points on the size/quality tradeoff:
+
+**Gemma 4 E4B Instruct (Q4 / UD-Q4_K_XL)**
+
+- GGUF file: `gemma-4-E4B-it-UD-Q4_K_XL.gguf` (~2.5 GB)
+- Source: https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF
+- Runs fully on GPU. Brand new (April 2025), built for edge hardware, 140+ language support including all five pipeline languages. First candidate to test.
+
+**Qwen2.5 7B Instruct (Q4_K_M)**
+
+- GGUF file: `Qwen2.5-7B-Instruct-Q4_K_M.gguf` (~4.5 GB)
+- Source: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF
+- Runs in hybrid mode (~26 of 32 layers on GPU, rest on CPU), ~8–12 tok/s. Stronger multilingual generation than any 3–4B model. Second candidate, for comparison against the smaller Gemma 4 E4B.
+
+### Installation
+
+```bash
+# Install build dependencies
+sudo apt install build-essential cmake git
+
+# Clone llama.cpp
+git clone https://github.com/ggerganov/llama.cpp
+cd llama.cpp
+
+# Build with CUDA support (GTX 950M — compute 5.0)
+cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=50
+cmake --build build --config Release -j$(nproc)
+
+# Download model (example — adjust path as needed)
+mkdir -p models
+wget -O models/qwen2.5-3b-instruct-q4_k_m.gguf \
+  https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwen2.5-3b-instruct-q4_k_m.gguf
+```
+
+### Starting the server
+
+**Gemma 4 E4B** (full GPU):
+
+```bash
+./build/bin/llama-server \
+  --model models/gemma-4-e4b-it-ud-q4_k_xl.gguf \
+  --port 8080 \
+  --ctx-size 4096 \
+  --n-gpu-layers 999 \
+  --host 127.0.0.1
+```
+
+**Qwen2.5 7B** (hybrid — tune `--n-gpu-layers` to fit your VRAM):
+
+```bash
+./build/bin/llama-server \
+  --model models/qwen2.5-7b-instruct-q4_k_m.gguf \
+  --port 8080 \
+  --ctx-size 4096 \
+  --n-gpu-layers 28 \
+  --host 127.0.0.1
+```
+
+`--n-gpu-layers 999` means "put everything on GPU" — llama.cpp caps at the
+actual layer count automatically, so 999 is safe as a "full offload" value.
+For the 7B hybrid, start with `28` and reduce by 2 if the server reports
+out-of-memory at startup.
+
+### Verify the server is running
+
+```bash
+curl http://127.0.0.1:8080/health
+# Expected: {"status":"ok"}
+```
+
+---
+
+## OpenRouter (free tier)
+
+OpenRouter exposes all models via an OpenAI-compatible API. No code changes
+are needed to switch from local llama.cpp to OpenRouter — only the config
+object changes.
+
+### Rate limits (free tier)
+
+- **50 requests per day** (account total, not per model)
+- 20 requests per minute
+
+> **Implication for testing:** with a 10-record test set you have headroom
+> to test 4–5 models per day. With a 100-record test set, plan one model per
+> day.
+
+> **Implication for production:** the free tier is not viable for 117k
+> records. If local quality is insufficient, use paid OpenRouter credits or
+> a dedicated provider.
+
+### Free models recommended for this pipeline
+
+Ranked by expected multilingual generation quality for en/it/de/fr/es:
+
+| Model ID                                 | Params                | Notes                                                                                |
+| ---------------------------------------- | --------------------- | ------------------------------------------------------------------------------------ |
+| `qwen/qwen3-coder:free`                  | 480B MoE (35B active) | Best free option. Strong multilingual despite "coder" label. Use as quality ceiling. |
+| `qwen/qwen3-next-80b-a3b-instruct:free`  | 80B MoE (3B active)   | Smaller Qwen, useful comparison point.                                               |
+| `nvidia/nemotron-3-super-120b-a12b:free` | 120B MoE (12B active) | 262K context, supports structured output.                                            |
+| `google/gemma-4-31b-it:free`             | 31B                   | 140+ language support, good European language coverage.                              |
+| `zhipuai/glm-4.5-air:free`               | MoE                   | Multilingual-focused.                                                                |
+
+**Skip for this pipeline:**
+
+- Llama models — weaker European language generation than Qwen/Gemma
+- Mistral free tier — requests may be used for model training
+
+### API endpoint
+
+```
+https://openrouter.ai/api/v1/chat/completions
+```
+
+Set `Authorization: Bearer <OPENROUTER_API_KEY>` in the request headers.
+
+---
+
+## Provider configuration in the enrich script
+
+The enrich script reads a single config object. To switch providers,
+change this object and re-run. The `name` field is used as the model
+identifier in `pipeline.db` — it must be unique across all runs.
+
+```typescript
+// config.ts
+
+export type ProviderConfig = {
+  name: string; // used as model identifier in pipeline.db — must be unique
+  baseURL: string;
+  apiKey: string;
+  model: string;
+  maxTokens: number;
+};
+
+// Local llama.cpp
+export const LOCAL_QWEN3B: ProviderConfig = {
+  name: "local-qwen2.5-3b",
+  baseURL: "http://127.0.0.1:8080/v1",
+  apiKey: "none", // llama.cpp ignores this
+  model: "qwen2.5-3b", // llama.cpp ignores model name, uses loaded model
+  maxTokens: 512,
+};
+
+// OpenRouter — Qwen3 480B (free)
+export const OR_QWEN3_480B: ProviderConfig = {
+  name: "or-qwen3-480b",
+  baseURL: "https://openrouter.ai/api/v1",
+  apiKey: process.env.OPENROUTER_API_KEY!,
+  model: "qwen/qwen3-coder:free",
+  maxTokens: 512,
+};
+
+// OpenRouter — Gemma 4 31B (free)
+export const OR_GEMMA4_31B: ProviderConfig = {
+  name: "or-gemma4-31b",
+  baseURL: "https://openrouter.ai/api/v1",
+  apiKey: process.env.OPENROUTER_API_KEY!,
+  model: "google/gemma-4-31b-it:free",
+  maxTokens: 512,
+};
+
+// Anthropic (reference baseline — different adapter required)
+export const ANTHROPIC_SONNET: ProviderConfig = {
+  name: "anthropic-sonnet",
+  baseURL: "https://api.anthropic.com/v1", // adapter handles format difference
+  apiKey: process.env.ANTHROPIC_API_KEY!,
+  model: "claude-sonnet-4-6",
+  maxTokens: 512,
+};
+```
+
+All output is written to `pipeline.db`. Each record is stored with the
+model name as identifier so results from different providers can be
+compared and compiled into votes.
+
+---
+
+## Evaluation metrics
+
+The test script measures the following per provider run:
+
+| Metric                   | What it measures                                                                                                                                 |
+| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------ |
+| **JSON parse rate**      | % of responses that are valid, schema-compliant JSON. Critical — a failed parse is a wasted call. Target: >97%                                   |
+| **Field coverage**       | % of records where all required fields are present (cefr votes for all translations, descriptions for all languages, glosses/examples for fr/es) |
+| **CEFR agreement**       | For records that have a `cefr_source` vote, % where the model agrees. Measures calibration.                                                      |
+| **Language correctness** | Manual spot-check only — automated detection not reliable enough                                                                                 |
+| **Tokens/second**        | Local only. Indicates overnight run feasibility                                                                                                  |
+
+### Decision thresholds
+
+| Metric          | Threshold | Action if below                                |
+| --------------- | --------- | ---------------------------------------------- |
+| JSON parse rate | < 97%     | Do not use this model for production           |
+| Field coverage  | < 95%     | Prompt needs revision before production        |
+| CEFR agreement  | < 70%     | Model lacks vocabulary knowledge for this task |
+
+---
+
+## Recommended test sequence
+
+1. **Start local, minimal dataset (5–10 records)**
+   Install llama.cpp, run Qwen2.5 3B against 5–10 hand-picked records.
+   Verify the server works, the output parses, and the model produces
+   something reasonable. This is purely a smoke test.
+
+2. **Expand local to full 100-record sample**
+   Once the pipeline is confirmed working, run all 100 records locally.
+   Collect metrics. This is your local quality baseline.
+
+3. **Run the same 100 records through OpenRouter free models**
+   One model per day (50 req/day limit). Start with `qwen/qwen3-coder:free`
+   as the quality ceiling.
+
+4. **Compare metrics side by side**
+   If local 3B is within acceptable range of the cloud models on CEFR
+   agreement and field coverage, proceed with local overnight runs for
+   production. If not, use the cloud model that passed.
+
+5. **Production run**
+   Full 117k records. Resume-safe — each record is written to `pipeline.db`
+   atomically as it is processed. Overnight runs can be stopped and
+   continued at any time without losing work.