formatting

2026-04-28 13:18:18 +02:00 · 2026-04-28 13:18:18 +02:00 · 4f59f3bc14
commit 4f59f3bc14
parent 2ff7d1759e
23 changed files with 994 additions and 3338 deletions
--- a/documentation/llm-setup.md
+++ b/documentation/llm-setup.md
@ -9,12 +9,12 @@ and production scripts.

 ## Hardware (dev machine)

-| Component | Spec |
-|---|---|
-| CPU | Intel Core i7-6500U (2 cores / 4 threads @ 3.10 GHz) |
-| RAM | 8 GB |
-| GPU | NVIDIA GeForce GTX 950M — 4 GB VRAM (Maxwell, CUDA compute 5.0) |
-| OS | Debian GNU/Linux 13 (trixie) x86_64 |
+| Component | Spec                                                            |
+| --------- | --------------------------------------------------------------- |
+| CPU       | Intel Core i7-6500U (2 cores / 4 threads @ 3.10 GHz)            |
+| RAM       | 8 GB                                                            |
+| GPU       | NVIDIA GeForce GTX 950M — 4 GB VRAM (Maxwell, CUDA compute 5.0) |
+| OS        | Debian GNU/Linux 13 (trixie) x86_64                             |

 **Local inference verdict:** viable for small/quantized models, not for
 production runs. See the [Local inference](#local-inference-llamacpp) section
@ -28,12 +28,12 @@ The enrich script uses a single, swappable provider config. All providers
 except Anthropic expose an OpenAI-compatible API, so the same client code
 works across all of them — only `baseURL`, `apiKey`, and `model` change.

-| Provider | Use case | Cost | Rate limits |
-|---|---|---|---|
-| llama.cpp (local) | Quality testing, overnight dev runs | Free (electricity) | None |
-| OpenRouter (free tier) | Quality comparison, multi-model evaluation | Free | 50 req/day, 20 req/min |
-| OpenRouter (paid) | Production runs if local quality insufficient | Pay-per-token | None |
-| Anthropic API | Quality baseline / reference | Pay-per-token | Standard |
+| Provider               | Use case                                      | Cost               | Rate limits            |
+| ---------------------- | --------------------------------------------- | ------------------ | ---------------------- |
+| llama.cpp (local)      | Quality testing, overnight dev runs           | Free (electricity) | None                   |
+| OpenRouter (free tier) | Quality comparison, multi-model evaluation    | Free               | 50 req/day, 20 req/min |
+| OpenRouter (paid)      | Production runs if local quality insufficient | Pay-per-token      | None                   |
+| Anthropic API          | Quality baseline / reference                  | Pay-per-token      | Standard               |

 ---

@ -58,12 +58,12 @@ in hybrid mode, slower than full-GPU but much faster than pure CPU.

 Practical estimates for this hardware (~3.5 GB VRAM usable after drivers):

-| Model size | Q4 VRAM | Mode | Est. speed |
-|---|---|---|---|
-| 3B | ~2.0 GB | Full GPU | ~15–20 tok/s |
-| 4B | ~2.5 GB | Full GPU | ~12–18 tok/s |
-| 7B | ~4.5 GB | Hybrid (~26/32 layers on GPU) | ~8–12 tok/s |
-| 13B+ | ~8 GB+ | CPU-heavy hybrid | too slow |
+| Model size | Q4 VRAM | Mode                          | Est. speed   |
+| ---------- | ------- | ----------------------------- | ------------ |
+| 3B         | ~2.0 GB | Full GPU                      | ~15–20 tok/s |
+| 4B         | ~2.5 GB | Full GPU                      | ~12–18 tok/s |
+| 7B         | ~4.5 GB | Hybrid (~26/32 layers on GPU) | ~8–12 tok/s  |
+| 13B+       | ~8 GB+  | CPU-heavy hybrid              | too slow     |

 ### Recommended local models

@ -71,6 +71,7 @@ Two candidates worth testing, covering different points on the size/quality
 tradeoff:

 **Gemma 4 E4B Instruct (Q4 / UD-Q4_K_XL)**
+
 - GGUF file: `gemma-4-E4B-it-UD-Q4_K_XL.gguf` (~2.5 GB)
 - Source: https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF
 - Runs fully on GPU. Brand new (April 2025), built for edge hardware, 140+
@ -78,6 +79,7 @@ tradeoff:
  to test.

 **Qwen2.5 7B Instruct (Q4_K_M)**
+
 - GGUF file: `Qwen2.5-7B-Instruct-Q4_K_M.gguf` (~4.5 GB)
 - Source: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF
 - Runs in hybrid mode (~26 of 32 layers on GPU, rest on CPU), ~8–12 tok/s.
@ -107,6 +109,7 @@ wget -O models/qwen2.5-3b-instruct-q4_k_m.gguf \
 ### Starting the server

 **Gemma 4 E4B** (full GPU):
+
 ```bash
 ./build/bin/llama-server \
  --model models/gemma-4-e4b-it-ud-q4_k_xl.gguf \
@ -117,6 +120,7 @@ wget -O models/qwen2.5-3b-instruct-q4_k_m.gguf \
 ```

 **Qwen2.5 7B** (hybrid — tune `--n-gpu-layers` to fit your VRAM):
+
 ```bash
 ./build/bin/llama-server \
  --model models/qwen2.5-7b-instruct-q4_k_m.gguf \
@ -163,15 +167,16 @@ object changes.

 Ranked by expected multilingual generation quality for en/it/de/fr/es:

-| Model ID | Params | Notes |
-|---|---|---|
-| `qwen/qwen3-coder:free` | 480B MoE (35B active) | Best free option. Strong multilingual despite "coder" label. Use as quality ceiling. |
-| `qwen/qwen3-next-80b-a3b-instruct:free` | 80B MoE (3B active) | Smaller Qwen, useful comparison point. |
-| `nvidia/nemotron-3-super-120b-a12b:free` | 120B MoE (12B active) | 262K context, supports structured output. |
-| `google/gemma-4-31b-it:free` | 31B | 140+ language support, good European language coverage. |
-| `zhipuai/glm-4.5-air:free` | MoE | Multilingual-focused. |
+| Model ID                                 | Params                | Notes                                                                                |
+| ---------------------------------------- | --------------------- | ------------------------------------------------------------------------------------ |
+| `qwen/qwen3-coder:free`                  | 480B MoE (35B active) | Best free option. Strong multilingual despite "coder" label. Use as quality ceiling. |
+| `qwen/qwen3-next-80b-a3b-instruct:free`  | 80B MoE (3B active)   | Smaller Qwen, useful comparison point.                                               |
+| `nvidia/nemotron-3-super-120b-a12b:free` | 120B MoE (12B active) | 262K context, supports structured output.                                            |
+| `google/gemma-4-31b-it:free`             | 31B                   | 140+ language support, good European language coverage.                              |
+| `zhipuai/glm-4.5-air:free`               | MoE                   | Multilingual-focused.                                                                |

 **Skip for this pipeline:**
+
 - Llama models — weaker European language generation than Qwen/Gemma
 - Mistral free tier — requests may be used for model training

@ -194,7 +199,7 @@ change this object and re-run.
 // config.ts

 export type ProviderConfig = {
-  name: string;           // used for output folder naming
+  name: string; // used for output folder naming
  baseURL: string;
  apiKey: string;
  model: string;
@ -205,8 +210,8 @@ export type ProviderConfig = {
 export const LOCAL_QWEN3B: ProviderConfig = {
  name: "local-qwen2.5-3b",
  baseURL: "http://127.0.0.1:8080/v1",
-  apiKey: "none",          // llama.cpp ignores this
-  model: "qwen2.5-3b",     // llama.cpp ignores model name, uses loaded model
+  apiKey: "none", // llama.cpp ignores this
+  model: "qwen2.5-3b", // llama.cpp ignores model name, uses loaded model
  maxTokens: 512,
 };

@ -231,7 +236,7 @@ export const OR_GEMMA4_31B: ProviderConfig = {
 // Anthropic (reference baseline — different adapter required)
 export const ANTHROPIC_SONNET: ProviderConfig = {
  name: "anthropic-sonnet",
-  baseURL: "https://api.anthropic.com/v1",  // adapter handles format difference
+  baseURL: "https://api.anthropic.com/v1", // adapter handles format difference
  apiKey: process.env.ANTHROPIC_API_KEY!,
  model: "claude-sonnet-4-6",
  maxTokens: 512,
@ -239,6 +244,7 @@ export const ANTHROPIC_SONNET: ProviderConfig = {
 ```

 Output from each run lands in:
+
 ```
 stage-3-enrich/test/output/{provider.name}/results.json
 stage-3-enrich/test/output/{provider.name}/metrics.json
@ -252,21 +258,21 @@ The evaluate script compares all `metrics.json` files side by side.

 The test script measures the following per provider run:

-| Metric | What it measures |
-|---|---|
-| **JSON parse rate** | % of responses that are valid, schema-compliant JSON. Critical — a failed parse is a wasted call. Target: >97% |
-| **Field coverage** | % of records where all required fields are present (cefr votes for all translations, descriptions for all languages, glosses/examples for fr/es) |
-| **CEFR agreement** | For records that have a `cefr_source` vote, % where the model agrees. Measures calibration. |
-| **Language correctness** | Manual spot-check only — automated detection not reliable enough |
-| **Tokens/second** | Local only. Indicates overnight run feasibility |
+| Metric                   | What it measures                                                                                                                                 |
+| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------ |
+| **JSON parse rate**      | % of responses that are valid, schema-compliant JSON. Critical — a failed parse is a wasted call. Target: >97%                                   |
+| **Field coverage**       | % of records where all required fields are present (cefr votes for all translations, descriptions for all languages, glosses/examples for fr/es) |
+| **CEFR agreement**       | For records that have a `cefr_source` vote, % where the model agrees. Measures calibration.                                                      |
+| **Language correctness** | Manual spot-check only — automated detection not reliable enough                                                                                 |
+| **Tokens/second**        | Local only. Indicates overnight run feasibility                                                                                                  |

 ### Decision thresholds

-| Metric | Threshold | Action if below |
-|---|---|---|
-| JSON parse rate | < 97% | Do not use this model for production |
-| Field coverage | < 95% | Prompt needs revision before production |
-| CEFR agreement | < 70% | Model lacks vocabulary knowledge for this task |
+| Metric          | Threshold | Action if below                                |
+| --------------- | --------- | ---------------------------------------------- |
+| JSON parse rate | < 97%     | Do not use this model for production           |
+| Field coverage  | < 95%     | Prompt needs revision before production        |
+| CEFR agreement  | < 70%     | Model lacks vocabulary knowledge for this task |

 ---