lila/documentation/LLM_SETUP.md
2026-05-16 01:59:43 +02:00

12 KiB
Raw Blame History

LLM Setup — lila pipeline

This document covers the LLM infrastructure for stage 3 (enrich) of the lila data pipeline. It documents the hardware constraints, supported providers, model recommendations, and how to configure and swap providers in the test and production scripts.


Provider model

Each provider + model combination counts as one vote in the final majority. Running the same model twice is not supported — one model, one vote. To increase vote confidence, add more models rather than re-running existing ones.


Hardware (dev machine)

Component Spec
CPU Intel Core i7-6500U (2 cores / 4 threads @ 3.10 GHz)
RAM 8 GB
GPU NVIDIA GeForce GTX 950M — 4 GB VRAM (Maxwell, CUDA compute 5.0)
OS Debian GNU/Linux 13 (trixie) x86_64

Local inference verdict: viable for small/quantized models, not for production runs. See the Local inference section for details.


Provider overview

The enrich script uses a single, swappable provider config. All providers except Anthropic expose an OpenAI-compatible API, so the same client code works across all of them — only baseURL, apiKey, and model change.

Provider Use case Cost Rate limits
llama.cpp (local) Quality testing, overnight dev runs Free (electricity) None
OpenRouter (free tier) Quality comparison, multi-model evaluation Free 50 req/day, 20 req/min
OpenRouter (paid) Production runs if local quality insufficient Pay-per-token None
Anthropic API Quality baseline / reference Pay-per-token Standard

Local inference (llama.cpp)

Why local inference is worth testing

Time is not a constraint — the pipeline scripts are fully resumable. The laptop can run overnight for multiple nights. The only question is output quality, which the test script evaluates empirically.

Hardware constraints

The GTX 950M has 4 GB VRAM and Maxwell architecture (CUDA compute 5.0). llama.cpp supports Maxwell via CUDA backend but newer builds may require the --cuda-no-kv-offload flag depending on the version.

llama.cpp splits model layers between GPU and CPU automatically via --n-gpu-layers. You set how many layers go on the GPU; the rest run on CPU/RAM. This means a model larger than VRAM is not a dead end — it runs in hybrid mode, slower than full-GPU but much faster than pure CPU.

Practical estimates for this hardware (~3.5 GB VRAM usable after drivers):

Model size Q4 VRAM Mode Est. speed
3B ~2.0 GB Full GPU ~1520 tok/s
4B ~2.5 GB Full GPU ~1218 tok/s
7B ~4.5 GB Hybrid (~26/32 layers on GPU) ~812 tok/s
13B+ ~8 GB+ CPU-heavy hybrid too slow

Two candidates worth testing, covering different points on the size/quality tradeoff:

Gemma 4 E4B Instruct (Q4 / UD-Q4_K_XL)

  • GGUF file: gemma-4-E4B-it-UD-Q4_K_XL.gguf (~2.5 GB)
  • Source: https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF
  • Runs fully on GPU. Brand new (April 2025), built for edge hardware, 140+ language support including all five pipeline languages. First candidate to test.

Qwen2.5 7B Instruct (Q4_K_M)

  • GGUF file: Qwen2.5-7B-Instruct-Q4_K_M.gguf (~4.5 GB)
  • Source: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF
  • Runs in hybrid mode (~26 of 32 layers on GPU, rest on CPU), ~812 tok/s. Stronger multilingual generation than any 34B model. Second candidate, for comparison against the smaller Gemma 4 E4B.

Installation

# Install build dependencies
sudo apt install build-essential cmake git

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with CUDA support (GTX 950M — compute 5.0)
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=50
cmake --build build --config Release -j$(nproc)

# Download model (example — adjust path as needed)
mkdir -p models
wget -O models/qwen2.5-3b-instruct-q4_k_m.gguf \
  https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwen2.5-3b-instruct-q4_k_m.gguf

Starting the server

Gemma 4 E4B (full GPU):

./build/bin/llama-server \
  --model models/gemma-4-e4b-it-ud-q4_k_xl.gguf \
  --port 8080 \
  --ctx-size 4096 \
  --n-gpu-layers 999 \
  --host 127.0.0.1

Qwen2.5 7B (hybrid — tune --n-gpu-layers to fit your VRAM):

./build/bin/llama-server \
  --model models/qwen2.5-7b-instruct-q4_k_m.gguf \
  --port 8080 \
  --ctx-size 4096 \
  --n-gpu-layers 28 \
  --host 127.0.0.1

--n-gpu-layers 999 means "put everything on GPU" — llama.cpp caps at the actual layer count automatically, so 999 is safe as a "full offload" value. For the 7B hybrid, start with 28 and reduce by 2 if the server reports out-of-memory at startup.

Verify the server is running

curl http://127.0.0.1:8080/health
# Expected: {"status":"ok"}

OpenRouter (free tier)

OpenRouter exposes all models via an OpenAI-compatible API. No code changes are needed to switch from local llama.cpp to OpenRouter — only the config object changes.

Rate limits (free tier)

  • 50 requests per day (account total, not per model)
  • 20 requests per minute

Implication for testing: with a 10-record test set you have headroom to test 45 models per day. With a 100-record test set, plan one model per day.

Implication for production: the free tier is not viable for 117k records. If local quality is insufficient, use paid OpenRouter credits or a dedicated provider.

Ranked by expected multilingual generation quality for en/it/de/fr/es:

Model ID Params Notes
qwen/qwen3-coder:free 480B MoE (35B active) Best free option. Strong multilingual despite "coder" label. Use as quality ceiling.
qwen/qwen3-next-80b-a3b-instruct:free 80B MoE (3B active) Smaller Qwen, useful comparison point.
nvidia/nemotron-3-super-120b-a12b:free 120B MoE (12B active) 262K context, supports structured output.
google/gemma-4-31b-it:free 31B 140+ language support, good European language coverage.
zhipuai/glm-4.5-air:free MoE Multilingual-focused.

Skip for this pipeline:

  • Llama models — weaker European language generation than Qwen/Gemma
  • Mistral free tier — requests may be used for model training

API endpoint

https://openrouter.ai/api/v1/chat/completions

Set Authorization: Bearer <OPENROUTER_API_KEY> in the request headers.


Provider configuration in the enrich script

The enrich script reads a single config object. To switch providers, change this object and re-run. The name field is used as the model identifier in pipeline.db — it must be unique across all runs.

// config.ts

export type ProviderConfig = {
  name: string; // used as model identifier in pipeline.db — must be unique
  baseURL: string;
  apiKey: string;
  model: string;
  maxTokens: number;
};

// Local llama.cpp
export const LOCAL_QWEN3B: ProviderConfig = {
  name: "local-qwen2.5-3b",
  baseURL: "http://127.0.0.1:8080/v1",
  apiKey: "none", // llama.cpp ignores this
  model: "qwen2.5-3b", // llama.cpp ignores model name, uses loaded model
  maxTokens: 512,
};

// OpenRouter — Qwen3 480B (free)
export const OR_QWEN3_480B: ProviderConfig = {
  name: "or-qwen3-480b",
  baseURL: "https://openrouter.ai/api/v1",
  apiKey: process.env.OPENROUTER_API_KEY!,
  model: "qwen/qwen3-coder:free",
  maxTokens: 512,
};

// OpenRouter — Gemma 4 31B (free)
export const OR_GEMMA4_31B: ProviderConfig = {
  name: "or-gemma4-31b",
  baseURL: "https://openrouter.ai/api/v1",
  apiKey: process.env.OPENROUTER_API_KEY!,
  model: "google/gemma-4-31b-it:free",
  maxTokens: 512,
};

// Anthropic (reference baseline — different adapter required)
export const ANTHROPIC_SONNET: ProviderConfig = {
  name: "anthropic-sonnet",
  baseURL: "https://api.anthropic.com/v1", // adapter handles format difference
  apiKey: process.env.ANTHROPIC_API_KEY!,
  model: "claude-sonnet-4-6",
  maxTokens: 512,
};

All output is written to pipeline.db. Each record is stored with the model name as identifier so results from different providers can be compared and compiled into votes.


Evaluation metrics

The test script measures the following per provider run:

Metric What it measures
JSON parse rate % of responses that are valid, schema-compliant JSON. Critical — a failed parse is a wasted call. Target: >97%
Field coverage % of records where all required fields are present (cefr votes for all translations, descriptions for all languages, glosses/examples for fr/es)
CEFR agreement For records that have a cefr_source vote, % where the model agrees. Measures calibration.
Language correctness Manual spot-check only — automated detection not reliable enough
Tokens/second Local only. Indicates overnight run feasibility

Decision thresholds

Metric Threshold Action if below
JSON parse rate < 97% Do not use this model for production
Field coverage < 95% Prompt needs revision before production
CEFR agreement < 70% Model lacks vocabulary knowledge for this task

  1. Start local, minimal dataset (510 records) Install llama.cpp, run Qwen2.5 3B against 510 hand-picked records. Verify the server works, the output parses, and the model produces something reasonable. This is purely a smoke test.

  2. Expand local to full 100-record sample Once the pipeline is confirmed working, run all 100 records locally. Collect metrics. This is your local quality baseline.

  3. Run the same 100 records through OpenRouter free models One model per day (50 req/day limit). Start with qwen/qwen3-coder:free as the quality ceiling.

  4. Compare metrics side by side If local 3B is within acceptable range of the cloud models on CEFR agreement and field coverage, proceed with local overnight runs for production. If not, use the cloud model that passed.

  5. Production run Full 117k records. Resume-safe — each record is written to pipeline.db atomically as it is processed. Overnight runs can be stopped and continued at any time without losing work.