lila/documentation/model-strategy.md

# Model Strategy

## The problem

The pipeline requires LLMs to perform four tasks per vocabulary entry:

1. **Gloss review** — confirm or improve the existing gloss
2. **Example review** — confirm or improve existing examples
3. **Translation validation** — confirm valid translations, reject bad data, generate missing ones
4. **CEFR assignment** — assign A1-C2 to the headword and each translation

The core challenge is that vocabulary entries have **multiple senses**. The word "cat" appears five times in the database — as an animal, as slang for "guy", as a nautical term, as a verb meaning "to vomit", and as a verb meaning "to hoist an anchor". Each sense requires a different CEFR level and different translations. A model that only knows "cat" is A1 gets four out of five wrong.

This makes CEFR assignment fundamentally a **sense-disambiguation problem**, not just a vocabulary lookup. Specialized CEFR classifiers (like `cefrpy` or `dksysd/cefr-classifier`) operate at the word or sentence level and cannot distinguish between senses of the same word. General LLMs handle sense disambiguation well but introduce quality and reliability problems that depend heavily on model size.

The secondary challenge is **hardware constraints**. The available local hardware (GTX 950M, 4GB VRAM) can only run models up to approximately 4B parameters fully in GPU memory. Larger models run in hybrid CPU/GPU mode which is significantly slower. Free cloud API tiers are generous enough for the sample dataset but have daily limits that make processing 100k+ entries across multiple sub-stages a multi-day or multi-week operation.

## What we tried and why it failed or worked

### Single-prompt design (abandoned)

The first enrich script sent one large prompt per entry covering all four tasks at once — CEFR voting, gloss improvement, example improvement, translation validation, and missing translation generation. This produced the following problems:

- The model skipped translations it considered invalid rather than explicitly rejecting them, causing validation failures
- Bad data in the translation table (`it:free`, `de:-frei`, `es:de fai`) caused consistent validation failures because the model refused to vote on them even when explicitly instructed
- The combined prompt was large enough to trigger reasoning mode on Gemma 4 E4B, consuming all available tokens on thinking before producing output
- 20% of entries required manual review

### Sub-stage design (current)

Splitting into four ordered sub-stages fixed the reasoning and validation problems:

1. `round1_gloss` — LLM reviews the gloss in isolation
2. `round1_example` — LLM reviews examples with verified gloss as context
3. `round1_translations` — LLM validates translations with verified gloss as context
4. `round1_cefr` — LLM assigns CEFR levels only to validated translations

This ordering ensures the CEFR sub-stage never sees bad data. The smaller, focused prompts eliminated reasoning mode triggering and reduced per-entry time from ~120 seconds to ~25 seconds.

### Gloss quality (ongoing)

Testing on 50 entries with Qwen3.5-4B showed ~80% good quality. The 20% failures fall into three categories:

- **Category header glosses** — Kaikki occasionally uses "Terms relating to people." or "Terms relating to things." as a gloss instead of a real definition. No model handles these correctly because there is no real meaning to improve.
- **Rare/obscure senses** — slang, archaic, and theological senses that a 4B model does not have enough knowledge to handle (e.g. "cat" meaning "to vomit", "word" meaning "Logos, Christ").
- **Short ambiguous glosses** — one or two word glosses with no example context cause hallucination.

### Gemma 4 E4B (rejected)

Gemma 4 E4B is a hybrid reasoning model. Disabling thinking via `--reasoning-budget 0` or `--chat-template-kwargs '{"enable_thinking":false}'` does not work reliably in llama.cpp for the E4B variant — the model either puts reasoning into the content field as plain text or returns empty content with reasoning in `reasoning_content`. Per-entry time exceeded 100 seconds making it impractical.

### Qwen3.5-4B (current local model)

Non-thinking by default for the small series. Runs fully in 4GB VRAM at ~5 seconds per sub-stage. Acceptable quality for common vocabulary (A1-B2) but struggles with rare and specialized senses. Used as the primary local voter.

### Specialized CEFR classifiers (rejected for primary use)

HuggingFace hosts several CEFR text classifiers (`dksysd/cefr-classifier`, `AbdulSami/bert-base-cased-cefr`) and the `cefrpy` Python library maps individual words to CEFR levels. These operate at the word or sentence level and cannot distinguish between senses. "cat" would always be assigned A1 regardless of whether the sense is the animal or obscure nautical slang. Useful only as a sanity check signal, not as a primary voter.

## Available free resources

| Resource                     | Type               | Requests/day      | Quality   | Notes                                                                  |
| ---------------------------- | ------------------ | ----------------- | --------- | ---------------------------------------------------------------------- |
| Local Qwen3.5-4B Q4_K_M      | Local model        | Unlimited         | Decent    | Non-thinking by default, fits in 4GB VRAM, ~5s per sub-stage           |
| Local Qwen3.5-9B Q4_K_M      | Local model        | Unlimited         | Good      | Hybrid CPU/GPU mode on 4GB VRAM, slower but better quality             |
| Local Llama 3.1 8B Q4_K_M    | Local model        | Unlimited         | Decent    | ~4.3GB, fits in VRAM or light hybrid, different architecture from Qwen |
| Groq — Llama 3.3 70B         | Cloud API          | 1,000             | Excellent | Best free quality available, 5-10x with batching                       |
| Groq — Llama 3.1 8B          | Cloud API          | 14,400            | Decent    | High volume, similar quality to local 4B                               |
| Google Gemini AI Studio      | Cloud API          | 1,500             | Very good | Google account required, 5-10x with batching                           |
| OpenRouter free rotation     | Cloud API          | 50–1,000          | Varies    | Rotates between free models automatically via `openrouter/free`        |
| Wiktionary API               | Context enrichment | Unlimited         | N/A       | Structured vocabulary data, directly related to Kaikki source          |
| `cefrpy` Python library      | Word lookup        | Unlimited         | Limited   | Deterministic English word CEFR lookup, no sense disambiguation        |
| HuggingFace CEFR classifiers | Text classifier    | Unlimited (local) | Limited   | Sentence-level difficulty, not sense-aware                             |

### Batching

All cloud APIs support sending multiple entries in a single request. Sending 5 entries per request multiplies effective daily capacity by 5x:

- Groq Llama 3.3 70B: 1,000 requests → ~5,000 entries/day
- Gemini: 1,500 requests → ~7,500 entries/day

### Multiple accounts

Prohibited by the terms of service of all providers listed above.

## Final approach per sub-stage

The pipeline runs multiple models as independent voters. Each model processes every entry once and writes its votes to `pipeline.db`. The merge stage resolves disagreements by majority vote. A tiebreaker runs additional models on flagged entries where no majority was reached.

### round1_gloss and round1_example

These sub-stages require a model that understands sense context from examples. Specialized classifiers cannot help here — only general LLMs can evaluate whether a gloss correctly describes a specific sense.

**Primary voter:** Local Qwen3.5-9B Q4_K_M — runs overnight, unlimited, handles common vocabulary well.

**Secondary voter:** Groq Llama 3.3 70B with 5-entry batching — higher quality, catches errors the local model makes on rare or specialized senses.

**Tertiary voter:** Gemini AI Studio with 5-entry batching — third independent opinion, different training data from both Groq and local model.

**Context enrichment via Wiktionary API:** Before calling any model for the gloss or example sub-stage, the pipeline queries the Wiktionary API for the headword. The API returns the full Wiktionary entry including all senses, usage notes, and examples. This structured data is added to the prompt as additional context, giving the model a much clearer picture of which specific sense it is working with.

This directly fixes the two hardest failure cases:
- **Category header glosses** ("Terms relating to people.") — the Wiktionary entry contains the real definition which the model can use to generate a proper gloss
- **Short ambiguous glosses** — the additional sense context prevents the model from guessing the wrong meaning

The Wiktionary API is free, has no rate limits for reasonable use, and is directly related to the Kaikki data source since Kaikki extracts from Wiktionary.

### round1_translations

Same voter stack as gloss/example. The few-shot examples in the prompt (showing that `it:free` → reject and `de:-frei` → reject) handle the bad data cases that caused validation failures in the single-prompt design.

### round1_cefr

This sub-stage only receives translations that survived the validation step. All bad data is already excluded.

**Primary voter:** Local Qwen3.5-9B Q4_K_M.

**Secondary voter:** Groq Llama 3.3 70B with 5-entry batching.

**Tertiary voter:** Gemini AI Studio with 5-entry batching.

**Sanity check:** `cefrpy` provides a deterministic English word CEFR level as a reference signal. If the majority LLM vote disagrees significantly (e.g. LLMs vote C2 for "cat" the animal), the entry is flagged for human review. `cefrpy` does not vote — it only triggers review flags.

### Voter summary

| Sub-stage           | Voter 1            | Voter 2            | Voter 3 |
| ------------------- | ------------------ | ------------------ | ------- |
| round1_gloss        | Qwen3.5-9B (local) | Groq Llama 3.3 70B | Gemini  |
| round1_example      | Qwen3.5-9B (local) | Groq Llama 3.3 70B | Gemini  |
| round1_translations | Qwen3.5-9B (local) | Groq Llama 3.3 70B | Gemini  |
| round1_cefr         | Qwen3.5-9B (local) | Groq Llama 3.3 70B | Gemini  |

Three voters means a correct majority requires at least two models to agree. Even if the local model gets a difficult sense wrong, the two cloud models will likely agree on the correct answer and outvote it.

## Open questions

### Wiktionary API context extraction
The Wiktionary API returns the full entry for a word including all senses. For a word like "free" with 8+ senses, dumping the entire entry into the prompt wastes tokens and may confuse the model. The open question is how to extract only the relevant sense — options include matching by sense_index, fuzzy-matching the Kaikki gloss against Wiktionary glosses, or letting the model see all senses and identify the correct one itself.

### Batching prompt design
Batching 5-10 entries per API call multiplies effective daily capacity significantly. The prompt and validation logic for batched requests is more complex — the model must return a structured JSON object keyed by entry ID, and partial failures (one entry in a batch fails validation) need careful handling. Not yet designed or tested.

### Groq and Gemini API integration
Neither Groq nor Gemini is integrated into the pipeline yet. Both use OpenAI-compatible APIs so integration is straightforward — add provider configs to `stage-3-enrich/config.ts` and set API keys in `.env`. The batching prompt design needs to be finalised first.

### OpenRouter free model rotation
OpenRouter's `openrouter/free` router selects a model at random from available free models. This means output style and quality vary between requests, which complicates round 2 voting where models review each other's candidates. May need to pin specific free models rather than using the router.

### Qwen3.5-9B performance on hard cases
The 9B model has not yet been tested. It is expected to handle rare and specialized senses better than the 4B model but this has not been verified. Needs a test run against the same 50 entries used to evaluate the 4B model.

### Llama.cpp Gemma 4 bug
The llama.cpp chat template bug preventing reliable JSON output from Gemma 4 E4B may be fixed in a future release. The model fits in 4GB VRAM and would be a useful additional local voter if the bug is resolved. Worth checking periodically.

### Full dataset scale
The current pipeline runs on a 500-entry sample per language. The full Kaikki English file contains approximately 1.3 million entries, of which a fraction will pass the POS and translation filters. The exact count and the time required to run all sub-stages across all models at full scale is not yet known.

### Category header glosses
Kaikki occasionally uses category headers ("Terms relating to people.", "Terms relating to things.") as glosses. These are not real definitions and no model produces useful output for them. Options include pre-filtering them before the gloss sub-stage and generating a gloss purely from examples, or flagging them as a special case for human review.


wget -O models/llama-3.1-8b-instruct-q4_k_m.gguf \
  "https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"

# Q4_K_M (5.68GB — hybrid mode, better quality)
wget -O models/qwen3.5-9b-q4_k_m.gguf \
  "https://huggingface.co/unsloth/Qwen3.5-9B-GGUF/resolve/main/Qwen3.5-9B-Q4_K_M.gguf"

# Q3_K_S (4.32GB — might fit fully in VRAM)
wget -O models/qwen3.5-9b-q3_k_s.gguf \
  "https://huggingface.co/unsloth/Qwen3.5-9B-GGUF/resolve/main/Qwen3.5-9B-Q3_K_S.gguf"