lila/documentation/model-strategy.md

14 KiB
Raw Blame History

Model Strategy

The problem

The pipeline requires LLMs to perform four tasks per vocabulary entry:

  1. Gloss review — confirm or improve the existing gloss
  2. Example review — confirm or improve existing examples
  3. Translation validation — confirm valid translations, reject bad data, generate missing ones
  4. CEFR assignment — assign A1-C2 to the headword and each translation

The core challenge is that vocabulary entries have multiple senses. The word "cat" appears five times in the database — as an animal, as slang for "guy", as a nautical term, as a verb meaning "to vomit", and as a verb meaning "to hoist an anchor". Each sense requires a different CEFR level and different translations. A model that only knows "cat" is A1 gets four out of five wrong.

This makes CEFR assignment fundamentally a sense-disambiguation problem, not just a vocabulary lookup. Specialized CEFR classifiers (like cefrpy or dksysd/cefr-classifier) operate at the word or sentence level and cannot distinguish between senses of the same word. General LLMs handle sense disambiguation well but introduce quality and reliability problems that depend heavily on model size.

The secondary challenge is hardware constraints. The available local hardware (GTX 950M, 4GB VRAM) can only run models up to approximately 4B parameters fully in GPU memory. Larger models run in hybrid CPU/GPU mode which is significantly slower. Free cloud API tiers are generous enough for the sample dataset but have daily limits that make processing 100k+ entries across multiple sub-stages a multi-day or multi-week operation.

What we tried and why it failed or worked

Single-prompt design (abandoned)

The first enrich script sent one large prompt per entry covering all four tasks at once — CEFR voting, gloss improvement, example improvement, translation validation, and missing translation generation. This produced the following problems:

  • The model skipped translations it considered invalid rather than explicitly rejecting them, causing validation failures
  • Bad data in the translation table (it:free, de:-frei, es:de fai) caused consistent validation failures because the model refused to vote on them even when explicitly instructed
  • The combined prompt was large enough to trigger reasoning mode on Gemma 4 E4B, consuming all available tokens on thinking before producing output
  • 20% of entries required manual review

Sub-stage design (current)

Splitting into four ordered sub-stages fixed the reasoning and validation problems:

  1. round1_gloss — LLM reviews the gloss in isolation
  2. round1_example — LLM reviews examples with verified gloss as context
  3. round1_translations — LLM validates translations with verified gloss as context
  4. round1_cefr — LLM assigns CEFR levels only to validated translations

This ordering ensures the CEFR sub-stage never sees bad data. The smaller, focused prompts eliminated reasoning mode triggering and reduced per-entry time from ~120 seconds to ~25 seconds.

Gloss quality (ongoing)

Testing on 50 entries with Qwen3.5-4B showed ~80% good quality. The 20% failures fall into three categories:

  • Category header glosses — Kaikki occasionally uses "Terms relating to people." or "Terms relating to things." as a gloss instead of a real definition. No model handles these correctly because there is no real meaning to improve.
  • Rare/obscure senses — slang, archaic, and theological senses that a 4B model does not have enough knowledge to handle (e.g. "cat" meaning "to vomit", "word" meaning "Logos, Christ").
  • Short ambiguous glosses — one or two word glosses with no example context cause hallucination.

Gemma 4 E4B (rejected)

Gemma 4 E4B is a hybrid reasoning model. Disabling thinking via --reasoning-budget 0 or --chat-template-kwargs '{"enable_thinking":false}' does not work reliably in llama.cpp for the E4B variant — the model either puts reasoning into the content field as plain text or returns empty content with reasoning in reasoning_content. Per-entry time exceeded 100 seconds making it impractical.

Qwen3.5-4B (current local model)

Non-thinking by default for the small series. Runs fully in 4GB VRAM at ~5 seconds per sub-stage. Acceptable quality for common vocabulary (A1-B2) but struggles with rare and specialized senses. Used as the primary local voter.

Specialized CEFR classifiers (rejected for primary use)

HuggingFace hosts several CEFR text classifiers (dksysd/cefr-classifier, AbdulSami/bert-base-cased-cefr) and the cefrpy Python library maps individual words to CEFR levels. These operate at the word or sentence level and cannot distinguish between senses. "cat" would always be assigned A1 regardless of whether the sense is the animal or obscure nautical slang. Useful only as a sanity check signal, not as a primary voter.

Available free resources

Resource Type Requests/day Quality Notes
Local Qwen3.5-4B Q4_K_M Local model Unlimited Decent Non-thinking by default, fits in 4GB VRAM, ~5s per sub-stage
Local Qwen3.5-9B Q4_K_M Local model Unlimited Good Hybrid CPU/GPU mode on 4GB VRAM, slower but better quality
Local Llama 3.1 8B Q4_K_M Local model Unlimited Decent ~4.3GB, fits in VRAM or light hybrid, different architecture from Qwen
Groq — Llama 3.3 70B Cloud API 1,000 Excellent Best free quality available, 5-10x with batching
Groq — Llama 3.1 8B Cloud API 14,400 Decent High volume, similar quality to local 4B
Google Gemini AI Studio Cloud API 1,500 Very good Google account required, 5-10x with batching
OpenRouter free rotation Cloud API 501,000 Varies Rotates between free models automatically via openrouter/free
Wiktionary API Context enrichment Unlimited N/A Structured vocabulary data, directly related to Kaikki source
cefrpy Python library Word lookup Unlimited Limited Deterministic English word CEFR lookup, no sense disambiguation
HuggingFace CEFR classifiers Text classifier Unlimited (local) Limited Sentence-level difficulty, not sense-aware

Batching

All cloud APIs support sending multiple entries in a single request. Sending 5 entries per request multiplies effective daily capacity by 5x:

  • Groq Llama 3.3 70B: 1,000 requests → ~5,000 entries/day
  • Gemini: 1,500 requests → ~7,500 entries/day

Multiple accounts

Prohibited by the terms of service of all providers listed above.

Final approach per sub-stage

The pipeline runs multiple models as independent voters. Each model processes every entry once and writes its votes to pipeline.db. The merge stage resolves disagreements by majority vote. A tiebreaker runs additional models on flagged entries where no majority was reached.

round1_gloss and round1_example

These sub-stages require a model that understands sense context from examples. Specialized classifiers cannot help here — only general LLMs can evaluate whether a gloss correctly describes a specific sense.

Primary voter: Local Qwen3.5-9B Q4_K_M — runs overnight, unlimited, handles common vocabulary well.

Secondary voter: Groq Llama 3.3 70B with 5-entry batching — higher quality, catches errors the local model makes on rare or specialized senses.

Tertiary voter: Gemini AI Studio with 5-entry batching — third independent opinion, different training data from both Groq and local model.

Context enrichment via Wiktionary API: Before calling any model for the gloss or example sub-stage, the pipeline queries the Wiktionary API for the headword. The API returns the full Wiktionary entry including all senses, usage notes, and examples. This structured data is added to the prompt as additional context, giving the model a much clearer picture of which specific sense it is working with.

This directly fixes the two hardest failure cases:

  • Category header glosses ("Terms relating to people.") — the Wiktionary entry contains the real definition which the model can use to generate a proper gloss
  • Short ambiguous glosses — the additional sense context prevents the model from guessing the wrong meaning

The Wiktionary API is free, has no rate limits for reasonable use, and is directly related to the Kaikki data source since Kaikki extracts from Wiktionary.

round1_translations

Same voter stack as gloss/example. The few-shot examples in the prompt (showing that it:free → reject and de:-frei → reject) handle the bad data cases that caused validation failures in the single-prompt design.

round1_cefr

This sub-stage only receives translations that survived the validation step. All bad data is already excluded.

Primary voter: Local Qwen3.5-9B Q4_K_M.

Secondary voter: Groq Llama 3.3 70B with 5-entry batching.

Tertiary voter: Gemini AI Studio with 5-entry batching.

Sanity check: cefrpy provides a deterministic English word CEFR level as a reference signal. If the majority LLM vote disagrees significantly (e.g. LLMs vote C2 for "cat" the animal), the entry is flagged for human review. cefrpy does not vote — it only triggers review flags.

Voter summary

Sub-stage Voter 1 Voter 2 Voter 3
round1_gloss Qwen3.5-9B (local) Groq Llama 3.3 70B Gemini
round1_example Qwen3.5-9B (local) Groq Llama 3.3 70B Gemini
round1_translations Qwen3.5-9B (local) Groq Llama 3.3 70B Gemini
round1_cefr Qwen3.5-9B (local) Groq Llama 3.3 70B Gemini

Three voters means a correct majority requires at least two models to agree. Even if the local model gets a difficult sense wrong, the two cloud models will likely agree on the correct answer and outvote it.

Open questions

Wiktionary API context extraction

The Wiktionary API returns the full entry for a word including all senses. For a word like "free" with 8+ senses, dumping the entire entry into the prompt wastes tokens and may confuse the model. The open question is how to extract only the relevant sense — options include matching by sense_index, fuzzy-matching the Kaikki gloss against Wiktionary glosses, or letting the model see all senses and identify the correct one itself.

Batching prompt design

Batching 5-10 entries per API call multiplies effective daily capacity significantly. The prompt and validation logic for batched requests is more complex — the model must return a structured JSON object keyed by entry ID, and partial failures (one entry in a batch fails validation) need careful handling. Not yet designed or tested.

Groq and Gemini API integration

Neither Groq nor Gemini is integrated into the pipeline yet. Both use OpenAI-compatible APIs so integration is straightforward — add provider configs to stage-3-enrich/config.ts and set API keys in .env. The batching prompt design needs to be finalised first.

OpenRouter free model rotation

OpenRouter's openrouter/free router selects a model at random from available free models. This means output style and quality vary between requests, which complicates round 2 voting where models review each other's candidates. May need to pin specific free models rather than using the router.

Qwen3.5-9B performance on hard cases

The 9B model has not yet been tested. It is expected to handle rare and specialized senses better than the 4B model but this has not been verified. Needs a test run against the same 50 entries used to evaluate the 4B model.

Llama.cpp Gemma 4 bug

The llama.cpp chat template bug preventing reliable JSON output from Gemma 4 E4B may be fixed in a future release. The model fits in 4GB VRAM and would be a useful additional local voter if the bug is resolved. Worth checking periodically.

Full dataset scale

The current pipeline runs on a 500-entry sample per language. The full Kaikki English file contains approximately 1.3 million entries, of which a fraction will pass the POS and translation filters. The exact count and the time required to run all sub-stages across all models at full scale is not yet known.

Category header glosses

Kaikki occasionally uses category headers ("Terms relating to people.", "Terms relating to things.") as glosses. These are not real definitions and no model produces useful output for them. Options include pre-filtering them before the gloss sub-stage and generating a gloss purely from examples, or flagging them as a special case for human review.

wget -O models/llama-3.1-8b-instruct-q4_k_m.gguf
"https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"

Q4_K_M (5.68GB — hybrid mode, better quality)

wget -O models/qwen3.5-9b-q4_k_m.gguf
"https://huggingface.co/unsloth/Qwen3.5-9B-GGUF/resolve/main/Qwen3.5-9B-Q4_K_M.gguf"

Q3_K_S (4.32GB — might fit fully in VRAM)

wget -O models/qwen3.5-9b-q3_k_s.gguf
"https://huggingface.co/unsloth/Qwen3.5-9B-GGUF/resolve/main/Qwen3.5-9B-Q3_K_S.gguf"