docs: rewrite data-pipeline.md for Kaikki migration

2026-05-05 17:14:48 +02:00 · 2026-05-05 17:14:48 +02:00 · 38d8b85228
commit 38d8b85228
parent 87aeb072c5
4 changed files with 615 additions and 313 deletions
--- a/data-pipeline/audit.md
+++ b/data-pipeline/audit.md
@ -0,0 +1,362 @@
 # OMW German Translation Quality Audit
 Instructions: for each entry, check if the German translations
 match the meaning described by the English gloss.
 Mark QUALITY as:
 OK — all German translations fit the meaning
 PARTIAL — some fit, some don't
 BAD — none of the German translations fit
 USELESS — translations are correct but useless for learners
 ---
 1.  [noun] ili:i98680
    EN gloss: the flowering part of a plant or arrangement of flowers on a stalk
    DE gloss: der blühende Teil einer Pflanze oder die Anordnung von Blüten an einem Stiel
    EN words: inflorescence
    DE words: Blütenstand, Infloreszenz
    QUALITY: correct
 2.  [verb] ili:i24675
    EN gloss: make motionless
    DE gloss: unbeweglich machen
    EN words: still
    DE words: stillen, zum Stillstand bringen
    QUALITY: stillen means breastfeeding, so completelyworng, zum stillstand bringen is correct but the gloss sounds weird: unbeweglich machen, no one says this
 3.  [verb] ili:i22153
    EN gloss: lose interest or become bored with something or somebody
    DE gloss: das Interesse an etwas oder jemandem verlieren oder sich langweilen
    EN words: fatigue, jade, pall, tire, weary
    DE words: Langeweile erzeugen, anöden, ermüden, langweilen, sich langweilen, sich zu Tode langweilen, sich öden
    QUALITY: its ok
 4.  [noun] ili:i74742
    EN gloss: zealous preaching and advocacy of the gospel
    DE gloss: eifriges Predigen und Eintreten für das Evangelium
    EN words: evangelism
    DE words: Evangelisation, Evangelisierung
    QUALITY: ok
 5.  [noun] ili:i115665
    EN gloss: an oxide of iron that is strongly attracted by magnets
    DE gloss: ein Eisenoxid, das stark von Magneten angezogen wird
    EN words: magnetic iron-ore, magnetite
    DE words: Eisenoxiduloxid, Magneteisen, Magneteisenstein, Magnetit
    QUALITY: ok
 6.  [adjective] ili:i17569
    EN gloss: of or relating to fatalism
    DE gloss: von oder im Zusammenhang mit Fatalismus
    EN words: fatalist, fatalistic
    DE words: auf alles gefasst, dem Schicksal ergeben, fatalistisch, gottergeben, schicksalsergeben
    QUALITY: ok
 7.  [adjective] ili:i682
    EN gloss: having no previous example or precedent or parallel
    DE gloss: ohne vorheriges Beispiel oder Präzedenzfall oder Parallele
    EN words: new, unexampled
    DE words: beispiellos, gab es noch nie, ohne Beispiel, ohne Präzedenzfall, ohnegleichen, präzedenzlos, sondergleichen, unvergleichbar
    QUALITY: ok
 8.  [noun] ili:i114018
    EN gloss: a soft silvery metallic element of the rare earth group; isotope 170 emits X-rays and is used in small portable X-ray machines; it occurs in monazite and apatite and xenotime
    DE gloss: ein weiches, silbriges Metallelement der Gruppe der Seltenen Erden; Isotop 170 emittiert Röntgenstrahlen und wird in kleinen tragbaren Röntgengeräten verwendet; es kommt in Monazit und Apatit sowie in Xenotim vor
    EN words: Tm, atomic number 69, thulium
    DE words: Terameter, Tm
    QUALITY: ok
 9.  [noun] ili:i117564
    EN gloss: the rate of some repeating event
    DE gloss: die Geschwindigkeit eines sich wiederholenden Ereignisses
    EN words: pace, tempo
    DE words: Takt, Tempo
    QUALITY: ok
 10. [verb] ili:i31619
    EN gloss: let drop or droop
    DE gloss: fallen oder hängen lassen
    EN words: hang
    DE words: am Galgen sterben lassen, aufhängen, aufknüpfen, erhängen, henken, hängen
    QUALITY: wrong,let drop means fallen lassen, like dropping something? im not sure here, does it really mean to hang some one? if so, then its ok
 11. [noun] ili:i75571
    EN gloss: a heavy dull sound (as made by impact of heavy objects)
    DE gloss: ein schweres, dumpfes Geräusch (wie beim Aufprall schwerer Gegenstände)
    EN words: clump, clunk, thud, thump, thumping
    DE words: Geklacker, Geklapper, Klackern, Klappern
    QUALITY: ok
 12. [noun] ili:i92290
    EN gloss: a person who makes a promise
    DE gloss: eine Person, die ein Versprechen gibt
    EN words: promiser, promisor
    DE words: Freud'scher Versprecher, Lapsus Linguae, Versprecher, freudscher Versprecher
    QUALITY: completeley wrong, Versprecher is if you intend to say something but say some thing else, it has nothing to do with Versprechen
 13. [noun] ili:i59450
    EN gloss: a vertical well around which there is a stairway
    DE gloss: ein vertikaler Schacht, um den herum eine Treppe verläuft
    EN words: stairwell
    DE words: Ern, Flur, Hausflur, Stiegenhaus, Treppenhaus
    QUALITY: treppenhaus woudl be the only correct one right?
 14. [verb] ili:i21908
    EN gloss: smile affectedly or derisively
    DE gloss: affektiert oder spöttisch lächeln
    EN words: simper, smirk
    DE words: in sich hinein lächeln, schmunzeln, vor sich hin lächeln
    QUALITY: the glosses would be also the words here? schmunzeln and lächeln are kind of the same but the affektiert and spöttisch is missing?
 15. [adjective] ili:i10887
    EN gloss: tending to reserve or introspection
    DE gloss: zur Zurückhaltung oder Introspektion neigend
    EN words: indrawn, withdrawn
    DE words: allein, einsam, eremitenhaft, eremitisch, für sich, solo, wie ein Einsiedler, wie ein Eremit, zurückgezogen
    QUALITY: ok
 16. [noun] ili:i113657
    EN gloss: a substance from which another substance is formed (especially by a metabolic reaction)
    DE gloss: ein Stoff, aus dem ein anderer Stoff gebildet wird (insbesondere durch eine Stoffwechselreaktion)
    EN words: precursor
    DE words: Ausgangsstoff, Edukt, Grundstoff, Präkursor, Vorläufer, biologische Vorstufe
    QUALITY: ok
 17. [adjective] ili:i13251
    EN gloss: tastelessly showy
    DE gloss: geschmacklos und auffällig
    EN words: brassy, cheap, flash, flashy, garish, gaudy, gimcrack, loud, meretricious, tacky, tatty, tawdry, trashy
    DE words: aufdringlich, marktschreierisch, reißerisch
    QUALITY: ok
 18. [noun] ili:i68734
    EN gloss: the branch of chemistry that studies the relation between chemical action and the amount of heat absorbed or generated
    DE gloss: der Zweig der Chemie, der die Beziehung zwischen chemischer Wirkung und der absorbierten oder erzeugten Wärmemenge untersucht
    EN words: thermochemistry
    DE words: Thermochemie, chemische Thermodynamik
    QUALITY: ok
 19. [adjective] ili:i12980
    EN gloss: distinguished from others in excellence
    DE gloss: durch hohe Qualität von anderen unterschieden
    EN words: outstanding
    DE words: I a, ausgezeichnet, außergewöhnlich, außerordentlich, besonders, bestens, eins a, exzeptionell, herausragend, schnafte, splendid, trefflich, vortrefflich, vorzüglich
    QUALITY: ok, aber eins a/1a is wirklich sehr starke umgangssprache. und cih habe ncoh nie schnafte oder splendid gehört, der rest passt
 20. [verb] ili:i30043
    EN gloss: tear down so as to make flat with the ground
    DE gloss: abreißen, um den Boden zu ebnen
    EN words: dismantle, level, pull down, rase, raze, take down, tear down
    DE words: abreißen, aus den Augen verlieren, keinen Kontakt mehr haben zu, nicht länger in Kontakt stehen
    QUALITY: nur abreißen stimmt, der rest passt in diesem zusammenhang gar nicht!
 21. [adjective] ili:i14014
    EN gloss: desired or wished for or sought
    DE gloss: gewünscht oder gewünscht oder gesucht
    EN words: wanted
    DE words: benötigt, gesucht, gewünscht
    QUALITY: ok
 22. [verb] ili:i29481
    EN gloss: mar or spoil the appearance of
    DE gloss: das Aussehen verunstalten
    EN words: blemish, deface, disfigure
    DE words: deformieren, entstellen, verhunzen, verschandeln, verunstalten, verunzieren
    QUALITY: ok
 23. [verb] ili:i28605
    EN gloss: spread thickly
    DE gloss: dick auftragen
    EN words: slather
    DE words: beharken, bestreichen, mit Feuer belegen, mit Sperrfeuer belegen
    QUALITY: kein wort ist wirklich ein synonym für dick auftragen, (i dont even know if the english word fits here?)
 24. [noun] ili:i92029
    EN gloss: someone who is licensed to operate an aircraft in flight
    DE gloss: jemand, der eine Lizenz zum Führen eines Luftfahrzeugs im Flug hat
    EN words: airplane pilot, pilot
    DE words: Führer, Lotse, Pilot
    QUALITY: nur Pilot stimmt hier
 25. [adjective] ili:i8221
    EN gloss: capable of being measured
    DE gloss: in der Lage, gemessen zu werden
    EN words: measurable, mensurable
    DE words: bestimmbar, der Messung zugänglich, erhebbar, mensurabel, messbar
    QUALITY: ok
 26. [noun] ili:i61380
    EN gloss: the spirit of a group that makes the members want the group to succeed
    DE gloss: der Geist einer Gruppe, der die Mitglieder dazu bringt, den Erfolg der Gruppe zu wollen
    EN words: esprit de corps, morale, team spirit
    DE words: Gruppengeist, Teamgeist
    QUALITY: Gruppengeist hört sich so komisch an, das sagt niemand, teamgeist ist in ordnung
 27. [adjective] ili:i10497
    EN gloss: free of restrictions or qualifications
    DE gloss: Zustand, in dem in einer Wohnung niemand wohnt.
    EN words: clean, clear
    DE words: frei, leer stehend, leerstehend, unbewohnt, ungenutzt, verwaist
    QUALITY: ok
 28. [adjective] ili:i6238
    EN gloss: moving and bending with ease
    DE gloss: anmutig schlank und mit Leichtigkeit biegsam und beweglich
    EN words: lissom, lissome, lithe, lithesome, slender, supple, svelte, sylphlike
    DE words: elastisch, geschmeidig, schlangenartig
    QUALITY: \_\_\_
 29. [noun] ili:i57906
    EN gloss: station for the production and transmission of AM or FM radio broadcasts
    DE gloss: Sender für die Produktion und Übertragung von AM- oder FM-Radiosendungen
    EN words: radio station
    DE words: Radiosender, Rundfunkstation, Sender
    QUALITY: \_\_\_
 30. [noun] ili:i112045
    EN gloss: the purple or black-and-blue area resulting from a bruise
    DE gloss: der violette oder schwarzblaue Bereich, der durch einen Bluterguss entsteht
    EN words: ecchymosis
    DE words: Ekchymose, kleinflächige Hautblutung
    QUALITY: \_\_\_
 31. [adjective] ili:i10839
    EN gloss: capable of being replaced
    DE gloss: kann ersetzt werden
    EN words: replaceable
    DE words: austauschbar, ersetzbar, fungibel
    QUALITY: \_\_\_
 32. [verb] ili:i28714
    EN gloss: whip
    DE gloss: peitschen
    EN words: flagellate, scourge
    DE words: auspeitschen, flagellieren, geißeln, peitschen
    QUALITY: \_\_\_
 33. [noun] ili:i52826
    EN gloss: a mechanical or electrical explosive device or a small amount of explosive; can be used to initiate the reaction of a disrupting explosive
    DE gloss: ein mechanischer oder elektrischer Sprengkörper oder eine kleine Menge Sprengstoff; kann verwendet werden, um die Reaktion eines Sprengstoffs auszulösen
    EN words: cap, detonating device, detonator
    DE words: Auslöser, Zünder, Zündvorrichtung
    QUALITY: \_\_\_
 34. [noun] ili:i115477
    EN gloss: ice crystals forming a white deposit (especially on objects outside)
    DE gloss: Eiskristalle, die einen weißen Belag bilden (insbesondere auf Gegenständen im Freien)
    EN words: frost, hoar, hoarfrost, rime
    DE words: Raufrost, Raureif, Reif
    QUALITY: \_\_\_
 35. [noun] ili:i66650
    EN gloss: the ability to see in reduced illumination (as in moonlight)
    DE gloss: die Fähigkeit, bei reduzierter Beleuchtung zu sehen (wie bei Mondlicht)
    EN words: night vision, night-sight, scotopic vision, twilight vision
    DE words: Nachtsehen, skotopisches Sehen
    QUALITY: \_\_\_
 36. [verb] ili:i26849
    EN gloss: express or utter with a hiss
    DE gloss: mit einem Zischen ausdrücken oder aussprechen
    EN words: hiss, sibilate, siss, sizz
    DE words: Stimme dämpfen, flüstern, hauchen, hinter vorgehaltener Hand, ins Ohr sagen, leise sprechen, mit tonloser Stimme, munkeln, raunen, säuseln, tonlos, tuscheln, wispern, zischeln, zuflüstern
    QUALITY: \_\_\_
 37. [noun] ili:i94222
    EN gloss: a teenager or a young adult male
    DE gloss: ein Jugendlicher oder ein junger Erwachsener
    EN words: young buck, young man
    DE words: Bruder, Bürschchen, Cowboy, Freundchen, Jungs, Kinders, Kollege, Kollegin, Leute, Mann Gottes, Meister, Sportsfreund, Verehrtester, der Herr, guter Mann, junger Mann, mein Gutster, mein Herr
    QUALITY: \_\_\_
 38. [noun] ili:i49310
    EN gloss: dusky grey food fish found from Louisiana and Florida southward
    DE gloss: dunkelgrauer Speisefisch, der von Louisiana und Florida südwärts vorkommt
    EN words: Anisotremus surinamensis, black margate, pompon
    DE words: Pompon, Puschel, Tanzwedel
    QUALITY: \_\_\_
 39. [noun] ili:i50315
    EN gloss: a small vehicle with four wheels in which a baby or child is pushed around
    DE gloss: ein kleines Fahrzeug mit vier Rädern, in dem ein Säugling oder ein Kind herumgeschoben wird
    EN words: baby buggy, baby carriage, carriage, go-cart, perambulator, pram, pushchair, pusher, stroller
    DE words: Kinderwagen, Säuglingskutsche
    QUALITY: \_\_\_
 40. [verb] ili:i31857
    EN gloss: meet at a point
    DE gloss: sich an einem Punkt treffen
    EN words: cross, intersect
    DE words: gegen den Wind segeln, kreuzen
    QUALITY: \_\_\_
 41. [noun] ili:i51632
    EN gloss: a sailboat with two parallel hulls held together by single deck
    DE gloss: ein Boot mit zwei parallelen Rümpfen, die durch ein einziges Deck zusammengehalten werden
    EN words: catamaran
    DE words: Doppelrumpfboot, Katamaran, Zweirumpfboot
    QUALITY: \_\_\_
 42. [verb] ili:i34734
    EN gloss: to be found to exist
    DE gloss: als existent befunden werden
    EN words: occur
    DE words: anzutreffen sein, auftreten, nicht ausbleiben, vorkommen, zu finden sein, zu sehen sein
    QUALITY: \_\_\_
 43. [verb] ili:i25187
    EN gloss: assign too high a value to
    DE gloss: einen zu hohen Wert zuweisen
    EN words: overestimate, overvalue
    DE words: zu hoch bewerten, zu viel Gewicht beimessen, zu viel Wichtigkeit beimessen, überbewerten, überschätzen
    QUALITY: \_\_\_
 44. [noun] ili:i73844
    EN gloss: an expressive style of music
    DE gloss: ein ausdrucksstarker Musikstil
    EN words: genre, music genre, musical genre, musical style
    DE words: Genre, Musikgenre, Musikrichtung, Musikstil, Stilrichtung
    QUALITY: \_\_\_
 45. [noun] ili:i113026
    EN gloss: an abnormal condition in which cerebrospinal fluid collects in the ventricles of the brain; in infants it can cause abnormally rapid growth of the head and bulging fontanelles and a small face; in adults the symptoms are primarily neurological
    DE gloss: ein anormaler Zustand, bei dem sich Liquor in den Hirnventrikeln sammelt; bei Säuglingen kann er zu einem anormal schnellen Wachstum des Kopfes, zu wulstigen Fontanellen und einem kleinen Gesicht führen; bei Erwachsenen sind die Symptome hauptsächlich neurologisch
    EN words: hydrocephalus, hydrocephaly
    DE words: Gehirnwassersucht, Hydrocephalus, Hydrozephalus, Wasserkopf
    QUALITY: \_\_\_
 46. [noun] ili:i62720
    EN gloss: habitual uncleanliness
    DE gloss: gewohnheitsmäßige Unreinheit
    EN words: slovenliness
    DE words: Flickarbeit, Flickenteppich, Flickwerk, Gestümper, Mist, Murks, Murkserei, Pfusch, Pfuscharbeit, Pfuscherei, Schlamperei, Schlendrian, Schluderei, Schund, schlechte Arbeit
    QUALITY: \_\_\_
 47. [noun] ili:i80976
    EN gloss: the government agency in the United Kingdom that is responsible for internal security and counterintelligence overseas
    DE gloss: Regierungsbehörde im Vereinigten Königreich, die für die innere Sicherheit und die Spionageabwehr im Ausland zuständig ist.
    EN words: MI, Military Intelligence Section 6, Secret Intelligence Service
    DE words: MI6, SIS, Secret Intelligence Service, Secret Service, britischer Auslandsgeheimdienst
    QUALITY: \_\_\_
 48. [noun] ili:i60476
    EN gloss: an electrical device by which alternating current of one voltage is changed to another voltage
    DE gloss: ein elektrisches Gerät, mit dem Wechselstrom einer bestimmten Spannung in eine andere Spannung umgewandelt wird
    EN words: transformer
    DE words: Spannungswandler, Trafo, Transformator, Transformer
    QUALITY: \_\_\_
 49. [noun] ili:i37037
    EN gloss: wandering from the main path of a journey
    DE gloss: das Abweichen vom Hauptweg einer Reise
    EN words: digression, excursion
    DE words: Abschweifung, Abstecher, Einschub, Exkurs, Umschweif
    QUALITY: \_\_\_
 50. [noun] ili:i77288
    EN gloss: any meat that is minced and spiced and cooked as patties or used to fill sausages
    DE gloss: jegliches Fleisch, das zerkleinert und gewürzt und als Pasteten gekocht oder zur Füllung von Würsten verwendet wird
    EN words: sausage meat
    DE words: Brät, Wurstbrät
    QUALITY: \_\_\_
--- a/data-pipeline/audit.ts
+++ b/data-pipeline/audit.ts
@ -0,0 +1,87 @@
 import Database from "better-sqlite3";
 import path from "node:path";
 import fs from "node:fs";
 import { fileURLToPath } from "node:url";
 const __dirname = path.dirname(fileURLToPath(import.meta.url));
 const DB_PATH = path.join(__dirname, "db/pipeline.db");
 const db = new Database(DB_PATH, { readonly: true });
 // Pull 50 synsets: ~12 per POS, all must have German translations
 const synsets = db
  .prepare(
    `
    SELECT DISTINCT s.source_id, s.pos
    FROM synsets s
    JOIN translations t ON t.source_id = s.source_id
    WHERE t.language = 'de'
    ORDER BY RANDOM()
    LIMIT 50
  `,
  )
  .all() as { source_id: string; pos: string }[];
 const results: string[] = [];
 let index = 0;
 for (const synset of synsets) {
  index++;
  const glosses = db
    .prepare("SELECT language, text FROM glosses WHERE source_id = ?")
    .all(synset.source_id) as { language: string; text: string }[];
  const enGloss = glosses.find((g) => g.language === "en")?.text ?? "—";
  const deGloss = glosses.find((g) => g.language === "de")?.text ?? "—";
  const deTranslations = db
    .prepare(
      "SELECT word FROM translations WHERE source_id = ? AND language = 'de'",
    )
    .all(synset.source_id) as { word: string }[];
  const enTranslations = db
    .prepare(
      "SELECT word FROM translations WHERE source_id = ? AND language = 'en'",
    )
    .all(synset.source_id) as { word: string }[];
  const deWords = deTranslations.map((t) => t.word);
  const enWords = enTranslations.map((t) => t.word);
  results.push(
    [
      `${String(index).padStart(2, " ")}. [${synset.pos}] ${synset.source_id}`,
      `    EN gloss: ${enGloss}`,
      `    DE gloss: ${deGloss}`,
      `    EN words: ${enWords.join(", ")}`,
      `    DE words: ${deWords.join(", ")}`,
      `    QUALITY:  ___`,
      ``,
    ].join("\n"),
  );
 }
 const output = [
  "# OMW German Translation Quality Audit",
  "",
  "Instructions: for each entry, check if the German translations",
  "match the meaning described by the English gloss.",
  "",
  "Mark QUALITY as:",
  "  OK    — all German translations fit the meaning",
  "  PARTIAL — some fit, some don't",
  "  BAD   — none of the German translations fit",
  "  USELESS — translations are correct but useless for learners",
  "",
  "---",
  "",
  ...results,
 ].join("\n");
 const outPath = path.join(__dirname, "audit.md");
 fs.writeFileSync(outPath, output, "utf-8");
 console.log(`Wrote ${synsets.length} entries → ${outPath}`);
 db.close();
--- a/data-pipeline/db/schema.sql
+++ b/data-pipeline/db/schema.sql
@ -64,6 +64,13 @@ CREATE TABLE IF NOT EXISTS model_cefr_votes (
  UNIQUE (translation_id, model_name)
 );
 CREATE TABLE IF NOT EXISTS model_translation_rejections (
  id             INTEGER PRIMARY KEY,
  translation_id INTEGER NOT NULL REFERENCES translations(id),
  model_name     TEXT    NOT NULL,
  UNIQUE (translation_id, model_name)
 );
 CREATE TABLE IF NOT EXISTS generated_glosses (
  id         INTEGER PRIMARY KEY,
  source_id  TEXT    NOT NULL REFERENCES synsets(source_id),
--- a/documentation/data-pipeline.md
+++ b/documentation/data-pipeline.md
@ -1,22 +1,14 @@
 # lila data pipeline
-> **NOTE: BEFORE RUNNING THE PIPELINE, CONSIDER IMPROVING THE CEFR SOURCE
+This pipeline extracts vocabulary data from Wiktionary via the Kaikki dataset, enriches it with CEFR levels and fills content gaps using local LLMs, and produces authoritative output in `pipeline.db`. This database is consumed by the sync script to populate the production database with vocabulary entries, translations, glosses, CEFR levels, and difficulty ratings.
 > FILES IN `stage-2-annotate/sources/cefr/`. BETTER SOURCE COVERAGE MEANS
 > FEWER WORDS FOR THE LLM TO ANNOTATE FROM SCRATCH, FASTER OVERNIGHT RUNS,
 > AND HIGHER CONFIDENCE IN THE FINAL OUTPUT. SEE UNIVERSALCEFR
 > (huggingface.co/UniversalCEFR) AND CEFR-J
 > (github.com/openlanguageprofiles/olp-en-cefrj) AS STARTING POINTS.**
 This pipeline extracts vocabulary data from the Open Multilingual Wordnet (OMW), annotates it with CEFR levels from curated source files, verifies and enriches annotations using local LLMs, and produces authoritative output in `pipeline.db`. This database is consumed by the sync script to populate the production database with terms, translations, glosses, CEFR levels, difficulty ratings, and LLM-generated descriptions.
 ## Overview
 ```mermaid
 flowchart LR
-    omw[(OMW SQLite DBs)]
+    kaikki[(Kaikki JSONL)]
    cefr[(CEFR JSON files)]
    extract[Extract]
-    annotate[Annotate]
+    reverselink[Reverse Link Sync]
    enrich[Enrich]
    pipelinedb[(pipeline.db)]
    merge[Merge]
@ -25,10 +17,11 @@ flowchart LR
    sync[Sync]
    db[(PostgreSQL)]
-    omw --> extract
+    kaikki --> extract
-    cefr --> annotate
+    extract --> pipelinedb
-    extract --> annotate
+    pipelinedb --> reverselink
-    annotate --> enrich
+    reverselink --> pipelinedb
    pipelinedb --> enrich
    enrich --> pipelinedb
    pipelinedb --> merge
    merge --> pipelinedb
@ -39,88 +32,75 @@ flowchart LR
    sync --> db
 ```
-Each stage is a standalone script that reads from the previous stage's output. Stages 1 and 2 read and write JSON files. From stage 3 onwards, all output is written to `pipeline.db` — a SQLite database that tracks processing status, LLM output, votes, and resolved records. This makes overnight LLM runs fully resumable and protects against data loss if a run is interrupted.
+Each stage is a standalone script that reads from and writes to `pipeline.db`. The pipeline is fully resumable — interrupted overnight runs pick up from the last processed record without losing work.
 Stage 1 is a manual prerequisite and is not run by the pipeline orchestrator. See **Stage 1 — Extract** for instructions.
-The enrich stage is designed to run overnight, one model at a time. Each model processes every word and writes results to `pipeline.db` atomically per record — interrupted runs resume from the last unprocessed record.
+The enrich stage is designed to run overnight, one model at a time. Each model processes every entry and writes results to `pipeline.db` atomically per record.
-Only fully resolved records reach the production database. Records where LLMs could not reach a majority vote are handled automatically by the tiebreaker stage before seeding.
+Only fully resolved records reach the production database. Records where LLMs could not reach a majority vote are handled automatically by the tiebreaker stage before syncing.
 ## pipeline.db
-All pipeline state from stage 3 onwards is stored in `pipeline.db` — a SQLite
+All pipeline state is stored in `pipeline.db` — a SQLite database in `data-pipeline/db/`. It is created automatically on first run and is not committed to git.
 database in `data-pipeline/db/`. It is created automatically on first run and
 is not committed to git.
 The database serves three purposes:
- **Resumability** — every record is written atomically with a status. Interrupted
+- **Resumability** — every record is written atomically with a status. Interrupted overnight runs resume from the last pending record without losing work.
-  overnight runs resume from the last pending record without losing work.
+- **Vote tracking** — all model votes for CEFR levels and generated content are stored per model per record, giving full auditability of how every decision was reached.
- **Vote tracking** — all model votes for CEFR levels and generated text are
+- **Resolved output** — the final resolved records live here and are read by the sync script to seed the production database.
  stored per model per record, giving full auditability of how every decision
  was reached.
 - **Resolved output** — the final resolved records live here and are read by
  the sync script to seed the production database.
 The schema is defined in `data-pipeline/db/schema.sql`. Never edit `pipeline.db` directly — all writes go through the pipeline scripts.
-On first run the orchestrator initialises `pipeline.db` automatically and imports the stage 2 output into the base tables. This happens once — subsequent runs skip the import if the base tables are already populated.
+On first run the orchestrator initialises `pipeline.db` automatically and imports the stage 1 output into the base tables. This happens once — subsequent runs skip the import if the base tables are already populated.
-## Data sources
+## Data source
-### OMW / WordNet
+### Kaikki (Wiktionary)
-The Open Multilingual Wordnet (OMW) is the base vocabulary source. It provides synsets — groups of synonymous words — with translations and glosses across multiple languages. One SQLite database per language is downloaded and placed in `sources/omw/`. These files are not committed to git.
+The pipeline uses pre-extracted Wiktionary data from [kaikki.org](https://kaikki.org), built with the [wiktextract](https://github.com/tatuylonen/wiktextract) tool. This data is updated weekly from the English Wiktionary dump and is freely available under the same license as Wiktionary (CC-BY-SA).
-All four parts of speech are extracted: noun, verb, adjective, adverb. WordNet's adjective satellites are collapsed into adjective — this is a WordNet-internal distinction that has no relevance for language learning. Alongside translations and glosses, usage examples are extracted where available and stored in the database as term_examples.
+**Why Kaikki instead of OMW:**
 Kaikki is structured per word sense. Each headword has multiple senses, and translations are linked to a specific sense rather than a general concept. This prevents the sense disambiguation problems found in OMW, where a single concept entry could contain translations from entirely different meanings of a word.
-See **Setup** for download instructions.
+Each Kaikki entry provides:
-### CEFR source files
+- A headword in the entry language
 - One or more senses, each with a gloss and examples
 - Per-sense translations to other languages with sense hints
 - IPA pronunciations and audio file references (deferred — see **Further extensions**)
 - Inflected forms (deferred — see **Further extensions**)
-Per-language JSON files in `sources/cefr/` provide the initial CEFR level annotations. These files do not cover the full vocabulary extracted from OMW — coverage varies by language. Gaps and disagreements are handled by the enrich stage.
+The pipeline uses the English Wiktionary edition (`enwiktionary`), which contains entries for all five supported languages with glosses in English.
-| Language | File                   |
+### CEFR levels
 | -------- | ---------------------- |
 | English  | `sources/cefr/en.json` |
 | Italian  | `sources/cefr/it.json` |
 | Spanish  | `sources/cefr/es.json` |
 | German   | `sources/cefr/de.json` |
 | French   | `sources/cefr/fr.json` |
-These files are committed to git. For per-language coverage detail see `COVERAGE.md`.
+CEFR levels are assigned entirely by LLM majority vote. Each model receives the headword, gloss, and an example sentence and votes on the appropriate level (A1–C2). There are no curated source files — the LLMs are the sole source of CEFR annotations.
-### CEFR annotation and verification
+If no majority is reached after all model runs, the entry is handled automatically by the tiebreaker stage.
 CEFR levels are determined by a majority vote combining all available sources:
 - The CEFR source file counts as one vote (if it has an entry for the word)
 - Each LLM model run counts as one vote
 The LLMs verify existing annotations as well as filling gaps — a source file entry does not automatically win. Majority vote across all sources determines the final level.
 Words appearing in the CEFR source file multiple times with different CEFR levels are written to `conflicts.json` and excluded from `cefr_source_votes`. They are still present in `translations` and the LLMs vote on them like any other unannotated word — the conflict is resolved by majority vote.
 If no majority is reached after all model runs, the word is handled automatically by the tiebreaker stage.
 ## Setup
-### OMW databases
+### Kaikki data files
-Download the OMW SQLite database for each language using the `wn` Python
+Download the pre-extracted Kaikki JSONL files for each language. These are large files — download them to `stage-1-extract/sources/` which is not committed to git.
 library:
 ```bash
-python -m wn download omw-en:1.4
+mkdir -p stage-1-extract/sources
-python -m wn download omw-it:1.4
+cd stage-1-extract/sources
 python -m wn download omw-de:1.4
 python -m wn download omw-es:1.4
 python -m wn download omw-fr:1.4
 ```
-The data is stored automatically at `~/.wn_data/wn.db` and is not committed
+# English entries (contains translations to all other languages)
-to git.
+wget https://kaikki.org/dictionary/English/kaikki.org-dictionary-English.jsonl.gz
 # Per-language files (for entries written in those languages)
 wget https://kaikki.org/dictionary/German/kaikki.org-dictionary-German.jsonl.gz
 wget https://kaikki.org/dictionary/Italian/kaikki.org-dictionary-Italian.jsonl.gz
 wget https://kaikki.org/dictionary/French/kaikki.org-dictionary-French.jsonl.gz
 wget https://kaikki.org/dictionary/Spanish/kaikki.org-dictionary-Spanish.jsonl.gz
 # Decompress
 gunzip *.gz
 ```
 ### LLM setup
@ -128,180 +108,97 @@ See `llm-setup.md`.
 ## Pipeline stages
-The pipeline runs in six stages plus a tiebreaker. Each stage is independent and can be re-run without affecting the others.
+| Stage           | What it does                                                             |
-
+| --------------- | ------------------------------------------------------------------------ |
-| Stage        | What it does                                                         |
+| 1. Extract      | Parses Kaikki JSONL, imports entries into `pipeline.db`                  |
-| ------------ | -------------------------------------------------------------------- |
+| 2. Reverse link | Inserts missing reverse translations between language pairs              |
-| 1. Extract   | Reads OMW SQLite database, outputs normalized JSON per language      |
+| 3. Enrich       | LLMs fill translation gaps, improve glosses/examples, assign CEFR levels |
-| 2. Annotate  | Merges CEFR source files into extracted data, adds source file votes |
+| 4. Merge        | Resolves LLM votes into final values                                     |
-| 3. Enrich    | Runs local LLMs in two rounds — generation then voting               |
+| 4b. Tiebreak    | Runs unused models on flagged entries until majority is reached          |
-| 4. Merge     | Resolves votes, derives difficulty, splits into final and flagged    |
+| 5. Compare / QA | Generates `COVERAGE.md` with detailed quality report                     |
-| 4b. Tiebreak | Runs unused models on flagged translations until majority is reached |
+| 6. Sync         | Upserts resolved records into production PostgreSQL                      |
 | 5. Compare   | Generates COVERAGE.md with detailed quality report                   |
 | 6. Sync      | Upserts resolved records into production PostgreSQL                  |
 ### 1. Extract
-Reads the OMW SQLite database (`~/.wn_data/wn.db`) and produces a single normalized JSON file containing all synsets with their translations, glosses, and usage examples across all five languages and all parts of speech. Adjective satellites are collapsed into adjective at this stage.
+Parses the Kaikki JSONL files for all five languages and imports them into the base tables of `pipeline.db`. Filters to the four supported parts of speech: noun, verb, adjective, adverb. Each Kaikki sense becomes one row in `vocabulary_entries`. Translations are stored in `entry_translations` with their sense hints.
-**Input:** `~/.wn_data/wn.db`
+**Input:** `stage-1-extract/sources/*.jsonl`
-**Output:** `stage-1-extract/output/omw.json`
+**Output:** `pipeline.db` — `vocabulary_entries` and `entry_translations` tables populated
 ```bash
-python stage-1-extract/scripts/extract.py
+pnpm --filter @lila/pipeline extract
 ```
-Add `--sample` to extract 100 synsets for inspection before running the full
+Add `--sample 100` to import only 100 entries per language for inspection before running the full import.
 extraction.
-Each record in the output looks like this:
+Each entry in `pipeline.db` looks like this:
 ```json
 {
-  "source_id": "ili:i1",
+  "headword": "thrill",
-  "pos": "adjective",
+  "language": "en",
-  "translations": {
+  "pos": "verb",
-    "en": ["able"],
+  "sense_index": 0,
-    "it": ["abile", "intelligente", "valente", "capace"],
+  "gloss": "To suddenly excite someone, or to give them great pleasure.",
-    "es": ["capaz"],
+  "examples": ["The movie thrilled the audience."],
-    "fr": ["comptable"]
+  "translations": [
-  },
+    { "language": "de", "word": "begeistern", "sense_hint": "suddenly excite" },
-  "glosses": {
+    {
-    "en": [
+      "language": "fr",
-      "(usually followed by 'to') having the necessary means or skill or know-how or authority to do something"
+      "word": "enthousiasmer",
-    ]
+      "sense_hint": "suddenly excite"
-  },
+    },
-  "examples": { "en": ["able to swim", "she was able to program her computer"] }
+    { "language": "it", "word": "entusiasmare" },
    { "language": "es", "word": "emocionar" }
  ]
 }
 ```
-Note: glosses and examples are not available for all languages. French and Spanish have no glosses or examples in the current OMW database — these will be generated by the LLM in the enrich stage. Coverage detail is in `COVERAGE.md`.
+> **Note:** Stage 1 is a manual prerequisite. It is not run by the pipeline orchestrator (`pipeline.ts`). Run it once before running the orchestrator for the first time, and re-run it manually if the Kaikki source files are updated.
-> **Note:** Stage 1 is a manual prerequisite. It is not run by the pipeline
+### 2. Reverse link sync
 > orchestrator (`pipeline.ts`). Run it once before running the orchestrator
 > for the first time, and re-run it manually if the OMW data changes.
-### 2. Annotate
+A pure script stage — no LLMs. For each translation pair in `entry_translations`, checks whether the reverse link exists. If English _thrill → begeistern_ exists and the German entry _begeistern_ exists in `vocabulary_entries` but lacks the English back-link, it is inserted automatically.
-Reads the combined OMW extract and merges CEFR source data into it. Each translation in each language is matched against the corresponding CEFR source file by word text and part of speech. Matched translations receive a `cefr_source` vote which carries into the enrich stage. Unmatched translations proceed without a vote.
+This runs before the enrich stage so that LLMs only generate translations that are genuinely missing — not translations that would be found by a simple reverse lookup.
-This stage also extracts native example sentences from the CEFR source files and adds them to the record alongside OMW examples, with `source: "cefr"` to distinguish them.
+**Input:** `pipeline.db` — populated `vocabulary_entries` and `entry_translations`
-
+**Output:** `pipeline.db` — missing reverse links inserted into `entry_translations`
 Words appearing in the CEFR source file multiple times with different CEFR levels are written to `conflicts.json` and excluded from source voting. The LLMs handle these words like any other unannotated word.
 **Input:** `stage-1-extract/output/omw.json` + `stage-2-annotate/sources/cefr/{lang}.json`
 **Output:**
 - `stage-2-annotate/output/{lang}.json` — one per language
 - `stage-2-annotate/output/conflicts.json` — cross-language conflicts for reference
 ```bash
-pnpm --filter @lila/pipeline annotate
+pnpm --filter @lila/pipeline reverse-link
 ```
 Each record in the output extends the OMW record with a `votes` field and any additional examples from the CEFR source file:
 ```json
 {
  "source_id": "ili:i1",
  "pos": "adjective",
  "translations": {
    "en": ["able"],
    "it": ["abile", "intelligente", "valente", "capace"],
    "es": ["capaz"],
    "fr": ["comptable"]
  },
  "glosses": { "en": ["having the necessary means or skill to do something"] },
  "examples": {
    "en": [
      { "text": "able to swim", "source": "omw" },
      { "text": "She was able to finish the task.", "source": "cefr" }
    ]
  },
  "votes": { "en": { "able": { "cefr_source": "B1" } } }
 }
 ```
 Words not present in the CEFR source file will have an empty `votes` object.
 ### 3. Enrich
-> **Note:** Before running this stage, ensure the llama.cpp server is running
+The enrich stage runs LLMs to fill four types of gaps, in this order:
 > locally. The orchestrator checks for a running server at
 > `http://127.0.0.1:8080/health` and exits with instructions if it is not
 > reachable. See `llm-setup.md` for setup instructions.
-The enrich stage runs in two rounds, both designed to execute overnight one model at a time. All output is written to `pipeline.db` atomically per record — runs are fully resumable if interrupted. Each model is run once — one model produces one vote.
+**A — Missing translations:** for each entry that has no translation in one or more supported languages after reverse link sync, the LLM generates the best translation for that language given the entry's headword, gloss, and examples.
-**Round 1 — generation**
+**B — Weak glosses and examples:** for each entry where the gloss is missing or the examples are missing, the LLM generates a natural, learner-friendly gloss and one usage example in the entry's language.
-Each model processes every word in every language one term at a time and generates:
+**C — CEFR levels:** for every entry, the LLM assigns a CEFR level (A1–C2) based on the headword, gloss, and examples. This runs for all entries regardless of whether other enrichment was needed.
- A CEFR level vote for each translation
+All output is written to `pipeline.db` atomically per entry — runs are fully resumable if interrupted. Each model is run once — one model produces one vote.
 - A description for each language
 - A translation for each language, only if OMW provides none
 - A gloss for each language, only if OMW provides none
 - Usage examples for each language, only if OMW provides none
-OMW data is never duplicated — the script checks what OMW already provides before building the prompt. For translations, glosses and examples, if OMW data exists for that language the LLM skips generation entirely. This significantly reduces compute time for languages with good OMW coverage such as English.
+> **Note:** Before running this stage, ensure the llama.cpp server is running locally. The orchestrator checks for a running server at `http://127.0.0.1:8080/health` and exits with instructions if it is not reachable. See `llm-setup.md` for setup instructions.
-All model-generated content is stored with an anonymised source (`model_1`, `model_2` etc.) so models cannot be biased by knowing who generated what in round 2.
+**Input:** `pipeline.db` — entries after reverse link sync
-
+**Output:** `pipeline.db` — LLM-generated translations, glosses, examples, and CEFR votes
 Each record is written to `pipeline.db` with status `complete` or `needs_review` immediately after processing. If a record fails structural validation (invalid JSON, missing required fields, invalid CEFR value) it is marked `needs_review` and skipped — the run continues without interruption.
 **Input:** `stage-2-annotate/output/{lang}.json`
 **Output:** `pipeline.db` — round 1 results per record per model
 ```bash
-pnpm --filter @lila/pipeline enrich --round 1 --model {model}
+pnpm --filter @lila/pipeline run --name "night-1"
 ```
 **Compiling candidates**
 Once all round 1 runs are complete, compile all generated candidates into a single structured record per term in `pipeline.db`. This is the input to round 2.
 ```bash
 pnpm --filter @lila/pipeline enrich --compile-candidates
 ```
 **Round 2 — voting**
 Each model receives the compiled candidate list for every word and votes on:
 - The best gloss candidate (if multiple exist)
 - The best description candidate (if multiple exist)
 - The best usage examples candidate (if multiple exist)
 - A CEFR level vote for each translation
 OMW data is not put to a vote — it automatically wins over any LLM-generated candidate. Round 2 only resolves conflicts between model-generated candidates. The prompt is kept small — one word at a time, a clean numbered candidate list — to fit within a limited context window.
 **Input:** `pipeline.db` — compiled candidates
 **Output:** `pipeline.db` — round 2 votes per record per model
 ```bash
 pnpm --filter @lila/pipeline enrich --round 2 --model {model}
 ```
 **Compiling votes**
 Once all round 2 runs are complete, compile all votes into a final votes record per term in `pipeline.db`. This is the input to the merge stage.
 ```bash
 pnpm --filter @lila/pipeline enrich --compile-votes
 ```
 ### 4. Merge
-Reads compiled votes from `pipeline.db` and resolves the final value for every field. Updates each record in `pipeline.db` with status `final` or `flagged`.
+Reads all LLM votes from `pipeline.db` and resolves the final value for every field. Writes resolved entries back to `pipeline.db`.
 **Merge rules:**
- OMW data wins automatically and is never overridden
+- Kaikki source data wins automatically and is never overridden by LLM output
- For CEFR levels: the level with the most votes wins. If no majority is
+- For CEFR levels: the level with the most votes wins. If no majority is reached, the entry is flagged for the tiebreaker
-  reached, that translation is flagged for the tiebreaker
+- For LLM-generated text fields: the candidate with the most votes wins. If no majority is reached, the tiebreaker runs
 - For LLM-generated text fields (gloss, examples, descriptions): the
  candidate with the most votes wins. If no majority is reached, the
  tiebreaker runs for that record as well
 **Difficulty mapping:**
@ -311,64 +208,44 @@ Reads compiled votes from `pipeline.db` and resolves the final value for every f
 | B1, B2 | intermediate |
 | C1, C2 | hard         |
-**Input:** `pipeline.db` — compiled votes
+**Input:** `pipeline.db` — LLM votes
-**Output:** `pipeline.db` — records updated with status `final` or `flagged`
+**Output:** `pipeline.db` — entries updated with resolved values or flagged status
 ```bash
 pnpm --filter @lila/pipeline merge
 ```
 ### 4b. Tiebreak
-Runs automatically after merge if any translations remain flagged. The script queries `pipeline.db` for flagged translations, identifies which configured models have not yet voted on each word, and runs those models on the flagged subset only. Merge is re-run after each tiebreaker pass. This repeats until all flagged translations are resolved or no unused models remain.
+Runs automatically after merge if any entries remain flagged. The script queries `pipeline.db` for flagged entries, identifies which configured models have not yet voted on each entry, and runs those models on the flagged subset only. Merge is re-run after each tiebreaker pass. This repeats until all flagged entries are resolved or no unused models remain.
-If unused models are exhausted and flagged translations remain, the script logs a detailed report showing the exact vote split for each unresolved word and lists available models from OpenRouter that have not been used. Seeding is blocked until all translations are resolved. To continue, add one or more models to the config and re-run the pipeline — the tiebreaker will pick up automatically.
+If unused models are exhausted and flagged entries remain, the script logs a detailed report showing the exact vote split for each unresolved entry and lists available models from OpenRouter that have not been used. Syncing is blocked until all entries are resolved. To continue, add one or more models to the config and re-run the pipeline — the tiebreaker will pick up automatically.
-**Input:** `pipeline.db` — flagged translations from merge
+> **Note:** The tiebreaker is not a standalone script. It runs automatically as part of the pipeline orchestrator after merge completes.
 **Output:** `pipeline.db` — flagged translations resolved to `final`
 > **Note:** The tiebreaker is not a standalone script. It runs automatically
 > as part of the pipeline orchestrator after merge completes.
 ### 5. Compare / QA
-Read-only. Generates `COVERAGE.md` with a full breakdown of the pipeline output quality per language. Run this after merge to verify output before syncing to the database.
+Read-only. Generates `COVERAGE.md` with a full breakdown of pipeline output quality per language. Run this after merge to verify output before syncing to the database.
-**Input:** `pipeline.db` — records with status `final`
+**Input:** `pipeline.db` — entries with status `final`
 **Output:** `COVERAGE.md`
 ```bash
 pnpm --filter @lila/pipeline compare
 ```
 `COVERAGE.md` reports the following per language:
- Total synsets extracted
+- Total entries extracted
- Total translations per language
+- POS breakdown — entry counts for noun, verb, adjective, adverb
- POS breakdown per language — word counts for noun, verb, adjective, adverb
+- Translation coverage — how many entries have translations in each other language
- CEFR coverage per language — how many translations have a resolved CEFR
+- CEFR coverage — how many entries have a resolved CEFR level, broken down by level
-  level, broken down by level (A1, A2, B1, B2, C1, C2)
+- Difficulty breakdown — entry counts for easy, intermediate, hard
- Difficulty breakdown per language — word counts for easy, intermediate, hard
+- Gloss coverage — how many entries have a gloss, broken down by source (Kaikki vs LLM-generated)
- Flagged count per language — how many translations are awaiting manual review
+- Example coverage — same breakdown as glosses
- Gloss coverage per language — total glosses, broken down by source (omw vs
+- LLM model contribution — how many CEFR votes and text candidates each anonymised model contributed
  LLM-generated) and which languages have no glosses at all
 - Example coverage per language — same breakdown as glosses
 - Description coverage per language — how many translations have a description,
  broken down by source
 - CEFR source file coverage per language — how many words from the source file
  were matched against OMW translations
 - LLM model contribution — how many CEFR votes and text candidates each
  anonymised model contributed
 ## Sync
-The sync script transfers all records with status `final` in `pipeline.db` to the production PostgreSQL database. It is upsert-based and never wipes existing data. For each record it checks whether a matching `source_id` already exists in the target database:
+The sync script transfers all entries with status `final` in `pipeline.db` to the production PostgreSQL database. It is upsert-based and never wipes existing data. For each entry it checks whether a matching record already exists in the target database:
 - **Missing** → insert
 - **Present but changed** → update
 - **Present and unchanged** → skip
-Run this after all records are resolved and Compare / QA has been reviewed.
+Run this after all entries are resolved and Compare / QA has been reviewed.
 ```bash
 pnpm --filter @lila/pipeline sync
@ -382,41 +259,34 @@ The pipeline generates a report at the end of every run. Reports are written to
 ```
 data-pipeline/reports/
-  2026-05-03_night-1.json
+  2026-05-03_run-1.json
-  2026-05-03_night-1.md
+  2026-05-03_run-1.md
 ```
-The report name is provided when starting the pipeline:
+The run name is auto-generated from the date and a counter. Reports are not committed to git.
 ```bash
 pnpm --filter @lila/pipeline run --name "night-1"
 ```
 **Nightly report** contains:
- Records processed this run vs total
+- Entries processed this run vs total
- Records remaining per stage
+- Entries remaining per stage
 - Average processing speed and estimated nights remaining
- `needs_review` count — records that failed structural validation
+- `needs_review` count — entries that failed structural validation
 - Per-model progress breakdown
-**Final report** (generated when all records are processed) additionally contains:
+**Final report** (generated when all entries are processed) additionally contains:
 - Full vote breakdown per model
- Flagged translations with exact vote splits
+- Flagged entries with exact vote splits
 - Available unused models from OpenRouter for tiebreaking
 - Per-model quality metrics — CEFR agreement rate, field coverage, JSON parse rate
 Reports are not committed to git.
 ## Adding a new language
 1. Add the language code to `SUPPORTED_LANGUAGE_CODES` in `packages/shared/src/constants.ts`
 2. Build shared: `pnpm --filter @lila/shared build`
 3. Generate and run a DB migration: `pnpm --filter @lila/db generate` then `pnpm --filter @lila/db migrate`
-4. Download the OMW lexicon for the language using the `wn` Python library
+4. Download the Kaikki JSONL file for the language from kaikki.org
-5. Add a CEFR source file at `stage-2-annotate/sources/cefr/{lang}.json`
+5. Re-run the full pipeline
 6. Run the full pipeline
 ## Constants and constraints
@ -433,108 +303,84 @@ Adding a new value to any of these requires a constants update and a database mi
 ## Further extensions
-These are not part of the current pipeline but are worth considering as the
+These are not part of the current pipeline but are worth considering as the dataset matures:
 dataset matures:
- **Grammatical gender and articles** — Wiktionary dumps contain gender and
+- **IPA pronunciations** — Kaikki includes IPA transcriptions for most entries. Could be extracted and stored in a `entry_pronunciations` table and displayed in the quiz UI.
-  article data for nouns across all supported languages. Could be extracted
+- **Audio files** — kaikki.org provides bulk audio file downloads (~20GB) for pronunciations. Could be stored as static files and served alongside the quiz UI.
-  and stored as a new `translation_forms` table.
+- **Inflected forms** — Kaikki provides conjugation and declension tables in a `forms` array. Useful for a future grammar-focused quiz mode.
- **Conjugations** — Wiktionary also carries verb conjugation tables. Useful
+- **Grammatical gender** — Kaikki includes grammatical gender for nouns. Could be stored per entry and used as an additional quiz mechanic.
-  for a future grammar-focused quiz mode.
+- **Frequency data** — Word frequency rankings per language from sources like the Google Ngram dataset. Useful for smarter difficulty calibration beyond CEFR levels alone.
- **IPA pronunciations** — Wiktionary and Forvo are potential sources for
+- **Additional languages** — The pipeline is language-agnostic. Adding a new language requires downloading its Kaikki JSONL file, a constants update, and a database migration. See **Adding a new language**.
  phonetic transcriptions per language.
 - **TTS audio files** — Generate pronunciation audio for each translation
  using a local or cloud TTS engine. Stored as static files, served alongside
  the quiz UI.
 - **Images** — Associate an image with each synset to support visual
  vocabulary learning. Could be sourced from open image datasets like
  ImageNet or WikiMedia Commons.
 - **Frequency data** — Word frequency rankings per language from sources like
  the Google Ngram dataset. Useful for smarter difficulty calibration beyond
  CEFR levels alone.
 - **Improved CEFR source files** — See note at the top of this document.
  UniversalCEFR and CEFR-J are good starting points.
 - **Additional languages** — The pipeline is language-agnostic. Adding a new
  language requires an OMW lexicon, a CEFR source file, and a constants
  update. See **Adding a new language**.
 ## Roadmap
-**Current state:** Stages 1 and 2 are complete, validated, and imported into `pipeline.db`. Schema, init, import scripts, validation tests, and fixtures are all in place. Stage 3 scripts have not been written yet and llama.cpp is not installed.
+**Current state:** Data source migrated from OMW to Kaikki. Production schema and pipeline being rewritten on `feat/kaikki-vocabulary-schema`. Pipeline infrastructure (orchestrator, db init, reporting, tests) is in place and carries forward.
-**Next action:** Write the stage 3 round 1 script.
+**Next action:** Rewrite production schema in `packages/db`, then rewrite pipeline extraction stage for Kaikki.
 | Stage           | Status         |
 | --------------- | -------------- |
-| 1. Extract      | ✅ complete    |
+| 1. Extract      | 🔲 not started |
-| 2. Annotate     | ✅ complete    |
+| 2. Reverse link | 🔲 not started |
 | 3. Enrich       | 🔲 not started |
 | 4. Merge        | 🔲 not started |
 | 4b. Tiebreak    | 🔲 not started |
 | 5. Compare / QA | 🔲 not started |
 | 6. Sync         | 🔲 not started |
-### Stage 1 — Extract `✅ complete`
+### Stage 1 — Extract `🔲 not started`
- [x] Write extraction script
+- [ ] Download Kaikki JSONL files for all 5 languages
- [x] Run extraction → `stage-1-extract/output/omw.json`
+- [ ] Write extraction script
 - [ ] Write stage 1 validation tests
 - [ ] Run extraction → `pipeline.db`
-### Stage 2 — Annotate `✅ complete`
+### Stage 2 — Reverse link sync `🔲 not started`
- [x] Write annotation script
+- [ ] Write reverse link sync script
- [x] Run annotation → per-language JSON + `conflicts.json`
+- [ ] Write tests
- [x] Add annotate script to package.json
+- [ ] Run reverse link sync → `pipeline.db`
 - [x] Fix duplicate translations in extract.py
 - [x] Write stage 1 and 2 validation tests
 - [x] Write db schema, init, and import scripts
 - [x] Write test fixtures
 ### Stage 3 — Enrich `🔲 not started`
-**Next action:** Write the round 1 generation script.
+**Next action:** Write the enrich script after production schema is complete.
- [ ] Write tests for stage 3
+- [ ] Write enrich script (missing translations, glosses, examples, CEFR votes)
- [ ] Write round 1 script (generation)
+- [ ] Write tests
 - [ ] Write compile-candidates script
 - [ ] Write round 2 script (voting)
 - [ ] Write compile-votes script
 - [ ] Install llama.cpp and verify server
- [ ] Smoke test with 5–10 records
+- [ ] Smoke test with sample entries
- [ ] Run full 100-record sample, collect metrics
+- [ ] Run full sample, collect metrics
 - [ ] Compare providers (local vs OpenRouter free models)
- [ ] Production run — all records, all models
+- [ ] Production run — all entries, all models
 - [ ] Compile candidates → `pipeline.db`
 - [ ] Compile votes → `pipeline.db`
 ### Stage 4 — Merge `🔲 not started`
 - [ ] Write tests for stage 3
 - [ ] Write merge script
 - [ ] Write tests
 - [ ] Run merge → `pipeline.db`
- [ ] Confirm tiebreaker resolves all flagged translations
+- [ ] Confirm tiebreaker resolves all flagged entries
 ### Stage 4b — Tiebreak `🔲 not started`
 - [ ] Write tests for stage 3
 - [ ] Write tiebreak logic
- [ ] Run tiebreaker for all flagged translations
+- [ ] Run tiebreaker for all flagged entries
- [ ] Confirm no flagged translations remain before seeding
+- [ ] Confirm no flagged entries remain before syncing
 ### Stage 5 — Compare / QA `🔲 not started`
 - [ ] Write tests for stage 3
 - [ ] Write compare script
 - [ ] Write tests
 - [ ] Run compare → `COVERAGE.md`
- [ ] Review output quality before seeding
+- [ ] Review output quality before syncing
 ### Stage 6 — Sync `🔲 not started`
 - [ ] Write tests for stage 3
 - [ ] Write sync script
 - [ ] Write tests
 - [ ] Configure `DATABASE_URL` in `.env`
 - [ ] Run sync → production PostgreSQL
 - [ ] Verify seeded data in production
 ### Utilities
-**`test/`** — Runs the pipeline against a small sample to produce human-readable output for a quick sanity check before committing to a full run. Run this after any script change before running the full pipeline.
+**`sample/`** — Runs the pipeline against a small sample to produce human-readable output for a quick sanity check before committing to a full run. Run this after any script change before running the full pipeline.