Compare commits
9 commits
ccfd83d16c
...
38d8b85228
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
38d8b85228 | ||
|
|
87aeb072c5 | ||
|
|
080fad1998 | ||
|
|
4d42fe4397 | ||
|
|
f59399be02 | ||
|
|
4a842140b9 | ||
|
|
4fa3073412 | ||
|
|
74cfc82bdd | ||
|
|
6007fe1e38 |
29 changed files with 2691 additions and 366 deletions
3
.gitignore
vendored
3
.gitignore
vendored
|
|
@ -15,3 +15,6 @@ data-pipeline/stage-1-extract/output/
|
||||||
data-pipeline/stage-2-annotate/output/
|
data-pipeline/stage-2-annotate/output/
|
||||||
data-pipeline/stage-3-enrich/output/
|
data-pipeline/stage-3-enrich/output/
|
||||||
data-pipeline/stage-4-merge/output/
|
data-pipeline/stage-4-merge/output/
|
||||||
|
data-pipeline/db/pipeline.db
|
||||||
|
data-pipeline/reports/
|
||||||
|
data-pipeline/.env
|
||||||
|
|
|
||||||
7
data-pipeline/.env.example
Normal file
7
data-pipeline/.env.example
Normal file
|
|
@ -0,0 +1,7 @@
|
||||||
|
# OpenRouter API key — required for OpenRouter providers
|
||||||
|
# Get one at https://openrouter.ai/keys
|
||||||
|
OPENROUTER_API_KEY=
|
||||||
|
|
||||||
|
# Anthropic API key — required for Anthropic provider (reference baseline only)
|
||||||
|
# Get one at https://console.anthropic.com/
|
||||||
|
ANTHROPIC_API_KEY=
|
||||||
362
data-pipeline/audit.md
Normal file
362
data-pipeline/audit.md
Normal file
|
|
@ -0,0 +1,362 @@
|
||||||
|
# OMW German Translation Quality Audit
|
||||||
|
|
||||||
|
Instructions: for each entry, check if the German translations
|
||||||
|
match the meaning described by the English gloss.
|
||||||
|
|
||||||
|
Mark QUALITY as:
|
||||||
|
OK — all German translations fit the meaning
|
||||||
|
PARTIAL — some fit, some don't
|
||||||
|
BAD — none of the German translations fit
|
||||||
|
USELESS — translations are correct but useless for learners
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
1. [noun] ili:i98680
|
||||||
|
EN gloss: the flowering part of a plant or arrangement of flowers on a stalk
|
||||||
|
DE gloss: der blühende Teil einer Pflanze oder die Anordnung von Blüten an einem Stiel
|
||||||
|
EN words: inflorescence
|
||||||
|
DE words: Blütenstand, Infloreszenz
|
||||||
|
QUALITY: correct
|
||||||
|
|
||||||
|
2. [verb] ili:i24675
|
||||||
|
EN gloss: make motionless
|
||||||
|
DE gloss: unbeweglich machen
|
||||||
|
EN words: still
|
||||||
|
DE words: stillen, zum Stillstand bringen
|
||||||
|
QUALITY: stillen means breastfeeding, so completelyworng, zum stillstand bringen is correct but the gloss sounds weird: unbeweglich machen, no one says this
|
||||||
|
|
||||||
|
3. [verb] ili:i22153
|
||||||
|
EN gloss: lose interest or become bored with something or somebody
|
||||||
|
DE gloss: das Interesse an etwas oder jemandem verlieren oder sich langweilen
|
||||||
|
EN words: fatigue, jade, pall, tire, weary
|
||||||
|
DE words: Langeweile erzeugen, anöden, ermüden, langweilen, sich langweilen, sich zu Tode langweilen, sich öden
|
||||||
|
QUALITY: its ok
|
||||||
|
|
||||||
|
4. [noun] ili:i74742
|
||||||
|
EN gloss: zealous preaching and advocacy of the gospel
|
||||||
|
DE gloss: eifriges Predigen und Eintreten für das Evangelium
|
||||||
|
EN words: evangelism
|
||||||
|
DE words: Evangelisation, Evangelisierung
|
||||||
|
QUALITY: ok
|
||||||
|
|
||||||
|
5. [noun] ili:i115665
|
||||||
|
EN gloss: an oxide of iron that is strongly attracted by magnets
|
||||||
|
DE gloss: ein Eisenoxid, das stark von Magneten angezogen wird
|
||||||
|
EN words: magnetic iron-ore, magnetite
|
||||||
|
DE words: Eisenoxiduloxid, Magneteisen, Magneteisenstein, Magnetit
|
||||||
|
QUALITY: ok
|
||||||
|
|
||||||
|
6. [adjective] ili:i17569
|
||||||
|
EN gloss: of or relating to fatalism
|
||||||
|
DE gloss: von oder im Zusammenhang mit Fatalismus
|
||||||
|
EN words: fatalist, fatalistic
|
||||||
|
DE words: auf alles gefasst, dem Schicksal ergeben, fatalistisch, gottergeben, schicksalsergeben
|
||||||
|
QUALITY: ok
|
||||||
|
|
||||||
|
7. [adjective] ili:i682
|
||||||
|
EN gloss: having no previous example or precedent or parallel
|
||||||
|
DE gloss: ohne vorheriges Beispiel oder Präzedenzfall oder Parallele
|
||||||
|
EN words: new, unexampled
|
||||||
|
DE words: beispiellos, gab es noch nie, ohne Beispiel, ohne Präzedenzfall, ohnegleichen, präzedenzlos, sondergleichen, unvergleichbar
|
||||||
|
QUALITY: ok
|
||||||
|
|
||||||
|
8. [noun] ili:i114018
|
||||||
|
EN gloss: a soft silvery metallic element of the rare earth group; isotope 170 emits X-rays and is used in small portable X-ray machines; it occurs in monazite and apatite and xenotime
|
||||||
|
DE gloss: ein weiches, silbriges Metallelement der Gruppe der Seltenen Erden; Isotop 170 emittiert Röntgenstrahlen und wird in kleinen tragbaren Röntgengeräten verwendet; es kommt in Monazit und Apatit sowie in Xenotim vor
|
||||||
|
EN words: Tm, atomic number 69, thulium
|
||||||
|
DE words: Terameter, Tm
|
||||||
|
QUALITY: ok
|
||||||
|
|
||||||
|
9. [noun] ili:i117564
|
||||||
|
EN gloss: the rate of some repeating event
|
||||||
|
DE gloss: die Geschwindigkeit eines sich wiederholenden Ereignisses
|
||||||
|
EN words: pace, tempo
|
||||||
|
DE words: Takt, Tempo
|
||||||
|
QUALITY: ok
|
||||||
|
|
||||||
|
10. [verb] ili:i31619
|
||||||
|
EN gloss: let drop or droop
|
||||||
|
DE gloss: fallen oder hängen lassen
|
||||||
|
EN words: hang
|
||||||
|
DE words: am Galgen sterben lassen, aufhängen, aufknüpfen, erhängen, henken, hängen
|
||||||
|
QUALITY: wrong,let drop means fallen lassen, like dropping something? im not sure here, does it really mean to hang some one? if so, then its ok
|
||||||
|
|
||||||
|
11. [noun] ili:i75571
|
||||||
|
EN gloss: a heavy dull sound (as made by impact of heavy objects)
|
||||||
|
DE gloss: ein schweres, dumpfes Geräusch (wie beim Aufprall schwerer Gegenstände)
|
||||||
|
EN words: clump, clunk, thud, thump, thumping
|
||||||
|
DE words: Geklacker, Geklapper, Klackern, Klappern
|
||||||
|
QUALITY: ok
|
||||||
|
|
||||||
|
12. [noun] ili:i92290
|
||||||
|
EN gloss: a person who makes a promise
|
||||||
|
DE gloss: eine Person, die ein Versprechen gibt
|
||||||
|
EN words: promiser, promisor
|
||||||
|
DE words: Freud'scher Versprecher, Lapsus Linguae, Versprecher, freudscher Versprecher
|
||||||
|
QUALITY: completeley wrong, Versprecher is if you intend to say something but say some thing else, it has nothing to do with Versprechen
|
||||||
|
|
||||||
|
13. [noun] ili:i59450
|
||||||
|
EN gloss: a vertical well around which there is a stairway
|
||||||
|
DE gloss: ein vertikaler Schacht, um den herum eine Treppe verläuft
|
||||||
|
EN words: stairwell
|
||||||
|
DE words: Ern, Flur, Hausflur, Stiegenhaus, Treppenhaus
|
||||||
|
QUALITY: treppenhaus woudl be the only correct one right?
|
||||||
|
|
||||||
|
14. [verb] ili:i21908
|
||||||
|
EN gloss: smile affectedly or derisively
|
||||||
|
DE gloss: affektiert oder spöttisch lächeln
|
||||||
|
EN words: simper, smirk
|
||||||
|
DE words: in sich hinein lächeln, schmunzeln, vor sich hin lächeln
|
||||||
|
QUALITY: the glosses would be also the words here? schmunzeln and lächeln are kind of the same but the affektiert and spöttisch is missing?
|
||||||
|
|
||||||
|
15. [adjective] ili:i10887
|
||||||
|
EN gloss: tending to reserve or introspection
|
||||||
|
DE gloss: zur Zurückhaltung oder Introspektion neigend
|
||||||
|
EN words: indrawn, withdrawn
|
||||||
|
DE words: allein, einsam, eremitenhaft, eremitisch, für sich, solo, wie ein Einsiedler, wie ein Eremit, zurückgezogen
|
||||||
|
QUALITY: ok
|
||||||
|
|
||||||
|
16. [noun] ili:i113657
|
||||||
|
EN gloss: a substance from which another substance is formed (especially by a metabolic reaction)
|
||||||
|
DE gloss: ein Stoff, aus dem ein anderer Stoff gebildet wird (insbesondere durch eine Stoffwechselreaktion)
|
||||||
|
EN words: precursor
|
||||||
|
DE words: Ausgangsstoff, Edukt, Grundstoff, Präkursor, Vorläufer, biologische Vorstufe
|
||||||
|
QUALITY: ok
|
||||||
|
|
||||||
|
17. [adjective] ili:i13251
|
||||||
|
EN gloss: tastelessly showy
|
||||||
|
DE gloss: geschmacklos und auffällig
|
||||||
|
EN words: brassy, cheap, flash, flashy, garish, gaudy, gimcrack, loud, meretricious, tacky, tatty, tawdry, trashy
|
||||||
|
DE words: aufdringlich, marktschreierisch, reißerisch
|
||||||
|
QUALITY: ok
|
||||||
|
|
||||||
|
18. [noun] ili:i68734
|
||||||
|
EN gloss: the branch of chemistry that studies the relation between chemical action and the amount of heat absorbed or generated
|
||||||
|
DE gloss: der Zweig der Chemie, der die Beziehung zwischen chemischer Wirkung und der absorbierten oder erzeugten Wärmemenge untersucht
|
||||||
|
EN words: thermochemistry
|
||||||
|
DE words: Thermochemie, chemische Thermodynamik
|
||||||
|
QUALITY: ok
|
||||||
|
|
||||||
|
19. [adjective] ili:i12980
|
||||||
|
EN gloss: distinguished from others in excellence
|
||||||
|
DE gloss: durch hohe Qualität von anderen unterschieden
|
||||||
|
EN words: outstanding
|
||||||
|
DE words: I a, ausgezeichnet, außergewöhnlich, außerordentlich, besonders, bestens, eins a, exzeptionell, herausragend, schnafte, splendid, trefflich, vortrefflich, vorzüglich
|
||||||
|
QUALITY: ok, aber eins a/1a is wirklich sehr starke umgangssprache. und cih habe ncoh nie schnafte oder splendid gehört, der rest passt
|
||||||
|
|
||||||
|
20. [verb] ili:i30043
|
||||||
|
EN gloss: tear down so as to make flat with the ground
|
||||||
|
DE gloss: abreißen, um den Boden zu ebnen
|
||||||
|
EN words: dismantle, level, pull down, rase, raze, take down, tear down
|
||||||
|
DE words: abreißen, aus den Augen verlieren, keinen Kontakt mehr haben zu, nicht länger in Kontakt stehen
|
||||||
|
QUALITY: nur abreißen stimmt, der rest passt in diesem zusammenhang gar nicht!
|
||||||
|
|
||||||
|
21. [adjective] ili:i14014
|
||||||
|
EN gloss: desired or wished for or sought
|
||||||
|
DE gloss: gewünscht oder gewünscht oder gesucht
|
||||||
|
EN words: wanted
|
||||||
|
DE words: benötigt, gesucht, gewünscht
|
||||||
|
QUALITY: ok
|
||||||
|
|
||||||
|
22. [verb] ili:i29481
|
||||||
|
EN gloss: mar or spoil the appearance of
|
||||||
|
DE gloss: das Aussehen verunstalten
|
||||||
|
EN words: blemish, deface, disfigure
|
||||||
|
DE words: deformieren, entstellen, verhunzen, verschandeln, verunstalten, verunzieren
|
||||||
|
QUALITY: ok
|
||||||
|
|
||||||
|
23. [verb] ili:i28605
|
||||||
|
EN gloss: spread thickly
|
||||||
|
DE gloss: dick auftragen
|
||||||
|
EN words: slather
|
||||||
|
DE words: beharken, bestreichen, mit Feuer belegen, mit Sperrfeuer belegen
|
||||||
|
QUALITY: kein wort ist wirklich ein synonym für dick auftragen, (i dont even know if the english word fits here?)
|
||||||
|
|
||||||
|
24. [noun] ili:i92029
|
||||||
|
EN gloss: someone who is licensed to operate an aircraft in flight
|
||||||
|
DE gloss: jemand, der eine Lizenz zum Führen eines Luftfahrzeugs im Flug hat
|
||||||
|
EN words: airplane pilot, pilot
|
||||||
|
DE words: Führer, Lotse, Pilot
|
||||||
|
QUALITY: nur Pilot stimmt hier
|
||||||
|
|
||||||
|
25. [adjective] ili:i8221
|
||||||
|
EN gloss: capable of being measured
|
||||||
|
DE gloss: in der Lage, gemessen zu werden
|
||||||
|
EN words: measurable, mensurable
|
||||||
|
DE words: bestimmbar, der Messung zugänglich, erhebbar, mensurabel, messbar
|
||||||
|
QUALITY: ok
|
||||||
|
|
||||||
|
26. [noun] ili:i61380
|
||||||
|
EN gloss: the spirit of a group that makes the members want the group to succeed
|
||||||
|
DE gloss: der Geist einer Gruppe, der die Mitglieder dazu bringt, den Erfolg der Gruppe zu wollen
|
||||||
|
EN words: esprit de corps, morale, team spirit
|
||||||
|
DE words: Gruppengeist, Teamgeist
|
||||||
|
QUALITY: Gruppengeist hört sich so komisch an, das sagt niemand, teamgeist ist in ordnung
|
||||||
|
|
||||||
|
27. [adjective] ili:i10497
|
||||||
|
EN gloss: free of restrictions or qualifications
|
||||||
|
DE gloss: Zustand, in dem in einer Wohnung niemand wohnt.
|
||||||
|
EN words: clean, clear
|
||||||
|
DE words: frei, leer stehend, leerstehend, unbewohnt, ungenutzt, verwaist
|
||||||
|
QUALITY: ok
|
||||||
|
|
||||||
|
28. [adjective] ili:i6238
|
||||||
|
EN gloss: moving and bending with ease
|
||||||
|
DE gloss: anmutig schlank und mit Leichtigkeit biegsam und beweglich
|
||||||
|
EN words: lissom, lissome, lithe, lithesome, slender, supple, svelte, sylphlike
|
||||||
|
DE words: elastisch, geschmeidig, schlangenartig
|
||||||
|
QUALITY: \_\_\_
|
||||||
|
|
||||||
|
29. [noun] ili:i57906
|
||||||
|
EN gloss: station for the production and transmission of AM or FM radio broadcasts
|
||||||
|
DE gloss: Sender für die Produktion und Übertragung von AM- oder FM-Radiosendungen
|
||||||
|
EN words: radio station
|
||||||
|
DE words: Radiosender, Rundfunkstation, Sender
|
||||||
|
QUALITY: \_\_\_
|
||||||
|
|
||||||
|
30. [noun] ili:i112045
|
||||||
|
EN gloss: the purple or black-and-blue area resulting from a bruise
|
||||||
|
DE gloss: der violette oder schwarzblaue Bereich, der durch einen Bluterguss entsteht
|
||||||
|
EN words: ecchymosis
|
||||||
|
DE words: Ekchymose, kleinflächige Hautblutung
|
||||||
|
QUALITY: \_\_\_
|
||||||
|
|
||||||
|
31. [adjective] ili:i10839
|
||||||
|
EN gloss: capable of being replaced
|
||||||
|
DE gloss: kann ersetzt werden
|
||||||
|
EN words: replaceable
|
||||||
|
DE words: austauschbar, ersetzbar, fungibel
|
||||||
|
QUALITY: \_\_\_
|
||||||
|
|
||||||
|
32. [verb] ili:i28714
|
||||||
|
EN gloss: whip
|
||||||
|
DE gloss: peitschen
|
||||||
|
EN words: flagellate, scourge
|
||||||
|
DE words: auspeitschen, flagellieren, geißeln, peitschen
|
||||||
|
QUALITY: \_\_\_
|
||||||
|
|
||||||
|
33. [noun] ili:i52826
|
||||||
|
EN gloss: a mechanical or electrical explosive device or a small amount of explosive; can be used to initiate the reaction of a disrupting explosive
|
||||||
|
DE gloss: ein mechanischer oder elektrischer Sprengkörper oder eine kleine Menge Sprengstoff; kann verwendet werden, um die Reaktion eines Sprengstoffs auszulösen
|
||||||
|
EN words: cap, detonating device, detonator
|
||||||
|
DE words: Auslöser, Zünder, Zündvorrichtung
|
||||||
|
QUALITY: \_\_\_
|
||||||
|
|
||||||
|
34. [noun] ili:i115477
|
||||||
|
EN gloss: ice crystals forming a white deposit (especially on objects outside)
|
||||||
|
DE gloss: Eiskristalle, die einen weißen Belag bilden (insbesondere auf Gegenständen im Freien)
|
||||||
|
EN words: frost, hoar, hoarfrost, rime
|
||||||
|
DE words: Raufrost, Raureif, Reif
|
||||||
|
QUALITY: \_\_\_
|
||||||
|
|
||||||
|
35. [noun] ili:i66650
|
||||||
|
EN gloss: the ability to see in reduced illumination (as in moonlight)
|
||||||
|
DE gloss: die Fähigkeit, bei reduzierter Beleuchtung zu sehen (wie bei Mondlicht)
|
||||||
|
EN words: night vision, night-sight, scotopic vision, twilight vision
|
||||||
|
DE words: Nachtsehen, skotopisches Sehen
|
||||||
|
QUALITY: \_\_\_
|
||||||
|
|
||||||
|
36. [verb] ili:i26849
|
||||||
|
EN gloss: express or utter with a hiss
|
||||||
|
DE gloss: mit einem Zischen ausdrücken oder aussprechen
|
||||||
|
EN words: hiss, sibilate, siss, sizz
|
||||||
|
DE words: Stimme dämpfen, flüstern, hauchen, hinter vorgehaltener Hand, ins Ohr sagen, leise sprechen, mit tonloser Stimme, munkeln, raunen, säuseln, tonlos, tuscheln, wispern, zischeln, zuflüstern
|
||||||
|
QUALITY: \_\_\_
|
||||||
|
|
||||||
|
37. [noun] ili:i94222
|
||||||
|
EN gloss: a teenager or a young adult male
|
||||||
|
DE gloss: ein Jugendlicher oder ein junger Erwachsener
|
||||||
|
EN words: young buck, young man
|
||||||
|
DE words: Bruder, Bürschchen, Cowboy, Freundchen, Jungs, Kinders, Kollege, Kollegin, Leute, Mann Gottes, Meister, Sportsfreund, Verehrtester, der Herr, guter Mann, junger Mann, mein Gutster, mein Herr
|
||||||
|
QUALITY: \_\_\_
|
||||||
|
|
||||||
|
38. [noun] ili:i49310
|
||||||
|
EN gloss: dusky grey food fish found from Louisiana and Florida southward
|
||||||
|
DE gloss: dunkelgrauer Speisefisch, der von Louisiana und Florida südwärts vorkommt
|
||||||
|
EN words: Anisotremus surinamensis, black margate, pompon
|
||||||
|
DE words: Pompon, Puschel, Tanzwedel
|
||||||
|
QUALITY: \_\_\_
|
||||||
|
|
||||||
|
39. [noun] ili:i50315
|
||||||
|
EN gloss: a small vehicle with four wheels in which a baby or child is pushed around
|
||||||
|
DE gloss: ein kleines Fahrzeug mit vier Rädern, in dem ein Säugling oder ein Kind herumgeschoben wird
|
||||||
|
EN words: baby buggy, baby carriage, carriage, go-cart, perambulator, pram, pushchair, pusher, stroller
|
||||||
|
DE words: Kinderwagen, Säuglingskutsche
|
||||||
|
QUALITY: \_\_\_
|
||||||
|
|
||||||
|
40. [verb] ili:i31857
|
||||||
|
EN gloss: meet at a point
|
||||||
|
DE gloss: sich an einem Punkt treffen
|
||||||
|
EN words: cross, intersect
|
||||||
|
DE words: gegen den Wind segeln, kreuzen
|
||||||
|
QUALITY: \_\_\_
|
||||||
|
|
||||||
|
41. [noun] ili:i51632
|
||||||
|
EN gloss: a sailboat with two parallel hulls held together by single deck
|
||||||
|
DE gloss: ein Boot mit zwei parallelen Rümpfen, die durch ein einziges Deck zusammengehalten werden
|
||||||
|
EN words: catamaran
|
||||||
|
DE words: Doppelrumpfboot, Katamaran, Zweirumpfboot
|
||||||
|
QUALITY: \_\_\_
|
||||||
|
|
||||||
|
42. [verb] ili:i34734
|
||||||
|
EN gloss: to be found to exist
|
||||||
|
DE gloss: als existent befunden werden
|
||||||
|
EN words: occur
|
||||||
|
DE words: anzutreffen sein, auftreten, nicht ausbleiben, vorkommen, zu finden sein, zu sehen sein
|
||||||
|
QUALITY: \_\_\_
|
||||||
|
|
||||||
|
43. [verb] ili:i25187
|
||||||
|
EN gloss: assign too high a value to
|
||||||
|
DE gloss: einen zu hohen Wert zuweisen
|
||||||
|
EN words: overestimate, overvalue
|
||||||
|
DE words: zu hoch bewerten, zu viel Gewicht beimessen, zu viel Wichtigkeit beimessen, überbewerten, überschätzen
|
||||||
|
QUALITY: \_\_\_
|
||||||
|
|
||||||
|
44. [noun] ili:i73844
|
||||||
|
EN gloss: an expressive style of music
|
||||||
|
DE gloss: ein ausdrucksstarker Musikstil
|
||||||
|
EN words: genre, music genre, musical genre, musical style
|
||||||
|
DE words: Genre, Musikgenre, Musikrichtung, Musikstil, Stilrichtung
|
||||||
|
QUALITY: \_\_\_
|
||||||
|
|
||||||
|
45. [noun] ili:i113026
|
||||||
|
EN gloss: an abnormal condition in which cerebrospinal fluid collects in the ventricles of the brain; in infants it can cause abnormally rapid growth of the head and bulging fontanelles and a small face; in adults the symptoms are primarily neurological
|
||||||
|
DE gloss: ein anormaler Zustand, bei dem sich Liquor in den Hirnventrikeln sammelt; bei Säuglingen kann er zu einem anormal schnellen Wachstum des Kopfes, zu wulstigen Fontanellen und einem kleinen Gesicht führen; bei Erwachsenen sind die Symptome hauptsächlich neurologisch
|
||||||
|
EN words: hydrocephalus, hydrocephaly
|
||||||
|
DE words: Gehirnwassersucht, Hydrocephalus, Hydrozephalus, Wasserkopf
|
||||||
|
QUALITY: \_\_\_
|
||||||
|
|
||||||
|
46. [noun] ili:i62720
|
||||||
|
EN gloss: habitual uncleanliness
|
||||||
|
DE gloss: gewohnheitsmäßige Unreinheit
|
||||||
|
EN words: slovenliness
|
||||||
|
DE words: Flickarbeit, Flickenteppich, Flickwerk, Gestümper, Mist, Murks, Murkserei, Pfusch, Pfuscharbeit, Pfuscherei, Schlamperei, Schlendrian, Schluderei, Schund, schlechte Arbeit
|
||||||
|
QUALITY: \_\_\_
|
||||||
|
|
||||||
|
47. [noun] ili:i80976
|
||||||
|
EN gloss: the government agency in the United Kingdom that is responsible for internal security and counterintelligence overseas
|
||||||
|
DE gloss: Regierungsbehörde im Vereinigten Königreich, die für die innere Sicherheit und die Spionageabwehr im Ausland zuständig ist.
|
||||||
|
EN words: MI, Military Intelligence Section 6, Secret Intelligence Service
|
||||||
|
DE words: MI6, SIS, Secret Intelligence Service, Secret Service, britischer Auslandsgeheimdienst
|
||||||
|
QUALITY: \_\_\_
|
||||||
|
|
||||||
|
48. [noun] ili:i60476
|
||||||
|
EN gloss: an electrical device by which alternating current of one voltage is changed to another voltage
|
||||||
|
DE gloss: ein elektrisches Gerät, mit dem Wechselstrom einer bestimmten Spannung in eine andere Spannung umgewandelt wird
|
||||||
|
EN words: transformer
|
||||||
|
DE words: Spannungswandler, Trafo, Transformator, Transformer
|
||||||
|
QUALITY: \_\_\_
|
||||||
|
|
||||||
|
49. [noun] ili:i37037
|
||||||
|
EN gloss: wandering from the main path of a journey
|
||||||
|
DE gloss: das Abweichen vom Hauptweg einer Reise
|
||||||
|
EN words: digression, excursion
|
||||||
|
DE words: Abschweifung, Abstecher, Einschub, Exkurs, Umschweif
|
||||||
|
QUALITY: \_\_\_
|
||||||
|
|
||||||
|
50. [noun] ili:i77288
|
||||||
|
EN gloss: any meat that is minced and spiced and cooked as patties or used to fill sausages
|
||||||
|
DE gloss: jegliches Fleisch, das zerkleinert und gewürzt und als Pasteten gekocht oder zur Füllung von Würsten verwendet wird
|
||||||
|
EN words: sausage meat
|
||||||
|
DE words: Brät, Wurstbrät
|
||||||
|
QUALITY: \_\_\_
|
||||||
87
data-pipeline/audit.ts
Normal file
87
data-pipeline/audit.ts
Normal file
|
|
@ -0,0 +1,87 @@
|
||||||
|
import Database from "better-sqlite3";
|
||||||
|
import path from "node:path";
|
||||||
|
import fs from "node:fs";
|
||||||
|
import { fileURLToPath } from "node:url";
|
||||||
|
|
||||||
|
const __dirname = path.dirname(fileURLToPath(import.meta.url));
|
||||||
|
const DB_PATH = path.join(__dirname, "db/pipeline.db");
|
||||||
|
|
||||||
|
const db = new Database(DB_PATH, { readonly: true });
|
||||||
|
|
||||||
|
// Pull 50 synsets: ~12 per POS, all must have German translations
|
||||||
|
const synsets = db
|
||||||
|
.prepare(
|
||||||
|
`
|
||||||
|
SELECT DISTINCT s.source_id, s.pos
|
||||||
|
FROM synsets s
|
||||||
|
JOIN translations t ON t.source_id = s.source_id
|
||||||
|
WHERE t.language = 'de'
|
||||||
|
ORDER BY RANDOM()
|
||||||
|
LIMIT 50
|
||||||
|
`,
|
||||||
|
)
|
||||||
|
.all() as { source_id: string; pos: string }[];
|
||||||
|
|
||||||
|
const results: string[] = [];
|
||||||
|
let index = 0;
|
||||||
|
|
||||||
|
for (const synset of synsets) {
|
||||||
|
index++;
|
||||||
|
|
||||||
|
const glosses = db
|
||||||
|
.prepare("SELECT language, text FROM glosses WHERE source_id = ?")
|
||||||
|
.all(synset.source_id) as { language: string; text: string }[];
|
||||||
|
|
||||||
|
const enGloss = glosses.find((g) => g.language === "en")?.text ?? "—";
|
||||||
|
const deGloss = glosses.find((g) => g.language === "de")?.text ?? "—";
|
||||||
|
|
||||||
|
const deTranslations = db
|
||||||
|
.prepare(
|
||||||
|
"SELECT word FROM translations WHERE source_id = ? AND language = 'de'",
|
||||||
|
)
|
||||||
|
.all(synset.source_id) as { word: string }[];
|
||||||
|
|
||||||
|
const enTranslations = db
|
||||||
|
.prepare(
|
||||||
|
"SELECT word FROM translations WHERE source_id = ? AND language = 'en'",
|
||||||
|
)
|
||||||
|
.all(synset.source_id) as { word: string }[];
|
||||||
|
|
||||||
|
const deWords = deTranslations.map((t) => t.word);
|
||||||
|
const enWords = enTranslations.map((t) => t.word);
|
||||||
|
|
||||||
|
results.push(
|
||||||
|
[
|
||||||
|
`${String(index).padStart(2, " ")}. [${synset.pos}] ${synset.source_id}`,
|
||||||
|
` EN gloss: ${enGloss}`,
|
||||||
|
` DE gloss: ${deGloss}`,
|
||||||
|
` EN words: ${enWords.join(", ")}`,
|
||||||
|
` DE words: ${deWords.join(", ")}`,
|
||||||
|
` QUALITY: ___`,
|
||||||
|
``,
|
||||||
|
].join("\n"),
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
const output = [
|
||||||
|
"# OMW German Translation Quality Audit",
|
||||||
|
"",
|
||||||
|
"Instructions: for each entry, check if the German translations",
|
||||||
|
"match the meaning described by the English gloss.",
|
||||||
|
"",
|
||||||
|
"Mark QUALITY as:",
|
||||||
|
" OK — all German translations fit the meaning",
|
||||||
|
" PARTIAL — some fit, some don't",
|
||||||
|
" BAD — none of the German translations fit",
|
||||||
|
" USELESS — translations are correct but useless for learners",
|
||||||
|
"",
|
||||||
|
"---",
|
||||||
|
"",
|
||||||
|
...results,
|
||||||
|
].join("\n");
|
||||||
|
|
||||||
|
const outPath = path.join(__dirname, "audit.md");
|
||||||
|
fs.writeFileSync(outPath, output, "utf-8");
|
||||||
|
console.log(`Wrote ${synsets.length} entries → ${outPath}`);
|
||||||
|
|
||||||
|
db.close();
|
||||||
224
data-pipeline/db/import.ts
Normal file
224
data-pipeline/db/import.ts
Normal file
|
|
@ -0,0 +1,224 @@
|
||||||
|
import fs from "node:fs/promises";
|
||||||
|
import path from "node:path";
|
||||||
|
import { fileURLToPath } from "node:url";
|
||||||
|
import { SUPPORTED_LANGUAGE_CODES } from "@lila/shared";
|
||||||
|
import type { SupportedLanguageCode, SupportedPos } from "@lila/shared";
|
||||||
|
import { openDb } from "./index.js";
|
||||||
|
|
||||||
|
// ── Types ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
type Example = { text: string; source: "omw" | "cefr" };
|
||||||
|
|
||||||
|
type AnnotatedRecord = {
|
||||||
|
source_id: string;
|
||||||
|
pos: SupportedPos;
|
||||||
|
translations: Partial<Record<SupportedLanguageCode, string[]>>;
|
||||||
|
glosses: Partial<Record<SupportedLanguageCode, string[]>>;
|
||||||
|
examples: Partial<Record<SupportedLanguageCode, Example[]>>;
|
||||||
|
votes: Partial<
|
||||||
|
Record<SupportedLanguageCode, Record<string, { cefr_source: string }>>
|
||||||
|
>;
|
||||||
|
};
|
||||||
|
|
||||||
|
// ── Paths ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
const __dirname = path.dirname(fileURLToPath(import.meta.url));
|
||||||
|
|
||||||
|
const PATHS = {
|
||||||
|
annotatedDir: path.resolve(__dirname, "../stage-2-annotate/output"),
|
||||||
|
};
|
||||||
|
|
||||||
|
// ── Loading ───────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
async function loadAnnotated(): Promise<AnnotatedRecord[]> {
|
||||||
|
// Use en.json as the base — it has the most complete glosses and examples.
|
||||||
|
// Merge votes and CEFR examples from the other language files.
|
||||||
|
const baseRaw = await fs.readFile(
|
||||||
|
path.join(PATHS.annotatedDir, "en.json"),
|
||||||
|
"utf-8",
|
||||||
|
);
|
||||||
|
const base = JSON.parse(baseRaw) as AnnotatedRecord[];
|
||||||
|
|
||||||
|
const byId = new Map<string, AnnotatedRecord>();
|
||||||
|
for (const record of base) {
|
||||||
|
byId.set(record.source_id, record);
|
||||||
|
}
|
||||||
|
|
||||||
|
for (const lang of SUPPORTED_LANGUAGE_CODES) {
|
||||||
|
if (lang === "en") continue;
|
||||||
|
|
||||||
|
const raw = await fs.readFile(
|
||||||
|
path.join(PATHS.annotatedDir, `${lang}.json`),
|
||||||
|
"utf-8",
|
||||||
|
);
|
||||||
|
const records = JSON.parse(raw) as AnnotatedRecord[];
|
||||||
|
|
||||||
|
for (const record of records) {
|
||||||
|
const base = byId.get(record.source_id);
|
||||||
|
if (!base) continue;
|
||||||
|
|
||||||
|
// Merge votes
|
||||||
|
for (const [l, langVotes] of Object.entries(record.votes)) {
|
||||||
|
if (!base.votes[l as SupportedLanguageCode]) {
|
||||||
|
base.votes[l as SupportedLanguageCode] = {};
|
||||||
|
}
|
||||||
|
Object.assign(base.votes[l as SupportedLanguageCode]!, langVotes);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Merge CEFR examples not already in base
|
||||||
|
for (const [l, examples] of Object.entries(record.examples)) {
|
||||||
|
const lang = l as SupportedLanguageCode;
|
||||||
|
const cefrExamples = examples.filter((e) => e.source === "cefr");
|
||||||
|
if (cefrExamples.length === 0) continue;
|
||||||
|
|
||||||
|
if (!base.examples[lang]) {
|
||||||
|
base.examples[lang] = cefrExamples;
|
||||||
|
} else {
|
||||||
|
base.examples[lang].push(...cefrExamples);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return [...byId.values()];
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Import ────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
export async function importStage2(): Promise<void> {
|
||||||
|
console.log("Loading stage 2 annotated files...");
|
||||||
|
const records = await loadAnnotated();
|
||||||
|
console.log(` Loaded ${records.length.toLocaleString()} synsets`);
|
||||||
|
|
||||||
|
const db = openDb();
|
||||||
|
|
||||||
|
const insertSynset = db.prepare(
|
||||||
|
`INSERT INTO synsets (source_id, pos) VALUES (?, ?)`,
|
||||||
|
);
|
||||||
|
|
||||||
|
const insertTranslation = db.prepare(
|
||||||
|
`INSERT INTO translations (source_id, language, word) VALUES (?, ?, ?)`,
|
||||||
|
);
|
||||||
|
|
||||||
|
const insertGloss = db.prepare(
|
||||||
|
`INSERT INTO glosses (source_id, language, text) VALUES (?, ?, ?)`,
|
||||||
|
);
|
||||||
|
|
||||||
|
const insertExample = db.prepare(
|
||||||
|
`INSERT INTO examples (source_id, language, text, source) VALUES (?, ?, ?, ?)`,
|
||||||
|
);
|
||||||
|
|
||||||
|
const insertCefrVote = db.prepare(`
|
||||||
|
INSERT INTO cefr_source_votes (translation_id, cefr_level)
|
||||||
|
VALUES (
|
||||||
|
(SELECT id FROM translations WHERE source_id = ? AND language = ? AND word = ?),
|
||||||
|
?
|
||||||
|
)
|
||||||
|
`);
|
||||||
|
|
||||||
|
console.log("\nImporting into pipeline.db...");
|
||||||
|
|
||||||
|
const importAll = db.transaction(() => {
|
||||||
|
let synsets = 0;
|
||||||
|
let translations = 0;
|
||||||
|
let glosses = 0;
|
||||||
|
let examples = 0;
|
||||||
|
let cefrVotes = 0;
|
||||||
|
|
||||||
|
for (const record of records) {
|
||||||
|
insertSynset.run(record.source_id, record.pos);
|
||||||
|
synsets++;
|
||||||
|
|
||||||
|
// Translations
|
||||||
|
for (const [lang, words] of Object.entries(record.translations)) {
|
||||||
|
const unique = [...new Set(words)];
|
||||||
|
for (const word of unique) {
|
||||||
|
insertTranslation.run(record.source_id, lang, word);
|
||||||
|
translations++;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Glosses
|
||||||
|
for (const [lang, glossList] of Object.entries(record.glosses)) {
|
||||||
|
for (const text of glossList) {
|
||||||
|
insertGloss.run(record.source_id, lang, text);
|
||||||
|
glosses++;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Examples
|
||||||
|
for (const [lang, exList] of Object.entries(record.examples)) {
|
||||||
|
for (const example of exList) {
|
||||||
|
insertExample.run(
|
||||||
|
record.source_id,
|
||||||
|
lang,
|
||||||
|
example.text,
|
||||||
|
example.source,
|
||||||
|
);
|
||||||
|
examples++;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// CEFR source votes
|
||||||
|
for (const [lang, langVotes] of Object.entries(record.votes)) {
|
||||||
|
for (const [word, vote] of Object.entries(
|
||||||
|
langVotes as Record<string, { cefr_source: string }>,
|
||||||
|
)) {
|
||||||
|
insertCefrVote.run(record.source_id, lang, word, vote.cefr_source);
|
||||||
|
cefrVotes++;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return { synsets, translations, glosses, examples, cefrVotes };
|
||||||
|
});
|
||||||
|
|
||||||
|
const counts = importAll();
|
||||||
|
|
||||||
|
console.log(` synsets: ${counts.synsets.toLocaleString()}`);
|
||||||
|
console.log(` translations: ${counts.translations.toLocaleString()}`);
|
||||||
|
console.log(` glosses: ${counts.glosses.toLocaleString()}`);
|
||||||
|
console.log(` examples: ${counts.examples.toLocaleString()}`);
|
||||||
|
console.log(` cefr votes: ${counts.cefrVotes.toLocaleString()}`);
|
||||||
|
|
||||||
|
db.close();
|
||||||
|
console.log("\nImport complete.");
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Check if already imported ─────────────────────────────────────────────────
|
||||||
|
|
||||||
|
export function isImported(): boolean {
|
||||||
|
const db = openDb();
|
||||||
|
const row = db.prepare(`SELECT COUNT(*) as count FROM synsets`).get() as {
|
||||||
|
count: number;
|
||||||
|
};
|
||||||
|
db.close();
|
||||||
|
return row.count > 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Main ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
async function main(): Promise<void> {
|
||||||
|
const db = openDb();
|
||||||
|
const row = db.prepare(`SELECT COUNT(*) as count FROM synsets`).get() as {
|
||||||
|
count: number;
|
||||||
|
};
|
||||||
|
db.close();
|
||||||
|
|
||||||
|
if (row.count > 0) {
|
||||||
|
console.log(
|
||||||
|
`pipeline.db already contains ${row.count.toLocaleString()} synsets — skipping import.`,
|
||||||
|
);
|
||||||
|
console.log("Delete pipeline.db and re-run db:init to start fresh.");
|
||||||
|
process.exit(0);
|
||||||
|
}
|
||||||
|
|
||||||
|
await importStage2();
|
||||||
|
}
|
||||||
|
|
||||||
|
if (import.meta.url === `file://${process.argv[1]}`) {
|
||||||
|
main().catch((err) => {
|
||||||
|
console.error(err);
|
||||||
|
process.exit(1);
|
||||||
|
});
|
||||||
|
}
|
||||||
24
data-pipeline/db/index.ts
Normal file
24
data-pipeline/db/index.ts
Normal file
|
|
@ -0,0 +1,24 @@
|
||||||
|
import path from "node:path";
|
||||||
|
import { fileURLToPath } from "node:url";
|
||||||
|
import Database from "better-sqlite3";
|
||||||
|
|
||||||
|
// ── Paths ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
const __dirname = path.dirname(fileURLToPath(import.meta.url));
|
||||||
|
|
||||||
|
const DB_PATH = path.join(__dirname, "pipeline.db");
|
||||||
|
|
||||||
|
// ── Types ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
export type Db = InstanceType<typeof Database>;
|
||||||
|
|
||||||
|
// ── Open ──────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
export function openDb(): Db {
|
||||||
|
const db = new Database(DB_PATH);
|
||||||
|
|
||||||
|
db.pragma("journal_mode = WAL");
|
||||||
|
db.pragma("foreign_keys = ON");
|
||||||
|
|
||||||
|
return db;
|
||||||
|
}
|
||||||
42
data-pipeline/db/init.ts
Normal file
42
data-pipeline/db/init.ts
Normal file
|
|
@ -0,0 +1,42 @@
|
||||||
|
import fs from "node:fs/promises";
|
||||||
|
import path from "node:path";
|
||||||
|
import { fileURLToPath } from "node:url";
|
||||||
|
import Database from "better-sqlite3";
|
||||||
|
|
||||||
|
// ── Paths ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
const __dirname = path.dirname(fileURLToPath(import.meta.url));
|
||||||
|
|
||||||
|
const PATHS = {
|
||||||
|
schema: path.join(__dirname, "schema.sql"),
|
||||||
|
db: path.join(__dirname, "pipeline.db"),
|
||||||
|
};
|
||||||
|
|
||||||
|
// ── Init ──────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
export async function initDb(): Promise<void> {
|
||||||
|
const schema = await fs.readFile(PATHS.schema, "utf-8");
|
||||||
|
const db = new Database(PATHS.db);
|
||||||
|
|
||||||
|
db.pragma("journal_mode = WAL");
|
||||||
|
db.pragma("foreign_keys = ON");
|
||||||
|
db.exec(schema);
|
||||||
|
db.close();
|
||||||
|
|
||||||
|
console.log(` pipeline.db initialised → ${PATHS.db}`);
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Main ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
async function main(): Promise<void> {
|
||||||
|
console.log("Initialising pipeline.db...");
|
||||||
|
await initDb();
|
||||||
|
}
|
||||||
|
|
||||||
|
// after
|
||||||
|
if (import.meta.url === `file://${process.argv[1]}`) {
|
||||||
|
main().catch((err) => {
|
||||||
|
console.error(err);
|
||||||
|
process.exit(1);
|
||||||
|
});
|
||||||
|
}
|
||||||
BIN
data-pipeline/db/pipeline.db-shm
Normal file
BIN
data-pipeline/db/pipeline.db-shm
Normal file
Binary file not shown.
164
data-pipeline/db/schema.sql
Normal file
164
data-pipeline/db/schema.sql
Normal file
|
|
@ -0,0 +1,164 @@
|
||||||
|
-- ── Base data ─────────────────────────────────────────────────────────────────
|
||||||
|
-- Imported from stage 2 JSON on first run. Never mutated after import.
|
||||||
|
|
||||||
|
CREATE TABLE IF NOT EXISTS synsets (
|
||||||
|
source_id TEXT PRIMARY KEY,
|
||||||
|
pos TEXT NOT NULL
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE TABLE IF NOT EXISTS translations (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
source_id TEXT NOT NULL REFERENCES synsets(source_id),
|
||||||
|
language TEXT NOT NULL,
|
||||||
|
word TEXT NOT NULL,
|
||||||
|
UNIQUE (source_id, language, word)
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE TABLE IF NOT EXISTS glosses (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
source_id TEXT NOT NULL REFERENCES synsets(source_id),
|
||||||
|
language TEXT NOT NULL,
|
||||||
|
text TEXT NOT NULL
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE TABLE IF NOT EXISTS examples (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
source_id TEXT NOT NULL REFERENCES synsets(source_id),
|
||||||
|
language TEXT NOT NULL,
|
||||||
|
text TEXT NOT NULL,
|
||||||
|
source TEXT NOT NULL
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE TABLE IF NOT EXISTS cefr_source_votes (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
translation_id INTEGER NOT NULL REFERENCES translations(id),
|
||||||
|
cefr_level TEXT NOT NULL,
|
||||||
|
UNIQUE (translation_id)
|
||||||
|
);
|
||||||
|
|
||||||
|
-- ── Status tracking ───────────────────────────────────────────────────────────
|
||||||
|
-- One row per synset per model per stage. Drives resumability.
|
||||||
|
-- stage: round1 | round2 | tiebreak
|
||||||
|
-- status: pending | complete | needs_review | flagged
|
||||||
|
|
||||||
|
CREATE TABLE IF NOT EXISTS run_status (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
source_id TEXT NOT NULL,
|
||||||
|
model_name TEXT NOT NULL,
|
||||||
|
stage TEXT NOT NULL,
|
||||||
|
status TEXT NOT NULL,
|
||||||
|
created_at TEXT NOT NULL DEFAULT (datetime('now')),
|
||||||
|
updated_at TEXT NOT NULL DEFAULT (datetime('now')),
|
||||||
|
UNIQUE (source_id, model_name, stage)
|
||||||
|
);
|
||||||
|
|
||||||
|
-- ── Round 1 output ────────────────────────────────────────────────────────────
|
||||||
|
-- One row per translation/language per model. Written atomically per record.
|
||||||
|
-- Unique constraints enforce one model one vote.
|
||||||
|
|
||||||
|
CREATE TABLE IF NOT EXISTS model_cefr_votes (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
translation_id INTEGER NOT NULL REFERENCES translations(id),
|
||||||
|
model_name TEXT NOT NULL,
|
||||||
|
cefr_level TEXT NOT NULL,
|
||||||
|
UNIQUE (translation_id, model_name)
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE TABLE IF NOT EXISTS model_translation_rejections (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
translation_id INTEGER NOT NULL REFERENCES translations(id),
|
||||||
|
model_name TEXT NOT NULL,
|
||||||
|
UNIQUE (translation_id, model_name)
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE TABLE IF NOT EXISTS generated_glosses (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
source_id TEXT NOT NULL REFERENCES synsets(source_id),
|
||||||
|
model_name TEXT NOT NULL,
|
||||||
|
language TEXT NOT NULL,
|
||||||
|
text TEXT NOT NULL,
|
||||||
|
UNIQUE (source_id, model_name, language)
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE TABLE IF NOT EXISTS generated_examples (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
source_id TEXT NOT NULL REFERENCES synsets(source_id),
|
||||||
|
model_name TEXT NOT NULL,
|
||||||
|
language TEXT NOT NULL,
|
||||||
|
text TEXT NOT NULL,
|
||||||
|
UNIQUE (source_id, model_name, language)
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE TABLE IF NOT EXISTS generated_descriptions (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
source_id TEXT NOT NULL REFERENCES synsets(source_id),
|
||||||
|
model_name TEXT NOT NULL,
|
||||||
|
language TEXT NOT NULL,
|
||||||
|
text TEXT NOT NULL,
|
||||||
|
UNIQUE (source_id, model_name, language)
|
||||||
|
);
|
||||||
|
|
||||||
|
-- ── Round 2 output ────────────────────────────────────────────────────────────
|
||||||
|
-- Each row represents one model voting for one candidate.
|
||||||
|
-- The candidate with the most votes wins in merge.
|
||||||
|
|
||||||
|
CREATE TABLE IF NOT EXISTS gloss_candidate_votes (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
gloss_id INTEGER NOT NULL REFERENCES generated_glosses(id),
|
||||||
|
model_name TEXT NOT NULL,
|
||||||
|
UNIQUE (gloss_id, model_name)
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE TABLE IF NOT EXISTS example_candidate_votes (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
example_id INTEGER NOT NULL REFERENCES generated_examples(id),
|
||||||
|
model_name TEXT NOT NULL,
|
||||||
|
UNIQUE (example_id, model_name)
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE TABLE IF NOT EXISTS description_candidate_votes (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
description_id INTEGER NOT NULL REFERENCES generated_descriptions(id),
|
||||||
|
model_name TEXT NOT NULL,
|
||||||
|
UNIQUE (description_id, model_name)
|
||||||
|
);
|
||||||
|
|
||||||
|
-- ── Resolved output ───────────────────────────────────────────────────────────
|
||||||
|
-- Written by merge. Never updated after writing.
|
||||||
|
-- Only fully resolved records are written here — no nulls, no flags.
|
||||||
|
-- Absence of a row means unresolved. Flagged status tracked in run_status.
|
||||||
|
-- source: omw | cefr | model_name
|
||||||
|
|
||||||
|
CREATE TABLE IF NOT EXISTS resolved_translations (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
translation_id INTEGER NOT NULL REFERENCES translations(id),
|
||||||
|
cefr_level TEXT NOT NULL,
|
||||||
|
difficulty TEXT NOT NULL,
|
||||||
|
UNIQUE (translation_id)
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE TABLE IF NOT EXISTS resolved_glosses (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
source_id TEXT NOT NULL REFERENCES synsets(source_id),
|
||||||
|
language TEXT NOT NULL,
|
||||||
|
text TEXT NOT NULL,
|
||||||
|
source TEXT NOT NULL,
|
||||||
|
UNIQUE (source_id, language)
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE TABLE IF NOT EXISTS resolved_examples (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
source_id TEXT NOT NULL REFERENCES synsets(source_id),
|
||||||
|
language TEXT NOT NULL,
|
||||||
|
text TEXT NOT NULL,
|
||||||
|
source TEXT NOT NULL
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE TABLE IF NOT EXISTS resolved_descriptions (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
source_id TEXT NOT NULL REFERENCES synsets(source_id),
|
||||||
|
language TEXT NOT NULL,
|
||||||
|
text TEXT NOT NULL,
|
||||||
|
source TEXT NOT NULL,
|
||||||
|
UNIQUE (source_id, language)
|
||||||
|
);
|
||||||
|
|
@ -3,7 +3,14 @@
|
||||||
"version": "1.0.0",
|
"version": "1.0.0",
|
||||||
"private": true,
|
"private": true,
|
||||||
"type": "module",
|
"type": "module",
|
||||||
"scripts": {},
|
"scripts": {
|
||||||
|
"db:import": "tsx db/import.ts",
|
||||||
|
"db:init": "tsx db/init.ts",
|
||||||
|
"annotate": "tsx stage-2-annotate/scripts/annotate.ts",
|
||||||
|
"test": "vitest run",
|
||||||
|
"test:watch": "vitest",
|
||||||
|
"pipeline:run": "tsx --env-file .env pipeline.ts"
|
||||||
|
},
|
||||||
"dependencies": {
|
"dependencies": {
|
||||||
"@lila/shared": "workspace:*",
|
"@lila/shared": "workspace:*",
|
||||||
"better-sqlite3": "^12.9.0"
|
"better-sqlite3": "^12.9.0"
|
||||||
|
|
@ -12,6 +19,7 @@
|
||||||
"@types/better-sqlite3": "^7.6.13",
|
"@types/better-sqlite3": "^7.6.13",
|
||||||
"@types/node": "^24.12.0",
|
"@types/node": "^24.12.0",
|
||||||
"tsx": "^4.21.0",
|
"tsx": "^4.21.0",
|
||||||
"typescript": "^5.9.3"
|
"typescript": "^5.9.3",
|
||||||
|
"vitest": "^4.1.0"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
|
||||||
561
data-pipeline/pipeline.ts
Normal file
561
data-pipeline/pipeline.ts
Normal file
|
|
@ -0,0 +1,561 @@
|
||||||
|
import fs from "node:fs/promises";
|
||||||
|
import path from "node:path";
|
||||||
|
import { fileURLToPath } from "node:url";
|
||||||
|
import { initDb } from "./db/init.js";
|
||||||
|
import { isImported, importStage2 } from "./db/import.js";
|
||||||
|
import { openDb } from "./db/index.js";
|
||||||
|
import { ALL_PROVIDERS, validateProviderKey } from "./stage-3-enrich/config.js";
|
||||||
|
import type { ProviderConfig } from "./stage-3-enrich/config.js";
|
||||||
|
|
||||||
|
// ── Types ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
type RunStage =
|
||||||
|
| "round1"
|
||||||
|
| "compile_candidates"
|
||||||
|
| "round2"
|
||||||
|
| "compile_votes"
|
||||||
|
| "merge"
|
||||||
|
| "tiebreak"
|
||||||
|
| "compare";
|
||||||
|
|
||||||
|
type StageStatus = "complete" | "pending" | "in_progress";
|
||||||
|
|
||||||
|
type RunStats = {
|
||||||
|
startedAt: Date;
|
||||||
|
stoppedAt: Date | null;
|
||||||
|
recordsProcessed: number;
|
||||||
|
recordsSkipped: number;
|
||||||
|
needsReview: number;
|
||||||
|
modelsRun: string[];
|
||||||
|
currentStage: RunStage | null;
|
||||||
|
};
|
||||||
|
|
||||||
|
// ── Constants ─────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
const __dirname = path.dirname(fileURLToPath(import.meta.url));
|
||||||
|
|
||||||
|
const PATHS = {
|
||||||
|
omw: path.join(__dirname, "stage-1-extract/output/omw.json"),
|
||||||
|
db: path.join(__dirname, "db/pipeline.db"),
|
||||||
|
reports: path.join(__dirname, "reports"),
|
||||||
|
llamaHealth: "http://127.0.0.1:8080/health",
|
||||||
|
};
|
||||||
|
|
||||||
|
const SENTINEL = { sourceId: "system", modelName: "system" };
|
||||||
|
|
||||||
|
// ── Startup checks ────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
async function checkOmwExists(): Promise<void> {
|
||||||
|
try {
|
||||||
|
await fs.access(PATHS.omw);
|
||||||
|
} catch {
|
||||||
|
console.error("\n ERROR: stage-1-extract/output/omw.json not found.");
|
||||||
|
console.error(" Run the stage 1 extraction script first:");
|
||||||
|
console.error(" python stage-1-extract/scripts/extract.py\n");
|
||||||
|
process.exit(1);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
async function checkAndInitDb(): Promise<void> {
|
||||||
|
try {
|
||||||
|
await fs.access(PATHS.db);
|
||||||
|
} catch {
|
||||||
|
console.log(" pipeline.db not found — initialising...");
|
||||||
|
await initDb();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
async function checkAndImportDb(): Promise<void> {
|
||||||
|
if (!isImported()) {
|
||||||
|
console.log(" Base tables empty — importing stage 2 data...");
|
||||||
|
await importStage2();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
async function checkLlamaServer(): Promise<boolean> {
|
||||||
|
try {
|
||||||
|
const res = await fetch(PATHS.llamaHealth);
|
||||||
|
return res.ok;
|
||||||
|
} catch {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
function isLocalProvider(provider: ProviderConfig): boolean {
|
||||||
|
return provider.apiKey === "none";
|
||||||
|
}
|
||||||
|
|
||||||
|
async function checkProviderReady(provider: ProviderConfig): Promise<void> {
|
||||||
|
if (isLocalProvider(provider)) {
|
||||||
|
const running = await checkLlamaServer();
|
||||||
|
if (!running) {
|
||||||
|
console.error("\n ERROR: llama.cpp server is not running.");
|
||||||
|
console.error(" Start the server before running the pipeline:");
|
||||||
|
console.error(
|
||||||
|
" ./build/bin/llama-server --model models/<model>.gguf \\",
|
||||||
|
);
|
||||||
|
console.error(" --port 8080 --host 127.0.0.1");
|
||||||
|
console.error(" See llm-setup.md for full instructions.\n");
|
||||||
|
process.exit(1);
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
validateProviderKey(provider);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Run name generation ───────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
async function generateRunName(): Promise<string> {
|
||||||
|
await fs.mkdir(PATHS.reports, { recursive: true });
|
||||||
|
|
||||||
|
const date = new Date().toISOString().slice(0, 10);
|
||||||
|
const files = await fs.readdir(PATHS.reports);
|
||||||
|
const todaysRuns = files.filter(
|
||||||
|
(f) => f.startsWith(date) && f.endsWith(".json"),
|
||||||
|
).length;
|
||||||
|
|
||||||
|
return `${date}_run-${todaysRuns + 1}`;
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Shutdown handler ──────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
let shutdownRequested = false;
|
||||||
|
|
||||||
|
function registerShutdownHandler(stats: RunStats): void {
|
||||||
|
const handler = (): void => {
|
||||||
|
if (shutdownRequested) return;
|
||||||
|
shutdownRequested = true;
|
||||||
|
stats.stoppedAt = new Date();
|
||||||
|
console.log("\n\n Shutdown requested — finishing current record...");
|
||||||
|
};
|
||||||
|
|
||||||
|
process.on("SIGINT", handler);
|
||||||
|
process.on("SIGTERM", handler);
|
||||||
|
}
|
||||||
|
// ── Stage status helpers ──────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
function getSentinelStatus(stage: RunStage): StageStatus {
|
||||||
|
const db = openDb();
|
||||||
|
const row = db
|
||||||
|
.prepare(
|
||||||
|
`SELECT status FROM run_status
|
||||||
|
WHERE source_id = ? AND model_name = ? AND stage = ?`,
|
||||||
|
)
|
||||||
|
.get(SENTINEL.sourceId, SENTINEL.modelName, stage) as
|
||||||
|
| { status: string }
|
||||||
|
| undefined;
|
||||||
|
db.close();
|
||||||
|
return row?.status === "complete" ? "complete" : "pending";
|
||||||
|
}
|
||||||
|
|
||||||
|
function markSentinelComplete(stage: RunStage): void {
|
||||||
|
const db = openDb();
|
||||||
|
db.prepare(
|
||||||
|
`INSERT INTO run_status (source_id, model_name, stage, status)
|
||||||
|
VALUES (?, ?, ?, 'complete')
|
||||||
|
ON CONFLICT (source_id, model_name, stage)
|
||||||
|
DO UPDATE SET status = 'complete', updated_at = datetime('now')`,
|
||||||
|
).run(SENTINEL.sourceId, SENTINEL.modelName, stage);
|
||||||
|
db.close();
|
||||||
|
}
|
||||||
|
|
||||||
|
function getModelRound1Status(modelName: string): StageStatus {
|
||||||
|
const db = openDb();
|
||||||
|
|
||||||
|
const total = (
|
||||||
|
db.prepare("SELECT COUNT(*) as count FROM synsets").get() as {
|
||||||
|
count: number;
|
||||||
|
}
|
||||||
|
).count;
|
||||||
|
|
||||||
|
const complete = (
|
||||||
|
db
|
||||||
|
.prepare(
|
||||||
|
`SELECT COUNT(*) as count FROM run_status
|
||||||
|
WHERE model_name = ? AND stage = 'round1' AND status = 'complete'`,
|
||||||
|
)
|
||||||
|
.get(modelName) as { count: number }
|
||||||
|
).count;
|
||||||
|
|
||||||
|
db.close();
|
||||||
|
|
||||||
|
if (complete === 0) return "pending";
|
||||||
|
if (complete >= total) return "complete";
|
||||||
|
return "in_progress";
|
||||||
|
}
|
||||||
|
|
||||||
|
function getModelRound2Status(modelName: string): StageStatus {
|
||||||
|
const db = openDb();
|
||||||
|
|
||||||
|
const total = (
|
||||||
|
db.prepare("SELECT COUNT(*) as count FROM synsets").get() as {
|
||||||
|
count: number;
|
||||||
|
}
|
||||||
|
).count;
|
||||||
|
|
||||||
|
const complete = (
|
||||||
|
db
|
||||||
|
.prepare(
|
||||||
|
`SELECT COUNT(*) as count FROM run_status
|
||||||
|
WHERE model_name = ? AND stage = 'round2' AND status = 'complete'`,
|
||||||
|
)
|
||||||
|
.get(modelName) as { count: number }
|
||||||
|
).count;
|
||||||
|
|
||||||
|
db.close();
|
||||||
|
|
||||||
|
if (complete === 0) return "pending";
|
||||||
|
if (complete >= total) return "complete";
|
||||||
|
return "in_progress";
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Stage runners (stubs) ─────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
function runRound1(provider: ProviderConfig, stats: RunStats): void {
|
||||||
|
console.log(`\n [round 1] Running ${provider.name}...`);
|
||||||
|
// TODO: implement round 1 enrich script
|
||||||
|
console.log(` [round 1] ${provider.name} — not yet implemented`);
|
||||||
|
stats.modelsRun.push(provider.name);
|
||||||
|
}
|
||||||
|
|
||||||
|
function compileCandidates(): void {
|
||||||
|
console.log("\n [compile candidates] Compiling round 1 output...");
|
||||||
|
// TODO: implement compile candidates script
|
||||||
|
console.log(" [compile candidates] not yet implemented");
|
||||||
|
markSentinelComplete("compile_candidates");
|
||||||
|
}
|
||||||
|
|
||||||
|
function runRound2(provider: ProviderConfig, stats: RunStats): void {
|
||||||
|
console.log(`\n [round 2] Running ${provider.name}...`);
|
||||||
|
// TODO: implement round 2 enrich script
|
||||||
|
console.log(` [round 2] ${provider.name} — not yet implemented`);
|
||||||
|
stats.modelsRun.push(provider.name);
|
||||||
|
}
|
||||||
|
|
||||||
|
function compileVotes(): void {
|
||||||
|
console.log("\n [compile votes] Compiling round 2 votes...");
|
||||||
|
// TODO: implement compile votes script
|
||||||
|
console.log(" [compile votes] not yet implemented");
|
||||||
|
markSentinelComplete("compile_votes");
|
||||||
|
}
|
||||||
|
|
||||||
|
function runMerge(): void {
|
||||||
|
console.log("\n [merge] Resolving votes...");
|
||||||
|
// TODO: implement merge script
|
||||||
|
console.log(" [merge] not yet implemented");
|
||||||
|
markSentinelComplete("merge");
|
||||||
|
}
|
||||||
|
|
||||||
|
function runTiebreak(stats: RunStats): void {
|
||||||
|
console.log("\n [tiebreak] Resolving flagged translations...");
|
||||||
|
// TODO: implement tiebreak logic
|
||||||
|
console.log(" [tiebreak] not yet implemented");
|
||||||
|
stats.currentStage = "tiebreak";
|
||||||
|
}
|
||||||
|
|
||||||
|
function runCompare(): void {
|
||||||
|
console.log("\n [compare] Generating COVERAGE.md...");
|
||||||
|
// TODO: implement compare script
|
||||||
|
console.log(" [compare] not yet implemented");
|
||||||
|
markSentinelComplete("compare");
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Report generation ─────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
async function generateReport(runName: string, stats: RunStats): Promise<void> {
|
||||||
|
const db = openDb();
|
||||||
|
|
||||||
|
const totalSynsets = (
|
||||||
|
db.prepare("SELECT COUNT(*) as count FROM synsets").get() as {
|
||||||
|
count: number;
|
||||||
|
}
|
||||||
|
).count;
|
||||||
|
|
||||||
|
const resolvedTranslations = (
|
||||||
|
db.prepare("SELECT COUNT(*) as count FROM resolved_translations").get() as {
|
||||||
|
count: number;
|
||||||
|
}
|
||||||
|
).count;
|
||||||
|
|
||||||
|
const flaggedTranslations = (
|
||||||
|
db
|
||||||
|
.prepare(
|
||||||
|
`SELECT COUNT(*) as count FROM run_status
|
||||||
|
WHERE stage = 'merge' AND status = 'flagged'`,
|
||||||
|
)
|
||||||
|
.get() as { count: number }
|
||||||
|
).count;
|
||||||
|
|
||||||
|
const needsReview = (
|
||||||
|
db
|
||||||
|
.prepare(
|
||||||
|
`SELECT COUNT(*) as count FROM run_status
|
||||||
|
WHERE status = 'needs_review'`,
|
||||||
|
)
|
||||||
|
.get() as { count: number }
|
||||||
|
).count;
|
||||||
|
|
||||||
|
db.close();
|
||||||
|
|
||||||
|
const stoppedAt = stats.stoppedAt ?? new Date();
|
||||||
|
const durationMs = stoppedAt.getTime() - stats.startedAt.getTime();
|
||||||
|
const durationMin = Math.round(durationMs / 60_000);
|
||||||
|
|
||||||
|
const isFinal =
|
||||||
|
getSentinelStatus("compare") === "complete" && flaggedTranslations === 0;
|
||||||
|
|
||||||
|
const report = {
|
||||||
|
runName,
|
||||||
|
generatedAt: stoppedAt.toISOString(),
|
||||||
|
durationMinutes: durationMin,
|
||||||
|
isFinal,
|
||||||
|
progress: {
|
||||||
|
totalSynsets,
|
||||||
|
resolvedTranslations,
|
||||||
|
flaggedTranslations,
|
||||||
|
needsReview,
|
||||||
|
recordsProcessedThisRun: stats.recordsProcessed,
|
||||||
|
recordsSkippedThisRun: stats.recordsSkipped,
|
||||||
|
},
|
||||||
|
modelsRun: stats.modelsRun,
|
||||||
|
stages: {
|
||||||
|
round1: ALL_PROVIDERS.map((p) => ({
|
||||||
|
model: p.name,
|
||||||
|
status: getModelRound1Status(p.name),
|
||||||
|
})),
|
||||||
|
compileCandidates: getSentinelStatus("compile_candidates"),
|
||||||
|
round2: ALL_PROVIDERS.map((p) => ({
|
||||||
|
model: p.name,
|
||||||
|
status: getModelRound2Status(p.name),
|
||||||
|
})),
|
||||||
|
compileVotes: getSentinelStatus("compile_votes"),
|
||||||
|
merge: getSentinelStatus("merge"),
|
||||||
|
compare: getSentinelStatus("compare"),
|
||||||
|
},
|
||||||
|
};
|
||||||
|
|
||||||
|
await fs.mkdir(PATHS.reports, { recursive: true });
|
||||||
|
|
||||||
|
const jsonPath = path.join(PATHS.reports, `${runName}.json`);
|
||||||
|
const mdPath = path.join(PATHS.reports, `${runName}.md`);
|
||||||
|
|
||||||
|
await fs.writeFile(jsonPath, JSON.stringify(report, null, 2), "utf-8");
|
||||||
|
|
||||||
|
const md = [
|
||||||
|
`# Pipeline run: ${runName}`,
|
||||||
|
``,
|
||||||
|
`Generated: ${stoppedAt.toISOString()}`,
|
||||||
|
`Duration: ${durationMin} minutes`,
|
||||||
|
isFinal
|
||||||
|
? `**Status: FINAL — pipeline complete**`
|
||||||
|
: `**Status: In progress**`,
|
||||||
|
``,
|
||||||
|
`## Progress`,
|
||||||
|
``,
|
||||||
|
`| Metric | Value |`,
|
||||||
|
`| ------ | ----- |`,
|
||||||
|
`| Total synsets | ${totalSynsets.toLocaleString()} |`,
|
||||||
|
`| Resolved translations | ${resolvedTranslations.toLocaleString()} |`,
|
||||||
|
`| Flagged translations | ${flaggedTranslations.toLocaleString()} |`,
|
||||||
|
`| Needs review | ${needsReview.toLocaleString()} |`,
|
||||||
|
`| Records processed this run | ${stats.recordsProcessed.toLocaleString()} |`,
|
||||||
|
`| Records skipped this run | ${stats.recordsSkipped.toLocaleString()} |`,
|
||||||
|
``,
|
||||||
|
`## Stage status`,
|
||||||
|
``,
|
||||||
|
`### Round 1`,
|
||||||
|
``,
|
||||||
|
...report.stages.round1.map(
|
||||||
|
(s) =>
|
||||||
|
`- ${s.status === "complete" ? "✅" : s.status === "in_progress" ? "🔄" : "🔲"} ${s.model}`,
|
||||||
|
),
|
||||||
|
``,
|
||||||
|
`### Compile candidates: ${report.stages.compileCandidates}`,
|
||||||
|
``,
|
||||||
|
`### Round 2`,
|
||||||
|
``,
|
||||||
|
...report.stages.round2.map(
|
||||||
|
(s) =>
|
||||||
|
`- ${s.status === "complete" ? "✅" : s.status === "in_progress" ? "🔄" : "🔲"} ${s.model}`,
|
||||||
|
),
|
||||||
|
``,
|
||||||
|
`### Compile votes: ${report.stages.compileVotes}`,
|
||||||
|
`### Merge: ${report.stages.merge}`,
|
||||||
|
`### Compare: ${report.stages.compare}`,
|
||||||
|
``,
|
||||||
|
`## Models run this session`,
|
||||||
|
``,
|
||||||
|
stats.modelsRun.length > 0
|
||||||
|
? stats.modelsRun.map((m) => `- ${m}`).join("\n")
|
||||||
|
: "_none_",
|
||||||
|
].join("\n");
|
||||||
|
|
||||||
|
await fs.writeFile(mdPath, md, "utf-8");
|
||||||
|
|
||||||
|
console.log(`\n Report written → ${jsonPath}`);
|
||||||
|
console.log(` Report written → ${mdPath}`);
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Main ──────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
async function main(): Promise<void> {
|
||||||
|
console.log("lila data pipeline\n");
|
||||||
|
|
||||||
|
// ── Startup checks
|
||||||
|
console.log("Checking prerequisites...");
|
||||||
|
await checkOmwExists();
|
||||||
|
await checkAndInitDb();
|
||||||
|
await checkAndImportDb();
|
||||||
|
console.log(" Prerequisites OK");
|
||||||
|
|
||||||
|
// ── Run name
|
||||||
|
const runName = await generateRunName();
|
||||||
|
console.log(`\n Run: ${runName}`);
|
||||||
|
|
||||||
|
// ── Stats
|
||||||
|
const stats: RunStats = {
|
||||||
|
startedAt: new Date(),
|
||||||
|
stoppedAt: null,
|
||||||
|
recordsProcessed: 0,
|
||||||
|
recordsSkipped: 0,
|
||||||
|
needsReview: 0,
|
||||||
|
modelsRun: [],
|
||||||
|
currentStage: null,
|
||||||
|
};
|
||||||
|
|
||||||
|
registerShutdownHandler(stats);
|
||||||
|
|
||||||
|
// ── Round 1
|
||||||
|
console.log("\nRound 1 — generation");
|
||||||
|
for (const provider of ALL_PROVIDERS) {
|
||||||
|
if (shutdownRequested) break;
|
||||||
|
|
||||||
|
const status = getModelRound1Status(provider.name);
|
||||||
|
|
||||||
|
if (status === "complete") {
|
||||||
|
console.log(` [round 1] ${provider.name} — already complete, skipping`);
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
await checkProviderReady(provider);
|
||||||
|
stats.currentStage = "round1";
|
||||||
|
|
||||||
|
if (status === "in_progress") {
|
||||||
|
console.log(` [round 1] ${provider.name} — resuming...`);
|
||||||
|
}
|
||||||
|
|
||||||
|
runRound1(provider, stats);
|
||||||
|
}
|
||||||
|
|
||||||
|
if (shutdownRequested) {
|
||||||
|
await generateReport(runName, stats);
|
||||||
|
process.exit(0);
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Compile candidates
|
||||||
|
if (getSentinelStatus("compile_candidates") === "complete") {
|
||||||
|
console.log("\n [compile candidates] Already complete, skipping");
|
||||||
|
} else {
|
||||||
|
stats.currentStage = "compile_candidates";
|
||||||
|
compileCandidates();
|
||||||
|
}
|
||||||
|
|
||||||
|
if (shutdownRequested) {
|
||||||
|
await generateReport(runName, stats);
|
||||||
|
process.exit(0);
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Round 2
|
||||||
|
console.log("\nRound 2 — voting");
|
||||||
|
for (const provider of ALL_PROVIDERS) {
|
||||||
|
if (shutdownRequested) break;
|
||||||
|
|
||||||
|
const status = getModelRound2Status(provider.name);
|
||||||
|
|
||||||
|
if (status === "complete") {
|
||||||
|
console.log(` [round 2] ${provider.name} — already complete, skipping`);
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
await checkProviderReady(provider);
|
||||||
|
stats.currentStage = "round2";
|
||||||
|
|
||||||
|
if (status === "in_progress") {
|
||||||
|
console.log(` [round 2] ${provider.name} — resuming...`);
|
||||||
|
}
|
||||||
|
|
||||||
|
runRound2(provider, stats);
|
||||||
|
}
|
||||||
|
|
||||||
|
if (shutdownRequested) {
|
||||||
|
await generateReport(runName, stats);
|
||||||
|
process.exit(0);
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Compile votes
|
||||||
|
if (getSentinelStatus("compile_votes") === "complete") {
|
||||||
|
console.log("\n [compile votes] Already complete, skipping");
|
||||||
|
} else {
|
||||||
|
stats.currentStage = "compile_votes";
|
||||||
|
compileVotes();
|
||||||
|
}
|
||||||
|
|
||||||
|
if (shutdownRequested) {
|
||||||
|
await generateReport(runName, stats);
|
||||||
|
process.exit(0);
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Merge
|
||||||
|
if (getSentinelStatus("merge") === "complete") {
|
||||||
|
console.log("\n [merge] Already complete, skipping");
|
||||||
|
} else {
|
||||||
|
stats.currentStage = "merge";
|
||||||
|
runMerge();
|
||||||
|
}
|
||||||
|
|
||||||
|
if (shutdownRequested) {
|
||||||
|
await generateReport(runName, stats);
|
||||||
|
process.exit(0);
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Tiebreak
|
||||||
|
const db = openDb();
|
||||||
|
const flagged = (
|
||||||
|
db
|
||||||
|
.prepare(
|
||||||
|
`SELECT COUNT(*) as count FROM run_status
|
||||||
|
WHERE stage = 'merge' AND status = 'flagged'`,
|
||||||
|
)
|
||||||
|
.get() as { count: number }
|
||||||
|
).count;
|
||||||
|
db.close();
|
||||||
|
|
||||||
|
if (flagged > 0) {
|
||||||
|
stats.currentStage = "tiebreak";
|
||||||
|
runTiebreak(stats);
|
||||||
|
}
|
||||||
|
|
||||||
|
if (shutdownRequested) {
|
||||||
|
await generateReport(runName, stats);
|
||||||
|
process.exit(0);
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Compare
|
||||||
|
if (getSentinelStatus("compare") === "complete") {
|
||||||
|
console.log("\n [compare] Already complete, skipping");
|
||||||
|
} else {
|
||||||
|
stats.currentStage = "compare";
|
||||||
|
runCompare();
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Report
|
||||||
|
stats.stoppedAt = new Date();
|
||||||
|
await generateReport(runName, stats);
|
||||||
|
|
||||||
|
console.log("\nPipeline complete.");
|
||||||
|
}
|
||||||
|
|
||||||
|
main().catch((err) => {
|
||||||
|
console.error(err);
|
||||||
|
process.exit(1);
|
||||||
|
});
|
||||||
|
|
@ -154,7 +154,7 @@ async function loadAnnotated(): Promise<AnnotatedRecord[]> {
|
||||||
for (const [l, examples] of Object.entries(record.examples)) {
|
for (const [l, examples] of Object.entries(record.examples)) {
|
||||||
const lang = l as SupportedLanguageCode;
|
const lang = l as SupportedLanguageCode;
|
||||||
if (!base.examples[lang]) {
|
if (!base.examples[lang]) {
|
||||||
base.examples[lang] = examples as Example[];
|
base.examples[lang] = examples;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
@ -80,7 +80,7 @@ def extract_all(
|
||||||
continue
|
continue
|
||||||
covered += 1
|
covered += 1
|
||||||
|
|
||||||
lemmas = [str(lemma) for lemma in synset.lemmas()]
|
lemmas = list(dict.fromkeys(str(lemma) for lemma in synset.lemmas()))
|
||||||
defns = [d for d in synset.definitions() if d]
|
defns = [d for d in synset.definitions() if d]
|
||||||
examples = [e for e in synset.examples() if e]
|
examples = [e for e in synset.examples() if e]
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -196,12 +196,12 @@ async function annotate(): Promise<void> {
|
||||||
|
|
||||||
// Add CEFR vote
|
// Add CEFR vote
|
||||||
if (!annotated.votes[lang]) annotated.votes[lang] = {};
|
if (!annotated.votes[lang]) annotated.votes[lang] = {};
|
||||||
annotated.votes[lang]![word] = { cefr_source: cefrEntry.level };
|
annotated.votes[lang][word] = { cefr_source: cefrEntry.level };
|
||||||
|
|
||||||
// Add native example if present
|
// Add native example if present
|
||||||
if (cefrEntry.example) {
|
if (cefrEntry.example) {
|
||||||
if (!annotated.examples[lang]) annotated.examples[lang] = [];
|
if (!annotated.examples[lang]) annotated.examples[lang] = [];
|
||||||
annotated.examples[lang]!.push({
|
annotated.examples[lang].push({
|
||||||
text: cefrEntry.example,
|
text: cefrEntry.example,
|
||||||
source: "cefr" as const,
|
source: "cefr" as const,
|
||||||
});
|
});
|
||||||
|
|
|
||||||
114
data-pipeline/stage-3-enrich/config.ts
Normal file
114
data-pipeline/stage-3-enrich/config.ts
Normal file
|
|
@ -0,0 +1,114 @@
|
||||||
|
// ── Provider configuration ────────────────────────────────────────────────────
|
||||||
|
//
|
||||||
|
// Each provider + model combination counts as one vote in the final majority.
|
||||||
|
// Running the same model twice is not supported — one model, one vote.
|
||||||
|
// The `name` field is used as the model identifier in pipeline.db and must
|
||||||
|
// be unique across all runs.
|
||||||
|
//
|
||||||
|
// The pipeline iterates through ALL_PROVIDERS in order, skipping models that
|
||||||
|
// have already completed a full run and resuming models with partial progress.
|
||||||
|
//
|
||||||
|
// See llm-setup.md for full setup instructions and model recommendations.
|
||||||
|
|
||||||
|
export type ProviderConfig = {
|
||||||
|
name: string; // unique model identifier — stored in pipeline.db
|
||||||
|
baseURL: string;
|
||||||
|
apiKey: string;
|
||||||
|
model: string;
|
||||||
|
maxTokens: number;
|
||||||
|
};
|
||||||
|
|
||||||
|
// ── Local llama.cpp ───────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
export const LOCAL_GEMMA4: ProviderConfig = {
|
||||||
|
name: "local-gemma4-e4b",
|
||||||
|
baseURL: "http://127.0.0.1:8080/v1",
|
||||||
|
apiKey: "none", // llama.cpp ignores this
|
||||||
|
model: "gemma4-e4b", // llama.cpp ignores model name, uses loaded model
|
||||||
|
maxTokens: 512,
|
||||||
|
};
|
||||||
|
|
||||||
|
export const LOCAL_QWEN7B: ProviderConfig = {
|
||||||
|
name: "local-qwen2.5-7b",
|
||||||
|
baseURL: "http://127.0.0.1:8080/v1",
|
||||||
|
apiKey: "none",
|
||||||
|
model: "qwen2.5-7b",
|
||||||
|
maxTokens: 512,
|
||||||
|
};
|
||||||
|
|
||||||
|
// ── OpenRouter — free tier ────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
export const OR_QWEN3_480B: ProviderConfig = {
|
||||||
|
name: "or-qwen3-480b",
|
||||||
|
baseURL: "https://openrouter.ai/api/v1",
|
||||||
|
apiKey: process.env["OPENROUTER_API_KEY"] ?? "",
|
||||||
|
model: "qwen/qwen3-coder:free",
|
||||||
|
maxTokens: 512,
|
||||||
|
};
|
||||||
|
|
||||||
|
export const OR_GEMMA4_31B: ProviderConfig = {
|
||||||
|
name: "or-gemma4-31b",
|
||||||
|
baseURL: "https://openrouter.ai/api/v1",
|
||||||
|
apiKey: process.env["OPENROUTER_API_KEY"] ?? "",
|
||||||
|
model: "google/gemma-4-31b-it:free",
|
||||||
|
maxTokens: 512,
|
||||||
|
};
|
||||||
|
|
||||||
|
export const OR_QWEN3_80B: ProviderConfig = {
|
||||||
|
name: "or-qwen3-80b",
|
||||||
|
baseURL: "https://openrouter.ai/api/v1",
|
||||||
|
apiKey: process.env["OPENROUTER_API_KEY"] ?? "",
|
||||||
|
model: "qwen/qwen3-next-80b-a3b-instruct:free",
|
||||||
|
maxTokens: 512,
|
||||||
|
};
|
||||||
|
|
||||||
|
export const OR_NEMOTRON: ProviderConfig = {
|
||||||
|
name: "or-nemotron-120b",
|
||||||
|
baseURL: "https://openrouter.ai/api/v1",
|
||||||
|
apiKey: process.env["OPENROUTER_API_KEY"] ?? "",
|
||||||
|
model: "nvidia/nemotron-3-super-120b-a12b:free",
|
||||||
|
maxTokens: 512,
|
||||||
|
};
|
||||||
|
|
||||||
|
// ── Anthropic — reference baseline ───────────────────────────────────────────
|
||||||
|
// Note: Anthropic uses a different API format. An adapter is required.
|
||||||
|
// See llm-setup.md for details.
|
||||||
|
|
||||||
|
export const ANTHROPIC_SONNET: ProviderConfig = {
|
||||||
|
name: "anthropic-sonnet-4",
|
||||||
|
baseURL: "https://api.anthropic.com/v1",
|
||||||
|
apiKey: process.env["ANTHROPIC_API_KEY"] ?? "",
|
||||||
|
model: "claude-sonnet-4-6",
|
||||||
|
maxTokens: 512,
|
||||||
|
};
|
||||||
|
|
||||||
|
// ── All configured providers ──────────────────────────────────────────────────
|
||||||
|
// The pipeline runs through these in order — local models first, then cloud.
|
||||||
|
// Add new providers here to include them in the voting pool.
|
||||||
|
|
||||||
|
export const ALL_PROVIDERS: ProviderConfig[] = [
|
||||||
|
LOCAL_GEMMA4,
|
||||||
|
LOCAL_QWEN7B,
|
||||||
|
OR_QWEN3_480B,
|
||||||
|
OR_GEMMA4_31B,
|
||||||
|
OR_QWEN3_80B,
|
||||||
|
OR_NEMOTRON,
|
||||||
|
ANTHROPIC_SONNET,
|
||||||
|
];
|
||||||
|
|
||||||
|
// ── Key validation ────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
const LOCAL_PROVIDERS = new Set(["none"]);
|
||||||
|
|
||||||
|
export function validateProviderKey(provider: ProviderConfig): void {
|
||||||
|
if (LOCAL_PROVIDERS.has(provider.apiKey)) return;
|
||||||
|
|
||||||
|
if (!provider.apiKey) {
|
||||||
|
const keyName = provider.name.startsWith("anthropic")
|
||||||
|
? "ANTHROPIC_API_KEY"
|
||||||
|
: "OPENROUTER_API_KEY";
|
||||||
|
console.error(`\n ERROR: ${keyName} is not set in .env`);
|
||||||
|
console.error(` Provider "${provider.name}" requires this key to run.\n`);
|
||||||
|
process.exit(1);
|
||||||
|
}
|
||||||
|
}
|
||||||
170
data-pipeline/tests/fixtures/annotated.fixture.json
vendored
Normal file
170
data-pipeline/tests/fixtures/annotated.fixture.json
vendored
Normal file
|
|
@ -0,0 +1,170 @@
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"_fixture": "noun_with_cefr_vote",
|
||||||
|
"source_id": "ili:i100955",
|
||||||
|
"pos": "noun",
|
||||||
|
"translations": { "en": ["grain"], "de": ["Korn", "Kornbrand"] },
|
||||||
|
"glosses": { "en": ["a cereal grass"], "de": ["ein Getreidegras"] },
|
||||||
|
"examples": {
|
||||||
|
"en": [
|
||||||
|
{ "text": "wheat is a grain that is grown in Kansas", "source": "omw" }
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"votes": { "en": { "grain": { "cefr_source": "B1" } } }
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"_fixture": "verb_no_votes_no_translations",
|
||||||
|
"source_id": "ili:i21779",
|
||||||
|
"pos": "verb",
|
||||||
|
"translations": { "en": ["respire"] },
|
||||||
|
"glosses": {
|
||||||
|
"en": [
|
||||||
|
"undergo the biomedical and metabolic processes of respiration by taking up oxygen and producing carbon monoxide"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"examples": {},
|
||||||
|
"votes": {}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"_fixture": "verb_with_cefr_vote_all_languages",
|
||||||
|
"source_id": "ili:i21778",
|
||||||
|
"pos": "verb",
|
||||||
|
"translations": {
|
||||||
|
"en": ["breathe", "take a breath", "respire", "suspire"],
|
||||||
|
"it": ["respirare"],
|
||||||
|
"es": ["aspirar", "respirar"],
|
||||||
|
"de": ["Luft holen", "hauchen", "Luft bekommen", "Luft kriegen", "atmen"],
|
||||||
|
"fr": ["inspirer", "respirer"]
|
||||||
|
},
|
||||||
|
"glosses": {
|
||||||
|
"en": ["draw air into, and expel out of, the lungs"],
|
||||||
|
"de": ["Luft in die Lunge saugen und aus ihr ausstoßen"]
|
||||||
|
},
|
||||||
|
"examples": {
|
||||||
|
"en": [
|
||||||
|
{
|
||||||
|
"text": "I can breathe better when the air is clean",
|
||||||
|
"source": "omw"
|
||||||
|
},
|
||||||
|
{ "text": "The patient is respiring", "source": "omw" }
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"votes": { "en": { "breathe": { "cefr_source": "A1" } } }
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"_fixture": "adjective_all_languages_multiple_translations",
|
||||||
|
"source_id": "ili:i10007",
|
||||||
|
"pos": "adjective",
|
||||||
|
"translations": {
|
||||||
|
"en": ["possible"],
|
||||||
|
"it": [
|
||||||
|
"attuabile",
|
||||||
|
"effettuabile",
|
||||||
|
"eseguibile",
|
||||||
|
"fattibile",
|
||||||
|
"operabile",
|
||||||
|
"possibile",
|
||||||
|
"producibile",
|
||||||
|
"realizzabile"
|
||||||
|
],
|
||||||
|
"es": ["posible"],
|
||||||
|
"de": [
|
||||||
|
"möglich",
|
||||||
|
"denkbar",
|
||||||
|
"eventuell",
|
||||||
|
"möglicherweise",
|
||||||
|
"allfällig",
|
||||||
|
"etwaig",
|
||||||
|
"gegebenenfalls",
|
||||||
|
"eventuell"
|
||||||
|
],
|
||||||
|
"fr": ["possible", "éventuel"]
|
||||||
|
},
|
||||||
|
"glosses": {
|
||||||
|
"en": ["capable of happening or existing"],
|
||||||
|
"de": ["in der Lage, zu geschehen oder zu existieren"]
|
||||||
|
},
|
||||||
|
"examples": {
|
||||||
|
"en": [
|
||||||
|
{ "text": "a breakthrough may be possible next year", "source": "omw" },
|
||||||
|
{ "text": "anything is possible", "source": "omw" },
|
||||||
|
{ "text": "warned of possible consequences", "source": "omw" }
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"votes": { "en": { "possible": { "cefr_source": "A2" } } }
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"_fixture": "adjective_multiple_de_votes_cefr_examples",
|
||||||
|
"source_id": "ili:i10000",
|
||||||
|
"pos": "adjective",
|
||||||
|
"translations": {
|
||||||
|
"en": ["negative"],
|
||||||
|
"de": [
|
||||||
|
"dürftig",
|
||||||
|
"zu wünschen übrig lassen",
|
||||||
|
"schlecht",
|
||||||
|
"widrig",
|
||||||
|
"ungut",
|
||||||
|
"lausig",
|
||||||
|
"negativ",
|
||||||
|
"von Nachteil",
|
||||||
|
"schädlich",
|
||||||
|
"nachteilig",
|
||||||
|
"ungünstig"
|
||||||
|
],
|
||||||
|
"fr": ["négatif", "strictement négatif"]
|
||||||
|
},
|
||||||
|
"glosses": { "en": ["less than zero"], "de": ["kleiner als Null"] },
|
||||||
|
"examples": {
|
||||||
|
"en": [{ "text": "a negative number", "source": "omw" }],
|
||||||
|
"de": [
|
||||||
|
{ "text": "Die Beweise waren dürftig.", "source": "cefr" },
|
||||||
|
{ "text": "Das Wetter ist heute schlecht.", "source": "cefr" },
|
||||||
|
{
|
||||||
|
"text": "Trotz widriger Umstände haben sie es geschafft.",
|
||||||
|
"source": "cefr"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"text": "Er hatte ein ungutes Gefühl bei der Sache.",
|
||||||
|
"source": "cefr"
|
||||||
|
},
|
||||||
|
{ "text": "Er hat eine sehr negative Einstellung.", "source": "cefr" },
|
||||||
|
{
|
||||||
|
"text": "Rauchen ist schädlich für die Gesundheit.",
|
||||||
|
"source": "cefr"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"text": "Diese Entscheidung könnte nachteilig sein.",
|
||||||
|
"source": "cefr"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"text": "Das Wetter ist heute ungünstig für einen Ausflug.",
|
||||||
|
"source": "cefr"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"votes": {
|
||||||
|
"de": {
|
||||||
|
"dürftig": { "cefr_source": "C1" },
|
||||||
|
"schlecht": { "cefr_source": "A1" },
|
||||||
|
"widrig": { "cefr_source": "C1" },
|
||||||
|
"ungut": { "cefr_source": "B2" },
|
||||||
|
"negativ": { "cefr_source": "A2" },
|
||||||
|
"schädlich": { "cefr_source": "B1" },
|
||||||
|
"nachteilig": { "cefr_source": "B1" },
|
||||||
|
"ungünstig": { "cefr_source": "B2" }
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"_fixture": "adverb_no_votes",
|
||||||
|
"source_id": "ili:i18157",
|
||||||
|
"pos": "adverb",
|
||||||
|
"translations": { "en": ["a cappella"], "es": ["a capella"] },
|
||||||
|
"glosses": { "en": ["without musical accompaniment"] },
|
||||||
|
"examples": {
|
||||||
|
"en": [{ "text": "they performed a cappella", "source": "omw" }]
|
||||||
|
},
|
||||||
|
"votes": {}
|
||||||
|
}
|
||||||
|
]
|
||||||
4
data-pipeline/tests/fixtures/conflicts.fixture.json
vendored
Normal file
4
data-pipeline/tests/fixtures/conflicts.fixture.json
vendored
Normal file
|
|
@ -0,0 +1,4 @@
|
||||||
|
[
|
||||||
|
{ "word": "macht", "pos": "noun", "language": "de", "levels": ["A2", "B1"] },
|
||||||
|
{ "word": "bleiche", "pos": "noun", "language": "de", "levels": ["B2", "B1"] }
|
||||||
|
]
|
||||||
237
data-pipeline/tests/validation/db-import.validation.test.ts
Normal file
237
data-pipeline/tests/validation/db-import.validation.test.ts
Normal file
|
|
@ -0,0 +1,237 @@
|
||||||
|
import fs from "node:fs/promises";
|
||||||
|
import path from "node:path";
|
||||||
|
import { describe, it, expect, beforeAll } from "vitest";
|
||||||
|
import { SUPPORTED_LANGUAGE_CODES } from "@lila/shared";
|
||||||
|
import type { SupportedLanguageCode, SupportedPos } from "@lila/shared";
|
||||||
|
|
||||||
|
// ── Types ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
type Example = { text: string; source: "omw" | "cefr" };
|
||||||
|
|
||||||
|
type AnnotatedRecord = {
|
||||||
|
source_id: string;
|
||||||
|
pos: SupportedPos;
|
||||||
|
translations: Partial<Record<SupportedLanguageCode, string[]>>;
|
||||||
|
glosses: Partial<Record<SupportedLanguageCode, string[]>>;
|
||||||
|
examples: Partial<Record<SupportedLanguageCode, Example[]>>;
|
||||||
|
votes: Partial<
|
||||||
|
Record<SupportedLanguageCode, Record<string, { cefr_source: string }>>
|
||||||
|
>;
|
||||||
|
};
|
||||||
|
|
||||||
|
// ── Paths ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
const DB_PATH = path.resolve("db/pipeline.db");
|
||||||
|
const OMW_PATH = path.resolve("stage-1-extract/output/omw.json");
|
||||||
|
const ANNOTATED_DIR = path.resolve("stage-2-annotate/output");
|
||||||
|
|
||||||
|
// ── Helpers ───────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
async function dbExists(): Promise<boolean> {
|
||||||
|
try {
|
||||||
|
await fs.access(DB_PATH);
|
||||||
|
return true;
|
||||||
|
} catch {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Tests ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
describe("pipeline.db — import validation", () => {
|
||||||
|
let db: import("better-sqlite3").Database;
|
||||||
|
let expectedSynsetCount: number;
|
||||||
|
let expectedCefrVoteCount: number;
|
||||||
|
|
||||||
|
beforeAll(async () => {
|
||||||
|
if (!(await dbExists())) return;
|
||||||
|
|
||||||
|
const Database = (await import("better-sqlite3")).default;
|
||||||
|
db = new Database(DB_PATH, { readonly: true });
|
||||||
|
db.pragma("foreign_keys = ON");
|
||||||
|
|
||||||
|
// Count expected synsets from omw.json
|
||||||
|
const omwRaw = await fs.readFile(OMW_PATH, "utf-8");
|
||||||
|
const omwRecords = JSON.parse(omwRaw) as unknown[];
|
||||||
|
expectedSynsetCount = omwRecords.length;
|
||||||
|
|
||||||
|
// Count expected CEFR votes from stage 2 annotated files.
|
||||||
|
// Merge all language files the same way the import script does —
|
||||||
|
// use en.json as base and merge votes from the other language files.
|
||||||
|
const byId = new Map<string, AnnotatedRecord>();
|
||||||
|
|
||||||
|
const baseRaw = await fs.readFile(
|
||||||
|
path.join(ANNOTATED_DIR, "en.json"),
|
||||||
|
"utf-8",
|
||||||
|
);
|
||||||
|
const base = JSON.parse(baseRaw) as AnnotatedRecord[];
|
||||||
|
for (const record of base) {
|
||||||
|
byId.set(record.source_id, record);
|
||||||
|
}
|
||||||
|
|
||||||
|
for (const lang of SUPPORTED_LANGUAGE_CODES) {
|
||||||
|
if (lang === "en") continue;
|
||||||
|
const raw = await fs.readFile(
|
||||||
|
path.join(ANNOTATED_DIR, `${lang}.json`),
|
||||||
|
"utf-8",
|
||||||
|
);
|
||||||
|
const records = JSON.parse(raw) as AnnotatedRecord[];
|
||||||
|
for (const record of records) {
|
||||||
|
const base = byId.get(record.source_id);
|
||||||
|
if (!base) continue;
|
||||||
|
for (const [l, langVotes] of Object.entries(record.votes)) {
|
||||||
|
if (!base.votes[l as SupportedLanguageCode]) {
|
||||||
|
base.votes[l as SupportedLanguageCode] = {};
|
||||||
|
}
|
||||||
|
Object.assign(base.votes[l as SupportedLanguageCode]!, langVotes);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
expectedCefrVoteCount = 0;
|
||||||
|
for (const record of byId.values()) {
|
||||||
|
for (const langVotes of Object.values(record.votes)) {
|
||||||
|
expectedCefrVoteCount += Object.keys(langVotes ?? {}).length;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}, 120_000);
|
||||||
|
|
||||||
|
it("pipeline.db exists — skipping all tests if not", async () => {
|
||||||
|
const exists = await dbExists();
|
||||||
|
if (!exists) {
|
||||||
|
console.warn(
|
||||||
|
"\n pipeline.db not found — run pnpm db:init and pnpm db:import first\n",
|
||||||
|
);
|
||||||
|
}
|
||||||
|
expect(exists).toBe(true);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("synsets count matches omw.json", () => {
|
||||||
|
if (!db) return;
|
||||||
|
const row = db.prepare("SELECT COUNT(*) as count FROM synsets").get() as {
|
||||||
|
count: number;
|
||||||
|
};
|
||||||
|
expect(row.count).toBe(expectedSynsetCount);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("every synset has at least one translation", () => {
|
||||||
|
if (!db) return;
|
||||||
|
const rows = db
|
||||||
|
.prepare(
|
||||||
|
`
|
||||||
|
SELECT s.source_id
|
||||||
|
FROM synsets s
|
||||||
|
LEFT JOIN translations t ON t.source_id = s.source_id
|
||||||
|
WHERE t.id IS NULL
|
||||||
|
`,
|
||||||
|
)
|
||||||
|
.all() as { source_id: string }[];
|
||||||
|
|
||||||
|
const errors = rows.map((r) => `${r.source_id}: no translations`);
|
||||||
|
expect(errors, `\n${errors.join("\n")}`).toHaveLength(0);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("every translation belongs to a valid synset", () => {
|
||||||
|
if (!db) return;
|
||||||
|
const rows = db
|
||||||
|
.prepare(
|
||||||
|
`
|
||||||
|
SELECT t.id, t.source_id
|
||||||
|
FROM translations t
|
||||||
|
LEFT JOIN synsets s ON s.source_id = t.source_id
|
||||||
|
WHERE s.source_id IS NULL
|
||||||
|
`,
|
||||||
|
)
|
||||||
|
.all() as { id: number; source_id: string }[];
|
||||||
|
|
||||||
|
const errors = rows.map(
|
||||||
|
(r) => `translation ${r.id}: references missing synset ${r.source_id}`,
|
||||||
|
);
|
||||||
|
expect(errors, `\n${errors.join("\n")}`).toHaveLength(0);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("every cefr_source_vote references a valid translation", () => {
|
||||||
|
if (!db) return;
|
||||||
|
const rows = db
|
||||||
|
.prepare(
|
||||||
|
`
|
||||||
|
SELECT v.id, v.translation_id
|
||||||
|
FROM cefr_source_votes v
|
||||||
|
LEFT JOIN translations t ON t.id = v.translation_id
|
||||||
|
WHERE t.id IS NULL
|
||||||
|
`,
|
||||||
|
)
|
||||||
|
.all() as { id: number; translation_id: number }[];
|
||||||
|
|
||||||
|
const errors = rows.map(
|
||||||
|
(r) =>
|
||||||
|
`cefr_vote ${r.id}: references missing translation ${r.translation_id}`,
|
||||||
|
);
|
||||||
|
expect(errors, `\n${errors.join("\n")}`).toHaveLength(0);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("cefr_source_votes count matches stage 2 annotated output", () => {
|
||||||
|
if (!db) return;
|
||||||
|
const row = db
|
||||||
|
.prepare("SELECT COUNT(*) as count FROM cefr_source_votes")
|
||||||
|
.get() as { count: number };
|
||||||
|
expect(row.count).toBe(expectedCefrVoteCount);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("every example has a valid source", () => {
|
||||||
|
if (!db) return;
|
||||||
|
const rows = db
|
||||||
|
.prepare(
|
||||||
|
`
|
||||||
|
SELECT source_id, language, source
|
||||||
|
FROM examples
|
||||||
|
WHERE source NOT IN ('omw', 'cefr')
|
||||||
|
`,
|
||||||
|
)
|
||||||
|
.all() as { source_id: string; language: string; source: string }[];
|
||||||
|
|
||||||
|
const errors = rows.map(
|
||||||
|
(r) =>
|
||||||
|
`${r.source_id} (${r.language}): invalid example source "${r.source}"`,
|
||||||
|
);
|
||||||
|
expect(errors, `\n${errors.join("\n")}`).toHaveLength(0);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("every example belongs to a valid synset", () => {
|
||||||
|
if (!db) return;
|
||||||
|
const rows = db
|
||||||
|
.prepare(
|
||||||
|
`
|
||||||
|
SELECT e.id, e.source_id
|
||||||
|
FROM examples e
|
||||||
|
LEFT JOIN synsets s ON s.source_id = e.source_id
|
||||||
|
WHERE s.source_id IS NULL
|
||||||
|
`,
|
||||||
|
)
|
||||||
|
.all() as { id: number; source_id: string }[];
|
||||||
|
|
||||||
|
const errors = rows.map(
|
||||||
|
(r) => `example ${r.id}: references missing synset ${r.source_id}`,
|
||||||
|
);
|
||||||
|
expect(errors, `\n${errors.join("\n")}`).toHaveLength(0);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("every gloss belongs to a valid synset", () => {
|
||||||
|
if (!db) return;
|
||||||
|
const rows = db
|
||||||
|
.prepare(
|
||||||
|
`
|
||||||
|
SELECT g.id, g.source_id
|
||||||
|
FROM glosses g
|
||||||
|
LEFT JOIN synsets s ON s.source_id = g.source_id
|
||||||
|
WHERE s.source_id IS NULL
|
||||||
|
`,
|
||||||
|
)
|
||||||
|
.all() as { id: number; source_id: string }[];
|
||||||
|
|
||||||
|
const errors = rows.map(
|
||||||
|
(r) => `gloss ${r.id}: references missing synset ${r.source_id}`,
|
||||||
|
);
|
||||||
|
expect(errors, `\n${errors.join("\n")}`).toHaveLength(0);
|
||||||
|
});
|
||||||
|
});
|
||||||
166
data-pipeline/tests/validation/stage-1.validation.test.ts
Normal file
166
data-pipeline/tests/validation/stage-1.validation.test.ts
Normal file
|
|
@ -0,0 +1,166 @@
|
||||||
|
import fs from "node:fs/promises";
|
||||||
|
import path from "node:path";
|
||||||
|
import { describe, it, expect } from "vitest";
|
||||||
|
import { SUPPORTED_POS, SUPPORTED_LANGUAGE_CODES } from "@lila/shared";
|
||||||
|
import type { SupportedPos, SupportedLanguageCode } from "@lila/shared";
|
||||||
|
|
||||||
|
// ── Types ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
type OmwRecord = {
|
||||||
|
source_id: string;
|
||||||
|
pos: SupportedPos;
|
||||||
|
translations: Partial<Record<SupportedLanguageCode, string[]>>;
|
||||||
|
glosses: Partial<Record<SupportedLanguageCode, string[]>>;
|
||||||
|
examples: Partial<Record<SupportedLanguageCode, string[]>>;
|
||||||
|
};
|
||||||
|
|
||||||
|
// ── Paths ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
const OMW_PATH = path.resolve("stage-1-extract/output/omw.json");
|
||||||
|
|
||||||
|
// ── Helpers ───────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
function isValidSourceId(id: string): boolean {
|
||||||
|
return /^ili:i\d+$/.test(id);
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Tests ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
describe("stage 1 — omw.json validation", () => {
|
||||||
|
let records: OmwRecord[];
|
||||||
|
|
||||||
|
it("file exists and is valid JSON", async () => {
|
||||||
|
const raw = await fs.readFile(OMW_PATH, "utf-8");
|
||||||
|
records = JSON.parse(raw) as OmwRecord[];
|
||||||
|
expect(records).toBeDefined();
|
||||||
|
});
|
||||||
|
|
||||||
|
it("is a non-empty array", async () => {
|
||||||
|
const raw = await fs.readFile(OMW_PATH, "utf-8");
|
||||||
|
records = JSON.parse(raw) as OmwRecord[];
|
||||||
|
expect(Array.isArray(records)).toBe(true);
|
||||||
|
expect(records.length).toBeGreaterThan(0);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("every record has required fields", async () => {
|
||||||
|
const raw = await fs.readFile(OMW_PATH, "utf-8");
|
||||||
|
records = JSON.parse(raw) as OmwRecord[];
|
||||||
|
|
||||||
|
const errors: string[] = [];
|
||||||
|
|
||||||
|
for (const record of records) {
|
||||||
|
if (!record.source_id) {
|
||||||
|
errors.push(`missing source_id`);
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
if (!record.pos) errors.push(`${record.source_id}: missing pos`);
|
||||||
|
if (!record.translations)
|
||||||
|
errors.push(`${record.source_id}: missing translations`);
|
||||||
|
if (!record.glosses) errors.push(`${record.source_id}: missing glosses`);
|
||||||
|
if (!record.examples)
|
||||||
|
errors.push(`${record.source_id}: missing examples`);
|
||||||
|
}
|
||||||
|
|
||||||
|
expect(errors, `\n${errors.join("\n")}`).toHaveLength(0);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("every source_id matches ili:i{number} pattern", async () => {
|
||||||
|
const raw = await fs.readFile(OMW_PATH, "utf-8");
|
||||||
|
records = JSON.parse(raw) as OmwRecord[];
|
||||||
|
|
||||||
|
const errors: string[] = [];
|
||||||
|
|
||||||
|
for (const record of records) {
|
||||||
|
if (!isValidSourceId(record.source_id)) {
|
||||||
|
errors.push(`invalid source_id: ${record.source_id}`);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
expect(errors, `\n${errors.join("\n")}`).toHaveLength(0);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("every source_id is unique", async () => {
|
||||||
|
const raw = await fs.readFile(OMW_PATH, "utf-8");
|
||||||
|
records = JSON.parse(raw) as OmwRecord[];
|
||||||
|
|
||||||
|
const seen = new Set<string>();
|
||||||
|
const errors: string[] = [];
|
||||||
|
|
||||||
|
for (const record of records) {
|
||||||
|
if (seen.has(record.source_id)) {
|
||||||
|
errors.push(`duplicate source_id: ${record.source_id}`);
|
||||||
|
}
|
||||||
|
seen.add(record.source_id);
|
||||||
|
}
|
||||||
|
|
||||||
|
expect(errors, `\n${errors.join("\n")}`).toHaveLength(0);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("every pos is a valid supported value", async () => {
|
||||||
|
const raw = await fs.readFile(OMW_PATH, "utf-8");
|
||||||
|
records = JSON.parse(raw) as OmwRecord[];
|
||||||
|
|
||||||
|
const errors: string[] = [];
|
||||||
|
const validPos = new Set(SUPPORTED_POS);
|
||||||
|
|
||||||
|
for (const record of records) {
|
||||||
|
if (!validPos.has(record.pos)) {
|
||||||
|
errors.push(`${record.source_id}: invalid pos "${record.pos}"`);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
expect(errors, `\n${errors.join("\n")}`).toHaveLength(0);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("every record has at least one translation in at least one language", async () => {
|
||||||
|
const raw = await fs.readFile(OMW_PATH, "utf-8");
|
||||||
|
records = JSON.parse(raw) as OmwRecord[];
|
||||||
|
|
||||||
|
const errors: string[] = [];
|
||||||
|
const validLangs = new Set(SUPPORTED_LANGUAGE_CODES);
|
||||||
|
|
||||||
|
for (const record of records) {
|
||||||
|
const langs = Object.keys(record.translations) as SupportedLanguageCode[];
|
||||||
|
|
||||||
|
if (langs.length === 0) {
|
||||||
|
errors.push(`${record.source_id}: no translations`);
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
for (const lang of langs) {
|
||||||
|
if (!validLangs.has(lang)) {
|
||||||
|
errors.push(`${record.source_id}: unsupported language "${lang}"`);
|
||||||
|
}
|
||||||
|
const words = record.translations[lang] ?? [];
|
||||||
|
if (words.length === 0) {
|
||||||
|
errors.push(`${record.source_id}: empty translations for "${lang}"`);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
expect(errors, `\n${errors.join("\n")}`).toHaveLength(0);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("no duplicate translations within a single synset and language", async () => {
|
||||||
|
const raw = await fs.readFile(OMW_PATH, "utf-8");
|
||||||
|
const records = JSON.parse(raw) as OmwRecord[];
|
||||||
|
|
||||||
|
const errors: string[] = [];
|
||||||
|
|
||||||
|
for (const record of records) {
|
||||||
|
for (const [lang, words] of Object.entries(record.translations)) {
|
||||||
|
const seen = new Set<string>();
|
||||||
|
for (const word of words) {
|
||||||
|
if (seen.has(word)) {
|
||||||
|
errors.push(
|
||||||
|
`${record.source_id} (${lang}): duplicate translation "${word}"`,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
seen.add(word);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
expect(errors, `\n${errors.join("\n")}`).toHaveLength(0);
|
||||||
|
});
|
||||||
|
});
|
||||||
218
data-pipeline/tests/validation/stage-2.validation.test.ts
Normal file
218
data-pipeline/tests/validation/stage-2.validation.test.ts
Normal file
|
|
@ -0,0 +1,218 @@
|
||||||
|
import fs from "node:fs/promises";
|
||||||
|
import path from "node:path";
|
||||||
|
import { describe, it, expect, beforeAll } from "vitest";
|
||||||
|
import {
|
||||||
|
SUPPORTED_POS,
|
||||||
|
SUPPORTED_LANGUAGE_CODES,
|
||||||
|
CEFR_LEVELS,
|
||||||
|
} from "@lila/shared";
|
||||||
|
import type { SupportedPos, SupportedLanguageCode } from "@lila/shared";
|
||||||
|
|
||||||
|
// ── Types ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
type Example = { text: string; source: "omw" | "cefr" };
|
||||||
|
|
||||||
|
type AnnotatedRecord = {
|
||||||
|
source_id: string;
|
||||||
|
pos: SupportedPos;
|
||||||
|
translations: Partial<Record<SupportedLanguageCode, string[]>>;
|
||||||
|
glosses: Partial<Record<SupportedLanguageCode, string[]>>;
|
||||||
|
examples: Partial<Record<SupportedLanguageCode, Example[]>>;
|
||||||
|
votes: Partial<
|
||||||
|
Record<SupportedLanguageCode, Record<string, { cefr_source: string }>>
|
||||||
|
>;
|
||||||
|
};
|
||||||
|
|
||||||
|
type ConflictEntry = {
|
||||||
|
word: string;
|
||||||
|
pos: string;
|
||||||
|
language: SupportedLanguageCode;
|
||||||
|
levels: string[];
|
||||||
|
};
|
||||||
|
|
||||||
|
// ── Paths ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
const OUTPUT_DIR = path.resolve("stage-2-annotate/output");
|
||||||
|
|
||||||
|
// ── Tests ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
describe("stage 2 — annotated output validation", () => {
|
||||||
|
const recordsByLang = new Map<SupportedLanguageCode, AnnotatedRecord[]>();
|
||||||
|
let conflicts: ConflictEntry[] = [];
|
||||||
|
|
||||||
|
beforeAll(async () => {
|
||||||
|
for (const lang of SUPPORTED_LANGUAGE_CODES) {
|
||||||
|
const raw = await fs.readFile(
|
||||||
|
path.join(OUTPUT_DIR, `${lang}.json`),
|
||||||
|
"utf-8",
|
||||||
|
);
|
||||||
|
recordsByLang.set(lang, JSON.parse(raw) as AnnotatedRecord[]);
|
||||||
|
}
|
||||||
|
const raw = await fs.readFile(
|
||||||
|
path.join(OUTPUT_DIR, "conflicts.json"),
|
||||||
|
"utf-8",
|
||||||
|
);
|
||||||
|
conflicts = JSON.parse(raw) as ConflictEntry[];
|
||||||
|
}, 60_000);
|
||||||
|
|
||||||
|
it("all five language files exist", async () => {
|
||||||
|
const errors: string[] = [];
|
||||||
|
|
||||||
|
for (const lang of SUPPORTED_LANGUAGE_CODES) {
|
||||||
|
const filePath = path.join(OUTPUT_DIR, `${lang}.json`);
|
||||||
|
try {
|
||||||
|
await fs.access(filePath);
|
||||||
|
} catch {
|
||||||
|
errors.push(`missing file: ${lang}.json`);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
expect(errors, `\n${errors.join("\n")}`).toHaveLength(0);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("conflicts.json exists", async () => {
|
||||||
|
const filePath = path.join(OUTPUT_DIR, "conflicts.json");
|
||||||
|
await expect(fs.access(filePath)).resolves.toBeUndefined();
|
||||||
|
});
|
||||||
|
|
||||||
|
it("every language file is a non-empty array", () => {
|
||||||
|
const errors: string[] = [];
|
||||||
|
|
||||||
|
for (const lang of SUPPORTED_LANGUAGE_CODES) {
|
||||||
|
const records = recordsByLang.get(lang)!;
|
||||||
|
if (!Array.isArray(records)) {
|
||||||
|
errors.push(`${lang}.json: not an array`);
|
||||||
|
} else if (records.length === 0) {
|
||||||
|
errors.push(`${lang}.json: empty array`);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
expect(errors, `\n${errors.join("\n")}`).toHaveLength(0);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("every record has required fields", () => {
|
||||||
|
const errors: string[] = [];
|
||||||
|
|
||||||
|
for (const lang of SUPPORTED_LANGUAGE_CODES) {
|
||||||
|
const records = recordsByLang.get(lang)!;
|
||||||
|
|
||||||
|
for (const record of records) {
|
||||||
|
if (!record.source_id) {
|
||||||
|
errors.push(`${lang}: record missing source_id`);
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
if (!record.pos)
|
||||||
|
errors.push(`${lang} ${record.source_id}: missing pos`);
|
||||||
|
if (!record.translations)
|
||||||
|
errors.push(`${lang} ${record.source_id}: missing translations`);
|
||||||
|
if (!record.glosses)
|
||||||
|
errors.push(`${lang} ${record.source_id}: missing glosses`);
|
||||||
|
if (record.examples === undefined)
|
||||||
|
errors.push(`${lang} ${record.source_id}: missing examples`);
|
||||||
|
if (record.votes === undefined)
|
||||||
|
errors.push(`${lang} ${record.source_id}: missing votes`);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
expect(errors, `\n${errors.join("\n")}`).toHaveLength(0);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("every pos is a valid supported value", () => {
|
||||||
|
const errors: string[] = [];
|
||||||
|
const validPos = new Set(SUPPORTED_POS);
|
||||||
|
|
||||||
|
for (const lang of SUPPORTED_LANGUAGE_CODES) {
|
||||||
|
const records = recordsByLang.get(lang)!;
|
||||||
|
|
||||||
|
for (const record of records) {
|
||||||
|
if (!validPos.has(record.pos)) {
|
||||||
|
errors.push(
|
||||||
|
`${lang} ${record.source_id}: invalid pos "${record.pos}"`,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
expect(errors, `\n${errors.join("\n")}`).toHaveLength(0);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("every example has text and a valid source", () => {
|
||||||
|
const errors: string[] = [];
|
||||||
|
const validSources = new Set(["omw", "cefr"]);
|
||||||
|
|
||||||
|
for (const lang of SUPPORTED_LANGUAGE_CODES) {
|
||||||
|
const records = recordsByLang.get(lang)!;
|
||||||
|
|
||||||
|
for (const record of records) {
|
||||||
|
for (const [l, examples] of Object.entries(record.examples)) {
|
||||||
|
for (const example of examples) {
|
||||||
|
if (!example.text) {
|
||||||
|
errors.push(
|
||||||
|
`${lang} ${record.source_id} (${l}): example missing text`,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
if (!validSources.has(example.source)) {
|
||||||
|
errors.push(
|
||||||
|
`${lang} ${record.source_id} (${l}): invalid example source "${example.source}"`,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
expect(errors, `\n${errors.join("\n")}`).toHaveLength(0);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("every cefr_source vote is a valid CEFR level", () => {
|
||||||
|
const errors: string[] = [];
|
||||||
|
const validLevels = new Set(CEFR_LEVELS);
|
||||||
|
|
||||||
|
for (const lang of SUPPORTED_LANGUAGE_CODES) {
|
||||||
|
const records = recordsByLang.get(lang)!;
|
||||||
|
|
||||||
|
for (const record of records) {
|
||||||
|
for (const [l, langVotes] of Object.entries(record.votes)) {
|
||||||
|
for (const [word, vote] of Object.entries(langVotes ?? {})) {
|
||||||
|
if (
|
||||||
|
!validLevels.has(vote.cefr_source as (typeof CEFR_LEVELS)[number])
|
||||||
|
) {
|
||||||
|
errors.push(
|
||||||
|
`${lang} ${record.source_id} (${l} — "${word}"): invalid cefr_source "${vote.cefr_source}"`,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
expect(errors, `\n${errors.join("\n")}`).toHaveLength(0);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("conflicts.json entries have required fields and valid CEFR levels", () => {
|
||||||
|
const errors: string[] = [];
|
||||||
|
const validLevels = new Set(CEFR_LEVELS);
|
||||||
|
const validLangs = new Set(SUPPORTED_LANGUAGE_CODES);
|
||||||
|
|
||||||
|
for (const entry of conflicts) {
|
||||||
|
if (!entry.word) errors.push(`conflict missing word`);
|
||||||
|
if (!entry.pos) errors.push(`conflict missing pos`);
|
||||||
|
if (!entry.language) {
|
||||||
|
errors.push(`conflict missing language`);
|
||||||
|
} else if (!validLangs.has(entry.language)) {
|
||||||
|
errors.push(`conflict invalid language "${entry.language}"`);
|
||||||
|
}
|
||||||
|
if (!Array.isArray(entry.levels) || entry.levels.length < 2) {
|
||||||
|
errors.push(`${entry.word}: levels must have at least 2 entries`);
|
||||||
|
} else {
|
||||||
|
for (const level of entry.levels) {
|
||||||
|
if (!validLevels.has(level as (typeof CEFR_LEVELS)[number])) {
|
||||||
|
errors.push(`${entry.word}: invalid level "${level}"`);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
expect(errors, `\n${errors.join("\n")}`).toHaveLength(0);
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
@ -8,5 +8,5 @@
|
||||||
"types": ["node"]
|
"types": ["node"]
|
||||||
},
|
},
|
||||||
"references": [{ "path": "../packages/shared" }],
|
"references": [{ "path": "../packages/shared" }],
|
||||||
"include": ["./**/*"]
|
"include": ["./**/*", "vitest.config.ts"]
|
||||||
}
|
}
|
||||||
|
|
|
||||||
11
data-pipeline/vitest.config.ts
Normal file
11
data-pipeline/vitest.config.ts
Normal file
|
|
@ -0,0 +1,11 @@
|
||||||
|
import { defineConfig } from "vitest/config";
|
||||||
|
|
||||||
|
export default defineConfig({
|
||||||
|
test: {
|
||||||
|
environment: "node",
|
||||||
|
globals: true,
|
||||||
|
include: ["tests/**/*.test.ts"],
|
||||||
|
exclude: ["**/dist/**", "**/node_modules/**"],
|
||||||
|
testTimeout: 60_000,
|
||||||
|
},
|
||||||
|
});
|
||||||
|
|
@ -1,335 +1,204 @@
|
||||||
# lila data pipeline
|
# lila data pipeline
|
||||||
|
|
||||||
> **NOTE: BEFORE RUNNING THE PIPELINE, CONSIDER IMPROVING THE CEFR SOURCE
|
This pipeline extracts vocabulary data from Wiktionary via the Kaikki dataset, enriches it with CEFR levels and fills content gaps using local LLMs, and produces authoritative output in `pipeline.db`. This database is consumed by the sync script to populate the production database with vocabulary entries, translations, glosses, CEFR levels, and difficulty ratings.
|
||||||
> FILES IN `stage-2-annotate/sources/cefr/`. BETTER SOURCE COVERAGE MEANS
|
|
||||||
> FEWER WORDS FOR THE LLM TO ANNOTATE FROM SCRATCH, FASTER OVERNIGHT RUNS,
|
|
||||||
> AND HIGHER CONFIDENCE IN THE FINAL OUTPUT. SEE UNIVERSALCEFR
|
|
||||||
> (huggingface.co/UniversalCEFR) AND CEFR-J
|
|
||||||
> (github.com/openlanguageprofiles/olp-en-cefrj) AS STARTING POINTS.**
|
|
||||||
|
|
||||||
This pipeline extracts vocabulary data from the Open Multilingual Wordnet (OMW), annotates it with CEFR levels from curated source files, verifies and enriches annotations using local LLMs, and produces authoritative JSON files per language. These files are consumed by the seeder in `packages/db` to populate the database with terms, translations, glosses, CEFR levels, difficulty ratings, and LLM-generated descriptions.
|
|
||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
```mermaid
|
```mermaid
|
||||||
flowchart LR
|
flowchart LR
|
||||||
omw[(OMW SQLite DBs)]
|
kaikki[(Kaikki JSONL)]
|
||||||
cefr[(CEFR JSON files)]
|
|
||||||
extract[Extract]
|
extract[Extract]
|
||||||
annotate[Annotate]
|
reverselink[Reverse Link Sync]
|
||||||
enrich[Enrich]
|
enrich[Enrich]
|
||||||
|
pipelinedb[(pipeline.db)]
|
||||||
merge[Merge]
|
merge[Merge]
|
||||||
final[(final/lang.json)]
|
tiebreak[Tiebreak]
|
||||||
flagged[(flagged/lang.json)]
|
compare[Compare]
|
||||||
seeder[packages/db seeder]
|
sync[Sync]
|
||||||
db[(Database)]
|
db[(PostgreSQL)]
|
||||||
|
|
||||||
omw --> extract
|
kaikki --> extract
|
||||||
cefr --> annotate
|
extract --> pipelinedb
|
||||||
extract --> annotate
|
pipelinedb --> reverselink
|
||||||
annotate --> enrich
|
reverselink --> pipelinedb
|
||||||
enrich --> merge
|
pipelinedb --> enrich
|
||||||
merge --> final
|
enrich --> pipelinedb
|
||||||
merge --> flagged
|
pipelinedb --> merge
|
||||||
final --> seeder
|
merge --> pipelinedb
|
||||||
seeder --> db
|
pipelinedb --> tiebreak
|
||||||
|
tiebreak --> pipelinedb
|
||||||
|
pipelinedb --> compare
|
||||||
|
pipelinedb --> sync
|
||||||
|
sync --> db
|
||||||
```
|
```
|
||||||
|
|
||||||
Each stage is a standalone script that reads from the previous stage's output and produces one JSON file per language. Stages can be re-run independently without affecting earlier or later stages.
|
Each stage is a standalone script that reads from and writes to `pipeline.db`. The pipeline is fully resumable — interrupted overnight runs pick up from the last processed record without losing work.
|
||||||
|
|
||||||
The enrich stage is the exception — it produces one checkpoint file per model run per language, plus a compiled votes file once all runs are complete. It is designed to run overnight, one model at a time, and is fully resumable if interrupted.
|
Stage 1 is a manual prerequisite and is not run by the pipeline orchestrator. See **Stage 1 — Extract** for instructions.
|
||||||
|
|
||||||
Only fully annotated output in `stage-4-merge/output/final/` reaches the database. Words where LLMs could not reach a majority vote land in `stage-4-merge/output/flagged/` and wait for manual review before seeding.
|
The enrich stage is designed to run overnight, one model at a time. Each model processes every entry and writes results to `pipeline.db` atomically per record.
|
||||||
|
|
||||||
## Data sources
|
Only fully resolved records reach the production database. Records where LLMs could not reach a majority vote are handled automatically by the tiebreaker stage before syncing.
|
||||||
|
|
||||||
### OMW / WordNet
|
## pipeline.db
|
||||||
|
|
||||||
The Open Multilingual Wordnet (OMW) is the base vocabulary source. It provides synsets — groups of synonymous words — with translations and glosses across multiple languages. One SQLite database per language is downloaded and placed in `sources/omw/`. These files are not committed to git.
|
All pipeline state is stored in `pipeline.db` — a SQLite database in `data-pipeline/db/`. It is created automatically on first run and is not committed to git.
|
||||||
|
|
||||||
All four parts of speech are extracted: noun, verb, adjective, adverb. WordNet's adjective satellites are collapsed into adjective — this is a WordNet-internal distinction that has no relevance for language learning. Alongside translations and glosses, usage examples are extracted where available and stored in the database as term_examples.
|
The database serves three purposes:
|
||||||
|
|
||||||
See **Setup** for download instructions.
|
- **Resumability** — every record is written atomically with a status. Interrupted overnight runs resume from the last pending record without losing work.
|
||||||
|
- **Vote tracking** — all model votes for CEFR levels and generated content are stored per model per record, giving full auditability of how every decision was reached.
|
||||||
|
- **Resolved output** — the final resolved records live here and are read by the sync script to seed the production database.
|
||||||
|
|
||||||
### CEFR source files
|
The schema is defined in `data-pipeline/db/schema.sql`. Never edit `pipeline.db` directly — all writes go through the pipeline scripts.
|
||||||
|
|
||||||
Per-language JSON files in `sources/cefr/` provide the initial CEFR level annotations. These files do not cover the full vocabulary extracted from OMW — coverage varies by language. Gaps and disagreements are handled by the enrich stage.
|
On first run the orchestrator initialises `pipeline.db` automatically and imports the stage 1 output into the base tables. This happens once — subsequent runs skip the import if the base tables are already populated.
|
||||||
|
|
||||||
| Language | File |
|
## Data source
|
||||||
| -------- | ---------------------- |
|
|
||||||
| English | `sources/cefr/en.json` |
|
|
||||||
| Italian | `sources/cefr/it.json` |
|
|
||||||
| Spanish | `sources/cefr/es.json` |
|
|
||||||
| German | `sources/cefr/de.json` |
|
|
||||||
| French | `sources/cefr/fr.json` |
|
|
||||||
|
|
||||||
These files are committed to git. For per-language coverage detail see `COVERAGE.md`.
|
### Kaikki (Wiktionary)
|
||||||
|
|
||||||
### CEFR annotation and verification
|
The pipeline uses pre-extracted Wiktionary data from [kaikki.org](https://kaikki.org), built with the [wiktextract](https://github.com/tatuylonen/wiktextract) tool. This data is updated weekly from the English Wiktionary dump and is freely available under the same license as Wiktionary (CC-BY-SA).
|
||||||
|
|
||||||
CEFR levels are determined by a majority vote combining all available sources:
|
**Why Kaikki instead of OMW:**
|
||||||
|
Kaikki is structured per word sense. Each headword has multiple senses, and translations are linked to a specific sense rather than a general concept. This prevents the sense disambiguation problems found in OMW, where a single concept entry could contain translations from entirely different meanings of a word.
|
||||||
|
|
||||||
- The CEFR source file counts as one vote (if it has an entry for the word)
|
Each Kaikki entry provides:
|
||||||
- Each LLM model run counts as one vote
|
|
||||||
|
|
||||||
The LLMs verify existing annotations as well as filling gaps — a source file entry does not automatically win. Majority vote across all sources determines the final level.
|
- A headword in the entry language
|
||||||
|
- One or more senses, each with a gloss and examples
|
||||||
|
- Per-sense translations to other languages with sense hints
|
||||||
|
- IPA pronunciations and audio file references (deferred — see **Further extensions**)
|
||||||
|
- Inflected forms (deferred — see **Further extensions**)
|
||||||
|
|
||||||
If no majority is reached, the word is flagged for manual review and excluded from the database until resolved.
|
The pipeline uses the English Wiktionary edition (`enwiktionary`), which contains entries for all five supported languages with glosses in English.
|
||||||
|
|
||||||
|
### CEFR levels
|
||||||
|
|
||||||
|
CEFR levels are assigned entirely by LLM majority vote. Each model receives the headword, gloss, and an example sentence and votes on the appropriate level (A1–C2). There are no curated source files — the LLMs are the sole source of CEFR annotations.
|
||||||
|
|
||||||
|
If no majority is reached after all model runs, the entry is handled automatically by the tiebreaker stage.
|
||||||
|
|
||||||
## Setup
|
## Setup
|
||||||
|
|
||||||
### OMW databases
|
### Kaikki data files
|
||||||
|
|
||||||
Download the OMW SQLite database for each language using the `wn` Python
|
Download the pre-extracted Kaikki JSONL files for each language. These are large files — download them to `stage-1-extract/sources/` which is not committed to git.
|
||||||
library:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m wn download omw-en:1.4
|
mkdir -p stage-1-extract/sources
|
||||||
python -m wn download omw-it:1.4
|
cd stage-1-extract/sources
|
||||||
python -m wn download omw-de:1.4
|
|
||||||
python -m wn download omw-es:1.4
|
|
||||||
python -m wn download omw-fr:1.4
|
|
||||||
```
|
|
||||||
|
|
||||||
The data is stored automatically at `~/.wn_data/wn.db` and is not committed
|
# English entries (contains translations to all other languages)
|
||||||
to git.
|
wget https://kaikki.org/dictionary/English/kaikki.org-dictionary-English.jsonl.gz
|
||||||
|
|
||||||
|
# Per-language files (for entries written in those languages)
|
||||||
|
wget https://kaikki.org/dictionary/German/kaikki.org-dictionary-German.jsonl.gz
|
||||||
|
wget https://kaikki.org/dictionary/Italian/kaikki.org-dictionary-Italian.jsonl.gz
|
||||||
|
wget https://kaikki.org/dictionary/French/kaikki.org-dictionary-French.jsonl.gz
|
||||||
|
wget https://kaikki.org/dictionary/Spanish/kaikki.org-dictionary-Spanish.jsonl.gz
|
||||||
|
|
||||||
|
# Decompress
|
||||||
|
gunzip *.gz
|
||||||
|
```
|
||||||
|
|
||||||
### LLM setup
|
### LLM setup
|
||||||
|
|
||||||
See `LLM-SETUP.md`.
|
See `llm-setup.md`.
|
||||||
|
|
||||||
## Pipeline stages
|
## Pipeline stages
|
||||||
|
|
||||||
The pipeline runs in five stages. Each stage is independent and can be re-run without affecting the others.
|
| Stage | What it does |
|
||||||
|
| --------------- | ------------------------------------------------------------------------ |
|
||||||
| Stage | What it does |
|
| 1. Extract | Parses Kaikki JSONL, imports entries into `pipeline.db` |
|
||||||
| ----------- | -------------------------------------------------------------------- |
|
| 2. Reverse link | Inserts missing reverse translations between language pairs |
|
||||||
| 1. Extract | Reads OMW SQLite database, outputs normalized JSON per language |
|
| 3. Enrich | LLMs fill translation gaps, improve glosses/examples, assign CEFR levels |
|
||||||
| 2. Annotate | Merges CEFR source files into extracted data, adds source file votes |
|
| 4. Merge | Resolves LLM votes into final values |
|
||||||
| 3. Enrich | Runs local LLMs in two rounds — generation then voting |
|
| 4b. Tiebreak | Runs unused models on flagged entries until majority is reached |
|
||||||
| 4. Merge | Resolves votes, derives difficulty, splits into final and flagged |
|
| 5. Compare / QA | Generates `COVERAGE.md` with detailed quality report |
|
||||||
| 5. Compare | Generates COVERAGE.md with detailed quality report |
|
| 6. Sync | Upserts resolved records into production PostgreSQL |
|
||||||
|
|
||||||
### 1. Extract
|
### 1. Extract
|
||||||
|
|
||||||
Reads the OMW SQLite database (`~/.wn_data/wn.db`) and produces a single normalized JSON file containing all synsets with their translations, glosses, and usage examples across all five languages and all parts of speech. Adjective satellites are collapsed into adjective at this stage.
|
Parses the Kaikki JSONL files for all five languages and imports them into the base tables of `pipeline.db`. Filters to the four supported parts of speech: noun, verb, adjective, adverb. Each Kaikki sense becomes one row in `vocabulary_entries`. Translations are stored in `entry_translations` with their sense hints.
|
||||||
|
|
||||||
**Input:** `~/.wn_data/wn.db`
|
**Input:** `stage-1-extract/sources/*.jsonl`
|
||||||
**Output:** `stage-1-extract/output/omw.json`
|
**Output:** `pipeline.db` — `vocabulary_entries` and `entry_translations` tables populated
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python stage-1-extract/scripts/extract.py
|
pnpm --filter @lila/pipeline extract
|
||||||
```
|
```
|
||||||
|
|
||||||
Add `--sample` to extract 100 synsets for inspection before running the full
|
Add `--sample 100` to import only 100 entries per language for inspection before running the full import.
|
||||||
extraction.
|
|
||||||
|
|
||||||
Each record in the output looks like this:
|
Each entry in `pipeline.db` looks like this:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"source_id": "ili:i1",
|
"headword": "thrill",
|
||||||
"pos": "adjective",
|
"language": "en",
|
||||||
"translations": {
|
"pos": "verb",
|
||||||
"en": ["able"],
|
"sense_index": 0,
|
||||||
"it": ["abile", "intelligente", "valente", "capace"],
|
"gloss": "To suddenly excite someone, or to give them great pleasure.",
|
||||||
"es": ["capaz"],
|
"examples": ["The movie thrilled the audience."],
|
||||||
"fr": ["comptable"]
|
"translations": [
|
||||||
},
|
{ "language": "de", "word": "begeistern", "sense_hint": "suddenly excite" },
|
||||||
"glosses": {
|
{
|
||||||
"en": [
|
"language": "fr",
|
||||||
"(usually followed by 'to') having the necessary means or skill or know-how or authority to do something"
|
"word": "enthousiasmer",
|
||||||
]
|
"sense_hint": "suddenly excite"
|
||||||
},
|
},
|
||||||
"examples": { "en": ["able to swim", "she was able to program her computer"] }
|
{ "language": "it", "word": "entusiasmare" },
|
||||||
|
{ "language": "es", "word": "emocionar" }
|
||||||
|
]
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
Note: glosses and examples are not available for all languages. French and Spanish have no glosses or examples in the current OMW database — these will be generated by the LLM in the enrich stage. Coverage detail is in `COVERAGE.md`.
|
> **Note:** Stage 1 is a manual prerequisite. It is not run by the pipeline orchestrator (`pipeline.ts`). Run it once before running the orchestrator for the first time, and re-run it manually if the Kaikki source files are updated.
|
||||||
|
|
||||||
### 2. Annotate
|
### 2. Reverse link sync
|
||||||
|
|
||||||
Reads the combined OMW extract and merges CEFR source data into it. Each translation in each language is matched against the corresponding CEFR source
|
A pure script stage — no LLMs. For each translation pair in `entry_translations`, checks whether the reverse link exists. If English _thrill → begeistern_ exists and the German entry _begeistern_ exists in `vocabulary_entries` but lacks the English back-link, it is inserted automatically.
|
||||||
file by word text and part of speech. Matched translations receive a `cefr_source` vote which carries into the enrich stage. Unmatched translations proceed without a vote.
|
|
||||||
|
|
||||||
This stage also extracts native example sentences from the CEFR source files and adds them to the record alongside OMW examples, with `source: "cefr"` to distinguish them.
|
This runs before the enrich stage so that LLMs only generate translations that are genuinely missing — not translations that would be found by a simple reverse lookup.
|
||||||
|
|
||||||
Words appearing in the CEFR source file multiple times with different CEFR levels are written to `conflicts.json` for manual review and excluded from voting until resolved.
|
**Input:** `pipeline.db` — populated `vocabulary_entries` and `entry_translations`
|
||||||
|
**Output:** `pipeline.db` — missing reverse links inserted into `entry_translations`
|
||||||
**Input:** `stage-1-extract/output/omw.json` + `stage-2-annotate/sources/cefr/{lang}.json`
|
|
||||||
**Output:**
|
|
||||||
|
|
||||||
- `stage-2-annotate/output/{lang}.json` — one per language
|
|
||||||
- `stage-2-annotate/output/conflicts.json` — cross-language conflicts for review
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pnpm --filter @lila/pipeline annotate
|
pnpm --filter @lila/pipeline reverse-link
|
||||||
```
|
```
|
||||||
|
|
||||||
Each record in the output extends the OMW record with a `votes` field and any additional examples from the CEFR source file:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"source_id": "ili:i1",
|
|
||||||
"pos": "adjective",
|
|
||||||
"translations": {
|
|
||||||
"en": ["able"],
|
|
||||||
"it": ["abile", "intelligente", "valente", "capace"],
|
|
||||||
"es": ["capaz"],
|
|
||||||
"fr": ["comptable"]
|
|
||||||
},
|
|
||||||
"glosses": { "en": ["having the necessary means or skill to do something"] },
|
|
||||||
"examples": {
|
|
||||||
"en": [
|
|
||||||
{ "text": "able to swim", "source": "omw" },
|
|
||||||
{ "text": "She was able to finish the task.", "source": "cefr" }
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"votes": { "en": { "able": { "cefr_source": "B1" } } }
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Words not present in the CEFR source file will have an empty `votes` object.
|
|
||||||
|
|
||||||
### 3. Enrich
|
### 3. Enrich
|
||||||
|
|
||||||
The enrich stage runs in two rounds, both designed to execute overnight one model at a time. The llama.cpp server must be running locally before starting either round. See `LLM-SETUP.md` for setup instructions.
|
The enrich stage runs LLMs to fill four types of gaps, in this order:
|
||||||
|
|
||||||
**Round 1 — generation**
|
**A — Missing translations:** for each entry that has no translation in one or more supported languages after reverse link sync, the LLM generates the best translation for that language given the entry's headword, gloss, and examples.
|
||||||
|
|
||||||
Each model processes every word in every language one term at a time and
|
**B — Weak glosses and examples:** for each entry where the gloss is missing or the examples are missing, the LLM generates a natural, learner-friendly gloss and one usage example in the entry's language.
|
||||||
generates:
|
|
||||||
|
|
||||||
- A CEFR level vote for each translation
|
**C — CEFR levels:** for every entry, the LLM assigns a CEFR level (A1–C2) based on the headword, gloss, and examples. This runs for all entries regardless of whether other enrichment was needed.
|
||||||
- A description for each language
|
|
||||||
- A translation for each language, only if OMW provides none
|
|
||||||
- A gloss for each language, only if OMW provides none
|
|
||||||
- Usage examples for each language, only if OMW provides none
|
|
||||||
|
|
||||||
OMW data is never duplicated — the script checks what OMW already provides before building the prompt. For translations, glosses and examples, if OMW data exists for that language the LLM skips generation entirely. This significantly reduces compute time for languages with good OMW coverage such as English.
|
All output is written to `pipeline.db` atomically per entry — runs are fully resumable if interrupted. Each model is run once — one model produces one vote.
|
||||||
|
|
||||||
All model-generated content is stored with an anonymised source (`model_1`, `model_2` etc.) so models cannot be biased by knowing who generated what in round 2.
|
> **Note:** Before running this stage, ensure the llama.cpp server is running locally. The orchestrator checks for a running server at `http://127.0.0.1:8080/health` and exits with instructions if it is not reachable. See `llm-setup.md` for setup instructions.
|
||||||
|
|
||||||
**Input:** `stage-2-annotate/output/{lang}.json`
|
**Input:** `pipeline.db` — entries after reverse link sync
|
||||||
**Output:** `stage-3-enrich/output/round1/{lang}_{model}.json` per run
|
**Output:** `pipeline.db` — LLM-generated translations, glosses, examples, and CEFR votes
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pnpm --filter @lila/pipeline enrich --round 1 --model {model}
|
pnpm --filter @lila/pipeline run --name "night-1"
|
||||||
```
|
|
||||||
|
|
||||||
**Compiling candidates**
|
|
||||||
|
|
||||||
Once all round 1 runs are complete, compile all generated candidates into a single structured file per language. This is the input to round 2.
|
|
||||||
|
|
||||||
**Input:** `stage-3-enrich/output/round1/{lang}_{model}.json`
|
|
||||||
**Output:** `stage-3-enrich/output/candidates/{lang}_candidates.json`
|
|
||||||
|
|
||||||
```bash
|
|
||||||
pnpm --filter @lila/pipeline enrich --compile-candidates
|
|
||||||
```
|
|
||||||
|
|
||||||
**Round 2 — voting**
|
|
||||||
|
|
||||||
Each model receives the compiled candidate list for every word and votes on:
|
|
||||||
|
|
||||||
- The best gloss candidate (if multiple exist)
|
|
||||||
- The best description candidate (if multiple exist)
|
|
||||||
- The best usage examples candidate (if multiple exist)
|
|
||||||
- A CEFR level vote for each translation
|
|
||||||
|
|
||||||
OMW data is not put to a vote — it automatically wins over any LLM-generated candidate. Round 2 only resolves conflicts between model-generated candidates. The prompt is kept small — one word at a time, a clean numbered candidate list — to fit within a limited context window.
|
|
||||||
|
|
||||||
**Input:** `stage-3-enrich/output/candidates/{lang}_candidates.json`
|
|
||||||
**Output:** `stage-3-enrich/output/round2/{lang}_{model}.json` per run
|
|
||||||
|
|
||||||
```bash
|
|
||||||
pnpm --filter @lila/pipeline enrich --round 2 --model {model}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Compiling votes**
|
|
||||||
|
|
||||||
Once all round 2 runs are complete, compile all votes into a single file per language. This is the input to the merge stage.
|
|
||||||
|
|
||||||
**Input:** `stage-3-enrich/output/round2/{lang}_{model}.json`
|
|
||||||
**Output:** `stage-3-enrich/output/votes/{lang}_votes.json`
|
|
||||||
|
|
||||||
```bash
|
|
||||||
pnpm --filter @lila/pipeline enrich --compile-votes
|
|
||||||
```
|
|
||||||
|
|
||||||
Each record in the votes file looks like this:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"source_id": "omw-en-12345",
|
|
||||||
"pos": "noun",
|
|
||||||
"translations": {
|
|
||||||
"en": [
|
|
||||||
{
|
|
||||||
"text": "dog",
|
|
||||||
"votes": { "cefr_source": "A1", "model_1": "A1", "model_2": "A1" }
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"text": "canine",
|
|
||||||
"votes": { "cefr_source": "B2", "model_1": "B2", "model_2": "B1" }
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"it": [
|
|
||||||
{
|
|
||||||
"text": "cane",
|
|
||||||
"votes": { "cefr_source": "A1", "model_1": "A1", "model_2": "A1" }
|
|
||||||
}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"glosses": {
|
|
||||||
"en": { "text": "a domesticated carnivorous mammal", "source": "omw" },
|
|
||||||
"fr": {
|
|
||||||
"candidates": [
|
|
||||||
{ "text": "un mammifère carnivore domestiqué", "source": "model_1" },
|
|
||||||
{ "text": "un animal domestique carnivore", "source": "model_2" }
|
|
||||||
],
|
|
||||||
"votes": { "model_1": 1, "model_2": 1 }
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"examples": {
|
|
||||||
"en": [{ "text": "the dog barked at the stranger", "source": "omw" }],
|
|
||||||
"fr": {
|
|
||||||
"candidates": [
|
|
||||||
{ "text": "le chien a aboyé", "source": "model_1" },
|
|
||||||
{ "text": "le chien gardait la maison", "source": "model_2" }
|
|
||||||
],
|
|
||||||
"votes": { "model_1": 2, "model_2": 1 }
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"descriptions": {
|
|
||||||
"en": {
|
|
||||||
"candidates": [
|
|
||||||
{
|
|
||||||
"text": "a common household pet known for loyalty",
|
|
||||||
"source": "model_1"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"text": "a domesticated animal and loyal companion",
|
|
||||||
"source": "model_2"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"votes": { "model_1": 2, "model_2": 1 }
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### 4. Merge
|
### 4. Merge
|
||||||
|
|
||||||
Reads the votes file per language and resolves the final value for every field. Produces two output files per language — fully resolved records ready for seeding, and flagged records that need manual review.
|
Reads all LLM votes from `pipeline.db` and resolves the final value for every field. Writes resolved entries back to `pipeline.db`.
|
||||||
|
|
||||||
**Merge rules:**
|
**Merge rules:**
|
||||||
|
|
||||||
- OMW data wins automatically and is never overridden
|
- Kaikki source data wins automatically and is never overridden by LLM output
|
||||||
- For CEFR levels: the level with the most votes wins. If no majority is reached, that translation is flagged
|
- For CEFR levels: the level with the most votes wins. If no majority is reached, the entry is flagged for the tiebreaker
|
||||||
- For LLM-generated text fields (gloss, examples, descriptions): the candidate with the most votes wins
|
- For LLM-generated text fields: the candidate with the most votes wins. If no majority is reached, the tiebreaker runs
|
||||||
|
|
||||||
<!-- TODO: decide fallback strategy when no majority is reached for text fields -->
|
|
||||||
|
|
||||||
**Difficulty mapping:**
|
**Difficulty mapping:**
|
||||||
|
|
||||||
|
|
@ -339,93 +208,85 @@ Reads the votes file per language and resolves the final value for every field.
|
||||||
| B1, B2 | intermediate |
|
| B1, B2 | intermediate |
|
||||||
| C1, C2 | hard |
|
| C1, C2 | hard |
|
||||||
|
|
||||||
**Input:** `stage-3-enrich/output/votes/{lang}_votes.json`
|
**Input:** `pipeline.db` — LLM votes
|
||||||
**Output:**
|
**Output:** `pipeline.db` — entries updated with resolved values or flagged status
|
||||||
|
|
||||||
- `stage-4-merge/output/final/{lang}.json` — fully resolved, ready for seeding
|
### 4b. Tiebreak
|
||||||
- `stage-4-merge/output/flagged/{lang}.json` — CEFR majority not reached, needs manual review before seeding
|
|
||||||
|
|
||||||
```bash
|
Runs automatically after merge if any entries remain flagged. The script queries `pipeline.db` for flagged entries, identifies which configured models have not yet voted on each entry, and runs those models on the flagged subset only. Merge is re-run after each tiebreaker pass. This repeats until all flagged entries are resolved or no unused models remain.
|
||||||
pnpm --filter @lila/pipeline merge
|
|
||||||
```
|
|
||||||
|
|
||||||
Each record in `final/{lang}.json` looks like this:
|
If unused models are exhausted and flagged entries remain, the script logs a detailed report showing the exact vote split for each unresolved entry and lists available models from OpenRouter that have not been used. Syncing is blocked until all entries are resolved. To continue, add one or more models to the config and re-run the pipeline — the tiebreaker will pick up automatically.
|
||||||
|
|
||||||
```json
|
> **Note:** The tiebreaker is not a standalone script. It runs automatically as part of the pipeline orchestrator after merge completes.
|
||||||
{
|
|
||||||
"source_id": "omw-en-12345",
|
|
||||||
"pos": "noun",
|
|
||||||
"translations": {
|
|
||||||
"en": [
|
|
||||||
{ "text": "dog", "cefr_level": "A1", "difficulty": "easy" },
|
|
||||||
{ "text": "canine", "cefr_level": "B2", "difficulty": "intermediate" }
|
|
||||||
],
|
|
||||||
"it": [{ "text": "cane", "cefr_level": "A1", "difficulty": "easy" }]
|
|
||||||
},
|
|
||||||
"glosses": {
|
|
||||||
"en": { "text": "a domesticated carnivorous mammal", "source": "omw" },
|
|
||||||
"fr": { "text": "un mammifère carnivore domestiqué", "source": "model_1" }
|
|
||||||
},
|
|
||||||
"examples": {
|
|
||||||
"en": [{ "text": "the dog barked at the stranger", "source": "omw" }],
|
|
||||||
"fr": [{ "text": "le chien a aboyé", "source": "model_1" }]
|
|
||||||
},
|
|
||||||
"descriptions": {
|
|
||||||
"en": {
|
|
||||||
"text": "a common household pet known for loyalty and companionship",
|
|
||||||
"source": "model_1"
|
|
||||||
},
|
|
||||||
"it": {
|
|
||||||
"text": "un animale domestico comune noto per la sua fedeltà",
|
|
||||||
"source": "model_2"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Resolving flagged words:**
|
|
||||||
|
|
||||||
Open `stage-4-merge/output/flagged/{lang}.json`, manually set the correct `cefr_level` and `difficulty` for each flagged translation, then move the resolved entries into `stage-4-merge/output/final/{lang}.json`. Re-run the seeder after resolving.
|
|
||||||
|
|
||||||
### 5. Compare / QA
|
### 5. Compare / QA
|
||||||
|
|
||||||
Read-only. Generates `COVERAGE.md` with a full breakdown of the pipeline
|
Read-only. Generates `COVERAGE.md` with a full breakdown of pipeline output quality per language. Run this after merge to verify output before syncing to the database.
|
||||||
output quality per language. Run this after merge to verify output before
|
|
||||||
seeding the database.
|
|
||||||
|
|
||||||
**Input:**
|
|
||||||
|
|
||||||
- `stage-4-merge/output/final/{lang}.json`
|
|
||||||
- `stage-4-merge/output/flagged/{lang}.json`
|
|
||||||
|
|
||||||
|
**Input:** `pipeline.db` — entries with status `final`
|
||||||
**Output:** `COVERAGE.md`
|
**Output:** `COVERAGE.md`
|
||||||
|
|
||||||
```bash
|
|
||||||
pnpm --filter @lila/pipeline compare
|
|
||||||
```
|
|
||||||
|
|
||||||
`COVERAGE.md` reports the following per language:
|
`COVERAGE.md` reports the following per language:
|
||||||
|
|
||||||
- Total synsets extracted
|
- Total entries extracted
|
||||||
- Total translations per language
|
- POS breakdown — entry counts for noun, verb, adjective, adverb
|
||||||
- POS breakdown per language — word counts for noun, verb, adjective, adverb
|
- Translation coverage — how many entries have translations in each other language
|
||||||
- CEFR coverage per language — how many translations have a resolved CEFR level, broken down by level (A1, A2, B1, B2, C1, C2)
|
- CEFR coverage — how many entries have a resolved CEFR level, broken down by level
|
||||||
- Difficulty breakdown per language — word counts for easy, intermediate, hard
|
- Difficulty breakdown — entry counts for easy, intermediate, hard
|
||||||
- Flagged count per language — how many translations are awaiting manual review
|
- Gloss coverage — how many entries have a gloss, broken down by source (Kaikki vs LLM-generated)
|
||||||
- Gloss coverage per language — total glosses, broken down by source (omw vs LLM-generated) and which languages have no glosses at all
|
- Example coverage — same breakdown as glosses
|
||||||
- Example coverage per language — same breakdown as glosses
|
|
||||||
- Description coverage per language — how many translations have a description, broken down by source
|
|
||||||
- CEFR source file coverage per language — how many words from the source file were matched against OMW translations
|
|
||||||
- LLM model contribution — how many CEFR votes and text candidates each anonymised model contributed
|
- LLM model contribution — how many CEFR votes and text candidates each anonymised model contributed
|
||||||
|
|
||||||
|
## Sync
|
||||||
|
|
||||||
|
The sync script transfers all entries with status `final` in `pipeline.db` to the production PostgreSQL database. It is upsert-based and never wipes existing data. For each entry it checks whether a matching record already exists in the target database:
|
||||||
|
|
||||||
|
- **Missing** → insert
|
||||||
|
- **Present but changed** → update
|
||||||
|
- **Present and unchanged** → skip
|
||||||
|
|
||||||
|
Run this after all entries are resolved and Compare / QA has been reviewed.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pnpm --filter @lila/pipeline sync
|
||||||
|
```
|
||||||
|
|
||||||
|
The sync script requires a connection string to the target database. Set `DATABASE_URL` in your `.env` file before running.
|
||||||
|
|
||||||
|
## Reports
|
||||||
|
|
||||||
|
The pipeline generates a report at the end of every run. Reports are written to `data-pipeline/reports/` as a JSON file and a markdown file with the same name. The markdown is generated from the JSON and contains identical data.
|
||||||
|
|
||||||
|
```
|
||||||
|
data-pipeline/reports/
|
||||||
|
2026-05-03_run-1.json
|
||||||
|
2026-05-03_run-1.md
|
||||||
|
```
|
||||||
|
|
||||||
|
The run name is auto-generated from the date and a counter. Reports are not committed to git.
|
||||||
|
|
||||||
|
**Nightly report** contains:
|
||||||
|
|
||||||
|
- Entries processed this run vs total
|
||||||
|
- Entries remaining per stage
|
||||||
|
- Average processing speed and estimated nights remaining
|
||||||
|
- `needs_review` count — entries that failed structural validation
|
||||||
|
- Per-model progress breakdown
|
||||||
|
|
||||||
|
**Final report** (generated when all entries are processed) additionally contains:
|
||||||
|
|
||||||
|
- Full vote breakdown per model
|
||||||
|
- Flagged entries with exact vote splits
|
||||||
|
- Available unused models from OpenRouter for tiebreaking
|
||||||
|
- Per-model quality metrics — CEFR agreement rate, field coverage, JSON parse rate
|
||||||
|
|
||||||
## Adding a new language
|
## Adding a new language
|
||||||
|
|
||||||
1. Add the language code to `SUPPORTED_LANGUAGE_CODES` in `packages/shared/src/constants.ts`
|
1. Add the language code to `SUPPORTED_LANGUAGE_CODES` in `packages/shared/src/constants.ts`
|
||||||
2. Build shared: `pnpm --filter @lila/shared build`
|
2. Build shared: `pnpm --filter @lila/shared build`
|
||||||
3. Generate and run a DB migration: `pnpm --filter @lila/db generate` then `pnpm --filter @lila/db migrate`
|
3. Generate and run a DB migration: `pnpm --filter @lila/db generate` then `pnpm --filter @lila/db migrate`
|
||||||
4. Download the OMW lexicon for the language using the `wn` Python library
|
4. Download the Kaikki JSONL file for the language from kaikki.org
|
||||||
5. Add a CEFR source file at `stage-2-annotate/sources/cefr/{lang}.json`
|
5. Re-run the full pipeline
|
||||||
6. Run the full pipeline
|
|
||||||
|
|
||||||
## Constants and constraints
|
## Constants and constraints
|
||||||
|
|
||||||
|
|
@ -442,27 +303,84 @@ Adding a new value to any of these requires a constants update and a database mi
|
||||||
|
|
||||||
## Further extensions
|
## Further extensions
|
||||||
|
|
||||||
These are not part of the current pipeline but are worth considering as the
|
These are not part of the current pipeline but are worth considering as the dataset matures:
|
||||||
dataset matures:
|
|
||||||
|
|
||||||
- **Grammatical gender and articles** — Wiktionary dumps contain gender and
|
- **IPA pronunciations** — Kaikki includes IPA transcriptions for most entries. Could be extracted and stored in a `entry_pronunciations` table and displayed in the quiz UI.
|
||||||
article data for nouns across all supported languages. Could be extracted
|
- **Audio files** — kaikki.org provides bulk audio file downloads (~20GB) for pronunciations. Could be stored as static files and served alongside the quiz UI.
|
||||||
and stored as a new `translation_forms` table.
|
- **Inflected forms** — Kaikki provides conjugation and declension tables in a `forms` array. Useful for a future grammar-focused quiz mode.
|
||||||
- **Conjugations** — Wiktionary also carries verb conjugation tables. Useful
|
- **Grammatical gender** — Kaikki includes grammatical gender for nouns. Could be stored per entry and used as an additional quiz mechanic.
|
||||||
for a future grammar-focused quiz mode.
|
- **Frequency data** — Word frequency rankings per language from sources like the Google Ngram dataset. Useful for smarter difficulty calibration beyond CEFR levels alone.
|
||||||
- **IPA pronunciations** — Wiktionary and Forvo are potential sources for
|
- **Additional languages** — The pipeline is language-agnostic. Adding a new language requires downloading its Kaikki JSONL file, a constants update, and a database migration. See **Adding a new language**.
|
||||||
phonetic transcriptions per language.
|
|
||||||
- **TTS audio files** — Generate pronunciation audio for each translation
|
## Roadmap
|
||||||
using a local or cloud TTS engine. Stored as static files, served alongside
|
|
||||||
the quiz UI.
|
**Current state:** Data source migrated from OMW to Kaikki. Production schema and pipeline being rewritten on `feat/kaikki-vocabulary-schema`. Pipeline infrastructure (orchestrator, db init, reporting, tests) is in place and carries forward.
|
||||||
- **Images** — Associate an image with each synset to support visual
|
|
||||||
vocabulary learning. Could be sourced from open image datasets like
|
**Next action:** Rewrite production schema in `packages/db`, then rewrite pipeline extraction stage for Kaikki.
|
||||||
ImageNet or WikiMedia Commons.
|
|
||||||
- **Frequency data** — Word frequency rankings per language from sources like
|
| Stage | Status |
|
||||||
the Google Ngram dataset. Useful for smarter difficulty calibration beyond
|
| --------------- | -------------- |
|
||||||
CEFR levels alone.
|
| 1. Extract | 🔲 not started |
|
||||||
- **Improved CEFR source files** — See note at the top of this document.
|
| 2. Reverse link | 🔲 not started |
|
||||||
UniversalCEFR and CEFR-J are good starting points.
|
| 3. Enrich | 🔲 not started |
|
||||||
- **Additional languages** — The pipeline is language-agnostic. Adding a new
|
| 4. Merge | 🔲 not started |
|
||||||
language requires an OMW lexicon, a CEFR source file, and a constants
|
| 4b. Tiebreak | 🔲 not started |
|
||||||
update. See **Adding a new language**.
|
| 5. Compare / QA | 🔲 not started |
|
||||||
|
| 6. Sync | 🔲 not started |
|
||||||
|
|
||||||
|
### Stage 1 — Extract `🔲 not started`
|
||||||
|
|
||||||
|
- [ ] Download Kaikki JSONL files for all 5 languages
|
||||||
|
- [ ] Write extraction script
|
||||||
|
- [ ] Write stage 1 validation tests
|
||||||
|
- [ ] Run extraction → `pipeline.db`
|
||||||
|
|
||||||
|
### Stage 2 — Reverse link sync `🔲 not started`
|
||||||
|
|
||||||
|
- [ ] Write reverse link sync script
|
||||||
|
- [ ] Write tests
|
||||||
|
- [ ] Run reverse link sync → `pipeline.db`
|
||||||
|
|
||||||
|
### Stage 3 — Enrich `🔲 not started`
|
||||||
|
|
||||||
|
**Next action:** Write the enrich script after production schema is complete.
|
||||||
|
|
||||||
|
- [ ] Write enrich script (missing translations, glosses, examples, CEFR votes)
|
||||||
|
- [ ] Write tests
|
||||||
|
- [ ] Install llama.cpp and verify server
|
||||||
|
- [ ] Smoke test with sample entries
|
||||||
|
- [ ] Run full sample, collect metrics
|
||||||
|
- [ ] Compare providers (local vs OpenRouter free models)
|
||||||
|
- [ ] Production run — all entries, all models
|
||||||
|
|
||||||
|
### Stage 4 — Merge `🔲 not started`
|
||||||
|
|
||||||
|
- [ ] Write merge script
|
||||||
|
- [ ] Write tests
|
||||||
|
- [ ] Run merge → `pipeline.db`
|
||||||
|
- [ ] Confirm tiebreaker resolves all flagged entries
|
||||||
|
|
||||||
|
### Stage 4b — Tiebreak `🔲 not started`
|
||||||
|
|
||||||
|
- [ ] Write tiebreak logic
|
||||||
|
- [ ] Run tiebreaker for all flagged entries
|
||||||
|
- [ ] Confirm no flagged entries remain before syncing
|
||||||
|
|
||||||
|
### Stage 5 — Compare / QA `🔲 not started`
|
||||||
|
|
||||||
|
- [ ] Write compare script
|
||||||
|
- [ ] Write tests
|
||||||
|
- [ ] Run compare → `COVERAGE.md`
|
||||||
|
- [ ] Review output quality before syncing
|
||||||
|
|
||||||
|
### Stage 6 — Sync `🔲 not started`
|
||||||
|
|
||||||
|
- [ ] Write sync script
|
||||||
|
- [ ] Write tests
|
||||||
|
- [ ] Configure `DATABASE_URL` in `.env`
|
||||||
|
- [ ] Run sync → production PostgreSQL
|
||||||
|
- [ ] Verify seeded data in production
|
||||||
|
|
||||||
|
### Utilities
|
||||||
|
|
||||||
|
**`sample/`** — Runs the pipeline against a small sample to produce human-readable output for a quick sanity check before committing to a full run. Run this after any script change before running the full pipeline.
|
||||||
|
|
|
||||||
|
|
@ -7,6 +7,14 @@ and production scripts.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Provider model
|
||||||
|
|
||||||
|
Each provider + model combination counts as one vote in the final majority.
|
||||||
|
Running the same model twice is not supported — one model, one vote. To
|
||||||
|
increase vote confidence, add more models rather than re-running existing ones.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Hardware (dev machine)
|
## Hardware (dev machine)
|
||||||
|
|
||||||
| Component | Spec |
|
| Component | Spec |
|
||||||
|
|
@ -190,16 +198,17 @@ Set `Authorization: Bearer <OPENROUTER_API_KEY>` in the request headers.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Provider configuration in the test script
|
## Provider configuration in the enrich script
|
||||||
|
|
||||||
The enrich test script reads a single config object. To switch providers,
|
The enrich script reads a single config object. To switch providers,
|
||||||
change this object and re-run.
|
change this object and re-run. The `name` field is used as the model
|
||||||
|
identifier in `pipeline.db` — it must be unique across all runs.
|
||||||
|
|
||||||
```typescript
|
```typescript
|
||||||
// config.ts
|
// config.ts
|
||||||
|
|
||||||
export type ProviderConfig = {
|
export type ProviderConfig = {
|
||||||
name: string; // used for output folder naming
|
name: string; // used as model identifier in pipeline.db — must be unique
|
||||||
baseURL: string;
|
baseURL: string;
|
||||||
apiKey: string;
|
apiKey: string;
|
||||||
model: string;
|
model: string;
|
||||||
|
|
@ -243,14 +252,9 @@ export const ANTHROPIC_SONNET: ProviderConfig = {
|
||||||
};
|
};
|
||||||
```
|
```
|
||||||
|
|
||||||
Output from each run lands in:
|
All output is written to `pipeline.db`. Each record is stored with the
|
||||||
|
model name as identifier so results from different providers can be
|
||||||
```
|
compared and compiled into votes.
|
||||||
stage-3-enrich/test/output/{provider.name}/results.json
|
|
||||||
stage-3-enrich/test/output/{provider.name}/metrics.json
|
|
||||||
```
|
|
||||||
|
|
||||||
The evaluate script compares all `metrics.json` files side by side.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -297,5 +301,6 @@ The test script measures the following per provider run:
|
||||||
production. If not, use the cloud model that passed.
|
production. If not, use the cloud model that passed.
|
||||||
|
|
||||||
5. **Production run**
|
5. **Production run**
|
||||||
Full 117k records. Resume-safe — the script checkpoints after each
|
Full 117k records. Resume-safe — each record is written to `pipeline.db`
|
||||||
record so overnight runs can be stopped and continued.
|
atomically as it is processed. Overnight runs can be stopped and
|
||||||
|
continued at any time without losing work.
|
||||||
|
|
|
||||||
|
|
@ -12,7 +12,6 @@ export default defineConfig([
|
||||||
"node_modules/",
|
"node_modules/",
|
||||||
"routeTree.gen.ts",
|
"routeTree.gen.ts",
|
||||||
"scripts/**",
|
"scripts/**",
|
||||||
"data-pipeline/**/*",
|
|
||||||
]),
|
]),
|
||||||
|
|
||||||
eslint.configs.recommended,
|
eslint.configs.recommended,
|
||||||
|
|
|
||||||
|
|
@ -23,7 +23,7 @@
|
||||||
"prettier --write"
|
"prettier --write"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
"packageManager": "pnpm@10.33.1",
|
"packageManager": "pnpm@10.33.2",
|
||||||
"devDependencies": {
|
"devDependencies": {
|
||||||
"@eslint/js": "^10.0.1",
|
"@eslint/js": "^10.0.1",
|
||||||
"@tanstack/eslint-plugin-router": "^1.161.6",
|
"@tanstack/eslint-plugin-router": "^1.161.6",
|
||||||
|
|
|
||||||
5
pnpm-lock.yaml
generated
5
pnpm-lock.yaml
generated
|
|
@ -173,6 +173,9 @@ importers:
|
||||||
typescript:
|
typescript:
|
||||||
specifier: ^5.9.3
|
specifier: ^5.9.3
|
||||||
version: 5.9.3
|
version: 5.9.3
|
||||||
|
vitest:
|
||||||
|
specifier: ^4.1.0
|
||||||
|
version: 4.1.0(@opentelemetry/api@1.9.1)(@types/node@24.12.0)(jsdom@29.0.1(@noble/hashes@2.2.0))(vite@8.0.1(@types/node@24.12.0)(esbuild@0.27.4)(jiti@2.6.1)(tsx@4.21.0)(yaml@2.8.3))
|
||||||
|
|
||||||
packages/db:
|
packages/db:
|
||||||
dependencies:
|
dependencies:
|
||||||
|
|
@ -4391,7 +4394,6 @@ snapshots:
|
||||||
magic-string: 0.30.21
|
magic-string: 0.30.21
|
||||||
optionalDependencies:
|
optionalDependencies:
|
||||||
vite: 8.0.1(@types/node@24.12.0)(esbuild@0.27.4)(jiti@2.6.1)(tsx@4.21.0)(yaml@2.8.3)
|
vite: 8.0.1(@types/node@24.12.0)(esbuild@0.27.4)(jiti@2.6.1)(tsx@4.21.0)(yaml@2.8.3)
|
||||||
optional: true
|
|
||||||
|
|
||||||
'@vitest/mocker@4.1.0(vite@8.0.1(@types/node@25.5.0)(esbuild@0.27.4)(jiti@2.6.1)(tsx@4.21.0)(yaml@2.8.3))':
|
'@vitest/mocker@4.1.0(vite@8.0.1(@types/node@25.5.0)(esbuild@0.27.4)(jiti@2.6.1)(tsx@4.21.0)(yaml@2.8.3))':
|
||||||
dependencies:
|
dependencies:
|
||||||
|
|
@ -6136,7 +6138,6 @@ snapshots:
|
||||||
jsdom: 29.0.1(@noble/hashes@2.2.0)
|
jsdom: 29.0.1(@noble/hashes@2.2.0)
|
||||||
transitivePeerDependencies:
|
transitivePeerDependencies:
|
||||||
- msw
|
- msw
|
||||||
optional: true
|
|
||||||
|
|
||||||
vitest@4.1.0(@opentelemetry/api@1.9.1)(@types/node@25.5.0)(jsdom@29.0.1(@noble/hashes@2.2.0))(vite@8.0.1(@types/node@25.5.0)(esbuild@0.27.4)(jiti@2.6.1)(tsx@4.21.0)(yaml@2.8.3)):
|
vitest@4.1.0(@opentelemetry/api@1.9.1)(@types/node@25.5.0)(jsdom@29.0.1(@noble/hashes@2.2.0))(vite@8.0.1(@types/node@25.5.0)(esbuild@0.27.4)(jiti@2.6.1)(tsx@4.21.0)(yaml@2.8.3)):
|
||||||
dependencies:
|
dependencies:
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue