docs: update data-pipeline.md and llm-setup.md to reflect sqlite architecture

2026-05-02 20:13:05 +02:00 · 2026-05-02 20:13:05 +02:00 · 6007fe1e38
commit 6007fe1e38
parent ccfd83d16c
2 changed files with 175 additions and 157 deletions
--- a/documentation/data-pipeline.md
+++ b/documentation/data-pipeline.md
@ -18,28 +18,31 @@ flowchart LR
    extract[Extract]
    annotate[Annotate]
    enrich[Enrich]
    pipelinedb[(pipeline.db)]
    merge[Merge]
-    final[(final/lang.json)]
+    compare[Compare]
-    flagged[(flagged/lang.json)]
+    sync[Sync]
-    seeder[packages/db seeder]
+    db[(PostgreSQL)]
    db[(Database)]
    omw --> extract
    cefr --> annotate
    extract --> annotate
    annotate --> enrich
-    enrich --> merge
+    enrich --> pipelinedb
-    merge --> final
+    pipelinedb --> merge
-    merge --> flagged
+    merge --> pipelinedb
-    final --> seeder
+    pipelinedb --> compare
-    seeder --> db
+    pipelinedb --> sync
    sync --> db
 ```
-Each stage is a standalone script that reads from the previous stage's output and produces one JSON file per language. Stages can be re-run independently without affecting earlier or later stages.
+Each stage is a standalone script that reads from the previous stage's output. Stages 1 and 2 read and write JSON files. From stage 3 onwards, all output is written to `pipeline.db` — a SQLite database that tracks processing status, LLM output, votes, and resolved records. This makes overnight LLM runs fully resumable and protects against data loss if a run is interrupted.
-The enrich stage is the exception — it produces one checkpoint file per model run per language, plus a compiled votes file once all runs are complete. It is designed to run overnight, one model at a time, and is fully resumable if interrupted.
+Stage 1 is a manual prerequisite and is not run by the pipeline orchestrator. See **Stage 1 — Extract** for instructions.
-Only fully annotated output in `stage-4-merge/output/final/` reaches the database. Words where LLMs could not reach a majority vote land in `stage-4-merge/output/flagged/` and wait for manual review before seeding.
+The enrich stage is designed to run overnight, one model at a time. Each model processes every word and writes results to `pipeline.db` atomically per record — interrupted runs resume from the last unprocessed record.
 Only fully resolved records reach the database. Records where LLMs could not reach a majority vote are marked `flagged` in `pipeline.db` and wait for manual review before syncing.
 ## Data sources
@ -147,6 +150,10 @@ Each record in the output looks like this:
 Note: glosses and examples are not available for all languages. French and Spanish have no glosses or examples in the current OMW database — these will be generated by the LLM in the enrich stage. Coverage detail is in `COVERAGE.md`.
 > **Note:** Stage 1 is a manual prerequisite. It is not run by the pipeline
 > orchestrator (`pipeline.ts`). Run it once before running the orchestrator
 > for the first time, and re-run it manually if the OMW data changes.
 ### 2. Annotate
 Reads the combined OMW extract and merges CEFR source data into it. Each translation in each language is matched against the corresponding CEFR source
@ -193,7 +200,15 @@ Words not present in the CEFR source file will have an empty `votes` object.
 ### 3. Enrich
-The enrich stage runs in two rounds, both designed to execute overnight one model at a time. The llama.cpp server must be running locally before starting either round. See `LLM-SETUP.md` for setup instructions.
+> **Note:** Before running this stage, ensure the llama.cpp server is running
 > locally. The orchestrator checks for a running server at
 > `http://127.0.0.1:8080/health` and exits with instructions if it is not
 > reachable. See `llm-setup.md` for setup instructions.
 The enrich stage runs in two rounds, both designed to execute overnight one
 model at a time. All output is written to `pipeline.db` atomically per record
 — runs are fully resumable if interrupted. Each model is run once — one model
 produces one vote.
 **Round 1 — generation**
@ -206,12 +221,23 @@ generates:
 - A gloss for each language, only if OMW provides none
 - Usage examples for each language, only if OMW provides none
-OMW data is never duplicated — the script checks what OMW already provides before building the prompt. For translations, glosses and examples, if OMW data exists for that language the LLM skips generation entirely. This significantly reduces compute time for languages with good OMW coverage such as English.
+OMW data is never duplicated — the script checks what OMW already provides
 before building the prompt. For translations, glosses and examples, if OMW
 data exists for that language the LLM skips generation entirely. This
 significantly reduces compute time for languages with good OMW coverage such
 as English.
-All model-generated content is stored with an anonymised source (`model_1`, `model_2` etc.) so models cannot be biased by knowing who generated what in round 2.
+All model-generated content is stored with an anonymised source (`model_1`,
 `model_2` etc.) so models cannot be biased by knowing who generated what in
 round 2.
 Each record is written to `pipeline.db` with status `complete` or
 `needs_review` immediately after processing. If a record fails structural
 validation (invalid JSON, missing required fields, invalid CEFR value) it is
 marked `needs_review` and skipped — the run continues without interruption.
 **Input:** `stage-2-annotate/output/{lang}.json`
-**Output:** `stage-3-enrich/output/round1/{lang}_{model}.json` per run
+**Output:** `pipeline.db` — round 1 results per record per model
 ```bash
 pnpm --filter @lila/pipeline enrich --round 1 --model {model}
@ -219,10 +245,9 @@ pnpm --filter @lila/pipeline enrich --round 1 --model {model}
 **Compiling candidates**
-Once all round 1 runs are complete, compile all generated candidates into a single structured file per language. This is the input to round 2.
+Once all round 1 runs are complete, compile all generated candidates into a
-
+single structured record per term in `pipeline.db`. This is the input to
-**Input:** `stage-3-enrich/output/round1/{lang}_{model}.json`
+round 2.
 **Output:** `stage-3-enrich/output/candidates/{lang}_candidates.json`
 ```bash
 pnpm --filter @lila/pipeline enrich --compile-candidates
@ -237,10 +262,13 @@ Each model receives the compiled candidate list for every word and votes on:
 - The best usage examples candidate (if multiple exist)
 - A CEFR level vote for each translation
-OMW data is not put to a vote — it automatically wins over any LLM-generated candidate. Round 2 only resolves conflicts between model-generated candidates. The prompt is kept small — one word at a time, a clean numbered candidate list — to fit within a limited context window.
+OMW data is not put to a vote — it automatically wins over any LLM-generated
 candidate. Round 2 only resolves conflicts between model-generated candidates.
 The prompt is kept small — one word at a time, a clean numbered candidate
 list — to fit within a limited context window.
-**Input:** `stage-3-enrich/output/candidates/{lang}_candidates.json`
+**Input:** `pipeline.db` — compiled candidates
-**Output:** `stage-3-enrich/output/round2/{lang}_{model}.json` per run
+**Output:** `pipeline.db` — round 2 votes per record per model
 ```bash
 pnpm --filter @lila/pipeline enrich --round 2 --model {model}
@ -248,86 +276,26 @@ pnpm --filter @lila/pipeline enrich --round 2 --model {model}
 **Compiling votes**
-Once all round 2 runs are complete, compile all votes into a single file per language. This is the input to the merge stage.
+Once all round 2 runs are complete, compile all votes into a final votes
-
+record per term in `pipeline.db`. This is the input to the merge stage.
 **Input:** `stage-3-enrich/output/round2/{lang}_{model}.json`
 **Output:** `stage-3-enrich/output/votes/{lang}_votes.json`
 ```bash
 pnpm --filter @lila/pipeline enrich --compile-votes
 ```
 Each record in the votes file looks like this:
 ```json
 {
  "source_id": "omw-en-12345",
  "pos": "noun",
  "translations": {
    "en": [
      {
        "text": "dog",
        "votes": { "cefr_source": "A1", "model_1": "A1", "model_2": "A1" }
      },
      {
        "text": "canine",
        "votes": { "cefr_source": "B2", "model_1": "B2", "model_2": "B1" }
      }
    ],
    "it": [
      {
        "text": "cane",
        "votes": { "cefr_source": "A1", "model_1": "A1", "model_2": "A1" }
      }
    ]
  },
  "glosses": {
    "en": { "text": "a domesticated carnivorous mammal", "source": "omw" },
    "fr": {
      "candidates": [
        { "text": "un mammifère carnivore domestiqué", "source": "model_1" },
        { "text": "un animal domestique carnivore", "source": "model_2" }
      ],
      "votes": { "model_1": 1, "model_2": 1 }
    }
  },
  "examples": {
    "en": [{ "text": "the dog barked at the stranger", "source": "omw" }],
    "fr": {
      "candidates": [
        { "text": "le chien a aboyé", "source": "model_1" },
        { "text": "le chien gardait la maison", "source": "model_2" }
      ],
      "votes": { "model_1": 2, "model_2": 1 }
    }
  },
  "descriptions": {
    "en": {
      "candidates": [
        {
          "text": "a common household pet known for loyalty",
          "source": "model_1"
        },
        {
          "text": "a domesticated animal and loyal companion",
          "source": "model_2"
        }
      ],
      "votes": { "model_1": 2, "model_2": 1 }
    }
  }
 }
 ```
 ### 4. Merge
-Reads the votes file per language and resolves the final value for every field. Produces two output files per language — fully resolved records ready for seeding, and flagged records that need manual review.
+Reads compiled votes from `pipeline.db` and resolves the final value for
 every field. Updates each record in `pipeline.db` with status `final` or
 `flagged`.
 **Merge rules:**
 - OMW data wins automatically and is never overridden
- For CEFR levels: the level with the most votes wins. If no majority is reached, that translation is flagged
+- For CEFR levels: the level with the most votes wins. If no majority is
- For LLM-generated text fields (gloss, examples, descriptions): the candidate with the most votes wins
+  reached, that translation is flagged
 - For LLM-generated text fields (gloss, examples, descriptions): the
  candidate with the most votes wins
 <!-- TODO: decide fallback strategy when no majority is reached for text fields -->
@ -339,65 +307,26 @@ Reads the votes file per language and resolves the final value for every field.
 | B1, B2 | intermediate |
 | C1, C2 | hard         |
-**Input:** `stage-3-enrich/output/votes/{lang}_votes.json`
+**Input:** `pipeline.db` — compiled votes
-**Output:**
+**Output:** `pipeline.db` — records updated with status `final` or `flagged`
 - `stage-4-merge/output/final/{lang}.json` — fully resolved, ready for seeding
 - `stage-4-merge/output/flagged/{lang}.json` — CEFR majority not reached, needs manual review before seeding
 ```bash
 pnpm --filter @lila/pipeline merge
 ```
 Each record in `final/{lang}.json` looks like this:
 ```json
 {
  "source_id": "omw-en-12345",
  "pos": "noun",
  "translations": {
    "en": [
      { "text": "dog", "cefr_level": "A1", "difficulty": "easy" },
      { "text": "canine", "cefr_level": "B2", "difficulty": "intermediate" }
    ],
    "it": [{ "text": "cane", "cefr_level": "A1", "difficulty": "easy" }]
  },
  "glosses": {
    "en": { "text": "a domesticated carnivorous mammal", "source": "omw" },
    "fr": { "text": "un mammifère carnivore domestiqué", "source": "model_1" }
  },
  "examples": {
    "en": [{ "text": "the dog barked at the stranger", "source": "omw" }],
    "fr": [{ "text": "le chien a aboyé", "source": "model_1" }]
  },
  "descriptions": {
    "en": {
      "text": "a common household pet known for loyalty and companionship",
      "source": "model_1"
    },
    "it": {
      "text": "un animale domestico comune noto per la sua fedeltà",
      "source": "model_2"
    }
  }
 }
 ```
 **Resolving flagged words:**
-Open `stage-4-merge/output/flagged/{lang}.json`, manually set the correct `cefr_level` and `difficulty` for each flagged translation, then move the resolved entries into `stage-4-merge/output/final/{lang}.json`. Re-run the seeder after resolving.
+Query `pipeline.db` for all records with status `flagged`, manually set the
 correct `cefr_level` and `difficulty` for each flagged translation, and update
 the record status to `final`. Re-run the sync script after resolving.
 ### 5. Compare / QA
 Read-only. Generates `COVERAGE.md` with a full breakdown of the pipeline
 output quality per language. Run this after merge to verify output before
-seeding the database.
+syncing to the database.
 **Input:**
 - `stage-4-merge/output/final/{lang}.json`
 - `stage-4-merge/output/flagged/{lang}.json`
 **Input:** `pipeline.db` — records with status `final` and `flagged`
 **Output:** `COVERAGE.md`
 ```bash
@ -409,14 +338,39 @@ pnpm --filter @lila/pipeline compare
 - Total synsets extracted
 - Total translations per language
 - POS breakdown per language — word counts for noun, verb, adjective, adverb
- CEFR coverage per language — how many translations have a resolved CEFR level, broken down by level (A1, A2, B1, B2, C1, C2)
+- CEFR coverage per language — how many translations have a resolved CEFR
  level, broken down by level (A1, A2, B1, B2, C1, C2)
 - Difficulty breakdown per language — word counts for easy, intermediate, hard
 - Flagged count per language — how many translations are awaiting manual review
- Gloss coverage per language — total glosses, broken down by source (omw vs LLM-generated) and which languages have no glosses at all
+- Gloss coverage per language — total glosses, broken down by source (omw vs
  LLM-generated) and which languages have no glosses at all
 - Example coverage per language — same breakdown as glosses
- Description coverage per language — how many translations have a description, broken down by source
+- Description coverage per language — how many translations have a description,
- CEFR source file coverage per language — how many words from the source file were matched against OMW translations
+  broken down by source
- LLM model contribution — how many CEFR votes and text candidates each anonymised model contributed
+- CEFR source file coverage per language — how many words from the source file
  were matched against OMW translations
 - LLM model contribution — how many CEFR votes and text candidates each
  anonymised model contributed
 ## Sync
 The sync script transfers all records with status `final` in `pipeline.db` to
 the production PostgreSQL database. It is upsert-based and never wipes
 existing data. For each record it checks whether a matching `source_id`
 already exists in the target database:
 - **Missing** → insert
 - **Present but changed** → update
 - **Present and unchanged** → skip
 Run this after merge and after manually resolving any flagged entries.
 ```bash
 pnpm --filter @lila/pipeline sync
 ```
 The sync script requires a connection string to the target database. Set
 `DATABASE_URL` in your `.env` file before running.
 ## Adding a new language
@ -466,3 +420,62 @@ dataset matures:
 - **Additional languages** — The pipeline is language-agnostic. Adding a new
  language requires an OMW lexicon, a CEFR source file, and a constants
  update. See **Adding a new language**.
 ## Roadmap
 **Current state:** Stages 1 and 2 are complete and output has been reviewed
 for all five languages. Architecture for stages 3–5 and the sync script is
 finalised. Stage 3 scripts have not been written yet and llama.cpp is not
 installed.
 **Next action:** Write the stage 3 round 1 script.
 | Stage           | Status         |
 | --------------- | -------------- |
 | 1. Extract      | ✅ complete    |
 | 2. Annotate     | ✅ complete    |
 | 3. Enrich       | 🔲 not started |
 | 4. Merge        | 🔲 not started |
 | 5. Compare / QA | 🔲 not started |
 ### Stage 1 — Extract `✅ complete`
 - [x] Write extraction script
 - [x] Run extraction → `stage-1-extract/output/omw.json`
 ### Stage 2 — Annotate `✅ complete`
 - [x] Write annotation script
 - [x] Run annotation → per-language JSON + `conflicts.json`
 ### Stage 3 — Enrich `🔲 not started`
 **Next action:** Write the round 1 generation script.
 - [ ] Write round 1 script (generation)
 - [ ] Write compile-candidates script
 - [ ] Write round 2 script (voting)
 - [ ] Write compile-votes script
 - [ ] Install llama.cpp and verify server
 - [ ] Smoke test with 5–10 records
 - [ ] Run full 100-record sample, collect metrics
 - [ ] Compare providers (local vs OpenRouter free models)
 - [ ] Production run — all records, all models
 - [ ] Compile candidates → `stage-3-enrich/output/candidates/{lang}_candidates.json`
 - [ ] Compile votes → `stage-3-enrich/output/votes/{lang}_votes.json`
 ### Stage 4 — Merge `🔲 not started`
 - [ ] Write merge script
 - [ ] Run merge → `final/` and `flagged/`
 - [ ] Manually resolve flagged entries
 ### Stage 5 — Compare / QA `🔲 not started`
 - [ ] Write compare script
 - [ ] Run compare → `COVERAGE.md`
 - [ ] Review output quality before seeding
 ### Utilities
 **`test/`** — Runs the pipeline against a small sample to produce human-readable output for a quick sanity check before committing to a full run. Run this after any script change before running the full pipeline.
--- a/documentation/llm-setup.md
+++ b/documentation/llm-setup.md
@ -7,6 +7,14 @@ and production scripts.
 ---
 ## Provider model
 Each provider + model combination counts as one vote in the final majority.
 Running the same model twice is not supported — one model, one vote. To
 increase vote confidence, add more models rather than re-running existing ones.
 ---
 ## Hardware (dev machine)
 | Component | Spec                                                            |
@ -190,16 +198,17 @@ Set `Authorization: Bearer <OPENROUTER_API_KEY>` in the request headers.
 ---
-## Provider configuration in the test script
+## Provider configuration in the enrich script
-The enrich test script reads a single config object. To switch providers,
+The enrich script reads a single config object. To switch providers,
-change this object and re-run.
+change this object and re-run. The `name` field is used as the model
 identifier in `pipeline.db` — it must be unique across all runs.
 ```typescript
 // config.ts
 export type ProviderConfig = {
-  name: string; // used for output folder naming
+  name: string; // used as model identifier in pipeline.db — must be unique
  baseURL: string;
  apiKey: string;
  model: string;
@ -243,14 +252,9 @@ export const ANTHROPIC_SONNET: ProviderConfig = {
 };
 ```
-Output from each run lands in:
+All output is written to `pipeline.db`. Each record is stored with the
-
+model name as identifier so results from different providers can be
-```
+compared and compiled into votes.
 stage-3-enrich/test/output/{provider.name}/results.json
 stage-3-enrich/test/output/{provider.name}/metrics.json
 ```
 The evaluate script compares all `metrics.json` files side by side.
 ---
@ -297,5 +301,6 @@ The test script measures the following per provider run:
   production. If not, use the cloud model that passed.
 5. **Production run**
-   Full 117k records. Resume-safe — the script checkpoints after each
+   Full 117k records. Resume-safe — each record is written to `pipeline.db`
-   record so overnight runs can be stopped and continued.
+   atomically as it is processed. Overnight runs can be stopped and
   continued at any time without losing work.