docs: update data-pipeline.md and llm-setup.md to reflect sqlite architecture

This commit is contained in:
lila 2026-05-02 20:13:05 +02:00
parent ccfd83d16c
commit 6007fe1e38
2 changed files with 175 additions and 157 deletions

View file

@ -18,28 +18,31 @@ flowchart LR
extract[Extract]
annotate[Annotate]
enrich[Enrich]
pipelinedb[(pipeline.db)]
merge[Merge]
final[(final/lang.json)]
flagged[(flagged/lang.json)]
seeder[packages/db seeder]
db[(Database)]
compare[Compare]
sync[Sync]
db[(PostgreSQL)]
omw --> extract
cefr --> annotate
extract --> annotate
annotate --> enrich
enrich --> merge
merge --> final
merge --> flagged
final --> seeder
seeder --> db
enrich --> pipelinedb
pipelinedb --> merge
merge --> pipelinedb
pipelinedb --> compare
pipelinedb --> sync
sync --> db
```
Each stage is a standalone script that reads from the previous stage's output and produces one JSON file per language. Stages can be re-run independently without affecting earlier or later stages.
Each stage is a standalone script that reads from the previous stage's output. Stages 1 and 2 read and write JSON files. From stage 3 onwards, all output is written to `pipeline.db` — a SQLite database that tracks processing status, LLM output, votes, and resolved records. This makes overnight LLM runs fully resumable and protects against data loss if a run is interrupted.
The enrich stage is the exception — it produces one checkpoint file per model run per language, plus a compiled votes file once all runs are complete. It is designed to run overnight, one model at a time, and is fully resumable if interrupted.
Stage 1 is a manual prerequisite and is not run by the pipeline orchestrator. See **Stage 1 — Extract** for instructions.
Only fully annotated output in `stage-4-merge/output/final/` reaches the database. Words where LLMs could not reach a majority vote land in `stage-4-merge/output/flagged/` and wait for manual review before seeding.
The enrich stage is designed to run overnight, one model at a time. Each model processes every word and writes results to `pipeline.db` atomically per record — interrupted runs resume from the last unprocessed record.
Only fully resolved records reach the database. Records where LLMs could not reach a majority vote are marked `flagged` in `pipeline.db` and wait for manual review before syncing.
## Data sources
@ -147,6 +150,10 @@ Each record in the output looks like this:
Note: glosses and examples are not available for all languages. French and Spanish have no glosses or examples in the current OMW database — these will be generated by the LLM in the enrich stage. Coverage detail is in `COVERAGE.md`.
> **Note:** Stage 1 is a manual prerequisite. It is not run by the pipeline
> orchestrator (`pipeline.ts`). Run it once before running the orchestrator
> for the first time, and re-run it manually if the OMW data changes.
### 2. Annotate
Reads the combined OMW extract and merges CEFR source data into it. Each translation in each language is matched against the corresponding CEFR source
@ -193,7 +200,15 @@ Words not present in the CEFR source file will have an empty `votes` object.
### 3. Enrich
The enrich stage runs in two rounds, both designed to execute overnight one model at a time. The llama.cpp server must be running locally before starting either round. See `LLM-SETUP.md` for setup instructions.
> **Note:** Before running this stage, ensure the llama.cpp server is running
> locally. The orchestrator checks for a running server at
> `http://127.0.0.1:8080/health` and exits with instructions if it is not
> reachable. See `llm-setup.md` for setup instructions.
The enrich stage runs in two rounds, both designed to execute overnight one
model at a time. All output is written to `pipeline.db` atomically per record
— runs are fully resumable if interrupted. Each model is run once — one model
produces one vote.
**Round 1 — generation**
@ -206,12 +221,23 @@ generates:
- A gloss for each language, only if OMW provides none
- Usage examples for each language, only if OMW provides none
OMW data is never duplicated — the script checks what OMW already provides before building the prompt. For translations, glosses and examples, if OMW data exists for that language the LLM skips generation entirely. This significantly reduces compute time for languages with good OMW coverage such as English.
OMW data is never duplicated — the script checks what OMW already provides
before building the prompt. For translations, glosses and examples, if OMW
data exists for that language the LLM skips generation entirely. This
significantly reduces compute time for languages with good OMW coverage such
as English.
All model-generated content is stored with an anonymised source (`model_1`, `model_2` etc.) so models cannot be biased by knowing who generated what in round 2.
All model-generated content is stored with an anonymised source (`model_1`,
`model_2` etc.) so models cannot be biased by knowing who generated what in
round 2.
Each record is written to `pipeline.db` with status `complete` or
`needs_review` immediately after processing. If a record fails structural
validation (invalid JSON, missing required fields, invalid CEFR value) it is
marked `needs_review` and skipped — the run continues without interruption.
**Input:** `stage-2-annotate/output/{lang}.json`
**Output:** `stage-3-enrich/output/round1/{lang}_{model}.json` per run
**Output:** `pipeline.db` — round 1 results per record per model
```bash
pnpm --filter @lila/pipeline enrich --round 1 --model {model}
@ -219,10 +245,9 @@ pnpm --filter @lila/pipeline enrich --round 1 --model {model}
**Compiling candidates**
Once all round 1 runs are complete, compile all generated candidates into a single structured file per language. This is the input to round 2.
**Input:** `stage-3-enrich/output/round1/{lang}_{model}.json`
**Output:** `stage-3-enrich/output/candidates/{lang}_candidates.json`
Once all round 1 runs are complete, compile all generated candidates into a
single structured record per term in `pipeline.db`. This is the input to
round 2.
```bash
pnpm --filter @lila/pipeline enrich --compile-candidates
@ -237,10 +262,13 @@ Each model receives the compiled candidate list for every word and votes on:
- The best usage examples candidate (if multiple exist)
- A CEFR level vote for each translation
OMW data is not put to a vote — it automatically wins over any LLM-generated candidate. Round 2 only resolves conflicts between model-generated candidates. The prompt is kept small — one word at a time, a clean numbered candidate list — to fit within a limited context window.
OMW data is not put to a vote — it automatically wins over any LLM-generated
candidate. Round 2 only resolves conflicts between model-generated candidates.
The prompt is kept small — one word at a time, a clean numbered candidate
list — to fit within a limited context window.
**Input:** `stage-3-enrich/output/candidates/{lang}_candidates.json`
**Output:** `stage-3-enrich/output/round2/{lang}_{model}.json` per run
**Input:** `pipeline.db` — compiled candidates
**Output:** `pipeline.db` — round 2 votes per record per model
```bash
pnpm --filter @lila/pipeline enrich --round 2 --model {model}
@ -248,86 +276,26 @@ pnpm --filter @lila/pipeline enrich --round 2 --model {model}
**Compiling votes**
Once all round 2 runs are complete, compile all votes into a single file per language. This is the input to the merge stage.
**Input:** `stage-3-enrich/output/round2/{lang}_{model}.json`
**Output:** `stage-3-enrich/output/votes/{lang}_votes.json`
Once all round 2 runs are complete, compile all votes into a final votes
record per term in `pipeline.db`. This is the input to the merge stage.
```bash
pnpm --filter @lila/pipeline enrich --compile-votes
```
Each record in the votes file looks like this:
```json
{
"source_id": "omw-en-12345",
"pos": "noun",
"translations": {
"en": [
{
"text": "dog",
"votes": { "cefr_source": "A1", "model_1": "A1", "model_2": "A1" }
},
{
"text": "canine",
"votes": { "cefr_source": "B2", "model_1": "B2", "model_2": "B1" }
}
],
"it": [
{
"text": "cane",
"votes": { "cefr_source": "A1", "model_1": "A1", "model_2": "A1" }
}
]
},
"glosses": {
"en": { "text": "a domesticated carnivorous mammal", "source": "omw" },
"fr": {
"candidates": [
{ "text": "un mammifère carnivore domestiqué", "source": "model_1" },
{ "text": "un animal domestique carnivore", "source": "model_2" }
],
"votes": { "model_1": 1, "model_2": 1 }
}
},
"examples": {
"en": [{ "text": "the dog barked at the stranger", "source": "omw" }],
"fr": {
"candidates": [
{ "text": "le chien a aboyé", "source": "model_1" },
{ "text": "le chien gardait la maison", "source": "model_2" }
],
"votes": { "model_1": 2, "model_2": 1 }
}
},
"descriptions": {
"en": {
"candidates": [
{
"text": "a common household pet known for loyalty",
"source": "model_1"
},
{
"text": "a domesticated animal and loyal companion",
"source": "model_2"
}
],
"votes": { "model_1": 2, "model_2": 1 }
}
}
}
```
### 4. Merge
Reads the votes file per language and resolves the final value for every field. Produces two output files per language — fully resolved records ready for seeding, and flagged records that need manual review.
Reads compiled votes from `pipeline.db` and resolves the final value for
every field. Updates each record in `pipeline.db` with status `final` or
`flagged`.
**Merge rules:**
- OMW data wins automatically and is never overridden
- For CEFR levels: the level with the most votes wins. If no majority is reached, that translation is flagged
- For LLM-generated text fields (gloss, examples, descriptions): the candidate with the most votes wins
- For CEFR levels: the level with the most votes wins. If no majority is
reached, that translation is flagged
- For LLM-generated text fields (gloss, examples, descriptions): the
candidate with the most votes wins
<!-- TODO: decide fallback strategy when no majority is reached for text fields -->
@ -339,65 +307,26 @@ Reads the votes file per language and resolves the final value for every field.
| B1, B2 | intermediate |
| C1, C2 | hard |
**Input:** `stage-3-enrich/output/votes/{lang}_votes.json`
**Output:**
- `stage-4-merge/output/final/{lang}.json` — fully resolved, ready for seeding
- `stage-4-merge/output/flagged/{lang}.json` — CEFR majority not reached, needs manual review before seeding
**Input:** `pipeline.db` — compiled votes
**Output:** `pipeline.db` — records updated with status `final` or `flagged`
```bash
pnpm --filter @lila/pipeline merge
```
Each record in `final/{lang}.json` looks like this:
```json
{
"source_id": "omw-en-12345",
"pos": "noun",
"translations": {
"en": [
{ "text": "dog", "cefr_level": "A1", "difficulty": "easy" },
{ "text": "canine", "cefr_level": "B2", "difficulty": "intermediate" }
],
"it": [{ "text": "cane", "cefr_level": "A1", "difficulty": "easy" }]
},
"glosses": {
"en": { "text": "a domesticated carnivorous mammal", "source": "omw" },
"fr": { "text": "un mammifère carnivore domestiqué", "source": "model_1" }
},
"examples": {
"en": [{ "text": "the dog barked at the stranger", "source": "omw" }],
"fr": [{ "text": "le chien a aboyé", "source": "model_1" }]
},
"descriptions": {
"en": {
"text": "a common household pet known for loyalty and companionship",
"source": "model_1"
},
"it": {
"text": "un animale domestico comune noto per la sua fedeltà",
"source": "model_2"
}
}
}
```
**Resolving flagged words:**
Open `stage-4-merge/output/flagged/{lang}.json`, manually set the correct `cefr_level` and `difficulty` for each flagged translation, then move the resolved entries into `stage-4-merge/output/final/{lang}.json`. Re-run the seeder after resolving.
Query `pipeline.db` for all records with status `flagged`, manually set the
correct `cefr_level` and `difficulty` for each flagged translation, and update
the record status to `final`. Re-run the sync script after resolving.
### 5. Compare / QA
Read-only. Generates `COVERAGE.md` with a full breakdown of the pipeline
output quality per language. Run this after merge to verify output before
seeding the database.
**Input:**
- `stage-4-merge/output/final/{lang}.json`
- `stage-4-merge/output/flagged/{lang}.json`
syncing to the database.
**Input:** `pipeline.db` — records with status `final` and `flagged`
**Output:** `COVERAGE.md`
```bash
@ -409,14 +338,39 @@ pnpm --filter @lila/pipeline compare
- Total synsets extracted
- Total translations per language
- POS breakdown per language — word counts for noun, verb, adjective, adverb
- CEFR coverage per language — how many translations have a resolved CEFR level, broken down by level (A1, A2, B1, B2, C1, C2)
- CEFR coverage per language — how many translations have a resolved CEFR
level, broken down by level (A1, A2, B1, B2, C1, C2)
- Difficulty breakdown per language — word counts for easy, intermediate, hard
- Flagged count per language — how many translations are awaiting manual review
- Gloss coverage per language — total glosses, broken down by source (omw vs LLM-generated) and which languages have no glosses at all
- Gloss coverage per language — total glosses, broken down by source (omw vs
LLM-generated) and which languages have no glosses at all
- Example coverage per language — same breakdown as glosses
- Description coverage per language — how many translations have a description, broken down by source
- CEFR source file coverage per language — how many words from the source file were matched against OMW translations
- LLM model contribution — how many CEFR votes and text candidates each anonymised model contributed
- Description coverage per language — how many translations have a description,
broken down by source
- CEFR source file coverage per language — how many words from the source file
were matched against OMW translations
- LLM model contribution — how many CEFR votes and text candidates each
anonymised model contributed
## Sync
The sync script transfers all records with status `final` in `pipeline.db` to
the production PostgreSQL database. It is upsert-based and never wipes
existing data. For each record it checks whether a matching `source_id`
already exists in the target database:
- **Missing** → insert
- **Present but changed** → update
- **Present and unchanged** → skip
Run this after merge and after manually resolving any flagged entries.
```bash
pnpm --filter @lila/pipeline sync
```
The sync script requires a connection string to the target database. Set
`DATABASE_URL` in your `.env` file before running.
## Adding a new language
@ -466,3 +420,62 @@ dataset matures:
- **Additional languages** — The pipeline is language-agnostic. Adding a new
language requires an OMW lexicon, a CEFR source file, and a constants
update. See **Adding a new language**.
## Roadmap
**Current state:** Stages 1 and 2 are complete and output has been reviewed
for all five languages. Architecture for stages 35 and the sync script is
finalised. Stage 3 scripts have not been written yet and llama.cpp is not
installed.
**Next action:** Write the stage 3 round 1 script.
| Stage | Status |
| --------------- | -------------- |
| 1. Extract | ✅ complete |
| 2. Annotate | ✅ complete |
| 3. Enrich | 🔲 not started |
| 4. Merge | 🔲 not started |
| 5. Compare / QA | 🔲 not started |
### Stage 1 — Extract `✅ complete`
- [x] Write extraction script
- [x] Run extraction → `stage-1-extract/output/omw.json`
### Stage 2 — Annotate `✅ complete`
- [x] Write annotation script
- [x] Run annotation → per-language JSON + `conflicts.json`
### Stage 3 — Enrich `🔲 not started`
**Next action:** Write the round 1 generation script.
- [ ] Write round 1 script (generation)
- [ ] Write compile-candidates script
- [ ] Write round 2 script (voting)
- [ ] Write compile-votes script
- [ ] Install llama.cpp and verify server
- [ ] Smoke test with 510 records
- [ ] Run full 100-record sample, collect metrics
- [ ] Compare providers (local vs OpenRouter free models)
- [ ] Production run — all records, all models
- [ ] Compile candidates → `stage-3-enrich/output/candidates/{lang}_candidates.json`
- [ ] Compile votes → `stage-3-enrich/output/votes/{lang}_votes.json`
### Stage 4 — Merge `🔲 not started`
- [ ] Write merge script
- [ ] Run merge → `final/` and `flagged/`
- [ ] Manually resolve flagged entries
### Stage 5 — Compare / QA `🔲 not started`
- [ ] Write compare script
- [ ] Run compare → `COVERAGE.md`
- [ ] Review output quality before seeding
### Utilities
**`test/`** — Runs the pipeline against a small sample to produce human-readable output for a quick sanity check before committing to a full run. Run this after any script change before running the full pipeline.

View file

@ -7,6 +7,14 @@ and production scripts.
---
## Provider model
Each provider + model combination counts as one vote in the final majority.
Running the same model twice is not supported — one model, one vote. To
increase vote confidence, add more models rather than re-running existing ones.
---
## Hardware (dev machine)
| Component | Spec |
@ -190,16 +198,17 @@ Set `Authorization: Bearer <OPENROUTER_API_KEY>` in the request headers.
---
## Provider configuration in the test script
## Provider configuration in the enrich script
The enrich test script reads a single config object. To switch providers,
change this object and re-run.
The enrich script reads a single config object. To switch providers,
change this object and re-run. The `name` field is used as the model
identifier in `pipeline.db` — it must be unique across all runs.
```typescript
// config.ts
export type ProviderConfig = {
name: string; // used for output folder naming
name: string; // used as model identifier in pipeline.db — must be unique
baseURL: string;
apiKey: string;
model: string;
@ -243,14 +252,9 @@ export const ANTHROPIC_SONNET: ProviderConfig = {
};
```
Output from each run lands in:
```
stage-3-enrich/test/output/{provider.name}/results.json
stage-3-enrich/test/output/{provider.name}/metrics.json
```
The evaluate script compares all `metrics.json` files side by side.
All output is written to `pipeline.db`. Each record is stored with the
model name as identifier so results from different providers can be
compared and compiled into votes.
---
@ -297,5 +301,6 @@ The test script measures the following per provider run:
production. If not, use the cloud model that passed.
5. **Production run**
Full 117k records. Resume-safe — the script checkpoints after each
record so overnight runs can be stopped and continued.
Full 117k records. Resume-safe — each record is written to `pipeline.db`
atomically as it is processed. Overnight runs can be stopped and
continued at any time without losing work.