575 lines
23 KiB
Markdown
575 lines
23 KiB
Markdown
# lila data pipeline
|
||
|
||
> **NOTE: BEFORE RUNNING THE PIPELINE, CONSIDER IMPROVING THE CEFR SOURCE
|
||
> FILES IN `stage-2-annotate/sources/cefr/`. BETTER SOURCE COVERAGE MEANS
|
||
> FEWER WORDS FOR THE LLM TO ANNOTATE FROM SCRATCH, FASTER OVERNIGHT RUNS,
|
||
> AND HIGHER CONFIDENCE IN THE FINAL OUTPUT. SEE UNIVERSALCEFR
|
||
> (huggingface.co/UniversalCEFR) AND CEFR-J
|
||
> (github.com/openlanguageprofiles/olp-en-cefrj) AS STARTING POINTS.**
|
||
|
||
This pipeline extracts vocabulary data from the Open Multilingual Wordnet (OMW), annotates it with CEFR levels from curated source files, verifies and enriches annotations using local LLMs, and produces authoritative output in `pipeline.db`. This database is consumed by the sync script to populate the production database with terms, translations, glosses, CEFR levels, difficulty ratings, and LLM-generated descriptions.
|
||
|
||
## Overview
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
omw[(OMW SQLite DBs)]
|
||
cefr[(CEFR JSON files)]
|
||
extract[Extract]
|
||
annotate[Annotate]
|
||
enrich[Enrich]
|
||
pipelinedb[(pipeline.db)]
|
||
merge[Merge]
|
||
tiebreak[Tiebreak]
|
||
compare[Compare]
|
||
sync[Sync]
|
||
db[(PostgreSQL)]
|
||
|
||
omw --> extract
|
||
cefr --> annotate
|
||
extract --> annotate
|
||
annotate --> enrich
|
||
enrich --> pipelinedb
|
||
pipelinedb --> merge
|
||
merge --> pipelinedb
|
||
pipelinedb --> tiebreak
|
||
tiebreak --> pipelinedb
|
||
pipelinedb --> compare
|
||
pipelinedb --> sync
|
||
sync --> db
|
||
```
|
||
|
||
Each stage is a standalone script that reads from the previous stage's output. Stages 1 and 2 read and write JSON files. From stage 3 onwards, all output is written to `pipeline.db` — a SQLite database that tracks processing status, LLM output, votes, and resolved records. This makes overnight LLM runs fully resumable and protects against data loss if a run is interrupted.
|
||
|
||
Stage 1 is a manual prerequisite and is not run by the pipeline orchestrator. See **Stage 1 — Extract** for instructions.
|
||
|
||
The enrich stage is designed to run overnight, one model at a time. Each model processes every word and writes results to `pipeline.db` atomically per record — interrupted runs resume from the last unprocessed record.
|
||
|
||
Only fully resolved records reach the production database. Records where LLMs could not reach a majority vote are handled automatically by the tiebreaker stage before seeding.
|
||
|
||
## pipeline.db
|
||
|
||
All pipeline state from stage 3 onwards is stored in `pipeline.db` — a SQLite
|
||
database in `data-pipeline/db/`. It is created automatically on first run and
|
||
is not committed to git.
|
||
|
||
The database serves three purposes:
|
||
|
||
- **Resumability** — every record is written atomically with a status. Interrupted
|
||
overnight runs resume from the last pending record without losing work.
|
||
- **Vote tracking** — all model votes for CEFR levels and generated text are
|
||
stored per model per record, giving full auditability of how every decision
|
||
was reached.
|
||
- **Resolved output** — the final resolved records live here and are read by
|
||
the sync script to seed the production database.
|
||
|
||
The schema is defined in `data-pipeline/db/schema.sql`. Never edit `pipeline.db`
|
||
directly — all writes go through the pipeline scripts.
|
||
|
||
## Data sources
|
||
|
||
### OMW / WordNet
|
||
|
||
The Open Multilingual Wordnet (OMW) is the base vocabulary source. It provides synsets — groups of synonymous words — with translations and glosses across multiple languages. One SQLite database per language is downloaded and placed in `sources/omw/`. These files are not committed to git.
|
||
|
||
All four parts of speech are extracted: noun, verb, adjective, adverb. WordNet's adjective satellites are collapsed into adjective — this is a WordNet-internal distinction that has no relevance for language learning. Alongside translations and glosses, usage examples are extracted where available and stored in the database as term_examples.
|
||
|
||
See **Setup** for download instructions.
|
||
|
||
### CEFR source files
|
||
|
||
Per-language JSON files in `sources/cefr/` provide the initial CEFR level annotations. These files do not cover the full vocabulary extracted from OMW — coverage varies by language. Gaps and disagreements are handled by the enrich stage.
|
||
|
||
| Language | File |
|
||
| -------- | ---------------------- |
|
||
| English | `sources/cefr/en.json` |
|
||
| Italian | `sources/cefr/it.json` |
|
||
| Spanish | `sources/cefr/es.json` |
|
||
| German | `sources/cefr/de.json` |
|
||
| French | `sources/cefr/fr.json` |
|
||
|
||
These files are committed to git. For per-language coverage detail see `COVERAGE.md`.
|
||
|
||
### CEFR annotation and verification
|
||
|
||
CEFR levels are determined by a majority vote combining all available sources:
|
||
|
||
- The CEFR source file counts as one vote (if it has an entry for the word)
|
||
- Each LLM model run counts as one vote
|
||
|
||
The LLMs verify existing annotations as well as filling gaps — a source file entry does not automatically win. Majority vote across all sources determines the final level.
|
||
|
||
Words appearing in the CEFR source file multiple times with different CEFR levels are written to `conflicts.json` and excluded from `cefr_source_votes`. They are still present in `translations` and the LLMs vote on them like any other unannotated word — the conflict is resolved by majority vote.
|
||
|
||
If no majority is reached after all model runs, the word is handled automatically by the tiebreaker stage.
|
||
|
||
## Setup
|
||
|
||
### OMW databases
|
||
|
||
Download the OMW SQLite database for each language using the `wn` Python
|
||
library:
|
||
|
||
```bash
|
||
python -m wn download omw-en:1.4
|
||
python -m wn download omw-it:1.4
|
||
python -m wn download omw-de:1.4
|
||
python -m wn download omw-es:1.4
|
||
python -m wn download omw-fr:1.4
|
||
```
|
||
|
||
The data is stored automatically at `~/.wn_data/wn.db` and is not committed
|
||
to git.
|
||
|
||
### LLM setup
|
||
|
||
See `llm-setup.md`.
|
||
|
||
## Pipeline stages
|
||
|
||
The pipeline runs in six stages plus a tiebreaker. Each stage is independent and can be re-run without affecting the others.
|
||
|
||
| Stage | What it does |
|
||
| ------------ | -------------------------------------------------------------------- |
|
||
| 1. Extract | Reads OMW SQLite database, outputs normalized JSON per language |
|
||
| 2. Annotate | Merges CEFR source files into extracted data, adds source file votes |
|
||
| 3. Enrich | Runs local LLMs in two rounds — generation then voting |
|
||
| 4. Merge | Resolves votes, derives difficulty, splits into final and flagged |
|
||
| 4b. Tiebreak | Runs unused models on flagged translations until majority is reached |
|
||
| 5. Compare | Generates COVERAGE.md with detailed quality report |
|
||
| 6. Sync | Upserts resolved records into production PostgreSQL |
|
||
|
||
### 1. Extract
|
||
|
||
Reads the OMW SQLite database (`~/.wn_data/wn.db`) and produces a single normalized JSON file containing all synsets with their translations, glosses, and usage examples across all five languages and all parts of speech. Adjective satellites are collapsed into adjective at this stage.
|
||
|
||
**Input:** `~/.wn_data/wn.db`
|
||
**Output:** `stage-1-extract/output/omw.json`
|
||
|
||
```bash
|
||
python stage-1-extract/scripts/extract.py
|
||
```
|
||
|
||
Add `--sample` to extract 100 synsets for inspection before running the full
|
||
extraction.
|
||
|
||
Each record in the output looks like this:
|
||
|
||
```json
|
||
{
|
||
"source_id": "ili:i1",
|
||
"pos": "adjective",
|
||
"translations": {
|
||
"en": ["able"],
|
||
"it": ["abile", "intelligente", "valente", "capace"],
|
||
"es": ["capaz"],
|
||
"fr": ["comptable"]
|
||
},
|
||
"glosses": {
|
||
"en": [
|
||
"(usually followed by 'to') having the necessary means or skill or know-how or authority to do something"
|
||
]
|
||
},
|
||
"examples": { "en": ["able to swim", "she was able to program her computer"] }
|
||
}
|
||
```
|
||
|
||
Note: glosses and examples are not available for all languages. French and Spanish have no glosses or examples in the current OMW database — these will be generated by the LLM in the enrich stage. Coverage detail is in `COVERAGE.md`.
|
||
|
||
> **Note:** Stage 1 is a manual prerequisite. It is not run by the pipeline
|
||
> orchestrator (`pipeline.ts`). Run it once before running the orchestrator
|
||
> for the first time, and re-run it manually if the OMW data changes.
|
||
|
||
### 2. Annotate
|
||
|
||
Reads the combined OMW extract and merges CEFR source data into it. Each translation in each language is matched against the corresponding CEFR source file by word text and part of speech. Matched translations receive a `cefr_source` vote which carries into the enrich stage. Unmatched translations proceed without a vote.
|
||
|
||
This stage also extracts native example sentences from the CEFR source files and adds them to the record alongside OMW examples, with `source: "cefr"` to distinguish them.
|
||
|
||
Words appearing in the CEFR source file multiple times with different CEFR levels are written to `conflicts.json` and excluded from source voting. The LLMs handle these words like any other unannotated word.
|
||
|
||
**Input:** `stage-1-extract/output/omw.json` + `stage-2-annotate/sources/cefr/{lang}.json`
|
||
**Output:**
|
||
|
||
- `stage-2-annotate/output/{lang}.json` — one per language
|
||
- `stage-2-annotate/output/conflicts.json` — cross-language conflicts for reference
|
||
|
||
```bash
|
||
pnpm --filter @lila/pipeline annotate
|
||
```
|
||
|
||
Each record in the output extends the OMW record with a `votes` field and any additional examples from the CEFR source file:
|
||
|
||
```json
|
||
{
|
||
"source_id": "ili:i1",
|
||
"pos": "adjective",
|
||
"translations": {
|
||
"en": ["able"],
|
||
"it": ["abile", "intelligente", "valente", "capace"],
|
||
"es": ["capaz"],
|
||
"fr": ["comptable"]
|
||
},
|
||
"glosses": { "en": ["having the necessary means or skill to do something"] },
|
||
"examples": {
|
||
"en": [
|
||
{ "text": "able to swim", "source": "omw" },
|
||
{ "text": "She was able to finish the task.", "source": "cefr" }
|
||
]
|
||
},
|
||
"votes": { "en": { "able": { "cefr_source": "B1" } } }
|
||
}
|
||
```
|
||
|
||
Words not present in the CEFR source file will have an empty `votes` object.
|
||
|
||
### 3. Enrich
|
||
|
||
> **Note:** Before running this stage, ensure the llama.cpp server is running
|
||
> locally. The orchestrator checks for a running server at
|
||
> `http://127.0.0.1:8080/health` and exits with instructions if it is not
|
||
> reachable. See `llm-setup.md` for setup instructions.
|
||
|
||
The enrich stage runs in two rounds, both designed to execute overnight one
|
||
model at a time. All output is written to `pipeline.db` atomically per record
|
||
— runs are fully resumable if interrupted. Each model is run once — one model
|
||
produces one vote.
|
||
|
||
**Round 1 — generation**
|
||
|
||
Each model processes every word in every language one term at a time and
|
||
generates:
|
||
|
||
- A CEFR level vote for each translation
|
||
- A description for each language
|
||
- A translation for each language, only if OMW provides none
|
||
- A gloss for each language, only if OMW provides none
|
||
- Usage examples for each language, only if OMW provides none
|
||
|
||
OMW data is never duplicated — the script checks what OMW already provides
|
||
before building the prompt. For translations, glosses and examples, if OMW
|
||
data exists for that language the LLM skips generation entirely. This
|
||
significantly reduces compute time for languages with good OMW coverage such
|
||
as English.
|
||
|
||
All model-generated content is stored with an anonymised source (`model_1`,
|
||
`model_2` etc.) so models cannot be biased by knowing who generated what in
|
||
round 2.
|
||
|
||
Each record is written to `pipeline.db` with status `complete` or
|
||
`needs_review` immediately after processing. If a record fails structural
|
||
validation (invalid JSON, missing required fields, invalid CEFR value) it is
|
||
marked `needs_review` and skipped — the run continues without interruption.
|
||
|
||
**Input:** `stage-2-annotate/output/{lang}.json`
|
||
**Output:** `pipeline.db` — round 1 results per record per model
|
||
|
||
```bash
|
||
pnpm --filter @lila/pipeline enrich --round 1 --model {model}
|
||
```
|
||
|
||
**Compiling candidates**
|
||
|
||
Once all round 1 runs are complete, compile all generated candidates into a
|
||
single structured record per term in `pipeline.db`. This is the input to
|
||
round 2.
|
||
|
||
```bash
|
||
pnpm --filter @lila/pipeline enrich --compile-candidates
|
||
```
|
||
|
||
**Round 2 — voting**
|
||
|
||
Each model receives the compiled candidate list for every word and votes on:
|
||
|
||
- The best gloss candidate (if multiple exist)
|
||
- The best description candidate (if multiple exist)
|
||
- The best usage examples candidate (if multiple exist)
|
||
- A CEFR level vote for each translation
|
||
|
||
OMW data is not put to a vote — it automatically wins over any LLM-generated
|
||
candidate. Round 2 only resolves conflicts between model-generated candidates.
|
||
The prompt is kept small — one word at a time, a clean numbered candidate
|
||
list — to fit within a limited context window.
|
||
|
||
**Input:** `pipeline.db` — compiled candidates
|
||
**Output:** `pipeline.db` — round 2 votes per record per model
|
||
|
||
```bash
|
||
pnpm --filter @lila/pipeline enrich --round 2 --model {model}
|
||
```
|
||
|
||
**Compiling votes**
|
||
|
||
Once all round 2 runs are complete, compile all votes into a final votes
|
||
record per term in `pipeline.db`. This is the input to the merge stage.
|
||
|
||
```bash
|
||
pnpm --filter @lila/pipeline enrich --compile-votes
|
||
```
|
||
|
||
### 4. Merge
|
||
|
||
Reads compiled votes from `pipeline.db` and resolves the final value for
|
||
every field. Updates each record in `pipeline.db` with status `final` or
|
||
`flagged`.
|
||
|
||
**Merge rules:**
|
||
|
||
- OMW data wins automatically and is never overridden
|
||
- For CEFR levels: the level with the most votes wins. If no majority is
|
||
reached, that translation is flagged for the tiebreaker
|
||
- For LLM-generated text fields (gloss, examples, descriptions): the
|
||
candidate with the most votes wins. If no majority is reached, the
|
||
tiebreaker runs for that record as well
|
||
|
||
**Difficulty mapping:**
|
||
|
||
| CEFR | Difficulty |
|
||
| ------ | ------------ |
|
||
| A1, A2 | easy |
|
||
| B1, B2 | intermediate |
|
||
| C1, C2 | hard |
|
||
|
||
**Input:** `pipeline.db` — compiled votes
|
||
**Output:** `pipeline.db` — records updated with status `final` or `flagged`
|
||
|
||
```bash
|
||
pnpm --filter @lila/pipeline merge
|
||
```
|
||
|
||
### 4b. Tiebreak
|
||
|
||
Runs automatically after merge if any translations remain flagged. The script
|
||
queries `pipeline.db` for flagged translations, identifies which configured
|
||
models have not yet voted on each word, and runs those models on the flagged
|
||
subset only. Merge is re-run after each tiebreaker pass. This repeats until
|
||
all flagged translations are resolved or no unused models remain.
|
||
|
||
If unused models are exhausted and flagged translations remain, the script
|
||
logs a detailed report showing the exact vote split for each unresolved word
|
||
and lists available models from OpenRouter that have not been used. Seeding
|
||
is blocked until all translations are resolved. To continue, add one or more
|
||
models to the config and re-run the pipeline — the tiebreaker will pick up
|
||
automatically.
|
||
|
||
**Input:** `pipeline.db` — flagged translations from merge
|
||
**Output:** `pipeline.db` — flagged translations resolved to `final`
|
||
|
||
> **Note:** The tiebreaker is not a standalone script. It runs automatically
|
||
> as part of the pipeline orchestrator after merge completes.
|
||
|
||
### 5. Compare / QA
|
||
|
||
Read-only. Generates `COVERAGE.md` with a full breakdown of the pipeline
|
||
output quality per language. Run this after merge to verify output before
|
||
syncing to the database.
|
||
|
||
**Input:** `pipeline.db` — records with status `final`
|
||
**Output:** `COVERAGE.md`
|
||
|
||
```bash
|
||
pnpm --filter @lila/pipeline compare
|
||
```
|
||
|
||
`COVERAGE.md` reports the following per language:
|
||
|
||
- Total synsets extracted
|
||
- Total translations per language
|
||
- POS breakdown per language — word counts for noun, verb, adjective, adverb
|
||
- CEFR coverage per language — how many translations have a resolved CEFR
|
||
level, broken down by level (A1, A2, B1, B2, C1, C2)
|
||
- Difficulty breakdown per language — word counts for easy, intermediate, hard
|
||
- Flagged count per language — how many translations are awaiting manual review
|
||
- Gloss coverage per language — total glosses, broken down by source (omw vs
|
||
LLM-generated) and which languages have no glosses at all
|
||
- Example coverage per language — same breakdown as glosses
|
||
- Description coverage per language — how many translations have a description,
|
||
broken down by source
|
||
- CEFR source file coverage per language — how many words from the source file
|
||
were matched against OMW translations
|
||
- LLM model contribution — how many CEFR votes and text candidates each
|
||
anonymised model contributed
|
||
|
||
## Sync
|
||
|
||
The sync script transfers all records with status `final` in `pipeline.db` to
|
||
the production PostgreSQL database. It is upsert-based and never wipes
|
||
existing data. For each record it checks whether a matching `source_id`
|
||
already exists in the target database:
|
||
|
||
- **Missing** → insert
|
||
- **Present but changed** → update
|
||
- **Present and unchanged** → skip
|
||
|
||
Run this after all records are resolved and Compare / QA has been reviewed.
|
||
|
||
```bash
|
||
pnpm --filter @lila/pipeline sync
|
||
```
|
||
|
||
The sync script requires a connection string to the target database. Set
|
||
`DATABASE_URL` in your `.env` file before running.
|
||
|
||
## Reports
|
||
|
||
The pipeline generates a report at the end of every run. Reports are written
|
||
to `data-pipeline/reports/` as a JSON file and a markdown file with the same
|
||
name. The markdown is generated from the JSON and contains identical data.
|
||
|
||
```
|
||
data-pipeline/reports/
|
||
2026-05-03_night-1.json
|
||
2026-05-03_night-1.md
|
||
```
|
||
|
||
The report name is provided when starting the pipeline:
|
||
|
||
```bash
|
||
pnpm --filter @lila/pipeline run --name "night-1"
|
||
```
|
||
|
||
**Nightly report** contains:
|
||
|
||
- Records processed this run vs total
|
||
- Records remaining per stage
|
||
- Average processing speed and estimated nights remaining
|
||
- `needs_review` count — records that failed structural validation
|
||
- Per-model progress breakdown
|
||
|
||
**Final report** (generated when all records are processed) additionally contains:
|
||
|
||
- Full vote breakdown per model
|
||
- Flagged translations with exact vote splits
|
||
- Available unused models from OpenRouter for tiebreaking
|
||
- Per-model quality metrics — CEFR agreement rate, field coverage, JSON parse rate
|
||
|
||
Reports are not committed to git.
|
||
|
||
## Adding a new language
|
||
|
||
1. Add the language code to `SUPPORTED_LANGUAGE_CODES` in `packages/shared/src/constants.ts`
|
||
2. Build shared: `pnpm --filter @lila/shared build`
|
||
3. Generate and run a DB migration: `pnpm --filter @lila/db generate` then `pnpm --filter @lila/db migrate`
|
||
4. Download the OMW lexicon for the language using the `wn` Python library
|
||
5. Add a CEFR source file at `stage-2-annotate/sources/cefr/{lang}.json`
|
||
6. Run the full pipeline
|
||
|
||
## Constants and constraints
|
||
|
||
These values are defined in `packages/shared/src/constants.ts` and enforced by database check constraints. The pipeline filters out any entries that violate them.
|
||
|
||
| Constant | Values |
|
||
| --------------- | ------------------------------------- |
|
||
| Languages | `en`, `it`, `de`, `es`, `fr` |
|
||
| Parts of speech | `noun`, `verb`, `adjective`, `adverb` |
|
||
| CEFR levels | `A1`, `A2`, `B1`, `B2`, `C1`, `C2` |
|
||
| Difficulty | `easy`, `intermediate`, `hard` |
|
||
|
||
Adding a new value to any of these requires a constants update and a database migration before re-running the pipeline. See **Adding a new language** for the full steps — the same process applies for new parts of speech.
|
||
|
||
## Further extensions
|
||
|
||
These are not part of the current pipeline but are worth considering as the
|
||
dataset matures:
|
||
|
||
- **Grammatical gender and articles** — Wiktionary dumps contain gender and
|
||
article data for nouns across all supported languages. Could be extracted
|
||
and stored as a new `translation_forms` table.
|
||
- **Conjugations** — Wiktionary also carries verb conjugation tables. Useful
|
||
for a future grammar-focused quiz mode.
|
||
- **IPA pronunciations** — Wiktionary and Forvo are potential sources for
|
||
phonetic transcriptions per language.
|
||
- **TTS audio files** — Generate pronunciation audio for each translation
|
||
using a local or cloud TTS engine. Stored as static files, served alongside
|
||
the quiz UI.
|
||
- **Images** — Associate an image with each synset to support visual
|
||
vocabulary learning. Could be sourced from open image datasets like
|
||
ImageNet or WikiMedia Commons.
|
||
- **Frequency data** — Word frequency rankings per language from sources like
|
||
the Google Ngram dataset. Useful for smarter difficulty calibration beyond
|
||
CEFR levels alone.
|
||
- **Improved CEFR source files** — See note at the top of this document.
|
||
UniversalCEFR and CEFR-J are good starting points.
|
||
- **Additional languages** — The pipeline is language-agnostic. Adding a new
|
||
language requires an OMW lexicon, a CEFR source file, and a constants
|
||
update. See **Adding a new language**.
|
||
|
||
## Roadmap
|
||
|
||
**Current state:** Stages 1 and 2 are complete and output has been reviewed
|
||
for all five languages. Architecture for stages 3–6, the tiebreaker, and the
|
||
report system are finalised. Stage 3 scripts have not been written yet and
|
||
llama.cpp is not installed.
|
||
|
||
**Next action:** Write the stage 3 round 1 script.
|
||
|
||
| Stage | Status |
|
||
| --------------- | -------------- |
|
||
| 1. Extract | ✅ complete |
|
||
| 2. Annotate | ✅ complete |
|
||
| 3. Enrich | 🔲 not started |
|
||
| 4. Merge | 🔲 not started |
|
||
| 4b. Tiebreak | 🔲 not started |
|
||
| 5. Compare / QA | 🔲 not started |
|
||
| 6. Sync | 🔲 not started |
|
||
|
||
### Stage 1 — Extract `✅ complete`
|
||
|
||
- [x] Write extraction script
|
||
- [x] Run extraction → `stage-1-extract/output/omw.json`
|
||
|
||
### Stage 2 — Annotate `✅ complete`
|
||
|
||
- [x] Write annotation script
|
||
- [x] Run annotation → per-language JSON + `conflicts.json`
|
||
|
||
### Stage 3 — Enrich `🔲 not started`
|
||
|
||
**Next action:** Write the round 1 generation script.
|
||
|
||
- [ ] Write tests for stage 3
|
||
- [ ] Write round 1 script (generation)
|
||
- [ ] Write compile-candidates script
|
||
- [ ] Write round 2 script (voting)
|
||
- [ ] Write compile-votes script
|
||
- [ ] Install llama.cpp and verify server
|
||
- [ ] Smoke test with 5–10 records
|
||
- [ ] Run full 100-record sample, collect metrics
|
||
- [ ] Compare providers (local vs OpenRouter free models)
|
||
- [ ] Production run — all records, all models
|
||
- [ ] Compile candidates → `pipeline.db`
|
||
- [ ] Compile votes → `pipeline.db`
|
||
|
||
### Stage 4 — Merge `🔲 not started`
|
||
|
||
- [ ] Write tests for stage 3
|
||
- [ ] Write merge script
|
||
- [ ] Run merge → `pipeline.db`
|
||
- [ ] Confirm tiebreaker resolves all flagged translations
|
||
|
||
### Stage 4b — Tiebreak `🔲 not started`
|
||
|
||
- [ ] Write tests for stage 3
|
||
- [ ] Write tiebreak logic
|
||
- [ ] Run tiebreaker for all flagged translations
|
||
- [ ] Confirm no flagged translations remain before seeding
|
||
|
||
### Stage 5 — Compare / QA `🔲 not started`
|
||
|
||
- [ ] Write tests for stage 3
|
||
- [ ] Write compare script
|
||
- [ ] Run compare → `COVERAGE.md`
|
||
- [ ] Review output quality before seeding
|
||
|
||
### Stage 6 — Sync `🔲 not started`
|
||
|
||
- [ ] Write tests for stage 3
|
||
- [ ] Write sync script
|
||
- [ ] Configure `DATABASE_URL` in `.env`
|
||
- [ ] Run sync → production PostgreSQL
|
||
- [ ] Verify seeded data in production
|
||
|
||
### Utilities
|
||
|
||
**`test/`** — Runs the pipeline against a small sample to produce human-readable output for a quick sanity check before committing to a full run. Run this after any script change before running the full pipeline.
|