From b5a76ee178fb22624440e5c6b397f7059e6049f5 Mon Sep 17 00:00:00 2001 From: lila Date: Tue, 5 May 2026 18:52:10 +0200 Subject: [PATCH] =?UTF-8?q?docs:=20update=20roadmap=20=E2=80=94=20stage=20?= =?UTF-8?q?1=20in=20progress,=20sample=20extraction=20complete?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- documentation/data-pipeline.md | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/documentation/data-pipeline.md b/documentation/data-pipeline.md index b40fc5c..9543d7f 100644 --- a/documentation/data-pipeline.md +++ b/documentation/data-pipeline.md @@ -314,9 +314,9 @@ These are not part of the current pipeline but are worth considering as the data ## Roadmap -**Current state:** Data source migrated from OMW to Kaikki. Production schema and pipeline being rewritten on `feat/kaikki-vocabulary-schema`. Pipeline infrastructure (orchestrator, db init, reporting, tests) is in place and carries forward. +**Current state:** Production schema migrated to Kaikki flat model. Stage 1 extraction scripts written and sample run complete (500 entries per language). pipeline.db initialised and imported with sample data. Stage 2 reverse link sync not yet written. llama.cpp not installed. -**Next action:** Rewrite production schema in `packages/db`, then rewrite pipeline extraction stage for Kaikki. +**Next action:** Write the stage 2 reverse link sync script. | Stage | Status | | --------------- | -------------- | @@ -328,12 +328,16 @@ These are not part of the current pipeline but are worth considering as the data | 5. Compare / QA | 🔲 not started | | 6. Sync | 🔲 not started | -### Stage 1 — Extract `🔲 not started` +### Stage 1 — Extract `🔄 in progress` -- [ ] Download Kaikki JSONL files for all 5 languages -- [ ] Write extraction script -- [ ] Write stage 1 validation tests -- [ ] Run extraction → `pipeline.db` +- [x] Download Kaikki JSONL files for all 5 languages +- [x] Write extraction script +- [x] Write stage 1 validation tests +- [x] Write db schema, init, and import scripts +- [x] Write db import validation tests +- [x] Run sample extraction → `stage-1-extract/output/{lang}.json` +- [ ] Remove sample limit and run full extraction +- [ ] Re-run full import → `pipeline.db` ### Stage 2 — Reverse link sync `🔲 not started`