docs: update roadmap — stage 1 in progress, sample extraction complete

This commit is contained in:
lila 2026-05-05 18:52:10 +02:00
parent ba2635e3f7
commit b5a76ee178

View file

@ -314,9 +314,9 @@ These are not part of the current pipeline but are worth considering as the data
## Roadmap
**Current state:** Data source migrated from OMW to Kaikki. Production schema and pipeline being rewritten on `feat/kaikki-vocabulary-schema`. Pipeline infrastructure (orchestrator, db init, reporting, tests) is in place and carries forward.
**Current state:** Production schema migrated to Kaikki flat model. Stage 1 extraction scripts written and sample run complete (500 entries per language). pipeline.db initialised and imported with sample data. Stage 2 reverse link sync not yet written. llama.cpp not installed.
**Next action:** Rewrite production schema in `packages/db`, then rewrite pipeline extraction stage for Kaikki.
**Next action:** Write the stage 2 reverse link sync script.
| Stage | Status |
| --------------- | -------------- |
@ -328,12 +328,16 @@ These are not part of the current pipeline but are worth considering as the data
| 5. Compare / QA | 🔲 not started |
| 6. Sync | 🔲 not started |
### Stage 1 — Extract `🔲 not started`
### Stage 1 — Extract `🔄 in progress`
- [ ] Download Kaikki JSONL files for all 5 languages
- [ ] Write extraction script
- [ ] Write stage 1 validation tests
- [ ] Run extraction → `pipeline.db`
- [x] Download Kaikki JSONL files for all 5 languages
- [x] Write extraction script
- [x] Write stage 1 validation tests
- [x] Write db schema, init, and import scripts
- [x] Write db import validation tests
- [x] Run sample extraction → `stage-1-extract/output/{lang}.json`
- [ ] Remove sample limit and run full extraction
- [ ] Re-run full import → `pipeline.db`
### Stage 2 — Reverse link sync `🔲 not started`