feat: update pipeline orchestrator for Kaikki — wire up stages 1 and 2

- Replace checkOmwExists with checkExtractedFilesExist
- Wire up importKaikki and reverseLink as real stage implementations
- Track reverse link completion via sentinel row in run_status
- Update report to use resolved_entry_cefr and entry counts
- Stages 3 onwards remain as stubs
This commit is contained in:
lila 2026-05-05 19:04:28 +02:00
parent 6f9a42c707
commit 1c44ef989b
2 changed files with 92 additions and 41 deletions

View file

@ -314,9 +314,12 @@ These are not part of the current pipeline but are worth considering as the data
## Roadmap
**Current state:** Production schema migrated to Kaikki flat model. Stage 1 extraction scripts written and sample run complete (500 entries per language). pipeline.db initialised and imported with sample data. Stage 2 reverse link sync not yet written. llama.cpp not installed.
**Current state:** Stage 1 extraction and stage 2 reverse link sync scripts
written and verified on sample data. pipeline.db contains 4,156 entries and
4,287 translations across 5 languages. Stage 3 enrich scripts not yet written.
llama.cpp not installed.
**Next action:** Write the stage 2 reverse link sync script.
**Next action:** Write the stage 3 enrich script.
| Stage | Status |
| --------------- | -------------- |
@ -339,11 +342,11 @@ These are not part of the current pipeline but are worth considering as the data
- [ ] Remove sample limit and run full extraction
- [ ] Re-run full import → `pipeline.db`
### Stage 2 — Reverse link sync `🔲 not started`
### Stage 2 — Reverse link sync `🔄 in progress`
- [ ] Write reverse link sync script
- [ ] Write tests
- [ ] Run reverse link sync `pipeline.db`
- [x] Write reverse link sync script
- [x] Run reverse link sync on sample data → 141 links inserted
- [ ] Run reverse link sync on full data after full extraction
### Stage 3 — Enrich `🔲 not started`