- Translation count test now adds reverse link count to expected total
- Non-English translations test now filters to kaikki source only
- Target language test now filters to kaikki source only — reverse links
to English are valid and expected
- Replace checkOmwExists with checkExtractedFilesExist
- Wire up importKaikki and reverseLink as real stage implementations
- Track reverse link completion via sentinel row in run_status
- Update report to use resolved_entry_cefr and entry counts
- Stages 3 onwards remain as stubs
- Extract.ts now processes all 5 language files, filters non-English
entries by lang_code, skips translation extraction for non-English
(no translations in source files)
- Import.ts now imports all 5 language output files, uses language
field from ExtractedSense instead of hardcoding en
- Sample limit hardcoded to 500 entries per language for development
- Add stage-1-extract/scripts/extract.ts — streams Kaikki JSONL,
filters to supported POS and languages, skips abbreviations and
senses with no translations in supported languages
- Rewrite db/import.ts for Kaikki flat model — tracks sense_index
offsets per headword+pos to handle duplicate JSONL entries
- Rewrite db/schema.sql for Kaikki model — entries, translations,
LLM vote tables, resolved tables
- Add extract and db:import scripts to package.json
- Sample mode hardcoded to 500 entries for development
- Remove foreign key on run_status.source_id to support sentinel rows
for tracking one-time pipeline steps (compile_candidates, compile_votes,
merge, compare)
- Add stage-3-enrich/config.ts with all provider configurations,
ALL_PROVIDERS ordered local-first, and validateProviderKey() for
startup key checks
- Add .env.example with required API keys for OpenRouter and Anthropic
- Add pipeline:run script to package.json using --env-file .env
- Add .env to root .gitignore coverage for data-pipeline/.env
- write annotate.ts — matches CEFR source files against OMW translations
- match by word text + normalized POS
- add cefr_source vote to matched translations
- extract native example sentences from CEFR source files
- write one annotated JSON per language to stage-2-annotate/output/
- write conflicts.json for words with multiple CEFR levels
- update tsconfig to support all stage directories
- 2 German conflicts found (macht, bleiche)
- match rates: en 47k, fr 44k, de 26k, it 26k, es 26k