Commit graph

27 commits

Author SHA1 Message Date
lila
04a581efe1 WIP: checkpoint before stage-3 sub-stage rewrite 2026-05-12 22:13:14 +02:00
lila
73fb12ac35 feat: enrich script working, redesigning to sub-stage architecture
- Enrich script functional with timeout, progress tracking, rejection mechanism
- Identified ordering issue: CEFR voting needs validated translations first
- Redesign: round1_gloss → round1_example → round1_translations → round1_cefr
- Update data-pipeline.md with new sub-stage design and roadmap
- Qwen3.5-4B confirmed working with thinking disabled
2026-05-07 13:09:43 +02:00
lila
9642daf6dd feat: add stage 3 round 1 enrich script and wire into orchestrator 2026-05-05 19:28:38 +02:00
lila
76af2ab093 fix: update db import validation tests to account for reverse links
- Translation count test now adds reverse link count to expected total
- Non-English translations test now filters to kaikki source only
- Target language test now filters to kaikki source only — reverse links
  to English are valid and expected
2026-05-05 19:10:19 +02:00
lila
1c44ef989b feat: update pipeline orchestrator for Kaikki — wire up stages 1 and 2
- Replace checkOmwExists with checkExtractedFilesExist
- Wire up importKaikki and reverseLink as real stage implementations
- Track reverse link completion via sentinel row in run_status
- Update report to use resolved_entry_cefr and entry counts
- Stages 3 onwards remain as stubs
2026-05-05 19:04:28 +02:00
lila
6f9a42c707 feat: add stage 2 reverse link sync script 2026-05-05 18:57:55 +02:00
lila
ba2635e3f7 feat: add stage 1 and db import validation tests for Kaikki schema 2026-05-05 18:51:11 +02:00
lila
0cc643e308 feat: update extractor for all 5 languages, update import for multi-language
- Extract.ts now processes all 5 language files, filters non-English
  entries by lang_code, skips translation extraction for non-English
  (no translations in source files)
- Import.ts now imports all 5 language output files, uses language
  field from ExtractedSense instead of hardcoding en
- Sample limit hardcoded to 500 entries per language for development
2026-05-05 18:46:32 +02:00
lila
209d52f54b feat: add Kaikki extraction and import scripts for stage 1
- Add stage-1-extract/scripts/extract.ts — streams Kaikki JSONL,
  filters to supported POS and languages, skips abbreviations and
  senses with no translations in supported languages
- Rewrite db/import.ts for Kaikki flat model — tracks sense_index
  offsets per headword+pos to handle duplicate JSONL entries
- Rewrite db/schema.sql for Kaikki model — entries, translations,
  LLM vote tables, resolved tables
- Add extract and db:import scripts to package.json
- Sample mode hardcoded to 500 entries for development
2026-05-05 18:11:53 +02:00
lila
38d8b85228 docs: rewrite data-pipeline.md for Kaikki migration 2026-05-05 17:14:48 +02:00
lila
87aeb072c5 feat: add pipeline orchestrator skeleton with startup checks, stage runners, shutdown handler, and report generation 2026-05-03 23:01:29 +02:00
lila
080fad1998 feat: enrich stage foundation — provider config, env setup, schema fix
- Remove foreign key on run_status.source_id to support sentinel rows
  for tracking one-time pipeline steps (compile_candidates, compile_votes,
  merge, compare)
- Add stage-3-enrich/config.ts with all provider configurations,
  ALL_PROVIDERS ordered local-first, and validateProviderKey() for
  startup key checks
- Add .env.example with required API keys for OpenRouter and Anthropic
- Add pipeline:run script to package.json using --env-file .env
- Add .env to root .gitignore coverage for data-pipeline/.env
2026-05-03 22:44:14 +02:00
lila
4d42fe4397 removing db from git tracking, adding it to gitignore, add db import validation tests 2026-05-03 22:16:43 +02:00
lila
f59399be02 feat: add db import script, fix duplicate translations in extract, add annotate script 2026-05-03 22:05:10 +02:00
lila
4a842140b9 feat: add stage 1 and 2 validation tests 2026-05-03 21:36:56 +02:00
lila
4fa3073412 feat: add db schema, init, and vitest config 2026-05-03 17:56:29 +02:00
lila
74cfc82bdd docs: finalise data-pipeline.md with tiebreak, pipeline.db, reports, sync 2026-05-03 17:21:02 +02:00
lila
4f59f3bc14 formatting 2026-04-28 13:18:18 +02:00
lila
849fcdad86 adding documentation for the llm setup for the data pipeline 2026-04-21 13:22:27 +02:00
lila
214a597e99 feat(pipeline): add annotate stage
- write annotate.ts — matches CEFR source files against OMW translations
- match by word text + normalized POS
- add cefr_source vote to matched translations
- extract native example sentences from CEFR source files
- write one annotated JSON per language to stage-2-annotate/output/
- write conflicts.json for words with multiple CEFR levels
- update tsconfig to support all stage directories
- 2 German conflicts found (macht, bleiche)
- match rates: en 47k, fr 44k, de 26k, it 26k, es 26k
2026-04-21 12:01:56 +02:00
lila
9ea35568e5 updating config 2026-04-21 12:01:29 +02:00
lila
c9cddf68de feat(pipeline): add data pipeline workspace and extraction stage
- rename scripts/ to data-pipeline/, archive existing scripts
- add @lila/pipeline as pnpm workspace package
- add stage-1-extract through stage-5-compare folder structure
- update SUPPORTED_LANGUAGE_CODES (add es, de, fr)
- update SUPPORTED_POS (add adjective, adverb)
- add description field to term_glosses
- add term_examples table
- run and verify db migration
- write and verify extract.py (117,659 synsets across 5 languages)
- write PIPELINE.md
2026-04-21 09:39:36 +02:00
lila
07fe256abd documenting the pipeline to enrich the db data, reorganizing the file structure of the data pipeline 2026-04-20 18:28:10 +02:00
lila
a3d19d36f6 adding the data-pipeline to ts and pnpm workspaces 2026-04-20 09:05:27 +02:00
lila
200b14ef64 reoganising folders/files 2026-04-20 08:50:27 +02:00
lila
1f42239779 reorganising file structure 2026-04-20 07:48:44 +02:00
lila
3f125ba162 reorganising data-pipeline folder 2026-04-20 07:37:02 +02:00