lila
4d42fe4397
removing db from git tracking, adding it to gitignore, add db import validation tests
2026-05-03 22:16:43 +02:00
lila
f59399be02
feat: add db import script, fix duplicate translations in extract, add annotate script
2026-05-03 22:05:10 +02:00
lila
4a842140b9
feat: add stage 1 and 2 validation tests
2026-05-03 21:36:56 +02:00
lila
4fa3073412
feat: add db schema, init, and vitest config
2026-05-03 17:56:29 +02:00
lila
74cfc82bdd
docs: finalise data-pipeline.md with tiebreak, pipeline.db, reports, sync
2026-05-03 17:21:02 +02:00
lila
4f59f3bc14
formatting
2026-04-28 13:18:18 +02:00
lila
849fcdad86
adding documentation for the llm setup for the data pipeline
2026-04-21 13:22:27 +02:00
lila
214a597e99
feat(pipeline): add annotate stage
...
- write annotate.ts — matches CEFR source files against OMW translations
- match by word text + normalized POS
- add cefr_source vote to matched translations
- extract native example sentences from CEFR source files
- write one annotated JSON per language to stage-2-annotate/output/
- write conflicts.json for words with multiple CEFR levels
- update tsconfig to support all stage directories
- 2 German conflicts found (macht, bleiche)
- match rates: en 47k, fr 44k, de 26k, it 26k, es 26k
2026-04-21 12:01:56 +02:00
lila
9ea35568e5
updating config
2026-04-21 12:01:29 +02:00
lila
c9cddf68de
feat(pipeline): add data pipeline workspace and extraction stage
...
- rename scripts/ to data-pipeline/, archive existing scripts
- add @lila/pipeline as pnpm workspace package
- add stage-1-extract through stage-5-compare folder structure
- update SUPPORTED_LANGUAGE_CODES (add es, de, fr)
- update SUPPORTED_POS (add adjective, adverb)
- add description field to term_glosses
- add term_examples table
- run and verify db migration
- write and verify extract.py (117,659 synsets across 5 languages)
- write PIPELINE.md
2026-04-21 09:39:36 +02:00
lila
07fe256abd
documenting the pipeline to enrich the db data, reorganizing the file structure of the data pipeline
2026-04-20 18:28:10 +02:00
lila
a3d19d36f6
adding the data-pipeline to ts and pnpm workspaces
2026-04-20 09:05:27 +02:00
lila
200b14ef64
reoganising folders/files
2026-04-20 08:50:27 +02:00
lila
1f42239779
reorganising file structure
2026-04-20 07:48:44 +02:00
lila
3f125ba162
reorganising data-pipeline folder
2026-04-20 07:37:02 +02:00