lila/data-pipeline
lila 214a597e99 feat(pipeline): add annotate stage
- write annotate.ts — matches CEFR source files against OMW translations
- match by word text + normalized POS
- add cefr_source vote to matched translations
- extract native example sentences from CEFR source files
- write one annotated JSON per language to stage-2-annotate/output/
- write conflicts.json for words with multiple CEFR levels
- update tsconfig to support all stage directories
- 2 German conflicts found (macht, bleiche)
- match rates: en 47k, fr 44k, de 26k, it 26k, es 26k
2026-04-21 12:01:56 +02:00
..
stage-1-extract/scripts feat(pipeline): add data pipeline workspace and extraction stage 2026-04-21 09:39:36 +02:00
stage-2-annotate feat(pipeline): add annotate stage 2026-04-21 12:01:56 +02:00
COVERAGE.md reorganising data-pipeline folder 2026-04-20 07:37:02 +02:00
package.json updating config 2026-04-21 12:01:29 +02:00
tsconfig.json updating config 2026-04-21 12:01:29 +02:00