forgejo-lila/lila - Forgejo: Beyond coding. We Forge.

Author	SHA1	Message	Date
lila	04a581efe1	WIP: checkpoint before stage-3 sub-stage rewrite	2026-05-12 22:13:14 +02:00
lila	73fb12ac35	feat: enrich script working, redesigning to sub-stage architecture - Enrich script functional with timeout, progress tracking, rejection mechanism - Identified ordering issue: CEFR voting needs validated translations first - Redesign: round1_gloss → round1_example → round1_translations → round1_cefr - Update data-pipeline.md with new sub-stage design and roadmap - Qwen3.5-4B confirmed working with thinking disabled	2026-05-07 13:09:43 +02:00
lila	9642daf6dd	feat: add stage 3 round 1 enrich script and wire into orchestrator	2026-05-05 19:28:38 +02:00
lila	76af2ab093	fix: update db import validation tests to account for reverse links - Translation count test now adds reverse link count to expected total - Non-English translations test now filters to kaikki source only - Target language test now filters to kaikki source only — reverse links to English are valid and expected	2026-05-05 19:10:19 +02:00
lila	1c44ef989b	feat: update pipeline orchestrator for Kaikki — wire up stages 1 and 2 - Replace checkOmwExists with checkExtractedFilesExist - Wire up importKaikki and reverseLink as real stage implementations - Track reverse link completion via sentinel row in run_status - Update report to use resolved_entry_cefr and entry counts - Stages 3 onwards remain as stubs	2026-05-05 19:04:28 +02:00
lila	6f9a42c707	feat: add stage 2 reverse link sync script	2026-05-05 18:57:55 +02:00
lila	ba2635e3f7	feat: add stage 1 and db import validation tests for Kaikki schema	2026-05-05 18:51:11 +02:00
lila	0cc643e308	feat: update extractor for all 5 languages, update import for multi-language - Extract.ts now processes all 5 language files, filters non-English entries by lang_code, skips translation extraction for non-English (no translations in source files) - Import.ts now imports all 5 language output files, uses language field from ExtractedSense instead of hardcoding en - Sample limit hardcoded to 500 entries per language for development	2026-05-05 18:46:32 +02:00
lila	209d52f54b	feat: add Kaikki extraction and import scripts for stage 1 - Add stage-1-extract/scripts/extract.ts — streams Kaikki JSONL, filters to supported POS and languages, skips abbreviations and senses with no translations in supported languages - Rewrite db/import.ts for Kaikki flat model — tracks sense_index offsets per headword+pos to handle duplicate JSONL entries - Rewrite db/schema.sql for Kaikki model — entries, translations, LLM vote tables, resolved tables - Add extract and db:import scripts to package.json - Sample mode hardcoded to 500 entries for development	2026-05-05 18:11:53 +02:00
lila	38d8b85228	docs: rewrite data-pipeline.md for Kaikki migration	2026-05-05 17:14:48 +02:00
lila	87aeb072c5	feat: add pipeline orchestrator skeleton with startup checks, stage runners, shutdown handler, and report generation	2026-05-03 23:01:29 +02:00
lila	080fad1998	feat: enrich stage foundation — provider config, env setup, schema fix - Remove foreign key on run_status.source_id to support sentinel rows for tracking one-time pipeline steps (compile_candidates, compile_votes, merge, compare) - Add stage-3-enrich/config.ts with all provider configurations, ALL_PROVIDERS ordered local-first, and validateProviderKey() for startup key checks - Add .env.example with required API keys for OpenRouter and Anthropic - Add pipeline:run script to package.json using --env-file .env - Add .env to root .gitignore coverage for data-pipeline/.env	2026-05-03 22:44:14 +02:00
lila	4d42fe4397	removing db from git tracking, adding it to gitignore, add db import validation tests	2026-05-03 22:16:43 +02:00
lila	f59399be02	feat: add db import script, fix duplicate translations in extract, add annotate script	2026-05-03 22:05:10 +02:00
lila	4a842140b9	feat: add stage 1 and 2 validation tests	2026-05-03 21:36:56 +02:00
lila	4fa3073412	feat: add db schema, init, and vitest config	2026-05-03 17:56:29 +02:00
lila	74cfc82bdd	docs: finalise data-pipeline.md with tiebreak, pipeline.db, reports, sync	2026-05-03 17:21:02 +02:00
lila	4f59f3bc14	formatting	2026-04-28 13:18:18 +02:00
lila	849fcdad86	adding documentation for the llm setup for the data pipeline	2026-04-21 13:22:27 +02:00
lila	214a597e99	feat(pipeline): add annotate stage - write annotate.ts — matches CEFR source files against OMW translations - match by word text + normalized POS - add cefr_source vote to matched translations - extract native example sentences from CEFR source files - write one annotated JSON per language to stage-2-annotate/output/ - write conflicts.json for words with multiple CEFR levels - update tsconfig to support all stage directories - 2 German conflicts found (macht, bleiche) - match rates: en 47k, fr 44k, de 26k, it 26k, es 26k	2026-04-21 12:01:56 +02:00
lila	9ea35568e5	updating config	2026-04-21 12:01:29 +02:00
lila	c9cddf68de	feat(pipeline): add data pipeline workspace and extraction stage - rename scripts/ to data-pipeline/, archive existing scripts - add @lila/pipeline as pnpm workspace package - add stage-1-extract through stage-5-compare folder structure - update SUPPORTED_LANGUAGE_CODES (add es, de, fr) - update SUPPORTED_POS (add adjective, adverb) - add description field to term_glosses - add term_examples table - run and verify db migration - write and verify extract.py (117,659 synsets across 5 languages) - write PIPELINE.md	2026-04-21 09:39:36 +02:00
lila	07fe256abd	documenting the pipeline to enrich the db data, reorganizing the file structure of the data pipeline	2026-04-20 18:28:10 +02:00
lila	a3d19d36f6	adding the data-pipeline to ts and pnpm workspaces	2026-04-20 09:05:27 +02:00
lila	200b14ef64	reoganising folders/files	2026-04-20 08:50:27 +02:00
lila	1f42239779	reorganising file structure	2026-04-20 07:48:44 +02:00
lila	3f125ba162	reorganising data-pipeline folder	2026-04-20 07:37:02 +02:00

27 commits