diff --git a/data-pipeline/COVERAGE.md b/data-pipeline/COVERAGE.md new file mode 100644 index 0000000..e69de29 diff --git a/data-pipeline/PIPELINE.md b/data-pipeline/PIPELINE.md new file mode 100644 index 0000000..ec0cc3b --- /dev/null +++ b/data-pipeline/PIPELINE.md @@ -0,0 +1,33 @@ +# lila data pipeline + +One paragraph: what this is, why it exists, where it feeds into. + +## Overview + Flow diagram: OMW + CEFR sources → Extract → Annotate → Enrich (LLM) → Merge → JSON → TS seeder → DB + +## Data sources + ### OMW / WordNet + ### Per-language CEFR files + (table: language, filename, approx. coverage — with a note pointing to COVERAGE.md for detail) + +## Pipeline stages + ### 1. Extract + ### 2. Annotate (CEFR) + ### 3. Enrich (LLM) + ### 4. Merge + ### 5. Compare / QA + Each: what it does, input, output, how to run. + +## LLM setup + - llama.cpp server: how to start it, what port, recommended models + - How the pipeline hits it + - Resuming interrupted runs + +## Supported languages + Table: language code, name, CEFR source file, full detail → COVERAGE.md + +## Adding a new language + Step by step. + +## Constants and constraints + POS values, CEFR levels, difficulty mapping, language codes.