# lila data pipeline One paragraph: what this is, why it exists, where it feeds into. ## Overview Flow diagram: OMW + CEFR sources → Extract → Annotate → Enrich (LLM) → Merge → JSON → TS seeder → DB ## Data sources ### OMW / WordNet ### Per-language CEFR files (table: language, filename, approx. coverage — with a note pointing to COVERAGE.md for detail) ## Pipeline stages ### 1. Extract ### 2. Annotate (CEFR) ### 3. Enrich (LLM) ### 4. Merge ### 5. Compare / QA Each: what it does, input, output, how to run. ## LLM setup - llama.cpp server: how to start it, what port, recommended models - How the pipeline hits it - Resuming interrupted runs ## Supported languages Table: language code, name, CEFR source file, full detail → COVERAGE.md ## Adding a new language Step by step. ## Constants and constraints POS values, CEFR levels, difficulty mapping, language codes.