diff --git a/data-pipeline/COVERAGE.md b/data-pipeline/COVERAGE.md
new file mode 100644
index 0000000..e69de29
diff --git a/data-pipeline/PIPELINE.md b/data-pipeline/PIPELINE.md
new file mode 100644
index 0000000..ec0cc3b
--- /dev/null
+++ b/data-pipeline/PIPELINE.md
@@ -0,0 +1,33 @@
+# lila data pipeline
+
+One paragraph: what this is, why it exists, where it feeds into.
+
+## Overview
+  Flow diagram: OMW + CEFR sources → Extract → Annotate → Enrich (LLM) → Merge → JSON → TS seeder → DB
+
+## Data sources
+  ### OMW / WordNet
+  ### Per-language CEFR files
+    (table: language, filename, approx. coverage — with a note pointing to COVERAGE.md for detail)
+
+## Pipeline stages
+  ### 1. Extract
+  ### 2. Annotate (CEFR)
+  ### 3. Enrich (LLM)
+  ### 4. Merge
+  ### 5. Compare / QA
+  Each: what it does, input, output, how to run.
+
+## LLM setup
+  - llama.cpp server: how to start it, what port, recommended models
+  - How the pipeline hits it
+  - Resuming interrupted runs
+
+## Supported languages
+  Table: language code, name, CEFR source file, full detail → COVERAGE.md
+
+## Adding a new language
+  Step by step.
+
+## Constants and constraints
+  POS values, CEFR levels, difficulty mapping, language codes.