reorganising data-pipeline folder

2026-04-20 07:37:02 +02:00 · 2026-04-20 07:37:02 +02:00 · 3f125ba162
commit 3f125ba162
parent cfd2927c4c
2 changed files with 33 additions and 0 deletions
--- a/data-pipeline/PIPELINE.md
+++ b/data-pipeline/PIPELINE.md
@ -0,0 +1,33 @@
+# lila data pipeline
+
+One paragraph: what this is, why it exists, where it feeds into.
+
+## Overview
+  Flow diagram: OMW + CEFR sources → Extract → Annotate → Enrich (LLM) → Merge → JSON → TS seeder → DB
+
+## Data sources
+  ### OMW / WordNet
+  ### Per-language CEFR files
+    (table: language, filename, approx. coverage — with a note pointing to COVERAGE.md for detail)
+
+## Pipeline stages
+  ### 1. Extract
+  ### 2. Annotate (CEFR)
+  ### 3. Enrich (LLM)
+  ### 4. Merge
+  ### 5. Compare / QA
+  Each: what it does, input, output, how to run.
+
+## LLM setup
+  - llama.cpp server: how to start it, what port, recommended models
+  - How the pipeline hits it
+  - Resuming interrupted runs
+
+## Supported languages
+  Table: language code, name, CEFR source file, full detail → COVERAGE.md
+
+## Adding a new language
+  Step by step.
+
+## Constants and constraints
+  POS values, CEFR levels, difficulty mapping, language codes.