reorganising data-pipeline folder

2026-04-20 07:37:02 +02:00 · 2026-04-20 07:37:02 +02:00 · 3f125ba162
commit 3f125ba162
parent cfd2927c4c
2 changed files with 33 additions and 0 deletions
--- a/data-pipeline/COVERAGE.md
+++ b/data-pipeline/COVERAGE.md
--- a/data-pipeline/PIPELINE.md
+++ b/data-pipeline/PIPELINE.md
@ -0,0 +1,33 @@
 # lila data pipeline
 One paragraph: what this is, why it exists, where it feeds into.
 ## Overview
  Flow diagram: OMW + CEFR sources → Extract → Annotate → Enrich (LLM) → Merge → JSON → TS seeder → DB
 ## Data sources
  ### OMW / WordNet
  ### Per-language CEFR files
    (table: language, filename, approx. coverage — with a note pointing to COVERAGE.md for detail)
 ## Pipeline stages
  ### 1. Extract
  ### 2. Annotate (CEFR)
  ### 3. Enrich (LLM)
  ### 4. Merge
  ### 5. Compare / QA
  Each: what it does, input, output, how to run.
 ## LLM setup
  - llama.cpp server: how to start it, what port, recommended models
  - How the pipeline hits it
  - Resuming interrupted runs
 ## Supported languages
  Table: language code, name, CEFR source file, full detail → COVERAGE.md
 ## Adding a new language
  Step by step.
 ## Constants and constraints
  POS values, CEFR levels, difficulty mapping, language codes.