reorganising data-pipeline folder

This commit is contained in:
lila 2026-04-20 07:37:02 +02:00
parent cfd2927c4c
commit 3f125ba162
2 changed files with 33 additions and 0 deletions

33
data-pipeline/PIPELINE.md Normal file
View file

@ -0,0 +1,33 @@
# lila data pipeline
One paragraph: what this is, why it exists, where it feeds into.
## Overview
Flow diagram: OMW + CEFR sources → Extract → Annotate → Enrich (LLM) → Merge → JSON → TS seeder → DB
## Data sources
### OMW / WordNet
### Per-language CEFR files
(table: language, filename, approx. coverage — with a note pointing to COVERAGE.md for detail)
## Pipeline stages
### 1. Extract
### 2. Annotate (CEFR)
### 3. Enrich (LLM)
### 4. Merge
### 5. Compare / QA
Each: what it does, input, output, how to run.
## LLM setup
- llama.cpp server: how to start it, what port, recommended models
- How the pipeline hits it
- Resuming interrupted runs
## Supported languages
Table: language code, name, CEFR source file, full detail → COVERAGE.md
## Adding a new language
Step by step.
## Constants and constraints
POS values, CEFR levels, difficulty mapping, language codes.