reorganising data-pipeline folder
This commit is contained in:
parent
cfd2927c4c
commit
3f125ba162
2 changed files with 33 additions and 0 deletions
0
data-pipeline/COVERAGE.md
Normal file
0
data-pipeline/COVERAGE.md
Normal file
33
data-pipeline/PIPELINE.md
Normal file
33
data-pipeline/PIPELINE.md
Normal file
|
|
@ -0,0 +1,33 @@
|
||||||
|
# lila data pipeline
|
||||||
|
|
||||||
|
One paragraph: what this is, why it exists, where it feeds into.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
Flow diagram: OMW + CEFR sources → Extract → Annotate → Enrich (LLM) → Merge → JSON → TS seeder → DB
|
||||||
|
|
||||||
|
## Data sources
|
||||||
|
### OMW / WordNet
|
||||||
|
### Per-language CEFR files
|
||||||
|
(table: language, filename, approx. coverage — with a note pointing to COVERAGE.md for detail)
|
||||||
|
|
||||||
|
## Pipeline stages
|
||||||
|
### 1. Extract
|
||||||
|
### 2. Annotate (CEFR)
|
||||||
|
### 3. Enrich (LLM)
|
||||||
|
### 4. Merge
|
||||||
|
### 5. Compare / QA
|
||||||
|
Each: what it does, input, output, how to run.
|
||||||
|
|
||||||
|
## LLM setup
|
||||||
|
- llama.cpp server: how to start it, what port, recommended models
|
||||||
|
- How the pipeline hits it
|
||||||
|
- Resuming interrupted runs
|
||||||
|
|
||||||
|
## Supported languages
|
||||||
|
Table: language code, name, CEFR source file, full detail → COVERAGE.md
|
||||||
|
|
||||||
|
## Adding a new language
|
||||||
|
Step by step.
|
||||||
|
|
||||||
|
## Constants and constraints
|
||||||
|
POS values, CEFR levels, difficulty mapping, language codes.
|
||||||
Loading…
Add table
Add a link
Reference in a new issue