From 3f125ba16209af512cb331df04061cc9093ca93c Mon Sep 17 00:00:00 2001 From: lila Date: Mon, 20 Apr 2026 07:37:02 +0200 Subject: [PATCH] reorganising data-pipeline folder --- data-pipeline/COVERAGE.md | 0 data-pipeline/PIPELINE.md | 33 +++++++++++++++++++++++++++++++++ 2 files changed, 33 insertions(+) create mode 100644 data-pipeline/COVERAGE.md create mode 100644 data-pipeline/PIPELINE.md diff --git a/data-pipeline/COVERAGE.md b/data-pipeline/COVERAGE.md new file mode 100644 index 0000000..e69de29 diff --git a/data-pipeline/PIPELINE.md b/data-pipeline/PIPELINE.md new file mode 100644 index 0000000..ec0cc3b --- /dev/null +++ b/data-pipeline/PIPELINE.md @@ -0,0 +1,33 @@ +# lila data pipeline + +One paragraph: what this is, why it exists, where it feeds into. + +## Overview + Flow diagram: OMW + CEFR sources → Extract → Annotate → Enrich (LLM) → Merge → JSON → TS seeder → DB + +## Data sources + ### OMW / WordNet + ### Per-language CEFR files + (table: language, filename, approx. coverage — with a note pointing to COVERAGE.md for detail) + +## Pipeline stages + ### 1. Extract + ### 2. Annotate (CEFR) + ### 3. Enrich (LLM) + ### 4. Merge + ### 5. Compare / QA + Each: what it does, input, output, how to run. + +## LLM setup + - llama.cpp server: how to start it, what port, recommended models + - How the pipeline hits it + - Resuming interrupted runs + +## Supported languages + Table: language code, name, CEFR source file, full detail → COVERAGE.md + +## Adding a new language + Step by step. + +## Constants and constraints + POS values, CEFR levels, difficulty mapping, language codes.