documenting the pipeline to enrich the db data, reorganizing the file structure of the data pipeline

2026-04-20 18:28:10 +02:00 · 2026-04-20 18:28:10 +02:00 · 07fe256abd
commit 07fe256abd
parent 0ac2cef6e1
8 changed files with 469 additions and 35 deletions
--- a/data-pipeline/PIPELINE.md
+++ b/data-pipeline/PIPELINE.md
@ -1,33 +0,0 @@
-# lila data pipeline
-
-One paragraph: what this is, why it exists, where it feeds into.
-
-## Overview
-  Flow diagram: OMW + CEFR sources → Extract → Annotate → Enrich (LLM) → Merge → JSON → TS seeder → DB
-
-## Data sources
-  ### OMW / WordNet
-  ### Per-language CEFR files
-    (table: language, filename, approx. coverage — with a note pointing to COVERAGE.md for detail)
-
-## Pipeline stages
-  ### 1. Extract
-  ### 2. Annotate (CEFR)
-  ### 3. Enrich (LLM)
-  ### 4. Merge
-  ### 5. Compare / QA
-  Each: what it does, input, output, how to run.
-
-## LLM setup
-  - llama.cpp server: how to start it, what port, recommended models
-  - How the pipeline hits it
-  - Resuming interrupted runs
-
-## Supported languages
-  Table: language code, name, CEFR source file, full detail → COVERAGE.md
-
-## Adding a new language
-  Step by step.
-
-## Constants and constraints
-  POS values, CEFR levels, difficulty mapping, language codes.
--- a/data-pipeline/scripts/annotate.ts
+++ b/data-pipeline/scripts/annotate.ts
--- a/data-pipeline/scripts/compare.ts
+++ b/data-pipeline/scripts/compare.ts
--- a/data-pipeline/scripts/enrich.ts
+++ b/data-pipeline/scripts/enrich.ts
--- a/data-pipeline/scripts/extract.ts
+++ b/data-pipeline/scripts/extract.ts
--- a/data-pipeline/scripts/merge.ts
+++ b/data-pipeline/scripts/merge.ts