lila/data-pipeline/PIPELINE.md
2026-04-20 07:37:02 +02:00

932 B

lila data pipeline

One paragraph: what this is, why it exists, where it feeds into.

Overview

Flow diagram: OMW + CEFR sources → Extract → Annotate → Enrich (LLM) → Merge → JSON → TS seeder → DB

Data sources

OMW / WordNet

Per-language CEFR files

(table: language, filename, approx. coverage — with a note pointing to COVERAGE.md for detail)

Pipeline stages

1. Extract

2. Annotate (CEFR)

3. Enrich (LLM)

4. Merge

5. Compare / QA

Each: what it does, input, output, how to run.

LLM setup

  • llama.cpp server: how to start it, what port, recommended models
  • How the pipeline hits it
  • Resuming interrupted runs

Supported languages

Table: language code, name, CEFR source file, full detail → COVERAGE.md

Adding a new language

Step by step.

Constants and constraints

POS values, CEFR levels, difficulty mapping, language codes.