lila/data-pipeline/PIPELINE.md

# lila data pipeline

One paragraph: what this is, why it exists, where it feeds into.

## Overview
  Flow diagram: OMW + CEFR sources → Extract → Annotate → Enrich (LLM) → Merge → JSON → TS seeder → DB

## Data sources
  ### OMW / WordNet
  ### Per-language CEFR files
    (table: language, filename, approx. coverage — with a note pointing to COVERAGE.md for detail)

## Pipeline stages
  ### 1. Extract
  ### 2. Annotate (CEFR)
  ### 3. Enrich (LLM)
  ### 4. Merge
  ### 5. Compare / QA
  Each: what it does, input, output, how to run.

## LLM setup
  - llama.cpp server: how to start it, what port, recommended models
  - How the pipeline hits it
  - Resuming interrupted runs

## Supported languages
  Table: language code, name, CEFR source file, full detail → COVERAGE.md

## Adding a new language
  Step by step.

## Constants and constraints
  POS values, CEFR levels, difficulty mapping, language codes.