formatting

2026-04-28 13:18:18 +02:00 · 2026-04-28 13:18:18 +02:00 · 4f59f3bc14
commit 4f59f3bc14
parent 2ff7d1759e
23 changed files with 994 additions and 3338 deletions
--- a/documentation/data-pipeline.md
+++ b/documentation/data-pipeline.md
@ -55,13 +55,13 @@ See **Setup** for download instructions.

 Per-language JSON files in `sources/cefr/` provide the initial CEFR level annotations. These files do not cover the full vocabulary extracted from OMW — coverage varies by language. Gaps and disagreements are handled by the enrich stage.

-| Language | File |
-|---|---|
-| English | `sources/cefr/en.json` |
-| Italian | `sources/cefr/it.json` |
-| Spanish | `sources/cefr/es.json` |
-| German | `sources/cefr/de.json` |
-| French | `sources/cefr/fr.json` |
+| Language | File                   |
+| -------- | ---------------------- |
+| English  | `sources/cefr/en.json` |
+| Italian  | `sources/cefr/it.json` |
+| Spanish  | `sources/cefr/es.json` |
+| German   | `sources/cefr/de.json` |
+| French   | `sources/cefr/fr.json` |

 These files are committed to git. For per-language coverage detail see `COVERAGE.md`.

@ -102,13 +102,13 @@ See `LLM-SETUP.md`.

 The pipeline runs in five stages. Each stage is independent and can be re-run without affecting the others.

-| Stage | What it does |
-|---|---|
-| 1. Extract | Reads OMW SQLite database, outputs normalized JSON per language |
+| Stage       | What it does                                                         |
+| ----------- | -------------------------------------------------------------------- |
+| 1. Extract  | Reads OMW SQLite database, outputs normalized JSON per language      |
 | 2. Annotate | Merges CEFR source files into extracted data, adds source file votes |
-| 3. Enrich | Runs local LLMs in two rounds — generation then voting |
-| 4. Merge | Resolves votes, derives difficulty, splits into final and flagged |
-| 5. Compare | Generates COVERAGE.md with detailed quality report |
+| 3. Enrich   | Runs local LLMs in two rounds — generation then voting               |
+| 4. Merge    | Resolves votes, derives difficulty, splits into final and flagged    |
+| 5. Compare  | Generates COVERAGE.md with detailed quality report                   |

 ### 1. Extract

@ -137,11 +137,11 @@ Each record in the output looks like this:
    "fr": ["comptable"]
  },
  "glosses": {
-    "en": ["(usually followed by 'to') having the necessary means or skill or know-how or authority to do something"]
+    "en": [
+      "(usually followed by 'to') having the necessary means or skill or know-how or authority to do something"
+    ]
  },
-  "examples": {
-    "en": ["able to swim", "she was able to program her computer"]
-  }
+  "examples": { "en": ["able to swim", "she was able to program her computer"] }
 }
 ```

@ -158,6 +158,7 @@ Words appearing in the CEFR source file multiple times with different CEFR level

 **Input:** `stage-1-extract/output/omw.json` + `stage-2-annotate/sources/cefr/{lang}.json`
 **Output:**
+
 - `stage-2-annotate/output/{lang}.json` — one per language
 - `stage-2-annotate/output/conflicts.json` — cross-language conflicts for review

@ -177,20 +178,14 @@ Each record in the output extends the OMW record with a `votes` field and any ad
    "es": ["capaz"],
    "fr": ["comptable"]
  },
-  "glosses": {
-    "en": ["having the necessary means or skill to do something"]
-  },
+  "glosses": { "en": ["having the necessary means or skill to do something"] },
  "examples": {
    "en": [
      { "text": "able to swim", "source": "omw" },
      { "text": "She was able to finish the task.", "source": "cefr" }
    ]
  },
-  "votes": {
-    "en": {
-      "able": { "cefr_source": "B1" }
-    }
-  }
+  "votes": { "en": { "able": { "cefr_source": "B1" } } }
 }
 ```

@ -297,9 +292,7 @@ Each record in the votes file looks like this:
    }
  },
  "examples": {
-    "en": [
-      { "text": "the dog barked at the stranger", "source": "omw" }
-    ],
+    "en": [{ "text": "the dog barked at the stranger", "source": "omw" }],
    "fr": {
      "candidates": [
        { "text": "le chien a aboyé", "source": "model_1" },
@ -311,8 +304,14 @@ Each record in the votes file looks like this:
  "descriptions": {
    "en": {
      "candidates": [
-        { "text": "a common household pet known for loyalty", "source": "model_1" },
-        { "text": "a domesticated animal and loyal companion", "source": "model_2" }
+        {
+          "text": "a common household pet known for loyalty",
+          "source": "model_1"
+        },
+        {
+          "text": "a domesticated animal and loyal companion",
+          "source": "model_2"
+        }
      ],
      "votes": { "model_1": 2, "model_2": 1 }
    }
@ -334,14 +333,15 @@ Reads the votes file per language and resolves the final value for every field.

 **Difficulty mapping:**

-| CEFR | Difficulty |
-|---|---|
-| A1, A2 | easy |
+| CEFR   | Difficulty   |
+| ------ | ------------ |
+| A1, A2 | easy         |
 | B1, B2 | intermediate |
-| C1, C2 | hard |
+| C1, C2 | hard         |

 **Input:** `stage-3-enrich/output/votes/{lang}_votes.json`
 **Output:**
+
 - `stage-4-merge/output/final/{lang}.json` — fully resolved, ready for seeding
 - `stage-4-merge/output/flagged/{lang}.json` — CEFR majority not reached, needs manual review before seeding

@ -360,21 +360,15 @@ Each record in `final/{lang}.json` looks like this:
      { "text": "dog", "cefr_level": "A1", "difficulty": "easy" },
      { "text": "canine", "cefr_level": "B2", "difficulty": "intermediate" }
    ],
-    "it": [
-      { "text": "cane", "cefr_level": "A1", "difficulty": "easy" }
-    ]
+    "it": [{ "text": "cane", "cefr_level": "A1", "difficulty": "easy" }]
  },
  "glosses": {
    "en": { "text": "a domesticated carnivorous mammal", "source": "omw" },
    "fr": { "text": "un mammifère carnivore domestiqué", "source": "model_1" }
  },
  "examples": {
-    "en": [
-      { "text": "the dog barked at the stranger", "source": "omw" }
-    ],
-    "fr": [
-      { "text": "le chien a aboyé", "source": "model_1" }
-    ]
+    "en": [{ "text": "the dog barked at the stranger", "source": "omw" }],
+    "fr": [{ "text": "le chien a aboyé", "source": "model_1" }]
  },
  "descriptions": {
    "en": {
@ -400,6 +394,7 @@ output quality per language. Run this after merge to verify output before
 seeding the database.

 **Input:**
+
 - `stage-4-merge/output/final/{lang}.json`
 - `stage-4-merge/output/flagged/{lang}.json`

@ -436,12 +431,12 @@ pnpm --filter @lila/pipeline compare

 These values are defined in `packages/shared/src/constants.ts` and enforced by database check constraints. The pipeline filters out any entries that violate them.

-| Constant | Values |
-|---|---|
-| Languages | `en`, `it`, `de`, `es`, `fr` |
+| Constant        | Values                                |
+| --------------- | ------------------------------------- |
+| Languages       | `en`, `it`, `de`, `es`, `fr`          |
 | Parts of speech | `noun`, `verb`, `adjective`, `adverb` |
-| CEFR levels | `A1`, `A2`, `B1`, `B2`, `C1`, `C2` |
-| Difficulty | `easy`, `intermediate`, `hard` |
+| CEFR levels     | `A1`, `A2`, `B1`, `B2`, `C1`, `C2`    |
+| Difficulty      | `easy`, `intermediate`, `hard`        |

 Adding a new value to any of these requires a constants update and a database migration before re-running the pipeline. See **Adding a new language** for the full steps — the same process applies for new parts of speech.