formatting

This commit is contained in:
lila 2026-04-28 13:18:18 +02:00
parent 2ff7d1759e
commit 4f59f3bc14
23 changed files with 994 additions and 3338 deletions

View file

@ -55,13 +55,13 @@ See **Setup** for download instructions.
Per-language JSON files in `sources/cefr/` provide the initial CEFR level annotations. These files do not cover the full vocabulary extracted from OMW — coverage varies by language. Gaps and disagreements are handled by the enrich stage.
| Language | File |
|---|---|
| English | `sources/cefr/en.json` |
| Italian | `sources/cefr/it.json` |
| Spanish | `sources/cefr/es.json` |
| German | `sources/cefr/de.json` |
| French | `sources/cefr/fr.json` |
| Language | File |
| -------- | ---------------------- |
| English | `sources/cefr/en.json` |
| Italian | `sources/cefr/it.json` |
| Spanish | `sources/cefr/es.json` |
| German | `sources/cefr/de.json` |
| French | `sources/cefr/fr.json` |
These files are committed to git. For per-language coverage detail see `COVERAGE.md`.
@ -102,13 +102,13 @@ See `LLM-SETUP.md`.
The pipeline runs in five stages. Each stage is independent and can be re-run without affecting the others.
| Stage | What it does |
|---|---|
| 1. Extract | Reads OMW SQLite database, outputs normalized JSON per language |
| Stage | What it does |
| ----------- | -------------------------------------------------------------------- |
| 1. Extract | Reads OMW SQLite database, outputs normalized JSON per language |
| 2. Annotate | Merges CEFR source files into extracted data, adds source file votes |
| 3. Enrich | Runs local LLMs in two rounds — generation then voting |
| 4. Merge | Resolves votes, derives difficulty, splits into final and flagged |
| 5. Compare | Generates COVERAGE.md with detailed quality report |
| 3. Enrich | Runs local LLMs in two rounds — generation then voting |
| 4. Merge | Resolves votes, derives difficulty, splits into final and flagged |
| 5. Compare | Generates COVERAGE.md with detailed quality report |
### 1. Extract
@ -137,11 +137,11 @@ Each record in the output looks like this:
"fr": ["comptable"]
},
"glosses": {
"en": ["(usually followed by 'to') having the necessary means or skill or know-how or authority to do something"]
"en": [
"(usually followed by 'to') having the necessary means or skill or know-how or authority to do something"
]
},
"examples": {
"en": ["able to swim", "she was able to program her computer"]
}
"examples": { "en": ["able to swim", "she was able to program her computer"] }
}
```
@ -158,6 +158,7 @@ Words appearing in the CEFR source file multiple times with different CEFR level
**Input:** `stage-1-extract/output/omw.json` + `stage-2-annotate/sources/cefr/{lang}.json`
**Output:**
- `stage-2-annotate/output/{lang}.json` — one per language
- `stage-2-annotate/output/conflicts.json` — cross-language conflicts for review
@ -177,20 +178,14 @@ Each record in the output extends the OMW record with a `votes` field and any ad
"es": ["capaz"],
"fr": ["comptable"]
},
"glosses": {
"en": ["having the necessary means or skill to do something"]
},
"glosses": { "en": ["having the necessary means or skill to do something"] },
"examples": {
"en": [
{ "text": "able to swim", "source": "omw" },
{ "text": "She was able to finish the task.", "source": "cefr" }
]
},
"votes": {
"en": {
"able": { "cefr_source": "B1" }
}
}
"votes": { "en": { "able": { "cefr_source": "B1" } } }
}
```
@ -297,9 +292,7 @@ Each record in the votes file looks like this:
}
},
"examples": {
"en": [
{ "text": "the dog barked at the stranger", "source": "omw" }
],
"en": [{ "text": "the dog barked at the stranger", "source": "omw" }],
"fr": {
"candidates": [
{ "text": "le chien a aboyé", "source": "model_1" },
@ -311,8 +304,14 @@ Each record in the votes file looks like this:
"descriptions": {
"en": {
"candidates": [
{ "text": "a common household pet known for loyalty", "source": "model_1" },
{ "text": "a domesticated animal and loyal companion", "source": "model_2" }
{
"text": "a common household pet known for loyalty",
"source": "model_1"
},
{
"text": "a domesticated animal and loyal companion",
"source": "model_2"
}
],
"votes": { "model_1": 2, "model_2": 1 }
}
@ -334,14 +333,15 @@ Reads the votes file per language and resolves the final value for every field.
**Difficulty mapping:**
| CEFR | Difficulty |
|---|---|
| A1, A2 | easy |
| CEFR | Difficulty |
| ------ | ------------ |
| A1, A2 | easy |
| B1, B2 | intermediate |
| C1, C2 | hard |
| C1, C2 | hard |
**Input:** `stage-3-enrich/output/votes/{lang}_votes.json`
**Output:**
- `stage-4-merge/output/final/{lang}.json` — fully resolved, ready for seeding
- `stage-4-merge/output/flagged/{lang}.json` — CEFR majority not reached, needs manual review before seeding
@ -360,21 +360,15 @@ Each record in `final/{lang}.json` looks like this:
{ "text": "dog", "cefr_level": "A1", "difficulty": "easy" },
{ "text": "canine", "cefr_level": "B2", "difficulty": "intermediate" }
],
"it": [
{ "text": "cane", "cefr_level": "A1", "difficulty": "easy" }
]
"it": [{ "text": "cane", "cefr_level": "A1", "difficulty": "easy" }]
},
"glosses": {
"en": { "text": "a domesticated carnivorous mammal", "source": "omw" },
"fr": { "text": "un mammifère carnivore domestiqué", "source": "model_1" }
},
"examples": {
"en": [
{ "text": "the dog barked at the stranger", "source": "omw" }
],
"fr": [
{ "text": "le chien a aboyé", "source": "model_1" }
]
"en": [{ "text": "the dog barked at the stranger", "source": "omw" }],
"fr": [{ "text": "le chien a aboyé", "source": "model_1" }]
},
"descriptions": {
"en": {
@ -400,6 +394,7 @@ output quality per language. Run this after merge to verify output before
seeding the database.
**Input:**
- `stage-4-merge/output/final/{lang}.json`
- `stage-4-merge/output/flagged/{lang}.json`
@ -436,12 +431,12 @@ pnpm --filter @lila/pipeline compare
These values are defined in `packages/shared/src/constants.ts` and enforced by database check constraints. The pipeline filters out any entries that violate them.
| Constant | Values |
|---|---|
| Languages | `en`, `it`, `de`, `es`, `fr` |
| Constant | Values |
| --------------- | ------------------------------------- |
| Languages | `en`, `it`, `de`, `es`, `fr` |
| Parts of speech | `noun`, `verb`, `adjective`, `adverb` |
| CEFR levels | `A1`, `A2`, `B1`, `B2`, `C1`, `C2` |
| Difficulty | `easy`, `intermediate`, `hard` |
| CEFR levels | `A1`, `A2`, `B1`, `B2`, `C1`, `C2` |
| Difficulty | `easy`, `intermediate`, `hard` |
Adding a new value to any of these requires a constants update and a database migration before re-running the pipeline. See **Adding a new language** for the full steps — the same process applies for new parts of speech.