formatting
This commit is contained in:
parent
2ff7d1759e
commit
4f59f3bc14
23 changed files with 994 additions and 3338 deletions
|
|
@ -55,13 +55,13 @@ See **Setup** for download instructions.
|
|||
|
||||
Per-language JSON files in `sources/cefr/` provide the initial CEFR level annotations. These files do not cover the full vocabulary extracted from OMW — coverage varies by language. Gaps and disagreements are handled by the enrich stage.
|
||||
|
||||
| Language | File |
|
||||
|---|---|
|
||||
| English | `sources/cefr/en.json` |
|
||||
| Italian | `sources/cefr/it.json` |
|
||||
| Spanish | `sources/cefr/es.json` |
|
||||
| German | `sources/cefr/de.json` |
|
||||
| French | `sources/cefr/fr.json` |
|
||||
| Language | File |
|
||||
| -------- | ---------------------- |
|
||||
| English | `sources/cefr/en.json` |
|
||||
| Italian | `sources/cefr/it.json` |
|
||||
| Spanish | `sources/cefr/es.json` |
|
||||
| German | `sources/cefr/de.json` |
|
||||
| French | `sources/cefr/fr.json` |
|
||||
|
||||
These files are committed to git. For per-language coverage detail see `COVERAGE.md`.
|
||||
|
||||
|
|
@ -102,13 +102,13 @@ See `LLM-SETUP.md`.
|
|||
|
||||
The pipeline runs in five stages. Each stage is independent and can be re-run without affecting the others.
|
||||
|
||||
| Stage | What it does |
|
||||
|---|---|
|
||||
| 1. Extract | Reads OMW SQLite database, outputs normalized JSON per language |
|
||||
| Stage | What it does |
|
||||
| ----------- | -------------------------------------------------------------------- |
|
||||
| 1. Extract | Reads OMW SQLite database, outputs normalized JSON per language |
|
||||
| 2. Annotate | Merges CEFR source files into extracted data, adds source file votes |
|
||||
| 3. Enrich | Runs local LLMs in two rounds — generation then voting |
|
||||
| 4. Merge | Resolves votes, derives difficulty, splits into final and flagged |
|
||||
| 5. Compare | Generates COVERAGE.md with detailed quality report |
|
||||
| 3. Enrich | Runs local LLMs in two rounds — generation then voting |
|
||||
| 4. Merge | Resolves votes, derives difficulty, splits into final and flagged |
|
||||
| 5. Compare | Generates COVERAGE.md with detailed quality report |
|
||||
|
||||
### 1. Extract
|
||||
|
||||
|
|
@ -137,11 +137,11 @@ Each record in the output looks like this:
|
|||
"fr": ["comptable"]
|
||||
},
|
||||
"glosses": {
|
||||
"en": ["(usually followed by 'to') having the necessary means or skill or know-how or authority to do something"]
|
||||
"en": [
|
||||
"(usually followed by 'to') having the necessary means or skill or know-how or authority to do something"
|
||||
]
|
||||
},
|
||||
"examples": {
|
||||
"en": ["able to swim", "she was able to program her computer"]
|
||||
}
|
||||
"examples": { "en": ["able to swim", "she was able to program her computer"] }
|
||||
}
|
||||
```
|
||||
|
||||
|
|
@ -158,6 +158,7 @@ Words appearing in the CEFR source file multiple times with different CEFR level
|
|||
|
||||
**Input:** `stage-1-extract/output/omw.json` + `stage-2-annotate/sources/cefr/{lang}.json`
|
||||
**Output:**
|
||||
|
||||
- `stage-2-annotate/output/{lang}.json` — one per language
|
||||
- `stage-2-annotate/output/conflicts.json` — cross-language conflicts for review
|
||||
|
||||
|
|
@ -177,20 +178,14 @@ Each record in the output extends the OMW record with a `votes` field and any ad
|
|||
"es": ["capaz"],
|
||||
"fr": ["comptable"]
|
||||
},
|
||||
"glosses": {
|
||||
"en": ["having the necessary means or skill to do something"]
|
||||
},
|
||||
"glosses": { "en": ["having the necessary means or skill to do something"] },
|
||||
"examples": {
|
||||
"en": [
|
||||
{ "text": "able to swim", "source": "omw" },
|
||||
{ "text": "She was able to finish the task.", "source": "cefr" }
|
||||
]
|
||||
},
|
||||
"votes": {
|
||||
"en": {
|
||||
"able": { "cefr_source": "B1" }
|
||||
}
|
||||
}
|
||||
"votes": { "en": { "able": { "cefr_source": "B1" } } }
|
||||
}
|
||||
```
|
||||
|
||||
|
|
@ -297,9 +292,7 @@ Each record in the votes file looks like this:
|
|||
}
|
||||
},
|
||||
"examples": {
|
||||
"en": [
|
||||
{ "text": "the dog barked at the stranger", "source": "omw" }
|
||||
],
|
||||
"en": [{ "text": "the dog barked at the stranger", "source": "omw" }],
|
||||
"fr": {
|
||||
"candidates": [
|
||||
{ "text": "le chien a aboyé", "source": "model_1" },
|
||||
|
|
@ -311,8 +304,14 @@ Each record in the votes file looks like this:
|
|||
"descriptions": {
|
||||
"en": {
|
||||
"candidates": [
|
||||
{ "text": "a common household pet known for loyalty", "source": "model_1" },
|
||||
{ "text": "a domesticated animal and loyal companion", "source": "model_2" }
|
||||
{
|
||||
"text": "a common household pet known for loyalty",
|
||||
"source": "model_1"
|
||||
},
|
||||
{
|
||||
"text": "a domesticated animal and loyal companion",
|
||||
"source": "model_2"
|
||||
}
|
||||
],
|
||||
"votes": { "model_1": 2, "model_2": 1 }
|
||||
}
|
||||
|
|
@ -334,14 +333,15 @@ Reads the votes file per language and resolves the final value for every field.
|
|||
|
||||
**Difficulty mapping:**
|
||||
|
||||
| CEFR | Difficulty |
|
||||
|---|---|
|
||||
| A1, A2 | easy |
|
||||
| CEFR | Difficulty |
|
||||
| ------ | ------------ |
|
||||
| A1, A2 | easy |
|
||||
| B1, B2 | intermediate |
|
||||
| C1, C2 | hard |
|
||||
| C1, C2 | hard |
|
||||
|
||||
**Input:** `stage-3-enrich/output/votes/{lang}_votes.json`
|
||||
**Output:**
|
||||
|
||||
- `stage-4-merge/output/final/{lang}.json` — fully resolved, ready for seeding
|
||||
- `stage-4-merge/output/flagged/{lang}.json` — CEFR majority not reached, needs manual review before seeding
|
||||
|
||||
|
|
@ -360,21 +360,15 @@ Each record in `final/{lang}.json` looks like this:
|
|||
{ "text": "dog", "cefr_level": "A1", "difficulty": "easy" },
|
||||
{ "text": "canine", "cefr_level": "B2", "difficulty": "intermediate" }
|
||||
],
|
||||
"it": [
|
||||
{ "text": "cane", "cefr_level": "A1", "difficulty": "easy" }
|
||||
]
|
||||
"it": [{ "text": "cane", "cefr_level": "A1", "difficulty": "easy" }]
|
||||
},
|
||||
"glosses": {
|
||||
"en": { "text": "a domesticated carnivorous mammal", "source": "omw" },
|
||||
"fr": { "text": "un mammifère carnivore domestiqué", "source": "model_1" }
|
||||
},
|
||||
"examples": {
|
||||
"en": [
|
||||
{ "text": "the dog barked at the stranger", "source": "omw" }
|
||||
],
|
||||
"fr": [
|
||||
{ "text": "le chien a aboyé", "source": "model_1" }
|
||||
]
|
||||
"en": [{ "text": "the dog barked at the stranger", "source": "omw" }],
|
||||
"fr": [{ "text": "le chien a aboyé", "source": "model_1" }]
|
||||
},
|
||||
"descriptions": {
|
||||
"en": {
|
||||
|
|
@ -400,6 +394,7 @@ output quality per language. Run this after merge to verify output before
|
|||
seeding the database.
|
||||
|
||||
**Input:**
|
||||
|
||||
- `stage-4-merge/output/final/{lang}.json`
|
||||
- `stage-4-merge/output/flagged/{lang}.json`
|
||||
|
||||
|
|
@ -436,12 +431,12 @@ pnpm --filter @lila/pipeline compare
|
|||
|
||||
These values are defined in `packages/shared/src/constants.ts` and enforced by database check constraints. The pipeline filters out any entries that violate them.
|
||||
|
||||
| Constant | Values |
|
||||
|---|---|
|
||||
| Languages | `en`, `it`, `de`, `es`, `fr` |
|
||||
| Constant | Values |
|
||||
| --------------- | ------------------------------------- |
|
||||
| Languages | `en`, `it`, `de`, `es`, `fr` |
|
||||
| Parts of speech | `noun`, `verb`, `adjective`, `adverb` |
|
||||
| CEFR levels | `A1`, `A2`, `B1`, `B2`, `C1`, `C2` |
|
||||
| Difficulty | `easy`, `intermediate`, `hard` |
|
||||
| CEFR levels | `A1`, `A2`, `B1`, `B2`, `C1`, `C2` |
|
||||
| Difficulty | `easy`, `intermediate`, `hard` |
|
||||
|
||||
Adding a new value to any of these requires a constants update and a database migration before re-running the pipeline. See **Adding a new language** for the full steps — the same process applies for new parts of speech.
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue