updating documentation
This commit is contained in:
parent
570dbff25e
commit
60cf48ef97
3 changed files with 243 additions and 31 deletions
216
documentation/cefr_enrichment.md
Normal file
216
documentation/cefr_enrichment.md
Normal file
|
|
@ -0,0 +1,216 @@
|
||||||
|
# Phase 4 — CEFR Enrichment Pipeline
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
This is a vocabulary trainer (Duolingo-style) built as a pnpm monorepo. The data layer
|
||||||
|
uses Drizzle ORM with Postgres. The project is called Glossa.
|
||||||
|
|
||||||
|
**Read `decisions.md` and `schema.ts` before doing anything.** They contain the full
|
||||||
|
reasoning behind every decision. Do not deviate from established patterns without
|
||||||
|
flagging it explicitly.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Current State
|
||||||
|
|
||||||
|
The database is fully populated with OMW data:
|
||||||
|
- 95,882 terms (nouns and verbs)
|
||||||
|
- 225,997 translations (English and Italian)
|
||||||
|
- `cefr_level` is null on every translation row — this phase populates it
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Build a pipeline that:
|
||||||
|
1. Normalizes CEFR word lists from multiple sources into a common JSON format
|
||||||
|
2. Compares sources to surface agreements and conflicts
|
||||||
|
3. Merges sources into a single authoritative JSON per language
|
||||||
|
4. Enriches the `translations` table with `cefr_level` values
|
||||||
|
|
||||||
|
All scripts live in `packages/db/src/cefr/`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Normalized JSON Format
|
||||||
|
|
||||||
|
Every source extraction script outputs a JSON file in this exact shape:
|
||||||
|
|
||||||
|
```json
|
||||||
|
[
|
||||||
|
{ "word": "bank", "pos": "noun", "cefr": "A1", "source": "esl-lounge" },
|
||||||
|
{ "word": "bank", "pos": "verb", "cefr": "B1", "source": "esl-lounge" },
|
||||||
|
{ "word": "run", "pos": "verb", "cefr": "A1", "source": "esl-lounge" }
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
Field rules:
|
||||||
|
- `word` — lowercase, trimmed, base form
|
||||||
|
- `pos` — must match `SUPPORTED_POS` values exactly (`'noun'` or `'verb'`)
|
||||||
|
- `cefr` — must match `CEFR_LEVELS` exactly (`'A1'`–`'C2'`)
|
||||||
|
- `source` — short identifier string for the source (`'esl-lounge'`, `'kelly'`, etc.)
|
||||||
|
|
||||||
|
Output files go in `packages/db/src/cefr/sources/` named `<source>-<language>.json`
|
||||||
|
e.g. `esl-lounge-en.json`, `kelly-en.json`, `kelly-it.json`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Scripts to Write
|
||||||
|
|
||||||
|
### 1. Source extraction scripts (one per source)
|
||||||
|
|
||||||
|
`packages/db/src/cefr/extract-<source>.ts`
|
||||||
|
|
||||||
|
Each script reads the raw source data (CSV, scraped HTML, whatever format the source
|
||||||
|
provides) and outputs the normalized JSON format above. Raw source files go in
|
||||||
|
`packages/db/src/cefr/raw/`.
|
||||||
|
|
||||||
|
**Sources to extract for English (start here):**
|
||||||
|
- `esl-lounge` — word lists at esl-lounge.com, already split by CEFR level and POS.
|
||||||
|
Raw data will be provided as text files, one per level.
|
||||||
|
|
||||||
|
Add more sources as they become available. Each source is one extraction script,
|
||||||
|
one output file. Do not combine sources in extraction scripts.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2. Comparison script
|
||||||
|
|
||||||
|
`packages/db/src/cefr/compare.ts`
|
||||||
|
|
||||||
|
Reads all normalized JSON files from `sources/` and prints a report:
|
||||||
|
|
||||||
|
```
|
||||||
|
=== CEFR Source Comparison ===
|
||||||
|
|
||||||
|
Per source:
|
||||||
|
esl-lounge-en: 2,847 entries (A1: 312, A2: 445, B1: 623, B2: 701, C1: 489, C2: 277)
|
||||||
|
kelly-en: 3,201 entries (A1: ...)
|
||||||
|
|
||||||
|
Overlap (words appearing in multiple sources):
|
||||||
|
esl-lounge-en ∩ kelly-en: 1,203 words
|
||||||
|
Agreement: 1,089 (90.5%)
|
||||||
|
Conflict: 114 (9.5%)
|
||||||
|
|
||||||
|
Conflicts (sample, first 20):
|
||||||
|
word pos esl-lounge kelly
|
||||||
|
-------------------------------
|
||||||
|
"achieve" verb B1 A2
|
||||||
|
"ancient" adj B2 B1
|
||||||
|
...
|
||||||
|
|
||||||
|
DB coverage (words in sources that match a translation row):
|
||||||
|
esl-lounge-en: 1,847 / 2,847 matched (64.9%)
|
||||||
|
kelly-en: 2,103 / 3,201 matched (65.7%)
|
||||||
|
```
|
||||||
|
|
||||||
|
This script is read-only — it never writes to the DB.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3. Merge script
|
||||||
|
|
||||||
|
`packages/db/src/cefr/merge.ts`
|
||||||
|
|
||||||
|
Reads all normalized JSON files from `sources/` for a given language and produces a
|
||||||
|
single merged JSON file in `packages/db/src/cefr/merged/<language>.json`.
|
||||||
|
|
||||||
|
**Merge rules:**
|
||||||
|
- If only one source has a word → use that level
|
||||||
|
- If multiple sources agree → use that level
|
||||||
|
- If sources conflict → use the level from the highest-priority source
|
||||||
|
|
||||||
|
**Source priority order (highest to lowest):**
|
||||||
|
1. `kelly` — purpose-built for language learning, CEFR-mapped by linguists
|
||||||
|
2. `esl-lounge` — curated by teachers, reliable but secondary
|
||||||
|
3. Any additional sources added later
|
||||||
|
|
||||||
|
Priority order is defined as a constant at the top of the merge script — easy to
|
||||||
|
change without touching the logic.
|
||||||
|
|
||||||
|
Output format — same normalized JSON shape but without `source` field, replaced by
|
||||||
|
`sources` array showing which sources contributed:
|
||||||
|
|
||||||
|
```json
|
||||||
|
[
|
||||||
|
{ "word": "bank", "pos": "noun", "cefr": "A1", "sources": ["esl-lounge", "kelly"] },
|
||||||
|
{ "word": "achieve", "pos": "verb", "cefr": "A2", "sources": ["kelly"] }
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 4. Enrichment script
|
||||||
|
|
||||||
|
`packages/db/src/cefr/enrich.ts`
|
||||||
|
|
||||||
|
Reads merged JSON files from `merged/` and writes `cefr_level` to matching
|
||||||
|
`translations` rows.
|
||||||
|
|
||||||
|
**Matching logic:**
|
||||||
|
- For each entry in the merged JSON, find all `translations` rows where:
|
||||||
|
- `language_code` matches the file's language
|
||||||
|
- `text` matches the word (case-insensitive, trimmed)
|
||||||
|
- The term's `pos` matches the entry's `pos`
|
||||||
|
- Set `cefr_level` on all matching rows
|
||||||
|
- Use `onConflictDoUpdate` to overwrite existing values (re-running is safe)
|
||||||
|
|
||||||
|
**Logging:**
|
||||||
|
```
|
||||||
|
=== CEFR Enrichment ===
|
||||||
|
Language: en
|
||||||
|
Entries in merged file: 2,847
|
||||||
|
Matched translation rows: 4,203 (one word can match multiple translations — synonyms)
|
||||||
|
Unmatched entries: 644 (words not in DB)
|
||||||
|
Updated: 4,203
|
||||||
|
```
|
||||||
|
|
||||||
|
This script IS idempotent — running it twice produces the same result.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
packages/db/src/cefr/
|
||||||
|
raw/ ← raw source files (gitignored if large)
|
||||||
|
esl-lounge-a1-en.txt
|
||||||
|
esl-lounge-a2-en.txt
|
||||||
|
...
|
||||||
|
sources/ ← normalized JSON per source per language
|
||||||
|
esl-lounge-en.json
|
||||||
|
kelly-en.json
|
||||||
|
kelly-it.json
|
||||||
|
merged/ ← one authoritative JSON per language
|
||||||
|
en.json
|
||||||
|
it.json
|
||||||
|
extract-esl-lounge.ts ← extraction script
|
||||||
|
extract-kelly.ts ← extraction script (when Kelly data is available)
|
||||||
|
compare.ts ← comparison report
|
||||||
|
merge.ts ← merge into authoritative file
|
||||||
|
enrich.ts ← write cefr_level to DB
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What NOT to do
|
||||||
|
|
||||||
|
- Do not hardcode CEFR level strings — always use `CEFR_LEVELS` from `@glossa/shared`
|
||||||
|
- Do not hardcode POS strings — always use `SUPPORTED_POS` from `@glossa/shared`
|
||||||
|
- Do not hardcode language codes — always use `SUPPORTED_LANGUAGE_CODES` from `@glossa/shared`
|
||||||
|
- Do not modify the schema
|
||||||
|
- Do not modify `seed.ts` or `generating-decks.ts`
|
||||||
|
- Do not skip the comparison step — it exists to surface data quality issues before enrichment
|
||||||
|
- Do not write `cefr_level` directly from raw source files — always go through normalize → merge → enrich
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Definition of Done
|
||||||
|
|
||||||
|
- All scripts run without TypeScript errors (`pnpm tsc --noEmit`)
|
||||||
|
- `extract-esl-lounge.ts` produces a valid normalized JSON file
|
||||||
|
- `compare.ts` prints a readable report showing coverage and conflicts
|
||||||
|
- `merge.ts` produces `merged/en.json` with conflict resolution applied
|
||||||
|
- `enrich.ts` writes `cefr_level` to matching `translations` rows and is idempotent
|
||||||
|
- Running `enrich.ts` twice produces the same DB state
|
||||||
|
- At least some `translations` rows have non-null `cefr_level` after enrichment
|
||||||
|
|
@ -325,6 +325,25 @@ Exercise types split naturally into Type A (translation, current model) and Type
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
### Term glosses: Italian coverage is sparse (expected)
|
||||||
|
|
||||||
|
OMW gloss data is primarily in English. After full import:
|
||||||
|
|
||||||
|
- English glosses: 95,882 (~100% of terms)
|
||||||
|
- Italian glosses: 1,964 (~2% of terms)
|
||||||
|
|
||||||
|
This is not a data pipeline problem — it reflects the actual state of OMW. Italian
|
||||||
|
glosses simply don't exist for most synsets in the dataset.
|
||||||
|
|
||||||
|
**Handling in the UI:** fall back to the English gloss when no gloss exists for the
|
||||||
|
user's language. This is acceptable UX — a definition in the wrong language is better
|
||||||
|
than no definition at all.
|
||||||
|
|
||||||
|
If Italian gloss coverage needs to improve in the future, Wiktionary is the most
|
||||||
|
likely source — it has broader multilingual definition coverage than OMW.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Open Research
|
## Open Research
|
||||||
|
|
||||||
### Semantic category metadata source
|
### Semantic category metadata source
|
||||||
|
|
|
||||||
|
|
@ -170,34 +170,11 @@ Too expensive at scale — only viable for small curated additions on top of an
|
||||||
|
|
||||||
### implementation roadmap
|
### implementation roadmap
|
||||||
|
|
||||||
#### Finalize data model
|
- [x] Finalize data model
|
||||||
|
- [x] Write and run migrations
|
||||||
text
|
- [x] Fill the database (expand import pipeline)
|
||||||
|
- [ ] Decide SUBTLEX → cefr_level mapping strategy
|
||||||
#### Write and run migrations
|
- [ ] Generate decks
|
||||||
|
- [ ] Finalize game selection flow
|
||||||
text
|
- [ ] Define Zod schemas in packages/shared
|
||||||
|
- [ ] Implement API
|
||||||
#### Fill the database (expand import pipeline)
|
|
||||||
|
|
||||||
text
|
|
||||||
|
|
||||||
#### Decide SUBTLEX → cefr_level mapping strategy
|
|
||||||
|
|
||||||
text
|
|
||||||
|
|
||||||
#### Generate decks
|
|
||||||
|
|
||||||
text
|
|
||||||
|
|
||||||
#### Finalize game selection flow
|
|
||||||
|
|
||||||
text
|
|
||||||
|
|
||||||
#### Define Zod schemas in packages/shared
|
|
||||||
|
|
||||||
text
|
|
||||||
|
|
||||||
#### Implement API
|
|
||||||
|
|
||||||
text
|
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue