updating documentation

This commit is contained in:
lila 2026-04-06 17:01:34 +02:00
parent 570dbff25e
commit 60cf48ef97
3 changed files with 243 additions and 31 deletions

View file

@ -0,0 +1,216 @@
# Phase 4 — CEFR Enrichment Pipeline
## Context
This is a vocabulary trainer (Duolingo-style) built as a pnpm monorepo. The data layer
uses Drizzle ORM with Postgres. The project is called Glossa.
**Read `decisions.md` and `schema.ts` before doing anything.** They contain the full
reasoning behind every decision. Do not deviate from established patterns without
flagging it explicitly.
---
## Current State
The database is fully populated with OMW data:
- 95,882 terms (nouns and verbs)
- 225,997 translations (English and Italian)
- `cefr_level` is null on every translation row — this phase populates it
---
## Goal
Build a pipeline that:
1. Normalizes CEFR word lists from multiple sources into a common JSON format
2. Compares sources to surface agreements and conflicts
3. Merges sources into a single authoritative JSON per language
4. Enriches the `translations` table with `cefr_level` values
All scripts live in `packages/db/src/cefr/`.
---
## Normalized JSON Format
Every source extraction script outputs a JSON file in this exact shape:
```json
[
{ "word": "bank", "pos": "noun", "cefr": "A1", "source": "esl-lounge" },
{ "word": "bank", "pos": "verb", "cefr": "B1", "source": "esl-lounge" },
{ "word": "run", "pos": "verb", "cefr": "A1", "source": "esl-lounge" }
]
```
Field rules:
- `word` — lowercase, trimmed, base form
- `pos` — must match `SUPPORTED_POS` values exactly (`'noun'` or `'verb'`)
- `cefr` — must match `CEFR_LEVELS` exactly (`'A1'``'C2'`)
- `source` — short identifier string for the source (`'esl-lounge'`, `'kelly'`, etc.)
Output files go in `packages/db/src/cefr/sources/` named `<source>-<language>.json`
e.g. `esl-lounge-en.json`, `kelly-en.json`, `kelly-it.json`.
---
## Scripts to Write
### 1. Source extraction scripts (one per source)
`packages/db/src/cefr/extract-<source>.ts`
Each script reads the raw source data (CSV, scraped HTML, whatever format the source
provides) and outputs the normalized JSON format above. Raw source files go in
`packages/db/src/cefr/raw/`.
**Sources to extract for English (start here):**
- `esl-lounge` — word lists at esl-lounge.com, already split by CEFR level and POS.
Raw data will be provided as text files, one per level.
Add more sources as they become available. Each source is one extraction script,
one output file. Do not combine sources in extraction scripts.
---
### 2. Comparison script
`packages/db/src/cefr/compare.ts`
Reads all normalized JSON files from `sources/` and prints a report:
```
=== CEFR Source Comparison ===
Per source:
esl-lounge-en: 2,847 entries (A1: 312, A2: 445, B1: 623, B2: 701, C1: 489, C2: 277)
kelly-en: 3,201 entries (A1: ...)
Overlap (words appearing in multiple sources):
esl-lounge-en ∩ kelly-en: 1,203 words
Agreement: 1,089 (90.5%)
Conflict: 114 (9.5%)
Conflicts (sample, first 20):
word pos esl-lounge kelly
-------------------------------
"achieve" verb B1 A2
"ancient" adj B2 B1
...
DB coverage (words in sources that match a translation row):
esl-lounge-en: 1,847 / 2,847 matched (64.9%)
kelly-en: 2,103 / 3,201 matched (65.7%)
```
This script is read-only — it never writes to the DB.
---
### 3. Merge script
`packages/db/src/cefr/merge.ts`
Reads all normalized JSON files from `sources/` for a given language and produces a
single merged JSON file in `packages/db/src/cefr/merged/<language>.json`.
**Merge rules:**
- If only one source has a word → use that level
- If multiple sources agree → use that level
- If sources conflict → use the level from the highest-priority source
**Source priority order (highest to lowest):**
1. `kelly` — purpose-built for language learning, CEFR-mapped by linguists
2. `esl-lounge` — curated by teachers, reliable but secondary
3. Any additional sources added later
Priority order is defined as a constant at the top of the merge script — easy to
change without touching the logic.
Output format — same normalized JSON shape but without `source` field, replaced by
`sources` array showing which sources contributed:
```json
[
{ "word": "bank", "pos": "noun", "cefr": "A1", "sources": ["esl-lounge", "kelly"] },
{ "word": "achieve", "pos": "verb", "cefr": "A2", "sources": ["kelly"] }
]
```
---
### 4. Enrichment script
`packages/db/src/cefr/enrich.ts`
Reads merged JSON files from `merged/` and writes `cefr_level` to matching
`translations` rows.
**Matching logic:**
- For each entry in the merged JSON, find all `translations` rows where:
- `language_code` matches the file's language
- `text` matches the word (case-insensitive, trimmed)
- The term's `pos` matches the entry's `pos`
- Set `cefr_level` on all matching rows
- Use `onConflictDoUpdate` to overwrite existing values (re-running is safe)
**Logging:**
```
=== CEFR Enrichment ===
Language: en
Entries in merged file: 2,847
Matched translation rows: 4,203 (one word can match multiple translations — synonyms)
Unmatched entries: 644 (words not in DB)
Updated: 4,203
```
This script IS idempotent — running it twice produces the same result.
---
## File Structure
```
packages/db/src/cefr/
raw/ ← raw source files (gitignored if large)
esl-lounge-a1-en.txt
esl-lounge-a2-en.txt
...
sources/ ← normalized JSON per source per language
esl-lounge-en.json
kelly-en.json
kelly-it.json
merged/ ← one authoritative JSON per language
en.json
it.json
extract-esl-lounge.ts ← extraction script
extract-kelly.ts ← extraction script (when Kelly data is available)
compare.ts ← comparison report
merge.ts ← merge into authoritative file
enrich.ts ← write cefr_level to DB
```
---
## What NOT to do
- Do not hardcode CEFR level strings — always use `CEFR_LEVELS` from `@glossa/shared`
- Do not hardcode POS strings — always use `SUPPORTED_POS` from `@glossa/shared`
- Do not hardcode language codes — always use `SUPPORTED_LANGUAGE_CODES` from `@glossa/shared`
- Do not modify the schema
- Do not modify `seed.ts` or `generating-decks.ts`
- Do not skip the comparison step — it exists to surface data quality issues before enrichment
- Do not write `cefr_level` directly from raw source files — always go through normalize → merge → enrich
---
## Definition of Done
- All scripts run without TypeScript errors (`pnpm tsc --noEmit`)
- `extract-esl-lounge.ts` produces a valid normalized JSON file
- `compare.ts` prints a readable report showing coverage and conflicts
- `merge.ts` produces `merged/en.json` with conflict resolution applied
- `enrich.ts` writes `cefr_level` to matching `translations` rows and is idempotent
- Running `enrich.ts` twice produces the same DB state
- At least some `translations` rows have non-null `cefr_level` after enrichment

View file

@ -325,6 +325,25 @@ Exercise types split naturally into Type A (translation, current model) and Type
---
### Term glosses: Italian coverage is sparse (expected)
OMW gloss data is primarily in English. After full import:
- English glosses: 95,882 (~100% of terms)
- Italian glosses: 1,964 (~2% of terms)
This is not a data pipeline problem — it reflects the actual state of OMW. Italian
glosses simply don't exist for most synsets in the dataset.
**Handling in the UI:** fall back to the English gloss when no gloss exists for the
user's language. This is acceptable UX — a definition in the wrong language is better
than no definition at all.
If Italian gloss coverage needs to improve in the future, Wiktionary is the most
likely source — it has broader multilingual definition coverage than OMW.
---
## Open Research
### Semantic category metadata source

View file

@ -170,34 +170,11 @@ Too expensive at scale — only viable for small curated additions on top of an
### implementation roadmap
#### Finalize data model
text
#### Write and run migrations
text
#### Fill the database (expand import pipeline)
text
#### Decide SUBTLEX → cefr_level mapping strategy
text
#### Generate decks
text
#### Finalize game selection flow
text
#### Define Zod schemas in packages/shared
text
#### Implement API
text
- [x] Finalize data model
- [x] Write and run migrations
- [x] Fill the database (expand import pipeline)
- [ ] Decide SUBTLEX → cefr_level mapping strategy
- [ ] Generate decks
- [ ] Finalize game selection flow
- [ ] Define Zod schemas in packages/shared
- [ ] Implement API