updating documentation

2026-04-06 17:01:34 +02:00 · 2026-04-06 17:01:34 +02:00 · 60cf48ef97
commit 60cf48ef97
parent 570dbff25e
3 changed files with 243 additions and 31 deletions
--- a/documentation/cefr_enrichment.md
+++ b/documentation/cefr_enrichment.md
@ -0,0 +1,216 @@
+# Phase 4 — CEFR Enrichment Pipeline
+
+## Context
+
+This is a vocabulary trainer (Duolingo-style) built as a pnpm monorepo. The data layer
+uses Drizzle ORM with Postgres. The project is called Glossa.
+
+**Read `decisions.md` and `schema.ts` before doing anything.** They contain the full
+reasoning behind every decision. Do not deviate from established patterns without
+flagging it explicitly.
+
+---
+
+## Current State
+
+The database is fully populated with OMW data:
+- 95,882 terms (nouns and verbs)
+- 225,997 translations (English and Italian)
+- `cefr_level` is null on every translation row — this phase populates it
+
+---
+
+## Goal
+
+Build a pipeline that:
+1. Normalizes CEFR word lists from multiple sources into a common JSON format
+2. Compares sources to surface agreements and conflicts
+3. Merges sources into a single authoritative JSON per language
+4. Enriches the `translations` table with `cefr_level` values
+
+All scripts live in `packages/db/src/cefr/`.
+
+---
+
+## Normalized JSON Format
+
+Every source extraction script outputs a JSON file in this exact shape:
+
+```json
+[
+  { "word": "bank", "pos": "noun", "cefr": "A1", "source": "esl-lounge" },
+  { "word": "bank", "pos": "verb", "cefr": "B1", "source": "esl-lounge" },
+  { "word": "run",  "pos": "verb", "cefr": "A1", "source": "esl-lounge" }
+]
+```
+
+Field rules:
+- `word` — lowercase, trimmed, base form
+- `pos` — must match `SUPPORTED_POS` values exactly (`'noun'` or `'verb'`)
+- `cefr` — must match `CEFR_LEVELS` exactly (`'A1'`–`'C2'`)
+- `source` — short identifier string for the source (`'esl-lounge'`, `'kelly'`, etc.)
+
+Output files go in `packages/db/src/cefr/sources/` named `<source>-<language>.json`
+e.g. `esl-lounge-en.json`, `kelly-en.json`, `kelly-it.json`.
+
+---
+
+## Scripts to Write
+
+### 1. Source extraction scripts (one per source)
+
+`packages/db/src/cefr/extract-<source>.ts`
+
+Each script reads the raw source data (CSV, scraped HTML, whatever format the source
+provides) and outputs the normalized JSON format above. Raw source files go in
+`packages/db/src/cefr/raw/`.
+
+**Sources to extract for English (start here):**
+- `esl-lounge` — word lists at esl-lounge.com, already split by CEFR level and POS.
+  Raw data will be provided as text files, one per level.
+
+Add more sources as they become available. Each source is one extraction script,
+one output file. Do not combine sources in extraction scripts.
+
+---
+
+### 2. Comparison script
+
+`packages/db/src/cefr/compare.ts`
+
+Reads all normalized JSON files from `sources/` and prints a report:
+
+```
+=== CEFR Source Comparison ===
+
+Per source:
+  esl-lounge-en: 2,847 entries (A1: 312, A2: 445, B1: 623, B2: 701, C1: 489, C2: 277)
+  kelly-en:      3,201 entries (A1: ...)
+
+Overlap (words appearing in multiple sources):
+  esl-lounge-en ∩ kelly-en: 1,203 words
+    Agreement:  1,089 (90.5%)
+    Conflict:     114 (9.5%)
+
+Conflicts (sample, first 20):
+  word       pos    esl-lounge  kelly
+  -------------------------------
+  "achieve"  verb   B1          A2
+  "ancient"  adj    B2          B1
+  ...
+
+DB coverage (words in sources that match a translation row):
+  esl-lounge-en: 1,847 / 2,847 matched (64.9%)
+  kelly-en:      2,103 / 3,201 matched (65.7%)
+```
+
+This script is read-only — it never writes to the DB.
+
+---
+
+### 3. Merge script
+
+`packages/db/src/cefr/merge.ts`
+
+Reads all normalized JSON files from `sources/` for a given language and produces a
+single merged JSON file in `packages/db/src/cefr/merged/<language>.json`.
+
+**Merge rules:**
+- If only one source has a word → use that level
+- If multiple sources agree → use that level
+- If sources conflict → use the level from the highest-priority source
+
+**Source priority order (highest to lowest):**
+1. `kelly` — purpose-built for language learning, CEFR-mapped by linguists
+2. `esl-lounge` — curated by teachers, reliable but secondary
+3. Any additional sources added later
+
+Priority order is defined as a constant at the top of the merge script — easy to
+change without touching the logic.
+
+Output format — same normalized JSON shape but without `source` field, replaced by
+`sources` array showing which sources contributed:
+
+```json
+[
+  { "word": "bank", "pos": "noun", "cefr": "A1", "sources": ["esl-lounge", "kelly"] },
+  { "word": "achieve", "pos": "verb", "cefr": "A2", "sources": ["kelly"] }
+]
+```
+
+---
+
+### 4. Enrichment script
+
+`packages/db/src/cefr/enrich.ts`
+
+Reads merged JSON files from `merged/` and writes `cefr_level` to matching
+`translations` rows.
+
+**Matching logic:**
+- For each entry in the merged JSON, find all `translations` rows where:
+  - `language_code` matches the file's language
+  - `text` matches the word (case-insensitive, trimmed)
+  - The term's `pos` matches the entry's `pos`
+- Set `cefr_level` on all matching rows
+- Use `onConflictDoUpdate` to overwrite existing values (re-running is safe)
+
+**Logging:**
+```
+=== CEFR Enrichment ===
+Language: en
+  Entries in merged file: 2,847
+  Matched translation rows: 4,203  (one word can match multiple translations — synonyms)
+  Unmatched entries: 644  (words not in DB)
+  Updated: 4,203
+```
+
+This script IS idempotent — running it twice produces the same result.
+
+---
+
+## File Structure
+
+```
+packages/db/src/cefr/
+  raw/                        ← raw source files (gitignored if large)
+    esl-lounge-a1-en.txt
+    esl-lounge-a2-en.txt
+    ...
+  sources/                    ← normalized JSON per source per language
+    esl-lounge-en.json
+    kelly-en.json
+    kelly-it.json
+  merged/                     ← one authoritative JSON per language
+    en.json
+    it.json
+  extract-esl-lounge.ts       ← extraction script
+  extract-kelly.ts            ← extraction script (when Kelly data is available)
+  compare.ts                  ← comparison report
+  merge.ts                    ← merge into authoritative file
+  enrich.ts                   ← write cefr_level to DB
+```
+
+---
+
+## What NOT to do
+
+- Do not hardcode CEFR level strings — always use `CEFR_LEVELS` from `@glossa/shared`
+- Do not hardcode POS strings — always use `SUPPORTED_POS` from `@glossa/shared`
+- Do not hardcode language codes — always use `SUPPORTED_LANGUAGE_CODES` from `@glossa/shared`
+- Do not modify the schema
+- Do not modify `seed.ts` or `generating-decks.ts`
+- Do not skip the comparison step — it exists to surface data quality issues before enrichment
+- Do not write `cefr_level` directly from raw source files — always go through normalize → merge → enrich
+
+---
+
+## Definition of Done
+
+- All scripts run without TypeScript errors (`pnpm tsc --noEmit`)
+- `extract-esl-lounge.ts` produces a valid normalized JSON file
+- `compare.ts` prints a readable report showing coverage and conflicts
+- `merge.ts` produces `merged/en.json` with conflict resolution applied
+- `enrich.ts` writes `cefr_level` to matching `translations` rows and is idempotent
+- Running `enrich.ts` twice produces the same DB state
+- At least some `translations` rows have non-null `cefr_level` after enrichment
--- a/documentation/decisions.md
+++ b/documentation/decisions.md
@ -325,6 +325,25 @@ Exercise types split naturally into Type A (translation, current model) and Type

 ---

+### Term glosses: Italian coverage is sparse (expected)
+
+OMW gloss data is primarily in English. After full import:
+
+- English glosses: 95,882 (~100% of terms)
+- Italian glosses: 1,964 (~2% of terms)
+
+This is not a data pipeline problem — it reflects the actual state of OMW. Italian
+glosses simply don't exist for most synsets in the dataset.
+
+**Handling in the UI:** fall back to the English gloss when no gloss exists for the
+user's language. This is acceptable UX — a definition in the wrong language is better
+than no definition at all.
+
+If Italian gloss coverage needs to improve in the future, Wiktionary is the most
+likely source — it has broader multilingual definition coverage than OMW.
+
+---
+
 ## Open Research

 ### Semantic category metadata source
--- a/documentation/schema_discussion.md
+++ b/documentation/schema_discussion.md
@ -170,34 +170,11 @@ Too expensive at scale — only viable for small curated additions on top of an

 ### implementation roadmap

-#### Finalize data model
-
-text
-
-#### Write and run migrations
-
-text
-
-#### Fill the database (expand import pipeline)
-
-text
-
-#### Decide SUBTLEX → cefr_level mapping strategy
-
-text
-
-#### Generate decks
-
-text
-
-#### Finalize game selection flow
-
-text
-
-#### Define Zod schemas in packages/shared
-
-text
-
-#### Implement API
-
-text
+- [x] Finalize data model
+- [x] Write and run migrations
+- [x] Fill the database (expand import pipeline)
+- [ ] Decide SUBTLEX → cefr_level mapping strategy
+- [ ] Generate decks
+- [ ] Finalize game selection flow
+- [ ] Define Zod schemas in packages/shared
+- [ ] Implement API