updating documentation

2026-04-06 17:01:34 +02:00 · 2026-04-06 17:01:34 +02:00 · 60cf48ef97
commit 60cf48ef97
parent 570dbff25e
3 changed files with 243 additions and 31 deletions
--- a/documentation/cefr_enrichment.md
+++ b/documentation/cefr_enrichment.md
@ -0,0 +1,216 @@
 # Phase 4 — CEFR Enrichment Pipeline
 ## Context
 This is a vocabulary trainer (Duolingo-style) built as a pnpm monorepo. The data layer
 uses Drizzle ORM with Postgres. The project is called Glossa.
 **Read `decisions.md` and `schema.ts` before doing anything.** They contain the full
 reasoning behind every decision. Do not deviate from established patterns without
 flagging it explicitly.
 ---
 ## Current State
 The database is fully populated with OMW data:
 - 95,882 terms (nouns and verbs)
 - 225,997 translations (English and Italian)
 - `cefr_level` is null on every translation row — this phase populates it
 ---
 ## Goal
 Build a pipeline that:
 1. Normalizes CEFR word lists from multiple sources into a common JSON format
 2. Compares sources to surface agreements and conflicts
 3. Merges sources into a single authoritative JSON per language
 4. Enriches the `translations` table with `cefr_level` values
 All scripts live in `packages/db/src/cefr/`.
 ---
 ## Normalized JSON Format
 Every source extraction script outputs a JSON file in this exact shape:
 ```json
 [
  { "word": "bank", "pos": "noun", "cefr": "A1", "source": "esl-lounge" },
  { "word": "bank", "pos": "verb", "cefr": "B1", "source": "esl-lounge" },
  { "word": "run",  "pos": "verb", "cefr": "A1", "source": "esl-lounge" }
 ]
 ```
 Field rules:
 - `word` — lowercase, trimmed, base form
 - `pos` — must match `SUPPORTED_POS` values exactly (`'noun'` or `'verb'`)
 - `cefr` — must match `CEFR_LEVELS` exactly (`'A1'`–`'C2'`)
 - `source` — short identifier string for the source (`'esl-lounge'`, `'kelly'`, etc.)
 Output files go in `packages/db/src/cefr/sources/` named `<source>-<language>.json`
 e.g. `esl-lounge-en.json`, `kelly-en.json`, `kelly-it.json`.
 ---
 ## Scripts to Write
 ### 1. Source extraction scripts (one per source)
 `packages/db/src/cefr/extract-<source>.ts`
 Each script reads the raw source data (CSV, scraped HTML, whatever format the source
 provides) and outputs the normalized JSON format above. Raw source files go in
 `packages/db/src/cefr/raw/`.
 **Sources to extract for English (start here):**
 - `esl-lounge` — word lists at esl-lounge.com, already split by CEFR level and POS.
  Raw data will be provided as text files, one per level.
 Add more sources as they become available. Each source is one extraction script,
 one output file. Do not combine sources in extraction scripts.
 ---
 ### 2. Comparison script
 `packages/db/src/cefr/compare.ts`
 Reads all normalized JSON files from `sources/` and prints a report:
 ```
 === CEFR Source Comparison ===
 Per source:
  esl-lounge-en: 2,847 entries (A1: 312, A2: 445, B1: 623, B2: 701, C1: 489, C2: 277)
  kelly-en:      3,201 entries (A1: ...)
 Overlap (words appearing in multiple sources):
  esl-lounge-en ∩ kelly-en: 1,203 words
    Agreement:  1,089 (90.5%)
    Conflict:     114 (9.5%)
 Conflicts (sample, first 20):
  word       pos    esl-lounge  kelly
  -------------------------------
  "achieve"  verb   B1          A2
  "ancient"  adj    B2          B1
  ...
 DB coverage (words in sources that match a translation row):
  esl-lounge-en: 1,847 / 2,847 matched (64.9%)
  kelly-en:      2,103 / 3,201 matched (65.7%)
 ```
 This script is read-only — it never writes to the DB.
 ---
 ### 3. Merge script
 `packages/db/src/cefr/merge.ts`
 Reads all normalized JSON files from `sources/` for a given language and produces a
 single merged JSON file in `packages/db/src/cefr/merged/<language>.json`.
 **Merge rules:**
 - If only one source has a word → use that level
 - If multiple sources agree → use that level
 - If sources conflict → use the level from the highest-priority source
 **Source priority order (highest to lowest):**
 1. `kelly` — purpose-built for language learning, CEFR-mapped by linguists
 2. `esl-lounge` — curated by teachers, reliable but secondary
 3. Any additional sources added later
 Priority order is defined as a constant at the top of the merge script — easy to
 change without touching the logic.
 Output format — same normalized JSON shape but without `source` field, replaced by
 `sources` array showing which sources contributed:
 ```json
 [
  { "word": "bank", "pos": "noun", "cefr": "A1", "sources": ["esl-lounge", "kelly"] },
  { "word": "achieve", "pos": "verb", "cefr": "A2", "sources": ["kelly"] }
 ]
 ```
 ---
 ### 4. Enrichment script
 `packages/db/src/cefr/enrich.ts`
 Reads merged JSON files from `merged/` and writes `cefr_level` to matching
 `translations` rows.
 **Matching logic:**
 - For each entry in the merged JSON, find all `translations` rows where:
  - `language_code` matches the file's language
  - `text` matches the word (case-insensitive, trimmed)
  - The term's `pos` matches the entry's `pos`
 - Set `cefr_level` on all matching rows
 - Use `onConflictDoUpdate` to overwrite existing values (re-running is safe)
 **Logging:**
 ```
 === CEFR Enrichment ===
 Language: en
  Entries in merged file: 2,847
  Matched translation rows: 4,203  (one word can match multiple translations — synonyms)
  Unmatched entries: 644  (words not in DB)
  Updated: 4,203
 ```
 This script IS idempotent — running it twice produces the same result.
 ---
 ## File Structure
 ```
 packages/db/src/cefr/
  raw/                        ← raw source files (gitignored if large)
    esl-lounge-a1-en.txt
    esl-lounge-a2-en.txt
    ...
  sources/                    ← normalized JSON per source per language
    esl-lounge-en.json
    kelly-en.json
    kelly-it.json
  merged/                     ← one authoritative JSON per language
    en.json
    it.json
  extract-esl-lounge.ts       ← extraction script
  extract-kelly.ts            ← extraction script (when Kelly data is available)
  compare.ts                  ← comparison report
  merge.ts                    ← merge into authoritative file
  enrich.ts                   ← write cefr_level to DB
 ```
 ---
 ## What NOT to do
 - Do not hardcode CEFR level strings — always use `CEFR_LEVELS` from `@glossa/shared`
 - Do not hardcode POS strings — always use `SUPPORTED_POS` from `@glossa/shared`
 - Do not hardcode language codes — always use `SUPPORTED_LANGUAGE_CODES` from `@glossa/shared`
 - Do not modify the schema
 - Do not modify `seed.ts` or `generating-decks.ts`
 - Do not skip the comparison step — it exists to surface data quality issues before enrichment
 - Do not write `cefr_level` directly from raw source files — always go through normalize → merge → enrich
 ---
 ## Definition of Done
 - All scripts run without TypeScript errors (`pnpm tsc --noEmit`)
 - `extract-esl-lounge.ts` produces a valid normalized JSON file
 - `compare.ts` prints a readable report showing coverage and conflicts
 - `merge.ts` produces `merged/en.json` with conflict resolution applied
 - `enrich.ts` writes `cefr_level` to matching `translations` rows and is idempotent
 - Running `enrich.ts` twice produces the same DB state
 - At least some `translations` rows have non-null `cefr_level` after enrichment
--- a/documentation/decisions.md
+++ b/documentation/decisions.md
@ -325,6 +325,25 @@ Exercise types split naturally into Type A (translation, current model) and Type
 ---
 ### Term glosses: Italian coverage is sparse (expected)
 OMW gloss data is primarily in English. After full import:
 - English glosses: 95,882 (~100% of terms)
 - Italian glosses: 1,964 (~2% of terms)
 This is not a data pipeline problem — it reflects the actual state of OMW. Italian
 glosses simply don't exist for most synsets in the dataset.
 **Handling in the UI:** fall back to the English gloss when no gloss exists for the
 user's language. This is acceptable UX — a definition in the wrong language is better
 than no definition at all.
 If Italian gloss coverage needs to improve in the future, Wiktionary is the most
 likely source — it has broader multilingual definition coverage than OMW.
 ---
 ## Open Research
 ### Semantic category metadata source
--- a/documentation/schema_discussion.md
+++ b/documentation/schema_discussion.md
@ -170,34 +170,11 @@ Too expensive at scale — only viable for small curated additions on top of an
 ### implementation roadmap
-#### Finalize data model
+- [x] Finalize data model
-
+- [x] Write and run migrations
-text
+- [x] Fill the database (expand import pipeline)
-
+- [ ] Decide SUBTLEX → cefr_level mapping strategy
-#### Write and run migrations
+- [ ] Generate decks
-
+- [ ] Finalize game selection flow
-text
+- [ ] Define Zod schemas in packages/shared
-
+- [ ] Implement API
 #### Fill the database (expand import pipeline)
 text
 #### Decide SUBTLEX → cefr_level mapping strategy
 text
 #### Generate decks
 text
 #### Finalize game selection flow
 text
 #### Define Zod schemas in packages/shared
 text
 #### Implement API
 text