From 60cf48ef97c1833fdabcde6585b125a8caea50b9 Mon Sep 17 00:00:00 2001 From: lila Date: Mon, 6 Apr 2026 17:01:34 +0200 Subject: [PATCH] updating documentation --- documentation/cefr_enrichment.md | 216 +++++++++++++++++++++++++++++ documentation/decisions.md | 19 +++ documentation/schema_discussion.md | 39 ++---- 3 files changed, 243 insertions(+), 31 deletions(-) create mode 100644 documentation/cefr_enrichment.md diff --git a/documentation/cefr_enrichment.md b/documentation/cefr_enrichment.md new file mode 100644 index 0000000..bc6c762 --- /dev/null +++ b/documentation/cefr_enrichment.md @@ -0,0 +1,216 @@ +# Phase 4 — CEFR Enrichment Pipeline + +## Context + +This is a vocabulary trainer (Duolingo-style) built as a pnpm monorepo. The data layer +uses Drizzle ORM with Postgres. The project is called Glossa. + +**Read `decisions.md` and `schema.ts` before doing anything.** They contain the full +reasoning behind every decision. Do not deviate from established patterns without +flagging it explicitly. + +--- + +## Current State + +The database is fully populated with OMW data: +- 95,882 terms (nouns and verbs) +- 225,997 translations (English and Italian) +- `cefr_level` is null on every translation row — this phase populates it + +--- + +## Goal + +Build a pipeline that: +1. Normalizes CEFR word lists from multiple sources into a common JSON format +2. Compares sources to surface agreements and conflicts +3. Merges sources into a single authoritative JSON per language +4. Enriches the `translations` table with `cefr_level` values + +All scripts live in `packages/db/src/cefr/`. + +--- + +## Normalized JSON Format + +Every source extraction script outputs a JSON file in this exact shape: + +```json +[ + { "word": "bank", "pos": "noun", "cefr": "A1", "source": "esl-lounge" }, + { "word": "bank", "pos": "verb", "cefr": "B1", "source": "esl-lounge" }, + { "word": "run", "pos": "verb", "cefr": "A1", "source": "esl-lounge" } +] +``` + +Field rules: +- `word` — lowercase, trimmed, base form +- `pos` — must match `SUPPORTED_POS` values exactly (`'noun'` or `'verb'`) +- `cefr` — must match `CEFR_LEVELS` exactly (`'A1'`–`'C2'`) +- `source` — short identifier string for the source (`'esl-lounge'`, `'kelly'`, etc.) + +Output files go in `packages/db/src/cefr/sources/` named `-.json` +e.g. `esl-lounge-en.json`, `kelly-en.json`, `kelly-it.json`. + +--- + +## Scripts to Write + +### 1. Source extraction scripts (one per source) + +`packages/db/src/cefr/extract-.ts` + +Each script reads the raw source data (CSV, scraped HTML, whatever format the source +provides) and outputs the normalized JSON format above. Raw source files go in +`packages/db/src/cefr/raw/`. + +**Sources to extract for English (start here):** +- `esl-lounge` — word lists at esl-lounge.com, already split by CEFR level and POS. + Raw data will be provided as text files, one per level. + +Add more sources as they become available. Each source is one extraction script, +one output file. Do not combine sources in extraction scripts. + +--- + +### 2. Comparison script + +`packages/db/src/cefr/compare.ts` + +Reads all normalized JSON files from `sources/` and prints a report: + +``` +=== CEFR Source Comparison === + +Per source: + esl-lounge-en: 2,847 entries (A1: 312, A2: 445, B1: 623, B2: 701, C1: 489, C2: 277) + kelly-en: 3,201 entries (A1: ...) + +Overlap (words appearing in multiple sources): + esl-lounge-en ∩ kelly-en: 1,203 words + Agreement: 1,089 (90.5%) + Conflict: 114 (9.5%) + +Conflicts (sample, first 20): + word pos esl-lounge kelly + ------------------------------- + "achieve" verb B1 A2 + "ancient" adj B2 B1 + ... + +DB coverage (words in sources that match a translation row): + esl-lounge-en: 1,847 / 2,847 matched (64.9%) + kelly-en: 2,103 / 3,201 matched (65.7%) +``` + +This script is read-only — it never writes to the DB. + +--- + +### 3. Merge script + +`packages/db/src/cefr/merge.ts` + +Reads all normalized JSON files from `sources/` for a given language and produces a +single merged JSON file in `packages/db/src/cefr/merged/.json`. + +**Merge rules:** +- If only one source has a word → use that level +- If multiple sources agree → use that level +- If sources conflict → use the level from the highest-priority source + +**Source priority order (highest to lowest):** +1. `kelly` — purpose-built for language learning, CEFR-mapped by linguists +2. `esl-lounge` — curated by teachers, reliable but secondary +3. Any additional sources added later + +Priority order is defined as a constant at the top of the merge script — easy to +change without touching the logic. + +Output format — same normalized JSON shape but without `source` field, replaced by +`sources` array showing which sources contributed: + +```json +[ + { "word": "bank", "pos": "noun", "cefr": "A1", "sources": ["esl-lounge", "kelly"] }, + { "word": "achieve", "pos": "verb", "cefr": "A2", "sources": ["kelly"] } +] +``` + +--- + +### 4. Enrichment script + +`packages/db/src/cefr/enrich.ts` + +Reads merged JSON files from `merged/` and writes `cefr_level` to matching +`translations` rows. + +**Matching logic:** +- For each entry in the merged JSON, find all `translations` rows where: + - `language_code` matches the file's language + - `text` matches the word (case-insensitive, trimmed) + - The term's `pos` matches the entry's `pos` +- Set `cefr_level` on all matching rows +- Use `onConflictDoUpdate` to overwrite existing values (re-running is safe) + +**Logging:** +``` +=== CEFR Enrichment === +Language: en + Entries in merged file: 2,847 + Matched translation rows: 4,203 (one word can match multiple translations — synonyms) + Unmatched entries: 644 (words not in DB) + Updated: 4,203 +``` + +This script IS idempotent — running it twice produces the same result. + +--- + +## File Structure + +``` +packages/db/src/cefr/ + raw/ ← raw source files (gitignored if large) + esl-lounge-a1-en.txt + esl-lounge-a2-en.txt + ... + sources/ ← normalized JSON per source per language + esl-lounge-en.json + kelly-en.json + kelly-it.json + merged/ ← one authoritative JSON per language + en.json + it.json + extract-esl-lounge.ts ← extraction script + extract-kelly.ts ← extraction script (when Kelly data is available) + compare.ts ← comparison report + merge.ts ← merge into authoritative file + enrich.ts ← write cefr_level to DB +``` + +--- + +## What NOT to do + +- Do not hardcode CEFR level strings — always use `CEFR_LEVELS` from `@glossa/shared` +- Do not hardcode POS strings — always use `SUPPORTED_POS` from `@glossa/shared` +- Do not hardcode language codes — always use `SUPPORTED_LANGUAGE_CODES` from `@glossa/shared` +- Do not modify the schema +- Do not modify `seed.ts` or `generating-decks.ts` +- Do not skip the comparison step — it exists to surface data quality issues before enrichment +- Do not write `cefr_level` directly from raw source files — always go through normalize → merge → enrich + +--- + +## Definition of Done + +- All scripts run without TypeScript errors (`pnpm tsc --noEmit`) +- `extract-esl-lounge.ts` produces a valid normalized JSON file +- `compare.ts` prints a readable report showing coverage and conflicts +- `merge.ts` produces `merged/en.json` with conflict resolution applied +- `enrich.ts` writes `cefr_level` to matching `translations` rows and is idempotent +- Running `enrich.ts` twice produces the same DB state +- At least some `translations` rows have non-null `cefr_level` after enrichment diff --git a/documentation/decisions.md b/documentation/decisions.md index 0fc1244..3112b8c 100644 --- a/documentation/decisions.md +++ b/documentation/decisions.md @@ -325,6 +325,25 @@ Exercise types split naturally into Type A (translation, current model) and Type --- +### Term glosses: Italian coverage is sparse (expected) + +OMW gloss data is primarily in English. After full import: + +- English glosses: 95,882 (~100% of terms) +- Italian glosses: 1,964 (~2% of terms) + +This is not a data pipeline problem — it reflects the actual state of OMW. Italian +glosses simply don't exist for most synsets in the dataset. + +**Handling in the UI:** fall back to the English gloss when no gloss exists for the +user's language. This is acceptable UX — a definition in the wrong language is better +than no definition at all. + +If Italian gloss coverage needs to improve in the future, Wiktionary is the most +likely source — it has broader multilingual definition coverage than OMW. + +--- + ## Open Research ### Semantic category metadata source diff --git a/documentation/schema_discussion.md b/documentation/schema_discussion.md index 0b473f0..577bbed 100644 --- a/documentation/schema_discussion.md +++ b/documentation/schema_discussion.md @@ -170,34 +170,11 @@ Too expensive at scale — only viable for small curated additions on top of an ### implementation roadmap -#### Finalize data model - -text - -#### Write and run migrations - -text - -#### Fill the database (expand import pipeline) - -text - -#### Decide SUBTLEX → cefr_level mapping strategy - -text - -#### Generate decks - -text - -#### Finalize game selection flow - -text - -#### Define Zod schemas in packages/shared - -text - -#### Implement API - -text +- [x] Finalize data model +- [x] Write and run migrations +- [x] Fill the database (expand import pipeline) +- [ ] Decide SUBTLEX → cefr_level mapping strategy +- [ ] Generate decks +- [ ] Finalize game selection flow +- [ ] Define Zod schemas in packages/shared +- [ ] Implement API