6.8 KiB
Phase 4 — CEFR Enrichment Pipeline
Context
This is a vocabulary trainer (Duolingo-style) built as a pnpm monorepo. The data layer uses Drizzle ORM with Postgres. The project is called Glossa.
Read decisions.md and schema.ts before doing anything. They contain the full
reasoning behind every decision. Do not deviate from established patterns without
flagging it explicitly.
Current State
The database is fully populated with OMW data:
- 95,882 terms (nouns and verbs)
- 225,997 translations (English and Italian)
cefr_levelis null on every translation row — this phase populates it
Goal
Build a pipeline that:
- Normalizes CEFR word lists from multiple sources into a common JSON format
- Compares sources to surface agreements and conflicts
- Merges sources into a single authoritative JSON per language
- Enriches the
translationstable withcefr_levelvalues
All scripts live in packages/db/src/cefr/.
Normalized JSON Format
Every source extraction script outputs a JSON file in this exact shape:
[
{ "word": "bank", "pos": "noun", "cefr": "A1", "source": "esl-lounge" },
{ "word": "bank", "pos": "verb", "cefr": "B1", "source": "esl-lounge" },
{ "word": "run", "pos": "verb", "cefr": "A1", "source": "esl-lounge" }
]
Field rules:
word— lowercase, trimmed, base formpos— must matchSUPPORTED_POSvalues exactly ('noun'or'verb')cefr— must matchCEFR_LEVELSexactly ('A1'–'C2')source— short identifier string for the source ('esl-lounge','kelly', etc.)
Output files go in packages/db/src/cefr/sources/ named <source>-<language>.json
e.g. esl-lounge-en.json, kelly-en.json, kelly-it.json.
Scripts to Write
1. Source extraction scripts (one per source)
packages/db/src/cefr/extract-<source>.ts
Each script reads the raw source data (CSV, scraped HTML, whatever format the source
provides) and outputs the normalized JSON format above. Raw source files go in
packages/db/src/cefr/raw/.
Sources to extract for English (start here):
esl-lounge— word lists at esl-lounge.com, already split by CEFR level and POS. Raw data will be provided as text files, one per level.
Add more sources as they become available. Each source is one extraction script, one output file. Do not combine sources in extraction scripts.
2. Comparison script
packages/db/src/cefr/compare.ts
Reads all normalized JSON files from sources/ and prints a report:
=== CEFR Source Comparison ===
Per source:
esl-lounge-en: 2,847 entries (A1: 312, A2: 445, B1: 623, B2: 701, C1: 489, C2: 277)
kelly-en: 3,201 entries (A1: ...)
Overlap (words appearing in multiple sources):
esl-lounge-en ∩ kelly-en: 1,203 words
Agreement: 1,089 (90.5%)
Conflict: 114 (9.5%)
Conflicts (sample, first 20):
word pos esl-lounge kelly
-------------------------------
"achieve" verb B1 A2
"ancient" adj B2 B1
...
DB coverage (words in sources that match a translation row):
esl-lounge-en: 1,847 / 2,847 matched (64.9%)
kelly-en: 2,103 / 3,201 matched (65.7%)
This script is read-only — it never writes to the DB.
3. Merge script
packages/db/src/cefr/merge.ts
Reads all normalized JSON files from sources/ for a given language and produces a
single merged JSON file in packages/db/src/cefr/merged/<language>.json.
Merge rules:
- If only one source has a word → use that level
- If multiple sources agree → use that level
- If sources conflict → use the level from the highest-priority source
Source priority order (highest to lowest):
kelly— purpose-built for language learning, CEFR-mapped by linguistsesl-lounge— curated by teachers, reliable but secondary- Any additional sources added later
Priority order is defined as a constant at the top of the merge script — easy to change without touching the logic.
Output format — same normalized JSON shape but without source field, replaced by
sources array showing which sources contributed:
[
{ "word": "bank", "pos": "noun", "cefr": "A1", "sources": ["esl-lounge", "kelly"] },
{ "word": "achieve", "pos": "verb", "cefr": "A2", "sources": ["kelly"] }
]
4. Enrichment script
packages/db/src/cefr/enrich.ts
Reads merged JSON files from merged/ and writes cefr_level to matching
translations rows.
Matching logic:
- For each entry in the merged JSON, find all
translationsrows where:language_codematches the file's languagetextmatches the word (case-insensitive, trimmed)- The term's
posmatches the entry'spos
- Set
cefr_levelon all matching rows - Use
onConflictDoUpdateto overwrite existing values (re-running is safe)
Logging:
=== CEFR Enrichment ===
Language: en
Entries in merged file: 2,847
Matched translation rows: 4,203 (one word can match multiple translations — synonyms)
Unmatched entries: 644 (words not in DB)
Updated: 4,203
This script IS idempotent — running it twice produces the same result.
File Structure
packages/db/src/cefr/
raw/ ← raw source files (gitignored if large)
esl-lounge-a1-en.txt
esl-lounge-a2-en.txt
...
sources/ ← normalized JSON per source per language
esl-lounge-en.json
kelly-en.json
kelly-it.json
merged/ ← one authoritative JSON per language
en.json
it.json
extract-esl-lounge.ts ← extraction script
extract-kelly.ts ← extraction script (when Kelly data is available)
compare.ts ← comparison report
merge.ts ← merge into authoritative file
enrich.ts ← write cefr_level to DB
What NOT to do
- Do not hardcode CEFR level strings — always use
CEFR_LEVELSfrom@glossa/shared - Do not hardcode POS strings — always use
SUPPORTED_POSfrom@glossa/shared - Do not hardcode language codes — always use
SUPPORTED_LANGUAGE_CODESfrom@glossa/shared - Do not modify the schema
- Do not modify
seed.tsorgenerating-decks.ts - Do not skip the comparison step — it exists to surface data quality issues before enrichment
- Do not write
cefr_leveldirectly from raw source files — always go through normalize → merge → enrich
Definition of Done
- All scripts run without TypeScript errors (
pnpm tsc --noEmit) extract-esl-lounge.tsproduces a valid normalized JSON filecompare.tsprints a readable report showing coverage and conflictsmerge.tsproducesmerged/en.jsonwith conflict resolution appliedenrich.tswritescefr_levelto matchingtranslationsrows and is idempotent- Running
enrich.tstwice produces the same DB state - At least some
translationsrows have non-nullcefr_levelafter enrichment