lila/documentation/cefr_enrichment.md
2026-04-06 17:01:34 +02:00

6.8 KiB
Raw Blame History

Phase 4 — CEFR Enrichment Pipeline

Context

This is a vocabulary trainer (Duolingo-style) built as a pnpm monorepo. The data layer uses Drizzle ORM with Postgres. The project is called Glossa.

Read decisions.md and schema.ts before doing anything. They contain the full reasoning behind every decision. Do not deviate from established patterns without flagging it explicitly.


Current State

The database is fully populated with OMW data:

  • 95,882 terms (nouns and verbs)
  • 225,997 translations (English and Italian)
  • cefr_level is null on every translation row — this phase populates it

Goal

Build a pipeline that:

  1. Normalizes CEFR word lists from multiple sources into a common JSON format
  2. Compares sources to surface agreements and conflicts
  3. Merges sources into a single authoritative JSON per language
  4. Enriches the translations table with cefr_level values

All scripts live in packages/db/src/cefr/.


Normalized JSON Format

Every source extraction script outputs a JSON file in this exact shape:

[
  { "word": "bank", "pos": "noun", "cefr": "A1", "source": "esl-lounge" },
  { "word": "bank", "pos": "verb", "cefr": "B1", "source": "esl-lounge" },
  { "word": "run",  "pos": "verb", "cefr": "A1", "source": "esl-lounge" }
]

Field rules:

  • word — lowercase, trimmed, base form
  • pos — must match SUPPORTED_POS values exactly ('noun' or 'verb')
  • cefr — must match CEFR_LEVELS exactly ('A1''C2')
  • source — short identifier string for the source ('esl-lounge', 'kelly', etc.)

Output files go in packages/db/src/cefr/sources/ named <source>-<language>.json e.g. esl-lounge-en.json, kelly-en.json, kelly-it.json.


Scripts to Write

1. Source extraction scripts (one per source)

packages/db/src/cefr/extract-<source>.ts

Each script reads the raw source data (CSV, scraped HTML, whatever format the source provides) and outputs the normalized JSON format above. Raw source files go in packages/db/src/cefr/raw/.

Sources to extract for English (start here):

  • esl-lounge — word lists at esl-lounge.com, already split by CEFR level and POS. Raw data will be provided as text files, one per level.

Add more sources as they become available. Each source is one extraction script, one output file. Do not combine sources in extraction scripts.


2. Comparison script

packages/db/src/cefr/compare.ts

Reads all normalized JSON files from sources/ and prints a report:

=== CEFR Source Comparison ===

Per source:
  esl-lounge-en: 2,847 entries (A1: 312, A2: 445, B1: 623, B2: 701, C1: 489, C2: 277)
  kelly-en:      3,201 entries (A1: ...)

Overlap (words appearing in multiple sources):
  esl-lounge-en ∩ kelly-en: 1,203 words
    Agreement:  1,089 (90.5%)
    Conflict:     114 (9.5%)

Conflicts (sample, first 20):
  word       pos    esl-lounge  kelly
  -------------------------------
  "achieve"  verb   B1          A2
  "ancient"  adj    B2          B1
  ...

DB coverage (words in sources that match a translation row):
  esl-lounge-en: 1,847 / 2,847 matched (64.9%)
  kelly-en:      2,103 / 3,201 matched (65.7%)

This script is read-only — it never writes to the DB.


3. Merge script

packages/db/src/cefr/merge.ts

Reads all normalized JSON files from sources/ for a given language and produces a single merged JSON file in packages/db/src/cefr/merged/<language>.json.

Merge rules:

  • If only one source has a word → use that level
  • If multiple sources agree → use that level
  • If sources conflict → use the level from the highest-priority source

Source priority order (highest to lowest):

  1. kelly — purpose-built for language learning, CEFR-mapped by linguists
  2. esl-lounge — curated by teachers, reliable but secondary
  3. Any additional sources added later

Priority order is defined as a constant at the top of the merge script — easy to change without touching the logic.

Output format — same normalized JSON shape but without source field, replaced by sources array showing which sources contributed:

[
  { "word": "bank", "pos": "noun", "cefr": "A1", "sources": ["esl-lounge", "kelly"] },
  { "word": "achieve", "pos": "verb", "cefr": "A2", "sources": ["kelly"] }
]

4. Enrichment script

packages/db/src/cefr/enrich.ts

Reads merged JSON files from merged/ and writes cefr_level to matching translations rows.

Matching logic:

  • For each entry in the merged JSON, find all translations rows where:
    • language_code matches the file's language
    • text matches the word (case-insensitive, trimmed)
    • The term's pos matches the entry's pos
  • Set cefr_level on all matching rows
  • Use onConflictDoUpdate to overwrite existing values (re-running is safe)

Logging:

=== CEFR Enrichment ===
Language: en
  Entries in merged file: 2,847
  Matched translation rows: 4,203  (one word can match multiple translations — synonyms)
  Unmatched entries: 644  (words not in DB)
  Updated: 4,203

This script IS idempotent — running it twice produces the same result.


File Structure

packages/db/src/cefr/
  raw/                        ← raw source files (gitignored if large)
    esl-lounge-a1-en.txt
    esl-lounge-a2-en.txt
    ...
  sources/                    ← normalized JSON per source per language
    esl-lounge-en.json
    kelly-en.json
    kelly-it.json
  merged/                     ← one authoritative JSON per language
    en.json
    it.json
  extract-esl-lounge.ts       ← extraction script
  extract-kelly.ts            ← extraction script (when Kelly data is available)
  compare.ts                  ← comparison report
  merge.ts                    ← merge into authoritative file
  enrich.ts                   ← write cefr_level to DB

What NOT to do

  • Do not hardcode CEFR level strings — always use CEFR_LEVELS from @glossa/shared
  • Do not hardcode POS strings — always use SUPPORTED_POS from @glossa/shared
  • Do not hardcode language codes — always use SUPPORTED_LANGUAGE_CODES from @glossa/shared
  • Do not modify the schema
  • Do not modify seed.ts or generating-decks.ts
  • Do not skip the comparison step — it exists to surface data quality issues before enrichment
  • Do not write cefr_level directly from raw source files — always go through normalize → merge → enrich

Definition of Done

  • All scripts run without TypeScript errors (pnpm tsc --noEmit)
  • extract-esl-lounge.ts produces a valid normalized JSON file
  • compare.ts prints a readable report showing coverage and conflicts
  • merge.ts produces merged/en.json with conflict resolution applied
  • enrich.ts writes cefr_level to matching translations rows and is idempotent
  • Running enrich.ts twice produces the same DB state
  • At least some translations rows have non-null cefr_level after enrichment