216 lines
6.8 KiB
Markdown
216 lines
6.8 KiB
Markdown
# Phase 4 — CEFR Enrichment Pipeline
|
||
|
||
## Context
|
||
|
||
This is a vocabulary trainer (Duolingo-style) built as a pnpm monorepo. The data layer
|
||
uses Drizzle ORM with Postgres. The project is called Glossa.
|
||
|
||
**Read `decisions.md` and `schema.ts` before doing anything.** They contain the full
|
||
reasoning behind every decision. Do not deviate from established patterns without
|
||
flagging it explicitly.
|
||
|
||
---
|
||
|
||
## Current State
|
||
|
||
The database is fully populated with OMW data:
|
||
- 95,882 terms (nouns and verbs)
|
||
- 225,997 translations (English and Italian)
|
||
- `cefr_level` is null on every translation row — this phase populates it
|
||
|
||
---
|
||
|
||
## Goal
|
||
|
||
Build a pipeline that:
|
||
1. Normalizes CEFR word lists from multiple sources into a common JSON format
|
||
2. Compares sources to surface agreements and conflicts
|
||
3. Merges sources into a single authoritative JSON per language
|
||
4. Enriches the `translations` table with `cefr_level` values
|
||
|
||
All scripts live in `packages/db/src/cefr/`.
|
||
|
||
---
|
||
|
||
## Normalized JSON Format
|
||
|
||
Every source extraction script outputs a JSON file in this exact shape:
|
||
|
||
```json
|
||
[
|
||
{ "word": "bank", "pos": "noun", "cefr": "A1", "source": "esl-lounge" },
|
||
{ "word": "bank", "pos": "verb", "cefr": "B1", "source": "esl-lounge" },
|
||
{ "word": "run", "pos": "verb", "cefr": "A1", "source": "esl-lounge" }
|
||
]
|
||
```
|
||
|
||
Field rules:
|
||
- `word` — lowercase, trimmed, base form
|
||
- `pos` — must match `SUPPORTED_POS` values exactly (`'noun'` or `'verb'`)
|
||
- `cefr` — must match `CEFR_LEVELS` exactly (`'A1'`–`'C2'`)
|
||
- `source` — short identifier string for the source (`'esl-lounge'`, `'kelly'`, etc.)
|
||
|
||
Output files go in `packages/db/src/cefr/sources/` named `<source>-<language>.json`
|
||
e.g. `esl-lounge-en.json`, `kelly-en.json`, `kelly-it.json`.
|
||
|
||
---
|
||
|
||
## Scripts to Write
|
||
|
||
### 1. Source extraction scripts (one per source)
|
||
|
||
`packages/db/src/cefr/extract-<source>.ts`
|
||
|
||
Each script reads the raw source data (CSV, scraped HTML, whatever format the source
|
||
provides) and outputs the normalized JSON format above. Raw source files go in
|
||
`packages/db/src/cefr/raw/`.
|
||
|
||
**Sources to extract for English (start here):**
|
||
- `esl-lounge` — word lists at esl-lounge.com, already split by CEFR level and POS.
|
||
Raw data will be provided as text files, one per level.
|
||
|
||
Add more sources as they become available. Each source is one extraction script,
|
||
one output file. Do not combine sources in extraction scripts.
|
||
|
||
---
|
||
|
||
### 2. Comparison script
|
||
|
||
`packages/db/src/cefr/compare.ts`
|
||
|
||
Reads all normalized JSON files from `sources/` and prints a report:
|
||
|
||
```
|
||
=== CEFR Source Comparison ===
|
||
|
||
Per source:
|
||
esl-lounge-en: 2,847 entries (A1: 312, A2: 445, B1: 623, B2: 701, C1: 489, C2: 277)
|
||
kelly-en: 3,201 entries (A1: ...)
|
||
|
||
Overlap (words appearing in multiple sources):
|
||
esl-lounge-en ∩ kelly-en: 1,203 words
|
||
Agreement: 1,089 (90.5%)
|
||
Conflict: 114 (9.5%)
|
||
|
||
Conflicts (sample, first 20):
|
||
word pos esl-lounge kelly
|
||
-------------------------------
|
||
"achieve" verb B1 A2
|
||
"ancient" adj B2 B1
|
||
...
|
||
|
||
DB coverage (words in sources that match a translation row):
|
||
esl-lounge-en: 1,847 / 2,847 matched (64.9%)
|
||
kelly-en: 2,103 / 3,201 matched (65.7%)
|
||
```
|
||
|
||
This script is read-only — it never writes to the DB.
|
||
|
||
---
|
||
|
||
### 3. Merge script
|
||
|
||
`packages/db/src/cefr/merge.ts`
|
||
|
||
Reads all normalized JSON files from `sources/` for a given language and produces a
|
||
single merged JSON file in `packages/db/src/cefr/merged/<language>.json`.
|
||
|
||
**Merge rules:**
|
||
- If only one source has a word → use that level
|
||
- If multiple sources agree → use that level
|
||
- If sources conflict → use the level from the highest-priority source
|
||
|
||
**Source priority order (highest to lowest):**
|
||
1. `kelly` — purpose-built for language learning, CEFR-mapped by linguists
|
||
2. `esl-lounge` — curated by teachers, reliable but secondary
|
||
3. Any additional sources added later
|
||
|
||
Priority order is defined as a constant at the top of the merge script — easy to
|
||
change without touching the logic.
|
||
|
||
Output format — same normalized JSON shape but without `source` field, replaced by
|
||
`sources` array showing which sources contributed:
|
||
|
||
```json
|
||
[
|
||
{ "word": "bank", "pos": "noun", "cefr": "A1", "sources": ["esl-lounge", "kelly"] },
|
||
{ "word": "achieve", "pos": "verb", "cefr": "A2", "sources": ["kelly"] }
|
||
]
|
||
```
|
||
|
||
---
|
||
|
||
### 4. Enrichment script
|
||
|
||
`packages/db/src/cefr/enrich.ts`
|
||
|
||
Reads merged JSON files from `merged/` and writes `cefr_level` to matching
|
||
`translations` rows.
|
||
|
||
**Matching logic:**
|
||
- For each entry in the merged JSON, find all `translations` rows where:
|
||
- `language_code` matches the file's language
|
||
- `text` matches the word (case-insensitive, trimmed)
|
||
- The term's `pos` matches the entry's `pos`
|
||
- Set `cefr_level` on all matching rows
|
||
- Use `onConflictDoUpdate` to overwrite existing values (re-running is safe)
|
||
|
||
**Logging:**
|
||
```
|
||
=== CEFR Enrichment ===
|
||
Language: en
|
||
Entries in merged file: 2,847
|
||
Matched translation rows: 4,203 (one word can match multiple translations — synonyms)
|
||
Unmatched entries: 644 (words not in DB)
|
||
Updated: 4,203
|
||
```
|
||
|
||
This script IS idempotent — running it twice produces the same result.
|
||
|
||
---
|
||
|
||
## File Structure
|
||
|
||
```
|
||
packages/db/src/cefr/
|
||
raw/ ← raw source files (gitignored if large)
|
||
esl-lounge-a1-en.txt
|
||
esl-lounge-a2-en.txt
|
||
...
|
||
sources/ ← normalized JSON per source per language
|
||
esl-lounge-en.json
|
||
kelly-en.json
|
||
kelly-it.json
|
||
merged/ ← one authoritative JSON per language
|
||
en.json
|
||
it.json
|
||
extract-esl-lounge.ts ← extraction script
|
||
extract-kelly.ts ← extraction script (when Kelly data is available)
|
||
compare.ts ← comparison report
|
||
merge.ts ← merge into authoritative file
|
||
enrich.ts ← write cefr_level to DB
|
||
```
|
||
|
||
---
|
||
|
||
## What NOT to do
|
||
|
||
- Do not hardcode CEFR level strings — always use `CEFR_LEVELS` from `@glossa/shared`
|
||
- Do not hardcode POS strings — always use `SUPPORTED_POS` from `@glossa/shared`
|
||
- Do not hardcode language codes — always use `SUPPORTED_LANGUAGE_CODES` from `@glossa/shared`
|
||
- Do not modify the schema
|
||
- Do not modify `seed.ts` or `generating-decks.ts`
|
||
- Do not skip the comparison step — it exists to surface data quality issues before enrichment
|
||
- Do not write `cefr_level` directly from raw source files — always go through normalize → merge → enrich
|
||
|
||
---
|
||
|
||
## Definition of Done
|
||
|
||
- All scripts run without TypeScript errors (`pnpm tsc --noEmit`)
|
||
- `extract-esl-lounge.ts` produces a valid normalized JSON file
|
||
- `compare.ts` prints a readable report showing coverage and conflicts
|
||
- `merge.ts` produces `merged/en.json` with conflict resolution applied
|
||
- `enrich.ts` writes `cefr_level` to matching `translations` rows and is idempotent
|
||
- Running `enrich.ts` twice produces the same DB state
|
||
- At least some `translations` rows have non-null `cefr_level` after enrichment
|