updating documentation
This commit is contained in:
parent
570dbff25e
commit
60cf48ef97
3 changed files with 243 additions and 31 deletions
216
documentation/cefr_enrichment.md
Normal file
216
documentation/cefr_enrichment.md
Normal file
|
|
@ -0,0 +1,216 @@
|
|||
# Phase 4 — CEFR Enrichment Pipeline
|
||||
|
||||
## Context
|
||||
|
||||
This is a vocabulary trainer (Duolingo-style) built as a pnpm monorepo. The data layer
|
||||
uses Drizzle ORM with Postgres. The project is called Glossa.
|
||||
|
||||
**Read `decisions.md` and `schema.ts` before doing anything.** They contain the full
|
||||
reasoning behind every decision. Do not deviate from established patterns without
|
||||
flagging it explicitly.
|
||||
|
||||
---
|
||||
|
||||
## Current State
|
||||
|
||||
The database is fully populated with OMW data:
|
||||
- 95,882 terms (nouns and verbs)
|
||||
- 225,997 translations (English and Italian)
|
||||
- `cefr_level` is null on every translation row — this phase populates it
|
||||
|
||||
---
|
||||
|
||||
## Goal
|
||||
|
||||
Build a pipeline that:
|
||||
1. Normalizes CEFR word lists from multiple sources into a common JSON format
|
||||
2. Compares sources to surface agreements and conflicts
|
||||
3. Merges sources into a single authoritative JSON per language
|
||||
4. Enriches the `translations` table with `cefr_level` values
|
||||
|
||||
All scripts live in `packages/db/src/cefr/`.
|
||||
|
||||
---
|
||||
|
||||
## Normalized JSON Format
|
||||
|
||||
Every source extraction script outputs a JSON file in this exact shape:
|
||||
|
||||
```json
|
||||
[
|
||||
{ "word": "bank", "pos": "noun", "cefr": "A1", "source": "esl-lounge" },
|
||||
{ "word": "bank", "pos": "verb", "cefr": "B1", "source": "esl-lounge" },
|
||||
{ "word": "run", "pos": "verb", "cefr": "A1", "source": "esl-lounge" }
|
||||
]
|
||||
```
|
||||
|
||||
Field rules:
|
||||
- `word` — lowercase, trimmed, base form
|
||||
- `pos` — must match `SUPPORTED_POS` values exactly (`'noun'` or `'verb'`)
|
||||
- `cefr` — must match `CEFR_LEVELS` exactly (`'A1'`–`'C2'`)
|
||||
- `source` — short identifier string for the source (`'esl-lounge'`, `'kelly'`, etc.)
|
||||
|
||||
Output files go in `packages/db/src/cefr/sources/` named `<source>-<language>.json`
|
||||
e.g. `esl-lounge-en.json`, `kelly-en.json`, `kelly-it.json`.
|
||||
|
||||
---
|
||||
|
||||
## Scripts to Write
|
||||
|
||||
### 1. Source extraction scripts (one per source)
|
||||
|
||||
`packages/db/src/cefr/extract-<source>.ts`
|
||||
|
||||
Each script reads the raw source data (CSV, scraped HTML, whatever format the source
|
||||
provides) and outputs the normalized JSON format above. Raw source files go in
|
||||
`packages/db/src/cefr/raw/`.
|
||||
|
||||
**Sources to extract for English (start here):**
|
||||
- `esl-lounge` — word lists at esl-lounge.com, already split by CEFR level and POS.
|
||||
Raw data will be provided as text files, one per level.
|
||||
|
||||
Add more sources as they become available. Each source is one extraction script,
|
||||
one output file. Do not combine sources in extraction scripts.
|
||||
|
||||
---
|
||||
|
||||
### 2. Comparison script
|
||||
|
||||
`packages/db/src/cefr/compare.ts`
|
||||
|
||||
Reads all normalized JSON files from `sources/` and prints a report:
|
||||
|
||||
```
|
||||
=== CEFR Source Comparison ===
|
||||
|
||||
Per source:
|
||||
esl-lounge-en: 2,847 entries (A1: 312, A2: 445, B1: 623, B2: 701, C1: 489, C2: 277)
|
||||
kelly-en: 3,201 entries (A1: ...)
|
||||
|
||||
Overlap (words appearing in multiple sources):
|
||||
esl-lounge-en ∩ kelly-en: 1,203 words
|
||||
Agreement: 1,089 (90.5%)
|
||||
Conflict: 114 (9.5%)
|
||||
|
||||
Conflicts (sample, first 20):
|
||||
word pos esl-lounge kelly
|
||||
-------------------------------
|
||||
"achieve" verb B1 A2
|
||||
"ancient" adj B2 B1
|
||||
...
|
||||
|
||||
DB coverage (words in sources that match a translation row):
|
||||
esl-lounge-en: 1,847 / 2,847 matched (64.9%)
|
||||
kelly-en: 2,103 / 3,201 matched (65.7%)
|
||||
```
|
||||
|
||||
This script is read-only — it never writes to the DB.
|
||||
|
||||
---
|
||||
|
||||
### 3. Merge script
|
||||
|
||||
`packages/db/src/cefr/merge.ts`
|
||||
|
||||
Reads all normalized JSON files from `sources/` for a given language and produces a
|
||||
single merged JSON file in `packages/db/src/cefr/merged/<language>.json`.
|
||||
|
||||
**Merge rules:**
|
||||
- If only one source has a word → use that level
|
||||
- If multiple sources agree → use that level
|
||||
- If sources conflict → use the level from the highest-priority source
|
||||
|
||||
**Source priority order (highest to lowest):**
|
||||
1. `kelly` — purpose-built for language learning, CEFR-mapped by linguists
|
||||
2. `esl-lounge` — curated by teachers, reliable but secondary
|
||||
3. Any additional sources added later
|
||||
|
||||
Priority order is defined as a constant at the top of the merge script — easy to
|
||||
change without touching the logic.
|
||||
|
||||
Output format — same normalized JSON shape but without `source` field, replaced by
|
||||
`sources` array showing which sources contributed:
|
||||
|
||||
```json
|
||||
[
|
||||
{ "word": "bank", "pos": "noun", "cefr": "A1", "sources": ["esl-lounge", "kelly"] },
|
||||
{ "word": "achieve", "pos": "verb", "cefr": "A2", "sources": ["kelly"] }
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4. Enrichment script
|
||||
|
||||
`packages/db/src/cefr/enrich.ts`
|
||||
|
||||
Reads merged JSON files from `merged/` and writes `cefr_level` to matching
|
||||
`translations` rows.
|
||||
|
||||
**Matching logic:**
|
||||
- For each entry in the merged JSON, find all `translations` rows where:
|
||||
- `language_code` matches the file's language
|
||||
- `text` matches the word (case-insensitive, trimmed)
|
||||
- The term's `pos` matches the entry's `pos`
|
||||
- Set `cefr_level` on all matching rows
|
||||
- Use `onConflictDoUpdate` to overwrite existing values (re-running is safe)
|
||||
|
||||
**Logging:**
|
||||
```
|
||||
=== CEFR Enrichment ===
|
||||
Language: en
|
||||
Entries in merged file: 2,847
|
||||
Matched translation rows: 4,203 (one word can match multiple translations — synonyms)
|
||||
Unmatched entries: 644 (words not in DB)
|
||||
Updated: 4,203
|
||||
```
|
||||
|
||||
This script IS idempotent — running it twice produces the same result.
|
||||
|
||||
---
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
packages/db/src/cefr/
|
||||
raw/ ← raw source files (gitignored if large)
|
||||
esl-lounge-a1-en.txt
|
||||
esl-lounge-a2-en.txt
|
||||
...
|
||||
sources/ ← normalized JSON per source per language
|
||||
esl-lounge-en.json
|
||||
kelly-en.json
|
||||
kelly-it.json
|
||||
merged/ ← one authoritative JSON per language
|
||||
en.json
|
||||
it.json
|
||||
extract-esl-lounge.ts ← extraction script
|
||||
extract-kelly.ts ← extraction script (when Kelly data is available)
|
||||
compare.ts ← comparison report
|
||||
merge.ts ← merge into authoritative file
|
||||
enrich.ts ← write cefr_level to DB
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What NOT to do
|
||||
|
||||
- Do not hardcode CEFR level strings — always use `CEFR_LEVELS` from `@glossa/shared`
|
||||
- Do not hardcode POS strings — always use `SUPPORTED_POS` from `@glossa/shared`
|
||||
- Do not hardcode language codes — always use `SUPPORTED_LANGUAGE_CODES` from `@glossa/shared`
|
||||
- Do not modify the schema
|
||||
- Do not modify `seed.ts` or `generating-decks.ts`
|
||||
- Do not skip the comparison step — it exists to surface data quality issues before enrichment
|
||||
- Do not write `cefr_level` directly from raw source files — always go through normalize → merge → enrich
|
||||
|
||||
---
|
||||
|
||||
## Definition of Done
|
||||
|
||||
- All scripts run without TypeScript errors (`pnpm tsc --noEmit`)
|
||||
- `extract-esl-lounge.ts` produces a valid normalized JSON file
|
||||
- `compare.ts` prints a readable report showing coverage and conflicts
|
||||
- `merge.ts` produces `merged/en.json` with conflict resolution applied
|
||||
- `enrich.ts` writes `cefr_level` to matching `translations` rows and is idempotent
|
||||
- Running `enrich.ts` twice produces the same DB state
|
||||
- At least some `translations` rows have non-null `cefr_level` after enrichment
|
||||
Loading…
Add table
Add a link
Reference in a new issue