adding datafiles + updating documentation

2026-04-07 00:00:58 +02:00 · 2026-04-07 00:00:58 +02:00 · 0cb9fe1485
commit 0cb9fe1485
parent 60cf48ef97
18 changed files with 2245974 additions and 1250 deletions
--- a/data-sources/english/.~lock.en_m3.xls#
+++ b/data-sources/english/.~lock.en_m3.xls#
@ -0,0 +1 @@
 ,languagedev,laptop,06.04.2026 23:24,file:///home/languagedev/.config/libreoffice/4;
--- a/data-sources/english/.~lock.octanove-vocabulary-profile-c1c2-1.0.csv#
+++ b/data-sources/english/.~lock.octanove-vocabulary-profile-c1c2-1.0.csv#
@ -0,0 +1 @@
 ,languagedev,laptop,06.04.2026 23:25,file:///home/languagedev/.config/libreoffice/4;
--- a/data-sources/english/cefrj-vocabulary-profile-1.5.csv
+++ b/data-sources/english/cefrj-vocabulary-profile-1.5.csv
--- a/data-sources/english/en_m3.xls
+++ b/data-sources/english/en_m3.xls
--- a/data-sources/english/english.json
+++ b/data-sources/english/english.json
--- a/data-sources/english/octanove-vocabulary-profile-c1c2-1.0.csv
+++ b/data-sources/english/octanove-vocabulary-profile-c1c2-1.0.csv
--- a/data-sources/french/french.json
+++ b/data-sources/french/french.json
--- a/data-sources/german/german.json
+++ b/data-sources/german/german.json
--- a/data-sources/italian/.~lock.it_m3.xls#
+++ b/data-sources/italian/.~lock.it_m3.xls#
@ -0,0 +1 @@
 ,languagedev,laptop,06.04.2026 23:23,file:///home/languagedev/.config/libreoffice/4;
--- a/data-sources/italian/it-list_with_glossas.csv
+++ b/data-sources/italian/it-list_with_glossas.csv
--- a/data-sources/italian/it_m3.xls
+++ b/data-sources/italian/it_m3.xls
--- a/data-sources/italian/italian.json
+++ b/data-sources/italian/italian.json
--- a/data-sources/italian/subtlex-it.csv
+++ b/data-sources/italian/subtlex-it.csv
--- a/data-sources/italian/wordlist_of_italian_words_660000_parole_italiane.txt
+++ b/data-sources/italian/wordlist_of_italian_words_660000_parole_italiane.txt
--- a/data-sources/spanish/spanish.json
+++ b/data-sources/spanish/spanish.json
--- a/documentation/cefr_enrichment.md
+++ b/documentation/cefr_enrichment.md
@ -1,216 +0,0 @@
 # Phase 4 — CEFR Enrichment Pipeline
 ## Context
 This is a vocabulary trainer (Duolingo-style) built as a pnpm monorepo. The data layer
 uses Drizzle ORM with Postgres. The project is called Glossa.
 **Read `decisions.md` and `schema.ts` before doing anything.** They contain the full
 reasoning behind every decision. Do not deviate from established patterns without
 flagging it explicitly.
 ---
 ## Current State
 The database is fully populated with OMW data:
 - 95,882 terms (nouns and verbs)
 - 225,997 translations (English and Italian)
 - `cefr_level` is null on every translation row — this phase populates it
 ---
 ## Goal
 Build a pipeline that:
 1. Normalizes CEFR word lists from multiple sources into a common JSON format
 2. Compares sources to surface agreements and conflicts
 3. Merges sources into a single authoritative JSON per language
 4. Enriches the `translations` table with `cefr_level` values
 All scripts live in `packages/db/src/cefr/`.
 ---
 ## Normalized JSON Format
 Every source extraction script outputs a JSON file in this exact shape:
 ```json
 [
  { "word": "bank", "pos": "noun", "cefr": "A1", "source": "esl-lounge" },
  { "word": "bank", "pos": "verb", "cefr": "B1", "source": "esl-lounge" },
  { "word": "run",  "pos": "verb", "cefr": "A1", "source": "esl-lounge" }
 ]
 ```
 Field rules:
 - `word` — lowercase, trimmed, base form
 - `pos` — must match `SUPPORTED_POS` values exactly (`'noun'` or `'verb'`)
 - `cefr` — must match `CEFR_LEVELS` exactly (`'A1'`–`'C2'`)
 - `source` — short identifier string for the source (`'esl-lounge'`, `'kelly'`, etc.)
 Output files go in `packages/db/src/cefr/sources/` named `<source>-<language>.json`
 e.g. `esl-lounge-en.json`, `kelly-en.json`, `kelly-it.json`.
 ---
 ## Scripts to Write
 ### 1. Source extraction scripts (one per source)
 `packages/db/src/cefr/extract-<source>.ts`
 Each script reads the raw source data (CSV, scraped HTML, whatever format the source
 provides) and outputs the normalized JSON format above. Raw source files go in
 `packages/db/src/cefr/raw/`.
 **Sources to extract for English (start here):**
 - `esl-lounge` — word lists at esl-lounge.com, already split by CEFR level and POS.
  Raw data will be provided as text files, one per level.
 Add more sources as they become available. Each source is one extraction script,
 one output file. Do not combine sources in extraction scripts.
 ---
 ### 2. Comparison script
 `packages/db/src/cefr/compare.ts`
 Reads all normalized JSON files from `sources/` and prints a report:
 ```
 === CEFR Source Comparison ===
 Per source:
  esl-lounge-en: 2,847 entries (A1: 312, A2: 445, B1: 623, B2: 701, C1: 489, C2: 277)
  kelly-en:      3,201 entries (A1: ...)
 Overlap (words appearing in multiple sources):
  esl-lounge-en ∩ kelly-en: 1,203 words
    Agreement:  1,089 (90.5%)
    Conflict:     114 (9.5%)
 Conflicts (sample, first 20):
  word       pos    esl-lounge  kelly
  -------------------------------
  "achieve"  verb   B1          A2
  "ancient"  adj    B2          B1
  ...
 DB coverage (words in sources that match a translation row):
  esl-lounge-en: 1,847 / 2,847 matched (64.9%)
  kelly-en:      2,103 / 3,201 matched (65.7%)
 ```
 This script is read-only — it never writes to the DB.
 ---
 ### 3. Merge script
 `packages/db/src/cefr/merge.ts`
 Reads all normalized JSON files from `sources/` for a given language and produces a
 single merged JSON file in `packages/db/src/cefr/merged/<language>.json`.
 **Merge rules:**
 - If only one source has a word → use that level
 - If multiple sources agree → use that level
 - If sources conflict → use the level from the highest-priority source
 **Source priority order (highest to lowest):**
 1. `kelly` — purpose-built for language learning, CEFR-mapped by linguists
 2. `esl-lounge` — curated by teachers, reliable but secondary
 3. Any additional sources added later
 Priority order is defined as a constant at the top of the merge script — easy to
 change without touching the logic.
 Output format — same normalized JSON shape but without `source` field, replaced by
 `sources` array showing which sources contributed:
 ```json
 [
  { "word": "bank", "pos": "noun", "cefr": "A1", "sources": ["esl-lounge", "kelly"] },
  { "word": "achieve", "pos": "verb", "cefr": "A2", "sources": ["kelly"] }
 ]
 ```
 ---
 ### 4. Enrichment script
 `packages/db/src/cefr/enrich.ts`
 Reads merged JSON files from `merged/` and writes `cefr_level` to matching
 `translations` rows.
 **Matching logic:**
 - For each entry in the merged JSON, find all `translations` rows where:
  - `language_code` matches the file's language
  - `text` matches the word (case-insensitive, trimmed)
  - The term's `pos` matches the entry's `pos`
 - Set `cefr_level` on all matching rows
 - Use `onConflictDoUpdate` to overwrite existing values (re-running is safe)
 **Logging:**
 ```
 === CEFR Enrichment ===
 Language: en
  Entries in merged file: 2,847
  Matched translation rows: 4,203  (one word can match multiple translations — synonyms)
  Unmatched entries: 644  (words not in DB)
  Updated: 4,203
 ```
 This script IS idempotent — running it twice produces the same result.
 ---
 ## File Structure
 ```
 packages/db/src/cefr/
  raw/                        ← raw source files (gitignored if large)
    esl-lounge-a1-en.txt
    esl-lounge-a2-en.txt
    ...
  sources/                    ← normalized JSON per source per language
    esl-lounge-en.json
    kelly-en.json
    kelly-it.json
  merged/                     ← one authoritative JSON per language
    en.json
    it.json
  extract-esl-lounge.ts       ← extraction script
  extract-kelly.ts            ← extraction script (when Kelly data is available)
  compare.ts                  ← comparison report
  merge.ts                    ← merge into authoritative file
  enrich.ts                   ← write cefr_level to DB
 ```
 ---
 ## What NOT to do
 - Do not hardcode CEFR level strings — always use `CEFR_LEVELS` from `@glossa/shared`
 - Do not hardcode POS strings — always use `SUPPORTED_POS` from `@glossa/shared`
 - Do not hardcode language codes — always use `SUPPORTED_LANGUAGE_CODES` from `@glossa/shared`
 - Do not modify the schema
 - Do not modify `seed.ts` or `generating-decks.ts`
 - Do not skip the comparison step — it exists to surface data quality issues before enrichment
 - Do not write `cefr_level` directly from raw source files — always go through normalize → merge → enrich
 ---
 ## Definition of Done
 - All scripts run without TypeScript errors (`pnpm tsc --noEmit`)
 - `extract-esl-lounge.ts` produces a valid normalized JSON file
 - `compare.ts` prints a readable report showing coverage and conflicts
 - `merge.ts` produces `merged/en.json` with conflict resolution applied
 - `enrich.ts` writes `cefr_level` to matching `translations` rows and is idempotent
 - Running `enrich.ts` twice produces the same DB state
 - At least some `translations` rows have non-null `cefr_level` after enrichment
--- a/packages/db/src/data/wordlists/top1000englishnouns
+++ b/packages/db/src/data/wordlists/top1000englishnouns
--- a/packages/db/src/data/wordlists/top1000englishnouns-missing
+++ b/packages/db/src/data/wordlists/top1000englishnouns-missing
@ -1,34 +0,0 @@
 a
 other
 us
 may
 st
 paul
 new
 software
 oxford
 english
 mary
 japan
 while
 pp
 membership
 manchester
 tony
 alan
 jones
 un
 northern
 simon
 behalf
 co
 graham
 joe
 guy
 lewis
 jane
 taylor
 co-operation
 travel
 self
 thatcher
		`@ -0,0 +1 @@`
							`,languagedev,laptop,06.04.2026 23:24,file:///home/languagedev/.config/libreoffice/4;`
		`@ -0,0 +1 @@`
							`,languagedev,laptop,06.04.2026 23:25,file:///home/languagedev/.config/libreoffice/4;`
		`@ -0,0 +1 @@`
							`,languagedev,laptop,06.04.2026 23:23,file:///home/languagedev/.config/libreoffice/4;`