adding datafiles + updating documentation
This commit is contained in:
parent
60cf48ef97
commit
0cb9fe1485
18 changed files with 2245974 additions and 1250 deletions
1
data-sources/english/.~lock.en_m3.xls#
Normal file
1
data-sources/english/.~lock.en_m3.xls#
Normal file
|
|
@ -0,0 +1 @@
|
||||||
|
,languagedev,laptop,06.04.2026 23:24,file:///home/languagedev/.config/libreoffice/4;
|
||||||
|
|
@ -0,0 +1 @@
|
||||||
|
,languagedev,laptop,06.04.2026 23:25,file:///home/languagedev/.config/libreoffice/4;
|
||||||
7800
data-sources/english/cefrj-vocabulary-profile-1.5.csv
Normal file
7800
data-sources/english/cefrj-vocabulary-profile-1.5.csv
Normal file
File diff suppressed because it is too large
Load diff
BIN
data-sources/english/en_m3.xls
Normal file
BIN
data-sources/english/en_m3.xls
Normal file
Binary file not shown.
186374
data-sources/english/english.json
Normal file
186374
data-sources/english/english.json
Normal file
File diff suppressed because it is too large
Load diff
2137
data-sources/english/octanove-vocabulary-profile-c1c2-1.0.csv
Normal file
2137
data-sources/english/octanove-vocabulary-profile-c1c2-1.0.csv
Normal file
File diff suppressed because it is too large
Load diff
193382
data-sources/french/french.json
Normal file
193382
data-sources/french/french.json
Normal file
File diff suppressed because it is too large
Load diff
324482
data-sources/german/german.json
Normal file
324482
data-sources/german/german.json
Normal file
File diff suppressed because it is too large
Load diff
1
data-sources/italian/.~lock.it_m3.xls#
Normal file
1
data-sources/italian/.~lock.it_m3.xls#
Normal file
|
|
@ -0,0 +1 @@
|
||||||
|
,languagedev,laptop,06.04.2026 23:23,file:///home/languagedev/.config/libreoffice/4;
|
||||||
2987
data-sources/italian/it-list_with_glossas.csv
Normal file
2987
data-sources/italian/it-list_with_glossas.csv
Normal file
File diff suppressed because it is too large
Load diff
BIN
data-sources/italian/it_m3.xls
Normal file
BIN
data-sources/italian/it_m3.xls
Normal file
Binary file not shown.
185759
data-sources/italian/italian.json
Normal file
185759
data-sources/italian/italian.json
Normal file
File diff suppressed because it is too large
Load diff
517565
data-sources/italian/subtlex-it.csv
Normal file
517565
data-sources/italian/subtlex-it.csv
Normal file
File diff suppressed because it is too large
Load diff
661563
data-sources/italian/wordlist_of_italian_words_660000_parole_italiane.txt
Normal file
661563
data-sources/italian/wordlist_of_italian_words_660000_parole_italiane.txt
Normal file
File diff suppressed because it is too large
Load diff
163922
data-sources/spanish/spanish.json
Normal file
163922
data-sources/spanish/spanish.json
Normal file
File diff suppressed because it is too large
Load diff
|
|
@ -1,216 +0,0 @@
|
||||||
# Phase 4 — CEFR Enrichment Pipeline
|
|
||||||
|
|
||||||
## Context
|
|
||||||
|
|
||||||
This is a vocabulary trainer (Duolingo-style) built as a pnpm monorepo. The data layer
|
|
||||||
uses Drizzle ORM with Postgres. The project is called Glossa.
|
|
||||||
|
|
||||||
**Read `decisions.md` and `schema.ts` before doing anything.** They contain the full
|
|
||||||
reasoning behind every decision. Do not deviate from established patterns without
|
|
||||||
flagging it explicitly.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Current State
|
|
||||||
|
|
||||||
The database is fully populated with OMW data:
|
|
||||||
- 95,882 terms (nouns and verbs)
|
|
||||||
- 225,997 translations (English and Italian)
|
|
||||||
- `cefr_level` is null on every translation row — this phase populates it
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Goal
|
|
||||||
|
|
||||||
Build a pipeline that:
|
|
||||||
1. Normalizes CEFR word lists from multiple sources into a common JSON format
|
|
||||||
2. Compares sources to surface agreements and conflicts
|
|
||||||
3. Merges sources into a single authoritative JSON per language
|
|
||||||
4. Enriches the `translations` table with `cefr_level` values
|
|
||||||
|
|
||||||
All scripts live in `packages/db/src/cefr/`.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Normalized JSON Format
|
|
||||||
|
|
||||||
Every source extraction script outputs a JSON file in this exact shape:
|
|
||||||
|
|
||||||
```json
|
|
||||||
[
|
|
||||||
{ "word": "bank", "pos": "noun", "cefr": "A1", "source": "esl-lounge" },
|
|
||||||
{ "word": "bank", "pos": "verb", "cefr": "B1", "source": "esl-lounge" },
|
|
||||||
{ "word": "run", "pos": "verb", "cefr": "A1", "source": "esl-lounge" }
|
|
||||||
]
|
|
||||||
```
|
|
||||||
|
|
||||||
Field rules:
|
|
||||||
- `word` — lowercase, trimmed, base form
|
|
||||||
- `pos` — must match `SUPPORTED_POS` values exactly (`'noun'` or `'verb'`)
|
|
||||||
- `cefr` — must match `CEFR_LEVELS` exactly (`'A1'`–`'C2'`)
|
|
||||||
- `source` — short identifier string for the source (`'esl-lounge'`, `'kelly'`, etc.)
|
|
||||||
|
|
||||||
Output files go in `packages/db/src/cefr/sources/` named `<source>-<language>.json`
|
|
||||||
e.g. `esl-lounge-en.json`, `kelly-en.json`, `kelly-it.json`.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Scripts to Write
|
|
||||||
|
|
||||||
### 1. Source extraction scripts (one per source)
|
|
||||||
|
|
||||||
`packages/db/src/cefr/extract-<source>.ts`
|
|
||||||
|
|
||||||
Each script reads the raw source data (CSV, scraped HTML, whatever format the source
|
|
||||||
provides) and outputs the normalized JSON format above. Raw source files go in
|
|
||||||
`packages/db/src/cefr/raw/`.
|
|
||||||
|
|
||||||
**Sources to extract for English (start here):**
|
|
||||||
- `esl-lounge` — word lists at esl-lounge.com, already split by CEFR level and POS.
|
|
||||||
Raw data will be provided as text files, one per level.
|
|
||||||
|
|
||||||
Add more sources as they become available. Each source is one extraction script,
|
|
||||||
one output file. Do not combine sources in extraction scripts.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 2. Comparison script
|
|
||||||
|
|
||||||
`packages/db/src/cefr/compare.ts`
|
|
||||||
|
|
||||||
Reads all normalized JSON files from `sources/` and prints a report:
|
|
||||||
|
|
||||||
```
|
|
||||||
=== CEFR Source Comparison ===
|
|
||||||
|
|
||||||
Per source:
|
|
||||||
esl-lounge-en: 2,847 entries (A1: 312, A2: 445, B1: 623, B2: 701, C1: 489, C2: 277)
|
|
||||||
kelly-en: 3,201 entries (A1: ...)
|
|
||||||
|
|
||||||
Overlap (words appearing in multiple sources):
|
|
||||||
esl-lounge-en ∩ kelly-en: 1,203 words
|
|
||||||
Agreement: 1,089 (90.5%)
|
|
||||||
Conflict: 114 (9.5%)
|
|
||||||
|
|
||||||
Conflicts (sample, first 20):
|
|
||||||
word pos esl-lounge kelly
|
|
||||||
-------------------------------
|
|
||||||
"achieve" verb B1 A2
|
|
||||||
"ancient" adj B2 B1
|
|
||||||
...
|
|
||||||
|
|
||||||
DB coverage (words in sources that match a translation row):
|
|
||||||
esl-lounge-en: 1,847 / 2,847 matched (64.9%)
|
|
||||||
kelly-en: 2,103 / 3,201 matched (65.7%)
|
|
||||||
```
|
|
||||||
|
|
||||||
This script is read-only — it never writes to the DB.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 3. Merge script
|
|
||||||
|
|
||||||
`packages/db/src/cefr/merge.ts`
|
|
||||||
|
|
||||||
Reads all normalized JSON files from `sources/` for a given language and produces a
|
|
||||||
single merged JSON file in `packages/db/src/cefr/merged/<language>.json`.
|
|
||||||
|
|
||||||
**Merge rules:**
|
|
||||||
- If only one source has a word → use that level
|
|
||||||
- If multiple sources agree → use that level
|
|
||||||
- If sources conflict → use the level from the highest-priority source
|
|
||||||
|
|
||||||
**Source priority order (highest to lowest):**
|
|
||||||
1. `kelly` — purpose-built for language learning, CEFR-mapped by linguists
|
|
||||||
2. `esl-lounge` — curated by teachers, reliable but secondary
|
|
||||||
3. Any additional sources added later
|
|
||||||
|
|
||||||
Priority order is defined as a constant at the top of the merge script — easy to
|
|
||||||
change without touching the logic.
|
|
||||||
|
|
||||||
Output format — same normalized JSON shape but without `source` field, replaced by
|
|
||||||
`sources` array showing which sources contributed:
|
|
||||||
|
|
||||||
```json
|
|
||||||
[
|
|
||||||
{ "word": "bank", "pos": "noun", "cefr": "A1", "sources": ["esl-lounge", "kelly"] },
|
|
||||||
{ "word": "achieve", "pos": "verb", "cefr": "A2", "sources": ["kelly"] }
|
|
||||||
]
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 4. Enrichment script
|
|
||||||
|
|
||||||
`packages/db/src/cefr/enrich.ts`
|
|
||||||
|
|
||||||
Reads merged JSON files from `merged/` and writes `cefr_level` to matching
|
|
||||||
`translations` rows.
|
|
||||||
|
|
||||||
**Matching logic:**
|
|
||||||
- For each entry in the merged JSON, find all `translations` rows where:
|
|
||||||
- `language_code` matches the file's language
|
|
||||||
- `text` matches the word (case-insensitive, trimmed)
|
|
||||||
- The term's `pos` matches the entry's `pos`
|
|
||||||
- Set `cefr_level` on all matching rows
|
|
||||||
- Use `onConflictDoUpdate` to overwrite existing values (re-running is safe)
|
|
||||||
|
|
||||||
**Logging:**
|
|
||||||
```
|
|
||||||
=== CEFR Enrichment ===
|
|
||||||
Language: en
|
|
||||||
Entries in merged file: 2,847
|
|
||||||
Matched translation rows: 4,203 (one word can match multiple translations — synonyms)
|
|
||||||
Unmatched entries: 644 (words not in DB)
|
|
||||||
Updated: 4,203
|
|
||||||
```
|
|
||||||
|
|
||||||
This script IS idempotent — running it twice produces the same result.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## File Structure
|
|
||||||
|
|
||||||
```
|
|
||||||
packages/db/src/cefr/
|
|
||||||
raw/ ← raw source files (gitignored if large)
|
|
||||||
esl-lounge-a1-en.txt
|
|
||||||
esl-lounge-a2-en.txt
|
|
||||||
...
|
|
||||||
sources/ ← normalized JSON per source per language
|
|
||||||
esl-lounge-en.json
|
|
||||||
kelly-en.json
|
|
||||||
kelly-it.json
|
|
||||||
merged/ ← one authoritative JSON per language
|
|
||||||
en.json
|
|
||||||
it.json
|
|
||||||
extract-esl-lounge.ts ← extraction script
|
|
||||||
extract-kelly.ts ← extraction script (when Kelly data is available)
|
|
||||||
compare.ts ← comparison report
|
|
||||||
merge.ts ← merge into authoritative file
|
|
||||||
enrich.ts ← write cefr_level to DB
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## What NOT to do
|
|
||||||
|
|
||||||
- Do not hardcode CEFR level strings — always use `CEFR_LEVELS` from `@glossa/shared`
|
|
||||||
- Do not hardcode POS strings — always use `SUPPORTED_POS` from `@glossa/shared`
|
|
||||||
- Do not hardcode language codes — always use `SUPPORTED_LANGUAGE_CODES` from `@glossa/shared`
|
|
||||||
- Do not modify the schema
|
|
||||||
- Do not modify `seed.ts` or `generating-decks.ts`
|
|
||||||
- Do not skip the comparison step — it exists to surface data quality issues before enrichment
|
|
||||||
- Do not write `cefr_level` directly from raw source files — always go through normalize → merge → enrich
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Definition of Done
|
|
||||||
|
|
||||||
- All scripts run without TypeScript errors (`pnpm tsc --noEmit`)
|
|
||||||
- `extract-esl-lounge.ts` produces a valid normalized JSON file
|
|
||||||
- `compare.ts` prints a readable report showing coverage and conflicts
|
|
||||||
- `merge.ts` produces `merged/en.json` with conflict resolution applied
|
|
||||||
- `enrich.ts` writes `cefr_level` to matching `translations` rows and is idempotent
|
|
||||||
- Running `enrich.ts` twice produces the same DB state
|
|
||||||
- At least some `translations` rows have non-null `cefr_level` after enrichment
|
|
||||||
File diff suppressed because it is too large
Load diff
|
|
@ -1,34 +0,0 @@
|
||||||
a
|
|
||||||
other
|
|
||||||
us
|
|
||||||
may
|
|
||||||
st
|
|
||||||
paul
|
|
||||||
new
|
|
||||||
software
|
|
||||||
oxford
|
|
||||||
english
|
|
||||||
mary
|
|
||||||
japan
|
|
||||||
while
|
|
||||||
pp
|
|
||||||
membership
|
|
||||||
manchester
|
|
||||||
tony
|
|
||||||
alan
|
|
||||||
jones
|
|
||||||
un
|
|
||||||
northern
|
|
||||||
simon
|
|
||||||
behalf
|
|
||||||
co
|
|
||||||
graham
|
|
||||||
joe
|
|
||||||
guy
|
|
||||||
lewis
|
|
||||||
jane
|
|
||||||
taylor
|
|
||||||
co-operation
|
|
||||||
travel
|
|
||||||
self
|
|
||||||
thatcher
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue