refactor: migrate to deck-based vocabulary curation
Database Schema: - Add decks table for curated word lists (A1, Most Common, etc.) - Add deck_terms join table with position ordering - Link rooms to decks via rooms.deck_id FK - Remove frequency_rank from terms (now deck-scoped) - Change users.id to uuid, add openauth_sub for auth mapping - Add room_players.left_at for disconnect tracking - Add rooms.updated_at for stale room recovery - Add CHECK constraints for data integrity (pos, status, etc.) Extraction Script: - Rewrite extract.py to mirror complete OMW dataset - Extract all 25,204 bilingual noun synsets (en-it) - Remove frequency filtering and block lists - Output all lemmas per synset for full synonym support - Seed data now uncurated; decks handle selection Architecture: - Separate concerns: raw OMW data in DB, curation in decks - Enables user-created decks and multiple difficulty levels - Rooms select vocabulary by choosing a deck
This commit is contained in:
parent
e9e750da3e
commit
be7a7903c5
9 changed files with 349148 additions and 492 deletions
|
|
@ -56,9 +56,28 @@ Production will use Nginx to serve static Vite build output.
|
|||
|
||||
`app.ts` exports a `createApp()` factory function. `server.ts` imports it and calls `.listen()`. This allows tests to import the app directly without starting a server, keeping tests isolated and fast.
|
||||
|
||||
### Data model: `terms` + `translations` (not flat word pairs)
|
||||
### Data model: `decks` separate from `terms` (not frequency_rank filtering)
|
||||
|
||||
Words are modelled as language-neutral concepts with one or more translations per language. Adding a new language pair requires no schema changes — only new rows in `translations` and `language_pairs`. The flat `english/italian` column pattern is explicitly avoided.
|
||||
**Original approach:** Store `frequency_rank` on `terms` table and filter by rank range for difficulty.
|
||||
|
||||
**Problem discovered:** WordNet/OMW frequency data is unreliable for language learning. Extraction produced results like:
|
||||
- Rank 1: "In" → "indio" (chemical symbol: Indium)
|
||||
- Rank 2: "Be" → "berillio" (chemical symbol: Beryllium)
|
||||
- Rank 7: "He" → "elio" (chemical symbol: Helium)
|
||||
|
||||
These are technically "common" in WordNet (every element is a noun) but useless for vocabulary learning.
|
||||
|
||||
**Decision:**
|
||||
- `terms` table stores ALL available OMW synsets (raw data, no frequency filtering)
|
||||
- `decks` table stores curated learning lists (A1, A2, B1, "Most Common 1000", etc.)
|
||||
- `deck_terms` junction table links terms to decks with position ordering
|
||||
- `rooms.deck_id` specifies which vocabulary deck a game uses
|
||||
|
||||
**Benefits:**
|
||||
- Curricula can come from external sources (CEFR lists, Oxford 3000, SUBTLEX)
|
||||
- Bad data (chemical symbols, obscure words) excluded at deck level, not schema level
|
||||
- Users can create custom decks later
|
||||
- Multiple difficulty levels without schema changes
|
||||
|
||||
### Multiplayer mechanic: simultaneous answers (not buzz-first)
|
||||
|
||||
|
|
@ -136,17 +155,54 @@ Then `sudo sysctl -p` or restart Docker.
|
|||
|
||||
## Data Model
|
||||
|
||||
### Translations: no isPrimary column
|
||||
### Users: internal UUID + openauth_sub (not sub as PK)
|
||||
|
||||
WordNet synsets often have multiple lemmas per language (e.g. "dog", "domestic dog", "Canis familiaris"). An earlier design included an isPrimary boolean to mark which translation to display in the quiz.
|
||||
This was dropped because:
|
||||
**Original approach:** Use OpenAuth `sub` claim directly as `users.id` (text primary key).
|
||||
|
||||
The schema cannot enforce a single primary per (term_id, language_code) with a boolean alone — multiple rows can be true simultaneously
|
||||
Fixing that requires either a partial unique index or application-level logic, both adding complexity for no MVP benefit
|
||||
The ambiguity can be resolved earlier and more cleanly
|
||||
**Problem:** Embeds auth provider in the primary key (e.g. `"google|12345"`). If OpenAuth changes format or a second provider is added, the PK cascades through all FKs (`rooms.host_id`, `room_players.user_id`).
|
||||
|
||||
Decision: the Python extraction script picks one translation per language per synset at extraction time, taking the first lemma (WordNet orders lemmas by frequency). By the time data enters the database the choice is already made.
|
||||
The translations table carries a UNIQUE (term_id, language_code) constraint, which enforces exactly one translation per language at the database level. No isPrimary column exists.
|
||||
**Decision:**
|
||||
- `users.id` = internal UUID (stable FK target)
|
||||
- `users.openauth_sub` = text UNIQUE (auth provider claim)
|
||||
- Allows adding multiple auth providers per user later without FK changes
|
||||
|
||||
### Rooms: `updated_at` for stale recovery only
|
||||
|
||||
Most tables omit `updated_at` (unnecessary for MVP). `rooms.updated_at` is kept specifically for stale room recovery—identifying rooms stuck in `in_progress` status after server crashes.
|
||||
|
||||
### Translations: UNIQUE (term_id, language_code, text)
|
||||
|
||||
Allows multiple synonyms per language per term (e.g. "dog", "hound" for same synset). Prevents exact duplicate rows. Homonyms (e.g. "Lead" metal vs. "Lead" guide) are handled by different `term_id` values (different synsets), so no constraint conflict.
|
||||
|
||||
### Decks: `pair_id` is nullable
|
||||
|
||||
`decks.pair_id` references `language_pairs` but is nullable. Reasons:
|
||||
- Single-language decks (e.g. "English Grammar")
|
||||
- Multi-pair decks (e.g. "Cognates" spanning EN-IT and EN-FR)
|
||||
- System decks (created by app, not tied to specific user)
|
||||
|
||||
### Decks separate from terms (not frequency_rank filtering)
|
||||
|
||||
**Original approach:** Store `frequency_rank` on `terms` table and filter by rank range for difficulty.
|
||||
|
||||
**Problem discovered:** WordNet/OMW frequency data is unreliable for language learning. Extraction produced results like:
|
||||
- Rank 1: "In" → "indio" (chemical symbol: Indium)
|
||||
- Rank 2: "Be" → "berillio" (chemical symbol: Beryllium)
|
||||
- Rank 7: "He" → "elio" (chemical symbol: Helium)
|
||||
|
||||
These are technically "common" in WordNet (every element is a noun) but useless for vocabulary learning.
|
||||
|
||||
**Decision:**
|
||||
- `terms` table stores ALL available OMW synsets (raw data, no frequency filtering)
|
||||
- `decks` table stores curated learning lists (A1, A2, B1, "Most Common 1000", etc.)
|
||||
- `deck_terms` junction table links terms to decks with position ordering
|
||||
- `rooms.deck_id` specifies which vocabulary deck a game uses
|
||||
|
||||
**Benefits:**
|
||||
- Curricula can come from external sources (CEFR lists, Oxford 3000, SUBTLEX)
|
||||
- Bad data (chemical symbols, obscure words) excluded at deck level, not schema level
|
||||
- Users can create custom decks later
|
||||
- Multiple difficulty levels without schema changes
|
||||
|
||||
---
|
||||
|
||||
|
|
|
|||
|
|
@ -1,3 +1,45 @@
|
|||
# notes
|
||||
|
||||
## tasks
|
||||
|
||||
- pinning dependencies in package.json files
|
||||
|
||||
## open word net
|
||||
|
||||
download libraries via
|
||||
|
||||
```bash
|
||||
python -c 'import wn; wn.download("omw-fr")';
|
||||
```
|
||||
|
||||
libraries:
|
||||
|
||||
odenet:1.4
|
||||
omw-es:1.4
|
||||
omw-fr:1.4
|
||||
omw-it:1.4
|
||||
omw-en:1.4
|
||||
|
||||
upgrade wn package:
|
||||
|
||||
```bash
|
||||
pip install --upgrade wn
|
||||
```
|
||||
|
||||
check if wn is available, eg italian:
|
||||
|
||||
```bash
|
||||
python -c "import wn; print(len(wn.words(lang='it', lexicon='omw-it:1.4')))"
|
||||
```
|
||||
|
||||
remove a library:
|
||||
|
||||
```bash
|
||||
python -c "import wn; wn.remove('oewn:2024')" python -c "import wn; wn.remove('oewn:2024')"
|
||||
```
|
||||
|
||||
list all libraries:
|
||||
|
||||
```bash
|
||||
python -c "import wn; print(wn.lexicons())"
|
||||
```
|
||||
|
|
|
|||
|
|
@ -1,301 +0,0 @@
|
|||
# Task: Run `scripts/extract_omw.py` locally → generates `packages/db/src/seed.json`
|
||||
|
||||
**Goal**: Produce a committed, validated JSON file containing 1000 English-Italian noun pairs ranked by frequency.
|
||||
**Done when**: `packages/db/src/seed.json` exists in version control with exactly 1000 entries, all validated.
|
||||
|
||||
---
|
||||
|
||||
## Step 1: Python Environment Setup
|
||||
|
||||
**Prerequisites**: Python 3.11+ installed locally (not in Docker)
|
||||
|
||||
- [x] Create `scripts/requirements.txt`:
|
||||
|
||||
nltk>=3.8
|
||||
|
||||
- [x] add to `.gitignore`:
|
||||
|
||||
venv/
|
||||
pycache/
|
||||
*.pyc
|
||||
|
||||
- [x] Create virtual environment:
|
||||
|
||||
```bash
|
||||
cd scripts
|
||||
python -m venv venv
|
||||
source venv/bin/activate
|
||||
```
|
||||
|
||||
- [x] Install dependencies:
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
- [x] Create scripts/download_data.py to fetch NLTK corpora:
|
||||
|
||||
```Python
|
||||
import nltk
|
||||
|
||||
def main():
|
||||
print("Downloading WordNet...")
|
||||
nltk.download("wordnet")
|
||||
|
||||
print("Downloading OMW 1.4...")
|
||||
nltk.download("omw-1.4")
|
||||
|
||||
print("Downloading WordNet IC...")
|
||||
nltk.download("wordnet_ic")
|
||||
|
||||
print("Done.")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
```
|
||||
|
||||
- [x] Run it once to cache data locally (~100MB in ~/nltk_data/):
|
||||
|
||||
```bash
|
||||
python download_data.py
|
||||
```
|
||||
|
||||
- [x] Verify data downloaded: check ~/nltk_data/corpora/ exists with wordnet, omw, wordnet_ic folders
|
||||
|
||||
## Step 2: Data Exploration (Throwaway Script)
|
||||
|
||||
Before writing extraction logic, understand the data shape.
|
||||
|
||||
[ ] Create scripts/explore.py:
|
||||
Import nltk.corpus.wordnet as wn
|
||||
Import nltk.corpus.wordnet_ic as wnic
|
||||
Load semcor_ic = wnic.ic('ic-semcor.dat')
|
||||
[ ] Print sample synset:
|
||||
Python
|
||||
Copy
|
||||
|
||||
dog = wn.synset('dog.n.01')
|
||||
print(f"Offset: {dog.offset():08d}")
|
||||
print(f"POS: {dog.pos()}")
|
||||
print(f"English lemmas: {dog.lemma_names()}")
|
||||
print(f"Italian lemmas: {dog.lemma_names(lang='ita')}")
|
||||
print(f"Frequency: {dog.res_similarity(dog, semcor_ic)}")
|
||||
|
||||
[ ] Test 5 common words: dog, house, car, water, time
|
||||
[ ] Document findings:
|
||||
[ ] Synset ID format confirmed: {offset:08d}{pos} → 02084071n
|
||||
[ ] Italian availability: what percentage have translations?
|
||||
[ ] Multi-word handling: underscores in lemma_names()?
|
||||
[ ] Frequency scores: numeric range and distribution
|
||||
[ ] Test edge cases:
|
||||
[ ] Word with multiple synsets (homonyms): bank, run
|
||||
[ ] Word with multiple Italian translations per synset
|
||||
[ ] Word with no Italian translation
|
||||
[ ] Delete or keep explore.py (optional reference)
|
||||
|
||||
Decision checkpoint: Confirm frequency ranking strategy
|
||||
|
||||
Option A: synset.count() — raw lemma occurrence count
|
||||
Option B: res_similarity with SemCor IC — information content score
|
||||
Document choice and rationale in comments
|
||||
|
||||
## Step 3: Extraction Script Implementation
|
||||
|
||||
Create scripts/extract_omw.py with the full pipeline.
|
||||
|
||||
[ ] Imports and setup:
|
||||
Python
|
||||
Copy
|
||||
|
||||
import json
|
||||
from collections import defaultdict
|
||||
from nltk.corpus import wordnet as wn
|
||||
from nltk.corpus import wordnet_ic
|
||||
|
||||
[ ] Load information content for frequency ranking:
|
||||
Python
|
||||
Copy
|
||||
|
||||
semcor_ic = wordnet_ic.ic('ic-semcor.dat')
|
||||
|
||||
[ ] Define target count and output path:
|
||||
Python
|
||||
Copy
|
||||
|
||||
TARGET_COUNT = 1000
|
||||
OUTPUT_PATH = '../packages/db/src/seed.json'
|
||||
|
||||
[ ] Iterate all noun synsets and collect candidates:
|
||||
Python
|
||||
Copy
|
||||
|
||||
candidates = []
|
||||
for synset in wn.all_synsets(pos='n'):
|
||||
italian_lemmas = synset.lemma_names(lang='ita')
|
||||
if not italian_lemmas:
|
||||
continue
|
||||
|
||||
english_lemmas = synset.lemma_names()
|
||||
|
||||
# Calculate frequency score
|
||||
try:
|
||||
# Using self-similarity as frequency proxy
|
||||
freq_score = synset.res_similarity(synset, semcor_ic)
|
||||
except Exception:
|
||||
freq_score = 0
|
||||
|
||||
candidates.append({
|
||||
'synset': synset,
|
||||
'offset': synset.offset(),
|
||||
'pos': synset.pos(),
|
||||
'freq_score': freq_score,
|
||||
'english': english_lemmas,
|
||||
'italian': italian_lemmas,
|
||||
})
|
||||
|
||||
[ ] Sort by frequency descending:
|
||||
Python
|
||||
Copy
|
||||
|
||||
candidates.sort(key=lambda x: x['freq_score'], reverse=True)
|
||||
|
||||
[ ] Slice top 1000:
|
||||
Python
|
||||
Copy
|
||||
|
||||
top_1000 = candidates[:TARGET_COUNT]
|
||||
|
||||
[ ] Build output structure matching schema needs:
|
||||
Python
|
||||
Copy
|
||||
|
||||
seed_data = []
|
||||
for rank, candidate in enumerate(top_1000, start=1):
|
||||
# Normalize multi-word expressions: replace underscores with spaces
|
||||
english_normalized = [w.replace('_', ' ') for w in candidate['english']]
|
||||
italian_normalized = [w.replace('_', ' ') for w in candidate['italian']]
|
||||
|
||||
seed_data.append({
|
||||
'synset_id': f"wn:{candidate['offset']:08d}{candidate['pos']}",
|
||||
'pos': 'noun', # Map 'n' to full word for schema
|
||||
'frequency_rank': rank,
|
||||
'english_lemmas': english_normalized,
|
||||
'italian_lemmas': italian_normalized,
|
||||
})
|
||||
|
||||
[ ] Write formatted JSON:
|
||||
Python
|
||||
Copy
|
||||
|
||||
with open(OUTPUT_PATH, 'w', encoding='utf-8') as f:
|
||||
json.dump(seed_data, f, indent=2, ensure_ascii=False)
|
||||
|
||||
print(f"Generated {len(seed_data)} entries at {OUTPUT_PATH}")
|
||||
|
||||
Edge cases to handle:
|
||||
|
||||
[ ] Skip synsets with empty Italian list (already filtered)
|
||||
[ ] Handle res_similarity exceptions (some synsets lack IC data)
|
||||
[ ] Normalize underscores to spaces in all lemmas
|
||||
[ ] Ensure UTF-8 encoding for Italian characters (à, è, ì, ò, ù)
|
||||
|
||||
## Step 4: Validation Script
|
||||
Create scripts/validate_seed.py to verify output quality.
|
||||
|
||||
[ ] Load and parse JSON:
|
||||
Python
|
||||
Copy
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
SEED_PATH = '../packages/db/src/seed.json'
|
||||
|
||||
with open(SEED_PATH, 'r', encoding='utf-8') as f:
|
||||
data = json.load(f)
|
||||
|
||||
[ ] Run validation checks:
|
||||
[ ] Count check: len(data) == 1000
|
||||
[ ] Rank check: all entries have frequency_rank from 1 to 1000, no gaps, no duplicates
|
||||
[ ] Synset ID format: matches regex ^wn:\d{8}[nvar]$ (noun, verb, adjective, adverb)
|
||||
[ ] POS check: all are noun (since we filtered pos='n')
|
||||
[ ] Italian presence: every entry has italian_lemmas with at least 1 item
|
||||
[ ] English presence: every entry has english_lemmas with at least 1 item
|
||||
[ ] No duplicate synset IDs
|
||||
[ ] No empty strings in lemma arrays
|
||||
[ ] No leading/trailing whitespace in lemmas
|
||||
[ ] Print summary statistics:
|
||||
Total entries
|
||||
Average English lemmas per entry
|
||||
Average Italian lemmas per entry
|
||||
Sample entries (ranks 1, 500, 1000) for manual inspection
|
||||
[ ] Exit with error code if any check fails
|
||||
|
||||
## Step 5: Execution and Iteration
|
||||
|
||||
[ ] Run extraction:
|
||||
bash
|
||||
Copy
|
||||
|
||||
cd scripts
|
||||
source venv/bin/activate
|
||||
python extract_omw.py
|
||||
|
||||
[ ] Run validation:
|
||||
bash
|
||||
Copy
|
||||
|
||||
python validate_seed.py
|
||||
|
||||
[ ] If validation fails, fix extract_omw.py and re-run:
|
||||
[ ] Too few entries? Relax filters or reduce target count
|
||||
[ ] Data quality issues? Add normalization logic
|
||||
[ ] Format mismatches? Adjust output structure
|
||||
[ ] Manual sanity check: open seed.json, read first 10 and last 10 entries
|
||||
Do translations make sense?
|
||||
Are frequencies plausible (common words first)?
|
||||
|
||||
## Step 6: Git Integration
|
||||
|
||||
[ ] Verify file location: packages/db/src/seed.json
|
||||
[ ] Check file size: should be ~200-500KB (if larger, investigate)
|
||||
[ ] Stage the file:
|
||||
bash
|
||||
Copy
|
||||
|
||||
git add packages/db/src/seed.json
|
||||
|
||||
[ ] Commit with descriptive message:
|
||||
plain
|
||||
Copy
|
||||
|
||||
feat(data): add seed.json with 1000 English-Italian noun pairs
|
||||
|
||||
Generated from WordNet 3.0 + OMW 1.4 using SemCor IC frequency ranking.
|
||||
Top entry: dog/cane (rank 1). Bottom entry: [word] (rank 1000).
|
||||
|
||||
[ ] Push to current feature branch
|
||||
|
||||
## Step 7: Documentation Update
|
||||
|
||||
[ ] Update documentation/decisions.md with:
|
||||
[ ] Frequency ranking method chosen (SemCor IC vs count) and why
|
||||
[ ] How multi-word expressions are handled (underscore → space)
|
||||
[ ] OMW data quality notes (coverage percentage, any manual fixes)
|
||||
[ ] Seed file structure (for future maintainers)
|
||||
|
||||
Definition of Done
|
||||
|
||||
[ ] scripts/extract_omw.py exists and runs without errors
|
||||
[ ] scripts/validate_seed.py passes all checks
|
||||
[ ] packages/db/src/seed.json committed to git with exactly 1000 entries
|
||||
[ ] Manual sample check confirms sensible translations
|
||||
[ ] decisions.md updated with extraction methodology
|
||||
[ ] Virtual environment and Python cache files are gitignored
|
||||
|
||||
Out of Scope (for this task)
|
||||
|
||||
Distractor generation (happens in API layer later)
|
||||
Additional parts of speech (verbs, adjectives)
|
||||
Data updates/re-seeding strategy (MVP assumes static seed)
|
||||
Database insertion (next task: seed.ts)
|
||||
|
|
@ -2,158 +2,145 @@
|
|||
|
||||
Each phase produces a working, deployable increment. Nothing is built speculatively.
|
||||
|
||||
---
|
||||
|
||||
## Phase 0 — Foundation
|
||||
|
||||
**Goal**: Empty repo that builds, lints, and runs end-to-end.
|
||||
**Done when**: `pnpm dev` starts both apps; `GET /api/health` returns 200; React renders a hello page.
|
||||
Goal: Empty repo that builds, lints, and runs end-to-end.
|
||||
Done when: `pnpm dev` starts both apps; `GET /api/health` returns 200; React renders a hello page.
|
||||
|
||||
- [x] Initialise pnpm workspace monorepo: `apps/web`, `apps/api`, `packages/shared`, `packages/db`
|
||||
- [x] Configure TypeScript project references across packages
|
||||
- [x] Set up ESLint + Prettier with shared configs in root
|
||||
- [x] Set up Vitest in `api` and `web` and both packages
|
||||
- [x] Scaffold Express app with `GET /api/health`
|
||||
- [x] Scaffold Vite + React app with TanStack Router (single root route)
|
||||
- [x] Configure Drizzle ORM + connection to local PostgreSQL
|
||||
- [x] Write first migration (empty — just validates the pipeline works)
|
||||
- [x] `docker-compose.yml` for local dev: `api`, `web`, `postgres`, `valkey`
|
||||
- [x] `.env.example` files for `apps/api` and `apps/web`
|
||||
- [x] update decisions.md
|
||||
|
||||
---
|
||||
[x] Initialise pnpm workspace monorepo: `apps/web`, `apps/api`, `packages/shared`, `packages/db`
|
||||
[x] Configure TypeScript project references across packages
|
||||
[x] Set up ESLint + Prettier with shared configs in root
|
||||
[x] Set up Vitest in `api` and `web` and both packages
|
||||
[x] Scaffold Express app with `GET /api/health`
|
||||
[x] Scaffold Vite + React app with TanStack Router (single root route)
|
||||
[x] Configure Drizzle ORM + connection to local PostgreSQL
|
||||
[x] Write first migration (empty — just validates the pipeline works)
|
||||
[x] `docker-compose.yml` for local dev: `api`, `web`, `postgres`, `valkey`
|
||||
[x] `.env.example` files for `apps/api` and `apps/web`
|
||||
[x] update decisions.md
|
||||
|
||||
## Phase 1 — Vocabulary Data
|
||||
|
||||
**Goal**: Word data lives in the DB and can be queried via the API.
|
||||
**Done when**: `GET /api/terms?pair=en-it&limit=10` returns 10 terms, each with 3 distractors attached.
|
||||
Goal: Word data lives in the DB and can be queried via the API.
|
||||
Done when: `GET /api/decks/1/terms?limit=10` returns 10 terms from a specific deck.
|
||||
|
||||
- [ ] Run `scripts/extract_omw.py` locally → generates `packages/db/src/seed.json`
|
||||
- [ ] Write Drizzle schema: `terms`, `translations`, `language_pairs`
|
||||
- [ ] Write and run migration
|
||||
- [ ] Write `packages/db/src/seed.ts` (reads `seed.json`, populates tables)
|
||||
- [ ] Implement `TermRepository.getRandom(pairId, limit)`
|
||||
- [ ] Implement `QuizService.attachDistractors(terms)` — same POS, server-side, no duplicates
|
||||
- [ ] Implement `GET /language-pairs` and `GET /terms` endpoints
|
||||
- [ ] Define Zod response schemas in `packages/shared`
|
||||
- [ ] Unit tests for `QuizService` (correct POS filtering, never includes the answer)
|
||||
- [ ] update decisions.md
|
||||
|
||||
---
|
||||
[x] Run `extract-en-it-nouns.py` locally → generates `datafiles/en-it-nouns.json`
|
||||
-- Import ALL available OMW noun synsets (no frequency filtering)
|
||||
[ ] Write Drizzle schema: `terms`, `translations`, `language_pairs`, `term_glosses`, `decks`, `deck_terms`
|
||||
[ ] Write and run migration (includes CHECK constraints for `pos`, `gloss_type`)
|
||||
[ ] Write `packages/db/src/seed.ts` (imports ALL terms + translations, NO decks)
|
||||
[ ] Write `scripts/build_decks.ts` (reads external CEFR lists, matches to DB, creates decks)
|
||||
[ ] Download CEFR A1/A2 noun lists (from GitHub repos)
|
||||
[ ] Run `pnpm db:seed` → populates terms
|
||||
[ ] Run `pnpm db:build-decks` → creates curated decks
|
||||
[ ] Implement `DeckRepository.getTerms(deckId, limit, offset)`
|
||||
[ ] Implement `QuizService.attachDistractors(terms)` — same POS, server-side, no duplicates
|
||||
[ ] Implement `GET /language-pairs`, `GET /decks`, `GET /decks/:id/terms` endpoints
|
||||
[ ] Define Zod response schemas in `packages/shared`
|
||||
[ ] Unit tests for `QuizService` (correct POS filtering, never includes the answer)
|
||||
[ ] update decisions.md
|
||||
|
||||
## Phase 2 — Auth
|
||||
|
||||
**Goal**: Users can log in via Google or GitHub and stay logged in.
|
||||
**Done when**: JWT from OpenAuth is validated by the API; protected routes redirect unauthenticated users; user row is created on first login.
|
||||
Goal: Users can log in via Google or GitHub and stay logged in.
|
||||
Done when: JWT from OpenAuth is validated by the API; protected routes redirect unauthenticated users; user row is created on first login.
|
||||
|
||||
- [ ] Add OpenAuth service to `docker-compose.yml`
|
||||
- [ ] Write Drizzle schema: `users`
|
||||
- [ ] Write and run migration
|
||||
- [ ] Implement JWT validation middleware in `apps/api`
|
||||
- [ ] Implement `GET /api/auth/me` (validate token, upsert user row, return user)
|
||||
- [ ] Define auth Zod schemas in `packages/shared`
|
||||
- [ ] Frontend: login page with "Continue with Google" + "Continue with GitHub" buttons
|
||||
- [ ] Frontend: redirect to `auth.yourdomain.com` → receive JWT → store in memory + HttpOnly cookie
|
||||
- [ ] Frontend: TanStack Router auth guard (redirects unauthenticated users)
|
||||
- [ ] Frontend: TanStack Query `api.ts` attaches token to every request
|
||||
- [ ] Unit tests for JWT middleware
|
||||
- [ ] update decisions.md
|
||||
|
||||
---
|
||||
[ ] Add OpenAuth service to `docker-compose.yml`
|
||||
[ ] Write Drizzle schema: `users` (uuid `id`, text `openauth_sub`, no games_played/won columns)
|
||||
[ ] Write and run migration (includes `updated_at` + triggers)
|
||||
[ ] Implement JWT validation middleware in `apps/api`
|
||||
[ ] Implement `GET /api/auth/me` (validate token, upsert user row via `openauth_sub`, return user)
|
||||
[ ] Define auth Zod schemas in `packages/shared`
|
||||
[ ] Frontend: login page with "Continue with Google" + "Continue with GitHub" buttons
|
||||
[ ] Frontend: redirect to `auth.yourdomain.com` → receive JWT → store in memory + HttpOnly cookie
|
||||
[ ] Frontend: TanStack Router auth guard (redirects unauthenticated users)
|
||||
[ ] Frontend: TanStack Query `api.ts` attaches token to every request
|
||||
[ ] Unit tests for JWT middleware
|
||||
[ ] update decisions.md
|
||||
|
||||
## Phase 3 — Single-player Mode
|
||||
|
||||
**Goal**: A logged-in user can complete a full solo quiz session.
|
||||
**Done when**: User sees 10 questions, picks answers, sees their final score.
|
||||
Goal: A logged-in user can complete a full solo quiz session.
|
||||
Done when: User sees 10 questions, picks answers, sees their final score.
|
||||
|
||||
- [ ] Frontend: `/singleplayer` route
|
||||
- [ ] `useQuizSession` hook: fetch terms, manage question index + score state
|
||||
- [ ] `QuestionCard` component: prompt word + 4 answer buttons
|
||||
- [ ] `OptionButton` component: idle / correct / wrong states
|
||||
- [ ] `ScoreScreen` component: final score + play-again button
|
||||
- [ ] TanStack Query integration for `GET /terms`
|
||||
- [ ] RTL tests for `QuestionCard` and `OptionButton`
|
||||
- [ ] update decisions.md
|
||||
|
||||
---
|
||||
[ ] Frontend: `/singleplayer` route
|
||||
[ ] `useQuizSession` hook: fetch terms, manage question index + score state
|
||||
[ ] `QuestionCard` component: prompt word + 4 answer buttons
|
||||
[ ] `OptionButton` component: idle / correct / wrong states
|
||||
[ ] `ScoreScreen` component: final score + play-again button
|
||||
[ ] TanStack Query integration for `GET /terms`
|
||||
[ ] RTL tests for `QuestionCard` and `OptionButton`
|
||||
[ ] update decisions.md
|
||||
|
||||
## Phase 4 — Multiplayer Rooms (Lobby)
|
||||
|
||||
**Goal**: Players can create and join rooms; the host sees all joined players in real time.
|
||||
**Done when**: Two browser tabs can join the same room and see each other's display names update live via WebSocket.
|
||||
Goal: Players can create and join rooms; the host sees all joined players in real time.
|
||||
Done when: Two browser tabs can join the same room and see each other's display names update live via WebSocket.
|
||||
|
||||
- [ ] Write Drizzle schema: `rooms`, `room_players`
|
||||
- [ ] Write and run migration
|
||||
- [ ] `POST /rooms` and `POST /rooms/:code/join` REST endpoints
|
||||
- [ ] `RoomService`: create room with short code, join room, enforce max player limit
|
||||
- [ ] WebSocket server: attach `ws` upgrade handler to the Express HTTP server
|
||||
- [ ] WS auth middleware: validate OpenAuth JWT on upgrade
|
||||
- [ ] WS message router: dispatch incoming messages by `type`
|
||||
- [ ] `room:join` / `room:leave` handlers → broadcast `room:state` to all room members
|
||||
- [ ] Room membership tracked in Valkey (ephemeral) + `room_players` in PostgreSQL (durable)
|
||||
- [ ] Define all WS event Zod schemas in `packages/shared`
|
||||
- [ ] Frontend: `/multiplayer/lobby` — create room form + join-by-code form
|
||||
- [ ] Frontend: `/multiplayer/room/:code` — player list, room code display, "Start Game" (host only)
|
||||
- [ ] Frontend: `ws.ts` singleton WS client with reconnect on drop
|
||||
- [ ] Frontend: Zustand `gameStore` handles incoming `room:state` events
|
||||
- [ ] update decisions.md
|
||||
|
||||
---
|
||||
[ ] Write Drizzle schema: `rooms`, `room_players` (add `deck_id` FK to rooms)
|
||||
[ ] Write and run migration (includes CHECK constraints: `code=UPPER(code)`, `status`, `max_players`)
|
||||
[ ] Add indexes: `idx_rooms_host`, `idx_room_players_score`
|
||||
[ ] `POST /rooms` and `POST /rooms/:code/join` REST endpoints
|
||||
[ ] `RoomService`: create room with short code, join room, enforce max player limit
|
||||
[ ] `POST /rooms` accepts `deck_id` (which vocabulary deck to use)
|
||||
[ ] WebSocket server: attach `ws` upgrade handler to the Express HTTP server
|
||||
[ ] WS auth middleware: validate OpenAuth JWT on upgrade
|
||||
[ ] WS message router: dispatch incoming messages by `type`
|
||||
[ ] `room:join` / `room:leave` handlers → broadcast `room:state` to all room members
|
||||
[ ] Room membership tracked in Valkey (ephemeral) + `room_players` in PostgreSQL (durable)
|
||||
[ ] Define all WS event Zod schemas in `packages/shared`
|
||||
[ ] Frontend: `/multiplayer/lobby` — create room form + join-by-code form
|
||||
[ ] Frontend: `/multiplayer/room/:code` — player list, room code display, "Start Game" (host only)
|
||||
[ ] Frontend: `ws.ts` singleton WS client with reconnect on drop
|
||||
[ ] Frontend: Zustand `gameStore` handles incoming `room:state` events
|
||||
[ ] update decisions.md
|
||||
|
||||
## Phase 5 — Multiplayer Game
|
||||
|
||||
**Goal**: Host starts a game; all players answer simultaneously in real time; a winner is declared.
|
||||
**Done when**: 2–4 players complete a 10-round game with correct live scores and a winner screen.
|
||||
Goal: Host starts a game; all players answer simultaneously in real time; a winner is declared.
|
||||
Done when: 2–4 players complete a 10-round game with correct live scores and a winner screen.
|
||||
|
||||
- [ ] `GameService`: generate question sequence for a room, enforce server-side 15 s timer
|
||||
- [ ] `room:start` WS handler → begin question loop, broadcast first `game:question`
|
||||
- [ ] `game:answer` WS handler → collect per-player answers
|
||||
- [ ] On all-answered or timeout → evaluate, broadcast `game:answer_result`
|
||||
- [ ] After N rounds → broadcast `game:finished`, update `rooms.status` + `room_players.score` in DB
|
||||
- [ ] Frontend: `/multiplayer/game/:code` route
|
||||
- [ ] Frontend: extend Zustand store with `currentQuestion`, `roundAnswers`, `scores`
|
||||
- [ ] Frontend: reuse `QuestionCard` + `OptionButton`; add countdown timer ring
|
||||
- [ ] Frontend: `ScoreBoard` component — live per-player scores after each round
|
||||
- [ ] Frontend: `GameFinished` screen — winner highlight, final scores, "Play Again" button
|
||||
- [ ] Unit tests for `GameService` (round evaluation, tie-breaking, timeout auto-advance)
|
||||
- [ ] update decisions.md
|
||||
|
||||
---
|
||||
[ ] `GameService`: generate question sequence for a room, enforce server-side 15 s timer
|
||||
[ ] `room:start` WS handler → begin question loop, broadcast first `game:question`
|
||||
[ ] `game:answer` WS handler → collect per-player answers
|
||||
[ ] On all-answered or timeout → evaluate, broadcast `game:answer_result`
|
||||
[ ] After N rounds → broadcast `game:finished`, update `rooms.status` + `room_players.score` in DB (transactional)
|
||||
[ ] Frontend: `/multiplayer/game/:code` route
|
||||
[ ] Frontend: extend Zustand store with `currentQuestion`, `roundAnswers`, `scores`
|
||||
[ ] Frontend: reuse `QuestionCard` + `OptionButton`; add countdown timer ring
|
||||
[ ] Frontend: `ScoreBoard` component — live per-player scores after each round
|
||||
[ ] Frontend: `GameFinished` screen — winner highlight, final scores, "Play Again" button
|
||||
[ ] Unit tests for `GameService` (round evaluation, tie-breaking, timeout auto-advance)
|
||||
[ ] update decisions.md
|
||||
|
||||
## Phase 6 — Production Deployment
|
||||
|
||||
**Goal**: App is live on Hetzner, accessible via HTTPS on all subdomains.
|
||||
**Done when**: `https://app.yourdomain.com` loads; `wss://api.yourdomain.com` connects; auth flow works end-to-end.
|
||||
Goal: App is live on Hetzner, accessible via HTTPS on all subdomains.
|
||||
Done when: `https://app.yourdomain.com` loads; `wss://api.yourdomain.com` connects; auth flow works end-to-end.
|
||||
|
||||
- [ ] `docker-compose.prod.yml`: all services + `nginx-proxy` + `acme-companion`
|
||||
- [ ] Nginx config per container: `VIRTUAL_HOST` + `LETSENCRYPT_HOST` env vars
|
||||
- [ ] Production `.env` files on VPS (OpenAuth secrets, DB credentials, Valkey URL)
|
||||
- [ ] Drizzle migration runs on `api` container start
|
||||
- [ ] Seed production DB (run `seed.ts` once)
|
||||
- [ ] Smoke test: login → solo game → create room → multiplayer game end-to-end
|
||||
- [ ] update decisions.md
|
||||
[ ] `docker-compose.prod.yml`: all services + `nginx-proxy` + `acme-companion`
|
||||
[ ] Nginx config per container: `VIRTUAL_HOST` + `LETSENCRYPT_HOST` env vars
|
||||
[ ] Production `.env` files on VPS (OpenAuth secrets, DB credentials, Valkey URL)
|
||||
[ ] Drizzle migration runs on `api` container start (includes CHECK constraints + triggers)
|
||||
[ ] Seed production DB (run `seed.ts` once)
|
||||
[ ] Smoke test: login → solo game → create room → multiplayer game end-to-end
|
||||
[ ] update decisions.md
|
||||
|
||||
---
|
||||
|
||||
## Phase 7 — Polish & Hardening _(post-MVP)_
|
||||
## Phase 7 — Polish & Hardening (post-MVP)
|
||||
|
||||
Not required to ship, but address before real users arrive.
|
||||
|
||||
- [ ] Rate limiting on API endpoints (`express-rate-limit`)
|
||||
- [ ] Graceful WS reconnect with exponential back-off
|
||||
- [ ] React error boundaries
|
||||
- [ ] `GET /users/me/stats` endpoint + profile page
|
||||
- [ ] Accessibility pass (keyboard nav, ARIA on quiz buttons)
|
||||
- [ ] Favicon, page titles, Open Graph meta
|
||||
- [ ] CI/CD pipeline (GitHub Actions → SSH deploy on push to `main`)
|
||||
- [ ] Database backups (cron → Hetzner Object Storage)
|
||||
- [ ] update decisions.md
|
||||
[ ] Rate limiting on API endpoints (`express-rate-limit`)
|
||||
[ ] Graceful WS reconnect with exponential back-off
|
||||
[ ] React error boundaries
|
||||
[ ] `GET /users/me/stats` endpoint (aggregates from `room_players`) + profile page
|
||||
[ ] Accessibility pass (keyboard nav, ARIA on quiz buttons)
|
||||
[ ] Favicon, page titles, Open Graph meta
|
||||
[ ] CI/CD pipeline (GitHub Actions → SSH deploy on push to `main`)
|
||||
[ ] Database backups (cron → Hetzner Object Storage)
|
||||
[ ] update decisions.md
|
||||
|
||||
---
|
||||
|
||||
## Dependency Graph
|
||||
|
||||
```
|
||||
Dependency Graph
|
||||
Phase 0 (Foundation)
|
||||
└── Phase 1 (Vocabulary Data)
|
||||
└── Phase 2 (Auth)
|
||||
|
|
@ -161,4 +148,3 @@ Phase 0 (Foundation)
|
|||
└── Phase 4 (Room Lobby)
|
||||
└── Phase 5 (Multiplayer Game)
|
||||
└── Phase 6 (Deployment)
|
||||
```
|
||||
|
|
|
|||
|
|
@ -71,7 +71,9 @@ vocab-trainer/
|
|||
│ ├── shared/ # Zod schemas, TypeScript types, constants
|
||||
│ └── db/ # Drizzle schema, migrations, seed script
|
||||
├── scripts/
|
||||
│ └── extract_omw.py # One-time WordNet + OMW extraction → seed.json
|
||||
| ├── datafiles/
|
||||
│ | └── en-it-nouns.json
|
||||
│ └── extract-en-it-nouns.py # One-time WordNet + OMW extraction → seed.json
|
||||
├── docker-compose.yml
|
||||
├── docker-compose.prod.yml
|
||||
├── pnpm-workspace.yaml
|
||||
|
|
@ -155,73 +157,137 @@ nginx-proxy (:80/:443)
|
|||
|
||||
SSL is fully automatic via `nginx-proxy` + `acme-companion`. No manual Certbot needed.
|
||||
|
||||
### 5.1 Valkey Key Structure
|
||||
|
||||
Ephemeral room state is stored in Valkey with TTL (e.g., 1 hour).
|
||||
PostgreSQL stores durable history only.
|
||||
|
||||
Key Format: `room:{code}:{field}`
|
||||
| Key | Type | TTL | Description |
|
||||
|------------------------------|---------|-------|-------------|
|
||||
| `room:{code}:state` | Hash | 1h | Current question index, round status |
|
||||
| `room:{code}:players` | Set | 1h | List of connected user IDs |
|
||||
| `room:{code}:answers:{round}`| Hash | 15m | Temp storage for current round answers |
|
||||
|
||||
Recovery Strategy
|
||||
If server crashes mid-game, Valkey data is lost.
|
||||
PostgreSQL `room_players.score` remains 0.
|
||||
Room status is reset to `finished` via startup health check if `updated_at` is stale.
|
||||
|
||||
---
|
||||
|
||||
## 6. Data Model
|
||||
|
||||
### Design principle
|
||||
## Design principle
|
||||
|
||||
Words are modelled as language-neutral **terms** with one or more **translations** per language. Adding a new language pair (e.g. English–French) requires **no schema changes** — only new rows in `translations` and `language_pairs`. The flat `english/italian` column pattern is explicitly avoided.
|
||||
Words are modelled as language-neutral concepts (terms) separate from learning curricula (decks).
|
||||
Adding a new language pair requires no schema changes — only new rows in `translations`, `decks`, and `language_pairs`.
|
||||
|
||||
### Core tables
|
||||
## Core tables
|
||||
|
||||
```
|
||||
terms
|
||||
id uuid PK
|
||||
synset_id text UNIQUE -- WordNet synset offset e.g. "wn:01234567n"
|
||||
pos varchar(20) -- "noun" | "verb" | "adjective"
|
||||
frequency_rank integer -- 1–1000, reserved for difficulty filtering
|
||||
created_at timestamptz
|
||||
synset_id text UNIQUE -- OMW ILI (e.g. "ili:i12345")
|
||||
pos varchar(20) -- NOT NULL, CHECK (pos IN ('noun', 'verb', 'adjective', 'adverb'))
|
||||
created_at timestamptz DEFAULT now()
|
||||
-- REMOVED: frequency_rank (handled at deck level)
|
||||
|
||||
translations
|
||||
id uuid PK
|
||||
term_id uuid FK → terms.id
|
||||
language_code varchar(10) -- BCP 47: "en", "it", "de", ...
|
||||
text text
|
||||
UNIQUE (term_id, language_code)
|
||||
language_code varchar(10) -- NOT NULL, BCP 47: "en", "it"
|
||||
text text -- NOT NULL
|
||||
created_at timestamptz DEFAULT now()
|
||||
UNIQUE (term_id, language_code, text) -- Allow synonyms, prevent exact duplicates
|
||||
|
||||
term_glosses
|
||||
id uuid PK
|
||||
term_id uuid FK → terms.id
|
||||
language_code varchar(10) -- NOT NULL
|
||||
text text -- NOT NULL
|
||||
type varchar(20) -- CHECK (type IN ('definition', 'example')), NULLABLE
|
||||
created_at timestamptz DEFAULT now()
|
||||
|
||||
language_pairs
|
||||
id uuid PK
|
||||
source varchar(10) -- "en"
|
||||
target varchar(10) -- "it"
|
||||
label text -- "English → Italian"
|
||||
source varchar(10) -- NOT NULL
|
||||
target varchar(10) -- NOT NULL
|
||||
label text
|
||||
active boolean DEFAULT true
|
||||
UNIQUE (source, target)
|
||||
|
||||
decks
|
||||
id uuid PK
|
||||
name text -- NOT NULL (e.g. "A1 Italian Nouns", "Most Common 1000")
|
||||
description text -- NULLABLE
|
||||
pair_id uuid FK → language_pairs.id -- NULLABLE (for single-language or multi-pair decks)
|
||||
created_by uuid FK → users.id -- NULLABLE (for system decks)
|
||||
is_public boolean DEFAULT true
|
||||
created_at timestamptz DEFAULT now()
|
||||
|
||||
deck_terms
|
||||
deck_id uuid FK → decks.id
|
||||
term_id uuid FK → terms.id
|
||||
position smallint -- NOT NULL, ordering within deck (1, 2, 3...)
|
||||
added_at timestamptz DEFAULT now()
|
||||
PRIMARY KEY (deck_id, term_id)
|
||||
|
||||
users
|
||||
id uuid PK -- OpenAuth sub claim
|
||||
email varchar(255) UNIQUE
|
||||
id uuid PK -- Internal stable ID (FK target)
|
||||
openauth_sub text UNIQUE -- NOT NULL, OpenAuth `sub` claim (e.g. "google|12345")
|
||||
email varchar(255) UNIQUE -- NULLABLE (GitHub users may lack email)
|
||||
display_name varchar(100)
|
||||
games_played integer DEFAULT 0
|
||||
games_won integer DEFAULT 0
|
||||
created_at timestamptz
|
||||
created_at timestamptz DEFAULT now()
|
||||
last_login_at timestamptz
|
||||
-- REMOVED: games_played, games_won (derive from room_players)
|
||||
|
||||
rooms
|
||||
id uuid PK
|
||||
code varchar(8) UNIQUE -- human-readable e.g. "WOLF-42"
|
||||
code varchar(8) UNIQUE -- NOT NULL, CHECK (code = UPPER(code))
|
||||
host_id uuid FK → users.id
|
||||
pair_id uuid FK → language_pairs.id
|
||||
status text -- "waiting" | "in_progress" | "finished"
|
||||
max_players smallint DEFAULT 4
|
||||
round_count smallint DEFAULT 10
|
||||
created_at timestamptz
|
||||
deck_id uuid FK → decks.id -- Which vocabulary deck this room uses
|
||||
status varchar(20) -- NOT NULL, CHECK (status IN ('waiting', 'in_progress', 'finished'))
|
||||
max_players smallint -- NOT NULL, DEFAULT 4, CHECK (max_players BETWEEN 2 AND 10)
|
||||
round_count smallint -- NOT NULL, DEFAULT 10, CHECK (round_count BETWEEN 5 AND 20)
|
||||
created_at timestamptz DEFAULT now()
|
||||
updated_at timestamptz DEFAULT now() -- For stale room recovery
|
||||
|
||||
room_players
|
||||
room_id uuid FK → rooms.id
|
||||
user_id uuid FK → users.id
|
||||
score integer DEFAULT 0
|
||||
joined_at timestamptz
|
||||
score integer DEFAULT 0 -- Final score only (written at game end)
|
||||
joined_at timestamptz DEFAULT now()
|
||||
left_at timestamptz -- Populated on WS disconnect/leave
|
||||
PRIMARY KEY (room_id, user_id)
|
||||
```
|
||||
|
||||
### Indexes
|
||||
Indexes
|
||||
-- Vocabulary
|
||||
CREATE INDEX idx_terms_pos ON terms (pos);
|
||||
CREATE INDEX idx_translations_lang ON translations (language_code, term_id);
|
||||
|
||||
```sql
|
||||
CREATE INDEX ON terms (pos, frequency_rank);
|
||||
CREATE INDEX ON rooms (status);
|
||||
CREATE INDEX ON room_players (user_id);
|
||||
```
|
||||
-- Decks
|
||||
CREATE INDEX idx_decks_pair ON decks (pair_id, is_public);
|
||||
CREATE INDEX idx_decks_creator ON decks (created_by);
|
||||
CREATE INDEX idx_deck_terms_term ON deck_terms (term_id);
|
||||
|
||||
-- Language Pairs
|
||||
CREATE INDEX idx_pairs_active ON language_pairs (active, source, target);
|
||||
|
||||
-- Rooms
|
||||
CREATE INDEX idx_rooms_status ON rooms (status);
|
||||
CREATE INDEX idx_rooms_host ON rooms (host_id);
|
||||
-- NOTE: idx_rooms_code omitted (UNIQUE constraint creates index automatically)
|
||||
|
||||
-- Room Players
|
||||
CREATE INDEX idx_room_players_user ON room_players (user_id);
|
||||
CREATE INDEX idx_room_players_score ON room_players (room_id, score DESC);
|
||||
|
||||
Repository Logic Note
|
||||
`DeckRepository.getTerms(deckId, limit, offset)` fetches terms from a specific deck.
|
||||
Query uses `deck_terms.position` for ordering.
|
||||
For random practice within a deck: `WHERE deck_id = X ORDER BY random() LIMIT N`
|
||||
(safe because deck is bounded, e.g., 500 terms max, not full table).
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -229,17 +295,26 @@ CREATE INDEX ON room_players (user_id);
|
|||
|
||||
### Source
|
||||
|
||||
- **Princeton WordNet** — English words + synset IDs
|
||||
- **Open Multilingual Wordnet (OMW)** — Italian translations keyed by synset ID
|
||||
Open Multilingual Wordnet (OMW) — English & Italian nouns via Interlingual Index (ILI)
|
||||
External CEFR lists — For deck curation (e.g. GitHub: ecom/cefr-lists)
|
||||
|
||||
### Extraction process
|
||||
|
||||
1. Run `scripts/extract_omw.py` once locally using NLTK
|
||||
2. Filter to the 1 000 most common nouns (by WordNet frequency data)
|
||||
3. Output: `packages/db/src/seed.json` — committed to the repo
|
||||
4. `packages/db/src/seed.ts` reads the JSON and populates `terms` + `translations`
|
||||
1. Run `extract-en-it-nouns.py` once locally using `wn` library
|
||||
- Imports ALL bilingual noun synsets (no frequency filtering)
|
||||
- Output: `datafiles/en-it-nouns.json` — committed to repo
|
||||
2. Run `pnpm db:seed` — populates `terms` + `translations` tables from JSON
|
||||
3. Run `pnpm db:build-decks` — matches external CEFR lists to DB terms, creates `decks` + `deck_terms`
|
||||
|
||||
`terms.synset_id` stores the WordNet offset (e.g. `wn:01234567n`) for traceability and future re-imports with additional languages.
|
||||
### Benefits of deck-based approach
|
||||
|
||||
- WordNet frequency data is unreliable (e.g. chemical symbols rank high)
|
||||
- Curricula can come from external sources (CEFR, Oxford 3000, SUBTLEX)
|
||||
- Bad data excluded at deck level, not schema level
|
||||
- Users can create custom decks later
|
||||
- Multiple difficulty levels without schema changes
|
||||
|
||||
`terms.synset_id` stores the OMW ILI (e.g. `ili:i12345`) for traceability and future re-imports with additional languages.
|
||||
|
||||
---
|
||||
|
||||
|
|
|
|||
348711
scripts/datafiles/en-it-nouns.json
Normal file
348711
scripts/datafiles/en-it-nouns.json
Normal file
File diff suppressed because it is too large
Load diff
|
|
@ -1,18 +0,0 @@
|
|||
import nltk
|
||||
|
||||
|
||||
def main():
|
||||
print("Downloading WordNet...")
|
||||
nltk.download("wordnet")
|
||||
|
||||
print("Downloading OMW 1.4...")
|
||||
nltk.download("omw-1.4")
|
||||
|
||||
print("Downloading WordNet IC...")
|
||||
nltk.download("wordnet_ic")
|
||||
|
||||
print("Done.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
105
scripts/extract-en-it-nouns.py
Normal file
105
scripts/extract-en-it-nouns.py
Normal file
|
|
@ -0,0 +1,105 @@
|
|||
"""
|
||||
scripts/extract-en-it-nouns.py
|
||||
|
||||
Extract ALL bilingual nouns from Open Multilingual Wordnet (OMW).
|
||||
Output mirrors the terms table schema exactly — no filtering, no ranking.
|
||||
Decks handle curation later.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import wn
|
||||
|
||||
|
||||
def extract_bilingual_nouns(
|
||||
source_lang: str = "en",
|
||||
target_lang: str = "it",
|
||||
output_path: str = "datafiles/en-it-nouns.json",
|
||||
) -> None:
|
||||
"""
|
||||
Extract all noun synsets present in both languages via ILI.
|
||||
|
||||
Args:
|
||||
source_lang: Source language code (e.g., "en" for English)
|
||||
target_lang: Target language code (e.g., "it" for Italian)
|
||||
output_path: Where to write the seed JSON
|
||||
"""
|
||||
print(f"Loading WordNets: {source_lang=}, {target_lang=}")
|
||||
|
||||
try:
|
||||
source_wn = wn.Wordnet(lang=source_lang)
|
||||
target_wn = wn.Wordnet(lang=target_lang)
|
||||
except wn.Error as e:
|
||||
print(f"Error loading WordNet: {e}")
|
||||
print(f"Run: wn download omw-{target_lang}:1.4 oewn:2024")
|
||||
sys.exit(1)
|
||||
|
||||
# Index nouns by ILI (Inter-Lingual Index)
|
||||
source_by_ili: dict[str, wn.Synset] = {}
|
||||
for synset in source_wn.synsets(pos="n"):
|
||||
if synset.ili:
|
||||
source_by_ili[synset.ili] = synset
|
||||
|
||||
target_by_ili: dict[str, wn.Synset] = {}
|
||||
for synset in target_wn.synsets(pos="n"):
|
||||
if synset.ili:
|
||||
target_by_ili[synset.ili] = synset
|
||||
|
||||
# Find bilingual synsets (present in both languages)
|
||||
common_ilis = set(source_by_ili.keys()) & set(target_by_ili.keys())
|
||||
print(f"Found {len(common_ilis):,} bilingual noun synsets")
|
||||
|
||||
# Build seed data matching schema exactly
|
||||
terms: list[dict] = []
|
||||
|
||||
for ili in sorted(common_ilis, key=lambda x: int(x[1:])):
|
||||
en_syn = source_by_ili[ili]
|
||||
it_syn = target_by_ili[ili]
|
||||
|
||||
# All lemmas (synonyms) for each language
|
||||
en_lemmas = [str(lemma) for lemma in en_syn.lemmas()]
|
||||
it_lemmas = [str(lemma) for lemma in it_syn.lemmas()]
|
||||
|
||||
term = {
|
||||
"synset_id": f"ili:{ili}", # e.g., "ili:i12345"
|
||||
"pos": "noun",
|
||||
"translations": {source_lang: en_lemmas, target_lang: it_lemmas},
|
||||
# Note: id, created_at added by seed.ts during insert
|
||||
}
|
||||
terms.append(term)
|
||||
|
||||
# Ensure output directory exists
|
||||
output_file = Path(output_path)
|
||||
output_file.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Write JSON
|
||||
with open(output_file, "w", encoding="utf-8") as f:
|
||||
json.dump(terms, f, indent=2, ensure_ascii=False)
|
||||
|
||||
print(f"Wrote {len(terms):,} terms to {output_path}")
|
||||
|
||||
# Summary stats
|
||||
total_en_lemmas = sum(len(t["translations"][source_lang]) for t in terms)
|
||||
total_it_lemmas = sum(len(t["translations"][target_lang]) for t in terms)
|
||||
|
||||
print(f"\nLemma counts:")
|
||||
print(
|
||||
f" English: {total_en_lemmas:,} total ({total_en_lemmas / len(terms):.1f} avg per synset)"
|
||||
)
|
||||
print(
|
||||
f" Italian: {total_it_lemmas:,} total ({total_it_lemmas / len(terms):.1f} avg per synset)"
|
||||
)
|
||||
|
||||
# Sample output
|
||||
print(f"\n--- Sample terms ---")
|
||||
for t in terms[1000:1005]:
|
||||
print(
|
||||
f"{t['synset_id']}: {t['translations'][source_lang]} -> {t['translations'][target_lang]}"
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
extract_bilingual_nouns()
|
||||
|
|
@ -1 +1 @@
|
|||
nltk>=3.8
|
||||
wn==1.1.0
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue