refactor: migrate to deck-based vocabulary curation

Database Schema:
- Add decks table for curated word lists (A1, Most Common, etc.)
- Add deck_terms join table with position ordering
- Link rooms to decks via rooms.deck_id FK
- Remove frequency_rank from terms (now deck-scoped)
- Change users.id to uuid, add openauth_sub for auth mapping
- Add room_players.left_at for disconnect tracking
- Add rooms.updated_at for stale room recovery
- Add CHECK constraints for data integrity (pos, status, etc.)

Extraction Script:
- Rewrite extract.py to mirror complete OMW dataset
- Extract all 25,204 bilingual noun synsets (en-it)
- Remove frequency filtering and block lists
- Output all lemmas per synset for full synonym support
- Seed data now uncurated; decks handle selection

Architecture:
- Separate concerns: raw OMW data in DB, curation in decks
- Enables user-created decks and multiple difficulty levels
- Rooms select vocabulary by choosing a deck
This commit is contained in:
lila 2026-03-27 16:53:26 +01:00
parent e9e750da3e
commit be7a7903c5
9 changed files with 349148 additions and 492 deletions

View file

@ -56,9 +56,28 @@ Production will use Nginx to serve static Vite build output.
`app.ts` exports a `createApp()` factory function. `server.ts` imports it and calls `.listen()`. This allows tests to import the app directly without starting a server, keeping tests isolated and fast.
### Data model: `terms` + `translations` (not flat word pairs)
### Data model: `decks` separate from `terms` (not frequency_rank filtering)
Words are modelled as language-neutral concepts with one or more translations per language. Adding a new language pair requires no schema changes — only new rows in `translations` and `language_pairs`. The flat `english/italian` column pattern is explicitly avoided.
**Original approach:** Store `frequency_rank` on `terms` table and filter by rank range for difficulty.
**Problem discovered:** WordNet/OMW frequency data is unreliable for language learning. Extraction produced results like:
- Rank 1: "In" → "indio" (chemical symbol: Indium)
- Rank 2: "Be" → "berillio" (chemical symbol: Beryllium)
- Rank 7: "He" → "elio" (chemical symbol: Helium)
These are technically "common" in WordNet (every element is a noun) but useless for vocabulary learning.
**Decision:**
- `terms` table stores ALL available OMW synsets (raw data, no frequency filtering)
- `decks` table stores curated learning lists (A1, A2, B1, "Most Common 1000", etc.)
- `deck_terms` junction table links terms to decks with position ordering
- `rooms.deck_id` specifies which vocabulary deck a game uses
**Benefits:**
- Curricula can come from external sources (CEFR lists, Oxford 3000, SUBTLEX)
- Bad data (chemical symbols, obscure words) excluded at deck level, not schema level
- Users can create custom decks later
- Multiple difficulty levels without schema changes
### Multiplayer mechanic: simultaneous answers (not buzz-first)
@ -136,17 +155,54 @@ Then `sudo sysctl -p` or restart Docker.
## Data Model
### Translations: no isPrimary column
### Users: internal UUID + openauth_sub (not sub as PK)
WordNet synsets often have multiple lemmas per language (e.g. "dog", "domestic dog", "Canis familiaris"). An earlier design included an isPrimary boolean to mark which translation to display in the quiz.
This was dropped because:
**Original approach:** Use OpenAuth `sub` claim directly as `users.id` (text primary key).
The schema cannot enforce a single primary per (term_id, language_code) with a boolean alone — multiple rows can be true simultaneously
Fixing that requires either a partial unique index or application-level logic, both adding complexity for no MVP benefit
The ambiguity can be resolved earlier and more cleanly
**Problem:** Embeds auth provider in the primary key (e.g. `"google|12345"`). If OpenAuth changes format or a second provider is added, the PK cascades through all FKs (`rooms.host_id`, `room_players.user_id`).
Decision: the Python extraction script picks one translation per language per synset at extraction time, taking the first lemma (WordNet orders lemmas by frequency). By the time data enters the database the choice is already made.
The translations table carries a UNIQUE (term_id, language_code) constraint, which enforces exactly one translation per language at the database level. No isPrimary column exists.
**Decision:**
- `users.id` = internal UUID (stable FK target)
- `users.openauth_sub` = text UNIQUE (auth provider claim)
- Allows adding multiple auth providers per user later without FK changes
### Rooms: `updated_at` for stale recovery only
Most tables omit `updated_at` (unnecessary for MVP). `rooms.updated_at` is kept specifically for stale room recovery—identifying rooms stuck in `in_progress` status after server crashes.
### Translations: UNIQUE (term_id, language_code, text)
Allows multiple synonyms per language per term (e.g. "dog", "hound" for same synset). Prevents exact duplicate rows. Homonyms (e.g. "Lead" metal vs. "Lead" guide) are handled by different `term_id` values (different synsets), so no constraint conflict.
### Decks: `pair_id` is nullable
`decks.pair_id` references `language_pairs` but is nullable. Reasons:
- Single-language decks (e.g. "English Grammar")
- Multi-pair decks (e.g. "Cognates" spanning EN-IT and EN-FR)
- System decks (created by app, not tied to specific user)
### Decks separate from terms (not frequency_rank filtering)
**Original approach:** Store `frequency_rank` on `terms` table and filter by rank range for difficulty.
**Problem discovered:** WordNet/OMW frequency data is unreliable for language learning. Extraction produced results like:
- Rank 1: "In" → "indio" (chemical symbol: Indium)
- Rank 2: "Be" → "berillio" (chemical symbol: Beryllium)
- Rank 7: "He" → "elio" (chemical symbol: Helium)
These are technically "common" in WordNet (every element is a noun) but useless for vocabulary learning.
**Decision:**
- `terms` table stores ALL available OMW synsets (raw data, no frequency filtering)
- `decks` table stores curated learning lists (A1, A2, B1, "Most Common 1000", etc.)
- `deck_terms` junction table links terms to decks with position ordering
- `rooms.deck_id` specifies which vocabulary deck a game uses
**Benefits:**
- Curricula can come from external sources (CEFR lists, Oxford 3000, SUBTLEX)
- Bad data (chemical symbols, obscure words) excluded at deck level, not schema level
- Users can create custom decks later
- Multiple difficulty levels without schema changes
---

View file

@ -1,3 +1,45 @@
# notes
## tasks
- pinning dependencies in package.json files
## open word net
download libraries via
```bash
python -c 'import wn; wn.download("omw-fr")';
```
libraries:
odenet:1.4
omw-es:1.4
omw-fr:1.4
omw-it:1.4
omw-en:1.4
upgrade wn package:
```bash
pip install --upgrade wn
```
check if wn is available, eg italian:
```bash
python -c "import wn; print(len(wn.words(lang='it', lexicon='omw-it:1.4')))"
```
remove a library:
```bash
python -c "import wn; wn.remove('oewn:2024')"﬌ python -c "import wn; wn.remove('oewn:2024')"
```
list all libraries:
```bash
python -c "import wn; print(wn.lexicons())"
```

View file

@ -1,301 +0,0 @@
# Task: Run `scripts/extract_omw.py` locally → generates `packages/db/src/seed.json`
**Goal**: Produce a committed, validated JSON file containing 1000 English-Italian noun pairs ranked by frequency.
**Done when**: `packages/db/src/seed.json` exists in version control with exactly 1000 entries, all validated.
---
## Step 1: Python Environment Setup
**Prerequisites**: Python 3.11+ installed locally (not in Docker)
- [x] Create `scripts/requirements.txt`:
nltk>=3.8
- [x] add to `.gitignore`:
venv/
pycache/
*.pyc
- [x] Create virtual environment:
```bash
cd scripts
python -m venv venv
source venv/bin/activate
```
- [x] Install dependencies:
```bash
pip install -r requirements.txt
```
- [x] Create scripts/download_data.py to fetch NLTK corpora:
```Python
import nltk
def main():
print("Downloading WordNet...")
nltk.download("wordnet")
print("Downloading OMW 1.4...")
nltk.download("omw-1.4")
print("Downloading WordNet IC...")
nltk.download("wordnet_ic")
print("Done.")
if __name__ == "__main__":
main()
```
- [x] Run it once to cache data locally (~100MB in ~/nltk_data/):
```bash
python download_data.py
```
- [x] Verify data downloaded: check ~/nltk_data/corpora/ exists with wordnet, omw, wordnet_ic folders
## Step 2: Data Exploration (Throwaway Script)
Before writing extraction logic, understand the data shape.
[ ] Create scripts/explore.py:
Import nltk.corpus.wordnet as wn
Import nltk.corpus.wordnet_ic as wnic
Load semcor_ic = wnic.ic('ic-semcor.dat')
[ ] Print sample synset:
Python
Copy
dog = wn.synset('dog.n.01')
print(f"Offset: {dog.offset():08d}")
print(f"POS: {dog.pos()}")
print(f"English lemmas: {dog.lemma_names()}")
print(f"Italian lemmas: {dog.lemma_names(lang='ita')}")
print(f"Frequency: {dog.res_similarity(dog, semcor_ic)}")
[ ] Test 5 common words: dog, house, car, water, time
[ ] Document findings:
[ ] Synset ID format confirmed: {offset:08d}{pos} → 02084071n
[ ] Italian availability: what percentage have translations?
[ ] Multi-word handling: underscores in lemma_names()?
[ ] Frequency scores: numeric range and distribution
[ ] Test edge cases:
[ ] Word with multiple synsets (homonyms): bank, run
[ ] Word with multiple Italian translations per synset
[ ] Word with no Italian translation
[ ] Delete or keep explore.py (optional reference)
Decision checkpoint: Confirm frequency ranking strategy
Option A: synset.count() — raw lemma occurrence count
Option B: res_similarity with SemCor IC — information content score
Document choice and rationale in comments
## Step 3: Extraction Script Implementation
Create scripts/extract_omw.py with the full pipeline.
[ ] Imports and setup:
Python
Copy
import json
from collections import defaultdict
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic
[ ] Load information content for frequency ranking:
Python
Copy
semcor_ic = wordnet_ic.ic('ic-semcor.dat')
[ ] Define target count and output path:
Python
Copy
TARGET_COUNT = 1000
OUTPUT_PATH = '../packages/db/src/seed.json'
[ ] Iterate all noun synsets and collect candidates:
Python
Copy
candidates = []
for synset in wn.all_synsets(pos='n'):
italian_lemmas = synset.lemma_names(lang='ita')
if not italian_lemmas:
continue
english_lemmas = synset.lemma_names()
# Calculate frequency score
try:
# Using self-similarity as frequency proxy
freq_score = synset.res_similarity(synset, semcor_ic)
except Exception:
freq_score = 0
candidates.append({
'synset': synset,
'offset': synset.offset(),
'pos': synset.pos(),
'freq_score': freq_score,
'english': english_lemmas,
'italian': italian_lemmas,
})
[ ] Sort by frequency descending:
Python
Copy
candidates.sort(key=lambda x: x['freq_score'], reverse=True)
[ ] Slice top 1000:
Python
Copy
top_1000 = candidates[:TARGET_COUNT]
[ ] Build output structure matching schema needs:
Python
Copy
seed_data = []
for rank, candidate in enumerate(top_1000, start=1):
# Normalize multi-word expressions: replace underscores with spaces
english_normalized = [w.replace('_', ' ') for w in candidate['english']]
italian_normalized = [w.replace('_', ' ') for w in candidate['italian']]
seed_data.append({
'synset_id': f"wn:{candidate['offset']:08d}{candidate['pos']}",
'pos': 'noun', # Map 'n' to full word for schema
'frequency_rank': rank,
'english_lemmas': english_normalized,
'italian_lemmas': italian_normalized,
})
[ ] Write formatted JSON:
Python
Copy
with open(OUTPUT_PATH, 'w', encoding='utf-8') as f:
json.dump(seed_data, f, indent=2, ensure_ascii=False)
print(f"Generated {len(seed_data)} entries at {OUTPUT_PATH}")
Edge cases to handle:
[ ] Skip synsets with empty Italian list (already filtered)
[ ] Handle res_similarity exceptions (some synsets lack IC data)
[ ] Normalize underscores to spaces in all lemmas
[ ] Ensure UTF-8 encoding for Italian characters (à, è, ì, ò, ù)
## Step 4: Validation Script
Create scripts/validate_seed.py to verify output quality.
[ ] Load and parse JSON:
Python
Copy
import json
from pathlib import Path
SEED_PATH = '../packages/db/src/seed.json'
with open(SEED_PATH, 'r', encoding='utf-8') as f:
data = json.load(f)
[ ] Run validation checks:
[ ] Count check: len(data) == 1000
[ ] Rank check: all entries have frequency_rank from 1 to 1000, no gaps, no duplicates
[ ] Synset ID format: matches regex ^wn:\d{8}[nvar]$ (noun, verb, adjective, adverb)
[ ] POS check: all are noun (since we filtered pos='n')
[ ] Italian presence: every entry has italian_lemmas with at least 1 item
[ ] English presence: every entry has english_lemmas with at least 1 item
[ ] No duplicate synset IDs
[ ] No empty strings in lemma arrays
[ ] No leading/trailing whitespace in lemmas
[ ] Print summary statistics:
Total entries
Average English lemmas per entry
Average Italian lemmas per entry
Sample entries (ranks 1, 500, 1000) for manual inspection
[ ] Exit with error code if any check fails
## Step 5: Execution and Iteration
[ ] Run extraction:
bash
Copy
cd scripts
source venv/bin/activate
python extract_omw.py
[ ] Run validation:
bash
Copy
python validate_seed.py
[ ] If validation fails, fix extract_omw.py and re-run:
[ ] Too few entries? Relax filters or reduce target count
[ ] Data quality issues? Add normalization logic
[ ] Format mismatches? Adjust output structure
[ ] Manual sanity check: open seed.json, read first 10 and last 10 entries
Do translations make sense?
Are frequencies plausible (common words first)?
## Step 6: Git Integration
[ ] Verify file location: packages/db/src/seed.json
[ ] Check file size: should be ~200-500KB (if larger, investigate)
[ ] Stage the file:
bash
Copy
git add packages/db/src/seed.json
[ ] Commit with descriptive message:
plain
Copy
feat(data): add seed.json with 1000 English-Italian noun pairs
Generated from WordNet 3.0 + OMW 1.4 using SemCor IC frequency ranking.
Top entry: dog/cane (rank 1). Bottom entry: [word] (rank 1000).
[ ] Push to current feature branch
## Step 7: Documentation Update
[ ] Update documentation/decisions.md with:
[ ] Frequency ranking method chosen (SemCor IC vs count) and why
[ ] How multi-word expressions are handled (underscore → space)
[ ] OMW data quality notes (coverage percentage, any manual fixes)
[ ] Seed file structure (for future maintainers)
Definition of Done
[ ] scripts/extract_omw.py exists and runs without errors
[ ] scripts/validate_seed.py passes all checks
[ ] packages/db/src/seed.json committed to git with exactly 1000 entries
[ ] Manual sample check confirms sensible translations
[ ] decisions.md updated with extraction methodology
[ ] Virtual environment and Python cache files are gitignored
Out of Scope (for this task)
Distractor generation (happens in API layer later)
Additional parts of speech (verbs, adjectives)
Data updates/re-seeding strategy (MVP assumes static seed)
Database insertion (next task: seed.ts)

View file

@ -2,158 +2,145 @@
Each phase produces a working, deployable increment. Nothing is built speculatively.
---
## Phase 0 — Foundation
**Goal**: Empty repo that builds, lints, and runs end-to-end.
**Done when**: `pnpm dev` starts both apps; `GET /api/health` returns 200; React renders a hello page.
Goal: Empty repo that builds, lints, and runs end-to-end.
Done when: `pnpm dev` starts both apps; `GET /api/health` returns 200; React renders a hello page.
- [x] Initialise pnpm workspace monorepo: `apps/web`, `apps/api`, `packages/shared`, `packages/db`
- [x] Configure TypeScript project references across packages
- [x] Set up ESLint + Prettier with shared configs in root
- [x] Set up Vitest in `api` and `web` and both packages
- [x] Scaffold Express app with `GET /api/health`
- [x] Scaffold Vite + React app with TanStack Router (single root route)
- [x] Configure Drizzle ORM + connection to local PostgreSQL
- [x] Write first migration (empty — just validates the pipeline works)
- [x] `docker-compose.yml` for local dev: `api`, `web`, `postgres`, `valkey`
- [x] `.env.example` files for `apps/api` and `apps/web`
- [x] update decisions.md
---
[x] Initialise pnpm workspace monorepo: `apps/web`, `apps/api`, `packages/shared`, `packages/db`
[x] Configure TypeScript project references across packages
[x] Set up ESLint + Prettier with shared configs in root
[x] Set up Vitest in `api` and `web` and both packages
[x] Scaffold Express app with `GET /api/health`
[x] Scaffold Vite + React app with TanStack Router (single root route)
[x] Configure Drizzle ORM + connection to local PostgreSQL
[x] Write first migration (empty — just validates the pipeline works)
[x] `docker-compose.yml` for local dev: `api`, `web`, `postgres`, `valkey`
[x] `.env.example` files for `apps/api` and `apps/web`
[x] update decisions.md
## Phase 1 — Vocabulary Data
**Goal**: Word data lives in the DB and can be queried via the API.
**Done when**: `GET /api/terms?pair=en-it&limit=10` returns 10 terms, each with 3 distractors attached.
Goal: Word data lives in the DB and can be queried via the API.
Done when: `GET /api/decks/1/terms?limit=10` returns 10 terms from a specific deck.
- [ ] Run `scripts/extract_omw.py` locally → generates `packages/db/src/seed.json`
- [ ] Write Drizzle schema: `terms`, `translations`, `language_pairs`
- [ ] Write and run migration
- [ ] Write `packages/db/src/seed.ts` (reads `seed.json`, populates tables)
- [ ] Implement `TermRepository.getRandom(pairId, limit)`
- [ ] Implement `QuizService.attachDistractors(terms)` — same POS, server-side, no duplicates
- [ ] Implement `GET /language-pairs` and `GET /terms` endpoints
- [ ] Define Zod response schemas in `packages/shared`
- [ ] Unit tests for `QuizService` (correct POS filtering, never includes the answer)
- [ ] update decisions.md
---
[x] Run `extract-en-it-nouns.py` locally → generates `datafiles/en-it-nouns.json`
-- Import ALL available OMW noun synsets (no frequency filtering)
[ ] Write Drizzle schema: `terms`, `translations`, `language_pairs`, `term_glosses`, `decks`, `deck_terms`
[ ] Write and run migration (includes CHECK constraints for `pos`, `gloss_type`)
[ ] Write `packages/db/src/seed.ts` (imports ALL terms + translations, NO decks)
[ ] Write `scripts/build_decks.ts` (reads external CEFR lists, matches to DB, creates decks)
[ ] Download CEFR A1/A2 noun lists (from GitHub repos)
[ ] Run `pnpm db:seed` → populates terms
[ ] Run `pnpm db:build-decks` → creates curated decks
[ ] Implement `DeckRepository.getTerms(deckId, limit, offset)`
[ ] Implement `QuizService.attachDistractors(terms)` — same POS, server-side, no duplicates
[ ] Implement `GET /language-pairs`, `GET /decks`, `GET /decks/:id/terms` endpoints
[ ] Define Zod response schemas in `packages/shared`
[ ] Unit tests for `QuizService` (correct POS filtering, never includes the answer)
[ ] update decisions.md
## Phase 2 — Auth
**Goal**: Users can log in via Google or GitHub and stay logged in.
**Done when**: JWT from OpenAuth is validated by the API; protected routes redirect unauthenticated users; user row is created on first login.
Goal: Users can log in via Google or GitHub and stay logged in.
Done when: JWT from OpenAuth is validated by the API; protected routes redirect unauthenticated users; user row is created on first login.
- [ ] Add OpenAuth service to `docker-compose.yml`
- [ ] Write Drizzle schema: `users`
- [ ] Write and run migration
- [ ] Implement JWT validation middleware in `apps/api`
- [ ] Implement `GET /api/auth/me` (validate token, upsert user row, return user)
- [ ] Define auth Zod schemas in `packages/shared`
- [ ] Frontend: login page with "Continue with Google" + "Continue with GitHub" buttons
- [ ] Frontend: redirect to `auth.yourdomain.com` → receive JWT → store in memory + HttpOnly cookie
- [ ] Frontend: TanStack Router auth guard (redirects unauthenticated users)
- [ ] Frontend: TanStack Query `api.ts` attaches token to every request
- [ ] Unit tests for JWT middleware
- [ ] update decisions.md
---
[ ] Add OpenAuth service to `docker-compose.yml`
[ ] Write Drizzle schema: `users` (uuid `id`, text `openauth_sub`, no games_played/won columns)
[ ] Write and run migration (includes `updated_at` + triggers)
[ ] Implement JWT validation middleware in `apps/api`
[ ] Implement `GET /api/auth/me` (validate token, upsert user row via `openauth_sub`, return user)
[ ] Define auth Zod schemas in `packages/shared`
[ ] Frontend: login page with "Continue with Google" + "Continue with GitHub" buttons
[ ] Frontend: redirect to `auth.yourdomain.com` → receive JWT → store in memory + HttpOnly cookie
[ ] Frontend: TanStack Router auth guard (redirects unauthenticated users)
[ ] Frontend: TanStack Query `api.ts` attaches token to every request
[ ] Unit tests for JWT middleware
[ ] update decisions.md
## Phase 3 — Single-player Mode
**Goal**: A logged-in user can complete a full solo quiz session.
**Done when**: User sees 10 questions, picks answers, sees their final score.
Goal: A logged-in user can complete a full solo quiz session.
Done when: User sees 10 questions, picks answers, sees their final score.
- [ ] Frontend: `/singleplayer` route
- [ ] `useQuizSession` hook: fetch terms, manage question index + score state
- [ ] `QuestionCard` component: prompt word + 4 answer buttons
- [ ] `OptionButton` component: idle / correct / wrong states
- [ ] `ScoreScreen` component: final score + play-again button
- [ ] TanStack Query integration for `GET /terms`
- [ ] RTL tests for `QuestionCard` and `OptionButton`
- [ ] update decisions.md
---
[ ] Frontend: `/singleplayer` route
[ ] `useQuizSession` hook: fetch terms, manage question index + score state
[ ] `QuestionCard` component: prompt word + 4 answer buttons
[ ] `OptionButton` component: idle / correct / wrong states
[ ] `ScoreScreen` component: final score + play-again button
[ ] TanStack Query integration for `GET /terms`
[ ] RTL tests for `QuestionCard` and `OptionButton`
[ ] update decisions.md
## Phase 4 — Multiplayer Rooms (Lobby)
**Goal**: Players can create and join rooms; the host sees all joined players in real time.
**Done when**: Two browser tabs can join the same room and see each other's display names update live via WebSocket.
Goal: Players can create and join rooms; the host sees all joined players in real time.
Done when: Two browser tabs can join the same room and see each other's display names update live via WebSocket.
- [ ] Write Drizzle schema: `rooms`, `room_players`
- [ ] Write and run migration
- [ ] `POST /rooms` and `POST /rooms/:code/join` REST endpoints
- [ ] `RoomService`: create room with short code, join room, enforce max player limit
- [ ] WebSocket server: attach `ws` upgrade handler to the Express HTTP server
- [ ] WS auth middleware: validate OpenAuth JWT on upgrade
- [ ] WS message router: dispatch incoming messages by `type`
- [ ] `room:join` / `room:leave` handlers → broadcast `room:state` to all room members
- [ ] Room membership tracked in Valkey (ephemeral) + `room_players` in PostgreSQL (durable)
- [ ] Define all WS event Zod schemas in `packages/shared`
- [ ] Frontend: `/multiplayer/lobby` — create room form + join-by-code form
- [ ] Frontend: `/multiplayer/room/:code` — player list, room code display, "Start Game" (host only)
- [ ] Frontend: `ws.ts` singleton WS client with reconnect on drop
- [ ] Frontend: Zustand `gameStore` handles incoming `room:state` events
- [ ] update decisions.md
---
[ ] Write Drizzle schema: `rooms`, `room_players` (add `deck_id` FK to rooms)
[ ] Write and run migration (includes CHECK constraints: `code=UPPER(code)`, `status`, `max_players`)
[ ] Add indexes: `idx_rooms_host`, `idx_room_players_score`
[ ] `POST /rooms` and `POST /rooms/:code/join` REST endpoints
[ ] `RoomService`: create room with short code, join room, enforce max player limit
[ ] `POST /rooms` accepts `deck_id` (which vocabulary deck to use)
[ ] WebSocket server: attach `ws` upgrade handler to the Express HTTP server
[ ] WS auth middleware: validate OpenAuth JWT on upgrade
[ ] WS message router: dispatch incoming messages by `type`
[ ] `room:join` / `room:leave` handlers → broadcast `room:state` to all room members
[ ] Room membership tracked in Valkey (ephemeral) + `room_players` in PostgreSQL (durable)
[ ] Define all WS event Zod schemas in `packages/shared`
[ ] Frontend: `/multiplayer/lobby` — create room form + join-by-code form
[ ] Frontend: `/multiplayer/room/:code` — player list, room code display, "Start Game" (host only)
[ ] Frontend: `ws.ts` singleton WS client with reconnect on drop
[ ] Frontend: Zustand `gameStore` handles incoming `room:state` events
[ ] update decisions.md
## Phase 5 — Multiplayer Game
**Goal**: Host starts a game; all players answer simultaneously in real time; a winner is declared.
**Done when**: 24 players complete a 10-round game with correct live scores and a winner screen.
Goal: Host starts a game; all players answer simultaneously in real time; a winner is declared.
Done when: 24 players complete a 10-round game with correct live scores and a winner screen.
- [ ] `GameService`: generate question sequence for a room, enforce server-side 15 s timer
- [ ] `room:start` WS handler → begin question loop, broadcast first `game:question`
- [ ] `game:answer` WS handler → collect per-player answers
- [ ] On all-answered or timeout → evaluate, broadcast `game:answer_result`
- [ ] After N rounds → broadcast `game:finished`, update `rooms.status` + `room_players.score` in DB
- [ ] Frontend: `/multiplayer/game/:code` route
- [ ] Frontend: extend Zustand store with `currentQuestion`, `roundAnswers`, `scores`
- [ ] Frontend: reuse `QuestionCard` + `OptionButton`; add countdown timer ring
- [ ] Frontend: `ScoreBoard` component — live per-player scores after each round
- [ ] Frontend: `GameFinished` screen — winner highlight, final scores, "Play Again" button
- [ ] Unit tests for `GameService` (round evaluation, tie-breaking, timeout auto-advance)
- [ ] update decisions.md
---
[ ] `GameService`: generate question sequence for a room, enforce server-side 15 s timer
[ ] `room:start` WS handler → begin question loop, broadcast first `game:question`
[ ] `game:answer` WS handler → collect per-player answers
[ ] On all-answered or timeout → evaluate, broadcast `game:answer_result`
[ ] After N rounds → broadcast `game:finished`, update `rooms.status` + `room_players.score` in DB (transactional)
[ ] Frontend: `/multiplayer/game/:code` route
[ ] Frontend: extend Zustand store with `currentQuestion`, `roundAnswers`, `scores`
[ ] Frontend: reuse `QuestionCard` + `OptionButton`; add countdown timer ring
[ ] Frontend: `ScoreBoard` component — live per-player scores after each round
[ ] Frontend: `GameFinished` screen — winner highlight, final scores, "Play Again" button
[ ] Unit tests for `GameService` (round evaluation, tie-breaking, timeout auto-advance)
[ ] update decisions.md
## Phase 6 — Production Deployment
**Goal**: App is live on Hetzner, accessible via HTTPS on all subdomains.
**Done when**: `https://app.yourdomain.com` loads; `wss://api.yourdomain.com` connects; auth flow works end-to-end.
Goal: App is live on Hetzner, accessible via HTTPS on all subdomains.
Done when: `https://app.yourdomain.com` loads; `wss://api.yourdomain.com` connects; auth flow works end-to-end.
- [ ] `docker-compose.prod.yml`: all services + `nginx-proxy` + `acme-companion`
- [ ] Nginx config per container: `VIRTUAL_HOST` + `LETSENCRYPT_HOST` env vars
- [ ] Production `.env` files on VPS (OpenAuth secrets, DB credentials, Valkey URL)
- [ ] Drizzle migration runs on `api` container start
- [ ] Seed production DB (run `seed.ts` once)
- [ ] Smoke test: login → solo game → create room → multiplayer game end-to-end
- [ ] update decisions.md
[ ] `docker-compose.prod.yml`: all services + `nginx-proxy` + `acme-companion`
[ ] Nginx config per container: `VIRTUAL_HOST` + `LETSENCRYPT_HOST` env vars
[ ] Production `.env` files on VPS (OpenAuth secrets, DB credentials, Valkey URL)
[ ] Drizzle migration runs on `api` container start (includes CHECK constraints + triggers)
[ ] Seed production DB (run `seed.ts` once)
[ ] Smoke test: login → solo game → create room → multiplayer game end-to-end
[ ] update decisions.md
---
## Phase 7 — Polish & Hardening _(post-MVP)_
## Phase 7 — Polish & Hardening (post-MVP)
Not required to ship, but address before real users arrive.
- [ ] Rate limiting on API endpoints (`express-rate-limit`)
- [ ] Graceful WS reconnect with exponential back-off
- [ ] React error boundaries
- [ ] `GET /users/me/stats` endpoint + profile page
- [ ] Accessibility pass (keyboard nav, ARIA on quiz buttons)
- [ ] Favicon, page titles, Open Graph meta
- [ ] CI/CD pipeline (GitHub Actions → SSH deploy on push to `main`)
- [ ] Database backups (cron → Hetzner Object Storage)
- [ ] update decisions.md
[ ] Rate limiting on API endpoints (`express-rate-limit`)
[ ] Graceful WS reconnect with exponential back-off
[ ] React error boundaries
[ ] `GET /users/me/stats` endpoint (aggregates from `room_players`) + profile page
[ ] Accessibility pass (keyboard nav, ARIA on quiz buttons)
[ ] Favicon, page titles, Open Graph meta
[ ] CI/CD pipeline (GitHub Actions → SSH deploy on push to `main`)
[ ] Database backups (cron → Hetzner Object Storage)
[ ] update decisions.md
---
## Dependency Graph
```
Dependency Graph
Phase 0 (Foundation)
└── Phase 1 (Vocabulary Data)
└── Phase 2 (Auth)
@ -161,4 +148,3 @@ Phase 0 (Foundation)
└── Phase 4 (Room Lobby)
└── Phase 5 (Multiplayer Game)
└── Phase 6 (Deployment)
```

View file

@ -71,7 +71,9 @@ vocab-trainer/
│ ├── shared/ # Zod schemas, TypeScript types, constants
│ └── db/ # Drizzle schema, migrations, seed script
├── scripts/
│ └── extract_omw.py # One-time WordNet + OMW extraction → seed.json
| ├── datafiles/
│ | └── en-it-nouns.json
│ └── extract-en-it-nouns.py # One-time WordNet + OMW extraction → seed.json
├── docker-compose.yml
├── docker-compose.prod.yml
├── pnpm-workspace.yaml
@ -155,73 +157,137 @@ nginx-proxy (:80/:443)
SSL is fully automatic via `nginx-proxy` + `acme-companion`. No manual Certbot needed.
### 5.1 Valkey Key Structure
Ephemeral room state is stored in Valkey with TTL (e.g., 1 hour).
PostgreSQL stores durable history only.
Key Format: `room:{code}:{field}`
| Key | Type | TTL | Description |
|------------------------------|---------|-------|-------------|
| `room:{code}:state` | Hash | 1h | Current question index, round status |
| `room:{code}:players` | Set | 1h | List of connected user IDs |
| `room:{code}:answers:{round}`| Hash | 15m | Temp storage for current round answers |
Recovery Strategy
If server crashes mid-game, Valkey data is lost.
PostgreSQL `room_players.score` remains 0.
Room status is reset to `finished` via startup health check if `updated_at` is stale.
---
## 6. Data Model
### Design principle
## Design principle
Words are modelled as language-neutral **terms** with one or more **translations** per language. Adding a new language pair (e.g. EnglishFrench) requires **no schema changes** — only new rows in `translations` and `language_pairs`. The flat `english/italian` column pattern is explicitly avoided.
Words are modelled as language-neutral concepts (terms) separate from learning curricula (decks).
Adding a new language pair requires no schema changes — only new rows in `translations`, `decks`, and `language_pairs`.
### Core tables
## Core tables
```
terms
id uuid PK
synset_id text UNIQUE -- WordNet synset offset e.g. "wn:01234567n"
pos varchar(20) -- "noun" | "verb" | "adjective"
frequency_rank integer -- 11000, reserved for difficulty filtering
created_at timestamptz
synset_id text UNIQUE -- OMW ILI (e.g. "ili:i12345")
pos varchar(20) -- NOT NULL, CHECK (pos IN ('noun', 'verb', 'adjective', 'adverb'))
created_at timestamptz DEFAULT now()
-- REMOVED: frequency_rank (handled at deck level)
translations
id uuid PK
term_id uuid FK → terms.id
language_code varchar(10) -- BCP 47: "en", "it", "de", ...
text text
UNIQUE (term_id, language_code)
language_code varchar(10) -- NOT NULL, BCP 47: "en", "it"
text text -- NOT NULL
created_at timestamptz DEFAULT now()
UNIQUE (term_id, language_code, text) -- Allow synonyms, prevent exact duplicates
term_glosses
id uuid PK
term_id uuid FK → terms.id
language_code varchar(10) -- NOT NULL
text text -- NOT NULL
type varchar(20) -- CHECK (type IN ('definition', 'example')), NULLABLE
created_at timestamptz DEFAULT now()
language_pairs
id uuid PK
source varchar(10) -- "en"
target varchar(10) -- "it"
label text -- "English → Italian"
source varchar(10) -- NOT NULL
target varchar(10) -- NOT NULL
label text
active boolean DEFAULT true
UNIQUE (source, target)
decks
id uuid PK
name text -- NOT NULL (e.g. "A1 Italian Nouns", "Most Common 1000")
description text -- NULLABLE
pair_id uuid FK → language_pairs.id -- NULLABLE (for single-language or multi-pair decks)
created_by uuid FK → users.id -- NULLABLE (for system decks)
is_public boolean DEFAULT true
created_at timestamptz DEFAULT now()
deck_terms
deck_id uuid FK → decks.id
term_id uuid FK → terms.id
position smallint -- NOT NULL, ordering within deck (1, 2, 3...)
added_at timestamptz DEFAULT now()
PRIMARY KEY (deck_id, term_id)
users
id uuid PK -- OpenAuth sub claim
email varchar(255) UNIQUE
id uuid PK -- Internal stable ID (FK target)
openauth_sub text UNIQUE -- NOT NULL, OpenAuth `sub` claim (e.g. "google|12345")
email varchar(255) UNIQUE -- NULLABLE (GitHub users may lack email)
display_name varchar(100)
games_played integer DEFAULT 0
games_won integer DEFAULT 0
created_at timestamptz
created_at timestamptz DEFAULT now()
last_login_at timestamptz
-- REMOVED: games_played, games_won (derive from room_players)
rooms
id uuid PK
code varchar(8) UNIQUE -- human-readable e.g. "WOLF-42"
code varchar(8) UNIQUE -- NOT NULL, CHECK (code = UPPER(code))
host_id uuid FK → users.id
pair_id uuid FK → language_pairs.id
status text -- "waiting" | "in_progress" | "finished"
max_players smallint DEFAULT 4
round_count smallint DEFAULT 10
created_at timestamptz
deck_id uuid FK → decks.id -- Which vocabulary deck this room uses
status varchar(20) -- NOT NULL, CHECK (status IN ('waiting', 'in_progress', 'finished'))
max_players smallint -- NOT NULL, DEFAULT 4, CHECK (max_players BETWEEN 2 AND 10)
round_count smallint -- NOT NULL, DEFAULT 10, CHECK (round_count BETWEEN 5 AND 20)
created_at timestamptz DEFAULT now()
updated_at timestamptz DEFAULT now() -- For stale room recovery
room_players
room_id uuid FK → rooms.id
user_id uuid FK → users.id
score integer DEFAULT 0
joined_at timestamptz
score integer DEFAULT 0 -- Final score only (written at game end)
joined_at timestamptz DEFAULT now()
left_at timestamptz -- Populated on WS disconnect/leave
PRIMARY KEY (room_id, user_id)
```
### Indexes
Indexes
-- Vocabulary
CREATE INDEX idx_terms_pos ON terms (pos);
CREATE INDEX idx_translations_lang ON translations (language_code, term_id);
```sql
CREATE INDEX ON terms (pos, frequency_rank);
CREATE INDEX ON rooms (status);
CREATE INDEX ON room_players (user_id);
```
-- Decks
CREATE INDEX idx_decks_pair ON decks (pair_id, is_public);
CREATE INDEX idx_decks_creator ON decks (created_by);
CREATE INDEX idx_deck_terms_term ON deck_terms (term_id);
-- Language Pairs
CREATE INDEX idx_pairs_active ON language_pairs (active, source, target);
-- Rooms
CREATE INDEX idx_rooms_status ON rooms (status);
CREATE INDEX idx_rooms_host ON rooms (host_id);
-- NOTE: idx_rooms_code omitted (UNIQUE constraint creates index automatically)
-- Room Players
CREATE INDEX idx_room_players_user ON room_players (user_id);
CREATE INDEX idx_room_players_score ON room_players (room_id, score DESC);
Repository Logic Note
`DeckRepository.getTerms(deckId, limit, offset)` fetches terms from a specific deck.
Query uses `deck_terms.position` for ordering.
For random practice within a deck: `WHERE deck_id = X ORDER BY random() LIMIT N`
(safe because deck is bounded, e.g., 500 terms max, not full table).
---
@ -229,17 +295,26 @@ CREATE INDEX ON room_players (user_id);
### Source
- **Princeton WordNet** — English words + synset IDs
- **Open Multilingual Wordnet (OMW)** — Italian translations keyed by synset ID
Open Multilingual Wordnet (OMW) — English & Italian nouns via Interlingual Index (ILI)
External CEFR lists — For deck curation (e.g. GitHub: ecom/cefr-lists)
### Extraction process
1. Run `scripts/extract_omw.py` once locally using NLTK
2. Filter to the 1 000 most common nouns (by WordNet frequency data)
3. Output: `packages/db/src/seed.json` — committed to the repo
4. `packages/db/src/seed.ts` reads the JSON and populates `terms` + `translations`
1. Run `extract-en-it-nouns.py` once locally using `wn` library
- Imports ALL bilingual noun synsets (no frequency filtering)
- Output: `datafiles/en-it-nouns.json` — committed to repo
2. Run `pnpm db:seed` — populates `terms` + `translations` tables from JSON
3. Run `pnpm db:build-decks` — matches external CEFR lists to DB terms, creates `decks` + `deck_terms`
`terms.synset_id` stores the WordNet offset (e.g. `wn:01234567n`) for traceability and future re-imports with additional languages.
### Benefits of deck-based approach
- WordNet frequency data is unreliable (e.g. chemical symbols rank high)
- Curricula can come from external sources (CEFR, Oxford 3000, SUBTLEX)
- Bad data excluded at deck level, not schema level
- Users can create custom decks later
- Multiple difficulty levels without schema changes
`terms.synset_id` stores the OMW ILI (e.g. `ili:i12345`) for traceability and future re-imports with additional languages.
---