refactor: migrate to deck-based vocabulary curation

Database Schema:
- Add decks table for curated word lists (A1, Most Common, etc.)
- Add deck_terms join table with position ordering
- Link rooms to decks via rooms.deck_id FK
- Remove frequency_rank from terms (now deck-scoped)
- Change users.id to uuid, add openauth_sub for auth mapping
- Add room_players.left_at for disconnect tracking
- Add rooms.updated_at for stale room recovery
- Add CHECK constraints for data integrity (pos, status, etc.)

Extraction Script:
- Rewrite extract.py to mirror complete OMW dataset
- Extract all 25,204 bilingual noun synsets (en-it)
- Remove frequency filtering and block lists
- Output all lemmas per synset for full synonym support
- Seed data now uncurated; decks handle selection

Architecture:
- Separate concerns: raw OMW data in DB, curation in decks
- Enables user-created decks and multiple difficulty levels
- Rooms select vocabulary by choosing a deck
This commit is contained in:
lila 2026-03-27 16:53:26 +01:00
parent e9e750da3e
commit be7a7903c5
9 changed files with 349148 additions and 492 deletions

View file

@ -56,9 +56,28 @@ Production will use Nginx to serve static Vite build output.
`app.ts` exports a `createApp()` factory function. `server.ts` imports it and calls `.listen()`. This allows tests to import the app directly without starting a server, keeping tests isolated and fast.
### Data model: `terms` + `translations` (not flat word pairs)
### Data model: `decks` separate from `terms` (not frequency_rank filtering)
Words are modelled as language-neutral concepts with one or more translations per language. Adding a new language pair requires no schema changes — only new rows in `translations` and `language_pairs`. The flat `english/italian` column pattern is explicitly avoided.
**Original approach:** Store `frequency_rank` on `terms` table and filter by rank range for difficulty.
**Problem discovered:** WordNet/OMW frequency data is unreliable for language learning. Extraction produced results like:
- Rank 1: "In" → "indio" (chemical symbol: Indium)
- Rank 2: "Be" → "berillio" (chemical symbol: Beryllium)
- Rank 7: "He" → "elio" (chemical symbol: Helium)
These are technically "common" in WordNet (every element is a noun) but useless for vocabulary learning.
**Decision:**
- `terms` table stores ALL available OMW synsets (raw data, no frequency filtering)
- `decks` table stores curated learning lists (A1, A2, B1, "Most Common 1000", etc.)
- `deck_terms` junction table links terms to decks with position ordering
- `rooms.deck_id` specifies which vocabulary deck a game uses
**Benefits:**
- Curricula can come from external sources (CEFR lists, Oxford 3000, SUBTLEX)
- Bad data (chemical symbols, obscure words) excluded at deck level, not schema level
- Users can create custom decks later
- Multiple difficulty levels without schema changes
### Multiplayer mechanic: simultaneous answers (not buzz-first)
@ -136,17 +155,54 @@ Then `sudo sysctl -p` or restart Docker.
## Data Model
### Translations: no isPrimary column
### Users: internal UUID + openauth_sub (not sub as PK)
WordNet synsets often have multiple lemmas per language (e.g. "dog", "domestic dog", "Canis familiaris"). An earlier design included an isPrimary boolean to mark which translation to display in the quiz.
This was dropped because:
**Original approach:** Use OpenAuth `sub` claim directly as `users.id` (text primary key).
The schema cannot enforce a single primary per (term_id, language_code) with a boolean alone — multiple rows can be true simultaneously
Fixing that requires either a partial unique index or application-level logic, both adding complexity for no MVP benefit
The ambiguity can be resolved earlier and more cleanly
**Problem:** Embeds auth provider in the primary key (e.g. `"google|12345"`). If OpenAuth changes format or a second provider is added, the PK cascades through all FKs (`rooms.host_id`, `room_players.user_id`).
Decision: the Python extraction script picks one translation per language per synset at extraction time, taking the first lemma (WordNet orders lemmas by frequency). By the time data enters the database the choice is already made.
The translations table carries a UNIQUE (term_id, language_code) constraint, which enforces exactly one translation per language at the database level. No isPrimary column exists.
**Decision:**
- `users.id` = internal UUID (stable FK target)
- `users.openauth_sub` = text UNIQUE (auth provider claim)
- Allows adding multiple auth providers per user later without FK changes
### Rooms: `updated_at` for stale recovery only
Most tables omit `updated_at` (unnecessary for MVP). `rooms.updated_at` is kept specifically for stale room recovery—identifying rooms stuck in `in_progress` status after server crashes.
### Translations: UNIQUE (term_id, language_code, text)
Allows multiple synonyms per language per term (e.g. "dog", "hound" for same synset). Prevents exact duplicate rows. Homonyms (e.g. "Lead" metal vs. "Lead" guide) are handled by different `term_id` values (different synsets), so no constraint conflict.
### Decks: `pair_id` is nullable
`decks.pair_id` references `language_pairs` but is nullable. Reasons:
- Single-language decks (e.g. "English Grammar")
- Multi-pair decks (e.g. "Cognates" spanning EN-IT and EN-FR)
- System decks (created by app, not tied to specific user)
### Decks separate from terms (not frequency_rank filtering)
**Original approach:** Store `frequency_rank` on `terms` table and filter by rank range for difficulty.
**Problem discovered:** WordNet/OMW frequency data is unreliable for language learning. Extraction produced results like:
- Rank 1: "In" → "indio" (chemical symbol: Indium)
- Rank 2: "Be" → "berillio" (chemical symbol: Beryllium)
- Rank 7: "He" → "elio" (chemical symbol: Helium)
These are technically "common" in WordNet (every element is a noun) but useless for vocabulary learning.
**Decision:**
- `terms` table stores ALL available OMW synsets (raw data, no frequency filtering)
- `decks` table stores curated learning lists (A1, A2, B1, "Most Common 1000", etc.)
- `deck_terms` junction table links terms to decks with position ordering
- `rooms.deck_id` specifies which vocabulary deck a game uses
**Benefits:**
- Curricula can come from external sources (CEFR lists, Oxford 3000, SUBTLEX)
- Bad data (chemical symbols, obscure words) excluded at deck level, not schema level
- Users can create custom decks later
- Multiple difficulty levels without schema changes
---