refactor: migrate to deck-based vocabulary curation

Database Schema: - Add decks table for curated word lists (A1, Most Common, etc.) - Add deck_terms join table with position ordering - Link rooms to decks via rooms.deck_id FK - Remove frequency_rank from terms (now deck-scoped) - Change users.id to uuid, add openauth_sub for auth mapping - Add room_players.left_at for disconnect tracking - Add rooms.updated_at for stale room recovery - Add CHECK constraints for data integrity (pos, status, etc.) Extraction Script: - Rewrite extract.py to mirror complete OMW dataset - Extract all 25,204 bilingual noun synsets (en-it) - Remove frequency filtering and block lists - Output all lemmas per synset for full synonym support - Seed data now uncurated; decks handle selection Architecture: - Separate concerns: raw OMW data in DB, curation in decks - Enables user-created decks and multiple difficulty levels - Rooms select vocabulary by choosing a deck
2026-03-27 16:53:26 +01:00 · 2026-03-27 16:53:26 +01:00 · be7a7903c5
commit be7a7903c5
parent e9e750da3e
9 changed files with 349148 additions and 492 deletions
--- a/documentation/spec.md
+++ b/documentation/spec.md
@ -71,7 +71,9 @@ vocab-trainer/
 │   ├── shared/               # Zod schemas, TypeScript types, constants
 │   └── db/                   # Drizzle schema, migrations, seed script
 ├── scripts/
-│   └── extract_omw.py        # One-time WordNet + OMW extraction → seed.json
+|   ├── datafiles/
+│   |   └── en-it-nouns.json  
+│   └── extract-en-it-nouns.py        # One-time WordNet + OMW extraction → seed.json
 ├── docker-compose.yml
 ├── docker-compose.prod.yml
 ├── pnpm-workspace.yaml
@ -155,73 +157,137 @@ nginx-proxy (:80/:443)

 SSL is fully automatic via `nginx-proxy` + `acme-companion`. No manual Certbot needed.

+### 5.1 Valkey Key Structure
+
+Ephemeral room state is stored in Valkey with TTL (e.g., 1 hour). 
+PostgreSQL stores durable history only.
+
+Key Format: `room:{code}:{field}`
+| Key                          | Type    | TTL   | Description |
+|------------------------------|---------|-------|-------------|
+| `room:{code}:state`          | Hash    | 1h    | Current question index, round status |
+| `room:{code}:players`        | Set     | 1h    | List of connected user IDs |
+| `room:{code}:answers:{round}`| Hash    | 15m   | Temp storage for current round answers |
+
+Recovery Strategy
+If server crashes mid-game, Valkey data is lost. 
+PostgreSQL `room_players.score` remains 0. 
+Room status is reset to `finished` via startup health check if `updated_at` is stale.
+
 ---

 ## 6. Data Model

-### Design principle
+## Design principle

-Words are modelled as language-neutral **terms** with one or more **translations** per language. Adding a new language pair (e.g. English–French) requires **no schema changes** — only new rows in `translations` and `language_pairs`. The flat `english/italian` column pattern is explicitly avoided.
+Words are modelled as language-neutral concepts (terms) separate from learning curricula (decks).
+Adding a new language pair requires no schema changes — only new rows in `translations`, `decks`, and `language_pairs`.

-### Core tables
+## Core tables

-```
 terms
  id            uuid PK
-  synset_id     text UNIQUE     -- WordNet synset offset e.g. "wn:01234567n"
-  pos           varchar(20)     -- "noun" | "verb" | "adjective"
-  frequency_rank integer        -- 1–1000, reserved for difficulty filtering
-  created_at    timestamptz
+  synset_id     text UNIQUE     -- OMW ILI (e.g. "ili:i12345")
+  pos           varchar(20)     -- NOT NULL, CHECK (pos IN ('noun', 'verb', 'adjective', 'adverb'))
+  created_at    timestamptz DEFAULT now()
+  -- REMOVED: frequency_rank (handled at deck level)

 translations
  id            uuid PK
  term_id       uuid FK → terms.id
-  language_code varchar(10)     -- BCP 47: "en", "it", "de", ...
-  text          text
-  UNIQUE (term_id, language_code)
+  language_code varchar(10)     -- NOT NULL, BCP 47: "en", "it"
+  text          text            -- NOT NULL
+  created_at    timestamptz DEFAULT now()
+  UNIQUE (term_id, language_code, text) -- Allow synonyms, prevent exact duplicates
+
+term_glosses
+  id            uuid PK
+  term_id       uuid FK → terms.id
+  language_code varchar(10)     -- NOT NULL
+  text          text            -- NOT NULL
+  type          varchar(20)     -- CHECK (type IN ('definition', 'example')), NULLABLE
+  created_at    timestamptz DEFAULT now()

 language_pairs
  id      uuid PK
-  source  varchar(10)           -- "en"
-  target  varchar(10)           -- "it"
-  label   text                  -- "English → Italian"
+  source  varchar(10)           -- NOT NULL
+  target  varchar(10)           -- NOT NULL
+  label   text
  active  boolean DEFAULT true
  UNIQUE (source, target)

+decks
+  id          uuid PK
+  name        text              -- NOT NULL (e.g. "A1 Italian Nouns", "Most Common 1000")
+  description text              -- NULLABLE
+  pair_id     uuid FK → language_pairs.id  -- NULLABLE (for single-language or multi-pair decks)
+  created_by  uuid FK → users.id           -- NULLABLE (for system decks)
+  is_public   boolean DEFAULT true
+  created_at  timestamptz DEFAULT now()
+
+deck_terms
+  deck_id     uuid FK → decks.id
+  term_id     uuid FK → terms.id
+  position    smallint          -- NOT NULL, ordering within deck (1, 2, 3...)
+  added_at    timestamptz DEFAULT now()
+  PRIMARY KEY (deck_id, term_id)
+
 users
-  id            uuid PK         -- OpenAuth sub claim
-  email         varchar(255) UNIQUE
+  id            uuid PK         -- Internal stable ID (FK target)
+  openauth_sub  text UNIQUE     -- NOT NULL, OpenAuth `sub` claim (e.g. "google|12345")
+  email         varchar(255) UNIQUE -- NULLABLE (GitHub users may lack email)
  display_name  varchar(100)
-  games_played  integer DEFAULT 0
-  games_won     integer DEFAULT 0
-  created_at    timestamptz
+  created_at    timestamptz DEFAULT now()
  last_login_at timestamptz
+  -- REMOVED: games_played, games_won (derive from room_players)

 rooms
  id           uuid PK
-  code         varchar(8) UNIQUE  -- human-readable e.g. "WOLF-42"
+  code         varchar(8) UNIQUE -- NOT NULL, CHECK (code = UPPER(code))
  host_id      uuid FK → users.id
  pair_id      uuid FK → language_pairs.id
-  status       text               -- "waiting" | "in_progress" | "finished"
-  max_players  smallint DEFAULT 4
-  round_count  smallint DEFAULT 10
-  created_at   timestamptz
+  deck_id      uuid FK → decks.id         -- Which vocabulary deck this room uses
+  status       varchar(20)       -- NOT NULL, CHECK (status IN ('waiting', 'in_progress', 'finished'))
+  max_players  smallint          -- NOT NULL, DEFAULT 4, CHECK (max_players BETWEEN 2 AND 10)
+  round_count  smallint          -- NOT NULL, DEFAULT 10, CHECK (round_count BETWEEN 5 AND 20)
+  created_at   timestamptz DEFAULT now()
+  updated_at   timestamptz DEFAULT now() -- For stale room recovery

 room_players
  room_id   uuid FK → rooms.id
  user_id   uuid FK → users.id
-  score     integer DEFAULT 0
-  joined_at timestamptz
+  score     integer DEFAULT 0   -- Final score only (written at game end)
+  joined_at timestamptz DEFAULT now()
+  left_at   timestamptz         -- Populated on WS disconnect/leave
  PRIMARY KEY (room_id, user_id)
-```

-### Indexes
+Indexes
+-- Vocabulary
+CREATE INDEX idx_terms_pos ON terms (pos);
+CREATE INDEX idx_translations_lang ON translations (language_code, term_id);

-```sql
-CREATE INDEX ON terms (pos, frequency_rank);
-CREATE INDEX ON rooms (status);
-CREATE INDEX ON room_players (user_id);
-```
+-- Decks
+CREATE INDEX idx_decks_pair ON decks (pair_id, is_public);
+CREATE INDEX idx_decks_creator ON decks (created_by);
+CREATE INDEX idx_deck_terms_term ON deck_terms (term_id);
+
+-- Language Pairs
+CREATE INDEX idx_pairs_active ON language_pairs (active, source, target);
+
+-- Rooms
+CREATE INDEX idx_rooms_status ON rooms (status);
+CREATE INDEX idx_rooms_host ON rooms (host_id);
+-- NOTE: idx_rooms_code omitted (UNIQUE constraint creates index automatically)
+
+-- Room Players
+CREATE INDEX idx_room_players_user ON room_players (user_id);
+CREATE INDEX idx_room_players_score ON room_players (room_id, score DESC);
+
+Repository Logic Note
+`DeckRepository.getTerms(deckId, limit, offset)` fetches terms from a specific deck.
+Query uses `deck_terms.position` for ordering.
+For random practice within a deck: `WHERE deck_id = X ORDER BY random() LIMIT N`
+(safe because deck is bounded, e.g., 500 terms max, not full table).

 ---

@ -229,17 +295,26 @@ CREATE INDEX ON room_players (user_id);

 ### Source

- **Princeton WordNet** — English words + synset IDs
- **Open Multilingual Wordnet (OMW)** — Italian translations keyed by synset ID
+Open Multilingual Wordnet (OMW) — English & Italian nouns via Interlingual Index (ILI)
+External CEFR lists — For deck curation (e.g. GitHub: ecom/cefr-lists)

 ### Extraction process

-1. Run `scripts/extract_omw.py` once locally using NLTK
-2. Filter to the 1 000 most common nouns (by WordNet frequency data)
-3. Output: `packages/db/src/seed.json` — committed to the repo
-4. `packages/db/src/seed.ts` reads the JSON and populates `terms` + `translations`
+1. Run `extract-en-it-nouns.py` once locally using `wn` library
+   - Imports ALL bilingual noun synsets (no frequency filtering)
+   - Output: `datafiles/en-it-nouns.json` — committed to repo
+2. Run `pnpm db:seed` — populates `terms` + `translations` tables from JSON
+3. Run `pnpm db:build-decks` — matches external CEFR lists to DB terms, creates `decks` + `deck_terms`

-`terms.synset_id` stores the WordNet offset (e.g. `wn:01234567n`) for traceability and future re-imports with additional languages.
+### Benefits of deck-based approach
+
+- WordNet frequency data is unreliable (e.g. chemical symbols rank high)
+- Curricula can come from external sources (CEFR, Oxford 3000, SUBTLEX)
+- Bad data excluded at deck level, not schema level
+- Users can create custom decks later
+- Multiple difficulty levels without schema changes
+
+`terms.synset_id` stores the OMW ILI (e.g. `ili:i12345`) for traceability and future re-imports with additional languages.

 ---