# Vocabulary Trainer — Project Specification ## 1. Overview A multiplayer English–Italian vocabulary trainer with a Duolingo-style quiz interface (one word prompt, four answer choices). Supports both single-player practice and real-time competitive multiplayer rooms of 2–4 players. Designed from the ground up to be language-pair agnostic. ### Core Principles - **Minimal but extendable**: Working product fast, clean architecture for future growth - **Mobile-first**: Touch-friendly Duolingo-like UX - **Type safety end-to-end**: TypeScript + Zod schemas shared between frontend and backend --- ## 2. Technology Stack | Layer | Technology | | -------------------- | ----------------------------- | | Monorepo | pnpm workspaces | | Frontend | React 18, Vite, TypeScript | | Routing | TanStack Router | | Server state | TanStack Query | | Client state | Zustand | | Styling | Tailwind CSS + shadcn/ui | | Backend | Node.js, Express, TypeScript | | Realtime | WebSockets (`ws` library) | | Database | PostgreSQL 18 | | ORM | Drizzle ORM | | Cache / Queue | Valkey 9 | | Auth | OpenAuth (Google + GitHub) | | Validation | Zod (shared schemas) | | Testing | Vitest, React Testing Library | | Linting / Formatting | ESLint, Prettier | | Containerisation | Docker, Docker Compose | | Hosting | Hetzner VPS | ### Why `ws` over Socket.io `ws` is the raw WebSocket library. For rooms of 2–4 players there is no need for Socket.io's transport fallbacks or room-management abstractions. The protocol is defined explicitly in `packages/shared`, which gives the same guarantees without the overhead. ### Why Valkey Valkey stores ephemeral room state that does not need to survive a server restart. It keeps the PostgreSQL schema clean and makes room lookups O(1). ### Why pnpm workspaces without Turborepo Turborepo adds parallel task running and build caching on top of pnpm workspaces. For a two-app monorepo of this size, the plain pnpm workspace commands (`pnpm -r run build`, `pnpm --filter`) are sufficient and there is one less tool to configure and maintain. --- ## 3. Repository Structure ``` vocab-trainer/ ├── apps/ │ ├── web/ # React SPA (Vite + TanStack Router) │ │ ├── src/ │ │ │ ├── routes/ │ │ │ ├── components/ │ │ │ ├── stores/ # Zustand stores │ │ │ └── lib/ │ │ └── Dockerfile │ └── api/ # Express REST + WebSocket server │ ├── src/ │ │ ├── routes/ │ │ ├── services/ │ │ ├── repositories/ │ │ └── websocket/ │ └── Dockerfile ├── packages/ │ ├── shared/ # Zod schemas, TypeScript types, constants │ └── db/ # Drizzle schema, migrations, seed script ├── scripts/ | ├── datafiles/ │ | └── en-it-nouns.json │ └── extract-en-it-nouns.py # One-time WordNet + OMW extraction → seed.json ├── docker-compose.yml ├── docker-compose.prod.yml ├── pnpm-workspace.yaml └── package.json ``` `packages/shared` is the contract between frontend and backend. All request/response shapes and WebSocket event payloads are defined there as Zod schemas and inferred TypeScript types — never duplicated. ### pnpm workspace config `pnpm-workspace.yaml` declares: ``` packages: - 'apps/*' - 'packages/*' ``` ### Root scripts The root `package.json` defines convenience scripts that delegate to workspaces: - `dev` — starts `api` and `web` in parallel - `build` — builds all packages in dependency order - `test` — runs Vitest across all workspaces - `lint` — runs ESLint across all workspaces For parallel dev, use `concurrently` or just two terminal tabs for MVP. --- ## 4. Architecture — N-Tier / Layered ``` ┌────────────────────────────────────┐ │ Presentation (React SPA) │ apps/web ├────────────────────────────────────┤ │ API / Transport │ HTTP REST + WebSocket ├────────────────────────────────────┤ │ Application (Controllers) │ apps/api/src/routes │ Domain (Business logic) │ apps/api/src/services │ Data Access (Repositories) │ apps/api/src/repositories ├────────────────────────────────────┤ │ Database (PostgreSQL via Drizzle) │ packages/db │ Cache (Valkey) │ apps/api/src/lib/valkey.ts └────────────────────────────────────┘ ``` Each layer only communicates with the layer directly below it. Business logic lives in services, not in route handlers or repositories. --- ## 5. Infrastructure ### Domain structure | Subdomain | Service | | --------------------- | ----------------------- | | `app.yourdomain.com` | React frontend | | `api.yourdomain.com` | Express API + WebSocket | | `auth.yourdomain.com` | OpenAuth service | ### Docker Compose services (production) | Container | Role | | ---------------- | ------------------------------------------- | | `postgres` | PostgreSQL 16, named volume | | `valkey` | Valkey 8, ephemeral (no persistence needed) | | `openauth` | OpenAuth service | | `api` | Express + WS server | | `web` | Nginx serving the Vite build | | `nginx-proxy` | Automatic reverse proxy | | `acme-companion` | Let's Encrypt certificate automation | ``` nginx-proxy (:80/:443) app.domain → web:80 api.domain → api:3000 (HTTP + WS upgrade) auth.domain → openauth:3001 ``` SSL is fully automatic via `nginx-proxy` + `acme-companion`. No manual Certbot needed. ### 5.1 Valkey Key Structure Ephemeral room state is stored in Valkey with TTL (e.g., 1 hour). PostgreSQL stores durable history only. Key Format: `room:{code}:{field}` | Key | Type | TTL | Description | |------------------------------|---------|-------|-------------| | `room:{code}:state` | Hash | 1h | Current question index, round status | | `room:{code}:players` | Set | 1h | List of connected user IDs | | `room:{code}:answers:{round}`| Hash | 15m | Temp storage for current round answers | Recovery Strategy If server crashes mid-game, Valkey data is lost. PostgreSQL `room_players.score` remains 0. Room status is reset to `finished` via startup health check if `updated_at` is stale. --- ## 6. Data Model ## Design principle Words are modelled as language-neutral concepts (terms) separate from learning curricula (decks). Adding a new language pair requires no schema changes — only new rows in `translations`, `decks`, and `language_pairs`. ## Core tables terms id uuid PK synset_id text UNIQUE -- OMW ILI (e.g. "ili:i12345") pos varchar(20) -- NOT NULL, CHECK (pos IN ('noun', 'verb', 'adjective', 'adverb')) created_at timestamptz DEFAULT now() -- REMOVED: frequency_rank (handled at deck level) translations id uuid PK term_id uuid FK → terms.id language_code varchar(10) -- NOT NULL, BCP 47: "en", "it" text text -- NOT NULL created_at timestamptz DEFAULT now() UNIQUE (term_id, language_code, text) -- Allow synonyms, prevent exact duplicates term_glosses id uuid PK term_id uuid FK → terms.id language_code varchar(10) -- NOT NULL text text -- NOT NULL created_at timestamptz DEFAULT now() language_pairs id uuid PK source varchar(10) -- NOT NULL target varchar(10) -- NOT NULL label text active boolean DEFAULT true UNIQUE (source, target) decks id uuid PK name text -- NOT NULL (e.g. "A1 Italian Nouns", "Most Common 1000") description text -- NULLABLE pair_id uuid FK → language_pairs.id -- NULLABLE (for single-language or multi-pair decks) created_by uuid FK → users.id -- NULLABLE (for system decks) is_public boolean DEFAULT true created_at timestamptz DEFAULT now() deck_terms deck_id uuid FK → decks.id term_id uuid FK → terms.id position smallint -- NOT NULL, ordering within deck (1, 2, 3...) added_at timestamptz DEFAULT now() PRIMARY KEY (deck_id, term_id) users id uuid PK -- Internal stable ID (FK target) openauth_sub text UNIQUE -- NOT NULL, OpenAuth `sub` claim (e.g. "google|12345") email varchar(255) UNIQUE -- NULLABLE (GitHub users may lack email) display_name varchar(100) created_at timestamptz DEFAULT now() last_login_at timestamptz -- REMOVED: games_played, games_won (derive from room_players) rooms id uuid PK code varchar(8) UNIQUE -- NOT NULL, CHECK (code = UPPER(code)) host_id uuid FK → users.id pair_id uuid FK → language_pairs.id deck_id uuid FK → decks.id -- Which vocabulary deck this room uses status varchar(20) -- NOT NULL, CHECK (status IN ('waiting', 'in_progress', 'finished')) max_players smallint -- NOT NULL, DEFAULT 4, CHECK (max_players BETWEEN 2 AND 10) round_count smallint -- NOT NULL, DEFAULT 10, CHECK (round_count BETWEEN 5 AND 20) created_at timestamptz DEFAULT now() updated_at timestamptz DEFAULT now() -- For stale room recovery room_players room_id uuid FK → rooms.id user_id uuid FK → users.id score integer DEFAULT 0 -- Final score only (written at game end) joined_at timestamptz DEFAULT now() left_at timestamptz -- Populated on WS disconnect/leave PRIMARY KEY (room_id, user_id) Indexes -- Vocabulary CREATE INDEX idx_terms_pos ON terms (pos); CREATE INDEX idx_translations_lang ON translations (language_code, term_id); -- Decks CREATE INDEX idx_decks_pair ON decks (pair_id, is_public); CREATE INDEX idx_decks_creator ON decks (created_by); CREATE INDEX idx_deck_terms_term ON deck_terms (term_id); -- Language Pairs CREATE INDEX idx_pairs_active ON language_pairs (active, source, target); -- Rooms CREATE INDEX idx_rooms_status ON rooms (status); CREATE INDEX idx_rooms_host ON rooms (host_id); -- NOTE: idx_rooms_code omitted (UNIQUE constraint creates index automatically) -- Room Players CREATE INDEX idx_room_players_user ON room_players (user_id); CREATE INDEX idx_room_players_score ON room_players (room_id, score DESC); Repository Logic Note `DeckRepository.getTerms(deckId, limit, offset)` fetches terms from a specific deck. Query uses `deck_terms.position` for ordering. For random practice within a deck: `WHERE deck_id = X ORDER BY random() LIMIT N` (safe because deck is bounded, e.g., 500 terms max, not full table). --- ## 7. Vocabulary Data — WordNet + OMW ### Source Open Multilingual Wordnet (OMW) — English & Italian nouns via Interlingual Index (ILI) External CEFR lists — For deck curation (e.g. GitHub: ecom/cefr-lists) ### Extraction process 1. Run `extract-en-it-nouns.py` once locally using `wn` library - Imports ALL bilingual noun synsets (no frequency filtering) - Output: `datafiles/en-it-nouns.json` — committed to repo 2. Run `pnpm db:seed` — populates `terms` + `translations` tables from JSON 3. Run `pnpm db:build-decks` — matches external CEFR lists to DB terms, creates `decks` + `deck_terms` ### Benefits of deck-based approach - WordNet frequency data is unreliable (e.g. chemical symbols rank high) - Curricula can come from external sources (CEFR, Oxford 3000, SUBTLEX) - Bad data excluded at deck level, not schema level - Users can create custom decks later - Multiple difficulty levels without schema changes `terms.synset_id` stores the OMW ILI (e.g. `ili:i12345`) for traceability and future re-imports with additional languages. --- ## 8. Authentication — OpenAuth All auth is delegated to the OpenAuth service at `auth.yourdomain.com`. Providers: Google, GitHub. The API validates the JWT from OpenAuth on every protected request. User rows are created or updated on first login via the `sub` claim as the primary key. **Auth endpoint on the API:** | Method | Path | Description | | ------ | -------------- | --------------------------- | | GET | `/api/auth/me` | Validate token, return user | All other auth flows (login, callback, token refresh) are handled entirely by OpenAuth — the frontend redirects to `auth.yourdomain.com` and receives a JWT back. --- ## 9. REST API All endpoints prefixed `/api`. Request and response bodies validated with Zod on both sides using schemas from `packages/shared`. ### Vocabulary | Method | Path | Description | | ------ | ---------------------------- | --------------------------------- | | GET | `/language-pairs` | List active language pairs | | GET | `/terms?pair=en-it&limit=10` | Fetch quiz terms with distractors | ### Rooms | Method | Path | Description | | ------ | ------------------- | ----------------------------------- | | POST | `/rooms` | Create a room → returns room + code | | GET | `/rooms/:code` | Get current room state | | POST | `/rooms/:code/join` | Join a room | ### Users | Method | Path | Description | | ------ | ----------------- | ---------------------- | | GET | `/users/me` | Current user profile | | GET | `/users/me/stats` | Games played, win rate | --- ## 10. WebSocket Protocol One WS connection per client. Authenticated by passing the OpenAuth JWT as a query param on the upgrade request: `wss://api.yourdomain.com?token=...`. All messages are JSON: `{ type: string, payload: unknown }`. The full set of types is a Zod discriminated union in `packages/shared` — both sides validate every message they receive. ### Client → Server | type | payload | Description | | ------------- | -------------------------- | -------------------------------- | | `room:join` | `{ code }` | Subscribe to a room's WS channel | | `room:leave` | — | Unsubscribe | | `room:start` | — | Host starts the game | | `game:answer` | `{ questionId, answerId }` | Player submits an answer | ### Server → Client | type | payload | Description | | -------------------- | -------------------------------------------------- | ----------------------------------------- | | `room:state` | Full room snapshot | Sent on join and on any player join/leave | | `game:question` | `{ id, prompt, options[], timeLimit }` | New question broadcast to all players | | `game:answer_result` | `{ questionId, correct, correctAnswerId, scores }` | Broadcast after all answer or timeout | | `game:finished` | `{ scores[], winner }` | End of game summary | | `error` | `{ message }` | Protocol or validation error | ### Multiplayer game mechanic — simultaneous answers All players see the same question at the same time. Everyone submits independently. The server waits until all players have answered **or** the 15-second timeout fires — then broadcasts `game:answer_result` with updated scores. There is no buzz-first mechanic. This keeps the experience Duolingo-like and symmetric. ### Game flow ``` host creates room (REST) → players join via room code (REST + WS room:join) → room:state broadcasts player list → host sends room:start → server broadcasts game:question → players send game:answer → server collects all answers or waits for timeout → server broadcasts game:answer_result → repeat for N rounds → server broadcasts game:finished ``` ### Room state in Valkey Active room state (connected players, current question, answers received this round) is stored in Valkey with a TTL. PostgreSQL holds the durable record (`rooms`, `room_players`). On server restart, in-progress games are considered abandoned — acceptable for MVP. --- ## 11. Game Mechanics - **Question format**: source-language word prompt + 4 target-language choices (1 correct + 3 distractors of the same POS) - **Distractors**: generated server-side, never include the correct answer, never repeat within a session - **Scoring**: +1 point per correct answer. Speed bonus is out of scope for MVP. - **Timer**: 15 seconds per question, server-authoritative - **Single-player**: uses `GET /terms` and runs entirely client-side. No WebSocket. --- ## 12. Frontend Structure ``` apps/web/src/ ├── routes/ │ ├── index.tsx # Landing / mode select │ ├── auth/ │ ├── singleplayer/ │ └── multiplayer/ │ ├── lobby.tsx # Create or join by code │ ├── room.$code.tsx # Waiting room │ └── game.$code.tsx # Active game ├── components/ │ ├── quiz/ # QuestionCard, OptionButton, ScoreBoard │ ├── room/ # PlayerList, RoomCode, ReadyState │ └── ui/ # shadcn/ui wrappers: Button, Card, Dialog ... ├── stores/ │ └── gameStore.ts # Zustand: game session, scores, WS state ├── lib/ │ ├── api.ts # TanStack Query wrappers │ └── ws.ts # WS client singleton └── main.tsx ``` ### Zustand store (single store for MVP) ```typescript interface AppStore { user: User | null; gameSession: GameSession | null; currentQuestion: Question | null; scores: Record; isLoading: boolean; error: string | null; } ``` TanStack Query handles all server data fetching. Zustand handles ephemeral UI and WebSocket-driven state. --- ## 13. Testing Strategy | Type | Tool | Scope | | ----------- | -------------------- | --------------------------------------------------- | | Unit | Vitest | Services, QuizService distractor logic, Zod schemas | | Component | Vitest + RTL | QuestionCard, OptionButton, auth forms | | Integration | Vitest | API route handlers against a test DB | | E2E | Out of scope for MVP | — | Tests are co-located with source files (`*.test.ts` / `*.test.tsx`). **Critical paths to cover:** - Distractor generation (correct POS, no duplicates, never includes answer) - Answer validation (server-side, correct scoring) - Game session lifecycle (create → play → complete) - JWT validation middleware --- ## 14. Definition of Done ### Functional - [ ] User can log in via Google or GitHub (OpenAuth) - [ ] User can play singleplayer: 10 rounds, score, result screen - [ ] User can create a room and share a code - [ ] User can join a room via code - [ ] Multiplayer: 10 rounds, simultaneous answers, real-time score sync - [ ] 1 000 English–Italian words seeded from WordNet + OMW ### Technical - [ ] Deployed to Hetzner with HTTPS on all three subdomains - [ ] Docker Compose running all services - [ ] Drizzle migrations applied on container start - [ ] 10–20 passing tests covering critical paths - [ ] pnpm workspace build pipeline green --- ## 15. Out of Scope (MVP) - Difficulty levels _(`frequency_rank` column exists, ready to use)_ - Additional language pairs _(schema already supports it — just add rows)_ - Leaderboards _(`games_played`, `games_won` columns exist)_ - Streaks / daily challenges - Friends / private invites - Audio pronunciation - CI/CD pipeline (manual deploy for now) - Rate limiting _(add before going public)_ - Admin panel for vocabulary management