diff --git a/apps/api/tsconfig.json b/apps/api/tsconfig.json index 52192c0..b470f84 100644 --- a/apps/api/tsconfig.json +++ b/apps/api/tsconfig.json @@ -2,7 +2,7 @@ "extends": "../../tsconfig.base.json", "references": [ { "path": "../../packages/shared" }, - { "path": "../../packages/db" }, + { "path": "../../packages/db" } ], "compilerOptions": { "module": "NodeNext", @@ -10,7 +10,7 @@ "outDir": "./dist", "resolveJsonModule": true, "rootDir": ".", - "types": ["vitest/globals"], + "types": ["vitest/globals"] }, - "include": ["src", "vitest.config.ts"], + "include": ["src", "vitest.config.ts"] } diff --git a/documentation/api-development.md b/documentation/api-development.md deleted file mode 100644 index 4611cf7..0000000 --- a/documentation/api-development.md +++ /dev/null @@ -1,348 +0,0 @@ -# Glossa — Architecture & API Development Summary - -A record of all architectural discussions, decisions, and outcomes from the initial -API design through the quiz model implementation. - ---- - -## Project Overview - -Glossa is a vocabulary trainer (Duolingo-style) built as a pnpm monorepo. Users see a -word and pick from 4 possible translations. Supports singleplayer and multiplayer. -Stack: Express API, React frontend, Drizzle ORM, Postgres, Valkey, WebSockets. - ---- - -## Architectural Foundation - -### The Layered Architecture - -The core mental model established for the entire API: - -```text -HTTP Request - ↓ - Router — maps URL + HTTP method to a controller - ↓ - Controller — handles HTTP only: validates input, calls service, sends response - ↓ - Service — business logic only: no HTTP, no direct DB access - ↓ - Model — database queries only: no business logic - ↓ - Database -``` - -**The rule:** each layer only talks to the layer directly below it. A controller never -touches the database. A service never reads `req.body`. A model never knows what a quiz is. - -### Monorepo Package Responsibilities - -| Package | Owns | -| ----------------- | -------------------------------------------------------- | -| `packages/shared` | Zod schemas, constants, derived TypeScript types | -| `packages/db` | Drizzle schema, DB connection, all model/query functions | -| `apps/api` | Router, controllers, services | -| `apps/web` | React frontend, consumes types from shared | - -**Key principle:** all database code lives in `packages/db`. `apps/api` never imports -`drizzle-orm` for queries — it only calls functions exported from `packages/db`. - ---- - -## Problems Faced & Solutions - -- Problem 1: Messy API structure - **Symptom:** responsibilities bleeding across layers — DB code in controllers, business - logic in routes. - **Solution:** strict layered architecture with one responsibility per layer. -- Problem 2: No shared contract between API and frontend - **Symptom:** API could return different shapes silently, frontend breaks at runtime. - **Solution:** Zod schemas in `packages/shared` as the single source of truth. Both API - (validation) and frontend (type inference) consume the same schemas. -- Problem 3: Type safety gaps - **Symptom:** TypeScript `any` types on model parameters, `Number` vs `number` confusion. - **Solution:** derived types from constants using `typeof CONSTANT[number]` pattern. - All valid values defined once in constants, types derived automatically. -- Problem 4: `getGameTerms` in wrong package - **Symptom:** model queries living in `apps/api/src/models/` meant `apps/api` had a - direct `drizzle-orm` dependency and was accessing the DB itself. - **Solution:** moved models folder to `packages/db/src/models/`. All Drizzle code now - lives in one package. -- Problem 5: Deck generation complexity - **Initial assumption:** 12 decks needed (nouns/verbs × easy/intermediate/hard × en/it). - **Correction:** decks are pools, not presets. POS and difficulty are query filters applied - at runtime — not deck properties. Only 2 decks needed (en-core, it-core). - **Final decision:** skip deck generation entirely for MVP. Query the terms table directly - with difficulty + POS filters. Revisit post-MVP when spaced repetition or progression - features require curated pools. -- Problem 6: GAME_ROUNDS type conflict - **Problem:** `z.enum()` only accepts strings. `GAME_ROUNDS = ["3", "10"]` works with - `z.enum()` but requires `Number(rounds)` conversion in the service. - **Decision:** keep as strings, convert to number in the service before passing to the - model. Documented coupling acknowledged with a comment. -- Problem 7: Gloss join could multiply question rows. Schema allowed multiple glosses per term per language, so the left join would duplicate rows. Fixed by tightening the unique constraint. -- Problem 8: Model leaked quiz semantics. Return fields were named prompt / answer, baking HTTP-layer concepts into the database layer. Renamed to neutral field names. -- Problem 9: AnswerResult wasn't self-contained. Frontend needed selectedOptionId to render feedback but the schema didn't include it (reasoning was "client already knows"). Discovered during frontend work; added the field. -- Problem 10: Distractor could duplicate the correct answer text. Different terms can share the same translation. Fixed with ne(translations.text, excludeText) in the query. -- Problem 11: TypeScript strict mode flagged Fisher-Yates shuffle array access. noUncheckedIndexedAccess treats result[i] as T | undefined. Fixed with non-null assertion and temp variable pattern. - ---- - -## Decisions Made - -- Zod schemas belong in `packages/shared` - Both the API and frontend import from the same schemas. If the shape changes, TypeScript - compilation fails in both places simultaneously — silent drift is impossible. -- Server-side answer evaluation - The correct answer is never sent to the frontend in `QuizQuestion`. It is only revealed - in `AnswerResult` after the client submits. Prevents cheating and keeps game logic - authoritative on the server. -- `safeParse` over `parse` in controllers - `parse` throws a raw Zod error → ugly 500 response. `safeParse` returns a result object - → clean 400 with early return. Global error handler to be implemented later (Step 6 of - roadmap) will centralise this pattern. -- POST not GET for game start - `GET` requests have no body. Game configuration is submitted as a JSON body → `POST` is - semantically correct. -- `express.json()` middleware required - Without it, `req.body` is `undefined`. Added to `createApp()` in `app.ts`. -- Type naming: PascalCase - TypeScript convention. `supportedLanguageCode` → `SupportedLanguageCode` etc. -- Primitive types: always lowercase - `number` not `Number`, `string` not `String`. The uppercase versions are object wrappers - and not assignable to Drizzle's expected primitive types. -- Model parameters use shared types, not `GameRequestType` - The model layer should not know about `GameRequestType` — that's an HTTP boundary concern. - Instead, parameters are typed using the derived constant types (`SupportedLanguageCode`, - `SupportedPos`, `DifficultyLevel`) exported from `packages/shared`. -- One gloss per term per language. The unique constraint on term_glosses was tightened from (term_id, language_code, text) to (term_id, language_code) to prevent the left join from multiplying question rows. Revisit if multiple glosses per language are ever needed (e.g. register or domain variants). -- Model returns neutral field names, not quiz semantics. getGameTerms returns sourceText / targetText / sourceGloss rather than prompt / answer / gloss. Quiz semantics are applied in the service layer. Keeps the model reusable for non-quiz features. -- Asymmetric difficulty filter. Difficulty is filtered on the target (answer) side only. A word can be A2 in Italian but B1 in English, and what matters is the difficulty of the word being learned. -- optionId as integer 0-3, not UUID. Options only need uniqueness within a single question; cheating prevented by shuffling, not opaque IDs. -- questionId and sessionId as UUIDs. Globally unique, opaque, natural Valkey keys when storage moves later. -- gloss is string | null rather than optional, for predictable shape on the frontend. -- GameSessionStore stores only the answer key (questionId → correctOptionId). Minimal payload for easy Valkey migration. -- All GameSessionStore methods are async even for the in-memory implementation, so the service layer is already written for Valkey. -- Distractors fetched per-question (N+1 queries). Correct shape for the problem; 10 queries on local Postgres is negligible latency. -- No fallback logic for insufficient distractors. Data volumes are sufficient; strict query throws if something is genuinely broken. -- Distractor query excludes both the correct term ID and the correct answer text, preventing duplicate options from different terms with the same translation. -- Submit-before-send flow on frontend: user selects, then confirms. Prevents misclicks. -- AppError base class over error code maps. A statusCode on the error itself means the middleware doesn't need a lookup table. New error types are self-contained — one class, one status code. -- next(error) over res.status().json() in controllers. Express requires explicit next(error) for async handlers. Centralises all error formatting in one place. Controllers stay clean — validate, call service, send response. -- Zod .message over .issues[0]?.message. Returns all validation failures, not just the first. Output is verbose (raw JSON string) — revisit formatting post-MVP if the frontend needs structured error objects. - ---- - -## Global error handler: typed error classes + central middleware - -Three-layer pattern: error classes define the shape, services throw them, middleware catches them. - -AppError is the base class — carries a statusCode and a message. ValidationError (400) and NotFoundError (404) extend it. Adding a new error type is one class with a super() call. - -Controllers wrap their body in try/catch and call next(error) in the catch block. They never build error responses themselves. This is required because Express does not catch errors from async handlers automatically — without next(error), an unhandled rejection crashes the process. - -The middleware in app.ts (registered after all routes) checks instanceof AppError. Known errors get their statusCode and message. Unknown errors get logged and return a generic 500 — no stack traces leak to the client. - -Zod validation error format: used gameSettings.error.message rather than gameSettings.error.issues[0]?.message. This sends all validation failures at once instead of just the first. Tradeoff: the output is a raw JSON string, not a clean object. Acceptable for MVP — if the frontend needs structured errors later, format .issues into { field, message }[] in the ValidationError constructor. - -Where errors are thrown: ValidationError is thrown in the controller (it's the layer that runs safeParse). NotFoundError is thrown in the service (it's the layer that knows whether a session or question exists). The service still doesn't know about HTTP — it throws a typed error, and the middleware maps it to a status code. - ---- - -## Data Pipeline Work (Pre-API) - -### CEFR Enrichment Pipeline (completed) - -A staged ETL pipeline was built to enrich translation records with CEFR levels and -difficulty ratings: - -```text -Raw source files - ↓ -extract-*.py — normalise each source to standard JSON - ↓ -compare-*.py — quality gate: surface conflicts between sources (read-only) - ↓ -merge-*.py — resolve conflicts by source priority, derive difficulty - ↓ -enrich.ts — write cefr_level + difficulty to DB translations table -``` - -**Source priority:** - -- English: `en_m3` > `cefrj` > `octanove` > `random` -- Italian: `it_m3` > `italian` - -**Enrichment results:** - -| Language | Enriched | Total | Coverage | -| -------- | -------- | ------- | -------- | -| English | 42,527 | 171,394 | ~25% | -| Italian | 23,061 | 54,603 | ~42% | - -Both languages have sufficient coverage for MVP. Italian C2 has only 242 terms — noted -as a potential constraint for the distractor algorithm at high difficulty. - ---- - -## API Schemas (packages/shared) - -### `GameRequestSchema` - -```typescript -{ - source_language: z.enum(SUPPORTED_LANGUAGE_CODES), - target_language: z.enum(SUPPORTED_LANGUAGE_CODES), - pos: z.enum(SUPPORTED_POS), - difficulty: z.enum(DIFFICULTY_LEVELS), - rounds: z.enum(GAME_ROUNDS), -} -``` - -AnswerOption: { optionId: number (0-3), text: string } -GameQuestion: { questionId: uuid, prompt: string, gloss: string | null, options: AnswerOption[4] } -GameSession: { sessionId: uuid, questions: GameQuestion[] } -AnswerSubmission: { sessionId: uuid, questionId: uuid, selectedOptionId: number (0-3) } -AnswerResult: { questionId: uuid, isCorrect: boolean, correctOptionId: number (0-3), selectedOptionId: number (0-3) } - ---- - -## API Endpoints - -```text -POST /api/v1/game/start GameRequest → QuizQuestion[] -POST /api/v1/game/answer AnswerSubmission → AnswerResult -``` - ---- - -## Current File Structure (apps/api) - -```text -apps/api/src/ -├── app.ts — Express app, express.json() middleware -├── server.ts — starts server on PORT -├── routes/ -│ ├── apiRouter.ts — mounts /health and /game routers -│ ├── gameRouter.ts — POST /start → createGame controller -│ └── healthRouter.ts -├── controllers/ -│ └── gameController.ts — validates GameRequest, calls service -└── services/ - └── gameService.ts — calls getGameTerms, returns raw rows -``` - ---- - -## Current File Structure (packages/db) - -```text -packages/db/src/ -├── db/ -│ └── schema.ts — Drizzle schema (terms, translations, users, decks...) -├── models/ -│ └── termModel.ts — getGameTerms() query -└── index.ts — exports db connection + getGameTerms -``` - ---- - -## Completed Tasks - -- [x] Layered architecture established and understood -- [x] `GameRequestSchema` defined in `packages/shared` -- [x] Derived types (`SupportedLanguageCode`, `SupportedPos`, `DifficultyLevel`) exported from constants -- [x] `getGameTerms()` model implemented with POS / language / difficulty / limit filters -- [x] Model correctly placed in `packages/db` -- [x] `prepareGameQuestions()` service skeleton calling the model -- [x] `createGame` controller with Zod `safeParse` validation -- [x] `POST /api/v1/game/start` route wired -- [x] End-to-end pipeline verified with test script — returns correct rows -- [x] CEFR enrichment pipeline complete for English and Italian -- [x] Double join on translations implemented (source + target language) -- [x] Gloss left join implemented -- [x] Model return type uses neutral field names (sourceText, targetText, sourceGloss) -- [x] Schema: gloss unique constraint tightened to one gloss per term per language -- [x] Zod schemas defined: AnswerOption, GameQuestion, GameSession, AnswerSubmission, AnswerResult -- [x] getDistractors model implemented with POS/difficulty/language/excludeTermId/excludeText filters -- [x] createGameSession service: fetches terms, fetches distractors per question, shuffles options, stores session, returns GameSession -- [x] evaluateAnswer service: looks up session, compares submitted optionId to stored correct answer, returns AnswerResult -- [x] GameSessionStore interface + InMemoryGameSessionStore (Map-backed, swappable to Valkey) -- [x] POST /api/v1/game/answer endpoint wired (route, controller, service) -- [x] selectedOptionId added to AnswerResult (discovered during frontend work) -- [x] Minimal frontend: /play route with settings UI, QuestionCard, OptionButton, ScoreScreen -- [x] Vite proxy configured for dev - ---- - -## Roadmap Ahead - -### Step 1 — Learn SQL fundamentals - done - -Concepts needed: SELECT, FROM, JOIN, WHERE, LIMIT. -Resources: sqlzoo.net or Khan Academy SQL section. -Required before: implementing the double join for source language prompt. - -### Step 2 — Complete the model layer - done - -- Double join on `translations` — once for source language (prompt), once for target language (answer) -- `GlossModel.getGloss(termId, languageCode)` — fetch gloss if available - -### Step 3 — Define remaining Zod schemas - done - -- `QuizQuestion`, `QuizOption`, `AnswerSubmission`, `AnswerResult` in `packages/shared` - -### Step 4 — Complete the service layer - done - -- `QuizService.buildSession()` — assemble raw rows into `QuizQuestion[]` - - Generate `questionId` per question - - Map source language translation as prompt - - Attach gloss if available - - Fetch 3 distractors (same POS, different term, same difficulty) - - Shuffle options so correct answer is not always in same position -- `QuizService.evaluateAnswer()` — validate correctness, return `AnswerResult` - -### Step 5 — Implement answer endpoint - done - -- `POST /api/v1/game/answer` route, controller, service method - -### Step 6 — Global error handler - done - -- Typed error classes (`ValidationError`, `NotFoundError`) -- Central error middleware in `app.ts` -- Remove temporary `safeParse` error handling from controllers - -### Step 7 — Tests - done - -- Unit tests for `QuizService` — correct POS filtering, distractor never equals correct answer -- Unit tests for `evaluateAnswer` — correct and incorrect cases -- Integration tests for both endpoints - -### Step 8 — Auth (Phase 2 from original roadmap) - -- OpenAuth integration -- JWT validation middleware -- `GET /api/auth/me` endpoint -- Frontend auth guard - ---- - -## Open Questions - -- **Distractor algorithm:** when Italian C2 has only 242 terms, should the difficulty - filter fall back gracefully or return an error? Decision needed before implementing - `buildSession()`. => resolved -- **Session statefulness:** game loop is currently stateless (fetch all questions upfront). - Confirm this is still the intended MVP approach before building `buildSession()`. => resolved -- **Glosses can leak answers:** some WordNet glosses contain the target-language - word in the definition text (e.g. "Padre" appearing in the English gloss for - "father"). Address during the post-MVP data enrichment pass — either clean the - glosses, replace them with custom definitions, or filter at the service layer. => resolved - --  WARN  2 deprecated subdependencies found: @esbuild-kit/core-utils@3.3.2, @esbuild-kit/esm-loader@2.6.5 -../.. | Progress: resolved 556, reused 0, downloaded 0, added 0, done - WARN  Issues with peer dependencies found -. -└─┬ eslint-plugin-react-hooks 7.0.1 - └── ✕ unmet peer eslint@"^3.0.0 || ^4.0.0 || ^5.0.0 || ^6.0.0 || ^7.0.0 || ^8.0.0-0 || ^9.0.0": found 10.0.3 diff --git a/documentation/data-seeding-notes.md b/documentation/data-seeding-notes.md deleted file mode 100644 index 2f68bd7..0000000 --- a/documentation/data-seeding-notes.md +++ /dev/null @@ -1,351 +0,0 @@ -# WordNet Seeding Script — Session Summary - -## Project Context - -A multiplayer English–Italian vocabulary trainer (Glossa) built with a pnpm monorepo. Vocabulary data comes from Open Multilingual Wordnet (OMW) and is extracted into JSON files, then seeded into a PostgreSQL database via Drizzle ORM. - ---- - -## 1. JSON Extraction Format - -Each synset extracted from WordNet is represented as: - -```json -{ - "synset_id": "ili:i35545", - "pos": "noun", - "translations": { "en": ["entity"], "it": ["cosa", "entità"] } -} -``` - -**Fields:** - -- `synset_id` — OMW Interlingual Index ID, maps to `terms.synset_id` in the DB -- `pos` — part of speech, matches the CHECK constraint on `terms.pos` -- `translations` — object of language code → array of lemmas (synonyms within a synset) - -**Glosses** are not extracted — the `term_glosses` table exists in the schema for future use but is not needed for the MVP quiz mechanic. - ---- - -## 2. Database Schema (relevant tables) - -```text -terms - id uuid PK - synset_id text UNIQUE - pos varchar(20) - created_at timestamptz - -translations - id uuid PK - term_id uuid FK → terms.id (CASCADE) - language_code varchar(10) - text text - created_at timestamptz - UNIQUE (term_id, language_code, text) -``` - ---- - -## 3. Seeding Script — v1 (batch, truncate-based) - -### Approach - -- Read a single JSON file -- Batch inserts into `terms` and `translations` in groups of 500 -- Truncate tables before each run for a clean slate - -### Key decisions made during development - -| Issue | Resolution | -| -------------------------------- | --------------------------------------------------- | -| `JSON.parse` returns `any` | Added `Array.isArray` check before casting | -| `forEach` doesn't await | Switched to `for...of` | -| Empty array types | Used Drizzle's `$inferInsert` types | -| `translations` naming conflict | Renamed local variable to `translationRows` | -| Final batch not flushed | Added `if (termsArray.length > 0)` guard after loop | -| Exact batch size check `=== 500` | Changed to `>= 500` | - -### Final script structure - -```ts -import fs from "node:fs/promises"; -import { SUPPORTED_LANGUAGE_CODES, SUPPORTED_POS } from "@glossa/shared"; -import { db } from "@glossa/db"; -import { terms, translations } from "@glossa/db/schema"; - -type POS = (typeof SUPPORTED_POS)[number]; -type LANGUAGE_CODE = (typeof SUPPORTED_LANGUAGE_CODES)[number]; -type TermInsert = typeof terms.$inferInsert; -type TranslationInsert = typeof translations.$inferInsert; -type Synset = { - synset_id: string; - pos: POS; - translations: Record; -}; - -const dataDir = "../../scripts/datafiles/"; - -const readFromJsonFile = async (filepath: string): Promise => { - const data = await fs.readFile(filepath, "utf8"); - const parsed = JSON.parse(data); - if (!Array.isArray(parsed)) throw new Error("Expected a JSON array"); - return parsed as Synset[]; -}; - -const uploadToDB = async ( - termsData: TermInsert[], - translationsData: TranslationInsert[], -) => { - await db.insert(terms).values(termsData); - await db.insert(translations).values(translationsData); -}; - -const main = async () => { - console.log("Reading JSON file..."); - const allSynsets = await readFromJsonFile(dataDir + "en-it-nouns.json"); - console.log(`Loaded ${allSynsets.length} synsets`); - - const termsArray: TermInsert[] = []; - const translationsArray: TranslationInsert[] = []; - let batchCount = 0; - - for (const synset of allSynsets) { - const term = { - id: crypto.randomUUID(), - synset_id: synset.synset_id, - pos: synset.pos, - }; - - const translationRows = Object.entries(synset.translations).flatMap( - ([lang, lemmas]) => - lemmas.map((lemma) => ({ - id: crypto.randomUUID(), - term_id: term.id, - language_code: lang as LANGUAGE_CODE, - text: lemma, - })), - ); - - translationsArray.push(...translationRows); - termsArray.push(term); - - if (termsArray.length >= 500) { - batchCount++; - console.log( - `Uploading batch ${batchCount} (${batchCount * 500}/${allSynsets.length} synsets)...`, - ); - await uploadToDB(termsArray, translationsArray); - termsArray.length = 0; - translationsArray.length = 0; - } - } - - if (termsArray.length > 0) { - batchCount++; - console.log( - `Uploading final batch (${allSynsets.length}/${allSynsets.length} synsets)...`, - ); - await uploadToDB(termsArray, translationsArray); - } - - console.log(`Seeding complete — ${allSynsets.length} synsets inserted`); -}; - -main().catch((error) => { - console.error(error); - process.exit(1); -}); -``` - ---- - -## 4. Pitfalls Encountered - -### Duplicate key on re-run - -Running the script twice causes `duplicate key value violates unique constraint "terms_synset_id_unique"`. Fix: truncate before seeding. - -```bash -docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translations, terms CASCADE;" -``` - -### `onConflictDoNothing` breaks FK references - -When `onConflictDoNothing` skips a `terms` insert, the in-memory UUID is never written to the DB. Subsequent `translations` inserts reference that non-existent UUID, causing a FK violation. This is why the truncate approach is correct for batch seeding. - -### DATABASE_URL misconfigured - -Correct format: - -```text -DATABASE_URL=postgresql://glossa:glossa@localhost:5432/glossa -``` - -### Tables not found after `docker compose up` - -Migrations must be applied first: `npx drizzle-kit migrate` - ---- - -## 5. Running the Script - -```bash -# Start the DB container -docker compose up -d postgres - -# Apply migrations -npx drizzle-kit migrate - -# Truncate existing data (if re-seeding) -docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translations, terms CASCADE;" - -# Run the seed script -npx tsx src/seed-en-it-nouns.ts - -# Verify -docker exec -it glossa-database psql -U glossa -d glossa -c "SELECT COUNT(*) FROM terms; SELECT COUNT(*) FROM translations;" -``` - ---- - -## 6. Seeding Script — v2 (incremental upsert, multi-file) - -### Motivation - -The truncate approach is fine for dev but unsuitable for production — it wipes all data. The v2 approach extends the database incrementally without ever truncating. - -### File naming convention - -One JSON file per language pair per POS: - -```text -scripts/datafiles/ - en-it-nouns.json - en-fr-nouns.json - en-it-verbs.json - de-it-nouns.json - ... -``` - -### How incremental upsert works - -For a concept like "dog" already in the DB with English and Italian: - -1. Import `en-fr-nouns.json` -2. Upsert `terms` by `synset_id` — finds existing row, returns its real ID -3. `dog (en)` already exists → skipped by `onConflictDoNothing` -4. `chien (fr)` is new → inserted - -The concept is **extended**, not replaced. - -### Tradeoff vs batch approach - -Batching is no longer possible since you need the real `term.id` from the DB before inserting translations. Each synset is processed individually. For 25k rows this is still fast enough. - -### Key types added - -```ts -type Synset = { - synset_id: string; - pos: POS; - translations: Partial>; // Partial — file only contains subset of languages -}; - -type FileName = { - sourceLang: LANGUAGE_CODE; - targetLang: LANGUAGE_CODE; - pos: POS; -}; -``` - -### Filename validation - -```ts -const parseFilename = (filename: string): FileName => { - const parts = filename.replace(".json", "").split("-"); - if (parts.length !== 3) - throw new Error( - `Invalid filename format: ${filename}. Expected: sourcelang-targetlang-pos.json`, - ); - const [sourceLang, targetLang, pos] = parts; - if (!SUPPORTED_LANGUAGE_CODES.includes(sourceLang as LANGUAGE_CODE)) - throw new Error(`Unsupported language code: ${sourceLang}`); - if (!SUPPORTED_LANGUAGE_CODES.includes(targetLang as LANGUAGE_CODE)) - throw new Error(`Unsupported language code: ${targetLang}`); - if (!SUPPORTED_POS.includes(pos as POS)) - throw new Error(`Unsupported POS: ${pos}`); - return { - sourceLang: sourceLang as LANGUAGE_CODE, - targetLang: targetLang as LANGUAGE_CODE, - pos: pos as POS, - }; -}; -``` - -### Upsert function (WIP) - -```ts -const upsertSynset = async ( - synset: Synset, - fileInfo: FileName, -): Promise<{ termInserted: boolean; translationsInserted: number }> => { - const [upsertedTerm] = await db - .insert(terms) - .values({ synset_id: synset.synset_id, pos: synset.pos }) - .onConflictDoUpdate({ target: terms.synset_id, set: { pos: synset.pos } }) - .returning({ id: terms.id, created_at: terms.created_at }); - - const termInserted = upsertedTerm.created_at > new Date(Date.now() - 1000); - - const translationRows = Object.entries(synset.translations).flatMap( - ([lang, lemmas]) => - lemmas!.map((lemma) => ({ - id: crypto.randomUUID(), - term_id: upsertedTerm.id, - language_code: lang as LANGUAGE_CODE, - text: lemma, - })), - ); - - const result = await db - .insert(translations) - .values(translationRows) - .onConflictDoNothing() - .returning({ id: translations.id }); - - return { termInserted, translationsInserted: result.length }; -}; -``` - ---- - -## 7. Strategy Comparison - -| Strategy | Use case | Pros | Cons | -| ------------------ | ----------------------------- | --------------------- | -------------------- | -| Truncate + batch | Dev / first-time setup | Fast, simple | Wipes all data | -| Incremental upsert | Production / adding languages | Safe, non-destructive | No batching, slower | -| Migrations-as-data | Production audit trail | Clean history | Files accumulate | -| Diff-based sync | Large production datasets | Minimal writes | Complex to implement | - ---- - -## 8. packages/db — package.json exports fix - -The `exports` field must be an object, not an array: - -```json -"exports": { - ".": "./src/index.ts", - "./schema": "./src/db/schema.ts" -} -``` - -Imports then resolve as: - -```ts -import { db } from "@glossa/db"; -import { terms, translations } from "@glossa/db/schema"; -``` diff --git a/documentation/decisions.md b/documentation/decisions.md index ade4fad..11ab391 100644 --- a/documentation/decisions.md +++ b/documentation/decisions.md @@ -1,6 +1,6 @@ # Decisions Log -A record of non-obvious technical decisions made during development, with reasoning. Intended to preserve context across sessions. +A record of non-obvious technical decisions made during development, with reasoning. Intended to preserve context across sessions. Grouped by topic area. --- @@ -32,21 +32,11 @@ All auth delegated to OpenAuth service at `auth.yourdomain.com`. Providers: Goog ### Multi-stage builds for monorepo context -Both `apps/web` and `apps/api` use multi-stage Dockerfiles (`deps`, `dev`, `builder`, `runner`) because: - -- The monorepo structure requires copying `pnpm-workspace.yaml`, root `package.json`, and cross-dependencies (`packages/shared`, `packages/db`) before installing -- `node_modules` paths differ between host and container due to workspace hoisting -- Stages allow caching `pnpm install` separately from source code changes +Both `apps/web` and `apps/api` use multi-stage Dockerfiles (`deps`, `dev`, `builder`, `runner`) because the monorepo structure requires copying `pnpm-workspace.yaml`, root `package.json`, and cross-dependencies before installing. Stages allow caching `pnpm install` separately from source code changes. ### Vite as dev server (not Nginx) -In development, `apps/web` uses `vite dev` directly, not Nginx. Reasons: - -- Hot Module Replacement (HMR) requires Vite's WebSocket dev server -- Source maps and error overlay need direct Vite integration -- Nginx would add unnecessary proxy complexity for local dev - -Production will use Nginx to serve static Vite build output. +In development, `apps/web` uses `vite dev` directly, not Nginx. HMR requires Vite's WebSocket dev server. Production will use Nginx to serve static Vite build output. --- @@ -54,41 +44,111 @@ Production will use Nginx to serve static Vite build output. ### Express app structure: factory function pattern -`app.ts` exports a `createApp()` factory function. `server.ts` imports it and calls `.listen()`. This allows tests to import the app directly without starting a server, keeping tests isolated and fast. +`app.ts` exports a `createApp()` factory function. `server.ts` imports it and calls `.listen()`. This allows tests to import the app directly without starting a server (used by supertest). -### Data model: `decks` separate from `terms` (not frequency_rank filtering) +### Zod schemas belong in `packages/shared` -**Original approach:** Store `frequency_rank` on `terms` table and filter by rank range for difficulty. +Both the API and frontend import from the same schemas. If the shape changes, TypeScript compilation fails in both places simultaneously — silent drift is impossible. -**Problem discovered:** WordNet/OMW frequency data is unreliable for language learning. Extraction produced results like: +### Server-side answer evaluation -- Rank 1: "In" → "indio" (chemical symbol: Indium) -- Rank 2: "Be" → "berillio" (chemical symbol: Beryllium) -- Rank 7: "He" → "elio" (chemical symbol: Helium) +The correct answer is never sent to the frontend in `GameQuestion`. It is only revealed in `AnswerResult` after the client submits. Prevents cheating and keeps game logic authoritative on the server. -These are technically "common" in WordNet (every element is a noun) but useless for vocabulary learning. +### `safeParse` over `parse` in controllers -**Decision:** +`parse` throws a raw Zod error → ugly 500 response. `safeParse` returns a result object → clean 400 with early return via the error handler. -- `terms` table stores ALL available OMW synsets (raw data, no frequency filtering) -- `decks` table stores curated learning lists (A1, A2, B1, "Most Common 1000", etc.) -- `deck_terms` junction table links terms to decks with position ordering -- `rooms.deck_id` specifies which vocabulary deck a game uses +### POST not GET for game start -**Benefits:** +`GET` requests have no body. Game configuration is submitted as a JSON body → `POST` is semantically correct. -- Curricula can come from external sources (CEFR lists, Oxford 3000, SUBTLEX) -- Bad data (chemical symbols, obscure words) excluded at deck level, not schema level -- Users can create custom decks later -- Multiple difficulty levels without schema changes +### Model parameters use shared types, not `GameRequestType` + +The model layer should not know about `GameRequestType` — that's an HTTP boundary concern. Parameters are typed using the derived constant types (`SupportedLanguageCode`, `SupportedPos`, `DifficultyLevel`) exported from `packages/shared`. + +### Model returns neutral field names, not quiz semantics + +`getGameTerms` returns `sourceText` / `targetText` / `sourceGloss` rather than `prompt` / `answer` / `gloss`. Quiz semantics are applied in the service layer. Keeps the model reusable for non-quiz features. + +### Asymmetric difficulty filter + +Difficulty is filtered on the target (answer) side only. A word can be A2 in Italian but B1 in English, and what matters is the difficulty of the word being learned. + +### optionId as integer 0-3, not UUID + +Options only need uniqueness within a single question; cheating prevented by shuffling, not opaque IDs. + +### questionId and sessionId as UUIDs + +Globally unique, opaque, natural Valkey keys when storage moves later. + +### gloss is `string | null` rather than optional + +Predictable shape on the frontend — always present, sometimes null. + +### GameSessionStore stores only the answer key + +Minimal payload (`questionId → correctOptionId`) for easy Valkey migration. All methods are async even for the in-memory implementation, so the service layer is already written for Valkey. + +### Distractors fetched per-question (N+1 queries) + +Correct shape for the problem; 10 queries on local Postgres is negligible latency. + +### No fallback logic for insufficient distractors + +Data volumes are sufficient; strict query throws if something is genuinely broken. + +### Distractor query excludes both term ID and answer text + +Prevents duplicate options from different terms with the same translation. + +### Submit-before-send flow on frontend + +User selects, then confirms. Prevents misclicks. ### Multiplayer mechanic: simultaneous answers (not buzz-first) -All players see the same question at the same time and submit independently. The server waits for all answers or a 15-second timeout, then broadcasts the result. This keeps the experience Duolingo-like and symmetric. A buzz-first mechanic was considered and rejected. +All players see the same question at the same time and submit independently. The server waits for all answers or a 15-second timeout, then broadcasts the result. Keeps the experience symmetric. ### Room model: room codes (not matchmaking queue) -Players create rooms and share a human-readable code (e.g. `WOLF-42`) to invite friends. Auto-matchmaking via a queue is out of scope for MVP. Valkey is included in the stack and can support a queue in a future phase. +Players create rooms and share a human-readable code (e.g. `WOLF-42`). Auto-matchmaking deferred. + +--- + +## Error Handling + +### `AppError` base class over error code maps + +A `statusCode` on the error itself means the middleware doesn't need a lookup table. New error types are self-contained — one class, one status code. `ValidationError` (400) and `NotFoundError` (404) extend `AppError`. + +### `next(error)` over `res.status().json()` in controllers + +Express requires explicit `next(error)` for async handlers — it does not catch async errors automatically. Centralises all error formatting in one middleware. Controllers stay clean: validate, call service, send response. + +### Zod `.message` over `.issues[0]?.message` + +Returns all validation failures at once, not just the first. Output is verbose (raw JSON string) — revisit formatting post-MVP if the frontend needs structured `{ field, message }[]` error objects. + +### Where errors are thrown + +`ValidationError` is thrown in the controller (the layer that runs `safeParse`). `NotFoundError` is thrown in the service (the layer that knows whether a session or question exists). The service doesn't know about HTTP — it throws a typed error, and the middleware maps it to a status code. + +--- + +## Testing + +### Mocked DB for unit tests (not test database) + +Unit tests mock `@glossa/db` via `vi.mock` — the real database is never touched. Tests run in milliseconds with no infrastructure dependency. Integration tests with a real test DB are deferred post-MVP. + +### Co-located test files + +`gameService.test.ts` lives next to `gameService.ts`, not in a separate `__tests__/` directory. Convention matches the `vitest` default and keeps related files together. + +### supertest for endpoint tests + +Uses `createApp()` factory directly — no server started. Tests the full HTTP layer (routing, middleware, error handler) with real request/response assertions. --- @@ -96,19 +156,31 @@ Players create rooms and share a human-readable code (e.g. `WOLF-42`) to invite ### Base config: no `lib`, `module`, or `moduleResolution` -These are intentionally omitted from `tsconfig.base.json` because different packages need different values — `apps/api` uses `NodeNext`, `apps/web` uses `ESNext`/`bundler` (Vite), and mixing them in the base caused errors. Each package declares its own. +Intentionally omitted from `tsconfig.base.json` because different packages need different values — `apps/api` uses `NodeNext`, `apps/web` uses `ESNext`/`bundler` (Vite). Each package declares its own. ### `outDir: "./dist"` per package -The base config originally had `outDir: "dist"` which resolved relative to the base file location, pointing to the root `dist` folder. Overridden in each package with `"./dist"` to ensure compiled output stays inside the package. +The base config originally had `outDir: "dist"` which resolved relative to the base file location, pointing to the root `dist` folder. Overridden in each package with `"./dist"`. ### `apps/web` tsconfig: deferred to Vite scaffold -The web tsconfig was left as a placeholder and filled in after `pnpm create vite` generated `tsconfig.json`, `tsconfig.app.json`, and `tsconfig.node.json`. The generated files were then trimmed to remove options already covered by the base. +Filled in after `pnpm create vite` generated tsconfig files. The generated files were trimmed to remove options already covered by the base. ### `rootDir: "."` on `apps/api` -Set explicitly to allow `vitest.config.ts` (which lives outside `src/`) to be included in the TypeScript program. Without it, TypeScript infers `rootDir` as `src/` and rejects any file outside that directory. +Set explicitly to allow `vitest.config.ts` (outside `src/`) to be included in the TypeScript program. + +### Type naming: PascalCase + +`supportedLanguageCode` → `SupportedLanguageCode`. TypeScript convention. + +### Primitive types: always lowercase + +`number` not `Number`, `string` not `String`. The uppercase versions are object wrappers and not assignable to Drizzle's expected primitive types. + +### `globals: true` with `"types": ["vitest/globals"]` + +Using Vitest globals requires `"types": ["vitest/globals"]` in each package's tsconfig. Added to `apps/api`, `packages/shared`, `packages/db`, and `apps/web/tsconfig.app.json`. --- @@ -116,43 +188,11 @@ Set explicitly to allow `vitest.config.ts` (which lives outside `src/`) to be in ### Two-config approach for `apps/web` -The root `eslint.config.mjs` handles TypeScript linting across all packages. `apps/web/eslint.config.js` is kept as a local addition for React-specific plugins only: `eslint-plugin-react-hooks` and `eslint-plugin-react-refresh`. ESLint flat config merges them automatically by directory proximity — no explicit import between them needed. +Root `eslint.config.mjs` handles TypeScript linting across all packages. `apps/web/eslint.config.js` adds React-specific plugins only. ESLint flat config merges them by directory proximity. ### Coverage config at root only -Vitest coverage configuration lives in the root `vitest.config.ts` only. Individual package configs omit it to produce a single aggregated report rather than separate per-package reports. - -### `globals: true` with `"types": ["vitest/globals"]` - -Using Vitest globals (`describe`, `it`, `expect` without imports) requires `"types": ["vitest/globals"]` in each package's tsconfig `compilerOptions`. Added to `apps/api`, `packages/shared`, and `packages/db`. Added to `apps/web/tsconfig.app.json`. - ---- - -## Known Issues / Dev Notes - -### glossa-web has no healthcheck - -The `web` service in `docker-compose.yml` has no `healthcheck` defined. Reason: Vite's dev server (`vite dev`) has no built-in health endpoint. Unlike the API's `/api/health`, there's no URL to poll. - -Workaround: `depends_on` uses `api` healthcheck as proxy. For production (Nginx), add a health endpoint or use TCP port check. - -### Valkey memory overcommit warning - -Valkey logs this on start in development: - -```text -WARNING Memory overcommit must be enabled for proper functionality -``` - -This is **harmless in dev** but should be fixed before production. The warning appears because Docker containers don't inherit host sysctl settings by default. - -Fix: Add to host `/etc/sysctl.conf`: - -```conf -vm.overcommit_memory = 1 -``` - -Then `sudo sysctl -p` or restart Docker. +Vitest coverage configuration lives in the root `vitest.config.ts` only. Produces a single aggregated report. --- @@ -160,190 +200,135 @@ Then `sudo sysctl -p` or restart Docker. ### Users: internal UUID + openauth_sub (not sub as PK) -**Original approach:** Use OpenAuth `sub` claim directly as `users.id` (text primary key). - -**Problem:** Embeds auth provider in the primary key (e.g. `"google|12345"`). If OpenAuth changes format or a second provider is added, the PK cascades through all FKs (`rooms.host_id`, `room_players.user_id`). - -**Decision:** - -- `users.id` = internal UUID (stable FK target) -- `users.openauth_sub` = text UNIQUE (auth provider claim) -- Allows adding multiple auth providers per user later without FK changes +Embeds auth provider in the primary key would cascade through all FKs if OpenAuth changes format. `users.id` = internal UUID (stable FK target). `users.openauth_sub` = text UNIQUE (auth provider claim). ### Rooms: `updated_at` for stale recovery only -Most tables omit `updated_at` (unnecessary for MVP). `rooms.updated_at` is kept specifically for stale room recovery—identifying rooms stuck in `in_progress` status after server crashes. +Most tables omit `updated_at`. `rooms.updated_at` is kept specifically for identifying rooms stuck in `in_progress` status after server crashes. ### Translations: UNIQUE (term_id, language_code, text) -Allows multiple synonyms per language per term (e.g. "dog", "hound" for same synset). Prevents exact duplicate rows. Homonyms (e.g. "Lead" metal vs. "Lead" guide) are handled by different `term_id` values (different synsets), so no constraint conflict. +Allows multiple synonyms per language per term (e.g. "dog", "hound" for same synset). Prevents exact duplicate rows. + +### One gloss per term per language + +The unique constraint on `term_glosses` was tightened from `(term_id, language_code, text)` to `(term_id, language_code)` to prevent left joins from multiplying question rows. Revisit if multiple glosses per language are ever needed. ### Decks: `source_language` + `validated_languages` (not `pair_id`) -**Original approach:** `decks.pair_id` references `language_pairs`, tying each deck to a single language pair. - -**Problem:** One deck can serve multiple target languages as long as translations exist for all its terms. A `pair_id` FK would require duplicating the deck for each target language. - -**Decision:** - -- `decks.source_language` — the language the wordlist was curated from (e.g. `"en"`). A deck sourced from an English frequency list is fundamentally different from one sourced from an Italian list. -- `decks.validated_languages` — array of language codes (excluding `source_language`) for which full translation coverage exists across all terms in the deck. Recalculated and updated on every run of the generation script. -- The language pair used for a quiz session is determined at session start, not at deck creation time. - -**Benefits:** - -- One deck serves multiple target languages (e.g. en→it and en→fr) without duplication -- `validated_languages` stays accurate as translation data grows -- DB enforces via CHECK constraint that `source_language` is never included in `validated_languages` +One deck can serve multiple target languages as long as translations exist for all its terms. `source_language` is the language the wordlist was curated from. `validated_languages` is recalculated on every generation script run. Enforced via CHECK: `source_language` is never in `validated_languages`. ### Decks: wordlist tiers as scope (not POS-split decks) -**Rejected approach:** one deck per POS (e.g. `en-nouns`, `en-verbs`). - -**Problem:** POS is already a filterable column on `terms`, so a POS-scoped deck duplicates logic the query already handles for free. A word like "run" (noun and verb, different synsets) would also appear in two decks, requiring deduplication in the generation script. - -**Decision:** one deck per frequency tier per source language (e.g. `en-core-1000`, `en-core-2000`). POS, difficulty, and category are query filters applied inside that boundary at query time. The user never sees or picks a deck — they pick a direction, POS, and difficulty, and the app resolves those to the right deck + filters. - -Progression works by expanding the deck set as the user advances: - -```sql -WHERE dt.deck_id IN ('en-core-1000', 'en-core-2000') - AND t.pos = 'noun' - AND t.cefr_level = 'B1' -``` - -Decks must not overlap — each term appears in exactly one tier. The generation script already deduplicates, so this is enforced at import time. +One deck per frequency tier per source language (e.g. `en-core-1000`). POS, difficulty, and category are query filters applied inside that boundary. Decks must not overlap — each term appears in exactly one tier. ### Decks: SUBTLEX as wordlist source (not manual curation) -**Problem:** the most common 1000 nouns in English are not the same 1000 nouns that are most common in Italian — not just in translation, but conceptually. Building decks from English frequency data alone gives Italian learners a distorted picture of what is actually common in Italian. - -**Decision:** use SUBTLEX, which exists in per-language editions (SUBTLEX-EN, SUBTLEX-IT, etc.) derived from subtitle corpora using the same methodology, making them comparable across languages. - -This is why `decks.source_language` is not just a technical detail — it is the reason the data model is correct: - -- `en-core-1000` built from SUBTLEX-EN → used when source language is English (en→it) -- `it-core-1000` built from SUBTLEX-IT → used when source language is Italian (it→en) - -Same translation data underneath, correctly frequency-grounded per direction. Two wordlist files, two generation script runs. - -### Terms: `synset_id` nullable (not NOT NULL) - -**Problem:** non-WordNet terms (custom words, Wiktionary-sourced entries added later) won't have a synset ID. `NOT NULL` is too strict. - -**Decision:** make `synset_id` nullable. `synset_id` remains the WordNet idempotency key — it prevents duplicate imports on re-runs and allows cross-referencing back to WordNet. It is not removed. - -Postgres `UNIQUE` on a nullable column allows multiple `NULL` values (nulls are not considered equal), so no additional constraint logic is needed beyond dropping `notNull()`. For extra defensiveness a partial unique index can be added later: - -```sql -CREATE UNIQUE INDEX idx_terms_synset_id ON terms (synset_id) WHERE synset_id IS NOT NULL; -``` - -### Terms: `source` + `source_id` columns - -Once multiple import pipelines exist (OMW today, Wiktionary later), `synset_id` alone is insufficient as an idempotency key — Wiktionary terms won't have a synset ID. - -**Decision:** add `source` (varchar, e.g. `'omw'`, `'wiktionary'`, null for manual) and `source_id` (text, the pipeline's internal identifier) with a unique constraint on the pair: - -```ts -unique("unique_source_id").on(table.source, table.source_id); -``` - -Postgres allows multiple `NULL` pairs under a unique constraint, so manual entries don't conflict. For existing OMW terms, backfill `source = 'omw'` and `source_id = synset_id`. `synset_id` remains for now to avoid pipeline churn — deprecate it during a future pipeline refactor. - -No CHECK constraint on `source` — it is only written by controlled import scripts, not user input. A free varchar is sufficient. - -### Translations: `cefr_level` column (deferred population, not on `terms`) - -CEFR difficulty is language-relative, not concept-relative. "House" in English is A1, "domicile" is also English but B2 — same concept, different words, different difficulty. Moving `cefr_level` to `translations` allows each language's word to have its own level independently. - -Added as nullable `varchar(2)` with CHECK constraint against `CEFR_LEVELS` (`A1`–`C2`) on the `translations` table. Left null for MVP; populated later via SUBTLEX or an external CEFR wordlist. Also included in the `translations` index since the quiz query filters on it: - -```ts -index("idx_translations_lang").on( - table.language_code, - table.cefr_level, - table.term_id, -); -``` +The most common 1000 nouns in English are not the same 1000 nouns that are most common in Italian. SUBTLEX exists in per-language editions derived from subtitle corpora using the same methodology — making them comparable. `en-core-1000` built from SUBTLEX-EN, `it-core-1000` from SUBTLEX-IT. ### `language_pairs` table: dropped -Valid language pairs are already implicitly defined by `decks.source_language` + `decks.validated_languages`. The table was redundant — the same information can be derived directly from decks: +Valid pairs are implicitly defined by `decks.source_language` + `decks.validated_languages`. The table was redundant. -```sql -SELECT DISTINCT source_language, unnest(validated_languages) AS target_language -FROM decks -WHERE validated_languages != '{}' -``` +### Terms: `synset_id` nullable (not NOT NULL) -The only thing `language_pairs` added was an `active` flag to manually disable a direction. This is an edge case not needed for MVP. Dropped to remove a maintenance surface that required staying in sync with deck data. +Non-WordNet terms won't have a synset ID. Postgres `UNIQUE` on a nullable column allows multiple NULL values. -### Schema: `categories` + `term_categories` (empty for MVP) +### Terms: `source` + `source_id` columns -Added to schema now, left empty for MVP. Grammar and Media work without them — Grammar maps to POS (already on `terms`), Media maps to deck membership. Thematic categories (animals, kitchen, etc.) require a metadata source that is still under research. +Once multiple import pipelines exist (OMW, Wiktionary), `synset_id` alone is insufficient as an idempotency key. Unique constraint on the pair. Postgres allows multiple NULL pairs. `synset_id` remains for now — deprecate during a future pipeline refactor. -```sql -categories: id, slug, label, created_at -term_categories: term_id → terms.id, category_id → categories.id, PK(term_id, category_id) -``` +### `cefr_level` on `translations` (not `terms`) -See Open Research section for source options. +CEFR difficulty is language-relative, not concept-relative. "House" in English is A1, "domicile" is also English but B2 — same concept, different words, different difficulty. Added as nullable `varchar(2)` with CHECK. -### Schema constraints: CHECK over pgEnum for extensible value sets +### Categories + term_categories: empty for MVP -**Question:** use `pgEnum` for columns like `pos`, `cefr_level`, and `source` since the values are driven by TypeScript constants anyway? +Schema exists. Grammar maps to POS (already on `terms`), Media maps to deck membership. Thematic categories require a metadata source still under research. -**Decision:** no. Use CHECK constraints for any value set that will grow over time. +### CHECK over pgEnum for extensible value sets -**Reason:** `ALTER TYPE enum_name ADD VALUE` in Postgres is non-transactional — it cannot be rolled back if a migration fails partway through, leaving the DB in a dirty state that requires manual intervention. CHECK constraints are fully transactional — if the migration fails it rolls back cleanly. +`ALTER TYPE enum_name ADD VALUE` in Postgres is non-transactional — cannot be rolled back if a migration fails. CHECK constraints are fully transactional. Rule: pgEnum for truly static sets, CHECK for any set tied to a growing constant. -**Rule of thumb:** pgEnum is appropriate for truly static value sets that will never grow (e.g. `('pending', 'active', 'cancelled')` on an orders table). Any value set tied to a growing constant in the codebase (`SUPPORTED_POS`, `CEFR_LEVELS`, `SUPPORTED_LANGUAGE_CODES`) stays as a CHECK constraint. +### `language_code` always CHECK-constrained -### Schema constraints: `language_code` always CHECK-constrained +Unlike `source` (only written by import scripts), `language_code` is a query-critical filter column. A typo would silently produce missing data. Rule: any column game queries filter on should be CHECK-constrained. -`language_code` columns on `translations` and `term_glosses` are constrained via CHECK against `SUPPORTED_LANGUAGE_CODES`, the same pattern used for `pos` and `cefr_level`. +### Unique constraints make explicit FK indexes redundant -**Reason:** unlike `source`, which is only written by controlled import scripts and failing silently is recoverable, `language_code` is a query-critical filter column. A typo (`'ita'` instead of `'it'`, `'en '` with a trailing space) would silently produce missing data in the UI — terms with no translation shown, glosses not displayed — which is harder to debug than a DB constraint violation. - -**Rule:** any column that game queries filter on should be CHECK-constrained. Columns only used for internal bookkeeping (like `source`) can be left as free varchars. - -### Schema: unique constraints make explicit FK indexes redundant - -Postgres automatically creates an index to enforce a unique constraint. An explicit index on a column that is already the leading column of a unique constraint is redundant. - -Example: `unique("unique_term_gloss").on(term_id, language_code, text)` already indexes `term_id` as the leading column. A separate `index("idx_term_glosses_term").on(term_id)` adds no value and was dropped. - -**Rule:** before adding an explicit index, check whether an existing unique constraint already covers it. - -### Future extensions: morphology and pronunciation (deferred, additive) - -The following features are explicitly deferred post-MVP. All are purely additive — new tables referencing existing `terms` rows via FK. No existing schema changes required when implemented: - -- `noun_forms` — gender, singular, plural, articles per language (source: Wiktionary) -- `verb_forms` — conjugation tables per language (source: Wiktionary) -- `term_pronunciations` — IPA and audio URLs per language (source: Wiktionary / Forvo) - -Exercise types split naturally into Type A (translation, current model) and Type B (morphology, future). The data layer is independent — the same `terms` anchor both. +Postgres automatically creates an index to enforce a unique constraint. A separate index on the leading column of an existing unique constraint adds no value. --- -### Term glosses: Italian coverage is sparse (expected) +## Data Pipeline -OMW gloss data is primarily in English. After full import: +### Seeding v1: batch, truncate-based -- English glosses: 95,882 (~100% of terms) -- Italian glosses: 1,964 (~2% of terms) +For dev/first-time setup. Read JSON, batch inserts in groups of 500, truncate tables before each run. Simple and fast. -This is not a data pipeline problem — it reflects the actual state of OMW. Italian -glosses simply don't exist for most synsets in the dataset. +Key pitfalls encountered: -**Handling in the UI:** fall back to the English gloss when no gloss exists for the -user's language. This is acceptable UX — a definition in the wrong language is better -than no definition at all. +- Duplicate key on re-run: truncate before seeding +- `onConflictDoNothing` breaks FK references: when it skips a `terms` insert, the in-memory UUID is never written, causing FK violations on `translations` +- `forEach` doesn't await: use `for...of` +- Final batch not flushed: guard with `if (termsArray.length > 0)` after loop -If Italian gloss coverage needs to improve in the future, Wiktionary is the most -likely source — it has broader multilingual definition coverage than OMW. +### Seeding v2: incremental upsert, multi-file + +For production / adding languages. Extends the database without truncating. Each synset processed individually (no batching — need real `term.id` from DB before inserting translations). Filename convention: `sourcelang-targetlang-pos.json`. + +### CEFR enrichment pipeline + +Staged ETL: `extract-*.py` → `compare-*.py` (quality gate) → `merge-*.py` (resolve conflicts) → `enrich.ts` (write to DB). Source priority: English `en_m3 > cefrj > octanove > random`, Italian `it_m3 > italian`. + +Enrichment results: English 42,527/171,394 (~25%), Italian 23,061/54,603 (~42%). Both sufficient for MVP. Italian C2 has only 242 terms — noted as constraint for distractor algorithm. + +### Term glosses: Italian coverage is sparse + +OMW gloss data is primarily English. English glosses: 95,882 (~100%), Italian: 1,964 (~2%). UI falls back to English gloss when no gloss exists for the user's language. + +### Glosses can leak answers + +Some WordNet glosses contain the target-language word in the definition text (e.g. "Padre" in the English gloss for "father"). Address during post-MVP data enrichment — clean glosses, replace with custom definitions, or filter at service layer. + +### `packages/db` exports fix + +The `exports` field must be an object, not an array: + +```json +"exports": { + ".": "./src/index.ts", + "./schema": "./src/db/schema.ts" +} +``` + +--- + +## API Development: Problems & Solutions + +1. **Messy API structure.** Responsibilities bleeding across layers. Fixed with strict layered architecture. +2. **No shared contract.** API could return different shapes silently. Fixed with Zod schemas in `packages/shared`. +3. **Type safety gaps.** `any` types, `Number` vs `number`. Fixed with derived types from constants. +4. **`getGameTerms` in wrong package.** Model queries in `apps/api` meant direct `drizzle-orm` dependency. Moved to `packages/db/src/models/`. +5. **Deck generation complexity.** 12 decks assumed, only 2 needed. Then skipped entirely for MVP — query terms table directly. +6. **GAME_ROUNDS type conflict.** `z.enum()` only accepts strings. Keep as strings, convert to number in service. +7. **Gloss join multiplied rows.** Multiple glosses per term per language. Fixed by tightening unique constraint. +8. **Model leaked quiz semantics.** Return fields named `prompt`/`answer`. Renamed to neutral `sourceText`/`targetText`. +9. **AnswerResult wasn't self-contained.** Frontend needed `selectedOptionId` but schema didn't include it. Added. +10. **Distractor could duplicate correct answer.** Different terms with same translation. Fixed with `ne(translations.text, excludeText)`. +11. **TypeScript strict mode flagged Fisher-Yates shuffle.** `noUncheckedIndexedAccess` treats `result[i]` as `T | undefined`. Fixed with non-null assertion + temp variable. + +--- + +## Known Issues / Dev Notes + +### glossa-web has no healthcheck + +Vite's dev server has no built-in health endpoint. `depends_on` uses API healthcheck as proxy. For production (Nginx), add a health endpoint or TCP port check. + +### Valkey memory overcommit warning + +Harmless in dev. Fix before production: add `vm.overcommit_memory = 1` to host `/etc/sysctl.conf`. --- @@ -351,88 +336,26 @@ likely source — it has broader multilingual definition coverage than OMW. ### Semantic category metadata source -Categories (`animals`, `kitchen`, etc.) are in the schema but empty for MVP. -Grammar and Media work without them (Grammar = POS filter, Media = deck membership). -Needs research before populating `term_categories`. Options: +Categories (`animals`, `kitchen`, etc.) are in the schema but empty. Options researched: -**Option 1: WordNet domain labels** -Already in OMW, extractable in the existing pipeline. Free, no extra dependency. -Problem: coarse and patchy — many terms untagged, vocabulary is academic ("fauna" not "animals"). - -**Option 2: Princeton WordNet Domains** -Separate project built on WordNet. ~200 hierarchical domains mapped to synsets. More structured -and consistent than basic WordNet labels. Freely available. Meaningfully better than Option 1. - -**Option 3: Kelly Project** -Frequency lists with CEFR levels AND semantic field tags, explicitly designed for language learning, -multiple languages. Could solve frequency tiers (`cefr_level`) and semantic categories in one shot. -Investigate coverage for your languages and POS range first. - -**Option 4: BabelNet / WikiData** -Rich, multilingual, community-maintained. Maps WordNet synsets to Wikipedia categories. -Problem: complex integration, BabelNet has commercial licensing restrictions, WikiData category -trees are deep and noisy. - -**Option 5: LLM-assisted categorization** -Run terms through Claude/GPT-4 with a fixed category list, spot-check output, import. -Fast and cheap at current term counts (3171 terms ≈ negligible cost). Not reproducible -without saving output. Good fallback if structured sources have insufficient coverage. - -**Option 6: Hybrid — WordNet Domains as baseline, LLM gap-fill** -Use Option 2 for automated coverage, LLM for terms with no domain tag, manual spot-check pass. -Combines automation with control. Likely the most practical approach. - -**Option 7: Manual curation** -Flat file mapping synset IDs to your own category slugs. Full control, matches UI exactly. -Too expensive at scale — only viable for small curated additions on top of an automated baseline. +1. **WordNet domain labels** — already in OMW, coarse and patchy +2. **Princeton WordNet Domains** — ~200 hierarchical domains, freely available, meaningfully better +3. **Kelly Project** — CEFR levels AND semantic fields, designed for language learning. Could solve frequency tiers and categories in one shot +4. **BabelNet / WikiData** — rich but complex integration, licensing issues +5. **LLM-assisted categorization** — fast and cheap at current term counts, not reproducible without saving output +6. **Hybrid (WordNet Domains + LLM gap-fill)** — likely most practical +7. **Manual curation** — full control, too expensive at scale **Current recommendation:** research Kelly Project first. If coverage is insufficient, go with Option 6. ---- +### SUBTLEX → `cefr_level` mapping strategy -## Current State +Raw frequency ranks need mapping to A1–C2 bands before tiered decks are meaningful. Decision pending. -Phase 0 complete. Phase 1 data pipeline complete. Phase 2 data model finalized and migrated. +### Future extensions: morphology and pronunciation -### Completed (Phase 1 — data pipeline) +All deferred post-MVP, purely additive (new tables referencing existing `terms`): -- [x] Run `extract-en-it-nouns.py` locally → generates `datafiles/en-it-nouns.json` -- [x] Write Drizzle schema: `terms`, `translations`, `language_pairs`, `term_glosses`, `decks`, `deck_terms` -- [x] Write and run migration (includes CHECK constraints for `pos`, `gloss_type`) -- [x] Write `packages/db/src/seed.ts` (imports ALL terms + translations, NO decks) -- [x] Write `packages/db/src/generating-decks.ts` — idempotent deck generation script - - reads and deduplicates source wordlist - - matches words to DB terms (homonyms included) - - writes unmatched words to `-missing` file - - determines `validated_languages` by checking full translation coverage per language - - creates deck if it doesn't exist, adds only missing terms on subsequent runs - - recalculates and persists `validated_languages` on every run - -### Completed (Phase 2 — data model) - -- [x] `synset_id` removed, replaced by `source` + `source_id` on `terms` -- [x] `cefr_level` added to `translations` (not `terms` — difficulty is language-relative) -- [x] `language_code` CHECK constraint added to `translations` and `term_glosses` -- [x] `language_pairs` table dropped — pairs derived from decks at query time -- [x] `is_public` and `added_at` dropped from `decks` and `deck_terms` -- [x] `type` added to `decks` with CHECK against `SUPPORTED_DECK_TYPES` -- [x] `topics` and `term_topics` tables added (empty for MVP) -- [x] Migration generated and run against fresh database - -### Known data facts (pre-wipe, for reference) - -- Wordlist: 999 unique words after deduplication (1000 lines, 1 duplicate) -- Term IDs resolved: 3171 (higher than word count due to homonyms) -- Words not found in DB: 34 -- Italian (`it`) coverage: 3171 / 3171 — full coverage, included in `validated_languages` - -### Next (Phase 3 — data pipeline + API) - -1. done -2. done -3. **Expand data pipeline** — import all OMW languages and POS, not just English nouns with Italian translations -4. **Decide SUBTLEX → `cefr_level` mapping strategy** — raw frequency ranks need a mapping to A1–C2 bands before tiered decks are meaningful -5. **Generate decks** — run generation script with SUBTLEX-grounded wordlists per source language -6. **Finalize game selection flow** — direction → category → POS → difficulty → round count -7. **Define Zod schemas in `packages/shared`** — based on finalized game flow and API shape -8. **Implement API** +- `noun_forms` — gender, singular, plural, articles per language (source: Wiktionary) +- `verb_forms` — conjugation tables per language (source: Wiktionary) +- `term_pronunciations` — IPA and audio URLs per language (source: Wiktionary / Forvo) diff --git a/documentation/mvp.md b/documentation/mvp.md deleted file mode 100644 index 1bad19e..0000000 --- a/documentation/mvp.md +++ /dev/null @@ -1,469 +0,0 @@ -# glossa mvp - -> **This document is the single source of truth for the project.** -> It is written to be handed to any LLM as context. It contains the project vision, the current MVP scope, the tech stack, the working methodology, and the roadmap. - ---- - -## 1. Project Overview - -A vocabulary trainer for English–Italian words. The quiz format is Duolingo-style: one word is shown as a prompt, and the user picks the correct translation from four choices (1 correct + 3 distractors of the same part-of-speech). The long-term vision is a multiplayer competitive game, but the MVP is a polished singleplayer experience. - -**The core learning loop:** -Show word → pick answer → see result → next word → final score - -The vocabulary data comes from WordNet + the Open Multilingual Wordnet (OMW). A one-time Python script extracts English–Italian noun pairs and seeds the database. The data model is language-pair agnostic by design — adding a new language later requires no schema changes. - ---- - -## 2. What the Full Product Looks Like (Long-Term Vision) - -- Users log in via Google or GitHub (OpenAuth) -- Singleplayer mode: 10-round quiz, score screen -- Multiplayer mode: create a room, share a code, 2–4 players answer simultaneously in real time, live scores, winner screen -- 1000+ English–Italian nouns seeded from WordNet - -This is documented in `spec.md` and the full `roadmap.md`. The MVP deliberately ignores most of it. - ---- - -## 3. MVP Scope - -**Goal:** A working, presentable singleplayer quiz that can be shown to real people. - -### What is IN the MVP - -- Vocabulary data in a PostgreSQL database (already seeded) -- REST API that returns quiz terms with distractors -- Singleplayer quiz UI: 10 questions, answer feedback, score screen -- Clean, mobile-friendly UI (Tailwind + shadcn/ui) -- Local dev only (no deployment for MVP) - -### What is CUT from the MVP - -| Feature | Why cut | -| ------------------------------- | -------------------------------------- | -| Authentication (OpenAuth) | No user accounts needed for a demo | -| Multiplayer (WebSockets, rooms) | Core quiz works without it | -| Valkey / Redis cache | Only needed for multiplayer room state | -| Deployment to Hetzner | Ship to people locally first | -| User stats / profiles | Needs auth | -| Testing suite | Add after the UI stabilises | - -These are not deleted from the plan — they are deferred. The architecture is already designed to support them. See Section 9 (Post-MVP Ladder). - ---- - -## 4. Technology Stack - -The monorepo structure and tooling are already set up (Phase 0 complete). This is the full stack — the MVP uses a subset of it. - -| Layer | Technology | MVP? | -| ------------ | ------------------------------ | ----------- | -| Monorepo | pnpm workspaces | ✅ | -| Frontend | React 18, Vite, TypeScript | ✅ | -| Routing | TanStack Router | ✅ | -| Server state | TanStack Query | ✅ | -| Client state | Zustand | ✅ | -| Styling | Tailwind CSS + shadcn/ui | ✅ | -| Backend | Node.js, Express, TypeScript | ✅ | -| Database | PostgreSQL + Drizzle ORM | ✅ | -| Validation | Zod (shared schemas) | ✅ | -| Auth | OpenAuth (Google + GitHub) | ❌ post-MVP | -| Realtime | WebSockets (`ws` library) | ❌ post-MVP | -| Cache | Valkey | ❌ post-MVP | -| Testing | Vitest, React Testing Library | ❌ post-MVP | -| Deployment | Docker Compose, Hetzner, Nginx | ❌ post-MVP | - -### Repository Structure (actual, as of Phase 1 data pipeline complete) - -```text -vocab-trainer/ -├── apps/ -│ ├── api/ -│ │ └── src/ -│ │ ├── app.ts # createApp() factory — routes registered here -│ │ └── server.ts # calls app.listen() -│ └── web/ -│ └── src/ -│ ├── routes/ -│ │ ├── __root.tsx -│ │ ├── index.tsx # placeholder landing page -│ │ └── about.tsx -│ ├── main.tsx -│ └── index.css -├── packages/ -│ ├── shared/ -│ │ └── src/ -│ │ ├── index.ts # empty — Zod schemas go here next -│ │ └── constants.ts -│ └── db/ -│ ├── drizzle/ # migration SQL files -│ └── src/ -│ ├── db/schema.ts # full Drizzle schema -│ ├── seeding-datafiles.ts # seeds terms + translations -│ ├── generating-deck.ts # builds curated decks -│ └── index.ts -├── documentation/ # all project docs live here -│ ├── spec.md -│ ├── roadmap.md -│ ├── decisions.md -│ ├── mvp.md # this file -│ └── CLAUDE.md -├── scripts/ -│ ├── extract-en-it-nouns.py -│ └── datafiles/en-it-noun.json -├── docker-compose.yml -└── pnpm-workspace.yaml -``` - -**What does not exist yet (to be built in MVP phases):** - -- `apps/api/src/routes/` — no route handlers yet -- `apps/api/src/services/` — no business logic yet -- `apps/api/src/repositories/` — no DB queries yet -- `apps/web/src/components/` — no UI components yet -- `apps/web/src/stores/` — no Zustand store yet -- `apps/web/src/lib/api.ts` — no TanStack Query wrappers yet -- `packages/shared/src/schemas/` — no Zod schemas yet - -`packages/shared` is the contract between frontend and backend. All request/response shapes are defined there as Zod schemas — never duplicated. - ---- - -## 5. Data Model (relevant tables for MVP) - -```javascript -export const terms = pgTable( - "terms", - { - id: uuid().primaryKey().defaultRandom(), - synset_id: text().unique().notNull(), - pos: varchar({ length: 20 }).notNull(), - created_at: timestamp({ withTimezone: true }).defaultNow().notNull(), - }, - (table) => [ - check( - "pos_check", - sql`${table.pos} IN (${sql.raw(SUPPORTED_POS.map((p) => `'${p}'`).join(", "))})`, - ), - index("idx_terms_pos").on(table.pos), - ], -); - -export const translations = pgTable( - "translations", - { - id: uuid().primaryKey().defaultRandom(), - term_id: uuid() - .notNull() - .references(() => terms.id, { onDelete: "cascade" }), - language_code: varchar({ length: 10 }).notNull(), - text: text().notNull(), - created_at: timestamp({ withTimezone: true }).defaultNow().notNull(), - }, - (table) => [ - unique("unique_translations").on( - table.term_id, - table.language_code, - table.text, - ), - index("idx_translations_lang").on(table.language_code, table.term_id), - ], -); - -export const decks = pgTable( - "decks", - { - id: uuid().primaryKey().defaultRandom(), - name: text().notNull(), - description: text(), - source_language: varchar({ length: 10 }).notNull(), - validated_languages: varchar({ length: 10 }).array().notNull().default([]), - is_public: boolean().default(false).notNull(), - created_at: timestamp({ withTimezone: true }).defaultNow().notNull(), - }, - (table) => [ - check( - "source_language_check", - sql`${table.source_language} IN (${sql.raw(SUPPORTED_LANGUAGE_CODES.map((l) => `'${l}'`).join(", "))})`, - ), - check( - "validated_languages_check", - sql`validated_languages <@ ARRAY[${sql.raw(SUPPORTED_LANGUAGE_CODES.map((l) => `'${l}'`).join(", "))}]::varchar[]`, - ), - check( - "validated_languages_excludes_source", - sql`NOT (${table.source_language} = ANY(${table.validated_languages}))`, - ), - unique("unique_deck_name").on(table.name, table.source_language), - ], -); - -export const deck_terms = pgTable( - "deck_terms", - { - deck_id: uuid() - .notNull() - .references(() => decks.id, { onDelete: "cascade" }), - term_id: uuid() - .notNull() - .references(() => terms.id, { onDelete: "cascade" }), - added_at: timestamp({ withTimezone: true }).defaultNow().notNull(), - }, - (table) => [primaryKey({ columns: [table.deck_id, table.term_id] })], -); -``` - -The seed + deck-build scripts have already been run. Data exists in the database. - ---- - -## 6. API Endpoints (MVP) - -All endpoints prefixed `/api`. Schemas live in `packages/shared` and are validated with Zod on both sides. - -| Method | Path | Description | -| ------ | ---------------------- | --------------------------------------- | -| GET | `/api/health` | Health check (already done) | -| GET | `/api/language-pairs` | List active language pairs | -| GET | `/api/decks` | List available decks | -| GET | `/api/decks/:id/terms` | Fetch terms with distractors for a quiz | - -### Distractor Logic - -The `QuizService` picks 3 distractors server-side: - -- Same part-of-speech as the correct answer -- Never the correct answer -- Never repeated within a session - ---- - -## 7. Frontend Structure (MVP) - -```text -apps/web/src/ -├── routes/ -│ ├── index.tsx # Landing page / mode select -│ └── singleplayer/ -│ └── index.tsx # The quiz -├── components/ -│ ├── quiz/ -│ │ ├── QuestionCard.tsx # Prompt word + 4 answer buttons -│ │ ├── OptionButton.tsx # idle / correct / wrong states -│ │ └── ScoreScreen.tsx # Final score + play again -│ └── ui/ # shadcn/ui wrappers -├── stores/ -│ └── gameStore.ts # Zustand: question index, score, answers -└── lib/ - └── api.ts # TanStack Query fetch wrappers -``` - -### State Management - -TanStack Query handles fetching quiz data from the API. Zustand handles the local quiz session (current question index, score, selected answers). There is no overlap between the two. - ---- - -## 8. Working Methodology - -> **Read this section before asking for help with any task.** - -This project is a learning exercise. The goal is to understand the code, not just to ship it. - -### How tasks are structured - -The roadmap (Section 10) lists broad phases. When work starts on a phase, it gets broken into smaller, concrete subtasks with clear done-conditions before any code is written. - -### How to use an LLM for help - -When asking an LLM for help: - -1. **Paste this document** (or the relevant sections) as context -2. **Describe what you're working on** and what specifically you're stuck on -3. **Ask for hints, not solutions.** Example prompts: - - "I'm trying to implement X. My current approach is Y. What am I missing conceptually?" - - "Here is my code. What would you change about the structure and why?" - - "Can you point me to the relevant docs for Z?" - -### Refactoring workflow - -After completing a task or a block of work: - -1. Share the current state of the code with the LLM -2. Ask: _"What would you refactor here, and why? Don't show me the code — point me in the right direction and link relevant documentation."_ -3. The LLM should explain the _what_ and _why_, link to relevant docs/guides, and let you implement the fix yourself - -**The LLM should never write the implementation for you.** If it does, ask it to delete it and explain the concept instead. - -### Decisions log - -Keep a `decisions.md` file in the root. When you make a non-obvious choice (a library, a pattern, a trade-off), write one short paragraph explaining what you chose and why. This is also useful context for any LLM session. - ---- - -## 9. Game Mechanics - -- **Format**: source-language word prompt + 4 target-language choices -- **Distractors**: same POS, server-side, never the correct answer, no repeats in a session -- **Session length**: 10 questions -- **Scoring**: +1 per correct answer (no speed bonus for MVP) -- **Timer**: none in singleplayer MVP -- **No auth required**: anonymous users - ---- - -## 10. MVP Roadmap - -> Tasks are written at a high level. When starting a phase, break it into smaller subtasks before writing any code. - -### Current Status - -**Phase 0 (Foundation) — ✅ Complete** -**Phase 1 (Vocabulary Data) — 🔄 Data pipeline complete. API layer is the immediate next step.** - -What is already in the database: - -- 999 unique English terms (nouns), fully seeded from WordNet/OMW -- 3171 term IDs resolved (higher than word count due to homonyms) -- Full Italian translation coverage (3171/3171 terms) -- Decks created and populated via `packages/db/src/generating-decks.ts` -- 34 words from the source wordlist had no WordNet match (expected, not a bug) - ---- - -### Phase 1 — Finish the API Layer - -**Goal:** The frontend can fetch quiz data from the API. - -**Done when:** `GET /api/decks/1/terms?limit=10` returns 10 terms, each with 3 distractors of the same POS attached. - -**Broadly, what needs to happen:** - -- Define Zod response schemas in `packages/shared` for terms, decks, and language pairs -- Implement a repository layer that queries the DB for terms belonging to a deck -- Implement a service layer that attaches distractors to each term (same POS, no duplicates, no correct answer included) -- Wire up the REST endpoints (`GET /language-pairs`, `GET /decks`, `GET /decks/:id/terms`) -- Manually test the endpoints (curl or a REST client like Bruno/Insomnia) - -**Key concepts to understand before starting:** - -- Drizzle ORM query patterns (joins, where clauses) -- The repository pattern (data access separated from business logic) -- Zod schema definition and inference -- How pnpm workspace packages reference each other - ---- - -### Phase 2 — Singleplayer Quiz UI - -**Goal:** A user can complete a full 10-question quiz in the browser. - -**Done when:** User visits `/singleplayer`, answers 10 questions, sees a score screen, and can play again. - -**Broadly, what needs to happen:** - -- Build the `QuestionCard` component (prompt word + 4 answer buttons) -- Build the `OptionButton` component with three visual states: idle, correct, wrong -- Build the `ScoreScreen` component (score summary + play again) -- Implement a Zustand store to track quiz session state (current question index, score, whether an answer has been picked) -- Wire up TanStack Query to fetch terms from the API on mount -- Create the `/singleplayer` route and assemble the components -- Handle the between-question transition (brief delay showing result → next question) - -**Key concepts to understand before starting:** - -- TanStack Query: `useQuery`, loading/error states -- Zustand: defining a store, reading and writing state from components -- TanStack Router: defining routes, navigating between them -- React component composition -- Controlled state for the answer selection (which button is selected, when to lock input) - ---- - -### Phase 3 — UI Polish - -**Goal:** The app looks good enough to show to people. - -**Done when:** The quiz is usable on mobile, readable on desktop, and has a coherent visual style. - -**Broadly, what needs to happen:** - -- Apply Tailwind utility classes and shadcn/ui components consistently -- Make the layout mobile-first (touch-friendly buttons, readable font sizes) -- Add a simple landing page (`/`) with a "Start Quiz" button -- Add loading and error states for the API fetch -- Visual feedback on correct/wrong answers (colour, maybe a brief animation) -- Deck selection: let the user pick a deck from a list before starting - -**Key concepts to understand before starting:** - -- Tailwind CSS utility-first approach -- shadcn/ui component library and how to add components -- Responsive design with Tailwind breakpoints -- CSS transitions for simple animations - ---- - -## 11. Key Technical Decisions - -These are the non-obvious decisions already made. Any LLM helping with this project should be aware of them and not suggest alternatives without good reason. - -### Architecture - -**Express app: factory function pattern** -`app.ts` exports `createApp()`. `server.ts` imports it and calls `.listen()`. This keeps tests isolated — a test can import the app without starting a server. - -**Layered architecture: routes → services → repositories** -Business logic lives in services, not route handlers or repositories. Each layer only talks to the layer directly below it. For the MVP API, this means: - -- `routes/` — parse request, call service, return response -- `services/` — business logic (e.g. attaching distractors) -- `repositories/` — all DB queries live here, nowhere else - -**Shared Zod schemas in `packages/shared`** -All request/response shapes are defined once as Zod schemas in `packages/shared` and imported by both `apps/api` and `apps/web`. Types are inferred from schemas (`z.infer`), never written by hand. - -### Data Model - -**Decks separate from terms (not frequency-rank filtering)** -Terms are raw WordNet data. Decks are curated lists. This separation exists because WordNet frequency data is unreliable for learning — common chemical element symbols ranked highly, for example. Bad words are excluded at the deck level, not filtered from `terms`. - -**Deck language model: `source_language` + `validated_languages` array** -A deck is not tied to a single language pair. `source_language` is the language the wordlist was curated from. `validated_languages` is an array of target languages with full translation coverage — calculated and updated by the deck generation script on every run. - -### Tooling - -**Drizzle ORM (not Prisma):** No binary, no engine. Queries map closely to SQL. Works naturally with Zod. Migrations are plain SQL files. - -**`tsx` as TypeScript runner (not `ts-node`):** Faster, zero config, uses esbuild. Does not type-check — that is handled by `tsc` and the editor. - -**pnpm workspaces (not Turborepo):** Two apps don't need the extra build caching complexity. - ---- - -## 12. Post-MVP Ladder - -These phases are deferred but planned. The architecture already supports them. - -| Phase | What it adds | -| ----------------- | -------------------------------------------------------------- | -| Auth | OpenAuth (Google + GitHub), JWT middleware, user rows in DB | -| User Stats | Games played, score history, profile page | -| Multiplayer Lobby | Room creation, join by code, WebSocket connection | -| Multiplayer Game | Simultaneous answers, server timer, live scores, winner screen | -| Deployment | Docker Compose prod config, Nginx, Let's Encrypt, Hetzner VPS | -| Hardening | Rate limiting, error boundaries, CI/CD, DB backups | - -Each of these maps to a phase in the full `roadmap.md`. - ---- - -## 13. Definition of Done (MVP) - -- [ ] `GET /api/decks/:id/terms` returns terms with correct distractors -- [ ] User can complete a 10-question quiz without errors -- [ ] Score screen shows final result and a play-again option -- [ ] App is usable on a mobile screen -- [ ] No hardcoded data — everything comes from the database diff --git a/documentation/roadmap.md b/documentation/roadmap.md deleted file mode 100644 index 2a43140..0000000 --- a/documentation/roadmap.md +++ /dev/null @@ -1,176 +0,0 @@ -# Vocabulary Trainer — Roadmap - -Each phase produces a working, deployable increment. Nothing is built speculatively. - -## Phase 0 — Foundation - -Goal: Empty repo that builds, lints, and runs end-to-end. -Done when: `pnpm dev` starts both apps; `GET /api/health` returns 200; React renders a hello page. - -[x] Initialise pnpm workspace monorepo: `apps/web`, `apps/api`, `packages/shared`, `packages/db` -[x] Configure TypeScript project references across packages -[x] Set up ESLint + Prettier with shared configs in root -[x] Set up Vitest in `api` and `web` and both packages -[x] Scaffold Express app with `GET /api/health` -[x] Scaffold Vite + React app with TanStack Router (single root route) -[x] Configure Drizzle ORM + connection to local PostgreSQL -[x] Write first migration (empty — just validates the pipeline works) -[x] `docker-compose.yml` for local dev: `api`, `web`, `postgres`, `valkey` -[x] `.env.example` files for `apps/api` and `apps/web` -[x] update decisions.md - -## Phase 1 — Vocabulary Data - -Goal: Word data lives in the DB and can be queried via the API. -Done when: `GET /api/decks/1/terms?limit=10` returns 10 terms from a specific deck. - -[x] Run `extract-en-it-nouns.py` locally → generates `datafiles/en-it-nouns.json` -[x] Write Drizzle schema: `terms`, `translations`, `language_pairs`, `term_glosses`, `decks`, `deck_terms` -[x] Write and run migration (includes CHECK constraints for `pos`, `gloss_type`) -[x] Write `packages/db/src/seed.ts` (imports ALL terms + translations, NO decks) -[x] Download CEFR A1/A2 noun lists (from GitHub repos) -[x] Write `scripts/build_decks.ts` (reads external CEFR lists, matches to DB, creates decks) -[x] Run `pnpm db:seed` → populates terms -[x] Run `pnpm db:build-deck` → creates curated decks -[x] Define Zod response schemas in `packages/shared` -[x] Implement `DeckRepository.getTerms(deckId, limit, offset)` => no decks needed anymore -[x] Implement `QuizService.attachDistractors(terms)` — same POS, server-side, no duplicates -[x] Implement `GET /language-pairs`, `GET /decks`, `GET /decks/:id/terms` endpoints => no language pairs, not needed anymore -[ ] Unit tests for `QuizService` (correct POS filtering, never includes the answer) -[ ] update decisions.md - -## Phase 2 — Auth - -Goal: Users can log in via Google or GitHub and stay logged in. -Done when: JWT from OpenAuth is validated by the API; protected routes redirect unauthenticated users; user row is created on first login. - -[ ] Add OpenAuth service to `docker-compose.yml` -[ ] Write Drizzle schema: `users` (uuid `id`, text `openauth_sub`, no games_played/won columns) -[ ] Write and run migration (includes `updated_at` + triggers) -[ ] Implement JWT validation middleware in `apps/api` -[ ] Implement `GET /api/auth/me` (validate token, upsert user row via `openauth_sub`, return user) -[ ] Define auth Zod schemas in `packages/shared` -[ ] Frontend: login page with "Continue with Google" + "Continue with GitHub" buttons -[ ] Frontend: redirect to `auth.yourdomain.com` → receive JWT → store in memory + HttpOnly cookie -[ ] Frontend: TanStack Router auth guard (redirects unauthenticated users) -[ ] Frontend: TanStack Query `api.ts` attaches token to every request -[ ] Unit tests for JWT middleware -[ ] update decisions.md - -## Phase 3 — Single-player Mode - -Goal: A logged-in user can complete a full solo quiz session. -Done when: User sees 10 questions, picks answers, sees their final score. - -[ ] Frontend: `/singleplayer` route -[ ] `useQuizSession` hook: fetch terms, manage question index + score state -[ ] `QuestionCard` component: prompt word + 4 answer buttons -[ ] `OptionButton` component: idle / correct / wrong states -[ ] `ScoreScreen` component: final score + play-again button -[ ] TanStack Query integration for `GET /terms` -[ ] RTL tests for `QuestionCard` and `OptionButton` -[ ] update decisions.md - -## Phase 4 — Multiplayer Rooms (Lobby) - -Goal: Players can create and join rooms; the host sees all joined players in real time. -Done when: Two browser tabs can join the same room and see each other's display names update live via WebSocket. - -[ ] Write Drizzle schema: `rooms`, `room_players` (add `deck_id` FK to rooms) -[ ] Write and run migration (includes CHECK constraints: `code=UPPER(code)`, `status`, `max_players`) -[ ] Add indexes: `idx_rooms_host`, `idx_room_players_score` -[ ] `POST /rooms` and `POST /rooms/:code/join` REST endpoints -[ ] `RoomService`: create room with short code, join room, enforce max player limit -[ ] `POST /rooms` accepts `deck_id` (which vocabulary deck to use) -[ ] WebSocket server: attach `ws` upgrade handler to the Express HTTP server -[ ] WS auth middleware: validate OpenAuth JWT on upgrade -[ ] WS message router: dispatch incoming messages by `type` -[ ] `room:join` / `room:leave` handlers → broadcast `room:state` to all room members -[ ] Room membership tracked in Valkey (ephemeral) + `room_players` in PostgreSQL (durable) -[ ] Define all WS event Zod schemas in `packages/shared` -[ ] Frontend: `/multiplayer/lobby` — create room form + join-by-code form -[ ] Frontend: `/multiplayer/room/:code` — player list, room code display, "Start Game" (host only) -[ ] Frontend: `ws.ts` singleton WS client with reconnect on drop -[ ] Frontend: Zustand `gameStore` handles incoming `room:state` events -[ ] update decisions.md - -## Phase 5 — Multiplayer Game - -Goal: Host starts a game; all players answer simultaneously in real time; a winner is declared. -Done when: 2–4 players complete a 10-round game with correct live scores and a winner screen. - -[ ] `GameService`: generate question sequence for a room, enforce server-side 15 s timer -[ ] `room:start` WS handler → begin question loop, broadcast first `game:question` -[ ] `game:answer` WS handler → collect per-player answers -[ ] On all-answered or timeout → evaluate, broadcast `game:answer_result` -[ ] After N rounds → broadcast `game:finished`, update `rooms.status` + `room_players.score` in DB (transactional) -[ ] Frontend: `/multiplayer/game/:code` route -[ ] Frontend: extend Zustand store with `currentQuestion`, `roundAnswers`, `scores` -[ ] Frontend: reuse `QuestionCard` + `OptionButton`; add countdown timer ring -[ ] Frontend: `ScoreBoard` component — live per-player scores after each round -[ ] Frontend: `GameFinished` screen — winner highlight, final scores, "Play Again" button -[ ] Unit tests for `GameService` (round evaluation, tie-breaking, timeout auto-advance) -[ ] update decisions.md - -## Phase 6 — Production Deployment - -Goal: App is live on Hetzner, accessible via HTTPS on all subdomains. -Done when: `https://app.yourdomain.com` loads; `wss://api.yourdomain.com` connects; auth flow works end-to-end. - -[ ] `docker-compose.prod.yml`: all services + `nginx-proxy` + `acme-companion` -[ ] Nginx config per container: `VIRTUAL_HOST` + `LETSENCRYPT_HOST` env vars -[ ] Production `.env` files on VPS (OpenAuth secrets, DB credentials, Valkey URL) -[ ] Drizzle migration runs on `api` container start (includes CHECK constraints + triggers) -[ ] Seed production DB (run `seed.ts` once) -[ ] Smoke test: login → solo game → create room → multiplayer game end-to-end -[ ] update decisions.md - -## Phase 7 — Polish & Hardening (post-MVP) - -Not required to ship, but address before real users arrive. - -[ ] Rate limiting on API endpoints (`express-rate-limit`) -[ ] Graceful WS reconnect with exponential back-off -[ ] React error boundaries -[ ] `GET /users/me/stats` endpoint (aggregates from `room_players`) + profile page -[ ] Accessibility pass (keyboard nav, ARIA on quiz buttons) -[ ] Favicon, page titles, Open Graph meta -[ ] CI/CD pipeline (GitHub Actions → SSH deploy on push to `main`) -[ ] Database backups (cron → Hetzner Object Storage) -[ ] update decisions.md - -Dependency Graph -Phase 0 (Foundation) -└── Phase 1 (Vocabulary Data) -└── Phase 2 (Auth) -├── Phase 3 (Singleplayer) ← parallel with Phase 4 -└── Phase 4 (Room Lobby) -└── Phase 5 (Multiplayer Game) -└── Phase 6 (Deployment) - ---- - -## ui sketch - -i was sketching the ui of the menu and came up with some questions. - -this would be the flow to start a single player game: -main menu => singleplayer, multiplayer, settings -singleplayer => language selection -"i speak english" => "i want to learn italian" (both languages are dropdowns to select the fitting language) -language selection => category selection => pure grammar, media (practicing on song lyrics or breaking bad subtitles) -pure grammar => pos selection => nouns or verbs (in mvp) -nouns has 3 subcategories => singular (1-on-1 translation dog => cane), plural (plural practices cane => cani for example), gender/articles (il cane or la cane for example) -verbs has 2 subcategories => infinitv (1-on-1 translation to talk => parlare) or conjugations (user gets shown the infinitiv and a table with all personal pronouns and has to fill in the gaps with the according conjugations) -pos selection => difficulty selection (from a1 to c2) -afterwards start game button - ---- - -this begs the questions: - -- how to store the plural, articles of nouns in database -- how to store the conjugations of verbs -- what about ipa? -- links to audiofiles to listen how a word is pronounced? -- one table for italian_verbs, french_nouns, german_adjectives? diff --git a/documentation/schema_discussion.md b/documentation/schema_discussion.md deleted file mode 100644 index 77b7d35..0000000 --- a/documentation/schema_discussion.md +++ /dev/null @@ -1,186 +0,0 @@ -# Glossa — Schema & Architecture Discussion Summary - -## Project Overview - -A vocabulary trainer in the style of Duolingo (see a word, pick from 4 translations). Built as a monorepo with a Drizzle/Postgres data layer. Phase 1 (data pipeline) is complete; the API layer is next. - ---- - -## Game Flow (MVP) - -Singleplayer: choose direction (en→it or it→en) → top-level category → part of speech → difficulty (A1–C2) → round count (3 or 10) → game starts. - -**Top-level categories (MVP):** - -- **Grammar** — practice nouns, verb conjugations, etc. -- **Media** — practice vocabulary from specific books, films, songs, etc. - -**Post-MVP categories (not in scope yet):** - -- Animals, kitchen, and other thematic word groups - ---- - -## Schema Decisions Made - -### Deck model: `source_language` + `validated_languages` (not `pair_id`) - -A deck is a curated pool of terms sourced from a specific language (e.g. an English frequency list). The language pair used for a quiz is chosen at session start, not at deck creation. - -- `decks.source_language` — the language the wordlist was curated from -- `decks.validated_languages` — array of target language codes for which full translation coverage exists across all terms; recalculated on every generation script run -- Enforced via CHECK: `source_language` is never in `validated_languages` -- One deck serves en→it and en→fr without duplication - -### Architecture: deck as curated pool (Option 2) - -Three options were considered: - -| Option | Description | Problem | -| ------------------ | -------------------------------------------------------- | ------------------------------------------------------- | -| 1. Pure filter | No decks, query the whole terms table | No curatorial control; import junk ends up in the game | -| 2. Deck as pool ✅ | Decks define scope, term metadata drives filtering | Clean separation of concerns | -| 3. Deck as preset | Deck encodes filter config (category + POS + difficulty) | Combinatorial explosion; can't reuse terms across decks | - -**Decision: Option 2.** Decks solve the curation problem (which terms are game-ready). Term metadata solves the filtering problem (which subset to show today). These are separate concerns and should stay separate. - -The quiz query joins `deck_terms` for scope, then filters by `pos`, `cefr_level`, and later `category` — all independently. - -### Missing from schema: `cefr_level` and categories - -The game flow requires filtering by difficulty and category, but neither is in the schema yet. - -**Difficulty (`cefr_level`):** - -- Belongs on `terms`, not on `decks` -- Add as a nullable `varchar(2)` with a CHECK constraint (`A1`–`C2`) -- Add now (nullable), populate later — backfilling a full terms table post-MVP is costly - -**Categories:** - -- Separate `categories` table + `term_categories` join table -- Do not use an enum or array on `terms` — a term can belong to multiple categories, and new categories should not require migrations - -```sql -categories: id, slug, label, created_at -term_categories: term_id → terms.id, category_id → categories.id, PK(term_id, category_id) -``` - -### Deck scope: wordlists, not POS splits - -**Rejected approach:** one deck per POS (e.g. `en-nouns`, `en-verbs`). Problem: POS is already a filterable column on `terms`, so a POS-scoped deck duplicates logic the query already handles for free. A word like "run" (noun and verb, different synsets) would also appear in two decks, requiring deduplication logic in the generation script. - -**Decision:** one deck per frequency tier per source language (e.g. `en-core-1000`, `en-core-2000`). POS, difficulty, and category are query filters applied inside that boundary. The user never sees or picks a deck — they pick "Nouns, B1" and the app resolves that to the right deck + filters. - -### Deck progression: tiered frequency lists - -When a user exhausts a deck, the app expands scope by adding the next tier: - -```sql -WHERE dt.deck_id IN ('en-core-1000', 'en-core-2000') - AND t.pos = 'noun' - AND t.cefr_level = 'B1' -``` - -Requirements for this to work cleanly: - -- Decks must not overlap — each word appears in exactly one tier -- The generation script already deduplicates, so this is enforced at import time -- Unlocking logic (when to add the next deck) lives in user learning state, not in the deck structure — for MVP, query all tiers at once or hardcode active decks - -### Wordlist source: SUBTLEX (not manual curation) - -**Problem:** the most common 1000 nouns in English are not the same 1000 nouns that are most common in Italian — not just in translation, but conceptually. Building decks from English frequency data alone gives Italian learners a distorted picture of what's actually common in Italian. - -**Decision:** use SUBTLEX, which exists in per-language editions (SUBTLEX-EN, SUBTLEX-IT, etc.) derived from subtitle corpora using the same methodology — making them comparable across languages. - -This maps directly onto `decks.source_language`: - -- `en-core-1000` — built from SUBTLEX-EN, used when the user's source language is English -- `it-core-1000` — built from SUBTLEX-IT, used when the source language is Italian - -When the user picks en→it, the app queries `en-core-1000`. When they pick it→en, it queries `it-core-1000`. Same translation data, correctly frequency-grounded per direction. Two wordlist files, two generation script runs — the schema already supports this. - -### Missing from schema: user learning state - -The current schema has no concept of a user's progress. Not blocking for the API layer right now, but will be needed before the game loop is functional: - -- `user_decks` — which decks a user is studying -- `user_term_progress` — per `(user_id, term_id, language_pair)`: `next_review_at`, `interval_days`, correct/attempt counts for spaced repetition -- `quiz_answers` — optional history log for stats and debugging - -### `synset_id`: make nullable, don't remove - -`synset_id` is the WordNet idempotency key — it prevents duplicate imports on re-runs and allows cross-referencing back to WordNet. It should stay. - -**Problem:** non-WordNet terms (custom words added later) won't have a synset ID, so `NOT NULL` is too strict. - -**Decision:** make `synset_id` nullable. Postgres `UNIQUE` on a nullable column allows multiple `NULL` values (nulls are not considered equal), so no constraint changes are needed beyond dropping `notNull()`. - -For extra defensiveness, a partial unique index can be added later: - -```sql -CREATE UNIQUE INDEX idx_terms_synset_id ON terms (synset_id) WHERE synset_id IS NOT NULL; -``` - ---- - -## Open Questions / Deferred - -- **User learning state** — not needed for the API layer but must be designed before the game loop ships -- **Distractors** — generated at query time (random same-POS terms from the same deck); no schema needed -- **`cefr_level` data source** — WordNet frequency data was already found to be unreliable; external CEFR lists (Oxford 3000, SUBTLEX) will be needed to populate this field - ---- - -### Open: semantic category metadata source - -Categories (`animals`, `kitchen`, etc.) are in the schema but empty for MVP. -Grammar and Media work without them (Grammar = POS filter, Media = deck membership). -Needs research before populating `term_categories`. Options: - -**Option 1: WordNet domain labels** -Already in OMW, extractable in the existing pipeline. Free, no extra dependency. -Problem: coarse and patchy — many terms untagged, vocabulary is academic ("fauna" not "animals"). - -**Option 2: Princeton WordNet Domains** -Separate project built on WordNet. ~200 hierarchical domains mapped to synsets. More structured -and consistent than basic WordNet labels. Freely available. Meaningfully better than Option 1. - -**Option 3: Kelly Project** -Frequency lists with CEFR levels AND semantic field tags, explicitly designed for language learning, -multiple languages. Could solve frequency tiers (cefr_level) and semantic categories in one shot. -Investigate coverage for your languages and POS range first. - -**Option 4: BabelNet / WikiData** -Rich, multilingual, community-maintained. Maps WordNet synsets to Wikipedia categories. -Problem: complex integration, BabelNet has commercial licensing restrictions, WikiData category -trees are deep and noisy. - -**Option 5: LLM-assisted categorization** -Run terms through Claude/GPT-4 with a fixed category list, spot-check output, import. -Fast and cheap at current term counts (3171 terms ≈ negligible cost). Not reproducible -without saving output. Good fallback if structured sources have insufficient coverage. - -**Option 6: Hybrid — WordNet Domains as baseline, LLM gap-fill** -Use Option 2 for automated coverage, LLM for terms with no domain tag, manual spot-check pass. -Combines automation with control. Likely the most practical approach. - -**Option 7: Manual curation** -Flat file mapping synset IDs to your own category slugs. Full control, matches UI exactly. -Too expensive at scale — only viable for small curated additions on top of an automated baseline. - -**Current recommendation:** research Kelly Project first. If coverage is insufficient, go with Option 6. - ---- - -### implementation roadmap - -- [x] Finalize data model -- [x] Write and run migrations -- [x] Fill the database (expand import pipeline) -- [ ] Decide SUBTLEX → cefr_level mapping strategy -- [ ] Generate decks -- [ ] Finalize game selection flow -- [ ] Define Zod schemas in packages/shared -- [ ] Implement API diff --git a/documentation/spec.md b/documentation/spec.md index 2461c76..a358d62 100644 --- a/documentation/spec.md +++ b/documentation/spec.md @@ -1,518 +1,366 @@ -# Vocabulary Trainer — Project Specification +# Glossa — Project Specification -## 1. Overview +> **This document is the single source of truth for the project.** +> It is written to be handed to any LLM as context. It contains the project vision, the current MVP scope, the tech stack, the architecture, and the roadmap. -A multiplayer English–Italian vocabulary trainer with a Duolingo-style quiz interface (one word prompt, four answer choices). Supports both single-player practice and real-time competitive multiplayer rooms of 2–4 players. Designed from the ground up to be language-pair agnostic. +--- + +## 1. Project Overview + +A vocabulary trainer for English–Italian words. The quiz format is Duolingo-style: one word is shown as a prompt, and the user picks the correct translation from four choices (1 correct + 3 distractors of the same part-of-speech). The long-term vision is a multiplayer competitive game, but the MVP is a polished singleplayer experience. + +**The core learning loop:** +Show word → pick answer → see result → next word → final score + +The vocabulary data comes from WordNet + the Open Multilingual Wordnet (OMW). A one-time Python script extracts English–Italian noun pairs and seeds the database. The data model is language-pair agnostic by design — adding a new language later requires no schema changes. ### Core Principles -- **Minimal but extendable**: Working product fast, clean architecture for future growth -- **Mobile-first**: Touch-friendly Duolingo-like UX +- **Minimal but extendable**: working product fast, clean architecture for future growth +- **Mobile-first**: touch-friendly Duolingo-like UX - **Type safety end-to-end**: TypeScript + Zod schemas shared between frontend and backend --- -## 2. Technology Stack +## 2. Full Product Vision (Long-Term) -| Layer | Technology | -| -------------------- | ----------------------------- | -| Monorepo | pnpm workspaces | -| Frontend | React 18, Vite, TypeScript | -| Routing | TanStack Router | -| Server state | TanStack Query | -| Client state | Zustand | -| Styling | Tailwind CSS + shadcn/ui | -| Backend | Node.js, Express, TypeScript | -| Realtime | WebSockets (`ws` library) | -| Database | PostgreSQL 18 | -| ORM | Drizzle ORM | -| Cache / Queue | Valkey 9 | -| Auth | OpenAuth (Google + GitHub) | -| Validation | Zod (shared schemas) | -| Testing | Vitest, React Testing Library | -| Linting / Formatting | ESLint, Prettier | -| Containerisation | Docker, Docker Compose | -| Hosting | Hetzner VPS | +- Users log in via Google or GitHub (OpenAuth) +- Singleplayer mode: 10-round quiz, score screen +- Multiplayer mode: create a room, share a code, 2–4 players answer simultaneously in real time, live scores, winner screen +- 1000+ English–Italian nouns seeded from WordNet -### Why `ws` over Socket.io - -`ws` is the raw WebSocket library. For rooms of 2–4 players there is no need for Socket.io's transport fallbacks or room-management abstractions. The protocol is defined explicitly in `packages/shared`, which gives the same guarantees without the overhead. - -### Why Valkey - -Valkey stores ephemeral room state that does not need to survive a server restart. It keeps the PostgreSQL schema clean and makes room lookups O(1). - -### Why pnpm workspaces without Turborepo - -Turborepo adds parallel task running and build caching on top of pnpm workspaces. For a two-app monorepo of this size, the plain pnpm workspace commands (`pnpm -r run build`, `pnpm --filter`) are sufficient and there is one less tool to configure and maintain. +This is the full vision. The MVP deliberately ignores most of it. --- -## 3. Repository Structure +## 3. MVP Scope -```tree +**Goal:** A working, presentable singleplayer quiz that can be shown to real people. + +### What is IN the MVP + +- Vocabulary data in a PostgreSQL database (already seeded) +- REST API that returns quiz terms with distractors +- Singleplayer quiz UI: configurable rounds (3 or 10), answer feedback, score screen +- Clean, mobile-friendly UI (Tailwind + shadcn/ui) +- Global error handler with typed error classes +- Unit + integration tests for the API +- Local dev only (no deployment for MVP) + +### What is CUT from the MVP + +| Feature | Why cut | +| ------------------------------- | -------------------------------------- | +| Authentication (OpenAuth) | No user accounts needed for a demo | +| Multiplayer (WebSockets, rooms) | Core quiz works without it | +| Valkey / Redis cache | Only needed for multiplayer room state | +| Deployment to Hetzner | Ship to people locally first | +| User stats / profiles | Needs auth | + +These are not deleted from the plan — they are deferred. The architecture is already designed to support them. See Section 11 (Post-MVP Ladder). + +--- + +## 4. Technology Stack + +The monorepo structure and tooling are already set up. This is the full stack — the MVP uses a subset of it. + +| Layer | Technology | MVP? | +| ------------ | ------------------------------ | ----------- | +| Monorepo | pnpm workspaces | ✅ | +| Frontend | React 18, Vite, TypeScript | ✅ | +| Routing | TanStack Router | ✅ | +| Server state | TanStack Query | ✅ | +| Client state | Zustand | ✅ | +| Styling | Tailwind CSS + shadcn/ui | ✅ | +| Backend | Node.js, Express, TypeScript | ✅ | +| Database | PostgreSQL + Drizzle ORM | ✅ | +| Validation | Zod (shared schemas) | ✅ | +| Testing | Vitest, supertest | ✅ | +| Auth | OpenAuth (Google + GitHub) | ❌ post-MVP | +| Realtime | WebSockets (`ws` library) | ❌ post-MVP | +| Cache | Valkey | ❌ post-MVP | +| Deployment | Docker Compose, Hetzner, Nginx | ❌ post-MVP | + +--- + +## 5. Repository Structure + +```text vocab-trainer/ ├── apps/ -│ ├── web/ # React SPA (Vite + TanStack Router) -│ │ ├── src/ -│ │ │ ├── routes/ -│ │ │ ├── components/ -│ │ │ ├── stores/ # Zustand stores -│ │ │ └── lib/ -│ │ └── Dockerfile -│ └── api/ # Express REST + WebSocket server -│ ├── src/ -│ │ ├── routes/ -│ │ ├── services/ -│ │ ├── repositories/ -│ │ └── websocket/ -│ └── Dockerfile +│ ├── api/ +│ │ └── src/ +│ │ ├── app.ts — createApp() factory, express.json(), error middleware +│ │ ├── server.ts — starts server on PORT +│ │ ├── errors/ +│ │ │ └── AppError.ts — AppError, ValidationError, NotFoundError +│ │ ├── middleware/ +│ │ │ └── errorHandler.ts — central error middleware +│ │ ├── routes/ +│ │ │ ├── apiRouter.ts — mounts /health and /game routers +│ │ │ ├── gameRouter.ts — POST /start, POST /answer +│ │ │ └── healthRouter.ts +│ │ ├── controllers/ +│ │ │ └── gameController.ts — validates input, calls service, sends response +│ │ ├── services/ +│ │ │ ├── gameService.ts — builds quiz sessions, evaluates answers +│ │ │ └── gameService.test.ts — unit tests (mocked DB) +│ │ └── gameSessionStore/ +│ │ ├── GameSessionStore.ts — interface (async, Valkey-ready) +│ │ ├── InMemoryGameSessionStore.ts +│ │ └── index.ts +│ └── web/ +│ └── src/ +│ ├── routes/ +│ │ ├── index.tsx — landing page +│ │ └── play.tsx — the quiz +│ ├── components/ +│ │ └── game/ +│ │ ├── GameSetup.tsx — settings UI +│ │ ├── QuestionCard.tsx — prompt + 4 options +│ │ ├── OptionButton.tsx — idle / correct / wrong states +│ │ └── ScoreScreen.tsx — final score + play again +│ └── main.tsx ├── packages/ -│ ├── shared/ # Zod schemas, TypeScript types, constants -│ └── db/ # Drizzle schema, migrations, seed script -├── scripts/ -| ├── datafiles/ -│ | └── en-it-nouns.json -│ └── extract-en-it-nouns.py # One-time WordNet + OMW extraction → seed.json +│ ├── shared/ +│ │ └── src/ +│ │ ├── constants.ts — SUPPORTED_POS, DIFFICULTY_LEVELS, etc. +│ │ ├── schemas/game.ts — Zod schemas for all game types +│ │ └── index.ts +│ └── db/ +│ ├── drizzle/ — migration SQL files +│ └── src/ +│ ├── db/schema.ts — Drizzle schema +│ ├── models/termModel.ts — getGameTerms(), getDistractors() +│ ├── seeding-datafiles.ts — seeds terms + translations from JSON +│ ├── seeding-cefr-levels.ts — enriches translations with CEFR data +│ ├── generating-deck.ts — builds curated decks +│ └── index.ts +├── scripts/ — Python extraction/comparison/merge scripts +├── documentation/ — project docs ├── docker-compose.yml -├── docker-compose.prod.yml -├── pnpm-workspace.yaml -└── package.json +└── pnpm-workspace.yaml ``` -`packages/shared` is the contract between frontend and backend. All request/response shapes and WebSocket event payloads are defined there as Zod schemas and inferred TypeScript types — never duplicated. - -### pnpm workspace config - -`pnpm-workspace.yaml` declares: - -```yaml -packages: - - 'apps/*' - - 'packages/*' -``` - -### Root scripts - -The root `package.json` defines convenience scripts that delegate to workspaces: - -- `dev` — starts `api` and `web` in parallel -- `build` — builds all packages in dependency order -- `test` — runs Vitest across all workspaces -- `lint` — runs ESLint across all workspaces - -For parallel dev, use `concurrently` or just two terminal tabs for MVP. +`packages/shared` is the contract between frontend and backend. All request/response shapes are defined there as Zod schemas — never duplicated. --- -## 4. Architecture — N-Tier / Layered +## 6. Architecture + +### The Layered Architecture ```text -┌────────────────────────────────────┐ -│ Presentation (React SPA) │ apps/web -├────────────────────────────────────┤ -│ API / Transport │ HTTP REST + WebSocket -├────────────────────────────────────┤ -│ Application (Controllers) │ apps/api/src/routes -│ Domain (Business logic) │ apps/api/src/services -│ Data Access (Repositories) │ apps/api/src/repositories -├────────────────────────────────────┤ -│ Database (PostgreSQL via Drizzle) │ packages/db -│ Cache (Valkey) │ apps/api/src/lib/valkey.ts -└────────────────────────────────────┘ +HTTP Request + ↓ + Router — maps URL + HTTP method to a controller + ↓ + Controller — handles HTTP only: validates input, calls service, sends response + ↓ + Service — business logic only: no HTTP, no direct DB access + ↓ + Model — database queries only: no business logic + ↓ + Database ``` -Each layer only communicates with the layer directly below it. Business logic lives in services, not in route handlers or repositories. +**The rule:** each layer only talks to the layer directly below it. A controller never touches the database. A service never reads `req.body`. A model never knows what a quiz is. + +### Monorepo Package Responsibilities + +| Package | Owns | +| ----------------- | -------------------------------------------------------- | +| `packages/shared` | Zod schemas, constants, derived TypeScript types | +| `packages/db` | Drizzle schema, DB connection, all model/query functions | +| `apps/api` | Router, controllers, services, error handling | +| `apps/web` | React frontend, consumes types from shared | + +**Key principle:** all database code lives in `packages/db`. `apps/api` never imports `drizzle-orm` for queries — it only calls functions exported from `packages/db`. --- -## 5. Infrastructure +## 7. Data Model (Current State) -### Domain structure +Words are modelled as language-neutral concepts (terms) separate from learning curricula (decks). Adding a new language pair requires no schema changes — only new rows in `translations`, `decks`. -| Subdomain | Service | -| --------------------- | ----------------------- | -| `app.yourdomain.com` | React frontend | -| `api.yourdomain.com` | Express API + WebSocket | -| `auth.yourdomain.com` | OpenAuth service | +**Core tables:** `terms`, `translations`, `term_glosses`, `decks`, `deck_terms`, `categories`, `term_categories` -### Docker Compose services (production) +Key columns on `terms`: `id` (uuid), `pos` (CHECK-constrained), `source`, `source_id` (unique pair for idempotent imports) -| Container | Role | -| ---------------- | ------------------------------------------- | -| `postgres` | PostgreSQL 16, named volume | -| `valkey` | Valkey 8, ephemeral (no persistence needed) | -| `openauth` | OpenAuth service | -| `api` | Express + WS server | -| `web` | Nginx serving the Vite build | -| `nginx-proxy` | Automatic reverse proxy | -| `acme-companion` | Let's Encrypt certificate automation | +Key columns on `translations`: `id`, `term_id` (FK), `language_code` (CHECK-constrained), `text`, `cefr_level` (nullable varchar(2), CHECK A1–C2) -```docker -nginx-proxy (:80/:443) - app.domain → web:80 - api.domain → api:3000 (HTTP + WS upgrade) - auth.domain → openauth:3001 -``` +Deck model uses `source_language` + `validated_languages` array — one deck serves multiple target languages. Decks are frequency tiers (e.g. `en-core-1000`), not POS splits. -SSL is fully automatic via `nginx-proxy` + `acme-companion`. No manual Certbot needed. - -### 5.1 Valkey Key Structure - -Ephemeral room state is stored in Valkey with TTL (e.g., 1 hour). -PostgreSQL stores durable history only. - -Key Format: `room:{code}:{field}` - -| Key | Type | TTL | Description | - -|------------------------------|---------|-------|-------------| -| `room:{code}:state` | Hash | 1h | Current question index, round status | -| `room:{code}:players` | Set | 1h | List of connected user IDs | -| `room:{code}:answers:{round}`| Hash | 15m | Temp storage for current round answers | - -Recovery Strategy -If server crashes mid-game, Valkey data is lost. -PostgreSQL `room_players.score` remains 0. -Room status is reset to `finished` via startup health check if `updated_at` is stale. +Full schema is in `packages/db/src/db/schema.ts`. --- -## 6. Data Model +## 8. API -## Design principle - -Words are modelled as language-neutral concepts (terms) separate from learning curricula (decks). -Adding a new language pair requires no schema changes — only new rows in `translations`, `decks`, and `language_pairs`. - -## Core tables - -terms -id uuid PK -synset_id text UNIQUE -- OMW ILI (e.g. "ili:i12345") -pos varchar(20) -- NOT NULL, CHECK (pos IN ('noun', 'verb', 'adjective', 'adverb')) -created_at timestamptz DEFAULT now() --- REMOVED: frequency_rank (handled at deck level) - -translations -id uuid PK -term_id uuid FK → terms.id -language_code varchar(10) -- NOT NULL, BCP 47: "en", "it" -text text -- NOT NULL -created_at timestamptz DEFAULT now() -UNIQUE (term_id, language_code, text) -- Allow synonyms, prevent exact duplicates - -term_glosses -id uuid PK -term_id uuid FK → terms.id -language_code varchar(10) -- NOT NULL -text text -- NOT NULL -created_at timestamptz DEFAULT now() - -language_pairs -id uuid PK -source varchar(10) -- NOT NULL -target varchar(10) -- NOT NULL -label text -active boolean DEFAULT true -UNIQUE (source, target) - -decks -id uuid PK -name text -- NOT NULL (e.g. "A1 Italian Nouns", "Most Common 1000") -description text -- NULLABLE -pair_id uuid FK → language_pairs.id -- NULLABLE (for single-language or multi-pair decks) -created_by uuid FK → users.id -- NULLABLE (for system decks) -is_public boolean DEFAULT true -created_at timestamptz DEFAULT now() - -deck_terms -deck_id uuid FK → decks.id -term_id uuid FK → terms.id -position smallint -- NOT NULL, ordering within deck (1, 2, 3...) -added_at timestamptz DEFAULT now() -PRIMARY KEY (deck_id, term_id) - -users -id uuid PK -- Internal stable ID (FK target) -openauth_sub text UNIQUE -- NOT NULL, OpenAuth `sub` claim (e.g. "google|12345") -email varchar(255) UNIQUE -- NULLABLE (GitHub users may lack email) -display_name varchar(100) -created_at timestamptz DEFAULT now() -last_login_at timestamptz --- REMOVED: games_played, games_won (derive from room_players) - -rooms -id uuid PK -code varchar(8) UNIQUE -- NOT NULL, CHECK (code = UPPER(code)) -host_id uuid FK → users.id -pair_id uuid FK → language_pairs.id -deck_id uuid FK → decks.id -- Which vocabulary deck this room uses -status varchar(20) -- NOT NULL, CHECK (status IN ('waiting', 'in_progress', 'finished')) -max_players smallint -- NOT NULL, DEFAULT 4, CHECK (max_players BETWEEN 2 AND 10) -round_count smallint -- NOT NULL, DEFAULT 10, CHECK (round_count BETWEEN 5 AND 20) -created_at timestamptz DEFAULT now() -updated_at timestamptz DEFAULT now() -- For stale room recovery - -room_players -room_id uuid FK → rooms.id -user_id uuid FK → users.id -score integer DEFAULT 0 -- Final score only (written at game end) -joined_at timestamptz DEFAULT now() -left_at timestamptz -- Populated on WS disconnect/leave -PRIMARY KEY (room_id, user_id) - -Indexes --- Vocabulary -CREATE INDEX idx_terms_pos ON terms (pos); -CREATE INDEX idx_translations_lang ON translations (language_code, term_id); - --- Decks -CREATE INDEX idx_decks_pair ON decks (pair_id, is_public); -CREATE INDEX idx_decks_creator ON decks (created_by); -CREATE INDEX idx_deck_terms_term ON deck_terms (term_id); - --- Language Pairs -CREATE INDEX idx_pairs_active ON language_pairs (active, source, target); - --- Rooms -CREATE INDEX idx_rooms_status ON rooms (status); -CREATE INDEX idx_rooms_host ON rooms (host_id); --- NOTE: idx_rooms_code omitted (UNIQUE constraint creates index automatically) - --- Room Players -CREATE INDEX idx_room_players_user ON room_players (user_id); -CREATE INDEX idx_room_players_score ON room_players (room_id, score DESC); - -Repository Logic Note -`DeckRepository.getTerms(deckId, limit, offset)` fetches terms from a specific deck. -Query uses `deck_terms.position` for ordering. -For random practice within a deck: `WHERE deck_id = X ORDER BY random() LIMIT N` -(safe because deck is bounded, e.g., 500 terms max, not full table). - ---- - -## 7. Vocabulary Data — WordNet + OMW - -### Source - -Open Multilingual Wordnet (OMW) — English & Italian nouns via Interlingual Index (ILI) -External CEFR lists — For deck curation (e.g. GitHub: ecom/cefr-lists) - -### Extraction process - -1. Run `extract-en-it-nouns.py` once locally using `wn` library - - Imports ALL bilingual noun synsets (no frequency filtering) - - Output: `datafiles/en-it-nouns.json` — committed to repo -2. Run `pnpm db:seed` — populates `terms` + `translations` tables from JSON -3. Run `pnpm db:build-decks` — matches external CEFR lists to DB terms, creates `decks` + `deck_terms` - -### Benefits of deck-based approach - -- WordNet frequency data is unreliable (e.g. chemical symbols rank high) -- Curricula can come from external sources (CEFR, Oxford 3000, SUBTLEX) -- Bad data excluded at deck level, not schema level -- Users can create custom decks later -- Multiple difficulty levels without schema changes - -`terms.synset_id` stores the OMW ILI (e.g. `ili:i12345`) for traceability and future re-imports with additional languages. - ---- - -## 8. Authentication — OpenAuth - -All auth is delegated to the OpenAuth service at `auth.yourdomain.com`. Providers: Google, GitHub. - -The API validates the JWT from OpenAuth on every protected request. User rows are created or updated on first login via the `sub` claim as the primary key. - -**Auth endpoint on the API:** - -| Method | Path | Description | -| ------ | -------------- | --------------------------- | -| GET | `/api/auth/me` | Validate token, return user | - -All other auth flows (login, callback, token refresh) are handled entirely by OpenAuth — the frontend redirects to `auth.yourdomain.com` and receives a JWT back. - ---- - -## 9. REST API - -All endpoints prefixed `/api`. Request and response bodies validated with Zod on both sides using schemas from `packages/shared`. - -### Vocabulary - -| Method | Path | Description | -| ------ | ---------------------------- | --------------------------------- | -| GET | `/language-pairs` | List active language pairs | -| GET | `/terms?pair=en-it&limit=10` | Fetch quiz terms with distractors | - -### Rooms - -| Method | Path | Description | -| ------ | ------------------- | ----------------------------------- | -| POST | `/rooms` | Create a room → returns room + code | -| GET | `/rooms/:code` | Get current room state | -| POST | `/rooms/:code/join` | Join a room | - -### Users - -| Method | Path | Description | -| ------ | ----------------- | ---------------------- | -| GET | `/users/me` | Current user profile | -| GET | `/users/me/stats` | Games played, win rate | - ---- - -## 10. WebSocket Protocol - -One WS connection per client. Authenticated by passing the OpenAuth JWT as a query param on the upgrade request: `wss://api.yourdomain.com?token=...`. - -All messages are JSON: `{ type: string, payload: unknown }`. The full set of types is a Zod discriminated union in `packages/shared` — both sides validate every message they receive. - -### Client → Server - -| type | payload | Description | -| ------------- | -------------------------- | -------------------------------- | -| `room:join` | `{ code }` | Subscribe to a room's WS channel | -| `room:leave` | — | Unsubscribe | -| `room:start` | — | Host starts the game | -| `game:answer` | `{ questionId, answerId }` | Player submits an answer | - -### Server → Client - -| type | payload | Description | -| -------------------- | -------------------------------------------------- | ----------------------------------------- | -| `room:state` | Full room snapshot | Sent on join and on any player join/leave | -| `game:question` | `{ id, prompt, options[], timeLimit }` | New question broadcast to all players | -| `game:answer_result` | `{ questionId, correct, correctAnswerId, scores }` | Broadcast after all answer or timeout | -| `game:finished` | `{ scores[], winner }` | End of game summary | -| `error` | `{ message }` | Protocol or validation error | - -### Multiplayer game mechanic — simultaneous answers - -All players see the same question at the same time. Everyone submits independently. The server waits until all players have answered **or** the 15-second timeout fires — then broadcasts `game:answer_result` with updated scores. There is no buzz-first mechanic. This keeps the experience Duolingo-like and symmetric. - -### Game flow +### Endpoints ```text -host creates room (REST) → -players join via room code (REST + WS room:join) → -room:state broadcasts player list → -host sends room:start → -server broadcasts game:question → -players send game:answer → -server collects all answers or waits for timeout → -server broadcasts game:answer_result → -repeat for N rounds → -server broadcasts game:finished +POST /api/v1/game/start GameRequest → GameSession +POST /api/v1/game/answer AnswerSubmission → AnswerResult +GET /api/v1/health Health check ``` -### Room state in Valkey +### Schemas (packages/shared) -Active room state (connected players, current question, answers received this round) is stored in Valkey with a TTL. PostgreSQL holds the durable record (`rooms`, `room_players`). On server restart, in-progress games are considered abandoned — acceptable for MVP. +**GameRequest:** `{ source_language, target_language, pos, difficulty, rounds }` +**GameSession:** `{ sessionId: uuid, questions: GameQuestion[] }` +**GameQuestion:** `{ questionId: uuid, prompt: string, gloss: string | null, options: AnswerOption[4] }` +**AnswerOption:** `{ optionId: number (0-3), text: string }` +**AnswerSubmission:** `{ sessionId: uuid, questionId: uuid, selectedOptionId: number (0-3) }` +**AnswerResult:** `{ questionId: uuid, isCorrect: boolean, correctOptionId: number (0-3), selectedOptionId: number (0-3) }` + +### Error Handling + +Typed error classes (`AppError` base, `ValidationError` 400, `NotFoundError` 404) with central error middleware. Controllers validate with `safeParse`, throw on failure, and call `next(error)` in the catch. The middleware maps `AppError` instances to HTTP status codes; unknown errors return 500. + +### Key Design Rules + +- Server-side answer evaluation: the correct answer is never sent to the frontend +- `POST` not `GET` for game start (configuration in request body) +- `safeParse` over `parse` (clean 400s, not raw Zod 500s) +- Session state stored in `GameSessionStore` (in-memory now, Valkey later) --- -## 11. Game Mechanics +## 9. Game Mechanics -- **Question format**: source-language word prompt + 4 target-language choices (1 correct + 3 distractors of the same POS) -- **Distractors**: generated server-side, never include the correct answer, never repeat within a session -- **Scoring**: +1 point per correct answer. Speed bonus is out of scope for MVP. -- **Timer**: 15 seconds per question, server-authoritative -- **Single-player**: uses `GET /terms` and runs entirely client-side. No WebSocket. +- **Format**: source-language word prompt + 4 target-language choices +- **Distractors**: same POS, same difficulty, server-side, never the correct answer, never repeated within a session +- **Session length**: 3 or 10 questions (configurable) +- **Scoring**: +1 per correct answer (no speed bonus for MVP) +- **Timer**: none in singleplayer MVP +- **No auth required**: anonymous users +- **Submit-before-send**: user selects, then confirms (prevents misclicks) --- -## 12. Frontend Structure +## 10. Working Methodology -```tree -apps/web/src/ -├── routes/ -│ ├── index.tsx # Landing / mode select -│ ├── auth/ -│ ├── singleplayer/ -│ └── multiplayer/ -│ ├── lobby.tsx # Create or join by code -│ ├── room.$code.tsx # Waiting room -│ └── game.$code.tsx # Active game -├── components/ -│ ├── quiz/ # QuestionCard, OptionButton, ScoreBoard -│ ├── room/ # PlayerList, RoomCode, ReadyState -│ └── ui/ # shadcn/ui wrappers: Button, Card, Dialog ... -├── stores/ -│ └── gameStore.ts # Zustand: game session, scores, WS state -├── lib/ -│ ├── api.ts # TanStack Query wrappers -│ └── ws.ts # WS client singleton -└── main.tsx +This project is a learning exercise. The goal is to understand the code, not just to ship it. + +### How to use an LLM for help + +1. Paste this document as context +2. Describe what you're working on and what you're stuck on +3. Ask for hints, not solutions + +### Refactoring workflow + +After completing a task: share the code, ask what to refactor and why. The LLM should explain the concept, not write the implementation. + +--- + +## 11. Post-MVP Ladder + +| Phase | What it adds | +| ----------------- | -------------------------------------------------------------- | +| Auth | OpenAuth (Google + GitHub), JWT middleware, user rows in DB | +| User Stats | Games played, score history, profile page | +| Multiplayer Lobby | Room creation, join by code, WebSocket connection | +| Multiplayer Game | Simultaneous answers, server timer, live scores, winner screen | +| Deployment | Docker Compose prod config, Nginx, Let's Encrypt, Hetzner VPS | +| Hardening | Rate limiting, error boundaries, CI/CD, DB backups | + +### Future Data Model Extensions (deferred, additive) + +- `noun_forms` — gender, singular, plural, articles per language +- `verb_forms` — conjugation tables per language +- `term_pronunciations` — IPA and audio URLs per language +- `user_decks` — which decks a user is studying +- `user_term_progress` — spaced repetition state per user/term/language +- `quiz_answers` — history log for stats + +All are new tables referencing existing `terms` rows via FK. No existing schema changes required. + +### Multiplayer Architecture (deferred) + +- WebSocket protocol: `ws` library, Zod discriminated union for message types +- Room model: human-readable codes (e.g. `WOLF-42`), not matchmaking queue +- Game mechanic: simultaneous answers, 15-second server timer, all players see same question +- Valkey for ephemeral room state, PostgreSQL for durable records + +### Infrastructure (deferred) + +- `app.yourdomain.com` → React frontend +- `api.yourdomain.com` → Express API + WebSocket +- `auth.yourdomain.com` → OpenAuth service +- Docker Compose with `nginx-proxy` + `acme-companion` for automatic SSL + +--- + +## 12. Definition of Done (MVP) + +- [x] API returns quiz terms with correct distractors +- [x] User can complete a quiz without errors +- [x] Score screen shows final result and a play-again option +- [x] App is usable on a mobile screen +- [x] No hardcoded data — everything comes from the database +- [x] Global error handler with typed error classes +- [x] Unit + integration tests for API + +--- + +## 13. Roadmap + +### Phase 0 — Foundation ✅ + +Empty repo that builds, lints, and runs end-to-end. `pnpm dev` starts both apps; `GET /api/health` returns 200; React renders a hello page. + +### Phase 1 — Vocabulary Data + API ✅ + +Word data lives in the DB. API returns quiz sessions with distractors. CEFR enrichment pipeline complete. Global error handler and tests implemented. + +### Phase 2 — Singleplayer Quiz UI ✅ + +User can complete a full quiz in the browser. Settings UI, question cards, answer feedback, score screen. + +### Phase 3 — Auth + +Users can log in via Google or GitHub and stay logged in. JWT validated by API. User row created on first login. + +### Phase 4 — Multiplayer Lobby + +Players can create and join rooms. Two browser tabs can join the same room and see each other via WebSocket. + +### Phase 5 — Multiplayer Game + +Host starts a game. All players answer simultaneously in real time. Winner declared. + +### Phase 6 — Production Deployment + +App is live on Hetzner with HTTPS. Auth flow works end-to-end. + +### Phase 7 — Polish & Hardening + +Rate limiting, reconnect logic, error boundaries, CI/CD, DB backups. + +### Dependency Graph + +```text +Phase 0 (Foundation) +└── Phase 1 (Vocabulary Data + API) + └── Phase 2 (Singleplayer UI) + └── Phase 3 (Auth) + ├── Phase 4 (Room Lobby) + │ └── Phase 5 (Multiplayer Game) + │ └── Phase 6 (Deployment) + └── Phase 7 (Hardening) ``` -### Zustand store (single store for MVP) - -```typescript -interface AppStore { - user: User | null; - gameSession: GameSession | null; - currentQuestion: Question | null; - scores: Record; - isLoading: boolean; - error: string | null; -} -``` - -TanStack Query handles all server data fetching. Zustand handles ephemeral UI and WebSocket-driven state. - --- -## 13. Testing Strategy +## 14. Game Flow (Future) -| Type | Tool | Scope | -| ----------- | -------------------- | --------------------------------------------------- | -| Unit | Vitest | Services, QuizService distractor logic, Zod schemas | -| Component | Vitest + RTL | QuestionCard, OptionButton, auth forms | -| Integration | Vitest | API route handlers against a test DB | -| E2E | Out of scope for MVP | — | +Singleplayer: choose direction (en→it or it→en) → top-level category → part of speech → difficulty (A1–C2) → round count → game starts. -Tests are co-located with source files (`*.test.ts` / `*.test.tsx`). +**Top-level categories (post-MVP):** -**Critical paths to cover:** - -- Distractor generation (correct POS, no duplicates, never includes answer) -- Answer validation (server-side, correct scoring) -- Game session lifecycle (create → play → complete) -- JWT validation middleware - ---- - -## 14. Definition of Done - -### Functional - -- [ ] User can log in via Google or GitHub (OpenAuth) -- [ ] User can play singleplayer: 10 rounds, score, result screen -- [ ] User can create a room and share a code -- [ ] User can join a room via code -- [ ] Multiplayer: 10 rounds, simultaneous answers, real-time score sync -- [ ] 1 000 English–Italian words seeded from WordNet + OMW - -### Technical - -- [ ] Deployed to Hetzner with HTTPS on all three subdomains -- [ ] Docker Compose running all services -- [ ] Drizzle migrations applied on container start -- [ ] 10–20 passing tests covering critical paths -- [ ] pnpm workspace build pipeline green - ---- - -## 15. Out of Scope (MVP) - -- Difficulty levels _(`frequency_rank` column exists, ready to use)_ -- Additional language pairs _(schema already supports it — just add rows)_ -- Leaderboards _(`games_played`, `games_won` columns exist)_ -- Streaks / daily challenges -- Friends / private invites -- Audio pronunciation -- CI/CD pipeline (manual deploy for now) -- Rate limiting _(add before going public)_ -- Admin panel for vocabulary management +- **Grammar** — practice nouns, verb conjugations, etc. +- **Media** — practice vocabulary from specific books, films, songs, etc. +- **Thematic** — animals, kitchen, etc. (requires category metadata research)