updating documentation, formatting

This commit is contained in:
lila 2026-04-12 09:28:35 +02:00
parent e320f43d8e
commit 047196c973
8 changed files with 523 additions and 2282 deletions

View file

@ -2,7 +2,7 @@
"extends": "../../tsconfig.base.json",
"references": [
{ "path": "../../packages/shared" },
{ "path": "../../packages/db" },
{ "path": "../../packages/db" }
],
"compilerOptions": {
"module": "NodeNext",
@ -10,7 +10,7 @@
"outDir": "./dist",
"resolveJsonModule": true,
"rootDir": ".",
"types": ["vitest/globals"],
"types": ["vitest/globals"]
},
"include": ["src", "vitest.config.ts"],
"include": ["src", "vitest.config.ts"]
}

View file

@ -1,348 +0,0 @@
# Glossa — Architecture & API Development Summary
A record of all architectural discussions, decisions, and outcomes from the initial
API design through the quiz model implementation.
---
## Project Overview
Glossa is a vocabulary trainer (Duolingo-style) built as a pnpm monorepo. Users see a
word and pick from 4 possible translations. Supports singleplayer and multiplayer.
Stack: Express API, React frontend, Drizzle ORM, Postgres, Valkey, WebSockets.
---
## Architectural Foundation
### The Layered Architecture
The core mental model established for the entire API:
```text
HTTP Request
Router — maps URL + HTTP method to a controller
Controller — handles HTTP only: validates input, calls service, sends response
Service — business logic only: no HTTP, no direct DB access
Model — database queries only: no business logic
Database
```
**The rule:** each layer only talks to the layer directly below it. A controller never
touches the database. A service never reads `req.body`. A model never knows what a quiz is.
### Monorepo Package Responsibilities
| Package | Owns |
| ----------------- | -------------------------------------------------------- |
| `packages/shared` | Zod schemas, constants, derived TypeScript types |
| `packages/db` | Drizzle schema, DB connection, all model/query functions |
| `apps/api` | Router, controllers, services |
| `apps/web` | React frontend, consumes types from shared |
**Key principle:** all database code lives in `packages/db`. `apps/api` never imports
`drizzle-orm` for queries — it only calls functions exported from `packages/db`.
---
## Problems Faced & Solutions
- Problem 1: Messy API structure
**Symptom:** responsibilities bleeding across layers — DB code in controllers, business
logic in routes.
**Solution:** strict layered architecture with one responsibility per layer.
- Problem 2: No shared contract between API and frontend
**Symptom:** API could return different shapes silently, frontend breaks at runtime.
**Solution:** Zod schemas in `packages/shared` as the single source of truth. Both API
(validation) and frontend (type inference) consume the same schemas.
- Problem 3: Type safety gaps
**Symptom:** TypeScript `any` types on model parameters, `Number` vs `number` confusion.
**Solution:** derived types from constants using `typeof CONSTANT[number]` pattern.
All valid values defined once in constants, types derived automatically.
- Problem 4: `getGameTerms` in wrong package
**Symptom:** model queries living in `apps/api/src/models/` meant `apps/api` had a
direct `drizzle-orm` dependency and was accessing the DB itself.
**Solution:** moved models folder to `packages/db/src/models/`. All Drizzle code now
lives in one package.
- Problem 5: Deck generation complexity
**Initial assumption:** 12 decks needed (nouns/verbs × easy/intermediate/hard × en/it).
**Correction:** decks are pools, not presets. POS and difficulty are query filters applied
at runtime — not deck properties. Only 2 decks needed (en-core, it-core).
**Final decision:** skip deck generation entirely for MVP. Query the terms table directly
with difficulty + POS filters. Revisit post-MVP when spaced repetition or progression
features require curated pools.
- Problem 6: GAME_ROUNDS type conflict
**Problem:** `z.enum()` only accepts strings. `GAME_ROUNDS = ["3", "10"]` works with
`z.enum()` but requires `Number(rounds)` conversion in the service.
**Decision:** keep as strings, convert to number in the service before passing to the
model. Documented coupling acknowledged with a comment.
- Problem 7: Gloss join could multiply question rows. Schema allowed multiple glosses per term per language, so the left join would duplicate rows. Fixed by tightening the unique constraint.
- Problem 8: Model leaked quiz semantics. Return fields were named prompt / answer, baking HTTP-layer concepts into the database layer. Renamed to neutral field names.
- Problem 9: AnswerResult wasn't self-contained. Frontend needed selectedOptionId to render feedback but the schema didn't include it (reasoning was "client already knows"). Discovered during frontend work; added the field.
- Problem 10: Distractor could duplicate the correct answer text. Different terms can share the same translation. Fixed with ne(translations.text, excludeText) in the query.
- Problem 11: TypeScript strict mode flagged Fisher-Yates shuffle array access. noUncheckedIndexedAccess treats result[i] as T | undefined. Fixed with non-null assertion and temp variable pattern.
---
## Decisions Made
- Zod schemas belong in `packages/shared`
Both the API and frontend import from the same schemas. If the shape changes, TypeScript
compilation fails in both places simultaneously — silent drift is impossible.
- Server-side answer evaluation
The correct answer is never sent to the frontend in `QuizQuestion`. It is only revealed
in `AnswerResult` after the client submits. Prevents cheating and keeps game logic
authoritative on the server.
- `safeParse` over `parse` in controllers
`parse` throws a raw Zod error → ugly 500 response. `safeParse` returns a result object
→ clean 400 with early return. Global error handler to be implemented later (Step 6 of
roadmap) will centralise this pattern.
- POST not GET for game start
`GET` requests have no body. Game configuration is submitted as a JSON body → `POST` is
semantically correct.
- `express.json()` middleware required
Without it, `req.body` is `undefined`. Added to `createApp()` in `app.ts`.
- Type naming: PascalCase
TypeScript convention. `supportedLanguageCode``SupportedLanguageCode` etc.
- Primitive types: always lowercase
`number` not `Number`, `string` not `String`. The uppercase versions are object wrappers
and not assignable to Drizzle's expected primitive types.
- Model parameters use shared types, not `GameRequestType`
The model layer should not know about `GameRequestType` — that's an HTTP boundary concern.
Instead, parameters are typed using the derived constant types (`SupportedLanguageCode`,
`SupportedPos`, `DifficultyLevel`) exported from `packages/shared`.
- One gloss per term per language. The unique constraint on term_glosses was tightened from (term_id, language_code, text) to (term_id, language_code) to prevent the left join from multiplying question rows. Revisit if multiple glosses per language are ever needed (e.g. register or domain variants).
- Model returns neutral field names, not quiz semantics. getGameTerms returns sourceText / targetText / sourceGloss rather than prompt / answer / gloss. Quiz semantics are applied in the service layer. Keeps the model reusable for non-quiz features.
- Asymmetric difficulty filter. Difficulty is filtered on the target (answer) side only. A word can be A2 in Italian but B1 in English, and what matters is the difficulty of the word being learned.
- optionId as integer 0-3, not UUID. Options only need uniqueness within a single question; cheating prevented by shuffling, not opaque IDs.
- questionId and sessionId as UUIDs. Globally unique, opaque, natural Valkey keys when storage moves later.
- gloss is string | null rather than optional, for predictable shape on the frontend.
- GameSessionStore stores only the answer key (questionId → correctOptionId). Minimal payload for easy Valkey migration.
- All GameSessionStore methods are async even for the in-memory implementation, so the service layer is already written for Valkey.
- Distractors fetched per-question (N+1 queries). Correct shape for the problem; 10 queries on local Postgres is negligible latency.
- No fallback logic for insufficient distractors. Data volumes are sufficient; strict query throws if something is genuinely broken.
- Distractor query excludes both the correct term ID and the correct answer text, preventing duplicate options from different terms with the same translation.
- Submit-before-send flow on frontend: user selects, then confirms. Prevents misclicks.
- AppError base class over error code maps. A statusCode on the error itself means the middleware doesn't need a lookup table. New error types are self-contained — one class, one status code.
- next(error) over res.status().json() in controllers. Express requires explicit next(error) for async handlers. Centralises all error formatting in one place. Controllers stay clean — validate, call service, send response.
- Zod .message over .issues[0]?.message. Returns all validation failures, not just the first. Output is verbose (raw JSON string) — revisit formatting post-MVP if the frontend needs structured error objects.
---
## Global error handler: typed error classes + central middleware
Three-layer pattern: error classes define the shape, services throw them, middleware catches them.
AppError is the base class — carries a statusCode and a message. ValidationError (400) and NotFoundError (404) extend it. Adding a new error type is one class with a super() call.
Controllers wrap their body in try/catch and call next(error) in the catch block. They never build error responses themselves. This is required because Express does not catch errors from async handlers automatically — without next(error), an unhandled rejection crashes the process.
The middleware in app.ts (registered after all routes) checks instanceof AppError. Known errors get their statusCode and message. Unknown errors get logged and return a generic 500 — no stack traces leak to the client.
Zod validation error format: used gameSettings.error.message rather than gameSettings.error.issues[0]?.message. This sends all validation failures at once instead of just the first. Tradeoff: the output is a raw JSON string, not a clean object. Acceptable for MVP — if the frontend needs structured errors later, format .issues into { field, message }[] in the ValidationError constructor.
Where errors are thrown: ValidationError is thrown in the controller (it's the layer that runs safeParse). NotFoundError is thrown in the service (it's the layer that knows whether a session or question exists). The service still doesn't know about HTTP — it throws a typed error, and the middleware maps it to a status code.
---
## Data Pipeline Work (Pre-API)
### CEFR Enrichment Pipeline (completed)
A staged ETL pipeline was built to enrich translation records with CEFR levels and
difficulty ratings:
```text
Raw source files
extract-*.py — normalise each source to standard JSON
compare-*.py — quality gate: surface conflicts between sources (read-only)
merge-*.py — resolve conflicts by source priority, derive difficulty
enrich.ts — write cefr_level + difficulty to DB translations table
```
**Source priority:**
- English: `en_m3` > `cefrj` > `octanove` > `random`
- Italian: `it_m3` > `italian`
**Enrichment results:**
| Language | Enriched | Total | Coverage |
| -------- | -------- | ------- | -------- |
| English | 42,527 | 171,394 | ~25% |
| Italian | 23,061 | 54,603 | ~42% |
Both languages have sufficient coverage for MVP. Italian C2 has only 242 terms — noted
as a potential constraint for the distractor algorithm at high difficulty.
---
## API Schemas (packages/shared)
### `GameRequestSchema`
```typescript
{
source_language: z.enum(SUPPORTED_LANGUAGE_CODES),
target_language: z.enum(SUPPORTED_LANGUAGE_CODES),
pos: z.enum(SUPPORTED_POS),
difficulty: z.enum(DIFFICULTY_LEVELS),
rounds: z.enum(GAME_ROUNDS),
}
```
AnswerOption: { optionId: number (0-3), text: string }
GameQuestion: { questionId: uuid, prompt: string, gloss: string | null, options: AnswerOption[4] }
GameSession: { sessionId: uuid, questions: GameQuestion[] }
AnswerSubmission: { sessionId: uuid, questionId: uuid, selectedOptionId: number (0-3) }
AnswerResult: { questionId: uuid, isCorrect: boolean, correctOptionId: number (0-3), selectedOptionId: number (0-3) }
---
## API Endpoints
```text
POST /api/v1/game/start GameRequest → QuizQuestion[]
POST /api/v1/game/answer AnswerSubmission → AnswerResult
```
---
## Current File Structure (apps/api)
```text
apps/api/src/
├── app.ts — Express app, express.json() middleware
├── server.ts — starts server on PORT
├── routes/
│ ├── apiRouter.ts — mounts /health and /game routers
│ ├── gameRouter.ts — POST /start → createGame controller
│ └── healthRouter.ts
├── controllers/
│ └── gameController.ts — validates GameRequest, calls service
└── services/
└── gameService.ts — calls getGameTerms, returns raw rows
```
---
## Current File Structure (packages/db)
```text
packages/db/src/
├── db/
│ └── schema.ts — Drizzle schema (terms, translations, users, decks...)
├── models/
│ └── termModel.ts — getGameTerms() query
└── index.ts — exports db connection + getGameTerms
```
---
## Completed Tasks
- [x] Layered architecture established and understood
- [x] `GameRequestSchema` defined in `packages/shared`
- [x] Derived types (`SupportedLanguageCode`, `SupportedPos`, `DifficultyLevel`) exported from constants
- [x] `getGameTerms()` model implemented with POS / language / difficulty / limit filters
- [x] Model correctly placed in `packages/db`
- [x] `prepareGameQuestions()` service skeleton calling the model
- [x] `createGame` controller with Zod `safeParse` validation
- [x] `POST /api/v1/game/start` route wired
- [x] End-to-end pipeline verified with test script — returns correct rows
- [x] CEFR enrichment pipeline complete for English and Italian
- [x] Double join on translations implemented (source + target language)
- [x] Gloss left join implemented
- [x] Model return type uses neutral field names (sourceText, targetText, sourceGloss)
- [x] Schema: gloss unique constraint tightened to one gloss per term per language
- [x] Zod schemas defined: AnswerOption, GameQuestion, GameSession, AnswerSubmission, AnswerResult
- [x] getDistractors model implemented with POS/difficulty/language/excludeTermId/excludeText filters
- [x] createGameSession service: fetches terms, fetches distractors per question, shuffles options, stores session, returns GameSession
- [x] evaluateAnswer service: looks up session, compares submitted optionId to stored correct answer, returns AnswerResult
- [x] GameSessionStore interface + InMemoryGameSessionStore (Map-backed, swappable to Valkey)
- [x] POST /api/v1/game/answer endpoint wired (route, controller, service)
- [x] selectedOptionId added to AnswerResult (discovered during frontend work)
- [x] Minimal frontend: /play route with settings UI, QuestionCard, OptionButton, ScoreScreen
- [x] Vite proxy configured for dev
---
## Roadmap Ahead
### Step 1 — Learn SQL fundamentals - done
Concepts needed: SELECT, FROM, JOIN, WHERE, LIMIT.
Resources: sqlzoo.net or Khan Academy SQL section.
Required before: implementing the double join for source language prompt.
### Step 2 — Complete the model layer - done
- Double join on `translations` — once for source language (prompt), once for target language (answer)
- `GlossModel.getGloss(termId, languageCode)` — fetch gloss if available
### Step 3 — Define remaining Zod schemas - done
- `QuizQuestion`, `QuizOption`, `AnswerSubmission`, `AnswerResult` in `packages/shared`
### Step 4 — Complete the service layer - done
- `QuizService.buildSession()` — assemble raw rows into `QuizQuestion[]`
- Generate `questionId` per question
- Map source language translation as prompt
- Attach gloss if available
- Fetch 3 distractors (same POS, different term, same difficulty)
- Shuffle options so correct answer is not always in same position
- `QuizService.evaluateAnswer()` — validate correctness, return `AnswerResult`
### Step 5 — Implement answer endpoint - done
- `POST /api/v1/game/answer` route, controller, service method
### Step 6 — Global error handler - done
- Typed error classes (`ValidationError`, `NotFoundError`)
- Central error middleware in `app.ts`
- Remove temporary `safeParse` error handling from controllers
### Step 7 — Tests - done
- Unit tests for `QuizService` — correct POS filtering, distractor never equals correct answer
- Unit tests for `evaluateAnswer` — correct and incorrect cases
- Integration tests for both endpoints
### Step 8 — Auth (Phase 2 from original roadmap)
- OpenAuth integration
- JWT validation middleware
- `GET /api/auth/me` endpoint
- Frontend auth guard
---
## Open Questions
- **Distractor algorithm:** when Italian C2 has only 242 terms, should the difficulty
filter fall back gracefully or return an error? Decision needed before implementing
`buildSession()`. => resolved
- **Session statefulness:** game loop is currently stateless (fetch all questions upfront).
Confirm this is still the intended MVP approach before building `buildSession()`. => resolved
- **Glosses can leak answers:** some WordNet glosses contain the target-language
word in the definition text (e.g. "Padre" appearing in the English gloss for
"father"). Address during the post-MVP data enrichment pass — either clean the
glosses, replace them with custom definitions, or filter at the service layer. => resolved
- WARN 2 deprecated subdependencies found: @esbuild-kit/core-utils@3.3.2, @esbuild-kit/esm-loader@2.6.5
../.. | Progress: resolved 556, reused 0, downloaded 0, added 0, done
WARN Issues with peer dependencies found
.
└─┬ eslint-plugin-react-hooks 7.0.1
└── ✕ unmet peer eslint@"^3.0.0 || ^4.0.0 || ^5.0.0 || ^6.0.0 || ^7.0.0 || ^8.0.0-0 || ^9.0.0": found 10.0.3

View file

@ -1,351 +0,0 @@
# WordNet Seeding Script — Session Summary
## Project Context
A multiplayer EnglishItalian vocabulary trainer (Glossa) built with a pnpm monorepo. Vocabulary data comes from Open Multilingual Wordnet (OMW) and is extracted into JSON files, then seeded into a PostgreSQL database via Drizzle ORM.
---
## 1. JSON Extraction Format
Each synset extracted from WordNet is represented as:
```json
{
"synset_id": "ili:i35545",
"pos": "noun",
"translations": { "en": ["entity"], "it": ["cosa", "entità"] }
}
```
**Fields:**
- `synset_id` — OMW Interlingual Index ID, maps to `terms.synset_id` in the DB
- `pos` — part of speech, matches the CHECK constraint on `terms.pos`
- `translations` — object of language code → array of lemmas (synonyms within a synset)
**Glosses** are not extracted — the `term_glosses` table exists in the schema for future use but is not needed for the MVP quiz mechanic.
---
## 2. Database Schema (relevant tables)
```text
terms
id uuid PK
synset_id text UNIQUE
pos varchar(20)
created_at timestamptz
translations
id uuid PK
term_id uuid FK → terms.id (CASCADE)
language_code varchar(10)
text text
created_at timestamptz
UNIQUE (term_id, language_code, text)
```
---
## 3. Seeding Script — v1 (batch, truncate-based)
### Approach
- Read a single JSON file
- Batch inserts into `terms` and `translations` in groups of 500
- Truncate tables before each run for a clean slate
### Key decisions made during development
| Issue | Resolution |
| -------------------------------- | --------------------------------------------------- |
| `JSON.parse` returns `any` | Added `Array.isArray` check before casting |
| `forEach` doesn't await | Switched to `for...of` |
| Empty array types | Used Drizzle's `$inferInsert` types |
| `translations` naming conflict | Renamed local variable to `translationRows` |
| Final batch not flushed | Added `if (termsArray.length > 0)` guard after loop |
| Exact batch size check `=== 500` | Changed to `>= 500` |
### Final script structure
```ts
import fs from "node:fs/promises";
import { SUPPORTED_LANGUAGE_CODES, SUPPORTED_POS } from "@glossa/shared";
import { db } from "@glossa/db";
import { terms, translations } from "@glossa/db/schema";
type POS = (typeof SUPPORTED_POS)[number];
type LANGUAGE_CODE = (typeof SUPPORTED_LANGUAGE_CODES)[number];
type TermInsert = typeof terms.$inferInsert;
type TranslationInsert = typeof translations.$inferInsert;
type Synset = {
synset_id: string;
pos: POS;
translations: Record<LANGUAGE_CODE, string[]>;
};
const dataDir = "../../scripts/datafiles/";
const readFromJsonFile = async (filepath: string): Promise<Synset[]> => {
const data = await fs.readFile(filepath, "utf8");
const parsed = JSON.parse(data);
if (!Array.isArray(parsed)) throw new Error("Expected a JSON array");
return parsed as Synset[];
};
const uploadToDB = async (
termsData: TermInsert[],
translationsData: TranslationInsert[],
) => {
await db.insert(terms).values(termsData);
await db.insert(translations).values(translationsData);
};
const main = async () => {
console.log("Reading JSON file...");
const allSynsets = await readFromJsonFile(dataDir + "en-it-nouns.json");
console.log(`Loaded ${allSynsets.length} synsets`);
const termsArray: TermInsert[] = [];
const translationsArray: TranslationInsert[] = [];
let batchCount = 0;
for (const synset of allSynsets) {
const term = {
id: crypto.randomUUID(),
synset_id: synset.synset_id,
pos: synset.pos,
};
const translationRows = Object.entries(synset.translations).flatMap(
([lang, lemmas]) =>
lemmas.map((lemma) => ({
id: crypto.randomUUID(),
term_id: term.id,
language_code: lang as LANGUAGE_CODE,
text: lemma,
})),
);
translationsArray.push(...translationRows);
termsArray.push(term);
if (termsArray.length >= 500) {
batchCount++;
console.log(
`Uploading batch ${batchCount} (${batchCount * 500}/${allSynsets.length} synsets)...`,
);
await uploadToDB(termsArray, translationsArray);
termsArray.length = 0;
translationsArray.length = 0;
}
}
if (termsArray.length > 0) {
batchCount++;
console.log(
`Uploading final batch (${allSynsets.length}/${allSynsets.length} synsets)...`,
);
await uploadToDB(termsArray, translationsArray);
}
console.log(`Seeding complete — ${allSynsets.length} synsets inserted`);
};
main().catch((error) => {
console.error(error);
process.exit(1);
});
```
---
## 4. Pitfalls Encountered
### Duplicate key on re-run
Running the script twice causes `duplicate key value violates unique constraint "terms_synset_id_unique"`. Fix: truncate before seeding.
```bash
docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translations, terms CASCADE;"
```
### `onConflictDoNothing` breaks FK references
When `onConflictDoNothing` skips a `terms` insert, the in-memory UUID is never written to the DB. Subsequent `translations` inserts reference that non-existent UUID, causing a FK violation. This is why the truncate approach is correct for batch seeding.
### DATABASE_URL misconfigured
Correct format:
```text
DATABASE_URL=postgresql://glossa:glossa@localhost:5432/glossa
```
### Tables not found after `docker compose up`
Migrations must be applied first: `npx drizzle-kit migrate`
---
## 5. Running the Script
```bash
# Start the DB container
docker compose up -d postgres
# Apply migrations
npx drizzle-kit migrate
# Truncate existing data (if re-seeding)
docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translations, terms CASCADE;"
# Run the seed script
npx tsx src/seed-en-it-nouns.ts
# Verify
docker exec -it glossa-database psql -U glossa -d glossa -c "SELECT COUNT(*) FROM terms; SELECT COUNT(*) FROM translations;"
```
---
## 6. Seeding Script — v2 (incremental upsert, multi-file)
### Motivation
The truncate approach is fine for dev but unsuitable for production — it wipes all data. The v2 approach extends the database incrementally without ever truncating.
### File naming convention
One JSON file per language pair per POS:
```text
scripts/datafiles/
en-it-nouns.json
en-fr-nouns.json
en-it-verbs.json
de-it-nouns.json
...
```
### How incremental upsert works
For a concept like "dog" already in the DB with English and Italian:
1. Import `en-fr-nouns.json`
2. Upsert `terms` by `synset_id` — finds existing row, returns its real ID
3. `dog (en)` already exists → skipped by `onConflictDoNothing`
4. `chien (fr)` is new → inserted
The concept is **extended**, not replaced.
### Tradeoff vs batch approach
Batching is no longer possible since you need the real `term.id` from the DB before inserting translations. Each synset is processed individually. For 25k rows this is still fast enough.
### Key types added
```ts
type Synset = {
synset_id: string;
pos: POS;
translations: Partial<Record<LANGUAGE_CODE, string[]>>; // Partial — file only contains subset of languages
};
type FileName = {
sourceLang: LANGUAGE_CODE;
targetLang: LANGUAGE_CODE;
pos: POS;
};
```
### Filename validation
```ts
const parseFilename = (filename: string): FileName => {
const parts = filename.replace(".json", "").split("-");
if (parts.length !== 3)
throw new Error(
`Invalid filename format: ${filename}. Expected: sourcelang-targetlang-pos.json`,
);
const [sourceLang, targetLang, pos] = parts;
if (!SUPPORTED_LANGUAGE_CODES.includes(sourceLang as LANGUAGE_CODE))
throw new Error(`Unsupported language code: ${sourceLang}`);
if (!SUPPORTED_LANGUAGE_CODES.includes(targetLang as LANGUAGE_CODE))
throw new Error(`Unsupported language code: ${targetLang}`);
if (!SUPPORTED_POS.includes(pos as POS))
throw new Error(`Unsupported POS: ${pos}`);
return {
sourceLang: sourceLang as LANGUAGE_CODE,
targetLang: targetLang as LANGUAGE_CODE,
pos: pos as POS,
};
};
```
### Upsert function (WIP)
```ts
const upsertSynset = async (
synset: Synset,
fileInfo: FileName,
): Promise<{ termInserted: boolean; translationsInserted: number }> => {
const [upsertedTerm] = await db
.insert(terms)
.values({ synset_id: synset.synset_id, pos: synset.pos })
.onConflictDoUpdate({ target: terms.synset_id, set: { pos: synset.pos } })
.returning({ id: terms.id, created_at: terms.created_at });
const termInserted = upsertedTerm.created_at > new Date(Date.now() - 1000);
const translationRows = Object.entries(synset.translations).flatMap(
([lang, lemmas]) =>
lemmas!.map((lemma) => ({
id: crypto.randomUUID(),
term_id: upsertedTerm.id,
language_code: lang as LANGUAGE_CODE,
text: lemma,
})),
);
const result = await db
.insert(translations)
.values(translationRows)
.onConflictDoNothing()
.returning({ id: translations.id });
return { termInserted, translationsInserted: result.length };
};
```
---
## 7. Strategy Comparison
| Strategy | Use case | Pros | Cons |
| ------------------ | ----------------------------- | --------------------- | -------------------- |
| Truncate + batch | Dev / first-time setup | Fast, simple | Wipes all data |
| Incremental upsert | Production / adding languages | Safe, non-destructive | No batching, slower |
| Migrations-as-data | Production audit trail | Clean history | Files accumulate |
| Diff-based sync | Large production datasets | Minimal writes | Complex to implement |
---
## 8. packages/db — package.json exports fix
The `exports` field must be an object, not an array:
```json
"exports": {
".": "./src/index.ts",
"./schema": "./src/db/schema.ts"
}
```
Imports then resolve as:
```ts
import { db } from "@glossa/db";
import { terms, translations } from "@glossa/db/schema";
```

View file

@ -1,6 +1,6 @@
# Decisions Log
A record of non-obvious technical decisions made during development, with reasoning. Intended to preserve context across sessions.
A record of non-obvious technical decisions made during development, with reasoning. Intended to preserve context across sessions. Grouped by topic area.
---
@ -32,21 +32,11 @@ All auth delegated to OpenAuth service at `auth.yourdomain.com`. Providers: Goog
### Multi-stage builds for monorepo context
Both `apps/web` and `apps/api` use multi-stage Dockerfiles (`deps`, `dev`, `builder`, `runner`) because:
- The monorepo structure requires copying `pnpm-workspace.yaml`, root `package.json`, and cross-dependencies (`packages/shared`, `packages/db`) before installing
- `node_modules` paths differ between host and container due to workspace hoisting
- Stages allow caching `pnpm install` separately from source code changes
Both `apps/web` and `apps/api` use multi-stage Dockerfiles (`deps`, `dev`, `builder`, `runner`) because the monorepo structure requires copying `pnpm-workspace.yaml`, root `package.json`, and cross-dependencies before installing. Stages allow caching `pnpm install` separately from source code changes.
### Vite as dev server (not Nginx)
In development, `apps/web` uses `vite dev` directly, not Nginx. Reasons:
- Hot Module Replacement (HMR) requires Vite's WebSocket dev server
- Source maps and error overlay need direct Vite integration
- Nginx would add unnecessary proxy complexity for local dev
Production will use Nginx to serve static Vite build output.
In development, `apps/web` uses `vite dev` directly, not Nginx. HMR requires Vite's WebSocket dev server. Production will use Nginx to serve static Vite build output.
---
@ -54,41 +44,111 @@ Production will use Nginx to serve static Vite build output.
### Express app structure: factory function pattern
`app.ts` exports a `createApp()` factory function. `server.ts` imports it and calls `.listen()`. This allows tests to import the app directly without starting a server, keeping tests isolated and fast.
`app.ts` exports a `createApp()` factory function. `server.ts` imports it and calls `.listen()`. This allows tests to import the app directly without starting a server (used by supertest).
### Data model: `decks` separate from `terms` (not frequency_rank filtering)
### Zod schemas belong in `packages/shared`
**Original approach:** Store `frequency_rank` on `terms` table and filter by rank range for difficulty.
Both the API and frontend import from the same schemas. If the shape changes, TypeScript compilation fails in both places simultaneously — silent drift is impossible.
**Problem discovered:** WordNet/OMW frequency data is unreliable for language learning. Extraction produced results like:
### Server-side answer evaluation
- Rank 1: "In" → "indio" (chemical symbol: Indium)
- Rank 2: "Be" → "berillio" (chemical symbol: Beryllium)
- Rank 7: "He" → "elio" (chemical symbol: Helium)
The correct answer is never sent to the frontend in `GameQuestion`. It is only revealed in `AnswerResult` after the client submits. Prevents cheating and keeps game logic authoritative on the server.
These are technically "common" in WordNet (every element is a noun) but useless for vocabulary learning.
### `safeParse` over `parse` in controllers
**Decision:**
`parse` throws a raw Zod error → ugly 500 response. `safeParse` returns a result object → clean 400 with early return via the error handler.
- `terms` table stores ALL available OMW synsets (raw data, no frequency filtering)
- `decks` table stores curated learning lists (A1, A2, B1, "Most Common 1000", etc.)
- `deck_terms` junction table links terms to decks with position ordering
- `rooms.deck_id` specifies which vocabulary deck a game uses
### POST not GET for game start
**Benefits:**
`GET` requests have no body. Game configuration is submitted as a JSON body → `POST` is semantically correct.
- Curricula can come from external sources (CEFR lists, Oxford 3000, SUBTLEX)
- Bad data (chemical symbols, obscure words) excluded at deck level, not schema level
- Users can create custom decks later
- Multiple difficulty levels without schema changes
### Model parameters use shared types, not `GameRequestType`
The model layer should not know about `GameRequestType` — that's an HTTP boundary concern. Parameters are typed using the derived constant types (`SupportedLanguageCode`, `SupportedPos`, `DifficultyLevel`) exported from `packages/shared`.
### Model returns neutral field names, not quiz semantics
`getGameTerms` returns `sourceText` / `targetText` / `sourceGloss` rather than `prompt` / `answer` / `gloss`. Quiz semantics are applied in the service layer. Keeps the model reusable for non-quiz features.
### Asymmetric difficulty filter
Difficulty is filtered on the target (answer) side only. A word can be A2 in Italian but B1 in English, and what matters is the difficulty of the word being learned.
### optionId as integer 0-3, not UUID
Options only need uniqueness within a single question; cheating prevented by shuffling, not opaque IDs.
### questionId and sessionId as UUIDs
Globally unique, opaque, natural Valkey keys when storage moves later.
### gloss is `string | null` rather than optional
Predictable shape on the frontend — always present, sometimes null.
### GameSessionStore stores only the answer key
Minimal payload (`questionId → correctOptionId`) for easy Valkey migration. All methods are async even for the in-memory implementation, so the service layer is already written for Valkey.
### Distractors fetched per-question (N+1 queries)
Correct shape for the problem; 10 queries on local Postgres is negligible latency.
### No fallback logic for insufficient distractors
Data volumes are sufficient; strict query throws if something is genuinely broken.
### Distractor query excludes both term ID and answer text
Prevents duplicate options from different terms with the same translation.
### Submit-before-send flow on frontend
User selects, then confirms. Prevents misclicks.
### Multiplayer mechanic: simultaneous answers (not buzz-first)
All players see the same question at the same time and submit independently. The server waits for all answers or a 15-second timeout, then broadcasts the result. This keeps the experience Duolingo-like and symmetric. A buzz-first mechanic was considered and rejected.
All players see the same question at the same time and submit independently. The server waits for all answers or a 15-second timeout, then broadcasts the result. Keeps the experience symmetric.
### Room model: room codes (not matchmaking queue)
Players create rooms and share a human-readable code (e.g. `WOLF-42`) to invite friends. Auto-matchmaking via a queue is out of scope for MVP. Valkey is included in the stack and can support a queue in a future phase.
Players create rooms and share a human-readable code (e.g. `WOLF-42`). Auto-matchmaking deferred.
---
## Error Handling
### `AppError` base class over error code maps
A `statusCode` on the error itself means the middleware doesn't need a lookup table. New error types are self-contained — one class, one status code. `ValidationError` (400) and `NotFoundError` (404) extend `AppError`.
### `next(error)` over `res.status().json()` in controllers
Express requires explicit `next(error)` for async handlers — it does not catch async errors automatically. Centralises all error formatting in one middleware. Controllers stay clean: validate, call service, send response.
### Zod `.message` over `.issues[0]?.message`
Returns all validation failures at once, not just the first. Output is verbose (raw JSON string) — revisit formatting post-MVP if the frontend needs structured `{ field, message }[]` error objects.
### Where errors are thrown
`ValidationError` is thrown in the controller (the layer that runs `safeParse`). `NotFoundError` is thrown in the service (the layer that knows whether a session or question exists). The service doesn't know about HTTP — it throws a typed error, and the middleware maps it to a status code.
---
## Testing
### Mocked DB for unit tests (not test database)
Unit tests mock `@glossa/db` via `vi.mock` — the real database is never touched. Tests run in milliseconds with no infrastructure dependency. Integration tests with a real test DB are deferred post-MVP.
### Co-located test files
`gameService.test.ts` lives next to `gameService.ts`, not in a separate `__tests__/` directory. Convention matches the `vitest` default and keeps related files together.
### supertest for endpoint tests
Uses `createApp()` factory directly — no server started. Tests the full HTTP layer (routing, middleware, error handler) with real request/response assertions.
---
@ -96,19 +156,31 @@ Players create rooms and share a human-readable code (e.g. `WOLF-42`) to invite
### Base config: no `lib`, `module`, or `moduleResolution`
These are intentionally omitted from `tsconfig.base.json` because different packages need different values — `apps/api` uses `NodeNext`, `apps/web` uses `ESNext`/`bundler` (Vite), and mixing them in the base caused errors. Each package declares its own.
Intentionally omitted from `tsconfig.base.json` because different packages need different values — `apps/api` uses `NodeNext`, `apps/web` uses `ESNext`/`bundler` (Vite). Each package declares its own.
### `outDir: "./dist"` per package
The base config originally had `outDir: "dist"` which resolved relative to the base file location, pointing to the root `dist` folder. Overridden in each package with `"./dist"` to ensure compiled output stays inside the package.
The base config originally had `outDir: "dist"` which resolved relative to the base file location, pointing to the root `dist` folder. Overridden in each package with `"./dist"`.
### `apps/web` tsconfig: deferred to Vite scaffold
The web tsconfig was left as a placeholder and filled in after `pnpm create vite` generated `tsconfig.json`, `tsconfig.app.json`, and `tsconfig.node.json`. The generated files were then trimmed to remove options already covered by the base.
Filled in after `pnpm create vite` generated tsconfig files. The generated files were trimmed to remove options already covered by the base.
### `rootDir: "."` on `apps/api`
Set explicitly to allow `vitest.config.ts` (which lives outside `src/`) to be included in the TypeScript program. Without it, TypeScript infers `rootDir` as `src/` and rejects any file outside that directory.
Set explicitly to allow `vitest.config.ts` (outside `src/`) to be included in the TypeScript program.
### Type naming: PascalCase
`supportedLanguageCode``SupportedLanguageCode`. TypeScript convention.
### Primitive types: always lowercase
`number` not `Number`, `string` not `String`. The uppercase versions are object wrappers and not assignable to Drizzle's expected primitive types.
### `globals: true` with `"types": ["vitest/globals"]`
Using Vitest globals requires `"types": ["vitest/globals"]` in each package's tsconfig. Added to `apps/api`, `packages/shared`, `packages/db`, and `apps/web/tsconfig.app.json`.
---
@ -116,43 +188,11 @@ Set explicitly to allow `vitest.config.ts` (which lives outside `src/`) to be in
### Two-config approach for `apps/web`
The root `eslint.config.mjs` handles TypeScript linting across all packages. `apps/web/eslint.config.js` is kept as a local addition for React-specific plugins only: `eslint-plugin-react-hooks` and `eslint-plugin-react-refresh`. ESLint flat config merges them automatically by directory proximity — no explicit import between them needed.
Root `eslint.config.mjs` handles TypeScript linting across all packages. `apps/web/eslint.config.js` adds React-specific plugins only. ESLint flat config merges them by directory proximity.
### Coverage config at root only
Vitest coverage configuration lives in the root `vitest.config.ts` only. Individual package configs omit it to produce a single aggregated report rather than separate per-package reports.
### `globals: true` with `"types": ["vitest/globals"]`
Using Vitest globals (`describe`, `it`, `expect` without imports) requires `"types": ["vitest/globals"]` in each package's tsconfig `compilerOptions`. Added to `apps/api`, `packages/shared`, and `packages/db`. Added to `apps/web/tsconfig.app.json`.
---
## Known Issues / Dev Notes
### glossa-web has no healthcheck
The `web` service in `docker-compose.yml` has no `healthcheck` defined. Reason: Vite's dev server (`vite dev`) has no built-in health endpoint. Unlike the API's `/api/health`, there's no URL to poll.
Workaround: `depends_on` uses `api` healthcheck as proxy. For production (Nginx), add a health endpoint or use TCP port check.
### Valkey memory overcommit warning
Valkey logs this on start in development:
```text
WARNING Memory overcommit must be enabled for proper functionality
```
This is **harmless in dev** but should be fixed before production. The warning appears because Docker containers don't inherit host sysctl settings by default.
Fix: Add to host `/etc/sysctl.conf`:
```conf
vm.overcommit_memory = 1
```
Then `sudo sysctl -p` or restart Docker.
Vitest coverage configuration lives in the root `vitest.config.ts` only. Produces a single aggregated report.
---
@ -160,190 +200,135 @@ Then `sudo sysctl -p` or restart Docker.
### Users: internal UUID + openauth_sub (not sub as PK)
**Original approach:** Use OpenAuth `sub` claim directly as `users.id` (text primary key).
**Problem:** Embeds auth provider in the primary key (e.g. `"google|12345"`). If OpenAuth changes format or a second provider is added, the PK cascades through all FKs (`rooms.host_id`, `room_players.user_id`).
**Decision:**
- `users.id` = internal UUID (stable FK target)
- `users.openauth_sub` = text UNIQUE (auth provider claim)
- Allows adding multiple auth providers per user later without FK changes
Embeds auth provider in the primary key would cascade through all FKs if OpenAuth changes format. `users.id` = internal UUID (stable FK target). `users.openauth_sub` = text UNIQUE (auth provider claim).
### Rooms: `updated_at` for stale recovery only
Most tables omit `updated_at` (unnecessary for MVP). `rooms.updated_at` is kept specifically for stale room recovery—identifying rooms stuck in `in_progress` status after server crashes.
Most tables omit `updated_at`. `rooms.updated_at` is kept specifically for identifying rooms stuck in `in_progress` status after server crashes.
### Translations: UNIQUE (term_id, language_code, text)
Allows multiple synonyms per language per term (e.g. "dog", "hound" for same synset). Prevents exact duplicate rows. Homonyms (e.g. "Lead" metal vs. "Lead" guide) are handled by different `term_id` values (different synsets), so no constraint conflict.
Allows multiple synonyms per language per term (e.g. "dog", "hound" for same synset). Prevents exact duplicate rows.
### One gloss per term per language
The unique constraint on `term_glosses` was tightened from `(term_id, language_code, text)` to `(term_id, language_code)` to prevent left joins from multiplying question rows. Revisit if multiple glosses per language are ever needed.
### Decks: `source_language` + `validated_languages` (not `pair_id`)
**Original approach:** `decks.pair_id` references `language_pairs`, tying each deck to a single language pair.
**Problem:** One deck can serve multiple target languages as long as translations exist for all its terms. A `pair_id` FK would require duplicating the deck for each target language.
**Decision:**
- `decks.source_language` — the language the wordlist was curated from (e.g. `"en"`). A deck sourced from an English frequency list is fundamentally different from one sourced from an Italian list.
- `decks.validated_languages` — array of language codes (excluding `source_language`) for which full translation coverage exists across all terms in the deck. Recalculated and updated on every run of the generation script.
- The language pair used for a quiz session is determined at session start, not at deck creation time.
**Benefits:**
- One deck serves multiple target languages (e.g. en→it and en→fr) without duplication
- `validated_languages` stays accurate as translation data grows
- DB enforces via CHECK constraint that `source_language` is never included in `validated_languages`
One deck can serve multiple target languages as long as translations exist for all its terms. `source_language` is the language the wordlist was curated from. `validated_languages` is recalculated on every generation script run. Enforced via CHECK: `source_language` is never in `validated_languages`.
### Decks: wordlist tiers as scope (not POS-split decks)
**Rejected approach:** one deck per POS (e.g. `en-nouns`, `en-verbs`).
**Problem:** POS is already a filterable column on `terms`, so a POS-scoped deck duplicates logic the query already handles for free. A word like "run" (noun and verb, different synsets) would also appear in two decks, requiring deduplication in the generation script.
**Decision:** one deck per frequency tier per source language (e.g. `en-core-1000`, `en-core-2000`). POS, difficulty, and category are query filters applied inside that boundary at query time. The user never sees or picks a deck — they pick a direction, POS, and difficulty, and the app resolves those to the right deck + filters.
Progression works by expanding the deck set as the user advances:
```sql
WHERE dt.deck_id IN ('en-core-1000', 'en-core-2000')
AND t.pos = 'noun'
AND t.cefr_level = 'B1'
```
Decks must not overlap — each term appears in exactly one tier. The generation script already deduplicates, so this is enforced at import time.
One deck per frequency tier per source language (e.g. `en-core-1000`). POS, difficulty, and category are query filters applied inside that boundary. Decks must not overlap — each term appears in exactly one tier.
### Decks: SUBTLEX as wordlist source (not manual curation)
**Problem:** the most common 1000 nouns in English are not the same 1000 nouns that are most common in Italian — not just in translation, but conceptually. Building decks from English frequency data alone gives Italian learners a distorted picture of what is actually common in Italian.
**Decision:** use SUBTLEX, which exists in per-language editions (SUBTLEX-EN, SUBTLEX-IT, etc.) derived from subtitle corpora using the same methodology, making them comparable across languages.
This is why `decks.source_language` is not just a technical detail — it is the reason the data model is correct:
- `en-core-1000` built from SUBTLEX-EN → used when source language is English (en→it)
- `it-core-1000` built from SUBTLEX-IT → used when source language is Italian (it→en)
Same translation data underneath, correctly frequency-grounded per direction. Two wordlist files, two generation script runs.
### Terms: `synset_id` nullable (not NOT NULL)
**Problem:** non-WordNet terms (custom words, Wiktionary-sourced entries added later) won't have a synset ID. `NOT NULL` is too strict.
**Decision:** make `synset_id` nullable. `synset_id` remains the WordNet idempotency key — it prevents duplicate imports on re-runs and allows cross-referencing back to WordNet. It is not removed.
Postgres `UNIQUE` on a nullable column allows multiple `NULL` values (nulls are not considered equal), so no additional constraint logic is needed beyond dropping `notNull()`. For extra defensiveness a partial unique index can be added later:
```sql
CREATE UNIQUE INDEX idx_terms_synset_id ON terms (synset_id) WHERE synset_id IS NOT NULL;
```
### Terms: `source` + `source_id` columns
Once multiple import pipelines exist (OMW today, Wiktionary later), `synset_id` alone is insufficient as an idempotency key — Wiktionary terms won't have a synset ID.
**Decision:** add `source` (varchar, e.g. `'omw'`, `'wiktionary'`, null for manual) and `source_id` (text, the pipeline's internal identifier) with a unique constraint on the pair:
```ts
unique("unique_source_id").on(table.source, table.source_id);
```
Postgres allows multiple `NULL` pairs under a unique constraint, so manual entries don't conflict. For existing OMW terms, backfill `source = 'omw'` and `source_id = synset_id`. `synset_id` remains for now to avoid pipeline churn — deprecate it during a future pipeline refactor.
No CHECK constraint on `source` — it is only written by controlled import scripts, not user input. A free varchar is sufficient.
### Translations: `cefr_level` column (deferred population, not on `terms`)
CEFR difficulty is language-relative, not concept-relative. "House" in English is A1, "domicile" is also English but B2 — same concept, different words, different difficulty. Moving `cefr_level` to `translations` allows each language's word to have its own level independently.
Added as nullable `varchar(2)` with CHECK constraint against `CEFR_LEVELS` (`A1``C2`) on the `translations` table. Left null for MVP; populated later via SUBTLEX or an external CEFR wordlist. Also included in the `translations` index since the quiz query filters on it:
```ts
index("idx_translations_lang").on(
table.language_code,
table.cefr_level,
table.term_id,
);
```
The most common 1000 nouns in English are not the same 1000 nouns that are most common in Italian. SUBTLEX exists in per-language editions derived from subtitle corpora using the same methodology — making them comparable. `en-core-1000` built from SUBTLEX-EN, `it-core-1000` from SUBTLEX-IT.
### `language_pairs` table: dropped
Valid language pairs are already implicitly defined by `decks.source_language` + `decks.validated_languages`. The table was redundant — the same information can be derived directly from decks:
Valid pairs are implicitly defined by `decks.source_language` + `decks.validated_languages`. The table was redundant.
```sql
SELECT DISTINCT source_language, unnest(validated_languages) AS target_language
FROM decks
WHERE validated_languages != '{}'
```
### Terms: `synset_id` nullable (not NOT NULL)
The only thing `language_pairs` added was an `active` flag to manually disable a direction. This is an edge case not needed for MVP. Dropped to remove a maintenance surface that required staying in sync with deck data.
Non-WordNet terms won't have a synset ID. Postgres `UNIQUE` on a nullable column allows multiple NULL values.
### Schema: `categories` + `term_categories` (empty for MVP)
### Terms: `source` + `source_id` columns
Added to schema now, left empty for MVP. Grammar and Media work without them — Grammar maps to POS (already on `terms`), Media maps to deck membership. Thematic categories (animals, kitchen, etc.) require a metadata source that is still under research.
Once multiple import pipelines exist (OMW, Wiktionary), `synset_id` alone is insufficient as an idempotency key. Unique constraint on the pair. Postgres allows multiple NULL pairs. `synset_id` remains for now — deprecate during a future pipeline refactor.
```sql
categories: id, slug, label, created_at
term_categories: term_id → terms.id, category_id → categories.id, PK(term_id, category_id)
```
### `cefr_level` on `translations` (not `terms`)
See Open Research section for source options.
CEFR difficulty is language-relative, not concept-relative. "House" in English is A1, "domicile" is also English but B2 — same concept, different words, different difficulty. Added as nullable `varchar(2)` with CHECK.
### Schema constraints: CHECK over pgEnum for extensible value sets
### Categories + term_categories: empty for MVP
**Question:** use `pgEnum` for columns like `pos`, `cefr_level`, and `source` since the values are driven by TypeScript constants anyway?
Schema exists. Grammar maps to POS (already on `terms`), Media maps to deck membership. Thematic categories require a metadata source still under research.
**Decision:** no. Use CHECK constraints for any value set that will grow over time.
### CHECK over pgEnum for extensible value sets
**Reason:** `ALTER TYPE enum_name ADD VALUE` in Postgres is non-transactional — it cannot be rolled back if a migration fails partway through, leaving the DB in a dirty state that requires manual intervention. CHECK constraints are fully transactional — if the migration fails it rolls back cleanly.
`ALTER TYPE enum_name ADD VALUE` in Postgres is non-transactional — cannot be rolled back if a migration fails. CHECK constraints are fully transactional. Rule: pgEnum for truly static sets, CHECK for any set tied to a growing constant.
**Rule of thumb:** pgEnum is appropriate for truly static value sets that will never grow (e.g. `('pending', 'active', 'cancelled')` on an orders table). Any value set tied to a growing constant in the codebase (`SUPPORTED_POS`, `CEFR_LEVELS`, `SUPPORTED_LANGUAGE_CODES`) stays as a CHECK constraint.
### `language_code` always CHECK-constrained
### Schema constraints: `language_code` always CHECK-constrained
Unlike `source` (only written by import scripts), `language_code` is a query-critical filter column. A typo would silently produce missing data. Rule: any column game queries filter on should be CHECK-constrained.
`language_code` columns on `translations` and `term_glosses` are constrained via CHECK against `SUPPORTED_LANGUAGE_CODES`, the same pattern used for `pos` and `cefr_level`.
### Unique constraints make explicit FK indexes redundant
**Reason:** unlike `source`, which is only written by controlled import scripts and failing silently is recoverable, `language_code` is a query-critical filter column. A typo (`'ita'` instead of `'it'`, `'en '` with a trailing space) would silently produce missing data in the UI — terms with no translation shown, glosses not displayed — which is harder to debug than a DB constraint violation.
**Rule:** any column that game queries filter on should be CHECK-constrained. Columns only used for internal bookkeeping (like `source`) can be left as free varchars.
### Schema: unique constraints make explicit FK indexes redundant
Postgres automatically creates an index to enforce a unique constraint. An explicit index on a column that is already the leading column of a unique constraint is redundant.
Example: `unique("unique_term_gloss").on(term_id, language_code, text)` already indexes `term_id` as the leading column. A separate `index("idx_term_glosses_term").on(term_id)` adds no value and was dropped.
**Rule:** before adding an explicit index, check whether an existing unique constraint already covers it.
### Future extensions: morphology and pronunciation (deferred, additive)
The following features are explicitly deferred post-MVP. All are purely additive — new tables referencing existing `terms` rows via FK. No existing schema changes required when implemented:
- `noun_forms` — gender, singular, plural, articles per language (source: Wiktionary)
- `verb_forms` — conjugation tables per language (source: Wiktionary)
- `term_pronunciations` — IPA and audio URLs per language (source: Wiktionary / Forvo)
Exercise types split naturally into Type A (translation, current model) and Type B (morphology, future). The data layer is independent — the same `terms` anchor both.
Postgres automatically creates an index to enforce a unique constraint. A separate index on the leading column of an existing unique constraint adds no value.
---
### Term glosses: Italian coverage is sparse (expected)
## Data Pipeline
OMW gloss data is primarily in English. After full import:
### Seeding v1: batch, truncate-based
- English glosses: 95,882 (~100% of terms)
- Italian glosses: 1,964 (~2% of terms)
For dev/first-time setup. Read JSON, batch inserts in groups of 500, truncate tables before each run. Simple and fast.
This is not a data pipeline problem — it reflects the actual state of OMW. Italian
glosses simply don't exist for most synsets in the dataset.
Key pitfalls encountered:
**Handling in the UI:** fall back to the English gloss when no gloss exists for the
user's language. This is acceptable UX — a definition in the wrong language is better
than no definition at all.
- Duplicate key on re-run: truncate before seeding
- `onConflictDoNothing` breaks FK references: when it skips a `terms` insert, the in-memory UUID is never written, causing FK violations on `translations`
- `forEach` doesn't await: use `for...of`
- Final batch not flushed: guard with `if (termsArray.length > 0)` after loop
If Italian gloss coverage needs to improve in the future, Wiktionary is the most
likely source — it has broader multilingual definition coverage than OMW.
### Seeding v2: incremental upsert, multi-file
For production / adding languages. Extends the database without truncating. Each synset processed individually (no batching — need real `term.id` from DB before inserting translations). Filename convention: `sourcelang-targetlang-pos.json`.
### CEFR enrichment pipeline
Staged ETL: `extract-*.py``compare-*.py` (quality gate) → `merge-*.py` (resolve conflicts) → `enrich.ts` (write to DB). Source priority: English `en_m3 > cefrj > octanove > random`, Italian `it_m3 > italian`.
Enrichment results: English 42,527/171,394 (~25%), Italian 23,061/54,603 (~42%). Both sufficient for MVP. Italian C2 has only 242 terms — noted as constraint for distractor algorithm.
### Term glosses: Italian coverage is sparse
OMW gloss data is primarily English. English glosses: 95,882 (~100%), Italian: 1,964 (~2%). UI falls back to English gloss when no gloss exists for the user's language.
### Glosses can leak answers
Some WordNet glosses contain the target-language word in the definition text (e.g. "Padre" in the English gloss for "father"). Address during post-MVP data enrichment — clean glosses, replace with custom definitions, or filter at service layer.
### `packages/db` exports fix
The `exports` field must be an object, not an array:
```json
"exports": {
".": "./src/index.ts",
"./schema": "./src/db/schema.ts"
}
```
---
## API Development: Problems & Solutions
1. **Messy API structure.** Responsibilities bleeding across layers. Fixed with strict layered architecture.
2. **No shared contract.** API could return different shapes silently. Fixed with Zod schemas in `packages/shared`.
3. **Type safety gaps.** `any` types, `Number` vs `number`. Fixed with derived types from constants.
4. **`getGameTerms` in wrong package.** Model queries in `apps/api` meant direct `drizzle-orm` dependency. Moved to `packages/db/src/models/`.
5. **Deck generation complexity.** 12 decks assumed, only 2 needed. Then skipped entirely for MVP — query terms table directly.
6. **GAME_ROUNDS type conflict.** `z.enum()` only accepts strings. Keep as strings, convert to number in service.
7. **Gloss join multiplied rows.** Multiple glosses per term per language. Fixed by tightening unique constraint.
8. **Model leaked quiz semantics.** Return fields named `prompt`/`answer`. Renamed to neutral `sourceText`/`targetText`.
9. **AnswerResult wasn't self-contained.** Frontend needed `selectedOptionId` but schema didn't include it. Added.
10. **Distractor could duplicate correct answer.** Different terms with same translation. Fixed with `ne(translations.text, excludeText)`.
11. **TypeScript strict mode flagged Fisher-Yates shuffle.** `noUncheckedIndexedAccess` treats `result[i]` as `T | undefined`. Fixed with non-null assertion + temp variable.
---
## Known Issues / Dev Notes
### glossa-web has no healthcheck
Vite's dev server has no built-in health endpoint. `depends_on` uses API healthcheck as proxy. For production (Nginx), add a health endpoint or TCP port check.
### Valkey memory overcommit warning
Harmless in dev. Fix before production: add `vm.overcommit_memory = 1` to host `/etc/sysctl.conf`.
---
@ -351,88 +336,26 @@ likely source — it has broader multilingual definition coverage than OMW.
### Semantic category metadata source
Categories (`animals`, `kitchen`, etc.) are in the schema but empty for MVP.
Grammar and Media work without them (Grammar = POS filter, Media = deck membership).
Needs research before populating `term_categories`. Options:
Categories (`animals`, `kitchen`, etc.) are in the schema but empty. Options researched:
**Option 1: WordNet domain labels**
Already in OMW, extractable in the existing pipeline. Free, no extra dependency.
Problem: coarse and patchy — many terms untagged, vocabulary is academic ("fauna" not "animals").
**Option 2: Princeton WordNet Domains**
Separate project built on WordNet. ~200 hierarchical domains mapped to synsets. More structured
and consistent than basic WordNet labels. Freely available. Meaningfully better than Option 1.
**Option 3: Kelly Project**
Frequency lists with CEFR levels AND semantic field tags, explicitly designed for language learning,
multiple languages. Could solve frequency tiers (`cefr_level`) and semantic categories in one shot.
Investigate coverage for your languages and POS range first.
**Option 4: BabelNet / WikiData**
Rich, multilingual, community-maintained. Maps WordNet synsets to Wikipedia categories.
Problem: complex integration, BabelNet has commercial licensing restrictions, WikiData category
trees are deep and noisy.
**Option 5: LLM-assisted categorization**
Run terms through Claude/GPT-4 with a fixed category list, spot-check output, import.
Fast and cheap at current term counts (3171 terms ≈ negligible cost). Not reproducible
without saving output. Good fallback if structured sources have insufficient coverage.
**Option 6: Hybrid — WordNet Domains as baseline, LLM gap-fill**
Use Option 2 for automated coverage, LLM for terms with no domain tag, manual spot-check pass.
Combines automation with control. Likely the most practical approach.
**Option 7: Manual curation**
Flat file mapping synset IDs to your own category slugs. Full control, matches UI exactly.
Too expensive at scale — only viable for small curated additions on top of an automated baseline.
1. **WordNet domain labels** — already in OMW, coarse and patchy
2. **Princeton WordNet Domains** — ~200 hierarchical domains, freely available, meaningfully better
3. **Kelly Project** — CEFR levels AND semantic fields, designed for language learning. Could solve frequency tiers and categories in one shot
4. **BabelNet / WikiData** — rich but complex integration, licensing issues
5. **LLM-assisted categorization** — fast and cheap at current term counts, not reproducible without saving output
6. **Hybrid (WordNet Domains + LLM gap-fill)** — likely most practical
7. **Manual curation** — full control, too expensive at scale
**Current recommendation:** research Kelly Project first. If coverage is insufficient, go with Option 6.
---
### SUBTLEX → `cefr_level` mapping strategy
## Current State
Raw frequency ranks need mapping to A1C2 bands before tiered decks are meaningful. Decision pending.
Phase 0 complete. Phase 1 data pipeline complete. Phase 2 data model finalized and migrated.
### Future extensions: morphology and pronunciation
### Completed (Phase 1 — data pipeline)
All deferred post-MVP, purely additive (new tables referencing existing `terms`):
- [x] Run `extract-en-it-nouns.py` locally → generates `datafiles/en-it-nouns.json`
- [x] Write Drizzle schema: `terms`, `translations`, `language_pairs`, `term_glosses`, `decks`, `deck_terms`
- [x] Write and run migration (includes CHECK constraints for `pos`, `gloss_type`)
- [x] Write `packages/db/src/seed.ts` (imports ALL terms + translations, NO decks)
- [x] Write `packages/db/src/generating-decks.ts` — idempotent deck generation script
- reads and deduplicates source wordlist
- matches words to DB terms (homonyms included)
- writes unmatched words to `-missing` file
- determines `validated_languages` by checking full translation coverage per language
- creates deck if it doesn't exist, adds only missing terms on subsequent runs
- recalculates and persists `validated_languages` on every run
### Completed (Phase 2 — data model)
- [x] `synset_id` removed, replaced by `source` + `source_id` on `terms`
- [x] `cefr_level` added to `translations` (not `terms` — difficulty is language-relative)
- [x] `language_code` CHECK constraint added to `translations` and `term_glosses`
- [x] `language_pairs` table dropped — pairs derived from decks at query time
- [x] `is_public` and `added_at` dropped from `decks` and `deck_terms`
- [x] `type` added to `decks` with CHECK against `SUPPORTED_DECK_TYPES`
- [x] `topics` and `term_topics` tables added (empty for MVP)
- [x] Migration generated and run against fresh database
### Known data facts (pre-wipe, for reference)
- Wordlist: 999 unique words after deduplication (1000 lines, 1 duplicate)
- Term IDs resolved: 3171 (higher than word count due to homonyms)
- Words not found in DB: 34
- Italian (`it`) coverage: 3171 / 3171 — full coverage, included in `validated_languages`
### Next (Phase 3 — data pipeline + API)
1. done
2. done
3. **Expand data pipeline** — import all OMW languages and POS, not just English nouns with Italian translations
4. **Decide SUBTLEX → `cefr_level` mapping strategy** — raw frequency ranks need a mapping to A1C2 bands before tiered decks are meaningful
5. **Generate decks** — run generation script with SUBTLEX-grounded wordlists per source language
6. **Finalize game selection flow** — direction → category → POS → difficulty → round count
7. **Define Zod schemas in `packages/shared`** — based on finalized game flow and API shape
8. **Implement API**
- `noun_forms` — gender, singular, plural, articles per language (source: Wiktionary)
- `verb_forms` — conjugation tables per language (source: Wiktionary)
- `term_pronunciations` — IPA and audio URLs per language (source: Wiktionary / Forvo)

View file

@ -1,469 +0,0 @@
# glossa mvp
> **This document is the single source of truth for the project.**
> It is written to be handed to any LLM as context. It contains the project vision, the current MVP scope, the tech stack, the working methodology, and the roadmap.
---
## 1. Project Overview
A vocabulary trainer for EnglishItalian words. The quiz format is Duolingo-style: one word is shown as a prompt, and the user picks the correct translation from four choices (1 correct + 3 distractors of the same part-of-speech). The long-term vision is a multiplayer competitive game, but the MVP is a polished singleplayer experience.
**The core learning loop:**
Show word → pick answer → see result → next word → final score
The vocabulary data comes from WordNet + the Open Multilingual Wordnet (OMW). A one-time Python script extracts EnglishItalian noun pairs and seeds the database. The data model is language-pair agnostic by design — adding a new language later requires no schema changes.
---
## 2. What the Full Product Looks Like (Long-Term Vision)
- Users log in via Google or GitHub (OpenAuth)
- Singleplayer mode: 10-round quiz, score screen
- Multiplayer mode: create a room, share a code, 24 players answer simultaneously in real time, live scores, winner screen
- 1000+ EnglishItalian nouns seeded from WordNet
This is documented in `spec.md` and the full `roadmap.md`. The MVP deliberately ignores most of it.
---
## 3. MVP Scope
**Goal:** A working, presentable singleplayer quiz that can be shown to real people.
### What is IN the MVP
- Vocabulary data in a PostgreSQL database (already seeded)
- REST API that returns quiz terms with distractors
- Singleplayer quiz UI: 10 questions, answer feedback, score screen
- Clean, mobile-friendly UI (Tailwind + shadcn/ui)
- Local dev only (no deployment for MVP)
### What is CUT from the MVP
| Feature | Why cut |
| ------------------------------- | -------------------------------------- |
| Authentication (OpenAuth) | No user accounts needed for a demo |
| Multiplayer (WebSockets, rooms) | Core quiz works without it |
| Valkey / Redis cache | Only needed for multiplayer room state |
| Deployment to Hetzner | Ship to people locally first |
| User stats / profiles | Needs auth |
| Testing suite | Add after the UI stabilises |
These are not deleted from the plan — they are deferred. The architecture is already designed to support them. See Section 9 (Post-MVP Ladder).
---
## 4. Technology Stack
The monorepo structure and tooling are already set up (Phase 0 complete). This is the full stack — the MVP uses a subset of it.
| Layer | Technology | MVP? |
| ------------ | ------------------------------ | ----------- |
| Monorepo | pnpm workspaces | ✅ |
| Frontend | React 18, Vite, TypeScript | ✅ |
| Routing | TanStack Router | ✅ |
| Server state | TanStack Query | ✅ |
| Client state | Zustand | ✅ |
| Styling | Tailwind CSS + shadcn/ui | ✅ |
| Backend | Node.js, Express, TypeScript | ✅ |
| Database | PostgreSQL + Drizzle ORM | ✅ |
| Validation | Zod (shared schemas) | ✅ |
| Auth | OpenAuth (Google + GitHub) | ❌ post-MVP |
| Realtime | WebSockets (`ws` library) | ❌ post-MVP |
| Cache | Valkey | ❌ post-MVP |
| Testing | Vitest, React Testing Library | ❌ post-MVP |
| Deployment | Docker Compose, Hetzner, Nginx | ❌ post-MVP |
### Repository Structure (actual, as of Phase 1 data pipeline complete)
```text
vocab-trainer/
├── apps/
│ ├── api/
│ │ └── src/
│ │ ├── app.ts # createApp() factory — routes registered here
│ │ └── server.ts # calls app.listen()
│ └── web/
│ └── src/
│ ├── routes/
│ │ ├── __root.tsx
│ │ ├── index.tsx # placeholder landing page
│ │ └── about.tsx
│ ├── main.tsx
│ └── index.css
├── packages/
│ ├── shared/
│ │ └── src/
│ │ ├── index.ts # empty — Zod schemas go here next
│ │ └── constants.ts
│ └── db/
│ ├── drizzle/ # migration SQL files
│ └── src/
│ ├── db/schema.ts # full Drizzle schema
│ ├── seeding-datafiles.ts # seeds terms + translations
│ ├── generating-deck.ts # builds curated decks
│ └── index.ts
├── documentation/ # all project docs live here
│ ├── spec.md
│ ├── roadmap.md
│ ├── decisions.md
│ ├── mvp.md # this file
│ └── CLAUDE.md
├── scripts/
│ ├── extract-en-it-nouns.py
│ └── datafiles/en-it-noun.json
├── docker-compose.yml
└── pnpm-workspace.yaml
```
**What does not exist yet (to be built in MVP phases):**
- `apps/api/src/routes/` — no route handlers yet
- `apps/api/src/services/` — no business logic yet
- `apps/api/src/repositories/` — no DB queries yet
- `apps/web/src/components/` — no UI components yet
- `apps/web/src/stores/` — no Zustand store yet
- `apps/web/src/lib/api.ts` — no TanStack Query wrappers yet
- `packages/shared/src/schemas/` — no Zod schemas yet
`packages/shared` is the contract between frontend and backend. All request/response shapes are defined there as Zod schemas — never duplicated.
---
## 5. Data Model (relevant tables for MVP)
```javascript
export const terms = pgTable(
"terms",
{
id: uuid().primaryKey().defaultRandom(),
synset_id: text().unique().notNull(),
pos: varchar({ length: 20 }).notNull(),
created_at: timestamp({ withTimezone: true }).defaultNow().notNull(),
},
(table) => [
check(
"pos_check",
sql`${table.pos} IN (${sql.raw(SUPPORTED_POS.map((p) => `'${p}'`).join(", "))})`,
),
index("idx_terms_pos").on(table.pos),
],
);
export const translations = pgTable(
"translations",
{
id: uuid().primaryKey().defaultRandom(),
term_id: uuid()
.notNull()
.references(() => terms.id, { onDelete: "cascade" }),
language_code: varchar({ length: 10 }).notNull(),
text: text().notNull(),
created_at: timestamp({ withTimezone: true }).defaultNow().notNull(),
},
(table) => [
unique("unique_translations").on(
table.term_id,
table.language_code,
table.text,
),
index("idx_translations_lang").on(table.language_code, table.term_id),
],
);
export const decks = pgTable(
"decks",
{
id: uuid().primaryKey().defaultRandom(),
name: text().notNull(),
description: text(),
source_language: varchar({ length: 10 }).notNull(),
validated_languages: varchar({ length: 10 }).array().notNull().default([]),
is_public: boolean().default(false).notNull(),
created_at: timestamp({ withTimezone: true }).defaultNow().notNull(),
},
(table) => [
check(
"source_language_check",
sql`${table.source_language} IN (${sql.raw(SUPPORTED_LANGUAGE_CODES.map((l) => `'${l}'`).join(", "))})`,
),
check(
"validated_languages_check",
sql`validated_languages <@ ARRAY[${sql.raw(SUPPORTED_LANGUAGE_CODES.map((l) => `'${l}'`).join(", "))}]::varchar[]`,
),
check(
"validated_languages_excludes_source",
sql`NOT (${table.source_language} = ANY(${table.validated_languages}))`,
),
unique("unique_deck_name").on(table.name, table.source_language),
],
);
export const deck_terms = pgTable(
"deck_terms",
{
deck_id: uuid()
.notNull()
.references(() => decks.id, { onDelete: "cascade" }),
term_id: uuid()
.notNull()
.references(() => terms.id, { onDelete: "cascade" }),
added_at: timestamp({ withTimezone: true }).defaultNow().notNull(),
},
(table) => [primaryKey({ columns: [table.deck_id, table.term_id] })],
);
```
The seed + deck-build scripts have already been run. Data exists in the database.
---
## 6. API Endpoints (MVP)
All endpoints prefixed `/api`. Schemas live in `packages/shared` and are validated with Zod on both sides.
| Method | Path | Description |
| ------ | ---------------------- | --------------------------------------- |
| GET | `/api/health` | Health check (already done) |
| GET | `/api/language-pairs` | List active language pairs |
| GET | `/api/decks` | List available decks |
| GET | `/api/decks/:id/terms` | Fetch terms with distractors for a quiz |
### Distractor Logic
The `QuizService` picks 3 distractors server-side:
- Same part-of-speech as the correct answer
- Never the correct answer
- Never repeated within a session
---
## 7. Frontend Structure (MVP)
```text
apps/web/src/
├── routes/
│ ├── index.tsx # Landing page / mode select
│ └── singleplayer/
│ └── index.tsx # The quiz
├── components/
│ ├── quiz/
│ │ ├── QuestionCard.tsx # Prompt word + 4 answer buttons
│ │ ├── OptionButton.tsx # idle / correct / wrong states
│ │ └── ScoreScreen.tsx # Final score + play again
│ └── ui/ # shadcn/ui wrappers
├── stores/
│ └── gameStore.ts # Zustand: question index, score, answers
└── lib/
└── api.ts # TanStack Query fetch wrappers
```
### State Management
TanStack Query handles fetching quiz data from the API. Zustand handles the local quiz session (current question index, score, selected answers). There is no overlap between the two.
---
## 8. Working Methodology
> **Read this section before asking for help with any task.**
This project is a learning exercise. The goal is to understand the code, not just to ship it.
### How tasks are structured
The roadmap (Section 10) lists broad phases. When work starts on a phase, it gets broken into smaller, concrete subtasks with clear done-conditions before any code is written.
### How to use an LLM for help
When asking an LLM for help:
1. **Paste this document** (or the relevant sections) as context
2. **Describe what you're working on** and what specifically you're stuck on
3. **Ask for hints, not solutions.** Example prompts:
- "I'm trying to implement X. My current approach is Y. What am I missing conceptually?"
- "Here is my code. What would you change about the structure and why?"
- "Can you point me to the relevant docs for Z?"
### Refactoring workflow
After completing a task or a block of work:
1. Share the current state of the code with the LLM
2. Ask: _"What would you refactor here, and why? Don't show me the code — point me in the right direction and link relevant documentation."_
3. The LLM should explain the _what_ and _why_, link to relevant docs/guides, and let you implement the fix yourself
**The LLM should never write the implementation for you.** If it does, ask it to delete it and explain the concept instead.
### Decisions log
Keep a `decisions.md` file in the root. When you make a non-obvious choice (a library, a pattern, a trade-off), write one short paragraph explaining what you chose and why. This is also useful context for any LLM session.
---
## 9. Game Mechanics
- **Format**: source-language word prompt + 4 target-language choices
- **Distractors**: same POS, server-side, never the correct answer, no repeats in a session
- **Session length**: 10 questions
- **Scoring**: +1 per correct answer (no speed bonus for MVP)
- **Timer**: none in singleplayer MVP
- **No auth required**: anonymous users
---
## 10. MVP Roadmap
> Tasks are written at a high level. When starting a phase, break it into smaller subtasks before writing any code.
### Current Status
**Phase 0 (Foundation) — ✅ Complete**
**Phase 1 (Vocabulary Data) — 🔄 Data pipeline complete. API layer is the immediate next step.**
What is already in the database:
- 999 unique English terms (nouns), fully seeded from WordNet/OMW
- 3171 term IDs resolved (higher than word count due to homonyms)
- Full Italian translation coverage (3171/3171 terms)
- Decks created and populated via `packages/db/src/generating-decks.ts`
- 34 words from the source wordlist had no WordNet match (expected, not a bug)
---
### Phase 1 — Finish the API Layer
**Goal:** The frontend can fetch quiz data from the API.
**Done when:** `GET /api/decks/1/terms?limit=10` returns 10 terms, each with 3 distractors of the same POS attached.
**Broadly, what needs to happen:**
- Define Zod response schemas in `packages/shared` for terms, decks, and language pairs
- Implement a repository layer that queries the DB for terms belonging to a deck
- Implement a service layer that attaches distractors to each term (same POS, no duplicates, no correct answer included)
- Wire up the REST endpoints (`GET /language-pairs`, `GET /decks`, `GET /decks/:id/terms`)
- Manually test the endpoints (curl or a REST client like Bruno/Insomnia)
**Key concepts to understand before starting:**
- Drizzle ORM query patterns (joins, where clauses)
- The repository pattern (data access separated from business logic)
- Zod schema definition and inference
- How pnpm workspace packages reference each other
---
### Phase 2 — Singleplayer Quiz UI
**Goal:** A user can complete a full 10-question quiz in the browser.
**Done when:** User visits `/singleplayer`, answers 10 questions, sees a score screen, and can play again.
**Broadly, what needs to happen:**
- Build the `QuestionCard` component (prompt word + 4 answer buttons)
- Build the `OptionButton` component with three visual states: idle, correct, wrong
- Build the `ScoreScreen` component (score summary + play again)
- Implement a Zustand store to track quiz session state (current question index, score, whether an answer has been picked)
- Wire up TanStack Query to fetch terms from the API on mount
- Create the `/singleplayer` route and assemble the components
- Handle the between-question transition (brief delay showing result → next question)
**Key concepts to understand before starting:**
- TanStack Query: `useQuery`, loading/error states
- Zustand: defining a store, reading and writing state from components
- TanStack Router: defining routes, navigating between them
- React component composition
- Controlled state for the answer selection (which button is selected, when to lock input)
---
### Phase 3 — UI Polish
**Goal:** The app looks good enough to show to people.
**Done when:** The quiz is usable on mobile, readable on desktop, and has a coherent visual style.
**Broadly, what needs to happen:**
- Apply Tailwind utility classes and shadcn/ui components consistently
- Make the layout mobile-first (touch-friendly buttons, readable font sizes)
- Add a simple landing page (`/`) with a "Start Quiz" button
- Add loading and error states for the API fetch
- Visual feedback on correct/wrong answers (colour, maybe a brief animation)
- Deck selection: let the user pick a deck from a list before starting
**Key concepts to understand before starting:**
- Tailwind CSS utility-first approach
- shadcn/ui component library and how to add components
- Responsive design with Tailwind breakpoints
- CSS transitions for simple animations
---
## 11. Key Technical Decisions
These are the non-obvious decisions already made. Any LLM helping with this project should be aware of them and not suggest alternatives without good reason.
### Architecture
**Express app: factory function pattern**
`app.ts` exports `createApp()`. `server.ts` imports it and calls `.listen()`. This keeps tests isolated — a test can import the app without starting a server.
**Layered architecture: routes → services → repositories**
Business logic lives in services, not route handlers or repositories. Each layer only talks to the layer directly below it. For the MVP API, this means:
- `routes/` — parse request, call service, return response
- `services/` — business logic (e.g. attaching distractors)
- `repositories/` — all DB queries live here, nowhere else
**Shared Zod schemas in `packages/shared`**
All request/response shapes are defined once as Zod schemas in `packages/shared` and imported by both `apps/api` and `apps/web`. Types are inferred from schemas (`z.infer<typeof Schema>`), never written by hand.
### Data Model
**Decks separate from terms (not frequency-rank filtering)**
Terms are raw WordNet data. Decks are curated lists. This separation exists because WordNet frequency data is unreliable for learning — common chemical element symbols ranked highly, for example. Bad words are excluded at the deck level, not filtered from `terms`.
**Deck language model: `source_language` + `validated_languages` array**
A deck is not tied to a single language pair. `source_language` is the language the wordlist was curated from. `validated_languages` is an array of target languages with full translation coverage — calculated and updated by the deck generation script on every run.
### Tooling
**Drizzle ORM (not Prisma):** No binary, no engine. Queries map closely to SQL. Works naturally with Zod. Migrations are plain SQL files.
**`tsx` as TypeScript runner (not `ts-node`):** Faster, zero config, uses esbuild. Does not type-check — that is handled by `tsc` and the editor.
**pnpm workspaces (not Turborepo):** Two apps don't need the extra build caching complexity.
---
## 12. Post-MVP Ladder
These phases are deferred but planned. The architecture already supports them.
| Phase | What it adds |
| ----------------- | -------------------------------------------------------------- |
| Auth | OpenAuth (Google + GitHub), JWT middleware, user rows in DB |
| User Stats | Games played, score history, profile page |
| Multiplayer Lobby | Room creation, join by code, WebSocket connection |
| Multiplayer Game | Simultaneous answers, server timer, live scores, winner screen |
| Deployment | Docker Compose prod config, Nginx, Let's Encrypt, Hetzner VPS |
| Hardening | Rate limiting, error boundaries, CI/CD, DB backups |
Each of these maps to a phase in the full `roadmap.md`.
---
## 13. Definition of Done (MVP)
- [ ] `GET /api/decks/:id/terms` returns terms with correct distractors
- [ ] User can complete a 10-question quiz without errors
- [ ] Score screen shows final result and a play-again option
- [ ] App is usable on a mobile screen
- [ ] No hardcoded data — everything comes from the database

View file

@ -1,176 +0,0 @@
# Vocabulary Trainer — Roadmap
Each phase produces a working, deployable increment. Nothing is built speculatively.
## Phase 0 — Foundation
Goal: Empty repo that builds, lints, and runs end-to-end.
Done when: `pnpm dev` starts both apps; `GET /api/health` returns 200; React renders a hello page.
[x] Initialise pnpm workspace monorepo: `apps/web`, `apps/api`, `packages/shared`, `packages/db`
[x] Configure TypeScript project references across packages
[x] Set up ESLint + Prettier with shared configs in root
[x] Set up Vitest in `api` and `web` and both packages
[x] Scaffold Express app with `GET /api/health`
[x] Scaffold Vite + React app with TanStack Router (single root route)
[x] Configure Drizzle ORM + connection to local PostgreSQL
[x] Write first migration (empty — just validates the pipeline works)
[x] `docker-compose.yml` for local dev: `api`, `web`, `postgres`, `valkey`
[x] `.env.example` files for `apps/api` and `apps/web`
[x] update decisions.md
## Phase 1 — Vocabulary Data
Goal: Word data lives in the DB and can be queried via the API.
Done when: `GET /api/decks/1/terms?limit=10` returns 10 terms from a specific deck.
[x] Run `extract-en-it-nouns.py` locally → generates `datafiles/en-it-nouns.json`
[x] Write Drizzle schema: `terms`, `translations`, `language_pairs`, `term_glosses`, `decks`, `deck_terms`
[x] Write and run migration (includes CHECK constraints for `pos`, `gloss_type`)
[x] Write `packages/db/src/seed.ts` (imports ALL terms + translations, NO decks)
[x] Download CEFR A1/A2 noun lists (from GitHub repos)
[x] Write `scripts/build_decks.ts` (reads external CEFR lists, matches to DB, creates decks)
[x] Run `pnpm db:seed` → populates terms
[x] Run `pnpm db:build-deck` → creates curated decks
[x] Define Zod response schemas in `packages/shared`
[x] Implement `DeckRepository.getTerms(deckId, limit, offset)` => no decks needed anymore
[x] Implement `QuizService.attachDistractors(terms)` — same POS, server-side, no duplicates
[x] Implement `GET /language-pairs`, `GET /decks`, `GET /decks/:id/terms` endpoints => no language pairs, not needed anymore
[ ] Unit tests for `QuizService` (correct POS filtering, never includes the answer)
[ ] update decisions.md
## Phase 2 — Auth
Goal: Users can log in via Google or GitHub and stay logged in.
Done when: JWT from OpenAuth is validated by the API; protected routes redirect unauthenticated users; user row is created on first login.
[ ] Add OpenAuth service to `docker-compose.yml`
[ ] Write Drizzle schema: `users` (uuid `id`, text `openauth_sub`, no games_played/won columns)
[ ] Write and run migration (includes `updated_at` + triggers)
[ ] Implement JWT validation middleware in `apps/api`
[ ] Implement `GET /api/auth/me` (validate token, upsert user row via `openauth_sub`, return user)
[ ] Define auth Zod schemas in `packages/shared`
[ ] Frontend: login page with "Continue with Google" + "Continue with GitHub" buttons
[ ] Frontend: redirect to `auth.yourdomain.com` → receive JWT → store in memory + HttpOnly cookie
[ ] Frontend: TanStack Router auth guard (redirects unauthenticated users)
[ ] Frontend: TanStack Query `api.ts` attaches token to every request
[ ] Unit tests for JWT middleware
[ ] update decisions.md
## Phase 3 — Single-player Mode
Goal: A logged-in user can complete a full solo quiz session.
Done when: User sees 10 questions, picks answers, sees their final score.
[ ] Frontend: `/singleplayer` route
[ ] `useQuizSession` hook: fetch terms, manage question index + score state
[ ] `QuestionCard` component: prompt word + 4 answer buttons
[ ] `OptionButton` component: idle / correct / wrong states
[ ] `ScoreScreen` component: final score + play-again button
[ ] TanStack Query integration for `GET /terms`
[ ] RTL tests for `QuestionCard` and `OptionButton`
[ ] update decisions.md
## Phase 4 — Multiplayer Rooms (Lobby)
Goal: Players can create and join rooms; the host sees all joined players in real time.
Done when: Two browser tabs can join the same room and see each other's display names update live via WebSocket.
[ ] Write Drizzle schema: `rooms`, `room_players` (add `deck_id` FK to rooms)
[ ] Write and run migration (includes CHECK constraints: `code=UPPER(code)`, `status`, `max_players`)
[ ] Add indexes: `idx_rooms_host`, `idx_room_players_score`
[ ] `POST /rooms` and `POST /rooms/:code/join` REST endpoints
[ ] `RoomService`: create room with short code, join room, enforce max player limit
[ ] `POST /rooms` accepts `deck_id` (which vocabulary deck to use)
[ ] WebSocket server: attach `ws` upgrade handler to the Express HTTP server
[ ] WS auth middleware: validate OpenAuth JWT on upgrade
[ ] WS message router: dispatch incoming messages by `type`
[ ] `room:join` / `room:leave` handlers → broadcast `room:state` to all room members
[ ] Room membership tracked in Valkey (ephemeral) + `room_players` in PostgreSQL (durable)
[ ] Define all WS event Zod schemas in `packages/shared`
[ ] Frontend: `/multiplayer/lobby` — create room form + join-by-code form
[ ] Frontend: `/multiplayer/room/:code` — player list, room code display, "Start Game" (host only)
[ ] Frontend: `ws.ts` singleton WS client with reconnect on drop
[ ] Frontend: Zustand `gameStore` handles incoming `room:state` events
[ ] update decisions.md
## Phase 5 — Multiplayer Game
Goal: Host starts a game; all players answer simultaneously in real time; a winner is declared.
Done when: 24 players complete a 10-round game with correct live scores and a winner screen.
[ ] `GameService`: generate question sequence for a room, enforce server-side 15 s timer
[ ] `room:start` WS handler → begin question loop, broadcast first `game:question`
[ ] `game:answer` WS handler → collect per-player answers
[ ] On all-answered or timeout → evaluate, broadcast `game:answer_result`
[ ] After N rounds → broadcast `game:finished`, update `rooms.status` + `room_players.score` in DB (transactional)
[ ] Frontend: `/multiplayer/game/:code` route
[ ] Frontend: extend Zustand store with `currentQuestion`, `roundAnswers`, `scores`
[ ] Frontend: reuse `QuestionCard` + `OptionButton`; add countdown timer ring
[ ] Frontend: `ScoreBoard` component — live per-player scores after each round
[ ] Frontend: `GameFinished` screen — winner highlight, final scores, "Play Again" button
[ ] Unit tests for `GameService` (round evaluation, tie-breaking, timeout auto-advance)
[ ] update decisions.md
## Phase 6 — Production Deployment
Goal: App is live on Hetzner, accessible via HTTPS on all subdomains.
Done when: `https://app.yourdomain.com` loads; `wss://api.yourdomain.com` connects; auth flow works end-to-end.
[ ] `docker-compose.prod.yml`: all services + `nginx-proxy` + `acme-companion`
[ ] Nginx config per container: `VIRTUAL_HOST` + `LETSENCRYPT_HOST` env vars
[ ] Production `.env` files on VPS (OpenAuth secrets, DB credentials, Valkey URL)
[ ] Drizzle migration runs on `api` container start (includes CHECK constraints + triggers)
[ ] Seed production DB (run `seed.ts` once)
[ ] Smoke test: login → solo game → create room → multiplayer game end-to-end
[ ] update decisions.md
## Phase 7 — Polish & Hardening (post-MVP)
Not required to ship, but address before real users arrive.
[ ] Rate limiting on API endpoints (`express-rate-limit`)
[ ] Graceful WS reconnect with exponential back-off
[ ] React error boundaries
[ ] `GET /users/me/stats` endpoint (aggregates from `room_players`) + profile page
[ ] Accessibility pass (keyboard nav, ARIA on quiz buttons)
[ ] Favicon, page titles, Open Graph meta
[ ] CI/CD pipeline (GitHub Actions → SSH deploy on push to `main`)
[ ] Database backups (cron → Hetzner Object Storage)
[ ] update decisions.md
Dependency Graph
Phase 0 (Foundation)
└── Phase 1 (Vocabulary Data)
└── Phase 2 (Auth)
├── Phase 3 (Singleplayer) ← parallel with Phase 4
└── Phase 4 (Room Lobby)
└── Phase 5 (Multiplayer Game)
└── Phase 6 (Deployment)
---
## ui sketch
i was sketching the ui of the menu and came up with some questions.
this would be the flow to start a single player game:
main menu => singleplayer, multiplayer, settings
singleplayer => language selection
"i speak english" => "i want to learn italian" (both languages are dropdowns to select the fitting language)
language selection => category selection => pure grammar, media (practicing on song lyrics or breaking bad subtitles)
pure grammar => pos selection => nouns or verbs (in mvp)
nouns has 3 subcategories => singular (1-on-1 translation dog => cane), plural (plural practices cane => cani for example), gender/articles (il cane or la cane for example)
verbs has 2 subcategories => infinitv (1-on-1 translation to talk => parlare) or conjugations (user gets shown the infinitiv and a table with all personal pronouns and has to fill in the gaps with the according conjugations)
pos selection => difficulty selection (from a1 to c2)
afterwards start game button
---
this begs the questions:
- how to store the plural, articles of nouns in database
- how to store the conjugations of verbs
- what about ipa?
- links to audiofiles to listen how a word is pronounced?
- one table for italian_verbs, french_nouns, german_adjectives?

View file

@ -1,186 +0,0 @@
# Glossa — Schema & Architecture Discussion Summary
## Project Overview
A vocabulary trainer in the style of Duolingo (see a word, pick from 4 translations). Built as a monorepo with a Drizzle/Postgres data layer. Phase 1 (data pipeline) is complete; the API layer is next.
---
## Game Flow (MVP)
Singleplayer: choose direction (en→it or it→en) → top-level category → part of speech → difficulty (A1C2) → round count (3 or 10) → game starts.
**Top-level categories (MVP):**
- **Grammar** — practice nouns, verb conjugations, etc.
- **Media** — practice vocabulary from specific books, films, songs, etc.
**Post-MVP categories (not in scope yet):**
- Animals, kitchen, and other thematic word groups
---
## Schema Decisions Made
### Deck model: `source_language` + `validated_languages` (not `pair_id`)
A deck is a curated pool of terms sourced from a specific language (e.g. an English frequency list). The language pair used for a quiz is chosen at session start, not at deck creation.
- `decks.source_language` — the language the wordlist was curated from
- `decks.validated_languages` — array of target language codes for which full translation coverage exists across all terms; recalculated on every generation script run
- Enforced via CHECK: `source_language` is never in `validated_languages`
- One deck serves en→it and en→fr without duplication
### Architecture: deck as curated pool (Option 2)
Three options were considered:
| Option | Description | Problem |
| ------------------ | -------------------------------------------------------- | ------------------------------------------------------- |
| 1. Pure filter | No decks, query the whole terms table | No curatorial control; import junk ends up in the game |
| 2. Deck as pool ✅ | Decks define scope, term metadata drives filtering | Clean separation of concerns |
| 3. Deck as preset | Deck encodes filter config (category + POS + difficulty) | Combinatorial explosion; can't reuse terms across decks |
**Decision: Option 2.** Decks solve the curation problem (which terms are game-ready). Term metadata solves the filtering problem (which subset to show today). These are separate concerns and should stay separate.
The quiz query joins `deck_terms` for scope, then filters by `pos`, `cefr_level`, and later `category` — all independently.
### Missing from schema: `cefr_level` and categories
The game flow requires filtering by difficulty and category, but neither is in the schema yet.
**Difficulty (`cefr_level`):**
- Belongs on `terms`, not on `decks`
- Add as a nullable `varchar(2)` with a CHECK constraint (`A1``C2`)
- Add now (nullable), populate later — backfilling a full terms table post-MVP is costly
**Categories:**
- Separate `categories` table + `term_categories` join table
- Do not use an enum or array on `terms` — a term can belong to multiple categories, and new categories should not require migrations
```sql
categories: id, slug, label, created_at
term_categories: term_id → terms.id, category_id → categories.id, PK(term_id, category_id)
```
### Deck scope: wordlists, not POS splits
**Rejected approach:** one deck per POS (e.g. `en-nouns`, `en-verbs`). Problem: POS is already a filterable column on `terms`, so a POS-scoped deck duplicates logic the query already handles for free. A word like "run" (noun and verb, different synsets) would also appear in two decks, requiring deduplication logic in the generation script.
**Decision:** one deck per frequency tier per source language (e.g. `en-core-1000`, `en-core-2000`). POS, difficulty, and category are query filters applied inside that boundary. The user never sees or picks a deck — they pick "Nouns, B1" and the app resolves that to the right deck + filters.
### Deck progression: tiered frequency lists
When a user exhausts a deck, the app expands scope by adding the next tier:
```sql
WHERE dt.deck_id IN ('en-core-1000', 'en-core-2000')
AND t.pos = 'noun'
AND t.cefr_level = 'B1'
```
Requirements for this to work cleanly:
- Decks must not overlap — each word appears in exactly one tier
- The generation script already deduplicates, so this is enforced at import time
- Unlocking logic (when to add the next deck) lives in user learning state, not in the deck structure — for MVP, query all tiers at once or hardcode active decks
### Wordlist source: SUBTLEX (not manual curation)
**Problem:** the most common 1000 nouns in English are not the same 1000 nouns that are most common in Italian — not just in translation, but conceptually. Building decks from English frequency data alone gives Italian learners a distorted picture of what's actually common in Italian.
**Decision:** use SUBTLEX, which exists in per-language editions (SUBTLEX-EN, SUBTLEX-IT, etc.) derived from subtitle corpora using the same methodology — making them comparable across languages.
This maps directly onto `decks.source_language`:
- `en-core-1000` — built from SUBTLEX-EN, used when the user's source language is English
- `it-core-1000` — built from SUBTLEX-IT, used when the source language is Italian
When the user picks en→it, the app queries `en-core-1000`. When they pick it→en, it queries `it-core-1000`. Same translation data, correctly frequency-grounded per direction. Two wordlist files, two generation script runs — the schema already supports this.
### Missing from schema: user learning state
The current schema has no concept of a user's progress. Not blocking for the API layer right now, but will be needed before the game loop is functional:
- `user_decks` — which decks a user is studying
- `user_term_progress` — per `(user_id, term_id, language_pair)`: `next_review_at`, `interval_days`, correct/attempt counts for spaced repetition
- `quiz_answers` — optional history log for stats and debugging
### `synset_id`: make nullable, don't remove
`synset_id` is the WordNet idempotency key — it prevents duplicate imports on re-runs and allows cross-referencing back to WordNet. It should stay.
**Problem:** non-WordNet terms (custom words added later) won't have a synset ID, so `NOT NULL` is too strict.
**Decision:** make `synset_id` nullable. Postgres `UNIQUE` on a nullable column allows multiple `NULL` values (nulls are not considered equal), so no constraint changes are needed beyond dropping `notNull()`.
For extra defensiveness, a partial unique index can be added later:
```sql
CREATE UNIQUE INDEX idx_terms_synset_id ON terms (synset_id) WHERE synset_id IS NOT NULL;
```
---
## Open Questions / Deferred
- **User learning state** — not needed for the API layer but must be designed before the game loop ships
- **Distractors** — generated at query time (random same-POS terms from the same deck); no schema needed
- **`cefr_level` data source** — WordNet frequency data was already found to be unreliable; external CEFR lists (Oxford 3000, SUBTLEX) will be needed to populate this field
---
### Open: semantic category metadata source
Categories (`animals`, `kitchen`, etc.) are in the schema but empty for MVP.
Grammar and Media work without them (Grammar = POS filter, Media = deck membership).
Needs research before populating `term_categories`. Options:
**Option 1: WordNet domain labels**
Already in OMW, extractable in the existing pipeline. Free, no extra dependency.
Problem: coarse and patchy — many terms untagged, vocabulary is academic ("fauna" not "animals").
**Option 2: Princeton WordNet Domains**
Separate project built on WordNet. ~200 hierarchical domains mapped to synsets. More structured
and consistent than basic WordNet labels. Freely available. Meaningfully better than Option 1.
**Option 3: Kelly Project**
Frequency lists with CEFR levels AND semantic field tags, explicitly designed for language learning,
multiple languages. Could solve frequency tiers (cefr_level) and semantic categories in one shot.
Investigate coverage for your languages and POS range first.
**Option 4: BabelNet / WikiData**
Rich, multilingual, community-maintained. Maps WordNet synsets to Wikipedia categories.
Problem: complex integration, BabelNet has commercial licensing restrictions, WikiData category
trees are deep and noisy.
**Option 5: LLM-assisted categorization**
Run terms through Claude/GPT-4 with a fixed category list, spot-check output, import.
Fast and cheap at current term counts (3171 terms ≈ negligible cost). Not reproducible
without saving output. Good fallback if structured sources have insufficient coverage.
**Option 6: Hybrid — WordNet Domains as baseline, LLM gap-fill**
Use Option 2 for automated coverage, LLM for terms with no domain tag, manual spot-check pass.
Combines automation with control. Likely the most practical approach.
**Option 7: Manual curation**
Flat file mapping synset IDs to your own category slugs. Full control, matches UI exactly.
Too expensive at scale — only viable for small curated additions on top of an automated baseline.
**Current recommendation:** research Kelly Project first. If coverage is insufficient, go with Option 6.
---
### implementation roadmap
- [x] Finalize data model
- [x] Write and run migrations
- [x] Fill the database (expand import pipeline)
- [ ] Decide SUBTLEX → cefr_level mapping strategy
- [ ] Generate decks
- [ ] Finalize game selection flow
- [ ] Define Zod schemas in packages/shared
- [ ] Implement API

View file

@ -1,518 +1,366 @@
# Vocabulary Trainer — Project Specification
# Glossa — Project Specification
## 1. Overview
> **This document is the single source of truth for the project.**
> It is written to be handed to any LLM as context. It contains the project vision, the current MVP scope, the tech stack, the architecture, and the roadmap.
A multiplayer EnglishItalian vocabulary trainer with a Duolingo-style quiz interface (one word prompt, four answer choices). Supports both single-player practice and real-time competitive multiplayer rooms of 24 players. Designed from the ground up to be language-pair agnostic.
---
## 1. Project Overview
A vocabulary trainer for EnglishItalian words. The quiz format is Duolingo-style: one word is shown as a prompt, and the user picks the correct translation from four choices (1 correct + 3 distractors of the same part-of-speech). The long-term vision is a multiplayer competitive game, but the MVP is a polished singleplayer experience.
**The core learning loop:**
Show word → pick answer → see result → next word → final score
The vocabulary data comes from WordNet + the Open Multilingual Wordnet (OMW). A one-time Python script extracts EnglishItalian noun pairs and seeds the database. The data model is language-pair agnostic by design — adding a new language later requires no schema changes.
### Core Principles
- **Minimal but extendable**: Working product fast, clean architecture for future growth
- **Mobile-first**: Touch-friendly Duolingo-like UX
- **Minimal but extendable**: working product fast, clean architecture for future growth
- **Mobile-first**: touch-friendly Duolingo-like UX
- **Type safety end-to-end**: TypeScript + Zod schemas shared between frontend and backend
---
## 2. Technology Stack
## 2. Full Product Vision (Long-Term)
| Layer | Technology |
| -------------------- | ----------------------------- |
| Monorepo | pnpm workspaces |
| Frontend | React 18, Vite, TypeScript |
| Routing | TanStack Router |
| Server state | TanStack Query |
| Client state | Zustand |
| Styling | Tailwind CSS + shadcn/ui |
| Backend | Node.js, Express, TypeScript |
| Realtime | WebSockets (`ws` library) |
| Database | PostgreSQL 18 |
| ORM | Drizzle ORM |
| Cache / Queue | Valkey 9 |
| Auth | OpenAuth (Google + GitHub) |
| Validation | Zod (shared schemas) |
| Testing | Vitest, React Testing Library |
| Linting / Formatting | ESLint, Prettier |
| Containerisation | Docker, Docker Compose |
| Hosting | Hetzner VPS |
- Users log in via Google or GitHub (OpenAuth)
- Singleplayer mode: 10-round quiz, score screen
- Multiplayer mode: create a room, share a code, 24 players answer simultaneously in real time, live scores, winner screen
- 1000+ EnglishItalian nouns seeded from WordNet
### Why `ws` over Socket.io
`ws` is the raw WebSocket library. For rooms of 24 players there is no need for Socket.io's transport fallbacks or room-management abstractions. The protocol is defined explicitly in `packages/shared`, which gives the same guarantees without the overhead.
### Why Valkey
Valkey stores ephemeral room state that does not need to survive a server restart. It keeps the PostgreSQL schema clean and makes room lookups O(1).
### Why pnpm workspaces without Turborepo
Turborepo adds parallel task running and build caching on top of pnpm workspaces. For a two-app monorepo of this size, the plain pnpm workspace commands (`pnpm -r run build`, `pnpm --filter`) are sufficient and there is one less tool to configure and maintain.
This is the full vision. The MVP deliberately ignores most of it.
---
## 3. Repository Structure
## 3. MVP Scope
```tree
**Goal:** A working, presentable singleplayer quiz that can be shown to real people.
### What is IN the MVP
- Vocabulary data in a PostgreSQL database (already seeded)
- REST API that returns quiz terms with distractors
- Singleplayer quiz UI: configurable rounds (3 or 10), answer feedback, score screen
- Clean, mobile-friendly UI (Tailwind + shadcn/ui)
- Global error handler with typed error classes
- Unit + integration tests for the API
- Local dev only (no deployment for MVP)
### What is CUT from the MVP
| Feature | Why cut |
| ------------------------------- | -------------------------------------- |
| Authentication (OpenAuth) | No user accounts needed for a demo |
| Multiplayer (WebSockets, rooms) | Core quiz works without it |
| Valkey / Redis cache | Only needed for multiplayer room state |
| Deployment to Hetzner | Ship to people locally first |
| User stats / profiles | Needs auth |
These are not deleted from the plan — they are deferred. The architecture is already designed to support them. See Section 11 (Post-MVP Ladder).
---
## 4. Technology Stack
The monorepo structure and tooling are already set up. This is the full stack — the MVP uses a subset of it.
| Layer | Technology | MVP? |
| ------------ | ------------------------------ | ----------- |
| Monorepo | pnpm workspaces | ✅ |
| Frontend | React 18, Vite, TypeScript | ✅ |
| Routing | TanStack Router | ✅ |
| Server state | TanStack Query | ✅ |
| Client state | Zustand | ✅ |
| Styling | Tailwind CSS + shadcn/ui | ✅ |
| Backend | Node.js, Express, TypeScript | ✅ |
| Database | PostgreSQL + Drizzle ORM | ✅ |
| Validation | Zod (shared schemas) | ✅ |
| Testing | Vitest, supertest | ✅ |
| Auth | OpenAuth (Google + GitHub) | ❌ post-MVP |
| Realtime | WebSockets (`ws` library) | ❌ post-MVP |
| Cache | Valkey | ❌ post-MVP |
| Deployment | Docker Compose, Hetzner, Nginx | ❌ post-MVP |
---
## 5. Repository Structure
```text
vocab-trainer/
├── apps/
│ ├── web/ # React SPA (Vite + TanStack Router)
│ │ ├── src/
│ │ │ ├── routes/
│ │ │ ├── components/
│ │ │ ├── stores/ # Zustand stores
│ │ │ └── lib/
│ │ └── Dockerfile
│ └── api/ # Express REST + WebSocket server
│ ├── src/
│ │ ├── routes/
│ │ ├── services/
│ │ ├── repositories/
│ │ └── websocket/
│ └── Dockerfile
│ ├── api/
│ │ └── src/
│ │ ├── app.ts — createApp() factory, express.json(), error middleware
│ │ ├── server.ts — starts server on PORT
│ │ ├── errors/
│ │ │ └── AppError.ts — AppError, ValidationError, NotFoundError
│ │ ├── middleware/
│ │ │ └── errorHandler.ts — central error middleware
│ │ ├── routes/
│ │ │ ├── apiRouter.ts — mounts /health and /game routers
│ │ │ ├── gameRouter.ts — POST /start, POST /answer
│ │ │ └── healthRouter.ts
│ │ ├── controllers/
│ │ │ └── gameController.ts — validates input, calls service, sends response
│ │ ├── services/
│ │ │ ├── gameService.ts — builds quiz sessions, evaluates answers
│ │ │ └── gameService.test.ts — unit tests (mocked DB)
│ │ └── gameSessionStore/
│ │ ├── GameSessionStore.ts — interface (async, Valkey-ready)
│ │ ├── InMemoryGameSessionStore.ts
│ │ └── index.ts
│ └── web/
│ └── src/
│ ├── routes/
│ │ ├── index.tsx — landing page
│ │ └── play.tsx — the quiz
│ ├── components/
│ │ └── game/
│ │ ├── GameSetup.tsx — settings UI
│ │ ├── QuestionCard.tsx — prompt + 4 options
│ │ ├── OptionButton.tsx — idle / correct / wrong states
│ │ └── ScoreScreen.tsx — final score + play again
│ └── main.tsx
├── packages/
│ ├── shared/ # Zod schemas, TypeScript types, constants
│ └── db/ # Drizzle schema, migrations, seed script
├── scripts/
| ├── datafiles/
│ | └── en-it-nouns.json
│ └── extract-en-it-nouns.py # One-time WordNet + OMW extraction → seed.json
│ ├── shared/
│ │ └── src/
│ │ ├── constants.ts — SUPPORTED_POS, DIFFICULTY_LEVELS, etc.
│ │ ├── schemas/game.ts — Zod schemas for all game types
│ │ └── index.ts
│ └── db/
│ ├── drizzle/ — migration SQL files
│ └── src/
│ ├── db/schema.ts — Drizzle schema
│ ├── models/termModel.ts — getGameTerms(), getDistractors()
│ ├── seeding-datafiles.ts — seeds terms + translations from JSON
│ ├── seeding-cefr-levels.ts — enriches translations with CEFR data
│ ├── generating-deck.ts — builds curated decks
│ └── index.ts
├── scripts/ — Python extraction/comparison/merge scripts
├── documentation/ — project docs
├── docker-compose.yml
├── docker-compose.prod.yml
├── pnpm-workspace.yaml
└── package.json
└── pnpm-workspace.yaml
```
`packages/shared` is the contract between frontend and backend. All request/response shapes and WebSocket event payloads are defined there as Zod schemas and inferred TypeScript types — never duplicated.
### pnpm workspace config
`pnpm-workspace.yaml` declares:
```yaml
packages:
- 'apps/*'
- 'packages/*'
```
### Root scripts
The root `package.json` defines convenience scripts that delegate to workspaces:
- `dev` — starts `api` and `web` in parallel
- `build` — builds all packages in dependency order
- `test` — runs Vitest across all workspaces
- `lint` — runs ESLint across all workspaces
For parallel dev, use `concurrently` or just two terminal tabs for MVP.
`packages/shared` is the contract between frontend and backend. All request/response shapes are defined there as Zod schemas — never duplicated.
---
## 4. Architecture — N-Tier / Layered
## 6. Architecture
### The Layered Architecture
```text
┌────────────────────────────────────┐
│ Presentation (React SPA) │ apps/web
├────────────────────────────────────┤
│ API / Transport │ HTTP REST + WebSocket
├────────────────────────────────────┤
│ Application (Controllers) │ apps/api/src/routes
│ Domain (Business logic) │ apps/api/src/services
│ Data Access (Repositories) │ apps/api/src/repositories
├────────────────────────────────────┤
│ Database (PostgreSQL via Drizzle) │ packages/db
│ Cache (Valkey) │ apps/api/src/lib/valkey.ts
└────────────────────────────────────┘
HTTP Request
Router — maps URL + HTTP method to a controller
Controller — handles HTTP only: validates input, calls service, sends response
Service — business logic only: no HTTP, no direct DB access
Model — database queries only: no business logic
Database
```
Each layer only communicates with the layer directly below it. Business logic lives in services, not in route handlers or repositories.
**The rule:** each layer only talks to the layer directly below it. A controller never touches the database. A service never reads `req.body`. A model never knows what a quiz is.
### Monorepo Package Responsibilities
| Package | Owns |
| ----------------- | -------------------------------------------------------- |
| `packages/shared` | Zod schemas, constants, derived TypeScript types |
| `packages/db` | Drizzle schema, DB connection, all model/query functions |
| `apps/api` | Router, controllers, services, error handling |
| `apps/web` | React frontend, consumes types from shared |
**Key principle:** all database code lives in `packages/db`. `apps/api` never imports `drizzle-orm` for queries — it only calls functions exported from `packages/db`.
---
## 5. Infrastructure
## 7. Data Model (Current State)
### Domain structure
Words are modelled as language-neutral concepts (terms) separate from learning curricula (decks). Adding a new language pair requires no schema changes — only new rows in `translations`, `decks`.
| Subdomain | Service |
| --------------------- | ----------------------- |
| `app.yourdomain.com` | React frontend |
| `api.yourdomain.com` | Express API + WebSocket |
| `auth.yourdomain.com` | OpenAuth service |
**Core tables:** `terms`, `translations`, `term_glosses`, `decks`, `deck_terms`, `categories`, `term_categories`
### Docker Compose services (production)
Key columns on `terms`: `id` (uuid), `pos` (CHECK-constrained), `source`, `source_id` (unique pair for idempotent imports)
| Container | Role |
| ---------------- | ------------------------------------------- |
| `postgres` | PostgreSQL 16, named volume |
| `valkey` | Valkey 8, ephemeral (no persistence needed) |
| `openauth` | OpenAuth service |
| `api` | Express + WS server |
| `web` | Nginx serving the Vite build |
| `nginx-proxy` | Automatic reverse proxy |
| `acme-companion` | Let's Encrypt certificate automation |
Key columns on `translations`: `id`, `term_id` (FK), `language_code` (CHECK-constrained), `text`, `cefr_level` (nullable varchar(2), CHECK A1C2)
```docker
nginx-proxy (:80/:443)
app.domain → web:80
api.domain → api:3000 (HTTP + WS upgrade)
auth.domain → openauth:3001
```
Deck model uses `source_language` + `validated_languages` array — one deck serves multiple target languages. Decks are frequency tiers (e.g. `en-core-1000`), not POS splits.
SSL is fully automatic via `nginx-proxy` + `acme-companion`. No manual Certbot needed.
### 5.1 Valkey Key Structure
Ephemeral room state is stored in Valkey with TTL (e.g., 1 hour).
PostgreSQL stores durable history only.
Key Format: `room:{code}:{field}`
| Key | Type | TTL | Description |
|------------------------------|---------|-------|-------------|
| `room:{code}:state` | Hash | 1h | Current question index, round status |
| `room:{code}:players` | Set | 1h | List of connected user IDs |
| `room:{code}:answers:{round}`| Hash | 15m | Temp storage for current round answers |
Recovery Strategy
If server crashes mid-game, Valkey data is lost.
PostgreSQL `room_players.score` remains 0.
Room status is reset to `finished` via startup health check if `updated_at` is stale.
Full schema is in `packages/db/src/db/schema.ts`.
---
## 6. Data Model
## 8. API
## Design principle
Words are modelled as language-neutral concepts (terms) separate from learning curricula (decks).
Adding a new language pair requires no schema changes — only new rows in `translations`, `decks`, and `language_pairs`.
## Core tables
terms
id uuid PK
synset_id text UNIQUE -- OMW ILI (e.g. "ili:i12345")
pos varchar(20) -- NOT NULL, CHECK (pos IN ('noun', 'verb', 'adjective', 'adverb'))
created_at timestamptz DEFAULT now()
-- REMOVED: frequency_rank (handled at deck level)
translations
id uuid PK
term_id uuid FK → terms.id
language_code varchar(10) -- NOT NULL, BCP 47: "en", "it"
text text -- NOT NULL
created_at timestamptz DEFAULT now()
UNIQUE (term_id, language_code, text) -- Allow synonyms, prevent exact duplicates
term_glosses
id uuid PK
term_id uuid FK → terms.id
language_code varchar(10) -- NOT NULL
text text -- NOT NULL
created_at timestamptz DEFAULT now()
language_pairs
id uuid PK
source varchar(10) -- NOT NULL
target varchar(10) -- NOT NULL
label text
active boolean DEFAULT true
UNIQUE (source, target)
decks
id uuid PK
name text -- NOT NULL (e.g. "A1 Italian Nouns", "Most Common 1000")
description text -- NULLABLE
pair_id uuid FK → language_pairs.id -- NULLABLE (for single-language or multi-pair decks)
created_by uuid FK → users.id -- NULLABLE (for system decks)
is_public boolean DEFAULT true
created_at timestamptz DEFAULT now()
deck_terms
deck_id uuid FK → decks.id
term_id uuid FK → terms.id
position smallint -- NOT NULL, ordering within deck (1, 2, 3...)
added_at timestamptz DEFAULT now()
PRIMARY KEY (deck_id, term_id)
users
id uuid PK -- Internal stable ID (FK target)
openauth_sub text UNIQUE -- NOT NULL, OpenAuth `sub` claim (e.g. "google|12345")
email varchar(255) UNIQUE -- NULLABLE (GitHub users may lack email)
display_name varchar(100)
created_at timestamptz DEFAULT now()
last_login_at timestamptz
-- REMOVED: games_played, games_won (derive from room_players)
rooms
id uuid PK
code varchar(8) UNIQUE -- NOT NULL, CHECK (code = UPPER(code))
host_id uuid FK → users.id
pair_id uuid FK → language_pairs.id
deck_id uuid FK → decks.id -- Which vocabulary deck this room uses
status varchar(20) -- NOT NULL, CHECK (status IN ('waiting', 'in_progress', 'finished'))
max_players smallint -- NOT NULL, DEFAULT 4, CHECK (max_players BETWEEN 2 AND 10)
round_count smallint -- NOT NULL, DEFAULT 10, CHECK (round_count BETWEEN 5 AND 20)
created_at timestamptz DEFAULT now()
updated_at timestamptz DEFAULT now() -- For stale room recovery
room_players
room_id uuid FK → rooms.id
user_id uuid FK → users.id
score integer DEFAULT 0 -- Final score only (written at game end)
joined_at timestamptz DEFAULT now()
left_at timestamptz -- Populated on WS disconnect/leave
PRIMARY KEY (room_id, user_id)
Indexes
-- Vocabulary
CREATE INDEX idx_terms_pos ON terms (pos);
CREATE INDEX idx_translations_lang ON translations (language_code, term_id);
-- Decks
CREATE INDEX idx_decks_pair ON decks (pair_id, is_public);
CREATE INDEX idx_decks_creator ON decks (created_by);
CREATE INDEX idx_deck_terms_term ON deck_terms (term_id);
-- Language Pairs
CREATE INDEX idx_pairs_active ON language_pairs (active, source, target);
-- Rooms
CREATE INDEX idx_rooms_status ON rooms (status);
CREATE INDEX idx_rooms_host ON rooms (host_id);
-- NOTE: idx_rooms_code omitted (UNIQUE constraint creates index automatically)
-- Room Players
CREATE INDEX idx_room_players_user ON room_players (user_id);
CREATE INDEX idx_room_players_score ON room_players (room_id, score DESC);
Repository Logic Note
`DeckRepository.getTerms(deckId, limit, offset)` fetches terms from a specific deck.
Query uses `deck_terms.position` for ordering.
For random practice within a deck: `WHERE deck_id = X ORDER BY random() LIMIT N`
(safe because deck is bounded, e.g., 500 terms max, not full table).
---
## 7. Vocabulary Data — WordNet + OMW
### Source
Open Multilingual Wordnet (OMW) — English & Italian nouns via Interlingual Index (ILI)
External CEFR lists — For deck curation (e.g. GitHub: ecom/cefr-lists)
### Extraction process
1. Run `extract-en-it-nouns.py` once locally using `wn` library
- Imports ALL bilingual noun synsets (no frequency filtering)
- Output: `datafiles/en-it-nouns.json` — committed to repo
2. Run `pnpm db:seed` — populates `terms` + `translations` tables from JSON
3. Run `pnpm db:build-decks` — matches external CEFR lists to DB terms, creates `decks` + `deck_terms`
### Benefits of deck-based approach
- WordNet frequency data is unreliable (e.g. chemical symbols rank high)
- Curricula can come from external sources (CEFR, Oxford 3000, SUBTLEX)
- Bad data excluded at deck level, not schema level
- Users can create custom decks later
- Multiple difficulty levels without schema changes
`terms.synset_id` stores the OMW ILI (e.g. `ili:i12345`) for traceability and future re-imports with additional languages.
---
## 8. Authentication — OpenAuth
All auth is delegated to the OpenAuth service at `auth.yourdomain.com`. Providers: Google, GitHub.
The API validates the JWT from OpenAuth on every protected request. User rows are created or updated on first login via the `sub` claim as the primary key.
**Auth endpoint on the API:**
| Method | Path | Description |
| ------ | -------------- | --------------------------- |
| GET | `/api/auth/me` | Validate token, return user |
All other auth flows (login, callback, token refresh) are handled entirely by OpenAuth — the frontend redirects to `auth.yourdomain.com` and receives a JWT back.
---
## 9. REST API
All endpoints prefixed `/api`. Request and response bodies validated with Zod on both sides using schemas from `packages/shared`.
### Vocabulary
| Method | Path | Description |
| ------ | ---------------------------- | --------------------------------- |
| GET | `/language-pairs` | List active language pairs |
| GET | `/terms?pair=en-it&limit=10` | Fetch quiz terms with distractors |
### Rooms
| Method | Path | Description |
| ------ | ------------------- | ----------------------------------- |
| POST | `/rooms` | Create a room → returns room + code |
| GET | `/rooms/:code` | Get current room state |
| POST | `/rooms/:code/join` | Join a room |
### Users
| Method | Path | Description |
| ------ | ----------------- | ---------------------- |
| GET | `/users/me` | Current user profile |
| GET | `/users/me/stats` | Games played, win rate |
---
## 10. WebSocket Protocol
One WS connection per client. Authenticated by passing the OpenAuth JWT as a query param on the upgrade request: `wss://api.yourdomain.com?token=...`.
All messages are JSON: `{ type: string, payload: unknown }`. The full set of types is a Zod discriminated union in `packages/shared` — both sides validate every message they receive.
### Client → Server
| type | payload | Description |
| ------------- | -------------------------- | -------------------------------- |
| `room:join` | `{ code }` | Subscribe to a room's WS channel |
| `room:leave` | — | Unsubscribe |
| `room:start` | — | Host starts the game |
| `game:answer` | `{ questionId, answerId }` | Player submits an answer |
### Server → Client
| type | payload | Description |
| -------------------- | -------------------------------------------------- | ----------------------------------------- |
| `room:state` | Full room snapshot | Sent on join and on any player join/leave |
| `game:question` | `{ id, prompt, options[], timeLimit }` | New question broadcast to all players |
| `game:answer_result` | `{ questionId, correct, correctAnswerId, scores }` | Broadcast after all answer or timeout |
| `game:finished` | `{ scores[], winner }` | End of game summary |
| `error` | `{ message }` | Protocol or validation error |
### Multiplayer game mechanic — simultaneous answers
All players see the same question at the same time. Everyone submits independently. The server waits until all players have answered **or** the 15-second timeout fires — then broadcasts `game:answer_result` with updated scores. There is no buzz-first mechanic. This keeps the experience Duolingo-like and symmetric.
### Game flow
### Endpoints
```text
host creates room (REST) →
players join via room code (REST + WS room:join) →
room:state broadcasts player list →
host sends room:start →
server broadcasts game:question →
players send game:answer →
server collects all answers or waits for timeout →
server broadcasts game:answer_result →
repeat for N rounds →
server broadcasts game:finished
POST /api/v1/game/start GameRequest → GameSession
POST /api/v1/game/answer AnswerSubmission → AnswerResult
GET /api/v1/health Health check
```
### Room state in Valkey
### Schemas (packages/shared)
Active room state (connected players, current question, answers received this round) is stored in Valkey with a TTL. PostgreSQL holds the durable record (`rooms`, `room_players`). On server restart, in-progress games are considered abandoned — acceptable for MVP.
**GameRequest:** `{ source_language, target_language, pos, difficulty, rounds }`
**GameSession:** `{ sessionId: uuid, questions: GameQuestion[] }`
**GameQuestion:** `{ questionId: uuid, prompt: string, gloss: string | null, options: AnswerOption[4] }`
**AnswerOption:** `{ optionId: number (0-3), text: string }`
**AnswerSubmission:** `{ sessionId: uuid, questionId: uuid, selectedOptionId: number (0-3) }`
**AnswerResult:** `{ questionId: uuid, isCorrect: boolean, correctOptionId: number (0-3), selectedOptionId: number (0-3) }`
### Error Handling
Typed error classes (`AppError` base, `ValidationError` 400, `NotFoundError` 404) with central error middleware. Controllers validate with `safeParse`, throw on failure, and call `next(error)` in the catch. The middleware maps `AppError` instances to HTTP status codes; unknown errors return 500.
### Key Design Rules
- Server-side answer evaluation: the correct answer is never sent to the frontend
- `POST` not `GET` for game start (configuration in request body)
- `safeParse` over `parse` (clean 400s, not raw Zod 500s)
- Session state stored in `GameSessionStore` (in-memory now, Valkey later)
---
## 11. Game Mechanics
## 9. Game Mechanics
- **Question format**: source-language word prompt + 4 target-language choices (1 correct + 3 distractors of the same POS)
- **Distractors**: generated server-side, never include the correct answer, never repeat within a session
- **Scoring**: +1 point per correct answer. Speed bonus is out of scope for MVP.
- **Timer**: 15 seconds per question, server-authoritative
- **Single-player**: uses `GET /terms` and runs entirely client-side. No WebSocket.
- **Format**: source-language word prompt + 4 target-language choices
- **Distractors**: same POS, same difficulty, server-side, never the correct answer, never repeated within a session
- **Session length**: 3 or 10 questions (configurable)
- **Scoring**: +1 per correct answer (no speed bonus for MVP)
- **Timer**: none in singleplayer MVP
- **No auth required**: anonymous users
- **Submit-before-send**: user selects, then confirms (prevents misclicks)
---
## 12. Frontend Structure
## 10. Working Methodology
```tree
apps/web/src/
├── routes/
│ ├── index.tsx # Landing / mode select
│ ├── auth/
│ ├── singleplayer/
│ └── multiplayer/
│ ├── lobby.tsx # Create or join by code
│ ├── room.$code.tsx # Waiting room
│ └── game.$code.tsx # Active game
├── components/
│ ├── quiz/ # QuestionCard, OptionButton, ScoreBoard
│ ├── room/ # PlayerList, RoomCode, ReadyState
│ └── ui/ # shadcn/ui wrappers: Button, Card, Dialog ...
├── stores/
│ └── gameStore.ts # Zustand: game session, scores, WS state
├── lib/
│ ├── api.ts # TanStack Query wrappers
│ └── ws.ts # WS client singleton
└── main.tsx
This project is a learning exercise. The goal is to understand the code, not just to ship it.
### How to use an LLM for help
1. Paste this document as context
2. Describe what you're working on and what you're stuck on
3. Ask for hints, not solutions
### Refactoring workflow
After completing a task: share the code, ask what to refactor and why. The LLM should explain the concept, not write the implementation.
---
## 11. Post-MVP Ladder
| Phase | What it adds |
| ----------------- | -------------------------------------------------------------- |
| Auth | OpenAuth (Google + GitHub), JWT middleware, user rows in DB |
| User Stats | Games played, score history, profile page |
| Multiplayer Lobby | Room creation, join by code, WebSocket connection |
| Multiplayer Game | Simultaneous answers, server timer, live scores, winner screen |
| Deployment | Docker Compose prod config, Nginx, Let's Encrypt, Hetzner VPS |
| Hardening | Rate limiting, error boundaries, CI/CD, DB backups |
### Future Data Model Extensions (deferred, additive)
- `noun_forms` — gender, singular, plural, articles per language
- `verb_forms` — conjugation tables per language
- `term_pronunciations` — IPA and audio URLs per language
- `user_decks` — which decks a user is studying
- `user_term_progress` — spaced repetition state per user/term/language
- `quiz_answers` — history log for stats
All are new tables referencing existing `terms` rows via FK. No existing schema changes required.
### Multiplayer Architecture (deferred)
- WebSocket protocol: `ws` library, Zod discriminated union for message types
- Room model: human-readable codes (e.g. `WOLF-42`), not matchmaking queue
- Game mechanic: simultaneous answers, 15-second server timer, all players see same question
- Valkey for ephemeral room state, PostgreSQL for durable records
### Infrastructure (deferred)
- `app.yourdomain.com` → React frontend
- `api.yourdomain.com` → Express API + WebSocket
- `auth.yourdomain.com` → OpenAuth service
- Docker Compose with `nginx-proxy` + `acme-companion` for automatic SSL
---
## 12. Definition of Done (MVP)
- [x] API returns quiz terms with correct distractors
- [x] User can complete a quiz without errors
- [x] Score screen shows final result and a play-again option
- [x] App is usable on a mobile screen
- [x] No hardcoded data — everything comes from the database
- [x] Global error handler with typed error classes
- [x] Unit + integration tests for API
---
## 13. Roadmap
### Phase 0 — Foundation ✅
Empty repo that builds, lints, and runs end-to-end. `pnpm dev` starts both apps; `GET /api/health` returns 200; React renders a hello page.
### Phase 1 — Vocabulary Data + API ✅
Word data lives in the DB. API returns quiz sessions with distractors. CEFR enrichment pipeline complete. Global error handler and tests implemented.
### Phase 2 — Singleplayer Quiz UI ✅
User can complete a full quiz in the browser. Settings UI, question cards, answer feedback, score screen.
### Phase 3 — Auth
Users can log in via Google or GitHub and stay logged in. JWT validated by API. User row created on first login.
### Phase 4 — Multiplayer Lobby
Players can create and join rooms. Two browser tabs can join the same room and see each other via WebSocket.
### Phase 5 — Multiplayer Game
Host starts a game. All players answer simultaneously in real time. Winner declared.
### Phase 6 — Production Deployment
App is live on Hetzner with HTTPS. Auth flow works end-to-end.
### Phase 7 — Polish & Hardening
Rate limiting, reconnect logic, error boundaries, CI/CD, DB backups.
### Dependency Graph
```text
Phase 0 (Foundation)
└── Phase 1 (Vocabulary Data + API)
└── Phase 2 (Singleplayer UI)
└── Phase 3 (Auth)
├── Phase 4 (Room Lobby)
│ └── Phase 5 (Multiplayer Game)
│ └── Phase 6 (Deployment)
└── Phase 7 (Hardening)
```
### Zustand store (single store for MVP)
```typescript
interface AppStore {
user: User | null;
gameSession: GameSession | null;
currentQuestion: Question | null;
scores: Record<string, number>;
isLoading: boolean;
error: string | null;
}
```
TanStack Query handles all server data fetching. Zustand handles ephemeral UI and WebSocket-driven state.
---
## 13. Testing Strategy
## 14. Game Flow (Future)
| Type | Tool | Scope |
| ----------- | -------------------- | --------------------------------------------------- |
| Unit | Vitest | Services, QuizService distractor logic, Zod schemas |
| Component | Vitest + RTL | QuestionCard, OptionButton, auth forms |
| Integration | Vitest | API route handlers against a test DB |
| E2E | Out of scope for MVP | — |
Singleplayer: choose direction (en→it or it→en) → top-level category → part of speech → difficulty (A1C2) → round count → game starts.
Tests are co-located with source files (`*.test.ts` / `*.test.tsx`).
**Top-level categories (post-MVP):**
**Critical paths to cover:**
- Distractor generation (correct POS, no duplicates, never includes answer)
- Answer validation (server-side, correct scoring)
- Game session lifecycle (create → play → complete)
- JWT validation middleware
---
## 14. Definition of Done
### Functional
- [ ] User can log in via Google or GitHub (OpenAuth)
- [ ] User can play singleplayer: 10 rounds, score, result screen
- [ ] User can create a room and share a code
- [ ] User can join a room via code
- [ ] Multiplayer: 10 rounds, simultaneous answers, real-time score sync
- [ ] 1 000 EnglishItalian words seeded from WordNet + OMW
### Technical
- [ ] Deployed to Hetzner with HTTPS on all three subdomains
- [ ] Docker Compose running all services
- [ ] Drizzle migrations applied on container start
- [ ] 1020 passing tests covering critical paths
- [ ] pnpm workspace build pipeline green
---
## 15. Out of Scope (MVP)
- Difficulty levels _(`frequency_rank` column exists, ready to use)_
- Additional language pairs _(schema already supports it — just add rows)_
- Leaderboards _(`games_played`, `games_won` columns exist)_
- Streaks / daily challenges
- Friends / private invites
- Audio pronunciation
- CI/CD pipeline (manual deploy for now)
- Rate limiting _(add before going public)_
- Admin panel for vocabulary management
- **Grammar** — practice nouns, verb conjugations, etc.
- **Media** — practice vocabulary from specific books, films, songs, etc.
- **Thematic** — animals, kitchen, etc. (requires category metadata research)