feat(api): implement game terms query with double join

- Add double join on translations for source/target languages - Left join term_glosses for optional source-language glosses - Filter difficulty on target side only (intentionally asymmetric: a word's difficulty can differ between languages, and what matters is the difficulty of the word being learned) - Return neutral field names (sourceText, targetText, sourceGloss) instead of quiz semantics; service layer maps to prompt/answer - Tighten term_glosses unique constraint to (term_id, language_code) to prevent the left join from multiplying question rows - Add TODO for ORDER BY RANDOM() scaling post-MVP
2026-04-10 18:02:03 +02:00 · 2026-04-10 18:02:03 +02:00 · b59fac493d
commit b59fac493d
parent 9fc3ba375a
4 changed files with 356 additions and 28 deletions
--- a/documentation/api-development.md
+++ b/documentation/api-development.md
@ -0,0 +1,288 @@
+# Glossa — Architecture & API Development Summary
+
+A record of all architectural discussions, decisions, and outcomes from the initial
+API design through the quiz model implementation.
+
+---
+
+## Project Overview
+
+Glossa is a vocabulary trainer (Duolingo-style) built as a pnpm monorepo. Users see a
+word and pick from 4 possible translations. Supports singleplayer and multiplayer.
+Stack: Express API, React frontend, Drizzle ORM, Postgres, Valkey, WebSockets.
+
+---
+
+## Architectural Foundation
+
+### The Layered Architecture
+
+The core mental model established for the entire API:
+
+```
+HTTP Request
+     ↓
+  Router        — maps URL + HTTP method to a controller
+     ↓
+ Controller     — handles HTTP only: validates input, calls service, sends response
+     ↓
+  Service       — business logic only: no HTTP, no direct DB access
+     ↓
+  Model         — database queries only: no business logic
+     ↓
+  Database
+```
+
+**The rule:** each layer only talks to the layer directly below it. A controller never
+touches the database. A service never reads `req.body`. A model never knows what a quiz is.
+
+### Monorepo Package Responsibilities
+
+| Package | Owns |
+|---------|------|
+| `packages/shared` | Zod schemas, constants, derived TypeScript types |
+| `packages/db` | Drizzle schema, DB connection, all model/query functions |
+| `apps/api` | Router, controllers, services |
+| `apps/web` | React frontend, consumes types from shared |
+
+**Key principle:** all database code lives in `packages/db`. `apps/api` never imports
+`drizzle-orm` for queries — it only calls functions exported from `packages/db`.
+
+---
+
+## Problems Faced & Solutions
+
+- Problem 1: Messy API structure
+**Symptom:** responsibilities bleeding across layers — DB code in controllers, business
+logic in routes.
+**Solution:** strict layered architecture with one responsibility per layer.
+- Problem 2: No shared contract between API and frontend
+**Symptom:** API could return different shapes silently, frontend breaks at runtime.
+**Solution:** Zod schemas in `packages/shared` as the single source of truth. Both API
+(validation) and frontend (type inference) consume the same schemas.
+- Problem 3: Type safety gaps
+**Symptom:** TypeScript `any` types on model parameters, `Number` vs `number` confusion.
+**Solution:** derived types from constants using `typeof CONSTANT[number]` pattern.
+All valid values defined once in constants, types derived automatically.
+- Problem 4: `getGameTerms` in wrong package
+**Symptom:** model queries living in `apps/api/src/models/` meant `apps/api` had a
+direct `drizzle-orm` dependency and was accessing the DB itself.
+**Solution:** moved models folder to `packages/db/src/models/`. All Drizzle code now
+lives in one package.
+- Problem 5: Deck generation complexity
+**Initial assumption:** 12 decks needed (nouns/verbs × easy/intermediate/hard × en/it).
+**Correction:** decks are pools, not presets. POS and difficulty are query filters applied
+at runtime — not deck properties. Only 2 decks needed (en-core, it-core).
+**Final decision:** skip deck generation entirely for MVP. Query the terms table directly
+with difficulty + POS filters. Revisit post-MVP when spaced repetition or progression
+features require curated pools.
+- Problem 6: GAME_ROUNDS type conflict
+**Problem:** `z.enum()` only accepts strings. `GAME_ROUNDS = ["3", "10"]` works with
+`z.enum()` but requires `Number(rounds)` conversion in the service.
+**Decision:** keep as strings, convert to number in the service before passing to the
+model. Documented coupling acknowledged with a comment.
+- Problem 7: Gloss join could multiply question rows. Schema allowed multiple glosses per term per language, so the left join would duplicate rows. Fixed by tightening the unique constraint.
+- Problem 8: Model leaked quiz semantics. Return fields were named prompt / answer, baking HTTP-layer concepts into the database layer. Renamed to neutral field names.
+---
+
+## Decisions Made
+
+- Zod schemas belong in `packages/shared`
+Both the API and frontend import from the same schemas. If the shape changes, TypeScript
+compilation fails in both places simultaneously — silent drift is impossible.
+- Server-side answer evaluation
+The correct answer is never sent to the frontend in `QuizQuestion`. It is only revealed
+in `AnswerResult` after the client submits. Prevents cheating and keeps game logic
+authoritative on the server.
+- `safeParse` over `parse` in controllers
+`parse` throws a raw Zod error → ugly 500 response. `safeParse` returns a result object
+→ clean 400 with early return. Global error handler to be implemented later (Step 6 of
+roadmap) will centralise this pattern.
+- POST not GET for game start
+`GET` requests have no body. Game configuration is submitted as a JSON body → `POST` is
+semantically correct.
+- `express.json()` middleware required
+Without it, `req.body` is `undefined`. Added to `createApp()` in `app.ts`.
+- Type naming: PascalCase
+TypeScript convention. `supportedLanguageCode` → `SupportedLanguageCode` etc.
+- Primitive types: always lowercase
+`number` not `Number`, `string` not `String`. The uppercase versions are object wrappers
+and not assignable to Drizzle's expected primitive types.
+- Model parameters use shared types, not `GameRequestType`
+The model layer should not know about `GameRequestType` — that's an HTTP boundary concern.
+Instead, parameters are typed using the derived constant types (`SupportedLanguageCode`,
+`SupportedPos`, `DifficultyLevel`) exported from `packages/shared`.
+- One gloss per term per language. The unique constraint on term_glosses was tightened from (term_id, language_code, text) to (term_id, language_code) to prevent the left join from multiplying question rows. Revisit if multiple glosses per language are ever needed (e.g. register or domain variants).
+- Model returns neutral field names, not quiz semantics. getGameTerms returns sourceText / targetText / sourceGloss rather than prompt / answer / gloss. Quiz semantics are applied in the service layer. Keeps the model reusable for non-quiz features.
+- Asymmetric difficulty filter. Difficulty is filtered on the target (answer) side only. A word can be A2 in Italian but B1 in English, and what matters is the difficulty of the word being learned.
+
+---
+
+## Data Pipeline Work (Pre-API)
+
+### CEFR Enrichment Pipeline (completed)
+
+A staged ETL pipeline was built to enrich translation records with CEFR levels and
+difficulty ratings:
+
+```
+Raw source files
+      ↓
+extract-*.py      — normalise each source to standard JSON
+      ↓
+compare-*.py      — quality gate: surface conflicts between sources (read-only)
+      ↓
+merge-*.py        — resolve conflicts by source priority, derive difficulty
+      ↓
+enrich.ts         — write cefr_level + difficulty to DB translations table
+```
+
+**Source priority:**
+- English: `en_m3` > `cefrj` > `octanove` > `random`
+- Italian: `it_m3` > `italian`
+
+**Enrichment results:**
+
+| Language | Enriched | Total | Coverage |
+|----------|----------|-------|----------|
+| English | 42,527 | 171,394 | ~25% |
+| Italian | 23,061 | 54,603 | ~42% |
+
+Both languages have sufficient coverage for MVP. Italian C2 has only 242 terms — noted
+as a potential constraint for the distractor algorithm at high difficulty.
+
+---
+
+## API Schemas (packages/shared)
+
+### `GameRequestSchema` (implemented)
+```typescript
+{
+  source_language: z.enum(SUPPORTED_LANGUAGE_CODES),
+  target_language: z.enum(SUPPORTED_LANGUAGE_CODES),
+  pos: z.enum(SUPPORTED_POS),
+  difficulty: z.enum(DIFFICULTY_LEVELS),
+  rounds: z.enum(GAME_ROUNDS),
+}
+```
+
+### Planned schemas (not yet implemented)
+```
+QuizQuestion      — prompt, optional gloss, 4 options (no correct answer)
+QuizOption        — optionId + text
+AnswerSubmission  — questionId + selectedOptionId
+AnswerResult      — correct boolean, correctOptionId, selectedOptionId
+```
+
+---
+
+## API Endpoints
+
+```
+POST /api/v1/game/start     GameRequest → QuizQuestion[]
+POST /api/v1/game/answer    AnswerSubmission → AnswerResult
+```
+
+---
+
+## Current File Structure (apps/api)
+
+```
+apps/api/src/
+├── app.ts                  — Express app, express.json() middleware
+├── server.ts               — starts server on PORT
+├── routes/
+│   ├── apiRouter.ts        — mounts /health and /game routers
+│   ├── gameRouter.ts       — POST /start → createGame controller
+│   └── healthRouter.ts
+├── controllers/
+│   └── gameController.ts   — validates GameRequest, calls service
+└── services/
+    └── gameService.ts      — calls getGameTerms, returns raw rows
+```
+
+---
+
+## Current File Structure (packages/db)
+
+```
+packages/db/src/
+├── db/
+│   └── schema.ts           — Drizzle schema (terms, translations, users, decks...)
+├── models/
+│   └── termModel.ts        — getGameTerms() query
+└── index.ts                — exports db connection + getGameTerms
+```
+
+---
+
+## Completed Tasks
+
+- [x] Layered architecture established and understood
+- [x] `GameRequestSchema` defined in `packages/shared`
+- [x] Derived types (`SupportedLanguageCode`, `SupportedPos`, `DifficultyLevel`) exported from constants
+- [x] `getGameTerms()` model implemented with POS / language / difficulty / limit filters
+- [x] Model correctly placed in `packages/db`
+- [x] `prepareGameQuestions()` service skeleton calling the model
+- [x] `createGame` controller with Zod `safeParse` validation
+- [x] `POST /api/v1/game/start` route wired
+- [x] End-to-end pipeline verified with test script — returns correct rows
+- [x] CEFR enrichment pipeline complete for English and Italian
+- [x] Double join on translations implemented (source + target language)
+- [x] Gloss left join implemented
+- [x] Model return type uses neutral field names (sourceText, targetText, sourceGloss)
+- [x] Schema: gloss unique constraint tightened to one gloss per term per language
+
+---
+
+## Roadmap Ahead
+
+### Step 1 — Learn SQL fundamentals (in progress)
+Concepts needed: SELECT, FROM, JOIN, WHERE, LIMIT.
+Resources: sqlzoo.net or Khan Academy SQL section.
+Required before: implementing the double join for source language prompt.
+
+### Step 2 — Complete the model layer
+- Double join on `translations` — once for source language (prompt), once for target language (answer)
+- `GlossModel.getGloss(termId, languageCode)` — fetch gloss if available
+
+### Step 3 — Define remaining Zod schemas
+- `QuizQuestion`, `QuizOption`, `AnswerSubmission`, `AnswerResult` in `packages/shared`
+
+### Step 4 — Complete the service layer
+- `QuizService.buildSession()` — assemble raw rows into `QuizQuestion[]`
+  - Generate `questionId` per question
+  - Map source language translation as prompt
+  - Attach gloss if available
+  - Fetch 3 distractors (same POS, different term, same difficulty)
+  - Shuffle options so correct answer is not always in same position
+- `QuizService.evaluateAnswer()` — validate correctness, return `AnswerResult`
+
+### Step 5 — Implement answer endpoint
+- `POST /api/v1/game/answer` route, controller, service method
+
+### Step 6 — Global error handler
+- Typed error classes (`ValidationError`, `NotFoundError`)
+- Central error middleware in `app.ts`
+- Remove temporary `safeParse` error handling from controllers
+
+### Step 7 — Tests
+- Unit tests for `QuizService` — correct POS filtering, distractor never equals correct answer
+- Unit tests for `evaluateAnswer` — correct and incorrect cases
+- Integration tests for both endpoints
+
+### Step 8 — Auth (Phase 2 from original roadmap)
+- OpenAuth integration
+- JWT validation middleware
+- `GET /api/auth/me` endpoint
+- Frontend auth guard
+
+---
+
+## Open Questions
+
+- **Distractor algorithm:** when Italian C2 has only 242 terms, should the difficulty
+  filter fall back gracefully or return an error? Decision needed before implementing
+  `buildSession()`.
+- **Session statefulness:** game loop is currently stateless (fetch all questions upfront).
+  Confirm this is still the intended MVP approach before building `buildSession()`.
--- a/documentation/notes.md
+++ b/documentation/notes.md
@ -5,6 +5,15 @@
 - pinning dependencies in package.json files
 - rethink organisation of datafiles and wordlists

+## notes
+
+- backend advice: https://github.com/MohdOwaisShah/backend
+- openapi
+- bruno for api testing
+- tailscale
+- husky/lint-staged
+- musicforprogramming.net
+
 ## openwordnet

 download libraries via
@ -44,17 +53,3 @@ list all libraries:
 ```bash
 python -c "import wn; print(wn.lexicons())"
 ```
-
-## drizzle
-
-generate migration file, go to packages/db, then:
-
-```bash
-pnpm drizzle-kit generate
-```
-
-execute migration, go to packages/db (docker containers need to be running):
-
-```bash
-DATABASE_URL=postgresql://username:password@localhost:5432/database pnpm drizzle-kit migrate
-```
--- a/packages/db/src/db/schema.ts
+++ b/packages/db/src/db/schema.ts
@ -51,11 +51,7 @@ export const term_glosses = pgTable(
    created_at: timestamp({ withTimezone: true }).defaultNow().notNull(),
  },
  (table) => [
-    unique("unique_term_gloss").on(
-      table.term_id,
-      table.language_code,
-      table.text,
-    ),
+    unique("unique_term_gloss").on(table.term_id, table.language_code),
    check(
      "language_code_check",
      sql`${table.language_code} IN (${sql.raw(SUPPORTED_LANGUAGE_CODES.map((l) => `'${l}'`).join(", "))})`,
--- a/packages/db/src/models/termModel.ts
+++ b/packages/db/src/models/termModel.ts
@ -1,6 +1,7 @@
 import { db } from "@glossa/db";
-import { eq, and } from "drizzle-orm";
-import { terms, translations } from "@glossa/db/schema";
+import { eq, and, isNotNull, sql } from "drizzle-orm";
+import { terms, translations, term_glosses } from "@glossa/db/schema";
+import { alias } from "drizzle-orm/pg-core";

 import type {
  SupportedLanguageCode,
@ -8,25 +9,73 @@ import type {
  DifficultyLevel,
 } from "@glossa/shared";

+export type TranslationPairRow = {
+  termId: string;
+  sourceText: string;
+  targetText: string;
+  sourceGloss: string | null;
+};
+
+// Note: difficulty filter is intentionally asymmetric. We filter on the target
+// (answer) side only — a word can be A2 in Italian but B1 in English, and what
+// matters for the learner is the difficulty of the word they're being taught.
+
 export const getGameTerms = async (
  sourceLanguage: SupportedLanguageCode,
  targetLanguage: SupportedLanguageCode,
  pos: SupportedPos,
  difficulty: DifficultyLevel,
-  count: number,
-) => {
+  rounds: number,
+): Promise<TranslationPairRow[]> => {
+  const sourceTranslations = alias(translations, "source_translations");
+  const targetTranslations = alias(translations, "target_translations");
+
  const rows = await db
-    .select()
+    .select({
+      termId: terms.id,
+      prompt: sourceTranslations.text,
+      answer: targetTranslations.text,
+      gloss: term_glosses.text,
+    })
    .from(terms)
-    .innerJoin(translations, eq(translations.term_id, terms.id))
+    .innerJoin(
+      sourceTranslations,
+      and(
+        eq(sourceTranslations.term_id, terms.id),
+        eq(sourceTranslations.language_code, sourceLanguage), // Filter here!
+      ),
+    )
+    .innerJoin(
+      targetTranslations,
+      and(
+        eq(targetTranslations.term_id, terms.id),
+        eq(targetTranslations.language_code, targetLanguage), // Filter here!
+      ),
+    )
+    .leftJoin(
+      term_glosses,
+      and(
+        eq(term_glosses.term_id, terms.id),
+        eq(term_glosses.language_code, sourceLanguage),
+      ),
+    )
    .where(
      and(
        eq(terms.pos, pos),
-        eq(translations.language_code, targetLanguage),
-        eq(translations.difficulty, difficulty),
+        eq(targetTranslations.difficulty, difficulty),
+        isNotNull(sourceTranslations.difficulty), // Good data quality check!
      ),
    )
-    .limit(count);
+    // TODO(post-mvp): ORDER BY RANDOM() sorts the entire filtered result set before
+    // applying LIMIT, which is fine at current data volumes (low thousands of rows
+    // after POS + difficulty filters) but degrades as the terms table grows. Once
+    // the database is fully populated and tagged, replace with one of:
+    //   - TABLESAMPLE BERNOULLI(n) for approximate sampling on large tables
+    //   - Random offset: SELECT ... OFFSET floor(random() * (SELECT count(*) ...))
+    //   - Pre-computed random column with a btree index, reshuffled periodically
+    // Benchmark first — don't optimise until it actually hurts.
+    .orderBy(sql`RANDOM()`)
+    .limit(rounds);

  return rows;
 };