From b59fac493dc329270a940b79529df99439b000a4 Mon Sep 17 00:00:00 2001 From: lila Date: Fri, 10 Apr 2026 18:02:03 +0200 Subject: [PATCH] feat(api): implement game terms query with double join - Add double join on translations for source/target languages - Left join term_glosses for optional source-language glosses - Filter difficulty on target side only (intentionally asymmetric: a word's difficulty can differ between languages, and what matters is the difficulty of the word being learned) - Return neutral field names (sourceText, targetText, sourceGloss) instead of quiz semantics; service layer maps to prompt/answer - Tighten term_glosses unique constraint to (term_id, language_code) to prevent the left join from multiplying question rows - Add TODO for ORDER BY RANDOM() scaling post-MVP --- documentation/api-development.md | 288 ++++++++++++++++++++++++++++ documentation/notes.md | 23 +-- packages/db/src/db/schema.ts | 6 +- packages/db/src/models/termModel.ts | 67 ++++++- 4 files changed, 356 insertions(+), 28 deletions(-) create mode 100644 documentation/api-development.md diff --git a/documentation/api-development.md b/documentation/api-development.md new file mode 100644 index 0000000..9ccf1eb --- /dev/null +++ b/documentation/api-development.md @@ -0,0 +1,288 @@ +# Glossa — Architecture & API Development Summary + +A record of all architectural discussions, decisions, and outcomes from the initial +API design through the quiz model implementation. + +--- + +## Project Overview + +Glossa is a vocabulary trainer (Duolingo-style) built as a pnpm monorepo. Users see a +word and pick from 4 possible translations. Supports singleplayer and multiplayer. +Stack: Express API, React frontend, Drizzle ORM, Postgres, Valkey, WebSockets. + +--- + +## Architectural Foundation + +### The Layered Architecture + +The core mental model established for the entire API: + +``` +HTTP Request + ↓ + Router — maps URL + HTTP method to a controller + ↓ + Controller — handles HTTP only: validates input, calls service, sends response + ↓ + Service — business logic only: no HTTP, no direct DB access + ↓ + Model — database queries only: no business logic + ↓ + Database +``` + +**The rule:** each layer only talks to the layer directly below it. A controller never +touches the database. A service never reads `req.body`. A model never knows what a quiz is. + +### Monorepo Package Responsibilities + +| Package | Owns | +|---------|------| +| `packages/shared` | Zod schemas, constants, derived TypeScript types | +| `packages/db` | Drizzle schema, DB connection, all model/query functions | +| `apps/api` | Router, controllers, services | +| `apps/web` | React frontend, consumes types from shared | + +**Key principle:** all database code lives in `packages/db`. `apps/api` never imports +`drizzle-orm` for queries — it only calls functions exported from `packages/db`. + +--- + +## Problems Faced & Solutions + +- Problem 1: Messy API structure +**Symptom:** responsibilities bleeding across layers — DB code in controllers, business +logic in routes. +**Solution:** strict layered architecture with one responsibility per layer. +- Problem 2: No shared contract between API and frontend +**Symptom:** API could return different shapes silently, frontend breaks at runtime. +**Solution:** Zod schemas in `packages/shared` as the single source of truth. Both API +(validation) and frontend (type inference) consume the same schemas. +- Problem 3: Type safety gaps +**Symptom:** TypeScript `any` types on model parameters, `Number` vs `number` confusion. +**Solution:** derived types from constants using `typeof CONSTANT[number]` pattern. +All valid values defined once in constants, types derived automatically. +- Problem 4: `getGameTerms` in wrong package +**Symptom:** model queries living in `apps/api/src/models/` meant `apps/api` had a +direct `drizzle-orm` dependency and was accessing the DB itself. +**Solution:** moved models folder to `packages/db/src/models/`. All Drizzle code now +lives in one package. +- Problem 5: Deck generation complexity +**Initial assumption:** 12 decks needed (nouns/verbs × easy/intermediate/hard × en/it). +**Correction:** decks are pools, not presets. POS and difficulty are query filters applied +at runtime — not deck properties. Only 2 decks needed (en-core, it-core). +**Final decision:** skip deck generation entirely for MVP. Query the terms table directly +with difficulty + POS filters. Revisit post-MVP when spaced repetition or progression +features require curated pools. +- Problem 6: GAME_ROUNDS type conflict +**Problem:** `z.enum()` only accepts strings. `GAME_ROUNDS = ["3", "10"]` works with +`z.enum()` but requires `Number(rounds)` conversion in the service. +**Decision:** keep as strings, convert to number in the service before passing to the +model. Documented coupling acknowledged with a comment. +- Problem 7: Gloss join could multiply question rows. Schema allowed multiple glosses per term per language, so the left join would duplicate rows. Fixed by tightening the unique constraint. +- Problem 8: Model leaked quiz semantics. Return fields were named prompt / answer, baking HTTP-layer concepts into the database layer. Renamed to neutral field names. +--- + +## Decisions Made + +- Zod schemas belong in `packages/shared` +Both the API and frontend import from the same schemas. If the shape changes, TypeScript +compilation fails in both places simultaneously — silent drift is impossible. +- Server-side answer evaluation +The correct answer is never sent to the frontend in `QuizQuestion`. It is only revealed +in `AnswerResult` after the client submits. Prevents cheating and keeps game logic +authoritative on the server. +- `safeParse` over `parse` in controllers +`parse` throws a raw Zod error → ugly 500 response. `safeParse` returns a result object +→ clean 400 with early return. Global error handler to be implemented later (Step 6 of +roadmap) will centralise this pattern. +- POST not GET for game start +`GET` requests have no body. Game configuration is submitted as a JSON body → `POST` is +semantically correct. +- `express.json()` middleware required +Without it, `req.body` is `undefined`. Added to `createApp()` in `app.ts`. +- Type naming: PascalCase +TypeScript convention. `supportedLanguageCode` → `SupportedLanguageCode` etc. +- Primitive types: always lowercase +`number` not `Number`, `string` not `String`. The uppercase versions are object wrappers +and not assignable to Drizzle's expected primitive types. +- Model parameters use shared types, not `GameRequestType` +The model layer should not know about `GameRequestType` — that's an HTTP boundary concern. +Instead, parameters are typed using the derived constant types (`SupportedLanguageCode`, +`SupportedPos`, `DifficultyLevel`) exported from `packages/shared`. +- One gloss per term per language. The unique constraint on term_glosses was tightened from (term_id, language_code, text) to (term_id, language_code) to prevent the left join from multiplying question rows. Revisit if multiple glosses per language are ever needed (e.g. register or domain variants). +- Model returns neutral field names, not quiz semantics. getGameTerms returns sourceText / targetText / sourceGloss rather than prompt / answer / gloss. Quiz semantics are applied in the service layer. Keeps the model reusable for non-quiz features. +- Asymmetric difficulty filter. Difficulty is filtered on the target (answer) side only. A word can be A2 in Italian but B1 in English, and what matters is the difficulty of the word being learned. + +--- + +## Data Pipeline Work (Pre-API) + +### CEFR Enrichment Pipeline (completed) + +A staged ETL pipeline was built to enrich translation records with CEFR levels and +difficulty ratings: + +``` +Raw source files + ↓ +extract-*.py — normalise each source to standard JSON + ↓ +compare-*.py — quality gate: surface conflicts between sources (read-only) + ↓ +merge-*.py — resolve conflicts by source priority, derive difficulty + ↓ +enrich.ts — write cefr_level + difficulty to DB translations table +``` + +**Source priority:** +- English: `en_m3` > `cefrj` > `octanove` > `random` +- Italian: `it_m3` > `italian` + +**Enrichment results:** + +| Language | Enriched | Total | Coverage | +|----------|----------|-------|----------| +| English | 42,527 | 171,394 | ~25% | +| Italian | 23,061 | 54,603 | ~42% | + +Both languages have sufficient coverage for MVP. Italian C2 has only 242 terms — noted +as a potential constraint for the distractor algorithm at high difficulty. + +--- + +## API Schemas (packages/shared) + +### `GameRequestSchema` (implemented) +```typescript +{ + source_language: z.enum(SUPPORTED_LANGUAGE_CODES), + target_language: z.enum(SUPPORTED_LANGUAGE_CODES), + pos: z.enum(SUPPORTED_POS), + difficulty: z.enum(DIFFICULTY_LEVELS), + rounds: z.enum(GAME_ROUNDS), +} +``` + +### Planned schemas (not yet implemented) +``` +QuizQuestion — prompt, optional gloss, 4 options (no correct answer) +QuizOption — optionId + text +AnswerSubmission — questionId + selectedOptionId +AnswerResult — correct boolean, correctOptionId, selectedOptionId +``` + +--- + +## API Endpoints + +``` +POST /api/v1/game/start GameRequest → QuizQuestion[] +POST /api/v1/game/answer AnswerSubmission → AnswerResult +``` + +--- + +## Current File Structure (apps/api) + +``` +apps/api/src/ +├── app.ts — Express app, express.json() middleware +├── server.ts — starts server on PORT +├── routes/ +│ ├── apiRouter.ts — mounts /health and /game routers +│ ├── gameRouter.ts — POST /start → createGame controller +│ └── healthRouter.ts +├── controllers/ +│ └── gameController.ts — validates GameRequest, calls service +└── services/ + └── gameService.ts — calls getGameTerms, returns raw rows +``` + +--- + +## Current File Structure (packages/db) + +``` +packages/db/src/ +├── db/ +│ └── schema.ts — Drizzle schema (terms, translations, users, decks...) +├── models/ +│ └── termModel.ts — getGameTerms() query +└── index.ts — exports db connection + getGameTerms +``` + +--- + +## Completed Tasks + +- [x] Layered architecture established and understood +- [x] `GameRequestSchema` defined in `packages/shared` +- [x] Derived types (`SupportedLanguageCode`, `SupportedPos`, `DifficultyLevel`) exported from constants +- [x] `getGameTerms()` model implemented with POS / language / difficulty / limit filters +- [x] Model correctly placed in `packages/db` +- [x] `prepareGameQuestions()` service skeleton calling the model +- [x] `createGame` controller with Zod `safeParse` validation +- [x] `POST /api/v1/game/start` route wired +- [x] End-to-end pipeline verified with test script — returns correct rows +- [x] CEFR enrichment pipeline complete for English and Italian +- [x] Double join on translations implemented (source + target language) +- [x] Gloss left join implemented +- [x] Model return type uses neutral field names (sourceText, targetText, sourceGloss) +- [x] Schema: gloss unique constraint tightened to one gloss per term per language + +--- + +## Roadmap Ahead + +### Step 1 — Learn SQL fundamentals (in progress) +Concepts needed: SELECT, FROM, JOIN, WHERE, LIMIT. +Resources: sqlzoo.net or Khan Academy SQL section. +Required before: implementing the double join for source language prompt. + +### Step 2 — Complete the model layer +- Double join on `translations` — once for source language (prompt), once for target language (answer) +- `GlossModel.getGloss(termId, languageCode)` — fetch gloss if available + +### Step 3 — Define remaining Zod schemas +- `QuizQuestion`, `QuizOption`, `AnswerSubmission`, `AnswerResult` in `packages/shared` + +### Step 4 — Complete the service layer +- `QuizService.buildSession()` — assemble raw rows into `QuizQuestion[]` + - Generate `questionId` per question + - Map source language translation as prompt + - Attach gloss if available + - Fetch 3 distractors (same POS, different term, same difficulty) + - Shuffle options so correct answer is not always in same position +- `QuizService.evaluateAnswer()` — validate correctness, return `AnswerResult` + +### Step 5 — Implement answer endpoint +- `POST /api/v1/game/answer` route, controller, service method + +### Step 6 — Global error handler +- Typed error classes (`ValidationError`, `NotFoundError`) +- Central error middleware in `app.ts` +- Remove temporary `safeParse` error handling from controllers + +### Step 7 — Tests +- Unit tests for `QuizService` — correct POS filtering, distractor never equals correct answer +- Unit tests for `evaluateAnswer` — correct and incorrect cases +- Integration tests for both endpoints + +### Step 8 — Auth (Phase 2 from original roadmap) +- OpenAuth integration +- JWT validation middleware +- `GET /api/auth/me` endpoint +- Frontend auth guard + +--- + +## Open Questions + +- **Distractor algorithm:** when Italian C2 has only 242 terms, should the difficulty + filter fall back gracefully or return an error? Decision needed before implementing + `buildSession()`. +- **Session statefulness:** game loop is currently stateless (fetch all questions upfront). + Confirm this is still the intended MVP approach before building `buildSession()`. diff --git a/documentation/notes.md b/documentation/notes.md index 4ff408e..0d52be8 100644 --- a/documentation/notes.md +++ b/documentation/notes.md @@ -5,6 +5,15 @@ - pinning dependencies in package.json files - rethink organisation of datafiles and wordlists +## notes + +- backend advice: https://github.com/MohdOwaisShah/backend +- openapi +- bruno for api testing +- tailscale +- husky/lint-staged +- musicforprogramming.net + ## openwordnet download libraries via @@ -44,17 +53,3 @@ list all libraries: ```bash python -c "import wn; print(wn.lexicons())" ``` - -## drizzle - -generate migration file, go to packages/db, then: - -```bash -pnpm drizzle-kit generate -``` - -execute migration, go to packages/db (docker containers need to be running): - -```bash -DATABASE_URL=postgresql://username:password@localhost:5432/database pnpm drizzle-kit migrate -``` diff --git a/packages/db/src/db/schema.ts b/packages/db/src/db/schema.ts index 637f127..b7b91b0 100644 --- a/packages/db/src/db/schema.ts +++ b/packages/db/src/db/schema.ts @@ -51,11 +51,7 @@ export const term_glosses = pgTable( created_at: timestamp({ withTimezone: true }).defaultNow().notNull(), }, (table) => [ - unique("unique_term_gloss").on( - table.term_id, - table.language_code, - table.text, - ), + unique("unique_term_gloss").on(table.term_id, table.language_code), check( "language_code_check", sql`${table.language_code} IN (${sql.raw(SUPPORTED_LANGUAGE_CODES.map((l) => `'${l}'`).join(", "))})`, diff --git a/packages/db/src/models/termModel.ts b/packages/db/src/models/termModel.ts index b1efea3..1c01b53 100644 --- a/packages/db/src/models/termModel.ts +++ b/packages/db/src/models/termModel.ts @@ -1,6 +1,7 @@ import { db } from "@glossa/db"; -import { eq, and } from "drizzle-orm"; -import { terms, translations } from "@glossa/db/schema"; +import { eq, and, isNotNull, sql } from "drizzle-orm"; +import { terms, translations, term_glosses } from "@glossa/db/schema"; +import { alias } from "drizzle-orm/pg-core"; import type { SupportedLanguageCode, @@ -8,25 +9,73 @@ import type { DifficultyLevel, } from "@glossa/shared"; +export type TranslationPairRow = { + termId: string; + sourceText: string; + targetText: string; + sourceGloss: string | null; +}; + +// Note: difficulty filter is intentionally asymmetric. We filter on the target +// (answer) side only — a word can be A2 in Italian but B1 in English, and what +// matters for the learner is the difficulty of the word they're being taught. + export const getGameTerms = async ( sourceLanguage: SupportedLanguageCode, targetLanguage: SupportedLanguageCode, pos: SupportedPos, difficulty: DifficultyLevel, - count: number, -) => { + rounds: number, +): Promise => { + const sourceTranslations = alias(translations, "source_translations"); + const targetTranslations = alias(translations, "target_translations"); + const rows = await db - .select() + .select({ + termId: terms.id, + prompt: sourceTranslations.text, + answer: targetTranslations.text, + gloss: term_glosses.text, + }) .from(terms) - .innerJoin(translations, eq(translations.term_id, terms.id)) + .innerJoin( + sourceTranslations, + and( + eq(sourceTranslations.term_id, terms.id), + eq(sourceTranslations.language_code, sourceLanguage), // Filter here! + ), + ) + .innerJoin( + targetTranslations, + and( + eq(targetTranslations.term_id, terms.id), + eq(targetTranslations.language_code, targetLanguage), // Filter here! + ), + ) + .leftJoin( + term_glosses, + and( + eq(term_glosses.term_id, terms.id), + eq(term_glosses.language_code, sourceLanguage), + ), + ) .where( and( eq(terms.pos, pos), - eq(translations.language_code, targetLanguage), - eq(translations.difficulty, difficulty), + eq(targetTranslations.difficulty, difficulty), + isNotNull(sourceTranslations.difficulty), // Good data quality check! ), ) - .limit(count); + // TODO(post-mvp): ORDER BY RANDOM() sorts the entire filtered result set before + // applying LIMIT, which is fine at current data volumes (low thousands of rows + // after POS + difficulty filters) but degrades as the terms table grows. Once + // the database is fully populated and tagged, replace with one of: + // - TABLESAMPLE BERNOULLI(n) for approximate sampling on large tables + // - Random offset: SELECT ... OFFSET floor(random() * (SELECT count(*) ...)) + // - Pre-computed random column with a btree index, reshuffled periodically + // Benchmark first — don't optimise until it actually hurts. + .orderBy(sql`RANDOM()`) + .limit(rounds); return rows; };