updating documentation, formatting

This commit is contained in:
lila 2026-04-12 09:28:35 +02:00
parent e320f43d8e
commit 047196c973
8 changed files with 523 additions and 2282 deletions

View file

@ -1,6 +1,6 @@
# Decisions Log
A record of non-obvious technical decisions made during development, with reasoning. Intended to preserve context across sessions.
A record of non-obvious technical decisions made during development, with reasoning. Intended to preserve context across sessions. Grouped by topic area.
---
@ -32,21 +32,11 @@ All auth delegated to OpenAuth service at `auth.yourdomain.com`. Providers: Goog
### Multi-stage builds for monorepo context
Both `apps/web` and `apps/api` use multi-stage Dockerfiles (`deps`, `dev`, `builder`, `runner`) because:
- The monorepo structure requires copying `pnpm-workspace.yaml`, root `package.json`, and cross-dependencies (`packages/shared`, `packages/db`) before installing
- `node_modules` paths differ between host and container due to workspace hoisting
- Stages allow caching `pnpm install` separately from source code changes
Both `apps/web` and `apps/api` use multi-stage Dockerfiles (`deps`, `dev`, `builder`, `runner`) because the monorepo structure requires copying `pnpm-workspace.yaml`, root `package.json`, and cross-dependencies before installing. Stages allow caching `pnpm install` separately from source code changes.
### Vite as dev server (not Nginx)
In development, `apps/web` uses `vite dev` directly, not Nginx. Reasons:
- Hot Module Replacement (HMR) requires Vite's WebSocket dev server
- Source maps and error overlay need direct Vite integration
- Nginx would add unnecessary proxy complexity for local dev
Production will use Nginx to serve static Vite build output.
In development, `apps/web` uses `vite dev` directly, not Nginx. HMR requires Vite's WebSocket dev server. Production will use Nginx to serve static Vite build output.
---
@ -54,41 +44,111 @@ Production will use Nginx to serve static Vite build output.
### Express app structure: factory function pattern
`app.ts` exports a `createApp()` factory function. `server.ts` imports it and calls `.listen()`. This allows tests to import the app directly without starting a server, keeping tests isolated and fast.
`app.ts` exports a `createApp()` factory function. `server.ts` imports it and calls `.listen()`. This allows tests to import the app directly without starting a server (used by supertest).
### Data model: `decks` separate from `terms` (not frequency_rank filtering)
### Zod schemas belong in `packages/shared`
**Original approach:** Store `frequency_rank` on `terms` table and filter by rank range for difficulty.
Both the API and frontend import from the same schemas. If the shape changes, TypeScript compilation fails in both places simultaneously — silent drift is impossible.
**Problem discovered:** WordNet/OMW frequency data is unreliable for language learning. Extraction produced results like:
### Server-side answer evaluation
- Rank 1: "In" → "indio" (chemical symbol: Indium)
- Rank 2: "Be" → "berillio" (chemical symbol: Beryllium)
- Rank 7: "He" → "elio" (chemical symbol: Helium)
The correct answer is never sent to the frontend in `GameQuestion`. It is only revealed in `AnswerResult` after the client submits. Prevents cheating and keeps game logic authoritative on the server.
These are technically "common" in WordNet (every element is a noun) but useless for vocabulary learning.
### `safeParse` over `parse` in controllers
**Decision:**
`parse` throws a raw Zod error → ugly 500 response. `safeParse` returns a result object → clean 400 with early return via the error handler.
- `terms` table stores ALL available OMW synsets (raw data, no frequency filtering)
- `decks` table stores curated learning lists (A1, A2, B1, "Most Common 1000", etc.)
- `deck_terms` junction table links terms to decks with position ordering
- `rooms.deck_id` specifies which vocabulary deck a game uses
### POST not GET for game start
**Benefits:**
`GET` requests have no body. Game configuration is submitted as a JSON body → `POST` is semantically correct.
- Curricula can come from external sources (CEFR lists, Oxford 3000, SUBTLEX)
- Bad data (chemical symbols, obscure words) excluded at deck level, not schema level
- Users can create custom decks later
- Multiple difficulty levels without schema changes
### Model parameters use shared types, not `GameRequestType`
The model layer should not know about `GameRequestType` — that's an HTTP boundary concern. Parameters are typed using the derived constant types (`SupportedLanguageCode`, `SupportedPos`, `DifficultyLevel`) exported from `packages/shared`.
### Model returns neutral field names, not quiz semantics
`getGameTerms` returns `sourceText` / `targetText` / `sourceGloss` rather than `prompt` / `answer` / `gloss`. Quiz semantics are applied in the service layer. Keeps the model reusable for non-quiz features.
### Asymmetric difficulty filter
Difficulty is filtered on the target (answer) side only. A word can be A2 in Italian but B1 in English, and what matters is the difficulty of the word being learned.
### optionId as integer 0-3, not UUID
Options only need uniqueness within a single question; cheating prevented by shuffling, not opaque IDs.
### questionId and sessionId as UUIDs
Globally unique, opaque, natural Valkey keys when storage moves later.
### gloss is `string | null` rather than optional
Predictable shape on the frontend — always present, sometimes null.
### GameSessionStore stores only the answer key
Minimal payload (`questionId → correctOptionId`) for easy Valkey migration. All methods are async even for the in-memory implementation, so the service layer is already written for Valkey.
### Distractors fetched per-question (N+1 queries)
Correct shape for the problem; 10 queries on local Postgres is negligible latency.
### No fallback logic for insufficient distractors
Data volumes are sufficient; strict query throws if something is genuinely broken.
### Distractor query excludes both term ID and answer text
Prevents duplicate options from different terms with the same translation.
### Submit-before-send flow on frontend
User selects, then confirms. Prevents misclicks.
### Multiplayer mechanic: simultaneous answers (not buzz-first)
All players see the same question at the same time and submit independently. The server waits for all answers or a 15-second timeout, then broadcasts the result. This keeps the experience Duolingo-like and symmetric. A buzz-first mechanic was considered and rejected.
All players see the same question at the same time and submit independently. The server waits for all answers or a 15-second timeout, then broadcasts the result. Keeps the experience symmetric.
### Room model: room codes (not matchmaking queue)
Players create rooms and share a human-readable code (e.g. `WOLF-42`) to invite friends. Auto-matchmaking via a queue is out of scope for MVP. Valkey is included in the stack and can support a queue in a future phase.
Players create rooms and share a human-readable code (e.g. `WOLF-42`). Auto-matchmaking deferred.
---
## Error Handling
### `AppError` base class over error code maps
A `statusCode` on the error itself means the middleware doesn't need a lookup table. New error types are self-contained — one class, one status code. `ValidationError` (400) and `NotFoundError` (404) extend `AppError`.
### `next(error)` over `res.status().json()` in controllers
Express requires explicit `next(error)` for async handlers — it does not catch async errors automatically. Centralises all error formatting in one middleware. Controllers stay clean: validate, call service, send response.
### Zod `.message` over `.issues[0]?.message`
Returns all validation failures at once, not just the first. Output is verbose (raw JSON string) — revisit formatting post-MVP if the frontend needs structured `{ field, message }[]` error objects.
### Where errors are thrown
`ValidationError` is thrown in the controller (the layer that runs `safeParse`). `NotFoundError` is thrown in the service (the layer that knows whether a session or question exists). The service doesn't know about HTTP — it throws a typed error, and the middleware maps it to a status code.
---
## Testing
### Mocked DB for unit tests (not test database)
Unit tests mock `@glossa/db` via `vi.mock` — the real database is never touched. Tests run in milliseconds with no infrastructure dependency. Integration tests with a real test DB are deferred post-MVP.
### Co-located test files
`gameService.test.ts` lives next to `gameService.ts`, not in a separate `__tests__/` directory. Convention matches the `vitest` default and keeps related files together.
### supertest for endpoint tests
Uses `createApp()` factory directly — no server started. Tests the full HTTP layer (routing, middleware, error handler) with real request/response assertions.
---
@ -96,19 +156,31 @@ Players create rooms and share a human-readable code (e.g. `WOLF-42`) to invite
### Base config: no `lib`, `module`, or `moduleResolution`
These are intentionally omitted from `tsconfig.base.json` because different packages need different values — `apps/api` uses `NodeNext`, `apps/web` uses `ESNext`/`bundler` (Vite), and mixing them in the base caused errors. Each package declares its own.
Intentionally omitted from `tsconfig.base.json` because different packages need different values — `apps/api` uses `NodeNext`, `apps/web` uses `ESNext`/`bundler` (Vite). Each package declares its own.
### `outDir: "./dist"` per package
The base config originally had `outDir: "dist"` which resolved relative to the base file location, pointing to the root `dist` folder. Overridden in each package with `"./dist"` to ensure compiled output stays inside the package.
The base config originally had `outDir: "dist"` which resolved relative to the base file location, pointing to the root `dist` folder. Overridden in each package with `"./dist"`.
### `apps/web` tsconfig: deferred to Vite scaffold
The web tsconfig was left as a placeholder and filled in after `pnpm create vite` generated `tsconfig.json`, `tsconfig.app.json`, and `tsconfig.node.json`. The generated files were then trimmed to remove options already covered by the base.
Filled in after `pnpm create vite` generated tsconfig files. The generated files were trimmed to remove options already covered by the base.
### `rootDir: "."` on `apps/api`
Set explicitly to allow `vitest.config.ts` (which lives outside `src/`) to be included in the TypeScript program. Without it, TypeScript infers `rootDir` as `src/` and rejects any file outside that directory.
Set explicitly to allow `vitest.config.ts` (outside `src/`) to be included in the TypeScript program.
### Type naming: PascalCase
`supportedLanguageCode``SupportedLanguageCode`. TypeScript convention.
### Primitive types: always lowercase
`number` not `Number`, `string` not `String`. The uppercase versions are object wrappers and not assignable to Drizzle's expected primitive types.
### `globals: true` with `"types": ["vitest/globals"]`
Using Vitest globals requires `"types": ["vitest/globals"]` in each package's tsconfig. Added to `apps/api`, `packages/shared`, `packages/db`, and `apps/web/tsconfig.app.json`.
---
@ -116,43 +188,11 @@ Set explicitly to allow `vitest.config.ts` (which lives outside `src/`) to be in
### Two-config approach for `apps/web`
The root `eslint.config.mjs` handles TypeScript linting across all packages. `apps/web/eslint.config.js` is kept as a local addition for React-specific plugins only: `eslint-plugin-react-hooks` and `eslint-plugin-react-refresh`. ESLint flat config merges them automatically by directory proximity — no explicit import between them needed.
Root `eslint.config.mjs` handles TypeScript linting across all packages. `apps/web/eslint.config.js` adds React-specific plugins only. ESLint flat config merges them by directory proximity.
### Coverage config at root only
Vitest coverage configuration lives in the root `vitest.config.ts` only. Individual package configs omit it to produce a single aggregated report rather than separate per-package reports.
### `globals: true` with `"types": ["vitest/globals"]`
Using Vitest globals (`describe`, `it`, `expect` without imports) requires `"types": ["vitest/globals"]` in each package's tsconfig `compilerOptions`. Added to `apps/api`, `packages/shared`, and `packages/db`. Added to `apps/web/tsconfig.app.json`.
---
## Known Issues / Dev Notes
### glossa-web has no healthcheck
The `web` service in `docker-compose.yml` has no `healthcheck` defined. Reason: Vite's dev server (`vite dev`) has no built-in health endpoint. Unlike the API's `/api/health`, there's no URL to poll.
Workaround: `depends_on` uses `api` healthcheck as proxy. For production (Nginx), add a health endpoint or use TCP port check.
### Valkey memory overcommit warning
Valkey logs this on start in development:
```text
WARNING Memory overcommit must be enabled for proper functionality
```
This is **harmless in dev** but should be fixed before production. The warning appears because Docker containers don't inherit host sysctl settings by default.
Fix: Add to host `/etc/sysctl.conf`:
```conf
vm.overcommit_memory = 1
```
Then `sudo sysctl -p` or restart Docker.
Vitest coverage configuration lives in the root `vitest.config.ts` only. Produces a single aggregated report.
---
@ -160,190 +200,135 @@ Then `sudo sysctl -p` or restart Docker.
### Users: internal UUID + openauth_sub (not sub as PK)
**Original approach:** Use OpenAuth `sub` claim directly as `users.id` (text primary key).
**Problem:** Embeds auth provider in the primary key (e.g. `"google|12345"`). If OpenAuth changes format or a second provider is added, the PK cascades through all FKs (`rooms.host_id`, `room_players.user_id`).
**Decision:**
- `users.id` = internal UUID (stable FK target)
- `users.openauth_sub` = text UNIQUE (auth provider claim)
- Allows adding multiple auth providers per user later without FK changes
Embeds auth provider in the primary key would cascade through all FKs if OpenAuth changes format. `users.id` = internal UUID (stable FK target). `users.openauth_sub` = text UNIQUE (auth provider claim).
### Rooms: `updated_at` for stale recovery only
Most tables omit `updated_at` (unnecessary for MVP). `rooms.updated_at` is kept specifically for stale room recovery—identifying rooms stuck in `in_progress` status after server crashes.
Most tables omit `updated_at`. `rooms.updated_at` is kept specifically for identifying rooms stuck in `in_progress` status after server crashes.
### Translations: UNIQUE (term_id, language_code, text)
Allows multiple synonyms per language per term (e.g. "dog", "hound" for same synset). Prevents exact duplicate rows. Homonyms (e.g. "Lead" metal vs. "Lead" guide) are handled by different `term_id` values (different synsets), so no constraint conflict.
Allows multiple synonyms per language per term (e.g. "dog", "hound" for same synset). Prevents exact duplicate rows.
### One gloss per term per language
The unique constraint on `term_glosses` was tightened from `(term_id, language_code, text)` to `(term_id, language_code)` to prevent left joins from multiplying question rows. Revisit if multiple glosses per language are ever needed.
### Decks: `source_language` + `validated_languages` (not `pair_id`)
**Original approach:** `decks.pair_id` references `language_pairs`, tying each deck to a single language pair.
**Problem:** One deck can serve multiple target languages as long as translations exist for all its terms. A `pair_id` FK would require duplicating the deck for each target language.
**Decision:**
- `decks.source_language` — the language the wordlist was curated from (e.g. `"en"`). A deck sourced from an English frequency list is fundamentally different from one sourced from an Italian list.
- `decks.validated_languages` — array of language codes (excluding `source_language`) for which full translation coverage exists across all terms in the deck. Recalculated and updated on every run of the generation script.
- The language pair used for a quiz session is determined at session start, not at deck creation time.
**Benefits:**
- One deck serves multiple target languages (e.g. en→it and en→fr) without duplication
- `validated_languages` stays accurate as translation data grows
- DB enforces via CHECK constraint that `source_language` is never included in `validated_languages`
One deck can serve multiple target languages as long as translations exist for all its terms. `source_language` is the language the wordlist was curated from. `validated_languages` is recalculated on every generation script run. Enforced via CHECK: `source_language` is never in `validated_languages`.
### Decks: wordlist tiers as scope (not POS-split decks)
**Rejected approach:** one deck per POS (e.g. `en-nouns`, `en-verbs`).
**Problem:** POS is already a filterable column on `terms`, so a POS-scoped deck duplicates logic the query already handles for free. A word like "run" (noun and verb, different synsets) would also appear in two decks, requiring deduplication in the generation script.
**Decision:** one deck per frequency tier per source language (e.g. `en-core-1000`, `en-core-2000`). POS, difficulty, and category are query filters applied inside that boundary at query time. The user never sees or picks a deck — they pick a direction, POS, and difficulty, and the app resolves those to the right deck + filters.
Progression works by expanding the deck set as the user advances:
```sql
WHERE dt.deck_id IN ('en-core-1000', 'en-core-2000')
AND t.pos = 'noun'
AND t.cefr_level = 'B1'
```
Decks must not overlap — each term appears in exactly one tier. The generation script already deduplicates, so this is enforced at import time.
One deck per frequency tier per source language (e.g. `en-core-1000`). POS, difficulty, and category are query filters applied inside that boundary. Decks must not overlap — each term appears in exactly one tier.
### Decks: SUBTLEX as wordlist source (not manual curation)
**Problem:** the most common 1000 nouns in English are not the same 1000 nouns that are most common in Italian — not just in translation, but conceptually. Building decks from English frequency data alone gives Italian learners a distorted picture of what is actually common in Italian.
**Decision:** use SUBTLEX, which exists in per-language editions (SUBTLEX-EN, SUBTLEX-IT, etc.) derived from subtitle corpora using the same methodology, making them comparable across languages.
This is why `decks.source_language` is not just a technical detail — it is the reason the data model is correct:
- `en-core-1000` built from SUBTLEX-EN → used when source language is English (en→it)
- `it-core-1000` built from SUBTLEX-IT → used when source language is Italian (it→en)
Same translation data underneath, correctly frequency-grounded per direction. Two wordlist files, two generation script runs.
### Terms: `synset_id` nullable (not NOT NULL)
**Problem:** non-WordNet terms (custom words, Wiktionary-sourced entries added later) won't have a synset ID. `NOT NULL` is too strict.
**Decision:** make `synset_id` nullable. `synset_id` remains the WordNet idempotency key — it prevents duplicate imports on re-runs and allows cross-referencing back to WordNet. It is not removed.
Postgres `UNIQUE` on a nullable column allows multiple `NULL` values (nulls are not considered equal), so no additional constraint logic is needed beyond dropping `notNull()`. For extra defensiveness a partial unique index can be added later:
```sql
CREATE UNIQUE INDEX idx_terms_synset_id ON terms (synset_id) WHERE synset_id IS NOT NULL;
```
### Terms: `source` + `source_id` columns
Once multiple import pipelines exist (OMW today, Wiktionary later), `synset_id` alone is insufficient as an idempotency key — Wiktionary terms won't have a synset ID.
**Decision:** add `source` (varchar, e.g. `'omw'`, `'wiktionary'`, null for manual) and `source_id` (text, the pipeline's internal identifier) with a unique constraint on the pair:
```ts
unique("unique_source_id").on(table.source, table.source_id);
```
Postgres allows multiple `NULL` pairs under a unique constraint, so manual entries don't conflict. For existing OMW terms, backfill `source = 'omw'` and `source_id = synset_id`. `synset_id` remains for now to avoid pipeline churn — deprecate it during a future pipeline refactor.
No CHECK constraint on `source` — it is only written by controlled import scripts, not user input. A free varchar is sufficient.
### Translations: `cefr_level` column (deferred population, not on `terms`)
CEFR difficulty is language-relative, not concept-relative. "House" in English is A1, "domicile" is also English but B2 — same concept, different words, different difficulty. Moving `cefr_level` to `translations` allows each language's word to have its own level independently.
Added as nullable `varchar(2)` with CHECK constraint against `CEFR_LEVELS` (`A1``C2`) on the `translations` table. Left null for MVP; populated later via SUBTLEX or an external CEFR wordlist. Also included in the `translations` index since the quiz query filters on it:
```ts
index("idx_translations_lang").on(
table.language_code,
table.cefr_level,
table.term_id,
);
```
The most common 1000 nouns in English are not the same 1000 nouns that are most common in Italian. SUBTLEX exists in per-language editions derived from subtitle corpora using the same methodology — making them comparable. `en-core-1000` built from SUBTLEX-EN, `it-core-1000` from SUBTLEX-IT.
### `language_pairs` table: dropped
Valid language pairs are already implicitly defined by `decks.source_language` + `decks.validated_languages`. The table was redundant — the same information can be derived directly from decks:
Valid pairs are implicitly defined by `decks.source_language` + `decks.validated_languages`. The table was redundant.
```sql
SELECT DISTINCT source_language, unnest(validated_languages) AS target_language
FROM decks
WHERE validated_languages != '{}'
```
### Terms: `synset_id` nullable (not NOT NULL)
The only thing `language_pairs` added was an `active` flag to manually disable a direction. This is an edge case not needed for MVP. Dropped to remove a maintenance surface that required staying in sync with deck data.
Non-WordNet terms won't have a synset ID. Postgres `UNIQUE` on a nullable column allows multiple NULL values.
### Schema: `categories` + `term_categories` (empty for MVP)
### Terms: `source` + `source_id` columns
Added to schema now, left empty for MVP. Grammar and Media work without them — Grammar maps to POS (already on `terms`), Media maps to deck membership. Thematic categories (animals, kitchen, etc.) require a metadata source that is still under research.
Once multiple import pipelines exist (OMW, Wiktionary), `synset_id` alone is insufficient as an idempotency key. Unique constraint on the pair. Postgres allows multiple NULL pairs. `synset_id` remains for now — deprecate during a future pipeline refactor.
```sql
categories: id, slug, label, created_at
term_categories: term_id → terms.id, category_id → categories.id, PK(term_id, category_id)
```
### `cefr_level` on `translations` (not `terms`)
See Open Research section for source options.
CEFR difficulty is language-relative, not concept-relative. "House" in English is A1, "domicile" is also English but B2 — same concept, different words, different difficulty. Added as nullable `varchar(2)` with CHECK.
### Schema constraints: CHECK over pgEnum for extensible value sets
### Categories + term_categories: empty for MVP
**Question:** use `pgEnum` for columns like `pos`, `cefr_level`, and `source` since the values are driven by TypeScript constants anyway?
Schema exists. Grammar maps to POS (already on `terms`), Media maps to deck membership. Thematic categories require a metadata source still under research.
**Decision:** no. Use CHECK constraints for any value set that will grow over time.
### CHECK over pgEnum for extensible value sets
**Reason:** `ALTER TYPE enum_name ADD VALUE` in Postgres is non-transactional — it cannot be rolled back if a migration fails partway through, leaving the DB in a dirty state that requires manual intervention. CHECK constraints are fully transactional — if the migration fails it rolls back cleanly.
`ALTER TYPE enum_name ADD VALUE` in Postgres is non-transactional — cannot be rolled back if a migration fails. CHECK constraints are fully transactional. Rule: pgEnum for truly static sets, CHECK for any set tied to a growing constant.
**Rule of thumb:** pgEnum is appropriate for truly static value sets that will never grow (e.g. `('pending', 'active', 'cancelled')` on an orders table). Any value set tied to a growing constant in the codebase (`SUPPORTED_POS`, `CEFR_LEVELS`, `SUPPORTED_LANGUAGE_CODES`) stays as a CHECK constraint.
### `language_code` always CHECK-constrained
### Schema constraints: `language_code` always CHECK-constrained
Unlike `source` (only written by import scripts), `language_code` is a query-critical filter column. A typo would silently produce missing data. Rule: any column game queries filter on should be CHECK-constrained.
`language_code` columns on `translations` and `term_glosses` are constrained via CHECK against `SUPPORTED_LANGUAGE_CODES`, the same pattern used for `pos` and `cefr_level`.
### Unique constraints make explicit FK indexes redundant
**Reason:** unlike `source`, which is only written by controlled import scripts and failing silently is recoverable, `language_code` is a query-critical filter column. A typo (`'ita'` instead of `'it'`, `'en '` with a trailing space) would silently produce missing data in the UI — terms with no translation shown, glosses not displayed — which is harder to debug than a DB constraint violation.
**Rule:** any column that game queries filter on should be CHECK-constrained. Columns only used for internal bookkeeping (like `source`) can be left as free varchars.
### Schema: unique constraints make explicit FK indexes redundant
Postgres automatically creates an index to enforce a unique constraint. An explicit index on a column that is already the leading column of a unique constraint is redundant.
Example: `unique("unique_term_gloss").on(term_id, language_code, text)` already indexes `term_id` as the leading column. A separate `index("idx_term_glosses_term").on(term_id)` adds no value and was dropped.
**Rule:** before adding an explicit index, check whether an existing unique constraint already covers it.
### Future extensions: morphology and pronunciation (deferred, additive)
The following features are explicitly deferred post-MVP. All are purely additive — new tables referencing existing `terms` rows via FK. No existing schema changes required when implemented:
- `noun_forms` — gender, singular, plural, articles per language (source: Wiktionary)
- `verb_forms` — conjugation tables per language (source: Wiktionary)
- `term_pronunciations` — IPA and audio URLs per language (source: Wiktionary / Forvo)
Exercise types split naturally into Type A (translation, current model) and Type B (morphology, future). The data layer is independent — the same `terms` anchor both.
Postgres automatically creates an index to enforce a unique constraint. A separate index on the leading column of an existing unique constraint adds no value.
---
### Term glosses: Italian coverage is sparse (expected)
## Data Pipeline
OMW gloss data is primarily in English. After full import:
### Seeding v1: batch, truncate-based
- English glosses: 95,882 (~100% of terms)
- Italian glosses: 1,964 (~2% of terms)
For dev/first-time setup. Read JSON, batch inserts in groups of 500, truncate tables before each run. Simple and fast.
This is not a data pipeline problem — it reflects the actual state of OMW. Italian
glosses simply don't exist for most synsets in the dataset.
Key pitfalls encountered:
**Handling in the UI:** fall back to the English gloss when no gloss exists for the
user's language. This is acceptable UX — a definition in the wrong language is better
than no definition at all.
- Duplicate key on re-run: truncate before seeding
- `onConflictDoNothing` breaks FK references: when it skips a `terms` insert, the in-memory UUID is never written, causing FK violations on `translations`
- `forEach` doesn't await: use `for...of`
- Final batch not flushed: guard with `if (termsArray.length > 0)` after loop
If Italian gloss coverage needs to improve in the future, Wiktionary is the most
likely source — it has broader multilingual definition coverage than OMW.
### Seeding v2: incremental upsert, multi-file
For production / adding languages. Extends the database without truncating. Each synset processed individually (no batching — need real `term.id` from DB before inserting translations). Filename convention: `sourcelang-targetlang-pos.json`.
### CEFR enrichment pipeline
Staged ETL: `extract-*.py``compare-*.py` (quality gate) → `merge-*.py` (resolve conflicts) → `enrich.ts` (write to DB). Source priority: English `en_m3 > cefrj > octanove > random`, Italian `it_m3 > italian`.
Enrichment results: English 42,527/171,394 (~25%), Italian 23,061/54,603 (~42%). Both sufficient for MVP. Italian C2 has only 242 terms — noted as constraint for distractor algorithm.
### Term glosses: Italian coverage is sparse
OMW gloss data is primarily English. English glosses: 95,882 (~100%), Italian: 1,964 (~2%). UI falls back to English gloss when no gloss exists for the user's language.
### Glosses can leak answers
Some WordNet glosses contain the target-language word in the definition text (e.g. "Padre" in the English gloss for "father"). Address during post-MVP data enrichment — clean glosses, replace with custom definitions, or filter at service layer.
### `packages/db` exports fix
The `exports` field must be an object, not an array:
```json
"exports": {
".": "./src/index.ts",
"./schema": "./src/db/schema.ts"
}
```
---
## API Development: Problems & Solutions
1. **Messy API structure.** Responsibilities bleeding across layers. Fixed with strict layered architecture.
2. **No shared contract.** API could return different shapes silently. Fixed with Zod schemas in `packages/shared`.
3. **Type safety gaps.** `any` types, `Number` vs `number`. Fixed with derived types from constants.
4. **`getGameTerms` in wrong package.** Model queries in `apps/api` meant direct `drizzle-orm` dependency. Moved to `packages/db/src/models/`.
5. **Deck generation complexity.** 12 decks assumed, only 2 needed. Then skipped entirely for MVP — query terms table directly.
6. **GAME_ROUNDS type conflict.** `z.enum()` only accepts strings. Keep as strings, convert to number in service.
7. **Gloss join multiplied rows.** Multiple glosses per term per language. Fixed by tightening unique constraint.
8. **Model leaked quiz semantics.** Return fields named `prompt`/`answer`. Renamed to neutral `sourceText`/`targetText`.
9. **AnswerResult wasn't self-contained.** Frontend needed `selectedOptionId` but schema didn't include it. Added.
10. **Distractor could duplicate correct answer.** Different terms with same translation. Fixed with `ne(translations.text, excludeText)`.
11. **TypeScript strict mode flagged Fisher-Yates shuffle.** `noUncheckedIndexedAccess` treats `result[i]` as `T | undefined`. Fixed with non-null assertion + temp variable.
---
## Known Issues / Dev Notes
### glossa-web has no healthcheck
Vite's dev server has no built-in health endpoint. `depends_on` uses API healthcheck as proxy. For production (Nginx), add a health endpoint or TCP port check.
### Valkey memory overcommit warning
Harmless in dev. Fix before production: add `vm.overcommit_memory = 1` to host `/etc/sysctl.conf`.
---
@ -351,88 +336,26 @@ likely source — it has broader multilingual definition coverage than OMW.
### Semantic category metadata source
Categories (`animals`, `kitchen`, etc.) are in the schema but empty for MVP.
Grammar and Media work without them (Grammar = POS filter, Media = deck membership).
Needs research before populating `term_categories`. Options:
Categories (`animals`, `kitchen`, etc.) are in the schema but empty. Options researched:
**Option 1: WordNet domain labels**
Already in OMW, extractable in the existing pipeline. Free, no extra dependency.
Problem: coarse and patchy — many terms untagged, vocabulary is academic ("fauna" not "animals").
**Option 2: Princeton WordNet Domains**
Separate project built on WordNet. ~200 hierarchical domains mapped to synsets. More structured
and consistent than basic WordNet labels. Freely available. Meaningfully better than Option 1.
**Option 3: Kelly Project**
Frequency lists with CEFR levels AND semantic field tags, explicitly designed for language learning,
multiple languages. Could solve frequency tiers (`cefr_level`) and semantic categories in one shot.
Investigate coverage for your languages and POS range first.
**Option 4: BabelNet / WikiData**
Rich, multilingual, community-maintained. Maps WordNet synsets to Wikipedia categories.
Problem: complex integration, BabelNet has commercial licensing restrictions, WikiData category
trees are deep and noisy.
**Option 5: LLM-assisted categorization**
Run terms through Claude/GPT-4 with a fixed category list, spot-check output, import.
Fast and cheap at current term counts (3171 terms ≈ negligible cost). Not reproducible
without saving output. Good fallback if structured sources have insufficient coverage.
**Option 6: Hybrid — WordNet Domains as baseline, LLM gap-fill**
Use Option 2 for automated coverage, LLM for terms with no domain tag, manual spot-check pass.
Combines automation with control. Likely the most practical approach.
**Option 7: Manual curation**
Flat file mapping synset IDs to your own category slugs. Full control, matches UI exactly.
Too expensive at scale — only viable for small curated additions on top of an automated baseline.
1. **WordNet domain labels** — already in OMW, coarse and patchy
2. **Princeton WordNet Domains** — ~200 hierarchical domains, freely available, meaningfully better
3. **Kelly Project** — CEFR levels AND semantic fields, designed for language learning. Could solve frequency tiers and categories in one shot
4. **BabelNet / WikiData** — rich but complex integration, licensing issues
5. **LLM-assisted categorization** — fast and cheap at current term counts, not reproducible without saving output
6. **Hybrid (WordNet Domains + LLM gap-fill)** — likely most practical
7. **Manual curation** — full control, too expensive at scale
**Current recommendation:** research Kelly Project first. If coverage is insufficient, go with Option 6.
---
### SUBTLEX → `cefr_level` mapping strategy
## Current State
Raw frequency ranks need mapping to A1C2 bands before tiered decks are meaningful. Decision pending.
Phase 0 complete. Phase 1 data pipeline complete. Phase 2 data model finalized and migrated.
### Future extensions: morphology and pronunciation
### Completed (Phase 1 — data pipeline)
All deferred post-MVP, purely additive (new tables referencing existing `terms`):
- [x] Run `extract-en-it-nouns.py` locally → generates `datafiles/en-it-nouns.json`
- [x] Write Drizzle schema: `terms`, `translations`, `language_pairs`, `term_glosses`, `decks`, `deck_terms`
- [x] Write and run migration (includes CHECK constraints for `pos`, `gloss_type`)
- [x] Write `packages/db/src/seed.ts` (imports ALL terms + translations, NO decks)
- [x] Write `packages/db/src/generating-decks.ts` — idempotent deck generation script
- reads and deduplicates source wordlist
- matches words to DB terms (homonyms included)
- writes unmatched words to `-missing` file
- determines `validated_languages` by checking full translation coverage per language
- creates deck if it doesn't exist, adds only missing terms on subsequent runs
- recalculates and persists `validated_languages` on every run
### Completed (Phase 2 — data model)
- [x] `synset_id` removed, replaced by `source` + `source_id` on `terms`
- [x] `cefr_level` added to `translations` (not `terms` — difficulty is language-relative)
- [x] `language_code` CHECK constraint added to `translations` and `term_glosses`
- [x] `language_pairs` table dropped — pairs derived from decks at query time
- [x] `is_public` and `added_at` dropped from `decks` and `deck_terms`
- [x] `type` added to `decks` with CHECK against `SUPPORTED_DECK_TYPES`
- [x] `topics` and `term_topics` tables added (empty for MVP)
- [x] Migration generated and run against fresh database
### Known data facts (pre-wipe, for reference)
- Wordlist: 999 unique words after deduplication (1000 lines, 1 duplicate)
- Term IDs resolved: 3171 (higher than word count due to homonyms)
- Words not found in DB: 34
- Italian (`it`) coverage: 3171 / 3171 — full coverage, included in `validated_languages`
### Next (Phase 3 — data pipeline + API)
1. done
2. done
3. **Expand data pipeline** — import all OMW languages and POS, not just English nouns with Italian translations
4. **Decide SUBTLEX → `cefr_level` mapping strategy** — raw frequency ranks need a mapping to A1C2 bands before tiered decks are meaningful
5. **Generate decks** — run generation script with SUBTLEX-grounded wordlists per source language
6. **Finalize game selection flow** — direction → category → POS → difficulty → round count
7. **Define Zod schemas in `packages/shared`** — based on finalized game flow and API shape
8. **Implement API**
- `noun_forms` — gender, singular, plural, articles per language (source: Wiktionary)
- `verb_forms` — conjugation tables per language (source: Wiktionary)
- `term_pronunciations` — IPA and audio URLs per language (source: Wiktionary / Forvo)