lila/documentation/decisions.md
2026-04-06 17:01:34 +02:00

23 KiB
Raw Blame History

Decisions Log

A record of non-obvious technical decisions made during development, with reasoning. Intended to preserve context across sessions.


Tooling

Monorepo: pnpm workspaces (not Turborepo)

Turborepo adds parallel task running and build caching on top of pnpm workspaces. For a two-app monorepo of this size, plain pnpm workspace commands are sufficient and there is one less tool to configure and maintain.

TypeScript runner: tsx (not ts-node)

tsx is faster, requires no configuration, and uses esbuild under the hood. ts-node is older and more complex to configure. tsx does not do type checking — that is handled separately by tsc and the editor. Installed as a dev dependency in apps/api only.

ORM: Drizzle (not Prisma)

Drizzle is lighter — no binary, no engine. Queries map closely to SQL. Migrations are plain SQL files. Works naturally with Zod for type inference. Prisma would add Docker complexity (engine binary in containers) and abstraction that is not needed for this schema.

WebSocket: ws library (not Socket.io)

For rooms of 24 players, Socket.io's room management, transport fallbacks, and reconnection abstractions are unnecessary overhead. The WS protocol is defined explicitly as a Zod discriminated union in packages/shared, giving the same type safety guarantees. Reconnection logic is deferred to Phase 7.

Auth: OpenAuth (not rolling own JWT)

All auth delegated to OpenAuth service at auth.yourdomain.com. Providers: Google, GitHub. The API validates the JWT on every protected request. User rows are created or updated on first login via the sub claim as the primary key.


Docker

Multi-stage builds for monorepo context

Both apps/web and apps/api use multi-stage Dockerfiles (deps, dev, builder, runner) because:

  • The monorepo structure requires copying pnpm-workspace.yaml, root package.json, and cross-dependencies (packages/shared, packages/db) before installing
  • node_modules paths differ between host and container due to workspace hoisting
  • Stages allow caching pnpm install separately from source code changes

Vite as dev server (not Nginx)

In development, apps/web uses vite dev directly, not Nginx. Reasons:

  • Hot Module Replacement (HMR) requires Vite's WebSocket dev server
  • Source maps and error overlay need direct Vite integration
  • Nginx would add unnecessary proxy complexity for local dev

Production will use Nginx to serve static Vite build output.


Architecture

Express app structure: factory function pattern

app.ts exports a createApp() factory function. server.ts imports it and calls .listen(). This allows tests to import the app directly without starting a server, keeping tests isolated and fast.

Data model: decks separate from terms (not frequency_rank filtering)

Original approach: Store frequency_rank on terms table and filter by rank range for difficulty.

Problem discovered: WordNet/OMW frequency data is unreliable for language learning. Extraction produced results like:

  • Rank 1: "In" → "indio" (chemical symbol: Indium)
  • Rank 2: "Be" → "berillio" (chemical symbol: Beryllium)
  • Rank 7: "He" → "elio" (chemical symbol: Helium)

These are technically "common" in WordNet (every element is a noun) but useless for vocabulary learning.

Decision:

  • terms table stores ALL available OMW synsets (raw data, no frequency filtering)
  • decks table stores curated learning lists (A1, A2, B1, "Most Common 1000", etc.)
  • deck_terms junction table links terms to decks with position ordering
  • rooms.deck_id specifies which vocabulary deck a game uses

Benefits:

  • Curricula can come from external sources (CEFR lists, Oxford 3000, SUBTLEX)
  • Bad data (chemical symbols, obscure words) excluded at deck level, not schema level
  • Users can create custom decks later
  • Multiple difficulty levels without schema changes

Multiplayer mechanic: simultaneous answers (not buzz-first)

All players see the same question at the same time and submit independently. The server waits for all answers or a 15-second timeout, then broadcasts the result. This keeps the experience Duolingo-like and symmetric. A buzz-first mechanic was considered and rejected.

Room model: room codes (not matchmaking queue)

Players create rooms and share a human-readable code (e.g. WOLF-42) to invite friends. Auto-matchmaking via a queue is out of scope for MVP. Valkey is included in the stack and can support a queue in a future phase.


TypeScript Configuration

Base config: no lib, module, or moduleResolution

These are intentionally omitted from tsconfig.base.json because different packages need different values — apps/api uses NodeNext, apps/web uses ESNext/bundler (Vite), and mixing them in the base caused errors. Each package declares its own.

outDir: "./dist" per package

The base config originally had outDir: "dist" which resolved relative to the base file location, pointing to the root dist folder. Overridden in each package with "./dist" to ensure compiled output stays inside the package.

apps/web tsconfig: deferred to Vite scaffold

The web tsconfig was left as a placeholder and filled in after pnpm create vite generated tsconfig.json, tsconfig.app.json, and tsconfig.node.json. The generated files were then trimmed to remove options already covered by the base.

rootDir: "." on apps/api

Set explicitly to allow vitest.config.ts (which lives outside src/) to be included in the TypeScript program. Without it, TypeScript infers rootDir as src/ and rejects any file outside that directory.


ESLint

Two-config approach for apps/web

The root eslint.config.mjs handles TypeScript linting across all packages. apps/web/eslint.config.js is kept as a local addition for React-specific plugins only: eslint-plugin-react-hooks and eslint-plugin-react-refresh. ESLint flat config merges them automatically by directory proximity — no explicit import between them needed.

Coverage config at root only

Vitest coverage configuration lives in the root vitest.config.ts only. Individual package configs omit it to produce a single aggregated report rather than separate per-package reports.

globals: true with "types": ["vitest/globals"]

Using Vitest globals (describe, it, expect without imports) requires "types": ["vitest/globals"] in each package's tsconfig compilerOptions. Added to apps/api, packages/shared, and packages/db. Added to apps/web/tsconfig.app.json.


Known Issues / Dev Notes

glossa-web has no healthcheck

The web service in docker-compose.yml has no healthcheck defined. Reason: Vite's dev server (vite dev) has no built-in health endpoint. Unlike the API's /api/health, there's no URL to poll.

Workaround: depends_on uses api healthcheck as proxy. For production (Nginx), add a health endpoint or use TCP port check.

Valkey memory overcommit warning

Valkey logs this on start in development:

WARNING Memory overcommit must be enabled for proper functionality

This is harmless in dev but should be fixed before production. The warning appears because Docker containers don't inherit host sysctl settings by default.

Fix: Add to host /etc/sysctl.conf:

vm.overcommit_memory = 1

Then sudo sysctl -p or restart Docker.


Data Model

Users: internal UUID + openauth_sub (not sub as PK)

Original approach: Use OpenAuth sub claim directly as users.id (text primary key).

Problem: Embeds auth provider in the primary key (e.g. "google|12345"). If OpenAuth changes format or a second provider is added, the PK cascades through all FKs (rooms.host_id, room_players.user_id).

Decision:

  • users.id = internal UUID (stable FK target)
  • users.openauth_sub = text UNIQUE (auth provider claim)
  • Allows adding multiple auth providers per user later without FK changes

Rooms: updated_at for stale recovery only

Most tables omit updated_at (unnecessary for MVP). rooms.updated_at is kept specifically for stale room recovery—identifying rooms stuck in in_progress status after server crashes.

Translations: UNIQUE (term_id, language_code, text)

Allows multiple synonyms per language per term (e.g. "dog", "hound" for same synset). Prevents exact duplicate rows. Homonyms (e.g. "Lead" metal vs. "Lead" guide) are handled by different term_id values (different synsets), so no constraint conflict.

Decks: source_language + validated_languages (not pair_id)

Original approach: decks.pair_id references language_pairs, tying each deck to a single language pair.

Problem: One deck can serve multiple target languages as long as translations exist for all its terms. A pair_id FK would require duplicating the deck for each target language.

Decision:

  • decks.source_language — the language the wordlist was curated from (e.g. "en"). A deck sourced from an English frequency list is fundamentally different from one sourced from an Italian list.
  • decks.validated_languages — array of language codes (excluding source_language) for which full translation coverage exists across all terms in the deck. Recalculated and updated on every run of the generation script.
  • The language pair used for a quiz session is determined at session start, not at deck creation time.

Benefits:

  • One deck serves multiple target languages (e.g. en→it and en→fr) without duplication
  • validated_languages stays accurate as translation data grows
  • DB enforces via CHECK constraint that source_language is never included in validated_languages

Decks: wordlist tiers as scope (not POS-split decks)

Rejected approach: one deck per POS (e.g. en-nouns, en-verbs).

Problem: POS is already a filterable column on terms, so a POS-scoped deck duplicates logic the query already handles for free. A word like "run" (noun and verb, different synsets) would also appear in two decks, requiring deduplication in the generation script.

Decision: one deck per frequency tier per source language (e.g. en-core-1000, en-core-2000). POS, difficulty, and category are query filters applied inside that boundary at query time. The user never sees or picks a deck — they pick a direction, POS, and difficulty, and the app resolves those to the right deck + filters.

Progression works by expanding the deck set as the user advances:

WHERE dt.deck_id IN ('en-core-1000', 'en-core-2000')
  AND t.pos = 'noun'
  AND t.cefr_level = 'B1'

Decks must not overlap — each term appears in exactly one tier. The generation script already deduplicates, so this is enforced at import time.

Decks: SUBTLEX as wordlist source (not manual curation)

Problem: the most common 1000 nouns in English are not the same 1000 nouns that are most common in Italian — not just in translation, but conceptually. Building decks from English frequency data alone gives Italian learners a distorted picture of what is actually common in Italian.

Decision: use SUBTLEX, which exists in per-language editions (SUBTLEX-EN, SUBTLEX-IT, etc.) derived from subtitle corpora using the same methodology, making them comparable across languages.

This is why decks.source_language is not just a technical detail — it is the reason the data model is correct:

  • en-core-1000 built from SUBTLEX-EN → used when source language is English (en→it)
  • it-core-1000 built from SUBTLEX-IT → used when source language is Italian (it→en)

Same translation data underneath, correctly frequency-grounded per direction. Two wordlist files, two generation script runs.

Terms: synset_id nullable (not NOT NULL)

Problem: non-WordNet terms (custom words, Wiktionary-sourced entries added later) won't have a synset ID. NOT NULL is too strict.

Decision: make synset_id nullable. synset_id remains the WordNet idempotency key — it prevents duplicate imports on re-runs and allows cross-referencing back to WordNet. It is not removed.

Postgres UNIQUE on a nullable column allows multiple NULL values (nulls are not considered equal), so no additional constraint logic is needed beyond dropping notNull(). For extra defensiveness a partial unique index can be added later:

CREATE UNIQUE INDEX idx_terms_synset_id ON terms (synset_id) WHERE synset_id IS NOT NULL;

Terms: source + source_id columns

Once multiple import pipelines exist (OMW today, Wiktionary later), synset_id alone is insufficient as an idempotency key — Wiktionary terms won't have a synset ID.

Decision: add source (varchar, e.g. 'omw', 'wiktionary', null for manual) and source_id (text, the pipeline's internal identifier) with a unique constraint on the pair:

unique("unique_source_id").on(table.source, table.source_id)

Postgres allows multiple NULL pairs under a unique constraint, so manual entries don't conflict. For existing OMW terms, backfill source = 'omw' and source_id = synset_id. synset_id remains for now to avoid pipeline churn — deprecate it during a future pipeline refactor.

No CHECK constraint on source — it is only written by controlled import scripts, not user input. A free varchar is sufficient.

Translations: cefr_level column (deferred population, not on terms)

CEFR difficulty is language-relative, not concept-relative. "House" in English is A1, "domicile" is also English but B2 — same concept, different words, different difficulty. Moving cefr_level to translations allows each language's word to have its own level independently.

Added as nullable varchar(2) with CHECK constraint against CEFR_LEVELS (A1C2) on the translations table. Left null for MVP; populated later via SUBTLEX or an external CEFR wordlist. Also included in the translations index since the quiz query filters on it:

index("idx_translations_lang").on(table.language_code, table.cefr_level, table.term_id)

language_pairs table: dropped

Valid language pairs are already implicitly defined by decks.source_language + decks.validated_languages. The table was redundant — the same information can be derived directly from decks:

SELECT DISTINCT source_language, unnest(validated_languages) AS target_language
FROM decks
WHERE validated_languages != '{}'

The only thing language_pairs added was an active flag to manually disable a direction. This is an edge case not needed for MVP. Dropped to remove a maintenance surface that required staying in sync with deck data.

Schema: categories + term_categories (empty for MVP)

Added to schema now, left empty for MVP. Grammar and Media work without them — Grammar maps to POS (already on terms), Media maps to deck membership. Thematic categories (animals, kitchen, etc.) require a metadata source that is still under research.

categories:      id, slug, label, created_at
term_categories: term_id  terms.id, category_id  categories.id, PK(term_id, category_id)

See Open Research section for source options.

Schema constraints: CHECK over pgEnum for extensible value sets

Question: use pgEnum for columns like pos, cefr_level, and source since the values are driven by TypeScript constants anyway?

Decision: no. Use CHECK constraints for any value set that will grow over time.

Reason: ALTER TYPE enum_name ADD VALUE in Postgres is non-transactional — it cannot be rolled back if a migration fails partway through, leaving the DB in a dirty state that requires manual intervention. CHECK constraints are fully transactional — if the migration fails it rolls back cleanly.

Rule of thumb: pgEnum is appropriate for truly static value sets that will never grow (e.g. ('pending', 'active', 'cancelled') on an orders table). Any value set tied to a growing constant in the codebase (SUPPORTED_POS, CEFR_LEVELS, SUPPORTED_LANGUAGE_CODES) stays as a CHECK constraint.

Schema constraints: language_code always CHECK-constrained

language_code columns on translations and term_glosses are constrained via CHECK against SUPPORTED_LANGUAGE_CODES, the same pattern used for pos and cefr_level.

Reason: unlike source, which is only written by controlled import scripts and failing silently is recoverable, language_code is a query-critical filter column. A typo ('ita' instead of 'it', 'en ' with a trailing space) would silently produce missing data in the UI — terms with no translation shown, glosses not displayed — which is harder to debug than a DB constraint violation.

Rule: any column that game queries filter on should be CHECK-constrained. Columns only used for internal bookkeeping (like source) can be left as free varchars.

Schema: unique constraints make explicit FK indexes redundant

Postgres automatically creates an index to enforce a unique constraint. An explicit index on a column that is already the leading column of a unique constraint is redundant.

Example: unique("unique_term_gloss").on(term_id, language_code, text) already indexes term_id as the leading column. A separate index("idx_term_glosses_term").on(term_id) adds no value and was dropped.

Rule: before adding an explicit index, check whether an existing unique constraint already covers it.

Future extensions: morphology and pronunciation (deferred, additive)

The following features are explicitly deferred post-MVP. All are purely additive — new tables referencing existing terms rows via FK. No existing schema changes required when implemented:

  • noun_forms — gender, singular, plural, articles per language (source: Wiktionary)
  • verb_forms — conjugation tables per language (source: Wiktionary)
  • term_pronunciations — IPA and audio URLs per language (source: Wiktionary / Forvo)

Exercise types split naturally into Type A (translation, current model) and Type B (morphology, future). The data layer is independent — the same terms anchor both.


Term glosses: Italian coverage is sparse (expected)

OMW gloss data is primarily in English. After full import:

  • English glosses: 95,882 (~100% of terms)
  • Italian glosses: 1,964 (~2% of terms)

This is not a data pipeline problem — it reflects the actual state of OMW. Italian glosses simply don't exist for most synsets in the dataset.

Handling in the UI: fall back to the English gloss when no gloss exists for the user's language. This is acceptable UX — a definition in the wrong language is better than no definition at all.

If Italian gloss coverage needs to improve in the future, Wiktionary is the most likely source — it has broader multilingual definition coverage than OMW.


Open Research

Semantic category metadata source

Categories (animals, kitchen, etc.) are in the schema but empty for MVP. Grammar and Media work without them (Grammar = POS filter, Media = deck membership). Needs research before populating term_categories. Options:

Option 1: WordNet domain labels Already in OMW, extractable in the existing pipeline. Free, no extra dependency. Problem: coarse and patchy — many terms untagged, vocabulary is academic ("fauna" not "animals").

Option 2: Princeton WordNet Domains Separate project built on WordNet. ~200 hierarchical domains mapped to synsets. More structured and consistent than basic WordNet labels. Freely available. Meaningfully better than Option 1.

Option 3: Kelly Project Frequency lists with CEFR levels AND semantic field tags, explicitly designed for language learning, multiple languages. Could solve frequency tiers (cefr_level) and semantic categories in one shot. Investigate coverage for your languages and POS range first.

Option 4: BabelNet / WikiData Rich, multilingual, community-maintained. Maps WordNet synsets to Wikipedia categories. Problem: complex integration, BabelNet has commercial licensing restrictions, WikiData category trees are deep and noisy.

Option 5: LLM-assisted categorization Run terms through Claude/GPT-4 with a fixed category list, spot-check output, import. Fast and cheap at current term counts (3171 terms ≈ negligible cost). Not reproducible without saving output. Good fallback if structured sources have insufficient coverage.

Option 6: Hybrid — WordNet Domains as baseline, LLM gap-fill Use Option 2 for automated coverage, LLM for terms with no domain tag, manual spot-check pass. Combines automation with control. Likely the most practical approach.

Option 7: Manual curation Flat file mapping synset IDs to your own category slugs. Full control, matches UI exactly. Too expensive at scale — only viable for small curated additions on top of an automated baseline.

Current recommendation: research Kelly Project first. If coverage is insufficient, go with Option 6.


Current State

Phase 0 complete. Phase 1 data pipeline complete. Phase 2 data model finalized and migrated.

Completed (Phase 1 — data pipeline)

  • Run extract-en-it-nouns.py locally → generates datafiles/en-it-nouns.json
  • Write Drizzle schema: terms, translations, language_pairs, term_glosses, decks, deck_terms
  • Write and run migration (includes CHECK constraints for pos, gloss_type)
  • Write packages/db/src/seed.ts (imports ALL terms + translations, NO decks)
  • Write packages/db/src/generating-decks.ts — idempotent deck generation script
    • reads and deduplicates source wordlist
    • matches words to DB terms (homonyms included)
    • writes unmatched words to -missing file
    • determines validated_languages by checking full translation coverage per language
    • creates deck if it doesn't exist, adds only missing terms on subsequent runs
    • recalculates and persists validated_languages on every run

Completed (Phase 2 — data model)

  • synset_id removed, replaced by source + source_id on terms
  • cefr_level added to translations (not terms — difficulty is language-relative)
  • language_code CHECK constraint added to translations and term_glosses
  • language_pairs table dropped — pairs derived from decks at query time
  • is_public and added_at dropped from decks and deck_terms
  • type added to decks with CHECK against SUPPORTED_DECK_TYPES
  • topics and term_topics tables added (empty for MVP)
  • Migration generated and run against fresh database

Known data facts (pre-wipe, for reference)

  • Wordlist: 999 unique words after deduplication (1000 lines, 1 duplicate)
  • Term IDs resolved: 3171 (higher than word count due to homonyms)
  • Words not found in DB: 34
  • Italian (it) coverage: 3171 / 3171 — full coverage, included in validated_languages

Next (Phase 3 — data pipeline + API)

  1. Expand data pipeline — import all OMW languages and POS, not just English nouns with Italian translations
  2. Decide SUBTLEX → cefr_level mapping strategy — raw frequency ranks need a mapping to A1C2 bands before tiered decks are meaningful
  3. Generate decks — run generation script with SUBTLEX-grounded wordlists per source language
  4. Finalize game selection flow — direction → category → POS → difficulty → round count
  5. Define Zod schemas in packages/shared — based on finalized game flow and API shape
  6. Implement API