9.4 KiB
Glossa — Schema & Architecture Discussion Summary
Project Overview
A vocabulary trainer in the style of Duolingo (see a word, pick from 4 translations). Built as a monorepo with a Drizzle/Postgres data layer. Phase 1 (data pipeline) is complete; the API layer is next.
Game Flow (MVP)
Singleplayer: choose direction (en→it or it→en) → top-level category → part of speech → difficulty (A1–C2) → round count (3 or 10) → game starts.
Top-level categories (MVP):
- Grammar — practice nouns, verb conjugations, etc.
- Media — practice vocabulary from specific books, films, songs, etc.
Post-MVP categories (not in scope yet):
- Animals, kitchen, and other thematic word groups
Schema Decisions Made
Deck model: source_language + validated_languages (not pair_id)
A deck is a curated pool of terms sourced from a specific language (e.g. an English frequency list). The language pair used for a quiz is chosen at session start, not at deck creation.
decks.source_language— the language the wordlist was curated fromdecks.validated_languages— array of target language codes for which full translation coverage exists across all terms; recalculated on every generation script run- Enforced via CHECK:
source_languageis never invalidated_languages - One deck serves en→it and en→fr without duplication
Architecture: deck as curated pool (Option 2)
Three options were considered:
| Option | Description | Problem |
|---|---|---|
| 1. Pure filter | No decks, query the whole terms table | No curatorial control; import junk ends up in the game |
| 2. Deck as pool ✅ | Decks define scope, term metadata drives filtering | Clean separation of concerns |
| 3. Deck as preset | Deck encodes filter config (category + POS + difficulty) | Combinatorial explosion; can't reuse terms across decks |
Decision: Option 2. Decks solve the curation problem (which terms are game-ready). Term metadata solves the filtering problem (which subset to show today). These are separate concerns and should stay separate.
The quiz query joins deck_terms for scope, then filters by pos, cefr_level, and later category — all independently.
Missing from schema: cefr_level and categories
The game flow requires filtering by difficulty and category, but neither is in the schema yet.
Difficulty (cefr_level):
- Belongs on
terms, not ondecks - Add as a nullable
varchar(2)with a CHECK constraint (A1–C2) - Add now (nullable), populate later — backfilling a full terms table post-MVP is costly
Categories:
- Separate
categoriestable +term_categoriesjoin table - Do not use an enum or array on
terms— a term can belong to multiple categories, and new categories should not require migrations
categories: id, slug, label, created_at
term_categories: term_id → terms.id, category_id → categories.id, PK(term_id, category_id)
Deck scope: wordlists, not POS splits
Rejected approach: one deck per POS (e.g. en-nouns, en-verbs). Problem: POS is already a filterable column on terms, so a POS-scoped deck duplicates logic the query already handles for free. A word like "run" (noun and verb, different synsets) would also appear in two decks, requiring deduplication logic in the generation script.
Decision: one deck per frequency tier per source language (e.g. en-core-1000, en-core-2000). POS, difficulty, and category are query filters applied inside that boundary. The user never sees or picks a deck — they pick "Nouns, B1" and the app resolves that to the right deck + filters.
Deck progression: tiered frequency lists
When a user exhausts a deck, the app expands scope by adding the next tier:
WHERE dt.deck_id IN ('en-core-1000', 'en-core-2000')
AND t.pos = 'noun'
AND t.cefr_level = 'B1'
Requirements for this to work cleanly:
- Decks must not overlap — each word appears in exactly one tier
- The generation script already deduplicates, so this is enforced at import time
- Unlocking logic (when to add the next deck) lives in user learning state, not in the deck structure — for MVP, query all tiers at once or hardcode active decks
Wordlist source: SUBTLEX (not manual curation)
Problem: the most common 1000 nouns in English are not the same 1000 nouns that are most common in Italian — not just in translation, but conceptually. Building decks from English frequency data alone gives Italian learners a distorted picture of what's actually common in Italian.
Decision: use SUBTLEX, which exists in per-language editions (SUBTLEX-EN, SUBTLEX-IT, etc.) derived from subtitle corpora using the same methodology — making them comparable across languages.
This maps directly onto decks.source_language:
en-core-1000— built from SUBTLEX-EN, used when the user's source language is Englishit-core-1000— built from SUBTLEX-IT, used when the source language is Italian
When the user picks en→it, the app queries en-core-1000. When they pick it→en, it queries it-core-1000. Same translation data, correctly frequency-grounded per direction. Two wordlist files, two generation script runs — the schema already supports this.
Missing from schema: user learning state
The current schema has no concept of a user's progress. Not blocking for the API layer right now, but will be needed before the game loop is functional:
user_decks— which decks a user is studyinguser_term_progress— per(user_id, term_id, language_pair):next_review_at,interval_days, correct/attempt counts for spaced repetitionquiz_answers— optional history log for stats and debugging
synset_id: make nullable, don't remove
synset_id is the WordNet idempotency key — it prevents duplicate imports on re-runs and allows cross-referencing back to WordNet. It should stay.
Problem: non-WordNet terms (custom words added later) won't have a synset ID, so NOT NULL is too strict.
Decision: make synset_id nullable. Postgres UNIQUE on a nullable column allows multiple NULL values (nulls are not considered equal), so no constraint changes are needed beyond dropping notNull().
For extra defensiveness, a partial unique index can be added later:
CREATE UNIQUE INDEX idx_terms_synset_id ON terms (synset_id) WHERE synset_id IS NOT NULL;
Open Questions / Deferred
- User learning state — not needed for the API layer but must be designed before the game loop ships
- Distractors — generated at query time (random same-POS terms from the same deck); no schema needed
cefr_leveldata source — WordNet frequency data was already found to be unreliable; external CEFR lists (Oxford 3000, SUBTLEX) will be needed to populate this field
Open: semantic category metadata source
Categories (animals, kitchen, etc.) are in the schema but empty for MVP.
Grammar and Media work without them (Grammar = POS filter, Media = deck membership).
Needs research before populating term_categories. Options:
Option 1: WordNet domain labels Already in OMW, extractable in the existing pipeline. Free, no extra dependency. Problem: coarse and patchy — many terms untagged, vocabulary is academic ("fauna" not "animals").
Option 2: Princeton WordNet Domains Separate project built on WordNet. ~200 hierarchical domains mapped to synsets. More structured and consistent than basic WordNet labels. Freely available. Meaningfully better than Option 1.
Option 3: Kelly Project Frequency lists with CEFR levels AND semantic field tags, explicitly designed for language learning, multiple languages. Could solve frequency tiers (cefr_level) and semantic categories in one shot. Investigate coverage for your languages and POS range first.
Option 4: BabelNet / WikiData Rich, multilingual, community-maintained. Maps WordNet synsets to Wikipedia categories. Problem: complex integration, BabelNet has commercial licensing restrictions, WikiData category trees are deep and noisy.
Option 5: LLM-assisted categorization Run terms through Claude/GPT-4 with a fixed category list, spot-check output, import. Fast and cheap at current term counts (3171 terms ≈ negligible cost). Not reproducible without saving output. Good fallback if structured sources have insufficient coverage.
Option 6: Hybrid — WordNet Domains as baseline, LLM gap-fill Use Option 2 for automated coverage, LLM for terms with no domain tag, manual spot-check pass. Combines automation with control. Likely the most practical approach.
Option 7: Manual curation Flat file mapping synset IDs to your own category slugs. Full control, matches UI exactly. Too expensive at scale — only viable for small curated additions on top of an automated baseline.
Current recommendation: research Kelly Project first. If coverage is insufficient, go with Option 6.
implementation roadmap
- Finalize data model
- Write and run migrations
- Fill the database (expand import pipeline)
- Decide SUBTLEX → cefr_level mapping strategy
- Generate decks
- Finalize game selection flow
- Define Zod schemas in packages/shared
- Implement API