9.2 KiB
Glossa — Schema & Architecture Discussion Summary
Project Overview
A vocabulary trainer in the style of Duolingo (see a word, pick from 4 translations). Built as a monorepo with a Drizzle/Postgres data layer. Phase 1 (data pipeline) is complete; the API layer is next.
Game Flow (MVP)
Singleplayer: choose direction (en→it or it→en) → top-level category → part of speech → difficulty (A1–C2) → round count (3 or 10) → game starts.
Top-level categories (MVP):
- Grammar — practice nouns, verb conjugations, etc.
- Media — practice vocabulary from specific books, films, songs, etc.
Post-MVP categories (not in scope yet):
- Animals, kitchen, and other thematic word groups
Schema Decisions Made
Deck model: source_language + validated_languages (not pair_id)
A deck is a curated pool of terms sourced from a specific language (e.g. an English frequency list). The language pair used for a quiz is chosen at session start, not at deck creation.
decks.source_language— the language the wordlist was curated fromdecks.validated_languages— array of target language codes for which full translation coverage exists across all terms; recalculated on every generation script run- Enforced via CHECK:
source_languageis never invalidated_languages - One deck serves en→it and en→fr without duplication
Architecture: deck as curated pool (Option 2)
Three options were considered:
| Option | Description | Problem |
|---|---|---|
| 1. Pure filter | No decks, query the whole terms table | No curatorial control; import junk ends up in the game |
| 2. Deck as pool ✅ | Decks define scope, term metadata drives filtering | Clean separation of concerns |
| 3. Deck as preset | Deck encodes filter config (category + POS + difficulty) | Combinatorial explosion; can't reuse terms across decks |
Decision: Option 2. Decks solve the curation problem (which terms are game-ready). Term metadata solves the filtering problem (which subset to show today). These are separate concerns and should stay separate.
The quiz query joins deck_terms for scope, then filters by pos, cefr_level, and later category — all independently.
Missing from schema: cefr_level and categories
The game flow requires filtering by difficulty and category, but neither is in the schema yet.
Difficulty (cefr_level):
- Belongs on
terms, not ondecks - Add as a nullable
varchar(2)with a CHECK constraint (A1–C2) - Add now (nullable), populate later — backfilling a full terms table post-MVP is costly
Categories:
- Separate
categoriestable +term_categoriesjoin table - Do not use an enum or array on
terms— a term can belong to multiple categories, and new categories should not require migrations
categories: id, slug, label, created_at
term_categories: term_id → terms.id, category_id → categories.id, PK(term_id, category_id)
Deck scope: wordlists, not POS splits
Rejected approach: one deck per POS (e.g. en-nouns, en-verbs). Problem: POS is already a filterable column on terms, so a POS-scoped deck duplicates logic the query already handles for free. A word like "run" (noun and verb, different synsets) would also appear in two decks, requiring deduplication logic in the generation script.
Decision: one deck per frequency tier per source language (e.g. en-core-1000, en-core-2000). POS, difficulty, and category are query filters applied inside that boundary. The user never sees or picks a deck — they pick "Nouns, B1" and the app resolves that to the right deck + filters.
Deck progression: tiered frequency lists
When a user exhausts a deck, the app expands scope by adding the next tier:
WHERE dt.deck_id IN ('en-core-1000', 'en-core-2000')
AND t.pos = 'noun'
AND t.cefr_level = 'B1'
Requirements for this to work cleanly:
- Decks must not overlap — each word appears in exactly one tier
- The generation script already deduplicates, so this is enforced at import time
- Unlocking logic (when to add the next deck) lives in user learning state, not in the deck structure — for MVP, query all tiers at once or hardcode active decks
Wordlist source: SUBTLEX (not manual curation)
Problem: the most common 1000 nouns in English are not the same 1000 nouns that are most common in Italian — not just in translation, but conceptually. Building decks from English frequency data alone gives Italian learners a distorted picture of what's actually common in Italian.
Decision: use SUBTLEX, which exists in per-language editions (SUBTLEX-EN, SUBTLEX-IT, etc.) derived from subtitle corpora using the same methodology — making them comparable across languages.
This maps directly onto decks.source_language:
en-core-1000— built from SUBTLEX-EN, used when the user's source language is Englishit-core-1000— built from SUBTLEX-IT, used when the source language is Italian
When the user picks en→it, the app queries en-core-1000. When they pick it→en, it queries it-core-1000. Same translation data, correctly frequency-grounded per direction. Two wordlist files, two generation script runs — the schema already supports this.
Missing from schema: user learning state
The current schema has no concept of a user's progress. Not blocking for the API layer right now, but will be needed before the game loop is functional:
user_decks— which decks a user is studyinguser_term_progress— per(user_id, term_id, language_pair):next_review_at,interval_days, correct/attempt counts for spaced repetitionquiz_answers— optional history log for stats and debugging
synset_id: make nullable, don't remove
synset_id is the WordNet idempotency key — it prevents duplicate imports on re-runs and allows cross-referencing back to WordNet. It should stay.
Problem: non-WordNet terms (custom words added later) won't have a synset ID, so NOT NULL is too strict.
Decision: make synset_id nullable. Postgres UNIQUE on a nullable column allows multiple NULL values (nulls are not considered equal), so no constraint changes are needed beyond dropping notNull().
For extra defensiveness, a partial unique index can be added later:
CREATE UNIQUE INDEX idx_terms_synset_id ON terms (synset_id) WHERE synset_id IS NOT NULL;
Open Questions / Deferred
- User learning state — not needed for the API layer but must be designed before the game loop ships
- Distractors — generated at query time (random same-POS terms from the same deck); no schema needed
cefr_leveldata source — WordNet frequency data was already found to be unreliable; external CEFR lists (Oxford 3000, SUBTLEX) will be needed to populate this field
Open: semantic category metadata source
Categories (animals, kitchen, etc.) are in the schema but empty for MVP.
Grammar and Media work without them (Grammar = POS filter, Media = deck membership).
Needs research before populating term_categories. Options:
Option 1: WordNet domain labels Already in OMW, extractable in the existing pipeline. Free, no extra dependency. Problem: coarse and patchy — many terms untagged, vocabulary is academic ("fauna" not "animals").
Option 2: Princeton WordNet Domains Separate project built on WordNet. ~200 hierarchical domains mapped to synsets. More structured and consistent than basic WordNet labels. Freely available. Meaningfully better than Option 1.
Option 3: Kelly Project Frequency lists with CEFR levels AND semantic field tags, explicitly designed for language learning, multiple languages. Could solve frequency tiers (cefr_level) and semantic categories in one shot. Investigate coverage for your languages and POS range first.
Option 4: BabelNet / WikiData Rich, multilingual, community-maintained. Maps WordNet synsets to Wikipedia categories. Problem: complex integration, BabelNet has commercial licensing restrictions, WikiData category trees are deep and noisy.
Option 5: LLM-assisted categorization Run terms through Claude/GPT-4 with a fixed category list, spot-check output, import. Fast and cheap at current term counts (3171 terms ≈ negligible cost). Not reproducible without saving output. Good fallback if structured sources have insufficient coverage.
Option 6: Hybrid — WordNet Domains as baseline, LLM gap-fill Use Option 2 for automated coverage, LLM for terms with no domain tag, manual spot-check pass. Combines automation with control. Likely the most practical approach.
Option 7: Manual curation Flat file mapping synset IDs to your own category slugs. Full control, matches UI exactly. Too expensive at scale — only viable for small curated additions on top of an automated baseline.
Current recommendation: research Kelly Project first. If coverage is insufficient, go with Option 6.
implementation roadmap
Finalize data model
text
Write and run migrations
text
Fill the database (expand import pipeline)
text
Decide SUBTLEX → cefr_level mapping strategy
text
Generate decks
text
Finalize game selection flow
text
Define Zod schemas in packages/shared
text
Implement API
text