180 lines
9.1 KiB
Markdown
180 lines
9.1 KiB
Markdown
# Glossa — Schema & Architecture Discussion Summary
|
||
|
||
## Project Overview
|
||
|
||
A vocabulary trainer in the style of Duolingo (see a word, pick from 4 translations). Built as a monorepo with a Drizzle/Postgres data layer. Phase 1 (data pipeline) is complete; the API layer is next.
|
||
|
||
---
|
||
|
||
## Game Flow (MVP)
|
||
|
||
Singleplayer: choose direction (en→it or it→en) → top-level category → part of speech → difficulty (A1–C2) → round count (3 or 10) → game starts.
|
||
|
||
**Top-level categories (MVP):**
|
||
- **Grammar** — practice nouns, verb conjugations, etc.
|
||
- **Media** — practice vocabulary from specific books, films, songs, etc.
|
||
|
||
**Post-MVP categories (not in scope yet):**
|
||
- Animals, kitchen, and other thematic word groups
|
||
|
||
---
|
||
|
||
## Schema Decisions Made
|
||
|
||
### Deck model: `source_language` + `validated_languages` (not `pair_id`)
|
||
|
||
A deck is a curated pool of terms sourced from a specific language (e.g. an English frequency list). The language pair used for a quiz is chosen at session start, not at deck creation.
|
||
|
||
- `decks.source_language` — the language the wordlist was curated from
|
||
- `decks.validated_languages` — array of target language codes for which full translation coverage exists across all terms; recalculated on every generation script run
|
||
- Enforced via CHECK: `source_language` is never in `validated_languages`
|
||
- One deck serves en→it and en→fr without duplication
|
||
|
||
### Architecture: deck as curated pool (Option 2)
|
||
|
||
Three options were considered:
|
||
|
||
| Option | Description | Problem |
|
||
|--------|-------------|---------|
|
||
| 1. Pure filter | No decks, query the whole terms table | No curatorial control; import junk ends up in the game |
|
||
| 2. Deck as pool ✅ | Decks define scope, term metadata drives filtering | Clean separation of concerns |
|
||
| 3. Deck as preset | Deck encodes filter config (category + POS + difficulty) | Combinatorial explosion; can't reuse terms across decks |
|
||
|
||
**Decision: Option 2.** Decks solve the curation problem (which terms are game-ready). Term metadata solves the filtering problem (which subset to show today). These are separate concerns and should stay separate.
|
||
|
||
The quiz query joins `deck_terms` for scope, then filters by `pos`, `cefr_level`, and later `category` — all independently.
|
||
|
||
### Missing from schema: `cefr_level` and categories
|
||
|
||
The game flow requires filtering by difficulty and category, but neither is in the schema yet.
|
||
|
||
**Difficulty (`cefr_level`):**
|
||
- Belongs on `terms`, not on `decks`
|
||
- Add as a nullable `varchar(2)` with a CHECK constraint (`A1`–`C2`)
|
||
- Add now (nullable), populate later — backfilling a full terms table post-MVP is costly
|
||
|
||
**Categories:**
|
||
- Separate `categories` table + `term_categories` join table
|
||
- Do not use an enum or array on `terms` — a term can belong to multiple categories, and new categories should not require migrations
|
||
|
||
```sql
|
||
categories: id, slug, label, created_at
|
||
term_categories: term_id → terms.id, category_id → categories.id, PK(term_id, category_id)
|
||
```
|
||
|
||
### Deck scope: wordlists, not POS splits
|
||
|
||
**Rejected approach:** one deck per POS (e.g. `en-nouns`, `en-verbs`). Problem: POS is already a filterable column on `terms`, so a POS-scoped deck duplicates logic the query already handles for free. A word like "run" (noun and verb, different synsets) would also appear in two decks, requiring deduplication logic in the generation script.
|
||
|
||
**Decision:** one deck per frequency tier per source language (e.g. `en-core-1000`, `en-core-2000`). POS, difficulty, and category are query filters applied inside that boundary. The user never sees or picks a deck — they pick "Nouns, B1" and the app resolves that to the right deck + filters.
|
||
|
||
### Deck progression: tiered frequency lists
|
||
|
||
When a user exhausts a deck, the app expands scope by adding the next tier:
|
||
|
||
```sql
|
||
WHERE dt.deck_id IN ('en-core-1000', 'en-core-2000')
|
||
AND t.pos = 'noun'
|
||
AND t.cefr_level = 'B1'
|
||
```
|
||
|
||
Requirements for this to work cleanly:
|
||
- Decks must not overlap — each word appears in exactly one tier
|
||
- The generation script already deduplicates, so this is enforced at import time
|
||
- Unlocking logic (when to add the next deck) lives in user learning state, not in the deck structure — for MVP, query all tiers at once or hardcode active decks
|
||
|
||
### Wordlist source: SUBTLEX (not manual curation)
|
||
|
||
**Problem:** the most common 1000 nouns in English are not the same 1000 nouns that are most common in Italian — not just in translation, but conceptually. Building decks from English frequency data alone gives Italian learners a distorted picture of what's actually common in Italian.
|
||
|
||
**Decision:** use SUBTLEX, which exists in per-language editions (SUBTLEX-EN, SUBTLEX-IT, etc.) derived from subtitle corpora using the same methodology — making them comparable across languages.
|
||
|
||
This maps directly onto `decks.source_language`:
|
||
- `en-core-1000` — built from SUBTLEX-EN, used when the user's source language is English
|
||
- `it-core-1000` — built from SUBTLEX-IT, used when the source language is Italian
|
||
|
||
When the user picks en→it, the app queries `en-core-1000`. When they pick it→en, it queries `it-core-1000`. Same translation data, correctly frequency-grounded per direction. Two wordlist files, two generation script runs — the schema already supports this.
|
||
|
||
### Missing from schema: user learning state
|
||
|
||
The current schema has no concept of a user's progress. Not blocking for the API layer right now, but will be needed before the game loop is functional:
|
||
|
||
- `user_decks` — which decks a user is studying
|
||
- `user_term_progress` — per `(user_id, term_id, language_pair)`: `next_review_at`, `interval_days`, correct/attempt counts for spaced repetition
|
||
- `quiz_answers` — optional history log for stats and debugging
|
||
|
||
### `synset_id`: make nullable, don't remove
|
||
|
||
`synset_id` is the WordNet idempotency key — it prevents duplicate imports on re-runs and allows cross-referencing back to WordNet. It should stay.
|
||
|
||
**Problem:** non-WordNet terms (custom words added later) won't have a synset ID, so `NOT NULL` is too strict.
|
||
|
||
**Decision:** make `synset_id` nullable. Postgres `UNIQUE` on a nullable column allows multiple `NULL` values (nulls are not considered equal), so no constraint changes are needed beyond dropping `notNull()`.
|
||
|
||
For extra defensiveness, a partial unique index can be added later:
|
||
|
||
```sql
|
||
CREATE UNIQUE INDEX idx_terms_synset_id ON terms (synset_id) WHERE synset_id IS NOT NULL;
|
||
```
|
||
|
||
---
|
||
|
||
## Open Questions / Deferred
|
||
|
||
- **User learning state** — not needed for the API layer but must be designed before the game loop ships
|
||
- **Distractors** — generated at query time (random same-POS terms from the same deck); no schema needed
|
||
- **`cefr_level` data source** — WordNet frequency data was already found to be unreliable; external CEFR lists (Oxford 3000, SUBTLEX) will be needed to populate this field
|
||
|
||
---
|
||
|
||
### Open: semantic category metadata source
|
||
|
||
Categories (`animals`, `kitchen`, etc.) are in the schema but empty for MVP.
|
||
Grammar and Media work without them (Grammar = POS filter, Media = deck membership).
|
||
Needs research before populating `term_categories`. Options:
|
||
|
||
**Option 1: WordNet domain labels**
|
||
Already in OMW, extractable in the existing pipeline. Free, no extra dependency.
|
||
Problem: coarse and patchy — many terms untagged, vocabulary is academic ("fauna" not "animals").
|
||
|
||
**Option 2: Princeton WordNet Domains**
|
||
Separate project built on WordNet. ~200 hierarchical domains mapped to synsets. More structured
|
||
and consistent than basic WordNet labels. Freely available. Meaningfully better than Option 1.
|
||
|
||
**Option 3: Kelly Project**
|
||
Frequency lists with CEFR levels AND semantic field tags, explicitly designed for language learning,
|
||
multiple languages. Could solve frequency tiers (cefr_level) and semantic categories in one shot.
|
||
Investigate coverage for your languages and POS range first.
|
||
|
||
**Option 4: BabelNet / WikiData**
|
||
Rich, multilingual, community-maintained. Maps WordNet synsets to Wikipedia categories.
|
||
Problem: complex integration, BabelNet has commercial licensing restrictions, WikiData category
|
||
trees are deep and noisy.
|
||
|
||
**Option 5: LLM-assisted categorization**
|
||
Run terms through Claude/GPT-4 with a fixed category list, spot-check output, import.
|
||
Fast and cheap at current term counts (3171 terms ≈ negligible cost). Not reproducible
|
||
without saving output. Good fallback if structured sources have insufficient coverage.
|
||
|
||
**Option 6: Hybrid — WordNet Domains as baseline, LLM gap-fill**
|
||
Use Option 2 for automated coverage, LLM for terms with no domain tag, manual spot-check pass.
|
||
Combines automation with control. Likely the most practical approach.
|
||
|
||
**Option 7: Manual curation**
|
||
Flat file mapping synset IDs to your own category slugs. Full control, matches UI exactly.
|
||
Too expensive at scale — only viable for small curated additions on top of an automated baseline.
|
||
|
||
**Current recommendation:** research Kelly Project first. If coverage is insufficient, go with Option 6.
|
||
|
||
---
|
||
|
||
### implementation roadmap
|
||
|
||
- [x] Finalize data model
|
||
- [x] Write and run migrations
|
||
- [x] Fill the database (expand import pipeline)
|
||
- [ ] Decide SUBTLEX → cefr_level mapping strategy
|
||
- [ ] Generate decks
|
||
- [ ] Finalize game selection flow
|
||
- [ ] Define Zod schemas in packages/shared
|
||
- [ ] Implement API
|