lila/documentation/schema_discussion.md
2026-04-05 00:33:05 +02:00

203 lines
9.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Glossa — Schema & Architecture Discussion Summary
## Project Overview
A vocabulary trainer in the style of Duolingo (see a word, pick from 4 translations). Built as a monorepo with a Drizzle/Postgres data layer. Phase 1 (data pipeline) is complete; the API layer is next.
---
## Game Flow (MVP)
Singleplayer: choose direction (en→it or it→en) → top-level category → part of speech → difficulty (A1C2) → round count (3 or 10) → game starts.
**Top-level categories (MVP):**
- **Grammar** — practice nouns, verb conjugations, etc.
- **Media** — practice vocabulary from specific books, films, songs, etc.
**Post-MVP categories (not in scope yet):**
- Animals, kitchen, and other thematic word groups
---
## Schema Decisions Made
### Deck model: `source_language` + `validated_languages` (not `pair_id`)
A deck is a curated pool of terms sourced from a specific language (e.g. an English frequency list). The language pair used for a quiz is chosen at session start, not at deck creation.
- `decks.source_language` — the language the wordlist was curated from
- `decks.validated_languages` — array of target language codes for which full translation coverage exists across all terms; recalculated on every generation script run
- Enforced via CHECK: `source_language` is never in `validated_languages`
- One deck serves en→it and en→fr without duplication
### Architecture: deck as curated pool (Option 2)
Three options were considered:
| Option | Description | Problem |
|--------|-------------|---------|
| 1. Pure filter | No decks, query the whole terms table | No curatorial control; import junk ends up in the game |
| 2. Deck as pool ✅ | Decks define scope, term metadata drives filtering | Clean separation of concerns |
| 3. Deck as preset | Deck encodes filter config (category + POS + difficulty) | Combinatorial explosion; can't reuse terms across decks |
**Decision: Option 2.** Decks solve the curation problem (which terms are game-ready). Term metadata solves the filtering problem (which subset to show today). These are separate concerns and should stay separate.
The quiz query joins `deck_terms` for scope, then filters by `pos`, `cefr_level`, and later `category` — all independently.
### Missing from schema: `cefr_level` and categories
The game flow requires filtering by difficulty and category, but neither is in the schema yet.
**Difficulty (`cefr_level`):**
- Belongs on `terms`, not on `decks`
- Add as a nullable `varchar(2)` with a CHECK constraint (`A1``C2`)
- Add now (nullable), populate later — backfilling a full terms table post-MVP is costly
**Categories:**
- Separate `categories` table + `term_categories` join table
- Do not use an enum or array on `terms` — a term can belong to multiple categories, and new categories should not require migrations
```sql
categories: id, slug, label, created_at
term_categories: term_id terms.id, category_id categories.id, PK(term_id, category_id)
```
### Deck scope: wordlists, not POS splits
**Rejected approach:** one deck per POS (e.g. `en-nouns`, `en-verbs`). Problem: POS is already a filterable column on `terms`, so a POS-scoped deck duplicates logic the query already handles for free. A word like "run" (noun and verb, different synsets) would also appear in two decks, requiring deduplication logic in the generation script.
**Decision:** one deck per frequency tier per source language (e.g. `en-core-1000`, `en-core-2000`). POS, difficulty, and category are query filters applied inside that boundary. The user never sees or picks a deck — they pick "Nouns, B1" and the app resolves that to the right deck + filters.
### Deck progression: tiered frequency lists
When a user exhausts a deck, the app expands scope by adding the next tier:
```sql
WHERE dt.deck_id IN ('en-core-1000', 'en-core-2000')
AND t.pos = 'noun'
AND t.cefr_level = 'B1'
```
Requirements for this to work cleanly:
- Decks must not overlap — each word appears in exactly one tier
- The generation script already deduplicates, so this is enforced at import time
- Unlocking logic (when to add the next deck) lives in user learning state, not in the deck structure — for MVP, query all tiers at once or hardcode active decks
### Wordlist source: SUBTLEX (not manual curation)
**Problem:** the most common 1000 nouns in English are not the same 1000 nouns that are most common in Italian — not just in translation, but conceptually. Building decks from English frequency data alone gives Italian learners a distorted picture of what's actually common in Italian.
**Decision:** use SUBTLEX, which exists in per-language editions (SUBTLEX-EN, SUBTLEX-IT, etc.) derived from subtitle corpora using the same methodology — making them comparable across languages.
This maps directly onto `decks.source_language`:
- `en-core-1000` — built from SUBTLEX-EN, used when the user's source language is English
- `it-core-1000` — built from SUBTLEX-IT, used when the source language is Italian
When the user picks en→it, the app queries `en-core-1000`. When they pick it→en, it queries `it-core-1000`. Same translation data, correctly frequency-grounded per direction. Two wordlist files, two generation script runs — the schema already supports this.
### Missing from schema: user learning state
The current schema has no concept of a user's progress. Not blocking for the API layer right now, but will be needed before the game loop is functional:
- `user_decks` — which decks a user is studying
- `user_term_progress` — per `(user_id, term_id, language_pair)`: `next_review_at`, `interval_days`, correct/attempt counts for spaced repetition
- `quiz_answers` — optional history log for stats and debugging
### `synset_id`: make nullable, don't remove
`synset_id` is the WordNet idempotency key — it prevents duplicate imports on re-runs and allows cross-referencing back to WordNet. It should stay.
**Problem:** non-WordNet terms (custom words added later) won't have a synset ID, so `NOT NULL` is too strict.
**Decision:** make `synset_id` nullable. Postgres `UNIQUE` on a nullable column allows multiple `NULL` values (nulls are not considered equal), so no constraint changes are needed beyond dropping `notNull()`.
For extra defensiveness, a partial unique index can be added later:
```sql
CREATE UNIQUE INDEX idx_terms_synset_id ON terms (synset_id) WHERE synset_id IS NOT NULL;
```
---
## Open Questions / Deferred
- **User learning state** — not needed for the API layer but must be designed before the game loop ships
- **Distractors** — generated at query time (random same-POS terms from the same deck); no schema needed
- **`cefr_level` data source** — WordNet frequency data was already found to be unreliable; external CEFR lists (Oxford 3000, SUBTLEX) will be needed to populate this field
---
### Open: semantic category metadata source
Categories (`animals`, `kitchen`, etc.) are in the schema but empty for MVP.
Grammar and Media work without them (Grammar = POS filter, Media = deck membership).
Needs research before populating `term_categories`. Options:
**Option 1: WordNet domain labels**
Already in OMW, extractable in the existing pipeline. Free, no extra dependency.
Problem: coarse and patchy — many terms untagged, vocabulary is academic ("fauna" not "animals").
**Option 2: Princeton WordNet Domains**
Separate project built on WordNet. ~200 hierarchical domains mapped to synsets. More structured
and consistent than basic WordNet labels. Freely available. Meaningfully better than Option 1.
**Option 3: Kelly Project**
Frequency lists with CEFR levels AND semantic field tags, explicitly designed for language learning,
multiple languages. Could solve frequency tiers (cefr_level) and semantic categories in one shot.
Investigate coverage for your languages and POS range first.
**Option 4: BabelNet / WikiData**
Rich, multilingual, community-maintained. Maps WordNet synsets to Wikipedia categories.
Problem: complex integration, BabelNet has commercial licensing restrictions, WikiData category
trees are deep and noisy.
**Option 5: LLM-assisted categorization**
Run terms through Claude/GPT-4 with a fixed category list, spot-check output, import.
Fast and cheap at current term counts (3171 terms ≈ negligible cost). Not reproducible
without saving output. Good fallback if structured sources have insufficient coverage.
**Option 6: Hybrid — WordNet Domains as baseline, LLM gap-fill**
Use Option 2 for automated coverage, LLM for terms with no domain tag, manual spot-check pass.
Combines automation with control. Likely the most practical approach.
**Option 7: Manual curation**
Flat file mapping synset IDs to your own category slugs. Full control, matches UI exactly.
Too expensive at scale — only viable for small curated additions on top of an automated baseline.
**Current recommendation:** research Kelly Project first. If coverage is insufficient, go with Option 6.
---
### implementation roadmap
#### Finalize data model
text
#### Write and run migrations
text
#### Fill the database (expand import pipeline)
text
#### Decide SUBTLEX → cefr_level mapping strategy
text
#### Generate decks
text
#### Finalize game selection flow
text
#### Define Zod schemas in packages/shared
text
#### Implement API
text