From c24967dc747b8054bc7f43074b4cf5b37d952f10 Mon Sep 17 00:00:00 2001 From: lila Date: Sun, 5 Apr 2026 00:33:05 +0200 Subject: [PATCH] updating docs --- documentation/decisions.md | 127 +++++++++++++++++- documentation/roadmap.md | 25 ++++ documentation/schema_discussion.md | 203 +++++++++++++++++++++++++++++ 3 files changed, 349 insertions(+), 6 deletions(-) create mode 100644 documentation/schema_discussion.md diff --git a/documentation/decisions.md b/documentation/decisions.md index 057a5a5..8b6d3a0 100644 --- a/documentation/decisions.md +++ b/documentation/decisions.md @@ -196,6 +196,118 @@ Allows multiple synonyms per language per term (e.g. "dog", "hound" for same syn - `validated_languages` stays accurate as translation data grows - DB enforces via CHECK constraint that `source_language` is never included in `validated_languages` + +### Decks: wordlist tiers as scope (not POS-split decks) + +**Rejected approach:** one deck per POS (e.g. `en-nouns`, `en-verbs`). + +**Problem:** POS is already a filterable column on `terms`, so a POS-scoped deck duplicates logic the query already handles for free. A word like "run" (noun and verb, different synsets) would also appear in two decks, requiring deduplication in the generation script. + +**Decision:** one deck per frequency tier per source language (e.g. `en-core-1000`, `en-core-2000`). POS, difficulty, and category are query filters applied inside that boundary at query time. The user never sees or picks a deck — they pick a direction, POS, and difficulty, and the app resolves those to the right deck + filters. + +Progression works by expanding the deck set as the user advances: + +```sql +WHERE dt.deck_id IN ('en-core-1000', 'en-core-2000') + AND t.pos = 'noun' + AND t.cefr_level = 'B1' +``` + +Decks must not overlap — each term appears in exactly one tier. The generation script already deduplicates, so this is enforced at import time. + +### Decks: SUBTLEX as wordlist source (not manual curation) + +**Problem:** the most common 1000 nouns in English are not the same 1000 nouns that are most common in Italian — not just in translation, but conceptually. Building decks from English frequency data alone gives Italian learners a distorted picture of what is actually common in Italian. + +**Decision:** use SUBTLEX, which exists in per-language editions (SUBTLEX-EN, SUBTLEX-IT, etc.) derived from subtitle corpora using the same methodology, making them comparable across languages. + +This is why `decks.source_language` is not just a technical detail — it is the reason the data model is correct: + +- `en-core-1000` built from SUBTLEX-EN → used when source language is English (en→it) +- `it-core-1000` built from SUBTLEX-IT → used when source language is Italian (it→en) + +Same translation data underneath, correctly frequency-grounded per direction. Two wordlist files, two generation script runs. + +### Terms: `synset_id` nullable (not NOT NULL) + +**Problem:** non-WordNet terms (custom words, Wiktionary-sourced entries added later) won't have a synset ID. `NOT NULL` is too strict. + +**Decision:** make `synset_id` nullable. `synset_id` remains the WordNet idempotency key — it prevents duplicate imports on re-runs and allows cross-referencing back to WordNet. It is not removed. + +Postgres `UNIQUE` on a nullable column allows multiple `NULL` values (nulls are not considered equal), so no additional constraint logic is needed beyond dropping `notNull()`. For extra defensiveness a partial unique index can be added later: + +```sql +CREATE UNIQUE INDEX idx_terms_synset_id ON terms (synset_id) WHERE synset_id IS NOT NULL; +``` + +### Terms: `cefr_level` column (deferred population) + +Added as a nullable `varchar(2)` with a CHECK constraint (`A1`–`C2`). Belongs on `terms`, not on `decks` — difficulty is a property of the term, not the curated list. Left null for MVP; populated later via SUBTLEX or an external CEFR wordlist. Added now while the table is small to avoid a costly backfill migration later. + +### Schema: `categories` + `term_categories` (empty for MVP) + +Added to schema now, left empty for MVP. Grammar and Media work without them — Grammar maps to POS (already on `terms`), Media maps to deck membership. Thematic categories (animals, kitchen, etc.) require a metadata source that is still under research. + +```sql +categories: id, slug, label, created_at +term_categories: term_id → terms.id, category_id → categories.id, PK(term_id, category_id) +``` + +See open research section below for source options. + +### Future extensions: morphology and pronunciation (deferred, additive) + +The following features are explicitly deferred post-MVP. All are purely additive — new tables referencing existing `terms` rows via FK. No existing schema changes required when implemented: + +- `noun_forms` — gender, singular, plural, articles per language (source: Wiktionary) +- `verb_forms` — conjugation tables per language (source: Wiktionary) +- `term_pronunciations` — IPA and audio URLs per language (source: Wiktionary / Forvo) + +Exercise types split naturally into Type A (translation, current model) and Type B (morphology, future). The data layer is independent — the same `terms` anchor both. + +--- + +## Open Research + +### Semantic category metadata source + +Categories (`animals`, `kitchen`, etc.) are in the schema but empty for MVP. +Grammar and Media work without them (Grammar = POS filter, Media = deck membership). +Needs research before populating `term_categories`. Options: + +**Option 1: WordNet domain labels** +Already in OMW, extractable in the existing pipeline. Free, no extra dependency. +Problem: coarse and patchy — many terms untagged, vocabulary is academic ("fauna" not "animals"). + +**Option 2: Princeton WordNet Domains** +Separate project built on WordNet. ~200 hierarchical domains mapped to synsets. More structured +and consistent than basic WordNet labels. Freely available. Meaningfully better than Option 1. + +**Option 3: Kelly Project** +Frequency lists with CEFR levels AND semantic field tags, explicitly designed for language learning, +multiple languages. Could solve frequency tiers (`cefr_level`) and semantic categories in one shot. +Investigate coverage for your languages and POS range first. + +**Option 4: BabelNet / WikiData** +Rich, multilingual, community-maintained. Maps WordNet synsets to Wikipedia categories. +Problem: complex integration, BabelNet has commercial licensing restrictions, WikiData category +trees are deep and noisy. + +**Option 5: LLM-assisted categorization** +Run terms through Claude/GPT-4 with a fixed category list, spot-check output, import. +Fast and cheap at current term counts (3171 terms ≈ negligible cost). Not reproducible +without saving output. Good fallback if structured sources have insufficient coverage. + +**Option 6: Hybrid — WordNet Domains as baseline, LLM gap-fill** +Use Option 2 for automated coverage, LLM for terms with no domain tag, manual spot-check pass. +Combines automation with control. Likely the most practical approach. + +**Option 7: Manual curation** +Flat file mapping synset IDs to your own category slugs. Full control, matches UI exactly. +Too expensive at scale — only viable for small curated additions on top of an automated baseline. + +**Current recommendation:** research Kelly Project first. If coverage is insufficient, go with Option 6. + --- ## Current State @@ -223,10 +335,13 @@ Phase 0 complete. Phase 1 data pipeline complete. - Words not found in DB: 34 - Italian (`it`) coverage: 3171 / 3171 — full coverage, included in `validated_languages` -### Next (Phase 1 — API layer) +### Roadmap to API implementation -- [ ] Define Zod response schemas in `packages/shared` -- [ ] Implement `DeckRepository.getTerms(deckId, limit, offset)` -- [ ] Implement `QuizService.attachDistractors(terms)` -- [ ] Implement `GET /language-pairs`, `GET /decks`, `GET /decks/:id/terms` endpoints -- [ ] Unit tests for `QuizService` +1. **Finalize data model** — apply decisions above: `synset_id` nullable, add `cefr_level` to `terms`, add `categories` + `term_categories` tables +2. **Write and run migrations** — schema changes before any data expansion +3. **Expand data pipeline** — import all OMW languages and POS, not just English nouns with Italian translations +4. **Decide SUBTLEX → `cefr_level` mapping strategy** — raw frequency ranks need a mapping to A1–C2 bands before tiered decks are meaningful +5. **Generate decks** — run generation script with SUBTLEX-grounded wordlists per source language +6. **Finalize game selection flow** — direction → category → POS → difficulty → round count +7. **Define Zod schemas in `packages/shared`** — based on finalized game flow and API shape +8. **Implement API** diff --git a/documentation/roadmap.md b/documentation/roadmap.md index db130f7..871cab2 100644 --- a/documentation/roadmap.md +++ b/documentation/roadmap.md @@ -147,3 +147,28 @@ Phase 0 (Foundation) └── Phase 4 (Room Lobby) └── Phase 5 (Multiplayer Game) └── Phase 6 (Deployment) + + +--- + +## ui sketch + +i was sketching the ui of the menu and came up with some questions.  + +this would be the flow to start a single player game: +main menu => singleplayer, multiplayer, settings +singleplayer => language selection +"i speak english" => "i want to learn italian" (both languages are dropdowns to select the fitting language) +language selection => category selection => pure grammar, media (as disussed, practicing on song lyrics or breaking bad subtitles) +pure grammar => pos selection => nouns or verbs (in mvp) +nouns has 3 subcategories => singular (1-on-1 translation dog => cane), plural (plural practices cane => cani for example), gender/articles (il cane or la cane for example) +verbs has 2 subcategories => infinitv (1-on-1 translation to talk => parlare) or conjugations (user gets shown the infinitiv and a table with all personal pronouns and has to fill in the gaps with the according conjugations) +pos selection => difficulty selection (from a1 to c2) +afterwards start game button +--- +this begs the questions: +- how to store the plural, articles of nouns in database +- how to store the conjugations of verbs +- what about ipa? +- links to audiofiles to listen how a word is pronounced? +- one table for italian_verbs, french_nouns, german_adjectives? diff --git a/documentation/schema_discussion.md b/documentation/schema_discussion.md new file mode 100644 index 0000000..0b473f0 --- /dev/null +++ b/documentation/schema_discussion.md @@ -0,0 +1,203 @@ +# Glossa — Schema & Architecture Discussion Summary + +## Project Overview + +A vocabulary trainer in the style of Duolingo (see a word, pick from 4 translations). Built as a monorepo with a Drizzle/Postgres data layer. Phase 1 (data pipeline) is complete; the API layer is next. + +--- + +## Game Flow (MVP) + +Singleplayer: choose direction (en→it or it→en) → top-level category → part of speech → difficulty (A1–C2) → round count (3 or 10) → game starts. + +**Top-level categories (MVP):** +- **Grammar** — practice nouns, verb conjugations, etc. +- **Media** — practice vocabulary from specific books, films, songs, etc. + +**Post-MVP categories (not in scope yet):** +- Animals, kitchen, and other thematic word groups + +--- + +## Schema Decisions Made + +### Deck model: `source_language` + `validated_languages` (not `pair_id`) + +A deck is a curated pool of terms sourced from a specific language (e.g. an English frequency list). The language pair used for a quiz is chosen at session start, not at deck creation. + +- `decks.source_language` — the language the wordlist was curated from +- `decks.validated_languages` — array of target language codes for which full translation coverage exists across all terms; recalculated on every generation script run +- Enforced via CHECK: `source_language` is never in `validated_languages` +- One deck serves en→it and en→fr without duplication + +### Architecture: deck as curated pool (Option 2) + +Three options were considered: + +| Option | Description | Problem | +|--------|-------------|---------| +| 1. Pure filter | No decks, query the whole terms table | No curatorial control; import junk ends up in the game | +| 2. Deck as pool ✅ | Decks define scope, term metadata drives filtering | Clean separation of concerns | +| 3. Deck as preset | Deck encodes filter config (category + POS + difficulty) | Combinatorial explosion; can't reuse terms across decks | + +**Decision: Option 2.** Decks solve the curation problem (which terms are game-ready). Term metadata solves the filtering problem (which subset to show today). These are separate concerns and should stay separate. + +The quiz query joins `deck_terms` for scope, then filters by `pos`, `cefr_level`, and later `category` — all independently. + +### Missing from schema: `cefr_level` and categories + +The game flow requires filtering by difficulty and category, but neither is in the schema yet. + +**Difficulty (`cefr_level`):** +- Belongs on `terms`, not on `decks` +- Add as a nullable `varchar(2)` with a CHECK constraint (`A1`–`C2`) +- Add now (nullable), populate later — backfilling a full terms table post-MVP is costly + +**Categories:** +- Separate `categories` table + `term_categories` join table +- Do not use an enum or array on `terms` — a term can belong to multiple categories, and new categories should not require migrations + +```sql +categories: id, slug, label, created_at +term_categories: term_id → terms.id, category_id → categories.id, PK(term_id, category_id) +``` + +### Deck scope: wordlists, not POS splits + +**Rejected approach:** one deck per POS (e.g. `en-nouns`, `en-verbs`). Problem: POS is already a filterable column on `terms`, so a POS-scoped deck duplicates logic the query already handles for free. A word like "run" (noun and verb, different synsets) would also appear in two decks, requiring deduplication logic in the generation script. + +**Decision:** one deck per frequency tier per source language (e.g. `en-core-1000`, `en-core-2000`). POS, difficulty, and category are query filters applied inside that boundary. The user never sees or picks a deck — they pick "Nouns, B1" and the app resolves that to the right deck + filters. + +### Deck progression: tiered frequency lists + +When a user exhausts a deck, the app expands scope by adding the next tier: + +```sql +WHERE dt.deck_id IN ('en-core-1000', 'en-core-2000') + AND t.pos = 'noun' + AND t.cefr_level = 'B1' +``` + +Requirements for this to work cleanly: +- Decks must not overlap — each word appears in exactly one tier +- The generation script already deduplicates, so this is enforced at import time +- Unlocking logic (when to add the next deck) lives in user learning state, not in the deck structure — for MVP, query all tiers at once or hardcode active decks + +### Wordlist source: SUBTLEX (not manual curation) + +**Problem:** the most common 1000 nouns in English are not the same 1000 nouns that are most common in Italian — not just in translation, but conceptually. Building decks from English frequency data alone gives Italian learners a distorted picture of what's actually common in Italian. + +**Decision:** use SUBTLEX, which exists in per-language editions (SUBTLEX-EN, SUBTLEX-IT, etc.) derived from subtitle corpora using the same methodology — making them comparable across languages. + +This maps directly onto `decks.source_language`: +- `en-core-1000` — built from SUBTLEX-EN, used when the user's source language is English +- `it-core-1000` — built from SUBTLEX-IT, used when the source language is Italian + +When the user picks en→it, the app queries `en-core-1000`. When they pick it→en, it queries `it-core-1000`. Same translation data, correctly frequency-grounded per direction. Two wordlist files, two generation script runs — the schema already supports this. + +### Missing from schema: user learning state + +The current schema has no concept of a user's progress. Not blocking for the API layer right now, but will be needed before the game loop is functional: + +- `user_decks` — which decks a user is studying +- `user_term_progress` — per `(user_id, term_id, language_pair)`: `next_review_at`, `interval_days`, correct/attempt counts for spaced repetition +- `quiz_answers` — optional history log for stats and debugging + +### `synset_id`: make nullable, don't remove + +`synset_id` is the WordNet idempotency key — it prevents duplicate imports on re-runs and allows cross-referencing back to WordNet. It should stay. + +**Problem:** non-WordNet terms (custom words added later) won't have a synset ID, so `NOT NULL` is too strict. + +**Decision:** make `synset_id` nullable. Postgres `UNIQUE` on a nullable column allows multiple `NULL` values (nulls are not considered equal), so no constraint changes are needed beyond dropping `notNull()`. + +For extra defensiveness, a partial unique index can be added later: + +```sql +CREATE UNIQUE INDEX idx_terms_synset_id ON terms (synset_id) WHERE synset_id IS NOT NULL; +``` + +--- + +## Open Questions / Deferred + +- **User learning state** — not needed for the API layer but must be designed before the game loop ships +- **Distractors** — generated at query time (random same-POS terms from the same deck); no schema needed +- **`cefr_level` data source** — WordNet frequency data was already found to be unreliable; external CEFR lists (Oxford 3000, SUBTLEX) will be needed to populate this field + +--- + +### Open: semantic category metadata source + +Categories (`animals`, `kitchen`, etc.) are in the schema but empty for MVP. +Grammar and Media work without them (Grammar = POS filter, Media = deck membership). +Needs research before populating `term_categories`. Options: + +**Option 1: WordNet domain labels** +Already in OMW, extractable in the existing pipeline. Free, no extra dependency. +Problem: coarse and patchy — many terms untagged, vocabulary is academic ("fauna" not "animals"). + +**Option 2: Princeton WordNet Domains** +Separate project built on WordNet. ~200 hierarchical domains mapped to synsets. More structured +and consistent than basic WordNet labels. Freely available. Meaningfully better than Option 1. + +**Option 3: Kelly Project** +Frequency lists with CEFR levels AND semantic field tags, explicitly designed for language learning, +multiple languages. Could solve frequency tiers (cefr_level) and semantic categories in one shot. +Investigate coverage for your languages and POS range first. + +**Option 4: BabelNet / WikiData** +Rich, multilingual, community-maintained. Maps WordNet synsets to Wikipedia categories. +Problem: complex integration, BabelNet has commercial licensing restrictions, WikiData category +trees are deep and noisy. + +**Option 5: LLM-assisted categorization** +Run terms through Claude/GPT-4 with a fixed category list, spot-check output, import. +Fast and cheap at current term counts (3171 terms ≈ negligible cost). Not reproducible +without saving output. Good fallback if structured sources have insufficient coverage. + +**Option 6: Hybrid — WordNet Domains as baseline, LLM gap-fill** +Use Option 2 for automated coverage, LLM for terms with no domain tag, manual spot-check pass. +Combines automation with control. Likely the most practical approach. + +**Option 7: Manual curation** +Flat file mapping synset IDs to your own category slugs. Full control, matches UI exactly. +Too expensive at scale — only viable for small curated additions on top of an automated baseline. + +**Current recommendation:** research Kelly Project first. If coverage is insufficient, go with Option 6. + +--- + +### implementation roadmap + +#### Finalize data model + +text + +#### Write and run migrations + +text + +#### Fill the database (expand import pipeline) + +text + +#### Decide SUBTLEX → cefr_level mapping strategy + +text + +#### Generate decks + +text + +#### Finalize game selection flow + +text + +#### Define Zod schemas in packages/shared + +text + +#### Implement API + +text