lila/documentation/schema_discussion.md
2026-04-06 17:01:34 +02:00

9.1 KiB
Raw Blame History

Glossa — Schema & Architecture Discussion Summary

Project Overview

A vocabulary trainer in the style of Duolingo (see a word, pick from 4 translations). Built as a monorepo with a Drizzle/Postgres data layer. Phase 1 (data pipeline) is complete; the API layer is next.


Game Flow (MVP)

Singleplayer: choose direction (en→it or it→en) → top-level category → part of speech → difficulty (A1C2) → round count (3 or 10) → game starts.

Top-level categories (MVP):

  • Grammar — practice nouns, verb conjugations, etc.
  • Media — practice vocabulary from specific books, films, songs, etc.

Post-MVP categories (not in scope yet):

  • Animals, kitchen, and other thematic word groups

Schema Decisions Made

Deck model: source_language + validated_languages (not pair_id)

A deck is a curated pool of terms sourced from a specific language (e.g. an English frequency list). The language pair used for a quiz is chosen at session start, not at deck creation.

  • decks.source_language — the language the wordlist was curated from
  • decks.validated_languages — array of target language codes for which full translation coverage exists across all terms; recalculated on every generation script run
  • Enforced via CHECK: source_language is never in validated_languages
  • One deck serves en→it and en→fr without duplication

Architecture: deck as curated pool (Option 2)

Three options were considered:

Option Description Problem
1. Pure filter No decks, query the whole terms table No curatorial control; import junk ends up in the game
2. Deck as pool Decks define scope, term metadata drives filtering Clean separation of concerns
3. Deck as preset Deck encodes filter config (category + POS + difficulty) Combinatorial explosion; can't reuse terms across decks

Decision: Option 2. Decks solve the curation problem (which terms are game-ready). Term metadata solves the filtering problem (which subset to show today). These are separate concerns and should stay separate.

The quiz query joins deck_terms for scope, then filters by pos, cefr_level, and later category — all independently.

Missing from schema: cefr_level and categories

The game flow requires filtering by difficulty and category, but neither is in the schema yet.

Difficulty (cefr_level):

  • Belongs on terms, not on decks
  • Add as a nullable varchar(2) with a CHECK constraint (A1C2)
  • Add now (nullable), populate later — backfilling a full terms table post-MVP is costly

Categories:

  • Separate categories table + term_categories join table
  • Do not use an enum or array on terms — a term can belong to multiple categories, and new categories should not require migrations
categories:      id, slug, label, created_at
term_categories: term_id  terms.id, category_id  categories.id, PK(term_id, category_id)

Deck scope: wordlists, not POS splits

Rejected approach: one deck per POS (e.g. en-nouns, en-verbs). Problem: POS is already a filterable column on terms, so a POS-scoped deck duplicates logic the query already handles for free. A word like "run" (noun and verb, different synsets) would also appear in two decks, requiring deduplication logic in the generation script.

Decision: one deck per frequency tier per source language (e.g. en-core-1000, en-core-2000). POS, difficulty, and category are query filters applied inside that boundary. The user never sees or picks a deck — they pick "Nouns, B1" and the app resolves that to the right deck + filters.

Deck progression: tiered frequency lists

When a user exhausts a deck, the app expands scope by adding the next tier:

WHERE dt.deck_id IN ('en-core-1000', 'en-core-2000')
  AND t.pos = 'noun'
  AND t.cefr_level = 'B1'

Requirements for this to work cleanly:

  • Decks must not overlap — each word appears in exactly one tier
  • The generation script already deduplicates, so this is enforced at import time
  • Unlocking logic (when to add the next deck) lives in user learning state, not in the deck structure — for MVP, query all tiers at once or hardcode active decks

Wordlist source: SUBTLEX (not manual curation)

Problem: the most common 1000 nouns in English are not the same 1000 nouns that are most common in Italian — not just in translation, but conceptually. Building decks from English frequency data alone gives Italian learners a distorted picture of what's actually common in Italian.

Decision: use SUBTLEX, which exists in per-language editions (SUBTLEX-EN, SUBTLEX-IT, etc.) derived from subtitle corpora using the same methodology — making them comparable across languages.

This maps directly onto decks.source_language:

  • en-core-1000 — built from SUBTLEX-EN, used when the user's source language is English
  • it-core-1000 — built from SUBTLEX-IT, used when the source language is Italian

When the user picks en→it, the app queries en-core-1000. When they pick it→en, it queries it-core-1000. Same translation data, correctly frequency-grounded per direction. Two wordlist files, two generation script runs — the schema already supports this.

Missing from schema: user learning state

The current schema has no concept of a user's progress. Not blocking for the API layer right now, but will be needed before the game loop is functional:

  • user_decks — which decks a user is studying
  • user_term_progress — per (user_id, term_id, language_pair): next_review_at, interval_days, correct/attempt counts for spaced repetition
  • quiz_answers — optional history log for stats and debugging

synset_id: make nullable, don't remove

synset_id is the WordNet idempotency key — it prevents duplicate imports on re-runs and allows cross-referencing back to WordNet. It should stay.

Problem: non-WordNet terms (custom words added later) won't have a synset ID, so NOT NULL is too strict.

Decision: make synset_id nullable. Postgres UNIQUE on a nullable column allows multiple NULL values (nulls are not considered equal), so no constraint changes are needed beyond dropping notNull().

For extra defensiveness, a partial unique index can be added later:

CREATE UNIQUE INDEX idx_terms_synset_id ON terms (synset_id) WHERE synset_id IS NOT NULL;

Open Questions / Deferred

  • User learning state — not needed for the API layer but must be designed before the game loop ships
  • Distractors — generated at query time (random same-POS terms from the same deck); no schema needed
  • cefr_level data source — WordNet frequency data was already found to be unreliable; external CEFR lists (Oxford 3000, SUBTLEX) will be needed to populate this field

Open: semantic category metadata source

Categories (animals, kitchen, etc.) are in the schema but empty for MVP. Grammar and Media work without them (Grammar = POS filter, Media = deck membership). Needs research before populating term_categories. Options:

Option 1: WordNet domain labels Already in OMW, extractable in the existing pipeline. Free, no extra dependency. Problem: coarse and patchy — many terms untagged, vocabulary is academic ("fauna" not "animals").

Option 2: Princeton WordNet Domains Separate project built on WordNet. ~200 hierarchical domains mapped to synsets. More structured and consistent than basic WordNet labels. Freely available. Meaningfully better than Option 1.

Option 3: Kelly Project Frequency lists with CEFR levels AND semantic field tags, explicitly designed for language learning, multiple languages. Could solve frequency tiers (cefr_level) and semantic categories in one shot. Investigate coverage for your languages and POS range first.

Option 4: BabelNet / WikiData Rich, multilingual, community-maintained. Maps WordNet synsets to Wikipedia categories. Problem: complex integration, BabelNet has commercial licensing restrictions, WikiData category trees are deep and noisy.

Option 5: LLM-assisted categorization Run terms through Claude/GPT-4 with a fixed category list, spot-check output, import. Fast and cheap at current term counts (3171 terms ≈ negligible cost). Not reproducible without saving output. Good fallback if structured sources have insufficient coverage.

Option 6: Hybrid — WordNet Domains as baseline, LLM gap-fill Use Option 2 for automated coverage, LLM for terms with no domain tag, manual spot-check pass. Combines automation with control. Likely the most practical approach.

Option 7: Manual curation Flat file mapping synset IDs to your own category slugs. Full control, matches UI exactly. Too expensive at scale — only viable for small curated additions on top of an automated baseline.

Current recommendation: research Kelly Project first. If coverage is insufficient, go with Option 6.


implementation roadmap

  • Finalize data model
  • Write and run migrations
  • Fill the database (expand import pipeline)
  • Decide SUBTLEX → cefr_level mapping strategy
  • Generate decks
  • Finalize game selection flow
  • Define Zod schemas in packages/shared
  • Implement API