lila/documentation/ai-context/02-data-model.md
2026-05-16 01:59:43 +02:00

11 KiB
Raw Permalink Blame History

02 — Data Model

Purpose: Database schema reference for LLMs working on features that query or modify data. Concatenate with 00-project-overview.md and 99-current-task.md. Last updated: 2026-05-15 Depends on: 00-project-overview.md


Core Tables

terms — Language-neutral concepts

Column Type Constraints Notes
id uuid PK
pos varchar CHECK: noun, verb, adjective, adverb Part of speech
source varchar Pipeline that created this term (e.g. kaikki, omw)
source_id varchar UNIQUE(source, source_id) Idempotency key for imports
synset_id varchar nullable WordNet synset ID. Nullable for non-WordNet terms.
created_at timestamp default now()

Rule: One row per concept. The word "cat" (animal) and "cat" (nautical) are separate rows because they have different source_id values.


translations — Per-language words

Column Type Constraints Notes
id uuid PK
term_id uuid FK → terms.id
language_code varchar(2) CHECK: en, it, de, es, fr
text varchar The actual word
cefr_level varchar(2) nullable, CHECK: A1C2 Difficulty of THIS word in THIS language
created_at timestamp default now()

Unique constraint: (term_id, language_code, text) — allows synonyms (e.g. "dog" and "hound" for same term), prevents exact duplicates.

Key design: cefr_level is on translations, not terms. "House" in English is A1; "domicile" is also English but B2 — same concept, different words, different difficulty.


term_glosses — Definitions per language

Column Type Constraints Notes
id uuid PK
term_id uuid FK → terms.id
language_code varchar(2) CHECK: en, it, de, es, fr
text text Definition/explanation
created_at timestamp default now()

Unique constraint: (term_id, language_code) — one gloss per term per language. Prevents left joins from multiplying question rows.

Note: Italian gloss coverage is sparse (~2% of terms have Italian glosses). UI falls back to English gloss when no gloss exists for the user's language.


decks — Curated wordlists

Column Type Constraints Notes
id uuid PK
name varchar e.g. en-core-1000
source_language varchar(2) CHECK Language the wordlist was built from
validated_languages varchar(2)[] CHECK: source_language NOT IN validated_languages Languages with complete translations for all deck terms
description text nullable
created_at timestamp default now()

Design: One deck per frequency tier per source language. POS, difficulty, and category are query filters, not separate decks. Decks must not overlap — each term appears in exactly one tier.

Source: SUBTLEX frequency lists (per-language editions, same methodology).


deck_terms — Junction table

Column Type Constraints Notes
deck_id uuid FK → decks.id
term_id uuid FK → terms.id
created_at timestamp default now()

PK: (deck_id, term_id)


Auth Tables (managed by Better Auth)

Better Auth creates and owns these tables. Do not modify directly.

user

Column Type Notes
id varchar PK
name varchar Display name
email varchar
email_verified boolean
image varchar nullable, avatar URL
created_at timestamp
updated_at timestamp

session

Column Type Notes
id varchar PK
user_id varchar FK → user.id
token varchar Session token
expires_at timestamp
ip_address varchar nullable
user_agent text nullable
created_at timestamp
Column Type Notes
id varchar PK
user_id varchar FK → user.id
account_id varchar Provider's user ID
provider_id varchar google or github
access_token text nullable
refresh_token text nullable
id_token text nullable
expires_at timestamp nullable

Note: One user can have multiple accounts (Google + GitHub linked to same user).

verification

Email verification tokens. Unused for social-only auth but managed by Better Auth.


Lobby Tables (Multiplayer)

lobbies

Column Type Constraints Notes
id uuid PK
code varchar UNIQUE Human-readable room code (e.g. WOLF-42)
host_id varchar FK → user.id
status varchar CHECK: waiting, in_progress, finished
max_players integer default 4
settings jsonb nullable Game mode, round count, timer duration, etc.
created_at timestamp default now()
updated_at timestamp default now() Used for stale recovery

lobby_players

Column Type Constraints Notes
id uuid PK
lobby_id uuid FK → lobbies.id
user_id varchar FK → user.id
display_name varchar Player's shown name in lobby
is_host boolean default false
joined_at timestamp default now()

Unique constraint: (lobby_id, user_id) — one entry per player per lobby.


Key Relationships

terms (1) ←──→ (N) translations
terms (1) ←──→ (N) term_glosses
terms (N) ←──→ (N) decks via deck_terms
user (1) ←──→ (N) sessions
user (1) ←──→ (N) accounts
user (1) ←──→ (N) lobbies (as host)
user (1) ←──→ (N) lobby_players
lobbies (1) ←──→ (N) lobby_players

Query Patterns

Get quiz terms (singleplayer)

SELECT t.id, t.pos, src.text AS source_text, tgt.text AS target_text, g.text AS gloss
FROM terms t
JOIN translations src ON src.term_id = t.id AND src.language_code = ?
JOIN translations tgt ON tgt.term_id = t.id AND tgt.language_code = ?
LEFT JOIN term_glosses g ON g.term_id = t.id AND g.language_code = ?
WHERE t.pos = ? AND tgt.cefr_level IN (?)
LIMIT ?

Get distractors

SELECT text FROM translations
WHERE language_code = ? AND pos = ? AND cefr_level IN (?)
AND term_id != ? AND text != ?
ORDER BY RANDOM()
LIMIT 3

Note: This is the N+1 query mentioned in BACKLOG.md. Each question fetches 3 distractors separately. Batching is planned.


Deferred Schema Extensions (Not Yet Implemented)

These tables are planned but do not exist yet. All are additive — they reference existing terms rows via FK.

Table Purpose Trigger
noun_forms Gender, singular, plural, articles per language Grammar quiz mode
verb_forms Conjugation tables per language Grammar quiz mode
term_pronunciations IPA + audio URLs per language Pronunciation quiz mode
user_decks Which decks a user studies User customization
user_term_progress Spaced repetition state per user/term/language SRS review queue
quiz_answers Answer history for stats/analytics User stats dashboard