lila/documentation/ai-context/02-data-model.md
2026-05-16 01:59:43 +02:00

221 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 02 — Data Model
> **Purpose:** Database schema reference for LLMs working on features that query or modify data. Concatenate with 00-project-overview.md and 99-current-task.md.
> **Last updated:** 2026-05-15
> **Depends on:** 00-project-overview.md
---
## Core Tables
### `terms` — Language-neutral concepts
| Column | Type | Constraints | Notes |
| ------------ | --------- | -------------------------------------------- | ------------------------------------------------------ |
| `id` | uuid | PK | |
| `pos` | varchar | CHECK: `noun`, `verb`, `adjective`, `adverb` | Part of speech |
| `source` | varchar | | Pipeline that created this term (e.g. `kaikki`, `omw`) |
| `source_id` | varchar | UNIQUE(`source`, `source_id`) | Idempotency key for imports |
| `synset_id` | varchar | nullable | WordNet synset ID. Nullable for non-WordNet terms. |
| `created_at` | timestamp | default now() | |
**Rule:** One row per concept. The word "cat" (animal) and "cat" (nautical) are separate rows because they have different `source_id` values.
---
### `translations` — Per-language words
| Column | Type | Constraints | Notes |
| --------------- | ---------- | ----------------------------------- | ---------------------------------------- |
| `id` | uuid | PK | |
| `term_id` | uuid | FK → terms.id | |
| `language_code` | varchar(2) | CHECK: `en`, `it`, `de`, `es`, `fr` | |
| `text` | varchar | | The actual word |
| `cefr_level` | varchar(2) | nullable, CHECK: `A1``C2` | Difficulty of THIS word in THIS language |
| `created_at` | timestamp | default now() | |
**Unique constraint:** (`term_id`, `language_code`, `text`) — allows synonyms (e.g. "dog" and "hound" for same term), prevents exact duplicates.
**Key design:** `cefr_level` is on `translations`, not `terms`. "House" in English is A1; "domicile" is also English but B2 — same concept, different words, different difficulty.
---
### `term_glosses` — Definitions per language
| Column | Type | Constraints | Notes |
| --------------- | ---------- | ----------------------------------- | ---------------------- |
| `id` | uuid | PK | |
| `term_id` | uuid | FK → terms.id | |
| `language_code` | varchar(2) | CHECK: `en`, `it`, `de`, `es`, `fr` | |
| `text` | text | | Definition/explanation |
| `created_at` | timestamp | default now() | |
**Unique constraint:** (`term_id`, `language_code`) — one gloss per term per language. Prevents left joins from multiplying question rows.
**Note:** Italian gloss coverage is sparse (~2% of terms have Italian glosses). UI falls back to English gloss when no gloss exists for the user's language.
---
### `decks` — Curated wordlists
| Column | Type | Constraints | Notes |
| --------------------- | ------------ | ------------------------------------------------- | ------------------------------------------------------- |
| `id` | uuid | PK | |
| `name` | varchar | | e.g. `en-core-1000` |
| `source_language` | varchar(2) | CHECK | Language the wordlist was built from |
| `validated_languages` | varchar(2)[] | CHECK: source_language NOT IN validated_languages | Languages with complete translations for all deck terms |
| `description` | text | nullable | |
| `created_at` | timestamp | default now() | |
**Design:** One deck per frequency tier per source language. POS, difficulty, and category are query filters, not separate decks. Decks must not overlap — each term appears in exactly one tier.
**Source:** SUBTLEX frequency lists (per-language editions, same methodology).
---
### `deck_terms` — Junction table
| Column | Type | Constraints | Notes |
| ------------ | --------- | ------------- | ----- |
| `deck_id` | uuid | FK → decks.id | |
| `term_id` | uuid | FK → terms.id | |
| `created_at` | timestamp | default now() | |
**PK:** (`deck_id`, `term_id`)
---
## Auth Tables (managed by Better Auth)
Better Auth creates and owns these tables. Do not modify directly.
### `user`
| Column | Type | Notes |
| ---------------- | --------- | -------------------- |
| `id` | varchar | PK |
| `name` | varchar | Display name |
| `email` | varchar | |
| `email_verified` | boolean | |
| `image` | varchar | nullable, avatar URL |
| `created_at` | timestamp | |
| `updated_at` | timestamp | |
### `session`
| Column | Type | Notes |
| ------------ | --------- | ------------- |
| `id` | varchar | PK |
| `user_id` | varchar | FK → user.id |
| `token` | varchar | Session token |
| `expires_at` | timestamp | |
| `ip_address` | varchar | nullable |
| `user_agent` | text | nullable |
| `created_at` | timestamp | |
### `account` — Social provider links
| Column | Type | Notes |
| --------------- | --------- | -------------------- |
| `id` | varchar | PK |
| `user_id` | varchar | FK → user.id |
| `account_id` | varchar | Provider's user ID |
| `provider_id` | varchar | `google` or `github` |
| `access_token` | text | nullable |
| `refresh_token` | text | nullable |
| `id_token` | text | nullable |
| `expires_at` | timestamp | nullable |
**Note:** One user can have multiple accounts (Google + GitHub linked to same user).
### `verification`
Email verification tokens. Unused for social-only auth but managed by Better Auth.
---
## Lobby Tables (Multiplayer)
### `lobbies`
| Column | Type | Constraints | Notes |
| ------------- | --------- | ------------------------------------------- | -------------------------------------------- |
| `id` | uuid | PK | |
| `code` | varchar | UNIQUE | Human-readable room code (e.g. `WOLF-42`) |
| `host_id` | varchar | FK → user.id | |
| `status` | varchar | CHECK: `waiting`, `in_progress`, `finished` | |
| `max_players` | integer | default 4 | |
| `settings` | jsonb | nullable | Game mode, round count, timer duration, etc. |
| `created_at` | timestamp | default now() | |
| `updated_at` | timestamp | default now() | Used for stale recovery |
### `lobby_players`
| Column | Type | Constraints | Notes |
| -------------- | --------- | --------------- | ---------------------------- |
| `id` | uuid | PK | |
| `lobby_id` | uuid | FK → lobbies.id | |
| `user_id` | varchar | FK → user.id | |
| `display_name` | varchar | | Player's shown name in lobby |
| `is_host` | boolean | default false | |
| `joined_at` | timestamp | default now() | |
**Unique constraint:** (`lobby_id`, `user_id`) — one entry per player per lobby.
---
## Key Relationships
```
terms (1) ←──→ (N) translations
terms (1) ←──→ (N) term_glosses
terms (N) ←──→ (N) decks via deck_terms
user (1) ←──→ (N) sessions
user (1) ←──→ (N) accounts
user (1) ←──→ (N) lobbies (as host)
user (1) ←──→ (N) lobby_players
lobbies (1) ←──→ (N) lobby_players
```
---
## Query Patterns
### Get quiz terms (singleplayer)
```sql
SELECT t.id, t.pos, src.text AS source_text, tgt.text AS target_text, g.text AS gloss
FROM terms t
JOIN translations src ON src.term_id = t.id AND src.language_code = ?
JOIN translations tgt ON tgt.term_id = t.id AND tgt.language_code = ?
LEFT JOIN term_glosses g ON g.term_id = t.id AND g.language_code = ?
WHERE t.pos = ? AND tgt.cefr_level IN (?)
LIMIT ?
```
### Get distractors
```sql
SELECT text FROM translations
WHERE language_code = ? AND pos = ? AND cefr_level IN (?)
AND term_id != ? AND text != ?
ORDER BY RANDOM()
LIMIT 3
```
**Note:** This is the N+1 query mentioned in BACKLOG.md. Each question fetches 3 distractors separately. Batching is planned.
---
## Deferred Schema Extensions (Not Yet Implemented)
These tables are planned but do not exist yet. All are additive — they reference existing `terms` rows via FK.
| Table | Purpose | Trigger |
| --------------------- | ----------------------------------------------- | ----------------------- |
| `noun_forms` | Gender, singular, plural, articles per language | Grammar quiz mode |
| `verb_forms` | Conjugation tables per language | Grammar quiz mode |
| `term_pronunciations` | IPA + audio URLs per language | Pronunciation quiz mode |
| `user_decks` | Which decks a user studies | User customization |
| `user_term_progress` | Spaced repetition state per user/term/language | SRS review queue |
| `quiz_answers` | Answer history for stats/analytics | User stats dashboard |