updating documentation

This commit is contained in:
lila 2026-04-01 18:02:12 +02:00
parent 3bb8bfdb39
commit b0c0baf9ab

View file

@ -178,60 +178,55 @@ Most tables omit `updated_at` (unnecessary for MVP). `rooms.updated_at` is kept
Allows multiple synonyms per language per term (e.g. "dog", "hound" for same synset). Prevents exact duplicate rows. Homonyms (e.g. "Lead" metal vs. "Lead" guide) are handled by different `term_id` values (different synsets), so no constraint conflict. Allows multiple synonyms per language per term (e.g. "dog", "hound" for same synset). Prevents exact duplicate rows. Homonyms (e.g. "Lead" metal vs. "Lead" guide) are handled by different `term_id` values (different synsets), so no constraint conflict.
### Decks: `pair_id` is nullable ### Decks: `source_language` + `validated_languages` (not `pair_id`)
`decks.pair_id` references `language_pairs` but is nullable. Reasons: **Original approach:** `decks.pair_id` references `language_pairs`, tying each deck to a single language pair.
- Single-language decks (e.g. "English Grammar") **Problem:** One deck can serve multiple target languages as long as translations exist for all its terms. A `pair_id` FK would require duplicating the deck for each target language.
- Multi-pair decks (e.g. "Cognates" spanning EN-IT and EN-FR)
- System decks (created by app, not tied to specific user)
### Decks separate from terms (not frequency_rank filtering)
**Original approach:** Store `frequency_rank` on `terms` table and filter by rank range for difficulty.
**Problem discovered:** WordNet/OMW frequency data is unreliable for language learning. Extraction produced results like:
- Rank 1: "In" → "indio" (chemical symbol: Indium)
- Rank 2: "Be" → "berillio" (chemical symbol: Beryllium)
- Rank 7: "He" → "elio" (chemical symbol: Helium)
These are technically "common" in WordNet (every element is a noun) but useless for vocabulary learning.
**Decision:** **Decision:**
- `terms` table stores ALL available OMW synsets (raw data, no frequency filtering) - `decks.source_language` — the language the wordlist was curated from (e.g. `"en"`). A deck sourced from an English frequency list is fundamentally different from one sourced from an Italian list.
- `decks` table stores curated learning lists (A1, A2, B1, "Most Common 1000", etc.) - `decks.validated_languages` — array of language codes (excluding `source_language`) for which full translation coverage exists across all terms in the deck. Recalculated and updated on every run of the generation script.
- `deck_terms` junction table links terms to decks with position ordering - The language pair used for a quiz session is determined at session start, not at deck creation time.
- `rooms.deck_id` specifies which vocabulary deck a game uses
**Benefits:** **Benefits:**
- Curricula can come from external sources (CEFR lists, Oxford 3000, SUBTLEX) - One deck serves multiple target languages (e.g. en→it and en→fr) without duplication
- Bad data (chemical symbols, obscure words) excluded at deck level, not schema level - `validated_languages` stays accurate as translation data grows
- Users can create custom decks later - DB enforces via CHECK constraint that `source_language` is never included in `validated_languages`
- Multiple difficulty levels without schema changes
--- ---
## Current State ## Current State
### Completed checkboxes (Phase 0) Phase 0 complete. Phase 1 data pipeline complete.
- [x] Initialise pnpm workspace monorepo: `apps/web`, `apps/api`, `packages/shared`, `packages/db` ### Completed (Phase 1 — data pipeline)
- [x] Configure TypeScript project references across packages
- [x] Set up ESLint + Prettier with shared configs in root
- [x] Set up Vitest in `api` and `web` and both packages
- [x] Scaffold Express app with `GET /api/health`
- [x] Scaffold Vite + React app with TanStack Router (single root route)
- [x] Configure Drizzle ORM + connection to local PostgreSQL
- [x] Write first migration (empty — just validates the pipeline works)
- [x] `docker-compose.yml` for local dev: `api`, `web`, `postgres`, `valkey`
- [x] `.env.example` files for `apps/api` and `apps/web`
- [x] update decisions.md
Phase 0 is finished with this. - [x] Run `extract-en-it-nouns.py` locally → generates `datafiles/en-it-nouns.json`
- [x] Write Drizzle schema: `terms`, `translations`, `language_pairs`, `term_glosses`, `decks`, `deck_terms`
- [x] Write and run migration (includes CHECK constraints for `pos`, `gloss_type`)
- [x] Write `packages/db/src/seed.ts` (imports ALL terms + translations, NO decks)
- [x] Write `packages/db/src/generating-decks.ts` — idempotent deck generation script
- reads and deduplicates source wordlist
- matches words to DB terms (homonyms included)
- writes unmatched words to `-missing` file
- determines `validated_languages` by checking full translation coverage per language
- creates deck if it doesn't exist, adds only missing terms on subsequent runs
- recalculates and persists `validated_languages` on every run
### Current checkpoint ### Known data facts
- [ ] Run `scripts/extract_omw.py` locally → generates `packages/db/src/seed.json` - Wordlist: 999 unique words after deduplication (1000 lines, 1 duplicate)
- Term IDs resolved: 3171 (higher than word count due to homonyms)
- Words not found in DB: 34
- Italian (`it`) coverage: 3171 / 3171 — full coverage, included in `validated_languages`
### Next (Phase 1 — API layer)
- [ ] Define Zod response schemas in `packages/shared`
- [ ] Implement `DeckRepository.getTerms(deckId, limit, offset)`
- [ ] Implement `QuizService.attachDistractors(terms)`
- [ ] Implement `GET /language-pairs`, `GET /decks`, `GET /decks/:id/terms` endpoints
- [ ] Unit tests for `QuizService`