updating documentation
This commit is contained in:
parent
3bb8bfdb39
commit
b0c0baf9ab
1 changed files with 35 additions and 40 deletions
|
|
@ -178,60 +178,55 @@ Most tables omit `updated_at` (unnecessary for MVP). `rooms.updated_at` is kept
|
|||
|
||||
Allows multiple synonyms per language per term (e.g. "dog", "hound" for same synset). Prevents exact duplicate rows. Homonyms (e.g. "Lead" metal vs. "Lead" guide) are handled by different `term_id` values (different synsets), so no constraint conflict.
|
||||
|
||||
### Decks: `pair_id` is nullable
|
||||
### Decks: `source_language` + `validated_languages` (not `pair_id`)
|
||||
|
||||
`decks.pair_id` references `language_pairs` but is nullable. Reasons:
|
||||
**Original approach:** `decks.pair_id` references `language_pairs`, tying each deck to a single language pair.
|
||||
|
||||
- Single-language decks (e.g. "English Grammar")
|
||||
- Multi-pair decks (e.g. "Cognates" spanning EN-IT and EN-FR)
|
||||
- System decks (created by app, not tied to specific user)
|
||||
|
||||
### Decks separate from terms (not frequency_rank filtering)
|
||||
|
||||
**Original approach:** Store `frequency_rank` on `terms` table and filter by rank range for difficulty.
|
||||
|
||||
**Problem discovered:** WordNet/OMW frequency data is unreliable for language learning. Extraction produced results like:
|
||||
|
||||
- Rank 1: "In" → "indio" (chemical symbol: Indium)
|
||||
- Rank 2: "Be" → "berillio" (chemical symbol: Beryllium)
|
||||
- Rank 7: "He" → "elio" (chemical symbol: Helium)
|
||||
|
||||
These are technically "common" in WordNet (every element is a noun) but useless for vocabulary learning.
|
||||
**Problem:** One deck can serve multiple target languages as long as translations exist for all its terms. A `pair_id` FK would require duplicating the deck for each target language.
|
||||
|
||||
**Decision:**
|
||||
|
||||
- `terms` table stores ALL available OMW synsets (raw data, no frequency filtering)
|
||||
- `decks` table stores curated learning lists (A1, A2, B1, "Most Common 1000", etc.)
|
||||
- `deck_terms` junction table links terms to decks with position ordering
|
||||
- `rooms.deck_id` specifies which vocabulary deck a game uses
|
||||
- `decks.source_language` — the language the wordlist was curated from (e.g. `"en"`). A deck sourced from an English frequency list is fundamentally different from one sourced from an Italian list.
|
||||
- `decks.validated_languages` — array of language codes (excluding `source_language`) for which full translation coverage exists across all terms in the deck. Recalculated and updated on every run of the generation script.
|
||||
- The language pair used for a quiz session is determined at session start, not at deck creation time.
|
||||
|
||||
**Benefits:**
|
||||
|
||||
- Curricula can come from external sources (CEFR lists, Oxford 3000, SUBTLEX)
|
||||
- Bad data (chemical symbols, obscure words) excluded at deck level, not schema level
|
||||
- Users can create custom decks later
|
||||
- Multiple difficulty levels without schema changes
|
||||
- One deck serves multiple target languages (e.g. en→it and en→fr) without duplication
|
||||
- `validated_languages` stays accurate as translation data grows
|
||||
- DB enforces via CHECK constraint that `source_language` is never included in `validated_languages`
|
||||
|
||||
---
|
||||
|
||||
## Current State
|
||||
|
||||
### Completed checkboxes (Phase 0)
|
||||
Phase 0 complete. Phase 1 data pipeline complete.
|
||||
|
||||
- [x] Initialise pnpm workspace monorepo: `apps/web`, `apps/api`, `packages/shared`, `packages/db`
|
||||
- [x] Configure TypeScript project references across packages
|
||||
- [x] Set up ESLint + Prettier with shared configs in root
|
||||
- [x] Set up Vitest in `api` and `web` and both packages
|
||||
- [x] Scaffold Express app with `GET /api/health`
|
||||
- [x] Scaffold Vite + React app with TanStack Router (single root route)
|
||||
- [x] Configure Drizzle ORM + connection to local PostgreSQL
|
||||
- [x] Write first migration (empty — just validates the pipeline works)
|
||||
- [x] `docker-compose.yml` for local dev: `api`, `web`, `postgres`, `valkey`
|
||||
- [x] `.env.example` files for `apps/api` and `apps/web`
|
||||
- [x] update decisions.md
|
||||
### Completed (Phase 1 — data pipeline)
|
||||
|
||||
Phase 0 is finished with this.
|
||||
- [x] Run `extract-en-it-nouns.py` locally → generates `datafiles/en-it-nouns.json`
|
||||
- [x] Write Drizzle schema: `terms`, `translations`, `language_pairs`, `term_glosses`, `decks`, `deck_terms`
|
||||
- [x] Write and run migration (includes CHECK constraints for `pos`, `gloss_type`)
|
||||
- [x] Write `packages/db/src/seed.ts` (imports ALL terms + translations, NO decks)
|
||||
- [x] Write `packages/db/src/generating-decks.ts` — idempotent deck generation script
|
||||
- reads and deduplicates source wordlist
|
||||
- matches words to DB terms (homonyms included)
|
||||
- writes unmatched words to `-missing` file
|
||||
- determines `validated_languages` by checking full translation coverage per language
|
||||
- creates deck if it doesn't exist, adds only missing terms on subsequent runs
|
||||
- recalculates and persists `validated_languages` on every run
|
||||
|
||||
### Current checkpoint
|
||||
### Known data facts
|
||||
|
||||
- [ ] Run `scripts/extract_omw.py` locally → generates `packages/db/src/seed.json`
|
||||
- Wordlist: 999 unique words after deduplication (1000 lines, 1 duplicate)
|
||||
- Term IDs resolved: 3171 (higher than word count due to homonyms)
|
||||
- Words not found in DB: 34
|
||||
- Italian (`it`) coverage: 3171 / 3171 — full coverage, included in `validated_languages`
|
||||
|
||||
### Next (Phase 1 — API layer)
|
||||
|
||||
- [ ] Define Zod response schemas in `packages/shared`
|
||||
- [ ] Implement `DeckRepository.getTerms(deckId, limit, offset)`
|
||||
- [ ] Implement `QuizService.attachDistractors(terms)`
|
||||
- [ ] Implement `GET /language-pairs`, `GET /decks`, `GET /decks/:id/terms` endpoints
|
||||
- [ ] Unit tests for `QuizService`
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue