From c49c2fe2c30c6bb64b5b64a8e8b417539a5a6611 Mon Sep 17 00:00:00 2001 From: lila Date: Sun, 5 Apr 2026 19:28:53 +0200 Subject: [PATCH] updating docs --- documentation/decisions.md | 52 ++++++++++---------------------------- 1 file changed, 14 insertions(+), 38 deletions(-) diff --git a/documentation/decisions.md b/documentation/decisions.md index 75e2d52..0fc1244 100644 --- a/documentation/decisions.md +++ b/documentation/decisions.md @@ -228,37 +228,6 @@ This is why `decks.source_language` is not just a technical detail — it is the Same translation data underneath, correctly frequency-grounded per direction. Two wordlist files, two generation script runs. -### Decks: media metadata structure (post-MVP, options documented) - -When the Media hierarchy is implemented, each media type (TV show, movie, book, song) -has different attributes. Three options considered: - -**Option A: One table with nullable columns** -All media types in one table, type-specific columns nullable. Simple but becomes a sparse -matrix as media types grow. - -**Option B: Separate table per media type** -```ts -tv_metadata: deck_id, title, season, episode -movie_metadata: deck_id, title, year -book_metadata: deck_id, title, author, year -song_metadata: deck_id, title, artist, album, year -``` -Each table has exactly the right columns. Clean and queryable, more tables to maintain. - -**Option C: JSONB for flexible attributes** -```ts -media_metadata: deck_id, media_type, title, attributes jsonb -``` -Type-specific fields in a JSON blob. No migration needed for new media types but -attributes are not schema-validated and harder to query. - -**Current recommendation:** Option A to start (few media types initially, sparse -columns manageable), migrate to Option B if the number of media types grows. -Option C only if media types become numerous and unpredictable. - -Decision deferred until Media is actually built. - ### Terms: `synset_id` nullable (not NOT NULL) **Problem:** non-WordNet terms (custom words, Wiktionary-sourced entries added later) won't have a synset ID. `NOT NULL` is too strict. @@ -401,7 +370,7 @@ Too expensive at scale — only viable for small curated additions on top of an ## Current State -Phase 0 complete. Phase 1 data pipeline complete. +Phase 0 complete. Phase 1 data pipeline complete. Phase 2 data model finalized and migrated. ### Completed (Phase 1 — data pipeline) @@ -417,19 +386,26 @@ Phase 0 complete. Phase 1 data pipeline complete. - creates deck if it doesn't exist, adds only missing terms on subsequent runs - recalculates and persists `validated_languages` on every run -### Known data facts +### Completed (Phase 2 — data model) + +- [x] `synset_id` removed, replaced by `source` + `source_id` on `terms` +- [x] `cefr_level` added to `translations` (not `terms` — difficulty is language-relative) +- [x] `language_code` CHECK constraint added to `translations` and `term_glosses` +- [x] `language_pairs` table dropped — pairs derived from decks at query time +- [x] `is_public` and `added_at` dropped from `decks` and `deck_terms` +- [x] `type` added to `decks` with CHECK against `SUPPORTED_DECK_TYPES` +- [x] `topics` and `term_topics` tables added (empty for MVP) +- [x] Migration generated and run against fresh database + +### Known data facts (pre-wipe, for reference) - Wordlist: 999 unique words after deduplication (1000 lines, 1 duplicate) - Term IDs resolved: 3171 (higher than word count due to homonyms) - Words not found in DB: 34 - Italian (`it`) coverage: 3171 / 3171 — full coverage, included in `validated_languages` -### Next (Phase 2 — data model + pipeline) +### Next (Phase 3 — data pipeline + API) -Roadmap to API implementation: - -1. **Finalize data model** — apply decisions above: `synset_id` nullable, add `source` + `source_id` to `terms`, add `cefr_level` to `translations`, add `categories` + `term_categories` tables, add `language_code` CHECK to `translations` and `term_glosses`, drop `language_pairs` -2. **Write and run migrations** — schema changes before any data expansion 3. **Expand data pipeline** — import all OMW languages and POS, not just English nouns with Italian translations 4. **Decide SUBTLEX → `cefr_level` mapping strategy** — raw frequency ranks need a mapping to A1–C2 bands before tiered decks are meaningful 5. **Generate decks** — run generation script with SUBTLEX-grounded wordlists per source language