updating docs

2026-04-05 19:28:53 +02:00 · 2026-04-05 19:28:53 +02:00 · c49c2fe2c3
commit c49c2fe2c3
parent e80f291c41
1 changed files with 14 additions and 38 deletions
--- a/documentation/decisions.md
+++ b/documentation/decisions.md
@ -228,37 +228,6 @@ This is why `decks.source_language` is not just a technical detail — it is the

 Same translation data underneath, correctly frequency-grounded per direction. Two wordlist files, two generation script runs.

-### Decks: media metadata structure (post-MVP, options documented)
-
-When the Media hierarchy is implemented, each media type (TV show, movie, book, song)
-has different attributes. Three options considered:
-
-**Option A: One table with nullable columns**
-All media types in one table, type-specific columns nullable. Simple but becomes a sparse
-matrix as media types grow.
-
-**Option B: Separate table per media type**
-```ts
-tv_metadata:    deck_id, title, season, episode
-movie_metadata: deck_id, title, year
-book_metadata:  deck_id, title, author, year
-song_metadata:  deck_id, title, artist, album, year
-```
-Each table has exactly the right columns. Clean and queryable, more tables to maintain.
-
-**Option C: JSONB for flexible attributes**
-```ts
-media_metadata: deck_id, media_type, title, attributes jsonb
-```
-Type-specific fields in a JSON blob. No migration needed for new media types but
-attributes are not schema-validated and harder to query.
-
-**Current recommendation:** Option A to start (few media types initially, sparse
-columns manageable), migrate to Option B if the number of media types grows.
-Option C only if media types become numerous and unpredictable.
-
-Decision deferred until Media is actually built.
-
 ### Terms: `synset_id` nullable (not NOT NULL)

 **Problem:** non-WordNet terms (custom words, Wiktionary-sourced entries added later) won't have a synset ID. `NOT NULL` is too strict.
@ -401,7 +370,7 @@ Too expensive at scale — only viable for small curated additions on top of an

 ## Current State

-Phase 0 complete. Phase 1 data pipeline complete.
+Phase 0 complete. Phase 1 data pipeline complete. Phase 2 data model finalized and migrated.

 ### Completed (Phase 1 — data pipeline)

@ -417,19 +386,26 @@ Phase 0 complete. Phase 1 data pipeline complete.
  - creates deck if it doesn't exist, adds only missing terms on subsequent runs
  - recalculates and persists `validated_languages` on every run

-### Known data facts
+### Completed (Phase 2 — data model)
+
+- [x] `synset_id` removed, replaced by `source` + `source_id` on `terms`
+- [x] `cefr_level` added to `translations` (not `terms` — difficulty is language-relative)
+- [x] `language_code` CHECK constraint added to `translations` and `term_glosses`
+- [x] `language_pairs` table dropped — pairs derived from decks at query time
+- [x] `is_public` and `added_at` dropped from `decks` and `deck_terms`
+- [x] `type` added to `decks` with CHECK against `SUPPORTED_DECK_TYPES`
+- [x] `topics` and `term_topics` tables added (empty for MVP)
+- [x] Migration generated and run against fresh database
+
+### Known data facts (pre-wipe, for reference)

 - Wordlist: 999 unique words after deduplication (1000 lines, 1 duplicate)
 - Term IDs resolved: 3171 (higher than word count due to homonyms)
 - Words not found in DB: 34
 - Italian (`it`) coverage: 3171 / 3171 — full coverage, included in `validated_languages`

-### Next (Phase 2 — data model + pipeline)
+### Next (Phase 3 — data pipeline + API)

-Roadmap to API implementation:
-
-1. **Finalize data model** — apply decisions above: `synset_id` nullable, add `source` + `source_id` to `terms`, add `cefr_level` to `translations`, add `categories` + `term_categories` tables, add `language_code` CHECK to `translations` and `term_glosses`, drop `language_pairs`
-2. **Write and run migrations** — schema changes before any data expansion
 3. **Expand data pipeline** — import all OMW languages and POS, not just English nouns with Italian translations
 4. **Decide SUBTLEX → `cefr_level` mapping strategy** — raw frequency ranks need a mapping to A1–C2 bands before tiered decks are meaningful
 5. **Generate decks** — run generation script with SUBTLEX-grounded wordlists per source language