updating docs

This commit is contained in:
lila 2026-04-05 19:28:53 +02:00
parent e80f291c41
commit c49c2fe2c3

View file

@ -228,37 +228,6 @@ This is why `decks.source_language` is not just a technical detail — it is the
Same translation data underneath, correctly frequency-grounded per direction. Two wordlist files, two generation script runs.
### Decks: media metadata structure (post-MVP, options documented)
When the Media hierarchy is implemented, each media type (TV show, movie, book, song)
has different attributes. Three options considered:
**Option A: One table with nullable columns**
All media types in one table, type-specific columns nullable. Simple but becomes a sparse
matrix as media types grow.
**Option B: Separate table per media type**
```ts
tv_metadata: deck_id, title, season, episode
movie_metadata: deck_id, title, year
book_metadata: deck_id, title, author, year
song_metadata: deck_id, title, artist, album, year
```
Each table has exactly the right columns. Clean and queryable, more tables to maintain.
**Option C: JSONB for flexible attributes**
```ts
media_metadata: deck_id, media_type, title, attributes jsonb
```
Type-specific fields in a JSON blob. No migration needed for new media types but
attributes are not schema-validated and harder to query.
**Current recommendation:** Option A to start (few media types initially, sparse
columns manageable), migrate to Option B if the number of media types grows.
Option C only if media types become numerous and unpredictable.
Decision deferred until Media is actually built.
### Terms: `synset_id` nullable (not NOT NULL)
**Problem:** non-WordNet terms (custom words, Wiktionary-sourced entries added later) won't have a synset ID. `NOT NULL` is too strict.
@ -401,7 +370,7 @@ Too expensive at scale — only viable for small curated additions on top of an
## Current State
Phase 0 complete. Phase 1 data pipeline complete.
Phase 0 complete. Phase 1 data pipeline complete. Phase 2 data model finalized and migrated.
### Completed (Phase 1 — data pipeline)
@ -417,19 +386,26 @@ Phase 0 complete. Phase 1 data pipeline complete.
- creates deck if it doesn't exist, adds only missing terms on subsequent runs
- recalculates and persists `validated_languages` on every run
### Known data facts
### Completed (Phase 2 — data model)
- [x] `synset_id` removed, replaced by `source` + `source_id` on `terms`
- [x] `cefr_level` added to `translations` (not `terms` — difficulty is language-relative)
- [x] `language_code` CHECK constraint added to `translations` and `term_glosses`
- [x] `language_pairs` table dropped — pairs derived from decks at query time
- [x] `is_public` and `added_at` dropped from `decks` and `deck_terms`
- [x] `type` added to `decks` with CHECK against `SUPPORTED_DECK_TYPES`
- [x] `topics` and `term_topics` tables added (empty for MVP)
- [x] Migration generated and run against fresh database
### Known data facts (pre-wipe, for reference)
- Wordlist: 999 unique words after deduplication (1000 lines, 1 duplicate)
- Term IDs resolved: 3171 (higher than word count due to homonyms)
- Words not found in DB: 34
- Italian (`it`) coverage: 3171 / 3171 — full coverage, included in `validated_languages`
### Next (Phase 2 — data model + pipeline)
### Next (Phase 3 — data pipeline + API)
Roadmap to API implementation:
1. **Finalize data model** — apply decisions above: `synset_id` nullable, add `source` + `source_id` to `terms`, add `cefr_level` to `translations`, add `categories` + `term_categories` tables, add `language_code` CHECK to `translations` and `term_glosses`, drop `language_pairs`
2. **Write and run migrations** — schema changes before any data expansion
3. **Expand data pipeline** — import all OMW languages and POS, not just English nouns with Italian translations
4. **Decide SUBTLEX → `cefr_level` mapping strategy** — raw frequency ranks need a mapping to A1C2 bands before tiered decks are meaningful
5. **Generate decks** — run generation script with SUBTLEX-grounded wordlists per source language