updating docs
This commit is contained in:
parent
e80f291c41
commit
c49c2fe2c3
1 changed files with 14 additions and 38 deletions
|
|
@ -228,37 +228,6 @@ This is why `decks.source_language` is not just a technical detail — it is the
|
|||
|
||||
Same translation data underneath, correctly frequency-grounded per direction. Two wordlist files, two generation script runs.
|
||||
|
||||
### Decks: media metadata structure (post-MVP, options documented)
|
||||
|
||||
When the Media hierarchy is implemented, each media type (TV show, movie, book, song)
|
||||
has different attributes. Three options considered:
|
||||
|
||||
**Option A: One table with nullable columns**
|
||||
All media types in one table, type-specific columns nullable. Simple but becomes a sparse
|
||||
matrix as media types grow.
|
||||
|
||||
**Option B: Separate table per media type**
|
||||
```ts
|
||||
tv_metadata: deck_id, title, season, episode
|
||||
movie_metadata: deck_id, title, year
|
||||
book_metadata: deck_id, title, author, year
|
||||
song_metadata: deck_id, title, artist, album, year
|
||||
```
|
||||
Each table has exactly the right columns. Clean and queryable, more tables to maintain.
|
||||
|
||||
**Option C: JSONB for flexible attributes**
|
||||
```ts
|
||||
media_metadata: deck_id, media_type, title, attributes jsonb
|
||||
```
|
||||
Type-specific fields in a JSON blob. No migration needed for new media types but
|
||||
attributes are not schema-validated and harder to query.
|
||||
|
||||
**Current recommendation:** Option A to start (few media types initially, sparse
|
||||
columns manageable), migrate to Option B if the number of media types grows.
|
||||
Option C only if media types become numerous and unpredictable.
|
||||
|
||||
Decision deferred until Media is actually built.
|
||||
|
||||
### Terms: `synset_id` nullable (not NOT NULL)
|
||||
|
||||
**Problem:** non-WordNet terms (custom words, Wiktionary-sourced entries added later) won't have a synset ID. `NOT NULL` is too strict.
|
||||
|
|
@ -401,7 +370,7 @@ Too expensive at scale — only viable for small curated additions on top of an
|
|||
|
||||
## Current State
|
||||
|
||||
Phase 0 complete. Phase 1 data pipeline complete.
|
||||
Phase 0 complete. Phase 1 data pipeline complete. Phase 2 data model finalized and migrated.
|
||||
|
||||
### Completed (Phase 1 — data pipeline)
|
||||
|
||||
|
|
@ -417,19 +386,26 @@ Phase 0 complete. Phase 1 data pipeline complete.
|
|||
- creates deck if it doesn't exist, adds only missing terms on subsequent runs
|
||||
- recalculates and persists `validated_languages` on every run
|
||||
|
||||
### Known data facts
|
||||
### Completed (Phase 2 — data model)
|
||||
|
||||
- [x] `synset_id` removed, replaced by `source` + `source_id` on `terms`
|
||||
- [x] `cefr_level` added to `translations` (not `terms` — difficulty is language-relative)
|
||||
- [x] `language_code` CHECK constraint added to `translations` and `term_glosses`
|
||||
- [x] `language_pairs` table dropped — pairs derived from decks at query time
|
||||
- [x] `is_public` and `added_at` dropped from `decks` and `deck_terms`
|
||||
- [x] `type` added to `decks` with CHECK against `SUPPORTED_DECK_TYPES`
|
||||
- [x] `topics` and `term_topics` tables added (empty for MVP)
|
||||
- [x] Migration generated and run against fresh database
|
||||
|
||||
### Known data facts (pre-wipe, for reference)
|
||||
|
||||
- Wordlist: 999 unique words after deduplication (1000 lines, 1 duplicate)
|
||||
- Term IDs resolved: 3171 (higher than word count due to homonyms)
|
||||
- Words not found in DB: 34
|
||||
- Italian (`it`) coverage: 3171 / 3171 — full coverage, included in `validated_languages`
|
||||
|
||||
### Next (Phase 2 — data model + pipeline)
|
||||
### Next (Phase 3 — data pipeline + API)
|
||||
|
||||
Roadmap to API implementation:
|
||||
|
||||
1. **Finalize data model** — apply decisions above: `synset_id` nullable, add `source` + `source_id` to `terms`, add `cefr_level` to `translations`, add `categories` + `term_categories` tables, add `language_code` CHECK to `translations` and `term_glosses`, drop `language_pairs`
|
||||
2. **Write and run migrations** — schema changes before any data expansion
|
||||
3. **Expand data pipeline** — import all OMW languages and POS, not just English nouns with Italian translations
|
||||
4. **Decide SUBTLEX → `cefr_level` mapping strategy** — raw frequency ranks need a mapping to A1–C2 bands before tiered decks are meaningful
|
||||
5. **Generate decks** — run generation script with SUBTLEX-grounded wordlists per source language
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue