refactoring data model
This commit is contained in:
parent
b16b5db3f7
commit
e80f291c41
3 changed files with 95 additions and 64 deletions
|
|
@ -228,6 +228,37 @@ This is why `decks.source_language` is not just a technical detail — it is the
|
|||
|
||||
Same translation data underneath, correctly frequency-grounded per direction. Two wordlist files, two generation script runs.
|
||||
|
||||
### Decks: media metadata structure (post-MVP, options documented)
|
||||
|
||||
When the Media hierarchy is implemented, each media type (TV show, movie, book, song)
|
||||
has different attributes. Three options considered:
|
||||
|
||||
**Option A: One table with nullable columns**
|
||||
All media types in one table, type-specific columns nullable. Simple but becomes a sparse
|
||||
matrix as media types grow.
|
||||
|
||||
**Option B: Separate table per media type**
|
||||
```ts
|
||||
tv_metadata: deck_id, title, season, episode
|
||||
movie_metadata: deck_id, title, year
|
||||
book_metadata: deck_id, title, author, year
|
||||
song_metadata: deck_id, title, artist, album, year
|
||||
```
|
||||
Each table has exactly the right columns. Clean and queryable, more tables to maintain.
|
||||
|
||||
**Option C: JSONB for flexible attributes**
|
||||
```ts
|
||||
media_metadata: deck_id, media_type, title, attributes jsonb
|
||||
```
|
||||
Type-specific fields in a JSON blob. No migration needed for new media types but
|
||||
attributes are not schema-validated and harder to query.
|
||||
|
||||
**Current recommendation:** Option A to start (few media types initially, sparse
|
||||
columns manageable), migrate to Option B if the number of media types grows.
|
||||
Option C only if media types become numerous and unpredictable.
|
||||
|
||||
Decision deferred until Media is actually built.
|
||||
|
||||
### Terms: `synset_id` nullable (not NOT NULL)
|
||||
|
||||
**Problem:** non-WordNet terms (custom words, Wiktionary-sourced entries added later) won't have a synset ID. `NOT NULL` is too strict.
|
||||
|
|
@ -254,9 +285,27 @@ Postgres allows multiple `NULL` pairs under a unique constraint, so manual entri
|
|||
|
||||
No CHECK constraint on `source` — it is only written by controlled import scripts, not user input. A free varchar is sufficient.
|
||||
|
||||
### Terms: `cefr_level` column (deferred population)
|
||||
### Translations: `cefr_level` column (deferred population, not on `terms`)
|
||||
|
||||
Added as nullable `varchar(2)` with CHECK constraint against `CEFR_LEVELS` (`A1`–`C2`). Belongs on `terms`, not `decks` — difficulty is a property of the term, not the curated list. Left null for MVP; populated later via SUBTLEX or an external CEFR wordlist. Added now while the table is small to avoid a costly backfill migration later.
|
||||
CEFR difficulty is language-relative, not concept-relative. "House" in English is A1, "domicile" is also English but B2 — same concept, different words, different difficulty. Moving `cefr_level` to `translations` allows each language's word to have its own level independently.
|
||||
|
||||
Added as nullable `varchar(2)` with CHECK constraint against `CEFR_LEVELS` (`A1`–`C2`) on the `translations` table. Left null for MVP; populated later via SUBTLEX or an external CEFR wordlist. Also included in the `translations` index since the quiz query filters on it:
|
||||
|
||||
```ts
|
||||
index("idx_translations_lang").on(table.language_code, table.cefr_level, table.term_id)
|
||||
```
|
||||
|
||||
### `language_pairs` table: dropped
|
||||
|
||||
Valid language pairs are already implicitly defined by `decks.source_language` + `decks.validated_languages`. The table was redundant — the same information can be derived directly from decks:
|
||||
|
||||
```sql
|
||||
SELECT DISTINCT source_language, unnest(validated_languages) AS target_language
|
||||
FROM decks
|
||||
WHERE validated_languages != '{}'
|
||||
```
|
||||
|
||||
The only thing `language_pairs` added was an `active` flag to manually disable a direction. This is an edge case not needed for MVP. Dropped to remove a maintenance surface that required staying in sync with deck data.
|
||||
|
||||
### Schema: `categories` + `term_categories` (empty for MVP)
|
||||
|
||||
|
|
@ -379,7 +428,7 @@ Phase 0 complete. Phase 1 data pipeline complete.
|
|||
|
||||
Roadmap to API implementation:
|
||||
|
||||
1. **Finalize data model** — apply decisions above: `synset_id` nullable, add `source` + `source_id` + `cefr_level` to `terms`, add `categories` + `term_categories` tables, add `language_code` CHECK to `translations` and `term_glosses`
|
||||
1. **Finalize data model** — apply decisions above: `synset_id` nullable, add `source` + `source_id` to `terms`, add `cefr_level` to `translations`, add `categories` + `term_categories` tables, add `language_code` CHECK to `translations` and `term_glosses`, drop `language_pairs`
|
||||
2. **Write and run migrations** — schema changes before any data expansion
|
||||
3. **Expand data pipeline** — import all OMW languages and POS, not just English nouns with Italian translations
|
||||
4. **Decide SUBTLEX → `cefr_level` mapping strategy** — raw frequency ranks need a mapping to A1–C2 bands before tiered decks are meaningful
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue