updating documentation
This commit is contained in:
parent
7d80b20390
commit
bfc09180f1
1 changed files with 46 additions and 4 deletions
|
|
@ -240,9 +240,23 @@ Postgres `UNIQUE` on a nullable column allows multiple `NULL` values (nulls are
|
|||
CREATE UNIQUE INDEX idx_terms_synset_id ON terms (synset_id) WHERE synset_id IS NOT NULL;
|
||||
```
|
||||
|
||||
### Terms: `source` + `source_id` columns
|
||||
|
||||
Once multiple import pipelines exist (OMW today, Wiktionary later), `synset_id` alone is insufficient as an idempotency key — Wiktionary terms won't have a synset ID.
|
||||
|
||||
**Decision:** add `source` (varchar, e.g. `'omw'`, `'wiktionary'`, null for manual) and `source_id` (text, the pipeline's internal identifier) with a unique constraint on the pair:
|
||||
|
||||
```ts
|
||||
unique("unique_source_id").on(table.source, table.source_id)
|
||||
```
|
||||
|
||||
Postgres allows multiple `NULL` pairs under a unique constraint, so manual entries don't conflict. For existing OMW terms, backfill `source = 'omw'` and `source_id = synset_id`. `synset_id` remains for now to avoid pipeline churn — deprecate it during a future pipeline refactor.
|
||||
|
||||
No CHECK constraint on `source` — it is only written by controlled import scripts, not user input. A free varchar is sufficient.
|
||||
|
||||
### Terms: `cefr_level` column (deferred population)
|
||||
|
||||
Added as a nullable `varchar(2)` with a CHECK constraint (`A1`–`C2`). Belongs on `terms`, not on `decks` — difficulty is a property of the term, not the curated list. Left null for MVP; populated later via SUBTLEX or an external CEFR wordlist. Added now while the table is small to avoid a costly backfill migration later.
|
||||
Added as nullable `varchar(2)` with CHECK constraint against `CEFR_LEVELS` (`A1`–`C2`). Belongs on `terms`, not `decks` — difficulty is a property of the term, not the curated list. Left null for MVP; populated later via SUBTLEX or an external CEFR wordlist. Added now while the table is small to avoid a costly backfill migration later.
|
||||
|
||||
### Schema: `categories` + `term_categories` (empty for MVP)
|
||||
|
||||
|
|
@ -253,7 +267,33 @@ categories: id, slug, label, created_at
|
|||
term_categories: term_id → terms.id, category_id → categories.id, PK(term_id, category_id)
|
||||
```
|
||||
|
||||
See open research section below for source options.
|
||||
See Open Research section for source options.
|
||||
|
||||
### Schema constraints: CHECK over pgEnum for extensible value sets
|
||||
|
||||
**Question:** use `pgEnum` for columns like `pos`, `cefr_level`, and `source` since the values are driven by TypeScript constants anyway?
|
||||
|
||||
**Decision:** no. Use CHECK constraints for any value set that will grow over time.
|
||||
|
||||
**Reason:** `ALTER TYPE enum_name ADD VALUE` in Postgres is non-transactional — it cannot be rolled back if a migration fails partway through, leaving the DB in a dirty state that requires manual intervention. CHECK constraints are fully transactional — if the migration fails it rolls back cleanly.
|
||||
|
||||
**Rule of thumb:** pgEnum is appropriate for truly static value sets that will never grow (e.g. `('pending', 'active', 'cancelled')` on an orders table). Any value set tied to a growing constant in the codebase (`SUPPORTED_POS`, `CEFR_LEVELS`, `SUPPORTED_LANGUAGE_CODES`) stays as a CHECK constraint.
|
||||
|
||||
### Schema constraints: `language_code` always CHECK-constrained
|
||||
|
||||
`language_code` columns on `translations` and `term_glosses` are constrained via CHECK against `SUPPORTED_LANGUAGE_CODES`, the same pattern used for `pos` and `cefr_level`.
|
||||
|
||||
**Reason:** unlike `source`, which is only written by controlled import scripts and failing silently is recoverable, `language_code` is a query-critical filter column. A typo (`'ita'` instead of `'it'`, `'en '` with a trailing space) would silently produce missing data in the UI — terms with no translation shown, glosses not displayed — which is harder to debug than a DB constraint violation.
|
||||
|
||||
**Rule:** any column that game queries filter on should be CHECK-constrained. Columns only used for internal bookkeeping (like `source`) can be left as free varchars.
|
||||
|
||||
### Schema: unique constraints make explicit FK indexes redundant
|
||||
|
||||
Postgres automatically creates an index to enforce a unique constraint. An explicit index on a column that is already the leading column of a unique constraint is redundant.
|
||||
|
||||
Example: `unique("unique_term_gloss").on(term_id, language_code, text)` already indexes `term_id` as the leading column. A separate `index("idx_term_glosses_term").on(term_id)` adds no value and was dropped.
|
||||
|
||||
**Rule:** before adding an explicit index, check whether an existing unique constraint already covers it.
|
||||
|
||||
### Future extensions: morphology and pronunciation (deferred, additive)
|
||||
|
||||
|
|
@ -335,9 +375,11 @@ Phase 0 complete. Phase 1 data pipeline complete.
|
|||
- Words not found in DB: 34
|
||||
- Italian (`it`) coverage: 3171 / 3171 — full coverage, included in `validated_languages`
|
||||
|
||||
### Roadmap to API implementation
|
||||
### Next (Phase 2 — data model + pipeline)
|
||||
|
||||
1. **Finalize data model** — apply decisions above: `synset_id` nullable, add `cefr_level` to `terms`, add `categories` + `term_categories` tables
|
||||
Roadmap to API implementation:
|
||||
|
||||
1. **Finalize data model** — apply decisions above: `synset_id` nullable, add `source` + `source_id` + `cefr_level` to `terms`, add `categories` + `term_categories` tables, add `language_code` CHECK to `translations` and `term_glosses`
|
||||
2. **Write and run migrations** — schema changes before any data expansion
|
||||
3. **Expand data pipeline** — import all OMW languages and POS, not just English nouns with Italian translations
|
||||
4. **Decide SUBTLEX → `cefr_level` mapping strategy** — raw frequency ranks need a mapping to A1–C2 bands before tiered decks are meaningful
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue