updating documentation

2026-04-05 01:21:18 +02:00 · 2026-04-05 01:21:18 +02:00 · bfc09180f1
commit bfc09180f1
parent 7d80b20390
1 changed files with 46 additions and 4 deletions
--- a/documentation/decisions.md
+++ b/documentation/decisions.md
@ -240,9 +240,23 @@ Postgres `UNIQUE` on a nullable column allows multiple `NULL` values (nulls are
 CREATE UNIQUE INDEX idx_terms_synset_id ON terms (synset_id) WHERE synset_id IS NOT NULL;
 ```
 ### Terms: `source` + `source_id` columns
 Once multiple import pipelines exist (OMW today, Wiktionary later), `synset_id` alone is insufficient as an idempotency key — Wiktionary terms won't have a synset ID.
 **Decision:** add `source` (varchar, e.g. `'omw'`, `'wiktionary'`, null for manual) and `source_id` (text, the pipeline's internal identifier) with a unique constraint on the pair:
 ```ts
 unique("unique_source_id").on(table.source, table.source_id)
 ```
 Postgres allows multiple `NULL` pairs under a unique constraint, so manual entries don't conflict. For existing OMW terms, backfill `source = 'omw'` and `source_id = synset_id`. `synset_id` remains for now to avoid pipeline churn — deprecate it during a future pipeline refactor.
 No CHECK constraint on `source` — it is only written by controlled import scripts, not user input. A free varchar is sufficient.
 ### Terms: `cefr_level` column (deferred population)
-Added as a nullable `varchar(2)` with a CHECK constraint (`A1`–`C2`). Belongs on `terms`, not on `decks` — difficulty is a property of the term, not the curated list. Left null for MVP; populated later via SUBTLEX or an external CEFR wordlist. Added now while the table is small to avoid a costly backfill migration later.
+Added as nullable `varchar(2)` with CHECK constraint against `CEFR_LEVELS` (`A1`–`C2`). Belongs on `terms`, not `decks` — difficulty is a property of the term, not the curated list. Left null for MVP; populated later via SUBTLEX or an external CEFR wordlist. Added now while the table is small to avoid a costly backfill migration later.
 ### Schema: `categories` + `term_categories` (empty for MVP)
@ -253,7 +267,33 @@ categories:      id, slug, label, created_at
 term_categories: term_id → terms.id, category_id → categories.id, PK(term_id, category_id)
 ```
-See open research section below for source options.
+See Open Research section for source options.
 ### Schema constraints: CHECK over pgEnum for extensible value sets
 **Question:** use `pgEnum` for columns like `pos`, `cefr_level`, and `source` since the values are driven by TypeScript constants anyway?
 **Decision:** no. Use CHECK constraints for any value set that will grow over time.
 **Reason:** `ALTER TYPE enum_name ADD VALUE` in Postgres is non-transactional — it cannot be rolled back if a migration fails partway through, leaving the DB in a dirty state that requires manual intervention. CHECK constraints are fully transactional — if the migration fails it rolls back cleanly.
 **Rule of thumb:** pgEnum is appropriate for truly static value sets that will never grow (e.g. `('pending', 'active', 'cancelled')` on an orders table). Any value set tied to a growing constant in the codebase (`SUPPORTED_POS`, `CEFR_LEVELS`, `SUPPORTED_LANGUAGE_CODES`) stays as a CHECK constraint.
 ### Schema constraints: `language_code` always CHECK-constrained
 `language_code` columns on `translations` and `term_glosses` are constrained via CHECK against `SUPPORTED_LANGUAGE_CODES`, the same pattern used for `pos` and `cefr_level`.
 **Reason:** unlike `source`, which is only written by controlled import scripts and failing silently is recoverable, `language_code` is a query-critical filter column. A typo (`'ita'` instead of `'it'`, `'en '` with a trailing space) would silently produce missing data in the UI — terms with no translation shown, glosses not displayed — which is harder to debug than a DB constraint violation.
 **Rule:** any column that game queries filter on should be CHECK-constrained. Columns only used for internal bookkeeping (like `source`) can be left as free varchars.
 ### Schema: unique constraints make explicit FK indexes redundant
 Postgres automatically creates an index to enforce a unique constraint. An explicit index on a column that is already the leading column of a unique constraint is redundant.
 Example: `unique("unique_term_gloss").on(term_id, language_code, text)` already indexes `term_id` as the leading column. A separate `index("idx_term_glosses_term").on(term_id)` adds no value and was dropped.
 **Rule:** before adding an explicit index, check whether an existing unique constraint already covers it.
 ### Future extensions: morphology and pronunciation (deferred, additive)
@ -335,9 +375,11 @@ Phase 0 complete. Phase 1 data pipeline complete.
 - Words not found in DB: 34
 - Italian (`it`) coverage: 3171 / 3171 — full coverage, included in `validated_languages`
-### Roadmap to API implementation
+### Next (Phase 2 — data model + pipeline)
-1. **Finalize data model** — apply decisions above: `synset_id` nullable, add `cefr_level` to `terms`, add `categories` + `term_categories` tables
+Roadmap to API implementation:
 1. **Finalize data model** — apply decisions above: `synset_id` nullable, add `source` + `source_id` + `cefr_level` to `terms`, add `categories` + `term_categories` tables, add `language_code` CHECK to `translations` and `term_glosses`
 2. **Write and run migrations** — schema changes before any data expansion
 3. **Expand data pipeline** — import all OMW languages and POS, not just English nouns with Italian translations
 4. **Decide SUBTLEX → `cefr_level` mapping strategy** — raw frequency ranks need a mapping to A1–C2 bands before tiered decks are meaningful