From bfc09180f1c835f76edf8a77390b2ff4b3ed8d1e Mon Sep 17 00:00:00 2001
From: lila <beiweitemderbeste@protonmail.com>
Date: Sun, 5 Apr 2026 01:21:18 +0200
Subject: [PATCH] updating documentation

---
 documentation/decisions.md | 50 +++++++++++++++++++++++++++++++++++---
 1 file changed, 46 insertions(+), 4 deletions(-)

diff --git a/documentation/decisions.md b/documentation/decisions.md
index 8b6d3a0..fd02d4c 100644
--- a/documentation/decisions.md
+++ b/documentation/decisions.md
@@ -240,9 +240,23 @@ Postgres `UNIQUE` on a nullable column allows multiple `NULL` values (nulls are
 CREATE UNIQUE INDEX idx_terms_synset_id ON terms (synset_id) WHERE synset_id IS NOT NULL;
 ```
 
+### Terms: `source` + `source_id` columns
+
+Once multiple import pipelines exist (OMW today, Wiktionary later), `synset_id` alone is insufficient as an idempotency key — Wiktionary terms won't have a synset ID.
+
+**Decision:** add `source` (varchar, e.g. `'omw'`, `'wiktionary'`, null for manual) and `source_id` (text, the pipeline's internal identifier) with a unique constraint on the pair:
+
+```ts
+unique("unique_source_id").on(table.source, table.source_id)
+```
+
+Postgres allows multiple `NULL` pairs under a unique constraint, so manual entries don't conflict. For existing OMW terms, backfill `source = 'omw'` and `source_id = synset_id`. `synset_id` remains for now to avoid pipeline churn — deprecate it during a future pipeline refactor.
+
+No CHECK constraint on `source` — it is only written by controlled import scripts, not user input. A free varchar is sufficient.
+
 ### Terms: `cefr_level` column (deferred population)
 
-Added as a nullable `varchar(2)` with a CHECK constraint (`A1`–`C2`). Belongs on `terms`, not on `decks` — difficulty is a property of the term, not the curated list. Left null for MVP; populated later via SUBTLEX or an external CEFR wordlist. Added now while the table is small to avoid a costly backfill migration later.
+Added as nullable `varchar(2)` with CHECK constraint against `CEFR_LEVELS` (`A1`–`C2`). Belongs on `terms`, not `decks` — difficulty is a property of the term, not the curated list. Left null for MVP; populated later via SUBTLEX or an external CEFR wordlist. Added now while the table is small to avoid a costly backfill migration later.
 
 ### Schema: `categories` + `term_categories` (empty for MVP)
 
@@ -253,7 +267,33 @@ categories:      id, slug, label, created_at
 term_categories: term_id → terms.id, category_id → categories.id, PK(term_id, category_id)
 ```
 
-See open research section below for source options.
+See Open Research section for source options.
+
+### Schema constraints: CHECK over pgEnum for extensible value sets
+
+**Question:** use `pgEnum` for columns like `pos`, `cefr_level`, and `source` since the values are driven by TypeScript constants anyway?
+
+**Decision:** no. Use CHECK constraints for any value set that will grow over time.
+
+**Reason:** `ALTER TYPE enum_name ADD VALUE` in Postgres is non-transactional — it cannot be rolled back if a migration fails partway through, leaving the DB in a dirty state that requires manual intervention. CHECK constraints are fully transactional — if the migration fails it rolls back cleanly.
+
+**Rule of thumb:** pgEnum is appropriate for truly static value sets that will never grow (e.g. `('pending', 'active', 'cancelled')` on an orders table). Any value set tied to a growing constant in the codebase (`SUPPORTED_POS`, `CEFR_LEVELS`, `SUPPORTED_LANGUAGE_CODES`) stays as a CHECK constraint.
+
+### Schema constraints: `language_code` always CHECK-constrained
+
+`language_code` columns on `translations` and `term_glosses` are constrained via CHECK against `SUPPORTED_LANGUAGE_CODES`, the same pattern used for `pos` and `cefr_level`.
+
+**Reason:** unlike `source`, which is only written by controlled import scripts and failing silently is recoverable, `language_code` is a query-critical filter column. A typo (`'ita'` instead of `'it'`, `'en '` with a trailing space) would silently produce missing data in the UI — terms with no translation shown, glosses not displayed — which is harder to debug than a DB constraint violation.
+
+**Rule:** any column that game queries filter on should be CHECK-constrained. Columns only used for internal bookkeeping (like `source`) can be left as free varchars.
+
+### Schema: unique constraints make explicit FK indexes redundant
+
+Postgres automatically creates an index to enforce a unique constraint. An explicit index on a column that is already the leading column of a unique constraint is redundant.
+
+Example: `unique("unique_term_gloss").on(term_id, language_code, text)` already indexes `term_id` as the leading column. A separate `index("idx_term_glosses_term").on(term_id)` adds no value and was dropped.
+
+**Rule:** before adding an explicit index, check whether an existing unique constraint already covers it.
 
 ### Future extensions: morphology and pronunciation (deferred, additive)
 
@@ -335,9 +375,11 @@ Phase 0 complete. Phase 1 data pipeline complete.
 - Words not found in DB: 34
 - Italian (`it`) coverage: 3171 / 3171 — full coverage, included in `validated_languages`
 
-### Roadmap to API implementation
+### Next (Phase 2 — data model + pipeline)
 
-1. **Finalize data model** — apply decisions above: `synset_id` nullable, add `cefr_level` to `terms`, add `categories` + `term_categories` tables
+Roadmap to API implementation:
+
+1. **Finalize data model** — apply decisions above: `synset_id` nullable, add `source` + `source_id` + `cefr_level` to `terms`, add `categories` + `term_categories` tables, add `language_code` CHECK to `translations` and `term_glosses`
 2. **Write and run migrations** — schema changes before any data expansion
 3. **Expand data pipeline** — import all OMW languages and POS, not just English nouns with Italian translations
 4. **Decide SUBTLEX → `cefr_level` mapping strategy** — raw frequency ranks need a mapping to A1–C2 bands before tiered decks are meaningful