From bfc09180f1c835f76edf8a77390b2ff4b3ed8d1e Mon Sep 17 00:00:00 2001 From: lila Date: Sun, 5 Apr 2026 01:21:18 +0200 Subject: [PATCH] updating documentation --- documentation/decisions.md | 50 +++++++++++++++++++++++++++++++++++--- 1 file changed, 46 insertions(+), 4 deletions(-) diff --git a/documentation/decisions.md b/documentation/decisions.md index 8b6d3a0..fd02d4c 100644 --- a/documentation/decisions.md +++ b/documentation/decisions.md @@ -240,9 +240,23 @@ Postgres `UNIQUE` on a nullable column allows multiple `NULL` values (nulls are CREATE UNIQUE INDEX idx_terms_synset_id ON terms (synset_id) WHERE synset_id IS NOT NULL; ``` +### Terms: `source` + `source_id` columns + +Once multiple import pipelines exist (OMW today, Wiktionary later), `synset_id` alone is insufficient as an idempotency key — Wiktionary terms won't have a synset ID. + +**Decision:** add `source` (varchar, e.g. `'omw'`, `'wiktionary'`, null for manual) and `source_id` (text, the pipeline's internal identifier) with a unique constraint on the pair: + +```ts +unique("unique_source_id").on(table.source, table.source_id) +``` + +Postgres allows multiple `NULL` pairs under a unique constraint, so manual entries don't conflict. For existing OMW terms, backfill `source = 'omw'` and `source_id = synset_id`. `synset_id` remains for now to avoid pipeline churn — deprecate it during a future pipeline refactor. + +No CHECK constraint on `source` — it is only written by controlled import scripts, not user input. A free varchar is sufficient. + ### Terms: `cefr_level` column (deferred population) -Added as a nullable `varchar(2)` with a CHECK constraint (`A1`–`C2`). Belongs on `terms`, not on `decks` — difficulty is a property of the term, not the curated list. Left null for MVP; populated later via SUBTLEX or an external CEFR wordlist. Added now while the table is small to avoid a costly backfill migration later. +Added as nullable `varchar(2)` with CHECK constraint against `CEFR_LEVELS` (`A1`–`C2`). Belongs on `terms`, not `decks` — difficulty is a property of the term, not the curated list. Left null for MVP; populated later via SUBTLEX or an external CEFR wordlist. Added now while the table is small to avoid a costly backfill migration later. ### Schema: `categories` + `term_categories` (empty for MVP) @@ -253,7 +267,33 @@ categories: id, slug, label, created_at term_categories: term_id → terms.id, category_id → categories.id, PK(term_id, category_id) ``` -See open research section below for source options. +See Open Research section for source options. + +### Schema constraints: CHECK over pgEnum for extensible value sets + +**Question:** use `pgEnum` for columns like `pos`, `cefr_level`, and `source` since the values are driven by TypeScript constants anyway? + +**Decision:** no. Use CHECK constraints for any value set that will grow over time. + +**Reason:** `ALTER TYPE enum_name ADD VALUE` in Postgres is non-transactional — it cannot be rolled back if a migration fails partway through, leaving the DB in a dirty state that requires manual intervention. CHECK constraints are fully transactional — if the migration fails it rolls back cleanly. + +**Rule of thumb:** pgEnum is appropriate for truly static value sets that will never grow (e.g. `('pending', 'active', 'cancelled')` on an orders table). Any value set tied to a growing constant in the codebase (`SUPPORTED_POS`, `CEFR_LEVELS`, `SUPPORTED_LANGUAGE_CODES`) stays as a CHECK constraint. + +### Schema constraints: `language_code` always CHECK-constrained + +`language_code` columns on `translations` and `term_glosses` are constrained via CHECK against `SUPPORTED_LANGUAGE_CODES`, the same pattern used for `pos` and `cefr_level`. + +**Reason:** unlike `source`, which is only written by controlled import scripts and failing silently is recoverable, `language_code` is a query-critical filter column. A typo (`'ita'` instead of `'it'`, `'en '` with a trailing space) would silently produce missing data in the UI — terms with no translation shown, glosses not displayed — which is harder to debug than a DB constraint violation. + +**Rule:** any column that game queries filter on should be CHECK-constrained. Columns only used for internal bookkeeping (like `source`) can be left as free varchars. + +### Schema: unique constraints make explicit FK indexes redundant + +Postgres automatically creates an index to enforce a unique constraint. An explicit index on a column that is already the leading column of a unique constraint is redundant. + +Example: `unique("unique_term_gloss").on(term_id, language_code, text)` already indexes `term_id` as the leading column. A separate `index("idx_term_glosses_term").on(term_id)` adds no value and was dropped. + +**Rule:** before adding an explicit index, check whether an existing unique constraint already covers it. ### Future extensions: morphology and pronunciation (deferred, additive) @@ -335,9 +375,11 @@ Phase 0 complete. Phase 1 data pipeline complete. - Words not found in DB: 34 - Italian (`it`) coverage: 3171 / 3171 — full coverage, included in `validated_languages` -### Roadmap to API implementation +### Next (Phase 2 — data model + pipeline) -1. **Finalize data model** — apply decisions above: `synset_id` nullable, add `cefr_level` to `terms`, add `categories` + `term_categories` tables +Roadmap to API implementation: + +1. **Finalize data model** — apply decisions above: `synset_id` nullable, add `source` + `source_id` + `cefr_level` to `terms`, add `categories` + `term_categories` tables, add `language_code` CHECK to `translations` and `term_glosses` 2. **Write and run migrations** — schema changes before any data expansion 3. **Expand data pipeline** — import all OMW languages and POS, not just English nouns with Italian translations 4. **Decide SUBTLEX → `cefr_level` mapping strategy** — raw frequency ranks need a mapping to A1–C2 bands before tiered decks are meaningful