updating documentation

This commit is contained in:
lila 2026-04-05 01:21:18 +02:00
parent 7d80b20390
commit bfc09180f1

View file

@ -240,9 +240,23 @@ Postgres `UNIQUE` on a nullable column allows multiple `NULL` values (nulls are
CREATE UNIQUE INDEX idx_terms_synset_id ON terms (synset_id) WHERE synset_id IS NOT NULL; CREATE UNIQUE INDEX idx_terms_synset_id ON terms (synset_id) WHERE synset_id IS NOT NULL;
``` ```
### Terms: `source` + `source_id` columns
Once multiple import pipelines exist (OMW today, Wiktionary later), `synset_id` alone is insufficient as an idempotency key — Wiktionary terms won't have a synset ID.
**Decision:** add `source` (varchar, e.g. `'omw'`, `'wiktionary'`, null for manual) and `source_id` (text, the pipeline's internal identifier) with a unique constraint on the pair:
```ts
unique("unique_source_id").on(table.source, table.source_id)
```
Postgres allows multiple `NULL` pairs under a unique constraint, so manual entries don't conflict. For existing OMW terms, backfill `source = 'omw'` and `source_id = synset_id`. `synset_id` remains for now to avoid pipeline churn — deprecate it during a future pipeline refactor.
No CHECK constraint on `source` — it is only written by controlled import scripts, not user input. A free varchar is sufficient.
### Terms: `cefr_level` column (deferred population) ### Terms: `cefr_level` column (deferred population)
Added as a nullable `varchar(2)` with a CHECK constraint (`A1``C2`). Belongs on `terms`, not on `decks` — difficulty is a property of the term, not the curated list. Left null for MVP; populated later via SUBTLEX or an external CEFR wordlist. Added now while the table is small to avoid a costly backfill migration later. Added as nullable `varchar(2)` with CHECK constraint against `CEFR_LEVELS` (`A1``C2`). Belongs on `terms`, not `decks` — difficulty is a property of the term, not the curated list. Left null for MVP; populated later via SUBTLEX or an external CEFR wordlist. Added now while the table is small to avoid a costly backfill migration later.
### Schema: `categories` + `term_categories` (empty for MVP) ### Schema: `categories` + `term_categories` (empty for MVP)
@ -253,7 +267,33 @@ categories: id, slug, label, created_at
term_categories: term_id → terms.id, category_id → categories.id, PK(term_id, category_id) term_categories: term_id → terms.id, category_id → categories.id, PK(term_id, category_id)
``` ```
See open research section below for source options. See Open Research section for source options.
### Schema constraints: CHECK over pgEnum for extensible value sets
**Question:** use `pgEnum` for columns like `pos`, `cefr_level`, and `source` since the values are driven by TypeScript constants anyway?
**Decision:** no. Use CHECK constraints for any value set that will grow over time.
**Reason:** `ALTER TYPE enum_name ADD VALUE` in Postgres is non-transactional — it cannot be rolled back if a migration fails partway through, leaving the DB in a dirty state that requires manual intervention. CHECK constraints are fully transactional — if the migration fails it rolls back cleanly.
**Rule of thumb:** pgEnum is appropriate for truly static value sets that will never grow (e.g. `('pending', 'active', 'cancelled')` on an orders table). Any value set tied to a growing constant in the codebase (`SUPPORTED_POS`, `CEFR_LEVELS`, `SUPPORTED_LANGUAGE_CODES`) stays as a CHECK constraint.
### Schema constraints: `language_code` always CHECK-constrained
`language_code` columns on `translations` and `term_glosses` are constrained via CHECK against `SUPPORTED_LANGUAGE_CODES`, the same pattern used for `pos` and `cefr_level`.
**Reason:** unlike `source`, which is only written by controlled import scripts and failing silently is recoverable, `language_code` is a query-critical filter column. A typo (`'ita'` instead of `'it'`, `'en '` with a trailing space) would silently produce missing data in the UI — terms with no translation shown, glosses not displayed — which is harder to debug than a DB constraint violation.
**Rule:** any column that game queries filter on should be CHECK-constrained. Columns only used for internal bookkeeping (like `source`) can be left as free varchars.
### Schema: unique constraints make explicit FK indexes redundant
Postgres automatically creates an index to enforce a unique constraint. An explicit index on a column that is already the leading column of a unique constraint is redundant.
Example: `unique("unique_term_gloss").on(term_id, language_code, text)` already indexes `term_id` as the leading column. A separate `index("idx_term_glosses_term").on(term_id)` adds no value and was dropped.
**Rule:** before adding an explicit index, check whether an existing unique constraint already covers it.
### Future extensions: morphology and pronunciation (deferred, additive) ### Future extensions: morphology and pronunciation (deferred, additive)
@ -335,9 +375,11 @@ Phase 0 complete. Phase 1 data pipeline complete.
- Words not found in DB: 34 - Words not found in DB: 34
- Italian (`it`) coverage: 3171 / 3171 — full coverage, included in `validated_languages` - Italian (`it`) coverage: 3171 / 3171 — full coverage, included in `validated_languages`
### Roadmap to API implementation ### Next (Phase 2 — data model + pipeline)
1. **Finalize data model** — apply decisions above: `synset_id` nullable, add `cefr_level` to `terms`, add `categories` + `term_categories` tables Roadmap to API implementation:
1. **Finalize data model** — apply decisions above: `synset_id` nullable, add `source` + `source_id` + `cefr_level` to `terms`, add `categories` + `term_categories` tables, add `language_code` CHECK to `translations` and `term_glosses`
2. **Write and run migrations** — schema changes before any data expansion 2. **Write and run migrations** — schema changes before any data expansion
3. **Expand data pipeline** — import all OMW languages and POS, not just English nouns with Italian translations 3. **Expand data pipeline** — import all OMW languages and POS, not just English nouns with Italian translations
4. **Decide SUBTLEX → `cefr_level` mapping strategy** — raw frequency ranks need a mapping to A1C2 bands before tiered decks are meaningful 4. **Decide SUBTLEX → `cefr_level` mapping strategy** — raw frequency ranks need a mapping to A1C2 bands before tiered decks are meaningful