updating docs
This commit is contained in:
parent
1accb10f49
commit
c24967dc74
3 changed files with 349 additions and 6 deletions
|
|
@ -196,6 +196,118 @@ Allows multiple synonyms per language per term (e.g. "dog", "hound" for same syn
|
|||
- `validated_languages` stays accurate as translation data grows
|
||||
- DB enforces via CHECK constraint that `source_language` is never included in `validated_languages`
|
||||
|
||||
|
||||
### Decks: wordlist tiers as scope (not POS-split decks)
|
||||
|
||||
**Rejected approach:** one deck per POS (e.g. `en-nouns`, `en-verbs`).
|
||||
|
||||
**Problem:** POS is already a filterable column on `terms`, so a POS-scoped deck duplicates logic the query already handles for free. A word like "run" (noun and verb, different synsets) would also appear in two decks, requiring deduplication in the generation script.
|
||||
|
||||
**Decision:** one deck per frequency tier per source language (e.g. `en-core-1000`, `en-core-2000`). POS, difficulty, and category are query filters applied inside that boundary at query time. The user never sees or picks a deck — they pick a direction, POS, and difficulty, and the app resolves those to the right deck + filters.
|
||||
|
||||
Progression works by expanding the deck set as the user advances:
|
||||
|
||||
```sql
|
||||
WHERE dt.deck_id IN ('en-core-1000', 'en-core-2000')
|
||||
AND t.pos = 'noun'
|
||||
AND t.cefr_level = 'B1'
|
||||
```
|
||||
|
||||
Decks must not overlap — each term appears in exactly one tier. The generation script already deduplicates, so this is enforced at import time.
|
||||
|
||||
### Decks: SUBTLEX as wordlist source (not manual curation)
|
||||
|
||||
**Problem:** the most common 1000 nouns in English are not the same 1000 nouns that are most common in Italian — not just in translation, but conceptually. Building decks from English frequency data alone gives Italian learners a distorted picture of what is actually common in Italian.
|
||||
|
||||
**Decision:** use SUBTLEX, which exists in per-language editions (SUBTLEX-EN, SUBTLEX-IT, etc.) derived from subtitle corpora using the same methodology, making them comparable across languages.
|
||||
|
||||
This is why `decks.source_language` is not just a technical detail — it is the reason the data model is correct:
|
||||
|
||||
- `en-core-1000` built from SUBTLEX-EN → used when source language is English (en→it)
|
||||
- `it-core-1000` built from SUBTLEX-IT → used when source language is Italian (it→en)
|
||||
|
||||
Same translation data underneath, correctly frequency-grounded per direction. Two wordlist files, two generation script runs.
|
||||
|
||||
### Terms: `synset_id` nullable (not NOT NULL)
|
||||
|
||||
**Problem:** non-WordNet terms (custom words, Wiktionary-sourced entries added later) won't have a synset ID. `NOT NULL` is too strict.
|
||||
|
||||
**Decision:** make `synset_id` nullable. `synset_id` remains the WordNet idempotency key — it prevents duplicate imports on re-runs and allows cross-referencing back to WordNet. It is not removed.
|
||||
|
||||
Postgres `UNIQUE` on a nullable column allows multiple `NULL` values (nulls are not considered equal), so no additional constraint logic is needed beyond dropping `notNull()`. For extra defensiveness a partial unique index can be added later:
|
||||
|
||||
```sql
|
||||
CREATE UNIQUE INDEX idx_terms_synset_id ON terms (synset_id) WHERE synset_id IS NOT NULL;
|
||||
```
|
||||
|
||||
### Terms: `cefr_level` column (deferred population)
|
||||
|
||||
Added as a nullable `varchar(2)` with a CHECK constraint (`A1`–`C2`). Belongs on `terms`, not on `decks` — difficulty is a property of the term, not the curated list. Left null for MVP; populated later via SUBTLEX or an external CEFR wordlist. Added now while the table is small to avoid a costly backfill migration later.
|
||||
|
||||
### Schema: `categories` + `term_categories` (empty for MVP)
|
||||
|
||||
Added to schema now, left empty for MVP. Grammar and Media work without them — Grammar maps to POS (already on `terms`), Media maps to deck membership. Thematic categories (animals, kitchen, etc.) require a metadata source that is still under research.
|
||||
|
||||
```sql
|
||||
categories: id, slug, label, created_at
|
||||
term_categories: term_id → terms.id, category_id → categories.id, PK(term_id, category_id)
|
||||
```
|
||||
|
||||
See open research section below for source options.
|
||||
|
||||
### Future extensions: morphology and pronunciation (deferred, additive)
|
||||
|
||||
The following features are explicitly deferred post-MVP. All are purely additive — new tables referencing existing `terms` rows via FK. No existing schema changes required when implemented:
|
||||
|
||||
- `noun_forms` — gender, singular, plural, articles per language (source: Wiktionary)
|
||||
- `verb_forms` — conjugation tables per language (source: Wiktionary)
|
||||
- `term_pronunciations` — IPA and audio URLs per language (source: Wiktionary / Forvo)
|
||||
|
||||
Exercise types split naturally into Type A (translation, current model) and Type B (morphology, future). The data layer is independent — the same `terms` anchor both.
|
||||
|
||||
---
|
||||
|
||||
## Open Research
|
||||
|
||||
### Semantic category metadata source
|
||||
|
||||
Categories (`animals`, `kitchen`, etc.) are in the schema but empty for MVP.
|
||||
Grammar and Media work without them (Grammar = POS filter, Media = deck membership).
|
||||
Needs research before populating `term_categories`. Options:
|
||||
|
||||
**Option 1: WordNet domain labels**
|
||||
Already in OMW, extractable in the existing pipeline. Free, no extra dependency.
|
||||
Problem: coarse and patchy — many terms untagged, vocabulary is academic ("fauna" not "animals").
|
||||
|
||||
**Option 2: Princeton WordNet Domains**
|
||||
Separate project built on WordNet. ~200 hierarchical domains mapped to synsets. More structured
|
||||
and consistent than basic WordNet labels. Freely available. Meaningfully better than Option 1.
|
||||
|
||||
**Option 3: Kelly Project**
|
||||
Frequency lists with CEFR levels AND semantic field tags, explicitly designed for language learning,
|
||||
multiple languages. Could solve frequency tiers (`cefr_level`) and semantic categories in one shot.
|
||||
Investigate coverage for your languages and POS range first.
|
||||
|
||||
**Option 4: BabelNet / WikiData**
|
||||
Rich, multilingual, community-maintained. Maps WordNet synsets to Wikipedia categories.
|
||||
Problem: complex integration, BabelNet has commercial licensing restrictions, WikiData category
|
||||
trees are deep and noisy.
|
||||
|
||||
**Option 5: LLM-assisted categorization**
|
||||
Run terms through Claude/GPT-4 with a fixed category list, spot-check output, import.
|
||||
Fast and cheap at current term counts (3171 terms ≈ negligible cost). Not reproducible
|
||||
without saving output. Good fallback if structured sources have insufficient coverage.
|
||||
|
||||
**Option 6: Hybrid — WordNet Domains as baseline, LLM gap-fill**
|
||||
Use Option 2 for automated coverage, LLM for terms with no domain tag, manual spot-check pass.
|
||||
Combines automation with control. Likely the most practical approach.
|
||||
|
||||
**Option 7: Manual curation**
|
||||
Flat file mapping synset IDs to your own category slugs. Full control, matches UI exactly.
|
||||
Too expensive at scale — only viable for small curated additions on top of an automated baseline.
|
||||
|
||||
**Current recommendation:** research Kelly Project first. If coverage is insufficient, go with Option 6.
|
||||
|
||||
---
|
||||
|
||||
## Current State
|
||||
|
|
@ -223,10 +335,13 @@ Phase 0 complete. Phase 1 data pipeline complete.
|
|||
- Words not found in DB: 34
|
||||
- Italian (`it`) coverage: 3171 / 3171 — full coverage, included in `validated_languages`
|
||||
|
||||
### Next (Phase 1 — API layer)
|
||||
### Roadmap to API implementation
|
||||
|
||||
- [ ] Define Zod response schemas in `packages/shared`
|
||||
- [ ] Implement `DeckRepository.getTerms(deckId, limit, offset)`
|
||||
- [ ] Implement `QuizService.attachDistractors(terms)`
|
||||
- [ ] Implement `GET /language-pairs`, `GET /decks`, `GET /decks/:id/terms` endpoints
|
||||
- [ ] Unit tests for `QuizService`
|
||||
1. **Finalize data model** — apply decisions above: `synset_id` nullable, add `cefr_level` to `terms`, add `categories` + `term_categories` tables
|
||||
2. **Write and run migrations** — schema changes before any data expansion
|
||||
3. **Expand data pipeline** — import all OMW languages and POS, not just English nouns with Italian translations
|
||||
4. **Decide SUBTLEX → `cefr_level` mapping strategy** — raw frequency ranks need a mapping to A1–C2 bands before tiered decks are meaningful
|
||||
5. **Generate decks** — run generation script with SUBTLEX-grounded wordlists per source language
|
||||
6. **Finalize game selection flow** — direction → category → POS → difficulty → round count
|
||||
7. **Define Zod schemas in `packages/shared`** — based on finalized game flow and API shape
|
||||
8. **Implement API**
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue