formatting

2026-03-31 10:06:06 +02:00 · 2026-03-31 10:06:06 +02:00 · e3a2136720
commit e3a2136720
parent 20fa6a9331
11 changed files with 72803 additions and 408878 deletions
--- a/documentation/data-seeding-notes.md
+++ b/documentation/data-seeding-notes.md
@ -14,14 +14,12 @@ Each synset extracted from WordNet is represented as:
 {
  "synset_id": "ili:i35545",
  "pos": "noun",
-  "translations": {
-    "en": ["entity"],
-    "it": ["cosa", "entità"]
-  }
+  "translations": { "en": ["entity"], "it": ["cosa", "entità"] }
 }
 ```

 **Fields:**
+
 - `synset_id` — OMW Interlingual Index ID, maps to `terms.synset_id` in the DB
 - `pos` — part of speech, matches the CHECK constraint on `terms.pos`
 - `translations` — object of language code → array of lemmas (synonyms within a synset)
@ -53,20 +51,21 @@ translations
 ## 3. Seeding Script — v1 (batch, truncate-based)

 ### Approach
+
 - Read a single JSON file
 - Batch inserts into `terms` and `translations` in groups of 500
 - Truncate tables before each run for a clean slate

 ### Key decisions made during development

-| Issue | Resolution |
-|-------|-----------|
-| `JSON.parse` returns `any` | Added `Array.isArray` check before casting |
-| `forEach` doesn't await | Switched to `for...of` |
-| Empty array types | Used Drizzle's `$inferInsert` types |
-| `translations` naming conflict | Renamed local variable to `translationRows` |
-| Final batch not flushed | Added `if (termsArray.length > 0)` guard after loop |
-| Exact batch size check `=== 500` | Changed to `>= 500` |
+| Issue                            | Resolution                                          |
+| -------------------------------- | --------------------------------------------------- |
+| `JSON.parse` returns `any`       | Added `Array.isArray` check before casting          |
+| `forEach` doesn't await          | Switched to `for...of`                              |
+| Empty array types                | Used Drizzle's `$inferInsert` types                 |
+| `translations` naming conflict   | Renamed local variable to `translationRows`         |
+| Final batch not flushed          | Added `if (termsArray.length > 0)` guard after loop |
+| Exact batch size check `=== 500` | Changed to `>= 500`                                 |

 ### Final script structure

@ -134,7 +133,9 @@ const main = async () => {

    if (termsArray.length >= 500) {
      batchCount++;
-      console.log(`Uploading batch ${batchCount} (${batchCount * 500}/${allSynsets.length} synsets)...`);
+      console.log(
+        `Uploading batch ${batchCount} (${batchCount * 500}/${allSynsets.length} synsets)...`,
+      );
      await uploadToDB(termsArray, translationsArray);
      termsArray.length = 0;
      translationsArray.length = 0;
@ -143,7 +144,9 @@ const main = async () => {

  if (termsArray.length > 0) {
    batchCount++;
-    console.log(`Uploading final batch (${allSynsets.length}/${allSynsets.length} synsets)...`);
+    console.log(
+      `Uploading final batch (${allSynsets.length}/${allSynsets.length} synsets)...`,
+    );
    await uploadToDB(termsArray, translationsArray);
  }

@ -161,6 +164,7 @@ main().catch((error) => {
 ## 4. Pitfalls Encountered

 ### Duplicate key on re-run
+
 Running the script twice causes `duplicate key value violates unique constraint "terms_synset_id_unique"`. Fix: truncate before seeding.

 ```bash
@ -168,15 +172,19 @@ docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translatio
 ```

 ### `onConflictDoNothing` breaks FK references
+
 When `onConflictDoNothing` skips a `terms` insert, the in-memory UUID is never written to the DB. Subsequent `translations` inserts reference that non-existent UUID, causing a FK violation. This is why the truncate approach is correct for batch seeding.

 ### DATABASE_URL misconfigured
+
 Correct format:
+
 ```
 DATABASE_URL=postgresql://glossa:glossa@localhost:5432/glossa
 ```

 ### Tables not found after `docker compose up`
+
 Migrations must be applied first: `npx drizzle-kit migrate`

 ---
@ -205,10 +213,13 @@ docker exec -it glossa-database psql -U glossa -d glossa -c "SELECT COUNT(*) FRO
 ## 6. Seeding Script — v2 (incremental upsert, multi-file)

 ### Motivation
+
 The truncate approach is fine for dev but unsuitable for production — it wipes all data. The v2 approach extends the database incrementally without ever truncating.

 ### File naming convention
+
 One JSON file per language pair per POS:
+
 ```
 scripts/datafiles/
  en-it-nouns.json
@ -219,7 +230,9 @@ scripts/datafiles/
 ```

 ### How incremental upsert works
+
 For a concept like "dog" already in the DB with English and Italian:
+
 1. Import `en-fr-nouns.json`
 2. Upsert `terms` by `synset_id` — finds existing row, returns its real ID
 3. `dog (en)` already exists → skipped by `onConflictDoNothing`
@ -228,6 +241,7 @@ For a concept like "dog" already in the DB with English and Italian:
 The concept is **extended**, not replaced.

 ### Tradeoff vs batch approach
+
 Batching is no longer possible since you need the real `term.id` from the DB before inserting translations. Each synset is processed individually. For 25k rows this is still fast enough.

 ### Key types added
@ -252,7 +266,9 @@ type FileName = {
 const parseFilename = (filename: string): FileName => {
  const parts = filename.replace(".json", "").split("-");
  if (parts.length !== 3)
-    throw new Error(`Invalid filename format: ${filename}. Expected: sourcelang-targetlang-pos.json`);
+    throw new Error(
+      `Invalid filename format: ${filename}. Expected: sourcelang-targetlang-pos.json`,
+    );
  const [sourceLang, targetLang, pos] = parts;
  if (!SUPPORTED_LANGUAGE_CODES.includes(sourceLang as LANGUAGE_CODE))
    throw new Error(`Unsupported language code: ${sourceLang}`);
@ -278,10 +294,7 @@ const upsertSynset = async (
  const [upsertedTerm] = await db
    .insert(terms)
    .values({ synset_id: synset.synset_id, pos: synset.pos })
-    .onConflictDoUpdate({
-      target: terms.synset_id,
-      set: { pos: synset.pos },
-    })
+    .onConflictDoUpdate({ target: terms.synset_id, set: { pos: synset.pos } })
    .returning({ id: terms.id, created_at: terms.created_at });

  const termInserted = upsertedTerm.created_at > new Date(Date.now() - 1000);
@ -310,12 +323,12 @@ const upsertSynset = async (

 ## 7. Strategy Comparison

-| Strategy | Use case | Pros | Cons |
-|----------|----------|------|------|
-| Truncate + batch | Dev / first-time setup | Fast, simple | Wipes all data |
-| Incremental upsert | Production / adding languages | Safe, non-destructive | No batching, slower |
-| Migrations-as-data | Production audit trail | Clean history | Files accumulate |
-| Diff-based sync | Large production datasets | Minimal writes | Complex to implement |
+| Strategy           | Use case                      | Pros                  | Cons                 |
+| ------------------ | ----------------------------- | --------------------- | -------------------- |
+| Truncate + batch   | Dev / first-time setup        | Fast, simple          | Wipes all data       |
+| Incremental upsert | Production / adding languages | Safe, non-destructive | No batching, slower  |
+| Migrations-as-data | Production audit trail        | Clean history         | Files accumulate     |
+| Diff-based sync    | Large production datasets     | Minimal writes        | Complex to implement |

 ---

@ -331,6 +344,7 @@ The `exports` field must be an object, not an array:
 ```

 Imports then resolve as:
+
 ```ts
 import { db } from "@glossa/db";
 import { terms, translations } from "@glossa/db/schema";
--- a/documentation/decisions.md
+++ b/documentation/decisions.md
@ -61,19 +61,22 @@ Production will use Nginx to serve static Vite build output.
 **Original approach:** Store `frequency_rank` on `terms` table and filter by rank range for difficulty.

 **Problem discovered:** WordNet/OMW frequency data is unreliable for language learning. Extraction produced results like:
+
 - Rank 1: "In" → "indio" (chemical symbol: Indium)
 - Rank 2: "Be" → "berillio" (chemical symbol: Beryllium)
 - Rank 7: "He" → "elio" (chemical symbol: Helium)

 These are technically "common" in WordNet (every element is a noun) but useless for vocabulary learning.

-**Decision:** 
+**Decision:**
+
 - `terms` table stores ALL available OMW synsets (raw data, no frequency filtering)
 - `decks` table stores curated learning lists (A1, A2, B1, "Most Common 1000", etc.)
 - `deck_terms` junction table links terms to decks with position ordering
 - `rooms.deck_id` specifies which vocabulary deck a game uses

 **Benefits:**
+
 - Curricula can come from external sources (CEFR lists, Oxford 3000, SUBTLEX)
 - Bad data (chemical symbols, obscure words) excluded at deck level, not schema level
 - Users can create custom decks later
@ -161,7 +164,8 @@ Then `sudo sysctl -p` or restart Docker.

 **Problem:** Embeds auth provider in the primary key (e.g. `"google|12345"`). If OpenAuth changes format or a second provider is added, the PK cascades through all FKs (`rooms.host_id`, `room_players.user_id`).

-**Decision:** 
+**Decision:**
+
 - `users.id` = internal UUID (stable FK target)
 - `users.openauth_sub` = text UNIQUE (auth provider claim)
 - Allows adding multiple auth providers per user later without FK changes
@ -177,6 +181,7 @@ Allows multiple synonyms per language per term (e.g. "dog", "hound" for same syn
 ### Decks: `pair_id` is nullable

 `decks.pair_id` references `language_pairs` but is nullable. Reasons:
+
 - Single-language decks (e.g. "English Grammar")
 - Multi-pair decks (e.g. "Cognates" spanning EN-IT and EN-FR)
 - System decks (created by app, not tied to specific user)
@ -186,19 +191,22 @@ Allows multiple synonyms per language per term (e.g. "dog", "hound" for same syn
 **Original approach:** Store `frequency_rank` on `terms` table and filter by rank range for difficulty.

 **Problem discovered:** WordNet/OMW frequency data is unreliable for language learning. Extraction produced results like:
+
 - Rank 1: "In" → "indio" (chemical symbol: Indium)
 - Rank 2: "Be" → "berillio" (chemical symbol: Beryllium)
 - Rank 7: "He" → "elio" (chemical symbol: Helium)

 These are technically "common" in WordNet (every element is a noun) but useless for vocabulary learning.

-**Decision:** 
+**Decision:**
+
 - `terms` table stores ALL available OMW synsets (raw data, no frequency filtering)
 - `decks` table stores curated learning lists (A1, A2, B1, "Most Common 1000", etc.)
 - `deck_terms` junction table links terms to decks with position ordering
 - `rooms.deck_id` specifies which vocabulary deck a game uses

 **Benefits:**
+
 - Curricula can come from external sources (CEFR lists, Oxford 3000, SUBTLEX)
 - Bad data (chemical symbols, obscure words) excluded at deck level, not schema level
 - Users can create custom decks later
--- a/documentation/notes.md
+++ b/documentation/notes.md
@ -4,7 +4,7 @@

 - pinning dependencies in package.json files
 - add this to drizzle migrartions file:
-✅ ALTER TABLE terms ADD CHECK (pos IN ('noun', 'verb', 'adjective', etc));
+  ✅ ALTER TABLE terms ADD CHECK (pos IN ('noun', 'verb', 'adjective', etc));

 ## openwordnet

--- a/documentation/roadmap.md
+++ b/documentation/roadmap.md
@ -25,7 +25,7 @@ Goal: Word data lives in the DB and can be queried via the API.
 Done when: `GET /api/decks/1/terms?limit=10` returns 10 terms from a specific deck.

 [x] Run `extract-en-it-nouns.py` locally → generates `datafiles/en-it-nouns.json`
-    -- Import ALL available OMW noun synsets (no frequency filtering)
+-- Import ALL available OMW noun synsets (no frequency filtering)
 [x] Write Drizzle schema: `terms`, `translations`, `language_pairs`, `term_glosses`, `decks`, `deck_terms`
 [x] Write and run migration (includes CHECK constraints for `pos`, `gloss_type`)
 [x] Write `packages/db/src/seed.ts` (imports ALL terms + translations, NO decks)
@ -142,9 +142,9 @@ Not required to ship, but address before real users arrive.

 Dependency Graph
 Phase 0 (Foundation)
-  └── Phase 1 (Vocabulary Data)
-        └── Phase 2 (Auth)
-              ├── Phase 3 (Singleplayer)   ← parallel with Phase 4
-              └── Phase 4 (Room Lobby)
-                    └── Phase 5 (Multiplayer Game)
-                          └── Phase 6 (Deployment)
+└── Phase 1 (Vocabulary Data)
+└── Phase 2 (Auth)
+├── Phase 3 (Singleplayer) ← parallel with Phase 4
+└── Phase 4 (Room Lobby)
+└── Phase 5 (Multiplayer Game)
+└── Phase 6 (Deployment)
--- a/documentation/spec.md
+++ b/documentation/spec.md
@ -72,7 +72,7 @@ vocab-trainer/
 │   └── db/                   # Drizzle schema, migrations, seed script
 ├── scripts/
 |   ├── datafiles/
-│   |   └── en-it-nouns.json  
+│   |   └── en-it-nouns.json
 │   └── extract-en-it-nouns.py        # One-time WordNet + OMW extraction → seed.json
 ├── docker-compose.yml
 ├── docker-compose.prod.yml
@ -159,19 +159,19 @@ SSL is fully automatic via `nginx-proxy` + `acme-companion`. No manual Certbot n

 ### 5.1 Valkey Key Structure

-Ephemeral room state is stored in Valkey with TTL (e.g., 1 hour). 
+Ephemeral room state is stored in Valkey with TTL (e.g., 1 hour).
 PostgreSQL stores durable history only.

 Key Format: `room:{code}:{field}`
-| Key                          | Type    | TTL   | Description |
+| Key | Type | TTL | Description |
 |------------------------------|---------|-------|-------------|
-| `room:{code}:state`          | Hash    | 1h    | Current question index, round status |
-| `room:{code}:players`        | Set     | 1h    | List of connected user IDs |
-| `room:{code}:answers:{round}`| Hash    | 15m   | Temp storage for current round answers |
+| `room:{code}:state` | Hash | 1h | Current question index, round status |
+| `room:{code}:players` | Set | 1h | List of connected user IDs |
+| `room:{code}:answers:{round}`| Hash | 15m | Temp storage for current round answers |

 Recovery Strategy
-If server crashes mid-game, Valkey data is lost. 
-PostgreSQL `room_players.score` remains 0. 
+If server crashes mid-game, Valkey data is lost.
+PostgreSQL `room_players.score` remains 0.
 Room status is reset to `finished` via startup health check if `updated_at` is stale.

 ---
@ -186,79 +186,79 @@ Adding a new language pair requires no schema changes — only new rows in `tran
 ## Core tables

 terms
-  id            uuid PK
-  synset_id     text UNIQUE     -- OMW ILI (e.g. "ili:i12345")
-  pos           varchar(20)     -- NOT NULL, CHECK (pos IN ('noun', 'verb', 'adjective', 'adverb'))
-  created_at    timestamptz DEFAULT now()
-  -- REMOVED: frequency_rank (handled at deck level)
+id uuid PK
+synset_id text UNIQUE -- OMW ILI (e.g. "ili:i12345")
+pos varchar(20) -- NOT NULL, CHECK (pos IN ('noun', 'verb', 'adjective', 'adverb'))
+created_at timestamptz DEFAULT now()
+-- REMOVED: frequency_rank (handled at deck level)

 translations
-  id            uuid PK
-  term_id       uuid FK → terms.id
-  language_code varchar(10)     -- NOT NULL, BCP 47: "en", "it"
-  text          text            -- NOT NULL
-  created_at    timestamptz DEFAULT now()
-  UNIQUE (term_id, language_code, text) -- Allow synonyms, prevent exact duplicates
+id uuid PK
+term_id uuid FK → terms.id
+language_code varchar(10) -- NOT NULL, BCP 47: "en", "it"
+text text -- NOT NULL
+created_at timestamptz DEFAULT now()
+UNIQUE (term_id, language_code, text) -- Allow synonyms, prevent exact duplicates

 term_glosses
-  id            uuid PK
-  term_id       uuid FK → terms.id
-  language_code varchar(10)     -- NOT NULL
-  text          text            -- NOT NULL
-  created_at    timestamptz DEFAULT now()
+id uuid PK
+term_id uuid FK → terms.id
+language_code varchar(10) -- NOT NULL
+text text -- NOT NULL
+created_at timestamptz DEFAULT now()

 language_pairs
-  id      uuid PK
-  source  varchar(10)           -- NOT NULL
-  target  varchar(10)           -- NOT NULL
-  label   text
-  active  boolean DEFAULT true
-  UNIQUE (source, target)
+id uuid PK
+source varchar(10) -- NOT NULL
+target varchar(10) -- NOT NULL
+label text
+active boolean DEFAULT true
+UNIQUE (source, target)

 decks
-  id          uuid PK
-  name        text              -- NOT NULL (e.g. "A1 Italian Nouns", "Most Common 1000")
-  description text              -- NULLABLE
-  pair_id     uuid FK → language_pairs.id  -- NULLABLE (for single-language or multi-pair decks)
-  created_by  uuid FK → users.id           -- NULLABLE (for system decks)
-  is_public   boolean DEFAULT true
-  created_at  timestamptz DEFAULT now()
+id uuid PK
+name text -- NOT NULL (e.g. "A1 Italian Nouns", "Most Common 1000")
+description text -- NULLABLE
+pair_id uuid FK → language_pairs.id -- NULLABLE (for single-language or multi-pair decks)
+created_by uuid FK → users.id -- NULLABLE (for system decks)
+is_public boolean DEFAULT true
+created_at timestamptz DEFAULT now()

 deck_terms
-  deck_id     uuid FK → decks.id
-  term_id     uuid FK → terms.id
-  position    smallint          -- NOT NULL, ordering within deck (1, 2, 3...)
-  added_at    timestamptz DEFAULT now()
-  PRIMARY KEY (deck_id, term_id)
+deck_id uuid FK → decks.id
+term_id uuid FK → terms.id
+position smallint -- NOT NULL, ordering within deck (1, 2, 3...)
+added_at timestamptz DEFAULT now()
+PRIMARY KEY (deck_id, term_id)

 users
-  id            uuid PK         -- Internal stable ID (FK target)
-  openauth_sub  text UNIQUE     -- NOT NULL, OpenAuth `sub` claim (e.g. "google|12345")
-  email         varchar(255) UNIQUE -- NULLABLE (GitHub users may lack email)
-  display_name  varchar(100)
-  created_at    timestamptz DEFAULT now()
-  last_login_at timestamptz
-  -- REMOVED: games_played, games_won (derive from room_players)
+id uuid PK -- Internal stable ID (FK target)
+openauth_sub text UNIQUE -- NOT NULL, OpenAuth `sub` claim (e.g. "google|12345")
+email varchar(255) UNIQUE -- NULLABLE (GitHub users may lack email)
+display_name varchar(100)
+created_at timestamptz DEFAULT now()
+last_login_at timestamptz
+-- REMOVED: games_played, games_won (derive from room_players)

 rooms
-  id           uuid PK
-  code         varchar(8) UNIQUE -- NOT NULL, CHECK (code = UPPER(code))
-  host_id      uuid FK → users.id
-  pair_id      uuid FK → language_pairs.id
-  deck_id      uuid FK → decks.id         -- Which vocabulary deck this room uses
-  status       varchar(20)       -- NOT NULL, CHECK (status IN ('waiting', 'in_progress', 'finished'))
-  max_players  smallint          -- NOT NULL, DEFAULT 4, CHECK (max_players BETWEEN 2 AND 10)
-  round_count  smallint          -- NOT NULL, DEFAULT 10, CHECK (round_count BETWEEN 5 AND 20)
-  created_at   timestamptz DEFAULT now()
-  updated_at   timestamptz DEFAULT now() -- For stale room recovery
+id uuid PK
+code varchar(8) UNIQUE -- NOT NULL, CHECK (code = UPPER(code))
+host_id uuid FK → users.id
+pair_id uuid FK → language_pairs.id
+deck_id uuid FK → decks.id -- Which vocabulary deck this room uses
+status varchar(20) -- NOT NULL, CHECK (status IN ('waiting', 'in_progress', 'finished'))
+max_players smallint -- NOT NULL, DEFAULT 4, CHECK (max_players BETWEEN 2 AND 10)
+round_count smallint -- NOT NULL, DEFAULT 10, CHECK (round_count BETWEEN 5 AND 20)
+created_at timestamptz DEFAULT now()
+updated_at timestamptz DEFAULT now() -- For stale room recovery

 room_players
-  room_id   uuid FK → rooms.id
-  user_id   uuid FK → users.id
-  score     integer DEFAULT 0   -- Final score only (written at game end)
-  joined_at timestamptz DEFAULT now()
-  left_at   timestamptz         -- Populated on WS disconnect/leave
-  PRIMARY KEY (room_id, user_id)
+room_id uuid FK → rooms.id
+user_id uuid FK → users.id
+score integer DEFAULT 0 -- Final score only (written at game end)
+joined_at timestamptz DEFAULT now()
+left_at timestamptz -- Populated on WS disconnect/leave
+PRIMARY KEY (room_id, user_id)

 Indexes
 -- Vocabulary
@ -501,8 +501,6 @@ Tests are co-located with source files (`*.test.ts` / `*.test.tsx`).
 - [ ] 10–20 passing tests covering critical paths
 - [ ] pnpm workspace build pipeline green

-
-
 ---

 ## 15. Out of Scope (MVP)