feat(db): add incremental upsert seed script for WordNet vocabulary

Implements packages/db/src/seed.ts — reads all JSON files from scripts/datafiles/, validates filenames against supported language codes and POS, and upserts synsets into and via onConflictDoNothing. Safe to re-run; produces 0 writes on a duplicate run.
2026-03-30 15:58:01 +02:00 · 2026-03-30 15:58:01 +02:00 · 2b177aad5b
commit 2b177aad5b
parent 55885336ba
12 changed files with 1349 additions and 10 deletions
--- a/documentation/data-seeding-notes.md
+++ b/documentation/data-seeding-notes.md
@ -0,0 +1,337 @@
+# WordNet Seeding Script — Session Summary
+
+## Project Context
+
+A multiplayer English–Italian vocabulary trainer (Glossa) built with a pnpm monorepo. Vocabulary data comes from Open Multilingual Wordnet (OMW) and is extracted into JSON files, then seeded into a PostgreSQL database via Drizzle ORM.
+
+---
+
+## 1. JSON Extraction Format
+
+Each synset extracted from WordNet is represented as:
+
+```json
+{
+  "synset_id": "ili:i35545",
+  "pos": "noun",
+  "translations": {
+    "en": ["entity"],
+    "it": ["cosa", "entità"]
+  }
+}
+```
+
+**Fields:**
+- `synset_id` — OMW Interlingual Index ID, maps to `terms.synset_id` in the DB
+- `pos` — part of speech, matches the CHECK constraint on `terms.pos`
+- `translations` — object of language code → array of lemmas (synonyms within a synset)
+
+**Glosses** are not extracted — the `term_glosses` table exists in the schema for future use but is not needed for the MVP quiz mechanic.
+
+---
+
+## 2. Database Schema (relevant tables)
+
+```
+terms
+  id          uuid PK
+  synset_id   text UNIQUE
+  pos         varchar(20)
+  created_at  timestamptz
+
+translations
+  id            uuid PK
+  term_id       uuid FK → terms.id (CASCADE)
+  language_code varchar(10)
+  text          text
+  created_at    timestamptz
+  UNIQUE (term_id, language_code, text)
+```
+
+---
+
+## 3. Seeding Script — v1 (batch, truncate-based)
+
+### Approach
+- Read a single JSON file
+- Batch inserts into `terms` and `translations` in groups of 500
+- Truncate tables before each run for a clean slate
+
+### Key decisions made during development
+
+| Issue | Resolution |
+|-------|-----------|
+| `JSON.parse` returns `any` | Added `Array.isArray` check before casting |
+| `forEach` doesn't await | Switched to `for...of` |
+| Empty array types | Used Drizzle's `$inferInsert` types |
+| `translations` naming conflict | Renamed local variable to `translationRows` |
+| Final batch not flushed | Added `if (termsArray.length > 0)` guard after loop |
+| Exact batch size check `=== 500` | Changed to `>= 500` |
+
+### Final script structure
+
+```ts
+import fs from "node:fs/promises";
+import { SUPPORTED_LANGUAGE_CODES, SUPPORTED_POS } from "@glossa/shared";
+import { db } from "@glossa/db";
+import { terms, translations } from "@glossa/db/schema";
+
+type POS = (typeof SUPPORTED_POS)[number];
+type LANGUAGE_CODE = (typeof SUPPORTED_LANGUAGE_CODES)[number];
+type TermInsert = typeof terms.$inferInsert;
+type TranslationInsert = typeof translations.$inferInsert;
+type Synset = {
+  synset_id: string;
+  pos: POS;
+  translations: Record<LANGUAGE_CODE, string[]>;
+};
+
+const dataDir = "../../scripts/datafiles/";
+
+const readFromJsonFile = async (filepath: string): Promise<Synset[]> => {
+  const data = await fs.readFile(filepath, "utf8");
+  const parsed = JSON.parse(data);
+  if (!Array.isArray(parsed)) throw new Error("Expected a JSON array");
+  return parsed as Synset[];
+};
+
+const uploadToDB = async (
+  termsData: TermInsert[],
+  translationsData: TranslationInsert[],
+) => {
+  await db.insert(terms).values(termsData);
+  await db.insert(translations).values(translationsData);
+};
+
+const main = async () => {
+  console.log("Reading JSON file...");
+  const allSynsets = await readFromJsonFile(dataDir + "en-it-nouns.json");
+  console.log(`Loaded ${allSynsets.length} synsets`);
+
+  const termsArray: TermInsert[] = [];
+  const translationsArray: TranslationInsert[] = [];
+  let batchCount = 0;
+
+  for (const synset of allSynsets) {
+    const term = {
+      id: crypto.randomUUID(),
+      synset_id: synset.synset_id,
+      pos: synset.pos,
+    };
+
+    const translationRows = Object.entries(synset.translations).flatMap(
+      ([lang, lemmas]) =>
+        lemmas.map((lemma) => ({
+          id: crypto.randomUUID(),
+          term_id: term.id,
+          language_code: lang as LANGUAGE_CODE,
+          text: lemma,
+        })),
+    );
+
+    translationsArray.push(...translationRows);
+    termsArray.push(term);
+
+    if (termsArray.length >= 500) {
+      batchCount++;
+      console.log(`Uploading batch ${batchCount} (${batchCount * 500}/${allSynsets.length} synsets)...`);
+      await uploadToDB(termsArray, translationsArray);
+      termsArray.length = 0;
+      translationsArray.length = 0;
+    }
+  }
+
+  if (termsArray.length > 0) {
+    batchCount++;
+    console.log(`Uploading final batch (${allSynsets.length}/${allSynsets.length} synsets)...`);
+    await uploadToDB(termsArray, translationsArray);
+  }
+
+  console.log(`Seeding complete — ${allSynsets.length} synsets inserted`);
+};
+
+main().catch((error) => {
+  console.error(error);
+  process.exit(1);
+});
+```
+
+---
+
+## 4. Pitfalls Encountered
+
+### Duplicate key on re-run
+Running the script twice causes `duplicate key value violates unique constraint "terms_synset_id_unique"`. Fix: truncate before seeding.
+
+```bash
+docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translations, terms CASCADE;"
+```
+
+### `onConflictDoNothing` breaks FK references
+When `onConflictDoNothing` skips a `terms` insert, the in-memory UUID is never written to the DB. Subsequent `translations` inserts reference that non-existent UUID, causing a FK violation. This is why the truncate approach is correct for batch seeding.
+
+### DATABASE_URL misconfigured
+Correct format:
+```
+DATABASE_URL=postgresql://glossa:glossa@localhost:5432/glossa
+```
+
+### Tables not found after `docker compose up`
+Migrations must be applied first: `npx drizzle-kit migrate`
+
+---
+
+## 5. Running the Script
+
+```bash
+# Start the DB container
+docker compose up -d postgres
+
+# Apply migrations
+npx drizzle-kit migrate
+
+# Truncate existing data (if re-seeding)
+docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translations, terms CASCADE;"
+
+# Run the seed script
+npx tsx src/seed-en-it-nouns.ts
+
+# Verify
+docker exec -it glossa-database psql -U glossa -d glossa -c "SELECT COUNT(*) FROM terms; SELECT COUNT(*) FROM translations;"
+```
+
+---
+
+## 6. Seeding Script — v2 (incremental upsert, multi-file)
+
+### Motivation
+The truncate approach is fine for dev but unsuitable for production — it wipes all data. The v2 approach extends the database incrementally without ever truncating.
+
+### File naming convention
+One JSON file per language pair per POS:
+```
+scripts/datafiles/
+  en-it-nouns.json
+  en-fr-nouns.json
+  en-it-verbs.json
+  de-it-nouns.json
+  ...
+```
+
+### How incremental upsert works
+For a concept like "dog" already in the DB with English and Italian:
+1. Import `en-fr-nouns.json`
+2. Upsert `terms` by `synset_id` — finds existing row, returns its real ID
+3. `dog (en)` already exists → skipped by `onConflictDoNothing`
+4. `chien (fr)` is new → inserted
+
+The concept is **extended**, not replaced.
+
+### Tradeoff vs batch approach
+Batching is no longer possible since you need the real `term.id` from the DB before inserting translations. Each synset is processed individually. For 25k rows this is still fast enough.
+
+### Key types added
+
+```ts
+type Synset = {
+  synset_id: string;
+  pos: POS;
+  translations: Partial<Record<LANGUAGE_CODE, string[]>>; // Partial — file only contains subset of languages
+};
+
+type FileName = {
+  sourceLang: LANGUAGE_CODE;
+  targetLang: LANGUAGE_CODE;
+  pos: POS;
+};
+```
+
+### Filename validation
+
+```ts
+const parseFilename = (filename: string): FileName => {
+  const parts = filename.replace(".json", "").split("-");
+  if (parts.length !== 3)
+    throw new Error(`Invalid filename format: ${filename}. Expected: sourcelang-targetlang-pos.json`);
+  const [sourceLang, targetLang, pos] = parts;
+  if (!SUPPORTED_LANGUAGE_CODES.includes(sourceLang as LANGUAGE_CODE))
+    throw new Error(`Unsupported language code: ${sourceLang}`);
+  if (!SUPPORTED_LANGUAGE_CODES.includes(targetLang as LANGUAGE_CODE))
+    throw new Error(`Unsupported language code: ${targetLang}`);
+  if (!SUPPORTED_POS.includes(pos as POS))
+    throw new Error(`Unsupported POS: ${pos}`);
+  return {
+    sourceLang: sourceLang as LANGUAGE_CODE,
+    targetLang: targetLang as LANGUAGE_CODE,
+    pos: pos as POS,
+  };
+};
+```
+
+### Upsert function (WIP)
+
+```ts
+const upsertSynset = async (
+  synset: Synset,
+  fileInfo: FileName,
+): Promise<{ termInserted: boolean; translationsInserted: number }> => {
+  const [upsertedTerm] = await db
+    .insert(terms)
+    .values({ synset_id: synset.synset_id, pos: synset.pos })
+    .onConflictDoUpdate({
+      target: terms.synset_id,
+      set: { pos: synset.pos },
+    })
+    .returning({ id: terms.id, created_at: terms.created_at });
+
+  const termInserted = upsertedTerm.created_at > new Date(Date.now() - 1000);
+
+  const translationRows = Object.entries(synset.translations).flatMap(
+    ([lang, lemmas]) =>
+      lemmas!.map((lemma) => ({
+        id: crypto.randomUUID(),
+        term_id: upsertedTerm.id,
+        language_code: lang as LANGUAGE_CODE,
+        text: lemma,
+      })),
+  );
+
+  const result = await db
+    .insert(translations)
+    .values(translationRows)
+    .onConflictDoNothing()
+    .returning({ id: translations.id });
+
+  return { termInserted, translationsInserted: result.length };
+};
+```
+
+---
+
+## 7. Strategy Comparison
+
+| Strategy | Use case | Pros | Cons |
+|----------|----------|------|------|
+| Truncate + batch | Dev / first-time setup | Fast, simple | Wipes all data |
+| Incremental upsert | Production / adding languages | Safe, non-destructive | No batching, slower |
+| Migrations-as-data | Production audit trail | Clean history | Files accumulate |
+| Diff-based sync | Large production datasets | Minimal writes | Complex to implement |
+
+---
+
+## 8. packages/db — package.json exports fix
+
+The `exports` field must be an object, not an array:
+
+```json
+"exports": {
+  ".": "./src/index.ts",
+  "./schema": "./src/db/schema.ts"
+}
+```
+
+Imports then resolve as:
+```ts
+import { db } from "@glossa/db";
+import { terms, translations } from "@glossa/db/schema";
+```
--- a/documentation/notes.md
+++ b/documentation/notes.md
@ -6,7 +6,7 @@
 - add this to drizzle migrartions file:
 ✅ ALTER TABLE terms ADD CHECK (pos IN ('noun', 'verb', 'adjective', etc));

-## open word net
+## openwordnet

 download libraries via

@ -45,3 +45,17 @@ list all libraries:
 ```bash
 python -c "import wn; print(wn.lexicons())"
 ```
+
+## drizzle
+
+generate migration file, go to packages/db, then:
+
+```bash
+pnpm drizzle-kit generate
+```
+
+execute migration, go to packages/db (docker containers need to be running):
+
+```bash
+DATABASE_URL=postgresql://username:password@localhost:5432/database pnpm drizzle-kit migrate
+```
--- a/documentation/roadmap.md
+++ b/documentation/roadmap.md
@ -26,17 +26,17 @@ Done when: `GET /api/decks/1/terms?limit=10` returns 10 terms from a specific de

 [x] Run `extract-en-it-nouns.py` locally → generates `datafiles/en-it-nouns.json`
    -- Import ALL available OMW noun synsets (no frequency filtering)
-[ ] Write Drizzle schema: `terms`, `translations`, `language_pairs`, `term_glosses`, `decks`, `deck_terms`
-[ ] Write and run migration (includes CHECK constraints for `pos`, `gloss_type`)
-[ ] Write `packages/db/src/seed.ts` (imports ALL terms + translations, NO decks)
-[ ] Write `scripts/build_decks.ts` (reads external CEFR lists, matches to DB, creates decks)
+[x] Write Drizzle schema: `terms`, `translations`, `language_pairs`, `term_glosses`, `decks`, `deck_terms`
+[x] Write and run migration (includes CHECK constraints for `pos`, `gloss_type`)
+[x] Write `packages/db/src/seed.ts` (imports ALL terms + translations, NO decks)
 [ ] Download CEFR A1/A2 noun lists (from GitHub repos)
+[ ] Write `scripts/build_decks.ts` (reads external CEFR lists, matches to DB, creates decks)
 [ ] Run `pnpm db:seed` → populates terms
 [ ] Run `pnpm db:build-decks` → creates curated decks
+[ ] Define Zod response schemas in `packages/shared`
 [ ] Implement `DeckRepository.getTerms(deckId, limit, offset)`
 [ ] Implement `QuizService.attachDistractors(terms)` — same POS, server-side, no duplicates
 [ ] Implement `GET /language-pairs`, `GET /decks`, `GET /decks/:id/terms` endpoints
-[ ] Define Zod response schemas in `packages/shared`
 [ ] Unit tests for `QuizService` (correct POS filtering, never includes the answer)
 [ ] update decisions.md

--- a/documentation/spec.md
+++ b/documentation/spec.md
@ -205,7 +205,6 @@ term_glosses
  term_id       uuid FK → terms.id
  language_code varchar(10)     -- NOT NULL
  text          text            -- NOT NULL
-  type          varchar(20)     -- CHECK (type IN ('definition', 'example')), NULLABLE
  created_at    timestamptz DEFAULT now()

 language_pairs