# WordNet Seeding Script — Session Summary ## Project Context A multiplayer English–Italian vocabulary trainer (Glossa) built with a pnpm monorepo. Vocabulary data comes from Open Multilingual Wordnet (OMW) and is extracted into JSON files, then seeded into a PostgreSQL database via Drizzle ORM. --- ## 1. JSON Extraction Format Each synset extracted from WordNet is represented as: ```json { "synset_id": "ili:i35545", "pos": "noun", "translations": { "en": ["entity"], "it": ["cosa", "entità"] } } ``` **Fields:** - `synset_id` — OMW Interlingual Index ID, maps to `terms.synset_id` in the DB - `pos` — part of speech, matches the CHECK constraint on `terms.pos` - `translations` — object of language code → array of lemmas (synonyms within a synset) **Glosses** are not extracted — the `term_glosses` table exists in the schema for future use but is not needed for the MVP quiz mechanic. --- ## 2. Database Schema (relevant tables) ``` terms id uuid PK synset_id text UNIQUE pos varchar(20) created_at timestamptz translations id uuid PK term_id uuid FK → terms.id (CASCADE) language_code varchar(10) text text created_at timestamptz UNIQUE (term_id, language_code, text) ``` --- ## 3. Seeding Script — v1 (batch, truncate-based) ### Approach - Read a single JSON file - Batch inserts into `terms` and `translations` in groups of 500 - Truncate tables before each run for a clean slate ### Key decisions made during development | Issue | Resolution | | -------------------------------- | --------------------------------------------------- | | `JSON.parse` returns `any` | Added `Array.isArray` check before casting | | `forEach` doesn't await | Switched to `for...of` | | Empty array types | Used Drizzle's `$inferInsert` types | | `translations` naming conflict | Renamed local variable to `translationRows` | | Final batch not flushed | Added `if (termsArray.length > 0)` guard after loop | | Exact batch size check `=== 500` | Changed to `>= 500` | ### Final script structure ```ts import fs from "node:fs/promises"; import { SUPPORTED_LANGUAGE_CODES, SUPPORTED_POS } from "@glossa/shared"; import { db } from "@glossa/db"; import { terms, translations } from "@glossa/db/schema"; type POS = (typeof SUPPORTED_POS)[number]; type LANGUAGE_CODE = (typeof SUPPORTED_LANGUAGE_CODES)[number]; type TermInsert = typeof terms.$inferInsert; type TranslationInsert = typeof translations.$inferInsert; type Synset = { synset_id: string; pos: POS; translations: Record; }; const dataDir = "../../scripts/datafiles/"; const readFromJsonFile = async (filepath: string): Promise => { const data = await fs.readFile(filepath, "utf8"); const parsed = JSON.parse(data); if (!Array.isArray(parsed)) throw new Error("Expected a JSON array"); return parsed as Synset[]; }; const uploadToDB = async ( termsData: TermInsert[], translationsData: TranslationInsert[], ) => { await db.insert(terms).values(termsData); await db.insert(translations).values(translationsData); }; const main = async () => { console.log("Reading JSON file..."); const allSynsets = await readFromJsonFile(dataDir + "en-it-nouns.json"); console.log(`Loaded ${allSynsets.length} synsets`); const termsArray: TermInsert[] = []; const translationsArray: TranslationInsert[] = []; let batchCount = 0; for (const synset of allSynsets) { const term = { id: crypto.randomUUID(), synset_id: synset.synset_id, pos: synset.pos, }; const translationRows = Object.entries(synset.translations).flatMap( ([lang, lemmas]) => lemmas.map((lemma) => ({ id: crypto.randomUUID(), term_id: term.id, language_code: lang as LANGUAGE_CODE, text: lemma, })), ); translationsArray.push(...translationRows); termsArray.push(term); if (termsArray.length >= 500) { batchCount++; console.log( `Uploading batch ${batchCount} (${batchCount * 500}/${allSynsets.length} synsets)...`, ); await uploadToDB(termsArray, translationsArray); termsArray.length = 0; translationsArray.length = 0; } } if (termsArray.length > 0) { batchCount++; console.log( `Uploading final batch (${allSynsets.length}/${allSynsets.length} synsets)...`, ); await uploadToDB(termsArray, translationsArray); } console.log(`Seeding complete — ${allSynsets.length} synsets inserted`); }; main().catch((error) => { console.error(error); process.exit(1); }); ``` --- ## 4. Pitfalls Encountered ### Duplicate key on re-run Running the script twice causes `duplicate key value violates unique constraint "terms_synset_id_unique"`. Fix: truncate before seeding. ```bash docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translations, terms CASCADE;" ``` ### `onConflictDoNothing` breaks FK references When `onConflictDoNothing` skips a `terms` insert, the in-memory UUID is never written to the DB. Subsequent `translations` inserts reference that non-existent UUID, causing a FK violation. This is why the truncate approach is correct for batch seeding. ### DATABASE_URL misconfigured Correct format: ``` DATABASE_URL=postgresql://glossa:glossa@localhost:5432/glossa ``` ### Tables not found after `docker compose up` Migrations must be applied first: `npx drizzle-kit migrate` --- ## 5. Running the Script ```bash # Start the DB container docker compose up -d postgres # Apply migrations npx drizzle-kit migrate # Truncate existing data (if re-seeding) docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translations, terms CASCADE;" # Run the seed script npx tsx src/seed-en-it-nouns.ts # Verify docker exec -it glossa-database psql -U glossa -d glossa -c "SELECT COUNT(*) FROM terms; SELECT COUNT(*) FROM translations;" ``` --- ## 6. Seeding Script — v2 (incremental upsert, multi-file) ### Motivation The truncate approach is fine for dev but unsuitable for production — it wipes all data. The v2 approach extends the database incrementally without ever truncating. ### File naming convention One JSON file per language pair per POS: ``` scripts/datafiles/ en-it-nouns.json en-fr-nouns.json en-it-verbs.json de-it-nouns.json ... ``` ### How incremental upsert works For a concept like "dog" already in the DB with English and Italian: 1. Import `en-fr-nouns.json` 2. Upsert `terms` by `synset_id` — finds existing row, returns its real ID 3. `dog (en)` already exists → skipped by `onConflictDoNothing` 4. `chien (fr)` is new → inserted The concept is **extended**, not replaced. ### Tradeoff vs batch approach Batching is no longer possible since you need the real `term.id` from the DB before inserting translations. Each synset is processed individually. For 25k rows this is still fast enough. ### Key types added ```ts type Synset = { synset_id: string; pos: POS; translations: Partial>; // Partial — file only contains subset of languages }; type FileName = { sourceLang: LANGUAGE_CODE; targetLang: LANGUAGE_CODE; pos: POS; }; ``` ### Filename validation ```ts const parseFilename = (filename: string): FileName => { const parts = filename.replace(".json", "").split("-"); if (parts.length !== 3) throw new Error( `Invalid filename format: ${filename}. Expected: sourcelang-targetlang-pos.json`, ); const [sourceLang, targetLang, pos] = parts; if (!SUPPORTED_LANGUAGE_CODES.includes(sourceLang as LANGUAGE_CODE)) throw new Error(`Unsupported language code: ${sourceLang}`); if (!SUPPORTED_LANGUAGE_CODES.includes(targetLang as LANGUAGE_CODE)) throw new Error(`Unsupported language code: ${targetLang}`); if (!SUPPORTED_POS.includes(pos as POS)) throw new Error(`Unsupported POS: ${pos}`); return { sourceLang: sourceLang as LANGUAGE_CODE, targetLang: targetLang as LANGUAGE_CODE, pos: pos as POS, }; }; ``` ### Upsert function (WIP) ```ts const upsertSynset = async ( synset: Synset, fileInfo: FileName, ): Promise<{ termInserted: boolean; translationsInserted: number }> => { const [upsertedTerm] = await db .insert(terms) .values({ synset_id: synset.synset_id, pos: synset.pos }) .onConflictDoUpdate({ target: terms.synset_id, set: { pos: synset.pos } }) .returning({ id: terms.id, created_at: terms.created_at }); const termInserted = upsertedTerm.created_at > new Date(Date.now() - 1000); const translationRows = Object.entries(synset.translations).flatMap( ([lang, lemmas]) => lemmas!.map((lemma) => ({ id: crypto.randomUUID(), term_id: upsertedTerm.id, language_code: lang as LANGUAGE_CODE, text: lemma, })), ); const result = await db .insert(translations) .values(translationRows) .onConflictDoNothing() .returning({ id: translations.id }); return { termInserted, translationsInserted: result.length }; }; ``` --- ## 7. Strategy Comparison | Strategy | Use case | Pros | Cons | | ------------------ | ----------------------------- | --------------------- | -------------------- | | Truncate + batch | Dev / first-time setup | Fast, simple | Wipes all data | | Incremental upsert | Production / adding languages | Safe, non-destructive | No batching, slower | | Migrations-as-data | Production audit trail | Clean history | Files accumulate | | Diff-based sync | Large production datasets | Minimal writes | Complex to implement | --- ## 8. packages/db — package.json exports fix The `exports` field must be an object, not an array: ```json "exports": { ".": "./src/index.ts", "./schema": "./src/db/schema.ts" } ``` Imports then resolve as: ```ts import { db } from "@glossa/db"; import { terms, translations } from "@glossa/db/schema"; ```