Implements packages/db/src/seed.ts — reads all JSON files from scripts/datafiles/, validates filenames against supported language codes and POS, and upserts synsets into and via onConflictDoNothing. Safe to re-run; produces 0 writes on a duplicate run.
9.6 KiB
WordNet Seeding Script — Session Summary
Project Context
A multiplayer English–Italian vocabulary trainer (Glossa) built with a pnpm monorepo. Vocabulary data comes from Open Multilingual Wordnet (OMW) and is extracted into JSON files, then seeded into a PostgreSQL database via Drizzle ORM.
1. JSON Extraction Format
Each synset extracted from WordNet is represented as:
{
"synset_id": "ili:i35545",
"pos": "noun",
"translations": {
"en": ["entity"],
"it": ["cosa", "entità"]
}
}
Fields:
synset_id— OMW Interlingual Index ID, maps toterms.synset_idin the DBpos— part of speech, matches the CHECK constraint onterms.postranslations— object of language code → array of lemmas (synonyms within a synset)
Glosses are not extracted — the term_glosses table exists in the schema for future use but is not needed for the MVP quiz mechanic.
2. Database Schema (relevant tables)
terms
id uuid PK
synset_id text UNIQUE
pos varchar(20)
created_at timestamptz
translations
id uuid PK
term_id uuid FK → terms.id (CASCADE)
language_code varchar(10)
text text
created_at timestamptz
UNIQUE (term_id, language_code, text)
3. Seeding Script — v1 (batch, truncate-based)
Approach
- Read a single JSON file
- Batch inserts into
termsandtranslationsin groups of 500 - Truncate tables before each run for a clean slate
Key decisions made during development
| Issue | Resolution |
|---|---|
JSON.parse returns any |
Added Array.isArray check before casting |
forEach doesn't await |
Switched to for...of |
| Empty array types | Used Drizzle's $inferInsert types |
translations naming conflict |
Renamed local variable to translationRows |
| Final batch not flushed | Added if (termsArray.length > 0) guard after loop |
Exact batch size check === 500 |
Changed to >= 500 |
Final script structure
import fs from "node:fs/promises";
import { SUPPORTED_LANGUAGE_CODES, SUPPORTED_POS } from "@glossa/shared";
import { db } from "@glossa/db";
import { terms, translations } from "@glossa/db/schema";
type POS = (typeof SUPPORTED_POS)[number];
type LANGUAGE_CODE = (typeof SUPPORTED_LANGUAGE_CODES)[number];
type TermInsert = typeof terms.$inferInsert;
type TranslationInsert = typeof translations.$inferInsert;
type Synset = {
synset_id: string;
pos: POS;
translations: Record<LANGUAGE_CODE, string[]>;
};
const dataDir = "../../scripts/datafiles/";
const readFromJsonFile = async (filepath: string): Promise<Synset[]> => {
const data = await fs.readFile(filepath, "utf8");
const parsed = JSON.parse(data);
if (!Array.isArray(parsed)) throw new Error("Expected a JSON array");
return parsed as Synset[];
};
const uploadToDB = async (
termsData: TermInsert[],
translationsData: TranslationInsert[],
) => {
await db.insert(terms).values(termsData);
await db.insert(translations).values(translationsData);
};
const main = async () => {
console.log("Reading JSON file...");
const allSynsets = await readFromJsonFile(dataDir + "en-it-nouns.json");
console.log(`Loaded ${allSynsets.length} synsets`);
const termsArray: TermInsert[] = [];
const translationsArray: TranslationInsert[] = [];
let batchCount = 0;
for (const synset of allSynsets) {
const term = {
id: crypto.randomUUID(),
synset_id: synset.synset_id,
pos: synset.pos,
};
const translationRows = Object.entries(synset.translations).flatMap(
([lang, lemmas]) =>
lemmas.map((lemma) => ({
id: crypto.randomUUID(),
term_id: term.id,
language_code: lang as LANGUAGE_CODE,
text: lemma,
})),
);
translationsArray.push(...translationRows);
termsArray.push(term);
if (termsArray.length >= 500) {
batchCount++;
console.log(`Uploading batch ${batchCount} (${batchCount * 500}/${allSynsets.length} synsets)...`);
await uploadToDB(termsArray, translationsArray);
termsArray.length = 0;
translationsArray.length = 0;
}
}
if (termsArray.length > 0) {
batchCount++;
console.log(`Uploading final batch (${allSynsets.length}/${allSynsets.length} synsets)...`);
await uploadToDB(termsArray, translationsArray);
}
console.log(`Seeding complete — ${allSynsets.length} synsets inserted`);
};
main().catch((error) => {
console.error(error);
process.exit(1);
});
4. Pitfalls Encountered
Duplicate key on re-run
Running the script twice causes duplicate key value violates unique constraint "terms_synset_id_unique". Fix: truncate before seeding.
docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translations, terms CASCADE;"
onConflictDoNothing breaks FK references
When onConflictDoNothing skips a terms insert, the in-memory UUID is never written to the DB. Subsequent translations inserts reference that non-existent UUID, causing a FK violation. This is why the truncate approach is correct for batch seeding.
DATABASE_URL misconfigured
Correct format:
DATABASE_URL=postgresql://glossa:glossa@localhost:5432/glossa
Tables not found after docker compose up
Migrations must be applied first: npx drizzle-kit migrate
5. Running the Script
# Start the DB container
docker compose up -d postgres
# Apply migrations
npx drizzle-kit migrate
# Truncate existing data (if re-seeding)
docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translations, terms CASCADE;"
# Run the seed script
npx tsx src/seed-en-it-nouns.ts
# Verify
docker exec -it glossa-database psql -U glossa -d glossa -c "SELECT COUNT(*) FROM terms; SELECT COUNT(*) FROM translations;"
6. Seeding Script — v2 (incremental upsert, multi-file)
Motivation
The truncate approach is fine for dev but unsuitable for production — it wipes all data. The v2 approach extends the database incrementally without ever truncating.
File naming convention
One JSON file per language pair per POS:
scripts/datafiles/
en-it-nouns.json
en-fr-nouns.json
en-it-verbs.json
de-it-nouns.json
...
How incremental upsert works
For a concept like "dog" already in the DB with English and Italian:
- Import
en-fr-nouns.json - Upsert
termsbysynset_id— finds existing row, returns its real ID dog (en)already exists → skipped byonConflictDoNothingchien (fr)is new → inserted
The concept is extended, not replaced.
Tradeoff vs batch approach
Batching is no longer possible since you need the real term.id from the DB before inserting translations. Each synset is processed individually. For 25k rows this is still fast enough.
Key types added
type Synset = {
synset_id: string;
pos: POS;
translations: Partial<Record<LANGUAGE_CODE, string[]>>; // Partial — file only contains subset of languages
};
type FileName = {
sourceLang: LANGUAGE_CODE;
targetLang: LANGUAGE_CODE;
pos: POS;
};
Filename validation
const parseFilename = (filename: string): FileName => {
const parts = filename.replace(".json", "").split("-");
if (parts.length !== 3)
throw new Error(`Invalid filename format: ${filename}. Expected: sourcelang-targetlang-pos.json`);
const [sourceLang, targetLang, pos] = parts;
if (!SUPPORTED_LANGUAGE_CODES.includes(sourceLang as LANGUAGE_CODE))
throw new Error(`Unsupported language code: ${sourceLang}`);
if (!SUPPORTED_LANGUAGE_CODES.includes(targetLang as LANGUAGE_CODE))
throw new Error(`Unsupported language code: ${targetLang}`);
if (!SUPPORTED_POS.includes(pos as POS))
throw new Error(`Unsupported POS: ${pos}`);
return {
sourceLang: sourceLang as LANGUAGE_CODE,
targetLang: targetLang as LANGUAGE_CODE,
pos: pos as POS,
};
};
Upsert function (WIP)
const upsertSynset = async (
synset: Synset,
fileInfo: FileName,
): Promise<{ termInserted: boolean; translationsInserted: number }> => {
const [upsertedTerm] = await db
.insert(terms)
.values({ synset_id: synset.synset_id, pos: synset.pos })
.onConflictDoUpdate({
target: terms.synset_id,
set: { pos: synset.pos },
})
.returning({ id: terms.id, created_at: terms.created_at });
const termInserted = upsertedTerm.created_at > new Date(Date.now() - 1000);
const translationRows = Object.entries(synset.translations).flatMap(
([lang, lemmas]) =>
lemmas!.map((lemma) => ({
id: crypto.randomUUID(),
term_id: upsertedTerm.id,
language_code: lang as LANGUAGE_CODE,
text: lemma,
})),
);
const result = await db
.insert(translations)
.values(translationRows)
.onConflictDoNothing()
.returning({ id: translations.id });
return { termInserted, translationsInserted: result.length };
};
7. Strategy Comparison
| Strategy | Use case | Pros | Cons |
|---|---|---|---|
| Truncate + batch | Dev / first-time setup | Fast, simple | Wipes all data |
| Incremental upsert | Production / adding languages | Safe, non-destructive | No batching, slower |
| Migrations-as-data | Production audit trail | Clean history | Files accumulate |
| Diff-based sync | Large production datasets | Minimal writes | Complex to implement |
8. packages/db — package.json exports fix
The exports field must be an object, not an array:
"exports": {
".": "./src/index.ts",
"./schema": "./src/db/schema.ts"
}
Imports then resolve as:
import { db } from "@glossa/db";
import { terms, translations } from "@glossa/db/schema";