lila 2b177aad5b feat(db): add incremental upsert seed script for WordNet vocabulary

Implements packages/db/src/seed.ts — reads all JSON files from
scripts/datafiles/, validates filenames against supported language
codes and POS, and upserts synsets into  and
via onConflictDoNothing. Safe to re-run; produces 0 writes on
a duplicate run.

2026-03-30 15:58:01 +02:00

9.6 KiB

Raw Blame History

WordNet Seeding Script — Session Summary

Project Context

A multiplayer English–Italian vocabulary trainer (Glossa) built with a pnpm monorepo. Vocabulary data comes from Open Multilingual Wordnet (OMW) and is extracted into JSON files, then seeded into a PostgreSQL database via Drizzle ORM.

1. JSON Extraction Format

Each synset extracted from WordNet is represented as:

{
  "synset_id": "ili:i35545",
  "pos": "noun",
  "translations": {
    "en": ["entity"],
    "it": ["cosa", "entità"]
  }
}

Fields:

synset_id — OMW Interlingual Index ID, maps to terms.synset_id in the DB
pos — part of speech, matches the CHECK constraint on terms.pos
translations — object of language code → array of lemmas (synonyms within a synset)

Glosses are not extracted — the term_glosses table exists in the schema for future use but is not needed for the MVP quiz mechanic.

2. Database Schema (relevant tables)

terms
  id          uuid PK
  synset_id   text UNIQUE
  pos         varchar(20)
  created_at  timestamptz

translations
  id            uuid PK
  term_id       uuid FK → terms.id (CASCADE)
  language_code varchar(10)
  text          text
  created_at    timestamptz
  UNIQUE (term_id, language_code, text)

3. Seeding Script — v1 (batch, truncate-based)

Approach

Read a single JSON file
Batch inserts into terms and translations in groups of 500
Truncate tables before each run for a clean slate

Key decisions made during development

Issue	Resolution
`JSON.parse` returns `any`	Added `Array.isArray` check before casting
`forEach` doesn't await	Switched to `for...of`
Empty array types	Used Drizzle's `$inferInsert` types
`translations` naming conflict	Renamed local variable to `translationRows`
Final batch not flushed	Added `if (termsArray.length > 0)` guard after loop
Exact batch size check `=== 500`	Changed to `>= 500`

Final script structure

import fs from "node:fs/promises";
import { SUPPORTED_LANGUAGE_CODES, SUPPORTED_POS } from "@glossa/shared";
import { db } from "@glossa/db";
import { terms, translations } from "@glossa/db/schema";

type POS = (typeof SUPPORTED_POS)[number];
type LANGUAGE_CODE = (typeof SUPPORTED_LANGUAGE_CODES)[number];
type TermInsert = typeof terms.$inferInsert;
type TranslationInsert = typeof translations.$inferInsert;
type Synset = {
  synset_id: string;
  pos: POS;
  translations: Record<LANGUAGE_CODE, string[]>;
};

const dataDir = "../../scripts/datafiles/";

const readFromJsonFile = async (filepath: string): Promise<Synset[]> => {
  const data = await fs.readFile(filepath, "utf8");
  const parsed = JSON.parse(data);
  if (!Array.isArray(parsed)) throw new Error("Expected a JSON array");
  return parsed as Synset[];
};

const uploadToDB = async (
  termsData: TermInsert[],
  translationsData: TranslationInsert[],
) => {
  await db.insert(terms).values(termsData);
  await db.insert(translations).values(translationsData);
};

const main = async () => {
  console.log("Reading JSON file...");
  const allSynsets = await readFromJsonFile(dataDir + "en-it-nouns.json");
  console.log(`Loaded ${allSynsets.length} synsets`);

  const termsArray: TermInsert[] = [];
  const translationsArray: TranslationInsert[] = [];
  let batchCount = 0;

  for (const synset of allSynsets) {
    const term = {
      id: crypto.randomUUID(),
      synset_id: synset.synset_id,
      pos: synset.pos,
    };

    const translationRows = Object.entries(synset.translations).flatMap(
      ([lang, lemmas]) =>
        lemmas.map((lemma) => ({
          id: crypto.randomUUID(),
          term_id: term.id,
          language_code: lang as LANGUAGE_CODE,
          text: lemma,
        })),
    );

    translationsArray.push(...translationRows);
    termsArray.push(term);

    if (termsArray.length >= 500) {
      batchCount++;
      console.log(`Uploading batch ${batchCount} (${batchCount * 500}/${allSynsets.length} synsets)...`);
      await uploadToDB(termsArray, translationsArray);
      termsArray.length = 0;
      translationsArray.length = 0;
    }
  }

  if (termsArray.length > 0) {
    batchCount++;
    console.log(`Uploading final batch (${allSynsets.length}/${allSynsets.length} synsets)...`);
    await uploadToDB(termsArray, translationsArray);
  }

  console.log(`Seeding complete — ${allSynsets.length} synsets inserted`);
};

main().catch((error) => {
  console.error(error);
  process.exit(1);
});

4. Pitfalls Encountered

Duplicate key on re-run

Running the script twice causes duplicate key value violates unique constraint "terms_synset_id_unique". Fix: truncate before seeding.

docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translations, terms CASCADE;"

`onConflictDoNothing` breaks FK references

When onConflictDoNothing skips a terms insert, the in-memory UUID is never written to the DB. Subsequent translations inserts reference that non-existent UUID, causing a FK violation. This is why the truncate approach is correct for batch seeding.

DATABASE_URL misconfigured

Correct format:

DATABASE_URL=postgresql://glossa:glossa@localhost:5432/glossa

Tables not found after `docker compose up`

Migrations must be applied first: npx drizzle-kit migrate

5. Running the Script

# Start the DB container
docker compose up -d postgres

# Apply migrations
npx drizzle-kit migrate

# Truncate existing data (if re-seeding)
docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translations, terms CASCADE;"

# Run the seed script
npx tsx src/seed-en-it-nouns.ts

# Verify
docker exec -it glossa-database psql -U glossa -d glossa -c "SELECT COUNT(*) FROM terms; SELECT COUNT(*) FROM translations;"

6. Seeding Script — v2 (incremental upsert, multi-file)

Motivation

The truncate approach is fine for dev but unsuitable for production — it wipes all data. The v2 approach extends the database incrementally without ever truncating.

File naming convention

One JSON file per language pair per POS:

scripts/datafiles/
  en-it-nouns.json
  en-fr-nouns.json
  en-it-verbs.json
  de-it-nouns.json
  ...

How incremental upsert works

For a concept like "dog" already in the DB with English and Italian:

Import en-fr-nouns.json
Upsert terms by synset_id — finds existing row, returns its real ID
dog (en) already exists → skipped by onConflictDoNothing
chien (fr) is new → inserted

The concept is extended, not replaced.

Tradeoff vs batch approach

Batching is no longer possible since you need the real term.id from the DB before inserting translations. Each synset is processed individually. For 25k rows this is still fast enough.

Key types added

type Synset = {
  synset_id: string;
  pos: POS;
  translations: Partial<Record<LANGUAGE_CODE, string[]>>; // Partial — file only contains subset of languages
};

type FileName = {
  sourceLang: LANGUAGE_CODE;
  targetLang: LANGUAGE_CODE;
  pos: POS;
};

Filename validation

const parseFilename = (filename: string): FileName => {
  const parts = filename.replace(".json", "").split("-");
  if (parts.length !== 3)
    throw new Error(`Invalid filename format: ${filename}. Expected: sourcelang-targetlang-pos.json`);
  const [sourceLang, targetLang, pos] = parts;
  if (!SUPPORTED_LANGUAGE_CODES.includes(sourceLang as LANGUAGE_CODE))
    throw new Error(`Unsupported language code: ${sourceLang}`);
  if (!SUPPORTED_LANGUAGE_CODES.includes(targetLang as LANGUAGE_CODE))
    throw new Error(`Unsupported language code: ${targetLang}`);
  if (!SUPPORTED_POS.includes(pos as POS))
    throw new Error(`Unsupported POS: ${pos}`);
  return {
    sourceLang: sourceLang as LANGUAGE_CODE,
    targetLang: targetLang as LANGUAGE_CODE,
    pos: pos as POS,
  };
};

Upsert function (WIP)

const upsertSynset = async (
  synset: Synset,
  fileInfo: FileName,
): Promise<{ termInserted: boolean; translationsInserted: number }> => {
  const [upsertedTerm] = await db
    .insert(terms)
    .values({ synset_id: synset.synset_id, pos: synset.pos })
    .onConflictDoUpdate({
      target: terms.synset_id,
      set: { pos: synset.pos },
    })
    .returning({ id: terms.id, created_at: terms.created_at });

  const termInserted = upsertedTerm.created_at > new Date(Date.now() - 1000);

  const translationRows = Object.entries(synset.translations).flatMap(
    ([lang, lemmas]) =>
      lemmas!.map((lemma) => ({
        id: crypto.randomUUID(),
        term_id: upsertedTerm.id,
        language_code: lang as LANGUAGE_CODE,
        text: lemma,
      })),
  );

  const result = await db
    .insert(translations)
    .values(translationRows)
    .onConflictDoNothing()
    .returning({ id: translations.id });

  return { termInserted, translationsInserted: result.length };
};

7. Strategy Comparison

Strategy	Use case	Pros	Cons
Truncate + batch	Dev / first-time setup	Fast, simple	Wipes all data
Incremental upsert	Production / adding languages	Safe, non-destructive	No batching, slower
Migrations-as-data	Production audit trail	Clean history	Files accumulate
Diff-based sync	Large production datasets	Minimal writes	Complex to implement

8. packages/db — package.json exports fix

The exports field must be an object, not an array:

"exports": {
  ".": "./src/index.ts",
  "./schema": "./src/db/schema.ts"
}

Imports then resolve as:

import { db } from "@glossa/db";
import { terms, translations } from "@glossa/db/schema";

9.6 KiB Raw Blame History Unescape Escape