lila/documentation/data-seeding-notes.md
2026-03-31 10:06:06 +02:00

10 KiB
Raw Blame History

WordNet Seeding Script — Session Summary

Project Context

A multiplayer EnglishItalian vocabulary trainer (Glossa) built with a pnpm monorepo. Vocabulary data comes from Open Multilingual Wordnet (OMW) and is extracted into JSON files, then seeded into a PostgreSQL database via Drizzle ORM.


1. JSON Extraction Format

Each synset extracted from WordNet is represented as:

{
  "synset_id": "ili:i35545",
  "pos": "noun",
  "translations": { "en": ["entity"], "it": ["cosa", "entità"] }
}

Fields:

  • synset_id — OMW Interlingual Index ID, maps to terms.synset_id in the DB
  • pos — part of speech, matches the CHECK constraint on terms.pos
  • translations — object of language code → array of lemmas (synonyms within a synset)

Glosses are not extracted — the term_glosses table exists in the schema for future use but is not needed for the MVP quiz mechanic.


2. Database Schema (relevant tables)

terms
  id          uuid PK
  synset_id   text UNIQUE
  pos         varchar(20)
  created_at  timestamptz

translations
  id            uuid PK
  term_id       uuid FK → terms.id (CASCADE)
  language_code varchar(10)
  text          text
  created_at    timestamptz
  UNIQUE (term_id, language_code, text)

3. Seeding Script — v1 (batch, truncate-based)

Approach

  • Read a single JSON file
  • Batch inserts into terms and translations in groups of 500
  • Truncate tables before each run for a clean slate

Key decisions made during development

Issue Resolution
JSON.parse returns any Added Array.isArray check before casting
forEach doesn't await Switched to for...of
Empty array types Used Drizzle's $inferInsert types
translations naming conflict Renamed local variable to translationRows
Final batch not flushed Added if (termsArray.length > 0) guard after loop
Exact batch size check === 500 Changed to >= 500

Final script structure

import fs from "node:fs/promises";
import { SUPPORTED_LANGUAGE_CODES, SUPPORTED_POS } from "@glossa/shared";
import { db } from "@glossa/db";
import { terms, translations } from "@glossa/db/schema";

type POS = (typeof SUPPORTED_POS)[number];
type LANGUAGE_CODE = (typeof SUPPORTED_LANGUAGE_CODES)[number];
type TermInsert = typeof terms.$inferInsert;
type TranslationInsert = typeof translations.$inferInsert;
type Synset = {
  synset_id: string;
  pos: POS;
  translations: Record<LANGUAGE_CODE, string[]>;
};

const dataDir = "../../scripts/datafiles/";

const readFromJsonFile = async (filepath: string): Promise<Synset[]> => {
  const data = await fs.readFile(filepath, "utf8");
  const parsed = JSON.parse(data);
  if (!Array.isArray(parsed)) throw new Error("Expected a JSON array");
  return parsed as Synset[];
};

const uploadToDB = async (
  termsData: TermInsert[],
  translationsData: TranslationInsert[],
) => {
  await db.insert(terms).values(termsData);
  await db.insert(translations).values(translationsData);
};

const main = async () => {
  console.log("Reading JSON file...");
  const allSynsets = await readFromJsonFile(dataDir + "en-it-nouns.json");
  console.log(`Loaded ${allSynsets.length} synsets`);

  const termsArray: TermInsert[] = [];
  const translationsArray: TranslationInsert[] = [];
  let batchCount = 0;

  for (const synset of allSynsets) {
    const term = {
      id: crypto.randomUUID(),
      synset_id: synset.synset_id,
      pos: synset.pos,
    };

    const translationRows = Object.entries(synset.translations).flatMap(
      ([lang, lemmas]) =>
        lemmas.map((lemma) => ({
          id: crypto.randomUUID(),
          term_id: term.id,
          language_code: lang as LANGUAGE_CODE,
          text: lemma,
        })),
    );

    translationsArray.push(...translationRows);
    termsArray.push(term);

    if (termsArray.length >= 500) {
      batchCount++;
      console.log(
        `Uploading batch ${batchCount} (${batchCount * 500}/${allSynsets.length} synsets)...`,
      );
      await uploadToDB(termsArray, translationsArray);
      termsArray.length = 0;
      translationsArray.length = 0;
    }
  }

  if (termsArray.length > 0) {
    batchCount++;
    console.log(
      `Uploading final batch (${allSynsets.length}/${allSynsets.length} synsets)...`,
    );
    await uploadToDB(termsArray, translationsArray);
  }

  console.log(`Seeding complete — ${allSynsets.length} synsets inserted`);
};

main().catch((error) => {
  console.error(error);
  process.exit(1);
});

4. Pitfalls Encountered

Duplicate key on re-run

Running the script twice causes duplicate key value violates unique constraint "terms_synset_id_unique". Fix: truncate before seeding.

docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translations, terms CASCADE;"

onConflictDoNothing breaks FK references

When onConflictDoNothing skips a terms insert, the in-memory UUID is never written to the DB. Subsequent translations inserts reference that non-existent UUID, causing a FK violation. This is why the truncate approach is correct for batch seeding.

DATABASE_URL misconfigured

Correct format:

DATABASE_URL=postgresql://glossa:glossa@localhost:5432/glossa

Tables not found after docker compose up

Migrations must be applied first: npx drizzle-kit migrate


5. Running the Script

# Start the DB container
docker compose up -d postgres

# Apply migrations
npx drizzle-kit migrate

# Truncate existing data (if re-seeding)
docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translations, terms CASCADE;"

# Run the seed script
npx tsx src/seed-en-it-nouns.ts

# Verify
docker exec -it glossa-database psql -U glossa -d glossa -c "SELECT COUNT(*) FROM terms; SELECT COUNT(*) FROM translations;"

6. Seeding Script — v2 (incremental upsert, multi-file)

Motivation

The truncate approach is fine for dev but unsuitable for production — it wipes all data. The v2 approach extends the database incrementally without ever truncating.

File naming convention

One JSON file per language pair per POS:

scripts/datafiles/
  en-it-nouns.json
  en-fr-nouns.json
  en-it-verbs.json
  de-it-nouns.json
  ...

How incremental upsert works

For a concept like "dog" already in the DB with English and Italian:

  1. Import en-fr-nouns.json
  2. Upsert terms by synset_id — finds existing row, returns its real ID
  3. dog (en) already exists → skipped by onConflictDoNothing
  4. chien (fr) is new → inserted

The concept is extended, not replaced.

Tradeoff vs batch approach

Batching is no longer possible since you need the real term.id from the DB before inserting translations. Each synset is processed individually. For 25k rows this is still fast enough.

Key types added

type Synset = {
  synset_id: string;
  pos: POS;
  translations: Partial<Record<LANGUAGE_CODE, string[]>>; // Partial — file only contains subset of languages
};

type FileName = {
  sourceLang: LANGUAGE_CODE;
  targetLang: LANGUAGE_CODE;
  pos: POS;
};

Filename validation

const parseFilename = (filename: string): FileName => {
  const parts = filename.replace(".json", "").split("-");
  if (parts.length !== 3)
    throw new Error(
      `Invalid filename format: ${filename}. Expected: sourcelang-targetlang-pos.json`,
    );
  const [sourceLang, targetLang, pos] = parts;
  if (!SUPPORTED_LANGUAGE_CODES.includes(sourceLang as LANGUAGE_CODE))
    throw new Error(`Unsupported language code: ${sourceLang}`);
  if (!SUPPORTED_LANGUAGE_CODES.includes(targetLang as LANGUAGE_CODE))
    throw new Error(`Unsupported language code: ${targetLang}`);
  if (!SUPPORTED_POS.includes(pos as POS))
    throw new Error(`Unsupported POS: ${pos}`);
  return {
    sourceLang: sourceLang as LANGUAGE_CODE,
    targetLang: targetLang as LANGUAGE_CODE,
    pos: pos as POS,
  };
};

Upsert function (WIP)

const upsertSynset = async (
  synset: Synset,
  fileInfo: FileName,
): Promise<{ termInserted: boolean; translationsInserted: number }> => {
  const [upsertedTerm] = await db
    .insert(terms)
    .values({ synset_id: synset.synset_id, pos: synset.pos })
    .onConflictDoUpdate({ target: terms.synset_id, set: { pos: synset.pos } })
    .returning({ id: terms.id, created_at: terms.created_at });

  const termInserted = upsertedTerm.created_at > new Date(Date.now() - 1000);

  const translationRows = Object.entries(synset.translations).flatMap(
    ([lang, lemmas]) =>
      lemmas!.map((lemma) => ({
        id: crypto.randomUUID(),
        term_id: upsertedTerm.id,
        language_code: lang as LANGUAGE_CODE,
        text: lemma,
      })),
  );

  const result = await db
    .insert(translations)
    .values(translationRows)
    .onConflictDoNothing()
    .returning({ id: translations.id });

  return { termInserted, translationsInserted: result.length };
};

7. Strategy Comparison

Strategy Use case Pros Cons
Truncate + batch Dev / first-time setup Fast, simple Wipes all data
Incremental upsert Production / adding languages Safe, non-destructive No batching, slower
Migrations-as-data Production audit trail Clean history Files accumulate
Diff-based sync Large production datasets Minimal writes Complex to implement

8. packages/db — package.json exports fix

The exports field must be an object, not an array:

"exports": {
  ".": "./src/index.ts",
  "./schema": "./src/db/schema.ts"
}

Imports then resolve as:

import { db } from "@glossa/db";
import { terms, translations } from "@glossa/db/schema";