lila/documentation/data-seeding-notes.md

# WordNet Seeding Script — Session Summary

## Project Context

A multiplayer English–Italian vocabulary trainer (Glossa) built with a pnpm monorepo. Vocabulary data comes from Open Multilingual Wordnet (OMW) and is extracted into JSON files, then seeded into a PostgreSQL database via Drizzle ORM.

---

## 1. JSON Extraction Format

Each synset extracted from WordNet is represented as:

```json
{
  "synset_id": "ili:i35545",
  "pos": "noun",
  "translations": { "en": ["entity"], "it": ["cosa", "entità"] }
}
```

**Fields:**

- `synset_id` — OMW Interlingual Index ID, maps to `terms.synset_id` in the DB
- `pos` — part of speech, matches the CHECK constraint on `terms.pos`
- `translations` — object of language code → array of lemmas (synonyms within a synset)

**Glosses** are not extracted — the `term_glosses` table exists in the schema for future use but is not needed for the MVP quiz mechanic.

---

## 2. Database Schema (relevant tables)

```
terms
  id          uuid PK
  synset_id   text UNIQUE
  pos         varchar(20)
  created_at  timestamptz

translations
  id            uuid PK
  term_id       uuid FK → terms.id (CASCADE)
  language_code varchar(10)
  text          text
  created_at    timestamptz
  UNIQUE (term_id, language_code, text)
```

---

## 3. Seeding Script — v1 (batch, truncate-based)

### Approach

- Read a single JSON file
- Batch inserts into `terms` and `translations` in groups of 500
- Truncate tables before each run for a clean slate

### Key decisions made during development

| Issue                            | Resolution                                          |
| -------------------------------- | --------------------------------------------------- |
| `JSON.parse` returns `any`       | Added `Array.isArray` check before casting          |
| `forEach` doesn't await          | Switched to `for...of`                              |
| Empty array types                | Used Drizzle's `$inferInsert` types                 |
| `translations` naming conflict   | Renamed local variable to `translationRows`         |
| Final batch not flushed          | Added `if (termsArray.length > 0)` guard after loop |
| Exact batch size check `=== 500` | Changed to `>= 500`                                 |

### Final script structure

```ts
import fs from "node:fs/promises";
import { SUPPORTED_LANGUAGE_CODES, SUPPORTED_POS } from "@glossa/shared";
import { db } from "@glossa/db";
import { terms, translations } from "@glossa/db/schema";

type POS = (typeof SUPPORTED_POS)[number];
type LANGUAGE_CODE = (typeof SUPPORTED_LANGUAGE_CODES)[number];
type TermInsert = typeof terms.$inferInsert;
type TranslationInsert = typeof translations.$inferInsert;
type Synset = {
  synset_id: string;
  pos: POS;
  translations: Record<LANGUAGE_CODE, string[]>;
};

const dataDir = "../../scripts/datafiles/";

const readFromJsonFile = async (filepath: string): Promise<Synset[]> => {
  const data = await fs.readFile(filepath, "utf8");
  const parsed = JSON.parse(data);
  if (!Array.isArray(parsed)) throw new Error("Expected a JSON array");
  return parsed as Synset[];
};

const uploadToDB = async (
  termsData: TermInsert[],
  translationsData: TranslationInsert[],
) => {
  await db.insert(terms).values(termsData);
  await db.insert(translations).values(translationsData);
};

const main = async () => {
  console.log("Reading JSON file...");
  const allSynsets = await readFromJsonFile(dataDir + "en-it-nouns.json");
  console.log(`Loaded ${allSynsets.length} synsets`);

  const termsArray: TermInsert[] = [];
  const translationsArray: TranslationInsert[] = [];
  let batchCount = 0;

  for (const synset of allSynsets) {
    const term = {
      id: crypto.randomUUID(),
      synset_id: synset.synset_id,
      pos: synset.pos,
    };

    const translationRows = Object.entries(synset.translations).flatMap(
      ([lang, lemmas]) =>
        lemmas.map((lemma) => ({
          id: crypto.randomUUID(),
          term_id: term.id,
          language_code: lang as LANGUAGE_CODE,
          text: lemma,
        })),
    );

    translationsArray.push(...translationRows);
    termsArray.push(term);

    if (termsArray.length >= 500) {
      batchCount++;
      console.log(
        `Uploading batch ${batchCount} (${batchCount * 500}/${allSynsets.length} synsets)...`,
      );
      await uploadToDB(termsArray, translationsArray);
      termsArray.length = 0;
      translationsArray.length = 0;
    }
  }

  if (termsArray.length > 0) {
    batchCount++;
    console.log(
      `Uploading final batch (${allSynsets.length}/${allSynsets.length} synsets)...`,
    );
    await uploadToDB(termsArray, translationsArray);
  }

  console.log(`Seeding complete — ${allSynsets.length} synsets inserted`);
};

main().catch((error) => {
  console.error(error);
  process.exit(1);
});
```

---

## 4. Pitfalls Encountered

### Duplicate key on re-run

Running the script twice causes `duplicate key value violates unique constraint "terms_synset_id_unique"`. Fix: truncate before seeding.

```bash
docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translations, terms CASCADE;"
```

### `onConflictDoNothing` breaks FK references

When `onConflictDoNothing` skips a `terms` insert, the in-memory UUID is never written to the DB. Subsequent `translations` inserts reference that non-existent UUID, causing a FK violation. This is why the truncate approach is correct for batch seeding.

### DATABASE_URL misconfigured

Correct format:

```
DATABASE_URL=postgresql://glossa:glossa@localhost:5432/glossa
```

### Tables not found after `docker compose up`

Migrations must be applied first: `npx drizzle-kit migrate`

---

## 5. Running the Script

```bash
# Start the DB container
docker compose up -d postgres

# Apply migrations
npx drizzle-kit migrate

# Truncate existing data (if re-seeding)
docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translations, terms CASCADE;"

# Run the seed script
npx tsx src/seed-en-it-nouns.ts

# Verify
docker exec -it glossa-database psql -U glossa -d glossa -c "SELECT COUNT(*) FROM terms; SELECT COUNT(*) FROM translations;"
```

---

## 6. Seeding Script — v2 (incremental upsert, multi-file)

### Motivation

The truncate approach is fine for dev but unsuitable for production — it wipes all data. The v2 approach extends the database incrementally without ever truncating.

### File naming convention

One JSON file per language pair per POS:

```
scripts/datafiles/
  en-it-nouns.json
  en-fr-nouns.json
  en-it-verbs.json
  de-it-nouns.json
  ...
```

### How incremental upsert works

For a concept like "dog" already in the DB with English and Italian:

1. Import `en-fr-nouns.json`
2. Upsert `terms` by `synset_id` — finds existing row, returns its real ID
3. `dog (en)` already exists → skipped by `onConflictDoNothing`
4. `chien (fr)` is new → inserted

The concept is **extended**, not replaced.

### Tradeoff vs batch approach

Batching is no longer possible since you need the real `term.id` from the DB before inserting translations. Each synset is processed individually. For 25k rows this is still fast enough.

### Key types added

```ts
type Synset = {
  synset_id: string;
  pos: POS;
  translations: Partial<Record<LANGUAGE_CODE, string[]>>; // Partial — file only contains subset of languages
};

type FileName = {
  sourceLang: LANGUAGE_CODE;
  targetLang: LANGUAGE_CODE;
  pos: POS;
};
```

### Filename validation

```ts
const parseFilename = (filename: string): FileName => {
  const parts = filename.replace(".json", "").split("-");
  if (parts.length !== 3)
    throw new Error(
      `Invalid filename format: ${filename}. Expected: sourcelang-targetlang-pos.json`,
    );
  const [sourceLang, targetLang, pos] = parts;
  if (!SUPPORTED_LANGUAGE_CODES.includes(sourceLang as LANGUAGE_CODE))
    throw new Error(`Unsupported language code: ${sourceLang}`);
  if (!SUPPORTED_LANGUAGE_CODES.includes(targetLang as LANGUAGE_CODE))
    throw new Error(`Unsupported language code: ${targetLang}`);
  if (!SUPPORTED_POS.includes(pos as POS))
    throw new Error(`Unsupported POS: ${pos}`);
  return {
    sourceLang: sourceLang as LANGUAGE_CODE,
    targetLang: targetLang as LANGUAGE_CODE,
    pos: pos as POS,
  };
};
```

### Upsert function (WIP)

```ts
const upsertSynset = async (
  synset: Synset,
  fileInfo: FileName,
): Promise<{ termInserted: boolean; translationsInserted: number }> => {
  const [upsertedTerm] = await db
    .insert(terms)
    .values({ synset_id: synset.synset_id, pos: synset.pos })
    .onConflictDoUpdate({ target: terms.synset_id, set: { pos: synset.pos } })
    .returning({ id: terms.id, created_at: terms.created_at });

  const termInserted = upsertedTerm.created_at > new Date(Date.now() - 1000);

  const translationRows = Object.entries(synset.translations).flatMap(
    ([lang, lemmas]) =>
      lemmas!.map((lemma) => ({
        id: crypto.randomUUID(),
        term_id: upsertedTerm.id,
        language_code: lang as LANGUAGE_CODE,
        text: lemma,
      })),
  );

  const result = await db
    .insert(translations)
    .values(translationRows)
    .onConflictDoNothing()
    .returning({ id: translations.id });

  return { termInserted, translationsInserted: result.length };
};
```

---

## 7. Strategy Comparison

| Strategy           | Use case                      | Pros                  | Cons                 |
| ------------------ | ----------------------------- | --------------------- | -------------------- |
| Truncate + batch   | Dev / first-time setup        | Fast, simple          | Wipes all data       |
| Incremental upsert | Production / adding languages | Safe, non-destructive | No batching, slower  |
| Migrations-as-data | Production audit trail        | Clean history         | Files accumulate     |
| Diff-based sync    | Large production datasets     | Minimal writes        | Complex to implement |

---

## 8. packages/db — package.json exports fix

The `exports` field must be an object, not an array:

```json
"exports": {
  ".": "./src/index.ts",
  "./schema": "./src/db/schema.ts"
}
```

Imports then resolve as:

```ts
import { db } from "@glossa/db";
import { terms, translations } from "@glossa/db/schema";
```