351 lines
10 KiB
Markdown
351 lines
10 KiB
Markdown
# WordNet Seeding Script — Session Summary
|
||
|
||
## Project Context
|
||
|
||
A multiplayer English–Italian vocabulary trainer (Glossa) built with a pnpm monorepo. Vocabulary data comes from Open Multilingual Wordnet (OMW) and is extracted into JSON files, then seeded into a PostgreSQL database via Drizzle ORM.
|
||
|
||
---
|
||
|
||
## 1. JSON Extraction Format
|
||
|
||
Each synset extracted from WordNet is represented as:
|
||
|
||
```json
|
||
{
|
||
"synset_id": "ili:i35545",
|
||
"pos": "noun",
|
||
"translations": { "en": ["entity"], "it": ["cosa", "entità"] }
|
||
}
|
||
```
|
||
|
||
**Fields:**
|
||
|
||
- `synset_id` — OMW Interlingual Index ID, maps to `terms.synset_id` in the DB
|
||
- `pos` — part of speech, matches the CHECK constraint on `terms.pos`
|
||
- `translations` — object of language code → array of lemmas (synonyms within a synset)
|
||
|
||
**Glosses** are not extracted — the `term_glosses` table exists in the schema for future use but is not needed for the MVP quiz mechanic.
|
||
|
||
---
|
||
|
||
## 2. Database Schema (relevant tables)
|
||
|
||
```text
|
||
terms
|
||
id uuid PK
|
||
synset_id text UNIQUE
|
||
pos varchar(20)
|
||
created_at timestamptz
|
||
|
||
translations
|
||
id uuid PK
|
||
term_id uuid FK → terms.id (CASCADE)
|
||
language_code varchar(10)
|
||
text text
|
||
created_at timestamptz
|
||
UNIQUE (term_id, language_code, text)
|
||
```
|
||
|
||
---
|
||
|
||
## 3. Seeding Script — v1 (batch, truncate-based)
|
||
|
||
### Approach
|
||
|
||
- Read a single JSON file
|
||
- Batch inserts into `terms` and `translations` in groups of 500
|
||
- Truncate tables before each run for a clean slate
|
||
|
||
### Key decisions made during development
|
||
|
||
| Issue | Resolution |
|
||
| -------------------------------- | --------------------------------------------------- |
|
||
| `JSON.parse` returns `any` | Added `Array.isArray` check before casting |
|
||
| `forEach` doesn't await | Switched to `for...of` |
|
||
| Empty array types | Used Drizzle's `$inferInsert` types |
|
||
| `translations` naming conflict | Renamed local variable to `translationRows` |
|
||
| Final batch not flushed | Added `if (termsArray.length > 0)` guard after loop |
|
||
| Exact batch size check `=== 500` | Changed to `>= 500` |
|
||
|
||
### Final script structure
|
||
|
||
```ts
|
||
import fs from "node:fs/promises";
|
||
import { SUPPORTED_LANGUAGE_CODES, SUPPORTED_POS } from "@glossa/shared";
|
||
import { db } from "@glossa/db";
|
||
import { terms, translations } from "@glossa/db/schema";
|
||
|
||
type POS = (typeof SUPPORTED_POS)[number];
|
||
type LANGUAGE_CODE = (typeof SUPPORTED_LANGUAGE_CODES)[number];
|
||
type TermInsert = typeof terms.$inferInsert;
|
||
type TranslationInsert = typeof translations.$inferInsert;
|
||
type Synset = {
|
||
synset_id: string;
|
||
pos: POS;
|
||
translations: Record<LANGUAGE_CODE, string[]>;
|
||
};
|
||
|
||
const dataDir = "../../scripts/datafiles/";
|
||
|
||
const readFromJsonFile = async (filepath: string): Promise<Synset[]> => {
|
||
const data = await fs.readFile(filepath, "utf8");
|
||
const parsed = JSON.parse(data);
|
||
if (!Array.isArray(parsed)) throw new Error("Expected a JSON array");
|
||
return parsed as Synset[];
|
||
};
|
||
|
||
const uploadToDB = async (
|
||
termsData: TermInsert[],
|
||
translationsData: TranslationInsert[],
|
||
) => {
|
||
await db.insert(terms).values(termsData);
|
||
await db.insert(translations).values(translationsData);
|
||
};
|
||
|
||
const main = async () => {
|
||
console.log("Reading JSON file...");
|
||
const allSynsets = await readFromJsonFile(dataDir + "en-it-nouns.json");
|
||
console.log(`Loaded ${allSynsets.length} synsets`);
|
||
|
||
const termsArray: TermInsert[] = [];
|
||
const translationsArray: TranslationInsert[] = [];
|
||
let batchCount = 0;
|
||
|
||
for (const synset of allSynsets) {
|
||
const term = {
|
||
id: crypto.randomUUID(),
|
||
synset_id: synset.synset_id,
|
||
pos: synset.pos,
|
||
};
|
||
|
||
const translationRows = Object.entries(synset.translations).flatMap(
|
||
([lang, lemmas]) =>
|
||
lemmas.map((lemma) => ({
|
||
id: crypto.randomUUID(),
|
||
term_id: term.id,
|
||
language_code: lang as LANGUAGE_CODE,
|
||
text: lemma,
|
||
})),
|
||
);
|
||
|
||
translationsArray.push(...translationRows);
|
||
termsArray.push(term);
|
||
|
||
if (termsArray.length >= 500) {
|
||
batchCount++;
|
||
console.log(
|
||
`Uploading batch ${batchCount} (${batchCount * 500}/${allSynsets.length} synsets)...`,
|
||
);
|
||
await uploadToDB(termsArray, translationsArray);
|
||
termsArray.length = 0;
|
||
translationsArray.length = 0;
|
||
}
|
||
}
|
||
|
||
if (termsArray.length > 0) {
|
||
batchCount++;
|
||
console.log(
|
||
`Uploading final batch (${allSynsets.length}/${allSynsets.length} synsets)...`,
|
||
);
|
||
await uploadToDB(termsArray, translationsArray);
|
||
}
|
||
|
||
console.log(`Seeding complete — ${allSynsets.length} synsets inserted`);
|
||
};
|
||
|
||
main().catch((error) => {
|
||
console.error(error);
|
||
process.exit(1);
|
||
});
|
||
```
|
||
|
||
---
|
||
|
||
## 4. Pitfalls Encountered
|
||
|
||
### Duplicate key on re-run
|
||
|
||
Running the script twice causes `duplicate key value violates unique constraint "terms_synset_id_unique"`. Fix: truncate before seeding.
|
||
|
||
```bash
|
||
docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translations, terms CASCADE;"
|
||
```
|
||
|
||
### `onConflictDoNothing` breaks FK references
|
||
|
||
When `onConflictDoNothing` skips a `terms` insert, the in-memory UUID is never written to the DB. Subsequent `translations` inserts reference that non-existent UUID, causing a FK violation. This is why the truncate approach is correct for batch seeding.
|
||
|
||
### DATABASE_URL misconfigured
|
||
|
||
Correct format:
|
||
|
||
```text
|
||
DATABASE_URL=postgresql://glossa:glossa@localhost:5432/glossa
|
||
```
|
||
|
||
### Tables not found after `docker compose up`
|
||
|
||
Migrations must be applied first: `npx drizzle-kit migrate`
|
||
|
||
---
|
||
|
||
## 5. Running the Script
|
||
|
||
```bash
|
||
# Start the DB container
|
||
docker compose up -d postgres
|
||
|
||
# Apply migrations
|
||
npx drizzle-kit migrate
|
||
|
||
# Truncate existing data (if re-seeding)
|
||
docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translations, terms CASCADE;"
|
||
|
||
# Run the seed script
|
||
npx tsx src/seed-en-it-nouns.ts
|
||
|
||
# Verify
|
||
docker exec -it glossa-database psql -U glossa -d glossa -c "SELECT COUNT(*) FROM terms; SELECT COUNT(*) FROM translations;"
|
||
```
|
||
|
||
---
|
||
|
||
## 6. Seeding Script — v2 (incremental upsert, multi-file)
|
||
|
||
### Motivation
|
||
|
||
The truncate approach is fine for dev but unsuitable for production — it wipes all data. The v2 approach extends the database incrementally without ever truncating.
|
||
|
||
### File naming convention
|
||
|
||
One JSON file per language pair per POS:
|
||
|
||
```text
|
||
scripts/datafiles/
|
||
en-it-nouns.json
|
||
en-fr-nouns.json
|
||
en-it-verbs.json
|
||
de-it-nouns.json
|
||
...
|
||
```
|
||
|
||
### How incremental upsert works
|
||
|
||
For a concept like "dog" already in the DB with English and Italian:
|
||
|
||
1. Import `en-fr-nouns.json`
|
||
2. Upsert `terms` by `synset_id` — finds existing row, returns its real ID
|
||
3. `dog (en)` already exists → skipped by `onConflictDoNothing`
|
||
4. `chien (fr)` is new → inserted
|
||
|
||
The concept is **extended**, not replaced.
|
||
|
||
### Tradeoff vs batch approach
|
||
|
||
Batching is no longer possible since you need the real `term.id` from the DB before inserting translations. Each synset is processed individually. For 25k rows this is still fast enough.
|
||
|
||
### Key types added
|
||
|
||
```ts
|
||
type Synset = {
|
||
synset_id: string;
|
||
pos: POS;
|
||
translations: Partial<Record<LANGUAGE_CODE, string[]>>; // Partial — file only contains subset of languages
|
||
};
|
||
|
||
type FileName = {
|
||
sourceLang: LANGUAGE_CODE;
|
||
targetLang: LANGUAGE_CODE;
|
||
pos: POS;
|
||
};
|
||
```
|
||
|
||
### Filename validation
|
||
|
||
```ts
|
||
const parseFilename = (filename: string): FileName => {
|
||
const parts = filename.replace(".json", "").split("-");
|
||
if (parts.length !== 3)
|
||
throw new Error(
|
||
`Invalid filename format: ${filename}. Expected: sourcelang-targetlang-pos.json`,
|
||
);
|
||
const [sourceLang, targetLang, pos] = parts;
|
||
if (!SUPPORTED_LANGUAGE_CODES.includes(sourceLang as LANGUAGE_CODE))
|
||
throw new Error(`Unsupported language code: ${sourceLang}`);
|
||
if (!SUPPORTED_LANGUAGE_CODES.includes(targetLang as LANGUAGE_CODE))
|
||
throw new Error(`Unsupported language code: ${targetLang}`);
|
||
if (!SUPPORTED_POS.includes(pos as POS))
|
||
throw new Error(`Unsupported POS: ${pos}`);
|
||
return {
|
||
sourceLang: sourceLang as LANGUAGE_CODE,
|
||
targetLang: targetLang as LANGUAGE_CODE,
|
||
pos: pos as POS,
|
||
};
|
||
};
|
||
```
|
||
|
||
### Upsert function (WIP)
|
||
|
||
```ts
|
||
const upsertSynset = async (
|
||
synset: Synset,
|
||
fileInfo: FileName,
|
||
): Promise<{ termInserted: boolean; translationsInserted: number }> => {
|
||
const [upsertedTerm] = await db
|
||
.insert(terms)
|
||
.values({ synset_id: synset.synset_id, pos: synset.pos })
|
||
.onConflictDoUpdate({ target: terms.synset_id, set: { pos: synset.pos } })
|
||
.returning({ id: terms.id, created_at: terms.created_at });
|
||
|
||
const termInserted = upsertedTerm.created_at > new Date(Date.now() - 1000);
|
||
|
||
const translationRows = Object.entries(synset.translations).flatMap(
|
||
([lang, lemmas]) =>
|
||
lemmas!.map((lemma) => ({
|
||
id: crypto.randomUUID(),
|
||
term_id: upsertedTerm.id,
|
||
language_code: lang as LANGUAGE_CODE,
|
||
text: lemma,
|
||
})),
|
||
);
|
||
|
||
const result = await db
|
||
.insert(translations)
|
||
.values(translationRows)
|
||
.onConflictDoNothing()
|
||
.returning({ id: translations.id });
|
||
|
||
return { termInserted, translationsInserted: result.length };
|
||
};
|
||
```
|
||
|
||
---
|
||
|
||
## 7. Strategy Comparison
|
||
|
||
| Strategy | Use case | Pros | Cons |
|
||
| ------------------ | ----------------------------- | --------------------- | -------------------- |
|
||
| Truncate + batch | Dev / first-time setup | Fast, simple | Wipes all data |
|
||
| Incremental upsert | Production / adding languages | Safe, non-destructive | No batching, slower |
|
||
| Migrations-as-data | Production audit trail | Clean history | Files accumulate |
|
||
| Diff-based sync | Large production datasets | Minimal writes | Complex to implement |
|
||
|
||
---
|
||
|
||
## 8. packages/db — package.json exports fix
|
||
|
||
The `exports` field must be an object, not an array:
|
||
|
||
```json
|
||
"exports": {
|
||
".": "./src/index.ts",
|
||
"./schema": "./src/db/schema.ts"
|
||
}
|
||
```
|
||
|
||
Imports then resolve as:
|
||
|
||
```ts
|
||
import { db } from "@glossa/db";
|
||
import { terms, translations } from "@glossa/db/schema";
|
||
```
|