lila/documentation/data-seeding-notes.md
2026-03-31 10:06:06 +02:00

351 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# WordNet Seeding Script — Session Summary
## Project Context
A multiplayer EnglishItalian vocabulary trainer (Glossa) built with a pnpm monorepo. Vocabulary data comes from Open Multilingual Wordnet (OMW) and is extracted into JSON files, then seeded into a PostgreSQL database via Drizzle ORM.
---
## 1. JSON Extraction Format
Each synset extracted from WordNet is represented as:
```json
{
"synset_id": "ili:i35545",
"pos": "noun",
"translations": { "en": ["entity"], "it": ["cosa", "entità"] }
}
```
**Fields:**
- `synset_id` — OMW Interlingual Index ID, maps to `terms.synset_id` in the DB
- `pos` — part of speech, matches the CHECK constraint on `terms.pos`
- `translations` — object of language code → array of lemmas (synonyms within a synset)
**Glosses** are not extracted — the `term_glosses` table exists in the schema for future use but is not needed for the MVP quiz mechanic.
---
## 2. Database Schema (relevant tables)
```
terms
id uuid PK
synset_id text UNIQUE
pos varchar(20)
created_at timestamptz
translations
id uuid PK
term_id uuid FK → terms.id (CASCADE)
language_code varchar(10)
text text
created_at timestamptz
UNIQUE (term_id, language_code, text)
```
---
## 3. Seeding Script — v1 (batch, truncate-based)
### Approach
- Read a single JSON file
- Batch inserts into `terms` and `translations` in groups of 500
- Truncate tables before each run for a clean slate
### Key decisions made during development
| Issue | Resolution |
| -------------------------------- | --------------------------------------------------- |
| `JSON.parse` returns `any` | Added `Array.isArray` check before casting |
| `forEach` doesn't await | Switched to `for...of` |
| Empty array types | Used Drizzle's `$inferInsert` types |
| `translations` naming conflict | Renamed local variable to `translationRows` |
| Final batch not flushed | Added `if (termsArray.length > 0)` guard after loop |
| Exact batch size check `=== 500` | Changed to `>= 500` |
### Final script structure
```ts
import fs from "node:fs/promises";
import { SUPPORTED_LANGUAGE_CODES, SUPPORTED_POS } from "@glossa/shared";
import { db } from "@glossa/db";
import { terms, translations } from "@glossa/db/schema";
type POS = (typeof SUPPORTED_POS)[number];
type LANGUAGE_CODE = (typeof SUPPORTED_LANGUAGE_CODES)[number];
type TermInsert = typeof terms.$inferInsert;
type TranslationInsert = typeof translations.$inferInsert;
type Synset = {
synset_id: string;
pos: POS;
translations: Record<LANGUAGE_CODE, string[]>;
};
const dataDir = "../../scripts/datafiles/";
const readFromJsonFile = async (filepath: string): Promise<Synset[]> => {
const data = await fs.readFile(filepath, "utf8");
const parsed = JSON.parse(data);
if (!Array.isArray(parsed)) throw new Error("Expected a JSON array");
return parsed as Synset[];
};
const uploadToDB = async (
termsData: TermInsert[],
translationsData: TranslationInsert[],
) => {
await db.insert(terms).values(termsData);
await db.insert(translations).values(translationsData);
};
const main = async () => {
console.log("Reading JSON file...");
const allSynsets = await readFromJsonFile(dataDir + "en-it-nouns.json");
console.log(`Loaded ${allSynsets.length} synsets`);
const termsArray: TermInsert[] = [];
const translationsArray: TranslationInsert[] = [];
let batchCount = 0;
for (const synset of allSynsets) {
const term = {
id: crypto.randomUUID(),
synset_id: synset.synset_id,
pos: synset.pos,
};
const translationRows = Object.entries(synset.translations).flatMap(
([lang, lemmas]) =>
lemmas.map((lemma) => ({
id: crypto.randomUUID(),
term_id: term.id,
language_code: lang as LANGUAGE_CODE,
text: lemma,
})),
);
translationsArray.push(...translationRows);
termsArray.push(term);
if (termsArray.length >= 500) {
batchCount++;
console.log(
`Uploading batch ${batchCount} (${batchCount * 500}/${allSynsets.length} synsets)...`,
);
await uploadToDB(termsArray, translationsArray);
termsArray.length = 0;
translationsArray.length = 0;
}
}
if (termsArray.length > 0) {
batchCount++;
console.log(
`Uploading final batch (${allSynsets.length}/${allSynsets.length} synsets)...`,
);
await uploadToDB(termsArray, translationsArray);
}
console.log(`Seeding complete — ${allSynsets.length} synsets inserted`);
};
main().catch((error) => {
console.error(error);
process.exit(1);
});
```
---
## 4. Pitfalls Encountered
### Duplicate key on re-run
Running the script twice causes `duplicate key value violates unique constraint "terms_synset_id_unique"`. Fix: truncate before seeding.
```bash
docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translations, terms CASCADE;"
```
### `onConflictDoNothing` breaks FK references
When `onConflictDoNothing` skips a `terms` insert, the in-memory UUID is never written to the DB. Subsequent `translations` inserts reference that non-existent UUID, causing a FK violation. This is why the truncate approach is correct for batch seeding.
### DATABASE_URL misconfigured
Correct format:
```
DATABASE_URL=postgresql://glossa:glossa@localhost:5432/glossa
```
### Tables not found after `docker compose up`
Migrations must be applied first: `npx drizzle-kit migrate`
---
## 5. Running the Script
```bash
# Start the DB container
docker compose up -d postgres
# Apply migrations
npx drizzle-kit migrate
# Truncate existing data (if re-seeding)
docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translations, terms CASCADE;"
# Run the seed script
npx tsx src/seed-en-it-nouns.ts
# Verify
docker exec -it glossa-database psql -U glossa -d glossa -c "SELECT COUNT(*) FROM terms; SELECT COUNT(*) FROM translations;"
```
---
## 6. Seeding Script — v2 (incremental upsert, multi-file)
### Motivation
The truncate approach is fine for dev but unsuitable for production — it wipes all data. The v2 approach extends the database incrementally without ever truncating.
### File naming convention
One JSON file per language pair per POS:
```
scripts/datafiles/
en-it-nouns.json
en-fr-nouns.json
en-it-verbs.json
de-it-nouns.json
...
```
### How incremental upsert works
For a concept like "dog" already in the DB with English and Italian:
1. Import `en-fr-nouns.json`
2. Upsert `terms` by `synset_id` — finds existing row, returns its real ID
3. `dog (en)` already exists → skipped by `onConflictDoNothing`
4. `chien (fr)` is new → inserted
The concept is **extended**, not replaced.
### Tradeoff vs batch approach
Batching is no longer possible since you need the real `term.id` from the DB before inserting translations. Each synset is processed individually. For 25k rows this is still fast enough.
### Key types added
```ts
type Synset = {
synset_id: string;
pos: POS;
translations: Partial<Record<LANGUAGE_CODE, string[]>>; // Partial — file only contains subset of languages
};
type FileName = {
sourceLang: LANGUAGE_CODE;
targetLang: LANGUAGE_CODE;
pos: POS;
};
```
### Filename validation
```ts
const parseFilename = (filename: string): FileName => {
const parts = filename.replace(".json", "").split("-");
if (parts.length !== 3)
throw new Error(
`Invalid filename format: ${filename}. Expected: sourcelang-targetlang-pos.json`,
);
const [sourceLang, targetLang, pos] = parts;
if (!SUPPORTED_LANGUAGE_CODES.includes(sourceLang as LANGUAGE_CODE))
throw new Error(`Unsupported language code: ${sourceLang}`);
if (!SUPPORTED_LANGUAGE_CODES.includes(targetLang as LANGUAGE_CODE))
throw new Error(`Unsupported language code: ${targetLang}`);
if (!SUPPORTED_POS.includes(pos as POS))
throw new Error(`Unsupported POS: ${pos}`);
return {
sourceLang: sourceLang as LANGUAGE_CODE,
targetLang: targetLang as LANGUAGE_CODE,
pos: pos as POS,
};
};
```
### Upsert function (WIP)
```ts
const upsertSynset = async (
synset: Synset,
fileInfo: FileName,
): Promise<{ termInserted: boolean; translationsInserted: number }> => {
const [upsertedTerm] = await db
.insert(terms)
.values({ synset_id: synset.synset_id, pos: synset.pos })
.onConflictDoUpdate({ target: terms.synset_id, set: { pos: synset.pos } })
.returning({ id: terms.id, created_at: terms.created_at });
const termInserted = upsertedTerm.created_at > new Date(Date.now() - 1000);
const translationRows = Object.entries(synset.translations).flatMap(
([lang, lemmas]) =>
lemmas!.map((lemma) => ({
id: crypto.randomUUID(),
term_id: upsertedTerm.id,
language_code: lang as LANGUAGE_CODE,
text: lemma,
})),
);
const result = await db
.insert(translations)
.values(translationRows)
.onConflictDoNothing()
.returning({ id: translations.id });
return { termInserted, translationsInserted: result.length };
};
```
---
## 7. Strategy Comparison
| Strategy | Use case | Pros | Cons |
| ------------------ | ----------------------------- | --------------------- | -------------------- |
| Truncate + batch | Dev / first-time setup | Fast, simple | Wipes all data |
| Incremental upsert | Production / adding languages | Safe, non-destructive | No batching, slower |
| Migrations-as-data | Production audit trail | Clean history | Files accumulate |
| Diff-based sync | Large production datasets | Minimal writes | Complex to implement |
---
## 8. packages/db — package.json exports fix
The `exports` field must be an object, not an array:
```json
"exports": {
".": "./src/index.ts",
"./schema": "./src/db/schema.ts"
}
```
Imports then resolve as:
```ts
import { db } from "@glossa/db";
import { terms, translations } from "@glossa/db/schema";
```