feat(db): add incremental upsert seed script for WordNet vocabulary
Implements packages/db/src/seed.ts — reads all JSON files from scripts/datafiles/, validates filenames against supported language codes and POS, and upserts synsets into and via onConflictDoNothing. Safe to re-run; produces 0 writes on a duplicate run.
This commit is contained in:
parent
55885336ba
commit
2b177aad5b
12 changed files with 1349 additions and 10 deletions
337
documentation/data-seeding-notes.md
Normal file
337
documentation/data-seeding-notes.md
Normal file
|
|
@ -0,0 +1,337 @@
|
|||
# WordNet Seeding Script — Session Summary
|
||||
|
||||
## Project Context
|
||||
|
||||
A multiplayer English–Italian vocabulary trainer (Glossa) built with a pnpm monorepo. Vocabulary data comes from Open Multilingual Wordnet (OMW) and is extracted into JSON files, then seeded into a PostgreSQL database via Drizzle ORM.
|
||||
|
||||
---
|
||||
|
||||
## 1. JSON Extraction Format
|
||||
|
||||
Each synset extracted from WordNet is represented as:
|
||||
|
||||
```json
|
||||
{
|
||||
"synset_id": "ili:i35545",
|
||||
"pos": "noun",
|
||||
"translations": {
|
||||
"en": ["entity"],
|
||||
"it": ["cosa", "entità"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Fields:**
|
||||
- `synset_id` — OMW Interlingual Index ID, maps to `terms.synset_id` in the DB
|
||||
- `pos` — part of speech, matches the CHECK constraint on `terms.pos`
|
||||
- `translations` — object of language code → array of lemmas (synonyms within a synset)
|
||||
|
||||
**Glosses** are not extracted — the `term_glosses` table exists in the schema for future use but is not needed for the MVP quiz mechanic.
|
||||
|
||||
---
|
||||
|
||||
## 2. Database Schema (relevant tables)
|
||||
|
||||
```
|
||||
terms
|
||||
id uuid PK
|
||||
synset_id text UNIQUE
|
||||
pos varchar(20)
|
||||
created_at timestamptz
|
||||
|
||||
translations
|
||||
id uuid PK
|
||||
term_id uuid FK → terms.id (CASCADE)
|
||||
language_code varchar(10)
|
||||
text text
|
||||
created_at timestamptz
|
||||
UNIQUE (term_id, language_code, text)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Seeding Script — v1 (batch, truncate-based)
|
||||
|
||||
### Approach
|
||||
- Read a single JSON file
|
||||
- Batch inserts into `terms` and `translations` in groups of 500
|
||||
- Truncate tables before each run for a clean slate
|
||||
|
||||
### Key decisions made during development
|
||||
|
||||
| Issue | Resolution |
|
||||
|-------|-----------|
|
||||
| `JSON.parse` returns `any` | Added `Array.isArray` check before casting |
|
||||
| `forEach` doesn't await | Switched to `for...of` |
|
||||
| Empty array types | Used Drizzle's `$inferInsert` types |
|
||||
| `translations` naming conflict | Renamed local variable to `translationRows` |
|
||||
| Final batch not flushed | Added `if (termsArray.length > 0)` guard after loop |
|
||||
| Exact batch size check `=== 500` | Changed to `>= 500` |
|
||||
|
||||
### Final script structure
|
||||
|
||||
```ts
|
||||
import fs from "node:fs/promises";
|
||||
import { SUPPORTED_LANGUAGE_CODES, SUPPORTED_POS } from "@glossa/shared";
|
||||
import { db } from "@glossa/db";
|
||||
import { terms, translations } from "@glossa/db/schema";
|
||||
|
||||
type POS = (typeof SUPPORTED_POS)[number];
|
||||
type LANGUAGE_CODE = (typeof SUPPORTED_LANGUAGE_CODES)[number];
|
||||
type TermInsert = typeof terms.$inferInsert;
|
||||
type TranslationInsert = typeof translations.$inferInsert;
|
||||
type Synset = {
|
||||
synset_id: string;
|
||||
pos: POS;
|
||||
translations: Record<LANGUAGE_CODE, string[]>;
|
||||
};
|
||||
|
||||
const dataDir = "../../scripts/datafiles/";
|
||||
|
||||
const readFromJsonFile = async (filepath: string): Promise<Synset[]> => {
|
||||
const data = await fs.readFile(filepath, "utf8");
|
||||
const parsed = JSON.parse(data);
|
||||
if (!Array.isArray(parsed)) throw new Error("Expected a JSON array");
|
||||
return parsed as Synset[];
|
||||
};
|
||||
|
||||
const uploadToDB = async (
|
||||
termsData: TermInsert[],
|
||||
translationsData: TranslationInsert[],
|
||||
) => {
|
||||
await db.insert(terms).values(termsData);
|
||||
await db.insert(translations).values(translationsData);
|
||||
};
|
||||
|
||||
const main = async () => {
|
||||
console.log("Reading JSON file...");
|
||||
const allSynsets = await readFromJsonFile(dataDir + "en-it-nouns.json");
|
||||
console.log(`Loaded ${allSynsets.length} synsets`);
|
||||
|
||||
const termsArray: TermInsert[] = [];
|
||||
const translationsArray: TranslationInsert[] = [];
|
||||
let batchCount = 0;
|
||||
|
||||
for (const synset of allSynsets) {
|
||||
const term = {
|
||||
id: crypto.randomUUID(),
|
||||
synset_id: synset.synset_id,
|
||||
pos: synset.pos,
|
||||
};
|
||||
|
||||
const translationRows = Object.entries(synset.translations).flatMap(
|
||||
([lang, lemmas]) =>
|
||||
lemmas.map((lemma) => ({
|
||||
id: crypto.randomUUID(),
|
||||
term_id: term.id,
|
||||
language_code: lang as LANGUAGE_CODE,
|
||||
text: lemma,
|
||||
})),
|
||||
);
|
||||
|
||||
translationsArray.push(...translationRows);
|
||||
termsArray.push(term);
|
||||
|
||||
if (termsArray.length >= 500) {
|
||||
batchCount++;
|
||||
console.log(`Uploading batch ${batchCount} (${batchCount * 500}/${allSynsets.length} synsets)...`);
|
||||
await uploadToDB(termsArray, translationsArray);
|
||||
termsArray.length = 0;
|
||||
translationsArray.length = 0;
|
||||
}
|
||||
}
|
||||
|
||||
if (termsArray.length > 0) {
|
||||
batchCount++;
|
||||
console.log(`Uploading final batch (${allSynsets.length}/${allSynsets.length} synsets)...`);
|
||||
await uploadToDB(termsArray, translationsArray);
|
||||
}
|
||||
|
||||
console.log(`Seeding complete — ${allSynsets.length} synsets inserted`);
|
||||
};
|
||||
|
||||
main().catch((error) => {
|
||||
console.error(error);
|
||||
process.exit(1);
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Pitfalls Encountered
|
||||
|
||||
### Duplicate key on re-run
|
||||
Running the script twice causes `duplicate key value violates unique constraint "terms_synset_id_unique"`. Fix: truncate before seeding.
|
||||
|
||||
```bash
|
||||
docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translations, terms CASCADE;"
|
||||
```
|
||||
|
||||
### `onConflictDoNothing` breaks FK references
|
||||
When `onConflictDoNothing` skips a `terms` insert, the in-memory UUID is never written to the DB. Subsequent `translations` inserts reference that non-existent UUID, causing a FK violation. This is why the truncate approach is correct for batch seeding.
|
||||
|
||||
### DATABASE_URL misconfigured
|
||||
Correct format:
|
||||
```
|
||||
DATABASE_URL=postgresql://glossa:glossa@localhost:5432/glossa
|
||||
```
|
||||
|
||||
### Tables not found after `docker compose up`
|
||||
Migrations must be applied first: `npx drizzle-kit migrate`
|
||||
|
||||
---
|
||||
|
||||
## 5. Running the Script
|
||||
|
||||
```bash
|
||||
# Start the DB container
|
||||
docker compose up -d postgres
|
||||
|
||||
# Apply migrations
|
||||
npx drizzle-kit migrate
|
||||
|
||||
# Truncate existing data (if re-seeding)
|
||||
docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translations, terms CASCADE;"
|
||||
|
||||
# Run the seed script
|
||||
npx tsx src/seed-en-it-nouns.ts
|
||||
|
||||
# Verify
|
||||
docker exec -it glossa-database psql -U glossa -d glossa -c "SELECT COUNT(*) FROM terms; SELECT COUNT(*) FROM translations;"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Seeding Script — v2 (incremental upsert, multi-file)
|
||||
|
||||
### Motivation
|
||||
The truncate approach is fine for dev but unsuitable for production — it wipes all data. The v2 approach extends the database incrementally without ever truncating.
|
||||
|
||||
### File naming convention
|
||||
One JSON file per language pair per POS:
|
||||
```
|
||||
scripts/datafiles/
|
||||
en-it-nouns.json
|
||||
en-fr-nouns.json
|
||||
en-it-verbs.json
|
||||
de-it-nouns.json
|
||||
...
|
||||
```
|
||||
|
||||
### How incremental upsert works
|
||||
For a concept like "dog" already in the DB with English and Italian:
|
||||
1. Import `en-fr-nouns.json`
|
||||
2. Upsert `terms` by `synset_id` — finds existing row, returns its real ID
|
||||
3. `dog (en)` already exists → skipped by `onConflictDoNothing`
|
||||
4. `chien (fr)` is new → inserted
|
||||
|
||||
The concept is **extended**, not replaced.
|
||||
|
||||
### Tradeoff vs batch approach
|
||||
Batching is no longer possible since you need the real `term.id` from the DB before inserting translations. Each synset is processed individually. For 25k rows this is still fast enough.
|
||||
|
||||
### Key types added
|
||||
|
||||
```ts
|
||||
type Synset = {
|
||||
synset_id: string;
|
||||
pos: POS;
|
||||
translations: Partial<Record<LANGUAGE_CODE, string[]>>; // Partial — file only contains subset of languages
|
||||
};
|
||||
|
||||
type FileName = {
|
||||
sourceLang: LANGUAGE_CODE;
|
||||
targetLang: LANGUAGE_CODE;
|
||||
pos: POS;
|
||||
};
|
||||
```
|
||||
|
||||
### Filename validation
|
||||
|
||||
```ts
|
||||
const parseFilename = (filename: string): FileName => {
|
||||
const parts = filename.replace(".json", "").split("-");
|
||||
if (parts.length !== 3)
|
||||
throw new Error(`Invalid filename format: ${filename}. Expected: sourcelang-targetlang-pos.json`);
|
||||
const [sourceLang, targetLang, pos] = parts;
|
||||
if (!SUPPORTED_LANGUAGE_CODES.includes(sourceLang as LANGUAGE_CODE))
|
||||
throw new Error(`Unsupported language code: ${sourceLang}`);
|
||||
if (!SUPPORTED_LANGUAGE_CODES.includes(targetLang as LANGUAGE_CODE))
|
||||
throw new Error(`Unsupported language code: ${targetLang}`);
|
||||
if (!SUPPORTED_POS.includes(pos as POS))
|
||||
throw new Error(`Unsupported POS: ${pos}`);
|
||||
return {
|
||||
sourceLang: sourceLang as LANGUAGE_CODE,
|
||||
targetLang: targetLang as LANGUAGE_CODE,
|
||||
pos: pos as POS,
|
||||
};
|
||||
};
|
||||
```
|
||||
|
||||
### Upsert function (WIP)
|
||||
|
||||
```ts
|
||||
const upsertSynset = async (
|
||||
synset: Synset,
|
||||
fileInfo: FileName,
|
||||
): Promise<{ termInserted: boolean; translationsInserted: number }> => {
|
||||
const [upsertedTerm] = await db
|
||||
.insert(terms)
|
||||
.values({ synset_id: synset.synset_id, pos: synset.pos })
|
||||
.onConflictDoUpdate({
|
||||
target: terms.synset_id,
|
||||
set: { pos: synset.pos },
|
||||
})
|
||||
.returning({ id: terms.id, created_at: terms.created_at });
|
||||
|
||||
const termInserted = upsertedTerm.created_at > new Date(Date.now() - 1000);
|
||||
|
||||
const translationRows = Object.entries(synset.translations).flatMap(
|
||||
([lang, lemmas]) =>
|
||||
lemmas!.map((lemma) => ({
|
||||
id: crypto.randomUUID(),
|
||||
term_id: upsertedTerm.id,
|
||||
language_code: lang as LANGUAGE_CODE,
|
||||
text: lemma,
|
||||
})),
|
||||
);
|
||||
|
||||
const result = await db
|
||||
.insert(translations)
|
||||
.values(translationRows)
|
||||
.onConflictDoNothing()
|
||||
.returning({ id: translations.id });
|
||||
|
||||
return { termInserted, translationsInserted: result.length };
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Strategy Comparison
|
||||
|
||||
| Strategy | Use case | Pros | Cons |
|
||||
|----------|----------|------|------|
|
||||
| Truncate + batch | Dev / first-time setup | Fast, simple | Wipes all data |
|
||||
| Incremental upsert | Production / adding languages | Safe, non-destructive | No batching, slower |
|
||||
| Migrations-as-data | Production audit trail | Clean history | Files accumulate |
|
||||
| Diff-based sync | Large production datasets | Minimal writes | Complex to implement |
|
||||
|
||||
---
|
||||
|
||||
## 8. packages/db — package.json exports fix
|
||||
|
||||
The `exports` field must be an object, not an array:
|
||||
|
||||
```json
|
||||
"exports": {
|
||||
".": "./src/index.ts",
|
||||
"./schema": "./src/db/schema.ts"
|
||||
}
|
||||
```
|
||||
|
||||
Imports then resolve as:
|
||||
```ts
|
||||
import { db } from "@glossa/db";
|
||||
import { terms, translations } from "@glossa/db/schema";
|
||||
```
|
||||
|
|
@ -6,7 +6,7 @@
|
|||
- add this to drizzle migrartions file:
|
||||
✅ ALTER TABLE terms ADD CHECK (pos IN ('noun', 'verb', 'adjective', etc));
|
||||
|
||||
## open word net
|
||||
## openwordnet
|
||||
|
||||
download libraries via
|
||||
|
||||
|
|
@ -45,3 +45,17 @@ list all libraries:
|
|||
```bash
|
||||
python -c "import wn; print(wn.lexicons())"
|
||||
```
|
||||
|
||||
## drizzle
|
||||
|
||||
generate migration file, go to packages/db, then:
|
||||
|
||||
```bash
|
||||
pnpm drizzle-kit generate
|
||||
```
|
||||
|
||||
execute migration, go to packages/db (docker containers need to be running):
|
||||
|
||||
```bash
|
||||
DATABASE_URL=postgresql://username:password@localhost:5432/database pnpm drizzle-kit migrate
|
||||
```
|
||||
|
|
|
|||
|
|
@ -26,17 +26,17 @@ Done when: `GET /api/decks/1/terms?limit=10` returns 10 terms from a specific de
|
|||
|
||||
[x] Run `extract-en-it-nouns.py` locally → generates `datafiles/en-it-nouns.json`
|
||||
-- Import ALL available OMW noun synsets (no frequency filtering)
|
||||
[ ] Write Drizzle schema: `terms`, `translations`, `language_pairs`, `term_glosses`, `decks`, `deck_terms`
|
||||
[ ] Write and run migration (includes CHECK constraints for `pos`, `gloss_type`)
|
||||
[ ] Write `packages/db/src/seed.ts` (imports ALL terms + translations, NO decks)
|
||||
[ ] Write `scripts/build_decks.ts` (reads external CEFR lists, matches to DB, creates decks)
|
||||
[x] Write Drizzle schema: `terms`, `translations`, `language_pairs`, `term_glosses`, `decks`, `deck_terms`
|
||||
[x] Write and run migration (includes CHECK constraints for `pos`, `gloss_type`)
|
||||
[x] Write `packages/db/src/seed.ts` (imports ALL terms + translations, NO decks)
|
||||
[ ] Download CEFR A1/A2 noun lists (from GitHub repos)
|
||||
[ ] Write `scripts/build_decks.ts` (reads external CEFR lists, matches to DB, creates decks)
|
||||
[ ] Run `pnpm db:seed` → populates terms
|
||||
[ ] Run `pnpm db:build-decks` → creates curated decks
|
||||
[ ] Define Zod response schemas in `packages/shared`
|
||||
[ ] Implement `DeckRepository.getTerms(deckId, limit, offset)`
|
||||
[ ] Implement `QuizService.attachDistractors(terms)` — same POS, server-side, no duplicates
|
||||
[ ] Implement `GET /language-pairs`, `GET /decks`, `GET /decks/:id/terms` endpoints
|
||||
[ ] Define Zod response schemas in `packages/shared`
|
||||
[ ] Unit tests for `QuizService` (correct POS filtering, never includes the answer)
|
||||
[ ] update decisions.md
|
||||
|
||||
|
|
|
|||
|
|
@ -205,7 +205,6 @@ term_glosses
|
|||
term_id uuid FK → terms.id
|
||||
language_code varchar(10) -- NOT NULL
|
||||
text text -- NOT NULL
|
||||
type varchar(20) -- CHECK (type IN ('definition', 'example')), NULLABLE
|
||||
created_at timestamptz DEFAULT now()
|
||||
|
||||
language_pairs
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue