formatting
This commit is contained in:
parent
20fa6a9331
commit
e3a2136720
11 changed files with 72803 additions and 408878 deletions
|
|
@ -14,14 +14,12 @@ Each synset extracted from WordNet is represented as:
|
|||
{
|
||||
"synset_id": "ili:i35545",
|
||||
"pos": "noun",
|
||||
"translations": {
|
||||
"en": ["entity"],
|
||||
"it": ["cosa", "entità"]
|
||||
}
|
||||
"translations": { "en": ["entity"], "it": ["cosa", "entità"] }
|
||||
}
|
||||
```
|
||||
|
||||
**Fields:**
|
||||
|
||||
- `synset_id` — OMW Interlingual Index ID, maps to `terms.synset_id` in the DB
|
||||
- `pos` — part of speech, matches the CHECK constraint on `terms.pos`
|
||||
- `translations` — object of language code → array of lemmas (synonyms within a synset)
|
||||
|
|
@ -53,20 +51,21 @@ translations
|
|||
## 3. Seeding Script — v1 (batch, truncate-based)
|
||||
|
||||
### Approach
|
||||
|
||||
- Read a single JSON file
|
||||
- Batch inserts into `terms` and `translations` in groups of 500
|
||||
- Truncate tables before each run for a clean slate
|
||||
|
||||
### Key decisions made during development
|
||||
|
||||
| Issue | Resolution |
|
||||
|-------|-----------|
|
||||
| `JSON.parse` returns `any` | Added `Array.isArray` check before casting |
|
||||
| `forEach` doesn't await | Switched to `for...of` |
|
||||
| Empty array types | Used Drizzle's `$inferInsert` types |
|
||||
| `translations` naming conflict | Renamed local variable to `translationRows` |
|
||||
| Final batch not flushed | Added `if (termsArray.length > 0)` guard after loop |
|
||||
| Exact batch size check `=== 500` | Changed to `>= 500` |
|
||||
| Issue | Resolution |
|
||||
| -------------------------------- | --------------------------------------------------- |
|
||||
| `JSON.parse` returns `any` | Added `Array.isArray` check before casting |
|
||||
| `forEach` doesn't await | Switched to `for...of` |
|
||||
| Empty array types | Used Drizzle's `$inferInsert` types |
|
||||
| `translations` naming conflict | Renamed local variable to `translationRows` |
|
||||
| Final batch not flushed | Added `if (termsArray.length > 0)` guard after loop |
|
||||
| Exact batch size check `=== 500` | Changed to `>= 500` |
|
||||
|
||||
### Final script structure
|
||||
|
||||
|
|
@ -134,7 +133,9 @@ const main = async () => {
|
|||
|
||||
if (termsArray.length >= 500) {
|
||||
batchCount++;
|
||||
console.log(`Uploading batch ${batchCount} (${batchCount * 500}/${allSynsets.length} synsets)...`);
|
||||
console.log(
|
||||
`Uploading batch ${batchCount} (${batchCount * 500}/${allSynsets.length} synsets)...`,
|
||||
);
|
||||
await uploadToDB(termsArray, translationsArray);
|
||||
termsArray.length = 0;
|
||||
translationsArray.length = 0;
|
||||
|
|
@ -143,7 +144,9 @@ const main = async () => {
|
|||
|
||||
if (termsArray.length > 0) {
|
||||
batchCount++;
|
||||
console.log(`Uploading final batch (${allSynsets.length}/${allSynsets.length} synsets)...`);
|
||||
console.log(
|
||||
`Uploading final batch (${allSynsets.length}/${allSynsets.length} synsets)...`,
|
||||
);
|
||||
await uploadToDB(termsArray, translationsArray);
|
||||
}
|
||||
|
||||
|
|
@ -161,6 +164,7 @@ main().catch((error) => {
|
|||
## 4. Pitfalls Encountered
|
||||
|
||||
### Duplicate key on re-run
|
||||
|
||||
Running the script twice causes `duplicate key value violates unique constraint "terms_synset_id_unique"`. Fix: truncate before seeding.
|
||||
|
||||
```bash
|
||||
|
|
@ -168,15 +172,19 @@ docker exec -it glossa-database psql -U glossa -d glossa -c "TRUNCATE translatio
|
|||
```
|
||||
|
||||
### `onConflictDoNothing` breaks FK references
|
||||
|
||||
When `onConflictDoNothing` skips a `terms` insert, the in-memory UUID is never written to the DB. Subsequent `translations` inserts reference that non-existent UUID, causing a FK violation. This is why the truncate approach is correct for batch seeding.
|
||||
|
||||
### DATABASE_URL misconfigured
|
||||
|
||||
Correct format:
|
||||
|
||||
```
|
||||
DATABASE_URL=postgresql://glossa:glossa@localhost:5432/glossa
|
||||
```
|
||||
|
||||
### Tables not found after `docker compose up`
|
||||
|
||||
Migrations must be applied first: `npx drizzle-kit migrate`
|
||||
|
||||
---
|
||||
|
|
@ -205,10 +213,13 @@ docker exec -it glossa-database psql -U glossa -d glossa -c "SELECT COUNT(*) FRO
|
|||
## 6. Seeding Script — v2 (incremental upsert, multi-file)
|
||||
|
||||
### Motivation
|
||||
|
||||
The truncate approach is fine for dev but unsuitable for production — it wipes all data. The v2 approach extends the database incrementally without ever truncating.
|
||||
|
||||
### File naming convention
|
||||
|
||||
One JSON file per language pair per POS:
|
||||
|
||||
```
|
||||
scripts/datafiles/
|
||||
en-it-nouns.json
|
||||
|
|
@ -219,7 +230,9 @@ scripts/datafiles/
|
|||
```
|
||||
|
||||
### How incremental upsert works
|
||||
|
||||
For a concept like "dog" already in the DB with English and Italian:
|
||||
|
||||
1. Import `en-fr-nouns.json`
|
||||
2. Upsert `terms` by `synset_id` — finds existing row, returns its real ID
|
||||
3. `dog (en)` already exists → skipped by `onConflictDoNothing`
|
||||
|
|
@ -228,6 +241,7 @@ For a concept like "dog" already in the DB with English and Italian:
|
|||
The concept is **extended**, not replaced.
|
||||
|
||||
### Tradeoff vs batch approach
|
||||
|
||||
Batching is no longer possible since you need the real `term.id` from the DB before inserting translations. Each synset is processed individually. For 25k rows this is still fast enough.
|
||||
|
||||
### Key types added
|
||||
|
|
@ -252,7 +266,9 @@ type FileName = {
|
|||
const parseFilename = (filename: string): FileName => {
|
||||
const parts = filename.replace(".json", "").split("-");
|
||||
if (parts.length !== 3)
|
||||
throw new Error(`Invalid filename format: ${filename}. Expected: sourcelang-targetlang-pos.json`);
|
||||
throw new Error(
|
||||
`Invalid filename format: ${filename}. Expected: sourcelang-targetlang-pos.json`,
|
||||
);
|
||||
const [sourceLang, targetLang, pos] = parts;
|
||||
if (!SUPPORTED_LANGUAGE_CODES.includes(sourceLang as LANGUAGE_CODE))
|
||||
throw new Error(`Unsupported language code: ${sourceLang}`);
|
||||
|
|
@ -278,10 +294,7 @@ const upsertSynset = async (
|
|||
const [upsertedTerm] = await db
|
||||
.insert(terms)
|
||||
.values({ synset_id: synset.synset_id, pos: synset.pos })
|
||||
.onConflictDoUpdate({
|
||||
target: terms.synset_id,
|
||||
set: { pos: synset.pos },
|
||||
})
|
||||
.onConflictDoUpdate({ target: terms.synset_id, set: { pos: synset.pos } })
|
||||
.returning({ id: terms.id, created_at: terms.created_at });
|
||||
|
||||
const termInserted = upsertedTerm.created_at > new Date(Date.now() - 1000);
|
||||
|
|
@ -310,12 +323,12 @@ const upsertSynset = async (
|
|||
|
||||
## 7. Strategy Comparison
|
||||
|
||||
| Strategy | Use case | Pros | Cons |
|
||||
|----------|----------|------|------|
|
||||
| Truncate + batch | Dev / first-time setup | Fast, simple | Wipes all data |
|
||||
| Incremental upsert | Production / adding languages | Safe, non-destructive | No batching, slower |
|
||||
| Migrations-as-data | Production audit trail | Clean history | Files accumulate |
|
||||
| Diff-based sync | Large production datasets | Minimal writes | Complex to implement |
|
||||
| Strategy | Use case | Pros | Cons |
|
||||
| ------------------ | ----------------------------- | --------------------- | -------------------- |
|
||||
| Truncate + batch | Dev / first-time setup | Fast, simple | Wipes all data |
|
||||
| Incremental upsert | Production / adding languages | Safe, non-destructive | No batching, slower |
|
||||
| Migrations-as-data | Production audit trail | Clean history | Files accumulate |
|
||||
| Diff-based sync | Large production datasets | Minimal writes | Complex to implement |
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -331,6 +344,7 @@ The `exports` field must be an object, not an array:
|
|||
```
|
||||
|
||||
Imports then resolve as:
|
||||
|
||||
```ts
|
||||
import { db } from "@glossa/db";
|
||||
import { terms, translations } from "@glossa/db/schema";
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue