refactoring

This commit is contained in:
lila 2026-04-02 13:37:54 +02:00
parent b0c0baf9ab
commit cdedbc44cd
2 changed files with 416 additions and 175 deletions

View file

@ -0,0 +1,332 @@
# Code Review: `build-top-english-nouns-deck` seed script
Hey, good work getting this to a finished, working state — that's genuinely the hardest part. Below is feedback structured the way a mentor would give it: what the problem is, why it matters in a real codebase, and how to fix it. Work through these one by one when you refactor.
---
## 1. Function names should be imperative, not gerunds
### What you wrote
```ts
const readingFromWordlist = async () => { ... }
const checkingSourceWordsAgainstDB = async () => { ... }
```
### Why it's a problem
Functions represent *actions*. In English, imperative verbs describe actions: `read`, `fetch`, `build`. Gerunds (`reading`, `checking`) describe ongoing processes — they read like you're narrating what's happening rather than declaring what a function does. This isn't just style preference: when you're scanning a call stack or reading `main()`, imperative names parse faster because they match the mental model of "I am calling this to do a thing."
### How to fix it
```ts
const readWordlist = async () => { ... }
const resolveSourceTerms = async () => { ... } // "checking" undersells what it returns
const writeMissingWords = async () => { ... }
```
Note the rename of `checkingSourceWordsAgainstDB``resolveSourceTerms`. The original name describes the *mechanism* (checking against DB). A better name describes the *result* (resolving words into term IDs). Callers don't need to know it hits the DB.
### Further reading
- [Clean Code, Chapter 2 Meaningful Names](https://www.oreilly.com/library/view/clean-code-a/9780136083238/) — specifically the section on "Use Intention-Revealing Names"
- [Google TypeScript Style Guide Naming](https://google.github.io/styleguide/tsguide.html#naming-style)
---
## 2. N+1 query pattern in `validateLanguages` and `logLanguageCoverage`
### What you wrote
```ts
for (const language of languages) {
const rows = await db
.selectDistinct({ termId: translations.term_id })
.from(translations)
.where(
and(
inArray(translations.term_id, termIds),
eq(translations.language_code, language),
),
);
}
```
### Why it's a problem
This fires one database query *per language*. If you have 15 supported languages, that's 15 round trips. Each round trip has network latency, connection overhead, and query planning cost. The database already knows how to aggregate across all languages in a single pass — you're just not asking it to.
This pattern is called **N+1** (one query to get the list, then N queries for each item in the list) and it's one of the most common performance mistakes in applications that use databases. At 15 languages it's fine. At 50 languages with 100k terms, your script will be the reason someone gets paged at 2am.
### How to fix it
Ask the database to do the grouping for you in a single query:
```ts
import { count, ne } from "drizzle-orm";
const coverage = await db
.select({
language: translations.language_code,
coveredCount: count(translations.term_id),
})
.from(translations)
.where(
and(
inArray(translations.term_id, termIds),
ne(translations.language_code, sourceLanguage),
),
)
.groupBy(translations.language_code);
const validatedLanguages = coverage
.filter((row) => row.coveredCount === termIds.length)
.map((row) => row.language);
```
One query. The database returns a row per language with the count of covered terms. You filter in JS. Done.
### Further reading
- [Drizzle ORM `groupBy` and aggregations](https://orm.drizzle.team/docs/select#aggregations)
- ["What is the N+1 query problem" — StackOverflow](https://stackoverflow.com/questions/97197/what-is-the-n1-select-query-problem-and-how-can-it-be-avoided)
---
## 3. Two functions doing the same database work
### What you wrote
`validateLanguages` and `logLanguageCoverage` both loop over languages and fire the same query per language. You wrote the same logic twice.
### Why it's a problem
This is a violation of **DRY** (Don't Repeat Yourself). The immediate cost is that any bug in the query exists in two places — fixing one doesn't fix the other. The deeper cost is that it doubles your database load for no reason: you fetch the coverage data, use it to compute `validatedLanguages`, throw it away, then fetch it again just to log it.
### How to fix it
Once you apply the fix from point 2, you have a single `coverage` array. Use it for both purposes:
```ts
const coverage = await db... // single query from point 2
// Use for validation
const validatedLanguages = coverage
.filter((row) => row.coveredCount === termIds.length)
.map((row) => row.language);
// Use for logging
for (const row of coverage) {
console.log(` ${row.language}: ${row.coveredCount} / ${termIds.length} terms covered`);
}
```
No second trip to the database.
### Further reading
- [The DRY Principle](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself)
---
## 4. Unnecessary array copying inside a loop
### What you wrote
```ts
const wordToTermIds = new Map<string, string[]>();
for (const row of rows) {
const existing = wordToTermIds.get(word) ?? [];
wordToTermIds.set(word, [...existing, row.termId]); // spreads the whole array every iteration
}
```
### Why it's a problem
`[...existing, row.termId]` creates a *brand new array* every time and copies all the previous elements into it. If "bank" has 3 homonyms, you allocate arrays of size 0, 1, 2, and 3 — throwing the first three away. This is an `O(n²)` memory allocation pattern. For 1000 words it's invisible. In a tighter loop or with more data, it adds up.
This pattern comes from functional programming habits (immutability is good there). But in a one-off script building a local data structure, there's no reason to avoid mutation.
### How to fix it
```ts
const wordToTermIds = new Map<string, string[]>();
for (const row of rows) {
const word = row.text.toLowerCase();
if (!wordToTermIds.has(word)) {
wordToTermIds.set(word, []);
}
wordToTermIds.get(word)!.push(row.termId);
}
```
Get the array once, push into it. No copies.
### Further reading
- [MDN Array.prototype.push()](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/push)
- [Big O Notation primer](https://www.freecodecamp.org/news/big-o-notation-why-it-matters-and-why-it-doesnt-1674cfa8a23c/) — worth understanding O(n²) vs O(n)
---
## 5. No database transaction — your "idempotent" script can corrupt state
### What you wrote
```ts
deckId = await createDeck(validatedLanguages); // step 1
const addedCount = await addTermsToDeck(deckId, termIds); // step 2
await updateValidatedLanguages(deckId, validatedLanguages); // step 3
```
### Why it's a problem
These three operations are separate database round trips with nothing tying them together. If step 2 throws (network blip, constraint violation, anything), you end up with a deck row that has no terms. Run the script again and it finds the existing deck, skips creation, then tries to add terms — but now your `validated_languages` from the previous partial run might be stale. The script *appears* to recover, but you can't be sure of what state you're in.
A **transaction** is a guarantee: either all steps succeed together, or none of them do. If anything fails mid-way, the database rolls back to the state before the transaction started. This is fundamental to writing scripts that touch multiple tables.
### How to fix it
```ts
await db.transaction(async (tx) => {
const existingDeck = await findExistingDeck(tx);
let deckId: string;
if (!existingDeck) {
deckId = await createDeck(tx, validatedLanguages);
} else {
deckId = existingDeck.id;
}
await addTermsToDeck(tx, deckId, termIds);
await updateValidatedLanguages(tx, deckId, validatedLanguages);
});
```
You'll need to thread the `tx` (transaction context) through your functions instead of using the global `db` — that's the key change.
### Further reading
- [Drizzle ORM Transactions](https://orm.drizzle.team/docs/transactions)
- [PostgreSQL What is a Transaction?](https://www.postgresql.org/docs/current/tutorial-transactions.html)
- [ACID properties explained](https://www.databricks.com/glossary/acid-transactions) — Atomicity is what protects you here
---
## 6. The `isNewDeck` flag is unnecessary
### What you wrote
```ts
let isNewDeck: boolean;
if (!existingDeck) {
deckId = await createDeck(validatedLanguages);
isNewDeck = true;
} else {
deckId = existingDeck.id;
isNewDeck = false;
}
// ...later...
if (!isNewDeck) {
await updateValidatedLanguages(deckId, validatedLanguages);
}
```
### Why it's a problem
You introduced `isNewDeck` to avoid calling `updateValidatedLanguages` when the deck was just created — reasoning that you already passed `validatedLanguages` to `createDeck`. But that means you're calling `updateValidatedLanguages` in *one path* and `createDeck(..., validatedLanguages)` in the *other* path. The intent (always keep validated languages current) is the same in both cases, but the code splits it into two branches you have to mentally reconcile.
The cleaner model: always call `updateValidatedLanguages` after finding or creating the deck. Then `createDeck` doesn't need `validatedLanguages` at all, and `isNewDeck` disappears.
### How to fix it
```ts
const deckId = existingDeck
? existingDeck.id
: await createDeck(); // no validatedLanguages needed here
await addTermsToDeck(deckId, termIds);
await updateValidatedLanguages(deckId, validatedLanguages); // always runs
```
Fewer variables, one clear flow.
---
## 7. Comments explain *what*, not *why*
### What you wrote
```ts
// new Set() automatically discards duplicate values,
// and spreading it back with ... converts it to a plain array again.
// So if "bank" appears twice in the file,
// the resulting array will only contain it once.
const words = [...new Set(raw.split("\n").map((w) => w.trim().toLowerCase()).filter(Boolean))];
```
### Why it's a problem
Comments that re-explain what the code literally does are called **noise comments**. They add length without adding understanding — any developer who can read this script already knows what `Set` does. Worse, they can get out of date if the code changes but the comment doesn't.
Good comments explain *why* a decision was made, not *what* the code does. The code already says what it does.
Meanwhile, your most complex line — `const termIds = [...new Set(Array.from(wordToTermIds.values()).flat())]` — has no comment at all. That's the one that earns a note.
### How to fix it
```ts
// Deduplicate: multiple words can map to the same term ID (e.g. via synonyms)
const termIds = [...new Set(Array.from(wordToTermIds.values()).flat())];
```
And remove the Set explanation from `readWordlist`. The code is clear.
### Further reading
- [Clean Code, Chapter 4 Comments](https://www.oreilly.com/library/view/clean-code-a/9780136083238/) — specifically "Explain Yourself in Code" and "Noise Comments"
---
## 8. The finished roadmap comment should be deleted
### What you wrote
```ts
/*
* roadmap
* [x] Setup
* [x] Read wordlist
* ...all checked off
*/
```
### Why it's a problem
This was useful *while you were planning*. Now that every item is checked, it communicates nothing except "this is done" — which the existence of a working script already communicates. Leaving it in adds noise to the file header and signals that you're not sure what belongs in source control vs. a task tracker.
### How to fix it
Delete it. Use GitHub Issues, a Notion doc, or even a scratchpad file for planning notes. Source code is the output of planning, not the place to store it.
---
## 9. No log levels — everything goes to `console.log`
### What you wrote
```ts
console.log("📖 Reading word list...");
console.log(` ${sourceWords.length} words loaded\n`);
// ...and so on for every step
```
### Why it's a problem
In a real environment — CI/CD pipelines, server logs, anything beyond your local terminal — all of this output lands in the same stream at the same priority. Actual errors (`console.error`) get buried in progress logs. There's no way to run the script quietly when you just need the summary, or verbosely when you're debugging.
For a one-off seed script this is low priority, but it's a habit worth building early.
### How to fix it
At minimum, use `console.error` for actual errors (not just in the catch block — also for things like "deck creation returned no ID"). For the detailed per-language breakdown, consider putting it behind a `--verbose` CLI flag so you can run the script cleanly in CI without dumping hundreds of lines of coverage data.
```ts
// Basic approach
if (process.argv.includes("--verbose")) {
await logLanguageCoverage(termIds);
}
```
### Further reading
- [Node.js `process.argv`](https://nodejs.org/en/learn/command-line/nodejs-accept-arguments-from-the-command-line)
- For a proper solution later: [pino](https://github.com/pinojs/pino) — a lightweight structured logger widely used in Node.js
---
## Summary
| # | Issue | Priority |
|---|-------|----------|
| 1 | Gerund function names | Low — style, but builds good habits |
| 2 | N+1 queries | High — real performance impact |
| 3 | Duplicate query logic | High — bugs in two places |
| 4 | Array spread in loop | Medium — inefficient pattern to unlearn |
| 5 | No transaction | High — can corrupt database state |
| 6 | `isNewDeck` flag | Low — unnecessary complexity |
| 7 | Comments explain what, not why | Low — style, but important long-term |
| 8 | Roadmap comment left in | Low — cleanup |
| 9 | No log levels | Low — good habit to build |
Start with **2, 3, and 5** — those are the ones that would cause real problems in production. The rest are about writing code that's easier to read and maintain over time.
Good luck with the refactor. Come back with the updated script when you're done.

View file

@ -1,56 +1,20 @@
/*
*
* Builds the "top English nouns" deck from a curated wordlist of the 1000 most
* frequently used English nouns. The deck has English as its source language
* meaning it was curated from an English-centric frequency list, and a separate
* deck would be needed for other source languages. For each word in the list, all
* matching term IDs are looked up in the database via the translations table
* (language: "en", POS: "noun") homonyms are intentionally included as separate
* cards since the quiz UI displays a gloss alongside each word. Words from the
* list that have no DB match are skipped and written to a file for future
* reference. The script is idempotent: if the deck already exists, only terms
* present in the source but missing from the deck are added; terms already in the
* deck are left untouched; terms in the deck but absent from the source are never
* removed. After resolving all matched terms, the script determines
* validated_for_languages by checking which languages excluding the source
* language have full translation coverage across all matched terms, and updates
* the array on every run.
*/
/*
* roadmap
*
* [x] Setup - hardcoded path, name, description, source language, POS
* [x] Read wordlist - load and deduplicate the 1000 nouns
* [x] Query terms - match to database, collect all term IDs per word (including homonyms)
* [x] Write missing words to file for future reference
* [x] Determine validated_languages - find languages (excluding source) with full coverage across all matched terms
* [x] Check idempotency - if deck exists, diff matched terms against existing deck_terms
* [x] Create deck if it doesn't exist - insert with name, source_language, validated_languages
* [x] Add new terms - insert only term IDs present in source but missing from deck
* [x] Update validated_languages - recalculate and update on every run
* [x] Report - summary of words found, missing, added, and validated languages
*/
import fs from "node:fs/promises";
import { db } from "@glossa/db";
import { translations, terms, decks, deck_terms } from "@glossa/db/schema";
import { inArray, and, eq } from "drizzle-orm";
import { SUPPORTED_LANGUAGE_CODES } from "@glossa/shared";
import { inArray, and, eq, count, ne } from "drizzle-orm";
const pathToWordlist = "./src/data/wordlists/top1000englishnouns";
const nameOfDeck = "top english nouns";
const descriptionOfDeck =
"Most frequently used English nouns for vocabulary practice";
const sourceLanguage = "en";
const sourcePOS = "noun";
type DbOrTx = Parameters<Parameters<typeof db.transaction>[0]>[0];
// new Set() automatically discards duplicate values,
// and spreading it back with ... converts it to a plain array again.
// So if "bank" appears twice in the file,
// the resulting array will only contain it once.
const readingFromWordlist = async () => {
const raw = await fs.readFile(pathToWordlist, "utf8");
const config = {
pathToWordlist: "./src/data/wordlists/top1000englishnouns",
deckName: "top english nouns",
deckDescription: "Most frequently used English nouns for vocabulary practice",
sourceLanguage: "en",
sourcePOS: "noun",
} as const;
const readWordList = async () => {
const raw = await fs.readFile(config.pathToWordlist, "utf8");
const words = [
...new Set(
raw
@ -62,7 +26,7 @@ const readingFromWordlist = async () => {
return words;
};
const checkingSourceWordsAgainstDB = async (words: string[]) => {
const resolveSourceTerms = async (words: string[]) => {
const rows = await db
.select({ text: translations.text, termId: translations.term_id })
.from(translations)
@ -70,17 +34,21 @@ const checkingSourceWordsAgainstDB = async (words: string[]) => {
.where(
and(
inArray(translations.text, words),
eq(translations.language_code, sourceLanguage),
eq(terms.pos, sourcePOS),
eq(translations.language_code, config.sourceLanguage),
eq(terms.pos, config.sourcePOS),
),
);
const wordToTermIds = new Map<string, string[]>();
for (const row of rows) {
const word = row.text.toLowerCase();
const existing = wordToTermIds.get(word) ?? [];
wordToTermIds.set(word, [...existing, row.termId]);
if (!wordToTermIds.has(word)) {
wordToTermIds.set(word, []);
}
wordToTermIds.get(word)!.push(row.termId);
}
// Deduplicate: multiple words can map to the same term ID (e.g. via synonyms)
const termIds = [...new Set(Array.from(wordToTermIds.values()).flat())];
const missingWords = words.filter((w) => !wordToTermIds.has(w));
@ -88,100 +56,52 @@ const checkingSourceWordsAgainstDB = async (words: string[]) => {
};
const writeMissingWordsToFile = async (missingWords: string[]) => {
const outputPath = `${pathToWordlist}-missing`;
const outputPath = `${config.pathToWordlist}-missing`;
await fs.writeFile(outputPath, missingWords.join("\n"), "utf8");
};
const validateLanguages = async (sourceLanguage: string, termIds: string[]) => {
// create array of language code from the supported languages
// remove source language from it
const languages = SUPPORTED_LANGUAGE_CODES.filter(
(language) => language !== sourceLanguage,
);
const validatedLanguages: string[] = [];
// For each remaining language, count how many of the termIds have a translation in that language
for (const language of languages) {
const rows = await db
.selectDistinct({ termId: translations.term_id })
const coverage = await db
.select({
language: translations.language_code,
coveredCount: count(translations.term_id),
})
.from(translations)
.where(
and(
inArray(translations.term_id, termIds),
eq(translations.language_code, language),
ne(translations.language_code, sourceLanguage),
),
);
if (rows.length === termIds.length) {
validatedLanguages.push(language);
}
}
)
.groupBy(translations.language_code);
// If the count equals termIds.length → full coverage → include in result
// Return the array of fully covered languages
return validatedLanguages;
const validatedLanguages = coverage
.filter((row) => Number(row.coveredCount) === termIds.length)
.map((row) => row.language);
return { coverage, validatedLanguages };
};
// Check idempotency — if deck exists, diff matched terms against existing deck_terms
const findExistingDeck = async () => {
const existing = await db
const findExistingDeck = async (tx: DbOrTx) => {
const existing = await tx
.select({ id: decks.id, validatedForLanguages: decks.validated_languages })
.from(decks)
.where(
and(
eq(decks.name, nameOfDeck),
eq(decks.source_language, sourceLanguage),
eq(decks.name, config.deckName),
eq(decks.source_language, config.sourceLanguage),
),
);
return existing[0] ?? null;
};
// logging translation coverage per language across all matched terms
const logLanguageCoverage = async (termIds: string[]) => {
const languages = SUPPORTED_LANGUAGE_CODES.filter(
(language) => language !== sourceLanguage,
);
for (const language of languages) {
const rows = await db
.selectDistinct({ termId: translations.term_id })
.from(translations)
.where(
and(
inArray(translations.term_id, termIds),
eq(translations.language_code, language),
),
);
console.log(
` ${language}: ${rows.length} / ${termIds.length} terms covered`,
);
const coveredIds = new Set(rows.map((r) => r.termId));
const missingTermIds = termIds.filter((id) => !coveredIds.has(id));
console.log(` missing term IDs count:`, missingTermIds.length);
const missingEnglish = await db
.selectDistinct({ text: translations.text })
.from(translations)
.where(
and(
inArray(translations.term_id, missingTermIds),
eq(translations.language_code, "en"),
),
);
console.log(
` missing words in ${language}:`,
missingEnglish.map((r) => r.text),
"\n",
);
}
};
// creating a deck
const createDeck = async (validatedLanguages: string[]) => {
const result = await db
const createDeck = async (tx: DbOrTx, validatedLanguages: string[]) => {
const result = await tx
.insert(decks)
.values({
name: nameOfDeck,
description: descriptionOfDeck,
source_language: sourceLanguage,
name: config.deckName,
description: config.deckDescription,
source_language: config.sourceLanguage,
validated_languages: validatedLanguages,
is_public: false,
})
@ -191,36 +111,27 @@ const createDeck = async (validatedLanguages: string[]) => {
return created.id;
};
// Diffs termIds against the existing deck_terms for this deck and inserts only
// the ones not already present. Returns the count of newly inserted terms.
const addTermsToDeck = async (
tx: DbOrTx,
deckId: string,
termIds: string[],
): Promise<number> => {
const existingRows = await db
.select({ termId: deck_terms.term_id })
.from(deck_terms)
.where(eq(deck_terms.deck_id, deckId));
if (termIds.length === 0) return 0;
const existingTermIds = new Set(existingRows.map((r) => r.termId));
const newTermIds = termIds.filter((id) => !existingTermIds.has(id));
if (newTermIds.length === 0) return 0;
await db
await tx
.insert(deck_terms)
.values(newTermIds.map((termId) => ({ deck_id: deckId, term_id: termId })));
.values(termIds.map((termId) => ({ deck_id: deckId, term_id: termId })))
.onConflictDoNothing();
return newTermIds.length;
return termIds.length;
};
// Recalculates and persists validated_languages on every run so the field stays
// accurate as translation coverage grows over time.
const updateValidatedLanguages = async (
tx: DbOrTx,
deckId: string,
validatedLanguages: string[],
): Promise<void> => {
await db
await tx
.update(decks)
.set({ validated_languages: validatedLanguages })
.where(eq(decks.id, deckId));
@ -228,12 +139,11 @@ const updateValidatedLanguages = async (
const main = async () => {
console.log("📖 Reading word list...");
const sourceWords = await readingFromWordlist();
const sourceWords = await readWordList();
console.log(` ${sourceWords.length} words loaded\n`);
console.log("🔍 Checking against database...");
const { termIds, missingWords } =
await checkingSourceWordsAgainstDB(sourceWords);
const { termIds, missingWords } = await resolveSourceTerms(sourceWords);
console.log(` ${termIds.length} terms found`);
console.log(` ${missingWords.length} words not found in DB\n`);
@ -241,43 +151,42 @@ const main = async () => {
await writeMissingWordsToFile(missingWords);
console.log("✅ Validating languages...");
const validatedLanguages = await validateLanguages(sourceLanguage, termIds);
const { coverage, validatedLanguages } = await validateLanguages(
config.sourceLanguage,
termIds,
);
console.log(
` Validated languages: ${JSON.stringify(validatedLanguages)}\n`,
);
console.log("🔬 Language coverage breakdown...");
await logLanguageCoverage(termIds);
for (const row of coverage) {
console.log(
` ${row.language}: ${row.coveredCount} / ${termIds.length} terms covered`,
);
}
console.log("🃏 Looking for existing deck...");
const existingDeck = await findExistingDeck();
const addedCount = await db.transaction(async (tx) => {
const existingDeck = await findExistingDeck(tx);
const deckId = existingDeck
? existingDeck.id
: await createDeck(tx, validatedLanguages);
let deckId: string;
let isNewDeck: boolean;
const addedCount = await addTermsToDeck(tx, deckId, termIds);
if (!existingDeck) {
console.log(" No existing deck found, will create one\n");
console.log("🆕 Creating deck...");
deckId = await createDeck(validatedLanguages);
console.log(` Deck created with id: ${deckId}\n`);
isNewDeck = true;
} else {
console.log(` Found existing deck with id: ${existingDeck.id}\n`);
deckId = existingDeck.id;
isNewDeck = false;
const currentLanguages = existingDeck?.validatedForLanguages ?? [];
const hasChanged =
JSON.stringify([...currentLanguages].sort()) !==
JSON.stringify([...validatedLanguages].sort());
if (hasChanged) {
await updateValidatedLanguages(tx, deckId, validatedLanguages);
}
console.log(" Adding terms to deck...");
const addedCount = await addTermsToDeck(deckId, termIds);
return addedCount;
});
const alreadyPresentCount = termIds.length - addedCount;
console.log(` ${addedCount} terms added`);
console.log(` ${alreadyPresentCount} terms already in deck\n`);
if (!isNewDeck) {
console.log("🔄 Updating validated languages...");
await updateValidatedLanguages(deckId, validatedLanguages);
console.log(` Updated to: ${JSON.stringify(validatedLanguages)}\n`);
}
console.log("━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━");
console.log("📊 Summary");