lila/documentation/audits/generating-decks.md
2026-04-10 20:09:46 +02:00

371 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Code Review: `build-top-english-nouns-deck` seed script
Hey, good work getting this to a finished, working state — that's genuinely the hardest part. Below is feedback structured the way a mentor would give it: what the problem is, why it matters in a real codebase, and how to fix it. Work through these one by one when you refactor.
---
## 1. Function names should be imperative, not gerunds
### What you wrote
```ts
const readingFromWordlist = async () => { ... }
const checkingSourceWordsAgainstDB = async () => { ... }
```
### Why it's a problem
Functions represent _actions_. In English, imperative verbs describe actions: `read`, `fetch`, `build`. Gerunds (`reading`, `checking`) describe ongoing processes — they read like you're narrating what's happening rather than declaring what a function does. This isn't just style preference: when you're scanning a call stack or reading `main()`, imperative names parse faster because they match the mental model of "I am calling this to do a thing."
### How to fix it
```ts
const readWordlist = async () => { ... }
const resolveSourceTerms = async () => { ... } // "checking" undersells what it returns
const writeMissingWords = async () => { ... }
```
Note the rename of `checkingSourceWordsAgainstDB``resolveSourceTerms`. The original name describes the _mechanism_ (checking against DB). A better name describes the _result_ (resolving words into term IDs). Callers don't need to know it hits the DB.
### Further reading
- [Clean Code, Chapter 2 Meaningful Names](https://www.oreilly.com/library/view/clean-code-a/9780136083238/) — specifically the section on "Use Intention-Revealing Names"
- [Google TypeScript Style Guide Naming](https://google.github.io/styleguide/tsguide.html#naming-style)
---
## 2. N+1 query pattern in `validateLanguages` and `logLanguageCoverage`
### What you wrote
```ts
for (const language of languages) {
const rows = await db
.selectDistinct({ termId: translations.term_id })
.from(translations)
.where(
and(
inArray(translations.term_id, termIds),
eq(translations.language_code, language),
),
);
}
```
### Why it's a problem
This fires one database query _per language_. If you have 15 supported languages, that's 15 round trips. Each round trip has network latency, connection overhead, and query planning cost. The database already knows how to aggregate across all languages in a single pass — you're just not asking it to.
This pattern is called **N+1** (one query to get the list, then N queries for each item in the list) and it's one of the most common performance mistakes in applications that use databases. At 15 languages it's fine. At 50 languages with 100k terms, your script will be the reason someone gets paged at 2am.
### How to fix it
Ask the database to do the grouping for you in a single query:
```ts
import { count, ne } from "drizzle-orm";
const coverage = await db
.select({
language: translations.language_code,
coveredCount: count(translations.term_id),
})
.from(translations)
.where(
and(
inArray(translations.term_id, termIds),
ne(translations.language_code, sourceLanguage),
),
)
.groupBy(translations.language_code);
const validatedLanguages = coverage
.filter((row) => row.coveredCount === termIds.length)
.map((row) => row.language);
```
One query. The database returns a row per language with the count of covered terms. You filter in JS. Done.
### Further reading
- [Drizzle ORM `groupBy` and aggregations](https://orm.drizzle.team/docs/select#aggregations)
- ["What is the N+1 query problem" — StackOverflow](https://stackoverflow.com/questions/97197/what-is-the-n1-select-query-problem-and-how-can-it-be-avoided)
---
## 3. Two functions doing the same database work
### What you wrote
`validateLanguages` and `logLanguageCoverage` both loop over languages and fire the same query per language. You wrote the same logic twice.
### Why it's a problem
This is a violation of **DRY** (Don't Repeat Yourself). The immediate cost is that any bug in the query exists in two places — fixing one doesn't fix the other. The deeper cost is that it doubles your database load for no reason: you fetch the coverage data, use it to compute `validatedLanguages`, throw it away, then fetch it again just to log it.
### How to fix it
Once you apply the fix from point 2, you have a single `coverage` array. Use it for both purposes:
```ts
const coverage = await db... // single query from point 2
// Use for validation
const validatedLanguages = coverage
.filter((row) => row.coveredCount === termIds.length)
.map((row) => row.language);
// Use for logging
for (const row of coverage) {
console.log(` ${row.language}: ${row.coveredCount} / ${termIds.length} terms covered`);
}
```
No second trip to the database.
### Further reading
- [The DRY Principle](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself)
---
## 4. Unnecessary array copying inside a loop
### What you wrote
```ts
const wordToTermIds = new Map<string, string[]>();
for (const row of rows) {
const existing = wordToTermIds.get(word) ?? [];
wordToTermIds.set(word, [...existing, row.termId]); // spreads the whole array every iteration
}
```
### Why it's a problem
`[...existing, row.termId]` creates a _brand new array_ every time and copies all the previous elements into it. If "bank" has 3 homonyms, you allocate arrays of size 0, 1, 2, and 3 — throwing the first three away. This is an `O(n²)` memory allocation pattern. For 1000 words it's invisible. In a tighter loop or with more data, it adds up.
This pattern comes from functional programming habits (immutability is good there). But in a one-off script building a local data structure, there's no reason to avoid mutation.
### How to fix it
```ts
const wordToTermIds = new Map<string, string[]>();
for (const row of rows) {
const word = row.text.toLowerCase();
if (!wordToTermIds.has(word)) {
wordToTermIds.set(word, []);
}
wordToTermIds.get(word)!.push(row.termId);
}
```
Get the array once, push into it. No copies.
### Further reading
- [MDN Array.prototype.push()](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/push)
- [Big O Notation primer](https://www.freecodecamp.org/news/big-o-notation-why-it-matters-and-why-it-doesnt-1674cfa8a23c/) — worth understanding O(n²) vs O(n)
---
## 5. No database transaction — your "idempotent" script can corrupt state
### What you wrote
```ts
deckId = await createDeck(validatedLanguages); // step 1
const addedCount = await addTermsToDeck(deckId, termIds); // step 2
await updateValidatedLanguages(deckId, validatedLanguages); // step 3
```
### Why it's a problem
These three operations are separate database round trips with nothing tying them together. If step 2 throws (network blip, constraint violation, anything), you end up with a deck row that has no terms. Run the script again and it finds the existing deck, skips creation, then tries to add terms — but now your `validated_languages` from the previous partial run might be stale. The script _appears_ to recover, but you can't be sure of what state you're in.
A **transaction** is a guarantee: either all steps succeed together, or none of them do. If anything fails mid-way, the database rolls back to the state before the transaction started. This is fundamental to writing scripts that touch multiple tables.
### How to fix it
```ts
await db.transaction(async (tx) => {
const existingDeck = await findExistingDeck(tx);
let deckId: string;
if (!existingDeck) {
deckId = await createDeck(tx, validatedLanguages);
} else {
deckId = existingDeck.id;
}
await addTermsToDeck(tx, deckId, termIds);
await updateValidatedLanguages(tx, deckId, validatedLanguages);
});
```
You'll need to thread the `tx` (transaction context) through your functions instead of using the global `db` — that's the key change.
### Further reading
- [Drizzle ORM Transactions](https://orm.drizzle.team/docs/transactions)
- [PostgreSQL What is a Transaction?](https://www.postgresql.org/docs/current/tutorial-transactions.html)
- [ACID properties explained](https://www.databricks.com/glossary/acid-transactions) — Atomicity is what protects you here
---
## 6. The `isNewDeck` flag is unnecessary
### What you wrote
```ts
let isNewDeck: boolean;
if (!existingDeck) {
deckId = await createDeck(validatedLanguages);
isNewDeck = true;
} else {
deckId = existingDeck.id;
isNewDeck = false;
}
// ...later...
if (!isNewDeck) {
await updateValidatedLanguages(deckId, validatedLanguages);
}
```
### Why it's a problem
You introduced `isNewDeck` to avoid calling `updateValidatedLanguages` when the deck was just created — reasoning that you already passed `validatedLanguages` to `createDeck`. But that means you're calling `updateValidatedLanguages` in _one path_ and `createDeck(..., validatedLanguages)` in the _other_ path. The intent (always keep validated languages current) is the same in both cases, but the code splits it into two branches you have to mentally reconcile.
The cleaner model: always call `updateValidatedLanguages` after finding or creating the deck. Then `createDeck` doesn't need `validatedLanguages` at all, and `isNewDeck` disappears.
### How to fix it
```ts
const deckId = existingDeck ? existingDeck.id : await createDeck(); // no validatedLanguages needed here
await addTermsToDeck(deckId, termIds);
await updateValidatedLanguages(deckId, validatedLanguages); // always runs
```
Fewer variables, one clear flow.
---
## 7. Comments explain _what_, not _why_
### What you wrote
```ts
// new Set() automatically discards duplicate values,
// and spreading it back with ... converts it to a plain array again.
// So if "bank" appears twice in the file,
// the resulting array will only contain it once.
const words = [
...new Set(
raw
.split("\n")
.map((w) => w.trim().toLowerCase())
.filter(Boolean),
),
];
```
### Why it's a problem
Comments that re-explain what the code literally does are called **noise comments**. They add length without adding understanding — any developer who can read this script already knows what `Set` does. Worse, they can get out of date if the code changes but the comment doesn't.
Good comments explain _why_ a decision was made, not _what_ the code does. The code already says what it does.
Meanwhile, your most complex line — `const termIds = [...new Set(Array.from(wordToTermIds.values()).flat())]` — has no comment at all. That's the one that earns a note.
### How to fix it
```ts
// Deduplicate: multiple words can map to the same term ID (e.g. via synonyms)
const termIds = [...new Set(Array.from(wordToTermIds.values()).flat())];
```
And remove the Set explanation from `readWordlist`. The code is clear.
### Further reading
- [Clean Code, Chapter 4 Comments](https://www.oreilly.com/library/view/clean-code-a/9780136083238/) — specifically "Explain Yourself in Code" and "Noise Comments"
---
## 8. The finished roadmap comment should be deleted
### What you wrote
```ts
/*
* roadmap
* [x] Setup
* [x] Read wordlist
* ...all checked off
*/
```
### Why it's a problem
This was useful _while you were planning_. Now that every item is checked, it communicates nothing except "this is done" — which the existence of a working script already communicates. Leaving it in adds noise to the file header and signals that you're not sure what belongs in source control vs. a task tracker.
### How to fix it
Delete it. Use GitHub Issues, a Notion doc, or even a scratchpad file for planning notes. Source code is the output of planning, not the place to store it.
---
## 9. No log levels — everything goes to `console.log`
### What you wrote
```ts
console.log("📖 Reading word list...");
console.log(` ${sourceWords.length} words loaded\n`);
// ...and so on for every step
```
### Why it's a problem
In a real environment — CI/CD pipelines, server logs, anything beyond your local terminal — all of this output lands in the same stream at the same priority. Actual errors (`console.error`) get buried in progress logs. There's no way to run the script quietly when you just need the summary, or verbosely when you're debugging.
For a one-off seed script this is low priority, but it's a habit worth building early.
### How to fix it
At minimum, use `console.error` for actual errors (not just in the catch block — also for things like "deck creation returned no ID"). For the detailed per-language breakdown, consider putting it behind a `--verbose` CLI flag so you can run the script cleanly in CI without dumping hundreds of lines of coverage data.
```ts
// Basic approach
if (process.argv.includes("--verbose")) {
await logLanguageCoverage(termIds);
}
```
### Further reading
- [Node.js `process.argv`](https://nodejs.org/en/learn/command-line/nodejs-accept-arguments-from-the-command-line)
- For a proper solution later: [pino](https://github.com/pinojs/pino) — a lightweight structured logger widely used in Node.js
---
## Summary
| # | Issue | Priority |
| --- | ------------------------------ | --------------------------------------- |
| 1 | Gerund function names | Low — style, but builds good habits |
| 2 | N+1 queries | High — real performance impact |
| 3 | Duplicate query logic | High — bugs in two places |
| 4 | Array spread in loop | Medium — inefficient pattern to unlearn |
| 5 | No transaction | High — can corrupt database state |
| 6 | `isNewDeck` flag | Low — unnecessary complexity |
| 7 | Comments explain what, not why | Low — style, but important long-term |
| 8 | Roadmap comment left in | Low — cleanup |
| 9 | No log levels | Low — good habit to build |
Start with **2, 3, and 5** — those are the ones that would cause real problems in production. The rest are about writing code that's easier to read and maintain over time.
Good luck with the refactor. Come back with the updated script when you're done.