lila/documentation/audits/generating-decks.md
2026-04-10 20:09:46 +02:00

14 KiB
Raw Blame History

Code Review: build-top-english-nouns-deck seed script

Hey, good work getting this to a finished, working state — that's genuinely the hardest part. Below is feedback structured the way a mentor would give it: what the problem is, why it matters in a real codebase, and how to fix it. Work through these one by one when you refactor.


1. Function names should be imperative, not gerunds

What you wrote

const readingFromWordlist = async () => { ... }
const checkingSourceWordsAgainstDB = async () => { ... }

Why it's a problem

Functions represent actions. In English, imperative verbs describe actions: read, fetch, build. Gerunds (reading, checking) describe ongoing processes — they read like you're narrating what's happening rather than declaring what a function does. This isn't just style preference: when you're scanning a call stack or reading main(), imperative names parse faster because they match the mental model of "I am calling this to do a thing."

How to fix it

const readWordlist = async () => { ... }
const resolveSourceTerms = async () => { ... }  // "checking" undersells what it returns
const writeMissingWords = async () => { ... }

Note the rename of checkingSourceWordsAgainstDBresolveSourceTerms. The original name describes the mechanism (checking against DB). A better name describes the result (resolving words into term IDs). Callers don't need to know it hits the DB.

Further reading


2. N+1 query pattern in validateLanguages and logLanguageCoverage

What you wrote

for (const language of languages) {
  const rows = await db
    .selectDistinct({ termId: translations.term_id })
    .from(translations)
    .where(
      and(
        inArray(translations.term_id, termIds),
        eq(translations.language_code, language),
      ),
    );
}

Why it's a problem

This fires one database query per language. If you have 15 supported languages, that's 15 round trips. Each round trip has network latency, connection overhead, and query planning cost. The database already knows how to aggregate across all languages in a single pass — you're just not asking it to.

This pattern is called N+1 (one query to get the list, then N queries for each item in the list) and it's one of the most common performance mistakes in applications that use databases. At 15 languages it's fine. At 50 languages with 100k terms, your script will be the reason someone gets paged at 2am.

How to fix it

Ask the database to do the grouping for you in a single query:

import { count, ne } from "drizzle-orm";

const coverage = await db
  .select({
    language: translations.language_code,
    coveredCount: count(translations.term_id),
  })
  .from(translations)
  .where(
    and(
      inArray(translations.term_id, termIds),
      ne(translations.language_code, sourceLanguage),
    ),
  )
  .groupBy(translations.language_code);

const validatedLanguages = coverage
  .filter((row) => row.coveredCount === termIds.length)
  .map((row) => row.language);

One query. The database returns a row per language with the count of covered terms. You filter in JS. Done.

Further reading


3. Two functions doing the same database work

What you wrote

validateLanguages and logLanguageCoverage both loop over languages and fire the same query per language. You wrote the same logic twice.

Why it's a problem

This is a violation of DRY (Don't Repeat Yourself). The immediate cost is that any bug in the query exists in two places — fixing one doesn't fix the other. The deeper cost is that it doubles your database load for no reason: you fetch the coverage data, use it to compute validatedLanguages, throw it away, then fetch it again just to log it.

How to fix it

Once you apply the fix from point 2, you have a single coverage array. Use it for both purposes:

const coverage = await db...  // single query from point 2

// Use for validation
const validatedLanguages = coverage
  .filter((row) => row.coveredCount === termIds.length)
  .map((row) => row.language);

// Use for logging
for (const row of coverage) {
  console.log(`  ${row.language}: ${row.coveredCount} / ${termIds.length} terms covered`);
}

No second trip to the database.

Further reading


4. Unnecessary array copying inside a loop

What you wrote

const wordToTermIds = new Map<string, string[]>();
for (const row of rows) {
  const existing = wordToTermIds.get(word) ?? [];
  wordToTermIds.set(word, [...existing, row.termId]); // spreads the whole array every iteration
}

Why it's a problem

[...existing, row.termId] creates a brand new array every time and copies all the previous elements into it. If "bank" has 3 homonyms, you allocate arrays of size 0, 1, 2, and 3 — throwing the first three away. This is an O(n²) memory allocation pattern. For 1000 words it's invisible. In a tighter loop or with more data, it adds up.

This pattern comes from functional programming habits (immutability is good there). But in a one-off script building a local data structure, there's no reason to avoid mutation.

How to fix it

const wordToTermIds = new Map<string, string[]>();
for (const row of rows) {
  const word = row.text.toLowerCase();
  if (!wordToTermIds.has(word)) {
    wordToTermIds.set(word, []);
  }
  wordToTermIds.get(word)!.push(row.termId);
}

Get the array once, push into it. No copies.

Further reading


5. No database transaction — your "idempotent" script can corrupt state

What you wrote

deckId = await createDeck(validatedLanguages); // step 1
const addedCount = await addTermsToDeck(deckId, termIds); // step 2
await updateValidatedLanguages(deckId, validatedLanguages); // step 3

Why it's a problem

These three operations are separate database round trips with nothing tying them together. If step 2 throws (network blip, constraint violation, anything), you end up with a deck row that has no terms. Run the script again and it finds the existing deck, skips creation, then tries to add terms — but now your validated_languages from the previous partial run might be stale. The script appears to recover, but you can't be sure of what state you're in.

A transaction is a guarantee: either all steps succeed together, or none of them do. If anything fails mid-way, the database rolls back to the state before the transaction started. This is fundamental to writing scripts that touch multiple tables.

How to fix it

await db.transaction(async (tx) => {
  const existingDeck = await findExistingDeck(tx);

  let deckId: string;
  if (!existingDeck) {
    deckId = await createDeck(tx, validatedLanguages);
  } else {
    deckId = existingDeck.id;
  }

  await addTermsToDeck(tx, deckId, termIds);
  await updateValidatedLanguages(tx, deckId, validatedLanguages);
});

You'll need to thread the tx (transaction context) through your functions instead of using the global db — that's the key change.

Further reading


6. The isNewDeck flag is unnecessary

What you wrote

let isNewDeck: boolean;

if (!existingDeck) {
  deckId = await createDeck(validatedLanguages);
  isNewDeck = true;
} else {
  deckId = existingDeck.id;
  isNewDeck = false;
}

// ...later...
if (!isNewDeck) {
  await updateValidatedLanguages(deckId, validatedLanguages);
}

Why it's a problem

You introduced isNewDeck to avoid calling updateValidatedLanguages when the deck was just created — reasoning that you already passed validatedLanguages to createDeck. But that means you're calling updateValidatedLanguages in one path and createDeck(..., validatedLanguages) in the other path. The intent (always keep validated languages current) is the same in both cases, but the code splits it into two branches you have to mentally reconcile.

The cleaner model: always call updateValidatedLanguages after finding or creating the deck. Then createDeck doesn't need validatedLanguages at all, and isNewDeck disappears.

How to fix it

const deckId = existingDeck ? existingDeck.id : await createDeck(); // no validatedLanguages needed here

await addTermsToDeck(deckId, termIds);
await updateValidatedLanguages(deckId, validatedLanguages); // always runs

Fewer variables, one clear flow.


7. Comments explain what, not why

What you wrote

// new Set() automatically discards duplicate values,
// and spreading it back with ... converts it to a plain array again.
// So if "bank" appears twice in the file,
// the resulting array will only contain it once.
const words = [
  ...new Set(
    raw
      .split("\n")
      .map((w) => w.trim().toLowerCase())
      .filter(Boolean),
  ),
];

Why it's a problem

Comments that re-explain what the code literally does are called noise comments. They add length without adding understanding — any developer who can read this script already knows what Set does. Worse, they can get out of date if the code changes but the comment doesn't.

Good comments explain why a decision was made, not what the code does. The code already says what it does.

Meanwhile, your most complex line — const termIds = [...new Set(Array.from(wordToTermIds.values()).flat())] — has no comment at all. That's the one that earns a note.

How to fix it

// Deduplicate: multiple words can map to the same term ID (e.g. via synonyms)
const termIds = [...new Set(Array.from(wordToTermIds.values()).flat())];

And remove the Set explanation from readWordlist. The code is clear.

Further reading


8. The finished roadmap comment should be deleted

What you wrote

/*
 * roadmap
 * [x] Setup
 * [x] Read wordlist
 * ...all checked off
 */

Why it's a problem

This was useful while you were planning. Now that every item is checked, it communicates nothing except "this is done" — which the existence of a working script already communicates. Leaving it in adds noise to the file header and signals that you're not sure what belongs in source control vs. a task tracker.

How to fix it

Delete it. Use GitHub Issues, a Notion doc, or even a scratchpad file for planning notes. Source code is the output of planning, not the place to store it.


9. No log levels — everything goes to console.log

What you wrote

console.log("📖 Reading word list...");
console.log(`   ${sourceWords.length} words loaded\n`);
// ...and so on for every step

Why it's a problem

In a real environment — CI/CD pipelines, server logs, anything beyond your local terminal — all of this output lands in the same stream at the same priority. Actual errors (console.error) get buried in progress logs. There's no way to run the script quietly when you just need the summary, or verbosely when you're debugging.

For a one-off seed script this is low priority, but it's a habit worth building early.

How to fix it

At minimum, use console.error for actual errors (not just in the catch block — also for things like "deck creation returned no ID"). For the detailed per-language breakdown, consider putting it behind a --verbose CLI flag so you can run the script cleanly in CI without dumping hundreds of lines of coverage data.

// Basic approach
if (process.argv.includes("--verbose")) {
  await logLanguageCoverage(termIds);
}

Further reading


Summary

# Issue Priority
1 Gerund function names Low — style, but builds good habits
2 N+1 queries High — real performance impact
3 Duplicate query logic High — bugs in two places
4 Array spread in loop Medium — inefficient pattern to unlearn
5 No transaction High — can corrupt database state
6 isNewDeck flag Low — unnecessary complexity
7 Comments explain what, not why Low — style, but important long-term
8 Roadmap comment left in Low — cleanup
9 No log levels Low — good habit to build

Start with 2, 3, and 5 — those are the ones that would cause real problems in production. The rest are about writing code that's easier to read and maintain over time.

Good luck with the refactor. Come back with the updated script when you're done.