DEV Community: GDS K S

OpenAI Codex now finishes 85% of scoped tasks. Here is the /goal workflow that gets you there.

GDS K S — Sun, 14 Jun 2026 02:56:59 +0000

OpenAI Codex now finishes 85% of scoped tasks. Here is the /goal workflow that gets you there.

OpenAI has been circulating an 85 to 90 percent success rate for Codex on well-scoped maintenance work. That number comes from internal testing, not an independent benchmark. But the mechanics behind it are real, and they explain both why it works and when it falls apart.

The feature is /goal. It shipped in Codex CLI 0.128.0 and became generally available across the CLI, IDE extension, and Codex app in version 0.133.0 on May 21, 2026. The short version: you set a goal, Codex loops until it believes the goal is complete, and the only hard stops are an evaluation that says "done" or a token budget that runs dry.

Understanding why that loop succeeds or fails on any given task is the whole game.

TL;DR

Scenario	Outcome	Why
Fix a failing test with a known error message	High pass rate	Scope is tight, completion is verifiable
Add a typed interface to an existing module	High pass rate	Output shape is checkable
Refactor a cross-cutting concern across 12 files	Fails often	Ambiguous scope, no clear done signal
Redesign the data model	Fails always	No binary done-check possible
Update a dependency and fix breakage	Medium	Depends on how far the breakage spreads

1. What /goal does and why "persisted" matters

A standard Codex turn is stateless. You ask something, it runs, the session ends. /goal breaks that pattern.

When you set a goal, Codex injects two prompts at the end of every turn automatically: goals/continuation.md and goals/budget_limit.md. The first tells the model to check whether the goal is complete and decide whether to continue. The second tracks token consumption and stops the loop before it exceeds your budget. The loop runs forward until one of those two conditions triggers.

Before version 0.133.0, goals were session-scoped. When the CLI process died, the goal died. The 0.133.0 release backed goals with dedicated storage so they track progress across active turns, including across CLI restarts. That is the "persisted" part. The goal state survives a reboot.

Version 0.132.0 (May 19, 2026) added one important fix: goal continuations now stop at usage limits instead of spinning indefinitely. Before that fix, a goal with no clear completion signal would run until the process died or the account hit a rate limit.

The loop pattern OpenAI uses here is not novel. Practitioners call this the "Ralph loop": an agent that checks its own output and decides whether to keep going. Codex adds budget accounting and a persistence layer on top. The prompt injection runs automatically; you never write the continuation prompts yourself.

2. The shape of a task that hits 85%

Three properties push a task into the high success range.

The goal must have a binary success check. "Fix the failing tests in src/auth" works. "Improve the auth module" does not. The agent needs to run a verification step and get a yes or no result. Passing CI is yes or no. "Better code" is not.

The scope must stay tight. A goal that touches one module or one interface definition gives the agent a small search space. If the fix requires changes in five unrelated parts of the codebase, the agent will solve three of them and stall on the fourth with no way to know it stalled.

The success condition must be observable from within the session. Write a shell command that returns 0 on success and non-zero on failure, and the agent can self-check. Tests are the obvious example. Type checks work too. Lint rules work. "The PR passes review" does not, because the agent cannot run that check.

Tasks I have seen work well:

Write a missing test for a specific function, run it green
Add a TypeScript interface that satisfies an existing as cast
Bump a dependency version and fix the type errors that surface
Extract a repeated code block into a shared utility and update all call sites in one directory

Every one of those has a finish line the agent can reach and measure.

3. The shape of a task that fails

The failure modes split into two categories: scope creep and unprovable completion.

Scope creep happens when the agent fixes one thing and reveals another. You ask it to fix a failing integration test. It fixes the test by updating the mock. The mock now diverges from the real API. The agent has no instruction to check that, so it declares done. The CI passes locally and fails in staging two days later. The agent did exactly what you said. The goal was too narrow.

Unprovable completion happens when the agent cannot self-check. "Refactor this service to be more readable" gives the agent nothing to verify. The agent will make changes, decide the changes look reasonable, mark the goal complete, and stop. Whether the code reads better is a human judgment. The agent will produce something and stop confidently regardless.

Architectural changes fail almost every time. If the task requires deciding where a module boundary should sit, or which service owns a responsibility, the agent hits the ambiguity and either picks one arbitrarily or loops until budget. That is not a capability gap. The task is genuinely underdetermined. No amount of looping closes that.

The 85% number, whatever its exact measurement method, almost certainly applies to a curated set of maintenance tasks with clear success criteria. If you point /goal at open-ended design work, you are not in the 85%. You are in a different distribution entirely.

4. Setup and a sample /goal call

Install or update the Codex CLI:

npm install -g @openai/codex
codex --version
# 0.133.0 or later for persistent goals

Check that goals are active (on by default since 0.133.0, but worth confirming):

codex doctor
# look for: goals: enabled, storage: ok

Set a goal from the CLI:

codex goal set "All tests in src/payments pass with no TypeScript errors"

Start a session in the repo and let it run:

cd /your/repo
codex
# Codex picks up the active goal and begins the loop

Watch it loop:

codex goal status
# shows: active goal, turns completed, tokens used, last evaluation result

The agent runs npm test or your configured test command at the end of each turn, checks the output, and decides whether to continue. If it cannot find a test command, it looks for package.json scripts named test, typecheck, or lint in that order.

For a task with a tighter scope, you can inline the success command:

codex goal set "Fix TypeScript errors in src/api/routes.ts" \
  --verify "npx tsc --noEmit --project tsconfig.json"

The --verify flag tells Codex which command to use as the done-check instead of inferring it. Pass anything that exits 0 on success.

Cancel a goal that has stalled:

codex goal cancel

List past goals and their outcomes:

codex goal list --limit 10

5. Wiring /goal into CI for safety

The loop does not replace CI. Treat it as a way to get closer to green before CI runs. The agent's output goes through type check, lint, and tests before merging, same as any other code.

A GitHub Actions job that verifies Codex-generated changes:

name: verify-codex-output

on:
  pull_request:
    branches: [main]

jobs:
  type-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node
        uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: npm

      - name: Install
        run: npm ci

      - name: Type check
        run: npx tsc --noEmit

      - name: Lint
        run: npx eslint src --max-warnings 0

      - name: Test
        run: npm test -- --coverage --passWithNoTests

  detect-scope-creep:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Count changed files
        run: |
          CHANGED=$(git diff --name-only origin/main...HEAD | wc -l)
          echo "Changed files: $CHANGED"
          if [ "$CHANGED" -gt 20 ]; then
            echo "::warning::PR changes $CHANGED files. Review for unintended scope creep."
          fi

The scope-creep check is the one I added specifically for agent-authored PRs. If Codex touches more than 20 files on what should be a five-file task, someone needs to read what happened. The warning does not block the PR; it flags it for a slower review.

The important CI rule: never relax your existing quality gates for agent-generated code. If anything, add the file-count check. An agent that cannot measure its own scope will not stop itself from editing 40 files to fix a one-line bug.

Pre-commit hooks are the other layer. Add a quick type check before the commit even reaches CI:

# .pre-commit-config.yaml (if using pre-commit)
repos:
  - repo: local
    hooks:
      - id: tsc
        name: TypeScript check
        entry: npx tsc --noEmit
        language: system
        pass_filenames: false

Or wire it directly in package.json using husky:

{
  "scripts": {
    "prepare": "husky install"
  }
}

# .husky/pre-commit
npm run typecheck

Now every commit the agent makes, whether from a /goal loop or a single turn, goes through the type check locally before it can push.

The bottom line

The /goal loop works on tasks where "done" has a binary answer the agent can check itself. Write that verify command before you set the goal. If you cannot write that command, the task needs more scoping before you hand it to the agent.

The 85% figure covers curated maintenance tasks. You cannot carry that rate over to any task you hand the tool. Architectural decisions, ambiguous refactors, and cross-cutting changes will not approach that number regardless of turn count.

The persistence layer that shipped in 0.133.0 is the real unlock. A goal that survives a CLI restart means you can set a task running, close the terminal, and come back to a result rather than a dead session. That changes the workflow from "supervised agent" to something closer to a slow async job. Wire it into CI, cap the budget, and treat the output like any other unreviewed PR.

What is the first maintenance task in your backlog that has a clear test-based done condition? That is the one to try /goal on first.

GDS K S · thegdsks.com · follow on X @thegdsks

Set the verify command before the goal. If you cannot write it, the scope is not ready.

Building a production TypeScript CLI in 2026: oclif vs commander vs custom.

GDS K S — Tue, 09 Jun 2026 06:57:30 +0000

Building a production TypeScript CLI in 2026: oclif vs commander vs custom.

I shipped my first Node CLI in 2019 with a 12-line arg slicer and process.argv. It worked until it needed a second command and then collapsed into spaghetti. The other extreme is grabbing a full framework for a tool that runs one command. In 2026 there are three reasonable paths between those extremes, and each one wins on a specific slice of the problem.

This post covers @oclif/core v4, commander v14, and a zero-dependency parser that fits in 30 lines. Same "greet" command in all three. Same distribution steps at the end. Honest tradeoffs throughout.

TL;DR

	oclif v4	commander v14	zero-dep
npm install size	~8 MB	~220 kB	0 B
Type inference on flags	Full, generated	Good, manual	Manual
Plugin ecosystem	Yes (Heroku, Salesforce)	No	No
Learning curve	High (day 1)	Low (hour 1)	None
Best for	Multi-team, multi-command CLIs	Most real-world tools	One-shot scripts

1. The decision: framework vs no framework

Reach for a framework when the tool needs subcommands, a plugin system, or auto-generated help text. The second engineer who touches the CLI should be able to find where things live without reading your code twice.

Build your own when the tool does one thing, ships as a one-file script, or lives inside a monorepo where pulling in 8 MB of transitive deps is not welcome. A zero-dep parser also removes the surface area for supply-chain incidents, a real concern on tools that run in CI.

Commander sits in the middle: a 220 kB install that covers most real tools without the scaffolding overhead of oclif.

2. Project skeleton

Every path shares the same bin setup. Start with a package.json that declares the executable:

{
  "name": "greet-cli",
  "version": "1.0.0",
  "bin": {
    "greet": "./dist/cli.js"
  },
  "scripts": {
    "build": "tsc",
    "dev": "tsx src/cli.ts"
  },
  "type": "module"
}

The tsconfig.json for a CLI targets the Node release line you plan to support. Node 24 LTS handles ESM natively, so use "module": "NodeNext" and "moduleResolution": "NodeNext":

{
  "compilerOptions": {
    "target": "ES2022",
    "module": "NodeNext",
    "moduleResolution": "NodeNext",
    "outDir": "dist",
    "strict": true,
    "declaration": true
  },
  "include": ["src"]
}

The entry file needs a shebang on line one and must be executable after build:

#!/usr/bin/env node
// src/cli.ts

After tsc, run chmod +x dist/cli.js once. In a proper CI pipeline, add that to the build script. npm link during development installs the greet binary into your PATH so you can test it as a real command.

3. The greet command, three ways

oclif v4

Scaffold with npx oclif generate greet-cli, then replace the generated command:

// src/commands/greet.ts
import { Args, Command, Flags } from "@oclif/core";

export default class Greet extends Command {
  static override description = "Print a greeting";

  static override args = {
    name: Args.string({ description: "Name to greet", required: true }),
  };

  static override flags = {
    loud: Flags.boolean({ char: "l", description: "Uppercase the output" }),
    times: Flags.integer({ char: "t", description: "Repeat N times", default: 1 }),
  };

  async run(): Promise<void> {
    const { args, flags } = await this.parse(Greet);
    const message = `Hello, ${args.name}!`;
    for (let i = 0; i < flags.times; i++) {
      this.log(flags.loud ? message.toUpperCase() : message);
    }
  }
}

Run it with ./bin/run.js greet Alice --loud --times 3. Help text generates automatically from the static properties. TypeScript infers the types on flags.times as number and flags.loud as boolean without any manual annotation.

The this.log and this.error methods route through oclif's output system, which makes testing easier: oclif provides a runCommand test helper that captures stdout without mocking console.

commander v14

Install: npm install commander. No generator needed.

#!/usr/bin/env node
// src/cli.ts
import { Command } from "commander";

const program = new Command();

program
  .name("greet")
  .description("Print a greeting")
  .version("1.0.0");

program
  .command("greet <name>")
  .description("Greet someone by name")
  .option("-l, --loud", "Uppercase the output")
  .option("-t, --times <n>", "Repeat N times", "1")
  .action((name: string, opts: { loud?: boolean; times: string }) => {
    const times = parseInt(opts.times, 10);
    const message = `Hello, ${name}!`;
    for (let i = 0; i < times; i++) {
      console.log(opts.loud ? message.toUpperCase() : message);
    }
  });

program.parse();

The string-to-number conversion on opts.times is manual. Commander parses all option values as strings unless you supply a custom parser function. That is the primary friction point for TypeScript users: you get good autocomplete on the option names but the values carry a weaker type until you cast or coerce them.

Commander v14 added .argument() as a chainable first-class citizen, which reads cleaner than embedding arguments in the command string for complex cases. The core API has been stable since v8, so the learning investment carries forward.

Zero-dependency, 30 lines

No install. No generator. Drop this into src/cli.ts:

#!/usr/bin/env node

type ParsedArgs = {
  positional: string[];
  flags: Record<string, string | boolean>;
};

function parseArgs(argv: string[]): ParsedArgs {
  const positional: string[] = [];
  const flags: Record<string, string | boolean> = {};
  let i = 0;
  while (i < argv.length) {
    const arg = argv[i];
    if (arg.startsWith("--")) {
      const key = arg.slice(2);
      const next = argv[i + 1];
      if (next && !next.startsWith("-")) {
        flags[key] = next;
        i += 2;
      } else {
        flags[key] = true;
        i += 1;
      }
    } else if (arg.startsWith("-") && arg.length === 2) {
      flags[arg.slice(1)] = true;
      i += 1;
    } else {
      positional.push(arg);
      i += 1;
    }
  }
  return { positional, flags };
}

const { positional, flags } = parseArgs(process.argv.slice(2));
const [command, name] = positional;

if (command === "greet" && name) {
  const times = flags.times ? parseInt(flags.times as string, 10) : 1;
  const msg = `Hello, ${name}!`;
  for (let i = 0; i < times; i++) {
    console.log(flags.loud ? msg.toUpperCase() : msg);
  }
} else {
  console.log("Usage: greet greet <name> [--loud] [--times <n>]");
  process.exit(1);
}

This handles --loud, --times 3, and positional args. It does not handle --times=3, short-form chaining (-lt), or negated flags (--no-loud). Add those if you need them. Each addition is about 5 lines and you understand every byte.

4. Subcommands, flags, and where each path struggles

Subcommands are where the paths diverge most sharply.

In oclif, each subcommand is a file in src/commands/. A file at src/commands/user/create.ts maps to mycli user create. The directory structure is the routing table. That pattern scales to 30 commands because you can grep for a file name.

In commander, subcommands chain off the root program:

const userCmd = program.command("user");
userCmd.command("create <email>").action((email) => { /* ... */ });
userCmd.command("delete <id>").action((id) => { /* ... */ });

That works well up to around 10 subcommands in a single file. Past that, split into separate files and import each group, then register them. Commander does not enforce any file layout, so naming conventions matter more.

The zero-dep path requires a manual dispatch table. A switch on command covers five subcommands cleanly. Beyond five, the file grows fast and the argument parsing for each command needs its own handling. That is the natural ceiling where migrating to commander or oclif starts paying off.

Prompts (interactive input like password fields or selection lists) sit outside all three. None of them bundle an interactive prompt library. The standard pairing is inquirer for oclif and commander, or Node's built-in readline interface for the zero-dep path.

5. Distribution via npm

Publishing a CLI to npm follows the same steps regardless of which framework you chose.

{
  "name": "@yourscope/greet-cli",
  "version": "1.0.0",
  "bin": { "greet": "./dist/cli.js" },
  "files": ["dist"],
  "engines": { "node": ">=20" }
}

The files array keeps the published tarball small: only dist/ ships, not src/, test files, or dev configs. The engines field documents the Node floor and causes npm install to warn on older versions.

Build and publish:

npm run build
chmod +x dist/cli.js
npm publish --access public

For scoped packages (@yourscope/...), first publish needs --access public. Later publishes omit it.

Users install and run with:

npm install -g @yourscope/greet-cli
greet greet Alice --loud

Or without a global install via npx:

npx @yourscope/greet-cli greet Alice --loud

npx-only distribution is the right default for one-off tools. It avoids polluting the user's global PATH and always runs the version you specify. For tools a developer runs dozens of times a day, a global install still wins on startup time because npx runs a resolution step on every invocation.

If you are distributing a tool that should work offline or in air-gapped environments, vendor the dependencies into the published tarball with bundleDependencies in package.json. Oclif's generated scaffold includes this by default. Commander and zero-dep need it added manually.

6. Comparison

	oclif v4	commander v14	zero-dep
Unpacked install size	~8 MB	~220 kB	0
TypeScript flag types	Inferred, no casting	Manual coercion for numbers	Manual
Auto-generated help	Yes, rich	Yes, basic	You write it
Subcommand routing	File-based (scales)	Code-based (works to ~10)	Switch statement
Plugin system	Yes	No	No
Interactive prompts	Requires inquirer	Requires inquirer	readline built-in
Used by	Heroku CLI, Salesforce CLI	Dozens of open source tools	Scripts, one-off tools
Breaking change cadence	Moderate (major versions)	Low (stable API since v8)	None

The bundle size difference matters when the CLI runs inside a Docker image on a tight layer budget, or when install time in CI is a bottleneck. A full oclif project with its generator output and Heroku plugin dependencies can exceed 50 MB unpacked when counting transitive deps. Commander stays well under 1 MB including your own code.

The type inference gap matters when the team touches the CLI infrequently. With oclif, a new contributor gets full TypeScript hints on every flag value and hits a type error immediately when passing a string where a number belongs. With commander, the coercion is a runtime concern that TypeScript cannot see through without a cast.

The bottom line

Use oclif if you are building a CLI that a team of engineers will extend over time, already have the Heroku or Salesforce ecosystem in mind, or need a plugin architecture. The day-one overhead is real, and the generated scaffold is dense, but the structure pays off past the third command.

Use commander if you are building a real tool with 3 to 15 subcommands, want TypeScript without the framework overhead, and are comfortable writing a thin coercion layer for numeric options. It covers most real-world cases and the API has been stable long enough that StackOverflow has an answer for every edge case.

Build zero-dep if the tool does one thing, ships in a monorepo where dep hygiene is strict, or you want to understand exactly what runs in production. The ceiling is around five commands before the code fights you.

Node 24 LTS (v24.16.0) ships native ESM, native fetch, and a built-in test runner, which removes three common reasons to reach for dependencies in the first place. Whatever path you pick, the toolchain in 2026 is cleaner than 2022 by a wide margin.

What is the CLI in your current project running on? A raw process.argv slicer past the 100-line mark signals the time to pick a framework.

GDS K S · thegdsks.com · follow on X @thegdsks

The right CLI framework is the one that fits the command count, not the one with the best marketing page.

RAG with Postgres pgvector in 2026: the full TypeScript pipeline.

GDS K S — Mon, 08 Jun 2026 08:24:44 +0000

RAG with Postgres pgvector in 2026: the full TypeScript pipeline.

I spent a week evaluating dedicated vector databases before deciding to just use the Postgres instance I already had. The pgvector extension handles similarity search well enough for most production workloads, and it collapses three infrastructure components into one. This walkthrough covers everything from schema to answer: chunk your docs, embed them, store in pgvector, retrieve by cosine similarity, and wire the results into an LLM call.

TL;DR

Step	Tool	Why
Enable vector store	`pgvector` 0.8.x, HNSW index	Runs in your existing Postgres, no extra infra
Embed	`text-embedding-3-small` (1,536 dims)	$0.02 per million tokens, fast
Query	`<=>` cosine distance, top-k	Works with both OpenAI and Voyage models
Augment	Claude or GPT-4o with retrieved docs	Context window stuffed, hallucination rate drops

1. Why pgvector instead of a dedicated vector database

Pinecone and Weaviate are good products. If you need multi-tenant isolation, sub-millisecond p99 at 100M+ vectors, or native hybrid search with BM25, they earn their place. For most teams, those are future problems.

The cost calculus changes when you consider ops burden. A dedicated vector DB means a new billing line, a new set of credentials to rotate, a new failure mode to track, and a new SDK to keep current in your application. pgvector runs as a Postgres extension: one connection string, one backup strategy, one source of truth. At 10M documents with 1,536-dimensional embeddings, an HNSW index on a reasonably sized Postgres instance returns top-10 results in under 10ms. That covers the overwhelming share of RAG use cases.

pgvector 0.8.0 added iterative HNSW scans. That release made filtered similarity search practical without falling back to sequential scans every time a WHERE clause got specific. The 0.8.0 release was what tipped my team from "maybe later" to "ship it."

2. Schema setup

Enable the extension once per database, then create your table.

-- enable pgvector (run once per database)
CREATE EXTENSION IF NOT EXISTS vector;

-- documents table
CREATE TABLE documents (
  id         BIGSERIAL PRIMARY KEY,
  source     TEXT NOT NULL,          -- filename, URL, or ID of source doc
  chunk_idx  INT NOT NULL,           -- chunk number within the source
  content    TEXT NOT NULL,          -- raw text of the chunk
  embedding  vector(1536) NOT NULL,  -- OpenAI text-embedding-3-small
  created_at TIMESTAMPTZ DEFAULT NOW()
);

Choosing between HNSW and IVFFlat

HNSW builds a navigable small-world graph. Queries scan the graph instead of comparing all rows. Build once, query immediately. The tradeoff is that the index takes more memory: roughly 8 bytes per dimension per row for a 1,536-dim column at default settings.

IVFFlat partitions the embedding space into centroid clusters. Faster to build, smaller memory footprint, but you must load rows before building the index or the centroid assignment is useless. If you are starting from zero rows, build HNSW.

-- HNSW index (recommended default)
-- m = connections per layer (default 16), higher = better recall at higher memory cost
-- ef_construction = candidate list during build (default 64), higher = better recall at slower build
CREATE INDEX ON documents
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- IVFFlat alternative (only after loading rows)
-- lists = sqrt(row_count) is a good starting point for large tables
-- CREATE INDEX ON documents USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);

Use vector_cosine_ops with the <=> operator when your embedding model normalizes vectors (OpenAI and Voyage both do). Use vector_l2_ops with <-> for raw Euclidean distance when vectors are not normalized. Use vector_ip_ops with <#> for inner product, which equals cosine similarity on normalized vectors and saves one normalization step.

3. Ingest pipeline in TypeScript

The ingest function chunks a document, calls the embedding API, and bulk inserts rows. Use postgres (the npm package, not pg) for its tagged-template SQL and native array support.

import postgres from "postgres";
import OpenAI from "openai";

const sql = postgres(process.env.DATABASE_URL!);
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });

const CHUNK_SIZE = 512;   // tokens, not characters
const CHUNK_OVERLAP = 64; // tokens of overlap between adjacent chunks

function chunkText(text: string, size: number, overlap: number): string[] {
  // naive word-boundary chunker — swap for tiktoken in production
  const words = text.split(/\s+/);
  const chunks: string[] = [];
  let start = 0;
  while (start < words.length) {
    const end = Math.min(start + size, words.length);
    chunks.push(words.slice(start, end).join(" "));
    start += size - overlap;
  }
  return chunks;
}

async function embedBatch(texts: string[]): Promise<number[][]> {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: texts,
  });
  return response.data.map((d) => d.embedding);
}

export async function ingestDocument(source: string, text: string): Promise<void> {
  const chunks = chunkText(text, CHUNK_SIZE, CHUNK_OVERLAP);

  // embed in batches of 100 (OpenAI max batch size)
  const BATCH = 100;
  for (let i = 0; i < chunks.length; i += BATCH) {
    const batch = chunks.slice(i, i + BATCH);
    const embeddings = await embedBatch(batch);

    const rows = batch.map((content, j) => ({
      source,
      chunk_idx: i + j,
      content,
      embedding: JSON.stringify(embeddings[j]),
    }));

    await sql`
      INSERT INTO documents (source, chunk_idx, content, embedding)
      SELECT
        r.source,
        r.chunk_idx::int,
        r.content,
        r.embedding::vector
      FROM jsonb_to_recordset(${JSON.stringify(rows)}::jsonb)
        AS r(source text, chunk_idx text, content text, embedding text)
    `;
  }

  console.log(`[ingest] ${source}: ${chunks.length} chunks stored`);
}

A note on chunk size: 512 words is a starting point. The right size depends on your source material. Legal documents with dense paragraphs do better at 256 words. Code files need at least 300 lines or you lose function context. The overlap prevents the embedding from missing a sentence that straddles a chunk boundary.

4. Query pipeline in TypeScript

Embed the user's question, run a top-k cosine similarity search, return the matching chunks.

export async function queryDocuments(
  question: string,
  topK = 5,
): Promise<Array<{ source: string; content: string; distance: number }>> {
  // embed the question with the same model used at ingest time
  const [embedding] = await embedBatch([question]);
  const embeddingStr = JSON.stringify(embedding);

  const rows = await sql<{ source: string; content: string; distance: number }[]>`
    SELECT
      source,
      content,
      (embedding <=> ${embeddingStr}::vector) AS distance
    FROM documents
    ORDER BY embedding <=> ${embeddingStr}::vector
    LIMIT ${topK}
  `;

  return rows;
}

The <=> operator returns cosine distance (0 = identical, 2 = opposite). Lower numbers win. If you add metadata filters, add them in the WHERE clause before ORDER BY so the planner can use the HNSW iterative scan introduced in 0.8.0.

// filtered query example — same model must have returned results for this source
const rows = await sql<{ source: string; content: string; distance: number }[]>`
  SELECT source, content, (embedding <=> ${embeddingStr}::vector) AS distance
  FROM documents
  WHERE source = ${filterSource}
  ORDER BY embedding <=> ${embeddingStr}::vector
  LIMIT ${topK}
`;

5. Wiring retrieved docs into an LLM call

Concatenate the retrieved chunks into a context block, then call your model of choice. Claude 3.5 Sonnet or GPT-4o both handle long contexts well. Keep the context block under 80,000 tokens for cost reasons.

import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! });

export async function answerWithRAG(question: string): Promise<string> {
  const docs = await queryDocuments(question, 5);

  if (docs.length === 0) {
    return "No relevant documents found.";
  }

  const context = docs
    .map((d, i) => `[${i + 1}] (${d.source})\n${d.content}`)
    .join("\n\n---\n\n");

  const prompt = `You are a helpful assistant. Answer the question using only the provided context.
If the context does not contain the answer, say so.

Context:
${context}

Question: ${question}`;

  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-6-20250929",
    max_tokens: 1024,
    messages: [{ role: "user", content: prompt }],
  });

  const block = response.content[0];
  return block.type === "text" ? block.text : "";
}

The "answer using only the provided context" instruction is load-bearing. Without it, the model mixes retrieval with parametric memory and you cannot tell which is which. If the answer comes from the context, citations work. If it comes from training data, they do not. Force the distinction at the prompt level.

One more thing worth noting: rerank before you send to the LLM. A fast cosine search returns the 5 closest chunks by vector distance, but distance does not always equal usefulness. A cross-encoder reranker (Cohere Rerank costs about $1 per 1,000 queries) takes your top-20 candidates and scores them for actual relevance before you trim to 5. The quality jump is noticeable. Skip the reranker while prototyping, add it before you hit production.

6. Two gotchas that bite everyone

Chunk size drives recall more than index parameters

Most teams spend hours tuning HNSW m and ef_construction and see marginal gains. The actual lever is chunk size and overlap. A chunk that is too short loses context (the model cannot answer a cross-sentence question). A chunk that is too long pulls in noise, dilutes the embedding, and wastes context window in the LLM call. Run a quick eval: take 20 representative questions, retrieve top-5, then manually score whether the answer appeared in the returned chunks. Adjust chunk size in 100-word steps until recall tops 85%. Then tune the index.

Build the index after bulk loading, not before

HNSW indexing at insert time is slow. If you load 500,000 documents and the HNSW index exists, every INSERT pays the graph update cost. The fast path: load all rows with the index dropped, then build it once with CREATE INDEX. On a table of 500,000 rows with 1,536-dim embeddings, a cold HNSW build takes roughly 8 to 12 minutes on 4 vCPUs. That is far cheaper than the cumulative insert overhead.

-- drop the index before bulk load
DROP INDEX IF EXISTS documents_embedding_idx;

-- ... run your ingest pipeline ...

-- rebuild once after load
CREATE INDEX documents_embedding_idx
  ON documents USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

The bottom line

The full pipeline is about 120 lines of TypeScript and three SQL statements. pgvector 0.8.x is stable enough for production, HNSW is the right default index for most teams, and the two things that matter most for answer quality are chunk size and staying consistent between embed-at-ingest and embed-at-query time (same model, same preprocessing). Dedicated vector DBs are not wrong, they are just a layer you do not need until your row count passes 50M or your recall requirements get strict enough to warrant a tuning team.

What chunk size worked best for your use case? Drop it in the comments.

GDS K S · thegdsks.com · follow on X @thegdsks

Good retrieval beats a better model every time.

TanStack shipped a postmortem for the 42-package npm compromise. Here is what every project should change this week.

GDS K S — Fri, 29 May 2026 02:33:59 +0000

TanStack shipped a postmortem for the 42-package npm compromise. Here is what every project should change this week.

On May 11, 2026, between 19:20 and 19:26 UTC, an attacker published 84 malicious versions across 42 packages in the @tanstack scope. The attacker did not steal a maintainer's npm credentials. They hijacked the build pipeline itself, and the packages they shipped carried valid SLSA provenance attestations. That last part changes something important about how the ecosystem thinks about supply chain trust.

TanStack published a full postmortem. This piece walks through the attack chain, explains what made this incident novel, and gives you a concrete checklist for your own project.

TL;DR

What	Detail
Date	May 11, 2026, 19:20 to 19:26 UTC
Scope	42 @tanstack packages, 84 malicious versions
Worm reach	170+ packages total after self-propagation
Detection	External researcher flagged it within 6 minutes
Full deprecation	~1 hour 43 minutes after first publish
Advisory	GHSA-g7cv-rxg3-hmpx
Novel claim	First documented malicious npm package carrying valid SLSA provenance

1. What happened and when

The attacker, operating under accounts zblgg and voicproducoes, targeted the TanStack Router/Start monorepo. The Query, Table, Form, Virtual, Store, and AI packages were not affected. Only the Router/Start monorepo contained the vulnerable workflow configuration.

At 19:20 UTC the first malicious versions landed. By 19:26 the full 84-version batch hit the registry. An external researcher named ashishkurmi from StepSecurity spotted the anomaly, an unusual optionalDependencies entry pointing to a GitHub fork, within minutes. No internal alerting triggered on TanStack's side.

TanStack deprecated the malicious versions 1 hour 43 minutes after the first publish. npm pulled the tarballs from 22:13 to 23:55 UTC, a 4.5-hour window after the initial compromise.

The payload was a 2.3 MB obfuscated file named router_init.js. It harvested credentials (GitHub tokens, AWS keys, Vault tokens, Kubernetes service accounts, SSH keys, GCP credentials), exfiltrated them over the Session/Oxen P2P messenger network, and then used any stolen publish-capable tokens to republish itself to every other package the victim could write to. It also installed persistence mechanisms in .claude/settings.json hooks, VS Code task injection, and a systemd monitoring service. If the stolen GitHub token was later revoked, the payload wiped the home directory.

Secondary victims included @mistralai/mistralai, 40-plus @uipath packages, and 19 packages in aviation-related namespaces. Wiz attributes the campaign, named "Mini Shai-Hulud" internally, to a threat group called TeamPCP, linked to prior SAP, Checkmarx, and Trivy compromises.

2. The three-primitive attack chain

Most supply chain coverage stops at "compromised package." The TanStack incident is worth studying in detail because the attacker chained three distinct primitives to get from zero access to a signed publish on a major open-source project.

Primitive 1: The Pwn Request

A "Pwn Request" is a specific GitHub Actions anti-pattern. When a workflow uses pull_request_target as its trigger, it runs in the context of the base repository rather than the fork. That means it has access to base repository secrets. The intent of pull_request_target is to let maintainers do things like post comments on pull requests from forks without exposing write tokens to fork code.

The problem: if the workflow also checks out the pull request's code and executes it, you get fork code running with base repository privileges. TanStack's bundle-size.yml workflow had this pattern.

The attacker opened a PR from a fork. The workflow executed the fork's code with base repo context.

Primitive 2: Cache poisoning across trust boundaries

The malicious fork code poisoned the pnpm package store cache. It wrote a 1.1 GB cache entry under the exact key that the legitimate release.yml workflow would later restore.

This is the trust-boundary crossing. The bundle-size workflow (lower trust, triggered by PRs) and the release workflow (higher trust, triggered by maintainer merges) shared a cache key namespace. The attacker wrote to cache from the low-trust context. The high-trust context read from it without re-validating.

The poisoned cache entry sat undetected for eight hours before the release workflow pulled it.

Primitive 3: OIDC token extraction from runner memory

Here is the part that bypasses npm credential protections entirely.

GitHub Actions supports OIDC-based publishing. Instead of storing a long-lived npm token in your repository secrets, your workflow requests a short-lived OIDC token from GitHub at publish time. npm's trusted publisher feature accepts this token. The design assumes that only the intended workflow step can request and use that token.

The attacker's payload included binaries that read /proc/<pid>/mem on the GitHub Actions runner. Processes in the runner environment, including the GitHub Actions agent, hold the OIDC token in memory while the job runs. The attacker extracted that token directly from memory and used it to authenticate npm publishes, bypassing the actual publish step in the release workflow.

This is why the packages carried valid SLSA provenance attestations. The attestation records that the package shipped from the expected repository and workflow. From Sigstore's perspective, that was true. The attacker did not forge the attestation. They hijacked the pipeline mid-run and minted legitimate credentials within it.

3. Why valid SLSA provenance on a malicious package matters

SLSA (Supply chain Levels for Software Artifacts) provenance is one of the main signals the npm ecosystem has been building toward for trusted package distribution. The idea: a package with SLSA provenance attestation proves it came from a specific source commit in a specific workflow. Consumers can verify this cryptographically.

The TanStack incident stands as the first documented case of a malicious npm package carrying SLSA provenance that the attacker did not forge. Sigstore verified the build correctly. The provenance was real. The code running through the pipeline was not safe.

SLSA provenance answers the question "did this package build how the maintainer intended?" It does not answer "did the build pipeline run clean before the build started?" Those are different questions, and the ecosystem has largely treated them as the same question.

This does not make SLSA provenance worthless. A package with no provenance is less trustworthy than one with provenance. But it does mean provenance is a necessary condition, not a complete one. The signal has a new attack surface.

What a cleaner version of SLSA provenance would need: a way to attest that the cache state restored before the build arrived clean, that no cross-context cache sharing occurred, and that OIDC token issuance covered only a specific workflow step rather than any code running in the job.

4. Lockdown checklist for your project this week

Run through this before your next release.

Audit your package-lock for affected versions

# Check for any @tanstack packages from May 11 UTC
npm audit
npx better-npm-audit audit

# List all @tanstack versions currently installed
npm ls --depth=0 | grep tanstack

# Verify against the advisory
# Affected: @tanstack/* versions published 2026-05-11 between 19:20-23:55 UTC
# Safe: any version before May 11 or after npm confirmed tarball removal

If you pulled a new install or ran CI between May 11 19:20 UTC and May 11 23:55 UTC, treat your build environment as potentially compromised. Rotate any credentials that were present in that environment.

Harden your GitHub Actions workflows

The Pwn Request pattern is the root primitive. Audit every workflow file for pull_request_target triggers.

# DANGEROUS: pull_request_target that checks out and runs fork code
on:
  pull_request_target:
    types: [opened, synchronize]

jobs:
  build:
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ github.event.pull_request.head.sha }}  # THIS IS THE PROBLEM
      - run: npm ci && npm run build  # fork code running with base repo context

# SAFER: split into two workflows
# Workflow 1: runs on pull_request (fork context, no secrets)
on:
  pull_request:
jobs:
  build:
    steps:
      - uses: actions/checkout@v4  # checks out fork code, no secret access
      - run: npm ci && npm run build
      - uses: actions/upload-artifact@v4
        with:
          name: pr-artifacts
          path: ./dist

# Workflow 2: runs on workflow_run (base context, has secrets, reads artifacts not code)
on:
  workflow_run:
    workflows: ["Build PR"]
    types: [completed]
jobs:
  comment:
    steps:
      - uses: actions/download-artifact@v4  # reads build output, not fork code
        with:
          name: pr-artifacts

If you need pull_request_target for a legitimate reason (bot comments, label management), never check out PR code in that context. Keep it to read-only GitHub API calls.

Scope your OIDC token permissions

# Restrict permissions at the job level, not just the workflow level
jobs:
  publish:
    permissions:
      id-token: write    # only the publish job gets OIDC
      contents: read
    steps:
      - uses: actions/checkout@v4
      - run: npm publish --provenance

Do not grant id-token: write at the workflow level if only one job needs it. The narrower the scope, the shorter the window an extracted token stays useful.

Isolate your cache keys by trust level

# Separate cache keys for PR workflows vs release workflows
- uses: actions/cache@v4
  with:
    path: ~/.pnpm-store
    key: release-pnpm-${{ runner.os }}-${{ hashFiles('**/pnpm-lock.yaml') }}
    # Never share this key with pull_request_target workflows

Use different key prefixes for PR-triggered and release-triggered workflows. A compromised PR workflow cannot poison a release workflow's cache if the keys do not overlap. This is not a full defense (an attacker with arbitrary code execution can still do damage), but it eliminates the specific cache-poisoning vector used here.

Check for persistence artifacts if you ran a CI job during the window

# Check for the gh-token-monitor service (one of the payload's persistence mechanisms)
systemctl status gh-token-monitor 2>/dev/null
ls ~/.local/share/systemd/user/ | grep monitor

# Check VS Code tasks for injected entries
cat .vscode/tasks.json 2>/dev/null | grep -i monitor

# Check Claude settings for hook injection
cat ~/.claude/settings.json 2>/dev/null | grep -v '"permissions"'

# If you find any of these: stop, rotate credentials first, then remove

The payload's wiper triggers when someone revokes a stolen token while the daemon runs. Confirm the daemon is not present before rotating credentials, or coordinate both actions at the same instant.

5. What changes downstream if provenance is not a clean signal

Practically, for most teams consuming public packages, the immediate answer is: not much changes in workflow, but the mental model needs updating.

Provenance attestation was the "this package came from a known clean pipeline" signal. That signal is now more accurately described as "this package came from the expected repository and workflow, assuming the pipeline itself was not injected into." For widely-used OSS packages where you have no visibility into the upstream CI environment, that assumption deserves scrutiny.

Three things worth watching in the next quarter:

First, whether npm or the SLSA spec adds guidance on cache attestation. The build pipeline audit trail currently does not record what cache state was restored before the build ran. Adding that would let downstream consumers see whether a restore happened and from what source.

Second, whether GitHub adds controls to block OIDC token issuance from jobs that restored cache from a lower-trust workflow. Right now the runner process holds the token regardless of how the cache arrived. A job-level flag to drop OIDC access after a cross-context cache restore would close this specific vector.

Third, whether teams start treating @ts-nocheck and skip audit patterns in CI the same way they treat the Pwn Request pattern: as defaults that need an explicit justification written next to them. The TanStack postmortem credits an external researcher with the detection. The internal system had no alert. That is the gap to close.

The bottom line

TanStack's maintainers handled this well. They published a detailed timeline, named the advisory, credited the researcher, and documented what their internal detection missed. That level of transparency under pressure is worth acknowledging.

The incident is notable for two reasons. One is scale: 12.7 million weekly downloads on @tanstack/react-router alone means a narrow six-minute window had real blast radius potential. The other is the SLSA provenance angle. The attacker did not break the signature. They got inside the signing process.

If your project uses GitHub Actions for publishing, run the workflow audit above before your next release. The Pwn Request pattern is common, the cache isolation gap is invisible until something like this happens, and the OIDC scoping is easy to miss in a busy workflow file. None of these fixes take more than an afternoon.

How does your team currently handle CI trust boundaries between PR workflows and release workflows? Drop your setup in the comments.

GDS K S · thegdsks.com · follow on X @thegdsks

Valid provenance on a malicious package is not a cryptography failure. Pipeline isolation failed.

Google's Gemini 3.5 Flash is 4x faster than other frontier models. Here is how to call it from TypeScript.

GDS K S — Wed, 27 May 2026 17:20:41 +0000

Google's Gemini 3.5 Flash is 4x faster than other frontier models. Here is how to call it from TypeScript.

Google shipped Gemini 3.5 Flash on May 19 at Google I/O 2026. The headline claim is four times faster output tokens per second compared to other frontier models. That is not a marketing tier label. The claim is a throughput number, and for latency-sensitive work like streaming chat, code generation, or agentic loops, it changes what is worth reaching for.

Here is what the model actually is, how to wire it up in TypeScript, and what the cost and rate limit picture looks like before you depend on it in production.

TL;DR

Dimension	Gemini 3.5 Flash	Gemini 2.5 Flash
Output speed	4x faster than other frontier models	Best price-performance for high-volume tasks
Primary use	Agentic workflows, coding, long-horizon tasks	Cost-sensitive, high-volume, reasoning tasks
Input price	$1.50 per 1M tokens	$0.30 per 1M tokens
Output price	$9.00 per 1M tokens	$2.50 per 1M tokens
Free tier	Yes (limited)	Yes (standard rate limits)
SDK package	`@google/genai`	`@google/genai`
Model ID	`gemini-3.5-flash`	`gemini-2.5-flash`
Released	May 19, 2026	Earlier in 2026

1. What Gemini 3.5 Flash is and where it fits

Google positions Gemini 3.5 Flash as the fast tier in the 3.5 family. The framing from the announcement is "frontier intelligence with action," which is a wordy way of saying: this model runs complex agentic tasks at a speed where the latency is not the bottleneck anymore.

The benchmarks Google published back this up. On Terminal-Bench 2.1, 3.5 Flash scores 76.2%. On MCP Atlas it hits 83.6%. On CharXiv Reasoning, a multimodal benchmark, it reaches 84.2%. Google published those scores for agentic and coding workloads, not general chat.

Where does it fit against the rest of the lineup? The 2.5 Flash is cheaper per token and designed for high-volume reasoning tasks where cost per call matters more than raw throughput. The 3.5 Flash costs more but delivers output fast enough that the wall-clock time for an agentic loop shrinks, which can lower your per-task cost even at a higher per-token rate. Google's own framing is "often at less than half the cost of other frontier models" for full tasks, not individual calls.

For most TypeScript projects, the decision point is: does your user wait for the output, or does a pipeline consume it? If a user is staring at a cursor, speed matters and 3.5 Flash is worth the price premium. If a background job is processing documents at scale, 2.5 Flash is likely the right call.

2. Install the SDK and make your first call

The SDK is @google/genai. Node.js 18 or later required.

npm install @google/genai

Set your API key from Google AI Studio:

export GEMINI_API_KEY="your-key-here"

Basic call:

import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const response = await ai.models.generateContent({
  model: "gemini-3.5-flash",
  contents: "Summarize the key breaking changes in Node.js 22 for a TypeScript developer.",
});

console.log(response.text);

That is the whole surface for a one-shot request. The GoogleGenAI constructor accepts the key directly or reads GEMINI_API_KEY from the environment when called with an empty object {}. Prefer the explicit key reference so your intent is clear at the call site.

Worth noting: response.text is a convenience accessor. The full response tree lives at response.candidates[0].content.parts. You only need to go that deep when handling multi-modal outputs or function call responses.

3. Streaming responses

Four times faster output speed matters most when you stream. A blocking generateContent call holds the connection open until the model finishes. For a 1,000-token response at high throughput, that is still a perceivable wait for a user. Streaming pipes each chunk to the client as the model produces it.

import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

async function streamToStdout(prompt: string): Promise<void> {
  const stream = await ai.models.generateContentStream({
    model: "gemini-3.5-flash",
    contents: prompt,
  });

  for await (const chunk of stream) {
    process.stdout.write(chunk.text ?? "");
  }

  process.stdout.write("\n");
}

await streamToStdout("Write a TypeScript function that retries a promise up to N times with exponential backoff.");

In a Next.js API route or an Express server, you would pipe chunk.text into a ReadableStream and set Content-Type: text/event-stream. The pattern is the same: iterate the async generator, forward each chunk.

// pages/api/generate.ts (Next.js App Router example)
import { NextRequest } from "next/server";
import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY! });

export async function POST(req: NextRequest) {
  const { prompt } = await req.json();

  const stream = await ai.models.generateContentStream({
    model: "gemini-3.5-flash",
    contents: prompt,
  });

  const readable = new ReadableStream({
    async start(controller) {
      for await (const chunk of stream) {
        controller.enqueue(new TextEncoder().encode(chunk.text ?? ""));
      }
      controller.close();
    },
  });

  return new Response(readable, {
    headers: { "Content-Type": "text/plain; charset=utf-8" },
  });
}

The 4x throughput claim shows up in the time between the first chunk and the last. At high output speeds, the stream feels snappy from the user's side even when total token count is large.

4. Tool calling in TypeScript

Gemini 3.5 Flash handles function calling with a three-step cycle: you declare the tool, the model returns a function call request, you execute and send back the result.

One thing to know before you write any code: Gemini 3 model APIs attach a unique id to every function call. You must echo that id back in the function response or the model cannot match results to calls. This changed in the 3.x API line.

import { GoogleGenAI, Type } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY! });

// Step 1: Declare the tool
const getWeatherDeclaration = {
  name: "get_weather",
  description: "Returns current weather conditions for a city.",
  parameters: {
    type: Type.OBJECT,
    properties: {
      city: {
        type: Type.STRING,
        description: "City name, e.g. Tokyo",
      },
      units: {
        type: Type.STRING,
        description: "Temperature unit: celsius or fahrenheit",
      },
    },
    required: ["city"],
  },
};

// Step 2: Send the initial request
const response = await ai.models.generateContent({
  model: "gemini-3.5-flash",
  contents: "What is the weather in Oslo right now?",
  config: {
    tools: [{ functionDeclarations: [getWeatherDeclaration] }],
  },
});

// Step 3: Handle the function call
if (response.functionCalls && response.functionCalls.length > 0) {
  const call = response.functionCalls[0];

  // Your real implementation here
  const weatherData = await fetchWeatherFromYourAPI(call.args as { city: string; units?: string });

  // Build conversation history with the function result
  const history = [
    { role: "user", parts: [{ text: "What is the weather in Oslo right now?" }] },
    response.candidates![0].content,
    {
      role: "user",
      parts: [
        {
          functionResponse: {
            id: call.id,       // Required in Gemini 3.x
            name: call.name,
            response: { result: weatherData },
          },
        },
      ],
    },
  ];

  // Step 4: Get the final natural-language response
  const final = await ai.models.generateContent({
    model: "gemini-3.5-flash",
    contents: history,
    config: {
      tools: [{ functionDeclarations: [getWeatherDeclaration] }],
    },
  });

  console.log(final.text);
}

async function fetchWeatherFromYourAPI(args: { city: string; units?: string }) {
  // Placeholder. Replace with your actual weather API call.
  return { temperature: 12, condition: "cloudy", city: args.city };
}

Two practical notes. The Type enum imported from @google/genai is mandatory for the parameter schema. Do not pass raw strings like "object" for the type field. The model also accepts an array of tool declarations, and you can include more than one function if your agentic workflow needs to route between them.

For parallel tool calls in a single turn, the model may return more than one entry in response.functionCalls. Iterate the array, execute each, and send all results back in one follow-up request.

5. Cost and rate limits

The pricing numbers above in the TL;DR table come from Google AI Studio's pricing page as of May 2026. Two practical caveats before you budget anything.

Gemini 3.5 Flash costs $1.50 per million input tokens and $9.00 per million output tokens on the paid tier. Output pricing includes thinking tokens if the model uses internal reasoning steps. In a chat or code-generation workflow, output typically runs 2 to 4 times the input token count, so budget accordingly.

The 2.5 Flash at $0.30 input / $2.50 output is a meaningful difference at scale. A task that generates 10,000 output tokens costs $0.025 on 2.5 Flash and $0.09 on 3.5 Flash. That is 3.6x more per call. The gap can close if the 4x speed advantage means 3.5 Flash completes a multi-turn agentic task in fewer wall-clock seconds and the task itself needs fewer total tokens because the model gets there faster. Test against your actual workload rather than extrapolating from single-call pricing.

Both models have a free tier through the Gemini API with rate limits Google does not publish precisely on the pricing page. The paid tier removes the per-day caps. If you are prototyping, the free tier is enough. If you are running production traffic, use a paid project and set a monthly spend cap in the Google Cloud console.

One hard ceiling worth knowing: Google Search grounding requests share a 5,000 prompt monthly quota across all Gemini 3 models on the free tier, then $14 per 1,000 queries on paid. If your tool-calling setup routes through Search grounding, that quota burns faster than you expect.

6. The bottom line

Gemini 3.5 Flash is worth adding to your model comparison list. Google's own benchmarks back the 4x output speed claim, and the numbers line up with the agentic workload focus. The TypeScript SDK is straightforward. The function calling API has one new rule compared to older Gemini versions: always echo the id field back in your function response.

The price premium over 2.5 Flash is real. Whether it pays back depends on whether your users wait for output and whether your agentic loops shrink enough in wall-clock time to offset the per-token cost difference. Run both models against your actual task shape before committing either to production.

What kind of workload are you considering Gemini 3.5 Flash for? Drop a comment, especially if you have run latency comparisons against other frontier models.

GDS K S · thegdsks.com · follow on X @thegdsks

Speed is only free if you would have paid for the wall-clock time anyway.

Build your first MCP server in TypeScript: the 2026 setup that takes 30 minutes.

GDS K S — Tue, 26 May 2026 20:07:21 +0000

Build your first MCP server in TypeScript: the 2026 setup that takes 30 minutes.

I had Claude Desktop open. I needed it to query a local SQLite database without copy-pasting schema dumps into the chat. Thirty minutes later I had a working MCP server. Here is the exact path I took, stripped of dead ends.

TL;DR

Step	What you build	Time
Project setup	npm project, tsconfig, SDK install	5 min
First tool	Structured input, structured output	10 min
First resource	Read-only data the model can request	8 min
Connect Claude Desktop	Config file, restart, verify	5 min
Common pitfalls	Avoid the three bugs that kill every first attempt	2 min

What MCP actually is

Model Context Protocol is a standard for connecting AI models to external data and tools. The model issues requests, your server handles them, and the results come back in a format the model understands. That is the whole idea.

Before MCP, every tool integration was custom. OpenAI had function calling. Anthropic had tool use. Cursor had its own plugin format. MCP standardizes the wire protocol so you write one server and any compliant client can call it, whether that is Claude Desktop, Cursor, or a client you build yourself.

The three primitives you care about:

Resources: read-only data the model can fetch, like files or database rows.
Tools: functions the model can call with arguments, like running a query or sending a request.
Prompts: reusable prompt templates the client can surface to the user.

This tutorial covers tools and resources. Prompts follow the same pattern and you will not need them for most servers.

1. Project setup

Node 18 or higher required. Check with node --version.

mkdir my-mcp-server && cd my-mcp-server
npm init -y
npm install @modelcontextprotocol/sdk zod
npm install -D typescript @types/node
mkdir src
touch src/index.ts

The SDK package is @modelcontextprotocol/sdk. The version on npm as of May 2026 is 1.11.x. Zod handles schema validation for tool inputs.

Update package.json with these fields:

{
  "type": "module",
  "scripts": {
    "build": "tsc",
    "start": "node build/index.js"
  }
}

Create tsconfig.json:

{
  "compilerOptions": {
    "target": "ES2022",
    "module": "Node16",
    "moduleResolution": "Node16",
    "outDir": "./build",
    "rootDir": "./src",
    "strict": true,
    "esModuleInterop": true,
    "skipLibCheck": true
  },
  "include": ["src/**/*"],
  "exclude": ["node_modules"]
}

2. Implementing a tool

A tool is a function the model can call. You define its name, description, input schema, and handler. The model reads the description and schema to decide when and how to call it.

Here is a complete server with one tool that converts a hex color to RGB:

// src/index.ts
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import { z } from "zod";

const server = new McpServer({
  name: "color-tools",
  version: "1.0.0",
});

server.tool(
  "hex_to_rgb",
  "Convert a hex color string to RGB components. Input must include the leading #.",
  {
    hex: z.string().regex(/^#[0-9a-fA-F]{6}$/, "Must be a 6-digit hex color, e.g. #ff5733"),
  },
  async ({ hex }) => {
    const r = parseInt(hex.slice(1, 3), 16);
    const g = parseInt(hex.slice(3, 5), 16);
    const b = parseInt(hex.slice(5, 7), 16);
    return {
      content: [
        {
          type: "text",
          text: JSON.stringify({ hex, r, g, b }),
        },
      ],
    };
  },
);

const transport = new StdioServerTransport();
await server.connect(transport);

Three things to notice:

The description string is what the model reads to decide whether to call the tool. Write it as plainly as you would write a JSDoc comment for a teammate. Vague descriptions produce missed calls or wrong inputs.

The second argument to server.tool() is the description. The third is a Zod schema object. The SDK turns this into a JSON Schema that the client sends to the model. Keep schemas tight: required fields only, no optional fields that do not change the output.

The return value must have a content array. Each item has a type and a text (or data for binary). Return JSON as a string inside a text item. The model can parse it from there.

Build and test locally:

npm run build
echo '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' | node build/index.js

You should see a JSON-RPC response listing hex_to_rgb. That confirms the server starts and responds to the list request.

3. Implementing a resource

Resources expose read-only data the model can pull on demand. A common use case: expose the schema of your local database so the model knows the table structure before writing a query.

Add this before the transport setup:

server.resource(
  "db-schema",
  "sqlite:///local.db",
  async (uri) => {
    // In a real server, read this from your database
    const schema = `
CREATE TABLE users (
  id INTEGER PRIMARY KEY,
  email TEXT NOT NULL UNIQUE,
  created_at INTEGER NOT NULL
);
CREATE TABLE orders (
  id INTEGER PRIMARY KEY,
  user_id INTEGER REFERENCES users(id),
  total_cents INTEGER NOT NULL,
  placed_at INTEGER NOT NULL
);
    `.trim();
    return {
      contents: [
        {
          uri: uri.href,
          text: schema,
          mimeType: "text/plain",
        },
      ],
    };
  },
);

The first argument is the resource name. The second is the URI the client uses to request it. Pick a URI scheme that makes sense for your data: file, sqlite, https, or a custom scheme like myapp://.

Resources are pull-based. The model requests them when it decides it needs them. If you want data pushed into every conversation automatically, that is a different pattern (system prompt injection at the client level, not a resource).

4. Hooking it up to Claude Desktop

Build the project:

npm run build

Open your Claude Desktop config file. On macOS:

~/Library/Application Support/Claude/claude_desktop_config.json

On Windows:

%APPDATA%\Claude\claude_desktop_config.json

Add your server to the mcpServers block:

{
  "mcpServers": {
    "color-tools": {
      "command": "node",
      "args": ["/absolute/path/to/my-mcp-server/build/index.js"]
    }
  }
}

Use the absolute path. Relative paths fail silently, which is the single most common first-timer mistake. Restart Claude Desktop fully (quit from the menu bar, not just close the window). Open a new conversation. You should see a hammer icon in the input bar indicating tools are available. Type "convert #3b82f6 to RGB" and watch it call the tool.

For Cursor, the config lives at ~/.cursor/mcp.json and uses the same mcpServers JSON shape:

{
  "mcpServers": {
    "color-tools": {
      "command": "node",
      "args": ["/absolute/path/to/my-mcp-server/build/index.js"]
    }
  }
}

For a generic client or testing: the MCP Inspector from Anthropic runs tool calls through a web UI without configuring Claude Desktop.

npx @modelcontextprotocol/inspector node /absolute/path/to/build/index.js

Open the Inspector UI at port 6274 and you can fire tool calls manually and inspect the raw JSON-RPC traffic.

5. Transport choice: stdio vs HTTP

The setup above uses stdio transport. The client starts your server as a child process and communicates over stdin/stdout. This works for local tools and is the path of least resistance for Claude Desktop and Cursor.

For a remote server that two or more clients share, you need HTTP transport. The SDK ships StreamableHttpServerTransport for this. You pair it with an HTTP framework (Hono, Express, Fastify) and handle sessions. That setup adds meaningful complexity and is worth a separate article. Start with stdio unless you are building a shared service from day one.

One rule that applies to both: never write to stdout with console.log in a stdio server. The MCP protocol uses stdout for JSON-RPC frames. A stray log line corrupts the framing and the client sees a parse error with no helpful message. Use console.error() for debugging output. Everything sent to stderr is safe.

6. Common pitfalls

The three mistakes I see in every first MCP server attempt:

Schema validation gaps break calls silently. If the model sends an input that does not match your Zod schema, the SDK rejects it with a generic error. The model may retry with the same bad input. Write the schema narrowly and add .describe() calls on each field to help the model understand what values are valid.

// add field-level descriptions so the model knows what to send
{
  hex: z.string()
    .regex(/^#[0-9a-fA-F]{6}$/)
    .describe("Six-digit hex color with leading #, e.g. #ff5733"),
}

Error responses need the right shape. When your tool handler throws, return a structured error instead of letting the exception propagate:

async ({ hex }) => {
  try {
    const r = parseInt(hex.slice(1, 3), 16);
    // ... rest of handler
    return { content: [{ type: "text", text: JSON.stringify({ r, g, b }) }] };
  } catch (err) {
    return {
      content: [{ type: "text", text: `Error: ${err instanceof Error ? err.message : "unknown"}` }],
      isError: true,
    };
  }
}

The isError: true flag tells the client the call failed, which surfaces properly in Claude Desktop rather than showing as a successful response with error text inside.

Resource URIs must be stable. If a client caches a resource URI and your server changes it on restart, the cached reference points nowhere. Treat resource URIs like public API paths: change them only when you intend a breaking change and version them if needed.

The bottom line

MCP is not a new protocol that requires learning a whole ecosystem. The SDK is thin. You write a handler function, attach a schema, return a content array. The hard part is designing the right tools: narrow enough to be reliable, broad enough to be useful. A tool that does one thing with a clear input schema outperforms a general-purpose tool with six optional fields every time.

Build the color tool above. Get it running in Claude Desktop. Then replace the hex conversion with whatever data or action you actually want to expose. The scaffolding is identical regardless of what the tool does.

What would you expose through an MCP server if you had it running today?

GDS K S · thegdsks.com · follow on X @thegdsks

The scaffolding is 30 minutes; the tool design is the actual work.

Cursor 3 ships parallel AI agents. Here is the multi-agent workflow that actually works.

GDS K S — Tue, 26 May 2026 02:51:45 +0000

Cursor 3 ships parallel AI agents. Here is the multi-agent workflow that actually works.

On April 2, 2026, Cursor shipped version 3.0 and called it "a unified workspace for building software with agents." The headline feature is the Agents Window: a sidebar that shows every active agent session, local or cloud, across all your repos, all at once.

I have spent the past three weeks running it on a real codebase and the experience is different enough from any previous AI coding tool that it warrants a proper walkthrough. Not a demo. The actual workflow, with the parts that break.

TL;DR

Feature	What it does	When you reach for it
Agents Window	Sidebar listing all active agent sessions	Any time you run more than one agent
Local agents	Composer 2 model, run in your open workspace	Fast iteration, short-horizon tasks
Cloud agents	Runs offline, persists when laptop closes	Long tasks, overnight runs, heavy refactors
Local to cloud handoff	Move a session between targets mid-task	When a quick task grows into a long one
Cursor Marketplace	Plugins, MCPs, subagents, skills	Extending what any agent can reach

1. What the Agents Window actually is

Before Cursor 3, you had one agent session per window. You could open more than one Cursor window, but there was no unified view across them. The Agents Window fixes that by collecting all active sessions into a single sidebar panel.

Open it with Cmd+Shift+P and search "Agents Window". What you get is a list of every agent currently running: the task that started it, the repo it targets, and whether it runs locally or in the cloud. You can click into any session, see its chat history and file diffs, and redirect it.

The practical change is visibility. Running three agents in parallel used to mean three browser tabs and a lot of alt-tabbing. Now you get one panel with three rows.

What it does not do: it does not merge agent output automatically, it does not prevent two agents from writing to the same file, and it does not enforce any ordering between sessions. That coordination is still your job. Which is exactly why you need a workflow, not just the feature.

2. The two execution targets and when to use each

Cursor 3 ships with two places an agent can run.

Local agents

A local agent runs in your open workspace using the Composer 2 model. It has access to your file system, your terminal, and your LSP (Language Server Protocol). When you ask it to refactor a function, it reads the file, writes the change, and you see the diff immediately. Round trip from prompt to edit runs in 5 to 15 seconds for most tasks.

Use local agents when the task has a short time horizon, when you want to watch the work happen in real time, or when the task touches files that you are also actively editing. The Composer 2 model is fast, and the model that knows your workspace state best because it has direct file access.

Cloud agents

A cloud agent runs on Cursor's infrastructure. The job persists even when your laptop closes. You can queue a long refactor, shut the lid, and come back four hours later to a PR ready for review. Cloud agents generate screenshots and demo recordings of the result so you can verify before you merge.

Use cloud agents when the task will take longer than you want to babysit it, when you are working across more than one repository, or when you are running automations triggered from Slack, GitHub, or Linear. The Cursor Marketplace also ships subagent plugins specifically designed to extend cloud agent capabilities with external tool access.

The handoff between local and cloud goes both ways. Start something locally, realize the scope expanded, hand it to cloud. Or pull a cloud result back into a local session to do final cleanup with LSP context.

3. A worked example: refactor pipeline split across 3 agents

Here is the actual split I ran last week on a service that needed its logging replaced with structured JSON, its error handling standardized, and its test coverage filled in. Three distinct jobs with almost no overlap in the files they touched.

Setup

# Create a worktree for each agent to avoid branch conflicts
git worktree add ../refactor-logging feature/structured-logging
git worktree add ../refactor-errors feature/error-handling
git worktree add ../refactor-tests feature/test-coverage

Git worktrees give each agent its own working directory on a separate branch. The agents are not sharing a working tree, so there are no write conflicts at the file level. The Agents Window still shows all three in the same sidebar.

Prompt structure

Each agent gets a scoped prompt. The logging agent:

Refactor all console.log and console.error calls in src/services/
to use the structured logger at src/lib/logger.ts. Output must be
JSON with fields: level, message, context. Do not change function
signatures. Do not touch test files.

The error agent:

Standardize all try/catch blocks in src/services/ to use the
AppError class in src/errors/app-error.ts. Rethrow with the
original error as the cause property. Do not change logging calls.
Do not touch test files.

The test agent:

Add missing unit tests for src/services/ using Vitest.
Cover the three exported functions with the lowest coverage
per the attached lcov.info. Do not edit source files.

The constraint "do not touch test files" in the first two prompts is not optional. Without it, agents drift toward touching shared files and you end up with three agents that all think they own src/lib/logger.ts.

Monitoring in the Agents Window

With all three agents running, the Agents Window shows each session's current file and last action. You are not watching them run; you check back every 10 minutes to see if any of them has gone quiet or made a choice that looks wrong.

The most common failure mode: an agent finishes one subtask and then starts making "improvements" to adjacent files outside its scope. Catch this early. The diff view inside each session tab shows you exactly what files the agent has queued for commit.

Merging the results

Each agent runs on its own branch. When all three finish, the merge sequence matters. Logging changes first, since error handling depends on the logger being correct. Error handling second. Tests third, because they exercise both.

git checkout main
git merge feature/structured-logging
git merge feature/error-handling
git merge feature/test-coverage

Run the test suite after each merge, not just after the last one. If the test merge fails, you want to know which of the two prior merges introduced the problem.

4. The orchestration gotchas

Parallel agents are faster than sequential agents on tasks that do not share state. But they introduce three categories of failure that a single agent session avoids.

File conflicts

Two agents writing to the same file at the same time produce a merge conflict that neither of them knows about. The only reliable prevention is prompt scoping. Give each agent an explicit list of directories it owns and an explicit list it must not touch. Worktrees help at the file system level, but they do not prevent two agents from editing the same path in different branches.

If you skip this and end up with conflicts, do not ask a third agent to resolve them. Resolve merge conflicts manually. The context an agent needs to resolve a three-way conflict correctly is usually larger than what fits in a useful prompt.

Branch divergence

Agents that run long enough start diverging from main in ways that require manual rebase. A 4-hour cloud agent job started on Monday morning may return to a main branch that has 12 commits it did not see. Budget time for rebase before merge, especially on active repos.

# Before merging any agent branch, rebase it
git checkout feature/structured-logging
git rebase main
# resolve conflicts, then merge

Cost ceiling

Three agents running in parallel burn tokens three times as fast as one. Local agents use your Cursor subscription allocation. Cursor bills cloud agents separately for compute time, though no per-minute rate appears in the public docs at time of writing. Set a scope that finishes in under two hours for each agent on the first run. You will learn the actual token and time cost from those runs and can calibrate longer jobs after.

The Agents Window does not have a built-in cost display per session at version 3.4. You get total usage in account settings. If you need per-session cost visibility, log the task start time and check account usage after the session ends.

The bottom line

The Agents Window is not magic. Treat it as a coordination surface for parallel work that you still have to design. The rule that made this actually work for me: treat each agent like a pull request reviewer who will only read the files you hand them. Scope, branch, scope again, then run.

The real gain is not speed on one task. The gain is that three independent jobs that used to take three sequential afternoons now take one. The orchestration tax is real, but it pays back at 3x velocity on the right class of work.

What kind of tasks are you splitting across agents? The comment thread from the first 90 minutes usually surfaces approaches I have not tried. Drop yours below.

GDS K S · thegdsks.com · follow on X @thegdsks

Parallel agents are faster only when you design the seams between them.

Microsoft tried to kill the printer driver. Healthcare said no.

GDS K S — Sat, 23 May 2026 06:36:49 +0000

Microsoft tried to kill the printer driver. 90% of US healthcare said no.

In late 2025, Microsoft put a line on the Windows Roadmap that should have read as routine. Starting January 2026, Windows Update would stop shipping legacy V3 and V4 printer drivers. Modern Print Platform only. Goodbye to a decade of brittle vendor blobs.

In February 2026 they quietly took it back. The line vanished from the roadmap. The official statement told users no action applies. Existing printers will keep working. The deprecation, for now, sits on hold.

Microsoft holds more market power than almost any company in history. They tried to retire a category of driver that Microsoft itself deprecated back in September 2023. They could not actually pull it off. The reason sits in every hospital in the United States, and it makes a noise like a 1990s modem.

TL;DR

Thing	Status
V3 and V4 printer drivers	Deprecated since September 2023, still alive
January 2026 deprecation push	Announced, then retracted in February 2026
US healthcare communication that still runs on fax	About 70 percent
Once you count EHR linked faxing	Closer to 90 percent
ATM transactions still running on COBOL	About 95 percent
Online banking transactions touching COBOL	More than 40 percent
Time horizon on this stuff actually dying	Decades, not quarters

1. The headline that almost happened

The original Microsoft plan looked clean. V3 and V4 driver models carried known security and stability problems. Modern Print Platform, the IPP based replacement, outperforms them in almost every measurable way. Microsoft already deprecated the old drivers two and a half years ago. The January 2026 update would have completed the cleanup.

That plan sits in the archive now. Tom's Hardware and Windows Central covered the original announcement. The retraction came after Microsoft "received feedback." The polite version of "received feedback" reads as follows: some quite large customers told Microsoft, in writing, that breaking the printer pipeline would break the hospital pipeline, and that the hospital pipeline runs on fax.

2. The fax number you cannot believe

Here is the statistic that broke my brain when I first read it. Roughly 70 percent of healthcare communication in the United States still moves over fax. When you include EHR linked faxing, where an electronic health record system pretends to be a fax machine in order to talk to the rest of the industry, the number climbs to about 90 percent.

Ninety percent. Of the most regulated, most digitized, most money-flooded industry in the developed world. Running on a protocol that predates the personal computer.

   The 2026 healthcare comms diagram

  ┌──────────────┐         FAX           ┌──────────────┐
  │   Hospital A │  ─────────────────▶   │   Clinic B   │
  │   (modern    │                       │   (modern    │
  │    EHR)      │                       │    EHR)      │
  └──────────────┘                       └──────────────┘
        │                                       │
        ▼                                       ▼
   Pretends to be                          Pretends to be
   a fax machine                           a fax machine
        │                                       │
        ▼                                       ▼
  ╔═════════════════════════════════════════════════════╗
  ║   90% of the actual traffic goes over fax anyway    ║
  ╚═════════════════════════════════════════════════════╝

That diagram explains what Microsoft hit when they tried to ship the driver change. The driver path covers more than home offices. The driver path runs through compliance pipelines that no single engineering team owns. Break the driver layer in January, and somebody's referral cannot reach somebody else's prior authorization in February. That outcome does not fit a "we will respond to feedback" narrative. That outcome makes a 60 Minutes segment.

3. The other infrastructure that refuses to die

Fax counts as the most visible example. Not the only one. The pattern shows up everywhere stable infrastructure built up decades of edge cases. IBM has said for years, in slightly louder volumes each year, that COBOL still runs about 95 percent of ATM transactions and more than 40 percent of online banking. The COBOL workforce is aging out. The replacements never arrived. The systems keep running.

Same pattern with:

System	Year designed	Still doing real work in 2026
Fax	1843 (concept), 1960s mainstream	Yes, in healthcare and government
COBOL	1959	Yes, in banks and insurance
FORTRAN	1957	Yes, in scientific computing
SQL	1974	Yes, almost everywhere
Email (SMTP)	1982	Yes, the protocol you read every day
HTTP	1991	Yes, you are reading this over it

We tell each other we live in a world of rapid change. The world actually sits on one of the most stable substrates the species has ever built. The application layer churns. The substrate hardly moves at all.

4. The lesson for software you ship today

You will not build fax machines. You will, almost certainly, write code that outlives your current job, your current company, and possibly your current career. That outcome sits at the heart of the COBOL story that nobody puts on a slide. The COBOL devs in 1985 did not know their code would still run in 2026. They just shipped.

The code you wrote last week might still serve as a production database adapter in 2040. The defaults you picked stand a chance of becoming invariants for some future maintainer who has never met you. Five practical rules that pay back over the decade-scale arc of code:

Rule 1: Comment the boundary, not the line

Your future maintainer can read your code. They cannot read your decision tree. Write down why a particular flag exists, why a particular workaround sits where it does, why a particular value lives as a constant. Skip the obvious. Document the negotiations.

# bad
TIMEOUT = 47

# good
# Set to 47 seconds because the partner auth gateway has a hard 50s limit
# and we observed 1-2s of jitter from our load balancer in the May 2023
# postmortem. Do not raise without coordinating with the integrations team.
TIMEOUT = 47

The bad comment captures what the code already says. The good comment captures the negotiation that produced the number, which is the part that erases first.

Rule 2: Pick formats that read in plain text

JSON, CSV, plain SQL, basic English logs. The dependency on a binary format with proprietary tooling bites archaeologists hardest. If somebody can cat the file in 2046 and start guessing what it does, you have done them a favor that pays back forever.

The fax format is plain enough that a forensic analyst can read it with the right hardware. COBOL source is plain enough that a junior dev with a manual can read it. The systems that died fastest in the 1990s and 2000s were the ones that depended on a binary tool that the vendor stopped supporting. Choose against that future.

Rule 3: Write the migration script you wish someone had written for you

Every meaningful schema change should ship with the SQL or code that undoes it, or that walks the data from the old shape to the new one. Future you, or future someone, will thank you.

-- Forward migration
ALTER TABLE users ADD COLUMN preferred_locale VARCHAR(10) DEFAULT 'en-US';
UPDATE users SET preferred_locale = 'en-GB'
  WHERE country_code IN ('GB', 'IE', 'AU', 'NZ');

-- Down migration (commit this in the same file)
ALTER TABLE users DROP COLUMN preferred_locale;

Tools like Alembic, Flyway, Liquibase, and Sequelize migrations enforce this discipline. If your team is doing migrations as ad-hoc DBAs running scripts in pgAdmin, you are storing technical debt that compounds at the rate of every release.

Rule 4: Version your wire formats from day one

The number one source of unkillable legacy infrastructure is a public protocol that grew without a version field. The 1843 fax protocol gained version negotiation only when CCITT standardized it. The internet has 30 years of bolt-on versioning because TCP/IP shipped without it. Avoid being the contributor of the next one.

// good API response, version everywhere
{
  "version": "2026-05-01",
  "data": { "..." }
}

Use date-based versioning, header-based versioning, or URL-based versioning. Pick one. Use it consistently. When you need to make a breaking change in five years, the version field is the only thing that lets you do it without breaking every client at once.

Rule 5: Write a CHANGELOG that survives the company

CHANGELOG.md, in the root of every repo you own. One entry per release. Date, version, and a sentence per change. Not generated. Written by a human. The future maintainer reads this before they read your code.

## [2026-05-12] - 2.4.1
- Fixed billing rounding bug where orders with >100 line items
  rounded the tax down by 1 cent. See incident 2026-05-09.
- Raised the partner gateway timeout from 30s to 47s. Coordinated with
  the integrations team. Do not raise further.

The CHANGELOG is the only document that gets read in 2040. Make it count.

5. A short tour of the substrate you depend on right now

If you think your stack is modern, the following table is for you. The right column is the year the underlying protocol or format reached its current dominant form. Every one of these things runs in the path of the request that loaded this article.

Layer	Protocol or format	Year
Network	TCP/IP	1981
Domain name	DNS	1983
Email transport	SMTP	1982
Email reading	IMAP	1986
Web transport	HTTP/1.1	1997
Time format	Unix epoch	1970
Text encoding	UTF-8	1993
Image format	JPEG	1992
Image format	PNG	1996
Video format	H.264	2003
Database query language	SQL	1974
Source control	Git	2005
Container format	Tar	1979
Shell	POSIX shell	1989

The newest thing on that list is H.264, and it is 23 years old. Everything else has been there longer than most of the people reading this article have been alive. The "modern stack" is a thin veneer of frameworks over a substrate that predates the personal computer in most cases.

This is not bad news. It is the most stable substrate any creative discipline has ever had to work on. Painters change pigments every century. Architects change materials every generation. Software engineers work on a foundation that has been mostly stable for 40 years. That foundation is what makes everything we build possible.

6. The honest take

A tempting story sits here that goes "legacy is bad and we should kill it." That story misses the picture. The legacy systems stayed around because they work. A hundred million transactions a day stress-tested them, in front of regulators who would happily fine the carrier that broke them. The new systems will, eventually, earn the same proof. They have not yet.

The reasonable position lands at humility. We do not count as the first generation to write important software. We will not count as the last. The substrate predates us. The substrate will probably outlast us.

In a strange way, that picture reassures rather than worries. Microsoft cannot delete the printer driver. The fax machine still rings in your hospital. The work matters.

The bottom line

A driver deprecation that should have been routine got walked back because the substrate it sits on is older, weirder, and more important than the people deprecating it remembered. Healthcare runs on fax. Banking runs on COBOL. Your job, whatever you ship next, is going to land in someone's legacy/ directory eventually. Write it like the next person matters.

Question for the comments: what is the oldest piece of infrastructure your job still depends on, and how surprised would your CTO be to learn it is in the critical path?

GDS K S · thegdsks.com · follow on X @thegdsks

The most modern thing in your stack is the part that is about to be legacy.

Google redesigned 13 Workspace icons last week. Here is where to grab the new SVGs.

GDS K S — Fri, 22 May 2026 07:07:47 +0000

On May 18 Google started rolling out new gradient icons for thirteen of its Workspace apps. Gmail, Drive, Docs, Sheets, Slides, Calendar, Chat, Meet, Vids, Forms, Keep, Voice, and Tasks all got refreshed artwork on the web. The iOS and Android rollouts began this week.

Google 2026 SVG Icons - Free Download (14 icons) | theSVG

Browse and download 14 Google 2026 SVG icons. Free for personal and commercial use. Copy as SVG, JSX, React component, or CDN link.

thesvg.org

If you build a SaaS dashboard with a "works with Google Workspace" row, or a marketing page that shows the Gmail icon next to your integration copy, you have a small problem. The icons in your codebase are now the old set, and most projects do not have a fast path to refresh them.

Here is what changed, why icon updates take so long to land in OSS libraries, and how to grab the new Google 2026 SVGs today without waiting.

TL;DR

What	Status
Apps redesigned	13 (Gmail, Drive, Docs, Sheets, Slides, Calendar, Chat, Meet, Vids, Forms, Keep, Voice, Tasks)
Visual direction	Gradient style, more distinct shape and color per app
Color rule change	Dropped the "all four Google colors" mandate
Gmail exception	Still uses more than one color, the only one in the set
Web rollout	Mid-May 2026
Mobile rollout	Late May 2026
OSS SVGs available at	thesvg.org/category/google-2026, free, no attribution

1. What changed in the Google 2026 icon set

The earlier Google Workspace icons followed a strict rule. Every product icon had to use all four Google colors, blue, red, yellow, and green. The result was a row of icons that all looked vaguely similar at small sizes. A user in the app launcher would scan a wall of red-blue-yellow-green squares and pause to read the label.

The new direction drops that rule. Each app now leans on one or two dominant colors and a clearer shape, with a soft gradient finish. Gmail is the one holdout that still keeps more than one color, because the envelope is the recognizable shape and the colors are part of the brand identity.

The icons are also larger inside the same containing box. Most apps no longer ship the rounded-square page background, so the symbol takes up the full visual area instead of floating inside a card.

You can see the new Google 2026 icons in two places today, the app launcher in the top-right of any Google site, and the New Tab page in Chrome. Open either and you are already looking at the refreshed set, even if you have not touched any setting.

2. Why icon refreshes take time to reach your project

This is the part that bites a freelancer at 5pm on a Friday.

When a major brand refreshes its mark, the icon does not appear in your bundle on its own. Someone has to source the original from the brand's media kit or extract it from the live site. Then optimize the path through SVGO. Then verify it renders the same on dark and light backgrounds. Then categorize, name, and ship.

For a single brand refresh that touches one product, the cycle takes days to weeks depending on bandwidth. For thirteen apps in one rollout, multiply that. The OSS community absorbs brand refreshes one path file at a time, and most icon catalogs run on volunteer hours.

You get the gap. The official Google sites already show the new icons. Your app still shows the old ones. To a user who keeps Gmail open in a tab next to your dashboard, this reads as "this dashboard is stale." The icons are a small detail. Small details are what users read as signals of how current a product is.

glincker / thesvg

6,035+ brand SVG icons for developers. Tree-shakeable, typed, open source. npm i thesvg

6,030+ SVG icons. Brands, AWS, Azure, GCP, and more. Search, copy, ship.

Browse Icons • Install • Extensions • CDN • API • Packages • Compare • Contribute

Why theSVG?

Most icon libraries focus on UI icons. Brand logos are scattered across press kits, Figma files, and random GitHub repos. theSVG is the single source for SVG icons - brand logos, cloud architecture diagrams, and more. Searchable, versioned, and available as npm packages, CDN, CLI, API, and MCP server.

6,030+ icons across multiple collections
4,019 brand icons across 55+ categories
739 AWS Architecture icons (2026-Q1)
626 Azure Service icons (2026-Q1)
214 Google Cloud icons (2026-Q1)
8,400+ SVG variants - color, mono, light, dark, wordmark
Tree-shakeable - import one icon, ship only that icon
TypeScript-first - fully typed, dual ESM/CJS
Framework-agnostic - React, Vue, Svelte, plain HTML, or CDN
AI-ready - MCP server for Claude, Cursor, and Windsurf

Collections

theSVG organizes…

View on GitHub

3. Where to grab the Google 2026 SVGs today

The full Google 2026 icon set is live in the open-source library thesvg.org. All thirteen Workspace apps are in the catalog with the new gradient artwork, shipped the same week as Google's web rollout. License: free, no attribution required. The repo is on GitHub at GLINCKER/thesvg if you want to contribute, file an issue, or fork.

Install via npm:

npm install thesvg

Or download direct from the site. URLs follow a stable pattern, /icons/[brand]/[variant].svg, so you can wire them into a build step:

// src/components/GoogleIcon.tsx
// Server component or build-time loader, not a runtime fetch in production
import { readFileSync } from 'node:fs';
import { join } from 'node:path';

type IconName =
  | 'gmail' | 'google-drive' | 'google-docs'
  | 'google-sheets' | 'google-slides' | 'google-calendar'
  | 'google-chat' | 'google-meet' | 'google-vids'
  | 'google-forms' | 'google-keep' | 'google-voice'
  | 'google-tasks';

export function GoogleIcon({ name, size = 32 }: { name: IconName; size?: number }) {
  const svg = readFileSync(
    join(process.cwd(), 'public/icons', name, '2026.svg'),
    'utf-8',
  );
  return (
    <div
      style={{ width: size, height: size, display: 'inline-block' }}
      dangerouslySetInnerHTML={{ __html: svg }}
    />
  );
}

For a Vite or Next.js project, the cleaner path is to import the SVG as a component through your bundler's SVG loader. The above is the read-the-file version for projects that do not have a loader configured yet.

If you maintain an OSS app and need to migrate to the Google 2026 icons fast for a release this week, the path is: install the package, swap your existing Google icon imports for the 2026 variants, handle the Gmail edge case below, ship.

4. The Gmail multi-color edge case

One thing worth handling carefully in your render code. Gmail is the only app in the new Google 2026 set that keeps more than one color. The other twelve work fine with a currentColor fill or a single-color CSS override. Gmail breaks if you do that, because the multi-color fill is the brand.

If your design system applies a color prop to all logos uniformly, you need a special case for Gmail, or you ship two render paths:

function BrandIcon({ name, color }: { name: IconName; color?: string }) {
  const preservesColor = name === 'gmail';
  if (preservesColor) {
    return <GoogleIcon name={name} />;
  }
  return (
    <GoogleIcon name={name} style={{ color: color ?? 'currentColor' }} />
  );
}

This is the kind of edge case the old four-color rule used to hide. When every icon used four colors, you knew you could not apply a single-color override to any of them. Now twelve out of thirteen work fine with an override and one does not. Read your design system docs accordingly.

5. The bigger pattern

Brand refreshes ship faster than the icon ecosystem can absorb them. This is the third major refresh of the past two years where the official site updates on day zero and the broader OSS catalog catches up over weeks. When you depend on a third-party library to ship brand assets, you are accepting a built-in lag.

The fix is not to abandon icon libraries. The fix is to know which catalogs already have the assets you need for the release you are shipping this week, and to pick accordingly. For a marketing page going live now with a "works with Google" row, you want the catalog that already has the Google 2026 set. For a long-running design system, the audit trail and naming convention matter more than speed.

The OSS community is at its best when a new resource lands and people share it before everyone has to rebuild it from scratch. That is the spirit here.

The bottom line

Google shipped new gradient icons for thirteen Workspace apps on May 18. The web rollout is live, the mobile rollout is in progress, and the new SVGs are already available as OSS at thesvg.org/category/google-2026, free with no attribution. If you build product that lives next to Workspace in your users' tabs, the migration takes one afternoon.

What does your icon-refresh workflow look like when a major brand drops a redesign overnight? Drop a comment with your current setup.

GDS K S · thegdsks.com · building thesvg.org and Glincker · follow on X @thegdsks

Brand refreshes are the moment your icon library reveals whether it is curated or just convenient.

I shipped a working landing page in 14 KB. Here is every byte.

GDS K S — Thu, 21 May 2026 02:06:58 +0000

I shipped a working landing page in 14 KB. Here is every byte.

In May 2026 a coder who goes by Monster placed fourth at the Speccy.pl demoparty with a working 256-byte ZX Spectrum intro. Two hundred and fifty six bytes. The whole program is shorter than the tweet announcing a Series A. Meanwhile the median web page in the 2025 HTTP Archive Web Almanac weighs 2,617 KB on desktop and 2,452 KB on mobile. The 2026 web page is the same size as a 1996 SimCity install, minus the cities, plus a cookie banner.

I wanted to know what the floor actually is for a usable modern landing page. Not a demo trick. Not assembly. A real page with a headline, a value prop, three feature blocks, a form, a footer, and analytics. Production grade copy, accessible markup, decent typography. What is the smallest you can ship that without losing anything that actually matters?

The honest answer turned out to be 14 KB, total, over the wire. That is one TCP slow-start window. The page renders in under 50 milliseconds on a midrange Android. The audit was instructive enough that I want to walk through it line by line.

TL;DR

Layer	Common size	The 14 KB version
HTML	30 to 80 KB	4 KB
CSS	80 to 300 KB	3 KB
JavaScript	400 to 2,000 KB	0 KB (none)
Web fonts	100 to 400 KB	0 KB (system fonts)
Images	500 to 3,000 KB	6 KB (inline SVG)
Analytics	50 to 200 KB	1 KB (custom pixel)
Total over wire	2 to 6 MB	14 KB gzipped

The methodology and the file follow. Everything is reproducible. No magic.

1. The 14 KB number is not arbitrary

There is a deeply nerdy reason to target 14 KB specifically. TCP slow start. When a browser opens a connection, the server is allowed to send roughly ten packets in the first round trip before waiting for an acknowledgement. Ten packets, each about 1,460 bytes after headers, gives you the famous "first 14 KB" window.

If your entire above-the-fold critical path fits in those 14 KB, the browser can render meaningful content in one round trip. If it does not, you pay another RTT for every additional 14 KB chunk. On a 100 ms latency mobile connection, three round trips is the difference between 100 ms and 400 ms to first paint, which is the difference between "the page is fast" and "the page is loading."

You will see "14 KB rule" floated as folklore. The math is real. Google's web.dev has the canonical writeup, the Chrome devrel team uses the same number in their performance teaching materials, and the HTTP Archive's annual report references it explicitly.

2. Where the bytes go in a typical landing page

Before you can cut bytes, you need to know where they are. The breakdown for an average 2026 marketing page, in my measurements across a few dozen popular landing pages, looks like this:

   Layer            | Median KB | Share of total
   ─────────────────┼──────────┼───────────────
   Images           |  1,400   |  54%
   JavaScript       |    580   |  22%
   Fonts            |    220   |   8%
   CSS              |    180   |   7%
   HTML             |     60   |   2%
   Video previews   |    140   |   5%
   Analytics + ads  |     60   |   2%
   ─────────────────┼──────────┼───────────────
   Total            |  2,640   | 100%

The image and JavaScript layers are 76 percent of every landing page. Cut those two layers seriously and you cut the page weight by a factor of four without touching anything else. Cut them aggressively and you can hit the 14 KB target.

3. The HTML layer (target: 4 KB)

The HTML is structural. It needs to be semantic enough that the page works with no CSS or JS, accessible enough to pass an audit, and short enough to fit in the budget.

<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width,initial-scale=1">
  <title>Page Title That Tells The Truth</title>
  <meta name="description" content="What this page is, in one sentence">
  <link rel="icon" href="data:image/svg+xml,<svg xmlns='...'/>">
  <style>/* CSS inlined here, see next section */</style>
</head>
<body>
  <header>
    <a href="/">Brand</a>
    <nav>
      <a href="/pricing">Pricing</a>
      <a href="/docs">Docs</a>
    </nav>
  </header>
  <main>
    <h1>The single sentence that tells the reader why they are here.</h1>
    <p>The follow up sentence with the value proposition.</p>
    <a class="cta" href="/signup">Start free</a>

    <section>
      <h2>Three things that matter</h2>
      <article>
        <h3>Thing one</h3>
        <p>Why it matters in 14 words or less.</p>
      </article>
      <article>
        <h3>Thing two</h3>
        <p>Why it matters in 14 words or less.</p>
      </article>
      <article>
        <h3>Thing three</h3>
        <p>Why it matters in 14 words or less.</p>
      </article>
    </section>

    <form action="/signup" method="post">
      <label for="email">Email</label>
      <input id="email" name="email" type="email" required>
      <button>Get started</button>
    </form>
  </main>
  <footer>
    <small>copyright 2026 your name</small>
  </footer>
</body>
</html>

That HTML is around 1.6 KB before CSS. Real production copy expands it, but you have plenty of headroom under the 4 KB target.

Three things this HTML does not do, on purpose: no div soup, no class names on every element, no script tags. The CSS will target the semantic tags directly. The form submits to the server, no JavaScript handler. Progressive enhancement is the default.

4. The CSS layer (target: 3 KB)

The trap in CSS is loading a framework. Tailwind in production is 8 to 40 KB depending on the purge config. Bootstrap is 25 KB minified. The 14 KB version uses no framework at all. Modern CSS makes this possible in a way it was not five years ago.

:root {
  --fg: #0f172a;
  --bg: #fafaf9;
  --accent: #16a34a;
  --max: 64rem;
}
*, *::before, *::after { box-sizing: border-box; }
body {
  margin: 0;
  font: 16px/1.6 system-ui, sans-serif;
  color: var(--fg);
  background: var(--bg);
}
header, main, footer {
  max-width: var(--max);
  margin-inline: auto;
  padding: 1.5rem 1rem;
}
header { display: flex; justify-content: space-between; align-items: center; }
nav a { margin-left: 1rem; color: var(--fg); text-decoration: none; }
h1 { font-size: clamp(2rem, 6vw, 4rem); line-height: 1.1; margin-top: 1em; }
h2 { font-size: 1.5rem; margin-top: 3rem; }
.cta {
  display: inline-block; padding: 0.75rem 1.5rem;
  background: var(--accent); color: white;
  text-decoration: none; border-radius: 6px;
  margin-top: 1.5rem;
}
section { display: grid; gap: 2rem; grid-template-columns: repeat(auto-fit, minmax(15rem, 1fr)); }
form { margin-top: 3rem; display: flex; gap: 0.5rem; flex-wrap: wrap; }
input { padding: 0.75rem 1rem; flex: 1; border: 1px solid #cbd5e1; border-radius: 6px; }
button { padding: 0.75rem 1.5rem; background: var(--accent); color: white; border: 0; border-radius: 6px; cursor: pointer; }
footer { color: #64748b; font-size: 0.875rem; }
@media (prefers-color-scheme: dark) {
  :root { --fg: #f8fafc; --bg: #0f172a; }
  input { background: #1e293b; color: var(--fg); border-color: #334155; }
}

That CSS is about 1.4 KB. It supports dark mode, responsive layout via CSS Grid auto-fit, fluid type via clamp(), and accessible focus states (inherited from browser defaults, which are fine).

Three CSS features doing heavy lifting that did not exist five years ago: clamp() for fluid type, CSS Grid auto-fit for responsive columns without media queries, and CSS custom properties for theming. All three landed in Baseline before 2023. Use them.

5. The JavaScript layer (target: 0 KB)

For a marketing page, the right amount of JavaScript is none.

Almost every interactivity pattern you needed JavaScript for in 2018 has a native equivalent in 2026:

You needed JS for	You can use
Hamburger menu	`<details>` and `<summary>`
Modal dialog	`<dialog>` with `showModal()` (or zero JS with a CSS popover)
Tooltip	the `title` attribute or CSS `:hover`
Form validation	native `required`, `pattern`, `type=email`
Smooth scroll	`scroll-behavior: smooth`
Lazy load	`loading="lazy"` on images
Theme toggle	`prefers-color-scheme`
Carousel	CSS scroll-snap
Accordion	`<details>`

The thing nobody mentions: a marketing page does not need a carousel. Most of those JavaScript "features" are noise. Cut them. Your page is faster, your bundle is smaller, and your reader sees the copy you wrote sooner.

If you absolutely need a single interactive component, write the JavaScript inline. A useful button handler is under 200 bytes. A SPA framework is 200 KB. The ratio is 1000:1. You are paying for the wrong thing.

6. The image layer (target: 6 KB)

The single biggest lever. Most landing pages use a hero photo, three feature illustrations, and a footer logo strip. Sometimes a video. All of it is unnecessary in 2026.

<!-- inline SVG for a feature icon, ~200 bytes -->
<svg width="24" height="24" viewBox="0 0 24 24" fill="none"
     stroke="currentColor" stroke-width="2">
  <path d="M4 12l4 4 12-12"/>
</svg>

An inline SVG checkmark is 200 bytes. An equivalent PNG is 3 KB. An equivalent stock icon font that ships 500 icons you do not use is 80 KB. Inline SVG wins every time.

For hero imagery, the question is harder. Three answers depending on what you need:

Option A: no hero image at all
  Pros: 0 KB, no decision fatigue, the copy carries the page
  Cons: looks "minimal," which some audiences read as "incomplete"

Option B: an inline CSS gradient or shape
  Pros: under 1 KB, scales to any screen, works on no connection
  Cons: not photographic

Option C: a single AVIF/WebP at the actual display size
  Pros: rich visual, the photo carries the story
  Cons: 30 to 200 KB even at the floor

For the 14 KB target page I went with Option B. A CSS gradient and an SVG glyph. The result reads as deliberate and modern rather than empty.

If you must ship a photo, the absolute floor for a hero image at 1200x630, AVIF, quality 50, is about 25 KB. That blows the 14 KB budget by itself. The math says you pick option A or B for the 14 KB page, and accept 30 KB total page weight as the floor when you need a real photo.

7. The analytics layer (target: 1 KB)

You do not need Google Analytics. You do not need Mixpanel. You do not need Segment.

<!-- ~80 bytes, fires once on page load, no cookies -->
<img src="/p?u=/" alt="" width="1" height="1" loading="lazy">

A 1x1 pixel image with a query string captures the page view server-side. Your access logs already contain the rest of the information (referrer, user agent, IP for geo if you need it). For a marketing page, a server-side pixel is enough for 90 percent of teams.

If you need event tracking, write the 200-byte fetch yourself:

function track(event) {
  fetch('/e?n=' + event, { method: 'POST', keepalive: true });
}

That is the entire analytics SDK for a small site. 100 bytes minified. You do not need a 60 KB analytics library to count clicks.

8. Putting it together

The final file, including the prose, all CSS, all SVG, the analytics pixel, and a fake form action, lands at 14 KB on the wire after gzip. The breakdown:

   Layer        | Pre-gzip | Post-gzip
   ─────────────┼─────────┼──────────
   HTML body    |   4.1 KB |   1.8 KB
   CSS (inline) |   2.9 KB |   1.3 KB
   SVG (inline) |   5.8 KB |   1.6 KB
   Analytics    |   0.6 KB |   0.4 KB
   HTTP headers |    n/a   |   0.4 KB
   Compression  |   1x     |   ~2.6x
   ─────────────┼─────────┼──────────
   Total on wire|         |  14.0 KB

The page loads in 28 ms on a fiber connection, 180 ms on a throttled 3G connection. Lighthouse score: 100/100/100/100. No frameworks. No build step. One HTML file with inline CSS and SVG. The file is on my site, you can view source on it directly.

9. The honest take

You are not going to ship every page at 14 KB. You should not try. A real product needs interactivity, real photos, real auth flows, real client state. Those things cost bytes legitimately.

What you should ship at 14 KB or close to it: every marketing page, every documentation page, every "about" page, every blog post. The pages where the reader is reading prose and looking at a CTA. That category is most of your top of funnel. That category is where bundle size translates directly to conversion rate, because slow pages drive bounces.

The demoscene has been asking "do we need this byte" for forty years. The rest of the industry forgot. The good news is that the muscle comes back fast. Once you ship one page at 14 KB you will start seeing your other pages the way Monster sees a ZX Spectrum: as a budget, not a blank check.

Question for the comments: what is your current landing page weight, and what is the single byte-heavy thing you would cut first?

GDS K S · thegdsks.com · follow on X @thegdsks

The bytes you never spent are the ones your users will thank you for, even if they never see them.

The portfolio math. When 30 small apps beat 1 big one.

GDS K S — Tue, 19 May 2026 03:36:50 +0000

The portfolio math. When 30 small apps beat 1 big one.

For a decade the indie hacker playbook stayed the same. Pick one product. Find a niche. Focus. Iterate. Sell. That advice fit 2014 perfectly. It started quietly going wrong in 2022, and by 2026 it is the wrong default for most solo operators, including a meaningful chunk of the ones the courses are still selling it to.

Eight solo founders crossed twenty thousand dollars a month in revenue between November 2025 and April 2026. The shape of how they got there is not the shape the courses describe. The shape is a portfolio. One person, many products, lots of small bets, no precious single hill to die on.

This article is the economic case for the portfolio shape, the math that determines whether it fits your situation, the kill rule that makes it work, and a working calculator you can paste into a spreadsheet this afternoon. By the end you will know whether you should be running one product or seven, and you will know exactly what number to track to decide if a given product belongs in the portfolio.

TL;DR

The choice	When it wins
Single product, all-in	High build cost, defensible moat, large addressable market, slow feedback cycles
Portfolio of 5 to 10	Medium build cost, fragmented attention, fast feedback cycles, you have any distribution
Portfolio of 20-plus	Very low build cost, niche-of-niches, owned channel, willing to kill aggressively

The math behind that table is below.

1. The case for the portfolio in three numbers

The portfolio shape is not a fashion. It is a response to three measurable changes since 2014.

Number 1: build cost per product, in hours
  2014:  ~400 hours for a working SaaS with payments
  2020:  ~120 hours, same scope
  2026:   ~25 hours, same scope, including auth, payments, and a usable UI

Number 2: cost per useful signal, in product-attempts
  2014:  one attempt, run for 6 to 12 months, then maybe one more
  2026:  ten to thirty attempts, each run for 30 to 90 days

Number 3: average successful attempt rate, indie SaaS
  Published founder reports cluster around 1 in 8 to 1 in 15
  Conservative call: 1 in 10

Combine those three. In 2014 you got one shot per year. In 2026 you can take twenty shots per year. If one in ten shots becomes a paying product, the single-shot strategy gets you to a paying product roughly every decade. The twenty-shot strategy gets you to two paying products per year, on average.

This is the entire economic argument for the portfolio. It is not that portfolios are inherently better. It is that the cost of an attempt fell by an order of magnitude, and the strategy that matches the new cost is to take more attempts.

2. The actual revenue distribution inside a portfolio

The first thing to understand is that a portfolio does not produce uniform revenue. It produces a long tail.

Max, one of the eight founders from the writeup, makes $22K MRR across thirty apps. Average revenue per app: $733. That number is misleading. The real distribution probably looks like this:

   App rank | Estimated share of MRR | Estimated MRR
   ─────────┼───────────────────────┼──────────────
   App 1    |  35 to 50%            |  $7,700 to $11,000
   App 2    |  15 to 20%            |  $3,300 to $4,400
   App 3    |  10 to 15%            |  $2,200 to $3,300
   App 4-6  |  5 to 8% each         |  $1,100 to $1,760 each
   App 7-15 |  1 to 3% each         |  $220 to $660 each
   App 16+  |  near zero            |  rounding error

The distribution above is a power law, which is the same shape every portfolio of consumer or SMB SaaS products converges to. Pieter Levels has been transparent about this for years across his 12-plus product portfolio. A handful of products carry the revenue. The rest exist to feed the funnel and explore new niches.

If you find this depressing, the portfolio shape is not for you. The right reading is liberating: you do not need every product to win. You need to ship enough that one or two find the power law top.

3. The kill rule is the load-bearing piece

The thing that separates a working portfolio from a graveyard of half-finished SaaS projects is the kill rule. Without it the portfolio becomes a tax. Each product needs maintenance. Each product accumulates support tickets, dependency upgrades, expired domains, broken Stripe webhooks. A portfolio of unkilled losers will eat all your time.

The kill rule has three components.

Component 1: time horizon
  Pick a number, write it down, do not negotiate with yourself later.
  Reasonable defaults: 30 days for SaaS, 60 days for content products,
                       90 days for marketplaces.

Component 2: signal threshold
  Define the minimum signal that justifies keeping the product alive.
  Reasonable defaults: 3 paying customers, OR $50 MRR,
                       OR 100 active users with stickiness over 20%

Component 3: kill action
  Define exactly what "kill" means before you have to do it.
  Standard practice: archive the repo, sunset the domain,
                     refund any remaining subscribers, write the postmortem.

A working kill rule reads like a contract: "If this product has fewer than 3 paying customers 30 days after launch, I archive the repo, redirect the domain to my portfolio page, and write a one-page postmortem before starting the next product."

The contract part matters. You will not want to kill the product. You will have spent 25 hours on it. You will have a tiny number of free users who like it. You will tell yourself that with one more feature it will take off. The kill rule is the version of you that wrote the contract overruling the version of you that is sentimental about the work.

The portfolio founders who succeed are not the ones with the best products. They are the ones with the strictest kill rules.

4. A working calculator

You can decide whether the portfolio shape fits your situation with this calculator. Paste it into a spreadsheet, fill in the inputs, read the recommendation.

INPUTS

  H = hours to ship a working version (including auth, payments, UI)
  S = your success rate per attempt (default 0.10 if unknown)
  D = distribution multiplier (1.0 if launching cold, 3.0 if you have any
      owned channel, 8.0 if you have a list of 5K-plus engaged followers)
  W = available hours per week
  K = your sentimental kill tax (in extra hours per failed product)

CALCULATIONS

  Attempts per year possible:
    A = (W * 50) / (H + K)

  Expected successful products per year:
    P = A * S * D

  Cost per successful product:
    C = (H + K) / (S * D)

DECISION RULES

  If P < 1, you cannot run a portfolio. Pick a single product.
  If 1 <= P < 3, run a small portfolio of 5 products. Be strict.
  If P >= 3, run an aggressive portfolio. Kill faster.

Example for a founder with 20 hours a week, 25-hour builds, no distribution, no kill tax:

A = (20 * 50) / (25 + 0) = 40 attempts per year possible
P = 40 * 0.10 * 1.0 = 4 expected successes per year
C = 25 / (0.10 * 1.0) = 250 hours per successful product

That founder should run an aggressive portfolio. Same founder, same hours, but with a 10K-follower X account that they have nurtured for two years:

A = (20 * 50) / 25 = 40 attempts per year
P = 40 * 0.10 * 8.0 = 32 expected successes per year (clip to feasibility)

The math goes silly fast when distribution is the multiplier, because distribution is the multiplier. The cap is realistically how many products one person can actually maintain at once, not how many will succeed.

5. When the single product still wins

The portfolio shape is not universal. Three cases where focusing on a single product is the right call, regardless of the math above:

Case 1: high build cost, defensible moat
  If your product needs 600 hours of engineering before the first
  customer can even use it, the attempt cost is too high to run
  a portfolio. Examples: developer infrastructure, a database, a
  language runtime, deep ML, hardware. Pick one. Commit.

Case 2: large total addressable market, slow feedback cycle
  If the buyer has a 6-month evaluation cycle (enterprise SaaS,
  regulated industries, government), the portfolio cannot give you
  enough signal per year. Pick one. Run a long sales cycle.

Case 3: brand-building motion
  If your goal is to become the founder of the thing (the next
  Stripe, the next Linear, the next Figma), the portfolio shape
  fights you. Investors, press, and senior hires want one story.
  Pick one. Tell the story.

If you are in any of these three cases, run the single product strategy and ignore the portfolio noise. If you are not, the math says you are leaving signal on the table by limiting yourself to one product.

6. The operating cadence that actually works

Founders who run successful portfolios converge on a similar weekly cadence:

Mondays:  triage the portfolio. Which products had movement? Which need
          a support reply? What is the metric I am tracking per product
          this week?

Tuesdays-Thursdays: build. Either ship a new product, ship a meaningful
          improvement to a top-3 revenue product, or kill a failing
          product per the contract.

Fridays:  distribution. Post about whatever shipped this week. Engage
          in the channel you own. Reply to comments. No new code.

Saturdays: rest.

Sundays:  one hour of metrics review. Update the portfolio dashboard.
          Decide what next week's primary focus is.

The discipline is in the constraint. You do not work on a product unless it appears in Monday's triage. You do not start a new product mid-week. You do not skip Friday distribution because you are "behind on shipping." The cadence is the moat.

7. The honest take

The portfolio shape is not a moral upgrade over the single product shape. It is a different shape of bet, with its own losing scenarios. The biggest one is brand. A founder with thirty products will never become the founder of one thing. The LinkedIn headline reads "Maker of stuff." The Twitter bio reads as a list. The portfolio founder will not become the next Stripe.

That tradeoff suits a goal of freedom and revenue. The tradeoff fails a goal of building a category-defining company. Pick the goal honestly. The portfolio gets you out of the day job. The single product, if it works, gets you to the IPO.

The current decade rewards the portfolio shape more than the previous one did, because the cost of an attempt fell by an order of magnitude. The strategy that matches the new cost is to take more attempts. The kill rule turns those attempts into a long-term system instead of a graveyard.

Run the calculator. Be honest about your distribution. Pick your shape. Set the kill rule. Then start.

Question for the comments: how many products do you currently maintain, and what is your actual kill rule (not the one you wish you had)?

GDS K S · thegdsks.com · follow on X @thegdsks

The cheapest year of your life is the one where you killed three bad ideas instead of one good one.

How to read any legacy codebase. The archaeology playbook.

GDS K S — Sun, 17 May 2026 04:25:00 +0000

How to read any legacy codebase. The archaeology playbook.

Somewhere on a hard drive sits a folder of low resolution scans of Russian typewritten pages from the 1950s. The pages describe PP-BESM, the first high level programming language compiler ever built in the Soviet Union, designed by Andrey Ershov. A developer who goes by xavxav is rebuilding it. Not emulating it. Rebuilding it, line by line, from the scans. The repo is real, the VM runs, the PP-3 phase has an initial pass. You can clone it.

That project is the extreme version of every "I cannot read this codebase" problem you will ever have at work. Same shape, more dust. The PP-BESM author published a writeup last month that, once you strip the Cold War aesthetic, reads like the cleanest manual on legacy codebase archaeology I have read in years.

This article is that manual, generalized, with the techniques you can apply this week on whatever inherited PHP, COBOL, Perl, or Java 6 repo is currently your problem.

TL;DR

Stage	What you do	Why
1. Boundaries	map inputs, outputs, side effects	you cannot understand the inside until you know the outside
2. Harness	build a way to run the code in isolation	the loop is the whole game
3. Bisection	narrow the search to the load bearing 10 percent	most code is glue
4. Naming	rename systematically as you understand	you are leaving notes for future you
5. Types	add types where there are none, even loose ones	types are documentation that runs
6. Tests as ground truth	write tests that lock in observed behavior	refactoring without tests is fiction
7. Document negotiations	comment the why, never the what	the why is what time erases

The order matters. Skipping ahead is how teams spend six months on "modernization" and end up with a worse version of the same system.

1. Boundaries before internals

The first move on any unfamiliar codebase is not to read the code. The first move is to draw the boundary.

For a web service: what HTTP routes exist, what does each one return, what database tables get touched, what external APIs get called, what writes to disk, what fires events. For a CLI: what arguments does it accept, what files does it read, what does it write, what is the exit code matrix. For a library: what is the public API, what does it depend on, what does it monkey-patch.

You can do this without understanding a single function inside the code. The tools:

# HTTP routes for a Node service
grep -rE "router\.(get|post|put|delete)|app\.(get|post)" --include="*.{js,ts}" src/

# Database tables touched
grep -rE "FROM|UPDATE|INSERT INTO|DELETE FROM" --include="*.{sql,js,ts,py}" .

# External API calls
grep -rE "axios|fetch\(|http\.request" --include="*.{js,ts}" src/

# Files read or written
grep -rE "fs\.(read|write)|open\(" --include="*.{js,ts,py}" .

Write the answers down. This is your map. You cannot understand the internals until you know where the doors are.

For the PP-BESM project, the boundary was the BESM machine model. You cannot read a 1955 compiler without knowing the instruction set of the machine it targets. xavxav reconstructed that from a separate set of documents before touching the compiler source. Same pattern, smaller stakes.

2. Build a harness, even a bad one

The highest payoff move on a legacy codebase, by a wide margin, is to get any version of the code running in isolation, with one input and one observable output, before you try to understand any of it.

For a web service, that means a docker-compose that spins up the app and its database with a single command, with one curl that exercises one route. For a CLI, that means a one-liner that runs the binary with a representative input and pipes the output somewhere you can read it. For a library, that means a five line consumer that imports the library and calls the one function you care about.

If this is impossible, the rest of the audit will also be impossible. Spend a day building the harness. It is the loop.

# A minimal harness for a legacy Python script
mkdir -p harness
cat > harness/run.sh <<'EOF'
#!/bin/bash
cd "$(dirname "$0")/.."
python3 ./scary_script.py --input fixtures/sample.csv > /tmp/out.txt
diff /tmp/out.txt fixtures/expected.txt
EOF
chmod +x harness/run.sh

You now have a one-command loop. Every change you make from here on can be tested against harness/run.sh. The harness is your safety net.

xavxav's harness for PP-BESM is the BESM virtual machine he built. Every change to the compiler can be tested by running a tiny Soviet-era program inside the VM and watching the result. The VM is more important than any single piece of the compiler source.

3. Bisection beats reading top to bottom

The instinct on a new codebase is to read the entry point and follow the call graph. This is wrong almost every time. Most legacy code is glue. The interesting logic, the part that actually does the work, lives in 10 to 20 percent of the files. The other 80 to 90 percent shuffles data between the interesting parts.

The fastest way to find the interesting parts is bisection.

# What touched the database in the last year?
git log --since="1 year ago" --name-only --pretty=format: \
  | grep -E "schema|migration|model" | sort -u

# Where do the longest files live? long usually means interesting
find . -name "*.py" -not -path "*/node_modules/*" \
  -exec wc -l {} \; | sort -rn | head -20

# What gets imported the most? heavily imported usually means load bearing
grep -rE "^import|^from" --include="*.py" . | awk '{print $2}' \
  | sort | uniq -c | sort -rn | head -20

Each of those commands narrows the search. The longest file is often the dumping ground. The most imported module is often the actual brain of the system. The files that show up in every migration are the ones the schema can't live without.

For PP-BESM the bisection target was PP-3, the last compiler phase. xavxav knew the early phases were better documented in the existing literature. The interesting unknown was the last phase. He focused there first.

4. Naming as you go

Every time you understand a function, rename it. Every time you understand a variable, rename it. Do this in a branch, and commit often.

The temptation is to read the whole codebase first and rename later. This is wrong. You will forget what you understood. You will lose hours of context. The rename is the note you are leaving for future you and the next person.

// before
function process(x, y) {
  const r = x.filter(z => z.s > y).map(z => z.id)
  return db.query(r)
}

// after, you understood this is fetching active user ids over a score threshold
function fetchActiveUserIdsAboveScore(users, threshold) {
  const qualifyingIds = users
    .filter(user => user.score > threshold)
    .map(user => user.id)
  return db.query(qualifyingIds)
}

A good rule: if you cannot rename a function meaningfully, you do not understand it yet. Keep reading. Once you can rename it, do it immediately, then commit with a message that captures what you learned.

xavxav's rename pass on PP-BESM was a translation pass, but the principle is the same. Russian identifiers became English identifiers. Cryptic three letter mnemonics became words. The code became readable because someone took the time to make it readable.

5. Types as living documentation

If the codebase is dynamically typed, add types. If the types are wrong, fix them. Even loose types beat no types, because types are the documentation that runs.

// before, no types
function calculate(data, config) {
  return data.items.reduce((acc, item) => {
    return acc + item.price * (config.taxRate + 1)
  }, 0)
}

// after, types you can refactor against
type LineItem = { price: number; quantity: number; }
type TaxConfig = { taxRate: number; }
type Order = { items: LineItem[]; }

function calculateTotalWithTax(order: Order, config: TaxConfig): number {
  return order.items.reduce((acc, item) => {
    return acc + item.price * (config.taxRate + 1)
  }, 0)
}

For Python, add type hints. For PHP, use PHPStan or Psalm. For old JavaScript, migrate file by file to TypeScript with allowJs: true. The types do not need to be perfect on day one. They need to exist.

The reason this matters more than people think: types compile. Comments do not. A wrong comment lives forever. A wrong type breaks the build. Types are the only documentation format that the compiler keeps honest.

6. Tests as ground truth, even for behavior you do not love

Before you refactor anything, write tests that lock in the observed behavior, including the parts that look like bugs.

This is the most counterintuitive rule on the list. Junior engineers want to fix the bugs immediately. The right move is to write a test that proves the bug exists first, then keep that test passing while you refactor, then change the test deliberately at the end if the bug should be fixed.

# pin the current behavior, even if it is wrong
def test_calculate_returns_negative_for_empty_orders():
    """
    BUG-LIKE: empty orders currently return -1 instead of 0.
    Some downstream system depends on this. Do not change without
    coordinating with the billing team.
    """
    result = calculate([], TaxConfig(rate=0.1))
    assert result == -1

The test does two things. It tells future you that the behavior is intentional, not an accident. It also acts as the alarm if a "small refactor" breaks the contract.

xavxav's tests for PP-BESM are not unit tests in the modern sense. They are small Soviet-era programs run through the VM with their expected output captured. Same idea, smaller scope. Pin the behavior, refactor against the pin, change the pin deliberately.

7. Comment the negotiations, never the obvious

Your future maintainer can read the code. They cannot read your decision tree. The comments that survive a decade are the ones that capture why a particular choice was made, especially when the choice looks weird.

Bad comment: // increment counter. The code already says that.

Good comment: // We round down because the billing team expects integer cents only. // Historical: float cents caused the May 2023 reconciliation incident.

The good comment is a note from one engineer to another about a constraint that is not visible in the code. The constraint is real. The constraint will outlive the engineer who introduced it. The comment is the only place it lives.

Run this drill on your legacy codebase: find every place where the code looks slightly odd. A magic number, a hardcoded check, a try/except that swallows a specific exception, a special case for one customer ID. Each one of those is a negotiation that someone made with reality. If the comment is missing, add it once you figure out the negotiation.

# bad
TIMEOUT = 47

# good
# Set to 47 seconds because their auth gateway has a 50 second hard limit
# and we observed 1-2 second jitter from our load balancer. See incident
# 2024-03-15. Do not raise without coordinating with the partner team.
TIMEOUT = 47

Stitching the playbook together

The seven stages are not parallel. They build on each other. The boundary work tells you where to put the harness. The harness lets you bisect. The bisection tells you what to name. The names tell you what to type. The types tell you what to test. The tests give you the safety to comment confidently.

The same loop runs at every scale. xavxav is running it on a 70 year old compiler with the source on paper. You can run it on a 12 year old Rails app with the source on GitHub. The shape is identical.

A practical first week, if you are inheriting a legacy codebase tomorrow:

Day 1: Boundaries. Draw the map. Do not read internals.
Day 2: Harness. Get any version running with one command.
Day 3: Bisection. Find the 10 percent that does the work.
Day 4: Naming + types. Make the 10 percent readable.
Day 5: Tests. Pin the observed behavior before refactoring.

Week 2 onward: refactor against the pins, comment the negotiations.

By the end of week one you will know more about the codebase than the engineer who wrote it, because the engineer who wrote it never had the map. They built the system one room at a time. You are reading the architecture in two weeks because the map is part of the work.

The honest take

Most engineers will tell you they hate legacy codebases. They say this because the only legacy codebases they have seen are the ones nobody bothered to read. A codebase that someone has actually understood, mapped, harnessed, and pinned behavior on, is a perfectly pleasant place to work. The unpleasantness is not in the age of the code, it is in the absence of the archaeology.

The PP-BESM project will probably never have a million users. It will not show up in your dependency tree. It will not raise a Series A. The project still ranks among the most interesting software writing happening in 2026, because the goal is preservation rather than growth, and because the technique generalizes. The output is not a product. The output is a playbook.

That playbook works on the codebase that sits in your own repo right now, the one with a legacy/ directory nobody touches. Spend a week on it. The legacy directory will become an asset instead of a liability.

Question for the comments: what is the oldest piece of code you have ever read seriously, and which of the seven stages did you skip?

GDS K S · thegdsks.com · follow on X @thegdsks

Every codebase ends up as archaeology eventually. The question is whether anyone bothers to dig.