Track 01 · Technical craft

Member of Technical Staff

“The job is no longer writing every line. The job is being the person who knows what should exist, gets it built at machine speed, and can tell when it is wrong.”

6 modules 18 lessons 8 weeks part-time Capstone: a shipped agentic product

0 of 18 lessons complete

Why this track exists

In 2026, a single engineer with a fleet of AI agents ships what a team of eight shipped in 2022. But the leverage only flows to people who can specify precisely, decompose problems, and verify results. The rest produce impressive-looking systems that fall over in production.

This track trains both layers: the AI-native workflow — context engineering, agent orchestration, evals — and the eternal fundamentals underneath it. Reading code, debugging, systems thinking. The fundamentals are what let you supervise a machine that is faster than you and confidently overrule it when it is wrong.

It is built for engineers, scientists, and technical founders who want to work the way the best small teams already work: high trust in tools, zero trust in unverified output.

Curriculum

Six modules, from mindset to production.

Module 1

The AI-Native Mindset

Before tools and tactics: a correct mental model of what these systems are, what they are reliably good at, and how the division of labor between you and them actually works.

1.1 What a model actually is

A language model is not a database and not a colleague. It is an extremely capable pattern engine that produces the most plausible continuation of the context you give it. Plausible is not the same as true, and the gap between the two is where every production incident lives. Once you internalize this, the model's behavior stops being mysterious: great with patterns it has seen, confident even when guessing, and entirely dependent on what you put in front of it.

The practical consequence: you control quality through context and verification, not through hope. Everything else in this track follows from that one sentence.

Practice. Ask a model a question in your own specialty deep enough that you can grade the answer. Note exactly where it was right, where it was plausible-but-wrong, and what context would have prevented the error.

1.2 The new division of labor

The 2026 split is simple: machines draft, humans decide. AI handles the breadth — boilerplate, migrations, first drafts, test scaffolding, research sweeps. You own the depth: architecture, trade-offs, the definition of done, and anything where being wrong is expensive. The failure mode of juniors is doing work the machine should do. The failure mode of seniors is delegating decisions the machine should never own.

Audit your week through this lens and you will usually find 40–60% of your hours sitting on the wrong side of the line.

Practice. List everything you did at work last week. Mark each item: should this have been drafted by AI, decided by me, or both? Move one recurring task across the line this week.

1.3 Verification as a way of life

The defining professional habit of the AI era is cheap, systematic verification. Every artifact a machine hands you arrives with an invisible question mark, and your job is to design the fastest honest way to remove it: run the code, click the flow, check the citation, re-derive the number. Teams that ship fast with AI are not the ones who trust it most — they are the ones whose verification loop is so fast that trust is unnecessary.

Verification effort should scale with blast radius. A typo fix needs a glance; a payment path needs tests, review, and a staged rollout — no matter who or what wrote it.

Practice. For your current project, write down the three cheapest checks that would catch 80% of AI mistakes (a test command, a smoke flow, a diff review ritual). Make them a habit before accepting any generated change.

Module 2

Context Engineering

Model quality is mostly context quality. This module makes you deliberate about what the machine knows at the moment it acts.

2.1 The context window is the workspace

A model can only reason over what is in front of it. Most "the AI is dumb" complaints are really "I gave it two vague sentences and none of the constraints I knew." Treat the context window like a new hire's first day: the relevant files, the conventions, the goal, the things that must not break. Curate hard — burying the signal under irrelevant dumps degrades output just as surely as starving it.

The skill is selection: enough context to determine the right answer, little enough that the right answer is easy to find.

Practice. Take a task an AI recently did poorly for you. Rewrite the request with curated context — goal, constraints, examples, relevant code — and compare results side by side.

2.2 Prompts are specifications

The eternal skill hiding inside "prompting" is specification writing — the thing great engineers were always better at than everyone else. State the goal, the inputs and outputs, the edge cases, the non-goals, and what done looks like. Ambiguity in, ambiguity out: a model fills every gap in your spec with the most statistically common choice, which is rarely your choice.

Write specs in plain language, with examples. An example of the desired output is worth ten paragraphs of description, for machines exactly as for contractors.

Practice. Write a one-page spec for a small feature — goal, constraints, edge cases, acceptance checks — and hand it to an AI agent untouched. Every clarifying question it needs to ask is a hole in your spec.

2.3 Grounding: retrieval, search, and sources

When the answer must be true rather than plausible, ground the model in real sources: your documents, your database, live search, your codebase. Retrieval is the general pattern — fetch the relevant facts, put them in context, and instruct the model to answer from them and say so when they don't contain the answer. Most hallucination problems are grounding problems wearing a scary name.

Design rule: any claim that will be acted on should be traceable to a source a human can check in one click.

Practice. Build a tiny grounded assistant: ten of your own documents, retrieval into context, and answers with citations. Test it with five questions whose answers you know — including one the documents cannot answer.

Module 3

Building Agents

From chat to systems that act: tools, orchestration, and the engineering of things that can fail on their own.

3.1 Tools, not chat

An agent is a model in a loop with tools: it reads state, decides, acts, observes the result, and repeats. The quality of an agent is mostly the quality of its tools — small, well-named, well-described actions with crisp inputs and honest error messages. A vague tool produces a flailing agent the same way a vague API produces buggy clients.

Start embarrassingly small: one agent, three tools, one job. Reliability at small scope is the foundation everything larger is built on.

Practice. Build a single-purpose agent with no more than three tools that completes one real task you do weekly. Run it ten times and log every failure and its cause.

3.2 Decomposition and orchestration

Big tasks defeat single agents the way big functions defeat single programmers. The remedy is the same: decomposition. Split work into steps with checkable outputs, fan independent steps out in parallel, and keep the control flow in deterministic code while the model handles judgment inside each step. The orchestrator decides what happens; agents decide how.

A good decomposition has a test for every seam — you can tell which step failed without reading every transcript.

Practice. Take a task too big for one prompt (a research report, a multi-file refactor). Draw the pipeline: steps, what each consumes and produces, and how you verify each seam. Then build it.

3.3 Guardrails and failure modes

Agents fail in characteristic ways: they loop, they overreach, they declare victory early, they take a destructive shortcut to satisfy the letter of the goal. Engineering for this means least-privilege tools, budgets on time and actions, human approval on irreversible steps, and logs you can actually replay. Assume the agent will eventually do the worst thing its permissions allow — then shrink the permissions.

The goal is not an agent that never fails. It is a system where failure is bounded, visible, and cheap.

Practice. Red-team your own agent: write down the three worst things it could do with its current permissions, then change the design so the worst one is impossible and the other two are recoverable.

Module 4

Evals — Measuring What "Good" Means

"It looks right" does not scale. Evals are how AI-native teams turn quality from a feeling into a number they can improve.

4.1 Why demos lie

Every AI feature demos well, because the demo is run by the person who built it, on inputs it handles. Production is run by strangers on inputs you never imagined. The gap between the two is invisible until you measure it — which is why teams without evals oscillate between overconfidence and panic, shipping on anecdotes and rolling back on anecdotes.

An eval is just a held-out set of real inputs with graded expected outputs, run on every change. It is unit testing for behavior that is probabilistic instead of deterministic.

Practice. Collect twenty real inputs for an AI feature you use or build — including the ugly ones. Define what a passing answer looks like for each. You have just written your first eval set.

4.2 Building an eval loop

A working eval loop has four parts: a dataset of real cases, a grader (exact match, assertions, or a model judging against a rubric), a score you track over time, and a habit — every prompt change, model swap, or pipeline edit runs the evals before it ships. With that loop, improving the system becomes engineering; without it, it's superstition.

Grade what matters, not what is easy: correctness first, then tone, format, and cost. And keep feeding production failures back into the dataset — that is where evals earn their keep.

Practice. Automate yesterday's twenty cases into a script that outputs a score. Change your prompt and watch the number move. Now make a change that improves the score without breaking any previously passing case.

4.3 Reviewing machine-written code

AI code review is a different sport from human code review. The machine does not get tired or skip the boring parts — but it confidently invents APIs, silently drops requirements, and writes plausible handling for error paths that can never occur while missing the one that will. Review for intent first (does this do what was asked?), then for the classic AI tells: unverified assumptions, dead code dressed as robustness, and tests that assert the bug.

Never approve a diff you wouldn't be able to explain line by line to a colleague. That standard is what keeps velocity from turning into debt.

Practice. Have an AI implement a small feature, then review the diff with a written checklist: requirements covered, APIs verified to exist, errors handled honestly, tests that would fail if the feature broke.

Module 5

Shipping at Machine Speed

Leverage means nothing if it dies in a prototype folder. This module is about the path from idea to running in front of users — repeatedly, safely, fast.

5.1 The one-day prototype

In 2026 the correct response to "would this work?" is almost never a meeting — it is a prototype by tomorrow morning. AI collapsed the cost of finding out. The skill is scoping: cut the idea down to the single riskiest assumption, build only what tests it, and fake everything else. A prototype is a question, not a product; it succeeds by producing an answer, even when the answer is no.

People who prototype weekly develop an unfair advantage: their opinions are backed by evidence while everyone else's are backed by slides.

Practice. Pick an idea you have debated for over a month. Define its riskiest assumption, then build the smallest thing that tests it — in one day, using every AI tool you have.

5.2 From demo to production

The distance from demo to production has not collapsed — it has just become legible. It is a checklist, not a mystery: handle the malformed input, add auth, bound the costs, log enough to debug a 3 a.m. failure, decide what happens when the model is down or slow or wrong. AI can write most of this too, but only if you ask — models optimize for the happy path unless the spec says otherwise.

Ship behind a flag, to a small slice, with a way back. Boring deployment hygiene is what makes aggressive speed safe.

Practice. Take your prototype and write its production gap list — everything between here and real users. Estimate each item, then knock out the top three with AI assistance.

5.3 Operating AI systems

An AI system in production degrades differently from normal software: the code is unchanged but the world shifts — new input patterns, model updates, a data source that quietly changed format. Operating one means watching quality, not just uptime: eval scores on live traffic samples, cost per request, latency, and the rate of "the AI said something weird" reports. Every incident becomes a new eval case; that is the flywheel.

Budget for it: a system you cannot afford to watch is a system you cannot afford to run.

Practice. Define a one-page runbook for an AI feature: the three quality metrics you watch, alert thresholds, the rollback procedure, and where new failure cases get filed.

Module 6

Eternal Fundamentals

The skills that made engineers great in 1976 and will in 2076. AI raises their value, because they are exactly what supervision of fast machines requires.

6.1 Reading code is the superpower now

When machines write most of the code, the binding human skill flips from writing to reading. Reading fast and deep — tracing data flow, spotting the assumption a function silently makes, sensing where the design fights itself — is what lets you review ten times more code than you write. The engineers who thrive are the ones for whom a 500-line diff is an afternoon, not a week.

Reading is trainable the same way writing is: deliberately, on excellent material, with questions in hand.

Practice. Spend 45 minutes reading a well-regarded open-source codebase in your stack. Write down its three best design decisions and one you would challenge — and why.

6.2 Debugging as epistemology

Debugging is the purest form of the verification mindset: form a hypothesis, design the cheapest experiment that could kill it, run it, update. It is the scientific method at keyboard speed, and it transfers to everything — broken pipelines, weird agent behavior, business numbers that don't add up. AI is a phenomenal debugging partner, but only for the person who can state symptoms precisely and judge proposed causes against evidence.

The anti-pattern of the era is "ask the AI to fix it" in a loop until the error message changes. That is not debugging; it is gambling with extra steps.

Practice. Next bug you hit, write the hypothesis before touching the code: "I believe X because Y; if true, Z will show it." Run exactly that check. Count how many hypotheses the bug takes.

6.3 Systems thinking and the courage to be simple

When generating code is free, complexity becomes the silent killer — every component you add is cheap to create and expensive forever. Systems thinking is seeing the whole: where state lives, what talks to what, which dependency will hurt you in a year. The discipline that follows from it is subtraction. The best AI-native engineers are conspicuous for how little they build: fewer services, fewer layers, fewer clever abstractions.

Ask of every addition: what would have to be true for us to delete this in six months? If nothing, you are building a permanent liability.

Practice. Diagram your current system from memory — boxes, arrows, state. Find one component that exists for historical reasons rather than current ones, and write the one-paragraph case for deleting it.

Capstone

Ship an agentic product in fourteen days.

Alone or in a pair, take a real problem — yours, your employer's, or a pilot customer's — from spec to a deployed, operating system with users. The constraint is the lesson: machine-speed execution with human-grade judgment.

Days 1–2: written spec with risks, non-goals, and acceptance checks.
Days 3–5: prototype that tests the riskiest assumption with real users.
Days 6–10: production build — guardrails, evals, observability.
Days 11–14: ship, operate, and present the system plus its eval scores.