From observation to instrument
The companion research paper, When does an AI build want a person?, presented the four categories as observation. This paper is a sequel: how to use the categories. The argument we make is short and operational. If you can detect, in your own AI-build workflow, when a build is approaching one of the four moments, you can route human judgment into it deliberately rather than reactively. You can write a policy around it. You can measure it. You can audit it.
The press, in our product, is one implementation of that routing. It is not the only implementation. A code review on a merged PR, a deploy gate, a Slack-triggered async question to a tech lead, all of these are implementations of the same underlying primitive: a moment in the build where a human decides whether the next step is allowed.
Category one: the confidence wall
The build technically works. The builder has clicked through it ten times. They cannot bring themselves to ship it because they do not know if it will hold. This is the most common category. Thirty-eight percent of the presses we analyzed were of this shape.
What makes it detectable
Three signals, in our data. First, the build has been completed to a runnable state for at least twenty minutes without shipping. Second, the builder has been re-running the same flow in dev or preview, in a way that suggests they’re looking for a problem rather than building. Third, recent conversation context with the AI tool has shifted from construction questions (how do I do X) to validation questions (does this look right).
What the human work looks like
Read once. Answer one question: can this ship to a customer? The session is short by nature; our average is 14 minutes; the resolution rate is 96% because the question is tractable. The builder is not, in this category, asking for a fix. They are asking for permission, or for an articulated reason permission should not be granted.
What gets it wrong
Treating it as a code review. A code review tries to make the code better. The builder doesn’t want better code. They want a one-sentence answer. Engineers who try to do a full review here over-deliver, frustrate the builder, and slow the ship. The right move is the answer, then the smallest set of notes the builder can act on.
Category two: the integration cliff
The AI produced a complete, runnable artifact. Connecting it to anything outside the sandbox, a database, an auth provider, a payment gateway, an email service, fails repeatedly in non-obvious ways. Twenty-nine percent of presses in our sample.
What makes it detectable
Repeated environment-variable churn. Failed deploys with shape differences in the error each time. Configuration files with contradictory values across copies. The build works locally; something fundamental is wrong about how it tries to reach the outside world. The AI has, in our experience, generated code that assumes one shape of integration and the customer’s environment has another.
What the human work looks like
Read the integration carefully. Identify which assumption is wrong. Rewrite the smallest section of code that fixes it. Document the choice in a comment, in plain language, so the next AI session does not undo the fix.
What gets it wrong
Letting the AI tool retry. A common pattern: the AI sees the failed deploy, generates a new attempt, the attempt has the same wrong assumption, the deploy fails again. Without a human in the loop, the loop continues. The right move is to break the loop with a senior engineer’s read.
Category three: the deploy moment
The build runs locally. The builder has a domain. They have not deployed software before. Seventeen percent of presses, and the longest in average duration: 1h 12min.
What makes it detectable
The build is in a runnable state. The builder is on the deploy-platform’s onboarding flow for the first time. DNS questions enter the conversation. Production-readiness questions enter the conversation. The build has not yet touched a customer.
What the human work looks like
Walk through the deploy with the builder. Set up the small number of things AI tools systematically miss: TLS certificates, environment-variable separation between preview and production, a deployable rollback procedure, a basic uptime check. The work is teachable; the human delivers the knowledge once and the builder shouldn’t need it again for the next deploy.
Category four: the ownership question
The build is shipped. Something has changed in the world. A payment failed; a user reported a bug; a third-party API silently changed shape. Eight percent of presses, but the highest sustained-relationship rate: 71% of these presses continue into an ongoing relationship.
What makes it detectable
The build is in production. The press happens not at build time but during the day, often during a customer’s business hours. The trigger is external, a customer email, an alert from a monitoring tool, a teammate flagging a number that doesn’t look right.
What the human work looks like
Diagnose. Decide. Sometimes the right answer is a code change; sometimes it’s a configuration change; sometimes it’s telling the customer about a third-party-vendor incident that isn’t about the build at all. The work is closer to on-call than to building. It rewards engineers with operational experience.
Instrumenting your own workflow
What we recommend, for a platform team that wants to use this framework without adopting Relay.
Build a press surface
One button, one URL, one Slack command, the choice doesn’t matter. What matters is that there is a single well-known place where a non-engineer can summon a software engineer mid-build, and that the surface is fast enough to be worth pressing.
Detect the four moments
Not algorithmically. Behaviorally. Tell every team that uses AI builders that there are four moments when the press is expected: before customer ship, before integration, before deploy, after a production incident. Make the press the default behavior at those moments rather than the exception.
Measure
Three numbers tell you whether the framework is working. Press volume per builder per week (the right level is non-zero for every active builder). Resolution rate (above 90% means the press is finding the right person). Builder retention (people who get value out of the press want to use it again; people who don’t use it will not).
| Category | Trigger signal | Routing | Time-box |
|---|---|---|---|
| Confidence wall | 20+ min completed but unshipped | Senior, any stack | ≤ 20 min |
| Integration cliff | Repeated env-var churn | Senior, stack-matched | ≤ 60 min |
| Deploy moment | First-time deploy on platform | Senior with deploys | ≤ 90 min |
| Ownership question | Production-time external trigger | Senior with on-call exp | Open-ended |
Turning the taxonomy into policy
Two policies are worth writing first.
The press-required policy
Specifies which builds, in which categories, may not ship without a press. Most companies start narrow: regulated data paths, identity flows, payments. Expand from there as the team builds confidence in the tooling.
The press-encouraged policy
Specifies the surfaces where a press is recommended but not required. The point of this policy is to remove stigma. The builder pressing for help should be celebrated, not penalized; the policy makes the encouragement explicit.
Anti-patterns we see most often
Routing all four categories to the same engineer pool. The categories want different skills. A great deploy-moment engineer is not necessarily a great ownership-question engineer. Pool the bench but route by category.
Treating the press as a help desk.A help desk has tickets and queues and SLAs measured in business days. The press is real-time and minutes. Confusing the two collapses quality and ruins the builder’s incentive to press.
Measuring press time-to-close as the primary KPI. A fast close on a build that should not have shipped is worse than a slow close that prevented the ship. Measure builder-shipped-without-incident as the primary; press time as secondary.
Skipping the ownership-question category.Most teams we’ve worked with build for the first three categories and forget the fourth. The fourth is where the relationship either compounds or breaks. Invest there even though the volume is the smallest.