Human-in-the-loop and accountability: who is on the hook when the agent acts

Sooner or later an autonomous agent takes an action it shouldn't have — refunds the wrong order, sends the wrong email, deletes the wrong record. That is not a reason to keep agents out of production; it is the reason production has a discipline for it. The discipline has two halves. Human-in-the-loop is the gate before the act: the points where a person has to say yes before the agent does something the system can't take back. Accountability is the answer after the act: who owns the outcome, and what record proves what happened. This page is both — the gate that catches the wrong action before it fires, and the trace and the named owner that make someone responsible for the ones that do. It is principle 9 made operational: a human is accountable for what an AI system does, and "the model decided" is not an answer anyone gets to give.

The mistake teams make is treating the human gate as a setting — on for caution, off for speed. It is neither. It is a function of consequence, and the rule that sets it is simple: the cost of a wrong autonomous action sets the bar for requiring approval. A read-only lookup needs no gate; it can't hurt anyone if it's wrong. An action that moves money, sends a message a customer will read, or deletes a record needs one, because a wrong call there is a wrong call you can't un-make. The skill is not "more gates" or "fewer gates." It is putting the gate exactly where the blast radius justifies the friction, and nowhere it doesn't.

When a human must be in the loop

A human gate is not a blanket policy applied to every action — it is targeted at the actions where autonomy is too expensive to risk. Here are the five situations that demand one, what makes each one dangerous, and the gate that answers it. The first column is the trigger; if an action hits any row, it should not fire unattended.

Situation	Why it needs a human	The gate
Irreversible action	Deleting a record, sending an email, issuing a payment or a refund — once it fires it can't be taken back, and a wrong call is a permanent wrong call (principle 1: the demo only ever read; production writes).	A confirmation step before the action commits — the agent proposes, a person approves, and only then does it execute.
Low model confidence	The model is uncertain, the input was ambiguous, or the request sits at the edge of what the agent was built for — exactly where a confident wrong answer is most likely.	Route the low-confidence case to a person instead of guessing; the agent escalates rather than acts.
High blast radius	The action touches many records, a high-value account, or a downstream system — the damage of getting it wrong scales with reach, not with how clever the action looks (principle 5).	Approval sized to the radius: a bulk or high-value action gates even when a single low-value one wouldn't.
Sensitive or compliance-bound decision	The decision involves regulated data, a legal or financial consequence, or a policy a regulator can ask you to justify — a place where "the agent decided" is not a defensible record.	A human makes or signs off on the decision, and the sign-off is logged as part of the compliance trail.
Action outside the verified scope	The agent is reaching for something its tools, Topics, or instructions did not clearly authorize — the boundary is blurry and the planner is guessing.	Refuse-and-escalate: the agent declines the out-of-scope act and hands it to a person rather than improvising past its bounds.

These are not five separate rules; they are five faces of one rule. Each row is a place where the cost of being wrong autonomously is higher than the cost of asking. Anthropic's own guidance for an agent that can act lands in the same place — among the defenses against a prompt injection slipping through, it names the procedural one directly: require confirmation before sensitive actions, so that even an instruction that beats every screen still can't move money or delete a record without a person saying yes. The gate is the last line that does not depend on the model behaving.

Escalation: handing the human the whole thread

A gate is only useful if the person on the other side of it can actually make the call — and that depends entirely on what they're handed. Escalation done wrong drops a human into a decision with no context: an approval prompt that says "the agent wants to issue a refund — yes or no?" forces the person to either rubber-stamp it or stop and reconstruct the whole conversation by hand. Escalation done right hands them the case already assembled.

On Agentforce this is built into the platform. The Instructions you write for an agent specify not just what it does but when to act and when to hand off to a human (agentforce-agents), and when the agent escalates it hands the person the full conversation — the thread, the context, the action it was about to take — so the human decides on the same facts the agent had, not a one-line summary of them. Off-platform you build the same move yourself: the escalation has to carry the conversation, the inputs, and the proposed action into whatever queue or interface the human works from, or you've built a gate that asks people to approve things they can't see.

And the decision the human makes is logged. Who approved, who rejected, what they were shown, and when — that record is not paperwork, it is the other half of the gate. A human-in-the-loop step that doesn't log the human's decision gives you the friction of approval with none of the accountability, because three months later "a person approved this" is a claim you can't back up. The gate and the trace are the same discipline seen from before and after the act.

Verification before a sensitive action

Not every gate has to be a human. For a whole class of sensitive actions, what stands between the agent and a costly mistake is a verification step — a check the agent has to pass before the consequential action is allowed to fire. The canonical case is the refund: before an agent issues one, a verification step confirms the customer's identity and that the order is theirs and eligible. The refund Action does not run on the model's say-so; it runs only after the verification gate clears.

This is the tool discipline from tools and actions, pointed at the act instead of the argument. A verification step can be a deterministic check the agent calls, or — in a multi-step build — a dedicated verification subagent whose only job is to confirm the preconditions before the sensitive Action is reachable, the same separation-of-duties pattern an agent uses to keep the thing that acts distinct from the thing that checks. The point is the same either way: a sensitive action is gated on a verification that has to pass, so that the agent cannot skip straight to the consequence. Where a human gate puts a person in the loop, a verification gate puts a check in the loop — and for the actions where the precondition is mechanical (is this the right customer, is this order eligible), the check is faster and just as binding.

The verification gate and the human gate are not alternatives; they stack. A refund might require both — a verification that the customer and order are valid, and, above a value threshold, a human sign-off on top. The blast-radius rule decides how many gates an action gets: the more a wrong call costs, the more has to clear before it fires.

Accountability: a person owns the outcome, not the model

Here is the line the whole subcategory has been walking toward. When an AI system does something wrong, a person is accountable — not the model. "The agent decided" is not an answer; it names no one and fixes nothing. Accountability means there is a human or a team who owns the outcome of what the system does in production — who answers for it to the customer, the auditor, the business — and that ownership does not transfer to the software because the software is the one that executed.

Accountability is not a feeling; it has a mechanism, and the mechanism is the trace. You cannot own an outcome you cannot reconstruct, which is why principle 11 — trace everything — and principle 9 — a human is accountable — are the same requirement stated twice. The trace is the record: what the agent was asked, what it retrieved, which tools it called with what arguments, what it decided, and what gate it passed or what human approved it. When the wrong action fires, the trace is how the accountable person reconstructs how it happened, fixes the cause, and proves to whoever is asking what the system actually did. Without it, accountability is a name on an org chart with no evidence behind it.

This is where this page meets the evaluation subcategory. The trace accountability depends on is the same trace tracing and monitoring builds for online evaluation — one record, two readers: the operator watching for a silent degrade, and the accountable owner reconstructing a specific incident. On Agentforce, the audit trail the Einstein Trust Layer keeps of every interaction is the precondition for owning what the agent did (agentforce-testing-and-observability); Agentforce Observability exports the session trace so the record exists where the agent ran. Off-platform, you build the trace yourself — and if you don't, you have an agent acting in production that no one can account for, which is the same as having no accountability at all.

The spine: built-in on Agentforce, built by you off it

Like every dimension in this subcategory, human-in-the-loop and accountability compose by where the system runs (principle 7) — the discipline is identical, and the platform decides whether you inherit the machinery or assemble it.

Agentforce builds it into the platform. Escalation to a human is a first-class move: the Instructions specify when to hand off, and the platform hands the person the full conversation. The security model and the Einstein Trust Layer's audit trail mean the interaction is logged by construction, so the record accountability needs exists without you wiring it. Verification before a sensitive Action is the Action-and-permission discipline the platform already runs. What you own is the policy — which actions gate, at what threshold, who the accountable owner is — because the Trust Layer governs the language side, not what your Actions do, and the blast radius of an Action that writes or deletes is still yours to bound (guardrails-and-safety).
Off-platform you build the approval step and the log. On a Claude API and your own infrastructure, the human gate is a step you implement — a pause that routes the proposed action into a human queue with the full context, waits for the decision, and records it. The verification check is a tool the agent calls before the sensitive one is reachable. And the trace is yours to write — the per-run record of inputs, tool calls, decisions, and approvals — because nothing logs it for you. Anthropic's guidance assumes exactly this build: apply least privilege so a slipped injection can do minimal damage, require confirmation before sensitive actions, and red-team the agent before deploy so you've confirmed the confirmation steps actually catch what the screens miss.

You do not pick one and pledge loyalty. A real system runs both — an Agentforce agent escalating governed actions to a person on-platform, an off-platform agent gating its own consequential step — with a clean handoff where accountability gets a seam (principle 9): the moment the work crosses from one surface to the other is the moment to be explicit about who owns what happens next. The same toolkit-composition logic runs through the AI Engineering principles.

The through-line

An autonomous agent will eventually act wrong; the question production answers is whether a human was in the loop first and who owns the outcome after. The gate is sized by one rule — the cost of a wrong autonomous action sets the bar for approval — and it triggers on five situations: an irreversible action, low model confidence, high blast radius, a sensitive or compliance-bound decision, and an action outside the verified scope. Escalation hands the human the whole thread, not a one-line prompt, and logs the decision. A verification step gates the sensitive action on a check that has to pass — verify the customer before the refund — and it stacks with the human gate rather than replacing it. And accountability is the through-line: a person owns the outcome, the model never does, and the trace is the record that makes ownership real. Agentforce builds the escalation, the verification, and the audit trail into the platform; off-platform you build the approval step and the log yourself. The model was never the thing on the hook. A person is — and the gate and the trace are how you make sure there's a person who can be.

AI Engineering principles — a human is accountable (9), trace everything (11), govern every tool's blast radius (5), ship safely (5)
What is production readiness — the dimension map this page is the accountability corner of
Production gotchas — the failure modes a shipped agent inherits, including the unattended wrong action this page gates
Guardrails and safety — the input/output safety layer; the confirmation-before-sensitive-action gate is shared with it
PII and governance — the lawful-handling and retention discipline the compliance-bound gate sits next to
Deploying to production — the release mechanics the human gate and the trace operate inside
Production Style Guide — the bar a production system clears before it ships
Tracing and monitoring — the trace accountability depends on, read here for replay rather than for the silent degrade
Agentforce testing and observability — the platform-native audit trail and session trace the accountable owner reads
Tools and actions — least privilege and the approval gate; the verification step is the same discipline pointed at the act
Agentforce agents — the Instructions that say when to hand off and the Trust Layer audit trail this page builds on

Reference: