Guardrails for AI Agents: Why Prompts Aren't Enough

The model finished editing six files. It wrote "Done. The refactor is complete." and stopped.

It never ran the tests.

You ask it why. It apologizes. It runs the tests. They fail. It fixes one. Declares done again. Hasn't run the tests since.

This is the AI agent reality in 2026. The models are good. The tools are good. The thing in between — the part where the model decides what to call, when to verify, when to stop — is probabilistic and gets it wrong often enough that we've all built coping strategies. We add "ALWAYS run tests before saying done" to system prompts. We tell it "DO NOT use rm -rf". We re-read the diff before committing.

None of that scales. Instructions are soft. Tool calls are hard. The fix is to put policy where the actions happen, not where the model reads them.

Octomind's next release ships a three-section policy file that does exactly that. This post explains what's in it, why we built it the way we did, and what you'll be able to do with it.

Three Failure Modes LLMs Cannot Self-Correct

Every "the AI agent went off the rails" story falls into one of three buckets. Each one needs a different intervention point.

Pre-call: it's about to do something dangerous. The model is one token away from git push --force to main, rm -rf /tmp/build, or cat .env > /tmp/leak.txt. You don't want to argue with it. You want to stop the call before it executes.

The pattern of LLMs blowing past safety instructions is well-documented. Research on instruction following showed a 39% performance drop in multi-turn conversations, with "meta-cognitive instructions like check yourself and verify" being "the first to fall."

OpenAI's own instruction-hierarchy work acknowledges that models often treat system prompts with the same priority as user messages.

So when you write "never use rm -rf" in the system prompt, you're playing the odds. Most calls will obey it. The one that doesn't is the one that wakes you up at 3 AM.

OWASP's Top 10 for LLM Applications lists "Excessive Agency" as #6 — overly broad tool permissions are how AI agents end up causing harm. The fix isn't "tell the model not to". The fix is "make the tool not callable for that input".

Post-result: it ran the tool, the result tells you something the model isn't going to notice. cargo build returned 47 lines of warnings. The model read the exit code, said "build succeeded," and moved on. You wanted those warnings. Or: a view call returned 4,000 tokens of output and the model summarized it as "looks fine." You wanted it to scan for specific patterns.

The model has a finite attention budget. It's optimizing for forward progress. Results that aren't structured "you need to do something" get glossed over. In long sessions, earlier directives fade, and this kind of structured result analysis is usually the first thing to slip.

End-of-turn: it claimed done, but it didn't actually do the thing. Edited five files, never built. Wrote new code, never tested. Refactored a module, never ran the linter. The model is genuinely trying. It's just that "I'm done" is the easiest token sequence to produce and there's nothing pulling it back to verify.

This is the failure mode that surprises people the most, because you've already put the instruction in. Your system prompt literally says "after editing any code, run cargo test." You wrote it twice. You bolded it. You added a <critical> tag. And the agent still wraps up without running tests often enough that you can't trust it. Same story with linters, type checks, schema migrations, format-on-save scripts. The longer the session and the more tools the model has fired, the more reliably the trailing "and now run X" instruction gets pruned out of effective attention.

It's worse with plans. The model produces a five-step plan, knocks out steps one through three, partially does step four, and then declares the project complete. Anyone who's run an agent against a multi-file refactor knows the shape of this: it nails the easy half, the rest goes unattended, and you have to babysit it back to finish. The model isn't malicious or lazy in any cognitive sense — it's just optimizing the next token under finite attention. "We did the work, here's a summary" is locally cheaper than "we still owe two more steps, let me continue." Prompt engineering can shift the ratio but won't fix it.

GitHub recently rate-limited Copilot agent minutes because runaway agents were burning through their budget. Anthropic shipped Claude 4 with new training data specifically aimed at reducing "premature completion" claims. Everyone is feeling this. Better training helps. But you can't train your way out of probabilistic adherence — you can only narrow the failure distribution. The deterministic guarantee comes from outside the model.

Validators are the answer here. Stop writing "always run the tests" in the system prompt and instead write a six-line rule that says: at the end of every turn where the model claimed done and edited files, but didn't run tests, inject "you said done but the tests haven't been run — run them." That message lands as a user turn in the conversation. The model either runs the tests or has to explicitly refuse — and refusing surfaces the gap to you rather than hiding it. The instruction lives outside the prompt, so it doesn't get diluted by 200 turns of tool calls. It runs every time, the same way, whether the session is fresh or 90 minutes deep.

Why "Just Prompt It Better" Doesn't Work

The standard playbook is some combination of:

System prompt instructions. "Never run rm -rf. Always test before claiming done." Works until it doesn't.
Allowlist filtering. Strip dangerous tools from the model's tool list. Coarse — turns off shell entirely instead of just shell rm -rf.
Sandbox the whole thing. Run the agent in a container that can't reach your real files. Solves blast radius but makes the agent useless for real work.
Human-in-the-loop confirmation. Click "approve" on every tool call. Solves accuracy at the cost of throughput.

None of these are wrong. They're just operating at the wrong granularity. You don't want to disable shell — you want shell rm -rf to be denied while shell ls runs fine. You don't want to manually approve every call — you want the dangerous five to require review and the safe five hundred to fly through.

The thing missing is a per-tool-call policy layer that's:

Deterministic — string match, regex, set membership. No AI in the decision loop.
Contextual — knows what's happened earlier in the session, what's loaded, what's been called.
Composable — pre-call deny + post-result react + end-of-turn check, all in one place.
Scoped to capabilities, not specific tools — a rule that says "filesystem-write" should cover text_editor, batch_edit, and whatever new write tool gets added next month without rewriting the policy.

That last one is the part nobody else gets right. We'll come back to it.

Introducing `.agents/guardrails.toml`

The new file ships as .agents/guardrails.toml in your project root. Three section types, evaluated at three different phases of the tool-execution lifecycle:

Section	When it fires	What it does
`[[guard]]`	Before the tool runs	Denies the call. Model sees `[guardrail] <your message>`.
`[[hook]]`	After the result lands	Runs a script. Non-zero exit injects stdout as a user message.
`[[validator]]`	End of assistant turn	Runs a script. Non-zero exit injects wrapped feedback.

All three share a small matching DSL. All three are optional. Missing file = zero policy. Drop one in and you have a policy.

`[[guard]]` — Deny Before Execution

[[guard]]
match   = "shell(command=^rm\\s+-rf?)"
message = "rm -rf blocked. Specify the exact paths to remove."

[[guard]]
match   = "shell(command=git push.*(--force|-f)\\b)"
message = "Force push blocked. Use --force-with-lease and confirm."

[[guard]]
match   = "filesystem-read(paths=\\.env)"
message = "Refusing to read .env files."

The match field uses a DSL: capability(arg=regex). The model emits a tool call. Before the call executes, every [[guard]] rule is evaluated in declaration order. First match wins. If matched, the call never reaches the executor — the model gets back a synthetic tool error with your message and adapts on its own.

No wasted compute. No tool spawn. The dangerous call dies in the policy layer.

Three things make this useful in practice:

Capability-targeted, not tool-targeted. shell covers any tool tagged as the shell capability. filesystem-read covers view, read, and any other read tool the agent has loaded. You write the policy once and it covers the surface area.
Arg-targeted regex. Match against a specific argument. command=^rm matches the command argument; paths=\.env matches against the paths array even when it's ["a.rs","b/.env"]. Strings, arrays, and objects all serialize cleanly into the haystack.
History conditions. when = ["+filesystem-read", "-shell(command=cargo test)"] reads as "fire only if filesystem-read was used AND cargo test was NOT used". Built from +used and -unused primitives over the session call log.

The history piece matters. A common pattern is "don't let it run ls if it has the filesystem view tool" — but only if the filesystem capability is actually loaded. That's two predicates: has = "filesystem-read" (filter is loaded) plus when = ["-filesystem-read"] (it hasn't been used yet). Both express in one rule.

`[[hook]]` — React to Results

[[hook]]
match  = "shell(command=cargo (build|test|check))"
result = "error\\[E\\d+\\]"
script = ".agents/hooks/cargo-summary.sh"

[[hook]]
on     = "error"
script = ".agents/hooks/log-failures.sh"

Hooks run after a tool result lands but before that result returns to the model. Each hook can filter on:

match — same DSL as guards, against the call
result — regex against the result content (empty results match ^$)
on — "success", "error", or "any" (default)

When all filters match, the script runs. The script receives a JSON payload on stdin describing the call and its result. Exit 0 = no action. Exit non-zero = the script's stdout gets injected as a user message before the model's next turn.

This is where you put structured post-call validation. The model just ran cargo build. The exit code was 0 but you grepped the output for warning: and found ten of them. You don't want the model to ignore them. You inject "Build succeeded with 10 warnings — please address them before proceeding." Now the model has the choice in its context as if you said it.

You can compose hooks. They fire in parallel, all injects accumulate. Multiple matching hooks all run — there's no first-match-wins for hooks. They're additive policy, not selection. The same goes for validators below.

Use cases we've already built:

Cargo error summaries. Long compiler output → one-line "fix E0382 and E0277 first" inject.
Secret-leak detection. Pipe the result through grep -E 'AKIA[0-9A-Z]{16}|sk-[a-zA-Z0-9]{32}' → if found, inject "potential secret leaked, redact before proceeding."
Long-running tool warnings. If result.length > 50000, inject "consider narrower query — last result was too large to process well."

`[[validator]]` — End-of-Turn Checks

[[validator]]
name   = "test-before-done"
match  = "(?i)\\b(done|finished|completed)\\b"
when   = ["+filesystem-write", "-shell(command=cargo test)"]
roles  = ["developer"]
script = ".agents/validators/remind-tests.sh"

Validators fire at the end of an assistant turn — when the model emits its final message with no further tool calls. They're the answer to "the model said done but didn't actually verify."

Each validator has four optional filters that AND together:

roles — only fire for specific session roles (developer, reviewer, etc.). Cheapest filter, runs first.
when — same +used / -unused DSL as guards. Evaluated against call_log[cursor..] — i.e. tool calls since this validator last ran. Per-validator cursor advances every time the validator fires.
match — regex against the assistant's final message text. Useful for catching premature "done" claims.
name — required identifier. Used for the cursor and the injection XML tag.

The cursor matters. With when = ["+filesystem-write"], the validator fires once when the first write happens, then resets. If the model writes again on the next turn, it fires again. If the model never writes again, the validator stays silent. You're not paying for redundant checks.

Injection is wrapped: <validation validator="test-before-done">...</validation>. The model sees this as a user message in the next turn. The wrapping prevents the validator output from being re-matched by skill auto-activation (we shipped that in 0.29.0 — same anti-recursion principle applies here).

Concrete scenarios this unlocks:

Force test run after edit. when = ["+filesystem-write", "-shell(command=cargo test)"] + a script that says "you edited code but didn't run tests." Fires every turn the model edits without testing.
Always-on linter. No filters at all. Validator runs every turn end, lints whatever changed, injects errors as a user message.
Lint claim verification. match = "(?i)\\bclean\\b|\\bno warnings\\b" + a script that runs cargo clippy. Catches the model saying "no warnings" when there are seven.
Verify commit messages match work. when = ["+shell(command=git commit)"] + a script that compares the commit message to the diff. Inject "commit message doesn't reflect changes" if drift.

The same shape works for any "since the last time we checked, did the agent meet a condition" question. The cursor mechanic generalizes; you write the script for the actual rule.

Capabilities Over Tools — The Flexibility Win

Here's the part everyone else has gotten wrong.

Most policy systems for AI agents target specific tools. "Block Bash." "Restrict Edit." This works for one project, one model, one tool catalog. The moment you add a new MCP server, swap to a different filesystem provider, or compose two agents with overlapping capabilities, your rules go stale.

Octomind's policy targets capabilities, not tool names. A capability is a logical grouping defined in a tap manifest:

shell — execute shell commands. Provided by octofs today, could be provided by bash, zsh-runner, or whatever you compose in tomorrow.
filesystem-read — read files. Covers view, read, extract_lines, ast_grep.
filesystem-write — modify files. Covers text_editor, batch_edit, write.
network — make HTTP requests. Covers fetch, curl, custom HTTP tools.

You write a rule against filesystem-write. It covers all current write tools. Six months from now you add a new MCP server with its own edit tool. As long as that tool is tagged as filesystem-write in its capability manifest, your rule covers it automatically.

This isn't a small win. It's the difference between a policy you ship once and a policy you rewrite every release. The shell rules we test internally — rm -rf, git push --force, secret-laden curl — are the same on macOS, Linux, in Docker, with octofs or with any future shell provider. You don't need to care which tool provides the capability.

You can still match by tool name when you need precision — shell(command=git push) is implicit "any tool in the shell capability with a command arg matching git push". Most of the time you don't need that precision. You're really saying "this kind of action".

How History Works (Practical Mental Model)

There's one shared piece of state: a per-session call log. Every successful tool call gets appended as (capability, params). Blocked calls (denied by a guard) are not recorded — they didn't happen, so they shouldn't satisfy history conditions on retry.

[[guard]] when reads the whole log — session-wide history.
[[validator]] when reads log[cursor..] — slice since this validator's last run.
[[hook]] doesn't use when — it's per-result.

This means the same DSL expression has consistent meaning. +filesystem-write in a guard means "was filesystem-write used at any point this session". +filesystem-write in a validator means "was filesystem-write used since this validator last ran". Same predicate, different window.

If you've used Cursor's or Cline's policy systems, you'll recognize the shape — but they don't have the cursor mechanic for validators, and they don't have capability-level matching. You write more rules for less coverage.

Pipeline View

LLM emits tool calls [t0, t1, …]
  ↓
[[guard]] evaluation (sequential, arrival order):
  for each ti:
    resolve capability
    evaluate guards → first match denies
    if allowed → record in call log → spawn task
  ↓
parallel tool execution (only allowed tools)
  ↓
[[hook]] evaluation:
  for each (call, real result):
    match + result + on filters
    spawn matching scripts in parallel
    non-zero exits → inbox push
  ↓
results to LLM
  ↓
LLM final message (no more tool calls)
  ↓
[[validator]] evaluation:
  for each validator in declaration order:
    role → when (cursor slice) → match (assistant text)
    survivors spawn in parallel
    advance per-validator cursors
    non-zero exits → wrap + inbox push
  ↓
inbox flushes into next API call as user messages

Three intervention points. Same DSL. Same call log. Composable.

Catching Drift: Habits the Model Can't Hold

Pre-call deny gets the attention because it's the obvious sharp-knife scenario. But the under-appreciated value of this whole system is enforcing habits. Things you'd want the model to always do, that it usually does, but that it skips often enough that you can't trust it.

Some examples from how we actually use this internally:

Memorize when something new shows up. The model just discovered an undocumented constraint in a vendor's API. It used the discovery, finished the turn, and moved on. The knowledge dies with the session. A validator with when = ["+memorize"] is the wrong direction — that fires only when memory was written. You want the opposite: when = ["+filesystem-read", "-memorize"] (or whatever the memory tool's capability tag is), so when the model has clearly learned something but didn't persist it, you nudge: "you found something worth remembering — did you write it to memory?" This isn't [[guard]] territory — you don't want to block. It's a [[validator]] because the heuristic is "session is ending, did you preserve what you learned?"

Finish what you planned. The model wrote a plan with five checkboxes. It marked three as done in its output but didn't actually do the remaining two. A validator reads the plan file, parses checked vs unchecked, and if there are unchecked items but the model said "done" — inject "plan has 2 unchecked items, list them." The plan file lives in the project, the validator script knows the format, the check is deterministic. Suddenly partial completion stops being a failure mode you babysit and starts being a soft assertion the model has to navigate.

Stay in the conventions. New files dropped in src/ need a license header. Test files need _test.rs suffix. Markdown headings shouldn't go past H4. The model knows these rules from the system prompt, mostly follows them, occasionally drifts. A hook fires on every filesystem-write and runs a tiny script that checks the convention — exit 1 with "file you wrote is missing the header" gets the model to fix it before it moves on. The instruction lives in code, not in attention.

Run the validation pipeline before claiming done. This is the meta-pattern: you have an existing CI script — ./scripts/preflight.sh — that runs lint + typecheck + test + a custom invariant check. You don't want to add four validators. You want one validator that runs the pipeline. The script chains into your existing tooling and only injects something when the pipeline itself fails. The validator is the trigger; the work is in your existing scripts. You're not duplicating policy in TOML — you're invoking what you already trust.

Catch the slow drift in tone/structure. Long sessions drift. The model starts producing terse one-line summaries when you want full reports, or starts using console.log when the convention is pino. A validator that greps the assistant message for patterns you don't want, and reminds when it sees them, catches drift before you'd notice it manually.

What ties these together: they're all "the model is supposed to do X, sometimes forgets, and the cost of forgetting is real." Before validators, the answer was longer system prompts. After validators, the answer is small scripts that check the observable outcome. Scripts don't dilute. Scripts don't compete for attention. Scripts run the same way every time.

What You'll Be Able to Do in Practice

A few scenarios from real internal use:

1. Force pre-flight checks on a dangerous command.

[[guard]]
match   = "shell(command=git push)"
when    = ["-shell(command=git status)"]
message = "Run `git status` first to confirm what's staged."

The model can't push until status has been checked this session. Doesn't matter how many sub-tools it composes, the policy holds.

2. Force tests before "done."

[[validator]]
name   = "test-before-done"
match  = "(?i)\\b(done|finished|completed)\\b"
when   = ["+filesystem-write", "-shell(command=cargo test)"]
script = ".agents/validators/remind-tests.sh"

The validator catches premature completion claims. The script writes "you edited code but didn't run tests" to stdout and exits 1. The model gets that as a user message. Loop continues until tests are actually run or the model stops claiming done.

3. Pre-emptively redact dangerous patterns from results.

[[hook]]
match  = "shell"
script = ".agents/hooks/redact-secrets.sh"

Script grabs the tool result, runs it through a regex sweep for AWS keys, GitHub tokens, etc. If found, exits 1 with "result contained N potential secrets, redact before continuing." Model sees the warning before re-acting on the data.

4. Block reads of sensitive paths.

[[guard]]
match   = "filesystem-read(paths=(\\.env|secrets/|private/))"
message = "Refusing to read sensitive paths. Specify what you actually need."

Covers every read tool — view, read, extract_lines — without listing each one.

5. Always-on lint.

[[validator]]
name   = "always-lint"
script = ".agents/validators/lint.sh"

No filters. Runs every turn end. Lints whatever changed. Injects errors. The model deals with them or stops working. Cheaper than letting bad code drift into the codebase.

6. Plan completion check.

[[validator]]
name   = "plan-completion"
match  = "(?i)\\b(done|all set|that's everything)\\b"
script = ".agents/validators/plan-unchecked.sh"

The script reads a known plan file (e.g. .agents/plan.md), counts unchecked items, and if there are any while the model claims completion — exits 1 with "plan still has N unchecked items: ..." This shuts down the "I knocked out the easy half, you do the rest" pattern. The model has to actually finish or explicitly acknowledge the gap.

7. Memory-write hygiene.

[[validator]]
name   = "remember-what-you-learned"
when   = ["+filesystem-read", "-memorize"]
script = ".agents/validators/memory-nudge.sh"

When the model reads through a lot of the codebase but never writes anything to memory, nudge: "you've been exploring — anything worth memorizing for next session?" Especially useful in long research sessions where context will be discarded and re-learning costs real tokens.

8. Pre-flight pipeline trigger.

[[validator]]
name   = "preflight-on-done"
match  = "(?i)\\b(done|finished|ready to merge)\\b"
when   = ["+filesystem-write"]
script = "./scripts/preflight.sh"

The script is your existing CI preflight — runs lint, typecheck, tests, custom invariants. Exits 0 if clean, non-zero with the failure summary if anything broke. The validator's only job is the trigger. The actual rules live in code you already maintain.

9. Caller-must-update-tests check.

[[validator]]
name   = "test-cohabits-edit"
when   = ["+filesystem-write"]
script = ".agents/validators/touched-without-tests.sh"

Script compares the set of edited source files to the set of edited test files. If src/foo.rs changed but src/foo_tests.rs didn't — inject "foo.rs changed without a corresponding test update." Catches the "I just added a function and didn't add a test for it" drift.

What's Coming and When

This ships in the next Octomind release. The implementation is complete, tested, and we're using it internally on Octomind itself — the same rules that catch git push --force here will be in your hands soon.

The article is the spec. Drop a .agents/guardrails.toml in your project, install Octomind, and the policy applies. No global config. No daemon. No special setup.

A few extensions we're shaping for post-release:

Let validator scripts suggest actions (inject "you should run X" with a structured action the model can accept/reject) instead of plain text.
Let [[hook]] modify tool results (filter, redact, append) rather than only react to them.
Add better introspection: /guardrails to show what fired, what didn't, and what would have been blocked (likely 0.31).

If you have a use case that doesn't fit cleanly into guard / hook / validator, open an issue. We'll use those cases to shape what ships next.

Why This Matters

LLMs are not going to stop being probabilistic. Better models will reduce the failure rate. They won't eliminate it. The thing that has the highest leverage is moving policy out of the model and into the tool layer.

The model is good at intentions. It's bad at boundaries. You can give it a million-token system prompt about never running rm -rf and it'll run rm -rf the one time the context happens to break the pattern. Or you can write three lines of TOML that make shell(command=^rm -rf) an impossible call. The latter is correct every time. The former is correct most of the time.

Guardrails are how we ship correct-every-time alongside the probabilistic model. Pre-call deny is the seatbelt. Post-result hooks are the dashcam. End-of-turn validators are the parking attendant. They're cheap, declarative, deterministic, and they target capabilities so they survive tool churn.

Octomind has had guardrails for tools internally for a while. This release exposes them as a first-class config surface. We think every serious AI agent runtime will have something like this within twelve months. Ours is shipping now.

Octomind on GitHub · Documentation · 0.29.0 release post

Guardrails for AI Agents: Why Prompts Aren't Enough

Three Failure Modes LLMs Cannot Self-Correct

Why "Just Prompt It Better" Doesn't Work

Introducing .agents/guardrails.toml

[[guard]] — Deny Before Execution

[[hook]] — React to Results

[[validator]] — End-of-Turn Checks