What's the difference between an agent firewall and guardrails?

Guardrails operate inside the model pipeline, checking the model's intent before it acts. An agent firewall operates at the network layer, scanning HTTP requests and MCP tool calls after the model acts. Guardrails catch bad decisions. Agent firewalls catch bad traffic. They fail in different ways, which is why you want both.

Can a prompt injection bypass guardrails?

Yes. Guardrails run in the same process as the model, often using the same model architecture to detect attacks. A sufficiently crafted injection can fool both the primary model and the guardrail. An agent firewall operates outside the model's trust boundary at the network layer, so injection that fools the model can't also fool the firewall's pattern matching.

Do I need guardrails if I have an agent firewall?

Yes, if you can deploy them. Guardrails catch unsafe intent before the agent acts. An agent firewall catches unsafe traffic after the agent acts. Neither alone covers everything. Defense in depth means multiple independent layers.

Agent Firewall vs Guardrails

The short version

Guardrails check the model’s intent before it acts. They run inside the inference pipeline.

An agent firewall checks what goes over the wire after the model acts. It runs at the network layer.

Guardrails catch bad reasoning. Agent firewalls catch bad traffic. They fail in different ways. Use both.

The trust boundary problem

Here’s why this matters: guardrails and the model share a trust boundary.

Guardrails like NeMo Guardrails, Guardrails AI, and LlamaFirewall run in the same process as the model or in the same inference pipeline. They use the same text processing, the same tokenization, and sometimes the same model architecture to detect attacks.

A prompt injection that’s good enough to fool the model has a decent chance of fooling the guardrail too. They’re processing the same input with similar techniques.

An agent firewall operates outside that trust boundary. It sees raw HTTP requests and MCP messages. Regex-based DLP doesn’t care what the model was thinking. It just checks whether the outbound request contains an API key. Pattern matching for injection doesn’t need to understand context. It just checks whether the response contains “ignore previous instructions.”

Different layers, different techniques, different failure modes. That’s defense in depth.

How guardrails work

Guardrails intercept model interactions before they reach external systems:

User Input → Guardrail (check input) → Model → Guardrail (check output) → Action

NeMo Guardrails (NVIDIA): Define conversation rails in a custom language (Colang). The model’s outputs are checked against allowed patterns before being executed.

Guardrails AI: Define validators using Pydantic models. Model outputs are validated against schemas and can be corrected automatically.

LlamaFirewall (Meta): Three-scanner pipeline. PromptGuard classifies inputs, AlignmentCheck audits chain-of-thought, CodeShield scans generated code.

All three are Python libraries. They hook into the model pipeline. They’re effective when you control the inference chain.

How an agent firewall works

An agent firewall intercepts network traffic after the model has decided to act:

Model decides to act → Agent sends request → Agent Firewall (scan request + response) → External system

It doesn’t know or care what the model was thinking. It scans:

Outbound HTTP for credential patterns (DLP)
Inbound HTTP for prompt injection patterns
MCP tool arguments for credential leaks
MCP tool descriptions for poisoned instructions
MCP tool description changes (rug-pulls)
DNS for exfiltration attempts
Destination IPs for SSRF

What guardrails catch that firewalls don’t

Unsafe reasoning. If the model is thinking “I should read the SSH key file,” guardrails can catch that intent before the agent writes the code. An agent firewall only sees the result after execution.

Bad code generation. Tools like CodeShield (in LlamaFirewall) can scan generated code for known vulnerabilities before it runs.

Off-topic behavior. NeMo Guardrails can constrain the model to stay on-topic and follow conversational rails. Agent firewalls don’t care about conversation flow.

Hallucination filtering. Some guardrail frameworks include factuality checks. Firewalls don’t validate content accuracy.

What firewalls catch that guardrails don’t

Credential leaks. When an agent’s outbound request contains an AWS key encoded in base64, DLP catches it. Guardrails don’t scan outbound HTTP.

MCP tool poisoning. A malicious MCP server can change its tool descriptions mid-session (rug-pull) to instruct the agent to exfiltrate data. An agent firewall fingerprints descriptions and detects changes. Guardrails don’t monitor MCP tool descriptions.

SSRF. An injection could tell the agent to request http://169.254.169.254/latest/meta-data/ to steal cloud credentials. An agent firewall blocks private IP requests. Guardrails don’t operate at the network layer.

Post-bypass traffic. If an injection gets past the guardrail (and some do), the resulting malicious request still has to go through the agent firewall. Two independent chances to catch the attack.

Closed-pipeline agents. Claude Code, Cursor, GitHub Copilot, and most commercial agents use hosted models. You can’t insert guardrails into their inference chain. But you can route their traffic through a proxy.

The bypass problem

Guardrails have a known bypass problem. Research has demonstrated:

High bypass rates against LlamaFirewall’s PromptGuard v1 using encoding tricks and language switching (PromptGuard 2 improved significantly, though independent benchmarks are still limited)
Jailbreak techniques that circumvent NeMo Guardrails conversation constraints
Adversarial inputs specifically crafted to pass guardrail checks while still being malicious

This doesn’t mean guardrails are useless. They catch a lot of attacks. But they operate in the same trust domain as the model, so a sufficiently clever attack can fool both.

Agent firewalls have a different bypass surface. Pattern-matching misses novel phrasings. DLP regex misses encrypted payloads. But these failures are independent of the guardrail’s failures. A prompt injection that fools the model and the guardrail might still trigger DLP when the resulting request contains a recognizable credential pattern.

Side-by-side

	Guardrails	Agent Firewall
Where it runs	In the model pipeline	At the network boundary
What it inspects	Model inputs/outputs, reasoning	HTTP requests, MCP messages
Credential scanning	No	Yes (DLP)
Injection detection	Model-based classification	Pattern matching
MCP security	No	Yes
SSRF protection	No	Yes
Works with closed agents	No (need pipeline access)	Yes (proxy-based)
Can be bypassed by injection	Yes (same trust boundary)	Different failure modes

How to use both

The best setup puts guardrails inside the pipeline and a firewall outside it:

User Input → Guardrail → Model → Guardrail → Agent → Agent Firewall → External System

Guardrails catch bad intent. The firewall catches bad traffic. If one misses something, the other might not.

Practical setup for custom Python agents:

Use LlamaFirewall or NeMo Guardrails in your agent code
Run Pipelock as the proxy
Set HTTPS_PROXY=http://127.0.0.1:8888 for the agent process
Wrap MCP servers with pipelock mcp proxy

For commercial agents (Claude Code, Cursor): Guardrails aren’t an option since you can’t modify the pipeline. Use Pipelock at the network layer. It’s your only enforcement point.

How Pipelock fits

Pipelock is an open-source agent firewall. It handles the network layer: DLP, injection detection, SSRF, MCP scanning, and rate limiting. It’s the second layer of defense that catches what guardrails miss.

It doesn’t replace guardrails. If you can deploy guardrails, do it. Pipelock handles the traffic that guardrails can’t see.