The short version
Guardrails check the model’s intent before it acts. They run inside the inference pipeline.
An agent firewall checks what goes over the wire after the model acts. It runs at the network layer.
Guardrails catch bad reasoning. Agent firewalls catch bad traffic. They fail in different ways. Use both.
The trust boundary problem
Here’s why this matters: guardrails and the model share a trust boundary.
Guardrails like NeMo Guardrails, Guardrails AI, and LlamaFirewall run in the same process as the model or in the same inference pipeline. They use the same text processing, the same tokenization, and sometimes the same model architecture to detect attacks.
A prompt injection that’s good enough to fool the model has a decent chance of fooling the guardrail too. They’re processing the same input with similar techniques.
An agent firewall operates outside that trust boundary. It sees raw HTTP requests and MCP messages. Regex-based DLP doesn’t care what the model was thinking. It just checks whether the outbound request contains an API key. Pattern matching for injection doesn’t need to understand context. It just checks whether the response contains “ignore previous instructions.”
Different layers, different techniques, different failure modes. That’s defense in depth.
How guardrails work
Guardrails intercept model interactions before they reach external systems:
User Input → Guardrail (check input) → Model → Guardrail (check output) → Action
NeMo Guardrails (NVIDIA): Define conversation rails in a custom language (Colang). The model’s outputs are checked against allowed patterns before being executed.
Guardrails AI: Define validators using Pydantic models. Model outputs are validated against schemas and can be corrected automatically.
LlamaFirewall (Meta): Three-scanner pipeline. PromptGuard classifies inputs, AlignmentCheck audits chain-of-thought, CodeShield scans generated code.
All three are Python libraries. They hook into the model pipeline. They’re effective when you control the inference chain.
How an agent firewall works
An agent firewall intercepts network traffic after the model has decided to act:
Model decides to act → Agent sends request → Agent Firewall (scan request + response) → External system
It doesn’t know or care what the model was thinking. It scans:
- Outbound HTTP for credential patterns (DLP)
- Inbound HTTP for prompt injection patterns
- MCP tool arguments for credential leaks
- MCP tool descriptions for poisoned instructions
- MCP tool description changes (rug-pulls)
- DNS for exfiltration attempts
- Destination IPs for SSRF
What guardrails catch that firewalls don’t
Unsafe reasoning. If the model is thinking “I should read the SSH key file,” guardrails can catch that intent before the agent writes the code. An agent firewall only sees the result after execution.
Bad code generation. Tools like CodeShield (in LlamaFirewall) can scan generated code for known vulnerabilities before it runs.
Off-topic behavior. NeMo Guardrails can constrain the model to stay on-topic and follow conversational rails. Agent firewalls don’t care about conversation flow.
Hallucination filtering. Some guardrail frameworks include factuality checks. Firewalls don’t validate content accuracy.
What firewalls catch that guardrails don’t
Credential leaks. When an agent’s outbound request contains an AWS key encoded in base64, DLP catches it. Guardrails don’t scan outbound HTTP.
MCP tool poisoning. A malicious MCP server can change its tool descriptions mid-session (rug-pull) to instruct the agent to exfiltrate data. An agent firewall fingerprints descriptions and detects changes. Guardrails don’t monitor MCP tool descriptions.
SSRF. An injection could tell the agent to request http://169.254.169.254/latest/meta-data/ to steal cloud credentials. An agent firewall blocks private IP requests. Guardrails don’t operate at the network layer.
Post-bypass traffic. If an injection gets past the guardrail (and some do), the resulting malicious request still has to go through the agent firewall. Two independent chances to catch the attack.
Closed-pipeline agents. Claude Code, Cursor, GitHub Copilot, and most commercial agents use hosted models. You can’t insert guardrails into their inference chain. But you can route their traffic through a proxy.
The bypass problem
Guardrails have a known bypass problem. Research has demonstrated:
- High bypass rates against LlamaFirewall’s PromptGuard v1 using encoding tricks and language switching (PromptGuard 2 improved significantly, though independent benchmarks are still limited)
- Jailbreak techniques that circumvent NeMo Guardrails conversation constraints
- Adversarial inputs specifically crafted to pass guardrail checks while still being malicious
This doesn’t mean guardrails are useless. They catch a lot of attacks. But they operate in the same trust domain as the model, so a sufficiently clever attack can fool both.
Agent firewalls have a different bypass surface. Pattern-matching misses novel phrasings. DLP regex misses encrypted payloads. But these failures are independent of the guardrail’s failures. A prompt injection that fools the model and the guardrail might still trigger DLP when the resulting request contains a recognizable credential pattern.
Side-by-side
| Guardrails | Agent Firewall | |
|---|---|---|
| Where it runs | In the model pipeline | At the network boundary |
| What it inspects | Model inputs/outputs, reasoning | HTTP requests, MCP messages |
| Credential scanning | No | Yes (DLP) |
| Injection detection | Model-based classification | Pattern matching |
| MCP security | No | Yes |
| SSRF protection | No | Yes |
| Works with closed agents | No (need pipeline access) | Yes (proxy-based) |
| Can be bypassed by injection | Yes (same trust boundary) | Different failure modes |
How to use both
The best setup puts guardrails inside the pipeline and a firewall outside it:
User Input → Guardrail → Model → Guardrail → Agent → Agent Firewall → External System
Guardrails catch bad intent. The firewall catches bad traffic. If one misses something, the other might not.
Practical setup for custom Python agents:
- Use LlamaFirewall or NeMo Guardrails in your agent code
- Run Pipelock as the proxy
- Set
HTTPS_PROXY=http://127.0.0.1:8888for the agent process - Wrap MCP servers with
pipelock mcp proxy
For commercial agents (Claude Code, Cursor): Guardrails aren’t an option since you can’t modify the pipeline. Use Pipelock at the network layer. It’s your only enforcement point.
How Pipelock fits
Pipelock is an open-source agent firewall. It handles the network layer: DLP, injection detection, SSRF, MCP scanning, and rate limiting. It’s the second layer of defense that catches what guardrails miss.
It doesn’t replace guardrails. If you can deploy guardrails, do it. Pipelock handles the traffic that guardrails can’t see.
Further reading
- What is an agent firewall? : full definition, threat coverage, and evaluation checklist
- Pipelock vs LlamaFirewall : detailed head-to-head with Meta’s guardrail
- Prompt Injection: Network-Layer Defense : how firewalls catch injection at the proxy
- Agent Firewall vs WAF : another commonly confused comparison
- Pipelock on GitHub