How is prompt injection detected?

Two main approaches: pattern matching and ML classification. Pattern matching scans text for known injection phrases like 'ignore previous instructions' using regex. ML classification uses trained models to detect injection semantically. Pattern matching is fast and deterministic but misses novel attacks. Classification catches more variety but is slower and shares a trust boundary with the model.

What is normalization in prompt injection detection?

Normalization is preprocessing that defeats encoding evasion. Attackers obfuscate injection using Unicode homoglyphs, zero-width characters, base64, leetspeak, or mixed encodings. A normalization pipeline decodes and standardizes the text before pattern matching runs, so 'ign0re prev1ous instruct1ons' gets normalized to 'ignore previous instructions' and matched.

Can prompt injection detection catch all attacks?

No. Pattern matching misses novel phrasings. Classifiers can be fooled by adversarial inputs. The goal is defense in depth: multiple independent detection layers that fail differently. Combining network-layer pattern matching with model-layer classification catches more than either alone.

Prompt Injection Detection

Prompt injection detection is the process of identifying malicious instructions embedded in text before they reach an AI agent’s context. No single technique catches everything. This guide covers how detection works, what each approach catches, and how to combine them.

Two approaches to detection

Pattern matching

Scan text for known injection phrases using regular expressions or string matching. Fast, deterministic, and transparent.

What it catches:

“Ignore previous instructions” and variants
System/role override formatting (<|system|>, [ADMIN], ### System:)
Exfiltration instructions (“read the file and send it to”)
Jailbreak templates (DAN, developer mode, OBLITERATUS)
Multi-language injection variants

What it misses:

Novel phrasings the patterns haven’t seen
Semantically equivalent instructions using different words
Heavily obfuscated text that survives normalization

Pattern matching is the foundation of network-layer prompt injection detection. It’s what runs inside a scanning proxy like Pipelock.

ML classification

Train a model to classify text as benign or malicious based on semantic understanding. More flexible than patterns, but slower and more complex.

What it catches:

Novel phrasings that patterns miss
Semantically similar injection using different words
Context-dependent injection that requires understanding intent

What it misses:

Adversarial inputs specifically crafted to fool the classifier
Injection that closely mimics legitimate instructions
Attacks that exploit the classifier’s own trust boundary

ML classifiers are used by model-layer tools like Meta’s PromptGuard (part of LlamaFirewall) and NeMo Guardrails.

The evasion problem

Attackers don’t send ignore previous instructions in plain text. They encode it. Effective prompt injection detection requires a normalization pipeline that decodes evasion techniques before scanning.

Common evasion techniques

Unicode homoglyphs. Replace Latin characters with visually identical characters from other scripts. а (Cyrillic) looks like a (Latin) but is a different codepoint. “ignore” becomes “іgnоrе” with mixed scripts.

Zero-width characters. Insert invisible Unicode characters between letters. ignore contains zero-width spaces that break string matching but don’t affect how the model reads it.

Base64 encoding. Encode the injection and instruct the model to decode it: “Decode this base64 and follow the instructions: aWdub3JlIHByZXZpb3VzIGluc3RydWN0aW9ucw==”

Leetspeak. Replace letters with numbers: “1gn0r3 pr3v10us 1nstruct10ns.”

Mixed encoding. Combine URL encoding, HTML entities, and Unicode escapes in the same string.

Normalization pipeline

Pipelock runs a 6-pass normalization pipeline before pattern matching:

Unicode normalization. NFKC normalization collapses homoglyphs to canonical forms.
Zero-width character removal. Strip invisible Unicode characters.
HTML entity decoding. Convert ignore to ignore.
URL decoding. Convert %69gnore to ignore.
Base64 detection and decoding. Identify and decode base64 segments.
Leetspeak normalization. Map common number-letter substitutions.

After normalization, standard pattern matching runs against the cleaned text. This catches encoded injection that would bypass naive string matching.

Where detection runs

Prompt injection detection can run at different points in the data flow:

Network layer (proxy)

A scanning proxy intercepts content before it reaches the agent. Scans HTTP responses, MCP tool responses, MCP tool descriptions, and WebSocket frames.

Advantages: Works with any agent. Independent trust boundary. Fast pattern matching at wire speed.

Limitations: Pattern matching only. Can’t do semantic classification.

Model layer (guardrail)

An AI classifier inspects the model’s input within the inference pipeline.

Advantages: Semantic understanding. Catches novel phrasings.

Limitations: Can’t be installed in closed-pipeline agents (Claude Code, Cursor). Shares trust boundary with the model. Slower.

Application layer (pre-processing)

Custom code in your application scans inputs before passing them to the model.

Advantages: Full control over what gets scanned and how.

Limitations: Only works for agents you build. Requires development effort.

Detection by content type

Content Type	Where Scanned	What to Look For
HTTP response bodies	Network proxy	Injection hidden in web pages, API responses
MCP tool responses	MCP proxy	Injection in tool output, split across content blocks
MCP tool descriptions	MCP proxy	Tool poisoning with hidden instructions
WebSocket frames	Network proxy	Injection in streaming data, fragmented across frames
Shared workspace files	File integrity monitoring	Poisoned files from compromised agents

Combining detection layers

The strongest prompt injection detection stacks independent layers:

External content → Network-layer scan → Model-layer scan → Agent processes content

These layers fail differently. A novel phrasing bypasses pattern matching but gets caught by the classifier. An adversarial input crafted to fool the classifier gets caught by pattern matching if it uses known phrases. Two independent detection systems with different techniques and different failure modes.

For closed-pipeline agents like Claude Code and Cursor, the network layer is your only automated detection point. Capability separation then limits what a successful injection can accomplish.

Practical setup with Pipelock

# Start the proxy with injection detection enabled
pipelock run --config balanced.yaml

# Point your agent at it
export HTTPS_PROXY=http://127.0.0.1:8888

# Wrap MCP servers for tool-level scanning
pipelock mcp proxy -- npx @some/mcp-server

The balanced preset enables injection scanning on HTTP responses, MCP responses, and MCP tool descriptions with the full normalization pipeline.

Prompt Injection Detection: Techniques and Tools