Ready to protect your own setup?

What chatbot security means

Chatbot security is the practice of protecting users, operators, and downstream systems from the failures and attacks unique to chatbots that use large language models. The boundary is wider than it looks: a chatbot is a model plus a prompt plus everything it reads plus everything it can do.

The main risk categories:

  • Credential and PII leaks — the chatbot reads sensitive content and sends it somewhere it should not.
  • Prompt injection and indirect prompt injection — content the chatbot reads (user message, retrieved document, tool response, webpage) overrides its intended behavior.
  • Jailbreaks — adversarial prompts that bypass safety training.
  • Oversharing — the chatbot reveals system prompts, internal context, or backend data.
  • Hallucinations — the chatbot confidently states things that are not true and the user acts on them.
  • Tool abuse — when the chatbot can call APIs or take actions, those calls become an attack surface.
  • Supply-chain compromise — the model provider or hosted service is compromised.
  • Inadequate logging and audit — any of the above is invisible after the fact.

Each category has a defense pattern. The defenses combine; no single control catches everything.

Are AI chatbots safe to use?

AI chatbots are safe to use for low-stakes tasks. Safety degrades as the chatbot is given access to sensitive data, allowed to call tools or make outbound network requests, or relied on for decisions that have legal or financial consequences.

The right question is not “is this chatbot safe”. The right question is “is this chatbot safe for this specific task with this specific data and these specific permissions.” Treat the chatbot’s output as untrusted user input by default, and put security controls between the chatbot and any system it can affect.

Defending against the major chatbot security risks

Credential and PII leaks

DLP (data loss prevention) scanning on every outbound request from the chatbot catches credential patterns (API keys, tokens, private keys), payment card numbers, and other regex-detectable secrets. The right place for DLP is at the network boundary, not inside the chatbot, because attackers can encode secrets to evade prompt-level filters.

For PII (names, emails, social security numbers), the answer is more about input gating: do not give the chatbot access to PII it does not need. If access is needed, log every retrieval, scope by user, and audit who saw what.

Prompt injection and indirect prompt injection

A model cannot reliably tell instructions apart from data. The defense is to scan the data — every webpage the chatbot fetches, every tool response it reads, every document it retrieves — for known injection patterns before the chatbot sees them. Pattern matching is imperfect and arms-race; the goal is to raise the cost of an attack, not to claim 100% coverage.

See prompt injection defense at the network layer for the technical pattern, and LLM prompt injection for the broader category.

Jailbreaks

Jailbreak attempts come through the user message. Refusing to answer is a model-layer response and not a security control — a jailbreak that succeeds means refusal failed. Real defense is layered: input filtering for known jailbreak patterns, monitoring for unusual response content (refusals that turn into compliance), and limiting what a jailbroken chatbot can actually do downstream.

If your chatbot has tool access, the network layer matters more than the prompt layer. A jailbroken chatbot that cannot call dangerous tools because the tool policy blocks them is a much smaller incident than one that can.

Oversharing system prompts and backend data

Set the chatbot’s system prompt with the assumption that users will see it. Anything truly sensitive does not go in the system prompt. Backend data the chatbot can access should be scoped by user identity at the data layer, not by trusting the chatbot to filter.

Hallucinations

Not strictly a security risk, but operationally similar: a confident-but-wrong answer that the user acts on can cause real damage. Guard with citations to source material, RAG over a curated knowledge base, retrieval result verification, and human-in-the-loop for any high-stakes action.

Tool abuse and confused deputy

The moment a chatbot can call a tool, the tool’s permissions become the chatbot’s permissions. A tool that holds an API token and accepts model-influenced parameters is a confused deputy: the model can be tricked into invoking the tool on the attacker’s behalf with privileges the attacker should not have.

The defenses are pre-execution allow/deny rules on tool calls, argument validation, scoping tools to the minimum permission they need, and runtime inspection of every tool call before it runs.

Supply-chain compromise

Pin model versions when possible. For hosted chatbots, monitor the provider’s security posture and incident response. For self-hosted chatbots, scan dependencies (the SDK, the model file, any containers) and watch for tampering. Signed model checkpoints and reproducible builds help where they are available.

Inadequate logging and audit

Every chatbot interaction needs to be loggable. At minimum: timestamp, user identity, prompt, response, any tools called and their arguments, any external content retrieved, any safety filter outcomes. Hash-chained signed audit logs are a stronger version of this for regulated environments.

Where chatbot security ends and agent security begins

Chatbot security covers the conversational surface. Agent security extends to chatbots that can also act: call tools, make HTTP requests, run shell commands, query databases, write to files.

Every additional capability is an additional attack surface. Chatbot security tends to focus on prompt-level controls and content moderation. Agent security adds:

  • Runtime network controls (egress filtering, DLP on every outbound request)
  • MCP traffic scanning (tool descriptions and tool responses scrutinized)
  • Process sandboxing (filesystem and syscall isolation)
  • Signed evidence of what the agent did (audit trail you can hand to a compliance reviewer)

If your chatbot can take actions, you need both layers. See What is an agent firewall? for the network-layer pattern.

Practical chatbot security checklist

For any chatbot deployment that touches sensitive data or has tool access:

  • DLP scanning on every outbound request from the chatbot’s host
  • Prompt injection scanning on every piece of content the chatbot reads (RAG sources, tool responses, fetched URLs)
  • Tool allowlist with explicit deny for dangerous categories (file writes outside scoped paths, network egress to unknown destinations, environment dumps)
  • System prompts authored under the assumption that users will see them
  • Backend data access scoped at the data layer per user identity, not by trusting the chatbot to filter
  • Citations or source attribution on factual answers
  • Hash-chained audit log capturing prompts, responses, tool calls, and retrieved content
  • Rate limits per user and per tool to bound any single incident
  • Kill switch the operator can flip without restarting the chatbot
  • Signed evidence trail for any high-stakes action

If the chatbot is also an agent, add the controls in the agent security best practices checklist.

Ready to protect your own setup?