Getting Started
Security for AI agents is fundamentally different from traditional application security. Your agent interprets natural language, which means every user input is potentially executable instruction. Prompt injection — where a user crafts input that overrides or subverts the agent's instructions — is the defining vulnerability of LLM-based systems.
The OWASP Top 10 for LLM Applications is essential reading before building any production agent. It covers injection, data leakage, insecure output handling, and more. But understanding the taxonomy is just the start; you need to implement concrete defenses.
Key Concepts
Prompt injection defense requires multiple layers. No single technique is sufficient. Start with input validation that flags known patterns:
import re
INJECTION_PATTERNS = [
r"ignore\s+(all\s+)?previous\s+instructions",
r"you\s+are\s+now\s+",
r"system\s*prompt",
r"disregard\s+(all\s+)?(prior|previous|above)",
r"new\s+instruction",
]
def detect_injection(user_input: str) -> bool:
"""Check input against known injection patterns."""
text = user_input.lower()
for pattern in INJECTION_PATTERNS:
if re.search(pattern, text):
return True
return False
Output filtering prevents your agent from leaking sensitive data. Even if your agent was not trained on private data, it might echo back PII from user inputs, reveal internal system details, or generate content that looks like real credentials:
PII_PATTERNS = {
"ssn": r"\b\d{3}-\d{2}-\d{4}\b",
"credit_card": r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b",
"api_key": r"\b(sk|pk)[-_][a-zA-Z0-9]{32,}\b",
}
def redact_output(text: str) -> str:
for name, pattern in PII_PATTERNS.items():
text = re.sub(pattern, f"[REDACTED_{name.upper()}]", text)
return text
RBAC (Role-Based Access Control) restricts which tools and data sources an agent can access based on the user's role. A customer support agent should not have access to the same tools as an admin agent. Define permissions as explicit allow-lists, never deny-lists.
Audit logging records every agent interaction with enough context to reconstruct what happened. Include timestamps, user identifiers, hashed inputs, security flags triggered, and the tools invoked. This is not optional in regulated environments — it is a compliance requirement.
Hands-On Practice
Build a SecurityLayer class that wraps your agent. Every request passes through validate_input() before reaching the agent and filter_output() before returning to the user. Add structured JSON audit logging that captures both the security checks performed and their results. Test your layer with known prompt injection payloads from the OWASP testing guide.