Key Concepts

Jailbreaking

Bypassing a model's safety training through adversarial prompting

What it is

Jailbreaking is the practice of constructing prompts that circumvent an LLM's safety fine-tuning or system prompt instructions, getting the model to produce outputs it was trained to refuse. Common techniques include roleplay framing, hypothetical framing, prompt injection, and multi-step manipulation.

Safety training from RLHF is not robust, it's a statistical tendency, not a hard rule. With enough creativity, a determined user can often find framings that bypass the training. Secret system prompt content (like API keys or confidential instructions) is also frequently extractable through careful prompting.

Why it matters

If you're building AI products, jailbreaking is a threat model you must design for. Relying solely on the model's safety training to prevent abuse is insufficient. This motivates input/output filtering, content moderation layers, rate limiting, and careful system prompt design. Understanding jailbreaking also helps you evaluate AI safety claims critically, "we've aligned the model" and "the model is safe against adversarial users" are very different claims.

Related concepts

System Prompts

Resources

Intro to Large Language Models (security section ~38:00+)

youtube.com· Outstanding coverage of LLM security: jailbreaks, prompt injection, data poisoning. Frames these through the "LLM OS" security analogy, if LLMs are a new computing paradigm, they need a new security paradigm.

20 min

Prompt Injection and Jailbreaking

youtube.com· Computerphile's clear academic walkthrough of how prompt injection and jailbreaking actually work mechanically, with real examples and discussion of why defenses are so hard.

15 min

Adversarial Prompting

promptingguide.ai· Comprehensive coverage of prompt injection, prompt leaking, and jailbreaking techniques. Includes real examples and defense strategies. Good reference material.

12 min

Hallucination (artificial intelligence), Security concerns

en.wikipedia.org· Wikipedia's LLM article has a solid section on safety measures, jailbreaking research, and real-world examples. Good supplementary reading that connects jailbreaking to the broader safety landscape.

10 min