Key Concepts
Jailbreaking
Bypassing a model's safety training through adversarial prompting
What it is
Jailbreaking is the practice of constructing prompts that circumvent an LLM's safety fine-tuning or system prompt instructions, getting the model to produce outputs it was trained to refuse. Common techniques include roleplay framing, hypothetical framing, prompt injection, and multi-step manipulation.
Safety training from RLHF is not robust, it's a statistical tendency, not a hard rule. With enough creativity, a determined user can often find framings that bypass the training. Secret system prompt content (like API keys or confidential instructions) is also frequently extractable through careful prompting.