0 / 0
Jailbreaking risk for AI
Last updated: Dec 12, 2024
Jailbreaking risk for AI
Multi-category Icon representing multi-category risks.
Risks associated with input
Inference
Multi-category
New to generative AI

Description

A jailbreaking attack attempts to break through the guardrails that are established in the model to perform restricted actions.

Why is jailbreaking a concern for foundation models?

Jailbreaking attacks can be used to alter model behavior and benefit the attacker. If not properly controlled, business entities can face fines, reputational harm, and other legal consequences.

Background image for risks associated with input
Example

Bypassing LLM guardrails

A study cited by researchers at Carnegie Mellon University, The Center for AI Safety, and the Bosch Center for AI, claim to have discovered a simple prompt addendum that allowed the researchers to trick models into generating biased, false, and otherwise toxic information. The researchers showed that they might circumvent these guardrails in a more automated way. These attacks were shown to be effective in a wide range of open source products, including ChatGPT, Google Bard, Meta’s LLaMA, Anthropic’s Claude, and others.

Parent topic: AI risk atlas

We provide examples covered by the press to help explain many of the foundation models' risks. Many of these events covered by the press are either still evolving or have been resolved, and referencing them can help the reader understand the potential risks and work towards mitigations. Highlighting these examples are for illustrative purposes only.

Generative AI search and answer
These answers are generated by a large language model in watsonx.ai based on content from the product documentation. Learn more