Sponsored By

Researchers Use AI to Jailbreak ChatGPT, Other LLMs

"Tree of Attacks With Pruning" is the latest in a growing string of methods for eliciting unintended behavior from a large language model.

4 Min Read
Lrage language model concept
Source: Ole.CNX via Shutterstock

The exploding use of large language models in industry and across organizations has sparked a flurry of research activity focused on testing the susceptibility of LLMs to generate harmful and biased content when prompted in specific ways.

The latest example is a new paper from researchers at Robust Intelligence and Yale University that describes a completely automated way to get even state-of-the-art black box LLMs to escape guardrails put in place by their creators and generate toxic content.

Tree of Attacks With Pruning

Black box LLMs are basically large language models such as those behind ChatGPT whose architecture, datasets, training methodologies and other details are not publicly known.

The new method, which the researchers have dubbed Tree of Attacks with Pruning (TAP), basically involves using an unaligned LLM to "jailbreak" another aligned LLM, or to get it to breach its guardrails, quickly and with a high success rate. An aligned LLM such as the one behind ChatGPT and other AI chatbots is explicitly designed to minimize potential for harm and would not, for example, normally respond to a request for information on how to build a bomb. An unaligned LLM is optimized for accuracy and generally has no — or fewer — such constraints.

With TAP, the researchers have shown how they can get an unaligned LLM to prompt an aligned target LLM on a potentially harmful topic and then use its response to keep refining the original prompt. The process basically continues until one of the generated prompts jailbreaks the target LLM and gets it to spew out the requested information. The researchers found that they were able to use small LLMs to jailbreak even the latest aligned LLMs.

"In empirical evaluations, we observe that TAP generates prompts that jailbreak state-of-the-art LLMs (including GPT4 and GPT4-Turbo) for more than 80% of the prompts using only a small number of queries," the researchers wrote. "This significantly improves upon the previous state-of-the-art black-box method for generating jailbreaks."

Rapidly Proliferating Research Interest

The new research is the latest among a growing number of studies in recent months that show how LLMs can be coaxed into unintended behavior, like revealing training data and sensitive information with the right prompt. Some of the research has focused on getting LLMs to reveal potentially harmful or unintended information by directly interacting with them via engineered prompts. Other studies have shown how adversaries can elicit the same behavior from a target LLM via indirect prompts hidden in text, audio, and image samples in data the model would likely retrieve when responding to a user input.

Such prompt injection methods to get a model to diverge from intended behavior have relied at least to some extent on manual interaction. And the output the prompts have generated have often been nonsensical. The new TAP research is a refinement of earlier studies that show how these attacks can be implemented in a completely automated, more reliable way.

In October, researchers at the University of Pennsylvania released details of a new algorithm they developed for jailbreaking an LLM using another LLM. The algorithm, called Prompt Automatic Iterative Refinement (PAIR), involved getting one LLM to jailbreak another. "At a high level, PAIR pits two black-box LLMs — which we call the attacker and the target — against one another; the attacker model is programmed to creatively discover candidate prompts which will jailbreak the target model," the researchers had noted. According to them, in tests PAIR was capable of triggering "semantically meaningful," or human-interpretable, jailbreaks in a mere 20 queries. The researchers described that as a 10,000-fold improvement over previous jailbreak techniques.

Highly Effective

The new TAP method that the researchers at Robust Intelligence and Yale developed is different in that it uses what the researchers call a "tree-of-thought" reasoning process.

"Crucially, before sending prompts to the target, TAP assesses them and prunes the ones unlikely to result in jailbreaks," the researchers wrote. "Using tree-of-thought reasoning allows TAP to navigate a large search space of prompts and pruning reduces the total number of queries sent to the target."

Such research is important because many organizations are rushing to integrate LLM technologies into their applications and operations without much thought to the potential security and privacy implications. As the TAP researchers noted in their report, many of the LLMs depend on guardrails that model developers implement to protect against unintended behavior. "However, even with the considerable time and effort spent by the likes of OpenAI, Google, and Meta, these guardrails are not resilient enough to protect enterprises and their users today," the researchers said. "Concerns surrounding model risk, biases, and potential adversarial exploits have come to the forefront."

About the Author(s)

Jai Vijayan, Contributing Writer

Jai Vijayan is a seasoned technology reporter with over 20 years of experience in IT trade journalism. He was most recently a Senior Editor at Computerworld, where he covered information security and data privacy issues for the publication. Over the course of his 20-year career at Computerworld, Jai also covered a variety of other technology topics, including big data, Hadoop, Internet of Things, e-voting, and data analytics. Prior to Computerworld, Jai covered technology issues for The Economic Times in Bangalore, India. Jai has a Master's degree in Statistics and lives in Naperville, Ill.

Keep up with the latest cybersecurity threats, newly discovered vulnerabilities, data breach information, and emerging trends. Delivered daily or weekly right to your email inbox.

You May Also Like


More Insights