Simple Hacking Technique Can Extract ChatGPT Training Data

Apparently all it takes to get a chatbot to start spilling its secrets is prompting it to repeat certain words like "poem" forever.

Jai Vijayan, Contributing Writer

December 1, 2023

5 Min Read

ChatGPT chatbot screen seen on smartphone and laptop display with Chat GPT login screen on the background.

Source: Ascannio via Shutterstock

Can getting ChatGPT to repeat the same word over and over again cause it to regurgitate large amounts of its training data, including personally identifiable information and other data scraped from the Web?

The answer is an emphatic yes, according to a team of researchers at Google DeepMind, Cornell University, and four other universities who tested the hugely popular generative AI chatbot's susceptibility to leaking data when prompted in a specific way.

'Poem' as a Trigger Word

In a report this week, the researchers described how they got ChatGPT to spew out memorized portions of its training data merely by prompting it to repeat words like "poem," "company," "send," "make," and "part" forever.

For example, when the researchers prompted ChatGPT to repeat the word "poem" forever, the chatbot initially responded by repeating the word as instructed. But after a few hundred times, ChatGPT began generating "often nonsensical" output, a small fraction of which included memorized training data such as an individual's email signature and personal contact information.

The researchers discovered that some words were better at getting the generative AI model to spill memorized data than others. For instance, prompting the chatbot to repeat the word "company" caused it to emit training data 164 times more often than other words, such as "know."

Data that the researchers were able to extract from ChatGPT in this manner included personally identifiable information on dozens of individuals; explicit content (when the researchers used an NSFW word as a prompt); verbatim paragraphs from books and poems (when the prompts contained the word "book" or "poem"); and URLs, unique user identifiers, bitcoin addresses, and programming code.

A Potentially Big Privacy Issue?

"Using only $200 USD worth of queries to ChatGPT (gpt-3.5-turbo), we are able to extract over 10,000 unique verbatim memorized training examples," the researchers wrote in their paper titled "Scalable Extraction of Training Data from (Production) Language Models."

"Our extrapolation to larger budgets suggests that dedicated adversaries could extract far more data," they wrote. The researchers estimated an adversary could extract 10 times more data with more queries.

Dark Reading's attempts to use some of the prompts in the study did not generate the output the researchers mentioned in their report. It's unclear if that's because ChatGPT creator OpenAI has addressed the underlying issues after the researchers disclosed their findings to the company in late August. OpenAI did not immediately respond to a Dark Reading request for comment.

The new research is the latest attempt to understand the privacy implications of developers using massive datasets scraped from different — and often not fully disclosed — sources to train their AI models.

Previous research has shown that large language models (LLMs) such as ChatGPT often can inadvertently memorize verbatim patterns and phrases in their training datasets. The tendency for such memorization increases with the size of the training data.

Researchers have shown how such memorized data is often discoverable in a model's output. Other researchers have shown how adversaries can use so-called divergence attacks to extract training data from an LLM. A divergence attack is one in which an adversary uses intentionally crafted prompts or inputs to get an LLM to generate outputs that diverge significantly from what it would typically produce.

In many of these studies, researchers have used open source models — where the training datasets and algorithms are known — to test the susceptibility of LLM to data memorization and leaks. The studies have also typically involved base AI models that have not been aligned to operate in a manner like an AI chatbot such as ChatGPT.

A Divergence Attack on ChatGPT

The latest study is an attempt to show how a divergence attack can work on a sophisticated closed, generative AI chatbot whose training data and algorithms remain mostly unknown. The study involved the researchers developing a way to get ChatGPT "to 'escape' out of its alignment training" and getting it to "behave like a base language model, outputting text in a typical Internet-text style." The prompting strategy they discovered (of getting ChatGPT to repeat the same word incessantly) caused precisely such an outcome, resulting in the model spewing out memorized data.

To verify that the data the model was generating was indeed training data, the researchers first built an auxiliary dataset containing some 9 terabytes of data from four of the largest LLM pre-training datasets — The Pile, RefinedWeb, RedPajama, and Dolma. They then compared the output data from ChatGPT against the auxiliary dataset and found numerous matches.

The researchers figured they were likely underestimating the extent of data memorization in ChatGPT because they were comparing the outputs of their prompting only against the 9-terabyte auxiliary dataset. So they took some 494 of ChatGPT's outputs from their prompts and manually searched for verbatim matches on Google. The exercise yielded 150 exact matches, compared to just 70 against the auxiliary dataset.

"We detect nearly twice as many model outputs are memorized in our manual search analysis than were detected in our (comparatively small)" auxiliary dataset, the researchers noted. "Our paper suggests that training data can easily be extracted from the best language models of the past few years through simple techniques."

The attack that the researchers described in their report is specific to ChatGPT and does not work against other LLMs. But the paper should help "warn practitioners that they should not train and deploy LLMs for any privacy-sensitive applications without extreme safeguards," they noted.

About the Author

Jai Vijayan, Contributing Writer

Jai Vijayan is a seasoned technology reporter with over 20 years of experience in IT trade journalism. He was most recently a Senior Editor at Computerworld, where he covered information security and data privacy issues for the publication. Over the course of his 20-year career at Computerworld, Jai also covered a variety of other technology topics, including big data, Hadoop, Internet of Things, e-voting, and data analytics. Prior to Computerworld, Jai covered technology issues for The Economic Times in Bangalore, India. Jai has a Master's degree in Statistics and lives in Naperville, Ill.

Related Topics

Related Topics

Related Topics

Related Topics

Simple Hacking Technique Can Extract ChatGPT Training Data

'Poem' as a Trigger Word

A Potentially Big Privacy Issue?

A Divergence Attack on ChatGPT

About the Author

Editor's Choice

Related Topics

Related Topics

Related Topics

Related Topics

<span class="ArticleBase-LargeTitle">Simple Hacking Technique Can Extract ChatGPT Training Data</span>Simple Hacking Technique Can Extract ChatGPT Training Data

'Poem' as a Trigger Word

A Potentially Big Privacy Issue?

A Divergence Attack on ChatGPT

About the Author

Editor's Choice

Simple Hacking Technique Can Extract ChatGPT Training Data