Learning how to break the latest AI models is important, but security researchers should also question whether there are enough guardrails to prevent the technology's misuse.

4 Min Read
Robot looking at the user.
Source: Atchariya wattanakul via Alamy Stock Photo

Samsung has banned some uses of ChatGPT, Ford Motor and Volkswagen shuttered their self-driving car firm, and a letter calling for a pause in training more powerful AI systems has garnered more than 25,000 signatures.

Overreactions? No, says Davi Ottenheimer, the vice president of trust and digital ethics at Inrupt, a startup creating digital identity and security solutions. A pause is needed to develop better approaches to testing, not just of the security, but the safety of machine-learning and artificial-intelligence models. These include ChatGPT, self-driving vehicles, and autonomous drones.

A steady stream of security researchers and technologists have already found ways to circumvent protections placed on AI systems, but society needs to have broader discussions about how to test and improve safety, say Ottenheimer, who will give a presentation on the topic at the RSA Conference in San Francisco next week.

"Especially from the context of a pentest, I'm supposed to go in and basically assess [an AI system] for safety, but what's missing is that we're not making a decision about whether it is safe, whether the application is acceptable," he says. A server's security, for example, does not speak to whether the system is safe "if you are running the server in a way that's unacceptable ... and we need to get to that level with AI."

With the introduction of ChatGPT in November, interest in artificial intelligence and machine learning — already surging due to applications in the data science field — took off. The eerie capabilities of the large language model (LLM) to seemingly understand human language and to synthesize coherent responses has led to a surge in proposed applications based on the technology and other forms of AI. ChatGPT has already been used to triage security incidents,  and a more advanced LLM forms the core of Microsoft's Security Copilot.

Yet the generative pre-trained transformer (GPT) is just one form of AI model, and all of them can have significant problems with bias, false positives, and other issues.

Exploiting Robots Is Easy

These shortcomings, and a general lack of explainability in AI models, means that any model can be attacked in ways that the creators may not have imagined, Inrupt's Ottenheimer will say in his RSA Conference presentation, Pentesting AI: How to Hunt a Robot. If AI models are quickly adopted without adequate study, they may make their way into critical application, where they could be attacked or fail spectacularly, he says.

"It's actually super easy to make them fail," Ottenheimer says. "Most people are looking at it as, 'Can I fool it in this one area?' but that's not the discussion you should be having, because — oh my god — you're using this technology in a totally inappropriate way."

Recent research demonstrates how simple attacking AI can be. Asking ChatGPT to mimic specific people, also known as assigning it a persona, can result in the AI model breaking its guardrails, according to a team of researchers from the Allen Institute for AI, the Georgia Institute of Technology, and Princeton University. The researchers had ChatGPT assume a variety of personas, and even a general persona — such as a "bad person" — can result in the large language model using toxic language, the team stated in a paper published on April 11.

With a plethora of products already shipping using ChatGPT, the researchers warn that it can unexpectedly result in harmful behavior.

"We hope that our findings inspire the broader AI community to rethink the efficacy of current safety guardrails and develop better techniques that lead to robust, safe, and trustworthy AI systems," the researchers stated in their paper.

Time to Turn Off the Tech & Do a Reset

Ottenheimer breaks down AI tests into six categories based on the traditional CIA triad: Confidentiality, integrity, and availability. False positives, for example, can lead to significant costs for society, such as emergency responders who are overtaxed because skiers' Apple watches are dialing 9-1-1 due to jarring runs down slopes. The academic research on using personas to jailbreak ChatGPT's content protections is similar to other research, which created a persona DAN (Do Anything Now) that allowed users to bypass safeguards.

Companies and researchers need to find ways to do a hard reset of such systems, to purge any toxic inputs, but at the same time teach the AI to take such actions in the future.

"You actually have to reset it, such that the harms don't happen again, or you have to reset it in a way that the harms can be undone," Ottenheimer says.

Finally, the threat to privacy is a significant threat as well as large language models use a vast data set, typically copied from the Internet, without the permissions of the publishers of that data. Italy has given OpenAI until the end of April to find ways to protect people's data and allow correction or deletion. And the efforts may grow as the European Data Protection Board (EDPB) has launched a task force dedicated to studying the issue and fostering cooperation.

About the Author(s)

Robert Lemos, Contributing Writer

Veteran technology journalist of more than 20 years. Former research engineer. Written for more than two dozen publications, including CNET News.com, Dark Reading, MIT's Technology Review, Popular Science, and Wired News. Five awards for journalism, including Best Deadline Journalism (Online) in 2003 for coverage of the Blaster worm. Crunches numbers on various trends using Python and R. Recent reports include analyses of the shortage in cybersecurity workers and annual vulnerability trends.

Keep up with the latest cybersecurity threats, newly discovered vulnerabilities, data breach information, and emerging trends. Delivered daily or weekly right to your email inbox.

You May Also Like


More Insights