While some cybercriminals have bypassed guardrails to force legitimate AI models to turn bad, building their own malicious chatbot platforms and making use of open source models are a greater threat.

4 Min Read
Electric blue AI chip
Source: PopTika via Shutterstock

Searching for ways to use large language models (LLMs) to streamline attacks and dodge defenses, cyberattackers face three choices: play a cat-and-mouse game to evade the guardrails put in place by the makers of major AI models like ChatGPT; spend the time and effort to train their own AI model; or conscript an uncensored open source model or something from the Dark Web to do their bidding.

Last month, underground developers appeared to have taken the first approach, releasing an AI-powered malicious front-end service, Dark Gemini, that likely modified prompts sent to legitimate LLMs to break restrictions on writing malicious programs and geolocating people in photographs. While many security professionals were not impressed with the capabilities demonstrated by the service, the chatbot did show what could be accomplished with little effort. 

While Dark Gemini has not made much of a splash, the systematic approach of creating a front end to bypass the guardrails restricting legitimate LLMs shows that a minimalist approach can deliver significant AI capabilities, such as text synthesis and translation, to make current attacks, such as phishing, more effective.

Offensive AI: Subvert, Buy or Build?

Dark Gemini is the latest example of finding ways to trick "born good" AIs into doing the dirty work. In February, Microsoft and Open AI warned that nation-state threat actors — including those from China, Iran, North Korea, and Russia — were using the firms' LLMs to augment the threat groups' operations. Earlier this month, researchers at AI security firm HiddenLayer noted that the guardrails set up to limit unsafe responses from Google's Gemini could easily be bypassed.

Yet using AI for more complex components of an attack — such as building sophisticated malware — will likely prove difficult enough with the hurdles created by current guardrails, says Dov Lerner, security research lead at threat intelligence firm Cybersixgill.

"To truly be effective, [any malware] needs to be evasive, it needs to dodge any sort of defenses that are there, and certainly, if it's malware being deployed on an enterprise system, then [it] needs to be very sophisticated," he says. "So I don't think AI can write [malware] programs right now."

Enter "born malicious" options for sale on the Dark Web. Already, AI chatbots trained on content from the Dark Web have proliferated, including FraudGPT, WormGPT, and DarkBART. Uncensored AI models based on Llama2 and the hybrid Wizard-Vicuna approaches are also available as pre-trained downloads from repositories.

Other approaches, however, will likely lead to more serious threats. Cybercriminals with access to unrestricted AI models through HuggingFace and other AI-model repositories could create their own platforms with specific capabilities, says Dylan Davis, threat intelligence analyst at Recorded Future's Insikt Group. 

"The impact unrestricted models will have on the threat landscape" will be significant, he says. "These models are easily accessible ..., easy to stand up, [and] they’re constantly improving — much better than most [Dark Web] models — and getting more efficient."

"This is typical of the cybersecurity arms race that repeats itself over and over," says Alex Cox, director of the threat intelligence team at LastPass. "With a disruptive technology like AI, you see quick adoption by both good and bad guys, with defensive mechanisms and process being put in place by the good guys."

The AI Arms Race & Defense Strategies

As attackers continues to search for ways of using AI, defenders will be hard-pressed to maintain AI guardrails against attacks like prompt injection, says Recorded Future's Davis. To create defenses that are hard to bypass, companies need to conduct in-depth adversarial testing, to create rules designed to filter out or censor both inputs and outputs — an expensive proposition, he says.

"Adversarial training is currently one of the more robust ways to [create] a resilient model, but there’s a massive tradeoff here between safety and model ability," Davis says. "The more adversarial training, the less 'useful' the models become, so most model creators will shy on the side of usability, as any sane business would." 

Defending against underground developers creating their own models, or using pretrained open source models in ways that were not anticipated, is nearly impossible. In those cases, defenders have to treat such tools as part of the cybersecurity arms race and adapt as attackers gain new capabilities, LastPass' Cox says. 

"Guardrails and generative AI safety should be viewed like any other input validation process, and the protections need to be evaluated, re-evaluated, and red-teamed on a regular basis as capabilities improve and vulnerabilities are discovered," he says. "In that sense, it’s just another technology that needs to be managed in the vulnerability assessment world."

About the Author(s)

Robert Lemos, Contributing Writer

Veteran technology journalist of more than 20 years. Former research engineer. Written for more than two dozen publications, including CNET News.com, Dark Reading, MIT's Technology Review, Popular Science, and Wired News. Five awards for journalism, including Best Deadline Journalism (Online) in 2003 for coverage of the Blaster worm. Crunches numbers on various trends using Python and R. Recent reports include analyses of the shortage in cybersecurity workers and annual vulnerability trends.

Keep up with the latest cybersecurity threats, newly discovered vulnerabilities, data breach information, and emerging trends. Delivered daily or weekly right to your email inbox.

You May Also Like


More Insights