Virtual kidnapping is just one of many new artificial intelligence attack types that threat actors have begun deploying, as voice cloning emerges as a potent new imposter tool.

5 Min Read
Neural network creating deepfake abstract face
Source: shuttersv via Shutterstock

An incident earlier this year in which a cybercriminal attempted to extort $1 million from an Arizona-based woman whose daughter he claimed to have kidnapped is an early example of what security experts say is the growing danger from voice cloning enabled by artificial intelligence.

As part of the extortion attempt, the alleged kidnapper threatened to drug and physically abuse the girl while letting her distraught mother, Jennifer DeStefano, hear over the phone what appeared to be her daughter's yelling, crying, and frantic pleas for help.

But, those pleas ended up being deepfakes.

Deepfakes Create Identical Voices

In recounting details of the incident, DeStefano told police she had been convinced the individual on the phone had actually kidnapped her daughter because of how identical the alleged kidnapping victim's voice was to her daughter's.

The incident is one in a rapidly growing number of instances where cybercriminals have exploited AI-enabled tools to try and scam people. The problem has become so rampant that the FBI in early June issued a warning to consumers that criminals manipulating benign videos and photos are targeting people in various kinds of extortion attempts.

"The FBI continues to receive reports from victims, including minor children and non-consenting adults, whose photos or videos were altered into explicit content," the agency warned. "The photos or videos are then publicly circulated on social media or pornographic websites, for the purpose of harassing victims or sextortion schemes." Scams involving deepfakes have added a new twist to so-called imposter scams, which last year cost US consumers a startling $2.6 billion in losses, according to the Federal Trade Commission.

In many instances, all it takes for attackers to create deepfake videos and audio — of the kind that fooled DeStefano — are very small samples of biometric content, Trend Micro said in a report this week highlighting the threat. Even a few seconds of audio that an individual might post on social media platforms like Facebook, TikTok, and Instagram is all that a threat actor requires to clone that individual's voice. Helpfully for them, a slew of AI tools is readily available — with many more on the way — that allow them to do voice cloning relatively easily using small voice biometrics harvested from various sources, according to Trend Micro researchers.

"Malicious actors who are able to create a deepfake voice of someone's child can use an input script (possibly one that's pulled from a movie script) to make the child appear to be crying, screaming, and in deep distress," Trend Micro researchers Craig Gibson and Josiah Hagen wrote in their report. "The malicious actors could then use this deepfake voice as proof that they have the targeted victim's child in their possession to pressure the victim into sending large ransom amounts."

A Plethora of AI Imposter Tools

Some examples of AI-enabled voice cloning tools include ElevenLabs' VoiceLab, Resemble.AI, Speechify, and VoiceCopy. Many of the tools are only available for a fee, though some offer freemium versions for trial. Even so, the cost to use these tools is often well less than $50 a month, making them readily accessible to those engaged in imposter scams.

A great deal of videos, audio clips, and other identity-containing data is readily available on the Dark Web that threat actors can correlate with publicly available information to identify targets for virtual kidnapping scams like DeStefano experienced and other imposter scams, Trend Micro noted. In fact, specific tools for enabling virtual kidnapping scams are emerging on the Dark Web that threat actors can use to hone their attacks, the researchers say in emailed comments to Dark Reading.

AI tools such as ChatGPT allow attackers to fuse data — including video, voice, and geolocation data — from disparate sources to essentially narrow down groups of people they can target in voice cloning or other scams. Much like social network analysis and propensities (SNAP) modeling allows marketers to determine the likelihood of customers taking specific actions, attackers can leverage tools like ChatGPT to focus on potential victims. "Attacks are enhanced by feeding user data, such as likes, into the prompt for content creation," the Trend Micro researchers say. "Convince this cat-loving woman, living alone who likes musicals, that their adult son has been kidnapped," they say, as one example. Tools like ChatGPT allow imposters to generate in a wholly automated way the entire conversation that an imposter might use in a voice cloning scam, they add.

Expect also to see threat actors use SIM-jacking — where they essentially hijack an individual's phone — in imposter scams such as virtual kidnapping. "When virtual kidnappers use this scheme on a supposedly kidnapped person, the phone number becomes unreachable, which can increase the chances of a successful ransom payout," Trend Micro said. Security professionals can also expect to see threat actors incorporate communication paths that are harder to block, like voice and video in ransomware attacks and other cyber-extortion schemes, the security vendor said.

Cloning Vendors Cognizant of the Cyber-Risks

Several vendors of voice cloning themselves are aware of the threat and appear to be taking measures to mitigate the risk. In Twitter messages earlier this year, ElevenLabs said it had seen an increasing number of voice-cloning misuse cases among users of its beta platform. In response, the company said it was considering adding additional account checks, such as full ID verification and verifying copyright to the voice. A third option is to manually verify each request for cloning a voice sample.

Microsoft, which has developed an AI-enabled text-to-speech technology called Vall-E, has warned of the potential for threat actors to misuse its technology to spoof voice identification or impersonate specific speakers. "If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model," the company said.

Facebook parent Meta, which has developed a generative AI tool for speech called VoiceBox, has decided to go slow in how it makes the tool generally available, citing concerns over potential misuse. The company has claimed technology uses a sophisticated new approach to cloning a voice from raw audio and an accompanying transcription. "There are many exciting use cases for generative speech models, but because of the potential risks of misuse, we are not making the Voicebox model or code publicly available at this time," Meta researchers wrote in recent recent post describing the technology.

About the Author(s)

Jai Vijayan, Contributing Writer

Jai Vijayan is a seasoned technology reporter with over 20 years of experience in IT trade journalism. He was most recently a Senior Editor at Computerworld, where he covered information security and data privacy issues for the publication. Over the course of his 20-year career at Computerworld, Jai also covered a variety of other technology topics, including big data, Hadoop, Internet of Things, e-voting, and data analytics. Prior to Computerworld, Jai covered technology issues for The Economic Times in Bangalore, India. Jai has a Master's degree in Statistics and lives in Naperville, Ill.

Keep up with the latest cybersecurity threats, newly discovered vulnerabilities, data breach information, and emerging trends. Delivered daily or weekly right to your email inbox.

You May Also Like


More Insights