Dark Reading is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Application Security

End of Bibblio RCM includes -->
10:00 AM
Gary McGraw Ph.D.
Gary McGraw Ph.D.
Expert Insights
Connect Directly

How to Secure Machine Learning

Part two of a series on avoiding potential security risks with ML.

When the field of software security was in its infancy 25 years ago, much hullabaloo was made over software vulnerabilities and their associated exploits. Hackers busied themselves exposing and exploiting bugs in everyday systems even as those systems were being rapidly migrated to the Internet. The popular press breathlessly covered each exploit. Nobody really concerned themselves with solving the underlying software engineering and configuration problems since finding and fixing the flood of individual bugs seemed like good progress. This hamster-wheel-like process came to be known as "penetrate and patch."

After several years of public bug whack-a-mole and debates over disclosure, it became clear that bad software was at the heart of computer security and that we would do well to figure out how to build secure software. That was 20 years ago, at the turn of the millennium. These days, software security is an important part of any progressive security program. To be sure, much work remains to be done in software security, but we really do know what that work should be.

Though machine learning (ML) - and artificial intelligence in general - has been around even longer than computer security, until very recently not much attention has been paid to the security of ML systems themselves. Over the last few years, a number of spectacular theoretical attacks on ML systems have led to the same kind of breathless press coverage that we experienced during the early days of computer security. It all seems strikingly familiar. Exploit a bug, hype things up in the media, lather, rinse, repeat.

Machine Learning appears to have made impressive progress on many tasks including image classification, machine translation, autonomous vehicle control, playing complex games including chess, Go, and Atari video games, and more. This has led to hyperbolic popular press coverage of AI, and has elevated deep learning to an almost magical status in the eyes of the public. ML, especially of the deep learning sort, is not magic, however. It is simply sophisticated, associative learning technology based on algorithms developed over the last 30 years. In fact, much of the recent progress in the field can be attributed to faster CPUs and much larger data sets rather than to any particular scientific breakthrough.

ML has become so popular that its application, though often poorly understood and partially motivated by hype, is exploding. In my view, this is not necessarily a good thing. I am concerned with the systematic risk invoked by adopting ML in a haphazard fashion. My current research with the Berryville Institute of Machine Learning (BIML) is focused on understanding and categorizing security engineering risks introduced by ML at the design level.1

We need to do better work to secure our ML systems, moving well beyond attack of the day, and penetrate and patch, towards real security engineering.

Top 5 Machine Learning Security Risks

Building security in for machine learning presents an interesting set of challenges.  Primary among these is the fact that in any machine learning system data plays an outside role in system security. In fact, my view is that the datasets an ML system is trained, tested, and ultimately operated on account for 60% or more of overall security risk, while the learning algorithms and other technical aspects of the system (including source code) account for the rest.

For that reason, in my work with BIML, I have focused my attention on architectural risk analysis sometimes called an ARA, (touchpoint number two for software security), as the most effective approach to get started with. This stands in contrast to starting with touchpoint one (code review), but the reasons why should be mostly obvious.

In a January 2020 report titled, "An Architectural Risk Analysis of Machine Learning Systems: Toward More Secure Machine Learning," BIML published an ARA as an important first step in its mission to help engineers and researchers secure ML systems. In the report, we painstakingly identified 78 risks. Of those 78 risks, I present the top five here.  (For a treatment of the remaining 73 risks and a set of scientific references, please see the report itself).

1. Adversarial Examples

Probably the most commonly discussed attacks against machine learning have come to be known as adversarial examples. The basic idea is to fool a machine learning system by providing malicious input often involving very small perturbations that cause the system to make a false prediction or categorization. Though coverage and resulting attention might be disproportionately large, swamping out other important ML risks, adversarial examples are very much real.

2. Data Poisoning

Data plays an outsized role in the security of an ML system. That's because an ML system learns to do what it does directly from data. If an attacker can intentionally manipulate the data being used by an ML system in a coordinated fashion, the entire system can be compromised. Data-poisoning attacks require special attention. In particular, ML engineers should consider what fraction of the training data an attacker can control and to what extent.

There are several data sources that are subject to poisoning attacks whereby an attacker intentionally manipulates data, including raw data in the world and datasets that are assembled to train, test, and validate an ML system, possibly in a coordinated fashion, to cause ML training to go awry. In some sense, this is a risk related both to data sensitivity and to the fact that the data themselves carry so much of the water in an ML system.

3. Online System Manipulation

An ML system is said to be "online" when it continues to learn during operational use, modifying its behavior over time. In this case a clever attacker can nudge the still-learning system in the wrong direction on purpose through system input and slowly "retrain" the ML system to do the wrong thing. Note that such an attack can be both subtle and reasonably easy to carry out. This risk is complex, demanding that ML engineers consider data provenance, algorithm choice, and system operations in order to properly address it. 

4. Transfer-Learning Attack

Many ML systems are constructed by tuning an already trained base model so that its somewhat generic capabilities are fine-tuned with a round of specialized training. A transfer attack presents an important risk in this situation. In cases where the pretrained model is widely available, an attacker may be able to devise attacks using it that will be robust enough to succeed against your (unavailable to the attacker) tuned task-specific model. You should also consider whether the ML system you are fine-tuning could possibly be a Trojan that includes sneaky ML behavior that is unanticipated.

ML systems are re-used intentionally in transfer situations. The risk of transfer outside of intended use applies. Groups posting models for transfer would do well to precisely describe exactly what their systems do and how they control the risks in this document.

5. Data Confidentiality

Data protection is difficult enough without throwing ML into the mix. One unique challenge in ML is protecting sensitive or confidential data that, through training, are built right into a model. Subtle but effective extraction attacks against an ML system's data are an important category of risk.

Preserving data confidentiality in an ML system is more challenging than in a standard computing situation. That's because an ML system that is trained up on confidential or sensitive data will have some aspects of those data built right into it through training. Attacks to extract sensitive and confidential information from ML systems (indirectly through normal use) are well known. Note that even sub-symbolic "feature" extraction may be useful since that can be used to hone adversarial input attacks.

Securing ML

BIML's basic architectural risk analysis identifies 78 specific risks associated with a generic ML system. The report organizes the risks by common component and also includes some system-wide risks. The BIML risk analysis results are meant to help ML systems engineers in securing their own particular ML systems.

In my view ML systems engineers can devise and field a more secure ML system by carefully considering the BIML risks while designing, implementing, and fielding their own specific ML system. In security, the devil is in the details, and BIML attempts to provide as much detail as possible regarding ML security risks and some basic controls.

 1.  G. McGraw, H. Figueroa, V. Shepardson, and R. Bonett, "An Architectural Risk Analysis of Machine Learning Systems: Toward More Secure Machine Learning." Technical Report from the Berryville Institute of Machine Learning (BIML), https://berryvilleiml.com/results/ara.pdf (accessed 6.3.20).



Gary McGraw is co-founder of the Berryville Institute of Machine Learning. He is a globally recognized authority on software security and the author of eight best selling books on this topic. His titles include Software Security, Exploiting Software, Building Secure Software, ... View Full Bio

Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
User Rank: Author
7/17/2020 | 6:36:48 PM
Very interesting article
This is a very interesting list of security risks to a machine learning system. I particularly liked Gary McGraw's point around how the problem of data confidentiality takes completely another dimension when some apsects of the data is retained in trained models. I have seen first hand how such risks become an insurmaountable barrier to many AI projects. 
I Smell a RAT! New Cybersecurity Threats for the Crypto Industry
David Trepp, Partner, IT Assurance with accounting and advisory firm BPM LLP,  7/9/2021
Attacks on Kaseya Servers Led to Ransomware in Less Than 2 Hours
Robert Lemos, Contributing Writer,  7/7/2021
It's in the Game (but It Shouldn't Be)
Tal Memran, Cybersecurity Expert, CYE,  7/9/2021
Register for Dark Reading Newsletters
White Papers
Current Issue
Everything You Need to Know About DNS Attacks
It's important to understand DNS, potential attacks against it, and the tools and techniques required to defend DNS infrastructure. This report answers all the questions you were afraid to ask. Domain Name Service (DNS) is a critical part of any organization's digital infrastructure, but it's also one of the least understood. DNS is designed to be invisible to business professionals, IT stakeholders, and many security professionals, but DNS's threat surface is large and widely targeted. Attackers are causing a great deal of damage with an array of attacks such as denial of service, DNS cache poisoning, DNS hijackin, DNS tunneling, and DNS dangling. They are using DNS infrastructure to take control of inbound and outbound communications and preventing users from accessing the applications they are looking for. To stop attacks on DNS, security teams need to shore up the organization's security hygiene around DNS infrastructure, implement controls such as DNSSEC, and monitor DNS traffic
Flash Poll
How Enterprises are Developing Secure Applications
How Enterprises are Developing Secure Applications
Recent breaches of third-party apps are driving many organizations to think harder about the security of their off-the-shelf software as they continue to move left in secure software development practices.
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
PUBLISHED: 2023-05-26
Craft is a CMS for creating custom digital experiences. Cross site scripting (XSS) can be triggered by review volumes. This issue has been fixed in version 4.4.7.
PUBLISHED: 2023-05-26
Django-SES is a drop-in mail backend for Django. The django_ses library implements a mail backend for Django using AWS Simple Email Service. The library exports the `SESEventWebhookView class` intended to receive signed requests from AWS to handle email bounces, subscriptions, etc. These requests ar...
PUBLISHED: 2023-05-26
Highlight is an open source, full-stack monitoring platform. Highlight may record passwords on customer deployments when a password html input is switched to `type="text"` via a javascript "Show Password" button. This differs from the expected behavior which always obfuscates `ty...
PUBLISHED: 2023-05-26
Craft is a CMS for creating custom digital experiences on the web.The platform does not filter input and encode output in Quick Post validation error message, which can deliver an XSS payload. Old CVE fixed the XSS in label HTML but didn’t fix it when clicking save. This issue was...
PUBLISHED: 2023-05-26
GDSDB infinite loop in Wireshark 4.0.0 to 4.0.5 and 3.6.0 to 3.6.13 allows denial of service via packet injection or crafted capture file