Dark Reading is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Application Security

6/19/2020
10:00 AM
Gary McGraw Ph.D.
Gary McGraw Ph.D.
Expert Insights
Connect Directly
Twitter
RSS
E-Mail
50%
50%

How to Secure Machine Learning

Part two of a series on avoiding potential security risks with ML.

When the field of software security was in its infancy 25 years ago, much hullabaloo was made over software vulnerabilities and their associated exploits. Hackers busied themselves exposing and exploiting bugs in everyday systems even as those systems were being rapidly migrated to the Internet. The popular press breathlessly covered each exploit. Nobody really concerned themselves with solving the underlying software engineering and configuration problems since finding and fixing the flood of individual bugs seemed like good progress. This hamster-wheel-like process came to be known as "penetrate and patch."

After several years of public bug whack-a-mole and debates over disclosure, it became clear that bad software was at the heart of computer security and that we would do well to figure out how to build secure software. That was 20 years ago, at the turn of the millennium. These days, software security is an important part of any progressive security program. To be sure, much work remains to be done in software security, but we really do know what that work should be.

Though machine learning (ML) - and artificial intelligence in general - has been around even longer than computer security, until very recently not much attention has been paid to the security of ML systems themselves. Over the last few years, a number of spectacular theoretical attacks on ML systems have led to the same kind of breathless press coverage that we experienced during the early days of computer security. It all seems strikingly familiar. Exploit a bug, hype things up in the media, lather, rinse, repeat.

Machine Learning appears to have made impressive progress on many tasks including image classification, machine translation, autonomous vehicle control, playing complex games including chess, Go, and Atari video games, and more. This has led to hyperbolic popular press coverage of AI, and has elevated deep learning to an almost magical status in the eyes of the public. ML, especially of the deep learning sort, is not magic, however. It is simply sophisticated, associative learning technology based on algorithms developed over the last 30 years. In fact, much of the recent progress in the field can be attributed to faster CPUs and much larger data sets rather than to any particular scientific breakthrough.

ML has become so popular that its application, though often poorly understood and partially motivated by hype, is exploding. In my view, this is not necessarily a good thing. I am concerned with the systematic risk invoked by adopting ML in a haphazard fashion. My current research with the Berryville Institute of Machine Learning (BIML) is focused on understanding and categorizing security engineering risks introduced by ML at the design level.1

We need to do better work to secure our ML systems, moving well beyond attack of the day, and penetrate and patch, towards real security engineering.

Top 5 Machine Learning Security Risks

Building security in for machine learning presents an interesting set of challenges.  Primary among these is the fact that in any machine learning system data plays an outside role in system security. In fact, my view is that the datasets an ML system is trained, tested, and ultimately operated on account for 60% or more of overall security risk, while the learning algorithms and other technical aspects of the system (including source code) account for the rest.

For that reason, in my work with BIML, I have focused my attention on architectural risk analysis sometimes called an ARA, (touchpoint number two for software security), as the most effective approach to get started with. This stands in contrast to starting with touchpoint one (code review), but the reasons why should be mostly obvious.

In a January 2020 report titled, "An Architectural Risk Analysis of Machine Learning Systems: Toward More Secure Machine Learning," BIML published an ARA as an important first step in its mission to help engineers and researchers secure ML systems. In the report, we painstakingly identified 78 risks. Of those 78 risks, I present the top five here.  (For a treatment of the remaining 73 risks and a set of scientific references, please see the report itself).

1. Adversarial Examples

Probably the most commonly discussed attacks against machine learning have come to be known as adversarial examples. The basic idea is to fool a machine learning system by providing malicious input often involving very small perturbations that cause the system to make a false prediction or categorization. Though coverage and resulting attention might be disproportionately large, swamping out other important ML risks, adversarial examples are very much real.

2. Data Poisoning

Data plays an outsized role in the security of an ML system. That's because an ML system learns to do what it does directly from data. If an attacker can intentionally manipulate the data being used by an ML system in a coordinated fashion, the entire system can be compromised. Data-poisoning attacks require special attention. In particular, ML engineers should consider what fraction of the training data an attacker can control and to what extent.

There are several data sources that are subject to poisoning attacks whereby an attacker intentionally manipulates data, including raw data in the world and datasets that are assembled to train, test, and validate an ML system, possibly in a coordinated fashion, to cause ML training to go awry. In some sense, this is a risk related both to data sensitivity and to the fact that the data themselves carry so much of the water in an ML system.

3. Online System Manipulation

An ML system is said to be "online" when it continues to learn during operational use, modifying its behavior over time. In this case a clever attacker can nudge the still-learning system in the wrong direction on purpose through system input and slowly "retrain" the ML system to do the wrong thing. Note that such an attack can be both subtle and reasonably easy to carry out. This risk is complex, demanding that ML engineers consider data provenance, algorithm choice, and system operations in order to properly address it. 

4. Transfer-Learning Attack

Many ML systems are constructed by tuning an already trained base model so that its somewhat generic capabilities are fine-tuned with a round of specialized training. A transfer attack presents an important risk in this situation. In cases where the pretrained model is widely available, an attacker may be able to devise attacks using it that will be robust enough to succeed against your (unavailable to the attacker) tuned task-specific model. You should also consider whether the ML system you are fine-tuning could possibly be a Trojan that includes sneaky ML behavior that is unanticipated.

ML systems are re-used intentionally in transfer situations. The risk of transfer outside of intended use applies. Groups posting models for transfer would do well to precisely describe exactly what their systems do and how they control the risks in this document.

5. Data Confidentiality

Data protection is difficult enough without throwing ML into the mix. One unique challenge in ML is protecting sensitive or confidential data that, through training, are built right into a model. Subtle but effective extraction attacks against an ML system's data are an important category of risk.

Preserving data confidentiality in an ML system is more challenging than in a standard computing situation. That's because an ML system that is trained up on confidential or sensitive data will have some aspects of those data built right into it through training. Attacks to extract sensitive and confidential information from ML systems (indirectly through normal use) are well known. Note that even sub-symbolic "feature" extraction may be useful since that can be used to hone adversarial input attacks.

Securing ML

BIML's basic architectural risk analysis identifies 78 specific risks associated with a generic ML system. The report organizes the risks by common component and also includes some system-wide risks. The BIML risk analysis results are meant to help ML systems engineers in securing their own particular ML systems.

In my view ML systems engineers can devise and field a more secure ML system by carefully considering the BIML risks while designing, implementing, and fielding their own specific ML system. In security, the devil is in the details, and BIML attempts to provide as much detail as possible regarding ML security risks and some basic controls.

 1.  G. McGraw, H. Figueroa, V. Shepardson, and R. Bonett, "An Architectural Risk Analysis of Machine Learning Systems: Toward More Secure Machine Learning." Technical Report from the Berryville Institute of Machine Learning (BIML), https://berryvilleiml.com/results/ara.pdf (accessed 6.3.20).

 

 

Gary McGraw is co-founder of the Berryville Institute of Machine Learning. He is a globally recognized authority on software security and the author of eight best selling books on this topic. His titles include Software Security, Exploiting Software, Building Secure Software, ... View Full Bio
 

Recommended Reading:

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
prafulkrishna
50%
50%
prafulkrishna,
User Rank: Author
7/17/2020 | 6:36:48 PM
Very interesting article
This is a very interesting list of security risks to a machine learning system. I particularly liked Gary McGraw's point around how the problem of data confidentiality takes completely another dimension when some apsects of the data is retained in trained models. I have seen first hand how such risks become an insurmaountable barrier to many AI projects. 
COVID-19: Latest Security News & Commentary
Dark Reading Staff 8/3/2020
'BootHole' Vulnerability Exposes Secure Boot Devices to Attack
Kelly Sheridan, Staff Editor, Dark Reading,  7/29/2020
Register for Dark Reading Newsletters
White Papers
Video
Cartoon Contest
Current Issue
Special Report: Computing's New Normal, a Dark Reading Perspective
This special report examines how IT security organizations have adapted to the "new normal" of computing and what the long-term effects will be. Read it and get a unique set of perspectives on issues ranging from new threats & vulnerabilities as a result of remote working to how enterprise security strategy will be affected long term.
Flash Poll
The Threat from the Internetand What Your Organization Can Do About It
The Threat from the Internetand What Your Organization Can Do About It
This report describes some of the latest attacks and threats emanating from the Internet, as well as advice and tips on how your organization can mitigate those threats before they affect your business. Download it today!
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2020-4396
PUBLISHED: 2020-08-04
IBM Jazz Foundation and IBM Engineering products are vulnerable to cross-site scripting. This vulnerability allows users to embed arbitrary JavaScript code in the Web UI thus altering the intended functionality potentially leading to credentials disclosure within a trusted session. IBM X-Force ID: 1...
CVE-2020-4410
PUBLISHED: 2020-08-04
IBM Jazz Foundation and IBM Engineering products could allow an authenticated user to send a specially crafted HTTP GET request to read attachments on the server that they should not have access to. IBM X-Force ID: 179539.
CVE-2020-4459
PUBLISHED: 2020-08-04
IBM Security Verify Access 10.7 contains hard-coded credentials, such as a password or cryptographic key, which it uses for its own inbound authentication, outbound communication to external components, or encryption of internal data. IBM X-Force ID: 181395.
CVE-2020-4525
PUBLISHED: 2020-08-04
IBM Jazz Foundation and IBM Engineering products are vulnerable to cross-site scripting. This vulnerability allows users to embed arbitrary JavaScript code in the Web UI thus altering the intended functionality potentially leading to credentials disclosure within a trusted session. IBM X-Force ID: 1...
CVE-2020-4542
PUBLISHED: 2020-08-04
IBM Jazz Foundation and IBM Engineering products are vulnerable to cross-site scripting. This vulnerability allows users to embed arbitrary JavaScript code in the Web UI thus altering the intended functionality potentially leading to credentials disclosure within a trusted session. IBM X-force ID: 1...