Dark Reading is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Threat Intelligence

6/13/2017
06:15 PM
Connect Directly
Twitter
LinkedIn
Google+
RSS
E-Mail
50%
50%

How Bad Data Alters Machine Learning Results

Machine learning models tested on single sources of data can prove inaccurate when presented with new sources of information.

The effectiveness of machine learning models may vary between the test phase and their use "in the wild" on actual consumer data.

Many research papers claim high rates of malware detection and false positives with machine learning, and often deep learning, models. However, nearly all of these rates are within the context of a single source of data, which authors use to train and test their models.

Machine learning has become more advanced but isn't used enough yet in security, says Hillary Sanders, data scientist for Sophos' data science research group. She anticipates usage will increase in coming years to address the rise of different forms of malware.

Historically, Sanders explains, static signatures have been used to detect malware. This method doesn't scale well because software needs to be updated with new signatures as more malware is created. Machine learning and deep learning automatically generate more flexible patterns, which could better detect malicious content compared with stricter static signatures.

"This enables us to move away from signature detection and more toward deep learning detection, which doesn't really require signatures and is going to be better at detecting malware that has never been seen before," she says.

The challenge is in creating a deep learning model to detect forms of malware that don't yet exist. Sanders explains the problem of using current data to test these models, which would ideally be used to detect future malware strains in different clients and environments.

"We can't be sure the data we trained on is going to be super similar to the data in organization deployment," she explains. "If we're training on data that isn't like the data we want to eventually test on, our model might fail catastrophically."

In current machine learning research, accuracy estimates don't consider how systems will process future data. Sanders says modern publications lack time-decay analysis and sensitivity analysis, which could lead to a lack of trust among those who rely on this information.

"If researchers forget to focus on sensitivity testing and time decay, our models are liable to fail catastrophically in the wild," she explains.

Time-decay analysis simulates how the accuracy of data decreases over time, she explains. Consider a dataset with information from January through April. If a machine learning model is trained on data before February 1, it will do well on processing data from January, but accuracy will begin to decay after February.

Sensitivity analysis tweaks inputs for machine learning models to see how output is affected. Sanders will present sensitivity results in her presentation titled "Garbage In Garbage Out: How Purportedly Great Machine Learning Models Can Be Screwed Up By Bad Data" at this year's Black Hat USA conference in Las Vegas.

This analysis will include a deep learning model designed to detect malicious URLs, which was trained and tested using three sources of URL data. As part of her discussion, she'll dive into what caused the results by focusing on how the data sources are different, and higher-level feature activations the neural net identified in some datasets but not in others.

For security teams, the end goal with deep learning is to stop malware. If training and testing data is biased compared with real-world data, models are likely to miss out.

"You ignore the thing you could be optimizing for," says Sanders. "You could miss swaths of malware."

Black Hat USA returns to the fabulous Mandalay Bay in Las Vegas, Nevada, July 22-27, 2017. Click for information on the conference schedule and to register.

 

Related Content:

Kelly Sheridan is the Staff Editor at Dark Reading, where she focuses on cybersecurity news and analysis. She is a business technology journalist who previously reported for InformationWeek, where she covered Microsoft, and Insurance & Technology, where she covered financial ... View Full Bio

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
The Problem with Proprietary Testing: NSS Labs vs. CrowdStrike
Brian Monkman, Executive Director at NetSecOPEN,  7/19/2019
How Attackers Infiltrate the Supply Chain & What to Do About It
Shay Nahari, Head of Red-Team Services at CyberArk,  7/16/2019
Register for Dark Reading Newsletters
White Papers
Video
Cartoon Contest
Current Issue
Building and Managing an IT Security Operations Program
As cyber threats grow, many organizations are building security operations centers (SOCs) to improve their defenses. In this Tech Digest you will learn tips on how to get the most out of a SOC in your organization - and what to do if you can't afford to build one.
Flash Poll
The State of IT Operations and Cybersecurity Operations
The State of IT Operations and Cybersecurity Operations
Your enterprise's cyber risk may depend upon the relationship between the IT team and the security team. Heres some insight on what's working and what isn't in the data center.
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2019-10102
PUBLISHED: 2019-07-22
The Linux Foundation ONOS 1.15.0 and ealier is affected by: Improper Input Validation. The impact is: The attacker can remotely execute any commands by sending malicious http request to the controller. The component is: Method runJavaCompiler in YangLiveCompilerManager.java. The attack vector is: ne...
CVE-2019-10102
PUBLISHED: 2019-07-22
Frog CMS 1.1 is affected by: Cross Site Scripting (XSS). The impact is: Cookie stealing, Alert pop-up on page, Redirecting to another phishing site, Executing browser exploits. The component is: Snippets.
CVE-2019-10102
PUBLISHED: 2019-07-22
Ilias 5.3 before 5.3.12; 5.2 before 5.2.21 is affected by: Cross Site Scripting (XSS) - CWE-79 Type 2: Stored XSS (or Persistent). The impact is: Execute code in the victim's browser. The component is: Assessment / TestQuestionPool. The attack vector is: Cloze Test Text gap (attacker) / Corrections ...
CVE-2019-9959
PUBLISHED: 2019-07-22
The JPXStream::init function in Poppler 0.78.0 and earlier doesn't check for negative values of stream length, leading to an Integer Overflow, thereby making it possible to allocate a large memory chunk on the heap, with a size controlled by an attacker, as demonstrated by pdftocairo.
CVE-2019-4236
PUBLISHED: 2019-07-22
A IBM Spectrum Protect 7.l client backup or archive operation running for an HP-UX VxFS object is silently skipping Access Control List (ACL) entries from backup or archive if there are more than twelve ACL entries associated with the object in total. As a result, it could allow a local attacker to ...