Dark Reading is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Operations

1/31/2018
10:30 AM
Anup Ghosh
Anup Ghosh
Commentary
Connect Directly
Twitter
RSS
E-Mail vvv
50%
50%

5 Questions to Ask about Machine Learning

Marketing hyperbole often exceeds reality. Here are questions you should ask before buying.

How tired are we of "artificial intelligence" and "machine learning" being sprinkled like pixie dust on every product being hawked by vendors? The challenge for cybersecurity professionals is to see through the fog and figure out what's real and what's just marketing hyperbole.

Often, marketing hyperbole exceeds the reality. Notoriously, Tesla's Autopilot sensors can be fooled in certain edge conditions, iPhone X can be fooled to unlock a phone by a doppelganger, and Apple's Siri isn't very good at taking directions. Even the winning team in the DARPA Cyber Grand Challenge lost spectacularly to actual hackers at the DEFCON conference following its win against other machines at Black Hat.

Machine learning is built on recursive algorithms and mathematics, making the concept itself difficult for many to comprehend. So how can buyers and practitioners decipher what's "real" machine learning technology from marketing spin and, just as importantly, what is effective versus what is not?

The five questions below go to the heart of how well a particular machine learning approach performs in detecting attacks, regardless of which particular algorithm it uses.

1. That detection rate you quote in your marketing materials is impressive, but what's the corresponding false-positive rate?

The false-positive rate is the flip side of detection rates. False positives and true detection rates go hand in hand. In fact, a system can be tuned to optimize false positives or true detections to acceptable levels. The receiver operating characteristic (ROC) is a curve that shows the relation between true detections versus false positives. Pick a false-positive rate on the curve and you'll see the corresponding true detection rate of the algorithm. If a vendor can't or won't show you a ROC curve for its system, you can bet it hasn't done proper machine learning research, or the results are not something it would brag about.

2. How often does your model need updating, and how much does your model's accuracy drop off between updates?

Just as important as detection and false-positive rates is the ability of the model to age well. Machine learning models will age with time as the training data it trained on becomes obsolete. The ability of a machine learning model to generalize from what it has trained on can be measured by its decay rate, the rate at which the model’s performance declines with time as the data it trained on ages. A good machine learning model will age slowly, which in practice means it will not need to be replaced that often. For comparison, traditional signature-based models need updating daily. A good machine learning model only needs to be replaced once every few months rather than every few days. The decay rate is heavily influenced by the training data. A diverse training set leads to a stable model, and a narrow training set ages out very fast.

3. Does your machine learning algorithm make decisions in real time?

Depending on your application, you can use machine learning for retrospective forensic analysis or for inline blocking — that is, blocking attacks as they occur in real time. If used for inline blocking, the approach needs to operate in real time, typically measured in milliseconds. In general, this rules out online lookups because of round-trip times from the cloud. Real-time performance requires a compact model able to run on-premises in the device's memory. Asking the real-time performance of the model is one way of figuring out whether the model is compact enough to block attacks in real time. 

4. What is your training set?

The most overlooked important attribute in machine learning is the training set. The performance of a machine learning algorithm depends on the quality of the training set. Good, curated training sets that are robust to change, reflect real-world conditions, and diverse are hard to acquire, but they are incredibly important for effective performance. If the data on which the model is trained is not representative of the threats you will face, then the performance on your network will suffer regardless of how the model was tested. Models tested on narrow data sets will have misleading performance results.

5. How well does your machine learning system scale?

The good and bad news for machine learning in security is that there is a massive amount of data on which to train. Machine learning algorithms typically require those massive amounts of data to properly learn the phenomena it is trying to detect. That's the good news. The bad news is the models must be able to scale to Internet-sized databases that change continuously. Understanding how much data an algorithm is trained on gives an indication of its scalability. Understanding the footprint of the model gives an indication of its ability to compactly represent and process Internet-scale databases.

As you can see, for a machine learning approach to be successful, it must do the following:

  • Have high detection rates and low false positives on known and unknown attacks, with a published ROC curve.
  • Be trained on a robust training set that is representative of real-world threats.
  • Continue to deliver high performance for months after each update.
  • Provide real-time performance (threat blocking) without consuming large amounts of system resources such as memory and disk.
  • Scale reliably, without using more memory or losing performance, even as the training set increases.

Next time you talk to a company that claims to use machine learning in its products, be sure to get answers to these questions.

Related Content:

Anup Ghosh is Chief Strategist, Next-Gen Endpoint, at Sophos. Ghosh was previously Founder and CEO at Invincea until Invincea was acquired by Sophos in March 2017. Prior to founding Invincea, he was a Program Manager at the Defense Advanced Research Projects Agency (DARPA). ... View Full Bio
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
vietnamvisaservice
50%
50%
vietnamvisaservice,
User Rank: Apprentice
1/31/2018 | 11:44:41 PM
Thank you
That sound good
SOC 2s & Third-Party Assessments: How to Prevent Them from Being Used in a Data Breach Lawsuit
Beth Burgin Waller, Chair, Cybersecurity & Data Privacy Practice , Woods Rogers PLC,  12/5/2019
Navigating Security in the Cloud
Diya Jolly, Chief Product Officer, Okta,  12/4/2019
Register for Dark Reading Newsletters
White Papers
Video
Cartoon Contest
Write a Caption, Win a Starbucks Card! Click Here
Latest Comment: This comment is waiting for review by our moderators.
Current Issue
Navigating the Deluge of Security Data
In this Tech Digest, Dark Reading shares the experiences of some top security practitioners as they navigate volumes of security data. We examine some examples of how enterprises can cull this data to find the clues they need.
Flash Poll
Rethinking Enterprise Data Defense
Rethinking Enterprise Data Defense
Frustrated with recurring intrusions and breaches, cybersecurity professionals are questioning some of the industrys conventional wisdom. Heres a look at what theyre thinking about.
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2019-19619
PUBLISHED: 2019-12-06
domain/section/markdown/markdown.go in Documize before 3.5.1 mishandles untrusted Markdown content. This was addressed by adding the bluemonday HTML sanitizer to defend against XSS.
CVE-2019-19616
PUBLISHED: 2019-12-06
An Insecure Direct Object Reference (IDOR) vulnerability in the Xtivia Web Time and Expense (WebTE) interface used for Microsoft Dynamics NAV before 2017 allows an attacker to download arbitrary files by specifying arbitrary values for the recId and filename parameters of the /Home/GetAttachment fun...
CVE-2019-19617
PUBLISHED: 2019-12-06
phpMyAdmin before 4.9.2 does not escape certain Git information, related to libraries/classes/Display/GitRevision.php and libraries/classes/Footer.php.
CVE-2012-1114
PUBLISHED: 2019-12-05
A Cross-Site Scripting (XSS) vulnerability exists in LDAP Account Manager (LAM) Pro 3.6 in the filter parameter to cmd.php in an export and exporter_id action. and the filteruid parameter to list.php.
CVE-2012-1115
PUBLISHED: 2019-12-05
A Cross-Site Scripting (XSS) vulnerability exists in LDAP Account Manager (LAM) Pro 3.6 in the export, add_value_form, and dn parameters to cmd.php.