Threat Intelligence

1/11/2018
10:30 AM
Raffael Marty
Raffael Marty
Commentary
Connect Directly
Twitter
RSS
E-Mail vvv
100%
0%

AI in Cybersecurity: Where We Stand & Where We Need to Go

How security practitioners can incorporate expert knowledge into machine learning algorithms that reveal security insights, safeguard data, and keep attackers out.

With the omnipresence of the term artificial intelligence (AI) and the increased popularity of deep learning, a lot of security practitioners are being lured into believing that these approaches are the magic silver bullet we have been waiting for to solve all of our security challenges. But deep learning — or any other machine learning (ML) approach — is just a tool. And it's not a tool we should use on its own. We need to incorporate expert knowledge for the algorithms to reveal actual security insights.

Before continuing this post, I will stop using the term artificial intelligence and revert back to using the term machine learning. We don't have AI or, to be precise, artificial general intelligence (AGI) yet, so let's not distract ourselves with these false concepts.

Where do we stand today with AI — excuse me, machine learning — in cybersecurity? We first need to look at our goals: To make a broad statement, we are trying to use ML to identify malicious behavior or malicious entities; call them hackers, attackers, malware, unwanted behavior, etc. In other words, it comes down to finding anomalies. Beware: to find anomalies, one of the biggest challenges is to define what is "normal." For example, can you define what is normal behavior for your laptop day in, day out? Don't forget to think of that new application you downloaded recently. How do you differentiate that from a download triggered by an attacker? In abstract terms, only a subset of statistical anomalies contains interesting security events.

Applying Machine Learning to Security
Within machine learning, we can look at two categories of approaches: supervised and unsupervised. Supervised ML is great at classifying data — for example, learning whether something is "good" or "bad." To do so, these approaches need large collections of training data to learn what these classes of data look like. Supervised algorithms learn the properties of the training data and are then used to apply the acquired knowledge to classify new, previously unknown data. Unsupervised ML is well suited for making large data sets easier to analyze and understand. Unfortunately, they are not that well suited to find anomalies.

Let's take a more detailed look at the different groups of ML algorithms, starting with the supervised case.

Supervised Machine Learning
Supervised ML is where machine learning has made the biggest impact in cybersecurity. The two poster use cases are malware identification and spam detection. Today's approaches in malware identification have greatly benefited from deep learning, which has helped drop false positive rates to very low numbers while reducing the false negative rates at the same time. The reason malware identification works so well is because of the availability of millions of labeled samples (both malware and benign applications) or training data. These samples allow us to train deep belief networks extremely well. A very similar problem is spam detection in the sense that we have a lot of training data to train our algorithms to differentiate spam from legitimate emails.

Where we don't have great training data is in most other areas — for example, in the realm of detecting attacks from network traffic. We have tried for almost two decades to come up with good training data sets for these problems, but we still do not have a suitable one. Without one, we cannot train our algorithms. In addition, there are other problems such as the inability to deterministically label data, the challenges associated with cleaning data, or understanding the semantics of a data record.

Unsupervised Machine Learning
Unsupervised approaches are great for data exploration. They can be used to reduce the number of dimensions or fields of data to look at (dimensionality reduction) or group records together (clustering and association rules). However, these algorithms are of limited use when it comes to identifying anomalies or attacks. Clustering could be interesting to find anomalies. Maybe we can find ways to cluster "normal" and "abnormal" entities, such as users or devices? Turns out that the fundamental problem that makes this hard to do is that clustering in security suffers from good distance functions and the "explainability" of the clusters. You can find more information about the challenge with distance functions and explainability in this blog post.

Context and Expert Knowledge
In addition to the already mentioned challenges for identifying anomalies with ML, there are significant building blocks we need to add. The first one is context. Context is anything that helps us better understand the types of the entities (devices, applications, and users) present in the data. Context for devices includes things like a device's role, its location, its owner, etc. For example, rather than looking at network traffic logs in isolation, we need to add context to make sense of the data. Knowing which machines represent DNS servers on the network helps understand which of them should be responding to DNS queries. A non-DNS server that is responding to DNS requests could be a sign of an attack.

In addition to context, we need to build systems and algorithms with expert knowledge. This is very different from throwing an algorithm at the wall and seeing what yields anything potentially useful. One of the interesting approaches in the area of knowledge capture that I would love to see get more attention is Bayesian belief networks. Anyone done anything interesting with those (in security)? Please share in the comments below.

Rather than trying to use algorithms to solve really hard problems, we should also consider building systems that help make security analysts more effective. Visualization is a great candidate in this area. Instead of having analysts look at thousands of rows of data, they can look at visual representations of the data that unlocks a deeper understanding in a very short amount of time. In addition, visualization is also a great tool to verify and understand the results of ML algorithms. 

This visualization of 100GB of network traffic over a period of one week is used to determine 'normal' behavior and identify potential  anomalies by experts.
This visualization of 100GB of network traffic over a period of one week is used to determine 'normal' behavior and identify potential anomalies by experts.

In the ancient practice of Zen, koans are a tool or a stepping stone to get to the enlightenment. Just like ML, it's a tool that you have to learn how to apply and use in order to come to new understanding and find attackers in your systems.

Related Content: 

Raffael Marty is vice president of security analytics at Sophos. He is one of the world's most recognized authorities on security data analytics, big data and visualization. Previously, Marty launched pixlcloud, a visual analytics platform, and Loggly, a cloud-based log ... View Full Bio
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
cknisley44
100%
0%
cknisley44,
User Rank: Apprentice
1/12/2018 | 4:37:14 PM
Bayesian Belief Networks for Cybersecurity
Really nice article highlighting the exact things we at Haystax Technology talk about with the use of AI in cybersecurity. It seems every vendor has launched a new version of their product in the past year with the tagline of "New and improved with AI". 

There are great places where ML is perfect for cybersecurity, but only as a part of the solution. The heart of our Constellation Analytics Platform is a Bayesian Inference Network built over years of research to reason like a team of cyber experts across weak inputs from disparate sensor systems. Some of those sensors need ML techniques to find the real signal, some don't - but they all come together in a BayesNet that prioritizes the threat and ultimately the risk. 

We believe this is the only viable approach today given what you highlighted of lack of training data, and even if you had some training data, sensor outputs change frequently as the environment changes. Only with additional context can you uncover those malicious events that represent the highest risks. 
cchio
100%
0%
cchio,
User Rank: Apprentice
1/11/2018 | 11:38:23 PM
Great post!
Enjoyed your post. Agree with many of the points you made. Think you will resonate with a lot we have to say in the book we are publishing next month - https://www.amazon.com/Machine-Learning-Security-Protecting-Algorithms/dp/1491979909 Happy to chat more!
Securing Social Media: National Safety, Privacy Concerns
Kelly Sheridan, Staff Editor, Dark Reading,  4/19/2018
Firms More Likely to Tempt Security Pros With Big Salaries than Invest in Training
Sara Peters, Senior Editor at Dark Reading,  4/19/2018
Register for Dark Reading Newsletters
White Papers
Video
Cartoon
Current Issue
How to Cope with the IT Security Skills Shortage
Most enterprises don't have all the in-house skills they need to meet the rising threat from online attackers. Here are some tips on ways to beat the shortage.
Flash Poll
[Strategic Security Report] Navigating the Threat Intelligence Maze
[Strategic Security Report] Navigating the Threat Intelligence Maze
Most enterprises are using threat intel services, but many are still figuring out how to use the data they're collecting. In this Dark Reading survey we give you a look at what they're doing today - and where they hope to go.
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2017-0290
Published: 2017-05-09
NScript in mpengine in Microsoft Malware Protection Engine with Engine Version before 1.1.13704.0, as used in Windows Defender and other products, allows remote attackers to execute arbitrary code or cause a denial of service (type confusion and application crash) via crafted JavaScript code within ...

CVE-2016-10369
Published: 2017-05-08
unixsocket.c in lxterminal through 0.3.0 insecurely uses /tmp for a socket file, allowing a local user to cause a denial of service (preventing terminal launch), or possibly have other impact (bypassing terminal access control).

CVE-2016-8202
Published: 2017-05-08
A privilege escalation vulnerability in Brocade Fibre Channel SAN products running Brocade Fabric OS (FOS) releases earlier than v7.4.1d and v8.0.1b could allow an authenticated attacker to elevate the privileges of user accounts accessing the system via command line interface. With affected version...

CVE-2016-8209
Published: 2017-05-08
Improper checks for unusual or exceptional conditions in Brocade NetIron 05.8.00 and later releases up to and including 06.1.00, when the Management Module is continuously scanned on port 22, may allow attackers to cause a denial of service (crash and reload) of the management module.

CVE-2017-0890
Published: 2017-05-08
Nextcloud Server before 11.0.3 is vulnerable to an inadequate escaping leading to a XSS vulnerability in the search module. To be exploitable a user has to write or paste malicious content into the search dialogue.