How security practitioners can incorporate expert knowledge into machine learning algorithms that reveal security insights, safeguard data, and keep attackers out.

Raffael Marty, VP Security Analytics, Sophos

January 11, 2018

5 Min Read
This visualization of 100GB of network traffic over a period of one week is used to determine 'normal' behavior and identify potential anomalies by experts.

With the omnipresence of the term artificial intelligence (AI) and the increased popularity of deep learning, a lot of security practitioners are being lured into believing that these approaches are the magic silver bullet we have been waiting for to solve all of our security challenges. But deep learning — or any other machine learning (ML) approach — is just a tool. And it's not a tool we should use on its own. We need to incorporate expert knowledge for the algorithms to reveal actual security insights.

Before continuing this post, I will stop using the term artificial intelligence and revert back to using the term machine learning. We don't have AI or, to be precise, artificial general intelligence (AGI) yet, so let's not distract ourselves with these false concepts.

Where do we stand today with AI — excuse me, machine learning — in cybersecurity? We first need to look at our goals: To make a broad statement, we are trying to use ML to identify malicious behavior or malicious entities; call them hackers, attackers, malware, unwanted behavior, etc. In other words, it comes down to finding anomalies. Beware: to find anomalies, one of the biggest challenges is to define what is "normal." For example, can you define what is normal behavior for your laptop day in, day out? Don't forget to think of that new application you downloaded recently. How do you differentiate that from a download triggered by an attacker? In abstract terms, only a subset of statistical anomalies contains interesting security events.

Applying Machine Learning to Security
Within machine learning, we can look at two categories of approaches: supervised and unsupervised. Supervised ML is great at classifying data — for example, learning whether something is "good" or "bad." To do so, these approaches need large collections of training data to learn what these classes of data look like. Supervised algorithms learn the properties of the training data and are then used to apply the acquired knowledge to classify new, previously unknown data. Unsupervised ML is well suited for making large data sets easier to analyze and understand. Unfortunately, they are not that well suited to find anomalies.

Let's take a more detailed look at the different groups of ML algorithms, starting with the supervised case.

Supervised Machine Learning
Supervised ML is where machine learning has made the biggest impact in cybersecurity. The two poster use cases are malware identification and spam detection. Today's approaches in malware identification have greatly benefited from deep learning, which has helped drop false positive rates to very low numbers while reducing the false negative rates at the same time. The reason malware identification works so well is because of the availability of millions of labeled samples (both malware and benign applications) or training data. These samples allow us to train deep belief networks extremely well. A very similar problem is spam detection in the sense that we have a lot of training data to train our algorithms to differentiate spam from legitimate emails.

Where we don't have great training data is in most other areas — for example, in the realm of detecting attacks from network traffic. We have tried for almost two decades to come up with good training data sets for these problems, but we still do not have a suitable one. Without one, we cannot train our algorithms. In addition, there are other problems such as the inability to deterministically label data, the challenges associated with cleaning data, or understanding the semantics of a data record.

Unsupervised Machine Learning
Unsupervised approaches are great for data exploration. They can be used to reduce the number of dimensions or fields of data to look at (dimensionality reduction) or group records together (clustering and association rules). However, these algorithms are of limited use when it comes to identifying anomalies or attacks. Clustering could be interesting to find anomalies. Maybe we can find ways to cluster "normal" and "abnormal" entities, such as users or devices? Turns out that the fundamental problem that makes this hard to do is that clustering in security suffers from good distance functions and the "explainability" of the clusters. You can find more information about the challenge with distance functions and explainability in this blog post.

Context and Expert Knowledge
In addition to the already mentioned challenges for identifying anomalies with ML, there are significant building blocks we need to add. The first one is context. Context is anything that helps us better understand the types of the entities (devices, applications, and users) present in the data. Context for devices includes things like a device's role, its location, its owner, etc. For example, rather than looking at network traffic logs in isolation, we need to add context to make sense of the data. Knowing which machines represent DNS servers on the network helps understand which of them should be responding to DNS queries. A non-DNS server that is responding to DNS requests could be a sign of an attack.

In addition to context, we need to build systems and algorithms with expert knowledge. This is very different from throwing an algorithm at the wall and seeing what yields anything potentially useful. One of the interesting approaches in the area of knowledge capture that I would love to see get more attention is Bayesian belief networks. Anyone done anything interesting with those (in security)? Please share in the comments below.

Rather than trying to use algorithms to solve really hard problems, we should also consider building systems that help make security analysts more effective. Visualization is a great candidate in this area. Instead of having analysts look at thousands of rows of data, they can look at visual representations of the data that unlocks a deeper understanding in a very short amount of time. In addition, visualization is also a great tool to verify and understand the results of ML algorithms. 

In the ancient practice of Zen, koans are a tool or a stepping stone to get to the enlightenment. Just like ML, it's a tool that you have to learn how to apply and use in order to come to new understanding and find attackers in your systems.

Related Content: 

About the Author(s)

Raffael Marty

VP Security Analytics, Sophos

Raffael Marty is vice president of security analytics at Sophos. He is one of the world's most recognized authorities on security data analytics, big data and visualization. Previously, Marty launched pixlcloud, a visual analytics platform, and Loggly, a cloud-based log management solution. With a track record at companies including IBM Research, ArcSight, and Splunk, he is thoroughly familiar with established practices and emerging trends in the big data and security analytics space. Marty is the author of Applied Security Visualization and a frequent speaker at academic and industry events. Zen meditation has become an important part of Raffy's life, sometimes leading to insights not in data but in life.

Keep up with the latest cybersecurity threats, newly discovered vulnerabilities, data breach information, and emerging trends. Delivered daily or weekly right to your email inbox.

You May Also Like


More Insights