Go to any security conference and you’ll likely hear someone say cybersecurity is a “big data problem” or claim artificial intelligence (AI) and machine learning (ML) are the best hope we have to improve our defenses. Most organizations collect mountains of data and urgently need help sifting through it all to find signs of an attack or breach. First generation tools like SIEM (security information and event management) systems, gave security teams a way to correlate events and triage the data. Then solutions with powerful search and indexing capabilities emerged, enabling security teams to quickly search through massive amounts of indexed data.
These tools have helped enormously, but leave us with two challenges: Vast amounts of data of unknown value that we don’t know when to discard, and nagging worries that the security team might have missed a needle somewhere in the haystack.
Can machine learning help? Can an algorithm reliably find the needles and give us the confidence to discard the haystack of data that represents normal activity? It’s an appealing idea. We’ve all experienced the power of ML systems in Google search, the recommendation engines of Amazon and Netflix, and the powerful spam-filtering capabilities of Web mail providers. Former Symantec CTO Amit Mital once said that ML offers of the “few beacons of hope in this mess.”
But it’s important not to succumb to hubris. Google’s fabled ability to identify flu epidemics turned out to be woefully inaccurate. And the domain of cybersecurity is characterized by weak signals, intelligent actors, a large attack surface, and a huge number of variables. Here, there is no guarantee that using ML/AI will leave you any better off than relying on skilled experts to do the hard work.
Unfortunately, that hasn’t stopped the marketing spin.
What’s Normal, Anyway?
It’s important to remember there is no silver bullet in security, and there’s no evidence at all that these tools help. ML is good at finding similarities between things (such as spam emails), but it’s not so good at finding anomalies. In fact, any discussion of anomalous behavior presumes that it is possible to describe normal behavior. Unfortunately, decades of research confirm that human activity, application behavior, and network traffic are all heavily auto-correlated, making it hard to understand what activity is normal. This gives malicious actors plenty of opportunity to “hide in plain sight” and even an opportunity to train the system that malicious activity is normal.
Trained vs. Untrained Learning
Any ML system must attempt to separate and differentiate activity based either on pre-defined (i.e. trained learning) or self-learned classifications. Training an ML engine using human experts seems like a great idea, but assumes that the attackers won’t subtly vary their behavior over time in response. Self-learned categories are often impossible for humans to understand. Unfortunately, ML systems are not good at describing why a particular activity is anomalous, and how it is related to others. So when the ML system delivers an alert, you still have to do the hard work of understanding whether it is a false positive or not, before trying to understand how the anomaly is related to other activity in the system.
Is It Real?
There is a huge difference between being pleased when Netflix recommends a movie you like, and expecting Netflix to never recommend a movie that you don’t like. So while applying ML to your security feeds might deliver some helpful insights, you cannot rely on such a system to reliably deliver only valid results. In our industry, the difference is cost – time spent understanding why an alert was triggered and whether or not it is a false positive. Ponemon estimates that a typical large enterprise spends up to 395 hours per week processing false alerts - about $1.27 million per year. Unfortunately, you also cannot rely on an ML system to find all anomalies, so you have no way to know if an attacker may still be lurking on your network, and no way to know when to throw away data.
Experts Are Still Better
Cybersecurity is a domain where human expertise will always be needed to pick through the subtle differences between anomalies. Rather than waste money on the unproven promises of ML and AI-based security technologies, I recommend that you invest in your experts, and in tools that enhance their ability to quickly search for and identify components of a new attack. In the context of endpoint security, an emerging category of tools that Gartner calls “Endpoint Detection & Response” plays an important role in equipping the security team with real-time insight into indicators of compromise on the endpoint. Here, both continuous monitoring and real-time search are key.
ML Cannot Protect You
A final caution – obvious as it may be: Post-hoc analysis of monitoring data cannot prevent a vulnerable system from being compromised in the first place. Ultimately, we need to quickly adopt technologies and infrastructure that are more secure by design. By way of example, segmenting the enterprise network and placing all PCs on a separate, routed network segment, forcing users to authenticate to get access to privileged applications, makes it much harder for malware to penetrate and move sideways in the organization. Virtualization and micro-segmentation take this a step further, restricting flows of activity in your networks and making your applications more resilient to attack. Overall, good infrastructure architecture can make the biggest difference in your security posture – reducing the size of the haystack and making the business of defending the enterprise much easier.
Black Hat Europe returns to the beautiful city of Amsterdam, Netherlands November 12 & 13, 2015. Click here for more information and to register.