The New Secret Weapon in Breach Detection: Math and Data Science

It's time for organizations across industries to use math and data science to assess the probabilities of a breach. Here's how.

The days of looking at log files to find security breaches are long gone. Don't get me wrong — log files are still useful. They are vital to confirming a breach and its cause, and necessary for forensics and remediation workflows. But manually sifting through logs to identify trouble is a waste of time in an era during which data grows exponentially, seemingly by the hour. This is further compounded by the complex interconnectedness and opacity of the digital supply chain necessary to deliver modern services.

For many of us, sitting through high school and college math courses, such as calculus, spurred a common question: "When am I ever going to use this in real life?" But for those of us that found our way to the information security world, the answer to that question is "now."

It's time for organizations across industries to take a page from the financial services book and use math and data science to assess the probabilities of a breach. Specifically, security teams can leverage time series data to build mathematical models that describe user behavior, and then look for anomalies and assign a probability that something is wrong.

Here are some of the elements and basic concepts of math and data science that organizations can use to improve their breach detection:

  • Derivatives. The word "derivative" may sound fancy, but it essentially means the rate of change with respect to time. For our purposes, a sudden increase in the number of authentication failures per unit time (per hour, per day, and so on) is a derivative worth watching. For example, if authentication failures jump from five or ten per day to 100 or more, it's a sign that a breach is being attempted (best case) or has already happened (worst case). Here you want to look at the derivative of a function rather than the quantity.
  • Mathematical models. Another concept that's useful in our field is building mathematical models of asset behavior. For example, think of a software-as-a-service product or platform as an asset. How can we provide a baseline norm that can then be used to spot anomalies? You might model GitHub if you use it as a code repository by monitoring metrics over time for some set of critical operations, such as "clone," "merge," "delete," "add user," or "generate access token."
  • Cardinality. These examples can also include the notion of cardinality — the number of elements of a set. This could be logins from known devices, in that we are looking for a change in the quantity of specific critical operations to represent a possible indicator of compromise. But in order to derive this, we first need to "learn" it. As a basic example, say the number of devices used to log in from a CEO is almost always three per day — one on their phone, one on their tablet, and another on their laptop. If that number grows to four or five, it could be that the CEO started working on a new device or two (still worth confirming). But if it jumps significantly, there is a high probability of a breach.

Many organizations and security teams still do breach detection the old-fashioned way, collecting and searching logs for patterns or regular expressions from any and everywhere, but it's clearly not adequate. Again, logs are still useful for forensics. But to limit the window of exposure and improve time-to-detection so that remediation activities can be initiated sooner, combining time series data with math and data science principles is proving to be extremely valuable.

Recommended Reading:
Editors' Choice
Kirsten Powell, Senior Manager for Security & Risk Management at Adobe
Joshua Goldfarb, Director of Product Management at F5