The great promise machine learning holds for the security industry is its ability to detect advanced and unknown attacks -- particularly those leading to data breaches. These range from traditional uses -- such as malware detection -- to new areas like attack detection for hackers who have circumvented preventative security.
Unfortunately, machine learning , which is rapidly becoming a popular marketing term, has lost much of its meaning because virtually all vendors define it differently. One way to get beyond the jargon is to look at ML from the perspective of who actually performs it, and where. But first, some basic concepts and definitions.
The strength of any ML algorithm is only as strong as the data modeling behind it; the actual algorithm in use only plays a secondary role. If the selected data parameters do not contain parameters that can predict the result, you can use fancy algorithms, but the accuracy of the results will be very low. They will also generate a lot of noise when used outside of a lab environment.
A basic principle in data science is that simple schemes with the right data modeling work better than complex schemes. So in evaluating options, it’s wise to look for vendors that have real domain expertise rather than a large staff of PhDs. That’s because understanding the parameters and various scenarios is more important than the development of an algorithm for correlating data. Domain expertise directly affects the quality of the data modeling. Consequently, if it’s hard to understand how ML is used, it probably means that it is not relevant to the way the product works.
As for understanding the various flavors of ML, one approach is to divide products into categories based on who (or what) actually performs the machine learning work: the vendor, the product or the end-user.
The vast majority of cases using the term machine learning actually describe one of the tools that the vendor uses to develop their product or generate threat intelligence. In these cases, the vendor is actually performing ML in their lab, rather than the product doing it on premise.
A typical example: AV and URL filtering vendors that perform ML behind the scenes. In order to keep their signatures (or threat intelligence) reasonably current and to process heavy loads of malware and viruses that have been encountered, vendors need to leverage ML in their labs to automate the classification and signature creation process. This use of ML occurs in the vendor’s lab and results in signatures or threat intelligence that the product then uses to detect specific patterns or artifacts.
Typical products: AV, sandboxing, anti-bot, whitelisting and rule-based event correlation.
Advantage: the products are deterministic and will always operate in the same way, regardless of the environment.
Disadvantage: the products are rule-based and can leverage only known artifacts, which leads to low detection accuracy (e.g. AVs inherently don’t detect new malware well). Attackers can circumvent detection and test against the product.
Some products perform ML as an integral part of their function, typically for behavioral detection. In this case the product “learns” the specific environment and uses that information for detection. For example, observing a user or machine starting to access resources it never accessed before and ones that the user’s peer group doesn’t typically access. There is no predetermined rule, signature or pattern that can detect this. You can only achieve an accurate detection by profiling normal behavior in the particular network and applying that knowledge to detect anomalous behavior.
“Behavioral analysis” by itself doesn’t mean machine learning. Many products look at behaviors and apply rules or signatures. For example, sandboxing products typically run a malware in a sandbox environment, examine its behavior and then compare the behavior against a list or rules previously developed by the vendor in their lab (using different methods, including machine learning). In this case the product itself does not perform any ML. A product that performs ML must have a self-training/learning/profiling period. Products that don’t operate this way do not belong in this category, even if they are said to perform “behavioral analysis” or “detection”.
A relatively new security application for machine learning is detection of attacks that have evaded preventative security. While malware detection doesn’t necessarily need ML-capable products, more general behavioral attack detection is usually based around the activities of a human attacker or insider. The system has to essentially customize its logic to the environment in order to accurately detect the activities. This area represents a substantial break from traditional security in that the goal is to identify unknown anomalous behaviors that neither the end user nor the vendor specified in advance, rather than evaluate against known, already-defined technical artifacts.
Typical products: fraud detection, anomaly detection, attack detection, behavioral detection. A product in this category has to have a self-learning/profiling period, so other “behavioral analysis” products are not included here.
Advantage: Leveraging ML, these products can obtain higher detection accuracy and a lower rate of false positives. They automatically optimize their detection to every specific environment and could detect unknown things that the end-user or vendor would not need to specify in advance. Additionally, these can’t be “gamed” by hackers in the way a statically defined technical artifact can be known and thus circumvented by an attacker.
Disadvantage: The detection depends on the profile of the specific environment, making the process less predictable. The products are less optimized for generic queries on the data, but more on automated detection.
This category includes products that are are toolkits used by data scientists to perform ML. For example, business intelligence (BI) tools enable the end user to define datasets, run correlations, regressions and clustering algorithms. In this case the end user is the data scientist who leverages ML, and the product is only a tool at his or her disposal. The end user decides which data to process, what parameters to use and how to interpret the results.
Typical products: Business intelligence products, mathematical/statistical analysis toolkits, SIEM products with analytics toolkits.
Advantage: Lets the user perform custom analytics on custom datasets.
Disadvantage: Can only be leveraged if the security team has data scientists. The responsibility is on the analyst rather than the tool to define the problem, the input data and the conclusions. The analyst would not be able to see patterns that he or she wasn’t looking for. In order to allow custom analytics the collection of data is a heavy task that requires additional products and storage.