Threat Intelligence

2/11/2016
12:30 PM
Giora Engel
Giora Engel
Commentary
Connect Directly
Facebook
Twitter
LinkedIn
Google+
RSS
E-Mail vvv
100%
0%

3 Flavors of Machine Learning: Who, What & Where

To get beyond the jargon of ML, you have to consider who (or what) performs the actual work of detecting advanced attacks: vendor, product or end-user.

The great promise machine learning holds for the security industry is its ability to detect advanced and unknown attacks -- particularly those leading to data breaches. These range from traditional uses -- such as malware detection -- to new areas like attack detection for hackers who have circumvented preventative security.

Unfortunately, machine learning , which is rapidly becoming a popular marketing term, has lost much of its meaning because virtually all vendors define it differently. One way to get beyond the jargon is to look at ML from the perspective of who actually performs it, and where. But first, some basic concepts and definitions.

The strength of any ML algorithm is only as strong as the data modeling behind it; the actual algorithm in use only plays a secondary role. If the selected data parameters do not contain parameters that can predict the result, you can use fancy algorithms, but the accuracy of the results will be very low. They will also generate a lot of noise when used outside of a lab environment.

A basic principle in data science is that simple schemes with the right data modeling work better than complex schemes. So in evaluating options, it’s wise to look for vendors that have real domain expertise rather than a large staff of PhDs. That’s because understanding the parameters and various scenarios is more important than the development of an algorithm for correlating data. Domain expertise directly affects the quality of the data modeling. Consequently, if it’s hard to understand how ML is used, it probably means that it is not relevant to the way the product works.

As for understanding the various flavors of ML, one approach is to divide products into categories based on who (or what) actually performs the machine learning work: the vendor, the product or the end-user.

The Vendor
The vast majority of cases using the term machine learning actually describe one of the tools that the vendor uses to develop their product or generate threat intelligence. In these cases, the vendor is actually performing ML in their lab, rather than the product doing it on premise.

A typical example: AV and URL filtering vendors that perform ML behind the scenes. In order to keep their signatures (or threat intelligence) reasonably current and to process heavy loads of malware and viruses that have been encountered, vendors need to leverage ML in their labs to automate the classification and signature creation process. This use of ML occurs in the vendor’s lab and results in signatures or threat intelligence that the product then uses to detect specific patterns or artifacts.

Typical products: AV, sandboxing, anti-bot, whitelisting and rule-based event correlation.

Advantage: the products are deterministic and will always operate in the same way, regardless of the environment.

Disadvantage: the products are rule-based and can leverage only known artifacts, which leads to low detection accuracy (e.g. AVs inherently don’t detect new malware well). Attackers can circumvent detection and test against the product.

The Product
Some products perform ML as an integral part of their function, typically for behavioral detection. In this case the product “learns” the specific environment and uses that information for detection. For example, observing a user or machine starting to access resources it never accessed before and ones that the user’s peer group doesn’t typically access. There is no predetermined rule, signature or pattern that can detect this. You can only achieve an accurate detection by profiling normal behavior in the particular network and applying that knowledge to detect anomalous behavior.

“Behavioral analysis” by itself doesn’t mean machine learning. Many products look at behaviors and apply rules or signatures. For example, sandboxing products typically run a malware in a sandbox environment, examine its behavior and then compare the behavior against a list or rules previously developed by the vendor in their lab (using different methods, including machine learning). In this case the product itself does not perform any ML. A product that performs ML must have a self-training/learning/profiling period. Products that don’t operate this way do not belong in this category, even if they are said to perform “behavioral analysis” or “detection”.

A relatively new security application for machine learning is detection of attacks that have evaded preventative security. While malware detection doesn’t necessarily need ML-capable products, more general behavioral attack detection is usually based around the activities of a human attacker or insider. The system has to essentially customize its logic to the environment in order to accurately detect the activities. This area represents a substantial break from traditional security in that the goal is to identify unknown anomalous behaviors that neither the end user nor the vendor specified in advance, rather than evaluate against known, already-defined technical artifacts.

Typical products: fraud detection, anomaly detection, attack detection, behavioral detection. A product in this category has to have a self-learning/profiling period, so other “behavioral analysis” products are not included here.

Advantage: Leveraging ML, these products can obtain higher detection accuracy and a lower rate of false positives. They automatically optimize their detection to every specific environment and could detect unknown things that the end-user or vendor would not need to specify in advance. Additionally, these can’t be “gamed” by hackers in the way a statically defined technical artifact can be known and thus circumvented by an attacker.

Disadvantage: The detection depends on the profile of the specific environment, making the process less predictable. The products are less optimized for generic queries on the data, but more on automated detection.

The End-user
This category includes products that are are toolkits used by data scientists to perform ML. For example, business intelligence (BI) tools enable the end user to define datasets, run correlations, regressions and clustering algorithms. In this case the end user is the data scientist who leverages ML, and the product is only a tool at his or her disposal. The end user decides which data to process, what parameters to use and how to interpret the results.

Typical products: Business intelligence products, mathematical/statistical analysis toolkits, SIEM products with analytics toolkits.

Advantage: Lets the user perform custom analytics on custom datasets.

Disadvantage: Can only be leveraged if the security team has data scientists. The responsibility is on the analyst rather than the tool to define the problem, the input data and the conclusions. The analyst would not be able to see patterns that he or she wasn’t looking for. In order to allow custom analytics the collection of data is a heavy task that requires additional products and storage.

 More on this topic:

Interop 2016 Las Vegas

Find out more about security trends and technologies at Interop 2016, May 2-6, at the Mandalay Bay Convention Center, Las Vegas. Register today and receive an early bird discount of $200.

Giora Engel, vice president, product & strategy at LightCyber is a serial entrepreneur with many years of technological and managerial experience. For nearly a decade, he served as an officer in an elite technological unit in the Israel Defense Forces, where he initiated and ... View Full Bio
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
JouCTO
50%
50%
JouCTO,
User Rank: Apprentice
2/14/2016 | 9:38:22 AM
Outstanding
A refreshingly accurate and honest review of machine learning. Thank you, Giora!

 
How the US Chooses Which Zero-Day Vulnerabilities to Stockpile
Ricardo Arroyo, Senior Technical Product Manager, Watchguard Technologies,  1/16/2019
Register for Dark Reading Newsletters
White Papers
Video
Cartoon Contest
Write a Caption, Win a Starbucks Card! Click Here
Latest Comment: "He just showed up at my doorstep one day without a geotag."
Current Issue
The Year in Security 2018
This Dark Reading Tech Digest explores the biggest news stories of 2018 that shaped the cybersecurity landscape.
Flash Poll
How Enterprises Are Attacking the Cybersecurity Problem
How Enterprises Are Attacking the Cybersecurity Problem
Data breach fears and the need to comply with regulations such as GDPR are two major drivers increased spending on security products and technologies. But other factors are contributing to the trend as well. Find out more about how enterprises are attacking the cybersecurity problem by reading our report today.
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2019-3906
PUBLISHED: 2019-01-18
Premisys Identicard version 3.1.190 contains hardcoded credentials in the WCF service on port 9003. An authenticated remote attacker can use these credentials to access the badge system database and modify its contents.
CVE-2019-3907
PUBLISHED: 2019-01-18
Premisys Identicard version 3.1.190 stores user credentials and other sensitive information with a known weak encryption method (MD5 hash of a salt and password).
CVE-2019-3908
PUBLISHED: 2019-01-18
Premisys Identicard version 3.1.190 stores backup files as encrypted zip files. The password to the zip is hard-coded and unchangeable. An attacker with access to these backups can decrypt them and obtain sensitive data.
CVE-2019-3909
PUBLISHED: 2019-01-18
Premisys Identicard version 3.1.190 database uses default credentials. Users are unable to change the credentials without vendor intervention.
CVE-2019-3910
PUBLISHED: 2019-01-18
Crestron AM-100 before firmware version 1.6.0.2 contains an authentication bypass in the web interface's return.cgi script. Unauthenticated remote users can use the bypass to access some administrator functionality such as configuring update sources and rebooting the device.