Black Hat 2015 is in the books and once again, another crazy year in Las Vegas with a mountain of stellar speakers and content. There is an incredible amount to learn in our industry and all of us struggle to keep up. So when I was asked by Dark Reading to author a post on an interesting session, I was happy to oblige. In some ways, I have a bit of an edge. As a long-time Black Hat speaker veteran, I have the honor of serving on the speaker review board to help select many of the presentations on the schedule – all of which I’m always excited to see firsthand!
At the same time, choosing which session to attend, because there’s simply not enough time to see them all, is a serious challenge. In the end, I narrowed in on a topic around data science and machine learning presented by to discuss in this blog post.
Josh’s research and presentation dove into the reasons why data science and machine learning apply to malware, and in particular malware detection, threat intelligence, malware analysis, as well as the scalability of malware analytics. For me, this particular technology and application of advanced math is absolutely fascinating. So few of us in the information security industry have a background in this stuff, and it’s time to get familiar. The sooner the better! I’m also incredibly curious about how data science and machine learning may apply to other areas of security, particularly application security – my day job and passion.
In his presentation, Josh correctly illustrates that the battle we face in security is asymmetric: the bad guys workload remains constant, while the good guys work is ever increasing. Bottom line, given enough time, and right now it ain’t a lot, the bad guys win, which is to say, we get hacked. Yeah, we already know that. The net effect of a defender’s day job, ahead of the hack, is being buried in sea of log data, which is not only annoying, but expensive.
Malware identification, analysis, and classification are now more difficult than ever and effectiveness is little better than a coin flip. For corporate defenders, there is simply too much log data to ever sort through manually, no matter how proficient you are at grep. Getting any real actionable value at scale requires machine learning to isolate signals in the noise.
On stage Josh showed a visualization of the fight using malware as an example. SUPER cool stuff! He showed how to take in a bunch of sample applications and algorithmically analyze them using machine learning algorithms to detect malware. These algorithms will cluster any malware strains together, making them easier to identify with the naked eye. This is a process that can happen in seconds or minutes, which is exactly what we need to make security decisions quickly versus manual log analysis which can take months or never happen at all. We’ll lose every battle that way.
Once malware is identified, and the more malware samples sampled, the smarter the machine learning algorithms become and malware families will start to cluster together. The machine gets smarter with each new strain making it harder for malware or viruses to break through and be successful without having to adapt, which increases their costs.
Josh also cautioned that algorithms tend to go stale over time, requiring them to evolve or risk generating an increasing volume of time-wasting false positives. The idea here, and hopefully the eventual outcome, is the workload will remain fixed for the good guys, while the bad guys have to increase theirs. Turning the economic tables!
Data science and machine learning can also be applied to application security in the same way. It’s possible to identify application entry points, vulnerabilities, defects and more in code using machine learning much faster than most current technology, let alone how a human could perform manually. This is a technique that WhiteHat has been using with success for the past decade: login detection, 404 page detection, page-crawling, attack surface detection, and other areas where machine learning is crucial.
To take this same idea into other areas of security will speed up the process in making the Internet a safer and more secure place for everyone. Machine learning is unquestionably a very powerful tool and I believe points a way forward for the industry. I enjoyed Josh’s talk very much and highly recommend it for other practitioners. I for one am going to start honing my machine learning skills. The hardest part: knowing where to begin, but this is as good of a place as any.
[Learn more about the pitfalls and promises of data science directly from Josh Saxe in his Dark Reading Radio interview with Community Editor Marilyn Cohodas.]