Open Source & Machine Learning: A Dynamic Duo

Vulnerabilities & Threats

If machine learning can be demonstrated to solve particular use cases in an open forum, more analysts will be willing to adopt the technology in their workflows.

Andrew Fast, Chief Data Scientist and Co-Founder, Counterflow AI

January 30, 2019

4 Min Read

As a data scientist, I'm always looking for new patterns and insights that guide action — especially ways to make data science more effective for cybersecurity. One pattern I see consistently throughout the industry is the inability to operationalize machine learning in a modern security operations center. The challenge is that the capabilities behind different machine-learning models are difficult to explain. And if those of us in security can't understand how something works, and how to apply it to what we do, why on earth would we trust it?

Machine learning (ML) can revolutionize the security industry, change the way we identify threats, and mitigate disruption to the business. We've all heard that. Things break down when we start to talk about ML more in practice and less in theory.

Trust is built through education, testing, and experience. Unfortunately, commercial interests have impaired the situation. Far too often, we see commercial offerings rolled out assuring their audiences they can hit the ground running on day one — without explaining how the artificial intelligence (AI) behind it arrives at specific insights. We call this a "black-box approach." But more "explainable AI" approaches are needed. We don't need to be told why to use a hammer. We need to be told how.

Understanding the "how" comes from practice and learning from others. This points to another fundamental requirement: easy access to ML code with which to experiment and share outcomes and experiences with a like-minded community.

That door leads us to the open source community. The typical security analyst is coming to the table with a specific challenge needed to be solved in a network environment, such as defending against sophisticated threat actors. The analyst knows how to write rules to prevent a specific tactic or technique from being used again, but he cannot detect the patterns to proactively hunt threats because he does not have the models to dynamically assess data as it arrives. If machine learning can be demonstrated to solve particular use cases in an open forum, more analysts will be willing to adopt the technology in their workflows.

Sharing code for use and constant alteration by others — and for the good of others and the enterprises they serve — has proved to be a wonderful learning mechanism. Two decades ago, we saw a similar challenge facing security engineers and analysts when few understood how to accurately assess network packets. Then came along Snort, which changed the game. They learned how to assess at their pace, experiment with the codes and models in simple ways, and in time began to trust real-time traffic analysis in the network intrusion detection system. The open signature ecosystem has grown over time into a global effort.

In recent months, ML code has become readily available in the open source community, offering security analysts opportunities to explore, experiment with, and exchange ideas about ML models, putting them on a path toward easier data pattern recognition. As analysts begin their journey testing out ML codes and models for themselves, here are three best practices to keep in mind:

Come prepared with a specific problem: No technology is magic. Machine learning can only solve problems for which it is well-suited. Coming to the table with a defined problem will make it easier to determine whether ML can help and, more importantly, it will help avoid wasting time and spinning wheels that force going back to square one.
Start with the end in mind: Having an idea about how the model could be used in production is a helpful guide during model development. A great model that can't be deployed in production is worthless. Starting with the end guides decisions about algorithm choice, data selection, and which question to address.
Remember that simplicity is the name of the game: Start with simple data counts, look at frequency and standard deviations, and gradually move to statistics and then onto the ML models. Simpler approaches can be deployed more easily. Remember: A model in the lab does not produce value until it is used on live data.

Sharing one's experiences with the experimentation of models is vital to advancing the adoption of machine learning and building trust over time. As more problems are shared, a deeper catalog of use-case recipes can be generated to help analysts optimize their ML models. Analysts helping other analysts — for the good of the community, for the good of enterprises — is common. It is very easy to detect a pattern here. All doors lead to open source.

About the Author(s)

Andrew Fast

Chief Data Scientist and Co-Founder, Counterflow AI

Andrew Fast is the chief data scientist and co-founder of CounterFlow AI. CounterFlow AI builds advanced network traffic analysis solutions for world-class security operation centers (SOC). Previously, Dr. Fast served as the chief scientist at Elder Research, a leading data science consulting firm, where he helped hundreds of companies expand their data science capabilities. He is a frequent author, teacher, and invited speaker on data science topics. In 2012, he co-authored the book "Practical Text Mining, which was published by Elsevier and won the PROSE Award for top book in the field of Computing and Information Sciences for that year. His work on analyzing NFL coaching trees was featured on ESPN.com in 2009. Dr. Fast received PhD and MS degrees in Computer Science from the University of Massachusetts Amherst and a BS in Computer Science from Bethel University.

See more from Andrew Fast

Related Topics

Related Topics

Related Topics

Related Topics

About the Author(s)

Editor's Choice