Operations
12/21/2016
10:30 AM
Nik Whitfield
Nik Whitfield
Commentary
Connect Directly
Twitter
LinkedIn
RSS
E-Mail vvv
50%
50%

Security Analytics: Don't Let Your Data Lake Turn Into A Data Swamp

It's easy to get bogged down when looking for insights from data using Hadoop. But that doesn't have to happen, and these tips can help.

Many technology and security teams, particularly in finance, are running data lake projects together to build data analytics capabilities using Hadoop.

The goal for security teams that are doing this is to create a platform that lets them gain meaningful, timely insights from similar data to help solve a wide range of problems. These problems range from continuous monitoring of cyber hygiene factors across the IT environment (e.g., asset, vulnerability, configuration, and access management) to identifying threat actors moving across their networks by correlating logs across large, cumbersome data sets such as those from Web proxies, Active Directory, DNS, and NetFlow.

The reason for this trend is that some members of security teams (typically the chief information security officer and leadership team, control managers, security operations, incident response) have recognized that they're all looking for tailored insights from different analysis of overlapping data sets. For example, the same data that can advance the CISO's executive communications with risk and audit functions can simplify security operations' job in monitoring and detecting for malicious external or internal activity across devices, users, and applications.

By building a platform that can store and enable all the analysis required on this data, security teams are looking to consolidate the output of all their security solutions and, where possible, simplify their tech stacks.

Generally, data lake projects go through four phases: build data lake; ingest data; do analysis; deliver insight. At each of these phases, challenges must be navigated. 

The first is to get a data lake that can support ingesting relevant data sets at the right frequency and speed — and enable varied analysis techniques to generate relevant insights. The second is to build and run efficient analysis on these data sets. The third is to do that in a way that delivers the insights that stakeholders need. The fourth is to give stakeholders a way to interact with information that's tailored to their concerns, decisions, and accountability.

Most projects today run into problems during the first two phases. Here's why.

Phase 1 Problems: Issues with the Build
Building a platform for data analysis on Hadoop is hard. Hadoop isn't a single technology; it's a rather complex ecosystem of technology components — HDFS, Spark, Kafka, Yarn, Hive, HBase, Phoenix, and Solr, to name a few. Figuring out what components of this ecosystem play nice with each other is tricky. Some components work for some use cases, but not for others. First you need to know how to connect the parts that work with each other to solve the range of problems you face. Then you need to know when the limits of these joined-up parts will be reached for the analysis you want to conduct. That's before you've considered that there are several different vendors of the Hadoop distribution.

Sometimes you'll find that you need to swap out an underlying part of your data lake to get the performance you need (either for data ingestion or the math your data scientists want to apply to the data). This requires expensive people with experience and the ability to experiment quickly when new components are released that enable greater functionality or performance improvements. 

If you build a data lake in-house, work on a minimum viable product model. Develop the capabilities you need for the analysis you need rather than trying to build a platform that will solve every problem you can imagine. Start with a small set of problems and with relatively simple data sets. Speak to your peers about paths they walked down that were nonproductive and budget carefully. If you go down the "buy" route, grill vendors about their promises of a "data lake out of a box." At what point does the data analysis workflow break? What is the extensibility of the data lake as your scale and IT operating model change?

Phase 2 Problems: Ingestion Indigestion
Once you have a data lake, it's time to ingest data. Here are two errors commonly made.

First: "Let's ingest all data we can get our hands on, then figure out what analysis we want to do later." This leads to problems because sooner or later, the CFO will demand to see some value (that is, insight) from the investment. It then either becomes apparent that the data sets aren't ideal to generate meaningful insights, so budgets get cut, or analysis begins but it's not easy to get valuable correlations in the data you have and the insights that can be presented under time pressure are underwhelming.

The second error is to ingest data sets that aren't well curated, cleaned, and understood. The result is that you end up with a data swamp, not a data lake. This often happens when Hadoop is used to replicate ad hoc data analysis methods already in place. It can also happen when the IT operations team doesn't have time to move the data that would be most useful across the network. 

There are three principles that can help when ingesting data for analysis:

  1. Identify data that is readily available and enables fast time to value at low cost.
  2. Prioritize data that supports solving problems you know you have and that you know you can solve.
  3. Use the minimum amount of data needed to deliver maximum insights to many stakeholders.

For example, there's a lot you can achieve with just these data sources: vulnerability scan, antivirus, Active Directory, and your configuration management database.

Lastly, if you're going to spend time understanding, cleaning, and ingesting data, it's worth making sure the data sets you choose can solve complementary problems and lay the foundation to solve more complicated problems more easily. 

Related Content:

Nik Whitfield is a noted computer scientist and cyber security technology entrepreneur. He founded Panaseer in 2014, a cybersecurity software company that gives businesses unparalleled visibility and insight into their cybersecurity weaknesses. Panaseer announced in November ... View Full Bio
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Register for Dark Reading Newsletters
White Papers
Video
Cartoon Contest
Write a Caption, Win a Starbucks Card! Click Here
Latest Comment: This is a secure windows pc.
Current Issue
Security Operations and IT Operations: Finding the Path to Collaboration
A wide gulf has emerged between SOC and NOC teams that's keeping both of them from assuring the confidentiality, integrity, and availability of IT systems. Here's how experts think it should be bridged.
Flash Poll
New Best Practices for Secure App Development
New Best Practices for Secure App Development
The transition from DevOps to SecDevOps is combining with the move toward cloud computing to create new challenges - and new opportunities - for the information security team. Download this report, to learn about the new best practices for secure application development.
Slideshows
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2017-0290
Published: 2017-05-09
NScript in mpengine in Microsoft Malware Protection Engine with Engine Version before 1.1.13704.0, as used in Windows Defender and other products, allows remote attackers to execute arbitrary code or cause a denial of service (type confusion and application crash) via crafted JavaScript code within ...

CVE-2016-10369
Published: 2017-05-08
unixsocket.c in lxterminal through 0.3.0 insecurely uses /tmp for a socket file, allowing a local user to cause a denial of service (preventing terminal launch), or possibly have other impact (bypassing terminal access control).

CVE-2016-8202
Published: 2017-05-08
A privilege escalation vulnerability in Brocade Fibre Channel SAN products running Brocade Fabric OS (FOS) releases earlier than v7.4.1d and v8.0.1b could allow an authenticated attacker to elevate the privileges of user accounts accessing the system via command line interface. With affected version...

CVE-2016-8209
Published: 2017-05-08
Improper checks for unusual or exceptional conditions in Brocade NetIron 05.8.00 and later releases up to and including 06.1.00, when the Management Module is continuously scanned on port 22, may allow attackers to cause a denial of service (crash and reload) of the management module.

CVE-2017-0890
Published: 2017-05-08
Nextcloud Server before 11.0.3 is vulnerable to an inadequate escaping leading to a XSS vulnerability in the search module. To be exploitable a user has to write or paste malicious content into the search dialogue.

Dark Reading Radio
Archived Dark Reading Radio
In past years, security researchers have discovered ways to hack cars, medical devices, automated teller machines, and many other targets. Dark Reading Executive Editor Kelly Jackson Higgins hosts researcher Samy Kamkar and Levi Gundert, vice president of threat intelligence at Recorded Future, to discuss some of 2016's most unusual and creative hacks by white hats, and what these new vulnerabilities might mean for the coming year.