Dark Reading is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Operations

12/21/2016
10:30 AM
Nik Whitfield
Nik Whitfield
Commentary
Connect Directly
Twitter
LinkedIn
RSS
E-Mail vvv
50%
50%

Security Analytics: Don't Let Your Data Lake Turn Into A Data Swamp

It's easy to get bogged down when looking for insights from data using Hadoop. But that doesn't have to happen, and these tips can help.

Many technology and security teams, particularly in finance, are running data lake projects together to build data analytics capabilities using Hadoop.

The goal for security teams that are doing this is to create a platform that lets them gain meaningful, timely insights from similar data to help solve a wide range of problems. These problems range from continuous monitoring of cyber hygiene factors across the IT environment (e.g., asset, vulnerability, configuration, and access management) to identifying threat actors moving across their networks by correlating logs across large, cumbersome data sets such as those from Web proxies, Active Directory, DNS, and NetFlow.

The reason for this trend is that some members of security teams (typically the chief information security officer and leadership team, control managers, security operations, incident response) have recognized that they're all looking for tailored insights from different analysis of overlapping data sets. For example, the same data that can advance the CISO's executive communications with risk and audit functions can simplify security operations' job in monitoring and detecting for malicious external or internal activity across devices, users, and applications.

By building a platform that can store and enable all the analysis required on this data, security teams are looking to consolidate the output of all their security solutions and, where possible, simplify their tech stacks.

Generally, data lake projects go through four phases: build data lake; ingest data; do analysis; deliver insight. At each of these phases, challenges must be navigated. 

The first is to get a data lake that can support ingesting relevant data sets at the right frequency and speed — and enable varied analysis techniques to generate relevant insights. The second is to build and run efficient analysis on these data sets. The third is to do that in a way that delivers the insights that stakeholders need. The fourth is to give stakeholders a way to interact with information that's tailored to their concerns, decisions, and accountability.

Most projects today run into problems during the first two phases. Here's why.

Phase 1 Problems: Issues with the Build
Building a platform for data analysis on Hadoop is hard. Hadoop isn't a single technology; it's a rather complex ecosystem of technology components — HDFS, Spark, Kafka, Yarn, Hive, HBase, Phoenix, and Solr, to name a few. Figuring out what components of this ecosystem play nice with each other is tricky. Some components work for some use cases, but not for others. First you need to know how to connect the parts that work with each other to solve the range of problems you face. Then you need to know when the limits of these joined-up parts will be reached for the analysis you want to conduct. That's before you've considered that there are several different vendors of the Hadoop distribution.

Sometimes you'll find that you need to swap out an underlying part of your data lake to get the performance you need (either for data ingestion or the math your data scientists want to apply to the data). This requires expensive people with experience and the ability to experiment quickly when new components are released that enable greater functionality or performance improvements. 

If you build a data lake in-house, work on a minimum viable product model. Develop the capabilities you need for the analysis you need rather than trying to build a platform that will solve every problem you can imagine. Start with a small set of problems and with relatively simple data sets. Speak to your peers about paths they walked down that were nonproductive and budget carefully. If you go down the "buy" route, grill vendors about their promises of a "data lake out of a box." At what point does the data analysis workflow break? What is the extensibility of the data lake as your scale and IT operating model change?

Phase 2 Problems: Ingestion Indigestion
Once you have a data lake, it's time to ingest data. Here are two errors commonly made.

First: "Let's ingest all data we can get our hands on, then figure out what analysis we want to do later." This leads to problems because sooner or later, the CFO will demand to see some value (that is, insight) from the investment. It then either becomes apparent that the data sets aren't ideal to generate meaningful insights, so budgets get cut, or analysis begins but it's not easy to get valuable correlations in the data you have and the insights that can be presented under time pressure are underwhelming.

The second error is to ingest data sets that aren't well curated, cleaned, and understood. The result is that you end up with a data swamp, not a data lake. This often happens when Hadoop is used to replicate ad hoc data analysis methods already in place. It can also happen when the IT operations team doesn't have time to move the data that would be most useful across the network. 

There are three principles that can help when ingesting data for analysis:

  1. Identify data that is readily available and enables fast time to value at low cost.
  2. Prioritize data that supports solving problems you know you have and that you know you can solve.
  3. Use the minimum amount of data needed to deliver maximum insights to many stakeholders.

For example, there's a lot you can achieve with just these data sources: vulnerability scan, antivirus, Active Directory, and your configuration management database.

Lastly, if you're going to spend time understanding, cleaning, and ingesting data, it's worth making sure the data sets you choose can solve complementary problems and lay the foundation to solve more complicated problems more easily. 

Related Content:

Nik Whitfield is the founder and CEO at Panaseer. He founded the company with the mission to make organizations cybersecurity risk-intelligent. His  team created the Panaseer Platform to automate the breadth and depth of visibility required to take control of ... View Full Bio
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
The Problem with Proprietary Testing: NSS Labs vs. CrowdStrike
Brian Monkman, Executive Director at NetSecOPEN,  7/19/2019
RDP Bug Takes New Approach to Host Compromise
Kelly Sheridan, Staff Editor, Dark Reading,  7/18/2019
Register for Dark Reading Newsletters
White Papers
Video
Cartoon Contest
Current Issue
Building and Managing an IT Security Operations Program
As cyber threats grow, many organizations are building security operations centers (SOCs) to improve their defenses. In this Tech Digest you will learn tips on how to get the most out of a SOC in your organization - and what to do if you can't afford to build one.
Flash Poll
The State of IT Operations and Cybersecurity Operations
The State of IT Operations and Cybersecurity Operations
Your enterprise's cyber risk may depend upon the relationship between the IT team and the security team. Heres some insight on what's working and what isn't in the data center.
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2019-14248
PUBLISHED: 2019-07-24
In libnasm.a in Netwide Assembler (NASM) 2.14.xx, asm/pragma.c allows a NULL pointer dereference in process_pragma, search_pragma_list, and nasm_set_limit when "%pragma limit" is mishandled.
CVE-2019-14249
PUBLISHED: 2019-07-24
dwarf_elf_load_headers.c in libdwarf before 2019-07-05 allows attackers to cause a denial of service (division by zero) via an ELF file with a zero-size section group (SHT_GROUP), as demonstrated by dwarfdump.
CVE-2019-14250
PUBLISHED: 2019-07-24
An issue was discovered in GNU libiberty, as distributed in GNU Binutils 2.32. simple_object_elf_match in simple-object-elf.c does not check for a zero shstrndx value, leading to an integer overflow and resultant heap-based buffer overflow.
CVE-2019-14247
PUBLISHED: 2019-07-24
The scan() function in mad.c in mpg321 0.3.2 allows remote attackers to trigger an out-of-bounds write via a zero bitrate in an MP3 file.
CVE-2019-2873
PUBLISHED: 2019-07-23
Vulnerability in the Oracle VM VirtualBox component of Oracle Virtualization (subcomponent: Core). Supported versions that are affected are Prior to 5.2.32 and prior to 6.0.10. Easily exploitable vulnerability allows low privileged attacker with logon to the infrastructure where Oracle VM VirtualBox...