Many technology and security teams, particularly in finance, are running data lake projects together to build data analytics capabilities using Hadoop.
The goal for security teams that are doing this is to create a platform that lets them gain meaningful, timely insights from similar data to help solve a wide range of problems. These problems range from continuous monitoring of cyber hygiene factors across the IT environment (e.g., asset, vulnerability, configuration, and access management) to identifying threat actors moving across their networks by correlating logs across large, cumbersome data sets such as those from Web proxies, Active Directory, DNS, and NetFlow.
The reason for this trend is that some members of security teams (typically the chief information security officer and leadership team, control managers, security operations, incident response) have recognized that they're all looking for tailored insights from different analysis of overlapping data sets. For example, the same data that can advance the CISO's executive communications with risk and audit functions can simplify security operations' job in monitoring and detecting for malicious external or internal activity across devices, users, and applications.
By building a platform that can store and enable all the analysis required on this data, security teams are looking to consolidate the output of all their security solutions and, where possible, simplify their tech stacks.
Generally, data lake projects go through four phases: build data lake; ingest data; do analysis; deliver insight. At each of these phases, challenges must be navigated.
The first is to get a data lake that can support ingesting relevant data sets at the right frequency and speed — and enable varied analysis techniques to generate relevant insights. The second is to build and run efficient analysis on these data sets. The third is to do that in a way that delivers the insights that stakeholders need. The fourth is to give stakeholders a way to interact with information that's tailored to their concerns, decisions, and accountability.
Most projects today run into problems during the first two phases. Here's why.
Phase 1 Problems: Issues with the Build
Building a platform for data analysis on Hadoop is hard. Hadoop isn't a single technology; it's a rather complex ecosystem of technology components — HDFS, Spark, Kafka, Yarn, Hive, HBase, Phoenix, and Solr, to name a few. Figuring out what components of this ecosystem play nice with each other is tricky. Some components work for some use cases, but not for others. First you need to know how to connect the parts that work with each other to solve the range of problems you face. Then you need to know when the limits of these joined-up parts will be reached for the analysis you want to conduct. That's before you've considered that there are several different vendors of the Hadoop distribution.
Sometimes you'll find that you need to swap out an underlying part of your data lake to get the performance you need (either for data ingestion or the math your data scientists want to apply to the data). This requires expensive people with experience and the ability to experiment quickly when new components are released that enable greater functionality or performance improvements.
If you build a data lake in-house, work on a minimum viable product model. Develop the capabilities you need for the analysis you need rather than trying to build a platform that will solve every problem you can imagine. Start with a small set of problems and with relatively simple data sets. Speak to your peers about paths they walked down that were nonproductive and budget carefully. If you go down the "buy" route, grill vendors about their promises of a "data lake out of a box." At what point does the data analysis workflow break? What is the extensibility of the data lake as your scale and IT operating model change?
Phase 2 Problems: Ingestion Indigestion
Once you have a data lake, it's time to ingest data. Here are two errors commonly made.
First: "Let's ingest all data we can get our hands on, then figure out what analysis we want to do later." This leads to problems because sooner or later, the CFO will demand to see some value (that is, insight) from the investment. It then either becomes apparent that the data sets aren't ideal to generate meaningful insights, so budgets get cut, or analysis begins but it's not easy to get valuable correlations in the data you have and the insights that can be presented under time pressure are underwhelming.
The second error is to ingest data sets that aren't well curated, cleaned, and understood. The result is that you end up with a data swamp, not a data lake. This often happens when Hadoop is used to replicate ad hoc data analysis methods already in place. It can also happen when the IT operations team doesn't have time to move the data that would be most useful across the network.
There are three principles that can help when ingesting data for analysis:
- Identify data that is readily available and enables fast time to value at low cost.
- Prioritize data that supports solving problems you know you have and that you know you can solve.
- Use the minimum amount of data needed to deliver maximum insights to many stakeholders.
For example, there's a lot you can achieve with just these data sources: vulnerability scan, antivirus, Active Directory, and your configuration management database.
Lastly, if you're going to spend time understanding, cleaning, and ingesting data, it's worth making sure the data sets you choose can solve complementary problems and lay the foundation to solve more complicated problems more easily.