Know Thyself Through Data-Driven Security Q&A

Two financial services security pros discuss how to correlate and contextualize data an organization already has to create actionable metrics that can bolster risk management practices
It's almost an inevitability at IT security conferences that some speaker will break out the Sun Tzu quote about knowing your enemy and yourself to avoid disaster in battle. But in this day of threat intelligence feeds and cyberawareness, all too often the emphasis is put on intelligence-gathering about the adversary. Meanwhile, the more obvious and often more available data about oneself remains unharvested.

At the recent UNITED Security Summit, two banking executives from a top 25 U.S. financial institution (who shared best practices on the condition of not naming their employer) challenged that lack of self-awareness, advising fellow practitioners to take a deeper dive into readily available data about their systems, users, and patterns in their environments to improve their risk management strategies with meaningful action. That process starts and ends with what Kelly White, vice president and information security manager, called a security Q&A for an organization.

"In order to set yourself up to be able to answer those security questions, the primary steps in doing that is setting it up as a data-centric problem," he said, explaining that the process involves collecting the right data, correlating that data, and then contextualizing the data so it's meaningful to the business.

He explained that as his organization -- which manages 30,000 to 40,000 systems on its network -- embarked on the years-long process of finding the right questions to ask and going through the collection, correlation and contextualization to answer them, it experienced plenty of hiccups early on.

"We weren't asking sophisticated security questions, nor were we doing a good job contextualizing it," he said, explaining that the Q&A was limited to simply asking, for example, what vulnerabilities its scanners were producing, and the answer was limited to the reports those scanners produced.

But questions like that don't offer much value to the risk management process on their own. Instead, said White's colleague, Adam Collins, organizations should be trying to raise the bar of sophistication. For example, network instrumentation data could offer a good view of the organization's network footprint, so a question to ask there would be, "How did the footprint change between yesterday and today?" Or you may have knowledge about how many rogue systems are running on the network, but the real question to ask would be, "How many rogue systems are there, and who has accessed them in the past week?"

"The good news is that the data that you need to answer your security questions, you already have. The data is there in your environment," said Collins, senior information security engineer for the bank. "It may not be simple to get at. It may be spread out over these different data points -- your platform configuration, your NetFlow data, your server and network vulns, your malware events, your network ingress -- but it's there."

Correlate And Contextualize
According to Collins, with the advent of big data stores, "the sky's the limit" on how much data an organization can store and analyze to get value from the information it is collecting.

He and White reported that their organization takes data from about 100 different types of data feeds. These include feeds like SQL server logs, firewall logs, system logs, PCAP files, and Active Directory information through LDAP queries.

"But when you look at scattered sets of data like this, it can seem unapproachable," White said. "The trick is extracting that and putting it into a central location where it can be analyzed."

He said that NoSQL systems have made it much easier for his organization to build out a centralized common system to do correlation and analysis. Collins also advised organizations to remember that 80 percent of the benefit will come from about 20 percent of the data points. "So it's good to start at the high-value data point targets, like user Web activity, Active Directory to bind everything together, your vulns -- whether on workstations, laptops, or servers -- your malware agent logs to know infection rates, your IP ports and addresses, and DNS," he said. "That's a good starting point and less overwhelming than trying to take all the data at once."

Collection is one thing, but it's the correlation that makes it possible to answer sophisticated security questions about oneself, White said. "What happens when you correlate is that it exponentially magnifies the value of that data as opposed to when it stands alone," he said, explaining that when data is tied together, "we can answer some cool, more useful actionable security questions."

One of the most important correlations, White said, is through DNS IP to host name mapping.

"We're just grabbing DNS activity logs for resolution, and, really, 95 percent of the time it boils down to taking an IP address, mapping it to a host name, and then, in some cases, we also map to a MAC address by pulling CAM tables off of switches," he said. "For users we extract from Active Directory, so how do we tie them to the system they've been interacting with? Again, from the domain authentication logs we can get their IP address and, from there, based on DNS, we can get host name. Nine times out of 10 it's that simple."

But correlation isn’t the only important part of the metrics equation. Adding in business context is also critical.

"Our work is only as useful as it is to the business and into the action that it influences," he said, explaining that often the big question is what business unit or process some particular metric relates to. In the banking world, it could be a matter of tying specific metrics to a customer sales system or call center systems or Internet banking system. But with so many back-end systems interrelated, those waters can get muddied very quickly.

"One of the more interesting, more challenging, issues we had is when you've got a large network and a lot of systems on that network, you start to say over time, 'What's the relationship of one system to another system?'"

To answer that contextual question for their organization, White and Collins said their team has had success leveraging NetFlow data coming off of its Cisco network infrastructure.

"That tells you in summary form who is talking to who and over what port," White said. "We know, for example, one system that belongs to our e-commerce site, then based on that NetFlow data we can say, 'OK, well, who does that system talk to? Well, it talks to these two app servers, and these two app servers talk to these systems, and it looks like they're talking this database language.'"

[Are you missing the downsides of big data security analysis? See 3 Inconvenient Truths About Big Data In Security Analysis.]

Putting It Together For Meaningful Answers Through Metrics
So what does all that correlation and contextualization look like in the real world? According to Collins, it can mean the difference between handing a business unit a report that said it has X amount of vulnerabilities on a laundry list of assets and handing them an enterprise threat readiness report.

"Since we've taken in more data, we've asked more complicated security questions, we've correlated that data, and we've added this rich context, we're able say, 'Here's the different vulnerabilities broken down by insider threat, outsider threat, by regulation, by each individual threat and also going across the columns by the business unit,'" he said.

As for security Q&A, the probing questions are based on what the organization needs to know, not on what data is offered ready-made by a security tool.

For example, they said their organization has asked which users have the worst security behavior. And by correlating system configuration information, Web proxy events, and malware events, they learned that 90 percent of the problems come from 1 percent of the users.

"Which really sets us up to do targeted, follow-up security awareness training," White said.

What's more, they took that a step further and asked, "Which users are the riskiest users?" They tied the answers from the previous question to its application risk catalog and user permissions to see how bad behavior looked across populations of users with access to the highest priority applications.

Like building up muscle through regular exercise, regularly asking and answering difficult security questions hones thought processes about data collection and correlation that can yield creative answers to some of the toughest metrics problems. For example, one of the most "intractable" problems faced by White and plenty of others in the industry is understanding where sensitive data resides in unstructured data stores, and who has access to those repositories.

In his organization's case, answering that question took the use of a Google appliance, pointing it at its systems, and configuring it to crawl and index unstructured data so that his team could execute regular expressions against the indexed content.

"You get the uniform resource locator and the filename and type of content found and the number of those records," he said, explaining that combining that with Active Directory information for user permissions to fileshares or SharePoint can pinpoint who has access to the sensitive information.

As other organizations seek to engage in data-driven Q&A like White and Collins' organization did, Collins said a real key to the correlation and contextualization process is ensuring that there's a common language for the data sets. It's also important to understand who the owners are for every asset and every system.

"It's great you collect this stuff," he said, "but if you don't have anyone you can communicate back to and have them act on it, it's not really that valuable."

Have a comment on this story? Please click "Add Your Comment" below. If you'd like to contact Dark Reading's editors directly, send us a message.