Analytics
7/8/2014
12:00 PM
Timber Wolfe
Timber Wolfe
Commentary
Connect Directly
Twitter
LinkedIn
Facebook
RSS
E-Mail
100%
0%

6 Tips for Using Big Data to Hunt Cyberthreats

You need to be smart about harnessing big data to defend against today's security threats, data breaches, and attacks.

Ask 10 different people what big data is, and you may get 10 different answers. For the sake of this article, big data refers to the mining of usable information from the large amounts of data being created around the world every day. While companies look to take advantage of all this data to improve operations, increase sales, and lower costs, many are discovering that it can also be used for security by offering a broader view of risk and vulnerabilities.

Big data offers the ability to analyze massive numbers of potential security events and make connections between them to create a prioritized list of threats. With big data, disparate data can be connected, which allows cyber security professionals to take a proactive approach that prevents attacks.

In today’s complex network environments, Advanced Persistent Threats (APTs) and other cyberthreat eradication may be accomplished by leveraging intelligence from data providers who invest time and materials into searching for, finding, and identifying where threats are coming from and to which IPs and URLs they are talking to. 

To use this approach against cyberthreats, appliances should be monitoring threat feeds from trusted providers for indicators of compromise (IOCs), including big data feeds like domain name systems (DNS) feeds, command and control (C2) feeds, and black/white lists, in order to correlate and hunt threats in a data set. 

Here are six tips for using big data to help wipe out cyberthreats in your organization. You can start with these six data feeds:

DNS feed
DNS data feed can provide lists of newly registered domains, domains commonly used for spamming, and newly created domains. All of these lists can be incorporated into black and white lists, and all of them should be null routed and logged for further analysis.

Using the networks own DNS servers to check outgoing queries would yield domains that are not resolving. The fact that any of this is occurring could mean you have discovered a domain generation algorithm (DGA). All of this information can be readily used in the fight to defend your company’s network. One other gem here is that the incident response team members would have the data they need to track down the suspect machine -- the LAN data. Here’s yet another reason to log all of the LAN traffic possible. 

C2 systems
Incorporating C2 data will provide black lists of IP addresses and domains, and there are plenty of these lists out there. Under no circumstances should network traffic be reaching out to known C2 systems from a corporate network. Should the network have an incident response team with time to investigate cybercrime, this would be a good place to relocate an infection over to one of the network honeypot machines.

Threat intelligence
There are IP address and domain reputation feeds that may be used to help determine if an address is likely safe or not. Some feed providers return a binary answer. For example, a "go" or "no go." Some more modern feed vendors are beginning to reply with threat-level indicators.

For example, an IP address may be evaluated and the answer would contain a "threat rating" instead of a binary response. The individual appliance managers could then decide what level of risk they were willing to accept on their networks based on the threat rating and scale being used.

Network traffic logs
There are plenty of vendors out there that offer appliances that will log all of the networks traffic or even just parts of it. When using big data to hunt threats, it is very easy to get lost in the noise and the day-to-day work cycles. However, to hunt threats effectively and validate the work being done, a log of the network traffic, in some fashion, would be a basic requirement. This log would also aid in analysis of a data breach.

At least one honeypot
Honeypots can be effective in identifying malware targeting a particular network. The implications here are tremendous. Additionally, the payloads dropped could have signatures made for them and then be hunted on other LAN machines, and monitor the LAN for similar network traffic.

This information is priceless in the case of a targeted attack that may not have been detected by any of the antivirus (AV) vendors. This means your AV software and appliances would not yet know of this attack vector. If they are unable to identify it themselves, there is no protection for the LAN.

Data quality matters
Finally, it is important to call attention to the data feed itself. There are lots of vendors selling data. The quality and precision of this data must be of the utmost concern. For this reason, it is critical that organizations develop in-house data evaluation teams that can effectively raise questions from a mile-high perspective about vendor data quality. For example:

  • How recently was the most recent data added?
  • Are significant sample sets available during evaluations?
  • How many new entries are added daily?
  • Is all or part of this data available freely?
  • How long has the vendor been collecting this data?
  • How large is its team?
  • Can we see a sample contract? (The terms should be no more than a year at a time.)

Undoubtedly, security breaches and fraud incidents will continue to make headlines. Even though organizations are taking steps to address APTs and other attacks, the fact remains that traditional security technologies lack the sophisticated capabilities necessary to detect and protect against such attacks. By utilizing big data, organizations can create more robust threat and risk detection programs -- and prevent malicious activity on a wider and deeper scale.

Timber Wolfe is a Principal Security Engineer (BSCE, ECSA, LPT, CEH, CHFI) at TrainACE, a progressive hacking and cyber security training and content organization. TrainACE employs security trainers and researchers to develop bleeding-edge training classes and create helpful ... View Full Bio
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Marilyn Cohodas
50%
50%
Marilyn Cohodas,
User Rank: Strategist
7/16/2014 | 8:24:25 AM
Re: Resolving Data Conflicts
Really appreciate the time and detail you are putting into your answers, timber. Very thoughtful and useful! Thanks. 
twolfe22101
50%
50%
twolfe22101,
User Rank: Author
7/15/2014 | 8:43:27 PM
Re: Resolving Data Conflicts
Hi Chris,

Yes, the resolution here is how you store your data.  The trick is to store the data in a relational database.  The primary key is either a GUID or the IP, Domain or URL could be used.  In either case you would want to make that column unique.  There would be no doubling up of storage with respect to the IP, Domain or URL.  The only thing that would be duplicated is the GUID.


Now that you have the GUID or the unique key, whatever it is, then you have another table.  The other table may have duplicates, which is why you would probably want to use a GUID in this and the primary table.  If a GUID is utilized the storage would grow linearly and predictably.  In either case the 'other table' may have a GUID or unique ID that would relate that row to the other table.  It would contain the dupicate identifier and something you know about the IP, Domain or GUID, i.e. metadata e.g. malware, spam, honeypot hit. 


An example would be, you have an IP Address (x.y.z.a) found in feed ZEBRA.  The IP is then added to the primary table, if it did not exist.  Then the returned GUID or unique identifer (in the event you are using the actual IP, Domain or URL, and that data is added to the related table.  Now there exists one row in the primary table with the IP x.y.z.a.  Now you receive more information from feed ANTELOPE that contains that same IP and has it marked as Spam.  Where the first time you saw it, previously, marked as malware.

Now there are two entries for IP x.y.z.a.  If you have an appliance that needs this feed the smaller table may be utilized. That is the table that contains the GUID and or IP, Domain and URL.  Very fast and efficient to store, read, pass, etc... Now, we have a hit in the appliance for the black list feed containting x.y.z.a.  So now the hit can be null routed, routed to the honeypot or just rejected (in the event it is spam).  Then a row in yet another relational table may be updated with the GUID of the x.y.z.a IP and at some point an analyst may wish to see or analyze the data in the hits table.  The analyst would then pull up the GUID and then the IP and then query the other relational table with all of the meta data.  The analyst would see that the IP has been tagged as being Spam and malware.  If the data is also further tracked (see the paragraph below) the analyst would also see that they saw it 15 times from feed DEER and it was always seen after it was provided in Antelope and others, like twenty four hours later.


This is generally how this is done.  You may not want to discount all of the other feeds just because they have the same IP, Domain or URL.  In the event two feeds provide the exact same data, yet another relational table.  i.e. the table containing all of the meta data about the IPs, Domains and URLs also have a GUID on each row.  The GUID could then be stored in the duplicates, or whatever you would like to call the table, so that the analysts or feed evaluators could view duplicate hits or collisions from competitive feeds.  I would also keep timestamps in each entry as well.  This could inform you as to when the data came from the respective feeds.  This would tip the analyst or feed evaluator off as to which feed vendors were supplying you with duplicate data and could further assist in helping lower the negotiated prices of the feeds or helping to exclude them completely.


I hope this helps.  Let me know if you have any further questions.
Chris Weltzien
50%
50%
Chris Weltzien,
User Rank: Apprentice
7/15/2014 | 12:44:35 PM
Re: Resolving Data Conflicts
It's certainly an opportunity. We've seen a steady stream of companies over the past 6 months  looking to validate conflicting threat intelligence. The key is the ability to run a real-time assessment from multiple IPs while simulating multiple devices.
Marilyn Cohodas
50%
50%
Marilyn Cohodas,
User Rank: Strategist
7/15/2014 | 10:15:25 AM
Re: Resolving Data Conflicts
@ChrisW415 data arbiters? Is that a new market niche? Where do you look for those services...
Chris Weltzien
50%
50%
Chris Weltzien,
User Rank: Apprentice
7/14/2014 | 8:23:12 PM
Resolving Data Conflicts
Good piece, Timber. One thing we're seeing, likely due to the barrage of data providers, is partners/customers that have conflicting data and are looking for "data arbitration." Do you have any thoughts or experience with this?
JasonSachowski
50%
50%
JasonSachowski,
User Rank: Author
7/14/2014 | 7:01:16 PM
Re: 6 Tips for Using Big Data to Hunt Cyberthreats
Agreed @twolfe22101 that technology contributes largely but I still think that cyber intelligence should be cyclic and not one-and-done. The real value of data mining cyber intelligence comes from how the data evolves and becomes more robust as we continue to wash/rinse/repeat.

Also we can't forget that without the human factor to put meaningful context to the data, the collecting of cyber intelligence is meanness. There always has to be security professionals involved in the process to make sense of what is being collected to develop actionable outputs.
twolfe22101
50%
50%
twolfe22101,
User Rank: Author
7/14/2014 | 4:46:09 PM
Re: 6 Tips for Using Big Data to Hunt Cyberthreats
Hi Jason,

Great advice, I would also add one extra step to your wash/rinse/repeat process.  That is plug in the feed to an appliance.  This data should be analyzed and correlated to your data sets then massaged in such a way as to get it plugged into your appliances so they can do the washing or automation for you.  The manual process here would involve analyzing the data and then getting an automation processing for picking it up periodically and getting it into your appliances or making it available to the appliances.  This method would make this a one off manual process per feed you would introduce.  Not so much a rinse/wash/repeat process.

Cheers!
twolfe22101
50%
50%
twolfe22101,
User Rank: Author
7/11/2014 | 11:01:12 PM
Re: Some great actionable advice
Hi,

The 'right data' would depend on the expertise you are trying to enhance.  I say that as this data only helps identify and hone threat indicators from large data sets.  The people I write articles for are careful not to allow us to mention companies by name in the articles.  I guess I can here.

What I mean by 'the expertise you are trying to enable' is that if the security team does not know what to do with DNS data, then it would not matter if I thought DNS data was the single best or not.

The idea is to correlate the input to your systems with the data you are generating and use that correlation to identify bad actors.  While this can be done with many different types of data, depending on the expertise of the security team, the most common and what
I would consider 'the low hanging fruit' is DNS.  Like the article states, the team needs to be able to:
- Identify new domains
- Identify newly transferred domains
- Identify domains that were offline since their creation and are now, all of a sudden, flooded with traffic
- Identify domains that are meaningless, etc ...

and investigate this traffic thoroughly.  

The second most important would be email data feeds.  There are lists of black listed domains and IPs, full URLs to the infections source.  This information can be used to search for IOCs in the large data sets.

There are other data sources, like Dark Net data, that may be plugged directly into the appliances.  If your company does not have expertise then these may be the most important type for you.

Honeypots are also very useful, again they require expertise, to 'catch' the threats targeting your local or regional networks.  You may be surprised to find how many threats you see there that are not present in the data feeds.

I hope this helps!  



JasonSachowski
50%
50%
JasonSachowski,
User Rank: Author
7/11/2014 | 10:22:55 AM
Re: 6 Tips for Using Big Data to Hunt Cyberthreats
The best approach to figuring out what data is important is to define what intelligence the system will provide and it's relevance to business.  The potential data sources to be used for Big Data analytics is comparible to that of a loose thread on a sweater where the more you pull on it the bigger it gets; which goes back to defining the purpose and goals of the system.

The most effective Cyberthreat intelligence comes from collecting that is relevant and strategic to the business while taking into account longer-term trending and analysis.  Cyberthreat analysis should be more of a cycle (wash/rinse/repeat) of collecting, analyzing, and reporting data instead of a one-time effort (start/middle/end).
Marilyn Cohodas
50%
50%
Marilyn Cohodas,
User Rank: Strategist
7/8/2014 | 5:04:46 PM
Some great actionable advice
Thanks for a great blog @trainACE. Curious to know which of the datafeeds you think are most important. Or are they all critical? 
Register for Dark Reading Newsletters
White Papers
Flash Poll
Current Issue
Cartoon
Threat Intel Today
Threat Intel Today
The 397 respondents to our new survey buy into using intel to stay ahead of attackers: 85% say threat intelligence plays some role in their IT security strategies, and many of them subscribe to two or more third-party feeds; 10% leverage five or more.
Video
Slideshows
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2014-3562
Published: 2014-08-21
Red Hat Directory Server 8 and 389 Directory Server, when debugging is enabled, allows remote attackers to obtain sensitive replicated metadata by searching the directory.

CVE-2014-3577
Published: 2014-08-21
org.apache.http.conn.ssl.AbstractVerifier in Apache HttpComponents HttpClient before 4.3.5 and HttpAsyncClient before 4.0.2 does not properly verify that the server hostname matches a domain name in the subject's Common Name (CN) or subjectAltName field of the X.509 certificate, which allows man-in-...

CVE-2014-5158
Published: 2014-08-21
The (1) av-centerd SOAP service and (2) backup command in the ossim-framework service in AlienVault OSSIM before 4.6.0 allows remote attackers to execute arbitrary commands via unspecified vectors.

CVE-2014-5159
Published: 2014-08-21
SQL injection vulnerability in the ossim-framework service in AlienVault OSSIM before 4.6.0 allows remote attackers to execute arbitrary SQL commands via the ws_data parameter.

CVE-2014-5210
Published: 2014-08-21
The av-centerd SOAP service in AlienVault OSSIM before 4.7.0 allows remote attackers to execute arbitrary commands via a crafted (1) remote_task or (2) get_license request, a different vulnerability than CVE-2014-3804 and CVE-2014-3805.

Best of the Web
Dark Reading Radio
Archived Dark Reading Radio
Three interviews on critical embedded systems and security, recorded at Black Hat 2014 in Las Vegas.