Dark Reading is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Comments
6 Tips for Using Big Data to Hunt Cyberthreats
Newest First  |  Oldest First  |  Threaded View
Marilyn Cohodas
50%
50%
Marilyn Cohodas,
User Rank: Strategist
7/16/2014 | 8:24:25 AM
Re: Resolving Data Conflicts
Really appreciate the time and detail you are putting into your answers, timber. Very thoughtful and useful! Thanks. 
twolfe22101
50%
50%
twolfe22101,
User Rank: Author
7/15/2014 | 8:43:27 PM
Re: Resolving Data Conflicts
Hi Chris,

Yes, the resolution here is how you store your data.  The trick is to store the data in a relational database.  The primary key is either a GUID or the IP, Domain or URL could be used.  In either case you would want to make that column unique.  There would be no doubling up of storage with respect to the IP, Domain or URL.  The only thing that would be duplicated is the GUID.


Now that you have the GUID or the unique key, whatever it is, then you have another table.  The other table may have duplicates, which is why you would probably want to use a GUID in this and the primary table.  If a GUID is utilized the storage would grow linearly and predictably.  In either case the 'other table' may have a GUID or unique ID that would relate that row to the other table.  It would contain the dupicate identifier and something you know about the IP, Domain or GUID, i.e. metadata e.g. malware, spam, honeypot hit. 


An example would be, you have an IP Address (x.y.z.a) found in feed ZEBRA.  The IP is then added to the primary table, if it did not exist.  Then the returned GUID or unique identifer (in the event you are using the actual IP, Domain or URL, and that data is added to the related table.  Now there exists one row in the primary table with the IP x.y.z.a.  Now you receive more information from feed ANTELOPE that contains that same IP and has it marked as Spam.  Where the first time you saw it, previously, marked as malware.

Now there are two entries for IP x.y.z.a.  If you have an appliance that needs this feed the smaller table may be utilized. That is the table that contains the GUID and or IP, Domain and URL.  Very fast and efficient to store, read, pass, etc... Now, we have a hit in the appliance for the black list feed containting x.y.z.a.  So now the hit can be null routed, routed to the honeypot or just rejected (in the event it is spam).  Then a row in yet another relational table may be updated with the GUID of the x.y.z.a IP and at some point an analyst may wish to see or analyze the data in the hits table.  The analyst would then pull up the GUID and then the IP and then query the other relational table with all of the meta data.  The analyst would see that the IP has been tagged as being Spam and malware.  If the data is also further tracked (see the paragraph below) the analyst would also see that they saw it 15 times from feed DEER and it was always seen after it was provided in Antelope and others, like twenty four hours later.


This is generally how this is done.  You may not want to discount all of the other feeds just because they have the same IP, Domain or URL.  In the event two feeds provide the exact same data, yet another relational table.  i.e. the table containing all of the meta data about the IPs, Domains and URLs also have a GUID on each row.  The GUID could then be stored in the duplicates, or whatever you would like to call the table, so that the analysts or feed evaluators could view duplicate hits or collisions from competitive feeds.  I would also keep timestamps in each entry as well.  This could inform you as to when the data came from the respective feeds.  This would tip the analyst or feed evaluator off as to which feed vendors were supplying you with duplicate data and could further assist in helping lower the negotiated prices of the feeds or helping to exclude them completely.


I hope this helps.  Let me know if you have any further questions.
Chris Weltzien
50%
50%
Chris Weltzien,
User Rank: Author
7/15/2014 | 12:44:35 PM
Re: Resolving Data Conflicts
It's certainly an opportunity. We've seen a steady stream of companies over the past 6 months  looking to validate conflicting threat intelligence. The key is the ability to run a real-time assessment from multiple IPs while simulating multiple devices.
Marilyn Cohodas
50%
50%
Marilyn Cohodas,
User Rank: Strategist
7/15/2014 | 10:15:25 AM
Re: Resolving Data Conflicts
@ChrisW415 data arbiters? Is that a new market niche? Where do you look for those services...
Chris Weltzien
50%
50%
Chris Weltzien,
User Rank: Author
7/14/2014 | 8:23:12 PM
Resolving Data Conflicts
Good piece, Timber. One thing we're seeing, likely due to the barrage of data providers, is partners/customers that have conflicting data and are looking for "data arbitration." Do you have any thoughts or experience with this?
JasonSachowski
50%
50%
JasonSachowski,
User Rank: Author
7/14/2014 | 7:01:16 PM
Re: 6 Tips for Using Big Data to Hunt Cyberthreats
Agreed @twolfe22101 that technology contributes largely but I still think that cyber intelligence should be cyclic and not one-and-done. The real value of data mining cyber intelligence comes from how the data evolves and becomes more robust as we continue to wash/rinse/repeat.

Also we can't forget that without the human factor to put meaningful context to the data, the collecting of cyber intelligence is meanness. There always has to be security professionals involved in the process to make sense of what is being collected to develop actionable outputs.
twolfe22101
50%
50%
twolfe22101,
User Rank: Author
7/14/2014 | 4:46:09 PM
Re: 6 Tips for Using Big Data to Hunt Cyberthreats
Hi Jason,

Great advice, I would also add one extra step to your wash/rinse/repeat process.  That is plug in the feed to an appliance.  This data should be analyzed and correlated to your data sets then massaged in such a way as to get it plugged into your appliances so they can do the washing or automation for you.  The manual process here would involve analyzing the data and then getting an automation processing for picking it up periodically and getting it into your appliances or making it available to the appliances.  This method would make this a one off manual process per feed you would introduce.  Not so much a rinse/wash/repeat process.

Cheers!
twolfe22101
50%
50%
twolfe22101,
User Rank: Author
7/11/2014 | 11:01:12 PM
Re: Some great actionable advice
Hi,

The 'right data' would depend on the expertise you are trying to enhance.  I say that as this data only helps identify and hone threat indicators from large data sets.  The people I write articles for are careful not to allow us to mention companies by name in the articles.  I guess I can here.

What I mean by 'the expertise you are trying to enable' is that if the security team does not know what to do with DNS data, then it would not matter if I thought DNS data was the single best or not.

The idea is to correlate the input to your systems with the data you are generating and use that correlation to identify bad actors.  While this can be done with many different types of data, depending on the expertise of the security team, the most common and what
I would consider 'the low hanging fruit' is DNS.  Like the article states, the team needs to be able to:
- Identify new domains
- Identify newly transferred domains
- Identify domains that were offline since their creation and are now, all of a sudden, flooded with traffic
- Identify domains that are meaningless, etc ...

and investigate this traffic thoroughly.  

The second most important would be email data feeds.  There are lists of black listed domains and IPs, full URLs to the infections source.  This information can be used to search for IOCs in the large data sets.

There are other data sources, like Dark Net data, that may be plugged directly into the appliances.  If your company does not have expertise then these may be the most important type for you.

Honeypots are also very useful, again they require expertise, to 'catch' the threats targeting your local or regional networks.  You may be surprised to find how many threats you see there that are not present in the data feeds.

I hope this helps!  



JasonSachowski
50%
50%
JasonSachowski,
User Rank: Author
7/11/2014 | 10:22:55 AM
Re: 6 Tips for Using Big Data to Hunt Cyberthreats
The best approach to figuring out what data is important is to define what intelligence the system will provide and it's relevance to business.  The potential data sources to be used for Big Data analytics is comparible to that of a loose thread on a sweater where the more you pull on it the bigger it gets; which goes back to defining the purpose and goals of the system.

The most effective Cyberthreat intelligence comes from collecting that is relevant and strategic to the business while taking into account longer-term trending and analysis.  Cyberthreat analysis should be more of a cycle (wash/rinse/repeat) of collecting, analyzing, and reporting data instead of a one-time effort (start/middle/end).
Marilyn Cohodas
50%
50%
Marilyn Cohodas,
User Rank: Strategist
7/8/2014 | 5:04:46 PM
Some great actionable advice
Thanks for a great blog @trainACE. Curious to know which of the datafeeds you think are most important. Or are they all critical? 


Edge-DRsplash-10-edge-articles
I Smell a RAT! New Cybersecurity Threats for the Crypto Industry
David Trepp, Partner, IT Assurance with accounting and advisory firm BPM LLP,  7/9/2021
News
Attacks on Kaseya Servers Led to Ransomware in Less Than 2 Hours
Robert Lemos, Contributing Writer,  7/7/2021
Commentary
It's in the Game (but It Shouldn't Be)
Tal Memran, Cybersecurity Expert, CYE,  7/9/2021
Register for Dark Reading Newsletters
White Papers
Video
Cartoon
Current Issue
How Data Breaches Affect the Enterprise
Data breaches continue to cause negative outcomes for companies worldwide. However, many organizations report that major impacts have declined significantly compared with a year ago, suggesting that many have gotten better at containing breach fallout. Download Dark Reading's Report "How Data Breaches Affect the Enterprise" to delve more into this timely topic.
Flash Poll
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2021-4020
PUBLISHED: 2021-11-27
janus-gateway is vulnerable to Improper Neutralization of Input During Web Page Generation ('Cross-site Scripting')
CVE-2021-23654
PUBLISHED: 2021-11-26
This affects all versions of package html-to-csv. When there is a formula embedded in a HTML page, it gets accepted without any validation and the same would be pushed while converting it into a CSV file. Through this a malicious actor can embed or generate a malicious link or execute commands via C...
CVE-2021-43785
PUBLISHED: 2021-11-26
@joeattardi/emoji-button is a Vanilla JavaScript emoji picker component. In affected versions there are two vectors for XSS attacks: a URL for a custom emoji, and an i18n string. In both of these cases, a value can be crafted such that it can insert a `script` tag into the page and execute malicious...
CVE-2021-43776
PUBLISHED: 2021-11-26
Backstage is an open platform for building developer portals. In affected versions the auth-backend plugin allows a malicious actor to trick another user into visiting a vulnerable URL that executes an XSS attack. This attack can potentially allow the attacker to exfiltrate access tokens or other se...
CVE-2021-41243
PUBLISHED: 2021-11-26
There is a Potential Zip Slip Vulnerability and OS Command Injection Vulnerability on the management system of baserCMS. Users with permissions to upload files may upload crafted zip files which may execute arbitrary commands on the host operating system. This is a vulnerability that needs to be add...