Dark Reading is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Comments
6 Tips for Using Big Data to Hunt Cyberthreats
Newest First  |  Oldest First  |  Threaded View
Marilyn Cohodas
Marilyn Cohodas,
User Rank: Strategist
7/16/2014 | 8:24:25 AM
Re: Resolving Data Conflicts
Really appreciate the time and detail you are putting into your answers, timber. Very thoughtful and useful! Thanks. 
twolfe22101
twolfe22101,
User Rank: Author
7/15/2014 | 8:43:27 PM
Re: Resolving Data Conflicts
Hi Chris,

Yes, the resolution here is how you store your data.  The trick is to store the data in a relational database.  The primary key is either a GUID or the IP, Domain or URL could be used.  In either case you would want to make that column unique.  There would be no doubling up of storage with respect to the IP, Domain or URL.  The only thing that would be duplicated is the GUID.


Now that you have the GUID or the unique key, whatever it is, then you have another table.  The other table may have duplicates, which is why you would probably want to use a GUID in this and the primary table.  If a GUID is utilized the storage would grow linearly and predictably.  In either case the 'other table' may have a GUID or unique ID that would relate that row to the other table.  It would contain the dupicate identifier and something you know about the IP, Domain or GUID, i.e. metadata e.g. malware, spam, honeypot hit. 


An example would be, you have an IP Address (x.y.z.a) found in feed ZEBRA.  The IP is then added to the primary table, if it did not exist.  Then the returned GUID or unique identifer (in the event you are using the actual IP, Domain or URL, and that data is added to the related table.  Now there exists one row in the primary table with the IP x.y.z.a.  Now you receive more information from feed ANTELOPE that contains that same IP and has it marked as Spam.  Where the first time you saw it, previously, marked as malware.

Now there are two entries for IP x.y.z.a.  If you have an appliance that needs this feed the smaller table may be utilized. That is the table that contains the GUID and or IP, Domain and URL.  Very fast and efficient to store, read, pass, etc... Now, we have a hit in the appliance for the black list feed containting x.y.z.a.  So now the hit can be null routed, routed to the honeypot or just rejected (in the event it is spam).  Then a row in yet another relational table may be updated with the GUID of the x.y.z.a IP and at some point an analyst may wish to see or analyze the data in the hits table.  The analyst would then pull up the GUID and then the IP and then query the other relational table with all of the meta data.  The analyst would see that the IP has been tagged as being Spam and malware.  If the data is also further tracked (see the paragraph below) the analyst would also see that they saw it 15 times from feed DEER and it was always seen after it was provided in Antelope and others, like twenty four hours later.


This is generally how this is done.  You may not want to discount all of the other feeds just because they have the same IP, Domain or URL.  In the event two feeds provide the exact same data, yet another relational table.  i.e. the table containing all of the meta data about the IPs, Domains and URLs also have a GUID on each row.  The GUID could then be stored in the duplicates, or whatever you would like to call the table, so that the analysts or feed evaluators could view duplicate hits or collisions from competitive feeds.  I would also keep timestamps in each entry as well.  This could inform you as to when the data came from the respective feeds.  This would tip the analyst or feed evaluator off as to which feed vendors were supplying you with duplicate data and could further assist in helping lower the negotiated prices of the feeds or helping to exclude them completely.


I hope this helps.  Let me know if you have any further questions.
Chris Weltzien
Chris Weltzien,
User Rank: Author
7/15/2014 | 12:44:35 PM
Re: Resolving Data Conflicts
It's certainly an opportunity. We've seen a steady stream of companies over the past 6 months  looking to validate conflicting threat intelligence. The key is the ability to run a real-time assessment from multiple IPs while simulating multiple devices.
Marilyn Cohodas
Marilyn Cohodas,
User Rank: Strategist
7/15/2014 | 10:15:25 AM
Re: Resolving Data Conflicts
@ChrisW415 data arbiters? Is that a new market niche? Where do you look for those services...
Chris Weltzien
Chris Weltzien,
User Rank: Author
7/14/2014 | 8:23:12 PM
Resolving Data Conflicts
Good piece, Timber. One thing we're seeing, likely due to the barrage of data providers, is partners/customers that have conflicting data and are looking for "data arbitration." Do you have any thoughts or experience with this?
JasonSachowski
JasonSachowski,
User Rank: Author
7/14/2014 | 7:01:16 PM
Re: 6 Tips for Using Big Data to Hunt Cyberthreats
Agreed @twolfe22101 that technology contributes largely but I still think that cyber intelligence should be cyclic and not one-and-done. The real value of data mining cyber intelligence comes from how the data evolves and becomes more robust as we continue to wash/rinse/repeat.

Also we can't forget that without the human factor to put meaningful context to the data, the collecting of cyber intelligence is meanness. There always has to be security professionals involved in the process to make sense of what is being collected to develop actionable outputs.
twolfe22101
twolfe22101,
User Rank: Author
7/14/2014 | 4:46:09 PM
Re: 6 Tips for Using Big Data to Hunt Cyberthreats
Hi Jason,

Great advice, I would also add one extra step to your wash/rinse/repeat process.  That is plug in the feed to an appliance.  This data should be analyzed and correlated to your data sets then massaged in such a way as to get it plugged into your appliances so they can do the washing or automation for you.  The manual process here would involve analyzing the data and then getting an automation processing for picking it up periodically and getting it into your appliances or making it available to the appliances.  This method would make this a one off manual process per feed you would introduce.  Not so much a rinse/wash/repeat process.

Cheers!
twolfe22101
twolfe22101,
User Rank: Author
7/11/2014 | 11:01:12 PM
Re: Some great actionable advice
Hi,

The 'right data' would depend on the expertise you are trying to enhance.  I say that as this data only helps identify and hone threat indicators from large data sets.  The people I write articles for are careful not to allow us to mention companies by name in the articles.  I guess I can here.

What I mean by 'the expertise you are trying to enable' is that if the security team does not know what to do with DNS data, then it would not matter if I thought DNS data was the single best or not.

The idea is to correlate the input to your systems with the data you are generating and use that correlation to identify bad actors.  While this can be done with many different types of data, depending on the expertise of the security team, the most common and what
I would consider 'the low hanging fruit' is DNS.  Like the article states, the team needs to be able to:
- Identify new domains
- Identify newly transferred domains
- Identify domains that were offline since their creation and are now, all of a sudden, flooded with traffic
- Identify domains that are meaningless, etc ...

and investigate this traffic thoroughly.  

The second most important would be email data feeds.  There are lists of black listed domains and IPs, full URLs to the infections source.  This information can be used to search for IOCs in the large data sets.

There are other data sources, like Dark Net data, that may be plugged directly into the appliances.  If your company does not have expertise then these may be the most important type for you.

Honeypots are also very useful, again they require expertise, to 'catch' the threats targeting your local or regional networks.  You may be surprised to find how many threats you see there that are not present in the data feeds.

I hope this helps!  



JasonSachowski
JasonSachowski,
User Rank: Author
7/11/2014 | 10:22:55 AM
Re: 6 Tips for Using Big Data to Hunt Cyberthreats
The best approach to figuring out what data is important is to define what intelligence the system will provide and it's relevance to business.  The potential data sources to be used for Big Data analytics is comparible to that of a loose thread on a sweater where the more you pull on it the bigger it gets; which goes back to defining the purpose and goals of the system.

The most effective Cyberthreat intelligence comes from collecting that is relevant and strategic to the business while taking into account longer-term trending and analysis.  Cyberthreat analysis should be more of a cycle (wash/rinse/repeat) of collecting, analyzing, and reporting data instead of a one-time effort (start/middle/end).
Marilyn Cohodas
Marilyn Cohodas,
User Rank: Strategist
7/8/2014 | 5:04:46 PM
Some great actionable advice
Thanks for a great blog @trainACE. Curious to know which of the datafeeds you think are most important. Or are they all critical? 


Edge-DRsplash-10-edge-articles
I Smell a RAT! New Cybersecurity Threats for the Crypto Industry
David Trepp, Partner, IT Assurance with accounting and advisory firm BPM LLP,  7/9/2021
News
Attacks on Kaseya Servers Led to Ransomware in Less Than 2 Hours
Robert Lemos, Contributing Writer,  7/7/2021
Commentary
It's in the Game (but It Shouldn't Be)
Tal Memran, Cybersecurity Expert, CYE,  7/9/2021
Register for Dark Reading Newsletters
White Papers
Video
Cartoon
Current Issue
Practical Network Security Approaches for a Multicloud, Hybrid IT World
The report covers areas enterprises should focus on for their multicloud/hybrid cloud security strategy: -increase visibility over the environment -learning cloud-specific skills -relying on established security frameworks -re-architecting the network
Flash Poll
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2022-30333
PUBLISHED: 2022-05-09
RARLAB UnRAR before 6.12 on Linux and UNIX allows directory traversal to write to files during an extract (aka unpack) operation, as demonstrated by creating a ~/.ssh/authorized_keys file. NOTE: WinRAR and Android RAR are unaffected.
CVE-2022-23066
PUBLISHED: 2022-05-09
In Solana rBPF versions 0.2.26 and 0.2.27 are affected by Incorrect Calculation which is caused by improper implementation of sdiv instruction. This can lead to the wrong execution path, resulting in huge loss in specific cases. For example, the result of a sdiv instruction may decide whether to tra...
CVE-2022-28463
PUBLISHED: 2022-05-08
ImageMagick 7.1.0-27 is vulnerable to Buffer Overflow.
CVE-2022-28470
PUBLISHED: 2022-05-08
marcador package in PyPI 0.1 through 0.13 included a code-execution backdoor.
CVE-2022-1620
PUBLISHED: 2022-05-08
NULL Pointer Dereference in function vim_regexec_string at regexp.c:2729 in GitHub repository vim/vim prior to 8.2.4901. NULL Pointer Dereference in function vim_regexec_string at regexp.c:2729 allows attackers to cause a denial of service (application crash) via a crafted input.