Re: Resolving Data Conflicts
Hi Chris,
Yes, the resolution here is how you store your data. The trick is to store the data in a relational database. The primary key is either a GUID or the IP, Domain or URL could be used. In either case you would want to make that column unique. There would be no doubling up of storage with respect to the IP, Domain or URL. The only thing that would be duplicated is the GUID.
Now that you have the GUID or the unique key, whatever it is, then you have another table. The other table may have duplicates, which is why you would probably want to use a GUID in this and the primary table. If a GUID is utilized the storage would grow linearly and predictably. In either case the 'other table' may have a GUID or unique ID that would relate that row to the other table. It would contain the dupicate identifier and something you know about the IP, Domain or GUID, i.e. metadata e.g. malware, spam, honeypot hit.
An example would be, you have an IP Address (x.y.z.a) found in feed ZEBRA. The IP is then added to the primary table, if it did not exist. Then the returned GUID or unique identifer (in the event you are using the actual IP, Domain or URL, and that data is added to the related table. Now there exists one row in the primary table with the IP x.y.z.a. Now you receive more information from feed ANTELOPE that contains that same IP and has it marked as Spam. Where the first time you saw it, previously, marked as malware.
Now there are two entries for IP x.y.z.a. If you have an appliance that needs this feed the smaller table may be utilized. That is the table that contains the GUID and or IP, Domain and URL. Very fast and efficient to store, read, pass, etc... Now, we have a hit in the appliance for the black list feed containting x.y.z.a. So now the hit can be null routed, routed to the honeypot or just rejected (in the event it is spam). Then a row in yet another relational table may be updated with the GUID of the x.y.z.a IP and at some point an analyst may wish to see or analyze the data in the hits table. The analyst would then pull up the GUID and then the IP and then query the other relational table with all of the meta data. The analyst would see that the IP has been tagged as being Spam and malware. If the data is also further tracked (see the paragraph below) the analyst would also see that they saw it 15 times from feed DEER and it was always seen after it was provided in Antelope and others, like twenty four hours later.
This is generally how this is done. You may not want to discount all of the other feeds just because they have the same IP, Domain or URL. In the event two feeds provide the exact same data, yet another relational table. i.e. the table containing all of the meta data about the IPs, Domains and URLs also have a GUID on each row. The GUID could then be stored in the duplicates, or whatever you would like to call the table, so that the analysts or feed evaluators could view duplicate hits or collisions from competitive feeds. I would also keep timestamps in each entry as well. This could inform you as to when the data came from the respective feeds. This would tip the analyst or feed evaluator off as to which feed vendors were supplying you with duplicate data and could further assist in helping lower the negotiated prices of the feeds or helping to exclude them completely.
I hope this helps. Let me know if you have any further questions.
User Rank: Strategist
7/16/2014 | 8:24:25 AM