The term "big data" is often misunderstood. In fact, it has been used so often, by so many people to push such specific agendas that the term has become almost meaningless.
Yes, big data is storing and processing very large data sets. However, it embodies a lot more than that.
When trying to get a handle on big data, it's helpful to consider it more an idea than a specific size or technology. In its simplest terms, the big data phenomenon is driven by the intersection of three trends: mountains of data that contain valuable information, the abundance of cheap commodity computing resources and virtually free analytics tools. When talking about security of big data environments, it's the last item -- virtually free analytics tools -- that often raise security concerns.
As of this writing, there are more than 120 variations of big data management systems focusing on different data types (for example, geolocation data, documents and tuple storage).
These systems use many different query models; different data storage models; and different task management, orchestration and resource management tools. While big data is often described as anti-relational (as shown by the term "NoSQL"), that concept also fails to capture the essence of big data.
It's true that big data implementations cast off many of the core features of relational databases to get around the associated performance issues, but make no mistake: Some big data environments offer relational structures, transactional consistency and structured query processing.
Since conventional definitions fail to capture the essence of big data, think about it in terms of the key elements that comprise big data environments. They use many nodes for distributed data storage and management.
They store multiple copies of data, "sharding" pieces of data across multiple nodes. This provides the benefits of fail-safe operation in the event any single node fails, and it means the data queries move to the data, where processing resources are available. It's this distributed cluster of data nodes that cooperate with each other to handle data management and data queries that makes big data different than "big iron."
The essential characteristics of big data -- the things that allow it to handle data management and processing requirements that outstrip previous data management systems, such as volume, data velocity, distributed architecture and parallel processing -- are what make securing these systems all the more difficult. The clusters are somewhat open and self-organizing, and they allow users to communicate with multiple data nodes simultaneously.
Validating which data nodes and which clients should have access to information is difficult. The elastic nature of big data means new nodes are automatically meshed into the cluster, sharing data and query results to handle client tasks.
In the mad race to do more with big data -- to add new features and push the boundaries of scalabilities -- the vast majority of development resources go to the improvement of big data scalability, ease of use and analysis capabilities.
A very low percentage of resources goes into adding security features. But you want security features embedded with the big data platforms. You want developers to be able to enable features as needed during the design and deployment phases. You want security to be just as scalable, high-performance and self-organizing as the clusters are. The problem is the security products available aren't typically included with open source systems or the majority of commercial bundles.
To find out more about the key security components behind big data -- and for a list of myths about big data, as well as seven key tips on security it -- download the free report on big data security.
Have a comment on this story? Please click "Add a Comment" below. If you'd like to contact Dark Reading's editors directly, send us a message.