"How do you secure Big Data environments?" is the new question people ask. The first time someone asked me this, my gut reaction was to consider what security features we have in relational systems, how they protect data and the database, and then show which facilities are missing from big data clusters.
But this is one of those cases where gut reactions are totally wrong, and that approach misses the essential differences between big data clusters and relational databases, both architecturally and operationally. A reasonable answer to that question would not come for many weeks, as that question kicked off a several-month long research project into big data systems and how to secure them.
In a future post, I'll go into detail about what big data is, and work through some of the specific issues in securing these systems. They're a lot different than relational systems and it requires a bit more discussion about how big data clusters work, and address the architectural differences between the two before we can dive into different approaches to secure them. For now, I do want to highlight the differences in available security features. Most security professionals think about risks, threats and responses, and as the methods to counter threats remains the same, be it big data or relational databases. It's helpful to consider what we are reliant upon today to get an understanding of what's missing.
A quick look at threat-response models for all types of databases:
Data at rest protection
Encryption is the accepted method of protecting archives and data files from unwanted inspection or any attempt to examine data outside of database interfaces. Any data encryption system will be supported by key management.
Unwanted system access or usage
User and administrative access management -- a.k.a. user names and passwords -- is the normal way to gate access to the database. Privilege management is how features and functions are allocated to different users/roles.
Fraud and misuse detection
Separation of duties is key in making fraud and misuse more difficult by requiring physical or virtual participation by one or more people. Logging and activity monitoring are used to track activity and forensically analyze what transpired.
Unwanted inspection of data or queries over the network is address via network layer encryption.
Injection or malicious queries
Application layer defenses, built-in database parsing, query interception and filtering, dynamic masking, and activity monitoring are all means to thwart injection and malicious queries and -- potentially -- unwanted map-reduce or similar operations.
Either provided at the app layer or, if the database has an understanding of what constitutes a transaction, performed by the database.
Exploits and code weaknesses
Configuration and patch management are the principle approaches to fixing database flaws. In some cases application layer protections and monitoring (a.k.a. virtual patching) can help as well.
Databases are inherently multitenant, and constructs like schemas, features like groups or role based access, and facilities to logical segregate data access through labels provide these capabilities.
Data leakage and overprivileged user protections
Encryption, at the application layer, is used as a backstop should these other security measures fail. Leaked data, without the key, is inaccessible. Tools like masking and tokenization remove sensitive data from the database altogether. With most big data environments, many of the protections we rely on are not included within the base set of functions. For example, a Hadoop will not provide means to encrypt stored data, configuration and patch management, identity management, groups and roles, query and data type integrity, nor transactional integrity. The concepts of label security, schemas, communication security, and logging are available -- usually via add-on package -- but not by default.
The good news is several missing capabilities can be bolted on, either by the application developer or IT support. The bad news is some of these will work, to a point, but are not designed to scale in the same manner as big data clusters and create a performance bottleneck in order to implement.
In the next post I'll branch into specifics of big data and introduce the essential characteristics that help define what big data is to help you better understand the security issues.
Adrian Lane is an analyst/CTO with Securosis LLC, an independent security consulting practice. Special to Dark Reading.