Cyber Risk

Tech Insight: Finding Security-Sensitive Data - on a Shoestring Budget

Thanks to open-source tools, discovering the heart of your data doesn't always mean paying an arm and a leg

June 20, 2008

4 Min Read

Securing your organization's most sensitive data is a great idea -- if you know all the places where that data resides. Most enterprises don't. In fact, a report issued last week by the Verizon Business Investigative Response Team indicates that 66 percent of data breaches studied involved data that the victim company was unaware of. (See Verizon Study Links External Hacks to Internal Mistakes.)

How can you identify and locate your organization's most sensitive data? Many vendors are offering data loss prevention (DLP) tools and other discovery tools, and many of them offer a lot of promise. But they aren't cheap or trivial to deploy. Does your data discovery process have to wait until you get the time and budget to deploy DLP?

Thankfully, no. It’s possible to get a jump-start on discovering sensitive data using freely available and open source tools -- provided that you understand what your company needs to identify and protect. The tools range in functionality from simple searching of files on desktops and laptops to spidering and searching Website content.

In the educational environment, where I work, some of the best-known and most widely used "discovery" tools include Cornell University’s Spider, Virginia Tech’s Find_SSN (and Find_CCN), the University of Texas at Austin’s Sensitive Number Finder (SENF), and the University of Illinois at Urbana-Champaign’s Firefly. These tools were created to address issues that educational institutions were having with faculty and staff, some of whom were storing personal information, such as Social Security numbers, on their own machines instead of on securely managed, central file servers.

The tools listed above are primarily used for finding SSNs stored on computers and servers ("data at rest") using what DLP vendors call "rule-based matching" -- regular expressions that match a specific pattern of letters and numbers. These regular expressions can be modified to find SSNs, credit card numbers (CCNs), account numbers, and any other data that fits into an easily identifiable pattern. Unfortunately, this type of searching is prone to false positives, so it requires human oversight to determine if there really is a match.

None of the freely available "discovery" tools can yet identify sensitive data in transit ("data in motion"), although some network-based and host-based DLP solutions do. However, it is possible to use your network intrusion detection system (IDS) to identify sensitive information as it traverses your network, essentially acting as a DLP system. You can get signatures for commonly used sensitive information, such as SSNs and CCNs, that can be detected by Snortr, the freely available and open source network IDS. However, as with rule-based matching, these signatures can lead to false positives.

Some DLP products can scan application servers, such as database, email, and Web servers. Most free tools can't do this -- they are designed to work with data at rest and don’t understand the protocols necessary to interact with the servers and do the searching. To use these tools, you'd have to export the server's data to a text file or store your mail in a specific type of supported container format. The one exception is Cornell’s Spider, which can traverse HTTP and HTTPS Websites and scan for sensitive information.

In addition, search engines like Yahoo and Google are quite effective at finding sensitive information that has been published on a Website, something my university found out the hard way last November. It's better to find this sort of data yourself, before an outsider does.

Ultimately, a data classification and labeling system can go a long way in making your efforts to identify and secure your data more effective. The use of unique labels for files containing sensitive information can make it easier to find them, whether they are at rest or in transit.

These unique labels can be used like digital watermarks or honeytokens, and you can write IDS rules to identify them when the data is in transit. You can also write regular expressions for the tools that search data at rest, using them to identify these honeytokens and to help audit computers that shouldn’t be storing a particular classification of data.

These freel available tools for identifying sensitive information are powerful, but they don't have the scope or features of DLP solutions. Still, they can give your enterprises a head start in finding where sensitive information lives -- which, in turn, can be used to help justify the need for a more comprehensive (and expensive) DLP solution.

Have a comment on this story? Please click "Discuss" below. If you'd like to contact Dark Reading's editors directly, send us a message.

About the Author(s)

Dark Reading Staff

Dark Reading

Dark Reading is a leading cybersecurity media site.

See more from Dark Reading Staff

Related Topics

Related Topics

Related Topics

Related Topics

Tech Insight: Finding Security-Sensitive Data - on a Shoestring Budget

About the Author(s)

Editor's Choice