Black Hat researcher releases new lexical analysis tool that doesn't rely on regular expressions

Don Bailey, Founder & CEO, Lab Mouse Security

July 26, 2012

4 Min Read

Even many years after gaining prominence as one of the most popular and convenient ways for criminals to break into corporate databases through vulnerable web applications, SQL injection still remains the apple of the eye of many a black hat hacker. While there are plenty of reasons to conspire against enterprises doing a better job preventing these attacks, one of the most fundamental is that it is very difficult to detect SQL injection attacks. This week at Black Hat, a researcher released a new tool to embed in applications that makes that detection process easier.

Part of the problem with many existing detection mechanisms today, including in many web application firewalls, Nick Galbreath, director of engineering at Etsy, told his audience yesterday, is their dependence on regular SQL expressions to do that detection. Analysis using regular expressions quickly gets bogged down because SQL is such a rich, complicated language. He cited a Black Hat talk back in 2005 by Hanson and Patterson that shows how regular expressions can be prone to breaking down and producing false positives.

"So what happens is, a lot of the web application firewalls have sort of ended up using what I call regular expression soup," he says."It's impossible to debug and test against. "Regular expressions, no matter what you do, are gonna miss something and something that you don't want is going to be flagged as a false positive."

One of the big difficulties in analyzing user input as a potential SQL injection attack is the fact that it is very tough to automatically tell the difference between things like phone numbers or Twitter handles and snippets of SQL statements used to inject code for attacks.

"It turns out to be a difficult problem. How do you detect if user input is SQL, good input, or what? Is that my phone number or an arithmetic expression? Is it a Twitter handle or or is it a SQL variable?" he says. "So trying to disambiguate these things turns out to be a hard problem."

As Galbreath examined that problem, he considered using some existing SQL parsers to do the heavy lifting. But as he doveinto them he found that not only would they only parse their particular flavor of SQL, but that they're not really designed to handle partial bits of code. They're also hard to extend and are very worried about correctness, because they're usually meant to ensure code runs properly. But someone seeking out SQL injection isn't so worried about correctness.

So instead of depending on tools not specifically meant for SQL injection analysis, Galbreath wrote his own.

"It sounds crazy but it turns out is pretty straightforward and not so bad (because) we don't need it to actually run SQL," he says. "What it does is it converts input into a stream of tokens. There's a master list of keywords and functions which is sort of combined against all the major databases. It's not completely intractable and it handles also the comments strings, literals and all the weird cases and things like that."

Called libinjection, it's an open source C library that takes a lexical analysis approach that was trained with real user input data from his company's site, a top 50 internet site with a rich base of user input data. With the tokenization approach, the tool is more lightweight and streamlines the process of analyzing user data.

"So it goes through, disambiguates, merges tokens, specializes, merges strings together, does all the stuff it needs to do and then it does one last step, which is really designed to reduce false positives," he says. "If it sees a bunch of arithmetic operations together, it just merges them all together. My phone number just returns into 1. We don't actually care what the value is because sql injection doesn't care what the value is, just that there's a number there. Same thing with multiple nested parenthesis, it just gets rid of them."

By parsing and analyzing these tokens in this way, what Galbreath finds is that his tool doesn't have to sift through bytes and bytes of user data to find whether or not user input is SQL injection or benign. In fact, through his testing of millions of user input and SQL injection input scenarios he found the magic number of tokens needed to" distinguish between SQL injection and benign input was just five tokens.

"That's pretty interesting compared to regular expression, because then you're parsing the entire input. If you have a 10 megs of input, it's going to be parsing 10 megs of data," he says. This, as soon as it hits 5 tokens, done."

Have a comment on this story? Please click "Add Your Comment" below. If you'd like to contact Dark Reading's editors directly, send us a message.

Read more about:

Black Hat News

About the Author(s)

Don Bailey

Founder & CEO, Lab Mouse Security

Don A. Bailey is a pioneer in security for mobile technology, the Internet of Things, and embedded systems. He has a long history of ground-breaking research, protecting mobile users from worldwide tracking systems, securing automobiles from remote attack, and mitigating crippling IoT risks. He has given almost a dozen talks at various Black Hat events, demonstrating new risks and weaknesses in next-generation technology, along with their solutions. His expertise has been used by energy companies, mobile engineering firms, and other corporations worldwide to build safer prototypes, and to strengthen existing products. Mr. Bailey's goal is to build systems and methodologies for securing the next generation of Internet technologies, such as mobile and IoT systems that bridge the physical and digital worlds together. 

Keep up with the latest cybersecurity threats, newly discovered vulnerabilities, data breach information, and emerging trends. Delivered daily or weekly right to your email inbox.

You May Also Like


More Insights