The common practice of blurring text using pixelation — where text is replaced with pixels representing averaged values — always bothered security researcher Dan Petro. Intuitively, he knew that the technique still leaked too much information about the underlying text.
When another researcher posted a challenge on Twitter to decipher a pixelated phrase, Petro, a lead researcher at offensive-security firm Bishop Fox, went to work. Over the next six months, he created a tool that essentially performs a dictionary attack against the obfuscated text.
"It has always bothered me because I knew that there was no way that this was a secure method of redaction," he says. "Clearly the technique leaks information, and there has to be a way to get the text back out again."
Yesterday Petro released an analysis of the problem and released his open source tool, Unredacter, so others can replicate his work. The JumpSec researcher who originally posted the challenge confirmed to Petro that he had recovered the random text correctly.
The blog post is a timely reminder that poor redaction can lead to data leakage, but other researchers have explored the problem over at least the past two decades.
In 2005, a paper published by researchers at Lehigh University and a document process company outlined the many ways that redaction can leak information. For example, the text may not be completely obscured, or certain features, such as ascenders and descenders, may be visible with incomplete redaction. In addition, if there are a limited number of keywords, then the length of the redacted text could allow matching to the keyword dictionary.
Researchers have also focused specifically on pixelation as a flawed method of redaction. In a 2014 blog post, for example, a machine-learning and robotics engineer outlined the math behind recovering pixelated text. In a more extensive 2016 research paper, four researchers outlined the limits of using a probabilistic approach known as a Hidden Markov Model, which performed well — with recognition rates of better than 80% — even when working on low-quality images. Recognition rates declined only when text had a significant mosaic or blur radius exceeding 20 or more pixels — an indication of how much information is mathematically mixed together to achieve the blur effect.
"Mosaicing and blurring are popular forms of redaction because they have a certain aesthetic appeal to the naked eye," the researchers stated in that paper. "The images that these methods produce are highly suggestive of text; as a result, they do not disrupt the visual appearance of documents to the same extent as cut-out or black-box methods for redaction. But while mosaicing and blurring are lossy transformations, they preserve far more information than most users realize."
Pixelation blurs information by essentially reducing the resolution of a given piece of text — 12-point text that may use 16 pixels vertically could be translated to a 4-pixel-high grid, for example. An attacker who knows that the blurred image is text can recover the underlying information if they know the font size and the actual font.
"These are fairly reasonable assumptions, I would assert, since the attacker in a realistic scenario would likely have received a full report, with just one piece redacted out," Petro stated in his Feb. 15 blog post. "In our challenge text, you can see a few words right above the pixelated text that give us this information."
In many ways, the problem is similar to hashing because a given pixelation method turns a word into a unique pattern of pixels. Unlike hashing, however, which generally will require that results are diffuse, pixelation can be solved one character at a time, in a Hollywood-esque password-cracking sequence — as can be seen in the video for the Redacter tool. Diffusion is a property of secure hashing functions ,where a small change in the input leads to a completely different output.
Petro is not the first developer of a program to recover pixelated text. Another program, Depix, takes a similar approach, but the Bishop Fox researcher found the results to be very sensitive to noise in the input image. Researchers at security firm Positive Security surveyed other methods, including ways to improve the resolution of facial images from low-resolution thumbnails.
Companies should use a black box for redaction instead. As long as all parts of the sensitive text are covered, the technique leaks little information, Petro says.
"The only secure way to redact text is a black bar that completely covers the text because even a slight amount of information could compromise the data," he says. "Even then, you have to worry about context clues," such as the length of the redacted text.