Regularly updating software components can eliminate two-thirds of the vulnerabilities found in container images, while minimizing the number of libraries can also reduce the attack surface area in some cases, according to research by a team at Concordia University in Montreal.
The research, which focused on containerized applications used in high-performance computing (HPC) environments for neuroimage processing, analyzed 44 container images using vulnerability scanners and found that the average container image had more than 320 vulnerabilities. Containers based on lightweight Linux distributions, such as Alpine Linux, had far fewer vulnerabilities, suggesting that minimizing the volume of code can also reduce the number of vulnerabilities, the research team said in a paper posted online last week.
While the researchers focused on containerized applications of analyzing images of the brain, the issue with vulnerabilities is not particular to that discipline or data science packages, says Tristan Glatard, associate professor in the department of computer science and software engineering at Concordia University.
"The problem is general — it's not specific to a particular data analysis software or OS distribution," says Glatard. "There is no particularly bad guy. ... We didn't find any particular origin of vulnerabilities."
The research highlights that updating the packages included in images is a proven way for users of Docker and Singularity containers to reduce the number of vulnerabilities in the software. Last year, for example, one survey of Docker images found that 60% had at least one moderate vulnerability, while 20% had at least one high-risk vulnerability. Unfortunately, data scientists, like enterprise IT workers, are often leery that updates may break critical software.
The researchers, however, urged other scientists and data specialists to become more proactive about container security.
"[I]n neuroimaging, as in other disciplines, software updates are generally discouraged because they can affect analysis results by introducing numerical perturbations in the computations," the researchers stated in the paper. "We believe that this position is not viable from an IT security perspective, and that it could endanger the entire Big Data processing infrastructure, starting with the HPC centers."
The research team used a script to determine the package manager for a specific image and then ran the manager's update function to install the most recent software versions. Both the original image and updated images were scanned with a variety of vulnerability scanners: Anchore, Vuls, and Clair for Docker images, and the Singularity Container Tools for Singularity images.
The number of vulnerabilities found varied from about 1,700 for one image to nearly zero for a handful of others. While the average number of vulnerabilities per image was 460, the median image had 321 vulnerabilities. The number depended fairly linearly on the number of packages, with about 1.7 security issues discovered per software component on average, according to the research. Updating the containers, however, removed almost two-thirds of the security issues, lowering the vulnerability density to an average of about 0.6 per software package.
Minimizing the number of packages often reduced the number of vulnerabilities, but the impact was uneven. In some cases, removing unnecessary packages had no impact, especially when there were few extraneous packages. However, using the Alpine Linux distribution — a minimal version of Linux commonly used as a base image in Docker containers — typically reduced the attack surface area, says Glatard.
"Container images based on Alpine Linux are an exception, though: They have less vulnerabilities overall," he says. "This isn't because of better software or anything else [other] than limiting the number of software packages present in Alpine Linux images."
As with enterprise software, data scientists are often concerned that updates will break — or, at least, change — their analyses, and so they avoid updating the software components in an image, says Concordia's Glatard. He urged image users to regularly check whether they are using the latest software.
"I think data scientists should aim at minimizing software dependencies in container images and update them," he says. "Updates, however, can be a bit tricky, as in some cases they might change the outcome of analyses. Currently, you don't want to update software in the midst of an experiment, as it might introduce a bias in your results."
In addition, data scientists and the users of scientific software should make their analyses more robust to changes, which can ensure that software updates don't affect the results of data analysis.