Google Cloud's 'Dataproc' Abuse Risk Endangers Corporate Data Stores

There's a new way for hackers to abuse the cloud, this time with data analysts and scientists in the crosshairs.

4 Min Read
Source: Antoni Bastien via Alamy Stock Photo

Lackluster security controls in one of Google's cloud services for data scientists could allow hackers to create applications, execute operations, and access data in Internet-facing environments.

The issue lies with Google Cloud's "Dataproc," a managed service for running large-scale data processing and analytics workloads via Apache Hadoop, Spark, and more than 30 other open source tools and frameworks.

A so-called "abuse risk" to Dataproc, outlined by the Orca Research Pod on Dec. 12, rests on the presence of two default open firewall ports used by Dataproc. If an attacker is able to achieve initial server compromise in an exposed cloud environment (through a common misconfiguration, say), they could take advantage of missing security checks to reach connected resources, such as data scientists' reams of sensitive data. They could also toy with their cloud environments in myriad other ways.

"One can imagine that the data used for analysis is likely to contain proprietary as well as sensitive data, which, if breached could provide bad actors with customer data, business intelligence, and other data that could be used for competitive intelligence," says Roi Nisimi, cloud threat researcher at Orca Security.

Exposed Dataproc in Default Private Cloud

Dataproc's issues begin with the fact that its two Web interfaces used for every master node — YARN ResourceManager on port 8088 and Apache's Hadoop Distributed File System (HDFS) NameNode on port 9870 — don't require any authentication.

"The two ports mentioned above are served for all addresses," according to Orca. "Which means to fully access them, the one single prerequisite is Internet access. So one not properly segmented cluster can cause great damage."

As for the specific potential attack path, the researchers note that it's "fairly simple." 

Google Dataproc Data Leak Attack Flow

Google Cloud comes packaged with a default virtual private cloud (VPC) called Compute Engine, which, while limiting most inbound connections, does not limit any connections within an organization's internal subnetwork. So, if an attacker can breach and execute code in the default VPC — say, if it's left open to the Internet — they have a path to access Dataproc clusters because those two interfaces are left open by default.

"The attacker can now tunnel through the compromised machine to access both Web interfaces," the researchers explained. "They can use the YARN endpoint to create applications, submit jobs and perform Cloud Storage operations. ... Or worse, they can use the HDFS endpoint to browse through the storage file system and obtain full access to sensitive data."

The upshot, as researchers explained: "Having an Internet-facing remote code execution (RCE) — vulnerable Compute Engine instance is not farfetched."

The researchers brought their findings to Google, but the issue has not yet been resolved.

Nisimi says that Google could implement a fix rather easily. “Potential solutions would prevent unauthenticated access to the cluster Web interfaces,” he explains. “For example, Google could enable authentication by default in the underlying open source software (OSS) managed solution, so that GCP Dataproc only allows authenticated access.”

Orca did acknowledge that Google's Dataproc documentation highlights this potential security risk and suggests avoiding open firewall rules on a public network, but "they don’t take into account the risk of an attacker already having an initial foothold on a Compute Engine instance — which would give them unauthenticated access to GCP Dataproc as well," according to the Orca post.

In response to a request by Dark Reading, a Google Cloud spokesperson notes, “The security of our customers' environments is a top priority. We enforce strict security practices including Custom Org Constraints for Dataproc customers. This allows project administrators to enforce additional rules for managing their security configuration for clusters.”

The person adds, “When these org constraints are properly enforced, these suggested exploits are not possible. We have found no evidence that customers have been impacted by this potential risk.”

Avoiding Cyber-Risk in Exposed Dataproc

To address such possibilities, the researchers recommended that Dataproc admins practice effective vulnerability management and properly segment their networks by creating independent clusters in different subnets, without cross-contamination with other services. Admins can also adjust firewall rules, or move to other VPCs.

Unless Google itself implements some sort of fix, the researchers wrote, "it’s up to organizations themselves to ensure that their GCP Dataproc clusters are not configured in a way that makes them vulnerable."

About the Author(s)

Nate Nelson, Contributing Writer

Nate Nelson is a freelance writer based in New York City. Formerly a reporter at Threatpost, he contributes to a number of cybersecurity blogs and podcasts. He writes "Malicious Life" -- an award-winning Top 20 tech podcast on Apple and Spotify -- and hosts every other episode, featuring interviews with leading voices in security. He also co-hosts "The Industrial Security Podcast," the most popular show in its field.

Keep up with the latest cybersecurity threats, newly discovered vulnerabilities, data breach information, and emerging trends. Delivered daily or weekly right to your email inbox.

You May Also Like

More Insights