Some of us as "seasoned" computer science professionals recall the early days of computing pre-Web and pre-PageRank, the key algorithmic innovation that enabled Google to grow to its current mammoth scale. Much has been written about Google's history and the spawning of effective web search engines that ranked web pages so users could easily find the most relevant information they were interested in.
At the time, some in the computer science community concerned with security and privacy issues expressed fears that Google's web crawling and indexing might be illegal. Certainly, copyright issues would be in play if wholesale copying of web content wasn't permissible. Many of these issues were resolved over the years by employing agreed-upon rules of the road, permitting crawling, page analysis, and indexing, but under the control of announced policies and terms of service by webmasters. In a perfect Internet, all would be good.
Today, web crawling is continuous and ubiquitous, and it has broadened in scope from web pages to general Internet searches and file shares. The downside to this is that Google searches can also capture and index files and data exposed in cloud shares. Along with the very many legitimate web crawlers that adhere to the rules in robot.txt, there are also malicious crawlers that ignore these warnings and scan and probe, sometimes successfully, to capture cloud shared documents. It may not be immediately apparent when a cloud share has been visited by a spider. After all, it isn't immediately obvious when your website has been crawled unless you explicitly look for it.
This is why it pays to be proactive. We experienced a related incident firsthand at Columbia University, where I work as a computer science professor. Long ago, before there were so many regulations around protecting personal identifiable information, student Social Security numbers were used as the unique identifier when entering a housing lottery for securing a dorm room on campus. The files associated with this lottery were then stored in the cloud and forgotten. That is, until Google's indexing made the Social Security numbers public and searchable, creating an incident years after the files were stored and students had moved on from the university. The university's security team was able to remove the links and has since spent more time educating its faculty and students on data privacy best-practices. They've also set up a scanning system to help monitor for any instances of students' social security numbers being shared.
It is these types of incidents that drove the university to take precautions, update security policies, and anticipate risks related to Google indexing and link sharing. Just recently, data from more than 90 companies, including Box, was exposed through Box accounts because employees shared web links.
How can security teams understand just how pervasive link-sharing risks are in their organizations? First, administrators should make sure the default access settings for shared links are configured to "people in your company" to reduce accidental exposure of data to the public. Secondly, security policies for cloud-resident data should mirror any policies that apply to data stored on the premises. That includes policies about downloading or sharing certain kinds of sensitive data, as well as encryption of sensitive data.
Defenders typically resort to cloud log analysis to determine the extent of the problem. Such log analytics can alert personnel to possibly misconfigured cloud share access controls, or user security violations, where a shared link gives access to a broad collection of documents to an interested spider.
The log analytics aren't easy to do, but generally, capturing all events including time stamps, source IPs, agent strings, and URLs requested is the basic starting point. There are numerous products available to assist in the process — for example, to uncover the source IPs from tracert, and that analyze timing of requests. Being alert to spiders is important, but once a spider has done its job, and the shared documents have been exposed, what's next?
At that point, once a spider has scanned and indexed the files in the cloud share, the data owner has lost the ability to control access to it; in essence, all bets are off. So, the immediate questions security teams need to know are: What was lost? Who is affected? Who is responsible? How did it get lost? Can it be prevented from happening again?
Cloud log analysis can help answer some of these questions. Appropriate mitigation actions in a case like this also include shutting down credentials for the person who shared the link, revoking user access to cloud-resident files, folders, or cloud shares, and, in some cases, decommissioning a public cloud folder and reconfiguring security settings for future files. That is how some of the organizations involved in the Box data leak responded.
At some point in the near to distant future, the information in cloud activity logs could be automatically analyzed using artificial intelligence, machine learning, or other technologies to lessen the workload of security professionals. Rather than spending resources digging through cloud logs, it may be possible to send teams real-time notifications when cloud security policies are violated, or when unsanctioned users open or download cloud-resident files that weren't meant for them.