Sprawling CrowdStrike Incident Mitigation Showcases Resilience Gaps
A painful recovery from arguably one of the worst IT outages ever continues, and the focus is shifting to what can be done to prevent something similar from happening again.
July 23, 2024
The fact that a few lines of errant code could cause disruption on the scale that CrowdStrike's update has over the past four days has focused unparalleled attention on the urgent need for greater resiliency and redundancy in enterprise information technology stacks worldwide.
Few expect that getting there will be easy. But almost everyone agrees that the developments of the past few days underscore the need for better preparedness, better impact mitigation, and fresh ideas for recoverability from technology failures of the sort that happened last week.
The havoc started on July 19 when a small CrowdStrike content update for the Windows version of the company's Falcon endpoint security technology caused systems failures worldwide. Numerous airlines, banks, airports, hospital, hotels, manufacturing companies, and others reported their Windows systems as becoming essentially inoperable and refusing to restart despite attempts to reboot them out of a blue screen of death (BSOD) state. Microsoft estimated the faulty CrowdStrike update affected some 8.5 million Windows systems worldwide.
As if the recovery issues were not enough of a challenge, threat actors added to them this week by taking advantage of the chaos to try and distribute phishing emails, information stealers, and other badware. On July 22, for example, CrowdStrike warned of threat actors using a fake CrowdStrike recovery manual to distribute a hitherto unseen information stealer dubbed Daolpu. Earlier, the security vendor warned of threat actors attempting to distribute a malicious zip archive to users in South America; it purported to be a hotfix from the company, but in actuality loaded the RemCos Trojan. Others, such as KnowBe4, reported phishing attempts using the CrowdStrike issue as a lure starting just hours after news of the problem first began surfacing.
CrowdStrike: A National Security Issue?
On July 22, the US House Committee on Homeland Security demanded an explanation from CrowdStrike CEO George Kurtz on what went wrong and the measures the company will implement to prevent a similar incident in the future. In a letter to Kurtz, the committee pointed to the sheer magnitude of the disruption in the US — more than 3,000 cancelled flights, 11,800 flight delays, surgery cancellations, 911 call center outages — as reasons why the issue cannot be ignored.
"This incident must serve as a broader warning about the national security risks associated with network dependency," Mark Green, the chairman of the committee, wrote. Malicious cyber actors backed by nation-states, such as China and Russia, are watching our response to this incident closely."
Both CrowdStrike and Microsoft have released updates and guidance — including self-remediation tips for remote users to help organizations restore their systems. Microsoft on Monday updated its recovery tool with expanded logging, error handling capabilities, and two repair options to help organizations expedite recovery.
A Mammoth Recovery Task
Even so, the task of restoring systems will be enormous and time consuming says Thomas Mackenzie, director of product strategy at Lansweeper. "It depends on a number of factors, including, but not limited to, whether there are backups in place to roll back to, and whether the assets are virtualized or not," he says. "Microsoft has released a tool to fix this problem, but if the asset has BitLocker and requires the key, then it can't be used. It’s not a trivial task if you’re talking about a lot of assets across different locations."
Danny Jenkins, CEO at ThreatLocker, says his company's testing shows it takes about 15 minutes per computer to recover manually — something that will be required in many cases.
"If all computers are office-based, it could be reduced to about four minutes per device, assuming they are close to each other," he says, but adds that restoration will be significantly harder when remote users are involved. "A company with 10,000 devices is going to take about 666 person hours to recover. Remote recovery makes it more likely to be three times that."
Encryption recovery keys are another issue. Each device will have its own BitLocker recovery key to boot into Safe Mode.
"This could extend recovery time [significantly], assuming you have them saved somewhere," Jenkins says. "It is also a really long manual key to type in." Organizations could try using another security tool to block CrowdStrike from running so as to enable automated recovery, he adds.
The Dangers of an Interconnected World
This CrowdStrike incident is a reminder that in an increasingly technology-dependent and interconnected world, sometimes things will go wrong, says Melissa Bischoping, director of endpoint security at Tanium. In this case, they went wrong in a way that was technically simple to remediate but involved an astronomical amount of effort in practice because it required human intervention on nearly every impacted endpoint in the first few days.
"Going forward, we must [focus] on resilience and redundancy in the technology we build and deploy across the globe," she says. "It is inevitable that failures will happen in technology. Having layers of resilience, real-time visibility, and business continuity plans which account for the most complex remediation must be at the center of every risk management conversation."
The incident, not surprisingly, has prompted questions about the wisdom of giving technology vendors the untrammeled ability to make automatic updates to their software on customer systems, without often so much as asking permission first, Bischoping says: "We place a lot of trust in the providers that deliver software to our organizations. It's imperative that we have conversations about allowing the customer to remain in control of changes to endpoints, and balance the need to deploy the most up-to-date information with each environment’s unique risk acceptance strategy."
Allowing organizations some level of control over the rate at which endpoints receive change is critical component of risk mitigation, she says.
Paul Davis, field chief information security officer (CISO) at JFrog, says the CrowdStrike incident is a reminder why proactive testing and preparedness are key to preventing massive disruption.
"Organizations affected by this must also take an honest look at their operations — what pieces of your tech stack went offline, who could have done their jobs better, who was prevented from doing their jobs, who were essential to the business, what could the org live without during downtime, and what pieces of the business need to be protected the most," he says. "The answers to these questions will define your crisis response plan and will give you a blueprint on how to act when an outage of this magnitude occurs."
The key takeaway for organizations here is that the software supply chain can be complex with multiple interconnecting parts and tools, where even small marginal errors can have massive impacts. "Slow ramp-ups and careful deployment are key," when it comes to rolling out updates, Davis says. "Never let the cure be worse than the disease, where an update causes more disruption than the bug it’s trying to fix."
Read more about:
CISO CornerAbout the Author
You May Also Like
Cybersecurity Day: How to Automate Security Analytics with AI and ML
Dec 17, 2024The Dirt on ROT Data
Dec 18, 2024