UPDATED 14:18 EST / JULY 19 2024

CLOUD

World reboots as CrowdStrike update cripples global systems running Microsoft Windows

In an incident that underscores the vulnerabilities of our technology-reliant world, a massive information technology software update failure rippled across key sectors today, affecting everything from airlines to medical facilities, news outlets and public safety systems.

At the center of the disruption lies a significant malfunction with CrowdStrike Holdings Inc. and Microsoft Corp. Windows and cloud services.

The chain reaction began when cybersecurity giant CrowdStrike rolled out an automatic software update to computers running Microsoft software. The update proved problematic, as some systems were unable to integrate it, resulting in the infamous “blue screen of death” – a hallmark of Windows crashes. This inconsistency forced system administrators to manually update and reboot affected computers, creating widespread chaos.

“This wasn’t a cybersecurity attack,” CrowdStrike said in a statement, “just IT operations on Windows servers and Microsoft apps gone bad.” Systems running on Linux and Mac servers remained unaffected, isolating the crisis to Microsoft-dependent infrastructure.

CrowdStrike Chief Executive George Kurtz addressed the situation on Twitter/X, stating, “CrowdStrike is actively working with customers impacted by a defect found in a single content update for Windows hosts.” Microsoft has not denied its involvement in this massive failure but is suggesting that the fault lies with CrowdStrike for pushing the update.

The fact that only Microsoft systems were affected has been widely pointed out on Twitter/X. Many, including I, are unsure whether to be more shocked by Microsoft’s vulnerability and CrowdStrike’s level of system access, or by the fact that these critical systems and services are reliant on Windows machines and servers.

The fallout from this event was immediate and widespread. Major airlines grounded flights, hospitals faced operational disruptions, businesses halted activities, and media outlets struggled to broadcast. This unprecedented IT failure underscored the fragility of our technology-dependent world.

Enterprise Technology Research, conducted a CrowdStrike-focused flash survey of 100 IT decision makers, nearly three-quarters of them from large organizations, to capture the initial impact and reaction from the widespread outages. It found, among other things, that 96% of organizations were affected by system crashes or outages while 68% experienced somewhat significant to very significant impacts, with 6% shutting down nearly all essential operations.

ETR’s survey indicated CrowdStrike could see considerable fallout from the incident. It says 56% of respondents said they’re very unlikely or somewhat unlikely to replace CrowdStrike, but 39% said they’re very likely or somewhat likely to replace the company’s products and 5% said they’re certain to replace them. Some 55% are reconsidering their reliance on CrowdStrike, while 38% see no need to change. Respondents’ expressed disappointment in its testing and response, urging better quality assurance processes and improved communication, ETR added.

Analysts angles

Rob Strechay, an analyst from theCUBE Research, described the incident as a quality assurance failure. “This will either be a release engineering process issue with CrowdStrike,” he said. “Safeguards normally used for software deployments have been missed or not included. Even worse, they were done, and the results were ignored or missed. We will see what comes out of the post-mortem.”

Christophe Bertrand, another analyst at theCUBE Research, offered a detailed perspective on the incident. “Today’s issue may have initially seemed like a cyberattack or ransomware, but it was actually a self-inflicted mistake,” he said. “This event is not exactly like a ransomware attack because the vendors quickly identified the issue, but it still caused widespread problems across many systems in different regions.

Bertrand added that the problem appears to have stemmed from an inadequately tested update. “It’s puzzling how a Blue Screen of Death loop was missed during testing,” he said. “However, the same strategies and best practices used to safely test updates, revert virtual machine versions, and ensure stable environments can also be used to restore previous versions to a stable state. This process is similar to recovering from ransomware once the issue is identified.”

Bertrand further elaborated on the recovery process, emphasizing the role of backup and recovery vendors. “Most backup and recovery vendors, including storage vendors with similar capabilities, can support this recovery work,” he said. “It’s akin to a ‘sandbox’ or ‘testing room’ – a nonproduction environment that sets up VMs for testing purposes, including testing different operating environments and application updates. It’s uncertain as of now whether most end-users will opt to remove the problematic system file causing the reboot loop or revert their VMs and affected systems to a previously known good state on a large scale.”

Many questions remain unanswered: Why was this update so crucial that it warranted a rapid, global rollout? Is this standard practice? Why weren’t measures taken to limit the impact of this deployment?

Bob Laliberte, another analyst from theCUBE Research, criticized the lack of testing. “How could they have not tested this against the exact software release prior to getting anywhere near the production environment?” he said. “Why wasn’t it rolled out to a portion of the environment before deploying to multiple regions? Did it take down all Azure Windows servers or just specific companies? Why didn’t these companies have environments replicated across multiple availability zones/regions?”

IT departments worldwide are now grappling with recovery efforts. “This is similar to what it would have been like had they been hit by an attack that encrypted the servers,” said Strechay. “They need to be able to go back to as close to before the new update or patch as possible and walk those bits forward. Lots of organizations will be spending substantial amounts on recovery efforts and enhancing their readiness for future incidents.”

Despite the turmoil, Goldman Sachs has maintained its stance on CrowdStrike stock. “While it’s still early, we expect to see minimal share shifts in endpoint as a result of the incident, although we expect to see noise in competitor go-to-market processes. We maintain our 12-month price target at $400 based on 55x Q5-Q8 FCF,” CNBC reported.  Shares of CrowdStrike were down about 11% as of midday trading. 

As the world grapples with the aftermath of this major IT failure, the event serves as a stark reminder of the vulnerabilities inherent in our reliance on technology.

Image: SiliconANGLE/Ideogram

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU