The Triggering Event
The recent CrowdStrike outage left little doubt that our mission-critical cyber infrastructure is still a work in progress. With over 5,000 flights canceled worldwide, video screens in Times Square going dark, and major supply chain interruptions, the outage affected over 8.5 million Microsoft devices worldwide.
Although that’s less than 1% of all Windows installations, the outage underscores the fragility at the very foundation of the IT tech stack. It’s also a reminder that malicious actors aren’t the only threat capable of cutting a swath of disruption across multiple industries and sectors. In this case, the culprit was a system designed to prevent just that by thwarting malicious attacks.
As the event unfolded, unsolicited advice for CrowdStrike streamed in from all corners of the internet – implement N-1 release policy, stagger or implement canary deployment – but unless you work for CrowdStrike, the most important lessons aren’t in their code or practices at all. The big takeaways for IT leadership are: 1) Understand this could happen again, and 2) Reexamine the emerging value of strategic support partners that can multiply your IT manpower on demand and help your team meet unforeseen challenges head-on.
Let’s examine the outage details and then consider the strategic partnerships that ensure you don’t have to go alone next time.
What Actually Happened
The press labeled it an IT catastrophe, and a meltdown, but Microsoft said that less than 1% of Windows instances were affected worldwide — at the very least, the media may want to reserve some hyperbole for the next time.
We know the root cause – a silent and routine update pushed from CrowdStrike with a channel configuration file with “bad” data and a failure to validate it before loading it. Affected systems began displaying a STOP error in a continuous loop, aka the Blue Screen of Death (BSOD), the once all-too-common signature of a Windows failure. The BSOD is now so rare that seeing this relic from the early days verges on the nostalgic.
Yes, CrowdStrike took responsibility, and yes, they could have done better, and they undoubtedly will. But after the incident, an ETR survey showed 55% of IT decision makers were considering reducing reliance on CrowdStrike. Doing so may not be as easy as it sounds and may not solve anything. Their product is best-of-breed, and the incident was only partly about them anyway. CrowdStrike will bulletproof the update process so that precise scenario won’t happen again soon. The important point is that CrowdStrike really just exposed deeper, systemic fragility in cyberinfrastructure worldwide.
Crisis Response Strategies
The applications and services embedded in tech stacks across the IT landscape, in the cloud and on-premises, aren’t getting any simpler. Change is constant, and governing the mashup of multi-vendor systems, open-source apps, containers, micro-services, and multiple operating systems remains a hard problem for IT—increasingly so as complexity grows and more black-box machine-learning models are deployed.
Of course, no organization can afford to have full-time staff for worst-case scenarios, but if it could scale its resources instantly up or down to match demand, it would.
IT leaders should consider looking hard at the value of engaging support partners. There’s no substitute for senior, hands-on IT expertise in a crisis. Multi-vendor support partners can tailor a service to fit your unique needs, including contingency plans that scale up support quickly in response to unanticipated events, break/fix support, and cybersecurity management. Having an instantly scalable support partner that knows your network and your tech stack inside and out increasingly makes good business sense. With a multi-vendor support service, “MVSS,” regardless of an organization’s requirements, they are supported by the same multi-disciplined engineering team and an integrated management platform. This approach ensures operational efficiency and effectiveness across multiple enterprise vendors, which is vital to getting back online after an incident like this one with CrowdStrike.
Next Steps – Finding a Support Partner
You may be well-positioned to elevate your vendor support game if you’ve already deployed the best practice regimen of multi-layered cybersecurity, internal education, redundancy and failover, DevOps, and disaster recovery plans.
Using multi-vendor support partners isn’t a new concept, but IT is at a tipping point. Scale and complexity are stressing staffing to the limit, and the business case for engaging an MVSS organization is growing.
Consider Xcelocloud MVSS365, a comprehensive multi-vendor support solution that is delivered through several top-tier technology solution providers.
Make no mistake, this incident and its repercussions are still unfolding. One study reported that 60% of CrowdStrike users are reconsidering their reliance on the product. The only problem is that CrowdStrike is in best-of-breed territory, and any migration is drought with its own risks.
Parametrix Insurance estimates damage to Fortune 500 companies at $5.4B this time. Few things in IT can be assumed, but it’s a safe assumption that an IT outage as widespread as the CrowdStrike incident will happen again.