On Friday, July 19, 2024, IT professionals worldwide were called to help their organizations recover when millions of Windows machines went down. Third-party multi-vendor support engineers also leaped into action to help their customers cope and recover. One of those engineers was Casey, a senior IT engineer here at Xcelocloud, where we deliver support for over 20 market-leading technologies – Microsoft being a key one – and on July 19th, Casey found himself in the eye of the CrowdStrike/Microsoft storm. This is his story, “from the trenches,” with a multi-vendor support approach to getting customers back up and running.
Question: Casey, what’s your role and daily life like at Xcelocloud?
Casey: I handle a wide range of IT tasks. One day, I could develop software in .NET or Rust; the next, I might troubleshoot a customer’s problem. Then, I could work on an internal request to design a solution for a product we plan to offer. On another day, I might lead a team to recover from a ransomware attack. I’m like a Swiss Army knife in that way, using my technical skills to tackle whatever challenges come our way.
Question: When a global outage like the one with Microsoft occurs, your customers probably wish they had more IT staff to conduct a root-cause analysis.
Casey: Absolutely. The entire IT world struggles to have enough talent in particular areas. Most people specialize in a product, like being an Azure or Splunk expert. That causes companies to have expensive people with one good skill set. At Xcelocloud, we prefer generalists with a knack for multiple things so we can provide comprehensive support. Rarely does an IT problem remain in the technology for which the ticket was opened. Usually, when dealing with OEMs like Microsoft, if they decide that the problem is not their responsibility, they will direct you to another vendor, and support ends with them. This can lead to a situation where you end up with a hefty bill without a resolution. You may still not even know the root cause, which could originate anywhere upstream from where the issue appears.
The Incident
Question: How did the CrowdStrike issue unfold on the day of the event, and did you know it was happening before you started hearing from customers?
Casey: I woke up on East Coast time to a lot of phone calls and text messages. Our global helpdesk was already digging into the issue. There was a lot of confusion initially, but it was clearly a global problem, and the common factor was CrowdStrike. Articles were being shared online with manual steps on how to solve the blue screens Microsoft users were seeing. We know our customers well, so we knew which would likely be impacted. This helped us prioritize who needed help. At that point, I took charge of the “catastrophe management.”
Question: How long did it take to resolve with each customer? Did you write scripts for multiple customers?
When attempting to remove the afflicted CrowdStrike files quickly, users encountered issues such as needing the local admin password for each system. This was particularly problematic for those using LAPS, as each server and workstation had a unique 32+ character password, making the manual entry tedious. By Friday afternoon, I developed scripts for Hyper-V and VMware environments that fully automated the removal process, eliminating the need for manual intervention on individual VMs.
Several customers, each with tens of thousands of servers to recover, volunteered to run these scripts early on, helping to ensure their reliability. The scripts automated the discovery of VMs that were running but not in a healthy state, attached and booted a bootable image to clean up the CrowdStrike files, and then rebooted the VMs back to normal.
One customer had 24,000 desktops across 300 locations that needed recovery by Monday, which was daunting! I adapted the process to run from PXE or a bootable USB drive, automating the recovery for desktops. This process required no technical skill from the user unless a BitLocker key was needed, allowing anyone to assist. Without this automation, skilled technicians would have been needed for every workstation. Ultimately, these automated processes saved thousands of hours of tedious labor. One extremely fortunate thing about the timing of the incident was that it was in the summertime. This meant the impact on our education customers’ learning environment was minimal. It would have had a significant impact if they had been in session.
Reflecting on What Happened
Question: What else should we know about the CrowdStrike incident?
Casey: The root kernel driver passed qualification by the WHQL lab at Microsoft. However, the issue arose from dynamically loaded code that wasn’t properly validated. Hopefully, this will serve as a wake-up call to exercise more caution. There are other similar products with kernel access and the potential to cause problems. Kernel driver issues are typically related to security software components. Only a few different drivers operate at the kernel level across many customers. A similar situation occurred with CrowdStrike on Linux last May or April, affecting several Linux flavors. This was not a Microsoft problem but rather the responsibility of CrowdStrike. Microsoft has been working to address some of the architectural issues related to kernel access but is encountering regulatory hurdles.
The Multi-Vendor Support Service Mission
Question: When resolving issues like this, how do you balance reliance on tools with shoot-from-the-hip scripting and brute-force troubleshooting?
Casey: In disaster scenarios, there’s always some shooting from the hip. Each customer’s environment is unique. We use a multitude of tools, but if none are available, we’re not afraid to get hands-on. The complex problems are the ones I enjoy. Catastrophes are exciting times for me because they’re challenging problems to solve.
Question: What makes Xcelocloud so adept at tackling these kinds of issues?
Casey: Our engineers are multi-disciplined and cover all bases. If it’s a Cisco problem, or a VMware problem, or a Microsoft problem…we help from start to finish; we don’t just pass it off. This minimizes the need for customers to explain the issue multiple times.
Question: Do any recent incidents stand out where customers were surprised by quick resolution?
Casey: Yes, it happens a lot. We’ve had situations where customers had problems that persisted for a year, and when customers brought us in, they were solved in less than two hours due to our multi-vendor approach and unique skill sets.
Question: Any final thoughts?
Casey: Remember, I work on lawnmowers! I often say that in public to avoid a thousand questions about IT.
Multi-Vendor Support Services for Crisis Management
Managing multiple technology vendors can be pretty challenging, especially during times of crisis. As demonstrated here, multi-vendor support services can provide expert help and fast issue resolution. If your organization didn’t respond as quickly as you would have liked to the CrowdStrike incident, or if you want to be better prepared for the next emergency, ask your preferred reseller if they offer multi-vendor support services to improve your business’s efficiency and peace of mind. Want to know which multi-vendor support services are powered by Xcelocloud? Contact us, and we will refer you to a partner that best fits your needs.