CrowdStrike Incident Shows the Importance of Availability in the Security Triad

On Friday 07/19/2024, CrowdStrike, a well-known security company, caused an incident that significantly impacted the availability of critical computer systems worldwide with Microsoft estimating that it affected 8.5 million Windows devices. Those affected include major airlines, financial institutions, healthcare systems, media broadcasters, telecommunication companies, and others. In addition, Reuters reports “over half of Fortune 500 companies and many government bodies such as the top US cybersecurity agency itself, the Cybersecurity, and Infrastructure Security Agency use the company’s software.” With such a large footprint, this incident underscores the immense responsibility companies like CrowdStrike have to ensure the safety and security of their products for their customers.

The CrowdStrike Incident: A Technical Overview

CrowdStrike offers a range of cybersecurity products, primarily through its Falcon platform, with many of their services revolving around its Falcon Sensor software agent, which is designed to protect systems from malicious hackers. To stay ahead of threats and protect its customer base, CrowdStrike routinely pushes out minor updates to its Falcon Sensor that includes new threat intelligence, signatures, and detection rules. Unfortunately, on Friday at 4:09 AM UTC, one of those minor updates became a big problem.

The minor update contained a malformed Channel file named C-00000291*.sys. It’s important to note that despite having a .sys extension, the Channel file is not a traditional driver but instead a configuration file. This particular file controls how the Falcon sensor evaluates named pipe execution on Windows systems. Named pipes provide a method to facilitate communications between different processes within the operating system.

When the malformed channel file was read by the Falcon sensor, it caused an attempted read from memory address 0x9c (156). This is typically an invalid region of memory, meaning that no physical or virtual memory is allocated to it. When a process attempts to read or write with an invalid address the Memory Management Unit (MMU) of the operating system detects the attempt. This results in an access violation (page fault) and termination of the affected process. Since the Falcon sensor runs within the system process, which operates at high privilege levels, manages critical system resources and operations, this access violation resulted in a system crash.

Recovering from the CrowdStrike Incident

Once the system crashed, users were displayed with the dreaded windows blue screen of death and boot loops. Recovering from this issue requires that the malformed Channel file is removed from the system. Doing so requires booting the system into the Windows Recovery Environment aka Safe Mode and deleting the file located in the C:\Windows\System32\drivers\CrowdStrike directory. Of course, manual intervention often involves physical access to the affected devices slowing the time to remediate. For cases where BitLocker is enabled, a 48-character recovery key is required which further extends recovery time. In many cases that BitLocker key was also stored on a system that was also affected by the malformed file leaving many IT professionals without a way to access the keys. Unfortunately, without the key, recovery is likely impossible for those systems.

Hindsight is 20/20

What most people don’t seem to know is that months before the Windows incident, CrowdStrike broke Linux in a similar manner. In April of 2024, CrowdStrike pushed an update that caused all Debian Linux servers in a civic lab to crash and made them unable to reboot. In this instance, an update was not compatible with the then current version of Debian despite being allegedly supported.

A root cause analysis found that the Debian Linux configurations had not been included in CrowdStrike’s test matrix. Users of the Rocky Linux distribution also experienced disruptions after upgrading to Rocky Linux 9.4, with servers crashing due to a kernel bug introduced by a CrowdStrike update.

On the topic of bad updates, in another twist, CrowdStrike’s founder and CEO, George Kurtz, was McAfee’s Chief Technology Officer during a 2010 incident. That incident also resulted in a global outage when a defective update was pushed to the McAfee Antivirus product for Windows XP.

The fact that CrowdStrike has had two similar incidents (that we know of) related to updates is concerning. In all cases, including the McAfee incident, seem to indicate that little to no quality assurance testing was done before the updates were pushed. This trend highlights the importance of companies following QA best practices and testing updates on a predefined set of systems before rolling them out across all production systems to ensure system stability and uptime.

The National Security Risk, Essential Technology Companies, and the Threat Actor’s Perspective

When looking at this from the perspective of a threat actor, there is always an opportunity that might be leveraged to embed and distribute malicious code. As IT teams rushed to fix the corporate computers affected by the CrowdStrike incident, various threat actors created phishing sites containing malware to entice IT teams to download “official fixes” or to login into fake support portals. John Hammond, a well-known cybersecurity research, posted a thread on X of over 10 domains created to take advantage of the CrowdStrike catastrophe.

While CrowdStrike is currently in the spotlight, it’s important to recognize that many companies, like CrowdStrike, are considered essential technology providers with widespread deployments around the world. These deployments are on computer systems and infrastructures that form the backbone of our connected world in private companies, public companies, and even government agencies. A trivial mistake or failure on their part can have a catastrophic global impact. When incidents are caused by deliberate actions, the consequences can be even more significant.

This might sound too far-fetched for some, but the SolarWinds data breach was the direct result of exactly this type of activity.

The SolarWinds attack, attributed to Russian Foreign Intelligence, involved threat actors injecting malware in the Orion software updates. Those updates impacted roughly 18,000 customers when they were deployed. These include sensitive government agencies like the Department of Homeland Security, Department of Energy, and the Department of Commerce. The attackers were not only able to capture an abundance of data but also covertly monitored their victim’s communication and operations, possibly compromising national security.

Other vulnerable Essential Companies create similar risks without being used as malware distribution hubs. The Progress Software (MoveIT) data breach from 2023, attributed to the c10p ransomware gang, is a prime example. This breach affected 2,600 organizations and roughly 77 million individuals globally. It impacted government agencies, financial institutions, healthcare providers, educational institutions, airlines and travel, media, retail, and professional service companies. Like SolarWinds, the nature and scope of the data compromised also posed a significant national security risk.

These incidents highlight the critical importance of robust security practices for essential technology companies. They must not only secure their own infrastructures but also ensure their products do not compromise the safety and security of their customers. As demonstrated by the SolarWinds and Progress Software breaches, the consequences of neglecting these responsibilities can be severe, impacting national security and affecting millions globally. While the recent CrowdStrike incident was not caused by a threat actor, it serves as a reminder that vigilance and proactive measures are essential to protect the backbone of our connected world. Ensuring availability, alongside confidentiality and integrity, is crucial in maintaining a secure and resilient digital environment.