Windows users worldwide complained about their PCs crashing and showing a Blue Screen of Death. (AI Generated)
On July 18 and 19, massive outages were observed in Microsoft’s Windows operating system (OS) across the world that caused severe disruptions across critical sectors, including aviation, railways, media broadcasts, banking, stock exchanges and hospitals, as indicated by the Blue Screen of Death (BSOD) appearing in those affected systems. The reason being ascribed so far is the malfunction of a software update for the Falcon sensor from the cyber security firm Crowdstrike, deployed at endpoints with Windows OS, which brought the systems down.
Microsoft and CrowdStrike have a strategic partnership on cyber security and as part of that, Falcon integrates with various Microsoft products, including Azure, Microsoft 365, and Windows OS. Linux and Apple OS were not impacted by the outage as confirmed by Crowdstrike.
These outages are the biggest that the world has seen so far in an increasingly digitally interconnected world that has set alarm bells ringing on multiple fronts. Most large-scale outages so far have been confined to sectoral areas and attributed to external agents-based cyber attacks, with hacking and distributed denial of service tools. This episode — by far the largest outage in history — was not due to hacking, but an internal system failure where two tech entities, giants in their areas, failed to prevent the disaster and resulting disruption.
While restoration of services has begun, clearly this set of outages has flagged multiple issues. Firstly, what were the protocols in place as per the service level agreements (SLAs) that Microsoft and Crowdstrike signed for smooth patch management and upgrades? Can such functions be fully left to self-regulation by tech entities, with no oversight when critical networks are involved? Also, can nations still stay away from cooperating on global cyber security and stability issues in the guise of national priorities or technical incompatibility? The answer is a clear “no”. A comprehensive effort has to be made to examine the current protocols, as risks and vulnerabilities increase.
Software companies have a fundamental responsibility to ensure the reliability and stability of their systems. They should be held accountable for outages for many reasons. Outages erode customer trust and damage reputation. Taking responsibility demonstrates a commitment to quality and reliability. System failures can result in significant financial losses for clients and end-users. Companies should bear some of this cost to incentivise better practices. Holding companies accountable encourages investment in robust architecture, testing, and disaster recovery plans. Companies should also maintain appropriate insurance coverage and establish compensation policies for affected customers.
While SLAs between companies and their clients or partners play a role in maintaining security standards, they are insufficient for several reasons. Companies prioritise cost-cutting or rapid development over comprehensive security measures. Many of these shortcomings go unnoticed till a breach happens and there is an investigation. Further, without external oversight, vulnerabilities go unreported or unaddressed. SLAs can vary widely between companies, leading to gaps in overall network security. Microsoft and Crowdstrike might have a good model of partnership to address cyber security but it might still fall short of the prudent requirements that oversight bodies will desire. Further, SLAs typically focus on performance metrics rather than comprehensive security practices.
Thus, regulatory oversight is crucial to approach risk management in a more comprehensive manner. Regulators can establish and enforce minimum security requirements across critical sectors and have audit protocols in place. Mandatory disclosure of breaches and vulnerabilities will improve overall industry security. In many cases, companies shy away from reporting breaches and vulnerabilities to national-level CERTs — in such cases, strict penalties should be imposed. Regulatory penalties for non-compliance can incentivise companies to prioritise security. Government oversight can drive investment in cutting-edge security technologies and practices.
Besides the regulatory approach, which would mostly be a wide national function, global cooperation also has to be fostered. So far, engagements have failed to create a global consensus on the matter — geopolitics around the larger issues of cyberspace remain predominant, not the need for secure and stable networks. Securing critical infrastructures and ensuring their stability and availability is well-defined in the 11 norms of responsible state behaviour in cyberspace that have been agreed under a UN expert body and the ongoing dialogues at multiple levels also look at this as a priority area. But actions on the ground belie seriousness and the tech entities also shy away from providing more support for better software products and patch management, besides cyber security measures and best practices.
Hopefully, these outages spark more tech community engagement on how to secure critical networks, right from writing codes to vulnerability management and disaster mitigation strategies. However, the need for regulations and oversight cannot be overlooked in the name of SLAs.
The writer, a defence and cyber security analyst, is former country head of General Dynamics