Cloudflare outage: Why today’s internet is so fragile
· The Straits TimesSAN FRANCISCO – For much of the world, there is no longer any such thing as being offline. The internet underpins the global financial and consumer ecosystem, enabling instant communication and transactions.
While the system is integral to so much human activity, it is also fragile, costing billions of dollars and creating huge inconvenience when part of it stops working.
Widespread blackouts have been happening on a regular basis in 2025; technical glitches at major web infrastructure providers have brought down services for millions of users.
One 15-hour outage at Amazon.com data centres
in October locked British kids out of gaming platform Roblox, stopped workers from making Zoom calls and forced on-call engineers in India to cancel plans for the Diwali religious holiday.
In mid-November, a malfunction at web security firm Cloudflare took down a swathe of sites, including ChatGPT, the New Jersey transit authority and social-media platform X.
It may seem strange that a problem at one provider can trigger such a cascade of damage. The reason lies in how the internet has evolved since its inception, and in the cost and efficiency shortcuts made by companies whose services are relied upon by millions of consumers.
Since the internet is not just packets of data but also a lot of physical infrastructure, these incidents can stem from a range of causes such as a software bug, an overheating data centre or a frayed cable. It is surprising it does not crash more often.
How do users access the internet?
When a user in Britain types google.com into their phone or computer, this kicks off a complicated but lightning-fast set of processes.
All devices – phones, PCs, servers – connected to the internet are assigned identifiers called IP addresses and use the Domain Name System, or DNS, to locate and talk to each other.
Meanwhile, sites and apps such as Google are made up of packets of data, comprising text, images and functionality.
To load up Google, a user’s device sends a request for those packets of data through WiFi, mobile data or a wired connection. The request travels along physical infrastructure such as routers, cables, switches, regional data centres and perhaps via undersea cables until it reaches the correct Google server.
That server, which sits inside a data centre alongside hundreds of thousands of other Google servers, examines the request and funnels the relevant packets of data back to the user through the same global infrastructure networks.
Outages can happen when anything along that interconnected chain goes wrong. And a large part of why outages happen at scale now is due to the rise of cloud computing.
Why do outages feel so disruptive now?
A major change in the way the internet works lies in where data and infrastructure are kept.
During the 1990s and early 2000s, any company that had its own website probably had its own servers located at its offices or headquarters – called on-premises.
Alternatively, it rented servers from another company but still managed the hardware and software.
At the level of the individual user, anything involving a computer also involved storing information locally: Music, photos, files would all be kept on hard drives.
Any outage might be down to a single corrupted file and, while losing thousands of digital photos would be irritating, it would not impact other users.
Cloud computing went mainstream after Amazon, mainly known for being an online retail giant, realised its engineers spent an inordinate amount of time solving the same problems around computing infrastructure and data storage. It built shared infrastructure to alleviate that burden, then realised the concept could be applied to much of the internet.
The idea took off, and now most internet users and businesses rely on cloud computing in some way.
After Amazon Web Services, Microsoft and Alphabet’s Google launched their own services, and the three tech companies came to dominate cloud computing globally.
In practice, that means operating millions of servers in data centres. These are generally organised into “regions” – separate clusters of server farms that serve a particular country or area.
Certain regions might handle more traffic, meaning there is a disproportionate impact if one goes down. Some companies might have regional dependencies that they are not aware of, leading to services failing because of an outage outside of their region.
The AWS outage in October came down to a bug affecting one of its key services, causing cascading failures and taking out several major sites and services.
Since the internet is not just packets of data but also a lot of physical infrastructure, these incidents can stem from a range of causes such as a software bug, an overheating data centre or a frayed cable. It is surprising it does not crash more often.
Why are cloud services such as AWS, Microsoft Azure and Google Cloud so dominant?
In Britain, for example, AWS and Microsoft’s Azure cloud service make up more than 70 per cent of the cloud computing market. This is the result of early-mover advantage, existing incumbency in enterprise tech for Microsoft, and sheer financial firepower.
But reliance on a handful of dominant cloud providers has some knock-on effects. An outage can now take out swathes of the internet.
The hyperscalers, as they came to be known, have also been criticised for business practices that make it harder for new entrants in the market, and for businesses to transfer their cloud computing contracts to other providers.
Because the technology infrastructure needed for each cloud service is different, it can be very costly for businesses to change providers. Cloud engineers will also often follow certification processes for one specific provider, adding to the cost and difficulty of diversifying.
What can go wrong with big cloud services?
One reason the major cloud providers are so popular is that they are, in general, reliable. Cloud computing is a fast-growing aspect of their businesses, and it is in their interests to keep services running smoothly.
Still, that cannot account for the increasingly interconnected nature of technology services, especially when a handful of companies dominate different layers of the infrastructure.
CrowdStrike, for example, is not a major technology company, but it is dominant in cybersecurity. Its popularity means it runs on millions of Microsoft Windows systems, often those running critical operations in big organisations.
When it issued a faulty software
update through a cloud update in July 2024, it instantly caused a “Blue Screen of Death” for millions of computers. While that was not a problem with the cloud, the simultaneous nature of cloud-based updates took out millions of machines at once.
What can companies do to protect themselves against the risk of outages?
The most important thing companies can do is make sure they think about what they will do in case of an outage before it actually happens. That might mean spending more money to make sure they are getting a back-up service if their primary region goes down, or adapting their infrastructure so there are “in-house” back-up servers for the most critical services.
As for the rest of us, there is not a huge amount you can do during a big system failure apart from wait for the cloud provider to fix it.
Take a step back from the screen, touch some grass, and marvel at the complex infrastructure behind the computer that (most of the time) keeps it all running smoothly. BLOOMBERG