add new post

This commit is contained in:
Bruno Carlin 2025-06-12 01:45:15 +02:00
parent 4aab9f1242
commit ae6163c100
WARNING! Although there is a key with this ID in the database it does not verify this commit! This commit is SUSPICIOUS.
GPG key ID: 8E254EA0FFEB9B6D
3 changed files with 144 additions and 2 deletions

1
.gitignore vendored
View file

@ -1,2 +1,3 @@
/public /public
/resources /resources
/.compress_state

View file

@ -1,4 +1,27 @@
#!/usr/bin/env bash #!/usr/bin/env bash
rm -rf public set -euo pipefail
hugo build --minify hugo build --minify
incremental-compress \
-dir public \
-statedir .compress_state \
-types html,css,js,json,xml,ico,svg,md,otf,woff,ttf,woff2,webmanifest \
-zstd=false \
-verbose
rsync \
--perms \
--times \
--update \
--partial \
--progress \
--recursive \
--checksum \
--compress \
--links \
--delete-after \
--owner \
--usermap "*:bcarlin_net" \
--group \
--groupmap "*:bcarlin_net" \
public/ root@192.168.1.25:/home/bcarlin_net/www

View file

@ -0,0 +1,118 @@
---
title: 'Prepare for the Next Internet Outage'
slug: 'prepare-for-the-next-internet-outage'
date: '2025-06-14T04:05:48+02:00'
tags: [architecture, cloud]
summary: >
A reflection on recent internet outages and my takeways to build more
resilient web services.
---
Last Thursday, [the Internet broke](https://mashable.com/article/google-down-cloudflare-twitch-character-ai-internet-outage).
Again. Yes, the media turned a two-hour outage into a baitclick-friendly global
crisis.
What made this incident significant was not just the disruption of Google Cloud
but the hundreds of websites and applications that went down at the same time.
This included including some major ones like Cloudflare, who uses GCP for some
of its services. Cloudflare being a widespread CDN, cache and proxy, it created
a domino effect and broke, in turn, countless websites.
It reminds us of the fragile interconnectedness of our digital world. I dont
want to point fingers, but rather learn lessons from this incident. This wasn't
just a random hiccup; it highlighted fundamental principles that, in the age of
"everything as a service," we might have inadvertently overlooked.
Here are my key takeaways.
## Do Not Put All Your Eggs in the Same Vendor Basket
The cloud means infinite scaling, infinite storage, infinite compute power,
infinite flexibility. It is built on the promise of reducing costs (which can be
true when used correctly). However, this hides an overlooked truth and its
biggest risk: single vendor dependency. The recent outage showed how a single
vendor outage, or even a component within their infrastructure, can have a
cascading effect on most services.
Now, lets add to the mix that
[AWS, Azure and Google Cloud Platform have a combined market share of 63% in value](https://www.crn.com/news/cloud/2025/cloud-market-share-q1-2025-aws-dips-microsoft-and-google-show-growth?page=1&itc=refresh).
Even if your business do not use these infrastructure providers directly,
chances are that you use vendors who relies on them, or on vendors who might
rely on them. Yes, chances are that your SaaS application is dependent on at
least one of these vendors.
**What you can do**:
* *Map Your Dependencies*: Do you truly know all the services your core product
relies on, directly and indirectly? Which IaaS, PaaS, APIs, CDN, and so on are
you using? What are they, in turn, using? Do you rely on NpmJS to build your
product? Is your app deployed with a Github Action ? The more you know, the
more youre prepared.
* *Vendor Due Diligence*: Uptime guarantees (3? 4? 5 nines?) are just marketing.
Take it as such. What is your vendors architecture? its Continuity plan? Its
transparency on incidents? those are far more important criteria.
* *Consider Multi-Cloud Strategies*: You would not put all your servers in the
same datacenter? Then do not put all your infrastructure in the same IaaS
provider! (If you would, you should do something about it!)
## Own Your Data, Own Your Business
The cloud and API world we live in is great. It allows us to build fast, iterate
quickly, test things and improve our solutions. You need authentication, use
Subabase or Auth0. Online payment? There is Stripe or Paypal. Transactional
emails? Sendgrid and MailChimp. Search? Algolia. The list can be long, but now,
you can work on creating value.
Yet, as the outage showed, if these services become unavailable, your users
might be locked out, or your application might cease to function, regardless of
your own infrastructure's health. This can lead to a significant loss of control
over core business operations and data access. Third-Party services ARE single
point of failures!
**What you can do**:
* *Fallback Mechanisms for Core Services*: If a service becomes unavailable, how
do you replace it? Can you develop an alternative to fall back on?
* *Robust Data Mirroring*: Ensure you have regular, accessible backups of your
critical data, even if it primarily resides with a third-party. Can you
restore it quickly to a different environment if needed?
## Build for Resilience
Resilience has always been a consequence of redundancy. You should always have a
backup system that can assure the service while your main system is down.
But this is not enough to just have redundancy. Your application should also be
designed to be fault tolerant and use whole or parts of the backup system when
needed. At least, it should ensure that the impact for your users is the least
possible: the impossibility to send an email should never block your whole
application.
**What you can do**:
* *Distributed Architectures*: Design your systems with principles like
microservices. Deploy your services on several IaaS providers. Replicate
critical data across several providers. The goal is to limit the impact of any
single component failure.
* *Self-Healing Systems*: Implement mechanisms that can automatically detect
failures, reroute traffic, or restart services without human intervention. The
quicker your system can react, the less impact an outage will have.
* *Design for failure*: Don't wait for an external event to expose your
weaknesses. It is too late. Add some automated failure tests to your CI
pipeline: what if the client has a 5 second latency with your server? What if
the database is unavailable? What if a payment cannot be processed right away?
What is the user *experience* like when something goes wrong? Those issues
WILL happen.
## Conclusion
The next outage will come. Thats for sure. Maybe not as big, but there will be
some that will affect your business.
Be prepared:
* Know your infrastructure, your vendors, their vendors, etc.
* Asses risks on a regular basis. Your app evolves, your vendors too. What is
true at one moment is not at the next.
* Plan for the worst case. Incidents will happen. Your job is to make it so that
the user experience is not impacted.