add new post
This commit is contained in:
parent
4aab9f1242
commit
ae6163c100
3 changed files with 144 additions and 2 deletions
1
.gitignore
vendored
1
.gitignore
vendored
|
@ -1,2 +1,3 @@
|
||||||
/public
|
/public
|
||||||
/resources
|
/resources
|
||||||
|
/.compress_state
|
||||||
|
|
27
build.sh
27
build.sh
|
@ -1,4 +1,27 @@
|
||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
|
|
||||||
rm -rf public
|
set -euo pipefail
|
||||||
hugo build --minify
|
|
||||||
|
hugo build --minify
|
||||||
|
incremental-compress \
|
||||||
|
-dir public \
|
||||||
|
-statedir .compress_state \
|
||||||
|
-types html,css,js,json,xml,ico,svg,md,otf,woff,ttf,woff2,webmanifest \
|
||||||
|
-zstd=false \
|
||||||
|
-verbose
|
||||||
|
rsync \
|
||||||
|
--perms \
|
||||||
|
--times \
|
||||||
|
--update \
|
||||||
|
--partial \
|
||||||
|
--progress \
|
||||||
|
--recursive \
|
||||||
|
--checksum \
|
||||||
|
--compress \
|
||||||
|
--links \
|
||||||
|
--delete-after \
|
||||||
|
--owner \
|
||||||
|
--usermap "*:bcarlin_net" \
|
||||||
|
--group \
|
||||||
|
--groupmap "*:bcarlin_net" \
|
||||||
|
public/ root@192.168.1.25:/home/bcarlin_net/www
|
||||||
|
|
118
content/blog/007-prepare-for-the-next-internet-outage.md
Normal file
118
content/blog/007-prepare-for-the-next-internet-outage.md
Normal file
|
@ -0,0 +1,118 @@
|
||||||
|
---
|
||||||
|
title: 'Prepare for the Next Internet Outage'
|
||||||
|
slug: 'prepare-for-the-next-internet-outage'
|
||||||
|
date: '2025-06-14T04:05:48+02:00'
|
||||||
|
tags: [architecture, cloud]
|
||||||
|
summary: >
|
||||||
|
A reflection on recent internet outages and my takeways to build more
|
||||||
|
resilient web services.
|
||||||
|
---
|
||||||
|
|
||||||
|
Last Thursday, [the Internet broke](https://mashable.com/article/google-down-cloudflare-twitch-character-ai-internet-outage).
|
||||||
|
Again. Yes, the media turned a two-hour outage into a baitclick-friendly global
|
||||||
|
crisis.
|
||||||
|
|
||||||
|
What made this incident significant was not just the disruption of Google Cloud
|
||||||
|
but the hundreds of websites and applications that went down at the same time.
|
||||||
|
This included including some major ones like Cloudflare, who uses GCP for some
|
||||||
|
of its services. Cloudflare being a widespread CDN, cache and proxy, it created
|
||||||
|
a domino effect and broke, in turn, countless websites.
|
||||||
|
|
||||||
|
It reminds us of the fragile interconnectedness of our digital world. I don’t
|
||||||
|
want to point fingers, but rather learn lessons from this incident. This wasn't
|
||||||
|
just a random hiccup; it highlighted fundamental principles that, in the age of
|
||||||
|
"everything as a service," we might have inadvertently overlooked.
|
||||||
|
|
||||||
|
Here are my key takeaways.
|
||||||
|
|
||||||
|
## Do Not Put All Your Eggs in the Same Vendor Basket
|
||||||
|
|
||||||
|
The cloud means infinite scaling, infinite storage, infinite compute power,
|
||||||
|
infinite flexibility. It is built on the promise of reducing costs (which can be
|
||||||
|
true when used correctly). However, this hides an overlooked truth and its
|
||||||
|
biggest risk: single vendor dependency. The recent outage showed how a single
|
||||||
|
vendor outage, or even a component within their infrastructure, can have a
|
||||||
|
cascading effect on most services.
|
||||||
|
|
||||||
|
Now, let’s add to the mix that
|
||||||
|
[AWS, Azure and Google Cloud Platform have a combined market share of 63% in value](https://www.crn.com/news/cloud/2025/cloud-market-share-q1-2025-aws-dips-microsoft-and-google-show-growth?page=1&itc=refresh).
|
||||||
|
Even if your business do not use these infrastructure providers directly,
|
||||||
|
chances are that you use vendors who relies on them, or on vendors who might
|
||||||
|
rely on them. Yes, chances are that your SaaS application is dependent on at
|
||||||
|
least one of these vendors.
|
||||||
|
|
||||||
|
**What you can do**:
|
||||||
|
|
||||||
|
* *Map Your Dependencies*: Do you truly know all the services your core product
|
||||||
|
relies on, directly and indirectly? Which IaaS, PaaS, APIs, CDN, and so on are
|
||||||
|
you using? What are they, in turn, using? Do you rely on NpmJS to build your
|
||||||
|
product? Is your app deployed with a Github Action ? The more you know, the
|
||||||
|
more you’re prepared.
|
||||||
|
* *Vendor Due Diligence*: Uptime guarantees (3? 4? 5 nines?) are just marketing.
|
||||||
|
Take it as such. What is your vendor’s architecture? its Continuity plan? Its
|
||||||
|
transparency on incidents? those are far more important criteria.
|
||||||
|
* *Consider Multi-Cloud Strategies*: You would not put all your servers in the
|
||||||
|
same datacenter? Then do not put all your infrastructure in the same IaaS
|
||||||
|
provider! (If you would, you should do something about it!)
|
||||||
|
|
||||||
|
## Own Your Data, Own Your Business
|
||||||
|
|
||||||
|
The cloud and API world we live in is great. It allows us to build fast, iterate
|
||||||
|
quickly, test things and improve our solutions. You need authentication, use
|
||||||
|
Subabase or Auth0. Online payment? There is Stripe or Paypal. Transactional
|
||||||
|
emails? Sendgrid and MailChimp. Search? Algolia. The list can be long, but now,
|
||||||
|
you can work on creating value.
|
||||||
|
|
||||||
|
Yet, as the outage showed, if these services become unavailable, your users
|
||||||
|
might be locked out, or your application might cease to function, regardless of
|
||||||
|
your own infrastructure's health. This can lead to a significant loss of control
|
||||||
|
over core business operations and data access. Third-Party services ARE single
|
||||||
|
point of failures!
|
||||||
|
|
||||||
|
**What you can do**:
|
||||||
|
|
||||||
|
* *Fallback Mechanisms for Core Services*: If a service becomes unavailable, how
|
||||||
|
do you replace it? Can you develop an alternative to fall back on?
|
||||||
|
* *Robust Data Mirroring*: Ensure you have regular, accessible backups of your
|
||||||
|
critical data, even if it primarily resides with a third-party. Can you
|
||||||
|
restore it quickly to a different environment if needed?
|
||||||
|
|
||||||
|
## Build for Resilience
|
||||||
|
|
||||||
|
Resilience has always been a consequence of redundancy. You should always have a
|
||||||
|
backup system that can assure the service while your main system is down.
|
||||||
|
|
||||||
|
But this is not enough to just have redundancy. Your application should also be
|
||||||
|
designed to be fault tolerant and use whole or parts of the backup system when
|
||||||
|
needed. At least, it should ensure that the impact for your users is the least
|
||||||
|
possible: the impossibility to send an email should never block your whole
|
||||||
|
application.
|
||||||
|
|
||||||
|
**What you can do**:
|
||||||
|
|
||||||
|
* *Distributed Architectures*: Design your systems with principles like
|
||||||
|
microservices. Deploy your services on several IaaS providers. Replicate
|
||||||
|
critical data across several providers. The goal is to limit the impact of any
|
||||||
|
single component failure.
|
||||||
|
* *Self-Healing Systems*: Implement mechanisms that can automatically detect
|
||||||
|
failures, reroute traffic, or restart services without human intervention. The
|
||||||
|
quicker your system can react, the less impact an outage will have.
|
||||||
|
* *Design for failure*: Don't wait for an external event to expose your
|
||||||
|
weaknesses. It is too late. Add some automated failure tests to your CI
|
||||||
|
pipeline: what if the client has a 5 second latency with your server? What if
|
||||||
|
the database is unavailable? What if a payment cannot be processed right away?
|
||||||
|
What is the user *experience* like when something goes wrong? Those issues
|
||||||
|
WILL happen.
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
The next outage will come. That’s for sure. Maybe not as big, but there will be
|
||||||
|
some that will affect your business.
|
||||||
|
|
||||||
|
Be prepared:
|
||||||
|
|
||||||
|
* Know your infrastructure, your vendors, their vendors, etc.
|
||||||
|
* Asses risks on a regular basis. Your app evolves, your vendors too. What is
|
||||||
|
true at one moment is not at the next.
|
||||||
|
* Plan for the worst case. Incidents will happen. Your job is to make it so that
|
||||||
|
the user experience is not impacted.
|
Loading…
Add table
Reference in a new issue