diff --git a/.gitignore b/.gitignore index 7ba0a09..ce63271 100644 --- a/.gitignore +++ b/.gitignore @@ -1,2 +1,3 @@ /public /resources +/.compress_state diff --git a/build.sh b/build.sh index 343c0b6..27ee107 100755 --- a/build.sh +++ b/build.sh @@ -1,4 +1,27 @@ #!/usr/bin/env bash -rm -rf public -hugo build --minify +set -euo pipefail + +hugo build --minify +incremental-compress \ + -dir public \ + -statedir .compress_state \ + -types html,css,js,json,xml,ico,svg,md,otf,woff,ttf,woff2,webmanifest \ + -zstd=false \ + -verbose +rsync \ + --perms \ + --times \ + --update \ + --partial \ + --progress \ + --recursive \ + --checksum \ + --compress \ + --links \ + --delete-after \ + --owner \ + --usermap "*:bcarlin_net" \ + --group \ + --groupmap "*:bcarlin_net" \ + public/ root@192.168.1.25:/home/bcarlin_net/www diff --git a/content/blog/007-prepare-for-the-next-internet-outage.md b/content/blog/007-prepare-for-the-next-internet-outage.md new file mode 100644 index 0000000..3532e6b --- /dev/null +++ b/content/blog/007-prepare-for-the-next-internet-outage.md @@ -0,0 +1,118 @@ +--- +title: 'Prepare for the Next Internet Outage' +slug: 'prepare-for-the-next-internet-outage' +date: '2025-06-14T04:05:48+02:00' +tags: [architecture, cloud] +summary: > + A reflection on recent internet outages and my takeways to build more + resilient web services. +--- + +Last Thursday, [the Internet broke](https://mashable.com/article/google-down-cloudflare-twitch-character-ai-internet-outage). +Again. Yes, the media turned a two-hour outage into a baitclick-friendly global +crisis. + +What made this incident significant was not just the disruption of Google Cloud +but the hundreds of websites and applications that went down at the same time. +This included including some major ones like Cloudflare, who uses GCP for some +of its services. Cloudflare being a widespread CDN, cache and proxy, it created +a domino effect and broke, in turn, countless websites. + +It reminds us of the fragile interconnectedness of our digital world. I don’t +want to point fingers, but rather learn lessons from this incident. This wasn't +just a random hiccup; it highlighted fundamental principles that, in the age of +"everything as a service," we might have inadvertently overlooked. + +Here are my key takeaways. + +## Do Not Put All Your Eggs in the Same Vendor Basket + +The cloud means infinite scaling, infinite storage, infinite compute power, +infinite flexibility. It is built on the promise of reducing costs (which can be +true when used correctly). However, this hides an overlooked truth and its +biggest risk: single vendor dependency. The recent outage showed how a single +vendor outage, or even a component within their infrastructure, can have a +cascading effect on most services. + +Now, let’s add to the mix that +[AWS, Azure and Google Cloud Platform have a combined market share of 63% in value](https://www.crn.com/news/cloud/2025/cloud-market-share-q1-2025-aws-dips-microsoft-and-google-show-growth?page=1&itc=refresh). +Even if your business do not use these infrastructure providers directly, +chances are that you use vendors who relies on them, or on vendors who might +rely on them. Yes, chances are that your SaaS application is dependent on at +least one of these vendors. + +**What you can do**: + +* *Map Your Dependencies*: Do you truly know all the services your core product + relies on, directly and indirectly? Which IaaS, PaaS, APIs, CDN, and so on are + you using? What are they, in turn, using? Do you rely on NpmJS to build your + product? Is your app deployed with a Github Action ? The more you know, the + more you’re prepared. +* *Vendor Due Diligence*: Uptime guarantees (3? 4? 5 nines?) are just marketing. + Take it as such. What is your vendor’s architecture? its Continuity plan? Its + transparency on incidents? those are far more important criteria. +* *Consider Multi-Cloud Strategies*: You would not put all your servers in the + same datacenter? Then do not put all your infrastructure in the same IaaS + provider! (If you would, you should do something about it!) + +## Own Your Data, Own Your Business + +The cloud and API world we live in is great. It allows us to build fast, iterate +quickly, test things and improve our solutions. You need authentication, use +Subabase or Auth0. Online payment? There is Stripe or Paypal. Transactional +emails? Sendgrid and MailChimp. Search? Algolia. The list can be long, but now, +you can work on creating value. + +Yet, as the outage showed, if these services become unavailable, your users +might be locked out, or your application might cease to function, regardless of +your own infrastructure's health. This can lead to a significant loss of control +over core business operations and data access. Third-Party services ARE single +point of failures! + +**What you can do**: + +* *Fallback Mechanisms for Core Services*: If a service becomes unavailable, how + do you replace it? Can you develop an alternative to fall back on? +* *Robust Data Mirroring*: Ensure you have regular, accessible backups of your + critical data, even if it primarily resides with a third-party. Can you + restore it quickly to a different environment if needed? + +## Build for Resilience + +Resilience has always been a consequence of redundancy. You should always have a +backup system that can assure the service while your main system is down. + +But this is not enough to just have redundancy. Your application should also be +designed to be fault tolerant and use whole or parts of the backup system when +needed. At least, it should ensure that the impact for your users is the least +possible: the impossibility to send an email should never block your whole +application. + +**What you can do**: + +* *Distributed Architectures*: Design your systems with principles like + microservices. Deploy your services on several IaaS providers. Replicate + critical data across several providers. The goal is to limit the impact of any + single component failure. +* *Self-Healing Systems*: Implement mechanisms that can automatically detect + failures, reroute traffic, or restart services without human intervention. The + quicker your system can react, the less impact an outage will have. +* *Design for failure*: Don't wait for an external event to expose your + weaknesses. It is too late. Add some automated failure tests to your CI + pipeline: what if the client has a 5 second latency with your server? What if + the database is unavailable? What if a payment cannot be processed right away? + What is the user *experience* like when something goes wrong? Those issues + WILL happen. + +## Conclusion + +The next outage will come. That’s for sure. Maybe not as big, but there will be +some that will affect your business. + +Be prepared: + +* Know your infrastructure, your vendors, their vendors, etc. +* Asses risks on a regular basis. Your app evolves, your vendors too. What is + true at one moment is not at the next. +* Plan for the worst case. Incidents will happen. Your job is to make it so that + the user experience is not impacted.