this post was submitted on 07 Nov 2023
210 points (98.6% liked)

Technology

59390 readers
2617 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 14 points 1 year ago (1 children)

Also sounded like they had apps centered/only there that had to be online for everything else to work.

But the 4-minutes-instead-of-10 batteries certainly didn’t help.

[–] [email protected] 3 points 1 year ago

That's exactly it.
https://blog.cloudflare.com/post-mortem-on-cloudflare-control-plane-and-analytics-outage/

Here is a quick summary, but the actual postmortem is worth reading.
Classic example of cascade failure or domino effect. Luckily their resilience wasn't a full outage

Basically, new features get developed fast and are iterated quickly. When they mature, they get integrated into the high availability cluster.
There are also some services that are deliberately not clustered. One of which is logging, which should cause logs to pile up "at the edge" when the logging core service is down.
Unfortunately, some services were too tightly coupled to the logging core. So they should've been HA clustered, but were unable to cope with the core logging service being down.
Whilst HA failover had been tested, the core services has never been taken offline, so all this was missed.

Which all ended up with inconsistent high-availability amongst different services and products. A lot of new features would have failed as expected, and some mature features that shouldn't have failed did.

When they brought their disaster recovery site up, there were some things that needed manual configuration, and some newer features that hadn't been tested in a disaster recovery scenario.

They are now focusing significant resources on:

  • Remove dependencies on our core data centers for control plane configuration of all services and move them wherever possible to be powered first by our distributed network.
  • Ensure that the control plane running on the network continues to function even if all our core data centers are offline.
  • Require that all products and features that are designated Generally Available must rely on the high availability cluster (if they rely on any of our core data centers), without having any software dependencies on specific facilities.
  • Require all products and features that are designated Generally Available have a reliable disaster recovery plan that is tested.
  • Test the blast radius of system failures and minimize the number of services that are impacted by a failure.
  • Implement more rigorous chaos testing of all data center functions including the full removal of each of our core data center facilities.
  • Thorough auditing of all core data centers and a plan to reaudit to ensure they comply with our standards.
  • Logging and analytics disaster recovery plan that ensures no logs are dropped even in the case of a failure of all our core facilities.