Cloudflare Dashboard Outage: A Deep Dive into root Cause, Remediation, and Future Improvements
On November 21st, 2023, cloudflare customers experienced intermittent disruptions to the dashboard and API access. We understand the frustration this caused,and we want to provide a comprehensive post-mortem detailing the incident,our response,and the steps we’re taking to prevent recurrence. This isn’t just about fixing a problem; it’s about strengthening the reliability of the platform millions rely on. At Cloudflare, we prioritize clarity and continuous enhancement, and this incident is a critical learning prospect.
What Happened: A Cascade of Issues
The initial incident stemmed from an unexpectedly high load on our Tenant Service, a core component responsible for managing customer configurations. This service, running on Kubernetes across a subset of our datacenters, began exhibiting unusually high resource utilization. Our initial response focused on immediate mitigation: we rapidly scaled the number of pods available to handle the increased demand. While this temporarily improved availability, it didn’t fully resolve the underlying issue.this indicated a deeper problem than simply insufficient capacity.
Further investigation revealed a bug within the Tenant Service itself.A subsequent patch, deployed with the intention of improving API health and restoring dashboard functionality, sadly exacerbated the situation, leading to a second, more significant outage (as visualized in the graph below). This patch was swiftly reverted, highlighting the importance of robust testing and controlled rollouts.
[Image: https://cf-assets.www.cloudflare.com/zkvhlag99gkb/UOY1fEUaSzxRE6tNrsBPu/fd02638a5d2e37e47f5c9a9888b5eac3/BLOG-3011_3.png]
(API Error Rate during the outage. Note the spike coinciding with the second deployment.)
Crucially, this outage was contained within our control plane. As of the architectural separation between our control plane and data plane, Cloudflare’s core network services - the services that protect your websites and applications – remained fully operational. The vast majority of Cloudflare users were unaffected, experiencing no disruption to their internet traffic. Impact was primarily limited to those actively making configuration changes or utilizing the dashboard.
Why It Was Difficult to Diagnose: The Thundering Herd & Lack of Granular visibility
the situation was intricate by a classic distributed systems problem known as the “Thundering Herd.” When the Tenant Service was restarted as part of the remediation efforts, all active dashboard sessions together attempted to re-authenticate with the API.This sudden surge overwhelmed the service, contributing to the instability. This effect was amplified by a pre-existing bug in our dashboard logic.A hotfix addressing this dashboard bug was deployed promptly after the incident’s impact subsided. We are now implementing further changes to the dashboard, including introducing randomized retry delays to distribute load and prevent future contention.
A significant challenge during the incident was differentiating between legitimate new requests and retries. We observed a considerable increase in API usage, but lacked the granular visibility to determine the source of the traffic. Had we been able to quickly identify a sustained volume of new requests, it would have been a strong indicator of a looping issue within the dashboard – which ultimately proved to be the case.
Our response & Lessons Learned: Strengthening Reliability & Observability
We are committed to learning from this incident and implementing improvements across multiple areas. Here’s a detailed breakdown of the actions we’re taking:
* Automated Rollbacks with Argo Rollouts: We are accelerating the migration of our services to Argo Rollouts, a powerful deployment platform that automatically monitors deployments for errors and rolls back changes upon detection. Had Argo Rollouts been in place for the Tenant Service, the problematic second deployment would have been automatically reverted, significantly limiting the scope of the outage. This migration was already planned, and we’ve elevated its priority.
* Enhanced Capacity Planning & monitoring: We’ve significantly increased the resources allocated to the Tenant Service to handle peak loads and future growth. more importantly, we are refining our monitoring systems to proactively alert us before the service reaches capacity limits. This includes more sophisticated metrics and alerting thresholds.
* Improved API Request Visibility: We are modifying our dashboard’s API calls to include detailed data, specifically identifying whether a request is a retry or a new request. This will provide critical insights during future incidents, enabling faster and more accurate diagnosis.
* Dashboard resilience: Beyond the hotfix addressing the initial bug,we are implementing changes to the dashboard to introduce randomized retry delays,smoothing out load spikes and

![Starbucks Dress Code Lawsuit: Colorado Workers Sue | [Year] Update Starbucks Dress Code Lawsuit: Colorado Workers Sue | [Year] Update](https://i0.wp.com/www.denverpost.com/wp-content/uploads/2025/09/Starbucks_Dress_Code_99401.jpg?resize=150%2C150&ssl=1)





