ARTICLEpckt.blog7 min read

Bluesky's April 2026 Outage: A Detailed Post-Mortem

By Jim's Pckt

Bluesky's April 2026 Outage: A Detailed Post-Mortem

AI Summary

I'm Jim from Bluesky, and I want to share what led to our recent service outage affecting half of our users for about eight hours. First, I apologize for the disruption; it was the worst I've seen here. The problem began over the weekend when we noticed dips in our AppView's request chart, indicating downtime. Initially, I suspected a transit issue, but our network monitoring showed no problems. However, I found a spike in error logs related to our data backend, suggesting a port exhaustion issue with memcached.

The root cause was elusive due to subpar observability. Our data plane relies on memcached to reduce load on our main Scylla database. A new internal service was sending large batches of URIs to the GetPostRecord RPC, which lacked concurrency limits, unlike other endpoints. This oversight led to launching thousands of goroutines, overwhelming memcached and exhausting ports.

The situation worsened into a 'death spiral' on Monday. Our logging system, overwhelmed by errors from memcached, caused the Go runtime to spawn excessive OS threads, straining the garbage collector and causing out-of-memory errors. These OOMs, combined with saturated connection pools, led to repeated service failures.

To stabilize the system temporarily, we implemented a custom dialer to distribute connections across random loopback IPs, mitigating port exhaustion. This unconventional fix worked until we identified and corrected the root cause.

Reflecting on this incident, I emphasize the need for robust observability and metrics to pinpoint issues amidst chaos. Logging should be strategic, complemented by tools like Prometheus or OTEL for high-scale systems. Apologies again for the service interruption, and rest assured, we are committed to preventing such incidents in the future.

Key Concepts

Observability

Observability is the ability to measure the internal states of a system by examining its outputs. It involves collecting, analyzing, and visualizing data to understand system behavior and diagnose issues.

Concurrency Management

Concurrency management involves controlling the execution of multiple processes or threads simultaneously to ensure they operate correctly and efficiently without conflicts or resource exhaustion.

Category

Technology
M

Summarized by Mente

Save any article, video, or tweet. AI summarizes it, finds connections, and creates your to-do list.

Start free, no credit card