Twitch is the world’s largest live streaming platform for individuals. There are many ways to view Twitch, including desktop browsers, mobile devices, game consoles, and TV apps. The Client Delivery Platform team owns the infrastructure that delivers Twitch clients to users. Last year, we designed next-generation availability defenses for one of our critical microservices that raised availability from 99.9% (3 9s) to 99.99% (4 9s). In this post I’ll share our design, guiding principles, and results.
Whenever you design high availability for a service, you should take into account both the common considerations (that apply to any cloud service) as well as specific opportunities for your service. We’ll be talking about what we’ve done beyond the basics.
We developed the following high availability principles for our service:
- Treat availability like security.
- Practice defense-in-depth.
- Use your Content Delivery Network (CDN) as a full partner to protect your service.
- Design failover mechanisms for service unavailability.
- Overprovision as a standing defense.
- Be ready to quickly add capacity when you need it.
- Use proportional provisioning to balance overprovisioning and cost.
- Regularly profile load and tune service allocation.
- Consider the impact of availability decisions on other areas such as latency.
- Invest in alerting, and avoid the extremes of false alarms and unawareness.
- Don’t rely on standing defenses: respond to attacks.
- Optimize your service for maximum efficiency.
- Commit to ongoing innovation and continuous improvement.
Delivering the Web Platform
30 million people visit www.twitch.tv on a typical day. Sage’s job is to serve up index.html. That sounds pretty simple, but Sage also supports canary releases, A/B testing, policy enforcement, and search engine metadata retrieval. If Sage fails to do its job, the user doesn’t reach Twitch and can’t enjoy live video, chat and everything else Twitch offers. Therefore, we care about high availability for Sage because you only get one chance to make a first impression.
Let’s define some metrics. We’re measuring availability as the error-free rate of the Sage service load balancers, reported weekly. A related metric is deliverability, the error-free rate from CDN to browser. Availability measures the service, while deliverability measures the actual customer experience.
Sage runs in over a dozen AWS data centers around the world, fronted by hundreds of CDN points-of-presence. Last year, it came to our attention that Sage’s availability had dropped slightly below 99.9% one day in March. 3 9s of availability is a minimum expectation of all critical Twitch services. When that happened a second time, we took the short-term action of tripling servers in the affected regions while we performed a deep analysis and decided what to do on a long-term basis.
Our analysis revealed several opportunities for improvement, and our engineers proposed a number of innovative ideas for increasing resilience they had either prototyped or were eager to try. We believed we could obtain an order of magnitude improvement, and proceeded to develop next-generation availability defenses for Sage.
Threat Modeling: Treating Availability like Security
The chart below shows what a denial of service attack on Sage looks like in one region: a burst of requests that are multiples of normal load. Distributed denial-of-service (DDOS) attacks hit multiple regions simultaneously. Surges due to a popular event can last hours or even days.
Sage Requests during a Denial of Service attack
Not all availability threats are hostile. Traffic can increase for perfectly innocent reasons, like growth, or surge during an unanticipated, wildly popular, event. Regardless of intent, a challenge to availability means war!
Medieval castles are a good analogy for visualizing availability defenses, so we’ll be relating our principles to castle defense. While that might seem more appropriate for a security discussion, you can’t separate availability from security, and many tools for security design work equally well for availability. Chief among them is threat modeling: We first work backwards from service disruptions, identifying what can cause or contribute to a failure. Then we design mitigations for those vulnerabilities.
Defense-in-Depth: Multiple Layers of Defense
Castles practiced defense-in-depth, the technique of layering multiple layers of defense to wear down and demoralize attackers. Some of the defenses were obvious, like the high, thick stone castle walls. As attackers made it past one formidable defense, they ran into another one. Penetrating a castle meant overcoming an overwhelming number of defenses that could include natural features, the surrounding town, moat, drawbridge, portcullis, archers, battlements, murder holes, outer walls, inner walls, and the keep tower. Defense-in-depth was effective in medieval times, and it remains a respected defense strategy today.
Defenses for a Medieval Castle
Defense-in-depth is essential for availability, since we can’t rely on any one mechanism to fully protect us. We leverage multiple, redundant layers of defense for superior protection. When an attacker penetrates one layer of defense, there’s another one behind it—wearing down the attacker’s resolve and resources to the point where further attack is not worth it. Each of our defense mechanisms are part of a defense-in-depth mosaic.
Thinking End-to-End: CDN Defenses
Before attackers reached a castle, they had to first get past the outer defenses. Castles took advantage of natural features of the surrounding area such as cliffs and rivers. Castles situated on hills or mountains could more easily detect invaders, and it was harder for the enemy to move soldiers and weapons into position. The surrounding town was an integrated part of the defenses. The water-filled moat surrounding the castle discouraged tunneling.
The CDN is our first opportunity to recognize hostile traffic and do something about it. For Sage, the CDN recognizes and filters known unwanted bots while passing on friendly bots such as search engine crawlers. During a prolonged attack, if we’re able to identify distinguishing characteristics of attacker traffic, we can configure the CDN to intercept it.
When a castle was under siege, the attackers isolated the castle, cutting off their food and water supply. Well-supplied castles had the reserves to wait out a lengthy siege. The 1226 siege of Kenilworth Castle lasted 6 months before the occupants ran out of food and surrendered. Cities could hold off even longer: the 1648-1669 Siege of Candia lasted 21 years!
A service owner must likewise consider what to do when starved of resources. Our senior systems engineer, James Hartshorn, devised a facility we call Sage Origin Backup. Sage and the CDN don’t normally cache index.html because new releases are frequent. However, we do cache the latest release of index.html for failover purposes. When Sage is unavailable and responds to a request with a 5xx error, the CDN steps in and serves a stale index.html in its place.
The Outer Walls: Overprovisioning Is Your Standing Defense
The most obvious layer of castle defense is its thick stone wall, also called a curtain wall. Everyone knows the importance of castle walls. It’s your standing defense, the foundation of what protects you. Visually intimidating, the wall was a constant reminder to would-be attackers of the castle’s integrity.
The service equivalent of a castle wall is our provisioning level, the amount of capacity in each region. When an attack comes, suddenly and without warning, our standing defense needs to be adequate. We allocate capacity not for standard load but for standard attack load.
Sage had traditionally been running on hundreds of servers in over a dozen AWS data centers around the world, all highly over-provisioned to guard against denial of service attacks. Each region had the same capacity. While that arrangement had been working well for years, some things had changed over time. How much did load vary around the world? Were we even in the right regions?
We sought to better understand our traffic and load. Analysis revealed a wide variance in regional load: our busiest region saw 24 times the traffic of our least-busy region! Moreover, some regions were more prone to denial of service attacks than others. That meant our “thick skin” of protection was not as thick as we thought it was everywhere. It was also uncomfortable to have 45% of all traffic routed through just 2 data centers: while there was no capacity concern, a full regional outage could impact a large portion of our users.
The resultant heap map of requests per hour by region allowed us to profile each region’s load. We also studied requests by minute to understand the intensity of surges.
More details From Twitch Blog – Defend Your Castle