Breaking
Production for Fun and Profit

Running Game Days at Stripe

Dan Frank (@danielhfrank)
Danielle Sucher (@DanielleSucher)
Franklin Hu (@thisisfranklin)

the best way to accept payments online

Game Days

injecting failures

into running applications

Who are we?

Happy Captain Picard Day!

Why do we run

game days?

Things might blow up!

explosions are better

when we're awake

If you die in prod...

Controlled experiments,

not chaos

An investment in

reliability

C'mon, let's break stuff!

There are so many

failure modes

we could try!

Single node failure

Process failure (`kill STOP`, `svc -d svcname`)
Machine failure
Machine reboot (`echo b > /proc/sysrq-trigger`)
Disk failure
Disk degradation
Disk full
Out of file descriptors
Network degradation
CPU degradation

Correlated node failures

Machine failures
Disk failures
Disk degradations
Disks full
Network degradations
CPU degradations

Network partition

Between availability zones
Between services
Within services

Misbehaving users

Denial of Service
Malicious input

Database failures

Autoincrement field exhausted
Indexes missing
Healthchecks fail?

Non-routine operations

Launching instances
Terminating instances

Okay, but which

failure modes

can we test

easily?

A few easily tested examples

Process failure
Machine failure
Network partition between services
Denial of Service
`sudo rm -rf /`

low-risk tests at first

Prepare hypotheses

in advance

Warn people!

Longbridge road works - sign - Danger! Open excavations (CC BY 2.0)

Planning experiments

instance failure

hypotheses

Latency should be fine
Alerting: no pages, only emails

Latency was fine, yay!

Do we really wanna get paged over this?

network partition

hypotheses

Charges should go through successfully, just without being scored
Latency should be fine
Alerting: We should get paged!

partition the network


            iptables -A ... DROP

Oh noes, latency spiked!

repair the network


            iptables -D ... DROP

Better timeouts would be nice...

let's kill

our redis cluster's

primary node


            kill -9 $REDIS_PID

A little

healthy fear

hypotheses

One of the secondary nodes should be promoted to primary
When the old primary node comes back up, it should come up as a secondary
Scoring should continue normally
Unless things don't go as above, no one should get paged

Where's our data?!

time to clean up

What happened?!

Well, we'd recently turned off snapshotting on our primary node...

failover didn't happen...

Franklin: With Game Days, you may think you're testing the failure properties of an isolated system, but you'd be wrong. Until you've exercised these paths, you have no idea what might happen or how failure in one system may spill over to another.

df: Because of this unpredictability, need to broadcast that exercise is happening, what to expect, where to report escalated problems - a dedicated slack channel is a good idea. Eng org and people on call, but also support, AMs, etc. Grab pager for relevant rotation of course..

d: To make this work, it's important to establish culture around these exercises. You can't take the rest of the org by surprise by dropping prod databases, or even by announcing that you'll do so; make sure everyone understands why we're running the exercises, so they're on board.

Franklin: Now, if you ever tell anyone outside the company you run these exercises, your reputation will be ruined forever! Especially if your customers are trusting you with processing their money!

df: Just kidding! Writing publicly about lessons learned running these exercises shows the world that you care about reliability, and can lead to wider discussion of issues in the underlying systems. In our case, speaking about the behavior we saw with redis led to wide (if a little one-sided) discussion on Twitter of these characteristics. Behavior confirmed and lots of others educated. Even more, we got some redis developers into the mix, looking not just at behavior on restart, but underlying source of snapshotting latency - fixes proposed (pushed?)

d: Publicizing your practices and values lets other engineers know what you're all about, and that can make you some new friends! Because of this, we got to hire more awesome people who are great at breaking stuff!

df: And we get to travel to cool places like Portland! Maybe we'll make some more friends while we're here...

Breaking
Production for Fun and Profit

Running Game Days at Stripe

Dan Frank (@danielhfrank)
Danielle Sucher (@DanielleSucher)
Franklin Hu (@thisisfranklin)

BreakingProduction for Fun and Profit

Running Game Days at Stripe

the best way to accept payments online

Game Days

injecting failures

into running applications

Who are we?

Happy Captain Picard Day!

Why do we run

game days?

Things might blow up!

explosions are better

when we're awake

If you die in prod...

Controlled experiments,

not chaos

An investment in

reliability

C'mon, let's break stuff!

There are so many

failure modes

we could try!

Single node failure

Correlated node failures

Network partition

Misbehaving users

Database failures

Non-routine operations

Okay, but which

failure modes

can we test

easily?

A few easily tested examples

low-risk tests at first

Prepare hypotheses

in advance

Warn people!

Planning experiments

instance failure

hypotheses

Latency was fine, yay!

Do we really wanna get paged over this?

network partition

hypotheses

partition the network

Oh noes, latency spiked!

repair the network

Better timeouts would be nice...

let's kill

our redis cluster's

primary node

A little

healthy fear

hypotheses

Where's our data?!

time to clean up

What happened?!

Well, we'd recently turned off snapshotting on our primary node...

failover didn't happen...

BreakingProduction for Fun and Profit

Running Game Days at Stripe

Breaking
Production for Fun and Profit

Breaking
Production for Fun and Profit