Breaking
Production for Fun and Profit

Running Game Days at Stripe

Dan Frank (@danielhfrank)
Danielle Sucher (@DanielleSucher)
Franklin Hu (@thisisfranklin)

Stripe!

the best way to accept payments online

Game Days

injecting failures

into running applications

Who are we?

Happy Captain Picard Day!

Captain Picard Day

Why do we run

game days?

Things might blow up!

Things might blow up!

explosions are better

when we're awake

If you die in prod...

Why test in prod?

Controlled experiments,

not chaos

An investment in

reliability

C'mon, let's break stuff!

twitter

There are so many

failure modes

we could try!

Single node failure

  • Process failure (`kill STOP`, `svc -d svcname`)
  • Machine failure
  • Machine reboot (`echo b > /proc/sysrq-trigger`)
  • Disk failure
  • Disk degradation
  • Disk full
  • Out of file descriptors
  • Network degradation
  • CPU degradation

Correlated node failures

  • Machine failures
  • Disk failures
  • Disk degradations
  • Disks full
  • Network degradations
  • CPU degradations

Network partition

  • Between availability zones
  • Between services
  • Within services

Misbehaving users

  • Denial of Service
  • Malicious input

Database failures

  • Autoincrement field exhausted
  • Indexes missing
  • Healthchecks fail?

Non-routine operations

  • Launching instances
  • Terminating instances

Okay, but which

failure modes

can we test

easily?

A few easily tested examples

  • Process failure
  • Machine failure
  • Network partition between services
  • Denial of Service
  • `sudo rm -rf /`

low-risk tests at first

out on a limb

Prepare hypotheses

in advance

Warn people!

Longbridge road works - sign - Danger! Open excavations (CC BY 2.0)

Planning experiments

Planning spreadsheet
healthy network

instance failure

terminate an instance

hypotheses

  • Latency should be fine
  • Alerting: no pages, only emails
exTERMINATE

Latency was fine, yay!

Latency was fine

Do we really wanna get paged over this?

network partition

network partition

hypotheses

  • Charges should go through successfully, just without being scored
  • Latency should be fine
  • Alerting: We should get paged!

partition the network


            iptables -A ... DROP
          

Oh noes, latency spiked!

Latency spiked

repair the network


            iptables -D ... DROP
          

Better timeouts would be nice...

let's kill

our redis cluster's

primary node


            kill -9 $REDIS_PID
          

A little

healthy fear

hypotheses

  • One of the secondary nodes should be promoted to primary
  • When the old primary node comes back up, it should come up as a secondary
  • Scoring should continue normally
  • Unless things don't go as above, no one should get paged

Where's our data?!

Covering up the bang (CC BY 2.0)

time to clean up

Cleanup

What happened?!

Well, we'd recently turned off snapshotting on our primary node...

Redis bad setup

failover didn't happen...

Redis primary rising from the dead
Surprise!
Bad [database] romance
Shhhh! We get scared (CC BY 2.0)

Breaking
Production for Fun and Profit

Running Game Days at Stripe

Dan Frank (@danielhfrank)
Danielle Sucher (@DanielleSucher)
Franklin Hu (@thisisfranklin)