Production for Fun and Profit

Running Game Days at Stripe

Dan Frank (@danielhfrank)
Danielle Sucher (@DanielleSucher)
Franklin Hu (@thisisfranklin)


Game Days

injecting failures

into running applications

Who are we?

Why do we run

game days?

Things might blow up!

Things might blow up!

explosions are better

when we're awake

If you die in prod...

Why test in prod?

Controlled experiments,

not chaos

An investment in


C'mon, let's break stuff!


There are so many

failure modes

we could try!

Single node failure

  • Process failure (`kill STOP`, `svc -d svcname`)
  • Machine failure
  • Machine reboot (`echo b > /proc/sysrq-trigger`)
  • Disk failure
  • Disk degradation
  • Disk full
  • Out of file descriptors
  • Network degradation
  • CPU degradation

Correlated node failures

  • Machine failures
  • Disk failures
  • Disk degradations
  • Disks full
  • Network degradations
  • CPU degradations

Network partition

  • Between availability zones
  • Between services
  • Within services

Misbehaving users

  • Denial of Service
  • Malicious input

Database failures

  • Autoincrement field exhausted
  • Indexes missing
  • Healthchecks fail?

Non-routine operations

  • Launching instances
  • Terminating instances

Okay, but which

failure modes

can we test


A few easily tested examples

  • Process failure
  • Machine failure
  • Network partition between services
  • Denial of Service
  • `sudo rm -rf /`

low-risk tests at first

out on a limb

Prepare hypotheses

in advance

Warn people!

Planning experiments

Planning spreadsheet
healthy network

instance failure

terminate an instance


  • Latency should be fine
  • Alerting: no pages, only emails

Latency was fine, yay!

Latency was fine

Do we really wanna get paged over this?

network partition

network partition


  • Charges should go through successfully, just without being scored
  • Latency should be fine
  • Alerting: We should get paged!

partition the network

            iptables -A ... DROP

Oh noes, latency spiked!

Latency spiked

repair the network

            iptables -D ... DROP

Better timeouts would be nice...

let's kill

our redis cluster's

primary node

            kill -9 $REDIS_PID

A little

healthy fear


  • One of the secondary nodes should be promoted to primary
  • When the old primary node comes back up, it should come up as a secondary
  • Scoring should continue normally
  • Unless things don't go as above, no one should get paged

Where's our data?!

time to clean up


What happened?!

Well, we'd recently turned off snapshotting on our primary node...

Redis bad setup

failover didn't happen...

Redis primary rising from the dead
Bad [database] romance
