Fast and Simple: Observing Code and Infra Deployments at Honeycomb

A presentation at Kong Destination:Automation 2020 in July 2020 in by Shelby Spees

Slide 1

Slide 1

Fast and Simple Observing Code and Infra Deployments at Honeycomb Liz Fong-Jones & Shelby Spees @lizthegrey & @shelbyspees #Automate2020 July 16, 2020 w/ illustrations by @emilywithcurls! 1

Slide 2

Slide 2

Observability is evolving quickly. 2 @lizthegrey & @shelbyspees at #Automate2020

Slide 3

Slide 3

We need velocity and reliability. 3 @lizthegrey & @shelbyspees at #Automate2020

Slide 4

Slide 4

A dozen engineers build Honeycomb. 4 @lizthegrey & @shelbyspees at #Automate2020

Slide 5

Slide 5

We make systems humane to run, 5 @lizthegrey & @shelbyspees at #Automate2020

Slide 6

Slide 6

by ingesting telemetry, 6 @lizthegrey & @shelbyspees at #Automate2020

Slide 7

Slide 7

enabling data exploration, 7 @lizthegrey & @shelbyspees at #Automate2020

Slide 8

Slide 8

and empowering engineers. 8 @lizthegrey & @shelbyspees at #Automate2020

Slide 9

Slide 9

Yes, we deploy on Fridays. 9 @lizthegrey & @shelbyspees at #Automate2020

Slide 10

Slide 10

@lizthegrey & @shelbyspees at #Automate2020

Slide 11

Slide 11

Continuous delivery is an investment. 11 @lizthegrey & @shelbyspees at #Automate2020

Slide 12

Slide 12

How did we get there? @lizthegrey & @shelbyspees at #Automate2020 12

Slide 13

Slide 13

Investment in tooling paid off, 13 @lizthegrey & @shelbyspees at #Automate2020

Slide 14

Slide 14

but we didn’t need all the soup. 14 @lizthegrey & @shelbyspees at #Automate2020

Slide 15

Slide 15

We needed to be thoughtful. 15 @lizthegrey & @shelbyspees at #Automate2020

Slide 16

Slide 16

and we needed cultural process too! 16 @lizthegrey & @shelbyspees at #Automate2020

Slide 17

Slide 17

Continuously evaluate tech debt. @lizthegrey & @shelbyspees at #Automate2020

Slide 18

Slide 18

and improve where it matters. @lizthegrey & @shelbyspees at #Automate2020

Slide 19

Slide 19

Speed up product development. 19 @lizthegrey & @shelbyspees at #Automate2020

Slide 20

Slide 20

and infrastructure safety. 20 @lizthegrey & @shelbyspees at #Automate2020

Slide 21

Slide 21

Embrace risk, but mitigate it. 21 @lizthegrey & @shelbyspees at #Automate2020

Slide 22

Slide 22

and never stop improving. 22 @lizthegrey & @shelbyspees at #Automate2020

Slide 23

Slide 23

Shipping prod features @lizthegrey & @shelbyspees at #Automate2020 23

Slide 24

Slide 24

What’s our recipe? @lizthegrey & @shelbyspees at #Automate2020

Slide 25

Slide 25

Instrument as we code. @lizthegrey & @shelbyspees at #Automate2020

Slide 26

Slide 26

Functional and visual testing. @lizthegrey & @shelbyspees at #Automate2020

Slide 27

Slide 27

@lizthegrey & @shelbyspees at #Automate2020

Slide 28

Slide 28

Design for feature flag deployment. @lizthegrey & @shelbyspees at #Automate2020

Slide 29

Slide 29

Automated integration. @lizthegrey & @shelbyspees at #Automate2020

Slide 30

Slide 30

CircleCI’s view of Honeycomb build & deploy @lizthegrey & @shelbyspees at #Automate2020

Slide 31

Slide 31

@lizthegrey & @shelbyspees at #Automate2020 Honeycomb’s trace of Honeycomb build & deploy

Slide 32

Slide 32

Automated integration. @lizthegrey & @shelbyspees at #Automate2020

Slide 33

Slide 33

Human PR review. @lizthegrey & @shelbyspees at #Automate2020

Slide 34

Slide 34

Automated integration. @lizthegrey & @shelbyspees at #Automate2020

Slide 35

Slide 35

Green button merge. @lizthegrey & @shelbyspees at #Automate2020

Slide 36

Slide 36

Auto-updates, rollbacks, & pins. @lizthegrey & @shelbyspees at #Automate2020

Slide 37

Slide 37

Observe behavior in prod. @lizthegrey & @shelbyspees at #Automate2020

Slide 38

Slide 38

Prod: customers observe data. @lizthegrey & @shelbyspees at #Automate2020

Slide 39

Slide 39

Dogfood observes prod. @lizthegrey & @shelbyspees at #Automate2020

Slide 40

Slide 40

[add observe in prod] image: Adoption curve on SLOs @lizthegrey & @shelbyspees at #Automate2020

Slide 41

Slide 41

Kibble observes dogfood. @lizthegrey & @shelbyspees at #Automate2020

Slide 42

Slide 42

That’s how 12 eng deploy 12x/day! @lizthegrey & @shelbyspees at #Automate2020

Slide 43

Slide 43

DORA data describes feedback loops. @lizthegrey & @shelbyspees at #Automate2020

Slide 44

Slide 44

start with lead time. 10 min builds (x3 at worst), 1h for peer review, hourly push train = 3 hours to deploy a change. @lizthegrey & @shelbyspees at #Automate2020

Slide 45

Slide 45

start with lead time. deploy frequency goes up. Builds go out every hour if there’s a change. 1-2 new commits per build artifact. @lizthegrey & @shelbyspees at #Automate2020

Slide 46

Slide 46

start with lead time. deploy frequency goes up. change fail rate goes down. Increased confidence via testing. Flag-flip or fix-forward, not emergency rollback. 0.1% fail rate. @lizthegrey & @shelbyspees at #Automate2020

Slide 47

Slide 47

start with lead time. deploy frequency goes up. change fail rate goes down. time to restore goes down. Flag flip takes 30 seconds. Rollback to previous build takes <10 min. Fix-forward takes 20 min. @lizthegrey & @shelbyspees at #Automate2020

Slide 48

Slide 48

High productivity product engineering: start with lead time. (<3 hours) deploy frequency goes up. (hourly, >12x/day) change fail rate goes down. (<0.1%) time to restore goes down. (seconds to minutes) @lizthegrey & @shelbyspees at #Automate2020

Slide 49

Slide 49

But what about infra? @lizthegrey & @shelbyspees at #Automate2020 49

Slide 50

Slide 50

Infrastructure empowers product. @lizthegrey & @shelbyspees at #Automate2020

Slide 51

Slide 51

Kubernetes isn’t the goal. @lizthegrey & @shelbyspees at #Automate2020

Slide 52

Slide 52

Reliability and simplicity is. @lizthegrey & @shelbyspees at #Automate2020

Slide 53

Slide 53

Everyone starts somewhere. @lizthegrey & @shelbyspees at #Automate2020

Slide 54

Slide 54

Automate the painful parts. @lizthegrey & @shelbyspees at #Automate2020

Slide 55

Slide 55

Fix the duct tape! @lizthegrey & @shelbyspees at #Automate2020

Slide 56

Slide 56

Keep the environment clean. @lizthegrey & @shelbyspees at #Automate2020

Slide 57

Slide 57

How do we do it? @lizthegrey & @shelbyspees at #Automate2020

Slide 58

Slide 58

Raw VMs are simpler than containers. @lizthegrey & @shelbyspees at #Automate2020

Slide 59

Slide 59

Cold boot from Chef, use cron to sync. @lizthegrey & @shelbyspees at #Automate2020

Slide 60

Slide 60

Outsource utilities like blob storage. @lizthegrey & @shelbyspees at #Automate2020

Slide 61

Slide 61

Repeatable infrastructure with code. @lizthegrey & @shelbyspees at #Automate2020

Slide 62

Slide 62

Centralize state & locking. @lizthegrey & @shelbyspees at #Automate2020

Slide 63

Slide 63

Remove barriers to setup. @lizthegrey & @shelbyspees at #Automate2020

Slide 64

Slide 64

Diff and release in browser. @lizthegrey & @shelbyspees at #Automate2020

Slide 65

Slide 65

@lizthegrey & @shelbyspees at #Automate2020

Slide 66

Slide 66

Remote run from git. @lizthegrey & @shelbyspees at #Automate2020

Slide 67

Slide 67

@lizthegrey & @shelbyspees at #Automate2020

Slide 68

Slide 68

Sentinel guardrails. @lizthegrey & @shelbyspees at #Automate2020

Slide 69

Slide 69

Notify only on risky changes. @lizthegrey & @shelbyspees at #Automate2020

Slide 70

Slide 70

Deploy changes incrementally! @lizthegrey & @shelbyspees at #Automate2020

Slide 71

Slide 71

Use modern components. @lizthegrey & @shelbyspees at #Automate2020

Slide 72

Slide 72

Feature flags… for infra! @lizthegrey & @shelbyspees at #Automate2020

Slide 73

Slide 73

Ephemeral fleets & autoscaling. @lizthegrey & @shelbyspees at #Automate2020

Slide 74

Slide 74

Quarantine bad traffic. @lizthegrey & @shelbyspees at #Automate2020

Slide 75

Slide 75

Delete unused code & components. @lizthegrey & @shelbyspees at #Automate2020

Slide 76

Slide 76

Refactor continuously! @lizthegrey & @shelbyspees at #Automate2020

Slide 77

Slide 77

What if it goes wrong? @lizthegrey & @shelbyspees at #Automate2020 77

Slide 78

Slide 78

@lizthegrey & @shelbyspees at #Automate2020

Slide 79

Slide 79

@lizthegrey & @shelbyspees at #Automate2020

Slide 80

Slide 80

How broken is “too broken”? 80 @lizthegrey & @shelbyspees at #Automate2020

Slide 81

Slide 81

Service Level Objectives define success. 81 @lizthegrey & @shelbyspees at #Automate2020

Slide 82

Slide 82

SLOs are common language. @lizthegrey & @shelbyspees at #Automate2020

Slide 83

Slide 83

How many eligible events did we see? @lizthegrey & @shelbyspees at #Automate2020

Slide 84

Slide 84

HTTP Code 200? Latency < 100ms? @lizthegrey & @shelbyspees at #Automate2020

Slide 85

Slide 85

Availability: Good / Eligible Events @lizthegrey & @shelbyspees at #Automate2020

Slide 86

Slide 86

Use a window and target percentage. @lizthegrey & @shelbyspees at #Automate2020

Slide 87

Slide 87

Error budget: allowed unavailability @lizthegrey & @shelbyspees at #Automate2020

Slide 88

Slide 88

Drive alerting with SLOs. @lizthegrey & @shelbyspees at #Automate2020

Slide 89

Slide 89

We keep SLOs at Honeycomb. 89 @lizthegrey & @shelbyspees at #Automate2020

Slide 90

Slide 90

We store incoming telemetry. 90 @lizthegrey & @shelbyspees at #Automate2020

Slide 91

Slide 91

Default dashboards usually load in 1s. 91 @lizthegrey & @shelbyspees at #Automate2020

Slide 92

Slide 92

Often, queries come back under 10s. 92 @lizthegrey & @shelbyspees at #Automate2020

Slide 93

Slide 93

User Data Throughput @lizthegrey & @shelbyspees at #Automate2020

Slide 94

Slide 94

We dropped customer data. 94 @lizthegrey & @shelbyspees at #Automate2020

Slide 95

Slide 95

but rolled back (at human speed) 95 @lizthegrey & @shelbyspees at #Automate2020

Slide 96

Slide 96

We communicated to customers. @lizthegrey & @shelbyspees at #Automate2020

Slide 97

Slide 97

We’d burned triple our error budget. 97 @lizthegrey & @shelbyspees at #Automate2020

Slide 98

Slide 98

We halted deploys. @lizthegrey & @shelbyspees at #Automate2020

Slide 99

Slide 99

How did this happen? @lizthegrey & @shelbyspees at #Automate2020

Slide 100

Slide 100

We checked in code that didn’t build. We had experimental CI build wiring. Our scripts deployed empty binaries. There was no health check & rollback. @lizthegrey & @shelbyspees at #Automate2020

Slide 101

Slide 101

We re-prioritized stability. @lizthegrey & @shelbyspees at #Automate2020

Slide 102

Slide 102

We mitigated the key risks, @lizthegrey & @shelbyspees at #Automate2020

Slide 103

Slide 103

then resumed building. @lizthegrey & @shelbyspees at #Automate2020

Slide 104

Slide 104

What’s ahead for us? @lizthegrey & @shelbyspees at #Automate2020 104

Slide 105

Slide 105

Be more reliable & scalable. @lizthegrey & @shelbyspees at #Automate2020

Slide 106

Slide 106

Launch new services easily. @lizthegrey & @shelbyspees at #Automate2020

Slide 107

Slide 107

Burn less money. @lizthegrey & @shelbyspees at #Automate2020

Slide 108

Slide 108

Continue modernizing & refactoring. @lizthegrey & @shelbyspees at #Automate2020

Slide 109

Slide 109

Sleep easily at night. @lizthegrey & @shelbyspees at #Automate2020

Slide 110

Slide 110

You can do this too, step by step. @lizthegrey & @shelbyspees at #Automate2020

Slide 111

Slide 111

Read our blog! hny.co/blog @lizthegrey & @shelbyspees at #Automate2020

Slide 112

Slide 112

Understand & control production. Go faster on stable infra. Manage risk and iterate. lizthegrey.com; @lizthegrey shelbyspees.com; @shelbyspees /liz @lizthegrey & @shelbyspees at #Automate2020 112