Getting Started with Observability for Chaos Engineering

A presentation at Chaos Carnival 2021 in February 2021 in by Shelby Spees

Slide 1

Slide 1

V3-21 Getting Started with Observability for Chaos Engineering Illustrat ions by @ emilywit hcurls!

Slide 2

Slide 2

V3-21 Shelby Spees Developer Advocate at Honeycomb.io @shelbyspees © 2021 Hound Technology, Inc. All Rights Reserved.

Slide 3

Slide 3

V3-21 Chaos at 3pm, not 3am © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 4

Slide 4

V3-21 Experimenting in the dark is risky © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 5

Slide 5

V3-21 We need observability! What is observability? in software, the ability to understand and explain any state a system can get into, no matter how novel or bizarre, without deploying new code © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 6

Slide 6

V3-21 Learn more from chaos © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 7

Slide 7

V3-21 Better account for business risks © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 8

Slide 8

V3-21 Lots of observability tooling out there © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 9

Slide 9

V3-21 Honeycomb is a system analytics tool © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021 9

Slide 10

Slide 10

V3-21 It works by ingesting your system telemetry What is telemetry? data gathered about the state of your system at runtime, often sent to external tooling for monitoring or analysis tele- (distance) + -metry (measure) © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 11

Slide 11

V3-21 In the form of structured events Capture ● ● ● { runtime context benchmarking data Enable ● ● trace visualizations asking novel questions } © 2021 Hound Technology, Inc. All Rights Reserved. “timestamp”: “2018-03-20T00:47:25.339Z”, “app.interesting_thing”: “banana”, “duration_ms”: 772.446625, “handler.name”: “main.hello”, “handler.pattern”: “/hello/”, “handler.type”: “http.HandlerFunc”, “meta.beeline_version”: “0.2.0”, “meta.local_hostname”: “cobbler.local”, “meta.span_type”: “root”, “meta.type”: “http_request”, “name”: “main.hello”, “request.content_length”: 0, “request.header.user_agent”: “curl/7.54.0”, “request.host”: “localhost:8080”, “request.http_version”: “HTTP/1.1”, “request.method”: “GET”, “request.path”: “/hello/”, “request.remote_addr”: “127.0.0.1:60379”, “response.status_code”: 200, “service_name”: “sample app”, “trace.span_id”: “9e4fe697-3ea9-48c9-b673-72d7ddf118a6”, “trace.trace_id”: “b64c89a9-7671-4732-bef1-9ef75ab831f6” @shelbyspees at #ChaosCarnival2021

Slide 12

Slide 12

V3-21 We index on individual fields for fast querying © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 13

Slide 13

V3-21 How do we know we’re doing a good job? © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 14

Slide 14

V3-21 Service Level Objectives (SLOs) The API for your engineering team

Slide 15

Slide 15

V3-21 SLOs are a common language © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 16

Slide 16

V3-21 Think in terms of events in context © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 17

Slide 17

V3-21 SLI: for each event, good or bad? © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 18

Slide 18

V3-21 Use a window and target percentage of all eligible events in 30-day window X% should meet the threshold for “good” Example SLO 99.9% of Home Page loads in the past 30 days were fast enough © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 19

Slide 19

V3-21 A good SLO barely keeps users happy © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 20

Slide 20

V3-21 Error budget: allowed unavailability 99.9% SLO target × 1 million requests/month = 1000 requests can fail/month © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 21

Slide 21

V3-21 When is it okay to take risks? © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 22

Slide 22

V3-21 When is it not okay? © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 23

Slide 23

V3-21 Ingest Latency SLO and Error Budget © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 24

Slide 24

V3-21 Defining an SLO Better data for better goal-setting

Slide 25

Slide 25

V3-21 Instrumenting ingest code © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 26

Slide 26

V3-21 Instrumenting ingest code © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 27

Slide 27

V3-21 Their fault or our fault? © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 28

Slide 28

V3-21 Capturing error context © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 29

Slide 29

V3-21 Defining our SLI: eligible events NOT(STARTS_WITH($request.header.user_agent, “collectd”)), NOT(STARTS_WITH($app.err, “deprecated endpoint”)), NOT(EQUALS($response.status_code, 401)), NOT(EQUALS($response.status_code, 403)), NOT(EQUALS($app.dropped, “their fault”)), EQUALS($request.endpoint, “batch”), EQUALS($request.method, “POST”), © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 30

Slide 30

V3-21 Defining our SLI: good events EQUALS($response.status_code, 200) normalized_duration = duration_ms / (batch_size > 500 ? batch_size : 500) IF(normalized_duration < 5) © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 31

Slide 31

V3-21 Ingest Latency SLO and Error Budget © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 32

Slide 32

V3-21 Get Started with Observability Prepare yourself for chaos

Slide 33

Slide 33

V3-21 Choose an observability tool © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 34

Slide 34

V3-21 Use OpenTelemetry! Vendor-neutral OSS instrumentation framework Java | C# | Go | JavaScript | Python | Rust | C++ | Erlang/Elixir | (and more!) Auto-instrumentation for HTTP and gRPC visit OpenTelemetry.io © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 35

Slide 35

V3-21 Dip a toe in the water Start with auto-instrumentation (especially for tracing!) Choose one app or service in the critical path Send from your dev environment Deploy a canary branch to a subset of prod © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 36

Slide 36

V3-21 Iterating your instrumentation Build on auto-instrumentation Start adding custom fields in your code Instrument where there’s risk (read: change) Set up distributed tracing © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 37

Slide 37

V3-21 Experiment with Observability Designing your experiment ● ● ● What’s my hypothesis about the system? Can I retrace the failure and resolution? Test the telemetry too! © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 38

Slide 38

V3-21 Celebrate successes and failures. © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 39

Slide 39

V3-21 Reach out! honeycomb.io/shelby @shelbyspees

Slide 40

Slide 40

V3-21 Questions?

Slide 41

Slide 41

V3-21 Experimenting in Prod

Slide 42

Slide 42

V3-21 Event batch Partition queue Single event Single event Single event Single event Single event Single event Partition queue Indexing worker Field index Field index Field index Indexing worker Single event Single event Field index Field index Field index Single event Partition queue Indexing worker Single event Single event Field index Field index Field index Single event © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021 S3

Slide 43

Slide 43

V3-21 Event batch Partition queue Single event Single event Single event Single event Single event Single event Indexing worker Field index Partition queue Shepherd Field index Field index Indexing worker Single event Field index Single event Field index Field index Single event Partition queue Indexing worker Single event Field index Single event Field index Field index Single event Kafka + Zookeeper © 2021 Hound Technology, Inc. All Rights Reserved. Retriever @shelbyspees at #ChaosCarnival2021 S3

Slide 44

Slide 44

V3-21 Infrequent changes. © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 45

Slide 45

V3-21 Long-running processes. © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 46

Slide 46

V3-21 Data integrity and consistency. © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 47

Slide 47

V3-21 Delicate failover dances © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 48

Slide 48

V3-21 Event batch Partition queue Single event Single event Single event Single event Single event Single event Partition queue Indexing worker Field index Field index Field index Indexing worker Single event Single event Field index Field index Field index Single event Partition queue Indexing worker Single event Single event Field index Field index Field index Single event © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021 S3

Slide 49

Slide 49

V3-21 Event batch Partition queue Single event Single event Single event Single event Single event Single event Partition queue Indexing worker Field index Field index Field index Indexing worker Single event Single event Field index Field index Field index Single event Partition queue Indexing worker Single event Single event Field index Field index Field index Single event © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021 S3

Slide 50

Slide 50

V3-21 Event batch Partition queue Single event Single event Single event Single event Single event Single event Partition queue Indexing worker Field index Field index Field index Indexing worker Single event Single event Field index Field index Field index Single event Partition queue Indexing worker Single event Single event Field index Field index Field index Single event © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021 S3

Slide 51

Slide 51

V3-21 Event batch Partition queue Single event Single event Single event Single event Single event Single event Partition queue Indexing worker Field index Field index Field index Indexing replay Single event Single event Field index Field index Field index Single event Partition queue Indexing worker Single event Single event Field index Field index Field index Single event © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021 S3

Slide 52

Slide 52

V3-21 Restart one server & service at a time. © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021 52

Slide 53

Slide 53

V3-21 Monitor for changes using SLIs. © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021 53

Slide 54

Slide 54

V3-21 Event batch Partition queue Single event Single event Single event Single event Single event Single event Partition queue Indexing worker Field index Field index Field index Indexing worker Single event Single event Field index Field index Field index Single event Partition queue Indexing worker Single event Single event Field index Field index Field index Single event © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021 S3

Slide 55

Slide 55

V3-21 Event batch Partition queue Single event Single event Single event Single event Single event Single event Partition queue Indexing worker Field index Field index Field index Indexing worker Single event Single event Field index Field index Field index Single event Partition queue Indexing worker Single event Single event Field index Field index Field index Single event © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021 S3

Slide 56

Slide 56

V3-21 Alerting worker Zookeeper cluster Alerting worker © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 57

Slide 57

V3-21 Alerting worker Zookeeper cluster Alerting worker © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 58

Slide 58

V3-21 Alerting worker Zookeeper cluster Alerting worker © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 59

Slide 59

V3-21 Partition queue Indexing worker Indexing worker Single event Single event Field index Field index Field index Field index Field index Field index Single event Partition queue Indexing worker Indexing worker Single event Single event Field index Field index Field index Field index Field index Field index Single event Partition queue Indexing worker Indexing worker Single event Single event Field index Field index Field index Field index Field index Field index Single event © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021 S3

Slide 60

Slide 60

V3-21 ARM64 hosts Spot instances © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #ChaosCarnival2021

Slide 61

Slide 61

V3-21 Reach out! honeycomb.io/shelby @shelbyspees

Slide 62

Slide 62

V3-21 Questions?