Observability for Software Teams

A presentation at DeveloperWeek 2021 in February 2021 in by Shelby Spees

Slide 1

Slide 1

V3-21 Observability for Software Teams Shelby Spees Developer Advocate @ Honeycomb.io Illustrat ions by @ emilywit hcurls!

Slide 2

Slide 2

V3-21 Production is increasingly complex © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021 2

Slide 3

Slide 3

V3-21 © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 4

Slide 4

V3-21 Software teams DevOps is more than “Dev” and “Ops”: ● ● ● ● ● ● ● application engineers platform engineers infrastructure engineers SREs SDETs support engineers and more! © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 5

Slide 5

V3-21 We’ve come a long way © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 6

Slide 6

V3-21 But we still encounter problems emergent failure modes small failures cascading together to degrade or take down a system see also: how.complexsystems.fail © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 7

Slide 7

V3-21 Getting paged can be scary © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 8

Slide 8

V3-21 Our tools don’t answer our questions © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 9

Slide 9

V3-21 We don’t make forward progress 42% of developer time is spent on dealing with bad code and tech debt Source: The Developer Coefficient, Stripe, 2018 © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 10

Slide 10

V3-21 It doesn’t have to be this way We have the technology!

Slide 11

Slide 11

V3-21 How can we better interact with production? Our applications have all the answers at runtime! We want to ● ● capture those answers make that data available to interrogate Big companies have done this ● ● at Facebook: Scuba at Google: Dapper © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 12

Slide 12

V3-21 We need observability! What is observability? in software, the ability to understand and explain any state a system can get into, no matter how novel or bizarre, without deploying new code © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 13

Slide 13

V3-21 How to gain observability? Observability requires ● ● capturing telemetry data with lots of runtime context interacting with that telemetry in near-real time © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021 13

Slide 14

Slide 14

V3-21 Instrumentation generates telemetry instrumentation: code or tooling that captures data about the state of your running system { telemetry: data generated by your system that documents its state } © 2021 Hound Technology, Inc. All Rights Reserved. “timestamp”: “2018-03-20T00:47:25.339Z”, “app.interesting_thing”: “banana”, “duration_ms”: 772.446625, “handler.name”: “main.hello”, “handler.pattern”: “/hello/”, “handler.type”: “http.HandlerFunc”, “meta.beeline_version”: “0.2.0”, “meta.local_hostname”: “cobbler.local”, “meta.span_type”: “root”, “meta.type”: “http_request”, “name”: “main.hello”, “request.content_length”: 0, “request.header.user_agent”: “curl/7.54.0”, “request.host”: “localhost:8080”, “request.http_version”: “HTTP/1.1”, “request.method”: “GET”, “request.path”: “/hello/”, “request.remote_addr”: “127.0.0.1:60379”, “response.status_code”: 200, “service_name”: “sample app”, “trace.span_id”: “9e4fe697-3ea9-48c9-b673-72d7ddf118a6”, “trace.trace_id”: “b64c89a9-7671-4732-bef1-9ef75ab831f6” @shelbyspees at #DevWeek2021

Slide 15

Slide 15

V3-21 You’re already collecting data You’ve probably instrumented your system to generate telemetry like ● ● ● metrics logs traces That’s great! © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 16

Slide 16

V3-21 But traditional approaches aren’t enough © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 17

Slide 17

V3-21 Observability is for hard-to-debug problems Distributed Systems small change causing downstream effects? Poor Performance what part of the app is worth optimizing? Subset of Traffic only some users are complaining? © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 18

Slide 18

V3-21 Metrics great when thinking about infrastructure! but they: ● ● ● only answer known questions don’t capture application context don’t support high-cardinality use cases © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 19

Slide 19

V3-21 Metrics have little context https://www.tylervigen.com/spurious-correlations © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 20

Slide 20

V3-21 Flat logs ● ● ● capturing output from the code itself no standard log formatting across libraries or services requires string parsing—expensive at scale Centralized logging services are expensive and slow to query ● great for compliance, not for debugging © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 21

Slide 21

V3-21 Distributed traces Visualization of the stack trace across your distributed system Traces are formed by a directed graph made up of objects called spans ● each span has a start time and a duration, and points to its parent span © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 22

Slide 22

V3-21 distributed trace visualization in Jaeger UI © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 23

Slide 23

V3-21 1.243s API call What made it slow? © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 24

Slide 24

V3-21 Structured events!

Slide 25

Slide 25

V3-21 What are structured events? structured logs + benchmarking + (optionally) tracing { “Timestamp”: “2018-07-03T04:57:12.517022Z”, “duration_ms”: 619.703, “request.method”: “GET”, “request.path”: “/”, “request.user_agent”: “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36”, “response.status_code”: 200, “trace.span_id”: “3eada0ce-934b-4ffd-bb72-2c9f57b02bf1”, “trace.parent_id”: “aa23-3becd17d726-607d2150c-5397-4e8e”, “trace.trace_id”: “07d2150c-5397-4e8e-aa23-3becd17d7266”, … } © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 26

Slide 26

V3-21 Keep all your runtime context together No predefined index or schema add fields as needed Key-value pairs less expensive to parse and query © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 27

Slide 27

V3-21 Flat logs vs. structured events Flat logs get written eagerly You can’t keep track of state changes, even within a specific context © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 28

Slide 28

V3-21 write as you go write as you go write as you go © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 29

Slide 29

V3-21 Flat logs vs. structured events Structured events store all the data within your context, from beginning to end © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 30

Slide 30

V3-21 write at the end © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 31

Slide 31

V3-21 From events: time-series graphs © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 32

Slide 32

V3-21 © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 33

Slide 33

V3-21 From events: error logging © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 34

Slide 34

V3-21 From events: tracing If your events have start time, duration, parent: then you can generate a trace! © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 35

Slide 35

V3-21 sql.active_record: 4.951s © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 36

Slide 36

V3-21 © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 37

Slide 37

V3-21 Instrument your code

Slide 38

Slide 38

V3-21 Instrument with OpenTelemetry! Vendor-neutral framework with auto-instrumentation Learn more: OpenTelemetry.io © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 39

Slide 39

V3-21 Start with auto-instrumentation Rich context and distributed tracing out of the box: ● ● ● HTTP headers gRPC calls SQL queries © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 40

Slide 40

V3-21 © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 41

Slide 41

V3-21 Custom instrumentation What’s important for the service that I’m providing? ● ● lib/ code consumed by your request endpoints 3rd-party API calls ● © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 42

Slide 42

V3-21 def get_all_tweets(user) all = [] options = { count: 200, include_rts: true } loop do tweets = client.user_timeline(user, options) return all if tweets.empty? all += tweets options[:max_id] = tweets.last.id - 1 end end © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 43

Slide 43

V3-21 def get_all_tweets(user) Instrumentation.start_span(name: ‘get_all_tweets’) do Instrumentation.add_field_to_trace(‘user’, user) all = [] options = { count: 200, include_rts: true } loop do Instrumentation.start_span(name: ‘get_batch’) do Instrumentation.add_field(‘options’, options) tweets = client.user_timeline(user, options) return all if tweets.empty? all += tweets options[:max_id] = tweets.last.id - 1 end end end end © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 44

Slide 44

V3-21 How does this benefit teams?

Slide 45

Slide 45

V3-21 Better data → better conversations Shared source of truth. Improved knowledge transfer. Shared ownership of production. © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 46

Slide 46

V3-21 Observability-driven development Before making a change, ask: How will I know this code is working as intended in prod? Observe the current state: What am I trying to change? Then implement + instrument: Look, it’s working! © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 47

Slide 47

V3-21 One tool for dev and prod Fast feedback loop in dev Strong debugging skills in prod Read more: The Future of Developer Careers go.hny.co/dev-careers © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 48

Slide 48

V3-21 Level up your team Better data leads to better conversations Custom instrumentation enables self-serve querying Knowledge transfer creates production ownership © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 49

Slide 49

V3-21 Where to start?

Slide 50

Slide 50

V3-21 Start where you’re at Take inventory of the tools and data your team has access to ● ● ● How often are people actually using them? How often are you actually able to answer questions? How much of the time do you rely on experience (and guessing) vs. data? © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 51

Slide 51

V3-21 Instrument your code Structured events! ● ● ● How to capture them? Look into OpenTelemetry Is there auto-instrumentation for your tech stack? © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 52

Slide 52

V3-21 Start small and grow Start with one app or one chunk of code in your critical path ● Critical code helps you learn the fastest Try out some tools ● ● Some vendors offer a free tier or trial Send from dev or deploy a canary © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #DevWeek2021

Slide 53

Slide 53

V3-21 Reach out! Get these slides: hny.co/shelby Twitter: @shelbyspees

Slide 54

Slide 54

V3-21 Visit our Booth! Get O’Reilly pre-release chapters! Get Honeycomb swag!

Slide 55

Slide 55

V3-21 Questions?