Observability for Software Teams

A presentation at Cloud Native Kitchen in December 2020 in by Shelby Spees

Slide 1

Slide 1

Observability for Software Teams Shelby Spees Developer Advocate, Honeycomb.io illustrations by @emilywithcurls! @shelbyspees at #CloudNativeKitchen

Slide 2

Slide 2

Modern software teams

DevOps is more than “Dev” and “Ops”:

  • application engineers
  • platform engineers
  • infrastructure engineers
  • SREs
  • SDETs
  • support engineers
  • and more!

Slide 3

Slide 3

Distributed systems

Slide 4

Slide 4

Then versus now

Slide 5

Slide 5

We've come a long way

Slide 6

Slide 6

But we still encounter problems

Modern software systems experience emergent failure modes → small failures cascade together to degrade or take down a system. See also: how.complexsystems.fail

Slide 7

Slide 7

So what happens?

Customers complain. Can’t reproduce. Knowledge silos. Meaningless dashboards.

Slide 8

Slide 8

Getting paged can be scary Where do I even start? @shelbyspees at #CloudNativeKitchen

Slide 9

Slide 9

Our graphs don’t answer our questions @shelbyspees at #CloudNativeKitchen

Slide 10

Slide 10

42% of developer time is spent on dealing with bad code and tech debt (The Developer Coefficient, Stripe, 2018) @shelbyspees at #CloudNativeKitchen

Slide 11

Slide 11

We need observability. What is observability? In software: the ability to ask new questions about your system without deploying code changes. What data allows us to ask new questions? @shelbyspees at #CloudNativeKitchen

Slide 12

Slide 12

Code has the context Your code has the answers to this while it’s running… Improving observability involves capturing that context! @shelbyspees at #CloudNativeKitchen

Slide 13

Slide 13

Observability helps with hard-to-debug problems Distributed systems: → small change causing downstream effects? Poor performance: → what part of the app is worth optimizing? Subset of traffic: → only some users are complaining? @shelbyspees at #CloudNativeKitchen

Slide 14

Slide 14

Some terminology to know (1/2) instrumentation: code or tooling that captures data about the state of your running system telemetry: data generated by your system that documents its state You add instrumentation, which generates telemetry. @shelbyspees at #CloudNativeKitchen

Slide 15

Slide 15

Some terminology to know (2/2) dimensionality: how many fields or attributes a piece of data has attached to it cardinality: how many possible values a dimension can have Dimensions == fields. Cardinality == values. @shelbyspees at #CloudNativeKitchen

Slide 16

Slide 16

You’re already collecting data You’ve probably instrumented your system to generate telemetry like ● ● ● metrics logs traces That’s great! @shelbyspees at #CloudNativeKitchen

Slide 17

Slide 17

But traditional approaches aren’t enough These data formats don’t help to debug novel failure modes. @shelbyspees at #CloudNativeKitchen

Slide 18

Slide 18

Metrics great when thinking about infrastructure! but they: ● ● ● only answer known questions don’t capture application context don’t support high-cardinality use cases @shelbyspees at #CloudNativeKitchen

Slide 19

Slide 19

Metrics have little context @shelbyspees at #CloudNativeKitchen https://www.tylervigen.com/spurious-correlations

Slide 20

Slide 20

Flat logs ● ● ● capturing output from the code itself no standard log formatting across libraries or services requires string parsing—expensive at scale Centralized logging services are expensive and slow to query ● great for compliance, not for debugging @shelbyspees at #CloudNativeKitchen

Slide 21

Slide 21

Distributed traces Visualization of the stack trace across your distributed system Traces are formed by a directed graph made up of objects called spans ● each span has a start time and a duration, and points to its parent span @shelbyspees at #CloudNativeKitchen

Slide 22

Slide 22

distributed trace visualization in Jaeger UI @shelbyspees at #CloudNativeKitchen

Slide 23

Slide 23

1.243s API call What made it slow? @shelbyspees at #CloudNativeKitchen

Slide 24

Slide 24

Structured events! @shelbyspees at #CloudNativeKitchen

Slide 25

Slide 25

What are structured events? structured logs + benchmarking + (optionally) tracing { “Timestamp”: “2018-07-03T04:57:12.517022Z”, “duration_ms”: 619.703, “request.method”: “GET”, “request.path”: “/”, “request.user_agent”: “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36”, “response.status_code”: 200, “trace.span_id”: “3eada0ce-934b-4ffd-bb72-2c9f57b02bf1”, “trace.parent_id”: “aa23-3becd17d726-607d2150c-5397-4e8e”, “trace.trace_id”: “07d2150c-5397-4e8e-aa23-3becd17d7266”, … } @shelbyspees at #CloudNativeKitchen

Slide 26

Slide 26

What are structured events? structured logs + benchmarking + (optionally) tracing { “Timestamp”: “2018-07-03T04:57:12.517022Z”, “duration_ms”: 619.703, “request.method”: “GET”, “request.path”: “/”, “request.user_agent”: “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36”, “response.status_code”: 200, “trace.span_id”: “3eada0ce-934b-4ffd-bb72-2c9f57b02bf1”, “trace.parent_id”: “aa23-3becd17d726-607d2150c-5397-4e8e”, “trace.trace_id”: “07d2150c-5397-4e8e-aa23-3becd17d7266”, … } @shelbyspees at #CloudNativeKitchen

Slide 27

Slide 27

What are structured events? structured logs + benchmarking + (optionally) tracing { “Timestamp”: “2018-07-03T04:57:12.517022Z”, “duration_ms”: 619.703, “request.method”: “GET”, “request.path”: “/”, “request.user_agent”: “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36”, “response.status_code”: 200, “trace.span_id”: “3eada0ce-934b-4ffd-bb72-2c9f57b02bf1”, “trace.parent_id”: “aa23-3becd17d726-607d2150c-5397-4e8e”, “trace.trace_id”: “07d2150c-5397-4e8e-aa23-3becd17d7266”, … } @shelbyspees at #CloudNativeKitchen

Slide 28

Slide 28

What are structured events? structured logs + benchmarking + (optionally) tracing { “Timestamp”: “2018-07-03T04:57:12.517022Z”, “duration_ms”: 619.703, “request.method”: “GET”, “request.path”: “/”, “request.user_agent”: “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36”, “response.status_code”: 200, “trace.span_id”: “3eada0ce-934b-4ffd-bb72-2c9f57b02bf1”, “trace.parent_id”: “aa23-3becd17d726-607d2150c-5397-4e8e”, “trace.trace_id”: “07d2150c-5397-4e8e-aa23-3becd17d7266”, … } @shelbyspees at #CloudNativeKitchen

Slide 29

Slide 29

Keep all your runtime context together No predefined index or schema → add fields as needed Key-value pairs → less expensive to parse and query @shelbyspees at #CloudNativeKitchen

Slide 30

Slide 30

Flat logs vs. structured events Flat logs get written eagerly You can’t keep track of state changes, even within a specific context @shelbyspees at #CloudNativeKitchen

Slide 31

Slide 31

write as you go write as you go write as you go @shelbyspees at #CloudNativeKitchen

Slide 32

Slide 32

Flat logs vs. structured events Structured events store all the data within your context, from beginning to end @shelbyspees at #CloudNativeKitchen

Slide 33

Slide 33

write at the end @shelbyspees at #CloudNativeKitchen

Slide 34

Slide 34

From events: time-series graphs @shelbyspees at #CloudNativeKitchen

Slide 35

Slide 35

@shelbyspees at #CloudNativeKitchen

Slide 36

Slide 36

From events: error logging @shelbyspees at #CloudNativeKitchen

Slide 37

Slide 37

From events: tracing If your events have start time, duration, parent: then you can generate a trace! @shelbyspees at #CloudNativeKitchen

Slide 38

Slide 38

sql.active_record: 4.951s @shelbyspees at #CloudNativeKitchen

Slide 39

Slide 39

@shelbyspees at #CloudNativeKitchen

Slide 40

Slide 40

Instrument your code @shelbyspees at #CloudNativeKitchen

Slide 41

Slide 41

Instrument with OpenTelemetry! Vendor-neutral framework with auto-instrumentation Learn more: OpenTelemetry.io @shelbyspees at #CloudNativeKitchen

Slide 42

Slide 42

Start with auto-instrumentation Rich context and distributed tracing out of the box: ● ● ● HTTP headers gRPC calls SQL queries @shelbyspees at #CloudNativeKitchen

Slide 43

Slide 43

@shelbyspees at #CloudNativeKitchen

Slide 44

Slide 44

Custom instrumentation What’s important for the service that I’m providing? ● ● lib/ code consumed by your request endpoints 3rd-party API calls @shelbyspees at #CloudNativeKitchen

Slide 45

Slide 45

def get_all_tweets(user) all = [] options = { count: 200, include_rts: true } loop do tweets = client.user_timeline(user, options) return all if tweets.empty? all += tweets options[:max_id] = tweets.last.id - 1 end end @shelbyspees at #CloudNativeKitchen

Slide 46

Slide 46

def get_all_tweets(user) Instrumentation.start_span(name: ‘get_all_tweets’) do Instrumentation.add_field_to_trace(‘user’, user) all = [] options = { count: 200, include_rts: true } loop do Instrumentation.start_span(name: ‘get_batch’) do Instrumentation.add_field(‘options’, options) tweets = client.user_timeline(user, options) return all if tweets.empty? all += tweets options[:max_id] = tweets.last.id - 1 end end end end @shelbyspees at #CloudNativeKitchen

Slide 47

Slide 47

How does this benefit teams? @shelbyspees at #CloudNativeKitchen

Slide 48

Slide 48

Better data → better conversations Shared source of truth. Improved knowledge transfer. Shared ownership of production. @shelbyspees at #CloudNativeKitchen

Slide 49

Slide 49

Observability-driven development When you’re assigned a ticket, ask: How will I know this code is working as intended in prod? Observe the current state: What am I trying to change? Then implement + instrument: Look, it’s working! @shelbyspees at #CloudNativeKitchen

Slide 50

Slide 50

One tool for dev and prod Fast feedback loop in dev Strong debugging skills in prod Read more: “The Future of Developer Careers” go.hny.co/dev-careers @shelbyspees at #CloudNativeKitchen

Slide 51

Slide 51

Level up your team Better data → better conversations Custom instrumentation → self-serve querying Knowledge transfer → production ownership @shelbyspees at #CloudNativeKitchen

Slide 52

Slide 52

Where to start? @shelbyspees at #CloudNativeKitchen

Slide 53

Slide 53

Start where you’re at Take inventory of the tools and data your team has access to. ● ● ● How often are people actually using them? How often are you actually able to answer questions? How much of the time do you rely on experience (and guessing) vs. data? @shelbyspees at #CloudNativeKitchen

Slide 54

Slide 54

Instrument your code Structured events! ● How to capture them? OpenTelemetry.io ● Is there auto-instrumentation for your tech stack? @shelbyspees at #CloudNativeKitchen

Slide 55

Slide 55

Start small and grow Start with one app or one chunk of code in your critical path ● Critical code helps you learn the fastest Try out some tools ● ● Some vendors offer a free tier or trial Send from dev or deploy a canary @shelbyspees at #CloudNativeKitchen

Slide 56

Slide 56

Reach out! shelby@honeycomb.io get these slides: hny.co/shelby schedule 30 minutes: hny.co/meet/shelby @shelbyspees at #CloudNativeKitchen