Observability for Software Teams

A presentation at DIDevOps 2020 Summer Send-off in August 2020 in by Shelby Spees

Slide 1

Slide 1

Observability for Software Teams ๐Ÿ‘ฏโ€โ™‚๐Ÿ‘ฏ Shelby Spees Developer Advocate, Honeycomb.io @shelbyspees #DIDevops

Slide 2

Slide 2

What is observability?โ‰ observability (n.) In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. Why do we care? Modern software services are complex systems, and observability helps us understand the health and reliability of those systems. @shelbyspees #DIDevops

Slide 3

Slide 3

Distributed systems ๐Ÿคน๐Ÿฝ Nowadays, itโ€™s rare to run production services that donโ€™t talk across the network: โ— โ— โ— โ— DB queries API calls build pipelines package downloads @shelbyspees #DIDevops

Slide 4

Slide 4

How modern software systems fail ๐Ÿ’ฅ โ— โ— rarely see the same failure twice auto-remediation (autoscaling, load balancing, retries, failovers) Instead, we see emergent failure modes โ— small failures cascade together to degrade or take down the system See also: https://how.complexsystems.fail @shelbyspees #DIDevops

Slide 5

Slide 5

Hard-to-debug problems ๐Ÿ›๐ŸฆŸ๐Ÿž๐Ÿœ Distributed systems: which microservice is experiencing issues? Poor performance: what part of the app is inefficient? Subset of traffic: only some users are complaining? Your code has the answers to this while itโ€™s runningโ€ฆ Improving observability involves capturing that context! @shelbyspees #DIDevops

Slide 6

Slide 6

How to improve observability? ๐Ÿ“ˆ Good observability comes from how you interact with your systemโ€™s telemetry data telemetry (n.) The collection of measurements or other data at remote points and their automatic transmission to receiving equipment for monitoring. Start by improving your telemetry! @shelbyspees #DIDevops

Slide 7

Slide 7

Software telemetry ๐Ÿ’ป๐Ÿ“Š โ— โ— โ— โ— โ— external probes - e.g. Pingdom metrics - e.g. Cloudwatch, statsd flat logs distributed traces structured events @shelbyspees #DIDevops

Slide 8

Slide 8

External probes ๐Ÿ™ˆ sometimes called โ€œblack boxโ€ probes, they ping your server every X minutes: Is it up? a more sophisticated approach can involve API or UI test suites: Does it return the expected response? @shelbyspees #DIDevops

Slide 9

Slide 9

Metrics ๐Ÿ”ข a numeric value tracked over time โ— โ— โ— CPU usage by host ALB 5XXs active connection count sum @shelbyspees #DIDevops

Slide 10

Slide 10

Metrics have little context ๐Ÿ•ท @shelbyspees #DIDevops https://www.tylervigen.com/spurious-correlations

Slide 11

Slide 11

Metrics answer known questions ๐Ÿ—ฏ โ— โ— must decide in advance what question youโ€™re asking high-cardinality metrics are expensive, e.g. latency for each customer cardinality - the number of possible values something can have โ— โ— low-cardinality: true/false, days of the week high-cardinality: UUID, build_id, social security number, RGB colors @shelbyspees #DIDevops

Slide 12

Slide 12

Metrics tell you about symptoms ๐Ÿค• They donโ€™t give you direction when debugging What was happening in the code when CPU utilization spiked? You need to come up with a theory and then go find supporting evidence of that @shelbyspees #DIDevops

Slide 13

Slide 13

Flat logs ๐Ÿ“ โ— โ— โ— capturing output from the code itself no standard log formatting across libraries or services requires string parsingโ€”doesnโ€™t scale Centralized logging services are expensive and slow to query โ— great for compliance, not for debugging @shelbyspees #DIDevops

Slide 14

Slide 14

Distributed traces ๐ŸŒ Visualization of the stack trace across your distributed system Traces are formed by a tree data structure made up of objects called spans โ— each span has a start time and a duration, and points to its parent span @shelbyspees #DIDevops

Slide 15

Slide 15

@shelbyspees #DIDevops

Slide 16

Slide 16

1.243s API call What made it slow? โณ @shelbyspees #DIDevops

Slide 17

Slide 17

Structured events ๐Ÿงบ keep the context around your systemโ€™s state together in a handy chunk of data (usually stored as JSON blobs) ways to generate structured event data: โ— โ— structured logs SDKs @shelbyspees #DIDevops

Slide 18

Slide 18

Flat logs vs. structured data ๐ŸฅŠ Flat logs get written eagerly, so you canโ€™t keep track of state changes within a specific context Structured logs/events store all the data within your context, from beginning to end @shelbyspees #DIDevops

Slide 19

Slide 19

@shelbyspees #DIDevops

Slide 20

Slide 20

write as you go write as you go write as you go @shelbyspees #DIDevops

Slide 21

Slide 21

write at the end ๐ŸŽ @shelbyspees #DIDevops

Slide 22

Slide 22

From events: metrics ๐Ÿ‘๐Ÿผ @shelbyspees #DIDevops

Slide 23

Slide 23

From events: logs ๐Ÿ‘๐Ÿผ @shelbyspees #DIDevops

Slide 24

Slide 24

From events: tracing ๐Ÿ‘๐Ÿผ Traces are formed by a tree data structure made up of objects called spans โ— each span has a start time and a duration, and points to its parent span If your events have start time, duration, parent: then you can generate a trace! @shelbyspees #DIDevops

Slide 25

Slide 25

sql.active_record: 4.951s @shelbyspees #DIDevops

Slide 26

Slide 26

Back to: How to improve observability? ๐Ÿ” Good observability comes from how you interact with your systemโ€™s telemetry data โ— โ— โ— iterating on your questions observability-driven development (ODD) shared ownership of production @shelbyspees #DIDevops

Slide 27

Slide 27

@shelbyspees #DIDevops

Slide 28

Slide 28

Dropdown from the field โฌ‡ Make a new query with this filter @shelbyspees #DIDevops

Slide 29

Slide 29

this user? ๐Ÿ’๐Ÿพ WHERE the sql query contains this user_id @shelbyspees #DIDevops

Slide 30

Slide 30

this user? ๐Ÿ’๐Ÿพ @shelbyspees #DIDevops

Slide 31

Slide 31

just slow? ๐ŸŒ DB queries that take more than 2s @shelbyspees #DIDevops

Slide 32

Slide 32

The Core Analysis Loop ๐Ÿ‘ฉ๐Ÿผโ€๐Ÿ”ฌ FILTER Filter out the variable or group by it. INSPECT Find something that looks unusual in the data. HYPOTHESIZE Is there a variable that could explain why there are differences in the values? @shelbyspees #DIDevops

Slide 33

Slide 33

@shelbyspees #DIDevops

Slide 34

Slide 34

Recap: Telemetry โฎ metrics go wide ๐Ÿ” logs and traces go deep ๐Ÿ”ฌ rich events can go wide AND deep: shift between views while holding onto the original context ๐Ÿ’ช @shelbyspees #DIDevops

Slide 35

Slide 35

Break down ๐Ÿ•บ๐Ÿป Ask new questions More context gives you more insight @shelbyspees #DIDevops

Slide 36

Slide 36

How bad is your worst case? โ›ˆ donโ€™t do the math up front calculate your metrics with raw data and keep your context @shelbyspees #DIDevops

Slide 37

Slide 37

Observability-Driven Development ๐Ÿ‘€ Writing your instrumentation before you write your code How will I know this code is working as intended in prod? Supports multiple goals โ— โ— โ— โ— correctness maintainability reliability observability @shelbyspees #DIDevops blog post: A Next Step Beyond Test-Driven Development

Slide 38

Slide 38

Same tool in dev and prod ๐Ÿ’• Fast feedback loop in dev Is the code doing what I expect? Strong muscle memory in prod โ— โ— โ— production ownership active exploration smoother incident response @shelbyspees #DIDevops

Slide 39

Slide 39

Set up your observability tooling ๐Ÿ How do you get data in? โ— โ— โ— deploy an agent? ๐Ÿคท rewrite your logs to be structured? ๐Ÿคท add a package to your code? ๐Ÿ™Œ @shelbyspees #DIDevops

Slide 40

Slide 40

Auto-instrumentation โš™ Adding observability doesnโ€™t have to be a lot of work. Some integrations can hook into popular frameworks to give you rich context and distributed tracing out of the box! โ— e.g. HTTP requests, DB queries @shelbyspees #DIDevops

Slide 41

Slide 41

@shelbyspees #DIDevops

Slide 42

Slide 42

Custom instrumentation ๐Ÿ‘ฉ๐Ÿผโ€๐ŸŽจ Because you, the developer, know what matters. Add fields relevant to business logic, user experience: Whatโ€™s important for the service that Iโ€™m providing? Send everything! Filter down when youโ€™re querying later. @shelbyspees #DIDevops

Slide 43

Slide 43

def get_all_tweets(user) all = [] options = { count: 200, include_rts: true } loop do tweets = client.user_timeline(user, options) return all if tweets.empty? all += tweets options[:max_id] = tweets.last.id - 1 end end @shelbyspees #DIDevops

Slide 44

Slide 44

def get_all_tweets(user) Instrumentation.start_span(name: โ€˜get_all_tweetsโ€™) do Instrumentation.add_field_to_trace(โ€˜userโ€™, user) all = [] options = { count: 200, include_rts: true } loop do Instrumentation.start_span(name: โ€˜get_batchโ€™) do Instrumentation.add_field(โ€˜optionsโ€™, options) tweets = client.user_timeline(user, options) return all if tweets.empty? all += tweets options[:max_id] = tweets.last.id - 1 end end end end @shelbyspees #DIDevops

Slide 45

Slide 45

Level up your team ๐Ÿ„ Better data โ†’ better conversations Custom instrumentation โ†’ self-serve querying Knowledge transfer โ†’ production ownership @shelbyspees #DIDevops

Slide 46

Slide 46

Reach out! ๐Ÿ’Œ shelby@honeycomb.io @shelbyspees ๐Ÿ“… honeycomb.io/meet/shelby honeycomb.io/shelby @shelbyspees #DIDevops