Adopting Graviton2: How Honeycomb Reduced Infra Spend by 40% on Our Highest-Volume Service

A presentation at Conf42: Cloud Native 2021 in April 2021 in by Shelby Spees

Slide 1

Slide 1

V6-21 Adopting Graviton2 How Honeycomb Reduced Infra Spend by 40% On Its Highest-Volume Service

Slide 2

Slide 2

V6-21 Shelby Spees Developer Advocate at Honeycomb.io @shelbyspees © 2021 Hound Technology, Inc. All Rights Reserved.

Slide 3

Slide 3

V6-21 Why Graviton2? Promised improvements ● ● ● (insert photo of re:Invent keynote) cost performance environmental impact Andy Jassy announces Graviton2 instance types during keynote at AWS re:Invent 2019 © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 4

Slide 4

V6-21 More efficient processor architecture Why it’s cheaper/power-efficient ● ● ● ● Why it’s faster x86 is CISC Arm is RISC More of Arm CPU die dedicated towards just doing compute 7nm process node = less power consumption © 2021 Hound Technology, Inc. All Rights Reserved. ● ● ● ● x86 SMT: 2 vCPU = 1 execution unit Arm: 1 vCPU = 1 execution unit Arm execution units not shared between threads running on different vCPUs Less tail latency, performance variability @shelbyspees at #Conf42 #CloudNative

Slide 5

Slide 5

V6-21 One year later Andy Jassy talks about Honeycomb during keynote at AWS re:Invent 2020 © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 6

Slide 6

V6-21 Is it worth the RISC? What’s important to Honeycomb?

Slide 7

Slide 7

V6-21 Data storage engine and analytics tool What Honeycomb does ● ● ● Ingests customer’s telemetry Indexes on every column Enables near-real-time querying on newly ingested data © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 8

Slide 8

V6-21 Service Level Objectives (SLOs) Common language between engineers and business stakeholders © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 9

Slide 9

V6-21 SLOs are user flows Honeycomb’s SLOs ● ● ● home page loads quickly user-run queries are fast customer data gets ingested fast © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 10

Slide 10

V6-21 Error budget: allowed unavailability © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 11

Slide 11

V6-21 Alert proactively based on budget burn rate © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 12

Slide 12

V6-21 Period of reliability = time to cut costs Infra is our #2 expense after taking care of our honeybees Infra cost scales with traffic “Cost of Goods Sold” and other business acronyms © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 13

Slide 13

V6-21 Choosing where to start

Slide 14

Slide 14

V6-21 Prod: customers observe data © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 15

Slide 15

V6-21 Dogfood observes prod Production telemetry → Dogfood ingest Same code as production © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 16

Slide 16

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 17

Slide 17

V6-21 Kibble observes dogfood © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 18

Slide 18

V6-21 Service Architecture Honeycomb’s services ● ● ● ● ● ● ● ● ● shepherd (ingest API) kafka (ingest event streaming) retriever (indexing and querying) poodle (frontend web app) refinery (sampling) doodle (images) labrador (docs, bins, nginx redirects) basset (alerting, lives on poodle) basenji (encryption) © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 19

Slide 19

V6-21 Shepherd: ingest API service Why Shepherd? ● ● ● ● highest-traffic service stateless, most straightforward only scales on CPU utilization cares about throughput first, latency close second © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 20

Slide 20

V6-21 Preparing to test out the change

Slide 21

Slide 21

V6-21 Is it feasible to migrate? What’s needed? ● ● ● Base images & tooling (Docker or AMI) Audit application code for arch-specific code (e.g. inline assembly) CI tooling (producing build artifacts) © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 22

Slide 22

V6-21 Producing artifacts for Arm64 Honeycomb uses Go ● ● Don’t need an Arm box to cross-compile Need an Arm box to build Arm Docker images efficiently Other languages ● ● Java, Python use arch-independent binaries, no changes needed C++ with hand-assembly would need updates © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 23

Slide 23

V6-21 Initial findings m6g is superior to c5 for our workloads ● ● ● ● lower cost on-demand more RAM lower median latency significantly lower tail latency Cost of this experiment? A few spare afternoons. © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 24

Slide 24

V6-21 A/B testing Limited variables ● ● same build ID (different compilation targets) single service Slow rollout ● ● started with one instance bumped to 20% to observe © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 25

Slide 25

V6-21 Distribution of request latency on different instance types © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 26

Slide 26

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 27

Slide 27

V6-21 CPU utilization on old architecture © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 28

Slide 28

V6-21 CPU utilization on Graviton2 instances © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 29

Slide 29

V6-21 Migration to Graviton2 instances in dogfood Shepherd, February 2020 to April 2021 © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 30

Slide 30

V6-21 Dogfood Shepherd cost reduction Dogfood Shepherd EC2 cost, grouped by instance type © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 31

Slide 31

V6-21 What happened next?

Slide 32

Slide 32

V6-21 Migrated prod Shepherd Production Shepherd EC2 cost, grouped by instance type © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 33

Slide 33

V6-21 Migrated prod Retriever Retriever is our query engine ● ● Cost savings wasn’t a goal Instead, we tuned performance For a 10% increase in cost, we could get a 3x performance improvement! © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 34

Slide 34

V6-21 Production Retriever migration © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 35

Slide 35

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 36

Slide 36

V6-21 AWS ran out of m6gd spot instances © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 37

Slide 37

V6-21 Kafka Longtime Confluent Kafka users First to use Kafka on Graviton2 at scale Changed multiple variables at once ● ● ● move to tiered storage i3en → c6gn AWS Nitro Read more: go.hny.co/kafka-lessons © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 38

Slide 38

V6-21 Kafka + the long tail Services fully on Graviton2: ● ● ● ● shepherd retriever poodle refinery © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 39

Slide 39

V6-21 Graviton2 going strong Amount of traffic running on Graviton2 instances © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 40

Slide 40

V6-21 Takeaways

Slide 41

Slide 41

V6-21 Have a measurable goal in mind Need to be able to compare to baseline! Ask yourself: ● ● What are you currently measuring? Do your existing dashboards reflect customer impact? Most importantly: ● ● Start by measuring something Then learn and iterate © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 42

Slide 42

V6-21 Acknowledge hidden risks Examples of hidden risks ● ● ● ● ● Operational complexity Existing tech debt Cost of learning new tech and practices Vendor code and architecture Upstream dependencies © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 43

Slide 43

V6-21 Take care of your people Existing incident response practices ● ● Escalate when you need a break / hand-off Remind (or enforce) time off work to make up for off-hours incident response Newly official Honeycomb policy ● Incident responders are encouraged to expense meals for themselves and family during an incident © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 44

Slide 44

V6-21 Optimize for safety Ensure people don’t feel rushed. Complexity multiplies ● ● ● ● if a software program change takes t hours, software system change takes 3t hours software product change also takes 3t hours software system product change = 9t hours Maintain tight feedback loops, but not everything has an immediate impact. Source: Code Complete, 2nd Ed. © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 45

Slide 45

V6-21 Graviton2 blog posts March 2020: go.hny.co/arm64 © 2021 Hound Technology, Inc. All Rights Reserved. April 2021: go.hny.co/graviton2-retro @shelbyspees at #Conf42 #CloudNative

Slide 46

Slide 46

V6-21 Learn about observability Make production a more welcoming place. Read more: go.hny.co/o11y101 © 2021 Hound Technology, Inc. All Rights Reserved. @shelbyspees at #Conf42 #CloudNative

Slide 47

Slide 47

V6-21 Reach out! honeycomb.io/shelby @shelbyspees

Slide 48

Slide 48

V6-21 www.honeycomb.io