Finding a needle in a Haystack
A talk about Observability
I am working as a technical product manager on the first open source product by Expedia. I am happiest when talking about the problems we have solved, the various approaches we tried and the challenges we overcame. I previously worked as a software developer and a tester for about 4 years and understand the perspectives and pains well having lived them. I have a double masters in computer science graduating from USC in 2015. I thrive in the space created by the intersection of tech, product, and management.
We at Expedia work on a mission of connecting people to places through the power of technology. To accomplish this, we build and run hundreds of micro-services that provide different functionalities to serve one single customer request. Now, what happens when one or more services fail at the same time? Well, to improve the observability in our system and provide a high quality of service, we see a need to connect these failure points across our distributed topology to reduce mean time to know(MTTK) and resolve (MTTR).
Google published a paper - Dapper, that conceptualizes the idea of distributed tracing. Contemporary tracing solutions e.g. Zipkin, AWS X-Ray, Jaeger aim to solve the problem in their own way. They provide a good set of features but they either use custom APIs locking us down to their platforms or are not extensible enough to suit our needs. Along came OpenTracing standard that offers consistent, expressive, vendor-neutral APIs for tracing systems.
In this talk, we will present the journey of distributed tracing in Expedia that started with Zipkin as a prototype and ended up building our own solution(in open source) using OpenTracing APIs. We will do a deep dive in our architecture and demonstrate how we ingest terabytes of tracing data in production for hundreds of our micro-services running on AWS. We are keen to show how the increased network bandwidth of EC2 C5 instance type and EBS backed Cassandra helped us build a highly performant and available solution. We will show how AWS enabled us in applying the science of horizontal scaling, as we keep the members(ec2) of our kubernetes cluster quite lean and increase the fleet size on demand for both the compute and data store nodes.
With a unique deployment topology, where every line of business in Expedia deploys their micro-services in its own segment aka VPC, building a distributed tracing solution for such a service mesh was a challenging task. We will show how two streaming solutions, AWS/Kinesis, and Apache Kafka, complemented each other and helped us solve the problem of stitching multi-vpcs (via Kinesis) and build a cost-friendly extensible solution(via Kafka). The tracing data that flows from kinesis to Kafka can be used to solve different interesting use cases. At the time of writing, we use this data for trending service errors/latencies/rate, perform anomaly detection on the aggregated trends, build service-dependency graphs, other than our primary use case of distributed tracing.
The significance of tracing data decays as time progresses, and for us, it is 90 days. We are required to keep this data for at least 3 months to help diagnose the problems happened in the past. We will highlight how we helped our developers and Ops team to use Cassandra+ElasticSearch for investigating the most recent trace data(4days) and fallback to Athena (mounted over AWS/S3) for historical data analysis. And it is worth mentioning how S3 infrequent access policies helped keep our AWS cost in check and provide a high-availability guarantee for petabytes of our tracing data.
- 45 min
- LinuxFest Northwest 2019