SREcon17 Asia/Australia: SRE Your gRPC—Building Reliable Distributed Systems Illustrated with gRPC
Grainne Sheerin and Gabe Krabbe, Google
Distributed systems have sharp edges, and we have a wealth of experience cutting ourselves on them. We want to share our experience with SREs elsewhere, so they can skip making the same mistakes and join us making exciting new ones instead!
We will share practical suggestions from 14 years of failing gracefully:
– In a distributed service, every component is a frontend to another one down the stack. How can it deal with backend failures so that the service as a whole does not go down?
– In a distributed service, every component is a backend for another one up the stack. How can it be scaled and managed, avoiding overload and under-use?
– In a distributed service, latency is often the biggest uncertainty. How can it be kept predictable?
– In a distributed service, availability, processing, and latency costs contributions are hard to assign. When things (inevitably) go wrong, what components are to blame? When they work, where are the biggest opportunities for improvement?
We will cover best and worst practices, using specific gRPC examples for illustration.
Sign up to find out more about SREcon at https://srecon.usenix.org
via YouTube https://youtu.be/eoy9z0UlaII