Bugsnag processes hundreds of millions of errors each day, and to handle this data we’ve prioritized building a scalable, performant, and robust backend system. This comes along with many technical challenges from which we’ve learned a lot. Most recently, we launched the new Releases dashboard, a project that required us to scale our system to handle the significant increase in service calls required to track releases and sessions for our users.
While work on the Releases dashboard was underway, the Engineering team was also breaking down Bugsnag’s backend functionality into a system of microservices we call the Pipeline. We knew that extending the Pipeline to support releases would mean adding several new services and modifying existing ones, and we also anticipated many new server and client interactions. To handle all of these architectural changes, we needed a consistent way of designing, implementing, and integrating our services. And we wanted a platform-agnostic approach — Bugsnag is a polyglot company and our services are written in Java, Ruby, Go, and Node.js.
In this post, we’ll walk you through why we opted for gRPC as our default communication framework for the Pipeline.
Our existing systems have traditionally used REST APIs with JSON payloads for communicating synchronously. This choice was made based on the overwhelming maturity, familiarity, and tooling available, but as our cross-continent engineering teams grew, we needed to design a consistent, agreed upon RESTful API. Unfortunately, it felt like we were trying to shoehorn simple methods calls into a data-driven RESTful interface. The magical combination of verbs, headers, URL identifiers, resource URLs, and payloads that satisfied a RESTful interface and made a clean, simple, functional interface seemed an impossible dream. RESTful has lots of rules and interpretations, which in most cases result in a RESTish interface, which takes extra time and effort to maintain its purity.
Eventually, the complications with our REST API led us to search out alternatives. We wanted our microservices to be as isolated from one another as much as possible in order to reduce interactions and decouple services. Simplicity would be key as it would allow us to create a workable service in as little time as possible, and keep us from jumping through hoops.
Choosing a communication framework should not be undertaken lightly. Large size organizations (like Netflix) can have backend systems powered by over +500 microservices. Migrating these services to replace inadequate inter-service comms can cost a large number of engineering cycles, making it logistically and financially impractical. Investing time into considering the right framework from the start can save a lot of wasted effort in the future.
We spent a significant amount of time drawing up evaluation criteria and researching our options. Here I’ll walk you through what that looked like for Bugsnag.
When researching the options available, there were specific criteria we used to assess our options. Our list of things to consider was based on what would work best for a microservice architecture. Our main goals would be to use communication liberally, remove complexity from communication so we could communicate freely and understand where responsibility lies for each service. Some of these technical concerns were:
Service APIs are one of the most important interfaces to get right as they are crucial in setting service expectations during development. Settling on the design for a service API can be an arduous task, which is amplified when different teams are responsible for the different services involved. Minimizing wasted time and effort due to mismatched expectations is as valuable as reducing coding time. Since Bugsnag has a cross-continent engineering team, there are few cycles of communication for us. We have to maximize that by streamlining our communication and making sure things are less open to interpretation, otherwise mistakes are easy and things can easily be delayed.
Here are some of the design considerations we had when choosing the framework:
In addition to the criteria listed above, we needed to choose a framework that is easily extensible. As microservices gain traction, we demand more and more “out of the box” features synonymous with this architecture, especially as we move forward and try to add more complexity to our system. The features we wish for include:
Most frameworks will not provide all these features, but at the very least, they should be extensible enough to add in when needed.
There was no single framework that ticked all the boxes. Some options we explored were Facebook’s Thrift, Apache Hadoop’s Avro, Twitter’s Finagle, and even using a JSON schema.
Our needs seemed more aligned with remote procedural calls, or RPCs, giving us the fine grain control we needed. Another attraction with using RPCs is the use of interface description languages or IDLs. An IDL allows us to describe a service API in a language-independent format, decoupling the interface from any specific programming language. They can provide a host of benefits including a single source of truth for the service API, and potentially can be used to generate client and server code to interact with these services. Examples of IDLs include Thrift, Avro, CORBA, and, of course, Protocol Buffers.
In the end, the clear winner was gRPC with Protocol Buffers.
We chose to go with gRPC as it met our feature needs (including extensibility going forward), the active community behind it, and its use of the HTTP/2 framework.
gRPC is a high-performance, lightweight communication framework designed for making traditional RPC calls, and developed by Google (but no, the g doesn’t stand for Google). The framework uses HTTP/2, the latest network transport protocol, primarily designed for low latency and multiplexing requests over a single TCP connection using streams. This makes gRPC amazingly fast and flexible compared to REST over HTTP/1.1.
The performance of gRPC was critical for setting up our Pipeline to handle the massive increase in calls we were expecting for the Releases dashboard. Also, HTTP/2 is the next standardized network protocol so we can leverage tools and techniques that have been developed for HTTP/2 (like Envoy proxies) with first class support for gRPC. Due to multiplexing stream support, we are not limited to simple request/response calls as gRPC supports bi-directional communications.
Protocol Buffers, or protobufs, are a way of defining and serializing structured data into an efficient binary format, also developed by Google. Protocol buffers were one of the main reasons we chose gRPC as the two work very well together. We previously had many issues related to versioning that we wanted to fix. Microservices mean we have to roll changes and updates constantly and so we need interfaces that can adapt and stay forward and backwards compatible, and protobufs are very good for this. Since they are in a binary format, they are also small payloads that are quick to send over the wire.
Protobuf messages are described using their associated IDL which gives a compact, strongly typed, backwards compatible format for defining messages and RPC services. We use the latest proto3 specification, with a real-life example of a protobuf message shown here.
// Defines a request to update the status of one or more errors.
messageErrorStatusUpdateRequest {
// The list of error IDs that specify which errors should be updated.
// The error IDs need to belong to the same project of the call will fail.
// Example:
// "587826d70000000000000001"
// "587826d70000000000000002"
// "587826d70000000000000003"
repeatedstring error_ids = 1;
// The ID of the user that has triggered the update if known.
// This is for auditing purposes only.
// Example: "587826d70000000000000004"
string user_id = 2;
// The ID of the project that the errors belong to if known.
// If the project ID is not provided, it will be queried in mongo.
// The call will fail if all of the error IDs do not belong to the same project.
// Example: "587826d70000000000000005"
string project_id = 3;
}
All fields according to proto3
are optional. Default values will always be used if a field is not set. This combined with field numbering provide an API that can be very resistant to breaking changes. By following some simple rules, forward and backwards compatibility can be the default for most API changes.
The protobuf format also allows an RPC service itself to be defined. The service endpoints live alongside the message structures providing a self-contained definition of the RPC service in a single protobuf file. This has been very useful for our cross-continent engineering team who can understand how the service works, generate a client, and start using it, all from just one file. Here is an example of one of our services:
syntax = "proto3";
package bugsnag.error_service;
service Errors {
// Attempt to get one or more errors.
// Returns information for each error requested.
// Possible exception response statuses:
// * NOT_FOUND - The error with the requested ID could not be found
// * INVALID_ARGUMENT - The error ID was not a valid 12-byte ObjectID string
rpc GetErrors (GetErrorsRequest) returns (GetErrorsResponse) {}
// Attempt to open the errors specified in the request.
// Returns whether or not the overall operation succeeded.
// Possible exception response statuses:
// * NOT_FOUND - One or more errors could not be found
// * INVALID_ARGUMENT - One of the request fields was missing or invalid,
// see status description for details
rpc OpenErrors (ErrorStatusUpdateRequest) returns (ErrorStatusUpdateResponse) {}
// Attempt to fix the errors specified in the request.
// Returns whether or not the overall operation succeeded.
// Possible exception response statuses:
// * NOT_FOUND - One or more errors could not be found
// * INVALID_ARGUMENT - One of the request fields was missing or invalid,
// see status description for details
rpc FixErrors (ErrorStatusUpdateRequest) returns (ErrorStatusUpdateResponse) {}
}
// Defines a request to update the status of one or more errors.message ErrorStatusUpdateRequest {
...
The framework is capable of generating code to interact with these services using just the protobuf files, which has been another advantage for us since it can automatically generate all the classes we need. This generated code takes care of the message modeling and provides a stub class with overridable method calls relating to the endpoints of your service. A wide range of languages are supported including C++, Java, Python, Go, Ruby, C#, Node, Android, Objective-C, and PHP. However, maintaining and synchronizing generated code with its protobuf file is a problem. We’ve been able to solve this by auto-generating client libraries using Protobuf files, and we’ll be sharing more about this in our next blog post, coming soon.
One of the best features of gRPC is the middleware pattern they support called interceptors. It allows all gRPC implementations to be extended (which you’ll remember was important for us), giving us easy access to the start and end of all requests, allowing us to implement our own microservice best practices. gRPC also has built-in support for a range of authentication mechanisms, including SSL/TLS.
We’re at the beginning of our gRPC adoption, and we’re looking to the community to provide more tools and techniques. We’re excited to join this vibrant community and have some ideas on future projects we’d like to see open sourced or possibly write ourselves.
gRPC is still relatively new, and the development tools available are lacking, especially compared to the veteran REST over HTTP/1.1 protocol. This is especially apparent when searching for tutorials and examples as only a handful exist. The binary format also makes messages opaque, requiring effort to decode. Although there are some options e.g. JSON transcoders to help (we’ll write more about this in a coming blog post), we anticipated needing to do some groundwork to provide a smooth developing experience with gRPC.
The popularity of gRPC has grown dramatically over the past few years with large-scale adoption from major companies such as Square, Lyft, Netflix, Docker, Cisco, and CoreOS. Netflix Ribbon is the defacto standard for microservice communication frameworks based around RPC calls using REST. This year, they announced they are transitioning to gRPC due to its multi-language support and better extensibility/composability. The framework has also recently joined the Cloud Native Computing Foundation in March 2017, joining heavyweights Kubernetes and Prometheus. The gRPC community is very active, with the open sourced gRPC ecosystem listing exciting projects for gRPC on the horizon.
In addition, gRPC has principles with which we agree with.
Lyft gave a great talk on moving to gRPC which is similar to our own experiences: Generating Unified APIs with Protocol Buffers and gRPC. Well worth checking out.
This is still early days for gRPC and there are some definite teething troubles, but the future looks bright. Overall, we’re happy with how gRPC has integrated into our backend systems and excited to see how this framework develops.