Riot Messaging Service

Hey there! My name is Michal 0xDEADB33F Ptaszek, and I’m a software architect at Riot. Today I would like to talk about communication. But not the kind of communication you’re probably thinking of. I want to talk about the other, more exciting kind of communication: LoL players communicating with chat servers during a tense game; authentication servers communicating with the LoL client on login; microservices that route state changes between clients in the middle of the night - you know, that kind of communication.   

At a high level, communication between services can be split into two groups: synchronous requests, where the sender is blocked until it receives a response, and asynchronous eventing, where messages are fired to the receiver without waiting for a reaction. I want to talk about the latter through the lens of the Riot Messaging Service, a service we built to support scenarios where backend services must inform clients about certain events - such as state changes - in an asynchronous way. The Riot Messaging Service is specifically designed to handle service to client messaging, which comes with its own challenges, requirements, and assumptions.

In this article, I’ll start with a discussion about state changes and the importance of stateful services. Then I’ll move on to the architecture of the Riot Messaging Service that enables linear scalability and high fault tolerance. Last but not least, I’ll focus on a slice of our journey from drawing board to production servers that can support 10 million player connections on a single box. Let’s dive in!

State

At Riot Games, we’re big fans of microservice architecture. This approach to developing software has tons of benefits for us, including separation of concerns, independent deployability, scalability, and language flexibility (some teams love Java, others like Go, some are even crazy enough to write their stuff in Erlang).

Each of the many microservices running in League is responsible for its own state. Let's take Clubs as an example. Clubs are player-created, player-organized, and player-controlled social groups. The Clubs service stores membership information, tags, messages of the day, member ranks, and a few other useful details. Whenever you log into the client, it fetches your latest state from the Clubs service and renders it so that you and your friends can pick up where you left off.  

But once we have the initial state, how do we receive updates when, for example, an officer changes the message of the day ("Team comp of the day: yordles!")? Should the client actively poll the clubs service every few seconds to check if there were any changes (the "are we there yet?" approach)? Should the client wait until a player relogs to fetch a fresh state (hmm, questionable)? Or maybe clients should establish persistent connections to the Clubs service, giving us hundreds of thousands of active connections on the Clubs service side (using the Comet approach for instance) and dozens of connections in each client - one for each service the client will use.

In short, there are lots of options, and the Riot Messaging Service (RMS for short) has been built specifically to solve this problem. It's a backend service that allows other services to publish messages and enables clients to receive them. So when a state change occurs in a player’s Club, the Clubs service can publish a message to RMS asking it to inform the other Club members’ clients to refresh their local Clubs state.

Philosophically, RMS is similar to a mobile push notification service. Said service would be responsible for delivering small push events to phones and would support a whole constellation of mobile applications without the need to understand what's actually being routed through it. Mobile push notifications are also one way async events, and in most cases it's up to the receiving application to act on the received notification.

RMS Under the Hood

On a high level, RMS consists of 2 main tiers: RMS Edge and RMS Routing.

The RMS Edge (RMSE) tier is a collection of independent servers responsible for hosting player client connections. League clients connect to an RMSE node sitting behind a load balancer using an encrypted WebSocket connection. The connection is established after successful authentication, persists throughout the player’s session, and is terminated on logout.

On top of handling authentication and holding client connections, each RMSE server is also responsible for delivering incoming published messages to connected clients. For every new session registered or existing session terminated, an RMSE node will make a request to the RMS Routing tier to inform it about the change.

Because RMSE servers don't know about each other, they are 100% linearly scalable. Another interesting property is that local failures are isolated to the local server: if one server crashes or has performance hiccups, its issues will not affect adjacent servers.

Here's a simple diagram showing how the RMSE tier is structured:

The RMS Routing (RMSR) tier in turn is a layer of clustered servers responsible for a global view of all client sessions across all RMSE servers. RMSR nodes hold a global, distributed table mapping player identifiers to RMSE nodes that keep their sessions. The RMSR tier also processes incoming published messages from other services and routes them to proper RMSE nodes. Finally, RMSR servers keep track of the health of RMSE tier nodes and perform necessary cleanups whenever something bad happens to one of them.

Architecturally, RMSR looks like this:

Publishing Messages

Now that we’ve got the basic architecture down, let's focus for a second on how we publish messages and how they’re delivered to the connected clients.

Messages intended for RMS follow the consistent format seen below. These simple JSON blurbs are published to the RMSR REST interface:

{
   "resource": "clubs/v1/clubs/665632A9-EF44-41CB-BF03-01F2BA533FE7",
   "service": "https://clubs.na.lol.riotgames.com",
   "version": "2016-10-12 09:33:43.1245",
   "recipients": ["3029B94E-412F-484C-B4E7-BD01073EA629", "81CA1B98-F906-4DDD-86CC-A3FF82A16505"]
}

Each message class is identifiable using the service and resource fields, which both follow REST resource modeling principles.

RMS can also deliver short payloads associated with messages. This is typically done if the updated state is relatively small, or to encode a REST method to use when contacting the publisher service.

{
   "resource": "clubs/v1/clubs/665632A9-EF44-41CB-BF03-01F2BA533FE7",
   "service": "https://clubs.na.lol.riotgames.com",
   "version": "2016-10-12 09:33:43.1245",
   "recipients": ["3029B94E-412F-484C-B4E7-BD01073EA629"],
   "payload": "{\"method\": \"GET\"}"
}

Once a published message is delivered to one of the RMSR servers, it’s parsed and validated. Its recipients list will be resolved using the global RMSR session table into a list of RMSE servers that hold at least one session for each recipient, and then stripped from the message. Next, the modified message will be forwarded to RMSE nodes that have at least one of the recipients logged on them.

Upon arriving at an RMSE node, a message will be forwarded to the recipient session handler, serialized, and pushed to the client via the WebSocket connection.

In the end, this is what the client receives:

{
   "resource": "clubs/v1/clubs/665632A9-EF44-41CB-BF03-01F2BA533FE7",
   "service": "https://clubs.na.lol.riotgames.com",
   "version": "2015-10-12 09:33:43.1245",
   "timestamp": 1444677045952,
   "payload": "{\"method\": \"GET\"}"
}

Internally, clients have a simple subscription table that allows their plugins to be notified of local resource state changes (see the previous article on LCU Architecture for more details). Whenever a new RMS message arrives, the client updates the resource field encoded in the message and all interested local consumers are notified of the update. Some plugins re-fetch their state by hitting the publisher through a concatenation of the service and resource fields, while others only consume the incoming event.

This service approach to messaging is extremely flexible and easy to integrate for teams across Riot. The actual integration with RMS for a new type of event requires a single REST request on the publish side and 2 lines of code in C++ in the client to receive and process the message. This enables immediate updates for players, faster development for Riot teams, and a standard, efficient, and elegant way to deliver async messages from servers to clients. Win-win-win!

Road to Production

We implemented RMS entirely in Erlang, which helped us enforce proper design patterns, simplified concurrency and distribution, and let us re-use many of the existing internal libraries we built for other services like the Chat Service.

Once built, RMS releases are packaged into Docker containers, giving us service isolation and relocatability, simplifying the deployments and configuration, and making us fairly independent from the underlying OS. For other examples of Docker usage at Riot, check out Maxfield Stewart’s post here.

The service is deployed to AWS, giving us enormous flexibility when it comes to scale, failure testing, and in-development experimentation. All AWS resources - soup to nuts - are managed by Terraform, an infrastructure automation tool that allows us to express our entire infrastructure as code. We can edit it programmatically, avoid costly human misclicks, version it, and much more. As an added bonus, we can spin up and tear down new RMS environments within a few minutes without having to manually configure each one. That allows us to create ephemeral environments just for testing then destroy them when we no longer need them.

RMS itself has been designed as a global service - although technically it can follow League of Legends’ sharding model, it is game-agnostic, and can handle multiple shards at once. It's also a fundamental messaging layer that's meant to be used by almost all backend player-facing services, so it has to be robust. We went to great lengths to make sure RMS is reliable, fast, scalable, and deployable with 0 downtime (more on that later).

Since the service’s CCU (Concurrently Connected Users) grows with the number of active clients, we wanted to establish a knowable capacity for a single RMSE and RMSR node, and then come up with some sort of formula that allowed us to scale the cluster when needed.

Load Testing

Unfortunately, load testing via running millions of actual League clients is not the most feasible option here (we would need thousands of servers to generate such load), so we came up with something better: a dedicated RMS Load Test Tool.

The Load Test Tool is responsible for simulating real world RMS scenarios. In simple terms, it hammers RMSE with tens of thousands of WebSocket connections, publishes tons of RMS messages addressed to said clients, then waits for the messages to arrive:

After running the full loop, it compares timestamps encoded in the message payloads and produces a number of useful statistics, such as publish-to-delivery latency histograms, publish response times, authentication durations, etc:

The Load Test Tool is also linearly scalable: each instance can hold ~60k active RMS connections (thanks to exhausting the ~2^16 - 1 number of TCP ports, less those reserved for system services), and we can grow the CCU by increasing the number of Load Test Tool boxes.

After a couple of rounds of write code -> deploy -> load test -> profile, and after fixing a few bottlenecks in the system, we established that the single largest RMSE machine (r3.8xlarge instance type) can handle at least 10M concurrent connections. This number - and the fact that we can easily add servers to increase RMSE capacity - greatly exceeded our expectations. We were thrilled to tick the load-test box on our checklist.

Fun fact: in order to get to 10M connections, we had to spin up 170 Load Test Tool servers. During testing, 170 terminals could not be squeezed into my MacBook's screen using csshX, so I had to split them up into 5 groups of terminals:

During the tests, a cluster of Load Test Tools was logging in almost 2000 clients per second, publishing and receiving over 20,000 messages per second, and consistently hitting ~15ms for end-to-end publish-to-receive scenarios.

Having a baseline for a single (albeit pretty beefy) server allowed us to define autoscaling policies that would kick in whenever the service is running out of capacity or wasting resources. At the end of the day we decided to run RMSE on much smaller instances to avoid putting all our eggs in one basket (losing 10M connections in case of an instance failure is not fun).

Fault Tolerance

Generally a lot of systems suffer from the fallacies of distributed computing. To address network reliability we took Jepsen, a framework for distributed systems verification with fault injection, and implemented a plugin for RMS. Jepsen was responsible for simulating random network failures: splitting RMSR clusters into random halves, introducing communication issues between RMSE and RMSR tiers, and messing with the cluster topology in the ways we couldn’t even imagine. This approach helped us catch a series of nasty bugs that would have caused big headaches if they hit production servers.

0 Downtime Deployments

Since a single RMS cluster handles multiple League of Legends shards, there is practically never an ideal time to take the service down without serious impact to players. In order to deploy new features and bug fixes in the smoothest way possible, we needed the ability to release new versions of RMS to production servers without interrupting existing traffic.

Let’s look at RMSE first. Since each RMSE node holds a persistent WebSocket connection to its players, we couldn't simply follow the blue green deployment paradigm and shut the nodes down whenever we please. To overcome this difficulty, we decided to bolster RMS with something called session durability. Whenever RMSE detects an unexpected disconnect of the client connection, or whenever RMSR detects an RMSE node loss, the affected sessions are put into a ‘durable’ mode. Messages addressed to affected players are cached in memory and delivered to the client on successful reconnect to the system. If the client doesn’t reconnect in the next few minutes or the system is close to running out of memory, durable sessions and their queued messages are purged.

A client that reconnects successfully won't even know it missed any messages, though its processing might be slightly delayed due to the reconnection itself. Failure to reconnect on time will result in the client re-syncing its state with all services since it might have missed important state updates while the RMS connection was down.

This technique allowed us to terminate entire RMSE nodes without draining their connections, making the RMSE tier rollout fast and easy.

For RMSR we’re able to spin up a secondary cluster, sync its state with the one that's currently running and handling production load, and route all traffic to the new boxes with one command.

Both RMSE and RMSR tiers can be deployed independently from one another with no player-facing impact, allowing us to introduce new features and fix bugs without costly maintenance windows.

Production Mode

Finally, with the open beta for the League client update, RMS has been put to the test. Real players started connecting to the service, services started publishing genuine messages, and the client began consuming them.

So far (knock on wood!) the service is performing as predicted: we haven't had a single failure or dropped message, we’ve been able to deploy to production clusters on multiple occasions, and our metrics reports look similar to our load tests (diligence: #worth).

Moving forward we would like more services to leverage RMS for player messaging, and once the AIR client is deprecated, have all players around the world connect to RMS!

I hope you enjoyed this post. if you have any questions or comments, please share them below. See you next time!

Posted by Michal Ptaszek