Bug Blog: Esports Trade Issue

Hello! My name is Ryan Price, and I work on the backend services that power League of Legends, focusing on the efficiency and reliability of the game loop as well as a few other out of game experiences. I’ve been at Riot for roughly four and a half years at this point, and have been in the League services space that entire time. I currently serve as the tech lead for the League Services Engineering (LSE) team, making me accountable for the general technical direction of our team.

In this article, I’m going to be walking you through a recent bug that was impacting our competitive leagues, and how we dove deep into one of League of Legends’ most legacy pieces of technology to mitigate it.

Context

For the first couple weeks of the LoL Esports summer split, multiple teams and regions were reporting issues during scrims and live matches played on the Tournament Realm, our offline client used for competitive play. The issues ranged from slow interactions when selecting a Champion to not being able to save your Runes and Masteries. But where we saw the most impact for esports was with Champion trading at the end of Champ Select.

Players would try to swap Champions during Champ Select, but when attempting to click the “Trade” button nothing would happen until the timer ran out and the system would automatically trade the champions.

This was an issue because various leagues have a rule where all trades must be finished before 20 seconds on the timer is reached. Therefore if the trades bugged out, champ select would just be aborted and remade before ever going into game. This was especially relevant for games that required double swaps as there just wasn’t enough time left in Champ Select to complete this.

To complicate things further, we hadn’t made changes to the sections of code related to Champion trading in years, so we were not optimistic that this would be as simple as rolling back a recent change.

Bug captured from player perspective during a match

A Tale of Two Champ Selects

Behind the scenes, there are two different implementations of Champ Select that players interact with – the “new” flow that goes through our microservice-based solution which is used on our live shards (think what you, a player of League, normally play on), and the “old” flow that goes through our legacy monolithic server application, which we call “the Platform”.

Esports uses the old flow to power Tournament Realm Champ Select, primarily because it doesn’t require as many dependencies to function, but also since there are some additional esports-only features that have been built into this flow. Live shards use the old flow for some things as well but much less frequently.

Triaging the Legacy Platform

Hard to Find What You Don’t Know You’re Looking For

One of the cool things about working on League is that we have a ton of data at our disposal. It's how we power things like SLI’s and performance tracking, as well as, how we diagnose issues in both internal and live environments. The downside to all this is that it can feel like searching for a needle in a haystack when we try to look at an issue like this that has wider impacts.

Platform metrics show almost everything to do with the game loop is just… slow…

Knowing that this manifests in Champion trades taking a while, we can narrow this down and look at just those calls. Unsurprisingly, they are also pretty rough.

Platform metrics narrowed down to just Champion trade operations.

These metrics are measured from the server side, so we know that this is the time it takes for the server to process the request, not how long it takes for the client to send the request over the network and receive a response. This means that the issue is somewhere within the boundaries of this service call hierarchy. Given all of the trading methods are affected, it seems likely that this might all stem from the same issue. With that in mind then, the common thread between these calls, and all of legacy champ select really, is the in-memory-data-grid cache we use to power the Platform.

Navigating and Understanding Cache Operations

We first believed that this might be getting triggered due to a race condition in the Platform’s interactions with the Cache. We began trying to determine what calls to the cache we were making on these requests to understand if we could be locking up the cache or allowing it to get into an incorrect state. For most of the interactions that we have with the cache, we leverage Entry Processors, which allow for lock safety to prevent concurrent writes.

Example of how entry processors interact with a cache.

Given we have protections in place for preventing race conditions, we looked into what the cache metrics were showing us about latencies for interacting with the cache.

Roughly 500ms avg operation latencies were slowing down champ select. Over 5s p99 operation latencies!

This was the first piece of information that ultimately led us to the solution as opposed to just verifying something overall was wrong! From the cache’s perspective, it was taking way longer than expected to run the entry processor logic, slowing down all of the champ select operations. The weird thing about this, though, is that we were only seeing this in esports regions, not in the limited use that we do have on live shards. So, how could we capture what was going on from the cache’s perspective without having to wait for the LCS to have problems?

Leveraging Live Data: Champions Queue

Thankfully, we have an esports-like environment that gets used every night! We dumped some data for games that were currently in Champ Select from the shard that powers champions queue. We leveraged one of the LCS caster’s stream (Thanks Kobe!) to know, from the player’s perspective, who all was in the game, the state of Champ Select, and any other relevant info that we might be able to compare to the object that we dumped from the cache. To our surprise, what we found was huge, literally!

You’re in for a bad time when gist is telling you the file is too big…

We found that the game object was massive. Further inspection showed us that the cause of this was threefold.

First, due to a design decision we made previously, we are storing the player’s inventory of available Champions and Skins on the game object for doing trade logic and selection validation. Normally this doesn’t affect performance, but players in esports realms have unlocked accounts, which means they own every skin and champion in the game. That’s a lot of data! This is also multiplied by 10, once for each player in the lobby.

Nearly 7000 records to say you own everything.

Second, we weren’t aggressive enough in cleaning up this data from when we transitioned to Champ Select from the lobby. Since observers can move from the observer slot to a player slot, we need their inventory during the lobby stage as well. However, we weren’t cleaning that up when we transitioned into Champ Select, causing us to keep around even more unnecessary data than we already were! The kicker is that esports matches also normally run with up to 20 observers, even further increasing the size of the object.

11 entries for a 10 player lobby doesn’t seem right…

Finally, we had multiple inefficiencies in the structure of our legacy game cache objects which was causing more serialization of fields that had no reason to be serialized.

Keying this map by the entire player object added more unused fields to the serialized object.

Overall, the serialized size of the cache entry was far larger than what we normally care to manage, especially for the game object. Every action during champ select involves multiple back-and-forth serializations, and running these operations on an object of that size was disastrous.

Improvements and Mitigations

Improving the Problem Holistically

Now that the core of the problem is understood, we can improve the serialized size of our cache entries substantially. We can remove duplicate data that we don’t need anymore and more efficiently store the data that we still need. This allows us to continue to handle the ever increasing inventories of long term players, until this feature is eventually deprecated in favor of a unified path with the microservice based Champ Select.

Mitigating the Problem For Esports Specifically

However, sometimes the best solution isn’t the most technically advanced or elegant. Given that esports is a special use case where every player on the shard has every Skin and Champion, we can provide some optimizations that will prevent them from having to ever deal with this issue again.

Adding a configuration for tournament shards to turn off inventory verification proved to be a simple and effective fix for this issue. This allowed us to stop persisting the inventory in the legacy game cache object all together for those shards.

Just trust me, I have everything.

Before and After

Moment of truth time! We deployed a hotfix to some select tournament shards, being sure not to impact any live games being played, and the results we saw were quite promising. We saw a ~10x reduction in the size of the cache entry. Additionally, we saw a ~100x reduction in latencies for the champion trade related operations!

~ 1.5s vs ~10ms for the server to respond

~600ms vs ~4ms for the entry processor execution

~150kB vs. ~15kB for the game object cache entry size

Wrapping Up

There were a couple takeaways that I would like to share with you coming out of this triage.

First, working in legacy systems can be incredibly difficult, especially when there are not many SME’s in a specific area that you need to triage in. Sticking to your standard playbook of triaging and debugging can help get you through navigating these sticky situations, ultimately leading you to success. The ability to work through problems on complex, interconnected systems is one of the key skills we encourage and support engineers to improve on while working on League Services Engineering.

Second, as engineers, it can be easy to create a solution that ultimately solves an issue, but at the trade off of adding additional complexity. Sometimes, it can be just as effective to come up with a simple solution for the problem. Given the nature of this issue and the impact it was having, prioritizing return to service for all of our esports athletes and fans was our top priority.

And finally, this is just a reminder of how old League really is and how much cool content there is in the game. The longer League is around, the more likely we are to have these weird issues crop up, whether it's due to inventories continuing to expand, databases filling up with billions of records, or just more and more players hopping on to play games, there is never a dull moment when working on services on League of Legends.

To all of our esports fans, I’d just like to say thank you for bearing with us while we worked to fix this. As an avid fan myself, I was glad we could get this turned around quickly and get back to another exciting split!

Posted by Ryan Price