Determinism in League of Legends: Fixing Divergences

The price of determinism is eternal vigilance.

I’m Rick Hoskinson, an engineer on the League of Legends Core Gameplay Initiative, and this is the final article in the “Determinism” series. Previously in this series, we explained the broad strokes required to achieve determinism in the League of Legends game server. However, there remained the task of tracking down divergences in legacy code while building maintainable systems that would ensure future divergence regressions could be found and fixed. Thoughtful implementation of these systems allows us to sustain Project Chronobreak features without dedicating a full-time team.

In this final article, we’ll dig into how we detect divergences and fix them.

Detecting Divergences

At the heart of divergence detection is the ability to compare results from two separate runs of the game. Ideally we compare the results of the original game to the playback of a server network recording. We can also compare the results of two back-to-back playbacks, but there are certain categories of divergences this technique may miss. As we discussed earlier in the series, games provide a convenient measurement quantum - the game frame.

A common technique for evaluating current game state is to checksum game memory and compare the results between two connected peers after some period of game update frames. This is particularly common in deterministic peer-to-peer multiplayer games. While this technique is relatively easy to maintain and implement, it has some problems:

It’s easy to tell that a divergence has occurred, but it’s difficult to drill down on the line of code where the divergence originates
If address randomization exists and the game state contains pointers, all those pointers must be converted to deterministic handles

The solution we chose was to instead manually log game state as a set of hierarchical key-value pairs in a file. This helps us kill two birds with one stone: we can compare the outputs of these logs and we can use the logs to help us pinpoint where in the codebase the divergence might have occurred.

This solution comes with some cons as well:

Log file sizes can be huge.
Recording to and reading back from these logs has severe performance implications.

Game state logs contain a JSON-based representation of the evolving game state in the League of Legends servers. We generally capture these logs at the end of every single frame, though we have the capacity to decrease our sampling rate or limit logging to a specific segment of frames.

To save on space, we don’t log the state of every game variable every frame, but only on frames where the value changes. This critical optimization massively improves I/O performance at the cost of running value comparisons to the existing state at log write time. Even with this optimization, game state logging on the server has the potential for huge amounts of performance overhead. Therefore we also have two levels of detail for game state logging:

We toyed with the idea of adding an intermediate logging level to help with debugging classes of problems where the divergence only occurs between the original recording and the playback. These are the dreaded Nondeterministic System State Leaks I discussed in the implementation article. Thankfully, we were able to build enough intuition for the kinds of nondeterministic systems that leaked data into the game state that we never had to pull the trigger on additional LODs.

We wrote a python script that compares the game log and returns a succinct “diff-log” of the details pertaining to the first known divergence. As discussed in previous blog posts, only the first divergence really matters, since the game state will quickly diverge into chaos after a single divergence.

The standard workflow for testing for and investigating divergences is as follows:

Record a server network recording and basic game state log on one of the Riot environments for a real-world game
On a separate pc, play back the server network recording and create a new basic game state log
Compare the two basic game state logs
If the basic game state logs are different, re-run the playback twice using detailed logging, comparing the detailed logs to each other
Investigate the divergence starting with the variables that diverged

Automated Testing

To ensure that determinism and Project Chronobreak would remain a functional part of League for years to come, we built a test automation process that runs on dedicated hardware. To isolate problems as they occur in the wild, the test automation hardware is set up to be as similar as possible to the actual servers that run League’s game server instances.

The automated process is very similar to the workflow described above, except that it works on a large set of games (currently, we test on 2-3 thousand real-world games per day.) The public beta servers are an ideal environment for this testing, as we have a large number of games from which to harvest test scenarios, and we're able to catch divergence-causing bugs before they go out to Live and our esports environments.

Case Study: Resolving a Divergence

Now that we’ve established our basic methodology, let’s follow the story of a divergence we recently discovered in the League code base. I can’t take credit for this rundown - the investigation and subsequent fix was implemented by our illustrious engineering manager for League Core Gameplay, Robin Maddock.

This divergence was quite rare, happening in just 1 of every 5,000 or so real-world games. Our test automation had caught it in basic logging, but we’d failed to reproduce the problem in our detailed logs for several months.

We’d finally gotten the break we needed, and the test automation had yielded a useful detailed log we could use to narrow in on the problem. The failure: a y-coordinate value of a floating point vector that represented the target position of a spellcast could diverge between runs. In the divergent case, the value was set to a nonsensical number. Given that this number was typically 0 but would occasionally be another seemingly random float, our suspicion quickly fell on an uninitialized variable as the culprit.

The 3-coordinate floating point vector class in our game does not initialize itself to zeroes due to the performance cost of instantiating each variable. While that seems like a poor tradeoff given the risk to determinism, those tiny costs can add up; the 3-float vector is one of the most commonly used types in the game, representing positions, directions, velocities, and numerous other high-touch gameplay properties.

There’s some trial and error in investigating a divergence. Sometimes simply walking back in the code from all of the writes to a gameplay variable is too complex to be useful. When we first looked at the divergence, there were no obvious culprits that may have fed an uninitialized y-coord into this variable.

Taking a step back, we decided the best way to proceed was to obtain an improved repro rate to allow us to better instrument the code. We started by simply scripting the playback to run in a loop, each time creating a detailed log and comparing. The repro rate of the issue was too low for this to be consistent.

We created a new game build from the original sources and changed the implementation to inject information into unallocated memory. Any changes to the code are tricky, since there’s a chance that floating point optimization paths chosen by the compiler might change subtly in other areas of the code. We started by removing our small block allocator and then injecting garbage into the memory buffers on alloc and free. We re-ran the playback, again with no repro. This gave us a critical hint: the issue was likely coming from something allocated on the stack.

In game code, vector errors tend to propagate from other vectors so we focused on the vector class itself as a way to implement fault injection. We manually 0-initialized the vector components. Again, there was no divergence, but we weren’t necessarily in a bad place since we were more confident that the compiler wasn’t going to make our lives harder by changing floating point optimizations on us. We then tried initializing the components of the vector to FLT_MAX, and viola! We could reproduce the issue 100% of the time.

Our next step was to gain some more insight into the game state when the issue occurred. We attempted to capture the divergence in a debugger at the exact moment that the unexpected variable was assigned. Back when building Chronobreak, we’d anticipated this need, and we built a debugging-only wrapper for members that would record assignments during a playback to a log. It would then read the log on a second run, assessing in real time if the new assignments had changed. This allowed us to quickly capture a breakpoint at the exact moment the CastInfo::TargetPosition changed.

Here’s where things got interesting. From the context of the breakpoint, we discovered that this was a champion auto-attack on a minion, which takes its value from a minion’s position. Going back further, this position was inferred from our pathfinding system, which was further buried in the relatively complex interpolation code coming from the pathing between waypoints. This dizzying trail eventually led to the discovery that minion waypoints themselves were diverging and occasionally trickling a divergent state all the way into the ability casting system.

At this point we had the option to add more detailed logging to the transient state in the pathfinding system that led us to this issue. That could take days of iteration and didn’t guarantee more clarity on a solution. It would also add a lot of uncertainty, so we decided to change our method of attack.

We temporarily added logging for single construction of every vector, which logged about 580,000 actors. Next, we added a static counter and a static threshold to the constructor. This would initialize the y-coord to 0 (no divergence) until the static threshold was hit, then it would start initializing the y-coord to FLT_MAX (100% divergence).

Now here’s the magic trick: you can binary search 580,000 in 20 iterations (580,000 < 2^20). Since the playback and comparison loop of our failing server network recording took about 6 seconds, we could start subdividing that threshold and re-testing over and over again. So the first run, we tested with vector constructors starting at 580,000/2. If the run diverged, then we’d know the divergence was in the latter half of the vector instances. If it didn’t, it had to be in the earlier set.

We continued like that, dividing the number of constructors in which we forced the divergence until we hit the exact constructor that wasn’t initialized after creation. It was then a matter of adding a breakpoint at the offending constructor and running one more playback. Bingo! It turned out the issue came down to a conditional statement with a comparison to an uninitialized variable.

What’s amazing is that the seeds of this divergence originated long before the assignment of the invalid y-coordinate. The divergence occurred in our network level of detail system - code we use to determine if a minion pathing update should be sent to a specific player based on the position of their camera. Talk about subtle!

The fix was trivial variable initialization, but there’s little chance we would have found the core issue by brute force code inspection. Only through creative instrumentation were we able to quickly drive toward a solution.

This is the nature of divergences and the cost of determinism - there isn’t a one-size-fits-all solution to resolving every divergence in the game, but the very repeatability of a deterministic playback opens doors to all manner of creative investigation methods.

Conclusion

I’ve been writing, speaking, and discussing determinism for a long time now, and I’m guessing everyone is quite sick of me blathering on about it. The Project Chronobreak team moved on from active determinism development in May of 2017, and the feature set is currently in maintenance mode within the League organization. We continue to see value in the feature, both for esports and for day-today development. We’ve been effectively using deterministic playbacks to diagnose gameplay bugs and discover new bugs at the heart of divergences.

The biggest win might also be the hardest to quantify: we dramatically improved code quality, maintainability, and reliability of a number of low-level systems in the game. This not only improves our agility as game developers, but gives us the confidence to take bigger gameplay and feature risks as League development continues. To me, this is the most exciting aspect - from strong technical foundations we can deliver more impactful gameplay features.

We owe the success of this project primarily to a very strong development team and clear product goals. However, there is one key learning that I’d like to pass on that made all the difference in bringing this feature through to completion. You can make bold changes to a game in large-scale release, but you need to be able to roll out systemic replacements in parallel with the legacy tech. We couldn’t afford to make mistakes that would require a redeploy to fix; our success was contingent on learning to effectively code new systems in parallel with tried-and-tested technology.

If you have further questions about Chronobreak or League determinism, be sure to post them in the comments and I’ll do my best to answer. Thank you so much for bearing with me over the last year!

Part I: Introduction
Part II: Implementation
Part III: Unified Clock
Part IV: Fixing Divergences (this article)

Posted by Rick Hoskinson