A Taxonomy of Tech Debt

Hi there. I’m Bill “LtRandolph” Clark, and I’m the engineering manager for the Champions team on LoL. I’ve worked on several different teams on League over the past years, but one focus has been consistent: I’m obsessed with tech debt. I want to find it, I want to understand it, and where possible, I want to fix it.

When engineers talk about any existing piece of technology - for example League of Legends patch 8.4 - we often talk about tech debt. I define tech debt as code or data that future developers will pay a cost for. Countless blog posts, articles, and definitions have been written about this scourge of software development. This post will focus on types of tech debt I’ve seen during my time working at Riot, and a model for discussing it that we’re starting to use internally. If you only take away one lesson from this article, I hope you remember the “contagion” metric discussed below.

Metrics

In order to make good decisions about what problems to fix now and what to fix eventually (or, realistically, never), we need a way to measure a particular piece of tech debt. I’ve identified 3 major axes to evaluate on: impact, fix cost, and contagion.

Impact

The first axis is the most obvious: the impact of the debt. This takes the form of player-facing issues (bugs, missing features, unexpected behavior), and developer-facing issues (slower implementation, workflow issues, random useless shit to remember). It’s worth noting that “developer” in this case can be anyone of any discipline. Some tech debt gets in the way of engineers writing new code, some blocks designers creating new scripts, some interferes with VFX artists making new particles, etc.

Fix Cost

The second axis has to do with the cost to fix the tech debt. If we decide to fix an issue in our code or data, it will require someone’s measurable time to fix. If it’s a deeply rooted assumption that affects every line of code in the game, it may take weeks or months of engineering time. If it’s a dumb error in a single function, it may be fixable in a matter of minutes. Regardless of the time to implement a fix, though, we also must consider the risk of actually deploying that fix. Even a system I consider “wrong” can still be used as a tool to make a great game. If I change the way our scripting engine handles errors, or how particles compute their spawn time, that could break any of the 500+ spells on 140+ champions in the game.

Contagion

The third axis is something I’ve become obsessed with: contagion. If this tech debt is allowed to continue to exist, how much will it spread? That spreading can result from other systems interfacing with the afflicted system, from copy-pasting data built on top of the system, or from influencing the way other engineers will choose to implement new features.

If a piece of tech debt is well-contained, the cost to fix it later compared to now is basically identical. You can weigh how much impact it has today when determining when a fix makes sense. If, on the other hand, a piece of tech debt is highly contagious, it will steadily become harder and harder to fix. What’s particularly gross about contagious tech debt is that its impact tends to increase as more and more systems become infected by the technical compromise at its core.

Types of Debt

Now that we have a framework for measuring a particular piece of tech debt, let’s talk about some broad categories of tech debt that I’ve seen on League of Legends.

Local Debt

Local debt resembles the classic “black box” model of programming. As far as the rest of the game is concerned, the local system (spell, network layer, script engine) works pretty reliably. No one needs to keep the debt in mind as they develop around the system. But if anyone opens the lid and looks inside, they’ll be horrified, disgusted, or completely confused by what they see.

You can find a couple real world examples of local debt in your eyes. Due to how the eye is constructed, you see things upside down. More significantly, the retinal nerve creates a blind spot near the middle of each eye. This malformed data is sent to the visual centers of the brain, which must flip the image and fill the blind spots so the rest of the brain can interact with the “correct” image. These oddities are localized to the eye/optic nerve system and easily avoided by other systems, so they’re “good enough.”

One of the most famous instances of local debt in League of Legends is Jarvan’s Cataclysm, which is made of minions to this day. When designers need to attach gameplay effects to a location (or a set of locations), one of the tools available to them is the ability to spawn an “invisible minion.” RiotXypherous describes what I mean by “minion” here. These game objects are a stable and well-understood way to track and execute scripted logic. In cases like Jarvan’s wall, you need to spawn a high number of minions (24 to be precise) to make sure that no one can squeeze through the wall. An alternate solution could be a ring-terrain construct consisting of a single logical piece controlling the pathability of Cataclysm. If we took this approach, we could clean up the logic and slightly reduce computation cost. Let’s take a look at cataclysm using our impact, fix cost, and contagion model to see why a fix isn’t currently the best option.

Cataclysm Metrics

1.    Impact: 1 / 5

Back when there were 12 minions, people would occasionally squeak through the wall, so Riot Exgeniar bumped it to 24. The fact that the wall is made of minions pretty much never influences any other developer as they make new content. (As an aside, the infamous “Jarvan Ult Hitch” was due to the confluence between this debt and a loading bug from trying to read missing auto-attack definitions.)

2.    Fix cost: 2 / 5

We don’t currently have the ability to composite shapes to create custom geometry without new code. If we wanted to create a ring shape to do an “area trigger” to more efficiently enforce Jarvan’s barrier, we would have to write some bespoke math to calculate collisions with a ring. We’re exploring Constructive Solid Geometry for other purposes, which may cataclysmically reduce the cost to fix.

3.    Contagion: 1 / 5

No one needs to take the implementation of Jarvan’s wall into account when developing features, which keeps it well contained. The one risk of contagion is other designers copy/pasting the implementation into their new champions (which has happened here and there). But as far as implementation problems go, the potential spread of Cataclysm is low and well understood.

This is a pretty typical shape for local debt. In general, local debt is defined by a low contagion score. If the impact is higher than the cost to fix, it tends to get fixed by a good citizen before too long.

When considering whether to fix local debt, first ask yourself if it’s worth it. If the debt is truly not contagious, it should be safe to leave alone for as long as necessary. One of the biggest mistakes I observe is an instinct to jump on local debt that itches an engineer’s perfectionist side when it doesn’t have a broad enough impact to warrant the effort. If you do decide to make a fix, it’s usually easy to confirm the fix and regression test, due to the locality of the change.

Some recent examples of local debt that has been fixed include bugs with inhibitors causing champions to path towards 0,0,0 in certain circumstances, Janna’s Monsoon ignoring spell shields, and Tear of the Goddess stacking on manaless casts.

MacGyver Debt

MacGyver debt is named after the TV show from the mid 80s. Angus MacGyver would solve problems using his swiss army knife, duct tape, and whatever else was on hand.

His solutions often involved attaching two unlikely pieces; in the context of tech debt, this means two conflicting systems are “duct-taped” together at their interface points throughout the codebase.

Seattle (among other cities) has a dramatic example of MacGyver debt as you can see above. The city used to have two competing settlements, each with its own grid. When those settlements grew into the modern Emerald City, the slightly different grids were mushed together, resulting in awkwardly shaped blocks and buildings and a less-than-efficient use of space. I’m particularly amused by the little shaved-off corner of the building in the bottom left.

One of the best examples of MacGyver debt in the LoL codebase is the use of C++’s std::string vs. our custom AString class. Both are ways to store, modify, and pass around strings of characters. In general, we’ve found that std::string leads to lots of “hidden” memory allocations and performance costs, and makes it easy to write code that does bad things. AString is specifically designed with thoughtful memory management in mind. Our strategy for replacing std::string with AString was to allow both to exist in the codebase and provide conversions between the two (via .c_str() and .Get() respectively). We gave AString a number of ease-of-use improvements that make it easier to work with and encouraged engineers to replace std::string at their leisure as they change code. Thus, we’re slowly phasing std::string out and the “duct tape” interface between the two systems slowly shrinks as we tidy up more of our code.

std::string vs AString Metrics

1.    Impact: 2 / 5

Most of the high-impact allocations from std::string have been phased out via profiling, so at this point, the main cost is the small mental switching cost to convert from one system to the other.

2.    Fix cost: 3 / 5

The conversion to AString isn’t just a find-and-replace. There are a few flavors of AString for different purposes (AStackString for initial allocation in stack memory, ARefString for references to static strings, in addition to heap-allocated base AString). A real, thinking human needs to look at a replacement site to do it right. It will be a long, slow process to phase out the old system.

3.    Contagion: -2 / 5

By making AString easier to work with than std::string, we’ve actually managed to flip contagion to our side. Every time an engineer checks in a change to game code, there’s a chance that AString has spread further, like a virus.

The biggest cost to most MacGyver debt tends to be the intellectual cost of switching modes when crossing boundaries. If some bug or feature is held back by being on the “wrong” system, a targeted move to the “right” system tends to be straightforward. The relative contagion of the new system vs. the old system is the key metric to keep an eye out for. If you can flip that balance to favor the new system, then the better system will inevitably win.

When considering whether to fix MacGyver debt, try to find ways to make the (global) better system more desirable at a local level. If a time-pressured engineer making greedy optimizations during their day to day work chooses to move towards the desired end state, then you’re well on your way.

The other approach that can work is to do brute-force large-scale refactors. Depending on how closely the systems map, it may be possible to fix some or all of your MacGyver debt via clever regexes.

Foundational Debt

Foundational debt is when some assumption lies deep in the heart of your system and has been baked into the way the entire thing works. Foundational debt is sometimes hard to recognize for experienced users of a system because it’s seen as “just the way it is.”

A hilariously stupid piece of real world foundational debt is the measurement system referred to as United States Customary Units. Having grown up in the US, my brain is filled with useless conversions, like that 5,280 feet are in a mile, and 2 pints are in a quart, while 4 quarts are in a gallon. The US government has considered switching to metric multiple times, but we remain one of seven countries that haven’t adopted Système International as the official measurement system. This debt is baked into road signs, recipes, elementary schools, and human minds.

We’ve talked about several of the big pieces of foundational debt that Riot has been tackling in previous Tech Blog articles like Determinism in League of Legends and Game Data Server.

Another example of foundational debt that I think about a lot is our use of the lua scripting language. Designers on League use a tool called BlockBuilder to create complex behaviors by stringing together blocks of functionality like getting the distance between points, spawning minions, dealing damage, or doing all manner of script flow control. The set of operations designers choose from is varied but limited, and the parameters for each operation are constrained. Yet long long ago, in the prehistory of League of Legends, the decision was made not to store the blocks and parameters in a simple, constrained format that matches the data. Instead they’re stored as arrays and tables in the powerful, beautiful, and entirely-too-complex-for-this-purpose lua language. A decade or so of game development took place upon that foundation since we made the decision and now manipulating lua objects is one of the most common operations in the engine.

BlockBuilder Lua Metrics

1.    Impact: 4 / 5

The mismatch between lua and this problem space has many costs. Every callstack is polluted with ~6 marshalling stack frames for each frame of BlockBuilder logic. Those marshalling operations are not cheap in terms of server CPU usage. Reading diffs of script changes is needlessly difficult. Parsing/searching script files to determine functionality requires a fairly in-depth understanding of the lua language.

2.    Fix cost: 4 / 5

Since lua is so deeply embedded into the engine, digging it out would be difficult. A current proposal is to create a wrapper class that behaves like the lua objects, but is a much simpler struct under the hood, so we can steadily morph our scripting innards into something more suitable. But any way we approach it, we’ll need to be careful and thoughtful.

3.    Contagion: 4 / 5

Every time a system bumps up against scripting (which is the core unit of logic for LoL), that system will be shaped by the operations and requirements of the lua backend. We average a new Building Block every ~3-4 days, each of which directly manipulates lua objects. The longer we don’t replace lua, the harder it becomes to replace lua.

Foundational debt tends to index highly on all three axes. The high cost encourages sticking with the janky system, which is often the right call, but the high impact and high contagion mean that fixing egregious foundational debt can have a huge payoff.

The most common strategy for fixing foundational debt that I’ve observed at Riot is to stand up the new system alongside the old one. If possible, I recommend then converting the foundational debt to MacGyver debt by slowly porting systems over to using the new system with conversion operations available to cross between new and old. This allows you to start reaping the benefits in targeted areas easily while limiting exposure to risk. Sometimes such a conversion isn’t possible, though. In that case, creating a compile time (or if possible, loading time) switch can help build confidence in the new system without forcing you to go all-in. The former is in use for the GDS conversion, and the latter worked for Determinism.

Data Debt

Data debt starts with a piece of tech debt from one of the other categories. Perhaps it’s a bug in the scripting system, a less-than-desirable file format for items, or two systems that don’t play very well with each other. But then a ton of content (art, scripts, sounds, etc.) gets built on top of that code deficiency. Before too long, fixing the initial tech debt becomes extremely risky and it becomes painfully hard to tell what you’ll break if you try to fix anything.

My favorite real world example for understanding data debt is DNA. The genome of an organism is slowly built up over millions of years through lossy copies (mutations), transcription errors, and evolutionary pressure. Some copy errors are useless but benign, others are harmful, and others confer powerful advantages. Attempting to figure out what any piece of DNA actually does is incredibly difficult. We fully understand what the base pairs mean, and how sets of base pairs translate into amino acids for protein construction. We are even starting to understand more about some non-encoding roles that DNA can play. But in the 3 billion plus base pairs of the human genome, there’s still so much we don’t even remotely understand. Radiolab’s episode about CRISPR highlights one such puzzle that was recently cracked.

Data debt on League of Legends is most impactful when it turns an otherwise trivial fix into a grueling ordeal. I will share one tiny example, but trust me: data debt is one of the most crucial considerations when making changes to the LoL engine. Our game engineers develop deep knowledge of how game systems were implemented and become quite skilled at predicting what data may break when they change some piece of code.

A memorable piece of data debt that we fixed a few years ago involved block parameters in our BlockBuilder scripting language. Above, you can see a toy example where I try to increase Owner’s armor by a variable plus a constant. I would expect Owner to receive 25 bonus armor: 20 from the variable Delta, which is passed into the block, and 5 from the constant. Since the variable’s name matches the parameter name, however, this used to yield 40. (Don’t even ask me why it wouldn’t yield 45; I have no idea what thought process led to the early-out.)

When NoopMoney, an engineer on Champions team, went to fix this nonsensical behavior, all he had to do was delete 4 lines of code. But with a highly contagious piece of debt like this, even a small change requires thorough planning. Any of 400,000 lines of script in LoL might have any of their numerical parameters being doubled by this bug. What’s worse, those scripts are “behaving correctly” in that the game is balanced and tuned around those potentially doubled values. NoopMoney had to make the fix toggleable on Live (in case we had any unexpected bugs come up) in addition to doing extensive regex searching and QA sweeps to try to identify which scripts might rely on this bug to function properly. In the end, the problems from fixing this bug were fairly minor; a small handful of champions needed their scripts altered. But the data debt made it difficult to predict.

Parameter Naming Bug Metrics

1.    Impact: 2 / 5

This bug’s impact was small when it occurred. It doubled a passed-in value, and potentially discarded a constant. But it became just yet another bit of otherwise useless tribal knowledge that designers and engineers needed to carry around (once they became aware of it). Developer mindshare is a valuable resource to be wasted like this.

2.    Fix cost: 2 / 5

All told, the fix was straightforward. By creating a live feature toggle, we were able to increase our confidence in the safety of the fix. The most expensive part was the initial screening to try to evaluate the scope of the problem in order to target testing.

3.    Contagion: 4 / 5

The unfortunate thing about this bug was that it preyed on an extremely logical behavior. If, for example, you want to deal damage to a unit, saving the value in a variable called “Damage” is totally logical. Sadly, though, the ApplyDamage block took in the amount in a parameter of the same name, thus triggering the bug. Then, if someone else wants to make a similar spell, they’d copy/paste your blocks and carry on, thus spreading the bug even farther.

In general, data debt indexes high on cost to fix since it makes changes hard to evaluate. More worryingly, it’s almost always extraordinarily contagious due to a few properties of data (as opposed to code). First, it’s generally acceptable to create a new piece of data with a copy/paste of an existing piece of data. If you’re making a new skillshot spell, starting with Ezreal’s Mystic Shot can save you tons of time. Any issues with an existing piece of data are propagated out to its descendents. Second, data is rarely subjected to technical review akin to code reviews. This makes it difficult to notice and halt the spread of bad practices even if they’re widely known. Finally, fixing any issue in the data typically requires a human being with eyes and a brain to verify - a compiler and formal logic won’t cut it.

When fixing data debt, I’ve observed two main approaches. The first I call the “do it right checkbox.” This means making a toggle between the old “broken” behavior and the new “fixed” behavior for data creators. Ideally, you make the fixed version default while you make sure old content uses the broken version. Then, like with MacGyver debt, you can do a slow and steady replacement to get things onto the new version. This has a permanent cost of adding more and more crap to your editing UI.

The second approach is the “just fix the damn thing” approach, like NoopMoney used on the parameter naming bug. This means fixing the bug and then trying to repair all the data that’s meaningfully affected. Several techniques can make this less terrifying. First is doing a lot of greps and regex searching to try to understand the theoretical impact. Second is a bunch of targeted testing. Finally, you can prepare a toggle to enable reverting to the old behavior once the fix ships in case you missed something worse than the bug you’re fixing. It’s worth noting that Determinism helps us a lot with testing for these types of changes by letting us confirm that the server produces the same results before and after a change.

Summary

When measuring a piece of tech debt, you can use impact (to customers and to developers), fix cost (time and risk), and contagion. I believe most developers regularly consider impact and fix cost, while I’ve rarely encountered discussions of contagion. Contagion can be a developer’s worst enemy as a problem burrows in and becomes harder and harder to dislodge. It is possible, however, to turn contagion into a weapon by making your fix more contagious than the problem.

Working on League, most of the tech debt I’ve seen falls into one of the 4 categories I’ve presented here. Local debt, like a black box of gross. MacGyver debt, where 2 or more systems are duct-taped together with conversion functions. Foundational debt, when the entire structure is built on some unfortunate assumptions. Data debt, when enormous quantities of data are piled on some other type of debt, making it risky and time-consuming to fix.

I hope this post helps provide some useful terms for thinking about and discussing tech debt. Hit the comments below if you have questions or wisdom about how to deal with tech debt.

 
Posted by Bill Clark