Right now, the personalized homepage for a popular e-commerce platform is recommending me a football pool toy alongside a pack of “self-sealing” water balloons. This despite the fact that I do not have kids, don’t own a pool and can see four inches of snow outside my window.
The platform is either giving me way too much long-term-planning credit, or it might be relying too heavily on past data — specifically when I got my nieces some water guns for their birthday last year.
Most recommendation systems are very good at leveraging user history, but the less sophisticated ones aren’t as dynamic and responsive as they could be, especially when preferences change or new contexts emerge, like sudden snowstorms.
Enter reinforcement learning.
What Is Reinforcement Learning
Reinforcement learning is a branch of machine learning, distinct from supervised learning and unsupervised learning. Rather than being trained on a body of clearly labeled data, reinforcement learning systems “learn” through trial and error as agents run actions across a state space, improving their decision process through a reward structure. The system balances exploration of new options against exploitation of acquired knowledge.
Reinforcement learning (RL), a version of machine learning, tries to produce better outcomes by exploring an environment through trial and error. As in any good education, feedback is critical. The RL mechanism, called an agent, runs through trial after trial, called an action, within a state, or the conditions of the environment. Here’s a good analogy of states via O’Reilly: “For a robot that is learning to walk, the state is the position of its two legs. For a Go program, the state is the positions of all the pieces on the board.”
Each action produces a feedback signal. Unlike in supervised learning, an RL agent is “trying to maximize a reward signal instead of trying to find hidden structure,” explains John Sutton in Reinforcement Learning: An Introduction, a foundational text in the field.
By scurrying to maximize validation, RL might sound a bit…insecure? It is, in a sense. Because it doesn’t have much information — or, in the context of the water toy-obsessed personalizer, not much useful information — it needs to learn by trying many different approaches in order to perform confidently.
“You have a reinforcement learning problem when the data that you want to learn on is created by the solution,” said John Langford, principal research scientist at Microsoft Research.
Langford (left) is a leading reinforcement learning researcher. He coined the term contextual bandits, which is an extension of the concept of the multi-armed bandit. That’s a reference to slot machines — aka one-armed bandits — and the gambler’s dilemma of choosing which machine to feed their money.
In reinforcement learning, there’s an eternal balancing act between exploitation — when the system chooses a path it has already learned to be “good,” as in a slot machine that’s paying out well — and exploration — or charting new territory to find better possible options. The risk, of course, is that the new option might be a jackpot, but it also might be a terminal bust.
You can see that tension at work in all reinforcement learning systems, including Microsoft’s Personalizer, a culmination of Langford’s research. It’s essentially a recommender system for non-experts that can be incorporated into different platforms where such a service might be useful, like a news site or online retailer. Anheuser-Busch InBev is using it now to fashion recommendations within an online store to better serve small grocers in Mexico, according to Microsoft.
As in many reinforcement learning systems, the default exploration/exploitation setting on Personalizer is epsilon greedy. The setting is “very naive and simplistic,” Langford said, but, importantly, and unlike more sophisticated alternatives, it allows for counterfactual evaluation. That is, the user can see what would have happened under a different trajectory, or “policy.”
“The ability to actually try out alternative features or alternative learning rates after the fact is very valuable in practice,” he said.
In Personalizer, end-user actions are sent to the platform’s Rank API, which then determines, based on those inputs and probabilities, whether to exploit or explore. Adventurous users can thumb the scales, if they choose, by toggling toward a higher or lower exploitation percentage. (It’s also possible to swap out epsilon greedy for a completely different algorithm, although that’d require more advanced know-how.)
“You have a reinforcement learning problem when the data that you want to learn on is created by the solution.”
Reinforcement learning has made quick inroads into the recommendation practice. You see it in action every time you fire up Netflix, which turbocharges A/B testing with contextual bandits to tailor the artwork of a movie or series to the viewer.
But its knack for chicken-or-egg problems has also made it a popular method in robotics, where it helps machines learn to grip and move; gaming, where it can power procedural generation; and HVAC, where an RL-enabled thermostat can “learn” the intricacies of a building’s climate. It’s even helping build the perfect Cheeto.
At the same time, reinforcement learning faces real challenges. Some are technical; some are more diffuse. Below, we’ll explore a few of them through the lens of two projects — one practical (Microsoft’s aforementioned Personalizer) and one novel (researchers’ mission to teach RL agents to find a diamond in Minecraft).
It Has To Work, Even with Limited Data
The idea that a deployed solution has to, well, work in the real world is an obvious one, but it’s so important that Langford emphasized it immediately in our conversation. The luxury of limited variables afforded by sandbox environments just doesn’t exist beyond their walls.
The most highly publicized implementation of reinforcement learning has probably been AlphaGo, the DeepMind program that famously bested a human Go champion. Those instances leave the general public thinking the technology is borderline magical.
“In the long term, maybe [the applications] are really fantastic,” said Langford, who also emphasized his preference for practical, rather than “adventure” applications, at Microsoft’s recent Reinforcement Learning Day event. “But in the short term, with what we know how to do well now, it’s important to understand the distinction between a simulated environment and the real world.”
“In a simulated environment, you have an unlimited amount of data,” he added. “That’s a key distinction [in terms of] trying to solve many otherwise natural applications of reinforcement learning.”
Getting the Data Logging Right
The personalization and recommendation realm has grown to emphasize the importance of reliable data logging. Langford knows the pain that errant logging can cause — and it doesn’t take much to cause big problems.
“I’ve been involved with a bunch of reinforcement learning projects where the data logging was wrong, and that destroyed the project,” Langford said. “There’s a strong failure mode associated with seemingly minor logging failures.”
A common mistake is, instead of logging the actual features used to make a prediction, an engineer might log a reference to the features. When it’s time to train the model, it references corrupted information, and the system starts to break down.
“The quality of the resulting model can decay very rapidly,” he said.
Choose Reward Structures Wisely
A reinforcement system “teaches itself,” so to speak, by collecting reward signals based on numerous actions and states through which the agent runs. But it has to be guided by human hands at basic levels to some extent. One of the biggest questions is also one of the most elemental: What should be rewarded?
“By the time you engineer a reward function that gives you a good signal at every time step, you basically solve the task. You could write a program to do it. That’s a hyperbolic statement, but it’s kind of true.”
That’s easy enough in the context of, say, a simple game. The agent managed to score a point? Give that agent a reward. But what does that mean in the context of a recommendation system?
Langford explained by way of counterexample. One could conceivably misuse Personalizer to try to predict how many ads to place on a website. To do so, you might tie the reward to the amount of revenue generated per event. The system would happily choose to litter your site with ads, and users would probably flee.
The approach should be to align the rewards with the long-term goal as best as possible, Langford said. That’s sometimes easier said than done. The sample complexity, for instance, can shoot up when searching for short-term proxies — which is to say, it can take a really long time before the model becomes reliable.
“There’s a bit of a tension between the most natural declaration of the problem, which may involve just optimizing the long-term reward, and the most tractable version of the problem to solve, which may involve optimizing short-term proxy,” he said. “That’s one of the finesses in trying to frame reinforcement learning problems effectively.”
MineRL Minecraft Competition
Don’t Rely Too Hard on Reward Shaping
There are a couple of ways to tackle the reward challenge. The big one is what’s known as reward shaping. It involves complex mathematical functions, but it essentially means leaning on domain knowledge to add intermittent rewards or amplify the reward signal. Think of it as signposting smaller rewards within the trajectory of an action, before it’s fully completed, in order to nudge an agent in a certain direction.
Reward shaping isn’t bad, but it takes time and intervention. For William Guss, a research scientist at OpenAI, it’s symptomatic of what he sees as the field’s overreliance on domain-specific hacks, rather than truly cross-functional models. It’s a deep, constitutional challenge for reinforcement learning — one that Guss and his colleagues are trying to solve with Minecraft.
Guss heads up MineRL, a competition that asks entrants to develop systems that can direct a character to mine a diamond within the game’s 3D world. That’s more complex than it might sound, namely because it’s hierarchical. Before you can find a diamond, you need an iron axe. To make an iron axe, you need iron ore, which you have to mine with a stone axe, which also needs wood for a handle.
But the real challenge comes from the restrictions Guss and his co-organizers impose. First, no reward shaping. Also, entrants’ systems are trained on a limited amount of compute — six CPUs and one GPU. And even though no one unearthed a diamond during the inaugural competition in 2019, the second competition, currently underway, is even harder, thanks to some neural network-generated world obfuscation.
The Problem of Sample Inefficiency
The primary goal behind MineRL is to push for solutions that address the problem of sample inefficiency, or the fact that it often takes incredibly long for a system to be sufficiently rewarded into competence. AlphaGoZero, for instance, self-played nearly 5 million games of Go before reaching its crowning achievement.
MineRL goes about this in part by incorporating a technique called imitation learning. Think of it as an assistive dataset. Human Minecraft players were invited to install a recording mod that sent video of their gameplay to the research team’s server, providing some 60 million state-action-reward samples.
It’s not the same as using detailed, labeled data to train a supervised learning system, but rather a way to provide the algorithm with “a strong set of priors,” Guss said. Video of k-means-clustered samples of the demonstration footage shows an agent navigating the digital 3D world. It’s not as logical as if you or I steered the character, but it’s far more logical than random.
By providing greater sample efficiency, imitation learning also tackles the common reinforcement learning problem of sparse rewards. An agent might make thousands of decisions, or time steps, within an action, but it’s only rewarded at the end of the sequence. What exactly were the steps that made it successful? Reward shaping offers an (imperfect) solution, but, again, that’s precisely the kind of tinkering Guss hopes to minimize.
“By the time you engineer a reward function that gives you a good signal at every time step, you basically solve the task,” he said. “You could write a program to do it. That’s a hyperbolic statement, but it’s kind of true.”
It Has To Be Reproducible
There’s been a growing movement in AI in recent years to counteract the so-called reproducibility crisis, a high-stakes version of the classic it-worked-on-my-machine coding problem. The crisis manifests in problems ranging from AI research that selectively reports algorithm runs to idealized results courtesy of heavy GPU firepower.
Guss and his colleagues made a concerted effort to foster reproducibility with the contest. For one, the code is public.
“That’s really important and hasn’t been the case for a lot of RL papers,” Guss said.
Second, rather than having entrants submit their agents, which could conceivably be trained with research-lab levels of GPU wattage, they’re required to submit code, which is trained using the organizers’ machine. They also introduce randomizing elements to make sure results track across different versions of the game.
By keeping the actual training on organizers’ machines, with deliberately restricted computing power, the competition also addresses the challenge of access. For well-funded research labs, access to high levels of computing power isn’t much of an issue. But those resources are not so well distributed elsewhere.
“Go to any state school and ask your average undergrad, Master’s or PhD student if they have enough GPUs to train AlphaGo, or train Atari, in an hour,” Guss said. “Go to any university in the southern hemisphere and ask students, ‘Do you have the resources?’ The answer is no.”
In other words, in order to reach even greater heights, reinforcement learning needs to be both sample efficient and equitable.