The reward that wasn't
The objective you can measure is almost never the one you want to optimize.
Teach a robot to walk by rewarding forward velocity and it learns to fall forward. Train a recommendation system to maximize clicks and it learns to present outrageous information.
This problem arises when teams build a reward function from what is easy to know — from logs, clicks, and so on — then discover that the system found a way to maximize it that is counterproductive.
Writing a reward function should be a way to formally state your business objective. Before training, you should be able to answer: under what policy would an agent optimize this reward while producing behavior we don't want? This is a static analysis of the reward function that can save heaps of time later.
The sim-to-real gap
Training in simulation is unavoidable for most RL problems. However, simulation results are problematic if one assumes blindly that the simulation environment captures the distribution of states in the actual production environment. This is rarely the case, especially when the environment reacts to our models.
For example, users can learn to avoid pop-ups and ads, regardless of how smart our placement is. Mitigating out-of-distribution degradation requires sensitivity analysis to distributional shift and continuous monitoring after deployment.
No task formalization
The most common root cause I find in failed RL projects: nobody formalized the decision task before coding started. What is a state? What is an action? What does the reward depend on? Is the environment partially observable?
These questions have precise answers that determine which algorithms apply, what training data you need, how you detect convergence, and what can break. Teams that skip this preparatory work face training instability, reward hacking, or a deployed policy that works only part of the time.
What production requires
Formalize the decision task first.
Write it down before any code: states, actions, transition function, reward function, discount factor, and so on. If the environment is partially observable, specify how beliefs are represented and updated. If this runs past two pages, the problem is not yet well-defined.
Select the algorithm from the formalization, not the toolbox.
Each algorithm has its use case. For example, PPO (proximal policy optimization) works well in many continuous-control problems. But it performs poorly on sparse-reward environments, high-dimensional discrete action spaces, and problems with strong epistemic uncertainty. The formalization tells you which of these you have.
Define convergence before training.
A converged policy produces consistent behavior across a representative distribution of initial states, does not collapse under distributional perturbations, and shows measurable alignment between the proxy reward and the business objective. The best practice is to build evaluation infrastructure alongside training infrastructure.
Deploy in shadow mode first.
Let the system learn from actions by humans or existing rule-based systems, and check if its suggestions are improvements. This establishes a baseline distribution of states to calibrate against. Once live, monitor for distribution shift, reward hacking, and policy degradation.
Get the implementation right.
Algorithm selection and implementation quality are separate problems. On a recent engagement — an inference engine for a tech company — rewriting a correct but suboptimal system produced a 130× speedup and 10× memory reduction, making much higher workloads feasible. The algorithm was unchanged, but the new implementation made a difference.
Biological RL algorithms matter
The standard modern RL toolkit — Q-learning, Actor-Critic, policy gradients — emerged as much from engineering as from behavioral biology and psychology. Rich Sutton, the inventor of TD learning, double-majored in psychology and computer science and first applied TD learning to explain animal learning.
This matters practically. Academic RL since the 2010s has been optimized for benchmark performance: showing that a new algorithm outperforms a baseline on Atari or another environment. Dealing with the pitfalls outlined above remains outside the scope of most benchmark papers. Practitioners who treat benchmark-tuned algorithms as production-ready artifacts find this out the hard way.
Animals routinely face the same problems as RL algorithms: poor observability (where are the predators? Where is food?), imperfect proxy rewards (is eating this thing right now really good for survival?), out-of-distribution states (what should I do with that green banana?). It turns out that animals do not use Q-learning or any of the other common algorithms. Instead, they seem to use the little-known QV-learning algorithm, which was discovered independently by computer scientists and my research group. They also use shortcuts and heuristics that, while shunned by pure RL theorists, can be extremely useful in practice.
Animals have it harder than most RL systems. They cannot die thousands of times on their way to learning an effective policy. In fact, they cannot even die once! If they use QV-learning plus heuristics, there must be a good reason. Taking inspiration from biological RL is underappreciated in current practice, but it can help us to avoid the pitfalls discussed above.
Four questions worth asking an RL consultant
-
Can they write down your decision task before writing any code?
This distinguishes someone who understands your problem from someone who has a framework they apply to everything.
-
Do they write production code themselves?
Correctness properties that are clear in the design get lost when implementation is handed to a separate team. The ideal consultant is someone who can specify, implement, and verify the algorithm.
-
What will their approach fail on?
Every method has boundary conditions. If a consultant cannot describe where their recommended algorithm breaks down, they don't know it well enough to deploy it in your system.
-
How have they handled distributional shift after deployment?
Training metrics are easy. Monitoring a live system for policy degradation and distribution drift is harder and more consequential. Ask what their plan is.