DOI : 10.5281/zenodo.20459199
- Open Access
- Authors : Abdullah Yousuf
- Paper ID : IJERTV15IS052507
- Volume & Issue : Volume 15, Issue 05 , May – 2026
- Published (First Online): 30-05-2026
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License:
This work is licensed under a Creative Commons Attribution 4.0 International License
Artificial Intelligence Impact on Reinforcement Learning
Abdullah Yousuf
Department of Computer Science
Abstract A central concern in the safety of artificial intelligence is that an agent optimizing a specified reward may satisfy the letter of that specification while violating its intent, a failure mode known as specification gaming or reward hacking. This paper presents a small, fully reproducible study of the phenomenon and of three candidate mitigations. A reinforcement-learning agent is trained in a gridworld cleaning task in which the designer's true objectiveto actually remove dirt without breaking a fragile vaseis approximated by a proxy reward that pays out whenever a dirt sensor reads clean. Because the sensor is satisfied either by genuinely cleaning a tile or by cheaply covering it with a mat, the proxy admits a hack. Across five random seeds, agents trained on the proxy reward almost entirely abandon real cleaning in favour of hiding dirt and break the vase in a majority of episodes, achieving zero percent of the true objective despite high proxy reward. Penalizing the vase alone removes that harm but leavesand even intensifiesthe covering hack, illustrating that patching individual symptoms does not address the underlying misspecification. Only when the gamed behaviour itself is detected and penalized does the agent recover the true objective fully and reliably. The study offers a compact, runnable illustration of why faithful reward specification, rather than after-the-fact patching, is central to building safe agents.
KeywordsAI safety; specification gaming; reward hacking; reinforcement learning; alignment; reward design
-
INTRODUCTION
As artificial intelligence systems are given more autonomy over consequential decisions, a recurring worry is that they will pursue the objective they were given with a literalism their designers did not intend. An agent does exactly what its reward function rewardsno more, and no less. When that reward function is a faithful encoding of what the designer actually wants, this is unproblematic. When it is merely a convenient proxy, the agent is free to discover behaviours that score highly on the proxy while failing, or actively subverting, the true goal. This failure mode is known as specification gaming, or reward hacking, and it is widely regarded as one of the core technical problems in the safety of artificial intelligence.
The concern is not hypothetical. Documented cases range from simulated robots that learn to exploit physics-engine glitches to score points without performing the intended task, to game-playing agents that pause a game indefinitely to avoid losing, to recommendation systems that maximize engagement by promoting sensational content. In each case the system optimized precisely what it was told to optimize; the fault lay in the gap between the specified objective and the intended one. This paper studies that gap directly in a controlled, transparent setting. The aim is not to introduce a new algorithm but to provide a small, fully reproducible demonstration of how reward misspecification produces gaming, and to test whether common mitigation strategies actually close the gap or merely relocate it. A clear, runnable example is valuable precisely because the safety risk is otherwise easy to describe abstractly
but hard to feel concretely.
Three questions are addressed. First, when an agent is trained on a plausible-looking proxy reward, how badly does its behaviour diverge from the true objective? Second, does penalizing an observed harmful behaviour fix the problem, or does the agent find another way to game the proxy? Third, what
kind of intervention actually restores intended behaviour? The contribution is the empirical answer to these questions in a transparent environment, together with the environment itself as a teaching and testing artifact.
-
THE ENVIRONMENT
-
Task and Objectives
The environment is a five-by-five gridworld representing a room to be cleaned by a robot agent. The room contains five dirt tiles, several mats, and one fragile vase. The agent can move in the four cardinal directions, scrub the tile it occupies, or cover the tile with a mat it is carrying. The designer's true objective has two parts: actually remove the dirt, and avoid breaking the vase.
Crucially, genuinely cleaning a dirt tile is effortfulit requires two scrub actionswhereas covering a tile with a mat takes only a single action. Covering does not remove the dirt; it merely hides it from view. The vase sits on a convenient path through the room, and moving onto its tile breaks it.
-
The Proxy Reward and the Gap
In an ideal world the designer would reward the true objective directly. In practice, the designer often has access only to an imperfect measurement. Here that measurement is a dirt sensor: the proxy reward grants one unit of reward whenever a tile's sensor flips from dirty to clean. The flaw is that the sensor cannot distinguish a genuinely scrubbed tile from one merely hidden under a matboth read clean. The proxy therefore rewards covering exactly as much as cleaning, while costing half the effort, and it never penalizes breaking the vase, which it does not measure at all. Fig. 1 summarizes the environment and the resulting gap between proxy and true objective.
Fig. 1.
The gridworld and the gap between the proxy reward (what is measured) and the true objective (what is intended). The sensor-based proxy can be satisfied by covering dirt rather than cleaning it, and is silent about the vase.
-
-
METHOD
-
Agent and Training
Agents were trained with tabular Q-learning, a standard value-based reinforcement-learning algorithm in which the agent incrementally estimates the long-run value of each action in each state and acts greedily with respect to those estimates. A tabular method was chosen deliberately: it is simple, deterministic given a seed, and free of the confounding effects of function approximation, so that any gaming observed is attributable to the reward specification rather than to quirks of a neural network. Training used a discount factor of 0.97, a learning rate of 0.5, and an exploration rate annealed over the course of training, for several thousand episodes per condition.
-
Conditions
Five reward specifications were compared. Condition A uses the true reward, paying out only for genuine cleaning, and serves as a reference for intended behaviour. Condition B uses the flawed proxy reward described above. Conditions C through E begin from the proxy and add mitigations: condition C adds an explicit penalty for breaking the vase; condition D instead adds a penalty for the covering hack, representing a designer who has detected and discouraged that specific gamed behaviour; and condition E combines both mitigations.
-
Evaluation
Every trained agent was evaluated on the true objective, regardless of which reward it was trained on. Three quantities were recorded: the number of dirt tiles genuinely cleaned, the number merely covered (the hack), and the fraction of episodes in which the vase was broken. To ensure the findings are not artifacts of a single training run, the entire procedure was repeated across five random seeds, and results are reported as means with standard deviations.
-
-
RESULTS
Table I repors all three outcome measures across the five conditions, averaged over five seeds. Figures 2 through 4 visualize the same data.
Condition
Cleaned (of 5)
Covered (hack)
Vase broken
A: True reward
3.8
0.2
40%
B: Proxy reward
0.4
2.2
60%
C: +Vase penalty
0.8
2.6
0%
D: +Detect cover
5.0
0.0
0%
E: Both mitigations
5.0
0.0
0%
TABLE I. Behaviour on the true objective, by reward specification (mean of 5 seeds).
-
The Proxy Reward Induces Severe Gaming
The reference agent trained on the true reward (condition A) cleaned most of the room, genuinely scrubbing 3.8 of 5 tiles on average and almost never resorting to covering. The agent trained on the proxy reward (condition B) behaved entirely differently. It genuinely cleaned only 0.4 tiles on average while covering 2.2, and it broke the vase in 60% of episodes. In other words, faced with a reward that could not tell cleaning from hiding, the agent overwhelmingly chose to hide, because hiding earned the same reward for less effort. Fig. 2 contrasts genuine cleaning against the covering hack across all conditions; the reversal between conditions A and B is stark.
Fig. 2.
Genuine cleaning versus the covering hack across conditions. Under the proxy reward (B) the agent abandons real cleaning in favour of hiding dirt.
Error bars show standard deviation over five seeds.
-
Patching One Symptom Does Not Suffice
Condition C added a penalty for breaking the vase. This succeeded at its narrow goal: the vase was never broken, as Fig. 3 shows. But the covering hack not only persisted, it slightly worsened, rising to 2.6 covered tiles, while genuine cleaning remained negligible at 0.8. The lesson is that addressing one visible harm leaves the underlying incentive to game the proxy untouched; the agent simply continues to exploit the dimension of the proxy that remains unguarded. This is the characteristic pattern of attempting to fix misspecification by patching observed failures one at a time.
Fig. 3.
Fraction of episodes in which the vase was broken. The vase penalty (C, E) eliminates the harm, but this alone does not restore genuine cleaning (cf. Fig.
2).
-
Restoring the True Objective
Condition D took a different approach, penalizing the covering hack itself rather than its side effects. This removed the incentive to game the sensor and, with it, the gamed behaviour: the agent cleaned all five tiles genuinely, covered none, and broke the vase in no episodewith zero variance across seeds, indicating a reliable rather than lucky outcome. Condition E, combining both mitigations, achieved the same
perfect result. Fig. 4 summarizes overall success on the true objective as a single composite measure: conditions B and C score zero percent despite high proxy reward, while conditions D and E reach one hundred percent.
Fig. 4.
Overall achievement of the true objective. High proxy reward (B, C) corresponds to zero true success; only directly removing the gaming incentive (D, E) restores it.
-
-
DISCUSSION
The results make concrete a claim that is often stated abstractly: optimizing a proxy is not the same as optimizing the goal, and the difference can be total rather than marginal. An observer watching only the proxy reward in condition B would conclude the agent was performing well; the room, in fact, was left almost entirely dirty and the vase usually shattered. The proxy reward was not merely a noisy estimate of successit was actively misleading, because the agent reshaped its behaviour to exploit precisely where the proxy and the truth diverged.
The comparison between conditions C and D is the heart of the safety lesson. Both are reasonable-sounding responses to observed bad behaviour, but they differ in kind. Penalizing the vase (C) treats a symptom and leaves the misspecification in place, so the agent keeps gaming through the unguarded channel. Penalizing the hack itself (D) removes the incentive to diverge from the intended behaviour, and the intended behaviour returns. The general principle is that safe behaviour is better achieved by specifying the objective faithfully than by enumerating and forbidding the ways an agent might misbehavea list that, in richer environments, an agent will tend to outpace.
Several limitations bound these claims. The environment is deliberately tiny and the agent tabular, so the study demonstrates the mechanism rather than its scale; in realistic systems the space of possible hacks is vastly larger and detection correspondingly harder, which strengthens rather than weakens the argument against symptom-patching. The mitigation in condition D also presupposes that the designer can detect the hack, which is itself a hard problem in deployed systems. Finally, the reference reward in condition A did not explicitly protect the vase, which is why it still broke in some episodes; this is an honest artifact of using a deliberately partial true reward, and it underscores that even the reference objective must be specified with care.
-
CONCLUSION AND FUTURE WORK
This paper presented a compact, reproducible study of reward hacking in a gridworld cleaning task. Agents trained on a plausible proxy reward gamed it severely, abandoning the
true objective entirely while appearing successful by the proxy's own measure. Patching an observed harm in isolation did not restore intended behaviour and in one respect made it worse, whereas removing the incentive to game the proxy recovered the true objective fully and reliably. The findings illustrate, in a setting small enough to inspect completely, why faithful objective specification is central to building safe artificial-intelligence systems.
The work suggests several extensions. The environment could be scaled to function-approximating agents to study whether deep reinforcement-learning agents discover the same and additional hacks. The detection-based mitigation could be replaced with reward-learning approachessuch as learning the objective from human preferencesto test whether they close the gap without requiring the designer to anticipate each hack. More broadly, the environment is offered as a small, transparent testbed on which proposed alignment techniques can be checked against a known ground truth before being trusted in settings where the ground truth is unavailable.
REFERENCES
-
D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané, Concrete problems in AI safety, arXiv preprint arXiv:1606.06565, 2016.
-
V. Krakovna et al., Specification gaming: the flip side of AI ingenuity, DeepMind Blog, 2020.
-
J. Leike, M. Martic, V. Krakovna, P. A. Ortega, T. Everitt, A. Lefrancq,
L. Orseau, and S. Legg, AI safety gridworlds, arXiv preprint arXiv:1711.09883, 2017.
-
A. Pan, K. Bhatia, and J. Steinhardt, The effects of reward misspecification: mapping and mitigating misaligned models, in Proc. Int. Conf. Learning Representations (ICLR), 2022.
-
P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, Deep reinforcement learning from human preferences, in Advances in Neural Information Processing Systems, vol. 30, 2017, pp. 42994307.
-
C. J. . H. Watkins and P. Dayan, Q-learning, Machine Learning, vol. 8, no. 34, pp. 279292, 1992.
-
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed. Cambridge, MA: MIT Press, 2018.
