Summary
This paper extends work on MONA in the public Camera Dropbox environment. It focuses on reproducing the environment, adding learned approval experiments, and extracting design implications for mitigating reward hacking.
Why it matters
Reward-hacking problems are not just prompt-level failures. They emerge from the interaction between objective design, environment feedback, monitoring, and the agent's learned strategy. Reproduction-first work helps distinguish robust safety claims from one-off demonstrations.
Publication lens
Explore the paper
Select a lens to translate the paper into the decision context you care about.
Citation
Heath, Nathan. Extending MONA in Camera Dropbox: Reproduction, Learned Approval, and Design Implications for Reward-Hacking Mitigation. arXiv:2603.29993, 2026.