Summary

This paper extends work on MONA in the public Camera Dropbox environment. It focuses on reproducing the environment, adding learned approval experiments, and extracting design implications for mitigating reward hacking.

Why it matters

Reward-hacking problems are not just prompt-level failures. They emerge from the interaction between objective design, environment feedback, monitoring, and the agent's learned strategy. Reproduction-first work helps distinguish robust safety claims from one-off demonstrations.

Publication lens

Explore the paper

Select a lens to translate the paper into the decision context you care about.

Citation

Heath, Nathan. Extending MONA in Camera Dropbox: Reproduction, Learned Approval, and Design Implications for Reward-Hacking Mitigation. arXiv:2603.29993, 2026.

Open arXiv record

Related services