abstract:8e168e548acd1eb5.tex

1: \begin{abstract}

2: How can we design agents that pursue a given objective when all feedback mechanisms are influenceable by the agent?

3: Standard RL algorithms assume a secure reward function, and can thus perform poorly in settings where agents can tamper with the reward-generating mechanism.

4: We present a principled solution to the problem of learning from influenceable feedback, which combines approval with a decoupled feedback collection procedure.

5: For a natural class of corruption functions, decoupled approval algorithms have aligned incentives both at convergence and for their local updates.

6: Empirically, they also scale to complex 3D environments where tampering is possible.

7: \end{abstract}

8: