f866826378948ab9.tex
1: $\;
2: 
3: \tcp{Train reward model (convergence judged by human)}
4: \While{$