edb50ac2e2b9f453.tex
1: \begin{abstract}
2: Entropy regularization is known to improve exploration %and has been recently leveraged to show convergence guarantees 
3: in sequential decision-making problems.
4: We show that this same mechanism can also lead to nearly unbiased and lower-variance estimates of the mean reward in the optimize-and-estimate structured bandit setting.
5: Mean reward estimation (i.e., population estimation) tasks have recently been shown to be essential for public policy settings where legal constraints often require precise estimates of population metrics.
6: We show that leveraging entropy and KL divergence can yield a better trade-off between reward and estimator variance than existing baselines, all while remaining nearly unbiased. 
7: These properties of entropy regularization illustrate an exciting potential for bridging the optimal exploration and estimation literatures. 
8: 
9: \end{abstract}
10: