abstract:edb50ac2e2b9f453.tex

1: \begin{abstract}

2: Entropy regularization is known to improve exploration %and has been recently leveraged to show convergence guarantees

3: in sequential decision-making problems.

4: We show that this same mechanism can also lead to nearly unbiased and lower-variance estimates of the mean reward in the optimize-and-estimate structured bandit setting.

5: Mean reward estimation (i.e., population estimation) tasks have recently been shown to be essential for public policy settings where legal constraints often require precise estimates of population metrics.

6: We show that leveraging entropy and KL divergence can yield a better trade-off between reward and estimator variance than existing baselines, all while remaining nearly unbiased.

7: These properties of entropy regularization illustrate an exciting potential for bridging the optimal exploration and estimation literatures.

8:

9: \end{abstract}

10: