abstract:07c9130b70f5682b.tex

1: \begin{abstract}

2: Learning strategies for imperfect information games from samples of interaction is a challenging problem. A common method for this setting,  Monte Carlo Counterfactual Regret Minimization (MCCFR), can have slow long-term convergence rates due to high variance. In this paper, we introduce a variance reduction technique (VR-MCCFR) that applies to any sampling variant of MCCFR. Using this technique, per-iteration estimated values and updates are reformulated as a function of sampled values and state-action baselines, similar to their use in policy gradient reinforcement learning.

3: The new formulation allows estimates to be bootstrapped from other estimates within the same episode, propagating the benefits of baselines along the sampled trajectory; the estimates remain unbiased even when bootstrapping from other estimates. Finally, we show that given a perfect baseline, the variance of the value estimates can be reduced to zero.

4: Experimental evaluation shows that VR-MCCFR brings an order of magnitude speedup, while the empirical variance decreases by three orders of magnitude.

5: The decreased variance allows for the first time CFR+ to be used with sampling, increasing the speedup to two orders of magnitude.

6:

7: \end{abstract}

8: