1: \begin{abstract}
2: Dynamic optimization of mean and variance in Markov decision
3: processes (MDPs) is a long-standing challenge caused by the failure
4: of dynamic programming. In this paper, we propose a new approach to
5: find the globally optimal policy for combined metrics of
6: steady-state mean and variance in an infinite-horizon undiscounted
7: MDP. By introducing the concepts of pseudo mean and pseudo variance,
8: we convert the original problem to a bilevel MDP problem, where the
9: inner one is a standard MDP optimizing pseudo mean-variance and the
10: outer one is a single parameter selection problem optimizing pseudo
11: mean. We use the sensitivity analysis of MDPs to derive the
12: properties of this bilevel problem. By solving inner standard MDPs
13: for pseudo mean-variance optimization, we can identify worse policy
14: spaces dominated by optimal policies of the pseudo problems. We
15: propose an optimization algorithm which can find the globally
16: optimal policy by repeatedly removing worse policy spaces. The
17: convergence and complexity of the algorithm are studied. Another
18: policy dominance property is also proposed to further improve the
19: algorithm efficiency. Numerical experiments demonstrate the
20: performance and efficiency of our algorithms. To the best of our
21: knowledge, our algorithm is the first that efficiently finds the
22: globally optimal policy of mean-variance optimization in MDPs. These
23: results are also valid for solely minimizing the variance metrics in
24: MDPs.
25: \end{abstract}