1: \begin{abstract}
2: This paper studies the risk-averse mean-variance optimization in
3: infinite-horizon discounted Markov decision processes (MDPs). The
4: involved variance metric concerns reward variability during the
5: whole process, and future deviations are discounted to their present
6: values. This discounted mean-variance optimization yields a reward
7: function dependent on a discounted mean, and this dependency renders
8: traditional dynamic programming methods inapplicable since it
9: suppresses a crucial property---time consistency. To deal with this
10: unorthodox problem, we introduce a pseudo mean to transform the
11: untreatable MDP to a standard one with a redefined reward function
12: in standard form and derive a discounted mean-variance performance
13: difference formula. With the pseudo mean, we propose a unified
14: algorithm framework with a bilevel optimization structure for the
15: discounted mean-variance optimization. The framework unifies a
16: variety of algorithms for several variance-related problems
17: including, but not limited to, risk-averse variance and
18: mean-variance optimizations in discounted and average MDPs.
19: Furthermore, the convergence analyses missing from the literature
20: can be complemented with the proposed framework as well. Taking the
21: value iteration as an example, we develop a discounted mean-variance
22: value iteration algorithm and prove its convergence to a local
23: optimum with the aid of a Bellman local-optimality equation.
24: Finally, we conduct a numerical experiment on portfolio management
25: to validate the proposed algorithm.
26: \end{abstract}
27: