1: \begin{abstract}
2:
3: Convex Q-learning is a recent approach to reinforcement learning, motivated by the possibility of a firmer theory for convergence, and the possibility of making use of greater a~priori knowledge regarding policy or value function structure. This paper explores algorithm design in the continuous time domain, with finite-horizon optimal control objective. The main contributions are
4: \begin{romannum}
5: \item
6: Algorithm design is based on a new \textit{Q-ODE}, which defines the model-free characterization of the Hamilton-Jacobi-Bellman equation.
7:
8: \item
9: The Q-ODE motivates a new formulation of Convex Q-learning that avoids the approximations appearing in prior work.
10: The Bellman error used in the algorithm is defined by filtered measurements, which is beneficial in the presence of measurement noise.
11:
12:
13: \item
14: A characterization of boundedness of the constraint region is obtained through a non-trivial extension of recent results from the discrete time setting.
15:
16: \item
17: The theory is illustrated in application to resource allocation for distributed energy resources, for which the theory is ideally suited.
18: \end{romannum}
19:
20: \end{abstract}
21: