abstract:1651491066603bdc.tex

1: \begin{abstract}

2: This paper considers online optimization for a system that performs a sequence of back-to-back tasks. Each task can be processed in one of

3: multiple processing modes that affect the duration of the task, the reward earned, and an additional vector of penalties (such as energy or cost). Let $A[k]$ be a random matrix of parameters that specifies the duration, reward, and penalty vector under each processing option for task $k$. The goal is to observe $A[k]$ at the start of each new task $k$ and then choose a processing mode for the task so that, over time, time average reward is maximized subject to time average penalty constraints. This is a \emph{renewal optimization problem} and is challenging because the probability distribution for the $A[k]$ sequence is unknown.  Prior work shows that any algorithm that comes within $\epsilon$ of optimality must have

4: $\Omega(1/\epsilon^2)$ convergence time.  The only known algorithm that can meet this bound operates without

5: time average penalty constraints and uses a diminishing stepsize that cannot adapt when probabilities change. This paper develops a new algorithm that is adaptive and comes within $O(\epsilon)$ of optimality for any interval of $\Theta(1/\epsilon^2)$ tasks over which  probabilities are held fixed, regardless of probabilities before the start of the interval.

6: \end{abstract}