abstract:e0f27ee8b62b0e42.tex

1: \begin{abstract}

2: In this supplementary material, we first provide theoretical analysis of the convergence rate (Sec~\ref{sec:convergence}) and sample complexity (Sec~\ref{sec:sample_complexity}) for Peer $Q$-Learning algorithm. Then we provide the extension to multi-outcome setting with theoretical proofs (Sec~\ref{sec:multi-outcome-ap}). We also show the extensions to other modern DRL algorithms in Sec~\ref{sec:drl-ap}, and further discussions on the effectiveness of PeerRL in Sec~\ref{supp:error_r}. We then provide more ``tie-breaking'' examples on varied noise models together with the python-style code snippet in Sec~\ref{supp:tie-break}.

3: In Sec~\ref{sec: peerbc}, we provide the technical proofs for proposed PeerBC approach under mild assumptions. Then, we report the experimental setup details (Sec~\ref{sec:exp_setup}), the implementation details (Sec~\ref{sec:implementation_details}), and additional experiments including complete results for Figure 2 and Table 1 (Sec~\ref{sec:complete_results}), sensitivity analysis of peer penalty coefficient $\xi$ (Sec~\ref{sec:sensitivity}),

4: and study of stochasticity for behavioral cloning policy (Sec~\ref{sec:stochastic}).

5: The summary of contents in the supplementary is provided in the following.

6: \end{abstract}

7: