1: \begin{abstract}
2: A central goal of machine learning is to learn robust representations that capture the causal relationship between inputs features and output labels.
3: % While machine learning models are able to learn c omplex prediction rules by minimizing the training error, they also
4: However, minimizing empirical risk over finite or biased datasets often results in models latching on to \emph{spurious correlations} between the training input/output pairs that are not fundamental to the problem at hand.
5: % Models that fit these correlations often fail on inputs where the spurious correlation does not hold.
6: In this paper, we define and analyze robust and spurious representations using the information-theoretic concept of \emph{minimal sufficient statistics}.
7: We prove that even when there is only bias of the input distribution (i.e.~\emph{covariate shift}), models can still pick up spurious features from their training data.
8: Group distributionally robust optimization (DRO) provides an effective tool to alleviate covariate shift by minimizing the \emph{worst-case} training loss over a set of pre-defined groups.
9: Inspired by our analysis, we demonstrate that group DRO can fail when groups do not directly account for various spurious correlations that occur in the data.
10: % under ``imperfect'' partitions where groups are not created from exact spurious factors.
11: To address this, we further propose to minimize the worst-case losses over a more flexible set of distributions that are defined on the \emph{joint distribution} of groups and instances, instead of treating each group as a whole at optimization time.
12: Through extensive experiments on one image and two language tasks, we show that our model is significantly more robust than comparable baselines under various partitions.
13: Our code is available at \url{https://github.com/violet-zct/group-conditional-DRO}.
14:
15: \end{abstract}
16: