1: \begin{abstract}
2:
3: We consider the variable selection problem, which seeks to identify
4: important variables influencing a response $Y$ out of many candidate
5: features $X_1, \ldots, X_p$. We wish to do so while offering
6: finite-sample guarantees about the fraction of false
7: positives---selected variables $X_j$ that in fact have no effect on
8: $Y$ after the other features are known. When the number of features
9: $p$ is large (perhaps even larger than the sample size $n$), and we
10: have no prior knowledge regarding the type of dependence between $Y$
11: and $X$, the model-X knockoffs framework nonetheless allows us to
12: select a model with a guaranteed bound on the false discovery rate,
13: as long as the distribution of the feature vector
14: $X=(X_1,\dots,X_p)$ is exactly known. This model selection procedure
15: operates by constructing ``knockoff copies'' of each of the $p$
16: features, which are then used as a control group to ensure that the
17: model selection algorithm is not choosing too many irrelevant
18: features. In this work, we study the practical setting where the
19: distribution of $X$ could only be estimated, rather than known
20: exactly, and the knockoff copies of the $X_j$'s are therefore
21: constructed somewhat incorrectly. Our results, which are free of
22: any modeling assumption whatsoever, show that the resulting model
23: selection procedure incurs an inflation of the false discovery rate
24: that is proportional to our errors in estimating the distribution of
25: each feature $X_j$ conditional on the remaining features
26: $\{X_k:k\neq j\}$. The model-X knockoffs framework is therefore
27: robust to errors in the underlying assumptions on the distribution of
28: $X$, making it an effective method for many practical applications,
29: such as genome-wide association studies, where the underlying
30: distribution on the features $X_1,\dots,X_p$ is estimated accurately
31: but not known exactly.
32: \end{abstract}
33: