1: \begin{abstract}
2: Many applications, such as photon-limited imaging and genomics, involve large datasets with noisy entries from exponential family distributions. It is of interest to estimate the covariance structure and principal components of the noiseless distribution.
3: %In photon-limited imaging (e.g. XFEL) we want to estimate the covariance of the pixel intensities of 2-D images, where the pixels are low-intensity Poisson variables. In genomics we want to estimate population structure from biallelic---Binomial(2)---genetic markers such as Single Nucleotide Polymorphisms (SNPs).
4: Principal Component Analysis (PCA), the standard method for this setting, can be inefficient when the noise is non-Gaussian.
5:
6: We develop $e$PCA (exponential family PCA), a new methodology for PCA on exponential family distributions. $e$PCA can be used for dimensionality reduction and denoising of large data matrices. $e$PCA involves the eigendecomposition of a new covariance matrix estimator, constructed in a simple and deterministic way using moment calculations, shrinkage, and random matrix theory. %$e$PCA is as fast as PCA and is suitable for datasets with multiple types of variables.
7:
8: We provide several theoretical justifications for our estimator, including the finite-sample convergence rate, and the Marchenko-Pastur law in high dimensions. %A key step of $e$PCA is \emph{homogenization}, a specific variable weighting. For SNPs, this recovers the widely used Hardy-Weinberg equilibrium (HWE) normalization. We show that homogenization improves the signal strength, providing justification for HWE normalization.
9: $e$PCA compares favorably to PCA and various PCA alternatives for exponential families, in simulations as well as in XFEL and SNP data analysis.
10: An open-source implementation is \href{http://github.com/lydiatliu/epca/}{available}.
11:
12:
13: \end{abstract}
14: