3527577a2e31d6e7.tex
1: \begin{abstract}
2:   Extraordinary amounts of data are being produced in many branches of
3:   science.  Proven statistical methods are no longer applicable with
4:   extraordinary large data sets due to computational limitations.  A
5:   critical step in big data analysis is data reduction. Existing
6:   investigations in the context of linear regression focus on
7:   subsampling-based methods. However, not only is this approach prone
8:   to sampling errors, it also leads to a covariance matrix of the
9:   estimators that is typically bounded from below by a term that is of the order of the inverse of the subdata size.
10:   We propose a novel approach, termed
11:   information-based optimal subdata selection (IBOSS). Compared to
12:   leading existing subdata methods, the IBOSS approach has the following advantages:
13:   (i) it is significantly faster; (ii) it is suitable for distributed
14:   parallel computing; (iii) the variances of the slope parameter
15:   estimators converge to 0 as the full data size increases even if the
16:   subdata size is fixed, i.e., the convergence rate depends on the
17:   full data size; (iv) data analysis for IBOSS subdata is
18:   straightforward and the sampling distribution of an IBOSS estimator
19:   is easy to assess. Theoretical results and extensive simulations
20:   demonstrate that the IBOSS approach is superior to subsampling-based
21:   methods, sometimes by orders of magnitude.  The advantages of the
22:   new approach are also illustrated through analysis of real data.
23: \end{abstract}
24: