1: \begin{abstract}
2: Extraordinary amounts of data are being produced in many branches of
3: science. Proven statistical methods are no longer applicable with
4: extraordinary large data sets due to computational limitations. A
5: critical step in big data analysis is data reduction. Existing
6: investigations in the context of linear regression focus on
7: subsampling-based methods. However, not only is this approach prone
8: to sampling errors, it also leads to a covariance matrix of the
9: estimators that is typically bounded from below by a term that is of the order of the inverse of the subdata size.
10: We propose a novel approach, termed
11: information-based optimal subdata selection (IBOSS). Compared to
12: leading existing subdata methods, the IBOSS approach has the following advantages:
13: (i) it is significantly faster; (ii) it is suitable for distributed
14: parallel computing; (iii) the variances of the slope parameter
15: estimators converge to 0 as the full data size increases even if the
16: subdata size is fixed, i.e., the convergence rate depends on the
17: full data size; (iv) data analysis for IBOSS subdata is
18: straightforward and the sampling distribution of an IBOSS estimator
19: is easy to assess. Theoretical results and extensive simulations
20: demonstrate that the IBOSS approach is superior to subsampling-based
21: methods, sometimes by orders of magnitude. The advantages of the
22: new approach are also illustrated through analysis of real data.
23: \end{abstract}
24: