abstract:3527577a2e31d6e7.tex

1: \begin{abstract}

2:   Extraordinary amounts of data are being produced in many branches of

3:   science.  Proven statistical methods are no longer applicable with

4:   extraordinary large data sets due to computational limitations.  A

5:   critical step in big data analysis is data reduction. Existing

6:   investigations in the context of linear regression focus on

7:   subsampling-based methods. However, not only is this approach prone

8:   to sampling errors, it also leads to a covariance matrix of the

9:   estimators that is typically bounded from below by a term that is of the order of the inverse of the subdata size.

10:   We propose a novel approach, termed

11:   information-based optimal subdata selection (IBOSS). Compared to

12:   leading existing subdata methods, the IBOSS approach has the following advantages:

13:   (i) it is significantly faster; (ii) it is suitable for distributed

14:   parallel computing; (iii) the variances of the slope parameter

15:   estimators converge to 0 as the full data size increases even if the

16:   subdata size is fixed, i.e., the convergence rate depends on the

17:   full data size; (iv) data analysis for IBOSS subdata is

18:   straightforward and the sampling distribution of an IBOSS estimator

19:   is easy to assess. Theoretical results and extensive simulations

20:   demonstrate that the IBOSS approach is superior to subsampling-based

21:   methods, sometimes by orders of magnitude.  The advantages of the

22:   new approach are also illustrated through analysis of real data.

23: \end{abstract}

24: