abstract:0a0723aeaa4b7426.tex

1: \begin{abstract}

2:

3: In this paper, we consider the problem of partitioning a small data sample of

4:   size $n$ drawn from a mixture of $2$ sub-gaussian distributions. In

5:   particular, we design and analyze two computational efficient algorithms to

6:   partition data into two groups approximately according to their population of

7:   origin given a small sample in a recent paper (Zhou 2023a).  Our work is

8:   motivated by the application of clustering individuals according to their

9:   population of origin using markers, when the divergence between any two of

10:   the populations is small.    Moreover, we are interested in the case that

11:   individual features are of low average quality $\gamma$, and we want to use

12:   as few of them as possible to correctly partition the sample.  Here we use $p

13:   \gamma$ to denote the $\ell_2^2$ distance between two population centers

14:   (mean vectors), namely, $\mu^{(1)}, \mu^{(2)} \in \R^p$.  We allow a full

15:   range of tradeoffs  between $n, p, \gamma$ in the sense that partial recovery

16:   (success  rate $< 100\%$) is feasible once the signal to noise ratio $s^2 :=

17:   \min\{np \gamma^2,  p \gamma\}$ is lower bounded by a constant.  Our work

18:   builds upon the semidefinite relaxation of an integer quadratic program that

19:   is formulated essentially as finding the  maximum cut on a graph, where edge

20:   weights in the cut represent dissimilarity  scores between two nodes based on

21:   their $p$ features in Zhou (2023a). More importantly, we prove that the

22:   misclassification error decays exponentially with respect to the SNR $s^2$ in

23:   the present  paper. The significance of such an exponentially decaying error

24:   bound is: when $s^2 =\Omega(\log n)$, perfect recovery of the cluster

25:   structure is  accomplished. This result was introduced in Zhou (2023a)

26:   without a proof. We  therefore present the full proof in the present work.

27: \end{abstract}

28: