0a0723aeaa4b7426.tex
1: \begin{abstract}
2: 
3: In this paper, we consider the problem of partitioning a small data sample of
4:   size $n$ drawn from a mixture of $2$ sub-gaussian distributions. In
5:   particular, we design and analyze two computational efficient algorithms to
6:   partition data into two groups approximately according to their population of
7:   origin given a small sample in a recent paper (Zhou 2023a).  Our work is
8:   motivated by the application of clustering individuals according to their
9:   population of origin using markers, when the divergence between any two of
10:   the populations is small.    Moreover, we are interested in the case that
11:   individual features are of low average quality $\gamma$, and we want to use
12:   as few of them as possible to correctly partition the sample.  Here we use $p
13:   \gamma$ to denote the $\ell_2^2$ distance between two population centers
14:   (mean vectors), namely, $\mu^{(1)}, \mu^{(2)} \in \R^p$.  We allow a full
15:   range of tradeoffs  between $n, p, \gamma$ in the sense that partial recovery
16:   (success  rate $< 100\%$) is feasible once the signal to noise ratio $s^2 :=
17:   \min\{np \gamma^2,  p \gamma\}$ is lower bounded by a constant.  Our work
18:   builds upon the semidefinite relaxation of an integer quadratic program that
19:   is formulated essentially as finding the  maximum cut on a graph, where edge
20:   weights in the cut represent dissimilarity  scores between two nodes based on
21:   their $p$ features in Zhou (2023a). More importantly, we prove that the
22:   misclassification error decays exponentially with respect to the SNR $s^2$ in
23:   the present  paper. The significance of such an exponentially decaying error
24:   bound is: when $s^2 =\Omega(\log n)$, perfect recovery of the cluster
25:   structure is  accomplished. This result was introduced in Zhou (2023a)
26:   without a proof. We  therefore present the full proof in the present work.
27: \end{abstract}
28: