1: \begin{abstract}
2: In this paper, we consider the problem of
3: partitioning a small data sample of size $n$ drawn from a mixture of
4: $2$ sub-gaussian distributions. In particular, we analyze computational efficient algorithms proposed
5: by the same author, to partition data into two groups approximately
6: according to their population of origin given a small sample.
7: This work is motivated by the application of clustering individuals according to
8: their population of origin using $p$ markers, when the divergence between any two
9: of the populations is small.
10: We build upon the semidefinite relaxation of an
11: integer quadratic program that is formulated essentially as finding
12: the maximum cut on a graph, where edge weights in the cut represent
13: dissimilarity scores between two nodes based on their $p$ features.
14: Here we use $\Delta^2 :=p \gamma$ to denote the $\ell_2^2$ distance
15: between two centers (mean vectors), namely, $\mu^{(1)}, \mu^{(2)} \in \R^p$.
16: The goal is to allow a full range of tradeoffs between $n, p, \gamma$ in the sense that
17: partial recovery (success rate $< 100\%$) is feasible once the signal to
18: noise ratio $s^2 := \min\{np \gamma^2, \Delta^2\}$ is lower bounded by a
19: constant. Importantly, we prove that the misclassification error
20: decays exponentially with respect to the SNR $s^2$. This result was
21: introduced earlier without a full proof. We therefore present the full proof in the present work.
22: Finally, for balanced partitions, we consider a variant of the SDP1, and show
23: that the new estimator has a superb debiasing property. This is novel
24: to the best of our knowledge.
25:
26:
27: \end{abstract}
28: