a960fdb159fc889a.tex
1: \begin{abstract}
2: In this paper, we consider the problem of 
3: partitioning a small data sample of size $n$ drawn from a mixture of
4: $2$ sub-gaussian distributions. In particular, we analyze computational efficient algorithms proposed 
5: by the same author, to partition data into two groups approximately
6: according to their population of origin given a small sample.
7: This work is motivated by the application of clustering individuals according to 
8:   their population of origin using $p$ markers, when the divergence between any two 
9:   of the populations is small. 
10: We build upon the semidefinite relaxation of an 
11:   integer quadratic program that is formulated essentially as finding 
12:   the  maximum cut on a graph, where edge weights in the cut represent 
13:   dissimilarity  scores between two nodes based on their $p$ features. 
14:  Here we use $\Delta^2 :=p \gamma$ to denote the $\ell_2^2$ distance
15: between two centers (mean vectors), namely, $\mu^{(1)}, \mu^{(2)} \in \R^p$.
16: The goal is to allow a full range of tradeoffs between $n, p, \gamma$ in the sense that
17:   partial recovery (success  rate $< 100\%$) is feasible once the signal to
18:   noise ratio $s^2 := \min\{np \gamma^2,  \Delta^2\}$ is lower bounded by a
19:   constant.  Importantly, we prove that the misclassification error
20:   decays exponentially with respect to the SNR $s^2$. This result was
21:   introduced earlier without a full proof. We  therefore present the full proof in the present work.
22:   Finally, for balanced partitions, we consider a variant of the SDP1, and show
23:   that the new estimator has a superb debiasing property. This is novel
24:   to the best of our knowledge.
25: 
26: 
27: \end{abstract}
28: