abstract:a960fdb159fc889a.tex

1: \begin{abstract}

2: In this paper, we consider the problem of

3: partitioning a small data sample of size $n$ drawn from a mixture of

4: $2$ sub-gaussian distributions. In particular, we analyze computational efficient algorithms proposed

5: by the same author, to partition data into two groups approximately

6: according to their population of origin given a small sample.

7: This work is motivated by the application of clustering individuals according to

8:   their population of origin using $p$ markers, when the divergence between any two

9:   of the populations is small.

10: We build upon the semidefinite relaxation of an

11:   integer quadratic program that is formulated essentially as finding

12:   the  maximum cut on a graph, where edge weights in the cut represent

13:   dissimilarity  scores between two nodes based on their $p$ features.

14:  Here we use $\Delta^2 :=p \gamma$ to denote the $\ell_2^2$ distance

15: between two centers (mean vectors), namely, $\mu^{(1)}, \mu^{(2)} \in \R^p$.

16: The goal is to allow a full range of tradeoffs between $n, p, \gamma$ in the sense that

17:   partial recovery (success  rate $< 100\%$) is feasible once the signal to

18:   noise ratio $s^2 := \min\{np \gamma^2,  \Delta^2\}$ is lower bounded by a

19:   constant.  Importantly, we prove that the misclassification error

20:   decays exponentially with respect to the SNR $s^2$. This result was

21:   introduced earlier without a full proof. We  therefore present the full proof in the present work.

22:   Finally, for balanced partitions, we consider a variant of the SDP1, and show

23:   that the new estimator has a superb debiasing property. This is novel

24:   to the best of our knowledge.

25:

26:

27: \end{abstract}

28: