abstract:55c6803a3a8e068a.tex

1: \begin{abstract}

2:       We give oracle inequalities on  procedures which combines quantization and variable selection via a weighted Lasso $k$-means type algorithm. The results are derived for a general family of weights, which can be tuned to size the influence of the variables in different ways. Moreover, these theoretical guarantees are proved to adapt the corresponding sparsity of the optimal codebooks, if appropriate. Even if there is no sparsity assumption on the optimal codebooks, our procedure is proved to be close to a sparse approximation of the optimal codebooks, as has been done for the Generalized Linear Models in regression. If the optimal codebooks have a sparse support, we also show that this support can be asymptotically recovered, giving an asymptotic upper bound on the probability of misclassification. These results are illustrated with Gaussian mixture models in arbitrary dimension with sparsity assumptions on the means, which are standard distributions in model-based clustering.

3:

4:       % Recent results in quantization theory provide theoretical bounds on the distortion of squared-norm based quantizers (see, e.g., \cite{Biau08} or \cite{Levrard14}). These bounds are valid whenever the source distribution has a bounded support, regardless of the dimension of the underlying Hilbert space.

5:

6:       %However, it remains of interest to select relevant variable  for quantization. This task is usually performed using coordinate energy-ratio thresholding (see, e.g., \cite{Antoniadis10} or \cite{Steinley08}), or maximizing a constrained empirical Between Cluster Sum of Squares criterion (see, e.g., \cite{Chang14} or \cite{Witten10}). This paper offers a Lasso type procedure to select the relevant variables for $k$-means clustering, as exposed in \cite{Sun12}. Moreover, some non-asymptotic convergence results on the distortion are derived for this procedure, along with consistency results toward sparse codebooks.

7: \end{abstract}

8: