1: \begin{abstract}
2: In this paper, we study the problem of sparse multiple kernel learning (MKL), where the goal is to efficiently learn a combination of a fixed small number of kernels from a large pool that could lead to a kernel classifier with a small prediction error. We develop an efficient algorithm based on the greedy coordinate descent algorithm, that is able to achieve a geometric convergence rate under appropriate conditions. The convergence rate is achieved by measuring the size of functional gradients by an empirical $\ell_2$ norm that depends on the empirical data distribution. This is in contrast to previous algorithms that use a functional norm to measure the size of gradients, which is independent from the data samples. We also establish a generalization error bound of the learned sparse kernel classifier using the technique of local Rademacher complexity.
3:
4:
5:
6:
7: %In this paper, we study the problem of sparse Multiple Kernel Learning (MKL), where the goal is to efficiently learn a combination of a fixed number of kernels from a large pool of kernels that could lead to a kernel classifier with a small prediction error. We develop two efficient algorithms, based on the theory of greedy coordinate descent, %~\citep{shai-2010-trade},%
8: %for sparse MKL. The essential difference between the two algorithms is how to measure the size of functional gradients when selecting the kernel with the ``largest'' gradient. The first algorithm measures the size of functional gradient by its functional norm, and the second algorithm measures the gradient by its $L_2$ norm that depends on the empirical data distribution. We show that the algorithm based on the functional norm achieves the error of $O(1/d)$ where $d$ is the number of selected kernels, while the second algorithm based on the $L_2$ norm is able to achieve a geometric convergence rate in $d$ under appropriate conditions, and is therefore significantly more effective than the first algorithm.
9: \end{abstract}
10: