1: \begin{abstract}
2: The problem of multimodal clustering arises whenever the data are
3: gathered with several physically different sensors. Observations
4: from different modalities are not necessarily aligned in the sense
5: there there is no obvious way to associate or to compare them in
6: some common space. A solution may consist in considering multiple
7: clustering tasks independently for each modality. The main
8: difficulty with such an approach is to guarantee that the unimodal
9: clusterings are mutually consistent. In this paper we show that
10: multimodal clustering can be addressed within a novel framework,
11: namely \textit{conjugate mixture models}. These models exploit the
12: explicit transformations that are often available between an
13: unobserved parameter space (objects) and each one of the
14: observation spaces (sensors). We formulate the problem as a
15: likelihood maximization task and we derive the associated
16: \textit{conjugate expectation-maximization} algorithm. The
17: convergence properties of the proposed algorithm are thoroughly
18: investigated. Several local/global optimization techniques are
19: proposed in order to increase its convergence speed. Two
20: initialization strategies are proposed and compared. A consistent
21: model-selection criterion is proposed. The algorithm and its
22: variants are tested and evaluated within the task of 3D
23: localization of several speakers using both auditory and visual
24: data.
25:
26: \end{abstract}
27: