5317e80aab7ed556.tex
1: \begin{abstract}
2: 		Motivated by a series of applications in data integration, language translation, bioinformatics, and computer vision, we consider spherical regression with two sets of unit-length vectors when the data are corrupted by a small fraction of mismatch in the response-predictor pairs. We propose a three-step algorithm in which we initialize the parameters by solving an orthogonal Procrustes problem to estimate a translation matrix $\Wbb$ ignoring the mismatch. We then estimate a mapping matrix aiming to correct the mismatch using hard-thresholding to induce sparsity, while incorporating potential group information. We eventually obtain a refined estimate for $\Wbb$ by removing the estimated mismatched pairs.
3: 		We derive the error bound for the initial estimate of $\Wbb$ in both fixed and high-dimensional setting. We demonstrate that the refined estimate of $\Wbb$ achieves an error rate that is as good as if no mismatch is present. We show that our mapping recovery method not only correctly distinguishes one-to-one and one-to-many correspondences, but also consistently identifies the matched pairs and estimates the weight vector for combined correspondence. 
4: 		We examine the finite sample performance of the proposed method via extensive simulation studies, and with application to the unsupervised translation of medical codes using electronic health records data. 
5: 	\end{abstract}
6: