1: \documentclass{sig-alternate}
2:
3: \newcommand{\cut}{{\rm cut}}
4: \newcommand{\Ncut}{{\rm Ncut}}
5: \newcommand{\s}{{\rm s}}
6: \newcommand{\diag}{{\rm diag}}
7: \newcommand{\op}{{\rm op}}
8: \newcommand{\R}{{\cal R}}
9: \newcommand{\tf}{{\rm tf}}
10: \newcommand{\df}{{\rm df}}
11: \newcommand{\sre}{{\rm sre}}
12: \newcommand{\svd}{{\rm svd}}
13: \newcommand{\nnz}{{\rm nnz}}
14: \newcommand{\trace}{{\rm trace}}
15:
16: \begin{document}
17: %
18: % --- Author Metadata here ---
19: \conferenceinfo{CIKM}{'01 November 5-10, 2001, Atlanta, Georgia. USA}
20: \CopyrightYear{2001} % Allows default copyright year (2000) to be over-ridden - IF NEED BE.
21: %\crdata{0-12345-67-8/90/01} % Allows default copyright data (0-89791-88-6/97/05) to be over-ridden - IF NEED BE.
22: % --- End of Author Metadata ---
23:
24: \title{Bipartite Graph Partitioning and Data
25: Clustering\titlenote{Part of this work was done while Xiaofeng He
26: was a graduate research assistant at NERSC, Berkeley National Lab.}
27: }
28: %\subtitle{[Extended Abstract]
29: %\titlenote{A full version of this paper is available as
30: %\textit{Author's Guide to Preparing ACM SIG Proceedings Using
31: %\LaTeX$2_\epsilon$\ and BibTeX} at
32: %\texttt{www.acm.org/eaddress.htm}}}
33: %
34: % You need the command \numberofauthors to handle the ``boxing''
35: % and alignment of the authors under the title, and to add
36: % a section for authors number 4 through n.
37: %
38: % Up to the first three authors are aligned under the title;
39: % use the \alignauthor commands below to handle those names
40: % and affiliations. Add names, affiliations, addresses for
41: % additional authors as the argument to \additionalauthors;
42: % these will be set for you without further effort on your
43: % part as the last section in the body of your article BEFORE
44: % References or any Appendices.
45:
46: \numberofauthors{3}
47: %
48: % You can go ahead and credit authors number 4+ here;
49: % their names will appear in a section called
50: % ``Additional Authors'' just before the Appendices
51: % (if there are any) or Bibliography (if there
52: % aren't)
53:
54: % Put no more than the first THREE authors in the \author command
55: \author{
56: %
57: % The command \alignauthor (no curly braces needed) should
58: % precede each author name, affiliation/snail-mail address and
59: % e-mail address. Additionally, tag each line of
60: % affiliation/address with \affaddr, and tag the
61: %% e-mail address with \email.
62: \alignauthor Hongyuan Zha \\[2pt] Xiaofeng He\\
63: \affaddr{Dept. of Comp. Sci. \& Eng.}\\
64: \affaddr{Penn State Univ.}\\
65: \affaddr{State College, PA 16802}\\
66: \email{\{zha,xhe\}@cse.psu.edu}
67: \alignauthor Chris Ding \\[2pt] Horst Simon\\
68: \affaddr{NERSC Division}\\
69: \affaddr{Berkeley National Lab.}\\
70: \affaddr{Berkeley, CA 94720}\\
71: \email{\{chqding,hdsimon\}@lbl.gov}
72: \alignauthor Ming Gu\\
73: \affaddr{Dept. of Math.}\\
74: \affaddr{U.C. Berkeley}\\
75: \affaddr{Berkeley, CA 94720}\\
76: \email{mgu@math.berkeley.edu}
77: }
78: %\additionalauthors{Additional authors: John Smith (The Th{\o}rv\"{a}ld Group,
79: %email: {\texttt{jsmith@affiliation.org}}) and Julius P.~Kumquat
80: %(The Kumquat Consortium, email: {\texttt{jpkumquat@consortium.net}}).}
81: %\date{30 July 1999}
82: \maketitle
83: \begin{abstract}
84: Many data types arising from data mining applications
85: can be modeled as bipartite graphs, examples include
86: terms and documents in a text corpus, customers and
87: purchasing items in market basket analysis and reviewers
88: and movies in a movie recommender system. In this paper,
89: we propose a new data clustering method based on
90: partitioning the underlying bipartite graph. The partition
91: is constructed by minimizing a {\it normalized}
92: sum of edge weights between {\it unmatched} pairs of vertices
93: of the bipartite graph.
94: We show that an approximate solution to the minimization
95: problem can be obtained by computing
96: a partial singular value decomposition (SVD)
97: of the associated edge weight
98: matrix of the bipartite graph. We point out the connection
99: of our clustering algorithm to correspondence analysis used in
100: multivariate analysis. We also briefly discuss the issue
101: of assigning data objects to multiple clusters.
102: In the experimental results, we apply our clustering
103: algorithm to the problem of document clustering to illustrate its
104: effectiveness and efficiency.
105: \end{abstract}
106:
107: % A category with only the three required fields
108: \category{H.3.3}{Information Search and Retrieval}{Clustering}
109: \category{G.1.3}{Numerical Linear Algebra}{Singular value decomposition}
110: %A category including the fourth, optional field follows...
111: \category{G.2.2}{Graph Theory}{Graph algorithms}
112:
113: \terms{Algorithms, theory}
114:
115: \keywords{document clustering, bipartite graph, graph partitioning,
116: spectral relaxation, singular value decomposition,
117: correspondence analysis}
118:
119: \section{Introduction}\label{sec:int}
120: Cluster analysis is an important tool for exploratory data mining
121: applications arising from many diverse disciplines. Informally,
122: cluster analysis seeks to partition a given data set into compact
123: clusters so that data objects within a cluster are more similar
124: than those in distinct clusters. The literature on cluster analysis
125: is enormous including contributions from many research communities.
126: (see \cite{Everitt,Gordon} for
127: recent surveys of some classical approaches.) Many traditional
128: clustering algorithms are based
129: on the assumption that the given dataset
130: consists of covariate information (or attributes) for each individual
131: data object, and cluster analysis can be cast as a problem of
132: grouping a set of $n$-dimensional vectors each representing
133: a data object in the dataset. A familiar example
134: is document clustering using the vector space
135: model \cite{bele:00}. Here each document
136: is represented by an $n$-dimensional vector, and each
137: coordinate of the vector corresponds to a term in a vocabulary of size $n$.
138: This formulation
139: leads to the so-called term-document matrix $A=(a_{ij})$
140: for the representation of the collection of documents,
141: where $a_{ij}$ is the
142: so-called term frequency, i.e.,
143: the number of times term $i$ occurs in document $j$.
144: In this vector
145: space model terms and documents are treated asymmetrically with
146: terms considered as the covariates or attributes of documents. It is
147: also possible to treat both terms and documents as first-class citizens
148: in a symmetric fashion, and consider $a_{ij}$ as the frequency of
149: co-occurrence of term $i$ and document $j$ as is done,
150: for example, in probabilistic
151: latent semantic indexing \cite{hoff:99}.\footnote{Our clustering
152: algorithm computes an approximate global optimal solution while probabilistic
153: latent semantic indexing relies on the EM algorithm and therefore might be
154: prune to local minima even with the help of some annealing process.}
155: In this paper, we
156: follow this basic principle and propose a new approach
157: to model terms and documents as vertices in a bipartite graph
158: with edges of the graph indicating the co-occurrence of terms and documents.
159: In addition
160: we can optionally
161: use edge weights to indicate the frequency of this co-occurrence.
162: Cluster analysis for document collections
163: in this context is based on a very intuitive notion: documents
164: are grouped by topics, on one hand
165: documents in a topic tend to more heavily use the same subset
166: of terms which form a term cluster, and
167: on the other hand a topic usually is characterized
168: by a subset of terms and those documents heavily using those terms tend to
169: be about that particular topic. It is this interplay of terms and
170: documents which gives rise to what we call bi-clustering by which
171: terms and documents are simultaneously grouped into
172: {\it semantically coherent} clusters.
173:
174: Within our bipartite graph model, the clustering problem can be
175: solved by constructing vertex graph partitions. Many criteria have been
176: proposed for measuring the quality of graph partitions
177: of undirected graphs \cite{Chung,Shi}. In this paper, we show
178: how to adapt those criteria for bipartite graph partitioning and
179: therefore solve the bi-clustering problem. A great variety of
180: objective functions have been proposed for cluster analysis without
181: efficient algorithms for finding the (approximate) optimal solutions.
182: We will show that our bipartite graph formulation naturally leads to
183: partial SVD problems for the underlying edge weight matrix
184: which admit efficient
185: {\it global} optimal solutions.
186: The rest of the paper
187: is organized as follows: in section \ref{se:bi}, we propose a
188: new criterion for
189: bipartite graph partitioning which tends to produce balanced
190: clusters. In section \ref{se:svd}, we show that our criterion
191: leads to an optimization problem that can be approximately
192: solved by computing a partial SVD
193: of the weight matrix of the bipartite graph.
194: In section \ref{se:corr}, we make connection of our approximate
195: solution to correspondence analysis used in multivariate data
196: analysis. In section \ref{se:over}, we briefly
197: discuss how to deal with clusters with overlaps.
198: In section \ref{se:exp}, we
199: describe experimental results on
200: bi-clustering a dataset of newsgroup articles. We conclude the paper
201: in section \ref{se:con} and give pointers to future research.
202:
203: \section{Bipartite graph partitioning}\label{se:bi}
204: We denote a graph by $G(V,E)$, where $V$ is the vertex set
205: and $E$ is the edge set of the
206: graph. A graph $G(V,E)$
207: is {\it bipartite} with two vertex
208: classes $X$ and $Y$ if $V = X\cup Y$ with
209: $X\cap Y = \emptyset$ and each edge in $E$ has
210: one endpoint in $X$ and one endpoint in $Y$.
211: We consider weighted bipartite graph $G(X,Y,W)$ with
212: $W = (w_{ij})$ where $w_{ij} > 0$ denotes the weight of the
213: edge between vertex $i$ and $j$. We let $w_{ij}=0$ if there
214: is no edge between vertices $i$ and $j$.
215: In the context
216: of document clustering, $X$ represents
217: the set of terms and $Y$ represents the set of documents, and $w_{ij}$
218: can be used to denote the number of times term $i$ occurs in
219: document $j$.
220: A vertex partition of $G(X,Y,W)$
221: denoted by $\Pi(A,B)$
222: is defined by a partition of the vertex sets
223: $X$ and $Y$, respectively: $X=A\cup A^c$, and $Y=B\cup B^c$, where
224: for a set $S$, $S^c$ denotes its compliment. By convention,
225: we pair $A$ with $B$, and $A^c$ with $B^c$. We say that
226: a pair of vertices $ x \in X$ and $y \in Y$ is {\it matched} with
227: respect to a partition $\Pi(A,B)$ if there is an edge
228: between $x$ and $y$, and either $x \in A$ and $y \in B$ or
229: $x \in A^c$ and $y \in B^c$. For any two subsets of vertices
230: $ S \subset X$ and $T \subset Y$, define
231: \[ W(S,T) = \sum_{i \in S, j \in T} w_{ij},\]
232: i.e., $W(S,T)$ is the sum of the weights of edges with one
233: endpoint in $S$ and one endpoint in $T$. The quantity
234: $W(S,T)$ can be
235: considered as measuring the association
236: between the vertex sets $S$ and
237: $T$. In the context of cluster analysis
238: edge weight measures the similarity between data objects.
239: To partition data objects into
240: clusters, we seek a partition of $G(X,Y,W)$ such that the
241: association (similarity)
242: between unmatched vertices is as small as possible.
243: One possibility is to consider for a partition $\Pi(A,B)$
244: the following quantity
245: \begin{equation}\label{eq:cut}
246: \begin{array}{ll} \cut(A,B) & \equiv W(A,B^c) + W(A^c, B)\\[3pt]
247: & = \sum_{i \in A, j \in B^c} w_{ij} +
248: \sum_{i \in A^c, j \in B} w_{ij}.
249: \end{array}
250: \end{equation}
251: Intuitively, choosing $\Pi(A,B)$
252: to minimize $\cut(A,B)$ will give rise to a partition that
253: minimizes the sum of all the edge weights between unmatched
254: vertices. In the context of document clustering, we try to find
255: two document clusters $B$ and $B^c$ which have few terms in
256: common, and the documents in $B$ mostly use terms in $A$ and
257: those in $B^c$ use terms in $A^c$.
258: Unfortunately, choosing a partition based entirely on
259: $\cut(A,B)$ tends to produce unbalanced clusters, i.e.,
260: the sizes of $A$ and/or $B$ or their compliments tend to be
261: small.
262: Inspired by the work in \cite{Chung,Driessche,Shi}, we propose
263: the following normalized variant of the edge cut in (\ref{eq:cut})
264: \[ \Ncut(A,B) \equiv \frac{\cut(A,B)}{W(A,Y) + W(X,B)}\]
265: \[ + \frac{\cut(A^c,B^c)}{W(A^c,Y) + W(X,B^c)}.\]
266: The intuition behind this criterion is that not only we
267: want a partition with small edge cut, but we also want the two
268: subgraphs formed between the matched vertices to be as dense
269: as possible. This latter requirement is partially
270: satisfied by introducing
271: the normalizing denominators in the above equation.\footnote{A
272: more natural criterion seems to be
273: \[\frac{\cut(A,B)}{W(A,B)}
274: + \frac{\cut(A^c,B^c)}{W(A^c,B^c)}.\]
275: However, it can be shown that it will leads to an SVD
276: problem with the same set of left and right singular vectors.}
277: Our bi-clustering problem is now equivalent to
278: the following optimization problem
279: \[ \min_{\Pi(A,B)} \Ncut(A,B),\]
280: i.e., finding partitions of the vertex sets $X$ and $Y$ to
281: minimize the normalized cut of the bipartite graph $G(X,Y,W)$.
282:
283: \section{Approximate solutions using singular vectors}\label{se:svd}
284: Given a bipartite graph $G(X,Y,W)$
285: and the associated partition $\Pi(A,B)$. Let us reorder
286: the vertices of $X$ and $Y$ so that vertices in $A$ and $B$ are
287: ordered before vertices in $A^c$ and $B^c$, respectively. The
288: weight matrix $W$ can be written in a block format
289: \begin{equation}\label{eq:w} W = \left[\begin{array}{cc}
290: W_{11}&W_{12}\\
291: W_{21}&W_{22}
292: \end{array}\right],\end{equation}
293: i.e., the rows of $W_{11}$ correspond to
294: the vertices in the vertex set $A$ and
295: the columns of $W_{11}$ correspond to
296: those in $B$. Therefore
297: $G(A,B,W_{11})$ denotes the weighted bipartite graph
298: corresponding to the vertex sets $A$ and $B$.
299: For any $m$-by-$n$ matrix
300: $H = (h_{ij})$, define
301: \[ \s(H) = \sum_{i=1}^m \sum_{j=1}^n h_{ij},\]
302: i.e., $\s(H)$ is the sum of all the elements of $H$.
303: It is easy to see from the definition of $\Ncut$,
304: \[ \Ncut(A,B) = \frac{\s(W_{12}) + \s(W_{21})}{2\s(W_{11}) +\s(W_{12})
305: +\s(W_{21})}\]\[ +
306: \frac{\s(W_{12}) + \s(W_{21})}{2\s(W_{22}) +\s(W_{12})
307: +\s(W_{21})}.\]
308: In order to make connections to
309: SVD problems, we
310: first consider the case when $W$ is symmetric.\footnote{A different
311: proof for the
312: symmetric case was first derived in \cite{Shi}. However, our derivation
313: is simpler and more transparent and leads naturally to the SVD
314: problems for the rectangular case.}
315: It is easy
316: to see that with $W$ symmetric (denoting $\Ncut(A,A)$ by $\Ncut(A)$),
317: we have
318: \begin{equation}\label{eq:sym}
319: \Ncut(A) = \frac{\s(W_{12})}{\s(W_{11})+\s(W_{12})}
320: + \frac{\s(W_{12})}{\s(W_{22})+\s(W_{12})}.\end{equation}
321: Let $e$ be the vector
322: with all its elements equal to $1$. Let $D$ be the diagonal matrix
323: such that $We = De$. Then $(D-W)e=0$. Let $x=(x_i)$ be the vector with
324: \[ x_i = \left\{\begin{array}{rl}
325: 1, & i \in A,\\
326: -1, & i \in A^c.
327: \end{array}
328: \right.
329: \]
330: It is easy to verify that
331: \[ \s(W_{12}) = x^T(D-W)x/4.\]
332: Define
333: \[ p \equiv \frac{\s(W_{11})+\s(W_{12})}{\s(W_{11})+2\s(W_{12})+\s(W_{22})}
334: =\frac{\s(W_{11})+\s(W_{12})}{e^TDe}.\]
335: Then
336: \[ \begin{array}{c}
337: \s(W_{11})+\s(W_{12}) = p e^TDe, \\[3pt]
338: \s(W_{22})+\s(W_{12}) = (1-p)e^TDe,
339: \end{array}
340: \]
341: and
342: \begin{equation}\label{eq:n}
343: \Ncut(A) = \frac{x^T(D-W)x}{4p(1-p)e^TDe}.
344: \end{equation}
345: Notice that $(D-W)e=0$, then for any scalar $s$, we have
346: \[ (se+x)^T(D-W)(se+x)= x^T(D-W)x.\]
347: To cast (\ref{eq:n}) in the form of a Rayleigh quotient,
348: we need to find
349: $s$ such that
350: \[ (se+x)^TD(se+x)= 4p(1-p)e^TDe.\]
351: Since $x^TDx = e^TDe$, it follows from the above equation that
352: $s = 1-2p$. Now let $y=(1-2p)e + x$, it is easy to see that
353: $y^TDe = ((1-2p)e + x)^TDe = 0$, and
354: \[ y_i = \left\{\begin{array}{rl}
355: 2(1-p)>0, & i \in A,\\
356: -2p<0, & i \in A^c.
357: \end{array}
358: \right.
359: \]
360: Thus
361: \[ \min_{A} \Ncut(A) = \min \left\{ \frac{y^T(D-W)y}{y^TDy}
362: \;\; | \;\; y \in S\right\},\]
363: where
364: \[ S=\{ y \;\; | \;\;
365: y^TDe =0, y_i
366: \in \{ 2(1-p), -2p\} \}.\]
367: If we drop the constraints $y_i
368: \in \{ 2(1-p), -2p\}$ and let
369: the elements of $y$ take
370: arbitrary continuous values, then the optimal $y$ can be approximated by
371: the following relaxed {\it continuous} minimization problem,
372: \begin{equation}\label{eq:y} \min \left\{ \frac{y^T(D-W)y}{y^TDy}
373: \;\; | \;\;y^TDe =0\right\}.\end{equation}
374: Notice that it follows from $We = De$ that
375: \[ D^{-1/2}WD^{-1/2} (D^{1/2}e) = D^{-1/2}e,\]
376: and therefore $D^{1/2}e$ is an eigenvector of
377: $D^{-1/2}WD^{-1/2}$ corresponding to the eigenvalue $1$. It
378: is easy to show that all the eigenvalues of $D^{-1/2}WD^{-1/2}$
379: have absolute value at most $1$ (See the Appendix). Thus the optimal $y$ in
380: (\ref{eq:y}) can be computed as $y = D^{1/2}\hat{y}$, where
381: $\hat{y}$ is the {\it second} largest
382: eigenvector of $D^{-1/2}WD^{-1/2}$.
383:
384: Now we return to the rectangular case for the weight matrix $W$,
385: and let $D_X$ and $D_Y$ be diagonal matrices such that
386: \begin{equation}\label{eq:xy}
387: We = D_X e, \quad W^Te = D_Ye.
388: \end{equation}
389: Consider a partition $\Pi(A,B)$, and define
390: \[ u_i = \left\{\begin{array}{rl}
391: 1, & i \in A\\
392: -1, & i \in A^c
393: \end{array}
394: \right., \quad
395: v_i = \left\{\begin{array}{rl}
396: 1, & i \in B\\
397: -1, & i \in B^c
398: \end{array}
399: \right.
400: \]
401: Let $W$ have the block form as in (\ref{eq:w}), and consider the
402: augmented symmetric matrix\footnote{In \cite{heko:00}, the Laplacian
403: of $\hat{W}$ is used for partitioning a rectangular matrix
404: in the context of designing load-balanced matrix-vector multiplication
405: algorithms for parallel computation. However, the eigenvalue
406: problem of the Laplacian
407: of $\hat{W}$ does not lead to a simpler singular value problem.}
408: \[ \hat{W} = \left[\begin{array}{cc}
409: 0 & W\\
410: W^T & 0
411: \end{array}\right]
412: = \left[\begin{array}{cc|cc}
413: 0 & 0 & W_{11} & W_{12}\\
414: 0 & 0 & W_{21} & W_{22}\\ \hline
415: W_{11}^T & W_{21}^T & 0 & 0 \\
416: W_{12}^T & W_{22}^T & 0 & 0
417: \end{array}\right].\]
418: If we interchange the second and third block rows and columns
419: of the above matrix, we obtain
420: \[ \left[\begin{array}{cc|cc}
421: 0 & W_{11} & 0 & W_{12}\\
422: W_{11}^T & 0 & W_{21}^T & 0\\ \hline
423: 0 & W_{21} & 0 & W_{22} \\
424: W_{12}^T & 0 & W_{22}^T & 0
425: \end{array}\right] \equiv
426: \left[\begin{array}{cc}
427: \hat{W}_{11} & \hat{W}_{12}\\
428: \hat{W}_{12}^T & \hat{W}_{22}
429: \end{array}\right],\]
430: and the normalized cut can be written as
431: \[ \Ncut(A,B) = \frac{\s(\hat{W}_{12})}{\s(\hat{W}_{11})+\s(\hat{W}_{12})}
432: + \frac{\s(\hat{W}_{12})}{\s(\hat{W}_{22})+\s(\hat{W}_{12})},\]
433: a form that resembles the symmetric case (\ref{eq:sym}). Define
434: \[ q = \frac{2\s(W_{11}) +\s(W_{12})
435: +\s(W_{21})}{e^TD_Xe + e^TD_Ye}.\]
436: Then we have
437: \[ \Ncut(A,B) = \frac{-2x^TWy + x^TD_Xx + y^TD_Yy}{x^TD_Xx + y^TD_Yy}\]\[
438: = 1- \frac{2x^TWy}{x^TD_Xx + y^TD_Yy},\]
439: where $x = (1-2p)e +u, y = (1-2p)e + v$. It is also easy to see that
440: \begin{equation}\label{eq:q}
441: x^TD_Xe + y^TD_Ye = 0, \quad x_i, y_i \in \{ 2(1-q), -2q\}.
442: \end{equation}
443: Therefore,
444: \[ \min_{\Pi(A,B)}
445: \Ncut(A,B)\]\[ = 1-\max_{x \neq 0, y \neq 0}
446: \left\{ \frac{2x^TWy}{x^TD_Xx + y^TD_Yy}
447: \;\; | \;\; x, y \; \mbox{\rm satisfy } (\ref{eq:q})\right\}.\]
448: Ignoring the discrete constraints on the elements of $x$ and $y$, we
449: have the following continuous maximization problem,
450: \begin{equation}\label{eq:yz}
451: \max_{x \neq 0, y \neq 0}
452: \left\{ \frac{2x^TWy}{x^TD_Xx + y^TD_Yy}\;\; | \;\;
453: x^TD_Xe + y^TD_Ye = 0 \right\}.
454: \end{equation}
455: Without the constraints
456: $x^TD_Xe + y^TD_Ye = 0$, the above problem is equivalent to
457: computing the largest singular triplet of $D_X^{-1/2} W D_Y^{-1/2}$
458: (see the Appendix).
459: From (\ref{eq:xy}), we have
460: \[ \begin{array}{c}
461: D_X^{-1/2} W D_Y^{-1/2} (D_Y^{1/2}e) = D_X^{1/2} e, \\[3pt]
462: (D_X^{-1/2} W D_Y^{-1/2})^T (D_X^{1/2}e) = D_Y^{1/2} e,
463: \end{array}
464: \]
465: and similarly to
466: the symmetric case, it is easy to show that all the
467: singular values of $D_X^{-1/2} W D_Y^{-1/2}$
468: are at most $1$. Therefore, an optimal pair $\{x,y\}$ for
469: (\ref{eq:yz}) can be computed as
470: $x = D_X^{-1/2} \hat{x}$ and $y = D_Y^{-1/2} \hat{y}$,
471: where $\hat{x}$ and $\hat{y}$ are the {\it second}
472: largest left and right singular vectors of $D_X^{-1/2} W D_Y^{-1/2}$,
473: respectively (see the Appendix).
474: With the above discussion, we can now summerize our
475: basic approach for bipartite graph clustering incorporating
476: a recursive procedure.
477:
478: \bigskip
479:
480: \begin{center}
481: \fbox{\parbox{7.7cm}{
482: {\sc Algorithm.} Spectral Recursive Embedding (SRE)
483:
484: Given a weighted bipartite graph $G = (X,Y,E)$
485: with its edge weight matrix $W$:
486:
487: \begin{enumerate}
488: \item Compute $D_X$ and $D_Y$ and form the scaled weight matrix
489: $\hat{W}=D_X^{-1/2} W D_Y^{-1/2}$.
490: \item Compute the {\it second} largest left and right
491: singular vectors of $\hat{W}$, $\hat{x}$ and $\hat{y}$.
492: \item Find cut points $c_x$ and $c_y$ for $x=D_X^{-1/2}\hat{x}$
493: and $y=D_Y^{-1/2}\hat{y}$, respectively.
494: \item Form partitions $A=\{i \;\;| \;\;x_i \geq c_x\}$ and
495: $A^c=\{i \;\;| \;\;x_i < c_x\}$ for vertex set $X$, and
496: $B=\{j \;\;| \;\;y_j \geq c_y\}$ and
497: $B^c=\{j \;\;|\;\; y_j < c_y\}$ for vertex set $Y$.
498: \item Recursively partition the sub-graphs $G(A,B)$
499: and $G(A^c,B^c)$ if necessary.
500:
501: \end{enumerate}
502: }}
503: \end{center}
504:
505: \bigskip
506:
507: Two basic strategies can be used for selecting the cut points
508: $c_x$ and $c_y$. The simplest strategy is to set $c_x=0$ and
509: $c_y=0$. Another more computing-intensive approach is to base
510: the selection on $\Ncut$: Check $N$ equally spaced splitting
511: points of $x$ and $y$, respectively, find the cut
512: points $c_x$ and $c_y$ with the smallest $\Ncut$ \cite{Shi}.
513:
514: {\bf Computational complexity.} The major computational cost
515: of SRE is Step 2 for computing the left and right singular vectors
516: which can be obtained either by power method or more robustly
517: by Lanczos bidiagonalization process \cite[Chapter 9]{govl:96}.
518: Lanczos method is an iterative process for computing
519: partial SVDs in
520: which each iterative step involves the computation of two matrix-vector
521: multiplications $\hat{W}u$ and $\hat{W}^Tv$ for some vectors
522: $u$ and $v$. The computational cost of these is
523: roughly proportional to $\nnz(\hat{W})$,
524: the number of nonzero elements of $\hat{W}$. The total
525: computational
526: cost of SRE is $O(c_{\sre}k_{\svd}\nnz(\hat{W}))$, where
527: $c_{\sre}$ the the level of recursion and $k_{\svd}$ is the
528: number of Lanczos iteration steps. In general, $k_{\svd}$ depends on
529: the singular value gaps of $\hat{W}$. Also notice that
530: $\nnz(\hat{W})= n_w n$, where $n_w$ is the average number of
531: terms per document and $n$ is the total number of document.
532: Therefore, the total cost of SRE is in general linear in the
533: number of documents to be clustered.
534:
535:
536: \section{Connections to correspondence analysis}\label{se:corr}
537: In its basic form correspondence analysis is applied to an
538: $m$-by-$n$ two-way
539: table of counts $W$ \cite{benz:92,gree:93,veri:99}. Let $w=\s(W)$,
540: the sum of all the elements of $W$, $D_X$ and $D_Y$ be diagonal
541: matrices defined in section \ref{se:svd}. Correspondence analysis
542: seeks to compute the largest singular triplets of the matrix
543: $Z=(z_{ij}) \in \R^{m \times n}$ with
544: \[ z_{ij}= \frac{w_{ij}/w - (D_X(i,i)/w)(D_Y(j,j)/w)}
545: {\sqrt{(D_X(i,i)/w)(D_Y(j,j)/w)}}.\]
546: The matrix $Z$ can be considered as the correlation matrix of two
547: group indicator matrices for the original $W$ \cite{veri:99}.
548: We now show that the SVD of $Z$ is closely related to the
549: SVD of $\hat{W} \equiv D^{-1/2}_XWD_Y^{-1/2}$.
550: In fact, in section \ref{se:svd},
551: we showed that $D_X^{1/2}e$ and $D_Y^{1/2}e$ are the left and right
552: singular vectors of $\hat{W}$ corresponding to the singular value one,
553: and it is also easy to show that all the singular values of
554: $\hat{W}$ are at most $1$. Therefore,
555: the rest of the singular values and singular vectors of
556: $\hat{W}$ can be found by computing the SVD of
557: the following rank-one modification
558: of $\hat{W}$
559: \[D^{-1/2}_XWD_Y^{-1/2}-
560: \frac{D_X^{1/2}ee^TD_Y^{1/2}}{\|D_X^{1/2}e\|_2\|D_Y^{1/2}\|_2}\]
561: which has $(i,j)$ element
562: \[ \frac{w_{ij}}{\sqrt{D_X(i,i)D_Y(j,j)}} -
563: \frac{\sqrt{D_X(i,i)D_Y(j,j)}}{w} = w^2z_{ij},\]
564: and is a constant multiple of the $(i,j)$ element of $Z$.
565: Therefore, normalized-cut based cluster analysis and correspondence
566: analysis arrive at the same SVD problems even though they start with
567: completely different principles. It is worthwhile to explore
568: more deeply the interplay between these two different points of views and
569: approaches, for example, using the statistical analysis of
570: correspondence analysis to provide better strategy for selecting cut
571: points and estimating the number of clusters.
572:
573: \section{Partitions with overlaps}\label{se:over}
574: So far in our discussion, we have only looked at {\it hard}
575: clustering, i.e., a data object belongs to one and only
576: one cluster. In many situations, especially when there are much
577: overlap among the clusters, it is more advantageous to allow
578: data objects to belong to different clusters. For
579: example, in document clustering, certain groups of words can
580: be shared by two clusters. Is it possible
581: to model this overlap using our bipartite graph model and also
582: find efficient approximate solutions? The answer seems to be yes,
583: but our results at this point are rather preliminary and we will
584: only illustrate the possibilities. Our basic idea is that when computing
585: $\Ncut(A,B)$, we should disregard the contributions of the
586: set of vertices that is in the overlap. More specifically,
587: let $X=A\cup O_X \cup \bar{A}$ and $Y=B\cup O_Y\cup \bar{B}$, where
588: $O_X$ denotes the overlap between
589: the vertex subsets
590: $A\cup O_X$ and $\bar{A}\cup O_X$, and
591: $O_Y$ the overlap between $B\cup O_Y$ and $\bar{B}\cup O_Y$, we compute
592: \[ \Ncut(A,B,\bar{A},\bar{B}) =\frac{\cut(A,B)}{W(A,Y)+W(X,B)}\]\[
593: +\frac{\cut(\bar{A},\bar{B})}{W(\bar{A},Y)+W(X,\bar{B})}.\]
594: However, we can make
595: $\Ncut(A,B,\bar{A},\bar{B})$ smaller simply by putting more
596: vertices in the overlap. Therefore, we need to balance these
597: two competing quantities: the size of the overlap and the modified
598: normalized cut $\Ncut(A,B,\bar{A},\bar{B})$ by minimizing
599: \[ \Ncut(A,B,\bar{A},\bar{B}) + \alpha(|O_X| + |O_Y|),\]
600: where $\alpha$ is a regularization parameter. How to find an
601: efficient method for computing the (approximate) optimal
602: solution to the above minimization problem still needs to be
603: investigated. We close this section by presenting an illustrative
604: example showing that in some situations, the singular vectors
605: already automatically separating the overlap sets while giving
606: the coordinates for carrying out clustering.
607:
608: \begin{figure}[t]
609: \centerline{
610: \mbox{\psfig{file=corr1.ps,height=1.8in,width=1.6in}
611: \psfig{file=corr2.ps,height=1.8in,width=1.6in}}}
612: \caption{Sparsity patterns of a test matrix before clustering
613: (left) and after clustering (right)}
614: \label{fi:op}
615: \end{figure}
616:
617: {\sc Example 1.} We construct a sparse $m$-by-$n$ rectangular matrix
618: \[ W = \left[\begin{array}{cc}
619: W_{11}&W_{12}\\
620: W_{21}&W_{22}
621: \end{array}\right].\]
622: so that $W_{11}$ and $W_{22}$ are relatively denser than $W_{12}$
623: and $W_{21}$. We also add some dense rows and columns to the matrix $W$
624: to represent row and column overlaps.
625: The left panel of Figure \ref{fi:op} shows the sparsity pattern of
626: $\bar{W}$,
627: a matrix obtained by randomly permuting
628: the rows and columns of $W$. We then compute the
629: second largest left and right singular vectors of
630: $D_X^{-1/2} \bar{W} D_Y^{-1/2}$, say $x$ and $y$, then sort the rows and
631: columns of $\bar{W}$ according to the values of the entries in
632: $D_X^{-1/2}x$ and $D_Y^{-1/2}y$, respectively. The sparsity
633: pattern of this permuted $\bar{W}$ is shown on the right panel of
634: Figure \ref{fi:op}. As can be seen that the singular vectors not
635: only do the job of clustering but at the same time also
636: concentrate the dense rows and columns at the boundary of the two
637: clusters.
638:
639: \section{Experiments}\label{se:exp}
640: In this section we present our experimental results on clustering
641: a dataset of newsgroup articles submitted to 20
642: newsgroups.\footnote{
643: The newsgroup dataset together with the {\tt bow} toolkit for
644: processing it
645: can be downloaded from
646: {\tt http://www.cs.cmu.edu/afs/cs/project/theo-11/www/}
647:
648: \noindent
649: {\tt naive-bayes.html}.}
650: This dataset contains about
651: 20,000 articles (email messages) evenly divided among the 20
652: newsgroups. We list the names of the newsgroups together
653: with the associated group labels (the labels will be
654: used in the sequel to identify the newsgroups).
655:
656:
657: \begin{verbatim}
658: NG1: alt.atheism
659: NG2: comp.graphics
660: NG3: comp.os.ms-windows.misc
661: NG4: comp.sys.ibm.pc.hardware
662: NG5:comp.sys.mac.hardware
663: NG6: comp.windows.x
664: NG7:misc.forsale
665: NG8: rec.autos
666: NG9:rec.motorcycles
667: NG10: rec.sport.baseball
668: NG11:rec.sport.hockey
669: NG12: sci.crypt
670: NG13:sci.electronics
671: NG14: sci.med
672: NG15:sci.space
673: NG16: soc.religion.christian
674: NG17:talk.politics.guns
675: NG18: talk.politics.mideast
676: NG19:talk.politics.misc
677: NG20: talk.religion.misc
678: \end{verbatim}
679:
680: \begin{table*}\label{tb:12}
681: \caption{Comparison of spectral embedding (SRE), PDDP, and K-means
682: (NG1/NG2)
683: }
684: \begin{center}
685: \begin{tabular}{llrrr}
686: Mixture & SRE & PDDP & K-means \\ \hline\hline
687: 50/50 & $92.12\pm 3.52$\% & $91.90\pm 3.19$\% $(53,10,37)$&$ 76.93\pm 14.42$\% $(82,2,10)$\\ \hline
688: 50/100 & $90.57\pm 3.11$\% & $86.11\pm 3.94$\% $(86, 5,9)$&$ 76.74\pm 14.01$\% $(80,2,18)$\\ \hline
689: 50/150 & $88.04\pm 3.90$\% & $78.60\pm 5.03$\% $(98, 0, 2)$&$68.80\pm 13.55$\% $(88, 0, 12)$\\ \hline
690: 50/200 & $82.77\pm 5.24$\% & $70.43\pm 6.04$\% $(97,0,3)$&$69.22\pm 12.34$\% $(83,1,16)$\\ \hline
691: \end{tabular}
692: \end{center}
693: \end{table*}
694:
695: \begin{table*}\label{tb:1011}
696: \caption{Comparison of spectral embedding (SRE), PDDP, and K-means
697: (NG10/NG11)
698: }
699: \begin{center}
700: \begin{tabular}{llrrr}
701: Mixture & SRE & PDDP & K-means \\ \hline\hline
702: 50/50 & $74.56\pm 8.93$\% & $73.40\pm 10.07$\% $(56,6,38)$ &$61.61\pm 8.77$\% $(86,0,14)$\\ \hline
703: 50/100 & $67.13\pm 7.17$\% & $67.10\pm 10.20$\% $(52,1,47)$ &$64.40\pm 9.37$\% $(59,1,40)$\\ \hline
704: 50/150 & $58.30\pm 5.99$\% & $58.72\pm 7.48$\% $(52,1,47)$ &$62.53\pm 8.20$\% $(36,1,63)$\\ \hline
705: 50/200 & $57.55\pm 5.69$\% & $56.63\pm 4.84$\% $(58,1,41)$ &$60.82\pm 7.54$\% $(39,2,59)$\\ \hline
706: \end{tabular}
707: \end{center}
708: \end{table*}
709:
710: We used the {\it bow} toolkit to construct the term-document
711: matrix for this dataset, specifically we use the tokenization option
712: so that the UseNet headers are stripped, and we also applied stemming
713: \cite{mcca:96}. Some of the newsgroups have large overlaps, for
714: example, the five newsgroups {\tt comp.* } about
715: computers. In fact several articles are posted to multiple newsgroups.
716: Before we apply clustering algorithms to the dataset, several
717: preprocessing steps need to be considered. Two standard steps
718: are weighting and feature selection. For weighting, we considered
719: a variant of tf.idf weighting scheme,
720: $\tf\log_2(n/\df),$
721: where $\tf$ is the term frequency and $\df$ is the document
722: frequency and several other variations
723: listed in \cite{bele:00}.
724: For feature selection, we looked at three approaches 1)
725: deleting terms that occur less than certain number of
726: times in the dataset; 2) deleting terms that
727: occur in less than certain number of
728: documents in the dataset; 3) selecting terms according to mutual
729: information of terms and documents defined as
730: \[ I(y) = \sum_{x} p(x,y)\log(p(x,y)/(p(x)p(y)),\]
731: where $y$ represents a term and $x$ a document \cite{slti:00}.
732: In general we found out that the traditional tf.idf based
733: weighting schemes do not improve performance for SRE. One possible
734: explanation comes from the connection with correspondence analysis,
735: the raw frequencies are samples of co-occurrence probabilities,
736: and the pre- and post-multiplication by $D_X^{-1/2}$ and
737: $D_Y^{-1/2}$ in $D_X^{-1/2}(D-W)D_Y^{-1/2}$ {\it automatically}
738: taking into account of weighting. We did, however, found out that
739: trimming the raw frequencies can sometimes improve
740: performance for SRE, especially for the anomalous cases where
741: some words can occur in certain documents an unusual number of times,
742: skewing the clustering process.
743:
744:
745:
746:
747: For the purpose of comparison, we consider two other clustering
748: methods: 1) K-means method \cite{Gordon}; 2) Principal direction
749: divisive partion (PDDP) method \cite{bole:98}. K-means method is
750: a widely used cluster analysis tool. The variant we used employs
751: the Euclidean distance when comparing the dissimilarity between
752: two documents. When applying K-means,
753: we {\it normalize} the length of each document so that it has
754: Euclidean length one. In essence, we use the cosine of the angle
755: between two document vectors when
756: measuring their similarity. We have also tried K-means without
757: document length normalization, the results are far worse and therefore
758: we will not report the corresponding results. Since K-means method is
759: an iterative method, we need to specify a stopping criterion. For
760: the variant we used, we compare the centroids between two
761: consecutive iterations, and stop when the difference is smaller
762: than a pre-defined tolerance.
763:
764:
765: PDDP is another clustering method that utilizes singular
766: vectors. It is based on the idea of principal component
767: analysis and has been shown to
768: outperform several standard clustering methods
769: such as hierarchical agglomerative algorithm \cite{bole:98}.
770: First each document is considered as a
771: multivariate data point. The set of document is normalized
772: to have unit Euclidean length and then centered, i,e., let
773: $W$ be the term-document matrix, and $w$ be the average of
774: the columns of $W$. Compute the largest singular value triplet
775: $\{u,\sigma,v\}$ of
776: $W-we^T$. Then split the set of documents based on their
777: values of the $v=(v_i)$ vector: one simple scheme is to
778: let those with
779: positive $v_i$ go into one cluster and those
780: with nonnegative $v_i$ inot another cluster. Then the
781: whole process is repeated on the term-document matrices of
782: the two clusters, respectively. Although both our clustering
783: method SRE and PDDP
784: make use of the singular vectors of some versions of the
785: term-document matrices, they are derived from fundamentally
786: different principles. PDDP is a feature-based clustering method,
787: projecting all the data points to the one-dimensional subspace
788: spanned by the first principal axis; SRE is a similarity-based
789: clustering method, two co-occurring variables (terms and
790: documents in the context of document clustering) are
791: simultaneously clustered. Unlike SRE, PDDP does not
792: have a well-defined objective function
793: for minimization. It only partitions the columns of
794: the term-document matrices while SRE partitions both of its
795: rows and columns. This will have significant impact on the
796: computational costs.
797: PDDP, however, has an advantage that it can be applied to
798: dataset with both positive and negative values while SRE can only be
799: applied to datasets with nonnegative data values.
800:
801: \begin{table*}\label{tb:1819}
802: \caption{Comparison of spectral embedding (SRE), PDDP, and K-means
803: (NG18/NG19)
804: }
805: \begin{center}
806: \begin{tabular}{llrrr}
807: Mixture & SRE & PDDP & K-means \\ \hline\hline
808: 50/50 & $73.66\pm 10.53$\% & $69.52\pm 12.83$\% $(65,12,32)$ & $62.25 \pm 9.94$\% $(82,1,17)$\\ \hline
809: 50/100 & $67.23\pm 7.84$\% & $67.84\pm 7.30$\% $(46,5,49)$& $60.91\pm 7.92$\% $(65,13,32)$\\ \hline
810: 50/150 & $65.83\pm 12.79$\% & $60.37\pm 9.85$\% $(53,3,44)$ &$63.32\pm 8.26$\% $(58,3,39)$\\ \hline
811: 50/200 & $61.23\pm 9.88$\% & $60.76\pm 5.55$\% $(40,1,59)$ &$64.50\pm 7.58$\% $(34,0,66)$\\ \hline
812: \end{tabular}
813: \end{center}
814: \end{table*}
815:
816:
817: \begin{table*}\label{tb:con}
818: \caption{Confusion matrix for newsgroups $\{2, 9, 10, 15, 18\}$
819: }
820: \begin{center}
821: \begin{tabular}{|l||c|c|c|c|c|}
822: \hline\hline
823: &mideast &graphics & space & baseball & motorcycles \\ \hline\hline
824: cluster 1& 87& 0& 0& 2& 0\\ \hline
825: cluster 2& 7& 90& 7& 6& 7\\ \hline
826: cluster 3& 3& 9& 84& 1& 1\\ \hline
827: cluster 4& 0& 0& 1& 88& 0\\ \hline
828: cluster 5& 3& 1& 8& 3& 92\\ \hline\hline
829: \end{tabular}
830: \end{center}
831: \end{table*}
832:
833:
834:
835: {\sc Example 2.} In this example, we examine binary clustering
836: with uneven clusters. We consider three pairs of newsgroups:
837: newsgroups 1 and 2 are well-separated, 10 and 11 are
838: less well-separated and 18 and 19 have a lot of overlap.
839: We used document frequency as the feature
840: selection criterion and delete
841: words that occur in less than $5$ documents in each datasets we
842: used. For both K-means and PDDP we apply tf.idf weighting together
843: with document length normalization so that each document vector
844: will have Euclidean norm one. For SRE we trim the raw frequency
845: so that the maximum is $10$.
846: For each newsgroup pair, we select four
847: types of mixture of articles from each newsgroup: $x/y$ indicates
848: that $x$ articles are from the first group and $y$ articles are
849: from the second group. The results are listed in Table
850: 1 for groups 1 and 2, Table 2 for groups 10 and
851: 11 and Table 3 for groups 18 and 19. We list
852: the means and standard deviations for 100 random samples.
853: For PDDP and K-means we also include a triplet of numbers
854: which indicates how many of the 100 samples SRE performs better (the first
855: number), the same (the second number) and worse (the third number) than
856: the corresponding methods (PDDP or K-means).
857: We should emphasize that
858: K-means method can only find local minimum, and the results
859: depend on initial values and stopping criteria. This is also
860: reflected by the large standard deviations associated with
861: K-means method.
862: From the three
863: tests we can conclude that both SRE and PDDP outperform K-means
864: method. The performance of SRE and PDDP are similar in balanced
865: mixtures, but SRE is superior to PDDP in skewed mixtures.
866:
867:
868:
869: {\sc Example 3.} In this example, we consider an easy multi-cluster case,
870: we examine five newsgroups $2, 9, 10, 15, 18$ which
871: was also considered in \cite{slti:00}. We sample 100
872: articles from each newsgroups, we use mutual information for
873: feature selection.
874: We use minimum normalized cut as cut point for each level
875: of the recursion.
876: For one sample, Table 4 gives the confusion matrix.
877: The accuracy for this sample is $88.2$\%. We also tested two
878: other samples with accuracy $85.4$\%
879: and $81.2$\%
880: which compare
881: favorably
882: with those obtained for three samples with
883: accuracy $59$\%, $58$\% and $53$\% reported in \cite{slti:00}.
884: In the following we also listed the top few words for
885: each clusters computed by mutual information.
886:
887: \begin{verbatim}
888: Cluster 1:
889: armenian israel arab palestinian peopl jew isra
890: iran muslim kill turkis war greek iraqi adl call
891:
892: Cluster 2:
893: imag file bit green gif mail graphic colour
894: group version comput jpeg blue xv ftp ac uk list
895:
896: Cluster 3:
897: univers space nasa theori system mission henri
898: moon cost sky launch orbit shuttl physic work
899:
900: Cluster 4:
901: clutch year game gant player team hirschbeck
902: basebal won hi lost ball defens base run win
903:
904: Cluster 5:
905: bike dog lock ride don wave drive black
906: articl write apr motorcycl ca turn dod insur
907: \end{verbatim}
908:
909: \section{Conclusions and feature work}\label{se:con}
910: In this paper, we formulate a class of clustering problems as
911: bipartite graph partitioning problems, and we show that
912: efficient optimal solutions can be found by computing the
913: partial singular value decomposition of some scaled edge weight
914: matrices. However, we have also shown that there still remain
915: many challenging problems. One area that needs further investigation
916: is the selection of cut points and number of clusters using
917: multiple left and right singular vectors, and
918: the possibility of adding local refinements to improve
919: clustering quality.\footnote{It will be
920: difficult to use local refinement for PDDP
921: because it does not have a global objective function
922: for minimization.} Another area is to find
923: efficient algorithms for handling overlapping clusters. Finally,
924: the treatment of missing data under our bipartite graph model
925: especially when we apply our spectral clustering methods to
926: the problem of data analysis of recommender systems also deserves
927: further investigation.
928:
929:
930: \section{Acknowledgments}
931: The work of Hongyuan Zha and Xiaofeng He was supported in
932: part by NSF grant CCR-9901986. The work of Xiaofeng He,
933: Chris Ding and Horst Simon was supported in
934: part by Department of Energy through an LBL LDRD fund.
935:
936: \bibliographystyle{plain}
937: \bibliography{ref}
938:
939:
940: \appendix
941: \section{Some proofs}
942: In this appendix we prove three
943: results: 1) All the
944: eigenvalues of $D^{-1/2}WD^{-1/2}$ has absolute value at
945: most $1$. Equivalently, we need to prove that the eigenvalues
946: of the generalized eigenvalue problem $Wx = \lambda Dx$
947: has absolute value at
948: most $1$. In fact let $x=(x_i)_{i=1}^n$ and let $i$ be such that
949: $|x_i| = \max |x_j|$, then it follows from
950: \[ \lambda d_i x_i = \sum_{j=1}^n w_{ij} x_j\]
951: that
952: \[ |\lambda| \leq \sum_{j=1}^n w_{ij}/d_i = 1.\]
953:
954: 2) We prove that
955: \[ \sigma_{\max}(D_X^{-1/2}WD_Y^{-1/2})
956: =\max_{x \neq 0, y \neq 0} \frac{2 x^TWy}{x^TD_Xx + y^TD_Yy}.\]
957: Let $\hat{x}= D^{1/2}_Xx$ and $\hat{y}= D^{1/2}_Yy$, then
958: \begin{equation}\label{eq:ff}
959: \frac{2 x^TWy}{x^TD_Xx + y^TD_Yy} =
960: \frac{2 \hat{x}^TD_X^{-1/2}WD_Y^{-1/2}\hat{y}}
961: {\hat{x}^T\hat{x} + \hat{y}^T\hat{y}}.\end{equation}
962: Let $D_X^{-1/2}WD_Y^{-1/2}=U\Sigma V^T$ be its SVD with
963: \[ U = [u_1, \dots, u_m], \quad V=[v_1,\dots, v_n] \]
964: and
965: \[ \Sigma = \diag(\sigma_1, \dots, \sigma_{\min\{m,n\}}), \quad
966: \sigma_1 = \sigma_{\max}(D_X^{-1/2}WD_Y^{-1/2}).\] Then
967: we can expand $\hat{x}$ and $\hat{y}$ as
968: \begin{equation}\label{eq:hh}
969: \hat{x} = \sum_{i} \hat{x}_i u_i, \quad \hat{y} = \sum_{i} \hat{y}_i v_i,
970: \end{equation}
971: and (\ref{eq:ff}) becomes
972: \[ \frac{2\sum_{i} \sigma_i \hat{x}_i\hat{y}_i}{\sum_i \hat{x}_i^2 +
973: \sum_i \hat{y}_i^2}
974: \leq \frac{2\sigma_1 \sqrt{\sum_i \hat{x}_i^2}\sqrt{\sum_i \hat{y}_i^2}}
975: {\sum_i \hat{x}_i^2 + \sum_i \hat{y}_i^2} \leq \sigma_1.\]
976: Taking $\hat{x}_1=1$ and $\hat{y}_1=1$ achieves the maximum.
977:
978: 3) Now we consider
979: the constraint
980: \[ x^TD_Xe + y^TD_Ye = 0\]
981: which is equivalent to
982: $\hat{x}_1+\hat{y}_1=0$ using the expansions in (\ref{eq:hh}).
983: We can always scale the vectors $\hat{x}$ and $\hat{y}$
984: without changing the maximum so that
985: $\hat{x}_1 \geq 0$ and $\hat{y}_1 \geq 0$.
986: Hence $\hat{x}_1+\hat{y}_1=0$ implies that
987: $\hat{x}_1=\hat{y}_1=0$. It is then easy to see that
988: \[\sigma_2 = \max\left\{ \frac{2x^TWy}{x^TD_Xx + y^TD_Yy}\;\; | \;\;
989: x^TD_Xe + y^TD_Ye = 0 \right\},\]
990: and the maximum is achieved by the second largest left and
991: right singular vectors of $D_X^{-1/2} W D_Y^{-1/2}$.
992:
993:
994: \end{document}
995:
996:
997:
998:
999: