cs0108018/final.tex
1: \documentclass{sig-alternate}
2: 
3: \newcommand{\cut}{{\rm cut}}
4: \newcommand{\Ncut}{{\rm Ncut}}
5: \newcommand{\s}{{\rm s}}
6: \newcommand{\diag}{{\rm diag}}
7: \newcommand{\op}{{\rm op}}
8: \newcommand{\R}{{\cal R}}
9: \newcommand{\tf}{{\rm tf}}
10: \newcommand{\df}{{\rm df}}
11: \newcommand{\sre}{{\rm sre}}
12: \newcommand{\svd}{{\rm svd}}
13: \newcommand{\nnz}{{\rm nnz}}
14: \newcommand{\trace}{{\rm trace}}
15: 
16: \begin{document}
17: %
18: % --- Author Metadata here ---
19: \conferenceinfo{CIKM}{'01 November 5-10, 2001, Atlanta, Georgia. USA}
20: \CopyrightYear{2001} % Allows default copyright year (2000) to be over-ridden - IF NEED BE.
21: %\crdata{0-12345-67-8/90/01}  % Allows default copyright data (0-89791-88-6/97/05) to be over-ridden - IF NEED BE.
22: % --- End of Author Metadata ---
23: 
24: \title{Bipartite Graph Partitioning and Data 
25: Clustering\titlenote{Part of this work was done while Xiaofeng He
26: was a graduate research assistant at NERSC, Berkeley National Lab.}
27: }
28: %\subtitle{[Extended Abstract]
29: %\titlenote{A full version of this paper is available as
30: %\textit{Author's Guide to Preparing ACM SIG Proceedings Using
31: %\LaTeX$2_\epsilon$\ and BibTeX} at
32: %\texttt{www.acm.org/eaddress.htm}}}
33: %
34: % You need the command \numberofauthors to handle the ``boxing''
35: % and alignment of the authors under the title, and to add
36: % a section for authors number 4 through n.
37: %
38: % Up to the first three authors are aligned under the title;
39: % use the \alignauthor commands below to handle those names
40: % and affiliations. Add names, affiliations, addresses for
41: % additional authors as the argument to \additionalauthors;
42: % these will be set for you without further effort on your
43: % part as the last section in the body of your article BEFORE
44: % References or any Appendices.
45: 
46: \numberofauthors{3}
47: %
48: % You can go ahead and credit authors number 4+ here;
49: % their names will appear in a section called
50: % ``Additional Authors'' just before the Appendices
51: % (if there are any) or Bibliography (if there
52: % aren't)
53: 
54: % Put no more than the first THREE authors in the \author command
55: \author{
56: %
57: % The command \alignauthor (no curly braces needed) should
58: % precede each author name, affiliation/snail-mail address and
59: % e-mail address. Additionally, tag each line of
60: % affiliation/address with \affaddr, and tag the
61: %% e-mail address with \email.
62: \alignauthor Hongyuan Zha \\[2pt] Xiaofeng He\\
63:        \affaddr{Dept. of Comp. Sci. \& Eng.}\\
64:        \affaddr{Penn State Univ.}\\
65:        \affaddr{State College, PA 16802}\\
66:        \email{\{zha,xhe\}@cse.psu.edu}
67: \alignauthor Chris Ding \\[2pt] Horst Simon\\
68:        \affaddr{NERSC Division}\\
69:        \affaddr{Berkeley National Lab.}\\
70:        \affaddr{Berkeley, CA 94720}\\
71:        \email{\{chqding,hdsimon\}@lbl.gov}
72: \alignauthor Ming Gu\\
73:        \affaddr{Dept. of Math.}\\
74:        \affaddr{U.C. Berkeley}\\
75:        \affaddr{Berkeley, CA 94720}\\
76:        \email{mgu@math.berkeley.edu}
77: }
78: %\additionalauthors{Additional authors: John Smith (The Th{\o}rv\"{a}ld Group,
79: %email: {\texttt{jsmith@affiliation.org}}) and Julius P.~Kumquat
80: %(The Kumquat Consortium, email: {\texttt{jpkumquat@consortium.net}}).}
81: %\date{30 July 1999}
82: \maketitle
83: \begin{abstract}
84: Many data types arising from data mining applications
85: can be modeled as bipartite graphs, examples include
86: terms and documents in a text corpus, customers and 
87: purchasing items in market basket analysis and reviewers
88: and movies in a movie recommender system. In this paper,
89: we propose a new data clustering method based on
90: partitioning the underlying bipartite graph. The partition
91: is constructed by minimizing a {\it normalized}
92: sum of edge weights between {\it unmatched} pairs of vertices
93: of the bipartite graph.
94: We show that an approximate solution to the minimization
95: problem can be obtained by computing
96: a partial singular value decomposition (SVD) 
97: of the associated edge weight
98: matrix of the bipartite graph. We  point out the connection
99: of our clustering algorithm to correspondence analysis used in
100: multivariate analysis. We also briefly discuss the issue
101: of assigning data objects to multiple clusters.
102: In the experimental results, we apply our clustering
103: algorithm to the problem of document clustering to illustrate its
104: effectiveness and efficiency.
105: \end{abstract}
106: 
107: % A category with only the three required fields
108: \category{H.3.3}{Information Search and Retrieval}{Clustering}
109: \category{G.1.3}{Numerical Linear Algebra}{Singular value decomposition}
110: %A category including the fourth, optional field follows...
111: \category{G.2.2}{Graph Theory}{Graph algorithms}
112: 
113: \terms{Algorithms, theory}
114: 
115: \keywords{document clustering, bipartite graph, graph partitioning,
116: spectral relaxation, singular value decomposition,
117: correspondence analysis}
118: 
119: \section{Introduction}\label{sec:int}
120: Cluster analysis is an important tool for exploratory data mining
121: applications arising from many diverse disciplines. Informally,
122: cluster analysis seeks to partition a given data set into compact
123: clusters so that data objects within a cluster are more similar
124: than those in distinct clusters. The literature on cluster analysis
125: is enormous including contributions from many research communities.
126: (see \cite{Everitt,Gordon} for 
127: recent surveys of some classical approaches.) Many traditional
128: clustering algorithms are based
129: on the assumption that the given dataset
130: consists of covariate information (or attributes) for each individual
131: data object, and cluster analysis can be cast as a problem of
132: grouping a set of $n$-dimensional vectors each representing
133: a data object in the dataset. A familiar example
134: is document clustering using the vector space 
135: model \cite{bele:00}. Here each document
136: is represented by an $n$-dimensional vector, and each
137: coordinate of the vector corresponds to a term in a vocabulary of size $n$.
138:  This formulation
139: leads to  the so-called term-document matrix $A=(a_{ij})$
140: for the representation of the collection of documents,
141: where $a_{ij}$ is the
142: so-called term frequency, i.e.,
143:  the number of times term $i$ occurs in document $j$.
144: In this vector
145: space model terms and documents are treated asymmetrically with
146: terms considered as the covariates or attributes of documents. It is
147: also possible to treat both terms and documents as first-class citizens
148: in a symmetric fashion, and consider $a_{ij}$ as the frequency of
149: co-occurrence of term $i$ and document $j$ as is done,
150: for example,  in probabilistic
151: latent semantic indexing \cite{hoff:99}.\footnote{Our clustering
152: algorithm computes an approximate global optimal solution while probabilistic
153: latent semantic indexing relies on the EM algorithm and therefore might be
154: prune to local minima even with the help of some annealing process.}
155: In this paper, we
156: follow this basic principle and propose a new approach 
157: to model terms and documents as vertices in a bipartite graph
158: with edges of the graph indicating the co-occurrence of terms and documents.
159: In addition
160: we can optionally
161:  use edge weights to indicate the frequency of this co-occurrence.
162: Cluster analysis for document collections
163: in this context is based on a very intuitive notion: documents
164: are grouped by topics, on one hand
165: documents in a topic tend to more heavily use the same subset
166: of terms which form a term cluster, and
167: on the other hand a topic usually is characterized
168: by a subset of terms and those documents heavily using those terms tend to
169: be about that particular topic. It is this interplay of terms and
170: documents which gives rise to what we call bi-clustering by which
171: terms and documents are simultaneously grouped into
172: {\it semantically coherent} clusters.
173: 
174: Within our bipartite graph model, the clustering problem can be
175: solved by constructing vertex graph partitions. Many criteria have been
176: proposed for measuring the quality of graph partitions 
177: of undirected graphs \cite{Chung,Shi}. In this paper, we show 
178: how to adapt those criteria for bipartite graph partitioning and
179: therefore solve the bi-clustering problem. A great variety of
180: objective functions have been proposed for cluster analysis without
181: efficient algorithms for finding the (approximate) optimal solutions.
182: We will show that our bipartite graph formulation naturally leads to
183: partial SVD problems for the underlying edge weight matrix
184: which admit efficient
185: {\it global} optimal solutions.
186: The rest of the paper
187: is organized as follows: in section \ref{se:bi}, we propose a 
188: new criterion for 
189: bipartite graph partitioning which tends to produce balanced
190: clusters. In section \ref{se:svd}, we show that our criterion
191: leads to an optimization problem that can be approximately
192: solved by computing a partial SVD
193: of the weight matrix of the bipartite graph. 
194: In section \ref{se:corr}, we make connection of our approximate
195: solution to correspondence analysis used in multivariate data
196: analysis. In section \ref{se:over}, we briefly
197: discuss how to deal with clusters with overlaps.
198: In section \ref{se:exp}, we 
199: describe experimental results on 
200: bi-clustering a dataset of newsgroup articles. We conclude the paper
201: in section \ref{se:con} and give pointers to future research.
202: 
203: \section{Bipartite graph partitioning}\label{se:bi}
204: We denote a graph by $G(V,E)$, where $V$ is the vertex set
205: and $E$ is the edge set of the
206: graph. A graph $G(V,E)$
207: is {\it bipartite} with two vertex
208: classes $X$ and $Y$ if $V = X\cup Y$ with
209: $X\cap Y = \emptyset$ and each edge in $E$ has
210: one endpoint in $X$ and one endpoint in $Y$. 
211: We consider weighted bipartite graph $G(X,Y,W)$ with
212: $W = (w_{ij})$ where $w_{ij} > 0$ denotes the weight of the
213: edge between vertex $i$ and $j$. We let $w_{ij}=0$ if there
214: is no edge between vertices $i$ and $j$. 
215: In the context
216: of document clustering, $X$ represents 
217: the set of  terms and $Y$ represents the set of documents, and $w_{ij}$
218: can be used to denote the number of times term $i$ occurs in
219: document $j$.
220: A vertex partition of $G(X,Y,W)$
221: denoted by $\Pi(A,B)$
222: is defined by  a partition of the vertex sets
223: $X$ and $Y$, respectively: $X=A\cup A^c$, and $Y=B\cup B^c$, where
224: for a set $S$, $S^c$ denotes its compliment. By convention,
225: we pair $A$ with $B$, and $A^c$ with $B^c$. We say that
226: a pair of vertices $ x \in X$ and $y \in Y$ is {\it matched} with
227: respect to a partition $\Pi(A,B)$ if there is an edge
228: between $x$ and $y$, and either $x \in A$ and $y \in B$ or
229: $x \in A^c$ and $y \in B^c$. For any two subsets of vertices
230: $ S \subset X$ and $T \subset Y$, define
231: \[ W(S,T) = \sum_{i \in S, j \in T} w_{ij},\]
232: i.e., $W(S,T)$ is the sum of the weights of edges with one
233: endpoint in $S$ and one endpoint in $T$. The quantity
234: $W(S,T)$ can be
235: considered as measuring the association
236: between the  vertex sets $S$ and 
237: $T$. In the context of cluster analysis
238: edge weight measures the similarity between data objects.
239: To partition data objects into
240: clusters, we seek a partition of $G(X,Y,W)$ such that the
241: association (similarity) 
242: between unmatched vertices is as small as possible.
243: One possibility is to consider for a partition $\Pi(A,B)$
244: the following quantity
245: \begin{equation}\label{eq:cut}
246: \begin{array}{ll} \cut(A,B) & \equiv W(A,B^c) + W(A^c, B)\\[3pt]
247:    & = \sum_{i \in A, j \in B^c} w_{ij} + 
248:      \sum_{i \in A^c, j \in B} w_{ij}.
249: \end{array}
250: \end{equation}
251: Intuitively, choosing $\Pi(A,B)$
252: to minimize $\cut(A,B)$ will give rise to a partition that
253: minimizes the sum of all the edge weights between unmatched
254: vertices. In the context of document clustering, we try to find
255: two document clusters $B$ and $B^c$ which have few terms in
256: common, and the documents in $B$ mostly use terms in $A$ and 
257: those in $B^c$ use terms in $A^c$.
258: Unfortunately, choosing a partition based entirely on
259: $\cut(A,B)$ tends to produce unbalanced clusters, i.e.,
260: the sizes of $A$ and/or $B$ or their compliments tend to be
261: small.
262: Inspired by the work in \cite{Chung,Driessche,Shi}, we propose
263: the following normalized variant of the edge cut in (\ref{eq:cut})
264: \[ \Ncut(A,B) \equiv \frac{\cut(A,B)}{W(A,Y) + W(X,B)}\]
265: \[          + \frac{\cut(A^c,B^c)}{W(A^c,Y) + W(X,B^c)}.\]
266: The intuition behind this criterion is that not only we
267: want a partition with small edge cut, but we also want the two
268: subgraphs formed between the matched vertices to be as dense
269: as possible. This latter requirement is partially
270: satisfied by introducing
271: the normalizing denominators in the above equation.\footnote{A
272: more natural criterion seems to be
273: \[\frac{\cut(A,B)}{W(A,B)}
274:           + \frac{\cut(A^c,B^c)}{W(A^c,B^c)}.\]
275: However, it can be shown that it will leads to an SVD 
276: problem with the same set of left and right singular vectors.} 
277: Our bi-clustering problem is now equivalent to
278: the following optimization problem
279: \[ \min_{\Pi(A,B)} \Ncut(A,B),\]
280: i.e., finding partitions of the vertex sets $X$ and $Y$ to
281: minimize the normalized cut of the bipartite graph $G(X,Y,W)$.
282: 
283: \section{Approximate solutions using singular vectors}\label{se:svd}
284: Given a bipartite graph $G(X,Y,W)$
285: and the associated partition $\Pi(A,B)$. Let us reorder
286: the vertices of $X$ and $Y$ so that vertices in $A$ and $B$ are
287: ordered before vertices in $A^c$ and $B^c$, respectively. The
288: weight matrix $W$ can be written in a block format
289: \begin{equation}\label{eq:w} W = \left[\begin{array}{cc}
290:        W_{11}&W_{12}\\
291:        W_{21}&W_{22}
292:        \end{array}\right],\end{equation}
293: i.e., the rows of $W_{11}$ correspond to 
294: the vertices in the vertex set $A$ and
295: the columns of $W_{11}$ correspond to 
296: those in $B$. Therefore
297: $G(A,B,W_{11})$  denotes the weighted bipartite graph
298: corresponding to the vertex sets $A$ and $B$. 
299: For any $m$-by-$n$ matrix 
300: $H = (h_{ij})$, define
301: \[ \s(H) = \sum_{i=1}^m \sum_{j=1}^n h_{ij},\]
302: i.e., $\s(H)$ is the sum of all the elements of $H$.
303: It is easy to see from the definition of $\Ncut$,
304: \[ \Ncut(A,B) = \frac{\s(W_{12}) + \s(W_{21})}{2\s(W_{11}) +\s(W_{12}) 
305: +\s(W_{21})}\]\[ +
306: \frac{\s(W_{12}) + \s(W_{21})}{2\s(W_{22}) +\s(W_{12}) 
307: +\s(W_{21})}.\]
308: In order to make connections to
309: SVD problems, we
310: first consider the case when $W$ is symmetric.\footnote{A different
311: proof for the
312: symmetric case was first derived in \cite{Shi}. However, our derivation
313: is simpler and more transparent and leads naturally to the SVD
314: problems for the rectangular case.}
315: It is easy
316: to see that with $W$ symmetric (denoting $\Ncut(A,A)$ by $\Ncut(A)$),
317: we have
318: \begin{equation}\label{eq:sym}
319:  \Ncut(A) = \frac{\s(W_{12})}{\s(W_{11})+\s(W_{12})}
320: + \frac{\s(W_{12})}{\s(W_{22})+\s(W_{12})}.\end{equation}
321: Let $e$ be the vector
322: with all its elements equal to $1$. Let $D$ be the diagonal matrix
323: such that $We = De$. Then $(D-W)e=0$. Let $x=(x_i)$ be the vector with
324: \[  x_i = \left\{\begin{array}{rl}
325:                  1, & i \in A,\\
326:                 -1, & i \in A^c.
327:                  \end{array}
328:            \right.
329: \] 
330: It is easy to verify that
331: \[ \s(W_{12}) = x^T(D-W)x/4.\]
332: Define
333: \[ p \equiv \frac{\s(W_{11})+\s(W_{12})}{\s(W_{11})+2\s(W_{12})+\s(W_{22})}
334:      =\frac{\s(W_{11})+\s(W_{12})}{e^TDe}.\]
335: Then
336: \[ \begin{array}{c}
337: \s(W_{11})+\s(W_{12}) = p e^TDe, \\[3pt]
338:    \s(W_{22})+\s(W_{12}) = (1-p)e^TDe,
339: \end{array}
340: \]
341: and
342: \begin{equation}\label{eq:n}
343:  \Ncut(A) = \frac{x^T(D-W)x}{4p(1-p)e^TDe}.
344: \end{equation}
345: Notice that  $(D-W)e=0$, then for any scalar $s$, we have
346: \[ (se+x)^T(D-W)(se+x)= x^T(D-W)x.\]
347: To cast (\ref{eq:n}) in the form of a Rayleigh quotient, 
348: we need to find
349: $s$ such that
350: \[ (se+x)^TD(se+x)= 4p(1-p)e^TDe.\]
351: Since $x^TDx = e^TDe$, it follows from the above equation that
352: $s = 1-2p$. Now let $y=(1-2p)e + x$, it is easy to see that
353: $y^TDe = ((1-2p)e + x)^TDe = 0$, and
354: \[  y_i = \left\{\begin{array}{rl}
355:                  2(1-p)>0, & i \in A,\\
356:                 -2p<0, & i \in A^c.
357:                  \end{array}
358:            \right.
359: \]
360: Thus
361: \[ \min_{A} \Ncut(A) = \min \left\{ \frac{y^T(D-W)y}{y^TDy} 
362: \;\; | \;\; y \in S\right\},\]
363: where
364: \[ S=\{ y \;\; | \;\;
365: y^TDe =0, y_i 
366: \in \{ 2(1-p), -2p\} \}.\]
367: If we drop the constraints $y_i 
368: \in \{ 2(1-p), -2p\}$ and let
369: the elements of $y$ take
370: arbitrary continuous values, then the optimal $y$ can be approximated by
371: the following relaxed {\it continuous} minimization problem,
372: \begin{equation}\label{eq:y} \min \left\{ \frac{y^T(D-W)y}{y^TDy} 
373: \;\; | \;\;y^TDe =0\right\}.\end{equation}
374: Notice that it follows from $We = De$ that 
375: \[ D^{-1/2}WD^{-1/2} (D^{1/2}e) = D^{-1/2}e,\]
376: and therefore $D^{1/2}e$ is an eigenvector of
377: $D^{-1/2}WD^{-1/2}$ corresponding to the eigenvalue $1$. It
378: is easy to show that all the eigenvalues of $D^{-1/2}WD^{-1/2}$
379: have absolute value at most $1$ (See the Appendix). Thus the optimal $y$ in
380: (\ref{eq:y}) can be computed as $y = D^{1/2}\hat{y}$, where
381: $\hat{y}$ is the {\it second} largest 
382: eigenvector of $D^{-1/2}WD^{-1/2}$.
383: 
384: Now we return to the rectangular case for the weight matrix $W$,
385: and let $D_X$ and $D_Y$ be diagonal matrices such that
386: \begin{equation}\label{eq:xy} 
387: We = D_X e, \quad W^Te = D_Ye.
388: \end{equation}
389: Consider a partition $\Pi(A,B)$, and define
390: \[ u_i = \left\{\begin{array}{rl}
391:                  1, & i \in A\\
392:                 -1, & i \in A^c
393:                  \end{array}
394:            \right., \quad
395: v_i = \left\{\begin{array}{rl}
396:                  1, & i \in B\\
397:                 -1, & i \in B^c
398:                  \end{array}
399:            \right.
400: \]
401: Let $W$ have the block form as in (\ref{eq:w}), and consider the
402: augmented symmetric matrix\footnote{In \cite{heko:00}, the Laplacian
403: of $\hat{W}$ is used for partitioning a rectangular matrix
404: in the context of designing load-balanced matrix-vector multiplication
405: algorithms for parallel computation. However, the eigenvalue
406: problem of the Laplacian
407: of $\hat{W}$ does not lead to a simpler singular value problem.}
408: \[ \hat{W} = \left[\begin{array}{cc}
409:                    0 & W\\
410:                    W^T & 0
411:              \end{array}\right]
412:    = \left[\begin{array}{cc|cc}
413:                    0 & 0 & W_{11} & W_{12}\\
414:                    0 & 0 & W_{21} & W_{22}\\ \hline
415:                    W_{11}^T & W_{21}^T & 0 & 0 \\
416:                    W_{12}^T & W_{22}^T & 0 & 0
417:                       \end{array}\right].\]
418: If we interchange the second and third block rows and columns
419: of the above matrix, we obtain
420: \[ \left[\begin{array}{cc|cc}
421:                    0 & W_{11} & 0 & W_{12}\\
422:                    W_{11}^T & 0 & W_{21}^T & 0\\ \hline
423:                    0 & W_{21} & 0 & W_{22} \\
424:                    W_{12}^T & 0 & W_{22}^T & 0
425:                       \end{array}\right] \equiv
426:     \left[\begin{array}{cc}
427:                    \hat{W}_{11} & \hat{W}_{12}\\
428:                    \hat{W}_{12}^T & \hat{W}_{22}
429:              \end{array}\right],\]
430: and the normalized cut can be written as
431: \[ \Ncut(A,B) = \frac{\s(\hat{W}_{12})}{\s(\hat{W}_{11})+\s(\hat{W}_{12})}
432: + \frac{\s(\hat{W}_{12})}{\s(\hat{W}_{22})+\s(\hat{W}_{12})},\]
433: a form that resembles the symmetric case (\ref{eq:sym}). Define
434: \[ q = \frac{2\s(W_{11}) +\s(W_{12}) 
435: +\s(W_{21})}{e^TD_Xe + e^TD_Ye}.\]
436: Then we have
437: \[ \Ncut(A,B) = \frac{-2x^TWy + x^TD_Xx + y^TD_Yy}{x^TD_Xx + y^TD_Yy}\]\[
438:               = 1- \frac{2x^TWy}{x^TD_Xx + y^TD_Yy},\]
439: where $x = (1-2p)e +u, y = (1-2p)e + v$. It is also easy to see that
440: \begin{equation}\label{eq:q} 
441: x^TD_Xe + y^TD_Ye = 0, \quad x_i, y_i \in \{ 2(1-q), -2q\}.
442: \end{equation}
443: Therefore,
444: \[ \min_{\Pi(A,B)} 
445: \Ncut(A,B)\]\[ = 1-\max_{x \neq 0, y \neq 0}
446: \left\{ \frac{2x^TWy}{x^TD_Xx + y^TD_Yy} 
447: \;\; | \;\;  x, y \; \mbox{\rm satisfy } (\ref{eq:q})\right\}.\]
448: Ignoring the discrete constraints on the elements of $x$ and $y$, we
449: have the following continuous maximization problem,
450: \begin{equation}\label{eq:yz}
451: \max_{x \neq 0, y \neq 0}
452: \left\{ \frac{2x^TWy}{x^TD_Xx + y^TD_Yy}\;\; | \;\; 
453: x^TD_Xe + y^TD_Ye = 0 \right\}.
454: \end{equation}
455: Without the constraints 
456: $x^TD_Xe + y^TD_Ye = 0$, the above problem is equivalent to
457: computing the largest singular triplet of $D_X^{-1/2} W D_Y^{-1/2}$
458: (see the Appendix).
459: From (\ref{eq:xy}), we have
460: \[ \begin{array}{c}
461: D_X^{-1/2} W D_Y^{-1/2} (D_Y^{1/2}e) = D_X^{1/2} e, \\[3pt]
462: (D_X^{-1/2} W D_Y^{-1/2})^T (D_X^{1/2}e) = D_Y^{1/2} e,
463: \end{array}
464: \]
465: and similarly to
466: the symmetric case, it is easy to show that all the 
467: singular values of $D_X^{-1/2} W D_Y^{-1/2}$
468: are at most $1$. Therefore, an optimal pair $\{x,y\}$ for
469: (\ref{eq:yz}) can be computed as
470: $x = D_X^{-1/2} \hat{x}$ and $y = D_Y^{-1/2} \hat{y}$,
471: where $\hat{x}$ and $\hat{y}$ are the {\it second}
472: largest left and right singular vectors of $D_X^{-1/2} W D_Y^{-1/2}$,
473: respectively (see the Appendix). 
474: With the above discussion, we can now summerize our
475: basic approach for bipartite graph clustering incorporating
476: a recursive procedure.
477: 
478: \bigskip
479: 
480: \begin{center}
481: \fbox{\parbox{7.7cm}{
482: {\sc Algorithm.} Spectral Recursive Embedding (SRE) 
483: 
484: Given a weighted bipartite graph $G = (X,Y,E)$
485: with its edge weight matrix $W$:
486: 
487:  \begin{enumerate}
488:    \item Compute $D_X$ and $D_Y$ and form the scaled weight matrix
489:          $\hat{W}=D_X^{-1/2} W D_Y^{-1/2}$.
490:    \item Compute the {\it second} largest left and right
491:          singular vectors of $\hat{W}$, $\hat{x}$ and $\hat{y}$.
492:    \item Find cut points $c_x$ and $c_y$ for $x=D_X^{-1/2}\hat{x}$
493:          and $y=D_Y^{-1/2}\hat{y}$, respectively.
494:    \item Form partitions $A=\{i \;\;| \;\;x_i \geq c_x\}$ and 
495:          $A^c=\{i \;\;| \;\;x_i < c_x\}$ for vertex set $X$, and
496:          $B=\{j \;\;| \;\;y_j \geq c_y\}$ and 
497:          $B^c=\{j \;\;|\;\; y_j < c_y\}$ for vertex set $Y$.
498:    \item Recursively partition the sub-graphs $G(A,B)$
499:           and $G(A^c,B^c)$ if necessary.
500:          
501:  \end{enumerate}
502: }}
503: \end{center}
504: 
505: \bigskip
506: 
507: Two basic strategies can be used for selecting the cut points
508: $c_x$ and $c_y$. The simplest strategy is to set $c_x=0$ and
509: $c_y=0$. Another more computing-intensive approach is to base
510: the selection on $\Ncut$: Check $N$ equally spaced splitting
511: points of $x$ and $y$, respectively, find the cut
512: points $c_x$ and $c_y$ with the smallest $\Ncut$ \cite{Shi}.
513: 
514: {\bf Computational complexity.} The major computational cost
515: of SRE is Step 2  for computing the left and right singular vectors
516: which can be obtained either by power method or more robustly
517: by Lanczos bidiagonalization process \cite[Chapter 9]{govl:96}. 
518: Lanczos method is an iterative process for computing
519: partial SVDs in 
520: which  each iterative step involves the computation of two matrix-vector
521: multiplications $\hat{W}u$ and $\hat{W}^Tv$ for some vectors
522: $u$ and $v$. The computational cost of these is 
523: roughly proportional to $\nnz(\hat{W})$,
524: the number of nonzero elements of $\hat{W}$. The total
525: computational 
526: cost of SRE is $O(c_{\sre}k_{\svd}\nnz(\hat{W}))$, where 
527: $c_{\sre}$ the the level of recursion and $k_{\svd}$ is the
528: number of Lanczos iteration steps. In general, $k_{\svd}$ depends on
529: the singular value gaps of $\hat{W}$. Also notice that
530: $\nnz(\hat{W})= n_w n$, where $n_w$ is the average number of
531: terms per document and $n$ is the total number of document.
532: Therefore, the total cost of SRE is in general linear in the
533: number of documents to be clustered.
534: 
535: 
536: \section{Connections to correspondence analysis}\label{se:corr}
537: In its basic form correspondence analysis is applied to an 
538: $m$-by-$n$ two-way
539: table of counts $W$ \cite{benz:92,gree:93,veri:99}. Let $w=\s(W)$,
540: the sum of all the elements of $W$, $D_X$ and $D_Y$ be diagonal
541: matrices defined in section \ref{se:svd}. Correspondence analysis
542: seeks to compute the largest singular triplets of the matrix 
543: $Z=(z_{ij}) \in \R^{m \times n}$ with
544: \[ z_{ij}= \frac{w_{ij}/w - (D_X(i,i)/w)(D_Y(j,j)/w)}
545:              {\sqrt{(D_X(i,i)/w)(D_Y(j,j)/w)}}.\]
546: The matrix $Z$ can be considered as the correlation matrix of two
547: group indicator matrices for the original $W$ \cite{veri:99}. 
548: We now show that the SVD of $Z$ is closely related to the
549: SVD of $\hat{W} \equiv D^{-1/2}_XWD_Y^{-1/2}$. 
550: In fact, in section \ref{se:svd},
551: we showed that $D_X^{1/2}e$ and $D_Y^{1/2}e$ are the left and right
552: singular vectors of $\hat{W}$ corresponding to the singular value one,
553: and it is also easy to show that all the singular values of
554: $\hat{W}$ are at most $1$. Therefore, 
555: the rest of the singular values and singular vectors of
556: $\hat{W}$ can be found by computing the SVD of 
557: the following  rank-one modification
558: of $\hat{W}$
559: \[D^{-1/2}_XWD_Y^{-1/2}-
560: \frac{D_X^{1/2}ee^TD_Y^{1/2}}{\|D_X^{1/2}e\|_2\|D_Y^{1/2}\|_2}\]
561: which has $(i,j)$ element
562: \[  \frac{w_{ij}}{\sqrt{D_X(i,i)D_Y(j,j)}} - 
563:      \frac{\sqrt{D_X(i,i)D_Y(j,j)}}{w} = w^2z_{ij},\]
564: and  is a constant multiple of the $(i,j)$ element of $Z$.
565: Therefore, normalized-cut based  cluster analysis and correspondence
566: analysis arrive at the same SVD problems even though they start with
567: completely different principles. It is worthwhile to explore
568: more deeply the interplay between these two different points of views and
569: approaches, for example, using the statistical analysis of
570: correspondence analysis to provide better strategy for selecting cut
571: points and estimating the number of clusters.
572: 
573: \section{Partitions with overlaps}\label{se:over}
574: So far in our discussion, we have only looked at {\it hard}
575: clustering, i.e., a data object belongs to one and only
576: one cluster. In many situations, especially when there are much
577: overlap among the clusters, it is more advantageous to allow
578: data objects to belong to different clusters. For
579: example, in document clustering, certain groups of words can
580: be shared by two clusters. Is it possible
581: to model this overlap using our bipartite graph model and also
582: find efficient approximate solutions? The answer seems to be yes,
583: but our results at this point are rather preliminary and we will
584: only illustrate the possibilities. Our basic idea is that when computing
585: $\Ncut(A,B)$, we should disregard the contributions of the
586: set of vertices that is in the overlap. More specifically,
587: let $X=A\cup O_X \cup \bar{A}$ and $Y=B\cup O_Y\cup \bar{B}$, where
588: $O_X$ denotes the overlap between 
589: the vertex subsets
590: $A\cup O_X$ and $\bar{A}\cup O_X$, and
591: $O_Y$ the overlap between $B\cup O_Y$ and $\bar{B}\cup O_Y$, we compute
592: \[ \Ncut(A,B,\bar{A},\bar{B}) =\frac{\cut(A,B)}{W(A,Y)+W(X,B)}\]\[
593: +\frac{\cut(\bar{A},\bar{B})}{W(\bar{A},Y)+W(X,\bar{B})}.\]
594: However, we can make 
595: $\Ncut(A,B,\bar{A},\bar{B})$ smaller simply by putting more
596: vertices in the overlap. Therefore, we need to balance these
597: two competing quantities: the size of the overlap and the modified
598: normalized cut $\Ncut(A,B,\bar{A},\bar{B})$ by minimizing
599: \[ \Ncut(A,B,\bar{A},\bar{B}) + \alpha(|O_X| + |O_Y|),\]
600: where $\alpha$ is a regularization parameter. How to find an
601: efficient method for computing the (approximate) optimal
602: solution to the above minimization problem still needs to be
603: investigated. We close this section by presenting an illustrative
604: example showing that in some situations, the singular vectors
605: already automatically separating the overlap sets while giving
606: the coordinates for carrying out clustering.
607: 
608: \begin{figure}[t]
609: \centerline{
610: \mbox{\psfig{file=corr1.ps,height=1.8in,width=1.6in}
611: \psfig{file=corr2.ps,height=1.8in,width=1.6in}}}
612: \caption{Sparsity patterns of a test matrix before clustering
613: (left) and after clustering (right)}
614: \label{fi:op}
615: \end{figure}
616: 
617: {\sc Example 1.} We construct a sparse $m$-by-$n$ rectangular matrix 
618: \[ W = \left[\begin{array}{cc}
619:              W_{11}&W_{12}\\
620:              W_{21}&W_{22}
621:        \end{array}\right].\]
622: so that $W_{11}$ and $W_{22}$ are relatively denser than $W_{12}$
623: and $W_{21}$. We also add some dense rows and columns to the matrix $W$
624: to represent row and column overlaps.
625: The left panel of Figure \ref{fi:op} shows the sparsity pattern of 
626: $\bar{W}$,
627: a matrix obtained by randomly permuting
628: the rows and columns of $W$. We then compute the
629: second largest left and right singular vectors of 
630: $D_X^{-1/2} \bar{W} D_Y^{-1/2}$, say $x$ and $y$, then sort the rows and
631: columns of $\bar{W}$ according to the values of the entries in
632: $D_X^{-1/2}x$ and $D_Y^{-1/2}y$, respectively. The sparsity
633: pattern of this permuted $\bar{W}$ is shown on the right panel of
634: Figure \ref{fi:op}. As can be seen that the singular vectors not
635: only do the job of clustering but at the same time also
636: concentrate the dense rows and columns at the boundary of the two
637: clusters.
638: 
639: \section{Experiments}\label{se:exp}
640: In this section we present our experimental results on clustering
641: a dataset of newsgroup articles submitted to  20 
642: newsgroups.\footnote{
643: The newsgroup dataset together with the {\tt bow} toolkit for
644: processing it  
645: can be downloaded from
646: {\tt http://www.cs.cmu.edu/afs/cs/project/theo-11/www/}
647: 
648: \noindent
649: {\tt naive-bayes.html}.}
650: This dataset contains about
651: 20,000 articles (email messages) evenly divided among the 20
652: newsgroups. We list the names of the newsgroups together
653: with the associated group labels (the labels will be
654: used in the sequel to identify the newsgroups).
655: 
656: 
657: \begin{verbatim}
658:        NG1: alt.atheism   
659:        NG2: comp.graphics 
660:        NG3: comp.os.ms-windows.misc   
661:        NG4: comp.sys.ibm.pc.hardware 
662:        NG5:comp.sys.mac.hardware   
663:        NG6: comp.windows.x 
664:        NG7:misc.forsale   
665:        NG8: rec.autos 
666:        NG9:rec.motorcycles   
667:        NG10: rec.sport.baseball 
668:        NG11:rec.sport.hockey 
669:        NG12: sci.crypt 
670:        NG13:sci.electronics   
671:        NG14: sci.med 
672:        NG15:sci.space   
673:        NG16: soc.religion.christian 
674:        NG17:talk.politics.guns   
675:        NG18: talk.politics.mideast 
676:        NG19:talk.politics.misc   
677:        NG20: talk.religion.misc 
678: \end{verbatim}
679: 
680: \begin{table*}\label{tb:12}
681: \caption{Comparison of spectral embedding (SRE), PDDP, and K-means
682: (NG1/NG2)
683: }
684: \begin{center}
685: \begin{tabular}{llrrr}
686: Mixture & SRE & PDDP & K-means \\ \hline\hline
687: 50/50   & $92.12\pm 3.52$\% & $91.90\pm 3.19$\% $(53,10,37)$&$ 76.93\pm 14.42$\% $(82,2,10)$\\ \hline
688: 50/100  & $90.57\pm 3.11$\% & $86.11\pm 3.94$\% $(86, 5,9)$&$ 76.74\pm 14.01$\% $(80,2,18)$\\ \hline
689: 50/150  & $88.04\pm 3.90$\% & $78.60\pm 5.03$\% $(98, 0, 2)$&$68.80\pm 13.55$\% $(88, 0, 12)$\\ \hline
690: 50/200  & $82.77\pm 5.24$\% & $70.43\pm 6.04$\% $(97,0,3)$&$69.22\pm 12.34$\% $(83,1,16)$\\ \hline
691: \end{tabular}
692: \end{center}
693: \end{table*}
694: 
695: \begin{table*}\label{tb:1011}
696: \caption{Comparison of spectral embedding (SRE), PDDP, and K-means
697: (NG10/NG11)
698: }
699: \begin{center}
700: \begin{tabular}{llrrr}
701: Mixture & SRE & PDDP & K-means \\ \hline\hline
702: 50/50   & $74.56\pm 8.93$\% & $73.40\pm 10.07$\% $(56,6,38)$ &$61.61\pm 8.77$\% $(86,0,14)$\\ \hline
703: 50/100  & $67.13\pm 7.17$\% & $67.10\pm 10.20$\% $(52,1,47)$ &$64.40\pm 9.37$\% $(59,1,40)$\\ \hline
704: 50/150  & $58.30\pm 5.99$\% & $58.72\pm 7.48$\% $(52,1,47)$ &$62.53\pm 8.20$\% $(36,1,63)$\\ \hline
705: 50/200  & $57.55\pm 5.69$\% & $56.63\pm 4.84$\% $(58,1,41)$ &$60.82\pm 7.54$\% $(39,2,59)$\\ \hline
706: \end{tabular}
707: \end{center}
708: \end{table*}
709: 
710: We used the {\it bow} toolkit to construct the term-document
711: matrix for this dataset, specifically we use the tokenization option
712: so that the UseNet  headers are stripped, and we also applied stemming
713: \cite{mcca:96}. Some of the newsgroups have large overlaps, for
714: example, the five newsgroups {\tt comp.* } about
715: computers. In fact several articles are posted to multiple newsgroups.
716: Before we apply clustering algorithms to the dataset, several
717: preprocessing steps need to be considered. Two standard steps
718: are weighting and feature selection. For weighting, we considered
719: a variant of tf.idf weighting scheme,
720: $\tf\log_2(n/\df),$
721: where $\tf$ is the term frequency and $\df$ is the document
722: frequency and several other variations
723: listed in \cite{bele:00}. 
724: For feature selection, we looked at three approaches 1)
725: deleting terms that occur less than certain number of
726: times in the dataset; 2) deleting terms that 
727: occur in less than certain number of
728: documents in the dataset; 3) selecting terms according to mutual
729: information of terms and documents defined as
730: \[ I(y) = \sum_{x} p(x,y)\log(p(x,y)/(p(x)p(y)),\]
731: where $y$ represents a term and $x$ a document \cite{slti:00}.
732: In general we found out that the traditional tf.idf based
733: weighting schemes do not improve performance for SRE. One possible
734: explanation comes from the connection with correspondence analysis,
735: the raw frequencies are samples of co-occurrence probabilities,
736: and the pre- and post-multiplication by $D_X^{-1/2}$ and
737: $D_Y^{-1/2}$ in $D_X^{-1/2}(D-W)D_Y^{-1/2}$ {\it automatically}
738: taking into account of weighting. We did, however, found out that
739: trimming the raw frequencies can sometimes improve 
740: performance for SRE, especially for the anomalous cases where
741: some words can occur in certain documents an unusual number of times,
742: skewing the clustering process.
743: 
744: 
745: 
746: 
747: For the purpose of comparison, we consider two other clustering
748: methods: 1) K-means method \cite{Gordon}; 2) Principal direction
749: divisive partion (PDDP) method \cite{bole:98}. K-means method is
750: a widely used cluster analysis tool. The variant we used employs
751: the Euclidean distance when comparing the dissimilarity between
752: two documents. When applying K-means,
753: we {\it normalize} the length of each document so that it has
754: Euclidean length one. In essence, we use the cosine of the angle
755: between two document vectors when
756: measuring their similarity. We have also tried K-means without
757: document length normalization, the results are far worse and therefore
758: we will not report the corresponding results. Since K-means method is
759: an iterative method, we need to specify a stopping criterion. For
760: the variant we used, we compare the centroids between two
761: consecutive iterations, and stop when the difference is smaller
762: than a pre-defined tolerance.
763: 
764: 
765: PDDP is another clustering method that utilizes singular
766: vectors. It is based on the idea of principal component
767: analysis and has been shown to
768: outperform several standard clustering methods
769: such as hierarchical agglomerative algorithm \cite{bole:98}. 
770: First each document is considered as a 
771: multivariate data point. The set of document is normalized
772: to have unit Euclidean length and then centered, i,e., let
773: $W$ be the term-document matrix, and $w$ be the average of
774: the columns of $W$. Compute the largest singular value triplet
775: $\{u,\sigma,v\}$ of
776: $W-we^T$. Then split the set of documents based on their 
777: values of the $v=(v_i)$ vector: one simple scheme is to 
778: let those with
779: positive $v_i$ go into one cluster and those
780: with nonnegative $v_i$  inot another cluster. Then the
781: whole process is repeated on the term-document matrices of 
782: the two clusters, respectively. Although both our clustering
783: method SRE and PDDP
784: make use of the singular vectors of some versions of the
785: term-document matrices, they are derived from fundamentally
786: different principles. PDDP is a feature-based clustering method,
787: projecting all the data points to the one-dimensional subspace
788: spanned by the first principal axis; SRE is a similarity-based
789: clustering method, two co-occurring variables (terms and
790: documents in the context of document clustering) are
791: simultaneously clustered. Unlike SRE, PDDP does not
792: have a well-defined objective function
793: for minimization. It only partitions the columns of
794: the term-document matrices while SRE partitions both of its
795: rows and columns. This will have significant impact on the
796: computational costs.
797: PDDP, however, has an  advantage that it can be applied to
798: dataset with both positive  and negative values while SRE can only be
799: applied to datasets with nonnegative data values. 
800: 
801: \begin{table*}\label{tb:1819}
802: \caption{Comparison of spectral embedding (SRE), PDDP, and K-means
803: (NG18/NG19)
804: }
805: \begin{center}
806: \begin{tabular}{llrrr}
807: Mixture & SRE & PDDP & K-means \\ \hline\hline
808: 50/50  & $73.66\pm 10.53$\% & $69.52\pm 12.83$\% $(65,12,32)$ & $62.25 \pm 9.94$\% $(82,1,17)$\\ \hline
809: 50/100   & $67.23\pm 7.84$\% & $67.84\pm 7.30$\% $(46,5,49)$& $60.91\pm 7.92$\% $(65,13,32)$\\ \hline
810: 50/150  & $65.83\pm 12.79$\% & $60.37\pm 9.85$\% $(53,3,44)$ &$63.32\pm 8.26$\% $(58,3,39)$\\ \hline
811: 50/200  & $61.23\pm 9.88$\% & $60.76\pm 5.55$\% $(40,1,59)$ &$64.50\pm 7.58$\% $(34,0,66)$\\ \hline
812: \end{tabular}
813: \end{center}
814: \end{table*}
815: 
816: 
817: \begin{table*}\label{tb:con}
818: \caption{Confusion matrix for newsgroups $\{2, 9, 10, 15, 18\}$
819: }
820: \begin{center}
821: \begin{tabular}{|l||c|c|c|c|c|}
822: \hline\hline
823:  &mideast &graphics & space & baseball &  motorcycles \\ \hline\hline
824: cluster 1&  87&  0&   0&    2&   0\\ \hline
825: cluster 2&  7&   90&  7&    6&   7\\ \hline
826: cluster 3&  3&   9&   84&   1&   1\\ \hline
827: cluster 4&  0&   0&   1&    88&  0\\ \hline
828: cluster 5&  3&   1&   8&    3&   92\\ \hline\hline
829: \end{tabular}
830: \end{center}
831: \end{table*}
832: 
833: 
834: 
835: {\sc Example 2.} In this example, we examine binary clustering
836: with uneven clusters. We consider three pairs of newsgroups:
837: newsgroups 1 and 2 are well-separated, 10 and 11 are 
838: less well-separated and 18 and 19 have a lot of overlap.
839: We used document frequency as the feature
840: selection criterion and delete 
841: words that occur in less than $5$ documents in each datasets we
842: used. For both K-means and PDDP we apply tf.idf weighting together
843: with document length normalization so that each document vector
844: will have Euclidean norm one. For SRE we trim the raw frequency
845: so that the maximum is $10$.
846: For each newsgroup pair, we select four
847: types of mixture of articles from each newsgroup: $x/y$ indicates
848: that $x$ articles are from the first group and $y$ articles are
849: from the second group. The results are listed in Table
850: 1 for groups 1 and 2, Table 2 for groups 10 and
851: 11 and Table 3 for groups 18 and 19. We list
852: the means  and standard deviations for  100 random samples.
853: For PDDP and K-means we also include a triplet of numbers
854: which indicates how many of the 100 samples SRE performs better (the first 
855: number), the same (the second number) and worse (the third number) than
856: the corresponding methods (PDDP or K-means).
857: We should emphasize that
858: K-means method can only find local minimum, and the results 
859: depend on initial values and stopping criteria. This is also
860: reflected by the large standard deviations associated with
861: K-means method.
862: From the three
863: tests we can conclude that both SRE and PDDP outperform K-means
864: method. The performance of SRE and PDDP are similar in balanced
865: mixtures, but SRE is superior to PDDP in skewed mixtures.
866: 
867: 
868: 
869: {\sc Example 3.} In this example, we consider an easy multi-cluster case,
870: we examine five newsgroups $2, 9, 10, 15, 18$ which
871: was also considered in \cite{slti:00}. We sample 100
872: articles from each newsgroups, we use mutual information for
873: feature selection.
874: We use minimum normalized cut as cut point for each level
875: of the recursion.
876: For one sample, Table 4 gives the confusion matrix.
877: The accuracy for this sample is $88.2$\%. We also tested two
878: other samples with accuracy  $85.4$\%
879: and $81.2$\% 
880: which compare
881: favorably
882: with those obtained for three samples with
883: accuracy $59$\%, $58$\% and $53$\% reported in \cite{slti:00}.
884: In the following we also listed the top few words for
885: each clusters computed by mutual information.
886: 
887: \begin{verbatim}
888: Cluster 1:
889:  armenian israel arab palestinian peopl jew isra
890:  iran muslim kill turkis war greek iraqi adl call
891: 
892: Cluster 2: 
893:  imag file bit green gif mail graphic colour
894:  group version comput jpeg blue xv ftp ac uk list
895: 
896: Cluster 3: 
897:  univers space nasa theori system mission henri 
898:  moon cost sky launch orbit shuttl physic work 
899: 
900: Cluster 4: 
901:  clutch year game gant player team hirschbeck 
902:  basebal won hi lost ball defens base run win
903: 
904: Cluster 5: 
905:  bike dog lock ride don wave drive black
906:  articl write apr motorcycl ca turn dod insur
907: \end{verbatim}
908: 
909: \section{Conclusions and feature work}\label{se:con}
910: In this paper, we formulate a class of clustering problems as
911: bipartite graph partitioning problems, and we show that
912: efficient optimal solutions can be found by computing the
913: partial singular value decomposition of some scaled edge weight
914: matrices. However, we have also shown that there still remain
915: many challenging problems. One area that needs further investigation
916: is the selection of cut points and number of clusters using 
917: multiple left and right singular vectors, and
918: the possibility of adding local refinements to improve
919: clustering quality.\footnote{It will be
920: difficult to use local refinement for PDDP
921: because it does not have a global objective function
922: for minimization.} Another area is to find
923: efficient algorithms for handling overlapping clusters. Finally,
924: the treatment of missing data under our bipartite graph model
925: especially when we apply our spectral clustering methods to
926: the problem of data analysis of recommender systems also deserves 
927: further investigation.
928: 
929: 
930: \section{Acknowledgments}
931: The work of Hongyuan Zha and Xiaofeng He was supported in
932: part  by NSF grant CCR-9901986. The work of Xiaofeng He, 
933: Chris Ding and Horst Simon was supported in
934: part  by Department of Energy through an LBL LDRD fund.
935: 
936: \bibliographystyle{plain}
937: \bibliography{ref}
938: 
939: 
940: \appendix
941: \section{Some proofs} 
942: In this appendix we prove three
943: results: 1) All the
944: eigenvalues of $D^{-1/2}WD^{-1/2}$ has absolute value at
945: most $1$. Equivalently, we need to prove that the eigenvalues
946: of the generalized eigenvalue problem $Wx = \lambda Dx$
947: has absolute value at
948: most $1$. In fact let $x=(x_i)_{i=1}^n$ and let $i$ be such that 
949: $|x_i| = \max |x_j|$, then it follows from
950: \[ \lambda d_i x_i = \sum_{j=1}^n w_{ij} x_j\]
951: that
952: \[ |\lambda| \leq \sum_{j=1}^n w_{ij}/d_i = 1.\]
953: 
954: 2) We prove that
955: \[ \sigma_{\max}(D_X^{-1/2}WD_Y^{-1/2})
956: =\max_{x \neq 0, y \neq 0} \frac{2 x^TWy}{x^TD_Xx + y^TD_Yy}.\]
957: Let $\hat{x}= D^{1/2}_Xx$ and $\hat{y}= D^{1/2}_Yy$, then
958: \begin{equation}\label{eq:ff}
959: \frac{2 x^TWy}{x^TD_Xx + y^TD_Yy} = 
960:  \frac{2 \hat{x}^TD_X^{-1/2}WD_Y^{-1/2}\hat{y}}
961: {\hat{x}^T\hat{x} + \hat{y}^T\hat{y}}.\end{equation}
962: Let $D_X^{-1/2}WD_Y^{-1/2}=U\Sigma V^T$ be its SVD with
963: \[ U = [u_1, \dots, u_m], \quad V=[v_1,\dots, v_n] \]
964: and
965: \[ \Sigma = \diag(\sigma_1, \dots, \sigma_{\min\{m,n\}}), \quad
966: \sigma_1 = \sigma_{\max}(D_X^{-1/2}WD_Y^{-1/2}).\] Then
967: we can expand $\hat{x}$ and $\hat{y}$ as
968: \begin{equation}\label{eq:hh}
969:  \hat{x} = \sum_{i} \hat{x}_i u_i, \quad \hat{y} = \sum_{i} \hat{y}_i v_i, 
970: \end{equation}
971: and (\ref{eq:ff}) becomes
972: \[ \frac{2\sum_{i} \sigma_i \hat{x}_i\hat{y}_i}{\sum_i \hat{x}_i^2 + 
973: \sum_i \hat{y}_i^2}
974: \leq \frac{2\sigma_1 \sqrt{\sum_i \hat{x}_i^2}\sqrt{\sum_i \hat{y}_i^2}}
975: {\sum_i \hat{x}_i^2 + \sum_i \hat{y}_i^2} \leq \sigma_1.\]
976: Taking $\hat{x}_1=1$ and $\hat{y}_1=1$ achieves the maximum. 
977: 
978: 3) Now we consider
979: the constraint
980: \[ x^TD_Xe + y^TD_Ye = 0\]
981: which is equivalent to 
982: $\hat{x}_1+\hat{y}_1=0$ using the expansions in (\ref{eq:hh}).
983: We can always scale the vectors $\hat{x}$ and $\hat{y}$ 
984: without changing the maximum so that
985: $\hat{x}_1 \geq 0$ and $\hat{y}_1 \geq 0$. 
986: Hence $\hat{x}_1+\hat{y}_1=0$ implies that
987: $\hat{x}_1=\hat{y}_1=0$. It is then easy to see that
988: \[\sigma_2 = \max\left\{ \frac{2x^TWy}{x^TD_Xx + y^TD_Yy}\;\; | \;\;
989: x^TD_Xe + y^TD_Ye = 0 \right\},\]
990: and the maximum is achieved by the second largest left and
991: right singular vectors of $D_X^{-1/2} W D_Y^{-1/2}$.
992: 
993: 
994: \end{document}
995: 
996: 
997: 
998: 
999: