0108:cs0108018/final.tex

1: \documentclass{sig-alternate}

2:

3: \newcommand{\cut}{{\rm cut}}

4: \newcommand{\Ncut}{{\rm Ncut}}

5: \newcommand{\s}{{\rm s}}

6: \newcommand{\diag}{{\rm diag}}

7: \newcommand{\op}{{\rm op}}

8: \newcommand{\R}{{\cal R}}

9: \newcommand{\tf}{{\rm tf}}

10: \newcommand{\df}{{\rm df}}

11: \newcommand{\sre}{{\rm sre}}

12: \newcommand{\svd}{{\rm svd}}

13: \newcommand{\nnz}{{\rm nnz}}

14: \newcommand{\trace}{{\rm trace}}

15:

16: \begin{document}

17: %

18: % --- Author Metadata here ---

19: \conferenceinfo{CIKM}{'01 November 5-10, 2001, Atlanta, Georgia. USA}

20: \CopyrightYear{2001} % Allows default copyright year (2000) to be over-ridden - IF NEED BE.

21: %\crdata{0-12345-67-8/90/01}  % Allows default copyright data (0-89791-88-6/97/05) to be over-ridden - IF NEED BE.

22: % --- End of Author Metadata ---

23:

24: \title{Bipartite Graph Partitioning and Data

25: Clustering\titlenote{Part of this work was done while Xiaofeng He

26: was a graduate research assistant at NERSC, Berkeley National Lab.}

27: }

28: %\subtitle{[Extended Abstract]

29: %\titlenote{A full version of this paper is available as

30: %\textit{Author's Guide to Preparing ACM SIG Proceedings Using

31: %\LaTeX$2_\epsilon$\ and BibTeX} at

32: %\texttt{www.acm.org/eaddress.htm}}}

33: %

34: % You need the command \numberofauthors to handle the ``boxing''

35: % and alignment of the authors under the title, and to add

36: % a section for authors number 4 through n.

37: %

38: % Up to the first three authors are aligned under the title;

39: % use the \alignauthor commands below to handle those names

40: % and affiliations. Add names, affiliations, addresses for

41: % additional authors as the argument to \additionalauthors;

42: % these will be set for you without further effort on your

43: % part as the last section in the body of your article BEFORE

44: % References or any Appendices.

45:

46: \numberofauthors{3}

47: %

48: % You can go ahead and credit authors number 4+ here;

49: % their names will appear in a section called

50: % ``Additional Authors'' just before the Appendices

51: % (if there are any) or Bibliography (if there

52: % aren't)

53:

54: % Put no more than the first THREE authors in the \author command

55: \author{

56: %

57: % The command \alignauthor (no curly braces needed) should

58: % precede each author name, affiliation/snail-mail address and

59: % e-mail address. Additionally, tag each line of

60: % affiliation/address with \affaddr, and tag the

61: %% e-mail address with \email.

62: \alignauthor Hongyuan Zha \\[2pt] Xiaofeng He\\

63:        \affaddr{Dept. of Comp. Sci. \& Eng.}\\

64:        \affaddr{Penn State Univ.}\\

65:        \affaddr{State College, PA 16802}\\

66:        \email{\{zha,xhe\}@cse.psu.edu}

67: \alignauthor Chris Ding \\[2pt] Horst Simon\\

68:        \affaddr{NERSC Division}\\

69:        \affaddr{Berkeley National Lab.}\\

70:        \affaddr{Berkeley, CA 94720}\\

71:        \email{\{chqding,hdsimon\}@lbl.gov}

72: \alignauthor Ming Gu\\

73:        \affaddr{Dept. of Math.}\\

74:        \affaddr{U.C. Berkeley}\\

75:        \affaddr{Berkeley, CA 94720}\\

76:        \email{mgu@math.berkeley.edu}

77: }

78: %\additionalauthors{Additional authors: John Smith (The Th{\o}rv\"{a}ld Group,

79: %email: {\texttt{jsmith@affiliation.org}}) and Julius P.~Kumquat

80: %(The Kumquat Consortium, email: {\texttt{jpkumquat@consortium.net}}).}

81: %\date{30 July 1999}

82: \maketitle

83: \begin{abstract}

84: Many data types arising from data mining applications

85: can be modeled as bipartite graphs, examples include

86: terms and documents in a text corpus, customers and

87: purchasing items in market basket analysis and reviewers

88: and movies in a movie recommender system. In this paper,

89: we propose a new data clustering method based on

90: partitioning the underlying bipartite graph. The partition

91: is constructed by minimizing a {\it normalized}

92: sum of edge weights between {\it unmatched} pairs of vertices

93: of the bipartite graph.

94: We show that an approximate solution to the minimization

95: problem can be obtained by computing

96: a partial singular value decomposition (SVD)

97: of the associated edge weight

98: matrix of the bipartite graph. We  point out the connection

99: of our clustering algorithm to correspondence analysis used in

100: multivariate analysis. We also briefly discuss the issue

101: of assigning data objects to multiple clusters.

102: In the experimental results, we apply our clustering

103: algorithm to the problem of document clustering to illustrate its

104: effectiveness and efficiency.

105: \end{abstract}

106:

107: % A category with only the three required fields

108: \category{H.3.3}{Information Search and Retrieval}{Clustering}

109: \category{G.1.3}{Numerical Linear Algebra}{Singular value decomposition}

110: %A category including the fourth, optional field follows...

111: \category{G.2.2}{Graph Theory}{Graph algorithms}

112:

113: \terms{Algorithms, theory}

114:

115: \keywords{document clustering, bipartite graph, graph partitioning,

116: spectral relaxation, singular value decomposition,

117: correspondence analysis}

118:

119: \section{Introduction}\label{sec:int}

120: Cluster analysis is an important tool for exploratory data mining

121: applications arising from many diverse disciplines. Informally,

122: cluster analysis seeks to partition a given data set into compact

123: clusters so that data objects within a cluster are more similar

124: than those in distinct clusters. The literature on cluster analysis

125: is enormous including contributions from many research communities.

126: (see \cite{Everitt,Gordon} for

127: recent surveys of some classical approaches.) Many traditional

128: clustering algorithms are based

129: on the assumption that the given dataset

130: consists of covariate information (or attributes) for each individual

131: data object, and cluster analysis can be cast as a problem of

132: grouping a set of $n$-dimensional vectors each representing

133: a data object in the dataset. A familiar example

134: is document clustering using the vector space

135: model \cite{bele:00}. Here each document

136: is represented by an $n$-dimensional vector, and each

137: coordinate of the vector corresponds to a term in a vocabulary of size $n$.

138:  This formulation

139: leads to  the so-called term-document matrix $A=(a_{ij})$

140: for the representation of the collection of documents,

141: where $a_{ij}$ is the

142: so-called term frequency, i.e.,

143:  the number of times term $i$ occurs in document $j$.

144: In this vector

145: space model terms and documents are treated asymmetrically with

146: terms considered as the covariates or attributes of documents. It is

147: also possible to treat both terms and documents as first-class citizens

148: in a symmetric fashion, and consider $a_{ij}$ as the frequency of

149: co-occurrence of term $i$ and document $j$ as is done,

150: for example,  in probabilistic

151: latent semantic indexing \cite{hoff:99}.\footnote{Our clustering

152: algorithm computes an approximate global optimal solution while probabilistic

153: latent semantic indexing relies on the EM algorithm and therefore might be

154: prune to local minima even with the help of some annealing process.}

155: In this paper, we

156: follow this basic principle and propose a new approach

157: to model terms and documents as vertices in a bipartite graph

158: with edges of the graph indicating the co-occurrence of terms and documents.

159: In addition

160: we can optionally

161:  use edge weights to indicate the frequency of this co-occurrence.

162: Cluster analysis for document collections

163: in this context is based on a very intuitive notion: documents

164: are grouped by topics, on one hand

165: documents in a topic tend to more heavily use the same subset

166: of terms which form a term cluster, and

167: on the other hand a topic usually is characterized

168: by a subset of terms and those documents heavily using those terms tend to

169: be about that particular topic. It is this interplay of terms and

170: documents which gives rise to what we call bi-clustering by which

171: terms and documents are simultaneously grouped into

172: {\it semantically coherent} clusters.

173:

174: Within our bipartite graph model, the clustering problem can be

175: solved by constructing vertex graph partitions. Many criteria have been

176: proposed for measuring the quality of graph partitions

177: of undirected graphs \cite{Chung,Shi}. In this paper, we show

178: how to adapt those criteria for bipartite graph partitioning and

179: therefore solve the bi-clustering problem. A great variety of

180: objective functions have been proposed for cluster analysis without

181: efficient algorithms for finding the (approximate) optimal solutions.

182: We will show that our bipartite graph formulation naturally leads to

183: partial SVD problems for the underlying edge weight matrix

184: which admit efficient

185: {\it global} optimal solutions.

186: The rest of the paper

187: is organized as follows: in section \ref{se:bi}, we propose a

188: new criterion for

189: bipartite graph partitioning which tends to produce balanced

190: clusters. In section \ref{se:svd}, we show that our criterion

191: leads to an optimization problem that can be approximately

192: solved by computing a partial SVD

193: of the weight matrix of the bipartite graph.

194: In section \ref{se:corr}, we make connection of our approximate

195: solution to correspondence analysis used in multivariate data

196: analysis. In section \ref{se:over}, we briefly

197: discuss how to deal with clusters with overlaps.

198: In section \ref{se:exp}, we

199: describe experimental results on

200: bi-clustering a dataset of newsgroup articles. We conclude the paper

201: in section \ref{se:con} and give pointers to future research.

202:

203: \section{Bipartite graph partitioning}\label{se:bi}

204: We denote a graph by $G(V,E)$, where $V$ is the vertex set

205: and $E$ is the edge set of the

206: graph. A graph $G(V,E)$

207: is {\it bipartite} with two vertex

208: classes $X$ and $Y$ if $V = X\cup Y$ with

209: $X\cap Y = \emptyset$ and each edge in $E$ has

210: one endpoint in $X$ and one endpoint in $Y$.

211: We consider weighted bipartite graph $G(X,Y,W)$ with

212: $W = (w_{ij})$ where $w_{ij} > 0$ denotes the weight of the

213: edge between vertex $i$ and $j$. We let $w_{ij}=0$ if there

214: is no edge between vertices $i$ and $j$.

215: In the context

216: of document clustering, $X$ represents

217: the set of  terms and $Y$ represents the set of documents, and $w_{ij}$

218: can be used to denote the number of times term $i$ occurs in

219: document $j$.

220: A vertex partition of $G(X,Y,W)$

221: denoted by $\Pi(A,B)$

222: is defined by  a partition of the vertex sets

223: $X$ and $Y$, respectively: $X=A\cup A^c$, and $Y=B\cup B^c$, where

224: for a set $S$, $S^c$ denotes its compliment. By convention,

225: we pair $A$ with $B$, and $A^c$ with $B^c$. We say that

226: a pair of vertices $ x \in X$ and $y \in Y$ is {\it matched} with

227: respect to a partition $\Pi(A,B)$ if there is an edge

228: between $x$ and $y$, and either $x \in A$ and $y \in B$ or

229: $x \in A^c$ and $y \in B^c$. For any two subsets of vertices

230: $ S \subset X$ and $T \subset Y$, define

231: \[ W(S,T) = \sum_{i \in S, j \in T} w_{ij},\]

232: i.e., $W(S,T)$ is the sum of the weights of edges with one

233: endpoint in $S$ and one endpoint in $T$. The quantity

234: $W(S,T)$ can be

235: considered as measuring the association

236: between the  vertex sets $S$ and

237: $T$. In the context of cluster analysis

238: edge weight measures the similarity between data objects.

239: To partition data objects into

240: clusters, we seek a partition of $G(X,Y,W)$ such that the

241: association (similarity)

242: between unmatched vertices is as small as possible.

243: One possibility is to consider for a partition $\Pi(A,B)$

244: the following quantity

245: \begin{equation}\label{eq:cut}

246: \begin{array}{ll} \cut(A,B) & \equiv W(A,B^c) + W(A^c, B)\\[3pt]

247:    & = \sum_{i \in A, j \in B^c} w_{ij} +

248:      \sum_{i \in A^c, j \in B} w_{ij}.

249: \end{array}

250: \end{equation}

251: Intuitively, choosing $\Pi(A,B)$

252: to minimize $\cut(A,B)$ will give rise to a partition that

253: minimizes the sum of all the edge weights between unmatched

254: vertices. In the context of document clustering, we try to find

255: two document clusters $B$ and $B^c$ which have few terms in

256: common, and the documents in $B$ mostly use terms in $A$ and

257: those in $B^c$ use terms in $A^c$.

258: Unfortunately, choosing a partition based entirely on

259: $\cut(A,B)$ tends to produce unbalanced clusters, i.e.,

260: the sizes of $A$ and/or $B$ or their compliments tend to be

261: small.

262: Inspired by the work in \cite{Chung,Driessche,Shi}, we propose

263: the following normalized variant of the edge cut in (\ref{eq:cut})

264: \[ \Ncut(A,B) \equiv \frac{\cut(A,B)}{W(A,Y) + W(X,B)}\]

265: \[          + \frac{\cut(A^c,B^c)}{W(A^c,Y) + W(X,B^c)}.\]

266: The intuition behind this criterion is that not only we

267: want a partition with small edge cut, but we also want the two

268: subgraphs formed between the matched vertices to be as dense

269: as possible. This latter requirement is partially

270: satisfied by introducing

271: the normalizing denominators in the above equation.\footnote{A

272: more natural criterion seems to be

273: \[\frac{\cut(A,B)}{W(A,B)}

274:           + \frac{\cut(A^c,B^c)}{W(A^c,B^c)}.\]

275: However, it can be shown that it will leads to an SVD

276: problem with the same set of left and right singular vectors.}

277: Our bi-clustering problem is now equivalent to

278: the following optimization problem

279: \[ \min_{\Pi(A,B)} \Ncut(A,B),\]

280: i.e., finding partitions of the vertex sets $X$ and $Y$ to

281: minimize the normalized cut of the bipartite graph $G(X,Y,W)$.

282:

283: \section{Approximate solutions using singular vectors}\label{se:svd}

284: Given a bipartite graph $G(X,Y,W)$

285: and the associated partition $\Pi(A,B)$. Let us reorder

286: the vertices of $X$ and $Y$ so that vertices in $A$ and $B$ are

287: ordered before vertices in $A^c$ and $B^c$, respectively. The

288: weight matrix $W$ can be written in a block format

289: \begin{equation}\label{eq:w} W = \left[\begin{array}{cc}

290:        W_{11}&W_{12}\\

291:        W_{21}&W_{22}

292:        \end{array}\right],\end{equation}

293: i.e., the rows of $W_{11}$ correspond to

294: the vertices in the vertex set $A$ and

295: the columns of $W_{11}$ correspond to

296: those in $B$. Therefore

297: $G(A,B,W_{11})$  denotes the weighted bipartite graph

298: corresponding to the vertex sets $A$ and $B$.

299: For any $m$-by-$n$ matrix

300: $H = (h_{ij})$, define

301: \[ \s(H) = \sum_{i=1}^m \sum_{j=1}^n h_{ij},\]

302: i.e., $\s(H)$ is the sum of all the elements of $H$.

303: It is easy to see from the definition of $\Ncut$,

304: \[ \Ncut(A,B) = \frac{\s(W_{12}) + \s(W_{21})}{2\s(W_{11}) +\s(W_{12})

305: +\s(W_{21})}\]\[ +

306: \frac{\s(W_{12}) + \s(W_{21})}{2\s(W_{22}) +\s(W_{12})

307: +\s(W_{21})}.\]

308: In order to make connections to

309: SVD problems, we

310: first consider the case when $W$ is symmetric.\footnote{A different

311: proof for the

312: symmetric case was first derived in \cite{Shi}. However, our derivation

313: is simpler and more transparent and leads naturally to the SVD

314: problems for the rectangular case.}

315: It is easy

316: to see that with $W$ symmetric (denoting $\Ncut(A,A)$ by $\Ncut(A)$),

317: we have

318: \begin{equation}\label{eq:sym}

319:  \Ncut(A) = \frac{\s(W_{12})}{\s(W_{11})+\s(W_{12})}

320: + \frac{\s(W_{12})}{\s(W_{22})+\s(W_{12})}.\end{equation}

321: Let $e$ be the vector

322: with all its elements equal to $1$. Let $D$ be the diagonal matrix

323: such that $We = De$. Then $(D-W)e=0$. Let $x=(x_i)$ be the vector with

324: \[  x_i = \left\{\begin{array}{rl}

325:                  1, & i \in A,\\

326:                 -1, & i \in A^c.

327:                  \end{array}

328:            \right.

329: \]

330: It is easy to verify that

331: \[ \s(W_{12}) = x^T(D-W)x/4.\]

332: Define

333: \[ p \equiv \frac{\s(W_{11})+\s(W_{12})}{\s(W_{11})+2\s(W_{12})+\s(W_{22})}

334:      =\frac{\s(W_{11})+\s(W_{12})}{e^TDe}.\]

335: Then

336: \[ \begin{array}{c}

337: \s(W_{11})+\s(W_{12}) = p e^TDe, \\[3pt]

338:    \s(W_{22})+\s(W_{12}) = (1-p)e^TDe,

339: \end{array}

340: \]

341: and

342: \begin{equation}\label{eq:n}

343:  \Ncut(A) = \frac{x^T(D-W)x}{4p(1-p)e^TDe}.

344: \end{equation}

345: Notice that  $(D-W)e=0$, then for any scalar $s$, we have

346: \[ (se+x)^T(D-W)(se+x)= x^T(D-W)x.\]

347: To cast (\ref{eq:n}) in the form of a Rayleigh quotient,

348: we need to find

349: $s$ such that

350: \[ (se+x)^TD(se+x)= 4p(1-p)e^TDe.\]

351: Since $x^TDx = e^TDe$, it follows from the above equation that

352: $s = 1-2p$. Now let $y=(1-2p)e + x$, it is easy to see that

353: $y^TDe = ((1-2p)e + x)^TDe = 0$, and

354: \[  y_i = \left\{\begin{array}{rl}

355:                  2(1-p)>0, & i \in A,\\

356:                 -2p<0, & i \in A^c.

357:                  \end{array}

358:            \right.

359: \]

360: Thus

361: \[ \min_{A} \Ncut(A) = \min \left\{ \frac{y^T(D-W)y}{y^TDy}

362: \;\; | \;\; y \in S\right\},\]

363: where

364: \[ S=\{ y \;\; | \;\;

365: y^TDe =0, y_i

366: \in \{ 2(1-p), -2p\} \}.\]

367: If we drop the constraints $y_i

368: \in \{ 2(1-p), -2p\}$ and let

369: the elements of $y$ take

370: arbitrary continuous values, then the optimal $y$ can be approximated by

371: the following relaxed {\it continuous} minimization problem,

372: \begin{equation}\label{eq:y} \min \left\{ \frac{y^T(D-W)y}{y^TDy}

373: \;\; | \;\;y^TDe =0\right\}.\end{equation}

374: Notice that it follows from $We = De$ that

375: \[ D^{-1/2}WD^{-1/2} (D^{1/2}e) = D^{-1/2}e,\]

376: and therefore $D^{1/2}e$ is an eigenvector of

377: $D^{-1/2}WD^{-1/2}$ corresponding to the eigenvalue $1$. It

378: is easy to show that all the eigenvalues of $D^{-1/2}WD^{-1/2}$

379: have absolute value at most $1$ (See the Appendix). Thus the optimal $y$ in

380: (\ref{eq:y}) can be computed as $y = D^{1/2}\hat{y}$, where

381: $\hat{y}$ is the {\it second} largest

382: eigenvector of $D^{-1/2}WD^{-1/2}$.

383:

384: Now we return to the rectangular case for the weight matrix $W$,

385: and let $D_X$ and $D_Y$ be diagonal matrices such that

386: \begin{equation}\label{eq:xy}

387: We = D_X e, \quad W^Te = D_Ye.

388: \end{equation}

389: Consider a partition $\Pi(A,B)$, and define

390: \[ u_i = \left\{\begin{array}{rl}

391:                  1, & i \in A\\

392:                 -1, & i \in A^c

393:                  \end{array}

394:            \right., \quad

395: v_i = \left\{\begin{array}{rl}

396:                  1, & i \in B\\

397:                 -1, & i \in B^c

398:                  \end{array}

399:            \right.

400: \]

401: Let $W$ have the block form as in (\ref{eq:w}), and consider the

402: augmented symmetric matrix\footnote{In \cite{heko:00}, the Laplacian

403: of $\hat{W}$ is used for partitioning a rectangular matrix

404: in the context of designing load-balanced matrix-vector multiplication

405: algorithms for parallel computation. However, the eigenvalue

406: problem of the Laplacian

407: of $\hat{W}$ does not lead to a simpler singular value problem.}

408: \[ \hat{W} = \left[\begin{array}{cc}

409:                    0 & W\\

410:                    W^T & 0

411:              \end{array}\right]

412:    = \left[\begin{array}{cc|cc}

413:                    0 & 0 & W_{11} & W_{12}\\

414:                    0 & 0 & W_{21} & W_{22}\\ \hline

415:                    W_{11}^T & W_{21}^T & 0 & 0 \\

416:                    W_{12}^T & W_{22}^T & 0 & 0

417:                       \end{array}\right].\]

418: If we interchange the second and third block rows and columns

419: of the above matrix, we obtain

420: \[ \left[\begin{array}{cc|cc}

421:                    0 & W_{11} & 0 & W_{12}\\

422:                    W_{11}^T & 0 & W_{21}^T & 0\\ \hline

423:                    0 & W_{21} & 0 & W_{22} \\

424:                    W_{12}^T & 0 & W_{22}^T & 0

425:                       \end{array}\right] \equiv

426:     \left[\begin{array}{cc}

427:                    \hat{W}_{11} & \hat{W}_{12}\\

428:                    \hat{W}_{12}^T & \hat{W}_{22}

429:              \end{array}\right],\]

430: and the normalized cut can be written as

431: \[ \Ncut(A,B) = \frac{\s(\hat{W}_{12})}{\s(\hat{W}_{11})+\s(\hat{W}_{12})}

432: + \frac{\s(\hat{W}_{12})}{\s(\hat{W}_{22})+\s(\hat{W}_{12})},\]

433: a form that resembles the symmetric case (\ref{eq:sym}). Define

434: \[ q = \frac{2\s(W_{11}) +\s(W_{12})

435: +\s(W_{21})}{e^TD_Xe + e^TD_Ye}.\]

436: Then we have

437: \[ \Ncut(A,B) = \frac{-2x^TWy + x^TD_Xx + y^TD_Yy}{x^TD_Xx + y^TD_Yy}\]\[

438:               = 1- \frac{2x^TWy}{x^TD_Xx + y^TD_Yy},\]

439: where $x = (1-2p)e +u, y = (1-2p)e + v$. It is also easy to see that

440: \begin{equation}\label{eq:q}

441: x^TD_Xe + y^TD_Ye = 0, \quad x_i, y_i \in \{ 2(1-q), -2q\}.

442: \end{equation}

443: Therefore,

444: \[ \min_{\Pi(A,B)}

445: \Ncut(A,B)\]\[ = 1-\max_{x \neq 0, y \neq 0}

446: \left\{ \frac{2x^TWy}{x^TD_Xx + y^TD_Yy}

447: \;\; | \;\;  x, y \; \mbox{\rm satisfy } (\ref{eq:q})\right\}.\]

448: Ignoring the discrete constraints on the elements of $x$ and $y$, we

449: have the following continuous maximization problem,

450: \begin{equation}\label{eq:yz}

451: \max_{x \neq 0, y \neq 0}

452: \left\{ \frac{2x^TWy}{x^TD_Xx + y^TD_Yy}\;\; | \;\;

453: x^TD_Xe + y^TD_Ye = 0 \right\}.

454: \end{equation}

455: Without the constraints

456: $x^TD_Xe + y^TD_Ye = 0$, the above problem is equivalent to

457: computing the largest singular triplet of $D_X^{-1/2} W D_Y^{-1/2}$

458: (see the Appendix).

459: From (\ref{eq:xy}), we have

460: \[ \begin{array}{c}

461: D_X^{-1/2} W D_Y^{-1/2} (D_Y^{1/2}e) = D_X^{1/2} e, \\[3pt]

462: (D_X^{-1/2} W D_Y^{-1/2})^T (D_X^{1/2}e) = D_Y^{1/2} e,

463: \end{array}

464: \]

465: and similarly to

466: the symmetric case, it is easy to show that all the

467: singular values of $D_X^{-1/2} W D_Y^{-1/2}$

468: are at most $1$. Therefore, an optimal pair $\{x,y\}$ for

469: (\ref{eq:yz}) can be computed as

470: $x = D_X^{-1/2} \hat{x}$ and $y = D_Y^{-1/2} \hat{y}$,

471: where $\hat{x}$ and $\hat{y}$ are the {\it second}

472: largest left and right singular vectors of $D_X^{-1/2} W D_Y^{-1/2}$,

473: respectively (see the Appendix).

474: With the above discussion, we can now summerize our

475: basic approach for bipartite graph clustering incorporating

476: a recursive procedure.

477:

478: \bigskip

479:

480: \begin{center}

481: \fbox{\parbox{7.7cm}{

482: {\sc Algorithm.} Spectral Recursive Embedding (SRE)

483:

484: Given a weighted bipartite graph $G = (X,Y,E)$

485: with its edge weight matrix $W$:

486:

487:  \begin{enumerate}

488:    \item Compute $D_X$ and $D_Y$ and form the scaled weight matrix

489:          $\hat{W}=D_X^{-1/2} W D_Y^{-1/2}$.

490:    \item Compute the {\it second} largest left and right

491:          singular vectors of $\hat{W}$, $\hat{x}$ and $\hat{y}$.

492:    \item Find cut points $c_x$ and $c_y$ for $x=D_X^{-1/2}\hat{x}$

493:          and $y=D_Y^{-1/2}\hat{y}$, respectively.

494:    \item Form partitions $A=\{i \;\;| \;\;x_i \geq c_x\}$ and

495:          $A^c=\{i \;\;| \;\;x_i < c_x\}$ for vertex set $X$, and

496:          $B=\{j \;\;| \;\;y_j \geq c_y\}$ and

497:          $B^c=\{j \;\;|\;\; y_j < c_y\}$ for vertex set $Y$.

498:    \item Recursively partition the sub-graphs $G(A,B)$

499:           and $G(A^c,B^c)$ if necessary.

500:

501:  \end{enumerate}

502: }}

503: \end{center}

504:

505: \bigskip

506:

507: Two basic strategies can be used for selecting the cut points

508: $c_x$ and $c_y$. The simplest strategy is to set $c_x=0$ and

509: $c_y=0$. Another more computing-intensive approach is to base

510: the selection on $\Ncut$: Check $N$ equally spaced splitting

511: points of $x$ and $y$, respectively, find the cut

512: points $c_x$ and $c_y$ with the smallest $\Ncut$ \cite{Shi}.

513:

514: {\bf Computational complexity.} The major computational cost

515: of SRE is Step 2  for computing the left and right singular vectors

516: which can be obtained either by power method or more robustly

517: by Lanczos bidiagonalization process \cite[Chapter 9]{govl:96}.

518: Lanczos method is an iterative process for computing

519: partial SVDs in

520: which  each iterative step involves the computation of two matrix-vector

521: multiplications $\hat{W}u$ and $\hat{W}^Tv$ for some vectors

522: $u$ and $v$. The computational cost of these is

523: roughly proportional to $\nnz(\hat{W})$,

524: the number of nonzero elements of $\hat{W}$. The total

525: computational

526: cost of SRE is $O(c_{\sre}k_{\svd}\nnz(\hat{W}))$, where

527: $c_{\sre}$ the the level of recursion and $k_{\svd}$ is the

528: number of Lanczos iteration steps. In general, $k_{\svd}$ depends on

529: the singular value gaps of $\hat{W}$. Also notice that

530: $\nnz(\hat{W})= n_w n$, where $n_w$ is the average number of

531: terms per document and $n$ is the total number of document.

532: Therefore, the total cost of SRE is in general linear in the

533: number of documents to be clustered.

534:

535:

536: \section{Connections to correspondence analysis}\label{se:corr}

537: In its basic form correspondence analysis is applied to an

538: $m$-by-$n$ two-way

539: table of counts $W$ \cite{benz:92,gree:93,veri:99}. Let $w=\s(W)$,

540: the sum of all the elements of $W$, $D_X$ and $D_Y$ be diagonal

541: matrices defined in section \ref{se:svd}. Correspondence analysis

542: seeks to compute the largest singular triplets of the matrix

543: $Z=(z_{ij}) \in \R^{m \times n}$ with

544: \[ z_{ij}= \frac{w_{ij}/w - (D_X(i,i)/w)(D_Y(j,j)/w)}

545:              {\sqrt{(D_X(i,i)/w)(D_Y(j,j)/w)}}.\]

546: The matrix $Z$ can be considered as the correlation matrix of two

547: group indicator matrices for the original $W$ \cite{veri:99}.

548: We now show that the SVD of $Z$ is closely related to the

549: SVD of $\hat{W} \equiv D^{-1/2}_XWD_Y^{-1/2}$.

550: In fact, in section \ref{se:svd},

551: we showed that $D_X^{1/2}e$ and $D_Y^{1/2}e$ are the left and right

552: singular vectors of $\hat{W}$ corresponding to the singular value one,

553: and it is also easy to show that all the singular values of

554: $\hat{W}$ are at most $1$. Therefore,

555: the rest of the singular values and singular vectors of

556: $\hat{W}$ can be found by computing the SVD of

557: the following  rank-one modification

558: of $\hat{W}$

559: \[D^{-1/2}_XWD_Y^{-1/2}-

560: \frac{D_X^{1/2}ee^TD_Y^{1/2}}{\|D_X^{1/2}e\|_2\|D_Y^{1/2}\|_2}\]

561: which has $(i,j)$ element

562: \[  \frac{w_{ij}}{\sqrt{D_X(i,i)D_Y(j,j)}} -

563:      \frac{\sqrt{D_X(i,i)D_Y(j,j)}}{w} = w^2z_{ij},\]

564: and  is a constant multiple of the $(i,j)$ element of $Z$.

565: Therefore, normalized-cut based  cluster analysis and correspondence

566: analysis arrive at the same SVD problems even though they start with

567: completely different principles. It is worthwhile to explore

568: more deeply the interplay between these two different points of views and

569: approaches, for example, using the statistical analysis of

570: correspondence analysis to provide better strategy for selecting cut

571: points and estimating the number of clusters.

572:

573: \section{Partitions with overlaps}\label{se:over}

574: So far in our discussion, we have only looked at {\it hard}

575: clustering, i.e., a data object belongs to one and only

576: one cluster. In many situations, especially when there are much

577: overlap among the clusters, it is more advantageous to allow

578: data objects to belong to different clusters. For

579: example, in document clustering, certain groups of words can

580: be shared by two clusters. Is it possible

581: to model this overlap using our bipartite graph model and also

582: find efficient approximate solutions? The answer seems to be yes,

583: but our results at this point are rather preliminary and we will

584: only illustrate the possibilities. Our basic idea is that when computing

585: $\Ncut(A,B)$, we should disregard the contributions of the

586: set of vertices that is in the overlap. More specifically,

587: let $X=A\cup O_X \cup \bar{A}$ and $Y=B\cup O_Y\cup \bar{B}$, where

588: $O_X$ denotes the overlap between

589: the vertex subsets

590: $A\cup O_X$ and $\bar{A}\cup O_X$, and

591: $O_Y$ the overlap between $B\cup O_Y$ and $\bar{B}\cup O_Y$, we compute

592: \[ \Ncut(A,B,\bar{A},\bar{B}) =\frac{\cut(A,B)}{W(A,Y)+W(X,B)}\]\[

593: +\frac{\cut(\bar{A},\bar{B})}{W(\bar{A},Y)+W(X,\bar{B})}.\]

594: However, we can make

595: $\Ncut(A,B,\bar{A},\bar{B})$ smaller simply by putting more

596: vertices in the overlap. Therefore, we need to balance these

597: two competing quantities: the size of the overlap and the modified

598: normalized cut $\Ncut(A,B,\bar{A},\bar{B})$ by minimizing

599: \[ \Ncut(A,B,\bar{A},\bar{B}) + \alpha(|O_X| + |O_Y|),\]

600: where $\alpha$ is a regularization parameter. How to find an

601: efficient method for computing the (approximate) optimal

602: solution to the above minimization problem still needs to be

603: investigated. We close this section by presenting an illustrative

604: example showing that in some situations, the singular vectors

605: already automatically separating the overlap sets while giving

606: the coordinates for carrying out clustering.

607:

608: \begin{figure}[t]

609: \centerline{

610: \mbox{\psfig{file=corr1.ps,height=1.8in,width=1.6in}

611: \psfig{file=corr2.ps,height=1.8in,width=1.6in}}}

612: \caption{Sparsity patterns of a test matrix before clustering

613: (left) and after clustering (right)}

614: \label{fi:op}

615: \end{figure}

616:

617: {\sc Example 1.} We construct a sparse $m$-by-$n$ rectangular matrix

618: \[ W = \left[\begin{array}{cc}

619:              W_{11}&W_{12}\\

620:              W_{21}&W_{22}

621:        \end{array}\right].\]

622: so that $W_{11}$ and $W_{22}$ are relatively denser than $W_{12}$

623: and $W_{21}$. We also add some dense rows and columns to the matrix $W$

624: to represent row and column overlaps.

625: The left panel of Figure \ref{fi:op} shows the sparsity pattern of

626: $\bar{W}$,

627: a matrix obtained by randomly permuting

628: the rows and columns of $W$. We then compute the

629: second largest left and right singular vectors of

630: $D_X^{-1/2} \bar{W} D_Y^{-1/2}$, say $x$ and $y$, then sort the rows and

631: columns of $\bar{W}$ according to the values of the entries in

632: $D_X^{-1/2}x$ and $D_Y^{-1/2}y$, respectively. The sparsity

633: pattern of this permuted $\bar{W}$ is shown on the right panel of

634: Figure \ref{fi:op}. As can be seen that the singular vectors not

635: only do the job of clustering but at the same time also

636: concentrate the dense rows and columns at the boundary of the two

637: clusters.

638:

639: \section{Experiments}\label{se:exp}

640: In this section we present our experimental results on clustering

641: a dataset of newsgroup articles submitted to  20

642: newsgroups.\footnote{

643: The newsgroup dataset together with the {\tt bow} toolkit for

644: processing it

645: can be downloaded from

646: {\tt http://www.cs.cmu.edu/afs/cs/project/theo-11/www/}

647:

648: \noindent

649: {\tt naive-bayes.html}.}

650: This dataset contains about

651: 20,000 articles (email messages) evenly divided among the 20

652: newsgroups. We list the names of the newsgroups together

653: with the associated group labels (the labels will be

654: used in the sequel to identify the newsgroups).

655:

656:

657: \begin{verbatim}

658:        NG1: alt.atheism

659:        NG2: comp.graphics

660:        NG3: comp.os.ms-windows.misc

661:        NG4: comp.sys.ibm.pc.hardware

662:        NG5:comp.sys.mac.hardware

663:        NG6: comp.windows.x

664:        NG7:misc.forsale

665:        NG8: rec.autos

666:        NG9:rec.motorcycles

667:        NG10: rec.sport.baseball

668:        NG11:rec.sport.hockey

669:        NG12: sci.crypt

670:        NG13:sci.electronics

671:        NG14: sci.med

672:        NG15:sci.space

673:        NG16: soc.religion.christian

674:        NG17:talk.politics.guns

675:        NG18: talk.politics.mideast

676:        NG19:talk.politics.misc

677:        NG20: talk.religion.misc

678: \end{verbatim}

679:

680: \begin{table*}\label{tb:12}

681: \caption{Comparison of spectral embedding (SRE), PDDP, and K-means

682: (NG1/NG2)

683: }

684: \begin{center}

685: \begin{tabular}{llrrr}

686: Mixture & SRE & PDDP & K-means \\ \hline\hline

687: 50/50   & $92.12\pm 3.52$\% & $91.90\pm 3.19$\% $(53,10,37)$&$ 76.93\pm 14.42$\% $(82,2,10)$\\ \hline

688: 50/100  & $90.57\pm 3.11$\% & $86.11\pm 3.94$\% $(86, 5,9)$&$ 76.74\pm 14.01$\% $(80,2,18)$\\ \hline

689: 50/150  & $88.04\pm 3.90$\% & $78.60\pm 5.03$\% $(98, 0, 2)$&$68.80\pm 13.55$\% $(88, 0, 12)$\\ \hline

690: 50/200  & $82.77\pm 5.24$\% & $70.43\pm 6.04$\% $(97,0,3)$&$69.22\pm 12.34$\% $(83,1,16)$\\ \hline

691: \end{tabular}

692: \end{center}

693: \end{table*}

694:

695: \begin{table*}\label{tb:1011}

696: \caption{Comparison of spectral embedding (SRE), PDDP, and K-means

697: (NG10/NG11)

698: }

699: \begin{center}

700: \begin{tabular}{llrrr}

701: Mixture & SRE & PDDP & K-means \\ \hline\hline

702: 50/50   & $74.56\pm 8.93$\% & $73.40\pm 10.07$\% $(56,6,38)$ &$61.61\pm 8.77$\% $(86,0,14)$\\ \hline

703: 50/100  & $67.13\pm 7.17$\% & $67.10\pm 10.20$\% $(52,1,47)$ &$64.40\pm 9.37$\% $(59,1,40)$\\ \hline

704: 50/150  & $58.30\pm 5.99$\% & $58.72\pm 7.48$\% $(52,1,47)$ &$62.53\pm 8.20$\% $(36,1,63)$\\ \hline

705: 50/200  & $57.55\pm 5.69$\% & $56.63\pm 4.84$\% $(58,1,41)$ &$60.82\pm 7.54$\% $(39,2,59)$\\ \hline

706: \end{tabular}

707: \end{center}

708: \end{table*}

709:

710: We used the {\it bow} toolkit to construct the term-document

711: matrix for this dataset, specifically we use the tokenization option

712: so that the UseNet  headers are stripped, and we also applied stemming

713: \cite{mcca:96}. Some of the newsgroups have large overlaps, for

714: example, the five newsgroups {\tt comp.* } about

715: computers. In fact several articles are posted to multiple newsgroups.

716: Before we apply clustering algorithms to the dataset, several

717: preprocessing steps need to be considered. Two standard steps

718: are weighting and feature selection. For weighting, we considered

719: a variant of tf.idf weighting scheme,

720: $\tf\log_2(n/\df),$

721: where $\tf$ is the term frequency and $\df$ is the document

722: frequency and several other variations

723: listed in \cite{bele:00}.

724: For feature selection, we looked at three approaches 1)

725: deleting terms that occur less than certain number of

726: times in the dataset; 2) deleting terms that

727: occur in less than certain number of

728: documents in the dataset; 3) selecting terms according to mutual

729: information of terms and documents defined as

730: \[ I(y) = \sum_{x} p(x,y)\log(p(x,y)/(p(x)p(y)),\]

731: where $y$ represents a term and $x$ a document \cite{slti:00}.

732: In general we found out that the traditional tf.idf based

733: weighting schemes do not improve performance for SRE. One possible

734: explanation comes from the connection with correspondence analysis,

735: the raw frequencies are samples of co-occurrence probabilities,

736: and the pre- and post-multiplication by $D_X^{-1/2}$ and

737: $D_Y^{-1/2}$ in $D_X^{-1/2}(D-W)D_Y^{-1/2}$ {\it automatically}

738: taking into account of weighting. We did, however, found out that

739: trimming the raw frequencies can sometimes improve

740: performance for SRE, especially for the anomalous cases where

741: some words can occur in certain documents an unusual number of times,

742: skewing the clustering process.

743:

744:

745:

746:

747: For the purpose of comparison, we consider two other clustering

748: methods: 1) K-means method \cite{Gordon}; 2) Principal direction

749: divisive partion (PDDP) method \cite{bole:98}. K-means method is

750: a widely used cluster analysis tool. The variant we used employs

751: the Euclidean distance when comparing the dissimilarity between

752: two documents. When applying K-means,

753: we {\it normalize} the length of each document so that it has

754: Euclidean length one. In essence, we use the cosine of the angle

755: between two document vectors when

756: measuring their similarity. We have also tried K-means without

757: document length normalization, the results are far worse and therefore

758: we will not report the corresponding results. Since K-means method is

759: an iterative method, we need to specify a stopping criterion. For

760: the variant we used, we compare the centroids between two

761: consecutive iterations, and stop when the difference is smaller

762: than a pre-defined tolerance.

763:

764:

765: PDDP is another clustering method that utilizes singular

766: vectors. It is based on the idea of principal component

767: analysis and has been shown to

768: outperform several standard clustering methods

769: such as hierarchical agglomerative algorithm \cite{bole:98}.

770: First each document is considered as a

771: multivariate data point. The set of document is normalized

772: to have unit Euclidean length and then centered, i,e., let

773: $W$ be the term-document matrix, and $w$ be the average of

774: the columns of $W$. Compute the largest singular value triplet

775: $\{u,\sigma,v\}$ of

776: $W-we^T$. Then split the set of documents based on their

777: values of the $v=(v_i)$ vector: one simple scheme is to

778: let those with

779: positive $v_i$ go into one cluster and those

780: with nonnegative $v_i$  inot another cluster. Then the

781: whole process is repeated on the term-document matrices of

782: the two clusters, respectively. Although both our clustering

783: method SRE and PDDP

784: make use of the singular vectors of some versions of the

785: term-document matrices, they are derived from fundamentally

786: different principles. PDDP is a feature-based clustering method,

787: projecting all the data points to the one-dimensional subspace

788: spanned by the first principal axis; SRE is a similarity-based

789: clustering method, two co-occurring variables (terms and

790: documents in the context of document clustering) are

791: simultaneously clustered. Unlike SRE, PDDP does not

792: have a well-defined objective function

793: for minimization. It only partitions the columns of

794: the term-document matrices while SRE partitions both of its

795: rows and columns. This will have significant impact on the

796: computational costs.

797: PDDP, however, has an  advantage that it can be applied to

798: dataset with both positive  and negative values while SRE can only be

799: applied to datasets with nonnegative data values.

800:

801: \begin{table*}\label{tb:1819}

802: \caption{Comparison of spectral embedding (SRE), PDDP, and K-means

803: (NG18/NG19)

804: }

805: \begin{center}

806: \begin{tabular}{llrrr}

807: Mixture & SRE & PDDP & K-means \\ \hline\hline

808: 50/50  & $73.66\pm 10.53$\% & $69.52\pm 12.83$\% $(65,12,32)$ & $62.25 \pm 9.94$\% $(82,1,17)$\\ \hline

809: 50/100   & $67.23\pm 7.84$\% & $67.84\pm 7.30$\% $(46,5,49)$& $60.91\pm 7.92$\% $(65,13,32)$\\ \hline

810: 50/150  & $65.83\pm 12.79$\% & $60.37\pm 9.85$\% $(53,3,44)$ &$63.32\pm 8.26$\% $(58,3,39)$\\ \hline

811: 50/200  & $61.23\pm 9.88$\% & $60.76\pm 5.55$\% $(40,1,59)$ &$64.50\pm 7.58$\% $(34,0,66)$\\ \hline

812: \end{tabular}

813: \end{center}

814: \end{table*}

815:

816:

817: \begin{table*}\label{tb:con}

818: \caption{Confusion matrix for newsgroups $\{2, 9, 10, 15, 18\}$

819: }

820: \begin{center}

821: \begin{tabular}{|l||c|c|c|c|c|}

822: \hline\hline

823:  &mideast &graphics & space & baseball &  motorcycles \\ \hline\hline

824: cluster 1&  87&  0&   0&    2&   0\\ \hline

825: cluster 2&  7&   90&  7&    6&   7\\ \hline

826: cluster 3&  3&   9&   84&   1&   1\\ \hline

827: cluster 4&  0&   0&   1&    88&  0\\ \hline

828: cluster 5&  3&   1&   8&    3&   92\\ \hline\hline

829: \end{tabular}

830: \end{center}

831: \end{table*}

832:

833:

834:

835: {\sc Example 2.} In this example, we examine binary clustering

836: with uneven clusters. We consider three pairs of newsgroups:

837: newsgroups 1 and 2 are well-separated, 10 and 11 are

838: less well-separated and 18 and 19 have a lot of overlap.

839: We used document frequency as the feature

840: selection criterion and delete

841: words that occur in less than $5$ documents in each datasets we

842: used. For both K-means and PDDP we apply tf.idf weighting together

843: with document length normalization so that each document vector

844: will have Euclidean norm one. For SRE we trim the raw frequency

845: so that the maximum is $10$.

846: For each newsgroup pair, we select four

847: types of mixture of articles from each newsgroup: $x/y$ indicates

848: that $x$ articles are from the first group and $y$ articles are

849: from the second group. The results are listed in Table

850: 1 for groups 1 and 2, Table 2 for groups 10 and

851: 11 and Table 3 for groups 18 and 19. We list

852: the means  and standard deviations for  100 random samples.

853: For PDDP and K-means we also include a triplet of numbers

854: which indicates how many of the 100 samples SRE performs better (the first

855: number), the same (the second number) and worse (the third number) than

856: the corresponding methods (PDDP or K-means).

857: We should emphasize that

858: K-means method can only find local minimum, and the results

859: depend on initial values and stopping criteria. This is also

860: reflected by the large standard deviations associated with

861: K-means method.

862: From the three

863: tests we can conclude that both SRE and PDDP outperform K-means

864: method. The performance of SRE and PDDP are similar in balanced

865: mixtures, but SRE is superior to PDDP in skewed mixtures.

866:

867:

868:

869: {\sc Example 3.} In this example, we consider an easy multi-cluster case,

870: we examine five newsgroups $2, 9, 10, 15, 18$ which

871: was also considered in \cite{slti:00}. We sample 100

872: articles from each newsgroups, we use mutual information for

873: feature selection.

874: We use minimum normalized cut as cut point for each level

875: of the recursion.

876: For one sample, Table 4 gives the confusion matrix.

877: The accuracy for this sample is $88.2$\%. We also tested two

878: other samples with accuracy  $85.4$\%

879: and $81.2$\%

880: which compare

881: favorably

882: with those obtained for three samples with

883: accuracy $59$\%, $58$\% and $53$\% reported in \cite{slti:00}.

884: In the following we also listed the top few words for

885: each clusters computed by mutual information.

886:

887: \begin{verbatim}

888: Cluster 1:

889:  armenian israel arab palestinian peopl jew isra

890:  iran muslim kill turkis war greek iraqi adl call

891:

892: Cluster 2:

893:  imag file bit green gif mail graphic colour

894:  group version comput jpeg blue xv ftp ac uk list

895:

896: Cluster 3:

897:  univers space nasa theori system mission henri

898:  moon cost sky launch orbit shuttl physic work

899:

900: Cluster 4:

901:  clutch year game gant player team hirschbeck

902:  basebal won hi lost ball defens base run win

903:

904: Cluster 5:

905:  bike dog lock ride don wave drive black

906:  articl write apr motorcycl ca turn dod insur

907: \end{verbatim}

908:

909: \section{Conclusions and feature work}\label{se:con}

910: In this paper, we formulate a class of clustering problems as

911: bipartite graph partitioning problems, and we show that

912: efficient optimal solutions can be found by computing the

913: partial singular value decomposition of some scaled edge weight

914: matrices. However, we have also shown that there still remain

915: many challenging problems. One area that needs further investigation

916: is the selection of cut points and number of clusters using

917: multiple left and right singular vectors, and

918: the possibility of adding local refinements to improve

919: clustering quality.\footnote{It will be

920: difficult to use local refinement for PDDP

921: because it does not have a global objective function

922: for minimization.} Another area is to find

923: efficient algorithms for handling overlapping clusters. Finally,

924: the treatment of missing data under our bipartite graph model

925: especially when we apply our spectral clustering methods to

926: the problem of data analysis of recommender systems also deserves

927: further investigation.

928:

929:

930: \section{Acknowledgments}

931: The work of Hongyuan Zha and Xiaofeng He was supported in

932: part  by NSF grant CCR-9901986. The work of Xiaofeng He,

933: Chris Ding and Horst Simon was supported in

934: part  by Department of Energy through an LBL LDRD fund.

935:

936: \bibliographystyle{plain}

937: \bibliography{ref}

938:

939:

940: \appendix

941: \section{Some proofs}

942: In this appendix we prove three

943: results: 1) All the

944: eigenvalues of $D^{-1/2}WD^{-1/2}$ has absolute value at

945: most $1$. Equivalently, we need to prove that the eigenvalues

946: of the generalized eigenvalue problem $Wx = \lambda Dx$

947: has absolute value at

948: most $1$. In fact let $x=(x_i)_{i=1}^n$ and let $i$ be such that

949: $|x_i| = \max |x_j|$, then it follows from

950: \[ \lambda d_i x_i = \sum_{j=1}^n w_{ij} x_j\]

951: that

952: \[ |\lambda| \leq \sum_{j=1}^n w_{ij}/d_i = 1.\]

953:

954: 2) We prove that

955: \[ \sigma_{\max}(D_X^{-1/2}WD_Y^{-1/2})

956: =\max_{x \neq 0, y \neq 0} \frac{2 x^TWy}{x^TD_Xx + y^TD_Yy}.\]

957: Let $\hat{x}= D^{1/2}_Xx$ and $\hat{y}= D^{1/2}_Yy$, then

958: \begin{equation}\label{eq:ff}

959: \frac{2 x^TWy}{x^TD_Xx + y^TD_Yy} =

960:  \frac{2 \hat{x}^TD_X^{-1/2}WD_Y^{-1/2}\hat{y}}

961: {\hat{x}^T\hat{x} + \hat{y}^T\hat{y}}.\end{equation}

962: Let $D_X^{-1/2}WD_Y^{-1/2}=U\Sigma V^T$ be its SVD with

963: \[ U = [u_1, \dots, u_m], \quad V=[v_1,\dots, v_n] \]

964: and

965: \[ \Sigma = \diag(\sigma_1, \dots, \sigma_{\min\{m,n\}}), \quad

966: \sigma_1 = \sigma_{\max}(D_X^{-1/2}WD_Y^{-1/2}).\] Then

967: we can expand $\hat{x}$ and $\hat{y}$ as

968: \begin{equation}\label{eq:hh}

969:  \hat{x} = \sum_{i} \hat{x}_i u_i, \quad \hat{y} = \sum_{i} \hat{y}_i v_i,

970: \end{equation}

971: and (\ref{eq:ff}) becomes

972: \[ \frac{2\sum_{i} \sigma_i \hat{x}_i\hat{y}_i}{\sum_i \hat{x}_i^2 +

973: \sum_i \hat{y}_i^2}

974: \leq \frac{2\sigma_1 \sqrt{\sum_i \hat{x}_i^2}\sqrt{\sum_i \hat{y}_i^2}}

975: {\sum_i \hat{x}_i^2 + \sum_i \hat{y}_i^2} \leq \sigma_1.\]

976: Taking $\hat{x}_1=1$ and $\hat{y}_1=1$ achieves the maximum.

977:

978: 3) Now we consider

979: the constraint

980: \[ x^TD_Xe + y^TD_Ye = 0\]

981: which is equivalent to

982: $\hat{x}_1+\hat{y}_1=0$ using the expansions in (\ref{eq:hh}).

983: We can always scale the vectors $\hat{x}$ and $\hat{y}$

984: without changing the maximum so that

985: $\hat{x}_1 \geq 0$ and $\hat{y}_1 \geq 0$.

986: Hence $\hat{x}_1+\hat{y}_1=0$ implies that

987: $\hat{x}_1=\hat{y}_1=0$. It is then easy to see that

988: \[\sigma_2 = \max\left\{ \frac{2x^TWy}{x^TD_Xx + y^TD_Yy}\;\; | \;\;

989: x^TD_Xe + y^TD_Ye = 0 \right\},\]

990: and the maximum is achieved by the second largest left and

991: right singular vectors of $D_X^{-1/2} W D_Y^{-1/2}$.

992:

993:

994: \end{document}

995:

996:

997:

998:

999: