0311:q-bio0311037/newmm.tex

1: \documentclass[aps,twocolumn,prl]{revtex4}

2: %\documentclass[aps,preprint]{revtex4}

3: %\documentclass[aps,twocolumn,a4]{revtex4}

4:

5: \usepackage{epsfig}

6: \newcommand{\be}{\begin{equation}}

7: \newcommand{\ee}{\end{equation}}

8: \newcommand{\bea}{\begin{eqnarray}}

9: \newcommand{\eea}{\end{eqnarray}}

10: \newcommand{\ba}{\begin{array}}

11: \newcommand{\ea}{\end{array}}

12:

13: \begin{document}

14: \title{Hierarchical Clustering Using Mutual Information}

15: \author{Alexander Kraskov, Harald St\"ogbauer, Ralph G. Andrzejak, and Peter Grassberger}

16: \affiliation{John-von-Neumann Institute for Computing, Forschungszentrum J\"ulich,

17:    D-52425 J\"ulich, Germany}

18:

19: \date{\today}

20: \begin{abstract}

21: We present a method for hierarchical clustering of data called {\it mutual information clustering} (MIC)

22: algorithm. It uses mutual information (MI) as a similarity measure and exploits its grouping property: The MI

23: between three objects $X, Y,$ and $Z$ is equal to the sum of the MI between $X$ and $Y$, plus the MI between $Z$

24: and the combined object $(XY)$. We use this both in the Shannon (probabilistic) version of information theory

25: and in the Kolmogorov (algorithmic) version. We apply our method to the construction of phylogenetic trees from

26: mitochondrial DNA sequences and to the output of independent components analysis (ICA) as illustrated with the

27: ECG of a pregnant woman.

28: \end{abstract}

29:

30: \maketitle

31:

32: Classification or organizing of data is very important in all scientific disciplines and is fundamental for

33: understanding and learning \cite{jain-dubes}. Classification can be exclusive or overlapping, supervised or

34: unsupervised. In the following we will be interested only in exclusive unsupervised classification, called

35: clustering.

36:

37: An instance of a clustering problem consist of a set of objects and a set of properties (called characteristic

38: vector) for each object. The goal of clustering is separation of objects into groups using only the

39: characteristic vectors. Cluster analysis organizes data either as a single grouping of individuals into

40: non-overlapping clusters or as a hierarchy of nested partitions. The latter is called hierarchical clustering

41: (HC). Because of wide spread of applications, there are a large variety of different clustering methods in

42: usage, see e.g. \cite{jain-dubes} for an overview.

43:

44:  The crucial point of all clustering algorithms is the choice

45: of a {\it proximity measure}. This is obtained from the characteristic vectors and can be either an indicator

46: for similarity or dissimilarity. In the latter case it is convenient but not obligatory to satisfy the standard

47: axioms of a metric (positivity, symmetry, and triangle inequality). Among HC methods one should distinguish

48: between those where one uses the characteristic vectors only at the first level of the hierarchy and derives the

49: proximities between clusters from the proximities of their constituents, and methods where the proximities are

50: calculated each time from their characteristic vectors. The latter strategy (which is used also in the present

51: paper) allows of course for more flexibility but might also be computationally more costly.

52:

53: Quite generally, the ``objects" to be clustered can be either single (finite) patterns (e.g. DNA sequences) or

54: random variables, i.e. {\it probability distributions}. In the latter case the data are usually supplied in form

55: of a statistical sample, and one of the simplest and most widely used similarity measures is the linear

56: (Pearson) correlation coefficient. But this is not sensitive to nonlinear dependencies which do not manifest

57: themselves in the covariance and can thus miss important features. This is in contrast to mutual information

58: (MI) which is also singled out by its information theoretic background \cite{cover-thomas}. Indeed, MI is zero

59: only if the two random variables are strictly independent.

60:

61:

62: Another important feature of MI is that it has also an ``algorithmic" cousin, defined within algorithmic

63: (Kolmogorov) information theory \cite{li-vi} which measures the similarity between individual objects. For a

64: thorough discussion of distance measures based on algorithmic MI and for their application to clustering, see

65: \cite{li1,li2}.

66:

67: Essential for the present application is the {\it grouping property} of MI,

68: \be

69:    I(X,Y,Z) = I(X,Y) + I((X,Y),Z).                        \label{group}

70: \ee

71: Within Shannon information theory this is an exact theorem, while it is true in the algorithmic version up to

72: the usual logarithmic correction terms \cite{li-vi}. Since $X,Y,$ and $Z$ can be themselves composite,

73: Eq.(\ref{group}) can be used recursively for a cluster decomposition of MI. This motivates the main idea of our

74: clustering method: instead of using e.g. centers of masses in order to treat clusters like individual objects in

75: an approximative way only, we treat them exactly like individual objects when using MI as proximity measure and

76: We thus propose the following scheme for clustering $n$ objects with MIC:\\

77: (1) Compute a proximity matrix based on pairwise mutual informations; assign $n$ clusters

78: such that each cluster contains exactly one object;\\

79: (2) find the two closest clusters $i$ and $j$; \\

80: (3) create a new cluster $(ij)$ by combining $i$ and $j$; \\

81: (4) delete the lines/columns with indices $i$ and $j$ from the proximity matrix, and add one line/column

82: containing the proximities between cluster $(ij)$ and all

83: other clusters; \\

84: (5) if the number of clusters is still $>2$, goto (2); else join the two clusters and stop.

85:

86: {\bf Shannon Theory:} Here, $X\equiv X_1, Y\equiv X_2, \ldots$ are random variables. If they are discrete,

87: entropies are defined as usual $H(X) = - \sum_ip_i(X) \log p_i(X)$ etc. The MI is defined as

88: \be

89:    I(X_1,\ldots, X_n) = \sum_{k=1}^n H(X_k) - H(X_1,\ldots, X_n).

90: \ee

91: Eq.(\ref{group}) can be checked easily, together with its generalization to arbitrary groupings. It means that

92: MI can be {\it decomposed into hierarchical levels}. By iterating it, one can decompose $I(X_1\ldots X_n)$ for

93: any $n>2$ and for any partitioning of the set $(X_1\ldots X_n)$ into the MIs between elements within one cluster

94: and MIs between clusters.

95:

96: For continuous variables with densities $\mu_X$ etc., one first introduces some binning (`coarse-graining'), and

97: applies the above to the binned variables. If $x$ is a vector with dimension $m$ and each bin has Lebesgue

98: measure $\Delta$, then $p_i(X) \approx \mu_X(x)\Delta^m$ with $x$ chosen suitably in bin $i$, and

99: \be

100:    H_{\rm bin}(X) \approx \tilde{H}(X) - m \log \Delta

101: \ee

102: where the {\it differential entropy} is given by

103: \be

104:    \tilde{H}(X) = -\int dx \;\mu_X(x) \log \mu_X(x).

105: \ee

106: Notice that $H_{\rm bin}(X)$ is a true (average) information and is thus non-negative, but $\tilde{H}(X)$ is not

107: an information, can be negative, and is not invariant under homeomorphisms $x\to \phi(x)$.

108:

109: Joint entropies, conditional entropies, and MI are defined as above, with sums replaced by integrals. Like

110: $\tilde{H}(X)$, joint and conditional entropies are neither positive (semi-)definite nor invariant. But MI,

111: defined as

112: \be

113:    I(X,Y) = \int\!\!\!\int dx dy \;\mu_{XY}(x,y) \;\log{\mu_{XY}(x,y)\over \mu_X(x)\mu_Y(y)}\;,

114:    \label{mi}

115: \ee

116: is non-negative and invariant under $x\to \phi(x)$ and $y\to \psi(y)$. It is (the limit of) a true information,

117: \be

118:    I(X,Y) = \lim_{\Delta\to 0} [H_{\rm bin}(X)+H_{\rm bin}(Y)-H_{\rm bin}(X,Y)].

119: \ee

120:

121: In applications, one usually has the data available in form of $N$ sample points $(x_i,y_i), \, i=1,\ldots N$

122: which are assumed to be i.i.d. realizations. There exist numerous algorithms to estimate $I(X,Y)$ and entropies.

123: We use in the following the MI estimators proposed recently in Ref.~\cite{mi}, and we refer to this paper for a

124: review of alternative methods.

125:

126:

127: {\bf Algorithmic Information Theory:} In contrast to Shannon theory where the basic objects are random variables

128: and entropies are {\it average} informations, algorithmic information theory deals with individual symbol

129: strings and with the actual information needed to specify them. To ``specify" a sequence $X$ means here to give

130: the necessary input to a universal computer $U$, such that $U$ prints $X$ on its output and stops. The analogon

131: to entropy, called here usually the {\it complexity} $K(X)$ of $X$, is the minimal length of an input which

132: leads to the output $X$, for fixed $U$. It depends on $U$, but it can be shown that this dependence is weak and

133: can be neglected in the limit when $K(X)$ is large \cite{li-vi}.

134:

135: Let us denote the concatenation of two strings $X$ and $Y$ as $XY$. Its complexity is $K(XY)$. It is intuitively

136: clear that $K(XY)$ should be larger than $K(X)$ but cannot be larger than the sum $K(X)+K(Y)$. Finally, one

137: expects that $K(X|Y)$, defined as the minimal length of a program printing $X$ when $Y$ is furnished as

138: auxiliary input, is related to $K(XY)-K(Y)$. Indeed, one can show \cite{li-vi} (again within correction terms

139: which become irrelevant asymptotically) that

140: \be

141:    0 \leq K(X|Y) \simeq K(XY)-K(Y) \leq K(X).

142: \ee

143: Notice the close similarity with Shannon entropy. The algorithmic information in $Y$ about $X$ is finally

144: \be

145:    I_{\rm alg}(X,Y) = K(X) - K(X|Y) \simeq K(X)+K(Y)-K(XY),

146: \ee

147: and similarly for more than two strings. Within the same additive correction terms, one shows that it is

148: symmetric, $I_{\rm alg}(X,Y) =I_{\rm alg}(Y,X)$, and can thus serve as an analogon to mutual information.

149:

150: $K(X)$ is in general not computable. But one can easily give upper bounds: The length of any input which

151: produces $X$ (e.g. by spelling it out verbatim) is an upper bound. Improved upper bounds are provided by any

152: file compression algorithm.

153: %Good compression algorithms will give good approximations to $K(X)$, and algorithms whose

154: %performance does not depend on the input file length (in particular since they do not segment the file during

155: %compression) will be crucial for the following.

156:

157:

158: {\bf MI-Based Distance Measures:} When comparing objects with different marginal or joint informations, it seems

159: intuitively clear that one should prefer {\it relative} distances over absolute ones, in order to minimize the

160: dependence on the total information. We here use

161: %only the one derived in \cite{li1}, since it gave more robust

162: %results in our applications, although it had been argued that the one given in \cite{li2} has theoretical

163: %advantages.  More precisely,

164: the quantity \cite{li1,cluster}

165: \be

166:    D(X,Y) = 1 - \frac{I(X,Y)}{H(X,Y)}                  \label{eq:dist}

167: \ee

168: which is a metric , with $D(X,X)=0$ and $D(X,Y)\leq 1$ for all pairs $(X,Y)$. The algorithmic version is also

169: {\it universal}: If $X\approx Y$ according to any non-trivial distance measure, then $X\approx Y$ also according

170: to $D$.

171:

172: A difficulty appears in the Shannon framework, if we deal with continuous random variables. As we mentioned

173: above, $\tilde{H}(X,Y)$ is not invariant under homeomorphisms (including rescalings) and not even positive

174: definite, while $H_{\rm bin}$ diverges when $\Delta\to 0$. We thus modified Eq.(\ref{eq:dist}) by replacing

175: $H(X,Y)$ by $H_{\rm bin}(X,Y)$ and replacing $D(X,Y)$ by the similarity measure

176: \be

177:    S(X,Y) = \lim_{\Delta\to 0} (D(X,Y)-1)\log\Delta  = \frac{I(X,Y)}{m_x+m_y}.

178:                                          \label{S}

179: \ee

180:

181: {\bf A Phylogenetic Tree for Mammals:} We study the mitochondrial DNA of a group of 34 mammals (see Fig.~1). The

182: same data \cite{Genebank} had previously been analyzed in \cite{li1,Reyes00}. This group includes among others

183: some rodents, ferungulates, and primates.

184:

185: Obviously we are here dealing with the algorithmic version of information theory, and informations are estimated

186: by lossless data compression. For constructing the proximity matrix between individual taxa, we proceed

187: essentially as in Ref.~\cite{li1}, using the special compression program GenCompress \cite{GenComp}.

188:

189: \begin{figure}

190:   \begin{center}

191:     \psfig{file=phylotree.eps,height=75mm,angle=270}

192:     \caption{Phylogenetic tree for 34 mammals.

193:        The heights of nodes are the distances

194:        between the joining daughter clusters.}

195:     \label{phylotree}

196:   \end{center}

197:   \vspace{-8mm}

198: \end{figure}

199:

200: In Ref.~\cite{li1}, this proximity matrix was then used as the input to a standard HC algorithm

201: (neighbour-joining and hypercleaning) to produce an evolutionary tree. Instead we use the MIC algorithm with

202: distance $D(X,Y)$. The joining of two clusters is obtained by simply concatenating the DNA sequences. There is

203: of course an arbitrariness in the order of concatenation sequences: $XY$ and $YX$ give in general compressed

204: sequences of different lengths. But we found this to have negligible effect on the evolutionary tree.

205:

206: The overall structure of this tree closely resembles the one shown in Ref.~\cite{Reyes00}. All primates are

207: correctly clustered and also the relative order of the ferungulates (blue whale to horse) is in accordance with

208: Ref.~\cite{Reyes00}. On the other hand, there are a number of connections which obviously do not reflect the

209: true evolutionary tree, see for example the guinea pig with bat and elephant with platypus.

210: %

211: %As shown in Fig.~1 all primates are correctly clustered together. Also, the relative order of ferungulates

212: %closely resembles the one shown in Ref. \cite{Reyes00}, although it deviates in some details. The clusters of

213: %the guinea pig and the bat, and the one of platypus and elephant, obviously do not reflect the true evolutionary

214: %tree.

215: %

216: But the latter two, inspite of being joined together, have a very large distance from each other, thus their

217: clustering just reflects the fact that neither the platypus nor the elephant have other close relatives in the

218: sample. All in all, however, already the results shown in Fig.~1 capture surprisingly well the overall structure

219: shown in Ref. \cite{Reyes00}. Dividing MI by the total information is essential for this success. If we had used

220: the non-normalized $I_{\rm alg}(X,Y)$ itself, the clustering algorithm used in \cite{li1} would not change much,

221: since all 34 DNA sequences have roughly the same length. But our MIC algorithm would be completely screwed up:

222: After the first cluster formation, we have DNA sequences of very different lengths, and longer sequences tend

223: also to have larger MI, even if they are not closely related.

224:

225: The concatenation of $X$ and $Y$ will of course not lead to a plausible sequence of the common ancestor, but it

226: {\it optimally represents the information} about it. This information is essential to find the correct way

227: through higher hierarchy levels of the evolutionary tree, and it is preserved in concatenating.

228:

229:

230: {\bf Clustering of Minimally Dependent Components in an Electrocardiogram:} As our second application we choose

231: a case where Shannon theory is the proper setting. We show in Fig.~2 an ECG recorded from the abdomen and thorax

232: of a pregnant woman \cite{ECGdata}. It is already seen from this graph that there are at least two important

233: components in this ECG: the heartbeat of the mother and of the fetus. In addition there is noise from various

234: sources (muscle activity, measurement noise, etc.). While it is easy to detect anomalies in the mother's ECG

235: from such a recording, it would be difficult to detect them in the fetal ECG.

236:

237: As a first approximation we can assume that the total ECG is a linear superposition of several independent

238: sources (mother, child, noise$_1$, noise$_2$,...). A standard method to disentangle such superpositions is {\it

239: independent component analysis} (ICA) \cite{ICA}. There, one tries to recover the sources by means of linear

240: transformation $s_i(t) = \sum_{j=1}^n W_{ij} x_j(t)$, where $W_{ij}$ is determined by minimizing the estimated

241: MI between the $s_i$.

242:

243: In reality things are not so simple. For instance, the sources might not be independent, the number of sources

244: (including noise sources!) might be different from the number of channels, and the mixing might involve delays.

245: For the present case this implies that the heartbeat of the mother is seen in several reconstructed components

246: $s_i$, and that the ``independent" components are not independent at all. In particular, all components $s_i$

247: which have large contributions from the mother form a cluster with large intra-cluster MIs and small

248: inter-cluster MIs. The same is true for the fetal ECG, albeit less pronounced. To obtain clean recordings of the

249: fetal and maternal ECGs, we proceeded as follows \cite{Harald}.

250:

251: Since we expect different delays in the different channels, we first used Takens delay embedding \cite{Takens80}

252: with time delay 0.002$\,$s and embedding dimension 3, resulting in $24$ channels. We then formed 24 linear

253: combinations $s_i(t)$. We use the MI estimator \cite{mi}, for details see \cite{cluster}. Five of the resulting

254: least dependent components contain strong contributions of the mother's heartbeat, three are dominated by the

255: fetus. The rest contains mostly noise \cite{cluster}.

256:

257: In plotting the actual dendrogram (Fig.~\ref{ClustECG}) we used $S(X,Y)$ for the cluster analysis but used the

258: MI of the clusters to determine the height at which the two branches join. The MI of the first five channels,

259: e.g., is $\approx 1.44$~nats, while that of channels 6 to 8 is $\approx 0.3$~nats. For any two clusters (tuples)

260: $X=X_1\ldots X_n$ and $Y=Y_1\ldots Y_m$ one has $I(X,Y) \geq I(X)+I(Y)$. This guarantees, if the MI is estimated

261: correctly, that the tree is drawn properly. The two slight glitches (when clusters (1 - 14) and (15 - 18) join,

262: and when (21 - 22) is joined with 23) result from small errors in estimating MI. They do in no way effect our

263: conclusions.

264:

265: \begin{figure}

266:   \begin{center}

267:     \psfig{file=ECG1s.eps,width=8cm}

268:     \caption{ECG of a pregnant woman (sampling rate 500 Hz).}

269:     \label{ICAECG0}

270:   \end{center}

271: \end{figure}

272:

273: \begin{figure}

274:   \begin{center}

275:     \psfig{file=ECGclust.eps,width=8cm,angle=0}

276:     \caption{Dendrogram for least dependent components.}

277:     \label{ClustECG}

278:   \end{center}

279:   \vspace{-8mm}

280: \end{figure}

281:

282: In Fig.~\ref{ClustECG} one can clearly see two big clusters corresponding to the mother and to the child. There

283: are also some small clusters which should be considered as noise. For reconstructing the mother and child

284: contributions to Fig.~\ref{ICAECG0}, we have to decide on one specific clustering from the entire hierarchy. We

285: decided to make the cut at inter-cluster MI equal to 0.1, i.e. two clusters $X$ and $Y$ are joined whenever

286: $I((X),(Y)) \equiv I(X,Y)-I(X)-I(Y) \geq 0.1$. Reconstructing the first five traces of the original ECG from the

287: child components only, we obtain Fig.~\ref{reconstruct}.

288:

289: \begin{figure}[t]

290:   \begin{center}

291:     \psfig{file=ECG3small.eps,width=8cm}

292:     \caption{ECG where all contributions except those of the child cluster have

293:      been removed.}

294:     \label{reconstruct}

295:   \end{center}

296:   \vspace{-8mm}

297: \end{figure}

298:

299: In summary, we have shown that MI can not only be used as a proximity measure in clustering, but that it also

300: suggests a conceptually very simple and natural hierarchical clustering algorithm. We do not claim that this

301: algorithm, called {\it mutual information clustering} (MIC), is always superior to other algorithms. Indeed, MI

302: is in general not easy to estimate. Obviously, when only crude estimates are possible, also MIC will not give

303: very good results. But as MI estimates are becoming better, also the results of MIC should improve. The present

304: paper was partly triggered by the development of a new class of MI estimators for continuous random variables

305: which have very small bias and also rather small variances \cite{mi}.

306:

307: We have illustrated our method with two applications, one from genetics and one from cardiology. For neither

308: application MIC might give optimal clustering, but it seems promising that one common method gives decent

309: results for both, although they are very different.

310:

311: The results of MIC should improve, if more data become available. This is trivial, if we mean by that longer

312: time sequences in the application to ECG, and longer parts of the genome. It is less trivial that we expect MIC

313: to make fewer mistakes in a phylogenetic tree, when more species are included. The reason is that close-by

314: species will be correctly joined anyhow, and families -- which now are represented only by single species and

315: thus are poorly characterized -- will be much better described by the concatenated genomes if more species are

316: included.

317:

318: We would like to thank Arndt von Haesseler, Walter Nadler and Volker Roth for many useful discussions.

319:

320: \begin{thebibliography}{30}

321:

322: \bibitem{jain-dubes} A.K. Jain and R.C.Dubes, {\it Algorithms for Clustering Data} (Prentice

323:    Hall, Englewood Cliffs, NJ, 1988).

324:

325: \bibitem{cover-thomas} T.M. Cover and J.A. Thomas, {\it Elements of Information

326:    Theory} (Wiley, New York 1991).

327:

328: \bibitem{li-vi} M. Li and P. Vitanyi, {\it An Introduction to Kolmogorov Complexity

329:    and its Applications}, 2nd ed. (Springer, New York 1997).

330:

331: \bibitem{li1} M. Li \textit{et al.}, Bioinformatics, {\bf 17}, 149 (2001).

332:

333: \bibitem{li2} M. Li \textit{et al.},  e-print CC/0111054 (2002).

334:

335: \bibitem{mi} A. Kraskov, H. St{\"o}gbauer and P. Grassberger, e-print cond-mat/0305641

336:    (2003).

337:

338: %\bibitem{cluster} A. Kraskov, H. St{\"o}gbauer, R.G. Andrzejak, and P. Grassberger,

339: %   e-print cond-mat/.... (2003).

340:

341: \bibitem{cluster} A. Kraskov, H. St{\"o}gbauer, R.G. Andrzejak, and P. Grassberger, to be published

342:

343: %\bibitem{footnote} We assume that densities really exists. If not

344: %   (e.g. if $X$ lives on a fractal set), then $m$ is to be replaced by the Hausdorff dimension

345: %   of the measure $\mu$.

346:

347: \bibitem{Genebank} http://www.ncbi.nlm.nih.gov/

348:

349: \bibitem{Reyes00} A. Reyes \textit{et al.}, Mol. Biol. Evol. {\bf 17}, 979 (2000).

350:

351: \bibitem{GenComp} http://www.cs.ucsb.edu/~mli/Bioinf/software/index.html

352:

353: \bibitem{ECGdata} B.L.R. De Moor, ed., {\sf www.esat.kuleuven.ac.be/sista/daisy} (1997).

354:

355: \bibitem{ICA} A. Hyv\"arinen, J. Karhunen, and E. Oja, {\it Independent

356:    Component Analysis} (Wiley, New York 2001).

357:

358: \bibitem{Harald} H. St{\"o}gbauer, A. Kraskov, and P. Grassberger,  to be published

359:

360: \bibitem{Takens80} F. Takens. In {\it

361:    Dynamical Systems and Turbulence}, eds. D.A. Rand and L.S. Young, Springer Lecture Notes

362:    in Mathematics 898, page 366 (Springer, Berlin 1980).

363:

364: \end{thebibliography}

365:

366: \end{document}

367: