1: \documentclass[aps,twocolumn,prl]{revtex4}
2: %\documentclass[aps,preprint]{revtex4}
3: %\documentclass[aps,twocolumn,a4]{revtex4}
4:
5: \usepackage{epsfig}
6: \newcommand{\be}{\begin{equation}}
7: \newcommand{\ee}{\end{equation}}
8: \newcommand{\bea}{\begin{eqnarray}}
9: \newcommand{\eea}{\end{eqnarray}}
10: \newcommand{\ba}{\begin{array}}
11: \newcommand{\ea}{\end{array}}
12:
13: \begin{document}
14: \title{Hierarchical Clustering Using Mutual Information}
15: \author{Alexander Kraskov, Harald St\"ogbauer, Ralph G. Andrzejak, and Peter Grassberger}
16: \affiliation{John-von-Neumann Institute for Computing, Forschungszentrum J\"ulich,
17: D-52425 J\"ulich, Germany}
18:
19: \date{\today}
20: \begin{abstract}
21: We present a method for hierarchical clustering of data called {\it mutual information clustering} (MIC)
22: algorithm. It uses mutual information (MI) as a similarity measure and exploits its grouping property: The MI
23: between three objects $X, Y,$ and $Z$ is equal to the sum of the MI between $X$ and $Y$, plus the MI between $Z$
24: and the combined object $(XY)$. We use this both in the Shannon (probabilistic) version of information theory
25: and in the Kolmogorov (algorithmic) version. We apply our method to the construction of phylogenetic trees from
26: mitochondrial DNA sequences and to the output of independent components analysis (ICA) as illustrated with the
27: ECG of a pregnant woman.
28: \end{abstract}
29:
30: \maketitle
31:
32: Classification or organizing of data is very important in all scientific disciplines and is fundamental for
33: understanding and learning \cite{jain-dubes}. Classification can be exclusive or overlapping, supervised or
34: unsupervised. In the following we will be interested only in exclusive unsupervised classification, called
35: clustering.
36:
37: An instance of a clustering problem consist of a set of objects and a set of properties (called characteristic
38: vector) for each object. The goal of clustering is separation of objects into groups using only the
39: characteristic vectors. Cluster analysis organizes data either as a single grouping of individuals into
40: non-overlapping clusters or as a hierarchy of nested partitions. The latter is called hierarchical clustering
41: (HC). Because of wide spread of applications, there are a large variety of different clustering methods in
42: usage, see e.g. \cite{jain-dubes} for an overview.
43:
44: The crucial point of all clustering algorithms is the choice
45: of a {\it proximity measure}. This is obtained from the characteristic vectors and can be either an indicator
46: for similarity or dissimilarity. In the latter case it is convenient but not obligatory to satisfy the standard
47: axioms of a metric (positivity, symmetry, and triangle inequality). Among HC methods one should distinguish
48: between those where one uses the characteristic vectors only at the first level of the hierarchy and derives the
49: proximities between clusters from the proximities of their constituents, and methods where the proximities are
50: calculated each time from their characteristic vectors. The latter strategy (which is used also in the present
51: paper) allows of course for more flexibility but might also be computationally more costly.
52:
53: Quite generally, the ``objects" to be clustered can be either single (finite) patterns (e.g. DNA sequences) or
54: random variables, i.e. {\it probability distributions}. In the latter case the data are usually supplied in form
55: of a statistical sample, and one of the simplest and most widely used similarity measures is the linear
56: (Pearson) correlation coefficient. But this is not sensitive to nonlinear dependencies which do not manifest
57: themselves in the covariance and can thus miss important features. This is in contrast to mutual information
58: (MI) which is also singled out by its information theoretic background \cite{cover-thomas}. Indeed, MI is zero
59: only if the two random variables are strictly independent.
60:
61:
62: Another important feature of MI is that it has also an ``algorithmic" cousin, defined within algorithmic
63: (Kolmogorov) information theory \cite{li-vi} which measures the similarity between individual objects. For a
64: thorough discussion of distance measures based on algorithmic MI and for their application to clustering, see
65: \cite{li1,li2}.
66:
67: Essential for the present application is the {\it grouping property} of MI,
68: \be
69: I(X,Y,Z) = I(X,Y) + I((X,Y),Z). \label{group}
70: \ee
71: Within Shannon information theory this is an exact theorem, while it is true in the algorithmic version up to
72: the usual logarithmic correction terms \cite{li-vi}. Since $X,Y,$ and $Z$ can be themselves composite,
73: Eq.(\ref{group}) can be used recursively for a cluster decomposition of MI. This motivates the main idea of our
74: clustering method: instead of using e.g. centers of masses in order to treat clusters like individual objects in
75: an approximative way only, we treat them exactly like individual objects when using MI as proximity measure and
76: We thus propose the following scheme for clustering $n$ objects with MIC:\\
77: (1) Compute a proximity matrix based on pairwise mutual informations; assign $n$ clusters
78: such that each cluster contains exactly one object;\\
79: (2) find the two closest clusters $i$ and $j$; \\
80: (3) create a new cluster $(ij)$ by combining $i$ and $j$; \\
81: (4) delete the lines/columns with indices $i$ and $j$ from the proximity matrix, and add one line/column
82: containing the proximities between cluster $(ij)$ and all
83: other clusters; \\
84: (5) if the number of clusters is still $>2$, goto (2); else join the two clusters and stop.
85:
86: {\bf Shannon Theory:} Here, $X\equiv X_1, Y\equiv X_2, \ldots$ are random variables. If they are discrete,
87: entropies are defined as usual $H(X) = - \sum_ip_i(X) \log p_i(X)$ etc. The MI is defined as
88: \be
89: I(X_1,\ldots, X_n) = \sum_{k=1}^n H(X_k) - H(X_1,\ldots, X_n).
90: \ee
91: Eq.(\ref{group}) can be checked easily, together with its generalization to arbitrary groupings. It means that
92: MI can be {\it decomposed into hierarchical levels}. By iterating it, one can decompose $I(X_1\ldots X_n)$ for
93: any $n>2$ and for any partitioning of the set $(X_1\ldots X_n)$ into the MIs between elements within one cluster
94: and MIs between clusters.
95:
96: For continuous variables with densities $\mu_X$ etc., one first introduces some binning (`coarse-graining'), and
97: applies the above to the binned variables. If $x$ is a vector with dimension $m$ and each bin has Lebesgue
98: measure $\Delta$, then $p_i(X) \approx \mu_X(x)\Delta^m$ with $x$ chosen suitably in bin $i$, and
99: \be
100: H_{\rm bin}(X) \approx \tilde{H}(X) - m \log \Delta
101: \ee
102: where the {\it differential entropy} is given by
103: \be
104: \tilde{H}(X) = -\int dx \;\mu_X(x) \log \mu_X(x).
105: \ee
106: Notice that $H_{\rm bin}(X)$ is a true (average) information and is thus non-negative, but $\tilde{H}(X)$ is not
107: an information, can be negative, and is not invariant under homeomorphisms $x\to \phi(x)$.
108:
109: Joint entropies, conditional entropies, and MI are defined as above, with sums replaced by integrals. Like
110: $\tilde{H}(X)$, joint and conditional entropies are neither positive (semi-)definite nor invariant. But MI,
111: defined as
112: \be
113: I(X,Y) = \int\!\!\!\int dx dy \;\mu_{XY}(x,y) \;\log{\mu_{XY}(x,y)\over \mu_X(x)\mu_Y(y)}\;,
114: \label{mi}
115: \ee
116: is non-negative and invariant under $x\to \phi(x)$ and $y\to \psi(y)$. It is (the limit of) a true information,
117: \be
118: I(X,Y) = \lim_{\Delta\to 0} [H_{\rm bin}(X)+H_{\rm bin}(Y)-H_{\rm bin}(X,Y)].
119: \ee
120:
121: In applications, one usually has the data available in form of $N$ sample points $(x_i,y_i), \, i=1,\ldots N$
122: which are assumed to be i.i.d. realizations. There exist numerous algorithms to estimate $I(X,Y)$ and entropies.
123: We use in the following the MI estimators proposed recently in Ref.~\cite{mi}, and we refer to this paper for a
124: review of alternative methods.
125:
126:
127: {\bf Algorithmic Information Theory:} In contrast to Shannon theory where the basic objects are random variables
128: and entropies are {\it average} informations, algorithmic information theory deals with individual symbol
129: strings and with the actual information needed to specify them. To ``specify" a sequence $X$ means here to give
130: the necessary input to a universal computer $U$, such that $U$ prints $X$ on its output and stops. The analogon
131: to entropy, called here usually the {\it complexity} $K(X)$ of $X$, is the minimal length of an input which
132: leads to the output $X$, for fixed $U$. It depends on $U$, but it can be shown that this dependence is weak and
133: can be neglected in the limit when $K(X)$ is large \cite{li-vi}.
134:
135: Let us denote the concatenation of two strings $X$ and $Y$ as $XY$. Its complexity is $K(XY)$. It is intuitively
136: clear that $K(XY)$ should be larger than $K(X)$ but cannot be larger than the sum $K(X)+K(Y)$. Finally, one
137: expects that $K(X|Y)$, defined as the minimal length of a program printing $X$ when $Y$ is furnished as
138: auxiliary input, is related to $K(XY)-K(Y)$. Indeed, one can show \cite{li-vi} (again within correction terms
139: which become irrelevant asymptotically) that
140: \be
141: 0 \leq K(X|Y) \simeq K(XY)-K(Y) \leq K(X).
142: \ee
143: Notice the close similarity with Shannon entropy. The algorithmic information in $Y$ about $X$ is finally
144: \be
145: I_{\rm alg}(X,Y) = K(X) - K(X|Y) \simeq K(X)+K(Y)-K(XY),
146: \ee
147: and similarly for more than two strings. Within the same additive correction terms, one shows that it is
148: symmetric, $I_{\rm alg}(X,Y) =I_{\rm alg}(Y,X)$, and can thus serve as an analogon to mutual information.
149:
150: $K(X)$ is in general not computable. But one can easily give upper bounds: The length of any input which
151: produces $X$ (e.g. by spelling it out verbatim) is an upper bound. Improved upper bounds are provided by any
152: file compression algorithm.
153: %Good compression algorithms will give good approximations to $K(X)$, and algorithms whose
154: %performance does not depend on the input file length (in particular since they do not segment the file during
155: %compression) will be crucial for the following.
156:
157:
158: {\bf MI-Based Distance Measures:} When comparing objects with different marginal or joint informations, it seems
159: intuitively clear that one should prefer {\it relative} distances over absolute ones, in order to minimize the
160: dependence on the total information. We here use
161: %only the one derived in \cite{li1}, since it gave more robust
162: %results in our applications, although it had been argued that the one given in \cite{li2} has theoretical
163: %advantages. More precisely,
164: the quantity \cite{li1,cluster}
165: \be
166: D(X,Y) = 1 - \frac{I(X,Y)}{H(X,Y)} \label{eq:dist}
167: \ee
168: which is a metric , with $D(X,X)=0$ and $D(X,Y)\leq 1$ for all pairs $(X,Y)$. The algorithmic version is also
169: {\it universal}: If $X\approx Y$ according to any non-trivial distance measure, then $X\approx Y$ also according
170: to $D$.
171:
172: A difficulty appears in the Shannon framework, if we deal with continuous random variables. As we mentioned
173: above, $\tilde{H}(X,Y)$ is not invariant under homeomorphisms (including rescalings) and not even positive
174: definite, while $H_{\rm bin}$ diverges when $\Delta\to 0$. We thus modified Eq.(\ref{eq:dist}) by replacing
175: $H(X,Y)$ by $H_{\rm bin}(X,Y)$ and replacing $D(X,Y)$ by the similarity measure
176: \be
177: S(X,Y) = \lim_{\Delta\to 0} (D(X,Y)-1)\log\Delta = \frac{I(X,Y)}{m_x+m_y}.
178: \label{S}
179: \ee
180:
181: {\bf A Phylogenetic Tree for Mammals:} We study the mitochondrial DNA of a group of 34 mammals (see Fig.~1). The
182: same data \cite{Genebank} had previously been analyzed in \cite{li1,Reyes00}. This group includes among others
183: some rodents, ferungulates, and primates.
184:
185: Obviously we are here dealing with the algorithmic version of information theory, and informations are estimated
186: by lossless data compression. For constructing the proximity matrix between individual taxa, we proceed
187: essentially as in Ref.~\cite{li1}, using the special compression program GenCompress \cite{GenComp}.
188:
189: \begin{figure}
190: \begin{center}
191: \psfig{file=phylotree.eps,height=75mm,angle=270}
192: \caption{Phylogenetic tree for 34 mammals.
193: The heights of nodes are the distances
194: between the joining daughter clusters.}
195: \label{phylotree}
196: \end{center}
197: \vspace{-8mm}
198: \end{figure}
199:
200: In Ref.~\cite{li1}, this proximity matrix was then used as the input to a standard HC algorithm
201: (neighbour-joining and hypercleaning) to produce an evolutionary tree. Instead we use the MIC algorithm with
202: distance $D(X,Y)$. The joining of two clusters is obtained by simply concatenating the DNA sequences. There is
203: of course an arbitrariness in the order of concatenation sequences: $XY$ and $YX$ give in general compressed
204: sequences of different lengths. But we found this to have negligible effect on the evolutionary tree.
205:
206: The overall structure of this tree closely resembles the one shown in Ref.~\cite{Reyes00}. All primates are
207: correctly clustered and also the relative order of the ferungulates (blue whale to horse) is in accordance with
208: Ref.~\cite{Reyes00}. On the other hand, there are a number of connections which obviously do not reflect the
209: true evolutionary tree, see for example the guinea pig with bat and elephant with platypus.
210: %
211: %As shown in Fig.~1 all primates are correctly clustered together. Also, the relative order of ferungulates
212: %closely resembles the one shown in Ref. \cite{Reyes00}, although it deviates in some details. The clusters of
213: %the guinea pig and the bat, and the one of platypus and elephant, obviously do not reflect the true evolutionary
214: %tree.
215: %
216: But the latter two, inspite of being joined together, have a very large distance from each other, thus their
217: clustering just reflects the fact that neither the platypus nor the elephant have other close relatives in the
218: sample. All in all, however, already the results shown in Fig.~1 capture surprisingly well the overall structure
219: shown in Ref. \cite{Reyes00}. Dividing MI by the total information is essential for this success. If we had used
220: the non-normalized $I_{\rm alg}(X,Y)$ itself, the clustering algorithm used in \cite{li1} would not change much,
221: since all 34 DNA sequences have roughly the same length. But our MIC algorithm would be completely screwed up:
222: After the first cluster formation, we have DNA sequences of very different lengths, and longer sequences tend
223: also to have larger MI, even if they are not closely related.
224:
225: The concatenation of $X$ and $Y$ will of course not lead to a plausible sequence of the common ancestor, but it
226: {\it optimally represents the information} about it. This information is essential to find the correct way
227: through higher hierarchy levels of the evolutionary tree, and it is preserved in concatenating.
228:
229:
230: {\bf Clustering of Minimally Dependent Components in an Electrocardiogram:} As our second application we choose
231: a case where Shannon theory is the proper setting. We show in Fig.~2 an ECG recorded from the abdomen and thorax
232: of a pregnant woman \cite{ECGdata}. It is already seen from this graph that there are at least two important
233: components in this ECG: the heartbeat of the mother and of the fetus. In addition there is noise from various
234: sources (muscle activity, measurement noise, etc.). While it is easy to detect anomalies in the mother's ECG
235: from such a recording, it would be difficult to detect them in the fetal ECG.
236:
237: As a first approximation we can assume that the total ECG is a linear superposition of several independent
238: sources (mother, child, noise$_1$, noise$_2$,...). A standard method to disentangle such superpositions is {\it
239: independent component analysis} (ICA) \cite{ICA}. There, one tries to recover the sources by means of linear
240: transformation $s_i(t) = \sum_{j=1}^n W_{ij} x_j(t)$, where $W_{ij}$ is determined by minimizing the estimated
241: MI between the $s_i$.
242:
243: In reality things are not so simple. For instance, the sources might not be independent, the number of sources
244: (including noise sources!) might be different from the number of channels, and the mixing might involve delays.
245: For the present case this implies that the heartbeat of the mother is seen in several reconstructed components
246: $s_i$, and that the ``independent" components are not independent at all. In particular, all components $s_i$
247: which have large contributions from the mother form a cluster with large intra-cluster MIs and small
248: inter-cluster MIs. The same is true for the fetal ECG, albeit less pronounced. To obtain clean recordings of the
249: fetal and maternal ECGs, we proceeded as follows \cite{Harald}.
250:
251: Since we expect different delays in the different channels, we first used Takens delay embedding \cite{Takens80}
252: with time delay 0.002$\,$s and embedding dimension 3, resulting in $24$ channels. We then formed 24 linear
253: combinations $s_i(t)$. We use the MI estimator \cite{mi}, for details see \cite{cluster}. Five of the resulting
254: least dependent components contain strong contributions of the mother's heartbeat, three are dominated by the
255: fetus. The rest contains mostly noise \cite{cluster}.
256:
257: In plotting the actual dendrogram (Fig.~\ref{ClustECG}) we used $S(X,Y)$ for the cluster analysis but used the
258: MI of the clusters to determine the height at which the two branches join. The MI of the first five channels,
259: e.g., is $\approx 1.44$~nats, while that of channels 6 to 8 is $\approx 0.3$~nats. For any two clusters (tuples)
260: $X=X_1\ldots X_n$ and $Y=Y_1\ldots Y_m$ one has $I(X,Y) \geq I(X)+I(Y)$. This guarantees, if the MI is estimated
261: correctly, that the tree is drawn properly. The two slight glitches (when clusters (1 - 14) and (15 - 18) join,
262: and when (21 - 22) is joined with 23) result from small errors in estimating MI. They do in no way effect our
263: conclusions.
264:
265: \begin{figure}
266: \begin{center}
267: \psfig{file=ECG1s.eps,width=8cm}
268: \caption{ECG of a pregnant woman (sampling rate 500 Hz).}
269: \label{ICAECG0}
270: \end{center}
271: \end{figure}
272:
273: \begin{figure}
274: \begin{center}
275: \psfig{file=ECGclust.eps,width=8cm,angle=0}
276: \caption{Dendrogram for least dependent components.}
277: \label{ClustECG}
278: \end{center}
279: \vspace{-8mm}
280: \end{figure}
281:
282: In Fig.~\ref{ClustECG} one can clearly see two big clusters corresponding to the mother and to the child. There
283: are also some small clusters which should be considered as noise. For reconstructing the mother and child
284: contributions to Fig.~\ref{ICAECG0}, we have to decide on one specific clustering from the entire hierarchy. We
285: decided to make the cut at inter-cluster MI equal to 0.1, i.e. two clusters $X$ and $Y$ are joined whenever
286: $I((X),(Y)) \equiv I(X,Y)-I(X)-I(Y) \geq 0.1$. Reconstructing the first five traces of the original ECG from the
287: child components only, we obtain Fig.~\ref{reconstruct}.
288:
289: \begin{figure}[t]
290: \begin{center}
291: \psfig{file=ECG3small.eps,width=8cm}
292: \caption{ECG where all contributions except those of the child cluster have
293: been removed.}
294: \label{reconstruct}
295: \end{center}
296: \vspace{-8mm}
297: \end{figure}
298:
299: In summary, we have shown that MI can not only be used as a proximity measure in clustering, but that it also
300: suggests a conceptually very simple and natural hierarchical clustering algorithm. We do not claim that this
301: algorithm, called {\it mutual information clustering} (MIC), is always superior to other algorithms. Indeed, MI
302: is in general not easy to estimate. Obviously, when only crude estimates are possible, also MIC will not give
303: very good results. But as MI estimates are becoming better, also the results of MIC should improve. The present
304: paper was partly triggered by the development of a new class of MI estimators for continuous random variables
305: which have very small bias and also rather small variances \cite{mi}.
306:
307: We have illustrated our method with two applications, one from genetics and one from cardiology. For neither
308: application MIC might give optimal clustering, but it seems promising that one common method gives decent
309: results for both, although they are very different.
310:
311: The results of MIC should improve, if more data become available. This is trivial, if we mean by that longer
312: time sequences in the application to ECG, and longer parts of the genome. It is less trivial that we expect MIC
313: to make fewer mistakes in a phylogenetic tree, when more species are included. The reason is that close-by
314: species will be correctly joined anyhow, and families -- which now are represented only by single species and
315: thus are poorly characterized -- will be much better described by the concatenated genomes if more species are
316: included.
317:
318: We would like to thank Arndt von Haesseler, Walter Nadler and Volker Roth for many useful discussions.
319:
320: \begin{thebibliography}{30}
321:
322: \bibitem{jain-dubes} A.K. Jain and R.C.Dubes, {\it Algorithms for Clustering Data} (Prentice
323: Hall, Englewood Cliffs, NJ, 1988).
324:
325: \bibitem{cover-thomas} T.M. Cover and J.A. Thomas, {\it Elements of Information
326: Theory} (Wiley, New York 1991).
327:
328: \bibitem{li-vi} M. Li and P. Vitanyi, {\it An Introduction to Kolmogorov Complexity
329: and its Applications}, 2nd ed. (Springer, New York 1997).
330:
331: \bibitem{li1} M. Li \textit{et al.}, Bioinformatics, {\bf 17}, 149 (2001).
332:
333: \bibitem{li2} M. Li \textit{et al.}, e-print CC/0111054 (2002).
334:
335: \bibitem{mi} A. Kraskov, H. St{\"o}gbauer and P. Grassberger, e-print cond-mat/0305641
336: (2003).
337:
338: %\bibitem{cluster} A. Kraskov, H. St{\"o}gbauer, R.G. Andrzejak, and P. Grassberger,
339: % e-print cond-mat/.... (2003).
340:
341: \bibitem{cluster} A. Kraskov, H. St{\"o}gbauer, R.G. Andrzejak, and P. Grassberger, to be published
342:
343: %\bibitem{footnote} We assume that densities really exists. If not
344: % (e.g. if $X$ lives on a fractal set), then $m$ is to be replaced by the Hausdorff dimension
345: % of the measure $\mu$.
346:
347: \bibitem{Genebank} http://www.ncbi.nlm.nih.gov/
348:
349: \bibitem{Reyes00} A. Reyes \textit{et al.}, Mol. Biol. Evol. {\bf 17}, 979 (2000).
350:
351: \bibitem{GenComp} http://www.cs.ucsb.edu/~mli/Bioinf/software/index.html
352:
353: \bibitem{ECGdata} B.L.R. De Moor, ed., {\sf www.esat.kuleuven.ac.be/sista/daisy} (1997).
354:
355: \bibitem{ICA} A. Hyv\"arinen, J. Karhunen, and E. Oja, {\it Independent
356: Component Analysis} (Wiley, New York 2001).
357:
358: \bibitem{Harald} H. St{\"o}gbauer, A. Kraskov, and P. Grassberger, to be published
359:
360: \bibitem{Takens80} F. Takens. In {\it
361: Dynamical Systems and Turbulence}, eds. D.A. Rand and L.S. Young, Springer Lecture Notes
362: in Mathematics 898, page 366 (Springer, Berlin 1980).
363:
364: \end{thebibliography}
365:
366: \end{document}
367: