1: \documentclass[twocolumn]{article}
2: %\documentclass[aps,preprint]{revtex4}
3: %\documentclass[aps,twocolumn]{revtex4}
4: \usepackage{epsfig}
5: \newcommand{\be}{\begin{equation}}
6: \newcommand{\ee}{\end{equation}}
7: \newcommand{\bea}{\begin{eqnarray}}
8: \newcommand{\eea}{\end{eqnarray}}
9: \newcommand{\ba}{\begin{array}}
10: \newcommand{\ea}{\end{array}}
11:
12: \begin{document}
13: \title{Hierarchical Clustering Based on Mutual Information}
14: \author{Alexander Kraskov, Harald St\"ogbauer, Ralph G. Andrzejak, and Peter Grassberger \\
15: %}
16: %\affiliation{
17: John-von-Neumann Institute for Computing, \\ Forschungszentrum J\"ulich,D-52425 J\"ulich, Germany}
18:
19: \date{\today}
20: \maketitle
21: \begin{abstract}
22: \noindent \textbf{Motivation:} Clustering is a frequently used concept in variety of bioinformatical
23: applications. We present a new method for hierarchical clustering of data called {\it mutual information
24: clustering} (MIC) algorithm. It uses mutual information (MI) as a similarity measure and exploits its grouping
25: property: The MI between three objects $X, Y,$ and $Z$ is equal to the sum of the MI between $X$ and $Y$, plus
26: the MI between $Z$ and the combined object $(XY)$.
27:
28: \noindent \textbf{Results:} We use this both in the Shannon (probabilistic) version of information theory, where
29: the ``objects" are probability distributions represented by random samples, and in the Kolmogorov (algorithmic)
30: version, where the ``objects" are symbol sequences. We apply our method to the construction of mammal
31: phylogenetic trees from mitochondrial DNA sequences and we reconstruct the fetal ECG from the output of
32: independent components analysis (ICA) applied to the ECG of a pregnant woman.
33:
34: \noindent \textbf{Availability:} The programs for estimation of MI and for clustering (probabilistic version)
35: are available at \textsf{http://www.fz-juelich.de/nic/cs/software}.
36:
37: \noindent \textbf{Contact:} \textsf{a.kraskov@fz-juelich.de}
38: \end{abstract}
39: %\maketitle
40:
41: %Motivation: Clustering is a frequently used concept in variety of bioinformatical applications. We present a new
42: %method for hierarchical clustering of data called mutual information clustering (MIC) algorithm. It uses mutual
43: %information (MI) as a similarity measure and exploits its grouping property: The MI between three objects X, Y,
44: %and Z is equal to the sum of the MI between X and Y, plus the MI between Z and the combined object (XY).
45: %
46: %Results: We use this both in the Shannon (probabilistic) version of information theory, where the ``objects" are
47: %probability distributions represented by random samples, and in the Kolmogorov (algorithmic) version, where the
48: %``objects" are symbol sequences. We apply our method to the construction of mammal phylogenetic trees from
49: %mitochondrial DNA sequences and we reconstruct the fetal ECG from the output of independent components analysis
50: %(ICA) applied to the ECG of a pregnant woman.
51: %
52: %Availability: The programs for estimation of MI and for clustering (probabilistic version) are available at
53: %http://www.fz-juelich.de/nic/cs/software
54: %
55: %Contact: a.kraskov@fz-juelich.de
56:
57: \section{Introduction}
58:
59: Classification or organizing of data is very important in all scientific disciplines. It is one of the most
60: fundamental mechanism of understanding and learning \cite{jain-dubes}. Depending on the problem, classification
61: can be exclusive or overlapping, supervised or unsupervised. In the following we will be interested only in
62: exclusive unsupervised classification. This type of classification is usually called clustering or cluster
63: analysis.
64:
65: An instance of a clustering problem consist of a set of objects and a set of properties (called characteristic
66: vector) for each object. The goal of clustering is the separation of objects into groups using only the
67: characteristic vectors. Indeed, in general only certain aspects of the characteristic vectors will be relevant,
68: and extracting these relevant features is one field where mutual information (MI) plays a major role
69: \cite{bottleneck}, but we shall not deal with this here. Cluster analysis organizes data either as a single
70: grouping of individuals into non-overlapping clusters or as a hierarchy of nested partitions. The first approach
71: is called partitional clustering (PC), the second one is hierarchical clustering (HC). One of the main features
72: of HC methods is the visual impact of the {\it dendrogram} which enables one to see how objects are being merged
73: into clusters. From any HC one can obtain a PC by restricting oneself to a ``horizontal" cut through the
74: dendrogram, while one cannot go in the other direction and obtain a full hierarchy from a single PC. Because of
75: their wide spread of applications, there are a large variety of different clustering methods in use
76: \cite{jain-dubes}.
77:
78: The crucial point of all clustering algorithms is the choice of a {\it proximity measure}. This is obtained from
79: the characteristic vectors and can be either an indicator for similarity (i.e. large for similar and small for
80: dissimilar objects), or dissimilarity. In the latter case it is convenient but not obligatory if it satisfies
81: the standard axioms of a metric (positivity, symmetry, and triangle inequality). A matrix of all pairwise
82: proximities is called proximity matrix. Among HC methods one should distinguish between those where one uses the
83: characteristic vectors only at the first level of the hierarchy and derives the proximities between clusters
84: from the proximities of their constituents, and methods where the proximities are calculated each time from
85: their characteristic vectors. The latter strategy (which is used also in the present paper) allows of course for
86: more flexibility but might also be computationally more costly.
87:
88: Quite generally, the ``objects" to be clustered can be either single (finite) patterns (e.g. DNA sequences) or
89: random variables, i.e. {\it probability distributions}. In the latter case the data are usually supplied in form
90: of a statistical sample, and one of the simplest and most widely used similarity measures is the linear
91: (Pearson) correlation coefficient. But this is not sensitive to nonlinear dependencies which do not manifest
92: themselves in the covariance and can thus miss important features. This is in contrast to mutual information
93: (MI) which is also singled out by its information theoretic background \cite{cover-thomas}. Indeed, MI is zero
94: only if the two random variables are strictly independent.
95:
96: Another important feature of MI is that it has also an ``algorithmic" cousin, defined within algorithmic
97: (Kolmogorov) information theory \cite{li-vi} which measures the similarity between individual objects. For a
98: thorough discussion of distance measures based on algorithmic MI and for their application to clustering, see
99: \cite{li1,li2}.
100:
101: Another feature of MI which is essential for the present application is its {\it grouping property}: The MI
102: between three objects (distributions) $X, Y,$ and $Z$ is equal to the sum of the MI between $X$ and $Y$, plus
103: the MI between $Z$ and the combined object (joint distribution) $(XY)$,
104: \be
105: I(X,Y,Z) = I(X,Y) + I((X,Y),Z). \label{group}
106: \ee
107: Within Shannon information theory this is an exact theorem (see below), while it is true in the algorithmic
108: version up to the usual logarithmic correction terms \cite{li-vi}. Since $X,Y,$ and $Z$ can be themselves
109: composite, Eq.(\ref{group}) can be used recursively for a cluster decomposition of MI. This motivates the main
110: idea of our clustering method: instead of using e.g. centers of masses in order to treat clusters like
111: individual objects in an approximative way, we treat them exactly like individual objects when using MI as
112: proximity measure.
113:
114: More precisely, we propose the following scheme for clustering $n$ objects with MIC:\\
115: (1) Compute a proximity matrix based on pairwise mutual informations; assign $n$ clusters such that each cluster
116: contains exactly one object;\\
117: (2) find the two closest clusters $i$ and $j$; \\
118: (3) create a new cluster $(ij)$ by combining $i$ and $j$; \\
119: (4) delete the lines/columns with indices $i$ and $j$ from the proximity matrix, and add one line/column
120: containing the proximities between cluster $(ij)$ and all
121: other clusters; \\
122: (5) if the number of clusters is still $>2$, goto (2); else join the two clusters and stop.
123:
124: In the next section we shall review the pertinent properties of MI, both in the Shannon and in the algorithmic
125: version. This is applied in Sec.~3 to construct a phylogenetic tree using mitochondrial DNA and in Sec.~4 to
126: cluster the output channels of an independent component analysis (ICA) of an electrocardiogram (ECG) of a
127: pregnant woman, and to reconstruct from this the maternal and fetal ECGs. We finish with our conclusions in
128: Sec.~6.
129:
130:
131: \section{Mutual Information}
132:
133: \subsection{Shannon Theory}
134:
135: Assume that one has two random variables $X$ and $Y$. If they are discrete, we write $p_i(X) = {\rm
136: prob}(X=x_i)$, $p_i(Y) = {\rm prob}(Y=x_i)$, and $p_{ij} = {\rm prob}(X=x_i,Y=y_i)$ for the marginal and joint
137: distribution. Otherwise (and if they have finite densities) we denote the densities by $\mu_X(x),\mu_Y(y)$, and
138: $\mu(x,y)$. Entropies are defined for the discrete case as usual by $H(X) = - \sum_ip_i(X) \log p_i(X)$, $H(Y) =
139: - \sum_ip_i(Y) \log p_i(Y)$, and $H(X,Y)=-\sum_{i,j} p_{ij} \log p_{ij}$. Conditional entropies are defined as
140: $H(X|Y) = H(X,Y)-H(Y) = -\sum_{i,j} p_{ij} \log p_{i|j}$. The base of the logarithm determines the units in
141: which information is measured. In particular, taking base two leads to information measured in bits. In the
142: following, we always will use natural logarithms. The MI between $X$ and $Y$ is finally defined as
143: \bea
144: I(X,Y) &=& H(X)+H(Y)-H(X,Y) \nonumber \\
145: &=& \sum_{i,j} p_{ij}\;\log{p_{ij}\over p_i(X)p_i(Y)}.
146: \eea
147: It can be shown to be non-negative, and is zero only when $X$ and $Y$ are strictly independent. For $n$ random
148: variables $X_1,X_2\ldots X_n$, the MI is defined as
149: \be
150: I(X_1,\ldots, X_n) = \sum_{k=1}^n H(X_k) - H(X_1,\ldots, X_n).
151: \ee This quantity is often referred to as (generalized) redundancy, in order to distinguish it from different
152: ``mutual informations" which are constructed analogously to higher order cumulants, but we shall not follow this
153: usage. Eq.(\ref{group}) can be checked easily, together with its generalization to arbitrary groupings. It means
154: that MI can be {\it decomposed into hierarchical levels}. By iterating it, one can decompose $I(X_1\ldots X_n)$
155: for any $n>2$ and for any partitioning of the set $(X_1\ldots X_n)$ into the MIs between elements within one
156: cluster and MIs between clusters.
157:
158:
159: For continuous variables one first introduces some binning (`coarse-graining'), and applies the above to the
160: binned variables. If $x$ is a vector with dimension $m$ and each bin has Lebesgue measure $\Delta$, then $p_i(X)
161: \approx \mu_X(x)\Delta^m$ with $x$ chosen suitably in bin $i$, and \footnote{Notice that we have here assumed
162: that densities really exists. If not e.g. if $X$ lives on a fractal set), then $m$ is to be replaced by the
163: Hausdorff dimension of the measure $\mu$.}
164: \be
165: H_{\rm bin}(X) \approx \tilde{H}(X) - m \log \Delta
166: \ee
167: where the {\it differential entropy} is given by
168: \be
169: \tilde{H}(X) = -\int dx \;\mu_X(x) \log \mu_X(x).
170: \ee
171: Notice that $H_{\rm bin}(X)$ is a true (average) information and is thus non-negative, but $\tilde{H}(X)$ is not
172: an information and can be negative. Also, $\tilde{H}(X)$ is not invariant under homeomorphisms $x\to \phi(x)$.
173:
174: Joint entropies, conditional entropies, and MI are defined as above, with sums replaced by integrals. Like
175: $\tilde{H}(X)$, joint and conditional entropies are neither positive (semi-)definite nor invariant. But MI,
176: defined as
177: \be
178: I(X,Y) = \int\!\!\!\int dx dy \;\mu_{XY}(x,y) \;\log{\mu_{XY}(x,y)\over \mu_X(x)\mu_Y(y)}\;,
179: \label{mi}
180: \ee
181: is non-negative and invariant under $x\to \phi(x)$ and $y\to \psi(y)$. It is (the limit of) a true information,
182: \be
183: I(X,Y) = \lim_{\Delta\to 0} [H_{\rm bin}(X)+H_{\rm bin}(Y)-H_{\rm bin}(X,Y)].
184: \ee
185:
186: In applications, one usually has the data available in form of a statistical sample. To estimate $I(X,Y)$ one
187: starts from $N$ bivariate measurements $(x_i,y_i), \, i=1,\ldots N$ which are assumed to be iid (independent
188: identically distributed) realizations. There exist numerous algorithms to estimate $I(X,Y)$ and entropies. We
189: shall use in the following the MI estimators proposed recently in Ref.~\cite{mi}, and we refer to this paper for
190: a review of alternative methods.
191:
192:
193: \subsection{Algorithmic Information Theory}
194:
195: In contrast to Shannon theory where the basic objects are random variables and entropies are {\it average}
196: informations, algorithmic information theory deals with individual symbol strings and with the actual
197: information needed to specify them. To ``specify" a sequence $X$ means here to give the necessary input to a
198: universal computer $U$, such that $U$ prints $X$ on its output and stops. The analogon to entropy, called here
199: usually the {\it complexity} $K(X)$ of $X$, is the minimal length of an input which leads to the output $X$, for
200: fixed $U$. It depends on $U$, but it can be shown that this dependence is weak and can be neglected in the limit
201: when $K(X)$ is large \cite{li-vi}.
202:
203: Let us denote the concatenation of two strings $X$ and $Y$ as $XY$. Its complexity is $K(XY)$. It is intuitively
204: clear that $K(XY)$ should be larger than $K(X)$ but cannot be larger than the sum $K(X)+K(Y)$. Finally, one
205: expects that $K(X|Y)$, defined as the minimal length of a program printing $X$ when $Y$ is furnished as
206: auxiliary input, is related to $K(XY)-K(Y)$. Indeed, one can show \cite{li-vi} (again within correction terms
207: which become irrelevant asymptotically) that
208: \be
209: 0 \leq K(X|Y) \simeq K(XY)-K(Y) \leq K(X).
210: \ee
211: Notice the close similarity with Shannon entropy.
212:
213: The algorithmic information in $Y$ about $X$ is finally defined as
214: \be
215: I_{\rm alg}(X,Y) = K(X) - K(X|Y) \simeq K(X)+K(Y)-K(XY).
216: \ee
217: Within the same additive correction terms, one shows that it is symmetric, $I_{\rm alg}(X,Y) = I_{\rm
218: alg}(Y,X)$, and can thus serve as an analogon to mutual information.
219:
220: From the halting theorem it follows that $K(X)$ is in general not computable. But one can easily give upper
221: bounds. Indeed, the length of any input which produces $X$ (e.g. by spelling it out verbatim) is an upper bound.
222: Improved upper bounds are provided by any file compression algorithm such as gnuzip or UNIX ``compress". Good
223: compression algorithms will give good approximations to $K(X)$, and algorithms whose performance does not depend
224: on the input file length (in particular since they do not segment the file during compression) will be crucial
225: for the following.
226:
227:
228: \subsection{MI-Based Distance Measures}
229:
230: Mutual information itself is a similarity measure in the sense that small values imply large ``distances" in a
231: loose sense. But it would be useful to modify it such that the resulting quantity is a metric in the strict
232: sense, i.e. satisfies the triangle inequality. Indeed, the first such metric is well known \cite{cover-thomas}:
233: The quantity
234: \be
235: d(X,Y)=H(X|Y)+H(Y|X)=H(X,Y)-I(X,Y) \label{d}
236: \ee
237: satisfies the triangle inequality, in addition to being non-negative and symmetric and to satisfying $d(X,X)=0$.
238: The proof proceeds by first showing that for any $Z$
239: \be
240: H(X|Y) \leq H(X,Z|Y) \leq H(X|Z)+H(Z|Y). \label{lemma0}
241: \ee
242:
243: But $d(X,Y)$ is not appropriate for our purposes. Since we want to compare the proximity between two single
244: objects and that between two clusters containing maybe many objects, we would like the distance measure to be
245: unbiased by the sizes of the clusters. As argued forcefully in \cite{li1,li2}, this is not true for $I_{\rm
246: alg}(X,Y)$, and for the same reasons it is not true for $I(X,Y)$ or $d(X,Y)$ either: A mutual information of
247: thousand bits should be considered as large, if $X$ and $Y$ themselves are just thousand bits long, but it
248: should be considered as very small, if $X$ and $Y$ would each be huge, say one million bits.
249:
250: As shown in \cite{li1,li2} within the algorithmic framework, one can form two different distances which measure
251: {\it relative} distance, i.e. which are normalized by dividing by a total entropy. We sketch here only the
252: theorems and proofs for the Shannon version, they are indeed very similar to their algorithmic analoga in
253: \cite{li1,li2}.
254:
255: \noindent {\sc Theorem 1}: The quantity
256: \be
257: D(X,Y) = 1 - \frac{I(X,Y)}{H(X,Y)} = \frac{d(X,Y)}{H(X,Y)} \label{eq:dist}
258: \ee
259: is a metric, with $D(X,X)=0$ and $D(X,Y)\leq 1$ for all pairs $(X,Y)$.
260:
261: \noindent {\sc Proof}: Symmetry, positivity and boundedness are obvious. Since $D(X,Y)$ can be written as
262: \be
263: D(X,Y)=\frac{H(X|Y)}{H(X,Y)}+\frac{H(Y|X)}{H(Y,X)},
264: \ee
265: it is sufficient for the proof of the triangle inequality to show that each of the two terms on the r.h.s. is
266: bounded by an analogous inequality, i.e.
267: \be
268: \frac{H(X|Y)}{H(X,Y)} \leq \frac{H(X|Z)}{H(X,Z)}+\frac{H(Z|Y)}{H(Z,Y)} \label{lemma}
269: \ee
270: and similarly for the second term. Eq.(\ref{lemma}) is proven straightforwardly, using Eq.(\ref{lemma0}) and the
271: basic inequalities $H(X) \geq 0$, $H(X,Y) \leq H(X,Y,Z)$ and $H(X|Z)\geq 0$:
272: \bea
273: \frac{H(X|Y)}{H(X,Y)} &=& \frac{H(X|Y)}{H(Y)+H(X|Y)} \nonumber \\
274: & \leq & \frac{H(X|Z)+H(Z|Y)}{H(Y)+H(X|Z)+H(Z|Y)} \nonumber \\
275: & = & \frac{H(X|Z)+H(Z|Y)}{H(X|Z)+H(Y,Z)} \nonumber \\
276: & \leq &\frac{H(X|Z)}{H(X|Z)+H(Z)}+\frac{H(Z|Y)}{H(Y,Z)} \nonumber \\
277: & = & \frac{H(X|Z)}{H(X,Z)}+\frac{H(Z|Y)}{H(Z,Y)}.
278: \eea
279:
280: \noindent {\sc Theorem 2}: The quantity
281: \bea
282: D'(X,Y) & = & 1 - \frac{I(X,Y)}{\max\{H(X),H(Y)\}} \nonumber \\
283: & = & \frac{\max\{H(X|Y),H(Y|X)\}}{\max\{H(X),H(Y)\}} \label{eq:dist2}
284: \eea
285: is also a metric, also with $D'(X,X)=0$ and $D'(X,Y)\leq 1$ for all pairs $(X,Y)$. It is sharper than $D$, i.e.
286: $D'(X,Y) \leq D(X,Y)$.
287:
288: \noindent
289: {\sc Proof}: Again we have only to prove the triangle inequality, the other parts are trivial. For this we have to distinguish different cases \cite{li2}.\\
290: Case 1: $\max\{H(Z),H(Y)\}\leq H(X)$. Using Eq.(\ref{lemma0}) we obtain
291: \bea
292: D'(X,Y) & = & \frac{H(X|Y)}{H(X)} \leq \frac{H(X|Z)}{H(X)} + \frac{H(Z|Y)}{H(Y)} \nonumber \\
293: & = & D'(X,Z)+D'(Z,Y).
294: \eea
295: Case 2: $\max\{H(Z),H(X)\}\leq H(Y)$. This is completely analogous.\\
296: Case 3: $H(X)\leq H(Y)< H(Z)$. We now have to show that
297: \bea
298: D'(X,Y) & = & \frac{H(Y|X)}{H(Y)} \leq \frac{H(Y|Z)+H(Z|X)}{H(Y)} \nonumber \\
299: & \stackrel{?}{\leq} & D'(X,Z)+D'(Z,Y) \nonumber \\
300: & = & \frac{H(Z|X)}{H(Z)}+\frac{H(Z|Y)}{H(Z)}. \label{dd}
301: \eea
302: Indeed, if the r.h.s. of the first line is less than 1, then
303: \bea
304: \frac{H(Y|X)}{H(Y)} & \leq &\frac{H(Y|Z)+H(Z|X)}{H(Y)} \nonumber \\
305: &\leq & \frac{H(Y|Z)+H(Z|X)+H(Z)-H(Y)}{H(Z)} \nonumber \\
306: & = & \frac{H(Z|Y)+H(Z|X)}{H(Z)},
307: \eea
308: and Eq.(\ref{dd}) holds. If it is larger than 1, then also $(H(Z|Y)+H(Z|X))/H(Z) \geq 1$.
309: Eq.(\ref{dd}) must now also hold, since $H(Y|X)/H(Y) \leq 1$. \\
310: Case 4: $H(Y)\leq H(X)< H(Z)$. This is completely analogous to case 3.
311:
312: Apart from scaling correctly with the total information, in contrast to $d(X,Y)$, the algorithmic analog to
313: $D(X,Y)$ and $D'(X,Y)$ are also {\it universal} \cite{li2}. Essentially this means that if $X\approx Y$
314: according to any non-trivial distance measure, then $X\approx Y$ also according to $D$, and even more so (by
315: factor up to 2) according to $D'$. In contrast to the other properties of $D$ and $D'$, this is not easy to
316: carry over from algorithmic to Shannon theory. The proof in Ref.~\cite{li2} depends on $X$ and $Y$ being
317: discrete, which is obviously not true for probability distributions. Based on the universality argument, it was
318: argued in \cite{li2} that $D'$ should be superior to $D$, but the numerical studies shown in that reference did
319: not show a clear difference between them. In the following we shall therefore use primarily $D$ for simplicity,
320: but we checked that using $D'$ did not give systematically better results.
321:
322: A major difficulty appears in the Shannon framework, if we deal with continuous random variables. As we
323: mentioned above, Shannon informations are only finite for coarse-grained variables, while they diverge if the
324: resolution tends to zero. This means that dividing MI by the entropy as in the definitions of $D$ and $D'$
325: becomes problematic. One has essentially two alternative possibilities. The first is to actually introduce some
326: coarse-graining, although it would have been necessary for the definition of $I(X,Y)$, and divide by the
327: coarse-grained entropies. This introduces an arbitrariness, since the scale $\Delta$ is completely ad hoc,
328: unless it can be fixed by some independent arguments. We have found no such arguments, and thus we propose the
329: second alternative. There we take $\Delta \to 0$. In this case $H(X) \sim m_x \log \Delta$, with $m_x$ being the
330: dimension of $X$. In this limit $D$ and $D'$ would tend to 1. But using similarity measures
331: \bea
332: S(X,Y) = (1-D(X,Y))\log(1/\Delta), \\
333: S'(X,Y) = (1-D'(X,Y))\log(1/\Delta)
334: \eea
335: instead of $D$ and $D'$ gives {\it exactly} the same results in MIC, and
336: \be
337: S(X,Y) = \frac{I(X,Y)}{m_x+m_y}, \quad S'(X,Y) = \frac{I(X,Y)}{\max\{m_x,m_y\}}.
338: \label{S}
339: \ee
340: Thus, when dealing with continuous variables, we divide the MI either by the sum or by the maximum of the
341: dimensions. When starting with scalar variables and when $X$ is a cluster variable obtained by joining $m$
342: elementary variables, then its dimension is just $m_x=m$.
343:
344:
345: \section{A Phylogenetic Tree for Mammals}
346:
347: As a first application, we study the mitochondrial DNA of a group of 34 mammals (see Fig.~1). Exactly the same
348: data \cite{Genebank} had previously been analyzed in \cite{li1,Reyes00}. This group includes among
349: others\footnote{opossum (\textit{Didelphis virginiana}), wallaroo (\textit{Macropus robustus}), and platypus
350: (\textit{Ornithorhyncus anatinus})} some rodents\footnote{rabbit (\textit{Oryctolagus cuniculus}), guinea pig
351: (\textit{Cavia porcellus}), fat dormouse (\textit{Glis glis}), rat (\textit{Rattus norvegicus}), squirrel
352: (Scuirus vulgaris), and mouse (\textit{Mus musculus})}, ferungulates\footnote{horse (\textit{Equu caballus}),
353: donkey (\textit{Equus asinus}), Indian rhinoceros (\textit{Rhinoceros unicornis}), white rhinoceros
354: (\textit{Ceratotherium simum}), harbor seal (\textit{Phoca vitulina}), grey seal (\textit{Halichoerus grypus}),
355: cat (\textit{Felis catus}), dog (\textit{Canis familiaris}), fin whale (\textit{Balenoptera physalus}), blue
356: whale (\textit{Balenoptera musculus}), cow (\textit{Bos taurus}), sheep (\textit{Ovis aries}), pig (\textit{Sus
357: scrofa}), hippopotamus (\textit{Hippopotamus amphibius}), neotropical fruit bat (\textit{Artibeus jamaicensis}),
358: African elephant (\textit{Loxodonta africana}), aardvark (\textit{Orycteropus afer}), and armadillo
359: (\textit{Dasypus novemcintus})}, and primates\footnote{human (\textit{Homo sapiens}), common chimpanzee
360: (\textit{Pan troglodytes}), pigmy chimpanzee (\textit{Pan paniscus}), gorilla (\textit{Gorilla gorilla}),
361: orangutan (\textit{Pongo pygmaeus}), gibbon (\textit{Hylobates lar}), and baboon (\textit{Papio hamadryas})}. It
362: had been chosen in \cite{li1} because of doubts about the relative closeness among these three groups
363: \cite{cao,Reyes00}.
364:
365: Obviously, we are here dealing with the algorithmic version of information theory, and informations are
366: estimated by lossless data compression. For constructing the proximity matrix between individual taxa, we
367: proceed essentially a in Ref.~\cite{li1}. But in addition to using the special compression program GenCompress
368: \cite{GenComp}, we also tested several general purpose compression programs such as BWTzip \cite{BWTzip} and the
369: UNIX tool bzip2.
370:
371: In Ref.~\cite{li1}, this proximity matrix was then used as the input to a standard HC algorithm
372: (neighbour-joining and hypercleaning) to produce an evolutionary tree. It is here where our treatment deviates
373: crucially. We used the MIC algorithm described in Sec.~1, with distance $D(X,Y)$. The joining of two clusters
374: (the third step in the MIC algorithm) is obtained by simply concatenating the DNA sequences. There is of course
375: an arbitrariness in the order of concatenation sequences: $XY$ and $YX$ give in general compressed sequences of
376: different lengths. But we found this to have negligible effect on the evolutionary tree. The resulting
377: evolutionary tree obtained with Gencompress is shown in Fig.~\ref{phylotree}.
378:
379: \begin{figure}
380: \begin{center}
381: \psfig{file=phylotree.eps,height=80mm,angle=270}
382: % \psfig{file=phylotree.eps,width=12cm,angle=270}
383: \caption{Phylogenetic tree for 34 mammals (31 eutherians plus 3 non-placenta mammals).
384: In contrast to Fig.~\ref{ClustECG}, the heights of nodes are equal to the distances
385: between the joining daughter clusters.}
386: \label{phylotree}
387: \end{center}
388: \vspace{-8mm}
389: \end{figure}
390:
391:
392: As shown in Fig.~\ref{phylotree} the overall structure of this tree closely resembles the one shown in
393: Ref.~\cite{Reyes00}. All primates are correctly clustered and also the relative order of the ferungulates is in
394: accordance with Ref.~\cite{Reyes00}. On the other hand, there are a number of connections which obviously do not
395: reflect the true evolutionary tree, see for example the guinea pig with bat and elephant with platypus. But the
396: latter two, inspite of being joined together, have a very large distance from each other, thus their clustering
397: just reflects the fact that neither the platypus nor the elephant have other close relatives in the sample. All
398: in all, however, already the results shown in Fig.~1 capture surprisingly well the overall structure shown in
399: Ref. \cite{Reyes00}. Dividing MI by the total information is essential for this success. If we had used the
400: non-normalized $I_{\rm alg}(X,Y)$ itself, the clustering algorithm used in \cite{li1} would not change much,
401: since all 34 DNA sequences have roughly the same length. But our MIC algorithm would be completely screwed up:
402: After the first cluster formation, we have DNA sequences of very different lengths, and longer sequences tend
403: also to have larger MI, even if they are not closely related.
404:
405: A heuristic reasoning for the use of MIC for the reconstruction of an evolutionary tree might be given as
406: follows: Suppose that a proximity matrix has been calculated for a set of DNA sequences and the smallest
407: distance is found for the pair $(X,Y)$. Ideally, one would remove the sequences $X$ and $Y$, replace them by the
408: sequence of the common ancestor (say $Z$) of the two species, update the proximity matrix to find the smallest
409: entry in the reduced set of species, and so on. But the DNA sequence of the common ancestor is not available.
410: One solution might be that one tries to reconstruct it by making some compromise between the sequences $X$ and
411: $Y$. Instead, we essentially propose to concatenate the sequences $X$ and $Y$. This will of course not lead to a
412: plausible sequence of the common ancestor, but it will {\it optimally represent the information} about the
413: common ancestor. During the evolution since the time of the ancestor $Z$, some parts of its genome might have
414: changed both in $X$ and in $Y$. These parts are of little use in constructing any phylogenetic tree. Other parts
415: might not have changed in either. They are recognized anyhow by any sensible algorithm. Finally, some parts of
416: its genome will have mutated significantly in $X$ but not in $Y$, and vice versa. This information is essential
417: to find the correct way through higher hierarchy levels of the evolutionary tree, and it is preserved in
418: concatenating.
419:
420:
421: \section{Clustering of Minimally Dependent Components in an Electrocardiogram}
422:
423: As our second application we choose a case where Shannon theory is the proper setting. We show in Fig.~2 an ECG
424: recorded from the abdomen and thorax of a pregnant woman \cite{ECGdata} (8 channels, sampling rate 500 Hz,
425: 5$\,$s total). It is already seen from this graph that there are at least two important components in this ECG:
426: the heartbeat of the mother, with a frequency of $\approx 3$ beat/s, and the heartbeat of the fetus with roughly
427: twice this frequency. Both are not synchronized. In addition there is noise from various sources (muscle
428: activity, measurement noise, etc.). While it is easy to detect anomalies in the mother's ECG from such a
429: recording, it would be difficult to detect them in the fetal ECG.
430:
431: As a first approximation we can assume that the total ECG is a linear superposition of several independent
432: sources (mother, child, noise$_1$, noise$_2$,...). A standard method to disentangle such superpositions is {\it
433: independent component analysis} (ICA) \cite{ICA}. In the simplest case one has $n$ independent sources
434: $s_i(t),\; i=1\ldots n$ and $n$ measured channels $x_i(t)$ obtained by instantaneous superpositions with a time
435: independent non-singular matrix ${\bf A}$,
436: \be
437: x_i(t) = \sum_{j=1}^n A_{ij} s_j(t)\;.
438: \ee
439: In this case the sources can be reconstructed by applying the inverse transformation ${\bf W} = {\bf A}^{-1}$
440: which is obtained by minimizing the (estimated) mutual informations between the transformed components $y_i(t) =
441: \sum_{j=1}^n W_{ij} x_j(t)$. If some of the sources are Gaussian, this leads to ambiguities \cite{ICA}, but it
442: gives a unique solution if the sources have more structure.
443:
444: In reality things are not so simple. For instance, the sources might not be independent, the number of sources
445: (including noise sources!) might be different from the number of channels, and the mixing might involve delays.
446: For the present case this implies that the heartbeat of the mother is seen in several reconstructed components
447: $y_i$, and that the supposedly ``independent" components are not independent at all. In particular, all
448: components $y_i$ which have large contributions from the mother form a cluster with large intra-cluster MIs and
449: small inter-cluster MIs. The same is true for the fetal ECG, albeit less pronounced.
450: It is thus our aim to \\
451: 1) optimally decompose the signals into least dependent components;\\
452: 2) cluster these components hierarchically such that the most dependent ones are
453: grouped together;\\
454: 3) decide on an optimal level of the hierarchy, such that the clusters make most sense
455: physiologically;\\
456: 4) project onto these clusters and apply the inverse transformations to obtain cleaned signals for the sources
457: of interest.
458:
459: \begin{figure}
460: \begin{center}
461: \psfig{file=ECG1s.eps,width=80mm}
462: \caption{ECG of a pregnant woman.}
463: \label{ICAECG0}
464: \end{center}
465: \vspace{-8mm}
466: \end{figure}
467:
468: \begin{figure}
469: \begin{center}
470: \psfig{file=ECG2s.eps,width=80mm}
471: \caption{Least dependent components of the ECG shown in Fig.~\ref{ICAECG0}, after increasing
472: the number of channels by delay embedding.}
473: \label{ICAECG}
474: \end{center}
475: \vspace{-8mm}
476: \end{figure}
477:
478: Technically we proceeded as follows \cite{Harald}:
479:
480: Since we expect different delays in the different channels, we first used Takens delay embedding \cite{Takens80}
481: with time delay 0.002$\,$s and embedding dimension 3, resulting in $24$ channels. We then formed 24 linear
482: combinations $y_i(t)$ and determined the de-mixing coefficients $W_{ij}$ by minimizing the overall mutual
483: information between them, using the MI estimator proposed in \cite{mi}. There, two classes of estimators were
484: introduced, one with square and the other with rectangular neighbourhoods. Within each class, one can use the
485: number of neighbours, called $k$ in the following, on which the estimate is based. Small values of $k$ lead to a
486: small bias but to large statistical errors, while the opposite is true for large $k$. But even for very large
487: $k$ the bias is zero when the true MI is zero, and it is systematically such that absolute values of the MI are
488: underestimated. Therefore this bias does not affect the determination of the optimal de-mixing matrix. But it
489: depends on the dimension of the random variables, therefore large values of $k$ are not suitable for the
490: clustering. We thus proceeded as follows: We first used $k=100$ and square neighbourhoods to obtain the least
491: dependent components $y_i(t)$, and then used $k=3$ with rectangular neighbourhoods for the clustering. The
492: resulting least dependent components are shown in Fig.~\ref{ICAECG}. They are sorted such that the first
493: components (1 - 5) are dominated by the mother's ECG, while the next three contain large contributions from the
494: fetus. The rest contains mostly noise, although some seem to be still mixed.
495:
496: These results obtained by visual inspection are fully supported by the cluster analysis. The dendrogram is shown
497: in Fig.~\ref{ClustECG}. In constructing it we used $S(X,Y)$ (Eq.(\ref{S})) as similarity measure to find the
498: correct topology. Again we would have obtained much worse results if we had not normalized it by dividing MI by
499: $m_X+m_Y$. In plotting the actual dendrogram, however, we used the MI of the cluster to determine the height at
500: which the two daughters join. The MI of the first five channels, e.g., is $\approx 1.43$, while that of channels
501: 6 to 8 is $\approx 0.34$. For any two clusters (tuples) $X=X_1\ldots X_n$ and $Y=Y_1\ldots Y_m$ one has $I(X,Y)
502: \geq I(X)+I(Y)$. This guarantees, if the MI is estimated correctly, that the tree is drawn properly. The two
503: slight glitches (when clusters (1--14) and (15--18) join, and when (21--22) is joined with 23) result from small
504: errors in estimating MI. They do in no way effect our conclusions.
505:
506: \begin{figure}
507: \begin{center}
508: \psfig{file=ECGclust.eps,width=80mm,angle=0}
509: \caption{Dendrogram for least dependent components. The height where the two branches of
510: a cluster join corresponds to the MI of the cluster.}
511: \label{ClustECG}
512: \end{center}
513: \vspace{-8mm}
514: \end{figure}
515:
516: In Fig.~\ref{ClustECG} one can clearly see two big clusters corresponding to the mother and to the child. There
517: are also some small clusters which should be considered as noise. For reconstructing the mother and child
518: contributions to Fig.~\ref{ICAECG0}, we have to decide on one specific clustering from the entire hierarchy. We
519: decided to make the cut such that mother and child are separated. The resulting clusters are indicated in
520: Fig.~\ref{ClustECG} and were already anticipated in sorting the channels. Reconstructing the original ECG from
521: the child components only, we obtain Fig.~\ref{reconstruct}.
522:
523: \begin{figure}
524: \begin{center}
525: \psfig{file=ECG3s.eps,width=80mm}
526: \caption{Original ECG where all contributions except those of the child cluster have
527: been removed.}
528: \label{reconstruct}
529: \vspace{-8mm}
530: \end{center}
531: \end{figure}
532:
533:
534: \section{Conclusions}
535:
536: We have shown that MI can not only be used as a proximity measure in clustering, but that it also suggests a
537: conceptually very simple and natural hierarchical clustering algorithm. We do not claim that this algorithm,
538: called {\it mutual information clustering} (MIC), is always superior to other algorithms. Indeed, MI is in
539: general not easy to estimate. Obviously, when only crude estimates are possible, also MIC will not give optimal
540: results. But as MI estimates are becoming better, also the results of MIC should improve. The present paper was
541: partly triggered by the development of a new class of MI estimators for continuous random variables which have
542: very small bias and also rather small variances \cite{mi}.
543:
544: We have illustrated our method with two applications, one from genetics and one from cardiology. For neither
545: application MIC might give the very best clustering, but it seems promising that one common method gives decent
546: results for both, although they are very different.
547:
548: The results of MIC should improve, if more data become available. This is trivial, if we mean by that longer
549: time sequences in the application to ECG, and longer parts of the genome in the application of Sec.3. It is less
550: trivial that we expect MIC to make fewer mistakes in a phylogenetic tree, when more species are included. The
551: reason is that close-by species will be correctly joined anyhow, and families -- which now are represented only
552: by single species and thus are poorly characterized -- will be much better described by the concatenated genomes
553: if more species are included.
554:
555: There are two versions of information theory, algorithmic and probabilistic, and therefore there are also two
556: variants of MI and of MIC. We discussed in detail one application of each, and showed that indeed common
557: concepts were involved in both. In particular it was crucial to normalize MI properly, so that it is essentially
558: the {\it relative} MI which is used as proximity measure. For conventional clustering algorithms using
559: algorithmic MI as proximity measure this had already been stressed in \cite{li1,li2}, but it is even more
560: important in MIC, both in the algorithmic and in the probabilistic versions.
561:
562: In the probabilistic version, one studies the clustering of probability distributions. But usually distributions
563: are not provided as such, but are given implicitly by finite random samples drawn (more or less) independently
564: from them. On the other hand, the full power of algorithmic information theory is only reached for infinitely
565: long sequences, and in this limit any individual sequence defines a sequence of probability measures on finite
566: subsequences. Thus the strict distinction between the two theories is somewhat blurred in practice.
567: Nevertheless, one should not confuse the similarity between two sequences (two English books, say) and that
568: between their subsequence statistics. Two sequences are maximally different if they are completely random, but
569: their statistics for short subsequences is then identical (all subsequences appear in both with equal
570: probabilities). Thus one should always be aware of what similarities or independencies one is looking for. The
571: fact that MI can be used in similar ways for all these problems is not trivial.
572:
573:
574: We would like to thank Arndt von Haesseler, Walter Nadler and Volker Roth for many useful discussions.
575:
576: \bibliographystyle{apalike}
577: \bibliography{lit}
578:
579: \end{document}
580: