0311:q-bio0311039/mm.tex

1: \documentclass[twocolumn]{article}

2: %\documentclass[aps,preprint]{revtex4}

3: %\documentclass[aps,twocolumn]{revtex4}

4: \usepackage{epsfig}

5: \newcommand{\be}{\begin{equation}}

6: \newcommand{\ee}{\end{equation}}

7: \newcommand{\bea}{\begin{eqnarray}}

8: \newcommand{\eea}{\end{eqnarray}}

9: \newcommand{\ba}{\begin{array}}

10: \newcommand{\ea}{\end{array}}

11:

12: \begin{document}

13: \title{Hierarchical Clustering Based on Mutual Information}

14: \author{Alexander Kraskov, Harald St\"ogbauer, Ralph G. Andrzejak, and Peter Grassberger \\

15: %}

16: %\affiliation{

17: John-von-Neumann Institute for Computing, \\ Forschungszentrum J\"ulich,D-52425 J\"ulich, Germany}

18:

19: \date{\today}

20: \maketitle

21: \begin{abstract}

22: \noindent \textbf{Motivation:} Clustering is a frequently used concept in variety of bioinformatical

23: applications. We present a new method for hierarchical clustering of data called {\it mutual information

24: clustering} (MIC) algorithm. It uses mutual information (MI) as a similarity measure and exploits its grouping

25: property: The MI between three objects $X, Y,$ and $Z$ is equal to the sum of the MI between $X$ and $Y$, plus

26: the MI between $Z$ and the combined object $(XY)$.

27:

28: \noindent \textbf{Results:} We use this both in the Shannon (probabilistic) version of information theory, where

29: the ``objects" are probability distributions represented by random samples, and in the Kolmogorov (algorithmic)

30: version, where the ``objects" are symbol sequences. We apply our method to the construction of mammal

31: phylogenetic trees from mitochondrial DNA sequences and we reconstruct the fetal ECG from the output of

32: independent components analysis (ICA) applied to the ECG of a pregnant woman.

33:

34: \noindent \textbf{Availability:} The programs for estimation of MI and for clustering (probabilistic version)

35: are available at \textsf{http://www.fz-juelich.de/nic/cs/software}.

36:

37: \noindent \textbf{Contact:} \textsf{a.kraskov@fz-juelich.de}

38: \end{abstract}

39: %\maketitle

40:

41: %Motivation: Clustering is a frequently used concept in variety of bioinformatical applications. We present a new

42: %method for hierarchical clustering of data called mutual information clustering (MIC) algorithm. It uses mutual

43: %information (MI) as a similarity measure and exploits its grouping property: The MI between three objects X, Y,

44: %and Z is equal to the sum of the MI between X and Y, plus the MI between Z and the combined object (XY).

45: %

46: %Results: We use this both in the Shannon (probabilistic) version of information theory, where the ``objects" are

47: %probability distributions represented by random samples, and in the Kolmogorov (algorithmic) version, where the

48: %``objects" are symbol sequences. We apply our method to the construction of mammal phylogenetic trees from

49: %mitochondrial DNA sequences and we reconstruct the fetal ECG from the output of independent components analysis

50: %(ICA) applied to the ECG of a pregnant woman.

51: %

52: %Availability: The programs for estimation of MI and for clustering (probabilistic version) are available at

53: %http://www.fz-juelich.de/nic/cs/software

54: %

55: %Contact: a.kraskov@fz-juelich.de

56:

57: \section{Introduction}

58:

59: Classification or organizing of data is very important in all scientific disciplines. It is one of the most

60: fundamental mechanism of understanding and learning \cite{jain-dubes}. Depending on the problem, classification

61: can be exclusive or overlapping, supervised or unsupervised. In the following we will be interested only in

62: exclusive unsupervised classification. This type of classification is usually called clustering or cluster

63: analysis.

64:

65: An instance of a clustering problem consist of a set of objects and a set of properties (called characteristic

66: vector) for each object. The goal of clustering is the separation of objects into groups using only the

67: characteristic vectors. Indeed, in general only certain aspects of the characteristic vectors will be relevant,

68: and extracting these relevant features is one field where mutual information (MI) plays a major role

69: \cite{bottleneck}, but we shall not deal with this here. Cluster analysis organizes data either as a single

70: grouping of individuals into non-overlapping clusters or as a hierarchy of nested partitions. The first approach

71: is called partitional clustering (PC), the second one is hierarchical clustering (HC). One of the main features

72: of HC methods is the visual impact of the {\it dendrogram} which enables one to see how objects are being merged

73: into clusters. From any HC one can obtain a PC by restricting oneself to a ``horizontal" cut through the

74: dendrogram, while one cannot go in the other direction and obtain a full hierarchy from a single PC. Because of

75: their wide spread of applications, there are a large variety of different clustering methods in use

76: \cite{jain-dubes}.

77:

78: The crucial point of all clustering algorithms is the choice of a {\it proximity measure}. This is obtained from

79: the characteristic vectors and can be either an indicator for similarity (i.e. large for similar and small for

80: dissimilar objects), or dissimilarity. In the latter case it is convenient but not obligatory if it satisfies

81: the standard axioms of a metric (positivity, symmetry, and triangle inequality). A matrix of all pairwise

82: proximities is called proximity matrix. Among HC methods one should distinguish between those where one uses the

83: characteristic vectors only at the first level of the hierarchy and derives the proximities between clusters

84: from the proximities of their constituents, and methods where the proximities are calculated each time from

85: their characteristic vectors. The latter strategy (which is used also in the present paper) allows of course for

86: more flexibility but might also be computationally more costly.

87:

88: Quite generally, the ``objects" to be clustered can be either single (finite) patterns (e.g. DNA sequences) or

89: random variables, i.e. {\it probability distributions}. In the latter case the data are usually supplied in form

90: of a statistical sample, and one of the simplest and most widely used similarity measures is the linear

91: (Pearson) correlation coefficient. But this is not sensitive to nonlinear dependencies which do not manifest

92: themselves in the covariance and can thus miss important features. This is in contrast to mutual information

93: (MI) which is also singled out by its information theoretic background \cite{cover-thomas}. Indeed, MI is zero

94: only if the two random variables are strictly independent.

95:

96: Another important feature of MI is that it has also an ``algorithmic" cousin, defined within algorithmic

97: (Kolmogorov) information theory \cite{li-vi} which measures the similarity between individual objects. For a

98: thorough discussion of distance measures based on algorithmic MI and for their application to clustering, see

99: \cite{li1,li2}.

100:

101: Another feature of MI which is essential for the present application is its {\it grouping property}: The MI

102: between three objects (distributions) $X, Y,$ and $Z$ is equal to the sum of the MI between $X$ and $Y$, plus

103: the MI between $Z$ and the combined object (joint distribution) $(XY)$,

104: \be

105:    I(X,Y,Z) = I(X,Y) + I((X,Y),Z).                        \label{group}

106: \ee

107: Within Shannon information theory this is an exact theorem (see below), while it is true in the algorithmic

108: version up to the usual logarithmic correction terms \cite{li-vi}. Since $X,Y,$ and $Z$ can be themselves

109: composite, Eq.(\ref{group}) can be used recursively for a cluster decomposition of MI. This motivates the main

110: idea of our clustering method: instead of using e.g. centers of masses in order to treat clusters like

111: individual objects in an approximative way, we treat them exactly like  individual objects when using MI as

112: proximity measure.

113:

114: More precisely, we propose the following scheme for clustering $n$ objects with MIC:\\

115: (1) Compute a proximity matrix based on pairwise mutual informations; assign $n$ clusters such that each cluster

116: contains exactly one object;\\

117: (2) find the two closest clusters $i$ and $j$; \\

118: (3) create a new cluster $(ij)$ by combining $i$ and $j$; \\

119: (4) delete the lines/columns with indices $i$ and $j$ from the proximity matrix, and add one line/column

120: containing the proximities between cluster $(ij)$ and all

121: other clusters; \\

122: (5) if the number of clusters is still $>2$, goto (2); else join the two clusters and stop.

123:

124: In the next section we shall review the pertinent properties of MI, both in the Shannon and in the algorithmic

125: version. This is applied in Sec.~3 to construct a phylogenetic tree using mitochondrial DNA and in Sec.~4 to

126: cluster the output channels of an independent component analysis (ICA) of an electrocardiogram (ECG) of a

127: pregnant woman, and to reconstruct from this the maternal and fetal ECGs. We finish with our conclusions in

128: Sec.~6.

129:

130:

131: \section{Mutual Information}

132:

133: \subsection{Shannon Theory}

134:

135: Assume that one has two random variables $X$ and $Y$. If they are discrete, we write $p_i(X) = {\rm

136: prob}(X=x_i)$, $p_i(Y) = {\rm prob}(Y=x_i)$, and $p_{ij} = {\rm prob}(X=x_i,Y=y_i)$ for the marginal and joint

137: distribution. Otherwise (and if they have finite densities) we denote the densities by $\mu_X(x),\mu_Y(y)$, and

138: $\mu(x,y)$. Entropies are defined for the discrete case as usual by $H(X) = - \sum_ip_i(X) \log p_i(X)$, $H(Y) =

139: - \sum_ip_i(Y) \log p_i(Y)$, and $H(X,Y)=-\sum_{i,j} p_{ij} \log p_{ij}$. Conditional entropies are defined as

140: $H(X|Y) = H(X,Y)-H(Y) = -\sum_{i,j} p_{ij} \log p_{i|j}$. The base of the logarithm determines the units in

141: which information is measured. In particular, taking base two leads to information measured in bits. In the

142: following, we always will use natural logarithms. The MI between $X$ and $Y$ is finally defined as

143: \bea

144:    I(X,Y) &=& H(X)+H(Y)-H(X,Y) \nonumber  \\

145:           &=& \sum_{i,j} p_{ij}\;\log{p_{ij}\over p_i(X)p_i(Y)}.

146: \eea

147: It can be shown to be non-negative, and is zero only when $X$ and $Y$ are strictly independent. For $n$ random

148: variables $X_1,X_2\ldots X_n$, the MI is defined as

149: \be

150:    I(X_1,\ldots, X_n) = \sum_{k=1}^n H(X_k) - H(X_1,\ldots, X_n).

151: \ee This quantity is often referred to as (generalized) redundancy, in order to distinguish it from different

152: ``mutual informations" which are constructed analogously to higher order cumulants, but we shall not follow this

153: usage. Eq.(\ref{group}) can be checked easily, together with its generalization to arbitrary groupings. It means

154: that MI can be {\it decomposed into hierarchical levels}. By iterating it, one can decompose $I(X_1\ldots X_n)$

155: for any $n>2$ and for any partitioning of the set $(X_1\ldots X_n)$ into the MIs between elements within one

156: cluster and MIs between clusters.

157:

158:

159: For continuous variables one first introduces some binning (`coarse-graining'), and applies the above to the

160: binned variables. If $x$ is a vector with dimension $m$ and each bin has Lebesgue measure $\Delta$, then $p_i(X)

161: \approx \mu_X(x)\Delta^m$ with $x$ chosen suitably in bin $i$, and \footnote{Notice that we have here assumed

162: that densities really exists. If not e.g. if $X$ lives on a fractal set), then $m$ is to be replaced by the

163: Hausdorff dimension of the measure $\mu$.}

164: \be

165:    H_{\rm bin}(X) \approx \tilde{H}(X) - m \log \Delta

166: \ee

167: where the {\it differential entropy} is given by

168: \be

169:    \tilde{H}(X) = -\int dx \;\mu_X(x) \log \mu_X(x).

170: \ee

171: Notice that $H_{\rm bin}(X)$ is a true (average) information and is thus non-negative, but $\tilde{H}(X)$ is not

172: an information and can be negative. Also, $\tilde{H}(X)$ is not invariant under homeomorphisms $x\to \phi(x)$.

173:

174: Joint entropies, conditional entropies, and MI are defined as above, with sums replaced by integrals. Like

175: $\tilde{H}(X)$, joint and conditional entropies are neither positive (semi-)definite nor invariant. But MI,

176: defined as

177: \be

178:    I(X,Y) = \int\!\!\!\int dx dy \;\mu_{XY}(x,y) \;\log{\mu_{XY}(x,y)\over \mu_X(x)\mu_Y(y)}\;,

179:    \label{mi}

180: \ee

181: is non-negative and invariant under $x\to \phi(x)$ and $y\to \psi(y)$. It is (the limit of) a true information,

182: \be

183:    I(X,Y) = \lim_{\Delta\to 0} [H_{\rm bin}(X)+H_{\rm bin}(Y)-H_{\rm bin}(X,Y)].

184: \ee

185:

186: In applications, one usually has the data available in form of a statistical sample. To estimate $I(X,Y)$ one

187: starts from $N$ bivariate measurements $(x_i,y_i), \, i=1,\ldots N$ which are assumed to be iid (independent

188: identically distributed) realizations. There exist numerous algorithms to estimate $I(X,Y)$ and entropies. We

189: shall use in the following the MI estimators proposed recently in Ref.~\cite{mi}, and we refer to this paper for

190: a review of alternative methods.

191:

192:

193: \subsection{Algorithmic Information Theory}

194:

195: In contrast to Shannon theory where the basic objects are random variables and entropies are {\it average}

196: informations, algorithmic information theory deals with individual symbol strings and with the actual

197: information needed to specify them. To ``specify" a sequence $X$ means here to give the necessary input to a

198: universal computer $U$, such that $U$ prints $X$ on its output and stops. The analogon to entropy, called here

199: usually the {\it complexity} $K(X)$ of $X$, is the minimal length of an input which leads to the output $X$, for

200: fixed $U$. It depends on $U$, but it can be shown that this dependence is weak and can be neglected in the limit

201: when $K(X)$ is large \cite{li-vi}.

202:

203: Let us denote the concatenation of two strings $X$ and $Y$ as $XY$. Its complexity is $K(XY)$. It is intuitively

204: clear that $K(XY)$ should be larger than $K(X)$ but cannot be larger than the sum $K(X)+K(Y)$. Finally, one

205: expects that $K(X|Y)$, defined as the minimal length of a program printing $X$ when $Y$ is furnished as

206: auxiliary input, is related to $K(XY)-K(Y)$. Indeed, one can show \cite{li-vi} (again within correction terms

207: which become irrelevant asymptotically) that

208: \be

209:    0 \leq K(X|Y) \simeq K(XY)-K(Y) \leq K(X).

210: \ee

211: Notice the close similarity with Shannon entropy.

212:

213: The algorithmic information in $Y$ about $X$ is finally defined as

214: \be

215:    I_{\rm alg}(X,Y) = K(X) - K(X|Y) \simeq K(X)+K(Y)-K(XY).

216: \ee

217: Within the same additive correction terms, one shows that it is symmetric, $I_{\rm alg}(X,Y) = I_{\rm

218: alg}(Y,X)$, and can thus serve as an analogon to mutual information.

219:

220: From the halting theorem it follows that $K(X)$ is in general not computable. But one can easily give upper

221: bounds. Indeed, the length of any input which produces $X$ (e.g. by spelling it out verbatim) is an upper bound.

222: Improved upper bounds are provided by any file compression algorithm such as gnuzip or UNIX ``compress". Good

223: compression algorithms will give good approximations to $K(X)$, and algorithms whose performance does not depend

224: on the input file length (in particular since they do not segment the file during compression) will be crucial

225: for the following.

226:

227:

228: \subsection{MI-Based Distance Measures}

229:

230: Mutual information itself is a similarity measure in the sense that small values imply large ``distances" in a

231: loose sense. But it would be useful to modify it such that the resulting quantity is a metric in the strict

232: sense, i.e. satisfies the triangle inequality. Indeed, the first such metric is well known \cite{cover-thomas}:

233: The quantity

234: \be

235:    d(X,Y)=H(X|Y)+H(Y|X)=H(X,Y)-I(X,Y)                   \label{d}

236: \ee

237: satisfies the triangle inequality, in addition to being non-negative and symmetric and to satisfying $d(X,X)=0$.

238: The proof proceeds by first showing that for any $Z$

239: \be

240:    H(X|Y) \leq H(X,Z|Y) \leq H(X|Z)+H(Z|Y).                                        \label{lemma0}

241: \ee

242:

243: But $d(X,Y)$ is not appropriate for our purposes. Since we want to compare the proximity between two single

244: objects and that between two clusters containing maybe many objects, we would like the distance measure to be

245: unbiased by the sizes of the clusters. As argued forcefully in \cite{li1,li2}, this is not true for $I_{\rm

246: alg}(X,Y)$, and for the same reasons it is not true for $I(X,Y)$ or $d(X,Y)$ either: A mutual information of

247: thousand bits should be considered as large, if $X$ and $Y$ themselves are just thousand bits long, but it

248: should be considered as very small, if $X$ and $Y$ would each be huge, say one million bits.

249:

250: As shown in \cite{li1,li2} within the algorithmic framework, one can form two different distances which measure

251: {\it relative} distance, i.e. which are normalized by dividing by a total entropy. We sketch here only the

252: theorems and proofs for the Shannon version, they are indeed very similar to their algorithmic analoga in

253: \cite{li1,li2}.

254:

255: \noindent {\sc Theorem 1}: The quantity

256: \be

257:    D(X,Y) = 1 - \frac{I(X,Y)}{H(X,Y)} = \frac{d(X,Y)}{H(X,Y)}                 \label{eq:dist}

258: \ee

259: is a metric, with $D(X,X)=0$ and $D(X,Y)\leq 1$ for all pairs $(X,Y)$.

260:

261: \noindent {\sc Proof}: Symmetry, positivity and boundedness are obvious. Since $D(X,Y)$ can be written as

262: \be

263:     D(X,Y)=\frac{H(X|Y)}{H(X,Y)}+\frac{H(Y|X)}{H(Y,X)},

264: \ee

265: it is sufficient for the proof of the triangle inequality to show that each of the two terms on the r.h.s.  is

266: bounded by an analogous inequality, i.e.

267: \be

268:    \frac{H(X|Y)}{H(X,Y)} \leq \frac{H(X|Z)}{H(X,Z)}+\frac{H(Z|Y)}{H(Z,Y)}   \label{lemma}

269: \ee

270: and similarly for the second term. Eq.(\ref{lemma}) is proven straightforwardly, using Eq.(\ref{lemma0}) and the

271: basic inequalities $H(X) \geq 0$, $H(X,Y) \leq H(X,Y,Z)$ and $H(X|Z)\geq 0$:

272: \bea

273:     \frac{H(X|Y)}{H(X,Y)} &=& \frac{H(X|Y)}{H(Y)+H(X|Y)}  \nonumber \\

274:     & \leq & \frac{H(X|Z)+H(Z|Y)}{H(Y)+H(X|Z)+H(Z|Y)} \nonumber \\

275:     & = & \frac{H(X|Z)+H(Z|Y)}{H(X|Z)+H(Y,Z)} \nonumber \\

276:     & \leq &\frac{H(X|Z)}{H(X|Z)+H(Z)}+\frac{H(Z|Y)}{H(Y,Z)} \nonumber \\

277:     & = & \frac{H(X|Z)}{H(X,Z)}+\frac{H(Z|Y)}{H(Z,Y)}.

278: \eea

279:

280: \noindent {\sc Theorem 2}: The quantity

281: \bea

282:    D'(X,Y) & = & 1 - \frac{I(X,Y)}{\max\{H(X),H(Y)\}}    \nonumber \\

283:      & = & \frac{\max\{H(X|Y),H(Y|X)\}}{\max\{H(X),H(Y)\}}                \label{eq:dist2}

284: \eea

285: is also a metric, also with $D'(X,X)=0$ and $D'(X,Y)\leq 1$ for all pairs $(X,Y)$. It is sharper than $D$, i.e.

286: $D'(X,Y) \leq D(X,Y)$.

287:

288: \noindent

289: {\sc Proof}: Again we have only to prove the triangle inequality, the other parts are trivial. For this we have to distinguish different cases \cite{li2}.\\

290: Case 1: $\max\{H(Z),H(Y)\}\leq H(X)$. Using Eq.(\ref{lemma0}) we obtain

291: \bea

292:     D'(X,Y) & = & \frac{H(X|Y)}{H(X)} \leq \frac{H(X|Z)}{H(X)} + \frac{H(Z|Y)}{H(Y)}  \nonumber \\

293:             & = & D'(X,Z)+D'(Z,Y).

294: \eea

295: Case 2: $\max\{H(Z),H(X)\}\leq H(Y)$. This is completely analogous.\\

296: Case 3: $H(X)\leq H(Y)< H(Z)$. We now have to show that

297: \bea

298:     D'(X,Y) & = &  \frac{H(Y|X)}{H(Y)} \leq \frac{H(Y|Z)+H(Z|X)}{H(Y)}  \nonumber \\

299:             & \stackrel{?}{\leq} &  D'(X,Z)+D'(Z,Y) \nonumber \\

300:             & = &  \frac{H(Z|X)}{H(Z)}+\frac{H(Z|Y)}{H(Z)}.      \label{dd}

301: \eea

302: Indeed, if the r.h.s. of the first line is less than 1, then

303: \bea

304:     \frac{H(Y|X)}{H(Y)} & \leq &\frac{H(Y|Z)+H(Z|X)}{H(Y)}  \nonumber \\

305:              &\leq & \frac{H(Y|Z)+H(Z|X)+H(Z)-H(Y)}{H(Z)} \nonumber \\

306:              & = &   \frac{H(Z|Y)+H(Z|X)}{H(Z)},

307: \eea

308: and Eq.(\ref{dd}) holds. If it is larger than 1, then also $(H(Z|Y)+H(Z|X))/H(Z) \geq 1$.

309: Eq.(\ref{dd}) must now also hold, since $H(Y|X)/H(Y) \leq 1$. \\

310: Case 4: $H(Y)\leq H(X)< H(Z)$. This is completely analogous to case 3.

311:

312: Apart from scaling correctly with the total information, in contrast to $d(X,Y)$, the algorithmic analog to

313: $D(X,Y)$ and $D'(X,Y)$ are also {\it universal} \cite{li2}. Essentially this means that if $X\approx Y$

314: according to any non-trivial distance measure, then $X\approx Y$ also according to $D$, and even more so (by

315: factor up to  2) according to $D'$. In contrast to the other properties of $D$ and $D'$, this is not easy to

316: carry over from algorithmic to Shannon theory. The proof in Ref.~\cite{li2} depends on $X$ and $Y$ being

317: discrete, which is obviously not true for probability distributions. Based on the universality argument, it was

318: argued in \cite{li2} that $D'$ should be superior to $D$, but the numerical studies shown in that reference did

319: not show a clear difference between them. In the following we shall therefore use primarily $D$ for simplicity,

320: but we checked that using $D'$ did not give systematically better results.

321:

322: A major difficulty appears in the Shannon framework, if we deal with continuous random variables. As we

323: mentioned above, Shannon informations are only finite for coarse-grained variables, while they diverge if the

324: resolution tends to zero. This means that dividing MI by the entropy as in the definitions of $D$ and $D'$

325: becomes problematic. One has essentially two alternative possibilities. The first is to actually introduce some

326: coarse-graining, although it would have been necessary for the definition of $I(X,Y)$, and divide by the

327: coarse-grained entropies. This introduces an arbitrariness, since the scale $\Delta$ is completely ad hoc,

328: unless it can be fixed by some independent arguments. We have found no such arguments, and thus we propose the

329: second alternative. There we take $\Delta \to 0$. In this case $H(X) \sim m_x \log \Delta$, with $m_x$ being the

330: dimension of $X$. In this limit $D$ and $D'$ would tend to 1. But using similarity measures

331: \bea

332:    S(X,Y) = (1-D(X,Y))\log(1/\Delta),          \\

333:    S'(X,Y) = (1-D'(X,Y))\log(1/\Delta)

334: \eea

335: instead of $D$ and $D'$ gives {\it exactly} the same results in MIC, and

336: \be

337:    S(X,Y) = \frac{I(X,Y)}{m_x+m_y}, \quad S'(X,Y) = \frac{I(X,Y)}{\max\{m_x,m_y\}}.

338:                                          \label{S}

339: \ee

340: Thus, when dealing with continuous variables, we divide the MI either by the sum or by the maximum of the

341: dimensions. When starting with scalar variables and when $X$ is a cluster variable obtained by joining $m$

342: elementary variables, then its dimension is just $m_x=m$.

343:

344:

345: \section{A Phylogenetic Tree for Mammals}

346:

347: As a first application, we study the mitochondrial DNA of a group of 34 mammals (see Fig.~1).  Exactly the same

348: data \cite{Genebank} had previously been analyzed in \cite{li1,Reyes00}. This group includes among

349: others\footnote{opossum (\textit{Didelphis virginiana}), wallaroo (\textit{Macropus robustus}), and platypus

350: (\textit{Ornithorhyncus anatinus})} some rodents\footnote{rabbit (\textit{Oryctolagus cuniculus}), guinea pig

351: (\textit{Cavia porcellus}), fat dormouse (\textit{Glis glis}), rat (\textit{Rattus norvegicus}), squirrel

352: (Scuirus vulgaris), and mouse (\textit{Mus musculus})}, ferungulates\footnote{horse (\textit{Equu caballus}),

353: donkey (\textit{Equus asinus}), Indian rhinoceros (\textit{Rhinoceros unicornis}), white rhinoceros

354: (\textit{Ceratotherium simum}), harbor seal (\textit{Phoca vitulina}), grey seal (\textit{Halichoerus grypus}),

355: cat (\textit{Felis catus}), dog (\textit{Canis familiaris}), fin whale (\textit{Balenoptera physalus}), blue

356: whale (\textit{Balenoptera musculus}), cow (\textit{Bos taurus}), sheep (\textit{Ovis aries}), pig (\textit{Sus

357: scrofa}), hippopotamus (\textit{Hippopotamus amphibius}), neotropical fruit bat (\textit{Artibeus jamaicensis}),

358: African elephant (\textit{Loxodonta africana}), aardvark (\textit{Orycteropus afer}), and armadillo

359: (\textit{Dasypus novemcintus})}, and primates\footnote{human (\textit{Homo sapiens}), common chimpanzee

360: (\textit{Pan troglodytes}), pigmy chimpanzee (\textit{Pan paniscus}), gorilla (\textit{Gorilla gorilla}),

361: orangutan (\textit{Pongo pygmaeus}), gibbon (\textit{Hylobates lar}), and baboon (\textit{Papio hamadryas})}. It

362: had been chosen in \cite{li1} because of doubts about the relative closeness among these three groups

363: \cite{cao,Reyes00}.

364:

365: Obviously, we are here dealing with the algorithmic version of information theory, and informations are

366: estimated by lossless data compression. For constructing the proximity matrix between individual taxa, we

367: proceed essentially a in Ref.~\cite{li1}. But in addition to using the special compression program GenCompress

368: \cite{GenComp}, we also tested several general purpose compression programs such as BWTzip \cite{BWTzip} and the

369: UNIX tool bzip2.

370:

371: In Ref.~\cite{li1}, this proximity matrix was then used as the input to a standard HC algorithm

372: (neighbour-joining and hypercleaning) to produce an evolutionary tree. It is here where our treatment deviates

373: crucially. We used the MIC algorithm described in Sec.~1, with distance $D(X,Y)$. The joining of two clusters

374: (the third step in the MIC algorithm) is obtained by simply concatenating the DNA sequences. There is of course

375: an arbitrariness in the order of concatenation sequences: $XY$ and $YX$ give in general compressed sequences of

376: different lengths. But we found this to have negligible effect on the evolutionary tree. The resulting

377: evolutionary tree obtained with Gencompress is shown in Fig.~\ref{phylotree}.

378:

379: \begin{figure}

380:   \begin{center}

381:    \psfig{file=phylotree.eps,height=80mm,angle=270}

382: %    \psfig{file=phylotree.eps,width=12cm,angle=270}

383:     \caption{Phylogenetic tree for 34 mammals (31 eutherians plus 3 non-placenta mammals).

384:        In contrast to Fig.~\ref{ClustECG}, the heights of nodes are equal to the distances

385:        between the joining daughter clusters.}

386:     \label{phylotree}

387: \end{center}

388: \vspace{-8mm}

389: \end{figure}

390:

391:

392: As shown in Fig.~\ref{phylotree} the overall structure of this tree closely resembles the one shown in

393: Ref.~\cite{Reyes00}. All primates are correctly clustered and also the relative order of the ferungulates is in

394: accordance with Ref.~\cite{Reyes00}. On the other hand, there are a number of connections which obviously do not

395: reflect the true evolutionary tree, see for example the guinea pig with bat and elephant with platypus. But the

396: latter two, inspite of being joined together, have a very large distance from each other, thus their clustering

397: just reflects the fact that neither the platypus nor the elephant have other close relatives in the sample. All

398: in all, however, already the results shown in Fig.~1 capture surprisingly well the overall structure shown in

399: Ref. \cite{Reyes00}. Dividing MI by the total information is essential for this success. If we had used the

400: non-normalized $I_{\rm alg}(X,Y)$ itself, the clustering algorithm used in \cite{li1} would not change much,

401: since all 34 DNA sequences have roughly the same length. But our MIC algorithm would be completely screwed up:

402: After the first cluster formation, we have DNA sequences of very different lengths, and longer sequences tend

403: also to have larger MI, even if they are not closely related.

404:

405: A heuristic reasoning for the use of MIC for the reconstruction of an evolutionary tree might be given as

406: follows: Suppose that a proximity matrix has been calculated for a set of DNA sequences and the smallest

407: distance is found for the pair $(X,Y)$. Ideally, one would remove the sequences $X$ and $Y$, replace them by the

408: sequence of the common ancestor (say $Z$) of the two species, update the proximity matrix to find the smallest

409: entry in the reduced set of species, and so on. But the DNA sequence of the common ancestor is not available.

410: One solution might be that one tries to reconstruct it by making some compromise between the sequences $X$ and

411: $Y$. Instead, we essentially propose to concatenate the sequences $X$ and $Y$. This will of course not lead to a

412: plausible sequence of the common ancestor, but it will {\it optimally represent the information} about the

413: common ancestor. During the evolution since the time of the ancestor $Z$, some parts of its genome might have

414: changed both in $X$ and in $Y$. These parts are of little use in constructing any phylogenetic tree. Other parts

415: might not have changed in either. They are recognized anyhow by any sensible algorithm. Finally, some parts of

416: its genome will have mutated significantly in $X$ but not in $Y$, and vice versa. This information is essential

417: to find the correct way through higher hierarchy levels of the evolutionary tree, and it is preserved in

418: concatenating.

419:

420:

421: \section{Clustering of Minimally Dependent Components in an Electrocardiogram}

422:

423: As our second application we choose a case where Shannon theory is the proper setting. We show in Fig.~2 an ECG

424: recorded from the abdomen and thorax of a pregnant woman \cite{ECGdata} (8 channels, sampling rate 500 Hz,

425: 5$\,$s total). It is already seen from this graph that there are at least two important components in this ECG:

426: the heartbeat of the mother, with a frequency of $\approx 3$ beat/s, and the heartbeat of the fetus with roughly

427: twice this frequency. Both are not synchronized. In addition there is noise from various sources (muscle

428: activity, measurement noise, etc.). While it is easy to detect anomalies in the mother's ECG from such a

429: recording, it would be difficult to detect them in the fetal ECG.

430:

431: As a first approximation we can assume that the total ECG is a linear superposition of several independent

432: sources (mother, child, noise$_1$, noise$_2$,...). A standard method to disentangle such superpositions is {\it

433: independent component analysis} (ICA) \cite{ICA}. In the simplest case one has $n$ independent sources

434: $s_i(t),\; i=1\ldots n$ and $n$ measured channels $x_i(t)$ obtained by instantaneous superpositions with a time

435: independent non-singular matrix ${\bf A}$,

436: \be

437:    x_i(t) = \sum_{j=1}^n A_{ij} s_j(t)\;.

438: \ee

439: In this case the sources can be reconstructed by applying the inverse transformation ${\bf W} = {\bf A}^{-1}$

440: which is obtained by minimizing the (estimated) mutual informations between the transformed components $y_i(t) =

441: \sum_{j=1}^n W_{ij} x_j(t)$. If some of the sources are Gaussian, this leads to ambiguities \cite{ICA}, but it

442: gives a unique solution if the sources have more structure.

443:

444: In reality things are not so simple. For instance, the sources might not be independent, the number of sources

445: (including noise sources!) might be different from the number of channels, and the mixing might involve delays.

446: For the present case this implies that the heartbeat of the mother is seen in several reconstructed components

447: $y_i$, and that the supposedly ``independent" components are not independent at all. In particular, all

448: components $y_i$ which have large contributions from the mother form a cluster with large intra-cluster MIs and

449: small inter-cluster MIs. The same is true for the fetal ECG, albeit less pronounced.

450: It is thus our aim to \\

451: 1) optimally decompose the signals into least dependent components;\\

452: 2) cluster these components hierarchically such that the most dependent ones are

453: grouped together;\\

454: 3) decide on an optimal level of the hierarchy, such that the clusters make most sense

455: physiologically;\\

456: 4) project onto these clusters and apply the inverse transformations to obtain cleaned signals for the sources

457: of interest.

458:

459: \begin{figure}

460:   \begin{center}

461:     \psfig{file=ECG1s.eps,width=80mm}

462:     \caption{ECG of a pregnant woman.}

463:     \label{ICAECG0}

464: \end{center}

465: \vspace{-8mm}

466: \end{figure}

467:

468: \begin{figure}

469:   \begin{center}

470:     \psfig{file=ECG2s.eps,width=80mm}

471:     \caption{Least dependent components of the ECG shown in Fig.~\ref{ICAECG0}, after increasing

472:      the number of channels by delay embedding.}

473:     \label{ICAECG}

474: \end{center}

475: \vspace{-8mm}

476: \end{figure}

477:

478: Technically we proceeded as follows \cite{Harald}:

479:

480: Since we expect different delays in the different channels, we first used Takens delay embedding \cite{Takens80}

481: with time delay 0.002$\,$s and embedding dimension 3, resulting in $24$ channels. We then formed 24 linear

482: combinations $y_i(t)$ and determined the de-mixing coefficients $W_{ij}$ by minimizing the overall mutual

483: information between them, using the MI estimator proposed in \cite{mi}. There, two classes of estimators were

484: introduced, one with square and the other with rectangular neighbourhoods. Within each class, one can use the

485: number of neighbours, called $k$ in the following, on which the estimate is based. Small values of $k$ lead to a

486: small bias but to large statistical errors, while the opposite is true for large $k$. But even for very large

487: $k$ the bias is zero when the true MI is zero, and it is systematically such that absolute values of the MI are

488: underestimated. Therefore this bias does not affect the determination of the optimal de-mixing matrix. But it

489: depends on the dimension of the random variables, therefore large values of $k$ are not suitable for the

490: clustering. We thus proceeded as follows: We first used $k=100$ and square neighbourhoods to obtain the least

491: dependent components $y_i(t)$, and then used $k=3$ with rectangular neighbourhoods for the clustering. The

492: resulting least dependent components are shown in Fig.~\ref{ICAECG}. They are sorted such that the first

493: components (1 - 5) are dominated by the mother's ECG, while the next three contain large contributions from the

494: fetus. The rest contains mostly noise, although some seem to be still mixed.

495:

496: These results obtained by visual inspection are fully supported by the cluster analysis. The dendrogram is shown

497: in Fig.~\ref{ClustECG}. In constructing it we used $S(X,Y)$ (Eq.(\ref{S})) as similarity measure to find the

498: correct topology. Again we would have obtained much worse results if we had not normalized it by dividing MI by

499: $m_X+m_Y$. In plotting the actual dendrogram, however, we used the MI of the cluster to determine the height at

500: which the two daughters join. The MI of the first five channels, e.g., is $\approx 1.43$, while that of channels

501: 6 to 8 is $\approx 0.34$. For any two clusters (tuples) $X=X_1\ldots X_n$ and $Y=Y_1\ldots Y_m$ one has $I(X,Y)

502: \geq I(X)+I(Y)$. This guarantees, if the MI is estimated correctly, that the tree is drawn properly. The two

503: slight glitches (when clusters (1--14) and (15--18) join, and when (21--22) is joined with 23) result from small

504: errors in estimating MI. They do in no way effect our conclusions.

505:

506: \begin{figure}

507:   \begin{center}

508:     \psfig{file=ECGclust.eps,width=80mm,angle=0}

509:     \caption{Dendrogram for least dependent components. The height where the two branches of

510:     a cluster join corresponds to the MI of the cluster.}

511:     \label{ClustECG}

512: \end{center}

513: \vspace{-8mm}

514: \end{figure}

515:

516: In Fig.~\ref{ClustECG} one can clearly see two big clusters corresponding to the mother and to the child. There

517: are also some small clusters which should be considered as noise. For reconstructing the mother and child

518: contributions to Fig.~\ref{ICAECG0}, we have to decide on one specific clustering from the entire hierarchy. We

519: decided to make the cut such that mother and child are separated. The resulting clusters are indicated in

520: Fig.~\ref{ClustECG} and were already anticipated in sorting the channels. Reconstructing the original ECG from

521: the child components only, we obtain Fig.~\ref{reconstruct}.

522:

523: \begin{figure}

524:   \begin{center}

525:     \psfig{file=ECG3s.eps,width=80mm}

526:     \caption{Original ECG where all contributions except those of the child cluster have

527:      been removed.}

528:     \label{reconstruct}

529: \vspace{-8mm}

530: \end{center}

531: \end{figure}

532:

533:

534: \section{Conclusions}

535:

536: We have shown that MI can not only be used as a proximity measure in clustering, but that it also suggests a

537: conceptually very simple and natural hierarchical clustering algorithm. We do not claim that this algorithm,

538: called {\it mutual information clustering} (MIC), is always superior to other algorithms. Indeed, MI is in

539: general not easy to estimate. Obviously, when only crude estimates are possible, also MIC will not give optimal

540: results. But as MI estimates are becoming better, also the results of MIC should improve. The present paper was

541: partly triggered by the development of a new class of MI estimators for continuous random variables which have

542: very small bias and also rather small variances \cite{mi}.

543:

544: We have illustrated our method with two applications, one from genetics and one from cardiology. For neither

545: application MIC might give the very best clustering, but it seems promising that one common method gives decent

546: results for both, although they are very different.

547:

548: The results of MIC should improve, if more data become available. This is trivial, if we mean by that longer

549: time sequences in the application to ECG, and longer parts of the genome in the application of Sec.3. It is less

550: trivial that we expect MIC to make fewer mistakes in a phylogenetic tree, when more species are included. The

551: reason is that close-by species will be correctly joined anyhow, and families -- which now are represented only

552: by single species and thus are poorly characterized -- will be much better described by the concatenated genomes

553: if more species are included.

554:

555: There are two versions of information theory, algorithmic and probabilistic, and therefore there are also two

556: variants of MI and of MIC. We discussed in detail one application of each, and showed that indeed common

557: concepts were involved in both. In particular it was crucial to normalize MI properly, so that it is essentially

558: the {\it relative} MI which is used as proximity measure. For conventional clustering algorithms using

559: algorithmic MI as proximity measure this had already been stressed in \cite{li1,li2}, but it is even more

560: important in MIC, both in the algorithmic and in the probabilistic versions.

561:

562: In the probabilistic version, one studies the clustering of probability distributions. But usually distributions

563: are not provided as such, but are given implicitly by finite random samples drawn (more or less) independently

564: from them. On the other hand, the full power of algorithmic information theory is only reached for infinitely

565: long sequences, and in this limit any individual sequence defines a sequence of probability measures on finite

566: subsequences. Thus the strict distinction between the two theories is somewhat blurred in practice.

567: Nevertheless, one should not confuse the similarity between two sequences (two English books, say) and that

568: between their subsequence statistics. Two sequences are maximally different if they are completely random, but

569: their statistics for short subsequences is then identical (all subsequences appear in both with equal

570: probabilities). Thus one should always be aware of what similarities or independencies one is looking for. The

571: fact that MI can be used in similar ways for all these problems is not trivial.

572:

573:

574: We would like to thank Arndt von Haesseler, Walter Nadler and Volker Roth for many useful discussions.

575:

576: \bibliographystyle{apalike}

577: \bibliography{lit}

578:

579: \end{document}

580: