0504:cs0504089/itw05.tex

1: %\documentclass{acm_proc_article-sp}

2: %\documentclass{sig-alternate}

3: \documentclass{article}

4: \usepackage{itw2005}

5: \usepackage{amsmath,amstext,amsthm,amssymb}

6: \usepackage{latex8}

7: \usepackage{times}

8:

9:

10:

11:

12: %\usepackage{amsmath,amstext,amsthm,amssymb,epsf}

13: \usepackage{amsmath,amstext,amssymb,epsf}

14: %\usepackage{fullpage,latexsym}

15: \usepackage{epsfig}

16: \usepackage{verbatim}

17: \usepackage{pslatex}

18:

19:

20: %\bibliographystyle{plain}

21:

22: \newcommand{\gzip}{ \texttt {gzip} }

23: \newcommand{\bzip}{ \texttt {bzip2} }

24: \newcommand{\NID}{ \textsc {NID} }

25: \newcommand{\NGD}{ \textsc {NGD} }

26: \newcommand{\NDD}{ \textsc {NDD} }

27: \newcommand{\NCD}{ \textsc {NCD} }

28: \newcommand{\NCDf}[2]{ \NCD(#1,#2) }

29: \newcommand{\SVM}{ \textsc {SVM} }

30:

31: \newtheorem{theorem}{\sc Theorem}

32: \newtheorem{lemma}{\sc Lemma}

33: \newtheorem{coro}{\sc Corollary}

34: \newtheorem{nota}{\sc Notation}

35: \newtheorem{defin}{\sc Definition}

36: \newtheorem{rem}{\sc Remark}

37: \newtheorem{cla}{\sc Claim}

38: \newtheorem{ex}{\sc Example}

39: \newenvironment{remark}{\begin{rem}}{\hspace*{\fill}$\diamondsuit$\end{rem}}

40: %\newenvironment{proof}{\par \sc Proof.\rm}{\hspace*{\fill}$\Box$\vspace{1ex}}

41: \newenvironment{example}{\begin{ex}}{\hspace*{\fill}$\Diamond$\end{ex}}

42: \newenvironment{claim}{\begin{cla}}{\end{cla}}

43: \newenvironment{corollary}{\begin{coro}}{\end{coro}}

44: \newenvironment{definition}{\begin{defin}}{\end{defin}}

45: %\newenvironment{remark}{\begin{rem}}{\end{rem}}

46: \newenvironment{notation}{\begin{nota}}{\end{nota}}

47:

48:

49: \itwtitle{Universal Similarity}

50:

51: %\numberofauthors{2}

52: %\author{

53: %\alignauthor Rudi Cilibrasi\titlenote{Supported in part by the Netherlands

54:  %BSIK/BRICKS project,

55: %and by NWO project 612.55.002. Address: CWI, Kruislaan 413, 1098 SJ

56: %Amsterdam, The Netherlands. Email: Rudi.Cilibrasi@cwi.nl}\\

57: %\affaddr{CWI}

58: %\affaddr{Kruislaan 413}\\

59: %\affaddr{1098 SJ Amsterdam, The Netherlands}\\

60: %\email{Rudi.Cilibrasi@cwi.nl}

61: %\alignauthor Paul Vitanyi\titlenote{Part of this work was done while the author was on sabbatical leave

62: %at National ICT of Australia, Sydney Laboratory at UNSW.

63: %Supported in part

64: %by the EU  EU Project RESQ IST-2001-37559,

65: %the ESF QiT Programmme,

66: %the EU NoE PASCAL, and the Netherlands BSIK/BRICKS project.

67: %Address: CWI, Kruislaan 413, 1098 SJ Amsterdam, The Netherlands.

68: %Email: Paul.Vitanyi@cwi.nl}\\

69: %\affaddr{CWI,}\\

70: %\affaddr{University of Amsterdam, and}\\

71: %\affaddr{National ICT of Australia}\\

72: %%\email{Paul.Vitanyi@cwi.nl}

73: %}

74:

75:

76: %\itwauthor{Rudi Cilibrasi}{CWI, Amsterdam, The Netherlands.

77: %{\tt Rudi.Cilibrasi@cwi.nl}}

78: %\itwsecondauthor{Paul Vitanyi}{CWI, Amsterdam, the Netherlands.

79: %{\tt paulv@cwi.nl}}

80: \itwauthor{Paul Vitanyi\thanks{Part of this work was done while the author was on sabbatical leave

81: at National ICT of Australia, Sydney Laboratory at UNSW.

82: Supported in part

83: by the EU  EU Project RESQ IST-2001-37559,

84: the ESF QiT Programmme,

85: the EU NoE PASCAL, and the Netherlands BSIK/BRICKS project.

86: Address: CWI, Kruislaan 413, 1098SJ Amsterdam, The Netherlands.

87: {\tt paulv@cwi.nl}

88: }}{CWI, University of Amsterdam, National ICT of Australia}

89:

90:

91: \begin{document}

92: \itwmaketitle

93:

94:

95:

96: \begin{itwabstract}

97: We survey a new area of parameter-free similarity distance measures

98: useful in data-mining,

99: pattern recognition, learning and automatic semantics extraction.

100: Given a family of distances on a set of objects,

101: a distance is universal up to a certain precision for that family if it

102: minorizes every distance in the family between every two objects

103: in the set, up to the stated precision (we do not require the universal

104: distance to be an element of the family).

105: We consider similarity distances

106: for two types of objects: literal objects that as such contain all of their

107: meaning, like genomes or books, and names for objects.

108: The latter may have

109: literal embodyments like the first type, but may also

110: be abstract like ``red'' or ``christianity.'' For the first type

111: we consider

112: a family of computable distance measures

113: corresponding to parameters expressing similarity according to

114: particular features

115: between

116: pairs of literal objects. For the second type we consider similarity

117: distances generated by web users corresponding to particular semantic

118: relations between the (names for) the designated objects.

119: For both families we give universal similarity

120: distance measures, incorporating all particular distance measures

121: in the family. In the first case the universal

122: distance is based on compression and in the second

123: case it is based on Google page counts related to search terms.

124: In both cases experiments on a massive scale give evidence of the

125: viability of the approaches.

126: \end{itwabstract}

127:

128: \begin{itwpaper}

129:

130: \itwsection{Introduction}

131: Objects can be given literally, like the literal

132: four-letter genome of a mouse,

133: or the literal text of {\em War and Peace} by Tolstoy. For

134: simplicity we take it that all meaning of the object

135: is represented by the literal object itself. Objects can also be

136: given by name, like ``the four-letter genome of a mouse,''

137: or ``the text of {\em War and Peace} by Tolstoy.'' There are

138: also objects that cannot be given literally, but only by name

139: and acquire their meaning from their contexts in background common

140: knowledge in humankind, like ``home'' or ``red.''

141: In the literal setting, objective similarity of objects can be established

142: by feature analysis, one type of similarity per feature.

143: In the abstract ``name'' setting, all similarity must depend on

144: background knowledge and common semantics relations,

145: which is inherently subjective and ``in the mind of the beholder.''

146:

147: \itwsection{Compression Based Similarity}

148: All data are created equal but some data are more alike than others.

149: We have recently proposed methods expressing this alikeness,

150: using a new similarity metric based on compression.

151: It is parameter-free in that it

152: doesn't use any features or background knowledge about the data, and can without

153: changes be applied to different areas and across area boundaries.

154: It is universal in that it approximates the parameter

155: expressing similarity of the dominant feature in all pairwise

156: comparisons.

157: It is robust in the sense that its success appears independent

158: from the type of compressor used.

159: The clustering we use is hierarchical clustering in dendrograms

160: based on a new fast heuristic for the quartet method.

161: The method is available as an open-source software tool, \cite{Ci03}.

162:

163: {\bf Feature-Based Similarities:}

164: We are presented with unknown data and

165: the question is to determine the similarities among them

166: and group like with like together. Commonly, the data are

167: of a certain type: music files, transaction records of ATM machines,

168: credit card applications, genomic data. In these data there are

169: hidden relations that we would like to get out in the open.

170: For example, from genomic data one can extract

171: letter- or block frequencies (the blocks are over the four-letter alphabet);

172:  from music files one can extract

173: various specific numerical features,

174: related to pitch, rhythm, harmony etc.

175: One can extract such features using for instance

176: Fourier transforms~\cite{TC02} or wavelet transforms~\cite{GKCwavelet},

177: to quantify parameters expressing similarity.

178: The resulting vectors corresponding to the various files are then

179: classified or clustered using existing classification software, based on

180: various standard statistical pattern recognition classifiers~\cite{TC02},

181: Bayesian classifiers~\cite{DTWml},

182: hidden Markov models~\cite{CVfolk},

183: ensembles of nearest-neighbor classifiers~\cite{GKCwavelet}

184: or neural networks~\cite{DTWml,Sneural}.

185: For example, in music one feature would be to look for rhythm in the sense

186: of beats per minute. One can make a histogram where each histogram

187: bin corresponds to a particular tempo in beats-per-minute and

188: the associated peak shows how frequent and strong that

189: particular periodicity was over the entire piece. In \cite{TC02}

190: we see a gradual change from a few high peaks to many low and spread-out

191: ones going from hip-hip, rock, jazz, to classical. One can use this

192: similarity type to try to cluster pieces in these categories.

193: However, such a method requires specific and detailed knowledge of

194: the problem area, since one needs to know what features to look for.

195:

196: {\bf Non-Feature Similarities:}

197: Our aim

198: is to capture, in a single similarity metric,

199: {\em every effective distance\/}:

200: effective versions of Hamming distance, Euclidean distance,

201: edit distances, alignment distance, Lempel-Ziv distance,

202: and so on.

203: This metric should be so general that it works in every

204: domain: music, text, literature, programs, genomes, executables,

205: natural language determination,

206: equally and simultaneously.

207: It would be able to simultaneously detect {\em all\/}

208: similarities between pieces that other effective distances can detect

209: seperately.

210:

211: Such a ``universal'' metric

212: was co-developed by us in \cite{LBCKKZ01,malivitch:simmet}, as a normalized

213: version of the ``information metric'' of \cite{liminvit:kolmbook,BGLVZ}.

214: Roughly speaking, two objects are deemed close if

215: we can significantly ``compress'' one given the information

216: in the other, the idea being that if two pieces are more similar,

217: then we can more succinctly describe one given the other.

218: The mathematics used is based on Kolmogorov complexity theory \cite{liminvit:kolmbook}.

219: In \cite{malivitch:simmet} we defined a

220: new class of (possibly non-metric) distances, taking values in $[0,1]$ and

221: appropriate for measuring effective

222: similarity relations between sequences, say one type of similarity

223: per distance, and {\em vice versa}. It was shown that an appropriately

224: ``normalized'' information distance

225: minorizes every distance

226: in the class.

227: It discovers all effective similarities in the sense that if two

228: objects are close according to some effective similarity, then

229: they are also close according to the normalized information distance.

230: Put differently, the normalized information distance represents

231: similarity according to the dominating shared feature between

232: the two objects being compared.

233: In comparisons of more than two objects,

234: different pairs may have different dominating features.

235: For every two objects,

236: this universal metric distance zooms in on the dominant

237: similarity between those two objects

238:  out of a wide class of admissible similarity

239: features. In \cite{malivitch:simmet} we proved its optimality

240: and universality.

241: The normalized information distance also satisfies the metric

242: (in)equalities, and takes values in $[0,1]$;

243: hence it may be called {\em ``the'' similarity metric}.

244:

245: {\bf Normalized Compression Distance:}

246: Unfortunately, the universality of the normalized information distance

247: comes at the price of noncomputability, since it is based on the uncomputable

248: notion of Kolmogorov complexity.

249: But since the Kolmogorov

250: complexity of a string or file is the length

251: of the ultimate compressed version of that

252: file,

253: we can use real data compression programs to approximate the Kolmogorov

254: complexity.

255: Therefore, to apply this ideal precise mathematical theory in real life,

256: we have to replace the use of  the noncomputable

257: Kolmogorov complexity by an approximation

258: using a standard real-world compressor.

259: Thus, if $C$ is a compressor and we use $C(x)$

260: to denote the length of the compressed version of a string $x$,

261: then we arrive at the {\em Normalized Compression Distance}:

262: \begin{equation}\label{eq.ncd}

263:  \NCD(x,y) = \frac{C(xy) - \min(C(x),C(y))}{\max(C(x),C(y))},

264: \end{equation}

265: where for convenience we have replaced the pair $(x,y)$ in the formula

266: by the concatenation $xy$,

267: see \cite{malivitch:simmet,civit:cbc},

268: In \cite{civit:cbc} we propose axioms to capture the real-world setting,

269: and show that \eqref{eq.ncd}

270: approximates optimality.

271: Actually, the

272: \NCD is a family of compression functions parameterized

273: by the given data

274: compressor $C$.

275:

276: {\bf Universality of NCD:} In \cite{civit:cbc} we prove that the

277: \NCD is universal with respect to the family of all

278: admissible normalized distances---a special class that

279: is argued to contain all parameters and features of

280: similarity that are effective.

281: The compression-based \NCD method to

282: establish a universal similarity metric \eqref{eq.ncd} among objects

283: given as finite binary strings

284: \cite{BGLVZ,LBCKKZ01,malivitch:simmet,civit:cbc,Ke04}, and has been applied to

285: objects like genomes, music pieces in MIDI format, computer programs

286: in Ruby or C, pictures in simple bitmap formats, or time sequences such as

287: heart rhythm data, heterogenous data and anomaly detection.

288: This method is feature-free in the sense

289: that it doesn't analyze the files looking for particular

290: features; rather it analyzes all features simultaneously

291: and determines the similarity between every pair of objects

292: according to the most dominant shared feature. The crucial

293: point is that the method analyzes the objects themselves.

294: This precludes comparison of abstract notions or other objects

295: that don't lend themselves to direct analysis, like

296: emotions, colors, Socrates, Plato, Mike Bonanno and Albert Einstein.

297:

298:

299: \itwsection{Google-Based Similarity}

300: To make computers more intelligent one would like

301: to represent meaning in computer-digestable form.

302: Long-term and labor-intensive efforts like

303: the {\em Cyc} project \cite{cyc:intro} and the {\em WordNet}

304: project \cite{wordnet} try to establish semantic relations

305: between common objects, or, more precisely, {\em names} for those

306: objects. The idea is to create

307: a semantic web of such vast proportions that rudimentary intelligence

308: and knowledge about the real world spontaneously emerges.

309: This comes at the great cost of designing structures capable

310: of manipulating knowledge, and entering high

311: quality contents in these structures

312: by knowledgeable human experts. While the efforts are long-running

313: and large scale, the overall information entered is minute compared

314: to what is available on the world-wide-web.

315:

316: The rise of the world-wide-web has enticed millions of users

317: to type in trillions of characters to create billions of web pages of

318: on average low quality contents. The sheer mass of the information

319: available about almost every conceivable topic makes it likely

320: that extremes will cancel and the majority or average is meaningful

321: in a low-quality approximate sense. We devise a general

322: method to tap the amorphous low-grade knowledge available for free

323: on the world-wide-web, typed in by local users aiming at personal

324: gratification of diverse objectives, and yet globally achieving

325: what is effectively the largest semantic electronic database in the world.

326: Moreover, this database is available for all by using any search engine

327: that can return aggregate page-count estimates like Google for a large

328: range of search-queries.

329:

330: While the previous \NCD method that compares the objects themselves using

331: \eqref{eq.ncd} is

332: particularly suited to obtain knowledge about the similarity of

333: objects themselves, irrespective of common beliefs about such

334: similarities, we now develop a method that uses only the name

335: of an object and obtains knowledge about the similarity of objects

336: by tapping available information generated by multitudes of

337: web users.

338: Here we are reminded of the words of D.H. Rumsfeld \cite{Ru01}

339: ``A trained ape can know an awful lot/

340: Of what is going on in this world,/

341: Just by punching on his mouse/

342: For a relatively modest cost!''

343: The new method is useful to extract knowledge from a given corpus of

344: knowledge, in this case the Google database, but not to

345: obtain true facts that are not common knowledge in that database.

346: For example, common viewpoints on the creation myths in different

347: religions

348: may be extracted by the Googling method, but contentious questions

349: of fact concerning the phylogeny of species can be better approached

350: by using the genomes of these species, rather than by opinion.

351:

352:

353: {\bf Googling for Knowledge:}

354: Let us start with simple intuitive justification (not to be mistaken

355: for a substitute of the underlying mathematics)

356:  of the approach we propose in \cite{CV04}.

357: The Google search engine indexes

358: around ten billion pages on the web today. Each such page can be

359: viewed as a set of index terms. A search for a particular index term,

360: say ``horse'', returns a certain number of hits (web pages where

361: this term occurred), say 46,700,000. The number of hits for the

362: search term ``rider'' is, say, 12,200,000. It is also possible to search

363: for the pages where both ``horse'' and ``rider'' occur. This gives,

364: say, 2,630,000 hits.

365: This can be easily put in the standard  probabilistic framework.

366: If $w$ is a web page and $x$ a search term, then we write $x \in w$

367: to mean that Google returns web page $w$ when presented with search

368: term $x$.

369: An {\em event} is a set of web pages

370: returned by Google after

371: it has been presented by a search term.

372: We can view the event as the collection of all contexts of

373: the search term, background knowledge, as induced by the

374: accessible web pages for the Google search engine.

375: If the search term is $x$, then we denote the event by ${\bf x}$,

376: and define ${\bf x} = \{w: x \in w \}$.

377: The {\em probability} $p(x)$ of an event ${\bf x }$ is

378: the number of web pages

379: in the event divided by the overall number $M$ of web pages possibly

380: returned by Google. Thus, $p( x)= |{\bf x}|/M$.

381: At the time of writing, Google searches 8,058,044,651 web pages.

382: Define the joint event ${\bf x}  \bigcap {\bf y} = \{ w : x,y \in w\}$

383: as the set of web pages returned by Google,

384: containing both the search term $x$ and

385: the search term $y$. The joint probability

386: $p(x,  y) = |\{ w : x,y \in w\}|/M $ is the number of

387: web pages in the joint event  divided by the

388: overall number $M$ of web pages possibly

389: returned by Google.

390: This notation also allows us to define the probability $p(x|y)$

391: of {\em conditional} events ${\bf x}|{\bf y}

392: = ({\bf x} \bigcap {\bf y})/{\bf y}$ defined by

393: $p(x| y) = p( x,y)/p(y)$.

394:

395:

396: In the above example we have therefore $p(horse) \approx  0.0058$,

397: $p(rider)$ $ \approx 0.0015$, $p(horse,rider) \approx 0.0003$.

398: We conclude that the probability $p(horse|rider)$

399:  of ``horse'' accompanying ``rider''

400: is $\approx 1/5$ and the probability $p(rider|horse)$ of ``rider'' accompanying

401: ``horse'' is $\approx 1/19$.  The probabilities are asymmetric, and it is the

402: least probability that is the significant one. A very general search term

403: like ``the'' occurs in virtually all (English language) web pages.

404: Hence $p(the|rider) \approx 1$, and for almost all search

405: terms $x$ we have $p(the|x) \approx 1$. But $p(rider|the) \ll 1$,

406: say about equal to $p(rider)$, and gives the relevant information

407: about the association of the two terms.

408:

409: Our first attempt therefore could be the distance

410: \[ D_1 (x,y) = \min \{ p(x|y),p(y|x) \}.

411: \]

412: Experimenting with this distance gives bad results. One reason

413: being that the differences among small probabilities have increasing

414: significance the smaller the probabilities involved are. Another

415: reason is that we deal with absolute probabilities: two notions

416: that have very small probabilities each and have $D_1$-distance

417: $\epsilon$ are much less similar than two notions that have

418: much larger probabilities and have the same $D_1$-distance.

419: To resolve the first problem we take the negative logarithm

420: of the items being minimized, resulting in

421: \[

422:  D_2 (x,y)  = \max \{  \log 1/p(x|y),  \log 1/p(y|x) \}.

423: \]

424: To resolve the second problem we normalize $D_2(x,y)$ by dividing

425: by the maximum of $\log 1/p(x), \log 1/p(y)$.

426: Altogether, we obtain

427: the following normalized distance

428: \[

429: D_3 (x,y) = \frac{ \max \{ \log 1/p(x|y),  \log 1/p(y|x) \}}

430: { \max \{ \log 1/p(x) , \log 1/p(y) \}},

431: \]

432: for $p(x|y) > 0$ (and hence $p(y|x)>0$),

433:  and $D_3 (x,y) = \infty $ for $p(x|y)=0$ (and hence $p(y|x)=0$). Note that

434: $p(x|y) = p(x,y)/p(x)=0$ means that the search terms

435: ``$x$'' and ``$y$'' never occur together.

436: The two conditional complexities are either both 0 or

437: they are both strictly positive. Moreover, if either of $p(x), p(y)$

438: is 0, then so are the conditional probabilities, but not necessarily

439: vice versa.

440:

441: We note that in the conditional probabilities the total number $M$,

442: of web pages indexed by Google, is divided out. Therefore, the

443: conditional probabilities are independent of $M$, and can be

444: replaced by the number of pages, the {\em frequency}, returned by Google.

445: Define the {\em frequency} $f(x)$ of search term $x$ as the

446: number of pages a Google search for $x$ returns:

447: $f(x)= Mp(x)$, $f(x,y)=Mp(x,y)$, and $p(x|y) = f(x,y)/f(y)$.

448: Rewriting $D_3$ results in

449: our final notion, the {\em normalized

450: Google distance (\NGD)}, defined by

451: \begin{equation}\label{eq.ngd}

452: \NGD(x,y) = \frac{  \max \{\log f(x), \log f(y)\}  - \log f(x,y) \}}{

453: \log M - \min\{\log f(x), \log f(y) \}},

454: \end{equation}

455: and if $f(x),f(y)>0$ and $f(x,y)=0$ then $\NGD(x,y)= \infty$.

456: From \eqref{eq.ngd} we see that

457: \begin{enumerate}

458: \item

459: $\NGD(x,y)$ is undefined for  $f(x)=f(y)=0$;

460: \item

461: $\NGD(x,y) = \infty$ for $f(x,y)=0$ and either or both $f(x)>0$

462: and $f(y)>0$; and

463: \item

464: $ \NGD(x,y) \geq 0$ otherwise.

465: \end{enumerate}

466:

467: With the Google hit numbers above, we can now compute

468: \[

469: \NGD(horse,rider)

470: \approx 0.443.

471: \]

472: We did the same calculation when Google indexed only one-half

473: of the current number of pages: 4,285,199,774. It is instructive that the

474: probabilities of the used search terms didn't change significantly over

475: this doubling of pages, with number of hits for ``horse''

476: equal 23,700,000, for ``rider'' equal 6,270,000, and

477: for ``horse, rider'' equal to 1,180,000.

478:  The $\NGD(horse,rider)$ we computed

479: in that situation was 0.460. This is in line with our contention

480: that the relative frequencies of web pages containing

481: search terms gives objective information about the semantic

482: relations between the search terms. If this is the case, then with

483: the vastness of the information accessed by Google, the

484: Google probabilities of search terms, and the computed \NGD's

485: should stabilize (be scale invariant) with a growing Google database.

486:

487: The \NGD formula itself \eqref{eq.ngd} is {\em scale-invariant}. It is very important that, if

488: the number $M$ of pages indexed by Google grows sufficiently large,

489: the number of pages containing given search terms

490: goes to a fixed fraction of $M$, and so does the number of pages

491: containing conjunctions of search terms. This means that if $M$   doubles,

492: then so do the $f$-frequencies. For the \NGD to give us an objective

493: semantic relation between search terms,

494: it needs to become stable when the number $M$ of indexed pages grows.

495: Some evidence that this actually happens

496: is given  in the remark about the \NGD scaling properly.

497:

498:

499:

500: \itwsection{From NCD to NGD}

501: {\bf The Google Distribution:}

502: \label{sect.google}

503: Let the set of singleton {\em Google search terms}

504: be denoted by ${\cal S}$. In the sequel we use both singleton

505: search terms and doubleton search terms $\{\{x,y\}: x,y \in {\cal S} \}$.

506: Let the set of web pages indexed (possible of being returned)

507: by Google be $\Omega$. The cardinality of $\Omega$ is denoted

508: by $M=|\Omega|$, and currently $8\cdot 10^9 \leq M \leq 9 \cdot 10^9$.

509: Assume that a priori all web pages are equi-probable, with the probability

510: of being returned by Google being $1/M$.  A subset of $\Omega$

511: is called an {\em event}. Every {\em  search term} $x$ usable by Google

512: defines a {\em singleton Google event} ${\bf x} \subseteq \Omega$ of web pages

513: that contain an occurrence of $x$ and are returned by Google

514: if we do a search for $x$.

515: Let $L: \Omega \rightarrow [0,1]$ be the uniform mass probability

516: function.

517: The probability of

518: such an event ${\bf x}$ is $L({\bf x})=|{\bf x}|/M$.

519:  Similarly, the {\em doubleton Google event} ${\bf x} \bigcap {\bf y}

520: \subseteq \Omega$ is the set of web pages returned by Google

521: if we do a search for pages containing both search term $x$ and

522: search term $y$.

523: The probability of this event is $L({\bf x} \bigcap {\bf y})

524: = |{\bf x} \bigcap {\bf y}|/M$.

525: We can also define the other Boolean combinations: $\neg {\bf x}=

526: \Omega \backslash {\bf x}$ and ${\bf x} \bigcup {\bf y} =

527: \Omega \backslash ( \neg {\bf x} \bigcap \neg {\bf y})$, each such event

528: having a probability equal to its cardinality divided by $M$.

529: If ${\bf e}$ is an event obtained from the basic events ${\bf x}, {\bf y},

530: \ldots$, corresponding to basic search terms $x,y, \ldots$,

531: by finitely many applications of the Boolean operations,

532: then the probability $L({\bf e}) = |{\bf e}|/M$.

533:

534: %A {\em pseudo-probability} is a

535: %function $p: {\cal S}

536: %\rightarrow [0,1]$ such that $ 1 < \sum_{s \in {\cal S}} p(s) < \infty$.

537: Google events capture in a particular sense

538: all background knowledge about the search terms concerned available

539: (to Google) on the web. Therefore, it is natural

540: to consider code words for those events

541: as coding this background knowledge. However,

542: we cannot use the probability of the events directly to determine

543: a prefix code such as the Shannon-Fano code \cite{liminvit:kolmbook}.

544: The reason is that

545: the events overlap and hence the summed probability exceeds 1.

546: By the Kraft inequality \cite{liminvit:kolmbook} this prevents a

547: corresponding Shannon-Fano code.

548: The solution is to normalize:

549: We use the probability of the Google events to define a probability

550: mass function over the set $\{\{x,y\}: x,y \in {\cal S}\}$

551: of  Google search terms, both singleton and doubleton.

552: Define

553: \[

554:  N= \sum_{\{x,y\} \subseteq {\cal S}} |{\bf x} \bigcap

555: {\bf y}|,

556: \]

557: counting each singleton set and each doubleton set (by definition

558: unordered) once in the summation.

559: Since every web page that is indexed by Google contains at least

560: one occurrence of a search term, we have $N \geq M$. On the other hand,

561: web pages contain on average not more than a certain constant $\alpha$

562: search terms. Therefore, $N \leq \alpha M$.

563: Define

564: \begin{align}\label{eq.gpmf}

565: &g(x) = L({\bf x}) M/N =|{\bf x}|/N

566: \\&

567: \nonumber

568: g(x,y) =  L({\bf x} \bigcap {\bf y}) M/N =|{\bf x} \bigcap {\bf y}|/N.

569: \end{align}

570: Then, $\sum_{x \in {\cal S}} g(x)+ \sum_{x,y \in {\cal S}} g(x,y) = 1$.

571: Note that $g(x,y)$ is not a conventional joint distribution

572: since possibly $g(x) \neq \sum_{y \in {\cal S}} g(x,y)$.

573: Rather, we consider $g$ to be a probability mass

574: function over the sample space $\{ \{x,y\}: x,y \in {\cal S} \}$.

575: This $g$-distribution changes over time,

576: and between different samplings

577: from the distribution. But let us imagine that $g$ holds

578: in the sense of an instantaneous snapshot. The real situation

579: will be an approximation of this.

580: Given the Google machinery, these are absolute probabilities

581: which allow us to define the associated Shannon-Fano code for

582: both the singletons and the doubletons.

583:

584: {\bf Normalized Google Distance}

585: The {\em Google code} length $G$

586: is defined by

587: \begin{align}\label{eq.gcc}

588: &G(x)= \log 1/g(x)

589: \\&

590: \nonumber

591: G(x,y)= \log 1/g(x,y) .

592: \end{align}

593: In contrast to strings $x$ where the complexity $C(x)$ represents

594: the length of the compressed version of $x$ using compressor $C$, for a search

595: term $x$ (just the name for an object rather than the object itself),

596: the Google code of length $G(x)$ represents the shortest expected

597: prefix-code word length of the associated Google event ${\bf x}$.

598: The expectation

599: is taken over the Google distribution $p$.

600: In this sense we can use the Google distribution as a compressor

601: for Google ``meaning'' associated with the search terms.

602: The associated \NCD, now called the

603: {\em normalized Google distance (\NGD)} is then defined

604: by \eqref{eq.ngd} with $N$ substituted for $M$, rewritten as

605: \begin{equation}\label{eq.NGD}

606:  \NGD(x,y)=\frac{G(x,y) - \min(G(x),G(y))}{\max(G(x),G(y))}.

607: \end{equation}

608: This $\NGD$ is an approximation to the $\NID$

609: using the Shannon-Fano code (Google code)

610: generated by the Google distribution as defining a compressor

611: approximating the length of the Kolmogorov code, using

612: the background knowledge on the web as viewed by Google

613: as conditional information. In experimental practice,

614: we consider $N$ (or $M$) as a normalization constant

615: that can be adjusted.

616:

617: {\bf Universality of NGD:} In the full paper \cite{CV04} we

618: show that \eqref{eq.ngd} and \eqref{eq.NGD} are

619: close in typical situations.

620: Our experimental results suggest that every reasonable

621: (greater than any $f(x)$) value can be used for the normalizing factor  $N$,

622: and our

623: results seem  in general insensitive to this choice.  In our software, this

624: parameter $N$ can be adjusted as appropriate, and we often use $M$ for $N$.

625: In the full paper we analyze the mathematical properties of \NGD,

626: and  prove the universality of the Google distribution among web author based

627: distributions, as well as the universality of the \NGD with respect to

628: the family of the individual web author's \NGD's, that is, their

629: individual semantics relations, (with high probability)---not included here

630: for space reasons.

631:

632:

633:

634:

635: \itwsection{Applications}

636: \label{sect.exp}

637: {\bf Applications of NCD:}

638: We developed the CompLearn Toolkit, \cite{Ci03}, and performed

639: experiments in vastly different

640: application fields to test the quality and universality of the method.

641: The success of the method as reported below depends strongly on the

642: judicious use of encoding of the objects compared. Here one should

643: use common sense on what a real world compressor can do. There are

644: situations where our approach fails if applied in a

645: straightforward way.

646: For example: comparing text files by the same authors

647: in different encodings (say, Unicode and 8-bit version) is bound to fail.

648: For the ideal similarity metric  based on

649: Kolmogorov complexity as defined in \cite{malivitch:simmet}

650: this does not matter at all, but for

651: practical compressors used in the experiments it will be fatal.

652: Similarly, in the music experiments below we use symbolic MIDI

653: music file  format rather than wave format music files. The reason is that

654: the strings resulting from straightforward

655: discretizing the wave form files may be too sensitive to how we discretize.

656: Further research may ovecome this problem.

657:

658: The \NCD is

659: not restricted to a specific application area, and

660: works across application area boundaries.

661: To extract a hierarchy of clusters

662: from the distance matrix,

663: we determine a dendrogram (binary tree)

664: by a new quartet

665: method and a fast heuristic to implement it.

666: The method is implemented and available as public software \cite{Ci03}, and is

667: robust under choice of different compressors.

668: This approach gives

669: the first completely automatic construction

670: of the phylogeny tree based on whole mitochondrial genomes,

671: \cite{LBCKKZ01,malivitch:simmet},

672: a completely automatic construction of a language tree for over 50

673: Euro-Asian languages \cite{malivitch:simmet},

674: detects plagiarism in student programming assignments

675: \cite{SID}, gives phylogeny of chain letters \cite{BLM03}, and clusters

676: music \cite{cidervit:mus}.

677: Moreover, the method turns out to be robust under change of the underlying

678: compressor-types: statistical (PPMZ), Lempel-Ziv based  dictionary (gzip),

679: block based (bzip2), or special purpose (Gencompress).

680:

681: To substantiate our claims of universality and robustness, in \cite{civit:cbc}

682: we report evidence of successful application in areas as diverse as

683: genomics, virology, languages, literature, music, handwritten digits,

684: astronomy, and

685: combinations of objects from completely different

686: domains, using statistical, dictionary, and block sorting compressors.

687: In genomics we presented new evidence for major questions

688: in Mammalian evolution, based on whole-mitochondrial genomic

689: analysis: the Eutherian orders and the Marsupionta hypothesis

690: against the Theria hypothesis.

691: Apart from the experiments reported in \cite{civit:cbc}, the clustering by

692: compression method reported

693: in this paper  has recently been used in many different areas all over

694: the world. One item in our group was

695: to analyze network traffic and cluster computer worms and virusses \cite{We04}.

696: Finally, recent  work \cite{Ke04} reports experiments with our method

697: on all time sequence data used in all the major data-mining

698: conferences in the last decade. Comparing the compression method

699: with all major methods used in those conferences they established

700: clear superiority of the compression method for clustering heterogenous

701: data, and for anomaly detection.

702:

703:

704: {\bf Applications of NGD:}

705:  This new method is  proposed in  \cite{CV04} to extract semantic

706: knowledge from the world-wide-web for both

707: supervised and unsupervised learning using the Google search engine

708: in an unconventional manner.  The approach is

709: novel in its unrestricted problem domain, simplicity of implementation,

710: and manifestly ontological underpinnings.  We give evidence of

711: elementary learning of the semantics of concepts, in

712: contrast to most prior approaches (outside of Knowledge Representation

713: research) that have neither the appearance nor the aim of dealing with ideas,

714: instead using abstract symbols that remain permanently ungrounded throughout

715: the machine learning application.

716: The world-wide-web is the largest database on earth,

717: and it induces a

718:  probability mass function, the Google

719: distribution, via page counts for combinations of search queries.

720: This distribution allows us to tap the latent semantic knowledge

721: on the web.

722: While in the \NGD compression-based method

723:  one deals with the objects themselves,

724: in the current work we deal with just names for the objects.

725: In \cite{CV04}, as proof of principle, we demonstrate

726: positive correlations, evidencing an

727: underlying semantic structure, in both numerical symbol notations

728: and number-name words in a variety of natural languages

729: and contexts.

730: Next, we give applications in

731: (i) unsupervised hierarchical clustering, demonstrating the ability

732: to distinguish between colors and numbers, and

733: to distinguish between 17th century

734: Dutch painters;

735: (ii)

736:  supervised

737: concept-learning by example, using Support Vector Machines,

738: demonstrating the ability to understand

739: electrical terms, religious terms,

740: emergency incidents, and by conducting

741:  a massive experiment in understanding

742: WordNet categories \cite{Ci04};

743: and (iii) matching of meaning, in an example of

744: automatic English-Spanish translation.

745:

746:

747:

748: \begin{itwreferences}

749:

750:

751: \bibitem{BGLVZ}

752: C.H. Bennett, P. G\'acs, M. Li, P.M.B. Vit\'anyi, W. Zurek,

753: Information Distance, {\em IEEE Trans. Information Theory},

754: 44:4(1998), 1407--1423.

755:

756:

757: \bibitem{BLM03}

758: C.H. Bennett, M. Li, B. Ma, Chain letters and evolutionary histories,

759: {\em Scientific American}, June 2003, 76--81.

760:

761:

762:

763: \bibitem{burges:svmtut}

764: C.J.C. Burges.

765: A tutorial on support vector machines for pattern recognition,

766: {\em Data Mining and Knowledge Discovery}, 2:2(1998),121--167.

767:

768:

769: \bibitem{SID}

770: X. Chen, B. Francia, M. Li, B. McKinnon, A. Seker,

771: Shared information and program plagiarism detection,

772: {\em IEEE Trans. Inform. Th.}, 50:7(2004), 1545--1551.

773:

774:

775:

776: \bibitem{Ci03}

777: R. Cilibrasi, The CompLearn Toolkit, CWI, 2003,

778:  http://complearn.sourceforge.net/

779:

780:

781: \bibitem{CVfolk}

782: W.~Chai and B.~Vercoe.

783: Folk music classification using hidden Markov models.

784: {\em Proc.~of International Conference on Artificial Intelligence}, 2001.

785:

786: \bibitem{Ci04}

787: R. Cilibrasi, P. Vitanyi,

788: Automatic Meaning Discovery Using Google: 100 Experiments in Learning

789: WordNet Categories, 2004,

790: {\tt http://www.cwi.nl/$\sim$cilibrar/googlepaper/appendix.pdf}

791:

792: \bibitem{cidervit:mus}

793: R.~Cilibrasi, R.~de~Wolf, P.~Vitanyi.

794: Algorithmic clustering of music based on string compression,

795: {\em Computer Music J.}, 28:4(2004), 49-67.

796:

797: \bibitem{civit:cbc}

798: R. Cilibrasi, P.M.B. Vitanyi, Clustering by compression,

799: {\em IEEE Trans. Information Theory}, 51:4(2005), 1523- 1545. Also:

800: (preliminary version) http://www.archiv.org/abs/cs.CV/0312044

801:

802:

803: \bibitem{CV04}

804: R.~Cilibrasi, P.~Vitanyi,

805: Automatic meaning discovery using Google,

806: Manuscript, CWI, 2004;

807: http://arxiv.org/abs/cs.CL/0412098

808:

809:

810: %\bibitem{CPSV00}

811: %G. Cormode, M. Paterson, S. Sahinalp, and U. Vishkin.

812: %Communication complexity of document exchange.

813: %In {\em Proc. 11th ACM--SIAM Symp. on Discrete Algorithms}, 2000,

814: %197--206.

815:

816:

817:

818: \bibitem{DTWml}

819: R.~Dannenberg, B.~Thom, and D.~Watson.

820: A machine learning approach to musical style recognition,

821: {\em Proc.~International Computer Music Conference}, pp. 344-347, 1997.

822:

823:

824:

825: \bibitem{google}

826: The basics of Google search,

827:  http://www.google.com/help/basics.html.

828:

829:

830: \bibitem{GKCwavelet}

831: M.~Grimaldi, A.~Kokaram, and P.~Cunningham.

832: Classifying music by genre using the wavelet packet transform

833: and a round-robin ensemble.

834: Technical report TCD-CS-2002-64, Trinity College Dublin, 2002.

835: http://www.cs.tcd.ie/publications/tech-reports/reports.02/TCD-CS-2002-64.pdf

836:

837:

838: %\bibitem{Kr49}

839: %L.G. Kraft,

840: %A device for quantizing, grouping and coding amplitude modulated

841:   %pulses.

842: %Master's thesis, Dept. of Electrical Engineering, M.I.T., Cambridge,

843:   %Mass., 1949.

844:

845: \bibitem{Ke04}

846: E. Keogh, S. Lonardi, and C.A. Rtanamahatana, Toward parameter-free

847: data mining, In: {\em Proc. 10th ACM SIGKDD Intn'l Conf. Knowledge

848: Discovery and Data Mining}, Seattle, Washington, USA, August 22---25, 2004,

849: 206--215.

850:

851:

852: \bibitem{Ko65}

853: A.N. Kolmogorov.

854: Three approaches to the quantitative definition of information,

855: {\em Problems Inform. Transmission}, 1:1(1965), 1--7.

856:

857: \bibitem{Ko83}

858: A.N. Kolmogorov.

859: Combinatorial foundations of information theory and the calculus of

860:   probabilities,

861: {\em Russian Math. Surveys}, 38:4(1983), 29--40.

862:

863: \bibitem{cyc:intro}

864: D.~B. Lenat.

865: Cyc: A large-scale investment in knowledge infrastructure,

866: {\em Comm. ACM}, 38:11(1995),33--38.

867:

868: \bibitem{LBCKKZ01}

869: M. Li, J.H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang,

870: An information-based sequence distance and its application

871: to whole mitochondrial genome phylogeny,

872: {\em Bioinformatics}, 17:2(2001), 149--154.

873:

874:

875: \bibitem{malivitch:simmet}

876: M.~Li, X.~Chen, X.~Li, B.~Ma, P.~Vitanyi.

877: The similarity metric,

878: {\em IEEE Trans. Information Theory}, 50:12(2004), 3250- 3264.

879:

880: \bibitem{liminvit:kolmbook}

881: M. Li, P. M.~B. Vitanyi.

882: {\em An Introduction to Kolmogorov Complexity and Its Applications},

883: 2nd Ed.,

884: Springer-Verlag, New York, 1997.

885:

886: \bibitem{cyc:onto}

887: S.~L. Reed, D.~B. Lenat.

888: Mapping ontologies into cyc.

889: {\em Proc. AAAI Conference 2002 Workshop on Ontologies for the Semantic Web},

890: Edmonton, Canada. http://citeseer.nj.nec.com/509238.html

891:

892: \bibitem{Ru01}

893: D.H. Rumsfeld, The digital revolution,

894: originally published June 9, 2001, following a European trip.

895: In: H. Seely, The Poetry of D.H. Rumsfeld, 2003,

896: http://slate.msn.com/id/2081042/

897:

898:

899: \bibitem{Sneural}

900: P.~Scott.

901: Music classification using neural networks, 2001.\\

902: http://www.stanford.edu/class/ee373a/musicclassification.pdf

903:

904:

905:

906:

907: \bibitem{wordnet}

908: {G.A. Miller et.al, WordNet,

909: A Lexical Database for the English Language,

910: Cognitive Science Lab, Princeton University.

911: \\http://www.cogsci.princeton.edu/$\sim$wn

912: }

913:

914:

915: \bibitem{TC02}

916: G.~Tzanetakis and P.~Cook, Music genre classification of audio signals,

917: {\em IEEE Transactions on Speech and Audio Processing},

918: 10(5):293--302, 2002.

919:

920: \bibitem{We04}

921: S. Wehner, Analyzing network traffic and worms using compression,

922: http://arxiv.org/abs/cs.CR/0504045

923:

924:

925: \end{itwreferences}

926:

927: \end{itwpaper}

928: \end{document}

929:

930:

931: