1: %\documentclass{acm_proc_article-sp}
2: %\documentclass{sig-alternate}
3: \documentclass{article}
4: \usepackage{itw2005}
5: \usepackage{amsmath,amstext,amsthm,amssymb}
6: \usepackage{latex8}
7: \usepackage{times}
8:
9:
10:
11:
12: %\usepackage{amsmath,amstext,amsthm,amssymb,epsf}
13: \usepackage{amsmath,amstext,amssymb,epsf}
14: %\usepackage{fullpage,latexsym}
15: \usepackage{epsfig}
16: \usepackage{verbatim}
17: \usepackage{pslatex}
18:
19:
20: %\bibliographystyle{plain}
21:
22: \newcommand{\gzip}{ \texttt {gzip} }
23: \newcommand{\bzip}{ \texttt {bzip2} }
24: \newcommand{\NID}{ \textsc {NID} }
25: \newcommand{\NGD}{ \textsc {NGD} }
26: \newcommand{\NDD}{ \textsc {NDD} }
27: \newcommand{\NCD}{ \textsc {NCD} }
28: \newcommand{\NCDf}[2]{ \NCD(#1,#2) }
29: \newcommand{\SVM}{ \textsc {SVM} }
30:
31: \newtheorem{theorem}{\sc Theorem}
32: \newtheorem{lemma}{\sc Lemma}
33: \newtheorem{coro}{\sc Corollary}
34: \newtheorem{nota}{\sc Notation}
35: \newtheorem{defin}{\sc Definition}
36: \newtheorem{rem}{\sc Remark}
37: \newtheorem{cla}{\sc Claim}
38: \newtheorem{ex}{\sc Example}
39: \newenvironment{remark}{\begin{rem}}{\hspace*{\fill}$\diamondsuit$\end{rem}}
40: %\newenvironment{proof}{\par \sc Proof.\rm}{\hspace*{\fill}$\Box$\vspace{1ex}}
41: \newenvironment{example}{\begin{ex}}{\hspace*{\fill}$\Diamond$\end{ex}}
42: \newenvironment{claim}{\begin{cla}}{\end{cla}}
43: \newenvironment{corollary}{\begin{coro}}{\end{coro}}
44: \newenvironment{definition}{\begin{defin}}{\end{defin}}
45: %\newenvironment{remark}{\begin{rem}}{\end{rem}}
46: \newenvironment{notation}{\begin{nota}}{\end{nota}}
47:
48:
49: \itwtitle{Universal Similarity}
50:
51: %\numberofauthors{2}
52: %\author{
53: %\alignauthor Rudi Cilibrasi\titlenote{Supported in part by the Netherlands
54: %BSIK/BRICKS project,
55: %and by NWO project 612.55.002. Address: CWI, Kruislaan 413, 1098 SJ
56: %Amsterdam, The Netherlands. Email: Rudi.Cilibrasi@cwi.nl}\\
57: %\affaddr{CWI}
58: %\affaddr{Kruislaan 413}\\
59: %\affaddr{1098 SJ Amsterdam, The Netherlands}\\
60: %\email{Rudi.Cilibrasi@cwi.nl}
61: %\alignauthor Paul Vitanyi\titlenote{Part of this work was done while the author was on sabbatical leave
62: %at National ICT of Australia, Sydney Laboratory at UNSW.
63: %Supported in part
64: %by the EU EU Project RESQ IST-2001-37559,
65: %the ESF QiT Programmme,
66: %the EU NoE PASCAL, and the Netherlands BSIK/BRICKS project.
67: %Address: CWI, Kruislaan 413, 1098 SJ Amsterdam, The Netherlands.
68: %Email: Paul.Vitanyi@cwi.nl}\\
69: %\affaddr{CWI,}\\
70: %\affaddr{University of Amsterdam, and}\\
71: %\affaddr{National ICT of Australia}\\
72: %%\email{Paul.Vitanyi@cwi.nl}
73: %}
74:
75:
76: %\itwauthor{Rudi Cilibrasi}{CWI, Amsterdam, The Netherlands.
77: %{\tt Rudi.Cilibrasi@cwi.nl}}
78: %\itwsecondauthor{Paul Vitanyi}{CWI, Amsterdam, the Netherlands.
79: %{\tt paulv@cwi.nl}}
80: \itwauthor{Paul Vitanyi\thanks{Part of this work was done while the author was on sabbatical leave
81: at National ICT of Australia, Sydney Laboratory at UNSW.
82: Supported in part
83: by the EU EU Project RESQ IST-2001-37559,
84: the ESF QiT Programmme,
85: the EU NoE PASCAL, and the Netherlands BSIK/BRICKS project.
86: Address: CWI, Kruislaan 413, 1098SJ Amsterdam, The Netherlands.
87: {\tt paulv@cwi.nl}
88: }}{CWI, University of Amsterdam, National ICT of Australia}
89:
90:
91: \begin{document}
92: \itwmaketitle
93:
94:
95:
96: \begin{itwabstract}
97: We survey a new area of parameter-free similarity distance measures
98: useful in data-mining,
99: pattern recognition, learning and automatic semantics extraction.
100: Given a family of distances on a set of objects,
101: a distance is universal up to a certain precision for that family if it
102: minorizes every distance in the family between every two objects
103: in the set, up to the stated precision (we do not require the universal
104: distance to be an element of the family).
105: We consider similarity distances
106: for two types of objects: literal objects that as such contain all of their
107: meaning, like genomes or books, and names for objects.
108: The latter may have
109: literal embodyments like the first type, but may also
110: be abstract like ``red'' or ``christianity.'' For the first type
111: we consider
112: a family of computable distance measures
113: corresponding to parameters expressing similarity according to
114: particular features
115: between
116: pairs of literal objects. For the second type we consider similarity
117: distances generated by web users corresponding to particular semantic
118: relations between the (names for) the designated objects.
119: For both families we give universal similarity
120: distance measures, incorporating all particular distance measures
121: in the family. In the first case the universal
122: distance is based on compression and in the second
123: case it is based on Google page counts related to search terms.
124: In both cases experiments on a massive scale give evidence of the
125: viability of the approaches.
126: \end{itwabstract}
127:
128: \begin{itwpaper}
129:
130: \itwsection{Introduction}
131: Objects can be given literally, like the literal
132: four-letter genome of a mouse,
133: or the literal text of {\em War and Peace} by Tolstoy. For
134: simplicity we take it that all meaning of the object
135: is represented by the literal object itself. Objects can also be
136: given by name, like ``the four-letter genome of a mouse,''
137: or ``the text of {\em War and Peace} by Tolstoy.'' There are
138: also objects that cannot be given literally, but only by name
139: and acquire their meaning from their contexts in background common
140: knowledge in humankind, like ``home'' or ``red.''
141: In the literal setting, objective similarity of objects can be established
142: by feature analysis, one type of similarity per feature.
143: In the abstract ``name'' setting, all similarity must depend on
144: background knowledge and common semantics relations,
145: which is inherently subjective and ``in the mind of the beholder.''
146:
147: \itwsection{Compression Based Similarity}
148: All data are created equal but some data are more alike than others.
149: We have recently proposed methods expressing this alikeness,
150: using a new similarity metric based on compression.
151: It is parameter-free in that it
152: doesn't use any features or background knowledge about the data, and can without
153: changes be applied to different areas and across area boundaries.
154: It is universal in that it approximates the parameter
155: expressing similarity of the dominant feature in all pairwise
156: comparisons.
157: It is robust in the sense that its success appears independent
158: from the type of compressor used.
159: The clustering we use is hierarchical clustering in dendrograms
160: based on a new fast heuristic for the quartet method.
161: The method is available as an open-source software tool, \cite{Ci03}.
162:
163: {\bf Feature-Based Similarities:}
164: We are presented with unknown data and
165: the question is to determine the similarities among them
166: and group like with like together. Commonly, the data are
167: of a certain type: music files, transaction records of ATM machines,
168: credit card applications, genomic data. In these data there are
169: hidden relations that we would like to get out in the open.
170: For example, from genomic data one can extract
171: letter- or block frequencies (the blocks are over the four-letter alphabet);
172: from music files one can extract
173: various specific numerical features,
174: related to pitch, rhythm, harmony etc.
175: One can extract such features using for instance
176: Fourier transforms~\cite{TC02} or wavelet transforms~\cite{GKCwavelet},
177: to quantify parameters expressing similarity.
178: The resulting vectors corresponding to the various files are then
179: classified or clustered using existing classification software, based on
180: various standard statistical pattern recognition classifiers~\cite{TC02},
181: Bayesian classifiers~\cite{DTWml},
182: hidden Markov models~\cite{CVfolk},
183: ensembles of nearest-neighbor classifiers~\cite{GKCwavelet}
184: or neural networks~\cite{DTWml,Sneural}.
185: For example, in music one feature would be to look for rhythm in the sense
186: of beats per minute. One can make a histogram where each histogram
187: bin corresponds to a particular tempo in beats-per-minute and
188: the associated peak shows how frequent and strong that
189: particular periodicity was over the entire piece. In \cite{TC02}
190: we see a gradual change from a few high peaks to many low and spread-out
191: ones going from hip-hip, rock, jazz, to classical. One can use this
192: similarity type to try to cluster pieces in these categories.
193: However, such a method requires specific and detailed knowledge of
194: the problem area, since one needs to know what features to look for.
195:
196: {\bf Non-Feature Similarities:}
197: Our aim
198: is to capture, in a single similarity metric,
199: {\em every effective distance\/}:
200: effective versions of Hamming distance, Euclidean distance,
201: edit distances, alignment distance, Lempel-Ziv distance,
202: and so on.
203: This metric should be so general that it works in every
204: domain: music, text, literature, programs, genomes, executables,
205: natural language determination,
206: equally and simultaneously.
207: It would be able to simultaneously detect {\em all\/}
208: similarities between pieces that other effective distances can detect
209: seperately.
210:
211: Such a ``universal'' metric
212: was co-developed by us in \cite{LBCKKZ01,malivitch:simmet}, as a normalized
213: version of the ``information metric'' of \cite{liminvit:kolmbook,BGLVZ}.
214: Roughly speaking, two objects are deemed close if
215: we can significantly ``compress'' one given the information
216: in the other, the idea being that if two pieces are more similar,
217: then we can more succinctly describe one given the other.
218: The mathematics used is based on Kolmogorov complexity theory \cite{liminvit:kolmbook}.
219: In \cite{malivitch:simmet} we defined a
220: new class of (possibly non-metric) distances, taking values in $[0,1]$ and
221: appropriate for measuring effective
222: similarity relations between sequences, say one type of similarity
223: per distance, and {\em vice versa}. It was shown that an appropriately
224: ``normalized'' information distance
225: minorizes every distance
226: in the class.
227: It discovers all effective similarities in the sense that if two
228: objects are close according to some effective similarity, then
229: they are also close according to the normalized information distance.
230: Put differently, the normalized information distance represents
231: similarity according to the dominating shared feature between
232: the two objects being compared.
233: In comparisons of more than two objects,
234: different pairs may have different dominating features.
235: For every two objects,
236: this universal metric distance zooms in on the dominant
237: similarity between those two objects
238: out of a wide class of admissible similarity
239: features. In \cite{malivitch:simmet} we proved its optimality
240: and universality.
241: The normalized information distance also satisfies the metric
242: (in)equalities, and takes values in $[0,1]$;
243: hence it may be called {\em ``the'' similarity metric}.
244:
245: {\bf Normalized Compression Distance:}
246: Unfortunately, the universality of the normalized information distance
247: comes at the price of noncomputability, since it is based on the uncomputable
248: notion of Kolmogorov complexity.
249: But since the Kolmogorov
250: complexity of a string or file is the length
251: of the ultimate compressed version of that
252: file,
253: we can use real data compression programs to approximate the Kolmogorov
254: complexity.
255: Therefore, to apply this ideal precise mathematical theory in real life,
256: we have to replace the use of the noncomputable
257: Kolmogorov complexity by an approximation
258: using a standard real-world compressor.
259: Thus, if $C$ is a compressor and we use $C(x)$
260: to denote the length of the compressed version of a string $x$,
261: then we arrive at the {\em Normalized Compression Distance}:
262: \begin{equation}\label{eq.ncd}
263: \NCD(x,y) = \frac{C(xy) - \min(C(x),C(y))}{\max(C(x),C(y))},
264: \end{equation}
265: where for convenience we have replaced the pair $(x,y)$ in the formula
266: by the concatenation $xy$,
267: see \cite{malivitch:simmet,civit:cbc},
268: In \cite{civit:cbc} we propose axioms to capture the real-world setting,
269: and show that \eqref{eq.ncd}
270: approximates optimality.
271: Actually, the
272: \NCD is a family of compression functions parameterized
273: by the given data
274: compressor $C$.
275:
276: {\bf Universality of NCD:} In \cite{civit:cbc} we prove that the
277: \NCD is universal with respect to the family of all
278: admissible normalized distances---a special class that
279: is argued to contain all parameters and features of
280: similarity that are effective.
281: The compression-based \NCD method to
282: establish a universal similarity metric \eqref{eq.ncd} among objects
283: given as finite binary strings
284: \cite{BGLVZ,LBCKKZ01,malivitch:simmet,civit:cbc,Ke04}, and has been applied to
285: objects like genomes, music pieces in MIDI format, computer programs
286: in Ruby or C, pictures in simple bitmap formats, or time sequences such as
287: heart rhythm data, heterogenous data and anomaly detection.
288: This method is feature-free in the sense
289: that it doesn't analyze the files looking for particular
290: features; rather it analyzes all features simultaneously
291: and determines the similarity between every pair of objects
292: according to the most dominant shared feature. The crucial
293: point is that the method analyzes the objects themselves.
294: This precludes comparison of abstract notions or other objects
295: that don't lend themselves to direct analysis, like
296: emotions, colors, Socrates, Plato, Mike Bonanno and Albert Einstein.
297:
298:
299: \itwsection{Google-Based Similarity}
300: To make computers more intelligent one would like
301: to represent meaning in computer-digestable form.
302: Long-term and labor-intensive efforts like
303: the {\em Cyc} project \cite{cyc:intro} and the {\em WordNet}
304: project \cite{wordnet} try to establish semantic relations
305: between common objects, or, more precisely, {\em names} for those
306: objects. The idea is to create
307: a semantic web of such vast proportions that rudimentary intelligence
308: and knowledge about the real world spontaneously emerges.
309: This comes at the great cost of designing structures capable
310: of manipulating knowledge, and entering high
311: quality contents in these structures
312: by knowledgeable human experts. While the efforts are long-running
313: and large scale, the overall information entered is minute compared
314: to what is available on the world-wide-web.
315:
316: The rise of the world-wide-web has enticed millions of users
317: to type in trillions of characters to create billions of web pages of
318: on average low quality contents. The sheer mass of the information
319: available about almost every conceivable topic makes it likely
320: that extremes will cancel and the majority or average is meaningful
321: in a low-quality approximate sense. We devise a general
322: method to tap the amorphous low-grade knowledge available for free
323: on the world-wide-web, typed in by local users aiming at personal
324: gratification of diverse objectives, and yet globally achieving
325: what is effectively the largest semantic electronic database in the world.
326: Moreover, this database is available for all by using any search engine
327: that can return aggregate page-count estimates like Google for a large
328: range of search-queries.
329:
330: While the previous \NCD method that compares the objects themselves using
331: \eqref{eq.ncd} is
332: particularly suited to obtain knowledge about the similarity of
333: objects themselves, irrespective of common beliefs about such
334: similarities, we now develop a method that uses only the name
335: of an object and obtains knowledge about the similarity of objects
336: by tapping available information generated by multitudes of
337: web users.
338: Here we are reminded of the words of D.H. Rumsfeld \cite{Ru01}
339: ``A trained ape can know an awful lot/
340: Of what is going on in this world,/
341: Just by punching on his mouse/
342: For a relatively modest cost!''
343: The new method is useful to extract knowledge from a given corpus of
344: knowledge, in this case the Google database, but not to
345: obtain true facts that are not common knowledge in that database.
346: For example, common viewpoints on the creation myths in different
347: religions
348: may be extracted by the Googling method, but contentious questions
349: of fact concerning the phylogeny of species can be better approached
350: by using the genomes of these species, rather than by opinion.
351:
352:
353: {\bf Googling for Knowledge:}
354: Let us start with simple intuitive justification (not to be mistaken
355: for a substitute of the underlying mathematics)
356: of the approach we propose in \cite{CV04}.
357: The Google search engine indexes
358: around ten billion pages on the web today. Each such page can be
359: viewed as a set of index terms. A search for a particular index term,
360: say ``horse'', returns a certain number of hits (web pages where
361: this term occurred), say 46,700,000. The number of hits for the
362: search term ``rider'' is, say, 12,200,000. It is also possible to search
363: for the pages where both ``horse'' and ``rider'' occur. This gives,
364: say, 2,630,000 hits.
365: This can be easily put in the standard probabilistic framework.
366: If $w$ is a web page and $x$ a search term, then we write $x \in w$
367: to mean that Google returns web page $w$ when presented with search
368: term $x$.
369: An {\em event} is a set of web pages
370: returned by Google after
371: it has been presented by a search term.
372: We can view the event as the collection of all contexts of
373: the search term, background knowledge, as induced by the
374: accessible web pages for the Google search engine.
375: If the search term is $x$, then we denote the event by ${\bf x}$,
376: and define ${\bf x} = \{w: x \in w \}$.
377: The {\em probability} $p(x)$ of an event ${\bf x }$ is
378: the number of web pages
379: in the event divided by the overall number $M$ of web pages possibly
380: returned by Google. Thus, $p( x)= |{\bf x}|/M$.
381: At the time of writing, Google searches 8,058,044,651 web pages.
382: Define the joint event ${\bf x} \bigcap {\bf y} = \{ w : x,y \in w\}$
383: as the set of web pages returned by Google,
384: containing both the search term $x$ and
385: the search term $y$. The joint probability
386: $p(x, y) = |\{ w : x,y \in w\}|/M $ is the number of
387: web pages in the joint event divided by the
388: overall number $M$ of web pages possibly
389: returned by Google.
390: This notation also allows us to define the probability $p(x|y)$
391: of {\em conditional} events ${\bf x}|{\bf y}
392: = ({\bf x} \bigcap {\bf y})/{\bf y}$ defined by
393: $p(x| y) = p( x,y)/p(y)$.
394:
395:
396: In the above example we have therefore $p(horse) \approx 0.0058$,
397: $p(rider)$ $ \approx 0.0015$, $p(horse,rider) \approx 0.0003$.
398: We conclude that the probability $p(horse|rider)$
399: of ``horse'' accompanying ``rider''
400: is $\approx 1/5$ and the probability $p(rider|horse)$ of ``rider'' accompanying
401: ``horse'' is $\approx 1/19$. The probabilities are asymmetric, and it is the
402: least probability that is the significant one. A very general search term
403: like ``the'' occurs in virtually all (English language) web pages.
404: Hence $p(the|rider) \approx 1$, and for almost all search
405: terms $x$ we have $p(the|x) \approx 1$. But $p(rider|the) \ll 1$,
406: say about equal to $p(rider)$, and gives the relevant information
407: about the association of the two terms.
408:
409: Our first attempt therefore could be the distance
410: \[ D_1 (x,y) = \min \{ p(x|y),p(y|x) \}.
411: \]
412: Experimenting with this distance gives bad results. One reason
413: being that the differences among small probabilities have increasing
414: significance the smaller the probabilities involved are. Another
415: reason is that we deal with absolute probabilities: two notions
416: that have very small probabilities each and have $D_1$-distance
417: $\epsilon$ are much less similar than two notions that have
418: much larger probabilities and have the same $D_1$-distance.
419: To resolve the first problem we take the negative logarithm
420: of the items being minimized, resulting in
421: \[
422: D_2 (x,y) = \max \{ \log 1/p(x|y), \log 1/p(y|x) \}.
423: \]
424: To resolve the second problem we normalize $D_2(x,y)$ by dividing
425: by the maximum of $\log 1/p(x), \log 1/p(y)$.
426: Altogether, we obtain
427: the following normalized distance
428: \[
429: D_3 (x,y) = \frac{ \max \{ \log 1/p(x|y), \log 1/p(y|x) \}}
430: { \max \{ \log 1/p(x) , \log 1/p(y) \}},
431: \]
432: for $p(x|y) > 0$ (and hence $p(y|x)>0$),
433: and $D_3 (x,y) = \infty $ for $p(x|y)=0$ (and hence $p(y|x)=0$). Note that
434: $p(x|y) = p(x,y)/p(x)=0$ means that the search terms
435: ``$x$'' and ``$y$'' never occur together.
436: The two conditional complexities are either both 0 or
437: they are both strictly positive. Moreover, if either of $p(x), p(y)$
438: is 0, then so are the conditional probabilities, but not necessarily
439: vice versa.
440:
441: We note that in the conditional probabilities the total number $M$,
442: of web pages indexed by Google, is divided out. Therefore, the
443: conditional probabilities are independent of $M$, and can be
444: replaced by the number of pages, the {\em frequency}, returned by Google.
445: Define the {\em frequency} $f(x)$ of search term $x$ as the
446: number of pages a Google search for $x$ returns:
447: $f(x)= Mp(x)$, $f(x,y)=Mp(x,y)$, and $p(x|y) = f(x,y)/f(y)$.
448: Rewriting $D_3$ results in
449: our final notion, the {\em normalized
450: Google distance (\NGD)}, defined by
451: \begin{equation}\label{eq.ngd}
452: \NGD(x,y) = \frac{ \max \{\log f(x), \log f(y)\} - \log f(x,y) \}}{
453: \log M - \min\{\log f(x), \log f(y) \}},
454: \end{equation}
455: and if $f(x),f(y)>0$ and $f(x,y)=0$ then $\NGD(x,y)= \infty$.
456: From \eqref{eq.ngd} we see that
457: \begin{enumerate}
458: \item
459: $\NGD(x,y)$ is undefined for $f(x)=f(y)=0$;
460: \item
461: $\NGD(x,y) = \infty$ for $f(x,y)=0$ and either or both $f(x)>0$
462: and $f(y)>0$; and
463: \item
464: $ \NGD(x,y) \geq 0$ otherwise.
465: \end{enumerate}
466:
467: With the Google hit numbers above, we can now compute
468: \[
469: \NGD(horse,rider)
470: \approx 0.443.
471: \]
472: We did the same calculation when Google indexed only one-half
473: of the current number of pages: 4,285,199,774. It is instructive that the
474: probabilities of the used search terms didn't change significantly over
475: this doubling of pages, with number of hits for ``horse''
476: equal 23,700,000, for ``rider'' equal 6,270,000, and
477: for ``horse, rider'' equal to 1,180,000.
478: The $\NGD(horse,rider)$ we computed
479: in that situation was 0.460. This is in line with our contention
480: that the relative frequencies of web pages containing
481: search terms gives objective information about the semantic
482: relations between the search terms. If this is the case, then with
483: the vastness of the information accessed by Google, the
484: Google probabilities of search terms, and the computed \NGD's
485: should stabilize (be scale invariant) with a growing Google database.
486:
487: The \NGD formula itself \eqref{eq.ngd} is {\em scale-invariant}. It is very important that, if
488: the number $M$ of pages indexed by Google grows sufficiently large,
489: the number of pages containing given search terms
490: goes to a fixed fraction of $M$, and so does the number of pages
491: containing conjunctions of search terms. This means that if $M$ doubles,
492: then so do the $f$-frequencies. For the \NGD to give us an objective
493: semantic relation between search terms,
494: it needs to become stable when the number $M$ of indexed pages grows.
495: Some evidence that this actually happens
496: is given in the remark about the \NGD scaling properly.
497:
498:
499:
500: \itwsection{From NCD to NGD}
501: {\bf The Google Distribution:}
502: \label{sect.google}
503: Let the set of singleton {\em Google search terms}
504: be denoted by ${\cal S}$. In the sequel we use both singleton
505: search terms and doubleton search terms $\{\{x,y\}: x,y \in {\cal S} \}$.
506: Let the set of web pages indexed (possible of being returned)
507: by Google be $\Omega$. The cardinality of $\Omega$ is denoted
508: by $M=|\Omega|$, and currently $8\cdot 10^9 \leq M \leq 9 \cdot 10^9$.
509: Assume that a priori all web pages are equi-probable, with the probability
510: of being returned by Google being $1/M$. A subset of $\Omega$
511: is called an {\em event}. Every {\em search term} $x$ usable by Google
512: defines a {\em singleton Google event} ${\bf x} \subseteq \Omega$ of web pages
513: that contain an occurrence of $x$ and are returned by Google
514: if we do a search for $x$.
515: Let $L: \Omega \rightarrow [0,1]$ be the uniform mass probability
516: function.
517: The probability of
518: such an event ${\bf x}$ is $L({\bf x})=|{\bf x}|/M$.
519: Similarly, the {\em doubleton Google event} ${\bf x} \bigcap {\bf y}
520: \subseteq \Omega$ is the set of web pages returned by Google
521: if we do a search for pages containing both search term $x$ and
522: search term $y$.
523: The probability of this event is $L({\bf x} \bigcap {\bf y})
524: = |{\bf x} \bigcap {\bf y}|/M$.
525: We can also define the other Boolean combinations: $\neg {\bf x}=
526: \Omega \backslash {\bf x}$ and ${\bf x} \bigcup {\bf y} =
527: \Omega \backslash ( \neg {\bf x} \bigcap \neg {\bf y})$, each such event
528: having a probability equal to its cardinality divided by $M$.
529: If ${\bf e}$ is an event obtained from the basic events ${\bf x}, {\bf y},
530: \ldots$, corresponding to basic search terms $x,y, \ldots$,
531: by finitely many applications of the Boolean operations,
532: then the probability $L({\bf e}) = |{\bf e}|/M$.
533:
534: %A {\em pseudo-probability} is a
535: %function $p: {\cal S}
536: %\rightarrow [0,1]$ such that $ 1 < \sum_{s \in {\cal S}} p(s) < \infty$.
537: Google events capture in a particular sense
538: all background knowledge about the search terms concerned available
539: (to Google) on the web. Therefore, it is natural
540: to consider code words for those events
541: as coding this background knowledge. However,
542: we cannot use the probability of the events directly to determine
543: a prefix code such as the Shannon-Fano code \cite{liminvit:kolmbook}.
544: The reason is that
545: the events overlap and hence the summed probability exceeds 1.
546: By the Kraft inequality \cite{liminvit:kolmbook} this prevents a
547: corresponding Shannon-Fano code.
548: The solution is to normalize:
549: We use the probability of the Google events to define a probability
550: mass function over the set $\{\{x,y\}: x,y \in {\cal S}\}$
551: of Google search terms, both singleton and doubleton.
552: Define
553: \[
554: N= \sum_{\{x,y\} \subseteq {\cal S}} |{\bf x} \bigcap
555: {\bf y}|,
556: \]
557: counting each singleton set and each doubleton set (by definition
558: unordered) once in the summation.
559: Since every web page that is indexed by Google contains at least
560: one occurrence of a search term, we have $N \geq M$. On the other hand,
561: web pages contain on average not more than a certain constant $\alpha$
562: search terms. Therefore, $N \leq \alpha M$.
563: Define
564: \begin{align}\label{eq.gpmf}
565: &g(x) = L({\bf x}) M/N =|{\bf x}|/N
566: \\&
567: \nonumber
568: g(x,y) = L({\bf x} \bigcap {\bf y}) M/N =|{\bf x} \bigcap {\bf y}|/N.
569: \end{align}
570: Then, $\sum_{x \in {\cal S}} g(x)+ \sum_{x,y \in {\cal S}} g(x,y) = 1$.
571: Note that $g(x,y)$ is not a conventional joint distribution
572: since possibly $g(x) \neq \sum_{y \in {\cal S}} g(x,y)$.
573: Rather, we consider $g$ to be a probability mass
574: function over the sample space $\{ \{x,y\}: x,y \in {\cal S} \}$.
575: This $g$-distribution changes over time,
576: and between different samplings
577: from the distribution. But let us imagine that $g$ holds
578: in the sense of an instantaneous snapshot. The real situation
579: will be an approximation of this.
580: Given the Google machinery, these are absolute probabilities
581: which allow us to define the associated Shannon-Fano code for
582: both the singletons and the doubletons.
583:
584: {\bf Normalized Google Distance}
585: The {\em Google code} length $G$
586: is defined by
587: \begin{align}\label{eq.gcc}
588: &G(x)= \log 1/g(x)
589: \\&
590: \nonumber
591: G(x,y)= \log 1/g(x,y) .
592: \end{align}
593: In contrast to strings $x$ where the complexity $C(x)$ represents
594: the length of the compressed version of $x$ using compressor $C$, for a search
595: term $x$ (just the name for an object rather than the object itself),
596: the Google code of length $G(x)$ represents the shortest expected
597: prefix-code word length of the associated Google event ${\bf x}$.
598: The expectation
599: is taken over the Google distribution $p$.
600: In this sense we can use the Google distribution as a compressor
601: for Google ``meaning'' associated with the search terms.
602: The associated \NCD, now called the
603: {\em normalized Google distance (\NGD)} is then defined
604: by \eqref{eq.ngd} with $N$ substituted for $M$, rewritten as
605: \begin{equation}\label{eq.NGD}
606: \NGD(x,y)=\frac{G(x,y) - \min(G(x),G(y))}{\max(G(x),G(y))}.
607: \end{equation}
608: This $\NGD$ is an approximation to the $\NID$
609: using the Shannon-Fano code (Google code)
610: generated by the Google distribution as defining a compressor
611: approximating the length of the Kolmogorov code, using
612: the background knowledge on the web as viewed by Google
613: as conditional information. In experimental practice,
614: we consider $N$ (or $M$) as a normalization constant
615: that can be adjusted.
616:
617: {\bf Universality of NGD:} In the full paper \cite{CV04} we
618: show that \eqref{eq.ngd} and \eqref{eq.NGD} are
619: close in typical situations.
620: Our experimental results suggest that every reasonable
621: (greater than any $f(x)$) value can be used for the normalizing factor $N$,
622: and our
623: results seem in general insensitive to this choice. In our software, this
624: parameter $N$ can be adjusted as appropriate, and we often use $M$ for $N$.
625: In the full paper we analyze the mathematical properties of \NGD,
626: and prove the universality of the Google distribution among web author based
627: distributions, as well as the universality of the \NGD with respect to
628: the family of the individual web author's \NGD's, that is, their
629: individual semantics relations, (with high probability)---not included here
630: for space reasons.
631:
632:
633:
634:
635: \itwsection{Applications}
636: \label{sect.exp}
637: {\bf Applications of NCD:}
638: We developed the CompLearn Toolkit, \cite{Ci03}, and performed
639: experiments in vastly different
640: application fields to test the quality and universality of the method.
641: The success of the method as reported below depends strongly on the
642: judicious use of encoding of the objects compared. Here one should
643: use common sense on what a real world compressor can do. There are
644: situations where our approach fails if applied in a
645: straightforward way.
646: For example: comparing text files by the same authors
647: in different encodings (say, Unicode and 8-bit version) is bound to fail.
648: For the ideal similarity metric based on
649: Kolmogorov complexity as defined in \cite{malivitch:simmet}
650: this does not matter at all, but for
651: practical compressors used in the experiments it will be fatal.
652: Similarly, in the music experiments below we use symbolic MIDI
653: music file format rather than wave format music files. The reason is that
654: the strings resulting from straightforward
655: discretizing the wave form files may be too sensitive to how we discretize.
656: Further research may ovecome this problem.
657:
658: The \NCD is
659: not restricted to a specific application area, and
660: works across application area boundaries.
661: To extract a hierarchy of clusters
662: from the distance matrix,
663: we determine a dendrogram (binary tree)
664: by a new quartet
665: method and a fast heuristic to implement it.
666: The method is implemented and available as public software \cite{Ci03}, and is
667: robust under choice of different compressors.
668: This approach gives
669: the first completely automatic construction
670: of the phylogeny tree based on whole mitochondrial genomes,
671: \cite{LBCKKZ01,malivitch:simmet},
672: a completely automatic construction of a language tree for over 50
673: Euro-Asian languages \cite{malivitch:simmet},
674: detects plagiarism in student programming assignments
675: \cite{SID}, gives phylogeny of chain letters \cite{BLM03}, and clusters
676: music \cite{cidervit:mus}.
677: Moreover, the method turns out to be robust under change of the underlying
678: compressor-types: statistical (PPMZ), Lempel-Ziv based dictionary (gzip),
679: block based (bzip2), or special purpose (Gencompress).
680:
681: To substantiate our claims of universality and robustness, in \cite{civit:cbc}
682: we report evidence of successful application in areas as diverse as
683: genomics, virology, languages, literature, music, handwritten digits,
684: astronomy, and
685: combinations of objects from completely different
686: domains, using statistical, dictionary, and block sorting compressors.
687: In genomics we presented new evidence for major questions
688: in Mammalian evolution, based on whole-mitochondrial genomic
689: analysis: the Eutherian orders and the Marsupionta hypothesis
690: against the Theria hypothesis.
691: Apart from the experiments reported in \cite{civit:cbc}, the clustering by
692: compression method reported
693: in this paper has recently been used in many different areas all over
694: the world. One item in our group was
695: to analyze network traffic and cluster computer worms and virusses \cite{We04}.
696: Finally, recent work \cite{Ke04} reports experiments with our method
697: on all time sequence data used in all the major data-mining
698: conferences in the last decade. Comparing the compression method
699: with all major methods used in those conferences they established
700: clear superiority of the compression method for clustering heterogenous
701: data, and for anomaly detection.
702:
703:
704: {\bf Applications of NGD:}
705: This new method is proposed in \cite{CV04} to extract semantic
706: knowledge from the world-wide-web for both
707: supervised and unsupervised learning using the Google search engine
708: in an unconventional manner. The approach is
709: novel in its unrestricted problem domain, simplicity of implementation,
710: and manifestly ontological underpinnings. We give evidence of
711: elementary learning of the semantics of concepts, in
712: contrast to most prior approaches (outside of Knowledge Representation
713: research) that have neither the appearance nor the aim of dealing with ideas,
714: instead using abstract symbols that remain permanently ungrounded throughout
715: the machine learning application.
716: The world-wide-web is the largest database on earth,
717: and it induces a
718: probability mass function, the Google
719: distribution, via page counts for combinations of search queries.
720: This distribution allows us to tap the latent semantic knowledge
721: on the web.
722: While in the \NGD compression-based method
723: one deals with the objects themselves,
724: in the current work we deal with just names for the objects.
725: In \cite{CV04}, as proof of principle, we demonstrate
726: positive correlations, evidencing an
727: underlying semantic structure, in both numerical symbol notations
728: and number-name words in a variety of natural languages
729: and contexts.
730: Next, we give applications in
731: (i) unsupervised hierarchical clustering, demonstrating the ability
732: to distinguish between colors and numbers, and
733: to distinguish between 17th century
734: Dutch painters;
735: (ii)
736: supervised
737: concept-learning by example, using Support Vector Machines,
738: demonstrating the ability to understand
739: electrical terms, religious terms,
740: emergency incidents, and by conducting
741: a massive experiment in understanding
742: WordNet categories \cite{Ci04};
743: and (iii) matching of meaning, in an example of
744: automatic English-Spanish translation.
745:
746:
747:
748: \begin{itwreferences}
749:
750:
751: \bibitem{BGLVZ}
752: C.H. Bennett, P. G\'acs, M. Li, P.M.B. Vit\'anyi, W. Zurek,
753: Information Distance, {\em IEEE Trans. Information Theory},
754: 44:4(1998), 1407--1423.
755:
756:
757: \bibitem{BLM03}
758: C.H. Bennett, M. Li, B. Ma, Chain letters and evolutionary histories,
759: {\em Scientific American}, June 2003, 76--81.
760:
761:
762:
763: \bibitem{burges:svmtut}
764: C.J.C. Burges.
765: A tutorial on support vector machines for pattern recognition,
766: {\em Data Mining and Knowledge Discovery}, 2:2(1998),121--167.
767:
768:
769: \bibitem{SID}
770: X. Chen, B. Francia, M. Li, B. McKinnon, A. Seker,
771: Shared information and program plagiarism detection,
772: {\em IEEE Trans. Inform. Th.}, 50:7(2004), 1545--1551.
773:
774:
775:
776: \bibitem{Ci03}
777: R. Cilibrasi, The CompLearn Toolkit, CWI, 2003,
778: http://complearn.sourceforge.net/
779:
780:
781: \bibitem{CVfolk}
782: W.~Chai and B.~Vercoe.
783: Folk music classification using hidden Markov models.
784: {\em Proc.~of International Conference on Artificial Intelligence}, 2001.
785:
786: \bibitem{Ci04}
787: R. Cilibrasi, P. Vitanyi,
788: Automatic Meaning Discovery Using Google: 100 Experiments in Learning
789: WordNet Categories, 2004,
790: {\tt http://www.cwi.nl/$\sim$cilibrar/googlepaper/appendix.pdf}
791:
792: \bibitem{cidervit:mus}
793: R.~Cilibrasi, R.~de~Wolf, P.~Vitanyi.
794: Algorithmic clustering of music based on string compression,
795: {\em Computer Music J.}, 28:4(2004), 49-67.
796:
797: \bibitem{civit:cbc}
798: R. Cilibrasi, P.M.B. Vitanyi, Clustering by compression,
799: {\em IEEE Trans. Information Theory}, 51:4(2005), 1523- 1545. Also:
800: (preliminary version) http://www.archiv.org/abs/cs.CV/0312044
801:
802:
803: \bibitem{CV04}
804: R.~Cilibrasi, P.~Vitanyi,
805: Automatic meaning discovery using Google,
806: Manuscript, CWI, 2004;
807: http://arxiv.org/abs/cs.CL/0412098
808:
809:
810: %\bibitem{CPSV00}
811: %G. Cormode, M. Paterson, S. Sahinalp, and U. Vishkin.
812: %Communication complexity of document exchange.
813: %In {\em Proc. 11th ACM--SIAM Symp. on Discrete Algorithms}, 2000,
814: %197--206.
815:
816:
817:
818: \bibitem{DTWml}
819: R.~Dannenberg, B.~Thom, and D.~Watson.
820: A machine learning approach to musical style recognition,
821: {\em Proc.~International Computer Music Conference}, pp. 344-347, 1997.
822:
823:
824:
825: \bibitem{google}
826: The basics of Google search,
827: http://www.google.com/help/basics.html.
828:
829:
830: \bibitem{GKCwavelet}
831: M.~Grimaldi, A.~Kokaram, and P.~Cunningham.
832: Classifying music by genre using the wavelet packet transform
833: and a round-robin ensemble.
834: Technical report TCD-CS-2002-64, Trinity College Dublin, 2002.
835: http://www.cs.tcd.ie/publications/tech-reports/reports.02/TCD-CS-2002-64.pdf
836:
837:
838: %\bibitem{Kr49}
839: %L.G. Kraft,
840: %A device for quantizing, grouping and coding amplitude modulated
841: %pulses.
842: %Master's thesis, Dept. of Electrical Engineering, M.I.T., Cambridge,
843: %Mass., 1949.
844:
845: \bibitem{Ke04}
846: E. Keogh, S. Lonardi, and C.A. Rtanamahatana, Toward parameter-free
847: data mining, In: {\em Proc. 10th ACM SIGKDD Intn'l Conf. Knowledge
848: Discovery and Data Mining}, Seattle, Washington, USA, August 22---25, 2004,
849: 206--215.
850:
851:
852: \bibitem{Ko65}
853: A.N. Kolmogorov.
854: Three approaches to the quantitative definition of information,
855: {\em Problems Inform. Transmission}, 1:1(1965), 1--7.
856:
857: \bibitem{Ko83}
858: A.N. Kolmogorov.
859: Combinatorial foundations of information theory and the calculus of
860: probabilities,
861: {\em Russian Math. Surveys}, 38:4(1983), 29--40.
862:
863: \bibitem{cyc:intro}
864: D.~B. Lenat.
865: Cyc: A large-scale investment in knowledge infrastructure,
866: {\em Comm. ACM}, 38:11(1995),33--38.
867:
868: \bibitem{LBCKKZ01}
869: M. Li, J.H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang,
870: An information-based sequence distance and its application
871: to whole mitochondrial genome phylogeny,
872: {\em Bioinformatics}, 17:2(2001), 149--154.
873:
874:
875: \bibitem{malivitch:simmet}
876: M.~Li, X.~Chen, X.~Li, B.~Ma, P.~Vitanyi.
877: The similarity metric,
878: {\em IEEE Trans. Information Theory}, 50:12(2004), 3250- 3264.
879:
880: \bibitem{liminvit:kolmbook}
881: M. Li, P. M.~B. Vitanyi.
882: {\em An Introduction to Kolmogorov Complexity and Its Applications},
883: 2nd Ed.,
884: Springer-Verlag, New York, 1997.
885:
886: \bibitem{cyc:onto}
887: S.~L. Reed, D.~B. Lenat.
888: Mapping ontologies into cyc.
889: {\em Proc. AAAI Conference 2002 Workshop on Ontologies for the Semantic Web},
890: Edmonton, Canada. http://citeseer.nj.nec.com/509238.html
891:
892: \bibitem{Ru01}
893: D.H. Rumsfeld, The digital revolution,
894: originally published June 9, 2001, following a European trip.
895: In: H. Seely, The Poetry of D.H. Rumsfeld, 2003,
896: http://slate.msn.com/id/2081042/
897:
898:
899: \bibitem{Sneural}
900: P.~Scott.
901: Music classification using neural networks, 2001.\\
902: http://www.stanford.edu/class/ee373a/musicclassification.pdf
903:
904:
905:
906:
907: \bibitem{wordnet}
908: {G.A. Miller et.al, WordNet,
909: A Lexical Database for the English Language,
910: Cognitive Science Lab, Princeton University.
911: \\http://www.cogsci.princeton.edu/$\sim$wn
912: }
913:
914:
915: \bibitem{TC02}
916: G.~Tzanetakis and P.~Cook, Music genre classification of audio signals,
917: {\em IEEE Transactions on Speech and Audio Processing},
918: 10(5):293--302, 2002.
919:
920: \bibitem{We04}
921: S. Wehner, Analyzing network traffic and worms using compression,
922: http://arxiv.org/abs/cs.CR/0504045
923:
924:
925: \end{itwreferences}
926:
927: \end{itwpaper}
928: \end{document}
929:
930:
931: