cond-mat0205521/cmc.tex
1: % Group addresses by affiliation; use superscriptaddress for long
2: % author lists, or if there are many overlapping affiliations.
3: % For Phys. Rev. appearance, change preprint to twocolumn.
4: % Choose pra, prb, prc, prd, pre, prl, prstab, or rmp for journal
5: %  Add 'draft' option to mark overfull boxes with black boxes
6: %  Add 'showpacs' option to make PACS codes appear
7: %  Add 'showkeys' option to make keywords appear
8: \documentclass[aps,prl,twocolumn,showpacs,superscriptaddress]{revtex4}
9: %\documentclass[aps,prl,preprint,superscriptaddress]{revtex4}
10: %\documentclass[aps,prl,twocolumn,groupedaddress]{revtex4}
11: 
12: % You should use BibTeX and apsrev.bst for references
13: % Choosing a journal automatically selects the correct APS
14: % BibTeX style file (bst file), so only uncomment the line
15: % below if necessary.
16: %\bibliographystyle{apsrev}
17: 
18: \begin{document}
19: 
20: % Use the \preprint command to place your local institutional report
21: % number in the upper righthand corner of the title page in preprint mode.
22: % Multiple \preprint commands are allowed.
23: % Use the 'preprintnumbers' class option to override journal defaults
24: % to display numbers if necessary
25: %\preprint{}
26: 
27: %Title of paper
28: \title{On an Application of Relative Entropy}
29: 
30: % repeat the \author .. \affiliation  etc. as needed
31: % \email, \thanks, \homepage, \altaffiliation all apply to the current
32: % author. Explanatory text should go in the []'s, actual e-mail
33: % address or url should go in the {}'s for \email and \homepage.
34: % Please use the appropriate macro foreach each type of information
35: 
36: % \affiliation command applies to all authors since the last
37: % \affiliation command. The \affiliation command should follow the
38: % other information
39: % \affiliation can be followed by \email, \homepage, \thanks as well.
40: \author{Dmitry V. Khmelev}
41: \email{D.Khmelev@newton.cam.ac.uk}
42: %\homepage[]{Your web page}
43: %\thanks{}
44: \affiliation{Isaac Newton Institute for Mathematical Sciences, 20
45:   Clarkson Rd, Cambridge, CB3 0EH, U.K.}  
46: \affiliation{Heriot-Watt
47:   University, Edinburgh, U.K. and Moscow State University, Russia}
48: 
49: \author{William J. Teahan}
50: \email{wjt@informatics.bangor.ac.uk}
51: %\homepage[]{Your web page}
52: %\thanks{}
53: \affiliation{University of Wales, Bangor, Dean Street, Bangor, LL57 1UT, U.K.}
54: 
55: \date{\today}
56: 
57: \begin{abstract}
58:   We show that in problems of authorship attribution and other
59:   linguistic applications, a Markov Chains approach is a more attractive
60:   technique than Lempel-Ziv based compression.
61: \end{abstract}
62: 
63: % insert suggested PACS numbers in braces on next line
64: \pacs{89.70.+c, 01.20.+x, 05.20.-y, 05.45.Tp}
65: % insert suggested keywords - APS authors don't need to do this
66: %\keywords{}
67: 
68: %\maketitle must follow title, authors, abstract, \pacs, and \keywords
69: \maketitle
70: 
71: We wish to point out a number of inaccurate and misleading statements
72: that Benedetto {\em et al.} make in their paper titled ``Language
73: Trees and Zipping''\cite{Bene:2002}. First, they claim the technique
74: they used for construction of a language tree does not make use of any
75: a-priori information about the alphabet, but it does, both in the
76: alphabet chosen (Unicode) and in the set of languages they chose to
77: experiment with; second, they propound Lempel-Ziv (LZ, {\em gzip})
78: compression as being applicable to DNA analysis, where the usefulness
79: of LZ is quite doubtful; third, in practice their definition of
80: relative entropy and distance can yield negative values; fourth, the
81: classification performance of the method they use is significantly
82: worse than other entropy-based methods as has been noted in prior
83: work; and fifth, the classification speed is significantly worse as
84: well, which shows that its ``potentiality'' is questionable. We
85: elaborate on each of these points in more detail in the subsequent
86: paragraphs.
87: 
88: Notice that the ``Language Tree''(LT) diagram~\cite{Bene:2002} does not
89: include the Russian language (Slavic family of Indo-European family of
90: languages; 288 million speakers). Our computations show that once
91: Russian is included, it does not cluster with the other members of
92: the Slavic group. Obviously, certain Cyrillic alphabet based languages
93: were left out of the study~\cite{Bene:2002}, which ``improves''
94: results significantly and shows that a-priori information about
95: the alphabet is being taken advantage of to achieve the results outlined in
96: paper~\cite{Bene:2002}.
97: 
98: The LZ compressor makes few assumptions about the input string, but in
99: practice, we do have a-priori information that we can take advantage
100: of. Biologists widely use an amino acid {\em substitution matrix} (PAM250
101: or BLOSUM62) in search for {\em similar} biological
102: sequences~\cite{gusfield}. It is not at all clear how a substitution
103: matrix could be implemented with the LZ algorithm.  That is why
104: compression is not widely used for DNA analysis, although first trials
105: for its application go back to 1990~\cite{gusfield}.
106: 
107: The quantity $S_{\mathcal A\mathcal B}$~\cite{Bene:2002} defined as
108: ``relative entropy'' in (1) and redefined as ``distance'' in (2) can
109: take negative values. Negative values indeed appeared in our study
110: which showed that the ``LT''\cite{Bene:2002} reflects
111: significantly the structure of Unicode or vice versa, and its
112: relevance to language classification should be supported additionally.
113: 
114: A traditional definition and estimates for (relative) entropy via $n$th
115: order Markov Chain {\em on letters}~\cite{Shannon,Yaglom,Kukush:2001}
116: always lead to a proper positive number. Markov Chains are also
117: traditional in text entropy analysis~\cite{Shannon,Yaglom},
118: compression~\cite{PPM}, authorship and subject
119: attribution~\cite{Teahan:2000,Khmelev:2000}.  In~\cite{Kukush:2001},
120: the classification performance of compression programs was compared
121: with the Markov Chain approach~\cite{Khmelev:2000}. 82 authors of
122: large enough texts ($\ge 10^5$ characters) were chosen. Afterwards 82
123: one-per-author texts were held out and used for control purposes.  The
124: classification algorithm~\cite{Kukush:2001} had to
125: determine the author of each control text among 82 alternatives. The
126: corresponding numbers of exact guesses for 15 compression programs and
127: Markov Chains are presented in the following list~\cite{Kukush:2001}:
128: 
129: Program(number of guesses): 7zip(39), arj(46), bsa(44), compress(12), dmc(36),
130: gzip(50), ha(47), huff(10), lzari(17), ppmd5(46), rar(58), rarw({\bf
131:   71}), rk(52); Markov Chain approach (see~\cite{Khmelev:2000}) {\bf
132:   69} guesses.
133: 
134: %\begin{center}
135: %  \footnotesize
136: %  \begin{tabular}{lcccccccc}
137: %    %\hline
138: %    Program   & 7zip& arj& bsa & bzip2 & compress & dmc& gzip &ha\\%[-0.5em]% \hline
139: %    Guesses   & 39  & 46 & 44  & 38    & 12       & 36 & 50   &47\\%[-0.5em]% \hline
140: %  \end{tabular}
141: 
142: %  \begin{tabular}{lccccccc}
143: %    %\hline
144: %    Program   & huff& lzari & lzss & ppmd5 & rar & rarw &rk\\%[-0.5em] %\hline
145: %    Guesses   & 10  & 17    & 14   & 46    & 58  & {\bf 71}   &52\\%[0.5em]%\hline
146: %    \multicolumn{8}{c}{Markov Chain approach (see~\cite{Khmelev:2000}):
147: %{\bf 69} guesses}%\\[-0.5em]
148: %  \end{tabular}
149: %\end{center}
150: 
151: Clearly, {\em gzip} is significantly outperformed by other compression
152: algorithms and the first order Markov chain model~\cite{Khmelev:2000}.
153: Notice also that in practical implementations, the {\em gzip}-based
154: approach~\cite{Bene:2002} is significantly slower than the first order Markov
155: chains method~\cite{Khmelev:2000}.
156: 
157: To sum up, in natural language processing (and, perhaps, in other
158: fields) the $n$th order Markov chain
159: models~\cite{Khmelev:2000,Teahan:2000} are more appropriate than
160: an LZ-approach~\cite{Bene:2002}.
161: 
162: % Create the reference section using BibTeX:
163: \bibliography{cmc}
164: 
165: \end{document}
166: