0205:cond-mat0205521/cmc.tex

1: % Group addresses by affiliation; use superscriptaddress for long

2: % author lists, or if there are many overlapping affiliations.

3: % For Phys. Rev. appearance, change preprint to twocolumn.

4: % Choose pra, prb, prc, prd, pre, prl, prstab, or rmp for journal

5: %  Add 'draft' option to mark overfull boxes with black boxes

6: %  Add 'showpacs' option to make PACS codes appear

7: %  Add 'showkeys' option to make keywords appear

8: \documentclass[aps,prl,twocolumn,showpacs,superscriptaddress]{revtex4}

9: %\documentclass[aps,prl,preprint,superscriptaddress]{revtex4}

10: %\documentclass[aps,prl,twocolumn,groupedaddress]{revtex4}

11:

12: % You should use BibTeX and apsrev.bst for references

13: % Choosing a journal automatically selects the correct APS

14: % BibTeX style file (bst file), so only uncomment the line

15: % below if necessary.

16: %\bibliographystyle{apsrev}

17:

18: \begin{document}

19:

20: % Use the \preprint command to place your local institutional report

21: % number in the upper righthand corner of the title page in preprint mode.

22: % Multiple \preprint commands are allowed.

23: % Use the 'preprintnumbers' class option to override journal defaults

24: % to display numbers if necessary

25: %\preprint{}

26:

27: %Title of paper

28: \title{On an Application of Relative Entropy}

29:

30: % repeat the \author .. \affiliation  etc. as needed

31: % \email, \thanks, \homepage, \altaffiliation all apply to the current

32: % author. Explanatory text should go in the []'s, actual e-mail

33: % address or url should go in the {}'s for \email and \homepage.

34: % Please use the appropriate macro foreach each type of information

35:

36: % \affiliation command applies to all authors since the last

37: % \affiliation command. The \affiliation command should follow the

38: % other information

39: % \affiliation can be followed by \email, \homepage, \thanks as well.

40: \author{Dmitry V. Khmelev}

41: \email{D.Khmelev@newton.cam.ac.uk}

42: %\homepage[]{Your web page}

43: %\thanks{}

44: \affiliation{Isaac Newton Institute for Mathematical Sciences, 20

45:   Clarkson Rd, Cambridge, CB3 0EH, U.K.}

46: \affiliation{Heriot-Watt

47:   University, Edinburgh, U.K. and Moscow State University, Russia}

48:

49: \author{William J. Teahan}

50: \email{wjt@informatics.bangor.ac.uk}

51: %\homepage[]{Your web page}

52: %\thanks{}

53: \affiliation{University of Wales, Bangor, Dean Street, Bangor, LL57 1UT, U.K.}

54:

55: \date{\today}

56:

57: \begin{abstract}

58:   We show that in problems of authorship attribution and other

59:   linguistic applications, a Markov Chains approach is a more attractive

60:   technique than Lempel-Ziv based compression.

61: \end{abstract}

62:

63: % insert suggested PACS numbers in braces on next line

64: \pacs{89.70.+c, 01.20.+x, 05.20.-y, 05.45.Tp}

65: % insert suggested keywords - APS authors don't need to do this

66: %\keywords{}

67:

68: %\maketitle must follow title, authors, abstract, \pacs, and \keywords

69: \maketitle

70:

71: We wish to point out a number of inaccurate and misleading statements

72: that Benedetto {\em et al.} make in their paper titled ``Language

73: Trees and Zipping''\cite{Bene:2002}. First, they claim the technique

74: they used for construction of a language tree does not make use of any

75: a-priori information about the alphabet, but it does, both in the

76: alphabet chosen (Unicode) and in the set of languages they chose to

77: experiment with; second, they propound Lempel-Ziv (LZ, {\em gzip})

78: compression as being applicable to DNA analysis, where the usefulness

79: of LZ is quite doubtful; third, in practice their definition of

80: relative entropy and distance can yield negative values; fourth, the

81: classification performance of the method they use is significantly

82: worse than other entropy-based methods as has been noted in prior

83: work; and fifth, the classification speed is significantly worse as

84: well, which shows that its ``potentiality'' is questionable. We

85: elaborate on each of these points in more detail in the subsequent

86: paragraphs.

87:

88: Notice that the ``Language Tree''(LT) diagram~\cite{Bene:2002} does not

89: include the Russian language (Slavic family of Indo-European family of

90: languages; 288 million speakers). Our computations show that once

91: Russian is included, it does not cluster with the other members of

92: the Slavic group. Obviously, certain Cyrillic alphabet based languages

93: were left out of the study~\cite{Bene:2002}, which ``improves''

94: results significantly and shows that a-priori information about

95: the alphabet is being taken advantage of to achieve the results outlined in

96: paper~\cite{Bene:2002}.

97:

98: The LZ compressor makes few assumptions about the input string, but in

99: practice, we do have a-priori information that we can take advantage

100: of. Biologists widely use an amino acid {\em substitution matrix} (PAM250

101: or BLOSUM62) in search for {\em similar} biological

102: sequences~\cite{gusfield}. It is not at all clear how a substitution

103: matrix could be implemented with the LZ algorithm.  That is why

104: compression is not widely used for DNA analysis, although first trials

105: for its application go back to 1990~\cite{gusfield}.

106:

107: The quantity $S_{\mathcal A\mathcal B}$~\cite{Bene:2002} defined as

108: ``relative entropy'' in (1) and redefined as ``distance'' in (2) can

109: take negative values. Negative values indeed appeared in our study

110: which showed that the ``LT''\cite{Bene:2002} reflects

111: significantly the structure of Unicode or vice versa, and its

112: relevance to language classification should be supported additionally.

113:

114: A traditional definition and estimates for (relative) entropy via $n$th

115: order Markov Chain {\em on letters}~\cite{Shannon,Yaglom,Kukush:2001}

116: always lead to a proper positive number. Markov Chains are also

117: traditional in text entropy analysis~\cite{Shannon,Yaglom},

118: compression~\cite{PPM}, authorship and subject

119: attribution~\cite{Teahan:2000,Khmelev:2000}.  In~\cite{Kukush:2001},

120: the classification performance of compression programs was compared

121: with the Markov Chain approach~\cite{Khmelev:2000}. 82 authors of

122: large enough texts ($\ge 10^5$ characters) were chosen. Afterwards 82

123: one-per-author texts were held out and used for control purposes.  The

124: classification algorithm~\cite{Kukush:2001} had to

125: determine the author of each control text among 82 alternatives. The

126: corresponding numbers of exact guesses for 15 compression programs and

127: Markov Chains are presented in the following list~\cite{Kukush:2001}:

128:

129: Program(number of guesses): 7zip(39), arj(46), bsa(44), compress(12), dmc(36),

130: gzip(50), ha(47), huff(10), lzari(17), ppmd5(46), rar(58), rarw({\bf

131:   71}), rk(52); Markov Chain approach (see~\cite{Khmelev:2000}) {\bf

132:   69} guesses.

133:

134: %\begin{center}

135: %  \footnotesize

136: %  \begin{tabular}{lcccccccc}

137: %    %\hline

138: %    Program   & 7zip& arj& bsa & bzip2 & compress & dmc& gzip &ha\\%[-0.5em]% \hline

139: %    Guesses   & 39  & 46 & 44  & 38    & 12       & 36 & 50   &47\\%[-0.5em]% \hline

140: %  \end{tabular}

141:

142: %  \begin{tabular}{lccccccc}

143: %    %\hline

144: %    Program   & huff& lzari & lzss & ppmd5 & rar & rarw &rk\\%[-0.5em] %\hline

145: %    Guesses   & 10  & 17    & 14   & 46    & 58  & {\bf 71}   &52\\%[0.5em]%\hline

146: %    \multicolumn{8}{c}{Markov Chain approach (see~\cite{Khmelev:2000}):

147: %{\bf 69} guesses}%\\[-0.5em]

148: %  \end{tabular}

149: %\end{center}

150:

151: Clearly, {\em gzip} is significantly outperformed by other compression

152: algorithms and the first order Markov chain model~\cite{Khmelev:2000}.

153: Notice also that in practical implementations, the {\em gzip}-based

154: approach~\cite{Bene:2002} is significantly slower than the first order Markov

155: chains method~\cite{Khmelev:2000}.

156:

157: To sum up, in natural language processing (and, perhaps, in other

158: fields) the $n$th order Markov chain

159: models~\cite{Khmelev:2000,Teahan:2000} are more appropriate than

160: an LZ-approach~\cite{Bene:2002}.

161:

162: % Create the reference section using BibTeX:

163: \bibliography{cmc}

164:

165: \end{document}

166: