0207:physics0207023/m.tex

1: \documentclass[aps,twocolumn,floatfix,showpacs]{revtex4}

2: \usepackage{epsfig}

3: \newcommand{\be}{\begin{equation}}

4: \newcommand{\ee}{\end{equation}}

5: \newcommand{\ftnt}{\footnote}

6:

7: \begin{document}

8:

9: \title{Data Compression and Entropy Estimates by Non-sequential Recursive Pair Substitution}

10:

11: \author{Peter Grassberger}

12:

13: \affiliation{John-von-Neumann Institute for Computing, Forschungszentrum J\"ulich,

14: D-52425 J\"ulich, Germany}

15:

16: \date{\today}

17:

18: \begin{abstract}

19: We argue that Non-sequential Recursive Pair Substitution (NSRPS) as suggested by

20: Jim\'enez-Monta\~no and Ebeling can indeed be used as a basis for an optimal data

21: compression algorithm. In particular, we prove for Markov sequences that NSRPS together

22: with suitable codings of the substitutions and of the substitute series does not lead

23: to a code length increase, in the limit of infinite sequence length. When applied

24: to written English, NSRPS gives entropy estimates which are very close to those obtained

25: by other methods. Using ca. 135 GB of input data from the project Gutenberg, we estimate

26: the effective entropy to be $\approx 1.82$ bit/character. Extrapolating to infinitely

27: long input, the true value of the entropy is estimated as $\approx 0.8$ bit/character.

28: \end{abstract}

29:

30: \pacs{02.50.-r, 05.10.-a, 05.45.Tp}

31:

32: \maketitle

33:

34: \section{Introduction}

35:

36: The discovery that the amount of information in a message (or in any other structure)

37: can be objectively measured was certainly one of the major scientific achievements of

38: the 20th century. On the theoretical side, this quantity -- the information theoretic

39: entropy -- is of interest mainly because of its close relationship to thermodynamic

40: entropy, its importance for chaotic systems, and its role in Bayesian

41: inference (maximum entropy principle). Practically, estimating

42: the entropy of a message (text document, picture, piece of music, etc.) is important

43: because it measures its compressibility, i.e. the optimal achievement for any

44: possible compression algorithm. In the following, we shall always deal with sequences

45: $(s_0,s_1,\ldots)$ built from the characters of a finite alphabet $A = \{a_0,\ldots,a_{m-1}\}$

46: of size $m$. In the simplest case the alphabet consists just of 2 characters, in

47: which case the maximum entropy is 1 bit per character.

48:

49: Indeed, information entropy as introduced by Shannon \cite{shannon} is a probabilistic

50: concept. It requires a measure (probability distribution) to be defined on the set

51: of all possible sequences. In particular, the probability for $s_t$

52: to be given by $a_k$, given all characters $s_0,s_1,\ldots, s_{t-1}$, is given by

53: \begin{eqnarray}

54:    p_t(k|k',k'',\ldots) = &&\\{\rm prob}(s_t = a_k&|&s_{t-1}=a_{k'}, s_{t-2}=a_{k''}, \ldots

55:      ) . \nonumber

56: \end{eqnarray}

57: In case of a stationary measure with finite range correlations, $p_t(k|k',k'',\ldots)$

58: becomes independent of $t$ for $t\to\infty$. Then Shannon's famous formula,

59: \be

60:    h = \lim_{i\to \infty} h^{(i)}

61: \ee

62: with

63: \be

64:    h^{(i)} = - \sum_{k_1\ldots k_i} p(k_1\ldots k_i) \log_2 p(k_1|k_2\ldots k_i)\;,

65: \ee

66: gives the {\it average} information per character. The generalization to non-stationary

67: measures is straightforward but will not be discussed here.

68:

69: In contrast to this approach are attempts to define the {\it exact} information content of a single

70: finite sequence. Theoretically, the basic concept here is the {\it algorithmic complexity}

71: AC (or algorithmic {\it randomness}) \cite{kolmogorov,chaitin}. For any given universal

72: computer $U$, the AC of a sequence $S$ relative to $U$ is given by

73: the length of the shortest program which, when input to $U$, prints $S$ and then makes

74: $U$ to stop, so that the next sequence can be read. If $S$ is randomly drawn from

75: a stationary ensemble with entropy $h$, then one can show that the AC

76: per character tends towards $h$, for almost all $S$ and all $U$, as the length of $S$

77: tends towards infinity \cite{li-vitanyi}. Thus, except for rare sequences which do not

78: contribute to averages, $h$ sets the limit for the compressibility.

79:

80: Practically, the usefulness of AC is limited by the fact that there cannot exist any

81: algorithm which finds for each $S$ its shortest code (such an algorithm could be used to

82: solve Turing's halting problem, which is known to be impossible) \cite{li-vitanyi}.

83: But one can give algorithms which are often quite efficient. Huffman,

84: arithmetic, and Lempel-Ziv coding are just three well known examples \cite{cover}.

85: Any such algorithm can be used to give an upper bound to $h$ (modulo fluctuations from

86: the finite sequence length) while, inversely, knowledge of $h$ sets a lower limit to

87: the average code lengths possible with these codes.

88:

89: A data compression scheme is called {\it optimal}, if it does not do much worse than the

90: best possible for typical random strings. More precisely, let $\{S\}$ be a set of sequences

91: with entropy $h(S)$, and let the code string $C(S)$ be built from an alphabet of $m_C$

92: characters. Then we call the coding scheme $C: S\to C(S)$ optimal, if

93: \be

94:    {{\rm length}[C(S)] \over {\rm length}[S]} \to {h \over \log_2 m_C }

95:                             \quad {\rm for} \;\; {\rm length}[S] \to \infty

96: \ee

97: and for nearly all $S$.

98: While Huffman coding is not optimal, arithmetic and Lempel-Ziv codings are \cite{cover}.

99:

100: In several papers, Jim\'enez-Monta\~no, Ebeling, and others \cite{jimenes,poeschl} have

101: suggested coding schemes by non-sequential recursive pair substitution (NSRPS) \cite{footnote0}.

102: Call the original sequence $S_0$. We count the numbers $n_{jk}$ of non-overlapping successive

103: pairs of characters in $S_0$ where $s_t = a_j$ and $s_{t+1} = a_k$, and find their maximum,

104: $n_{\rm max} = \max_{j,k< m} n_{jk}$. The corresponding index pair is $(j_0,k_0)$.

105: Then we introduce a new character by concatenation

106: \be

107:    a_m = (a_{j_0}a_{k_0})

108: \ee

109: and form the sequence $S_1$ by replacing everywhere the pair $a_{j_0}a_{k_0}$ by $a_m$. For

110: the special case of $j_0 = k_0$, any string of $2r+1$ characters $a_{j_0}$ is replaced by $r$

111: characters $a_m$, followed by one $a_{j_0}$.

112:

113: This is then repeated recursively: The sequence $S_{i+1}$ is obtained from $S_i$ by replacing

114: the most frequent pair $a_{j_i}a_{k_i}$ by a new character $a_{m+i}$. The procedure stops

115: if one can argue that further replacements would not possibly be of any use. Typically this

116: will happen if the code length consisting of both a description of $S_{i+1}$ and a description

117: of the pair $(j_i,k_i)$ is definitely longer than a description of $S_i$, for the present

118: and all subsequent $i$.

119:

120: Thus one sees that efficient encodings (which must also be uniquely decodable!) of the

121: sequences $S_i$ and of the type of substituted pairs become crucial for the analysis of NSRPS.

122: Unfortunately, the ``codings" given in \cite{jimenes,poeschl} are neither efficient nor

123: uniquely decodable \cite{footnote}. Thus their ``complexities" have no direct relationship

124: to $h$ or to algorithmic complexity (in contrast to their claim), and it is not clear from

125: their work whether NSRPS can be made into an optimal coding scheme at all.

126:

127: It is the purpose of the present paper to give at least partial answers to this.

128: More precisely, we shall only be concerned with the limit of infinitely long

129: strings, where the information encoded in the pairs $(j_i,k_i)$ can be neglected

130: in comparison with the information stored in $S_i$, at least for any finite $i$.

131: We will first show analytically that a coding scheme for $S_i$ exists which

132: satisfies a necessary condition for optimality (Sec.2). We then apply this to written

133: English (Sec.3), where we shall also compare our estimates of $h$ to those obtained

134: with other methods.

135:

136: \section{NSRPS for Markov sequences}

137:

138: Let us for the moment assume that $S_0$ is binary (the two characters are ``0" and ``1"),

139: and that it is completely random, i.e. identically and independently distributed (iid)

140: with the same probability for each character. Thus $p(0|\ldots) = p(1|\ldots) = 1/2$, and

141: $h=1$ bit. The length of $S_0$ is $N_0$, thus the total average information

142: stored in $S_0$ is $N_0$ bits.

143:

144: No coding scheme can reduce the length of $C(S_0)$ to less than $N_0$ bits

145: on average. Indeed, all schemes will have ${\rm length}[C(S_0)] > N_0$ bits (strict

146: inequality!), unless the ``coding" is a verbatim copy. For a coding scheme to be

147: optimal, a necessary (but not sufficient) condition is that

148: \be

149:    {\rm length}[C(S_0)] / N_0 \to 1 \;{\rm bit}

150: \ee

151: for $N_0\to\infty$, i.e. the overhead in the code must be less than extensive

152: in the sequence length. This is what we want to show here, together with its

153: generalization to arbitrary (first order) Markov sequences.

154:

155: For this, we need two lemmata:

156:

157: {\bf Lemma 1}: {\it For any Markov sequence $S_0$ (not necessarily binary, and not

158: necessarily iid) built from $m$ letters, the sequence $S_1$ is again Markov.}

159:

160: {\bf Lemma 2}: {\it If a word $w = (k,k',k'',\ldots)$ appears several times in $S_0$,

161: and if one of these instances is substituted in $S_i$ by a string of characters

162: not straddling its boundaries, then all other instances of $w$ in $S_0$ are also

163: substituted in $S_i$ by the same string.}

164:

165: Lemma 1 tells us that NSRPS might make the structure of $S_i$ more complex than

166: that of $S_0$, but not much so. Being a Markov chain, its entropy can be estimated

167: if the transition probabilities $p(k|k_1)$ are known. Thus estimating the entropy

168: of $S_i$ reduces to estimating di-block entropies $h^{(2)}$, which is straightforward (at

169: least in the limit $N_0\to\infty$).

170:

171: Lemma 2 tells us that there cannot be any ambiguity in $S_i$. In particular,

172: it cannot happen that more information is needed to specify $S_i$ than there

173: is needed to specify $S_0$, since the mapping $S_0\to S_i$ is bijective, once

174: the substitution rules are fixed.

175:

176: The proofs of the lemmata are easy. Let us denote by $p_j(\ldots)$ the probability

177: distributions after $j$ pair substitutions. For lemma 1 we just have to show that

178: $p_1(k|k',k'')$ is independent of $k''$ for each pair $(k,k')$, provided the

179: same holds also for $p_0$. This follows basically from the fact that any

180: substitution makes the sequence shorter. But the detailed proof is somewhat

181: tedious, because $p_1(k|k',k'')\neq p_0(k|k',k'')$, even if all $k$'s are less than

182: $m$, $k\neq k_0$, $k''\neq j_0$, and neither $(k,k')$ nor $(k',k'')$ are equal to

183: the pair $(j_0,k_0)$. In that case,

184: $(N_0-n_{\rm max}) p_1(k|k',k'') = N_0 p_0(k|k',k'')$, and independence of $k''$ follows

185: immediately. All other cases have to be dealt with similarly. For instance, if

186: either $(k,k')$ or $(k',k'')$ is the pair $(j_0,k_0)$, then $p_1(k,k',k'')=0$. Else,

187: if $k''=m\neq k,k'$, then $p_1(k|k',k'') = N_0/ (N_0-n_{\rm max}) p_0(k|k',j_0,k_0) =

188: N_0/ (N_0-n_{\rm max}) p_0(k|k')$. We leave the other cases as exercises to the reader.

189:

190: For proving lemma 2 we proceed indirectly. We assume that there is a word in

191: $S_0$ which is encoded differently in different locations. Let us assume

192: that this difference happened for the first time after $i$ substitutions.

193: Since only one type of pair is exchanged in each step, this means that a

194: substitution is skipped in one of the locations, at this step. But this is

195: impossible, since {\it all} possible substitutions are made at each step.

196:

197: From the two lemmata we obtain immediately our central

198:

199: {\bf Theorem:} {\it If $S_0$ is drawn from a (first order) Markov process with length $N_0$

200: and entropy $h_0 =  - \sum_{k,k'} p_0(k,k') \log_2 p_0(k|k')$, then every $S_i$

201: is also Markovian in the limit $N_0\to\infty$, with entropy

202: \be

203:    h_i = h_i^{(2)} =  - \sum_{k,k'} p_i(k,k') \log_2 p_i(k|k')                \label{h2}

204: \ee

205: and with length $N_i$ satisfying $N_i/N_0 = h_0/h_i$.}

206:

207: Thus the total amount of information needed to specify $S_i$ is the same as that for

208: $S_0$, for infinitely long sequences. Since the overhead needed to specify the pairs

209: $(j_i,k_i)$ can be neglected in this limit, we see that we do not loose code length

210: efficiency by pair substitution, provided we take pair probabilities correctly

211: into account during the coding. The actual encoding can be done by means of an arithmetic

212: code based on the probabilities $p_i(k|k')$ \cite{cover}, but we shall not work out

213: the details. It is enough to know that the code length then becomes equal to the

214: information (both measured in bits), for $N_0\to\infty$.

215:

216: Let us see in detail how all this works for completely random iid binary

217: sequences. The original sequence $S_0 = 00101001111010011011\ldots$

218: has $p_0(00)=p_0(01)=p_0(10)=p_0(11)=1/4$ and therefore $h_0 = 1$ bit. Thus

219: we can, without loss of generality, assume that the new character is

220: $2 = (01)$, so that $S_1 = 02202111202121\ldots$. The 3 characters are

221: now equiprobable, $p_1(0)=p_1(1)=p_1(2)=1/3$, but they are not independent

222: since of course $p_1(01)=0$. Indeed, one finds $p_1(00)=p_1(02)=p_1(11)=p_1(21)=1/6,

223: \;p_1(10)=p_1(12)=p_1(20)=p_1(22)=1/12$. The order-2 entropy of $S_1$ is easily

224: calculated as $h_1^{(2)} = 4/3 \log_2 2$. On the other hand, since $N_0/4$ pairs

225: have been replaced by single characters, the length of $S_1$ is $N_1=3N_0/4$. Thus,

226: if $S_1$ is Markov, then the total information needed to specify it is

227: $N_1 h_1^{(2)} = N_0$ bits, the same as for $S_0$. If it were not Markov, its

228: information would be smaller. But this cannot be, because the map $S_0\to S_1$

229: was invertible. Thus $S_1$ must indeed be Markov, as can also be checked explicitly.

230:

231: In the next step, we can either replace $(21) \to 3$ or $(02) \to 3$, since both

232: have the same probability. If we do the former, the sequence becomes

233: $S_2 = 02203112033\ldots$. Now the letters are no longer equiprobable,

234: $p_2(1)=p_2(2)=p_2(3)=1/5$, $p_2(0)=2/5$. Calculating $N_2, p_2(kk')$, and

235: $h_2^{(2)}$ is straightforward, and one finds again $N_2 h_2^{(2)} = N_0$ bits.

236: Thus one concludes that $S_2$ must also be Markov. For the next

237: few steps one can still verify

238: \be

239:    N_i h_i^{(2)} = \ldots N_0 \; {\rm bits},           \label{same}

240: \ee

241: by hand, but this becomes increasingly tedious as $i$ increases.

242:

243: \begin{figure}

244: %Fig 1

245: \psfig{file=fig1.ps,width=5.8cm,angle=270}

246: \caption{Results for a completely random (iid, uniformly distributed) binary

247:   initial sequence of $N_0 = 8\times 10^8$ bits, plotted against the size of

248:   the extended alphabet. Uppermost curve: code length needed to encode $S_i$,

249:   divided by $N_0$, if $\log_2 (i+2)$ bits are used for each character. Middle

250:   curve: code length based on $h_i^{(1)}$, i.e. the single-character distributions

251:   $p_i(k)$ are used in the encoding. Lowest curve, indistinguishable on this

252:   scale from a horizontal straight line:

253:   code length based on $h_i^{(2)}$, using the two-character distributions $p_i(k,k')$.}

254: \label{fig1.ps}

255: \end{figure}

256:

257: Thus we have verified Eq.(\ref{same}) by extensive simulations, where we found

258: that it is exact, within the expected fluctuations, up to several thousand

259: substitutions (Fig.1). The distribution of the probabilities $p_i(k)$ becomes very

260: wide for large $i$, i.e. the sequences $S_i$ are far from uniform for large $i$,

261: but they are Markov and their entropies $h_i^{(2)}$ are exactly (within

262: the expected systematic finite sample corrections \cite{herzel,grass-fsc})

263: equal to $N_0/N_i$ bits. Notice that if we would encode the last $S_i$ without

264: taking the correlations into account (as seems suggested in

265: \cite{jimenes,poeschl}), then the code length for it would be larger and the

266: coding scheme would not be optimal.

267:

268: We have also made some simulations where we started with non-trivial Markov

269: processes for $S_0$, or even with non-Markov sequences with known entropy.

270: The latter were generated by creating initially a binary iid sequence with

271: $p(0) \neq p(1)$, and then using this as an input configuration for a few

272: iterations of the bijective cellular automaton R150 (in Wolfram's notation)

273: \cite{sg}.

274:

275: \begin{figure}

276: %Fig 2

277: \psfig{file=fig2.ps,width=5.8cm,angle=270}

278: \caption{Ranked single character probability distributions $p_i(k)$ of strings after

279:   $i=2298$ pair substitutions. The different curves are for a completely random iid

280:   initial string $S_0$ (solid line), iid string $S_0$ with $p_0(0)=0.29$ (long dashed),

281:   $S_0$ obtained by applying two times CA rule 150 to an iid sequence with

282:   $p(0)=0.09$ (dashed), and to written English with a reduced (46 character)

283:   alphabet (dotted).}

284: \label{fig2.ps}

285: \end{figure}

286:

287: From these simulations it seems that $N_i h_i^{(2)}$ always tends towards $N_0$.

288: Also, the probability distributions $p_i(k)$ seem to tend (very slowly, see

289: Fig.2) to the same scaling limit as for iid and uniform $S_0$. This suggests

290: that indeed $S_i$ tends to a Markov process for arbitrary $S_0$. In this

291: case an optimal coding would be obtained if one would use, e.g., an

292: arithmetic code to encode $S_i$ by using approximate values of the observed

293: $p_i(k|k')$ for large $i$.

294:

295: Thus we have given strong (but still incomplete) arguments that NSRPS combined

296: with efficient coding of $S_i$ gives indeed an optimal coding scheme. In

297: practice, it would of course be extremely inefficient in terms of speed, and

298: thus of no practical relevance. But it could well be that it might lead to

299: more stringent entropy estimates than other methods. To test this we shall

300: now turn to one of the most complex and interesting system, written natural

301: language.

302:

303: \section{The entropy of written English}

304:

305: The data used for the application of NSRPS to entropy estimation of written

306: English consisted of ca. 150 MB of text taken from the Project Gutenberg

307: homepage \cite{gutenberg}. It includes mainly English and American novels

308: from the 19th and early 20th century (Austen, Dickens, Galsworthy, Melville,

309: Stevenson, etc.), but also some technical reports (e.g. Darwin, historical

310: and sociological texts, etc.), Shakespeares collected works, the King James

311: Bible, and some novels translated from French and Russian (Verne, Tolstoy,

312: Dostoevsky, etc.).

313:

314: From these texts we removed first editorial and legal remarks added by the

315: editors of Project Gutenberg. We also removed end-of-line, end-of-page, and

316: carriage return characters. All runs of consecutive blanks were replaced by

317: a single blank. Finally, we also removed all characters not in

318: the 7-bit ASCII alphabet (ca. 4200 in total). These cleaned texts were then

319: concatenated to form one big input string of 148,214,028 characters.

320:

321: Entropies were estimated both from this string (which still contained upper

322: and lower case letters, numbers, all kinds of brackets and interpunctation marks,

323: 95 different characters in total), and from a version with reduced alphabet.

324: In the latter, we changed all letters to upper case; all brackets to either

325: ( or ); the symbols \$,\#,\&,*,\%, @ to one single symbol; colons, exclamation and

326: question marks to points; quotation marks to apostrophes; and semicolons to commas.

327: This reduced alphabet had then 46 letters (including, of course, the blank

328: ``$_\sqcup$").

329:

330: The most frequent pair of letters in English is ``e$_\sqcup$". After replacing it

331: by a new ``letter", the next pair to substitute is ``$_\sqcup$t", then ``$_\sqcup$a",

332: ``$_\sqcup$th", etc. Very soon also longer strings are substituted, e.g. after

333: 92 steps appears the first two-word combination, ``of$_\sqcup$the$_\sqcup$".

334:

335: As long as the number of new symbols is still small, it is easy to estimate the

336: pair probabilities, and from this an upper bound $\hat{h}_i = h_i^{(2)}N_i/N_0$

337: on the entropy.  This becomes more and more difficult

338: as the alphabet size increases, as the sampling becomes insufficient even with

339: our very long input file, and we can no longer approximate the $p_i(k,k')$ by the

340: observed relative frequencies. As long as the number of different subsequent pairs is

341: much smaller than the sequence length (i.e., most pairs are observed many times),

342: we can still get reliable estimates of $\hat{h}_i$ by using the leading correction

343: term discussed in \cite{grass-fsc,footnote2}. But finally, when many pairs are seen only

344: once in the entire text, we have to stop since any estimate of $h_i^{(2)}$ becomes

345: unreliable.

346:

347: We went up to 6000 substitutions. The longest substrings substituted by a single

348: new symbol had length 13 in the original (95 letter) alphabet, and length 16 in the

349: reduced (46 letter) one (the latter was ``would$_\sqcup$have$_\sqcup$been$_\sqcup$").

350: The entropies $\hat{h}$ per (original) character are plotted

351: in Fig.3. We see that they are very similar for both alphabets.

352: We find $\hat{h}\approx 1.8$ bits/character after 6000 substitutions. This number

353: is very close to the value obtained from most other methods (with the exception of

354: \cite{teahan-cleary}, where $\approx 1.5$ bits/character were obtained), if one uses

355: $10 - 100$ MB of input text \cite{bell,sg}. This is surprising in view of two facts.

356: First of all, the methods applied in \cite{bell,sg} are very different, and one

357: might have thought a priori that they are able to use different structures of the

358: language to achieve high compression rates. Apparently they do not.

359:

360: \begin{figure}

361: %Fig 3

362: \psfig{file=fig3.ps,width=5.8cm,angle=270}

363: \caption{Entropy estimates $\hat{h}$ from pair probabilities plotted against

364:    the size of the extended alphabet. Upper curve is for the initial 7 bit

365:    alphabet, including upper and lower case letters. The lower curve is for the

366:    reduced (46 letter) initial alphabet. The smooth dotted line passing

367:    through the lower data set is a fit with Eq.(\ref{fit}).}

368: \label{fig3.ps}

369: \end{figure}

370:

371: Secondly, it is clear that $\hat{h}\approx 1.8$ bits/character is not a realistic

372: estimate of the true entropy of written English. Even though we can not, with our

373: present text lengths and our computational resources, go to much larger alphabet sizes

374: (i.e. to more substitutions), it is clear from Fig.3 that both curves would continue

375: to decrease. Let us denote by $i$ the number of substitutions. Then empirical

376: fits to both curves in Fig.3 are given by

377: \be

378:    \hat{h}_i = h + {c\over (i+i_0)^\alpha } \;.                \label{fit}

379: \ee

380: Such a fit to the 46 letter data, with $h=0.7, i_0=34, c=4.99,$ and $\alpha = 0.1745$,

381: is also shown in Fig.3. One should of course not take it too serious in view of the

382: very slow convergence with $i$ and the very long extrapolation, but it suggests that

383: the true entropy of written English is $0.7\pm 0.2$ bits/character.

384:

385: This estimate is somewhat lower than estimate of \cite{cov-king} and the

386: extrapolations given in \cite{sg}. It is

387: comparable with that of \cite{grass-ieee} and with Shannon's original estimate

388: \cite{shannon2}. It seems definitely to exclude the possibility $h=0$ which was

389: proposed in \cite{hilberg,ebel-posch}.

390:

391: \section{Conclusions}

392:

393: We have shown how a strategy of non-sequential replacements of pairs of characters

394: can yield efficient data compression and entropy estimates. A similar

395: strategy was first proposed by Jim\'enez-Monta\~no and others, but details and the

396: actual coding done in the present paper are quite different from those proposed in

397: \cite{jimenes,poeschl}. Indeed, this strategy was never used in \cite{jimenes,poeschl}

398: for actual codings, and it was also not used for realistic entropy estimates.

399:

400: Compared to conventional sequential codes (such as Lempel-Ziv or arithmetic

401: codes \cite{cover}, just to mention two), the present method would be much

402: slower. Instead of a single pass through the data as in sequential coding

403: schemes, we had gone up to 6000 times through the data file, in order to

404: achieve a high compression rate. We could do of course with much less passes,

405: if we would be content with compression rates comparable to those of commercial

406: packages such as ``zip" or ``compress". For written English these achieve typically

407: compression factors $\approx 2.6$, i.e. ca. 3 bits/character. As seen from Fig.1,

408: this can be achieved by NSRPS very easily with very few passes, but even then the

409: overhead and the computational complexity of NSRPS is much too high to make it

410: a practical alternative.

411:

412: NSRPS can be seen as a greedy and extremely simple version of off-line textual

413: substitution \cite{storer}. In combination with other sophisticated techniques,

414: similar substitutions can give excellent results \cite{teahan-cleary}. But without

415: these techniques, it is in general believed that only much more sophisticated

416: versions of off-line textual substitution are of any interest \cite{storer}.

417: Again this is presumably true as far as practical coding schemes are concerned.

418: But things seem to be different if one is interested in entropy estimation. Here the

419: present method is much simpler (even though computationally more demanding) than

420: the tree-based gambling algorithms \cite{sg,bell} that had given the best results

421: up to now. Without extrapolation, it gives the same (upper bound) estimates

422: as these methods. But it seems that it allows a more reliable extrapolation to

423: infinite text length and infinite substitution depth, and thus a more reliable

424: estimate of the true asymptotic entropy.

425:

426: From the mathematical point of view, we should however stress that we have only

427: partial results. While we have proven that the Markov structure is a fixed point

428: of the substitution, we have not proven that it is {\it attractive}. We thus

429: cannot prove that the present strategy is indeed universally

430: optimal, although we believe that our numerical results strongly support this

431: conjecture. A rigorous proof would of course be extremely welcome.

432:

433: I thank Ralf Andrzejak, Hsiao-Ping Hsu, and Walter Nadler for carefully reading

434: the manuscript and for useful discussions.

435:

436:

437: \begin{thebibliography}{99}

438: \bibitem{shannon} C.E. Shannon and W. Weaver, {\it The Mathematical

439:    Theory of Communications} (Univ. of Illinois Press, Urbana 1949).

440: \bibitem{kolmogorov} A.N. Kolmogorov, IEEE Trans. Inf. Theory

441:    {\bf IT 14}, 662 (1965).

442: \bibitem{chaitin} G.J. Chaitin, {\it Algorithmic Information Theory}

443:    (Cambridge Univ. Press, New York 1987).

444: \bibitem{li-vitanyi} M. Li and P. Vit\'anyi, {\it An Introduction to

445:    Kolmogorov Complexity and its Applications} (Springer, New York 1997).

446: \bibitem{cover} T.M. Cover and J.A. Thomas, {\it Elements of Information Theory}

447:    (Wiley Interscience, 1991).

448: \bibitem{jimenes} W. Ebeling and M.A. Jim\'enez-Monta\~no, Math. Biosc.

449:    {\bf 52}, 53 (1980);

450:    M.A. Jim\'enez-Monta\~no, Bull. Math. Biol. {\bf 46}, 641 (1984);

451:    P.E. Rapp, I.D. Zimmermann, E.P. Vining, N. Cohen, A.M. Albano, and

452:    M.A. Jim\'enez-Monta\~no, Phys. Lett. A {\bf 192}, 27 (1994);

453: \bibitem{poeschl} M.A. Jim\'enez-Monta\~no, W. Ebeling, and T. P\"oschel,

454:    preprint arXiv:cond-mat/0204134 (2002).

455: \bibitem{footnote0} Actually, Jim\'enez-Monta\~no {\it et al.} use somewhat

456:    different schemes. Also, we found the names given in

457:    \cite{jimenes,poeschl} to their algorithms somewhat misleading,

458:    since they refer to grammatical categories, while we are dealing with

459:    probability measures.

460: \bibitem{footnote} In \cite{poeschl}, e.g., it is assumed that a character from

461:    a two-letter alphabet can still be encoded by one bit, after the first pair

462:    has been replaced by a ``non-terminal node", in their notation. This is

463:    not true, since encoding this character now must fix a choice between

464:    {\it three} (instead of two) possibilities.

465: \bibitem{gutenberg} http://promo.net/pg/.

466: \bibitem{herzel} B. Harris, Colloquia Mathematica Societatis Janos Bolya, 1975,

467:    p. 323; H. Herzel, Syst. Anal. Model Sim. {\bf 5}, 435 (1988).

468: \bibitem{grass-fsc}  P. Grassberger, Phys. Lett. A {\bf 128}, 369 (1988).

469: \bibitem{footnote2} We use Eq.(13) of \cite{grass-fsc}, but with a misprint

470:    corrected: The denominator of the last term should be $(n_i+1)n_i$ instead

471:    of $n_i+1$.

472: \bibitem{teahan-cleary} W.J. Teahan and J.G. Cleary, {\it The entropy of English

473:    using PPM-based models}, Proc. of Data Compression Conf., Los Alamos (1996)

474: \bibitem{bell} T.C. Bell, J.G. Cleary, and I.H. Witten, {\it Text Compression}

475:    (Prentice-Hall, Englewood Cliffs, NJ, 1990).

476: \bibitem{cov-king} T. Cover and R. King, IEEE Trans. Inf. Theory {\bf IT-24}, 413

477:    (1978)

478: \bibitem{sg} T. Sch\"urmann and P. Grassberger, CHAOS {\bf 6}, 414 (1996).

479: \bibitem{grass-ieee} P. Grassberger, IEEE Trans. Inf. Theory {\bf IT-35}, 669

480:    (1989).

481: \bibitem{shannon2} C.E. Shannon, Bell Syst. Technol. J. {\bf 30}, 50 (1951).

482: \bibitem{hilberg} W. Hilberg, Frequenz {\bf 44}, 243 (1990).

483: \bibitem{storer} J.A. Storer, {\it Data Compression} (Computer Science Press,

484:    Rockville, MD, 1988).

485: \bibitem{ebel-posch} W. Ebeling and T. P\"oschel, Europhys. Lett. {\bf 26}, 241

486:    (1994).

487:

488: \end{thebibliography}

489:

490: \end{document}

491: