physics0207023/m.tex
1: \documentclass[aps,twocolumn,floatfix,showpacs]{revtex4}
2: \usepackage{epsfig}
3: \newcommand{\be}{\begin{equation}}
4: \newcommand{\ee}{\end{equation}}
5: \newcommand{\ftnt}{\footnote}
6: 
7: \begin{document}
8: 
9: \title{Data Compression and Entropy Estimates by Non-sequential Recursive Pair Substitution}
10: 
11: \author{Peter Grassberger}
12: 
13: \affiliation{John-von-Neumann Institute for Computing, Forschungszentrum J\"ulich,
14: D-52425 J\"ulich, Germany}
15: 
16: \date{\today}
17: 
18: \begin{abstract}
19: We argue that Non-sequential Recursive Pair Substitution (NSRPS) as suggested by 
20: Jim\'enez-Monta\~no and Ebeling can indeed be used as a basis for an optimal data 
21: compression algorithm. In particular, we prove for Markov sequences that NSRPS together 
22: with suitable codings of the substitutions and of the substitute series does not lead 
23: to a code length increase, in the limit of infinite sequence length. When applied 
24: to written English, NSRPS gives entropy estimates which are very close to those obtained 
25: by other methods. Using ca. 135 GB of input data from the project Gutenberg, we estimate
26: the effective entropy to be $\approx 1.82$ bit/character. Extrapolating to infinitely 
27: long input, the true value of the entropy is estimated as $\approx 0.8$ bit/character.
28: \end{abstract}
29: 
30: \pacs{02.50.-r, 05.10.-a, 05.45.Tp}
31: 
32: \maketitle
33: 
34: \section{Introduction}
35: 
36: The discovery that the amount of information in a message (or in any other structure)
37: can be objectively measured was certainly one of the major scientific achievements of 
38: the 20th century. On the theoretical side, this quantity -- the information theoretic 
39: entropy -- is of interest mainly because of its close relationship to thermodynamic
40: entropy, its importance for chaotic systems, and its role in Bayesian 
41: inference (maximum entropy principle). Practically, estimating 
42: the entropy of a message (text document, picture, piece of music, etc.) is important 
43: because it measures its compressibility, i.e. the optimal achievement for any 
44: possible compression algorithm. In the following, we shall always deal with sequences
45: $(s_0,s_1,\ldots)$ built from the characters of a finite alphabet $A = \{a_0,\ldots,a_{m-1}\}$
46: of size $m$. In the simplest case the alphabet consists just of 2 characters, in 
47: which case the maximum entropy is 1 bit per character.
48: 
49: Indeed, information entropy as introduced by Shannon \cite{shannon} is a probabilistic 
50: concept. It requires a measure (probability distribution) to be defined on the set 
51: of all possible sequences. In particular, the probability for $s_t$
52: to be given by $a_k$, given all characters $s_0,s_1,\ldots, s_{t-1}$, is given by
53: \begin{eqnarray}
54:    p_t(k|k',k'',\ldots) = &&\\{\rm prob}(s_t = a_k&|&s_{t-1}=a_{k'}, s_{t-2}=a_{k''}, \ldots
55:      ) . \nonumber
56: \end{eqnarray}
57: In case of a stationary measure with finite range correlations, $p_t(k|k',k'',\ldots)$ 
58: becomes independent of $t$ for $t\to\infty$. Then Shannon's famous formula,
59: \be
60:    h = \lim_{i\to \infty} h^{(i)}
61: \ee
62: with 
63: \be
64:    h^{(i)} = - \sum_{k_1\ldots k_i} p(k_1\ldots k_i) \log_2 p(k_1|k_2\ldots k_i)\;,
65: \ee
66: gives the {\it average} information per character. The generalization to non-stationary 
67: measures is straightforward but will not be discussed here.
68: 
69: In contrast to this approach are attempts to define the {\it exact} information content of a single 
70: finite sequence. Theoretically, the basic concept here is the {\it algorithmic complexity} 
71: AC (or algorithmic {\it randomness}) \cite{kolmogorov,chaitin}. For any given universal 
72: computer $U$, the AC of a sequence $S$ relative to $U$ is given by
73: the length of the shortest program which, when input to $U$, prints $S$ and then makes 
74: $U$ to stop, so that the next sequence can be read. If $S$ is randomly drawn from 
75: a stationary ensemble with entropy $h$, then one can show that the AC
76: per character tends towards $h$, for almost all $S$ and all $U$, as the length of $S$
77: tends towards infinity \cite{li-vitanyi}. Thus, except for rare sequences which do not 
78: contribute to averages, $h$ sets the limit for the compressibility.
79: 
80: Practically, the usefulness of AC is limited by the fact that there cannot exist any 
81: algorithm which finds for each $S$ its shortest code (such an algorithm could be used to 
82: solve Turing's halting problem, which is known to be impossible) \cite{li-vitanyi}. 
83: But one can give algorithms which are often quite efficient. Huffman,
84: arithmetic, and Lempel-Ziv coding are just three well known examples \cite{cover}.
85: Any such algorithm can be used to give an upper bound to $h$ (modulo fluctuations from 
86: the finite sequence length) while, inversely, knowledge of $h$ sets a lower limit to 
87: the average code lengths possible with these codes.
88: 
89: A data compression scheme is called {\it optimal}, if it does not do much worse than the 
90: best possible for typical random strings. More precisely, let $\{S\}$ be a set of sequences
91: with entropy $h(S)$, and let the code string $C(S)$ be built from an alphabet of $m_C$
92: characters. Then we call the coding scheme $C: S\to C(S)$ optimal, if 
93: \be
94:    {{\rm length}[C(S)] \over {\rm length}[S]} \to {h \over \log_2 m_C }
95:                             \quad {\rm for} \;\; {\rm length}[S] \to \infty
96: \ee
97: and for nearly all $S$.
98: While Huffman coding is not optimal, arithmetic and Lempel-Ziv codings are \cite{cover}.
99: 
100: In several papers, Jim\'enez-Monta\~no, Ebeling, and others \cite{jimenes,poeschl} have 
101: suggested coding schemes by non-sequential recursive pair substitution (NSRPS) \cite{footnote0}. 
102: Call the original sequence $S_0$. We count the numbers $n_{jk}$ of non-overlapping successive 
103: pairs of characters in $S_0$ where $s_t = a_j$ and $s_{t+1} = a_k$, and find their maximum,
104: $n_{\rm max} = \max_{j,k< m} n_{jk}$. The corresponding index pair is $(j_0,k_0)$. 
105: Then we introduce a new character by concatenation
106: \be
107:    a_m = (a_{j_0}a_{k_0})
108: \ee
109: and form the sequence $S_1$ by replacing everywhere the pair $a_{j_0}a_{k_0}$ by $a_m$. For 
110: the special case of $j_0 = k_0$, any string of $2r+1$ characters $a_{j_0}$ is replaced by $r$ 
111: characters $a_m$, followed by one $a_{j_0}$.
112: 
113: This is then repeated recursively: The sequence $S_{i+1}$ is obtained from $S_i$ by replacing 
114: the most frequent pair $a_{j_i}a_{k_i}$ by a new character $a_{m+i}$. The procedure stops
115: if one can argue that further replacements would not possibly be of any use. Typically this 
116: will happen if the code length consisting of both a description of $S_{i+1}$ and a description
117: of the pair $(j_i,k_i)$ is definitely longer than a description of $S_i$, for the present 
118: and all subsequent $i$.
119: 
120: Thus one sees that efficient encodings (which must also be uniquely decodable!) of the 
121: sequences $S_i$ and of the type of substituted pairs become crucial for the analysis of NSRPS. 
122: Unfortunately, the ``codings" given in \cite{jimenes,poeschl} are neither efficient nor 
123: uniquely decodable \cite{footnote}. Thus their ``complexities" have no direct relationship 
124: to $h$ or to algorithmic complexity (in contrast to their claim), and it is not clear from 
125: their work whether NSRPS can be made into an optimal coding scheme at all.
126: 
127: It is the purpose of the present paper to give at least partial answers to this.
128: More precisely, we shall only be concerned with the limit of infinitely long 
129: strings, where the information encoded in the pairs $(j_i,k_i)$ can be neglected
130: in comparison with the information stored in $S_i$, at least for any finite $i$.
131: We will first show analytically that a coding scheme for $S_i$ exists which 
132: satisfies a necessary condition for optimality (Sec.2). We then apply this to written 
133: English (Sec.3), where we shall also compare our estimates of $h$ to those obtained 
134: with other methods.
135: 
136: \section{NSRPS for Markov sequences}
137: 
138: Let us for the moment assume that $S_0$ is binary (the two characters are ``0" and ``1"),
139: and that it is completely random, i.e. identically and independently distributed (iid) 
140: with the same probability for each character. Thus $p(0|\ldots) = p(1|\ldots) = 1/2$, and 
141: $h=1$ bit. The length of $S_0$ is $N_0$, thus the total average information 
142: stored in $S_0$ is $N_0$ bits.
143: 
144: No coding scheme can reduce the length of $C(S_0)$ to less than $N_0$ bits
145: on average. Indeed, all schemes will have ${\rm length}[C(S_0)] > N_0$ bits (strict
146: inequality!), unless the ``coding" is a verbatim copy. For a coding scheme to be 
147: optimal, a necessary (but not sufficient) condition is that 
148: \be 
149:    {\rm length}[C(S_0)] / N_0 \to 1 \;{\rm bit}
150: \ee
151: for $N_0\to\infty$, i.e. the overhead in the code must be less than extensive
152: in the sequence length. This is what we want to show here, together with its 
153: generalization to arbitrary (first order) Markov sequences.
154: 
155: For this, we need two lemmata:
156: 
157: {\bf Lemma 1}: {\it For any Markov sequence $S_0$ (not necessarily binary, and not 
158: necessarily iid) built from $m$ letters, the sequence $S_1$ is again Markov.}
159: 
160: {\bf Lemma 2}: {\it If a word $w = (k,k',k'',\ldots)$ appears several times in $S_0$,
161: and if one of these instances is substituted in $S_i$ by a string of characters
162: not straddling its boundaries, then all other instances of $w$ in $S_0$ are also
163: substituted in $S_i$ by the same string.}
164: 
165: Lemma 1 tells us that NSRPS might make the structure of $S_i$ more complex than 
166: that of $S_0$, but not much so. Being a Markov chain, its entropy can be estimated
167: if the transition probabilities $p(k|k_1)$ are known. Thus estimating the entropy 
168: of $S_i$ reduces to estimating di-block entropies $h^{(2)}$, which is straightforward (at 
169: least in the limit $N_0\to\infty$).
170: 
171: Lemma 2 tells us that there cannot be any ambiguity in $S_i$. In particular, 
172: it cannot happen that more information is needed to specify $S_i$ than there 
173: is needed to specify $S_0$, since the mapping $S_0\to S_i$ is bijective, once 
174: the substitution rules are fixed.
175: 
176: The proofs of the lemmata are easy. Let us denote by $p_j(\ldots)$ the probability 
177: distributions after $j$ pair substitutions. For lemma 1 we just have to show that 
178: $p_1(k|k',k'')$ is independent of $k''$ for each pair $(k,k')$, provided the 
179: same holds also for $p_0$. This follows basically from the fact that any 
180: substitution makes the sequence shorter. But the detailed proof is somewhat 
181: tedious, because $p_1(k|k',k'')\neq p_0(k|k',k'')$, even if all $k$'s are less than 
182: $m$, $k\neq k_0$, $k''\neq j_0$, and neither $(k,k')$ nor $(k',k'')$ are equal to 
183: the pair $(j_0,k_0)$. In that case,
184: $(N_0-n_{\rm max}) p_1(k|k',k'') = N_0 p_0(k|k',k'')$, and independence of $k''$ follows
185: immediately. All other cases have to be dealt with similarly. For instance, if 
186: either $(k,k')$ or $(k',k'')$ is the pair $(j_0,k_0)$, then $p_1(k,k',k'')=0$. Else,
187: if $k''=m\neq k,k'$, then $p_1(k|k',k'') = N_0/ (N_0-n_{\rm max}) p_0(k|k',j_0,k_0) = 
188: N_0/ (N_0-n_{\rm max}) p_0(k|k')$. We leave the other cases as exercises to the reader.
189: 
190: For proving lemma 2 we proceed indirectly. We assume that there is a word in 
191: $S_0$ which is encoded differently in different locations. Let us assume 
192: that this difference happened for the first time after $i$ substitutions.
193: Since only one type of pair is exchanged in each step, this means that a 
194: substitution is skipped in one of the locations, at this step. But this is 
195: impossible, since {\it all} possible substitutions are made at each step.
196: 
197: From the two lemmata we obtain immediately our central 
198: 
199: {\bf Theorem:} {\it If $S_0$ is drawn from a (first order) Markov process with length $N_0$ 
200: and entropy $h_0 =  - \sum_{k,k'} p_0(k,k') \log_2 p_0(k|k')$, then every $S_i$ 
201: is also Markovian in the limit $N_0\to\infty$, with entropy 
202: \be
203:    h_i = h_i^{(2)} =  - \sum_{k,k'} p_i(k,k') \log_2 p_i(k|k')                \label{h2}
204: \ee
205: and with length $N_i$ satisfying $N_i/N_0 = h_0/h_i$.}
206: 
207: Thus the total amount of information needed to specify $S_i$ is the same as that for 
208: $S_0$, for infinitely long sequences. Since the overhead needed to specify the pairs 
209: $(j_i,k_i)$ can be neglected in this limit, we see that we do not loose code length 
210: efficiency by pair substitution, provided we take pair probabilities correctly 
211: into account during the coding. The actual encoding can be done by means of an arithmetic
212: code based on the probabilities $p_i(k|k')$ \cite{cover}, but we shall not work out 
213: the details. It is enough to know that the code length then becomes equal to the 
214: information (both measured in bits), for $N_0\to\infty$.
215: 
216: Let us see in detail how all this works for completely random iid binary 
217: sequences. The original sequence $S_0 = 00101001111010011011\ldots$ 
218: has $p_0(00)=p_0(01)=p_0(10)=p_0(11)=1/4$ and therefore $h_0 = 1$ bit. Thus 
219: we can, without loss of generality, assume that the new character is 
220: $2 = (01)$, so that $S_1 = 02202111202121\ldots$. The 3 characters are 
221: now equiprobable, $p_1(0)=p_1(1)=p_1(2)=1/3$, but they are not independent
222: since of course $p_1(01)=0$. Indeed, one finds $p_1(00)=p_1(02)=p_1(11)=p_1(21)=1/6,
223: \;p_1(10)=p_1(12)=p_1(20)=p_1(22)=1/12$. The order-2 entropy of $S_1$ is easily
224: calculated as $h_1^{(2)} = 4/3 \log_2 2$. On the other hand, since $N_0/4$ pairs 
225: have been replaced by single characters, the length of $S_1$ is $N_1=3N_0/4$. Thus,
226: if $S_1$ is Markov, then the total information needed to specify it is 
227: $N_1 h_1^{(2)} = N_0$ bits, the same as for $S_0$. If it were not Markov, its
228: information would be smaller. But this cannot be, because the map $S_0\to S_1$ 
229: was invertible. Thus $S_1$ must indeed be Markov, as can also be checked explicitly.
230: 
231: In the next step, we can either replace $(21) \to 3$ or $(02) \to 3$, since both 
232: have the same probability. If we do the former, the sequence becomes 
233: $S_2 = 02203112033\ldots$. Now the letters are no longer equiprobable,
234: $p_2(1)=p_2(2)=p_2(3)=1/5$, $p_2(0)=2/5$. Calculating $N_2, p_2(kk')$, and 
235: $h_2^{(2)}$ is straightforward, and one finds again $N_2 h_2^{(2)} = N_0$ bits. 
236: Thus one concludes that $S_2$ must also be Markov. For the next 
237: few steps one can still verify 
238: \be
239:    N_i h_i^{(2)} = \ldots N_0 \; {\rm bits},           \label{same}
240: \ee
241: by hand, but this becomes increasingly tedious as $i$ increases.
242: 
243: \begin{figure}
244: %Fig 1
245: \psfig{file=fig1.ps,width=5.8cm,angle=270}
246: \caption{Results for a completely random (iid, uniformly distributed) binary
247:   initial sequence of $N_0 = 8\times 10^8$ bits, plotted against the size of 
248:   the extended alphabet. Uppermost curve: code length needed to encode $S_i$, 
249:   divided by $N_0$, if $\log_2 (i+2)$ bits are used for each character. Middle 
250:   curve: code length based on $h_i^{(1)}$, i.e. the single-character distributions 
251:   $p_i(k)$ are used in the encoding. Lowest curve, indistinguishable on this
252:   scale from a horizontal straight line: 
253:   code length based on $h_i^{(2)}$, using the two-character distributions $p_i(k,k')$.}
254: \label{fig1.ps}
255: \end{figure}
256: 
257: Thus we have verified Eq.(\ref{same}) by extensive simulations, where we found 
258: that it is exact, within the expected fluctuations, up to several thousand 
259: substitutions (Fig.1). The distribution of the probabilities $p_i(k)$ becomes very 
260: wide for large $i$, i.e. the sequences $S_i$ are far from uniform for large $i$, 
261: but they are Markov and their entropies $h_i^{(2)}$ are exactly (within 
262: the expected systematic finite sample corrections \cite{herzel,grass-fsc})
263: equal to $N_0/N_i$ bits. Notice that if we would encode the last $S_i$ without
264: taking the correlations into account (as seems suggested in
265: \cite{jimenes,poeschl}), then the code length for it would be larger and the 
266: coding scheme would not be optimal.
267: 
268: We have also made some simulations where we started with non-trivial Markov 
269: processes for $S_0$, or even with non-Markov sequences with known entropy. 
270: The latter were generated by creating initially a binary iid sequence with 
271: $p(0) \neq p(1)$, and then using this as an input configuration for a few 
272: iterations of the bijective cellular automaton R150 (in Wolfram's notation)
273: \cite{sg}.
274: 
275: \begin{figure}
276: %Fig 2
277: \psfig{file=fig2.ps,width=5.8cm,angle=270}
278: \caption{Ranked single character probability distributions $p_i(k)$ of strings after 
279:   $i=2298$ pair substitutions. The different curves are for a completely random iid
280:   initial string $S_0$ (solid line), iid string $S_0$ with $p_0(0)=0.29$ (long dashed), 
281:   $S_0$ obtained by applying two times CA rule 150 to an iid sequence with 
282:   $p(0)=0.09$ (dashed), and to written English with a reduced (46 character) 
283:   alphabet (dotted).}
284: \label{fig2.ps}
285: \end{figure}
286: 
287: From these simulations it seems that $N_i h_i^{(2)}$ always tends towards $N_0$.
288: Also, the probability distributions $p_i(k)$ seem to tend (very slowly, see 
289: Fig.2) to the same scaling limit as for iid and uniform $S_0$. This suggests
290: that indeed $S_i$ tends to a Markov process for arbitrary $S_0$. In this
291: case an optimal coding would be obtained if one would use, e.g., an 
292: arithmetic code to encode $S_i$ by using approximate values of the observed 
293: $p_i(k|k')$ for large $i$.
294: 
295: Thus we have given strong (but still incomplete) arguments that NSRPS combined 
296: with efficient coding of $S_i$ gives indeed an optimal coding scheme. In 
297: practice, it would of course be extremely inefficient in terms of speed, and 
298: thus of no practical relevance. But it could well be that it might lead to 
299: more stringent entropy estimates than other methods. To test this we shall
300: now turn to one of the most complex and interesting system, written natural
301: language.
302: 
303: \section{The entropy of written English}
304: 
305: The data used for the application of NSRPS to entropy estimation of written 
306: English consisted of ca. 150 MB of text taken from the Project Gutenberg 
307: homepage \cite{gutenberg}. It includes mainly English and American novels
308: from the 19th and early 20th century (Austen, Dickens, Galsworthy, Melville, 
309: Stevenson, etc.), but also some technical reports (e.g. Darwin, historical 
310: and sociological texts, etc.), Shakespeares collected works, the King James 
311: Bible, and some novels translated from French and Russian (Verne, Tolstoy,
312: Dostoevsky, etc.). 
313: 
314: From these texts we removed first editorial and legal remarks added by the 
315: editors of Project Gutenberg. We also removed end-of-line, end-of-page, and 
316: carriage return characters. All runs of consecutive blanks were replaced by 
317: a single blank. Finally, we also removed all characters not in 
318: the 7-bit ASCII alphabet (ca. 4200 in total). These cleaned texts were then 
319: concatenated to form one big input string of 148,214,028 characters. 
320: 
321: Entropies were estimated both from this string (which still contained upper 
322: and lower case letters, numbers, all kinds of brackets and interpunctation marks,
323: 95 different characters in total), and from a version with reduced alphabet.
324: In the latter, we changed all letters to upper case; all brackets to either 
325: ( or ); the symbols \$,\#,\&,*,\%, @ to one single symbol; colons, exclamation and 
326: question marks to points; quotation marks to apostrophes; and semicolons to commas.
327: This reduced alphabet had then 46 letters (including, of course, the blank
328: ``$_\sqcup$").
329: 
330: The most frequent pair of letters in English is ``e$_\sqcup$". After replacing it 
331: by a new ``letter", the next pair to substitute is ``$_\sqcup$t", then ``$_\sqcup$a",
332: ``$_\sqcup$th", etc. Very soon also longer strings are substituted, e.g. after 
333: 92 steps appears the first two-word combination, ``of$_\sqcup$the$_\sqcup$".
334: 
335: As long as the number of new symbols is still small, it is easy to estimate the 
336: pair probabilities, and from this an upper bound $\hat{h}_i = h_i^{(2)}N_i/N_0$ 
337: on the entropy.  This becomes more and more difficult
338: as the alphabet size increases, as the sampling becomes insufficient even with 
339: our very long input file, and we can no longer approximate the $p_i(k,k')$ by the
340: observed relative frequencies. As long as the number of different subsequent pairs is 
341: much smaller than the sequence length (i.e., most pairs are observed many times), 
342: we can still get reliable estimates of $\hat{h}_i$ by using the leading correction 
343: term discussed in \cite{grass-fsc,footnote2}. But finally, when many pairs are seen only 
344: once in the entire text, we have to stop since any estimate of $h_i^{(2)}$ becomes 
345: unreliable.
346: 
347: We went up to 6000 substitutions. The longest substrings substituted by a single 
348: new symbol had length 13 in the original (95 letter) alphabet, and length 16 in the 
349: reduced (46 letter) one (the latter was ``would$_\sqcup$have$_\sqcup$been$_\sqcup$").
350: The entropies $\hat{h}$ per (original) character are plotted 
351: in Fig.3. We see that they are very similar for both alphabets.
352: We find $\hat{h}\approx 1.8$ bits/character after 6000 substitutions. This number 
353: is very close to the value obtained from most other methods (with the exception of 
354: \cite{teahan-cleary}, where $\approx 1.5$ bits/character were obtained), if one uses 
355: $10 - 100$ MB of input text \cite{bell,sg}. This is surprising in view of two facts.
356: First of all, the methods applied in \cite{bell,sg} are very different, and one 
357: might have thought a priori that they are able to use different structures of the 
358: language to achieve high compression rates. Apparently they do not. 
359: 
360: \begin{figure}
361: %Fig 3
362: \psfig{file=fig3.ps,width=5.8cm,angle=270}
363: \caption{Entropy estimates $\hat{h}$ from pair probabilities plotted against 
364:    the size of the extended alphabet. Upper curve is for the initial 7 bit
365:    alphabet, including upper and lower case letters. The lower curve is for the 
366:    reduced (46 letter) initial alphabet. The smooth dotted line passing 
367:    through the lower data set is a fit with Eq.(\ref{fit}).}
368: \label{fig3.ps}
369: \end{figure}
370: 
371: Secondly, it is clear that $\hat{h}\approx 1.8$ bits/character is not a realistic
372: estimate of the true entropy of written English. Even though we can not, with our 
373: present text lengths and our computational resources, go to much larger alphabet sizes
374: (i.e. to more substitutions), it is clear from Fig.3 that both curves would continue
375: to decrease. Let us denote by $i$ the number of substitutions. Then empirical 
376: fits to both curves in Fig.3 are given by
377: \be
378:    \hat{h}_i = h + {c\over (i+i_0)^\alpha } \;.                \label{fit}
379: \ee
380: Such a fit to the 46 letter data, with $h=0.7, i_0=34, c=4.99,$ and $\alpha = 0.1745$, 
381: is also shown in Fig.3. One should of course not take it too serious in view of the 
382: very slow convergence with $i$ and the very long extrapolation, but it suggests that 
383: the true entropy of written English is $0.7\pm 0.2$ bits/character.
384: 
385: This estimate is somewhat lower than estimate of \cite{cov-king} and the 
386: extrapolations given in \cite{sg}. It is 
387: comparable with that of \cite{grass-ieee} and with Shannon's original estimate 
388: \cite{shannon2}. It seems definitely to exclude the possibility $h=0$ which was 
389: proposed in \cite{hilberg,ebel-posch}.
390: 
391: \section{Conclusions}
392: 
393: We have shown how a strategy of non-sequential replacements of pairs of characters
394: can yield efficient data compression and entropy estimates. A similar
395: strategy was first proposed by Jim\'enez-Monta\~no and others, but details and the 
396: actual coding done in the present paper are quite different from those proposed in 
397: \cite{jimenes,poeschl}. Indeed, this strategy was never used in \cite{jimenes,poeschl}
398: for actual codings, and it was also not used for realistic entropy estimates.
399: 
400: Compared to conventional sequential codes (such as Lempel-Ziv or arithmetic
401: codes \cite{cover}, just to mention two), the present method would be much 
402: slower. Instead of a single pass through the data as in sequential coding 
403: schemes, we had gone up to 6000 times through the data file, in order to 
404: achieve a high compression rate. We could do of course with much less passes,
405: if we would be content with compression rates comparable to those of commercial
406: packages such as ``zip" or ``compress". For written English these achieve typically
407: compression factors $\approx 2.6$, i.e. ca. 3 bits/character. As seen from Fig.1,
408: this can be achieved by NSRPS very easily with very few passes, but even then the 
409: overhead and the computational complexity of NSRPS is much too high to make it 
410: a practical alternative.
411: 
412: NSRPS can be seen as a greedy and extremely simple version of off-line textual 
413: substitution \cite{storer}. In combination with other sophisticated techniques, 
414: similar substitutions can give excellent results \cite{teahan-cleary}. But without
415: these techniques, it is in general believed that only much more sophisticated 
416: versions of off-line textual substitution are of any interest \cite{storer}.
417: Again this is presumably true as far as practical coding schemes are concerned.
418: But things seem to be different if one is interested in entropy estimation. Here the 
419: present method is much simpler (even though computationally more demanding) than 
420: the tree-based gambling algorithms \cite{sg,bell} that had given the best results
421: up to now. Without extrapolation, it gives the same (upper bound) estimates 
422: as these methods. But it seems that it allows a more reliable extrapolation to 
423: infinite text length and infinite substitution depth, and thus a more reliable
424: estimate of the true asymptotic entropy. 
425: 
426: From the mathematical point of view, we should however stress that we have only 
427: partial results. While we have proven that the Markov structure is a fixed point 
428: of the substitution, we have not proven that it is {\it attractive}. We thus 
429: cannot prove that the present strategy is indeed universally
430: optimal, although we believe that our numerical results strongly support this 
431: conjecture. A rigorous proof would of course be extremely welcome.
432: 
433: I thank Ralf Andrzejak, Hsiao-Ping Hsu, and Walter Nadler for carefully reading 
434: the manuscript and for useful discussions.
435: 
436: 
437: \begin{thebibliography}{99}
438: \bibitem{shannon} C.E. Shannon and W. Weaver, {\it The Mathematical 
439:    Theory of Communications} (Univ. of Illinois Press, Urbana 1949).
440: \bibitem{kolmogorov} A.N. Kolmogorov, IEEE Trans. Inf. Theory 
441:    {\bf IT 14}, 662 (1965).
442: \bibitem{chaitin} G.J. Chaitin, {\it Algorithmic Information Theory} 
443:    (Cambridge Univ. Press, New York 1987).
444: \bibitem{li-vitanyi} M. Li and P. Vit\'anyi, {\it An Introduction to 
445:    Kolmogorov Complexity and its Applications} (Springer, New York 1997).
446: \bibitem{cover} T.M. Cover and J.A. Thomas, {\it Elements of Information Theory}
447:    (Wiley Interscience, 1991).
448: \bibitem{jimenes} W. Ebeling and M.A. Jim\'enez-Monta\~no, Math. Biosc. 
449:    {\bf 52}, 53 (1980);
450:    M.A. Jim\'enez-Monta\~no, Bull. Math. Biol. {\bf 46}, 641 (1984);
451:    P.E. Rapp, I.D. Zimmermann, E.P. Vining, N. Cohen, A.M. Albano, and 
452:    M.A. Jim\'enez-Monta\~no, Phys. Lett. A {\bf 192}, 27 (1994);
453: \bibitem{poeschl} M.A. Jim\'enez-Monta\~no, W. Ebeling, and T. P\"oschel,
454:    preprint arXiv:cond-mat/0204134 (2002).
455: \bibitem{footnote0} Actually, Jim\'enez-Monta\~no {\it et al.} use somewhat
456:    different schemes. Also, we found the names given in 
457:    \cite{jimenes,poeschl} to their algorithms somewhat misleading,
458:    since they refer to grammatical categories, while we are dealing with
459:    probability measures.
460: \bibitem{footnote} In \cite{poeschl}, e.g., it is assumed that a character from 
461:    a two-letter alphabet can still be encoded by one bit, after the first pair
462:    has been replaced by a ``non-terminal node", in their notation. This is 
463:    not true, since encoding this character now must fix a choice between
464:    {\it three} (instead of two) possibilities. 
465: \bibitem{gutenberg} http://promo.net/pg/.
466: \bibitem{herzel} B. Harris, Colloquia Mathematica Societatis Janos Bolya, 1975,
467:    p. 323; H. Herzel, Syst. Anal. Model Sim. {\bf 5}, 435 (1988).
468: \bibitem{grass-fsc}  P. Grassberger, Phys. Lett. A {\bf 128}, 369 (1988).
469: \bibitem{footnote2} We use Eq.(13) of \cite{grass-fsc}, but with a misprint 
470:    corrected: The denominator of the last term should be $(n_i+1)n_i$ instead 
471:    of $n_i+1$.
472: \bibitem{teahan-cleary} W.J. Teahan and J.G. Cleary, {\it The entropy of English
473:    using PPM-based models}, Proc. of Data Compression Conf., Los Alamos (1996)
474: \bibitem{bell} T.C. Bell, J.G. Cleary, and I.H. Witten, {\it Text Compression}
475:    (Prentice-Hall, Englewood Cliffs, NJ, 1990).
476: \bibitem{cov-king} T. Cover and R. King, IEEE Trans. Inf. Theory {\bf IT-24}, 413
477:    (1978)
478: \bibitem{sg} T. Sch\"urmann and P. Grassberger, CHAOS {\bf 6}, 414 (1996).
479: \bibitem{grass-ieee} P. Grassberger, IEEE Trans. Inf. Theory {\bf IT-35}, 669
480:    (1989).
481: \bibitem{shannon2} C.E. Shannon, Bell Syst. Technol. J. {\bf 30}, 50 (1951).
482: \bibitem{hilberg} W. Hilberg, Frequenz {\bf 44}, 243 (1990).
483: \bibitem{storer} J.A. Storer, {\it Data Compression} (Computer Science Press, 
484:    Rockville, MD, 1988).
485: \bibitem{ebel-posch} W. Ebeling and T. P\"oschel, Europhys. Lett. {\bf 26}, 241 
486:    (1994).
487: 
488: \end{thebibliography}
489: 
490: \end{document}
491: