1: \documentclass[aps,twocolumn,floatfix,showpacs]{revtex4}
2: \usepackage{epsfig}
3: \newcommand{\be}{\begin{equation}}
4: \newcommand{\ee}{\end{equation}}
5: \newcommand{\ftnt}{\footnote}
6:
7: \begin{document}
8:
9: \title{Data Compression and Entropy Estimates by Non-sequential Recursive Pair Substitution}
10:
11: \author{Peter Grassberger}
12:
13: \affiliation{John-von-Neumann Institute for Computing, Forschungszentrum J\"ulich,
14: D-52425 J\"ulich, Germany}
15:
16: \date{\today}
17:
18: \begin{abstract}
19: We argue that Non-sequential Recursive Pair Substitution (NSRPS) as suggested by
20: Jim\'enez-Monta\~no and Ebeling can indeed be used as a basis for an optimal data
21: compression algorithm. In particular, we prove for Markov sequences that NSRPS together
22: with suitable codings of the substitutions and of the substitute series does not lead
23: to a code length increase, in the limit of infinite sequence length. When applied
24: to written English, NSRPS gives entropy estimates which are very close to those obtained
25: by other methods. Using ca. 135 GB of input data from the project Gutenberg, we estimate
26: the effective entropy to be $\approx 1.82$ bit/character. Extrapolating to infinitely
27: long input, the true value of the entropy is estimated as $\approx 0.8$ bit/character.
28: \end{abstract}
29:
30: \pacs{02.50.-r, 05.10.-a, 05.45.Tp}
31:
32: \maketitle
33:
34: \section{Introduction}
35:
36: The discovery that the amount of information in a message (or in any other structure)
37: can be objectively measured was certainly one of the major scientific achievements of
38: the 20th century. On the theoretical side, this quantity -- the information theoretic
39: entropy -- is of interest mainly because of its close relationship to thermodynamic
40: entropy, its importance for chaotic systems, and its role in Bayesian
41: inference (maximum entropy principle). Practically, estimating
42: the entropy of a message (text document, picture, piece of music, etc.) is important
43: because it measures its compressibility, i.e. the optimal achievement for any
44: possible compression algorithm. In the following, we shall always deal with sequences
45: $(s_0,s_1,\ldots)$ built from the characters of a finite alphabet $A = \{a_0,\ldots,a_{m-1}\}$
46: of size $m$. In the simplest case the alphabet consists just of 2 characters, in
47: which case the maximum entropy is 1 bit per character.
48:
49: Indeed, information entropy as introduced by Shannon \cite{shannon} is a probabilistic
50: concept. It requires a measure (probability distribution) to be defined on the set
51: of all possible sequences. In particular, the probability for $s_t$
52: to be given by $a_k$, given all characters $s_0,s_1,\ldots, s_{t-1}$, is given by
53: \begin{eqnarray}
54: p_t(k|k',k'',\ldots) = &&\\{\rm prob}(s_t = a_k&|&s_{t-1}=a_{k'}, s_{t-2}=a_{k''}, \ldots
55: ) . \nonumber
56: \end{eqnarray}
57: In case of a stationary measure with finite range correlations, $p_t(k|k',k'',\ldots)$
58: becomes independent of $t$ for $t\to\infty$. Then Shannon's famous formula,
59: \be
60: h = \lim_{i\to \infty} h^{(i)}
61: \ee
62: with
63: \be
64: h^{(i)} = - \sum_{k_1\ldots k_i} p(k_1\ldots k_i) \log_2 p(k_1|k_2\ldots k_i)\;,
65: \ee
66: gives the {\it average} information per character. The generalization to non-stationary
67: measures is straightforward but will not be discussed here.
68:
69: In contrast to this approach are attempts to define the {\it exact} information content of a single
70: finite sequence. Theoretically, the basic concept here is the {\it algorithmic complexity}
71: AC (or algorithmic {\it randomness}) \cite{kolmogorov,chaitin}. For any given universal
72: computer $U$, the AC of a sequence $S$ relative to $U$ is given by
73: the length of the shortest program which, when input to $U$, prints $S$ and then makes
74: $U$ to stop, so that the next sequence can be read. If $S$ is randomly drawn from
75: a stationary ensemble with entropy $h$, then one can show that the AC
76: per character tends towards $h$, for almost all $S$ and all $U$, as the length of $S$
77: tends towards infinity \cite{li-vitanyi}. Thus, except for rare sequences which do not
78: contribute to averages, $h$ sets the limit for the compressibility.
79:
80: Practically, the usefulness of AC is limited by the fact that there cannot exist any
81: algorithm which finds for each $S$ its shortest code (such an algorithm could be used to
82: solve Turing's halting problem, which is known to be impossible) \cite{li-vitanyi}.
83: But one can give algorithms which are often quite efficient. Huffman,
84: arithmetic, and Lempel-Ziv coding are just three well known examples \cite{cover}.
85: Any such algorithm can be used to give an upper bound to $h$ (modulo fluctuations from
86: the finite sequence length) while, inversely, knowledge of $h$ sets a lower limit to
87: the average code lengths possible with these codes.
88:
89: A data compression scheme is called {\it optimal}, if it does not do much worse than the
90: best possible for typical random strings. More precisely, let $\{S\}$ be a set of sequences
91: with entropy $h(S)$, and let the code string $C(S)$ be built from an alphabet of $m_C$
92: characters. Then we call the coding scheme $C: S\to C(S)$ optimal, if
93: \be
94: {{\rm length}[C(S)] \over {\rm length}[S]} \to {h \over \log_2 m_C }
95: \quad {\rm for} \;\; {\rm length}[S] \to \infty
96: \ee
97: and for nearly all $S$.
98: While Huffman coding is not optimal, arithmetic and Lempel-Ziv codings are \cite{cover}.
99:
100: In several papers, Jim\'enez-Monta\~no, Ebeling, and others \cite{jimenes,poeschl} have
101: suggested coding schemes by non-sequential recursive pair substitution (NSRPS) \cite{footnote0}.
102: Call the original sequence $S_0$. We count the numbers $n_{jk}$ of non-overlapping successive
103: pairs of characters in $S_0$ where $s_t = a_j$ and $s_{t+1} = a_k$, and find their maximum,
104: $n_{\rm max} = \max_{j,k< m} n_{jk}$. The corresponding index pair is $(j_0,k_0)$.
105: Then we introduce a new character by concatenation
106: \be
107: a_m = (a_{j_0}a_{k_0})
108: \ee
109: and form the sequence $S_1$ by replacing everywhere the pair $a_{j_0}a_{k_0}$ by $a_m$. For
110: the special case of $j_0 = k_0$, any string of $2r+1$ characters $a_{j_0}$ is replaced by $r$
111: characters $a_m$, followed by one $a_{j_0}$.
112:
113: This is then repeated recursively: The sequence $S_{i+1}$ is obtained from $S_i$ by replacing
114: the most frequent pair $a_{j_i}a_{k_i}$ by a new character $a_{m+i}$. The procedure stops
115: if one can argue that further replacements would not possibly be of any use. Typically this
116: will happen if the code length consisting of both a description of $S_{i+1}$ and a description
117: of the pair $(j_i,k_i)$ is definitely longer than a description of $S_i$, for the present
118: and all subsequent $i$.
119:
120: Thus one sees that efficient encodings (which must also be uniquely decodable!) of the
121: sequences $S_i$ and of the type of substituted pairs become crucial for the analysis of NSRPS.
122: Unfortunately, the ``codings" given in \cite{jimenes,poeschl} are neither efficient nor
123: uniquely decodable \cite{footnote}. Thus their ``complexities" have no direct relationship
124: to $h$ or to algorithmic complexity (in contrast to their claim), and it is not clear from
125: their work whether NSRPS can be made into an optimal coding scheme at all.
126:
127: It is the purpose of the present paper to give at least partial answers to this.
128: More precisely, we shall only be concerned with the limit of infinitely long
129: strings, where the information encoded in the pairs $(j_i,k_i)$ can be neglected
130: in comparison with the information stored in $S_i$, at least for any finite $i$.
131: We will first show analytically that a coding scheme for $S_i$ exists which
132: satisfies a necessary condition for optimality (Sec.2). We then apply this to written
133: English (Sec.3), where we shall also compare our estimates of $h$ to those obtained
134: with other methods.
135:
136: \section{NSRPS for Markov sequences}
137:
138: Let us for the moment assume that $S_0$ is binary (the two characters are ``0" and ``1"),
139: and that it is completely random, i.e. identically and independently distributed (iid)
140: with the same probability for each character. Thus $p(0|\ldots) = p(1|\ldots) = 1/2$, and
141: $h=1$ bit. The length of $S_0$ is $N_0$, thus the total average information
142: stored in $S_0$ is $N_0$ bits.
143:
144: No coding scheme can reduce the length of $C(S_0)$ to less than $N_0$ bits
145: on average. Indeed, all schemes will have ${\rm length}[C(S_0)] > N_0$ bits (strict
146: inequality!), unless the ``coding" is a verbatim copy. For a coding scheme to be
147: optimal, a necessary (but not sufficient) condition is that
148: \be
149: {\rm length}[C(S_0)] / N_0 \to 1 \;{\rm bit}
150: \ee
151: for $N_0\to\infty$, i.e. the overhead in the code must be less than extensive
152: in the sequence length. This is what we want to show here, together with its
153: generalization to arbitrary (first order) Markov sequences.
154:
155: For this, we need two lemmata:
156:
157: {\bf Lemma 1}: {\it For any Markov sequence $S_0$ (not necessarily binary, and not
158: necessarily iid) built from $m$ letters, the sequence $S_1$ is again Markov.}
159:
160: {\bf Lemma 2}: {\it If a word $w = (k,k',k'',\ldots)$ appears several times in $S_0$,
161: and if one of these instances is substituted in $S_i$ by a string of characters
162: not straddling its boundaries, then all other instances of $w$ in $S_0$ are also
163: substituted in $S_i$ by the same string.}
164:
165: Lemma 1 tells us that NSRPS might make the structure of $S_i$ more complex than
166: that of $S_0$, but not much so. Being a Markov chain, its entropy can be estimated
167: if the transition probabilities $p(k|k_1)$ are known. Thus estimating the entropy
168: of $S_i$ reduces to estimating di-block entropies $h^{(2)}$, which is straightforward (at
169: least in the limit $N_0\to\infty$).
170:
171: Lemma 2 tells us that there cannot be any ambiguity in $S_i$. In particular,
172: it cannot happen that more information is needed to specify $S_i$ than there
173: is needed to specify $S_0$, since the mapping $S_0\to S_i$ is bijective, once
174: the substitution rules are fixed.
175:
176: The proofs of the lemmata are easy. Let us denote by $p_j(\ldots)$ the probability
177: distributions after $j$ pair substitutions. For lemma 1 we just have to show that
178: $p_1(k|k',k'')$ is independent of $k''$ for each pair $(k,k')$, provided the
179: same holds also for $p_0$. This follows basically from the fact that any
180: substitution makes the sequence shorter. But the detailed proof is somewhat
181: tedious, because $p_1(k|k',k'')\neq p_0(k|k',k'')$, even if all $k$'s are less than
182: $m$, $k\neq k_0$, $k''\neq j_0$, and neither $(k,k')$ nor $(k',k'')$ are equal to
183: the pair $(j_0,k_0)$. In that case,
184: $(N_0-n_{\rm max}) p_1(k|k',k'') = N_0 p_0(k|k',k'')$, and independence of $k''$ follows
185: immediately. All other cases have to be dealt with similarly. For instance, if
186: either $(k,k')$ or $(k',k'')$ is the pair $(j_0,k_0)$, then $p_1(k,k',k'')=0$. Else,
187: if $k''=m\neq k,k'$, then $p_1(k|k',k'') = N_0/ (N_0-n_{\rm max}) p_0(k|k',j_0,k_0) =
188: N_0/ (N_0-n_{\rm max}) p_0(k|k')$. We leave the other cases as exercises to the reader.
189:
190: For proving lemma 2 we proceed indirectly. We assume that there is a word in
191: $S_0$ which is encoded differently in different locations. Let us assume
192: that this difference happened for the first time after $i$ substitutions.
193: Since only one type of pair is exchanged in each step, this means that a
194: substitution is skipped in one of the locations, at this step. But this is
195: impossible, since {\it all} possible substitutions are made at each step.
196:
197: From the two lemmata we obtain immediately our central
198:
199: {\bf Theorem:} {\it If $S_0$ is drawn from a (first order) Markov process with length $N_0$
200: and entropy $h_0 = - \sum_{k,k'} p_0(k,k') \log_2 p_0(k|k')$, then every $S_i$
201: is also Markovian in the limit $N_0\to\infty$, with entropy
202: \be
203: h_i = h_i^{(2)} = - \sum_{k,k'} p_i(k,k') \log_2 p_i(k|k') \label{h2}
204: \ee
205: and with length $N_i$ satisfying $N_i/N_0 = h_0/h_i$.}
206:
207: Thus the total amount of information needed to specify $S_i$ is the same as that for
208: $S_0$, for infinitely long sequences. Since the overhead needed to specify the pairs
209: $(j_i,k_i)$ can be neglected in this limit, we see that we do not loose code length
210: efficiency by pair substitution, provided we take pair probabilities correctly
211: into account during the coding. The actual encoding can be done by means of an arithmetic
212: code based on the probabilities $p_i(k|k')$ \cite{cover}, but we shall not work out
213: the details. It is enough to know that the code length then becomes equal to the
214: information (both measured in bits), for $N_0\to\infty$.
215:
216: Let us see in detail how all this works for completely random iid binary
217: sequences. The original sequence $S_0 = 00101001111010011011\ldots$
218: has $p_0(00)=p_0(01)=p_0(10)=p_0(11)=1/4$ and therefore $h_0 = 1$ bit. Thus
219: we can, without loss of generality, assume that the new character is
220: $2 = (01)$, so that $S_1 = 02202111202121\ldots$. The 3 characters are
221: now equiprobable, $p_1(0)=p_1(1)=p_1(2)=1/3$, but they are not independent
222: since of course $p_1(01)=0$. Indeed, one finds $p_1(00)=p_1(02)=p_1(11)=p_1(21)=1/6,
223: \;p_1(10)=p_1(12)=p_1(20)=p_1(22)=1/12$. The order-2 entropy of $S_1$ is easily
224: calculated as $h_1^{(2)} = 4/3 \log_2 2$. On the other hand, since $N_0/4$ pairs
225: have been replaced by single characters, the length of $S_1$ is $N_1=3N_0/4$. Thus,
226: if $S_1$ is Markov, then the total information needed to specify it is
227: $N_1 h_1^{(2)} = N_0$ bits, the same as for $S_0$. If it were not Markov, its
228: information would be smaller. But this cannot be, because the map $S_0\to S_1$
229: was invertible. Thus $S_1$ must indeed be Markov, as can also be checked explicitly.
230:
231: In the next step, we can either replace $(21) \to 3$ or $(02) \to 3$, since both
232: have the same probability. If we do the former, the sequence becomes
233: $S_2 = 02203112033\ldots$. Now the letters are no longer equiprobable,
234: $p_2(1)=p_2(2)=p_2(3)=1/5$, $p_2(0)=2/5$. Calculating $N_2, p_2(kk')$, and
235: $h_2^{(2)}$ is straightforward, and one finds again $N_2 h_2^{(2)} = N_0$ bits.
236: Thus one concludes that $S_2$ must also be Markov. For the next
237: few steps one can still verify
238: \be
239: N_i h_i^{(2)} = \ldots N_0 \; {\rm bits}, \label{same}
240: \ee
241: by hand, but this becomes increasingly tedious as $i$ increases.
242:
243: \begin{figure}
244: %Fig 1
245: \psfig{file=fig1.ps,width=5.8cm,angle=270}
246: \caption{Results for a completely random (iid, uniformly distributed) binary
247: initial sequence of $N_0 = 8\times 10^8$ bits, plotted against the size of
248: the extended alphabet. Uppermost curve: code length needed to encode $S_i$,
249: divided by $N_0$, if $\log_2 (i+2)$ bits are used for each character. Middle
250: curve: code length based on $h_i^{(1)}$, i.e. the single-character distributions
251: $p_i(k)$ are used in the encoding. Lowest curve, indistinguishable on this
252: scale from a horizontal straight line:
253: code length based on $h_i^{(2)}$, using the two-character distributions $p_i(k,k')$.}
254: \label{fig1.ps}
255: \end{figure}
256:
257: Thus we have verified Eq.(\ref{same}) by extensive simulations, where we found
258: that it is exact, within the expected fluctuations, up to several thousand
259: substitutions (Fig.1). The distribution of the probabilities $p_i(k)$ becomes very
260: wide for large $i$, i.e. the sequences $S_i$ are far from uniform for large $i$,
261: but they are Markov and their entropies $h_i^{(2)}$ are exactly (within
262: the expected systematic finite sample corrections \cite{herzel,grass-fsc})
263: equal to $N_0/N_i$ bits. Notice that if we would encode the last $S_i$ without
264: taking the correlations into account (as seems suggested in
265: \cite{jimenes,poeschl}), then the code length for it would be larger and the
266: coding scheme would not be optimal.
267:
268: We have also made some simulations where we started with non-trivial Markov
269: processes for $S_0$, or even with non-Markov sequences with known entropy.
270: The latter were generated by creating initially a binary iid sequence with
271: $p(0) \neq p(1)$, and then using this as an input configuration for a few
272: iterations of the bijective cellular automaton R150 (in Wolfram's notation)
273: \cite{sg}.
274:
275: \begin{figure}
276: %Fig 2
277: \psfig{file=fig2.ps,width=5.8cm,angle=270}
278: \caption{Ranked single character probability distributions $p_i(k)$ of strings after
279: $i=2298$ pair substitutions. The different curves are for a completely random iid
280: initial string $S_0$ (solid line), iid string $S_0$ with $p_0(0)=0.29$ (long dashed),
281: $S_0$ obtained by applying two times CA rule 150 to an iid sequence with
282: $p(0)=0.09$ (dashed), and to written English with a reduced (46 character)
283: alphabet (dotted).}
284: \label{fig2.ps}
285: \end{figure}
286:
287: From these simulations it seems that $N_i h_i^{(2)}$ always tends towards $N_0$.
288: Also, the probability distributions $p_i(k)$ seem to tend (very slowly, see
289: Fig.2) to the same scaling limit as for iid and uniform $S_0$. This suggests
290: that indeed $S_i$ tends to a Markov process for arbitrary $S_0$. In this
291: case an optimal coding would be obtained if one would use, e.g., an
292: arithmetic code to encode $S_i$ by using approximate values of the observed
293: $p_i(k|k')$ for large $i$.
294:
295: Thus we have given strong (but still incomplete) arguments that NSRPS combined
296: with efficient coding of $S_i$ gives indeed an optimal coding scheme. In
297: practice, it would of course be extremely inefficient in terms of speed, and
298: thus of no practical relevance. But it could well be that it might lead to
299: more stringent entropy estimates than other methods. To test this we shall
300: now turn to one of the most complex and interesting system, written natural
301: language.
302:
303: \section{The entropy of written English}
304:
305: The data used for the application of NSRPS to entropy estimation of written
306: English consisted of ca. 150 MB of text taken from the Project Gutenberg
307: homepage \cite{gutenberg}. It includes mainly English and American novels
308: from the 19th and early 20th century (Austen, Dickens, Galsworthy, Melville,
309: Stevenson, etc.), but also some technical reports (e.g. Darwin, historical
310: and sociological texts, etc.), Shakespeares collected works, the King James
311: Bible, and some novels translated from French and Russian (Verne, Tolstoy,
312: Dostoevsky, etc.).
313:
314: From these texts we removed first editorial and legal remarks added by the
315: editors of Project Gutenberg. We also removed end-of-line, end-of-page, and
316: carriage return characters. All runs of consecutive blanks were replaced by
317: a single blank. Finally, we also removed all characters not in
318: the 7-bit ASCII alphabet (ca. 4200 in total). These cleaned texts were then
319: concatenated to form one big input string of 148,214,028 characters.
320:
321: Entropies were estimated both from this string (which still contained upper
322: and lower case letters, numbers, all kinds of brackets and interpunctation marks,
323: 95 different characters in total), and from a version with reduced alphabet.
324: In the latter, we changed all letters to upper case; all brackets to either
325: ( or ); the symbols \$,\#,\&,*,\%, @ to one single symbol; colons, exclamation and
326: question marks to points; quotation marks to apostrophes; and semicolons to commas.
327: This reduced alphabet had then 46 letters (including, of course, the blank
328: ``$_\sqcup$").
329:
330: The most frequent pair of letters in English is ``e$_\sqcup$". After replacing it
331: by a new ``letter", the next pair to substitute is ``$_\sqcup$t", then ``$_\sqcup$a",
332: ``$_\sqcup$th", etc. Very soon also longer strings are substituted, e.g. after
333: 92 steps appears the first two-word combination, ``of$_\sqcup$the$_\sqcup$".
334:
335: As long as the number of new symbols is still small, it is easy to estimate the
336: pair probabilities, and from this an upper bound $\hat{h}_i = h_i^{(2)}N_i/N_0$
337: on the entropy. This becomes more and more difficult
338: as the alphabet size increases, as the sampling becomes insufficient even with
339: our very long input file, and we can no longer approximate the $p_i(k,k')$ by the
340: observed relative frequencies. As long as the number of different subsequent pairs is
341: much smaller than the sequence length (i.e., most pairs are observed many times),
342: we can still get reliable estimates of $\hat{h}_i$ by using the leading correction
343: term discussed in \cite{grass-fsc,footnote2}. But finally, when many pairs are seen only
344: once in the entire text, we have to stop since any estimate of $h_i^{(2)}$ becomes
345: unreliable.
346:
347: We went up to 6000 substitutions. The longest substrings substituted by a single
348: new symbol had length 13 in the original (95 letter) alphabet, and length 16 in the
349: reduced (46 letter) one (the latter was ``would$_\sqcup$have$_\sqcup$been$_\sqcup$").
350: The entropies $\hat{h}$ per (original) character are plotted
351: in Fig.3. We see that they are very similar for both alphabets.
352: We find $\hat{h}\approx 1.8$ bits/character after 6000 substitutions. This number
353: is very close to the value obtained from most other methods (with the exception of
354: \cite{teahan-cleary}, where $\approx 1.5$ bits/character were obtained), if one uses
355: $10 - 100$ MB of input text \cite{bell,sg}. This is surprising in view of two facts.
356: First of all, the methods applied in \cite{bell,sg} are very different, and one
357: might have thought a priori that they are able to use different structures of the
358: language to achieve high compression rates. Apparently they do not.
359:
360: \begin{figure}
361: %Fig 3
362: \psfig{file=fig3.ps,width=5.8cm,angle=270}
363: \caption{Entropy estimates $\hat{h}$ from pair probabilities plotted against
364: the size of the extended alphabet. Upper curve is for the initial 7 bit
365: alphabet, including upper and lower case letters. The lower curve is for the
366: reduced (46 letter) initial alphabet. The smooth dotted line passing
367: through the lower data set is a fit with Eq.(\ref{fit}).}
368: \label{fig3.ps}
369: \end{figure}
370:
371: Secondly, it is clear that $\hat{h}\approx 1.8$ bits/character is not a realistic
372: estimate of the true entropy of written English. Even though we can not, with our
373: present text lengths and our computational resources, go to much larger alphabet sizes
374: (i.e. to more substitutions), it is clear from Fig.3 that both curves would continue
375: to decrease. Let us denote by $i$ the number of substitutions. Then empirical
376: fits to both curves in Fig.3 are given by
377: \be
378: \hat{h}_i = h + {c\over (i+i_0)^\alpha } \;. \label{fit}
379: \ee
380: Such a fit to the 46 letter data, with $h=0.7, i_0=34, c=4.99,$ and $\alpha = 0.1745$,
381: is also shown in Fig.3. One should of course not take it too serious in view of the
382: very slow convergence with $i$ and the very long extrapolation, but it suggests that
383: the true entropy of written English is $0.7\pm 0.2$ bits/character.
384:
385: This estimate is somewhat lower than estimate of \cite{cov-king} and the
386: extrapolations given in \cite{sg}. It is
387: comparable with that of \cite{grass-ieee} and with Shannon's original estimate
388: \cite{shannon2}. It seems definitely to exclude the possibility $h=0$ which was
389: proposed in \cite{hilberg,ebel-posch}.
390:
391: \section{Conclusions}
392:
393: We have shown how a strategy of non-sequential replacements of pairs of characters
394: can yield efficient data compression and entropy estimates. A similar
395: strategy was first proposed by Jim\'enez-Monta\~no and others, but details and the
396: actual coding done in the present paper are quite different from those proposed in
397: \cite{jimenes,poeschl}. Indeed, this strategy was never used in \cite{jimenes,poeschl}
398: for actual codings, and it was also not used for realistic entropy estimates.
399:
400: Compared to conventional sequential codes (such as Lempel-Ziv or arithmetic
401: codes \cite{cover}, just to mention two), the present method would be much
402: slower. Instead of a single pass through the data as in sequential coding
403: schemes, we had gone up to 6000 times through the data file, in order to
404: achieve a high compression rate. We could do of course with much less passes,
405: if we would be content with compression rates comparable to those of commercial
406: packages such as ``zip" or ``compress". For written English these achieve typically
407: compression factors $\approx 2.6$, i.e. ca. 3 bits/character. As seen from Fig.1,
408: this can be achieved by NSRPS very easily with very few passes, but even then the
409: overhead and the computational complexity of NSRPS is much too high to make it
410: a practical alternative.
411:
412: NSRPS can be seen as a greedy and extremely simple version of off-line textual
413: substitution \cite{storer}. In combination with other sophisticated techniques,
414: similar substitutions can give excellent results \cite{teahan-cleary}. But without
415: these techniques, it is in general believed that only much more sophisticated
416: versions of off-line textual substitution are of any interest \cite{storer}.
417: Again this is presumably true as far as practical coding schemes are concerned.
418: But things seem to be different if one is interested in entropy estimation. Here the
419: present method is much simpler (even though computationally more demanding) than
420: the tree-based gambling algorithms \cite{sg,bell} that had given the best results
421: up to now. Without extrapolation, it gives the same (upper bound) estimates
422: as these methods. But it seems that it allows a more reliable extrapolation to
423: infinite text length and infinite substitution depth, and thus a more reliable
424: estimate of the true asymptotic entropy.
425:
426: From the mathematical point of view, we should however stress that we have only
427: partial results. While we have proven that the Markov structure is a fixed point
428: of the substitution, we have not proven that it is {\it attractive}. We thus
429: cannot prove that the present strategy is indeed universally
430: optimal, although we believe that our numerical results strongly support this
431: conjecture. A rigorous proof would of course be extremely welcome.
432:
433: I thank Ralf Andrzejak, Hsiao-Ping Hsu, and Walter Nadler for carefully reading
434: the manuscript and for useful discussions.
435:
436:
437: \begin{thebibliography}{99}
438: \bibitem{shannon} C.E. Shannon and W. Weaver, {\it The Mathematical
439: Theory of Communications} (Univ. of Illinois Press, Urbana 1949).
440: \bibitem{kolmogorov} A.N. Kolmogorov, IEEE Trans. Inf. Theory
441: {\bf IT 14}, 662 (1965).
442: \bibitem{chaitin} G.J. Chaitin, {\it Algorithmic Information Theory}
443: (Cambridge Univ. Press, New York 1987).
444: \bibitem{li-vitanyi} M. Li and P. Vit\'anyi, {\it An Introduction to
445: Kolmogorov Complexity and its Applications} (Springer, New York 1997).
446: \bibitem{cover} T.M. Cover and J.A. Thomas, {\it Elements of Information Theory}
447: (Wiley Interscience, 1991).
448: \bibitem{jimenes} W. Ebeling and M.A. Jim\'enez-Monta\~no, Math. Biosc.
449: {\bf 52}, 53 (1980);
450: M.A. Jim\'enez-Monta\~no, Bull. Math. Biol. {\bf 46}, 641 (1984);
451: P.E. Rapp, I.D. Zimmermann, E.P. Vining, N. Cohen, A.M. Albano, and
452: M.A. Jim\'enez-Monta\~no, Phys. Lett. A {\bf 192}, 27 (1994);
453: \bibitem{poeschl} M.A. Jim\'enez-Monta\~no, W. Ebeling, and T. P\"oschel,
454: preprint arXiv:cond-mat/0204134 (2002).
455: \bibitem{footnote0} Actually, Jim\'enez-Monta\~no {\it et al.} use somewhat
456: different schemes. Also, we found the names given in
457: \cite{jimenes,poeschl} to their algorithms somewhat misleading,
458: since they refer to grammatical categories, while we are dealing with
459: probability measures.
460: \bibitem{footnote} In \cite{poeschl}, e.g., it is assumed that a character from
461: a two-letter alphabet can still be encoded by one bit, after the first pair
462: has been replaced by a ``non-terminal node", in their notation. This is
463: not true, since encoding this character now must fix a choice between
464: {\it three} (instead of two) possibilities.
465: \bibitem{gutenberg} http://promo.net/pg/.
466: \bibitem{herzel} B. Harris, Colloquia Mathematica Societatis Janos Bolya, 1975,
467: p. 323; H. Herzel, Syst. Anal. Model Sim. {\bf 5}, 435 (1988).
468: \bibitem{grass-fsc} P. Grassberger, Phys. Lett. A {\bf 128}, 369 (1988).
469: \bibitem{footnote2} We use Eq.(13) of \cite{grass-fsc}, but with a misprint
470: corrected: The denominator of the last term should be $(n_i+1)n_i$ instead
471: of $n_i+1$.
472: \bibitem{teahan-cleary} W.J. Teahan and J.G. Cleary, {\it The entropy of English
473: using PPM-based models}, Proc. of Data Compression Conf., Los Alamos (1996)
474: \bibitem{bell} T.C. Bell, J.G. Cleary, and I.H. Witten, {\it Text Compression}
475: (Prentice-Hall, Englewood Cliffs, NJ, 1990).
476: \bibitem{cov-king} T. Cover and R. King, IEEE Trans. Inf. Theory {\bf IT-24}, 413
477: (1978)
478: \bibitem{sg} T. Sch\"urmann and P. Grassberger, CHAOS {\bf 6}, 414 (1996).
479: \bibitem{grass-ieee} P. Grassberger, IEEE Trans. Inf. Theory {\bf IT-35}, 669
480: (1989).
481: \bibitem{shannon2} C.E. Shannon, Bell Syst. Technol. J. {\bf 30}, 50 (1951).
482: \bibitem{hilberg} W. Hilberg, Frequenz {\bf 44}, 243 (1990).
483: \bibitem{storer} J.A. Storer, {\it Data Compression} (Computer Science Press,
484: Rockville, MD, 1988).
485: \bibitem{ebel-posch} W. Ebeling and T. P\"oschel, Europhys. Lett. {\bf 26}, 241
486: (1994).
487:
488: \end{thebibliography}
489:
490: \end{document}
491: