1: \documentclass[12pt]{article}
2:
3: \begin{document}
4:
5: \title{ Fast Codes for Large Alphabets.
6: \footnote {Supported by the
7: INTAS under the Grant no. 00-738. } }
8:
9:
10:
11: \author{ Boris Ryabko, Jaakko Astola, Karen Egiazarian. }
12: \date{}
13: \maketitle
14:
15:
16: \begin{abstract}
17: We address the problem of constructing a fast lossless code in the
18: case when the source alphabet is large. The main idea of the new
19: scheme may be described as follows. We group letters with small
20: probabilities in subsets (acting as super letters) and use time
21: consuming coding for these subsets only, whereas letters in the
22: subsets have the same code length and therefore can be coded fast.
23: %This procedure makes the code more redundant but
24: %faster. If we denote the extra redundancy by $\delta $ and apply
25: %the proposed scheme to the arithmetic code and $N$- symbol
26: %alphabet, we obtain the time of encoding and decoding $ c( \log
27: %\log N + \log (1/ \delta ))+ c_1 $ instead of $ c \log N + c_2 $
28: %for a usual arithmetic code, $N \rightarrow \infty $.
29: The
30: described scheme can be applied to sources with known and unknown
31: statistics.
32:
33:
34: \end{abstract}
35:
36: \textbf{Keywords.} { \it fast algorithms, source coding, adaptive
37: algorithm, cumulative probabilities, arithmetic coding, data
38: compression, grouped alphabet.}
39: %\end{keywords}
40:
41:
42:
43:
44: \section{Introduction.}
45:
46: The computational efficiency of lossless data compression for
47: large alphabets has attracted attention of researches for
48: ages due to its great importance in practice. The alphabet
49: of $2^8 = 256$ symbols, which is commonly used in compressing
50: computer files, may already be treated as a large one, and with
51: adoption of the UNICODE the alphabet size will grow up to $2^{16}=
52: 65536 $.
53: Moreover, there are many data compression methods when
54: the coding is carried out in such a way that, first input data
55: are transformed by some algorithm, and then the resulting sequence
56: is compressed by a lossless code. It turns out that
57: very often the alphabet of the sequence is very large or even
58: infinite. For instance, the run length code, many implementations
59: of Lempel- Ziv codes, Grammar - Based codes \cite{Ki1,Ki2} and
60: many methods of image compression can be described in this way.
61: That is why the problem of constructing high-speed codes for large
62: alphabets has attracted great attention by researches. Important
63: results have been obtained by Moffat, Turpin
64: \cite{Moffat90,Moffat94,Moffat99,M-T1,M-T,T-M} and others
65: \cite{Jo,RyabkoDAN,Ryabko,Fenwick, R-Ri}.
66:
67: For many adaptive lossless codes the speed of coding depends substantially
68: on the alphabet size, because of the need to maintain
69: cumulative probabilities. The speed of an obvious (or naive)
70: method of updating the cumulative probabilities is proportional to
71: the alphabet size $N$. Jones \cite{Jo} and Ryabko \cite{RyabkoDAN}
72: have independently suggested two different algorithms, which
73: perform all the necessary transitions between individual and
74: cumulative probabilities in $O(\log N)$ operations under $ (\log N
75: + \tau)$- bit words , where $\tau$ is a constant depending on
76: the redundancy required, $N$ is the alphabet size. Later many such
77: algorithms have been developed and investigated in numerous papers
78: \cite{Moffat90,Ryabko,Fenwick,Moffat94,Moffat99}.
79:
80: In this paper we suggest a method for speeding up codes
81: based on the following main idea. Letters of the alphabet are
82: put in order according to their probabilities (or frequencies of
83: occurrence), and the letters with probabilities close to each others
84: are grouped
85: in subsets (as new super letters), which contain letters
86: with small probabilities. The key
87: point is the following: equal probability is ascribed
88: to all letters in one subset, and, consequently, their codewords
89: have the same length. This gives a possibility to encode and
90: decode them much faster than if they are
91: different. Since each subset of the grouped letters
92: is treated as one letter in the new alphabet, whose size is much
93: smaller than the original alphabet.
94: Such a grouping can increase the redundancy of the code. It
95: turns out, however, that a large decrease in the alphabet size may cause a
96: relatively small increase in the redundancy. More exactly, we
97: suggest a method of grouping for which the number of the
98: groups as a function of the redundancy ($\delta$) increases as $c
99: ( \log N + 1/ \delta )+ c_1 $, where $N$ is the alphabet size and
100: $c, c_1$ are constants.
101: %It should be noted that, in fact, the
102: %number of the groups can be considered as the size of the new
103: %alphabet and the last formula shows that the alphabet reduction
104: %can be quite large.
105:
106: In order to explain the main idea we consider the following
107: example. Let a source generate letters $ \{ a_0,\ldots , a_4 \}$
108: with probabilities $ p(a_0) = 1/16,\, p(a_1) = 1/16, \,p(a_2) =
109: 1/8, \,p(a_3) = 1/4, \,p(a_4) = 1/2, $ correspondingly. It is easy
110: to see that the following code $$ code(a_0) = 0000, code(a_1) =
111: 0001, code(a_2) = 001, code(a_3) = 01, code(a_4) = 1 $$ has the
112: minimal average codeword length. It seems that for decoding one needs
113: to look at one bit for decoding $a_4$, two bits for
114: decoding $a_3$, 3 bits for $a_2$ and 4 bits for $a_1$ and $a_0$.
115: However,
116: consider another code $$ \widetilde{code}(a_4) = 1,
117: \widetilde{code}(a_0) = 000, \widetilde{code}(a_1) = 001,
118: \widetilde{code}(a_2) = 010, \widetilde{code}(a_3) = 011, $$ and we
119: see that, on the one hand, its average codeword length is a
120: little larger than in the first code (2 bits instead of 1.825
121: bits), but, on the other hand, the decoding is simpler. In fact,
122: the decoding can be carried out as follows. If the first bit is 1,
123: the letter is $a_4$. Otherwise, read the next two bits and treat
124: them as an integer (in a binary system) denoting the
125: code of the letter (i.e. 00 corresponds $a_0$, 01 corresponds
126: $a_1$, etc.) This simple observation can be
127: generalized and extended for constructing a new coding scheme with the
128: property that the larger the alphabet size is,
129: the more speeding-up we get.
130:
131: In principle, the proposed method can be applied to the Huffman
132: code, arithmetic code, and other lossless codes for speeding them
133: up, but for the sake of simplicity, we will consider the
134: arithmetic code in the main part of the paper, whereas the Huffman
135: code and some others will be mentioned only briefly, because, on
136: the one hand, the arithmetic code is widely used in practice and,
137: on the other hand, generalizations are obvious.
138:
139: The suggested scheme can be applied to sources with unknown
140: statistics. As we mentioned above, the alphabet letters should be
141: ordered according to their frequency of occurrences when the
142: encoding and decoding are carried out. Since the frequencies are
143: changing after coding of each message letter, the order should be
144: updated, and the time of such updating should be taken into
145: account when we estimate the speed of the coding. It turns out
146: that there exists an algorithm and data structure, which give a
147: possibility to carry out the updating with few operations per
148: message letter, and the amount of these operations does not depend
149: on the alphabet size and/or a probability distribution.
150:
151: The rest of the paper is organized as follows. The second part
152: contains estimations of the redundancy caused by the grouping of
153: letters, and it contains examples for several values of the
154: redundancy. A fast method of the adaptive arithmetic code for the
155: grouped alphabet as well as the data structure and algorithm for
156: easy maintaining the alphabet ordered according to the frequency
157: of the occurrences are given in the third and
158: the fourth parts. Appendix contains all the proofs.
159:
160: \section{The redundancy due to grouping. }
161:
162: First we give some definitions. Let $A = \{ a_1, a_2,\ldots, a_N
163: \}$ be an alphabet with a probability distribution $\bar{p} = \{
164: p_1, p_2,\ldots, p_N \}$ where $ p_1 \geq p_2 \geq \ldots \geq
165: p_N, N \geq 1 $. The distribution can be either known a priori or
166: it can be estimated from the occurrence counts. In the last case
167: the order of the probabilities should be updated after encoding
168: each letter, and it should be taken into account when the speed of
169: coding is estimated. The simple data structure and algorithm for
170: maintaining the order of the probabilities will be described in
171: the fourth part, whereas here we discuss estimation of the
172: redundancy.
173:
174: Let the letters from the alphabet $A$ be grouped as follows : $A_1 =
175: \{ a_1, a_2,$ $ \ldots, a_{n_1} \},$ $A_2 = \{
176: a_{n_1+1},a_{n_1+2},\ldots, a_{n_2} \},\ldots, A_s = \{
177: a_{n_{s-1}+1},a_{n_{s-1}+2},\ldots, a_{n_{s}} \} $ where $n_s =
178: N, s \geq 1 $. We define the probability distribution $\pi$ and
179: the vector $\bar{m}= (m_1,$ $ m_2,..., $ $ m_s)$ by
180: \begin{equation}\label{pi}\pi_i = \sum
181: _{a_j \in A_i} p_j
182: \end{equation}
183: and $m_i = (n_i - n_{i-1}), n_0 =0, i
184: = 1, 2, \ldots,s $, correspondingly. In fact,the grouping is
185: defined by the vector $\bar{m}$. We intend to encode all
186: letters from one subset $A_i$ by the codewords of equal length.
187: For this purpose we ascribe equal probabilities to the letters
188: from $A_i$ by
189: \begin{equation}\label{code}
190: \hat{p}_j = \pi_i / m_i
191: \end{equation}
192: if $a_j \in A_i, i = 1, 2,
193: \ldots,s.$ Such encoding causes redundancy, defined by
194: \begin{equation}\label{red}
195: r(\bar{p}, \bar{m}) = \sum_{i=1}^N p_i \log ( p_i / \hat{p}_i ).
196: \end{equation}
197: (Here and below $\log(\:)= \log_2(\:).$)
198:
199: The suggested method of grouping is based on information about the
200: order of probabilities (or their estimations). We are
201: interested in an upper bound for the redundancy (\ref{red})
202: defined by
203: \begin{equation}\label{Red}
204: \ R( \bar{m})= \sup_{ \bar{p} \in \bar{P }_N} r(\bar{p}, \bar{m})
205: ; \: \bar{P}_N = \{ p_1, p_2,\ldots, p_N \} : p_1 \geq p_2 \geq
206: \ldots \geq p_N \}.
207: \end{equation}
208: The following theorem gives the redundancy estimate.
209:
210: \textbf{Theorem 1.}
211:
212: {\it The following equality for the redundancy (\ref{Red}) is
213: valid.
214: \begin{equation}\label{th}
215: \ R( \bar{m})= \max_{i=1,...,s} \max_{l=1,...,m_i} l\, \log (m_i
216: /l)/ (n_i + l),
217: \end{equation}
218: where, as before, $\bar{m}= (m_1, m_2,...,m_s), n_i = \sum_{j=1}^i
219: m_j, i=1, ...,s. $ }
220:
221: \emph{The proof }is given in Appendix.
222:
223: The practically interesting question is how to find a grouping
224: which minimizes the number of groups for a given upper bound of
225: the redundancy $\delta$. Theorem 1 can be used as the basis
226: for such an algorithm. This algorithm
227: %is given in Appendix (it is also
228: is implemented as a Java program and has been used for preparation
229: of all examples given below. The program can be found on the
230: internet and used for practical needs, see
231:
232: $http://www.ict.nsc.ru/~ryabko/GroupYourAlphabet.html .$
233:
234: Let us consider some examples of such grouping carried
235: out by the program mentioned.
236:
237: First we consider the Huffman code. It should be noted
238: that in the case of the Huffman code the size of each group should
239: be a power of 2, whereas it can be any integer
240: in case of an arithmetic code. This is because the length of
241: Huffman codewords must be integers whereas this limitation is
242: absent in arithmetic code.
243:
244: For example, let the alphabet have 256 letters and let the additional
245: redundancy (\ref{code}) not exceed 0.08 per letter.
246: (The choice of these parameters is appropriate, because an alphabet of $2^8 =
247: 256$ symbols is commonly used in compressing computer files, and
248: the redundancy 0.08 a letter gives 0.01 a bit.) In this case the
249: following grouping
250: %$m[1], m[2], ..., m[s])$
251: gives the minimal number of the groups $s$. $$A_1= \{ a_1 \} ,
252: A_2= \{ a_2 \} , \ldots , A_{12}= \{ a_{12}\}, $$ $$ A_{13}= \{
253: a_{13}, a_{14}\}, A_{14}= \{ a_{15}, a_{16}\}, \ldots,A_{19}= \{
254: a_{25}, a_{26}\}, $$ $$A_{20}= \{ a_{27}, a_{28}, a_{29}, a_{30}
255: \}, \ldots, A_{26}= \{ a_{51}, a_{52}, a_{53}, a_{54} \}, $$ $$
256: A_{27}= \{ a_{55}, a_{56},\ldots, a_{62} \},\ldots, A_{32}= \{
257: a_{95},\ldots, a_{102} \}, $$ $$ A_{33}= \{ a_{103},
258: a_{104},\ldots, a_{118} \},\ldots, A_{39}= \{ a_{199},\ldots,
259: a_{214} \}, $$ $$ A_{40}= \{ a_{215}, a_{216},\ldots, a_{246} \},
260: A_{41}= \{ a_{247},\ldots, a_{278} \}. $$ We see that each
261: of the first 12 subsets contains one letter, each of the subsets
262: $A_{13}, \ldots, A_{19}$ contains two letters, etc., and the total
263: number of the subsets $s$ is 41. In reality we could let the last
264: subset $A_{41}$ contain the letters $\{ a_{247},\ldots, a_{278}
265: \}$ rather than the letters $ \{ a_{247},\ldots, a_{256} \}$, since each
266: letter from this subset will be encoded \emph{inside} the subset
267: by 5- bit words (because $\log 32 = 5$).
268:
269: Let us proceed with this example in order to show how such a
270: grouping can be used to simplify the encoding and
271: decoding of the Huffman code. If someone knows the
272: letter probabilities, he can calculate the probability
273: distribution $\pi$ by (\ref{pi}) and the Huffman code for the new
274: alphabet $\hat{A} = A_{1}, \ldots, A_{41}$ with the distribution
275: $\pi$. If we denote a codeword of $A_i$ by $code (A_i)$ and
276: enumerate all letters in each subset $A_i$ from 0 to $|A_i| -1 $,
277: then the code of a letter $a_j \in A_i $ can be presented as the
278: pair of the words $$code (A_i)\: \{number\, of \, a_j \, \in A_i
279: \},$$ where $ \{number\, of \, a_j \, \in A_i \} $ is the $\log
280: |A_i|$\,- bit notations of the $a_j$ number (inside $A_i$). For
281: instance, the letter $a_{103}$ is the first in the 16- letter
282: subset $A_{33}$ and $a_{246}$ is the last in the 32- letter subset
283: $A_{40}$. They will be encoded by $code( A_{33})\,0000$ and
284: $code(A_{40})\,11111$, correspondingly. It is worth noting that
285: the $code (A_i)\, ,i=1,\ldots, s,$ depends on the probability
286: distribution whereas the second part of the codewords $\{number\,
287: of \, a_j \, \in A_i \}$ does not do that. So, in fact, the
288: Huffman code should be constructed for the 41- letter alphabet
289: instead of the 256- one, whereas the encoding and decoding inside
290: the subsets may be implemented with few operations. Of course,
291: this scheme can be applied to a Shannon code, alphabetical code,
292: arithmetic code and many others. It is also important that the
293: decrease of the alphabet size is larger when the alphabet size is
294: large.
295:
296: Let us consider one more example of grouping, where the subset
297: sizes don't need to be powers of two. Let, as before, the
298: alphabet have 256 letters and let the additional redundancy
299: (\ref{code}) not to exceed 0.08 per letter. In this case the
300: optimal grouping is as follows. $$ |A_1| = |A_2| = \ldots ,
301: |A_{12}| = 1, |A_{13}| = |A_{14}| = \ldots= |A_{16}|= 2, |A_{17}|=
302: |A_{18}| = 3,
303: $$
304: $$|A_{19}|= |A_{20}| = 4, |A_{21}| =5 , |A_{22}| = 6,|A_{23}| = 7,
305: |A_{24}| = 8, |A_{25}| =9,$$ $$ |A_{26}| = 11,|A_{27}| =
306: 12,|A_{28}| = 14, |A_{29}| = 16, |A_{30}| = 19, $$ $$|A_{31}| =
307: 22, |A_{32}| = 25, |A_{33}| = 29,|A_{34}| = 34,|A_{35}| = 39.$$ We
308: see that the total number of the subsets (or the size of the
309: new alphabet) is less than in the previous example (35 instead of
310: 41), because in the first example the subset sizes should be
311: powers of two, whereas there is no such limitation in the
312: second case. So, if someone can accept the additional redundancy
313: 0.01 per bit, he can use the new alphabet $ \hat{A} = \{ A_{1},
314: \ldots, A_{35} \} $ instead of 256- letter alphabet and implement
315: the arithmetic coding in the same manner as it was described for
316: the Huffman code. (The exact description of the method will be
317: given in the next part). We will not consider the new examples in
318: details, but note again that the decrease in the number of the
319: letters is more, when the alphabet size is larger. Thus, if the
320: alphabet size is $2^{16}$ and the redundancy upper bound is 0.16
321: (0.01 per bit), the number of groups $s$ is 39, and if the size is
322: $2^{20}$ then $s= 40 $ whereas the redundancy per bit is the same.
323: (Such calculations can be easily carried out by the above
324: mentioned program).
325:
326: The required grouping for decreasing the
327: alphabet size is based on the simple theorem 2, for which
328: we need to give some definitions standard in source
329: coding.
330:
331: Let $\gamma$ be a certain method of source coding which can be
332: applied to letters from a certain alphabet $A$. If $p$ is a
333: probability distribution on $A$, then the redundancy of $\gamma$
334: and its upper bound are defined by
335: \begin{equation}\label{red1}
336: \rho(\gamma, p) = \sum_{a \in A} p(a)( |\gamma (a) |+ \log p(a)),
337: \quad \hat{\rho}(\gamma ) = sup_{p} \:\rho(\gamma, p),
338: \end{equation}
339: where the supremum is taken over all distributions $p$, $|\gamma
340: (a) |$ and $p(a)$ are the length of the code word and the
341: probability of $a \in A$, correspondingly. For example,
342: $\hat{\rho} $ equals 1 for the Huffman and the Shannon codes
343: whereas for the arithmetic code $\hat{\rho}$ can be done as small
344: as it is required by choosing some parameters, see, for ex.,
345: \cite{Ryabko-Fionov}.
346: %There are such codes, that their redundancy
347: %depend on the alphabet size, that is why we will use the notation
348: %$\hat{\rho}(|A|)$.
349:
350: The following theorem gives a formal justification for applying
351: the above described grouping for source coding.
352:
353: \textbf{Theorem 2.} {\it Let the redundancy of a certain code
354: $\gamma$ be not more than some $\Delta$ for all probability
355: distributions. Then, if the alphabet is divided into subsets $A_i,
356: i=1,\ldots, s ,$ in such a way that the additional redundancy
357: (\ref{red}) equals $\delta$, and the code $\gamma$ is applied to
358: the probability distribution $\hat{p}$ defined by (\ref{code}),
359: then the total redundancy of this new code $\gamma_{gr}$ is upper
360: bounded by $\Delta+\delta$.}
361:
362: Theorem 1 gives a simple algorithm for finding the grouping
363: which gives the minimal number of the groups $s$ when the upper
364: bound for the admissible redundancy (\ref{Red}) is given. On the
365: other hand, the simple asymptotic estimate of the number of
366: such groups and the group sizes can be interesting when the
367: number of the alphabet letters is large. The following theorem can
368: be used for this purpose.
369:
370: \textbf{Theorem 3.}
371:
372: {\it Let $\delta > 0 $ be an admissible redundancy (\ref{Red}) of
373: a grouping.
374: %Let the admissible redundancy (\ref{Red}) of a grouping $m=
375: %(m_1, \ldots, m_s ) $ should not exceed some $\delta, \delta >0 $.
376:
377: i) If
378: \begin{equation}\label{co1}
379: \quad m_i \,\leq \,\lfloor \,\delta\, n_{i-1}\,\, e \,/ (\log e -
380: \delta \,e)\,\rfloor,
381: \end{equation}
382: then the redundancy of the grouping $(m_1, m_2, \ldots )$ does not
383: exceed $\delta$, where $n_i = \sum_{j=1}^i\, m_j, \;$ $e\approx
384: 2.718... .$).
385:
386:
387: ii) the minimal number of groups $\:s\:$ as a function of the
388: redundancy $\delta$ is upper bounded by
389: \begin{equation}\label{co}
390: c \log N / \delta + c_1,
391: \end{equation}
392: where $c$ and $c_1$ are constants and $N$ is the alphabet
393: size,$\:N \rightarrow \infty.$ }
394:
395: \emph{The proof} is given in Appendix.
396:
397: \textbf{Comment 1.} {\it The first statement of the theorem 3
398: gives
399: construction of the $\delta-$ redundant grouping $(m_1,
400: m_2, ...)$ for an infinite alphabet, because $m_i$ in (\ref{co1})
401: depends only on previous $m_1, m_2, \ldots, m_{i-1}$.}
402:
403: \textbf{Comment 2.} {\it Theorem 3 is valid for grouping where the
404: subset sizes $(m_1, m_2, \ldots )$ should be powers of 2. }
405:
406: \section{The arithmetic code for grouped alphabets. }
407:
408:
409: Arithmetic coding was introduced by Rissanen \cite{Riss76} in 1976
410: and now it is one of the most popular methods of source coding,
411: see, e.g., \cite{Moffat94}, \cite{Ryabko-Fionov}. The advantage of
412: arithmetic coding over other coding techniques is that it achieves
413: arbitrarily small coding redundancy per source symbol at less
414: computational effort than any other method.
415:
416:
417: We give first a brief description of an arithmetic code by paying
418: attention to features which determine the speed of encoding and
419: decoding. As before, consider a memoryless source generating
420: letters from the alphabet $A= \{ a_1, ..., a_{N} \}$ with unknown
421: probabilities. Let the source generate a message $x_1\ldots
422: x_{t-1}x_t\ldots $, $x_i\in A$ for all $i$, and let $ \nu^t(a)$
423: denote the occurrence count of letter $a$ in the word $x_1\ldots
424: x_{t-1}x_t $. After
425: first $t$ letters $x_1,\ldots, x_{t-1},x_t$ have been processed
426: the following letter $ x_{t+1}$ needs to be encoded. In the most
427: popular version of the arithmetic code the current estimated
428: probability distribution is taken as
429: \begin{equation}\label{piti}
430: p^t(a)= (\nu^t(a)+c)/(t+Nc) , a \in A ,
431: \end{equation}
432: where $c$ is a constant (as a rule $c$ is 1 or 1/2). Let $x_{t+1}=
433: a_i$, and let the interval $[\alpha, \beta )$ represent the word
434: $x_1 \ldots x_{t-1} x_t $. Then the word $x_1 \ldots x_{t-1} x_t
435: x_{t+1}$, $x_{t+1}= a_i $ will be encoded by the interval
436: \begin{equation}\label{int}
437: [\alpha + ( \beta - \alpha)\: q^t_i,\quad \alpha + ( \beta -
438: \alpha)\: q^t_{i+1}\: )\,,
439: \end{equation}
440: where
441: \begin{equation}\label{qu}
442: q^t_i = \sum _{j=1}^{i-1} p^t(a_j).
443: \end{equation}
444: When the size of the alphabet $N$ is large, the calculation of
445: $q^t_i$ is the most time consuming part in the encoding process.
446: As it was mentioned in the introduction, there are fast algorithms
447: for calculation of $q^t_i$ in
448: \begin{equation}\label{time}
449: T= c_1 \log N + c_2,
450: \end{equation}
451: operations under $ (\log N + \tau)$- bit words, where $\tau$ is
452: the constant determining the redundancy of the arithmetic code.
453: (As a rule, this length is in proportional to the length of the
454: computer word: 16 bits, 32 bits, etc.)
455:
456: We describe a new algorithm for the alphabet whose letters are
457: divided into subsets $ A_1^t,\ldots, A_s^t, $ and the same
458: probability is ascribed to all letters in the subset. Such a
459: separation of the alphabet $A$ can depend on $t$ which is why the
460: notation $A_i^t$ is used. But, on the other hand, the number of
461: the letters in each subset $A_i^t$ will not depend on $t$ which is
462: why it is denoted as $|A_i^t| = m_i$.
463:
464: In principle, the scheme for the arithmetic coding is the same as
465: in the above considered case of the Huffman code: the
466: codeword of the letter $ x_{t+1}= a_i $ consists of two parts,
467: where the first part encodes the set $A^t_k$ that contains $a_i$,
468: and the second part encodes the ordinal of the element $a_i$ in the
469: set $A^t_k$. It turns out that it is easy to encode and decode
470: letters in the sets $A^t_k$, and the time
471: consuming operations should be used to encode the sets $A^t_k$, only.
472:
473: We proceed with the formal description of the algorithm. Since the
474: probabilities of the letters in $A$ can depend on $t$ we define in
475: analogy with (\ref{pi}),(\ref{code})
476: \begin{equation}\label{PQ1}
477: \pi_i^t = \sum _{a_j \in A_i} p_j,\quad \hat{p}_i^{\,t} = \pi_i^t
478: / m_i
479: \end{equation}
480: and let
481: \begin{equation}\label{PQ}
482: Q^t_i= \sum _{j=1}^{i-1} \pi_j^t\:.
483: \end{equation}
484:
485: The arithmetic encoding and decoding are implemented for the
486: probability distribution (\ref{PQ1}), where the probability
487: $\hat{p}_i^{\,t}$ is ascribed to all letters from the subset
488: $A_i$. More precisely, assume that the letters in each $A^t_k$ are
489: enumerated from 1 to $m_i$, and that the encoder and the decoder
490: know this enumeration. Let, as before, $ x_{t+1}= a_i $, and let
491: $a_i$ belong to $A^t_k$ for some $k$. Then the coding interval for
492: the word $x_1\ldots x_{t-1}x_t x_{t+1}$ is calculated as follows
493: \begin{equation}\label{newint}
494: [\alpha + ( \beta - \alpha)( Q^t_k + (\delta (a_i)-1)\,
495: \hat{p}_i^{\,t}\,)\, ,\quad
496: \alpha + ( \beta - \alpha) ( Q^t_k + \delta (a_i)\,\hat{p}_i^{\,t})\; ),
497: \end{equation}
498: where $ \delta(a_i)$ is the ordinal of $a_i$ in the subset
499: $A^t_k$. It can be easily seen that this definition is equivalent
500: with (\ref{int}), where the probability of each letter from $A_i$
501: equals $ \hat{p}_i^{\,t}$.
502: Indeed, let us order the letters of $A$
503: according to their count of
504: occurrence in the word $x_1\ldots x_{t-1}x_t, $ and let the letters
505: in $A^t_k,k=1,2,...,s\, ,$ be ordered according to the
506: enumeration mentioned above. We then immediately obtain
507: (\ref{newint}) from (\ref{int}) and (\ref{PQ1}). The additional redundancy which
508: is caused by the replacement of the distribution (\ref{piti}) by
509: $ \hat{p}_i^{\,t}$ can be estimated using (\ref{red}) and the theorems 1-3,
510: which is why
511: we may
512: concentrate our attention on the encoding and decoding speed
513: and the storage space needed.
514:
515: First we compare the time needed for the calculation in
516: (\ref{int}) and (\ref{newint}). If we ignore the expressions
517: $(\delta (a_i)-1) \hat{p}_i^{\,t}$ and $ \delta (a_i)
518: \hat{p}_i^{\,t}$ for a while, we see that (\ref{newint}) can be
519: considered as the arithmetic encoding of the new alphabet $ \{
520: A^t_1$, $A^t_2,...,$ $A^t_s \} $. Therefore, the number of
521: operations for encoding by (\ref{newint}) is the same as the time
522: of arithmetic coding for the $s$ letter alphabet, which by
523: (\ref{time}) equals $c_1 \log s + c_2 $. The expressions $(\delta
524: (a_i)-1)\hat{p}_i^{\:t}$ and $ \delta (a_i) \hat{p}_i^{\:t}$
525: require two multiplications, and two additions are needed to
526: obtain bounds of the interval in (\ref{newint}). Hence, the number
527: of operations for encoding ($T$) by (\ref{newint}) is given by
528: \begin{equation}\label{newtime}
529: T= c_1^* \log s + c_2^* ,
530: \end{equation}
531: where $c_1^*, c_2^*$ are constants and all operations are carried
532: out under the word of the length $ (\log N + \tau)$- bit as it
533: was required for the usual arithmetic code. In case $s$ is much
534: less than $N$, the time of encoding in the new method is less than
535: the time of the usual arithmetic code, see (\ref{newtime}) and
536: (\ref{time}).
537:
538: We describe shortly decoding with the new method. Suppose that the
539: letters $x_1 \ldots x_{t-1} x_t $ have been decoded and the letter
540: $x_{t+1}$ is to be decoded.
541: There are two steps required:
542: first, the algorithm finds the set $A^t_k$ with the usual
543: arithmetic code that contains the (unknown) letter $a_i$. The
544: ordinal of the letter $a_i$ is calculated as follows:
545: \begin{equation}\label{decode}
546: \delta ( ) = \lfloor(code (x_{t+1}...) - Q^t_j )/
547: \hat{p}_i^{\,t}\rfloor,
548: \end{equation}
549: where $ code (x_{t+1}...)$ is the number that encodes the word
550: $x_{t+1}x_{t+2}...$. It can be seen that (\ref{decode}) is the
551: inverse of (\ref{newint}). In order to calculate (\ref{decode})
552: the decoder should carry out one division and one subtraction.
553: That is why the total number of decoding operations is given by
554: the same formula as for the encoding, see (\ref{newtime}).
555:
556: It is worth noting that multiplications and divisions in
557: (\ref{newint}) and (\ref{decode}) could be carried out faster if
558: the subset sizes are powers of two. But, on the other hand, in
559: this case the number of the subsets is larger, that is why both
560: version could be useful.
561:
562: We did not estimate yet the time needed for maintaining the order
563: of letters from $A$ according to their frequencies (\ref{piti}).
564: The point is that the order should be updated by the encoder and
565: the decoder after encoding and decoding each letter $x_t$. It
566: turns out that it is possible to update the order using a fixed
567: number of operations. Such a method is described in the next
568: section. Besides, we should take into account that, when $x_t$ is
569: encoded (or decoded), one frequency (\ref{piti}) should be changed
570: and at most two $\pi_i$ (\ref{PQ1}) must be recalculated. It is
571: easy to see that all these transformations can be done with no
572: more than two additions and two subtractions. Therefore, the
573: total number of operations for encoding and decoding is given by
574: (\ref{newtime}) with the new constant $c_2^*$.
575:
576: So we can see that
577: %\section{Conclusion and discussion} The main result of the paper can be formulated as follows.
578: %\begin{theorem}
579: %
580: if the arithmetic code can be applied to an $N \:- $ letter source, so
581: that the number of operations (under words of a certain length) of
582: coding is $$ T= c_1 \log N + c_2,$$ then there exists an algorithm
583: of coding, which can be applied to the grouped alphabet $
584: A_1^t,\ldots, A_s^t $ in such a way that, first, at each moment
585: $t$ the letters are ordered by decreasing
586: frequencies and, second, the number of coding operations is $$
587: T= c_1 \log s + c_2^* $$ with words of the same length, where $
588: c_1, c_2, c_2^* $ are constants.
589:
590: %\section{Conclusion and discussion}
591: \section{ A fast algorithm for keeping the alphabet letters ordered.}
592: In this section we describe a data structure and an algorithm, which allow
593: one to carry out all the operations for maintaining the alphabet letters
594: ordered by their frequencies, in such a way that the
595: number of such operations is constant, independently of the
596: probability distribution, the size of the alphabet, and other
597: characteristics.
598:
599: The data structure suggested is based on five arrays $Fr [1 :
600: N], Sorted$ $ Alphabet [1:N],$ $ Inverse Sort[1:N], SetBegin [0:
601: MAX ], SetEnd [0: MAX ]$, where, as before, $N$ is the size of the
602: alphabet, $\Lambda^t_k$ is the set of the letters from $A$, which
603: frequency of the occurrence equals $k$ at the moment $t$ and $MAX$
604: is an upper bound for the maximal count of occurrence (For
605: example, if the code uses the sliding window to adapt to the
606: source, $MAX$ is upper bounded by the length of the window). At
607: each moment $t$ the array $Fr$ contains information about
608: frequencies of occurrence of the letters from $A$ in the word $
609: x_1 \ldots x_{t-1} x_t$ such that $Fr[i]= \nu^t(a_i)$. The array
610: $SortedAlphabet [1:N]$ consists of letters from $A$ ordered by the
611: frequency of occurrence. More precisely, the following property is
612: satisfied: if $i \leq j$ and $ Sorted Alphabet [i]= b $ and $
613: Sorted Alphabet [j]= c$, then $ \nu^t(b) \leq \nu^t(c)$. In
614: particular, it means that all letters from a subset $\Lambda^t_k,
615: k=0,1,...$, are situated in succession in $Sorted Alphabet [1:N]$
616: and forming a string. $SetBegin [k ]$ and $SetEnd [k]$ contain
617: information about the beginning and the end of such a string.
618: At last, by definition,$ Inverse Sort[i]$
619: contains an integer $j$ such that $Sorted Alphabet [j]= a_i$.
620:
621: Let us consider a small example. Let $N = 4 $, $t = 4 $ and the
622: frequencies
623: $\nu^t(a_1)=0, \nu^t(a_2)=1, \nu^t(a_3)=2 $ and $ \nu^t(a_4)=1 $.
624: Then, $Fr= $ $ [0,1,2,1],$ $ Sorted Alphabet $ $ =
625: [a_1,a_4,a_2,a_3],$ $ Inverse Sort =[1,3,4,2]$,$Set $ $ Begin$ $
626: =[1,2,4]$, $ SetEnd $ $ =[1,3,4] $ is one possible configuration
627: of the contents of the relevant arrays.
628:
629: Consider next updating the information in the arrays, which should
630: be done by the encoder (and decoder) after encoding (and decoding)
631: of each letter, in such a way that only a constant number of
632: operations is needed. Suppose we encode the letter $a_4$ and
633: increment its occurrence count. The arrays should be changed as
634: follows : the processed letter ($a_4$) should be exchanged with
635: the last letter from $ \Lambda^t_k $ ($ \Lambda^t_1 $ in our case)
636: and the relevant modifications should be done in $ SortedAlphabet$
637: and $InverseSort $. Then the letter processed should be included
638: in the set $ \Lambda^t_{k+1} $ and excluded from the set $
639: \Lambda^t_k $. In fact, it is enough to change two elements in
640: $SetBegin$ and $ SetEnd $, namely, $SetBegin [k+1]=
641: SetBegin[k+1]-1 $ and $ SetEnd[k]= SetEnd [k]- 1 $. (In our
642: example, $a_4$ should be moved from $ \Lambda^t_1 $ into $
643: \Lambda^t_2 $. When we carry out these calculations the result is
644: $Fr= [0,1,2,2],$ $ SortedAlphabet = [a_1,a_2,a_4,a_3],$ $
645: InverseSort=[1,2,4,3], $ $ SetBegin =[1,2,3] $ and $ SetEnd
646: =[1,2,4] $.)
647:
648: We have considered the case when the occurrence count should be
649: incremented. Decrementing, which is used in certain schemes of the
650: adaptive arithmetic code, can be carried out in a similar manner.
651:
652:
653: \section{Appendix. }
654:
655: \textbf{The proof of Theorem 1.} It is easy to see that the set
656: $\bar{P}_N$ of all distributions which are ordered according to
657: the probability decreasing is convex. Indeed, each $ \bar{p} = \{
658: p_1, p_2,\ldots, p_N \} \in \bar{P}_N$ may be presented as a
659: linear combination of vectors from the set
660: \begin{equation}\label{q}
661: Q_N = \{q_1 = (1,0,\ldots,0), q_2= (1/2,1/2,0,\ldots,0),\ldots,
662: q_N = ( 1/N, \ldots, 1/N)
663: \end{equation}
664: as follows:
665: $$ \bar{p} = \sum_{i=1}^N (p_i - p_{i+1 } ) q_i ; $$
666: where $p_{N+1}= 0 .$
667:
668: On the other hand, the redundancy (\ref{red}) is a convex function,
669: because the direct calculation shows that its second partial
670: derivatives are nonnegative. Indeed, the redundancy (\ref{red})
671: can be represented as follows. $$ r(\bar{p}, \bar{m}) = \sum_{i=1
672: }^N p_i \log ( p_i ) \: - \,\sum_{j=1}^s \pi_j (\log \pi_j - \log
673: m_j) = $$
674:
675: $$ \sum_{i=2 }^N p_i \log ( p_i ) \:- \,
676: \sum_{j=2}^s \pi_j (\log \pi_j - \log m_j)\, +$$
677: $$(1-\sum_{k=2}^N p_k ) \log (1-\sum_{k=2}^N p_k )\,
678: -\,(1-\sum_{l=2}^s \pi_l )
679: ( \log (1-\sum_{l=2}^s \pi_l ) - \log m_1). $$ If $a_i$ is a
680: certain letter from $A$ and $j$ is such a subset that $a_i \in A_j
681: $ then, the direct calculation shows that
682: $$ \partial r / \partial p_i = \log_2 e \,(\,\ln p_i - \ln \pi_j-
683: \ln (1 - \sum_{k=2}^N p_k) + \ln (1 - \sum_{l=2}^s \pi_l)\,) +
684: constant ,
685: $$
686:
687: $$\partial^2 r /
688: \partial^2 p_i = \log_2 e \,((- 1/\pi_i + 1/ p_j) +
689: (- 1/\pi_1 + 1/ p_1)) .$$ The last value is nonnegative, because,
690: by definition, $\pi_i = \sum _{k= n_i }^{n_{i+1}-1}p_k$ and $p_j$
691: is one of the summands as well as $p_1$ is one of the summands of
692: $\pi_1$.
693:
694: Thus, the redundancy is a convex function defined on a
695: convex set, and its extreme points are $Q_N$ from (\ref{q}). So
696: $$sup_{ \bar{p} \in \bar{P }_N} r(\bar{p}, \bar{m}) = \max_{ q
697: \,\in \;Q_N} r(q, \bar{m}) .$$ Each $q \in Q_N$ can be presented
698: as a vector $ q= (1/(n_i + l), \ldots, 1/(n_i + l), 0, \ldots, 0 )
699: $ where $ 1 \leq l \leq m_{i+1} , i=0, \ldots, s-1.$ This
700: representation, the last equality, the definitions (\ref{q}) ,
701: (\ref{red}) and (\ref{Red}) give (\ref{th}).
702:
703: \textbf{Proof of the theorem 2.} Obviously,
704: \begin{equation}\label{obv}
705: \sum_{a \in A} p(a)( |\gamma_{gr} (a) |+ \log p(a)) =$$ $$ \sum_{a
706: \in A} p(a)( |\gamma_{gr}(a) |+ \log \hat{p}(a)) + \sum_{a \in A}
707: p(a)(
708: \log (p(a)/ \hat{p}(a)).
709: \end{equation}
710: and, from (\ref{pi}),(\ref{code}) we obtain $$ \sum_{a \in A}
711: p(a)( |\gamma_{gr}(a) |+ \log \hat{p}(a))= \sum_{i=1}^s (
712: |\gamma_{gr}(a) |+ \log \hat{p}(a)) \sum_{a \in A_i} p(a) = $$ $$
713: \sum_{i=1}^s ( |\gamma_{gr}(a) |+ \log \hat{p}(a)) \sum_{a \in
714: A_i} \hat{p}(a) = \sum_{a \in A} \hat{p}(a)( |\gamma_{gr}(a) |+
715: \log \hat{p}(a)). $$ This equality and (\ref{obv}) gives
716: $$
717: \sum_{a \in A} p(a)( |\gamma_{gr} (a) |+ \log p(a)) =$$ $$ \sum_{a
718: \in A} \hat{p}(a)( |\gamma_{gr}(a) |+ \log \hat{p}(a)) + \sum_{a
719: \in A} p(a)(
720: \log (p(a)/ \hat{p}(a)).$$ From this equality, the statement of the theorem and
721: the definitions (\ref{red}) and (\ref{red1}) we obtain
722: $$
723: \sum_{a \in A} p(a)( |\gamma_{gr} (a) |+ \log p(a)) \leq \Delta +
724: \delta.
725: $$
726: Theorem 2 is proved.
727:
728: \textbf{The proof of the theorem 3.} The proof is based on the
729: theorem 1. From (\ref{th}) we obtain the following obvious
730: inequality
731: \begin{equation}\label{cr}
732: R( \bar{m})\leq \max_{i=1,...,s} \max_{l=1,...,m_i} l\, \log (m_i
733: /l)/ n_i .
734: \end{equation} Direct calculation shows that
735: $$
736: \partial (\log (m_i /l)/n_i)/\partial l = \log_2 e \,(\ln
737: (m_i/l) - 1 )/n_i ,$$
738: $$ \partial^2(\log (m_i /l)/n_i)/\partial l^2 = - \log_2e/(l \,n_i)
739: <0
740: $$ and, consequently, the maximum of the function
741: $\log (m_i /l)/n_i$ is equal to $ m_i\log e / (e \,n_i) ,$ when
742: $l= m_i/e $. So,
743: $$ \max_{l=1,...,m_i} l\, \log (m_i/l)/ n_i \leq m_i\log e / (e\,
744: n_i) $$ and from (\ref{cr}) we obtain
745: \begin{equation}\label{cr1}
746: R( \bar{m})\leq \max_{i=1,...,s} m_i\log e / (e \,n_i).
747: \end{equation}
748: That is why, if
749: \begin{equation}\label{cr2}
750: m_i \leq \delta \,e \,n_i/ \log e
751: \end{equation}
752: then $R( \bar{m})\leq \delta $. By
753: definition ( see the statement of the theorem ) , $n_i = n_{i-1} +
754: m_i$ and we obtain from (\ref{cr2}) the first claim of the
755: theorem. Taking into account that $n_{s-1} < N \leq n_s $ and
756: (\ref{cr1}), (\ref{cr2}) we can see that, if
757: $$ N = \acute{c}_1 (1+\delta e/ \log e)^s + \acute{c}_2, $$ then
758: $R( \bar{m})\leq \delta ,$ where $ \acute{c}_1$ and $\acute{c}_2$
759: are constants and $N \rightarrow \infty .$ Taking the logarithm
760: and applying the well known estimation $\ln (1+\varepsilon )
761: \approx \varepsilon$ when $ \varepsilon \approx 0, $ we obtain
762: (\ref{co}). The theorem is proved.
763:
764:
765: %\section{Appendix 2. The program for grouping.}
766: %\noindent {\bf begin\ }
767:
768:
769: %{\bf end}
770:
771: %\vskip .1in
772:
773:
774:
775:
776:
777:
778:
779:
780:
781: %**********************************************************
782: %* Bibliography *
783: %**********************************************************
784: \newpage
785: \begin{thebibliography}{5}
786:
787: \bibitem{Aho}
788: A.V.Aho,J.E. Hopcroft, J.D.Ulman.{ \it The desighn and analysis of
789: computer algorithms }, Reading, MA: Addison- Wesley, 1976.
790:
791: \bibitem{Fenwick}
792: P. Fenwick, ``A new data structure for cumulative probability
793: tables,'' {\it Software -- Practice and Experience,} vol. 24, no.
794: 3, pp. 327--336, March 1994. Errata published in vol. 24, no. 7,
795: p. 667, July 1994.
796:
797: \bibitem{Jo}
798: D. W. Jones", "Application of splay trees to data compression",
799: {\it Communications of the ACM}, v 31, n. 8,1988,
800: pp. "996-1007",
801:
802:
803: \bibitem{Ki1}
804: Kieffer, J.C.; Yang, E.H. Grammar-based codes: a new class of
805: universal lossless source codes. {\it IEEE Trans. Inform. Theory},
806: v.46 (2000), no. 3, 737--754.
807:
808: \bibitem{Ki2}
809: Kieffer, J.C.; Yang, E.H.; Nelson, G.J.; Cosman, P. Universal
810: lossless compression via multilevel pattern matching.{\it IEEE
811: Trans. Inform. Theory,} v.46 (2000), no. 4, 1227--1245.
812:
813: \bibitem{Moffat90}
814: A. Moffat, Linear time adaptive arithmetic coding",{\it IEEE
815: Transactions on Information Theory }
816: 1990, v.36, no. 2, pp.401-406.
817:
818:
819:
820: \bibitem{Moffat99}
821: A. Moffat, An improved data structure for cumulative probability
822: tables, 1999,{\it Software -- Practice and Experience},
823: v.29,
824: no. 7,
825: pp.647-659.
826:
827:
828: \bibitem{Moffat94}
829: A.Moffat,R.Neal,I.Witten. "Arithmetic Coding Revisited", {\it ACM
830: Transactions on Information Systems,} 16(3):256-294, July 1998.
831:
832: \bibitem{T-M}
833: Moffat, A.; Turpin, A. On the implementation of minimum redundancy
834: prefix codes, {\it IEEE Transactions on Communications,} v.45,
835: no. 10, pp. 1200 - 1207, 1997.
836:
837: \bibitem{M-T1}
838: A.Moffat,A.Turpin, Efficient Construction of Minimum-Redundancy
839: Codes for Large Alphabets. {\it IEEE Trans. Inform. Theory,} vol.
840: IT-44, no. 4, pp. 1650--1657, July 1998.
841:
842:
843: \bibitem{Riss76}
844: J.Rissanen, ``Generalized Kraft inequality and arithmetic
845: coding,'' {\it IBM J. Res. Dev.,} vol. 20, pp. 198--203, May 1976.
846:
847:
848:
849: \bibitem{RyabkoDAN}
850: Ryabko, B. Ya. A fast sequential code. {\it Dokl. Akad. Nauk SSSR
851: } v.306 (1989), no. 3, pp.548--552 (Russian); translation in {\it
852: Soviet Math. Dokl.}, v. 39 (1989), no. 3, pp. 533--537.
853:
854:
855:
856: \bibitem{Ryabko}
857: B.Ryabko, A fast on-line adaptive code, {\it IEEE Trans. Inform.
858: Theory,} vol. IT-38, no. 4, pp. 1400--1404, 1992.
859:
860: %\bibitem{R-R}
861: %D.B. Ryabko, B.Ya. Ryabko. ''The program for grouping
862: %letters'',2002.
863: %In: http://www.ict.nsc.ru/$\sim$ryabko /
864: %GroupYourAlphabet.html
865:
866:
867: \bibitem{Ryabko-Fionov}
868: B.Ryabko, A.Fionov, ''Fast and Space-Efficient Adaptive Arithmetic
869: Coding'',{ \it in :Cryptography and Coding, 7th IMA International
870: Conference, Cirencester, UK, December 1999. Proceedings }, LNCS
871: 1746, pp. 270 -279.
872:
873: \bibitem{R-Ri}
874: B.Ryabko, J.Rissanen. " Fast Adaptive Arithmetic Code for Large
875: Alphabet Sources
876: with Asymmetrical Distributions", { \it IEEE Communications Letters,}
877: 2002, (accepted for publication).
878:
879: See also B. Ryabko, J. Rissanen. Fast Adaptive Arithmetic Code for
880: Large Alphabet Sources with Asymmetrical Distributions . { \it
881: Proceedings of the IEEE International Symposium on Information
882: Theory, 2002, Lausanne, Switzelend, } p.319
883:
884:
885: \bibitem{M-T}
886: Turpin, A.; Moffat, A., ''On-line adaptive canonical prefix coding
887: with bounded compression loss'',{\it IEEE Trans. Inform. Theory,}
888: vol. IT-47, no. 1, pp.88- 98, 2001.
889:
890:
891: \end{thebibliography}
892: \newpage
893:
894: \noindent{\bf Authors:}
895:
896: \noindent B.Ya. Ryabko\\
897: Professor.\\
898: Siberian State University of Telecommunication and Computer Science\\
899: Kirov Street, 86\\
900: 630102 Novosibirsk, Russia.
901: \vskip .05in
902: \noindent e-mail: \verb"ryabko@neic.nsk.su" \\
903: URL: \verb"www.ict.nsc.ru/~ryabko"
904:
905: \vskip .1in
906:
907:
908: \noindent J. Astola \\
909: Professor.\\
910: Tampere University of Technology\\
911: P.O.B. 553, FIN- 33101 Tampere,\\ Finland.
912: \vskip .05in \noindent
913: e-mail: \verb"jta@cs.tut.fi"
914:
915: \vskip .1in
916:
917:
918: \noindent K. Eguiazarian \\
919: Professor.\\
920: Tampere University of Technology\\
921: P.O.B. 553, FIN- 33101 Tampere, \\ Finland.
922: \vskip .05in
923: \noindent e-mail: \verb"karen@cs.tut.fi"
924:
925:
926:
927: \vskip .1in
928: \noindent{\bf Address for Correspondence:}\\
929: \noindent prof. B. Ryabko \\
930: Siberian State University of Telecommunication and
931: Computer
932: Science\\
933: Kirov Street, 86\\
934: 630102 Novosibirsk, Russia.\\
935: \noindent e-mail: \verb"ryabko@neic.nsk.su" \\
936:
937:
938: %\fi
939: \end{document}
940: