1: %\documentclass[doublecolumn,doublesided]{IEEEtran}
2: \documentclass{article}
3:
4: \usepackage{amsmath,amstext,amssymb,epsf}
5:
6:
7: \sloppy
8:
9: %\usepackage{amsmath,amssymb,latexsym}
10: %\usepackage{ltexpprt}
11: \setlength{\textwidth}{6.5in}
12: \setlength{\textheight}{9.3in}
13: \setlength{\oddsidemargin}{0in}
14: \setlength{\evensidemargin}{0in}
15: \setlength{\topmargin}{-0.1in}
16: \setlength{\headheight}{0in}
17: \setlength{\headsep}{0in}
18: \setlength{\footskip}{0.5in}
19:
20:
21: \newtheorem{lemma}{Lemma}[section]
22: \newtheorem{theorem}[lemma]{Theorem}
23: \newtheorem{proposition}[lemma]{Proposition}
24: \newtheorem{fact}[lemma]{Fact}
25: \newtheorem{claim}[lemma]{Claim}
26: \newtheorem{corollary}[lemma]{Corollary}
27: \newtheorem{conjecture}[lemma]{Conjecture}
28: \newtheorem{notation}[lemma]{Notation}
29: \newtheorem{definition}[lemma]{Definition}
30: \newtheorem{rem}[lemma]{Remark}
31:
32: \numberwithin{equation}{section} % in amsmath
33: \newenvironment{comment}{\begin{small}\begin{quotation}\hspace{-0.23in}\rm}{\end{quotation}\end{small}}
34:
35: \newenvironment{proof}{\par \sf Proof.\rm}{\hspace*{\fill}$\Box$\vspace{1ex}}
36:
37: \newenvironment{remark}{\begin{rem}}{\hspace*{\fill}$\diamondsuit$\end{rem}}
38: \newtheorem{ex}[lemma]{Example}
39: \newenvironment{example}{\begin{ex}}{\hspace*{\fill}$\diamondsuit$\end{ex}}
40:
41: \newcommand{\len}[2]{l_{#1}(#2)}
42:
43: \newcommand{\m}{{\bf m}}
44: \newcommand{\lea}{\stackrel{{}_+}{<}}
45: \newcommand{\gea}{\stackrel{{}_+}{>}}
46: \newcommand{\eqa}{\stackrel{{}_+}{=}}
47: \newcommand{\soph}{\mbox{\rm soph}}
48: \newcommand{\Lint}{L_{{\mathcal N}}}
49: \newcommand{\eps}{\epsilon}
50:
51: \newcommand{\commentout}[1]{}
52:
53: \begin{document}
54: \title{Shannon Information and Kolmogorov Complexity}
55: \author{Peter Gr\"unwald and
56: Paul Vit\'anyi\thanks{
57: Manuscript received xxx, 2004;
58: revised yyy 200?.
59: This work supported in part
60: by the EU fifth framework project QAIP, IST--1999--11234,
61: the NoE QUIPROCONE IST--1999--29064,
62: the ESF QiT Programmme, and the EU Fourth Framework BRA
63: NeuroCOLT II Working Group
64: EP 27150, the EU NoE PASCAL, and by the Netherlands Organization for
65: Scientific Research (NWO) under Grant 612.052.004.
66: Address: CWI, Kruislaan 413,
67: 1098 SJ Amsterdam, The Netherlands.
68: Email: {\tt Peter.Grunwald@cwi.nl, Paul.Vitanyi@cwi.nl}.}
69: }
70:
71: %\markboth{IEEE Transactions on Information Theory, VOL. XX, NO Y, MONTH 2004}{P.D. Gr\"unwald and P.M.B. Vit\'anyi: Shannon Information and Kolmogorov Complexity}
72:
73: \maketitle
74: \begin{abstract}
75: We compare the
76: elementary theories of Shannon information and Kolmogorov
77: complexity, the extent to which they have a common purpose, and where
78: they are fundamentally different. We discuss and relate the basic
79: notions of both theories:
80: Shannon entropy versus Kolmogorov complexity, the relation of both
81: to universal coding, Shannon mutual information
82: versus Kolmogorov (`algorithmic') mutual information,
83: probabilistic sufficient statistic versus algorithmic sufficient
84: statistic (related to lossy compression in
85: the Shannon theory versus
86: meaningful
87: information in the Kolmogorov theory),
88: and
89: rate distortion theory versus Kolmogorov's structure function.
90: Part of the material has appeared in print before, scattered
91: through various publications, but this is the first comprehensive
92: systematic comparison. The last mentioned relations are new.
93:
94: \end{abstract}
95: \tableofcontents
96: \section{Introduction}
97: %How should we measure the amount of information about a phenomenon
98: %that is given to us by a particular observation concerning the
99: %phenomenon?
100: %
101: {\em Shannon information} theory, usually called just `information'
102: theory was introduced in 1948, \cite{Sh48}, by C.E. Shannon (1916--2001). {\em
103: Kolmogorov complexity} theory, also known as `algorithmic
104: information' theory,
105: was introduced with different
106: motivations (among which Shannon's probabilistic notion
107: of information), independently by R.J. Solomonoff
108: (born 1926), A.N. Kolmogorov (1903--1987) and G. Chaitin (born 1943)
109: in 1960/1964, \cite{So64}, 1965, \cite{Ko65}, and 1969 \cite{Ch69},
110: respectively. Both theories
111: aim at providing a means for measuring `information'. They
112: use the same unit to do this: the {\em bit}. In both cases, the amount
113: of information in an object may be interpreted as the length of a
114: description of the object. In the Shannon approach, however, the
115: method of encoding objects is based on the presupposition that the
116: objects to be encoded are outcomes of a known random source---it is
117: only the characteristics of that random source that determine the
118: encoding, not the characteristics of the objects that are its
119: outcomes. In the Kolmogorov complexity approach we consider the
120: individual objects themselves, in isolation so-to-speak, and the
121: encoding of an object is a short computer program
122: (compressed version of the object) that
123: generates it and then halts. In the Shannon approach we are
124: interested in the minimum expected number of bits to transmit a
125: message from a random source of known characteristics
126: through an error-free channel. Says Shannon \cite{Sh48}:
127: \begin{quote}
128: ``The fundamental problem
129: of communication is that of reproducing at one point
130: either exactly or approximately a message selected at another point.
131: Frequently the messages have {\em meaning}; that is they refer to or are
132: correlated according to some system with certain physical or conceptual
133: entities. These semantic aspects of communication are irrelevant to the
134: engineering problem. The significant aspect is that the actual message
135: is one {\em selected from a set} of possible messages. The system must
136: be designed to operate for each possible selection, not just the one which
137: will actually be chosen since this is unknown at the time of design.''
138: \end{quote}
139: In
140: Kolmogorov complexity we are interested in the minimum number of bits
141: from which a particular message or file
142: can effectively be reconstructed: the minimum
143: number of bits that suffice to store the file in reproducible format.
144: This is the basic question
145: of the ultimate compression
146: of given individual files. A
147: little reflection reveals that this is a great difference: for {\em
148: every} source emitting but two messages the Shannon information (entropy) is
149: at most 1 bit, but we can choose both messages concerned of
150: arbitrarily high Kolmogorov complexity. Shannon stresses in his
151: founding article that his notion is only concerned with {\em
152: communication}, while Kolmogorov stresses in his founding article
153: that his notion aims at supplementing the gap left by Shannon theory
154: concerning the information in individual objects.
155: Kolmogorov
156: %\cite{Ko65}:
157: \commentout{\begin{quote}
158: ``The probabilistic approach is natural in
159: the theory of information transmission over communication channels
160: carrying `bulk' information consisting of a large number of unrelated or
161: weakly related messages obeying definite probabilistic laws. $\dots$
162: But what real meaning is there, for example, in asking how much information
163: is contained in `War and Peace'? Is it reasonable to include this
164: novel in the set of `possible novels,' or even to postulate
165: some probability distribution for this set? Or, on the other hand, must
166: we assume that the individual scenes in this book form a random
167: sequence with `stochastic relations' that damp out quite rapidly over a
168: distance of several pages?''
169: \end{quote}
170: And in
171: }
172: \cite{Ko83}:
173: \begin{quote}
174: ``Our definition of the
175: quantity of information has the advantage that it refers to individual
176: objects and not to objects treated as members of a set of objects
177: with a probability distribution given on it. The probabilistic
178: definition can be convincingly applied to the information contained,
179: for example, in a stream of congratulatory telegrams. But it would
180: not be clear how to apply it, for example, to an estimate of the quantity
181: of information contained in a novel or in the translation of a novel
182: into another language relative to the original. I think that the
183: new definition is capable of introducing in similar applications
184: of the theory at least clarity of principle.''
185: \end{quote}
186: To be sure, both notions are natural: Shannon ignores the object itself
187: but considers only the characteristics of the random source of which the
188: object is one of the possible outcomes, while Kolmogorov considers
189: only the object itself to determine the number of bits in the ultimate
190: compressed version irrespective of the manner in which the object arose.
191: In this paper, we introduce, compare and contrast the Shannon and Kolmogorov
192: approaches.
193: An early comparison between Shannon entropy and Kolmogorov
194: complexity is \cite{ChCo78}.
195: \paragraph{How to read this paper:}
196: We switch back and forth between the two
197: theories concerned according to the following pattern: we first discuss a
198: concept of Shannon's theory, discuss its properties as well as some
199: questions it leaves open. We then provide Kolmogorov's analogue of
200: the concept and show how it answers the question left open by
201: Shannon's theory.
202: To ease understanding of the two theories and
203: how they relate, we supplied the overview below
204: and then Sections~\ref{sec:coding} and
205: Section~\ref{sec:basic}, which discuss preliminaries, fix
206: notation and introduce the basic notions. The other sections are
207: largely independent from one another.
208: Throughout the text,
209: we assume some basic familiarity with elementary notions of
210: probability theory and computation, but we have kept the treatment
211: elementary. This may provoke scorn in the information theorist, who sees
212: an elementary treatment of basic matters in his discipline, and likewise
213: from the computation theorist concerning the treatment
214: of aspects of the elementary theory of computation. But experience has shown
215: that what one expert views as child's play is an insurmountable
216: mountain for his opposite number. Thus, we decided to
217: ignore background knowledge and
218: cover both areas from first principles onwards, so that
219: the opposite expert can easily access the unknown discipline, possibly
220: helped along by the familiar analogues in his own ken of knowledge.
221: \subsection{Overview and Summary}
222: A summary of the basic ideas is given
223: below. In the paper, these notions are discussed in the same order.
224: \begin{description}
225: \item[1. Coding: Prefix codes, Kraft inequality]
226: (Section~\ref{sec:coding}) Since descriptions or {\em encodings\/} of objects are
227: fundamental to both theories, we first review some elementary facts
228: about coding. The most important of these is the {\em Kraft
229: inequality}. This inequality gives the
230: fundamental relationship between {\em probability density functions and
231: prefix codes}, which are the type of codes we are interested in.
232: Prefix codes and the Kraft inequality underly most of Shannon's, and a
233: large part of Kolmogorov's theory.
234: \item[2. Shannon's Fundamental Concept: Entropy]
235: (Section~\ref{sec:shannon}) Entropy is defined by a functional that maps
236: {\em probability distributions\/} or,
237: equivalently, {\em random variables},
238: to {\em real numbers}. This notion is derived from first
239: principles as the only `reasonable' way to measure the
240: %`uncertainty
241: %inherent in a probabilisty distribution', or (equivalently), as the
242: `average amount of information conveyed when an outcome of the random
243: variable is observed'. The notion is then related to
244: encoding and communicating messages by Shannon's famous `coding theorem'.
245: \item[3. Kolmogorov's Fundamental Concept: Kolmogorov Complexity]
246: (Section~\ref{sec:kolmogorov})
247: Kolmogorov complexity is defined by a function that maps {\em
248: objects\/} (to be thought of as natural numbers or sequences of
249: symbols, for example outcomes of the random variables
250: figuring in the Shannon theory) to the {\em natural numbers\/}. Intuitively, the Kolmogorov
251: complexity of a sequence is the length (in bits) of the shortest computer
252: program that prints the sequence and then halts.
253: \item[4. Relating entropy and
254: Kolmogorov complexity ]
255: (Section~\ref{sec:KCSE} and Appendix~\ref{sec:universal})
256: Although their primary aim is quite different, and they are functions
257: defined on different spaces, there are close relations
258: between entropy and Kolmogorov complexity. The formal relation
259: ``entropy = expected Kolmogorov complexity'' is discussed in
260: Section~\ref{sec:KCSE}. The relation is further illustrated
261: by explaining `universal coding' (also introduced by Kolmogorov in 1965)
262: which combines elements from both
263: Shannon's and Kolmogorov's theory, and which lies at the basis of most
264: practical data compression methods. While related to the main theme
265: of this paper, universal coding plays no direct role in the later
266: sections, and therefore we delegated it to Appendix~\ref{sec:universal}.
267: \end{description}
268: Entropy and Kolmogorov Complexity are the basic
269: notions of the two theories. They serve as building blocks for all
270: other important notions in the respective theories. Arguably the most
271: important of these notions is {\em mutual information\/}:
272: \begin{description}
273: \item[5. Mutual Information---Shannon and Kolmogorov Style]
274: (Section~\ref{sec:mutual})
275: Entropy and Kolmogorov complexity are
276: concerned with information in a single object: a random variable
277: (Shannon)
278: or an individual sequence (Kolmogorov). Both theories provide
279: a (distinct) notion of {\em mutual information\/} that
280: measures the information that {\em one
281: object gives about another object}. In Shannon's theory, this is the
282: information that one random variable carries about another; in
283: Kolmogorov's theory (`algorithmic mutual information'),
284: it is the information one sequence gives about another.
285: In an appropriate setting the former notion can be shown to
286: be the expectation of the latter notion.
287: \item[6. Mutual Information Non-Increase]
288: (Section~\ref{sect.mini})
289: In the probabilistic setting the mutual information between two
290: random variables cannot be increased by processing the outcomes.
291: That stands to reason, since the mutual information is expressed
292: in probabilities of the random variables involved. But in the algorithmic
293: setting, where we talk about mutual information between two
294: strings this is not evident at all. Nonetheless, up to some precision,
295: the same non-increase law holds. This result was used recently to
296: refine and extend the celebrated G\"odel's incompleteness theorem.
297: \item[7. Sufficient Statistic] (Section~\ref{sect.sufstat}) Although
298: its roots are in the statistical literature, the notion of
299: probabilistic ``sufficient statistic'' has a natural formalization
300: in terms of mutual Shannon information, and can thus also be
301: considered a part of Shannon theory. The probabilistic sufficient
302: statistic extracts the information in the data about a model class.
303: In the algorithmic setting, a sufficient statistic extracts the
304: meaningful information from the data, leaving the remainder as
305: accidental random ``noise''. In a certain sense the probabilistic version of
306: sufficient statistic is the expectation of the algorithmic version.
307: These ideas are generalized significantly in the next item.
308: \item[8. Rate Distortion Theory versus Structure Function]
309: (Section~\ref{sect.rdsf}) Entropy, Kolmogorov complexity and mutual
310: information are concerned with {\em lossless\/} description or
311: compression: messages must be described in such a way that from the
312: description, the original message can be completely reconstructed.
313: Extending the theories to {\em lossy\/} description or compression
314: leads to rate-distortion theory in the Shannon setting, and the
315: Kolmogorov structure function in the Kolmogorov section. The basic
316: ingredients of the lossless theory (entropy and Kolmogorov
317: complexity) remain the building blocks for such extensions. The
318: Kolmogorov structure function significantly extends the idea of
319: ``meaningful information'' related to the algorithmic sufficient
320: statistic, and can be used to provide a foundation for inductive
321: inference principles such as Minimum Description Length (MDL). Once again, the Kolmogorov
322: structure function can be related to Shannon's rate-distortion
323: function by taking expectations in an appropriate manner.
324: \end{description}
325:
326: \subsection{Preliminaries}
327: \label{sec:preliminaries}
328: \paragraph{Strings:}
329: Let ${\cal B}$ be some finite or countable set. We use the notation
330: ${\cal B}^*$ to denote the set of finite
331: {\em strings\/} or {\em sequences\/} over ${\cal X}$. For example,
332: $$\{0,1\}^* = \{ \epsilon,0,1,00,01,10,11,000,\ldots \},$$
333: with $\epsilon$ denoting the {\em empty word} `' with no letters.
334: Let
335: ${\cal N}$ denotes the natural
336: numbers. We identify
337: ${\cal N}$ and $\{0,1\}^*$ according to the
338: correspondence
339: \begin{equation}
340: \label{eq:correspondence}
341: (0, \epsilon ), (1,0), (2,1), (3,00), (4,01), \ldots
342: \end{equation}
343: The {\em length} $l(x)$ of $x$ is the number of bits
344: in the binary string $x$. For example,
345: $l(010)=3$ and $l(\epsilon)=0$.
346: If $x$ is interpreted as an integer, we get $ l(x) = \lfloor \log
347: (x+1) \rfloor$ and, for $x \geq 2$,
348: \begin{equation}
349: \label{eq:intlength}
350: \lfloor \log x \rfloor
351: \leq l(x) \leq \lceil \log x \rceil.
352: \end{equation}
353: Here, as in the sequel, $\lceil x \rceil$ is the smallest integer larger than or equal to
354: $x$, $\lfloor x \rfloor$ is the largest integer smaller than or equal
355: to $x$ and $\log$ denotes logarithm to base two.
356: We shall typically be concerned with
357: encoding finite-length binary strings by other finite-length binary strings.
358: The emphasis is on binary strings only for convenience;
359: observations in any alphabet can be so encoded in a way
360: that is `theory neutral'.
361:
362: \paragraph{Precision and $\lea, \eqa$ notation:}
363: It is customary in the area of Kolmogorov complexity
364: to use ``additive constant $c$'' or
365: equivalently ``additive $O(1)$ term'' to mean a constant,
366: accounting for the length of a fixed binary program,
367: independent from every variable or parameter in the expression
368: in which it occurs. In this paper we use the prefix complexity
369: variant of Kolmogorov complexity for convenience. Since
370: (in)equalities in the Kolmogorov complexity setting
371: typically hold up to an additive constant, we use a special notation.
372:
373: We will denote by $\lea$ an
374: inequality to within an additive constant. More precisely, let $f,g$
375: be functions from $\{0,1\}^*$ to ${\cal R}$,
376: the {\em real numbers}. Then by `$f(x) \lea g(x)$'
377: we mean that there exists a $c$ such that for all $x \in \{0,1\}^*$,
378: $f(x) < g(x) + c$. We denote by $\eqa$ the situation when both $\lea$
379: and $\gea$ hold.
380:
381: \paragraph{Probabilistic Notions:}
382: Let ${\cal X}$ be a finite or countable set. A function $f: {\cal X}
383: \rightarrow [0,1]$ is a {\em probability mass function} if $\sum_{x
384: \in {\cal X}} f(x) = 1$. We call $f$ a {\em sub-probability mass
385: function} if $\sum_{x \in {\cal X}} f(x) \leq 1$. Such sub-probability
386: mass functions will sometimes be used for technical convenience. We
387: can think of them as ordinary probability mass functions by
388: considering the surplus probability to be concentrated on an undefined
389: element $u \not\in {\cal X}$.
390:
391: In the context of (sub-) probability mass functions, ${\cal
392: X}$ is called the {\em sample space}. Associated with mass function $f$ and
393: sample space ${\cal X}$ is the {\em random variable\/} $X$ and the
394: probability distribution $P$ such that $X$ takes value $x \in {\cal
395: X}$ with probability $P(X=x) = f(x)$. A subset of ${\cal X}$ is
396: called an {\em event}. We extend the probability of individual
397: outcomes to events. With this terminology, $P(X= x) = f(x)$ is the
398: probability that the singleton event $\{x\}$ occurs, and $P(X \in
399: {\cal A}) = \sum_{x \in {\cal A}} f(x)$. In some cases (where the use
400: of $f(x)$ would be confusing) we write $p_x$ as an abbreviation of
401: $P(X= x)$. In the sequel, we often refer to probability distributions
402: in terms of their mass functions, i.e. we freely employ phrases like
403: `Let $X$ be distributed according to $f$'.
404:
405: Whenever we refer to probability mass functions without explicitly
406: mentioning the sample space ${\cal X}$ is assumed to
407: be ${\cal N}$ or, equivalently, $\{ 0,1\}^*$.
408:
409: For a given probability mass function $f(x,y)$ on sample space ${\cal
410: X} \times {\cal Y}$ with random variable $(X,Y)$, we define the {\em
411: conditional probability mass function\/} $f(y \mid x)$ of outcome
412: $Y=y$ given outcome $X=x$ as
413: $$
414: f (y|x) := {f(x,y) \over \sum_{y} f(x,y)}.
415: $$
416: Note that $X$ and $Y$ are not necessarily independent.
417:
418: In some cases (esp. Section~\ref{sec:relpa} and
419: Appendix~\ref{sec:universal}), the notion of {\em sequential
420: information source\/} will be needed. This may be thought of as a
421: probability distribution over arbitrarily long binary sequences, of
422: which an observer gets to see longer and longer initial segments.
423: Formally, a sequential information source $P$ is a probability
424: distribution on the set $\{0,1\}^{\infty}$ of one-way infinite
425: sequences. It is characterized by a {\em sequence of probability mass
426: functions\/} $(f^{(1)},f^{(2)}, \ldots)$ where
427: $f^{(n)}$ is a probability mass function on $\{0,1\}^n$ that
428: denotes the {\em marginal\/} distribution of
429: $P$ on the first $n$-bit segments. By definition, the sequence
430: $f \equiv (f^{(1)}, f^{(2)},
431: \ldots)$ represents a sequential information source if for all $n >
432: 0$, $f^{(n)}$ is related to $f^{(n+1)}$ as follows: for all $x \in
433: \{0,1\}^n$, $\sum_{y \in \{0,1\}} f^{(n+1)}(xy) = f^{(n)}(x)$ and
434: $f^{(0)}(x)=1$. This is also called Kolmogorov's {\em compatibility
435: condition\/} \cite{Ri89}.
436:
437: Some (by no means all!) probability mass functions on $\{ 0,1\}^*$ can
438: be thought of as information sources. Namely, given a probability mass
439: function $g$ on $\{0,1 \}^*$, we can define $g^{(n)}$ as the
440: conditional distribution of $x$ given that the length of $x$ is $n$,
441: with domain restricted to $x$ of length $n$. That is, $g^{(n)}:
442: \{0,1\}^n \rightarrow [0,1]$ is defined, for $x \in \{0,1\}^n$, as
443: $g^{(n)}(x) = g(x) / \sum_{y \in \{0,1\}^n} g(y)$. Then $g$ can be
444: thought of as an information source if and only if the sequence
445: $(g^{(1)}, g^{(2)}, \ldots)$ represents an information source.
446: \paragraph{Computable Functions:}
447: %{\em Integer-valued functions\/}
448: %When dealing with computability issues, it is convenient to consider
449: Partial functions on the natural numbers ${\cal N}$ are
450: functions $f$ such that $f(x)$ can be `undefined' for some $x$. We
451: abbreviate `undefined' to `$\uparrow$'. A
452: central notion in the theory of computation is that of the {\em
453: partial recursive functions}. Formally, a function $f: {\cal N}
454: \rightarrow {\cal N} \cup \{ \uparrow \}$ is called {\em partial
455: recursive\/} or {\em computable\/} if there exists a Turing Machine
456: $T$ that implements $f$. This means that for all $x$
457: \begin{enumerate}
458: \item
459: If $f(x) \in {\cal N}$, then $T$,
460: when run with input $x$ outputs $f(x)$ and then halts.
461: \item
462: If $f(x) = \uparrow$ (`$f(x)$ is undefined'), then $T$ with input $x$ never halts.
463: \end{enumerate}
464: Readers not familiar with computation theory may think of a Turing
465: Machine as a computer program written in a general-purpose language such as
466: C or Java.
467:
468: A function $f: {\cal N} \rightarrow {\cal N} \cup \{ \uparrow \}$ is
469: called {\em total\/} if it is defined for all $x$ (i.e. for all $x$,
470: $f(x) \in {\cal N}$). A {\em total recursive\/} function is thus a
471: function that is implementable on a Turing Machine that halts on all
472: inputs. These definitions are extended to several arguments as
473: follows: we fix, once and for all, some standard invertible pairing
474: function $\langle \cdot, \cdot \rangle: {\cal N} \times {\cal N}
475: \rightarrow {\cal N}$ and we say that $f: {\cal N} \times {\cal N}
476: \rightarrow {\cal N} \cup \{ \uparrow \}$ is computable if there
477: exists a Turing Machine $T$ such that for all $x_1, x_2$, $T$ with
478: input $\langle x_1, x_2 \rangle$ outputs $f(x_1,x_2)$ and halts if
479: $f(x_1,x_2) \in {\cal N}$ and otherwise $T$ does not halt. By
480: repeating this construction, functions with arbitrarily many arguments
481: can be considered.
482:
483: {\em Real-valued Functions:} We call a
484: distribution $f: {\cal N} \rightarrow {\cal R}$ {\em recursive\/} or
485: {\em computable\/} if there exists a Turing machine that, when input
486: $\langle x, y\rangle$ with $x \in \{0,1\}^*$ and $y \in {\cal N}$,
487: outputs $f(x)$ to precision $1/y$; more precisely, it outputs a pair
488: $\langle p, q \rangle$ such that $| p/q - |f(x)| | < 1/y $ and an
489: additional bit to indicate whether $f(x)$ larger or smaller than $0$.
490: Here $\langle \cdot, \cdot \rangle$ is the standard pairing function.
491: In this paper all real-valued functions we consider are by definition
492: total. Therefore, in line with the above definitions, for a
493: real-valued function `computable' (equivalently, recursive), means
494: that there is a Turing Machine which for {\em all\/} $x$, computes
495: $f(x)$ to arbitrary accuracy; `partial' recursive real-valued
496: functions are not considered.
497:
498: It is convenient to distinguish between {\em upper\/} and {\em lower
499: semi-computability}. For this purpose we consider both the argument
500: of an auxiliary function $\phi$ and the value of $\phi$ as a pair of
501: natural numbers according to the standard pairing function $\langle
502: \cdot \rangle$. We define a function from ${\cal N}$ to the reals
503: ${\cal R}$ by a Turing machine $T$ computing a function $\phi$ as
504: follows. Interpret the computation $\phi(\langle x,t \rangle ) =
505: \langle p,q \rangle$ to mean that the quotient $p/q$ is the rational
506: valued $t$th approximation of $f(x)$.
507: \begin{definition}\label{def.enum.funct}
508: \rm
509: \label{def.semi}
510: A function $f: {\cal N} \rightarrow {\cal R}$ is
511: {\em lower semi-computable} if there is a Turing machine $T$ computing a
512: total function $\phi$
513: such that $\phi (x,t+1) \geq \phi (x,t)$ and
514: $\lim_{t \rightarrow \infty} \phi (x,t)=f(x)$. This means
515: that $f$ can be computably approximated from below.
516: A function $f$ is {\em upper semi-computable} if
517: $-f$ is lower semi-computable,
518: Note that, if $f$ is both upper- and lower semi-computable, then
519: $f$ is computable.
520: \end{definition}
521: %For example, $K(x)$ is upper semi-computable, but not computable.
522:
523: {\em (Sub-) Probability mass functions\/:} Probability mass
524: functions on $\{0,1\}^*$ may be thought of as real-valued functions on
525: ${\cal N}$. Therefore, the definitions of `computable' and
526: `recursive' carry over unchanged from the real-valued function case.
527: \subsection{Codes}
528: \label{sec:coding}
529: We repeatedly consider the following scenario: a {\em
530: sender\/} (say, A) wants to communicate or transmit some information
531: to a {\em receiver\/} (say, B). The information to be transmitted is
532: an element from some set ${\cal X}$ (This set may or may not consist
533: of binary strings).
534: It will be communicated by sending a
535: binary string, called the {\em message}.
536: When B receives the message, he can decode it again and (hopefully)
537: reconstruct the element of ${\cal X}$ that was sent.
538: To achieve this, A and B need to agree
539: on a {\em code\/} or {\em description method\/} before
540: communicating. Intuitively, this is a binary relation between {\em
541: source words} and associated {\em code words}. The relation is fully
542: characterized by the {\em decoding function}. Such a decoding function
543: $D$ can be any function $D: \{ 0, 1 \}^* \rightarrow {\cal X}$.
544: The domain of $D$ is the set of %
545: \it code words
546: \rm and the range of $D$ is the set of %
547: \it source words. \rm $D(y) = x$ is interpreted as ``$y$ is a code
548: word for the source word $x$''.
549: The set of all code words
550: for source word $x$ is the set $D^{-1} (x) = \{ y: D(y) = x \}$.
551: Hence, $E=D^{-1}$ can be called the %
552: \it encoding %
553: \rm substitution
554: ($E$ is not necessarily a function). With each code $D$ we can
555: associate a {\em length function\/} $L_D: {\cal X} \rightarrow {\cal N}$
556: such that, for each source
557: word $x$, $L(x)$ is the length of the shortest encoding of $x$:
558: $$
559: L_D(x) = \min \{ l(y): D(y) = x \}.
560: $$
561: We denote by $x^*$ the shortest $y$ such that $D(y) = x$; if there is
562: more than one such $y$, then $x^*$ is defined to be the
563: first such $y$ in some agreed-upon order---for example,
564: the lexicographical order.
565:
566: In coding theory attention is often restricted to
567: the case where the source word set is finite, say
568: ${\cal X} = \{ 1, 2, \ldots , N \} $. If there is a constant $l_0$
569: such that $l(y) = l_0$ for all code words $y$ (which implies, $L(x) =
570: l_0$ for all source words $x$),
571: then we call $D$ a %
572: \it fixed-length
573: \rm code. It is
574: easy to see that $l_0 \geq \log N$.
575: For instance, in teletype transmissions the source
576: has an alphabet of $N = 32$ letters, consisting
577: of the 26 letters in the Latin alphabet plus
578: 6 special characters. Hence, we need $l_0 = 5$
579: binary digits per source letter. In electronic computers
580: we often use the fixed-length ASCII code\index{code!ASCII}
581: with $l_0=8$.
582: \paragraph{Prefix code:}
583: It is immediately clear that in general
584: we cannot uniquely recover $x$ and $y$ from $E(xy)$.
585: Let $E$ be
586: the identity mapping.
587: Then we have $E(00)E(00) = 0000 = E(0)E(000)$.
588: We now introduce {\em prefix codes}, which do not suffer from this defect.
589: A binary string $x$
590: is a {\em proper prefix} of a binary string $y$
591: if we can write $y=xz$ for $z \neq \epsilon$.
592: A set $\{x,y, \ldots \} \subseteq \{0,1\}^*$
593: is {\em prefix-free} if for any pair of distinct
594: elements in the set neither is a proper prefix of the other.
595: A function $D: \{ 0, 1 \}^* \rightarrow {\cal N}$
596: defines a {\it prefix-code}\index{code!prefix-}
597: if its domain is prefix-free.
598: In order to decode a code sequence of a prefix-code,
599: we simply start at the beginning and decode one
600: code word at a time. When we come to the end of
601: a code word, we know it is the end, since no
602: code word is the prefix of any other code word
603: in a prefix-code.
604:
605: Suppose we encode each binary string $x=x_1 x_2 \ldots x_n$ as
606: \[ \bar x = \underbrace{11 \ldots 1}
607: _{n \mbox{{\scriptsize \ times}}}0x_1x_2 \ldots x_n .\]
608: The resulting code is prefix because we can determine where the
609: code word $\bar x$ ends by reading it from left to right without
610: backing up. Note $l(\bar{x}) = 2n+1$; thus, we have encoded strings in
611: $\{0,1\}^*$ in a prefix manner at the price of doubling their
612: length. We can get a much more efficient code by applying the
613: construction above to the length $l(x)$ of $x$ rather than $x$ itself:
614: define $x'=\overline{l(x)}x$, where $l(x)$ is interpreted as a binary
615: string according to the correspondence (\ref{eq:correspondence}). Then the code $D'$ with
616: %, for all $x \in \{0,1\}^*$,
617: $D'(x') = x$ is a prefix code satisfying, for all $x \in
618: \{0,1\}^*$, $l(x') = n+2 \log n+1$ (here we ignore the `rounding error'
619: in \eqref{eq:intlength}). $D'$ is used throughout this paper as
620: a standard code to encode natural numbers in a prefix free-manner; we call it
621: the {\em standard prefix-code for the natural numbers}. We use
622: $\Lint(x)$ as notation for $l(x')$. When $x$ is
623: interpreted as an integer (using the correspondence
624: (\ref{eq:correspondence}) and (\ref{eq:intlength})), we see that,
625: up to rounding,
626: $\Lint(x) = \log x +
627: 2 \log \log x+1$.
628:
629: \paragraph{Prefix codes and the Kraft inequality:}
630: %It requires little reflection to realize that
631: %prefix-codes waste potential code words since
632: %the internal nodes of the representation tree
633: %cannot be used, and in fact neither
634: %are the potential descendants of the external nodes
635: %used. Hence, we can expect that the code-word length
636: %exceeds the (binary) source-word length in prefix-codes.
637: Let ${\cal X}$ be the set of natural numbers and
638: consider the straightforward non-prefix representation
639: (\ref{eq:correspondence}).
640: There are two elements of ${\cal X}$ with
641: a description of length $1$, four with a description of
642: length $2$ and so on. However, for a prefix code $D$ for the natural numbers
643: there are less binary prefix code words of each length:
644: if $x$ is a prefix code word
645: then no $y = xz$ with $z \neq \epsilon$ is a prefix code word.
646: Asymptotically there are less prefix code words of length $n$
647: than the $2^n$ source words of length $n$.
648: Quantification of this intuition for countable ${\cal X}$ and
649: arbitrary prefix-codes leads to
650: a precise constraint on the number of code-words of given lengths.
651: This important relation is known as the
652: {\em Kraft Inequality\index{Kraft Inequality|bold}}
653: and is due to L.G. Kraft\index{Kraft, L.G.} \cite{Kr49}.
654: \begin{theorem}
655: \label{kraft}
656: Let
657: $l_1 , l_2 , \ldots $
658: be a finite or infinite sequence
659: of natural numbers.
660: There is a prefix-code with this sequence as
661: lengths of its binary code words iff
662: $$
663: \sum_n 2^{{-} l_n } \leq 1.
664: $$
665: \end{theorem}
666: \paragraph{Uniquely Decodable Codes:}
667: We want to code elements of ${\cal X}$ in a way that they can be
668: uniquely reconstructed from the encoding. Such codes are called
669: `uniquely decodable'.
670: Every prefix-code is a uniquely decodable code. For example, if
671: $E(1) = 0$, $E(2) = 10$, $E(3) = 110$, $E(4) = 111$
672: %peter1 left out
673: %as in Figure~\ref{prefix.tree.picture},
674: %
675: then
676: $1421$ is encoded as $0111100$, which can be
677: easily decoded from left to right in
678: a unique way.
679:
680: On the other hand, not every uniquely decodable code satisfies the prefix
681: condition.
682: %For example, if $E(1) = 0$,
683: %$E(2) = 01$, $E(3) = 011$, $E(4) = 0111$, then
684: %every code word is a prefix of every
685: %longer code word
686: %peter1 left out
687: %as in Figure~\ref{non.prefix.tree.picture}
688: %. But unique decoding is trivial,
689: %since the beginning of a new code word is
690: %always indicated by a zero.
691: Prefix-codes are
692: distinguished from other uniquely decodable codes
693: by the property that the end of a code word is always
694: recognizable as such. This means that decoding
695: can be accomplished without the delay of observing
696: subsequent code words, which is why prefix-codes
697: are also called instantaneous codes.
698:
699: There is
700: good reason for our emphasis on prefix-codes.
701: Namely, it turns out that
702: Theorem~\ref{kraft} stays valid if we replace
703: ``prefix-code'' by ``uniquely decodable code.''
704: %This follows directly from the observation
705: %that if a code has code-word lengths $l_1 , l_2 , \ldots $
706: %and it is uniquely decodable, then the Kraft Inequality
707: %\index{Kraft Inequality}
708: %must be satisfied; see \cite{CT91} for details.
709: This important fact means that every
710: uniquely decodable code can be replaced
711: by a prefix-code without changing the set of
712: code-word lengths.
713: %Hence, all propositions concerning
714: %code-word lengths apply to uniquely decodable
715: %codes and to the subclass of prefix-codes.
716: In Shannon's and Kolmogorov's theories, we are only interested in code
717: word {\em lengths\/} of uniquely decodable codes rather than actual
718: encodings. By the previous
719: argument, we may restrict the set of codes we work with to prefix
720: codes, which are much easier to handle.
721: %Accordingly, in looking for uniquely decodable
722: %codes with minimal average code-word length
723: %we can restrict ourselves to prefix-codes.
724: \paragraph{Probability distributions and complete prefix codes:}
725: A uniquely decodable code is %
726: \it complete\index{code!uniquely decodable} \rm if the addition of any
727: new code word to its code word set results in a non-uniquely decodable
728: code. It is easy to see that a code is complete iff equality holds in
729: the associated Kraft Inequality. Let $l_1, l_2, \ldots$ be the
730: code words of some complete uniquely decodable code. Let us define
731: $q_x = 2^{- l_x}$. By definition of completeness, we have $\sum_x q_x
732: = 1$. Thus, the $q_x$ can be thought of as {\em probability mass
733: functions\/} corresponding to some probability distribution $Q$. We
734: say $Q$ is the distribution {\em corresponding\/} to $l_1,l_2,\ldots$.
735: In this way, each complete uniquely decodable code is mapped to a
736: unique probability distribution. Of course, this is nothing more than
737: a formal correspondence: we may choose to encode outcomes of $X$ using
738: a code corresponding to a distribution $q$, whereas the outcomes are
739: actually distributed according to some $p \neq q$. But, as we show
740: below, if $X$ is distributed according to $p$, then the code to which
741: $p$ corresponds is, in an average sense, the code that achieves
742: optimal compression of $X$.
743: \section{Shannon Entropy versus Kolmogorov Complexity}
744: \label{sec:basic}
745: %A special but very important case occurs if the sample
746: %space ${\cal X}$ is discrete and the outcomes $x$ in our space
747: %are generated by a random source $X$ (are distributed
748: %according to some probability distribution $P(X = x)$), or if Observer is
749: %willing to bet on outcomes as if they were\footnote{In that case we
750: %may say $P$ is the Agent's `subjective distribution'}. This is
751: %situation for which C.E. Shannon has developed his famous information
752: %theory \cite{Shannon48}. More generally,
753: %let $\Pi = \{\pi_1, \ldots, \pi_m\}$ be some
754: %partition of ${\cal X}$. Let $Y$ be another random variable that can take on
755: %values in $\Pi$, for example we partition the sample
756: %space in subsets of outcomes sharing a common property.
757: %Then the statement ``$Y = \pi_j$'' indicates
758: %that the event $\pi_j$ has obtained: the world is in some state
759: %$x \in \pi_j$. As before, we let $X = x$ represent the event that the
760: %world is in state $x \in {\cal X}$. Shannon defines the amount of
761: %information that observing the value of $Y$ gives about the value of
762: %$X$ (the state of the world) in terms of the {\em mutual
763: %information} between $X$ and $Y$. This is a quantity that is defined
764: %in terms of the {\em entropy}. The entropy in turn is the fundamental
765: %quantity of Shannon's theory. Very roughly speaking, the `entropy' of
766: %random variable $X$ can be interpreted as the `expected amount of
767: %surprise' in observing an outcome of the random variable $X$.
768: %The conditional entropy of
769: %$X$ given $Y$ is the `expected amount of surprise' in observing the
770: %outcome of $X$ {\em after\/} having observed the outcome of $Y$. The
771: %`information that observing the outcome of $Y$ gives about the outcome of
772: %$X$' is called the `mutual information between $X$ and $Y$' and is
773: %defined as the entropy of $X$ minus the entropy of $X$ given $Y$.
774: %
775: \subsection{Shannon Entropy}
776: \label{sec:shannon}
777: It seldom happens that a detailed mathematical theory springs forth in
778: essentially final form from a single publication. Such was the case
779: with Shannon information theory, which properly started only with the
780: appearance of C.E. Shannon's paper ``The mathematical theory of
781: communication'' \cite{Sh48}.
782: In this paper, Shannon proposed a measure of
783: information in a distribution, which he
784: called the `entropy'. The
785: entropy $H(P)$ of a distribution $P$ measures the
786: `the inherent uncertainty in $P$', or (in fact
787: equivalently), `how much information is gained when an outcome of $P$
788: is observed'. To make this a bit more precise, let us imagine an
789: observer who knows that $X$ is distributed
790: according to $P$. The observer then observes $X=x$. The entropy of $P$
791: stands for the `uncertainty of the observer about the outcome $x$
792: {\em before\/} he observes it'. Now think of the observer as a
793: `receiver' who receives the message conveying the value of $X$. From this dual point of
794: view, the entropy stands for
795: \begin{quote}
796: the average amount of information that the observer has gained {\em after\/}
797: receiving a realized outcome $x$ of the random variable $X$. $(*)$
798: \end{quote}
799: Below, we first give Shannon's mathematical definition of entropy, and
800: we then connect it to its intuitive meaning $(*)$.
801: \begin{definition} \rm Let ${\cal X}$ be a finite or countable
802: \label{def.entropy}
803: set, let $X$ be a random variable taking values in ${\cal X}$ with
804: distribution $P(X=x)=p_x$. Then
805: the (Shannon-)
806: \it entropy\index{entropy|bold}\index{$H$: entropy stochastic source} %
807: \rm of random variable $X$
808: is given by
809: \begin{equation}
810: \label{eq:entropy}
811: H(X) = \sum_{x \in {\cal X}} p_x \log 1/p_x ,
812: \end{equation}
813: Entropy is defined here as a functional mapping random
814: variables to real numbers. In many texts, entropy is, essentially
815: equivalently, defined as a map from {\em distributions\/} of random variables to
816: the real numbers. Thus, by definition:
817: $
818: H(P) := H(X) = \sum_{x \in {\cal X}} p_x \log 1/ p_x
819: $.
820: \end{definition}
821: \paragraph{Motivation:} The entropy function \eqref{eq:entropy}
822: can be motivated in different ways. The two most
823: important ones are the {\em axiomatic\/} approach and the {\em coding
824: interpretation}. In this paper we concentrate on the latter, but we
825: first briefly sketch the former. The idea of the axiomatic approach is
826: to postulate a
827: small set of self-evident axioms that
828: any measure of information relative to a distribution should
829: satisfy. One then shows that the only measure satisfying all the
830: postulates is the Shannon entropy. We
831: outline this approach for
832: finite sources ${\cal X} = \{1,\ldots, N\}$. We look for a function
833: $H$ that maps probability distributions on ${\cal X}$ to real
834: numbers. For given distribution $P$, $H(P)$ should measure
835: `how much information is gained on average when an outcome is made
836: available'. We can write $H(P) = H(p_1,\ldots,p_N)$ where
837: $p_i$ stands for the
838: probability of $i$.
839: Suppose we require that
840: \begin{enumerate}
841: \item $H(p_1,\ldots,p_N)$ is continuous in $p_1,\ldots,p_N$.
842: \item If all the $p_i$ are equal, $p_i = 1/N$, then $H$ should be a
843: monotonic increasing function of $N$. With equally likely events
844: there is more choice, or uncertainty, when there are more possible
845: events.
846: \item If a choice is broken down into two successive choices, the
847: original $H$ should be the weighted sum of the individual values of
848: $H$. Rather than formalizing this condition, we will give a specific
849: example. Suppose that ${\cal X} = \{ 1,2,3\}$, and $p_1 = \frac{1}{2}, p_2 =
850: 1/3, p_3 = 1/6$. We can think of $x \in {\cal X}$ as being generated
851: in a two-stage process. First, an outcome in ${\cal X'} =\{0,1\}$ is
852: generated according to a distribution $P'$ with
853: $p'_0 = p'_1 = \frac{1}{2}$. If $x'=1$, we set $x=1$ and the process
854: stops. If $x'= 0$, then outcome `$2$' is generated with probability
855: $2/3$ and outcome `$3$' with probability $1/3$, and the process
856: stops. The final results
857: have the same probabilities as before. In this particular case we
858: require that
859: $$H(\frac{1}{2},\frac{1}{3},\frac{1}{6}) = H(\frac{1}{2},\frac{1}{2}) + \frac{1}{2} H(\frac{2}{3},\frac{1}{3}) + \frac{1}{2} H(1).$$
860: Thus, the entropy of $P$ must be equal to entropy of the first
861: step in the generation process, plus the weighted sum (weighted
862: according to the probabilities in the first step) of the entropies of the
863: second step in the generation process.
864:
865: As a special case, if ${\cal X}$
866: is the $n$-fold product space of another space ${\cal Y}$, $X =
867: (Y_1,\ldots, Y_n)$ and the $Y_i$ are all independently distributed
868: according to $P_Y$, then $H(P_X) = n H(P_Y)$. For example, the total
869: entropy of $n$ independent tosses of a coin with bias $p$ is $n
870: H(p,1-p)$.
871: \end{enumerate}
872: %Remarkably, Shannon \cite{Sh48} proved that
873: \begin{theorem}
874: \label{thm:axiomatic}
875: The only $H$ satisfying the three above assumptions is of the form
876: $$
877: H = K \sum_{i=1}^N p_i \log 1/p_i,
878: $$
879: with $K$ a constant.
880: \end{theorem}
881: Thus, requirements (1)--(3) lead us to the definition of entropy
882: (\ref{eq:entropy}) given above up to an (unimportant) scaling
883: factor. We shall give a concrete interpretation of this factor later
884: on. Besides the defining characteristics (1)--(3), the function $H$ has a few other
885: properties that make it attractive as a measure of information.
886: We mention:
887: \begin{description}
888: \item[\rm 4.] $H(p_1,\ldots,p_N)$ is a concave function of the $p_i$.
889: \item[\rm 5.] For each $N$, $H$ achieves its unique maximum for the uniform distribution $p_i =
890: 1/N$.
891: \item[\rm 6.] $H(p_1,\ldots,p_N)$ is zero iff one of the $p_i$ has value $1$.
892: Thus, $H$ is zero if and only if we do not gain any information at
893: all if we are told that the outcome is $i$ (since we already knew
894: $i$ would take place with certainty).
895: \end{description}
896: \paragraph{The Coding Interpretation:}
897: Immediately after
898: stating Theorem~\ref{thm:axiomatic}, Shannon \cite{Sh48} continues, ``this theorem, and the
899: assumptions required for its proof, are in no way necessary for the
900: present theory. It is given chiefly to provide a certain plausibility
901: to some of our later definitions. The {\em real justification\/} of these
902: definitions, however, will reside in their implications''.
903: %Thus, in the spirit of Shannon, we will henceforth concentrate on a
904: %very concrete
905: Following this injunction, we emphasize the main practical
906: interpretation of entropy as the length
907: (number of bits) needed to encode outcomes in ${\cal X}$. This
908: provides much clearer intuitions,
909: it lies at the root of the many practical applications
910: of information theory, and, most importantly for us,
911: it simplifies the comparison to Kolmogorov complexity.
912:
913: %Very briefly,
914: %the {\em
915: %entropy\/} of $Y$ is the expected number of bits needed to encode
916: %outcomes in $\Pi$ when the (in some sense) most efficient code to
917: %encode outcomes in $\Pi$ is used. The {\em mutual information between
918: %$Y$ and $X$}, or, equivalently {\em the information that observing the
919: %value of $Y$ gives about $X$\/} is the {\em reduction\/} in the
920: %expected number of bits needed to encode an outcome in $\cal X$ if one
921: %has already observed the value of $Y$.
922: \begin{example}
923: \rm
924: The entropy
925: of a random variable $X$ with equally likely outcomes
926: in a finite sample space ${\cal X}$ is given by
927: $H(X) = \log |{\cal X}|$.
928: %This is a measure of the uncertainty in
929: %choice before we have selected a particular value for $X$,
930: %and of the information
931: %produced from the set if we assign a specific value to $X$.
932: By choosing a particular message $x$ from ${\cal X}$,
933: we remove the entropy from $X$ by the
934: assignment $X := x$ and produce
935: or transmit {\em information}\index{information}
936: $I = \log |{\cal X}|$ by our selection of $x$. We show below
937: that $I = \log |{\cal X}|$ (or, to be more precise, the integer
938: $I' = \lceil
939: \log |{\cal X}| \rceil $) can be interpreted as the number of bits
940: needed to be transmitted from an (imagined) sender
941: to an (imagined) receiver.
942: \end{example}
943: We now connect entropy to minimum average code lengths. These are
944: defined as follows:
945: %Given a source that produces source words from ${\cal N}$ according
946: %to probability distribution $P$, it is possible
947: %to assign code words to source words in such a way
948: %that any code word sequence is uniquely decodable,
949: %and moreover the
950: %average code-word length is minimal.
951: \begin{definition}
952: \rm
953: Let source words $x \in \{0,1\}^*$ be
954: produced by a random variable $X$ with probability
955: $P(X=x)=p_x$ for the event $X=x$. The characteristics of
956: $X$ are fixed. Now consider prefix codes
957: $D: \{0,1\}^* \rightarrow {\cal N}$
958: with one code word per source word,
959: and denote the length of the code word for $x$ by $l_x$.
960: We want to minimize the expected number of bits
961: we have to transmit for the given
962: source $X$ and choose a prefix code $D$ that achieves this.
963: In order to do so, we must minimize the
964: %
965: \it average code-word length\index{code!average word length|bold}
966: \rm
967: $\bar{L}_{D} = \sum_x p_x l_x$%
968: \rm .
969: %peter1: changed notation here to get consistency with what follows.
970: %Idea is as follows: l(y) = length of y, l_x is length
971: %of codeword of x, L is codelength function, so that L(x) = l_x
972: % (analogously to p_x, P),
973: %\bar{L} is its expectation. Need this because need to speak about
974: %LIST L_1,L_2 of codelength functions later
975: We define the %
976: \it minimal average code word
977: length
978: \rm as $\bar{L} = \min \{ \bar{L}_{D}: D \mbox{ is a prefix-code}\} $.
979: A prefix-code $D$ such that $\bar{L}_{D} = \bar{L}$ is called
980: an %
981: \it optimal prefix-code\index{code!optimal prefix-}
982: \rm with respect to prior
983: probability $P$ of the source words.
984: \end{definition}
985: The (minimal) average code length of an
986: (optimal) code does not depend on the details of the set
987: of code words, but only on the set of code-word lengths.
988: It is just the expected code-word length
989: with respect to the given distribution.
990: Shannon\index{Shannon, C.E.} discovered that the
991: minimal average code word
992: length is about equal to the entropy of
993: the source word set. This is known as the
994: {\it Noiseless Coding Theorem}.\index{Theorem!Noiseless Coding|bold}
995: The adjective ``noiseless'' emphasizes that we ignore the possibility
996: of errors.
997: \begin{theorem}
998: \label{thm:noiseless}
999: Let $\bar{L}$ and $P$ be as above.
1000: If $H(P) = \sum_x p_x \log 1/ p_x$
1001: is the entropy\index{entropy}, then
1002: \begin{equation}
1003: \label{eq:entopt}
1004: H(P) \leq \bar{L} \leq H(P) + 1.
1005: \end{equation}
1006: \end{theorem}
1007: We are typically interested in encoding a binary string
1008: of length $n$ with entropy proportional to $n$
1009: (Example~\ref{ex:universal}). The essence of
1010: (\ref{eq:entopt}) is that,
1011: for all but the smallest $n$, the difference between
1012: entropy and minimal expected
1013: code length is completely negligible.
1014:
1015: It turns out that the optimum $\bar{L}$ in (\ref{eq:entopt}) is relatively easy to achieve,
1016: with the Shannon-Fano code.
1017: Let there be $N$ symbols
1018: (also called basic messages or source words).
1019: Order these symbols
1020: according to decreasing probability,
1021: say ${\cal X} = \{ 1,2, \ldots ,N \}$ with probabilities $p_1 ,p_2 , \ldots ,p_N$.
1022: Let $P_r = \sum_{i=1}^{r-1} p_i$, for $r = 1, \ldots ,N$.
1023: The binary code $E: {\cal X} \rightarrow \{0,1\}^*$ is obtained
1024: by coding $r$ as a binary number $E(r)$, obtained by
1025: truncating the binary expansion of $P_r$ at length
1026: $l(E(r))$ such that
1027: $$
1028: \log 1/ p_r \leq l(E(r)) < 1 + \log 1/ p_r .
1029: $$
1030: This code is the {\em Shannon-Fano code}.
1031: It has the property that highly probable symbols
1032: are mapped to short code words and symbols with low
1033: probability are mapped to longer code words (just like in a less optimal,
1034: non-prefix-free, setting is done in the Morse code).
1035: Moreover,
1036: $$
1037: 2^{-l(E(r))} \leq p_r < 2^{-l(E(r))+1} .
1038: $$
1039: Note that the code for symbol $r$ differs from all
1040: codes of symbols $r+1$ through $N$ in one or more
1041: bit positions, since for all $i$ with $ r+1 \leq i \leq N$,
1042: \[ P_i \geq P_r + 2^{-l(E(r))}.\]
1043: Therefore the binary
1044: expansions of $P_r$ and $P_i$ differ in the first $l(E(r))$
1045: positions. This means that $E$ is one-to-one,
1046: and it has an inverse: the decoding mapping $E^{-1}$.
1047: Even better,
1048: since
1049: no value of $E$ is a prefix of any other value of $E$,
1050: the set of code words is
1051: a prefix-code\index{code!prefix-}. This means we
1052: can recover the source message
1053: from the code message
1054: by scanning it from left to right
1055: without look-ahead.
1056: If $H_1$ is the average
1057: number of bits used per symbol of an original
1058: message, then $H_1 = \sum_r p_r l(E(r))$.
1059: Combining this with the previous inequality we obtain (\ref{eq:entopt}):
1060: $$
1061: \sum_r p_r \log 1/ p_r \leq
1062: H_1 <
1063: \sum_r (1+ \log 1/ p_r )p_r = 1 + \sum_r p_r \log 1/ p_r .
1064: $$
1065: %From this it follows that $H_1 \sim H(X)$ for large $n$,
1066: %with $H(X)$ the entropy per symbol of the source.
1067: \commentout{
1068: \begin{example}
1069: \label{ex:00}
1070: \rm
1071: %Application of these notions to the exchange of information $x$
1072: %is as follows:
1073: Assuming that $x$ is emitted by a random source $X$
1074: with probability $P(X=x)$, we can transmit $x$ using the Shannon-Fano
1075: code. This uses (up to rounding) $ \log 1/ P(X=x)$ bits.
1076: By Shannon's noiseless coding theorem this is optimal {\em on average},
1077: the average taken over the probability distribution of outcomes
1078: from the source. Thus, if $x = 00 \ldots 0$ ($n$ zeros), and the
1079: random source emits $n$-bit messages with equal probability $1/2^n$
1080: each, then we require $n$ bits to transmit $x$ (the same as
1081: transmitting $x$ literally). However, we can transmit $x$
1082: in about $\log n$ bits if we ignore probabilities and
1083: just describe $x$ individually. Thus, the optimality with
1084: respect to the average may be very sub-optimal in individual cases.
1085: \end{example}
1086: }
1087:
1088: \ \\
1089: {\bf Problem and Lacuna:}
1090: Shannon observes, ``Messages have %
1091: \index{Shannon, C.E.}
1092: \it meaning %
1093: \rm [ $\ldots$ however $\ldots$ ]
1094: the semantic aspects of communication are irrelevant
1095: to the engineering problem.'' In other words, can we answer a
1096: question like ``what is the information in this book''
1097: by viewing it as an element of a set of possible books
1098: with a probability distribution on it? Or that the
1099: individual sections in this book form
1100: a random sequence with stochastic relations that
1101: damp out rapidly over a distance of several pages?
1102: And how to measure the quantity of hereditary information in
1103: biological organisms, as encoded in DNA? Again there is the
1104: possibility of seeing a particular form of animal as one of a set of
1105: possible forms with a probability distribution on it. This seems
1106: to be contradicted by the fact that the calculation of
1107: all possible lifeforms in existence at any one time on earth
1108: would give a ridiculously low figure like
1109: %peter1 I don't understand this number!
1110: $2^{100}$.
1111:
1112: Shannon's classical
1113: information theory\index{information theory}\index{Shannon, C.E.}
1114: assigns a quantity of information to an ensemble of
1115: possible messages. All messages in the ensemble being equally probable,
1116: this quantity is the number of bits needed to
1117: count all possibilities.
1118: This expresses the fact that
1119: each message in the ensemble can be communicated
1120: using this number of bits.
1121: However, it does not say
1122: anything about the number of bits needed to convey any
1123: individual message in the ensemble. To illustrate this,
1124: consider the ensemble consisting of all binary strings
1125: of length 9999999999999999.
1126:
1127: By Shannon's measure, we require
1128: 9999999999999999 bits
1129: on the average to encode a string in such an ensemble. However, the
1130: string consisting of 9999999999999999 1's can be encoded in about
1131: 55 bits by expressing 9999999999999999 in binary and adding the
1132: repeated pattern ``1.'' A requirement for this to work is
1133: that we have agreed on an algorithm that decodes the encoded
1134: string. We can compress the string still further when we note that
1135: 9999999999999999 equals $3^2 \times 1111111111111111$, and that
1136: 1111111111111111 consists of $2^4$ 1's.
1137:
1138: Thus, we
1139: have discovered an interesting phenomenon: the description of
1140: some strings can be compressed considerably,
1141: provided they exhibit enough regularity.
1142: %This observation, of
1143: %course, is the basis of all systems to express very large
1144: %numbers and was exploited early on by Archimedes in
1145: %\index{Archimedes}
1146: %his treatise %
1147: %\it The Sand Reckoner%
1148: %\rm , in which he proposes
1149: %a system to name very large numbers:
1150: %``There are some, King Golon, who think that the number
1151: %of sand is infinite in multitude [$\ldots$ or] that no number
1152: %has been named which is great enough to exceed its multitude.[$\ldots$]
1153: %But I will try to show you, by geometrical proofs,
1154: %which you will be able to follow, that, of the numbers named
1155: %by me [...] some exceed not only the density of sand equal in
1156: %magnitude to the earth filled up in the way described,
1157: %but also that of a density equal in magnitude to the universe.''
1158: However, if regularity is lacking, it becomes more cumbersome
1159: to express large numbers. For instance, it seems easier to
1160: compress the number ``one billion,'' than the number
1161: ``one billion seven hundred thirty-five million two hundred
1162: sixty-eight thousand and three hundred ninety-four,'' even though they
1163: are of the same order of magnitude.
1164:
1165: We are interested in a measure of information that, unlike Shannon's,
1166: does not rely on (often untenable) probabilistic assumptions,
1167: and that takes into account the phenomenon that
1168: `regular' strings are compressible. Thus, we aim for a measure of information
1169: content of an %
1170: \it individual finite object%
1171: \rm ,
1172: and in the information conveyed about an individual finite
1173: object by another individual finite object. Here, we want
1174: the information content of an object $x$ to be
1175: an attribute of $x$ %
1176: \it alone%
1177: \rm , and not to depend
1178: on, for instance, the means chosen to describe this information
1179: content. Surprisingly, this turns
1180: out to be possible, at least to a large extent. The resulting theory
1181: of information is based on Kolmogorov complexity, a
1182: notion independently proposed by Solomonoff (1964), Kolmogorov (1965)
1183: and Chaitin (1969); Li and Vit\'anyi (1997) describe the history of the
1184: subject.
1185: \subsection{Kolmogorov Complexity}
1186: \label{sec:kolmogorov}
1187: Suppose we want to describe a given object by a
1188: finite binary string. We do not care whether the object
1189: has many descriptions; however, each description
1190: should describe but one object.
1191: From among all descriptions
1192: of an object we
1193: can take the length of the shortest description as
1194: a measure of the object's complexity.
1195: It is natural to call an object ``simple'' if it has
1196: at least one short description, and to call it ``complex''
1197: if all of its descriptions are long.
1198:
1199: %But now we are in danger of falling into the trap
1200: %so eloquently described in the Richard-Berry paradox,
1201: %\index{Paradox!Richard-Berry|bold} where
1202: %we define a natural number as
1203: %``the least natural number that cannot be described in less
1204: %than twenty words.'' If this number does exist, we have just described
1205: %it in thirteen words, contradicting its definitional
1206: %statement. If such a number does not exist, then
1207: %all natural numbers can be described in fewer than twenty
1208: %words.
1209: %We need to look very carefully at what kind
1210: %of descriptions (codes) we may allow.
1211: As in Section~\ref{sec:coding}, consider a description method
1212: $D$, to be used to transmit messages from a sender to a receiver.
1213: %Assume that each description
1214: %describes at most one object. That is, there is a
1215: %specification method $D$ that associates at most one
1216: %object $x$ with a description $y$.
1217: %This means that $D$ is a function from the set of descriptions,
1218: %say $Y$, into the set of objects, say $X$.
1219: %It seems also reasonable to require that
1220: %for each object $x$ in $X$, there is a description $y$ in $Y$
1221: %such that $D(y) = x$.
1222: %(Each object has a description.)
1223: %To make descriptions useful we like them to be finite.
1224: %This means that there are only countably many descriptions. Since there
1225: %is a description for each object, there are also only
1226: %countably many
1227: %describable objects.
1228: %How do we measure the complexity
1229: %of descriptions?
1230: %\index{complexity!algorithmic}
1231: %
1232: %
1233: %Taking our cue from the theory of computation,
1234: %we express descriptions as finite sequences of 0's and 1's.
1235: %In communication technology, if the specification method
1236: If $D$ is known
1237: to both a sender and receiver, then a message $x$ can be transmitted
1238: from sender to receiver by transmitting the description $y$ with
1239: $D(y)=x$. The cost of this transmission is measured by $l(y)$,
1240: the length of $y$. The least cost of transmission of $x$ is determined
1241: by the length function $L(x)$: recall that $L(x)$ is the length of
1242: the shortest $y$ such that $D(y)=x$.
1243: We choose this length function
1244: as the descriptional complexity of $x$ under specification
1245: method $D$.
1246:
1247: Obviously, this descriptional complexity of
1248: $x$ depends crucially
1249: on $D$.
1250: The general principle involved is that the syntactic
1251: framework of the description language
1252: determines the succinctness of description.
1253:
1254: In order to objectively compare descriptional complexities
1255: of objects, to be able to say ``$x$ is more complex than $z$,''
1256: the descriptional complexity of $x$
1257: should depend on $x$ alone. This complexity can be viewed as related to
1258: a universal description method that is a priori
1259: assumed by all senders and receivers.
1260: This complexity is optimal if no other description method
1261: assigns a lower complexity to any object.
1262:
1263: We are not really interested in optimality with respect to
1264: all description methods.
1265: For specifications to be useful at all it is
1266: necessary that the mapping from $y$ to $D(y)$
1267: can be executed in an effective manner. That is,
1268: it can at least in principle be performed by humans or machines.
1269: This notion has been
1270: formalized as that of ``partial recursive functions'',
1271: also known simply as ``computable functions'', which are
1272: formally defined later.
1273: According to
1274: generally accepted mathematical viewpoints it coincides
1275: with the intuitive notion of effective computation.
1276:
1277: The set of partial recursive functions
1278: contains an optimal function that minimizes
1279: description length of every other such function. We denote
1280: this function by $D_0$.
1281: Namely, for any other recursive function $D$,
1282: for all objects $x$,
1283: there is a description $y$ of $x$ under $D_0$ that is
1284: shorter than any description $z$ of $x$ under $D$. (That is,
1285: shorter up to an
1286: additive constant that is independent of $x$.)
1287: Complexity with respect to $D_0$ minorizes
1288: the complexities with respect
1289: to all partial recursive functions.
1290:
1291: We identify the
1292: length of the description of $x$ with respect
1293: to a fixed specification function $D_0$ with
1294: the ``algorithmic (descriptional) complexity'' of $x$.
1295: The optimality of $D_0$ in the sense above
1296: means that the complexity of an object $x$
1297: is invariant (up to an additive constant
1298: independent of $x$) under transition
1299: from one optimal specification function to another.
1300: Its complexity is an objective attribute
1301: of the described object alone: it is an
1302: intrinsic property of that object, and it does
1303: not depend on the description formalism.
1304: This complexity can be viewed as ``absolute information content'':
1305: the amount of information that needs to be transmitted
1306: between all senders and receivers when they communicate the
1307: message in absence of any other a priori knowledge
1308: that restricts the domain of the message.
1309: %
1310: %Broadly
1311: %speaking, this means that all description
1312: %syntaxes that are powerful enough to express the partial
1313: %recursive functions are approximately equally succinct.
1314: %In contrast
1315: %to the suggestion implicit in
1316: %the LISP versus FORTRAN example,
1317: %All algorithms can be expressed in each such programming
1318: %language equally succinctly, up to a fixed additive constant term.
1319: %
1320: %The restriction to formally effective descriptions
1321: %covers all intuitively effective descriptions
1322: %by general mathematical consensus.
1323: %While the idea of a theory of short descriptions
1324: %in itself has been proposed before,
1325: %The remarkable
1326: %usefulness and inherent rightness of the theory
1327: %of Kolmogorov complexity stems from this
1328: %independence of the description method. %\begin{comment}
1329: %As an aside, a too narrow restriction of admissible
1330: %functions is not good either. For instance,
1331: %the class of primitive recursive functions,
1332: %\index{function!primitive recursive}
1333: %Exercise~\ref{ex.p.r.function}
1334: %in Section~\ref{sect.recursion}, is a proper subset
1335: %of the partial recursive functions, but
1336: %it contains no (universal) primitive recursive
1337: %function such that the associated complexity
1338: %minorizes the complexities of all other
1339: %primitive recursive functions.
1340: %\end{comment}
1341: %
1342: Thus, we have outlined the program for
1343: a general theory of algorithmic complexity.
1344: The three
1345: %four
1346: major innovations are as follows:
1347: \begin{enumerate}
1348: \item
1349: In restricting
1350: ourselves to formally effective descriptions,
1351: our definition covers every form of description
1352: that is intuitively acceptable as being effective
1353: according to general viewpoints in mathematics and logic.
1354: \item
1355: The restriction to effective descriptions
1356: entails that there is a universal description
1357: method that minorizes the description length or complexity
1358: with respect to any other effective description
1359: method.
1360: Significantly, this implies Item 3.
1361: \item
1362: The description length or complexity of an object
1363: is an intrinsic attribute of the object independent
1364: of the particular description method or formalizations
1365: thereof.
1366: %\item
1367: %The disturbing Richard-Berry paradox above does not disappear,
1368: %but resurfaces in the form of an alternative
1369: %approach to proving Kurt G\"odel's (1906--1978) famous
1370: %\index{G\"odel, K.}
1371: %result that not every true mathematical statement
1372: %is provable in mathematics.
1373: \end{enumerate}
1374:
1375: \subsubsection{Formal Details}
1376: The Kolmogorov complexity $K(x)$ of a finite object $x$
1377: will be defined as the length of the
1378: shortest effective binary description of $x$. Broadly speaking, $K(x)$
1379: may be thought of as the length of the shortest computer program that
1380: prints $x$ and then halts. This computer program may be written in
1381: C, Java, LISP or any other universal language: we shall see that,
1382: for any two universal languages,
1383: the resulting program lengths differ at most by a constant not
1384: depending on $x$.
1385:
1386: To make this precise,
1387: let $T_1 ,T_2 , \ldots$ be a standard enumeration \cite{LiVi97}
1388: of all Turing machines, and let $\phi_1 , \phi_2 , \ldots$
1389: be the enumeration of corresponding functions
1390: which are computed by the respective Turing machines.
1391: That is, $T_i$ computes $\phi_i$.
1392: These functions are the {\em partial recursive} functions
1393: or {\em computable} functions, Section~\ref{sec:preliminaries}. For technical reasons we are interested in the
1394: so-called prefix complexity, which is associated with Turing machines
1395: for which the set of programs (inputs) resulting in a halting computation
1396: is prefix free\footnote{There exists a version of Kolmogorov
1397: complexity corresponding to programs that are not necessarily
1398: prefix-free, but we will not go into it here.}. We can realize this by equipping the Turing
1399: machine with a one-way input tape, a separate work tape,
1400: and a one-way output tape. Such Turing
1401: machines are called prefix machines
1402: since the halting programs for any one of them form a prefix free set.
1403: %Taking the universal prefix machine $U$ we can define
1404: %the prefix complexity analogously with the plain Kolmogorov complexity.
1405: %peter1: we have not defined `plain KC' yet, so this must be changed
1406: %
1407:
1408: We first define $K_{T_i}(x)$, the prefix Kolmogorov complexity of $x$ relative to a
1409: given prefix machine $T_i$, where $T_i$ is the $i$-th prefix machine
1410: in a standard enumeration of them. $K_{T_i}(x)$ is defined as the length of the shortest
1411: input sequence $y$ such that $T_i(y) = \phi_i(y) = x$. If no such
1412: input sequence exists, $K_{T_i}(x)$ remains undefined. Of course, this
1413: preliminary definition is still highly sensitive to the particular
1414: prefix machine $T_i$ that we use. But now the `universal
1415: prefix machine' comes to our rescue. Just as there exists universal ordinary
1416: Turing machines, there also exist universal prefix machines. These
1417: have the remarkable property that they can simulate every other prefix
1418: machine. More specifically, there exists a prefix machine $U$ such
1419: that, with as input the pair $\langle i, y\rangle$, it outputs $\phi_i(y)$
1420: and then halts. We now fix, once and for all,
1421: a prefix machine $U$ with this property and call $U$ the {\em reference
1422: machine}. The Kolmogorov complexity $K(x)$ of $x$ is defined as $K_U(x)$.
1423:
1424: Let us formalize this definition.
1425: Let $\langle \cdot \rangle$ be a standard invertible
1426: effective one-one encoding from ${\cal N} \times {\cal N}$
1427: to a prefix-free subset of ${\cal N}$. $\langle \cdot \rangle$ may be
1428: thought of as the encoding function of a prefix code.
1429: For example, we can set $\langle x,y \rangle = x'y'$.
1430: Comparing to the definition of in
1431: Section~\ref{sec:preliminaries}, we note that from now on, we require
1432: $\langle \cdot \rangle$ to map to a prefix-free set.
1433: We insist on prefix-freeness and
1434: effectiveness because we want a universal Turing
1435: machine to be able to read an image under $\langle \cdot \rangle$
1436: from left to right and
1437: determine where it ends.
1438: \begin{definition}\label{def.KolmK}
1439: \rm
1440: Let $U$ be our reference prefix machine satisfying for all $i \in {\cal N},
1441: y \in \{0,1\}^*$,
1442: $U(\langle i,y \rangle) = \phi_i(y)$. The {\em prefix Kolmogorov complexity} of $x$ is
1443: \begin{eqnarray}
1444: K(x) & = &
1445: \min_{z} \{ l(z) : U(z) = x , z \in \{0,1\}^*\} = \nonumber \\
1446: & = & \min_{i,y}\{l(\langle i, y \rangle): \phi_i (y )=x , y \in \{0,1\}^*, i
1447: \in {\cal N} \}.
1448: \end{eqnarray}
1449: \end{definition}
1450: We can alternatively think of $z$ as a program that prints $x$ and
1451: then halts, or as $z = \langle i,y \rangle$ where $y$ is a program such
1452: that, when $T_i$ is input program $y$, it prints $x$ and then halts.
1453:
1454: Thus, by definition $K(x)=l(x^*)$, where $x^*$ is the
1455: lexicographically first shortest
1456: self-delimiting (prefix) program for $x$ with respect to the
1457: reference prefix machine. Consider the mapping $E^*$ defined by $E^*(x)=x^*$.
1458: This may be viewed as the encoding function of a prefix-code (decoding
1459: function) $D^*$ with $D^*(x^*) = x$. By its definition, $D^*$ is a
1460: very parsimonious code. The reason for working with prefix rather than standard
1461: Turing machines is that, for many of the subsequent developments,
1462: we need $D^*$ to be prefix.
1463: %
1464: %
1465: %If $x^*$ is the (lexicographically)
1466: %first shortest program for $x$ then the set
1467: %$\{x^* : U(x^*)=x, x \in \{0,1\}^*\}$ is a {\em prefix code}.
1468: %That is, each $x^*$ is a code word for some $x$, and if $x^*$
1469: %and $y^*$ are code words for $x$ and $y$ with $x \neq y$ then $x^*$ is not
1470: %a prefix of $x$.
1471:
1472: Though defined in terms of a
1473: particular machine model, the Kolmogorov complexity
1474: is machine-independent up to an additive
1475: constant
1476: and acquires an asymptotically universal and absolute character
1477: through Church's thesis, from the ability of universal machines to
1478: simulate one another and execute any effective process.
1479: The Kolmogorov complexity of an object can be viewed as an absolute
1480: and objective quantification of the amount of information in it.
1481: %peter1: more explanation needed here
1482: %the following is already said at various places in the text, so I
1483: %commented it out
1484: %
1485: %This leads to a theory of {\em absolute} information {\em contents}
1486: %of {\em individual} objects in contrast to classic information theory
1487: %which deals with {\em average} information {\em to communicate}
1488: %objects produced by a {\em random source} \cite{LiVi97}.
1489:
1490: %peter1 the following about m(x) is not needed except perhaps for the
1491: %Kolmogorov structure function. Better to deal with it there it seems.
1492: %\commentout{
1493: %\begin{example}
1494: % \rm
1495: \subsubsection{Intuition}
1496: To develop some intuitions, it is useful to think of $K(x)$ as
1497: the shortest program for $x$
1498: in some standard programming language such as
1499: LISP or Java. Consider the lexicographical enumeration
1500: of all syntactically correct LISP programs $ \lambda_1, \lambda_2,
1501: \ldots$, and the lexicographical enumeration of all syntactically
1502: correct Java programs $ \pi_1, \pi_2, \ldots$. We assume that both
1503: these programs are encoded in some standard prefix-free manner. With
1504: proper definitions we can view the programs in both enumerations as
1505: computing partial recursive functions from their inputs to their
1506: outputs. Choosing reference machines in both enumerations we can
1507: define complexities $K_{\mbox{\scriptsize LISP}}(x)$ and
1508: $K_{\mbox{\scriptsize Java}}(x)$
1509: completely analogous to $K(x)$. All of these measures of the
1510: descriptional complexities of $x$ coincide up to a fixed additive
1511: constant. Let us show this directly for $K_{\mbox{\scriptsize LISP}}(x)$ and
1512: $K_{\mbox{\scriptsize Java}}(x)$. Since LISP is universal, there exists a LISP
1513: program $\lambda_P$ implementing a Java-to-LISP compiler.
1514: $\lambda_P$ translates each Java program to an equivalent LISP
1515: program. Consequently, for all $x$, $K_{\mbox{\scriptsize LISP}}(x) \leq
1516: K_{\mbox{\scriptsize Java}}(x) + 2l(P)$. Similarly, there is a Java program
1517: $\pi_L$ that is a LISP-to-Java compiler, so that for all $x$,
1518: $K_{\mbox{\scriptsize Java}}(x) \leq K_{\mbox{\scriptsize LISP}}(x) + 2l(L)$. It follows
1519: that $|K_{\mbox{\scriptsize Java}}(x) - K_{\mbox{\scriptsize LISP}}(x)| \leq 2l(P) + 2 l(L)$
1520: for all $x$!
1521:
1522: The programming language view immediately tells us that $K(x)$ must be
1523: small for `simple' or `regular' objects $x$. For example,
1524: there exists a fixed-size program that, when input
1525: $n$, outputs the first $n$ bits of $
1526: \pi$ and then halts. Specification of $n$ takes at most $L_{\cal N}(n)
1527: = \log n + 2 \log \log n + 1 $ bits. Thus, if $x$
1528: consists of the first $n$ binary digits of $\pi$, then $K(x) \lea \log
1529: n + 2 \log \log n$. Similarly, if $0^n$ denotes the string
1530: consisting of $n$ $0$'s, then $K(0^n) \lea \log n + 2 \log \log n$.
1531:
1532: On the other hand, for all $x$, there exists a program `print $x$;
1533: halt'. This shows that for all $K(x) \lea l(x)$. As was previously noted, for any prefix code,
1534: there are no more than $2^m$ strings $x$ which can be described by
1535: $m$ or less bits. In particular, this holds for the prefix code $E^*$
1536: whose length function is $K(x)$. Thus, the fraction of strings $x$ of
1537: length $n$ with $K(x) \leq m$ is at most $2^{m-n}$: the overwhelming majority
1538: of sequences cannot be compressed by more than a
1539: constant. Specifically, if $x$ is determined by $n$ independent
1540: tosses of a fair coin, then with overwhelming probability, $K(x) \approx
1541: l(x)$. Thus, while for very regular strings, the Kolmogorov complexity is
1542: small (sub-linear in the length of the string),
1543: {\em most\/} strings are `random' and have Kolmogorov
1544: complexity about equal to their own length.
1545: %\end{example}
1546: \subsubsection{Kolmogorov complexity of sets, functions and
1547: probability distributions}
1548: \paragraph{Finite sets:}
1549: The class of {\em finite sets} consists of the set
1550: of finite subsets $S \subseteq \{0,1\}^*$. The {\em complexity
1551: of the finite set} $S$ is
1552: $K(S)$---the length (number of bits) of the
1553: shortest binary program $p$ from which the reference universal
1554: prefix machine $U$
1555: computes a listing of the elements of $S$ and then
1556: halts.
1557: That is, if $S=\{x_1 , \ldots , x_{n} \}$, then
1558: $U(p)= \langle x_1,\langle x_2, \ldots, \langle x_{n-1},x_n\rangle \ldots\rangle \rangle $.
1559: The {\em conditional complexity} $K(x \mid S)$ of $x$ given $S$,
1560: is the length (number of bits) in the
1561: shortest binary program $p$ from which the reference universal
1562: prefix machine $U$, given $S$ literally as auxiliary information,
1563: computes $x$.
1564: %he class of {\em partial recursive
1565: %unctions} consists of the set
1566: %f functions $f: \{0,1\}^* \rightarrow \{0,1\}^*$ such that
1567: %here is a Turing machine $T$ such that
1568: %nd $f(i) = T(i)$, for every $i \in \{0,1\}^*$.
1569: \paragraph{Integer-valued functions:}
1570: The (prefix-) complexity $K(f)$ of a
1571: partial recursive function $f$ is defined by
1572: $
1573: K(f) = \min_i \{K(i): \mbox{\rm Turing machine } T_i
1574: \; \; \mbox{\rm computes }
1575: f \}.
1576: $
1577: If $f^*$ is a shortest program for computing the function $f$
1578: (if there is more than one of them then $f^*$ is the first one in
1579: enumeration order), then $K(f)=l(f^*)$.
1580: \begin{remark}
1581: \rm
1582: In the above definition of $K(f)$, the objects being
1583: described are functions instead of finite binary strings.
1584: To unify the approaches, we can
1585: consider a finite binary string $x$ as corresponding
1586: to a function having value $x$ for argument 0.
1587: Note that we can upper semi-compute (Section~\ref{sec:preliminaries})
1588: $x^*$ given $x$,
1589: but we cannot upper semi-compute $f^*$ given $f$ (as an oracle),
1590: since we should be able to
1591: verify agreement of a program for a function and an oracle for the
1592: target function, on all infinitely many arguments.
1593: \end{remark}
1594: \paragraph{Probability Distributions:}
1595: In this text we identify
1596: probability distributions on finite and countable sets ${\cal
1597: X}$ with their corresponding mass functions
1598: (Section~\ref{sec:preliminaries}). Since any
1599: (sub-) probability mass function $f$ is a total real-valued function, $K(f)$
1600: is defined in the same way as above.
1601: \subsubsection{Kolmogorov Complexity and the Universal Distribution}
1602: \label{sec:m}
1603: Following the definitions
1604: above we now consider lower semi-computable and computable probability
1605: mass functions (Section~\ref{sec:preliminaries}).
1606: By the fundamental
1607: Kraft's inequality, Theorem~\ref{kraft}, we know that
1608: if $l_1 , l_2 , \ldots$ are the code-word lengths of a prefix code,
1609: then $\sum_x 2^{-l_x} \leq 1$. Therefore,
1610: since $K(x)$ is the length of
1611: a prefix-free program for $x$,
1612: we can interpret $2^{-K(x)}$
1613: as a sub-probability mass function, and
1614: we define ${\bf m}(x)=2^{-K(x)}$.
1615: This is the so-called
1616: universal distribution---a rigorous form of Occam's razor.
1617: The following two theorems are to be considered as major achievements
1618: in the theory of Kolmogorov complexity, and will be used
1619: again and again in the sequel. For the proofs we refer to
1620: \cite{LiVi97}.
1621:
1622:
1623: \begin{theorem}\label{PR1}
1624: Let $f$ represent a
1625: lower semi-computable (sub-) probability distribution on the
1626: natural numbers (equivalently, finite binary strings).
1627: (This implies $K(f) < \infty$.)
1628: Then, $2^{c_f} {\bf m}(x) > f(x)$ for all $x$, where $c_f =K(f)+O(1)$.
1629: We call ${\bf m}$ a {\em universal distribution}.
1630: \end{theorem}
1631:
1632: The family of lower semi-computable sub-probability mass functions
1633: contains all distributions with computable parameters which have a
1634: name, or in which we could conceivably be interested, or which have
1635: ever been considered\footnote{To be sure, in statistical applications,
1636: one often works with model classes containing distributions that are
1637: neither upper- nor lower semi-computable. An example is the
1638: Bernoulli model class, containing the distributions with $P(X=1) =
1639: \theta$ for all $\theta \in [0,1]$. However, every concrete {\em
1640: parameter estimate\/} or {\em predictive distribution\/} based on
1641: the Bernoulli model class that has ever been considered or in which we
1642: could be conceivably interested, is in fact computable; typically,
1643: $\theta$ is then rational-valued. See also Example~\ref{ex:appy} in
1644: Appendix~\ref{sec:universal}.}. In particular, it contains the
1645: computable distributions. We call $\hbox{\bf m}$ ``universal'' since
1646: it assigns at least as much probability to each object as any other
1647: lower semi-computable distribution (up to a multiplicative factor),
1648: and is itself lower semi-computable.
1649:
1650: \begin{theorem}\label{PR2}
1651: \begin{equation}\label{eq.m}
1652: \log 1/\hbox{\bf m} (x)=K(x) \pm O( 1).
1653: \end{equation}
1654: \end{theorem}
1655: That means that $\hbox{\bf m}$ assigns high probability to simple
1656: objects
1657: and low probability to complex or random objects.
1658: For example, for $x=00 \ldots 0$ ($n$ 0's) we have
1659: $K(x) = K(n) \pm O(1) \leq \log n + 2 \log \log n +O(1) $ since the program
1660: \[ \mbox{\tt print } n \mbox{\tt \_times a ``0''} \]
1661: prints $x$. (The additional $2 \log \log n$ term
1662: is the penalty term for a prefix encoding.)
1663: Then, $1/ (n \log^2 n ) = O( \hbox{\bf m}(x))$.
1664: But if we flip a coin to obtain a string $y$ of $n$ bits,
1665: then with overwhelming probability $K(y) \geq n \pm O(1) $
1666: (because $y$ does not contain effective regularities
1667: which allow compression),
1668: and hence $\hbox{\bf m}(y) = O( 1/2^n)$.
1669:
1670:
1671: \paragraph*{Problem and Lacuna:} Unfortunately $K(x)$ is not a recursive
1672: function: the Kolmogorov complexity is
1673: not computable in general. This means that
1674: there exists no computer program that, when input an arbitrary string,
1675: outputs the Kolmogorov complexity of that string and then halts.
1676: While Kolmogorov complexity is upper semi-computable
1677: (Section~\ref{sec:preliminaries}), it cannot be approximated in
1678: general in a
1679: practically useful sense; and even though
1680: there
1681: exist `feasible', resource-bounded forms of Kolmogorov
1682: complexity (Li and Vit\'anyi 1997), these lack some of the elegant
1683: properties of the original, uncomputable notion.
1684:
1685:
1686: Now suppose we are interested in efficient storage and transmission of
1687: long sequences of data. According to Kolmogorov, we can compress such
1688: sequences in an essentially optimal way by storing or transmitting the
1689: shortest program that generates them. Unfortunately, as we have just
1690: seen, we cannot find such a program in general. According to Shannon,
1691: we can compress such sequences optimally in an average sense (and
1692: therefore, it turns out, also with high probability) if they are
1693: distributed according to some $P$ and we know $P$. Unfortunately, in
1694: practice, $P$ is often unknown, it may not be computable---bringing us
1695: in the same conundrum as with the Kolmogorov complexity approach---or
1696: worse, it may be nonexistent. In Appendix~\ref{sec:universal}, we
1697: consider {\em universal coding}, which can be considered a sort of
1698: middle ground between Shannon information and Kolmogorov complexity.
1699: In contrast to both these approaches, universal codes can be directly
1700: applied for practical data compression. Some basic knowledge of
1701: universal codes will be very helpful in providing intuition for the
1702: next section, in which we relate Kolmogorov complexity and Shannon
1703: entropy. Nevertheless, universal codes are not directly needed in any
1704: of the statements and proofs of the next section or, in fact, anywhere
1705: else in the paper, which is why delegated their treatment to an
1706: appendix.
1707: \subsection{Expected Kolmogorov Complexity Equals Shannon Entropy}
1708: \label{sec:KCSE}
1709: %Shannon's entropy measures
1710: %the uncertainty in a statistical ensemble
1711: %of messages, while Kolmogorov complexity measures
1712: %the algorithmic information in an individual
1713: %message.
1714:
1715: Suppose the source words $x$ are distributed as a random variable
1716: $X$ with probability $P(X=x) = f(x)$.
1717: %The expected code word length of source words
1718: %with respect to probability distribution
1719: %$P$ is $\sum_x f(x)K(x)$.
1720: %
1721: %What we would like to know is the following:
1722: While $K(x)$ is
1723: fixed for each $x$ and gives the shortest code word length
1724: (but only up to a fixed constant) and is {\em independent} of the
1725: probability distribution $P$, we may wonder whether
1726: $K$ is also universal in the following sense:
1727: If we weigh each individual code word length for
1728: $x$ with its probability $f(x)$, does the resulting $f$-expected
1729: code word length $\sum_x f(x)K(x)$
1730: achieve the minimal average code word
1731: length $ H(X)= \sum_x f(x) \log 1/ f(x)$?
1732: Here we sum over the entire support of $f$; restricting summation
1733: to a small set, for example the singleton set $\{x\}$, can give
1734: a different result.
1735: %This universality requirement contrasts
1736: %with the Shannon-Fano code
1737: %%\index{code!Shannon-Fano}
1738: %%of Example~\ref{shannon-fano} on page~\pageref{shannon-fano},
1739: %which does achieve the $H(P)$ bound at the cost of setting
1740: %the code word length equal to the negative logarithm of
1741: %the specific source word probability.
1742: The reasoning above implies that, under some mild restrictions on the
1743: distributions $f$,
1744: the answer is yes.
1745: %We can view the $K(x)$'s as the code word length set
1746: %of a ``universal'' Shannon-Fano
1747: %code based on the universal probability, Theorem~\ref{thm:noiseless}
1748: %on page~\pageref{thm:noiseless}.
1749: %The expectation of
1750: %$K(x)$ differs from $H(P)$ by a constant
1751: %depending on $P$.
1752: %Namely, $H(P) = \sum_x P(x)K(x) + c_P$,
1753: %where the constant $c_P$ depends on
1754: %the length of the program to compute the distribution $P$.
1755: This is expressed in the following theorem, where, instead of the quotient
1756: we look at the difference of
1757: $\sum_x f(x) K(x)$ and $ H(X)$.
1758: This allows
1759: us to express really small distinctions.
1760: %We call an information source $X$ recursive if its marginals
1761: %$f^{(1)} (X=x | l(x)=1), f^{(2)} (X=x | l(x)=2), \ldots$ are all recursive.
1762: %In Exercise~\ref{exer.caves} this dependence is removed.
1763: %
1764: %If the set
1765: %of outcomes is infinite, then
1766: %it is possible that $H(P)$ is infinite.
1767: %For example, with $x \in {\cal N}$ and
1768: %$P(x)= 1/(x \log x)$ we have $H(P) > \sum_x 1/x$ which diverges.
1769: %If the expected $K(x)$ is close to $H(P)$ then it diverges as well.
1770: %Two diverging quantities are compared by looking at their quotient
1771: %or difference. The latter allows us to express really small
1772: %distinctions.
1773:
1774: \begin{theorem}\label{theo.eq.entropy}
1775: Let $f$ be a computable probability mass function (Section~\ref{sec:preliminaries}) $f(x)=P(X=x)$ on
1776: sample space ${\cal X} = \{0,1\}^*$
1777: associated with a random source $X$ and
1778: entropy
1779: $H(X)=\sum_x f(x) \log 1/f(x)$. Then,
1780: \[ 0 \leq \left( \sum_x f(x) K(x) - H(X) \right) \leq K(f) + O(1). \]
1781: \end{theorem}
1782:
1783: \begin{proof}
1784: Since $K(x)$ is the code word length of a prefix-code for $x$,
1785: the first inequality of the Noiseless Coding Theorem~\ref{thm:noiseless}
1786: states that
1787: \[H(X) \leq \sum_x f(x) K(x).\]
1788: %Moreover, by the Kraft inequality \eqref{kraft} we have
1789: %$\sum_x 2^{-K(x)} \leq 1$. Hence we can define ${\bf m}(x) = 2^{-K(x)}$,
1790: %which can be considered as a probability distribution since it sums
1791: %to at most 1. (We can concentrate the deficit on a special undefined
1792: %element $u \not\in \{0,1\}^*$.) One of the main achievements of the theory
1793: %is that ${\bf m}$ is a {\em universal} distribution in the sense that
1794: %for every lower semi-computable probability mass function
1795: %$P$ on $\{0,1\}^*$ we have
1796: Since $f(x) \leq 2^{K(f)+O(1)} {\bf m}(x)$ (Theorem~\ref{PR1})
1797: and $\log {\bf m} (x) = K(x)+O(1)$ (Theorem~\ref{PR2}), we have
1798: $ \log 1/ f(x) \geq K(x) - K(f) - O(1)$.
1799: It follows that
1800: \[\sum_x f(x) K(x) \leq H(X) + K(f)+ O(1).\]
1801: Set the constant $c_f$ to
1802: \[c_f := K(f)+O(1), \]
1803: and the theorem is proved.
1804: As an aside, the constant implied in the $O(1)$ term
1805: depends on the lengths of the programs occurring in the proof of the cited
1806: Theorems~\ref{PR1}, \ref{PR2} (Theorems 4.3.1 and 4.3.2 in \cite{LiVi97}).
1807: These depend only
1808: on the reference universal prefix machine.
1809: \end{proof}
1810:
1811: The theorem shows that for simple (low complexity)
1812: distributions the expected Kolmogorov complexity is close to
1813: the entropy, but these two quantities may be wide apart for distributions
1814: of high complexity. This explains the apparent problem arising
1815: in considering a distribution $f$ that concentrates all probability
1816: on an element $x$ of length $n$. Suppose we choose $K(x)>n$.
1817: Then $f(x)=1$ and hence the entropy $H(f)=0$. On the other hand
1818: the term $ \sum_{x \in \{0,1\}^* } f(x) K(x) = K(x)$. Therefore,
1819: the discrepancy between the expected Kolmogorov complexity and the entropy
1820: exceeds the length $n$ of $x$. One may think this contradicts
1821: the theorem, but that is not the case: The complexity of the distribution
1822: is at least that of $x$, since we can reconstruct $x$ given $f$
1823: (just compute $f(y)$ for all $y$ of length $n$ in lexicographical
1824: order until we meet one that has probability 1). Thus, $c_f = K(f)+O(1)
1825: \geq K(x)+O(1) \geq n+O(1)$. Thus, if we pick a probability distribution
1826: with a complex support, or a trickily skewed probability distribution,
1827: than this is reflected in the complexity of that distribution, and
1828: as consequence in the closeness between the entropy and the expected
1829: Kolmogorov complexity.
1830:
1831: For example, bringing the discussion in line with the universal coding
1832: counterpart of Appendix~\ref{sec:universal} by considering $f$'s that
1833: can be interpreted as sequential information sources and denoting the
1834: conditional version of $f$ restricted to strings of length $n$ by
1835: $f^{(n)}$ as in Section~\ref{sec:preliminaries}, we find by the same
1836: proof as the theorem that for all $n$,
1837: \[ 0 \leq \sum_{x \in \{0,1\}^n } f^{(n)}(x) K(x) - H(f^{(n)}) \leq
1838: c_{f^{(n)}}, \]
1839: where $c_{f^{(n)}} = K(f^{(n)})+O(1) \leq K(f) + K(n)+O(1)$ is now a constant
1840: depending on both $f$ and $n$.
1841: On the other hand, we can eliminate
1842: the complexity of the distribution, or its recursivity for that matter,
1843: and / or restrictions to a conditional version of $f$
1844: restricted to a finite support $A$
1845: (for example $A = \{0,1\}^n$), denoted by $f^A$,
1846: in the following conditional formulation (this involves
1847: a peek in the future since
1848: the precise meaning of the ``$K(\cdot \mid \cdot)$'' notation
1849: is only provided in Definition~\ref{def.KolmKb}):
1850: \begin{equation}\label{eq.condentropy}
1851: 0 \leq \sum_{x \in A } f^A (x) K(x \mid f,A) - H(f^A) = O(1) .
1852: \end{equation}
1853:
1854: The Shannon-Fano code for a computable distribution is
1855: itself computable. Therefore, for every computable
1856: distribution $f$, the universal code $D^*$
1857: whose length function is the Kolmogorov complexity compresses
1858: on average at least as much as the Shannon-Fano code for $f$.
1859: This is the intuitive reason
1860: why, no matter what computable distribution $f$ we take, its expected
1861: Kolmogorov complexity is close to its entropy.
1862:
1863:
1864: %To define Kolmogorov complexity, we must first represent our space
1865: %$\cal X$ by a sequence of binary variables $x_1, x_2, \ldots$ (this
1866: %can be done without any essential loss of generality). The idea is
1867: %that the outcomes (actual values) of these variables will be revealed
1868: %to us sequentially, either one at a time (first we see $x_1$, then
1869: %$x_2$, etc.) or in `blocks'. Broadly speaking, we now proceed as
1870: %follows. We fix some universal programming language $L$ (by universal
1871: %we mean that a universal Turing Machine can be programmed in it). We
1872: %now define the Kolmogorov complexity $K(x^n)$ of a sequence $x^n =
1873: %x_1, \ldots, x_n$ as follows: $K(x^n)$ is the length of the shortest
1874: %program (written in language $L$) that prints the sequence $x^n$ and
1875: %then halts.
1876: %
1877: %The quantity $K(x^n)$ might be called the `information inherent in the
1878: %object $x^n$'. While for finite sequences it depends on the
1879: %programming language $L$ that is used, one can show that for infinite
1880: %sequences $x^{\infty} = x_1, x_2, \ldots$ it becomes, in some sense,
1881: %invariant:
1882:
1883: %More importantly, the {\em algorithmic\/} information that observation
1884: %$y^n$ gives about sequence $x^n$ is defined as the {\em conditional
1885: %Kolmogorov complexity\/} $K(x^n | y^n)$; this is the length of the
1886: %shortest program $p$ such that, when the pair $<p,y^n>$ (suitably
1887: %encoded) is fed to the UTM $U$, $U$ outputs $x^n$ and then
1888: %halts. Several concrete examples and properties of this new definition
1889: %of mutual information will be given in the paper.
1890:
1891: \section{Mutual Information}
1892: \label{sec:mutual}
1893: \subsection{Probabilistic Mutual Information}
1894: \label{sec:probmutual}
1895: How much information can a random variable $X$ convey about a
1896: random variable $Y$?
1897: Taking a purely combinatorial approach,
1898: this notion is captured as follows:
1899: If $X$ ranges over ${\cal X}$
1900: and $Y$ ranges over ${\cal Y}$, then we look at the set $U$
1901: of possible events $(X=x,Y=y)$ consisting of
1902: joint occurrences of event $X=x$ and event $Y=y$.
1903: If $U$ does not equal the Cartesian product ${\cal X} \times {\cal Y}$,
1904: then this means there is some dependency between $X$ and $Y$.
1905: Considering the set $U_x = \{ (x,u):(x,u) \in U \}$ for $x \in {\cal X} $,
1906: it is natural to define the
1907: \it conditional entropy
1908: \rm of $Y$
1909: \rm given $X = x$ as $H(Y|X=x) = \log d(U_x )$. This suggests
1910: immediately that the information given by $X=x$ about $Y$ is
1911: $$
1912: I(X=x: Y) = H(Y) - H(Y| X=x).
1913: $$
1914: For example, if $U = \{ (1,1), (1,2),(2,3) \} $, $U \subseteq {\cal X}
1915: \times {\cal Y}$
1916: with ${\cal X} = \{ 1,2 \} $ and ${\cal Y} = \{ 1,2,3,4 \} $,
1917: then $I(X=1: Y) = 1$ and $I(X=2: Y) = 2$.
1918:
1919: In this formulation it is obvious that $H(X|X=x) = 0$,
1920: and that $I(X=x : X) = H(X)$.
1921: This approach amounts
1922: to the assumption of a
1923: {\em uniform distribution}\index{distribution!uniform}
1924: of the probabilities concerned.
1925:
1926: We can generalize this approach,
1927: taking into account
1928: the frequencies or probabilities of the occurrences of the different
1929: values $X$ and $Y$ can assume.
1930: Let the {\em joint probability}\index{probability!joint}
1931: $f(x,y)$ be the ``probability of
1932: the joint occurrence of event $X=x$ and event $Y=y$.''
1933: The {\em marginal probabilities} $f_1 (x)$ and $f_2(y)$ are
1934: defined by $f_1 (x)= \sum_y f(x,y)$ and $f_2 (y)= \sum_x f(x,y)$ and
1935: are ``the probability of the occurrence of the event $X=x$''
1936: and the ``probability of the occurrence of the event $Y=y$'',
1937: respectively.
1938: This leads to the self-evident formulas for joint variables $X,Y$:
1939: \begin{eqnarray*}
1940: && H(X,Y) = \sum_{x,y} f(x,y) \log 1/ f(x,y), \\
1941: && H(X) = \sum_{x} f(x) \log 1/f(x), \\
1942: && H(Y) = \sum_{y} f(y) \log 1/ f(y) ,
1943: \end{eqnarray*}
1944: where summation over $x$ is taken over all outcomes of the random variable
1945: $X$ and summation over $y$ is taken over all outcomes of random variable $Y$.
1946: One can show that
1947: \begin{equation}
1948: H(X,Y) \leq H(X) + H(Y) ,
1949: \label{I4}
1950: \end{equation}
1951: with equality only in the case that $X$ and $Y$ are independent.
1952: In all of these equations the
1953: entropy quantity on the left-hand side increases if
1954: we choose the probabilities on the right-hand side
1955: more equally.
1956:
1957: \paragraph{Conditional entropy:}
1958: We start
1959: the analysis of
1960: the information in $X$ about $Y$ by first considering
1961: the %
1962: \it conditional
1963: entropy \index{entropy!conditional|bold}%
1964: \rm of $Y$ %
1965: \rm given $X$ as the average of the
1966: entropy for $Y$ for each value of $X$ %
1967: \rm weighted
1968: by the probability of getting that particular value:
1969: \begin{eqnarray*}
1970: H(Y| X)
1971: & = & \sum_x f_1(x) H(Y|X=x) \\
1972: & = & \sum_x f_1(x) \sum_{y} f(y|x) \log 1/ f(y|x) \\
1973: & = & \sum_{x,y} f(x,y) \log 1/f(y| x) .
1974: \end{eqnarray*}
1975: Here $f(y|x)$ is the conditional probability mass function as defined
1976: in Section~\ref{sec:preliminaries}.
1977:
1978: The quantity on the left-hand side tells us
1979: how uncertain we are on average about the outcome of $Y$
1980: when we know an outcome of $X$. With
1981: \begin{eqnarray*}
1982: H(X) & = & \sum_x f_1(x) \log 1/ f_1(x) \\
1983: & = & \sum_{x} \left(\sum_y f(x,y) \right)
1984: \log \sum_y 1/ f(x,y) \\
1985: & = & \sum_{x,y} f(x,y)
1986: \log \sum_y 1/ f(x,y) ,
1987: \end{eqnarray*}
1988: and substituting the formula for $f(y|x)$, we find
1989: $H(Y| X) = H(X,Y) - H(X)$. Rewrite this expression as
1990: the Entropy Equality
1991: \begin{equation}
1992: H(X,Y) = H (X) + H(Y| X).
1993: \label{I5}
1994: \end{equation}
1995: This can be interpreted as, ``the uncertainty of
1996: the joint event $(X,Y)$ is the uncertainty of $X$
1997: plus the uncertainty of $Y$ given $X$.''
1998: Combining Equations~\ref{I4}, \ref{I5} gives
1999: $H(Y) \geq H(Y| X)$, which can be taken to imply
2000: that, on average, knowledge of $X$ can never increase uncertainty
2001: of $Y$. In fact, uncertainty in $Y$ will be decreased
2002: unless $X$ and $Y$ are independent.
2003: \paragraph{Information:}
2004: The
2005: \it information %
2006: \rm in the outcome $X=x$ %
2007: \rm about $Y$ is defined as
2008: \begin{equation}
2009: I(X=x: Y) = H(Y) - H(Y| X=x) .
2010: \label{I6}
2011: \end{equation}
2012: Here the quantities $H(Y)$ and $H(Y| X=x)$ on the right-hand side
2013: of the equations are always equal to or less than the
2014: corresponding quantities under the uniform distribution
2015: we analyzed first. The values of the quantities
2016: $I(X=x: Y)$ under the assumption of uniform distribution
2017: of $Y$ and $Y|X=x$ versus any other distribution are not
2018: related by inequality in a particular direction.
2019: The equalities $H(X|X=x) = 0$ and $I(X=x: X) = H(X)$
2020: hold under any distribution of the variables. Since
2021: $I(X=x: Y)$ is a function of outcomes of $X$, while $I(Y=y: X)$
2022: is a function of outcomes of $Y$, we do not compare them directly.
2023: However, forming the expectation defined as
2024: \begin{eqnarray*}
2025: {\bf E} (I(X=x: Y)) & = & \sum_x f_1 (x)I(X=x: Y), \\
2026: {\bf E} (I(Y=y: X)) & = & \sum_y f_2(y)I(Y=y: X),
2027: \end{eqnarray*}
2028: and combining Equations~\ref{I5}, \ref{I6},
2029: we see that the resulting quantities are equal. Denoting
2030: this quantity by $I(X;Y)$ and calling
2031: it the %
2032: \it mutual information\index{information!mutual|bold} %
2033: \rm in $X$ and $Y$,
2034: we see that this information is
2035: {\it symmetric}:\index{information!symmetry of|see{symmetry of information}}\index{symmetry of information!stochastic|bold}
2036: \begin{equation}
2037: I(X; Y) = {\bf E} (I(X=x: Y)) = {\bf E} (I(Y=y: X)).
2038: \end{equation}
2039: Writing this out we find
2040: that the
2041: {\em mutual information} $I(X;Y)$
2042: is defined by:
2043: \begin{equation}\label{eq.mutinfprob}
2044: I(X;Y) = \sum_x \sum_y f(x,y) \log \frac{f(x,y)}{f_1(x)f_2(y)} .
2045: \end{equation}
2046: Another way to express this is as follows: a well-known criterion for
2047: the difference between a given distribution $f(x)$ and a distribution
2048: $g(x)$ it is compared with is the so-called
2049: {\em Kullback-Leibler divergence}
2050: \begin{equation}\label{eq.kl}
2051: D(f \parallel g ) = \sum_x f(x) \log f(x)/g(x).
2052: \end{equation}
2053: It has the important property that
2054: \begin{equation}\label{eq.ii}
2055: D (f \parallel g ) \geq 0
2056: \end{equation}
2057: with equality only iff $f(x)=g(x)$ for all $x$. This is called the
2058: {\em information inequality} in \cite{CT91}, p. 26.
2059: Thus, the mutual information is the Kullback-Leibler divergence between
2060: the joint distribution and the product
2061: $f_1(x)f_2(y)$ of the two marginal distributions. If this quantity is 0
2062: then $f(x,y)=f_1(x)f_2(y)$ for every pair $x,y$, which is the same as
2063: saying that $X$ and $Y$ are independent random variables.
2064: \begin{example}
2065: \rm
2066: \label{ex:mutual}
2067: Suppose we want to exchange the information about the outcome $X=x$
2068: and it is known already that outcome $Y=y$ is the case,
2069: that is, $x$ has property $y$.
2070: Then we require (using the Shannon-Fano code) about
2071: $ \log 1/ P(X=x | Y=y)$ bits to communicate $x$. On average, over the
2072: joint distribution $P(X=x, Y=y)$ we use $H(X|Y)$ bits,
2073: which is optimal by Shannon's noiseless coding theorem.
2074: In fact, exploiting the mutual information paradigm,
2075: the expected information $I(Y ; X)$
2076: that outcome $Y=y$ gives about outcome $X=x$
2077: is the same as the expected information that $X=x$ gives about $Y=y$,
2078: and is never negative. Yet
2079: there may certainly exist
2080: {\em individual\/} $y$ such that $I(Y=y : X)$ is negative. For example, we may
2081: have ${\cal X } = \{0,1\}$, ${\cal Y} = \{0,1\}$, $P(X=1 | Y=0) = 1$,
2082: $P(X=1 | Y = 1) = 1/2$, $P(Y=1) = \epsilon$. Then $I(Y; X) =
2083: H(\epsilon,1- \epsilon)$ whereas
2084: $I(Y=1 : X) = H(\epsilon,1-\epsilon) + \epsilon - 1$. For small
2085: $\epsilon$, this quantity is smaller than $0$.
2086: %However, in terms of individual relation between $x$ and $y$
2087: %this can be very high or very low.
2088: \end{example}
2089:
2090: \
2091: \\
2092: {\bf Problem and Lacuna:}
2093: The quantity $I(Y; X)$
2094: symmetrically characterizes to what extent random
2095: variables $X$ and $Y$ are correlated. An inherent problem
2096: with probabilistic definitions
2097: is that --- as we have just seen --- although $I(Y; X) = {\bf E} (I(Y
2098: = y: X))$
2099: is always positive, for some probability
2100: distributions
2101: %peter1 changed
2102: %$I(X: Y)$
2103: %into
2104: and some $y$, $I(Y=y: X)$
2105: %
2106: can turn out to be negative---which
2107: definitely contradicts our naive notion of information content.
2108: \commentout{
2109: %peter1 changed 3 lines
2110: %The development of this theory immediately gave rise to
2111: %at least two different questions. The first observation is that
2112: %the
2113: How is this possible? The
2114: %
2115: concept of information as used in the theory of communication
2116: is a probabilistic notion, which is natural for
2117: information transmission over communication channels.
2118: Nonetheless,
2119: %as we
2120: %have seen from the discussion,
2121: we tend to
2122: identify
2123: %
2124: \it probabilities %
2125: \rm of messages with
2126: %
2127: \it frequencies %
2128: \rm of messages in a sufficiently
2129: long sequence, which under some conditions on the stochastic
2130: source can be rigorously justified.
2131: For instance, Morse code\index{code!Morse} transmissions of English
2132: telegrams over a communication channel
2133: can be validly treated by probabilistic
2134: methods even if we (as is usual) use empirical
2135: frequencies for probabilities. The great probabilist,
2136: Kolmogorov, remarks, ``If something goes wrong here,
2137: \index{Kolmogorov, A.N.}
2138: the problem lies in the vagueness of our ideas of
2139: the relation between mathematical probability theory
2140: and real random events in general.''
2141: %peter1 added
2142: }
2143: The {\em algorithmic\/} mutual information we introduce below can {\em
2144: never\/} be negative, and in this sense is closer to the intuitive
2145: notion of information content.
2146:
2147: \subsection{Algorithmic Mutual Information}
2148: \label{sec:algmi}
2149: For individual objects the information about one another
2150: is possibly even more fundamental than for random sources.
2151: Kolmogorov \cite{Ko65}:
2152: \begin{quote}
2153: Actually, it is most fruitful to discuss the quantity of information
2154: ``conveyed by an object'' ($x$) ``about an object'' ($y$). It is not
2155: an accident that in the probabilistic approach this has led
2156: to a generalization to the case of continuous variables, for which
2157: the entropy is finite but, in a large number of cases,
2158: \[
2159: I_W(x,y) = \int \int P_{xy}(dx \; dy)\log_2
2160: \frac{P_{xy}(dx \; dy)}{P_x(dx) P_y(dy)}
2161: \]
2162: is finite.
2163: The real objects that we study are very (infinitely) complex,
2164: but the relationships between two separate objects diminish as the
2165: schemes used to describe them become simpler.
2166: While a map yields a considerable amount of information about a region
2167: of the earth's surface, the microstructure of the paper and the ink
2168: on the paper have no relation to the microstructure
2169: of the area shown on the map.''
2170: \end{quote}
2171: In the discussions on Shannon mutual information, we first
2172: needed to introduce a conditional version of entropy. Analogously, to
2173: prepare for the definition of algorithmic mutual information, we need
2174: a notion of conditional Kolmogorov complexity.
2175:
2176: Intuitively, the
2177: conditional prefix Kolmogorov complexity $K(x|y)$ of $x$ given $y$ can
2178: be interpreted as the shortest prefix program $p$ such that, when $y$
2179: is given to the program $p$ as input, the program prints $x$ and then
2180: halts. The idea of providing $p$ with an input $y$ is realized by putting
2181: $\langle p,y \rangle$ rather than just $p$ on the input tape of the
2182: universal prefix machine $U$.
2183: \begin{definition}\label{def.KolmKb}
2184: \rm
2185: The {\em conditional prefix Kolmogorov complexity} of $x$ given $y$ (for
2186: free) is
2187: \[K(x|y) = \min_{p}\{l(p): U(\langle p,y \rangle )=x , p \in \{0,1\}^*\}. \]
2188: We define
2189: \begin{equation}
2190: \label{eq:redefine}
2191: K(x)=K(x|\epsilon).
2192: \end{equation}
2193: \end{definition}
2194: Note that we just redefined $K(x)$ so that the unconditional
2195: Kolmogorov complexity is {\em exactly\/} equal to the conditional
2196: Kolmogorov complexity with empty input. This does not contradict our
2197: earlier definition: we can choose a reference prefix machine $U$ such
2198: that $U(\langle p,\epsilon \rangle) = U(p)$. Then (\ref{eq:redefine})
2199: holds automatically.
2200:
2201:
2202: We now have the technical apparatus to express the relation between
2203: entropy inequalities and Kolmogorov complexity inequalities.
2204: Recall that the entropy expresses the expected information
2205: to transmit an outcome of a known random source,
2206: while the Kolmogorov complexity
2207: of every such outcome expresses the specific information
2208: contained in that outcome. This makes us wonder to what extend the
2209: entropy-(in)equalities hold for the corresponding Kolmogorov
2210: complexity situation. In the latter case the corresponding (in)equality
2211: is a far stronger statement, implying the same (in)equality in the
2212: entropy setting. It is remarkable, therefore, that similar inequalities
2213: hold for both cases, where the entropy ones hold exactly while the
2214: Kolmogorov complexity ones hold up to a logarithmic, and in some cases
2215: $O(1)$,
2216: additive precision.
2217:
2218: \paragraph{Additivity:}
2219: %Recall the notation $\eqa, \gea$ (Section~\ref{sec:coding}).
2220: By definition, $K(x,y) = K(\langle x,y \rangle)$.
2221: Trivially, the symmetry property holds: $K(x,y) \eqa K(y,x)$.
2222: Another interesting property is the ``Additivity of Complexity''
2223: property that, as we explain further below,
2224: is equivalent to the ``Symmetry of
2225: Algorithmic Mutual Information'' property. Recall that
2226: $x^*$ denotes the first (in a standard enumeration order)
2227: shortest prefix program that
2228: generates $x$ and then halts.
2229: \begin{theorem}[Additivity of Complexity/Symmetry of Mutual Information]
2230: \label{thm:additive}
2231: \begin{equation}\label{eq.soi}
2232: K(x, y) \eqa K(x) + K(y \mid x^*) \eqa K(y) + K(x \mid y^*).
2233: \end{equation}
2234: \end{theorem}
2235: This is the Kolmogorov complexity equivalent of the entropy equality
2236: (\ref{I5}). That this latter equality holds is true
2237: by simply rewriting both sides of the equation according to the
2238: definitions of averages of joint and marginal probabilities.
2239: In fact, potential individual differences are averaged out.
2240: But in the Kolmogorov complexity case we do nothing like that:
2241: it is truly remarkable that additivity of algorithmic information
2242: holds for individual objects. It was first proven by Kolmogorov
2243: and Leonid A. Levin for the plain (non-prefix) version
2244: of Kolmogorov complexity, where it holds up to an additive logarithmic
2245: term, and reported in \cite{ZvLe70}.
2246: The prefix-version (\ref{eq.soi}), holding up to an $O(1)$
2247: additive term is due to \cite{Ga74}, can be found
2248: as Theorem 3.9.1 in~\cite{LiVi97}, and has a difficult proof.
2249: \paragraph{Symmetry:}
2250: To define the algorithmic mutual information between
2251: two individual objects $x$ and $y$ with no
2252: probabilities involved, it is instructive to first recall
2253: the probabilistic notion (\ref{eq.mutinfprob}).
2254: Rewriting (\ref{eq.mutinfprob})
2255: as
2256: \[ \sum_x \sum_y f(x,y) [ \log 1/f(x) + \log 1/ f(y) - \log 1/f(x,y) ] , \]
2257: and noting that $ \log 1/ f ( s )$ is
2258: very close to the length of the
2259: prefix-free Shannon-Fano code for $s$, we are led to the following
2260: definition.
2261: %\footnote{The Shannon-Fano code has nearly optimal expected
2262: %code length equal to the entropy with
2263: %respect to the distribution of the source \cite{CT91}. However,
2264: %the prefix-free code with code word length $K(s)$ has both
2265: %about expected optimal code word length and individual optimal
2266: %effective code word length, \cite{LiVi97}.}
2267: The
2268: {\em information in $y$ about $x$}
2269: is defined as
2270: \begin{equation}\label{def.mutinf}
2271: I(y : x) = K(x) - K(x \mid y^*) \eqa K(x) + K(y) - K(x, y),
2272: \end{equation}
2273: where the second equality is a consequence of~(\ref{eq.soi})
2274: and states that this information is symmetrical,
2275: $I(x:y) \eqa I(y:x)$, and therefore we can talk about
2276: {\em mutual information}.\footnote{The notation of the
2277: algorithmic (individual)
2278: notion $I(x:y)$ distinguishes it from the probabilistic
2279: (average) notion
2280: $I(X; Y)$. We deviate slightly from~\cite{LiVi97}
2281: where $I(y : x)$ is defined as $K(x) - K(x \mid y)$.}
2282: % \begin{remark}\label{rem.cami}
2283: %The conditional mutual information is
2284: % \begin{align*}
2285: % I(x : y \mid z) & = K(x \mid z) - K(x \mid y, K(y \mid z), z)
2286: % \\ & \eqa K(x \mid z) + K(y \mid z) - K(x, y \mid z).
2287: % \end{align*}
2288: % \end{remark}
2289: \paragraph{Precision -- $O(1)$ vs. $O(\log n)$:}
2290: The version of (\ref{eq.soi}) with just $x$ and $y$ in the
2291: conditionals doesn't
2292: hold with $\eqa$, but holds up to additive logarithmic terms
2293: that cannot be eliminated. To gain some further insight in this
2294: matter, first consider the following lemma:
2295: \begin{lemma}
2296: $x^*$ has the same information as the
2297: pair $x,K(x)$, that is, $K(x^* \mid x,K(x)),K(x,K(x) \mid x^*)=O(1)$.
2298: \end{lemma}
2299: \begin{proof}
2300: Given $x,K(x)$ we can run all programs simultaneously in
2301: dovetailed fashion and select the first program of length $K(x)$
2302: that halts with output $x$ as $x^*$. (Dovetailed fashion means that
2303: in phase $k$ of the process we run all programs $i$ for $j$ steps
2304: such that $i+j=k$, $k=1,2, \ldots$)
2305: \end{proof}
2306:
2307: \noindent
2308: Thus, $x^*$ provides more information than $x$. Therefore, we have to
2309: be very careful when extending Theorem~\ref{thm:additive}.
2310: For example, the conditional version of (\ref{eq.soi}) is:
2311: \begin{equation}\label{eq.soi-cond}
2312: K(x, y \mid z) \eqa K(x \mid z) + K(y \mid x, K(x \mid z), z).
2313: \end{equation}
2314: Note that a naive version
2315: \[
2316: K(x, y \mid z) \eqa K(x \mid z) + K(y \mid x^{*}, z)
2317: \]
2318: is incorrect: taking $z = x$, $y = K(x)$,
2319: the left-hand side equals $K(x^{*} \mid x)$ which can be as large as
2320: $\log n - \log \log n + O(1)$, and the right-hand side
2321: equals $K(x \mid x) + K(K(x) \mid x^{*}, x) \eqa 0$.
2322:
2323: But up to logarithmic precision we do not need to
2324: be that careful. In fact, it turns out that {\em every}
2325: linear entropy inequality holds for the corresponding Kolmogorov
2326: complexities within a logarithmic additive error, \cite{HRSV00}:
2327: \begin{theorem}
2328: All linear (in)equalities that are valid for Kolmogorov complexity
2329: are also valid for Shannon entropy and vice versa---provided
2330: we require the
2331: Kolmogorov complexity (in)equalities to hold up to additive
2332: logarithmic precision only.
2333: \end{theorem}
2334: \subsection{Expected Algorithmic Mutual Information Equals Probabilistic Mutual Information}
2335: Theorem~\ref{theo.eq.entropy}
2336: gave the relationship between entropy and ordinary Kolmogorov
2337: complexity; it showed that the entropy of distribution $P$ is
2338: approximately equal to the expected (under $P$) Kolmogorov
2339: complexity. Theorem~\ref{thm:mutinf} gives the analogous result for
2340: the mutual information (to facilitate comparison to
2341: Theorem~\ref{theo.eq.entropy}, note that $x$ and $y$ in
2342: (\ref{eq.eqamipmi}) below may stand for strings of arbitrary length $n$).
2343: \begin{theorem}
2344: \label{thm:mutinf}
2345: Given a computable probability distribution $f(x,y)$ over $(x,y)$
2346: we have
2347: \begin{align}\label{eq.eqamipmi}
2348: I(X; Y) - K(f) & \lea \sum_x \sum_y f(x,y) I(x:y)
2349: \\& \lea I(X;Y) + 2 K(f) ,
2350: \nonumber
2351: \end{align}
2352: %where $c_f$ is a constant that depends only on $f$ (it is the length of the shortest prefix-free program that computes
2353: %$f(x,y)$ from input $(x,y)$).
2354: \end{theorem}
2355: \begin{proof}
2356: Rewrite the expectation
2357: \begin{align*}
2358: \sum_x \sum_y f(x,y) I(x:y) \eqa
2359: \sum_x \sum_y & f(x,y) [K(x)
2360: \\& + K(y) - K(x, y)].
2361: \end{align*}
2362: Define
2363: $\sum_y f(x,y) = f_1 (x)$
2364: and $\sum_x f(x,y) = f_2(y)$
2365: to obtain
2366: \begin{align*}
2367: \sum_x \sum_y f(x,y) I(x:y) \eqa
2368: \sum_x & f_1 (x) K(x)
2369: + \sum_y f_2 (y) K(y)
2370: \\& - \sum_{x,y} f(x,y) K(x, y).
2371: \end{align*}
2372: Given the program that computes $f$, we can approximate $f_1 (x)$
2373: by $q_1 (x,y_0) = \sum_{y \leq y_0} f(x,y)$, and
2374: similarly for $f_2$. That is, the
2375: distributions $f_i$ ($i=1,2$) are lower semicomputable.
2376: Because they sum to 1 it can be shown
2377: they must also be computable.
2378: By Theorem~\ref{theo.eq.entropy},
2379: we have $H(g) \lea \sum_x g(x) K(x) \lea H(g) + K(g)$
2380: for every computable probability mass function $g$.
2381:
2382: Hence, $H(f_i) \lea \sum_x f_i (x) K(x) \lea H(f_i) + K(f_i)$
2383: ($i=1,2$), and $H(f) \lea \sum_{x,y} f (x,y) K(x,y) \lea H(f) + K(f)$.
2384: On the other hand, the probabilistic mutual information
2385: (\ref{eq.mutinfprob}) is expressed in the entropies by
2386: $I(X;Y) = H(f_1) + H(f_2) - H(f)$.
2387: By construction of the $f_i$'s above,
2388: we have $K(f_1), K(f_2) \lea K(f)$. Since the complexities
2389: are positive, substitution
2390: establishes the lemma.
2391: \end{proof}
2392:
2393: Can we get rid of the $K(f)$ error term? The answer is affirmative;
2394: by putting $f(\cdot)$ in the conditional, and
2395: applying \eqref{eq.condentropy}, we can even get rid of
2396: the computability requirement.
2397:
2398: \begin{lemma}
2399: Given a joint probability distribution $f(x,y)$ over $(x,y)$
2400: (not necessarily computable) we have
2401: \[ I(X;Y) \eqa \sum_x \sum_y f(x,y) I(x:y \mid f) , \]
2402: where the auxiliary $f$ means that we can directly access the
2403: values $f(x,y)$ on the
2404: auxiliary conditional information tape of the reference
2405: universal prefix machine.
2406: \end{lemma}
2407:
2408: \begin{proof}
2409: The lemma follows from the definition of conditional
2410: algorithmic mutual information,
2411: if we show that $\sum_{x}
2412: f(x) K(x \mid f) \eqa H(f)$,
2413: where the $O(1)$ term implicit in the $\eqa$ sign
2414: is independent of $f$.
2415:
2416: Equip the reference universal prefix machine,
2417: with an $O(1)$ length
2418: program to compute a Shannon-Fano code from the auxiliary table
2419: of probabilities.
2420: Then, given an input $r$, it can determine
2421: whether $r$ is the Shannon-Fano code word for some $x$.
2422: Such a code word
2423: has length $\eqa \log 1/f(x)$.
2424: If this is the case, then the machine
2425: outputs $x$, otherwise it halts without output. Therefore,
2426: $K(x \mid f) \lea \log 1/ f(x)$.
2427: This shows
2428: the upper bound on the expected prefix complexity.
2429: The lower bound follows as usual
2430: from the Noiseless Coding Theorem.
2431: \end{proof}
2432:
2433: Thus, we see that the expectation of the algorithmic mutual
2434: information $I(x:y)$ is close to the probabilistic mutual information
2435: $I(X; Y)$ --- which is important: if
2436: this were not the case then the algorithmic notion would not
2437: be a sharpening of the probabilistic notion to individual objects,
2438: but something else.
2439:
2440:
2441: \section{Mutual Information Non-Increase}
2442: \label{sect.mini}
2443: \subsection{Probabilistic Version}
2444: Is it possible to increase the mutual information between
2445: two random variables, by processing the outcomes in some deterministic
2446: manner? The answer is negative:
2447: For every function $T$
2448: we have
2449: \begin{equation}\label{eq.infnonincrprob}
2450: I(X ; Y) \geq I( X ; T(Y)),
2451: \end{equation}
2452: that is, mutual information between two random variables
2453: cannot be increased by processing their outcomes in any deterministic way.
2454: The same holds in an appropriate sense for randomized processing
2455: of the outcomes of the random variables.
2456: This fact is called the {\em data processing inequality} \cite{CT91},
2457: Theorem 2.8.1. The reason why it holds is that \eqref{eq.mutinfprob}
2458: is expressed in terms of probabilities $f(a,b), f_1(a), f_2(b)$,
2459: rather than in terms of the arguments.
2460: Processing the arguments $a,b$ will not increase the value
2461: of the expression in the right-hand side. If the processing of the arguments
2462: just renames them in a one-to-one manner then the expression
2463: keeps the same value. If the processing eliminates or merges arguments
2464: then it is easy to check from the formula
2465: that the expression value doesn't increase.
2466:
2467: \subsection{Algorithmic Version}
2468: \label{sect:minialg}
2469: In the algorithmic version of mutual information, the notion
2470: is expressed in terms of the individual arguments instead of
2471: solely in terms of the probabilities as in the probabilistic version.
2472: Therefore, the reason for \eqref{eq.infnonincrprob} to hold is
2473: not valid in the algorithmic case. Yet it turns out that the
2474: data processing inequality also holds between individual objects,
2475: by far more subtle arguments and not precisely but with a small
2476: tolerance. The first to observe this fact was Leonid A. Levin
2477: who proved his ``information non-growth,'' and ``information
2478: conservation inequalities'' for both finite and infinite sequences
2479: under both deterministic and randomized data processing,
2480: \cite{Le74,Le84}.
2481:
2482:
2483: \subsubsection{A Triangle Inequality}
2484: We first discuss some useful technical lemmas.
2485: The additivity of complexity (symmetry of information)
2486: \eqref{eq.soi} can be used to
2487: derive a ``directed triangle inequality'' from \cite{GTV01},
2488: that is needed later.
2489: \begin{theorem}\label{lem.magic}
2490: For all $x,y,z$,
2491: \[
2492: K(x \mid y^*) \lea K(x, z \mid y^{*}) \lea K(z \mid y^*) + K(x \mid z^*).
2493: \]
2494: \end{theorem}
2495:
2496: \begin{proof}
2497: Using~(\ref{eq.soi}), an evident inequality introducing
2498: an auxiliary object $z$, and twice (~\ref{eq.soi}) again:
2499: \begin{align*}
2500: K(x, z \mid y^*) &\eqa
2501: K(x,y,z) - K(y)
2502: \\ & \lea K(z) + K(x \mid z^*) + K(y \mid z^*) - K(y)
2503: \\ &\eqa K(y,z) - K(y) + K(x \mid z^*)
2504: \\ & \eqa K(x \mid z^*) + K(z \mid y^*).
2505: \end{align*}
2506:
2507: \end{proof}
2508:
2509: \begin{remark}
2510: \rm
2511: This theorem has bizarre consequences. These consequences are not
2512: simple unexpected artifacts of our definitions, but, to the contrary,
2513: they show the power and the genuine contribution to our understanding
2514: represented by the deep and important mathematical relation
2515: (\ref{eq.soi}).
2516:
2517: Denote $k=K(y)$ and substitute $k=z$ and $K(k)=x$
2518: to find the following counterintuitive corollary: To determine the complexity
2519: of the complexity of an object $y$ it suffices to give both $y$ and
2520: the complexity of $y$. This is counterintuitive since in general
2521: we cannot compute the complexity of an object from the object itself;
2522: if we could this would also solve the
2523: so-called ``halting problem'', \cite{LiVi97}. This noncomputability
2524: can be quantified in terms of $K(K(y) \mid y )$ which can rise to
2525: almost $K(K(y))$ for some $y$.
2526: But in the
2527: seemingly similar, but subtly different, setting below it is possible.
2528:
2529: \begin{corollary}
2530: As above, let $k$ denote $K(y)$. Then,
2531: $K(K(k) \mid y,k) \eqa K(K(k) \mid y^*) \lea K(K(k) \mid k^*)+K(k \mid y,k) \eqa 0$.
2532: \end{corollary}
2533: \end{remark}
2534:
2535:
2536: Now back to whether mutual information in one object
2537: about another one cannot be increased. In the probabilistic
2538: setting this was shown to hold for random variables. But does
2539: it also hold for individual outcomes?
2540: In \cite{Le74,Le84} it was shown that
2541: the information in one individual string about another
2542: cannot be increased by any deterministic algorithmic method
2543: by more than a constant. With added randomization this holds
2544: with overwhelming probability.
2545: Here, we follow the proof method
2546: of \cite{GTV01} and
2547: use the triangle inequality of Theorem~\ref{lem.magic} to recall,
2548: and to give proofs of this information non-increase.
2549:
2550: %We need the following technical concepts.
2551: %Let us call a nonnegative
2552: %real function $f(x)$ defined on strings a {\em semimeasure} if
2553: %$\sum_{x} f(x) \le 1$, and a {\em measure} (a probability distribution)
2554: %if the sum is 1.
2555: %A function $f(x)$ is called {\em lower semicomputable} if there is a
2556: %rational valued computable function $g(n,x)$ such that
2557: %$g(n+1,x) \geq g(n,x)$ and $\lim_{n \rightarrow \infty} g(n,x) = f(x)$.
2558: %For an {\em upper semicomputable} function $f$ we require
2559: %that $-f$ is lower semicomputable.
2560: %It is computable when it is both lower and upper semicomputable.
2561: %(A lower semicomputable measure is also computable.)
2562:
2563: \subsubsection{Deterministic Data Processing:}
2564: Recall the definition~\ref{def.mutinf} and Theorem~\ref{eq.eqamipmi}.
2565: We prove a strong version of the information non-increase law
2566: under deterministic processing (later we need the attached corollary):
2567:
2568: \begin{theorem}
2569: Given $x$ and $z$, let $q$ be a program
2570: computing $z$ from $x^*$.
2571: Then
2572: \begin{equation}\label{eq.nonincrease2}
2573: I(z : y) \lea I(x : y) + K(q).
2574: \end{equation}
2575: \end{theorem}
2576:
2577: \begin{proof}
2578: By the triangle inequality,
2579: \begin{align*}
2580: K(y \mid x^{*}) & \lea K(y \mid z^{*}) + K(z \mid x^{*})
2581: \\& \eqa K(y \mid z^{*})+ K(q).
2582: \end{align*}
2583: Thus,
2584: \begin{align*}
2585: I(x : y) & = K(y) - K(y \mid x^{*})
2586: \\ & \gea K(y) - K(y \mid z^{*}) - K(q)
2587: \\ & = I(z : y) - K(q).
2588: \end{align*}
2589: \end{proof}
2590:
2591: This also implies the slightly weaker but intuitively
2592: more appealing statement that the mutual information between strings
2593: $x$ and $y$ cannot be increased by processing $x$ and $y$ separately by
2594: deterministic computations.
2595: \begin{corollary} Let $f, g$ be recursive functions.
2596: Then
2597: \begin{equation}\label{eq.nonincrease}
2598: I(f(x) : g(y)) \lea I(x : y) + K(f)+K(g).
2599: \end{equation}
2600: \end{corollary}
2601: \begin{proof}
2602: It suffices to prove the case $g(y) = y$ and apply it twice.
2603: The proof is by replacing the program $q$ that computes
2604: a particular string $z$
2605: from a particular $x^*$ in (\ref{eq.nonincrease2}). There, $q$
2606: possibly depends on $x^*$ and $z$. Replace it by a program $q_f$ that first
2607: computes $x$ from $x^*$, followed by computing a
2608: recursive function
2609: $f$, that is, $q_f$ is independent of $x$.
2610: Since we only require an $O(1)$-length program to compute
2611: $x$ from $x^*$ we can choose $l(q_f) \eqa K(f)$.
2612:
2613: By the triangle inequality,
2614: \begin{align*}
2615: K(y \mid x^{*}) & \lea K(y \mid f(x)^{*}) + K(f(x) \mid x^{*})
2616: \\& \eqa K(y \mid f(x)^{*})+ K(f).
2617: \end{align*}
2618: Thus,
2619: \begin{align*}
2620: I(x : y) & = K(y) - K(y \mid x^{*})
2621: \\ & \gea K(y) - K(y \mid f(x)^{*}) - K(f)
2622: \\ & = I(f(x) : y) - K(f).
2623: \end{align*}
2624: \end{proof}
2625:
2626:
2627: \subsubsection{Randomized Data Processing:}
2628: It turns out that furthermore, randomized computation can increase
2629: information only with negligible probability.
2630: Recall from Section~\ref{sec:m} that the
2631: {\em universal probability} $\m(x) = 2^{-K(x)}$ is
2632: maximal within a multiplicative constant among lower semicomputable
2633: semimeasures.
2634: So, in particular, for each computable measure $f(x)$ we have
2635: $f(x) \leq c_1 \m(x)$, where the constant factor $c_1$ depends on $f$.
2636: This property also holds when we have an extra parameter, like $y^*$,
2637: in the condition.
2638:
2639:
2640: Suppose that $z$ is obtained from $x$ by some randomized computation.
2641: We assume that
2642: the probability $f(z \mid x)$ of obtaining $z$ from $x$ is a semicomputable
2643: distribution over the $z$'s.
2644: Therefore it is upperbounded by
2645: $\m(z \mid x) \leq c_2 \m(z \mid x^{*}) = 2^{-K(z \mid x^{*})}$.
2646: The information increase $I(z : y) - I(x : y)$ satisfies the theorem below.
2647:
2648: \begin{theorem} There is a constant $c_3$ such that
2649: for all $x,y,z$ we have
2650: \[
2651: \m(z \mid x^{*}) 2^{I(z : y) - I(x : y)}
2652: \leq c_3 \m(z \mid x^{*}, y, K(y \mid x^{*})).
2653: \]
2654: \end{theorem}
2655:
2656: \begin{remark}
2657: \rm
2658: For example, the probability of an increase of mutual information
2659: by the amount $d$ is $O( 2^{-d})$.
2660: The theorem
2661: implies $\sum_{z} \m(z \mid x^{*}) 2^{I(z : y) - I(x : y)} =O(1)$,
2662: the $\m(\cdot \mid x^{*})$-expectation of the exponential of the increase
2663: is bounded by a constant.
2664: \end{remark}
2665:
2666: \begin{proof}
2667: We have
2668: \begin{align*}
2669: I(z : y) - I(x : y) & = K(y) - K(y \mid z^{*}) - (K(y) - K(y \mid x^{*}))
2670: \\& = K(y \mid x^{*}) - K(y \mid z^{*}).
2671: \end{align*}
2672: The negative logarithm of the left-hand side in the theorem is therefore
2673: \[
2674: K(z \mid x^{*}) + K(y \mid z^{*}) - K(y \mid x^{*}).
2675: \]
2676: Using Theorem~\ref{lem.magic}, and the conditional
2677: additivity (\ref{eq.soi-cond}), this is
2678: \[
2679: \gea K(y, z \mid x^{*}) - K(y \mid x^{*}) \eqa
2680: K(z \mid x^{*}, y, K(y \mid x^{*})).
2681: \]
2682: \end{proof}
2683:
2684: \begin{remark}
2685: \rm An example of the use of algorithmic mutual information is as
2686: follows \cite{Le02}. A celebrated result of K. G\"odel states that
2687: Peano Arithmetic is incomplete in the sense that it cannot be
2688: consistently extended to a complete theory using recursively
2689: enumerable axiom sets. (Here `complete' means that every sentence of
2690: Peano Arithmetic is decidable within the theory; for further
2691: details on the terminology used in this example, we refer to
2692: \cite{LiVi97}). The essence is the non-existence of total recursive
2693: extensions of a universal partial recursive predicate. This is
2694: usually taken to mean that mathematics is undecidable.
2695: Non-existence of an algorithmic solution need not be a problem when
2696: the requirements do not imply unique solutions. A perfect example is
2697: the generation of strings of high Kolmogorov complexity, say of half
2698: the length of the strings. There is no deterministic effective process that can
2699: produce such a string; but repeatedly flipping a fair coin we
2700: generate a desired string with overwhelming probability. Therefore,
2701: the question arises whether randomized means allow us to bypass
2702: G\"odel's result. The notion of mutual information between two
2703: finite strings can be refined and extended to infinite sequences, so
2704: that, again, it cannot be increased by either deterministic or
2705: randomized processing. In \cite{Le02} the existence of an infinite
2706: sequence is shown that has infinite mutual information with all
2707: total extensions of a universal partial recursive predicate. As
2708: Levin states ``it plays the role of password: no substantial
2709: information about it can be guessed, no matter what methods are
2710: allowed.'' This ``forbidden information'' is used to extend the
2711: G\"odel's incompleteness result to also hold for consistent
2712: extensions to a complete theory by randomized means with
2713: non-vanishing probability.
2714: \end{remark}
2715:
2716:
2717:
2718: \
2719: \\
2720: \paragraph{Problem and Lacuna:}
2721: Entropy, Kolmogorov complexity and mutual
2722: (algorithmic) information are concepts that do not distinguish
2723: between different {\em kinds\/} of information (such as `meaningful' and `meaningless'
2724: information). In the remainder of this paper, we show how these more
2725: intricate notions
2726: can be arrived at, typically by {\em constraining\/} the description
2727: methods with which strings are allowed to be encoded
2728: (Section~\ref{sec:algsuf}) and by considering {\em lossy\/} rather
2729: than lossless compression (Section~\ref{sect.rdsf}). Nevertheless, the basic
2730: notions entropy, Kolmogorov complexity and mutual information continue
2731: to play a fundamental r\^ole.
2732: %This leads to Distortion Theory
2733: %\cite{CT91} in the Information Theory setting and
2734: %to meaningful information in the sense of algorithmic sufficient
2735: %statistics in the Kolmogorov complexity setting
2736: %that also has applications to the question-answer game. But that is
2737: %another story that will be told elsewhere because current space has
2738: %run out.
2739: \section{Sufficient Statistic}
2740: \label{sect.sufstat}
2741: In introducing the notion of sufficiency in classical
2742: statistics, Fisher~\cite{Fi22} stated:
2743: \begin{quote}
2744: ``The statistic chosen should summarize the whole of the relevant
2745: information supplied by the sample. This may be called
2746: the Criterion of Sufficiency $\ldots$
2747: In the case of the normal curve
2748: of distribution it is evident that the second moment is a
2749: sufficient statistic for estimating the standard deviation.''
2750: \end{quote}
2751: A ``sufficient'' statistic of the data
2752: contains all information in the data about the model class.
2753: Below we first discuss the standard notion of (probabilistic)
2754: sufficient statistic as employed in the statistical literature.
2755: We show that this notion has a natural interpretation in terms of
2756: Shannon mutual information, so that we may just as well
2757: think of a probabilistic sufficient
2758: statistic as a concept in Shannon information theory. Just as in the
2759: other sections of this paper, there is a corresponding notion in the
2760: Kolmogorov complexity literature: the algorithmic sufficient statistic
2761: which we introduce in Section~\ref{sec:algsuf}. Finally,
2762: in Section~\ref{sec:relpa} we connect the statistical/Shannon
2763: and the algorithmic notions of sufficiency.
2764: \subsection{Probabilistic Sufficient Statistic}
2765: \label{sec:probstat}
2766: Let $\{ P_\theta \}$ be a family of distributions,
2767: also called a {\em model class}, of a
2768: random variable $X$ that takes values in a finite or countable
2769: {\em set of data} ${\cal X}$.
2770: Let ${\mathbf \Theta}$
2771: be the set of parameters $\theta$ parameterizing the family
2772: $\{ P_\theta \}$. Any function $S: {\cal X} \rightarrow {\cal S}$
2773: taking values in some set ${\cal S}$ is said to be a {\em statistic}
2774: of the data in ${\cal X}$. A {\em statistic}
2775: $S$ is said to be {\em sufficient} for the family $\{
2776: P_{\theta} \}$ if,
2777: for every $s \in {\cal S}$,
2778: the conditional distribution
2779: \begin{equation}
2780: \label{eq:cond}
2781: P_{\theta}(X= \cdot \mid S(x) = s)
2782: \end{equation}
2783: is invariant under changes of $\theta$. This is the
2784: standard definition
2785: in the statistical literature, see
2786: for example \cite{CoxH74}.
2787: Intuitively, (\ref{eq:cond}) means that all information about $\theta$
2788: in the observation $x$ is present in the (coarser) observation $S(x)$,
2789: in line with Fisher's quote above.
2790:
2791:
2792: The notion of `sufficient statistic'
2793: can be equivalently expressed in terms
2794: of probability mass functions. Let $f_{\theta}(x) = P_{\theta} (X=x)$
2795: denote the
2796: probability mass of $x$ according to $P_{\theta}$.
2797: We identify distributions $P_{\theta}$ with their
2798: mass functions $f_{\theta}$ and denote the model class $\{P_{\theta} \}$ by
2799: $\{f_{\theta}\}$. Let
2800: $f_{\theta}(x | s)$ denote the
2801: probability mass function of the conditional distribution (\ref{eq:cond}), defined as
2802: in Section~\ref{sec:preliminaries}. That is,
2803: % .
2804: %Then,
2805: $$
2806: f_{\theta}(x|s)
2807: =
2808: \begin{cases}
2809: f_{\theta}(x) / \sum_{x \in{\cal X}: S(x) = s} f_{\theta}(x)
2810: & \text{\ if\ } S(x) = s \\
2811: 0 & \text{\ if\ } S(x) \neq s .
2812: \end{cases}
2813: $$
2814: The requirement of $S$ to be sufficient is equivalent to the existence
2815: of a function
2816: $g: {\cal X} \times {\cal S} \rightarrow {\cal R}$ such
2817: that
2818: \begin{equation}
2819: \label{eq:standarddef}
2820: g(x \mid s) = f_{\theta}(x \mid s),
2821: \end{equation}
2822: for every $\theta \in {\mathbf \Theta}$, $s \in {\cal S}$,
2823: $x \in {\cal X}$. (Here we change the common
2824: notation `$g(x,s)$' to `$g(x \mid s)$' which is more expressive for
2825: our purpose.)
2826:
2827: \begin{example}
2828: \rm
2829: Let ${\cal X} = \{ 0,1\}^{n}$, let $X = (X_1, \ldots, X_n)$. Let
2830: $\{P_{\theta} : \theta \in (0,1 ) \}$ be the set of $n$-fold
2831: Bernoulli
2832: distributions on
2833: ${\cal X}$ with parameter $\theta$. That is,
2834: $$
2835: f_{\theta}(x) = f_{\theta}(x_1 \ldots x_n) = \theta^{S(x)}(1- \theta)^{n-S(x)}$$
2836: where $S(x)$ is the number of $1$'s in $x$. Then $S(x)$ is a
2837: sufficient statistic for $\{ P_{\theta} \}$. Namely, fix an arbitrary
2838: $P_{\theta}$ with $\theta \in (0,1)$ and an arbitrary $s$ with $0 < s<
2839: n$.
2840: Then all $x$'s with $s$ ones and $n-s$ zeroes are equally probable.
2841: The number of such $x$'s is $\binom{n}{s}$. Therefore, the
2842: probability $P_\theta(X=x \mid S(x) = s)$ is equal to
2843: $1/ \binom{n}{s}$, and this does not depend on the parameter
2844: $\theta$.
2845: Equivalently, for all $\theta \in (0,1)$,
2846: \begin{equation}
2847: \label{eq:berndef}
2848: f_{\theta}(x \mid s)
2849: = \begin{cases} 1/ \binom{n}{s} & \text{\ if\ } S(x) = s
2850: \\
2851: 0 & \text{\ otherwise.}
2852: \end{cases}
2853: \end{equation}
2854: Since (\ref{eq:berndef}) satisfies
2855: (\ref{eq:standarddef}) (with $g(x|s)$ the uniform distribution on all
2856: $x$ with exactly $s$ ones), $S(x)$
2857: is a sufficient statistic relative to the model class $\{P_{\theta}\}$.
2858: In the Bernoulli case,
2859: $g(x|s)$ can be
2860: obtained by starting from the {\em uniform\/}
2861: distribution on ${\cal X}$ ($\theta = \frac{1}{2}$),
2862: and conditioning on $S(x)=s$. But
2863: $g$ is not necessarily uniform. For example, for the
2864: Poisson model class, where $\{f_{\theta}\}$
2865: represents the set of Poisson distributions on $n$ observations,
2866: the observed mean is a sufficient statistic and the corresponding $g$
2867: is far from uniform.
2868: All information
2869: about the parameter $\theta$ in
2870: the observation $x$ is already contained in $S(x)$.
2871: In the Bernoulli case, once we know
2872: the number $S(x)$ of $1$'s in $x$, all further details
2873: of $x$ (such as the order of $0$s and $1$s) are irrelevant
2874: for determination of the Bernoulli parameter $\theta$.
2875:
2876: To give an example of a
2877: statistics that is not sufficient for the Bernoulli model class, consider
2878: the statistic $T(x)$ which counts the number of 1s in $x$ that are
2879: followed by a $1$. On the other hand, for every statistic $U$, the
2880: combined statistic $V(x) := (S(x),
2881: U(x))$ with $S(x)$ as before, is sufficient, since it contains all
2882: information in $S(x)$. But in contrast to $S(x)$, a
2883: statistic such as $V(x)$ is typically not
2884: {\em minimal}, as explained further below.
2885: \end{example}
2886:
2887: It will be useful to rewrite
2888: (\ref{eq:standarddef}) as
2889: \begin{equation}
2890: \label{eq:indivdef}
2891: \log 1/f_{\theta}(x \mid s) = \log 1/ g(x| s).
2892: \end{equation}
2893: \begin{definition}\label{def.wss}
2894: \rm
2895: A function $S: {\cal X} \rightarrow {\cal S}$ is
2896: a {\em probabilistic sufficient statistic} for $\{f_{\theta}\}$ if
2897: there exists a function $g : {\cal X} \times {\cal
2898: S} \rightarrow {\cal R}$ such that (\ref{eq:indivdef}) holds for every
2899: $\theta \in {\mathbf \Theta}$, every $x \in {\cal X}$, every $s \in
2900: {\cal S}$ (Here we use the convention $\log 1/0 = \infty$).
2901: \end{definition}
2902: \paragraph{Expectation-version of definition:}
2903: The standard definition of probabilistic
2904: sufficient statistics is ostensibly of the
2905: `individual-sequence'-type: for $S$ to be sufficient,
2906: (\ref{eq:indivdef}) has to hold for
2907: {\em every} $x$, rather than merely in expectation or with high probability.
2908: However,
2909: %because (\ref{eq:indivdef}) has to hold for every $\theta$ as well,
2910: the definition turns out to be equivalent to an expectation-oriented
2911: form, as shown in Proposition~\ref{prop:suff}. We first introduce an a priori distribution
2912: over $\Theta$, the parameter set for our model class $\{ f_\theta\}$. We
2913: denote the probability density of this distribution by $p_1$.
2914: This way we can define a joint distribution
2915: $p(\theta,x) = p_1 (\theta) f_{\theta}(x)$.
2916: \begin{proposition}
2917: \label{prop:suff}
2918: \rm
2919: The following two statements are equivalent to
2920: Definition~\ref{def.wss}: (1)
2921: For every $\theta \in {\mathbf \Theta}$,
2922: \begin{equation}
2923: \label{eq:expdef}
2924: \sum_x f_\theta(x) \log 1/f_\theta(x \mid S(x)) =
2925: \sum_{x} f_\theta(x) \log 1/g(x \mid S(x)) \; .
2926: %{\bf E}_{X \sim f_\theta} [ - \log f_{\theta}(X|S(X))] =
2927: %{\bf E}_{X \sim f_\theta} [- \log g(X| S(X))].
2928: \end{equation}
2929: %the expectation taken over $f_\theta$.
2930: (2) For {\em every} prior $p_1(\theta)$ on ${\mathbf \Theta}$,
2931: \begin{equation}
2932: \label{eq:expdef2}
2933: \sum_{\theta,x} p(\theta,x) \log 1/f_\theta(x \mid S(x))
2934: = \sum_{\theta,x} p(\theta,x) \log 1/g(x \mid S(x)) \; .
2935: %{\bf E}_{(X,\Theta) \sim p} [ - \log f_{\theta}(X|S(X))] =
2936: %{\bf E}_{(X,\Theta) \sim p} [- \log g(X| S(X))],
2937: \end{equation}
2938: %the expectation taken over the joint distribution $p(\theta,x) =
2939: %p_1(\theta) f_\theta(x)$.
2940: \end{proposition}
2941: %Note that (\ref{eq:expdef}) is shorthand for
2942: %$$
2943: %$$
2944: %while (\ref{eq:expdef2}) is shorthand for
2945: %\begin{equation}
2946: %\end{equation}
2947: \begin{proof}
2948: {\em Definition~\ref{def.wss} $\Rightarrow$ \eqref{eq:expdef}:}
2949: Suppose (\ref{eq:indivdef}) holds for every $\theta
2950: \in {\mathbf \Theta}$, every $x \in {\cal X}$, every $s \in {\cal S}$.
2951: Then it also holds in expectation for every $\theta \in {\mathbf
2952: \Theta}$:
2953: \begin{equation}
2954: \label{eq:propsuff1}
2955: \sum_{x} f_\theta(x) \log 1/f_{\theta}(x|S(x)) = \sum_{x}
2956: f_\theta(x) \log 1/ g(x| S(x))].
2957: \end{equation}
2958:
2959: {\em \eqref{eq:expdef}$\Rightarrow$ Definition~\ref{def.wss}:}
2960: Suppose that for every $\theta \in {\mathbf \Theta}$,
2961: (\ref{eq:propsuff1}) holds.
2962: Denote
2963: \begin{equation}\label{eq:overload}
2964: f_{\theta}(s) = \sum_{y \in {\cal X}: S(y)=s} f_{\theta} (y).
2965: \end{equation}
2966: By adding
2967: $\sum_x f_{\theta}(x) \log 1/f_\theta(S(x))$ to both
2968: sides of the equation, (\ref{eq:propsuff1}) can be rewritten as
2969: \begin{equation}
2970: \label{eq:suffent}
2971: \sum_x f_{\theta} (x) \log 1/ f_{\theta}(x) =
2972: \sum_x f_{\theta}(x) \log1/ g_{\theta}(x),
2973: \end{equation}
2974: with
2975: $g_{\theta}(x) = f_{\theta}(S(x)) \cdot g(x| S(x))].$
2976: By the information inequality \eqref{eq.ii}, the equality
2977: (\ref{eq:suffent}) can
2978: only hold if $g_{\theta}(x) =
2979: f_{\theta}(x)$ for every $x \in {\cal X}$.
2980: Hence, we have established \eqref{eq:indivdef}.
2981:
2982: {\em \eqref{eq:expdef} $\Leftrightarrow$ \eqref{eq:expdef2}:}
2983: follows by linearity of expectation.
2984: \end{proof}
2985:
2986: \paragraph{Mutual information-version of definition:}
2987: After some rearranging of terms, the characterization
2988: (\ref{eq:expdef2}) gives rise to the intuitively appealing definition
2989: of probabilistic sufficient statistic in terms of mutual information
2990: \eqref{eq.mutinfprob}. The resulting formulation of sufficiency
2991: is as follows \cite{CT91}:
2992: $S$ is sufficient for $\{ f_{\theta} \}$ iff for all
2993: priors $p_1$ on ${\mathbf \Theta}$:
2994: \begin{equation}\label{eq.suffstatprob}
2995: I(\Theta ; X) = I( \Theta ; S(X))
2996: \end{equation}
2997: for all distributions
2998: of $\theta$.
2999:
3000: Thus, a statistic $S(x)$ is
3001: sufficient if the probabilistic mutual
3002: information is invariant under taking the statistic \eqref{eq.suffstatprob}.
3003: \paragraph{Minimal Probabilistic Sufficient Statistic:}
3004: A sufficient statistic may contain information
3005: that is not relevant: for a normal distribution the sample mean
3006: is a sufficient statistic, but the pair of functions
3007: which give the
3008: mean of the even-numbered samples and the odd-numbered samples
3009: respectively, is also a sufficient statistic.
3010: A statistic $S(x)$ is a {\em minimal} sufficient statistic
3011: with respect to an indexed
3012: model class $\{f_{\theta}\}$, if it is a
3013: function of all other sufficient statistics: it contains no
3014: irrelevant information and maximally compresses the information
3015: in the data about
3016: the model class.
3017: For the family of normal distributions
3018: the sample mean is a minimal sufficient statistic, but the
3019: sufficient statistic consisting of the mean of the even samples
3020: in combination with the mean of the odd samples is not minimal.
3021: Note that one cannot improve on sufficiency:
3022: The data processing inequality \eqref{eq.infnonincrprob} states that
3023: $
3024: I(\Theta ; X) \geq I( \Theta ; S(X)),
3025: $
3026: for every function $S$, and that
3027: for randomized functions $S$ an appropriate related expression holds.
3028: That is, mutual information between data random variable and model random
3029: variable
3030: cannot be increased by processing the data sample in any way.
3031: \paragraph{Problem and Lacuna:}
3032: We can think of the probabilistic sufficient statistic as extracting
3033: those patterns in the data that are relevant in determining the
3034: parameters of a statistical model class. But what if we do not want to
3035: commit ourselves to a simple finite-dimensional parametric model
3036: class? In the most general context, we may consider the model class of all computable
3037: distributions, or all computable sets of which the observed data is an
3038: element. Does there exist an analogue
3039: of the sufficient statistic that automatically summarizes {\em all\/}
3040: information in the sample $x$ that is relevant for determining the
3041: ``best'' (appropriately defined)
3042: model for $x$ within this enormous class of models? Of course,
3043: we may consider the
3044: literal data $x$ as a statistic of $x$, but that would not be
3045: satisfactory: we would still like our generalized statistic, at least
3046: in many cases, to be
3047: considerably coarser, and much more concise, than the data $x$ itself.
3048: It turns
3049: out that, to some extent, this is achieved by
3050: the {\em algorithmic\/} sufficient statistic
3051: of the data: it
3052: summarizes {\em all\/} conceivably relevant information in the
3053: data $x$; at the same time, many types of data $x$ admit an algorithmic
3054: sufficient statistic that is concise in the sense that it has very small
3055: Kolmogorov complexity.
3056: \subsection{Algorithmic Sufficient Statistic}
3057: \label{sec:algsuf}
3058: %While previous authors have used the name ``Kolmogorov sufficient statistic''
3059: %because the model appears to summarize the relevant information in the data
3060: %in analogy of what the classic sufficient statistic
3061: %does in a probabilistic sense, a formal justification has been lacking.
3062: \subsubsection{Meaningful Information}\index{meaningful information}
3063: \label{sect.meaning}
3064: The information contained in an individual
3065: finite object (like a finite binary string) is measured
3066: by its Kolmogorov complexity---the length of the shortest binary program
3067: that computes the object. Such a shortest program contains no redundancy:
3068: every bit is information; but is it meaningful information?
3069: If we flip a fair coin to obtain a finite binary string, then with overwhelming
3070: probability that string constitutes its own shortest program. However,
3071: also with overwhelming probability all the bits in the string are meaningless
3072: information, random noise. On the other hand, let an object
3073: $x$ be a sequence of observations of heavenly bodies. Then $x$
3074: can be described by the binary
3075: string $pd$, where $p$ is the description of
3076: the laws of gravity and the observational
3077: parameter setting, while $d$ accounts for the measurement errors:
3078: we can divide the information in $x$ into
3079: meaningful information $p$ and accidental information $d$.
3080: The main task for statistical inference and learning theory is to
3081: distill the meaningful information present in the data. The question
3082: arises whether it is possible to separate meaningful
3083: information from accidental information, and if so, how.
3084: The essence of the solution to this problem is revealed when we
3085: write Definition~\ref{def.KolmK}
3086: as follows:
3087: %(use the universality of the fixed reference universal prefix Turing machine
3088: %$U=T_u$ with $|u| = O(1)$ to obtain the last equality):
3089: \begin{equation}\label{eq.kcmdl}
3090: K(x) =
3091: \min_{p,i} \{K(i)+l(p):T_i(p) =x\}+O(1),
3092: \end{equation}
3093: where the minimum is taken over
3094: $p \in \{0,1\}^*$ and $i \in \{1,2, \ldots\}$.
3095: The justification is that for the fixed reference
3096: universal prefix Turing machine
3097: $U(\langle i,p \rangle)=T_i(p)$ for all $i$ and $p$. Since $i^*$
3098: denotes the shortest self-delimiting program for $i$, we have
3099: $|i^*|=K(i)$.
3100: The expression \eqref{eq.kcmdl}
3101: emphasizes the two-part code nature of Kolmogorov complexity.
3102: In a randomly truncated initial segment of a time series
3103: $$x = 10101010101010101010101010,$$
3104: we can encode $x$ by a small Turing machine printing a specified
3105: number of copies of the pattern ``01.''
3106: %which computes
3107: %$x$ from the program ``13.''
3108: %The minimal-length
3109: %two-part code squeezes out regularity only insofar as
3110: %the reduction in the length of the description of random aspects
3111: %is greater than the increase in the regularity description.
3112: This way, $K(x)$ is viewed as the shortest length of
3113: a two-part code for $x$, one part describing a Turing machine $T$,
3114: or {\em model}, for the {\em regular} aspects of $x$,
3115: and the second part describing
3116: the {\em irregular} aspects of $x$ in the form
3117: of a program $p$ to be interpreted by $T$.
3118: The regular, or ``valuable,'' information in $x$ is constituted
3119: by the bits in the ``model'' while the random or ``useless''
3120: information of $x$ constitutes the remainder.
3121: This leaves open the crucial question: How to choose
3122: $T$ and $p$ that together describe $x$? In general, many
3123: combinations of $T$ and $p$ are possible, but we want to find
3124: a $T$ that describes the meaningful aspects of $x$.
3125:
3126: \subsubsection{Data and Model}
3127: \index{data}\index{model} We consider only finite binary data strings
3128: $x$. Our model class consists of Turing machines $T$ that enumerate a
3129: finite set, say $S$, such that on input $p \leq |S|$ we have $T(p)=x$
3130: with $x$ the $p$th element of $T$'s enumeration of $S$, and $T(p)$ is
3131: a special {\em undefined} value if $p>|S|$. The ``best fitting''
3132: model for $x$ is a Turing machine $T$ that reaches the minimum
3133: description length in (\ref{eq.kcmdl}). There may be many such $T$,
3134: but, as we will see, if chosen properly, such a machine $T$ embodies
3135: the amount of useful information contained in $x$. Thus, we have
3136: divided a shortest program $x^*$ for $x$ into parts $x^*=T^*(p)$ such
3137: that $T^*$ is a shortest self-delimiting program for $T$. Now suppose
3138: we consider only low complexity finite-set models, and under these
3139: constraints the shortest two-part description happens to be longer
3140: than the shortest one-part description. For example, this can happen
3141: if the data is generated by a model that is too complex to be in the
3142: contemplated model class. Does the model minimizing the two-part
3143: description still capture all (or as much as possible) meaningful
3144: information? Such considerations require study of the relation between
3145: the complexity limit on the contemplated model classes, the shortest
3146: two-part code length, and the amount of meaningful information
3147: captured.
3148:
3149:
3150: In the following we will distinguish between ``models'' that are
3151: finite sets, and the ``shortest programs'' to compute those models
3152: that are finite strings. The latter will be called `algorithmic statistics'.
3153: %Such a shortest program is in the proper
3154: %sense a statistic of the data sample as defined before.
3155: In a way the distinction between ``model'' and ``statistic'' is
3156: artificial, but for now we prefer clarity and unambiguousness in the
3157: discussion. Moreover, the terminology is customary in the literature
3158: on algorithmic statistics. Note that strictly speaking, neither an
3159: algorithmic statistic nor the set it defines is a statistic in the
3160: probabilistic sense: the latter was defined as a {\em function\/} on
3161: the set of possible data samples of given length. Both notions are
3162: unified in Section~\ref{sec:relpa}.
3163: \subsubsection{Typical Elements}
3164: \index{typical data}\index{random data}
3165: Consider a string $x$
3166: of length $n$ and prefix complexity $K(x)=k$.
3167: For every finite set $S \subseteq \{0,1\}^*$ containing
3168: $x$ we have $K(x | S)\le\log|S|+O(1)$.
3169: Indeed, consider the prefix code of $x$
3170: consisting of its $\lceil\log|S|\rceil$ bit long index
3171: of $x$ in the lexicographical ordering of $S$.
3172: This code is called
3173: \emph{data-to-model code}.
3174: We identify the {\em structure} or {\em regularity} in $x$ that are
3175: to be summarized with a set $S$
3176: of which $x$ is a {\em random} or {\em typical} member:
3177: given $S$ containing $x$, %(or rather,
3178: %shortest program $S^*$ for $S$),
3179: the element $x$ cannot
3180: be described significantly shorter than by its maximal length index in $S$,
3181: that is, $ K(x \mid S) \geq \log |S| +O(1) $.
3182:
3183: \begin{definition}
3184: \rm
3185: Let $\beta \ge 0$ be an agreed upon, fixed, constant.
3186: A finite binary string $x$
3187: is a {\em typical} or {\em random} element of a set $S$ of finite binary
3188: strings, if $x \in S$ and
3189: \begin{equation}\label{eq.deftyp}
3190: K(x \mid S) \ge \log |S| - \beta.
3191: \end{equation}
3192: %where $S^*$ is a shortest program for $S$.
3193: We will not indicate the dependence on $\beta$ explicitly, but the
3194: constants in all our inequalities ($O(1)$) will be allowed to be functions
3195: of this $\beta$.
3196: \end{definition}
3197:
3198: This definition requires a finite $S$.
3199: In fact, since
3200: $K(x \mid S) \leq K(x)+O(1) $, it limits the size of $S$ to $O(2^k)$.
3201: %The shortest program $S^*$ from
3202: %which $S$ can be computed is an {\em algorithmic statistic} for $x$ if
3203: %\index{algorithmic statistic}
3204: %\begin{equation}\label{eq.typ}
3205: %K(x \mid S) \geq \log |S| +O(1).
3206: %\end{equation}
3207: Note that the notion of typicality is not absolute
3208: but depends on fixing the constant implicit in the $O$-notation.
3209:
3210: \begin{example}\label{xmp.typical}
3211: \rm
3212: Consider the set $S$ of binary strings of length $n$
3213: whose every odd position is 0.
3214: Let $x$ be an element of this set in which the subsequence of bits in
3215: even positions is an incompressible string.
3216: Then $x$ is a typical element of $S$ (or by with some abuse
3217: of language we can say $S$ is typical for $x$).
3218: But $x$ is also a typical element of the set $\{x\}$.
3219: \end{example}
3220:
3221:
3222: \subsubsection{Optimal Sets}
3223: \index{optimal model}
3224: Let $x$ be a binary data string of length $n$.
3225: For every finite set $S \ni x$, we have
3226: $K(x) \leq K(S) + \log |S| + O(1)$,
3227: since we can describe $x$ by giving $S$ and the index of $x$
3228: in a standard enumeration of $S$. Clearly this can be implemented
3229: by a Turing machine computing the finite set $S$ and a program
3230: $p$ giving the index of $x$ in $S$.
3231: The size of a set containing $x$ measures intuitively the number of
3232: properties of $x$ that are represented:
3233: The largest set is $\{0,1\}^{n}$ and represents only one property
3234: of $x$, namely, being of length $n$. It clearly ``underfits''
3235: as explanation or model for $x$. The smallest set containing $x$
3236: is the singleton set $\{x\}$ and represents all conceivable properties
3237: of $x$. It clearly ``overfits'' as explanation or model for $x$.
3238:
3239: There are two natural measures of suitability of such a set as
3240: a model for $x$.
3241: We might prefer either the simplest set, or the smallest set, as
3242: corresponding to the most likely structure `explaining' $x$.
3243: Both the largest set $\{0,1\}^n$ (having low complexity of about $K(n)$)
3244: and the singleton set $\{x\}$ (having high complexity of about $K(x)$),
3245: while certainly statistics for $x$,
3246: would indeed be considered poor explanations.
3247: \index{two-stage description}
3248: We would like to balance simplicity of model versus size of model.
3249: Both measures relate to the optimality of a two-stage description of
3250: $x$ using a finite set $S$ that contains it. Elaborating on
3251: the two-part code:
3252: \begin{align}\label{eq.twostage}
3253: K(x) \leq K(x,S) & \leq K(S) + K(x \mid S) +O(1)
3254: \\ & \leq K(S) + \log |S| +O(1),
3255: \nonumber
3256: \end{align}
3257: where only the final substitution of $K(x \mid S)$ by $\log |S|+O(1)$
3258: uses the fact that $x$ is an element of $S$.
3259: The closer the right-hand side of \eqref{eq.twostage} gets
3260: to the left-hand side, the better the description of $x$ is in terms
3261: of the set $S$.
3262: This implies a trade-off between meaningful model information, $K(S)$,
3263: and meaningless ``noise'' $\log |S|$.
3264: A set $S$ (containing $x$)
3265: for which \eqref{eq.twostage} holds with equality
3266: \begin{equation}\label{eq.optim}
3267: K(x) = K(S) + \log |S| +O(1),
3268: \end{equation}
3269: is called {\em optimal}.
3270: A data string $x$ can be typical for a set $S$ without that set $S$
3271: being optimal for $x$. This is the case precisely when $x$ is
3272: typical for $S$ (that is $K(x|S)=\log S +O(1)$)
3273: while $K(x,S)>K(x)$.
3274:
3275: %\begin{example}
3276: %\rm
3277: %Combining \eqref{eq.twostage} and
3278: %\eqref{eq.optim}, we see that if $S$ is an optimal set for $x$
3279: %then $K(x,S)=K(x)+O(1)$ which implies that $K(S\mid x)=O(1)$.
3280: %Going from $x$ to $S$ requires but an $O(1)$ length program,
3281: %which implies that there are ony $O(1)$ optimal sets for $x$,
3282: %however large $x$ may be.
3283: %\end{example}
3284:
3285: \subsubsection{Sufficient Statistic}
3286: \label{sect.ss}
3287: Intuitively, a model expresses the essence
3288: of the data if the two-part code describing the data consisting of the
3289: model and the data-to-model code is as concise as
3290: the best one-part description.
3291:
3292: Mindful of our distinction between a finite set $S$ and a
3293: program that describes $S$ in a required representation format,
3294: we call a shortest program for an optimal set with respect to $x$
3295: an {\em algorithmic sufficient statistic} for $x$.
3296: Furthermore, among optimal sets,
3297: there is a direct trade-off between complexity and log-size, which together
3298: sum to $ K(x)+O(1)$.
3299:
3300:
3301: \begin{example}\label{xmp.optimal}
3302: \rm
3303: It can be shown that the set $S$ of Example~\ref{xmp.typical} is also
3304: optimal, and so is $\{x\}$.
3305: Sets for which $x$ is typical form a much wider class than optimal
3306: sets for $x$: the set
3307: $\{x,y\}$ is still typical for
3308: $x$ but with most $y$, it will be too complex to be optimal for $x$.
3309:
3310: For a perhaps less artificial example, consider complexities conditional
3311: on the length $n$ of strings.
3312: Let $y$ be a random string of length $n$, let
3313: $S_{y}$ be the set of strings of length $n$ which have 0's exactly
3314: where $y$ has, and let $x$ be a random element of $S_{y}$.
3315: Then $x$ has about 25\%
3316: 1's,
3317: so its complexity is much less than $n$.
3318: The set $S_{y}$ has $x$ as a typical element,
3319: but is too complex to be optimal,
3320: since its complexity (even conditional on $n$) is still $n$.
3321: \end{example}
3322:
3323: %Optimal sets (or rather, descriptions of them) are statistics.
3324: %Equality (\ref{eq.optim}) expresses the conditions on the algorithmic
3325: %individual relation between the data and the sufficient statistic.
3326: %Later we
3327: %demonstrate that this relation implies that the probabilistic
3328: %optimality of mutual information holds
3329: %for the algorithmic version in the expected sense.
3330:
3331: An algorithmic sufficient statistic
3332: \index{sufficient statistic, algorithmic}
3333: \index{sufficient statistic, algorithmic minimal}
3334: \index{sufficient statistic, probabilistic}
3335: \index{sufficient statistic, probabilistic minimal}
3336: is a sharper individual notion than a probabilistic sufficient
3337: statistic. An optimal set $S$ associated with $x$ (the shortest
3338: program computing $S$ is the corresponding
3339: sufficient statistic associated with $x$) is chosen such that
3340: $x$ is maximally random with respect to it. That is, the
3341: information in $x$ is divided in a relevant structure expressed
3342: by the set $S$, and the remaining randomness with respect
3343: to that structure, expressed by $x$'s index in $S$ of $\log |S|$
3344: bits. The shortest program for $S$ is itself alone an algorithmic
3345: definition of structure, without a probabilistic interpretation.
3346:
3347:
3348: %One can also consider notions of
3349: %{\em near}-typical and {\em near}-optimal that arise from replacing
3350: %the $\beta$ in (\ref{eq.deftyp})
3351: %by some slowly growing functions, such as $O(\log l(x))$ or
3352: %$O(\log k)$.
3353: %In~\cite{CT91}, only
3354: Those optimal sets that admit the shortest possible program
3355: are called {\em algorithmic minimal sufficient statistics\/} of
3356: $x$. They will play a major role in the next section on the Kolmogorov
3357: structure function. Summarizing:
3358: %$with the shortest program
3359: %(or rather that shortest program) is the
3360: \begin{definition}[Algorithmic sufficient statistic, algorithmic
3361: minimal sufficient statistic]
3362: \label{def:algsufstat}
3363: An {\em algorithmic sufficient statistic\/} of $x$ is a shortest program for
3364: a set $S$ containing $x$ that is optimal, i.e. it satisfies (\ref{eq.optim}).
3365: An algorithmic sufficient statistic with optimal set $S$ is {\em
3366: minimal\/} if there exists no optimal set $S'$ with $K(S') < K(S)$.
3367: \end{definition}
3368: \begin{example}
3369: \rm
3370: Let $k$ be a number in the range $0,1,\dots,n$
3371: of complexity $\log n+ O(1)$ given $n$ and let $x$ be a string of length
3372: $n$ having $k$ ones of complexity $K(x \mid n,k) \geq \log {n \choose k}$
3373: given $n,k$. This $x$ can be viewed as a typical result of
3374: tossing a coin with a bias about $p=k/n$.
3375: A two-part description
3376: of $x$ is given by
3377: the number $k$ of 1's in $x$ first, followed by the index
3378: $j \leq \log |S|$ of $x$
3379: in the set $S$ of strings of length $n$ with $k$ 1's.
3380: This set is optimal, since
3381: $K(x \mid n)=K(x,k \mid n)=K(k \mid n)+K(x \mid k,n)= K(S)+ \log|S|$.
3382:
3383: Note that $S$ encodes the number of $1$s in $x$. The shortest program
3384: for $S$ is an
3385: algorithmic minimal sufficient statistic for {\em most\/} $x$ of
3386: length $n$ with $k$ $1$'s, since only a fraction of at most $2^{-m}$
3387: $x$'s of length $n$ with $k$ $1$s can have $K(x) < \log | S| - m$
3388: (Section~\ref{sec:kolmogorov}). But of course there exist $x$'s with
3389: $k$ ones which have much more regularity. An example is the string
3390: starting with $k$ $1$'s followed by $n-k$ $0$'s. For such strings, $S$ is
3391: still optimal and the shortest program for $S$ is still an algorithmic
3392: sufficient statistic, but not a minimal one.
3393: \end{example}
3394:
3395:
3396: \commentout{
3397:
3398: \subsection{Expected Algorithmic Sufficient Statistic is Probabilistic Sufficient Statistic}
3399: \label{sect.formanal}
3400: Algorithmic sufficient statistic, a function of the data,
3401: is so named because intuitively
3402: it expresses an individual summarizing of the relevant information
3403: in the individual data, reminiscent of
3404: the probabilistic sufficient statistic that summarizes the
3405: relevant information in a data random variable about a model
3406: random variable. Formally, however, previous authors have
3407: not established any relation. Other algorithmic notions
3408: have been successfully related to their probabilistic
3409: counterparts. The most significant one is that for every computable
3410: probability distribution, the expected prefix complexity of the
3411: objects equals the entropy of the distribution up to an additive
3412: constant term, related to the complexity of the distribution in
3413: question. We have used this property in (\ref{eq.eqamipmi})
3414: to establish a similar relation between the expected
3415: algorithmic mutual information and the probabilistic mutual information.
3416: We use this in turn to show that
3417: there is a close relation between the algorithmic version and
3418: the probabilistic version of sufficient
3419: statistic: A probabilistic sufficient statistic is
3420: with high probability a natural conditional form
3421: of algorithmic sufficient statistic
3422: for individual data, and, conversely, that with
3423: high probability a natural conditional
3424: form of algorithmic sufficient statistic is also a probabilistic
3425: sufficient statistic.
3426:
3427: Recall the terminology of probabilistic mutual information
3428: (\ref{eq.mutinfprob})
3429: and probabilistic sufficient statistic (\ref{eq.suffstatprob}).
3430: Consider a probabilistic ensemble of models,
3431: a family of computable probability mass functions $\{f_{\theta} \}$
3432: indexed by a discrete parameter $\theta$, together with a computable
3433: distribution $f_1$ over $\theta$.
3434: (The finite set model case is the restriction where
3435: the $f_{\theta}$'s are restricted to uniform distributions
3436: with finite supports.)
3437: This way we have a random variable $\Theta$ with outcomes in $\{f_{\theta} \}$
3438: and a random variable $X$ with outcomes
3439: in the union of domains of $f_{\theta}$, and
3440: $f(\theta,x) = f_1 (\theta) f_{\theta}(x)$ is computable.
3441:
3442: \begin{notation}
3443: \rm
3444: To compare the algorithmic sufficient statistic
3445: with the probabilistic sufficient statistic it is
3446: convenient to denote the sufficient statistic
3447: as a function $S(\cdot)$ of the data in both cases.
3448: Let a statistic
3449: $S(x)$ of data $x$ be the more general form of probability distribution
3450: as in Remark~\ref{s.prob}. That is, $S$ maps the data $x$ to the
3451: parameter $\rho$ that determines
3452: a probability mass function $f_{\rho}$ (possibly not an element
3453: of $\{f_{\theta} \}$). Note that ``$f_{\rho} (\cdot)$'' corresponds
3454: to ``$P(\cdot)$''
3455: in Remark~\ref{s.prob}.
3456: If $f_{\rho}$ is computable, then this can be the
3457: Turing machine $T_{\rho}$ that computes
3458: $f_{\rho}$.
3459: Hence, in the current section,
3460: ``$S(x)$'' denotes a probability distribution, say $f_{\rho}$,
3461: and ``$f_{\rho}(x)$'' is the probability $f_{\rho}$ concentrates on data $x$.
3462: \end{notation}
3463: \begin{remark}
3464: \rm
3465: In the probabilistic statistics setting,
3466: Every function $T(x)$ is a statistic of $x$, but only some
3467: of them are a sufficient statistic. In the algorithmic statistic
3468: setting we have a quite similar situation. In the finite set statistic
3469: case $S(x)$ is a finite set, and in the computable probability
3470: mass function case $S(x)$ is a computable probability mass function.
3471: In both algorithmic cases we have shown $K(S(x) \mid x^*) \eqa 0$
3472: for $S(x)$ is an implicitly or explicitly described sufficient statistic.
3473: This means that the number of such sufficient statistics for $x$
3474: is bounded by a universal constant, and that there is a universal program
3475: to compute all of them from $x^*$---and hence to compute
3476: the minimal sufficient statistic from $x^*$.
3477: \end{remark}
3478: \begin{lemma}\label{theo.eqpral}
3479: Let $f(\theta,x) = f_1 (\theta) f_{\theta} (x)$ be a computable joint
3480: probability mass function, and let
3481: $S$ be a function. Then all three conditions below are equivalent
3482: and imply each other:
3483:
3484: (i) $S$ is a probabilistic sufficient statistic
3485: (in the form $I(\Theta; X) \eqa I(\Theta ; S(X))$).
3486:
3487: (ii) $S$ satisfies
3488: \begin{equation}\label{eq.eqami}
3489: \sum_{\theta,x} f(\theta,x) I(\theta:x)
3490: \eqa
3491: \sum_{\theta,x} f(\theta,x) I(\theta: S(x))
3492: \end{equation}
3493:
3494: (iii) $S$ satisfies
3495: \begin{align*}
3496: I(\Theta ; X) \eqa I(\Theta ; S(X)) & \eqa
3497: \sum_{\theta,x} f(\theta,x) I(\theta:x)
3498: \\& \eqa
3499: \sum_{\theta,x} f(\theta,x) I(\theta: S(x)).
3500: \end{align*}
3501:
3502: All $\eqa$ signs hold up to an $\eqa \pm 2K(f)$ constant additive term.
3503:
3504: \end{lemma}
3505:
3506: \begin{proof}
3507: Clearly, (iii) implies (i) and (ii).
3508:
3509: We show that both (i) implies (iii) and (ii) implies (iii):
3510: By (\ref{eq.eqamipmi}) we have
3511: \begin{align}\label{eq.asseq}
3512: I(\Theta ; X) & \eqa \sum_{\theta,x} f(\theta,x) I(\theta:x),
3513: \\ I(\Theta ; S(X)) & \eqa \sum_{\theta,x} f(\theta,x) I(\theta: S(x)),
3514: \nonumber
3515: \end{align}
3516: where we absorb a $\pm 2K(f)$ additive term in the $\eqa$ sign.
3517: Together with (\ref{eq.eqami}),
3518: (\ref{eq.asseq}) implies
3519: \begin{equation}\label{eq.eqpmi}
3520: I(\Theta ; X) \eqa I(\Theta ; S(X)) ;
3521: \end{equation}
3522: and {\em vice versa} (\ref{eq.eqpmi}) together with (\ref{eq.asseq})
3523: implies (\ref{eq.eqami}).
3524:
3525: \end{proof}
3526:
3527: \begin{remark}
3528: \rm
3529: It may be worth stressing that $S$ in Theorem~\ref{theo.eqpral} can
3530: be any function, without restriction.
3531: \end{remark}
3532:
3533: \begin{remark}
3534: \rm
3535: Note that (\ref{eq.eqpmi}) involves equality $\eqa$
3536: rather than precise equality as in the
3537: definition of the probabilistic sufficient
3538: statistic (\ref{eq.suffstatprob}).
3539: \end{remark}
3540:
3541: \begin{definition}\label{def.thetaI}
3542: \rm
3543: Assume the terminology and notation above.
3544: A statistic $S$ for data $x$
3545: is {\em $\theta$-sufficient with deficiency $\delta$}
3546: if
3547: $I(\theta : x) \eqa I(\theta : S(x)) + \delta$.
3548: If $\delta \eqa 0$ then $S(x)$ is simply a {\em $\theta$-sufficient
3549: statistic}.
3550: \end{definition}
3551:
3552: The following lemma shows that $\theta$-sufficiency is a type
3553: of conditional sufficiency:
3554:
3555: \begin{lemma}\label{claim.1}
3556: Let $S(x)$ be a sufficient statistic for $x$. Then,
3557: \begin{equation}\label{eq.theta}
3558: K(x \mid \theta^*) + \delta \eqa K(S(x) \mid \theta^* ) - \log S(x).
3559: \end{equation}
3560: iff $I(\theta : x) \eqa I(\theta : S(x)) + \delta$.
3561: \end{lemma}
3562:
3563: \begin{proof}
3564: (If) By assumption,
3565: $K(S(x)) - K(S(x) \mid \theta^*) + \delta \eqa K(x) - K(x \mid \theta^*)$.
3566: %that is, $I(\theta: S(x)) + \delta \eqa I(\theta:x)$.
3567: %Since $S$ is a sufficient statistic for $x$, the term
3568: Rearrange and add
3569: $-K(x \mid S(x)^*)- \log S(x) \eqa 0$ (by typicality)
3570: to the right-hand side to obtain
3571: $K(x \mid \theta^*) +K(S(x)) \eqa K(S(x) \mid \theta^*) + K(x)
3572: - K(x \mid S(x)^*) - \log S(x) - \delta$.
3573: Substitute according to $K(x) \eqa K(S(x))+K(x \mid S(x)^*)$
3574: (by sufficiency) in the
3575: right-hand side, and subsequently subtract
3576: $K(S(x))$ from both sides, to obtain
3577: (\ref{eq.theta}).
3578:
3579: (Only If) Reverse the proof of the (If) case.
3580:
3581: \end{proof}
3582:
3583: The following theorems state that $S(X)$ is a probabilistic sufficient
3584: statistic iff $S(x)$ is an algorithmic $\theta$-sufficient statistic,
3585: up to small deficiency, with high probability.
3586:
3587:
3588: \begin{theorem}
3589: Let $f(\theta,x) = f_1 (\theta) f_{\theta} (x)$ be a computable joint
3590: probability mass function, and let
3591: $S$ be a function.
3592: If $S$ is
3593: a recursive probabilistic sufficient statistic, then
3594: $S$ is
3595: a $\theta$-sufficient statistic with deficiency $O(k)$,
3596: with $f$-probability at least $1 - \frac{1}{k}$.
3597: \end{theorem}
3598:
3599: \begin{proof}
3600: If $S$ is a probabilistic sufficient statistic,
3601: then, by Lemma~\ref{theo.eqpral}, equality of $f$-expectations (\ref{eq.eqami})
3602: holds. However, it is still consistent with this to have
3603: large positive and negative differences
3604: $I(\theta: x) -I(\theta:S(x))$
3605: for different $(\theta,x)$ arguments, such that these
3606: differences cancel each other.
3607: This problem is resolved by appeal to
3608: the algorithmic mutual information non-increase
3609: law (\ref{eq.nonincrease}) which shows that all differences are
3610: essentially positive:
3611: $I(\theta : x) - I(\theta : S(x)) \gea -K(S)$.
3612: Altogether, let $c_1,c_2$ be least positive constants such that
3613: $I(\theta : x) - I(\theta : S(x))+c_1$ is always nonnegative
3614: and its $f$-expectation is $c_2$.
3615: Then, by Markov's inequality,
3616: \[
3617: f ( I( \theta : x) - I(\theta : S(x)) \geq kc_2 - c_1 ) \leq \frac{1}{k},
3618: \]
3619: that is,
3620: \[ f ( I( \theta : x) - I(\theta : S(x)) < kc_2 - c_1 )
3621: > 1 - \frac{1}{k}.
3622: \]
3623: \end{proof}
3624:
3625: \begin{theorem}
3626: For each $n$, consider the set of data $x$ of length $n$.
3627: Let $f(\theta,x) = f_1 (\theta) f_{\theta} (x)$ be a computable joint
3628: probability mass function, and let
3629: $S$ be a function.
3630: If $S$ is an algorithmic $\theta$-sufficient statistic for
3631: $x$, with $f$-probability
3632: at least $1-\epsilon$ ($1/\epsilon \eqa n + 2 \log n$), then
3633: $S$ is a probabilistic sufficient statistic.
3634: \end{theorem}
3635:
3636: \begin{proof}
3637: By assumption, using Definition~\ref{def.thetaI},
3638: there is a positive constant $c_1$, such that,
3639: \[
3640: f ( | I(\theta : x) - I(\theta : S(x))| \leq c_1) \geq 1- \epsilon.
3641: \]
3642: Therefore,
3643: \begin{align*}
3644: 0 \leq \sum_{| I(\theta : x ) - I(\theta : S(x))| \leq c_1 } f(\theta ,x)
3645: & |I(\theta : x ) - I(\theta : S(x))|
3646: \\ & \lea (1-\epsilon)c_1 \eqa 0.
3647: \end{align*}
3648: On the other hand, since
3649: \[
3650: 1/\epsilon \gea n + 2 \log n \gea K(x) \gea \max_{\theta , x} I(\theta ; x),
3651: \]
3652: we obtain
3653: \begin{align*}
3654: 0 \leq \sum_{| I(\theta : x ) - I(\theta : S(x))| > c_1 } f(\theta ,x)
3655: & |I(\theta : x ) - I(\theta : S(x))|
3656: \\ & \lea \epsilon (n+2 \log n) \lea 0.
3657: \end{align*}
3658: Altogether, this implies (\ref{eq.eqami}), and by
3659: Lemma~\ref{theo.eqpral}, the theorem.
3660: \end{proof}
3661: }
3662:
3663: \subsection{Relating Probabilistic and Algorithmic Sufficiency}
3664: \label{sec:relpa}
3665: We want to relate `algorithmic sufficient statistics' (defined
3666: independently of any model class $\{f_\theta\}$) to probabilistic sufficient
3667: statistics (defined relative to some model class
3668: $\{f_\theta\}$ as in Section~\ref{sec:probstat}). We will show that,
3669: essentially, algorithmic sufficient statistics are probabilistic
3670: nearly-sufficient statistics with respect to {\em all\/} model families $\{
3671: f_{\theta} \}$. Since the notion of
3672: algorithmic sufficiency is only defined to within additive constants,
3673: we cannot expect algorithmic sufficient statistics to satisfy
3674: the requirements (\ref{eq:indivdef}) or (\ref{eq:expdef}) for probabilistic sufficiency {\em
3675: exactly}, but only `nearly\footnote{We use `nearly' rather than
3676: `almost' since `almost' suggests things like `almost
3677: everywhere/almost surely/with probability 1'. Instead, `nearly' means, roughly speaking, `to within $O(1)$'.}'.
3678: \paragraph{Nearly Sufficient Statistics:}
3679: Intuitively, we may consider a probabilistic statistic $S$ to be
3680: nearly sufficient if (\ref{eq:indivdef}) or (\ref{eq:expdef}) holds to
3681: within some constant. For long sequences $x$, this constant will then
3682: be negligible compared to the two terms in
3683: (\ref{eq:indivdef}) or (\ref{eq:expdef}) which, for most practically
3684: interesting statistical model classes, typically grow linearly in the
3685: sequence length. But now we encounter a difficulty:
3686: \begin{quote}
3687: whereas
3688: (\ref{eq:indivdef}) and (\ref{eq:expdef}) are equivalent if they are
3689: required to hold exactly, they express something substantially
3690: different if they are only required to hold within a constant.
3691: \end{quote}
3692: Because of our observation
3693: above, when relating probabilistic and algorithmic statistics
3694: we have to be very careful about what
3695: happens if $n$ is allowed to change. Thus, we need to extend probabilistic and algorithmic statistics to strings of arbitrary length. This
3696: leads to
3697: the following generalized definition of a statistic:
3698: \begin{definition}
3699: \rm
3700: \label{def:seqstat}
3701: A {\em sequential statistic\/} is a function $S: \{0,1\}^* \rightarrow
3702: 2^{\{0,1 \}^*}$, such that for all $n$, all $x \in \{0,1\}^n$,
3703: (1) $S(x) \subseteq \{0,1\}^n$, and (2) $x \in S(x)$, and (3) for all $n$, the set
3704: $$
3705: \{ s \; | \; \text{There exists $x \in \{0,1\}^n$ with
3706: $S(x) = s $} \ \}
3707: $$
3708: is a partition of $\{0,1\}^n$.
3709: \end{definition}
3710: Algorithmic statistics are defined relative to
3711: individual $x$ of some length $n$.
3712: Probabilistic statistics are defined as functions, hence for all $x$
3713: of given length, but still relative to given length $n$. Such
3714: algorithmic and probabilistic statistics can be
3715: extended to each $n$ and each $x \in \{0,1\}^n$ in a
3716: variety of ways; the three conditions in Definition~\ref{def:seqstat}
3717: ensure that the extension is done in a reasonable way.
3718: %
3719: %As usual, let $X$ be a random variable having outcomes, the data, in
3720: %${\cal X} = \{0,1\}^n$ with probability $P_{\theta} (X=x) = f_{\theta}
3721: %(x)$ out of a model (family of probability mass functions)
3722: %$\{f_\theta\}$. A function $S: {\cal X} \rightarrow {\cal S}$, for
3723: %some range ${\cal S}$ to be specified later, is statistic of the data.
3724: %\begin{definition}
3725: Now let $\{ f_{\theta} \}$ be a model class of sequential
3726: information sources (Section~\ref{sec:preliminaries}), i.e. a
3727: statistical model class defined for sequences of arbitrary length rather
3728: than just fixed $n$. As before, $f^{(n)}_{\theta}$ denotes the
3729: marginal distribution of $f_\theta$ on $\{0,1\}^n$.
3730: \begin{definition}
3731: \label{def:nearsuff}
3732: \rm
3733: We call sequential statistic $S$
3734: {\em nearly-sufficient for
3735: $\{f_\theta\}$ in the probabilistic-individual sense} if
3736: there exist functions $g^{(1)}, g^{(2)}, \ldots$ and a constant $c$ such that
3737: for all $\theta$,
3738: all $n$, every $x \in \{0,1\}^n$,
3739: \begin{equation}
3740: \label{eq:nindivdef}
3741: \biggl| \log 1/f^{(n)}_{\theta}(x \mid S(x))
3742: - \log 1/ g^{(n)}(x
3743: | S(x))
3744: \bigr] \biggr|
3745: \leq c.
3746: \end{equation}
3747: We say $S$ is {\em nearly-sufficient for
3748: $\{f_\theta\}$ in the probabilistic-expectation sense\/} if
3749: there exists functions $g^{(1)}, g^{(2)}, \ldots$ and a constant $c'$ such that
3750: for all $\theta$, all $n$,
3751: \begin{equation}
3752: \label{eq:nexpdef}
3753: \biggl| \sum_{x \in \{0,1\}^n}
3754: f^{(n)}_{\theta}(x) \bigl[ \log 1/ f^{(n)}_{\theta}(x \mid
3755: S(x)) - \log 1/ g^{(n)}(x| S(x))
3756: \bigr]\;
3757: \biggr|
3758: \leq c'.
3759: \end{equation}
3760: \end{definition}
3761: Inequality \eqref{eq:nindivdef} may
3762: be read as `(\ref{eq:indivdef}) holds within a constant', whereas
3763: (\ref{eq:nexpdef}) may be read as `(\ref{eq:expdef}) holds within a
3764: constant'.
3765: %Note that we have to include $K(f_{\theta})$ and $K(p_1)$
3766: %in the definition - if, for each $n$, $f_{\theta}$ puts all its mass
3767: %on a sequence $x$ of length $n$ with $K(x) \approx n$, so that
3768: %$K(f_{\theta}) \approx n$, we cannot expect the algorithmic sufficient
3769: %statistic to be a probabilistic sufficient statistic.
3770:
3771: \begin{remark}
3772: \rm
3773: Whereas the
3774: individual-sequence definition (\ref{eq:indivdef}) and the
3775: expectation-definition (\ref{eq:expdef}) are equivalent if we
3776: require exact equality, they become quite different if we allow
3777: equality to within a constant as in Definition~\ref{def:nearsuff}. To
3778: see this, let $S$ be some sequential
3779: statistic such that for all large $n$, for some $\theta_1,
3780: \theta_2$, for some $x \in \{0,1\}^n$,
3781: $$f^{(n)}_{\theta_1}(x \mid S(x)) \gg
3782: f^{(n)}_{\theta_2}(x \mid S(x)),
3783: $$
3784: while for all $x' \neq x$ of length $n$,
3785: $f^{(n)}_{\theta_1}(x|S(x)) \approx f^{(n)}_{\theta_2}(x| S(x))$. If
3786: $x$ has very small but nonzero probability according to some $\theta
3787: \in \Theta$, then with very small $f_{\theta}$-probability, the
3788: difference between the left-hand and right-hand side of
3789: (\ref{eq:indivdef}) is very large, and with large
3790: $f_{\theta}$-probability, the difference between the left-hand and
3791: right-hand side of (\ref{eq:indivdef}) is about $0$. Then $S$ will be
3792: nearly sufficient in expectation, but not in the individual sense.
3793: \end{remark}
3794: In the theorem below we focus on probabilistic statistics that are
3795: `nearly sufficient in an expected sense'. We connect these to
3796: algorithmic sequential statistics, defined as follows:
3797: \begin{definition}
3798: \rm
3799: %Let $\{f_\theta\}$ be a model of sequential information sources.
3800: % A sequential statistic $S$ is {\em sufficient in the probabilistic
3801: % sense relative to model $\{f_\theta \}$\/} if for all $n$, $S$
3802: % restricted to $\{0,1\}^n$ is a sufficient statistic for $\{ f^{(n)}
3803: % \}$.
3804: %
3805: A sequential statistic $S$ is {\em
3806: sufficient in the algorithmic sense\/}
3807: if there is a constant $c$
3808: such that for all $n$, all $x \in \{0,1\}^n$, the program generating
3809: $S(x)$ is an algorithmic sufficient statistic for $x$ (relative
3810: to constant $c$), i.e.
3811: \begin{equation}
3812: \label{eq:seqsuf}
3813: K(S(x)) + \log |S(x)| \leq K(x) + c.
3814: \end{equation}
3815: \end{definition}
3816: In Theorem~\ref{thm:wiske} we relate
3817: algorithmic to probabilistic sufficiency.
3818: %The meaning of the theorem
3819: %is made clear in Corollary~\ref{cor:wiske} further below.
3820: In the theorem, $S$
3821: represents a sequential statistic, $\{f_{\theta}\}$ is a model class of
3822: sequential information sources and $g^{(n)}$ is the conditional
3823: probability mass function arising from the uniform distribution:
3824: $$
3825: g^{(n)}(x |s) = \begin{cases}
3826: 1/|\{ x \in { \{0,1\}^{n}} :
3827: S(x) = s \} | &
3828: \text{\ if $S(x) = s$\ } \\
3829: 0 & \text{\ otherwise.}
3830: \end{cases}
3831: $$
3832: %Note that in the statement of the theorem, $K(S)$ is the Kolmogorov
3833: %complexity of the {\em function\/} $S$, whereas
3834: %(in (\ref{eq:thmalg})), $K(S(x))$ is the
3835: %Kolmogorov complexity of the {\em set\/} $S(x)$. It is certainly
3836: %possible that $K(S)$ is finite yet $K(S(x))$ increases with $n$. This
3837: %can be the case, if, for example, $S$ counts the number of $1$s in
3838: %$x$. Then $K(S(x))$ can be as large as $O(\log n)$.
3839: \begin{theorem}[algorithmic sufficient statistic is probabilistic
3840: sufficient statistic]
3841: \label{thm:wiske}
3842: \rm
3843: Let $S$ be a sequential statistic that is sufficient in the
3844: algorithmic sense. Then for every $\theta$ with $K(f_\theta) < \infty$,
3845: there exists a constant $c$, such that for all $n$, inequality
3846: \eqref{eq:nexpdef} holds with $g^{(n)}$ the uniform distribution.
3847: Thus, if $\sup_{\theta \in
3848: {\mathbf \Theta}} K(f_\theta) < \infty$, then $S$ is a nearly-sufficient
3849: statistic for $\{ f_\theta \}$ in the probabilistic-expectation sense,
3850: with $g$ equal to the uniform distribution.
3851: %\begin{equation}
3852: %\label{eq:thmalg}
3853: %{\bf E}_{\theta} \bigl[ K(S(Y_1, \ldots, Y_n) \mid n) + \log
3854: %|S(Y_1, \ldots, Y_n)| \bigr] \leq {\bf E}_\theta
3855: %\bigl[ K(Y_1, \ldots, Y_n \mid n) \bigr]
3856: %+ C
3857: %\end{equation}
3858: %\begin{equation}
3859: %\label{eq:thmprob}
3860: %\biggl| \; \sum_{x \in \{0,1\}^n}
3861: %f^{(n)}_{\theta}(x) \bigl[ \log 1/ f^{(n)}_{\theta}(x \mid
3862: %S(x)) - \log 1/ g^{(n)}(x| S(x))
3863: %\bigr]\;
3864: %\biggr|
3865: %\leq c'.
3866: %\end{equation}
3867: \end{theorem}
3868: %\paragraph{Remark} It is straightforward to see that
3869: %\begin{enumerate}
3870: %\item Every probabilistic nearly-sufficient statistic in the
3871: % individual sense is a probabilistic nearly-sufficient statistic in
3872: % the expected sense, but not vice versa.
3873: %\item Every algorithmic sufficient statistic
3874: %is an algorithmic statistic sufficient `in
3875: %expectation' relative to {\em every\/} conceivable
3876: %model $\{ f_\theta \}$, but not vice versa.
3877: %\end{enumerate}
3878: \noindent
3879: \begin{proof}
3880: The definition of algorithmic sufficiency, (\ref{eq:seqsuf}) directly
3881: implies that
3882: there exists a
3883: constant $c$ such that for all $\theta$, all $n$,
3884: \begin{equation}
3885: \label{eq:thmalg}
3886: \sum_{x \in \{0,1\}^n }
3887: f^{(n)}_{\theta}(x) \bigl[ K(S(x)) + \log
3888: |S(x)| \bigr] \leq \sum_{x \in \{0,1\}^n } f^{(n)}_{\theta}(x)
3889: K(x)
3890: + c.
3891: \end{equation}
3892: Now fix any $\theta$ with $K(f_\theta) < \infty$. It follows (by the
3893: same reasoning as in Theorem~\ref{theo.eq.entropy}) that
3894: for some $c_{\theta} \approx K(f_\theta)$,
3895: for all $n$,
3896: \begin{equation}
3897: \label{eq:dochter}
3898: 0 \leq \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x) K(x) -
3899: \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x)
3900: \log 1/ f_{\theta}(x) \leq c_{\theta}.
3901: \end{equation}
3902: Essentially, the left inequality follows by the information inequality
3903: (\ref{eq.ii}): no code can be more
3904: efficient in expectation under $f_\theta$ than the Shannon-Fano code
3905: with lengths $\log 1 /f_\theta(x)$; the right inequality follows
3906: because, since $K(f_\theta) < \infty$,
3907: the Shannon-Fano code can be implemented by a
3908: computer program with a fixed-size independent of $n$.
3909: By (\ref{eq:dochter}),
3910: (\ref{eq:thmalg}) becomes: for all $n$,
3911: \begin{equation}
3912: \label{eq:thmalgb}
3913: \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta}(x) \log 1/ f_{\theta}(x) \leq
3914: \sum_{x \in \{0,1\}^n }
3915: f^{(n)}_{\theta}(x) \bigl[ K(S(x)) + \log
3916: |S(x)| \bigr] \leq \sum_x f^{(n)}_{\theta}(x) \log 1/ f_{\theta}(x)
3917: + c_\theta.
3918: \end{equation}
3919: For $s \subseteq \{0,1\}^n$, we use the notation
3920: $f^{(n)}_\theta(s)$ according to \eqref{eq:overload}.
3921: Note that,
3922: by requirement (3) in the definition of sequential statistic,
3923: $$\sum_{s: \exists x \in \{0,1\}^n : S(x) = s} f^{(n)}_\theta(s) =
3924: 1,
3925: $$
3926: whence $f^{(n)}_\theta(s)$ is a probability mass function on
3927: ${\cal S}$, the set of values the statistic $S$ can take on sequences
3928: of length $n$. Thus, we get, once again by the information inequality (\ref{eq.ii}),
3929: \begin{equation}
3930: \label{eq:zoon}
3931: \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x) K(S(x)) \geq
3932: \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x)
3933: \log 1/ f^{(n)}_{\theta}(S(x)).
3934: \end{equation}
3935: Now note that for all $n$,
3936: \begin{equation}
3937: \label{eq:extra}
3938: \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x)
3939: \bigl[ \log 1/ f^{(n)}_{\theta}(S(x)) +
3940: %\sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x)
3941: \log 1/ f^{(n)}_{\theta}(x \mid S(x))
3942: \bigr]
3943: = \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta}(x) \log 1/ f_{\theta}(x).
3944: \end{equation}
3945: Consider the two-part code which encodes $x$ by first encoding $S(x)$
3946: using $\log 1 /f^{(n)}_\theta(S(x))$ bits, and then encoding $x$ using
3947: $\log |S(x) |$ bits. By the
3948: information inequality, (\ref{eq.ii}), this code must be less efficient
3949: than the Shannon-Fano code with lengths $\log 1/ f_{\theta}(x)$, so
3950: that if follows from (\ref{eq:extra}) that, for all $n$,
3951: \begin{equation}
3952: \label{eq:kind}
3953: \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x) \log | S(x) | \geq
3954: \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x)
3955: \log 1/ f^{(n)}_{\theta}(x \mid S(x)).
3956: \end{equation}
3957: Now defining
3958: \begin{eqnarray}
3959: u & = & \sum_{x \in \{0,1\}^n }
3960: f^{(n)}_{\theta}(x) K(S(x)) \nonumber \\
3961: v & = & \sum_{x \in \{0,1\}^n }
3962: f^{(n)}_{\theta}(x) \log |S(x)| \nonumber \\
3963: u' & = & \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x)
3964: \log 1/ f^{(n)}_{\theta}(S(x))
3965: \nonumber \\
3966: v' & = & \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x)
3967: \log 1/ f^{(n)}_{\theta}(x \mid S(x)) \nonumber \\
3968: w & = & \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta}(x) \log 1/
3969: f_{\theta}(x),
3970: \nonumber
3971: \end{eqnarray}
3972: we
3973: find that (\ref{eq:thmalgb}), (\ref{eq:extra}),
3974: (\ref{eq:zoon}) and
3975: (\ref{eq:kind}) express, respectively,
3976: that $u + v \eqa w$, $u' + v' = w$, $u \geq u'$,
3977: $v \geq v'$. It follows that $v \eqa v'$, so that
3978: (\ref{eq:kind}) must actually hold with equality up to a
3979: constant. That is, there exist a $c'$ such that for all $n$,
3980: \begin{equation}
3981: \label{eq:kindb}
3982: \bigl|
3983: \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x) \log | S(x) | -
3984: \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x)
3985: \log 1/ f^{(n)}_{\theta}(x \mid S(x)) \bigr| \leq c'.
3986: \end{equation}
3987: The result now follows upon noting that (\ref{eq:kindb}) is just
3988: (\ref{eq:nexpdef}) with $g^{(n)}$ the uniform distribution.
3989: \end{proof}
3990: \section{Rate Distortion and Structure Function}
3991: \label{sect.rdsf}
3992: We continue the discussion about meaningful information of
3993: Section~\ref{sect.meaning}. This time we a priori restrict the number
3994: of bits allowed for conveying the essence of the information. In the
3995: probabilistic situation this takes the form of allowing only a
3996: ``rate'' of $R$ bits to communicate as well as possible, on average,
3997: the outcome of a random variable $X$, while the set ${\cal X}$ of
3998: outcomes has cardinality possibly exceeding $2^R$. Clearly, not all
3999: outcomes can be communicated without information loss, the average of
4000: which is expressed by the ``distortion''. This leads to the so-called
4001: ``rate--distortion'' theory. In the algorithmic setting the
4002: corresponding idea is to consider a set of models from which to choose
4003: a single model that expresses the ``meaning'' of the given individual
4004: data $x$ as well as possible. If we allow only $R$ bits to express the
4005: model, while possibly the Kolmogorov complexity $K(x) > R$, we suffer
4006: information loss---a situation that arises for example with ``lossy''
4007: compression. In the latter situation, the data cannot be perfectly
4008: reconstructed from the model, and the question arises in how far the
4009: model can capture the meaning present in the specific data $x$. This
4010: leads to the so-called ``structure function'' theory.
4011:
4012: The limit of $R$ bits to express
4013: a model to capture the most meaningful information
4014: in the data is an individual version of the average notion
4015: of ``rate''. The remaining less meaningful information in the data
4016: is the individual version of the average-case notion of ``distortion''.
4017: If the $R$ bits are sufficient to express all meaning in the data
4018: then the resulting model is called a ``sufficient statistic'',
4019: in the sense introduced above. The remaining information in the data
4020: is then purely accidental, random, noise.
4021: For example, a sequence of
4022: outcomes of $n$ tosses of a coin with computable bias $p$, typically
4023: has a sufficient statistic
4024: of $K(p)$ bits, while the remaining random information is typically
4025: at least about $pn-K(p)$ bits (up to an $O(\sqrt{n})$ additive term).
4026: %One connection with Rate-Distortion theory with $R < K(p)$ to set
4027: %the distortion at $K(p)-R$, that is, the number of meaningful bits
4028: %that cannot be included. The meaningless bits are now discounted
4029: %altogether.
4030: \subsection{Rate Distortion}
4031: \label{sec:ratedistortion}
4032: %\label{sec:basics}
4033: Initially, Shannon \cite{Sh48} introduced rate-distortion
4034: as follows: ``Practically, we are not interested in exact transmission
4035: when we have a continuous source, but only in transmission to
4036: within a given tolerance. The question is, can we assign a definite
4037: rate to a continuous source when we require only a certain fidelity
4038: of recovery, measured in a suitable way.'' Later, in \cite{Sh59}
4039: he applied this idea to lossy data compression
4040: of discrete memoryless sources---our topic below.
4041: As before, we consider a situation in which sender $A$ wants to
4042: communicate the outcome of random variable $X$ to receiver $B$.
4043: Let $X$ take values in some set ${\cal X}$, and the
4044: distribution $P$ of $X$ be known to both $A$ and $B$. The change
4045: is that now $A$ is only
4046: allowed to use a finite number, say $R$ bits, to communicate, so that
4047: $A$ can only send $2^R$ different messages. Let us denote by ${Y}$ the
4048: encoding function used by $A$. This ${Y}$ maps ${\cal X}$
4049: onto some set ${\cal Y}$.
4050: %Let us denote by $\ddot{Y}$ the range of ${\cal Y}$.
4051: %We require that $\ddot{\cal Y}$ satisfies $|\ddot{\cal Y}| \leq 2^R$.
4052: We require that $|{\cal Y}| \leq 2^R$.
4053: %$D$ has to map ${\cal Y}$ back to ${\cal X}$.
4054: If $|{\cal X}| > 2^R$ or if ${\cal X}$ is continuous-valued,
4055: then necessarily some information
4056: is lost during the communication. There is no decoding function
4057: $D: {\cal Y} \rightarrow {\cal X}$ such that
4058: $D({Y}(x)) = x$ for all $x$. Thus, $A$ and $B$ cannot ensure that
4059: $x$ can always be reconstructed. As the next best thing, they may
4060: agree on a code such that for all $x$, the value ${Y}(x)$ contains as much
4061: useful information about $x$ as is possible---what exactly `useful'
4062: means depends on the situation at hand; examples are provided below.
4063: An easy example would be that ${Y}(x)$ is a finite list of elements,
4064: one of which is $x$.
4065: We assume that the `goodness'
4066: of ${Y}(x)$ is gaged by a {\em distortion function\/} $d:
4067: {\cal X} \times {\cal Y} \rightarrow [0, \infty]$. This distortion
4068: function may be any nonnegative function
4069: that is appropriate to the situation at hand.
4070: In the example above it could be the logarithm of the number
4071: of elements in the list ${Y}(x)$.
4072: Examples of some common distortion functions are the
4073: Hamming distance and the squared Euclidean distance.
4074: We can view $Y$ as a
4075: a random variable on the space ${\cal Y}$,
4076: a coarse version of the random variable $X$, defined as taking
4077: value $Y=y$ if $X=x$ with $Y(x)=y$.
4078: Write $f(x) = P(X=x)$ and $g(y)=\sum_{x: Y(x)=y} P(X= x)$.
4079: Once the distortion function
4080: $d$ is fixed, we define the {\em expected \/} distortion by
4081: \begin{align}
4082: \label{eq:distortion}
4083: {\bf E} [ d(X,{Y}) ] & = \sum_{x \in {\cal X}} f(x) d(x,{Y}(x)) \\
4084: \nonumber
4085: & = \sum_{y \in {\cal Y}} g(y) \sum_{x: Y(x)=y} f(x)/g(y) d(x,y) .
4086: \end{align}
4087: If $X$ is a continuous random variable, the sum should be
4088: replaced by an integral.
4089: %This motivates the abbreviation
4090: %``${\bf E} [ d(X,{Y}) ]$'' for ``${\bf E} [ d(X,{Y}(X)) ]$''.
4091: \begin{example}
4092: \label{ex:classicrd}
4093: \rm
4094: In most standard applications of rate distortion theory,
4095: the goal is to compress $x$ in a `lossy' way,
4096: such that $x$ can be reconstructed `as well as possible' from
4097: ${Y}(x)$. In that case, ${\cal Y} \subseteq {\cal X}$ and
4098: %$d$ is a really a function $d: {\cal X} \times {\cal X}
4099: %\rightarrow {\mathbb R}$.
4100: writing $\hat{x} = Y(x)$,
4101: the value $d(x,\hat{x})$ measures the similarity
4102: between $x$ and $\hat{x}$. For example, with ${\cal X}$
4103: is the set of real numbers and ${\cal Y}$ is the set
4104: of integers, the squared difference
4105: $d(x,\hat{x}) = (x- \hat{x})^2$ is a
4106: viable distortion function.
4107: We may interpret $\hat{x}$ as an
4108: estimate of $x$, and ${\cal Y}$ as the set of values it can take. The
4109: reason we use the notation ${Y}$ rather than $\hat{X}$ (as in,
4110: for example, \cite{CT91}) is that further below,
4111: we mostly concentrate on slightly non-standard
4112: applications where ${\cal Y}$ should {\em not\/}
4113: be interpreted as a subset of ${\cal X}$.
4114: \end{example}
4115: We want to determine the optimal code $Y$ for communication between A and
4116: B under the constraint that there are no more than $2^R$ messages.
4117: That is, we look for the encoding function ${Y}$ that
4118: minimizes the expected distortion, under the constraint that
4119: $|{\cal Y}| \leq 2^R$.
4120: %We call the function mapping $R$ to the
4121: %minimimum expected distortion that can be achieved with $R$ bits the
4122: %{\em distortion-rate function\/} $D(R)$. Formally,
4123: %\begin{equation}
4124: %\label{eq:rd}
4125: %D(R) = \inf_{\hat{X}: {\cal X} \rightarrow \hat{\cal X} \; ; \;
4126: %|\hat{\cal X}| \leq 2^R }
4127: %{\bf E} [ d(X,\hat{X})].
4128: %\end{equation}
4129: Usually, the minimum
4130: achievable expected distortion
4131: is nonincreasing as a function of increasing $R$.
4132: \begin{example}
4133: \label{ex:gauss}
4134: \rm Suppose $X$ is a real-valued, normally (Gaussian) distributed
4135: random variable with mean ${\bf E}[X] = 0$ and variance ${\bf E} [ X
4136: - {\bf E} [X]]^2 = \sigma^2$. Let us use the squared Euclidean
4137: distance $d(x,y) = (x - y)^2$ as a distortion measure.
4138: If $A$ is allowed to use $R$ bits, then ${\cal Y}$ can have no
4139: more than $2^R$ elements, in contrast to ${\cal X}$ that is
4140: uncountably infinite. We should choose ${\cal Y}$ and the
4141: function ${Y}$ such that (\ref{eq:distortion}) is minimized.
4142: Suppose first $R=1$. Then the optimal ${Y}$ turns out to be
4143: $$
4144: {Y}(x) =
4145: \begin{cases} \sqrt{\frac{2}{\pi}} \sigma^2 & \mbox{\ if $x \geq 0 $} \\
4146: - \sqrt{\frac{2}{\pi}} \sigma^2 & \mbox{\ if $x < 0 $}.
4147: \end{cases}
4148: $$
4149: Thus, the domain ${\cal X}$ is partitioned into two regions, one
4150: corresponding to $x \geq 0$, and one to $x < 0$. By the symmetry of
4151: the Gaussian distribution around $0$, it should be clear that this is
4152: the best one can do. Within each of the two region, one picks a
4153: `representative point' so as to minimize (\ref{eq:distortion}).
4154: This mapping allows $B$ to estimate $x$ as well as possible.
4155:
4156: Similarly, if $R=2$, then ${\cal X}$ should be partitioned into 4
4157: regions, each of which are to be represented by a single point such
4158: that (\ref{eq:distortion}) is minimized. An extreme case is $R= 0$:
4159: how can $B$ estimate $X$ if it is always given the same information?
4160: This means that ${Y}(x)$ must take the same value for
4161: all $x$. The expected distortion (\ref{eq:distortion}) is then
4162: minimized if $Y(x) \equiv 0$, the mean of $X$, giving
4163: distortion equal to $\sigma^2$.
4164: \end{example}
4165: In general, there is no need for the space of estimates ${\cal Y}$
4166: to be a subset of ${\cal X}$. We may, for example, also lossily encode
4167: or `estimate' the actual value of $x$ by specifying a set in which $x$
4168: must lie (Section~\ref{sec:meaningful})
4169: or a probability distribution (see below) on ${\cal X}$.
4170: % Thus, we assume
4171: %there is some set ${\cal Y}$ of possible messages. Sender maps outcome
4172: %$x \in {\cal X}$ to some $h \in {\cal H}$ based on an `encoding'
4173: %function ${H}: {\cal X} \rightarrow {\cal H}$. Receiver than
4174: %incurs some `loss' given by a distortion function $d: {\cal X} \times
4175: %{\cal H} \rightarrow {\mathbb R}$. The generalized definition of
4176: %distortion-rate function now becomes
4177: %\begin{equation}
4178: %\label{eq:rdb}
4179: %D(R) := \inf_{{H}: {\cal X} \rightarrow {\cal H} \; ; \;
4180: %|{\cal H}| \leq 2^R }
4181: %{\bf E} [ d(X,{H})].
4182: %\end{equation}
4183: %In Example~\ref{ex:gauss}, we had ${\cal H} = {\cal X}$. In
4184: %Example~\ref{ex:reconcile}, ${\cal H}$ will be the set of
4185: %distributions on ${\cal X}$.
4186: \begin{example}
4187: \label{ex:reconcile}
4188: \rm Suppose receiver $B$ wants to estimate the actual $x$ by a probability
4189: distribution $P$ on ${\cal X}$. Thus, if $R$ bits are allowed to be
4190: used, one of $2^{R}$ different distributions on ${\cal X}$ can be sent to
4191: receiver. The most accurate that can be done is to partition ${\cal
4192: X}$ into $2^R$ subsets ${\cal A}_1, \ldots, {\cal A}_{2^R}$.
4193: Relative to any such partition, we introduce a new random variable ${Y}$
4194: and abbreviate the event $x \in {\cal A}_y$ to ${Y}=y$.
4195: Sender observes that ${Y} = y$ for some $y \in
4196: {\cal Y} = \{1, \ldots, 2^R\}$
4197: and passes this
4198: information on to receiver. The information $y$ actually means that $X$
4199: is now distributed according to the conditional distribution $P(X=x
4200: \mid x \in {\cal A}_y) = P(X= x \mid {Y} = y)$.
4201: %Thus, ${\cal Y} = \{1, \ldots, 2^{R} \}$.
4202:
4203: It is now natural to measure the quality of the transmitted distribution
4204: $P(X=x \mid Y = y)$ by its conditional
4205: entropy, i.e. the expected
4206: additional number of bits that sender has to transmit before
4207: receiver knows the value of $x$ with certainty. This can be achieved
4208: by taking
4209: \begin{equation}
4210: d(x,y) =
4211: \log 1/ P(X=x \mid {Y}= y),
4212: \end{equation}
4213: which we abbreviate to $d(x,y) = \log 1/ f(x|y)$.
4214: In words,
4215: the distortion function is the Shannon-Fano code length for the
4216: communicated distribution. The expected distortion then
4217: becomes equal to the conditional entropy $H(X \mid Y)$ as defined in
4218: Section~\ref{sec:probmutual} (rewrite according to
4219: \eqref{eq:distortion}, $f(x|y)=f(x)/g(y)$ for
4220: $P(X=x| Y(x)=y)$ and $g(y)$ defined earlier,
4221: and the definition of conditional probability):
4222: \begin{align}
4223: \label{eq:entdist}
4224: {\bf E}
4225: [d(X,{Y})]
4226: %= \sum_{x \in {\cal X}}
4227: %p(x) [- \log p(x|{Y}(x))] = \sum_{x \in {\cal X}}
4228: %p(x) [- \log P(X=x \mid X \in {\cal A}_y)]
4229: & = \sum_{y \in {\cal Y}} g(y) \sum_{x: Y(x)=y} (f(x)/g(y)) d(x,y)\\
4230: \nonumber
4231: &= \sum_{y \in {\cal Y}} g(y) \sum_{x: Y(x)=y} f(x|y) \log 1/f(x|y)\\
4232: \nonumber
4233: &= H(X|{Y}).
4234: \end{align}
4235: How is this related to lossless compression? Suppose
4236: for example that $R= 1$. Then
4237: the optimal distortion is achieved by
4238: partitioning ${\cal X}$ into two sets ${\cal A}_1, {\cal A}_2$ in the
4239: most `informative' possible way, so that the conditional entropy
4240: $$H(X|Y) = \sum_{y=1,2} P(Y=y) H(X| Y=y)$$
4241: is minimized. If
4242: $Y$ itself is encoded with the Shannon-Fano code, then $H(Y)$ bits are
4243: needed to communicate $Y$. Rewriting
4244: $H(X|Y)= \sum_{y \in {\cal Y}} P(Y=y) H(X|Y=y)$ and
4245: $H(X|Y=y)= \sum_{x:Y(x)=y} f(x|y) \log 1/f(x|y)$ with $f(x|y)=P(X=x)/P(Y=y)$
4246: and rearranging, shows that for all such partitions of ${\cal
4247: X}$ into $|{\cal Y}|$ subsets defined by $Y:{\cal X} \rightarrow {\cal Y}$
4248: we have
4249: \begin{equation}
4250: \label{eq:snavel}
4251: H(X|Y) + H(Y) = H(X).
4252: \end{equation}
4253: The minimum rate
4254: distortion is obtained by choosing the function
4255: $Y$ that minimizes $H(X|Y)$.
4256: By (\ref{eq:snavel}) this is also the $Y$ maximizing $H(Y)$. Thus, the
4257: average total number of bits we need to send our message in this way
4258: is still equal to $H(X)$---the more we save in the second part, the
4259: more we pay in the first part.
4260: \end{example}
4261: %The minimum achieveable distortion $D(r)$ for $R=r$
4262: %is given by
4263: %\begin{equation}
4264: %\label{eq:drbasic}
4265: %D(R) = \min_{Y :
4266: %{\cal X} \rightarrow {\cal Y}, | {\cal Y} | < 2^R} H(X|Y).
4267: %\end{equation}
4268: \paragraph{Rate Distortion and
4269: Mutual Information:}
4270: Already in his 1948 paper, Shannon established a
4271: deep relation between mutual information and minimum achievable
4272: distortion for (essentially) {\em arbitrary\/} distortion functions.
4273: The relation is summarized in Theorem~\ref{thm:rd} below. To prepare
4274: for the theorem, we need to slightly extend our setting by considering
4275: {\em independent repetitions of the same scenario}. This can be
4276: motivated in various ways such as (a) it often corresponds to the
4277: situation we are trying to model; (b) it allows us to consider non-integer
4278: rates $R$, and (c) it greatly simplifies the mathematical analysis.
4279:
4280: \begin{definition}
4281: \rm
4282: Let ${\cal X}, {\cal Y}$ be two sample spaces.
4283: The distortion of $y \in {\cal Y}$ with respect
4284: to $x \in {\cal X}$ is defined
4285: by a nonnegative real-valued function $d(x,y)$ as above.
4286: %The idea is that the
4287: %distortion measures the loss of information in an encoding $y$
4288: %of the original $x$ with respect to $x$.
4289: We extend the definition to sequences:
4290: the distortion of $(y_1, \ldots , y_n)$
4291: with respect to $(x_1, \ldots , x_n)$ is
4292: \begin{equation}
4293: \label{eq:avdist}
4294: d((x_1,\ldots,x_n),(y_1, \ldots, y_n)) := \frac{1}{n}
4295: \sum_{i=1}^n d(x_i,y_i).
4296: \end{equation}
4297: \end{definition}
4298: Let $X_1, \ldots, X_n$ be $n$ independent identically
4299: distributed random variables on outcome space ${\cal X}$.
4300: Let ${\cal Y}$ be a set of code words.
4301: We want to find a sequence of functions $Y_1, \ldots , Y_n:{\cal X}
4302: \rightarrow {\cal Y}$ so that the message $(Y_1(x_1), \ldots,
4303: Y_n (x_n)) \in {\cal Y}^n$ gives as much expected
4304: information about the sequence of outcomes $(X_1=x_1,
4305: \ldots, X_n=x_n)$ as is possible, under the constraint that the message
4306: takes at most $R \cdot n$ bits (so that $R$ bits are allowed on
4307: average per outcome of $X_i$).
4308: Instead of $Y_1, \ldots , Y_n$ above write
4309: $Z_n: {\cal X}^n \rightarrow {\cal Y}^n$.
4310: The {\em expected distortion} ${\bf E}[d(X^n,Z_n)]$ for $Z_n$ is
4311: \begin{equation}
4312: {\bf E}[d(X^n,Z_n)] = \sum_{(x_1, \ldots , x_n) \in {\cal X}^n}
4313: P(X^n = (x_1, \ldots , x_n)) \cdot \frac{1}{n}
4314: \sum_{i=1}^n d(x_i,Y_i(x_i)).
4315: \end{equation}
4316: Consider functions $Z_n$
4317: with range ${\cal Z}_n \subseteq {\cal Y}^n$
4318: satisfying $|{\cal Z}_n| \leq 2^{nR}$.
4319: Let for $n \geq 1$ random variables
4320: a choice $Y_1, \ldots , Y_n$
4321: minimize the expected distortion
4322: under these constraints, and let the corresponding value $D^*_n (R)$ of
4323: the expected distortion be defined by
4324: \begin{equation}
4325: D^*_n (R) = \min_{Z_n: |{\cal Z}_n| \leq 2^{nR}} {\bf E}(d(X^n,Z_n)) .
4326: \end{equation}
4327:
4328: \begin{lemma}
4329: For every distortion measure, and all $R,n,m \geq 1$,
4330: $(n+m)D^*_{n+m} (R) \leq n D^*_{n}(R)+mD^*_m(R)$.
4331: \end{lemma}
4332: \begin{proof}
4333: Let $Y_1, \ldots , Y_{n}$ achieve $D^*_{n}(R)$
4334: and $Y'_1, \ldots , Y'_{m}$ achieve $D^*_{m}(R)$.
4335: Then, $Y_1, \ldots , Y_{n},Y'_1, \ldots , Y'_m$
4336: achieves $(nD^*_n(R) + mD^*_m(R))/(n+m)$. This is an upper
4337: bound on the minimal possible value $D^*_{n+m} (R)$ for $n+m$ random variables.
4338: \end{proof}
4339:
4340: It follows that for all $R,n \geq 1$ we have $D^*_{2n}(R) \leq
4341: D^*_n(R)$.
4342: The inequality is
4343: typically strict; \cite{CT91} gives an
4344: intuitive explanation of this phenomenon.
4345: For fixed $R$ the value of $D^*_1(R)$ is fixed and it is finite.
4346: Since also $D^*_n(R)$ is necessarily
4347: positive for all $n$, we have established the existence
4348: of the limit
4349: \begin{equation}
4350: \label{eq:dr}
4351: D^*(R) = \lim\inf_{n \rightarrow \infty} D^*_n (R).
4352: \end{equation}
4353: The value of $D^*(R)$ is the minimum achievable distortion
4354: at rate (number of bits/outcome) $R$. Therefore, $D^*(\cdot)$
4355: It is called the
4356: {\em distortion-rate function}.
4357: In our Gaussian
4358: Example~\ref{ex:gauss}, $D^*(R)$ quickly converges to $0$ with
4359: increasing $R$. It turns out that for general $d$, when
4360: we view $D^*(R)$ as a
4361: function of $R\in [0,\infty)$, it is {\em convex and nonincreasing}.
4362: \begin{example}
4363: \label{ex:ber}
4364: \rm
4365: Let ${\cal X} = \{0,1\}$, and let $P(X= 1) = p$. Let ${\cal Y} =
4366: \{0,1\}$ and take the Shannon-Fano distortion
4367: function $d(x,y) = \log 1/ f(x \mid y)$ with notation as in Example~\ref{ex:reconcile}.
4368: Let $Y$ be a function that
4369: achieves the minimum expected Shannon-Fano
4370: distortion $D^*_1(R)$. As usual we write $Y$ for the random variable $Y(x)$
4371: induced by $X$. Then, $D^*_1(R)={\bf E}[d(X,Y)] =
4372: {\bf E} [ \log 1/ f(X|Y)] = H(X|Y)$.
4373: At rate $R = 1$, we
4374: can set $Y= X$ and the minimum achievable distortion is
4375: given by $D^*_1(1)=H(X|X) =
4376: 0$. Now consider some rate $R$ with $0 < R < 1$, say $R= \frac{1}{2}$.
4377: Since we are now
4378: forced to use less than $2^R < 2$
4379: messages in communicating, only a fixed message can be sent, no
4380: matter what outcome of the random variable $X$ is realized.
4381: This means that no communication is
4382: possible at all and the minimum achievable distortion is
4383: $D^*_1(\frac{1}{2}) =H(X) =
4384: H(p,1-p)$. But clearly, if we consider $n$ repetitions of the same
4385: scenario and are allowed to send a message out of
4386: $\lfloor 2^{nR} \rfloor$
4387: candidates, then some useful information can be communicated after
4388: all, even if $R < 1$.
4389: In Example~\ref{ex:berb} we will show that
4390: if $R > H(p,1-p)$, then $D^*(R) = 0$; if $R \leq
4391: H(p,1-p)$, then $D^*(R) = H(p,1-p) - R$.
4392: \end{example}
4393: Up to now we studied the minimum achievable distortion $D$ as a function of
4394: the rate $R$.
4395: For technical reasons, it is often more convenient to consider the
4396: minimum achievable rate $R$ as a function of the distortion $D$.
4397: This is the more celebrated
4398: version, the {\em rate-distortion function} $R^*(D)$.
4399: Because $D^*(R)$ is convex
4400: and nonincreasing, $R^*(D): [ 0, \infty) \rightarrow [0,\infty]$ is
4401: just the {\em inverse\/} of the function $D^*(R)$.
4402:
4403: %We say that a function
4404: %${X}^n: {\cal X}^n \rightarrow {\cal X}^n$
4405: %{\em achieves rate $R$ and distortion $d^*$\/} iff
4406: %$|{\cal X}^n| \leq 2^{nR}$ and
4407: %$ {\bf E} [d((x_1,\ldots,
4408: %x_n),{X}^n(x_1,\ldots,x_n))] \leq d^*$. By definition,
4409: %\begin{equation}
4410: %R(d^*) = \lim_{n \rightarrow \infty} \inf \{ R: \ \text{there exists
4411: % ${X}^n$ achieving rate $R$ and distortion $d^*$} \ \}
4412: %\end{equation}
4413: It turns out to be possible to relate distortion to the Shannon mutual
4414: information.
4415: This remarkable fact, which Shannon proved already in
4416: \cite{Sh48,Sh59},
4417: illustrates the fundamental nature of Shannon's concepts.
4418: %The setting is more generalized than in the previous simplified discussion
4419: %of the basics of distortion theory.
4420: Up till now, we only considered
4421: {\em deterministic\/} encodings $Y: {\cal X} \rightarrow {\cal Y}$.
4422: But it is hard to analyze the rate-distortion, and distortion-rate,
4423: functions in this setting. It turns out to be advantageous to follow
4424: an indirect route by bringing
4425: information-theoretic techniques into play.
4426: To this end, we generalize the setting to {\em
4427: randomized\/} encodings. That is, upon observing $X=x$ with probability
4428: $f(x)$, the
4429: sender may use a randomizing device (e.g. a coin) to decide which
4430: code word in $y \in {\cal Y}$ he is going to send to the receiver. A
4431: randomized encoding $Y$ thus maps each $x \in {\cal X}$
4432: to $y \in {\cal Y}$
4433: with probability
4434: $g_x(y)$, denoted in conditional probability format
4435: as $g(y|x)$. Altogether we deal
4436: with a joint distribution $g(x,y)=f(x)g(y|x)$ on
4437: the joint sample
4438: space ${\cal X} \times {\cal Y}$. (In the deterministic case we have
4439: $g(Y(x) \mid x)=1$ for the given function $Y: {\cal X} \rightarrow {\cal Y}$.)
4440:
4441: %The natural question arising is:
4442: %What is the minimum possible rate that can be achieved
4443: %by a randomized code achieving distortion at most $D$? The latter
4444: %condition is formally expressed as follows. Again, let $X$ be
4445: %a random variable over outcome space ${\cal X}$, with probability $p(x)$
4446: %that $X=x$. Let $Y$ be a random variable over the set of code words
4447: %${\cal Y}$, with probability $r(y)$ that $Y=y$. There is (possibly or rather,
4448: %commonly) a dependence between the random variables $X$ and $Y$,
4449: %given by the joint probability $q(x,y)$ with
4450: %$p(x)= \sum_{y \in {\cal Y}} q(x,y)$ and
4451: %$r(x)= \sum_{x \in {\cal X}} q(x,y)$.
4452: %The marginal probability $q(y \mid x)$ is that of encoding outcome $X=x$
4453: %by code word $y$.
4454: \begin{definition}
4455: \rm
4456: Let $X$ and $Y$ be joint random variables as above, and let $d(x,y)$
4457: be a distortion measure.
4458: The {\em expected distortion} $D(X,Y)$ of $Y$ with respect to $X$ is
4459: defined by
4460: \begin{equation}\label{eq.DXY}
4461: D(X,Y)= %{\bf E}_{g} [d(X,Y)] =
4462: \sum_{x \in {\cal X}, y \in {\cal Y}} g(x,y) d(x,y).
4463: \end{equation}
4464: \end{definition}
4465: Note that for a given problem
4466: the source probability $f(x)$ of outcome $X=x$ is fixed,
4467: but the randomized encoding $Y$, that is the conditional probability
4468: $g(y|x)$ of encoding source word $x$ by code word $y$,
4469: can be chosen to advantage.
4470: We define the auxiliary notion
4471: of {\em information rate distortion function} $R^{(I)}(D)$ by
4472: \begin{equation}
4473: \label{eq:ird}
4474: R^{(I)}(D) = \inf_{Y : D(X,Y) \leq D}
4475: I(X; Y).
4476: \end{equation}
4477: That is, for random variable $X$,
4478: among {\em all\/} joint random variables $Y$ with expected distortion
4479: to $X$ less
4480: than or equal to $D$, the information rate $R^{(I)}(D)$
4481: equals the minimal mutual information with $X$.
4482: \begin{theorem}[Shannon]
4483: \label{thm:rd}
4484: For every random source $X$ and distortion measure $d$:
4485: \begin{equation}
4486: \label{eq:rd}
4487: R^*(D) = R^{(I)}(D)
4488: \end{equation}
4489: \end{theorem}
4490: This remarkable theorem
4491: states that the best deterministic code achieves a rate-distortion
4492: that equals the minimal information rate possible for a randomized code,
4493: that is, the minimal
4494: mutual information between the random source and a
4495: randomized code.
4496: Note that this does not mean that $R^*(D)$
4497: is independent of the distortion measure.
4498: In fact, the source random variable $X$,
4499: together with the distortion measure $d$, determines a random
4500: code $Y$ for which the joint random variables $X$ and $Y$
4501: reach the infimum in \eqref{eq:ird}.
4502: The proof of this theorem is given in
4503: \cite{CT91}. It is illuminating to see how it goes:
4504: It is shown first that, for a random source $X$ and distortion measure $d$,
4505: every deterministic code $Y$ with distortion $\leq D$ has rate
4506: $R \geq R^{(I)} (D)$. Subsequently, it is shown that there
4507: exists
4508: a deterministic code that, with distortion $ \leq D$,
4509: achieves rate $R^*(D)=R^{(I)} (D)$.
4510: To analyze deterministic $R^*(D)$ therefore,
4511: we can determine the best randomized
4512: code $Y$ for random source $X$ under distortion constraint $D$,
4513: and then we know that simply $R^*(D)=I(X;Y)$.
4514: \begin{example} (Example~\ref{ex:ber}, continued)
4515: \label{ex:berb}
4516: \rm Suppose we want to compute $R^*(D)$ for some $D$ between $0$ and
4517: $1$. If we only allow encodings $Y$ that are deterministic functions
4518: of $X$, then either $Y(x) \equiv x$ or $Y(x) \equiv |1- x|$.
4519: In both cases ${\bf E }
4520: [d(X,Y)] = H(X| Y) = 0$, so $Y$ satisfies the constraint in
4521: (\ref{eq:ird}). In both cases, $I(X, Y) = H(Y) = H(X)$. With
4522: (\ref{eq:rd}) this
4523: shows that $R^*(D) \leq H(X)$. However, $R^*(D)$ is actually smaller:
4524: by allowing randomized codes, we can define $Y_{\alpha}$ as
4525: $Y_{\alpha} (x) = x$ with probability $\alpha$ and $Y_{\alpha} (x) = |1- x|$
4526: with probability $1- \alpha$. For $0 \leq \alpha \leq \frac{1}{2}$, ${\bf E }
4527: [d(X,Y_{\alpha})] = H(X| Y_{\alpha})$ increases with $\alpha$, while
4528: $I(X;Y_{\alpha})$ decreases with $\alpha$. Thus, by choosing the
4529: $\alpha^*$ for which the constraint ${\bf E } [d(X,Y_{\alpha})] \leq
4530: D$ holds with equality, we find $R^*(D) = I(X; Y_{\alpha^*})$. Let us
4531: now calculate $R^*(D)$ and $D^*(R)$ explicitly.
4532:
4533: Since $I(X,Y) = H(X) - H(X|Y)$, we can rewrite $R^*(D)$ as
4534: $$
4535: R^*(D) = H(X) - \sup_{Y: D(X,Y) \leq D}
4536: H(X|Y).
4537: $$
4538: In the special case where $D$ is itself the
4539: Shannon-Fano distortion, this can in turn be rewritten as
4540: $$
4541: R^*(D) = H(X) - \sup_{Y: H(X|Y) \leq D} H(X \mid Y)
4542: = H(X) - D.
4543: $$
4544: Since $D^*(R)$ is the inverse of $R^*(D)$,
4545: we find $D^*(R) = H(X) - R$, as announced in Example~\ref{ex:ber}.
4546: \end{example}
4547:
4548: \paragraph{Problem and Lacuna:}
4549: In the Rate-Distortion setting we allow (on average) a rate of
4550: $R$ bits to express the data as well as possible in some way,
4551: and measure the average of loss by some distortion function.
4552: But in many cases, like lossy compression of images,
4553: one is interested in the individual cases. The average over all
4554: possible images may be irrelevant for the individual cases one meets.
4555: Moreover, one is not particularly interested in bit-loss,
4556: but rather in preserving the essence of the image as well as possible.
4557: As another example, suppose
4558: the distortion function is simply to supply the remaining
4559: bits of the data. But this can be unsatisfactory: we are given
4560: an outcome of a measurement as a real number of $n$ significant bits. Then
4561: the $R$ most significant bits carry most of the meaning
4562: of the data, while the remaining $n-R$ bits may be irrelevant.
4563: Thus, we are lead to the elusive notion
4564: of a distortion function that captures
4565: the amount of ``meaning'' that is not included in the $R$ rate bits.
4566: These issues are taken up by Kolmogorov's proposal of the structure function.
4567: This cluster of ideas puts the notion
4568: of Rate--Distortion in an individual algorithmic (Kolmogorov
4569: complexity) setting, and focuses on the meaningful information
4570: in the data. In the end we can recycle the new insights and
4571: connect them to Rate-Distortion notions to provide new foundations
4572: for statistical inference notions as maximum likelihood (ML)
4573: \cite{Fi22},
4574: minimum
4575: message length (MML) \cite{WallaceF87},
4576: and minimum description length (MDL) \cite{Ri89}.
4577: \subsection{Structure Function}
4578: \label{sec:structure}
4579: There is a close relation between
4580: functions describing
4581: three, a priori seemingly unrelated, aspects of modeling individual
4582: data, depicted in Figure~\ref{figure.estimator}.
4583: \begin{figure}
4584: \begin{center}
4585: \epsfxsize=8cm
4586: \epsfxsize=8cm \epsfbox{estimator.eps}
4587: \end{center}
4588: \caption{Structure functions $h_x(i), \beta_x(\alpha), \lambda_x(\alpha)$,
4589: and minimal sufficient statistic.}
4590: \label{figure.estimator}
4591: \end{figure}
4592: \label{sec:meaningful}
4593: One of these was introduced by
4594: Kolmogorov at a conference in Tallinn 1974 (no written version)
4595: and in a talk at the Moscow Mathematical Society in the same year
4596: of which the abstract \cite{Ko74}
4597: is as follows (this is the only writing by Kolmogorov about
4598: this circle of ideas):
4599: \begin{quote}
4600: ``To each constructive object corresponds a function $\Phi_x(k)$ of a
4601: natural number $k$---the log of minimal cardinality of $x$-containing
4602: sets that allow definitions of complexity at most $k$.
4603: If the element $x$ itself allows a simple definition,
4604: then the function $\Phi$ drops to $1$ %[presumably, $0 = \log 1$ is meant]
4605: even for small $k$.
4606: Lacking such definition, the element is ``random'' in a negative sense.
4607: But it is positively ``probabilistically random'' only when function
4608: $\Phi$ having taken the value $\Phi_0$ at a relatively small
4609: $k=k_0$, then changes approximately as $\Phi(k)=\Phi_0-(k-k_0)$.''
4610: \end{quote}
4611: Kolmogorov's $\Phi_x$ is commonly called the ``structure function''
4612: and is here denoted as $h_x$ and defined in \eqref{eq2}. The
4613: structure function notion entails a proposal for a non-probabilistic
4614: approach to statistics, an individual combinatorial relation between
4615: the data and its model, expressed in terms of Kolmogorov complexity.
4616: It turns out that the structure function determines all stochastic
4617: properties of the data in the sense of determining the best-fitting
4618: model at every model-complexity level, the equivalent notion to
4619: ``rate'' in the Shannon theory. A consequence is this: minimizing the
4620: data-to-model code length (finding the ML estimator or MDL estimator),
4621: in a class of contemplated models of prescribed maximal (Kolmogorov)
4622: complexity, {\em always} results in a model of best fit, irrespective
4623: of whether the source producing the data is in the model class
4624: considered. In this setting, code length minimization {\em always}
4625: separates optimal model information from the remaining accidental
4626: information, and not only with high probability. The function that
4627: maps the maximal allowed model complexity to the goodness-of-fit
4628: (expressed as minimal ``randomness deficiency'') of the best model
4629: cannot itself be monotonically approximated. However, the shortest
4630: one-part or two-part code above can---implicitly optimizing this
4631: elusive goodness-of-fit.
4632:
4633:
4634: In probabilistic statistics the goodness of the selection process is
4635: measured in terms of expectations over probabilistic ensembles. For
4636: current applications, average relations are often irrelevant, since
4637: the part of the support of the probability mass function that will
4638: ever be observed has about zero measure. This may be the case in, for
4639: example, complex video and sound analysis. There arises the problem
4640: that for individual cases the selection performance may be bad
4641: although the performance is good on average, or vice versa. There is
4642: also the problem of what probability means, whether it is subjective,
4643: objective, or exists at all. Kolmogorov's proposal strives for the
4644: firmer and less contentious ground of finite combinatorics and
4645: effective computation.
4646:
4647: \paragraph{Model Selection:}
4648: It is technically convenient to initially consider the simple model
4649: class of finite sets to obtain our results, just as in
4650: Section~\ref{sec:algsuf}. It then turns out that it is relatively easy
4651: to generalize everything to the model class of computable probability
4652: distributions (Section~\ref{s.prob}). That class is very large
4653: indeed: perhaps it contains every distribution that has ever been
4654: considered in statistics and probability theory, as long as the
4655: parameters are computable numbers---for example rational numbers. Thus
4656: the results are of great generality; indeed, they are so general that
4657: further development of the theory must be aimed at restrictions on
4658: this model class.
4659:
4660: Below we will consider various model
4661: selection procedures. These are approaches for finding a model $S$
4662: (containing $x$) for arbitrary data $x$. The goal is to find a model
4663: that captures all meaningful information in the data $x$ . All
4664: approaches we consider are at some level based on coding $x$ by giving
4665: its index in the set $S$, taking $ \log |S|$ bits. This
4666: codelength may be thought of as a particular distortion function, and
4667: here lies the first connection to Shannon's rate-distortion:
4668:
4669: \begin{example}\label{rem.rd-ksf1}
4670: \rm
4671: %This approach can be straightforwardly translated into the Rate-Distortion
4672: %setting:
4673: A model selection procedure is a function $Z_n$ mapping binary data of
4674: length $n$ to finite sets of strings of length $n$, containing the
4675: mapped data, $Z_n(x)=S$ ($x \in S$). The range of $Z_n$ satisfies
4676: ${\cal Z}_n \subseteq 2^{\{0,1\}^n}$, The distortion function $d$ is
4677: defined to be $d(x,Y(x))= \frac{1}{n} \log |S|$. To define the
4678: rate--distortion function we need that $x$ is the outcome of a random
4679: variable $X$. Here we treat the simple case that $X$ represents $n$
4680: flips of a fair coin; this is substantially generalized in
4681: Section~\ref{sec:esf}. Since each outcome of a fair coin can be
4682: described by one bit, we set the rate $R$ at $0 < R < 1$. Then,
4683: $D_n^*(R) = \min_{Z_n: |{\cal Z}_n| \leq 2^{nR}} \sum_{|x|=n}2^{-n}
4684: \frac{1}{n} \log |Z_n(x)|$ For the minimum of the right-hand side we
4685: can assume that if $y \in Z_n(x)$ then $Z_n(y)=Z_n(x)$ (the distinct
4686: $Z_n(x)$'s are disjoint). Denote the distinct $Z_n(x)$'s by $Z_{n,i}$
4687: with $i=1,\ldots , k$ for some $k \leq 2^{nR}$. Then, $D_n^*(R) = \min
4688: _{Z_n: |{\cal Z}_n| \leq 2^{nR}} \sum_{1=1}^k |Z_{n,i}|2^{-n}
4689: \frac{1}{n} \log |Z_{n,i}|$. The right-hand side reaches its minimum
4690: for all $Z_{n,i}$'s having the same cardinality and $k=2^{nR}$. Then,
4691: $D_n^*(R) = 2^{nR} 2^{(1-R)n} 2^{-n} \frac{1}{n} \log 2^{(1-R)n} =
4692: 1-R$. Therefore, $D^*(R)= 1-R$ and therefore $R^*(D) = 1-D$.
4693:
4694: Alternatively, and more in line with the structure-function
4695: approach below, one may consider repetitions of a random variable $X$
4696: with outcomes in $\{0,1\}^n$. Then,
4697: a model selection procedure is a function $Y$ mapping
4698: binary data of length $n$ to finite sets of strings of length $n$,
4699: containing the mapped data,
4700: $Y(x)=S$ ($x \in S$). The range of $Y$ satisfies
4701: ${\cal Y} \subseteq 2^{\{0,1\}^n}$, The distortion function $d$ is defined
4702: by $d(x,Y(x))= \log |S|$. To define the rate--distortion function
4703: we need that $x$ is the outcome of a random variable $X$, say
4704: a toss of a fair $2^n$-sided coin. Since each outcome of a fair
4705: coin can be described by $n$ bits, we set the rate $R$
4706: at $0 < R < n$. Then, for outcomes $\overline{x}=x_1 \ldots x_m$
4707: ($|x_i|=n$), resulting from $m$ i.i.d. random variables $X_1, \ldots , X_m$,
4708: we have $d(\overline{x}, Z_m (\overline{x})) =
4709: \frac{1}{m} \sum_{i=1}^m \log |Y_i (x_i)| =
4710: \frac{1}{m} \log | Y_1(x_1) \times \cdots \times Y_m(x_m)|$. Then,
4711: $D_m^*(R) = \min_{Z_m: |{\cal Z}_m| \leq 2^{mR}}
4712: \sum_{\overline{x}}2^{-mn} d(\overline{x}, Z_m (\overline{x}))$.
4713: Assume that $\overline{y} \in Z_m(\overline{x})$ if
4714: $Z_m(\overline{y}) = Z_m(\overline{x})$: the distinct
4715: $Z_m(\overline{x})$'s are disjoint and partition $\{0,1\}^{mn}$
4716: into disjoint subsets $Z_{m,i}$, with $i=1, \ldots, k$ for
4717: some $k \leq 2^{mR}$.
4718: Then,
4719: $D_m^*(R) = \min_{Z_m: |{\cal Z}_m| \leq 2^{mR}}
4720: \sum_{i=1,\ldots, k} |Z_{m,i}|2^{-mn} \frac{1}{m}
4721: \log |Z_{m,i}|$.
4722: The right-hand side reaches its minimum for all $Z_{m,i}$'s having
4723: the same cardinality and $k=2^{mR}$, so that
4724: $D_m^*(R) = 2^{(n-R)m} 2^{mR} 2^{-mn} \frac{1}{m} \log 2^{(n-R)m} = n-R$.
4725: Therefore, $D^*(R)= n-R$ and $R^*(D) = n-D$.
4726: In Example~\ref{ex.rd=str} we relate these numbers to the structure
4727: function approach described below.
4728: \end{example}
4729:
4730: \paragraph{Model Fitness:}
4731: A distinguishing feature of the structure function approach is that
4732: we want to formalize what it means for an element to be ``typical''
4733: for a set that contains it. For example, if we flip a fair coin $n$
4734: times, then the sequence of $n$ outcomes, denoted by $x$,
4735: will be an element of the set $\{0,1\}^n$. In fact,
4736: most likely it will be a ``typical'' element in the sense that
4737: it has all properties that hold on average for an element of that set.
4738: For example, $x$ will have $\frac{n}{2} \pm O(\sqrt{n})$ frequency
4739: of 1's, it will have a run of about $\log n$ consecutive 0's,
4740: and so on for many properties. Note that the sequence $x=0 \ldots 01\ldots1$,
4741: consisting of one half 0's followed by one half ones, is very untypical,
4742: even though it satisfies the two properties described explicitly.
4743: The question arises how to formally define ``typicality''. We do
4744: this as follows:
4745: The lack of typicality
4746: of $x$ with respect to a finite set $S$ (the model) containing it,
4747: is the amount by which $K(x|S)$
4748: falls short of the length $\log |S|$ of the data-to-model code (Section~\ref{sec:algsuf}).
4749: Thus, the {\em randomness deficiency} of $x$ in $S$ is defined by
4750: \begin{equation}\label{eq:randomness-deficiency}
4751: \delta (x | S) = \log |S| - K(x | S),
4752: \end{equation}
4753: for $x \in S$, and $\infty$ otherwise. Clearly, $x$ can be typical for
4754: vastly different sets. For example, every $x$ is typical for the singleton
4755: set $\{x\}$, since $\log |\{x\}|=0$ and $K(x \mid \{x\})=O(1)$.
4756: Yet the many $x$'s that have $K(x) \geq n$ are also typical for
4757: $\{0,1\}^n$, but in another way. In the first example, the set is about
4758: as complex as $x$ itself. In the second example, the set is vastly
4759: less complex than $x$: the set has complexity about
4760: $K(n) \leq \log n + 2 \log \log n$ while $K(x)\geq n$.
4761: Thus, very high complexity data may have simple
4762: sets for which they are typical. As we shall see,
4763: this is certainly not the case for all high complexity data.
4764: %certain
4765: %$y$ with $|y|=n$ and $K(y)=n/2$.
4766: The question arises how typical
4767: data $x$ of length $n$ can be in the best case
4768: for a finite set of complexity $R$
4769: when $R$ ranges from 0 to $n$. The function describing this dependency,
4770: expressed in terms of randomness deficiency to measure the optimal
4771: typicality, as a function of the complexity ``rate'' $R$ ($0 \leq R \leq n$)
4772: of the number of bits we can maximally spend to describe a finite
4773: set containing $x$,
4774: is defined as follows:
4775:
4776: %From the definition we see that $x$ can only be typical for a
4777: %large cardinality set if the complexity of $x$ is large.
4778: %But $y = 00 \ldots 0$ is not typical for $\{0,1\}^n$;
4779: %in fact, $\delta (y \mid \{0,1\}^n) \geq n - O(1)$.
4780:
4781: %This definition allows us to consider how typical data $x$ can be for
4782: %a model $S$ of certain maximal complexity:
4783: The {\em minimal randomness deficiency} function is
4784: \begin{equation}
4785: \label{eq1}
4786: \beta_x( R) =
4787: \min_{S} \{ \delta(x| S): S \ni x, \; K(S) \leq R \},
4788: \end{equation}
4789: where we set $\min \emptyset = \infty$. If $\delta(x |
4790: S)$ is small, then $x$ may be considered as a {\em
4791: typical} member of $S$. This means that $S$ is a
4792: ``best'' model for $x$---a most likely explanation. There
4793: are no simple special properties that single it out from
4794: the majority of elements in $S$. We therefore like to
4795: call $\beta_x(R)$ the {\em best-fit estimator}. This
4796: is not just terminology: If $\delta (x | S)$ is small,
4797: then $x$ satisfies {\em all} properties of low Kolmogorov
4798: complexity that hold with high probability (under the
4799: uniform distribution) for the elements of $S$. To be
4800: precise \cite{VV02}: Consider strings of length $n$ and
4801: let $S$ be a subset of such strings. We view a {\em
4802: property} of elements in $S$ as a function $f_P: S
4803: \rightarrow \{0,1\}$. If $f_P(x)=1$ then $x$ has the
4804: property represented by $f_P$ and if $f_P(x)=0$ then $x$
4805: does not have the property. Then: (i) If $f_P$ is a
4806: property satisfied by all $x$ with $\delta(x | S) \le
4807: \delta (n)$, then $f_P$ holds with probability at least
4808: $1-1/2^{\delta(n)}$ for the elements of $S$.
4809:
4810: (ii) Let
4811: $f_P$ be any
4812: property
4813: that holds with probability at least
4814: $1-1/2^{\delta (n)}$ for the
4815: elements of $S$. Then, every such $f_P$ holds
4816: simultaneously for every $x \in S$
4817: with $\delta (x | S)\le\delta (n)-K(f_P|S)-O(1)$.
4818:
4819:
4820: \begin{example}
4821: \rm {\bf Lossy Compression:} \index{compression, lossy} The function
4822: $\beta_x( R)$ is relevant to lossy compression (used, for instance,
4823: to compress images) -- see also Remark~\ref{rem:lossy}. Assume we
4824: need to compress $x$ to $R$ bits where $R \ll K(x)$. Of course this
4825: implies some loss of information present in $x$. One way to select
4826: redundant information to discard is as follows: Find a set $S\ni x$
4827: with $K(S)\le R$ and with small $\delta(x | S)$, and consider a
4828: compressed version $S'$ of $S$. To reconstruct an $x'$, a
4829: decompresser uncompresses $S'$ to $S$ and selects at random an
4830: element $x'$ of $S$. Since with high probability the randomness
4831: deficiency of $x'$ in $S$ is small, $x'$ serves the purpose of the
4832: message $x$ as well as does $x$ itself. Let us look at an example.
4833: To transmit a picture of ``rain'' through a channel with limited
4834: capacity $R$, one can transmit the indication that this is a picture
4835: of the rain and the particular drops may be chosen by the receiver
4836: at random. In this interpretation, $\beta_x(R)$ indicates how
4837: ``random'' or ``typical'' $x$ is with respect to the best model at
4838: complexity level $R$---and hence how ``indistinguishable'' from the
4839: original $x$ the randomly reconstructed $x'$ can be expected to be.
4840: \end{example}
4841:
4842:
4843: \begin{remark}
4844: \rm
4845: This randomness deficiency function quantifies
4846: the goodness of fit of the best model at complexity $R$
4847: for given data $x$. As far as we know no direct counterpart of this
4848: notion exists in Rate--Distortion theory, or, indeed,
4849: can be expressed in classical theories like Information Theory.
4850: But the situation is different for the next function we define,
4851: which, in almost contradiction to the previous statement, can
4852: be tied to the minimum randomness deficiency function, yet, as will be
4853: seen in Example~\ref{ex.rd=str} and Section~\ref{sec:esf},
4854: does have a counterpart in Rate--Distortion theory after all.
4855: \end{remark}
4856:
4857:
4858: \paragraph{Maximum Likelihood estimator:}
4859: The {\em Kolmogorov structure} function $h_x$ of given data $x$ is defined by
4860: \begin{equation}\label{eq2}
4861: h_{x}(R) = \min_{S} \{\log | S| : S \ni x,\; K(S) \leq R\},
4862: \end{equation}
4863: where $S \ni x$ is
4864: a contemplated model for $x$, and $R$ is a nonnegative
4865: integer value bounding the complexity of the contemplated $S$'s.
4866: The structure function uses models that are finite sets and
4867: the value of the structure function is the log-cardinality of the
4868: smallest such set containing the data. Equivalently, we can
4869: use uniform probability mass functions over finite supports (the former
4870: finite set models). The smallest set containing the data then becomes
4871: the uniform probability mass assigning the highest probability
4872: to the data---with the value of the structure function
4873: the corresponding negative
4874: log-probability. This motivates us to call $h_x$ the {\em maximum likelihood
4875: estimator}. The treatment can be extended from uniform probability
4876: mass functions with finite supports, to probability models that
4877: are arbitrary computable probability mass functions, keeping
4878: all relevant notions and results essentially unchanged, Section~\ref{s.prob},
4879: justifying the maximum likelihood identification even more.
4880:
4881: Clearly, the Kolmogorov structure function is
4882: non-increasing and reaches $\log |\{x\}| = 0$
4883: for the ``rate'' $R = K(x)+c_1$ where $c_1$ is the number of bits required
4884: to change $x$ into $\{x\}$.
4885: It is also easy to see that for argument $K(|x|)+c_2$, where $c_2$
4886: is the number of bits required to compute the
4887: set of all strings of length $|x|$ of $x$ from $|x|$,
4888: the value of the structure function is at most $|x|$; see Figure~\ref{figure.estimator}
4889: \begin{example}\label{ex.rd=str}
4890: \rm
4891: Clearly the structure function measures for individual outcome $x$
4892: a distortion that is related to the
4893: one measured by $D_1^*(R)$ in Example~\ref{rem.rd-ksf1}
4894: for the uniform average of outcomes $x$.
4895: Note that all strings $x$ of length $n$ satisfy $h_x(K(n)+O(1)) \leq n$
4896: (since $x \in S_n=\{0,1\}^n$ and $K(S_n)=K(n)+O(1)$).
4897: For every $R$ ($0 \leq R \leq n$),
4898: we can describe every $x = x_1x_2 \ldots x_n$ as an element
4899: of the set $A_R = \{x_1 \ldots x_R y_{R+1} \ldots y_n:
4900: y_i \in \{0,1\}, R < i \leq n \}$. Then, $|A_R|=2^{n-R}$
4901: and $K(A_R) \leq R+K(n,R)+O(1) \leq R + O(\log n)$.
4902: This shows that $h_x (R) \leq n-R+O(\log n)$ for every $x$
4903: and every $R$ with $0 \leq R \leq n$; see Figure~\ref{figure.estimator}.
4904:
4905: For all $x$'s and $R$'s we can describe $x$ in a two-part code by the set
4906: $S$ witnessing $h_x(R)$ and $x$'s index in that set. The first part
4907: describing $S$ in $K(S)=R$ allows us to generate
4908: $S$, and given $S$ we know $\log |S|$. Then,
4909: we can parse the second part of $\log |S|=h_x(R)$ bits that gives $x$'s
4910: index in $S$. We also need a fixed $O(1)$ bit program to produce $x$
4911: from these descriptions. Since $K(x)$ is the lower bound on
4912: the length of effective descriptions of $x$, we have $h_x(R)+R \geq K(x)-O(1)$.
4913: There are $2^n - 2^{n-K(n)+O(1)}$ strings $x$ of complexity $K(x)\geq n$,
4914: \cite{LiVi97}. For all these strings $h_x(R) + R \geq n-O(1)$.
4915: Hence, the expected value $h_x(R)$ equals
4916: $2^{-n} \{ (2^n-2^{n-K(n)+O(1)}) [n-R+O(\log n)]
4917: + 2^{n-K(n)+O(1)} O(n-R+O(\log n)) \} = n-R + O(n-R/2^{-K(n)})
4918: = n-R + o(n-R)$
4919: (since $K(n) \rightarrow \infty$ for $n \rightarrow \infty$).
4920: That is, the expectation of $h_x(R)$ equals $(1+o(1))D^*_1(R)
4921: =(1+o(1))D^*(R)$, the Distortion-Rate function, where the
4922: $o(1)$ term goes to 0 with the length $n$ of $x$. In
4923: Section~\ref{sec:esf} we extend this idea to non-uniform distributions
4924: on $X$.
4925: \end{example}
4926:
4927:
4928: For every $S\ni x$ we have
4929: \begin{equation}\label{eq.descr}
4930: K(x)\leq K(S)+ \log |S| + O(1).
4931: \end{equation}
4932: Indeed,
4933: consider the following \emph{two-part code}
4934: for $x$: the first part is
4935: a shortest self-delimiting program $p$ of $S$ and the second
4936: part is
4937: $\lceil\log|S|\rceil$ bit long index of $x$
4938: in the lexicographical ordering of $S$.
4939: Since $S$ determines $\log |S|$ this code is self-delimiting
4940: and we obtain \eqref{eq.descr}
4941: where the constant $O(1)$ is
4942: the length of the program to reconstruct
4943: $x$ from its two-part code.
4944: We thus conclude that $K(x)\leq R+h_x(R)+O(1)$, that is, the
4945: function $h_x(R)$
4946: never decreases
4947: more than a fixed independent constant below
4948: the diagonal \emph{sufficiency line} $L$ defined by
4949: $L(R)+R = K(x)$,
4950: which is a lower bound on $h_x (R)$
4951: and is approached to within a constant distance by
4952: the graph of $h_x$ for certain $R$'s
4953: (for instance, for $R = K(x)+c_1$).
4954: For these $R$'s we
4955: thus have
4956: $R + h_x (R) = K(x)+O(1)$.
4957: In the terminology we have introduced in Section~\ref{sect.ss} and Definition~\ref{def:algsufstat},
4958: a model corresponding to such an $R$ (witness for
4959: $h_x(R)$) is an optimal set for $x$
4960: and a shortest program to compute this model
4961: is a sufficient statistic. It is
4962: {\em minimal} for the least such $R$ for which the above equality holds.
4963:
4964:
4965: \paragraph{MDL Estimator:}
4966: The length of the minimal two-part code for $x$ consisting
4967: of the model cost $K(S)$ and the
4968: length of the index of $x$ in
4969: $S$,
4970: the complexity of $S$ upper bounded by $R$, is given by
4971: the {\em MDL (minimum description length) function}:
4972: \begin{equation}\label{eq.3}
4973: \lambda_{x}(R) =
4974: \min_{S} \{\Lambda(S): S \ni x,\; K(S) \leq R\},
4975: \end{equation}
4976: where $\Lambda(S)=\log|S|+K(S) \ge K(x)-O(1)$ is
4977: the total length of two-part code of $x$
4978: with help of model $S$.
4979: Clearly,
4980: $\lambda_x (R) \leq h_x(R)+ R +O(1)$,
4981: but a priori it is still possible that $ h_x(R')+ R'
4982: < h_x(R)+R$ for $R' < R$.
4983: In that case $\lambda_x(R) \leq
4984: h_x(R')+ R'
4985: < h_x(R)+R$. However, in \cite{VV02} it is shown
4986: that $\lambda_x (R) = h_x(R)+ R + O(\log n)$
4987: for all $x$ of length $n$. Even so, this doesn't mean that a set
4988: $S$ that witnesses $\lambda_x (R)$ in the sense that $x \in S$,
4989: $K(S) \leq R$, and $K(S)+\log |S|= \lambda_x (R)$,
4990: also witnesses $h_x(R)$. It can in fact be the case that $K(S) \leq R-r$,
4991: and $\log |S|= h_x (R)+r$ for arbitrarily large $r \leq n$.
4992:
4993:
4994: Apart from being convenient for the technical analysis
4995: in this work, $\lambda_x (R)$ is the
4996: celebrated two-part Minimum Description Length code
4997: length \cite{Ri89} with the
4998: model-code length restricted to at most $R$.
4999: When $R$ is large enough so that $\lambda_{x}(R) = K(x)$,
5000: then there is a set $S$ that is a sufficient statistic, and
5001: the smallest such $R$ has an associated witness set $S$ that
5002: is a minimal sufficient statistic.
5003:
5004: The most fundamental result in \cite{VV02}
5005: is the equality
5006: \begin{equation}\label{eq.eq}
5007: \beta_x (R ) = h_x (R) + R - K(x) = \lambda_x (R)
5008: - K(x)
5009: \end{equation}
5010: which holds within logarithmic additive terms in argument and value.
5011: Additionally, every set $S$ that witnesses the value $h_x (R )$
5012: (or $\lambda_x(R)$),
5013: also witnesses the value $\beta_x (R)$ (but not vice versa).
5014: It is easy to see that $h_x (R)$ and $\lambda_x(R)$
5015: are
5016: upper semi-computable (Definition~\ref{def.semi});
5017: but we have shown \cite{VV02}
5018: that $\beta_x (R)$ is neither upper nor lower semi-computable
5019: (not even within a great tolerance).
5020: A priori
5021: there is no reason to suppose that
5022: a set that witnesses $h_x (R)$
5023: (or $\lambda_x(R)$) also witnesses $\beta_x (R)$,
5024: for {\em every} $R$.
5025: But the fact that they do, vindicates
5026: Kolmogorov's original proposal
5027: and establishes $h_x$'s pre-eminence over $\beta_x$ -- the
5028: pre-eminence of $h_x$ over $\lambda_x$ is discussed below.
5029:
5030: \begin{remark}\label{rem.MLvsMDL}
5031: \rm
5032: What we call `maximum likelihood' in the form of $h_x$ is really `maximum
5033: likelihood' under a complexity constraint $R$ on the models' as in
5034: $h_x (R)$. In
5035: statistics, it is a well-known fact that maximum likelihood often
5036: fails (dramatically overfits) when the models under consideration are
5037: of unrestricted complexity (for example, with polynomial regression with
5038: Gaussian noise, or with Markov chain model learning, maximum
5039: likelihood will always select a model with $n$ parameters, where $n$ is
5040: the size of the sample---and thus typically, maximum likelihood will
5041: dramatically overfit, whereas for example MDL typically performs
5042: well). The equivalent, in our setting, is that allowing models of unconstrained
5043: complexity for data $x$, say complexity $K(x)$,
5044: will result in the ML-estimator $h_x (K(x)+O(1))=0$---the witness model
5045: being the trivial, maximally overfitting, set $\{x\}$.
5046: In the MDL case, on the other hand, there may be a long constant
5047: interval with the MDL estimator
5048: $\lambda_x (R) = K(x)$ ($R \in [R_1 , K(x)]$)
5049: where the length of the two-part code doesn't decrease anymore.
5050: Selecting the least complexity model witnessing this function value
5051: we obtain the, very significant, algorithmic {\em minimal} sufficient
5052: statistic, Definition~\ref{def:algsufstat}.
5053: In this sense, MDL augmented with a bias for the least complex explanation,
5054: which we may call the `Occam's Razor MDL',
5055: is superior to maximum likelihood and resilient to overfitting.
5056: If we don't apply bias in the direction of simple explanations,
5057: then -- at least in our setting --
5058: MDL may be just as prone to overfitting as is ML. For example,
5059: if $x$ is a typical random element of $\{0,1\}^n$, then
5060: $\lambda_x (R) = K(x)+O(1)$ for the entire interval
5061: $K(n)+O(1) \leq R \leq K(x)+O(1) \approx n$.
5062: Choosing the model on the left side, of simplest complexity,
5063: of complexity $K(n)$
5064: gives us the best fit with the correct model $\{0,1\}^n$.
5065: But choosing a model on the right side, of high complexity, gives us
5066: a model $\{x\}$ of complexity $K(x)+O(1)$ that completely
5067: overfits the data by modeling all random noise in $x$
5068: (which in fact in this example almost completely consists of random noise).
5069: \index{overfit}
5070:
5071: Thus, it should be emphasized that 'ML =
5072: MDL' really only holds if complexities are constrained to a value
5073: $R$ (that remains fixed as the sample size grows---note that in the
5074: Markov chain example above, the complexity grows linearly with
5075: the sample size); it certainly
5076: does not hold in an unrestricted sense (not even in the algorithmic setting).
5077: \end{remark}
5078: \begin{remark}
5079: \rm
5080: In a sense, $h_x$ is more strict than $\lambda_x$:
5081: A set that witnesses $h_x(R)$ also witnesses
5082: $\lambda_x(R)$ but not necessarily vice versa. However,
5083: at those complexities $R$ where $\lambda_x (R)$ drops
5084: (a little bit of added complexity in the model allows a
5085: shorter description), the witness set of $\lambda_x$ is
5086: also a witness set of $h_x$. But if $\lambda_x$ stays
5087: constant in an interval $[R_1, R_2]$, then we
5088: can trade-off complexity of a witness set versus its cardinality,
5089: keeping the description length constant. This is of course not possible
5090: with $h_x$ where the cardinality of the witness set at complexity $R$
5091: is fixed at $h_x(R)$.
5092: \end{remark}
5093:
5094: The main result can be taken as a foundation and justification
5095: of common statistical principles in model
5096: selection such as maximum likelihood
5097: or MDL.
5098: The structure functions $\lambda_x,h_x$ and $\beta_x$ can assume all
5099: possible shapes over their full domain of definition (up to
5100: additive logarithmic precision in both argument and value), see \cite{VV02}.
5101: (This establishes
5102: the significance of \eqref{eq.eq}, since it shows that $\lambda_x (R)
5103: \gg K(x)$ is common for $(x, R)$ pairs---in which case the more
5104: or less
5105: easy fact that $\beta_x(R)=0$ for $\lambda_x(R)=K(x)$ is
5106: not
5107: applicable, and it is a priori unlikely that \eqref{eq.eq} holds:
5108: Why should minimizing a set containing
5109: $x$ also minimize its randomness deficiency? Surprisingly, it does!)
5110: We have exhibited a---to our knowledge first---natural example,
5111: $\beta_x$, of a function that
5112: is not semi-computable but computable with an oracle for the halting problem.
5113:
5114:
5115: \begin{example}\label{ex.prnr}
5116: \index{randomness, positive}
5117: \index{randomness, negative}
5118: \rm
5119: {\bf ``Positive'' and ``Negative'' Individual Randomness:}
5120: In \cite{GTV01} we showed the existence
5121: of strings for which essentially
5122: the singleton set consisting of the string itself is a minimal
5123: sufficient statistic. While a sufficient
5124: statistic of an object yields a two-part code that is as short as the shortest
5125: one part code, restricting the complexity of the allowed statistic
5126: may yield two-part codes that are considerably longer than the best one-part
5127: code (so the statistic is insufficient).
5128: In fact,
5129: for every object there is a complexity bound below which this happens---but
5130: if that bound is small (logarithmic) we call the object ``stochastic''
5131: since it has a simple satisfactory explanation (sufficient statistic).
5132: Thus, Kolmogorov in \cite{Ko74}
5133: makes the important distinction of
5134: an object being random in the ``negative'' sense by having this bound
5135: high (it has high complexity and is not a typical element of
5136: a low-complexity model),
5137: and an object being random in the ``positive,
5138: probabilistic'' sense by both having this bound small and itself
5139: having complexity considerably exceeding this bound
5140: (like a string $x$ of length $n$ with $K(x) \geq n$,
5141: being typical for the
5142: set $\{0,1\}^n$, or the uniform probability distribution over that
5143: set,
5144: while this set or probability distribution
5145: has complexity $K(n)+O(1) = O(\log n)$).
5146: We depict the distinction in Figure~\ref{figure.pos_negrandom}.
5147: In simple terms: High Kolmogorov complexity of a data string
5148: just means that it is random in a {\em negative sense};
5149: but a data string of high Kolmogorov
5150: complexity is {\em positively random} if the simplest satisfactory explanation
5151: (sufficient statistic) has low complexity,
5152: and it therefore is the typical outcome
5153: of a simple random process.
5154:
5155: \begin{figure}
5156: \begin{center}
5157: \epsfxsize=8cm
5158: \epsfxsize=8cm \epsfbox{pos_negrandom.eps}
5159: \end{center}
5160: \caption{Data string $x$ is ``positive random'' or ``stochastic''
5161: and data string $y$
5162: is just ``negative random'' or ``non-stochastic''.}
5163: \label{figure.pos_negrandom}
5164: \end{figure}
5165:
5166:
5167: In \cite{VV02} it is shown that for every length $n$ and
5168: every complexity $k \leq n+K(n) + O(1)$ (the maximal complexity
5169: of $x$ of length $n$) and every $R \in [0,k]$,
5170: there are $x$'s of length $n$ and complexity $k$ such that
5171: the minimal randomness deficiency $\beta_x (i) \geq n-k\pm O(\log
5172: n)$
5173: for every $i \leq R \pm O(\log n)$ and $\beta_x (i) \pm O(\log n)$
5174: for every $i > R \pm O(\log n)$. Therefore, the set of $n$-length
5175: strings of every complexity $k$ can be partitioned in subsets of strings that
5176: have a Kolmogorov minimal sufficient statistic of complexity
5177: $\Theta (i \log n)$ for $i = 1, \ldots , k/ \Theta (\log n)$.
5178: For instance, there are $n$-length non-stochastic
5179: strings of almost maximal complexity $n - \sqrt{n}$
5180: having significant $\sqrt{n}\pm O(\log n)$ randomness deficiency with
5181: respect to $\{0,1\}^n$ or, in fact, every other finite set
5182: of complexity less than $n - O(\log n)$!
5183: \end{example}
5184:
5185:
5186:
5187: \subsubsection{Probability Models}
5188: \label{s.prob}
5189: The structure function (and of course the sufficient statistic) use
5190: properties of data strings modeled by finite sets, which amounts to
5191: modeling data by uniform distributions. As already
5192: observed by Kolmogorov himself, it turns out
5193: that this is no real restriction.
5194: Everything holds also for computable probability mass functions
5195: (probability models), up to additive logarithmic precision. Another
5196: version of $h_x$ uses probability models $f$ rather than finite set
5197: models. It is defined as $h'_x(R) = \min_{f} \{\log 1/f(x): f(x)>0,
5198: K(f) \leq R\}$. Since $h'_x(R)$ and $h_x(R)$ are close by
5199: Proposition~\ref{prop.1} below, Theorem~\ref{thm.dresf} and
5200: Corollary~\ref{cor.esf} also apply to $h'_x$ and the distortion-rate
5201: function $D^*(R)$ based on a variation of the
5202: Shannon-Fano distortion measure
5203: defined by using encodings $Y(x)=f$ with $f$ a computable
5204: probability distribution. In this context,
5205: the Shannon-Fano distortion measure
5206: is defined by
5207: \begin{equation}\label{eq.sfdf}
5208: d'(x,f)= \log 1/f(x).
5209: \end{equation}
5210: It remains to show that probability models are essentially the same as
5211: finite set models. We restrict ourselves to the model class of {\em
5212: computable probability distributions}. Within the present section,
5213: we assume these are defined on strings of arbitrary length; so they
5214: are represented by mass functions $f: \{0,1\}^* \rightarrow [0,1]$
5215: with $\sum f(x) = 1$ being computable according to
5216: Definition~\ref{def.enum.funct}. A string $x$ is typical for a
5217: distribution $f$ if the randomness deficiency $ \delta (x \mid f) =
5218: \log 1/ f(x) - K(x \mid f) $ is small. The conditional complexity $K(x
5219: \mid f)$ is defined as follows. Say that a function $A$ approximates
5220: $f$ if $|A(y,\eps)-f(y)|<\eps$ for every $y$ and every positive
5221: rational $\eps$. Then $K(x \mid f)$ is the minimum length of a program
5222: that given every function $A$ approximating $f$ as an oracle prints
5223: $x$. Similarly, $f$ is $c$-optimal for $x$ if $ K(f) + \log 1/ f(x)
5224: \leq K(x)+c $. Thus, instead of the data-to-model code length
5225: $\log|S|$ for finite set models, we consider the data-to-model code
5226: length $\log 1/ f(x)$ (the Shannon-Fano code). The value $\log 1/f(x)$
5227: measures also how likely $x$ is under the hypothesis $f$. The
5228: mapping $x\mapsto f_{\min}$ where $f_{\min}$ minimizes $\log 1/f(x)$
5229: over $f$ with $K(f)\le R$ is a \emph{maximum likelihood
5230: estimator}, see figure~\ref{figure.MLestimator}. Our results thus
5231: imply that that maximum likelihood estimator always returns a
5232: hypothesis with minimum randomness deficiency.
5233:
5234: \begin{figure}
5235: \begin{center}
5236: \epsfxsize=8cm
5237: \epsfxsize=8cm \epsfbox{MLestimator.eps}
5238: \end{center}
5239: \caption{Structure function $h_x(i)= \min_f \{ \log 1/ f(x): f(x)>0, \; K(f) \leq
5240: i\}$ with $f$ a computable
5241: probability mass function, with values according to the left
5242: vertical coordinate, and the maximum likelihood estimator $2^{-h_x(i)}=
5243: \max \{f(x): p(x)>0 , \; K(f) \leq i\}$,
5244: with values according to the right-hand side vertical coordinate.}
5245: \label{figure.MLestimator}
5246: \end{figure}
5247:
5248:
5249:
5250: It is easy to show that for every data string $x$
5251: and a contemplated finite set model for it, there
5252: is an almost equivalent computable probability model.
5253: The converse is slightly harder:
5254: for every data string $x$ and a contemplated
5255: computable probability model for it,
5256: there is a finite set model for $x$ that has no worse complexity,
5257: randomness deficiency, and worst-case data-to-model code for $x$,
5258: up to additive logarithmic precision:
5259:
5260:
5261: \begin{proposition}\label{prop.1}
5262: (a) For every $x$ and every finite set $S \ni x$ there is
5263: a computable probability
5264: mass function $f$ with $\log 1/f(x) =\log|S|$,
5265: $\delta(x \mid f)=\delta(x \mid S)+O(1)$
5266: and $K(f) = K(S)+ O(1)$.
5267:
5268: (b)
5269: There are constants $c,C$, such that
5270: for every string $x$, the following holds:
5271: For every computable probability
5272: mass function $f$
5273: there is a finite set $S \ni x$
5274: such that $\log |S| < \log 1/ f(x)+1$, $\delta(x \mid S)
5275: \le \delta(x \mid f)+ 2\log K(f)+K(\lfloor \log 1/
5276: f(x)\rfloor)+2\log K(\lfloor \log 1/ f(x)\rfloor)+C$
5277: and
5278: $K(S) \leq K(f) + K(\lfloor \log 1/f(x)\rfloor)+C$.
5279:
5280:
5281: \end{proposition}
5282:
5283: \begin{proof}
5284: (a) Define $f(y)= 1/|S|$ for $y \in S$
5285: and 0 otherwise.
5286:
5287: (b) Let $m=\lfloor \log 1/f(x)\rfloor$, that is,
5288: $2^{-m-1}<f(x)\le 2^{-m}$.
5289: Define $S = \{y: f(y)
5290: > 2^{-m-1}\}$. Then,
5291: $|S|<2^{m+1} \leq 2/f(x)$,
5292: which implies the claimed value for $\log |S|$.
5293: To list $S$ it suffices to compute all consecutive values of $f(y)$ to
5294: sufficient precision
5295: until the combined probabilities exceed $1-2^{-m-1}$.
5296: That is, $K(S) \leq
5297: K(f)+ K(m)+O(1)$.
5298: Finally,
5299: $\delta(x \mid S)=\log|S|-K(x|S^*)< \log 1/f(x)-K(x \mid S^*)+1=
5300: \delta(x \mid f)+K(x \mid f)-K(x \mid S^*)+1\le \delta(x \mid f)+K(S^* \mid f)+O(1)$.
5301: The term $K(S^* \mid f)$ can be upper bounded
5302: as $K(K(S))+K(m)+O(1)\le 2\log K(S)+K(m)+O(1)
5303: \le 2\log (K(f)+K(m))+K(m)+O(1)
5304: \le 2\log K(f)+2\log K(m)+K(m)+O(1)$, which implies the claimed bound for
5305: $\delta(x \mid S)$.
5306:
5307: \end{proof}
5308:
5309: How large are the nonconstant additive complexity terms in
5310: Proposition~\ref{prop.1} for strings $x$ of length $n$? In item (b),
5311: we are commonly only interested
5312: in $f$ such that $K(f)\le n+O(\log n)$ and
5313: $\log 1/f(x)\le n+O(1)$.
5314: Indeed, for every $f$ there is $f'$ such that
5315: $K(f')\le \min\{K(f),n\}+O(\log n)$,
5316: $\delta(x \mid f')\le \delta(x \mid f)+O(\log n)$,
5317: $\log 1/f'(x)\le\min\{\log1/ f(x),n\}+1$.
5318: Such $f'$ is defined as follows: If
5319: $K(f)>n$ then $f'(x)=1$ and $f'(y)=0$ for every $y\ne x$;
5320: otherwise $f'=(f+U_n)/2$ where $U_n$ stands for
5321: the uniform distribution
5322: on $\{0,1\}^n$.
5323: Then the additive terms in item (b) are $O(\log n)$.
5324:
5325: \subsection{Expected Structure Function Equals Distortion--Rate Function}
5326: \label{sec:esf}
5327: In this section we treat the general relation between the expected
5328: value of $h_x(R)$, the expectation taken on a distribution
5329: $f(x)=P(X=x)$ of the random variable $X$ having outcome $x$, and
5330: $D^*(R)$. This involves the development of a rate-distortion theory
5331: for individual sequences and arbitrary computable distortion measures.
5332: Following \cite{VereshchaginV04}, we outline such a theory in
5333: Sections~\ref{sec:spheres}-~\ref{sec:ssrev}. Based on this theory, we
5334: present in Section~\ref{sec:esfb} a general theorem
5335: (Theorem~\ref{thm.dresf}) relating Shannon's $D^*(R)$ to the expected
5336: value of $h_x(R)$, for arbitrary random sources and computable
5337: distortion measures. This generalizes Example~\ref{ex.rd=str} above,
5338: where we analyzed the case of the distortion function
5339: \begin{equation}\label{eq.lcfs1}
5340: d(x,Y(x))
5341: = \log |Y(x)|,
5342: \end{equation}
5343: where $Y(x)$ is an $x$-containing finite set,
5344: for the uniform distribution. Below we first extend this example to
5345: arbitrary generating distributions, keeping the distortion function
5346: still fixed to (\ref{eq.lcfs1}. This will prepare us for the general
5347: development in Sections~\ref{sec:spheres}--\ref{sec:ssrev}
5348: \begin{example}
5349: In Example~\ref{ex.rd=str}
5350: it transpired that
5351: the distortion-rate function is the expected structure function,
5352: the expectation taken over the distribution on the $x$'s.
5353: %Part of the required treatment was already introduced in
5354: %Example~\ref{rem.rd-ksf1}.
5355: If, instead of using the uniform
5356: distribution on $\{0,1\}^n$ we use an arbitrary distribution $f(x)$,
5357: it is not difficult to compute the rate-distortion
5358: function $R^*(D)= H(X)- \sup_{Y:d(X,Y) \leq D} H(X|Y)$ where
5359: $Y$ is a random vaiable with outcomes that are finite sets. Since $d$
5360: is a special type of Shannon-Fano distortion, with
5361: $d(x,y) = P(X=x | Y=y) = \log |y|$ if $x \in y$, and 0 otherwise,
5362: we have already met
5363: $D^*(R)$ for the distortion measure \eqref{eq.lcfs1} in another guise.
5364: By the conclusion of Example~\ref{ex:berb}, generalized to the random
5365: variable $X$ having outcomes in $\{0,1\}^n$, and $R$ being a rate
5366: in between 0 and $n$, we know that
5367: \begin{equation}\label{eq.DRE}
5368: D^*(R) = H(X)-R.
5369: \end{equation}
5370: \end{example} In the particular case analyzed above, the code word for a source word
5371: is a finite set containing the source word, and the distortion is the
5372: log-cardinality of the finite set. Considering the set of source words
5373: of length $n$, the distortion-rate function is the diagonal line from
5374: $n$ to $n$. The structure functions of the individual data $x$ of
5375: length $n$, on the other hand, always start at $n$, decrease at a
5376: slope of at least -1 until they hit the diagonal from $K(x)$ to
5377: $K(x)$, which they must do, and follow the diagonal henceforth. Above
5378: we proved that the average of the structure function is simply the
5379: straight line, the diagonal, between $n$ and $n$. This is the case,
5380: since the strings $x$ with $K(x) \geq n$ are the overwhelming
5381: majority. All of them have a minimal sufficient statistic (the point
5382: where the structure function hits the diagonal from $K(x)$ to $K(x)$.
5383: This point has complexity at most $K(n)$. The structure function for
5384: all these $x$'s follows the diagonal from about $n$ to $n$, giving
5385: overall an expectation of the structure function close to this
5386: diagonal, that is, the probabilistic distortion-rate function for this
5387: code and distortion measure.
5388: \subsubsection{Distortion Spheres}
5389: \label{sec:spheres}
5390: Modeling the data can be viewed as
5391: encoding the data by a model: the data are source words
5392: to be coded, and models are
5393: code words for the data. As before, the set of possible data is
5394: ${\cal X} = \{0,1\}^n$. Let ${\cal R}^+$ denote the set
5395: of non-negative real numbers.
5396: For every model class ${\cal Y}$ (particular set of code words)
5397: we choose an appropriate
5398: recursive function
5399: $d: {\cal X} \times {\cal Y} \rightarrow {\cal R}^+$ defining
5400: the {\em distortion} $d(x,y)$ between data $x \in {\cal X}$ and model $y \in {\cal Y}$.
5401: \begin{remark}[Lossy Compression]
5402: \label{rem:lossy}
5403: \rm
5404: The choice of distortion
5405: function is a selection of which aspects of the data are relevant,
5406: or meaningful, and
5407: which aspects are irrelevant (noise).
5408: We can think of the distortion-rate function as measuring how far the model at
5409: each bit-rate
5410: falls short in representing the data. Distortion-rate theory
5411: underpins the practice of lossy compression.
5412: For example, lossy compression of a sound file gives as ``model''
5413: the compressed file where, among others, the very high and
5414: very low inaudible frequencies have been suppressed. Thus,
5415: the rate-distortion function will penalize the deletion of the inaudible
5416: frequencies but lightly because they are not relevant for the auditory
5417: experience.
5418:
5419: But in the traditional distortion-rate approach, we average twice:
5420: once because we consider
5421: a sequence of outcomes of $m$ instantiations of the same random variable,
5422: and once because we take the expectation
5423: over the sequences. Essentially, the results deal with typical ``random'' data
5424: of certain simple distributions. This assumes that the data to a certain extent
5425: satisfy the behavior of repeated outcomes of a random source.
5426: Kolmogorov \cite{Ko65}:
5427: \begin{quote}
5428: The probabilistic approach is natural in the theory of information
5429: transmission over communication channels carrying ``bulk'' information
5430: consisting of a large number of unrelated or
5431: weakly related messages obeying
5432: definite probabilistic laws. In this type of problem there is
5433: a harmless and (in applied work) deep-rooted tendency to mix up
5434: probabilities and frequencies within sufficiently long time sequence
5435: (which is rigorously satisfied if it is assumed that ``mixing''
5436: is sufficiently rapid). In practice,
5437: for example, it can be assumed that finding the ``entropy''
5438: of a flow of congratulatory telegrams and the channel ``capacity'' required
5439: for timely and undistorted transmission is validly represented by a
5440: probabilistic treatment even with the usual substitution of empirical
5441: frequencies for probabilities. If something goes wrong here,
5442: the problem lies with the vagueness of our ideas of the relationship
5443: between mathematical probabilities and real random events in general.
5444:
5445: But what real meaning is there, for example, in asking how much
5446: information is contained in ``War and Peace''?
5447: Is it reasonable to include the novel in the set of ``possible novels'',
5448: or even to postulate some probability distribution for this set?
5449: Or, on the other hand, must we assume that the individual scenes in
5450: this book form a random sequence with ``stocahstic relations'' that damp out
5451: quite rapidly over a distance of several pages?
5452: \end{quote}
5453: Currently, individual data arising in practice are submitted to
5454: analysis, for example sound or video files, where the assumption that
5455: they either consist of a large number of weakly related messages, or
5456: being an element of a set of possible messages that is susceptible to
5457: analysis, is clearly wrong. It is precisely the global related aspects
5458: of the data which we want to preserve under lossy compression. The
5459: rich versatility of the structure functions, that is, many different
5460: distortion-rate functions for different individual data, is all but
5461: obliterated in the averaging that goes on in the traditional
5462: distortion-rate function. In the structure function approach one
5463: focuses entirely on the stochastic properties of one data item.
5464: %Analyzing the situation, we see that (i) the structure functions of
5465: %all typical data items are about the same; (ii) the probability mass
5466: %concentrated on the typical data items is almost one; (iii) a sequence
5467: %of outcomes of i.i.d. distributed random variables (or for that
5468: %matter, ergodic stationary sources) consists primarily of typical data
5469: %items. This is enough to ensure that the distortion-rate function is
5470: %approximately the structure function of a typical data item, which is
5471: %the essence of Theorem~\ref{thm.dresf} below.
5472: \end{remark}
5473: Below we follow \cite{VereshchaginV04}, where we developed a
5474: rate-distortion theory for individual data for general computable
5475: distortion measures, with as specific examples the `Kolmogorov'
5476: distortion below, but also Hamming distortion and Euclidean
5477: distortion. This individual rate-distortion theory is summarized in
5478: Sections~\ref{sec:rdrev} and~\ref{sec:ssrev}. In
5479: Section~\ref{sec:esfb}, Theorem~\ref{thm.dresf}.
5480: we connect this indivual rate-distortion theory to Shannon's. We
5481: emphasize that the typical data items of i.i.d. distributed simple
5482: random variables, or simple ergodic stationary sources, which are the
5483: subject of Theorem~\ref{thm.dresf}, are generally unrelated to the
5484: higly globally structured data we want to analyze using our new
5485: rate-distortion theory for individual data. From the prespective of
5486: lossy compression, the typical data have the characteristics of random
5487: noise, and there is no significant ``meaning'' to be preserved under
5488: the lossy compression. Rather, Theorem~\ref{thm.dresf} serves as a
5489: `sanity check' showing that in the special, simple case of repetitive
5490: probabilistic data, the new theory behaves essentially like Shannon's
5491: probabilistic rate-distortion theory.
5492: \begin{example}\label{ex.11}
5493: \rm
5494: Let us look at various model classes and distortion measures:
5495:
5496: (i) The set of models are the finite sets of finite binary strings.
5497: Let $S \subseteq \{0,1\}^*$ and $|S| < \infty$.
5498: We define $d(x,S) = \log |S|$ if $x \in S$, and $\infty$ otherwise.
5499:
5500: (ii) The set of models are the computable probability density functions $f$
5501: mapping $\{0,1\}^*$ to $[0,1]$.
5502: We define $d(x,S) = \log 1/f(x)$ if $f(x) > 0$, and $\infty$ otherwise.
5503:
5504: (iii) The set of models are the total recursive functions $f$
5505: mapping $\{0,1\}^*$ to ${\cal N}$.
5506: We define $d(x,f) = \min \{ l(d): f(d)=x\}$, and $\infty$ if
5507: no such $d$ exists.
5508:
5509: All of these model classes and accompanying
5510: distortions \cite{VV02}, together with the ``communication exchange'' models
5511: in \cite{BKVV03}, are loosely called {\em Kolmogorov} models
5512: and distortion, since the graphs of their structure functions (individual
5513: distortion-rate functions) are all within a strip---of width
5514: logarithmic in the binary length of the data---of one another.
5515: \end{example}
5516: If ${\cal Y}$ is a model class, then
5517: we consider {\em distortion spheres} of given
5518: radius $r$ centered on $y \in {\cal Y}$:
5519: \[
5520: B_y(r)= \{x: d(x,y) = r\}.
5521: \]
5522: This way, every model class and distortion measure can be treated
5523: similarly to the canonical finite set case, which, however, is
5524: especially simple in that the radius not variable.
5525: That is, there is only one distortion sphere centered on a given finite set,
5526: namely the one with radius equal to the log-cardinality of that finite set.
5527: In fact, that distortion sphere equals the finite set on which it is
5528: centered.
5529:
5530: \subsubsection{Randomness Deficiency---Revisited}
5531: \label{sec:rdrev}
5532: Let ${\cal Y}$ be a model class and $d$ a distortion measure.
5533: Since in our definition the distortion is recursive,
5534: given a model $y \in {\cal Y}$ and diameter $r$,
5535: the elements in the distortion sphere
5536: of diameter $r$ can be recursively enumerated from the distortion function.
5537: Giving the index of any element $x$ in that enumeration we can find the
5538: element. Hence, $K(x|y,r) \lea \log |B_y(r)|$. On the other hand,
5539: the vast majority of elements $x$ in the distortion sphere have
5540: complexity $K(x|y,r) \gea \log |B_y(r)|$ since, for every constant $c$,
5541: there are only
5542: $2^{\log |B_y(r)|-c} - 1$ binary programs of length $ < \log |B_y(r)|-c$
5543: available, and there are $|B_y(r)|$ elements to be described.
5544: We can now reason as in the similar case of finite set models.
5545: With data $x$ and $r=d(x,y)$,
5546: if $K(x|y,d(x,y))
5547: \gea \log |B_y(d(x,y))|$, then $x$ belongs to every large majority of elements
5548: (has the property represented by that majority)
5549: of the distortion sphere $B_y(d(x,y))$,
5550: provided that property is simple in the
5551: sense of having a description of low Kolmogorov complexity.
5552: \begin{definition}
5553: \rm
5554: The {\em randomness
5555: deficiency} of $x$ with respect to model $y$ under distortion $d$
5556: is defined as
5557: \[
5558: \delta (x \mid y) = \log |B_y (d(x,y))| - K(x|y,d(x,y)).
5559: \]
5560: Data $x$ is {\em typical} for model $y \in {\cal Y}$ (and that model
5561: ``typical'' or ``best fitting'' for $x$) if
5562: \begin{equation}\label{eq.typical}
5563: \delta (x \mid y) \eqa 0.
5564: \end{equation}
5565: \end{definition}
5566: If $x$ is typical for a model $y$, then the shortest way to effectively
5567: describe $x$, given $y$, takes about as many bits as the
5568: descriptions of the great
5569: majority of elements in
5570: a recursive enumeration of the distortion sphere.
5571: So there are no special simple properties that distinguish $x$
5572: from the great majority of elements
5573: in the distortion sphere: they are all typical or random elements
5574: in the distortion sphere (that is, with respect to the contemplated model).
5575: \begin{example}
5576: \rm
5577: Continuing Example~\ref{ex.11} by applying \eqref{eq.typical}
5578: to different model classes:
5579:
5580: (i) {\em Finite sets:}
5581: For finite set models $S$, clearly $K(x|S) \lea \log |S|$.
5582: Together with \eqref{eq.typical} we have that $x$ is typical for $S$,
5583: and $S$ best fits $x$, if the randomness deficiency
5584: according to \eqref{eq:randomness-deficiency} satisfies
5585: $\delta(x|S) \eqa 0$.
5586:
5587: (ii) {\em Computable probability density functions:}
5588: Instead of the data-to-model code length $\log|S|$ for
5589: finite set models, we consider the data-to-model code length
5590: $\log 1/f(x)$ (the Shannon-Fano code). The value $\log 1/f(x)$
5591: measures how likely $x$ is under the hypothesis $f$.
5592: For probability models $f$,
5593: define the conditional complexity
5594: $K(x \mid f, \lceil \log 1/f(x) \rceil )$ as follows.
5595: Say that a function
5596: $A$ approximates $f$ if $|A(x,\eps)-f(x)|<\eps$
5597: for every $x$ and every positive rational
5598: $\eps$. Then $K(x \mid f , \lceil \log 1/f(x) \rceil)$ is defined as
5599: the minimum length
5600: of a program that, given $\lceil \log 1/f(x) \rceil$
5601: and any function $A$ approximating $f$
5602: as an oracle, prints $x$.
5603:
5604: Clearly
5605: $K(x|f, \lceil \log 1/f(x) \rceil ) \lea \log 1/f(x)$.
5606: Together with \eqref{eq.typical}, we have that $x$ is typical for $f$,
5607: and $f$ best fits $x$, if
5608: $K(x|f, \lceil \log 1/f(x) \rceil) \gea \log |\{z: \log 1/f(z) \leq
5609: \log 1/f(x)\}|$. The right-hand side set condition is the same
5610: as $f(z) \geq f(x)$, and there can be only $\leq 1/f(x)$ such $z$,
5611: since otherwise the total probability exceeds 1. Therefore,
5612: the requirement, and hence typicality,
5613: is implied by $K(x|f, \lceil \log 1/f(x) \rceil ) \gea \log 1/f(x)$.
5614: Define the randomness
5615: deficiency by
5616: $
5617: \delta (x \mid f) = \log 1/f(x) - K(x \mid f, \lceil \log 1/f(x) \rceil).
5618: $
5619: Altogether, a string $x$ is {\em typical for a distribution} $f$,
5620: or $f$ is the {\em best fitting model} for $x$,
5621: if $\delta (x \mid f) \eqa 0$.
5622: if $\delta (x \mid f) \eqa 0$.
5623:
5624: (iii) {\em Total Recursive Functions:}
5625: In place of $\log|S|$ for finite set models
5626: we consider the data-to-model code length (actually, the distortion
5627: $d(x,f)$ above)
5628: $$\len xf=\min\{l(d):f(d)=x\}.$$
5629: Define the conditional complexity
5630: $K(x \mid f, \len xf )$ as
5631: the minimum length
5632: of a program that, given $\len xf$ and an oracle for $f$,
5633: prints $x$.
5634:
5635: Clearly, $K(x|f, \len xf ) \lea \len xf$.
5636: Together with \eqref{eq.typical}, we have that $x$ is typical for $f$,
5637: and $f$ best fits $x$, if $K(x|f, \len xf ) \gea \log \{z: \len zf
5638: \leq \len xf \}$. There are at most $(2^{\len xf +1} - 1)$-
5639: many $z$ satisfying the set condition since
5640: $\len zf \in \{0,1\}^*$. Therefore,
5641: the requirement, and hence typicality,
5642: is implied by $K(x|f, \len xf ) \gea \len xf$.
5643: Define the randomness
5644: deficiency by
5645: $
5646: \delta (x \mid f) = \len xf - K(x \mid f, \len xf ).
5647: $
5648: Altogether, a string $x$ is {\em typical for a total recursive
5649: function} $f$, and $f$ is the {\em best fitting recursive function model}
5650: for $x$
5651: if $\delta (x \mid f) \eqa 0$, or written differently,
5652: \begin{equation}\label{eq.typp}
5653: K(x|f, \len xf ) \eqa \len xf.
5654: \end{equation}
5655: Note that since $\len xf$ is given as conditional information,
5656: with $\len xf = l(d)$ and $f(d)=x$, the quantity $K(x|f, \len xf )$
5657: represents the number of bits in a shortest
5658: {\em self-delimiting} description of $d$.
5659: \end{example}
5660:
5661:
5662: \begin{remark}
5663: \rm
5664: We required $\len xf$ in the conditional in \eqref{eq.typp}.
5665: This is the information about
5666: the radius of the distortion sphere centered on the model concerned.
5667: Note that in the canonical finite set model case, as treated
5668: in \cite{Ko74,GTV01,VV02}, every model has a fixed radius which
5669: is explicitly provided by the model itself. But in the
5670: more general model
5671: classes of computable probability density functions, or
5672: total recursive functions, models can have a variable radius.
5673: There are subclasses of the more general models that
5674: have fixed radiuses (like the finite set models).
5675:
5676: (i) In the computable probability density functions one can think of the
5677: probabilities with a finite support, for example $f_n (x) = 1/2^n$
5678: for $l(x)=n$, and $f(x)=0$ otherwise.
5679:
5680: (ii) In the total recursive function case one can similarly think
5681: of functions with finite support, for example $f_n (x) = \sum_{i=1}^n x_i$
5682: for $x=x_1 \ldots x_n$, and $f_n(x)=0$ for $l(x) \neq n$.
5683:
5684: The incorporation of the radius in the model will increase the
5685: complexity of the model, and hence of the minimal sufficient statistic
5686: below.
5687: \end{remark}
5688:
5689: \subsubsection{Sufficient Statistic---Revisited}
5690: \label{sec:ssrev}
5691: As with the probabilistic sufficient statistic
5692: (Section~\ref{sec:probstat}), a statistic is a function mapping the
5693: data to an element (model) in the contemplated model class. With some
5694: sloppiness of terminology we often call the function value (the model)
5695: also a statistic of the data. A statistic is called sufficient if the
5696: two-part description of the data by way of the model and the
5697: data-to-model code is as concise as the shortest one-part description
5698: of $x$. Consider a model class ${\cal Y}$.
5699: \begin{definition}
5700: A model $y \in {\cal Y}$ is a {\em sufficient statistic} for $x$ if
5701: \begin{equation}\label{eq.ssm}
5702: K(y, d(x,y))+ \log |B_y(d(x,y))| \eqa K(x).
5703: \end{equation}
5704: \end{definition}
5705:
5706: \begin{lemma}\label{lem.V2}
5707: If $y$ is a sufficient statistic for $x$, then
5708: $K(x \mid y, d(x,y) \eqa \log |B_y(d(x,y))|$, that is,
5709: $x$ is typical for $y$.
5710: \end{lemma}
5711: \begin{proof}
5712: We can rewrite
5713: $K(x) \lea K(x,y,d(x,y)) \lea K(y,d(x,y))+K(x|y,d(x,y))
5714: \lea K(y, d(x,y))+ \log |B_y(d(x,y))| \eqa K(x)$.
5715: The first three inequalities are straightforward and
5716: the last equality is by the assumption of sufficiency.
5717: Altogether, the first sum equals the second sum, which implies the lemma.
5718: \end{proof}
5719:
5720: Thus, if $y$ is a sufficient statistic for $x$, then $x$ is a typical element
5721: for $y$, and $y$ is the best fitting model for $x$.
5722: Note that the converse implication, ``typicality'' implies
5723: ``sufficiency,'' is not valid. Sufficiency is a special type
5724: of typicality, where the model does not add significant
5725: information to the data, since the preceding proof shows
5726: $K(x) \eqa K(x,y,d(x,y))$. Using the symmetry of information \eqref{eq.soi}
5727: this shows that
5728: \begin{equation}\label{eq.pcondx}
5729: K(y,d(x,y) \mid x ) \eqa K(y \mid x) \eqa 0.
5730: \end{equation}
5731: This means that:
5732:
5733: (i) A sufficient statistic $y$ is determined by the data in the sense
5734: that we need only an $O(1)$-bit program, possibly depending on
5735: the data itself, to compute the model
5736: from the data.
5737:
5738: (ii) For each model class and distortion there is a universal constant $c$
5739: such that for every data item $x$ there are at most $c$ sufficient
5740: statistics.
5741:
5742: \begin{example}
5743: \rm
5744: {\em Finite sets:}
5745: For the model class of finite sets, a set $S$ is a sufficient statistic
5746: for data $x$ if
5747: \[
5748: K(S)+ \log |S| \eqa K(x).
5749: \]
5750:
5751: {\em Computable probability density functions:}
5752: For the model class of computable probability density functions,
5753: a function $f$ is a sufficient statistic
5754: for data $x$ if
5755: \[
5756: K(f) + \log 1/f(x) \eqa K(x).
5757: \]
5758: For the model class of
5759: {\em total recursive functions}, a function $f$ is a
5760: {\em sufficient statistic} for data $x$
5761: if
5762: \begin{equation}\label{eq.ss}
5763: K(x) \eqa K(f) + \len xf .
5764: \end{equation}
5765: Following the above discussion, the meaningful information in $x$
5766: is represented by $f$ (the model) in $K(f)$ bits, and the
5767: meaningless information in $x$ is represented by $d$ (the noise in
5768: the data) with $f(d)=x$ in $l(d) = \len xf$ bits. Note that
5769: $l(d) \eqa K(d) \eqa K(d|f^*)$,
5770: since the two-part
5771: code $(f^*,d)$ for $x$
5772: cannot be shorter than the shortest one-part code of $K(x)$ bits,
5773: and therefore the $d$-part must already be maximally compressed.
5774: By Lemma~\ref{lem.V2}, $\len xf \eqa K(x \mid f^* , \len xf)$,
5775: $x$ is typical for $f$,
5776: and hence $K(x) \eqa K(f) + K(x \mid f^* , \len xf)$.
5777: \end{example}
5778:
5779:
5780:
5781: \subsubsection{Expected Structure Function}
5782: \label{sec:esfb}
5783: We treat the relation between the expected value
5784: of $h_x(R)$, the expectation taken on a
5785: distribution $f(x)=P(X=x)$ of the random variable $X$ having outcome $x$,
5786: and $D^*(R)$, for arbitrary random sources provided the probability mass
5787: function $f(x)$ is recursive.
5788:
5789:
5790: \begin{theorem}\label{thm.dresf}
5791: Let $d$ be a recursive distortion
5792: measure.
5793: Given $m$ repetitions of a random variable $X$ with outcomes
5794: $x \in {\cal X}$ (typically, ${\cal X}= \{0,1\}^n$)
5795: with probability $f(x)$, where $f$ is a total
5796: recursive function, we have
5797: $$
5798: {\bf E} \frac{1}{m} h_{\overline{x}} (mR+K(f,d,m,R)+O(\log n))
5799: \leq D^*_m(R)
5800: \leq {\bf E} \frac{1}{m} h_{\overline{x}} (mR),
5801: $$
5802: the expectations are taken over $\overline{x}
5803: = x_1 \ldots x_m$ where $x_i$ is the outcome of the $i$th repetition
5804: of $X$.
5805: \end{theorem}
5806: \begin{proof}
5807: As before, let $X_1, \ldots, X_m$ be $m$ independent identically
5808: distributed random variables on outcome space ${\cal X}$.
5809: Let ${\cal Y}$ be a set of code words.
5810: We want to find a sequence of functions $Y_1, \ldots , Y_m:{\cal X}
5811: \rightarrow {\cal Y}$ so that the message $(Y_1(x_1), \ldots,
5812: Y_m (x_m)) \in {\cal Y}^m$ gives as much expected
5813: information about the sequence of outcomes $(X_1=x_1,
5814: \ldots, X_m=x_m)$ as is possible, under the constraint that the message
5815: takes at most $R \cdot m$ bits (so that $R$ bits are allowed on
5816: average per outcome of $X_i$).
5817: Instead of $Y_1, \ldots , Y_m$ above write
5818: $\overline{Y}: {\cal X}^m \rightarrow {\cal Y}^m$.
5819: Denote the cardinality of the range of $\overline{Y}$
5820: by $\rho (\overline{Y})= | \{\overline{Y}(\overline{x}):
5821: \overline{x} \in {\cal X}^m\}|$.
5822: Consider distortion spheres
5823: \begin{equation}\label{eq.lcfs}
5824: B_{\overline{y}}(d) = \{\overline{x}: d(\overline{x},\overline{y}) = d \},
5825: \end{equation}
5826: with $\overline{x} = x_1 \ldots x_m \in {\cal X}^m$
5827: and $\overline{y} \in {\cal Y}^m$.
5828:
5829:
5830: {\em Left Inequality:}
5831: Keeping the earlier notation, for $m$ i.i.d.
5832: random variables $X_1, \ldots ,X_m$, and extending $f$ to
5833: the $m$-fold Cartesian product of $\{0,1\}^n$, we obtain
5834: $D_m^*(R) = \frac{1}{m} \min_{ \overline{Y}: \rho (\overline{Y}) \leq 2^{mR}}
5835: \sum_{\overline{x}}f(\overline{x})
5836: d(\overline{x}, \overline{Y} (\overline{x}))$.
5837: %Assume that $\overline{y} \in Z_m(\overline{x})$ iff
5838: %$Z_m(\overline{y}) = Z_m(\overline{x})$: the distinct
5839: %$Z_m(\overline{x})$'s are disjoint and partition $\{0,1\}^{mn}$
5840: %into disjoint subsets $Z_{m,i}$, with $i=1, \ldots, k$ for
5841: %some $k \leq 2^{mR}$. Denote the elements of this partition
5842: %by $ Z_{(1)} , \ldots , Z_{(k)}$.
5843: By definition of $D_m^*(R)$ it equals the following expression in terms
5844: of a minimal canonical covering of $\{0,1\}^{nm}$ by
5845: disjoint nonempty spheres $B'_{\overline{y}_i}(d_i)$
5846: ($1 \leq i \leq k$) obtained from the possibly overlapping
5847: distortion spheres $B_{\overline{y}_i}(d_i)$ as follows.
5848: Every element $\overline{x}$ in the overlap between two or more spheres
5849: is assigned to the sphere with the smallest radius and removed
5850: from the other spheres. If there is more than
5851: one sphere of smallest radius, then
5852: we take the sphere of least index in the canonical covering.
5853: Empty $B'$-spheres are removed from the $B'$-covering.
5854: If $S \subseteq \{0,1\}^{nm}$, then $f(S)$ denotes $\sum_{x \in S} f(x)$. Now,
5855: we can rewrite
5856: \begin{equation}\label{eq.distpart}
5857: D^*_m(R) =
5858: \min_{\overline{y}_1, \ldots , \overline{y}_k; d_1, \ldots , d_k; k \leq 2^{mR}}
5859: \frac{1}{m}
5860: \sum_{i=1}^k f(B'_{\overline{y}_i}(d_i)) d_i.
5861: \end{equation}
5862: In the structure function setting we consider some individual
5863: data $\overline{x}$ residing
5864: in one of the covering spheres.
5865: Given $m,n,R$ and a program to compute $f$ and $d$, we can compute the
5866: covering spheres centers $\overline{y}_1, \ldots, \overline{y}_k$,
5867: and radiuses $d_1, \ldots , d_k$,
5868: and hence the $B'$-sphere canonical covering. In this
5869: covering we can identify every pair $(\overline{y}_i, d_i)$ by
5870: its index $i \leq 2^{mR}$. Therefore,
5871: $K(\overline{y}_i, d_i) \leq mR + K(f,d,m,R)+O(\log n)$ ($1 \leq i \leq k)$.
5872: For $\overline{x} \in B'_{\overline{y}_{i}}(d_i)$
5873: we have $h_{\overline{x}}(mR + K(f,d,m,R)+O(\log n)) \leq d_i$.
5874: Therefore,
5875: ${\bf E} \frac{1}{m}h_{\overline{x}}(mR + K(f,d,m,R)+O(\log n)) \leq D^*_m(R)$,
5876: the expectation taken over
5877: $f(\overline{x})$ for $\overline{x} \in \{0,1\}^{mn}$.
5878:
5879:
5880: {\em Right Inequality:}
5881: Consider a covering of $\{0,1\}^{nm}$
5882: by the (possibly overlapping)
5883: distortion spheres $B_{\overline{y}_i}(d_i)$
5884: satisfying $K(B_{\overline{y}_i}(d_i) | mR) < mR-c$, with $c$ an
5885: appropriate constant choosen so that the remainder of the argument
5886: goes through.
5887: If there are more than one spheres with different (center, radius)-pairs
5888: representing the same subset of $\{0,1\}^{nm}$, then
5889: we eliminate all of them except the one with the smallest radius.
5890: If there are more than one such spheres, then we only keep the one
5891: with the lexicographically least center. From this covering we obtain
5892: a canonical covering
5893: by nonempty disjoint spheres $B'_{\overline{y}_i} (d_i)$
5894: similar to that in the previous paragraph,
5895: ($1 \leq i \leq k$).
5896:
5897: For every $\overline{x} \in \{0,1\}^{nm}$
5898: there is a unique
5899: sphere $B'_{\overline{y}_i}(d_i) \ni \overline{x}$ ($1 \leq i \leq k$).
5900: Choose the constant $c$ above so that
5901: $K(B'_{\overline{y}_i}(d_i) |mR ) < mR$. Then,
5902: $k \leq 2^{mR}$.
5903: Moreover, by construction, if $B'_{\overline{y}_i} (d_i)$
5904: is the sphere containing $\overline{x}$, then
5905: $h_{\overline{x}} (mR)= d_i$.
5906: Define functions $\gamma: \{0,1\}^{nm} \rightarrow {\cal Y}^m$,
5907: $\delta: \{0,1\}^{nm} \rightarrow {\cal R}^+$ defined by
5908: $\gamma(\overline{x}) = \overline{y}_i$ and $\delta (\overline{x}) = d_i$
5909: for $\overline{x}$ in the sphere $B'_{\overline{y}_i}(d_i)$.
5910: Then,
5911: \begin{equation}\label{eq.lb2}
5912: {\bf E} \frac{1}{m} h_{\overline{x}} (mR) =
5913: \frac{1}{m} \sum_{\overline{x} \in \{0,1\}^{mn}}
5914: f(\overline{x}) d(\overline{x}, \gamma(\overline{x}))
5915: = \frac{1}{m} \sum_{\overline{y}_1, \ldots , \overline{y}_k; d_1, \ldots , d_k}
5916: %\exists_{\overline{x}} [y= \gamma(\overline{x}),d = \delta(\overline{x})]}
5917: f(B'_{\overline{y}_i}(d_i)) d_i .
5918: \end{equation}
5919: The distortion $D^*_m (R)$ achieves the minimum of the expression in
5920: right-hand side of \eqref{eq.distpart}.
5921: Since $K(B'_{\gamma( \overline{x})} (\delta(\overline{x}))|mR) < mR$,
5922: the cover in the right-hand side of \eqref{eq.lb2}
5923: is a possible partition satisfying the expression being
5924: minimized in the right-hand side of
5925: \eqref{eq.distpart}, and hence majorizes the minumum $D^*_m(R)$. Therefore,
5926: ${\bf E} \frac{1}{m} h_{\overline{x}} (mR) \geq D^*_m(R)$.
5927: \end{proof}
5928:
5929: \begin{remark}
5930: \rm
5931: A sphere
5932: is a subset of $\{0,1\}^{nm}$. The same subset may correspond
5933: to more than one spheres with different centers and radiuses:
5934: $B_{\overline{y_0}}(d_0) = B_{\overline{y_1}}(d_1)$ with
5935: $(y_0,d_0) \neq (y_1,d_1)$.
5936: Hence, $K(B_{\overline{y}} (d))
5937: \leq K(\overline{y},d)) + O(1)$, but possibly
5938: $K(\overline{y},d)) > K(B_{\overline{y}} (d))+O(1)$.
5939: However, in the proof we constructed the ordered sequence of $B'$
5940: spheres such that every sphere uniquely corresponds to a
5941: (center, radius)-pair. Therefore, $K(B'_{\overline{y}_i}(d_i)|mR)
5942: \eqa K(\overline{y}_i, d_i | mR)$.
5943: \end{remark}
5944:
5945:
5946: \begin{corollary}\label{cor.esf}
5947: It follows from the above theorem that, for
5948: a recursive distortion function $d$:
5949: (i) $
5950: {\bf E} h_{x} (R+K(f,d,R)+O(\log n))
5951: \leq D^*_1 (R)
5952: \leq {\bf E} h_x (R)
5953: $,
5954: for outcomes of a single repetition of random variable $X =x$
5955: with $x \in \{0,1\}^n$,
5956: the expectation taken over $f(x)=P(X =x)$; and
5957:
5958: (ii) $\lim_{m \rightarrow \infty} {\bf E} \frac{1}{m} h_{\overline{x}} (mR)
5959: = D^*(R)$
5960: for outcomes $\overline{x} = x_1 \ldots x_m$
5961: of i.i.d. random variables $X_i =x_i$ with $x_i \in \{0,1\}^n$ for
5962: $1 \leq i \leq m$,
5963: the expectation taken over $f(\overline{x})=P(X_i=x_i, i=1, \ldots, m)$
5964: (the extension of
5965: $f$ to $m$ repetitions of $X$).
5966: \end{corollary}
5967:
5968: This is the sense in which the expected value of the structure function
5969: is asymptotically equal to the value of the distortion-rate function,
5970: for arbitrary computable distortion measures.
5971: In the structure function approach we dealt with only two
5972: model classes, finite sets and computable probability density functions,
5973: and the associated quantities to be minimized, the log-cardinality
5974: and the negative log-probability, respectively. Translated into
5975: the distortion-rate setting, the models are code words
5976: and the minimalizable quantities are distortion measures.
5977: In \cite{VV02}
5978: we also investigate the model class of total recursive functions,
5979: and in
5980: \cite{BKVV03} the model class of communication protocols. The associated
5981: quantities to be minimized are then function arguments and communicated
5982: bits, respectively. All these models are equivalent up to logarithmic
5983: precision in argument and value of the corresponding structure functions,
5984: and hence their expectations are asymptotic to the distortion-rate
5985: functions of the related code-word set and distortion measure.
5986:
5987:
5988: \commentout{
5989: \begin{remark}
5990: \rm
5991: Suppose we extend the structure function from $h_x(R)=S \ni x$
5992: to $h_x(R) = p$, where $p$ is a distribution on $x$-containing finite
5993: $S \subseteq \{0,1\}^*$. The classic case treated above
5994: is equivalent to $p(S)=1$ for some $x$-containing finite set
5995: of least cardinality with $K(S) \leq R$. Then, given a random
5996: variable $X$ we have a joint probabity $q(X,{\bf S})$ where
5997: ${\bf S}$ denotes the set of finite subsets of $\{0,1\}^*$.
5998: It may be possible to repeat the analysis above in this setting,
5999: and then combine the equivalent of Corollary~\ref{cor.esf} Item (ii)
6000: with Theorem~\ref{thm:rd} to express
6001: the expected complexity $R$ as a function of maximal
6002: allowed expected distortion of $X$ in terms of ${\bf S}$
6003: as the infimum of the mutual information between $X$ and ${\bf S}$
6004: subject to this constraint.
6005: \end{remark}
6006: }
6007:
6008:
6009: \commentout{
6010: \section{Rate Distortion---Continued}
6011: \subsection{Deterministic Rate Distortion}
6012: To obtain the beautiful Theorem~\ref{thm:rd}, we
6013: needed to consider (a) the limit of the average outcome of
6014: repetitions of the same i.i.d. probabilistic scenario, the limit taken
6015: for the number of repetitions grows unboundedly
6016: (in the definition of $D^*(R)$), and (b) randomization of the coding process
6017: (in the minimization (\ref{eq:rd})).
6018: From both perspectives, $D^*(R)$ and $R^*(D)$ are hard to compute.
6019: There exist clever algorithms
6020: %such as the {\em Blahut-Arimoto
6021: % algorithm\/} \cite{CT91}
6022: to compute $R^*(D)$, but these are not
6023: always practical. It therefore seems useful to simplify matters.
6024:
6025: Repetition and
6026: averaging is unavoidable
6027: in say, the Law of Large Numbers, that cannot be expressed otherwise.
6028: However, the minimal rate at which messages can be sent under
6029: distortion constraints makes perfect sense for the individual unrepeated
6030: event and deterministic coding processes.
6031: It turns out that if the distortion function is
6032: `regular' (in a sense to be defined below),
6033: then it becomes meaningful to consider `unrandomized' bounds on the
6034: rate distortion, which are computationally easier to
6035: handle and---to us---also easier to interpret.
6036:
6037: \begin{definition}
6038: \rm
6039: A distortion function
6040: $d: {\cal X} \times {\cal Y} \rightarrow [0,\infty]$ is
6041: {\em regular\/} if
6042: \begin{enumerate}
6043: \item ${\cal Y}$ is a convex space;
6044: \item For each fixed $x$, the function $h_x(y)=d(x,y)$ is convex.
6045: \end{enumerate}
6046: \end{definition}
6047: The set of code words ${\cal Y}$ can be a convex subset of
6048: the real numbers, but also the family of all
6049: probability distributions on some domain. But we are commonly
6050: in one of the following two situations
6051: (a) ${\cal Y}$ is finite or countable; or
6052: (b) ${\cal Y}$ is uncountably infinite and $d$ is regular.
6053:
6054: \begin{definition}
6055: \rm
6056: Let $X$ be a random variable with outcomes in ${\cal X}$
6057: and $Y$ is a function $Y: {\cal X} \rightarrow {\cal Y}$.
6058: We abuse notation by denoting the random variable
6059: $Y(x)$ induced by the random variable $X$ by ``$Y$''. Then it makes
6060: sense to talk about the entropy $H(Y)$. We define
6061: \begin{enumerate}
6062: \item The {\em deterministic distortion-rate function\/}:
6063: \begin{equation}
6064: \label{eq:udr}
6065: D^\circ(R) :=
6066: \inf_{Y: H(Y) \leq R} {\bf E}[d(X,Y)].
6067: \end{equation}
6068: \item The {\em deterministic rate-distortion function\/}:
6069: \begin{equation}
6070: \label{eq:urd}
6071: R^\circ(D) := \inf_{Y: {\bf E}[d(X,Y)] \leq D}
6072: H(Y).
6073: \end{equation}
6074: \end{enumerate}
6075: \end{definition}
6076: It is easy to see that $D^\circ(R)$ must be convex and non-increasing.
6077: Therefore, $R^\circ(D)$ must be the
6078: inverse of $D^\circ(R)$, itself also convex and non-increasing.
6079:
6080: \paragraph{Relating $R^*(D)$ and $R^\circ(D)$:}
6081: Using the definition $I(X;Y) = H(Y) - H(Y|X)$ (Section~\ref{sec:mutual}) we can rewrite
6082: (\ref{eq:rd}) as
6083: $$
6084: R^*(D) = \inf_{Y: {\bf E}[d(X,Y)] \leq D}
6085: H(Y) - H(Y|X),
6086: $$
6087: the infimum taken over randomized $Y$.
6088: If $Y$ is a deterministic function of $X$,
6089: then $H(Y|X) = 0$, and we obtain (\ref{eq:urd}). By
6090: \eqref{eq:rd} and \eqref{eq:ird} $R^*(D)=R^{(I)}(D)$ where the latter
6091: is defined as the right-hand side of the above
6092: equality, but using randomized codes.
6093: Since $R^\circ(D)$ is restricted to
6094: deterministic codes, we have
6095: $R^*(D) \leq R^\circ(D)$.
6096:
6097: \paragraph{Relating $D^*(R)$ and $D^\circ(R)$:}
6098: In our formulation of the basic rate-distortion problem,
6099: before we turned to independent
6100: repetitions, we wanted to minimize distortion under the constraint
6101: that only $2^R$ messages are to be used in a one-shot
6102: setting. It is equivalent to using
6103: the best code under the constraint that only {\em fixed length\/}
6104: codes (using $R$ bits per message) are used. In that case, no matter
6105: what message is sent, the actual number of bits will also be $R$.
6106: Comparing this to
6107: (\ref{eq:udr}), and using the
6108: noiseless-coding interpretation of entropy
6109: (Theorem~\ref{thm:noiseless}), we see that the only difference is that
6110: in the definition of $D^\circ(R)$ we are
6111: allowed to use any code with {\em expected\/} (rather than
6112: actual) code length not larger than $R$ bits.
6113:
6114: But, if we consider repeated scenarios, we can also think
6115: of $D^\circ(R)$ in terms of actual rather than expected code lengths.
6116: By the law of large numbers, we know that if we consider independent
6117: repetitions of the same scenario and
6118: we encode the vector of realized $n$ values,
6119: we can achieve an actual codelength within $o(n)$ the expected
6120: code-length $nH(Y)$ with probability arbitrarily close to $1$.
6121: Using a code $Y$ satisfying $H(Y) \leq R$ and range ${\cal Y}$, we
6122: map the outcomes of $n$ repetitions of $X$,
6123: mapping $(x_1, \ldots, x_n)$
6124: to $(Y(x_1), \ldots, Y(x_n))$. Given a tolerance $\delta >0$,
6125: we only reserve codewords for the
6126: $2^{n (H(Y)+ \delta)}$ must probable vectors $(y_1, \ldots, y_n)$.
6127: The code length for each of these vectors will be $n H(Y) + \delta$. Then with probability
6128: approaching $1$ as $n$ increases, the realized sequence
6129: of outcomes $(x_1, \ldots, x_n)$ such
6130: that $(Y(x_1), \ldots, Y(x_n))$ has a code word of length
6131: $n H(Y) + \delta$. This
6132: code uses less than $R + \delta$ bits per $X_i$, and it is easy to see that
6133: achieves distortion ${\bf E}[d(X,Y)]$.
6134:
6135: This means that $D^\circ(R)$ can be interpreted in two ways: (a) we look
6136: at codes that use at most $R$ bits per message; or (b)
6137: we consider i.i.d. repetitions of the same random variable
6138: as in the definition of $D^*(R)$, but we restrict ourselves to
6139: using the same deterministic coding function $Y$ for each repetition.
6140: Therefore,
6141: $D^*(R) \leq D^\circ(R)$.
6142: \commentout{
6143: Let $D_{\min} = \inf_{R} D^*(R)$ and $D_{\max} = D^*(0)$.
6144: We call ${\cal Y}$ {\em distortion-continuous} relative to $D$
6145: if for all $D \in (D_{\min}, D_{\max})$, there exists a
6146: random variable $Y: {\cal X} \rightarrow {\cal Y}$ with ${\bf E}
6147: [d(X,Y)]= D$.
6148: NOTE PAUL: THE FOLLOWING RESULT SEEMS NEW (ALTHOUGH NOT AT ALL HARD TO PROVE)
6149: \begin{theorem}
6150: \label{thm:simplerd}
6151: Suppose ${\cal Y}$ is distortion-continuous and $d$
6152: is the Shannon-Fano distortion $d(x,y) = \log 1/ p(x\mid y)$. Then:
6153: \begin{enumerate}
6154: \item For all $D \geq 0$,
6155: \begin{equation}
6156: \label{eq:drentropy}
6157: R^*(D)
6158: = \inf_{{Y}: {\cal X} \rightarrow {\cal Y} \; ; \; H(X \mid{Y}) \leq D } H(Y),
6159: \end{equation}
6160: so that only non-randomized estimates $Y$ have to be considered;
6161: \item For all $R \geq 0$,
6162: \begin{equation}
6163: \label{eq:drentropy}
6164: D^*(R) = \inf_{{Y}: {\cal X} \rightarrow {\cal Y} \; ; \; H({Y}) \leq R } H(X \mid Y),
6165: \end{equation}
6166: so that $D^*(R)$ can be directly computed from $R$, without taking the
6167: large $n$ limit as in (\ref{eq:dr}).
6168: \end{enumerate}
6169: \end{theorem}
6170: \begin{proof}
6171: The theorem follows easily from the following lemma.
6172: \begin{lemma}
6173: Suppose there exists a deterministic function $Y: {\cal X} \rightarrow {\cal Y}$ with $H( X \mid Y) = D$. Then
6174: \begin{equation}
6175: \label{eq:lem}
6176: \inf_{f'(y'|x) : \sum_{x \in {\cal X}, y' \in {\cal Y}}
6177: f(x) f'(y'|x) [ \log 1/ p(x|y')] \leq D}I(X; Y') = I(X; Y) = H(Y).
6178: \end{equation}
6179: \end{lemma}
6180: Here the expression over which the minimum is taken should be read as
6181: in Theorem~\ref{thm:rd}, i.e. the minimum is over all conditional
6182: distributions $P'(Y' = \cdot \mid X = \cdot)$ satisfying
6183: $$
6184: {\bf E}_{X \sim P} {\bf E}_{Y'|X \sim P'} [ \log 1/ P(X|Y')] = H(X| Y') \leq D.
6185: $$
6186: \begin{proof}
6187: Note that $I(X;Y) = H(Y) - H(Y | X)$ (Section~\ref{sec:mutual}).
6188: Since $Y$ is a deterministic function of $X$, $H(Y|X) = 0$; this
6189: shows the second equality in (\ref{eq:lem}). For the first equality,
6190: consider the space
6191: ${\cal X} \times {\cal Y}$, in which $Y'$ is a random variable. We
6192: have $I(X;Y') = H(X) - H(X| Y')$. Since $H(X)$ does not depend on
6193: $p(y' \mid x)$, we have
6194: $$\inf_{p(y' \mid x): H(X| Y') \leq D} I(X; Y') =
6195: H(X) + \inf_{p(y' \mid x): H(X| Y') \leq D} \{ - H(X|Y') \}
6196: = H(X) + D = I(X;Y).
6197: $$
6198: \end{proof}
6199: \end{proof}
6200: }
6201: \subsection{Rate Distortion and Estimators}
6202: Let $X$ be a random variable with set of outcomes ${\cal X}$.
6203: Let $\Theta$ be a {\em parameter space}.
6204: Suppose we observe a sample $(x_1, \ldots, x_n) \in {\cal X}^n$.
6205: A statistical {\em model family\/} ${\cal M}$ is defined
6206: by ${\cal M} = \{ p(\cdot
6207: , \theta) \mid \theta \in \Theta\}$, where
6208: $p$ is a joint distribution over ${\cal X}^n$ and $\Theta$.
6209: For every parameter
6210: $\theta$, the function $p_{\theta} (x_1, \ldots , x_n)=p((x_1, \ldots , x_n)
6211: \mid \theta)$ is a possibly different distribution on ${\cal X}^n$.
6212: For example, $\theta \in [0,1]$ represents the bias of a coin
6213: with outcomes in ${\cal X}=\{0,1\}$ per trial. Then the model family
6214: is that of the Bernoulli distributions.
6215: A statistical {\em estimator\/} $\hat{\theta}$
6216: is a function
6217: $\hat{\theta}: {\cal X}^n \rightarrow \Theta$,
6218: mapping each possible sample of $n$ outcomes
6219: into a value in $\Theta$.
6220: The name `estimator' comes from the statistical literature, in which
6221: $\hat{\theta}(x_1 , \ldots , x_n)$ is interpreted as an `estimate' of the data
6222: generating mechanism $\theta$. A typical example is
6223: the maximum likelihood estimator; see below.
6224: For convenience, denote $\overline{X} = X^n$ as the random variable
6225: with outcomes $\overline{x}$
6226: in the sample space $\overline{\cal X} = {\cal X}^n$
6227: We may now consider the distortion function:
6228: $
6229: d: \overline{\cal X} \times
6230: \Theta \rightarrow [0,\infty],
6231: $
6232: defined by
6233: \begin{equation}
6234: \label{eq:absdist}
6235: d(\overline{x}, \theta) = \log 1/ p(\overline{x} \mid \theta)
6236: \end{equation}
6237: Note however that it is {\em not\/}
6238: identical to the `Shannon-Fano distortion' as
6239: in Example~\ref{ex:reconcile}. We explain the difference
6240: below in (\ref{eq:datacode}).
6241:
6242: The expected distortion ${\bf E} [d(\overline{X},
6243: \Theta)]$ requires the distribution $p(\overline{x} \mid \theta)$.
6244: This distribution can arise in different ways.
6245: We first consider a
6246: Bayesian analysis, in which we assume that the statistician employs
6247: some prior distribution $W$ on $\Theta$. This $W$
6248: indicates the statistician's prior `degree of belief' in the various
6249: elements of $\Theta$. Assumption of $W$ induces a unique distribution
6250: ${\Pr}_{\text{Bayes}}$ on $\overline{\cal X}$, the so-called `Bayesian
6251: marginal likelihood' distribution:
6252: \begin{equation}
6253: \label{eq:bayesmarg}
6254: {\Pr}_{\text{Bayes}}(\overline{x} ) = \int_{\theta \in \Theta} p(\overline{x} \mid \theta)
6255: d W(\theta),
6256: \end{equation}
6257: where, in case $\Theta$ is discrete, the integral is replaced by a sum.
6258: \begin{example}
6259: \label{ex:markov}
6260: \rm
6261: A simple example of a statistical model with continuous $\Theta$
6262: is the {\em Bernoulli process}
6263: ${\cal M}_0 = \{ p(\overline{x} \mid \theta) : \overline{x}
6264: \in \overline{\cal X}, \theta \in \Theta \}$,
6265: where ${\cal X}=\{0,1\}$, $ \Theta = [0,1]$,
6266: $\overline{x}=(x_1, \ldots , x_n)$,
6267: $p(\overline{x} \mid \theta) = \prod_{i=1}^n p(x_i \mid \theta)$, and the
6268: joint probability $p(x,\theta)$ is induced by
6269: the uniform
6270: prior $W(\theta) = \theta$.
6271: Then, $p(x_i \mid \theta)$ is the conditional probability that the random
6272: variable $X_i$ has outcome $x_i$ when the model has parameter $\theta$ and
6273: we have by definition of the Bernoulli process that
6274: $p(1 \mid \theta) = \theta$ and $p(0 \mid \theta)= 1- \theta$.
6275: This family has a single parameter, that is, $\theta$.
6276: In general, we do not restrict ourselves to finitely parameterizable
6277: families.
6278:
6279: An example of a statistical model with both continuous $\Theta$
6280: and unbounded number of parameters is the model family of
6281: {\em Markov chains} ${\cal M}$ defined as follows:
6282: ${\cal M} = \bigcup_k {\cal M}_k$, $\Theta =
6283: \bigcup_k \Theta_k$ where $\Theta_k = [0,1]^{2^k}$, and
6284: $${\cal M}_k = \{ p(\cdot \mid \theta) ; \theta \in \Theta_k \}$$
6285: consists of the family of $k$-th order Markov chains for alphabet
6286: ${\cal X} = \{0,1\}$ ($k=0,1, \ldots$).
6287: The subfamily ${\cal M}_0$ is the Bernoulli family
6288: introduced before. A prior on ${\cal M}$ typically takes a hierarchical
6289: form: we first specify a prior $W$ (with probability density function
6290: $w$) on parameter space $\Theta$. This induces
6291: a prior $W_k$
6292: with associated probability density $w_k$ on
6293: every fixed-number parameter space
6294: $\Theta_k$ ($k=0,1, \ldots$). It also induces a probability
6295: density $w_{\Theta}(k) = \int_{\theta \in \Theta_k} w (\theta) d \theta$
6296: Then, (\ref{eq:bayesmarg}) can be rewritten as
6297: \begin{equation}
6298: {\Pr}_{\text{Bayes}}(\overline{x} ) =
6299: \sum_k w_{\Theta}(k) \int_{\theta \in \Theta_k}
6300: w_k(\theta) p(\overline{x} \mid \theta) d \theta.
6301: \end{equation}
6302: \end{example}
6303:
6304: With the given prior distribution, the
6305: random variable $\overline{X}$ is distributed according
6306: to ${\Pr}_{\text{Bayes}}$,
6307: and the expected distortion of an estimator $\hat{\theta}$
6308: becomes well-defined and equal to
6309: \begin{align}
6310: \label{eq:datacode}
6311: {\bf E}_{{\Pr}_{\text{Bayes}}} [d(\overline{X},
6312: \hat{\theta}( \overline{X}))] & = {\bf E}_{{\Pr}_{\text{Bayes}}}
6313: [ - \log p(\overline{X} \mid \hat{\theta}(\overline{X}))]
6314: \\ & =
6315: \nonumber
6316: \sum_{\overline{x}} {\Pr}_{\text{Bayes}} (\overline{x}
6317: [- \log p
6318: \overline{x} \mid \hat{\theta}) ]
6319: \end{align}
6320: Below we use $H$ to refer to entropy with respect to
6321: ${\Pr}_{\text{Bayes}}$, and we use $\hat{\Theta}$ to refer to the
6322: range of the estimator $\hat{\theta}$.
6323: \begin{remark}
6324: \rm
6325: It is important to realize
6326: that (\ref{eq:datacode}) is {\em not\/} equal to
6327: \begin{align*}
6328: H(\overline{X} \mid \hat{\theta}(\overline{X})) = &
6329: {\bf E}_{{\Pr}_{\text{Bayes}}} [- \log
6330: {\Pr}_{\text{Bayes}}(\overline{X} \mid \hat{\theta}(\overline{X}))]
6331: \\ = &
6332: \sum_{\theta \in \hat{\bf \Theta}}
6333: {\Pr}_{\text{Bayes}} \bigcup\{ \overline{y} : \hat{\theta}(\overline{y})
6334: = \theta\}
6335: \\& \; \; \; \left(\sum_{\overline{x} : \hat{\theta}(\overline{x}) = \theta}
6336: {\Pr}_{\text{Bayes}} \bigcup \{
6337: \overline{x} \mid \theta \}
6338: [- \log {\Pr}_{\text{Bayes}} \bigcup
6339: \{\overline{x} \mid \theta) \} ] \right).
6340: \end{align*}
6341:
6342: %Note that $P(\cdot ; \theta)$ appearing inside
6343: %the expectation (\ref{eq:datacode}) is not $\Pr_{\text{Bayes}(\cdot
6344: % \mid \hat{\theta}(x^n) = \theta)}$.
6345: Thus, the expected distortion (\ref{eq:datacode})
6346: is the expected code-length of $\overline{x}$,
6347: where the Shannon-Fano code for distribution $p(\overline{x} \mid
6348: \hat{\theta}(\overline{x}))$, rather than the Shannon-Fano code for the actual
6349: conditional distribution ${\Pr}_{\text{Bayes}}(\overline{x} \mid
6350: \hat{\theta}(\overline{x}))$ is used. Therefore, the present development is
6351: quite different from Example~\ref{ex:reconcile}.
6352: \end{remark}
6353: From (\ref{eq:udr}) we see that the deterministic
6354: distortion-rate function for (\ref{eq:absdist}) is given by
6355: \begin{equation}
6356: \label{eq:drentropyb}
6357: D^\circ(R) = \inf_{{\hat{\theta}}:
6358: H(\hat{\theta}(\overline{X})) \leq R } {\bf E}_{{\Pr}_{\text{Bayes}}}
6359: [ - \log p(\overline{X} \mid \hat{\theta}(\overline{X}))].
6360: \end{equation}
6361: Clearly, $D^\circ(R)$ is non-increasing in $R$.
6362: We may conjecture that $D^\circ(R) =
6363: D^*(R)$ but this is not true:
6364: using results in \cite{CT91}, Section 13.7, it is not hard to show
6365: that always $D^*(R) \leq D^\circ(R)$ and typically, $D^*(R) < D^\circ(R)$.
6366: This means that randomized estimators can typically
6367: achieve a lower distortion than deterministic estimators,
6368: for any given rate.
6369: Deterministic estimators are appealing since they allow for a
6370: clear `two-part code' interpretation of the distortion process.
6371:
6372: \paragraph{Bayes Mean Structure function and MML:}
6373: Let $\hat{\theta}_R:
6374: {\cal X}^n \rightarrow {\bf \Theta}$ denote an estimator
6375: $\hat{\theta}$ achieving $D^\circ(R)$ for given $R$.
6376: Note that
6377: $$
6378: D^\circ(R) = {\bf E}_{{\Pr}_{\text{Bayes}}}
6379: [ - \log p(\overline{X} \mid \hat{\theta}_R(\overline{X}))].
6380: $$
6381: We call
6382: $D^\circ(R)$ the {\em Bayes mean structure function}, in analogy to the
6383: Kolmogorov structure function to be introduced in the next section.
6384: We can interpret
6385: \begin{equation}
6386: \label{eq:mml}
6387: H(\hat{\theta}_R(\overline{X})) + D^\circ(R) = {\bf E}_{{\Pr}_{\text{Bayes}}}
6388: [ - \log {\Pr}_{\text{Bayes}}(\hat{\theta}_R(\overline{X}))] + D^\circ(R)
6389: \end{equation}
6390: as the total number of bits it takes, on average, to encode the
6391: outcomes of the random variable $\overline{X}$
6392: using the cleverest possible two-part code under the constraint that
6393: $H(\hat{\theta}(\overline{X})) \leq R$. The first part of this code
6394: is the estimator $\hat{\theta}_R (\overline{x})$, which we interpret
6395: as corresponding to the proposed {\em model} for the data $\overline{x}$,
6396: at a Bayes mean cost of $H(\hat{\theta}_R(\overline{X})) \leq R$ bits.
6397: The second part of this code is the distortion-rate
6398: $- \log p(\overline{x} \mid \hat{\theta}_R(\overline{x}))$,
6399: which corresponds to the {\em data-to-model} code,
6400: the Shannon-Fano code for the data $\overline{x}$ conditional
6401: the estimation of the model $\hat{\theta}_R(\overline{x}$,
6402: at a Bayes mean cost of $D^\circ(R)$ bits.
6403:
6404: There exist estimators $\hat{\theta}_R$ for every $R$. A natural question is
6405: to consider the rate $R^*$ for which
6406: an estimator $\hat{\theta}_{R^*}$ minimizes
6407: the value of (\ref{eq:mml}) over all $R$:
6408: $$
6409: H(\hat{\theta}_{R^*} (\overline{X})) + D^\circ(R^*)
6410: \min_R H(\hat{\theta}_R(\overline{X})) + D^\circ(R).
6411: $$
6412: The estimator $\hat{\theta}_{R^*}$ minimizes,
6413: over {\em all\/} possible estimators
6414: the expected two-part code-length.
6415: It turns out that the estimator
6416: $\hat{\theta}_{R^*}$ is well-known, albeit not in terms of rate-distortion:
6417: \begin{proposition}
6418: \label{prop:mml}
6419: $\hat{\theta}_{R^*}$ is identical to
6420: the {\em strict MML estimator\/} of Wallace \& Boulton
6421: \cite{WallaceB75,WallaceF87}.
6422: \end{proposition}
6423: \begin{proof}
6424: Immediate from the definition of strict MML.
6425: \end{proof}
6426: Wallace and Freeman \cite{WallaceF87} do not give instructions
6427: for the case that the $\hat{\theta}_{R^*}$
6428: minimizing (\ref{eq:mml}) is not unique. It is natural,
6429: besides other reasons that will become clear below,
6430: if there is more than one $R$
6431: for which the minimum of (\ref{eq:mml}) is achieved, then
6432: define $R^*$ to be
6433: the {\em least\/} such $R$.
6434: It is well argued, \cite{WallaceB75,WallaceF87}, that the strict MML
6435: estimator $\hat{\theta}_{R^*}$ may be interpreted as an estimator that
6436: `trades off complexity and goodness-of-fit'. Below we shall explain this
6437: idea in a novel manner.
6438:
6439: \paragraph{Bayes Mean Randomness Deficiency Function:}
6440: Let us fix some rate $R$. Using the two-part code \eqref{eq:mml},
6441: achieving $D^\circ(R)$, we need on average
6442: \begin{equation}
6443: \label{eq:mmlb}
6444: H(\hat{\theta}_R(\overline{X})) + D^\circ(R) = H(\hat{\theta}_R(\overline{X})) +
6445: {\bf E}_{{\Pr}_{\text{Bayes}}}
6446: [ - \log p(\overline{X} \mid \hat{\theta}_R(\overline{X}))]
6447: \end{equation}
6448: bits to encode our data. We may compare this with the optimal
6449: (on average) code for $\overline{X}$: the
6450: Shannon-Fano code for ${\Pr}_{\text{Bayes}}$, with lengths $L(\overline{x}) =
6451: - \log {\Pr}_{\text{Bayes}} (\overline{x})$,
6452: and expected length $H(\overline{X})$. Since the two-part code at
6453: rate $R$ can never be better than this overall optimum code, the
6454: difference $\beta(R)$ defined by
6455: \begin{multline}
6456: \label{eq:redundancy}
6457: \beta(R) = H(\hat{\theta}_R(\overline{X})) + {\bf E}_{{\Pr}_{\text{Bayes}}}
6458: [ - \log p(\overline{X} \mid \hat{\theta}_R(\overline{X}))]- H(\overline{X}) = \\
6459: {\bf E}_{{\Pr}_{\text{Bayes}}}
6460: [ - \log p(\overline{X} \mid \hat{\theta}_R(\overline{X}))] -
6461: H(\overline{X} \mid \hat{\theta}_R(\overline{X}))
6462: \end{multline}
6463: is always nonnegative. Information theorists call
6464: (\ref{eq:redundancy}) the {\em redundancy\/} of the given 2-part code:
6465: it is the average {\em additional\/} number of bits needed to encode
6466: $\overline{X}$ compared to the optimal code for $\overline{X}$.
6467: In analogy with Kolmogorov's minimum randomness
6468: deficiency function $\beta_x (R)$, our new function $\beta(R)$ may be called
6469: the {\em Bayes mean randomness deficiency function}.
6470:
6471: Typically, as $R$ increases, the function $\beta$ will behave
6472: as follows: first, it will be much larger than $0$ (and in fact, for
6473: fixed $R$, it will be linear in $n$---the number of i.i.d. random
6474: variables denoted by $\overline{X}$). As $R$ grows ($n$
6475: fixed), the function decreases reaches a first minimum at $R =
6476: R^*$, the rate for the strict MML estimator minimizing
6477: (\ref{eq:mml}). At this point, the difference $\beta(R)$
6478: is bounded by a constant (independent of $n$).
6479:
6480: This means that at the minimum at $R^*$,
6481: the two-part code is essentially (within a constant) as good as the
6482: overal best one-part (Shannon-Fano) code. The estimator
6483: $\hat{\theta}_R (\overline{x})$, with $R \geq R^*$,
6484: thus {\em on average\/} behaves
6485: as the `algorithmic sufficient statistic',
6486: capturing essentially all regularity in
6487: the data (since even if the `true' distribution ${\Pr}_{\text{Bayes}}$
6488: were known, the data could not be compressed more).
6489: The optimum $\hat{\theta}_{R^*} (\overline{x})$ is
6490: a {\em minimum\/} sufficient statistic since all other
6491: sufficient statistics are attained for $R > R^*$, and need on
6492: average more bits to be described.
6493:
6494: TODO IN SOME CASES I CAN ACTUALLY FORMALLY PROVE ALL THIS
6495:
6496: We may thus think of the strict MML estimator as an `minimum
6497: sufficient statistic'. Historically, the strict MML estimator has been
6498: introduced and interpreted from a lossless coding point of view. The
6499: variation of MML based on the `mean structure function' introduced
6500: above may also be understood from a lossy coding point of view: the
6501: MML estimator implements the two-part code that restricts ${\cal M}$
6502: to the smallest possible subset ${\cal M}' \subset {\cal M}$
6503: containing an element $p(\cdot \mid
6504: \theta) \in {\cal M}'$ so that $p(\cdot \mid \theta)$ captures all
6505: relevant information in $\overline{X}$, which means that the data must look
6506: like a typical outcome of $p(\cdot \mid \theta) \in {\cal M}$.
6507:
6508: \commentout{
6509: \begin{quote}
6510: {\bf Caution\ }
6511: We stress that the encoded value of ${\theta}$ is a {\em
6512: function \/} of the sample $(x_1, \ldots, x_n)$. It is {\em not\/}
6513: necessarily equal to the $\theta$ `generated' by $W$: since
6514: $$
6515: H({\Theta}) = \sum_{{\Theta} \in {\mathbf \Theta}}
6516: \Pr_{\text{Bayes}}(\{x^n : {\Theta}(x^n) = \theta\}) [
6517: - \log \Pr_{\text{Bayes}}(\{x^n : {\Theta}(x^n) = \theta\}) ]
6518: $$
6519: whereas, if ${\bf \Theta}$ is finite, then
6520: $$
6521: H(\Theta) = \int_{\theta \in {\mathbf \Theta}} (\theta) - \log W(\theta),
6522: $$
6523: and if ${\bf \Theta}$ is infinite, $H(\Theta)$ is not defined.
6524: Thus, we have $H({\Theta}) \neq H(\Theta))$.
6525: \end{quote}
6526: }
6527: \begin{remark}
6528: \rm
6529: The previous analysis opens up the intriguing
6530: possibility to define a {\em randomized MML estimator\/} as the
6531: estimator which minimizes expected two-part code length over all
6532: randomized, rather than just unrandomized functions from the data to
6533: the parameters. As seen, this will in general lead to smaller expected
6534: two-part code lengths. This could therefore somewhat change the
6535: distortion-rate curve, and therefore also somewhat change the inferred
6536: distribution for any given particular set of data. At this time it is
6537: unclear however whether this would lead to any substantial changes.
6538: \end{remark}
6539: \paragraph{Problem and Lacuna}
6540: The `strict MML method' provides a code book that achieves the minimum
6541: two-part code length (and, through the structure function
6542: interpretation, something close to the `optimal separation between
6543: data and noise') {\em on average when applied several times}, and
6544: according to the prior. In practice, a statistician who uses MML
6545: observes a data sample $\overline{x}$ and then infers
6546: that $\hat{\theta}(\overline{x})$ is a good explanation
6547: for the data. There are two potential problems here: (a) for the
6548: individual sequence $\overline{x}$ that actually arises, the MML estimator
6549: $\hat{\theta}(\overline{x})$
6550: may {\em not\/} achieve the optimal data-noise separation;
6551: (b) the statistician may not be able to come up with a reasonable
6552: prior $W$. These concerns are addressed, to some extent, by MML's
6553: close cousin: Rissanen's Minimum Description Length Principle.
6554: \subsection{MDL Parameter Estimates}
6555: DISCUSS CONNECTION BETWEEN MML AND UNIVERSAL MODELS; HOW MDL GETS AWAY
6556: WITHOUT PRIOR - PROBABLY BEST PUT *AFTER* DISCUSSION OF KOLMOGOROV
6557: SUFFICIENT STATISTIC. LETS FIRST WAIT UNTIL WE HAVE KOLMOGOROV TEXT!
6558:
6559: THE FOLLOWING IS PROBABLY SUPERFLUOUS To end this section, we consider
6560: one last distortion function that will play an important r\^ole in the
6561: next section.
6562: \begin{example}[structure function]
6563: \label{ex:uniformcode}
6564: \rm Consider the distortion function $d: {\cal X} \times {\cal S}
6565: \rightarrow {\cal R}$ where ${\cal S} = 2^{\cal X}$ is the power set
6566: of ${\cal X}$. We define $d(x,S) = \log |S|$ if $x \in {\cal S}$ and
6567: $d(x,S) = \infty$ if $x \not \in S$.
6568: This distortion function has both
6569: a lossy and a lossless coding interpretation. From the lossy point of
6570: view, $x$ is encoded as a set which contains it, and the quality of
6571: the encoding is given by the (log of the) size of the set. From the
6572: lossless point of view, this distortion corresponds to a scenario
6573: where sender has to send the value $x$ to receiver but is not allowed
6574: to use any arbitrary code he likes. Instead, he must do the encoding
6575: in two stages: he must first specify a set $S$, using at most $R$
6576: bits. He then has to specify $x$ by giving its index in the set
6577: $S$. That is , in the second stage of the description, he is not
6578: allowed to use any probabilistic knowledge about $x$ at all, but must
6579: describe it in a trivial, fixed length manner. This rate distortion
6580: function has a number of interesting properties.
6581: It is closely related to the Kolmogorov structure function which
6582: we discuss in the next section.
6583: \end{example}
6584:
6585: \paragraph{A Rate Distortion Theory for individual sequences?} TO BE PUT AFTER CONSIDERING KOLMOGOROVS STRUCTURE FUNCTION: We see
6586: that rate distortion theory leads to trade-offs between number of bits
6587: needed to send a message and achieveable distortion in an average
6588: sense. When also the distortion is measured in terms of bits, this
6589: leads to two-part codes that are optimal on average, as considered in
6590: MML. In MDL and with the Kolmogorov structure function, we consider
6591: two-part codes that are optimal not in average, but in an individual
6592: sequence sense (even though the sense of optimality and the codes that are
6593: used in MDL and Kolmogorov's setting are different, they are both
6594: concerned with individual sequences). This suggests that we might just
6595: as well replace the second part of the code by some non-logarithmic
6596: distortions and consider non-logarithimic distortions in an individual
6597: sequence sense, using the tools developed in the MDL and Kolmogorov structure
6598: function theory. Ideally, this would lead to an {\em individual
6599: sequence-based rate-distortion theory}. The very first steps in this
6600: promising new direction have been recently taken by
6601: \cite{RissanenT03}.
6602: FOR DIFFERENT NOVEL APPROACHES: \cite{KontoyiannisZ02,SowE03}
6603: CHANGE PART ABOUT AVERAGE SUFFICIENT STATISTIC IN SECTION 7
6604: \cite{SowE03}
6605:
6606:
6607:
6608:
6609: \section{Resource-Bounded Information}
6610: The area of computational resource-bounded information transmission
6611: seems rather underdeveloped in the Shannon-Information case.
6612: This would SPIELMAN, SIPSER???
6613: have to deal with the speed of encoding/decoding parsimonious
6614: prefix codes. This will depend on the size of the message domain.
6615: In general we can say that the resource-bounded information transmission
6616: rate will depend primarily on the probability characteristics of the
6617: random source.
6618: In contrast, in the algorithmic (Kolmogorov complexity)
6619: case, the resource-bounded information depends on the
6620: individual object concerned. The theory is partially developed.
6621: One may consider a book on number theory
6622: difficult, or ``deep.'' The book will list
6623: a number of difficult theorems of number theory. However,
6624: it has very low Kolmogorov complexity since all
6625: theorems are derivable from the initial few definitions.
6626: Our estimate of the difficulty, or ``depth,'' of the book is based
6627: on the fact that it takes a long time to reproduce the book
6628: from part of the information in it.
6629: The existence of a ``deep'' book is itself evidence of some long
6630: evolution preceding it.
6631: Currently, the sequence of primes is being broadcast
6632: to outer space since it is deemed deep enough to prove
6633: to aliens that it arose as a result of a long
6634: evolution.
6635: From the point of view of an investigator, a sequence is deep if
6636: it yields its secrets only slowly: one will be able to discover
6637: all significant regularities in it
6638: only if one analyzes it long enough.
6639:
6640: A suggestive example is provided by
6641: DNA sequences. Such a sequence is quite regular and
6642: has some 90\% redundancy, possibly due to evolutionary history.
6643: A DNA sequence over an alphabet of four letters $\{ A,C,G,T \}$
6644: \index{sequence!DNA}
6645: looks like nothing but a super-long
6646: ($3 \times 10^9$ characters for humans) computer program.
6647: A particular three-letter combination
6648: literally signifies ``begin'' of the encoding of a protein.
6649: Following the ``begin'' command, every next block of three consecutive
6650: letters encodes one of the 20 amino acids. At the end
6651: another three-letter combination signifies the
6652: ``end'' of the program for this protein. Such a sequence is
6653: not Kolmogorov random, and it encodes the structure of a living being.
6654: DNA is much less random than, say, a typical
6655: configuration of gas in a container.
6656: On the other hand, DNA is more random than a crystal.
6657: Both gases and crystals are structurally trivial;
6658: the former is in complete chaos and the latter is in total order.
6659: Intuitively, DNA contains more useful information than both.
6660: A ``deep'' object, such as DNA, is something really simple but
6661: ``disguised'' by complicated manipulations of nature
6662: or computation by computer.
6663:
6664: Logical depth is the necessary number of
6665: steps in the deductive or causal path connecting an object with its
6666: plausible origin. Formally, it is
6667: the time required by a universal computer to compute the
6668: object from its compressed original description.
6669:
6670: It turns out that it is quite subtle to give a formal
6671: definition of ``depth'' that satisfies our intuitive notion
6672: of it. After some attempts at a definition,
6673: we will settle for
6674: Definition~\ref{def.depth}.
6675: As usual, we write $x^*$ to denote the shortest
6676: self-delimiting program (of the reference universal prefix machine
6677: $U$) for $x$. If there is more than
6678: one of the same length, then $x^*$ is
6679: the first such program in a fixed enumeration.
6680: \begin{description}
6681: \item[Attempt 1]
6682: The number of steps required to compute $x$ from
6683: $x^*$ is not a stable quantity since
6684: there might be a program of
6685: just a few more bits using substantially less time to generate
6686: $x$. That this can happen
6687: is shown by the hierarchy theorems in \cite{LiVi97}.
6688: Therefore, a proper definition of
6689: depth probably should ``compromise''
6690: between the program size and computation time.
6691: \item[Attempt 2]
6692: Relax the strict requirement of
6693: minimum program to {\em almost minimum} programs.
6694: Define that a string $x$
6695: has depth $d$ within error $2^{-b}$ if $x$ can be
6696: computed in $d$ steps by a program $p$ of
6697: no more than $b$ bits in excess of $x^*$.
6698: That is, $2^{-l(p)}/2^{-K(x)} \geq 2^{-b}$.
6699:
6700: This definition is stable but is unsatisfactory because
6701: of the way it treats multiple programs of the same length.
6702: If $2^b$ distinct programs of length $m+b$ all compute $x$,
6703: then together they account for the same
6704: algorithmic probability
6705: \begin{equation}
6706: \nonumber
6707: \sum \{2^{-l(p)}: U(p)=x, l(p) = m+b \},
6708: \end{equation}
6709: as one program of length $m$ printing $x$ does.
6710: That is, they are as likely to produce $x$ as output
6711: of the universal reference prefix machine when
6712: its input is provided by fair coin tosses.
6713: But with the proposed definition,
6714: $2^b$ programs of length $m+b$
6715: make the emerging of $x$ no more
6716: probable than one program of length $m+b$.
6717: \end{description}
6718: We shall explicitly take the algorithmic probability into account.
6719: The universal prior probability of a string $x$ is
6720: \index{probability!universal prior}
6721: \[
6722: Q_U (x) = \sum_{U(p)=x} 2^{-l(p)},
6723: \]
6724: where $U$ is the reference universal
6725: prefix machine.
6726: This is the probability
6727: that $U$ would print $x$ if its input were provided by random tosses
6728: of a fair coin.
6729: By one of the main results in Kolmogov complexity theory,
6730: \begin{equation}\label{eq.depth.PR2}
6731: - \log Q_U (x) + O(1) = -\log {\bf m} (x) = K(x) +O(1).
6732: \end{equation}
6733: It shows that $2^{-K(x)}$ is
6734: a universal discrete semimeasure. This means
6735: that we are free to choose the reference universal semimeasure
6736: ${\bf m}$ exactly equal to $2^{-K(x)}$.
6737:
6738: Thus, weighing all possible causes of emergence of $x$
6739: appropriately, we are led to the following definition:
6740: \begin{definition}\label{def.gacs.depth}\label{def.depth}
6741: \rm
6742: The {\em depth} of a string $x$
6743: at {\em significance level} $\epsilon = 2^{-b}$ is
6744: \[ depth_{\epsilon} (x) = \min
6745: \{ t : Q_U^t (x)/ Q_U (x) \geq \epsilon \},
6746: \]
6747: where $Q_U^t (x) = \sum_{U^t (p)=x} 2^{-l(p)}$
6748: and \index{logical depth!$(d,b)$-deep|bold}
6749: $U^t (p) =x$ means that $U$ computes $x$ within $t$
6750: steps and halts. A string $x$ is {\em $(d,b)$-deep} if
6751: $d=depth_{\epsilon} (x)$ and $\epsilon = 2^{-b}$.
6752: \end{definition}
6753: If $x$ is $(d,b)$-deep,
6754: then $x$ receives an approximately $1/2^{b \pm \delta }$ fraction
6755: of its algorithmic probability (for some small $\delta$) from
6756: programs running in $d$ steps.
6757: Below we formalize this statement and make $\delta$ precise.
6758: A binary string $x$ is $b$-compressible
6759: if $l(x^* ) \leq l(x) - b$.
6760: Otherwise, $x$ is $b$-incompressible.
6761: \begin{theorem}\label{theorem.depth}
6762: A string $x$ is
6763: {\em $(d,b)$-deep} {\rm (}$b$ up to precision $K(d)+O(1)${\rm )}
6764: if and only if $d$ is the least time
6765: needed by a $b$-incompressible program
6766: to print $x$.
6767: \end{theorem}
6768:
6769:
6770: %\section{Kolmogorov Minimum Sufficient Statistic}
6771: %A shortcoming of both the Shannon and the Algorithmic Information
6772: %Theory is that they do not distinguish between `useful' and `useless'
6773: %information in an event (initial sequence of outcomes). It turns out
6774: %though that, by using Kolmogorov complexity as a `building block', one
6775: %can arrive at natural definitions of these concepts. These definitions
6776: %are too involved to cite in this abstract. In many cases, `useful'
6777: %information in a sequence corresponds to information that can be used
6778: %to predict future outcomes of the process better than by random
6779: %guessing.
6780: %
6781: %In the paper we will formally define `useful' and `useless'
6782: %information using the theory of the {\em Kolmogorov Minimum Sufficient
6783: %Statistic\/} and we will show the importance of these concepts.
6784: %\section{Properties and Comparisons}
6785: %The form of information theory based on Kolmogorov complexity is
6786: %usually called {\em algorithmic information theory}. Although
6787: %algorithmic rather than probabilistic, algorithmic information theory
6788: %is closely related to the Shannon theory. We will discuss this and
6789: %other relations in detail and give intuitive interpretations of them.
6790: %
6791: %We also point out the important fact that the Kolmogorov
6792: %complexity/minimum sufficient statistic approach cannot be applied
6793: %without modification in practical situations, the reason being that
6794: %there is no algorithm that, for arbitrary inputs $x$ outputs the
6795: %length of the shortest program that prints $x$. We briefly indicate
6796: %modifications of the theory such as the {\em minimum description
6797: %length principle\/} that {\em can\/} be used in practical
6798: %settings. Such modifications have found important applications in
6799: %statistics and machine learning.
6800: \section{Conclusion}
6801: We have compared Shannon's and Kolmogorov's theories of
6802: information, highlighting the various similarities and differences. We
6803: end by suggesting further topics and reading for the interested reader.
6804: \subsection{Further Topics}
6805: We have only treated those aspects of Shannon's theory
6806: that have a clear analogue in Kolmogorov's theory, and vice versa.
6807: Among the many aspects of Shannon theory we have not discussed, one
6808: cannot go unmentioned:
6809: \begin{description}
6810: \item{\bf The Channel Coding Theorem} Of the three (arguably) most important
6811: developments in Shannon's original
6812: paper, we only discussed two: first, the {\em noiseless coding theorem\/}
6813: (Theorem~\ref{thm:noiseless}), related to lossless compression or,
6814: equivalently, lossless communication over a {\em noiseless\/} channel.
6815: Second, the fundamental theorem of {\em rate-distortion}, which deals with lossy
6816: compression.
6817: We did not discuss the {\em channel coding theorem},
6818: which is related to {\em lossless\/}
6819: communication over a {\em noisy\/} channel.
6820: \end{description}
6821: Among the many aspects of Kolmogorov complexity that
6822: we have not discussed, some
6823: cannot go unmentioned:
6824: \begin{description}
6825: \item{\bf Algorithmic Randomness; The Universal Distribution}
6826: TODO
6827: \item{\bf Inductive Inference}
6828: TODO
6829: \item{\bf Kolmogorov complexity as a proof technique}
6830: TODO Goedel
6831: \end{description}
6832: \subsection{Further Reading}
6833: The standard reference for Shannon information theory is
6834: \cite{CT91}. Also, Shannon's original \cite{Sh48} is still
6835: well-worth reading. The 50-year anniversary issue of the {\em IEEE
6836: Transactions on Information Theory\/} in 1998
6837: contains overview articles on some of
6838: the most important topics in Shannon
6839: information theory. The standard reference for Kolmogorov Complexity
6840: is \cite{LiVi97}; \cite{Ch87b} is a monograph written by G. Chaitin,
6841: one of the founders of Kolmogorov complexity. It concentrates on the
6842: application of Kolmogorov complexity to proving metamathematical statements.
6843: References
6844: \cite{LiVi97} and \cite{CT91} provide an extensive treatment of all the notions
6845: discussed in this article, as well as many
6846: others we could not touch upon here. Recently, there have been many
6847: exciting new results in `meaningful information' and the Kolmogorov
6848: structure function which are not yet mentioned in \cite{LiVi97}. We
6849: refer to \cite{VV02}. Both universal coding and the Kolmogorov structure
6850: function are closely related to Rissanen's `minimum description length
6851: principle' for inductive inference; see \cite{Grunwald03} and
6852: \cite{Rissanen89}.
6853: }
6854: \section{Conclusion}
6855: We have compared Shannon's and Kolmogorov's theories of information,
6856: highlighting the various similarities and differences. Some of this
6857: material can also be found in \cite{CT91}, the standard reference for
6858: Shannon information theory, as well as \cite{LiVi97}, the standard
6859: reference for Kolmogorov complexity theory. These books predate much
6860: of the recent material on the Kolmogorov theory discussed in the
6861: present paper, such as \cite{HRSV00} (Section~\ref{sec:algmi}),
6862: \cite{Le02} (Section~\ref{sect:minialg}), \cite{GTV01}
6863: (Section~\ref{sec:algsuf}), \cite{VV02, VereshchaginV04}
6864: (Section~\ref{sec:structure}). The material in Sections~\ref{sec:relpa}
6865: and \ref{sec:esf}
6866: has not been published before. The present paper summarizes these
6867: recent contributions and systematically compares
6868: them to the corresponding notions in Shannon's theory.
6869:
6870: \paragraph{Related Developments:} There are two major practical theories
6871: which have their roots in both Shannon's and Kolmogorov's notions of
6872: information: first, {\em universal coding}, briefly introduced in
6873: Appendix~\ref{sec:universal} below, is a remarkably successful theory for
6874: practical lossless data compression. Second, Rissanen's {\em
6875: Minimum Description Length (MDL) Principle\/}
6876: \cite{Ri89,Grunwald04} is a theory of inductive inference that
6877: is both practical and successful. Note that direct practical
6878: application of Shannon's theory is hampered by the typically
6879: untenable assumption of a true and known distribution generating the
6880: data. Direct application of Kolmogorov's theory is hampered by the
6881: noncomputability of Kolmogorov complexity and the strictly asymptotic
6882: nature of the results. Both universal coding (of the individual
6883: sequence type, Appendix~\ref{sec:universal}) and MDL seek to
6884: overcome both problems by restricting the description methods used
6885: to those corresponding to a set of probabilistic predictors (thus
6886: making encodings and their lengths computable and nonasymptotic);
6887: yet when applying these predictors, the assumption that any one of
6888: them generates the data is never actually made. Interestingly, while
6889: in its current form MDL bases inference on universal codes, in
6890: recent work Rissanen and co-workers have sought to found the
6891: principle on a restricted form of the algorithmic sufficient
6892: statistic and Kolmogorov's structure function as discussed in
6893: Section~\ref{sec:structure} \cite{RissanenT04}.
6894:
6895: By looking at general types of prediction errors, of which
6896: codelengths are merely a special case, one achieves a generalization
6897: of the Kolmogorov theory that goes by the name of {\em predictive
6898: complexity}, pioneered by Vovk, Vyugin, Kalnishkan and others\footnote{See {\tt www.vovk.net} for an overview.} \cite{Vovk01}. Finally, the
6899: notions of `randomness deficiency' and `typical set' that are
6900: central to the algorithmic sufficient statistic
6901: (Section~\ref{sec:algsuf}) are intimately related to
6902: the celebrated Martin-L\"of-Kolmogorov theory of {\em randomness in
6903: individual sequences}, an overview of which is given in
6904: \cite{LiVi97}.
6905: \appendix
6906: \section{Appendix: Universal Codes}
6907: \label{sec:universal}
6908: Shannon's and Kolmogorov's idea are not directly applicable to
6909: most actual data compression problems. Shannon's theory is hampered
6910: by the typically
6911: untenable assumption of a true and known distribution generating the
6912: data. Kolmogorov's theory is hampered by the
6913: noncomputability of Kolmogorov complexity and the strictly asymptotic
6914: nature of the results. Yet there is
6915: a middle ground that is feasible: {\em
6916: universal codes\/} that may be viewed as both an
6917: generalized version of Shannon's, and a feasible
6918: approximation to Kolmogorov's theory. In introducing
6919: the notion of universal coding Kolmogorov says \cite{Ko65}:
6920: \begin{quote}
6921: ``A universal coding method that permits the transmission of
6922: any sufficiently long message [of length $n$] in an alphabet of $s$ letters
6923: with no more $nh$ [$h$ is the empirical entropy] binary digits is
6924: not necessarily excessively complex; in particular, it is not
6925: essential to begin by determining the frequencies $p_r$ for the entire
6926: message.''
6927: \end{quote}
6928:
6929: Below we repeatedly use the coding concepts introduced in
6930: Section~\ref{sec:coding}.
6931: Suppose we are given a recursive enumeration
6932: of prefix codes $D_1, D_2, \ldots$. Let $L_1, L_2, \ldots$ be the
6933: length functions associated with these codes. That is, $L_i(x) = \min_y
6934: \{ l(y) : D_i(y) = x \}$; if there exists no $y$ with $D_i(y) = x$,
6935: then $L_i(y) = \infty$. We may encode $x$ by first
6936: encoding a natural number $k$ using the standard prefix code
6937: for the natural numbers.
6938: We then encode $x$ itself using the code $D_k$. This leads to a
6939: so-called {\em two-part code\/} $\tilde{D}$
6940: with lengths $\tilde{L}$. By construction, this code is prefix and its lengths satisfy
6941: \begin{equation}
6942: \tilde{L}(x) := \min_{k \in {\cal N}} \ \Lint(k) + L_k(x),
6943: \end{equation}
6944: Let ${\bf x}$ be an infinite binary sequence and let $x_{[1:n]} \in
6945: \{0,1\}^n$ be the initial $n$-bit segment of this sequence.
6946: Since $L_{\cal N}(k) = O (\log k)$,
6947: we have for all $k$, all $n$:
6948: $$
6949: \tilde{L}(x_{[1:n]}) \leq L_k(x_{[1:n]}) + O(\log k).
6950: $$
6951: Recall that for
6952: each fixed $L_k$, the fraction of sequences of length $n$ that can be
6953: compressed by more than $m$ bits is less than $2^{-m}$. Thus,
6954: typically, the codes $L_k$ and the strings $x_{[1:n]}$ will be such
6955: that $L_k(x_{[1:n]})$ grows {\em linearly\/} with $n$.
6956: This implies that for every ${\bf x}$,
6957: the newly constructed $\tilde{L}$ is `almost as good'
6958: as whatever code $D_k$ in the list is best for that particular ${\bf x}$: the
6959: difference in code lengths is bounded by a constant depending on $k$ but not on
6960: $n$. In particular, for each
6961: infinite sequence ${\bf x}$, for each fixed $k$,
6962: \begin{equation}
6963: \label{eq:universal}
6964: \lim_{n \rightarrow \infty}
6965: \frac{\tilde{L}(x_{[1:n]})}{L_k(x_{[1:n]})} \leq 1.
6966: \end{equation}
6967: A code satisfying (\ref{eq:universal}) is called a {\em universal
6968: code\/} relative to the {\em comparison class\/} of codes
6969: $\{ D_1, D_2, \ldots \}$.
6970: It is `universal' in the sense that it compresses every
6971: sequence essentially as well as the $D_k$ that compresses that particular
6972: sequence the most.
6973: % This terminology is slightly non-standard; see below.
6974: In general, there exist many types of codes that
6975: are universal: the 2-part universal code defined above is just
6976: one means of achieving (\ref{eq:universal}).
6977:
6978: \paragraph{Universal codes and Kolmogorov:}
6979: %Let us now reinterpret the definition of (prefix) Kolmogorov complexity
6980: %in terms of universal codes.
6981: %%
6982: %
6983: %From Definition~\ref{def.KolmK} we see
6984: %that the Kolmogorov complexity is just the length function of the
6985: %universal two-part code that is defined relative to the list of
6986: %reference codes $D_1,D_2, \ldots$ with $D_i$ defined by $D_i(p) =
6987: %\phi_i(\langle p,\epsilon \rangle)$.
6988: %Note that, for large $n$, the Kolmogorov complexity
6989: %$K(x_{[1:n]})$ must be smaller or equal (up to a constant)
6990: %than the universal code length $\tilde{L}(x_{[1:n]})$
6991: In most practically interesting cases we may assume that
6992: for all $k$, the decoding function $D_k$ is computable,
6993: i.e. there exists a prefix Turing machine which
6994: for all $y \in \{0,1\}^*$, when input $y'$ (the prefix-free version of
6995: $y$), outputs $D_k(y)$ and then
6996: halts. Since such a program has finite length, we must have for all $k$,
6997: $$
6998: %l(E^*(x_{[1:n]})) = K(x_{[1:n]}) \leq^+ \tilde{L}_k(x_{[1:n]})
6999: l(E^*(x_{[1:n]})) = K(x_{[1:n]}) \leq^+ L_k(x_{[1:n]})
7000: $$
7001: where $E^*$ is the encoding function defined in Section~\ref{sec:kolmogorov},
7002: with $l(E^*(x))
7003: = K(x)$. Comparing with (\ref{eq:universal}) shows that
7004: the code $D^*$ with encoding function $E^*$ is a universal code relative to $D_1, D_2,
7005: \ldots$. Thus, we see that the Kolmogorov complexity $K$ is just the length function
7006: of the universal code $D^*$. Note that $D^*$ is an example of a universal
7007: code that is not (explicitly) two-part.
7008: \begin{example}
7009: \label{ex:universal}
7010: \rm Let us create a universal two-part code that allows us to significantly
7011: compress all binary strings with frequency of 0's deviating significantly
7012: from $\frac{1}{2}$. For $n_0 < n_1$, let $D_{\langle n,n_0 \rangle }$ be the code that assigns
7013: code words of equal (minimum) length
7014: to all strings of length $n$ with $n_0$ zeroes, and no code words to
7015: any other strings. Then $D_{\langle n,n_0 \rangle}$
7016: is a prefix-code and $L_{\langle n,n_0 \rangle} (x) =
7017: \lceil \log \binom{n}{n_0} \rceil$. The universal two part code
7018: $\tilde{D}$ relative to the set of codes $\{
7019: D_{\langle i,j\rangle} \; : \; i,j \in {\cal N} \}$ then achieves
7020: the following lengths (to within 1 bit): for all $n$, all $n_0 \in \{0,\ldots,n\}$, all
7021: $x_{[1:n]}$ with $n_0$ zeroes,
7022: $$
7023: \tilde{L}(x_{[1:n]}) = \log n + \log n_0 + 2 \log \log n + 2 \log \log
7024: n_0 + \log \binom{n}{n_0} = \log \binom{n}{n_0} + O(\log n)
7025: $$
7026: Using Stirling's approximation of the factorial, $n! \sim
7027: n^{n}e^{-n}\sqrt{2\pi n}$, we find that
7028: \begin{multline}
7029: \label{eq:stirling}
7030: \log \binom{n}{n_0} =
7031: \log n! - \log n_0! + \log (n- n_0)! = \\
7032: n \log n - n_0 \log n_0 - (n-n_0) \log (n- n_0) + O(\log n) = n
7033: H(n_0/n) + O(\log n)
7034: \end{multline}
7035: Note that $H(n_0/n) \leq 1$, with equality iff $n_0 = n$. Therefore, if
7036: the frequency deviates significantly from $\frac{1}{2}$, $\tilde{D}$
7037: compresses $x_{[1:n]}$ by a factor linear in $n$. In all such cases,
7038: $D^*$ compresses the data by at least the same linear factor.
7039: Note that (a) each individual code $D_{\langle n,n_0 \rangle}$ is
7040: capable of exploiting a particular type of
7041: regularity in a sequence to compress that
7042: sequence,
7043: (b) the universal code $\tilde{D}$ may exploit
7044: {\em many\/} different types of
7045: regularities to compress a sequence, and (c)
7046: the code $D^*$ with lengths given by
7047: the Kolmogorov complexity asymptotically exploits {\em all\/}
7048: computable regularities so as to maximally compress a sequence.
7049: \end{example}
7050: \paragraph{Universal codes and Shannon:}
7051: If a random variable
7052: $X$ is distributed according to some known probability
7053: mass function $f(x)=P(X=x)$,
7054: then the optimal (in the average sense) code to use is the
7055: Shannon-Fano code. But now suppose it is only known that
7056: $f \in \{f \}$, where $\{ f \}$ is some given (possibly very large,
7057: or even uncountable) set of candidate distributions. Now it is not clear
7058: what code is optimal. We may try the Shannon-Fano code for a particular $f
7059: \in \{ f \}$, but such a code will typically lead to very large
7060: expected code lengths if $X$ turns out to be distributed according to
7061: some $g \in \{ f \}, g \neq f$.
7062: We may ask whether there exists another
7063: code that is `almost' as good as the Shannon-Fano code for $f$, no
7064: matter what $f \in \{ f \}$ actually generates the sequence?
7065: We now show that, provided $ \{ f \}$ is finite or countable,
7066: then (perhaps surprisingly), the answer is yes. To see this,
7067: we need the notion of an {\em sequential information source},
7068: Section~\ref{sec:preliminaries}.
7069:
7070: Suppose then that $\{ f \}$ represents a finite or countable set of
7071: sequential information sources. Thus,
7072: $\{ f \} = \{ f_1, f_2, \ldots \}$ and $f_k \equiv (f_k^{(1)},
7073: f_k^{(2)}, \ldots)$ represents a sequential information source, abbreviated to
7074: $f_k$. To each marginal distribution $f^{(n)}_k$, there corresponds a
7075: unique Shannon-Fano code defined on the set $\{0,1\}^n$ with lengths
7076: $L_{\langle n, k \rangle}(x) := \lceil \log 1/ f^{(n)}_k(x) \rceil$
7077: and decoding function $D_{\langle n, k \rangle}$.
7078:
7079: For given $f \in \{ f \}$,
7080: we define $H(f^{(n)}) := \sum_{x \in \{0,1\}^n} f^{(n)}(x) [ \log 1/
7081: f^{(n)}(x)]$ as the entropy of the distribution of the first $n$
7082: outcomes.
7083:
7084: Let $E$ be a prefix-code assigning codeword $E(x)$ to source word $x
7085: \in \{0,1\}^n$. The Noiseless Coding Theorem~\ref{thm:noiseless}
7086: asserts that the minimal average codeword length
7087: $\bar{L}(f^{(n)})
7088: = \sum_{x \in \{0,1\}^n} f^{(n)}(x) l(E(x))$ among all such
7089: prefix-codes $E$ satisfies
7090: $$H(f^{(n)}) \leq L(f^{(n)}) \leq H(f^{(n)}) + 1.$$
7091: The entropy $H(f^{(n)})$
7092: can therefore be interpreted as the expected code length of
7093: encoding the first $n$ bits generated by the source $f$, when the
7094: optimal (Shannon-Fano) code is used.
7095:
7096: We look for a prefix code $\tilde{D}$ with length function $\tilde{L}$
7097: that satisfies, for all fixed $f \in
7098: \{ f \}$:
7099: \begin{equation}
7100: \label{eq:universalb}
7101: \lim_{n \rightarrow \infty}
7102: \frac{{\bf E}_f \tilde{L}(X_{[1:n]})}{H(f^{(n)})} \leq 1.
7103: \end{equation}
7104: where ${\bf E}_f \tilde{L}(X_{[1:n]}) = \sum_{x \in \{0,1\}^n}
7105: f^{(n)}(x)L(x)$.
7106: Define $\tilde{D}$ as the following two-part code: first, $n$ is
7107: encoded using the standard prefix code for natural numbers. Then, among
7108: all codes $D_{\langle n, k \rangle}$, the $k$ that minimizes
7109: $L_{\langle n, k \rangle}(x)$ is encoded (again using the standard
7110: prefix code); finally, $x$ is encoded in $L_{\langle n, k \rangle}(x)$
7111: bits. Then for all $n$, for all $k$, for {\em every\/} sequence
7112: $x_{[1:n]}$,
7113: \begin{equation}
7114: \label{eq:probuni}
7115: \tilde{L}(x_{[1:n]}) \leq L_{\langle n,k \rangle}(x_{[1:n]}) +L_{\cal
7116: N}(k) + L_{\cal N}(n)
7117: \end{equation}
7118: Since (\ref{eq:probuni}) holds for all strings of length $n$, it must
7119: also hold in expectation for all possible distributions on strings
7120: of length $n$. In particular, this gives, for all $k \in {\cal N}$,
7121: $$
7122: {\bf E}_{f_k} \tilde{L}(X_{[1:n]}) \leq {\bf E}_{f_k} L_{\langle n, k
7123: \rangle}(X_{[1:n]}) + O(\log n) = H(f^{(n)}_k) + O(\log n),
7124: $$
7125: from which (\ref{eq:universalb}) follows.
7126:
7127: Historically, codes satisfying (\ref{eq:universalb}) have been called
7128: {\em universal codes\/} relative to $\{ f \}$; codes satisfying
7129: (\ref{eq:universal}) have been considered in the literature only much
7130: more recently and are usually called `universal codes for individual
7131: sequences' \cite{MerhavF98}. The two-part code $\tilde{D}$ that we
7132: just defined is universal both in an individual sequence and in an
7133: average sense: $\tilde{D}$ achieves code lengths within a constant of
7134: that achieved by $D_{\langle n,k \rangle}$ for {\em every individual
7135: sequence}, for {\em every\/} $k \in {\cal N}$; but $\tilde{D}$ also
7136: achieves expected code lengths within a constant of the Shannon-Fano
7137: code for $f$, for {\em every\/} $f \in \{ f \}$. Note once again that
7138: the $D^*$ based on Kolmogorov complexity does at least as well as
7139: $\tilde{D}$.
7140:
7141:
7142: \begin{example}
7143: \rm
7144: \label{ex:appy}
7145: Suppose our sequence is generated by independent tosses of a coin with
7146: bias $p$ of tossing ``head'' where $p \in (0,1)$.
7147: Identifying `heads' with $1$, the probability of $n-n_0$ outcomes
7148: ``1'' in an initial
7149: segment $x_{[1:n]}$ is then $(1-p)^{n_0} p^{n- n_0}$.
7150: Let $\{ f \}$ be the set of corresponding information sources,
7151: containing one element for each $p \in (0,1)$.
7152: $\{ f \}$ is an uncountable set; nevertheless, a universal code for
7153: $\{ f \}$ exists. In fact, it can be shown that
7154: the code $\tilde{D}$ with lengths (\ref{eq:stirling})
7155: in Example~\ref{ex:universal} is universal for $\{ f \}$, i.e. it
7156: satisfies (\ref{eq:universalb}). The reason for this is (roughly) as
7157: follows: if data are generated by a coin with bias $p$, then with
7158: probability $1$, the frequency $n_0/n$ converges to $p$, so that, by
7159: (\ref{eq:stirling}), $n^{-1} \tilde{L}(x_{[1:n]})$ tends to
7160: $n^{-1} H(f^{(n)}) = H(p,1-p)$.
7161:
7162: If we are interested in practical data-compression, then the
7163: assumption that the data are generated by a biased-coin source is very
7164: restricted. But there are much richer classes of distributions
7165: $\{ f \}$ for which we can formulate universal codes. For example, we
7166: can take $\{ f \}$ to be the class of all Markov sources of each
7167: order; here the probability that $X_i = 1$ may depend on arbitrarily
7168: many earlier
7169: outcomes. Such ideas form the basis of most data compression schemes
7170: used in practice. Codes which are universal for the class of
7171: all Markov sources of each order and which encode and decode in real-time
7172: can easily be implemented. Thus, while we cannot find the
7173: shortest program that generates a particular sequence, it is often
7174: possible to effectively find the shortest encoding within a
7175: quite sophisticated class of codes.
7176: \end{example}
7177: %From the point of view of Shannon's theory, this means that there
7178: %exists a code which achieves average code length `essentially' as well
7179: %as the optimal Shannon-Fano code for the unknown source that generates
7180: %the sequence. The small price we pay is that our codelengths may be
7181: %longer by some constant not depending on $n$; this may be relevant if
7182: %the string we want to encode is short.
7183: \bibliographystyle{plain}
7184: %\bibliography{jolli,info,book}
7185:
7186: \begin{thebibliography}{10}
7187:
7188: \bibitem{BKVV03}
7189: H.~Buhrman, H.~Klauck, N.K. Vereshchagin, and P.M.B. Vit\'anyi.
7190: \newblock Individual communication complexity.
7191: \newblock In {\em Proc. STACS}, LNCS, pages~19--30, Springer-Verlag, 2004.
7192:
7193: \bibitem{CoxH74}
7194: R.T.~Cox and D.~Hinkley.
7195: \newblock {\em Theoretical Statistics}.
7196: \newblock Chapman and Hall, 1974.
7197:
7198:
7199: \bibitem{Ch69}
7200: G.J. Chaitin.
7201: \newblock On the length of programs for computing finite binary sequences:
7202: statistical considerations.
7203: \newblock {\em J. Assoc. Comput. Mach.}, 16:145--159, 1969.
7204:
7205: \bibitem{CT91}
7206: T.M. Cover and J.A. Thomas.
7207: \newblock {\em Elements of Information Theory}.
7208: \newblock Wiley \& Sons, 1991.
7209:
7210:
7211: \bibitem{Fi22}
7212: R.A. Fisher.
7213: \newblock On the mathematical foundations of theoretical statistics.
7214: \newblock {\em Philos. Trans. Royal Soc. London, Ser. A}, 222:309--368, 1922.
7215:
7216: \bibitem{Ga74}
7217: P.~G\'acs.
7218: \newblock On the symmetry of algorithmic information.
7219: \newblock {\em Soviet Math. Dokl.}, 15:1477--1480, 1974.
7220: \newblock Correction, Ibid., 15:1480, 1974.
7221:
7222:
7223: \bibitem{Grunwald04}
7224: P. D. Gr\"unwald.
7225: \newblock {MDL Tutorial}.
7226: \newblock In P.~D. Gr\"unwald, I.~J. Myung, and M.~A. Pitt (Eds.), {\em
7227: Advances in Minimum Description Length: Theory and Applications}. MIT Press, 2004.
7228:
7229: \bibitem{GTV01}
7230: P.~G\'acs, J.~Tromp, and P.M.B. Vit\'anyi.
7231: \newblock Algorithmic statistics.
7232: \newblock {\em IEEE Trans. Inform. Theory}, 47(6):2443--2463, 2001.
7233:
7234: \bibitem{HRSV00}
7235: D.~Hammer, A.~Romashchenko, A.~Shen, and N.~Vereshchagin.
7236: \newblock Inequalities for {S}hannon entropies and {K}olmogorov complexities.
7237: \newblock {\em J. Comput. Syst. Sci.}, 60:442--464, 2000.
7238:
7239: \bibitem{Ko65}
7240: A.N. Kolmogorov.
7241: \newblock Three approaches to the quantitative definition of information.
7242: \newblock {\em Problems Inform. Transmission}, 1(1):1--7, 1965.
7243:
7244: \bibitem{Ko74}
7245: A.N. Kolmogorov.
7246: \newblock Complexity of algorithms and objective definition of randomness.
7247: \newblock {\em Uspekhi Mat. Nauk}, 29(4):155, 1974.
7248: \newblock Abstract of a talk at the Moscow Math. Soc. meeting 4/16/1974. In
7249: Russian.
7250:
7251: \bibitem{Ko83}
7252: A.N. Kolmogorov.
7253: \newblock Combinatorial foundations of information theory and the calculus of
7254: probabilities.
7255: \newblock {\em Russian Math. Surveys}, 38(4):29--40, 1983.
7256:
7257: \bibitem{Kr49}
7258: L.G. Kraft.
7259: \newblock A device for quantizing, grouping and coding amplitude modulated
7260: pulses.
7261: \newblock Master's thesis, Dept. of Electrical Engineering, M.I.T., Cambridge,
7262: Mass., 1949.
7263:
7264: \bibitem{ChCo78}
7265: S.K. Leung-Yan-Cheong and T.M. Cover.
7266: \newblock Some equivalences between {Shannon} entropy and {K}olmogorov
7267: complexity.
7268: \newblock {\em IEEE Transactions on Information Theory}, 24:331--339, 1978.
7269:
7270: \bibitem{Le74}
7271: L.A. Levin.
7272: \newblock Laws of information conservation (non-growth) and aspects of the
7273: foundation of probability theory.
7274: \newblock {\em Problems Inform. Transmission}, 10:206--210, 1974.
7275:
7276: \bibitem{Le84}
7277: L.A. Levin.
7278: \newblock Randomness conservation inequalities; information and independence in
7279: mathematical theories.
7280: \newblock {\em Inform. Contr.}, 61:15--37, 1984.
7281:
7282: \bibitem{Le02}
7283: L.A. Levin.
7284: \newblock Forbidden information.
7285: \newblock In {\em Proc. 47th IEEE Symp. Found. Comput. Sci.}, pages 761--768,
7286: 2002.
7287:
7288: \bibitem{LiVi97}
7289: M.~Li and P.M.B. Vit\'anyi.
7290: \newblock {\em An {I}ntroduction to {K}olmogorov {C}omplexity and {I}ts
7291: {A}pplications}.
7292: \newblock Springer-Verlag, 1997.
7293: \newblock 2nd Edition.
7294:
7295: \bibitem{MerhavF98}
7296: N.~Merhav and M.~Feder.
7297: \newblock Universal prediction.
7298: \newblock {\em IEEE Transactions on Information Theory}, IT-44(6):2124--2147,
7299: 1998.
7300: \newblock invited paper for the 1948-1998 commemorative special issue.
7301:
7302: \bibitem{Ri89}
7303: J.J. Rissanen.
7304: \newblock {\em Stochastical Complexity and Statistical Inquiry}.
7305: \newblock World Scientific, 1989.
7306:
7307: \bibitem{RissanenT04}
7308: J. Rissanen and I.~Tabus.
7309: \newblock {K}olmogorov's structure function in {MDL} theory and lossy data
7310: compression.
7311: \newblock In P.~D. Gr\"unwald, I.~J. Myung, and M.~A. Pitt (Eds.), {\em
7312: Advances in Minimum Description Length: Theory and Applications}. MIT Press, 2004.
7313:
7314: \bibitem{Sh48}
7315: C.E. Shannon.
7316: \newblock The mathematical theory of communication.
7317: \newblock {\em Bell System Tech. J.}, 27:379--423, 623--656, 1948.
7318:
7319: \bibitem{Sh59}
7320: C.E. Shannon.
7321: \newblock Coding theorems for a discrete source with a fidelity criterion.
7322: \newblock In {\em IRE National Convention Record, Part 4}, pages 142--163,
7323: 1959.
7324:
7325: \bibitem{So64}
7326: R.J. Solomonoff.
7327: \newblock A formal theory of inductive inference, part 1 and part 2.
7328: \newblock {\em Inform. Contr.}, 7:1--22, 224--254, 1964.
7329:
7330: \bibitem{Vovk01}
7331: V. Vovk.
7332: \newblock Competitive on-line statistics,
7333: \newblock {\em Intern. Stat. Rev.}, 69:213--248, 2001.
7334:
7335: \bibitem{VV02}
7336: N.K. Vereshchagin and P.M.B. Vit\'anyi.
7337: \newblock Kolmogorov's structure functions and model selection.
7338: \newblock {\em IEEE Trans. Informat. Theory}.
7339: \newblock To appear.
7340:
7341: \bibitem{VereshchaginV04}
7342: N.K. Vereshchagin and P.M.B. Vit\'anyi.
7343: \newblock Rate-distortion theory for individual data.
7344: \newblock Manuscript, CWI, 2004.
7345:
7346: \bibitem{WallaceF87}
7347: Wallace, C. and P.~Freeman.
7348: \newblock Estimation and inference by compact coding.
7349: \newblock {\em Journal of the Royal Statistical Society, Series {B}\/}~{\em
7350: 49}, 240--251, 1987.
7351: \newblock Discussion: pages 252--265.
7352:
7353: \bibitem{ZvLe70}
7354: A.K. Zvonkin and L.A. Levin.
7355: \newblock The complexity of finite objects and the development of the concepts
7356: of information and randomness by means of the theory of algorithms.
7357: \newblock {\em Russian Math. Surveys}, 25(6):83--124, 1970.
7358:
7359: \end{thebibliography}
7360:
7361: \end{document}
7362:
7363:
7364:
7365: