0410:cs0410002/info.tex

1: %\documentclass[doublecolumn,doublesided]{IEEEtran}

2: \documentclass{article}

3:

4: \usepackage{amsmath,amstext,amssymb,epsf}

5:

6:

7: \sloppy

8:

9: %\usepackage{amsmath,amssymb,latexsym}

10: %\usepackage{ltexpprt}

11: \setlength{\textwidth}{6.5in}

12: \setlength{\textheight}{9.3in}

13: \setlength{\oddsidemargin}{0in}

14: \setlength{\evensidemargin}{0in}

15: \setlength{\topmargin}{-0.1in}

16: \setlength{\headheight}{0in}

17: \setlength{\headsep}{0in}

18: \setlength{\footskip}{0.5in}

19:

20:

21:  \newtheorem{lemma}{Lemma}[section]

22:  \newtheorem{theorem}[lemma]{Theorem}

23:  \newtheorem{proposition}[lemma]{Proposition}

24:  \newtheorem{fact}[lemma]{Fact}

25:  \newtheorem{claim}[lemma]{Claim}

26:  \newtheorem{corollary}[lemma]{Corollary}

27:  \newtheorem{conjecture}[lemma]{Conjecture}

28: \newtheorem{notation}[lemma]{Notation}

29:  \newtheorem{definition}[lemma]{Definition}

30: \newtheorem{rem}[lemma]{Remark}

31:

32: \numberwithin{equation}{section} % in amsmath

33: \newenvironment{comment}{\begin{small}\begin{quotation}\hspace{-0.23in}\rm}{\end{quotation}\end{small}}

34:

35: \newenvironment{proof}{\par \sf Proof.\rm}{\hspace*{\fill}$\Box$\vspace{1ex}}

36:

37: \newenvironment{remark}{\begin{rem}}{\hspace*{\fill}$\diamondsuit$\end{rem}}

38:  \newtheorem{ex}[lemma]{Example}

39: \newenvironment{example}{\begin{ex}}{\hspace*{\fill}$\diamondsuit$\end{ex}}

40:

41: \newcommand{\len}[2]{l_{#1}(#2)}

42:

43: \newcommand{\m}{{\bf m}}

44: \newcommand{\lea}{\stackrel{{}_+}{<}}

45: \newcommand{\gea}{\stackrel{{}_+}{>}}

46: \newcommand{\eqa}{\stackrel{{}_+}{=}}

47: \newcommand{\soph}{\mbox{\rm soph}}

48: \newcommand{\Lint}{L_{{\mathcal N}}}

49: \newcommand{\eps}{\epsilon}

50:

51: \newcommand{\commentout}[1]{}

52:

53: \begin{document}

54: \title{Shannon Information and Kolmogorov Complexity}

55: \author{Peter Gr\"unwald and

56: Paul Vit\'anyi\thanks{

57: Manuscript received xxx, 2004;

58: revised yyy 200?.

59: This work supported in part

60: by the EU fifth framework project QAIP, IST--1999--11234,

61: the NoE QUIPROCONE IST--1999--29064,

62: the ESF QiT Programmme, and the EU Fourth Framework BRA

63: NeuroCOLT II Working Group

64: EP 27150, the EU NoE PASCAL, and by the Netherlands Organization for

65: Scientific Research (NWO) under Grant 612.052.004.

66: Address: CWI, Kruislaan 413,

67: 1098 SJ Amsterdam, The Netherlands.

68: Email: {\tt Peter.Grunwald@cwi.nl, Paul.Vitanyi@cwi.nl}.}

69: }

70:

71: %\markboth{IEEE Transactions on Information Theory, VOL. XX, NO Y, MONTH 2004}{P.D. Gr\"unwald and P.M.B. Vit\'anyi: Shannon Information and Kolmogorov Complexity}

72:

73: \maketitle

74: \begin{abstract}

75: We compare the

76: elementary theories of Shannon information and Kolmogorov

77: complexity, the extent to which they have a common purpose, and where

78: they are fundamentally different. We discuss and relate the basic

79: notions  of both theories:

80: Shannon entropy versus Kolmogorov complexity, the relation of both

81: to universal coding, Shannon mutual information

82: versus Kolmogorov (`algorithmic') mutual information,

83: probabilistic sufficient statistic versus algorithmic sufficient

84: statistic (related to lossy compression in

85: the Shannon theory versus

86: meaningful

87: information in the Kolmogorov theory),

88: and

89: rate distortion theory versus Kolmogorov's structure function.

90: Part of the material has appeared in print before, scattered

91: through various publications, but this is the first comprehensive

92: systematic comparison. The last mentioned relations are new.

93:

94: \end{abstract}

95: \tableofcontents

96: \section{Introduction}

97: %How should we measure the amount of information about a phenomenon

98: %that is given to us by a particular observation concerning the

99: %phenomenon?

100: %

101: {\em Shannon information} theory, usually called just `information'

102: theory was introduced in 1948, \cite{Sh48}, by C.E. Shannon (1916--2001). {\em

103:   Kolmogorov complexity} theory, also known as `algorithmic

104: information' theory,

105: was introduced with different

106: motivations (among which Shannon's probabilistic notion

107: of information), independently by R.J. Solomonoff

108: (born 1926), A.N. Kolmogorov (1903--1987) and G. Chaitin (born 1943)

109: in 1960/1964, \cite{So64}, 1965, \cite{Ko65}, and 1969 \cite{Ch69},

110:  respectively. Both theories

111: aim at providing a means for measuring `information'.  They

112: use the same unit to do this: the {\em bit}. In both cases, the amount

113: of information in an object may be interpreted as the length of a

114: description of the object.  In the Shannon approach, however, the

115: method of encoding objects is based on the presupposition that the

116: objects to be encoded are outcomes of a known random source---it is

117: only the characteristics of that random source that determine the

118: encoding, not the characteristics of the objects that are its

119: outcomes.  In the Kolmogorov complexity approach we consider the

120: individual objects themselves, in isolation so-to-speak, and the

121: encoding of an object is a short computer program

122: (compressed version of the object) that

123: generates it and then halts.  In the Shannon approach we are

124: interested in the minimum expected number of bits to transmit a

125: message from a random source of known characteristics

126: through an error-free channel.  Says Shannon \cite{Sh48}:

127: \begin{quote}

128:  ``The fundamental problem

129: of communication is that of reproducing at one point

130: either exactly or approximately a message selected at another point.

131: Frequently the messages have {\em meaning}; that is they refer to or are

132: correlated according to some system with certain physical or conceptual

133: entities. These semantic aspects of communication are irrelevant to the

134: engineering problem. The significant aspect is that the actual message

135: is one {\em selected from a set} of possible messages. The system must

136: be designed to operate for each possible selection, not just the one which

137: will actually be chosen since this is unknown at the time of design.''

138: \end{quote}

139: In

140: Kolmogorov complexity we are interested in the minimum number of bits

141: from which a particular message or file

142: can effectively be reconstructed: the minimum

143: number of bits that suffice to store the file in reproducible format.

144: This is the basic question

145: of the ultimate compression

146: of given individual files.  A

147: little reflection reveals that this is a great difference: for {\em

148:   every} source emitting but two messages the Shannon information (entropy) is

149: at most 1 bit, but we can choose both messages concerned of

150: arbitrarily high Kolmogorov complexity. Shannon stresses in his

151: founding article that his notion is only concerned with {\em

152:   communication}, while Kolmogorov stresses in his founding article

153: that his notion aims at supplementing the gap left by Shannon theory

154: concerning the information in individual objects.

155: Kolmogorov

156: %\cite{Ko65}:

157: \commentout{\begin{quote}

158: ``The probabilistic approach is natural in

159: the theory of information transmission over communication channels

160: carrying `bulk' information consisting of a large number of unrelated or

161: weakly related messages obeying definite probabilistic laws. $\dots$

162: But what real meaning is there, for example, in asking how much information

163: is contained in `War and Peace'? Is it reasonable to include this

164: novel in the set of `possible novels,' or even to postulate

165: some probability distribution for this set? Or, on the other hand, must

166: we assume that the individual scenes in this book form a random

167: sequence with `stochastic relations' that damp out quite rapidly over a

168: distance of several pages?''

169: \end{quote}

170: And in

171: }

172: \cite{Ko83}:

173: \begin{quote}

174: ``Our definition of the

175: quantity of information has the advantage that it refers to individual

176: objects and not to objects treated as members of a set of objects

177: with a probability distribution given on it. The probabilistic

178: definition can be convincingly applied to the information contained,

179: for example, in a stream of congratulatory telegrams. But it would

180: not be clear how to apply it, for example, to an estimate of the quantity

181: of information contained in a novel or in the translation of a novel

182: into another language relative to the original. I think that the

183: new definition is capable of introducing in similar applications

184: of the theory at least clarity of principle.''

185: \end{quote}

186: To be sure, both notions are natural: Shannon ignores the object itself

187: but considers only the characteristics of the random source of which the

188: object is one of the possible outcomes, while Kolmogorov considers

189: only the object itself to determine the number of bits in the ultimate

190: compressed version irrespective of the manner in which the object arose.

191: In this paper, we introduce, compare and contrast the Shannon and Kolmogorov

192: approaches.

193: An early comparison between Shannon entropy and Kolmogorov

194: complexity is \cite{ChCo78}.

195: \paragraph{How to read this paper:}

196: We switch back and forth between the two

197: theories concerned according to the following pattern: we first discuss a

198: concept of Shannon's theory, discuss its properties as well as some

199: questions it leaves open.  We then provide Kolmogorov's analogue of

200: the concept and show how it answers the question left open by

201: Shannon's theory.

202: To ease understanding of the two theories and

203: how they relate, we supplied the overview below

204: and then Sections~\ref{sec:coding} and

205: Section~\ref{sec:basic}, which discuss preliminaries, fix

206: notation and introduce the basic notions. The other sections are

207: largely independent from one another.

208: Throughout the text,

209: we assume some basic familiarity with elementary notions of

210: probability theory and computation, but we have kept the treatment

211: elementary. This may provoke scorn in the information theorist, who sees

212: an elementary treatment of basic matters in his discipline, and likewise

213: from the computation theorist concerning the treatment

214: of aspects of the elementary theory of computation. But experience has shown

215: that what one expert views as child's play is an insurmountable

216: mountain for his opposite number. Thus, we decided to

217: ignore background knowledge and

218: cover both areas from first principles onwards, so that

219: the opposite expert can easily access the unknown discipline, possibly

220: helped along by the familiar analogues in his own ken of knowledge.

221: \subsection{Overview and Summary}

222: A summary of the basic ideas is given

223: below. In the paper, these notions are discussed in the same order.

224: \begin{description}

225: \item[1. Coding: Prefix codes, Kraft inequality]

226: (Section~\ref{sec:coding}) Since descriptions or {\em encodings\/} of objects are

227: fundamental to both theories, we first review some elementary facts

228: about coding. The most important of these is the {\em Kraft

229:   inequality}. This inequality gives the

230: fundamental relationship between {\em probability density functions and

231:   prefix codes}, which are the type of codes we are interested in.

232: Prefix codes and the Kraft inequality underly most of Shannon's, and a

233: large part of Kolmogorov's theory.

234: \item[2. Shannon's Fundamental Concept: Entropy]

235: (Section~\ref{sec:shannon}) Entropy is defined by a functional that maps

236: {\em probability distributions\/} or,

237: equivalently, {\em random variables},

238: to {\em real numbers}. This notion is derived from first

239: principles as the only `reasonable' way to measure the

240: %`uncertainty

241: %inherent in a probabilisty distribution', or (equivalently), as the

242: `average amount of information conveyed when an outcome of the random

243: variable is observed'. The notion is then related to

244: encoding and communicating messages by Shannon's famous `coding theorem'.

245: \item[3. Kolmogorov's Fundamental Concept: Kolmogorov Complexity]

246: (Section~\ref{sec:kolmogorov})

247: Kolmogorov complexity is defined by a function that maps {\em

248:   objects\/} (to be thought of as natural numbers or sequences of

249: symbols, for example outcomes of the random variables

250: figuring in the Shannon theory) to the {\em natural numbers\/}. Intuitively, the Kolmogorov

251: complexity of a sequence is the length (in bits) of the shortest computer

252: program that prints the sequence and then halts.

253: \item[4. Relating entropy and

254:   Kolmogorov complexity ]

255: (Section~\ref{sec:KCSE} and Appendix~\ref{sec:universal})

256: Although their primary aim is quite different, and they are functions

257: defined on different spaces,  there are close relations

258: between entropy and Kolmogorov complexity. The formal relation

259: ``entropy = expected Kolmogorov complexity'' is discussed in

260: Section~\ref{sec:KCSE}. The relation is further illustrated

261: by explaining `universal coding' (also introduced by Kolmogorov in 1965)

262: which combines elements from both

263: Shannon's and Kolmogorov's theory, and which lies at the basis of most

264: practical data compression methods. While related to the main theme

265: of this paper, universal coding plays no direct role in the later

266: sections, and therefore we delegated it to Appendix~\ref{sec:universal}.

267: \end{description}

268: Entropy and Kolmogorov Complexity are the basic

269: notions of the two theories. They serve as building blocks for all

270: other important notions in the respective theories. Arguably the most

271: important of these notions is {\em mutual information\/}:

272: \begin{description}

273: \item[5. Mutual Information---Shannon and Kolmogorov Style]

274: (Section~\ref{sec:mutual})

275: Entropy and Kolmogorov complexity are

276: concerned with information in a single object: a random variable

277: (Shannon)

278: or an individual sequence (Kolmogorov). Both theories provide

279: a (distinct) notion of  {\em mutual information\/} that

280: measures the information that {\em one

281: object gives about another object}. In Shannon's theory, this is the

282: information that one random variable carries about another; in

283: Kolmogorov's theory (`algorithmic mutual information'),

284: it is the information one sequence gives about another.

285: In an appropriate setting the former notion can be shown to

286: be the expectation of the latter notion.

287: \item[6. Mutual Information Non-Increase]

288: (Section~\ref{sect.mini})

289: In the probabilistic setting the mutual information between two

290: random variables cannot be increased by processing the outcomes.

291: That stands to reason, since the mutual information is expressed

292: in probabilities of the random variables involved. But in the algorithmic

293: setting, where we talk about mutual information between two

294: strings this is not evident at all. Nonetheless, up to some precision,

295: the same non-increase law holds. This result was used recently to

296: refine and extend the celebrated G\"odel's incompleteness theorem.

297: \item[7. Sufficient Statistic] (Section~\ref{sect.sufstat}) Although

298:   its roots are in the statistical literature, the notion of

299:   probabilistic ``sufficient statistic'' has a natural formalization

300:   in terms of mutual Shannon information, and can thus also be

301:   considered a part of Shannon theory. The probabilistic sufficient

302:   statistic extracts the information in the data about a model class.

303:   In the algorithmic setting, a sufficient statistic extracts the

304:   meaningful information from the data, leaving the remainder as

305:   accidental random ``noise''.  In a certain sense the probabilistic version of

306:   sufficient statistic is the expectation of the algorithmic version.

307:   These ideas are generalized significantly in the next item.

308: \item[8. Rate Distortion Theory versus Structure Function]

309:   (Section~\ref{sect.rdsf}) Entropy, Kolmogorov complexity and mutual

310:   information are concerned with {\em lossless\/} description or

311:   compression: messages must be described in such a way that from the

312:   description, the original message can be completely reconstructed.

313:   Extending the theories to {\em lossy\/} description or compression

314:   leads to rate-distortion theory in the Shannon setting, and the

315:   Kolmogorov structure function in the Kolmogorov section. The basic

316:   ingredients of the lossless theory (entropy and Kolmogorov

317:   complexity) remain the building blocks for such extensions.  The

318:   Kolmogorov structure function significantly extends the idea of

319:   ``meaningful information'' related to the algorithmic sufficient

320:   statistic, and can be used to provide a foundation for inductive

321:   inference principles such as Minimum Description Length (MDL). Once again, the Kolmogorov

322:   structure function can be related to Shannon's rate-distortion

323:   function by taking expectations in an appropriate manner.

324: \end{description}

325:

326: \subsection{Preliminaries}

327: \label{sec:preliminaries}

328: \paragraph{Strings:}

329: Let ${\cal B}$ be some finite or countable set. We use the notation

330: ${\cal B}^*$ to denote the set of finite

331: {\em strings\/} or {\em sequences\/} over ${\cal X}$. For example,

332: $$\{0,1\}^* = \{ \epsilon,0,1,00,01,10,11,000,\ldots \},$$

333: with $\epsilon$ denoting the {\em empty word} `' with no letters.

334: Let

335: ${\cal N}$ denotes the natural

336: numbers. We identify

337: ${\cal N}$ and $\{0,1\}^*$ according to the

338: correspondence

339: \begin{equation}

340: \label{eq:correspondence}

341: (0, \epsilon ), (1,0), (2,1), (3,00), (4,01), \ldots

342: \end{equation}

343: The {\em length} $l(x)$ of $x$ is the number of bits

344: in the binary string $x$. For example,

345: $l(010)=3$ and $l(\epsilon)=0$.

346: If $x$ is interpreted as an integer, we get $ l(x) =  \lfloor \log

347: (x+1) \rfloor$ and, for $x \geq 2$,

348: \begin{equation}

349: \label{eq:intlength}

350: \lfloor \log x \rfloor

351: \leq l(x) \leq \lceil \log x \rceil.

352: \end{equation}

353: Here, as in the sequel, $\lceil x \rceil$ is the smallest integer larger than or equal to

354: $x$, $\lfloor x \rfloor$ is the largest integer smaller than or equal

355: to $x$ and $\log$ denotes logarithm  to base two.

356: We shall typically be concerned with

357: encoding finite-length binary strings by other finite-length binary strings.

358: The emphasis is on binary strings only for convenience;

359: observations in any alphabet can be so encoded in a way

360: that is `theory neutral'.

361:

362: \paragraph{Precision and $\lea, \eqa$ notation:}

363: It is customary in the area of Kolmogorov complexity

364:  to use ``additive constant $c$'' or

365: equivalently ``additive $O(1)$ term'' to mean a constant,

366: accounting for the length of a fixed binary program,

367: independent from every variable or parameter in the expression

368: in which it occurs. In this paper we use the prefix complexity

369: variant of Kolmogorov complexity for convenience. Since

370: (in)equalities in the Kolmogorov complexity setting

371: typically hold up to an additive constant, we use a special notation.

372:

373: We will denote by $\lea$ an

374: inequality to within an additive constant. More precisely, let $f,g$

375: be functions from $\{0,1\}^*$ to ${\cal R}$,

376: the {\em real numbers}. Then by `$f(x) \lea g(x)$'

377: we mean that there exists a $c$ such that for all $x \in \{0,1\}^*$,

378: $f(x) < g(x) + c$.  We denote by $\eqa$ the situation when both $\lea$

379: and $\gea$ hold.

380:

381: \paragraph{Probabilistic Notions:}

382: Let ${\cal X}$ be a finite or countable set. A function $f: {\cal X}

383: \rightarrow [0,1]$ is a {\em probability mass function} if $\sum_{x

384:   \in {\cal X}} f(x) = 1$. We call $f$ a {\em sub-probability mass

385:   function} if $\sum_{x \in {\cal X}} f(x) \leq 1$. Such sub-probability

386: mass functions will sometimes be used for technical convenience. We

387: can think of them as ordinary probability mass functions by

388: considering the surplus probability to be concentrated on an undefined

389: element $u \not\in {\cal X}$.

390:

391: In the context of (sub-) probability mass functions, ${\cal

392:   X}$ is called the {\em sample space}. Associated with mass function $f$ and

393: sample space ${\cal X}$ is the {\em random variable\/} $X$ and the

394: probability distribution $P$ such that $X$ takes value $x \in {\cal

395:   X}$ with probability $P(X=x) = f(x)$. A subset of ${\cal X}$ is

396: called an {\em event}. We extend the probability of individual

397: outcomes to events.  With this terminology, $P(X= x) = f(x)$ is the

398: probability that the singleton event $\{x\}$ occurs, and $P(X \in

399: {\cal A}) = \sum_{x \in {\cal A}} f(x)$. In some cases (where the use

400: of $f(x)$ would be confusing) we write $p_x$ as an abbreviation of

401: $P(X= x)$.  In the sequel, we often refer to probability distributions

402: in terms of their mass functions, i.e. we freely employ phrases like

403: `Let $X$ be distributed according to $f$'.

404:

405: Whenever we refer to probability mass functions without explicitly

406: mentioning the sample space ${\cal X}$ is assumed to

407: be ${\cal N}$ or, equivalently, $\{ 0,1\}^*$.

408:

409: For a given probability mass function $f(x,y)$ on sample space ${\cal

410:   X} \times {\cal Y}$ with random variable $(X,Y)$, we define the {\em

411:   conditional probability mass function\/} $f(y \mid x)$ of outcome

412: $Y=y$ given outcome $X=x$ as

413: $$

414: f (y|x) :=  {f(x,y)  \over \sum_{y}  f(x,y)}.

415: $$

416: Note that $X$ and $Y$ are not necessarily independent.

417:

418: In some cases (esp. Section~\ref{sec:relpa} and

419: Appendix~\ref{sec:universal}), the notion of {\em sequential

420:   information source\/} will be needed. This may be thought of as a

421: probability distribution over arbitrarily long binary sequences, of

422: which an observer gets to see longer and longer initial segments.

423: Formally, a sequential information source $P$ is a probability

424: distribution on the set $\{0,1\}^{\infty}$ of one-way infinite

425: sequences. It is characterized by a {\em sequence of probability mass

426:   functions\/} $(f^{(1)},f^{(2)}, \ldots)$ where

427: $f^{(n)}$ is a probability mass function on $\{0,1\}^n$ that

428: denotes the {\em marginal\/} distribution of

429: $P$ on the first $n$-bit segments. By definition, the sequence

430: $f \equiv (f^{(1)}, f^{(2)},

431: \ldots)$ represents a sequential information source if for all $n >

432: 0$, $f^{(n)}$ is related to $f^{(n+1)}$ as follows: for all $x \in

433: \{0,1\}^n$, $\sum_{y \in \{0,1\}} f^{(n+1)}(xy) = f^{(n)}(x)$ and

434: $f^{(0)}(x)=1$. This is also called Kolmogorov's {\em compatibility

435:   condition\/} \cite{Ri89}.

436:

437: Some (by no means all!) probability mass functions on $\{ 0,1\}^*$ can

438: be thought of as information sources. Namely, given a probability mass

439: function $g$ on $\{0,1 \}^*$, we can define $g^{(n)}$ as the

440: conditional distribution of $x$ given that the length of $x$ is $n$,

441: with domain restricted to $x$ of length $n$.  That is, $g^{(n)}:

442: \{0,1\}^n \rightarrow [0,1]$ is defined, for $x \in \{0,1\}^n$, as

443: $g^{(n)}(x) = g(x) / \sum_{y \in \{0,1\}^n} g(y)$. Then $g$ can be

444: thought of as an information source if and only if the sequence

445: $(g^{(1)}, g^{(2)},  \ldots)$ represents an information source.

446: \paragraph{Computable Functions:}

447: %{\em Integer-valued functions\/}

448: %When dealing with computability issues, it is convenient to consider

449: Partial functions on the natural numbers ${\cal N}$ are

450: functions $f$ such that $f(x)$ can be `undefined' for some $x$. We

451: abbreviate `undefined' to `$\uparrow$'. A

452: central notion in the theory of computation is that of the {\em

453:   partial recursive functions}. Formally, a function $f: {\cal N}

454: \rightarrow {\cal N} \cup \{ \uparrow \}$ is called {\em partial

455:   recursive\/} or {\em computable\/} if there exists a Turing Machine

456: $T$ that implements $f$. This means that for all $x$

457: \begin{enumerate}

458: \item

459: If $f(x) \in {\cal N}$, then $T$,

460: when run with input $x$ outputs $f(x)$ and then halts.

461: \item

462: If $f(x) = \uparrow$ (`$f(x)$ is undefined'), then $T$ with input $x$ never halts.

463: \end{enumerate}

464: Readers not familiar with computation theory may think of a Turing

465: Machine as a computer program written in a general-purpose language such as

466: C or Java.

467:

468: A function $f: {\cal N} \rightarrow {\cal N} \cup \{ \uparrow \}$ is

469: called {\em total\/} if it is defined for all $x$ (i.e. for all $x$,

470: $f(x) \in {\cal N}$). A {\em total recursive\/} function is thus a

471: function that is implementable on a Turing Machine that halts on all

472: inputs. These definitions are extended to several arguments as

473: follows: we fix, once and for all, some standard invertible pairing

474: function $\langle \cdot, \cdot \rangle: {\cal N} \times {\cal N}

475: \rightarrow {\cal N}$ and we say that $f: {\cal N} \times {\cal N}

476: \rightarrow {\cal N} \cup \{ \uparrow \}$ is computable if there

477: exists a Turing Machine $T$ such that for all $x_1, x_2$, $T$ with

478: input $\langle x_1, x_2 \rangle$ outputs $f(x_1,x_2)$ and halts if

479: $f(x_1,x_2) \in {\cal N}$ and otherwise $T$ does not halt. By

480: repeating this construction, functions with arbitrarily many arguments

481: can be considered.

482:

483: {\em Real-valued Functions:} We call a

484: distribution $f: {\cal N} \rightarrow {\cal R}$ {\em recursive\/} or

485: {\em computable\/} if there exists a Turing machine that, when input

486: $\langle x, y\rangle$ with $x \in \{0,1\}^*$ and $y \in {\cal N}$,

487: outputs $f(x)$ to precision $1/y$; more precisely, it outputs a pair

488: $\langle p, q \rangle$ such that $| p/q - |f(x)| | < 1/y $ and an

489: additional bit to indicate whether $f(x)$ larger or smaller than $0$.

490: Here $\langle \cdot, \cdot \rangle$ is the standard pairing function.

491: In this paper all real-valued functions we consider are by definition

492: total. Therefore, in line with the above definitions, for a

493: real-valued function `computable' (equivalently, recursive), means

494: that there is a Turing Machine which for {\em all\/} $x$, computes

495: $f(x)$ to arbitrary accuracy; `partial' recursive real-valued

496: functions are not considered.

497:

498: It is convenient to distinguish between {\em upper\/} and {\em lower

499:   semi-computability}.  For this purpose we consider both the argument

500: of an auxiliary function $\phi$ and the value of $\phi$ as a pair of

501: natural numbers according to the standard pairing function $\langle

502: \cdot \rangle$. We define a function from ${\cal N}$ to the reals

503: ${\cal R}$ by a Turing machine $T$ computing a function $\phi$ as

504: follows. Interpret the computation $\phi(\langle x,t \rangle ) =

505: \langle p,q \rangle$ to mean that the quotient $p/q$ is the rational

506: valued $t$th approximation of $f(x)$.

507: \begin{definition}\label{def.enum.funct}

508: \rm

509: \label{def.semi}

510: A function $f: {\cal N} \rightarrow {\cal R}$ is

511: {\em lower semi-computable} if there is a Turing machine $T$ computing a

512: total function $\phi$

513: such that $\phi (x,t+1) \geq \phi (x,t)$ and

514: $\lim_{t \rightarrow \infty} \phi (x,t)=f(x)$. This means

515: that $f$ can be computably approximated from below.

516: A function $f$ is {\em upper semi-computable} if

517: $-f$ is lower semi-computable,

518: Note that, if $f$ is both upper- and lower semi-computable, then

519: $f$ is computable.

520: \end{definition}

521: %For example, $K(x)$ is upper semi-computable, but not computable.

522:

523: {\em (Sub-) Probability mass functions\/:} Probability mass

524: functions on $\{0,1\}^*$ may be thought of as real-valued functions on

525: ${\cal N}$. Therefore, the definitions of `computable' and

526: `recursive' carry over unchanged from the real-valued function case.

527: \subsection{Codes}

528: \label{sec:coding}

529: We repeatedly consider the following scenario: a {\em

530:   sender\/} (say, A) wants to communicate or transmit some information

531:   to a {\em receiver\/} (say, B). The information to be transmitted is

532:   an element from some set ${\cal X}$ (This set may or may not consist

533:   of binary strings).

534: It will be communicated by sending a

535: binary string, called the {\em message}.

536: When B receives the message, he can decode it again and (hopefully)

537:   reconstruct the element of ${\cal X}$ that was sent.

538: To achieve this, A and B need to agree

539:   on a {\em code\/} or {\em description method\/} before

540:   communicating. Intuitively, this is a binary relation between {\em

541:   source words} and associated {\em code words}. The relation is fully

542:   characterized by the {\em decoding function}. Such a decoding function

543: $D$ can be any function $D: \{ 0, 1 \}^* \rightarrow {\cal X}$.

544: The domain of $D$ is the set of %

545: \it code words

546: \rm and the range of $D$ is the set of %

547: \it source words. \rm $D(y) = x$ is interpreted as ``$y$ is a code

548: word for the source word $x$''.

549: The set of all code words

550: for source word $x$ is the set $D^{-1} (x) = \{ y: D(y) = x \}$.

551: Hence, $E=D^{-1}$ can be called the %

552: \it encoding %

553: \rm substitution

554: ($E$ is not necessarily a function). With each code $D$ we can

555: associate a {\em length function\/} $L_D: {\cal X} \rightarrow {\cal N}$

556: such that, for each source

557: word $x$, $L(x)$ is the length of the shortest encoding of $x$:

558: $$

559: L_D(x) = \min \{ l(y): D(y) = x  \}.

560: $$

561: We denote by $x^*$ the shortest $y$ such that $D(y) = x$; if there is

562: more than one such $y$, then $x^*$ is defined to be the

563: first such $y$ in some agreed-upon order---for example,

564: the lexicographical order.

565:

566: In coding theory attention is often restricted to

567: the case where the source word set is finite, say

568: ${\cal X} =  \{  1, 2,  \ldots , N  \}  $. If there is a constant $l_0$

569: such that $l(y) = l_0$ for all code words $y$ (which implies, $L(x) =

570: l_0$ for all source words $x$),

571: then we call $D$ a %

572: \it fixed-length

573: \rm code. It is

574: easy to see that $l_0   \geq   \log N$.

575: For instance, in teletype transmissions the source

576: has an alphabet of $N = 32$ letters, consisting

577: of the 26 letters in the Latin alphabet plus

578: 6 special characters. Hence, we need $l_0 = 5$

579: binary digits per source letter. In electronic computers

580: we often use the fixed-length ASCII code\index{code!ASCII}

581: with $l_0=8$.

582: \paragraph{Prefix code:}

583: It is immediately clear that in general

584: we cannot uniquely recover $x$ and $y$ from $E(xy)$.

585: Let $E$ be

586: the identity mapping.

587: Then we have $E(00)E(00) = 0000 = E(0)E(000)$.

588: We now introduce {\em prefix codes}, which do not suffer from this defect.

589: A binary string $x$

590: is a {\em proper prefix} of a binary string $y$

591: if we can write $y=xz$ for $z \neq \epsilon$.

592:  A set $\{x,y, \ldots \} \subseteq \{0,1\}^*$

593: is {\em prefix-free} if for any pair of distinct

594: elements in the set neither is a proper prefix of the other.

595: A function $D: \{ 0, 1 \}^*  \rightarrow  {\cal N}$

596: defines a {\it prefix-code}\index{code!prefix-}

597: if its domain is prefix-free.

598: In order to decode a code sequence of a prefix-code,

599: we simply start at the beginning and decode one

600: code word at a time. When we come to the end of

601: a code word, we know it is the end, since no

602: code word is the prefix of any other code word

603: in a prefix-code.

604:

605: Suppose we encode each binary string $x=x_1 x_2 \ldots x_n$ as

606: \[ \bar x = \underbrace{11 \ldots 1}

607: _{n \mbox{{\scriptsize  \ times}}}0x_1x_2 \ldots x_n .\]

608: The resulting code is prefix because we can determine where the

609: code word $\bar x$ ends by reading it from left to right without

610: backing up. Note $l(\bar{x}) = 2n+1$; thus, we have encoded strings in

611: $\{0,1\}^*$ in a prefix  manner at the price of doubling their

612: length. We can  get a much more efficient code by applying the

613: construction above to the length $l(x)$ of $x$ rather than $x$ itself:

614: define $x'=\overline{l(x)}x$, where $l(x)$ is interpreted as a binary

615: string according to the correspondence (\ref{eq:correspondence}). Then the code $D'$ with

616: %, for all $x \in \{0,1\}^*$,

617: $D'(x') = x$ is a prefix  code satisfying, for all $x \in

618: \{0,1\}^*$, $l(x') = n+2 \log n+1$ (here we ignore the `rounding error'

619: in \eqref{eq:intlength}). $D'$ is used throughout this paper as

620: a standard code to encode natural numbers in a prefix free-manner; we call it

621: the {\em standard prefix-code for the natural numbers}. We use

622: $\Lint(x)$ as notation for $l(x')$. When $x$ is

623: interpreted as an integer (using the correspondence

624: (\ref{eq:correspondence}) and (\ref{eq:intlength})), we see that,

625: up to rounding,

626: $\Lint(x) = \log x +

627: 2 \log \log x+1$.

628:

629: \paragraph{Prefix codes and the Kraft inequality:}

630: %It requires little reflection to realize that

631: %prefix-codes waste potential code words since

632: %the internal nodes of the representation tree

633: %cannot be used, and in fact neither

634: %are the potential descendants of the external nodes

635: %used. Hence, we can expect that the code-word length

636: %exceeds the (binary) source-word length in prefix-codes.

637: Let ${\cal X}$ be the set of natural numbers and

638: consider the straightforward non-prefix representation

639: (\ref{eq:correspondence}).

640: There are two elements of ${\cal X}$ with

641: a description of length $1$, four with a description of

642: length $2$ and so on. However, for a prefix code $D$ for the natural numbers

643: there are less binary prefix code words of each length:

644: if $x$ is a prefix code word

645: then no $y = xz$ with $z \neq \epsilon$ is a prefix code word.

646: Asymptotically there are less prefix code words of length $n$

647: than the $2^n$ source words of length $n$.

648: Quantification of this intuition for countable ${\cal X}$ and

649: arbitrary prefix-codes leads to

650: a precise constraint on the number of code-words of given lengths.

651: This important relation is known as the

652: {\em Kraft Inequality\index{Kraft Inequality|bold}}

653: and is due to L.G. Kraft\index{Kraft, L.G.} \cite{Kr49}.

654: \begin{theorem}

655: \label{kraft}

656: Let

657: $l_1 , l_2 , \ldots $

658: be a finite or infinite sequence

659: of natural numbers.

660: There is a prefix-code with this sequence as

661: lengths of its binary code words iff

662: $$

663: \sum_n  2^{{-}  l_n }  \leq 1.

664: $$

665: \end{theorem}

666: \paragraph{Uniquely Decodable Codes:}

667: We want to code elements of ${\cal X}$ in a way that they can be

668: uniquely reconstructed from the encoding. Such codes are called

669: `uniquely decodable'.

670: Every prefix-code is a uniquely decodable code. For example, if

671: $E(1) = 0$, $E(2) = 10$, $E(3) = 110$, $E(4) = 111$

672: %peter1 left out

673: %as in Figure~\ref{prefix.tree.picture},

674: %

675: then

676: $1421$ is encoded as $0111100$, which can be

677: easily decoded from left to right in

678: a unique way.

679:

680: On the other hand, not every uniquely decodable code satisfies the prefix

681: condition.

682: %For example, if $E(1) = 0$,

683: %$E(2) = 01$, $E(3) = 011$, $E(4) = 0111$, then

684: %every code word is a prefix of every

685: %longer code word

686: %peter1 left out

687: %as in Figure~\ref{non.prefix.tree.picture}

688: %. But unique decoding is trivial,

689: %since the beginning of a new code word is

690: %always indicated by a zero.

691: Prefix-codes are

692: distinguished from other uniquely decodable codes

693: by the property that the end of a code word is always

694: recognizable as such. This means that decoding

695: can be accomplished without the delay of observing

696: subsequent code words, which is why prefix-codes

697: are also called instantaneous codes.

698:

699: There is

700: good reason for our emphasis on prefix-codes.

701: Namely, it turns out that

702: Theorem~\ref{kraft} stays valid if we replace

703: ``prefix-code'' by ``uniquely decodable code.''

704: %This follows directly from the observation

705: %that if a code has code-word lengths $l_1 , l_2 , \ldots $

706: %and it is uniquely decodable, then the Kraft Inequality

707: %\index{Kraft Inequality}

708: %must be satisfied; see \cite{CT91} for details.

709: This important fact means that every

710: uniquely decodable code can be replaced

711: by a prefix-code without changing the set of

712: code-word lengths.

713: %Hence, all propositions concerning

714: %code-word lengths apply to uniquely decodable

715: %codes and to the subclass of prefix-codes.

716: In Shannon's and Kolmogorov's theories, we are only interested in code

717: word {\em lengths\/} of uniquely decodable codes rather than actual

718: encodings. By the previous

719: argument, we may restrict the set of codes we work with to prefix

720: codes, which are much easier to handle.

721: %Accordingly, in looking for uniquely decodable

722: %codes with minimal average code-word length

723: %we can restrict ourselves to prefix-codes.

724: \paragraph{Probability distributions and complete prefix codes:}

725: A uniquely decodable code is %

726: \it complete\index{code!uniquely decodable} \rm if the addition of any

727: new code word to its code word set results in a non-uniquely decodable

728: code.  It is easy to see that a code is complete iff equality holds in

729: the associated Kraft Inequality.  Let $l_1, l_2, \ldots$ be the

730: code words of some complete uniquely decodable code. Let us define

731: $q_x = 2^{- l_x}$.  By definition of completeness, we have $\sum_x q_x

732: = 1$. Thus, the $q_x$ can be thought of as {\em probability mass

733:   functions\/} corresponding to some probability distribution $Q$. We

734: say $Q$ is the distribution {\em corresponding\/} to $l_1,l_2,\ldots$.

735: In this way, each complete uniquely decodable code is mapped to a

736: unique probability distribution. Of course, this is nothing more than

737: a formal correspondence: we may choose to encode outcomes of $X$ using

738: a code corresponding to a distribution $q$, whereas the outcomes are

739: actually distributed according to some $p \neq q$. But, as we show

740: below, if $X$ is distributed according to $p$, then the code to which

741: $p$ corresponds is, in an average sense, the code that achieves

742: optimal compression of $X$.

743: \section{Shannon Entropy versus Kolmogorov Complexity}

744: \label{sec:basic}

745: %A special but very important case occurs if the sample

746: %space ${\cal X}$ is discrete and the outcomes $x$ in our space

747: %are generated by a random source $X$ (are distributed

748: %according to some probability distribution $P(X = x)$), or if Observer is

749: %willing to bet on outcomes as if they were\footnote{In that case we

750: %may say $P$ is the Agent's `subjective distribution'}. This is

751: %situation for which C.E. Shannon has developed his famous information

752: %theory \cite{Shannon48}. More generally,

753: %let $\Pi = \{\pi_1, \ldots, \pi_m\}$ be some

754: %partition of ${\cal X}$. Let $Y$ be another random variable that can take on

755: %values in $\Pi$, for example we partition the sample

756: %space in subsets of outcomes sharing a common property.

757: %Then the statement ``$Y = \pi_j$'' indicates

758: %that the event $\pi_j$ has obtained: the world is in some state

759: %$x \in \pi_j$. As before, we let $X = x$ represent the event that the

760: %world is in state $x \in {\cal X}$.  Shannon defines the amount of

761: %information that observing the value of $Y$ gives about the value of

762: %$X$ (the state of the world) in terms of the {\em mutual

763: %information} between $X$ and $Y$. This is a quantity that is defined

764: %in terms of the {\em entropy}. The entropy in turn is the fundamental

765: %quantity of Shannon's theory. Very roughly speaking, the `entropy' of

766: %random variable $X$ can be interpreted as the `expected amount of

767: %surprise' in observing an outcome of the random variable $X$.

768: %The conditional entropy of

769: %$X$ given $Y$ is the `expected amount of surprise' in observing the

770: %outcome of $X$ {\em after\/} having observed the outcome of $Y$. The

771: %`information that observing the outcome of $Y$ gives about the outcome of

772: %$X$' is called the `mutual information between $X$ and $Y$' and is

773: %defined as the entropy of $X$ minus the entropy of $X$ given $Y$.

774: %

775: \subsection{Shannon Entropy}

776: \label{sec:shannon}

777: It seldom happens that a detailed mathematical theory springs forth in

778: essentially final form from a single publication. Such was the case

779: with Shannon information theory, which properly started only with the

780: appearance of C.E. Shannon's paper ``The mathematical theory of

781: communication'' \cite{Sh48}.

782: In this paper, Shannon proposed a measure of

783: information in a distribution, which he

784: called the `entropy'. The

785: entropy $H(P)$ of a distribution $P$ measures the

786: `the inherent  uncertainty in $P$', or (in fact

787: equivalently), `how much information is gained when an outcome of $P$

788: is observed'. To make this a bit more precise, let us imagine an

789: observer  who knows that $X$ is distributed

790: according to $P$. The observer then observes $X=x$. The entropy of $P$

791: stands for the `uncertainty of the observer about the outcome $x$

792: {\em before\/} he observes it'. Now think of the observer as a

793: `receiver' who receives the message conveying the value of $X$. From this dual point of

794: view, the entropy stands for

795: \begin{quote}

796: the average amount of information that the observer has gained {\em after\/}

797: receiving a realized outcome $x$ of the random variable $X$. $(*)$

798: \end{quote}

799: Below, we first give Shannon's mathematical definition of entropy, and

800: we then connect it to its intuitive meaning $(*)$.

801: \begin{definition} \rm Let ${\cal X}$ be a finite or countable

802: \label{def.entropy}

803: set, let $X$ be a random variable taking values in ${\cal X}$ with

804:   distribution $P(X=x)=p_x$. Then

805: the  (Shannon-)

806: \it entropy\index{entropy|bold}\index{$H$: entropy stochastic source} %

807: \rm of random variable $X$

808: is given by

809: \begin{equation}

810: \label{eq:entropy}

811: H(X)  =   \sum_{x \in {\cal X}} p_x \log 1/p_x ,

812: \end{equation}

813: Entropy is defined here as a functional mapping random

814: variables to real numbers. In many texts, entropy is, essentially

815: equivalently, defined as a map from {\em distributions\/} of random variables to

816: the real numbers. Thus, by definition:

817: $

818: H(P) := H(X) =  \sum_{x \in {\cal X}} p_x \log 1/ p_x

819: $.

820: \end{definition}

821: \paragraph{Motivation:} The entropy function \eqref{eq:entropy}

822: can be motivated in different ways. The two most

823: important ones are the {\em axiomatic\/} approach and the {\em coding

824:   interpretation}.  In this paper we concentrate on the latter, but we

825: first briefly sketch the former. The idea of the axiomatic approach is

826: to postulate a

827: small set of self-evident axioms that

828: any measure of information relative to a distribution should

829: satisfy. One then shows that the only measure satisfying all the

830: postulates is the Shannon entropy. We

831: outline this approach for

832: finite sources ${\cal X} = \{1,\ldots, N\}$. We look for a function

833: $H$ that maps probability distributions on ${\cal X}$ to real

834: numbers. For given distribution $P$, $H(P)$ should measure

835: `how much information is gained on average  when an outcome is made

836: available'. We can write $H(P) = H(p_1,\ldots,p_N)$ where

837: $p_i$ stands for the

838:   probability of $i$.

839: Suppose we require that

840: \begin{enumerate}

841: \item $H(p_1,\ldots,p_N)$ is continuous in $p_1,\ldots,p_N$.

842: \item If all the $p_i$ are equal, $p_i = 1/N$, then $H$ should be a

843:   monotonic increasing function of $N$. With equally likely events

844: there is more choice, or uncertainty, when there are more possible

845: events.

846: \item If a choice is broken down into two successive choices, the

847:   original $H$ should be the weighted sum of the individual values of

848:   $H$. Rather than formalizing this condition, we will give a specific

849:   example. Suppose that ${\cal X} = \{ 1,2,3\}$, and $p_1 = \frac{1}{2}, p_2 =

850:   1/3, p_3 = 1/6$. We can think of $x \in {\cal X}$ as being generated

851:   in a two-stage process. First, an outcome in ${\cal X'} =\{0,1\}$ is

852:   generated according to a distribution $P'$ with

853:  $p'_0 = p'_1 = \frac{1}{2}$. If $x'=1$, we set $x=1$ and the process

854:  stops. If $x'= 0$, then  outcome `$2$' is generated with probability

855:  $2/3$ and outcome `$3$' with probability $1/3$, and the process

856:  stops. The final results

857:  have the same probabilities as before. In this particular case we

858:  require that

859: $$H(\frac{1}{2},\frac{1}{3},\frac{1}{6}) = H(\frac{1}{2},\frac{1}{2}) + \frac{1}{2} H(\frac{2}{3},\frac{1}{3}) + \frac{1}{2} H(1).$$

860: Thus, the entropy of $P$ must be equal to entropy of the first

861:  step in the generation process, plus the weighted sum (weighted

862:  according to the probabilities in the first step) of the entropies of the

863:  second step in the generation process.

864:

865: As a special case, if ${\cal X}$

866:  is the $n$-fold product space of another space ${\cal Y}$, $X =

867:  (Y_1,\ldots, Y_n)$ and the $Y_i$ are all independently distributed

868:  according to $P_Y$, then $H(P_X) = n H(P_Y)$. For example, the total

869:  entropy of $n$ independent tosses of a coin with bias $p$ is $n

870:  H(p,1-p)$.

871: \end{enumerate}

872: %Remarkably, Shannon \cite{Sh48} proved that

873: \begin{theorem}

874: \label{thm:axiomatic}

875: The only $H$ satisfying the three above assumptions is of the form

876: $$

877: H =  K \sum_{i=1}^N p_i \log 1/p_i,

878: $$

879: with $K$ a constant.

880: \end{theorem}

881: Thus, requirements (1)--(3) lead us to the definition of entropy

882: (\ref{eq:entropy}) given above up to an (unimportant) scaling

883: factor. We shall give a concrete interpretation of this factor later

884: on. Besides  the defining characteristics (1)--(3), the function $H$ has a few other

885: properties that make it attractive as a measure of information.

886: We mention:

887: \begin{description}

888: \item[\rm 4.]  $H(p_1,\ldots,p_N)$ is a concave function of the $p_i$.

889: \item[\rm 5.] For each $N$, $H$ achieves its unique maximum for the uniform distribution $p_i =

890:   1/N$.

891: \item[\rm 6.] $H(p_1,\ldots,p_N)$ is zero iff one of the $p_i$ has value $1$.

892:   Thus, $H$ is zero if and only if we do not gain any information at

893:   all if we are told that the outcome is $i$ (since we already knew

894:   $i$ would take place with certainty).

895: \end{description}

896: \paragraph{The Coding Interpretation:}

897: Immediately after

898: stating Theorem~\ref{thm:axiomatic}, Shannon \cite{Sh48} continues, ``this theorem, and the

899: assumptions required for its proof, are in no way necessary for the

900: present theory. It is given chiefly to provide a certain plausibility

901: to some of our later definitions. The {\em real justification\/} of these

902: definitions, however, will reside in their implications''.

903: %Thus, in the spirit of Shannon, we will henceforth concentrate on a

904: %very concrete

905: Following this injunction, we emphasize the main practical

906: interpretation of entropy as the length

907: (number of bits) needed to encode outcomes in ${\cal X}$. This

908: provides much clearer intuitions,

909: it lies at the root of the many practical applications

910: of information theory, and, most importantly for us,

911: it simplifies the comparison to Kolmogorov complexity.

912:

913: %Very briefly,

914: %the {\em

915: %entropy\/} of $Y$ is the expected number of bits needed to encode

916: %outcomes in $\Pi$ when the (in some sense) most efficient code to

917: %encode outcomes in $\Pi$ is used. The {\em mutual information between

918: %$Y$ and $X$}, or, equivalently {\em the information that observing the

919: %value of $Y$ gives about $X$\/} is the {\em reduction\/} in the

920: %expected number of bits needed to encode an outcome in $\cal X$ if one

921: %has already observed the value of $Y$.

922: \begin{example}

923: \rm

924: The entropy

925: of a random variable $X$ with equally likely outcomes

926: in a finite sample space ${\cal X}$ is given by

927: $H(X) = \log |{\cal X}|$.

928: %This is a measure of the uncertainty in

929: %choice before we have selected a particular value for $X$,

930: %and of the information

931: %produced from the set if we assign a specific value to $X$.

932: By choosing a particular message $x$ from ${\cal X}$,

933: we remove the entropy from $X$ by the

934: assignment $X := x$ and produce

935: or transmit {\em information}\index{information}

936: $I = \log |{\cal X}|$ by our selection of $x$. We  show below

937: that $I = \log |{\cal X}|$ (or, to be more precise, the integer

938: $I' =  \lceil

939: \log |{\cal X}|  \rceil $) can be interpreted as the number of bits

940: needed to be transmitted from an (imagined) sender

941: to an (imagined) receiver.

942: \end{example}

943: We now connect entropy to minimum average code lengths. These are

944: defined as follows:

945: %Given a source that produces source words from ${\cal N}$ according

946: %to probability distribution $P$, it is possible

947: %to assign code words to source words in such a way

948: %that any code word sequence is uniquely decodable,

949: %and moreover the

950: %average code-word length is minimal.

951: \begin{definition}

952: \rm

953: Let source words $x \in \{0,1\}^*$ be

954: produced by a random variable $X$ with probability

955: $P(X=x)=p_x$ for the event $X=x$. The characteristics of

956: $X$ are fixed. Now consider prefix codes

957: $D: \{0,1\}^* \rightarrow {\cal N}$

958: with one code word per source word,

959: and denote the length of the code word for $x$ by $l_x$.

960: We want to minimize the expected number of bits

961: we have to transmit for the given

962: source $X$ and choose a prefix code $D$ that achieves this.

963: In order to do so, we must minimize the

964: %

965: \it average code-word length\index{code!average word length|bold}

966: \rm

967: $\bar{L}_{D} = \sum_x p_x l_x$%

968: \rm .

969: %peter1: changed notation here to get consistency with what follows.

970: %Idea is as follows: l(y) = length of y, l_x is length

971: %of codeword of x, L is codelength function, so that L(x) = l_x

972: % (analogously to p_x, P),

973: %\bar{L} is its expectation. Need this because need to speak about

974: %LIST L_1,L_2 of codelength functions later

975: We define the %

976: \it minimal average code word

977: length

978: \rm as $\bar{L} = \min   \{  \bar{L}_{D}: D \mbox{ is a prefix-code}\}  $.

979: A prefix-code $D$ such that $\bar{L}_{D} = \bar{L}$ is called

980: an %

981: \it optimal prefix-code\index{code!optimal prefix-}

982: \rm with respect to prior

983: probability $P$ of the source words.

984: \end{definition}

985: The (minimal) average code length of an

986: (optimal) code does not depend on the details of the set

987: of code words, but only on the set of code-word lengths.

988: It is just the expected code-word length

989: with respect to the given distribution.

990: Shannon\index{Shannon, C.E.} discovered that the

991: minimal average code word

992: length is about equal to the entropy of

993: the source word set. This is known as the

994: {\it Noiseless Coding Theorem}.\index{Theorem!Noiseless Coding|bold}

995: The adjective ``noiseless'' emphasizes that we ignore the possibility

996: of errors.

997: \begin{theorem}

998: \label{thm:noiseless}

999: Let $\bar{L}$ and $P$ be as above.

1000: If $H(P) =  \sum_x p_x  \log 1/ p_x$

1001: is the entropy\index{entropy}, then

1002: \begin{equation}

1003: \label{eq:entopt}

1004: H(P)  \leq \bar{L}  \leq H(P) + 1.

1005: \end{equation}

1006: \end{theorem}

1007: We are typically interested in encoding a binary string

1008: of length $n$ with  entropy proportional to $n$

1009: (Example~\ref{ex:universal}). The essence of

1010: (\ref{eq:entopt})  is that,

1011: for all but the smallest $n$, the difference between

1012: entropy and minimal expected

1013: code length is completely negligible.

1014:

1015: It turns out that the optimum $\bar{L}$ in (\ref{eq:entopt}) is relatively easy to achieve,

1016: with the Shannon-Fano code.

1017: Let there be $N$ symbols

1018: (also called basic messages or source words).

1019: Order these symbols

1020: according to decreasing probability,

1021: say ${\cal X} = \{ 1,2, \ldots ,N \}$ with probabilities $p_1 ,p_2 , \ldots ,p_N$.

1022: Let $P_r = \sum_{i=1}^{r-1} p_i$, for $r = 1, \ldots ,N$.

1023: The binary code $E: {\cal X} \rightarrow \{0,1\}^*$ is obtained

1024: by coding $r$ as a binary number $E(r)$, obtained by

1025: truncating the binary expansion of $P_r$ at length

1026: $l(E(r))$ such that

1027: $$

1028:  \log 1/ p_r   \leq l(E(r))  <  1 + \log 1/ p_r .

1029: $$

1030: This code is the {\em Shannon-Fano code}.

1031: It has the property that highly probable symbols

1032: are mapped to short code words and symbols with low

1033: probability are mapped to longer code words (just like in a less optimal,

1034: non-prefix-free, setting is done in the Morse code).

1035: Moreover,

1036: $$

1037: 2^{-l(E(r))}  \leq p_r  <  2^{-l(E(r))+1} .

1038: $$

1039: Note that the code for symbol $r$ differs from all

1040: codes of symbols $r+1$ through $N$ in one or more

1041: bit positions, since for all $i$  with $ r+1  \leq i  \leq N$,

1042: \[ P_i \geq P_r + 2^{-l(E(r))}.\]

1043: Therefore the binary

1044: expansions of $P_r$ and $P_i$ differ in the first $l(E(r))$

1045: positions.  This means that $E$ is one-to-one,

1046: and it has an inverse: the decoding mapping $E^{-1}$.

1047: Even better,

1048: since

1049: no value of $E$ is a prefix of any other value of $E$,

1050: the set of code words is

1051: a prefix-code\index{code!prefix-}. This means we

1052: can recover the source message

1053: from the code message

1054: by scanning it from left to right

1055: without look-ahead.

1056: If $H_1$ is the average

1057: number of bits used per symbol of an original

1058: message, then $H_1 = \sum_r  p_r l(E(r))$.

1059: Combining this with the previous inequality we obtain (\ref{eq:entopt}):

1060: $$

1061:  \sum_r  p_r \log 1/ p_r    \leq

1062: H_1  <

1063: \sum_r  (1+ \log 1/ p_r )p_r  = 1 + \sum_r  p_r \log 1/ p_r .

1064: $$

1065: %From this it follows that $H_1  \sim  H(X)$ for large $n$,

1066: %with $H(X)$ the entropy per symbol of the source.

1067: \commentout{

1068: \begin{example}

1069: \label{ex:00}

1070: \rm

1071: %Application of these notions to the exchange of information $x$

1072: %is as follows:

1073: Assuming that $x$ is emitted by a random source $X$

1074: with probability $P(X=x)$, we can transmit $x$ using the Shannon-Fano

1075: code. This uses (up to rounding) $ \log 1/ P(X=x)$ bits.

1076: By Shannon's noiseless coding theorem this is optimal {\em on average},

1077: the average taken over the probability distribution of outcomes

1078: from the source. Thus, if $x = 00 \ldots 0$ ($n$ zeros), and the

1079: random source emits $n$-bit messages with equal probability $1/2^n$

1080: each, then we require $n$ bits to transmit $x$ (the same as

1081: transmitting $x$ literally). However, we can transmit $x$

1082: in about $\log n$ bits if we ignore probabilities and

1083: just describe $x$ individually. Thus, the optimality with

1084: respect to the average may be very sub-optimal in individual cases.

1085: \end{example}

1086: }

1087:

1088: \ \\

1089: {\bf Problem and Lacuna:}

1090: Shannon observes, ``Messages have %

1091: \index{Shannon, C.E.}

1092: \it meaning %

1093: \rm [ $\ldots$ however $\ldots$ ]

1094: the semantic aspects of communication are irrelevant

1095: to the engineering problem.'' In other words, can we answer a

1096: question like ``what is the information in this book''

1097: by viewing it as an element of a set of possible books

1098: with a probability distribution on it? Or that the

1099: individual sections in this book form

1100: a random sequence with stochastic relations that

1101: damp out rapidly over a distance of several pages?

1102: And how to measure the quantity of hereditary information in

1103: biological organisms, as encoded in DNA? Again there is the

1104: possibility of seeing a particular form of animal as one of a set of

1105: possible forms with a probability distribution on it. This seems

1106: to be contradicted by the fact that the calculation of

1107: all possible lifeforms in existence at any one time on earth

1108: would give a ridiculously low figure like

1109: %peter1 I don't understand this number!

1110: $2^{100}$.

1111:

1112: Shannon's classical

1113: information theory\index{information theory}\index{Shannon, C.E.}

1114: assigns a quantity of information to an ensemble of

1115: possible messages. All messages in the ensemble being equally probable,

1116: this quantity is the number of bits needed to

1117: count all possibilities.

1118: This expresses the fact that

1119: each message in the ensemble can be communicated

1120: using this number of bits.

1121: However, it does not say

1122: anything about the number of bits needed to convey any

1123: individual message in the ensemble. To illustrate this,

1124: consider the ensemble consisting of all binary strings

1125: of length 9999999999999999.

1126:

1127: By Shannon's measure, we require

1128: 9999999999999999 bits

1129: on the average to encode a string in such an ensemble. However, the

1130: string consisting of 9999999999999999 1's can be encoded in about

1131: 55 bits by expressing 9999999999999999 in binary and adding the

1132: repeated pattern ``1.'' A requirement for this to work is

1133: that we have agreed on an algorithm that decodes the encoded

1134: string. We can compress the string still further when we note that

1135: 9999999999999999 equals $3^2 \times 1111111111111111$, and that

1136: 1111111111111111 consists of $2^4$ 1's.

1137:

1138: Thus, we

1139: have discovered an interesting phenomenon: the description of

1140: some strings can be compressed considerably,

1141: provided they exhibit enough regularity.

1142: %This observation, of

1143: %course, is the basis of all systems to express very large

1144: %numbers and was exploited early on by Archimedes in

1145: %\index{Archimedes}

1146: %his treatise %

1147: %\it The Sand Reckoner%

1148: %\rm , in which he proposes

1149: %a system to name very large numbers:

1150: %``There are some, King Golon, who think that the number

1151: %of sand is infinite in multitude [$\ldots$ or] that no number

1152: %has been named which is great enough to exceed its multitude.[$\ldots$]

1153: %But I will try to show you, by geometrical proofs,

1154: %which you will be able to follow, that, of the numbers named

1155: %by me [...] some exceed not only the density of sand equal in

1156: %magnitude to the earth filled up in the way described,

1157: %but also that of a density equal in magnitude to the universe.''

1158: However, if regularity is lacking, it becomes more cumbersome

1159: to express large numbers. For instance, it seems easier to

1160: compress the number ``one billion,'' than the number

1161: ``one billion seven hundred thirty-five million two hundred

1162: sixty-eight thousand and three hundred ninety-four,'' even though they

1163: are of the same order of magnitude.

1164:

1165: We are interested in a measure of information that, unlike Shannon's,

1166: does not rely on (often untenable) probabilistic assumptions,

1167: and that takes into account the phenomenon that

1168: `regular' strings are compressible. Thus, we aim for a measure of information

1169: content of an %

1170: \it individual finite object%

1171: \rm ,

1172: and in the information conveyed about an individual finite

1173: object by another individual finite object. Here, we want

1174: the information content of an object $x$ to be

1175: an attribute of $x$ %

1176: \it alone%

1177: \rm , and not to depend

1178: on, for instance, the means chosen to describe this information

1179: content.  Surprisingly, this turns

1180: out to be possible, at least to a large extent. The resulting theory

1181: of information is based on Kolmogorov complexity, a

1182: notion independently proposed by Solomonoff (1964), Kolmogorov (1965)

1183: and Chaitin (1969); Li and Vit\'anyi (1997) describe the history of the

1184: subject.

1185: \subsection{Kolmogorov Complexity}

1186: \label{sec:kolmogorov}

1187: Suppose we want to describe a given object by a

1188: finite binary string. We do not care whether the object

1189: has many descriptions; however, each description

1190: should describe but one object.

1191: From among all descriptions

1192: of an object we

1193: can take the length of the shortest description as

1194: a measure of the object's complexity.

1195: It is natural to call an object ``simple'' if it has

1196: at least one short description, and to call it ``complex''

1197: if all of its descriptions are long.

1198:

1199: %But now we are in danger of falling into the trap

1200: %so eloquently described in the Richard-Berry paradox,

1201: %\index{Paradox!Richard-Berry|bold} where

1202: %we define a natural number as

1203: %``the least natural number that cannot be described in less

1204: %than twenty words.'' If this number does exist, we have just described

1205: %it in thirteen words, contradicting its definitional

1206: %statement. If such a number does not exist, then

1207: %all natural numbers can be described in fewer than twenty

1208: %words.

1209: %We need to look very carefully at what kind

1210: %of descriptions (codes) we may allow.

1211: As  in Section~\ref{sec:coding}, consider a description method

1212: $D$, to be used to transmit messages from a sender to a receiver.

1213: %Assume that each description

1214: %describes at most one object. That is, there is a

1215: %specification method $D$ that associates at most one

1216: %object $x$ with a description $y$.

1217: %This means that $D$ is a function from the set of descriptions,

1218: %say $Y$, into the set of objects, say $X$.

1219: %It seems also reasonable to require that

1220: %for each object $x$ in $X$, there is a description $y$ in $Y$

1221: %such that $D(y) = x$.

1222: %(Each object has a description.)

1223: %To make descriptions useful we like them to be finite.

1224: %This means that there are only countably many descriptions. Since there

1225: %is a description for each object, there are also only

1226: %countably many

1227: %describable objects.

1228: %How do we measure the complexity

1229: %of descriptions?

1230: %\index{complexity!algorithmic}

1231: %

1232: %

1233: %Taking our cue from the theory of computation,

1234: %we express descriptions as finite sequences of 0's and 1's.

1235: %In communication technology, if the specification method

1236: If $D$ is known

1237: to both a sender and receiver, then a message $x$ can be transmitted

1238: from sender to receiver by transmitting the description $y$ with

1239: $D(y)=x$. The cost of this transmission is measured by $l(y)$,

1240: the length of $y$. The least cost of transmission of $x$ is determined

1241: by the length function $L(x)$: recall that $L(x)$ is the length of

1242: the shortest $y$ such that $D(y)=x$.

1243: We choose this length function

1244: as the descriptional complexity of $x$ under specification

1245: method $D$.

1246:

1247: Obviously, this descriptional complexity of

1248: $x$ depends crucially

1249: on $D$.

1250: The general principle involved is that the syntactic

1251: framework of the description language

1252: determines the succinctness of description.

1253:

1254: In order to objectively compare descriptional complexities

1255: of objects, to be able to say ``$x$ is more complex than $z$,''

1256: the descriptional complexity of $x$

1257: should depend on $x$ alone. This complexity can be viewed as related to

1258: a universal description method that is a priori

1259: assumed by all senders and receivers.

1260: This complexity is optimal if no other description method

1261: assigns a lower complexity to any object.

1262:

1263: We are not really interested in optimality with respect to

1264: all description methods.

1265: For specifications to be useful at all it is

1266: necessary that the mapping from $y$ to $D(y)$

1267: can be executed in an effective manner. That is,

1268: it can at least in principle be performed by humans or machines.

1269: This notion has been

1270: formalized as that of ``partial recursive functions'',

1271: also known simply as ``computable functions'', which are

1272: formally defined later.

1273: According to

1274: generally accepted mathematical viewpoints it coincides

1275: with the intuitive notion of effective computation.

1276:

1277: The set of partial recursive functions

1278: contains an optimal function that minimizes

1279: description length of every other such function. We denote

1280: this function by $D_0$.

1281: Namely, for any other recursive function $D$,

1282: for all objects $x$,

1283: there is a description $y$ of $x$ under $D_0$ that is

1284: shorter than any description $z$ of $x$ under $D$. (That is,

1285: shorter up to an

1286: additive constant that is independent of $x$.)

1287: Complexity with respect to $D_0$ minorizes

1288: the complexities with respect

1289: to all partial recursive functions.

1290:

1291: We identify the

1292: length of the description of $x$ with respect

1293: to a fixed specification function $D_0$ with

1294: the ``algorithmic (descriptional) complexity'' of $x$.

1295: The optimality of $D_0$ in the sense above

1296: means that the complexity of an object $x$

1297: is invariant (up to an additive constant

1298: independent of $x$) under transition

1299: from one optimal specification function to another.

1300: Its complexity is an objective attribute

1301: of the described object alone: it is an

1302: intrinsic property of that object, and it does

1303: not depend on the description formalism.

1304: This complexity can be viewed as ``absolute information content'':

1305: the amount of information that needs to be transmitted

1306: between all senders and receivers when they communicate the

1307: message in absence of any other a priori knowledge

1308: that restricts the domain of the message.

1309: %

1310: %Broadly

1311: %speaking, this means that all description

1312: %syntaxes that are powerful enough to express the partial

1313: %recursive functions are approximately equally succinct.

1314: %In contrast

1315: %to the suggestion implicit in

1316: %the LISP versus FORTRAN example,

1317: %All algorithms can be expressed in each such programming

1318: %language equally succinctly, up to a fixed additive constant term.

1319: %

1320: %The restriction to formally effective descriptions

1321: %covers all intuitively effective descriptions

1322: %by general mathematical consensus.

1323: %While the idea of a theory of short descriptions

1324: %in itself has been proposed before,

1325: %The remarkable

1326: %usefulness and inherent rightness of the theory

1327: %of Kolmogorov complexity stems from this

1328: %independence of the description method.  %\begin{comment}

1329: %As an aside, a too narrow restriction of admissible

1330: %functions is not good either. For instance,

1331: %the class of primitive recursive functions,

1332: %\index{function!primitive recursive}

1333: %Exercise~\ref{ex.p.r.function}

1334: %in Section~\ref{sect.recursion}, is a proper subset

1335: %of the partial recursive functions, but

1336: %it contains no (universal) primitive recursive

1337: %function such that the associated complexity

1338: %minorizes the complexities of all other

1339: %primitive recursive functions.

1340: %\end{comment}

1341: %

1342: Thus, we have outlined the program for

1343: a general theory of algorithmic complexity.

1344: The three

1345: %four

1346: major innovations are as follows:

1347: \begin{enumerate}

1348: \item

1349: In restricting

1350: ourselves to formally effective descriptions,

1351: our definition covers every form of description

1352: that is intuitively acceptable as being effective

1353: according to general viewpoints in mathematics and logic.

1354: \item

1355: The restriction to effective descriptions

1356: entails that there is a universal description

1357: method that minorizes the description length or complexity

1358: with respect to any other effective description

1359: method.

1360: Significantly, this implies Item 3.

1361: \item

1362: The description length or complexity of an object

1363: is an intrinsic attribute of the object independent

1364: of the particular description method or formalizations

1365: thereof.

1366: %\item

1367: %The disturbing Richard-Berry paradox above does not disappear,

1368: %but resurfaces in the form of an alternative

1369: %approach to proving Kurt G\"odel's (1906--1978) famous

1370: %\index{G\"odel, K.}

1371: %result that not every true mathematical statement

1372: %is provable in mathematics.

1373: \end{enumerate}

1374:

1375: \subsubsection{Formal Details}

1376: The Kolmogorov complexity $K(x)$ of a finite object $x$

1377: will be defined as the length of the

1378: shortest effective binary description of $x$. Broadly speaking, $K(x)$

1379: may be thought of as the length of the shortest computer program that

1380: prints $x$ and then halts. This computer program may be written in

1381: C, Java, LISP or any other universal language: we shall see that,

1382: for any two universal languages,

1383: the resulting program lengths differ at most by a constant not

1384: depending on $x$.

1385:

1386: To make this precise,

1387: let $T_1 ,T_2 , \ldots$ be a standard enumeration \cite{LiVi97}

1388: of all Turing machines, and let $\phi_1 , \phi_2 , \ldots$

1389: be the enumeration of corresponding functions

1390: which are computed by the respective Turing machines.

1391: That is, $T_i$ computes $\phi_i$.

1392: These functions are the {\em partial recursive} functions

1393: or {\em computable} functions, Section~\ref{sec:preliminaries}. For technical reasons we are interested in the

1394: so-called prefix complexity, which is associated with Turing machines

1395: for which the set of programs (inputs) resulting in a halting computation

1396: is prefix free\footnote{There exists a version of Kolmogorov

1397:   complexity corresponding to programs that are not necessarily

1398:   prefix-free, but we will not go into it here.}. We can realize this by equipping the Turing

1399: machine with a one-way input tape, a separate work tape,

1400: and a one-way output tape. Such Turing

1401: machines are called prefix machines

1402: since the halting programs for any one of them form a prefix free set.

1403: %Taking the universal prefix machine $U$ we can define

1404: %the prefix complexity analogously with the plain Kolmogorov complexity.

1405: %peter1: we have not defined `plain KC' yet, so this must be changed

1406: %

1407:

1408: We first define $K_{T_i}(x)$, the prefix Kolmogorov complexity of $x$ relative to a

1409: given prefix machine $T_i$, where $T_i$ is the $i$-th prefix machine

1410: in a standard enumeration of them. $K_{T_i}(x)$  is defined as the length of the shortest

1411: input sequence $y$ such that $T_i(y) = \phi_i(y) = x$. If no such

1412: input sequence exists, $K_{T_i}(x)$ remains undefined. Of course, this

1413: preliminary definition is still highly sensitive to the particular

1414: prefix machine $T_i$ that we use. But now the  `universal

1415: prefix machine' comes to our rescue. Just as there exists universal ordinary

1416: Turing machines, there also exist universal prefix machines. These

1417: have the remarkable property that they can simulate every other prefix

1418: machine. More specifically, there exists a prefix machine $U$ such

1419: that, with as input the pair $\langle i, y\rangle$, it outputs $\phi_i(y)$

1420: and then halts. We now fix, once and for all,

1421: a prefix machine $U$ with this property and call $U$ the {\em reference

1422:   machine}. The Kolmogorov complexity $K(x)$ of $x$ is defined as $K_U(x)$.

1423:

1424: Let us formalize this definition.

1425: Let $\langle \cdot \rangle$ be a standard invertible

1426: effective one-one encoding from ${\cal N} \times {\cal N}$

1427: to a prefix-free  subset of ${\cal N}$. $\langle \cdot \rangle$ may be

1428: thought of as the encoding function of a prefix code.

1429: For example, we can set $\langle x,y \rangle = x'y'$.

1430: Comparing to the definition of  in

1431: Section~\ref{sec:preliminaries}, we note that from now on, we require

1432: $\langle \cdot \rangle$ to map to a prefix-free set.

1433: We insist on prefix-freeness and

1434: effectiveness because we want a universal Turing

1435: machine to be able to read an image under $\langle \cdot \rangle$

1436: from left to right and

1437: determine where it ends.

1438: \begin{definition}\label{def.KolmK}

1439: \rm

1440: Let $U$ be our reference prefix machine satisfying for all $i \in {\cal N},

1441: y \in \{0,1\}^*$,

1442: $U(\langle i,y \rangle) = \phi_i(y)$. The {\em prefix Kolmogorov complexity} of $x$ is

1443: \begin{eqnarray}

1444: K(x) & = &

1445: \min_{z} \{ l(z) : U(z) = x , z \in \{0,1\}^*\} =  \nonumber \\

1446: & = &  \min_{i,y}\{l(\langle i, y \rangle): \phi_i (y )=x , y \in \{0,1\}^*, i

1447: \in {\cal N} \}.

1448: \end{eqnarray}

1449: \end{definition}

1450: We can alternatively think of $z$ as a program that prints $x$ and

1451: then halts, or as $z = \langle i,y \rangle$ where $y$ is a program such

1452: that, when $T_i$ is input program $y$, it prints $x$ and then halts.

1453:

1454: Thus, by definition $K(x)=l(x^*)$, where $x^*$ is the

1455: lexicographically first shortest

1456: self-delimiting (prefix) program for $x$ with respect to the

1457: reference prefix machine. Consider the mapping $E^*$ defined by $E^*(x)=x^*$.

1458: This may be viewed as the encoding function of a prefix-code (decoding

1459: function) $D^*$ with $D^*(x^*) = x$. By its definition, $D^*$ is  a

1460: very parsimonious code. The reason for working with prefix rather than standard

1461: Turing machines is that, for many of the subsequent developments,

1462: we need $D^*$ to be prefix.

1463: %

1464: %

1465: %If $x^*$ is the (lexicographically)

1466: %first shortest program for $x$ then the set

1467: %$\{x^* : U(x^*)=x, x \in \{0,1\}^*\}$ is a {\em prefix code}.

1468: %That is, each $x^*$ is a code word for some $x$, and if $x^*$

1469: %and $y^*$ are code words for $x$ and $y$ with $x \neq y$ then $x^*$ is not

1470: %a prefix of $x$.

1471:

1472: Though defined in terms of a

1473: particular machine model, the Kolmogorov complexity

1474: is machine-independent up to an additive

1475: constant

1476:  and acquires an asymptotically universal and absolute character

1477: through Church's thesis, from the ability of universal machines to

1478: simulate one another and execute any effective process.

1479:   The Kolmogorov complexity of an object can be viewed as an absolute

1480: and objective quantification of the amount of information in it.

1481: %peter1: more explanation needed here

1482: %the following is already said at various places in the text, so I

1483: %commented it out

1484: %

1485: %This leads to a theory of {\em absolute} information {\em contents}

1486: %of {\em individual} objects in contrast to classic information theory

1487: %which deals with {\em average} information {\em to communicate}

1488: %objects produced by a {\em random source} \cite{LiVi97}.

1489:

1490: %peter1 the following about m(x) is not needed except perhaps for the

1491: %Kolmogorov structure function. Better to deal with it there it seems.

1492: %\commentout{

1493: %\begin{example}

1494: %  \rm

1495: \subsubsection{Intuition}

1496: To develop some intuitions, it is useful to think of $K(x)$ as

1497:   the shortest program for $x$

1498: in some standard programming language such as

1499:   LISP or Java. Consider the lexicographical enumeration

1500:   of all syntactically correct LISP programs $ \lambda_1, \lambda_2,

1501:   \ldots$, and the lexicographical enumeration of all syntactically

1502:   correct Java programs $ \pi_1, \pi_2, \ldots$. We assume that both

1503:   these programs are encoded in some standard prefix-free manner. With

1504:   proper definitions we can view the programs in both enumerations as

1505:   computing partial recursive functions from their inputs to their

1506:   outputs. Choosing reference machines in both enumerations we can

1507:   define complexities $K_{\mbox{\scriptsize LISP}}(x)$ and

1508:   $K_{\mbox{\scriptsize  Java}}(x)$

1509:   completely analogous to $K(x)$.  All of these measures of the

1510:   descriptional complexities of $x$ coincide up to a fixed additive

1511:   constant. Let us show this directly for $K_{\mbox{\scriptsize LISP}}(x)$ and

1512:   $K_{\mbox{\scriptsize Java}}(x)$. Since LISP is universal, there exists a LISP

1513:   program $\lambda_P$ implementing a Java-to-LISP compiler.

1514:   $\lambda_P$ translates each Java program to an equivalent LISP

1515:   program. Consequently, for all $x$, $K_{\mbox{\scriptsize LISP}}(x) \leq

1516:   K_{\mbox{\scriptsize Java}}(x) + 2l(P)$. Similarly, there is a Java program

1517:   $\pi_L$ that is a LISP-to-Java compiler, so that for all $x$,

1518:   $K_{\mbox{\scriptsize Java}}(x) \leq K_{\mbox{\scriptsize LISP}}(x) + 2l(L)$. It follows

1519:   that $|K_{\mbox{\scriptsize Java}}(x) - K_{\mbox{\scriptsize LISP}}(x)| \leq 2l(P) + 2 l(L)$

1520:   for all $x$!

1521:

1522: The programming  language view immediately tells us that $K(x)$ must be

1523: small for `simple' or `regular' objects $x$. For example,

1524: there exists a fixed-size program that, when input

1525: $n$, outputs the first $n$ bits of $

1526: \pi$ and then halts. Specification of $n$ takes at most $L_{\cal N}(n)

1527: = \log n + 2 \log \log n + 1 $ bits. Thus, if $x$

1528: consists of the first $n$ binary digits of $\pi$, then $K(x) \lea \log

1529: n + 2 \log \log n$. Similarly, if $0^n$ denotes the string

1530: consisting of $n$ $0$'s, then $K(0^n) \lea \log n + 2 \log \log n$.

1531:

1532: On the other hand, for all $x$, there exists a program `print $x$;

1533:  halt'. This shows that for all $K(x) \lea l(x)$. As was previously noted, for any prefix code,

1534:  there are no more than $2^m$ strings $x$ which can be described by

1535:  $m$ or less bits. In particular, this holds for the prefix code $E^*$

1536:  whose length function is $K(x)$. Thus, the fraction of strings $x$ of

1537:  length $n$ with $K(x) \leq m$ is at most  $2^{m-n}$: the overwhelming majority

1538:  of sequences cannot be compressed by more than a

1539:  constant. Specifically, if $x$ is determined by $n$ independent

1540:  tosses of a fair coin, then with overwhelming probability,  $K(x) \approx

1541:  l(x)$. Thus, while for very regular strings, the Kolmogorov complexity is

1542: small (sub-linear in the length of the string),

1543: {\em most\/} strings are `random' and have Kolmogorov

1544: complexity about equal to their own length.

1545: %\end{example}

1546: \subsubsection{Kolmogorov complexity of sets, functions and

1547:   probability distributions}

1548: \paragraph{Finite sets:}

1549: The  class of {\em finite sets} consists of the set

1550: of finite subsets $S \subseteq \{0,1\}^*$. The  {\em complexity

1551: of the finite set} $S$ is

1552: $K(S)$---the length (number of bits) of the

1553: shortest binary program $p$ from which the reference universal

1554: prefix machine $U$

1555: computes a listing of the elements of $S$ and then

1556: halts.

1557: That is, if $S=\{x_1 , \ldots , x_{n} \}$, then

1558: $U(p)= \langle x_1,\langle x_2, \ldots, \langle x_{n-1},x_n\rangle \ldots\rangle \rangle $.

1559: The {\em conditional complexity} $K(x \mid S)$ of $x$ given $S$,

1560: is the length (number of bits) in the

1561: shortest binary program $p$ from which the reference universal

1562: prefix machine $U$, given $S$ literally as auxiliary information,

1563: computes $x$.

1564: %he class of {\em partial recursive

1565: %unctions} consists of the set

1566: %f functions $f: \{0,1\}^* \rightarrow \{0,1\}^*$ such that

1567: %here is a Turing machine $T$ such that

1568: %nd $f(i) = T(i)$, for every $i \in \{0,1\}^*$.

1569: \paragraph{Integer-valued functions:}

1570: The (prefix-) complexity $K(f)$ of a

1571: partial recursive function $f$ is defined by

1572: $

1573: K(f) = \min_i \{K(i): \mbox{\rm Turing machine } T_i

1574: \; \; \mbox{\rm computes }

1575: f \}.

1576: $

1577: If $f^*$ is a shortest program for computing the function $f$

1578: (if there is more than one of them then $f^*$ is the first one in

1579: enumeration order), then $K(f)=l(f^*)$.

1580: \begin{remark}

1581: \rm

1582: In the above definition of $K(f)$, the objects being

1583: described are functions instead of finite binary strings.

1584: To unify the approaches, we can

1585: consider a finite binary string $x$ as corresponding

1586: to a function having value $x$ for argument 0.

1587: Note that we can upper semi-compute (Section~\ref{sec:preliminaries})

1588: $x^*$ given $x$,

1589: but we cannot upper semi-compute $f^*$ given $f$ (as an oracle),

1590: since we should be able to

1591: verify agreement of a program for a function and an oracle for the

1592: target function, on all infinitely many arguments.

1593: \end{remark}

1594: \paragraph{Probability Distributions:}

1595: In this text we identify

1596: probability distributions on finite and countable sets ${\cal

1597:   X}$ with their corresponding mass functions

1598: (Section~\ref{sec:preliminaries}). Since any

1599: (sub-) probability mass function $f$ is a total real-valued function, $K(f)$

1600: is defined in the same way as above.

1601: \subsubsection{Kolmogorov Complexity and the Universal Distribution}

1602: \label{sec:m}

1603: Following the definitions

1604: above we now consider lower semi-computable and computable probability

1605: mass functions (Section~\ref{sec:preliminaries}).

1606: By the fundamental

1607: Kraft's inequality, Theorem~\ref{kraft}, we know that

1608: if $l_1 , l_2 , \ldots$ are the code-word lengths of a  prefix code,

1609: then $\sum_x 2^{-l_x} \leq 1$. Therefore,

1610: since $K(x)$ is the length of

1611: a prefix-free program for $x$,

1612: we can interpret $2^{-K(x)}$

1613: as a sub-probability mass function, and

1614:  we define ${\bf m}(x)=2^{-K(x)}$.

1615: This is the so-called

1616: universal distribution---a rigorous form of Occam's razor.

1617: The following two theorems are to be considered as major achievements

1618: in the theory of Kolmogorov complexity, and will be used

1619: again and again in the sequel. For the proofs we refer to

1620: \cite{LiVi97}.

1621:

1622:

1623: \begin{theorem}\label{PR1}

1624: Let $f$ represent a

1625: lower semi-computable (sub-) probability distribution on the

1626: natural numbers (equivalently, finite binary strings).

1627: (This implies $K(f) < \infty$.)

1628: Then, $2^{c_f} {\bf m}(x) > f(x)$ for all $x$, where $c_f =K(f)+O(1)$.

1629: We call ${\bf m}$ a {\em universal distribution}.

1630: \end{theorem}

1631:

1632: The family of lower semi-computable sub-probability mass functions

1633: contains all distributions with computable parameters which have a

1634: name, or in which we could conceivably be interested, or which have

1635: ever been considered\footnote{To be sure, in statistical applications,

1636:   one often works with model classes containing distributions that are

1637:   neither upper- nor lower semi-computable. An example is the

1638:   Bernoulli model class, containing the distributions with $P(X=1) =

1639:   \theta$ for all $\theta \in [0,1]$. However, every concrete {\em

1640:     parameter estimate\/} or {\em predictive distribution\/} based on

1641:   the Bernoulli model class that has ever been considered or in which we

1642:   could be conceivably interested, is in fact computable; typically,

1643:   $\theta$ is then rational-valued. See also Example~\ref{ex:appy} in

1644:   Appendix~\ref{sec:universal}.}.  In particular, it contains the

1645: computable distributions.  We call $\hbox{\bf m}$ ``universal'' since

1646: it assigns at least as much probability to each object as any other

1647: lower semi-computable distribution (up to a multiplicative factor),

1648: and is itself lower semi-computable.

1649:

1650: \begin{theorem}\label{PR2}

1651: \begin{equation}\label{eq.m}

1652:   \log 1/\hbox{\bf m} (x)=K(x) \pm  O( 1).

1653: \end{equation}

1654: \end{theorem}

1655: That means that $\hbox{\bf m}$ assigns high probability to simple

1656: objects

1657: and low probability to complex or random objects.

1658: For example, for $x=00 \ldots 0$ ($n$ 0's) we have

1659: $K(x) = K(n) \pm O(1) \leq \log n + 2 \log \log n +O(1) $ since the program

1660: \[ \mbox{\tt print } n \mbox{\tt \_times a ``0''} \]

1661: prints $x$. (The additional $2 \log \log n$ term

1662: is the penalty term for a prefix encoding.)

1663: Then, $1/ (n \log^2 n ) = O( \hbox{\bf m}(x))$.

1664: But if we flip a coin to obtain a string $y$ of $n$ bits,

1665: then with overwhelming probability $K(y) \geq n \pm O(1) $

1666: (because $y$ does not contain effective regularities

1667: which allow compression),

1668: and hence $\hbox{\bf m}(y) = O( 1/2^n)$.

1669:

1670:

1671: \paragraph*{Problem and Lacuna:} Unfortunately $K(x)$ is not a recursive

1672: function: the Kolmogorov complexity is

1673: not computable in general. This means that

1674: there exists no computer program that, when input an arbitrary string,

1675: outputs the Kolmogorov complexity of that string and then halts.

1676: While Kolmogorov complexity is upper semi-computable

1677: (Section~\ref{sec:preliminaries}), it cannot be approximated in

1678: general in a

1679: practically useful sense; and even though

1680: there

1681: exist `feasible', resource-bounded forms of Kolmogorov

1682: complexity (Li and Vit\'anyi 1997), these lack some of the elegant

1683: properties of the original, uncomputable notion.

1684:

1685:

1686: Now suppose we are interested in efficient storage and transmission of

1687: long sequences of data. According to Kolmogorov, we can compress such

1688: sequences in an essentially optimal way by storing or transmitting the

1689: shortest program that generates them. Unfortunately, as we have just

1690: seen, we cannot find such a program in general. According to Shannon,

1691: we can compress such sequences optimally in an average sense (and

1692: therefore, it turns out, also with high probability) if they are

1693: distributed according to some $P$ and we know $P$. Unfortunately, in

1694: practice, $P$ is often unknown, it may not be computable---bringing us

1695: in the same conundrum as with the Kolmogorov complexity approach---or

1696: worse, it may be nonexistent. In Appendix~\ref{sec:universal}, we

1697: consider {\em universal coding}, which can be considered a sort of

1698: middle ground between Shannon information and Kolmogorov complexity.

1699: In contrast to both these approaches, universal codes can be directly

1700: applied for practical data compression. Some basic knowledge of

1701: universal codes will be very helpful in providing intuition for the

1702: next section, in which we relate Kolmogorov complexity and Shannon

1703: entropy. Nevertheless, universal codes are not directly needed in any

1704: of the statements and proofs of the next section or, in fact, anywhere

1705: else in the paper, which is why delegated their treatment to an

1706: appendix.

1707: \subsection{Expected Kolmogorov Complexity Equals Shannon Entropy}

1708: \label{sec:KCSE}

1709: %Shannon's entropy measures

1710: %the uncertainty in a statistical ensemble

1711: %of messages, while Kolmogorov complexity measures

1712: %the algorithmic information in an individual

1713: %message.

1714:

1715: Suppose the source words $x$ are distributed as a random variable

1716: $X$ with probability $P(X=x) = f(x)$.

1717: %The expected code word length of source words

1718: %with respect to probability distribution

1719: %$P$ is $\sum_x f(x)K(x)$.

1720: %

1721: %What we would like to know is the following:

1722: While $K(x)$ is

1723: fixed for each $x$ and gives the shortest code word length

1724: (but only up to a fixed constant) and is {\em independent} of the

1725: probability distribution $P$, we may wonder whether

1726: $K$ is also universal in the following sense:

1727: If we weigh each individual code word length for

1728: $x$ with its probability $f(x)$, does the resulting $f$-expected

1729: code word length $\sum_x f(x)K(x)$

1730: achieve the minimal average code word

1731: length $ H(X)=  \sum_x f(x) \log 1/ f(x)$?

1732: Here we sum over the entire support of $f$; restricting summation

1733: to a small set, for example the singleton set $\{x\}$, can give

1734: a different result.

1735: %This universality requirement contrasts

1736: %with the Shannon-Fano code

1737: %%\index{code!Shannon-Fano}

1738: %%of Example~\ref{shannon-fano} on page~\pageref{shannon-fano},

1739: %which does achieve the $H(P)$ bound at the cost of setting

1740: %the code word length equal to the negative logarithm of

1741: %the specific source word probability.

1742: The reasoning above implies that, under some mild restrictions on the

1743: distributions $f$,

1744: the answer is yes.

1745: %We can view the $K(x)$'s as the code word length set

1746: %of a ``universal'' Shannon-Fano

1747: %code based on the universal probability, Theorem~\ref{thm:noiseless}

1748: %on page~\pageref{thm:noiseless}.

1749: %The expectation of

1750: %$K(x)$ differs from $H(P)$ by a constant

1751: %depending on $P$.

1752: %Namely, $H(P) = \sum_x P(x)K(x) + c_P$,

1753: %where the constant $c_P$ depends on

1754: %the length of the program to compute the distribution $P$.

1755: This is expressed in the following theorem, where, instead of the quotient

1756: we look at the difference of

1757: $\sum_x f(x) K(x)$ and $ H(X)$.

1758: This allows

1759: us to express really small distinctions.

1760: %We call an information source $X$ recursive if its marginals

1761: %$f^{(1)} (X=x | l(x)=1), f^{(2)} (X=x | l(x)=2), \ldots$ are all recursive.

1762: %In Exercise~\ref{exer.caves} this dependence is removed.

1763: %

1764: %If the set

1765: %of outcomes is infinite, then

1766: %it is possible that $H(P)$ is infinite.

1767: %For example, with $x \in {\cal N}$ and

1768: %$P(x)= 1/(x \log x)$ we have $H(P) > \sum_x 1/x$ which diverges.

1769: %If the expected $K(x)$ is close to $H(P)$ then it diverges as well.

1770: %Two diverging quantities are compared by looking at their quotient

1771: %or difference. The latter allows us to express really small

1772: %distinctions.

1773:

1774: \begin{theorem}\label{theo.eq.entropy}

1775: Let $f$ be a computable probability mass function (Section~\ref{sec:preliminaries}) $f(x)=P(X=x)$ on

1776: sample space ${\cal X} = \{0,1\}^*$

1777: associated with a random source $X$ and

1778: entropy

1779: $H(X)=\sum_x f(x) \log 1/f(x)$. Then,

1780: \[ 0 \leq \left( \sum_x f(x) K(x) - H(X) \right) \leq  K(f) + O(1). \]

1781: \end{theorem}

1782:

1783: \begin{proof}

1784: Since $K(x)$ is the code word length of a prefix-code for $x$,

1785: the first inequality of the Noiseless Coding Theorem~\ref{thm:noiseless}

1786: states that

1787: \[H(X) \leq \sum_x f(x) K(x).\]

1788: %Moreover, by the Kraft inequality \eqref{kraft} we have

1789: %$\sum_x 2^{-K(x)} \leq 1$. Hence we can define ${\bf m}(x) = 2^{-K(x)}$,

1790: %which can be considered as a probability distribution since it sums

1791: %to at most 1. (We can concentrate the deficit on a special undefined

1792: %element $u \not\in \{0,1\}^*$.) One of the main achievements of the theory

1793: %is that ${\bf m}$ is a {\em universal} distribution in the sense that

1794: %for every lower semi-computable probability mass function

1795: %$P$ on $\{0,1\}^*$ we have

1796: Since $f(x) \leq 2^{K(f)+O(1)} {\bf m}(x)$ (Theorem~\ref{PR1})

1797: and $\log {\bf m} (x) = K(x)+O(1)$ (Theorem~\ref{PR2}), we have

1798: $ \log 1/ f(x) \geq K(x) - K(f) - O(1)$.

1799: It follows that

1800: \[\sum_x  f(x) K(x) \leq H(X)  + K(f)+ O(1).\]

1801: Set the constant $c_f$ to

1802: \[c_f := K(f)+O(1), \]

1803: and the theorem is proved.

1804: As an aside, the constant implied in the $O(1)$ term

1805: depends on the lengths of the programs occurring in the proof of the cited

1806: Theorems~\ref{PR1}, \ref{PR2} (Theorems 4.3.1 and 4.3.2 in \cite{LiVi97}).

1807: These depend only

1808: on the reference universal prefix machine.

1809: \end{proof}

1810:

1811: The theorem shows that for simple (low complexity)

1812: distributions the expected Kolmogorov complexity is close to

1813: the entropy, but these two quantities may be wide apart for distributions

1814: of high complexity. This explains the apparent problem arising

1815: in considering a distribution $f$ that concentrates all probability

1816: on an element $x$ of length $n$.  Suppose we choose $K(x)>n$.

1817: Then $f(x)=1$ and hence the entropy $H(f)=0$. On the other hand

1818: the term $ \sum_{x \in \{0,1\}^* } f(x) K(x) = K(x)$. Therefore,

1819: the discrepancy between the expected Kolmogorov complexity and the entropy

1820: exceeds the length $n$ of $x$. One may think this contradicts

1821: the theorem, but that is not the case: The complexity of the distribution

1822: is at least that of $x$, since we can reconstruct $x$ given $f$

1823: (just compute $f(y)$ for all $y$ of length $n$ in lexicographical

1824:  order until we meet one that has probability 1). Thus, $c_f = K(f)+O(1)

1825: \geq K(x)+O(1) \geq n+O(1)$. Thus, if we pick a probability distribution

1826: with a complex support, or a trickily skewed probability distribution,

1827: than this is reflected in the complexity of that distribution, and

1828: as consequence in the closeness between the entropy and the expected

1829: Kolmogorov complexity.

1830:

1831: For example, bringing the discussion in line with the universal coding

1832: counterpart of Appendix~\ref{sec:universal} by considering $f$'s that

1833: can be interpreted as sequential information sources and denoting the

1834: conditional version of $f$ restricted to strings of length $n$ by

1835: $f^{(n)}$ as in Section~\ref{sec:preliminaries}, we find by the same

1836: proof as the theorem that for all $n$,

1837: \[ 0 \leq \sum_{x \in \{0,1\}^n } f^{(n)}(x) K(x) - H(f^{(n)}) \leq

1838: c_{f^{(n)}}, \]

1839: where $c_{f^{(n)}} = K(f^{(n)})+O(1) \leq  K(f) + K(n)+O(1)$ is now a constant

1840: depending on both $f$ and $n$.

1841: On the other hand, we can eliminate

1842: the complexity of the distribution, or its recursivity for that matter,

1843: and / or restrictions to a conditional version of $f$

1844: restricted to a finite support $A$

1845: (for example $A = \{0,1\}^n$), denoted by $f^A$,

1846: in the following conditional formulation (this involves

1847: a peek in the future since

1848: the precise meaning of the ``$K(\cdot \mid \cdot)$'' notation

1849: is only provided in Definition~\ref{def.KolmKb}):

1850: \begin{equation}\label{eq.condentropy}

1851:  0 \leq \sum_{x \in A } f^A (x) K(x \mid f,A) - H(f^A) = O(1) .

1852: \end{equation}

1853:

1854: The Shannon-Fano code for a computable distribution is

1855: itself computable. Therefore, for every computable

1856: distribution $f$, the universal code $D^*$

1857: whose length function is the Kolmogorov complexity compresses

1858: on average at least as much as the Shannon-Fano code for $f$.

1859: This is the intuitive reason

1860: why, no matter what computable distribution $f$ we take, its expected

1861: Kolmogorov complexity is close to its entropy.

1862:

1863:

1864: %To define Kolmogorov complexity, we must first represent our space

1865: %$\cal X$ by a sequence of binary variables $x_1, x_2, \ldots$ (this

1866: %can be done without any essential loss of generality). The idea is

1867: %that the outcomes (actual values) of these variables will be revealed

1868: %to us sequentially, either one at a time (first we see $x_1$, then

1869: %$x_2$, etc.) or in `blocks'. Broadly speaking, we now proceed as

1870: %follows. We fix some universal programming language $L$ (by universal

1871: %we mean that a universal Turing Machine can be programmed in it).  We

1872: %now define the Kolmogorov complexity $K(x^n)$ of a sequence $x^n =

1873: %x_1, \ldots, x_n$ as follows: $K(x^n)$ is the length of the shortest

1874: %program (written in language $L$) that prints the sequence $x^n$ and

1875: %then halts.

1876: %

1877: %The quantity $K(x^n)$ might be called the `information inherent in the

1878: %object $x^n$'. While for finite sequences it depends on the

1879: %programming language $L$ that is used, one can show that for infinite

1880: %sequences $x^{\infty} = x_1, x_2, \ldots$ it becomes, in some sense,

1881: %invariant:

1882:

1883: %More importantly, the {\em algorithmic\/} information that observation

1884: %$y^n$ gives about sequence $x^n$ is defined as the {\em conditional

1885: %Kolmogorov complexity\/} $K(x^n | y^n)$; this is the length of the

1886: %shortest program $p$ such that, when the pair $<p,y^n>$ (suitably

1887: %encoded) is fed to the UTM $U$, $U$ outputs $x^n$ and then

1888: %halts. Several concrete examples and properties of this new definition

1889: %of mutual information will be given in the paper.

1890:

1891: \section{Mutual Information}

1892: \label{sec:mutual}

1893: \subsection{Probabilistic Mutual Information}

1894: \label{sec:probmutual}

1895: How much information can a random variable $X$ convey about a

1896: random variable $Y$?

1897: Taking a purely combinatorial approach,

1898: this notion is captured as follows:

1899: If $X$ ranges over ${\cal X}$

1900: and $Y$ ranges over ${\cal Y}$, then we look at the set $U$

1901: of possible events $(X=x,Y=y)$ consisting of

1902: joint occurrences of event $X=x$ and event $Y=y$.

1903: If $U$ does not equal the Cartesian product ${\cal X} \times {\cal Y}$,

1904: then this means there is some dependency between $X$ and $Y$.

1905: Considering the set $U_x =  \{ (x,u):(x,u) \in U \}$ for $x \in {\cal X} $,

1906: it is natural to define the

1907: \it conditional entropy

1908: \rm of $Y$

1909: \rm given $X = x$ as $H(Y|X=x) = \log d(U_x )$. This suggests

1910: immediately that the information given by $X=x$ about $Y$ is

1911: $$

1912: I(X=x: Y) = H(Y) - H(Y| X=x).

1913: $$

1914: For example, if $U =  \{ (1,1), (1,2),(2,3) \} $, $U  \subseteq {\cal X}

1915: \times {\cal Y}$

1916: with ${\cal X} =  \{ 1,2 \} $ and ${\cal Y} =  \{ 1,2,3,4 \} $,

1917: then $I(X=1: Y) = 1$ and $I(X=2: Y) = 2$.

1918:

1919: In this formulation it is obvious that $H(X|X=x) = 0$,

1920: and that $I(X=x : X) = H(X)$.

1921: This approach amounts

1922: to the assumption of a

1923: {\em uniform distribution}\index{distribution!uniform}

1924: of the probabilities concerned.

1925:

1926: We can generalize this approach,

1927: taking into account

1928: the frequencies or probabilities of the occurrences of the different

1929: values $X$ and $Y$ can assume.

1930: Let the {\em joint probability}\index{probability!joint}

1931: $f(x,y)$ be the ``probability of

1932: the joint occurrence of event $X=x$ and event $Y=y$.''

1933: The {\em marginal probabilities} $f_1 (x)$ and $f_2(y)$ are

1934: defined by $f_1 (x)= \sum_y f(x,y)$ and $f_2 (y)= \sum_x f(x,y)$ and

1935: are ``the probability of the occurrence of the event $X=x$''

1936: and the ``probability of the occurrence of the event $Y=y$'',

1937: respectively.

1938: This leads to the self-evident formulas for joint variables $X,Y$:

1939: \begin{eqnarray*}

1940:  && H(X,Y)  =   \sum_{x,y}  f(x,y) \log 1/ f(x,y), \\

1941:  && H(X)  =   \sum_{x}  f(x) \log  1/f(x), \\

1942:  && H(Y)  =   \sum_{y}  f(y) \log 1/ f(y) ,

1943: \end{eqnarray*}

1944: where summation over $x$ is taken over all outcomes of the random variable

1945: $X$ and summation over $y$ is taken over all outcomes of random variable $Y$.

1946: One can show that

1947: \begin{equation}

1948: H(X,Y)  \leq H(X) + H(Y) ,

1949: \label{I4}

1950: \end{equation}

1951: with equality only in the case that $X$ and $Y$ are independent.

1952: In all of these equations the

1953: entropy quantity on the left-hand side increases if

1954: we choose the probabilities on the right-hand side

1955: more equally.

1956:

1957: \paragraph{Conditional entropy:}

1958: We start

1959: the analysis of

1960: the information in $X$ about $Y$ by first considering

1961: the %

1962: \it conditional

1963: entropy \index{entropy!conditional|bold}%

1964: \rm of $Y$ %

1965: \rm given $X$ as the average of the

1966: entropy for $Y$ for each value of $X$ %

1967: \rm weighted

1968: by the probability of getting that particular value:

1969: \begin{eqnarray*}

1970: H(Y| X)

1971:  & = &   \sum_x  f_1(x) H(Y|X=x) \\

1972:  & = &   \sum_x  f_1(x) \sum_{y} f(y|x) \log 1/ f(y|x) \\

1973:  & = &  \sum_{x,y} f(x,y) \log 1/f(y| x) .

1974: \end{eqnarray*}

1975: Here $f(y|x)$ is the conditional probability mass function as defined

1976: in Section~\ref{sec:preliminaries}.

1977:

1978: The quantity on the left-hand side tells us

1979: how uncertain we are on average about the outcome of $Y$

1980: when we know an outcome of $X$. With

1981: \begin{eqnarray*}

1982: H(X) & = &  \sum_x  f_1(x) \log 1/ f_1(x) \\

1983: &  = &  \sum_{x} \left(\sum_y f(x,y) \right)

1984: \log  \sum_y 1/ f(x,y)  \\

1985: &  = &  \sum_{x,y} f(x,y)

1986: \log  \sum_y 1/ f(x,y) ,

1987: \end{eqnarray*}

1988: and substituting the formula for $f(y|x)$, we find

1989: $H(Y| X)  =  H(X,Y) - H(X)$. Rewrite this expression as

1990: the Entropy Equality

1991: \begin{equation}

1992: H(X,Y)  =  H (X) + H(Y| X).

1993: \label{I5}

1994: \end{equation}

1995: This can be interpreted as, ``the uncertainty of

1996: the joint event $(X,Y)$ is the uncertainty of $X$

1997: plus the uncertainty of $Y$ given $X$.''

1998: Combining Equations~\ref{I4}, \ref{I5} gives

1999: $H(Y)  \geq H(Y| X)$, which can be taken to imply

2000: that, on average, knowledge of $X$ can never increase uncertainty

2001: of $Y$. In fact, uncertainty in $Y$ will be decreased

2002: unless $X$ and $Y$ are independent.

2003: \paragraph{Information:}

2004: The

2005: \it information %

2006: \rm in the outcome $X=x$ %

2007: \rm about $Y$ is defined as

2008: \begin{equation}

2009:  I(X=x: Y) = H(Y) - H(Y| X=x) .

2010: \label{I6}

2011: \end{equation}

2012: Here the quantities $H(Y)$ and $H(Y| X=x)$ on the right-hand side

2013: of the equations are always equal to or less than the

2014: corresponding quantities under the uniform distribution

2015: we analyzed first. The values of the quantities

2016: $I(X=x: Y)$ under the assumption of uniform distribution

2017: of $Y$ and $Y|X=x$ versus any other distribution are not

2018: related by inequality in a particular direction.

2019: The equalities $H(X|X=x) = 0$ and $I(X=x: X) = H(X)$

2020: hold under any distribution of the variables. Since

2021: $I(X=x: Y)$ is a function of outcomes of $X$, while $I(Y=y: X)$

2022: is a function of outcomes of $Y$, we do not compare them directly.

2023: However, forming the expectation defined as

2024: \begin{eqnarray*}

2025: {\bf E} (I(X=x: Y)) & = & \sum_x f_1 (x)I(X=x: Y), \\

2026: {\bf E} (I(Y=y: X)) & = & \sum_y f_2(y)I(Y=y: X),

2027: \end{eqnarray*}

2028: and combining Equations~\ref{I5}, \ref{I6},

2029: we see that the resulting quantities are equal. Denoting

2030: this quantity by $I(X;Y)$ and calling

2031: it the %

2032: \it mutual information\index{information!mutual|bold} %

2033: \rm in $X$ and $Y$,

2034: we see that this information is

2035: {\it symmetric}:\index{information!symmetry of|see{symmetry of information}}\index{symmetry of information!stochastic|bold}

2036: \begin{equation}

2037: I(X; Y) = {\bf E} (I(X=x: Y)) = {\bf E} (I(Y=y: X)).

2038: \end{equation}

2039: Writing this out we find

2040: that the

2041:  {\em mutual information} $I(X;Y)$

2042: is defined by:

2043: \begin{equation}\label{eq.mutinfprob}

2044:  I(X;Y) = \sum_x \sum_y f(x,y) \log \frac{f(x,y)}{f_1(x)f_2(y)} .

2045: \end{equation}

2046: Another way to express this is as follows: a well-known criterion for

2047: the difference between a given distribution $f(x)$ and a distribution

2048:  $g(x)$ it is compared with is the so-called

2049: {\em Kullback-Leibler divergence}

2050: \begin{equation}\label{eq.kl}

2051: D(f \parallel g ) = \sum_x f(x) \log f(x)/g(x).

2052: \end{equation}

2053: It has the important property that

2054: \begin{equation}\label{eq.ii}

2055: D (f \parallel g ) \geq 0

2056: \end{equation}

2057: with equality only  iff $f(x)=g(x)$ for all $x$. This is called the

2058: {\em information inequality} in \cite{CT91}, p. 26.

2059: Thus, the mutual information is the Kullback-Leibler divergence between

2060: the joint distribution and the product

2061: $f_1(x)f_2(y)$ of the two marginal distributions. If this quantity is 0

2062: then $f(x,y)=f_1(x)f_2(y)$ for every pair $x,y$, which is the same as

2063: saying that $X$ and $Y$ are independent random variables.

2064: \begin{example}

2065: \rm

2066: \label{ex:mutual}

2067: Suppose we want to exchange the information about the outcome $X=x$

2068: and it is known already that outcome $Y=y$ is the case,

2069: that is, $x$ has property $y$.

2070: Then we require (using the Shannon-Fano code) about

2071: $ \log 1/ P(X=x | Y=y)$ bits to communicate $x$. On average, over the

2072: joint distribution $P(X=x, Y=y)$ we use $H(X|Y)$ bits,

2073: which is optimal by Shannon's noiseless coding theorem.

2074: In fact, exploiting the mutual information paradigm,

2075: the expected information $I(Y ; X)$

2076: that outcome $Y=y$ gives about outcome $X=x$

2077: is the same as the expected information that $X=x$ gives about $Y=y$,

2078: and is never negative. Yet

2079: there may certainly exist

2080: {\em individual\/} $y$ such that $I(Y=y : X)$ is negative. For example, we may

2081: have ${\cal X } = \{0,1\}$, ${\cal Y} = \{0,1\}$, $P(X=1 | Y=0) = 1$,

2082: $P(X=1 | Y = 1) = 1/2$, $P(Y=1) = \epsilon$. Then $I(Y; X) =

2083: H(\epsilon,1- \epsilon)$ whereas

2084: $I(Y=1 : X) = H(\epsilon,1-\epsilon) + \epsilon - 1$. For small

2085: $\epsilon$, this quantity is smaller than $0$.

2086: %However, in terms of individual relation between $x$ and $y$

2087: %this can be very high or very low.

2088: \end{example}

2089:

2090: \

2091: \\

2092: {\bf Problem and Lacuna:}

2093: The quantity $I(Y; X)$

2094: symmetrically characterizes to what extent random

2095: variables $X$ and $Y$ are correlated. An inherent problem

2096: with probabilistic definitions

2097: is that --- as we have just seen --- although $I(Y; X) = {\bf E} (I(Y

2098: = y: X))$

2099: is always positive, for some probability

2100: distributions

2101: %peter1 changed

2102: %$I(X: Y)$

2103: %into

2104: and some $y$, $I(Y=y: X)$

2105: %

2106: can turn out to be negative---which

2107: definitely contradicts our naive notion of information content.

2108: \commentout{

2109: %peter1 changed 3 lines

2110: %The development of this theory immediately gave rise to

2111: %at least two different questions. The first observation is that

2112: %the

2113: How is this possible? The

2114: %

2115: concept of information as used in the theory of communication

2116: is a probabilistic notion, which is natural for

2117: information transmission over communication channels.

2118: Nonetheless,

2119: %as we

2120: %have seen from the discussion,

2121: we tend to

2122: identify

2123: %

2124: \it probabilities %

2125: \rm of messages with

2126: %

2127: \it frequencies %

2128: \rm of messages in a sufficiently

2129: long sequence, which under some conditions on the stochastic

2130: source can be rigorously justified.

2131: For instance, Morse code\index{code!Morse} transmissions of English

2132: telegrams over a communication channel

2133: can be validly treated by probabilistic

2134: methods even if we (as is usual) use empirical

2135: frequencies for probabilities. The great probabilist,

2136: Kolmogorov, remarks, ``If something goes wrong here,

2137: \index{Kolmogorov, A.N.}

2138: the problem lies in the vagueness of our ideas of

2139: the relation between mathematical probability theory

2140: and real random events in general.''

2141: %peter1 added

2142: }

2143: The {\em algorithmic\/} mutual information we introduce below can {\em

2144:   never\/} be negative, and in this sense is closer to the intuitive

2145: notion of information content.

2146:

2147: \subsection{Algorithmic Mutual Information}

2148: \label{sec:algmi}

2149: For individual objects the information about one another

2150: is possibly even more fundamental than for random sources.

2151: Kolmogorov \cite{Ko65}:

2152: \begin{quote}

2153: Actually, it is most fruitful to discuss the quantity of information

2154: ``conveyed by an object'' ($x$) ``about an object'' ($y$). It is not

2155: an accident that in the probabilistic approach this has led

2156: to a generalization to the case of continuous variables, for which

2157: the entropy is finite but, in a large number of cases,

2158: \[

2159: I_W(x,y) = \int \int P_{xy}(dx \; dy)\log_2

2160: \frac{P_{xy}(dx \; dy)}{P_x(dx) P_y(dy)}

2161: \]

2162: is finite.

2163: The real objects that we study are very (infinitely) complex,

2164: but the relationships between two separate objects diminish as the

2165: schemes used to describe them become simpler.

2166: While a map yields a considerable amount of information about a region

2167: of the earth's surface, the microstructure of the paper and the ink

2168: on the paper have no relation to the microstructure

2169: of the area shown on the map.''

2170: \end{quote}

2171: In the discussions on Shannon mutual information, we first

2172: needed to introduce a conditional version of entropy. Analogously, to

2173: prepare for the definition of algorithmic mutual information, we need

2174: a notion of conditional Kolmogorov complexity.

2175:

2176: Intuitively, the

2177: conditional prefix Kolmogorov complexity $K(x|y)$ of $x$ given $y$ can

2178: be interpreted as the shortest prefix program $p$ such that, when $y$

2179: is given to the program $p$ as input, the program prints $x$ and then

2180: halts. The idea of providing $p$ with an input $y$ is realized by putting

2181: $\langle p,y \rangle$ rather than just $p$ on the input tape of the

2182: universal prefix machine $U$.

2183: \begin{definition}\label{def.KolmKb}

2184: \rm

2185: The {\em conditional prefix Kolmogorov complexity} of $x$ given $y$ (for

2186: free) is

2187: \[K(x|y) = \min_{p}\{l(p): U(\langle p,y \rangle )=x , p \in \{0,1\}^*\}. \]

2188: We define

2189: \begin{equation}

2190: \label{eq:redefine}

2191: K(x)=K(x|\epsilon).

2192: \end{equation}

2193: \end{definition}

2194: Note that we just redefined $K(x)$ so that the unconditional

2195: Kolmogorov complexity is {\em exactly\/} equal to the conditional

2196: Kolmogorov complexity with empty input. This does not contradict our

2197: earlier definition: we can choose a reference prefix machine $U$ such

2198: that $U(\langle p,\epsilon \rangle) = U(p)$. Then (\ref{eq:redefine})

2199: holds automatically.

2200:

2201:

2202: We now have the technical apparatus to express the relation between

2203: entropy inequalities and Kolmogorov complexity inequalities.

2204: Recall that the entropy expresses the expected information

2205: to transmit an outcome of a known random source,

2206: while the Kolmogorov complexity

2207: of every such outcome expresses the specific information

2208: contained in that outcome. This makes us wonder to what extend the

2209: entropy-(in)equalities hold for the corresponding Kolmogorov

2210: complexity situation. In the latter case the corresponding (in)equality

2211: is a far stronger statement, implying the same (in)equality in the

2212: entropy setting. It is remarkable, therefore, that similar inequalities

2213: hold for both cases, where the entropy ones hold exactly while the

2214: Kolmogorov complexity ones hold up to a logarithmic, and in some cases

2215: $O(1)$,

2216: additive precision.

2217:

2218: \paragraph{Additivity:}

2219: %Recall the notation $\eqa, \gea$ (Section~\ref{sec:coding}).

2220: By definition, $K(x,y) = K(\langle x,y \rangle)$.

2221: Trivially, the symmetry property holds: $K(x,y) \eqa K(y,x)$.

2222: Another interesting property is the ``Additivity of Complexity''

2223: property that, as we explain further below,

2224: is equivalent to the ``Symmetry of

2225: Algorithmic Mutual Information'' property. Recall that

2226: $x^*$ denotes the first (in a standard enumeration order)

2227: shortest prefix program that

2228: generates $x$ and then halts.

2229: \begin{theorem}[Additivity of Complexity/Symmetry of Mutual Information]

2230: \label{thm:additive}

2231: \begin{equation}\label{eq.soi}

2232:   K(x, y) \eqa K(x) + K(y \mid x^*) \eqa K(y) + K(x \mid y^*).

2233:  \end{equation}

2234: \end{theorem}

2235: This is the Kolmogorov complexity equivalent of the entropy equality

2236: (\ref{I5}). That this latter equality holds is true

2237: by simply rewriting both sides of the equation according to the

2238: definitions of averages of joint and marginal probabilities.

2239: In fact, potential individual differences are averaged out.

2240: But in the Kolmogorov complexity case we do nothing like that:

2241: it is truly remarkable that additivity of algorithmic information

2242: holds for individual objects. It was first proven by Kolmogorov

2243: and Leonid A. Levin for the plain (non-prefix) version

2244: of Kolmogorov complexity, where it holds up to an additive logarithmic

2245: term, and reported in \cite{ZvLe70}.

2246: The prefix-version (\ref{eq.soi}), holding up to an $O(1)$

2247: additive term is due to \cite{Ga74}, can be found

2248: as Theorem 3.9.1 in~\cite{LiVi97}, and has a difficult proof.

2249: \paragraph{Symmetry:}

2250: To define the algorithmic mutual information between

2251: two individual objects $x$ and $y$ with no

2252: probabilities involved, it is instructive to first recall

2253: the probabilistic notion (\ref{eq.mutinfprob}).

2254: Rewriting (\ref{eq.mutinfprob})

2255: as

2256: \[ \sum_x \sum_y f(x,y) [  \log 1/f(x) + \log 1/ f(y) - \log 1/f(x,y) ] , \]

2257: and noting that $ \log 1/ f ( s )$ is

2258: very close to the length of the

2259: prefix-free Shannon-Fano code for $s$, we are led to the following

2260: definition.

2261: %\footnote{The Shannon-Fano code has nearly optimal expected

2262: %code length equal to the entropy with

2263: %respect to the distribution of the source \cite{CT91}. However,

2264: %the prefix-free code with code word length $K(s)$ has both

2265: %about expected optimal code word length and individual optimal

2266: %effective code word length, \cite{LiVi97}.}

2267: The

2268: {\em information in  $y$ about $x$}

2269:  is defined as

2270:  \begin{equation}\label{def.mutinf}

2271:    I(y : x) = K(x) - K(x  \mid  y^*) \eqa K(x) + K(y) - K(x, y),

2272:  \end{equation}

2273: where the second equality is a consequence of~(\ref{eq.soi})

2274: and states that this information is symmetrical,

2275: $I(x:y) \eqa I(y:x)$, and therefore we can talk about

2276: {\em mutual information}.\footnote{The notation of the

2277: algorithmic (individual)

2278:  notion $I(x:y)$ distinguishes it from the probabilistic

2279: (average) notion

2280: $I(X; Y)$.  We deviate slightly from~\cite{LiVi97}

2281: where $I(y : x)$ is defined as $K(x) - K(x \mid y)$.}

2282: % \begin{remark}\label{rem.cami}

2283: %The conditional mutual information is

2284: % \begin{align*}

2285: %   I(x : y \mid z) & = K(x \mid z) - K(x \mid y, K(y \mid z), z)

2286: % \\ & \eqa K(x \mid z) + K(y \mid z) - K(x, y \mid z).

2287: % \end{align*}

2288: % \end{remark}

2289: \paragraph{Precision -- $O(1)$ vs. $O(\log n)$:}

2290: The version of (\ref{eq.soi})  with just $x$ and $y$ in the

2291: conditionals doesn't

2292: hold with $\eqa$, but holds up to additive logarithmic terms

2293: that cannot be eliminated. To gain some further insight in this

2294: matter, first consider the following lemma:

2295: \begin{lemma}

2296: $x^*$ has the same information as the

2297: pair $x,K(x)$, that is, $K(x^* \mid x,K(x)),K(x,K(x) \mid x^*)=O(1)$.

2298: \end{lemma}

2299: \begin{proof}

2300: Given $x,K(x)$ we can run all programs simultaneously in

2301: dovetailed fashion and select the first program of length $K(x)$

2302: that halts with output $x$ as $x^*$. (Dovetailed fashion means that

2303: in phase $k$ of the process we run all programs $i$ for $j$ steps

2304: such that $i+j=k$, $k=1,2, \ldots$)

2305: \end{proof}

2306:

2307: \noindent

2308: Thus, $x^*$ provides more information than $x$. Therefore, we have to

2309: be very careful when extending Theorem~\ref{thm:additive}.

2310: For example, the conditional version of (\ref{eq.soi}) is:

2311:  \begin{equation}\label{eq.soi-cond}

2312:   K(x, y \mid z) \eqa K(x \mid z) + K(y \mid x, K(x \mid z), z).

2313:  \end{equation}

2314: Note that a naive version

2315:  \[

2316:   K(x, y \mid z) \eqa K(x \mid z) + K(y \mid x^{*}, z)

2317:  \]

2318: is incorrect: taking $z = x$, $y = K(x)$,

2319: the left-hand side equals $K(x^{*} \mid x)$ which can be as large as

2320: $\log n - \log \log n + O(1)$, and the right-hand side

2321: equals $K(x \mid x) + K(K(x) \mid x^{*}, x) \eqa 0$.

2322:

2323: But up to logarithmic precision we do not need to

2324: be that careful. In fact, it turns out that {\em every}

2325: linear entropy inequality holds for the corresponding Kolmogorov

2326: complexities within a logarithmic additive error, \cite{HRSV00}:

2327: \begin{theorem}

2328: All linear (in)equalities that are valid for Kolmogorov complexity

2329: are also valid for Shannon entropy and vice versa---provided

2330: we require the

2331: Kolmogorov complexity (in)equalities to hold up to additive

2332: logarithmic precision only.

2333: \end{theorem}

2334: \subsection{Expected Algorithmic Mutual Information Equals Probabilistic Mutual Information}

2335: Theorem~\ref{theo.eq.entropy}

2336: gave the relationship between entropy and ordinary Kolmogorov

2337: complexity; it showed that the entropy of distribution $P$ is

2338: approximately equal to the expected (under $P$) Kolmogorov

2339: complexity. Theorem~\ref{thm:mutinf} gives the analogous result for

2340: the mutual information (to facilitate comparison to

2341: Theorem~\ref{theo.eq.entropy}, note that $x$ and $y$ in

2342: (\ref{eq.eqamipmi}) below may stand for strings of arbitrary length $n$).

2343: \begin{theorem}

2344: \label{thm:mutinf}

2345: Given a computable probability distribution $f(x,y)$ over $(x,y)$

2346: we have

2347: \begin{align}\label{eq.eqamipmi}

2348: I(X; Y) - K(f) & \lea  \sum_x \sum_y f(x,y) I(x:y)

2349: \\& \lea I(X;Y) + 2 K(f) ,

2350: \nonumber

2351: \end{align}

2352: %where $c_f$ is a constant that depends only on $f$ (it is the length of the shortest prefix-free program that computes

2353: %$f(x,y)$ from input $(x,y)$).

2354: \end{theorem}

2355: \begin{proof}

2356: Rewrite the expectation

2357: \begin{align*}

2358: \sum_x \sum_y f(x,y) I(x:y) \eqa

2359: \sum_x \sum_y & f(x,y)  [K(x)

2360: \\& + K(y) - K(x, y)].

2361: \end{align*}

2362:  Define

2363: $\sum_y f(x,y) = f_1 (x)$

2364: and $\sum_x f(x,y) = f_2(y)$

2365: to obtain

2366: \begin{align*}

2367: \sum_x \sum_y f(x,y) I(x:y) \eqa

2368:  \sum_x & f_1 (x) K(x)

2369:  + \sum_y f_2 (y) K(y)

2370: \\& - \sum_{x,y} f(x,y) K(x, y).

2371: \end{align*}

2372: Given the program that computes $f$, we can approximate $f_1 (x)$

2373: by  $q_1 (x,y_0) = \sum_{y \leq y_0} f(x,y)$, and

2374: similarly for $f_2$. That is, the

2375: distributions $f_i$ ($i=1,2$) are lower semicomputable.

2376: Because they sum to 1 it can be shown

2377: they must also be computable.

2378: By Theorem~\ref{theo.eq.entropy},

2379: we have $H(g) \lea \sum_x g(x) K(x) \lea H(g) + K(g)$

2380: for every computable probability mass function $g$.

2381:

2382: Hence, $H(f_i) \lea \sum_x f_i (x) K(x) \lea H(f_i) + K(f_i)$

2383: ($i=1,2$), and $H(f) \lea \sum_{x,y} f (x,y) K(x,y) \lea H(f) + K(f)$.

2384: On the other hand, the probabilistic mutual information

2385:  (\ref{eq.mutinfprob}) is expressed in the entropies by

2386: $I(X;Y) = H(f_1) + H(f_2) - H(f)$.

2387: By construction of the $f_i$'s above,

2388: we have $K(f_1), K(f_2) \lea K(f)$. Since the complexities

2389: are positive, substitution

2390: establishes the lemma.

2391: \end{proof}

2392:

2393: Can we get rid of the $K(f)$ error term? The answer is affirmative;

2394: by putting $f(\cdot)$ in the conditional, and

2395: applying \eqref{eq.condentropy}, we can even get rid of

2396: the computability requirement.

2397:

2398: \begin{lemma}

2399: Given a joint probability distribution $f(x,y)$ over $(x,y)$

2400: (not necessarily computable) we have

2401: \[ I(X;Y)  \eqa  \sum_x \sum_y f(x,y) I(x:y \mid f) , \]

2402: where the auxiliary $f$ means that we can directly access the

2403: values $f(x,y)$ on the

2404: auxiliary conditional information tape of the reference

2405: universal prefix machine.

2406: \end{lemma}

2407:

2408: \begin{proof}

2409: The lemma follows from the definition of conditional

2410: algorithmic mutual information,

2411: if we show that $\sum_{x}

2412: f(x) K(x \mid f) \eqa H(f)$,

2413: where the $O(1)$ term implicit in the $\eqa$ sign

2414: is independent of $f$.

2415:

2416: Equip the reference universal prefix machine,

2417: with an $O(1)$ length

2418: program to compute a Shannon-Fano code from the auxiliary table

2419: of probabilities.

2420: Then, given an input $r$, it can determine

2421: whether $r$ is the Shannon-Fano code word for some $x$.

2422: Such a code word

2423: has length $\eqa  \log 1/f(x)$.

2424: If this is the case, then the machine

2425: outputs $x$, otherwise it halts without output. Therefore,

2426: $K(x   \mid f) \lea  \log 1/ f(x)$.

2427: This shows

2428: the upper bound on the expected prefix complexity.

2429: The lower bound follows as usual

2430: from the Noiseless Coding Theorem.

2431: \end{proof}

2432:

2433: Thus, we see that the expectation of the algorithmic mutual

2434: information $I(x:y)$ is close to the probabilistic mutual information

2435: $I(X; Y)$ --- which is important: if

2436: this were not the case then the algorithmic notion would not

2437: be a sharpening of the probabilistic notion to individual objects,

2438: but something else.

2439:

2440:

2441: \section{Mutual Information Non-Increase}

2442: \label{sect.mini}

2443: \subsection{Probabilistic Version}

2444: Is it possible to increase the mutual information between

2445: two random variables, by processing the outcomes in some deterministic

2446: manner? The answer is negative:

2447: For every function $T$

2448: we have

2449: \begin{equation}\label{eq.infnonincrprob}

2450: I(X ; Y) \geq I( X ; T(Y)),

2451: \end{equation}

2452: that is, mutual information between two random variables

2453: cannot be increased by processing their outcomes in any deterministic way.

2454: The same holds in an appropriate sense for randomized processing

2455: of the outcomes of the random variables.

2456: This fact is called the {\em data processing inequality} \cite{CT91},

2457: Theorem 2.8.1. The reason why it holds is that \eqref{eq.mutinfprob}

2458: is expressed in terms of probabilities $f(a,b), f_1(a), f_2(b)$,

2459: rather than in terms of the arguments.

2460: Processing the arguments $a,b$ will not increase the value

2461: of the expression in the right-hand side. If the processing of the arguments

2462: just renames them in a one-to-one manner then the expression

2463: keeps the same value. If the processing eliminates or merges arguments

2464: then it is easy to check from the formula

2465: that the expression value doesn't increase.

2466:

2467: \subsection{Algorithmic Version}

2468: \label{sect:minialg}

2469: In the algorithmic version of mutual information, the notion

2470: is expressed in terms of the individual arguments instead of

2471: solely in terms of the probabilities as in the probabilistic version.

2472: Therefore, the reason for \eqref{eq.infnonincrprob} to hold is

2473: not valid in the algorithmic case. Yet it turns out that the

2474: data processing inequality also holds between individual objects,

2475: by far more subtle arguments and not precisely but with a small

2476: tolerance. The first to observe this fact was Leonid A. Levin

2477: who proved his ``information non-growth,'' and ``information

2478: conservation inequalities'' for both finite and infinite sequences

2479: under both deterministic and randomized data processing,

2480: \cite{Le74,Le84}.

2481:

2482:

2483: \subsubsection{A Triangle Inequality}

2484: We first discuss some useful technical lemmas.

2485: The  additivity of complexity (symmetry of information)

2486: \eqref{eq.soi} can be used to

2487: derive a ``directed triangle inequality'' from \cite{GTV01},

2488: that is needed later.

2489:  \begin{theorem}\label{lem.magic}

2490: For all $x,y,z$,

2491:  \[

2492:   K(x \mid y^*) \lea K(x, z \mid y^{*}) \lea K(z \mid y^*) + K(x \mid z^*).

2493:  \]

2494:  \end{theorem}

2495:

2496: \begin{proof}

2497: Using~(\ref{eq.soi}), an evident inequality introducing

2498: an auxiliary object $z$, and twice (~\ref{eq.soi}) again:

2499:  \begin{align*}

2500:   K(x, z \mid y^*) &\eqa

2501:     K(x,y,z) - K(y)

2502: \\ & \lea K(z) + K(x \mid z^*) + K(y \mid z^*) - K(y)

2503: \\ &\eqa K(y,z) - K(y) + K(x \mid z^*)

2504: \\ & \eqa K(x \mid z^*) + K(z \mid y^*).

2505:  \end{align*}

2506:

2507: \end{proof}

2508:

2509: \begin{remark}

2510: \rm

2511: This theorem has bizarre consequences. These  consequences are not

2512: simple unexpected artifacts of our definitions, but, to the contrary,

2513: they show the power and the genuine contribution to our understanding

2514: represented by the deep and important mathematical relation

2515: (\ref{eq.soi}).

2516:

2517: Denote $k=K(y)$ and substitute $k=z$ and $K(k)=x$

2518: to find the following counterintuitive corollary: To determine the complexity

2519: of the complexity of an object $y$ it suffices to give both $y$ and

2520: the complexity of $y$. This is counterintuitive since in general

2521: we cannot compute the complexity of an object from the object itself;

2522: if we could this would also solve the

2523: so-called ``halting problem'', \cite{LiVi97}. This noncomputability

2524: can be quantified in terms of $K(K(y) \mid y )$ which can rise to

2525: almost $K(K(y))$ for some $y$.

2526: But in the

2527: seemingly similar, but subtly different, setting below it is possible.

2528:

2529: \begin{corollary}

2530: As above, let $k$ denote $K(y)$. Then,

2531: $K(K(k) \mid  y,k) \eqa K(K(k) \mid y^*)  \lea K(K(k) \mid k^*)+K(k \mid y,k) \eqa 0$.

2532: \end{corollary}

2533: \end{remark}

2534:

2535:

2536: Now back to whether mutual information in one object

2537: about another one cannot be increased. In the probabilistic

2538: setting this was shown to hold for random variables. But does

2539: it also hold for individual outcomes?

2540: In \cite{Le74,Le84} it was shown that

2541: the information in one individual string about another

2542: cannot be increased by any deterministic algorithmic method

2543: by more than a constant. With added randomization this holds

2544: with overwhelming probability.

2545: Here, we follow the proof method

2546: of \cite{GTV01} and

2547: use the triangle inequality of Theorem~\ref{lem.magic} to recall,

2548: and to give proofs of this information non-increase.

2549:

2550: %We need the following technical concepts.

2551: %Let us call a nonnegative

2552: %real function $f(x)$ defined on strings a {\em semimeasure} if

2553: %$\sum_{x} f(x) \le 1$, and a {\em measure} (a probability distribution)

2554: %if the sum is 1.

2555: %A function $f(x)$ is called {\em lower semicomputable} if there is a

2556: %rational valued computable function $g(n,x)$ such that

2557: %$g(n+1,x) \geq g(n,x)$ and $\lim_{n \rightarrow \infty} g(n,x) = f(x)$.

2558: %For an {\em upper semicomputable} function $f$ we require

2559: %that $-f$ is lower semicomputable.

2560: %It is computable when it is both lower and upper semicomputable.

2561: %(A lower semicomputable measure is also computable.)

2562:

2563: \subsubsection{Deterministic Data Processing:}

2564: Recall the definition~\ref{def.mutinf} and Theorem~\ref{eq.eqamipmi}.

2565: We prove a strong version of the information non-increase law

2566: under deterministic processing (later we need the attached corollary):

2567:

2568: \begin{theorem}

2569: Given $x$ and $z$, let $q$ be a program

2570: computing $z$ from $x^*$.

2571: Then

2572:  \begin{equation}\label{eq.nonincrease2}

2573:    I(z : y) \lea I(x : y) + K(q).

2574:  \end{equation}

2575: \end{theorem}

2576:

2577: \begin{proof}

2578: By the triangle inequality,

2579:  \begin{align*}

2580:    K(y \mid x^{*}) & \lea K(y \mid z^{*}) + K(z \mid x^{*})

2581:  \\&  \eqa K(y \mid z^{*})+ K(q).

2582:  \end{align*}

2583: Thus,

2584:  \begin{align*}

2585:    I(x : y) & = K(y) - K(y \mid x^{*})

2586: \\ & \gea K(y) - K(y \mid z^{*}) - K(q)

2587:  \\ &  = I(z : y) - K(q).

2588:  \end{align*}

2589: \end{proof}

2590:

2591: This also implies the slightly weaker but intuitively

2592: more appealing statement that the mutual information between strings

2593: $x$ and $y$ cannot be increased by processing $x$ and $y$ separately by

2594: deterministic computations.

2595:  \begin{corollary} Let $f, g$ be recursive functions.

2596: Then

2597:  \begin{equation}\label{eq.nonincrease}

2598:    I(f(x) : g(y)) \lea I(x : y) + K(f)+K(g).

2599:  \end{equation}

2600:  \end{corollary}

2601:  \begin{proof}

2602: It suffices to prove the case $g(y) = y$ and apply it twice.

2603: The proof is by replacing the program $q$ that computes

2604: a particular string $z$

2605: from a particular $x^*$ in (\ref{eq.nonincrease2}). There, $q$

2606: possibly depends on $x^*$ and $z$. Replace it by a program $q_f$ that first

2607: computes $x$ from $x^*$, followed  by computing a

2608: recursive function

2609: $f$, that is,  $q_f$ is independent of $x$.

2610: Since we only require an $O(1)$-length program to compute

2611: $x$ from $x^*$ we can choose $l(q_f) \eqa K(f)$.

2612:

2613: By the triangle inequality,

2614:  \begin{align*}

2615:    K(y \mid x^{*}) & \lea K(y \mid f(x)^{*}) + K(f(x) \mid x^{*})

2616:  \\&  \eqa K(y \mid f(x)^{*})+ K(f).

2617:  \end{align*}

2618: Thus,

2619:  \begin{align*}

2620:    I(x : y) & = K(y) - K(y \mid x^{*})

2621: \\ & \gea K(y) - K(y \mid f(x)^{*}) - K(f)

2622:  \\ &  = I(f(x) : y) - K(f).

2623:  \end{align*}

2624:  \end{proof}

2625:

2626:

2627: \subsubsection{Randomized Data Processing:}

2628: It turns out that furthermore, randomized computation can increase

2629: information only with negligible probability.

2630: Recall from Section~\ref{sec:m} that the

2631: {\em universal probability} $\m(x) = 2^{-K(x)}$ is

2632: maximal within a multiplicative constant among lower semicomputable

2633: semimeasures.

2634: So, in particular, for each computable measure $f(x)$ we have

2635: $f(x) \leq c_1 \m(x)$, where the constant factor $c_1$ depends on $f$.

2636: This property also holds when we have an extra parameter, like $y^*$,

2637: in the condition.

2638:

2639:

2640: Suppose that $z$ is obtained from $x$ by some randomized computation.

2641: We assume that

2642: the probability $f(z \mid x)$ of obtaining $z$ from $x$ is a semicomputable

2643: distribution over the $z$'s.

2644: Therefore it is upperbounded by

2645: $\m(z \mid x) \leq c_2 \m(z \mid x^{*}) = 2^{-K(z \mid x^{*})}$.

2646: The information increase $I(z : y) - I(x : y)$ satisfies the theorem below.

2647:

2648:  \begin{theorem} There is a constant $c_3$ such that

2649: for all $x,y,z$ we have

2650:  \[

2651:   \m(z \mid x^{*}) 2^{I(z : y) - I(x : y)}

2652:   \leq c_3 \m(z \mid x^{*}, y, K(y \mid x^{*})).

2653:  \]

2654:  \end{theorem}

2655:

2656: \begin{remark}

2657: \rm

2658: For example, the probability of an increase of mutual information

2659: by the amount $d$ is $O( 2^{-d})$.

2660: The theorem

2661: implies $\sum_{z} \m(z \mid x^{*}) 2^{I(z : y) - I(x : y)} =O(1)$,

2662: the $\m(\cdot \mid x^{*})$-expectation of the exponential of the increase

2663: is bounded by a constant.

2664: \end{remark}

2665:

2666:  \begin{proof}

2667: We have

2668:  \begin{align*}

2669:  I(z : y) - I(x : y) & = K(y) - K(y \mid z^{*}) - (K(y) - K(y \mid x^{*}))

2670:  \\&  = K(y \mid x^{*}) - K(y \mid z^{*}).

2671:  \end{align*}

2672:  The negative logarithm of the left-hand side in the theorem is therefore

2673:  \[

2674:   K(z \mid x^{*}) + K(y \mid z^{*}) - K(y \mid x^{*}).

2675:  \]

2676: Using Theorem~\ref{lem.magic}, and the conditional

2677: additivity (\ref{eq.soi-cond}), this is

2678:  \[

2679:    \gea K(y, z \mid x^{*}) - K(y \mid x^{*}) \eqa

2680:         K(z \mid x^{*}, y, K(y \mid x^{*})).

2681:  \]

2682:  \end{proof}

2683:

2684: \begin{remark}

2685:   \rm An example of the use of algorithmic mutual information is as

2686:   follows \cite{Le02}.  A celebrated result of K. G\"odel states that

2687:   Peano Arithmetic is incomplete in the sense that it cannot be

2688:   consistently extended to a complete theory using recursively

2689:   enumerable axiom sets.  (Here `complete' means that every sentence of

2690:   Peano Arithmetic is decidable within the theory; for further

2691:   details on the terminology used in this example, we refer to

2692:   \cite{LiVi97}). The essence is the non-existence of total recursive

2693:   extensions of a universal partial recursive predicate. This is

2694:   usually taken to mean that mathematics is undecidable.

2695:   Non-existence of an algorithmic solution need not be a problem when

2696:   the requirements do not imply unique solutions. A perfect example is

2697:   the generation of strings of high Kolmogorov complexity, say of half

2698:   the length of the strings. There is no deterministic effective process that can

2699:   produce such a string; but repeatedly flipping a fair coin we

2700:   generate a desired string with overwhelming probability. Therefore,

2701:   the question arises whether randomized means allow us to bypass

2702:   G\"odel's result.  The notion of mutual information between two

2703:   finite strings can be refined and extended to infinite sequences, so

2704:   that, again, it cannot be increased by either deterministic or

2705:   randomized processing.  In \cite{Le02} the existence of an infinite

2706:   sequence is shown that has infinite mutual information with all

2707:   total extensions of a universal partial recursive predicate. As

2708:   Levin states ``it plays the role of password: no substantial

2709:   information about it can be guessed, no matter what methods are

2710:   allowed.''  This ``forbidden information'' is used to extend the

2711:   G\"odel's incompleteness result to also hold for consistent

2712:   extensions to a complete theory by randomized means with

2713:   non-vanishing probability.

2714: \end{remark}

2715:

2716:

2717:

2718: \

2719: \\

2720: \paragraph{Problem and Lacuna:}

2721: Entropy, Kolmogorov complexity and mutual

2722: (algorithmic) information are concepts that do not distinguish

2723: between different {\em   kinds\/} of information (such as `meaningful' and `meaningless'

2724: information). In the remainder of this paper, we  show how these more

2725: intricate notions

2726: can be arrived at, typically by {\em constraining\/} the description

2727: methods with which strings are allowed to be encoded

2728: (Section~\ref{sec:algsuf}) and by considering {\em lossy\/} rather

2729: than lossless compression (Section~\ref{sect.rdsf}). Nevertheless, the basic

2730: notions entropy, Kolmogorov complexity and mutual information continue

2731: to play a fundamental r\^ole.

2732: %This leads to Distortion Theory

2733: %\cite{CT91} in the Information Theory setting and

2734: %to meaningful information in the sense of algorithmic sufficient

2735: %statistics in the Kolmogorov complexity setting

2736: %that also has applications to the question-answer game. But that is

2737: %another story that will be told elsewhere because current space has

2738: %run out.

2739: \section{Sufficient Statistic}

2740: \label{sect.sufstat}

2741: In introducing the notion of sufficiency in classical

2742: statistics,  Fisher~\cite{Fi22} stated:

2743: \begin{quote}

2744:        ``The statistic chosen should summarize the whole of the relevant

2745: information supplied by the sample. This may be called

2746:        the Criterion of Sufficiency $\ldots$

2747: In the case of the normal curve

2748: of distribution it is evident that the second moment is a

2749:        sufficient statistic for estimating the standard deviation.''

2750: \end{quote}

2751: A ``sufficient'' statistic of the data

2752: contains all information in the data about the model class.

2753: Below we first discuss the standard notion of (probabilistic)

2754: sufficient statistic as employed in the statistical literature.

2755: We show that this notion has a natural interpretation in terms of

2756: Shannon mutual information, so that we may just as well

2757: think of a probabilistic sufficient

2758: statistic as a concept in Shannon information theory. Just as in the

2759: other sections of this paper, there is a corresponding notion in the

2760: Kolmogorov complexity literature: the algorithmic sufficient statistic

2761: which we introduce in Section~\ref{sec:algsuf}. Finally,

2762: in Section~\ref{sec:relpa} we connect the statistical/Shannon

2763: and the algorithmic notions of sufficiency.

2764: \subsection{Probabilistic Sufficient Statistic}

2765: \label{sec:probstat}

2766: Let $\{ P_\theta \}$ be a family of distributions,

2767: also called a {\em model class}, of a

2768: random variable $X$ that takes values in a finite or countable

2769: {\em set of data} ${\cal X}$.

2770: Let ${\mathbf \Theta}$

2771: be the set of parameters $\theta$ parameterizing the family

2772: $\{ P_\theta \}$. Any function $S: {\cal X} \rightarrow {\cal S}$

2773: taking values in some set  ${\cal S}$ is said to be a {\em statistic}

2774: of the data in ${\cal X}$. A {\em statistic}

2775: $S$ is said to be {\em sufficient} for the family $\{

2776: P_{\theta} \}$ if,

2777: for every $s \in {\cal S}$,

2778: the conditional distribution

2779: \begin{equation}

2780: \label{eq:cond}

2781: P_{\theta}(X= \cdot \mid S(x) = s)

2782: \end{equation}

2783: is invariant under changes of $\theta$. This is the

2784: standard definition

2785: in the statistical literature, see

2786: for example \cite{CoxH74}.

2787: Intuitively, (\ref{eq:cond}) means that all information about $\theta$

2788: in the observation $x$ is present in the (coarser) observation $S(x)$,

2789: in line with Fisher's quote above.

2790:

2791:

2792: The notion of `sufficient statistic'

2793: can be equivalently expressed in terms

2794: of probability mass functions. Let $f_{\theta}(x) = P_{\theta} (X=x)$

2795: denote the

2796: probability mass of $x$ according to $P_{\theta}$.

2797: We identify distributions $P_{\theta}$ with their

2798: mass functions $f_{\theta}$ and denote the model class $\{P_{\theta} \}$ by

2799: $\{f_{\theta}\}$. Let

2800: $f_{\theta}(x | s)$ denote the

2801: probability mass function of the conditional distribution (\ref{eq:cond}), defined as

2802: in Section~\ref{sec:preliminaries}. That is,

2803: % .

2804: %Then,

2805: $$

2806: f_{\theta}(x|s)

2807: =

2808: \begin{cases}

2809: f_{\theta}(x) / \sum_{x \in{\cal X}: S(x) = s} f_{\theta}(x)

2810: & \text{\ if\ } S(x) = s \\

2811: 0 & \text{\ if\ } S(x) \neq s .

2812: \end{cases}

2813: $$

2814: The requirement of $S$ to be sufficient is equivalent to the existence

2815: of a function

2816: $g: {\cal X} \times {\cal S} \rightarrow {\cal R}$ such

2817: that

2818: \begin{equation}

2819: \label{eq:standarddef}

2820: g(x \mid s) = f_{\theta}(x \mid s),

2821: \end{equation}

2822: for every $\theta \in {\mathbf \Theta}$, $s \in {\cal S}$,

2823: $x \in {\cal X}$. (Here we change the common

2824: notation `$g(x,s)$' to  `$g(x \mid s)$' which is more expressive for

2825: our purpose.)

2826:

2827: \begin{example}

2828: \rm

2829: Let ${\cal X} = \{ 0,1\}^{n}$, let $X = (X_1, \ldots, X_n)$. Let

2830: $\{P_{\theta} : \theta \in (0,1 ) \}$ be the set of $n$-fold

2831: Bernoulli

2832: distributions on

2833: ${\cal X}$ with parameter $\theta$. That is,

2834: $$

2835: f_{\theta}(x) = f_{\theta}(x_1 \ldots x_n) = \theta^{S(x)}(1- \theta)^{n-S(x)}$$

2836: where $S(x)$ is the number of $1$'s in $x$. Then $S(x)$ is a

2837: sufficient statistic for $\{ P_{\theta} \}$. Namely, fix an arbitrary

2838: $P_{\theta}$ with $\theta \in (0,1)$ and an arbitrary $s$ with $0 < s<

2839: n$.

2840: Then all $x$'s with $s$ ones and $n-s$ zeroes are equally probable.

2841: The number of such $x$'s is $\binom{n}{s}$. Therefore, the

2842: probability $P_\theta(X=x \mid S(x) = s)$ is equal to

2843: $1/ \binom{n}{s}$, and this does not depend on the parameter

2844: $\theta$.

2845: Equivalently, for all $\theta \in (0,1)$,

2846: \begin{equation}

2847: \label{eq:berndef}

2848: f_{\theta}(x \mid s)

2849: = \begin{cases}  1/ \binom{n}{s} & \text{\ if\ } S(x) = s

2850:   \\

2851: 0 & \text{\ otherwise.}

2852: \end{cases}

2853: \end{equation}

2854: Since (\ref{eq:berndef}) satisfies

2855: (\ref{eq:standarddef}) (with $g(x|s)$ the uniform distribution on all

2856: $x$ with exactly $s$ ones), $S(x)$

2857: is a sufficient statistic relative to the model class $\{P_{\theta}\}$.

2858: In the Bernoulli case,

2859: $g(x|s)$ can be

2860: obtained by starting from the {\em uniform\/}

2861: distribution on ${\cal X}$ ($\theta = \frac{1}{2}$),

2862: and conditioning on $S(x)=s$. But

2863: $g$ is not necessarily uniform. For example, for the

2864: Poisson model class, where $\{f_{\theta}\}$

2865: represents the set of Poisson distributions on $n$ observations,

2866: the observed mean is a sufficient statistic and the corresponding $g$

2867: is far from uniform.

2868: All information

2869: about the parameter $\theta$ in

2870: the observation $x$ is already contained in $S(x)$.

2871: In the Bernoulli case, once we know

2872: the number $S(x)$ of $1$'s in $x$, all further details

2873: of $x$ (such as the order of $0$s and $1$s) are irrelevant

2874: for determination of the  Bernoulli parameter $\theta$.

2875:

2876: To give an example of a

2877: statistics that is not sufficient for the Bernoulli model class, consider

2878: the statistic $T(x)$ which counts the number of 1s in $x$ that are

2879: followed by a $1$. On the other hand, for every statistic $U$, the

2880: combined statistic $V(x) := (S(x),

2881: U(x))$ with $S(x)$ as before, is sufficient, since it contains all

2882: information in $S(x)$. But in contrast to $S(x)$, a

2883: statistic such as $V(x)$ is typically not

2884: {\em minimal}, as explained further below.

2885: \end{example}

2886:

2887: It will be useful to rewrite

2888: (\ref{eq:standarddef}) as

2889: \begin{equation}

2890: \label{eq:indivdef}

2891:  \log 1/f_{\theta}(x \mid s)  =  \log 1/ g(x| s).

2892: \end{equation}

2893: \begin{definition}\label{def.wss}

2894: \rm

2895: A function  $S: {\cal X} \rightarrow {\cal S}$ is

2896:   a {\em probabilistic sufficient statistic} for $\{f_{\theta}\}$ if

2897:   there exists a function $g : {\cal X} \times {\cal

2898:     S} \rightarrow {\cal R}$ such that (\ref{eq:indivdef}) holds for every

2899:   $\theta \in {\mathbf \Theta}$, every $x \in {\cal X}$, every $s \in

2900:   {\cal S}$ (Here we use the convention $\log 1/0 = \infty$).

2901: \end{definition}

2902: \paragraph{Expectation-version of definition:}

2903: The standard definition of probabilistic

2904: sufficient statistics is ostensibly of the

2905: `individual-sequence'-type: for $S$ to be sufficient,

2906: (\ref{eq:indivdef}) has to hold for

2907: {\em every} $x$, rather than merely in expectation or with high probability.

2908: However,

2909: %because (\ref{eq:indivdef}) has to hold for every $\theta$ as well,

2910: the definition turns out to be equivalent to an expectation-oriented

2911: form, as shown in Proposition~\ref{prop:suff}. We first introduce an a priori distribution

2912: over $\Theta$, the parameter set for our model class $\{ f_\theta\}$. We

2913: denote the probability density of this distribution by $p_1$.

2914: This way we can define a joint distribution

2915: $p(\theta,x) = p_1 (\theta) f_{\theta}(x)$.

2916: \begin{proposition}

2917: \label{prop:suff}

2918: \rm

2919: The following two statements are equivalent to

2920: Definition~\ref{def.wss}: (1)

2921: For every $\theta \in  {\mathbf \Theta}$,

2922: \begin{equation}

2923: \label{eq:expdef}

2924: \sum_x f_\theta(x)  \log 1/f_\theta(x \mid S(x)) =

2925:  \sum_{x} f_\theta(x) \log 1/g(x \mid S(x)) \; .

2926: %{\bf E}_{X \sim f_\theta} [ - \log f_{\theta}(X|S(X))] =

2927: %{\bf E}_{X \sim f_\theta} [- \log g(X| S(X))].

2928: \end{equation}

2929: %the expectation taken over $f_\theta$.

2930: (2) For {\em every} prior $p_1(\theta)$ on ${\mathbf \Theta}$,

2931: \begin{equation}

2932: \label{eq:expdef2}

2933: \sum_{\theta,x} p(\theta,x)  \log 1/f_\theta(x \mid S(x))

2934: = \sum_{\theta,x} p(\theta,x) \log 1/g(x \mid S(x)) \; .

2935: %{\bf E}_{(X,\Theta) \sim p} [ - \log f_{\theta}(X|S(X))] =

2936: %{\bf E}_{(X,\Theta) \sim p} [- \log g(X| S(X))],

2937: \end{equation}

2938: %the expectation taken over the joint distribution $p(\theta,x) =

2939: %p_1(\theta) f_\theta(x)$.

2940: \end{proposition}

2941: %Note that (\ref{eq:expdef}) is shorthand for

2942: %$$

2943: %$$

2944: %while (\ref{eq:expdef2}) is shorthand for

2945: %\begin{equation}

2946: %\end{equation}

2947: \begin{proof}

2948:   {\em Definition~\ref{def.wss} $\Rightarrow$ \eqref{eq:expdef}:}

2949: Suppose (\ref{eq:indivdef}) holds for every $\theta

2950:   \in {\mathbf \Theta}$, every $x \in {\cal X}$, every $s \in {\cal S}$.

2951:   Then it also holds in expectation for every $\theta \in {\mathbf

2952:     \Theta}$:

2953: \begin{equation}

2954: \label{eq:propsuff1}

2955: \sum_{x} f_\theta(x) \log 1/f_{\theta}(x|S(x)) = \sum_{x}

2956: f_\theta(x) \log 1/ g(x| S(x))].

2957: \end{equation}

2958:

2959:   {\em \eqref{eq:expdef}$\Rightarrow$   Definition~\ref{def.wss}:}

2960: Suppose that for every $\theta \in {\mathbf \Theta}$,

2961: (\ref{eq:propsuff1}) holds.

2962: Denote

2963: \begin{equation}\label{eq:overload}

2964: f_{\theta}(s) = \sum_{y \in {\cal X}: S(y)=s} f_{\theta} (y).

2965: \end{equation}

2966: By adding

2967: $\sum_x f_{\theta}(x) \log 1/f_\theta(S(x))$ to both

2968: sides of the equation, (\ref{eq:propsuff1}) can be rewritten as

2969: \begin{equation}

2970: \label{eq:suffent}

2971: \sum_x f_{\theta} (x) \log 1/ f_{\theta}(x) =

2972: \sum_x f_{\theta}(x)  \log1/ g_{\theta}(x),

2973: \end{equation}

2974: with

2975: $g_{\theta}(x) = f_{\theta}(S(x)) \cdot g(x| S(x))].$

2976: By the information inequality \eqref{eq.ii}, the equality

2977: (\ref{eq:suffent}) can

2978: only hold if $g_{\theta}(x) =

2979: f_{\theta}(x)$ for every $x \in {\cal X}$.

2980: Hence, we have established \eqref{eq:indivdef}.

2981:

2982: {\em  \eqref{eq:expdef} $\Leftrightarrow$ \eqref{eq:expdef2}:}

2983: follows by linearity of expectation.

2984: \end{proof}

2985:

2986: \paragraph{Mutual information-version of definition:}

2987: After some rearranging of terms, the characterization

2988: (\ref{eq:expdef2}) gives rise to the intuitively appealing definition

2989: of probabilistic sufficient statistic in terms of mutual information

2990: \eqref{eq.mutinfprob}. The resulting formulation of sufficiency

2991: is as follows \cite{CT91}:

2992:  $S$ is sufficient for $\{ f_{\theta} \}$ iff for all

2993: priors $p_1$ on ${\mathbf \Theta}$:

2994: \begin{equation}\label{eq.suffstatprob}

2995: I(\Theta ; X) = I( \Theta ; S(X))

2996: \end{equation}

2997:  for all distributions

2998: of $\theta$.

2999:

3000: Thus, a statistic $S(x)$ is

3001: sufficient if the probabilistic mutual

3002: information is invariant under taking the statistic \eqref{eq.suffstatprob}.

3003: \paragraph{Minimal Probabilistic Sufficient Statistic:}

3004: A sufficient statistic may contain information

3005: that is not relevant: for a normal distribution the sample mean

3006: is a sufficient statistic, but the pair of functions

3007: which give the

3008: mean of the even-numbered samples and the odd-numbered samples

3009: respectively, is also a sufficient statistic.

3010: A statistic $S(x)$ is a {\em minimal} sufficient statistic

3011: with respect to an indexed

3012: model class $\{f_{\theta}\}$, if it is a

3013: function of all other sufficient statistics: it contains no

3014: irrelevant information and maximally compresses the information

3015: in the data about

3016: the model class.

3017: For the family of normal distributions

3018: the sample mean is a minimal sufficient statistic, but the

3019: sufficient statistic consisting of the mean of the even samples

3020: in combination with the mean of the odd samples is not minimal.

3021: Note that one cannot improve on sufficiency:

3022: The data processing inequality \eqref{eq.infnonincrprob} states that

3023: $

3024: I(\Theta ; X) \geq I( \Theta ; S(X)),

3025: $

3026: for every function $S$, and that

3027: for randomized functions $S$ an appropriate related expression holds.

3028: That is, mutual information between data random variable and model random

3029: variable

3030: cannot be increased by processing the data sample in any way.

3031: \paragraph{Problem and Lacuna:}

3032: We can think of the probabilistic sufficient statistic as extracting

3033: those patterns in the data that are relevant in determining the

3034: parameters of a statistical model class. But what if we do not want to

3035: commit ourselves to a simple finite-dimensional parametric model

3036: class? In the most general context, we may consider the model class of all computable

3037: distributions, or all computable sets of which the observed data is an

3038: element. Does there exist an analogue

3039: of the sufficient statistic that  automatically summarizes {\em all\/}

3040: information in the sample $x$ that is relevant for determining the

3041: ``best'' (appropriately defined)

3042: model for $x$ within this enormous class of models? Of course,

3043: we may consider the

3044: literal data $x$ as a statistic of $x$, but that would not be

3045: satisfactory: we would still like our generalized statistic, at least

3046: in many cases, to be

3047: considerably coarser, and much more concise, than the data $x$ itself.

3048: It turns

3049: out that, to some extent, this is achieved by

3050: the {\em algorithmic\/} sufficient statistic

3051: of the data: it

3052: summarizes {\em all\/} conceivably relevant information in the

3053: data $x$; at the same time, many types of data $x$ admit an algorithmic

3054: sufficient statistic that is concise in the sense that it has very small

3055: Kolmogorov complexity.

3056: \subsection{Algorithmic Sufficient Statistic}

3057: \label{sec:algsuf}

3058: %While previous authors have used the name ``Kolmogorov sufficient statistic''

3059: %because the model appears to summarize the relevant information in the data

3060: %in analogy of what the classic sufficient statistic

3061: %does in a probabilistic sense, a formal justification has been lacking.

3062: \subsubsection{Meaningful Information}\index{meaningful information}

3063: \label{sect.meaning}

3064: The information contained in an individual

3065: finite object (like a finite binary string) is measured

3066: by its Kolmogorov complexity---the length of the shortest binary program

3067: that computes the object. Such a shortest program contains no redundancy:

3068: every bit is information; but is it meaningful information?

3069: If we flip a fair coin to obtain a finite binary string, then with overwhelming

3070: probability that string constitutes its own shortest program. However,

3071: also with overwhelming probability all the bits in the string are meaningless

3072: information, random noise. On the other hand, let an object

3073: $x$ be a sequence of observations of heavenly bodies. Then $x$

3074: can be described by the binary

3075: string $pd$, where $p$ is the description of

3076: the laws of gravity and the observational

3077: parameter setting, while $d$ accounts for the measurement errors:

3078: we can divide the information in $x$ into

3079: meaningful information $p$ and accidental information $d$.

3080: The main task for statistical inference and learning theory is to

3081: distill the meaningful information present in the data. The question

3082: arises whether it is possible to separate meaningful

3083: information from accidental information, and if so, how.

3084: The essence of the solution to this problem is revealed when we

3085: write Definition~\ref{def.KolmK}

3086: as follows:

3087: %(use the universality of the fixed reference universal prefix Turing machine

3088: %$U=T_u$ with $|u| = O(1)$ to obtain the last equality):

3089:              \begin{equation}\label{eq.kcmdl}

3090: K(x)  =

3091: \min_{p,i} \{K(i)+l(p):T_i(p) =x\}+O(1),

3092: \end{equation}

3093: where the minimum is taken over

3094: $p \in \{0,1\}^*$ and $i \in \{1,2, \ldots\}$.

3095: The justification is that for the fixed reference

3096: universal prefix Turing machine

3097: $U(\langle i,p \rangle)=T_i(p)$ for all $i$ and $p$. Since $i^*$

3098: denotes the shortest self-delimiting program for $i$, we have

3099: $|i^*|=K(i)$.

3100: The expression \eqref{eq.kcmdl}

3101:  emphasizes the two-part code nature of Kolmogorov complexity.

3102: In a randomly truncated initial segment of a time series

3103: $$x = 10101010101010101010101010,$$

3104: we can encode $x$ by a small Turing machine printing a specified

3105: number of copies of the pattern ``01.''

3106: %which computes

3107: %$x$ from the program ``13.''

3108: %The minimal-length

3109: %two-part code squeezes out regularity only insofar as

3110: %the reduction in the length of the description of random aspects

3111: %is greater than the increase in the regularity description.

3112: This way, $K(x)$ is viewed  as the shortest length of

3113: a two-part code for $x$, one part describing a Turing machine $T$,

3114: or {\em model}, for the {\em regular} aspects of $x$,

3115: and the second part describing

3116: the {\em irregular} aspects of $x$ in the form

3117: of a program $p$ to be interpreted by $T$.

3118: The regular, or ``valuable,'' information in $x$ is constituted

3119: by the bits in the ``model'' while the random or ``useless''

3120: information of $x$ constitutes the remainder.

3121: This leaves open the crucial question: How to choose

3122: $T$ and $p$ that together describe $x$? In general, many

3123: combinations of $T$ and $p$ are possible, but we want to find

3124: a $T$ that describes the meaningful aspects of $x$.

3125:

3126: \subsubsection{Data and Model}

3127: \index{data}\index{model} We consider only finite binary data strings

3128: $x$.  Our model class consists of Turing machines $T$ that enumerate a

3129: finite set, say $S$, such that on input $p \leq |S|$ we have $T(p)=x$

3130: with $x$ the $p$th element of $T$'s enumeration of $S$, and $T(p)$ is

3131: a special {\em undefined} value if $p>|S|$.  The ``best fitting''

3132: model for $x$ is a Turing machine $T$ that reaches the minimum

3133: description length in (\ref{eq.kcmdl}).  There may be many such $T$,

3134: but, as we will see, if chosen properly, such a machine $T$ embodies

3135: the amount of useful information contained in $x$. Thus, we have

3136: divided a shortest program $x^*$ for $x$ into parts $x^*=T^*(p)$ such

3137: that $T^*$ is a shortest self-delimiting program for $T$.  Now suppose

3138: we consider only low complexity finite-set models, and under these

3139: constraints the shortest two-part description happens to be longer

3140: than the shortest one-part description.  For example, this can happen

3141: if the data is generated by a model that is too complex to be in the

3142: contemplated model class.  Does the model minimizing the two-part

3143: description still capture all (or as much as possible) meaningful

3144: information? Such considerations require study of the relation between

3145: the complexity limit on the contemplated model classes, the shortest

3146: two-part code length, and the amount of meaningful information

3147: captured.

3148:

3149:

3150: In the following we will distinguish between ``models'' that are

3151: finite sets, and the ``shortest programs'' to compute those models

3152: that are finite strings. The latter will be called `algorithmic statistics'.

3153: %Such a shortest program is in the proper

3154: %sense a statistic of the data sample as defined before.

3155: In a way the distinction between ``model'' and ``statistic'' is

3156: artificial, but for now we prefer clarity and unambiguousness in the

3157: discussion.  Moreover, the terminology is customary in the literature

3158: on algorithmic statistics.  Note that strictly speaking, neither an

3159: algorithmic statistic nor the set it defines is a statistic in the

3160: probabilistic sense: the latter was defined as a {\em function\/} on

3161: the set of possible data samples of given length. Both notions are

3162: unified in Section~\ref{sec:relpa}.

3163: \subsubsection{Typical Elements}

3164: \index{typical data}\index{random data}

3165: Consider a string $x$

3166: of length $n$ and prefix complexity $K(x)=k$.

3167: For every finite set $S  \subseteq \{0,1\}^*$ containing

3168: $x$ we have $K(x | S)\le\log|S|+O(1)$.

3169: Indeed, consider the prefix code of $x$

3170: consisting of its $\lceil\log|S|\rceil$ bit long index

3171: of $x$ in the lexicographical ordering of $S$.

3172: This code is called

3173: \emph{data-to-model code}.

3174: We identify the {\em structure} or {\em regularity} in $x$ that are

3175: to be summarized with a set $S$

3176: of which $x$ is a {\em random} or  {\em typical} member:

3177: given $S$ containing $x$, %(or rather,

3178: %shortest program $S^*$ for $S$),

3179: the element $x$ cannot

3180: be described significantly shorter than by its maximal length index in $S$,

3181: that is, $ K(x \mid S) \geq \log |S| +O(1) $.

3182:

3183: \begin{definition}

3184: \rm

3185: Let $\beta \ge 0$ be an agreed upon, fixed, constant.

3186: A finite binary string $x$

3187: is a {\em typical} or {\em random} element of a set $S$ of finite binary

3188: strings, if $x \in S$ and

3189: \begin{equation}\label{eq.deftyp}

3190:  K(x \mid S) \ge \log |S| - \beta.

3191: \end{equation}

3192: %where $S^*$ is a shortest program for $S$.

3193: We will not indicate the dependence on $\beta$ explicitly, but the

3194: constants in all our inequalities ($O(1)$) will be allowed to be functions

3195: of this $\beta$.

3196: \end{definition}

3197:

3198: This definition requires a finite $S$.

3199: In fact, since

3200: $K(x \mid S) \leq K(x)+O(1) $, it limits the size of $S$ to $O(2^k)$.

3201: %The shortest program $S^*$ from

3202: %which $S$ can be computed is an {\em algorithmic statistic} for $x$ if

3203: %\index{algorithmic statistic}

3204: %\begin{equation}\label{eq.typ}

3205:  %K(x \mid S) \geq \log |S| +O(1).

3206: %\end{equation}

3207: Note that the notion of typicality is not absolute

3208: but depends on fixing the constant implicit in the $O$-notation.

3209:

3210: \begin{example}\label{xmp.typical}

3211: \rm

3212: Consider the set $S$ of binary strings of length $n$

3213: whose every odd position is 0.

3214: Let $x$ be an element of this set in which the subsequence of bits in

3215: even positions is an incompressible string.

3216: Then $x$ is a typical element of $S$ (or by with some abuse

3217: of language we can say $S$ is typical for $x$).

3218: But $x$ is also a typical element of the set $\{x\}$.

3219: \end{example}

3220:

3221:

3222: \subsubsection{Optimal Sets}

3223: \index{optimal model}

3224: Let $x$ be a binary data string of length $n$.

3225: For every finite set $S \ni x$, we have

3226: $K(x) \leq K(S) + \log |S| + O(1)$,

3227: since we can describe $x$ by giving $S$ and the index of $x$

3228: in a standard enumeration of $S$. Clearly this can be implemented

3229: by a Turing machine computing the finite set $S$ and a program

3230: $p$ giving the index of $x$ in $S$.

3231: The size of a set containing $x$ measures intuitively the number of

3232: properties of $x$ that are represented:

3233: The largest set is $\{0,1\}^{n}$ and represents only one property

3234: of $x$, namely, being of length $n$. It clearly ``underfits''

3235: as explanation or model for $x$. The smallest set containing $x$

3236: is the singleton set $\{x\}$ and represents all conceivable properties

3237: of $x$. It clearly ``overfits'' as explanation or model for $x$.

3238:

3239: There are two natural measures of suitability of such a set as

3240: a model for $x$.

3241: We might prefer either the simplest set, or the smallest set, as

3242: corresponding to the most likely structure `explaining' $x$.

3243: Both the largest set $\{0,1\}^n$ (having low complexity of about $K(n)$)

3244: and the  singleton set $\{x\}$ (having high complexity of about $K(x)$),

3245: while certainly statistics for $x$,

3246: would indeed be considered poor explanations.

3247: \index{two-stage description}

3248: We would like to balance simplicity of model versus size of model.

3249: Both measures relate to the optimality of a two-stage description of

3250: $x$ using a finite set $S$ that contains it. Elaborating on

3251: the two-part code:

3252: \begin{align}\label{eq.twostage}

3253:  K(x) \leq K(x,S)  & \leq  K(S) + K(x \mid S) +O(1)

3254: \\ & \leq K(S) + \log |S| +O(1),

3255: \nonumber

3256: \end{align}

3257: where only the final substitution of $K(x \mid S)$ by $\log |S|+O(1)$

3258: uses the fact that $x$ is an element of $S$.

3259: The closer the right-hand side of \eqref{eq.twostage} gets

3260: to the left-hand side, the better the description of $x$ is in terms

3261: of the set $S$.

3262: This implies a trade-off between meaningful model information, $K(S)$,

3263: and meaningless ``noise'' $\log |S|$.

3264: A set $S$ (containing $x$)

3265:  for which \eqref{eq.twostage} holds with equality

3266: \begin{equation}\label{eq.optim}

3267: K(x) = K(S) + \log |S| +O(1),

3268: \end{equation}

3269: is called {\em optimal}.

3270: A data string $x$ can be typical for a set $S$ without that set $S$

3271: being optimal for $x$. This is the case precisely when $x$ is

3272: typical for $S$ (that is $K(x|S)=\log S +O(1)$)

3273: while $K(x,S)>K(x)$.

3274:

3275: %\begin{example}

3276: %\rm

3277: %Combining \eqref{eq.twostage} and

3278: %\eqref{eq.optim}, we see that if $S$ is an optimal set for $x$

3279: %then $K(x,S)=K(x)+O(1)$ which implies that $K(S\mid x)=O(1)$.

3280: %Going from $x$ to $S$ requires but an $O(1)$ length program,

3281: %which implies that there are ony $O(1)$ optimal sets for $x$,

3282: %however large $x$ may be.

3283: %\end{example}

3284:

3285: \subsubsection{Sufficient Statistic}

3286: \label{sect.ss}

3287:  Intuitively, a model expresses the essence

3288: of the data if the two-part code describing the data consisting of the

3289: model and the data-to-model code is as concise as

3290: the best one-part description.

3291:

3292: Mindful of our distinction between a finite set $S$ and a

3293: program that describes $S$ in a required representation format,

3294: we call a shortest program for an optimal set with respect to $x$

3295: an {\em algorithmic sufficient statistic} for $x$.

3296: Furthermore, among optimal sets,

3297: there is a direct trade-off between complexity and log-size, which together

3298: sum to $ K(x)+O(1)$.

3299:

3300:

3301: \begin{example}\label{xmp.optimal}

3302: \rm

3303: It can be shown that the set $S$ of Example~\ref{xmp.typical} is also

3304: optimal, and so is $\{x\}$.

3305: Sets for which $x$ is typical form a much wider class than optimal

3306: sets for $x$: the set

3307: $\{x,y\}$ is still typical for

3308: $x$ but with most $y$, it will be too complex to be optimal for $x$.

3309:

3310: For a perhaps less artificial example, consider complexities conditional

3311: on the length $n$ of strings.

3312: Let $y$ be a random string of length $n$, let

3313: $S_{y}$ be the set of strings of length $n$ which have 0's exactly

3314: where $y$ has, and let $x$ be a random element of $S_{y}$.

3315: Then $x$ has about 25\%

3316: 1's,

3317: so its complexity is much less than $n$.

3318: The set $S_{y}$ has $x$ as a typical element,

3319: but is too complex to be optimal,

3320: since its  complexity (even conditional on $n$) is still $n$.

3321: \end{example}

3322:

3323: %Optimal sets (or rather, descriptions of them) are statistics.

3324: %Equality (\ref{eq.optim}) expresses the conditions on the algorithmic

3325: %individual relation between the data and the sufficient statistic.

3326: %Later we

3327: %demonstrate that this relation implies that the probabilistic

3328: %optimality of mutual information  holds

3329: %for the algorithmic version in the expected sense.

3330:

3331: An algorithmic sufficient statistic

3332: \index{sufficient statistic, algorithmic}

3333: \index{sufficient statistic, algorithmic minimal}

3334: \index{sufficient statistic, probabilistic}

3335: \index{sufficient statistic, probabilistic minimal}

3336: is a sharper individual notion than a probabilistic sufficient

3337: statistic. An optimal set $S$ associated with $x$ (the shortest

3338: program computing $S$ is the corresponding

3339: sufficient statistic associated with $x$) is chosen such that

3340: $x$ is maximally random with respect to it. That is, the

3341: information in $x$ is divided in a relevant structure expressed

3342: by the set $S$, and the remaining randomness with respect

3343: to that structure, expressed by $x$'s index in $S$ of $\log |S|$

3344: bits. The shortest program for $S$ is itself alone an algorithmic

3345: definition of structure, without a probabilistic interpretation.

3346:

3347:

3348: %One can also consider notions of

3349: %{\em near}-typical and {\em near}-optimal that arise from replacing

3350: %the $\beta$  in (\ref{eq.deftyp})

3351: %by some slowly growing functions, such as $O(\log l(x))$ or

3352: %$O(\log k)$.

3353: %In~\cite{CT91}, only

3354: Those optimal sets that admit the shortest possible program

3355: are called {\em algorithmic minimal sufficient statistics\/} of

3356: $x$. They will play a major role in the next section on the Kolmogorov

3357: structure function. Summarizing:

3358: %$with the shortest program

3359: %(or rather that shortest program) is  the

3360: \begin{definition}[Algorithmic sufficient statistic, algorithmic

3361:   minimal sufficient statistic]

3362: \label{def:algsufstat}

3363: An {\em algorithmic sufficient statistic\/} of $x$ is a shortest program for

3364: a set $S$ containing $x$ that is optimal, i.e. it satisfies (\ref{eq.optim}).

3365: An algorithmic sufficient statistic with optimal set $S$ is {\em

3366:   minimal\/} if there exists no optimal set $S'$ with $K(S') < K(S)$.

3367: \end{definition}

3368: \begin{example}

3369: \rm

3370: Let $k$ be a number in the range $0,1,\dots,n$

3371: of complexity $\log n+ O(1)$ given $n$ and let $x$ be a string of length

3372: $n$ having $k$ ones of complexity $K(x \mid n,k) \geq \log {n \choose k}$

3373: given $n,k$. This $x$ can be viewed as a typical result of

3374: tossing a coin with a bias about $p=k/n$.

3375: A two-part description

3376: of $x$ is given by

3377: the number $k$ of 1's in $x$ first, followed by the index

3378: $j \leq \log |S|$  of $x$

3379: in the set $S$ of strings of length $n$ with $k$ 1's.

3380: This set is optimal, since

3381: $K(x \mid n)=K(x,k \mid n)=K(k \mid n)+K(x \mid k,n)= K(S)+ \log|S|$.

3382:

3383: Note that $S$ encodes the number of $1$s in $x$. The shortest program

3384: for $S$  is an

3385: algorithmic minimal sufficient statistic for {\em most\/} $x$ of

3386: length $n$ with $k$ $1$'s, since only a fraction of at most $2^{-m}$

3387: $x$'s of length $n$ with $k$ $1$s can have $K(x) < \log | S| - m$

3388: (Section~\ref{sec:kolmogorov}). But of course there exist $x$'s with

3389: $k$ ones which have much more regularity. An example is the string

3390: starting with $k$ $1$'s followed by $n-k$ $0$'s. For such strings, $S$ is

3391: still optimal and the shortest program for $S$ is still an algorithmic

3392: sufficient statistic, but not a minimal one.

3393: \end{example}

3394:

3395:

3396: \commentout{

3397:

3398: \subsection{Expected Algorithmic Sufficient Statistic is Probabilistic Sufficient Statistic}

3399: \label{sect.formanal}

3400: Algorithmic sufficient statistic, a function of the data,

3401: is so named because intuitively

3402: it expresses an individual summarizing of the relevant information

3403: in the individual data, reminiscent of

3404: the probabilistic sufficient statistic that summarizes the

3405: relevant information in a data random variable about a model

3406: random variable. Formally, however, previous authors have

3407: not established any relation. Other algorithmic notions

3408: have been successfully related to their probabilistic

3409: counterparts. The most significant one is that for every computable

3410: probability distribution, the expected prefix complexity of the

3411: objects equals the entropy of the distribution up to an additive

3412: constant term, related to the complexity of the distribution in

3413: question. We have used this property in (\ref{eq.eqamipmi})

3414: to establish a similar relation between the expected

3415: algorithmic mutual information and the probabilistic mutual information.

3416: We use this in turn to show that

3417: there is a close relation between the algorithmic version and

3418: the probabilistic version of sufficient

3419: statistic: A probabilistic sufficient statistic is

3420: with high probability a natural conditional form

3421: of algorithmic sufficient statistic

3422: for individual data, and, conversely, that with

3423: high probability a natural conditional

3424: form of algorithmic sufficient statistic is  also a probabilistic

3425: sufficient statistic.

3426:

3427: Recall the terminology of probabilistic mutual information

3428: (\ref{eq.mutinfprob})

3429: and probabilistic sufficient statistic (\ref{eq.suffstatprob}).

3430: Consider a probabilistic ensemble of models,

3431: a family of computable probability mass functions $\{f_{\theta} \}$

3432: indexed by a discrete parameter $\theta$, together with a computable

3433: distribution $f_1$ over $\theta$.

3434: (The finite set model case is the  restriction where

3435: the $f_{\theta}$'s are restricted to uniform distributions

3436: with finite supports.)

3437: This way we have a random variable $\Theta$ with outcomes in $\{f_{\theta} \}$

3438: and a random variable $X$ with outcomes

3439: in the union of domains of $f_{\theta}$, and

3440: $f(\theta,x) = f_1 (\theta) f_{\theta}(x)$ is computable.

3441:

3442: \begin{notation}

3443: \rm

3444: To compare the algorithmic sufficient statistic

3445: with the probabilistic sufficient statistic it is

3446: convenient to denote the sufficient statistic

3447:  as a function $S(\cdot)$ of the data in both cases.

3448: Let a statistic

3449: $S(x)$ of data $x$ be the more general form of probability distribution

3450: as in Remark~\ref{s.prob}. That is, $S$ maps the data $x$ to the

3451: parameter $\rho$ that determines

3452: a probability mass function $f_{\rho}$ (possibly not an element

3453: of $\{f_{\theta} \}$). Note that ``$f_{\rho} (\cdot)$'' corresponds

3454: to ``$P(\cdot)$''

3455: in Remark~\ref{s.prob}.

3456: If $f_{\rho}$ is computable, then this can be the

3457: Turing machine $T_{\rho}$ that computes

3458: $f_{\rho}$.

3459: Hence, in the current section,

3460: ``$S(x)$'' denotes a probability distribution, say $f_{\rho}$,

3461: and ``$f_{\rho}(x)$'' is the probability $f_{\rho}$ concentrates on data $x$.

3462: \end{notation}

3463: \begin{remark}

3464: \rm

3465: In the probabilistic statistics setting,

3466: Every function $T(x)$ is a statistic of $x$, but only some

3467: of them are a sufficient statistic. In the algorithmic statistic

3468: setting we have a quite similar situation. In the finite set statistic

3469: case $S(x)$ is a finite set, and in the computable probability

3470: mass function case $S(x)$ is a computable probability mass function.

3471: In both algorithmic cases we have shown $K(S(x) \mid x^*) \eqa 0$

3472: for $S(x)$ is an implicitly or explicitly described sufficient statistic.

3473: This means that the number of such sufficient statistics for $x$

3474: is bounded by a universal constant, and that there is a universal program

3475: to compute all of them from $x^*$---and hence to compute

3476: the minimal sufficient statistic from $x^*$.

3477: \end{remark}

3478: \begin{lemma}\label{theo.eqpral}

3479: Let $f(\theta,x) = f_1 (\theta) f_{\theta} (x)$ be a computable joint

3480: probability mass function, and let

3481: $S$ be a function. Then all three conditions below are equivalent

3482: and imply each other:

3483:

3484: (i) $S$ is a probabilistic sufficient statistic

3485: (in the form $I(\Theta; X) \eqa I(\Theta ; S(X))$).

3486:

3487: (ii) $S$ satisfies

3488: \begin{equation}\label{eq.eqami}

3489: \sum_{\theta,x} f(\theta,x) I(\theta:x)

3490: \eqa

3491: \sum_{\theta,x} f(\theta,x) I(\theta: S(x))

3492: \end{equation}

3493:

3494: (iii) $S$ satisfies

3495: \begin{align*}

3496: I(\Theta ; X) \eqa I(\Theta ; S(X)) & \eqa

3497: \sum_{\theta,x} f(\theta,x) I(\theta:x)

3498: \\& \eqa

3499: \sum_{\theta,x} f(\theta,x) I(\theta: S(x)).

3500: \end{align*}

3501:

3502: All $\eqa$ signs hold up to an $\eqa \pm 2K(f)$ constant additive term.

3503:

3504: \end{lemma}

3505:

3506: \begin{proof}

3507: Clearly, (iii) implies (i) and (ii).

3508:

3509: We show that both (i) implies (iii) and (ii) implies (iii):

3510: By (\ref{eq.eqamipmi}) we have

3511: \begin{align}\label{eq.asseq}

3512: I(\Theta ; X) & \eqa \sum_{\theta,x} f(\theta,x) I(\theta:x),

3513: \\ I(\Theta ; S(X)) & \eqa \sum_{\theta,x} f(\theta,x) I(\theta: S(x)),

3514: \nonumber

3515: \end{align}

3516: where  we absorb a $\pm 2K(f)$ additive term in the $\eqa$ sign.

3517: Together with (\ref{eq.eqami}),

3518: (\ref{eq.asseq}) implies

3519: \begin{equation}\label{eq.eqpmi}

3520:  I(\Theta ; X) \eqa I(\Theta ; S(X)) ;

3521: \end{equation}

3522: and {\em vice versa} (\ref{eq.eqpmi}) together with (\ref{eq.asseq})

3523: implies (\ref{eq.eqami}).

3524:

3525: \end{proof}

3526:

3527: \begin{remark}

3528: \rm

3529: It may be worth stressing that $S$ in Theorem~\ref{theo.eqpral} can

3530: be any function, without restriction.

3531: \end{remark}

3532:

3533: \begin{remark}

3534: \rm

3535: Note that (\ref{eq.eqpmi}) involves equality $\eqa$

3536: rather than precise equality as in the

3537: definition of the probabilistic sufficient

3538: statistic (\ref{eq.suffstatprob}).

3539: \end{remark}

3540:

3541: \begin{definition}\label{def.thetaI}

3542: \rm

3543: Assume the terminology and notation above.

3544: A statistic $S$ for data $x$

3545: is {\em $\theta$-sufficient with deficiency $\delta$}

3546: if

3547: $I(\theta : x) \eqa I(\theta : S(x)) + \delta$.

3548: If $\delta \eqa 0$ then $S(x)$ is simply a {\em $\theta$-sufficient

3549: statistic}.

3550: \end{definition}

3551:

3552: The following lemma shows that $\theta$-sufficiency is a type

3553: of conditional sufficiency:

3554:

3555: \begin{lemma}\label{claim.1}

3556: Let $S(x)$ be a sufficient statistic for $x$. Then,

3557: \begin{equation}\label{eq.theta}

3558:  K(x \mid \theta^*) + \delta \eqa  K(S(x) \mid \theta^* ) - \log S(x).

3559: \end{equation}

3560: iff $I(\theta : x) \eqa I(\theta : S(x)) + \delta$.

3561: \end{lemma}

3562:

3563: \begin{proof}

3564: (If) By assumption,

3565:  $K(S(x)) - K(S(x) \mid  \theta^*) + \delta \eqa K(x) - K(x \mid \theta^*)$.

3566: %that is, $I(\theta: S(x)) + \delta \eqa I(\theta:x)$.

3567: %Since $S$ is a sufficient statistic for $x$, the term

3568: Rearrange and add

3569: $-K(x \mid S(x)^*)- \log S(x) \eqa 0$ (by typicality)

3570:  to the right-hand side to obtain

3571: $K(x \mid \theta^*) +K(S(x)) \eqa K(S(x) \mid \theta^*) + K(x)

3572: - K(x \mid S(x)^*) - \log S(x) - \delta$.

3573: Substitute according to $K(x) \eqa K(S(x))+K(x \mid S(x)^*)$

3574: (by sufficiency) in the

3575: right-hand side, and subsequently subtract

3576: $K(S(x))$ from both sides, to obtain

3577: (\ref{eq.theta}).

3578:

3579: (Only If) Reverse the proof of the (If) case.

3580:

3581: \end{proof}

3582:

3583: The following theorems state that $S(X)$ is a probabilistic sufficient

3584: statistic iff $S(x)$ is an algorithmic $\theta$-sufficient statistic,

3585: up to small deficiency, with high probability.

3586:

3587:

3588: \begin{theorem}

3589: Let $f(\theta,x) = f_1 (\theta) f_{\theta} (x)$ be a computable joint

3590: probability mass function, and let

3591: $S$ be a function.

3592: If $S$ is

3593: a recursive probabilistic sufficient statistic, then

3594: $S$ is

3595: a $\theta$-sufficient statistic with deficiency $O(k)$,

3596: with $f$-probability at least $1 - \frac{1}{k}$.

3597: \end{theorem}

3598:

3599: \begin{proof}

3600: If $S$ is a probabilistic sufficient statistic,

3601: then, by Lemma~\ref{theo.eqpral}, equality of $f$-expectations (\ref{eq.eqami})

3602: holds. However, it is still consistent with this to have

3603: large positive and negative differences

3604: $I(\theta: x) -I(\theta:S(x))$

3605: for different $(\theta,x)$ arguments, such that these

3606: differences cancel each other.

3607: This problem is resolved by appeal to

3608: the algorithmic mutual information non-increase

3609: law (\ref{eq.nonincrease}) which shows that all differences are

3610: essentially positive:

3611: $I(\theta : x) - I(\theta : S(x)) \gea -K(S)$.

3612: Altogether, let $c_1,c_2$ be least positive constants such that

3613: $I(\theta : x) - I(\theta : S(x))+c_1$ is always nonnegative

3614: and its $f$-expectation is $c_2$.

3615: Then, by Markov's inequality,

3616: \[

3617: f ( I( \theta : x) - I(\theta : S(x)) \geq kc_2 - c_1 ) \leq \frac{1}{k},

3618: \]

3619: that is,

3620: \[ f ( I( \theta : x) - I(\theta : S(x)) < kc_2 - c_1 )

3621: > 1 - \frac{1}{k}.

3622: \]

3623: \end{proof}

3624:

3625: \begin{theorem}

3626: For each $n$, consider the set of data $x$ of length $n$.

3627: Let $f(\theta,x) = f_1 (\theta) f_{\theta} (x)$ be a computable joint

3628: probability mass function, and let

3629: $S$ be a function.

3630: If $S$ is an algorithmic $\theta$-sufficient statistic for

3631: $x$, with $f$-probability

3632: at least $1-\epsilon$ ($1/\epsilon \eqa n + 2 \log n$), then

3633: $S$ is a probabilistic sufficient statistic.

3634: \end{theorem}

3635:

3636: \begin{proof}

3637: By assumption, using Definition~\ref{def.thetaI},

3638: there is a positive constant $c_1$, such that,

3639: \[

3640: f ( | I(\theta : x) - I(\theta : S(x))| \leq c_1) \geq 1- \epsilon.

3641: \]

3642: Therefore,

3643: \begin{align*}

3644: 0 \leq \sum_{| I(\theta : x )  - I(\theta : S(x))| \leq  c_1 } f(\theta ,x)

3645: & |I(\theta : x )  - I(\theta : S(x))|

3646: \\ &  \lea  (1-\epsilon)c_1 \eqa  0.

3647: \end{align*}

3648: On the other hand, since

3649:  \[

3650: 1/\epsilon \gea n + 2 \log n \gea K(x) \gea  \max_{\theta , x} I(\theta ; x),

3651: \]

3652: we obtain

3653: \begin{align*}

3654: 0 \leq \sum_{| I(\theta : x )  - I(\theta : S(x))| >  c_1 } f(\theta ,x)

3655: & |I(\theta : x )  - I(\theta : S(x))|

3656: \\ &  \lea  \epsilon (n+2 \log n) \lea  0.

3657: \end{align*}

3658: Altogether, this implies (\ref{eq.eqami}), and by

3659: Lemma~\ref{theo.eqpral}, the theorem.

3660: \end{proof}

3661: }

3662:

3663: \subsection{Relating Probabilistic and Algorithmic Sufficiency}

3664: \label{sec:relpa}

3665: We want to relate `algorithmic sufficient statistics' (defined

3666: independently of any model class $\{f_\theta\}$) to probabilistic sufficient

3667: statistics (defined relative to some model class

3668: $\{f_\theta\}$ as in Section~\ref{sec:probstat}). We will show that,

3669: essentially, algorithmic sufficient statistics are probabilistic

3670: nearly-sufficient statistics with respect to {\em all\/} model families $\{

3671: f_{\theta} \}$. Since the notion of

3672: algorithmic sufficiency is only defined to within additive constants,

3673: we cannot expect algorithmic sufficient statistics to satisfy

3674: the requirements (\ref{eq:indivdef}) or (\ref{eq:expdef}) for probabilistic sufficiency {\em

3675:   exactly}, but only `nearly\footnote{We use `nearly' rather than

3676:   `almost' since `almost' suggests things like `almost

3677:   everywhere/almost surely/with probability 1'. Instead, `nearly' means, roughly speaking,  `to within $O(1)$'.}'.

3678: \paragraph{Nearly Sufficient Statistics:}

3679: Intuitively, we may consider a probabilistic statistic $S$ to be

3680: nearly sufficient if (\ref{eq:indivdef}) or (\ref{eq:expdef}) holds to

3681: within some constant. For long sequences $x$, this constant will then

3682: be negligible compared to the two terms in

3683: (\ref{eq:indivdef}) or (\ref{eq:expdef}) which, for most practically

3684: interesting statistical model classes, typically grow linearly in the

3685: sequence length. But now we encounter a difficulty:

3686: \begin{quote}

3687: whereas

3688: (\ref{eq:indivdef}) and (\ref{eq:expdef}) are equivalent if they are

3689: required to hold exactly, they express something substantially

3690: different if they are only required to hold within a constant.

3691: \end{quote}

3692: Because of our observation

3693: above, when relating probabilistic and algorithmic statistics

3694: we have to be very careful about what

3695: happens if $n$ is allowed to change. Thus, we need to extend probabilistic and algorithmic statistics to strings of arbitrary length. This

3696: leads to

3697: the following generalized definition of a statistic:

3698: \begin{definition}

3699: \rm

3700: \label{def:seqstat}

3701: A {\em sequential statistic\/} is a function $S: \{0,1\}^* \rightarrow

3702: 2^{\{0,1 \}^*}$, such that for all $n$, all $x \in \{0,1\}^n$,

3703: (1) $S(x) \subseteq \{0,1\}^n$, and (2) $x \in S(x)$, and (3) for all $n$, the set

3704: $$

3705: \{ s \; | \; \text{There exists $x \in \{0,1\}^n$ with

3706: $S(x) = s $} \ \}

3707: $$

3708: is a partition of $\{0,1\}^n$.

3709: \end{definition}

3710: Algorithmic statistics are defined relative to

3711: individual $x$ of some length $n$.

3712: Probabilistic statistics are defined as functions, hence for all $x$

3713: of given length, but still relative to given length $n$. Such

3714: algorithmic and probabilistic statistics can be

3715: extended to  each $n$ and each $x \in \{0,1\}^n$ in a

3716: variety of ways; the three conditions in Definition~\ref{def:seqstat}

3717: ensure that the extension is done in a reasonable way.

3718: %

3719: %As usual, let $X$ be a random variable having outcomes, the data, in

3720: %${\cal X} = \{0,1\}^n$ with probability $P_{\theta} (X=x) = f_{\theta}

3721: %(x)$ out of a model (family of probability mass functions)

3722: %$\{f_\theta\}$. A function $S: {\cal X} \rightarrow {\cal S}$, for

3723: %some range ${\cal S}$ to be specified later, is statistic of the data.

3724: %\begin{definition}

3725: Now let $\{ f_{\theta} \}$ be a model class of sequential

3726: information sources (Section~\ref{sec:preliminaries}), i.e. a

3727: statistical model class defined for sequences of arbitrary length rather

3728: than just fixed $n$.  As before, $f^{(n)}_{\theta}$ denotes the

3729: marginal distribution of $f_\theta$ on $\{0,1\}^n$.

3730: \begin{definition}

3731: \label{def:nearsuff}

3732: \rm

3733: We call sequential statistic  $S$

3734: {\em nearly-sufficient for

3735:   $\{f_\theta\}$ in the probabilistic-individual sense} if

3736: there exist functions $g^{(1)}, g^{(2)}, \ldots$ and a constant $c$   such that

3737: for all $\theta$,

3738: all $n$,  every $x \in \{0,1\}^n$,

3739: \begin{equation}

3740: \label{eq:nindivdef}

3741: \biggl|  \log 1/f^{(n)}_{\theta}(x \mid S(x))

3742: - \log 1/ g^{(n)}(x

3743: | S(x))

3744: \bigr] \biggr|

3745: \leq c.

3746: \end{equation}

3747: We say $S$ is {\em nearly-sufficient  for

3748:  $\{f_\theta\}$ in the probabilistic-expectation sense\/} if

3749: there exists functions $g^{(1)}, g^{(2)}, \ldots$ and a constant $c'$ such that

3750: for all $\theta$, all $n$,

3751: \begin{equation}

3752: \label{eq:nexpdef}

3753: \biggl| \sum_{x \in \{0,1\}^n}

3754: f^{(n)}_{\theta}(x) \bigl[ \log 1/ f^{(n)}_{\theta}(x \mid

3755: S(x)) - \log 1/ g^{(n)}(x| S(x))

3756: \bigr]\;

3757: \biggr|

3758: \leq c'.

3759: \end{equation}

3760: \end{definition}

3761: Inequality \eqref{eq:nindivdef} may

3762: be read as `(\ref{eq:indivdef}) holds within a constant', whereas

3763: (\ref{eq:nexpdef}) may be read as `(\ref{eq:expdef}) holds within a

3764: constant'.

3765: %Note that we have to include $K(f_{\theta})$ and $K(p_1)$

3766: %in the definition - if, for each $n$, $f_{\theta}$ puts all its mass

3767: %on a sequence $x$ of length $n$ with $K(x) \approx n$, so that

3768: %$K(f_{\theta}) \approx n$, we cannot expect the algorithmic sufficient

3769: %statistic to be a probabilistic sufficient statistic.

3770:

3771: \begin{remark}

3772: \rm

3773: Whereas the

3774: individual-sequence definition (\ref{eq:indivdef}) and the

3775: expectation-definition (\ref{eq:expdef}) are equivalent if we

3776: require exact equality, they become quite different if we allow

3777: equality to within a constant as in Definition~\ref{def:nearsuff}.  To

3778: see this, let $S$ be some sequential

3779: statistic such that for all large $n$, for some $\theta_1,

3780: \theta_2$, for some $x \in \{0,1\}^n$,

3781: $$f^{(n)}_{\theta_1}(x \mid S(x)) \gg

3782: f^{(n)}_{\theta_2}(x \mid S(x)),

3783: $$

3784: while for all $x' \neq x$ of length $n$,

3785: $f^{(n)}_{\theta_1}(x|S(x)) \approx f^{(n)}_{\theta_2}(x| S(x))$. If

3786: $x$ has very small but nonzero probability according to some $\theta

3787: \in \Theta$, then with very small $f_{\theta}$-probability, the

3788: difference between the left-hand and right-hand side of

3789: (\ref{eq:indivdef}) is very large, and with large

3790: $f_{\theta}$-probability, the difference between the left-hand and

3791: right-hand side of (\ref{eq:indivdef}) is about $0$.  Then $S$ will be

3792: nearly sufficient in expectation, but not in the individual sense.

3793: \end{remark}

3794: In the theorem below we focus on probabilistic statistics that are

3795: `nearly sufficient in an expected sense'. We connect these to

3796: algorithmic sequential statistics, defined as follows:

3797: \begin{definition}

3798: \rm

3799: %Let $\{f_\theta\}$ be a model of sequential information sources.

3800: %  A sequential statistic $S$ is {\em sufficient in the probabilistic

3801: %    sense relative to model $\{f_\theta \}$\/} if for all $n$, $S$

3802: %  restricted to $\{0,1\}^n$ is a sufficient statistic for $\{ f^{(n)}

3803: %  \}$.

3804: %

3805:   A sequential statistic $S$ is {\em

3806:     sufficient in the algorithmic sense\/}

3807: if there is a constant $c$

3808:   such that for all $n$, all $x \in \{0,1\}^n$, the program generating

3809:   $S(x)$ is an algorithmic sufficient statistic for $x$ (relative

3810:   to constant $c$), i.e.

3811: \begin{equation}

3812: \label{eq:seqsuf}

3813: K(S(x)) + \log |S(x)|  \leq K(x) + c.

3814: \end{equation}

3815: \end{definition}

3816: In Theorem~\ref{thm:wiske} we relate

3817: algorithmic to probabilistic sufficiency.

3818: %The meaning of the theorem

3819: %is made clear in Corollary~\ref{cor:wiske} further below.

3820: In the theorem, $S$

3821: represents a sequential statistic, $\{f_{\theta}\}$ is a model class of

3822: sequential information sources and $g^{(n)}$ is the conditional

3823: probability mass function arising from the uniform distribution:

3824: $$

3825: g^{(n)}(x |s) = \begin{cases}

3826: 1/|\{ x \in { \{0,1\}^{n}} :

3827: S(x) = s \} |  &

3828: \text{\ if $S(x) = s$\ } \\

3829: 0 & \text{\ otherwise.}

3830: \end{cases}

3831: $$

3832: %Note that in the statement of the theorem, $K(S)$ is the Kolmogorov

3833: %complexity of the {\em function\/} $S$, whereas

3834: %(in (\ref{eq:thmalg})), $K(S(x))$ is the

3835: %Kolmogorov complexity of the {\em set\/} $S(x)$. It is certainly

3836: %possible that $K(S)$ is finite yet $K(S(x))$ increases with $n$. This

3837: %can be the case, if, for example, $S$ counts the number of $1$s in

3838: %$x$. Then $K(S(x))$ can be as large as $O(\log n)$.

3839: \begin{theorem}[algorithmic sufficient statistic is probabilistic

3840:   sufficient statistic]

3841: \label{thm:wiske}

3842: \rm

3843: Let $S$ be a sequential statistic that is sufficient in the

3844: algorithmic sense. Then for every $\theta$ with $K(f_\theta) < \infty$,

3845: there exists a constant $c$, such that for all $n$, inequality

3846: \eqref{eq:nexpdef} holds with $g^{(n)}$ the uniform distribution.

3847: Thus, if $\sup_{\theta \in

3848:   {\mathbf \Theta}} K(f_\theta) < \infty$, then $S$ is a nearly-sufficient

3849: statistic for $\{ f_\theta \}$ in the probabilistic-expectation sense,

3850: with $g$ equal to the uniform distribution.

3851: %\begin{equation}

3852: %\label{eq:thmalg}

3853: %{\bf E}_{\theta} \bigl[ K(S(Y_1, \ldots, Y_n) \mid n) + \log

3854: %|S(Y_1, \ldots, Y_n)| \bigr] \leq {\bf E}_\theta

3855: %\bigl[ K(Y_1, \ldots, Y_n \mid n) \bigr]

3856: %+ C

3857: %\end{equation}

3858: %\begin{equation}

3859: %\label{eq:thmprob}

3860: %\biggl| \; \sum_{x \in \{0,1\}^n}

3861: %f^{(n)}_{\theta}(x) \bigl[ \log 1/ f^{(n)}_{\theta}(x \mid

3862: %S(x)) - \log 1/ g^{(n)}(x| S(x))

3863: %\bigr]\;

3864: %\biggr|

3865: %\leq c'.

3866: %\end{equation}

3867: \end{theorem}

3868: %\paragraph{Remark} It is straightforward to see that

3869: %\begin{enumerate}

3870: %\item Every probabilistic nearly-sufficient statistic in the

3871: %  individual sense is a probabilistic nearly-sufficient statistic in

3872: %  the expected sense, but not vice versa.

3873: %\item Every algorithmic sufficient statistic

3874: %is an algorithmic statistic sufficient `in

3875: %expectation' relative to {\em every\/} conceivable

3876: %model $\{ f_\theta \}$, but not vice versa.

3877: %\end{enumerate}

3878:                                                                            \noindent

3879: \begin{proof}

3880: The definition of algorithmic sufficiency, (\ref{eq:seqsuf}) directly

3881: implies that

3882: there exists a

3883: constant $c$ such that for all $\theta$, all $n$,

3884: \begin{equation}

3885: \label{eq:thmalg}

3886:  \sum_{x \in \{0,1\}^n }

3887: f^{(n)}_{\theta}(x) \bigl[ K(S(x)) + \log

3888: |S(x)| \bigr] \leq  \sum_{x \in \{0,1\}^n } f^{(n)}_{\theta}(x)

3889: K(x)

3890: + c.

3891: \end{equation}

3892: Now fix any $\theta$ with $K(f_\theta) < \infty$. It follows (by the

3893: same reasoning as in Theorem~\ref{theo.eq.entropy}) that

3894: for some $c_{\theta} \approx K(f_\theta)$,

3895: for all $n$,

3896: \begin{equation}

3897: \label{eq:dochter}

3898: 0 \leq \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x) K(x) -

3899: \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x)

3900:  \log 1/ f_{\theta}(x) \leq c_{\theta}.

3901: \end{equation}

3902: Essentially, the left inequality follows by the information inequality

3903: (\ref{eq.ii}): no code can be more

3904: efficient in expectation under $f_\theta$ than the Shannon-Fano code

3905: with lengths $\log 1 /f_\theta(x)$; the right inequality follows

3906: because, since $K(f_\theta) < \infty$,

3907: the Shannon-Fano code can be implemented by a

3908: computer program with a fixed-size independent of $n$.

3909: By (\ref{eq:dochter}),

3910: (\ref{eq:thmalg}) becomes: for all $n$,

3911: \begin{equation}

3912: \label{eq:thmalgb}

3913: \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta}(x) \log 1/ f_{\theta}(x) \leq

3914: \sum_{x \in \{0,1\}^n }

3915: f^{(n)}_{\theta}(x) \bigl[ K(S(x)) + \log

3916: |S(x)| \bigr] \leq  \sum_x f^{(n)}_{\theta}(x) \log 1/ f_{\theta}(x)

3917: + c_\theta.

3918: \end{equation}

3919: For $s \subseteq \{0,1\}^n$, we use the notation

3920: $f^{(n)}_\theta(s)$ according to \eqref{eq:overload}.

3921: Note that,

3922: by requirement (3) in the definition of sequential statistic,

3923: $$\sum_{s: \exists x \in \{0,1\}^n : S(x) = s} f^{(n)}_\theta(s) =

3924: 1,

3925: $$

3926: whence $f^{(n)}_\theta(s)$ is a probability mass function on

3927: ${\cal S}$, the set of values the statistic $S$ can take on sequences

3928: of length $n$. Thus, we get, once again by the information inequality (\ref{eq.ii}),

3929: \begin{equation}

3930: \label{eq:zoon}

3931: \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x) K(S(x)) \geq

3932: \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x)

3933: \log 1/ f^{(n)}_{\theta}(S(x)).

3934: \end{equation}

3935: Now note that for all $n$,

3936: \begin{equation}

3937: \label{eq:extra}

3938: \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x)

3939: \bigl[ \log 1/ f^{(n)}_{\theta}(S(x)) +

3940: %\sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x)

3941: \log 1/ f^{(n)}_{\theta}(x \mid S(x))

3942: \bigr]

3943: = \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta}(x) \log 1/ f_{\theta}(x).

3944: \end{equation}

3945: Consider the two-part code which encodes $x$ by first encoding $S(x)$

3946: using $\log 1 /f^{(n)}_\theta(S(x))$ bits, and then encoding $x$ using

3947: $\log |S(x) |$ bits. By the

3948: information inequality, (\ref{eq.ii}), this code must be less efficient

3949: than the Shannon-Fano code with lengths $\log 1/ f_{\theta}(x)$, so

3950: that if follows from (\ref{eq:extra}) that, for all $n$,

3951: \begin{equation}

3952: \label{eq:kind}

3953: \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x) \log | S(x) | \geq

3954: \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x)

3955: \log 1/ f^{(n)}_{\theta}(x \mid S(x)).

3956: \end{equation}

3957: Now defining

3958: \begin{eqnarray}

3959: u & = & \sum_{x \in \{0,1\}^n }

3960: f^{(n)}_{\theta}(x) K(S(x))   \nonumber \\

3961: v & = &  \sum_{x \in \{0,1\}^n }

3962: f^{(n)}_{\theta}(x) \log |S(x)| \nonumber \\

3963: u' & = & \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x)

3964: \log 1/ f^{(n)}_{\theta}(S(x))

3965:  \nonumber \\

3966: v' & = & \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x)

3967: \log 1/ f^{(n)}_{\theta}(x \mid S(x)) \nonumber \\

3968: w & = & \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta}(x) \log 1/

3969: f_{\theta}(x),

3970: \nonumber

3971: \end{eqnarray}

3972: we

3973: find that (\ref{eq:thmalgb}), (\ref{eq:extra}),

3974: (\ref{eq:zoon}) and

3975: (\ref{eq:kind}) express, respectively,

3976: that $u + v \eqa w$, $u' + v' = w$, $u \geq u'$,

3977: $v \geq v'$. It follows that $v \eqa v'$, so that

3978: (\ref{eq:kind}) must actually hold with equality up to a

3979: constant. That is, there exist a $c'$ such that for all $n$,

3980: \begin{equation}

3981: \label{eq:kindb}

3982: \bigl|

3983: \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x) \log | S(x) | -

3984: \sum_{x \in \{0,1\}^n} f^{(n)}_{\theta} (x)

3985: \log 1/ f^{(n)}_{\theta}(x \mid S(x)) \bigr| \leq c'.

3986: \end{equation}

3987: The result now follows upon noting that (\ref{eq:kindb}) is just

3988: (\ref{eq:nexpdef}) with $g^{(n)}$  the uniform distribution.

3989: \end{proof}

3990: \section{Rate Distortion and Structure Function}

3991: \label{sect.rdsf}

3992: We continue the discussion about meaningful information of

3993: Section~\ref{sect.meaning}. This time we a priori restrict the number

3994: of bits allowed for conveying the essence of the information. In the

3995: probabilistic situation this takes the form of allowing only a

3996: ``rate'' of $R$ bits to communicate as well as possible, on average,

3997: the outcome of a random variable $X$, while the set ${\cal X}$ of

3998: outcomes has cardinality possibly exceeding $2^R$.  Clearly, not all

3999: outcomes can be communicated without information loss, the average of

4000: which is expressed by the ``distortion''.  This leads to the so-called

4001: ``rate--distortion'' theory.  In the algorithmic setting the

4002: corresponding idea is to consider a set of models from which to choose

4003: a single model that expresses the ``meaning'' of the given individual

4004: data $x$ as well as possible. If we allow only $R$ bits to express the

4005: model, while possibly the Kolmogorov complexity $K(x) > R$, we suffer

4006: information loss---a situation that arises for example with ``lossy''

4007: compression.  In the latter situation, the data cannot be perfectly

4008: reconstructed from the model, and the question arises in how far the

4009: model can capture the meaning present in the specific data $x$.  This

4010: leads to the so-called ``structure function'' theory.

4011:

4012: The limit of $R$ bits to express

4013: a model to capture the most meaningful information

4014: in the data is an individual version of the average notion

4015: of ``rate''. The remaining less meaningful information in the data

4016: is the individual version of the average-case notion of ``distortion''.

4017: If the $R$ bits are sufficient to express all meaning in the data

4018: then the resulting model is called a ``sufficient statistic'',

4019: in the sense introduced above. The remaining information in the data

4020: is then purely accidental, random, noise.

4021: For example, a sequence of

4022: outcomes of $n$ tosses of a coin with computable bias $p$,  typically

4023: has a sufficient statistic

4024: of $K(p)$ bits, while the remaining random information is typically

4025: at least about $pn-K(p)$ bits (up to an $O(\sqrt{n})$ additive term).

4026: %One connection with Rate-Distortion theory with $R < K(p)$ to set

4027: %the distortion at $K(p)-R$, that is, the number of meaningful bits

4028: %that cannot be included. The meaningless bits are now discounted

4029: %altogether.

4030: \subsection{Rate Distortion}

4031: \label{sec:ratedistortion}

4032: %\label{sec:basics}

4033: Initially, Shannon \cite{Sh48} introduced rate-distortion

4034: as follows: ``Practically, we are not interested in exact transmission

4035: when we have a continuous source, but only in transmission to

4036: within a given tolerance. The question is, can we assign a definite

4037: rate to a continuous source when we require only a certain fidelity

4038: of recovery, measured in a suitable way.'' Later, in \cite{Sh59}

4039: he applied this idea to lossy data compression

4040: of discrete memoryless sources---our topic below.

4041: As before, we consider a situation in which sender $A$ wants to

4042: communicate the outcome of random variable $X$ to receiver $B$.

4043: Let $X$ take values in some set ${\cal X}$, and the

4044: distribution $P$ of $X$ be known to both $A$ and $B$. The change

4045: is that now $A$ is only

4046: allowed to use a finite number, say $R$ bits, to communicate, so that

4047: $A$ can only send $2^R$ different messages. Let us denote by ${Y}$ the

4048: encoding function used by $A$. This ${Y}$ maps ${\cal X}$

4049: onto some set ${\cal Y}$.

4050: %Let us denote by $\ddot{Y}$ the range of ${\cal Y}$.

4051: %We require that $\ddot{\cal Y}$ satisfies $|\ddot{\cal Y}| \leq 2^R$.

4052: We require that  $|{\cal Y}| \leq 2^R$.

4053: %$D$ has to map ${\cal Y}$ back to ${\cal X}$.

4054: If $|{\cal X}| > 2^R$ or if ${\cal X}$ is continuous-valued,

4055: then necessarily some information

4056: is lost during the communication. There is no decoding function

4057: $D: {\cal Y} \rightarrow {\cal X}$ such that

4058: $D({Y}(x)) = x$ for all $x$. Thus, $A$ and $B$ cannot ensure that

4059: $x$ can always be reconstructed. As the next best thing, they may

4060: agree on a code such that for all $x$, the value ${Y}(x)$ contains as much

4061: useful information about $x$ as is possible---what exactly `useful'

4062: means depends on the situation at hand; examples are provided below.

4063: An easy example would be that ${Y}(x)$ is a finite list of elements,

4064: one of which is $x$.

4065: We assume that the `goodness'

4066: of ${Y}(x)$ is gaged by a {\em distortion function\/} $d:

4067: {\cal X} \times {\cal Y} \rightarrow [0, \infty]$. This distortion

4068: function may be any nonnegative function

4069: that is appropriate to the situation at hand.

4070: In the example above it could be the logarithm of the number

4071: of elements in the list ${Y}(x)$.

4072: Examples of some common distortion functions are the

4073: Hamming distance and the squared Euclidean distance.

4074: We can view $Y$ as a

4075: a random variable on the space ${\cal Y}$,

4076: a coarse version of the random variable $X$, defined as taking

4077: value $Y=y$ if $X=x$ with $Y(x)=y$.

4078: Write $f(x) = P(X=x)$ and $g(y)=\sum_{x: Y(x)=y} P(X= x)$.

4079: Once the distortion function

4080: $d$ is fixed, we define the {\em expected \/} distortion by

4081: \begin{align}

4082: \label{eq:distortion}

4083: {\bf E} [ d(X,{Y}) ] & = \sum_{x \in {\cal X}} f(x) d(x,{Y}(x)) \\

4084: \nonumber

4085: & = \sum_{y \in {\cal Y}} g(y) \sum_{x: Y(x)=y} f(x)/g(y) d(x,y) .

4086: \end{align}

4087: If $X$ is a continuous random variable, the sum should be

4088: replaced by an integral.

4089: %This motivates the abbreviation

4090: %``${\bf E} [ d(X,{Y}) ]$'' for  ``${\bf E} [ d(X,{Y}(X)) ]$''.

4091: \begin{example}

4092: \label{ex:classicrd}

4093: \rm

4094: In most standard applications of rate distortion theory,

4095: the goal is to compress $x$ in a `lossy' way,

4096: such that $x$ can be reconstructed `as well as possible' from

4097: ${Y}(x)$. In that case, ${\cal Y} \subseteq {\cal X}$ and

4098: %$d$ is a really a function $d: {\cal X} \times {\cal X}

4099: %\rightarrow {\mathbb R}$.

4100: writing $\hat{x} = Y(x)$,

4101: the value $d(x,\hat{x})$ measures the similarity

4102: between $x$ and $\hat{x}$.  For example, with ${\cal X}$

4103: is the set of real numbers and ${\cal Y}$ is the set

4104: of integers, the squared difference

4105: $d(x,\hat{x}) = (x- \hat{x})^2$ is a

4106: viable distortion function.

4107: We may interpret $\hat{x}$ as an

4108: estimate of $x$, and ${\cal Y}$ as the set of values it can take. The

4109: reason we use the notation ${Y}$ rather than $\hat{X}$ (as in,

4110: for example, \cite{CT91}) is that further below,

4111: we mostly concentrate on slightly non-standard

4112: applications where ${\cal Y}$ should {\em not\/}

4113: be interpreted as a subset of ${\cal X}$.

4114: \end{example}

4115: We want to determine the optimal code $Y$ for communication between A and

4116: B under the constraint that there are no more than $2^R$ messages.

4117: That is, we look for the encoding function ${Y}$ that

4118: minimizes the expected distortion, under the constraint that

4119: $|{\cal Y}| \leq 2^R$.

4120: %We call the function mapping $R$ to the

4121: %minimimum expected distortion that can be achieved with $R$ bits the

4122: %{\em distortion-rate function\/} $D(R)$. Formally,

4123: %\begin{equation}

4124: %\label{eq:rd}

4125: %D(R) = \inf_{\hat{X}: {\cal X} \rightarrow \hat{\cal X} \; ; \;

4126: %|\hat{\cal X}| \leq 2^R }

4127: %{\bf E} [ d(X,\hat{X})].

4128: %\end{equation}

4129: Usually, the minimum

4130: achievable expected distortion

4131: is nonincreasing as a function of increasing  $R$.

4132: \begin{example}

4133: \label{ex:gauss}

4134:   \rm Suppose $X$ is a real-valued, normally (Gaussian) distributed

4135:   random variable with mean ${\bf E}[X] = 0$ and variance ${\bf E} [ X

4136:   - {\bf E} [X]]^2 = \sigma^2$. Let us use the squared Euclidean

4137:   distance $d(x,y) = (x - y)^2$ as a distortion measure.

4138:   If $A$ is allowed to use $R$ bits, then ${\cal Y}$ can have no

4139:   more than $2^R$ elements, in contrast to ${\cal X}$ that is

4140:   uncountably infinite. We should choose ${\cal Y}$ and the

4141:   function ${Y}$ such that (\ref{eq:distortion}) is minimized.

4142:   Suppose first $R=1$. Then the optimal ${Y}$ turns out to be

4143: $$

4144: {Y}(x) =

4145: \begin{cases} \sqrt{\frac{2}{\pi}} \sigma^2 & \mbox{\ if $x \geq 0 $} \\

4146:   - \sqrt{\frac{2}{\pi}} \sigma^2 & \mbox{\ if $x < 0 $}.

4147: \end{cases}

4148: $$

4149: Thus, the domain ${\cal X}$ is partitioned into two regions, one

4150: corresponding to $x \geq 0$, and one to $x < 0$. By the symmetry of

4151: the Gaussian distribution around $0$, it should be clear that this is

4152: the best one can do. Within each of the two region, one picks a

4153: `representative point' so as to minimize (\ref{eq:distortion}).

4154: This mapping allows $B$ to estimate $x$ as well as possible.

4155:

4156: Similarly, if $R=2$, then ${\cal X}$ should be partitioned into 4

4157: regions, each of which are to be represented by a single point such

4158: that (\ref{eq:distortion}) is minimized. An extreme case is $R= 0$:

4159: how can $B$ estimate $X$ if it is always given the same information?

4160: This means that ${Y}(x)$ must take the same value for

4161: all $x$.  The expected distortion (\ref{eq:distortion}) is then

4162: minimized if $Y(x) \equiv 0$, the mean of $X$, giving

4163: distortion equal to $\sigma^2$.

4164: \end{example}

4165: In general, there is no need for the space of estimates ${\cal Y}$

4166: to be a subset of ${\cal X}$. We may, for example, also lossily encode

4167: or `estimate' the actual value of $x$ by specifying a set in which $x$

4168: must lie (Section~\ref{sec:meaningful})

4169: or a probability distribution (see below) on ${\cal X}$.

4170: % Thus, we assume

4171: %there is some set ${\cal Y}$ of possible messages. Sender maps outcome

4172: %$x \in {\cal X}$ to some $h \in {\cal H}$ based on an `encoding'

4173: %function ${H}: {\cal X} \rightarrow {\cal H}$. Receiver than

4174: %incurs some `loss' given by a distortion function $d: {\cal X} \times

4175: %{\cal H} \rightarrow {\mathbb R}$. The generalized definition of

4176: %distortion-rate function now becomes

4177: %\begin{equation}

4178: %\label{eq:rdb}

4179: %D(R) := \inf_{{H}: {\cal X} \rightarrow {\cal H} \; ; \;

4180: %|{\cal H}| \leq 2^R }

4181: %{\bf E} [ d(X,{H})].

4182: %\end{equation}

4183: %In Example~\ref{ex:gauss}, we had ${\cal H} = {\cal X}$. In

4184: %Example~\ref{ex:reconcile}, ${\cal H}$ will be the set of

4185: %distributions on ${\cal X}$.

4186: \begin{example}

4187: \label{ex:reconcile}

4188: \rm Suppose receiver $B$ wants to estimate the actual $x$ by a probability

4189: distribution $P$ on ${\cal X}$. Thus, if $R$ bits are allowed to be

4190: used, one of $2^{R}$ different distributions on ${\cal X}$ can be sent to

4191: receiver. The most accurate that can be done is to partition ${\cal

4192:   X}$ into $2^R$ subsets ${\cal A}_1, \ldots, {\cal A}_{2^R}$.

4193: Relative to any such partition, we introduce a new random variable ${Y}$

4194: and  abbreviate the event $x \in {\cal A}_y$ to ${Y}=y$.

4195: Sender observes that ${Y} = y$ for some $y \in

4196: {\cal Y} = \{1, \ldots, 2^R\}$

4197: and passes this

4198: information on to receiver. The information $y$ actually means that $X$

4199: is now distributed according to the conditional distribution $P(X=x

4200: \mid x \in {\cal A}_y) = P(X= x \mid {Y} = y)$.

4201: %Thus, ${\cal Y} = \{1, \ldots, 2^{R} \}$.

4202:

4203: It is now natural to measure the quality of the transmitted  distribution

4204: $P(X=x \mid Y = y)$ by its conditional

4205: entropy,  i.e. the expected

4206: additional number of bits that sender has to transmit before

4207: receiver knows the value of $x$ with certainty. This can be achieved

4208: by taking

4209: \begin{equation}

4210: d(x,y) =

4211:  \log 1/ P(X=x \mid {Y}= y),

4212: \end{equation}

4213: which we abbreviate to $d(x,y) =  \log 1/ f(x|y)$.

4214: In words,

4215: the distortion function is the Shannon-Fano code length for the

4216: communicated distribution. The expected distortion then

4217: becomes equal to the conditional entropy $H(X \mid Y)$ as defined in

4218: Section~\ref{sec:probmutual} (rewrite according to

4219: \eqref{eq:distortion}, $f(x|y)=f(x)/g(y)$ for

4220: $P(X=x| Y(x)=y)$ and $g(y)$ defined earlier,

4221: and the definition of conditional probability):

4222: \begin{align}

4223: \label{eq:entdist}

4224: {\bf E}

4225: [d(X,{Y})]

4226: %= \sum_{x \in {\cal X}}

4227: %p(x) [- \log p(x|{Y}(x))] = \sum_{x \in {\cal X}}

4228: %p(x) [- \log P(X=x \mid X \in {\cal A}_y)]

4229: & = \sum_{y \in {\cal Y}} g(y) \sum_{x: Y(x)=y} (f(x)/g(y)) d(x,y)\\

4230: \nonumber

4231: &=  \sum_{y \in {\cal Y}} g(y) \sum_{x: Y(x)=y} f(x|y) \log 1/f(x|y)\\

4232: \nonumber

4233: &=  H(X|{Y}).

4234: \end{align}

4235: How is this related to lossless compression? Suppose

4236: for example that $R= 1$. Then

4237: the optimal distortion is achieved by

4238: partitioning ${\cal X}$ into two sets ${\cal A}_1, {\cal A}_2$ in the

4239: most `informative' possible way, so that the conditional entropy

4240: $$H(X|Y) = \sum_{y=1,2} P(Y=y) H(X| Y=y)$$

4241: is minimized.  If

4242: $Y$ itself is encoded with the Shannon-Fano code, then $H(Y)$ bits are

4243: needed to communicate $Y$. Rewriting

4244: $H(X|Y)= \sum_{y \in {\cal Y}} P(Y=y) H(X|Y=y)$ and

4245: $H(X|Y=y)= \sum_{x:Y(x)=y} f(x|y) \log 1/f(x|y)$ with $f(x|y)=P(X=x)/P(Y=y)$

4246: and rearranging, shows  that for all such partitions of ${\cal

4247:   X}$ into $|{\cal Y}|$  subsets defined by $Y:{\cal X} \rightarrow {\cal Y}$

4248:  we have

4249: \begin{equation}

4250: \label{eq:snavel}

4251: H(X|Y) + H(Y) = H(X).

4252: \end{equation}

4253: The minimum rate

4254: distortion is obtained by choosing the function

4255: $Y$ that minimizes $H(X|Y)$.

4256: By (\ref{eq:snavel}) this is also the $Y$ maximizing $H(Y)$. Thus, the

4257: average total number of bits we need to send our message in this way

4258: is still equal to $H(X)$---the more we save in the second part, the

4259: more we pay in the first part.

4260: \end{example}

4261: %The minimum achieveable distortion $D(r)$ for $R=r$

4262: %is given by

4263: %\begin{equation}

4264: %\label{eq:drbasic}

4265: %D(R) = \min_{Y :

4266: %{\cal X} \rightarrow {\cal Y}, | {\cal Y} | < 2^R} H(X|Y).

4267: %\end{equation}

4268: \paragraph{Rate Distortion and

4269:   Mutual Information:}

4270: Already in his 1948 paper, Shannon established a

4271: deep relation between mutual information and minimum achievable

4272: distortion for (essentially) {\em arbitrary\/} distortion functions.

4273: The relation is summarized in Theorem~\ref{thm:rd} below. To prepare

4274: for the theorem, we need to slightly extend our setting by considering

4275: {\em independent repetitions of the same scenario}. This can be

4276: motivated in various ways such as (a) it often corresponds to the

4277: situation we are trying to model; (b) it allows us to consider non-integer

4278: rates $R$, and (c) it greatly simplifies the mathematical analysis.

4279:

4280: \begin{definition}

4281: \rm

4282: Let ${\cal X}, {\cal Y}$ be two sample spaces.

4283: The  distortion of $y \in {\cal Y}$ with respect

4284: to $x \in {\cal X}$ is defined

4285: by a nonnegative real-valued function $d(x,y)$ as above.

4286: %The idea is that the

4287: %distortion measures the loss of information in an encoding $y$

4288: %of the original $x$ with respect to $x$.

4289: We extend the definition to sequences:

4290: the distortion of $(y_1, \ldots , y_n)$

4291: with respect to $(x_1, \ldots , x_n)$ is

4292: \begin{equation}

4293: \label{eq:avdist}

4294: d((x_1,\ldots,x_n),(y_1, \ldots, y_n)) := \frac{1}{n}

4295: \sum_{i=1}^n d(x_i,y_i).

4296: \end{equation}

4297: \end{definition}

4298: Let $X_1, \ldots, X_n$ be $n$ independent identically

4299: distributed random variables on outcome space ${\cal X}$.

4300: Let ${\cal Y}$ be a set of code words.

4301: We want to find a sequence of functions $Y_1, \ldots , Y_n:{\cal X}

4302: \rightarrow {\cal Y}$ so that the message $(Y_1(x_1), \ldots,

4303: Y_n (x_n)) \in {\cal Y}^n$ gives as much expected

4304: information about the sequence of outcomes $(X_1=x_1,

4305: \ldots, X_n=x_n)$ as is possible, under the constraint that the message

4306: takes at most $R \cdot n$ bits (so that $R$ bits are allowed on

4307: average per outcome of $X_i$).

4308: Instead of $Y_1, \ldots , Y_n$ above write

4309: $Z_n: {\cal X}^n \rightarrow  {\cal Y}^n$.

4310: The {\em expected distortion} ${\bf E}[d(X^n,Z_n)]$ for $Z_n$ is

4311: \begin{equation}

4312:  {\bf E}[d(X^n,Z_n)] = \sum_{(x_1, \ldots , x_n) \in {\cal X}^n}

4313: P(X^n = (x_1, \ldots , x_n)) \cdot \frac{1}{n}

4314: \sum_{i=1}^n d(x_i,Y_i(x_i)).

4315: \end{equation}

4316: Consider functions $Z_n$

4317: with range ${\cal Z}_n \subseteq {\cal Y}^n$

4318: satisfying $|{\cal Z}_n| \leq 2^{nR}$.

4319: Let for $n \geq 1$ random variables

4320: a choice $Y_1, \ldots , Y_n$

4321: minimize the expected distortion

4322: under these constraints, and let the corresponding value $D^*_n (R)$ of

4323: the expected distortion be defined by

4324: \begin{equation}

4325: D^*_n (R) = \min_{Z_n: |{\cal Z}_n| \leq 2^{nR}} {\bf E}(d(X^n,Z_n)) .

4326: \end{equation}

4327:

4328: \begin{lemma}

4329: For every distortion measure, and all  $R,n,m \geq 1$,

4330: $(n+m)D^*_{n+m} (R) \leq n D^*_{n}(R)+mD^*_m(R)$.

4331: \end{lemma}

4332: \begin{proof}

4333: Let $Y_1, \ldots , Y_{n}$ achieve $D^*_{n}(R)$

4334: and $Y'_1, \ldots , Y'_{m}$ achieve $D^*_{m}(R)$.

4335: Then, $Y_1, \ldots , Y_{n},Y'_1, \ldots , Y'_m$

4336: achieves $(nD^*_n(R) + mD^*_m(R))/(n+m)$. This is an upper

4337: bound on the minimal possible value $D^*_{n+m} (R)$ for $n+m$ random variables.

4338: \end{proof}

4339:

4340: It follows that for all $R,n \geq 1$ we have $D^*_{2n}(R) \leq

4341: D^*_n(R)$.

4342: The inequality is

4343: typically strict; \cite{CT91} gives an

4344: intuitive explanation of this phenomenon.

4345: For fixed $R$ the value of $D^*_1(R)$ is fixed and it is finite.

4346: Since also $D^*_n(R)$ is necessarily

4347: positive for all $n$, we have established the existence

4348: of the limit

4349: \begin{equation}

4350: \label{eq:dr}

4351: D^*(R) = \lim\inf_{n \rightarrow \infty} D^*_n (R).

4352: \end{equation}

4353: The value of $D^*(R)$ is the minimum achievable distortion

4354: at rate (number of bits/outcome) $R$. Therefore, $D^*(\cdot)$

4355: It is called the

4356: {\em distortion-rate function}.

4357: In our Gaussian

4358: Example~\ref{ex:gauss}, $D^*(R)$ quickly converges to $0$ with

4359: increasing $R$. It turns out that for general $d$, when

4360: we view $D^*(R)$ as a

4361: function of $R\in [0,\infty)$, it is {\em convex and nonincreasing}.

4362: \begin{example}

4363: \label{ex:ber}

4364: \rm

4365:   Let ${\cal X} = \{0,1\}$, and let $P(X= 1) = p$.  Let ${\cal Y} =

4366:   \{0,1\}$  and take the Shannon-Fano distortion

4367: function $d(x,y) =  \log 1/ f(x \mid y)$ with notation as in Example~\ref{ex:reconcile}.

4368: Let $Y$ be a function that

4369: achieves the minimum expected Shannon-Fano

4370: distortion $D^*_1(R)$. As usual we write $Y$ for the random variable $Y(x)$

4371: induced by $X$. Then, $D^*_1(R)={\bf E}[d(X,Y)] =

4372: {\bf E} [ \log 1/ f(X|Y)] = H(X|Y)$.

4373: At rate $R = 1$, we

4374:   can set $Y= X$ and the minimum achievable distortion is

4375: given by $D^*_1(1)=H(X|X) =

4376:   0$. Now consider some rate $R$ with $0 < R < 1$, say $R= \frac{1}{2}$.

4377:  Since we are now

4378: forced to use less than $2^R < 2$

4379:   messages in communicating, only a fixed message can be sent, no

4380:   matter what outcome of the random variable $X$ is realized.

4381: This means that no communication is

4382:   possible at all and the minimum achievable distortion is

4383: $D^*_1(\frac{1}{2}) =H(X) =

4384:   H(p,1-p)$. But clearly, if we consider $n$ repetitions of the same

4385:   scenario and are allowed to send a message out of

4386: $\lfloor 2^{nR} \rfloor$

4387:   candidates, then some useful information can be communicated after

4388:   all, even if $R < 1$.

4389: In Example~\ref{ex:berb} we will show that

4390: if $R > H(p,1-p)$, then $D^*(R) = 0$; if $R \leq

4391:   H(p,1-p)$, then $D^*(R) = H(p,1-p) - R$.

4392: \end{example}

4393: Up to now we studied the minimum achievable distortion $D$ as a function of

4394: the rate $R$.

4395: For technical reasons, it is often more convenient to consider the

4396: minimum achievable rate $R$ as a function of the distortion $D$.

4397: This is the more  celebrated

4398: version, the {\em rate-distortion function} $R^*(D)$.

4399: Because $D^*(R)$ is convex

4400: and nonincreasing, $R^*(D): [ 0, \infty) \rightarrow [0,\infty]$ is

4401: just the {\em inverse\/} of the function $D^*(R)$.

4402:

4403: %We say that a function

4404: %${X}^n: {\cal X}^n \rightarrow {\cal X}^n$

4405: %{\em achieves rate $R$ and distortion $d^*$\/} iff

4406: %$|{\cal X}^n| \leq 2^{nR}$ and

4407: %$ {\bf E} [d((x_1,\ldots,

4408: %x_n),{X}^n(x_1,\ldots,x_n))] \leq d^*$. By definition,

4409: %\begin{equation}

4410: %R(d^*) = \lim_{n \rightarrow \infty} \inf \{ R: \ \text{there exists

4411: %  ${X}^n$ achieving rate $R$ and distortion $d^*$} \ \}

4412: %\end{equation}

4413: It turns out to be possible to relate distortion to the Shannon mutual

4414: information.

4415: This remarkable fact, which Shannon proved already in

4416: \cite{Sh48,Sh59},

4417: illustrates the fundamental nature of Shannon's concepts.

4418: %The setting is more generalized than in the previous simplified discussion

4419: %of the basics of distortion theory.

4420: Up till now, we only considered

4421: {\em deterministic\/} encodings $Y: {\cal X} \rightarrow {\cal Y}$.

4422: But it is hard to analyze the rate-distortion,  and distortion-rate,

4423: functions in this setting. It turns out to be advantageous to follow

4424: an indirect route by bringing

4425: information-theoretic techniques into play.

4426: To this end, we generalize the setting to {\em

4427: randomized\/} encodings.  That is, upon observing $X=x$ with probability

4428: $f(x)$, the

4429: sender may use a randomizing device (e.g. a coin) to decide which

4430: code word in $y \in {\cal Y}$ he is going to send to the receiver. A

4431: randomized encoding $Y$ thus maps each $x \in {\cal X}$

4432: to $y \in {\cal Y}$

4433: with probability

4434: $g_x(y)$, denoted in conditional probability format

4435: as $g(y|x)$. Altogether we deal

4436: with a joint distribution $g(x,y)=f(x)g(y|x)$ on

4437: the joint sample

4438: space ${\cal X} \times {\cal Y}$. (In the deterministic case we have

4439: $g(Y(x) \mid x)=1$ for the given function $Y: {\cal X} \rightarrow {\cal Y}$.)

4440:

4441: %The natural question arising is:

4442: %What is the minimum possible rate  that can be achieved

4443: %by a randomized code achieving distortion at most $D$? The latter

4444: %condition is formally expressed as follows. Again, let $X$ be

4445: %a random variable over outcome space ${\cal X}$, with probability $p(x)$

4446: %that $X=x$. Let $Y$ be a random variable over the set of code words

4447: %${\cal Y}$, with probability $r(y)$ that $Y=y$. There is (possibly or rather,

4448: %commonly) a dependence between the random variables $X$ and $Y$,

4449: %given by the joint probability $q(x,y)$ with

4450: %$p(x)= \sum_{y \in {\cal Y}} q(x,y)$ and

4451: %$r(x)= \sum_{x \in {\cal X}} q(x,y)$.

4452: %The marginal probability $q(y \mid x)$ is that of encoding outcome $X=x$

4453: %by code word $y$.

4454: \begin{definition}

4455: \rm

4456: Let $X$ and $Y$ be joint random variables as above, and let $d(x,y)$

4457: be a distortion measure.

4458: The {\em  expected distortion} $D(X,Y)$ of $Y$ with respect to $X$ is

4459: defined by

4460: \begin{equation}\label{eq.DXY}

4461: D(X,Y)= %{\bf E}_{g} [d(X,Y)] =

4462: \sum_{x \in {\cal X}, y \in {\cal Y}} g(x,y) d(x,y).

4463: \end{equation}

4464: \end{definition}

4465: Note that for a given problem

4466: the source probability $f(x)$ of outcome $X=x$ is fixed,

4467: but the randomized encoding $Y$, that is the conditional probability

4468: $g(y|x)$ of encoding source word $x$ by code word $y$,

4469: can be chosen to advantage.

4470: We define the auxiliary notion

4471: of {\em information rate distortion function} $R^{(I)}(D)$ by

4472: \begin{equation}

4473: \label{eq:ird}

4474: R^{(I)}(D) = \inf_{Y : D(X,Y) \leq D}

4475: I(X; Y).

4476: \end{equation}

4477:  That is, for random variable $X$,

4478: among {\em all\/} joint random variables $Y$ with expected distortion

4479: to $X$ less

4480:   than or equal to $D$, the information rate $R^{(I)}(D)$

4481: equals the minimal mutual information with $X$.

4482: \begin{theorem}[Shannon]

4483: \label{thm:rd}

4484: For every random source $X$ and distortion measure $d$:

4485: \begin{equation}

4486: \label{eq:rd}

4487: R^*(D) = R^{(I)}(D)

4488: \end{equation}

4489: \end{theorem}

4490: This remarkable theorem

4491: states that the best deterministic code achieves a rate-distortion

4492: that equals the minimal information rate possible for a randomized code,

4493: that is, the minimal

4494: mutual information between the random source and a

4495: randomized code.

4496: Note that this does not mean that $R^*(D)$

4497: is independent of the distortion measure.

4498: In fact, the source random variable $X$,

4499: together with the distortion measure $d$, determines a random

4500: code $Y$ for which the joint random variables $X$ and $Y$

4501: reach the infimum in \eqref{eq:ird}.

4502: The proof of this theorem is given in

4503: \cite{CT91}. It is illuminating to see how it goes:

4504: It is shown first that, for a random source $X$ and distortion measure $d$,

4505: every deterministic code $Y$ with distortion $\leq D$ has rate

4506: $R \geq R^{(I)} (D)$. Subsequently, it is shown that there

4507: exists

4508: a deterministic code that, with distortion $ \leq D$,

4509: achieves rate $R^*(D)=R^{(I)} (D)$.

4510: To analyze deterministic $R^*(D)$ therefore,

4511: we can determine the best randomized

4512: code $Y$ for random source $X$ under distortion constraint $D$,

4513: and then we know that simply $R^*(D)=I(X;Y)$.

4514: \begin{example} (Example~\ref{ex:ber}, continued)

4515: \label{ex:berb}

4516: \rm Suppose we want to compute $R^*(D)$ for some $D$ between $0$ and

4517: $1$.  If we only allow encodings $Y$ that are deterministic functions

4518: of $X$, then either $Y(x) \equiv x$ or $Y(x) \equiv |1- x|$.

4519: In both cases ${\bf E }

4520: [d(X,Y)] = H(X| Y) = 0$, so $Y$ satisfies the constraint in

4521: (\ref{eq:ird}). In both cases, $I(X, Y) = H(Y) = H(X)$. With

4522: (\ref{eq:rd}) this

4523: shows that $R^*(D) \leq H(X)$. However, $R^*(D)$ is actually smaller:

4524: by allowing randomized codes, we can define $Y_{\alpha}$ as

4525: $Y_{\alpha} (x) = x$ with probability $\alpha$ and $Y_{\alpha} (x) = |1- x|$

4526: with probability $1- \alpha$. For $0 \leq \alpha \leq \frac{1}{2}$, ${\bf E }

4527: [d(X,Y_{\alpha})] = H(X| Y_{\alpha})$ increases with $\alpha$, while

4528: $I(X;Y_{\alpha})$ decreases with $\alpha$.  Thus, by choosing the

4529: $\alpha^*$ for which the constraint ${\bf E } [d(X,Y_{\alpha})] \leq

4530: D$ holds with equality, we find $R^*(D) = I(X; Y_{\alpha^*})$. Let us

4531: now calculate $R^*(D)$ and $D^*(R)$ explicitly.

4532:

4533: Since $I(X,Y) = H(X) - H(X|Y)$, we can rewrite $R^*(D)$ as

4534: $$

4535: R^*(D) = H(X)  - \sup_{Y: D(X,Y) \leq D}

4536: H(X|Y).

4537: $$

4538: In the special case where $D$ is itself the

4539: Shannon-Fano distortion, this can in turn be rewritten as

4540: $$

4541: R^*(D) = H(X)  - \sup_{Y: H(X|Y) \leq D} H(X \mid Y)

4542: = H(X) - D.

4543: $$

4544: Since $D^*(R)$ is the inverse of $R^*(D)$,

4545: we find $D^*(R) = H(X) - R$, as announced in Example~\ref{ex:ber}.

4546: \end{example}

4547:

4548: \paragraph{Problem and Lacuna:}

4549: In the Rate-Distortion setting we allow (on average) a rate of

4550: $R$ bits to express the data as well as possible in some way,

4551: and measure the average of loss by some distortion function.

4552: But in many cases, like lossy compression of images,

4553: one is interested in the individual cases. The average over all

4554: possible images may be irrelevant for the individual cases one meets.

4555: Moreover, one is not particularly interested in bit-loss,

4556: but rather in preserving the essence of the image as well as possible.

4557: As another example, suppose

4558: the distortion function is simply to supply the remaining

4559: bits of the data. But this can be unsatisfactory: we are given

4560: an outcome of a measurement as a real number of $n$ significant bits. Then

4561: the $R$ most significant bits carry most of the meaning

4562: of the data, while the remaining $n-R$ bits may be irrelevant.

4563: Thus, we are lead to the elusive notion

4564: of a distortion function that captures

4565: the amount of ``meaning'' that is not included in the $R$ rate bits.

4566: These issues are taken up by Kolmogorov's proposal of the structure function.

4567: This cluster  of ideas puts the notion

4568: of Rate--Distortion in an individual algorithmic (Kolmogorov

4569: complexity) setting, and focuses on the meaningful information

4570: in the data. In the end we can recycle the new insights and

4571: connect them to Rate-Distortion notions to provide new foundations

4572: for statistical inference notions as maximum likelihood (ML)

4573: \cite{Fi22},

4574: minimum

4575: message length (MML) \cite{WallaceF87},

4576: and minimum description length (MDL) \cite{Ri89}.

4577: \subsection{Structure Function}

4578: \label{sec:structure}

4579: There is a close relation between

4580: functions describing

4581: three, a priori seemingly unrelated, aspects of modeling individual

4582: data, depicted in Figure~\ref{figure.estimator}.

4583: \begin{figure}

4584: \begin{center}

4585: \epsfxsize=8cm

4586: \epsfxsize=8cm \epsfbox{estimator.eps}

4587: \end{center}

4588: \caption{Structure functions $h_x(i), \beta_x(\alpha), \lambda_x(\alpha)$,

4589: and minimal sufficient statistic.}

4590: \label{figure.estimator}

4591: \end{figure}

4592: \label{sec:meaningful}

4593: One of these was introduced by

4594: Kolmogorov at a conference in Tallinn 1974 (no written version)

4595: and in a talk at the Moscow Mathematical Society in the same year

4596: of which the abstract \cite{Ko74}

4597: is as follows (this is the only writing by Kolmogorov about

4598: this circle of ideas):

4599: \begin{quote}

4600: ``To each constructive object corresponds a function $\Phi_x(k)$ of a

4601:  natural number $k$---the log of minimal cardinality of $x$-containing

4602:  sets that allow definitions of complexity at most $k$.

4603:  If the element $x$ itself allows a simple definition,

4604:  then the function $\Phi$ drops to $1$ %[presumably, $0 = \log 1$ is meant]

4605:  even for small $k$.

4606:  Lacking such definition, the element is ``random'' in a negative sense.

4607:  But it is positively ``probabilistically random'' only when function

4608:  $\Phi$ having taken the value $\Phi_0$ at a relatively small

4609:  $k=k_0$, then changes approximately as $\Phi(k)=\Phi_0-(k-k_0)$.''

4610: \end{quote}

4611: Kolmogorov's $\Phi_x$ is commonly called the ``structure function''

4612: and is here denoted as $h_x$ and defined in \eqref{eq2}.  The

4613: structure function notion entails a proposal for a non-probabilistic

4614: approach to statistics, an individual combinatorial relation between

4615: the data and its model, expressed in terms of Kolmogorov complexity.

4616: It turns out that the structure function determines all stochastic

4617: properties of the data in the sense of determining the best-fitting

4618: model at every model-complexity level, the equivalent notion to

4619: ``rate'' in the Shannon theory. A consequence is this: minimizing the

4620: data-to-model code length (finding the ML estimator or MDL estimator),

4621: in a class of contemplated models of prescribed maximal (Kolmogorov)

4622: complexity, {\em always} results in a model of best fit, irrespective

4623: of whether the source producing the data is in the model class

4624: considered.  In this setting, code length minimization {\em always}

4625: separates optimal model information from the remaining accidental

4626: information, and not only with high probability.  The function that

4627: maps the maximal allowed model complexity to the goodness-of-fit

4628: (expressed as minimal ``randomness deficiency'') of the best model

4629: cannot itself be monotonically approximated. However, the shortest

4630: one-part or two-part code above can---implicitly optimizing this

4631: elusive goodness-of-fit.

4632:

4633:

4634: In probabilistic statistics the goodness of the selection process is

4635: measured in terms of expectations over probabilistic ensembles.  For

4636: current applications, average relations are often irrelevant, since

4637: the part of the support of the probability mass function that will

4638: ever be observed has about zero measure. This may be the case in, for

4639: example, complex video and sound analysis.  There arises the problem

4640: that for individual cases the selection performance may be bad

4641: although the performance is good on average, or vice versa. There is

4642: also the problem of what probability means, whether it is subjective,

4643: objective, or exists at all.  Kolmogorov's proposal strives for the

4644: firmer and less contentious ground of finite combinatorics and

4645: effective computation.

4646:

4647: \paragraph{Model Selection:}

4648: It is technically convenient to initially consider the simple model

4649: class of finite sets to obtain our results, just as in

4650: Section~\ref{sec:algsuf}. It then turns out that it is relatively easy

4651: to generalize everything to the model class of computable probability

4652: distributions (Section~\ref{s.prob}). That class is very large

4653: indeed: perhaps it contains every distribution that has ever been

4654: considered in statistics and probability theory, as long as the

4655: parameters are computable numbers---for example rational numbers. Thus

4656: the results are of great generality; indeed, they are so general that

4657: further development of the theory must be aimed at restrictions on

4658: this model class.

4659:

4660: Below we will consider various model

4661: selection procedures. These are approaches for finding a model $S$

4662: (containing $x$) for arbitrary data $x$. The goal is to find a model

4663: that captures all meaningful information in the data $x$ . All

4664: approaches we consider are at some level based on coding $x$ by giving

4665: its index in the set $S$, taking $ \log |S|$ bits. This

4666: codelength may be thought of as a particular distortion function, and

4667: here lies the first connection to Shannon's rate-distortion:

4668:

4669: \begin{example}\label{rem.rd-ksf1}

4670: \rm

4671: %This approach can be straightforwardly translated into the Rate-Distortion

4672: %setting:

4673: A model selection procedure is a function $Z_n$ mapping binary data of

4674: length $n$ to finite sets of strings of length $n$, containing the

4675: mapped data, $Z_n(x)=S$ ($x \in S$). The range of $Z_n$ satisfies

4676: ${\cal Z}_n \subseteq 2^{\{0,1\}^n}$, The distortion function $d$ is

4677: defined to be $d(x,Y(x))= \frac{1}{n} \log |S|$. To define the

4678: rate--distortion function we need that $x$ is the outcome of a random

4679: variable $X$. Here we treat the simple case that $X$ represents $n$

4680: flips of a fair coin; this is substantially generalized in

4681: Section~\ref{sec:esf}. Since each outcome of a fair coin can be

4682: described by one bit, we set the rate $R$ at $0 < R < 1$. Then,

4683: $D_n^*(R) = \min_{Z_n: |{\cal Z}_n| \leq 2^{nR}} \sum_{|x|=n}2^{-n}

4684: \frac{1}{n} \log |Z_n(x)|$ For the minimum of the right-hand side we

4685: can assume that if $y \in Z_n(x)$ then $Z_n(y)=Z_n(x)$ (the distinct

4686: $Z_n(x)$'s are disjoint). Denote the distinct $Z_n(x)$'s by $Z_{n,i}$

4687: with $i=1,\ldots , k$ for some $k \leq 2^{nR}$. Then, $D_n^*(R) = \min

4688: _{Z_n: |{\cal Z}_n| \leq 2^{nR}} \sum_{1=1}^k |Z_{n,i}|2^{-n}

4689: \frac{1}{n} \log |Z_{n,i}|$.  The right-hand side reaches its minimum

4690: for all $Z_{n,i}$'s having the same cardinality and $k=2^{nR}$. Then,

4691: $D_n^*(R) = 2^{nR} 2^{(1-R)n} 2^{-n} \frac{1}{n} \log 2^{(1-R)n} =

4692: 1-R$.  Therefore, $D^*(R)= 1-R$ and therefore $R^*(D) = 1-D$.

4693:

4694: Alternatively, and more in line with the structure-function

4695: approach below, one may consider repetitions of a random variable $X$

4696: with outcomes in $\{0,1\}^n$. Then,

4697: a model selection procedure is a function $Y$ mapping

4698: binary data of length $n$ to finite sets of strings of length $n$,

4699: containing the mapped data,

4700: $Y(x)=S$ ($x \in S$). The range of $Y$ satisfies

4701: ${\cal Y} \subseteq 2^{\{0,1\}^n}$,  The distortion function $d$ is defined

4702: by $d(x,Y(x))= \log |S|$. To define the rate--distortion function

4703: we need that $x$ is the outcome of a random variable $X$, say

4704: a toss of a fair $2^n$-sided coin. Since each outcome of a fair

4705: coin can be described by $n$ bits, we set the rate $R$

4706: at $0 < R < n$. Then, for outcomes $\overline{x}=x_1 \ldots x_m$

4707: ($|x_i|=n$), resulting from $m$  i.i.d. random variables $X_1, \ldots , X_m$,

4708: we have $d(\overline{x}, Z_m (\overline{x})) =

4709: \frac{1}{m} \sum_{i=1}^m \log |Y_i (x_i)| =

4710: \frac{1}{m} \log | Y_1(x_1) \times \cdots \times Y_m(x_m)|$. Then,

4711: $D_m^*(R) = \min_{Z_m: |{\cal Z}_m| \leq 2^{mR}}

4712: \sum_{\overline{x}}2^{-mn} d(\overline{x}, Z_m (\overline{x}))$.

4713: Assume that $\overline{y} \in Z_m(\overline{x})$ if

4714: $Z_m(\overline{y}) = Z_m(\overline{x})$: the distinct

4715: $Z_m(\overline{x})$'s are disjoint and partition $\{0,1\}^{mn}$

4716: into disjoint subsets $Z_{m,i}$, with $i=1, \ldots, k$ for

4717: some $k \leq 2^{mR}$.

4718: Then,

4719: $D_m^*(R) = \min_{Z_m: |{\cal Z}_m| \leq 2^{mR}}

4720: \sum_{i=1,\ldots, k} |Z_{m,i}|2^{-mn} \frac{1}{m}

4721: \log |Z_{m,i}|$.

4722: The right-hand side reaches its minimum for all $Z_{m,i}$'s having

4723: the same cardinality and $k=2^{mR}$, so that

4724: $D_m^*(R) = 2^{(n-R)m} 2^{mR} 2^{-mn} \frac{1}{m} \log 2^{(n-R)m} = n-R$.

4725: Therefore, $D^*(R)= n-R$ and $R^*(D) = n-D$.

4726: In Example~\ref{ex.rd=str} we relate these numbers to the structure

4727: function approach described below.

4728: \end{example}

4729:

4730: \paragraph{Model Fitness:}

4731: A distinguishing feature of the structure function approach is that

4732: we want to formalize what it means for an element to  be ``typical''

4733: for a set that contains it. For example, if we flip a fair coin $n$

4734: times, then the sequence of $n$ outcomes, denoted by $x$,

4735:  will be an element of the set $\{0,1\}^n$. In fact,

4736: most likely it will be a ``typical'' element in the sense that

4737: it has all properties that hold on average for an element of that set.

4738: For example, $x$ will have $\frac{n}{2} \pm O(\sqrt{n})$ frequency

4739: of 1's, it will have a run of about $\log n$ consecutive 0's,

4740: and so on for many properties. Note that the sequence $x=0 \ldots 01\ldots1$,

4741: consisting of one half 0's followed by one half ones, is very untypical,

4742: even though it satisfies the two properties described explicitly.

4743: The question arises how to formally define ``typicality''. We do

4744: this as follows:

4745: The lack of typicality

4746: of $x$ with respect to a finite set $S$ (the model) containing it,

4747: is the amount by which $K(x|S)$

4748: falls short of the length $\log |S|$ of the data-to-model code (Section~\ref{sec:algsuf}).

4749: Thus, the {\em randomness deficiency} of $x$ in $S$ is defined by

4750:       \begin{equation}\label{eq:randomness-deficiency}

4751: \delta (x | S) = \log |S| - K(x | S),

4752:       \end{equation}

4753: for $x \in S$, and $\infty$ otherwise. Clearly, $x$ can be typical for

4754: vastly different sets. For example, every $x$ is typical for the singleton

4755: set $\{x\}$, since $\log |\{x\}|=0$ and $K(x \mid \{x\})=O(1)$.

4756: Yet the many $x$'s that have $K(x) \geq n$ are also typical for

4757: $\{0,1\}^n$, but in another way. In the first example, the set is about

4758: as complex as $x$ itself. In the second example, the set is vastly

4759: less complex than $x$: the set has complexity about

4760: $K(n) \leq \log n + 2 \log \log n$ while $K(x)\geq n$.

4761: Thus, very high complexity data may have simple

4762: sets for which they are typical. As we shall see,

4763: this is certainly not the case for all high complexity data.

4764: %certain

4765: %$y$ with $|y|=n$ and $K(y)=n/2$.

4766: The question arises how typical

4767: data $x$ of length $n$ can be in the best case

4768: for a finite set of complexity $R$

4769: when $R$ ranges from 0 to $n$. The function describing this dependency,

4770: expressed in terms of randomness deficiency to measure the optimal

4771: typicality, as a function of the complexity ``rate'' $R$ ($0 \leq R \leq n$)

4772: of the number of bits we can maximally spend to describe a finite

4773: set containing $x$,

4774: is defined as follows:

4775:

4776: %From the definition we see that $x$ can only be typical for a

4777: %large cardinality set if the complexity of $x$ is large.

4778: %But $y = 00 \ldots 0$ is not typical for $\{0,1\}^n$;

4779: %in fact, $\delta (y \mid \{0,1\}^n) \geq n - O(1)$.

4780:

4781: %This definition allows us to consider how typical data $x$ can be for

4782: %a model $S$ of certain maximal complexity:

4783: The {\em minimal randomness deficiency} function is

4784:            \begin{equation}

4785: \label{eq1}

4786: \beta_x( R) =

4787: \min_{S} \{ \delta(x| S): S \ni x, \;  K(S) \leq R \},

4788:             \end{equation}

4789:             where we set $\min \emptyset = \infty$.  If $\delta(x |

4790:             S)$ is small, then $x$ may be considered as a {\em

4791:               typical} member of $S$. This means that $S$ is a

4792:             ``best'' model for $x$---a most likely explanation.  There

4793:             are no simple special properties that single it out from

4794:             the majority of elements in $S$.  We therefore like to

4795:             call $\beta_x(R)$ the {\em best-fit estimator}.  This

4796:             is not just terminology: If $\delta (x | S)$ is small,

4797:             then $x$ satisfies {\em all} properties of low Kolmogorov

4798:             complexity that hold with high probability (under the

4799:             uniform distribution) for the elements of $S$. To be

4800:             precise \cite{VV02}: Consider strings of length $n$ and

4801:             let $S$ be a subset of such strings. We view a {\em

4802:               property} of elements in $S$ as a function $f_P: S

4803:             \rightarrow \{0,1\}$. If $f_P(x)=1$ then $x$ has the

4804:             property represented by $f_P$ and if $f_P(x)=0$ then $x$

4805:             does not have the property.  Then: (i) If $f_P$ is a

4806:             property satisfied by all $x$ with $\delta(x | S) \le

4807:             \delta (n)$, then $f_P$ holds with probability at least

4808:             $1-1/2^{\delta(n)}$ for the elements of $S$.

4809:

4810: (ii) Let

4811: $f_P$ be any

4812: property

4813: that holds with probability at least

4814: $1-1/2^{\delta (n)}$ for the

4815: elements of $S$. Then, every such $f_P$ holds

4816: simultaneously for every $x \in S$

4817: with $\delta (x | S)\le\delta (n)-K(f_P|S)-O(1)$.

4818:

4819:

4820: \begin{example}

4821:   \rm {\bf Lossy Compression:} \index{compression, lossy} The function

4822:   $\beta_x( R)$ is relevant to lossy compression (used, for instance,

4823:   to compress images) -- see also Remark~\ref{rem:lossy}.  Assume we

4824:   need to compress $x$ to $R$ bits where $R \ll K(x)$.  Of course this

4825:   implies some loss of information present in $x$.  One way to select

4826:   redundant information to discard is as follows: Find a set $S\ni x$

4827:   with $K(S)\le R$ and with small $\delta(x | S)$, and consider a

4828:   compressed version $S'$ of $S$.  To reconstruct an $x'$, a

4829:   decompresser uncompresses $S'$ to $S$ and selects at random an

4830:   element $x'$ of $S$.  Since with high probability the randomness

4831:   deficiency of $x'$ in $S$ is small, $x'$ serves the purpose of the

4832:   message $x$ as well as does $x$ itself.  Let us look at an example.

4833:   To transmit a picture of ``rain'' through a channel with limited

4834:   capacity $R$, one can transmit the indication that this is a picture

4835:   of the rain and the particular drops may be chosen by the receiver

4836:   at random.  In this interpretation, $\beta_x(R)$ indicates how

4837:   ``random'' or ``typical'' $x$ is with respect to the best model at

4838:   complexity level $R$---and hence how ``indistinguishable'' from the

4839:   original $x$ the randomly reconstructed $x'$ can be expected to be.

4840: \end{example}

4841:

4842:

4843: \begin{remark}

4844: \rm

4845: This randomness deficiency function quantifies

4846: the goodness of fit of the best model at complexity $R$

4847: for given data $x$. As far as we know no direct counterpart of this

4848: notion exists in Rate--Distortion theory, or, indeed,

4849: can be expressed in classical theories like Information Theory.

4850: But the situation is different for the next function we define,

4851: which, in almost contradiction to the previous statement, can

4852: be tied to the minimum randomness deficiency function, yet, as will be

4853: seen in Example~\ref{ex.rd=str} and Section~\ref{sec:esf},

4854: does have a counterpart in Rate--Distortion theory after all.

4855: \end{remark}

4856:

4857:

4858: \paragraph{Maximum Likelihood estimator:}

4859: The {\em Kolmogorov structure} function $h_x$ of given data $x$ is defined by

4860:  \begin{equation}\label{eq2}

4861:    h_{x}(R) = \min_{S} \{\log | S| : S \ni x,\; K(S) \leq R\},

4862: \end{equation}

4863: where $S \ni x$ is

4864: a contemplated model for $x$, and $R$ is a nonnegative

4865: integer value bounding the complexity of the contemplated $S$'s.

4866: The structure function uses models that are finite sets and

4867: the value of the structure function is the log-cardinality of the

4868: smallest such set containing the data. Equivalently, we can

4869: use uniform probability mass functions over finite supports (the former

4870: finite set models). The smallest set containing the data then becomes

4871: the uniform probability mass assigning the highest probability

4872: to the data---with the value of the structure function

4873: the corresponding negative

4874: log-probability. This motivates us to call $h_x$ the {\em maximum likelihood

4875: estimator}. The treatment can be extended from uniform probability

4876: mass functions with finite supports, to  probability models that

4877: are arbitrary computable probability mass functions, keeping

4878: all relevant notions and results essentially unchanged, Section~\ref{s.prob},

4879: justifying the maximum likelihood identification even more.

4880:

4881: Clearly, the Kolmogorov structure function is

4882: non-increasing and reaches $\log |\{x\}| = 0$

4883: for the ``rate'' $R = K(x)+c_1$ where $c_1$ is the number of bits required

4884: to change $x$ into $\{x\}$.

4885: It is also easy to see that for argument $K(|x|)+c_2$, where $c_2$

4886: is the number of bits required to compute the

4887: set of all strings of length $|x|$ of $x$ from $|x|$,

4888: the value of the structure function is at most $|x|$; see Figure~\ref{figure.estimator}

4889: \begin{example}\label{ex.rd=str}

4890: \rm

4891: Clearly the structure function measures for individual outcome $x$

4892: a distortion that is related to the

4893: one measured by $D_1^*(R)$ in Example~\ref{rem.rd-ksf1}

4894: for the uniform average of outcomes $x$.

4895: Note that all  strings $x$ of length $n$ satisfy $h_x(K(n)+O(1)) \leq n$

4896: (since $x \in S_n=\{0,1\}^n$ and $K(S_n)=K(n)+O(1)$).

4897: For every $R$ ($0 \leq R \leq n$),

4898: we can describe every $x = x_1x_2 \ldots x_n$ as an element

4899: of the set $A_R = \{x_1 \ldots x_R y_{R+1} \ldots y_n:

4900: y_i \in \{0,1\}, R < i \leq n \}$. Then, $|A_R|=2^{n-R}$

4901: and $K(A_R) \leq R+K(n,R)+O(1) \leq R + O(\log n)$.

4902: This shows that $h_x (R) \leq n-R+O(\log n)$ for every $x$

4903: and every $R$ with $0 \leq R \leq n$; see Figure~\ref{figure.estimator}.

4904:

4905: For all $x$'s and $R$'s we can describe $x$ in a two-part code by the set

4906: $S$ witnessing $h_x(R)$ and $x$'s index in that set. The first part

4907: describing $S$ in $K(S)=R$ allows us to generate

4908: $S$, and given $S$ we know $\log |S|$. Then,

4909: we can parse the second part of $\log |S|=h_x(R)$ bits that gives $x$'s

4910: index in $S$. We also need a fixed $O(1)$ bit program to produce $x$

4911: from these descriptions. Since $K(x)$ is the lower bound on

4912: the length of effective descriptions of $x$, we have $h_x(R)+R \geq K(x)-O(1)$.

4913: There are $2^n - 2^{n-K(n)+O(1)}$ strings $x$ of complexity $K(x)\geq n$,

4914: \cite{LiVi97}. For all these strings $h_x(R) + R \geq n-O(1)$.

4915: Hence, the expected value $h_x(R)$ equals

4916: $2^{-n} \{ (2^n-2^{n-K(n)+O(1)}) [n-R+O(\log n)]

4917: + 2^{n-K(n)+O(1)} O(n-R+O(\log n)) \} = n-R + O(n-R/2^{-K(n)})

4918: =  n-R + o(n-R)$

4919: (since $K(n) \rightarrow \infty$ for $n \rightarrow \infty$).

4920: That is, the expectation of $h_x(R)$ equals $(1+o(1))D^*_1(R)

4921: =(1+o(1))D^*(R)$, the Distortion-Rate function, where the

4922: $o(1)$ term goes to 0 with the length $n$ of $x$. In

4923: Section~\ref{sec:esf} we extend this idea to non-uniform distributions

4924: on $X$.

4925: \end{example}

4926:

4927:

4928: For every $S\ni x$ we have

4929:          \begin{equation}\label{eq.descr}

4930: K(x)\leq K(S)+ \log |S| + O(1).

4931:           \end{equation}

4932: Indeed,

4933: consider the following \emph{two-part code}

4934: for $x$: the first part is

4935: a shortest  self-delimiting program $p$ of $S$ and the second

4936: part is

4937: $\lceil\log|S|\rceil$ bit long index of $x$

4938: in the lexicographical ordering of $S$.

4939: Since $S$ determines $\log |S|$ this code is self-delimiting

4940: and we obtain \eqref{eq.descr}

4941: where the constant $O(1)$ is

4942: the length of the program to reconstruct

4943: $x$ from its two-part code.

4944: We thus conclude that $K(x)\leq R+h_x(R)+O(1)$, that is, the

4945: function $h_x(R)$

4946: never decreases

4947: more than a fixed independent constant below

4948: the diagonal \emph{sufficiency line} $L$ defined by

4949: $L(R)+R = K(x)$,

4950: which is a lower bound on $h_x (R)$

4951: and is approached to within a constant distance by

4952: the graph of $h_x$ for certain $R$'s

4953: (for instance, for $R = K(x)+c_1$).

4954: For these $R$'s we

4955: thus have

4956: $R + h_x (R) = K(x)+O(1)$.

4957: In the terminology we have introduced in Section~\ref{sect.ss} and Definition~\ref{def:algsufstat},

4958: a model corresponding to such an $R$ (witness for

4959: $h_x(R)$) is an optimal set for $x$

4960: and a shortest program to compute this model

4961: is a sufficient statistic. It is

4962: {\em minimal} for the least such $R$ for which the above equality holds.

4963:

4964:

4965: \paragraph{MDL Estimator:}

4966: The length of the minimal two-part code for $x$ consisting

4967: of the model cost $K(S)$ and the

4968: length of the index of $x$ in

4969: $S$,

4970: the complexity of $S$ upper bounded by $R$, is given by

4971: the {\em MDL (minimum description length) function}:

4972:   \begin{equation}\label{eq.3}

4973:    \lambda_{x}(R) =

4974: \min_{S} \{\Lambda(S): S \ni x,\; K(S) \leq R\},

4975:   \end{equation}

4976: where $\Lambda(S)=\log|S|+K(S) \ge K(x)-O(1)$ is

4977: the total length of two-part code of $x$

4978: with help of model $S$.

4979:  Clearly,

4980: $\lambda_x (R) \leq  h_x(R)+ R +O(1)$,

4981: but a priori it is still possible that $ h_x(R')+ R'

4982: < h_x(R)+R$ for $R' < R$.

4983: In that case $\lambda_x(R) \leq

4984:  h_x(R')+ R'

4985: < h_x(R)+R$. However, in \cite{VV02} it is shown

4986: that $\lambda_x (R) =  h_x(R)+ R + O(\log n)$

4987: for all $x$ of length $n$. Even so, this doesn't mean that a set

4988: $S$ that witnesses $\lambda_x (R)$ in the sense that $x \in S$,

4989: $K(S) \leq R$, and $K(S)+\log |S|= \lambda_x (R)$,

4990: also witnesses $h_x(R)$. It can in fact be the case that $K(S) \leq R-r$,

4991: and $\log |S|= h_x (R)+r$ for arbitrarily large $r \leq n$.

4992:

4993:

4994: Apart from being convenient for the technical analysis

4995: in this work, $\lambda_x (R)$ is the

4996: celebrated two-part Minimum Description Length code

4997: length \cite{Ri89} with the

4998: model-code length restricted to at most $R$.

4999: When $R$ is large enough so that $\lambda_{x}(R) = K(x)$,

5000: then there is a set $S$ that is a sufficient statistic, and

5001: the smallest such $R$ has an associated witness set $S$ that

5002: is a minimal sufficient statistic.

5003:

5004: The most fundamental result in \cite{VV02}

5005: is the equality

5006:          \begin{equation}\label{eq.eq}

5007: \beta_x (R )  = h_x (R) + R - K(x) = \lambda_x (R)

5008: - K(x)

5009:          \end{equation}

5010: which holds within logarithmic additive terms in argument and value.

5011: Additionally, every set $S$ that witnesses the value $h_x (R )$

5012: (or $\lambda_x(R)$),

5013: also witnesses the value $\beta_x (R)$ (but not vice versa).

5014: It is easy to see that $h_x (R)$ and $\lambda_x(R)$

5015: are

5016: upper semi-computable (Definition~\ref{def.semi});

5017: but we have shown \cite{VV02}

5018: that $\beta_x (R)$ is neither upper nor lower semi-computable

5019: (not even within a great tolerance).

5020: A priori

5021: there is no reason to suppose that

5022: a set that witnesses $h_x (R)$

5023: (or $\lambda_x(R)$) also witnesses $\beta_x (R)$,

5024: for {\em every} $R$.

5025: But the fact that they do, vindicates

5026: Kolmogorov's original proposal

5027: and establishes $h_x$'s pre-eminence over $\beta_x$ -- the

5028: pre-eminence of $h_x$ over $\lambda_x$ is discussed below.

5029:

5030: \begin{remark}\label{rem.MLvsMDL}

5031: \rm

5032:  What we call `maximum likelihood' in the form of $h_x$ is really `maximum

5033: likelihood' under a complexity constraint $R$ on the models' as in

5034: $h_x (R)$. In

5035: statistics, it is a well-known fact that maximum likelihood often

5036: fails (dramatically overfits) when the models under consideration are

5037: of unrestricted complexity (for example, with polynomial regression with

5038: Gaussian noise, or with Markov chain model learning, maximum

5039: likelihood will always select a model with $n$ parameters, where $n$ is

5040: the size of the sample---and thus typically, maximum likelihood will

5041: dramatically overfit, whereas for example MDL typically performs

5042: well). The equivalent, in our setting, is that allowing models of unconstrained

5043: complexity  for data $x$, say complexity $K(x)$,

5044: will result in the ML-estimator $h_x (K(x)+O(1))=0$---the witness model

5045: being the trivial, maximally overfitting, set $\{x\}$.

5046: In the MDL case, on the other hand, there may be a long constant

5047: interval with the MDL estimator

5048:  $\lambda_x (R) = K(x)$ ($R \in [R_1 , K(x)]$)

5049: where the length of the two-part code doesn't decrease anymore.

5050: Selecting the least complexity model witnessing this function value

5051: we obtain the, very significant, algorithmic {\em minimal} sufficient

5052: statistic, Definition~\ref{def:algsufstat}.

5053: In this sense, MDL augmented with a bias for the least complex explanation,

5054: which we may call the `Occam's Razor MDL',

5055: is superior to maximum likelihood and resilient to overfitting.

5056: If we don't apply bias in the direction of simple explanations,

5057: then -- at least in our setting --

5058: MDL may be just as prone to overfitting as is ML. For example,

5059: if $x$ is a typical random element of $\{0,1\}^n$, then

5060:  $\lambda_x (R) = K(x)+O(1)$ for the entire interval

5061: $K(n)+O(1) \leq R \leq K(x)+O(1) \approx n$.

5062: Choosing the model on the left side, of simplest complexity,

5063:  of complexity $K(n)$

5064: gives us the best fit with the correct model $\{0,1\}^n$.

5065: But choosing a model on the right side, of high complexity, gives us

5066: a model $\{x\}$ of complexity $K(x)+O(1)$ that completely

5067: overfits the data by modeling all random noise in $x$

5068: (which in fact in this example almost completely consists of random noise).

5069: \index{overfit}

5070:

5071: Thus, it should be emphasized that 'ML =

5072: MDL' really only holds if complexities are constrained to a value

5073: $R$ (that remains fixed as the sample size grows---note that in the

5074: Markov chain example above, the complexity grows linearly with

5075: the sample size); it certainly

5076: does not hold in an unrestricted sense (not even in the algorithmic setting).

5077: \end{remark}

5078: \begin{remark}

5079: \rm

5080: In a sense, $h_x$ is more strict than $\lambda_x$:

5081: A set that witnesses $h_x(R)$ also witnesses

5082: $\lambda_x(R)$ but not necessarily vice versa. However,

5083: at those complexities $R$ where $\lambda_x (R)$ drops

5084: (a little bit of added complexity in the model allows a

5085: shorter description), the witness set of $\lambda_x$ is

5086: also a witness set of $h_x$. But if $\lambda_x$ stays

5087: constant in an interval $[R_1, R_2]$, then we

5088: can trade-off complexity of a witness set versus its cardinality,

5089: keeping the description length constant. This is of course not possible

5090: with $h_x$ where the cardinality of the witness set at complexity $R$

5091: is fixed at $h_x(R)$.

5092: \end{remark}

5093:

5094: The main result  can be taken as a foundation and justification

5095: of common statistical principles in model

5096: selection such as maximum likelihood

5097: or MDL.

5098: The structure functions $\lambda_x,h_x$ and $\beta_x$ can assume all

5099: possible shapes over their full domain of definition (up to

5100: additive logarithmic precision in both argument and value), see \cite{VV02}.

5101: (This establishes

5102: the significance of \eqref{eq.eq}, since it shows that $\lambda_x (R)

5103: \gg K(x)$ is common for $(x, R)$ pairs---in which case the more

5104: or less

5105: easy fact that $\beta_x(R)=0$ for $\lambda_x(R)=K(x)$ is

5106: not

5107: applicable, and it is a priori unlikely that \eqref{eq.eq} holds:

5108: Why should minimizing a set containing

5109: $x$ also minimize its randomness deficiency? Surprisingly, it does!)

5110:  We have exhibited a---to our knowledge first---natural example,

5111: $\beta_x$, of a function that

5112: is not semi-computable but computable with an oracle for the halting problem.

5113:

5114:

5115: \begin{example}\label{ex.prnr}

5116: \index{randomness, positive}

5117: \index{randomness, negative}

5118: \rm

5119: {\bf ``Positive'' and ``Negative'' Individual Randomness:}

5120: In \cite{GTV01} we showed the existence

5121: of strings for which essentially

5122: the singleton set consisting of the string itself is a minimal

5123: sufficient statistic. While a sufficient

5124: statistic of an object yields a two-part code that is as short as the shortest

5125: one part code, restricting the complexity of the allowed statistic

5126: may yield two-part codes that are considerably longer than the best one-part

5127: code (so the statistic is insufficient).

5128: In fact,

5129: for every object there is a complexity bound below which this happens---but

5130: if that bound is small (logarithmic) we call the object ``stochastic''

5131: since it has a simple satisfactory explanation (sufficient statistic).

5132: Thus, Kolmogorov in \cite{Ko74}

5133:  makes the important distinction of

5134: an object being random in the ``negative'' sense by having this bound

5135: high (it has high complexity and is not a typical element of

5136: a low-complexity model),

5137: and an object being random in the ``positive,

5138: probabilistic'' sense by both having this bound small and itself

5139: having complexity considerably exceeding this bound

5140: (like a string $x$ of length $n$ with $K(x) \geq n$,

5141: being typical for the

5142: set $\{0,1\}^n$, or the uniform probability distribution over that

5143: set,

5144: while this set or probability distribution

5145: has complexity $K(n)+O(1) = O(\log n)$).

5146: We depict the distinction in Figure~\ref{figure.pos_negrandom}.

5147: In simple terms: High Kolmogorov complexity of a data string

5148: just means that it is random in a {\em negative sense};

5149: but a data string of high Kolmogorov

5150: complexity is {\em positively random} if the simplest satisfactory explanation

5151: (sufficient statistic) has low complexity,

5152: and it therefore is the typical outcome

5153: of a simple random process.

5154:

5155: \begin{figure}

5156: \begin{center}

5157: \epsfxsize=8cm

5158: \epsfxsize=8cm \epsfbox{pos_negrandom.eps}

5159: \end{center}

5160: \caption{Data string $x$ is ``positive random'' or ``stochastic''

5161:  and data string $y$

5162: is just ``negative random'' or ``non-stochastic''.}

5163: \label{figure.pos_negrandom}

5164: \end{figure}

5165:

5166:

5167: In \cite{VV02} it is shown that for every length $n$ and

5168: every complexity $k \leq n+K(n) + O(1)$ (the maximal complexity

5169: of $x$ of length $n$) and every $R \in [0,k]$,

5170: there are $x$'s of length $n$ and complexity $k$ such that

5171: the minimal randomness deficiency $\beta_x (i) \geq  n-k\pm O(\log

5172: n)$

5173: for every $i \leq R \pm O(\log n)$ and $\beta_x (i) \pm O(\log n)$

5174: for every $i > R \pm O(\log n)$. Therefore, the set of  $n$-length

5175: strings of every complexity $k$ can be partitioned in subsets of strings that

5176: have a Kolmogorov minimal sufficient statistic of complexity

5177: $\Theta (i \log n)$ for $i = 1, \ldots , k/ \Theta (\log n)$.

5178: For instance, there are $n$-length non-stochastic

5179: strings of almost maximal complexity $n -  \sqrt{n}$

5180: having significant $\sqrt{n}\pm O(\log n)$ randomness deficiency with

5181: respect to $\{0,1\}^n$ or, in fact, every other finite set

5182: of complexity less than $n - O(\log n)$!

5183: \end{example}

5184:

5185:

5186:

5187: \subsubsection{Probability Models}

5188: \label{s.prob}

5189: The structure function (and of course the sufficient statistic) use

5190: properties of data strings modeled by finite sets, which amounts to

5191: modeling data by uniform distributions. As already

5192: observed by Kolmogorov himself, it turns out

5193: that this is no real restriction.

5194: Everything holds also for computable probability mass functions

5195: (probability models), up to additive logarithmic precision.  Another

5196: version of $h_x$ uses probability models $f$ rather than finite set

5197: models. It is defined as $h'_x(R) = \min_{f} \{\log 1/f(x): f(x)>0,

5198: K(f) \leq R\}$.  Since $h'_x(R)$ and $h_x(R)$ are close by

5199: Proposition~\ref{prop.1} below, Theorem~\ref{thm.dresf} and

5200: Corollary~\ref{cor.esf} also apply to $h'_x$ and the distortion-rate

5201: function $D^*(R)$ based on a variation of the

5202: Shannon-Fano distortion measure

5203: defined by using encodings $Y(x)=f$ with $f$ a computable

5204: probability distribution. In this context,

5205: the Shannon-Fano distortion measure

5206: is defined by

5207: \begin{equation}\label{eq.sfdf}

5208: d'(x,f)=  \log 1/f(x).

5209: \end{equation}

5210: It remains to show that probability models are essentially the same as

5211: finite set models.  We restrict ourselves to the model class of {\em

5212:   computable probability distributions}. Within the present section,

5213: we assume these are defined on strings of arbitrary length; so they

5214: are represented by mass functions $f: \{0,1\}^* \rightarrow [0,1]$

5215: with $\sum f(x) = 1$ being computable according to

5216: Definition~\ref{def.enum.funct}.  A string $x$ is typical for a

5217: distribution $f$ if the randomness deficiency $ \delta (x \mid f) =

5218: \log 1/ f(x) - K(x \mid f) $ is small. The conditional complexity $K(x

5219: \mid f)$ is defined as follows. Say that a function $A$ approximates

5220: $f$ if $|A(y,\eps)-f(y)|<\eps$ for every $y$ and every positive

5221: rational $\eps$. Then $K(x \mid f)$ is the minimum length of a program

5222: that given every function $A$ approximating $f$ as an oracle prints

5223: $x$.  Similarly, $f$ is $c$-optimal for $x$ if $ K(f) + \log 1/ f(x)

5224: \leq K(x)+c $.  Thus, instead of the data-to-model code length

5225: $\log|S|$ for finite set models, we consider the data-to-model code

5226: length $\log 1/ f(x)$ (the Shannon-Fano code). The value $\log 1/f(x)$

5227: measures also how likely $x$ is under the hypothesis $f$. The

5228: mapping $x\mapsto f_{\min}$ where $f_{\min}$ minimizes $\log 1/f(x)$

5229: over $f$ with $K(f)\le R$ is a \emph{maximum likelihood

5230:   estimator}, see figure~\ref{figure.MLestimator}. Our results thus

5231: imply that that maximum likelihood estimator always returns a

5232: hypothesis with minimum randomness deficiency.

5233:

5234: \begin{figure}

5235: \begin{center}

5236: \epsfxsize=8cm

5237: \epsfxsize=8cm \epsfbox{MLestimator.eps}

5238: \end{center}

5239: \caption{Structure function $h_x(i)= \min_f \{ \log 1/ f(x): f(x)>0, \; K(f) \leq

5240: i\}$ with $f$ a computable

5241: probability mass function, with values according to the left

5242: vertical coordinate, and the maximum likelihood estimator $2^{-h_x(i)}=

5243: \max \{f(x): p(x)>0 , \; K(f) \leq i\}$,

5244: with values according to the right-hand side vertical coordinate.}

5245: \label{figure.MLestimator}

5246: \end{figure}

5247:

5248:

5249:

5250: It is easy to show that for every data string $x$

5251: and a contemplated finite set model for it, there

5252: is an almost equivalent computable probability model.

5253: The converse is slightly harder:

5254: for every data string $x$ and a contemplated

5255: computable probability  model for it,

5256: there is a finite set model for $x$ that has no worse complexity,

5257: randomness deficiency, and worst-case data-to-model code for $x$,

5258: up to additive logarithmic precision:

5259:

5260:

5261: \begin{proposition}\label{prop.1}

5262: (a) For every $x$ and every finite set $S \ni x$ there is

5263: a computable probability

5264: mass function $f$ with $\log 1/f(x) =\log|S|$,

5265: $\delta(x \mid f)=\delta(x \mid S)+O(1)$

5266: and $K(f) = K(S)+ O(1)$.

5267:

5268: (b)

5269:    There are constants $c,C$, such that

5270:     for every string $x$, the following holds:

5271:     For every computable probability

5272:     mass function $f$

5273:     there is a finite set $S \ni x$

5274:     such that $\log |S| <  \log 1/ f(x)+1$, $\delta(x \mid S)

5275: \le \delta(x \mid f)+    2\log K(f)+K(\lfloor \log 1/

5276: f(x)\rfloor)+2\log K(\lfloor \log 1/ f(x)\rfloor)+C$

5277:     and

5278:     $K(S) \leq  K(f) + K(\lfloor \log 1/f(x)\rfloor)+C$.

5279:

5280:

5281: \end{proposition}

5282:

5283: \begin{proof}

5284: (a) Define $f(y)= 1/|S|$ for $y \in S$

5285: and 0 otherwise.

5286:

5287: (b) Let $m=\lfloor \log 1/f(x)\rfloor$, that is,

5288: $2^{-m-1}<f(x)\le 2^{-m}$.

5289: Define $S = \{y:  f(y)

5290: > 2^{-m-1}\}$. Then,

5291: $|S|<2^{m+1} \leq 2/f(x)$,

5292: which implies the claimed value for $\log |S|$.

5293: To list $S$ it suffices to compute all consecutive values of $f(y)$ to

5294: sufficient precision

5295: until the combined probabilities exceed $1-2^{-m-1}$.

5296: That is, $K(S) \leq

5297: K(f)+ K(m)+O(1)$.

5298: Finally,

5299: $\delta(x \mid S)=\log|S|-K(x|S^*)< \log 1/f(x)-K(x \mid S^*)+1=

5300: \delta(x \mid f)+K(x \mid f)-K(x \mid S^*)+1\le \delta(x \mid f)+K(S^* \mid f)+O(1)$.

5301: The term $K(S^* \mid f)$ can be upper bounded

5302: as $K(K(S))+K(m)+O(1)\le 2\log K(S)+K(m)+O(1)

5303: \le 2\log (K(f)+K(m))+K(m)+O(1)

5304: \le 2\log K(f)+2\log K(m)+K(m)+O(1)$, which implies the claimed bound for

5305:  $\delta(x \mid S)$.

5306:

5307: \end{proof}

5308:

5309: How large are the nonconstant additive complexity terms in

5310: Proposition~\ref{prop.1} for strings $x$ of length $n$? In item (b),

5311: we are commonly only interested

5312: in $f$ such that $K(f)\le n+O(\log n)$ and

5313: $\log 1/f(x)\le n+O(1)$.

5314: Indeed, for every $f$ there is $f'$ such that

5315: $K(f')\le \min\{K(f),n\}+O(\log n)$,

5316: $\delta(x \mid f')\le \delta(x \mid f)+O(\log n)$,

5317: $\log 1/f'(x)\le\min\{\log1/ f(x),n\}+1$.

5318: Such $f'$ is defined as follows: If

5319: $K(f)>n$ then $f'(x)=1$ and $f'(y)=0$ for every $y\ne x$;

5320: otherwise $f'=(f+U_n)/2$ where $U_n$ stands for

5321: the uniform distribution

5322: on $\{0,1\}^n$.

5323: Then the additive terms in item (b) are $O(\log n)$.

5324:

5325: \subsection{Expected Structure Function Equals Distortion--Rate Function}

5326: \label{sec:esf}

5327: In this section we treat the general relation between the expected

5328: value of $h_x(R)$, the expectation taken on a distribution

5329: $f(x)=P(X=x)$ of the random variable $X$ having outcome $x$, and

5330: $D^*(R)$. This involves the development of a rate-distortion theory

5331: for individual sequences and arbitrary computable distortion measures.

5332: Following \cite{VereshchaginV04}, we outline such a theory in

5333: Sections~\ref{sec:spheres}-~\ref{sec:ssrev}. Based on this theory, we

5334: present in Section~\ref{sec:esfb} a general theorem

5335: (Theorem~\ref{thm.dresf}) relating Shannon's $D^*(R)$ to the expected

5336: value of $h_x(R)$, for arbitrary random sources and computable

5337: distortion measures. This generalizes Example~\ref{ex.rd=str} above,

5338: where we analyzed the case of the distortion function

5339: \begin{equation}\label{eq.lcfs1}

5340: d(x,Y(x))

5341: = \log |Y(x)|,

5342: \end{equation}

5343: where $Y(x)$ is an $x$-containing finite set,

5344: for the uniform distribution. Below we first extend this example to

5345: arbitrary generating distributions, keeping the distortion function

5346: still fixed to (\ref{eq.lcfs1}. This will prepare us for the general

5347: development in Sections~\ref{sec:spheres}--\ref{sec:ssrev}

5348: \begin{example}

5349: In Example~\ref{ex.rd=str}

5350: it transpired that

5351: the distortion-rate function is the expected structure function,

5352: the expectation taken over the distribution on the $x$'s.

5353: %Part of the required treatment was already introduced in

5354: %Example~\ref{rem.rd-ksf1}.

5355: If, instead of using the uniform

5356: distribution on $\{0,1\}^n$ we use an arbitrary distribution $f(x)$,

5357: it is not difficult to compute the rate-distortion

5358: function $R^*(D)= H(X)- \sup_{Y:d(X,Y) \leq D} H(X|Y)$ where

5359: $Y$ is a random vaiable  with outcomes that are finite sets. Since $d$

5360: is a special type of Shannon-Fano distortion, with

5361: $d(x,y) = P(X=x | Y=y) = \log |y|$ if $x \in y$, and 0 otherwise,

5362: we have already met

5363: $D^*(R)$ for the distortion measure \eqref{eq.lcfs1} in another guise.

5364: By the conclusion of Example~\ref{ex:berb}, generalized to the random

5365: variable $X$ having outcomes in $\{0,1\}^n$, and $R$ being a rate

5366: in between 0 and $n$, we know that

5367: \begin{equation}\label{eq.DRE}

5368: D^*(R) = H(X)-R.

5369: \end{equation}

5370: \end{example}                                                              In the particular case analyzed above, the code word for a source word

5371: is a finite set containing the source word, and the distortion is the

5372: log-cardinality of the finite set. Considering the set of source words

5373: of length $n$, the distortion-rate function is the diagonal line from

5374: $n$ to $n$.  The structure functions of the individual data $x$ of

5375: length $n$, on the other hand, always start at $n$, decrease at a

5376: slope of at least -1 until they hit the diagonal from $K(x)$ to

5377: $K(x)$, which they must do, and follow the diagonal henceforth. Above

5378: we proved that the average of the structure function is simply the

5379: straight line, the diagonal, between $n$ and $n$. This is the case,

5380: since the strings $x$ with $K(x) \geq n$ are the overwhelming

5381: majority. All of them have a minimal sufficient statistic (the point

5382: where the structure function hits the diagonal from $K(x)$ to $K(x)$.

5383: This point has complexity at most $K(n)$. The structure function for

5384: all these $x$'s follows the diagonal from about $n$ to $n$, giving

5385: overall an expectation of the structure function close to this

5386: diagonal, that is, the probabilistic distortion-rate function for this

5387: code and distortion measure.

5388: \subsubsection{Distortion Spheres}

5389: \label{sec:spheres}

5390: Modeling the data can be viewed as

5391: encoding the data by a model: the data are source words

5392: to be coded, and models are

5393: code words for the data. As before, the set of possible data is

5394: ${\cal X} = \{0,1\}^n$. Let ${\cal R}^+$ denote the set

5395: of non-negative real numbers.

5396: For every model class ${\cal Y}$  (particular set of code words)

5397: we choose an appropriate

5398: recursive function

5399: $d: {\cal X} \times {\cal Y} \rightarrow {\cal R}^+$ defining

5400: the {\em distortion} $d(x,y)$ between data $x \in {\cal X}$ and model $y \in {\cal Y}$.

5401: \begin{remark}[Lossy Compression]

5402: \label{rem:lossy}

5403: \rm

5404: The choice of distortion

5405: function is a selection of which aspects of the data are relevant,

5406: or meaningful, and

5407: which aspects are irrelevant (noise).

5408: We can think of the distortion-rate function as measuring how far the model at

5409: each bit-rate

5410: falls short in representing the data. Distortion-rate theory

5411: underpins the practice of lossy compression.

5412: For example, lossy compression of a sound file gives as ``model''

5413: the compressed file where, among others, the very high and

5414: very low inaudible frequencies have been suppressed. Thus,

5415: the rate-distortion function will penalize the deletion of the inaudible

5416: frequencies but lightly because they are not relevant for the auditory

5417: experience.

5418:

5419: But in the traditional distortion-rate approach, we average twice:

5420: once because we consider

5421: a sequence of outcomes of $m$ instantiations of the same random variable,

5422: and once because we take the expectation

5423: over the sequences. Essentially, the results deal with typical ``random'' data

5424: of certain simple distributions. This assumes that the data to a certain extent

5425: satisfy the behavior of repeated outcomes of a random source.

5426: Kolmogorov \cite{Ko65}:

5427: \begin{quote}

5428: The probabilistic approach is natural in the theory of information

5429: transmission over communication channels carrying ``bulk'' information

5430: consisting of a large number of unrelated or

5431: weakly related messages obeying

5432: definite probabilistic laws. In this type of problem there is

5433: a harmless and (in applied work) deep-rooted tendency to mix up

5434: probabilities and frequencies within sufficiently long time sequence

5435: (which is rigorously satisfied if it is assumed that ``mixing''

5436: is sufficiently rapid). In practice,

5437: for example, it can be assumed that finding the ``entropy''

5438: of a flow of congratulatory telegrams and the channel ``capacity'' required

5439: for timely and undistorted transmission is validly represented by a

5440: probabilistic treatment even with the usual substitution of empirical

5441: frequencies for probabilities. If something goes wrong here,

5442: the problem lies with the vagueness of our ideas of the relationship

5443: between mathematical probabilities and real random events in general.

5444:

5445: But what real meaning is there, for example, in asking how much

5446: information is contained in ``War and Peace''?

5447: Is it reasonable to include the novel in the set of ``possible novels'',

5448: or even to postulate some probability distribution for this set?

5449: Or, on the other hand, must we assume that the individual scenes in

5450: this book form a random sequence with ``stocahstic relations'' that damp out

5451: quite rapidly over a distance of several pages?

5452: \end{quote}

5453: Currently, individual data arising in practice are submitted to

5454: analysis, for example sound or video files, where the assumption that

5455: they either consist of a large number of weakly related messages, or

5456: being an element of a set of possible messages that is susceptible to

5457: analysis, is clearly wrong.  It is precisely the global related aspects

5458: of the data which we want to preserve under lossy compression.  The

5459: rich versatility of the structure functions, that is, many different

5460: distortion-rate functions for different individual data, is all but

5461: obliterated in the averaging that goes on in the traditional

5462: distortion-rate function.  In the structure function approach one

5463: focuses entirely on the stochastic properties of one data item.

5464: %Analyzing the situation, we see that (i) the structure functions of

5465: %all typical data items are about the same; (ii) the probability mass

5466: %concentrated on the typical data items is almost one; (iii) a sequence

5467: %of outcomes of i.i.d. distributed random variables (or for that

5468: %matter, ergodic stationary sources) consists primarily of typical data

5469: %items. This is enough to ensure that the distortion-rate function is

5470: %approximately the structure function of a typical data item, which is

5471: %the essence of Theorem~\ref{thm.dresf} below.

5472: \end{remark}

5473: Below we follow \cite{VereshchaginV04}, where we developed a

5474: rate-distortion theory for individual data for general computable

5475: distortion measures, with as specific examples the `Kolmogorov'

5476: distortion below, but also Hamming distortion and Euclidean

5477: distortion. This individual rate-distortion theory is summarized in

5478: Sections~\ref{sec:rdrev} and~\ref{sec:ssrev}. In

5479: Section~\ref{sec:esfb}, Theorem~\ref{thm.dresf}.

5480: we connect this indivual rate-distortion theory to Shannon's. We

5481: emphasize that the typical data items of i.i.d. distributed simple

5482: random variables, or simple ergodic stationary sources, which are the

5483: subject of Theorem~\ref{thm.dresf}, are generally unrelated to the

5484: higly globally structured data we want to analyze using our new

5485: rate-distortion theory for individual data. From the prespective of

5486: lossy compression, the typical data have the characteristics of random

5487: noise, and there is no significant ``meaning'' to be preserved under

5488: the lossy compression.  Rather, Theorem~\ref{thm.dresf} serves as a

5489: `sanity check' showing that in the special, simple case of repetitive

5490: probabilistic data, the new theory behaves essentially like Shannon's

5491: probabilistic rate-distortion theory.

5492: \begin{example}\label{ex.11}

5493: \rm

5494: Let us look at various model classes and distortion measures:

5495:

5496: (i) The set of models are the finite sets of finite binary strings.

5497: Let $S \subseteq \{0,1\}^*$ and $|S| < \infty$.

5498: We define $d(x,S) = \log |S|$ if $x \in S$, and $\infty$ otherwise.

5499:

5500: (ii) The set of models are the computable probability density functions $f$

5501: mapping $\{0,1\}^*$ to $[0,1]$.

5502: We define $d(x,S) =  \log 1/f(x)$ if $f(x) > 0$, and $\infty$ otherwise.

5503:

5504: (iii) The set of models are the total recursive functions  $f$

5505: mapping $\{0,1\}^*$ to ${\cal N}$.

5506: We define $d(x,f) = \min \{ l(d): f(d)=x\}$, and $\infty$ if

5507: no such $d$ exists.

5508:

5509: All of these model classes and accompanying

5510: distortions \cite{VV02}, together with the ``communication exchange'' models

5511: in \cite{BKVV03}, are loosely called {\em Kolmogorov} models

5512: and distortion, since the graphs of their structure functions (individual

5513: distortion-rate functions) are all within a strip---of width

5514: logarithmic in the binary length of the data---of one another.

5515: \end{example}

5516: If ${\cal Y}$ is a model class, then

5517: we consider {\em distortion spheres} of given

5518: radius $r$ centered on $y \in {\cal Y}$:

5519: \[

5520: B_y(r)= \{x: d(x,y) = r\}.

5521: \]

5522: This way, every model class and distortion measure can be treated

5523: similarly to the canonical finite set case, which, however, is

5524: especially simple in that the radius not variable.

5525: That is, there is only one distortion sphere  centered on a given finite set,

5526: namely the one with radius equal to the log-cardinality of that finite set.

5527: In fact, that distortion sphere equals the finite set on which it is

5528: centered.

5529:

5530: \subsubsection{Randomness Deficiency---Revisited}

5531: \label{sec:rdrev}

5532: Let ${\cal Y}$ be a model class and $d$ a distortion measure.

5533: Since in our definition the distortion is recursive,

5534: given a model $y \in {\cal Y}$ and diameter $r$,

5535: the elements in the distortion sphere

5536: of diameter $r$ can be recursively enumerated from the distortion function.

5537: Giving the index of any element $x$ in that enumeration we can find the

5538: element. Hence, $K(x|y,r) \lea \log |B_y(r)|$. On the other hand,

5539: the vast majority of elements $x$ in the distortion sphere have

5540: complexity $K(x|y,r) \gea  \log |B_y(r)|$ since, for every constant $c$,

5541:  there are only

5542: $2^{\log |B_y(r)|-c} - 1$ binary programs of length $ < \log |B_y(r)|-c$

5543: available, and there are $|B_y(r)|$ elements to be described.

5544: We can now reason as in the similar case of finite set models.

5545: With data $x$ and $r=d(x,y)$,

5546: if $K(x|y,d(x,y))

5547: \gea \log |B_y(d(x,y))|$, then $x$ belongs to every large majority of elements

5548: (has the property represented by that majority)

5549: of the distortion sphere $B_y(d(x,y))$,

5550: provided that property is simple in the

5551: sense of having a description of low Kolmogorov complexity.

5552: \begin{definition}

5553: \rm

5554: The {\em randomness

5555: deficiency} of $x$ with respect to model $y$ under distortion $d$

5556: is defined as

5557: \[

5558: \delta (x \mid y) = \log |B_y (d(x,y))| - K(x|y,d(x,y)).

5559: \]

5560: Data $x$ is {\em typical} for model $y \in {\cal Y}$ (and that model

5561: ``typical'' or ``best fitting'' for $x$) if

5562: \begin{equation}\label{eq.typical}

5563: \delta (x \mid y)  \eqa 0.

5564: \end{equation}

5565: \end{definition}

5566: If $x$ is typical for a model $y$, then the shortest way to effectively

5567: describe $x$, given $y$, takes about as many bits as the

5568: descriptions of the great

5569: majority of elements in

5570: a recursive enumeration of the distortion sphere.

5571: So there are no special simple properties that distinguish $x$

5572: from the great majority of elements

5573: in the distortion sphere: they are all typical or random elements

5574: in the distortion sphere (that is, with respect to the contemplated model).

5575: \begin{example}

5576: \rm

5577: Continuing Example~\ref{ex.11} by applying \eqref{eq.typical}

5578: to different model classes:

5579:

5580: (i) {\em Finite sets:}

5581:  For finite set models $S$, clearly $K(x|S) \lea \log |S|$.

5582: Together with \eqref{eq.typical} we have that $x$ is typical for $S$,

5583: and $S$ best fits $x$, if the randomness deficiency

5584: according to \eqref{eq:randomness-deficiency} satisfies

5585: $\delta(x|S) \eqa 0$.

5586:

5587: (ii) {\em Computable probability density functions:}

5588: Instead of the data-to-model code length $\log|S|$ for

5589: finite set models, we consider the data-to-model code length

5590: $\log 1/f(x)$ (the Shannon-Fano code). The value $\log 1/f(x)$

5591: measures how likely $x$ is under the hypothesis $f$.

5592:  For probability models $f$,

5593: define the conditional complexity

5594: $K(x \mid f, \lceil  \log 1/f(x) \rceil )$ as follows.

5595: Say that a function

5596: $A$ approximates $f$ if $|A(x,\eps)-f(x)|<\eps$

5597: for every $x$ and every positive rational

5598: $\eps$. Then $K(x \mid f , \lceil \log 1/f(x) \rceil)$ is defined as

5599: the minimum length

5600: of a program that, given $\lceil \log 1/f(x) \rceil$

5601: and any function $A$ approximating $f$

5602: as an oracle, prints $x$.

5603:

5604: Clearly

5605: $K(x|f, \lceil \log 1/f(x) \rceil ) \lea \log 1/f(x)$.

5606: Together with \eqref{eq.typical}, we have that $x$ is typical for $f$,

5607: and $f$ best fits $x$, if

5608: $K(x|f, \lceil \log 1/f(x) \rceil) \gea \log |\{z:  \log 1/f(z) \leq

5609:  \log 1/f(x)\}|$. The right-hand side set condition is the same

5610: as $f(z) \geq f(x)$, and there can be only $\leq 1/f(x)$ such $z$,

5611: since otherwise the total probability exceeds 1. Therefore,

5612: the requirement, and hence typicality,

5613: is implied by $K(x|f, \lceil \log 1/f(x) \rceil ) \gea \log 1/f(x)$.

5614: Define  the randomness

5615: deficiency by

5616: $

5617: \delta (x \mid f) =   \log 1/f(x) - K(x \mid f, \lceil \log 1/f(x) \rceil).

5618: $

5619: Altogether, a string $x$ is {\em typical for a distribution} $f$,

5620: or $f$ is the {\em best fitting model} for $x$,

5621: if $\delta (x \mid f) \eqa 0$.

5622: if $\delta (x \mid f) \eqa 0$.

5623:

5624: (iii) {\em Total Recursive Functions:}

5625: In place of $\log|S|$ for finite set models

5626: we consider the data-to-model code length (actually, the distortion

5627: $d(x,f)$ above)

5628: $$\len xf=\min\{l(d):f(d)=x\}.$$

5629: Define the conditional complexity

5630: $K(x \mid f, \len xf )$ as

5631: the minimum length

5632: of a program that, given $\len xf$ and an oracle for  $f$,

5633: prints $x$.

5634:

5635: Clearly, $K(x|f, \len xf ) \lea \len xf$.

5636: Together with \eqref{eq.typical}, we have that $x$ is typical for $f$,

5637: and $f$ best fits $x$, if $K(x|f, \len xf ) \gea \log \{z: \len zf

5638:  \leq  \len xf \}$. There are at most $(2^{\len xf +1} - 1)$-

5639: many $z$ satisfying the set condition since

5640: $\len zf \in \{0,1\}^*$.  Therefore,

5641: the requirement, and hence typicality,

5642: is implied by $K(x|f, \len xf ) \gea \len xf$.

5643: Define  the  randomness

5644: deficiency by

5645: $

5646: \delta (x \mid f) =   \len xf - K(x \mid f, \len xf ).

5647: $

5648: Altogether, a string $x$ is {\em typical for a total recursive

5649: function} $f$, and $f$ is the {\em best fitting recursive function model}

5650: for $x$

5651: if $\delta (x \mid f) \eqa 0$, or written differently,

5652: \begin{equation}\label{eq.typp}

5653: K(x|f, \len xf ) \eqa \len xf.

5654: \end{equation}

5655: Note that since $\len xf$ is given as conditional information,

5656: with $\len xf = l(d)$ and $f(d)=x$, the quantity $K(x|f, \len xf )$

5657: represents the number of bits in a shortest

5658: {\em self-delimiting} description of $d$.

5659: \end{example}

5660:

5661:

5662: \begin{remark}

5663: \rm

5664: We required $\len xf$ in the conditional in \eqref{eq.typp}.

5665: This is the information about

5666: the radius of the distortion sphere centered on the model concerned.

5667: Note that in the canonical finite set model case, as treated

5668: in \cite{Ko74,GTV01,VV02}, every model has a fixed radius which

5669: is explicitly provided by the model itself. But in the

5670: more general model

5671: classes of computable probability density functions, or

5672: total recursive functions, models can have a variable radius.

5673: There are subclasses of the more general models that

5674: have fixed radiuses (like the finite set models).

5675:

5676: (i) In the computable probability density functions one can think of the

5677: probabilities with a finite support, for example $f_n (x) = 1/2^n$

5678: for $l(x)=n$, and $f(x)=0$ otherwise.

5679:

5680: (ii) In the total recursive function case one can similarly think

5681: of functions with finite support, for example $f_n (x) = \sum_{i=1}^n x_i$

5682: for $x=x_1 \ldots x_n$, and $f_n(x)=0$ for $l(x) \neq n$.

5683:

5684: The incorporation of the radius in the model will increase the

5685: complexity of the model, and hence of the minimal sufficient statistic

5686: below.

5687: \end{remark}

5688:

5689: \subsubsection{Sufficient Statistic---Revisited}

5690: \label{sec:ssrev}

5691: As with the probabilistic sufficient statistic

5692: (Section~\ref{sec:probstat}), a statistic is a function mapping the

5693: data to an element (model) in the contemplated model class. With some

5694: sloppiness of terminology we often call the function value (the model)

5695: also a statistic of the data.  A statistic is called sufficient if the

5696: two-part description of the data by way of the model and the

5697: data-to-model code is as concise as the shortest one-part description

5698: of $x$.  Consider a model class ${\cal Y}$.

5699: \begin{definition}

5700: A model $y \in {\cal Y}$ is a {\em sufficient statistic} for $x$ if

5701: \begin{equation}\label{eq.ssm}

5702: K(y, d(x,y))+ \log |B_y(d(x,y))| \eqa K(x).

5703: \end{equation}

5704: \end{definition}

5705:

5706: \begin{lemma}\label{lem.V2}

5707: If $y$ is a sufficient statistic for $x$, then

5708: $K(x \mid y, d(x,y) \eqa  \log |B_y(d(x,y))|$, that is,

5709: $x$ is typical for $y$.

5710: \end{lemma}

5711: \begin{proof}

5712: We can rewrite

5713: $K(x) \lea K(x,y,d(x,y)) \lea K(y,d(x,y))+K(x|y,d(x,y))

5714: \lea K(y, d(x,y))+ \log |B_y(d(x,y))| \eqa K(x)$.

5715: The first three inequalities are straightforward and

5716: the last equality is by the assumption of sufficiency.

5717: Altogether, the first sum equals the second sum, which implies the lemma.

5718: \end{proof}

5719:

5720: Thus, if $y$ is a sufficient statistic for $x$, then $x$ is a typical element

5721: for $y$, and $y$ is the best fitting model for $x$.

5722: Note that the converse implication,  ``typicality'' implies

5723: ``sufficiency,'' is not valid. Sufficiency is a special type

5724: of typicality, where the model does not add significant

5725: information to the data, since the preceding proof shows

5726: $K(x) \eqa K(x,y,d(x,y))$. Using the symmetry of information \eqref{eq.soi}

5727: this shows that

5728: \begin{equation}\label{eq.pcondx}

5729: K(y,d(x,y) \mid x ) \eqa K(y \mid x) \eqa 0.

5730: \end{equation}

5731: This means that:

5732:

5733: (i) A sufficient statistic  $y$ is determined by the data in the sense

5734: that we need only an $O(1)$-bit program, possibly depending on

5735: the data itself, to compute the model

5736: from the data.

5737:

5738: (ii) For each model class and distortion there is a universal constant $c$

5739: such that for every data item $x$ there are at most $c$ sufficient

5740: statistics.

5741:

5742: \begin{example}

5743: \rm

5744: {\em Finite sets:}

5745: For the model class of finite sets, a set $S$ is a sufficient statistic

5746: for data $x$ if

5747: \[

5748: K(S)+ \log |S| \eqa K(x).

5749: \]

5750:

5751: {\em Computable probability density functions:}

5752: For the model class of computable probability density functions,

5753: a function $f$ is a sufficient statistic

5754: for data $x$ if

5755: \[

5756: K(f)  + \log 1/f(x) \eqa K(x).

5757: \]

5758: For the model class of

5759: {\em total recursive functions}, a function $f$ is a

5760: {\em sufficient statistic} for data $x$

5761: if

5762: \begin{equation}\label{eq.ss}

5763: K(x) \eqa K(f)  + \len xf .

5764: \end{equation}

5765: Following the above discussion, the meaningful information in $x$

5766: is represented by $f$ (the model) in $K(f)$ bits, and the

5767: meaningless information in $x$ is represented by $d$ (the noise in

5768: the data) with $f(d)=x$ in $l(d) = \len xf$ bits. Note that

5769: $l(d) \eqa K(d) \eqa K(d|f^*)$,

5770: since the two-part

5771: code $(f^*,d)$ for $x$

5772: cannot be shorter than the shortest one-part code of $K(x)$ bits,

5773: and therefore the $d$-part must already be maximally compressed.

5774: By Lemma~\ref{lem.V2},  $\len xf \eqa   K(x \mid f^* , \len xf)$,

5775: $x$ is typical for $f$,

5776: and hence $K(x) \eqa K(f)  + K(x \mid f^* , \len xf)$.

5777: \end{example}

5778:

5779:

5780:

5781: \subsubsection{Expected Structure Function}

5782: \label{sec:esfb}

5783: We treat the relation between the expected value

5784: of $h_x(R)$, the expectation taken on a

5785: distribution $f(x)=P(X=x)$ of the random variable $X$ having outcome $x$,

5786: and $D^*(R)$, for arbitrary random sources provided the probability mass

5787: function $f(x)$ is recursive.

5788:

5789:

5790: \begin{theorem}\label{thm.dresf}

5791: Let $d$ be a recursive distortion

5792: measure.

5793: Given $m$ repetitions of a random variable $X$ with outcomes

5794: $x \in {\cal X}$ (typically, ${\cal X}= \{0,1\}^n$)

5795: with probability $f(x)$, where $f$ is a total

5796: recursive function, we have

5797: $$

5798: {\bf E} \frac{1}{m} h_{\overline{x}} (mR+K(f,d,m,R)+O(\log n))

5799: \leq D^*_m(R)

5800: \leq {\bf E} \frac{1}{m} h_{\overline{x}} (mR),

5801: $$

5802: the expectations are taken over $\overline{x}

5803: = x_1 \ldots x_m$ where $x_i$ is the outcome of the $i$th repetition

5804: of $X$.

5805: \end{theorem}

5806: \begin{proof}

5807: As before, let $X_1, \ldots, X_m$ be $m$ independent identically

5808: distributed random variables on outcome space ${\cal X}$.

5809: Let ${\cal Y}$ be a set of code words.

5810: We want to find a sequence of functions $Y_1, \ldots , Y_m:{\cal X}

5811: \rightarrow {\cal Y}$ so that the message $(Y_1(x_1), \ldots,

5812: Y_m (x_m)) \in {\cal Y}^m$ gives as much expected

5813: information about the sequence of outcomes $(X_1=x_1,

5814: \ldots, X_m=x_m)$ as is possible, under the constraint that the message

5815: takes at most $R \cdot m$ bits (so that $R$ bits are allowed on

5816: average per outcome of $X_i$).

5817: Instead of $Y_1, \ldots , Y_m$ above write

5818: $\overline{Y}: {\cal X}^m \rightarrow  {\cal Y}^m$.

5819: Denote the cardinality of the range of $\overline{Y}$

5820: by $\rho (\overline{Y})= | \{\overline{Y}(\overline{x}):

5821: \overline{x} \in {\cal X}^m\}|$.

5822: Consider distortion spheres

5823: \begin{equation}\label{eq.lcfs}

5824: B_{\overline{y}}(d) = \{\overline{x}: d(\overline{x},\overline{y}) = d \},

5825: \end{equation}

5826: with $\overline{x} = x_1 \ldots x_m \in {\cal X}^m$

5827: and $\overline{y} \in {\cal Y}^m$.

5828:

5829:

5830: {\em Left Inequality:}

5831: Keeping the earlier notation, for $m$ i.i.d.

5832: random variables $X_1, \ldots ,X_m$, and extending $f$ to

5833: the $m$-fold Cartesian product of $\{0,1\}^n$, we obtain

5834: $D_m^*(R) = \frac{1}{m} \min_{ \overline{Y}: \rho (\overline{Y}) \leq 2^{mR}}

5835: \sum_{\overline{x}}f(\overline{x})

5836: d(\overline{x}, \overline{Y} (\overline{x}))$.

5837: %Assume that $\overline{y} \in Z_m(\overline{x})$ iff

5838: %$Z_m(\overline{y}) = Z_m(\overline{x})$: the distinct

5839: %$Z_m(\overline{x})$'s are disjoint and partition $\{0,1\}^{mn}$

5840: %into disjoint subsets $Z_{m,i}$, with $i=1, \ldots, k$ for

5841: %some $k \leq 2^{mR}$. Denote the elements of this partition

5842: %by $ Z_{(1)} , \ldots , Z_{(k)}$.

5843: By definition of $D_m^*(R)$ it equals the following expression in terms

5844: of a minimal canonical covering of $\{0,1\}^{nm}$ by

5845: disjoint nonempty spheres $B'_{\overline{y}_i}(d_i)$

5846: ($1 \leq i \leq k$) obtained from the possibly overlapping

5847: distortion spheres $B_{\overline{y}_i}(d_i)$ as follows.

5848: Every element $\overline{x}$ in the overlap between two or more spheres

5849: is assigned to the sphere with the smallest radius and removed

5850: from the other spheres. If there is more than

5851: one sphere of smallest radius, then

5852: we take the sphere of least index in the canonical covering.

5853: Empty $B'$-spheres are removed from the $B'$-covering.

5854: If $S \subseteq \{0,1\}^{nm}$, then $f(S)$ denotes  $\sum_{x \in S} f(x)$. Now,

5855: we can rewrite

5856: \begin{equation}\label{eq.distpart}

5857: D^*_m(R) =

5858: \min_{\overline{y}_1, \ldots , \overline{y}_k; d_1, \ldots , d_k; k \leq 2^{mR}}

5859:  \frac{1}{m}

5860: \sum_{i=1}^k f(B'_{\overline{y}_i}(d_i)) d_i.

5861: \end{equation}

5862: In the structure function setting we consider some individual

5863:  data $\overline{x}$ residing

5864: in one of the covering spheres.

5865: Given $m,n,R$ and a program to compute $f$ and $d$, we can compute the

5866: covering spheres centers $\overline{y}_1, \ldots, \overline{y}_k$,

5867: and radiuses $d_1, \ldots , d_k$,

5868: and hence the $B'$-sphere canonical covering. In this

5869: covering we can identify every pair $(\overline{y}_i, d_i)$ by

5870: its index $i \leq 2^{mR}$. Therefore,

5871: $K(\overline{y}_i, d_i) \leq mR + K(f,d,m,R)+O(\log n)$ ($1 \leq i \leq k)$.

5872: For $\overline{x} \in B'_{\overline{y}_{i}}(d_i)$

5873: we have $h_{\overline{x}}(mR + K(f,d,m,R)+O(\log n)) \leq d_i$.

5874: Therefore,

5875: ${\bf E} \frac{1}{m}h_{\overline{x}}(mR + K(f,d,m,R)+O(\log n)) \leq  D^*_m(R)$,

5876: the expectation taken over

5877: $f(\overline{x})$ for $\overline{x} \in \{0,1\}^{mn}$.

5878:

5879:

5880: {\em Right Inequality:}

5881: Consider a covering of $\{0,1\}^{nm}$

5882: by the  (possibly overlapping)

5883: distortion spheres $B_{\overline{y}_i}(d_i)$

5884: satisfying $K(B_{\overline{y}_i}(d_i) | mR) < mR-c$, with $c$ an

5885: appropriate constant choosen so that the remainder of the argument

5886: goes through.

5887: If there are more than one spheres with different (center, radius)-pairs

5888: representing the same subset of  $\{0,1\}^{nm}$, then

5889: we eliminate all of them except the one with the smallest radius.

5890: If there are more than one such spheres, then we only keep the one

5891: with the lexicographically least center. From this covering we obtain

5892: a canonical covering

5893: by nonempty disjoint spheres $B'_{\overline{y}_i} (d_i)$

5894: similar to that in the previous paragraph,

5895: ($1 \leq i \leq k$).

5896:

5897: For every $\overline{x} \in \{0,1\}^{nm}$

5898: there is a unique

5899: sphere $B'_{\overline{y}_i}(d_i) \ni \overline{x}$ ($1 \leq i \leq k$).

5900: Choose the constant $c$ above so that

5901: $K(B'_{\overline{y}_i}(d_i) |mR )  < mR$. Then,

5902: $k \leq 2^{mR}$.

5903: Moreover, by construction, if $B'_{\overline{y}_i} (d_i)$

5904: is the sphere containing $\overline{x}$, then

5905: $h_{\overline{x}} (mR)= d_i$.

5906: Define functions $\gamma: \{0,1\}^{nm} \rightarrow {\cal Y}^m$,

5907: $\delta: \{0,1\}^{nm} \rightarrow {\cal R}^+$ defined by

5908: $\gamma(\overline{x}) = \overline{y}_i$ and $\delta (\overline{x}) = d_i$

5909: for $\overline{x}$ in the sphere $B'_{\overline{y}_i}(d_i)$.

5910: Then,

5911: \begin{equation}\label{eq.lb2}

5912:  {\bf E} \frac{1}{m}  h_{\overline{x}} (mR) =

5913: \frac{1}{m} \sum_{\overline{x} \in \{0,1\}^{mn}}

5914: f(\overline{x}) d(\overline{x}, \gamma(\overline{x}))

5915: = \frac{1}{m} \sum_{\overline{y}_1, \ldots , \overline{y}_k; d_1, \ldots , d_k}

5916: %\exists_{\overline{x}} [y= \gamma(\overline{x}),d = \delta(\overline{x})]}

5917: f(B'_{\overline{y}_i}(d_i)) d_i .

5918: \end{equation}

5919: The distortion  $D^*_m (R)$ achieves the minimum of the expression in

5920: right-hand side of \eqref{eq.distpart}.

5921: Since $K(B'_{\gamma( \overline{x})} (\delta(\overline{x}))|mR) < mR$,

5922: the cover in the right-hand side of \eqref{eq.lb2}

5923: is a possible partition satisfying the expression being

5924: minimized in the right-hand side of

5925: \eqref{eq.distpart}, and hence majorizes the minumum $D^*_m(R)$. Therefore,

5926: ${\bf E} \frac{1}{m} h_{\overline{x}} (mR) \geq D^*_m(R)$.

5927: \end{proof}

5928:

5929: \begin{remark}

5930: \rm

5931: A sphere

5932: is a subset of $\{0,1\}^{nm}$. The same subset may correspond

5933: to more than one spheres with different centers and radiuses:

5934:  $B_{\overline{y_0}}(d_0) = B_{\overline{y_1}}(d_1)$ with

5935: $(y_0,d_0) \neq (y_1,d_1)$.

5936: Hence, $K(B_{\overline{y}} (d))

5937: \leq K(\overline{y},d)) + O(1)$, but possibly

5938: $K(\overline{y},d)) > K(B_{\overline{y}} (d))+O(1)$.

5939: However, in the proof we constructed the ordered sequence of $B'$

5940: spheres such that every sphere uniquely corresponds to a

5941: (center, radius)-pair. Therefore, $K(B'_{\overline{y}_i}(d_i)|mR)

5942: \eqa K(\overline{y}_i, d_i | mR)$.

5943: \end{remark}

5944:

5945:

5946: \begin{corollary}\label{cor.esf}

5947: It follows from the above theorem that, for

5948: a recursive distortion function $d$:

5949:                                                                                                                                                       (i) $

5950: {\bf E} h_{x} (R+K(f,d,R)+O(\log n))

5951: \leq D^*_1 (R)

5952: \leq {\bf E} h_x (R)

5953: $,

5954: for outcomes of a single repetition of random variable $X =x$

5955: with $x \in \{0,1\}^n$,

5956: the expectation taken over $f(x)=P(X =x)$; and

5957:

5958: (ii) $\lim_{m \rightarrow \infty} {\bf E} \frac{1}{m} h_{\overline{x}} (mR)

5959: = D^*(R)$

5960: for outcomes $\overline{x} = x_1 \ldots x_m$

5961: of i.i.d. random variables $X_i =x_i$ with $x_i \in \{0,1\}^n$ for

5962: $1 \leq i \leq m$,

5963: the expectation taken over $f(\overline{x})=P(X_i=x_i, i=1, \ldots, m)$

5964: (the extension of

5965: $f$ to $m$ repetitions of $X$).

5966: \end{corollary}

5967:

5968: This is the sense in which the expected value of the structure function

5969: is asymptotically equal to the value of the distortion-rate function,

5970: for arbitrary computable distortion measures.

5971: In the structure function approach we dealt with only two

5972: model classes, finite sets and computable probability density functions,

5973: and the associated quantities to be minimized, the log-cardinality

5974: and the negative log-probability, respectively. Translated into

5975: the distortion-rate setting, the models are code words

5976: and the minimalizable quantities are distortion measures.

5977:  In \cite{VV02}

5978: we also investigate the model class of total recursive functions,

5979: and in

5980: \cite{BKVV03} the model class of communication protocols. The associated

5981: quantities to be minimized are then function arguments and communicated

5982: bits, respectively. All these models are equivalent up to logarithmic

5983: precision in argument and value of the corresponding structure functions,

5984: and hence their expectations are asymptotic to the distortion-rate

5985: functions of the related code-word set and distortion measure.

5986:

5987:

5988: \commentout{

5989: \begin{remark}

5990: \rm

5991: Suppose we extend the structure function from $h_x(R)=S \ni x$

5992: to $h_x(R) = p$, where $p$ is a distribution on  $x$-containing finite

5993: $S \subseteq \{0,1\}^*$. The classic case treated above

5994: is equivalent to $p(S)=1$ for some $x$-containing finite set

5995: of least cardinality with $K(S) \leq R$. Then, given a random

5996: variable $X$ we have a joint probabity $q(X,{\bf S})$ where

5997: ${\bf S}$ denotes the set of finite subsets of $\{0,1\}^*$.

5998: It may be possible to repeat the analysis above in this setting,

5999: and then combine the equivalent of Corollary~\ref{cor.esf} Item (ii)

6000: with Theorem~\ref{thm:rd} to express

6001: the expected complexity $R$ as a function of maximal

6002: allowed expected distortion of $X$ in terms of ${\bf S}$

6003: as the infimum of the mutual information between $X$ and ${\bf S}$

6004: subject to this constraint.

6005: \end{remark}

6006: }

6007:

6008:

6009: \commentout{

6010: \section{Rate Distortion---Continued}

6011: \subsection{Deterministic Rate Distortion}

6012: To obtain the beautiful Theorem~\ref{thm:rd}, we

6013: needed to consider (a) the limit of the average outcome of

6014: repetitions of the same i.i.d. probabilistic scenario, the limit taken

6015: for the number of repetitions grows unboundedly

6016: (in the definition of $D^*(R)$), and (b) randomization of the coding process

6017: (in the minimization (\ref{eq:rd})).

6018: From both perspectives, $D^*(R)$ and $R^*(D)$ are hard to compute.

6019: There exist clever algorithms

6020: %such as the {\em Blahut-Arimoto

6021: %  algorithm\/} \cite{CT91}

6022: to compute $R^*(D)$, but these are not

6023: always practical. It therefore seems useful to simplify matters.

6024:

6025: Repetition and

6026: averaging is unavoidable

6027: in say, the Law of Large Numbers, that cannot be expressed otherwise.

6028: However, the minimal rate at which messages can be sent under

6029: distortion constraints makes perfect sense for the individual unrepeated

6030: event and deterministic coding processes.

6031: It turns out that if the distortion function is

6032: `regular' (in a sense to be defined below),

6033: then it becomes meaningful to consider `unrandomized' bounds on the

6034: rate distortion, which are computationally easier to

6035: handle and---to us---also easier to interpret.

6036:

6037: \begin{definition}

6038: \rm

6039: A  distortion function

6040: $d: {\cal X} \times {\cal Y} \rightarrow [0,\infty]$ is

6041: {\em regular\/} if

6042: \begin{enumerate}

6043: \item ${\cal Y}$ is a convex space;

6044: \item For each fixed $x$, the function $h_x(y)=d(x,y)$ is convex.

6045: \end{enumerate}

6046: \end{definition}

6047: The set of code words ${\cal Y}$ can be a convex subset of

6048: the real numbers, but also the family of all

6049: probability distributions on some domain. But we are commonly

6050: in one of the following two situations

6051: (a) ${\cal Y}$ is finite or countable; or

6052: (b) ${\cal Y}$ is uncountably infinite and $d$ is regular.

6053:

6054: \begin{definition}

6055: \rm

6056: Let $X$ be a random variable with outcomes in ${\cal X}$

6057: and $Y$ is a function $Y: {\cal X} \rightarrow {\cal Y}$.

6058: We abuse notation by denoting the random variable

6059: $Y(x)$ induced by the random variable $X$ by ``$Y$''. Then it makes

6060: sense to talk about the entropy $H(Y)$. We define

6061: \begin{enumerate}

6062: \item The {\em deterministic distortion-rate function\/}:

6063: \begin{equation}

6064: \label{eq:udr}

6065: D^\circ(R) :=

6066: \inf_{Y: H(Y) \leq R} {\bf E}[d(X,Y)].

6067: \end{equation}

6068: \item The {\em deterministic rate-distortion function\/}:

6069: \begin{equation}

6070: \label{eq:urd}

6071: R^\circ(D) := \inf_{Y: {\bf E}[d(X,Y)] \leq D}

6072: H(Y).

6073: \end{equation}

6074: \end{enumerate}

6075: \end{definition}

6076: It is easy to see that $D^\circ(R)$ must be convex and non-increasing.

6077: Therefore, $R^\circ(D)$ must be the

6078: inverse of $D^\circ(R)$, itself also convex and non-increasing.

6079:

6080: \paragraph{Relating $R^*(D)$ and $R^\circ(D)$:}

6081: Using the definition $I(X;Y) = H(Y) - H(Y|X)$ (Section~\ref{sec:mutual}) we can rewrite

6082: (\ref{eq:rd}) as

6083: $$

6084: R^*(D) = \inf_{Y: {\bf E}[d(X,Y)] \leq D}

6085: H(Y) - H(Y|X),

6086: $$

6087: the infimum taken over randomized $Y$.

6088: If $Y$ is a deterministic function of $X$,

6089: then $H(Y|X) = 0$, and we obtain (\ref{eq:urd}). By

6090: \eqref{eq:rd} and \eqref{eq:ird} $R^*(D)=R^{(I)}(D)$ where the latter

6091: is defined as the right-hand side of the above

6092: equality, but using randomized codes.

6093: Since $R^\circ(D)$ is restricted to

6094: deterministic codes, we have

6095: $R^*(D) \leq R^\circ(D)$.

6096:

6097: \paragraph{Relating $D^*(R)$ and $D^\circ(R)$:}

6098: In our formulation of the basic rate-distortion problem,

6099: before we turned to independent

6100: repetitions, we wanted to minimize distortion under the constraint

6101: that only $2^R$ messages are to be used in a one-shot

6102: setting. It is equivalent to using

6103: the best code under the constraint that only {\em fixed length\/}

6104: codes (using $R$ bits per message) are used. In that case, no matter

6105: what message is sent, the actual number of bits will also be $R$.

6106: Comparing this to

6107: (\ref{eq:udr}), and using the

6108: noiseless-coding interpretation of entropy

6109: (Theorem~\ref{thm:noiseless}), we see that the only difference is that

6110: in the definition of $D^\circ(R)$ we are

6111: allowed to use any code with {\em expected\/} (rather than

6112: actual) code length not larger than $R$ bits.

6113:

6114: But, if we consider repeated scenarios, we can also think

6115: of $D^\circ(R)$ in terms of actual rather than expected code lengths.

6116: By the law of large numbers, we know that if we consider independent

6117: repetitions of the same scenario and

6118: we encode the vector of realized $n$ values,

6119: we can achieve an actual codelength within $o(n)$ the expected

6120: code-length $nH(Y)$ with probability arbitrarily close to $1$.

6121: Using a code $Y$ satisfying $H(Y) \leq R$ and range ${\cal Y}$, we

6122: map the outcomes of $n$ repetitions of $X$,

6123: mapping $(x_1, \ldots, x_n)$

6124: to $(Y(x_1), \ldots, Y(x_n))$. Given a tolerance $\delta >0$,

6125: we only reserve codewords for the

6126: $2^{n (H(Y)+ \delta)}$ must probable vectors $(y_1, \ldots, y_n)$.

6127: The code length for each of these vectors will be $n H(Y) + \delta$.  Then with probability

6128: approaching $1$ as $n$ increases, the realized sequence

6129: of outcomes $(x_1, \ldots, x_n)$ such

6130: that $(Y(x_1), \ldots, Y(x_n))$ has a code word of length

6131:  $n H(Y) + \delta$. This

6132: code uses less than $R + \delta$ bits per $X_i$, and it is easy to see that

6133: achieves distortion ${\bf E}[d(X,Y)]$.

6134:

6135: This means that $D^\circ(R)$ can be interpreted in two ways: (a) we look

6136: at codes that use at most $R$ bits per message; or (b)

6137: we consider i.i.d. repetitions of the same random variable

6138: as in the definition of $D^*(R)$, but we restrict ourselves to

6139: using the same deterministic coding function $Y$ for each repetition.

6140: Therefore,

6141: $D^*(R) \leq D^\circ(R)$.

6142: \commentout{

6143: Let $D_{\min} = \inf_{R} D^*(R)$ and $D_{\max} = D^*(0)$.

6144: We call ${\cal Y}$ {\em distortion-continuous} relative to $D$

6145:   if for all $D \in (D_{\min}, D_{\max})$, there exists a

6146:   random variable $Y: {\cal X} \rightarrow {\cal Y}$ with ${\bf E}

6147:   [d(X,Y)]= D$.

6148: NOTE PAUL: THE FOLLOWING RESULT SEEMS NEW (ALTHOUGH NOT AT ALL HARD TO PROVE)

6149: \begin{theorem}

6150: \label{thm:simplerd}

6151: Suppose ${\cal Y}$ is distortion-continuous and $d$

6152: is the Shannon-Fano distortion $d(x,y) =  \log 1/ p(x\mid y)$. Then:

6153: \begin{enumerate}

6154: \item For all $D \geq 0$,

6155: \begin{equation}

6156: \label{eq:drentropy}

6157: R^*(D)

6158: = \inf_{{Y}: {\cal X} \rightarrow {\cal Y} \; ; \;   H(X \mid{Y}) \leq D } H(Y),

6159: \end{equation}

6160: so that only non-randomized estimates $Y$ have to be considered;

6161: \item For all $R \geq 0$,

6162: \begin{equation}

6163: \label{eq:drentropy}

6164: D^*(R) = \inf_{{Y}: {\cal X} \rightarrow {\cal Y} \; ; \;   H({Y}) \leq R } H(X \mid Y),

6165: \end{equation}

6166: so that $D^*(R)$ can be directly computed from $R$, without taking the

6167: large $n$ limit as in (\ref{eq:dr}).

6168: \end{enumerate}

6169: \end{theorem}

6170: \begin{proof}

6171: The theorem follows easily from the following lemma.

6172: \begin{lemma}

6173: Suppose there exists a deterministic function $Y: {\cal X} \rightarrow {\cal Y}$ with $H( X \mid Y) = D$. Then

6174: \begin{equation}

6175: \label{eq:lem}

6176: \inf_{f'(y'|x) : \sum_{x \in {\cal X}, y' \in {\cal Y}}

6177: f(x) f'(y'|x) [ \log 1/ p(x|y')] \leq D}I(X; Y') = I(X; Y) = H(Y).

6178: \end{equation}

6179: \end{lemma}

6180: Here the expression over which the minimum is taken should be read as

6181: in Theorem~\ref{thm:rd}, i.e. the minimum is over all conditional

6182: distributions $P'(Y' = \cdot \mid X = \cdot)$ satisfying

6183: $$

6184: {\bf E}_{X \sim P} {\bf E}_{Y'|X \sim P'} [  \log 1/ P(X|Y')] = H(X| Y') \leq D.

6185: $$

6186: \begin{proof}

6187:   Note that $I(X;Y) = H(Y) - H(Y | X)$ (Section~\ref{sec:mutual}).

6188:   Since $Y$ is a deterministic function of $X$, $H(Y|X) = 0$; this

6189:   shows the second equality in (\ref{eq:lem}). For the first equality,

6190: consider the space

6191: ${\cal X} \times {\cal Y}$, in which $Y'$ is a random variable. We

6192:   have $I(X;Y') = H(X) - H(X| Y')$. Since $H(X)$ does not depend on

6193:   $p(y' \mid x)$, we have

6194: $$\inf_{p(y' \mid x): H(X| Y') \leq D} I(X; Y') =

6195: H(X) +  \inf_{p(y' \mid x): H(X| Y') \leq D}  \{ - H(X|Y') \}

6196: = H(X) + D = I(X;Y).

6197: $$

6198: \end{proof}

6199: \end{proof}

6200: }

6201: \subsection{Rate Distortion and Estimators}

6202: Let $X$ be a random variable with set of outcomes ${\cal X}$.

6203: Let $\Theta$ be a {\em parameter space}.

6204: Suppose we observe a sample $(x_1, \ldots, x_n) \in {\cal X}^n$.

6205: A statistical {\em model family\/} ${\cal M}$ is defined

6206: by ${\cal M} = \{ p(\cdot

6207: , \theta) \mid \theta \in  \Theta\}$, where

6208: $p$ is a joint distribution over ${\cal X}^n$ and $\Theta$.

6209: For every parameter

6210: $\theta$, the function  $p_{\theta} (x_1, \ldots , x_n)=p((x_1, \ldots , x_n)

6211: \mid \theta)$ is a possibly different distribution on ${\cal X}^n$.

6212: For example, $\theta \in [0,1]$ represents the bias of a coin

6213: with outcomes in ${\cal X}=\{0,1\}$ per trial. Then the model family

6214: is that of the Bernoulli distributions.

6215: A statistical {\em estimator\/} $\hat{\theta}$

6216:   is a function

6217: $\hat{\theta}:  {\cal X}^n \rightarrow \Theta$,

6218:   mapping each possible sample of $n$ outcomes

6219:  into a value in $\Theta$.

6220:   The name `estimator' comes from the statistical literature, in which

6221:   $\hat{\theta}(x_1 , \ldots , x_n)$ is interpreted as an `estimate' of the data

6222:   generating mechanism $\theta$. A typical example is

6223: the maximum likelihood estimator; see below.

6224: For convenience, denote $\overline{X} = X^n$ as the random variable

6225: with outcomes $\overline{x}$

6226: in the sample space  $\overline{\cal X} = {\cal X}^n$

6227: We may now consider the distortion function:

6228: $

6229: d: \overline{\cal X} \times

6230: \Theta \rightarrow [0,\infty],

6231: $

6232: defined by

6233: \begin{equation}

6234: \label{eq:absdist}

6235: d(\overline{x}, \theta) =  \log 1/ p(\overline{x} \mid \theta)

6236: \end{equation}

6237: Note however that it is {\em not\/}

6238: identical to the `Shannon-Fano distortion' as

6239: in Example~\ref{ex:reconcile}. We explain the difference

6240: below in (\ref{eq:datacode}).

6241:

6242: The expected distortion ${\bf E} [d(\overline{X},

6243: \Theta)]$ requires the distribution $p(\overline{x} \mid \theta)$.

6244: This distribution can arise in different ways.

6245: We first consider a

6246: Bayesian analysis, in which we assume that the statistician employs

6247: some prior distribution $W$ on $\Theta$. This $W$

6248: indicates the statistician's prior `degree of belief' in the various

6249: elements of $\Theta$.  Assumption of $W$ induces a unique distribution

6250: ${\Pr}_{\text{Bayes}}$ on $\overline{\cal X}$, the so-called `Bayesian

6251: marginal likelihood' distribution:

6252: \begin{equation}

6253: \label{eq:bayesmarg}

6254: {\Pr}_{\text{Bayes}}(\overline{x} ) = \int_{\theta \in \Theta} p(\overline{x} \mid \theta)

6255: d W(\theta),

6256: \end{equation}

6257: where, in case $\Theta$ is discrete, the integral is replaced by a sum.

6258: \begin{example}

6259: \label{ex:markov}

6260: \rm

6261: A simple example of a statistical model with continuous $\Theta$

6262: is the {\em Bernoulli process}

6263: ${\cal M}_0 = \{ p(\overline{x} \mid \theta) : \overline{x}

6264: \in \overline{\cal X}, \theta \in  \Theta \}$,

6265: where ${\cal X}=\{0,1\}$, $ \Theta = [0,1]$,

6266: $\overline{x}=(x_1, \ldots , x_n)$,

6267: $p(\overline{x} \mid \theta) = \prod_{i=1}^n p(x_i \mid \theta)$, and the

6268: joint probability $p(x,\theta)$ is induced by

6269: the uniform

6270: prior $W(\theta) = \theta$.

6271: Then, $p(x_i \mid \theta)$ is the conditional probability that the random

6272: variable $X_i$ has outcome $x_i$ when the model has parameter $\theta$ and

6273: we have by definition of the Bernoulli process that

6274: $p(1 \mid \theta) = \theta$ and $p(0 \mid \theta)= 1- \theta$.

6275: This family has a single parameter, that is, $\theta$.

6276: In general, we do not restrict ourselves to finitely parameterizable

6277: families.

6278:

6279: An example of a statistical model with both continuous $\Theta$

6280: and unbounded number of parameters is the model family of

6281: {\em Markov chains} ${\cal M}$ defined as follows:

6282: ${\cal M} = \bigcup_k {\cal M}_k$, $\Theta =

6283: \bigcup_k \Theta_k$ where $\Theta_k = [0,1]^{2^k}$, and

6284: $${\cal M}_k = \{ p(\cdot \mid \theta) ; \theta \in \Theta_k \}$$

6285: consists of the family of $k$-th order Markov chains for alphabet

6286: ${\cal X} = \{0,1\}$ ($k=0,1, \ldots$).

6287: The subfamily ${\cal M}_0$ is the Bernoulli family

6288: introduced before.  A prior on ${\cal M}$ typically takes a hierarchical

6289: form: we first specify a prior $W$ (with probability density function

6290: $w$) on parameter space $\Theta$. This induces

6291: a prior $W_k$

6292: with associated probability density $w_k$ on

6293: every fixed-number parameter space

6294:  $\Theta_k$ ($k=0,1, \ldots$). It also induces a probability

6295: density $w_{\Theta}(k) = \int_{\theta \in \Theta_k} w (\theta) d \theta$

6296: Then, (\ref{eq:bayesmarg}) can be rewritten as

6297: \begin{equation}

6298: {\Pr}_{\text{Bayes}}(\overline{x} ) =

6299: \sum_k w_{\Theta}(k) \int_{\theta \in \Theta_k}

6300:  w_k(\theta) p(\overline{x} \mid \theta) d \theta.

6301: \end{equation}

6302: \end{example}

6303:

6304: With the given prior distribution, the

6305: random variable $\overline{X}$ is distributed according

6306: to ${\Pr}_{\text{Bayes}}$,

6307: and the expected distortion of an estimator $\hat{\theta}$

6308: becomes well-defined and equal to

6309: \begin{align}

6310: \label{eq:datacode}

6311: {\bf E}_{{\Pr}_{\text{Bayes}}} [d(\overline{X},

6312: \hat{\theta}( \overline{X}))] & = {\bf E}_{{\Pr}_{\text{Bayes}}}

6313: [ - \log p(\overline{X} \mid \hat{\theta}(\overline{X}))]

6314: \\ & =

6315: \nonumber

6316: \sum_{\overline{x}} {\Pr}_{\text{Bayes}} (\overline{x}

6317:  [- \log p

6318: \overline{x} \mid \hat{\theta})  ]

6319: \end{align}

6320: Below we use $H$ to refer to entropy with respect to

6321: ${\Pr}_{\text{Bayes}}$, and we use $\hat{\Theta}$ to refer to the

6322: range of the estimator $\hat{\theta}$.

6323: \begin{remark}

6324: \rm

6325: It is important to realize

6326: that (\ref{eq:datacode})  is {\em not\/} equal to

6327: \begin{align*}

6328: H(\overline{X}  \mid \hat{\theta}(\overline{X}))   = &

6329: {\bf E}_{{\Pr}_{\text{Bayes}}} [- \log

6330: {\Pr}_{\text{Bayes}}(\overline{X} \mid \hat{\theta}(\overline{X}))]

6331: \\ = &

6332: \sum_{\theta \in \hat{\bf \Theta}}

6333: {\Pr}_{\text{Bayes}} \bigcup\{ \overline{y}  : \hat{\theta}(\overline{y})

6334: = \theta\}

6335: \\& \; \; \; \left(\sum_{\overline{x} : \hat{\theta}(\overline{x}) = \theta}

6336: {\Pr}_{\text{Bayes}} \bigcup \{

6337: \overline{x} \mid \theta \}

6338:  [- \log {\Pr}_{\text{Bayes}} \bigcup

6339: \{\overline{x} \mid \theta) \} ] \right).

6340: \end{align*}

6341:

6342: %Note that  $P(\cdot ; \theta)$ appearing inside

6343: %the expectation (\ref{eq:datacode}) is not $\Pr_{\text{Bayes}(\cdot

6344: %  \mid \hat{\theta}(x^n) = \theta)}$.

6345: Thus, the expected distortion (\ref{eq:datacode})

6346: is the expected code-length of $\overline{x}$,

6347: where the Shannon-Fano code for distribution $p(\overline{x} \mid

6348: \hat{\theta}(\overline{x}))$, rather than the Shannon-Fano code for the actual

6349: conditional distribution ${\Pr}_{\text{Bayes}}(\overline{x} \mid

6350: \hat{\theta}(\overline{x}))$ is used.  Therefore, the present development is

6351: quite different from Example~\ref{ex:reconcile}.

6352: \end{remark}

6353: From (\ref{eq:udr}) we see that the deterministic

6354: distortion-rate function for (\ref{eq:absdist}) is given by

6355: \begin{equation}

6356: \label{eq:drentropyb}

6357: D^\circ(R) = \inf_{{\hat{\theta}}:

6358: H(\hat{\theta}(\overline{X})) \leq R } {\bf E}_{{\Pr}_{\text{Bayes}}}

6359: [ - \log p(\overline{X} \mid \hat{\theta}(\overline{X}))].

6360: \end{equation}

6361: Clearly, $D^\circ(R)$ is non-increasing in $R$.

6362: We may conjecture that $D^\circ(R) =

6363: D^*(R)$ but this is not true:

6364: using results in \cite{CT91}, Section 13.7, it is not hard to show

6365: that always $D^*(R) \leq D^\circ(R)$ and typically, $D^*(R) < D^\circ(R)$.

6366: This means that randomized estimators can typically

6367: achieve  a lower distortion than deterministic estimators,

6368: for any given rate.

6369: Deterministic estimators are appealing since they allow for a

6370: clear `two-part code' interpretation of the distortion process.

6371:

6372: \paragraph{Bayes Mean Structure function and MML:}

6373: Let $\hat{\theta}_R:

6374: {\cal X}^n \rightarrow {\bf \Theta}$ denote an estimator

6375: $\hat{\theta}$ achieving $D^\circ(R)$ for given $R$.

6376: Note that

6377: $$

6378: D^\circ(R) = {\bf E}_{{\Pr}_{\text{Bayes}}}

6379: [ - \log p(\overline{X} \mid  \hat{\theta}_R(\overline{X}))].

6380: $$

6381: We call

6382: $D^\circ(R)$ the {\em Bayes mean structure function}, in analogy to the

6383: Kolmogorov structure function to be introduced in the next section.

6384: We can interpret

6385: \begin{equation}

6386: \label{eq:mml}

6387: H(\hat{\theta}_R(\overline{X}))  + D^\circ(R) = {\bf E}_{{\Pr}_{\text{Bayes}}}

6388: [ - \log {\Pr}_{\text{Bayes}}(\hat{\theta}_R(\overline{X}))] + D^\circ(R)

6389: \end{equation}

6390: as the total number of bits it takes, on average, to encode the

6391: outcomes of the random variable $\overline{X}$

6392: using the cleverest possible two-part code under the constraint that

6393: $H(\hat{\theta}(\overline{X})) \leq R$. The first part of this code

6394: is the estimator $\hat{\theta}_R (\overline{x})$, which we interpret

6395: as corresponding to the proposed {\em model} for the data $\overline{x}$,

6396: at a Bayes mean cost of $H(\hat{\theta}_R(\overline{X})) \leq R$ bits.

6397: The second part of this code is the distortion-rate

6398: $- \log p(\overline{x} \mid  \hat{\theta}_R(\overline{x}))$,

6399: which corresponds to the {\em data-to-model} code,

6400: the Shannon-Fano code for the data $\overline{x}$ conditional

6401: the estimation of the model $\hat{\theta}_R(\overline{x}$,

6402: at a Bayes mean cost of $D^\circ(R)$ bits.

6403:

6404: There exist estimators $\hat{\theta}_R$ for every $R$. A natural question is

6405: to consider the rate $R^*$ for which

6406: an estimator $\hat{\theta}_{R^*}$  minimizes

6407: the value of (\ref{eq:mml}) over all $R$:

6408: $$

6409: H(\hat{\theta}_{R^*} (\overline{X}))  + D^\circ(R^*)

6410: \min_R H(\hat{\theta}_R(\overline{X}))  + D^\circ(R).

6411: $$

6412: The estimator $\hat{\theta}_{R^*}$ minimizes,

6413: over {\em all\/} possible estimators

6414: the expected two-part code-length.

6415: It turns out that the estimator

6416: $\hat{\theta}_{R^*}$ is well-known, albeit not in terms of rate-distortion:

6417: \begin{proposition}

6418: \label{prop:mml}

6419: $\hat{\theta}_{R^*}$ is identical to

6420: the {\em strict MML estimator\/} of Wallace \& Boulton

6421: \cite{WallaceB75,WallaceF87}.

6422: \end{proposition}

6423: \begin{proof}

6424:   Immediate from the definition of strict MML.

6425: \end{proof}

6426: Wallace and Freeman \cite{WallaceF87} do not give instructions

6427: for the case that the $\hat{\theta}_{R^*}$

6428: minimizing (\ref{eq:mml}) is not unique. It is natural,

6429: besides other reasons that will become clear below,

6430: if there is more than one $R$

6431: for which the minimum of (\ref{eq:mml}) is achieved, then

6432: define $R^*$ to be

6433: the {\em least\/} such $R$.

6434: It is well argued, \cite{WallaceB75,WallaceF87}, that the strict MML

6435: estimator $\hat{\theta}_{R^*}$ may be interpreted as an estimator that

6436: `trades off complexity and goodness-of-fit'. Below we shall explain this

6437: idea in a novel manner.

6438:

6439: \paragraph{Bayes Mean Randomness Deficiency Function:}

6440: Let us fix some rate $R$. Using the two-part code \eqref{eq:mml},

6441: achieving $D^\circ(R)$, we need on average

6442: \begin{equation}

6443: \label{eq:mmlb}

6444: H(\hat{\theta}_R(\overline{X})) + D^\circ(R)  = H(\hat{\theta}_R(\overline{X})) +

6445: {\bf E}_{{\Pr}_{\text{Bayes}}}

6446: [ - \log p(\overline{X} \mid \hat{\theta}_R(\overline{X}))]

6447: \end{equation}

6448: bits to encode our data. We may compare this with the optimal

6449:   (on average) code for $\overline{X}$: the

6450:   Shannon-Fano code for ${\Pr}_{\text{Bayes}}$, with lengths $L(\overline{x}) =

6451:     - \log {\Pr}_{\text{Bayes}} (\overline{x})$,

6452:  and expected length $H(\overline{X})$. Since the two-part code at

6453:     rate $R$ can never be better than this overall optimum code, the

6454:     difference $\beta(R)$ defined by

6455: \begin{multline}

6456: \label{eq:redundancy}

6457: \beta(R) = H(\hat{\theta}_R(\overline{X})) +  {\bf E}_{{\Pr}_{\text{Bayes}}}

6458: [ - \log p(\overline{X} \mid \hat{\theta}_R(\overline{X}))]- H(\overline{X}) = \\

6459: {\bf E}_{{\Pr}_{\text{Bayes}}}

6460: [ - \log p(\overline{X} \mid \hat{\theta}_R(\overline{X}))] -

6461: H(\overline{X} \mid \hat{\theta}_R(\overline{X}))

6462: \end{multline}

6463: is always nonnegative. Information theorists call

6464: (\ref{eq:redundancy}) the {\em redundancy\/} of the given 2-part code:

6465: it is the average {\em additional\/} number of bits needed to encode

6466: $\overline{X}$ compared to the optimal code for $\overline{X}$.

6467: In analogy with Kolmogorov's minimum randomness

6468: deficiency function $\beta_x (R)$, our new function $\beta(R)$ may be called

6469: the {\em Bayes mean randomness deficiency function}.

6470:

6471: Typically, as $R$ increases, the function $\beta$ will behave

6472: as follows: first, it will be much larger than $0$ (and in fact, for

6473: fixed $R$, it will be linear in $n$---the number of i.i.d. random

6474: variables denoted by $\overline{X}$). As $R$ grows ($n$

6475: fixed), the function decreases  reaches a first minimum at $R =

6476: R^*$, the rate for the strict MML estimator minimizing

6477: (\ref{eq:mml}). At this point, the difference $\beta(R)$

6478: is bounded by a constant (independent of $n$).

6479:

6480: This means that at the minimum at $R^*$,

6481: the two-part code is essentially (within a constant) as good as the

6482: overal best one-part (Shannon-Fano) code. The estimator

6483: $\hat{\theta}_R (\overline{x})$, with $R \geq R^*$,

6484: thus {\em on average\/} behaves

6485: as the `algorithmic sufficient statistic',

6486: capturing essentially all regularity in

6487: the data (since even if the `true' distribution ${\Pr}_{\text{Bayes}}$

6488: were known, the data could not be compressed more).

6489: The optimum $\hat{\theta}_{R^*} (\overline{x})$ is

6490: a {\em minimum\/} sufficient statistic since all other

6491: sufficient statistics are attained for $R > R^*$, and need on

6492: average more bits to be described.

6493:

6494: TODO IN SOME CASES I CAN ACTUALLY FORMALLY PROVE  ALL THIS

6495:

6496: We may thus think of the strict MML estimator as an `minimum

6497: sufficient statistic'. Historically, the strict MML estimator has been

6498: introduced and interpreted from a lossless coding point of view.  The

6499: variation of MML based on the `mean structure function' introduced

6500: above may also be understood from a lossy coding point of view: the

6501: MML estimator implements the two-part code that restricts ${\cal M}$

6502: to the smallest possible subset ${\cal M}' \subset {\cal M}$

6503: containing an element $p(\cdot \mid

6504: \theta) \in {\cal M}'$ so that $p(\cdot \mid \theta)$ captures all

6505: relevant information in $\overline{X}$, which means that the data must look

6506: like a typical outcome of $p(\cdot \mid  \theta) \in {\cal M}$.

6507:

6508: \commentout{

6509: \begin{quote}

6510: {\bf Caution\ }

6511: We stress that the encoded value of ${\theta}$ is a {\em

6512:   function \/} of the sample $(x_1, \ldots, x_n)$. It is {\em not\/}

6513: necessarily equal to the $\theta$ `generated' by $W$: since

6514: $$

6515: H({\Theta}) = \sum_{{\Theta} \in {\mathbf \Theta}}

6516: \Pr_{\text{Bayes}}(\{x^n : {\Theta}(x^n) = \theta\}) [

6517: - \log \Pr_{\text{Bayes}}(\{x^n : {\Theta}(x^n) = \theta\}) ]

6518: $$

6519: whereas, if ${\bf \Theta}$ is finite, then

6520: $$

6521: H(\Theta) = \int_{\theta \in {\mathbf \Theta}} (\theta) - \log W(\theta),

6522: $$

6523: and if ${\bf \Theta}$ is infinite, $H(\Theta)$ is not defined.

6524: Thus, we have $H({\Theta}) \neq H(\Theta))$.

6525: \end{quote}

6526: }

6527: \begin{remark}

6528: \rm

6529: The previous analysis opens up the intriguing

6530: possibility to define a {\em randomized MML estimator\/} as the

6531: estimator which minimizes expected two-part code length over all

6532: randomized, rather than just unrandomized functions from the data to

6533: the parameters. As seen, this will in general lead to smaller expected

6534: two-part code lengths. This could therefore somewhat change the

6535: distortion-rate curve, and therefore also somewhat change the inferred

6536: distribution for any given particular set of data. At this time it is

6537: unclear however whether this would lead to any substantial changes.

6538: \end{remark}

6539: \paragraph{Problem and Lacuna}

6540: The `strict MML method' provides a code book that achieves the minimum

6541: two-part code length (and, through the structure function

6542: interpretation, something close to the `optimal separation between

6543: data and noise') {\em on average when applied several times}, and

6544: according to the prior. In practice, a statistician who uses MML

6545: observes a data sample $\overline{x}$ and then infers

6546: that $\hat{\theta}(\overline{x})$ is a good explanation

6547: for the data. There are two potential problems here: (a) for the

6548: individual sequence $\overline{x}$ that actually arises, the MML estimator

6549: $\hat{\theta}(\overline{x})$

6550: may {\em not\/} achieve the optimal data-noise separation;

6551: (b) the statistician may not be able to come up with a reasonable

6552: prior $W$.  These concerns are addressed, to some extent, by MML's

6553: close cousin: Rissanen's Minimum Description Length Principle.

6554: \subsection{MDL Parameter Estimates}

6555: DISCUSS CONNECTION BETWEEN MML AND UNIVERSAL MODELS; HOW MDL GETS AWAY

6556: WITHOUT PRIOR - PROBABLY BEST PUT *AFTER* DISCUSSION OF KOLMOGOROV

6557: SUFFICIENT STATISTIC. LETS FIRST WAIT UNTIL WE HAVE KOLMOGOROV TEXT!

6558:

6559: THE FOLLOWING IS PROBABLY SUPERFLUOUS To end this section, we consider

6560: one last distortion function that will play an important r\^ole in the

6561: next section.

6562: \begin{example}[structure function]

6563: \label{ex:uniformcode}

6564: \rm Consider the distortion function $d: {\cal X} \times {\cal S}

6565: \rightarrow {\cal R}$ where ${\cal S} = 2^{\cal X}$ is the power set

6566: of ${\cal X}$. We define $d(x,S) = \log |S|$ if $x \in {\cal S}$ and

6567: $d(x,S) = \infty$ if $x \not \in S$.

6568: This distortion function has both

6569: a lossy and a lossless coding interpretation. From the lossy point of

6570: view, $x$ is encoded as a set which contains it, and the quality of

6571: the encoding is given by the (log of the) size of the set. From the

6572: lossless point of view, this distortion corresponds to a scenario

6573: where sender has to send the value $x$ to receiver but is not allowed

6574: to use any arbitrary code he likes. Instead, he must do the encoding

6575: in two stages: he must first specify a set $S$, using at most $R$

6576: bits. He then has to specify $x$ by giving its index in the set

6577: $S$. That is , in the second stage of the description, he is not

6578: allowed to use any probabilistic knowledge about $x$ at all, but must

6579: describe it in a trivial, fixed length manner. This rate distortion

6580: function has a number of interesting properties.

6581: It is closely related to the Kolmogorov structure function which

6582: we discuss in the next section.

6583: \end{example}

6584:

6585: \paragraph{A Rate Distortion Theory for individual sequences?} TO BE PUT AFTER CONSIDERING KOLMOGOROVS STRUCTURE FUNCTION: We see

6586: that rate distortion theory leads to trade-offs between number of bits

6587: needed to send a message and achieveable distortion in an average

6588: sense. When also the distortion is measured in terms of bits, this

6589: leads to two-part codes that are optimal on average, as considered in

6590: MML. In MDL and with the Kolmogorov structure function, we consider

6591: two-part codes that are optimal not in average, but in an individual

6592: sequence sense (even though the sense of optimality and the codes that are

6593: used in MDL and Kolmogorov's setting are different, they are both

6594: concerned with individual sequences). This suggests that we might just

6595: as well replace the second part of the code by some non-logarithmic

6596: distortions and consider non-logarithimic distortions in an individual

6597: sequence sense, using the tools developed in the MDL and Kolmogorov structure

6598: function theory. Ideally, this would lead to an {\em individual

6599:   sequence-based rate-distortion theory}. The very first steps in this

6600: promising new direction have been recently taken by

6601: \cite{RissanenT03}.

6602: FOR DIFFERENT NOVEL APPROACHES: \cite{KontoyiannisZ02,SowE03}

6603: CHANGE PART ABOUT AVERAGE SUFFICIENT STATISTIC IN SECTION 7

6604: \cite{SowE03}

6605:

6606:

6607:

6608:

6609: \section{Resource-Bounded Information}

6610: The area of computational resource-bounded information transmission

6611: seems rather underdeveloped in the Shannon-Information case.

6612: This would SPIELMAN, SIPSER???

6613: have to deal with the speed of encoding/decoding parsimonious

6614: prefix codes. This will depend on the size of the message domain.

6615: In general we can say that the resource-bounded information transmission

6616: rate will depend primarily on the probability characteristics of the

6617: random source.

6618: In contrast, in the algorithmic (Kolmogorov complexity)

6619: case, the resource-bounded information depends on the

6620: individual object concerned.  The theory is partially developed.

6621: One may consider a book on number theory

6622: difficult, or ``deep.'' The book will list

6623: a number of difficult theorems of number theory. However,

6624: it has very low Kolmogorov complexity since all

6625: theorems are derivable from the initial few definitions.

6626: Our estimate of the difficulty, or ``depth,'' of the book is based

6627: on the fact that it takes a long time to reproduce the book

6628: from part of the information in it.

6629: The existence of a ``deep'' book is itself evidence of some long

6630: evolution preceding it.

6631: Currently, the sequence of primes is being broadcast

6632: to outer space since it is deemed deep enough to prove

6633: to aliens that it arose as a result of a long

6634: evolution.

6635: From the point of view of an investigator, a sequence is deep if

6636: it yields its secrets only slowly: one will be able to discover

6637: all significant regularities in it

6638: only if one analyzes it long enough.

6639:

6640: A suggestive example is provided by

6641: DNA sequences. Such a sequence is quite regular and

6642: has some 90\% redundancy, possibly due to evolutionary history.

6643: A DNA sequence over an alphabet of four letters $\{ A,C,G,T \}$

6644: \index{sequence!DNA}

6645: looks like nothing but a super-long

6646: ($3 \times 10^9$ characters for humans) computer program.

6647: A particular three-letter combination

6648: literally signifies ``begin'' of the encoding of a protein.

6649: Following the ``begin'' command, every next block of three consecutive

6650: letters encodes one of the 20 amino acids. At the end

6651: another three-letter combination signifies the

6652: ``end'' of the program for this protein. Such a sequence is

6653: not Kolmogorov random, and it encodes the structure of a living being.

6654: DNA is much less random than, say, a typical

6655: configuration of gas in a container.

6656: On the other hand, DNA is more random than a crystal.

6657: Both gases and crystals are structurally trivial;

6658: the former is in complete chaos and the latter is in total order.

6659: Intuitively, DNA contains more useful information than both.

6660: A ``deep'' object, such as DNA, is something really simple but

6661: ``disguised'' by complicated manipulations of nature

6662: or computation by computer.

6663:

6664: Logical depth is the necessary number of

6665: steps in the deductive or causal path connecting an object with its

6666: plausible origin. Formally, it is

6667: the time required by a universal computer to compute the

6668: object from its compressed original description.

6669:

6670: It turns out that it is quite subtle to give a formal

6671: definition of ``depth'' that satisfies our intuitive notion

6672: of it. After some attempts at a definition,

6673: we will settle for

6674: Definition~\ref{def.depth}.

6675: As usual, we write $x^*$ to denote the shortest

6676: self-delimiting program (of the reference universal prefix machine

6677: $U$) for $x$.  If there is more than

6678: one of the same length, then $x^*$ is

6679: the first such program in a fixed enumeration.

6680: \begin{description}

6681: \item[Attempt 1]

6682: The number of steps required to compute $x$ from

6683: $x^*$ is not a stable quantity since

6684: there might be a program of

6685: just a few more bits using substantially less time to generate

6686: $x$. That this can happen

6687: is shown by the hierarchy theorems in \cite{LiVi97}.

6688: Therefore, a proper definition of

6689: depth probably should ``compromise''

6690: between the program size and computation time.

6691: \item[Attempt 2]

6692: Relax the strict requirement of

6693: minimum program to {\em almost minimum} programs.

6694: Define that a string $x$

6695: has depth $d$ within error $2^{-b}$ if $x$ can be

6696: computed in $d$ steps by a program $p$ of

6697: no more than $b$ bits in excess of $x^*$.

6698: That is, $2^{-l(p)}/2^{-K(x)} \geq 2^{-b}$.

6699:

6700: This definition is stable but is unsatisfactory because

6701: of the way it treats multiple programs of the same length.

6702: If $2^b$ distinct programs of length $m+b$ all compute $x$,

6703: then together they account for the same

6704: algorithmic probability

6705: \begin{equation}

6706: \nonumber

6707: \sum \{2^{-l(p)}: U(p)=x, l(p) = m+b \},

6708: \end{equation}

6709: as one program of length $m$ printing $x$ does.

6710: That is, they are as likely to produce $x$ as output

6711: of the universal reference prefix machine when

6712: its input is provided by fair coin tosses.

6713: But with the proposed definition,

6714: $2^b$ programs of length $m+b$

6715: make the emerging of $x$ no more

6716: probable than one program of length $m+b$.

6717: \end{description}

6718: We shall explicitly take the algorithmic probability into account.

6719: The universal prior probability of a string $x$ is

6720: \index{probability!universal prior}

6721: \[

6722: Q_U (x) = \sum_{U(p)=x} 2^{-l(p)},

6723: \]

6724: where $U$ is the reference universal

6725: prefix machine.

6726: This is the probability

6727: that $U$ would print $x$ if its input were provided by random tosses

6728: of a fair coin.

6729: By one of the main results in Kolmogov complexity theory,

6730: \begin{equation}\label{eq.depth.PR2}

6731: - \log Q_U (x) + O(1) = -\log {\bf m} (x) = K(x) +O(1).

6732: \end{equation}

6733: It shows that $2^{-K(x)}$ is

6734: a universal discrete semimeasure. This means

6735: that we are free to choose the reference universal semimeasure

6736: ${\bf m}$  exactly equal to $2^{-K(x)}$.

6737:

6738: Thus, weighing all possible causes of emergence of $x$

6739: appropriately, we are led to the following definition:

6740: \begin{definition}\label{def.gacs.depth}\label{def.depth}

6741: \rm

6742: The {\em depth} of a string $x$

6743: at {\em significance level} $\epsilon = 2^{-b}$ is

6744: \[ depth_{\epsilon} (x) = \min

6745: \{ t : Q_U^t (x)/ Q_U (x) \geq \epsilon \},

6746: \]

6747: where $Q_U^t (x) = \sum_{U^t (p)=x} 2^{-l(p)}$

6748: and \index{logical depth!$(d,b)$-deep|bold}

6749: $U^t (p) =x$ means that $U$ computes $x$ within $t$

6750: steps and halts. A string $x$ is {\em $(d,b)$-deep} if

6751: $d=depth_{\epsilon} (x)$ and $\epsilon = 2^{-b}$.

6752: \end{definition}

6753: If $x$ is $(d,b)$-deep,

6754: then $x$ receives an approximately $1/2^{b \pm \delta }$ fraction

6755: of its algorithmic probability (for some small $\delta$) from

6756: programs running in $d$ steps.

6757: Below we formalize this statement and make $\delta$ precise.

6758: A binary string $x$ is $b$-compressible

6759: if $l(x^* ) \leq l(x) - b$.

6760: Otherwise, $x$ is $b$-incompressible.

6761: \begin{theorem}\label{theorem.depth}

6762: A string $x$ is

6763: {\em $(d,b)$-deep} {\rm (}$b$ up to precision $K(d)+O(1)${\rm )}

6764: if and only if $d$ is the least time

6765: needed by a $b$-incompressible program

6766: to print $x$.

6767: \end{theorem}

6768:

6769:

6770: %\section{Kolmogorov Minimum Sufficient Statistic}

6771: %A shortcoming of both the Shannon and the Algorithmic Information

6772: %Theory is that they do not distinguish between `useful' and `useless'

6773: %information in an event (initial sequence of outcomes). It turns out

6774: %though that, by using Kolmogorov complexity as a `building block', one

6775: %can arrive at natural definitions of these concepts. These definitions

6776: %are too involved to cite in this abstract. In many cases, `useful'

6777: %information in a sequence corresponds to information that can be used

6778: %to predict future outcomes of the process better than by random

6779: %guessing.

6780: %

6781: %In the paper we will formally define `useful' and `useless'

6782: %information using the theory of the {\em Kolmogorov Minimum Sufficient

6783: %Statistic\/} and we will show the importance of these concepts.

6784: %\section{Properties and Comparisons}

6785: %The form of information theory based on Kolmogorov complexity is

6786: %usually called {\em algorithmic information theory}. Although

6787: %algorithmic rather than probabilistic, algorithmic information theory

6788: %is closely related to the Shannon theory. We will discuss this and

6789: %other relations in detail and give intuitive interpretations of them.

6790: %

6791: %We also point out the important fact that the Kolmogorov

6792: %complexity/minimum sufficient statistic approach cannot be applied

6793: %without modification in practical situations, the reason being that

6794: %there is no algorithm that, for arbitrary inputs $x$ outputs the

6795: %length of the shortest program that prints $x$. We briefly indicate

6796: %modifications of the theory such as the {\em minimum description

6797: %length principle\/} that {\em can\/} be used in practical

6798: %settings. Such modifications have found important applications in

6799: %statistics and machine learning.

6800: \section{Conclusion}

6801: We have compared Shannon's and Kolmogorov's theories of

6802: information, highlighting the various similarities and differences. We

6803: end by suggesting further topics and reading for the interested reader.

6804: \subsection{Further Topics}

6805: We have only treated those aspects of Shannon's theory

6806: that have a clear analogue in Kolmogorov's theory, and vice versa.

6807: Among the many aspects of Shannon theory we have not discussed, one

6808: cannot go unmentioned:

6809: \begin{description}

6810: \item{\bf The Channel Coding Theorem} Of the three (arguably) most important

6811:   developments  in Shannon's original

6812: paper, we only discussed two: first, the {\em noiseless coding theorem\/}

6813: (Theorem~\ref{thm:noiseless}), related to lossless compression or,

6814: equivalently, lossless communication over a {\em noiseless\/} channel.

6815: Second, the fundamental theorem of {\em rate-distortion}, which deals with lossy

6816: compression.

6817: We did not discuss the {\em channel coding theorem},

6818: which is related to {\em lossless\/}

6819: communication over a {\em noisy\/} channel.

6820: \end{description}

6821: Among the many aspects of Kolmogorov complexity that

6822: we have not discussed, some

6823: cannot go unmentioned:

6824: \begin{description}

6825: \item{\bf Algorithmic Randomness; The Universal Distribution}

6826: TODO

6827: \item{\bf Inductive Inference}

6828: TODO

6829: \item{\bf Kolmogorov complexity as a proof technique}

6830: TODO Goedel

6831: \end{description}

6832: \subsection{Further Reading}

6833: The standard reference for Shannon information theory is

6834: \cite{CT91}. Also, Shannon's original \cite{Sh48} is still

6835: well-worth reading. The 50-year anniversary issue of the {\em IEEE

6836:   Transactions on Information Theory\/} in 1998

6837: contains overview articles on some of

6838: the most important topics in Shannon

6839: information theory. The standard reference for Kolmogorov Complexity

6840: is \cite{LiVi97}; \cite{Ch87b} is a monograph written by G. Chaitin,

6841: one of the founders of Kolmogorov complexity. It concentrates on the

6842: application of Kolmogorov complexity to proving metamathematical statements.

6843: References

6844: \cite{LiVi97} and \cite{CT91} provide an extensive treatment of all the notions

6845: discussed in this article, as well as many

6846: others we could not touch upon here. Recently, there have been many

6847: exciting new results in `meaningful information' and the Kolmogorov

6848: structure function which are not yet mentioned in \cite{LiVi97}. We

6849: refer to \cite{VV02}. Both universal coding and the Kolmogorov structure

6850: function are closely related to Rissanen's `minimum description length

6851: principle' for inductive inference; see \cite{Grunwald03} and

6852: \cite{Rissanen89}.

6853: }

6854: \section{Conclusion}

6855: We have compared Shannon's and Kolmogorov's theories of information,

6856: highlighting the various similarities and differences. Some of this

6857: material can also be found in \cite{CT91}, the standard reference for

6858: Shannon information theory, as well as \cite{LiVi97}, the standard

6859: reference for Kolmogorov complexity theory. These books predate much

6860: of the recent material on the Kolmogorov theory discussed in the

6861: present paper, such as \cite{HRSV00} (Section~\ref{sec:algmi}),

6862: \cite{Le02} (Section~\ref{sect:minialg}), \cite{GTV01}

6863: (Section~\ref{sec:algsuf}), \cite{VV02, VereshchaginV04}

6864: (Section~\ref{sec:structure}). The material in Sections~\ref{sec:relpa}

6865: and \ref{sec:esf}

6866: has not been published before. The present paper summarizes these

6867: recent contributions and systematically compares

6868: them to the corresponding notions in Shannon's theory.

6869:

6870:   \paragraph{Related Developments:} There are two major practical theories

6871:   which have their roots in both Shannon's and Kolmogorov's notions of

6872:   information: first, {\em universal coding}, briefly introduced in

6873:   Appendix~\ref{sec:universal} below, is a remarkably successful theory for

6874:   practical lossless data compression.  Second, Rissanen's {\em

6875:     Minimum Description Length (MDL) Principle\/}

6876:   \cite{Ri89,Grunwald04} is a theory of inductive inference that

6877:   is both practical and successful. Note that direct practical

6878:   application of Shannon's theory is hampered by the typically

6879:   untenable assumption of a true and known distribution generating the

6880:   data. Direct application of Kolmogorov's theory is hampered by the

6881:   noncomputability of Kolmogorov complexity and the strictly asymptotic

6882:   nature of the results.  Both universal coding (of the individual

6883:   sequence type, Appendix~\ref{sec:universal}) and MDL seek to

6884:   overcome both problems by restricting the description methods used

6885:   to those corresponding to a set of probabilistic predictors (thus

6886:   making encodings and their lengths computable and nonasymptotic);

6887:   yet when applying these predictors, the assumption that any one of

6888:   them generates the data is never actually made. Interestingly, while

6889:   in its current form MDL bases inference on universal codes, in

6890:   recent work Rissanen and co-workers have sought to found the

6891:   principle on a restricted form of the algorithmic sufficient

6892:   statistic and Kolmogorov's structure function as discussed in

6893:   Section~\ref{sec:structure} \cite{RissanenT04}.

6894:

6895:   By looking at general types of prediction errors, of which

6896:   codelengths are merely a special case, one achieves a generalization

6897:   of the Kolmogorov theory that goes by the name of {\em predictive

6898:     complexity}, pioneered by Vovk, Vyugin, Kalnishkan and others\footnote{See {\tt www.vovk.net} for an overview.} \cite{Vovk01}.  Finally, the

6899:   notions of `randomness deficiency' and `typical set' that are

6900:   central to the algorithmic sufficient statistic

6901:   (Section~\ref{sec:algsuf}) are intimately related to

6902:  the celebrated Martin-L\"of-Kolmogorov theory of {\em randomness in

6903:     individual sequences}, an overview of which is given in

6904:   \cite{LiVi97}.

6905: \appendix

6906: \section{Appendix: Universal Codes}

6907: \label{sec:universal}

6908: Shannon's and Kolmogorov's idea are not directly applicable to

6909: most actual data compression problems. Shannon's theory is hampered

6910: by the typically

6911:   untenable assumption of a true and known distribution generating the

6912:   data. Kolmogorov's theory is hampered by the

6913:   noncomputability of Kolmogorov complexity and the strictly asymptotic

6914:   nature of the results. Yet there is

6915: a middle ground that is feasible:  {\em

6916: universal codes\/} that may be viewed as both an

6917: generalized version of Shannon's, and a feasible

6918: approximation to Kolmogorov's theory. In introducing

6919: the notion of universal coding Kolmogorov says \cite{Ko65}:

6920: \begin{quote}

6921: ``A universal coding method that permits the transmission of

6922: any sufficiently long message [of length $n$] in an alphabet of $s$ letters

6923: with no more $nh$ [$h$ is the empirical entropy] binary digits is

6924: not necessarily excessively complex; in particular, it is not

6925: essential to begin by determining the frequencies $p_r$ for the entire

6926: message.''

6927: \end{quote}

6928:

6929: Below we repeatedly use the coding concepts introduced in

6930: Section~\ref{sec:coding}.

6931: Suppose we are given a recursive enumeration

6932: of prefix codes $D_1, D_2, \ldots$. Let $L_1, L_2, \ldots$ be the

6933: length functions associated with these codes. That is, $L_i(x) = \min_y

6934: \{ l(y) : D_i(y) = x \}$; if there exists no $y$ with $D_i(y) = x$,

6935: then $L_i(y) = \infty$. We may encode $x$ by first

6936: encoding a natural number $k$ using the standard prefix code

6937: for the natural numbers.

6938: We then encode $x$ itself using the code $D_k$. This leads to a

6939: so-called {\em two-part code\/} $\tilde{D}$

6940: with lengths $\tilde{L}$. By construction, this code is prefix and its lengths satisfy

6941: \begin{equation}

6942: \tilde{L}(x) := \min_{k \in {\cal N}} \  \Lint(k) + L_k(x),

6943: \end{equation}

6944: Let ${\bf x}$ be an infinite binary sequence and let $x_{[1:n]} \in

6945: \{0,1\}^n$ be the initial $n$-bit segment of this sequence.

6946: Since  $L_{\cal N}(k) = O (\log k)$,

6947: we have for all $k$, all $n$:

6948: $$

6949: \tilde{L}(x_{[1:n]}) \leq  L_k(x_{[1:n]}) + O(\log k).

6950: $$

6951: Recall that for

6952: each fixed $L_k$, the fraction of sequences of length $n$ that can be

6953: compressed by more than $m$ bits is less than $2^{-m}$. Thus,

6954: typically, the codes $L_k$ and the strings $x_{[1:n]}$ will be such

6955: that $L_k(x_{[1:n]})$ grows {\em linearly\/} with $n$.

6956: This implies that for every ${\bf x}$,

6957: the newly constructed $\tilde{L}$ is `almost as good'

6958: as whatever code $D_k$ in the list is best for that particular ${\bf x}$: the

6959: difference in code lengths is bounded by a constant depending on $k$ but not on

6960: $n$. In particular, for each

6961: infinite sequence ${\bf x}$, for each fixed $k$,

6962: \begin{equation}

6963: \label{eq:universal}

6964: \lim_{n \rightarrow \infty}

6965: \frac{\tilde{L}(x_{[1:n]})}{L_k(x_{[1:n]})} \leq 1.

6966: \end{equation}

6967: A code satisfying (\ref{eq:universal}) is called a {\em universal

6968:   code\/} relative to the {\em comparison class\/} of codes

6969: $\{ D_1, D_2, \ldots \}$.

6970: It is `universal' in the sense that it compresses every

6971: sequence essentially as well as the $D_k$ that compresses that particular

6972: sequence the most.

6973: % This terminology is slightly non-standard; see below.

6974: In general, there exist many types of codes that

6975: are universal: the 2-part universal code defined above is just

6976: one means of achieving (\ref{eq:universal}).

6977:

6978: \paragraph{Universal codes and Kolmogorov:}

6979: %Let us now reinterpret the definition of (prefix) Kolmogorov complexity

6980: %in terms of universal codes.

6981: %%

6982: %

6983: %From Definition~\ref{def.KolmK} we see

6984: %that the Kolmogorov complexity is just the length function of the

6985: %universal two-part code that is defined relative to the list of

6986: %reference codes $D_1,D_2, \ldots$ with $D_i$ defined by $D_i(p) =

6987: %\phi_i(\langle p,\epsilon \rangle)$.

6988: %Note that, for large $n$, the Kolmogorov complexity

6989: %$K(x_{[1:n]})$ must be smaller or equal (up to a constant)

6990: %than the universal code length $\tilde{L}(x_{[1:n]})$

6991: In most practically interesting cases we may assume that

6992: for all $k$, the decoding function $D_k$ is computable,

6993: i.e. there exists a prefix Turing machine which

6994: for all $y \in \{0,1\}^*$, when input $y'$ (the prefix-free version of

6995: $y$), outputs $D_k(y)$ and then

6996: halts. Since such a program has finite length, we must have for all $k$,

6997: $$

6998: %l(E^*(x_{[1:n]})) = K(x_{[1:n]}) \leq^+ \tilde{L}_k(x_{[1:n]})

6999: l(E^*(x_{[1:n]})) = K(x_{[1:n]}) \leq^+ L_k(x_{[1:n]})

7000: $$

7001: where $E^*$ is the encoding function defined in Section~\ref{sec:kolmogorov},

7002: with $l(E^*(x))

7003: = K(x)$. Comparing with (\ref{eq:universal}) shows that

7004: the code $D^*$  with encoding function $E^*$ is a universal code relative to $D_1, D_2,

7005: \ldots$. Thus, we see that the Kolmogorov complexity $K$ is just the length function

7006: of the universal code $D^*$. Note that $D^*$ is an example of a universal

7007: code that is not (explicitly) two-part.

7008: \begin{example}

7009: \label{ex:universal}

7010: \rm Let us create a universal two-part code that allows us to significantly

7011: compress all  binary strings with frequency of 0's deviating significantly

7012: from $\frac{1}{2}$. For $n_0 < n_1$, let $D_{\langle n,n_0 \rangle }$ be the code that assigns

7013: code words of equal (minimum)  length

7014: to all strings of length $n$ with $n_0$ zeroes, and no code words to

7015: any other strings. Then $D_{\langle n,n_0 \rangle}$

7016: is a prefix-code and $L_{\langle n,n_0 \rangle} (x) =

7017: \lceil \log \binom{n}{n_0} \rceil$. The universal two part code

7018: $\tilde{D}$ relative to the set of codes $\{

7019: D_{\langle i,j\rangle} \; : \; i,j \in {\cal N} \}$ then achieves

7020: the following lengths (to within 1 bit): for all $n$, all $n_0 \in \{0,\ldots,n\}$, all

7021: $x_{[1:n]}$ with $n_0$ zeroes,

7022: $$

7023: \tilde{L}(x_{[1:n]}) = \log n + \log n_0 + 2 \log \log n + 2 \log \log

7024: n_0 + \log \binom{n}{n_0} = \log  \binom{n}{n_0} + O(\log n)

7025: $$

7026: Using Stirling's approximation of the factorial, $n! \sim

7027: n^{n}e^{-n}\sqrt{2\pi n}$,  we find that

7028: \begin{multline}

7029: \label{eq:stirling}

7030: \log \binom{n}{n_0} =

7031: \log n! - \log n_0! + \log (n- n_0)! = \\

7032: n \log n - n_0 \log n_0 - (n-n_0) \log (n- n_0) + O(\log n) = n

7033: H(n_0/n) + O(\log n)

7034: \end{multline}

7035: Note that  $H(n_0/n) \leq 1$, with equality iff $n_0 = n$. Therefore, if

7036: the frequency deviates significantly from $\frac{1}{2}$, $\tilde{D}$

7037: compresses $x_{[1:n]}$ by a factor linear in $n$. In all such cases,

7038: $D^*$ compresses the data by at least the same linear factor.

7039: Note that (a) each individual code $D_{\langle n,n_0 \rangle}$ is

7040: capable of exploiting a particular type of

7041: regularity in a sequence to compress that

7042: sequence,

7043: (b) the universal code $\tilde{D}$ may exploit

7044: {\em many\/} different types of

7045: regularities to compress a sequence, and (c)

7046: the code $D^*$ with lengths given by

7047: the Kolmogorov complexity asymptotically exploits {\em all\/}

7048: computable regularities so as to maximally compress a sequence.

7049: \end{example}

7050: \paragraph{Universal codes and Shannon:}

7051: If a random variable

7052: $X$ is distributed according to some known probability

7053: mass function $f(x)=P(X=x)$,

7054: then the optimal (in the average sense) code to use is the

7055: Shannon-Fano code. But now suppose it is only known that

7056: $f \in \{f \}$, where $\{ f \}$ is some given (possibly very large,

7057: or even uncountable) set of candidate distributions. Now it is not clear

7058: what code is optimal. We may try the Shannon-Fano code for a particular $f

7059: \in \{ f \}$, but such a code will typically lead to very large

7060: expected code lengths if $X$ turns out to be distributed according to

7061: some $g \in \{ f \}, g \neq f$.

7062: We may ask whether there exists another

7063: code that is `almost' as good as the Shannon-Fano code for $f$, no

7064: matter what $f \in \{ f \}$ actually generates the sequence?

7065: We now show that, provided $ \{ f \}$ is finite or countable,

7066: then (perhaps surprisingly), the answer is yes. To see this,

7067: we need the notion of an {\em sequential information source},

7068: Section~\ref{sec:preliminaries}.

7069:

7070: Suppose then that $\{ f \}$ represents a finite or countable set of

7071: sequential information sources. Thus,

7072: $\{ f \} = \{ f_1, f_2, \ldots \}$ and $f_k \equiv (f_k^{(1)},

7073: f_k^{(2)}, \ldots)$ represents a sequential information source, abbreviated to

7074: $f_k$. To each marginal distribution $f^{(n)}_k$, there corresponds a

7075: unique Shannon-Fano code defined on the set $\{0,1\}^n$ with lengths

7076: $L_{\langle n, k \rangle}(x) := \lceil \log 1/ f^{(n)}_k(x) \rceil$

7077: and decoding function $D_{\langle n, k \rangle}$.

7078:

7079: For given $f \in \{ f \}$,

7080: we define $H(f^{(n)}) := \sum_{x \in \{0,1\}^n} f^{(n)}(x) [  \log 1/

7081: f^{(n)}(x)]$ as the entropy of the distribution of the first $n$

7082: outcomes.

7083:

7084: Let $E$ be a prefix-code assigning codeword $E(x)$ to source word $x

7085: \in \{0,1\}^n$.  The Noiseless Coding Theorem~\ref{thm:noiseless}

7086: asserts that the minimal average codeword length

7087: $\bar{L}(f^{(n)})

7088: = \sum_{x \in \{0,1\}^n} f^{(n)}(x) l(E(x))$ among all such

7089: prefix-codes $E$ satisfies

7090: $$H(f^{(n)}) \leq L(f^{(n)}) \leq H(f^{(n)}) + 1.$$

7091: The entropy $H(f^{(n)})$

7092: can therefore be interpreted as the expected code length of

7093: encoding the first $n$ bits generated by the source $f$, when the

7094: optimal (Shannon-Fano) code is used.

7095:

7096: We look for a prefix code $\tilde{D}$ with length function $\tilde{L}$

7097: that satisfies, for all fixed $f \in

7098: \{ f \}$:

7099: \begin{equation}

7100: \label{eq:universalb}

7101: \lim_{n \rightarrow \infty}

7102: \frac{{\bf E}_f \tilde{L}(X_{[1:n]})}{H(f^{(n)})} \leq 1.

7103: \end{equation}

7104: where ${\bf E}_f \tilde{L}(X_{[1:n]}) = \sum_{x \in \{0,1\}^n}

7105: f^{(n)}(x)L(x)$.

7106: Define $\tilde{D}$ as the following two-part code: first, $n$ is

7107: encoded using the standard prefix code for  natural numbers. Then, among

7108: all codes $D_{\langle n, k \rangle}$, the $k$ that minimizes

7109: $L_{\langle n, k \rangle}(x)$  is encoded (again using the standard

7110: prefix code); finally, $x$ is encoded in $L_{\langle n, k \rangle}(x)$

7111: bits. Then for all $n$, for all $k$, for {\em every\/} sequence

7112: $x_{[1:n]}$,

7113: \begin{equation}

7114: \label{eq:probuni}

7115: \tilde{L}(x_{[1:n]}) \leq L_{\langle n,k \rangle}(x_{[1:n]}) +L_{\cal

7116:   N}(k) + L_{\cal N}(n)

7117: \end{equation}

7118: Since (\ref{eq:probuni}) holds for all strings of length $n$, it must

7119: also hold in expectation for all possible distributions on strings

7120: of length $n$. In particular, this gives, for all $k \in {\cal N}$,

7121: $$

7122: {\bf E}_{f_k} \tilde{L}(X_{[1:n]}) \leq {\bf E}_{f_k} L_{\langle n, k

7123:   \rangle}(X_{[1:n]}) + O(\log n) = H(f^{(n)}_k) + O(\log n),

7124: $$

7125: from which (\ref{eq:universalb}) follows.

7126:

7127: Historically, codes satisfying (\ref{eq:universalb}) have been called

7128: {\em universal codes\/} relative to $\{ f \}$; codes satisfying

7129: (\ref{eq:universal}) have been considered in the literature only much

7130: more recently and are usually called `universal codes for individual

7131: sequences' \cite{MerhavF98}.  The two-part code $\tilde{D}$ that we

7132: just defined is universal both in an individual sequence and in an

7133: average sense: $\tilde{D}$ achieves code lengths within a constant of

7134: that achieved by $D_{\langle n,k \rangle}$ for {\em every individual

7135:   sequence}, for {\em every\/} $k \in {\cal N}$; but $\tilde{D}$ also

7136: achieves expected code lengths within a constant of the Shannon-Fano

7137: code for $f$, for {\em every\/} $f \in \{ f \}$.  Note once again that

7138: the $D^*$ based on Kolmogorov complexity does at least as well as

7139: $\tilde{D}$.

7140:

7141:

7142: \begin{example}

7143: \rm

7144: \label{ex:appy}

7145: Suppose our sequence is generated by independent tosses of a coin with

7146: bias $p$ of tossing ``head'' where $p \in (0,1)$.

7147: Identifying `heads' with $1$, the probability of $n-n_0$ outcomes

7148: ``1'' in an initial

7149: segment $x_{[1:n]}$ is then $(1-p)^{n_0} p^{n- n_0}$.

7150: Let $\{ f \}$ be the set of corresponding information sources,

7151: containing one element for each $p \in (0,1)$.

7152: $\{ f \}$ is an uncountable set; nevertheless, a universal code for

7153: $\{ f \}$ exists. In fact, it can be shown that

7154: the code $\tilde{D}$ with lengths (\ref{eq:stirling})

7155: in Example~\ref{ex:universal} is universal for $\{ f \}$, i.e. it

7156: satisfies (\ref{eq:universalb}). The reason for this is (roughly) as

7157: follows: if data are generated by a coin with bias $p$, then with

7158: probability $1$, the frequency $n_0/n$ converges to $p$, so that, by

7159: (\ref{eq:stirling}),  $n^{-1} \tilde{L}(x_{[1:n]})$ tends to

7160: $n^{-1} H(f^{(n)}) = H(p,1-p)$.

7161:

7162: If we are interested in practical data-compression, then the

7163: assumption that the data are generated by a biased-coin source is very

7164: restricted. But there are much richer classes of distributions

7165: $\{ f \}$ for which we can formulate universal codes. For example, we

7166: can take $\{ f \}$ to be the class of all Markov sources of each

7167: order; here the probability that $X_i = 1$ may depend on arbitrarily

7168: many earlier

7169: outcomes. Such ideas form the basis of most data compression schemes

7170: used in practice. Codes which are universal for the class of

7171: all Markov sources of each order and which encode and decode in real-time

7172: can easily be implemented. Thus, while we cannot find the

7173: shortest program that generates a particular sequence, it is often

7174: possible to effectively find the shortest encoding within a

7175: quite sophisticated class of codes.

7176: \end{example}

7177: %From the point of view of Shannon's theory, this means that there

7178: %exists a code which achieves average code length `essentially' as well

7179: %as the optimal Shannon-Fano code for the unknown source that generates

7180: %the sequence. The small price we pay is that our codelengths may be

7181: %longer by some constant not depending on $n$; this may be relevant if

7182: %the string we want to encode is short.

7183: \bibliographystyle{plain}

7184: %\bibliography{jolli,info,book}

7185:

7186: \begin{thebibliography}{10}

7187:

7188: \bibitem{BKVV03}

7189: H.~Buhrman, H.~Klauck, N.K. Vereshchagin, and P.M.B. Vit\'anyi.

7190: \newblock Individual communication complexity.

7191: \newblock In {\em Proc. STACS}, LNCS, pages~19--30, Springer-Verlag, 2004.

7192:

7193: \bibitem{CoxH74}

7194: R.T.~Cox and D.~Hinkley.

7195: \newblock {\em Theoretical Statistics}.

7196: \newblock Chapman and Hall, 1974.

7197:

7198:

7199: \bibitem{Ch69}

7200: G.J. Chaitin.

7201: \newblock On the length of programs for computing finite binary sequences:

7202:   statistical considerations.

7203: \newblock {\em J. Assoc. Comput. Mach.}, 16:145--159, 1969.

7204:

7205: \bibitem{CT91}

7206: T.M. Cover and J.A. Thomas.

7207: \newblock {\em Elements of Information Theory}.

7208: \newblock Wiley \& Sons, 1991.

7209:

7210:

7211: \bibitem{Fi22}

7212: R.A. Fisher.

7213: \newblock On the mathematical foundations of theoretical statistics.

7214: \newblock {\em Philos. Trans. Royal Soc. London, Ser. A}, 222:309--368, 1922.

7215:

7216: \bibitem{Ga74}

7217: P.~G\'acs.

7218: \newblock On the symmetry of algorithmic information.

7219: \newblock {\em Soviet Math. Dokl.}, 15:1477--1480, 1974.

7220: \newblock Correction, Ibid., 15:1480, 1974.

7221:

7222:

7223: \bibitem{Grunwald04}

7224: P. D. Gr\"unwald.

7225: \newblock {MDL Tutorial}.

7226: \newblock In P.~D. Gr\"unwald, I.~J. Myung, and M.~A. Pitt (Eds.), {\em

7227:   Advances in Minimum Description Length: Theory and Applications}. MIT Press, 2004.

7228:

7229: \bibitem{GTV01}

7230: P.~G\'acs, J.~Tromp, and P.M.B. Vit\'anyi.

7231: \newblock Algorithmic statistics.

7232: \newblock {\em IEEE Trans. Inform. Theory}, 47(6):2443--2463, 2001.

7233:

7234: \bibitem{HRSV00}

7235: D.~Hammer, A.~Romashchenko, A.~Shen, and N.~Vereshchagin.

7236: \newblock Inequalities for {S}hannon entropies and {K}olmogorov complexities.

7237: \newblock {\em J. Comput. Syst. Sci.}, 60:442--464, 2000.

7238:

7239: \bibitem{Ko65}

7240: A.N. Kolmogorov.

7241: \newblock Three approaches to the quantitative definition of information.

7242: \newblock {\em Problems Inform. Transmission}, 1(1):1--7, 1965.

7243:

7244: \bibitem{Ko74}

7245: A.N. Kolmogorov.

7246: \newblock Complexity of algorithms and objective definition of randomness.

7247: \newblock {\em Uspekhi Mat. Nauk}, 29(4):155, 1974.

7248: \newblock Abstract of a talk at the Moscow Math. Soc. meeting 4/16/1974. In

7249:   Russian.

7250:

7251: \bibitem{Ko83}

7252: A.N. Kolmogorov.

7253: \newblock Combinatorial foundations of information theory and the calculus of

7254:   probabilities.

7255: \newblock {\em Russian Math. Surveys}, 38(4):29--40, 1983.

7256:

7257: \bibitem{Kr49}

7258: L.G. Kraft.

7259: \newblock A device for quantizing, grouping and coding amplitude modulated

7260:   pulses.

7261: \newblock Master's thesis, Dept. of Electrical Engineering, M.I.T., Cambridge,

7262:   Mass., 1949.

7263:

7264: \bibitem{ChCo78}

7265: S.K. Leung-Yan-Cheong and T.M. Cover.

7266: \newblock Some equivalences between {Shannon} entropy and {K}olmogorov

7267:   complexity.

7268: \newblock {\em IEEE Transactions on Information Theory}, 24:331--339, 1978.

7269:

7270: \bibitem{Le74}

7271: L.A. Levin.

7272: \newblock Laws of information conservation (non-growth) and aspects of the

7273:   foundation of probability theory.

7274: \newblock {\em Problems Inform. Transmission}, 10:206--210, 1974.

7275:

7276: \bibitem{Le84}

7277: L.A. Levin.

7278: \newblock Randomness conservation inequalities; information and independence in

7279:   mathematical theories.

7280: \newblock {\em Inform. Contr.}, 61:15--37, 1984.

7281:

7282: \bibitem{Le02}

7283: L.A. Levin.

7284: \newblock Forbidden information.

7285: \newblock In {\em Proc. 47th IEEE Symp. Found. Comput. Sci.}, pages 761--768,

7286:   2002.

7287:

7288: \bibitem{LiVi97}

7289: M.~Li and P.M.B. Vit\'anyi.

7290: \newblock {\em An {I}ntroduction to {K}olmogorov {C}omplexity and {I}ts

7291:   {A}pplications}.

7292: \newblock Springer-Verlag, 1997.

7293: \newblock 2nd Edition.

7294:

7295: \bibitem{MerhavF98}

7296: N.~Merhav and M.~Feder.

7297: \newblock Universal prediction.

7298: \newblock {\em IEEE Transactions on Information Theory}, IT-44(6):2124--2147,

7299:   1998.

7300: \newblock invited paper for the 1948-1998 commemorative special issue.

7301:

7302: \bibitem{Ri89}

7303: J.J. Rissanen.

7304: \newblock {\em Stochastical Complexity and Statistical Inquiry}.

7305: \newblock World Scientific, 1989.

7306:

7307: \bibitem{RissanenT04}

7308: J. Rissanen and I.~Tabus.

7309: \newblock {K}olmogorov's structure function in {MDL} theory and lossy data

7310:   compression.

7311: \newblock In P.~D. Gr\"unwald, I.~J. Myung, and M.~A. Pitt (Eds.), {\em

7312:   Advances in Minimum Description Length: Theory and Applications}. MIT Press, 2004.

7313:

7314: \bibitem{Sh48}

7315: C.E. Shannon.

7316: \newblock The mathematical theory of communication.

7317: \newblock {\em Bell System Tech. J.}, 27:379--423, 623--656, 1948.

7318:

7319: \bibitem{Sh59}

7320: C.E. Shannon.

7321: \newblock Coding theorems for a discrete source with a fidelity criterion.

7322: \newblock In {\em IRE National Convention Record, Part 4}, pages 142--163,

7323:   1959.

7324:

7325: \bibitem{So64}

7326: R.J. Solomonoff.

7327: \newblock A formal theory of inductive inference, part 1 and part 2.

7328: \newblock {\em Inform. Contr.}, 7:1--22, 224--254, 1964.

7329:

7330: \bibitem{Vovk01}

7331: V. Vovk.

7332: \newblock Competitive on-line statistics,

7333: \newblock {\em Intern. Stat. Rev.}, 69:213--248, 2001.

7334:

7335: \bibitem{VV02}

7336: N.K. Vereshchagin and P.M.B. Vit\'anyi.

7337: \newblock Kolmogorov's structure functions and model selection.

7338: \newblock {\em IEEE Trans. Informat. Theory}.

7339: \newblock To appear.

7340:

7341: \bibitem{VereshchaginV04}

7342: N.K. Vereshchagin and P.M.B. Vit\'anyi.

7343: \newblock Rate-distortion theory for individual data.

7344: \newblock Manuscript, CWI, 2004.

7345:

7346: \bibitem{WallaceF87}

7347: Wallace, C. and P.~Freeman.

7348: \newblock Estimation and inference by compact coding.

7349: \newblock {\em Journal of the Royal Statistical Society, Series {B}\/}~{\em

7350:   49}, 240--251, 1987.

7351: \newblock Discussion: pages 252--265.

7352:

7353: \bibitem{ZvLe70}

7354: A.K. Zvonkin and L.A. Levin.

7355: \newblock The complexity of finite objects and the development of the concepts

7356:   of information and randomness by means of the theory of algorithms.

7357: \newblock {\em Russian Math. Surveys}, 25(6):83--124, 1970.

7358:

7359: \end{thebibliography}

7360:

7361: \end{document}

7362:

7363:

7364:

7365: