0606:cs0606070/cs0606070

1: \documentclass{article}

2:

3: \usepackage{latexsym,amssymb,textcomp}

4:

5:

6: % short hand for mathcal characters

7: \def\AA{\mathcal A} \def\BB{\mathcal B} \def\CC{\mathcal C}

8: \def\DD{\mathcal D} \def\FF{\mathcal F} \def\LL{\mathcal L}

9: \def\MM{\mathcal M} \def\NN{\mathcal N} \def\OO{\mathcal O}

10: \def\RR{\mathcal R} \def\SS{\mathcal S} \def\UU{\mathcal U}

11: \def\WW{\mathcal W} \def\XX{\mathcal X} \def\YY{\mathcal Y}

12: \def\ZZ{\mathcal Z} \def\PP{\mathcal P}

13:

14: % short hand for mathbb characters

15: \def\NNN{\mathbb N} \def\RRR{\mathbb R} \def\QQQ{\mathbb Q}

16: \def\CCC{\mathbb C} \def\ZZZ{\mathbb Z} \def\BBB{\mathbb B}

17:

18: \def\plt{<^{{}^{\!\!\!\!\!\!\!\;+}}}   % dot less than

19: \def\pgt{>^{{}^{\!\!\!\!\!\!\!\;+}}}    % dot greater than

20: \def\peq{\stackrel{\scriptscriptstyle{+}}{=}}     % dot equal

21:

22: \newtheorem{nummer}{\hspace*{-0.33em}}[section]

23: \newenvironment{definition} {\begin{nummer}{\bf Definition.} \begin{rm}}{\end{rm} \end{nummer}}

24: \newenvironment{lemma}      {\begin{nummer} {\bf Lemma.}}       {\end{nummer}}

25: \newenvironment{theorem}    {\begin{nummer} {\bf Theorem.}}     {\end{nummer}}

26: \newenvironment{corollary}  {\begin{nummer} {\bf Corollary.}}   {\end{nummer}}

27: \newenvironment{proof}      {\noindent \bf Proof. \rm} {\ \nolinebreak \hfill $\Box$ \vspace{2ex}}

28:

29: % expected value

30: \def\E{\mathbf{E}}

31:

32: \sloppy

33:

34: \bibliographystyle{plain}

35:

36:

37: \title{Is there an Elegant Universal\\ Theory of Prediction?

38: \thanks{This work was

39: supported by SNF grant 200020-107616.}}

40:

41: \author{Shane Legg\thanks{\tt shane@idsia.ch}}

42:

43:

44: \begin{document}

45:

46: \maketitle

47:

48: \begin{abstract}

49: Solomonoff's inductive learning model is a powerful, universal and

50: highly elegant theory of sequence prediction.  Its critical flaw is

51: that it is incomputable and thus cannot be used in practice.  It is

52: sometimes suggested that it may still be useful to help guide the

53: development of very general and powerful theories of prediction which

54: are computable.  In this paper it is shown that although powerful

55: algorithms exist, they are necessarily highly complex.  This alone

56: makes their theoretical analysis problematic, however it is further

57: shown that beyond a moderate level of complexity the analysis runs

58: into the deeper problem of G\"{o}del incompleteness.  This limits the

59: power of mathematics to analyse and study prediction algorithms, and

60: indeed intelligent systems in general.

61: \end{abstract}

62:

63:

64: \section{Introduction}

65:

66: Could there exist an elegant and universal theory of sequence

67: prediction?  Solomonoff's model of induction rapidly learns to make

68: optimal predictions for any computable sequence, including

69: probabilistic ones \cite{Solomonoff:64,Solomonoff:78}.  Indeed the

70: problem of sequence prediction could well be considered solved

71: \cite{hutter:06usp,hutter:04uaibook}, if it were not for the fact that

72: Solomonoff's theoretical model is incomputable.

73:

74: Among computable theories there exist powerful general predictors,

75: such as the Lempel-Ziv algorithm \cite{Feder:92} and Context Tree

76: Weighting \cite{Willems:95}, that can learn to predict some complex

77: sequences, but not others.  Some prediction methods, such as the

78: Minimum Description Length principle \cite{Rissanen:96} and the

79: Minimum Message Length principle \cite{Wallace:68}, can even be viewed

80: as computable approximations to Solomonoff induction~\cite{Li:97}.

81: However in practice their power and generality are limited by the

82: power of compression and coding methods employed, as well as having a

83: significantly reduced data efficiency as compared to Solomonoff

84: induction \cite{Poland:04mdl2p}.

85:

86: Could there exist elegant computable prediction algorithms that are in

87: some sense universal, or at least universal over large sets of simple

88: sequences?  In this paper we explore this fundamental question from

89: the perspective of Kolmogorov complexity theory and uncover some

90: surprising implications.

91:

92:

93:

94:

95: \section{Preliminaries}

96:

97: An \emph{alphabet} $\AA$ is a finite set of 2 or more elements which

98: are called \emph{symbols}.  In this paper we will assume a binary

99: alphabet $\BBB := \{ 0, 1 \}$, though all the results can easily be

100: generalised to other alphabets.  A \emph{string} is a finite ordered

101: $n$-tuple of symbols denoted $x := x_1 x_2 \ldots x_n$ where $\forall

102: i \in \{ 1, \ldots, n \}$, $x_i \in \BBB$, or more succinctly, $x \in

103: \BBB^n$.  The 0-tuple is denoted $\lambda$ and is called the

104: \emph{null string}.  The expression $\BBB^{\leq n}$ has the obvious

105: interpretation, and $\BBB^* := \bigcup_{n \in \NNN} \BBB^n$.  The

106: length \emph{lexicographical} ordering is a total order on $\BBB^*$

107: defined as $\lambda < 0 < 1 < 00 < 01 < 10 < 11 < 000 < 001 < \cdots$.

108: A \emph{substring} of $x$ is defined $x_{j:k} := x_j x_{j+1} \ldots

109: x_k$ where $1 \leq j \leq k \leq n$.  By $|x|$ we mean the length of

110: the string $x$, for example, $|x_{j:k}| = k - j +1$.  We will

111: sometimes need to encode a natural number as a string.  Using simple

112: encoding techniques it can be shown that there exists a computable

113: injective function $f : \NNN \to \BBB^*$ where no string in the range

114: of $f$ is a prefix of any other, and $\forall n \in \NNN : |f(n)| \leq

115: \log_2 n + 2 \log_2 \log_2 n + 1$.

116:

117: Unlike strings which always have finite length, a \emph{sequence}

118: $\omega$ is an infinite list of symbols $x_1 x_2 x_3 \ldots \in

119: \BBB^\infty$.  Of particular interest to us will be the class of

120: sequences which can be generated by an algorithm executed on a

121: universal Turing machine:

122:

123: \begin{definition}

124: A {\bf monotone universal Turing machine} $\UU$ is defined as a

125: universal Turing machine with one unidirectional input tape, one

126: unidirectional output tape, and some bidirectional work tapes.  Input

127: tapes are read only, output tapes are write only, unidirectional tapes

128: are those where the head can only move from left to right.  All tapes

129: are binary (no blank symbol) and the work tapes are initially filled

130: with zeros.  We say that $\UU$ outputs/computes a sequence $\omega$ on

131: input $p$, and write $\UU(p) = \omega$, if $\UU$ reads all of $p$ but

132: no more as it continues to write $\omega$ to the output tape.

133: \end{definition}

134:

135: We fix $\UU$ and define $\UU( p, x )$ by simply using a standard

136: coding technique to encode a program $p$ along with a string $x \in

137: \BBB^*$ as a single input string for $\UU$.

138:

139: \begin{definition}

140: A sequence $\omega \in \BBB^\infty$ is a {\bf computable binary

141: sequence} if there exists a program $q \in \BBB^*$ that writes

142: $\omega$ to a one-way output tape when run on a monotone universal

143: Turing machine $\mathcal{U}$, that is, $\exists q \in \BBB^* : \UU(q)

144: = \omega$. We denote the set of all computable sequences by $\CC$.

145: \end{definition}

146:

147: A similar definition for strings is not necessary as all strings have

148: finite length and are therefore trivially computable.

149:

150: \begin{definition}

151: A {\bf computable binary predictor} is a program $p \in \BBB^*$ that

152: on a universal Turing machine $\UU$ computes a total function $\BBB^*

153: \to \BBB$.

154: \end{definition}

155:

156: For simplicity of notation we will often write $p(x)$ to mean the

157: function computed by the program $p$ when executed on $\UU$ along with

158: the input string $x$, that is, $p(x)$ is short hand for $\UU( p, x )$.

159: Having $x_{1:n}$ as input, the objective of a predictor is for its

160: output, called its \emph{prediction}, to match the next symbol in the

161: sequence.  Formally we express this by writing $p(x_{1:n}) = x_{n+1}$.

162:

163: As the algorithmic prediction of incomputable sequences, such as the

164: halting sequence, is impossible by definition, we only consider the

165: problem of predicting computable sequences.  To simplify things we

166: will assume that the predictor has an unlimited supply of computation

167: time and storage.  We will also make the assumption that the predictor

168: has unlimited data to learn from, that is, we are only concerned with

169: whether or not a predictor can learn to predict in the following

170: sense:

171:

172: \begin{definition}

173: We say that a predictor $p$ can {\bf learn to predict} a sequence

174: $\omega := x_1 x_2 \ldots \in \BBB^\infty$ if there exists $m \in \NNN$

175: such that $\forall n \geq m : p(x_{1:n}) = x_{n+1}$.

176: \end{definition}

177:

178: The existence of $m$ in the above definition need not be constructive,

179: that is, we might not know when the predictor will stop making

180: prediction errors for a given sequence, just that this will occur

181: eventually.  This is essentially ``next value'' prediction as

182: characterised by Barzdin~\cite{Barzdin:72}, which follows from Gold's

183: notion of identifiability in the limit for languages~\cite{Gold:67}.

184:

185: \begin{definition}

186: Let $P(\omega)$ be the set of all predictors able to learn to predict

187: $\omega$.  Similarly for sets of sequences $S \subset

188: \BBB^\infty$, define $P(S) := \bigcap_{\omega \in S} P( \omega )$.

189: \end{definition}

190:

191: A standard measure of complexity for sequences is the length of the

192: shortest program which generates the sequence:

193: \begin{definition}

194: For any sequence $\omega \in \BBB^\infty$ the monotone {\bf Kolmogorov

195: complexity} of the sequence is,

196: \[

197: K( \omega ) := \min_{q \in \BBB^*} \{ |q| : \UU(q) = \omega \},

198: \]

199: where $\UU$ is a monotone universal Turing machine.  If no such $q$

200: exists, we define $K(\omega) := \infty$.

201: \end{definition}

202:

203: It can be shown that this measure of complexity depends on our choice

204: of universal Turing machine $\UU$, but only up to an additive constant

205: that is independent of $\omega$.  This is due to the fact that a

206: universal Turing machine can simulate any other universal Turing

207: machine with a fixed length program.

208:

209: In essentially the same way as the definition above we can define the

210: Kolmogorov complexity of a string $x \in \BBB^n$, written $K(x)$, by

211: requiring that $\UU(q)$ halts after generating $x$ on the output tape.

212: For an extensive treatment of Kolmogorov complexity and some of its

213: applications see \cite{Li:97} or \cite{Calude:02}.

214:

215: As many of our results will have the above property of holding within

216: an additive constant that is independent of the variables in the

217: expression, we will indicate this by placing a small plus above the

218: equality or inequality symbol.  For example, $f(x) \plt g(x)$ means

219: that that $\exists c \in \RRR, \forall x : f(x) < g(x) + c$.  When

220: using standard ``Big O'' notation this is unnecessary as expressions

221: are already understood to hold within an independent constant, however

222: for consistency of notation we will use it in these cases also.

223:

224:

225:

226:

227:

228: \section{Prediction of computable sequences}

229:

230: The most elementary result is that every computable sequence can be

231: predicted by at least one predictor, and that this predictor need not

232: be significantly more complex than the sequence to be predicted.

233:

234: \begin{lemma}\label{lem:bound1}

235: $\forall \omega \in \CC, \exists p \in P( \omega ) : K( p )

236: \plt K( \omega )$.

237: \end{lemma}

238:

239: \begin{proof}

240: As the sequence $\omega$ is computable, there must exist at least one

241: algorithm that generates $\omega$.  Let $q$ be the shortest such

242: algorithm and construct an algorithm $p$ that ``predicts'' $\omega$ as

243: follows: Firstly the algorithm $p$ reads $x_{1:n}$ to find the value

244: of $n$, then it runs $q$ to generate $x_{1:n+1}$ and returns $x_{n+1}$

245: as its prediction.  Clearly $p$ perfectly predicts $\omega$ and $|p| <

246: |q| + c$, for some small constant $c$ that is independent of $\omega$

247: and $q$.

248: \end{proof}

249:

250: Not only can any computable sequence be predicted, there also exist

251: very simple predictors able to predict arbitrarily complex sequences:

252:

253: \begin{lemma}\label{lem:predofcomplex}

254: There exist a predictor $p$ such that $\forall n \in \NNN, \exists \,

255: \omega \in \CC : p \in P(\omega)$ and $K(\omega) > n$.

256: \end{lemma}

257:

258: \begin{proof}

259: Take a string $x$ such that $K(x) = |x| \geq 2n$, and from this

260: define a sequence $\omega := x 0 0 0 0 \ldots$.  Clearly $K(\omega) >

261: n$ and yet a simple predictor $p$ that always predicts 0 can learn to

262: predict $\omega$.

263: \end{proof}

264:

265: The predictor used in the above proof is very simple and can only

266: learn sequences that end with all 0's, albeit where the initial string

267: can have an arbitrarily high Kolmogorov complexity.  It is not hard to

268: see that more sophisticated predictors can learn to predict many other

269: more subtle types of patterns which are more complex than the

270: predictor, such as arbitrary repeating strings, regular or primitive

271: recursive sequences.

272:

273: As each computable sequence can be predicted, and simple predictors

274: exist which can predict arbitrarily complex sequences, we might wonder

275: whether there exists a computable predictor able to learn to predict

276: all computable sequences.  Unfortunately, no universal predictor

277: exists, indeed for every predictor there exists a sequence which it

278: cannot predict at all:

279:

280: \begin{lemma}\label{lem:adv}

281: For any predictor $p$ there constructively exists a sequence $\omega

282: := x_1 x_2 \ldots \in \CC$ such that $\forall n \in \NNN : p(x_{1:n})

283: \neq x_{n+1}$ and $K(\omega) \plt K(p)$.

284: \end{lemma}

285:

286: \begin{proof}

287: For any computable predictor $p$ there constructively exists a

288: computable sequence $\omega = x_1 x_2 x_3 \ldots$ computed by an

289: algorithm $q$ defined as follows: Set $x_1 = 1 - p(\lambda)$, then

290: $x_2 = 1 - p( x_1 )$, then $x_3 = 1 - p( x_{1:2} )$ and so on.

291: Clearly $\omega \in \CC$ and $\forall n \in \NNN : p(x_{1:n}) = 1 -

292: x_{n+1}$.

293:

294: Let $p^*$ be the shortest program that computes the same function as

295: $p$ and define a sequence generation algorithm $q^*$ based on $p^*$

296: using the procedure above.  By construction, $|q^*| = |p^*| + c$ for

297: some constant $c$ that is independent of $p^*$.  Because $q^*$

298: generates $\omega$, it follows that $K(\omega) \leq |q^*|$.  By

299: definition $K(p) = |p^*|$ and so $K(\omega) \plt K(p)$.

300: \end{proof}

301:

302: Allowing the predictor to be probabilistic does not fundamentally

303: avoid the problem of Lemma~\ref{lem:adv}.  In each step, rather than

304: generating the opposite to what will be predicted by $p$, instead $q$

305: attempts to generate the symbol which $p$ is least likely to predict

306: given $x_{1:n}$.  To do this $q$ must simulate $p$ in order to

307: estimate $\Pr \! \big( p(x_{1:n}) = 1 \big| x_{1:n} \big)$.  With

308: sufficient simulation effort, $q$ can estimate this probability to any

309: desired accuracy for any $x_{1:n}$.  This produces a computable

310: sequence $\omega$ such that $\forall n \in \NNN : \Pr \!  \big(

311: p(x_{1:n}) = x_{n+1} \big| x_{1:n} \big)$ is not significantly greater

312: than $\frac{1}{2}$, that is, the performance of $p$ is no better than

313: a predictor that makes completely random predictions.

314:

315: The impossibility of prediction in this more general probabilistic

316: setting has been pointed out before by Dawid~\cite{Dawid:85}.

317: Specifically, Dawid notes that for any statistical forecasting system

318: there exist sequences which are not calibrated.  Dawid also notes that

319: a forecasting system for a family of distributions is necessarily more

320: complex than any forecasting system generated from a single

321: distribution in the family.  However, he does not deal with the

322: complexity of the sequences themselves, nor does he make a precise

323: statement in terms of a specific measure of complexity, such as

324: Kolmogorov complexity.  The impossibility of forecasting has since

325: been developed in considerably more depth by V'yugin~\cite{Vyugin:98},

326: in particular, it is proven that there is an efficient randomised

327: procedure producing sequences that cannot be predicted (with high

328: probability) by computable forecasting systems.

329:

330: As probabilistic prediction complicates things without avoiding this

331: fundamental problem, in the remainder of this paper we will consider

332: only deterministic predictors.  This will also allow us to see the

333: roots of this problem as clearly as possible.  With the preliminaries

334: covered, we now move on to the central problem considered in this

335: paper: Predicting sequences of limited Kolmogorov complexity.

336:

337:

338:

339:

340:

341: \section{Prediction of simple computable sequences}

342:

343: As the computable prediction of any computable sequence is impossible,

344: a weaker goal is to be able to predict all ``simple'' computable

345: sequences.

346:

347: \begin{definition}

348: For $n \in \NNN$, let $\CC_n := \{ \omega \in \CC: K(\omega) \leq n

349: \}$.  Further, let $P_n := P( \CC_n )$ be the set of predictors able

350: to learn to predict all sequences in $\CC_n$.

351: \end{definition}

352:

353: Firstly we establish that prediction algorithms exist that can learn

354: to predict all sequences up to a given complexity, and that these

355: predictors need not be significantly more complex than the sequences

356: they can predict:

357:

358: \begin{lemma} \label{lem:infpredictors}

359: $\forall n \in \NNN, \exists p \in P_n : K( p ) \plt n + O( \log_2 n )$.

360: \end{lemma}

361:

362: \begin{proof}

363: Let $h \in \NNN$ be the number of programs of length $n$ or less which

364: generate infinite sequences.  Build the value of $h$ into a prediction

365: algorithm $p$ constructed as follows:

366:

367: In the $k^{th}$ prediction cycle run in parallel all programs of

368: length $n$ or less until $h$ of these programs have each produced

369: $k+1$ symbols of output.  Next predict according to the $k+1^{th}$

370: symbol of the generated string whose first $k$ symbols is consistent

371: with the observed string.  If two generated strings are consistent

372: with the observed sequence (there cannot be more than two as the

373: strings are binary and have length $k+1$), pick the one which was

374: generated by the program that occurs first in a lexicographical

375: ordering of the programs.  If no generated output is consistent, give

376: up and output a fixed symbol.

377:

378: For sufficiently large $k$, only the $h$ programs which produce

379: infinite sequences will produce output strings of length $k$.  As this

380: set of sequences is finite, they can be uniquely identified by finite

381: initial strings.  Thus for sufficiently large $k$ the predictor $p$

382: will correctly predict any computable sequence $\omega$ for which $K(

383: \omega ) \leq n$, that is, $p \in P_n$.

384:

385: As there are $2^{n+1} -1$ possible strings of length $n$ or less, $h <

386: 2^{n+1}$ and thus we can encode $h$ with $\log_2 h + 2 \log_2 \log_2 h

387: = n + 1 + 2\log_2 (n+1)$ bits.  Thus, $K( p ) < n + 1 + 2 \log_2 (n+1)

388: + c$ for some constant $c$ that is independent of $n$.

389: \end{proof}

390:

391: Can we do better than this?  Lemma~\ref{lem:predofcomplex} shows us

392: that there exist predictors able to predict at least some sequences

393: vastly more complex than themselves.  This suggests that there might

394: exist simple predictors able to predict arbitrary sequences up to a

395: high complexity.  Formally, could there exist $p \in P_n$ where $n \gg

396: K(p)$?  Unfortunately, these simple but powerful predictors are not

397: possible:

398:

399: \begin{theorem}\label{thm:simplepred}

400: $\forall n \in \NNN: p \in P_n \Rightarrow K(p) \pgt n$.

401: \end{theorem}

402:

403: \begin{proof}

404: For any $n \in \NNN$ let $p \in P_n$, that is, $\forall \omega \in

405: \CC_n: p \in P(\omega)$.  By Lemma~\ref{lem:adv} we know that $\exists

406: \, \omega' \in \CC : p \notin P(\omega')$ .  As $p \notin P(\omega')$

407: it must be the case that $\omega' \notin \CC_n$, that is, $K(\omega')

408: \geq n$.  From Lemma~\ref{lem:adv} we also know that $K(p) \pgt

409: K(\omega')$ and so the result follows.

410: \end{proof}

411:

412: Intuitively the reason for this is as follows: Lemma~\ref{lem:adv}

413: guarantees that every simple predictor fails for at least one simple

414: sequence.  Thus if we want a predictor that can learn to predict all

415: sequences up to a moderate level of complexity, then clearly the

416: predictor cannot be simple.  Likewise, if we want a predictor that can

417: predict all sequences up to a high level of complexity, then the

418: predictor itself must be very complex.  Thus, even though we have made

419: the generous assumption of unlimited computational resources and data

420: to learn from, only very complex algorithms can be truly powerful

421: predictors.

422:

423: These results easily generalise to notions of complexity that take

424: computation time into consideration.  As sequences are infinite, the

425: appropriate measure of time is the time needed to generate or predict

426: the next symbol in the sequence.  Under any reasonable measure of time

427: complexity, the operation of inverting a single output from a binary

428: valued function can be performed with little cost.  If $C$ is any

429: complexity measure with this property, it is trivial to see that the

430: proof of Lemma~\ref{lem:adv} still holds for $C$.  From this, an

431: analogue of Theorem~\ref{thm:simplepred} for $C$ easily follows.  With

432: similar arguments these results also generalise in a straightforward

433: way to complexity measures that take space or other computational

434: resources into account.  Thus, the fact that extremely powerful

435: predictors must be very complex, holds under any measure of complexity

436: for which inverting a single bit is inexpensive.

437:

438:

439:

440: \section{Complexity of prediction}

441:

442: Another way of viewing these results is in terms of an alternate

443: notion of sequence complexity defined as the size of the smallest

444: predictor able to learn to predict the sequence.  This allows us to

445: express the results of the previous sections more concisely.

446: Formally, for any sequence $\omega$ define the complexity measure,

447: \[

448: \dot{K} ( \omega ) := \min_{p \in \BBB^*} \{ |p| : p \in P( \omega ) \},

449: \]

450: and $\dot{K}(\omega) := \infty$ if $P( \omega ) = \varnothing$.  Thus,

451: if $\dot{K} ( \omega )$ is high then the sequence $\omega$ is complex

452: in the sense that only complex prediction algorithms are able to learn

453: to predict it.  It can easily be seen that this notion of complexity

454: has the same invariance to the choice of reference universal Turing

455: machine as the standard Kolmogorov complexity measure.

456:

457: It may be tempting to conjecture that this definition simply describes

458: what might be called the ``tail end complexity'' of a sequence, that

459: is, $\dot{K}(\omega) = \lim_{i \to \infty} K(\omega_{i:\infty})$.

460: This is not the case.  Consider again Lemma~\ref{lem:predofcomplex}

461: and its proof.  For any $n \in \NNN$, we let $y_{1:n}$ be a random

462: string, that is, $K(y_{1:n}) \peq n$.  From this we defined a

463: computable sequence that was a repetition of this string, $\omega :=

464: (y_{1:n})^*$.  It was then proven that there exists a single predictor

465: $p$ which can predict any sequence of this form, with no restriction

466: on how high $K(\omega)$ can be.  From our definition of $\dot{K}$

467: above it is thus clear that $\dot{K}(\omega) \peq 0$ for any such

468: $\omega$.  Consider now the tail complexity of $\omega$.  As

469: $K(y_{1:n}) \peq n$, whenever $i \bmod n = 0$ we have

470: $K(\omega_{i:\infty}) \pgt n - O(\log n)$ (the $O(\log n)$ term comes

471: from potentially saving bits due to not having to encode $|y_{1:n}|$).

472: Thus even if the limit $\lim_{i \to \infty} K(\omega_{i:\infty})$

473: exists (it may oscillate), it cannot be equal to $\dot{K}(\omega)$ in

474: general.

475:

476: Using $\dot{K}$ we can now rewrite a number of our previous results

477: more succinctly in terms of the new complexity measure.  From

478: Lemma~\ref{lem:bound1} it immediately follows that,

479: \[

480: \forall

481: \omega: 0 \leq \dot{K}( \omega ) \plt K( \omega ).

482: \]

483: From Lemma~\ref{lem:predofcomplex} we know that $\exists c \in \NNN,

484: \forall n \in \NNN, \exists \, \omega \in \CC$ such that $\dot{K}(

485: \omega ) <c$ and $K( \omega ) > n$, that is, $\dot{K}$ can attain the

486: lower bound above within a small constant, no matter how large the

487: value of $K$ is.  The sequences for which the upper bound on $\dot{K}$

488: is tight are interesting as they are the ones which demand complex

489: predictors.  We prove the existence of these sequences and look at

490: some of their properties in the next section.

491:

492: The complexity measure $\dot{K}$ can also be generalised to sets of

493: sequences, for $S \subset \BBB^\infty$ define $\dot{K}( S ) := \min_p

494: \{ |p|: p \in P(S) \}$.  This allows us to rewrite

495: Lemma~\ref{lem:infpredictors} and Theorem~\ref{thm:simplepred} as

496: simply,

497: \[

498: \forall n \in \NNN : n

499: \plt \dot{K} ( \CC_n ) \plt n + O( \log_2 n ).

500: \]

501: This is just a restatement of the fact that the simplest predictor

502: capable of predicting all sequences up to a Kolmogorov complexity of

503: $n$, has itself a Kolmogorov complexity of roughly $n$.

504:

505:

506:

507: \section{Hard to predict sequences}\label{sec:hard}

508:

509: We have already seen that some individual sequences, such as the

510: repeating string used in the proof of Lemma~\ref{lem:predofcomplex},

511: can have arbitrarily high Kolmogorov complexity but nevertheless can

512: be predicted by trivial algorithms.  Thus, although these sequences

513: contain a lot of information in the Kolmogorov sense, in a deeper

514: sense their structure is very simple and easily learnt.

515:

516: What interests us in this section is the other extreme; individual

517: sequences which can only be predicted by complex predictors.  As we

518: are only concerned with prediction in the limit, this extra complexity

519: in the predictor must be some kind of special information which

520: cannot be learnt just through observing the sequence.  Our first task

521: is to show that these difficult to predict sequences exist.

522:

523: \begin{theorem}\label{thm:uninf}

524: $\forall n \in \NNN, \exists \, \omega \in \CC : n \plt \dot{K}(

525: \omega ) \plt K(\omega) \plt n + O( \log_2 n )$.

526: \end{theorem}

527:

528: \begin{proof}

529: For any $n \in \NNN$, let $Q_n \subset \BBB^{<n}$ be the set of

530: programs shorter than $n$ that are predictors, and let $x_{1:k} \in

531: \BBB^k$ be the observed initial string from the sequence $\omega$

532: which is to be predicted.  Now construct a meta-predictor $\hat{p}$:

533:

534: By dovetailing the computations, run in parallel every program of

535: length less than $n$ on every string in $\BBB^{\leq k}$.  Each time a

536: program is found to halt on all of these input strings, add the

537: program to a set of ``candidate prediction algorithms'', called

538: $\tilde{Q}^k_n$.  As each element of $Q_n$ is a valid predictor and

539: thus will halt for all input strings for any $k$, for every $n$ and

540: $k$ it eventually will be the case that $|\tilde{Q}^k_n| = |Q_n|$.  At

541: this point the simulation to approximate $Q_n$ terminates.  It is

542: clear that for sufficiently large values of $k$ all of the valid

543: predictors, and only the valid predictors, will halt with a single

544: symbol of output on all tested input strings.  That is, $\exists r \in

545: \NNN, \forall k > r : \tilde{Q}^k_n = Q_n$.

546:

547: The second part of the $\hat{p}$ algorithm uses these candidate

548: prediction algorithms to make a prediction.  For $p \in \tilde{Q}^k_n$

549: define $d^k(p) := \sum_{i=1}^{k-1} |p(x_{1:i})-x_{i+1}|$.  Informally,

550: $d^k(p)$ is the number of prediction errors made by $p$ so far.

551: Compute this for all $p \in \tilde{Q}^k_n$ and then let $p^*_k \in

552: \tilde{Q}^k_n$ be the program with minimal $d^k(p)$.  If there is more

553: than one such program, break the tie by letting $p^*_k$ be the

554: lexicographically first of these.  Finally, $\hat{p}$ computes the

555: value of $p^*_k(x_{1:k})$ and then returns this as its prediction and

556: halts.

557:

558: By Lemma~\ref{lem:adv}, there exists $\omega' \in \CC$ such that

559: $\hat{p}$ makes a prediction error for every $k$ when trying to

560: predict $\omega'$.  Thus, in each cycle at least one of the finitely

561: many predictors with minimal $d^k$ makes a prediction error and so

562: $\forall p \in Q_n: d^k(p) \to \infty$ as $k \to \infty$.  Therefore,

563: $\nexists p \in Q_n : p \in P(\omega')$, that is, no program of length

564: less than $n$ can learn to predict $\omega'$ and so $n \leq

565: \dot{K}(\omega')$.  Further, from Lemma~\ref{lem:bound1} we know that

566: $\dot{K}( \omega' ) \plt K(\omega')$, and from Lemma~\ref{lem:adv}

567: again, $K(\omega') \plt K(\hat{p})$.

568:

569: Examining the algorithm for $\hat{p}$, we see that it contains some

570: fixed length program code and an encoding of $|Q_n|$, where $|Q_n| <

571: 2^n-1$.  Thus, using a standard encoding method for integers,

572: $K(\hat{p}) \plt n + O( \log_2 n )$.

573:

574: Chaining these together we get, $n \plt \dot{K}( \omega' ) \plt

575: K(\omega') \plt K(\hat{p}) \plt n + O( \log_2 n )$, which proves the

576: theorem.

577: \end{proof}

578:

579: This establishes the existence of sequences with arbitrarily high

580: $\dot{K}$ complexity which also have a similar level of Kolmogorov

581: complexity.  Next we establish a fundamental property of high

582: $\dot{K}$ complexity sequences: they are extremely difficult to

583: compute.

584:

585: For an algorithm $q$ that generates $\omega \in \CC$, define $t_q(n)$

586: to be the number of computation steps performed by $q$ before the

587: $n^{th}$ symbol of $\omega$ is written to the output tape.  For

588: example, if $q$ is a simple algorithm that outputs the sequence

589: $010101\ldots$, then clearly $t_q(n) = O(n)$ and so $\omega$ can be

590: computed quickly.  The following theorem proves that if a sequence can

591: be computed in a reasonable amount of time, then the sequence must

592: have a low $\dot{K}$ complexity:

593:

594: \begin{lemma}\label{lem:slow}

595: $\forall \omega \in \CC$, if $\exists q : \UU(q) = \omega$ and

596: $\exists r \in \NNN , \forall n > r : t_q(n) < 2^n$, then

597: $\dot{K}(\omega) \peq 0$.

598: \end{lemma}

599:

600: \begin{proof}

601: Construct a prediction algorithm $\tilde{p}$ as follows:

602:

603: On input $x_{1:n}$, run all programs of length $n$ or less, each for

604: $2^{n+1}$ steps.  In a set $W_n$ collect together all generated

605: strings which are at least $n+1$ symbols long and where the first $n$

606: symbols match the observed string $x_{1:n}$.  Now order the strings in

607: $W_n$ according to a lexicographical ordering of their generating

608: programs.  If $W_n = \varnothing$, then just return a prediction of 1

609: and halt.  If $|W_n| > 1$ then return the $n+1^{th}$ symbol from the

610: first sequence in the above ordering.

611:

612: Assume that $\exists q : \UU(q) = \omega$ such that $\exists r \in

613: \NNN , \forall n > r : t_q(n) < 2^n$.  If $q$ is not unique, take $q$

614: to be the lexicographically first of these.  Clearly $\forall n > r$

615: the initial string from $\omega$ generated by $q$ will be in the set

616: $W_n$.  As there is no lexicographically lower program which can

617: generate $\omega$ within the time constraint $t_q (n) < 2^n$ for all

618: $n>r$, for sufficiently large $n$ the predictor $\tilde{p}$ must

619: converge on using $q$ for each prediction and thus $\tilde{p} \in

620: P(\omega)$.  As $|\tilde{p}|$ is clearly a fixed constant that is

621: independent of $\omega$, it follows then that $\dot{K}(\omega) <

622: |\tilde{p}| \peq 0$.

623: \end{proof}

624:

625: We could replace the $2^n$ bound in the above result with an even more

626: rapidly growing computable function, for example, $2^{2^n}$.  In any

627: case, this does not change the fundamental result that sequences which

628: have a high $\dot{K}$ complexity are practically impossible to

629: compute.  However from our theoretical perspective these sequences

630: present no problem as they can be predicted, albeit with immense

631: difficulty.

632:

633:

634: \section{The limits of mathematical analysis}

635:

636: One way to interpret the results of the previous sections is in terms

637: of constructive theories of prediction.  Essentially, a constructive

638: theory of prediction $\mathcal{T}$, expressed in some sufficiently

639: rich formal system $\mathcal{F}$, is in effect a description of a

640: prediction algorithm with respect to a universal Turing machine which

641: implements the required parts of $\mathcal{F}$.  Thus from

642: Theorems~\ref{thm:simplepred} and \ref{thm:uninf} it follows that if

643: we want to have a predictor that can learn to predict all sequences up

644: to a high level of Kolmogorov complexity, or even just predict

645: individual sequences which have high $\dot{K}$ complexity, the

646: constructive theory of prediction that we base our predictor on must

647: be very complex.  Elegant and highly general constructive theories of

648: prediction simply do not exist, even if we assume unlimited

649: computational resources.  This is in marked contrast to Solomonoff's

650: highly elegant but non-constructive theory of prediction.

651:

652: Naturally, highly complex theories of prediction will be very

653: difficult to mathematically analyse, if not practically impossible.

654: Thus at some point the development of very general prediction

655: algorithms must become mainly an experimental endeavour due to the

656: difficulty of working with the required theory.  Interestingly, an

657: even stronger result can be proven showing that beyond some point the

658: mathematical analysis is in fact impossible, even in theory:

659:

660: \begin{theorem}\label{thm:incomplete}

661: In any consistent formal axiomatic system $\FF$ that is sufficiently

662: rich to express statements of the form ``$p \in P_n$'', there exists

663: $m \in \NNN$ such that for all $n > m$ and for all predictors $p \in

664: P_n$ the true statement ``$p \in P_n$'' cannot be proven in $\FF$.

665: \end{theorem}

666:

667: In other words, even though we have proven that very powerful sequence

668: prediction algorithms exist, beyond a certain complexity it is

669: impossible to find any of these algorithms using mathematics.  The

670: proof has a similar structure to Chaitin's information theoretic proof

671: \cite{Chaitin:82} of G\"{o}del incompleteness theorem for formal

672: axiomatic systems \cite{Goedel:31}.

673:

674: \begin{proof}

675: For each $n \in \NNN$ let $T_n$ be the set of statements expressed in

676: the formal system $\FF$ of the form ``$p \in P_n$'', where $p$ is

677: filled in with the complete description of some algorithm in each

678: case.  As the set of programs is denumerable, $T_n$ is also

679: denumerable and each element of $T_n$ has finite length.  From

680: Lemma~\ref{lem:infpredictors} and Theorem~\ref{thm:simplepred} it

681: follows that each $T_n$ contains infinitely many statements of the

682: form ``$p \in P_n$'' which are true.

683:

684: Fix $n$ and create a search algorithm $s$ that enumerates all proofs

685: in the formal system $\FF$ searching for a proof of a

686: statement in the set $T_n$.  As the set $T_n$ is recursive, $s$ can

687: always recognise a proof of a statement in $T_n$.  If $s$ finds any

688: such proof, it outputs the corresponding program $p$ and then halts.

689:

690: By way of contradiction, assume that $s$ halts, that is, a proof of a

691: theorem in $T_n$ is found and $p$ such that $p \in P_n$ is generated

692: as output.  The size of the algorithm $s$ is a constant (a description

693: of the formal system $\FF$ and some proof enumeration code) as well as

694: an $O( \log_2 n )$ term needed to describe $n$.  It follows then that

695: $K(p) \plt O( \log_2 n )$.  However from Theorem~\ref{thm:simplepred}

696: we know that $K(p) \pgt n$.  Thus, for sufficiently large $n$, we have

697: a contradiction and so our assumption of the existence of a proof must

698: be false.  That is, for sufficiently large $n$ and for all $p \in

699: P_n$, the true statement ``$p \in P_n$'' cannot be proven within the

700: formal system~$\FF$.

701: \end{proof}

702:

703: The exact value of $m$ depends on our choice of formal system $\FF$

704: and which reference machine $\UU$ we measure complexity with respect

705: to.  However for reasonable choices of $\FF$ and $\UU$ the value of

706: $m$ would be in the order of 1000.  That is, the bound $m$ is

707: certainly not so large as to be vacuous.

708:

709:

710:

711:

712: \section{Discussion}

713:

714: Solomonoff induction is an elegant and extremely general model of

715: inductive learning.  It neatly brings together the philosophical

716: principles of Occam's razor, Epicurus' principle of multiple

717: explanations, Bayes theorem and Turing's model of universal

718: computation into a theoretical sequence predictor with astonishingly

719: powerful properties.  If theoretical models of prediction can have

720: such elegance and power, one cannot help but wonder whether similarly

721: beautiful and highly general computable theories of prediction are

722: also possible.

723:

724: What we have shown here is that there does not exist an elegant

725: constructive theory of prediction for computable sequences, even if we

726: assume unbounded computational resources, unbounded data and learning

727: time, and place moderate bounds on the Kolmogorov complexity of the

728: sequences to be predicted.  Very powerful computable predictors are

729: therefore necessarily complex.  We have further shown that the source

730: of this problem is computable sequences which are extremely expensive

731: to compute.  While we have proven that very powerful prediction

732: algorithms which can learn to predict these sequences exist, we have

733: also proven that, unfortunately, mathematical analysis cannot be used

734: to discover these algorithms due to problems of G\"{o}del

735: incompleteness.

736:

737: These results can be extended to more general settings, specifically

738: to those problems which are equivalent to, or depend on, sequence

739: prediction.  Consider, for example, a reinforcement learning agent

740: interacting with an environment \cite{Sutton:98,hutter:04uaibook}.  In

741: each interaction cycle the agent must choose its actions so as to

742: maximise the future rewards that it receives from the environment.  Of

743: course the agent cannot know for certain whether or not some action

744: will lead to rewards in the future, thus it must predict these.

745: Clearly, at the heart of reinforcement learning lies a prediction

746: problem, and so the results for computable predictors presented in

747: this paper also apply to computable reinforcement learners.  More

748: specifically, from Theorem~\ref{thm:simplepred} it follows that very

749: powerful computable reinforcement learners are necessarily complex,

750: and from Theorem~\ref{thm:incomplete} it follows that it is impossible

751: to discover extremely powerful reinforcement learning algorithms

752: mathematically.

753:

754: It is reasonable to ask whether the assumptions we have made in our

755: model need to be changed.  If we increase the power of the predictors

756: further, for example by providing them with some kind of an oracle,

757: this would make the predictors even more unrealistic than they

758: currently are.  Clearly this goes against our goal of finding an

759: elegant, powerful and general prediction theory that is more realistic

760: in its assumptions than Solomonoff's incomputable model.  On the other

761: hand, if we weaken our assumptions about the predictors' resources to

762: make them more realistic, we are in effect taking a subset of our

763: current class of predictors.  As such, all the same limitations and

764: problems will still apply, as well as some new ones.

765:

766: It seems then that the way forward is to further restrict the problem

767: space.  One possibility would be to bound the amount of computation

768: time needed to generate the next symbol in the sequence.  However if

769: we do this without restricting the predictors' resources then the

770: simple predictor from Lemma~\ref{lem:slow} easily learns to predict

771: any such sequence and thus the problem of prediction in the limit has

772: become trivial.  Another possibility might be to bound the memory of

773: the machine used to generate the sequence, however this makes the

774: generator a finite state machine and thus bounds its computation time,

775: again making the problem trivial.

776:

777: Perhaps the only reasonable solution would be to add additional

778: restrictions to both the algorithms which generate the sequences to be

779: predicted, and to the predictors.  We may also want to consider not

780: just learnability in the limit, but also how quickly the predictor is

781: able to learn.  Of course we are then facing a much more difficult

782: analysis problem.

783:

784:

785: \subsubsection*{Acknowledgements}

786:

787: I would like to thank Marcus Hutter, Alexey Chernov, Daniil Ryabko and

788: Laurent Orseau for useful discussions and advice during the

789: development of this paper.

790:

791:

792: \begin{thebibliography}{10}

793:

794: \bibitem{Barzdin:72}

795: J.~M. Barzdin.

796: \newblock Prognostication of automata and functions.

797: \newblock {\em Information Processing}, 71:81--84, 1972.

798:

799: \bibitem{Calude:02}

800: C.~S. Calude.

801: \newblock {\em Information and Randomness}.

802: \newblock Springer, Berlin, 2nd edition, 2002.

803:

804: \bibitem{Chaitin:82}

805: G.~J. Chaitin.

806: \newblock G{\"o}del's theorem and information.

807: \newblock {\em International Journal of Theoretical Physics}, 22:941--954,

808:   1982.

809:

810: \bibitem{Dawid:85}

811: A.~P. Dawid.

812: \newblock Comment on {T}he impossibility of inductive inference.

813: \newblock {\em Journal of the American Statistical Association},

814:   80(390):340--341, 1985.

815:

816: \bibitem{Feder:92}

817: M.~Feder, N.~Merhav, and M.~Gutman.

818: \newblock Universal prediction of individual sequences.

819: \newblock {\em {IEEE} Trans. on Information Theory}, 38:1258--1270, 1992.

820:

821: \bibitem{Goedel:31}

822: K.~G{\"o}del.

823: \newblock {\"U}ber formal unentscheidbare {S}{\"a}tze der principia mathematica

824:   und verwandter systeme {I}.

825: \newblock {\em Monatshefte f{\"u}r Matematik und Physik}, 38:173--198, 1931.

826: \newblock [English translation by E. Mendelsohn: ``On undecidable propositions

827:   of formal mathematical systems''. In M. Davis, editor, {\it The undecidable},

828:   pages 39--71, New York, 1965. Raven Press, Hewlitt].

829:

830: \bibitem{Gold:67}

831: E.~Mark Gold.

832: \newblock Language identification in the limit.

833: \newblock {\em Information and Control}, 10(5):447--474, 1967.

834:

835: \bibitem{hutter:04uaibook}

836: M.~Hutter.

837: \newblock {\em Universal Artificial Intelligence: Sequential Decisions based on

838:   Algorithmic Probability}.

839: \newblock Springer, Berlin, 2005.

840: \newblock 300 pages, http://www.idsia.ch/$_{^{\sim}}$marcus/ai/uaibook.htm.

841:

842: \bibitem{hutter:06usp}

843: M.~Hutter.

844: \newblock On the foundations of universal sequence prediction.

845: \newblock In {\em Proc. 3rd Annual Conference on Theory and Applications of

846:   Models of Computation ({TAMC'06})}, volume 3959 of {\em LNCS}, pages

847:   408--420. Springer, 2006.

848:

849: \bibitem{Li:97}

850: M.~Li and P.~M.~B. Vit\'anyi.

851: \newblock {\em An introduction to {Kolmogorov} complexity and its

852:   applications}.

853: \newblock Springer, 2nd edition, 1997.

854:

855: \bibitem{Poland:04mdl2p}

856: J.~Poland and M.~Hutter.

857: \newblock Convergence of discrete {MDL} for sequential prediction.

858: \newblock In {\em Proc. 17th Annual Conf. on Learning Theory ({COLT'04})},

859:   volume 3120 of {\em LNAI}, pages 300--314, Banff, 2004. Springer, Berlin.

860:

861: \bibitem{Rissanen:96}

862: J.~J. Rissanen.

863: \newblock Fisher {I}nformation and {S}tochastic {C}omplexity.

864: \newblock {\em IEEE Trans. on Information Theory}, 42(1):40--47, January 1996.

865:

866: \bibitem{Solomonoff:64}

867: R.~J. Solomonoff.

868: \newblock A formal theory of inductive inference: Part 1 and 2.

869: \newblock {\em Inform. Control}, 7:1--22, 224--254, 1964.

870:

871: \bibitem{Solomonoff:78}

872: R.~J. Solomonoff.

873: \newblock Complexity-based induction systems: comparisons and convergence

874:   theorems.

875: \newblock {\em IEEE Trans. Information Theory}, IT-24:422--432, 1978.

876:

877: \bibitem{Sutton:98}

878: R.~Sutton and A.~Barto.

879: \newblock {\em Reinforcement learning: An introduction}.

880: \newblock Cambridge, MA, MIT Press, 1998.

881:

882: \bibitem{Vyugin:98}

883: V.~V. V'yugin.

884: \newblock Non-stochastic infinite and finite sequences.

885: \newblock {\em Theoretical computer science}, 207:363--382, 1998.

886:

887: \bibitem{Wallace:68}

888: C.~S. Wallace and D.~M. Boulton.

889: \newblock An information measure for classification.

890: \newblock {\em Computer Jrnl.}, 11(2):185--194, August 1968.

891:

892: \bibitem{Willems:95}

893: F.M.J. Willems, Y.M. Shtarkov, and Tj.J. Tjalkens.

894: \newblock The context-tree weighting method: Basic properties.

895: \newblock {\em IEEE Transactions on Information Theory}, 41(3), 1995.

896:

897: \end{thebibliography}

898:

899:

900:

901:

902: \end{document}

903: