1: \documentclass{article}
2:
3: \usepackage{latexsym,amssymb,textcomp}
4:
5:
6: % short hand for mathcal characters
7: \def\AA{\mathcal A} \def\BB{\mathcal B} \def\CC{\mathcal C}
8: \def\DD{\mathcal D} \def\FF{\mathcal F} \def\LL{\mathcal L}
9: \def\MM{\mathcal M} \def\NN{\mathcal N} \def\OO{\mathcal O}
10: \def\RR{\mathcal R} \def\SS{\mathcal S} \def\UU{\mathcal U}
11: \def\WW{\mathcal W} \def\XX{\mathcal X} \def\YY{\mathcal Y}
12: \def\ZZ{\mathcal Z} \def\PP{\mathcal P}
13:
14: % short hand for mathbb characters
15: \def\NNN{\mathbb N} \def\RRR{\mathbb R} \def\QQQ{\mathbb Q}
16: \def\CCC{\mathbb C} \def\ZZZ{\mathbb Z} \def\BBB{\mathbb B}
17:
18: \def\plt{<^{{}^{\!\!\!\!\!\!\!\;+}}} % dot less than
19: \def\pgt{>^{{}^{\!\!\!\!\!\!\!\;+}}} % dot greater than
20: \def\peq{\stackrel{\scriptscriptstyle{+}}{=}} % dot equal
21:
22: \newtheorem{nummer}{\hspace*{-0.33em}}[section]
23: \newenvironment{definition} {\begin{nummer}{\bf Definition.} \begin{rm}}{\end{rm} \end{nummer}}
24: \newenvironment{lemma} {\begin{nummer} {\bf Lemma.}} {\end{nummer}}
25: \newenvironment{theorem} {\begin{nummer} {\bf Theorem.}} {\end{nummer}}
26: \newenvironment{corollary} {\begin{nummer} {\bf Corollary.}} {\end{nummer}}
27: \newenvironment{proof} {\noindent \bf Proof. \rm} {\ \nolinebreak \hfill $\Box$ \vspace{2ex}}
28:
29: % expected value
30: \def\E{\mathbf{E}}
31:
32: \sloppy
33:
34: \bibliographystyle{plain}
35:
36:
37: \title{Is there an Elegant Universal\\ Theory of Prediction?
38: \thanks{This work was
39: supported by SNF grant 200020-107616.}}
40:
41: \author{Shane Legg\thanks{\tt shane@idsia.ch}}
42:
43:
44: \begin{document}
45:
46: \maketitle
47:
48: \begin{abstract}
49: Solomonoff's inductive learning model is a powerful, universal and
50: highly elegant theory of sequence prediction. Its critical flaw is
51: that it is incomputable and thus cannot be used in practice. It is
52: sometimes suggested that it may still be useful to help guide the
53: development of very general and powerful theories of prediction which
54: are computable. In this paper it is shown that although powerful
55: algorithms exist, they are necessarily highly complex. This alone
56: makes their theoretical analysis problematic, however it is further
57: shown that beyond a moderate level of complexity the analysis runs
58: into the deeper problem of G\"{o}del incompleteness. This limits the
59: power of mathematics to analyse and study prediction algorithms, and
60: indeed intelligent systems in general.
61: \end{abstract}
62:
63:
64: \section{Introduction}
65:
66: Could there exist an elegant and universal theory of sequence
67: prediction? Solomonoff's model of induction rapidly learns to make
68: optimal predictions for any computable sequence, including
69: probabilistic ones \cite{Solomonoff:64,Solomonoff:78}. Indeed the
70: problem of sequence prediction could well be considered solved
71: \cite{hutter:06usp,hutter:04uaibook}, if it were not for the fact that
72: Solomonoff's theoretical model is incomputable.
73:
74: Among computable theories there exist powerful general predictors,
75: such as the Lempel-Ziv algorithm \cite{Feder:92} and Context Tree
76: Weighting \cite{Willems:95}, that can learn to predict some complex
77: sequences, but not others. Some prediction methods, such as the
78: Minimum Description Length principle \cite{Rissanen:96} and the
79: Minimum Message Length principle \cite{Wallace:68}, can even be viewed
80: as computable approximations to Solomonoff induction~\cite{Li:97}.
81: However in practice their power and generality are limited by the
82: power of compression and coding methods employed, as well as having a
83: significantly reduced data efficiency as compared to Solomonoff
84: induction \cite{Poland:04mdl2p}.
85:
86: Could there exist elegant computable prediction algorithms that are in
87: some sense universal, or at least universal over large sets of simple
88: sequences? In this paper we explore this fundamental question from
89: the perspective of Kolmogorov complexity theory and uncover some
90: surprising implications.
91:
92:
93:
94:
95: \section{Preliminaries}
96:
97: An \emph{alphabet} $\AA$ is a finite set of 2 or more elements which
98: are called \emph{symbols}. In this paper we will assume a binary
99: alphabet $\BBB := \{ 0, 1 \}$, though all the results can easily be
100: generalised to other alphabets. A \emph{string} is a finite ordered
101: $n$-tuple of symbols denoted $x := x_1 x_2 \ldots x_n$ where $\forall
102: i \in \{ 1, \ldots, n \}$, $x_i \in \BBB$, or more succinctly, $x \in
103: \BBB^n$. The 0-tuple is denoted $\lambda$ and is called the
104: \emph{null string}. The expression $\BBB^{\leq n}$ has the obvious
105: interpretation, and $\BBB^* := \bigcup_{n \in \NNN} \BBB^n$. The
106: length \emph{lexicographical} ordering is a total order on $\BBB^*$
107: defined as $\lambda < 0 < 1 < 00 < 01 < 10 < 11 < 000 < 001 < \cdots$.
108: A \emph{substring} of $x$ is defined $x_{j:k} := x_j x_{j+1} \ldots
109: x_k$ where $1 \leq j \leq k \leq n$. By $|x|$ we mean the length of
110: the string $x$, for example, $|x_{j:k}| = k - j +1$. We will
111: sometimes need to encode a natural number as a string. Using simple
112: encoding techniques it can be shown that there exists a computable
113: injective function $f : \NNN \to \BBB^*$ where no string in the range
114: of $f$ is a prefix of any other, and $\forall n \in \NNN : |f(n)| \leq
115: \log_2 n + 2 \log_2 \log_2 n + 1$.
116:
117: Unlike strings which always have finite length, a \emph{sequence}
118: $\omega$ is an infinite list of symbols $x_1 x_2 x_3 \ldots \in
119: \BBB^\infty$. Of particular interest to us will be the class of
120: sequences which can be generated by an algorithm executed on a
121: universal Turing machine:
122:
123: \begin{definition}
124: A {\bf monotone universal Turing machine} $\UU$ is defined as a
125: universal Turing machine with one unidirectional input tape, one
126: unidirectional output tape, and some bidirectional work tapes. Input
127: tapes are read only, output tapes are write only, unidirectional tapes
128: are those where the head can only move from left to right. All tapes
129: are binary (no blank symbol) and the work tapes are initially filled
130: with zeros. We say that $\UU$ outputs/computes a sequence $\omega$ on
131: input $p$, and write $\UU(p) = \omega$, if $\UU$ reads all of $p$ but
132: no more as it continues to write $\omega$ to the output tape.
133: \end{definition}
134:
135: We fix $\UU$ and define $\UU( p, x )$ by simply using a standard
136: coding technique to encode a program $p$ along with a string $x \in
137: \BBB^*$ as a single input string for $\UU$.
138:
139: \begin{definition}
140: A sequence $\omega \in \BBB^\infty$ is a {\bf computable binary
141: sequence} if there exists a program $q \in \BBB^*$ that writes
142: $\omega$ to a one-way output tape when run on a monotone universal
143: Turing machine $\mathcal{U}$, that is, $\exists q \in \BBB^* : \UU(q)
144: = \omega$. We denote the set of all computable sequences by $\CC$.
145: \end{definition}
146:
147: A similar definition for strings is not necessary as all strings have
148: finite length and are therefore trivially computable.
149:
150: \begin{definition}
151: A {\bf computable binary predictor} is a program $p \in \BBB^*$ that
152: on a universal Turing machine $\UU$ computes a total function $\BBB^*
153: \to \BBB$.
154: \end{definition}
155:
156: For simplicity of notation we will often write $p(x)$ to mean the
157: function computed by the program $p$ when executed on $\UU$ along with
158: the input string $x$, that is, $p(x)$ is short hand for $\UU( p, x )$.
159: Having $x_{1:n}$ as input, the objective of a predictor is for its
160: output, called its \emph{prediction}, to match the next symbol in the
161: sequence. Formally we express this by writing $p(x_{1:n}) = x_{n+1}$.
162:
163: As the algorithmic prediction of incomputable sequences, such as the
164: halting sequence, is impossible by definition, we only consider the
165: problem of predicting computable sequences. To simplify things we
166: will assume that the predictor has an unlimited supply of computation
167: time and storage. We will also make the assumption that the predictor
168: has unlimited data to learn from, that is, we are only concerned with
169: whether or not a predictor can learn to predict in the following
170: sense:
171:
172: \begin{definition}
173: We say that a predictor $p$ can {\bf learn to predict} a sequence
174: $\omega := x_1 x_2 \ldots \in \BBB^\infty$ if there exists $m \in \NNN$
175: such that $\forall n \geq m : p(x_{1:n}) = x_{n+1}$.
176: \end{definition}
177:
178: The existence of $m$ in the above definition need not be constructive,
179: that is, we might not know when the predictor will stop making
180: prediction errors for a given sequence, just that this will occur
181: eventually. This is essentially ``next value'' prediction as
182: characterised by Barzdin~\cite{Barzdin:72}, which follows from Gold's
183: notion of identifiability in the limit for languages~\cite{Gold:67}.
184:
185: \begin{definition}
186: Let $P(\omega)$ be the set of all predictors able to learn to predict
187: $\omega$. Similarly for sets of sequences $S \subset
188: \BBB^\infty$, define $P(S) := \bigcap_{\omega \in S} P( \omega )$.
189: \end{definition}
190:
191: A standard measure of complexity for sequences is the length of the
192: shortest program which generates the sequence:
193: \begin{definition}
194: For any sequence $\omega \in \BBB^\infty$ the monotone {\bf Kolmogorov
195: complexity} of the sequence is,
196: \[
197: K( \omega ) := \min_{q \in \BBB^*} \{ |q| : \UU(q) = \omega \},
198: \]
199: where $\UU$ is a monotone universal Turing machine. If no such $q$
200: exists, we define $K(\omega) := \infty$.
201: \end{definition}
202:
203: It can be shown that this measure of complexity depends on our choice
204: of universal Turing machine $\UU$, but only up to an additive constant
205: that is independent of $\omega$. This is due to the fact that a
206: universal Turing machine can simulate any other universal Turing
207: machine with a fixed length program.
208:
209: In essentially the same way as the definition above we can define the
210: Kolmogorov complexity of a string $x \in \BBB^n$, written $K(x)$, by
211: requiring that $\UU(q)$ halts after generating $x$ on the output tape.
212: For an extensive treatment of Kolmogorov complexity and some of its
213: applications see \cite{Li:97} or \cite{Calude:02}.
214:
215: As many of our results will have the above property of holding within
216: an additive constant that is independent of the variables in the
217: expression, we will indicate this by placing a small plus above the
218: equality or inequality symbol. For example, $f(x) \plt g(x)$ means
219: that that $\exists c \in \RRR, \forall x : f(x) < g(x) + c$. When
220: using standard ``Big O'' notation this is unnecessary as expressions
221: are already understood to hold within an independent constant, however
222: for consistency of notation we will use it in these cases also.
223:
224:
225:
226:
227:
228: \section{Prediction of computable sequences}
229:
230: The most elementary result is that every computable sequence can be
231: predicted by at least one predictor, and that this predictor need not
232: be significantly more complex than the sequence to be predicted.
233:
234: \begin{lemma}\label{lem:bound1}
235: $\forall \omega \in \CC, \exists p \in P( \omega ) : K( p )
236: \plt K( \omega )$.
237: \end{lemma}
238:
239: \begin{proof}
240: As the sequence $\omega$ is computable, there must exist at least one
241: algorithm that generates $\omega$. Let $q$ be the shortest such
242: algorithm and construct an algorithm $p$ that ``predicts'' $\omega$ as
243: follows: Firstly the algorithm $p$ reads $x_{1:n}$ to find the value
244: of $n$, then it runs $q$ to generate $x_{1:n+1}$ and returns $x_{n+1}$
245: as its prediction. Clearly $p$ perfectly predicts $\omega$ and $|p| <
246: |q| + c$, for some small constant $c$ that is independent of $\omega$
247: and $q$.
248: \end{proof}
249:
250: Not only can any computable sequence be predicted, there also exist
251: very simple predictors able to predict arbitrarily complex sequences:
252:
253: \begin{lemma}\label{lem:predofcomplex}
254: There exist a predictor $p$ such that $\forall n \in \NNN, \exists \,
255: \omega \in \CC : p \in P(\omega)$ and $K(\omega) > n$.
256: \end{lemma}
257:
258: \begin{proof}
259: Take a string $x$ such that $K(x) = |x| \geq 2n$, and from this
260: define a sequence $\omega := x 0 0 0 0 \ldots$. Clearly $K(\omega) >
261: n$ and yet a simple predictor $p$ that always predicts 0 can learn to
262: predict $\omega$.
263: \end{proof}
264:
265: The predictor used in the above proof is very simple and can only
266: learn sequences that end with all 0's, albeit where the initial string
267: can have an arbitrarily high Kolmogorov complexity. It is not hard to
268: see that more sophisticated predictors can learn to predict many other
269: more subtle types of patterns which are more complex than the
270: predictor, such as arbitrary repeating strings, regular or primitive
271: recursive sequences.
272:
273: As each computable sequence can be predicted, and simple predictors
274: exist which can predict arbitrarily complex sequences, we might wonder
275: whether there exists a computable predictor able to learn to predict
276: all computable sequences. Unfortunately, no universal predictor
277: exists, indeed for every predictor there exists a sequence which it
278: cannot predict at all:
279:
280: \begin{lemma}\label{lem:adv}
281: For any predictor $p$ there constructively exists a sequence $\omega
282: := x_1 x_2 \ldots \in \CC$ such that $\forall n \in \NNN : p(x_{1:n})
283: \neq x_{n+1}$ and $K(\omega) \plt K(p)$.
284: \end{lemma}
285:
286: \begin{proof}
287: For any computable predictor $p$ there constructively exists a
288: computable sequence $\omega = x_1 x_2 x_3 \ldots$ computed by an
289: algorithm $q$ defined as follows: Set $x_1 = 1 - p(\lambda)$, then
290: $x_2 = 1 - p( x_1 )$, then $x_3 = 1 - p( x_{1:2} )$ and so on.
291: Clearly $\omega \in \CC$ and $\forall n \in \NNN : p(x_{1:n}) = 1 -
292: x_{n+1}$.
293:
294: Let $p^*$ be the shortest program that computes the same function as
295: $p$ and define a sequence generation algorithm $q^*$ based on $p^*$
296: using the procedure above. By construction, $|q^*| = |p^*| + c$ for
297: some constant $c$ that is independent of $p^*$. Because $q^*$
298: generates $\omega$, it follows that $K(\omega) \leq |q^*|$. By
299: definition $K(p) = |p^*|$ and so $K(\omega) \plt K(p)$.
300: \end{proof}
301:
302: Allowing the predictor to be probabilistic does not fundamentally
303: avoid the problem of Lemma~\ref{lem:adv}. In each step, rather than
304: generating the opposite to what will be predicted by $p$, instead $q$
305: attempts to generate the symbol which $p$ is least likely to predict
306: given $x_{1:n}$. To do this $q$ must simulate $p$ in order to
307: estimate $\Pr \! \big( p(x_{1:n}) = 1 \big| x_{1:n} \big)$. With
308: sufficient simulation effort, $q$ can estimate this probability to any
309: desired accuracy for any $x_{1:n}$. This produces a computable
310: sequence $\omega$ such that $\forall n \in \NNN : \Pr \! \big(
311: p(x_{1:n}) = x_{n+1} \big| x_{1:n} \big)$ is not significantly greater
312: than $\frac{1}{2}$, that is, the performance of $p$ is no better than
313: a predictor that makes completely random predictions.
314:
315: The impossibility of prediction in this more general probabilistic
316: setting has been pointed out before by Dawid~\cite{Dawid:85}.
317: Specifically, Dawid notes that for any statistical forecasting system
318: there exist sequences which are not calibrated. Dawid also notes that
319: a forecasting system for a family of distributions is necessarily more
320: complex than any forecasting system generated from a single
321: distribution in the family. However, he does not deal with the
322: complexity of the sequences themselves, nor does he make a precise
323: statement in terms of a specific measure of complexity, such as
324: Kolmogorov complexity. The impossibility of forecasting has since
325: been developed in considerably more depth by V'yugin~\cite{Vyugin:98},
326: in particular, it is proven that there is an efficient randomised
327: procedure producing sequences that cannot be predicted (with high
328: probability) by computable forecasting systems.
329:
330: As probabilistic prediction complicates things without avoiding this
331: fundamental problem, in the remainder of this paper we will consider
332: only deterministic predictors. This will also allow us to see the
333: roots of this problem as clearly as possible. With the preliminaries
334: covered, we now move on to the central problem considered in this
335: paper: Predicting sequences of limited Kolmogorov complexity.
336:
337:
338:
339:
340:
341: \section{Prediction of simple computable sequences}
342:
343: As the computable prediction of any computable sequence is impossible,
344: a weaker goal is to be able to predict all ``simple'' computable
345: sequences.
346:
347: \begin{definition}
348: For $n \in \NNN$, let $\CC_n := \{ \omega \in \CC: K(\omega) \leq n
349: \}$. Further, let $P_n := P( \CC_n )$ be the set of predictors able
350: to learn to predict all sequences in $\CC_n$.
351: \end{definition}
352:
353: Firstly we establish that prediction algorithms exist that can learn
354: to predict all sequences up to a given complexity, and that these
355: predictors need not be significantly more complex than the sequences
356: they can predict:
357:
358: \begin{lemma} \label{lem:infpredictors}
359: $\forall n \in \NNN, \exists p \in P_n : K( p ) \plt n + O( \log_2 n )$.
360: \end{lemma}
361:
362: \begin{proof}
363: Let $h \in \NNN$ be the number of programs of length $n$ or less which
364: generate infinite sequences. Build the value of $h$ into a prediction
365: algorithm $p$ constructed as follows:
366:
367: In the $k^{th}$ prediction cycle run in parallel all programs of
368: length $n$ or less until $h$ of these programs have each produced
369: $k+1$ symbols of output. Next predict according to the $k+1^{th}$
370: symbol of the generated string whose first $k$ symbols is consistent
371: with the observed string. If two generated strings are consistent
372: with the observed sequence (there cannot be more than two as the
373: strings are binary and have length $k+1$), pick the one which was
374: generated by the program that occurs first in a lexicographical
375: ordering of the programs. If no generated output is consistent, give
376: up and output a fixed symbol.
377:
378: For sufficiently large $k$, only the $h$ programs which produce
379: infinite sequences will produce output strings of length $k$. As this
380: set of sequences is finite, they can be uniquely identified by finite
381: initial strings. Thus for sufficiently large $k$ the predictor $p$
382: will correctly predict any computable sequence $\omega$ for which $K(
383: \omega ) \leq n$, that is, $p \in P_n$.
384:
385: As there are $2^{n+1} -1$ possible strings of length $n$ or less, $h <
386: 2^{n+1}$ and thus we can encode $h$ with $\log_2 h + 2 \log_2 \log_2 h
387: = n + 1 + 2\log_2 (n+1)$ bits. Thus, $K( p ) < n + 1 + 2 \log_2 (n+1)
388: + c$ for some constant $c$ that is independent of $n$.
389: \end{proof}
390:
391: Can we do better than this? Lemma~\ref{lem:predofcomplex} shows us
392: that there exist predictors able to predict at least some sequences
393: vastly more complex than themselves. This suggests that there might
394: exist simple predictors able to predict arbitrary sequences up to a
395: high complexity. Formally, could there exist $p \in P_n$ where $n \gg
396: K(p)$? Unfortunately, these simple but powerful predictors are not
397: possible:
398:
399: \begin{theorem}\label{thm:simplepred}
400: $\forall n \in \NNN: p \in P_n \Rightarrow K(p) \pgt n$.
401: \end{theorem}
402:
403: \begin{proof}
404: For any $n \in \NNN$ let $p \in P_n$, that is, $\forall \omega \in
405: \CC_n: p \in P(\omega)$. By Lemma~\ref{lem:adv} we know that $\exists
406: \, \omega' \in \CC : p \notin P(\omega')$ . As $p \notin P(\omega')$
407: it must be the case that $\omega' \notin \CC_n$, that is, $K(\omega')
408: \geq n$. From Lemma~\ref{lem:adv} we also know that $K(p) \pgt
409: K(\omega')$ and so the result follows.
410: \end{proof}
411:
412: Intuitively the reason for this is as follows: Lemma~\ref{lem:adv}
413: guarantees that every simple predictor fails for at least one simple
414: sequence. Thus if we want a predictor that can learn to predict all
415: sequences up to a moderate level of complexity, then clearly the
416: predictor cannot be simple. Likewise, if we want a predictor that can
417: predict all sequences up to a high level of complexity, then the
418: predictor itself must be very complex. Thus, even though we have made
419: the generous assumption of unlimited computational resources and data
420: to learn from, only very complex algorithms can be truly powerful
421: predictors.
422:
423: These results easily generalise to notions of complexity that take
424: computation time into consideration. As sequences are infinite, the
425: appropriate measure of time is the time needed to generate or predict
426: the next symbol in the sequence. Under any reasonable measure of time
427: complexity, the operation of inverting a single output from a binary
428: valued function can be performed with little cost. If $C$ is any
429: complexity measure with this property, it is trivial to see that the
430: proof of Lemma~\ref{lem:adv} still holds for $C$. From this, an
431: analogue of Theorem~\ref{thm:simplepred} for $C$ easily follows. With
432: similar arguments these results also generalise in a straightforward
433: way to complexity measures that take space or other computational
434: resources into account. Thus, the fact that extremely powerful
435: predictors must be very complex, holds under any measure of complexity
436: for which inverting a single bit is inexpensive.
437:
438:
439:
440: \section{Complexity of prediction}
441:
442: Another way of viewing these results is in terms of an alternate
443: notion of sequence complexity defined as the size of the smallest
444: predictor able to learn to predict the sequence. This allows us to
445: express the results of the previous sections more concisely.
446: Formally, for any sequence $\omega$ define the complexity measure,
447: \[
448: \dot{K} ( \omega ) := \min_{p \in \BBB^*} \{ |p| : p \in P( \omega ) \},
449: \]
450: and $\dot{K}(\omega) := \infty$ if $P( \omega ) = \varnothing$. Thus,
451: if $\dot{K} ( \omega )$ is high then the sequence $\omega$ is complex
452: in the sense that only complex prediction algorithms are able to learn
453: to predict it. It can easily be seen that this notion of complexity
454: has the same invariance to the choice of reference universal Turing
455: machine as the standard Kolmogorov complexity measure.
456:
457: It may be tempting to conjecture that this definition simply describes
458: what might be called the ``tail end complexity'' of a sequence, that
459: is, $\dot{K}(\omega) = \lim_{i \to \infty} K(\omega_{i:\infty})$.
460: This is not the case. Consider again Lemma~\ref{lem:predofcomplex}
461: and its proof. For any $n \in \NNN$, we let $y_{1:n}$ be a random
462: string, that is, $K(y_{1:n}) \peq n$. From this we defined a
463: computable sequence that was a repetition of this string, $\omega :=
464: (y_{1:n})^*$. It was then proven that there exists a single predictor
465: $p$ which can predict any sequence of this form, with no restriction
466: on how high $K(\omega)$ can be. From our definition of $\dot{K}$
467: above it is thus clear that $\dot{K}(\omega) \peq 0$ for any such
468: $\omega$. Consider now the tail complexity of $\omega$. As
469: $K(y_{1:n}) \peq n$, whenever $i \bmod n = 0$ we have
470: $K(\omega_{i:\infty}) \pgt n - O(\log n)$ (the $O(\log n)$ term comes
471: from potentially saving bits due to not having to encode $|y_{1:n}|$).
472: Thus even if the limit $\lim_{i \to \infty} K(\omega_{i:\infty})$
473: exists (it may oscillate), it cannot be equal to $\dot{K}(\omega)$ in
474: general.
475:
476: Using $\dot{K}$ we can now rewrite a number of our previous results
477: more succinctly in terms of the new complexity measure. From
478: Lemma~\ref{lem:bound1} it immediately follows that,
479: \[
480: \forall
481: \omega: 0 \leq \dot{K}( \omega ) \plt K( \omega ).
482: \]
483: From Lemma~\ref{lem:predofcomplex} we know that $\exists c \in \NNN,
484: \forall n \in \NNN, \exists \, \omega \in \CC$ such that $\dot{K}(
485: \omega ) <c$ and $K( \omega ) > n$, that is, $\dot{K}$ can attain the
486: lower bound above within a small constant, no matter how large the
487: value of $K$ is. The sequences for which the upper bound on $\dot{K}$
488: is tight are interesting as they are the ones which demand complex
489: predictors. We prove the existence of these sequences and look at
490: some of their properties in the next section.
491:
492: The complexity measure $\dot{K}$ can also be generalised to sets of
493: sequences, for $S \subset \BBB^\infty$ define $\dot{K}( S ) := \min_p
494: \{ |p|: p \in P(S) \}$. This allows us to rewrite
495: Lemma~\ref{lem:infpredictors} and Theorem~\ref{thm:simplepred} as
496: simply,
497: \[
498: \forall n \in \NNN : n
499: \plt \dot{K} ( \CC_n ) \plt n + O( \log_2 n ).
500: \]
501: This is just a restatement of the fact that the simplest predictor
502: capable of predicting all sequences up to a Kolmogorov complexity of
503: $n$, has itself a Kolmogorov complexity of roughly $n$.
504:
505:
506:
507: \section{Hard to predict sequences}\label{sec:hard}
508:
509: We have already seen that some individual sequences, such as the
510: repeating string used in the proof of Lemma~\ref{lem:predofcomplex},
511: can have arbitrarily high Kolmogorov complexity but nevertheless can
512: be predicted by trivial algorithms. Thus, although these sequences
513: contain a lot of information in the Kolmogorov sense, in a deeper
514: sense their structure is very simple and easily learnt.
515:
516: What interests us in this section is the other extreme; individual
517: sequences which can only be predicted by complex predictors. As we
518: are only concerned with prediction in the limit, this extra complexity
519: in the predictor must be some kind of special information which
520: cannot be learnt just through observing the sequence. Our first task
521: is to show that these difficult to predict sequences exist.
522:
523: \begin{theorem}\label{thm:uninf}
524: $\forall n \in \NNN, \exists \, \omega \in \CC : n \plt \dot{K}(
525: \omega ) \plt K(\omega) \plt n + O( \log_2 n )$.
526: \end{theorem}
527:
528: \begin{proof}
529: For any $n \in \NNN$, let $Q_n \subset \BBB^{<n}$ be the set of
530: programs shorter than $n$ that are predictors, and let $x_{1:k} \in
531: \BBB^k$ be the observed initial string from the sequence $\omega$
532: which is to be predicted. Now construct a meta-predictor $\hat{p}$:
533:
534: By dovetailing the computations, run in parallel every program of
535: length less than $n$ on every string in $\BBB^{\leq k}$. Each time a
536: program is found to halt on all of these input strings, add the
537: program to a set of ``candidate prediction algorithms'', called
538: $\tilde{Q}^k_n$. As each element of $Q_n$ is a valid predictor and
539: thus will halt for all input strings for any $k$, for every $n$ and
540: $k$ it eventually will be the case that $|\tilde{Q}^k_n| = |Q_n|$. At
541: this point the simulation to approximate $Q_n$ terminates. It is
542: clear that for sufficiently large values of $k$ all of the valid
543: predictors, and only the valid predictors, will halt with a single
544: symbol of output on all tested input strings. That is, $\exists r \in
545: \NNN, \forall k > r : \tilde{Q}^k_n = Q_n$.
546:
547: The second part of the $\hat{p}$ algorithm uses these candidate
548: prediction algorithms to make a prediction. For $p \in \tilde{Q}^k_n$
549: define $d^k(p) := \sum_{i=1}^{k-1} |p(x_{1:i})-x_{i+1}|$. Informally,
550: $d^k(p)$ is the number of prediction errors made by $p$ so far.
551: Compute this for all $p \in \tilde{Q}^k_n$ and then let $p^*_k \in
552: \tilde{Q}^k_n$ be the program with minimal $d^k(p)$. If there is more
553: than one such program, break the tie by letting $p^*_k$ be the
554: lexicographically first of these. Finally, $\hat{p}$ computes the
555: value of $p^*_k(x_{1:k})$ and then returns this as its prediction and
556: halts.
557:
558: By Lemma~\ref{lem:adv}, there exists $\omega' \in \CC$ such that
559: $\hat{p}$ makes a prediction error for every $k$ when trying to
560: predict $\omega'$. Thus, in each cycle at least one of the finitely
561: many predictors with minimal $d^k$ makes a prediction error and so
562: $\forall p \in Q_n: d^k(p) \to \infty$ as $k \to \infty$. Therefore,
563: $\nexists p \in Q_n : p \in P(\omega')$, that is, no program of length
564: less than $n$ can learn to predict $\omega'$ and so $n \leq
565: \dot{K}(\omega')$. Further, from Lemma~\ref{lem:bound1} we know that
566: $\dot{K}( \omega' ) \plt K(\omega')$, and from Lemma~\ref{lem:adv}
567: again, $K(\omega') \plt K(\hat{p})$.
568:
569: Examining the algorithm for $\hat{p}$, we see that it contains some
570: fixed length program code and an encoding of $|Q_n|$, where $|Q_n| <
571: 2^n-1$. Thus, using a standard encoding method for integers,
572: $K(\hat{p}) \plt n + O( \log_2 n )$.
573:
574: Chaining these together we get, $n \plt \dot{K}( \omega' ) \plt
575: K(\omega') \plt K(\hat{p}) \plt n + O( \log_2 n )$, which proves the
576: theorem.
577: \end{proof}
578:
579: This establishes the existence of sequences with arbitrarily high
580: $\dot{K}$ complexity which also have a similar level of Kolmogorov
581: complexity. Next we establish a fundamental property of high
582: $\dot{K}$ complexity sequences: they are extremely difficult to
583: compute.
584:
585: For an algorithm $q$ that generates $\omega \in \CC$, define $t_q(n)$
586: to be the number of computation steps performed by $q$ before the
587: $n^{th}$ symbol of $\omega$ is written to the output tape. For
588: example, if $q$ is a simple algorithm that outputs the sequence
589: $010101\ldots$, then clearly $t_q(n) = O(n)$ and so $\omega$ can be
590: computed quickly. The following theorem proves that if a sequence can
591: be computed in a reasonable amount of time, then the sequence must
592: have a low $\dot{K}$ complexity:
593:
594: \begin{lemma}\label{lem:slow}
595: $\forall \omega \in \CC$, if $\exists q : \UU(q) = \omega$ and
596: $\exists r \in \NNN , \forall n > r : t_q(n) < 2^n$, then
597: $\dot{K}(\omega) \peq 0$.
598: \end{lemma}
599:
600: \begin{proof}
601: Construct a prediction algorithm $\tilde{p}$ as follows:
602:
603: On input $x_{1:n}$, run all programs of length $n$ or less, each for
604: $2^{n+1}$ steps. In a set $W_n$ collect together all generated
605: strings which are at least $n+1$ symbols long and where the first $n$
606: symbols match the observed string $x_{1:n}$. Now order the strings in
607: $W_n$ according to a lexicographical ordering of their generating
608: programs. If $W_n = \varnothing$, then just return a prediction of 1
609: and halt. If $|W_n| > 1$ then return the $n+1^{th}$ symbol from the
610: first sequence in the above ordering.
611:
612: Assume that $\exists q : \UU(q) = \omega$ such that $\exists r \in
613: \NNN , \forall n > r : t_q(n) < 2^n$. If $q$ is not unique, take $q$
614: to be the lexicographically first of these. Clearly $\forall n > r$
615: the initial string from $\omega$ generated by $q$ will be in the set
616: $W_n$. As there is no lexicographically lower program which can
617: generate $\omega$ within the time constraint $t_q (n) < 2^n$ for all
618: $n>r$, for sufficiently large $n$ the predictor $\tilde{p}$ must
619: converge on using $q$ for each prediction and thus $\tilde{p} \in
620: P(\omega)$. As $|\tilde{p}|$ is clearly a fixed constant that is
621: independent of $\omega$, it follows then that $\dot{K}(\omega) <
622: |\tilde{p}| \peq 0$.
623: \end{proof}
624:
625: We could replace the $2^n$ bound in the above result with an even more
626: rapidly growing computable function, for example, $2^{2^n}$. In any
627: case, this does not change the fundamental result that sequences which
628: have a high $\dot{K}$ complexity are practically impossible to
629: compute. However from our theoretical perspective these sequences
630: present no problem as they can be predicted, albeit with immense
631: difficulty.
632:
633:
634: \section{The limits of mathematical analysis}
635:
636: One way to interpret the results of the previous sections is in terms
637: of constructive theories of prediction. Essentially, a constructive
638: theory of prediction $\mathcal{T}$, expressed in some sufficiently
639: rich formal system $\mathcal{F}$, is in effect a description of a
640: prediction algorithm with respect to a universal Turing machine which
641: implements the required parts of $\mathcal{F}$. Thus from
642: Theorems~\ref{thm:simplepred} and \ref{thm:uninf} it follows that if
643: we want to have a predictor that can learn to predict all sequences up
644: to a high level of Kolmogorov complexity, or even just predict
645: individual sequences which have high $\dot{K}$ complexity, the
646: constructive theory of prediction that we base our predictor on must
647: be very complex. Elegant and highly general constructive theories of
648: prediction simply do not exist, even if we assume unlimited
649: computational resources. This is in marked contrast to Solomonoff's
650: highly elegant but non-constructive theory of prediction.
651:
652: Naturally, highly complex theories of prediction will be very
653: difficult to mathematically analyse, if not practically impossible.
654: Thus at some point the development of very general prediction
655: algorithms must become mainly an experimental endeavour due to the
656: difficulty of working with the required theory. Interestingly, an
657: even stronger result can be proven showing that beyond some point the
658: mathematical analysis is in fact impossible, even in theory:
659:
660: \begin{theorem}\label{thm:incomplete}
661: In any consistent formal axiomatic system $\FF$ that is sufficiently
662: rich to express statements of the form ``$p \in P_n$'', there exists
663: $m \in \NNN$ such that for all $n > m$ and for all predictors $p \in
664: P_n$ the true statement ``$p \in P_n$'' cannot be proven in $\FF$.
665: \end{theorem}
666:
667: In other words, even though we have proven that very powerful sequence
668: prediction algorithms exist, beyond a certain complexity it is
669: impossible to find any of these algorithms using mathematics. The
670: proof has a similar structure to Chaitin's information theoretic proof
671: \cite{Chaitin:82} of G\"{o}del incompleteness theorem for formal
672: axiomatic systems \cite{Goedel:31}.
673:
674: \begin{proof}
675: For each $n \in \NNN$ let $T_n$ be the set of statements expressed in
676: the formal system $\FF$ of the form ``$p \in P_n$'', where $p$ is
677: filled in with the complete description of some algorithm in each
678: case. As the set of programs is denumerable, $T_n$ is also
679: denumerable and each element of $T_n$ has finite length. From
680: Lemma~\ref{lem:infpredictors} and Theorem~\ref{thm:simplepred} it
681: follows that each $T_n$ contains infinitely many statements of the
682: form ``$p \in P_n$'' which are true.
683:
684: Fix $n$ and create a search algorithm $s$ that enumerates all proofs
685: in the formal system $\FF$ searching for a proof of a
686: statement in the set $T_n$. As the set $T_n$ is recursive, $s$ can
687: always recognise a proof of a statement in $T_n$. If $s$ finds any
688: such proof, it outputs the corresponding program $p$ and then halts.
689:
690: By way of contradiction, assume that $s$ halts, that is, a proof of a
691: theorem in $T_n$ is found and $p$ such that $p \in P_n$ is generated
692: as output. The size of the algorithm $s$ is a constant (a description
693: of the formal system $\FF$ and some proof enumeration code) as well as
694: an $O( \log_2 n )$ term needed to describe $n$. It follows then that
695: $K(p) \plt O( \log_2 n )$. However from Theorem~\ref{thm:simplepred}
696: we know that $K(p) \pgt n$. Thus, for sufficiently large $n$, we have
697: a contradiction and so our assumption of the existence of a proof must
698: be false. That is, for sufficiently large $n$ and for all $p \in
699: P_n$, the true statement ``$p \in P_n$'' cannot be proven within the
700: formal system~$\FF$.
701: \end{proof}
702:
703: The exact value of $m$ depends on our choice of formal system $\FF$
704: and which reference machine $\UU$ we measure complexity with respect
705: to. However for reasonable choices of $\FF$ and $\UU$ the value of
706: $m$ would be in the order of 1000. That is, the bound $m$ is
707: certainly not so large as to be vacuous.
708:
709:
710:
711:
712: \section{Discussion}
713:
714: Solomonoff induction is an elegant and extremely general model of
715: inductive learning. It neatly brings together the philosophical
716: principles of Occam's razor, Epicurus' principle of multiple
717: explanations, Bayes theorem and Turing's model of universal
718: computation into a theoretical sequence predictor with astonishingly
719: powerful properties. If theoretical models of prediction can have
720: such elegance and power, one cannot help but wonder whether similarly
721: beautiful and highly general computable theories of prediction are
722: also possible.
723:
724: What we have shown here is that there does not exist an elegant
725: constructive theory of prediction for computable sequences, even if we
726: assume unbounded computational resources, unbounded data and learning
727: time, and place moderate bounds on the Kolmogorov complexity of the
728: sequences to be predicted. Very powerful computable predictors are
729: therefore necessarily complex. We have further shown that the source
730: of this problem is computable sequences which are extremely expensive
731: to compute. While we have proven that very powerful prediction
732: algorithms which can learn to predict these sequences exist, we have
733: also proven that, unfortunately, mathematical analysis cannot be used
734: to discover these algorithms due to problems of G\"{o}del
735: incompleteness.
736:
737: These results can be extended to more general settings, specifically
738: to those problems which are equivalent to, or depend on, sequence
739: prediction. Consider, for example, a reinforcement learning agent
740: interacting with an environment \cite{Sutton:98,hutter:04uaibook}. In
741: each interaction cycle the agent must choose its actions so as to
742: maximise the future rewards that it receives from the environment. Of
743: course the agent cannot know for certain whether or not some action
744: will lead to rewards in the future, thus it must predict these.
745: Clearly, at the heart of reinforcement learning lies a prediction
746: problem, and so the results for computable predictors presented in
747: this paper also apply to computable reinforcement learners. More
748: specifically, from Theorem~\ref{thm:simplepred} it follows that very
749: powerful computable reinforcement learners are necessarily complex,
750: and from Theorem~\ref{thm:incomplete} it follows that it is impossible
751: to discover extremely powerful reinforcement learning algorithms
752: mathematically.
753:
754: It is reasonable to ask whether the assumptions we have made in our
755: model need to be changed. If we increase the power of the predictors
756: further, for example by providing them with some kind of an oracle,
757: this would make the predictors even more unrealistic than they
758: currently are. Clearly this goes against our goal of finding an
759: elegant, powerful and general prediction theory that is more realistic
760: in its assumptions than Solomonoff's incomputable model. On the other
761: hand, if we weaken our assumptions about the predictors' resources to
762: make them more realistic, we are in effect taking a subset of our
763: current class of predictors. As such, all the same limitations and
764: problems will still apply, as well as some new ones.
765:
766: It seems then that the way forward is to further restrict the problem
767: space. One possibility would be to bound the amount of computation
768: time needed to generate the next symbol in the sequence. However if
769: we do this without restricting the predictors' resources then the
770: simple predictor from Lemma~\ref{lem:slow} easily learns to predict
771: any such sequence and thus the problem of prediction in the limit has
772: become trivial. Another possibility might be to bound the memory of
773: the machine used to generate the sequence, however this makes the
774: generator a finite state machine and thus bounds its computation time,
775: again making the problem trivial.
776:
777: Perhaps the only reasonable solution would be to add additional
778: restrictions to both the algorithms which generate the sequences to be
779: predicted, and to the predictors. We may also want to consider not
780: just learnability in the limit, but also how quickly the predictor is
781: able to learn. Of course we are then facing a much more difficult
782: analysis problem.
783:
784:
785: \subsubsection*{Acknowledgements}
786:
787: I would like to thank Marcus Hutter, Alexey Chernov, Daniil Ryabko and
788: Laurent Orseau for useful discussions and advice during the
789: development of this paper.
790:
791:
792: \begin{thebibliography}{10}
793:
794: \bibitem{Barzdin:72}
795: J.~M. Barzdin.
796: \newblock Prognostication of automata and functions.
797: \newblock {\em Information Processing}, 71:81--84, 1972.
798:
799: \bibitem{Calude:02}
800: C.~S. Calude.
801: \newblock {\em Information and Randomness}.
802: \newblock Springer, Berlin, 2nd edition, 2002.
803:
804: \bibitem{Chaitin:82}
805: G.~J. Chaitin.
806: \newblock G{\"o}del's theorem and information.
807: \newblock {\em International Journal of Theoretical Physics}, 22:941--954,
808: 1982.
809:
810: \bibitem{Dawid:85}
811: A.~P. Dawid.
812: \newblock Comment on {T}he impossibility of inductive inference.
813: \newblock {\em Journal of the American Statistical Association},
814: 80(390):340--341, 1985.
815:
816: \bibitem{Feder:92}
817: M.~Feder, N.~Merhav, and M.~Gutman.
818: \newblock Universal prediction of individual sequences.
819: \newblock {\em {IEEE} Trans. on Information Theory}, 38:1258--1270, 1992.
820:
821: \bibitem{Goedel:31}
822: K.~G{\"o}del.
823: \newblock {\"U}ber formal unentscheidbare {S}{\"a}tze der principia mathematica
824: und verwandter systeme {I}.
825: \newblock {\em Monatshefte f{\"u}r Matematik und Physik}, 38:173--198, 1931.
826: \newblock [English translation by E. Mendelsohn: ``On undecidable propositions
827: of formal mathematical systems''. In M. Davis, editor, {\it The undecidable},
828: pages 39--71, New York, 1965. Raven Press, Hewlitt].
829:
830: \bibitem{Gold:67}
831: E.~Mark Gold.
832: \newblock Language identification in the limit.
833: \newblock {\em Information and Control}, 10(5):447--474, 1967.
834:
835: \bibitem{hutter:04uaibook}
836: M.~Hutter.
837: \newblock {\em Universal Artificial Intelligence: Sequential Decisions based on
838: Algorithmic Probability}.
839: \newblock Springer, Berlin, 2005.
840: \newblock 300 pages, http://www.idsia.ch/$_{^{\sim}}$marcus/ai/uaibook.htm.
841:
842: \bibitem{hutter:06usp}
843: M.~Hutter.
844: \newblock On the foundations of universal sequence prediction.
845: \newblock In {\em Proc. 3rd Annual Conference on Theory and Applications of
846: Models of Computation ({TAMC'06})}, volume 3959 of {\em LNCS}, pages
847: 408--420. Springer, 2006.
848:
849: \bibitem{Li:97}
850: M.~Li and P.~M.~B. Vit\'anyi.
851: \newblock {\em An introduction to {Kolmogorov} complexity and its
852: applications}.
853: \newblock Springer, 2nd edition, 1997.
854:
855: \bibitem{Poland:04mdl2p}
856: J.~Poland and M.~Hutter.
857: \newblock Convergence of discrete {MDL} for sequential prediction.
858: \newblock In {\em Proc. 17th Annual Conf. on Learning Theory ({COLT'04})},
859: volume 3120 of {\em LNAI}, pages 300--314, Banff, 2004. Springer, Berlin.
860:
861: \bibitem{Rissanen:96}
862: J.~J. Rissanen.
863: \newblock Fisher {I}nformation and {S}tochastic {C}omplexity.
864: \newblock {\em IEEE Trans. on Information Theory}, 42(1):40--47, January 1996.
865:
866: \bibitem{Solomonoff:64}
867: R.~J. Solomonoff.
868: \newblock A formal theory of inductive inference: Part 1 and 2.
869: \newblock {\em Inform. Control}, 7:1--22, 224--254, 1964.
870:
871: \bibitem{Solomonoff:78}
872: R.~J. Solomonoff.
873: \newblock Complexity-based induction systems: comparisons and convergence
874: theorems.
875: \newblock {\em IEEE Trans. Information Theory}, IT-24:422--432, 1978.
876:
877: \bibitem{Sutton:98}
878: R.~Sutton and A.~Barto.
879: \newblock {\em Reinforcement learning: An introduction}.
880: \newblock Cambridge, MA, MIT Press, 1998.
881:
882: \bibitem{Vyugin:98}
883: V.~V. V'yugin.
884: \newblock Non-stochastic infinite and finite sequences.
885: \newblock {\em Theoretical computer science}, 207:363--382, 1998.
886:
887: \bibitem{Wallace:68}
888: C.~S. Wallace and D.~M. Boulton.
889: \newblock An information measure for classification.
890: \newblock {\em Computer Jrnl.}, 11(2):185--194, August 1968.
891:
892: \bibitem{Willems:95}
893: F.M.J. Willems, Y.M. Shtarkov, and Tj.J. Tjalkens.
894: \newblock The context-tree weighting method: Basic properties.
895: \newblock {\em IEEE Transactions on Information Theory}, 41(3), 1995.
896:
897: \end{thebibliography}
898:
899:
900:
901:
902: \end{document}
903: