1:
2: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
3: %% Towards a Universal Theory of Artificial Intelligence %%
4: %% based on %%
5: %% Algorithmic Probability and Sequential Decision Theory %%
6: %% %%
7: %% Marcus Hutter: Start: 09.12.00 LastEdit: 16.12.00 %%
8: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
9:
10: \newif\ifijcai\ijcaifalse % TechReport version
11:
12: %-------------------------------%
13: % My Document-Style %
14: %-------------------------------%
15: \documentclass[10pt,twocolumn]{article}
16:
17: \setlength\headheight{0pt} \setlength\headsep{0pt}
18: \topmargin=0cm \oddsidemargin=-1cm \evensidemargin=-1cm
19: \textwidth=18cm \textheight=23cm %\unitlength=1mm \sloppy
20:
21: %-------------------------------%
22: % Macro-Definitions %
23: %-------------------------------%
24:
25: \renewenvironment{abstract}{\centerline{\bf
26: Abstract}\vspace{0.5ex}\begin{quote}\small}{\par\end{quote}\vskip 1ex}
27: \newenvironment{keywords}{\centerline{\bf
28: Key Words}\vspace{0.5ex}\begin{quote}\small}{\par\end{quote}\vskip 1ex}
29: \def\eqd{\stackrel{\bullet}{=}}
30: \def\ff{\Longrightarrow}
31: \def\gdw{\Longleftrightarrow}
32: \def\toinfty#1{\stackrel{#1\to\infty}{\longrightarrow}}
33: \def\gtapprox{\buildrel{\lower.7ex\hbox{$>$}}\over
34: {\lower.7ex\hbox{$\sim$}}}
35: \def\nq{\hspace{-1em}}
36: \def\look{\(\uparrow\)}
37: \def\ignore#1{}
38: \def\deltabar{{\delta\!\!\!^{-}}}
39: \def\qed{\sqcap\!\!\!\!\sqcup}
40: \def\odt{{\textstyle{1\over 2}}}
41: \def\odf{{\textstyle{1\over 4}}}
42: \def\odA{{\textstyle{1\over A}}}
43: \def\hbar{h\!\!\!\!^{-}\,}
44: \def\dbar{d\!\!^{-}\!}
45: \def\eps{\varepsilon}
46: \def\beq{\begin{equation}}
47: \def\eeq{\end{equation}}
48: \def\beqn{\begin{displaymath}}
49: \def\eeqn{\end{displaymath}}
50: \def\bqa{\begin{equation}\begin{array}{c}}
51: \def\eqa{\end{array}\end{equation}}
52: \def\bqan{\begin{displaymath}\begin{array}{c}}
53: \def\eqan{\end{array}\end{displaymath}}
54: \def\pb{\underline} % probability notation
55: \def\pb#1{\underline{#1}} % probability notation
56: \def\blank{{\,_\sqcup\,}} % blank position
57: \def\maxarg{\mathop{\rm maxarg}} % maxarg
58: \def\minarg{\mathop{\rm minarg}} % minarg
59: \def\hh#1{{\dot{#1}}} % historic I/O
60: \def\best{*} % or {best}
61: \def\vec#1{{\bf #1}}
62: \def\length{{l}}
63: \ifijcai\def\paragraph#1{{\bf #1}}\fi
64:
65: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
66: % T i t l e - P a g e %
67: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
68:
69: \title{\bf \Large Towards a Universal Theory of Artificial Intelligence
70: based on \\ Algorithmic Probability and Sequential Decision Theory}
71:
72: {\author{ Marcus Hutter \\[2mm]
73: {\small IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland} \\
74: {\small marcus@idsia.ch \qquad http://www.idsia.ch} \\
75: {\small Technical Report IDSIA-14-00, 16. December 2000}
76: }
77:
78: \date{}
79:
80: \begin{document}
81:
82: \maketitle
83:
84: \begin{abstract}
85: Decision theory formally solves the problem of rational agents in
86: uncertain worlds if the true environmental probability
87: distribution is known. Solomonoff's theory of universal induction
88: formally solves the problem of sequence prediction for unknown
89: distribution. We unify both theories and give strong arguments
90: that the resulting universal AI$\xi$ model behaves optimal in any
91: computable environment. The major drawback of the AI$\xi$ model is
92: that it is uncomputable. To overcome this problem, we construct a
93: modified algorithm AI$\xi^{tl}$, which is still superior to any
94: other time $t$ and space $l$ bounded agent. The computation time
95: of AI$\xi^{tl}$ is of the order $t\!\cdot\!2^l$.\\
96: \end{abstract}
97:
98: \ifijcai\else
99: \begin{keywords}
100: Rational agents,
101: sequential decision theory, universal Solomonoff induction,
102: algorithmic probability, reinforcement learning, computational
103: complexity, theorem proving, probabilistic reasoning, Kolmogorov
104: complexity, Levin search.
105: \end{keywords}
106: \fi
107:
108: % ACM Classification
109: %I.2; I.2.3; I.2.6; I.2.8; F.1.3; F.2
110: %I.2. Artificial Intelligence,
111: %I.2.3. Deduction and Theorem Proving
112: %I.2.6. Learning
113: %I.2.8. Problem Solving, Control Methods and Search
114: %F.1.3. Complexity Classes
115: %F.2. Analysis of Algorithms and Problem Complexity
116:
117: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
118: \section{Introduction}\label{int}
119: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
120:
121: The most general framework for Artificial Intelligence is the
122: picture of an {\em agent} interacting with an environment
123: \cite{Russell:95}. If the goal is not pre-specified, the agent has
124: to learn by occasional reinforcement feedback \cite{Sutton:98}. If
125: the agent shall be universal, no assumption about the environment
126: may be made, besides that there {\it exists} some exploitable
127: structure at all. We may ask for the most intelligent way an agent
128: could behave, or, about the optimal way of learning in terms of
129: real world interaction cycles. {\em Decision theory}
130: formally\footnote{With a formal solution we mean a rigorous
131: mathematically definition, uniquely specifying the solution. For
132: problems considered here this always implies the existence of an
133: algorithm which asymptotically converges to the correct solution.}
134: solves this problem only if the true environmental probability
135: distribution is known (e.g. Backgammon)
136: \cite{Bellman:57,Bertsekas:96}. \cite{Solomonoff:64,Solomonoff:78}
137: formally solves the problem of {\em induction} if the true
138: distribution is unknown but only if the agent cannot influence the
139: environment (e.g.\ weather forecasts) \cite{Li:97}. We combine
140: both ideas and get {\em a parameterless model AI$\xi$ of an acting
141: agent which we claim to behave optimally in any computable
142: environment} (e.g.\ prisoner or auction problems, poker, car
143: driving). To get an effective solution, a modification
144: AI$\xi^{tl}$, superior to any other time $t$ and space $l$ bounded
145: agent, is constructed. The computation time of AI$\xi^{tl}$ is of
146: the order $t\!\cdot\!2^l$. The main goal of this work is to derive
147: and discuss the AI$\xi$ and the AI$\xi^{tl}$ model, and to clarify
148: the meaning of {\it universal}, {\it optimal}, {\it superior},
149: {\it etc}. Details can be found in \cite{Hutter:00f}.
150:
151: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
152: \section{Rational Agents \& Sequential Decisions}\label{secAImurec}
153: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
154:
155: %------------------------------%
156: \paragraph{Agents in probabilistic environments:}
157: %------------------------------%
158: A very general framework for intelligent systems is that of
159: rational agents \cite{Russell:95}. In cycle $k$, an agent performs
160: {\em action} $y_k\!\in\!Y$ (output word) which results in a {\em
161: perception} $x_k\!\in\!X$ (input word), followed by cycle
162: $k\!+\!1$ and so on. If agent and environment are deterministic
163: and computable, the entanglement of both can be modeled by two
164: Turing machines with two common tapes (and some private tapes)
165: containing the action stream $y_1y_2y_3...$ and the perception
166: stream $x_1x_2x_3...$ (The meaning of $x_k\!\equiv\!x'_kr_k$ is
167: explained in the next paragraph):
168:
169: \begin{center}\label{cyberpic}
170: \small\unitlength=0.8mm
171: \special{em:linewidth 0.4pt}
172: \linethickness{0.4pt}
173: \begin{picture}(106,47)
174: \thinlines
175: \put(1,41){\framebox(10,6)[cc]{$x'_1$}}
176: \put(11,41){\framebox(6,6)[cc]{$r_1$}}
177: \put(17,41){\framebox(10,6)[cc]{$x'_2$}}
178: \put(27,41){\framebox(6,6)[cc]{$r_2$}}
179: \put(33,41){\framebox(10,6)[cc]{$x'_3$}}
180: \put(43,41){\framebox(6,6)[cc]{$r_3$}}
181: \put(49,41){\framebox(10,6)[cc]{$x'_4$}}
182: \put(59,41){\framebox(6,6)[cc]{$r_4$}}
183: \put(65,41){\framebox(10,6)[cc]{$x'_5$}}
184: \put(75,41){\framebox(6,6)[cc]{$r_5$}}
185: \put(81,41){\framebox(10,6)[cc]{$x'_6$}}
186: \put(91,41){\framebox(6,6)[cc]{$r_6$}}
187: \put(102,44){\makebox(0,0)[cc]{...}}
188: \put(1,1){\framebox(16,6)[cc]{$y_1$}}
189: \put(17,1){\framebox(16,6)[cc]{$y_2$}}
190: \put(33,1){\framebox(16,6)[cc]{$y_3$}}
191: \put(49,1){\framebox(16,6)[cc]{$y_4$}}
192: \put(65,1){\framebox(16,6)[cc]{$y_5$}}
193: \put(81,1){\framebox(16,6)[cc]{$y_6$}}
194: \put(102,4){\makebox(0,0)[cc]{...}}
195: \put(97,47){\line(1,0){9}}
196: \put(97,41){\line(1,0){9}}
197: \put(97,7){\line(1,0){9}}
198: \put(97,1){\line(0,0){0}}
199: \put(97,1){\line(1,0){9}}
200: \put(1,21){\framebox(16,6)[cc]{working}}
201: \thicklines
202: \put(17,17){\framebox(20,14)[cc]{$\displaystyle{Agent\atop\bf p}$}}
203: \thinlines
204: \put(37,27){\line(1,0){14}}
205: \put(37,21){\line(1,0){14}}
206: \put(39,24){\makebox(0,0)[lc]{tape ...}}
207: \put(56,21){\framebox(16,6)[cc]{working}}
208: \thicklines
209: \put(72,17){\framebox(20,14)[cc]{$\displaystyle{Environ-\atop ment\quad\bf q}$}}
210: \thinlines
211: \put(92,27){\line(1,0){14}}
212: \put(92,21){\line(1,0){14}}
213: \put(94,24){\makebox(0,0)[lc]{tape ...}}
214: \thicklines
215: \put(54,41){\vector(-3,-1){29}}
216: \put(84,31){\vector(-3,1){30}}
217: \put(54,7){\vector(3,1){30}}
218: \put(25,17){\vector(3,-1){29}}
219: \end{picture}
220: \end{center}
221:
222: $p$ is the {\em policy} of the agent interacting with environment
223: $q$. We write $p(x_{<k})\!=\!y_{1:k}$ to denote the output
224: $y_{1:k}\!\equiv\!y_1...y_k$ of the agent $p$ on input
225: $x_{<k}\!\equiv\!x_1...x_{k-1}$ and similarly $q(y_{1:k})\!=\!x_{1:k}$
226: for the environment $q$. We call Turing machines
227: $p$ and $q$ behaving in this way {\it chronological}. In the more
228: general case of a {\em probabilistic environment}, given the
229: history $y\!x_{<k}y_k\!\equiv\!y_1x_1...y_{k-1}x_{k-1}y_k$, the
230: probability that the environment leads to perception $x_k$ in
231: cycle $k$ is (by definition) $\mu(y\!x_{<k}y\!\pb x_k)$. The
232: underlined argument $\pb x_k$ in $\mu$ is a probability variable
233: and the other non-underlined arguments $y\!x_{<k}y_k$ represent
234: conditions. We call probability distributions like $\mu$ {\it
235: chronological}.
236:
237: %------------------------------%
238: \paragraph{The AI$\mu$ Model:}
239: %------------------------------%
240: The goal of the agent is to maximize future {\em rewards}, which are
241: provided by the environment through the inputs $x_k$. The inputs
242: $x_k\!\equiv\!x'_kr_k$ are divided into a regular part $x'_k$ and
243: some (possibly empty or delayed) reward $r_k$. The $\mu$-expected
244: reward sum of future cycles $k$ to $m$ with outputs
245: $y_{k:m}\!=\!y_{k:m}^p$ generated by the agent's policy $p$
246: can be written compactly as
247: \begin{equation}\label{vpdef}
248: V_\mu^p(\hh y\!\hh x_{<k}) \!:=\!\!
249: \!\!\sum_{x_k...x_m}\!\!
250: (r_k\!+...+\!r_m)
251: \mu(\hh y\!\hh x_{<k}y\!\pb x_{k:m}),
252: \end{equation}
253: where $m$ is the {\em lifespan} of the agent,
254: and the dots above $\hh y\!\hh x_{<k}$
255: indicate the actual action and perception history.
256: The $\mu$-expected reward sum of future cycles $k$ to $m$
257: with outputs $y_i$ generated by the {\em ideal agent}, which
258: maximizes the expected future rewards is
259: \begin{equation}\label{voptdef}
260: V_\mu^\best(\hh y\!\hh x_{<k}) :=
261: \max_{y_k}\!\sum_{x_k}...
262: \max_{y_{m}}\!\sum_{x_{m}}
263: (r_k\!+...+\!r_m)
264: \mu(\hh y\!\hh x_{<k}y\!\pb x_{k:m}),
265: \end{equation}
266: i.e.\ the best expected credit
267: is obtained by averaging over the $x_i$ and
268: maximizing over the $y_i$. This has to be done in chronological
269: order to correctly incorporate the dependency of $x_i$ and $y_i$
270: on the history. The output $\hh y_k$, which achieves the maximal value
271: defines {\em the AI$\mu$ model}:
272: \begin{equation}\label{ydotrec}
273: \hh y_k :=
274: \maxarg_{y_k}\!\sum_{x_k}...
275: \max_{y_{m}}\!\sum_{x_{m}}
276: (r_k\!+...+\!r_m)
277: \mu(\hh y\!\hh x_{<k}y\!\pb x_{k:m}).
278: \end{equation}
279: The AI$\mu$ model is optimal in the sense that no other policy
280: leads to higher $\mu$-expected reward. A detailed derivation and
281: other recursive and functional versions can be found in
282: \cite{Hutter:00f}.
283:
284: %------------------------------%
285: \paragraph{Sequential decision theory:}
286: %------------------------------%
287: Eq.\ (\ref{ydotrec}) is essentially an Expectimax algorithm/sequence.
288: One can relate (\ref{ydotrec}) to the Bellman equations
289: \cite{Bellman:57} of sequential decision theory by identifying
290: complete histories $y\!x_{<k}$ with states, $\mu(y\!x_{<k}y\!\pb
291: x_k)$ with the state transition matrix, $V_\mu^\best(y\!x_{<k})$
292: with the value of history/state $y\!x_{<k}$, and $y_k$ with the
293: action in cycle $k$ \cite{Russell:95,Hutter:00f}.
294: Due to the use of complete histories as state space, the AI$\mu$
295: model neither assumes stationarity, nor the Markov property, nor
296: complete accessibility of the environment. Every state occurs at
297: most once in the lifetime of the system.
298: As we have in mind a universal system with complex interactions,
299: the action and perception spaces $Y$ and $X$ are huge (e.g.\ video
300: images), and every action or perception itself occurs usually only
301: once in the lifespan $m$ of the agent. As there is no (obvious)
302: universal similarity relation on the state space, an effective
303: reduction of its size is impossible, but there is no principle
304: problem in determining $\hh y_k$ as long as $\mu$ is known and
305: computable and $X$, $Y$ and $m$ are finite.
306:
307: %------------------------------%
308: \paragraph{Reinforcement learning:}
309: %------------------------------%
310: Things dramatically change if $\mu$ is unknown. Reinforcement
311: learning algorithms \cite{Kaelbling:96,Sutton:98,Bertsekas:96} are
312: commonly used in this case to learn the unknown $\mu$. They
313: succeed if the state space is either small or has effectively been
314: made small by generalization or function approximation techniques.
315: In any case, the solutions are either ad hoc, work in restricted
316: domains only, have serious problems with state space exploration
317: versus exploitation, or have non-optimal learning rate. There is
318: no universal and optimal solution to this problem so far. In the
319: Section \ref{secAIxi} we present a new model and argue that it
320: formally solves all these problems in an optimal way. The true
321: probability distribution $\mu$ will not be learned directly, but
322: will be replaced by a universal prior $\xi$, which is shown to
323: converge to $\mu$ in a sense.
324:
325:
326: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
327: \section{Algorithmic Complexity and Universal Induction}\label{secAIsp}
328: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
329:
330: %------------------------------%
331: \paragraph{The problem of the unknown environment:}
332: %------------------------------%
333: We have argued that currently there is no universal and optimal solution to
334: solving reinforcement learning problems. On the other hand,
335: \cite{Solomonoff:64} defined a universal scheme of inductive
336: inference, based on Epicurus' principle of multiple explanations,
337: Ockham's razor, and Bayes' rule for
338: conditional probabilities. For an excellent introduction
339: one should consult the book of \cite{Li:97}. In
340: the following we outline the theory and the basic results.
341:
342: %------------------------------%
343: \paragraph{Kolmogorov complexity and universal probability:}
344: %------------------------------%
345: Let us choose some universal prefix Turing machine $U$ with
346: unidirectional binary input and output tapes and a bidirectional
347: working tape. We can then define the (conditional) prefix
348: Kolmogorov complexity
349: \cite{Chaitin:75,Gacs:74,Kolmogorov:65,Levin:74} as the
350: length $l$ of the shortest
351: program $p$, for which $U$ outputs the binary string
352: $x\!=\!x_{1:n}$ with $x_i\in\!\{0,1\}$:
353: %
354: $$
355: K(x) \;:=\; \min_p\{l(p): U(p)=x\},
356: $$
357: and given $y$
358: $$
359: K(x|y) \;:=\; \min_p\{l(p): U(p,y)=x\}.
360: $$
361: %
362: The {\em universal semimeasure} $\xi(\pb x)$ is defined as the
363: probability that the output of $U$
364: starts with $x$ when provided with fair coin flips on the input
365: tape \cite{Solomonoff:64,Solomonoff:78}. It is easy to see that
366: this is equivalent to the formal definition
367: %
368: \beq\label{xidef}
369: \xi(\pb x)\;:=\;\sum_{p\;:\;\exists\omega:U(p)=x\omega}\nq 2^{-l(p)}
370: \eeq
371: where the sum is over minimal programs $p$ for which $U$
372: outputs a string starting with $x$. $U$ might be non-terminating.
373: As the short programs dominate the sum, $\xi$ is closely related
374: to $K(x)$ as $\xi(\pb x)=2^{-K(x)+O(K(l(x))}$. $\xi$ has the
375: important universality property \cite{Solomonoff:64} that it
376: dominates every computable probability
377: distribution $\rho$ up to a multiplicative factor depending only
378: on $\rho$ but not on $x$:
379: \beq\label{uni}
380: \xi(\pb x) \;\geq\; 2^{-K(\rho)-O(1)}\!\cdot\!\rho(\pb x).
381: \eeq
382: %
383: The Kolmogorov complexity of a function like $\rho$ is defined as
384: the length of the shortest self-delimiting coding of a Turing
385: machine computing this function.
386: $\xi$ itself is {\it not} a probability
387: distribution\footnote{It is possible to normalize $\xi$ to a
388: probability distribution as has been done in
389: \cite{Solomonoff:78,Hutter:99} by giving up the enumerability of $\xi$.
390: Bounds (\ref{eukdist}) and (\ref{spebound}) hold for both
391: definitions.}.
392: We have $\xi(\pb{x0})\!+\!\xi(\pb{x1})\!<\!\xi(\pb
393: x)$ because there are programs $p$, which output just $x$, neither
394: followed by $0$ nor $1$. They just stop after printing $x$ or
395: continue forever without any further output. We will call a
396: function $\rho\!\geq 0$ with the properties
397: $\rho(\epsilon)\!\leq\!1$ and $\sum_{x_n}\rho(\pb
398: x_{1:n})\!\leq\!\rho(\pb x_{<n})$ a {\it semimeasure}. $\xi$ is a
399: semimeasure and (\ref{uni}) actually holds for all enumerable
400: semimeasures $\rho$.
401:
402: %------------------------------%
403: \paragraph{Universal sequence prediction:}
404: %------------------------------%
405: (Binary) sequence prediction algorithms try to predict the
406: continuation $x_n$ of a given sequence $x_1...x_{n-1}$. In the
407: following we will assume that the sequences are drawn from
408: a probability distribution and that the true probability of a
409: string starting with $x_1...x_n$ is $\mu(\pb x_{1:n})$. The
410: probability of $x_n$ given $x_{<n}$ hence is $\mu(x_{<n}\pb x_n)$.
411: If we measure prediction quality as the number of correct
412: predictions, the best possible system predicts the $x_n$ with the
413: highest probability. Usually $\mu$ is unknown and the system can
414: only have some belief $\rho$ about the true distribution $\mu$.
415: Now the universal probability $\xi$
416: comes into play: \cite{Solomonoff:78} has proved
417: that the mean squared difference
418: between $\xi$ and $\mu$ is finite for computable $\mu$:
419: \beq\label{eukdist}
420: \sum_{k=1}^\infty\sum_{x_{1:k}}\mu(\pb x_{<k})
421: (\xi(x_{<k}\pb x_k)-\mu(x_{<k}\pb x_k))^2
422: \eeq
423: $$
424: <\; \ln 2\!\cdot\!K(\mu)+O(1).
425: $$
426: A simplified proof can be found in \cite{Hutter:99}. So the
427: difference between $\xi(x_{<n}\pb x_n)$ and $\mu(x_{<n}\pb x_n)$ tends
428: to zero with $\mu$ probability $1$ for {\it any} computable
429: probability distribution $\mu$. The reason for the astonishing
430: property of a single (universal) function to converge to {\it any}
431: computable probability distribution lies in the fact that the set
432: of $\mu$-random sequences differ for different $\mu$. The
433: universality property (\ref{uni}) is the central ingredient for
434: proving (\ref{eukdist}).
435:
436: %------------------------------%
437: \paragraph{Error bounds:}
438: %------------------------------%
439: Let SP$\rho$ be a probabilistic
440: sequence predictor, predicting $x_n$ with probability
441: $\rho(x_{<n}\pb x_n)$. If $\rho$ is only a semimeasure the
442: SP$\rho$ system might refuse any output in some cycles $n$.
443: Further, we define a deterministic sequence predictor
444: SP$\Theta_\rho$ predicting the $x_n$ with highest $\rho$
445: probability. $\Theta_\rho(x_{<n}\pb x_n)\!:=\!1$ for one $x_n$
446: with $\rho(x_{<n}\pb x_n)\!\geq\!\rho(x_{<n}\pb x'_n)\,\forall
447: x'_n$ and $\Theta_\rho(x_{<n}\pb x_n)\!:=\!0$ otherwise.
448: SP$\Theta_\mu$ is the best prediction scheme when $\mu$ is known.
449: If $\rho(x_{<n}\pb x_n)$ converges quickly to $\mu(x_{<n}\pb x_n)$ the
450: number of additional prediction errors introduced by using
451: $\Theta_\rho$ instead of $\Theta_\mu$ for prediction should be
452: small in some sense.
453: Let us define the total number of expected erroneous predictions
454: the SP$\rho$ system makes for the first $n$ bits:
455: \beq\label{esp}
456: E_{n\rho} \;:=\; \sum_{k=1}^n\sum_{x_{1:k}}\mu(\pb x_{1:k})
457: (1\!-\!\rho(x_{<k}\pb x_k)).
458: \eeq
459: The SP$\Theta_\mu$ system is best in the sense that
460: $E_{n\Theta_\mu}\!\leq\!E_{n\rho}$
461: for any $\rho$. In \cite{Hutter:99} it has been shown that
462: SP$\Theta_\xi$ is not much worse
463: \beq\label{spebound}
464: E_{n\Theta_\xi}\!-\!E_{n\rho} \;\leq\;
465: H+\sqrt{4E_{n\rho}H+H^2} \;=\;
466: O(\sqrt{E_{n\rho}})
467: \eeq
468: $$
469: \mbox{with}\quad H\;<\;\ln 2\!\cdot\!K(\mu)+O(1)
470: $$
471: and the tightest bound for $\rho\!=\!\Theta_\mu$. For finite
472: $E_{\infty\Theta_\mu}$, $E_{\infty\Theta_\xi}$ is finite too. For
473: infinite $E_{\infty\Theta_\mu}$,
474: $E_{n\Theta_\xi}/E_{n\Theta_\mu}\toinfty{n}1$ with rapid
475: convergence. One can hardly imagine any better prediction
476: algorithm as SP$\Theta_\xi$ without extra knowledge about the
477: environment. In \cite{Hutter:00e}, (\ref{eukdist}) and
478: (\ref{spebound}) have been generalized from binary to arbitrary
479: alphabet and to general loss functions. Apart from computational
480: aspects, which are of course very important, the problem of
481: sequence prediction could be viewed as essentially solved.
482:
483: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
484: \section{The Universal AI$\xi$ Model}\label{secAIxi}
485: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
486:
487: %------------------------------%
488: \paragraph{Definition of the AI$\xi$ Model:}
489: %------------------------------%
490: We have developed enough formalism to suggest our universal
491: AI$\xi$ model. All we have to do is to suitably generalize the
492: universal semimeasure $\xi$ from the last section and to replace
493: the true but unknown probability $\mu$ in the AI$\mu$ model by
494: this generalized $\xi$. In what sense this AI$\xi$ model is
495: universal and optimal will be discussed thereafter.
496:
497: We define the generalized universal probability $\xi^{AI}$ as the
498: $2^{-l(q)}$ weighted sum over all chronological programs
499: (environments) $q$ which output $x_{1:k}$, similar to
500: (\ref{xidef}) but with $y_{1:k}$ provided on the ''input''
501: tape:
502: \beq\label{uniMAI}
503: \xi(y\!\pb x_{1:k}) \;:=\;
504: \nq\sum_{q:q(y_{1:k})=x_{1:k}}\nq 2^{-l(q)}.
505: \eeq
506: %
507: Replacing $\mu$ by $\xi$ in (\ref{ydotrec}) the
508: iterative AI$\xi$ system outputs
509: \beq\label{ydotxi}
510: \hh y_k :=
511: \maxarg_{y_k}\!\sum_{x_k}...
512: \max_{y_m}\!\sum_{x_m}
513: (c_k\!+...+\!c_m)
514: \xi(\hh y\!\hh x_{<k}y\!\pb x_{k:m}).
515: \eeq
516: in cycle $k$ given the history $\hh y\!\hh x_{<k}$.
517:
518: %------------------------------%
519: \paragraph{(Non)parameters of AI$\xi$:}
520: %------------------------------%
521: The AI$\xi$ model and its behaviour is completely defined by
522: (\ref{uniMAI}) and (\ref{ydotxi}). It (slightly) depends on the
523: choice of the universal Turing machine. The AI$\xi$ model also
524: depends on the choice of $X$ and $Y$, but we do
525: not expect any bias when the spaces are chosen sufficiently large
526: and simple, e.g. all strings of length $2^{16}$. Choosing $I\!\!N$
527: as word space would be ideal, but whether the maxima (or suprema)
528: exist in this case, has to be shown beforehand. The only
529: non-trivial dependence is on the horizon $m$. Ideally we would
530: like to chose $m\!=\!\infty$, but there are several subtleties
531: \ifijcai{discussed in \cite{Hutter:00f},}
532: \else{to be discussed later,}
533: \fi
534: which prevent at least a naive limit
535: $m\!\to\!\infty$. So apart from $m$ and unimportant details, the
536: AI$\xi$ system is uniquely defined by (\ref{ydotxi}) and
537: (\ref{uniMAI}) without adjustable parameters. It does not depend on
538: any assumption about the environment apart from being generated by
539: some computable (but unknown!) probability distribution as we will see.
540:
541: \ifijcai\else
542: %------------------------------%
543: \paragraph{$\xi$ is only a semimeasure:}
544: %------------------------------%
545: One subtlety should be mentioned.
546: Like in the SP case, $\xi$ is
547: not a probability distribution but still satisfies the weaker
548: inequalities
549: \beq\label{chrf}
550: \sum_{x_n}\xi(y\!\pb x_{1:n}) \;\leq\; \xi(y\!\pb x_{<n})
551: \quad,\quad
552: \xi(\epsilon) \;\leq\; 1
553: \eeq
554: Note, that the sum on the l.h.s.\ is {\it not} independent of
555: $y_n$ unlike for the chronological probability distribution $\mu$.
556: Nevertheless, it is bounded by something (the r.h.s) which is
557: independent of $y_n$. The reason is that the sum in (\ref{uniMAI})
558: runs over (partial recursive) chronological functions only and the
559: functions $q$ which satisfy $q(y_{1:n})=x_{<n}x'_n$ for some
560: $x'_n\!\in\!X$ are a subset of the functions satisfying
561: $q(y_{<n})=x_{<n}$. We will in general call functions satisfying
562: (\ref{chrf}) {\it chronological semimeasures}. The important point
563: is that the conditional probabilities (\ref{uniMAI}) are $\leq\!1$
564: like for true probability distributions.
565: \fi %extended
566:
567: %------------------------------%
568: \paragraph{Universality of $\xi^{AI}$:}
569: %------------------------------%
570: It can be shown that $\xi^{AI}$ defined in
571: (\ref{uniMAI}) is universal and converges to $\mu^{AI}$
572: analogously to the SP case (\ref{uni}) and (\ref{eukdist}). The
573: proofs are generalizations from the SP case. The actions $y$ are pure
574: spectators and cause no difficulties in the generalization. This
575: will change when we analyze error/value bounds analogously to
576: (\ref{spebound}). The major difference when incorporating $y$ is
577: that in (\ref{uni}), $U(p)=x\omega$ produces strings starting with $x$,
578: whereas in (\ref{uniMAI}) we can demand $q$ to output exactly $n$
579: words $x_{1:n}$ as $q$ knows $n$ from the number of input words
580: $y_1...y_n$.
581: $\xi^{AI}$ dominates all {\em chronological enumerable
582: semimeasures}
583: \beq\label{uniaixi}
584: \xi(y\!\pb x_{1:n}) \;\geq\;
585: 2^{-K(\rho)-O(1)}\rho(y\!\pb x_{1:n}).
586: \eeq
587: $\xi$ is a universal element in the sense of (\ref{uniaixi})
588: in the set of all enumerable chronological semimeasures. This can
589: be proved even for infinite (countable) alphabet
590: \cite{Hutter:00f}.
591:
592: %------------------------------%
593: \paragraph{Convergence of $\xi^{AI}$ to $\mu^{AI}$:}
594: %------------------------------%
595: From (\ref{uniaixi}) one can show
596: $$
597: \sum_{k=1}^n\sum_{x_{1:k}}\mu(y\!\pb x_{<k})
598: \Big(\mu(y\!x_{<k}y\!\pb x_k)-\xi(y\!x_{<k}y\!\pb x_k)\Big)^2
599: $$
600: \beq\label{eukdistxi}
601: \;<\; \ln 2\!\cdot\!K(\mu)+O(1)
602: \eeq
603: for computable chronological measures $\mu$. The main
604: complication in generalizing (\ref{eukdist}) to (\ref{eukdistxi})
605: is the generalization to non-binary alphabet \cite{Hutter:00e}.
606: The $y$ are, again, pure spectators.
607: (\ref{eukdistxi}) shows that the $\mu$-expected
608: squared difference of $\mu$ and $\xi$ is finite for computable
609: $\mu$. This, in turn, shows that $\xi(y\!x_{<k}y\!\pb x_k)$
610: converges to $\mu(y\!x_{<k}y\!\pb x_k)$ for $k\!\to\!\infty$ with $\mu$
611: probability 1. If we take a finite product of $\xi'$s and use
612: Bayes' rule, we see that also $\xi(y\!x_{<k}y\!\pb x_{k:k+r})$
613: converges to $\mu(y\!x_{<k}y\!\pb x_{k:k+r})$. More generally, in case of
614: a bounded horizon $h_k\equiv m_k\!-\!k\!+\!1 \leq h_{max}\!<\!\infty$, it follows that
615: \beq\label{aixitomu}
616: \xi(y\!x_{<k}y\!\pb x_{k:m_k}) \toinfty{k} \mu(y\!x_{<k}y\!\pb x_{k:m_k})
617: \eeq
618: Convergence is only guaranteed for one (e.g.\ the true) i/o
619: sequence $\hh y\!\hh x_{<k}\hh y\!\hh x_{k:m_k}$ but not for
620: alternate sequences $\hh y\!\hh x_{<k}y\!x_{k:m_k}$. Since
621: (\ref{ydotxi}) takes an average over all possible future actions
622: and perceptions $y\!x_{k:m_k}$; not only the one which will
623: finally occur, (\ref{aixitomu}) does not guarantee $\hh y_k^\xi\!\to\!\hh
624: y_k^\mu$. This
625: gap is already present in the SP$\Theta_\rho$ models, but
626: nevertheless good error bounds could be proved. This gives
627: confidence that the outputs $\hh y_k$ of the AI$\xi$ model
628: (\ref{ydotxi}) could converge to the outputs $\hh y_k$ of the
629: AI$\mu$ model (\ref{ydotrec}), at least for a bounded horizon
630: $h_k$. The problems with a fixed horizon $m_k\!=\!m$ and especially
631: $m\!\to\!\infty$
632: \ifijcai{are discussed in \cite{Hutter:00f}.}
633: \else{will be discussed later.}
634: \fi
635:
636: %------------------------------%
637: \paragraph{Universally optimal AI systems:}
638: %------------------------------%
639: We want to call an AI model {\it universal}, if it is
640: $\mu$-independent (unbiased, model-free) and is able to solve any
641: solvable problem and learn any learnable task. Further, we call a
642: universal model, {\it universally optimal}, if there is no
643: program, which can solve or learn significantly faster (in terms
644: of interaction cycles). As the AI$\xi$ model is parameterless,
645: $\xi$ converges to $\mu$ in the sense of
646: (\ref{eukdistxi},\ref{aixitomu}), the AI$\mu$ model is itself
647: optimal, and we expect no other model to converge faster to
648: AI$\mu$ by analogy to SP (\ref{spebound}),
649: %we risk the following conjecture:
650: \beqn
651: \mbox{\it we expect AI$\xi$ to be universally optimal.}
652: \eeqn
653: This is our main claim. Further support is given in
654: \cite{Hutter:00f} by a detailed analysis of the behaviour of
655: AI$\xi$ for various problem classes, including prediction,
656: optimization, games, and supervised learning.
657:
658: \ifijcai\else
659: %------------------------------%
660: \paragraph{The choice of the horizon:}
661: %------------------------------%
662: The only significant arbitrariness in the AI$\xi$ model lies in
663: the choice of the lifespan $m$ or the
664: $h_k\!\equiv\!m_k\!-\!k\!+\!1$ if we allow a cycle dependent $m$.
665: We will not discuss ad hoc choices of $h_k$ for specific problems.
666: We are interested in universal choices. The book of
667: \cite{Bertsekas:95b} thoroughly discusses the mathematical
668: problems regarding infinite horizon systems.
669:
670: In many cases the time we are willing to run a system depends on
671: the quality of its actions. Hence, the lifetime, if finite at all,
672: is not known in advance. Exponential discounting
673: $r_k\!\to\!r_k\!\cdot\!\gamma^k$ solves the mathematical problem
674: of $m\!\to\!\infty$ but is no real solution, since an effective
675: horizon $h\sim\ln{1\over\gamma}$ has been introduced. The scale
676: invariant discounting $r_k\!\to\!r_k\!\cdot\!k^{-\alpha}$ has a
677: dynamic horizon $h\sim\!k$. This choice has some appeal, as it
678: seems that humans of age $k$ years usually do not plan their lives
679: for more than the next $\sim k$ years. From a practical point of
680: view this model might serve all needs, but from a theoretical
681: point we feel uncomfortable with such a limitation in the horizon
682: from the very beginning. A possible way of taking the limit
683: $m\!\to\!\infty$ without discounting and its problems can be found
684: in \cite{Hutter:00f}.
685:
686: Another objection against too large choices of $m_k$
687: is that $\xi(y\!x_{<k}y\!\pb x_{k:m_k})$ has been proved to be a
688: good approximation of $\mu(y\!x_{<k}y\!\pb x_{k:m_k})$ only for
689: $k\!\gg\!h_k$, which is never satisfied for
690: $m_k\!=\!m\!\to\!\infty$.
691: On the other hand it may turn out that the rewards
692: $r_{k'}$ for $k'\!\gg\!k$, where $\xi$ may no longer be trusted as
693: a good approximation of $\mu$, are in a sense randomly
694: disturbed with decreasing influence on the choice of $\hh y_k$.
695: This claim is supported by the forgetfulness property of $\xi$
696: \ifijcai\else{(see next section)}\fi
697: and can be proved when restricting to
698: factorizable environments \cite{Hutter:00f}.
699:
700: We are not sure whether the choice of $m_k$ is of marginal
701: importance, as long as $m_k$ is chosen sufficiently large and of
702: low complexity, $m_k=2^{2^{16}}$ for instance, or whether the
703: choice of $m_k$ will turn out to be a central topic for the
704: AI$\xi$ model or for the planning aspect of any universal AI
705: system in general. Most if not all problems in agent design of
706: balancing exploration and exploitation vanish by a sufficiently
707: large choice of the (effective) horizon and/or a sufficiently
708: general prior. We suppose that the limit $m_k\!\to\!\infty$ for
709: the AI$\xi$ model results in correct behaviour for weakly
710: separable (defined in the next section) $\mu$, and that even the
711: naive limit $m\!\to\!\infty$ may exist.
712: \fi
713:
714: %------------------------------%
715: \paragraph{Value bounds and separability concepts:}
716: %------------------------------%
717: The values $V_\rho^\best$ associated with the AI$\rho$ systems
718: correspond roughly to the negative error measure $-E_{n\rho}$ of
719: the SP$\rho$ systems. In the SP case we were interested in small
720: bounds for the error excess $E_{n\Theta_\xi}\!-\!E_{n\rho}$.
721: Unfortunately, simple value bounds for AI$\xi$ or any other AI system in terms of
722: $V^\best$ analogously to the error bound (\ref{spebound}) can not
723: hold \cite{Hutter:00f}. We even have difficulties in specifying
724: what we can expect to hold for AI$\xi$ or any AI system which
725: claims to be universally optimal. In SP, the only important
726: property of $\mu$ for proving error bounds was its complexity
727: $K(\mu)$. In the AI case, there are no useful bounds in terms of
728: $K(\mu)$ only. We either have to study restricted problem classes
729: or consider bounds depending on other properties of $\mu$, rather
730: than on its complexity only. In \cite{Hutter:00f} the difficulties
731: are exhibited by two examples. Several concepts, which might be
732: useful for proving value bounds are introduced and discussed. They
733: include forgetful, relevant, asymptotically learnable, farsighted,
734: uniform, (generalized) Markovian, factorizable and (pseudo)
735: passive $\mu$. They are approximately sorted in the order of
736: decreasing generality and are called {\it separability concepts}.
737: A first weak bound for passive $\mu$ is proved.
738:
739: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
740: \section{Time Bounds and Effectiveness}\label{secTime}
741: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
742:
743: %------------------------------%
744: \paragraph{Non-effectiveness of AI$\xi$:}
745: %------------------------------%
746: $\xi$ is not a computable but
747: only an enumerable semimeasure. Hence, the output $\hh y_k$ of the
748: AI$\xi$ model is only asymptotically computable. AI$\xi$ yields an
749: algorithm that produces a sequence of trial outputs eventually
750: converging to the correct output $\hh y_k$, but one can never be sure
751: whether one has already reached it. Besides this, convergence
752: is extremely slow, so this type of asymptotic computability is of
753: no direct (practical) use. Furthermore, the replacement
754: of $\xi$ by time-limited versions \cite{Li:91,Li:97}, which is
755: suitable for sequence prediction, has been shown to fail for the
756: AI$\xi$ model \cite{Hutter:00f}.
757: This leads to the issues addressed next.
758:
759: %------------------------------%
760: \paragraph{Time bounds and effectiveness:}
761: %------------------------------%
762: Let $\tilde p$ be a policy which calculates an acceptable output
763: within a reasonable time $\tilde t$ per cycle. This sort of
764: computability assumption, namely, that a general purpose computer
765: of sufficient power and appropriate program is able to behave in
766: an intelligent way, is the very basis of AI research. Here it is
767: not necessary to discuss what exactly is meant by
768: 'reasonable time/intelligence' and 'sufficient power'. What we are
769: interested in is whether there is a computable version
770: AI$\xi^{\tilde t}$ of the AI$\xi$ system which is superior or
771: equal to any program $p$ with computation time per cycle of at
772: most $\tilde t$.
773:
774: What one can realistically hope to construct is an AI$\xi^{\tilde
775: t\tilde l}$ system of computation time $c\!\cdot\!\tilde t$ per
776: cycle for some constant $c$. The idea is to run all programs $p$
777: of length $\leq\!\tilde l\!:=\!l(\tilde p)$ and time $\leq\!\tilde
778: t$ per cycle and pick the best output in the sense of maximizing
779: the {\em universal value} $V_\xi^\best$. The total computation time is
780: $c\!\cdot\!\tilde t$ with $c\!\approx\!2^{\tilde l}$. Unfortunately
781: $V_\xi^\best$ can not be used directly since this measure is also
782: only semi-computable and the approximation quality by using
783: computable versions of $\xi$ given a time of order
784: $c\!\cdot\!\tilde t$ is crude \cite{Li:97,Hutter:00f}. On the
785: other hand, we {\it have} to use a measure which converges
786: $V_\xi^\best$ for $\tilde t,\tilde l\!\to\!\infty$, since the
787: AI$\xi^{\tilde t\tilde l}$ model should converge to the AI$\xi$ model
788: in that case.
789:
790: %------------------------------%
791: \paragraph{Valid approximations:}
792: %------------------------------%
793: A solution satisfying the above conditions is suggested in
794: \cite{Hutter:00f}. The main idea is to consider {\em extended
795: chronological incremental policies} $p$, which in addition to the
796: regular output $y_k^p$ {\em rate} their own output with $w_k^p$. The
797: AI$\xi^{\tilde t\tilde l}$ model selects the output $\hh y_k\!=\!y_k^p$
798: of the policy $p$ with highest rating $w_k^p$. $p$ might suggest
799: any output $y_k^p$ but it is not allowed to rate itself with an
800: arbitrarily high $w_k^p$ if one wants $w_k^p$ to be a reliable
801: criterion for selecting the best $p$. One must demand that no
802: policy $p$ is allowed to claim that it is better than it actually
803: is. In \cite{Hutter:00f} a (logical) predicate VA($p$), called
804: {\it valid approximation}, is defined, which is true if, and only
805: if, $p$ {\it always} satisfies $w_k^p\!\leq\!V_\xi^p(y\!x_{<k})$, i.e. never
806: overrates itself. $V_\xi^p(y\!x_{<k})$ is the $\xi$ expected
807: future reward under policy $p$. Valid policies $p$ can then be
808: (partially) ordered w.r.t.\ their rating $w_k^p$.
809:
810: %------------------------------%
811: \paragraph{The universal time bounded AI$\xi^{\tilde t\tilde l}$ system:}
812: %------------------------------%
813: In the following, we describe the algorithm $p^\best$ underlying
814: the universal time bounded AI$\xi^{\tilde t\tilde l}$ system. It
815: is essentially based on the selection of the best algorithms
816: $p_k^\best$ out of the time ${\tilde t}$ and length ${\tilde l}$
817: bounded policies $p$, for which there exists a proof $P$ of
818: VA($p$) with length $\leq\!l_P$.
819:
820: \begin{enumerate}\parskip=0ex\parsep=0ex\itemsep=0ex
821: \item Create all binary strings of length $l_P$ and interpret each
822: as a coding of a mathematical proof in the same formal logic system in
823: which VA($\cdot$) has been formulated. Take those strings
824: which are proofs of VA($p$) for some $p$ and keep the
825: corresponding programs $p$.
826: \item Eliminate all $p$ of length $>\!\tilde l$.
827: \item Modify all $p$ in the following way: all output $w_k^py_k^p$
828: is temporarily written on an auxiliary tape. If $p$ stops in $\tilde t$
829: steps the internal 'output' is copied to the output tape. If $p$
830: does not stop after $\tilde t$ steps a stop is forced and $w_k^p\!=\!0$
831: and some arbitrary $y_k^p$ is written on the output tape. Let ${\cal P}$ be
832: the set of all those modified programs.
833: \item Start first cycle: $k\!:=\!1$.
834: \item\label{pbestloop} Run every $p\!\in\!{\cal P}$ on extended input
835: $\hh y\!\hh x_{<k}$, where all outputs are redirected to some auxiliary
836: tape:
837: $p(\hh y\!\hh x_{<k})\!=\!w_1^py_1^p...w_k^py_k^p$. This step is
838: performed incrementally by adding $\hh y\!\hh x_{k-1}$ for $k\!>\!1$ to
839: the input tape and continuing the computation of the previous
840: cycle.
841: \item Select the program $p$ with highest rating $w_k^p$:
842: $p_k^\best\!:=\!\maxarg_pw_k^p$.
843: \item Write $\hh y_k\!:=\!y_k^{p_k^\best}$ to the output tape.
844: \item Receive input $\hh x_k$ from the environment.
845: \item Begin next cycle: $k\!:=\!k\!+\!1$, goto step
846: \ref{pbestloop}.
847: \end{enumerate}
848:
849: %------------------------------%
850: \paragraph{Properties of the $p^\best$ algorithm:}
851: %------------------------------%
852: Let $p$ be any extended chronological (incremental) policy of
853: length $l(p)\!\leq\!\tilde l$ and computation time per cycle
854: $t(p)\!\leq\!\tilde t$, for which there exists a proof of VA($p$)
855: of length $\leq\!l_P$. The algorithm $p^\best$, depending on
856: $\tilde l$, $\tilde t$ and $l_P$ but not on $p$, has always higher
857: rating than any such $p$. The setup time of $p^\best$ is
858: $t_{setup}(p^\best)\!=\!O(l_P^2\!\cdot\!2^{l_P})$ and the
859: computation time per cycle is $t_{cycle}(p^\best)\!=\!O(2^{\tilde
860: l}\!\cdot\!\tilde t)$. Furthermore, for $\tilde t,\tilde
861: l\!\to\!\infty$, $p^\best$ converges to the behavior of the AI$\xi$
862: model.
863:
864: Roughly speaking, this means that if there exists a computable
865: solution to some AI problem at all, then the explicitly
866: constructed algorithm $p^\best$ is such a solution. Although this
867: claim is quite general, there are some limitations and open
868: questions, regarding the setup time regarding the necessity that
869: the policies must rate their own output, regarding true but not
870: efficiently provable VA($p$), and regarding ``inconsistent''
871: policies \cite{Hutter:00f}.
872:
873:
874: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
875: \section{Outlook \& Discussion}\label{secOutlook}
876: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
877: This section contains some discussion and remarks on otherwise
878: unmentioned topics.
879:
880: %------------------------------%
881: \paragraph{Value bounds:}
882: %------------------------------%
883: Rigorous proofs of value bounds for the AI$\xi$ theory are the
884: major theoretical challenge -- general ones as well as tighter
885: bounds for special environments $\mu$. Of special importance are
886: suitable (and acceptable) conditions to $\mu$, under which $\hh
887: y_k$ and finite value bounds exist for infinite $Y$, $X$ and $m$.
888:
889: %------------------------------%
890: \paragraph{Scaling AI$\xi$ down:}
891: %------------------------------%
892: \cite{Hutter:00f} shows for several examples how to integrate
893: problem classes into the AI$\xi$ model. Conversely, one can
894: downscale the AI$\xi$ model by using more restricted forms of
895: $\xi$. This could be done in a similar way as the theory of
896: universal induction has been downscaled with many insights to the
897: Minimum Description Length principle \cite{Li:92b,Rissanen:89} or
898: to the domain of finite automata \cite{Feder:92}. The AI$\xi$
899: model might similarly serve as a super model or as the very
900: definition of (universal unbiased) intelligence, from which
901: specialized models could be derived.
902:
903: %------------------------------%
904: \paragraph{Applications:}
905: %------------------------------%
906: \cite{Hutter:00f} shows how a number of AI problem classes,
907: including {\em sequence prediction}, {\em strategic games}, {\em
908: function minimization} and {\em supervised learning} fit into
909: the general AI$\xi$ model. All problems are claimed to be formally
910: solved by the AI$\xi$ model. The solution is, however, only
911: formal, because the AI$\xi$ model is uncomputable or, at best,
912: approximable. First, each problem class is formulated in its
913: natural way (when $\mu^{\mbox{\tiny problem}}$ is known) and then
914: a formulation within the AI$\mu$ model is constructed and their
915: equivalence is proven. Then, the consequences of replacing $\mu$
916: by $\xi$ are considered. The main goal is to understand
917: how the problems are solved by AI$\xi$. For more details see
918: \cite{Hutter:00f}.
919:
920: %------------------------------%
921: \paragraph{Implementation and approximation:}
922: %------------------------------%
923: The AI$\xi^{\tilde t\tilde l}$ model suffers from the same large
924: factor $2^{\tilde l}$ in computation time as Levin search for
925: inversion problems
926: \ifijcai\cite{Levin:73}.
927: \else\cite{Levin:73,Levin:84}.
928: \fi
929: Nevertheless, Levin
930: search has been implemented and successfully applied to a variety
931: of problems \cite{Schmidhuber:97nn,Schmidhuber:97bias}. Hence, a direct
932: implementation of the AI$\xi^{\tilde t\tilde l}$ model may also be
933: successful, at least in toy environments, e.g.\ prisoner problems.
934: The AI$\xi^{\tilde t\tilde l}$ algorithm should be regarded only
935: as the first step toward a {\em computable universal AI model}.
936: Elimination of the factor $2^{\tilde l}$ without giving up
937: universality will probably be a very difficult task. One could try
938: to select programs $p$ and prove VA($p$) in a more clever way than
939: by mere enumeration. All kinds of ideas like, heuristic search,
940: genetic algorithms, advanced theorem provers, and many more could
941: be incorporated. But now we have a problem.
942:
943: %------------------------------%
944: \paragraph{Computability:}
945: %------------------------------%
946: We seem to have transferred the AI problem just to a different
947: level. This shift has some advantages (and also some
948: disadvantages) but presents, in no way, a solution. Nevertheless,
949: we want to stress that we have reduced the AI problem to (mere)
950: computational questions. Even the most general other systems the
951: author is aware of, depend on some (more than complexity)
952: assumptions about the environment, or it is far from clear whether
953: they are, indeed, universally optimal. Although computational
954: questions are themselves highly complicated, this reduction is a
955: non-trivial result. A formal theory of something, even if not
956: computable, is often a great step toward solving a problem and has
957: also merits of its own (see previous paragraphs).
958:
959: %------------------------------%
960: \paragraph{Elegance:}
961: %------------------------------%
962: Many researchers in AI believe that intelligence is something
963: complicated and cannot be condensed into a few formulas. They
964: believe it is more a combining of enough {\em methods} and much
965: explicit {\em knowledge} in the right way. From a theoretical
966: point of view, we disagree as the AI$\xi$ model is simple and
967: seems to serve all needs. From a practical point of view we agree
968: to the following extent. To reduce the computational burden one
969: should provide special purpose algorithms ({\em methods}) from the
970: very beginning, probably many of them related to reduce the
971: complexity of the input and output spaces $X$ and $Y$ by
972: appropriate pre/post-processing methods.
973:
974: %------------------------------%
975: \paragraph{Extra knowledge:}
976: %------------------------------%
977: There is no need to incorporate extra {\em knowledge} from the
978: very beginning. It can be presented in the first few cycles in
979: {\it any} format. As long as the algorithm that interprets the
980: data is of size $O(1)$, the AI$\xi$ system will 'understand' the
981: data after a few cycles (see \cite{Hutter:00f}). If the
982: environment $\mu$ is complicated but extra knowledge $z$ makes
983: $K(\mu|z)$ small, one can show that the bound (\ref{eukdistxi})
984: reduces to $\ln 2\!\cdot\!K(\mu|z)$ when $x_1\!\equiv\!z$, i.e.\
985: when $z$ is presented in the first cycle. Special purpose
986: algorithms could also be presented in $x_1$, but it would be
987: cheating to say that no special purpose algorithms have been
988: implemented in AI$\xi$. The boundary between implementation and
989: training is blurred in the AI$\xi$ model.
990:
991: %------------------------------%
992: \paragraph{Training:}
993: %------------------------------%
994: We have not said much about the training process itself, as it is
995: not specific to the AI$\xi$ model and has been discussed in
996: literature in various forms and disciplines. A serious discussion
997: would be out of place. To repeat a truism, it is, of course,
998: important to present enough knowledge $x'_k$ and evaluate the
999: system output $y_k$ with $r_k$ in a reasonable way. To maximize
1000: the information content in the reward, one should start with
1001: simple tasks and give positive reward to approximately
1002: the better half of the outputs $y_k$, for instance.
1003:
1004: %------------------------------%
1005: \paragraph{The big questions:}
1006: %------------------------------%
1007: \cite{Hutter:00f} contains a discussion of the ``big'' questions
1008: concerning the mere existence of any computable, fast, and elegant
1009: universal theory of intelligence, related to non-computable $\mu$
1010: \cite{Penrose:94} and the `number of wisdom' $\Omega$
1011: \cite{Chaitin:75,Chaitin:91}.
1012:
1013: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1014: % Bibliography %
1015: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1016: \begin{thebibliography}{FMG92}
1017:
1018: \bibitem[Bel57]{Bellman:57}
1019: R.~Bellman.
1020: \newblock {\em Dynamic Programming}.
1021: \newblock Princeton University Press, New Jersey, 1957.
1022:
1023: \bibitem[Ber95]{Bertsekas:95b}
1024: D.~P. Bertsekas.
1025: \newblock {\em Dynamic Programming and Optimal Control, Vol. (II)}.
1026: \newblock Athena Scientific, Belmont, Massachusetts, 1995.
1027:
1028: \bibitem[BT96]{Bertsekas:96}
1029: D.~P. Bertsekas and J.~N. Tsitsiklis.
1030: \newblock {\em Neuro-Dynamic Programming}.
1031: \newblock Athena Scientific, Belmont, MA, 1996.
1032:
1033: \bibitem[Cha75]{Chaitin:75}
1034: G.~J. Chaitin.
1035: \newblock A theory of program size formally identical to information theory.
1036: \newblock {\em Journal of the ACM}, 22(3):329--340, 1975.
1037:
1038: \bibitem[Cha91]{Chaitin:91}
1039: G.~J. Chaitin.
1040: \newblock Algorithmic information and evolution.
1041: \newblock {\em in O.T. Solbrig and G. Nicolis, Perspectives on Biological
1042: Complexity, IUBS Press}, pages 51--60, 1991.
1043:
1044: \bibitem[FMG92]{Feder:92}
1045: M.~Feder, N.~Merhav, and M.~Gutman.
1046: \newblock Universal prediction of individual sequences.
1047: \newblock {\em {IEEE} Transactions on Information Theory}, 38:1258--1270, 1992.
1048:
1049: \bibitem[G\'74]{Gacs:74}
1050: P.~G\'acs.
1051: \newblock On the symmetry of algorithmic information.
1052: \newblock {\em Russian Academy of Sciences Doklady. Mathematics (formerly
1053: Soviet Mathematics--Doklady)}, 15:1477--1480, 1974.
1054:
1055: \bibitem[Hut99]{Hutter:99}
1056: M.~Hutter.
1057: \newblock New error bounds for {Solomonoff} prediction.
1058: \newblock {\em Journal of Computer and System Science, in press},
1059: (IDSIA-11-00):1--13, 1999.
1060: \newblock ftp://ftp.idsia.ch/pub/techrep/IDSIA-11-00.ps.gz.
1061:
1062: \bibitem[Hut00a]{Hutter:00e}
1063: M.~Hutter.
1064: \newblock Optimality of universal prediction for general loss and alphabet.
1065: \newblock Technical Report IDSIA-15-00, Istituto Dalle Molle di Studi
1066: sull'Intelligenza Artificiale, Manno(Lugano), Switzerland, 2000.
1067: \newblock In progress.
1068:
1069: \bibitem[Hut00b]{Hutter:00f}
1070: M.~Hutter.
1071: \newblock A theory of universal artificial intelligence based on algorithmic
1072: complexity.
1073: \newblock Technical report, 2000.
1074: \newblock 62 pages, http://xxx.lanl.gov/abs/cs.AI/0004001.
1075:
1076: \bibitem[Kol65]{Kolmogorov:65}
1077: A.~N. Kolmogorov.
1078: \newblock Three approaches to the quantitative definition of information.
1079: \newblock {\em Problems of Information and Transmission}, 1(1):1--7, 1965.
1080:
1081: \bibitem[Lev73]{Levin:73}
1082: L.~A. Levin.
1083: \newblock Universal sequential search problems.
1084: \newblock {\em Problems of Information Transmission}, 9:265--266, 1973.
1085:
1086: \bibitem[Lev74]{Levin:74}
1087: L.~A. Levin.
1088: \newblock Laws of information conservation (non-growth) and aspects of the
1089: foundation of probability theory.
1090: \newblock {\em Problems of Information Transmission}, 10:206--210, 1974.
1091:
1092: \bibitem[Lev84]{Levin:84}
1093: L.~A. Levin.
1094: \newblock Randomness conservation inequalities: Information and independence in
1095: mathematical theories.
1096: \newblock {\em Information and Control}, 61:15--37, 1984.
1097:
1098: \bibitem[LK96]{Kaelbling:96}
1099: A.W.~Moore L.P.~Kaelbling, M.L.~Littman.
1100: \newblock Reinforcement learning: a survey.
1101: \newblock {\em Journal of AI research}, 4:237--285, 1996.
1102:
1103: \bibitem[LV91]{Li:91}
1104: M.~Li and P.~M.~B. Vit\'anyi.
1105: \newblock Learning simple concepts under simple distributions.
1106: \newblock {\em SIAM Journal on Computing}, 20(5):911--935, 1991.
1107:
1108: \bibitem[LV92]{Li:92b}
1109: M.~Li and P.~M.~B. Vit\'anyi.
1110: \newblock Inductive reasoning and {Kolmogorov} complexity.
1111: \newblock {\em Journal of Computer and System Sciences}, 44:343--384, 1992.
1112:
1113: \bibitem[LV97]{Li:97}
1114: M.~Li and P.~M.~B. Vit\'anyi.
1115: \newblock {\em An introduction to {Kolmogorov} complexity and its
1116: applications}.
1117: \newblock Springer, 2nd edition, 1997.
1118:
1119: \bibitem[Pen94]{Penrose:94}
1120: R.~Penrose.
1121: \newblock {\em Shadows of the mind, {A} search for the missing science of
1122: consciousness}.
1123: \newblock Oxford Univ. Press, 1994.
1124:
1125: \bibitem[Ris89]{Rissanen:89}
1126: J.~Rissanen.
1127: \newblock {\em Stochastic Complexity in Statistical Inquiry}.
1128: \newblock World Scientific Publ. Co., 1989.
1129:
1130: \bibitem[RN95]{Russell:95}
1131: S.~J. Russell and P.~Norvig.
1132: \newblock {\em Artificial Intelligence. {A} Modern Approach}.
1133: \newblock Prentice-Hall, Englewood Cliffs, 1995.
1134:
1135: \bibitem[SB98]{Sutton:98}
1136: R.~Sutton and A.~Barto.
1137: \newblock {\em Reinforcement learning: An introduction}.
1138: \newblock Cambridge, MA, MIT Press, 1998.
1139:
1140: \bibitem[Sch97]{Schmidhuber:97nn}
1141: J.~Schmidhuber.
1142: \newblock Discovering neural nets with low {Kolmogorov} complexity and high
1143: generalization capability.
1144: \newblock {\em Neural Networks}, 10(5):857--873, 1997.
1145:
1146: \bibitem[Sol64]{Solomonoff:64}
1147: R.~J. Solomonoff.
1148: \newblock A formal theory of inductive inference: Part 1 and 2.
1149: \newblock {\em Inform. Control}, 7:1--22, 224--254, 1964.
1150:
1151: \bibitem[Sol78]{Solomonoff:78}
1152: R.~J. Solomonoff.
1153: \newblock Complexity-based induction systems: comparisons and convergence
1154: theorems.
1155: \newblock {\em IEEE Trans. Inform. Theory}, IT-24:422--432, 1978.
1156:
1157: \bibitem[SZW97]{Schmidhuber:97bias}
1158: J.~Schmidhuber, J.~Zhao, and M.~Wiering.
1159: \newblock Shifting inductive bias with success-story algorithm, adaptive
1160: {Levin} search, and incremental self-improvement.
1161: \newblock {\em Machine Learning}, 28:105--130, 1997.
1162:
1163: \end{thebibliography}
1164:
1165: \end{document}
1166:
1167: %---------------------------------------------------------------
1168: