cs0004001/cs0004001
1: 
2: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
3: %%      A Theory of Universal Artificial Intelligence        %%
4: %%            based on Algorithmic Complexity                %%
5: %%     Marcus Hutter: Start: 13.11.99  LastEdit: 31.03.00    %%
6: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
7: 
8: %-------------------------------%
9: %   Document-Style              %
10: %-------------------------------%
11: \documentclass[12pt]{article}
12: %\documentstyle[10pt,german,epsf,axodraw,twoside]{report}
13: % 'epsf.sty'to include post-script pictures         %
14: % 'axodraw.sty' for feynman pictures                %
15: %\epsfverbosetrue
16: \parskip=1.5ex plus 1ex minus 1ex \parindent=0ex
17: \pagestyle{headings}
18: \setcounter{tocdepth}{4}
19: \setcounter{secnumdepth}{1}
20: \topmargin=0cm  \oddsidemargin=0cm \evensidemargin=0cm
21: \textwidth=16cm \textheight=23cm
22: \unitlength=1mm \sloppy
23: %\makeindex
24: 
25: %-------------------------------%
26: %   Compiler-Switches           %
27: %-------------------------------%
28: \newif\ifall\alltrue %\allfalse              % compile only parts
29: \newif\ifexpaper\expapertrue %\expaperfalse  % compile only parts
30: %\newif\ifprivate\privatetrue %\privatefalse  % compile only parts
31: \newif\ifprivate\privatefalse                % compile only parts
32: 
33: %\def\private#1{{\it private: #1}}  % print private comments
34: \def\private#1{}            % not print private "
35: %\nofiles               % no .aux .toc ... files
36: 
37: %-------------------------------%
38: %   Macro-Definitions           %
39: %-------------------------------%
40: %\def\keywords#1{\centerline{\parbox{14cm}{{\it Key Words:} #1}}}
41: \def\keywords#1{\small\centerline{\bf Key Words}\vspace{5mm}\centerline{\parbox{14cm}{#1}}}
42: \def\eqd{\stackrel{\bullet}{=}}
43: \def\ff{\Longrightarrow}
44: \def\gdw{\Longleftrightarrow}
45: \def\toinfty#1{\stackrel{#1\to\infty}{\longrightarrow}}
46: \def\gtapprox{\buildrel{\lower.7ex\hbox{$>$}}\over
47:                        {\lower.7ex\hbox{$\sim$}}}
48: \def\nq{\hspace{-1em}}
49: \def\look{\(\uparrow\)}
50: \def\ignore#1{}
51: \def\deltabar{{\delta\!\!\!^{-}}}
52: \def\qed{\sqcap\!\!\!\!\sqcup}
53: \def\1d2{{\textstyle{1\over 2}}}
54: \def\hbar{h\!\!\!\!^{-}\,}
55: \def\dbar{d\!\!^{-}\!}
56: \def\eps{\varepsilon}
57: \def\beq{\begin{equation}}
58: \def\eeq{\end{equation}}
59: \def\beqn{\begin{displaymath}}
60: \def\eeqn{\end{displaymath}}
61: \def\bqa{\begin{equation}\begin{array}{c}}
62: \def\eqa{\end{array}\end{equation}}
63: \def\bqan{\begin{displaymath}\begin{array}{c}}
64: \def\eqan{\end{array}\end{displaymath}}
65: \def\pb{\underline}                       % probability notation
66: \def\pb#1{\underline{#1}}                 % probability notation
67: \def\blank{{\,_\sqcup\,}}                 % blank position
68: \def\maxarg{\mathop{\rm maxarg}}          % maxarg
69: \def\minarg{\mathop{\rm minarg}}          % minarg
70: \def\hh#1{{\dot{#1}}}                     % historic I/O
71: \def\best{*}                              % or {best}
72: \begin{document}
73: 
74: %------------------------------%
75: %      T i t l e- P a g e      %
76: %------------------------------%
77: \begin{titlepage}
78: %{\tt http://xxx.lanl.gov/abs/cs.AI/000400?}
79: \hfill Munich, 31.03.2000
80: 
81: %\hspace*{10cm}{\Large Preliminary Version 10}
82: 
83: \begin{center}       \vspace*{2cm}
84:   {\LARGE\bf A Theory of Universal Artificial Intelligence} \\[0.5cm]
85:   {\LARGE\bf based on Algorithmic Complexity}               \\[2cm]
86:   {\bf Marcus Hutter\footnotemark}                 \\[1cm]
87:   {\it Bayerstr. 21, 80335 Munich, Germany} \\[1.5cm]
88: \end{center}
89: \footnotetext{Any response to {\tt marcus@hutter1.de} is welcome.}
90: 
91: \keywords{Artificial intelligence, algorithmic complexity,
92: sequential decision theory; induction; Solomonoff; Kolmogorov;
93: Bayes; reinforcement learning; universal sequence prediction;
94: strategic games; function minimization; supervised learning.}
95: 
96: \begin{abstract}
97: Decision theory formally solves the problem of rational agents in
98: uncertain worlds if the true environmental prior probability
99: distribution is known. Solomonoff's theory of universal induction
100: formally solves the problem of sequence prediction for unknown
101: prior distribution. We combine both ideas and get a parameterless
102: theory of universal Artificial Intelligence. We give strong
103: arguments that the resulting AI$\xi$ model is the most intelligent
104: unbiased agent possible. We outline for a number of problem
105: classes, including sequence prediction, strategic games, function
106: minimization, reinforcement and supervised learning, how the
107: AI$\xi$ model can formally solve them. The major drawback of the
108: AI$\xi$ model is that it is uncomputable. To overcome this
109: problem, we construct a modified algorithm AI$\xi^{tl}$, which is
110: still effectively more intelligent than any other time $t$ and
111: space $l$ bounded agent. The computation time of AI$\xi^{tl}$
112: is of the order $t\!\cdot\!2^l$. Other discussed topics are formal
113: definitions of intelligence order relations, the horizon problem
114: and relations of the AI$\xi$ theory to other AI approaches.
115: \end{abstract}
116: 
117: \end{titlepage}
118: 
119: %------------------------------%
120: %      Table of Contents       %
121: %------------------------------%
122: {\parskip=0ex\tableofcontents}
123: 
124: \newpage
125: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
126: \section{Introduction}\label{int}
127: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
128: 
129: %------------------------------%
130: \paragraph{Artificial Intelligence:}
131: %------------------------------%
132: The science of Artificial Intelligence (AI) might be defined as
133: the construction of intelligent systems and their analysis. A
134: natural definition of {\it systems} is anything which has an
135: input and an output stream. Intelligence is more complicated. It
136: can have many faces like creativity, solving problems, pattern
137: recognition, classification, learning, induction, deduction,
138: building analogies, optimization, surviving in an environment,
139: language processing, knowledge and many more. A formal definition
140: incorporating every aspect of intelligence, however, seems difficult.
141: Further, intelligence is graded, there is a smooth transition
142: between systems, which everyone would agree to be not intelligent
143: and truely intelligent systems. One simply has to look in nature,
144: starting with, for instance, inanimate crystals, then come amino-acids,
145: then some RNA fragments, then viruses, bacteria, plants, animals,
146: apes, followed by the truly intelligent homo sapiens, and possibly
147: continued by AI systems or ET's. So the best we can expect to find
148: is a partial or total order relation on the set of systems, which
149: orders them w.r.t.\ their degree of intelligence (like
150: intelligence tests do for human systems, but for a limited class of
151: problems). Having this order we are, of course, are interested in large
152: elements, i.e.\ highly intelligent systems. If a largest element
153: exists, it would correspond to the most intelligent system which
154: could exist.
155: 
156: Most, if not all known facets of intelligence can be formulated
157: as goal driven or, more precisely, as maximizing some utility
158: function. It is, therefore, sufficient to study goal driven AI.
159: E.g.\ the (biological) goal of animals and humans is to survive and spread.
160: The goal of AI systems should be to be useful to humans. The
161: problem is that, except for special cases, we know neither
162: the utility function, nor the environment in which the
163: system will operate, in advance.
164: 
165: %------------------------------%
166: \paragraph{Main idea:}
167: %------------------------------%
168: We propose a theory which formally\footnote{With a formal solution
169: we mean a rigorous mathematically definition, uniquely specifying the solution.
170: In the following, a solution is
171: always meant in this formal sense.} solves the problem of unknown
172: goal and environment. It might be viewed as a unification of the ideas of
173: universal induction, probabilistic planning and reinforcement
174: learning or as a unification of sequential decision theory with algorithmic
175: information theory.
176: We apply this model to some of the facets of intelligence,
177: including induction, game playing, optimization, reinforcement and supervised
178: learning, and show how it solves these problem classes. This,
179: together with general convergence theorems motivates us to
180: believe that the constructed universal AI system is the best one
181: in a sense to be clarified in the sequel, i.e. that it is the most
182: intelligent environmental independent system possible.
183: The intention of this work is to introduce the universal AI model
184: and give an in breadth analysis. Most arguments and proofs are
185: succinct and require slow reading or some additional pencil
186: work.
187: %Several topics would deserve an in depth analysis,
188: %but is deferred to future publications.
189: 
190: %------------------------------%
191: \paragraph{Contents:}
192: %------------------------------%
193: {\it Section \ref{secAIfunc}:} The general framework for AI might
194: be viewed as the design and study of intelligent agents
195: \cite{Rus95}. An agent is a cybernetic system with some internal
196: state, which acts with output $y_k$ to some environment in cycle $k$,
197: perceives some input $x_k$ from the environment and updates its
198: internal state. Then the next cycle follows. It operates according
199: to some function $p$. We split the input $x_k$ into a regular part
200: $x'_k$ and a credit $c_k$, often called reinforcement feedback.
201: From time to time the environment provides non-zero credit to the
202: system. The task of the system is to maximize its utility, defined
203: as the sum of future credits. A probabilistic environment is a
204: probability distribution $\mu(q)$ over deterministic environments
205: $q$. Most, if not all environments are of this type. We give a
206: formal expression for the function $p^\best$, which maximizes in
207: every cycle the total $\mu$ expected future credit. This model is
208: called the AI$\mu$ model. As every AI problem can be brought into
209: this form, the problem of maximizing utility is hence being
210: formally solved, if $\mu$ is known. There is nothing remarkable or
211: new here, it is the essence of sequential decision theory
212: \cite{Che85,Pea88,Neu44}. Notation and formulas needed in
213: later sections are simply developed. There are two major remaining
214: problems. The problem of the unknown true prior probability $\mu$
215: is solved in section \ref{secAIxi}. Computational aspects are
216: addressed in section \ref{secTime}.
217: 
218: {\it Section \ref{secAImurec}:} Instead of talking about
219: probability distributions $\mu(q)$ over functions, one could
220: describe the environment by the conditional probability of
221: providing inputs $x_1...x_n$ to the system under the condition
222: that the system outputs $y_1...y_n$. The definition of the optimal
223: $p^\best$ system in this iterative form is shown to be equivalent
224: to the previous functional form. The functional form is more
225: elegant and will be used to define an intelligence order relation
226: and the time-bounded model in section \ref{secTime}. The iterative
227: form is more index intensive but more suitable for explicit
228: calculations and is used in most of the other sections. Further,
229: we introduce factorizable probability distributions.
230: 
231: {\it Section \ref{secAIxi}:} A special topic is the theory of
232: induction. In which sense prediction of the future is possible at
233: all, is best summarized by the theory of Solomonoff. Given the
234: initial binary sequence $x_1...x_k$, what is the probability of
235: the next bit being $1$? It can be fairly well predicted by using a
236: universal probability distribution $\xi$ invented and shown to
237: converge to the true prior probability $\mu$ by Solomonoff
238: \cite{Sol64,Sol78} as long as $\mu$ (which needs not be known!) is
239: computable. The problem of unknown $\mu$ is hence solved for
240: induction problems. All AI problems where the systems' output does
241: not influence the environment, i.e. all passive systems are of
242: this inductive form. Besides sequence prediction (SP),
243: classification(CF)
244: is also of this type. Active systems, like game playing (SG) and
245: optimization (FM), can not be reduced to induction systems. The {\bf
246: main idea of this work} is to generalize universal induction to
247: the general cybernetic model described in sections \ref{secAIfunc}
248: and \ref{secAImurec}. For this, we generalize $\xi$ to include
249: conditions and replace $\mu$ by $\xi$ in the rational agent model. In this
250: way the problem that the true prior probability $\mu$ is usually
251: unknown is solved. Universality of $\xi$ and convergence of
252: $\xi\!\to\!\mu$ will be shown. These are strong arguments for the
253: optimality of the resulting AI$\xi$ model. There are certain
254: difficulties in proving rigorously that and in which sense it is
255: optimal, i.e. the most intelligent system. Further, we introduce a
256: universal order relation for intelligence.
257: 
258: {\it Sections \ref{secSP}--\ref{secOther}} show how a number of
259: AI problem classes fit into the general AI$\xi$ model.
260: All these problems are formally solved by the AI$\xi$ model.
261: The solution is, however, only formal because
262: the AI$\xi$ model developed thus far is
263: uncomputable or, at best, approximable. These sections should support
264: the claim that every AI problem can be formulated (and hence
265: solved) within the AI$\xi$ model. For some classes we give
266: concrete examples to illuminate the
267: scope of the problem class. We first formulate each problem class
268: in its natural way (when $\mu^{\mbox{\tiny problem}}$ is known) and
269: then construct a formulation within the AI$\mu$ model and prove
270: its equivalence. We then consider the consequences of
271: replacing $\mu$ by $\xi$. The main goal is to understand why and
272: how the problems are solved by AI$\xi$. We only highlight special
273: aspects of each problem class. Sections
274: \ref{secSP}--\ref{secOther} together should give a better picture
275: of the AI$\xi$ model. We do not study every aspect for every
276: problem class. The sections might be read selectively. They are
277: not necessary to understand the remaining sections.
278: 
279: {\it Section \ref{secSP}:} Using the AI$\mu$ model for sequence
280: prediction (SP) is identical to Baysian sequence prediction
281: SP$\Theta_\mu$. One might expect, when using the AI$\xi$ model for
282: sequence prediction, one would recover exactly the universal
283: sequence prediction scheme SP$\Theta_\xi$, as AI$\xi$ was a unification of the
284: AI$\mu$ model and the idea of universal probability $\xi$. Unfortunately
285: this is not the case. One reason is that $\xi$ is only a
286: probability distribution in the inputs $x$ and not in the outputs
287: $y$. This is also one of the origins of the difficulty of proving error/credit
288: bounds for AI$\xi$. Nevertheless, we argue that AI$\xi$ is
289: equally well suited for sequence prediction as SP$\Theta_\xi$ is.
290: In a very limited setting we prove a (weak) error bound for
291: AI$\xi$ which gives hope that a general proof is attainable.
292: 
293: {\it Section \ref{secSG}:} A very important class of problems are
294: strategic games (SG). We restrict ourselves to deterministic strictly
295: competitive strategic games like chess. If the environment is a
296: minimax player, the AI$\mu$ model itself reduces to a minimax
297: strategy. Repeated games of fixed lengths are a special case for
298: factorizable $\mu$. The consequences of variable game length is
299: sketched. The AI$\xi$ model has to learn the rules of the game
300: under consideration, as it has no prior information about these
301: rules. We describe how AI$\xi$ actually learns these rules.
302: 
303: {\it Section \ref{secFM}:} There are many problems that fall into
304: the category 'resource bounded function minimization' (FM). They
305: include the Traveling Salesman Problem, minimizing production
306: costs, inventing new materials or even producing, e.g. nice
307: paintings, which are (subjectively) judged by a human. The task is to
308: (approximately) minimize some function $f\!:\!Y\!\to\!Z$ within
309: minimal number of function calls. We will see that a greedy model
310: trying to minimize $f$ in every cycle fails. Although the greedy
311: model has nothing to do with downhill or gradient techniques
312: (there is nothing like a gradient or direction for functions over
313: $Y$) which are known to fail, we discover the same difficulties.
314: FM has already nearly the full complexity of
315: general AI. The reason being that FM can actively influence the
316: information gathering process by its trials $y_k$ (whereas SP and
317: CF cannot). We discuss in detail the optimal FM$\mu$ model and
318: its inventiveness in choosing the $y\!\in\!Y$. A discussion of the subtleties when
319: using AI$\xi$ for function minimization, follows.
320: 
321: {\it Section \ref{secEX}:} Reinforcement learning, as the
322: AI$\xi$ model does, is an important learning technique but not the only one.
323: To improve the speed of learning, supervised learning, i.e.
324: learning by acquiring knowledge, or learning from a constructive
325: teacher is necessary. We show, how AI$\xi$ learns to learn
326: supervised. It actually establishes supervised learning very
327: quickly within $O(1)$ cycles.
328: 
329: {\it Section \ref{secOther}} gives a brief survey of other general
330: aspects, ideas and methods in AI, and their connection to the
331: AI$\xi$ model. Some aspects are directly included, others are or
332: should be emergent.
333: 
334: {\it Section \ref{secTime}:} Up to now we have shown the universal
335: character of the AI$\xi$ model but have completely ignored
336: computational aspects. Let us assume that there exists some
337: algorithm $\tilde p$ of size $\tilde l$ with computation time per
338: cycle $\tilde t$, which behaves in a sufficiently intelligent way
339: (this assumption is the very basis of AI). The
340: algorithm $p^\best$ should run all algorithms of length
341: $\leq\!\tilde l$ for $\tilde t$ time steps in every cycle and select the best
342: output among them. So we have an algorithm which runs in time
343: $\tilde l\!\cdot\!2^{\tilde t}$ and is at least as good as $\tilde
344: p$, i.e.\ it also serves our needs apart from the (very large
345: but) constant multiplicative factor in computation time. This idea
346: of the 'typing monkeys', one of them eventually producing 'Shakespeare', is
347: well known and widely used in theoretical computer science. The
348: difficult part is the selection of the algorithm with the best
349: output. A further complication is that the selection process
350: itself must have only limited computation time. We present a
351: suitable modification of the AI$\xi$ model which solves these
352: difficult problems. The solution is somewhat involved from an
353: implementational aspect. An implementation would include first
354: order logic, the definition of a Universal Turing machine within
355: it and proof theory. The assumptions behind this construction are
356: discussed at the end.
357: 
358: {\it Section \ref{secOutlook}} contains some discussion of
359: otherwise unmentioned topics and some (personal) remarks. It also
360: serves as an outlook to further research.
361: 
362: {\it Section \ref{secCon}} contains the conclusions.
363: 
364: %------------------------------%
365: \paragraph{History \& References:}
366: %------------------------------%
367: Kolmogorov65 \cite{Kol65} suggested to define the information
368: content of an object as the length of the shortest program
369: computing a representation of it. Solomonoff64 \cite{Sol64}
370: invented the closely related universal prior probability
371: distribution and used it for binary sequence prediction
372: \cite{Sol64,Sol78} and function inversion and minimization
373: \cite{Sol86}. Together with Chaitin66\&75 \cite{Cha66,Cha75} this
374: was the invention of what is now called Algorithmic Information
375: theory. For further literature and many applications see
376: \cite{LiVi93}. Other interesting 'applications' can be found in
377: \cite{Cha91,Sch99,Vov98}. Related topics are the Weighted Majority
378: Algorithm invented by Littlestone and Warmuth89 \cite{LiWa89},
379: universal forecasting by Vovk92 \cite{Vov92}, Levin search73
380: \cite{Lev73}, pac-learning introduced by Valiant84 \cite{Val84}
381: and Minimum Description Length \cite{LiVi92,Ris89}. Resource
382: bounded complexity is discussed in \cite{Dal73,Fed92,Ko86,Pin97},
383: resource bounded universal probability in \cite{LiVi91,LiVi93}.
384: Implementations are rare \cite{Con97,Sch95,Sch96}. Excellent
385: reviews with a philosophical touch are \cite{LiVi92a,Sol97}. For
386: an older, but general review of inductive inference see Angluin83
387: \cite{Ang83}. For an excellent introduction into algorithmic
388: information theory, further literature and many applications one
389: should consult the book of Li and Vit\'anyi97 \cite{LiVi93}. The
390: survey \cite{LiVi92} or the chapters 4 and 5 of \cite{LiVi93}
391: should be sufficient to follow the arguments and proofs
392: in this paper. %%%%%%%%%%%%%%%
393: The other ingredient in our AI$\xi$ model is sequential decision theory. We
394: do not need much more than the maximum expected utility principle
395: and the expecimax algorithm \cite{Mic66,Rus95}. The book of von Neumann and
396: Morgenstern44 \cite{Neu44} might be seen as the initiation of
397: game theory, which already contains the expectimax algorithm
398: as a special case. The literature on decision theory is
399: vast and we only give two possibly interesting references with
400: regard to this paper. Cheeseman85\&88 \cite{Che85} is a defense
401: of the use of probability theory in AI. Pearl88 \cite{Pea88} is a
402: good introduction and overview of probabilistic reasoning.
403: 
404: \newpage
405: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
406: \section{The AI$\mu$ Model in Functional Form}\label{secAIfunc}
407: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
408: 
409: %------------------------------%
410: \paragraph{The cybernetic or agent model:}
411: %------------------------------%
412: A good way to start thinking about intelligent systems is to
413: consider more generally cybernetic systems, in AI usually called
414: agents. This avoids having to struggle with the meaning of
415: intelligence from the very beginning. A cybernetic system is a
416: control circuit with input $y$ and output $x$ and an internal
417: state. From an external input and the internal state the system
418: calculates deterministically or stochastically an output. This
419: output (action) modifies the environment and leads to a new input
420: (reception). This continues ad infinitum or for a finite number of
421: cycles. As explained in the last section, we need some credit
422: assignment to the cybernetic system. The input $x$ is divided into
423: two parts, the standard input $x'$ and some credit input $c$. If
424: input and output are represented by strings, a deterministic
425: cybernetic system can be modeled by a Turing machine $p$. $p$ is
426: called the policy of the agent, which determines the action to a
427: receipt. If the environment is also computable it might be modeled
428: by a Turing machine $q$ as well. The interaction of the agent
429: with the environment can be illustrated as follows:
430: 
431: \begin{center}\label{cyberpic}
432: %\input KCUnAI.pic
433: \special{em:linewidth 0.4pt}
434: \linethickness{0.4pt}
435: \begin{picture}(106,47)
436: \thinlines
437: \put(1,41){\framebox(10,6)[cc]{$c_1$}}
438: \put(11,41){\framebox(6,6)[cc]{$x'_1$}}
439: \put(17,41){\framebox(10,6)[cc]{$c_2$}}
440: \put(27,41){\framebox(6,6)[cc]{$x'_2$}}
441: \put(33,41){\framebox(10,6)[cc]{$c_3$}}
442: \put(43,41){\framebox(6,6)[cc]{$x'_3$}}
443: \put(49,41){\framebox(10,6)[cc]{$c_4$}}
444: \put(59,41){\framebox(6,6)[cc]{$x'_4$}}
445: \put(65,41){\framebox(10,6)[cc]{$c_5$}}
446: \put(75,41){\framebox(6,6)[cc]{$x'_5$}}
447: \put(81,41){\framebox(10,6)[cc]{$c_6$}}
448: \put(91,41){\framebox(6,6)[cc]{$x'_6$}}
449: \put(102,44){\makebox(0,0)[cc]{...}}
450: \put(1,1){\framebox(16,6)[cc]{$y_1$}}
451: \put(17,1){\framebox(16,6)[cc]{$y_2$}}
452: \put(33,1){\framebox(16,6)[cc]{$y_3$}}
453: \put(49,1){\framebox(16,6)[cc]{$y_4$}}
454: \put(65,1){\framebox(16,6)[cc]{$y_5$}}
455: \put(81,1){\framebox(16,6)[cc]{$y_6$}}
456: \put(102,4){\makebox(0,0)[cc]{...}}
457: \put(97,47){\line(1,0){9}}
458: \put(97,41){\line(1,0){9}}
459: \put(97,7){\line(1,0){9}}
460: \put(97,1){\line(0,0){0}}
461: \put(97,1){\line(1,0){9}}
462: \put(1,21){\framebox(16,6)[cc]{working}}
463: \thicklines
464: \put(17,17){\framebox(20,14)[cc]{$\displaystyle{System\atop\bf p}$}}
465: \thinlines
466: \put(37,27){\line(1,0){14}}
467: \put(37,21){\line(1,0){14}}
468: \put(39,24){\makebox(0,0)[lc]{tape ...}}
469: \put(56,21){\framebox(16,6)[cc]{working}}
470: \thicklines
471: \put(72,17){\framebox(20,14)[cc]{$\displaystyle{Environ-\atop ment\quad\bf q}$}}
472: \thinlines
473: \put(92,27){\line(1,0){14}}
474: \put(92,21){\line(1,0){14}}
475: \put(94,24){\makebox(0,0)[lc]{tape ...}}
476: \thicklines
477: \put(54,41){\vector(-3,-1){29}}
478: \put(84,31){\vector(-3,1){30}}
479: \put(54,7){\vector(3,1){30}}
480: \put(25,17){\vector(3,-1){29}}
481: \end{picture}
482: \end{center}
483: 
484: $p$ as well as $q$ have unidirectional input and output tapes and
485: bidirectional working tapes. What entangles the agent with the
486: environment, is the fact that the upper tape serves as input tape
487: for $p$, as well as output tape for $q$, and that the lower tape
488: serves as output tape for $p$ as well as input tape for $q$.
489: Further, the reading head must always be left of the writing head,
490: i.e. the symbols must first be written, before they are read. $p$
491: and $q$ have their own mutually inaccessible working tapes
492: containing their own 'secrets'. The heads move in the following
493: way. In the k$^{th}$ cycle $p$ writes $y_k$, $q$ reads $y_k$, $q$
494: writes $x_k\!\equiv\!c_kx_k'$, $p$ reads $x_k\!\equiv\!c_kx_k'$,
495: followed by the $(k+1)^{th}$ cycle and so on. The whole process
496: starts with the first cycle, all heads on tape start and working
497: tapes being empty. We want to call Turing machines behaving in
498: this way, {\it chronological Turing machines}, for obvious
499: reasons. Before continuing, some notations on strings are
500: appropriate.
501: 
502: %------------------------------%
503: \paragraph{Strings:}
504: %------------------------------%
505: We will denote strings over the alphabet $X$ by
506: $s\!=\!x_1x_2...x_n$, with $x_k\!\in\!X$, where $X$ is
507: alternatively interpreted as a non-empty subset of $I\!\!N$ or
508: itself as a prefix free set of binary strings.
509: $l(s)=l(x_1)\!+...+\!l(x_n)$ is the length of s. Analogous
510: definitions hold for $y_k\!\in\!Y$. We call $x_k$ the $k^{th}$
511: input word and $y_k$ the $k^{th}$ output word (rather than
512: letter). The string $s=y_1x_1...y_nx_n$ represents the
513: input/output in chronological order. Due to the prefix property of
514: the $x_k$ and $y_k$, $s$ can be uniquely separated into its words.
515: The words appearing in strings are always in chronological order.
516: We further introduce the following abbreviations: $\epsilon$ is the
517: empty string, $x_{n:m}:=x_nx_{n+1}...x_{m-1}x_m$ for $n\leq m$ and
518: $\epsilon$ for $n>m$. $x_{<n}:=x_1... x_{n-1}$. Analog for $y$.
519: Further, $y\!x_n\!:=y_nx_n$, $y\!x_{n:m}\!:=\!y_nx_n...y_mx_m$,
520: and so on.
521: 
522: %------------------------------%
523: \paragraph{AI model for known deterministic environment:}
524: %------------------------------%
525: Let us define for the chronological Turing machine $p$ a partial
526: function also named $p\!:\!X^*\!\rightarrow\!Y^*$ with
527: $y_{1:k}=p(x_{<k})$ where $y_{1:k}$ is the output of Turing
528: machine $p$ on input $x_{<k}$ in cycle k, i.e. where $p$ has read
529: up to $x_{k-1}$ but no further. In an analogous way, we define
530: $q\!:\!Y^*\!\rightarrow\!X^*$ with $x_{1:k}=q(y_{1:k})$.
531: Conversely, for every partial recursive chronological function we
532: can define a corresponding chronological Turing machine. Each
533: (system,environment) pair $(p,q)$ produces a unique I/O sequence
534: $\omega(p,q):=y_1^{pq}x_1^{pq}y_2^{pq}x_2^{pq}...$. When we look
535: at the definition of $p$ and $q$ we see a nice symmetry between
536: the cybernetic system and the environment. Until now, not much
537: intelligence is in our system. Now the credit assignment comes
538: into the game and removes the symmetry somewhat. We split the
539: input $x_k\!\in\!X\!:=\!C\!\times\!X'$ into a regular part
540: $x_k'\!\in\!X'$ and a credit $c_k\!\in\!C\!\subset\!I\!\!R$. We
541: define $x_k\!\equiv\!c_kx_k'$ and $c_k\equiv c(x_k)$. The goal of
542: the system should be to maximize received credits. This is called
543: reinforcement learning. The reason for the asymmetry is, that
544: eventually we (humans) will be the environment with which the
545: system will communicate and {\it we} want to dictate what is good
546: and what is wrong, not the other way round. This one way learning,
547: the system learns from the environment, and not conversely,
548: neither prevents the system from becoming more intelligent than the
549: environment, nor does it prevent the environment learning from
550: the system because the environment can itself interpret the
551: outputs $y_k$ as a regular and a credit part. The environment is
552: just not forced to learn, whereas the system is. In cases where we
553: restrict the credit to two values
554: $c\!\in\!C\!=\!I\!\!B\!:=\!\{0,1\}$, $c\!=\!1$ is interpreted as a
555: positive feedback, called {\it good} or {\it correct} and
556: $c\!=\!0$ a negative feedback, called {\it bad} or {\it error} in
557: the following. Further, let us restrict for a while the lifetime
558: (number of cycles) $T$ of the system to a large, but finite value.
559: Let $C_{km}(p,q)\!:=\!\sum_{i=k}^mc(x_i)$ be the total credit, the
560: system $p$ receives from the environment $q$ in the cycles $k$ to
561: $m$. It is now natural to call the system, which maximizes the
562: total credit $C_{1T}$, called utility, the {\it best} or {\it most intelligent}
563: one\footnote {$\maxarg_p C(p)$ is the $p$ which maximizes
564: $C(\cdot)$. If there is more than one maximum we might choose the
565: lexicographically smallest one for definiteness.}.
566: \beqn
567:  p^{\best,T,q}=\maxarg_p C_{1T}(p,q) \quad\Rightarrow\quad
568:  C_{kT}(p^{\best,T,q},q) \geq C_{kT}(p,q) \quad \forall p
569: \eeqn
570: For $k\!=\!1$ this is obvious and for $k\!>\!1$ easy to see.
571: If $T$, $Y$ and $X$ are finite, the number of different behaviours
572: of the system, i.e. the search space is finite. Therefore, because
573: we have assumed that $q$ is known, $p^{\best,T,q}$ can effectively
574: be determined (by pre-analyzing all behaviours). The main reason
575: for restricting to finite $T$ was not to ensure computability of
576: $p^{\best,T,q}$ but that the limit $T\!\to\infty$ might not exist.
577: This is nothing special, the (unrealistic) assumption of a
578: completely known deterministic environment $q$ has simply trivialized
579: everything.
580: %------------------------------%
581: \paragraph{AI model for known prior probability:}
582: %------------------------------%
583: Let us now weaken our assumptions by replacing the environment $q$
584: with a probability distribution $\mu(q)$ over chronological functions.
585: $\mu$ might be interpreted
586: in two ways. Either the environment itself behaves in a
587: probabilistic way defined by $\mu$ or the true environment is
588: deterministic, but we only have probabilistic information, of which
589: environment being the true environment. Combinations of
590: both cases are also possible. The interpretation does not matter in the
591: following. We just assume that we know $\mu$ but no more
592: about the environment whatever the interpretation may be.
593: 
594: Let us assume we are in cycle $k$ with history
595: $\hh y\!\hh x_1...\hh y\!\hh x_{k-1}$
596: and ask for the {\it best} output $y_k$.
597: Further, let
598: $\hh Q_k\!:=\!\{q:q(\hh y_{<k})=\hh x_{<k}\}$
599: be the set of all environments producing the above history.
600: The expected credit
601: for the next $m\!-\!k\!+\!1$ cycles (given the above history) is
602: given by a conditional probability:
603: \beq\label{eefunc}
604:   C^\mu_{km}(p|\hh y\!\hh x_{<k}) \;:=\;
605:   { \sum_{q\in \hh Q_k} \mu(q)C_{km}(p,q) \over
606:     \sum_{q\in \hh Q_k} \mu(q) }.
607: \eeq
608: We cannot simply determine $\maxarg_p(C_{1T})$ unlike the
609: deterministic case because the history is no longer
610: deterministically determined by $p$ and $q$, but depends on $p$
611: and $\mu$ {\it and} on the outcome of a stochastic process.
612: Every new cycle adds new information ($\hh x_i$) to the
613: system. This is indicated by the dots over the symbols.
614: In cycle $k$ we have to maximize the expected future
615: credit, taking into account the information in the history $\hh
616: y\!\hh x_{<k}$. This information is not already present
617: in $p$ and $q/\mu$ at the system's start unlike in the deterministic
618: case.
619: 
620: Further, we want to generalize the finite lifetime $T$ to a
621: dynamical (computable) farsightedness
622: $h_k\!\equiv\!m_k\!-\!k\!+\!1\!\geq\!1$, called horizon in the
623: following. For $m_k\!=\!T$ we have our original finite lifetime,
624: for $m_k\!=\!k\!+\!m\!-\!1$ the system maximizes in every cycle the next
625: $m$ expected credits. A discussion of the choices $m_k$ is delayed
626: to section \ref{secAIxi}.
627: 
628: The next $h_k$ credits are maximized by
629: $$
630:   p_k^\best \;:=\; \maxarg_{p\in \hh P_k} C^\mu_{km_k}(p|\hh y\!\hh
631:   x_{<k}),
632: $$
633: where $\hh P_k\!:=\!\{p:p(\hh x_{<k})=\hh y_{<k}*\}$ is the set of
634: systems consistent with the current history.
635: $p_k^\best$ depends on $k$ and is used only in step $k$ to
636: determine $\hh y_k$ by
637: $ p_k^\best(\hh x_{<k};\hh y_{<k})\!=\!\hh y_{<k}\hh y_k$.
638: After writing $\hh y_k$ the environment replies with $\hh x_k$
639: with (conditional) probability $\mu(\hh Q_{k+1})/\mu(\hh Q_k)$. This
640: probabilistic outcome provides new information to the system.
641: The cycle $k\!+\!1$ starts with determining $\hh y_{k+1}$ from
642: $p_{k+1}^\best$ (which differs from $p_k$ as $\hh x_k$ is
643: now fixed) and so on. Note that $p_k^\best$ depends also on
644: $\hh y_{<k}$ because $\hh P_k$ and $\hh Q_k$ do so.
645: But recursively inserting $p_{k-1}^\best$ and
646: so on, we can define
647: \beq\label{pbestfunc}
648:   p^\best(\hh x_{<k}) \;:=\;
649:   p_k^\best(\hh x_{<k};p_{k-1}^\best(\hh x_{<k-1}...p_1^\best)))
650: \eeq
651: It is a chronological function and computable if $X$, $Y$ and $m_k$ are
652: finite. The policy $p^\best$ defines our AI$\mu$ model.
653: For deterministic\footnote{We call a probability distribution deterministic
654: if it is 1 for exactly one argument and 0 for all others.}
655: $\mu$ this model reduces to the deterministic case.
656: 
657: It is important to maximize the sum of future credits and not, for instance,
658: to be greedy and only maximize the next credit, as is done e.g. in
659: sequence prediction. For example, let the environment be a
660: sequence of chess games and each cycle corresponds to one move.
661: Only at the end of each game a positive credit $c\!=\!1$ is given
662: to the system if it won the game (and made no illegal move).
663: For the system, maximizing all future credits means trying to win as
664: many games in as short as possible time (and avoiding illegal
665: moves). The same performance is reached, if we choose
666: $m_k\!=\!k\!+\!m$ with $m$ much larger than the typical game
667: lengths. Maximization of only the next credit would be a very bad
668: chess playing system. Even if we would make our credit $c$ finer,
669: e.g. by evaluating the number of chessmen, the system would play
670: very bad chess for $m\!=\!1$, indeed.
671: 
672: The AI$\mu$ model still depends on $\mu$ and $m_k$. $m_k$ is addressed
673: in section \ref{secAIxi}. To get our
674: final universal AI model the idea is to replace $\mu$ by the
675: universal probability $\xi$, defined later. This is motivated
676: by the fact that $\xi\!\to\!\mu$ in a certain sense for any $\mu$.
677: With $\xi$ instead of $\mu$ our model no longer depends on any
678: parameters, so it is truly universal. It remains to show that it
679: produces intelligent outputs. But let us continue step by step. In
680: the next section we develop an alternative but equivalent
681: formulation of the AI model given above. Whereas the functional
682: form is more suitable for theoretical considerations, especially
683: for the development of a timebounded version in section
684: \ref{secTime}, the iterative formulation of the next section will
685: be more appropriate for the explicit calculations in most of the
686: other sections.
687: 
688: \newpage
689: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
690: \section{The AI$\mu$ Model in Recursive and Iterative Form}\label{secAImurec}
691: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
692: 
693: %------------------------------%
694: \paragraph{Probability distributions:}
695: %------------------------------%
696: Throughout the paper we deal with sequences/strings and
697: conditional probability distributions on strings. Some
698: notations are therefore appropriate.
699: 
700: We use Greek letters for probability distributions and underline their
701: arguments to indicate that they are probability arguments. Let
702: $\rho_n(\pb x_1...\pb x_n)$ be the probability that a string starts with
703: $x_1...x_n$. We only consider sufficiently long strings, so the
704: $\rho_n$ are normalized to 1. Moreover, we drop the index on $\rho$
705: if it is clear from its arguments:
706: \beq\label{prop}
707:   \sum_{x_n\in X}\rho(\pb x_{1:n}) \equiv
708:   \sum_{x_n}\rho_n(\pb x_{1:n}) =
709:   \rho_{n-1}(\pb x_{<n}) \equiv
710:   \rho(\pb x_{<n})
711:   ,\quad
712:   \rho(\epsilon) \equiv \rho_0(\epsilon)=1.
713: \eeq
714: We also need conditional probabilities derived from Bayes' rule.
715: We prefer a notation which preserves the chronological order of the words, in
716: contrast to the standard notation $\rho(\cdot|\cdot)$ which flips it. We extend the
717: definition of $\rho$ to the conditional case with
718: the following convention for its arguments: An underlined argument
719: $\pb x_k$ is a probability variable and other non-underlined
720: arguments $x_k$ represent conditions. With this convention, Bayes'
721: rule has the form $\rho(x_{<n}\pb x_n)\!=\!\rho(\pb x_{1:n})/\rho(\pb
722: x_{<n})$.
723: The equation states that the probability that a string
724: $x_1...x_{n-1}$ is followed by $x_n$ is equal to the probability
725: of $x_1...x_n*$ divided by the probability of
726: $x_1...x_{n-1}*$. We use $x*$ as a shortcut for 'strings
727: starting with $x$'.
728: 
729: The introduced notation is also suitable for defining the
730: conditional probability $\rho(y_1\pb x_1...y_n\pb x_n)$ that the
731: environment reacts with $x_1...x_n$ under the condition that the
732: output of the system is $y_1...y_n$.
733: The environment is chronological, i.e. input $x_i$ depends on
734: $y\!x_{<i}y_i$ only. In the probabilistic case this means that
735: $\rho(y\!\pb x_{<k}y_k)\!:=\!\sum_{x_k}\rho(y\!\pb x_{1:k})$
736: is independent of $y_k$, hence a tailing $y_k$ in the arguments of $\rho$
737: can be dropped. Probability distributions with this
738: property will be called {\it chronological}.
739: The $y$ are always
740: conditions, i.e.\ never underlined, whereas additional
741: conditioning for the $x$ can be obtained with Bayes' rule
742: \bqa\label{bayes2}
743:   \rho(y\!x_{<n}y\!\pb x_n) =
744:   \rho(y\!\pb x_{1:n})/\rho(y\!\pb x_{<n}) \quad\mbox{and}
745:   \\[4mm]
746:   \rho(y\!\pb x_{1:n}) \;=\;
747:   \rho(y\!\pb x_1)\!\cdot\!\rho(y\!x_1y\!\pb x_2)\!\cdot...\cdot\!
748:   \rho(y\!x_{<n}y\!\pb x_n)
749: \eqa
750: The second equation is the first equation applied $n$ times.
751: 
752: %------------------------------%
753: \paragraph{Alternative Formulation of the AI$\mu$ Model:}
754: %------------------------------%
755: Let us define the AI$\mu$ model $p^\best$ in a different way. In the
756: next subsection we will show that the $p^\best$ model defined here
757: is identical to the functional definition of $p^\best$ given
758: in the last section.
759: 
760: Let $\mu(y\!\pb x_{1:k})$ be the true chronological prior probability
761: that the environment reacts with $x_{1:k}$ if provided with
762: actions $y_{1:k}$ from the system. We assume the cybernetic model depicted on page
763: \pageref{cyberpic} to be valid.
764: Next we define $C_{k+1,m}^\best(y\!x_{1:k})$ to be the $\mu$
765: expected credit sum in cycles $k\!+\!1$ to $m$ with outputs $y_i$
766: generated by system $p^\best$ and past responses $x_i$ from the
767: environment. Adding $c(x_k)$ we get the credit including cycle
768: $k$. The probability of $x_k$,
769: given $y\!x_{<k}y_k$, is given by the condition probability
770: $\mu(y\!x_{<k}y\!\pb x_k)$. So the expected credit sum
771: in cycles $k$ to $m$ given $y\!x_{<k}y_k$ is
772: \beq\label{ebesty}
773:   C_{km}^\best(y\!x_{<k}y_k) \;:=\;
774:   \sum_{x_k}[c(x_k)+C_{k+1,m}^\best(y\!x_{1:k})] \!\cdot\!
775:   \mu(y\!x_{<k}y\!\pb x_k)
776: \eeq
777: Now we ask about how $p^\best$ chooses
778: $y_k$. It should choose $y_k$ as to maximize the future credit.
779: So the expected number of errors in cycles $k$ to $m$ given
780: $y\!x_{<k}$ and $y_k$ chosen by $p^\best$ is
781: $ C_{km}^\best(y\!x_{<k})
782: \!:=\!\max_{y_k}C_{km}^\best(y\!x_{<k}y_k)$.
783: Together with the induction start
784: \beq\label{ee0}
785:   C_{m+1,m}^\best(y\!x_{1:m}) \;:=\; 0
786: \eeq
787: $C_{km}$ is completely defined.
788: We might summarize one cycle into the formula
789: \beq\label{airec2}
790:   C_{km}^\best(y\!x_{<k}) \;=\;
791:   \max_{y_k}\sum_{x_k}
792:   [c(x_k)+C_{k+1,m}^\best(y\!x_{1:k})] \!\cdot\!
793:   \mu(y\!x_{<k}y\!\pb x_k)
794: \eeq
795: If $m_k$ is our horizon function of $p^\best$ and
796: $\hh y\!\hh x_{<k}$ is the actual history in cycle
797: $k$, the output $\hh y_k$ of the system is explicitly given by
798: \beq\label{pbestrec}
799:   \hh y_k \;=\; \maxarg_{y_k}C_{km_k}^\best
800:   (\hh y\!\hh x_{<k}y_k) \;=:\;
801:   p^\best(\hh y\!\hh x_{<k})
802: \eeq
803: Then the environment responds $\hh x_k$ with
804: probability $\mu(\hh y\!\hh x_{<k}\hh y\!\pb{\hh
805: x}_k)$. Then cycle $k\!+\!1$ starts. We might
806: unfold the recursion (\ref{airec2}) further and give $\hh y_k$
807: non-recursive as
808: \beq\label{ydotrec}
809:   \hh y_k \;=\;
810:   \maxarg_{y_k}\sum_{x_k}\max_{y_{k+1}}\sum_{x_{k+1}}\;...\;
811:   \max_{y_{m_k}}\sum_{x_{m_k}}
812:   (c(x_k)\!+...+\!c(x_{m_k})) \!\cdot\!
813:   \mu(\hh y\!\hh x_{<k}y\!\pb x_{k:m_k})
814: \eeq
815: This has a direct interpretation: the probability of inputs
816: $x_{k:m_k}$ in cycle $k$ when the system outputs $y_{k:m_k}$ and
817: the actual history is $\hh y\!\hh x_{<k}$ is $\mu(\hh y\!\hh
818: x_{<k}y\!\pb x_{k:m_k})$. The future credit in this case is
819: $c(x_k)\!+...+\!c(x_{m_k})$. The best expected credit is obtained
820: by averaging over the $x_i$ ($sum_{x_i}$) and maximizing over the $y_i$.
821: This has to be done in chronological order to correctly
822: incorporate the dependency of $x_i$ and $y_i$ on the history.
823: This is essentially the expectimax algorithm/sequence
824: \cite{Mic66,Rus95}. The AI$\mu$ model is {\it optimal} in the
825: sense that no other policy leads to higher expected credit.
826: 
827: These explicit as well as recursive definitions of the AI$\mu$ model
828: are more index intensive as compared to the functional form but
829: are more suitable for explicit calculations.
830: 
831: %------------------------------%
832: \paragraph{Equivalence of Functional and Iterative AI model:}
833: %------------------------------%
834: The iterative environmental probability $\mu$ is given by the
835: functional form in the following way,
836: \beq\label{mufr}
837:   \mu(y\!\pb x_{1:k}) \;=\;
838:   \nq\sum_{q:q(y_{1:k})=x_{1:k}}\nq \mu(q)
839: \eeq
840: as is easy to see. We will prove the equivalence of
841: (\ref{pbestfunc}) and (\ref{pbestrec}) only for $k\!=\!2$ and
842: $m_2\!=\!3$. The proof of the general case is completely analog except
843: that the notation becomes quite messy.
844: 
845: Let us first evaluate (\ref{eefunc}) for fixed $\hh
846: y_1\hh x_1$ and some $p\!\in\!\hh P_2$, i.e. $p(\hh
847: x_1)=\hh y_1y_2$ for some $y_2$. If the next input to the
848: system is $x_2$, $p$ will respond with $p(\hh x_1
849: x_2)=\hh y_1y_2y_3$ for some $y_3$ depending on $x_2$. We
850: write $y_3(x_2)$ in the following\footnote{Dependency on dotted
851: words like $\hh x_1$ is not shown as the dotted words are fixed.}.
852: The numerator of (\ref{eefunc}) simplifies to
853: \beqn
854:   \sum_{q\in \hh Q_2} \mu(q)C_{23}(p,q) \;=\;
855:   \nq\sum_{q:q(\hh y_1)=\hh x_1}\nq \mu(q)C_{23}(p,q)
856:   \;=\; \sum_{x_2x_3}(c(x_2)\!+\!c(x_3))
857:   \nq\nq\sum_{q:q(\hh y_1y_2y_3(x_2))=\hh x_1x_2x_3}\nq\nq
858:   \mu(q) \;=\;
859: \eeqn
860: \beqn
861:   \;=\; \sum_{x_2x_3}(c(x_2)\!+\!c(x_3)) \!\cdot\!
862:   \mu(\hh y_1\pb{\hh x}_1y_2\pb x_2y_3(x_2)\pb x_3)
863: \eeqn
864: In the first equality we inserted the definition of $\hh Q_2$. In
865: the second equality we split the sum over $q$ by first summing
866: over $q$ with fixed $x_2x_3$. This allows us to pull
867: $C_{23}\!=c(x_2)\!+\!c(x_3)$ out of the inner sum. Then we sum
868: over $x_2x_3$. Further, we have inserted $p$, i.e. replaced $p$
869: by $y_2$ and $y_3(\cdot)$. In the last equality we used
870: (\ref{mufr}). The denominator reduces to
871: \beqn
872:   \sum_{q\in \hh Q_2} \mu(q) \;=\;
873:   \nq\sum_{q:q(\hh y_1)=\hh x_1}\nq \mu(q)
874:   \;=\; \mu(\hh y_1\pb{\hh x}_1).
875: \eeqn
876: For the quotient we get
877: $$
878:   C_{23}(p|\hh y_1\hh x_1) \;=\;
879:   \sum_{x_2x_3}(c(x_2)\!+\!c(x_3))\!\cdot\!
880:   \mu(\hh y_1\hh x_1
881:       y_2\pb x_2y_3(x_2)\pb x_3)
882: $$
883: We have seen that the relevant behaviour of $p\!\in\!\hh P_2$ in cycle 2 and 3
884: is completely determined by $y_2$ and the function $y_3(\cdot)$
885: $$
886:   \max_{p\in\hh P_2}C_{23}(p|\hh y_1\hh x_1) \;=\;
887:   \max_{y_2}\max_{y_3(\cdot)}\sum_{x_2x_3}(c(x_2)\!+\!c(x_3))\!\cdot\!
888:   \mu(\hh y_1\hh x_1y_2\pb x_2y_3(x_2)\pb c_3) \;=\;
889: $$
890: $$
891:   \;=\;
892:   \max_{y_2}\sum_{x_2}\max_{y_3}\sum_{x_3}(c(x_2)\!+\!c(x_3))\!\cdot\!
893:   \mu(\hh y_1\hh x_1y_2\pb x_2y_3\pb x_3)
894: $$
895: In the last equality we have used the fact that the functional
896: minimization over $y_3(\cdot)$ reduces to a simple minimization
897: over the word $y_3$ when interchanging with the sum over its
898: arguments
899: $(\max_{y_3(\cdot)}\sum_{x_2}\equiv\sum_{x_2}\max_{y_3})$.
900: In the functional case $\hh y_2$ is therefore determined by
901: $$
902:   \hh y_2 \;=\;
903:   \maxarg_{y_2}\sum_{x_2}\max_{y_3}\sum_{x_3}(c(x_2)\!+\!c(x_3))\!\cdot\!
904:   \mu(\hh y_1\hh x_1y_2\pb x_2y_3\pb x_3)
905: $$
906: This is identical to the iterative definition (\ref{ydotrec}) with
907: $k\!=\!2$ and $m_2\!=\!3$ $\qed$.
908: 
909: %------------------------------%
910: \paragraph{Factorizable $\mu$:}
911: %------------------------------%
912: Up to now we have made no restrictions on the form of the prior
913: probability $\mu$ apart from being a chronological probability
914: distribution. On the other hand, we will see that, in order to
915: prove rigorous credit bounds, the prior probability must satisfy
916: some separability condition to be defined later. Here we introduce
917: some very strong form of separability, when $\mu$ factorizes into
918: products. We start with a
919: factorization into two factors. Let us assume that $\mu$ is of the
920: form
921: \beq\label{fac12}
922:   \mu(y\!\pb x_{1:n}) \;=\;
923:   \mu_1(y\!\pb x_{<l}) \cdot
924:   \mu_2(y\!\pb x_{l:n})
925: \eeq
926: for some fixed $l$ and sufficiently large $n\!\geq\!m_k$.
927: For this $\mu$ the output $\hh y_k$ in cycle
928: $k$ of the AI$\mu$ system (\ref{ydotrec}) for $k\!\geq\!l$ depends on
929: $\hh y\!\hh x_{l:k-1}$ and $\mu_2$ only and
930: is independent of $\hh y\!\hh x_{<l}$
931: and $\mu_1$. This is easily seen when inserting
932: \beq\label{fac11}
933:   \mu(\hh y\!\hh x_{<k}y\!\pb x_{k:m_k}) =
934:   \underbrace{\mu_1(\hh y\!\hh x_{<l})}_{\equiv 1}
935:   \cdot
936:   \mu_2(\hh y\!\hh x_{l:k-1}y\!\pb x_{k:m_k})
937: \eeq
938: into (\ref{ydotrec}). For $k\!<\!l$ the output $\hh y_k$ depends
939: on $\hh y\!\hh x_{<k}$ (this is trivial) and $\mu_1$
940: only (trivial if $m_k\!<\!l$) and is independent of $\mu_2$.
941: The non-trivial case, where the horizon $m_k\!\geq\!l$ reaches
942: into the region $\mu_2$, can be proved as follows (we abbreviate
943: $m\!:=\!m_k$ in the following). Inserting (\ref{fac12}) into the
944: definition of $C_{lm}^\best(y\!x_{<l})$ the factor
945: $\mu_1$ is $1$ as in (\ref{fac11}). We abbreviate
946: $C_{lm}^\best\!:=\!C_{lm}^\best(y\!x_{<l})$ as
947: it is independent of its arguments. One can
948: decompose
949: \beq\label{decompE}
950:   C_{km}^\best(y\!x_{<k}) \;=\;
951:   C_{k,l-1}^\best(y\!x_{<k}) \;+\; C_{lm}^\best
952: \eeq
953: For $k\!=\!l$ this is true because the first term on the r.h.s.\ is
954: zero.
955: For $k\!<\!l$ we prove the decomposition by induction from $k\!+\!1$ to $k$.
956: \beqn
957:   C_{km}^\best(y\!x_{<k}) \;=\;
958:   \max_{y_k}\sum_{x_k}
959:   [c(x_k)+C_{k+1,l-1}^\best(y\!x_{1:k})+C_{lm}^\best] \!\cdot\!
960:   \mu_1(y\!x_{<k}y\!\pb x_k) \;=\;
961: \eeqn
962: \beqn
963:   \;=\; \max_{y_k}\bigg[\sum_{x_k}
964:   (c(x_k)+C_{k+1,l-1}^\best(y\!x_{<k})) \!\cdot\!
965:   \mu_1(y\!x_{<k}y\!\pb x_k) + C_{lm}^\best\bigg]
966:    \;=\;
967: \eeqn
968: \beqn
969:   \;=\; C_{k,l-1}^\best(y\!x_{<k}) + C_{lm}^\best
970: \eeqn
971: Inserting (\ref{decompE}), valid for $k$ by induction hypothesis,
972: into (\ref{airec2}) gives the first equality. In the second
973: equality we have performed the $x_k$ sum for the
974: $C_{lm}^\best\!\cdot\!\mu_1$ term which is now independent of $y_k$. It can
975: therefore be pulled out of $\max_{y_k}$. In the last
976: equality we used again the definition (\ref{airec2}). This completes
977: the induction step and proves
978: (\ref{decompE}) for $k\!<\!l$. $\hh y_k$ can now be represented
979: as
980: \beq
981:   \hh y_k \;=\; \maxarg_{y_k}C_{km}^\best
982:   (\hh y\!\hh x_{<k}y_k) \;=\;
983:   \maxarg_{y_k}C_{k,l-1}^\best(\hh y\!\hh x_{<k}y_k)
984: \eeq
985: where (\ref{pbestrec}) and (\ref{decompE}) and the fact that
986: an additive constant $C_{lm}^\best$ does not change
987: $\maxarg_{y_k}$ has been used. $C_{k,l-1}^\best(\hh y\!\hh x_{<k}y_k)$ and
988: hence $\hh y_k$ is independent of $\mu_2$ for $k\!<\!l$. Note,
989: that $\hh y_k$ is also independent of the choice of $m$, as
990: long as $m\!\geq\!l$.
991: 
992: In the general case the cycles are grouped into
993: independent episodes $r\!=\!1,2,3,...$, where each episode $r$
994: consists of the cycles $k\!=\!n_r\!+\!1,...,n_{r+1}$ for some
995: $0=n_0<n_1<...<n_s=n$:
996: \beq\label{facmu}
997:   \mu(y\!\pb x_{1:n}) \;=\;
998:   \prod_{r=0}^{s-1} \mu_r(y\!\pb x_{n_r+1:n_{r+1}})
999: \eeq
1000: In the simplest case, when all episodes have the
1001: same length $l$ then $n_r=r\!\cdot\!l$. $\hh y_k$ depends on
1002: $\mu_r$ and $x$ and $y$ of episode $r$ only, with $r$ such
1003: that $n_r\!<\!k\!\leq\!n_{r+1}$.
1004: \beq\label{facydot}
1005:   \hh y_k =
1006:   \maxarg_{y_k}\sum_{x_k}...
1007:   \max_{y_t}\sum_{x_t}
1008:   (c(x_k)\!+...+\!c(x_t)) \!\cdot\!
1009:   \mu_r(\hh y\!\hh x_{n_r+1:k-1}y\!\pb x_{k:n_{r+1}}) \\[-3mm]
1010: \eeq
1011: with $t\!:=\!\min\{m_k,n_{r+1}\}$. The different episodes are
1012: completely independent in the following sense. The inputs $x_k$
1013: of different episodes are statistically independent and
1014: depend only on $y_k$ of the same episode. The outputs $y_k$ depend on the
1015: $x$ and $y$ of the corresponding episode $r$ only, and are
1016: independent of the actual I/O of the other episodes.
1017: 
1018: If all episodes have a length of at most $l$, i.e.
1019: $n_{r+1}\!-\!n_r\!\leq\!l$ and if we choose the horizon
1020: $h_k$ to be at least $l$, then
1021: $m_k\!\geq\!k\!+\!l\!-\!1\!\geq\!n_r\!+\!l\!\geq\!n_{r+1}$ and
1022: hence $t=n_{r+1}$ independent of $m_k$. This means that for
1023: factorizable $\mu$ there is no problem in taking the limit
1024: $m_k\!\to\!\infty$. Maybe this limit can also be performed in the
1025: more general case of a separable $\mu$. The (problem of the)
1026: choice of $m_k$ will be discussed in more detail later.
1027: 
1028: Although factorizable $\mu$ are too restrictive to cover all AI
1029: problems, it often occurs in practice in the form of repeated
1030: problem solving, and hence, is worth being studied. For example, if
1031: the system has to play games like chess repeatedly, or has to
1032: minimize different functions, the different games/functions might
1033: be completely independent, i.e. the environmental probability
1034: factorizes, where each factor corresponds to a game/function
1035: minimization. For details, see the appropriate sections on
1036: strategic games and function minimization.
1037: 
1038: Further, for factorizable $\mu$ it is probably easier to derive
1039: suitable credit bounds for the universal AI$\xi$ model defined in
1040: the next section, than for the general separable case which will be
1041: introduced later. This could be a first step toward a definition
1042: and proof for the general case of separable problems. One goal of
1043: this paragraph was to show, that the notion of a factorizable
1044: $\mu$ could be the first step toward a definition and analysis of
1045: the general case of separable $\mu$.
1046: 
1047: %------------------------------%
1048: \paragraph{Constants and Limits:}
1049: %------------------------------%
1050: We have in mind a universal system with complex
1051: interactions that is as least as intelligent and complex as a human
1052: being. One might think of a system whose input $y_k$ comes from a
1053: digital video camera, the output $x_k$ is some image to a
1054: monitor\footnote{Humans can only simulate a screen as
1055: output device by drawing pictures.}, only for the valuation we
1056: might restrict to the most primitive binary one, i.e. $c_k\!\in I\!\!B$. So we think of the
1057: following constant sizes:
1058: $$
1059: \begin{array}{ccccccccc}
1060:   1 & \ll & \langle l(y_kx_k)\rangle & \ll & k & \leq & T & \ll & |Y\times X| \\
1061:   1 & \ll & 2^{16} & \ll & 2^{24} & \le & 2^{32} & \ll & 2^{65536}
1062: \end{array}
1063: $$
1064: The first two limits say that the actual number $k$ of
1065: inputs/outputs should be reasonably large, compared to the typical
1066: size $\langle l\rangle$ of the input/output words, which itself
1067: should be rather sizeable. The last limit expresses the fact that
1068: the total lifetime $T$ (number of I/O cycles) of the system is far
1069: too small to allow every possible input to occur, or to try every
1070: possible output, or to make use of identically repeated
1071: inputs or outputs. We do not expect any useful outputs for
1072: $k\le\langle l\rangle$. More interesting than the lengths of the
1073: inputs is the complexity $K(x_1...x_k)$ of all inputs until now,
1074: to be defined later. The environment is usually not "perfect". The
1075: system could either interact with a non-perfect human or tackle a
1076: non-deterministic world (due to quantum mechanics or chaos)
1077: world\footnote{Whether there exist stochastic processes at all is
1078: a difficult question. At least the quantum indeterminacy comes
1079: very close to it.}. In either case, the sequence contains some
1080: noise, leading to $K\sim \langle l\rangle\!\cdot\!k$. The
1081: complexity of the probability distribution of the input sequence
1082: is something different. We assume that this noisy world operates
1083: according to some simple computable, though not finite rules.
1084: $K(\mu_k)\ll \langle l\rangle\!\cdot\!k$, i.e. the rules of the
1085: world can be highly compressed. On the other hand, there may
1086: appear new aspects of the environment for $k\!\to\!\infty$ causing
1087: a non-bounded $K(\mu_k)$.
1088: 
1089: In the following we never use these limits, except when explicitly
1090: stated. In some simpler models and examples the size of the
1091: constants will even violate these limits (e.g. $l(x_k)=l(y_k)=1$),
1092: but it is the limits above that the reader should bear in mind. We are
1093: only interested in theorems which do not degenerate under the
1094: above limits.
1095: 
1096: %------------------------------%
1097: \paragraph{Sequential decision theory:}
1098: %------------------------------%
1099: In the following we clarify the connection of (\ref{airec2}) and
1100: (\ref{pbestrec}) to sequential decision theory and discuss similarities and
1101: differences. With probability $M^a_{ij}$, the system under
1102: consideration should reach (environmental) state $i\!\in\!S$ when
1103: taking action $a\!\in\!A$ depending on the current state
1104: $j\!\in\!S$. If the system receives reward $R(i)$,
1105: the optimal policy $p^*$, maximizing expected utility (defined as
1106: sum of future rewards), and the utility $U(i)$ of policy
1107: $p^*$ are
1108: \beq\label{dt}
1109:   p^*(i)=\maxarg_a\sum_j M^a_{ij}U(j) \quad,\quad
1110:   U(i)=R(i)+\max_a\sum_j M^a_{ij}U(j)
1111: \eeq
1112: See \cite{Rus95} for details and further references. Let us identify
1113: \bqan
1114:   S=(Y\!\times\!X)^*,\quad A=Y,\quad
1115:   a=y_k, \quad M^a_{ij}=\mu(y\!x_{<k}y\!\pb x_k), \\[4pt]
1116:   i=y\!x_{<k}, \quad R(i)=c(x_{k-1}), \quad
1117:   U(i)=C^*_{k-1,m}(y\!x_{<k})=c(x_{k-1})+C^*_{km}(y\!x_{<k}), \\[4pt]
1118:   j=y\!x_{1:k}, \quad R(j)=c(x_k), \quad
1119:   U(j)=C^*_{km}(y\!x_{1:k})=c(x_k)+C^*_{k+1,m}(y\!x_{1:k}),
1120: \eqan
1121: where we further set $M^a_{ij}\!=\!0$ if $i$ is not a starting
1122: substring of $j$ or if $a\!\neq\!y_k$. This ensures the sum over
1123: $j$ in (\ref{dt}) to reduce to a sum over $x_k$. If we set
1124: $m_k\!=\!m$ and use
1125: $C^*_{km}(y\!x_{<k}y_k)\!=\!\sum_{x_k}C^*_{km}(y\!x_{1:k})$ in
1126: (\ref{pbestrec}), it is easy to see that (\ref{dt}) coincides with
1127: (\ref{airec2}) and (\ref{pbestrec}).
1128: 
1129: Note that despite of this formal equivalence, we were forced to use
1130: the complete history $y\!x_{<k}$ as environmental state $i$. The
1131: AI$\mu$ model neither assumes stationarity, nor Markov property,
1132: nor complete accessibility of the environment, as any assumption
1133: would restrict the applicability of AI$\mu$. The consequence is
1134: that every state occurs at most once in the lifetime of the
1135: system. Every moment in the universe is unique! Even if the state
1136: space could be identified with the input space $X$, inputs would
1137: usually not occur twice by assumption $k\!\ll\!|X|$, made in the
1138: last subsection. Further, there is no (obvious) universal
1139: similarity relation on $(X\!\times\!Y)^*$ allowing an effective
1140: reduction of the size of the state space. Although many algorithms
1141: (e.g. value and policy iteration) have problems in solving
1142: (\ref{dt}) for huge or infinite state spaces in practice,
1143: there is no principle problem in determining $p^*$ and $U$, as
1144: long as $\mu$ is known and $|X|$, $|Y|$ and $m$ are finite.
1145: 
1146: Things dramatically change if $\mu$ is unknown. Reinforcement
1147: learning algorithms \cite{Kae96} are commonly used in this case to
1148: learn the unknown $\mu$. They succeed if the state space is either
1149: small or has effectively been made small by so called generalization
1150: techniques. In any case, the solutions are either ad hoc, or work
1151: in restricted domains only, or have serious problems with state
1152: space exploration versus exploitation, or have non-optimal
1153: learning rate. There is no universal and optimal solution to this
1154: problem so far. In the next section we present a new model and
1155: argue that it formally solves all these problems in an optimal
1156: way. It will not concern with learning of $\mu$ directly. All we
1157: do is to replace the true prior probability $\mu$ by a universal
1158: probability $\xi$, which is shown to converge to $\mu$ in a sense.
1159: 
1160: \newpage
1161: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1162: \section{The Universal AI$\xi$ Model}\label{secAIxi}
1163: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1164: 
1165: %------------------------------%
1166: \paragraph{Induction and Algorithmic Information theory:}
1167: %------------------------------%
1168: One very important and highly non-trivial aspect of intelligence is
1169: inductive inference. Before formulating the AI$\xi$ model,
1170: a short introduction to the history of induction is given, culminating
1171: into the sequence prediction theory by Solomonoff. We emphasize
1172: only those aspects which will be of importance for the development
1173: of our universal AI$\xi$ model.
1174: 
1175: Simply speaking, induction is the process of
1176: predicting the future from the past or, more precisely, it is the
1177: process of finding rules in (past) data and using these rules to
1178: guess future data. On the one hand, induction seems to happen in
1179: every day life by finding regularities in past observations and
1180: using them to predict the future. On the other hand, this procedure
1181: seems to add knowledge about the future from past observations.
1182: But how can we know something about the future? This dilemma and
1183: the induction principle in general have a long philosophical
1184: history
1185: %
1186: \begin{itemize}\parskip=0ex\parsep=0ex\itemsep=0ex
1187:   \item Hume's negation of Induction (1711-1776) \cite{Hume},
1188:   \item Epicurus' principle of multiple explanations (342?-270?
1189:   BC),
1190:   \item Occams' razor (simplicity) princple (1290?-1349?),
1191:   \item Bayes' rule for conditional probabilites \cite{Bay63}
1192: \end{itemize}
1193: %
1194: and a short but important mathematical history: a clever
1195: unification of all these aspects into one formal theory of
1196: inductive inference has been done by Solomonoff \cite{Sol64} based
1197: on Kolmogorov's \cite{Kol65} definition of complexity. For an
1198: excellent introduction into Kolmogorov complexity and Solomonoff
1199: induction one should consult the book of Li and Vit\'anyi
1200: \cite{LiVi93}. In the rest of this subsection we state all results
1201: which are needed or generalized later.
1202: 
1203: Let us choose some universal prefix Turing machine $U$ with
1204: unidirectional binary input and output tapes and a bidirectional
1205: working tape. We can then define the (prefix) Kolomogorov complexity
1206: \cite{Cha75,Gac74,Kol65,Lev74} as the shortest prefix program $p$, for which $U$
1207: outputs $x\!=\!x_{1:n}$ with $x_i\in\!I\!\!B$:
1208: %
1209: $$
1210:   K(x) \;:=\; \min_p\{l(p): U(p)=x\}
1211: $$
1212: The universal semimeasure $\xi(\pb x)$ is defined as the probability
1213: that the output of the universal Turing machine $U$ starts with
1214: $x$ when provided with fair coin flips on the input tape \cite{Sol64,Sol78}. It is
1215: easy to see that this is equivalent to the formal definition
1216: \beq\label{xidef}
1217:   \xi(\pb x)\;:=\;\sum_{p\;:\;U(p)=x*}\nq 2^{-l(p)}
1218: \eeq
1219: where the sum is over minimal programs $p$ for which $U$
1220: outputs a string starting with $x$. $U$ might be non-terminating.
1221: As the shortest programs dominate the sum, $\xi$ is closely
1222: related to $K(x)$ ($\xi(\pb x)=2^{-K(x)+O(K(l(x))}$).
1223: $\xi$ has the important universality property \cite{Sol64}, that it
1224: majorizes every computable probability distribution $\rho$ up
1225: to a multiplicative factor
1226: depending only on $\rho$ but not on $x$:
1227: \beq\label{uni}
1228:   \xi(\pb x) \;\stackrel{\times}{\geq}\; 2^{-K(\rho)}\!\cdot\!\rho(\pb x).
1229: \eeq
1230: %
1231: A '$\times$' above an (in)equality denotes (in)equality within a
1232: universal multiplicative constant,
1233: a '$+$' above an (in)equality denotes (in)equality within a
1234: universal additive constant, both depending only on the choice of the
1235: universal reference machine $U$.
1236: $\xi$ itself is {\it not} a probability
1237: distribution\footnote{It is possible to normalize $\xi$ to a
1238: probability distribution as has been done in
1239: \cite{Wil70,Sol78,Hut99} by giving up the enumerability of $\xi$.
1240: Error bounds (\ref{eukdist}) and (\ref{spebound}) hold for both
1241: definitions.}.
1242: We have $\xi(\pb{x0})\!+\!\xi(\pb{x1})\!<\!\xi(\pb
1243: x)$ because there are programs $p$, which output just $x$, neither
1244: followed by $0$ nor $1$. They just stop after printing $x$ or
1245: continue forever without any further output. We will call a
1246: function $\rho\!\geq 0$ with the properties
1247: $\rho(\epsilon)\!\leq\!1$ and $\sum_{x_n}\rho(\pb
1248: x_{1:n})\!\leq\!\rho(\pb x_{<n})$ a {\it semimeasure}. $\xi$ is a
1249: semimeasure and (\ref{uni}) actually holds for all enumerable
1250: semimeasures $\rho$.
1251: 
1252: (Binary) sequence prediction algorithms try to predict the
1253: continuation $x_n$ of a given sequence $x_1...x_{n-1}$. In the
1254: following we will assume that the sequences are drawn according to
1255: a probability distribution and that the true prior probability of
1256: $x_{1:n}$ is $\mu(\pb{x_1...x_n})$. The probability of $x_n$ given
1257: $x_{<n}$ hence is $\mu(x_{<n}\pb x_n)$. The best possible system
1258: predicts the $x_n$ with higher probability. Usually $\mu$ is
1259: unknown and the system can only have some belief $\rho$ about the
1260: true prior probability $\mu$. Let SP$\rho$ be a probabilistic
1261: sequence predictor, predicting $x_n$ with probability
1262: $\rho(x_{<n}\pb x_n)$. Further we define a deterministic sequence
1263: predictor SP$\Theta_\rho$ predicting the $x_n$ with higher $\rho$
1264: probability. $\Theta_\rho(x_{<n}\pb x_n)\!:=\!1$ if
1265: $\rho(x_{<n}\pb x_n)\!>\!{1\over 2}$ and $\Theta_\rho(x_{<n}\pb x_n)\!:=\!0$
1266: otherwise.  If $\rho$ is only a semimeasure the SP$\rho$ and
1267: SP$\Theta_\rho$ systems might refuse any output in some cycles
1268: $n$. The SP$\Theta_\mu$ is the best prediction scheme when $\mu$
1269: is known.
1270: 
1271: If $\rho(x_{<n}\pb x_n)$ converges quickly to $\mu(x_{<n}\pb x_n)$ the
1272: number of additional prediction errors introduced by using
1273: $\Theta_\rho$ instead of $\Theta_\mu$ for prediction should be
1274: small in some sense. Now the universal probability $\xi$
1275: comes into play as it has been proved
1276: by Solomonoff \cite{Sol78} that the $\mu$ expected Euclidean
1277: distance betweewn $\xi$ and $\mu$ is finite
1278: \beq\label{eukdist}
1279:   \sum_{k=1}^\infty\sum_{x_{1:k}}\mu(\pb x_{1:k})
1280:   (\xi(x_{<k}\pb x_k)-\mu(x_{<k}\pb x_k))^2 \;\stackrel{+}{<}\;
1281:   {\1d2}\ln 2\!\cdot\!K(\mu)
1282: \eeq
1283: The '$+$' atop '$<$' means up to additive terms of order 1.
1284: So indeed the difference does tend to zero, i.e.
1285: $\xi(x_{<n}\pb x_n)\toinfty{n}\mu(x_{<n}\pb x_n)$ with $\mu$ probability
1286: $1$ for {\it any} computable probability distribution $\mu$. The reason for the
1287: astonishing property of a single (universal) function to
1288: converge to {\it any} computable probability distribution lies in the fact that the
1289: set of $\mu$ random sequences differ for different $\mu$.
1290: The universality property (\ref{uni}) is the central ingredient for
1291: proving (\ref{eukdist}).
1292: 
1293: Let us define the total number of expected erroneous predictions
1294: the SP$\rho$ system makes for the first $n$ bits
1295: \beq\label{esp}
1296:   E_{n\rho} \;:=\; \sum_{k=1}^n\sum_{x_{1:k}}\mu(\pb x_{1:k})
1297:   (1\!-\!\rho(x_{<k}\pb x_k))
1298: \eeq
1299: The SP$\Theta_\mu$ system is best in the sense that
1300: $E_{n\Theta_\mu}\!\leq\!E_{n\rho}$
1301: for any $\rho$. In \cite{Hut99} it has been shown that
1302: SP$\Theta_\xi$ is not much worse
1303: \beq\label{spebound}
1304:   E_{n\Theta_\xi}\!-\!E_{n\rho} \;\leq\;
1305:   H+\sqrt{4E_{n\rho}H+H^2} \;=\;
1306:   O(\sqrt{E_{n\rho}})\quad,\quad
1307:   H\;\stackrel{+}{<}\;\ln 2\!\cdot\!K(\mu)
1308: \eeq
1309: with the tightest bound for $\rho\!=\!\Theta_\mu$. For finite
1310: $E_{\infty\Theta_\mu}$, $E_{\infty\Theta_\xi}$ is finite too. For
1311: infinite $E_{\infty\Theta_\mu}$,
1312: $E_{n\Theta_\xi}/E_{n\Theta_\mu}\toinfty{n}1$ with rapid
1313: convergence. One can hardly imagine any better prediction
1314: algorithm without extra knowledge about the environment. In
1315: \cite{Hut00e}, (\ref{eukdist}) and (\ref{spebound}) have been
1316: generalized from binary to arbitrary alphabet. Apart from
1317: computational aspects, which are of course very important, the
1318: problem of sequence prediction could be viewed as essentially
1319: solved.
1320: 
1321: %------------------------------%
1322: \paragraph{Definition of the AI$\xi$ Model:}
1323: %------------------------------%
1324: We have developed enough formalism to suggest our universal
1325: AI$\xi$ model\footnote{Speak 'aixi' and write AIXI without Greek letters.}.
1326: All we have to do is to suitably generalize the universal
1327: semimeasure $\xi$ from the last subsection and replace the true
1328: but unknown prior probability $\mu^{AI}$ in the AI$\mu$ model by this
1329: generalized $\xi^{AI}$. In what sense this AI$\xi$ model is universal
1330: will be discussed later.
1331: 
1332: In the functional formulation we define the universal probability
1333: $\xi^{AI}$ of an environment $q$ just as $2^{-l(q)}$
1334: \beqn
1335:   \xi(q) \;:=\; 2^{-l(q)}
1336: \eeqn
1337: The definition could not be easier\footnote{It is not necessary
1338: to use $2^{-K(q)}$ or something similar as some reader may expect
1339: at this point. The reason is that for every program $q$ there
1340: exists a functionally equivalent program $q'$ with
1341: $K(q')=l(q')$.}!\footnote{Here and later we identify objects with
1342: their coding relative to some fixed Turing machine $U$. For example, if $q$ is
1343: a function $K(q):=K(\lceil q\rceil)$ with $\lceil q\rceil$ being a
1344: binary coding of $q$ such that $U(\lceil q\rceil,y):=q(y)$. On the
1345: other hand, if $q$ already is a binary string we define $q(y)\!:=U(q,y)$.}
1346: Collecting the formulas of section \ref{secAIfunc}
1347: and replacing $\mu(q)$ by $\xi(q)$
1348: we get the definition of the AI$\xi$ system in
1349: functional form. Given the history $\hh y\!\hh x_{<k}$ the
1350: functional AI$\xi$ system outputs
1351: \beq\label{eefuncxi}
1352:   \hh y_k \;:=\;
1353:   \maxarg_{y_k}\max_{p:p(\hh x_{<k})=\hh y_{<k}y_k}
1354:   \sum_{q:q(\hh y_{<k})=\hh x_{<k}}
1355:   \nq 2^{-l(q)}\cdot C_{km_k}(p,q)
1356: \eeq
1357: in cycle $k$, where $C_{km_k}(p,q)$ is the total credit of cycles $k$ to $m_k$ when
1358: system $p$ interacts with environment $q$. We have dropped the
1359: denominator $\sum_q\mu(q)$ from (\ref{eefunc}) as it is
1360: independent of the $p\!\in\!\hh P_k$ and a constant multiplicative
1361: factor does not change $\maxarg$.
1362: 
1363: For the iterative formulation the universal probability
1364: $\xi$ can be obtained by inserting the functional $\xi(q)$ into
1365: (\ref{mufr})
1366: \beq\label{uniMAI}
1367:   \xi(y\!\pb x_{1:k}) \;=\;
1368:   \nq\sum_{q:q(y_{1:k})=x_{1:k}}\nq 2^{-l(q)}
1369: \eeq
1370: Replacing $\mu$ by $\xi$ in (\ref{ydotrec}) the
1371: iterative AI$\xi$ system outputs
1372: \beq\label{ydotxi}
1373:   \hh y_k \;=\;
1374:   \maxarg_{y_k}\sum_{x_k}\max_{y_{k+1}}\sum_{x_{k+1}}\;...\;
1375:   \max_{y_{m_k}}\sum_{x_{m_k}}
1376:   (c(x_k)\!+...+\!c(x_{m_k})) \!\cdot\!
1377:   \xi(\hh y\!\hh x_{<k}y\!\pb x_{k:m_k})
1378: \eeq
1379: in cycle $k$ given the history $\hh y\!\hh x_{<k}$.
1380: 
1381: One subtlety has been passed over. Like in the
1382: SP case, $\xi$ is not a probability distribution but satisfies only the weaker inequalities
1383: \beq\label{chrf}
1384:   \sum_{x_n}\xi(y\!\pb x_{1:n}) \;\leq\; \xi(y\!\pb x_{<n})
1385:   \quad,\quad
1386:   \xi(\epsilon) \;\leq\; 1
1387: \eeq
1388: Note, that the sum on the l.h.s.\ is {\it not}
1389: independent of $y_n$ unlike for chronological probability
1390: distributions. Nevertheless, it is bounded by something (the r.h.s)
1391: which is independent of $y_n$. The reason is that the sum in
1392: (\ref{uniMAI}) runs over (partial recursive) chronological
1393: functions only and the functions $q$ which satisfy
1394: $q(y_{1:n})=x_{<n}*$ are a subset of the functions satisfying
1395: $q(y_{<n})=x_{<n}$. Therefore we will in general call functions satisfying
1396: (\ref{chrf}) {\it chronological semimeasures}. The important point
1397: is that the conditional probabilities (\ref{bayes2}) are $\leq\!1$
1398: like for true probability distributions.
1399: 
1400: The equivalence of the functional and iterative AI model proven in
1401: section \ref{secAImurec} is true for every chronological
1402: semimeasure $\rho$, esp.\ for $\xi$, hence we can talk about {\it
1403: the} AI$\xi$ model in this respect. It (slightly) depends on the
1404: choice of universal Turing machine. $l(q)$ is defined only up to
1405: an additive constant. It also depends on the choice of
1406: $X\!=\!C\!\times\!X'$ and $Y$, but we do not expect any bias when
1407: the spaces are chosen sufficiently simple, e.g. all strings of
1408: length $2^{16}$. Choosing $I\!\!N$ as word space would be optimal,
1409: but whether the maxima (suprema) exist in this case, has to be
1410: shown beforehand. The only non-trivial dependence is on the
1411: horizon function $m_k$ which will be discussed later. So apart
1412: from $m_k$ and unimportant details the AI$\xi$ system is uniquely
1413: defined by (\ref{eefuncxi}) or (\ref{ydotxi}).
1414: It doesn't depend
1415: on assumptions about the environment apart from being generated
1416: from some computable (but unknown!) probability distribution.
1417: 
1418: %------------------------------%
1419: \paragraph{Universality of $\xi^{AI}$:}
1420: %------------------------------%
1421: In which sense the AI$\xi$ model is optimal will be clarified
1422: later. In this and the next two subsections we show that $\xi^{AI}$
1423: defined in (\ref{uniMAI}) is universal and converges to $\mu^{AI}$ analog to the
1424: SP case (\ref{uni}) and (\ref{eukdist}). The proofs are
1425: generalizations from the SP case. The $y$ are pure spectators and
1426: cause no difficulties in the generalization. The replacement of
1427: the binary alphabet $I\!\!B$ used in SP by the (possibly infinite)
1428: alphabet $X$ is possible, but needs to be done with care. In
1429: (\ref{uni}) $U(p)=x*$ produces strings starting with $x$, whereas
1430: in (\ref{uniMAI}) we can demand $q$ to output exactly $n$ words $x_{1:n}$ as
1431: $q$ knows $n$ from the number of input words $y_1...y_n$.
1432: For proofs of (\ref{uni}) and (\ref{eukdist}) see \cite{Sol78} and
1433: \cite{LiVi92}.
1434: 
1435: There is an alternative
1436: definition of $\xi$ which coincides with (\ref{uniMAI}) within a
1437: multiplicative constant of $O(1)$,
1438: \beq\label{xirhodef}
1439:   \xi(y\!\pb x_{1:n}) \;\stackrel{\times}{=}\; \sum_\rho 2^{-K(\rho)}\rho(y\!\pb
1440:   x_{1:n})
1441: \eeq
1442: where the sum runs over all enumerable chronological semimeasures.
1443: The $2^{-K(\rho)}$ weighted sum over probabilistic environments
1444: $\rho$, coincides with the sum over $2^{-l(q)}$ weighted
1445: deterministic environments $q$, as will be proved below.
1446: In the next subsection we show that an enumeration of all
1447: enumerable functions can be converted into an enumeration of
1448: enumerable chronological semimeasures $\rho$. $K(\rho)$ is co-enumerable,
1449: therefore $\xi$ defined in (\ref{xirhodef}) is itself enumerable.
1450: The representation (\ref{uniMAI}) is also enumerable. As
1451: $\sum_\rho2^{-K(\rho)}\!\leq\!1$ and the $\rho's$ satisfy (\ref{chrf}), $\xi$
1452: is a chronological semimeasure as well. If we pick one $\rho$ in
1453: (\ref{xirhodef}) we get the universality property ''for free''
1454: \beq\label{uniaixi}
1455:   \xi(y\!\pb x_{1:n}) \;\stackrel{\times}{\geq}\; 2^{-K(\rho)}\rho(y\!\pb x_{1:n})
1456: \eeq
1457: $\xi$ is a universal element in the sense of (\ref{uniaixi}) in
1458: the set of all enumerable chronological semimeasures.
1459: 
1460: To prove universality of $\xi$ in the form (\ref{uniMAI}) we have
1461: to show that for every  enumerable chronological semimeasure
1462: $\rho$ there exists a Turing machine $T$ with
1463: \beq\label{reprho}
1464:   \rho(y\!\pb x_{1:n}) \;=\; \sum_{q:T(qy_{1:n})=x_{1:n}}\nq 2^{-l(q)}
1465:   \quad\mbox{and}\quad l(T)\stackrel{+}{=}K(\rho).
1466: \eeq
1467: 
1468: This will not be done here. Given $T$ the universality of
1469: $\xi$
1470: follows from
1471: \beqn
1472:   \xi(y\!\pb x_{1:n}) \;=\;
1473:   \nq\nq\sum_{\quad\quad q:U(qy_{1:n})=x_{1:n}}\nq\nq 2^{-l(q)}
1474:   \;\geq\;
1475:   \nq\nq\sum_{\quad\quad q:U(Tq'y_{1:n})=x_{1:n}}\nq\;\nq\nq 2^{-l(Tq')}
1476:   \;=\;
1477:   2^{-l(T)}\nq\nq\sum_{q:T(q'y_{1:n})=x_{1:n}}\nq\nq 2^{-l(q')}
1478:   \stackrel{\times}{\;=\;}
1479:   2^{-K(\rho)}\rho(y\!\pb x_{1:n})
1480: \eeqn
1481: The first equality and (\ref{uniMAI}) are identical by definition.
1482: In the inequality we have restricted the sum over all $q$ to $q$
1483: of the form $q\!=\!Tq'$. The third relation is true as running $U$
1484: on $Tz$ is a simulation of $T$ on $z$. The last equality follows
1485: from (\ref{reprho}). All enumerable, universal, chronological
1486: semimeasures coincide up to a multiplicative constant, as they
1487: mutually dominate each other. Hence, definitions (\ref{uniMAI}) and
1488: (\ref{xirhodef}) are, indeed, equivalent.
1489: 
1490: %------------------------------%
1491: \paragraph{Converting general functions into chronological semi-measures:}
1492: %------------------------------%
1493: To complete the proof of the universality (\ref{uniaixi}) of $\xi$
1494: we need to convert enumerable functions
1495: $\psi:I\!\!B^*\!\to\!I\!\!R^+$ into enumerable chronological
1496: semi-measures $\rho:(Y\!\times\!X)^*\!\to\!I\!\!R^+$ with certain
1497: additional properties. Every enumerable function like $\psi$ and
1498: $\rho$ can be approximated from below by definition\footnote{Defining
1499: enumerability as the supremum of total primitive recursive
1500: functions is more suitable for our purpose than the equivalent
1501: definition as a limit of monotone increasing partial
1502: recursive functions. In terms of Turing machines, the recursion
1503: parameter is the time after which a computation is terminated.} by
1504: primitive recursive functions
1505: $\varphi:I\!\!B^*\!\times\!I\!\!N\!\to\!I\!\!\!Q^+$ and
1506: $\phi:(Y\!\times\!X)^*\!\times\!I\!\!N\!\to\!I\!\!\!Q^+$ with
1507: $\psi(s)=\sup_t\varphi(s,t)$ and $\rho(s)=\sup_t\phi(s,t)$ and
1508: recursion parameter $t$. For arguments of the form
1509: $s\!=\!y\!x_{1:n}$ we recursively (in $n$) construct $\phi$ from
1510: $\varphi$ as follows:
1511: \begin{eqnarray}\label{ccsm1}
1512:   \varphi'(y\!x_{1:n},t) &\!:=\!&
1513:   \left\{
1514:   \begin{array}{c@{\quad\mbox{for}\quad}l}
1515:     \varphi(y\!x_{1:n},t) & x_n<t     \\
1516:     0                   & x_n\geq t
1517:   \end{array} \right.
1518:   \quad,\quad \varphi'(\epsilon,t) \;:=\; \varphi(\epsilon,t)
1519: \\ \label{ccsm2}
1520:   \phi(\epsilon,t) &\!:=\!& \max_{0\leq i\leq t}
1521:   \Big\{\varphi'(\epsilon,i):\varphi'(\epsilon,i)\leq 1 \Big\}
1522: \\ \label{ccsm3}
1523:   \phi(y\!\pb x_{1:n},t) &\!:=\!& \max_{0\leq i\leq t}
1524:   \Big\{ \varphi'(y\!x_{1:n},i):{\textstyle\sum_{x_n}}\varphi'(y\!x_{1:n},i)\leq
1525:      \phi(y\!\pb x_{<n},t) \Big\}
1526: \end{eqnarray}
1527: With $x_n\!<\!t$ we mean that the natural number associated with
1528: string $x_n$ is smaller than $t$.
1529: According to (\ref{ccsm1}) with $\varphi$ also $\varphi'$ as well as
1530: $\sum_{x_n}\varphi'$ are primitive recursive functions. Further, if we
1531: allow $t\!=\!0$ we have $\varphi'(s,0)=0$. This ensures that
1532: $\phi$ is a total function.
1533: 
1534: In the following we prove by induction over $n$ that $\phi$ is a
1535: primitive recursive chronological semimeasure
1536: monotone increasing in $t$. All necessary properties hold for
1537: $n\!=\!0$ ($y\!x_{1:0}\!=\!\epsilon$) according to (\ref{ccsm2}).
1538: For general $n$ assume that the induction hypothesis is true for
1539: $\phi(y\!\pb x_{<n},t)$. We can see from (\ref{ccsm3}) that
1540: $\phi(y\!\pb x_{1:n},t)$ is monotone  increasing in $t$. $\phi$ is
1541: total as $\varphi'(y\!x_{1:n},i\!=\!0)\!=\!0$ satisfies the
1542: inequality. By assumption $\phi(y\!x_{<n},t)$ is
1543: primitive recursive, hence with $\sum_{x_n}\varphi'$ also the order relation
1544: $\sum\varphi'\!\leq\!\phi$ is primitive recursive. This ensures
1545: that the non-empty finite set
1546: $\{\varphi'\!:\!\sum\varphi'\!\leq\!\phi\}_i$ and its maximum
1547: $\phi(y\!\pb x_{1:n},t)$ are primitive recursive. Further,
1548: $\phi(y\!\pb x_{1:n},t)\!=\!\varphi'(y\!x_{1:n},i)$ for some $i$ with
1549: $i\!\leq\!t$ independent of $x_n$. Thus,
1550: $\sum_{x_n}\phi(y\!\pb x_{1:n},t)$ $=$ $\sum_{x_n}\varphi'(y\!x_{1:n},i)$
1551: $\leq$ $\phi(y\!\pb x_{<n},t)$ which is the condition for $\phi$ being a
1552: chronological semimeasure. Inductively we have proved that $\phi$ is
1553: indeed a primitive recursive chronological semimeasure
1554: monotone increasing in $t$.
1555: 
1556: In the following we show that every (total)\footnote{Semimeasures
1557: are, by definition, total functions.} enumerable chronological
1558: semimeasure $\rho$ can be enumerated by some $\phi$. By definition
1559: of enumerability there exist primitive recursive functions
1560: $\tilde\varphi$ with $\rho(s)\!=\!\sup_t\tilde\varphi(s,t)$. The
1561: function $\varphi(s,t)\!:=\!(1\!-\! ^1\!/_t)\!\cdot\!
1562: \max_{i<t}\tilde\varphi(s,i)$ also enumerates $\rho$ but has
1563: the additional advantage of being strictly monotone increasing in $t$.
1564: 
1565: $\varphi'(y\!x_{1:n},\infty)\!=
1566: \!\varphi(y\!x_{1:n},\infty)\!=\!\rho(y\!x_{1:n})$ by definition
1567: (\ref{ccsm1}). $\phi(\epsilon,t)\!=\!\varphi'(\epsilon,t)$ by
1568: (\ref{ccsm2}) and the fact that
1569: $\varphi'(\epsilon,i\!-\!1)<\varphi'(\epsilon,i)\!\leq\!
1570: \varphi(\epsilon,i)\!\leq\!\rho(\epsilon)\!\leq\!1$, hence
1571: $\phi(\epsilon,\infty)\!=\!\rho(\epsilon)$. $\phi(y\!\pb
1572: x_{1:n},t)\!\leq\!\varphi'(y\!x_{1:n},t)$ by (\ref{ccsm3}), hence
1573: $\phi(y\!\pb x_{1:n},\infty)\!\leq\!\rho(y\!\pb x_{1:n})$. We prove
1574: the opposite direction $\phi(y\!\pb
1575: x_{1:n},\infty)\!\geq\!\rho(y\!x_{1:n})$ by induction over $n$. We
1576: have
1577: \beq\label{upineq}
1578:   \sum_{x_n}\varphi'(y\!x_{1:n},i) \;\leq\;
1579:   \sum_{x_n}\varphi(y\!x_{1:n},i)  \;<\;
1580:   \sum_{x_n}\varphi(y\!x_{1:n},\infty) \;=\;
1581:   \sum_{x_n}\rho(y\!x_{1:n}) \;\leq\; \rho(y\!\pb x_{<n})
1582: \eeq
1583: The strict monotony of $\varphi$ and the semimeasure
1584: property of $\rho$ have been used. By induction hypothesis
1585: $\lim_{t\to\infty}\phi(y\!\pb x_{<n},t)\!\geq\!\rho(y\!\pb x_{<n})$ and
1586: (\ref{upineq}) for sufficiently large $t$ we have
1587: $\phi(y\!\pb x_{<n},t)\!>\!\sum_{x_n}\varphi'(y\!x_{1:n},i)$. The
1588: condition in (\ref{ccsm3}) is, hence, satisfied and therefore
1589: $\phi(y\!\pb x_{1:n},t)\!\geq\!\varphi'(y\!x_{1:n},i)$ for sufficiently
1590: large $t$, especially
1591: $\phi(y\!\pb x_{1:n},\infty)\!\geq\!\varphi'(y\!x_{1:n},i)$ for all $i$.
1592: Taking the limit $i\!\to\!\infty$ we get
1593: $\phi(y\!\pb x_{1:n},\infty)\!\geq\!\varphi'(y\!x_{1:n},\infty)\!=\!\rho(y\!\pb x_{1:n})$.
1594: 
1595: Combining all results, we have shown that the constructed
1596: $\phi(\cdot,t)$ are primitive recursive chronological semimeasures
1597: monotone increasing in $t$, which converge to the enumerable
1598: chronological semimeasure $\rho$. This finally proves the
1599: enumerability of the set of enumerable chronological
1600: semimeasures.
1601: 
1602: %------------------------------%
1603: \paragraph{Convergence of $\xi^{AI}$ to $\mu^{AI}$:}
1604: %------------------------------%
1605: In \cite{Hut00e} the following inequality is proved
1606: \beq\label{entro2}
1607:   2\sum_{i=1}^{|X|} y_i(y_i\!-\!z_i)^2 \;\leq\!
1608:   \sum_{i=1}^{|X|} y_i\ln{y_i\over z_i} \quad\mbox{with}\quad
1609:   \sum_{i=1}^{|X|} y_i=1, \quad \sum_{i=1}^{|X|} z_i\leq 1
1610: \eeq
1611: If we identify $i\!=\!x_k$ and $y_i\!=\!\mu(y\!x_{<k}y\!\pb x_k)$ and
1612: $z_i\!=\!\xi(y\!x_{<k}y\!\pb x_k)$, multiply both sides with
1613: $\mu(y\!\pb x_{<k})$, take the sum over $x_{<k}$, then the sum
1614: over $k$ and use Bayes' rule $\mu(y\!\pb x_{<k})\!\cdot\!\mu(y\!x_{<k}y\!\pb
1615: x_k)=\mu(y\!\pb x_{1:k})$ we get
1616: \beq\label{eukdistxi}
1617:   2\sum_{k=1}^n\sum_{x_{1:k}}\mu(y\!\pb x_{1:k})
1618:   \Big(\mu(y\!x_{<k}\pb x_k)-\xi(y\!x_{<k}\pb x_k)\Big)^2 \;\leq\;
1619:   \sum_{k=1}^n\sum_{x_{1:k}}\mu(y\!\pb x_{1:k})
1620:   \ln{\mu(y\!x_{<k}\pb x_k)\over\xi(y\!x_{<k}\pb x_k)}
1621:   =\; ...
1622: \eeq
1623: In the r.h.s.\ we can replace $\sum_{x_{1:k}}\mu(y\!\pb
1624: x_{1:k})$ by $\sum_{x_{1:n}}\mu(y\!\pb x_{1:n})$ as the argument
1625: of the logarithm is independent of $x_{k+1:n}$. The $k$ sum can now be
1626: brought into the logarithm and converts to a product. Using Bayes'
1627: rule (\ref{bayes2}) for $\mu$ and $\xi$ we get
1628: \beq\label{eukdistxi2}
1629:   ...\;=\;
1630:   \sum_{x_{1:n}}\mu(y\!\pb x_{1:n})
1631:   \ln\prod_{k=1}^n{\mu(y\!x_{<k}\pb x_k)\over\xi(y\!x_{<k}\pb x_k)}
1632:   \;=\;
1633:   \sum_{x_{1:n}}\mu(y\!\pb x_{1:n})
1634:   \ln{\mu(y\!\pb x_{1:n})\over\xi(y\!\pb x_{1:n})}
1635:   \;\stackrel{+}{<}\; \ln 2\!\cdot\!K(\mu)
1636: \eeq
1637: where we have used the universality property (\ref{uniaixi})
1638: of $\xi$ in the last step. The main complication for generalizing
1639: (\ref{eukdist}) to (\ref{eukdistxi},\ref{eukdistxi2}) was the
1640: generalization of (\ref{entro2}) from $|X|\!=\!2$ to a general
1641: alphabet, the $y$ are, again, pure spectators. This will change when
1642: we analyze error/credit bounds analog to (\ref{spebound}).
1643: 
1644: (\ref{eukdistxi},\ref{eukdistxi2}) shows that the $\mu$ expected
1645: squared difference of $\mu$ and $\xi$ is finite for computable
1646: $\mu$. This, in turn, shows that $\xi(y\!x_{<k}y\!\pb x_k)$
1647: converges to $\mu(y\!x_{<k}y\!\pb x_k)$ for $k\!\to\!\infty$ with $\mu$
1648: probability 1. If we take a finite product of $\xi's$ and use
1649: Bayes' rule, we see that also $\xi(y\!x_{<k}y\!\pb x_{k:k+r})$
1650: converges to $\mu(y\!x_{<k}y\!\pb x_{k:k+r})$. More generally, in case of
1651: a bounded horizon $h_k$, it follows that
1652: \beq\label{aixitomu}
1653:   \xi(y\!x_{<k}y\!\pb x_{k:m_k}) \toinfty{k} \mu(y\!x_{<k}y\!\pb x_{k:m_k})
1654:   \quad\mbox{if}\quad h_k\equiv m_k\!-\!k\!+\!1 \leq h_{max} < \infty
1655: \eeq
1656: This gives makes us confident that the outputs $\hh y_k$
1657: of the AI$\xi$ model (\ref{ydotxi}) could converge to the outputs $\hh
1658: y_k$ from the AI$\mu$ model (\ref{ydotrec}), at least for bounded
1659: horizon.
1660: 
1661: We want to call an AI model {\it universal}, if it is $\mu$
1662: independent (unbiased, model-free) and is able
1663: to solve any solvable problem and learn any learnable task.
1664: Further, we call a universal model, {\it universally optimal}, if
1665: there is no program, which can solve or learn significantly faster
1666: (in terms of interaction cycles). As the AI$\xi$ model is
1667: parameterless, $\xi$ converges to $\mu$ (\ref{aixitomu}), the
1668: AI$\mu$ model is itself optimal, and we expect no other model to
1669: converge faster to AI$\mu$ by analogy to SP (\ref{spebound}),
1670: \beqn
1671:   \mbox{\it we expect AI$\xi$ to be universally optimal.}
1672: \eeqn
1673: This is our main claim. In a sense, the intention of the remaining
1674: (sub)sections is to define this statement more rigorously and
1675: to give further support.
1676: 
1677: %------------------------------%
1678: \paragraph{Intelligence order relation:}
1679: %------------------------------%
1680: We define the $\xi$ expected credit in cycles $k$ to $m$ of a
1681: policy $p$ similar to (\ref{eefunc}) and (\ref{eefuncxi}).
1682: We extend the definition to programs $p\!\not\in\!\hh P_k$ which
1683: are not consistent with the current history.
1684: \beq\label{cxi}
1685:   C^\xi_{km}(p|\hh y\!\hh x_{<k}) \;:=\;
1686:   {1\over\cal N}
1687:   \sum_{q:q(\hh y_{<k})=\hh x_{<k}}
1688:   \nq 2^{-l(q)}\cdot C_{km}(\tilde p,q)
1689: \eeq
1690: The normalization $\cal N$ is again only necessary for
1691: interpreting $C_{km}$ as the expected credit but otherwise
1692: unneeded. For consistent policies $p\!\in\!\hh P_k$ we define
1693: $\tilde p\!:=\!p$. For $p\!\not\in\!\hh P_k$, $\tilde p$ is a
1694: modification of $p$ in such a way that its output is consistent
1695: with the current history $\hh y\!\hh x_{<k}$, hence $\tilde
1696: p\!\in\!\hh P_k$, but unaltered for the current and future cycles
1697: $\geq\!k$. Using this definition of $C_{km}$ we could take the
1698: maximium over all systems $p$ in (\ref{eefuncxi}), rather than only the
1699: consistent ones.
1700: 
1701: We call $p$ {\it more or equally intelligent} than $p'$ if
1702: \beq\label{aiorder}
1703:   p\succeq p' \;:\Leftrightarrow
1704:   \forall k\forall\hh y\!\hh x_{<k}:
1705:   C^\xi_{km_k}(p|\hh y\!\hh x_{<k}) \geq
1706:   C^\xi_{km_k}(p'|\hh y\!\hh x_{<k})
1707: \eeq
1708: i.e.\ if $p$ yields in any circumstance higher $\xi$ expected
1709: credit than $p'$. As the algorithm $p^\best$ behind the AI$\xi$
1710: system maximizes $C^\xi_{km_k}$ we have $p^\best\!\succeq\!p$ for all
1711: $p$. The AI$\xi$ model is hence the most intelligent system
1712: w.r.t.\ $\succeq$. $\succeq$ is a universal order relation in the
1713: sense that it is free of any parameters (except $m_k$) or specific
1714: assumptions about the environment. A proof, that $\succeq$ is a
1715: reliable intelligence order (what we believe to be true), would
1716: prove that AI$\xi$ is universally optimal. We could further ask:
1717: how useful is $\succeq$ for ordering policies of practical
1718: interest with intermediate intelligence, or how can $\succeq$ help
1719: to guide toward constructing more intelligent systems with
1720: reasonable computation time. An effective intelligence order
1721: relation $\succeq^c$ will be defined in section \ref{secTime},
1722: which is more useful from a practical point of view.
1723: 
1724: %------------------------------%
1725: \paragraph{Credit bounds and separability concepts:}
1726: %------------------------------%
1727: The credits $C_{km}$ associated with the AI systems correspond
1728: roughly to the negative error measure $-E_{n\rho}$ of the SP
1729: systems. In SP, we were interested in small bounds for the error
1730: excess $E_{n\Theta_\xi}\!-\!E_{n\rho}$. Unfortunately, simple
1731: credit bounds for AI$\xi$ in terms of $C_{km}$ analog to the error
1732: bound (\ref{spebound}) do not hold. We even have difficulties in
1733: specifying what we can expect to hold for AI$\xi$ or any AI system
1734: which claims to be universally optimal. Consequently, we cannot
1735: have a proof if we don't know what to prove. In SP, the only
1736: important property of $\mu$ for proving error bounds was its
1737: complexity $K(\mu)$. We will see that in the AI case, there are no
1738: useful bounds in terms of $K(\mu)$ only. We either have to study
1739: restricted problem classes or consider bounds depending on other
1740: properties of $\mu$, rather than on its complexity only. In the
1741: following, we will exhibit the difficulties by two examples and
1742: introduce concepts which may be useful for proving credit bounds.
1743: Despite the difficulties in even claiming useful credit bounds, we
1744: nevertheless, firmly believe that the order relation
1745: (\ref{aiorder}) correctly formalizes the intuitive meaning of
1746: intelligence and, hence, that the AI$\xi$ system is universally optimal.
1747: 
1748: %------------------------------%
1749: %\paragraph{(Pseudo) passive $\mu$ and the heaven/hell example:}
1750: %------------------------------%
1751: In the following, we choose $m_k\!=\!T$. We want to compare the
1752: true, i.e. $\mu$ expected credit $C^\mu_{1T}$ of a $\mu$
1753: independent universal policy $p^{best}$ with any other policy $p$.
1754: Naively, we might expect the existence of a policy $p^{best}$ which
1755: maximizes $C^\mu_{1T}$, apart from additive
1756: corrections of lower order for $T\!\to\!\infty$
1757: \beq\label{cximu}
1758:   C^\mu_{1T}(p^{best}) \;\geq\; C^\mu_{1T}(p) - o(...)
1759:   \quad \forall\mu,p
1760: \eeq
1761: Note, that the policy $p^{*\xi}$ of the AI$\xi$ system
1762: maximizes $C^\xi_{1T}$ by definition ($p^{*\xi}\succeq p$). As
1763: $C^\xi_{1T}$ is thought to be a guess of $C^\mu_{1T}$, we might
1764: expect $p^{best}\!=\!p^{*\xi}$ to approximately maximize
1765: $C^\mu_{1T}$, i.e. (\ref{cximu}) to hold. Let us consider the
1766: problem class (set of environments) $\{\mu_0,\mu_1\}$ with
1767: $Y\!=\!C\!=\{0,1\}$ and $c_k\!=\delta_{iy_1}$ in environment
1768: $\mu_i$. The first output $y_1$ decides whether you go to heaven
1769: with all future credits $c_k$ being $1$ (good) or to hell with all
1770: future credits being $0$ (bad). It is clear, that if
1771: $\mu_i$, i.e. $i$ is known, the optimal policy $p^{*\mu_i}$
1772: is to output $y_1\!=\!i$ in the first cycle with
1773: $C^\mu_{1T}(p^{*\mu_i})\!=\!T$. On the other hand, any unbiased
1774: policy $p^{best}$ independent of the actual $\mu$ either outputs
1775: $y_1\!=\!1$ or $y_1\!=\!0$. Independent of the actual choice
1776: $y_1$, there is always an environment ($\mu\!=\!\mu_{1-y_1}$)
1777: for which this choice is catastrophic
1778: ($C^\mu_{1T}(p^{best})\!=\!0$). No single system can perform well in both
1779: environments $\mu_0$ {\it and} $\mu_1$. The r.h.s.\ of
1780: (\ref{cximu}) equals $T\!-\!o(T)$ for $p\!=\!p^{*\mu}$. For all
1781: $p^{best}$ there is a $\mu$ for which the l.h.s.\ is zero. We have
1782: shown that no $p^{best}$ can satisfy (\ref{cximu}) for all $\mu$
1783: and $p$, so we cannot expect $p^{*\xi}$ to do so. Nevertheless,
1784: there are problem classes for which (\ref{cximu}) holds, for
1785: instance SP and CF. For SP, (\ref{cximu}) is just a reformulation
1786: of (\ref{spebound}) with an appropriate choice for $p^{best}$
1787: (which differs from $p^{*\xi}$, see next section). We expect
1788: (\ref{cximu}) to hold for all inductive problems in which the
1789: environment is not influenced\footnote{Of course, the credit
1790: feedback $c_k$ depends on the system's output. What we have in mind
1791: is, like in sequence prediction, that the true sequence is not
1792: influenced by the system} by the output of the system. We want to
1793: call these $\mu$, {\it passive} or {\it inductive} environments.
1794: Further, we want to call $\mu$ satisfying (\ref{cximu}) with
1795: $p^{best}\!=\!p^{*\xi}$ {\it pseudo passive}. So we expect
1796: inductive $\mu$ to be pseudo passive.
1797: 
1798: %------------------------------%
1799: %\paragraph{The OnlyOne example:}
1800: %------------------------------%
1801: Let us give a further example to demonstrate the difficulties in
1802: establishing credit bounds. Let $C\!=\{0,1\}$ and $|Y|$ be large. We
1803: consider all (deterministic) environments in which a single complex output
1804: $y^*$ is correct ($c\!=\!1$) and all others are wrong ($c\!=\!0$).
1805: The problem class $M$ is defined by
1806: $$
1807:   M:=\{\mu:\mu(y\!x_{<k}y_k\pb 1)=
1808:        \delta_{y_ky^*},\; y^*\!\in\!Y,\; K(y^*)\!=\!_\lfloor\log_2|Y|_\rfloor\}
1809: $$
1810: There are $N\stackrel\times=|Y|$ such $y^*$. The only way a
1811: $\mu$ independent policy $p$ can find the correct $y^*$´, is
1812: by trying one $y$ after the other in a certain order. In the first
1813: $N\!-\!1$ cycles at most, $N\!-\!1$ different $y$ are tested. As
1814: there are $N$ different possible $y^*$, there is always a
1815: $\mu\!\in\!M$ for which $p$ gives erroneous outputs in the first
1816: $N\!-\!1$ cycles. The number of errors are $E_{\infty
1817: p}\!\geq\!N\!-\!1\!\stackrel\times=|Y|\stackrel\times=2^{K(y^*)}\stackrel\times=2^{K(\mu)}$
1818: for this $\mu$. As this is true for any $p$, it is also true
1819: for the AI$\xi$ model, hence $E_{k\xi}\!\leq\!2^{K(\mu)}$ is the
1820: best possible error bound we can expect, which depends on $K(\mu)$
1821: only. Actually, we will derive such a bound in section
1822: \ref{secSP} for SP. Unfortunately, as we are mainly interested in
1823: the cycle region $k\ll|Y|\stackrel\times=2^{K(\mu)}$ (see section
1824: \ref{secAImurec}) this bound is trivial.
1825: There are no interesting bounds depending on $K(\mu)$
1826: only, unlike the SP case for deterministic $\mu$. Bounds must
1827: either depend on additional properties of $\mu$ or we have to
1828: consider specialized bounds for restricted problem classes. The
1829: case of probabilistic $\mu$ is similar. Whereas for SP there are
1830: useful bounds in terms of $E_{k\mu}$ and $K(\mu)$, there are no
1831: such bounds for AI$\xi$. Again, this is not a drawback of AI$\xi$
1832: since for no unbiased AI system the errors/credits could be bound in
1833: terms of $K(\mu)$ and the errors/credits of AI$\mu$ only.
1834: 
1835: There is a way to make use of gross (e.g. $2^{K(\mu)}$) bounds.
1836: Assume that after a reasonable number of cycles $k$, the
1837: information $\hh x_{<k}$ perceived by the AI$\xi$ system contains
1838: a lot of information about the true environment $\mu$. The
1839: information in $\hh x_{<k}$ might be coded in any form. Let us
1840: assume that the complexity $K(\mu|\hh x_{<k})$ of $\mu$ under the
1841: condition that $\hh x_{<k}$ is known, is of order 1. Consider a
1842: theorem, bounding the sum of credits or of other quantities over
1843: cycles $1...\infty$ in terms of $f(K(\mu))$ for a function $f$
1844: with $f(O(1))\!=\!O(1)$, like $f(n)\!=\!2^n$. Then, there will be
1845: a bound for cycles $k...\infty$ in terms of $f(K(\mu|\hh
1846: x_{<k}))\!=\!O(1)$. Hence, a bound like $2^{K(\mu)}$ can be
1847: replaced by small bound $2^{K(\mu|\hh x_{<k})}\!=\!O(1)$ after
1848: a reasonable number of cycles. All one has to
1849: show/ensure/assume is that enough information about $\mu$ is
1850: presented (in any form) in the first $k$ cycles. In this way, even
1851: a gross bound could become useful. In section \ref{secEX} we use a
1852: similar argument to prove that AI$\xi$ is able to learn
1853: supervised.
1854: 
1855: %------------------------------%
1856: %\paragraph{Asymptotic learnability:}
1857: %------------------------------%
1858: In the following, we weaken (\ref{cximu}) in the hope of getting a
1859: bound applicable to wider problem classes than the passive one.
1860: Consider the I/O sequence $\hh y_1\hh x_1...\hh y_n\hh x_n$ caused
1861: by AI$\xi$. On history $\hh y\!\hh x_{<k}$, AI$\xi$ will output
1862: $\hh y_k\!\equiv\hh y^\xi_k$ in cycle $k$. Let us compare this to
1863: $\hh y^\mu_k$ what AI$\mu$ would output, still on the same history
1864: $\hh y\!\hh x_{<k}$ produced by AI$\xi$. As AI$\mu$ maximizes the
1865: $\mu$ expected credit, AI$\xi$ causes lower (or at best equal)
1866: $C^\mu_{km_k}$, if $\hh y^\xi_k$ differs from $\hh y^\mu_k$. Let
1867: $D_{n\mu\xi}\!:=\!\langle\sum_{k=1}^n 1\!-\!\delta_{\hh
1868: y^\mu_k,\hh y^\xi_k}\rangle_\mu$ be the $\mu$ expected number of
1869: suboptimal choices of AI$\xi$, i.e. outputs different from AI$\mu$
1870: in the first $n$ cycles. One might weigh the deviating cases by
1871: their severity. Especially when the $\mu$ expected credits
1872: $C^\mu_{km_k}$ for $\hh y^\xi_k$ and $\hh y^\mu_k$ are equal or
1873: close to each other, this should be taken into account in the
1874: definition of $D_{n\mu\xi}$. These details do not matter in the
1875: following qualitative discussion. The important difference to
1876: (\ref{cximu}) is that here we stick on the history produced by
1877: AI$\xi$ and count a wrong decision as, at most, one error. The
1878: wrong decision in the Heaven\&Hell example in the first cycle no
1879: longer counts as losing $T$ credits, but counts as one wrong
1880: decision. In a sense, this is fairer. One shouldn't blame somebody
1881: too much who makes a single wrong decision for which he just has
1882: too little information available, in order to make a correct
1883: decision. The AI$\xi$ model would deserve to be called
1884: asymptotically optimal, if the probability of making a wrong
1885: decision tends to zero, i.e.\ if
1886: \beq\label{Doon}
1887:   D_{n\mu\xi}/n\to 0 \quad\mbox{for}\quad n\to\infty, \quad\mbox{i.e.}\quad
1888:   D_{n\mu\xi} \;=\; o(n).
1889: \eeq
1890: We say that $\mu$ can be {\it asymptotically learned} (by AI$\xi$)
1891: if (\ref{Doon}) is satisfied. We claim that AI$\xi$ (for
1892: $m_k\!\to\!\infty$) can asymptotically learn every problem $\mu$
1893: of relevance, i.e. AI$\xi$ is asymptotically optimal. We included
1894: the qualifier {\it of relevance}, as we are not sure whether there
1895: could be strange $\mu$ spoiling (\ref{Doon}) but we expect those
1896: $\mu$ to be irrelevant from the perspective of AI. In the field of
1897: Learning, there are many asymptotic learnability theorems, often
1898: not too difficult to prove. So a proof of (\ref{Doon}) might also
1899: be accessible. Unfortunately, asymptotic learnability theorems are
1900: often too weak to be useful from a practical point. Nevertheless,
1901: they point in the right direction.
1902: 
1903: %------------------------------%
1904: %\paragraph{Uniform $\mu$:}
1905: %------------------------------%
1906: From the convergence (\ref{aixitomu}) of $\mu\!\to\!\xi$ we might
1907: expect $C^\xi_{km_k}\!\to\!C^\mu_{km_k}$ and hence, $\hh y^\xi_k$
1908: defined in (\ref{ydotxi}) to converge to $\hh y^\mu_k$ defined in
1909: (\ref{ydotrec}) with $\mu$ probability 1 for $k\!\to\!\infty$.
1910: The first problem is, that if the $C_{km_k}$ for
1911: the different choices of $y_k$ are nearly equal, then even if
1912: $C^\xi_{km_k}\!\approx\!C^\mu_{km_k}$, $\hh y^\xi_k\!\neq\!\hh
1913: y^\mu_k$ is possible due to the non-continuity of $\maxarg_{y_k}$. This
1914: can be cured by a weighted $D_{n\mu\xi}$ as described above. More
1915: serious is the second problem we explain for $h_k\!=\!1$ and
1916: $X\!=\!C\!=\!\{0,1\}$. For $\hh
1917: y^\xi_k\!\equiv\!\maxarg_{y_k}\xi(\hh y\!\hh c_{<k}y_k\pb 1)$ to
1918: converge to $\hh y^\mu_k\!\equiv\!\maxarg_{y_k}\mu(\hh y\!\hh
1919: c_{<k}y_k\pb 1)$, it is not sufficient to know that $\xi(\hh
1920: y\!\hh c_{<k}\hh y\!\hh{\pb c}_k)\!\to\!\mu(\hh y\!\hh c_{<k}\hh
1921: y\!\hh{\pb c}_k)$ as has been proved in (\ref{aixitomu}). We need
1922: convergence not only for the true output $\hh y_k$ and credit $\hh
1923: c_k$, but also for alternate outputs $y_k$ and credit 1.
1924: $\hh y^\xi_k$ converges to $\hh y^\mu_k$
1925: if $\xi$ converges uniformly to $\mu$, i.e. if in addition to
1926: (\ref{aixitomu})
1927: \beq\label{uniform}
1928:   \big|\mu(y\!x_{<k}y'_k\pb x'_k)-\xi(y\!x_{<k}y'_k\pb x'_k)\big|
1929:   \;<\; c\!\cdot\!
1930:   \big|\mu(y\!x_{<k}y\!\pb x_k)-\xi(y\!x_{<k}y\!\pb x_k)\big|
1931:   \quad\forall y'_kx'_k
1932: \eeq
1933: holds for some constant $c$ (at least in some $\mu$ expected sense).
1934: We call $\mu$ satisfying (\ref{uniform}) {\it uniform}. For
1935: uniform $\mu$ one can show (\ref{Doon}) with appropriately weighted
1936: $D_{n\mu\xi}$ and bounded horizon $h_k\!<\!h_{max}$. Unfortunately
1937: there are relevant $\mu$ which are not uniform.
1938: Details will be given elsewhere.
1939: 
1940: %------------------------------%
1941: %\paragraph{Other concepts:}
1942: %------------------------------%
1943: In the following, we briefly mention some further
1944: concepts. A {\it Markovian} $\mu$ is defined as depending only on the
1945: last output, i.e. $\mu(y\!x_{<k}y\!\pb x_k)\!=\!\mu_k(y\!\pb x_k)$. We
1946: say $\mu$ is {\it generalized Markovian}, if $\mu(y\!x_{<k}y\!\pb
1947: x_k)\!=\!\mu_k(y\!x_{k-l:k-1}y\!\pb x_k)$ for fixed $l$. This
1948: property has some similarities to {\it factorizable} $\mu$ defined
1949: in (\ref{facmu}). If further $\mu_k\!\equiv\!\mu_1\forall k$,
1950: $\mu$ is called {\it stationary}. Further, for all enumerable
1951: $\mu$, $\mu(y\!x_{<k}y\!\pb x_k)$ and $\xi(y\!x_{<k}y\!\pb x_k)$
1952: get independent of $y\!x_{<l}$ for fixed $l$ and $k\!\to\!\infty$
1953: with $\mu$ probability 1. This property, which we want to call
1954: {\it forgetfulness}, will be proved elsewhere.
1955: Further, we say $\mu$ is {\it farsighted}, if
1956: $\lim_{m_k\to\infty}\hh y_k^{(m_k)}$ exists. More details will be given in
1957: the next subsection, where we also give an example of a
1958: possibly relevant $\mu$, which is not farsighted.
1959: 
1960: %------------------------------%
1961: %\paragraph{Concepts:}
1962: %------------------------------%
1963: We have introduced several concepts, which might be useful for
1964: proving credit bounds, including forgetful, relevant, asymptotically
1965: learnable, farsighted, uniform, (generalized) Markovian, factorizable
1966: and (pseudo) passive $\mu$. We have sorted them here, approximately in
1967: the order of decreasing generality. We want to call them {\it
1968: separability concepts}. The more general (like relevant,
1969: asymptotically learnable and farsighted) $\mu$ will be called
1970: weakly separable, the more restrictive (like (pseudo) passive and
1971: factorizable) $\mu$ will be called strongly separable, but we will
1972: use these qualifiers in a more qualitative, rather than rigid
1973: sense. Other (non-separability) concepts are deterministic $\mu$
1974: and, of course, the class of all chronological $\mu$.
1975: 
1976: %------------------------------%
1977: \paragraph{The choice of the horizon:}
1978: %------------------------------%
1979: The only significant arbitrariness in the AI$\xi$ model lies in
1980: the choice of the horizon function
1981: $h_k\!\equiv\!m_k\!-\!k\!+\!1$. We discuss some choices which seem
1982: to be natural and give preliminary conclusions at the end.
1983: We will not discuss ad hoc choices of $h_k$ for
1984: specific problems (like the discussion in section \ref{secSG} in
1985: the context of finite games). We are interested in universal
1986: choices of $m_k$.
1987: 
1988: If the lifetime of the system is known to be $T$, which is in
1989: practice always large but finite, then the choice $m_k\!=\!T$
1990: maximizes correctly the expected future credit. $T$ is usually not
1991: known in advance, as in many cases the time we are willing to run
1992: a system depends on the quality of its outputs. For this reason,
1993: it is often desirable that good outputs are not delayed too much,
1994: if this results in a marginal credit increase only. This can be
1995: incorporated by damping the future credits. If, for instance, we
1996: assume that the survival of the system in each cycle is
1997: proportional to the past credit an exponential damping
1998: $c_k\!:=\!c'_k\!\cdot\!e^{-\lambda k}$ is appropriate, where
1999: $c'_k$ are bounded, e.g. $c'_k\!\in\![0,1]$. The expression
2000: (\ref{ydotxi}) converges for $m_k\!\to\!\infty$ in this case. But
2001: this does not solve the problem, as we introduced a new arbitrary
2002: time-scale $^1\!/_\lambda$. Every damping introduces a time-scale.
2003: 
2004: Even the time-scale invariant damping factor $k^{-\alpha}$
2005: introduces a dynamic time-scale. In cycle $k$ the contribution of
2006: cycle $2^{1/\alpha}\!\cdot\!k$ is damped by a factor $\1d2$. The
2007: effective horizon $h_k$ in this case is $\sim k$. The choice
2008: $h_k\!=\!\beta\!\cdot\!k$ with $\beta\!\sim\!2^{1/\alpha}$
2009: qualitatively models the same behaviour. We have not introduced an
2010: arbitrary time-scale $T$, but limited the farsightedness to some
2011: multiple (or fraction) of the length of the current history. This
2012: avoids the pre-selection of a global time-scale $T$ or
2013: $^1\!/_\lambda$. This choice has some appeal, as it seems that
2014: humans of age $k$ years usually do not plan their lives for more
2015: than, perhaps, the next $k$ years ($\beta_{human}\!=\!1$). From a
2016: practical point of view this model might serve all needs, but from
2017: a theoretical point we feel uncomfortable with such a limitation
2018: in the horizon from the very beginning. Note, that we have to
2019: choose $\beta\!=\!O(1)$ because otherwise we would again introduce
2020: a number $\beta$, which has to be justified.
2021: 
2022: The naive limit $m_k\!\to\!\infty$ in
2023: (\ref{ydotxi}) may turn out to be well defined and the previous discussion
2024: superfluous. In the following, we define a limit which is always
2025: well defined (for finite $|Y|$). Let $\hh y_k^{(m)}$ be defined as
2026: in (\ref{ydotxi}) with $m_k$ replaced by $m$. Further, let $\hh
2027: Y_k^{(m)}\!:=\!\{\,\hh y_k^{(m)}\!:\!m_k\!\geq\!m\}$ be the set of
2028: outputs in cycle $k$ for the choices $m_k\!=\!m,m+1,m+2,...$.
2029: Because $\hh Y_k^{(m)}\!\supseteq\!\hh Y_k^{(m+1)}\!\neq\!\{\}$, we
2030: have $\hh Y_k^{(\infty)}\!:=\!\bigcap_{m=k}^\infty\hh
2031: Y_k^{(m)}\!\neq\!\{\}$. We define the $m_k\!=\!\infty$ model to
2032: output any $\hh y_k^{(\infty)}\!\in\!\hh Y_k^{(\infty)}$. This is
2033: the best output consistent with any choice of $m_k$, esp.
2034: $m_k\!\to\!\infty$. Choosing the lexicographically smallest $\hh
2035: y_k^{(\infty)}\!\in\!\hh Y_k^{(\infty)}$ would correspond to the
2036: limes inferior $\underline\lim_{m\to\infty}\hh y_k^{(m)}$. $\hh
2037: y_k^{(\infty)}$ is unique, i.e. $|\hh Y_k^{(\infty)}|\!=\!1$ iff
2038: the naive limit $\lim_{m\to\infty}\hh y_k^{(m)}$ exists. Note,
2039: that the limit $\lim_{m\to\infty}C_{km}^\best(y\!x_{<k})$ needs
2040: not to exist for this construction.
2041: 
2042: The construction above leads to a mathematically elegant,
2043: no-parameter AI$\xi$ model. Unfortunately this is not the end of
2044: the story. The limit $m_k\!\to\!\infty$ can cause undesirable
2045: results in the AI$\mu$ model for special $\mu$ which might also happen
2046: in the AI$\xi$ model whatever we define $m_k\!\to\!\infty$.
2047: Consider $Y\!=\!C\!=\!\{0,1\}$ and $X'\!=\!\{\}$. Output $y_k\!=\!0$ shall give credit
2048: $c_k\!=\!0$, output $y_k\!=\!1$ shall give $c_k\!=\!1$ iff $\hh
2049: y_{k-l-\sqrt l}...\hh y_{k-l}\!=\!0...0$ for some $l$. I.e. the system can
2050: achieve $l$ consecutive positive credits if there was a sequence
2051: of length at least $\sqrt l$ with $y_k\!=\!c_k\!=\!0$. If the lifetime of the
2052: AI$\mu$ system is $T$, it outputs $\hh y_k\!=\!0$ in the first $r$ cycles
2053: and then $\hh y_k\!=\!1$ for the remaining $r^2$ cycles with
2054: $r$ such that $r+r^2=T$. This will lead to the highest possible
2055: total credit $C_{1T}\!=\!\sqrt{T+^1\!\!/_4}-^1\!\!/_2$. Any fragmentation of the
2056: $0$ and $1$ sequences would reduce this. For $T\!\to\!\infty$ the
2057: AI$\mu$ system can and will delay the point $r$ of switching to
2058: $\hh y_k\!=\!1$ indefinitely and always output $0$ with total
2059: credit $0$, obviously the worst possible behaviour. The AI$\xi$
2060: system will explore the above rule after a while of trying
2061: $y_k\!=\!0/1$ and then applies the same behaviour as the AI$\mu$
2062: system, since the simplest rules covering past data dominate $\xi$.
2063: For finite $T$ this is exactly what we want, but for infinite $T$
2064: the AI$\xi$ model fails just as the AI$\mu$ model does. The good point
2065: is, that this is not a weakness of the AI$\xi$ model, as AI$\mu$
2066: fails too and no system can be better than AI$\mu$. The bad point
2067: is that $m_k\!\to\!\infty$ has far reaching consequences, even when
2068: starting from an already very large $m_k\!=\!T$. The reason being that
2069: the $\mu$ of this example is highly non-local in time, i.e. it
2070: may violate one of our weak separability conditions.
2071: 
2072: In the last paragraph we have considered the consequences of
2073: $m_k\!\to\!\infty$ in the AI$\mu$ model. We now consider
2074: whether the AI$\xi$ model is a good approximation of the
2075: AI$\mu$ model for large $m_k$. Another objection against too large
2076: choices of $m_k$ is that $\xi(y\!x_{<k}y\!\pb x_{k:m_k})$ has been proved to be a
2077: good approximation of $\mu(y\!x_{<k}y\!\pb x_{k:m_k})$ only for
2078: $k\!\gg\!h_k$, which is never satisfied for $m_k\!=\!T$ or
2079: $m_k\!=\!\infty$. We have seen that, for factorizable
2080: $\mu$, the limit $h_k\!\to\!\infty$ causes
2081: no problem, as from a certain $h_k$ on the output $\hh y_k$ is
2082: independent of $h_k$. As $\xi\!\to\!\mu$ for bounded $h_k$, $\xi$
2083: will develop this separability property too. So, from a
2084: certain $k_0$ on the limit $h_k\!\to\!\infty$ might also be safe
2085: for $\xi$. Therefore, taking the limit from the very beginning worsens
2086: the behaviour of AI$\xi$ maybe only for finitely many cycles
2087: $k\!\leq k_0$, which would be acceptable. We suppose that the
2088: valuations $c_{k'}$ for $k'\!\gg\!k$, where $\xi$ can no longer
2089: be trusted as a good approximation to $\mu$, are in some sense
2090: randomly disturbed with decreasing influence on the choice of $\hh
2091: y_k$. This claim is supported by the forgetfulness property of $\xi$.
2092: 
2093: We are not sure whether the choice of $m_k$ is of marginal
2094: importance, as long as $m_k$ is chosen sufficiently large and of
2095: low complexity, $m_k=2^{2^{16}}$ for instance, or whether the choice of
2096: $m_k$ will turn out to be a central topic for the AI$\xi$ model or
2097: for the planning aspect of any AI system in general. We suppose
2098: that the limit $m_k\!\to\!\infty$ for the AI$\xi$ model results in
2099: correct behaviour for weakly separable $\mu$, and that even the naive
2100: limit exists, but to prove this would probably give interesting
2101: insights.
2102: 
2103: \newpage
2104: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2105: \section{Sequence Prediction (SP)}\label{secSP}
2106: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2107: We have introduced the
2108: AI$\xi$ model as a unification of the ideas of decision theory and
2109: universal probability distribution. We might expect AI$\xi$ to
2110: behave identically to SP$\Theta_\xi$, when faced with a sequence
2111: prediction problem, but things are not that simple, as we will see.
2112: 
2113: %------------------------------%
2114: \paragraph{Using the AI$\mu$ Model for Sequence Prediction:}
2115: %------------------------------%
2116: % 9910(15) 9911(7)
2117: We have seen in the last section how to predict sequences for
2118: known and unknown prior distribution $\mu^{SP}$. Here we consider binary
2119: sequences\footnote{We use $z_k$ to avoid notational conflicts with
2120: the systems inputs $x_k$.} $z_1z_2z_3...\in I\!\!B^\infty$ with known prior
2121: probability $\mu^{SP}(\pb{z_1z_2z_3...})$.
2122: 
2123: We want to show
2124: how the AI$\mu$ model can be used for sequence prediction.
2125: We will see that it gives the same prediction as the SP$\Theta_\mu$ system.
2126: First, we have to specify {\it how} the AI$\mu$ model should be used
2127: for sequence prediction. The following choice is natural:
2128: 
2129: The systems output $y_k$ is interpreted as a prediction for the
2130: $k^{th}$ bit $z_k$ of the string, which has to be predicted. This
2131: means that $y_k$ is binary ($y_k\!\in\!I\!\!B\!=:\!Y$). As a
2132: reaction of the environment, the system receives credit $c_k\!=\!1$
2133: if the prediction was correct ($y_k\!=\!z_k$), or $c_k\!=\!0$ if
2134: the prediction was erroneous ($y_k\!\neq\!z_k$). The question is
2135: what the input $x'_k$ of the next cycle should be. One choice
2136: would be to inform the system about the correct $k^{th}$ bit of
2137: the last cycle of the string and set $x'_k=z_k$. But as from
2138: the credit $c_k$ in conjunction with the prediction $y_k$, the true
2139: bit $z_k=\delta_{y_kc_k}$ can be inferred, this information is
2140: redundant. $\delta$ is the Kronecker symbol, defined as
2141: $\delta_{ab}\!=\!1$ for $a\!=\!b$ and $0$ otherwise. There is no
2142: need for this additional feedback. So we set
2143: $x'_k\!=\!\epsilon\!\in\!X\!=\!\{\epsilon\}$ thus having $x_k\!\equiv\!c_k$. The
2144: system's performance does not change when we include this
2145: redundant information, it merely complicates the notation. The prior
2146: probability $\mu^{AI}$ of the AI$\mu$ model is
2147: \beq\label{muaisp}
2148:   \mu^{AI}(y_1\pb x_1 ...y_k\pb x_k) \;=\;
2149:   \mu^{AI}(y_1\pb c_1...y_k\pb c_k) \;=\;
2150:   \mu^{SP}(\pb{\delta_{y_1 c_1}...\delta_{y_k c_k}}) \;=\;
2151:   \mu^{SP}(\pb{z_1...z_k})
2152: \eeq
2153: In the following, we will drop the superscripts of $\mu$ because they
2154: are clear from the arguments of $\mu$ and the $\mu$ equal in any case.
2155: 
2156: The formula (\ref{airec2}) for the expected credit reduces to
2157: \beq\label{eerecsp}
2158:   C_{km}^\best(y\!x_{<k}) \;=\;
2159:   \max_{y_k}\sum_{c_k}
2160:   [c_k+C_{k+1,m}^\best(y\!x_{1:k})] \!\cdot\!
2161:   \mu(\delta_{y_1c_1}...\delta_{y_{k-1}c_{k-1}}\pb{\delta_{y_kc_k}})
2162: \eeq
2163: The first observation we can make, is that for this special
2164: $\mu$, $C_{km}^\best$ only depends on $\delta_{y_ic_i}$, i.e.
2165: replacing $y_i$ and $c_i$ simultaneously with their complements
2166: does not change the value of $C_{km}^\best$. We have a symmetry in
2167: $y_ic_i$. For $k\!=\!m\!+\!1$ this is definitely true as
2168: $C_{m+1,m}^\best\!=\!0$ in this case (see (\ref{ee0})). For
2169: $k\!\leq\!m$ we prove it by induction. The r.h.s.\ of
2170: (\ref{eerecsp}) is symmetric in $y_ic_i$ for $i\!<\!k$ because
2171: $\mu$ possesses this symmetry and $C_{k+1,m}^\best$ possesses it by induction
2172: hypothesis, so the symmetry holds for the l.h.s., which completes
2173: the proof. The prediction $\hh y_k$ is
2174: \beq\label{ebestysp}
2175:   \hh y_k \;=\; \maxarg_{y_k}
2176:   C_{km_k}^\best(\hh y\!\hh x_{<k}y_k) \;=\;
2177:   \maxarg_{y_k}\sum_{c_k}[c_k+C_{k+1,m_k}^\best(y\!x_{1:k})]
2178:   \!\cdot\!\mu(...\pb{\delta_{y_kc_k}}) \;=\;
2179: \eeq
2180: $$
2181:   \;=\; \maxarg_{y_k}\sum_{c_k}c_k
2182:   \!\cdot\!\mu(\delta_{\hh y_1\hh c_1}...\pb{\delta_{y_kc_k}}) \;=\;
2183:   \maxarg_{y_k}\mu(\hh z_1...\hh z_{k-1}\pb y_k) \;=\;
2184:   \maxarg_{z_k}\mu(\hh z_1...\hh z_{k-1}\pb z_k)
2185: $$
2186: The first equation is the definition of the system's prediction
2187: (\ref{pbestrec}). In the second equation, we have inserted
2188: (\ref{ebesty}) which gives the r.h.s.\ of (\ref{eerecsp}) with
2189: $\max_{y_k}$ replaced by $\maxarg_{y_k}$. $\sum_c
2190: f(...\delta_{yc}...)$ is independent of $y$ for any function,
2191: depending on the combination $\delta_{yc}$ only. Therefore, the
2192: $\sum_cC^\best\mu$ term is independent of $y_k$ because
2193: $C_{k+1,m}^\best$ as well as $\mu$ depend on $\delta_{y_kc_k}$ only. In
2194: the third equation, we can therefore drop this term, as adding a
2195: constant to the argument of $\maxarg_{y_k}$ does not change the
2196: location of the maximum. In the second last equation we evaluated
2197: the $\sum_{c_k}$. Further, if the true credit to $\hh y_i$ is $\hh
2198: c_i$ the true $i^{th}$ bit of the string must be $\hh
2199: z_i\!=\!\delta_{\hh y_i\hh c_i}$. The last equation is just a renaming.
2200: 
2201: So, the AI$\mu$ model predicts that $z_k$ that has maximal $\mu$
2202: probability, given $\hh z_1...\hh z_{k-1}$. This prediction is
2203: independent of the choice of $m_k$. It is exactly the prediction
2204: scheme of the deterministic sequence prediction with known prior
2205: SP$\Theta_\mu$ described in the last section. As this model was
2206: optimal, AI$\mu$ is optimal, too, i.e. has minimal number of
2207: expected errors (maximal expected credit) as compared to any other
2208: sequence prediction scheme.
2209: 
2210: From this, it is already clear that the total expected credit
2211: $C_{km}$ must be related to the expected sequence prediction error
2212: $E_{m\Theta_\mu}$ (\ref{esp}). Let us prove directly that
2213: $C_{1m}(\epsilon)\!+\!E_{m\Theta_\mu\!}=m$.
2214: We rewrite $C_{km}^\best$ in (\ref{eerecsp})
2215: as a function of $z_i$ instead of $y_ic_i$ as it
2216: is symmetric in $y_ic_i$. Further, we can pull $C_{km}^\best$ out of
2217: the maximization, as it is independent of $y_k$ similar to
2218: (\ref{ebestysp}). Renaming the bounded variables $y_k$ and $c_k$
2219: we get
2220: \beq\label{ebr2}
2221:   C_{km}^\best(z_{<k}) \;=\;
2222:   \max_{z_k}\mu(z_{<k}\pb z_k) +
2223:   \sum_{z_k}C_{k+1,m}^\best(z_{1:k})
2224:   \!\cdot\!\mu(z_{<k}\pb z_k)
2225: \eeq
2226: Recursively inserting the l.h.s.\ into the r.h.s.\ we get
2227: \beq\label{ebi2}
2228:   C_{km}^\best(z_{<k}) \;=\;
2229:   \sum_{i=k}^m\nq\;\sum_{\quad z_{k:i-1}}\nq\max_{z_i}
2230:   \mu(z_{<k}\pb{z_{k:i}})
2231: \eeq
2232: This is most easily proven by induction. For $k\!=\!m$
2233: we have $C_{mm}^\best(z_{<m})\!=\!\max_{z_m}\mu(z_{<m}\pb
2234: z_m)$ from (\ref{ebr2}) and (\ref{ee0}), which equals (\ref{ebi2}). By induction
2235: hypothesis, we assume that
2236: (\ref{ebi2}) is true for $k$. Inserting this into
2237: (\ref{ebr2}) we get
2238: $$
2239:   C_{km}^\best(z_{<k})
2240:   \;=\;
2241:   \max_{z_k}\mu(z_{<k}\pb z_k) +
2242:   \sum_{z_k}\left[
2243:   \sum_{i=k+1}^m\nq\;\sum_{\quad z_{k+1:i-1}}\max_{z_i}
2244:   \mu(z_{1:k}\pb z_{k+1:i})
2245:   \right]\mu(z_{<k}\pb z_k) \;=\;
2246: $$
2247: $$
2248:   \;=\; \max_{z_k}\mu(z_{<k}\pb z_k) +
2249:   \sum_{i=k+1}^m\nq\;\sum_{\quad z_{k:i-1}}\max_{z_i}
2250:   \mu(z_{<k}\pb z_{k:i})
2251: $$
2252: which equals (\ref{ebi2}). This was the induction step and hence
2253: (\ref{ebi2}) is proven.
2254: 
2255: By setting $k\!=\!0$ and slightly reformulating (\ref{ebi2}),
2256: we get the total expected credit in the first $m$ cycles
2257: $$
2258:   C_{1:m}^\best(\epsilon) \;=\;
2259:   \sum_{i=1}^m\;\sum_{z_{<i}}\mu(\pb z_{<i})
2260:   \max\{\mu(z_{<i}\pb 0),\mu(z_{<i}\pb 1)\} \;=\;
2261:   m-E_{m\Theta_\mu}
2262: $$
2263: with $E_{m\Theta_\mu}$ defined in (\ref{esp}).
2264: 
2265: %------------------------------%
2266: \paragraph{Using the AI$\xi$ Model for Sequence Prediction:}
2267: %------------------------------%
2268: Now we want to use the universal AI$\xi$ model instead of
2269: AI$\mu$ for sequence prediction and try to derive error bounds
2270: analog to (\ref{spebound}).
2271: Like in the AI$\mu$ case, the systems output $y_k$ in cycle $k$ is
2272: interpreted as a prediction for the k$^{th}$ bit $z_k$ of the
2273: string, which has to be predicted. The credit is
2274: $c_k=\delta_{y_kz_k}$ and there are no other inputs
2275: $x_k=\epsilon$. What makes the analysis more difficult is that $\xi$ is not
2276: symmetric in $y_ic_i\leftrightarrow(1-y_i)(1-c_i)$ and
2277: (\ref{muaisp}) does not hold for $\xi$. On the other hand,
2278: $\xi^{AI}$ converges to $\mu^{AI}$ in the limit (\ref{aixitomu}), and
2279: (\ref{muaisp}) should hold asymptotically for $\xi$ in some sense.
2280: So we expect that everything proven for AI$\mu$ holds
2281: approximately for AI$\xi$. The AI$\xi$ model should behave
2282: similarly to SP$\Theta_\xi$, the deterministic variant of Solomonoff prediction.
2283: Especially we expect error bounds similar to (\ref{spebound}). Making
2284: this rigorous seems difficult. Some general remarks have been made
2285: in the last section.
2286: 
2287: Here we concentrate on the special case of a deterministic
2288: computable environment, i.e. the environment is a sequence
2289: $\hh z\!=\!\hh z_1\hh z_2...$, $K(\hh z_1...\hh z_n*)\!\leq\!K(\hh
2290: z)\!<\!\infty$. Furthermore, we only consider the simplest
2291: horizon model $m_k\!=\!k$, i.e. maximize only the next
2292: credit. This is sufficient for sequence prediction, as the credit
2293: of cycle $k$ only depends on output $y_k$ and not on earlier
2294: decisions. This choice is in no way sufficient and satisfactory
2295: for the full AI$\xi$ model, as {\it one} single choice of $m_k$ should
2296: serve for {\it all} AI problem classes. So AI$\xi$ should allow
2297: good sequence prediction for some universal choice of $m_k$ and not
2298: only for $m_k\!=\!k$, which definitely does not suffice for more
2299: complicated AI problems. The analysis of this general case is a challenge for the future.
2300: For $m_k\!=\!k$ the AI$\xi$ model
2301: (\ref{ydotxi}) with $x'_i\!=\!\epsilon$ reduces to
2302: \beq\label{ydotxisp}
2303:   \hh y_k \;=\; \maxarg_{y_k}\sum_{c_k}c_k\!\cdot\!
2304:   \xi(\hh y\!\hh c_{<k}y\!\pb c_k) \;=\;
2305:   \maxarg_{y_k}\xi(\hh y\!\hh c_{<k}y_k\pb 1) \;=\;
2306:   \maxarg_{y_k}\xi(\hh y\!\hh{\pb c}_{<k}y_k\pb 1)
2307: \eeq
2308: The environmental response $\hh c_k$ is given by $\delta_{\hh y_k\hh
2309: z_k}$; it is 1 for a correct prediction $(\hh y_k\!=\!\hh z_k)$ and 0
2310: otherwise. In the following, we want to bound the number of errors
2311: this prediction scheme makes. We need the following inequality
2312: \beq\label{spineq}
2313:   \xi(y\!\pb c_1...y\!\pb c_k) \;>\;
2314:   2^{-K(\delta_{y_1c_1}...\delta_{y_kc_k}*)-O(1)}
2315: \eeq
2316: We have to find a short program in the sum
2317: (\ref{uniMAI}) calculating $c_1...c_k$ from $y_1...y_k$. If we
2318: knew $z_i:=\delta_{y_ic_i}$ for $1\!\leq\!i\!\leq\!k$ a program of
2319: size $O(1)$ could calculate
2320: $c_1...c_k=\delta_{y_1z_1}...\delta_{y_kz_k}$. So combining this program with
2321: a shortest coding of $z_1...z_k$ leads to a program of size
2322: $K(z_1...z_k*)\!+\!O(1)$, which proves (\ref{spineq}).
2323: 
2324: Let us now assume that we make a wrong prediction in cycle $k$,
2325: i.e. $\hh c_k\!=\!0$, $\hh y_k\neq \hh z_k$. The goal is to
2326: show that $\hh\xi$ defined by
2327: \beqn
2328:   \hh\xi_k \;:=\; \xi(\hh y\!\pb{\hh c}_{1:k}) \;=\;
2329:   \xi(\hh y\pb{\hh c}_{<k}\hh y_k\pb 0) \;\leq\;
2330:   \xi(\hh y\pb{\hh c}_{<k}) -
2331:   \xi(\hh y\pb{\hh c}_{<k}\hh y_k\pb 1) \;<\;
2332:   \hh\xi_{k-1}-\alpha
2333: \eeqn
2334: decreases for every wrong prediction, at least by some $\alpha$.
2335: The $\leq$ arises from the fact that $\xi$ is only a semimeasure.
2336: \beqn
2337:   \xi(\hh y\!\pb{\hh c}_1...\hh y\pb 1) \;>\;
2338:   \xi(\hh y_1\pb{\hh c}_1...(1\!-\!\hh y_k)\pb 1) \;\stackrel{\times}{>}\;
2339:   2^{-K(\delta_{\hh y_1\hh c_1}...\delta_{(1-\hh y_k)1}*)}
2340:   \;=\;
2341: \eeqn
2342: \beqn
2343:   \;=\; 2^{-K(\hh z_1...\hh z_k*)} \;>\;
2344:   2^{-K(\hh z)-O(1)} \;=:\; \alpha
2345: \eeqn
2346: In the first inequality we have used the fact that $\hh y_k$
2347: maximizes by definition (\ref{ydotxisp}) the argument, i.e.
2348: $1\!-\!\hh y_k$ has lower probability than $\hh y_k$. (\ref{spineq}) has been
2349: applied in the second inequality. The equality holds, because
2350: $\hh z_i\!=\!\delta_{\hh y_i\hh c_i}$ and
2351: $\delta_{(1-\hh y_k)1}\!=\!\delta_{\hh y_k0}\!=\!\delta_{\hh y_k\hh
2352: c_k}\!=\!\hh z_k$. The last inequality follows from the
2353: definition of $\hh z$.
2354: 
2355: We have shown that each erroneous prediction reduces $\hh\xi$ by at
2356: least the $\alpha$ defined above. Together with $\hh\xi_0\!=\!1$ and
2357: $\hh\xi_k\!>\!0$ for all $k$ this shows that the system can make
2358: at most $1/\alpha$ errors, since otherwise $\hh\xi_k$ would become
2359: negative. So the number of wrong predictions $E_{n\xi}^{AI}$ of system
2360: (\ref{ydotxisp}) is bounded by
2361: \beq\label{Ebndsp}
2362:   E_{n\xi}^{AI} \;<\; {\textstyle{1\over\alpha}} \;=\;
2363:   2^{K(\hh z)+O(1)} \;<\; \infty
2364: \eeq
2365: for a computable deterministic environment string $\hh z_1\hh
2366: z_2...$. The intuitive interpretation is that each wrong
2367: prediction eliminates at least one program $p$ of size
2368: $l(p)\!\stackrel+<\!K(\hh z)$. The size is smaller than $K(\hh z)$, as
2369: larger policies could not mislead the system to a wrong
2370: prediction, since there is a program of size $K(\hh z)$ making a correct
2371: prediction. There are at most $2^{K(\hh z)+O(1)}$ such policies,
2372: which bounds the total number of errors.
2373: 
2374: We have derived a finite bound for $E_{n\xi}^{AI}$, but unfortunately, a
2375: rather weak one as compared to (\ref{spebound}). The reason for the
2376: strong bound in the SP case was that every error at least halves
2377: $\hh\xi$ because the sum of the $\maxarg_{x_k}$ arguments was 1.
2378: Here we have
2379: \bqan
2380:   \xi(\hh y_1\hh c_1...\hh y_{k-1}\hh c_{k-1}0\pb 0) +
2381:   \xi(\hh y_1\hh c_1...\hh y_{k-1}\hh c_{k-1}0\pb 1) = 1 \\
2382:   \xi(\hh y_1\hh c_1...\hh y_{k-1}\hh c_{k-1}1\pb 0) +
2383:   \xi(\hh y_1\hh c_1...\hh y_{k-1}\hh c_{k-1}1\pb 1) = 1
2384: \eqan
2385: but $\maxarg_{y_k}$ runs over the right top and right bottom
2386: $\xi$, for which no sum criterion holds.
2387: 
2388: The AI$\xi$ model would not be sufficient for
2389: realistic applications if the bound (\ref{Ebndsp}) were sharp,
2390: but we have the strong feeling (but only weak
2391: arguments) that better bounds proportional to $K(\hh z)$
2392: analog to (\ref{spebound}) exist. The technique used above may not
2393: be appropriate for achieving this. One argument for a better bound is
2394: the formal similarity between $\maxarg_{z_k}(\hh z_{<k}z_k)$ and (\ref{ydotxisp}),
2395: the other is that we were unable to construct an example sequence
2396: for which (\ref{ydotxisp}) makes more than $O(K(\hh z))$ errors.
2397: 
2398: \newpage
2399: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2400: \section{Strategic Games (SG)}\label{secSG}
2401: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2402: 
2403: %------------------------------%
2404: \paragraph{Introduction:}
2405: %------------------------------%
2406: A very important class of problems are strategic games, like chess.
2407: In fact, what is subsumed under game theory nowadays, is so
2408: general, that it includes not only a huge variety of games, from simple
2409: games of chance like roulette, combined with strategy like
2410: Backgammon, up to purely strategic games like chess or checkers or
2411: go. Game theory can also describe political and economic competitions and
2412: coalitions, even Darwinism and many more have been modeled within game theory.
2413: It seems that nearly every AI problem could be brought into
2414: the form of a game. Nevertheless,
2415: the intention of a game is that several players perform
2416: some actions with (partial) observable consequences.
2417: The goal of each player is to maximize some utility
2418: function (e.g.\ to win the game). The players are assumed to be
2419: rational, taking into account all information they posses. The
2420: different goals of the players are usually in conflict.
2421: For an introduction into game theory, see \cite{Fud91,Osb94,Rus95,Neu44}.
2422: 
2423: If we interpret the AI system as one player and the environment
2424: models the other rational player {\it and} the environment provides
2425: the reinforcement feedback $c_k$, we see that the system-environment
2426: configuration satisfies all criteria of a game. On the other hand,
2427: we know that the AI system can handle more general situations,
2428: since it interacts optimally with an environment, even if the environment
2429: is not a rational player with conflicting goals.
2430: 
2431: %------------------------------%
2432: \paragraph{Strictly competitive strategic games:}
2433: %------------------------------%
2434: In the following, we restrict ourselves to deterministic, strictly
2435: competitive strategic\footnote{In game theory, games like chess
2436: are often called 'extensive', whereas 'strategic' is reserved for a
2437: different kind of game.} games with alternating moves. Player 1
2438: makes move $y_k'$ in round $k$, followed by the move $x_k'$ of player
2439: 2. So a game with $n$ rounds consists of a sequence of alternating
2440: moves $y'_1x'_1y'_2x'_2...y'_nx'_n$. At the end of the game in cycle $n$
2441: the game or final board state is evaluated with
2442: $C(y'_1x'_1...y'_nx'_n)$. Player 1 tries to maximize $C$, whereas player 2
2443: tries to minimize $C$. In the simplest case, $C$ is $1$ if player 1
2444: won the game, $C\!=\!-1$ if player 2 won and $C\!=\!0$ for a draw. We
2445: assume a fixed game length $n$ independent of the actual move
2446: sequence. For games with variable length but maximal possible number of
2447: moves $n$, we could add dummy moves
2448: and pad the length to $n$. The optimal strategy (Nash equilibrium)
2449: of both players is a minimax strategy
2450: \beq\label{sgxdot}
2451:   \hh x'_k=\minarg_{x'_k}\max_{y'_{k+1}}\min_{x'_{k+1}}...\max_{y'_n}\min_{x'_n}
2452:   C(\hh y'_1\hh x'_1...\hh y'_kx'_k...y'_nx'_n)
2453: \eeq
2454: \beq\label{sgydot}
2455:   \hh y'_k=\maxarg_{y'_k}\min_{x'_k}...\max_{y'_n}\min_{x'_n}
2456:   C(\hh y'_1\hh x'_1...\hh y'_{k-1}\hh x'_{k-1}y'_kx'_k...y'_nx'_n)
2457: \eeq
2458: But note, that the minimax strategy is only optimal if both players
2459: behave rationally. If, for instance, player 2 has limited capabilites or makes
2460: errors and player 1 is able to discover these (through past moves) he
2461: could exploit these and improve his performance
2462: by deviating from the minimax strategy. At least, the classical
2463: game theory of Nash equilibria does not take into account limited
2464: rationality, whereas the AI$\xi$ system should.
2465: 
2466: %------------------------------%
2467: \paragraph{Using the AI$\mu$ model for game playing:}
2468: %------------------------------%
2469: In the following, we demonstrate the applicability of the AI model
2470: to games. The AI system takes the position of player 1. The
2471: environment provides the evaluation $C$. For a symmetric situation
2472: we could take a second AI system as player 2, but for simplicity we
2473: take the environment as the second player and assume that this
2474: environmental player behaves according to the minimax strategy (\ref{sgxdot}).
2475: The environment serves as a perfect player {\it and} as a teacher, albeit a
2476: very crude one as it tells the system at the end of the game,
2477: only whether it won or lost.
2478: 
2479: The minimax behaviour of player 2 can be expressed by a
2480: (deterministic) probability distribution $\mu^{SG}$ as the
2481: following
2482: \beq\label{defmusg}
2483:   \mu^{SG}(y'_1\pb x'_1...y'_n\pb x'_n) \;:=\;
2484:   \left\{
2485:   \begin{array}{l}
2486:     \displaystyle
2487:     1 \quad\mbox{if}\quad
2488:     x'_k=\minarg_{x''_k}...\max_{y''_n}\min_{x''_n}
2489:     C(y'_1...x'_{k-1}y''_k...x''_n)
2490:     \;\;\forall\; 1\!\leq\!k\!\leq\!n
2491:     \\
2492:     0 \quad\mbox{otherwise}
2493:   \end{array} \right.
2494: \eeq
2495: The probability that player 2 makes move $x'_k$ is
2496: $\mu^{SG}(\hh y'_1\!\hh x'_1...\hh y'_k\pb x'_k)$ which is 1 for
2497: $x'_k\!=\!\hh x'_k$ as defined in (\ref{sgxdot}) and 0 otherwise.
2498: 
2499: Clearly, the AI system receives no feedback, i.e.
2500: $c_1\!=...=\!c_{n-1}\!=\!0$, until the end of the game, where it should
2501: receive positive/negative/neutral feedback on a win/loss/draw, i.e.
2502: $c_n=C(...)$. The environmental prior probability is therefore
2503: \beq\label{muaisg}
2504:   \mu^{AI}(y_1\pb x_1...y_n\pb x_n) \;=\;
2505:   \left\{
2506:   \begin{array}{cl}
2507:     \displaystyle
2508:     \mu^{SG}(y'_1\pb x'_1...y'_n\pb x'_n) & \mbox{if}\quad
2509:     c_1\!=...=\!c_{n-1}\!=\!0 \;\mbox{and}\; c_n=C(y'_1x'_1...y'_nx'_n)
2510:     \\
2511:     0 & \mbox{otherwise}
2512:   \end{array} \right.
2513: \eeq
2514: where $y_i\!=\!y'_i$ and $x_i\!=\!c_ix'_i$.
2515: If the environment is a minimax player (\ref{sgxdot}) plus a crude
2516: teacher $C$, i.e. if $\mu^{AI}$ is the true prior probability, the
2517: question now is, what is the behaviour $\hh y_k^{AI}$ of the AI$\mu$
2518: system. It turns out that if we set $m_k\!=\!n$ the AI$\mu$ system
2519: is also a minimax player (\ref{sgydot}) and hence optimal
2520: \beqn
2521:   \hh y_k^{AI} \;=\;
2522:   \maxarg_{y_k}\sum_{x'_k}...\max_{y_n}\sum_{x'_n}
2523:   C(\hh y\!\hh x'_{<k}y\!x'_{k:n})\!\cdot\!
2524:   \mu^{SG}(\hh y\!\hh x'_{<k}y\!\pb x'_{k:n}) \;=
2525: \eeqn
2526: \beq\label{yaisg2}
2527:   =\; \maxarg_{y_k}\sum_{x'_k}...\max_{y_{n-1}}\sum_{x'_{n-1}}\max_{y_n}\min_{x'_n}
2528:   C(\hh y\!\hh x'_{<k}y\!x'_{k:n})\!\cdot\!
2529:   \mu^{SG}(\hh y\!\hh x'_{<k}y\!\pb x'_{k:n-1}) \;=
2530: \eeq
2531: \beqn
2532:  =\;...\;=\; \maxarg_{y_k}\min_{x'_{k+1}}...\max_{y_n}\min_{x'_n}
2533:      C(\hh y\!\hh x'_{<k}y\!x'_{k:n}) \;=\;
2534:      \hh y_k^{SG}
2535: \eeqn
2536: In the first line we inserted $m_k\!=\!n$ and (\ref{muaisg}) into
2537: the definition (\ref{ydotrec}) of $\hh y_k^{AI}$. This removes all
2538: sums over the $c_k$. Further, the sum over $x'_n$ gives only a
2539: contribution for $x'_n\!=\!\minarg_{x'_n}C(\hh x'_1\hh
2540: y'_1...x'_ny'_n)$ by definition (\ref{defmusg}) of $\mu^{SG}$.
2541: Inserting this $x'_n$ gives the second line. $\mu^{SG}$ is
2542: effectively reduced to a lower number of arguments and the sum
2543: over $x'_n$ replaced by $\min_{x'_n}$.  Repeating this procedure
2544: for $x'_{n-1},...,x'_{k+1}$ leads to the last line, which is just
2545: the minimax strategy of player 1 defined in (\ref{sgydot}).
2546: 
2547: Let us now assume that the game under consideration is played $s$
2548: times. The prior probability then is
2549: \beq\label{sgrep}
2550:   \mu^{AI}(y\!\pb x_1...y\!\pb x_{sn}) \;=\;
2551:   \prod_{r=0}^{s-1} \mu_1^{AI}(y\!\pb x_{rn+1}...
2552:   y\!\pb x_{(r+1)n})
2553: \eeq
2554: where we have renamed the prior probability (\ref{muaisg}) for
2555: one game to $\mu_1^{AI}$. (\ref{sgrep}) is a special case of a
2556: factorizable $\mu$ (\ref{facmu}) with identical factors
2557: $\mu_r\!=\mu_1^{AI}$ for all $r$ and equal episode lengths
2558: $n_{r+1}\!-\!n_r\!=\!n$. The AI$\mu$ system (\ref{sgrep}) for repeated
2559: game playing also implements the minimax strategy,
2560: \beq\label{yaisgrep}
2561:   \hh y_k^{AI} \;=\;
2562:   \maxarg_{y_k}\min_{x'_k}...
2563:      \max_{y_{(r+1)n}}\min_{\;x'_{(r+1)n}}
2564:      C(\hh y\!\hh x'_{rn+1:k-1}...y\!x'_{k:(r+1)n})
2565: \eeq
2566: with $r$ such that $rn\!<\!k\!\leq\!(r\!+\!1)n$ and for any choice of $m_k$
2567: as long as the horizon $h_k\!\geq\!n$. This can be
2568: proved by using (\ref{facydot}) and (\ref{yaisg2}).
2569: See section (\ref{secAIxi}) for a discussion on separable and
2570: factorizable $\mu$.
2571: 
2572: %------------------------------%
2573: \paragraph{Games of variable length:}
2574: %------------------------------%
2575: In the unrepeated case we have argued that games of variable but
2576: bounded length can be padded to a fixed length without effect. We
2577: now analyze in a sequence of games the effect of replacing the games with fixed
2578: length by games of variable length.
2579: The sequence $y'_1x'_1...y'_nx'_n$ can still be grouped into episodes
2580: corresponding to the moves of separated consecutive games, but now
2581: the length and total number of games that fit into the $n$
2582: moves depend on the actual moves taken\footnote{If the sum of
2583: game lengths do not fit exactly into $n$ moves, we pad the last
2584: game appropriately.}. $C(y'_1x'_1...y'_nx'_n)$
2585: equals the number of games where the
2586: system wins, minus the number of games where the environment wins.
2587: Whenever a loss, win or draw has been achieved by the
2588: system or the environment, a new game starts. The player whose turn it would next
2589: be, begins the next game. The games are still separated in
2590: the sense that the behaviour and credit of the current game does
2591: not influence the next game. On the other hand, they are
2592: slightly entangled, because the length of the current
2593: game determines the time of start of the next. As the rules of the
2594: game are time invariant, this does not influence the next game
2595: directly. If we play a fixed number of games, the games are
2596: completely independent, but if we play a fixed number of total moves
2597: $n$, the number of games depends on their lengths. This has the
2598: following consequences: the better player tries to keep the games
2599: short, to win more games in the given time $n$. The poorer player
2600: tries to draw the games out, in order loose less games. The better
2601: player might further prefer a quick draw, rather than to win a long game.
2602: Formally, this entanglement is represented by the fact that the
2603: prior probability $\mu$ does no longer factorize. The reduced
2604: form (\ref{yaisgrep}) of $\hh y_k^{AI}$ to one episode is no
2605: longer valid. Also, the behaviour $\hh y_k^{AI}$ of the system
2606: depends on $m_k$, even if the horizon $h_k$ is
2607: chosen larger than the longest possible game (unless $m_k\!\geq\!n$).
2608: The important point is that the system realizes that
2609: keeping games short/long can lead to increased credit. In
2610: practice, a horizon much larger than the average game length
2611: should be sufficient to incorporate this effect. The details of
2612: games in the distant future do not affect the current game and can,
2613: therefore, be ignored. A more quantitative analysis could be interesting, but
2614: would lead us too far astray.
2615: 
2616: %------------------------------%
2617: \paragraph{Using the AI$\xi$ model for game playing:}
2618: %------------------------------%
2619: When going from the specific AI$\mu$ model, where the rules of the
2620: game have been explicitly modeled into the prior probability
2621: $\mu^{AI}$, to the universal model AI$\xi$ we have to ask whether
2622: these rules can be learned from the assigned credits $c_k$. Here,
2623: another (actually the main) reason for studying the case of
2624: repeated games, rather than just one game arises. For a single game
2625: there is only one cycle of non-trivial feedback namely the end of
2626: the game - too late to be useful except when there are further
2627: games following.
2628: 
2629: Even in the case of repeated games, there is only very limited
2630: feedback, at most $\log_2 3$ bits of information per game if the 3
2631: outcomes win/loss/draw have the same frequency. So there are at
2632: least $O(K(game))$ number of games necessary to learn a game of
2633: complexity $K(game)$. Apart from extremely simple games, even this
2634: estimate is far too optimistic. As the AI$\xi$ system has no
2635: information about the game to begin with, its moves will be more
2636: or less random and it can win the first few games merely by pure luck.
2637: So the probability that the system looses is near to one and
2638: hence the information content $I$ in the feedback $c_k$ at the end
2639: of the game is much less than $\log_2 3$. This situation remains
2640: for a very large number of games. On the other hand, in principle,
2641: every game should be learnable after a very long sequence of games
2642: even with this minimal feedback only, as long as $I\not\equiv 0$.
2643: 
2644: The important point is that no other learning scheme with no extra
2645: information can learn the game more quickly. We expect this to be
2646: true as $\mu^{AI}$ factorizes in the case of games of fixed
2647: length, i.e. $\mu^{AI}$ satisfies a strong separability condition.
2648: In the case of variable game length the entanglement is also low.
2649: $\mu^{AI}$ should still be sufficiently separable allowing
2650: to formulate and prove good credit bounds for AI$\xi$.
2651: 
2652: To learn realistic games like tic-tac-toe (noughts and crosses) in
2653: realistic time one has to provide more feedback. This could be
2654: achieved by intermediate help during the game. The environment
2655: could give positive(negative) feedback for every good(bad) move
2656: the system makes. The demand on whether a move is to be valued as
2657: good should be adopted to the gained experience of the system in
2658: such a way that approximately half of the moves are valuated as
2659: good and the other half as bad, in order to maximize the
2660: information content of the feedback.
2661: 
2662: For more complicated games like chess, even more feedback is
2663: necessary from a practical point of view. One way to increase the
2664: feedback far beyond a few bits per cycle is to train the system by
2665: teaching it good moves. This is called supervised learning.
2666: Despite the fact that the AI model has only a credit feedback
2667: $c_k$, it is able to learn by teaching, as will be shown in section
2668: \ref{secEX}. Another way would be to start with more simple games
2669: containing certain aspects of the true game and to switch to the true
2670: game when the system has learned the simple game.
2671: 
2672: No other difficulties are expected when going from
2673: $\mu$ to $\xi$. Eventually $\xi^{AI}$ will converge to the
2674: minimax strategy $\mu^{AI}$. In the more realistic case, where the
2675: environment is not a perfect minimax player, AI$\xi$ can
2676: detect and exploit the weakness of the opponent.
2677: 
2678: Finally, we want to comment on the input/output space $X$/$Y$ of
2679: the AI system. In practical applications, $Y$ will possibly include
2680: also illegal moves. If $Y$ is the set of moves of e.g. a robotic
2681: arm, the system could move a wrong figure or even knock over the
2682: figures. A simple way to handle illegal moves $y_k$ is by
2683: interpreting them as losing moves, which terminate the game.
2684: Further, if e.g. the input $x_k$ is the image of a video camera
2685: which makes one shot per move, $X$ is not the set of moves by the
2686: environment but includes the set of states of the game board. The
2687: discussion in this section handles this case as well. There is no
2688: need to explicitly design the systems I/O space $X/Y$ for a
2689: specific game.
2690: 
2691: The discussion above on the AI$\xi$ system was rather informal for
2692: the following reason: game playing (the SG$\xi$ system) has
2693: (nearly) the same complexity as fully general AI, and quantitative
2694: results for the AI$\xi$ system are difficult (but not impossible)
2695: to obtain.
2696: 
2697: \newpage
2698: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2699: \section{Function Minimization (FM)}\label{secFM}
2700: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2701: 
2702: %------------------------------%
2703: \paragraph{Applications/Examples:}
2704: %------------------------------%
2705: There are many problems that can be reduced to a minimization
2706: problem (FM). The minimum of a (real valued) function
2707: $f\!:\!Y\!\to\!I\!\!R$ over some domain $Y$ or a good approximate
2708: of it has to be found, usually with some limited resources.
2709: 
2710: One popular example is the traveling salesman problem (TSP). $Y$
2711: is the set of different routes between towns and $f(y)$ the length
2712: of route $y\!\in\!Y$. The task is to find a route of minimal
2713: length visiting all cities. This problem is NP hard. Getting good
2714: approximations in limited time is of great importance in various
2715: applications. %%%%%%
2716: Another example is the minimization of production costs (MPC),
2717: e.g.\ of a car, under several constraints. $Y$ is the set of all
2718: alternative car designs and production methods compatible with the
2719: specifications and $f(y)$ the overall cost of alternative
2720: $y\!\in\!Y$. %%%%%%
2721: A related example is finding materials or (bio)molecules with
2722: certain properties (MAT). E.g. solids with minimal electrical
2723: resistance or maximally efficient chlorophyll modifications or
2724: aromatic molecules that taste as close as possible to strawberry.%%%%%%
2725: We can also ask for nice paintings (NPT). $Y$ is the set of all
2726: existing or imaginable paintings and $f(y)$ characterizes how much
2727: person $A$ likes painting $y$. The system should present
2728: paintings, which $A$ likes.
2729: 
2730: For now, these are enough examples. The TSP is very rigorous from a
2731: mathematical point of view, as $f$, i.e. an algorithm of $f$, is
2732: usually known. In principle, the minimum could be found by
2733: extensive search, were it not for computational resource
2734: limitations. For MPC, $f$ can often be modeled in a reliable and
2735: sufficiently accurate way. For MAT you need very accurate physical
2736: models, which might be unavailable or too difficult to solve or
2737: implement. For NPT the most we have is the judgement of person $A$ on
2738: every presented painting. The evaluation function $f$ cannot be
2739: implemented without scanning $A's$ brain, which is not possible with
2740: todays technology.
2741: 
2742: So there are different limitations, some depending on the
2743: application we have in mind. An implementation of $f$ might not be
2744: available, $f$ can only be tested at some arguments $y$ and $f(y)$
2745: is determined by the environment. We want to (approximately)
2746: minimize $f$ with as few function calls as possible or, conversely,
2747: find an as close as possible approximation for the
2748: minimum within a fixed number of function evaluations. If $f$ is
2749: available or can quickly be inferred by the system and evaluation
2750: is quick, it is more important to minimize the total time needed to
2751: imagine new trial minimum candidates plus the evaluation time for
2752: $f$. As we do not consider computational aspects of AI$\xi$ till
2753: section \ref{secTime} we concentrate on the first
2754: case, where $f$ is not available or dominates the computational
2755: requirements.
2756: 
2757: %------------------------------%
2758: \paragraph{The Greedy Model FMG$\mu$ :}
2759: %------------------------------%
2760: The FM model consists of a sequence $\hh y_1\hh z_1\hh y_2\hh
2761: z_2...$ where $\hh y_k$ is a trial of the FM system for a minimum
2762: of $f$ and $\hh z_k=f(\hh y_k)$ is the true function value
2763: returned by the environment. We randomize the model by assuming a
2764: probability distribution $\mu(f)$ over the functions. There are
2765: several reasons for doing this. We might really not know the exact
2766: function $f$, as in the NPT example, and model our uncertainty by
2767: the probability distribution $\mu$. More importantly, we want to
2768: parallel the other AI classes, like in the SP$\mu$ model, where we
2769: always started with a probability distribution $\mu$ that was finally
2770: replaced by $\xi$ to get the universal Solomonoff prediction
2771: SP$\xi$. We want to do the same thing here. Further, the probabilistic
2772: case includes the deterministic case by choosing
2773: $\mu(f)\!=\!\delta_{ff_0}$, where $f_0$ is the true function. A
2774: final reason is that the deterministic case is trivial when $\mu$
2775: and hence $f_0$ is known, as the system can internally (virtually)
2776: check all function arguments and output the correct minimum from the very
2777: beginning.
2778: 
2779: We will assume that $Y$ is countable or finite and that $\mu$ is a
2780: discrete measure, e.g. by taking only computable functions. The
2781: probability that the function values of $y_1,...,y_n$ are
2782: $z_1,...,z_n$ is then given by
2783: \beq\label{fmmudef}
2784:   \mu^{FM}(y_1\pb z_1...y_n\pb z_n) \;:=\;
2785:   \sum_{f:f(y_i)=z_i\;\forall 1\leq i\leq n} \nq\mu(f)
2786: \eeq
2787: We start with a model that minimizes the expectation
2788: $z_k$ of the function value $f$ for the next output
2789: $y_k$, taking into account previous information:
2790: \beqn
2791:   \hh y_k \;:=\; \minarg_{y_k}\sum_{z_k} z_k\!\cdot\!
2792:   \mu(\hh y_1\hh z_1...\hh y_{k-1}\hh z_{k-1}y_k\pb z_k)
2793: \eeqn
2794: This type of greedy algorithm, just minimizing the next
2795: feedback, was sufficient for sequence prediction (SP) and is also
2796: sufficient for classification (CF). It is, however, not sufficient for
2797: function minimization as the following example demonstrates.
2798: 
2799: Take $f:\{0,1\}\!\to\!\{1,2,3,4\}$. There are 16 different
2800: functions which shall be equiprobable, $\mu(f)\!=\!{1\over 16}$.
2801: The function expectation in the first cycle
2802: \beqn
2803:   \langle z_1\rangle \;:=\; \sum_{z_1} z_1\!\cdot\!\mu(y_1\pb z_1) \;=\;
2804:   {\textstyle{1\over 4}}\sum_{z_1}z_1 \;=\;
2805:   {\textstyle{1\over 4}}(1\!+\!2\!+\!3\!+\!4) \;=\; 2.5
2806: \eeqn
2807: is just the arithmetic average of the possible function values and
2808: is independent of $y_1$. Therefore, $\hh y_1\!=\!0$, as $\minarg$
2809: is defined to take the lexicographically first minimum in an
2810: ambiguous case. Let us assume that $f_0(0)\!=\!2$, where $f_0$ is the
2811: true environment function, i.e. $\hh z_1\!=\!2$. The expectation of $z_2$ is then
2812: \beqn
2813:   \langle z_2\rangle \;:=\; \sum_{z_2} z_2\!\cdot\!\mu(02y_2\pb z_2)
2814:   \;=\; \left\{
2815:   \begin{array}{c@{\quad\mbox{for}\quad}l}
2816:     2                      & y_2=0 \\
2817:     2.5                    & y_2=1
2818:   \end{array} \right.
2819: \eeqn
2820: For $y_2\!=\!0$ the system already knows $f(0)\!=\!2$, for
2821: $y_2\!=\!1$ the expectation is, again, the arithmetic average. The
2822: system will again output $\hh y_2\!=\!0$ with feedback $\hh
2823: z_2\!=\!2$. This will continue forever. The system is not
2824: motivated to explore other $y's$ as $f(0)$ is already smaller than the
2825: expectation of $f(1)$. This is obviously not what we
2826: want. The greedy model fails. The system ought to be inventive and
2827: try other outputs when given enough time.
2828: 
2829: The general reason for the failure of the greedy approach is that
2830: the information contained in the feedback $z_k$ depends on the
2831: output $y_k$. A FM system can actively influence the knowledge it
2832: receives from the environment by the choice in $y_k$. It may be
2833: more advantageous to first collect certain knowledge about $f$ by
2834: an (in greedy sense) non-optimal choice for $y_k$, rather than to
2835: minimize the $z_k$ expectation immediately. The non-minimality of
2836: $z_k$ might be over-compensated in the long run by
2837: exploiting this knowledge. In SP, the received information is
2838: always the current bit of the sequence, independent of what SP
2839: predicts for this bit. This is the reason why a greedy
2840: strategy in the SP case is already optimal.
2841: 
2842: %------------------------------%
2843: \paragraph{The general FM$\mu/\xi$ Model:}
2844: %------------------------------%
2845: To get a useful model we have to think more carefully about what we
2846: really want. Should the FM system output a good minimum in the last output
2847: in a limited number of
2848: cycles $T$, or should the average of the $z_1,...,z_T$ values be minimal, or
2849: does it suffice that just one of the $z$ is as small as possible?
2850: Let us define the FM$\mu$ model as to minimize the $\mu$ averaged weighted
2851: sum $\alpha_1 z_1\!+...+\!\alpha_T z_T$ for some given
2852: $\alpha_k\!\geq\!0$. Building the $\mu$ average by summation over
2853: the $z_i$ and minimizing w.r.t.\ the $y_i$ has to be performed in
2854: the correct chronological order. With a similar reasoning as in
2855: (\ref{ebesty}) to (\ref{ydotrec}) we get
2856: \beq\label{fmydot}
2857:   \hh y_k^{FM} \;=\; \minarg_{y_k}\sum_{z_k}...\min_{y_T}\sum_{z_T}
2858:   (\alpha_1 z_1\!+...+\!\alpha_T z_T)\!\cdot\!
2859:   \mu(\hh y_1\hh z_1...\hh y_{k-1}\hh z_{k-1}y_k\pb z_k...y_T\pb z_T)
2860: \eeq
2861: If we want the final output $\hh y_T$ to be optimal we should
2862: choose $\alpha_k\!=\!0$ for $k\!<\!T$ and $\alpha_T\!=\!1$ (final
2863: model FMF$\mu$). If we want to already have a good
2864: approximation during intermediate cycles, we should demand that the
2865: output of all cycles together are optimal in some average sense,
2866: so we should choose $\alpha_k\!=\!1$ for all $k$ (sum model
2867: FMS$\mu$). If we want to have something in between, for instance, increase
2868: the pressure to produce good outputs, we could choose the
2869: $\alpha_k\!=\!e^{\gamma(k-T)}$ exponentially increasing for some
2870: $\gamma\!>\!0$ (exponential model FME$\mu$). For
2871: $\gamma\!\to\!\infty$ we get the FMF$\mu$, for $\gamma\!\to\!0$
2872: the FMS$\mu$ model. If we want to demand that the best of the
2873: outputs $y_1...y_k$ is optimal, we must replace the $\alpha$
2874: weighted $z$-sum by $\min\{z_1,...,z_T\}$ (minimum Model
2875: FMM$\mu$). We expect the behaviour to be very similar to the
2876: FMF$\mu$ model, and do not consider it further.
2877: 
2878: By construction, the FM$\mu$ models guarantee optimal results in
2879: the usual sense that no other model knowing only $\mu$
2880: can be expected to produce better results. The variety of FM
2881: variants is not a fault of the theory. They just reflect the fact
2882: that there is some interpretational freedom of what is meant by
2883: minimization within $T$ function calls. In most applications, probably FMF is
2884: appropriate. In the NPT application one might prefer the FMS model.
2885: 
2886: The interesting case (in AI) is when $\mu$ is unknown. We
2887: define for this case, the FM$\xi$ model by replacing $\mu(f)$
2888: with some $\xi(f)$, which should assign high probability to
2889: functions $f$ of low complexity. So we might define\footnote
2890: {$\xi^{FM}(f)$ is a true
2891: probability distribution if we include partial functions in the
2892: domain. So normalization is not necessary.}
2893: $\xi(f)\!=\!\sum_{q:\forall x[U(qx)=f(x)]}2^{-l(q)}$.
2894: The problem with this definition is that it is, in general,
2895: undecidable whether a TM $q$ is an implementation of a function
2896: $f$. $\xi(f)$ defined in this way is uncomputable,
2897: not even approximable. As we only need a $\xi$ analog to the
2898: l.h.s.\ of (\ref{fmmudef}), the following definition is natural
2899: \beq\label{fmxidef}
2900:   \xi^{FM}(y_1\pb z_1...y_n\pb z_n) \;:=\;
2901:   \sum_{q:q(y_i)=z_i\;\forall 1\leq i\leq n} \nq 2^{-l(q)}
2902: \eeq
2903: $\xi^{FM}$ is
2904: actually equivalent to inserting the incomputable $\xi(f)$ into
2905: (\ref{fmmudef}). $\xi^{FM}$ is an enumerable semi-measure and
2906: universal, relative to all probability distributions of the form
2907: (\ref{fmmudef}). We will not prove this here.
2908: 
2909: Alternatively, we could have constrained the sum in (\ref{fmxidef})
2910: by $q(y_1...y_n)\!=\!z_1...z_n$ analog to (\ref{uniMAI}), but these
2911: two definitions are not equivalent. Definition (\ref{fmxidef})
2912: ensures the symmetry\footnote{See \cite{Sol99} for a discussion
2913: on symmetric universal distributions on unordered data.} in its
2914: arguments and $\xi^{FM}(...y\pb z...y\pb z'...)\!=\!0$ for $z\neq z'$.
2915: It incorporates all general knowledge we have about function
2916: minimization, whereas (\ref{uniMAI}) does not. But this extra
2917: knowledge has only low information content (complexity of $O(1)$),
2918: so we do not expect FM$\xi$ to perform much worse when using
2919: (\ref{uniMAI}) instead of (\ref{fmxidef}). But there is no reason
2920: to deviate from (\ref{fmxidef}) at this point.
2921: 
2922: We can now define an ''error'' measure $E_{T\mu}^{FM}$ as
2923: (\ref{fmydot}) with $k\!=\!1$ and $\minarg_{y_1}$ replaced by
2924: $\min_{y_1}$ and, additionally, $\mu$ replaced by $\xi$ for
2925: $E_{T\xi}^{FM}$. We expect $|E_{T\xi}^{FM}\!-\!E_{T\mu}^{FM}|$ to
2926: be bounded in a way that justifies the use of $\xi$ instead of
2927: $\mu$ for computable $\mu$, i.e. computable $f_0$ in the
2928: deterministic case. The arguments are the same as for the AI$\xi$
2929: model.
2930: 
2931: %------------------------------%
2932: \paragraph{Is the general model inventive?}
2933: %------------------------------%
2934: In the following we will show that FM$\xi$ will never cease
2935: searching for minima, but will test an infinite set of different
2936: $y's$ for $T\!\to\!\infty$.
2937: 
2938: Let us assume that the system tests only a finite number of
2939: $y_i\!\in\!A\!\subset Y$, $|A|\!<\!\infty$. Let $t\!-\!1$ be the
2940: cycle in which the last new $y\!\in\!A$ is selected (or some later
2941: cycle). Selecting $y's$ in cycles $k\!\geq\!t$ a second time, the
2942: feedback $z$ does not provide any new information, i.e. does not
2943: modify the probability $\xi^{FM}$. The system can
2944: minimize $E_{T\xi}^{FM}$ by outputting in cycles $k\geq t$ the
2945: best $y\!\in\!A$ found so far (in the case $\alpha_k\!=\!0$, the output
2946: does not matter).
2947: Let us fix $f$ for a moment. Then we have
2948: \beqn
2949:   E^a \;:=\; \alpha_1 z_1\!+...+\!\alpha_T z_T \;=\;
2950:   \sum_{k=1}^{t-1}\alpha_kf(y_k)+f_1\!\cdot\!\sum_{k=t}^T\alpha_k
2951:   \quad,\quad f_1:=\min_{1\leq k<t}f(y_k)
2952: \eeqn
2953: Let us now assume that the system tests one additional
2954: $y_t\!\not\in\!A$ in cycle $t$, but no other $y\!\not\in\!A$.
2955: Again, it will keep to the best output for $k\!>\!t$, which is
2956: either the one of the previous system or $y_t$.
2957: \beqn
2958:   E^b \;=\;
2959:   \sum_{k=1}^t\alpha_kf(y_k) +
2960:   \min\{f_1,f(y_t)\}\!\cdot\nq\;\sum_{k=t+1}^T\alpha_k
2961: \eeqn
2962: The difference can be represented in the form
2963: \beqn
2964:   E^a-E^b \;=\; \left(\sum_{k=t}^T\alpha_k\right)\!\cdot\!f^+ -
2965:   \alpha_t\!\cdot\!f^- \quad,\quad
2966:   f^\pm \;:=\; \max\{0,\pm(f_1\!-\!f(y_t))\} \;\geq\; 0
2967: \eeqn
2968: As the true FM$\xi$ strategy is the one which minimizes $E$, assumption
2969: $a$ is ruled out if $E^a>E^b$. We will say that $b$ is favored over $a$,
2970: which does not mean that $b$ is the correct strategy, only that
2971: $a$ is not the true one. For probability distributed $f$, $b$ is
2972: favored over $a$ when
2973: \beqn
2974:   E^a-E^b \;=\; \left(\sum_{k=t}^T\alpha_k\right)\!\cdot\!\langle f^+\rangle -
2975:   \alpha_t\!\cdot\!\langle f^-\rangle \;>\; 0
2976:   \quad\Leftrightarrow\quad
2977:   \sum_{k=t}^T\alpha_k > \alpha_t{\langle f^-\rangle\over\langle
2978:   f^+\rangle}
2979: \eeqn
2980: where $\langle f^\pm\rangle$ is the $\xi$ expectation of $\pm f_1\mp f(y_t)$
2981: under the condition that $\pm f_1\!\geq\!\pm f(y_t)$ and under the constrains
2982: imposed in cycles $1...t\!-\!1$. As $\xi$ assigns a strictly
2983: positive probability to every non-empty event, $\langle
2984: f^+\rangle\!\neq\!0$.
2985: Inserting $\alpha_k\!=\!e^{\gamma(k-T)}$, assumption $a$ is ruled
2986: out in model FME$\xi$ if
2987: \beqn
2988:   T-t \;>\; {1\over\gamma}\ln\left[1+
2989:   {\langle f^-\rangle\over\langle f^+\rangle}(e^\gamma-1)\right]-1
2990:   \;\to\; \left\{
2991:   \begin{array}{c@{\quad\mbox{for}\quad}l}
2992:     0 & \gamma\to\infty\mbox{ (FMF$\xi$ model)} \\
2993:     \langle f^-\rangle/\langle f^+\rangle-1
2994:     & \gamma\to 0\;\;\mbox{ (FMS$\xi$ model)}
2995:   \end{array} \right.
2996: \eeqn
2997: We see that if the condition is not satisfied for some $t$, it will
2998: remain wrong for all $t'\!>\!t$. So the FMF$\xi$ system will test each $y$
2999: only once up to a point from which on it always outputs the best
3000: found $y$. Further, for $T\!\to\!\infty$ the condition always gets
3001: satisfied. As this is true for any finite $A$, the assumption of a
3002: finite $A$ is wrong. For $T\!\to\!\infty$ the system
3003: tests an increasing number of different $y's$, provided $Y$ is
3004: infinite. The FMF$\xi$ model will never repeat any $y$ except in
3005: the last cycle $T$ where it chooses the best found $y$. The
3006: FMS$\xi$ model will test a new $y_t$ for fixed $T$, only if the
3007: expected value of $f(y_t)$ is not too large.
3008: 
3009: The above does not necessarily hold for different choices of
3010: $\alpha_k$. The above also holds for the FMF$\mu$ system if
3011: $\langle f^+\rangle\!\neq\!0$. $\langle f^+\rangle\!=\!0$ if the
3012: system can already exclude that $y_t$ is a better guess, so there
3013: is no reason to test it explicitly.
3014: 
3015: Nothing has been said about the quality of the guesses, but for
3016: the FM$\mu$ system they are optimal by definition.
3017: If $K(\mu)$ for the true distribution $\mu$ is finite, we expect
3018: the FM$\xi$ system to solve the ''exploration versus
3019: exploitation'' problem in a universally optimal way, as $\xi$
3020: converges to $\mu$.
3021: 
3022: %------------------------------%
3023: \paragraph{Using the AI models for Function Mininimization:}
3024: %------------------------------%
3025: The AI model can be used for function minimization in the
3026: following way. The output $y_k$ of cycle $k$ is a guess for a
3027: minimum of $f$, like in the FM model. The credit $c_k$ should
3028: be high for small function values $z_k\!=\!f(y_k)$.
3029: The credit should also be weighted with $\alpha_k$ to reflect the
3030: same strategy as in the FM case. The choice of $c_k\!=\!-\alpha_k z_k$
3031: is natural. Here, the feedback is not binary but
3032: $c_k\!\in\!C\!\subset\!I\!\!R$, with $C$ being a countable subset of
3033: $I\!\!R$, e.g. the computable reals or all rational numbers. The
3034: feedback $x'_k$ should be the function value $f(y_k)$.
3035: So we set $x'_k\!=\!z_k$. Note, that there is a redundancy
3036: if $\alpha_{()}$ is a computable function with no zeros, as
3037: $c_k\!=-\alpha_kx'_k$. So, for small $K(\alpha_{()})$ like in
3038: the FMS model, one might set $x_k\equiv\epsilon$. If we keep $x'_k$
3039: the AI prior probability is
3040: \beq\label{muAIfm}
3041:   \mu^{AI}(y_1\pb x_1...y_n\pb x_n)
3042:   \;=\; \left\{
3043:   \begin{array}{cl}
3044:     \mu^{FM}(y_1\pb z_1...y_n\pb z_n)
3045:     & \mbox{for } c_k=-\alpha_kz_k,\; x'_k=z_k,\; x_k=c_kx_k' \\
3046:     0 & \mbox{else}.
3047:   \end{array} \right.
3048: \eeq
3049: Inserting this into (\ref{ydotrec}) with $m_k=T$ we get
3050: \beqn
3051:   \hh y_k^{AI} \;=\;
3052:   \maxarg_{y_k}\sum_{x_k}...\max_{y_T}\sum_{x_T}
3053:   (c_k\!+...+\!c_T)\!\cdot\!
3054:   \mu^{AI}(\hh y_1\hh x_1...y_k\pb x_k...y_T\pb x_T)
3055:   \;=\;
3056: \eeqn
3057: \beqn
3058:   \;=\; \minarg_{y_k}\sum_{z_k}...\min_{y_T}\sum_{z_T}
3059:   (\alpha_kz_k\!+...+\!\alpha_Tz_T)\!\cdot\!
3060:   \mu^{FM}(\hh y_1\hh z_1...y_k\pb z_k...y_T\pb z_T) \;=\; \hh y_k^{FM}
3061: \eeqn
3062: where $\hh y_k^{FM}$ has been defined in (\ref{fmydot}).
3063: The proof of equivalence was so simple because the FM model has already a
3064: rather general structure, which is similar to the full AI model.
3065: 
3066: One might expect no problems when going from the already very
3067: general FM$\xi$ model to the universal AI$\xi$ model (with
3068: $m_k=T$), but there is a pitfall in the case of the FMF model. All
3069: credits $c_k$ are zero in this case, except for the last one being $c_T$.
3070: Although there is a feedback $z_k$ in every cycle, the AI$\xi$
3071: system cannot learn from this feedback as it is not told that in
3072: the final cycle $c_T$ will equal to $-z_T$. There is no problem in
3073: the FM$\xi$ model because in this case this knowledge is hardcoded into
3074: $\xi^{FM}$. The AI$\xi$ model must first learn that it
3075: has to minimize a function but it can only learn if there is a
3076: non-trivial credit assignment $c_k$. FMF works for repeated
3077: minimization of (different) functions, such as minimizing $N$
3078: functions in $N\!\cdot\!T$ cycles. In this case there are $N$ non-trivial
3079: feedbacks and AI$\xi$ has time to learn that there is a relation
3080: between $c_{k\!\cdot\!T}$ and $x'_{k\!\cdot\!T}$ every T$^{th}$
3081: cycle. This situation is similar to the case of strategic games
3082: discussed in section \ref{secSG}.
3083: 
3084: There is no problem in applying AI$\xi$ to FMS because the $c$
3085: feedback provides enough information in this case. The only thing
3086: the AI$\xi$ model has to learn, is to ignore the $x$ feedbacks as
3087: all information is already contained in $c$. Interestingly the
3088: same argument holds for the FME model if $K(\gamma)$ and $K(T)$
3089: are small\footnote{If we set $\alpha_k=e^{\gamma k}$ the condition
3090: on $K(T)$ can be dropped.}. The AI$\xi$ model has additionally only to learn
3091: the relation $c_k\!=\!-e^{-\gamma(k-T)}x'_k$. This
3092: task is simple as every cycle provides one data point for a simple
3093: function to learn. This argument is no longer valid for
3094: $\gamma\!\to\!\infty$ as $K(\gamma)\!\to\!\infty$ in this case.
3095: 
3096: %------------------------------%
3097: \paragraph{Remark:}
3098: %------------------------------%
3099: TSP seems to be trivial in the AI$\mu$ model but non-trivial in
3100: the AI$\xi$ model. The reason being that (\ref{fmydot}) just
3101: implements an internal complete search as
3102: $\mu(f)\!=\!\delta_{ff^{TSP}}$ contains all necessary information.
3103: AI$\mu$ outputs from the very beginning, the exact minimum of $f^{TSP}$. This
3104: ''solution'' is, of course, unacceptable from performance
3105: perspective. As long as we give no efficient approximation $\xi^c$
3106: of $\xi$, we have not contributed anything to a solution of the
3107: TSP by using AI$\xi^c$. The same is true for any other problem
3108: where $f$ is computable and easily accessible. Therefore, TSP is not (yet)
3109: a good example because all we have done is to replace a NP
3110: complete problem with the uncomputable AI$\xi$ model or by a
3111: computable AI$\xi^c$ model, for which we have said nothing about
3112: computation time yet. It is simply an overkill to reduce simple
3113: problems to AI$\xi$. TSP is a simple problem in this respect, until
3114: we consider the AI$\xi^c$ model seriously. For the other examples,
3115: where $f$ is inaccessible or complicated, AI$\xi^c$ provides a
3116: true solution to the minimization problem as an explicit
3117: definition of $f$ is not needed for AI$\xi$ and AI$\xi^c$.
3118: 
3119: \newpage
3120: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
3121: \section{Supervised Learning by Examples (EX)}\label{secEX}
3122: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
3123: 
3124: %------------------------------%
3125: %\paragraph{Introduction (reinforcement versus supervised learning:}
3126: %------------------------------%
3127: The AI models provide a frame for reinforcement learning. The
3128: environment provides a feedback $c$, informing the system about the
3129: quality of its last output $y$; it assigns credit $c$ to output
3130: $y$. In this sense, reinforcement learning is explicitly integrated
3131: into the AI$\rho$ model. For $\rho\!=\!\mu$ it maximizes the true
3132: expected credit, whereas the AI$\xi$ model is a universal,
3133: environment independent, reinforcement learning algorithm.
3134: 
3135: There is another type of learning method: Supervised learning by
3136: presentation of examples (EX). Many problems learned by this
3137: method are association problems of the following type. Given some
3138: examples $x\!\in\!R\subset\!X$, the system should reconstruct, from
3139: a partially given $x'$, the missing or corrupted parts, i.e.
3140: complete $x'$ to $x$ such that relation $R$ contains $x$. In many
3141: cases, $X$ consists of pairs $(z,v)$, where $v$ is the possibly
3142: missing part.
3143: 
3144: %------------------------------%
3145: \paragraph{Applications/Examples:}
3146: %------------------------------%
3147: Learning functions by presenting $(z,f(z))$ pairs and asking for
3148: the function value of $z$ by presenting $(z,?)$ also falls into
3149: this category.
3150: 
3151: A basic example is learning properties of geometrical objects
3152: coded in some way. E.g.\ if there are 18 different objects
3153: characterized by their size (small or big), their colors (red,
3154: green or blue) and their shapes (square, triange, circle), then
3155: $(object,property)\!\in\!\!R$ if the $object$ possesses the
3156: $property$. Here, $R$ is a relation which is not the graph of a
3157: single valued function.
3158: 
3159: When teaching a child, by pointing to objects and saying ''this is
3160: a tree'' or ''look how green'' or ''how beautiful'', one
3161: establishes a relation of $(object,property)$ pairs in $R$.
3162: Pointing to a (possibly different) tree later and asking ''what is
3163: this ?'' corresponds to a partially given pair $(object,?)$, where
3164: the missing part ''?'' should be completed by the
3165: child saying ''tree''.
3166: 
3167: A final example we want to give is chess. We have seen that, in
3168: principle, chess can be learned by reinforcement learning. In the
3169: extreme case the environment only provides credit $c\!=\!1$ when
3170: the system wins. The learning rate is completely inacceptable from
3171: a practical point of view. The reason is the very low amount of
3172: information feedback. A more practical method of teaching chess is
3173: to present example games in the form of sensible
3174: $(board\mbox{-}state,move)$
3175: sequences. They contain information about legal and good moves
3176: (but without any explanation). After several games have been presented, the
3177: teacher could ask the system to make its own move by presenting
3178: $(board\mbox{-}state,?)$ and then evaluate the answer of the system.
3179: 
3180: %------------------------------%
3181: \paragraph{Supervised learning with the AI$\mu/\xi$ model:}
3182: %------------------------------%
3183: Let us define the EX model as follows: The environment presents
3184: inputs
3185: $x'_k = z_kv_k \equiv (z_k,v_k) \in R\!\cup\!(Z\!\times\!\{?\}) \subset
3186:  Z\!\times\!(Y\!\cup\!\{?\}) = X'$
3187: to the system in cycle $k$. The system is expected to output $y_{k+1}$
3188: in the next cycle, which is evaluated with $c_{k+1}\!=\!1$ if $(z_k,y_{k+1})\!\in\!R$ and 0
3189: otherwise. To simplify the discussion, an output $y_k$ is expected
3190: and evaluated even when $v_k(\neq?)$ is given. To complete the
3191: description of the environment, the probability distribution
3192: $\mu_R(\pb{x'_1...x'_n})$ of the examples $x'_i$ (depending on $R$)
3193: has to be given. Wrong examples should not occur, i.e.\ $\mu_R$
3194: should be 0 if $x_i'\!\not\in\!R$ for some $1\!\leq\!i\!\leq\!n$.
3195: The relations $R$ might also be probability distributed with
3196: $\sigma(\pb R)$. The example prior probability in this case is
3197: \beq\label{exmudef}
3198:   \mu(\pb{x'_1...x'_n}) \;=\;
3199:   \sum_R \mu_R(\pb{x'_1...x'_n})\!\cdot\!\sigma(\pb R)
3200: \eeq
3201: The knowledge of the valuation $c_k$ on output $y_k$
3202: restricts the possible relations $R$, consistent with
3203: $R(z_k,y_{k+1})\!=\!c_{k+1}$, where $R(z,y)\!:=\!1$ if $(z,y)\!\in\!R$ and 0
3204: otherwise. The prior probability for the input sequence
3205: $x_1...x_n$ if the output sequence is $y_1...y_n$, is
3206: therefore
3207: \beqn
3208:   \mu^{AI}(y_1\pb x_1...y_n\pb x_n) \;=\;
3209:   \sum_{R:\forall 1\leq i< n[R(z_i,y_{i+1})=c_{i+1}]}
3210:   \mu_R(\pb{x'_1...x'_n})\!\cdot\!\sigma(\pb R)
3211: \eeqn
3212: where $x_i\!=\!c_ix'_i$ and $x'_{i-1}\!=\!z_iv_i$ with $v_i\!\in\!Y\!\cup\!\{?\}$.
3213: In the I/O sequence $y_1x_1y_2x_2...=y_1c_1z_2v_2y_2c_2z_3v_3...$
3214: the $c_1y_1$ are dummies, after which regular behaviour starts.
3215: 
3216: The AI$\mu$ model is optimal by construction of $\mu^{AI}$. For
3217: computable prior $\mu_R$ and $\sigma$, we expect a near optimal
3218: behavior of the universal AI$\xi$ model if $\mu_R$ additionally satisfies some
3219: separability property. In the following, we give some motivation
3220: why the AI$\xi$ model takes into account the supervisor
3221: information contained in the examples and why it learns faster than by
3222: reinforcement.
3223: 
3224: %------------------------------%
3225: %\paragraph{Reason why AI$\xi$ can learn supervised:}
3226: %------------------------------%
3227: We keep $R$ fixed and assume
3228: $\mu_R(x'_1...x'_n)\!=\!\mu_R(x'_1)\!\cdot...\cdot\!\mu_R(x'_n)\!\neq\!0
3229: \Leftrightarrow x'_i\!\in\!R\!\cup\!(Z\!\times\!\{?\})\;\forall i$
3230: to simplify the discussion. Short codes $q$ contribute mostly to
3231: $\xi^{AI}(y_1\pb x_1...y_n\pb x_n)$. As $x'_1...x'_n$ is
3232: distributed according to the computable probability distribution
3233: $\mu_R$, a short code of $x'_1...x'_n$ for large enough $n$ is a
3234: Huffman coding w.r.t.\ the distribution $\mu_R$. So we expect
3235: $\mu_R$ and hence $R$ coded in the dominant contributions to
3236: $\xi^{AI}$ in some way, where the plausible assumption was made
3237: that the $y$ on the input tape do not matter. Much more than one
3238: bit per cycle will usually be learned, hence, relation $R$ can be
3239: learned in $n\!\ll\!K(R)$ cycles by appropriate examples. This
3240: coding of $R$ in $q$ evolves independently of the feedbacks $c$.
3241: To maximize the feedback $c_k$, the system has to learn to output
3242: a $y_{k+1}$ with $(z_k,y_{k+1})\!\in\!R$. The system has to invent
3243: a program extension $q'$ to $q$, which extracts $z_k$ from
3244: $x_k\!=\!z_kv_k$ and searches for and outputs a $y_{k+1}$ with
3245: $(z_k,y_{k+1})\!\in\!R$. As $R$ is already coded in $q$, $q'$ can
3246: re-use this coding of $R$ in $q$. The size of the extension $q'$
3247: is, therefore, of $O(1)$. To learn this $q'$, the system requires
3248: feedback $c$ with information content of $O(1)\!=\!K(q')$ only.
3249: 
3250: Let us compare this with reinforcement learning, where only $x'_k\!=\!(z_k,?)$
3251: pairs are presented. A coding of $R$ in a short code $q$ for
3252: $x'_1...x'_n$ is of no use and will therefore be absent. Only the
3253: credits $c$ force the system to learn $R$. $q'$ is therefore
3254: expected to be of size $K(R)$. The information content in the
3255: $c's$ must be of the order $K(R)$. In practice, there are often only very few
3256: $c_k\!=\!1$ at the beginning of the learning phase and the
3257: information content in $c_1...c_n$ is much less than $n$ bits. The
3258: required number of cycles to learn $R$ by reinforcement is,
3259: therefore, at least but in many cases much larger than $K(R)$.
3260: 
3261: Although AI$\xi$ was never designed or told to learn
3262: supervised, it learns how to take advantage of the examples from
3263: the supervisor.  $\mu_R$ and $R$ are learned from the examples, the
3264: credits $c$ are not necessary for this process. The remaining task
3265: of learning how to learn supervised is then a simple task of
3266: complexity $O(1)$, for which the credits $c$ are necessary.
3267: 
3268: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
3269: \section{Other AI Classes}\label{secOther}
3270: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
3271: \ifprivate
3272: \begin{itemize}\parskip=0ex\parsep=0ex\itemsep=0ex
3273: \item Function Inversion
3274: \item Building analogies
3275: \item Delayed SP
3276: \item Artificial Life
3277: \end{itemize}
3278: \fi
3279: 
3280: %------------------------------%
3281: \paragraph{Other aspects of intelligence:}
3282: %------------------------------%
3283: In AI, a variety of general ideas and methods have been developed.
3284: In the last sections, we have seen how several problem classes can
3285: be formulated within AI$\xi$. As we claim universality of the
3286: AI$\xi$ model, we want to enlight which of, and how the other AI
3287: methods are incorporated in the AI$\xi$ model, by looking its
3288: structure. Some methods are directly included,
3289: others are or should be emergent. We do not claim the following
3290: list to be complete.
3291: 
3292: {\it Probability theory} and {\it utility theory} are the heart of
3293: the AI$\mu/\xi$ models. The probabilities are the true/universal
3294: behaviours of the environment. The utility function is what we
3295: called total credit, which should be maximized. Maximization of an
3296: expected utility function in a probabilistic environment is
3297: usually called {\it sequential decision theory}, and is explicitly integrated
3298: in full generality in our model. This includes probabilistic (a
3299: generalization of deterministic) {\it reasoning}, where the
3300: object of reasoning are not true or false statements, but the
3301: prediction of the environmental behaviour. {\it Reinforcement
3302: Learning}
3303: is explicitly built in, due to the credits. Supervised learning is
3304: an emergent phenomenon (section \ref{secEX}). {\it Algorithmic
3305: information theory} leads us to use $\xi$ as a universal estimate
3306: for the prior probability $\mu$.
3307: 
3308: For horizon $>\!1$, the alternative series of expectimax series
3309: in (\ref{facydot}) and the process of selecting maximal
3310: values can be interpreted as abstract {\it planning}. This expectimax
3311: series also includes {\it informed search}, in the case of AI$\mu$, and {\it
3312: heuristic search}, for AI$\xi$, where $\xi$ could be interpreted as
3313: a heuristic for $\mu$. The minimax strategy of {\it game playing}
3314: in case of AI$\mu$ is also subsumed. The AI$\xi$ model converges
3315: to the minimax strategy if the environment is a minimax player but
3316: it can also take advantage of environmental players with limited
3317: rationality. {\it Problem solving} occurs (only) in the form of
3318: how to maximize the expected future credit.
3319: 
3320: {\it Knowledge} is accumulated by AI$\xi$ and is stored in some
3321: form not specified further on the working tape. Any kind of
3322: information in any representation on the inputs $y$ is
3323: exploited. The problem of {\it knowledge engineering} and
3324: representation appears in the form of how to train the AI$\xi$
3325: model. More practical aspects, like {\it language or image
3326: processing} have to be learned by AI$\xi$ from scratch.
3327: 
3328: Other theories, like {\it fuzzy logic}, {\it possibility theory},
3329: {\it Dempster-Shafer theory}, ... are partly outdated and partly
3330: reducible to Bayesian probability theory \cite{Che85}. The
3331: interpretation and effects of the evidence gap
3332: $g\!:=\!1\!-\!\sum_{x_k}\xi(y\!x_{<k}y\!\pb x_k)\!>\!0$ in $\xi$ may
3333: be similar to those in Dempster-Shafer theory. Boolean logical
3334: reasoning about the external world plays, at best, an emergent
3335: role in the AI$\xi$ model.
3336: 
3337: Other methods, which don't seem to be contained in the AI$\xi$ model
3338: might also be emergent phenomena. The AI$\xi$ model has to
3339: construct short codes of the environmental behaviour, the
3340: AI$\xi^{\tilde t\tilde l}$ (see next section) has to construct
3341: short action programs. If we would analyze and interpret these
3342: programs for realistic environments, we might find some of the
3343: unmentioned or unused or new AI methods at work in these
3344: algorithms. This is, however, pure speculation at this point. More
3345: important: when trying to make AI$\xi$ practically usable,
3346: some other AI methods, like genetic algorithms or neural nets,
3347: may be useful.
3348: 
3349: The main thing we wanted to point out is that the AI$\xi$ model
3350: does not lack any important known property of intelligence or
3351: known AI methodology. What {\it is} missing, however, are computational
3352: aspects, which are addressed, in the next section.
3353: 
3354: \newpage
3355: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
3356: \section{Time Bounds and Effectiveness}\label{secTime}
3357: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
3358: 
3359: %------------------------------%
3360: \paragraph{Introduction:}
3361: %------------------------------%
3362: Until now, we have not bothered with the non-computability of the
3363: universal probability distribution $\xi$. As all universal models
3364: in this paper are based on $\xi$, they are not effective in this
3365: form. In this section, We will outline how the previous models and
3366: results can be modified/generalized to the time-bounded case.
3367: Indeed, the situation is not as bad as it could be. $\xi$ and $C$
3368: are enumerable and $\hh y_k$ is still approximable or computable
3369: in the limit. There exists an algorithm, that will produce a
3370: sequence of outputs eventually converging to the exact output $\hh
3371: y_k$, but we can never be sure whether we have already reached it.
3372: Besides this, the convergence is extremely slow, so this type of
3373: asymptotic computability is of no direct (practical) use, but will
3374: nevertheless, be important later.
3375: 
3376: Let $\tilde p$ be a program which calculates within a reasonable
3377: time $\tilde t$ per cycle, a reasonable intelligent output, i.e.
3378: $\tilde p(\hh x_{<k})\!=\!\hh y_{1:k}$. This sort
3379: of computability assumption, that a general purpose computer of
3380: sufficient power is able to behave in an intelligent way, is
3381: the very basis of AI, justifying the
3382: hope to be able to construct systems which eventually reach and outperform
3383: human intelligence. For a contrary viewpoint see \cite{Pen89}. It
3384: is not necessary to discuss here, what is meant by 'reasonable
3385: time/intelligence' and 'sufficient power'. What we are interested
3386: in, in this section, is whether there is a computable version
3387: AI$\xi^{\tilde t}$ of the AI$\xi$ system which is superior or equal to any
3388: $p$ with computation time per cycle of at most $\tilde t$.
3389: With 'superior', we mean 'more intelligent', so what we
3390: need is an order relation (like) (\ref{aiorder}) for intelligence.
3391: 
3392: The best result we could think of would be an AI$\xi^{\tilde t}$
3393: with computation time $\leq\!\tilde t$ at least as intelligent as
3394: any $p$ with computation time $\leq\!\tilde t$. If AI is possible
3395: at all, we would have reached the final goal, the construction of
3396: the most intelligent algorithm with computation $\leq\!\tilde t$.
3397: Just as there is no universal measure in the set of computable
3398: measures (within time $\tilde t$), such an AI$\xi^t$ may
3399: neither exist.
3400: 
3401: What we can realistically hope to construct, is an AI$\xi^{\tilde
3402: t}$ system of computation time $c\!\cdot\!\tilde t$ per cycle for
3403: some constant $c$. The idea is to run all programs $p$ of length
3404: $\leq\!\tilde l\!:=\!l(\tilde p)$ and time $\leq\!\tilde t$ per
3405: cycle and pick the best output. The total computation time is
3406: $2^{\tilde l}\!\cdot\!\tilde t$, hence $c=2^{\tilde l}$. This sort
3407: of idea of 'typing monkeys' with one of them eventually writing
3408: Shakespeare, has been applied in various forms and contexts in
3409: theoretical computer science. The realization of this {\it best
3410: vote} idea, in our case, is not straightforward and will be
3411: outlined in this section. An idea related to this, is that of basing the
3412: decision on the majority of algorithms. This 'democratic vote'
3413: idea has been used in \cite{LiWa89,Vov92} for sequence prediction,
3414: and is referred to as 'weighted majority' there.
3415: 
3416: %------------------------------%
3417: \paragraph{Time limited probability distributions:}
3418: %------------------------------%
3419: In the literature one can find time limited versions of Kolmogorov
3420: complexity \cite{Dal73,Ko86} and the time limited universal
3421: semimeasure \cite{LiVi91,LiVi93}. In the following, we
3422: utilize and adapt the latter and see how far we get. One way to define a
3423: time-limited universal chronological semimeasure is
3424: as a sum over all enumerable chronological semimeasures
3425: computable within time $\tilde t$ and of size at most $\tilde l$
3426: similar to the unbounded case (\ref{xirhodef}).
3427: \beq\label{aixitl}
3428:   \xi^{\tilde t\tilde l}(y\!\pb x_{1:n})
3429:   \;:=\; \nq\sum_{\quad\rho\;:\;l(\rho)\leq\tilde l\;\wedge\;t(\rho)\leq\tilde t}
3430:   \nq\nq 2^{-l(\rho)}\rho(y\!\pb x_{1:n})
3431: \eeq
3432: Let us assume that the true environmental prior probability $\mu^{AI}$
3433: is equal to or sufficiently accurately approximated by a $\rho$ with
3434: $l(\rho)\!\leq\!\tilde l$ and $t(\rho)\!\leq\!\tilde t$ with $\tilde
3435: t$ and $\tilde l$ of reasonable size. There are several AI
3436: problems that fall into this class. In function minimization of
3437: section \ref{secFM}, the computation of $f$ and $\mu^{FM}$ are
3438: usually feasible. In many cases, the sequences of section \ref{secSP}
3439: which should be predicted, can be easily calculated when $\mu^{SP}$
3440: is known. In a classifier problem, the
3441: probability distribution $\mu^{CF}$, according to which examples
3442: are presented, is, in many cases, also elementary. But not all AI
3443: problems are of this 'easy' type. For the strategic games of section
3444: \ref{secSG}, the environment is usually, itself, a highly
3445: complex strategic player with a difficult to calculate $\mu^{SG}$
3446: that is difficult to calculate,
3447: although one might argue that the environmental player may have
3448: limited capabilities too. But it is easy to think of a difficult
3449: to calculate physical (probabilistic) environment like the
3450: chemistry of biomolecules.
3451: 
3452: The number of interesting applications makes this restricted class
3453: of AI problems, with time and space bounded environment
3454: $\mu^{\tilde t\tilde l}$, worth being studied. Superscripts to a
3455: probability distribution except for $\xi^{\tilde t\tilde l}$
3456: indicate their length and maximal computation time. $\xi^{\tilde
3457: t\tilde l}$ defined in (\ref{aixitl}), with a yet to be determined
3458: computation time, multiplicatively dominates all $\mu^{\tilde
3459: t\tilde l}$ of this type. Hence, an AI$\xi^{\tilde t\tilde l}$
3460: model, where we use $\xi^{\tilde t\tilde l}$ as prior probability,
3461: is universal, relative to all AI$\mu^{\tilde t\tilde l}$ models in
3462: the same way as AI$\xi$ is universal to AI$\mu$ for all enumerable
3463: chronological semimeasures $\mu$. The $\maxarg_{y_k}$ in
3464: (\ref{ydotxi}) selects a $y_k$ for which $\xi^{\tilde t\tilde l}$
3465: has the highest expected utility $C_{km_k}$, where $\xi^{\tilde
3466: t\tilde l}$ is the weighted average over the $\rho^{\tilde t\tilde
3467: l}$. $\hh y_k^{AI\xi^{\tilde t\tilde l}}$ is determined by a
3468: weighted majority. We expect $AI\xi^{\tilde t\tilde l}$ to
3469: outperform all (bounded) $AI\rho^{\tilde t\tilde l}$, analog to the
3470: unrestricted case.
3471: 
3472: In the following we analyze the computability properties of
3473: $\xi^{\tilde t\tilde l}$ and AI$\xi^{\tilde t\tilde l}$,
3474: i.e.\ of $\hh y_k^{AI\xi^{\tilde t\tilde l}}$. To compute
3475: $\xi^{\tilde t\tilde l}$ according to the definition
3476: (\ref{aixitl}) we have to enumerate all chronological enumerable semimeasures
3477: $\rho^{\tilde t\tilde l}$ of length $\leq\!\tilde l$
3478: and computation time $\leq\!\tilde t$. This can be done similarly to
3479: the unbounded case (\ref{ccsm1}-\ref{ccsm3}). All $2^{\tilde l}$
3480: enumerable functions of length $\leq\!\tilde l$, computable within time
3481: $\tilde t$ have to be converted to chronological probability
3482: distributions. For this, one has to evaluate each function for
3483: $|X|\!\cdot\!k$ different arguments. Hence,
3484: $\xi^{\tilde t\tilde l}$ is computable within time\footnote{We
3485: assume that a TM can be simulated by another in linear time.}
3486: $
3487:   t(\xi^{\tilde t\tilde l}(y\!\pb x_{1:k})) \!=\!
3488:   O(|X|\!\cdot\!k\!\cdot\!2^{\tilde l}\!\cdot\!\tilde t)
3489: $.
3490: The computation time of $\hh y_k^{AI\xi^{\tilde t\tilde l}}$
3491: depends on the size of $X$, $Y$ and $m_k$.
3492: $\xi^{\tilde t\tilde l}$ has to be
3493: evaluated $|Y|^{h_k}|X|^{h_k}$ times in (\ref{ydotxi}).
3494: It is possible to
3495: optimize the algorithm and perform the computation within time
3496: \beq\label{tyaixi}
3497:   t(\hh y_k^{AI\xi^{\tilde t\tilde l}}) \;=\;
3498:   O(|Y|^{h_k}|X|^{h_k}\!\cdot\!2^{\tilde l}\!\cdot\!\tilde t)
3499: \eeq
3500: per cycle. If we assume that the computation time of $\mu^{\tilde
3501: t\tilde l}$ is exactly $\tilde t$ for all arguments, the brute
3502: force time $\bar t$ for calculating the sums and maxs in
3503: (\ref{ydotrec}) is $\bar t(\hh y_k^{AI\mu^{\tilde t\tilde
3504: l}})\!\geq\!|Y|^{h_k}|X|^{h_k}\!\cdot\!\tilde t$. Combining this
3505: with (\ref{tyaixi}), we get
3506: \beqn
3507:   t(\hh y_k^{AI\xi^{\tilde t\tilde l}}) \;=\;
3508:   O(2^{\tilde l}\!\cdot\!
3509:   \bar t(\hh y_k^{AI\mu^{\tilde t\tilde l}}))
3510: \eeqn
3511: This result has the proposed structure, that there is a universal
3512: AI$\xi^{\tilde t\tilde l}$ system with computation time
3513: $2^{\tilde l}$ times the computation time of a special
3514: AI$\mu^{\tilde t\tilde l}$ system.
3515: 
3516: Unfortunately, the class of AI$\mu^{\tilde t\tilde l}$ systems
3517: with brute force evaluation of $\hh y_k$, according to
3518: (\ref{ydotrec}) is completely uninteresting from a practical point
3519: of view. E.g. in the context of chess, the above result says that
3520: the AI$\xi^{\tilde t\tilde l}$ is superior within time $2^{\tilde
3521: l}\!\cdot\!\tilde t$ to any brute force minimax strategy of computation time
3522: $\tilde t$. Even if the factor of $2^{\tilde l}$ in computation
3523: time would not matter, the AI$\xi^{\tilde t\tilde l}$ system is,
3524: nevertheless practically useless, as a brute force minimax chess
3525: player with reasonable time $\tilde t$ is a very poor player.
3526: 
3527: Note, that in the case of sequence prediction ($h_k\!=\!1$,
3528: $|Y|\!=\!|X|\!=\!2$) the computation time of $\rho$ coincides with
3529: that of $\hh y_k^{AI\rho}$ within a factor of 2. The class
3530: AI$\rho^{\tilde t\tilde l}$ includes {\it all} non-incremental
3531: sequence prediction algorithms of size $\leq\!\tilde l$ and
3532: computation time $\leq\!\tilde t/2$. With non-incremental, we mean
3533: that no information of previous cycles is taken into account for
3534: the computation of $\hh y_k$ of the current cycle.
3535: 
3536: The shortcomings (mentioned and unmentioned ones) of this
3537: approach are cured in the next subsection, by deviating from the
3538: standard way of defining a timebounded $\xi$ as a sum over functions or
3539: programs.
3540: 
3541: %------------------------------%
3542: \paragraph{The idea of the best vote algorithm:}
3543: %------------------------------%
3544: A general cybernetic or AI system is a chronological program
3545: $p(x_{<k})=y_{1:k}$. This form, introduced in section
3546: \ref{secAIfunc}, is general enough to include any AI system (and
3547: also less intelligent systems).
3548: In the following, we are interested in programs $p$ of length
3549: $\leq\!\tilde l$ and computation time $\leq\!\tilde t$ per cycle.
3550: One important point in the time-limited setting is that $p$ should be
3551: incremental, i.e. when computing $y_k$ in cycle $k$, the
3552: information of the previous cycles stored on the working tape can
3553: be re-used. Indeed, there is probably no practically interesting,
3554: non-incremental AI system at all.
3555: 
3556: In the following, we construct a policy $p^\best$, or more
3557: precisely, policies $p_k^\best$ for every cycle $k$ that
3558: outperform all time and length limited AI systems $p$. In cycle k,
3559: $p_k^\best$ runs all $2^{\tilde l}$ programs $p$ and selects the
3560: one with the best output $y_k$. This is a 'best vote' type of
3561: algorithm, as compared to the 'weighted majority' like algorithm of the
3562: last subsection. The ideal measure for the quality of the output
3563: would be the $\xi$ expected credit
3564: \beq
3565:  C_{km}(p|\hh y\!\hh x_{<k}) \;:=\; \sum_{q\in\hh Q_k}2^{-l(q)}C_{km}(p,q)
3566:  \quad,\quad
3567:   C_{km}(p,q) \;:=\; c(x_k^{pq})+...+c(x_m^{pq})
3568: \eeq
3569: The program $p$ which maximizes $C_{km_k}$ should be
3570: selected. We have dropped the normalization $\cal N$ unlike in
3571: (\ref{cxi}), as it is independent of $p$ and
3572: does not change the order relation which we are solely interested
3573: in here. Furthermore, without normalization, $C_{km}$ is enumerable,
3574: which will be important later.
3575: 
3576: %------------------------------%
3577: \paragraph{Extended chronological programs:}
3578: %------------------------------%
3579: In the (functional form of the) AI$\xi$ model it was convenient to
3580: maximize $C_{km_k}$ over all $p\!\in\!\hh P_k$,
3581: i.e. all $p$ consistent with the current history $\hh y\!\hh x_{<k}$.
3582: This was no restriction, because for every
3583: possibly inconsistent program $p$ there exists a program $p'\!\in\!\hh P_k$ consistent
3584: with the current history and identical to $p$ for all future
3585: cycles $\geq\!k$. For the time limited best vote algorithm
3586: $p^\best$ it would be too restrictive to demand $p\!\in\!\hh
3587: P_k$. To prove universality, one has to compare {\it all} $2^{\tilde l}$
3588: algorithms in every cycle, not just the consistent ones. An
3589: inconsistent algorithm may become the best one in later cycles.
3590: For inconsistent programs we have to include the $\hh y_k$ into the
3591: input, i.e. $p(\hh y\!\hh x_{<k})\!=\!y_{1:k}^p$
3592: with $\hh y_i\!\neq\!y_i^p$ possible. For $p\!\in\!\hh P_k$ this
3593: was not necessary, as $p$ knows the output $\hh y_k\equiv y_k^p$ in
3594: this case. The $c_i^{pq}$ in the definition of $C_{km}$ are the
3595: valuations emerging in the I/O sequence, starting with $\hh
3596: y\!\hh x_{<k}$ (emerging from $p^\best$) and then continued
3597: by applying $p$ and $q$ with $\hh y_i\!:=\!y_i^p$ for
3598: $i\!\geq\!k$.
3599: 
3600: Another problem is that we need $C_{km_k}$ to select the best
3601: policy, but unfortunately $C_{km_k}$ is uncomputable. Indeed, the
3602: structure of the definition of $C_{km_k}$ is very similar to that
3603: of $\hh y_k$, hence a brute force approach to approximate
3604: $C_{km_k}$ requires too much computation time as for $\hh y_k$. We
3605: solve this problem in a similar way, by supplementing each $p$ with
3606: a program that estimates $C_{km_k}$ by $w_k^p$ within time
3607: $\tilde t$. We combine the calculation of $y_k^p$ and $w_k^p$ and
3608: extend the notion of a chronological program once again to
3609: \beq\label{extprog}
3610:   p(\hh y\!\hh x_{<k}) \;=\; w_1^py_1^p...w_k^py_k^p
3611: \eeq
3612: with chronological order $w_1^py_1^p\hh y_1\hh x_1
3613: w_2^py_2^p\hh y_2\hh x_2...$.
3614: 
3615: %------------------------------%
3616: \paragraph{Valid approximations:}
3617: %------------------------------%
3618: $p$ might suggest any output $y_k^p$ but it is not allowed to rate
3619: it with an arbitrarily high $w_k^p$ if we want $w_k^p$ to be a reliable
3620: criterion for selecting the best $p$. We demand that no policy is
3621: allowed to claim that it is better than it actually is. We define
3622: a (logical) predicate VA($p$) called {\it valid approximation}, which
3623: is true if, and only if, $p$ always satisfies
3624: $w_k^p\!\leq\!C_{km_k}(p)$, i.e. never overrates itself.
3625: \beq\label{vadef}
3626:   \mbox{VA}(p) \;\equiv\;
3627:   \forall k\forall w_1^py_1^p\hh y_1\hh x_1...w_k^py_k^p :
3628:   p(\hh y\!\hh x_{<k}) \!=\! w_1^py_1^p...w_k^py_k^p
3629:   \Rightarrow
3630:   w_k^p\!\leq\!C_{km_k}(p|\hh y\!\hh x_{<k})
3631: \eeq
3632: In the following, we restrict our attention to programs $p$, for which
3633: VA($p$) can be proved in some formal axiomatic system.
3634: A very important point is that $C_{km_k}$ is enumerable.
3635: This ensures the existence of sequences of
3636: program $p_1, p_2, p_3, ...$ for which VA($p_i$) can be proved and
3637: $\lim_{i\to\infty}w_k^{p_i}\!=\!C_{km_k}(p)$
3638: for all $k$ and all I/O sequences. The approximation is not
3639: uniform in $k$, but this does not matter as the selected $p$ is allowed to change
3640: from cycle to cycle.
3641: 
3642: Another possibility would be to consider only those $p$ which check
3643: $w_k^p\!\leq\!C_{km_k}(p)$ online in every cycle, instead of
3644: the pre-check VA($p$), either by constructing a proof (on the working
3645: tape) for this special case, or it is already evident by the
3646: construction of $w_k^p$. In cases where $p$ cannot guarantee
3647: $w_k^p\!\leq\!C_{km_k}(p)$ it sets $w_k\!=\!0$ and, hence, trivially
3648: satisfies $w_k^p\!\leq\!C_{km_k}(p)$. On the other hand, for these
3649: $p$ it is also no problem to prove VA($p$) as one has simply to
3650: analyze the internal structure of $p$ and recognize that $p$ shows
3651: the validity internally itself, cycle by cycle, which is easy by
3652: assumption on $p$. The cycle by cycle check is, therefore, a special
3653: case of the pre-proof of VA($p$).
3654: 
3655: %------------------------------%
3656: \paragraph{Effective intelligence order relation:}
3657: %------------------------------%
3658: In section \ref{secAIxi} we have introduced an intelligence order
3659: relation $\succeq$ on AI systems, based on the expected credit
3660: $C_{km_k}(p)$. In the following we need an order relation
3661: $\succeq^c$ based on the claimed credit $w_k^p$ which might
3662: be interpreted as an approximation to $\succeq$. We call $p$
3663: {\it effectively more or equally intelligent} than $p'$ if
3664: \bqa\label{effaiord}
3665:   p\succeq^c\!p' \;:\Leftrightarrow\;
3666:   \forall k\forall \hh y\!\hh x_{<k}
3667:   \exists w_{1:n}w'_{1:n} : \\
3668:   p(\hh y\!\hh x_{<k}) \!=\! w_1\!*...w_k\!* \;\wedge\;
3669:   p'(\hh y\!\hh x_{<k}) \!=\! w_1'\!*...w_k'\!* \;\wedge\;
3670:   w_k\!\geq\!w_k'
3671: \eqa
3672: i.e.\ if $p$ always claims higher credit estimate $w$ than $p'$.
3673: $\succeq^c$ is a co-enumerable partial order relation on extended
3674: chronological programs. Restricted to valid approximations
3675: it orders the policies w.r.t.\ the quality of their outputs {\it
3676: and} their ability to justify their outputs with high $w_k$.
3677: 
3678: %------------------------------%
3679: \paragraph{The universal time bounded AI$\xi^{\tilde t\tilde l}$ system:}
3680: %------------------------------%
3681: In the following we, describe the algorithm $p^\best$ underlying
3682: the universal time bounded AI$\xi^{\tilde t\tilde l}$ system. It
3683: is essentially based on the selection of the best algorithms
3684: $p_k^\best$ out of the time ${\tilde t}$ and length ${\tilde l}$
3685: bounded $p$, for which there exists a proof of VA($p$) with length
3686: $\leq\!l_P$.
3687: 
3688: \begin{enumerate}\parskip=0ex\parsep=0ex\itemsep=0ex
3689: \item Create all binary strings of length $l_P$ and interpret each
3690: as a coding of a mathematical proof in the same formal logic system in
3691: which VA($\cdot$) has been formulated. Take those strings
3692: which are proofs of VA($p$) for some $p$ and keep the
3693: corresponding programs $p$.
3694: \item Eliminate all $p$ of length $>\!\tilde l$.
3695: \item Modify all $p$ in the following way: all output $w_k^py_k^p$
3696: is temporarily written on an auxiliary tape. If $p$ stops in $\tilde t$
3697: steps the internal 'output' is copied to the output tape. If $p$
3698: does not stop after $\tilde t$ steps a stop is forced and $w_k\!=\!0$
3699: and some arbitrary $y_k$ is written on the output tape. Let $P$ be
3700: the set of all those modified programs.
3701: \item Start first cycle: $k\!:=\!1$.
3702: \item\label{pbestloop} Run every $p\!\in\!P$ on extended input
3703: $\hh y\!\hh x_{<k}$, where all outputs are redirected to some auxiliary
3704: tape:
3705: $p(\hh y\!\hh x_{<k})\!=\!w_1^py_1^p...w_k^py_k^p$.
3706: \item Select the program $p$ with highest claimed credit $w_k^p$:
3707: $p_k^\best\!:=\!\maxarg_pw_k^p$.
3708: \item Write $\hh y_k\!:=\!y_k^{p_k^\best}$ to the output tape.
3709: \item Receive input $\hh x_k$ from the environment.
3710: \item Begin next cycle: $k\!:=\!k\!+\!1$, goto step
3711: \ref{pbestloop}.
3712: \end{enumerate}
3713: 
3714: It is easy to see that the following theorem holds.
3715: 
3716: %------------------------------%
3717: \paragraph{Main theorem:}
3718: %------------------------------%
3719: Let $p$ be any extended chronological (incremental) program like
3720: (\ref{extprog}) of length $l(p)\!\leq\!\tilde l$ and computation
3721: time per cycle $t(p)\!\leq\!\tilde t$, for which there exists a
3722: proof of VA($p$) defined in (\ref{vadef}) of length $\leq\!l_P$.
3723: The algorithm $p^\best$ constructed in the last subsection,
3724: depending on $\tilde l$, $\tilde t$ and $l_P$ but not on $p$, is
3725: effectively more or equally intelligent, according to $\succeq^c$
3726: defined in (\ref{effaiord}) than any such $p$. The size of
3727: $p^\best$ is $l(p^\best)\!=\!O(\ln(\tilde l\!\cdot\!\tilde
3728: t\!\cdot\! l_P))$, the setup-time is
3729: $t_{setup}(p^\best)\!=\!O(l_P\!\cdot\!2^{l_P})$, the computation
3730: time per cycle is $t_{cycle}(p^\best)\!=\!O(2^{\tilde
3731: l}\!\cdot\!\tilde t)$.
3732: 
3733: Roughly speaking, the theorem says, that if there exists a
3734: computable solution to some AI problem at all, the explicitly
3735: constructed algorithm $p^\best$ is such a solution. Although this
3736: theorem is quite general, there are some limitations and open
3737: questions which we discuss in the following.
3738: 
3739: %------------------------------%
3740: \paragraph{Limitations and open questions:}
3741: %------------------------------%
3742: \begin{itemize}\parskip=0ex\parsep=0ex%\itemsep=0ex
3743: \item Formally, the total computation time of $p^\best$ for cycles
3744: $1...k$ increases linearly with $k$, i.e. is of order $O(k)$ with
3745: a coefficient $2^{\tilde l}\!\cdot\!\tilde t$. The unreasonably
3746: large factor $2^{\tilde l}$ is a well known drawback in
3747: best/democratic vote models and will be taken without further comments, whereas the
3748: factor ${\tilde t}$ can be assumed to be of reasonable size. If we
3749: don't take the limit $k\!\to\!\infty$ but consider reasonable $k$,
3750: the practical usefulness of the timebound on $p^\best$ is somewhat
3751: limited, due to the additional additive constant
3752: $O(l_P\!\cdot\!2^{l_P})$. It is much larger than
3753: $k\!\cdot\!2^{\tilde l}\!\cdot\!\tilde t$ as typically
3754: $l_P\!\gg\!l($VA$(p))\!\geq\!l(p)\!\equiv\!\tilde l$.
3755: \item $p^\best$ is superior only to those $p$ which justify their
3756: outputs (by large $w_k^p$). It might be possible that there are
3757: $p$ which produce good outputs $y_k^p$ within reasonable time, but
3758: it takes an unreasonably long time to justify their outputs by
3759: sufficiently high $w_k^p$. We do not think that (from a certain
3760: complexity level onwards) there are policies where the process of
3761: constructing a good output is completely separated from some sort
3762: of justification process. But this justification might not be
3763: translatable (at least within reasonable time) into a reasonable
3764: estimate of $C_{km_k}(p)$.
3765: \item The (inconsistent) programs $p$ must be able to continue
3766: strategies started by other policies. It might happen that a
3767: policy $p$ steers the environment to a direction for which it is
3768: specialized. A 'foreign' policy might be able to displace $p$
3769: only between loosely bounded episodes. There is probably no
3770: problem for factorizable $\mu$. Think of a chess game, where it is
3771: usually very difficult to continue the game/strategy of a
3772: different player. When the game is over, it is usually advantageous
3773: to replace a player by a better one for the next game. There might
3774: also be no problem for sufficiently separable $\mu$.
3775: \item There might be (efficient) valid approximations $p$ for which
3776: VA($p$) is true but not provable, or for which only a very long
3777: ($>\!l_P$) proof exists.
3778: \end{itemize}
3779: 
3780: %------------------------------%
3781: \paragraph{Remarks:}
3782: %------------------------------%
3783: \begin{itemize}\parskip=0ex\parsep=0ex%\itemsep=0ex
3784: \item The idea of suggesting outputs and justifying them by proving
3785: credit bounds implements one aspect of human thinking. There are
3786: several possible reactions to an input. Each reaction possibly has
3787: far reaching consequences. Within a limited time one tries to estimate the
3788: consequences as well as possible. Finally,
3789: each reaction is valued and the best one is selected. What
3790: is inferior to human thinking is, that the estimates $w_k^p$ must
3791: be rigorously proved and the proofs are constructed by blind
3792: extensive search, further, that {\it all} behaviours $p$ of length
3793: $\leq\!\tilde l$ are checked. It is inferior 'only' in the sense of
3794: necessary computation time but not in the sense of the quality of
3795: the outputs.
3796: \item In practical applications there are often cases with
3797: short and slow programs $p_s$ performing some task $T$, e.g.
3798: the computation of the digits of $\pi$, for which there also exist
3799: long and quick programs $p_l$ too. If it is not too difficult to
3800: prove that this long program is equivalent to the short one, then it is
3801: possible to prove $K(T)\!\leq\!l(p_s)$ within time $t(p_l)$.
3802: Similarly, the method of proving bounds $w_k$ for $C_{km_k}$ can
3803: give high lower bounds without explicitly executing these short
3804: and slow programs, which mainly contribute to $C_{km_k}$.
3805: \item Dovetailing all length and time-limited programs is a well
3806: known elementary idea (typing monkeys). The crucial part
3807: which has been developed here, is the selection criterion for the
3808: most intelligent system.
3809: \item By construction of AI$\xi^{\tilde t\tilde l}$ and due to the enumerability
3810: of $C_{km_k}$, ensuring arbitrary close approximations of
3811: $C_{km_k}$ we expect that the behaviour of AI$\xi^{\tilde t\tilde l}$
3812: converges to the behaviour of AI$\xi$ in the limit $\tilde
3813: t,\tilde l\!\to\!\infty$ in a sense.
3814: \item Depending on what you know/assume that a program $p$ of size
3815: $\tilde l$ and computation time per cycle $\tilde t$ is able to
3816: achieve, the computable AI$\xi^{\tilde t\tilde l}$ model will have the
3817: same capabilities. For the strongest assumption of the existence of a Turing
3818: machine, which outperforms human intelligence, the AI$\xi^{\tilde
3819: t\tilde l}$ will do too, within the same time frame up to a (unfortunately
3820: very large) constant factor.
3821: \end{itemize}
3822: 
3823: \newpage
3824: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
3825: \section{Outlook \& Discussion}\label{secOutlook}
3826: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
3827: This section contains some discussion of otherwise unmentioned
3828: topics and some (more personal) remarks. It also serves as an outlook
3829: to further research.
3830: 
3831: %------------------------------%
3832: \paragraph{Miscellaneous:}
3833: %------------------------------%
3834: \begin{itemize}
3835: \item In game theory \cite{Osb94} one often wants to model the situation of
3836:       simultaneous actions, whereas the AI$\xi$ models has
3837:       serial I/O. Simultaneity can be simulated by withholding the
3838:       environment from the current system's output $y_k$, until
3839:       $x_k$ has been received by the system. Formally, this means
3840:       that $\xi(y\!x_{<k}y\!\pb x_k)$ is independent of $y_k$.
3841:       The AI$\xi$ system is already of simultaneous type in an
3842:       abstract view if the behaviour $p$ is interpreted as the action.
3843:       In this sense, AI$\xi$ is the action $p^\best$ which maximizes
3844:       the utility function (credit), under the assumption that the environment
3845:       acts according to $\xi$. The situation is different from
3846:       game theory as the environment is not modeled to be a second
3847:       'player' that tries to optimize his own utility although it might
3848:       actually be a rational player (see section \ref{secSG}).
3849: \item In various examples we have chosen differently specialized
3850:       input and output spaces $X$ and $Y$. It should be clear
3851:       that, in principle, this is unnecessary, as large enough spaces $X$
3852:       and $Y$, e.g. $2^{32}$ bit, serve every need and can always
3853:       be Turing reduced to the specific presentation needed internally by the
3854:       AI$\xi$ system itself. But it is clear that using a generic
3855:       interface, such as camera and monitor for, learning
3856:       tic-tac-toe for example, adds the task of learning vision and drawing.
3857: \end{itemize}
3858: 
3859: %------------------------------%
3860: \paragraph{Outlook:}
3861: %------------------------------%
3862: \begin{itemize}
3863: \item Rigorous proofs for credit bounds are the major theoretical challenge are
3864:       -- general ones as well as tighter bounds for
3865:       special environments $\mu$. Of special importance are suitable (and
3866:       acceptable) conditions to $\mu$, under which $\hh y_k$ and
3867:       finite credit bounds exist for infinite $Y$, $X$ and $m_k$.
3868: \item A direct implementation of the
3869:       AI$\xi^{\tilde t\tilde l}$ model is ,at best, possible for toy
3870:       environments due to the large factor $2^{\tilde l}$ in
3871:       computation time. But there are other applications of the AI$\xi$ theory.
3872:       We have seen in several examples how to integrate problem classes
3873:       into the AI$\xi$ model. Conversely, one can downscale the
3874:       AI$\xi$ model by using more restricted forms of $\xi$.
3875:       This could be done in the same way as the theory of universal
3876:       induction has been downscaled with many insights
3877:       to the Minimum Description Length principle
3878:       \cite{LiVi92,Ris89} or to the domain of finite automata \cite{Fed92}.
3879:       The AI$\xi$ model might similarly serve as a super model or as the
3880:       very definition of (universal unbiased) intelligence, from
3881:       which specialized models could be derived.
3882: \item With a reasonable computation time, the AI$\xi$ model
3883:       would be a solution of AI (see next point if you disagree).
3884:       The AI$\xi^{\tilde t\tilde l}$ model was the first step,
3885:       but the elimination of the factor $2^{\tilde l}$ without giving up
3886:       universality will (almost certainly) be a very difficult task.
3887:       One could try to select programs $p$ and prove VA($p$) in a
3888:       more clever way than by mere enumeration, to improve performance
3889:       without destroying
3890:       universality. All kinds of ideas like, genetic algorithms,
3891:       advanced theorem provers and many more could be incorporated. But now we
3892:       are in trouble. We seem to have transferred the AI
3893:       problem just to a different level. This shift has some
3894:       advantages (and also some disadvantages) but presents, in no way, a
3895:       solution.
3896:       Nevertheless, we want to stress that we have reduced the AI
3897:       problem to (mere) computational questions.
3898:       Even the most general other systems the author is aware of, depend on some
3899:       (more than computational) assumptions about the
3900:       environment or it is far from clear whether they are, indeed, universal and optimal.
3901:       Although computational
3902:       questions are themselves highly complicated, this reduction is a
3903:       non-trivial result. A formal theory of something, even if
3904:       not computable, is often a great step toward solving a
3905:       problem and has also merits of its own, and AI should not be different (see previous item).
3906: \item Many researchers in AI believe that intelligence is something
3907:       complicated and cannot be condensed into a few formulas.
3908:       It is more a combining of enough {\it methods} and much explicit
3909:       {\it knowledge} in the right way. From a theoretical point of
3910:       view, we disagree as the AI$\xi$ model is simple and seems to serve all
3911:       needs. From a practical point of view we agree to the following extent.
3912:       To reduce the computational burden one should
3913:       provide special purpose algorithms ({\it methods}) from the
3914:       very beginning, probably many of them related to reduce
3915:       the complexity of the input and output spaces $X$ and $Y$ by
3916:       appropriate preprocessing {\it methods}.
3917: \item There is no need to incorporate extra {\it knowledge} from the very
3918:       beginning. It can be presented in the first few cycles in
3919:       {\it any} format. As long as the algorithm to interpret the data
3920:       is of size $O(1)$, the AI$\xi$ system will 'understand' the data
3921:       after a few cycles (see section \ref{secEX}). If the
3922:       environment $\mu$ is complicated but extra knowledge
3923:       $z$ makes $K(\mu|z)$ small, one can show that the bound
3924:       (\ref{eukdist}) reduces to $\1d2\ln 2\!\cdot\!K(\mu|z)$
3925:       when $x_1\!\equiv\!z$, i.e.\
3926:       when $z$ is presented in the first cycle. The special
3927:       purpose algorithms could be presented in $x_1$, too, but it
3928:       would be cheating to say that no special purpose algorithms
3929:       had been implemented in AI$\xi$. The boundary between
3930:       implementation and training is unsharp in the AI$\xi$ model.
3931: \item We have not said much about the training
3932:       process itself, as it is not specific to the AI$\xi$ model
3933:       and has been discussed in literature in various forms and
3934:       disciplines. A serious discussion would be out of place.
3935:       To repeat a truism, it is, of course,
3936:       important to present enough knowledge $x'_k$ and evaluate
3937:       the system output $y_k$ with $c_k$ in a reasonable way.
3938:       To maximize the information content in the credit, one should
3939:       start with simple tasks and give positive reward
3940:       $c_k\!=\!1$ to approximately half of the outputs $y_k$.
3941: \end{itemize}
3942: 
3943: %------------------------------%
3944: \paragraph{The big questions:}
3945: %------------------------------%
3946: This subsection is devoted to the {\it big} questions of AI in
3947: general and the AI$\xi$ model in particular with a personal touch.
3948: 
3949: \begin{itemize}
3950: \item There are two possible objections to AI in general and,
3951:       therefore, also against AI$\xi$ in particular we want
3952:       to comment on briefly. Non-computable physics (which is not too
3953:       weird) could make Turing computable AI impossible. As at least the
3954:       world that is relevant for humans seems mainly to be computable
3955:       we do not believe that it is necessary to integrate non-computable
3956:       devices into an AI system. The (clever and nearly convincing) 'G\"odel'
3957:       argument by Penrose \cite{Pen89} that non-computational physics
3958:       {\it must} exist and {\it is} relevant to the brain, has (in our opinion convincing)
3959:       loopholes.
3960: \item A more serious problem is the evolutionary information
3961:       gathering process. It has been shown that the
3962:       'number of wisdom' $\Omega$ contains a very compact
3963:       tabulation of $2^n$ undecidable problems in its very first
3964:       $n$ binary digits \cite{Cha91}. $\Omega$ is only enumerable
3965:       with computation time increasing more rapidly with $n$, than any
3966:       recursive function.
3967:       The enormous computational power of evolution could
3968:       have developed and coded something like $\Omega$ into
3969:       our genes, which significantly guides human reasoning.
3970:       In short: Intelligence could be something complicated
3971:       and evolution toward it from an even cleverly designed
3972:       algorithm of size $O(1)$ could be too slow. As evolution has
3973:       already taken place, we could add the information from our
3974:       genes or brain structure to any/our AI system, but this means that
3975:       the important part is still missing and a simple formal definition
3976:       of AI is principally impossible.
3977: \item For the probably {\it biggest question} about {\it consciousness}
3978:       we want to give a physical analogy. Quantum (field) theory is
3979:       the most accurate and universal physical theory ever
3980:       invented. Although already developed in the 1930ies the {\it
3981:       big} question regarding the interpretation of the wave function collapse
3982:       is still open. Although extremely interesting from a
3983:       philosophical point of view, it is completely irrelevant from
3984:       a practical point of view\footnote{In the theory of everything, the
3985:       collapse might become of 'practical' importance and must or will be
3986:       solved.}.
3987:       We believe the same to be true
3988:       for {\it consciousness} in the field of Artificial
3989:       Intelligence. Philosophically highly interesting but
3990:       practically unimportant. Whether consciousness {\it will} be
3991:       explained some day is another question.
3992: \end{itemize}
3993: 
3994: \newpage
3995: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
3996: \section{Conclusions}\label{secCon}
3997: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
3998: All tasks which require intelligence to be solved can naturally be
3999: formulated as a maximization of some expected utility in the
4000: framework of agents. We gave a functional (\ref{pbestfunc}) and an
4001: iterative (\ref{ydotrec}) formulation of such a decision theoretic
4002: agent, which is general enough to cover all AI problem classes,
4003: as has been demonstrated by several examples. The main remaining
4004: problem is the unknown prior probability distribution $\mu^{AI}$
4005: of the environment(s). Conventional learning algorithms are
4006: unsuitable, because they can neither handle large (unstructured)
4007: state spaces, nor do they converge in the theoretically minimal
4008: number of cycles, nor can they handle non-stationary environments
4009: appropriately. On the other hand, the universal semimeasure $\xi$
4010: (\ref{xidef}), based on ideas from algorithmic information theory,
4011: solves the problem of the unknown prior distribution for induction
4012: problems. No explicit learning procedure is necessary, as $\xi$
4013: automatically converges to $\mu$. We unified the theory of
4014: universal sequence prediction with the decision theoretic agent by
4015: replacing the unknown true prior $\mu^{AI}$ by an appropriately
4016: generalized universal semimeasure $\xi^{AI}$. We gave strong
4017: arguments that the resulting AI$\xi$ model is the most
4018: intelligent, parameterless and environmental/application independent model
4019: possible. We defined an intelligence order relation
4020: (\ref{aiorder}) to give a rigorous meaning to this claim.
4021: Furthermore, possible solutions to the horizon problem have been
4022: discussed. We outlined for a number of problem classes in sections
4023: \ref{secSP}--\ref{secEX}, how the AI$\xi$ model can solve them.
4024: They include sequence prediction, strategic games, function
4025: minimization and, especially, how AI$\xi$ learns to learn
4026: supervised. The list could easily be extended to other problem
4027: classes like classification, function inversion and many others.
4028: The major drawback of the AI$\xi$ model is that it is
4029: uncomputable, or more precisely, only asymptotically computable,
4030: which makes an implementation impossible. To overcome this
4031: problem, we constructed a modified model AI$\xi^{\tilde t\tilde
4032: l}$, which is still effectively more intelligent than any other
4033: time $\tilde t$ and space $\tilde l$ bounded algorithm. The
4034: computation time of AI$\xi^{\tilde t\tilde l}$ is of the order
4035: $\tilde t\!\cdot\!2^{\tilde l}$. Possible further research has
4036: been discussed. The main directions could be to prove general
4037: and special credit bounds, use AI$\xi$ as a super model and
4038: explore its relation to other specialized models and finally
4039: improve performance with or without giving up universality.
4040: 
4041: \newpage
4042: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
4043: %%                B i b l i o g r a p h y                    %%
4044: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
4045: \addcontentsline{toc}{section}{Literature}
4046: \parskip=0ex plus 1ex minus 1ex
4047: \begin{thebibliography}{9}\parskip=0ex\parsep=0ex\itemsep=0ex
4048: \bibitem{Ang83}
4049:   {\bf D. Angluin, C. H. Smith}:
4050:   {\it Inductive inference: Theory and methods};
4051:   {\rm Comput. Surveys, 15:3, (1983) 237--269 }.
4052: \bibitem{Bay63}
4053:   {\bf T. Bayes}:
4054:   {\it An essay towards solving a problem in the doctrine of chances};
4055:   {\rm Philos. Trans. Royal Soc., 53 (1763) 376--398}.
4056: \bibitem{Cha66}
4057:   {\bf G.J. Chaitin}:
4058:   {\it On the length of programs for computing finite binary sequences};
4059:   {\rm Journal A.C.M. 13:4 (1966) 547--569 and J. Assoc. Comput. Mach., 16 (1969) 145--159}.
4060: \bibitem{Cha75}
4061:   {\bf G.J. Chaitin}:
4062:   {\it A theory of program size formally identical to information theory};
4063:   {\rm J. Assoc. Comput. Mach. 22 (1975) 329--340}.
4064: \bibitem{Cha91}
4065:   {\bf G.J. Chaitin}:
4066:   {\it Algorithmic information and evolution};
4067:   {\rm in O.T. Solbrig and G. Nicolis, Perspectives on
4068:        Biological Complexity, IUBS Press (1991) 51-60}.
4069: \bibitem{Che85}
4070:   {\bf P. Cheeseman}:
4071:   {\it In defense of probability theory};
4072:   {\rm Proc. 9th int. joint conference on AI, IJCAI-85 (1985) 1002--1009}.
4073:   {\it An inquiry into computer understanding};
4074:   {\rm Comp. intelligence 4:1 (1988) 58--66}.
4075: \bibitem{Con97}
4076:   {\bf M.Conte et. al.}:
4077:   {\it Genetic programming estimates of Kolmogorov complexity};
4078:   {\rm Proc. 7th Int. Conf. on GA (1997) 743--750}.
4079: \bibitem{Dal73}
4080:   {\bf R.P. Daley}:
4081:   {\it Minimal-program complexity of sequences with restricted resources};
4082:   {\rm Inform. Contr. 23 (1973) 301--312 }.
4083:   {\it On the inference of optimal descritions};
4084:   {\rm Theoret. Comput. Sci. 4 (1977) 301--319}.
4085: \bibitem{Fed92}
4086:   {\bf M. Feder, N. Merhav, M. Gutman}:
4087:   {\it Universal prediction of individual sequences};
4088:   {\rm IEEE Trans. Inform. Theory, 38;4, (1992), 1258--1270}.
4089: \bibitem{Fud91}
4090:   {\bf D. Fudenberg, J. Tirole}:
4091:   {\it Game Theory};
4092:   {\rm The MIT Press (1991)}.
4093: \bibitem{Gac74}
4094:   {\bf P. G\'acs}:
4095:   {\it On the symmetry of algorithmic information}:
4096:   {\rm Soviet Math. Dokl. 15 (1974) 1477-1480}.
4097: \bibitem{Hume}
4098:   {\bf D. Hume,}:
4099:   {\it Treatise of Human Nature};
4100:   {\rm Book I (1739)}.
4101: \bibitem{Hut99}
4102:   {\bf M. Hutter}:
4103:   {\it New Error Bounds for Solomonoff Sequence Prediction};
4104:   {\rm Submitted to J. Comput. System Sci. (2000)},
4105:   {\rm http://xxx.lanl.gov/abs/cs.AI/9912008}.
4106: \bibitem{Hut00e}
4107:   {\bf M. Hutter}:
4108:   {\it Optimality of non-binary universal Solomonoff sequence prediction};
4109:   {\rm In progress}.
4110: \bibitem{Kae96}
4111:   {\bf L.P. Kaebling, M.L. Littman, A.W. Moore}:
4112:   {\it Reinforcement learning: a survey};
4113:   {\rm Journal of AI research 4 (1996) 237--285}.
4114: \bibitem{Ko86}
4115:   {\bf K. Ko}:
4116:   {\it On the definition of infinite pseudo-random sequences};
4117:   {\rm Theoret. Comput. Sci 48 (1986) 9--34}.
4118: \bibitem{Kol65}
4119:   {\bf A.N. Kolmogorov}:
4120:   {\it Three approaches to the quantitative definition of information};
4121:   {\rm Problems Inform. Transmission, 1:1 (1965) 1--7}.
4122: \bibitem{Lev73}
4123:   {\bf L.A. Levin}:
4124:   {\it Universal sequential search problems};
4125:   {\rm Problems of Inform. Transmission, 9:3 (1973) 265--266}.
4126: \bibitem{Lev74}
4127:   {\bf L.A. Levin}:
4128:   {\it Laws of information conservation (non-growth) and
4129:        aspects of the foundation of probability theory};
4130:   {\rm Problems Inform. Transmission, 10 (1974), 206--210}.
4131: \bibitem{LiWa89}
4132:   {\bf N. Littlestone, M.K. Warmuth}:
4133:   {\it The weighted majority algorithm};
4134:   {\rm Proc. 30th IEEE Symp. on Found. of Comp. Science (1989) 256--261}.
4135: \bibitem{LiVi91}
4136:   {\bf M. Li and P.M.B. Vit\'anyi}:
4137:   {\it Learning simple concepts under simple distributions};
4138:   {\rm SIAM J. Comput., 20:5 (1995), 915--935}.
4139: \bibitem{LiVi92}
4140:   {\bf M. Li and P.M.B. Vit\'anyi}:
4141:   {\it Inductive reasoning and Kolmogorov complexity};
4142:   {\rm J. Comput. System Sci., 44:2 (1992), 343--384}.
4143: \bibitem{LiVi92a}
4144:   {\bf M. Li and P.M.B. Vit\'anyi}:
4145:   {\it Philosophical issues in Kolmogorov complexity};
4146:   {\rm Lecture Notes Comput. Sci. 623 (1992), 1--15}.
4147: \bibitem{LiVi93}
4148:   {\bf M. Li and P.M.B. Vit\'anyi}:
4149:   {\it An Introduction to Kolmogorov Complexity and its Applications};
4150:   {\rm Springer-Verlag, New York, 2nd Edition, 1997}.
4151: \bibitem{Mic66}
4152:   {\bf D. Michie}:
4153:   {\it Game Playing and game-learning automata};
4154:   {\rm In Fox, L., editor, Adv. in Prog. and Non-Numerical Comp.,
4155:        183--200 (1966) Pergamon, NY}.
4156: \bibitem{Osb94}
4157:   {\bf M.J. Osborne, A. Rubinstein}:
4158:   {\it A course in game theory};
4159:   {\rm MIT Press (1994)}.
4160: \bibitem{Pea88}
4161:   {\bf J. Pearl}:
4162:   {\it Probabilistic reasoning in intelligent systems:
4163:        Networks of plausible inference};
4164:   {\rm Morgan Kaufmann, San Mateo, Califormia (1988)}.
4165: \bibitem{Pen89}
4166:   {\bf R. Penrose}:
4167:   {\it The empiror's new mind};
4168:   {\rm Oxford Univ. Press (1989)}.
4169:   {\it Shadows of the mind};
4170:   {\rm Oxford Univ. Press (1994)}.
4171: \bibitem{Pin97}
4172:   {\bf X. Pintaro, E. Fuentes}:
4173:   {\it A forecasting algorithm based on information theory};
4174:   {\rm Technical report, Centre Univ. d'Informatique, University of Geneva (1997)}.
4175: \bibitem{Ris89}
4176:   {\bf J.J. Rissanen}:
4177:   {\it Stochastic Complexity and Statistical Inquiry};
4178:   {\rm World Scientific Publishers (1989)}.
4179: \bibitem{Rus95}
4180:   {\bf S. Russell, P. Norvig}:
4181:   {\it Artificial Intelligence: A modern approach};
4182:   {\rm Prentice Hall (1995)}.
4183: \bibitem{Sch95}
4184:   {\bf J. Schmidhuber}:
4185:   {\it Discovering solutions with low Kolmogorov complexity and
4186:        high generalization capability};
4187:   {\rm Proc. 12th Int. Conf. on Machine Learning (1995) 488--496}.
4188: \bibitem{Sch96}
4189:   {\bf J. Schmidhuber, M. Wiering}:
4190:   {\it Solving POMDP's with Levin search and EIRA};
4191:   {\rm Proc. 13th Int. Conf. on Machine Learning (1996) 534--542}.
4192: \bibitem{Sch99}
4193:   {\bf M. Schmidt}:
4194:   {\it Time-Bounded Kolmogorov Complexity May Help in Search
4195:        for Extra Terrestrial Intelligence (SETI) };
4196:   {\rm Bulletin of the European Association for Theor. Comp. Sci. 67 (1999) 176--180}.
4197: \bibitem{Sol64}
4198:   {\bf R.J. Solomonoff}:
4199:   {\it A formal theory of inductive inference, Part 1 and 2};
4200:   {\rm Inform. Contr., 7 (1964), 1--22, 224--254}.
4201: \bibitem{Sol78}
4202:   {\bf R.J. Solomonoff}:
4203:   {\it Complexity-based induction systems: comparisons and convergence theorems};
4204:   {\rm IEEE Trans. Inform. Theory, IT-24:4, (1978), 422--432}.
4205: \bibitem{Sol86}
4206:   {\bf R.J. Solomonoff}:
4207:   {\it An application of algorithmic probability to problems in artificial intelligence};
4208:   {\rm In L.N. Kanal and J.F.Lemmer, editors, Uncertainty in Artificial Intelligence,
4209:        North-Holland, (1986), 473--491}.
4210: \bibitem{Sol97}
4211:   {\bf R.J. Solomonoff}:
4212:   {\it The discovery of algorithmic probability};
4213:   {\rm J. Comput. System Sci. 55 (1997), 73--88}.
4214: \bibitem{Sol99}
4215:   {\bf R.J. Solomonoff}:
4216:   {\it Two kinds of probabilistic induction};
4217:   {\rm Comput. Journal 42:4 (1999), 256--259}.
4218: \bibitem{Neu44}
4219:   {\bf von Neumann, J.O. Morgenstern}:
4220:   {\it The theory of games and economic behaviour};
4221:   {\rm Princeton Univ. Press (1944)}.
4222: \bibitem{Val84}
4223:   {\bf L.G. Valiant}:
4224:   {\it A theory of the learnable};
4225:   {\rm Comm. Assoc. Comput. Mach., 27 (1984) 1134--1142}.
4226: \bibitem{Vov92}
4227:   {\bf V. G. Vovk}:
4228:   {\it Universal forecasting algorithms};
4229:   {\rm Inform. and Comput., 96, (1992), 245--277}.
4230: \bibitem{Vov98}
4231:   {\bf V. Vovk, C. Watkins}:
4232:   {\it Universal portfolio selection};
4233:   {\rm Proceedings 11th Ann. Conf. on Comp. Learning Theory (1998) 12--23}.
4234: \bibitem{Wil70}
4235:   {\bf D.G. Willis}:
4236:   {\it Computational complexity and probability constructions};
4237:   {\rm J. Ass. Comput. Mach., 4 (1970), 241--259}.
4238: \end{thebibliography}
4239: 
4240: \end{document}
4241: 
4242: %---------------------------------------------------------------
4243: