0004:cs0004001/cs0004001

1:

2: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

3: %%      A Theory of Universal Artificial Intelligence        %%

4: %%            based on Algorithmic Complexity                %%

5: %%     Marcus Hutter: Start: 13.11.99  LastEdit: 31.03.00    %%

6: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

7:

8: %-------------------------------%

9: %   Document-Style              %

10: %-------------------------------%

11: \documentclass[12pt]{article}

12: %\documentstyle[10pt,german,epsf,axodraw,twoside]{report}

13: % 'epsf.sty'to include post-script pictures         %

14: % 'axodraw.sty' for feynman pictures                %

15: %\epsfverbosetrue

16: \parskip=1.5ex plus 1ex minus 1ex \parindent=0ex

17: \pagestyle{headings}

18: \setcounter{tocdepth}{4}

19: \setcounter{secnumdepth}{1}

20: \topmargin=0cm  \oddsidemargin=0cm \evensidemargin=0cm

21: \textwidth=16cm \textheight=23cm

22: \unitlength=1mm \sloppy

23: %\makeindex

24:

25: %-------------------------------%

26: %   Compiler-Switches           %

27: %-------------------------------%

28: \newif\ifall\alltrue %\allfalse              % compile only parts

29: \newif\ifexpaper\expapertrue %\expaperfalse  % compile only parts

30: %\newif\ifprivate\privatetrue %\privatefalse  % compile only parts

31: \newif\ifprivate\privatefalse                % compile only parts

32:

33: %\def\private#1{{\it private: #1}}  % print private comments

34: \def\private#1{}            % not print private "

35: %\nofiles               % no .aux .toc ... files

36:

37: %-------------------------------%

38: %   Macro-Definitions           %

39: %-------------------------------%

40: %\def\keywords#1{\centerline{\parbox{14cm}{{\it Key Words:} #1}}}

41: \def\keywords#1{\small\centerline{\bf Key Words}\vspace{5mm}\centerline{\parbox{14cm}{#1}}}

42: \def\eqd{\stackrel{\bullet}{=}}

43: \def\ff{\Longrightarrow}

44: \def\gdw{\Longleftrightarrow}

45: \def\toinfty#1{\stackrel{#1\to\infty}{\longrightarrow}}

46: \def\gtapprox{\buildrel{\lower.7ex\hbox{$>$}}\over

47:                        {\lower.7ex\hbox{$\sim$}}}

48: \def\nq{\hspace{-1em}}

49: \def\look{\(\uparrow\)}

50: \def\ignore#1{}

51: \def\deltabar{{\delta\!\!\!^{-}}}

52: \def\qed{\sqcap\!\!\!\!\sqcup}

53: \def\1d2{{\textstyle{1\over 2}}}

54: \def\hbar{h\!\!\!\!^{-}\,}

55: \def\dbar{d\!\!^{-}\!}

56: \def\eps{\varepsilon}

57: \def\beq{\begin{equation}}

58: \def\eeq{\end{equation}}

59: \def\beqn{\begin{displaymath}}

60: \def\eeqn{\end{displaymath}}

61: \def\bqa{\begin{equation}\begin{array}{c}}

62: \def\eqa{\end{array}\end{equation}}

63: \def\bqan{\begin{displaymath}\begin{array}{c}}

64: \def\eqan{\end{array}\end{displaymath}}

65: \def\pb{\underline}                       % probability notation

66: \def\pb#1{\underline{#1}}                 % probability notation

67: \def\blank{{\,_\sqcup\,}}                 % blank position

68: \def\maxarg{\mathop{\rm maxarg}}          % maxarg

69: \def\minarg{\mathop{\rm minarg}}          % minarg

70: \def\hh#1{{\dot{#1}}}                     % historic I/O

71: \def\best{*}                              % or {best}

72: \begin{document}

73:

74: %------------------------------%

75: %      T i t l e- P a g e      %

76: %------------------------------%

77: \begin{titlepage}

78: %{\tt http://xxx.lanl.gov/abs/cs.AI/000400?}

79: \hfill Munich, 31.03.2000

80:

81: %\hspace*{10cm}{\Large Preliminary Version 10}

82:

83: \begin{center}       \vspace*{2cm}

84:   {\LARGE\bf A Theory of Universal Artificial Intelligence} \\[0.5cm]

85:   {\LARGE\bf based on Algorithmic Complexity}               \\[2cm]

86:   {\bf Marcus Hutter\footnotemark}                 \\[1cm]

87:   {\it Bayerstr. 21, 80335 Munich, Germany} \\[1.5cm]

88: \end{center}

89: \footnotetext{Any response to {\tt marcus@hutter1.de} is welcome.}

90:

91: \keywords{Artificial intelligence, algorithmic complexity,

92: sequential decision theory; induction; Solomonoff; Kolmogorov;

93: Bayes; reinforcement learning; universal sequence prediction;

94: strategic games; function minimization; supervised learning.}

95:

96: \begin{abstract}

97: Decision theory formally solves the problem of rational agents in

98: uncertain worlds if the true environmental prior probability

99: distribution is known. Solomonoff's theory of universal induction

100: formally solves the problem of sequence prediction for unknown

101: prior distribution. We combine both ideas and get a parameterless

102: theory of universal Artificial Intelligence. We give strong

103: arguments that the resulting AI$\xi$ model is the most intelligent

104: unbiased agent possible. We outline for a number of problem

105: classes, including sequence prediction, strategic games, function

106: minimization, reinforcement and supervised learning, how the

107: AI$\xi$ model can formally solve them. The major drawback of the

108: AI$\xi$ model is that it is uncomputable. To overcome this

109: problem, we construct a modified algorithm AI$\xi^{tl}$, which is

110: still effectively more intelligent than any other time $t$ and

111: space $l$ bounded agent. The computation time of AI$\xi^{tl}$

112: is of the order $t\!\cdot\!2^l$. Other discussed topics are formal

113: definitions of intelligence order relations, the horizon problem

114: and relations of the AI$\xi$ theory to other AI approaches.

115: \end{abstract}

116:

117: \end{titlepage}

118:

119: %------------------------------%

120: %      Table of Contents       %

121: %------------------------------%

122: {\parskip=0ex\tableofcontents}

123:

124: \newpage

125: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

126: \section{Introduction}\label{int}

127: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

128:

129: %------------------------------%

130: \paragraph{Artificial Intelligence:}

131: %------------------------------%

132: The science of Artificial Intelligence (AI) might be defined as

133: the construction of intelligent systems and their analysis. A

134: natural definition of {\it systems} is anything which has an

135: input and an output stream. Intelligence is more complicated. It

136: can have many faces like creativity, solving problems, pattern

137: recognition, classification, learning, induction, deduction,

138: building analogies, optimization, surviving in an environment,

139: language processing, knowledge and many more. A formal definition

140: incorporating every aspect of intelligence, however, seems difficult.

141: Further, intelligence is graded, there is a smooth transition

142: between systems, which everyone would agree to be not intelligent

143: and truely intelligent systems. One simply has to look in nature,

144: starting with, for instance, inanimate crystals, then come amino-acids,

145: then some RNA fragments, then viruses, bacteria, plants, animals,

146: apes, followed by the truly intelligent homo sapiens, and possibly

147: continued by AI systems or ET's. So the best we can expect to find

148: is a partial or total order relation on the set of systems, which

149: orders them w.r.t.\ their degree of intelligence (like

150: intelligence tests do for human systems, but for a limited class of

151: problems). Having this order we are, of course, are interested in large

152: elements, i.e.\ highly intelligent systems. If a largest element

153: exists, it would correspond to the most intelligent system which

154: could exist.

155:

156: Most, if not all known facets of intelligence can be formulated

157: as goal driven or, more precisely, as maximizing some utility

158: function. It is, therefore, sufficient to study goal driven AI.

159: E.g.\ the (biological) goal of animals and humans is to survive and spread.

160: The goal of AI systems should be to be useful to humans. The

161: problem is that, except for special cases, we know neither

162: the utility function, nor the environment in which the

163: system will operate, in advance.

164:

165: %------------------------------%

166: \paragraph{Main idea:}

167: %------------------------------%

168: We propose a theory which formally\footnote{With a formal solution

169: we mean a rigorous mathematically definition, uniquely specifying the solution.

170: In the following, a solution is

171: always meant in this formal sense.} solves the problem of unknown

172: goal and environment. It might be viewed as a unification of the ideas of

173: universal induction, probabilistic planning and reinforcement

174: learning or as a unification of sequential decision theory with algorithmic

175: information theory.

176: We apply this model to some of the facets of intelligence,

177: including induction, game playing, optimization, reinforcement and supervised

178: learning, and show how it solves these problem classes. This,

179: together with general convergence theorems motivates us to

180: believe that the constructed universal AI system is the best one

181: in a sense to be clarified in the sequel, i.e. that it is the most

182: intelligent environmental independent system possible.

183: The intention of this work is to introduce the universal AI model

184: and give an in breadth analysis. Most arguments and proofs are

185: succinct and require slow reading or some additional pencil

186: work.

187: %Several topics would deserve an in depth analysis,

188: %but is deferred to future publications.

189:

190: %------------------------------%

191: \paragraph{Contents:}

192: %------------------------------%

193: {\it Section \ref{secAIfunc}:} The general framework for AI might

194: be viewed as the design and study of intelligent agents

195: \cite{Rus95}. An agent is a cybernetic system with some internal

196: state, which acts with output $y_k$ to some environment in cycle $k$,

197: perceives some input $x_k$ from the environment and updates its

198: internal state. Then the next cycle follows. It operates according

199: to some function $p$. We split the input $x_k$ into a regular part

200: $x'_k$ and a credit $c_k$, often called reinforcement feedback.

201: From time to time the environment provides non-zero credit to the

202: system. The task of the system is to maximize its utility, defined

203: as the sum of future credits. A probabilistic environment is a

204: probability distribution $\mu(q)$ over deterministic environments

205: $q$. Most, if not all environments are of this type. We give a

206: formal expression for the function $p^\best$, which maximizes in

207: every cycle the total $\mu$ expected future credit. This model is

208: called the AI$\mu$ model. As every AI problem can be brought into

209: this form, the problem of maximizing utility is hence being

210: formally solved, if $\mu$ is known. There is nothing remarkable or

211: new here, it is the essence of sequential decision theory

212: \cite{Che85,Pea88,Neu44}. Notation and formulas needed in

213: later sections are simply developed. There are two major remaining

214: problems. The problem of the unknown true prior probability $\mu$

215: is solved in section \ref{secAIxi}. Computational aspects are

216: addressed in section \ref{secTime}.

217:

218: {\it Section \ref{secAImurec}:} Instead of talking about

219: probability distributions $\mu(q)$ over functions, one could

220: describe the environment by the conditional probability of

221: providing inputs $x_1...x_n$ to the system under the condition

222: that the system outputs $y_1...y_n$. The definition of the optimal

223: $p^\best$ system in this iterative form is shown to be equivalent

224: to the previous functional form. The functional form is more

225: elegant and will be used to define an intelligence order relation

226: and the time-bounded model in section \ref{secTime}. The iterative

227: form is more index intensive but more suitable for explicit

228: calculations and is used in most of the other sections. Further,

229: we introduce factorizable probability distributions.

230:

231: {\it Section \ref{secAIxi}:} A special topic is the theory of

232: induction. In which sense prediction of the future is possible at

233: all, is best summarized by the theory of Solomonoff. Given the

234: initial binary sequence $x_1...x_k$, what is the probability of

235: the next bit being $1$? It can be fairly well predicted by using a

236: universal probability distribution $\xi$ invented and shown to

237: converge to the true prior probability $\mu$ by Solomonoff

238: \cite{Sol64,Sol78} as long as $\mu$ (which needs not be known!) is

239: computable. The problem of unknown $\mu$ is hence solved for

240: induction problems. All AI problems where the systems' output does

241: not influence the environment, i.e. all passive systems are of

242: this inductive form. Besides sequence prediction (SP),

243: classification(CF)

244: is also of this type. Active systems, like game playing (SG) and

245: optimization (FM), can not be reduced to induction systems. The {\bf

246: main idea of this work} is to generalize universal induction to

247: the general cybernetic model described in sections \ref{secAIfunc}

248: and \ref{secAImurec}. For this, we generalize $\xi$ to include

249: conditions and replace $\mu$ by $\xi$ in the rational agent model. In this

250: way the problem that the true prior probability $\mu$ is usually

251: unknown is solved. Universality of $\xi$ and convergence of

252: $\xi\!\to\!\mu$ will be shown. These are strong arguments for the

253: optimality of the resulting AI$\xi$ model. There are certain

254: difficulties in proving rigorously that and in which sense it is

255: optimal, i.e. the most intelligent system. Further, we introduce a

256: universal order relation for intelligence.

257:

258: {\it Sections \ref{secSP}--\ref{secOther}} show how a number of

259: AI problem classes fit into the general AI$\xi$ model.

260: All these problems are formally solved by the AI$\xi$ model.

261: The solution is, however, only formal because

262: the AI$\xi$ model developed thus far is

263: uncomputable or, at best, approximable. These sections should support

264: the claim that every AI problem can be formulated (and hence

265: solved) within the AI$\xi$ model. For some classes we give

266: concrete examples to illuminate the

267: scope of the problem class. We first formulate each problem class

268: in its natural way (when $\mu^{\mbox{\tiny problem}}$ is known) and

269: then construct a formulation within the AI$\mu$ model and prove

270: its equivalence. We then consider the consequences of

271: replacing $\mu$ by $\xi$. The main goal is to understand why and

272: how the problems are solved by AI$\xi$. We only highlight special

273: aspects of each problem class. Sections

274: \ref{secSP}--\ref{secOther} together should give a better picture

275: of the AI$\xi$ model. We do not study every aspect for every

276: problem class. The sections might be read selectively. They are

277: not necessary to understand the remaining sections.

278:

279: {\it Section \ref{secSP}:} Using the AI$\mu$ model for sequence

280: prediction (SP) is identical to Baysian sequence prediction

281: SP$\Theta_\mu$. One might expect, when using the AI$\xi$ model for

282: sequence prediction, one would recover exactly the universal

283: sequence prediction scheme SP$\Theta_\xi$, as AI$\xi$ was a unification of the

284: AI$\mu$ model and the idea of universal probability $\xi$. Unfortunately

285: this is not the case. One reason is that $\xi$ is only a

286: probability distribution in the inputs $x$ and not in the outputs

287: $y$. This is also one of the origins of the difficulty of proving error/credit

288: bounds for AI$\xi$. Nevertheless, we argue that AI$\xi$ is

289: equally well suited for sequence prediction as SP$\Theta_\xi$ is.

290: In a very limited setting we prove a (weak) error bound for

291: AI$\xi$ which gives hope that a general proof is attainable.

292:

293: {\it Section \ref{secSG}:} A very important class of problems are

294: strategic games (SG). We restrict ourselves to deterministic strictly

295: competitive strategic games like chess. If the environment is a

296: minimax player, the AI$\mu$ model itself reduces to a minimax

297: strategy. Repeated games of fixed lengths are a special case for

298: factorizable $\mu$. The consequences of variable game length is

299: sketched. The AI$\xi$ model has to learn the rules of the game

300: under consideration, as it has no prior information about these

301: rules. We describe how AI$\xi$ actually learns these rules.

302:

303: {\it Section \ref{secFM}:} There are many problems that fall into

304: the category 'resource bounded function minimization' (FM). They

305: include the Traveling Salesman Problem, minimizing production

306: costs, inventing new materials or even producing, e.g. nice

307: paintings, which are (subjectively) judged by a human. The task is to

308: (approximately) minimize some function $f\!:\!Y\!\to\!Z$ within

309: minimal number of function calls. We will see that a greedy model

310: trying to minimize $f$ in every cycle fails. Although the greedy

311: model has nothing to do with downhill or gradient techniques

312: (there is nothing like a gradient or direction for functions over

313: $Y$) which are known to fail, we discover the same difficulties.

314: FM has already nearly the full complexity of

315: general AI. The reason being that FM can actively influence the

316: information gathering process by its trials $y_k$ (whereas SP and

317: CF cannot). We discuss in detail the optimal FM$\mu$ model and

318: its inventiveness in choosing the $y\!\in\!Y$. A discussion of the subtleties when

319: using AI$\xi$ for function minimization, follows.

320:

321: {\it Section \ref{secEX}:} Reinforcement learning, as the

322: AI$\xi$ model does, is an important learning technique but not the only one.

323: To improve the speed of learning, supervised learning, i.e.

324: learning by acquiring knowledge, or learning from a constructive

325: teacher is necessary. We show, how AI$\xi$ learns to learn

326: supervised. It actually establishes supervised learning very

327: quickly within $O(1)$ cycles.

328:

329: {\it Section \ref{secOther}} gives a brief survey of other general

330: aspects, ideas and methods in AI, and their connection to the

331: AI$\xi$ model. Some aspects are directly included, others are or

332: should be emergent.

333:

334: {\it Section \ref{secTime}:} Up to now we have shown the universal

335: character of the AI$\xi$ model but have completely ignored

336: computational aspects. Let us assume that there exists some

337: algorithm $\tilde p$ of size $\tilde l$ with computation time per

338: cycle $\tilde t$, which behaves in a sufficiently intelligent way

339: (this assumption is the very basis of AI). The

340: algorithm $p^\best$ should run all algorithms of length

341: $\leq\!\tilde l$ for $\tilde t$ time steps in every cycle and select the best

342: output among them. So we have an algorithm which runs in time

343: $\tilde l\!\cdot\!2^{\tilde t}$ and is at least as good as $\tilde

344: p$, i.e.\ it also serves our needs apart from the (very large

345: but) constant multiplicative factor in computation time. This idea

346: of the 'typing monkeys', one of them eventually producing 'Shakespeare', is

347: well known and widely used in theoretical computer science. The

348: difficult part is the selection of the algorithm with the best

349: output. A further complication is that the selection process

350: itself must have only limited computation time. We present a

351: suitable modification of the AI$\xi$ model which solves these

352: difficult problems. The solution is somewhat involved from an

353: implementational aspect. An implementation would include first

354: order logic, the definition of a Universal Turing machine within

355: it and proof theory. The assumptions behind this construction are

356: discussed at the end.

357:

358: {\it Section \ref{secOutlook}} contains some discussion of

359: otherwise unmentioned topics and some (personal) remarks. It also

360: serves as an outlook to further research.

361:

362: {\it Section \ref{secCon}} contains the conclusions.

363:

364: %------------------------------%

365: \paragraph{History \& References:}

366: %------------------------------%

367: Kolmogorov65 \cite{Kol65} suggested to define the information

368: content of an object as the length of the shortest program

369: computing a representation of it. Solomonoff64 \cite{Sol64}

370: invented the closely related universal prior probability

371: distribution and used it for binary sequence prediction

372: \cite{Sol64,Sol78} and function inversion and minimization

373: \cite{Sol86}. Together with Chaitin66\&75 \cite{Cha66,Cha75} this

374: was the invention of what is now called Algorithmic Information

375: theory. For further literature and many applications see

376: \cite{LiVi93}. Other interesting 'applications' can be found in

377: \cite{Cha91,Sch99,Vov98}. Related topics are the Weighted Majority

378: Algorithm invented by Littlestone and Warmuth89 \cite{LiWa89},

379: universal forecasting by Vovk92 \cite{Vov92}, Levin search73

380: \cite{Lev73}, pac-learning introduced by Valiant84 \cite{Val84}

381: and Minimum Description Length \cite{LiVi92,Ris89}. Resource

382: bounded complexity is discussed in \cite{Dal73,Fed92,Ko86,Pin97},

383: resource bounded universal probability in \cite{LiVi91,LiVi93}.

384: Implementations are rare \cite{Con97,Sch95,Sch96}. Excellent

385: reviews with a philosophical touch are \cite{LiVi92a,Sol97}. For

386: an older, but general review of inductive inference see Angluin83

387: \cite{Ang83}. For an excellent introduction into algorithmic

388: information theory, further literature and many applications one

389: should consult the book of Li and Vit\'anyi97 \cite{LiVi93}. The

390: survey \cite{LiVi92} or the chapters 4 and 5 of \cite{LiVi93}

391: should be sufficient to follow the arguments and proofs

392: in this paper. %%%%%%%%%%%%%%%

393: The other ingredient in our AI$\xi$ model is sequential decision theory. We

394: do not need much more than the maximum expected utility principle

395: and the expecimax algorithm \cite{Mic66,Rus95}. The book of von Neumann and

396: Morgenstern44 \cite{Neu44} might be seen as the initiation of

397: game theory, which already contains the expectimax algorithm

398: as a special case. The literature on decision theory is

399: vast and we only give two possibly interesting references with

400: regard to this paper. Cheeseman85\&88 \cite{Che85} is a defense

401: of the use of probability theory in AI. Pearl88 \cite{Pea88} is a

402: good introduction and overview of probabilistic reasoning.

403:

404: \newpage

405: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

406: \section{The AI$\mu$ Model in Functional Form}\label{secAIfunc}

407: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

408:

409: %------------------------------%

410: \paragraph{The cybernetic or agent model:}

411: %------------------------------%

412: A good way to start thinking about intelligent systems is to

413: consider more generally cybernetic systems, in AI usually called

414: agents. This avoids having to struggle with the meaning of

415: intelligence from the very beginning. A cybernetic system is a

416: control circuit with input $y$ and output $x$ and an internal

417: state. From an external input and the internal state the system

418: calculates deterministically or stochastically an output. This

419: output (action) modifies the environment and leads to a new input

420: (reception). This continues ad infinitum or for a finite number of

421: cycles. As explained in the last section, we need some credit

422: assignment to the cybernetic system. The input $x$ is divided into

423: two parts, the standard input $x'$ and some credit input $c$. If

424: input and output are represented by strings, a deterministic

425: cybernetic system can be modeled by a Turing machine $p$. $p$ is

426: called the policy of the agent, which determines the action to a

427: receipt. If the environment is also computable it might be modeled

428: by a Turing machine $q$ as well. The interaction of the agent

429: with the environment can be illustrated as follows:

430:

431: \begin{center}\label{cyberpic}

432: %\input KCUnAI.pic

433: \special{em:linewidth 0.4pt}

434: \linethickness{0.4pt}

435: \begin{picture}(106,47)

436: \thinlines

437: \put(1,41){\framebox(10,6)[cc]{$c_1$}}

438: \put(11,41){\framebox(6,6)[cc]{$x'_1$}}

439: \put(17,41){\framebox(10,6)[cc]{$c_2$}}

440: \put(27,41){\framebox(6,6)[cc]{$x'_2$}}

441: \put(33,41){\framebox(10,6)[cc]{$c_3$}}

442: \put(43,41){\framebox(6,6)[cc]{$x'_3$}}

443: \put(49,41){\framebox(10,6)[cc]{$c_4$}}

444: \put(59,41){\framebox(6,6)[cc]{$x'_4$}}

445: \put(65,41){\framebox(10,6)[cc]{$c_5$}}

446: \put(75,41){\framebox(6,6)[cc]{$x'_5$}}

447: \put(81,41){\framebox(10,6)[cc]{$c_6$}}

448: \put(91,41){\framebox(6,6)[cc]{$x'_6$}}

449: \put(102,44){\makebox(0,0)[cc]{...}}

450: \put(1,1){\framebox(16,6)[cc]{$y_1$}}

451: \put(17,1){\framebox(16,6)[cc]{$y_2$}}

452: \put(33,1){\framebox(16,6)[cc]{$y_3$}}

453: \put(49,1){\framebox(16,6)[cc]{$y_4$}}

454: \put(65,1){\framebox(16,6)[cc]{$y_5$}}

455: \put(81,1){\framebox(16,6)[cc]{$y_6$}}

456: \put(102,4){\makebox(0,0)[cc]{...}}

457: \put(97,47){\line(1,0){9}}

458: \put(97,41){\line(1,0){9}}

459: \put(97,7){\line(1,0){9}}

460: \put(97,1){\line(0,0){0}}

461: \put(97,1){\line(1,0){9}}

462: \put(1,21){\framebox(16,6)[cc]{working}}

463: \thicklines

464: \put(17,17){\framebox(20,14)[cc]{$\displaystyle{System\atop\bf p}$}}

465: \thinlines

466: \put(37,27){\line(1,0){14}}

467: \put(37,21){\line(1,0){14}}

468: \put(39,24){\makebox(0,0)[lc]{tape ...}}

469: \put(56,21){\framebox(16,6)[cc]{working}}

470: \thicklines

471: \put(72,17){\framebox(20,14)[cc]{$\displaystyle{Environ-\atop ment\quad\bf q}$}}

472: \thinlines

473: \put(92,27){\line(1,0){14}}

474: \put(92,21){\line(1,0){14}}

475: \put(94,24){\makebox(0,0)[lc]{tape ...}}

476: \thicklines

477: \put(54,41){\vector(-3,-1){29}}

478: \put(84,31){\vector(-3,1){30}}

479: \put(54,7){\vector(3,1){30}}

480: \put(25,17){\vector(3,-1){29}}

481: \end{picture}

482: \end{center}

483:

484: $p$ as well as $q$ have unidirectional input and output tapes and

485: bidirectional working tapes. What entangles the agent with the

486: environment, is the fact that the upper tape serves as input tape

487: for $p$, as well as output tape for $q$, and that the lower tape

488: serves as output tape for $p$ as well as input tape for $q$.

489: Further, the reading head must always be left of the writing head,

490: i.e. the symbols must first be written, before they are read. $p$

491: and $q$ have their own mutually inaccessible working tapes

492: containing their own 'secrets'. The heads move in the following

493: way. In the k$^{th}$ cycle $p$ writes $y_k$, $q$ reads $y_k$, $q$

494: writes $x_k\!\equiv\!c_kx_k'$, $p$ reads $x_k\!\equiv\!c_kx_k'$,

495: followed by the $(k+1)^{th}$ cycle and so on. The whole process

496: starts with the first cycle, all heads on tape start and working

497: tapes being empty. We want to call Turing machines behaving in

498: this way, {\it chronological Turing machines}, for obvious

499: reasons. Before continuing, some notations on strings are

500: appropriate.

501:

502: %------------------------------%

503: \paragraph{Strings:}

504: %------------------------------%

505: We will denote strings over the alphabet $X$ by

506: $s\!=\!x_1x_2...x_n$, with $x_k\!\in\!X$, where $X$ is

507: alternatively interpreted as a non-empty subset of $I\!\!N$ or

508: itself as a prefix free set of binary strings.

509: $l(s)=l(x_1)\!+...+\!l(x_n)$ is the length of s. Analogous

510: definitions hold for $y_k\!\in\!Y$. We call $x_k$ the $k^{th}$

511: input word and $y_k$ the $k^{th}$ output word (rather than

512: letter). The string $s=y_1x_1...y_nx_n$ represents the

513: input/output in chronological order. Due to the prefix property of

514: the $x_k$ and $y_k$, $s$ can be uniquely separated into its words.

515: The words appearing in strings are always in chronological order.

516: We further introduce the following abbreviations: $\epsilon$ is the

517: empty string, $x_{n:m}:=x_nx_{n+1}...x_{m-1}x_m$ for $n\leq m$ and

518: $\epsilon$ for $n>m$. $x_{<n}:=x_1... x_{n-1}$. Analog for $y$.

519: Further, $y\!x_n\!:=y_nx_n$, $y\!x_{n:m}\!:=\!y_nx_n...y_mx_m$,

520: and so on.

521:

522: %------------------------------%

523: \paragraph{AI model for known deterministic environment:}

524: %------------------------------%

525: Let us define for the chronological Turing machine $p$ a partial

526: function also named $p\!:\!X^*\!\rightarrow\!Y^*$ with

527: $y_{1:k}=p(x_{<k})$ where $y_{1:k}$ is the output of Turing

528: machine $p$ on input $x_{<k}$ in cycle k, i.e. where $p$ has read

529: up to $x_{k-1}$ but no further. In an analogous way, we define

530: $q\!:\!Y^*\!\rightarrow\!X^*$ with $x_{1:k}=q(y_{1:k})$.

531: Conversely, for every partial recursive chronological function we

532: can define a corresponding chronological Turing machine. Each

533: (system,environment) pair $(p,q)$ produces a unique I/O sequence

534: $\omega(p,q):=y_1^{pq}x_1^{pq}y_2^{pq}x_2^{pq}...$. When we look

535: at the definition of $p$ and $q$ we see a nice symmetry between

536: the cybernetic system and the environment. Until now, not much

537: intelligence is in our system. Now the credit assignment comes

538: into the game and removes the symmetry somewhat. We split the

539: input $x_k\!\in\!X\!:=\!C\!\times\!X'$ into a regular part

540: $x_k'\!\in\!X'$ and a credit $c_k\!\in\!C\!\subset\!I\!\!R$. We

541: define $x_k\!\equiv\!c_kx_k'$ and $c_k\equiv c(x_k)$. The goal of

542: the system should be to maximize received credits. This is called

543: reinforcement learning. The reason for the asymmetry is, that

544: eventually we (humans) will be the environment with which the

545: system will communicate and {\it we} want to dictate what is good

546: and what is wrong, not the other way round. This one way learning,

547: the system learns from the environment, and not conversely,

548: neither prevents the system from becoming more intelligent than the

549: environment, nor does it prevent the environment learning from

550: the system because the environment can itself interpret the

551: outputs $y_k$ as a regular and a credit part. The environment is

552: just not forced to learn, whereas the system is. In cases where we

553: restrict the credit to two values

554: $c\!\in\!C\!=\!I\!\!B\!:=\!\{0,1\}$, $c\!=\!1$ is interpreted as a

555: positive feedback, called {\it good} or {\it correct} and

556: $c\!=\!0$ a negative feedback, called {\it bad} or {\it error} in

557: the following. Further, let us restrict for a while the lifetime

558: (number of cycles) $T$ of the system to a large, but finite value.

559: Let $C_{km}(p,q)\!:=\!\sum_{i=k}^mc(x_i)$ be the total credit, the

560: system $p$ receives from the environment $q$ in the cycles $k$ to

561: $m$. It is now natural to call the system, which maximizes the

562: total credit $C_{1T}$, called utility, the {\it best} or {\it most intelligent}

563: one\footnote {$\maxarg_p C(p)$ is the $p$ which maximizes

564: $C(\cdot)$. If there is more than one maximum we might choose the

565: lexicographically smallest one for definiteness.}.

566: \beqn

567:  p^{\best,T,q}=\maxarg_p C_{1T}(p,q) \quad\Rightarrow\quad

568:  C_{kT}(p^{\best,T,q},q) \geq C_{kT}(p,q) \quad \forall p

569: \eeqn

570: For $k\!=\!1$ this is obvious and for $k\!>\!1$ easy to see.

571: If $T$, $Y$ and $X$ are finite, the number of different behaviours

572: of the system, i.e. the search space is finite. Therefore, because

573: we have assumed that $q$ is known, $p^{\best,T,q}$ can effectively

574: be determined (by pre-analyzing all behaviours). The main reason

575: for restricting to finite $T$ was not to ensure computability of

576: $p^{\best,T,q}$ but that the limit $T\!\to\infty$ might not exist.

577: This is nothing special, the (unrealistic) assumption of a

578: completely known deterministic environment $q$ has simply trivialized

579: everything.

580: %------------------------------%

581: \paragraph{AI model for known prior probability:}

582: %------------------------------%

583: Let us now weaken our assumptions by replacing the environment $q$

584: with a probability distribution $\mu(q)$ over chronological functions.

585: $\mu$ might be interpreted

586: in two ways. Either the environment itself behaves in a

587: probabilistic way defined by $\mu$ or the true environment is

588: deterministic, but we only have probabilistic information, of which

589: environment being the true environment. Combinations of

590: both cases are also possible. The interpretation does not matter in the

591: following. We just assume that we know $\mu$ but no more

592: about the environment whatever the interpretation may be.

593:

594: Let us assume we are in cycle $k$ with history

595: $\hh y\!\hh x_1...\hh y\!\hh x_{k-1}$

596: and ask for the {\it best} output $y_k$.

597: Further, let

598: $\hh Q_k\!:=\!\{q:q(\hh y_{<k})=\hh x_{<k}\}$

599: be the set of all environments producing the above history.

600: The expected credit

601: for the next $m\!-\!k\!+\!1$ cycles (given the above history) is

602: given by a conditional probability:

603: \beq\label{eefunc}

604:   C^\mu_{km}(p|\hh y\!\hh x_{<k}) \;:=\;

605:   { \sum_{q\in \hh Q_k} \mu(q)C_{km}(p,q) \over

606:     \sum_{q\in \hh Q_k} \mu(q) }.

607: \eeq

608: We cannot simply determine $\maxarg_p(C_{1T})$ unlike the

609: deterministic case because the history is no longer

610: deterministically determined by $p$ and $q$, but depends on $p$

611: and $\mu$ {\it and} on the outcome of a stochastic process.

612: Every new cycle adds new information ($\hh x_i$) to the

613: system. This is indicated by the dots over the symbols.

614: In cycle $k$ we have to maximize the expected future

615: credit, taking into account the information in the history $\hh

616: y\!\hh x_{<k}$. This information is not already present

617: in $p$ and $q/\mu$ at the system's start unlike in the deterministic

618: case.

619:

620: Further, we want to generalize the finite lifetime $T$ to a

621: dynamical (computable) farsightedness

622: $h_k\!\equiv\!m_k\!-\!k\!+\!1\!\geq\!1$, called horizon in the

623: following. For $m_k\!=\!T$ we have our original finite lifetime,

624: for $m_k\!=\!k\!+\!m\!-\!1$ the system maximizes in every cycle the next

625: $m$ expected credits. A discussion of the choices $m_k$ is delayed

626: to section \ref{secAIxi}.

627:

628: The next $h_k$ credits are maximized by

629: $$

630:   p_k^\best \;:=\; \maxarg_{p\in \hh P_k} C^\mu_{km_k}(p|\hh y\!\hh

631:   x_{<k}),

632: $$

633: where $\hh P_k\!:=\!\{p:p(\hh x_{<k})=\hh y_{<k}*\}$ is the set of

634: systems consistent with the current history.

635: $p_k^\best$ depends on $k$ and is used only in step $k$ to

636: determine $\hh y_k$ by

637: $ p_k^\best(\hh x_{<k};\hh y_{<k})\!=\!\hh y_{<k}\hh y_k$.

638: After writing $\hh y_k$ the environment replies with $\hh x_k$

639: with (conditional) probability $\mu(\hh Q_{k+1})/\mu(\hh Q_k)$. This

640: probabilistic outcome provides new information to the system.

641: The cycle $k\!+\!1$ starts with determining $\hh y_{k+1}$ from

642: $p_{k+1}^\best$ (which differs from $p_k$ as $\hh x_k$ is

643: now fixed) and so on. Note that $p_k^\best$ depends also on

644: $\hh y_{<k}$ because $\hh P_k$ and $\hh Q_k$ do so.

645: But recursively inserting $p_{k-1}^\best$ and

646: so on, we can define

647: \beq\label{pbestfunc}

648:   p^\best(\hh x_{<k}) \;:=\;

649:   p_k^\best(\hh x_{<k};p_{k-1}^\best(\hh x_{<k-1}...p_1^\best)))

650: \eeq

651: It is a chronological function and computable if $X$, $Y$ and $m_k$ are

652: finite. The policy $p^\best$ defines our AI$\mu$ model.

653: For deterministic\footnote{We call a probability distribution deterministic

654: if it is 1 for exactly one argument and 0 for all others.}

655: $\mu$ this model reduces to the deterministic case.

656:

657: It is important to maximize the sum of future credits and not, for instance,

658: to be greedy and only maximize the next credit, as is done e.g. in

659: sequence prediction. For example, let the environment be a

660: sequence of chess games and each cycle corresponds to one move.

661: Only at the end of each game a positive credit $c\!=\!1$ is given

662: to the system if it won the game (and made no illegal move).

663: For the system, maximizing all future credits means trying to win as

664: many games in as short as possible time (and avoiding illegal

665: moves). The same performance is reached, if we choose

666: $m_k\!=\!k\!+\!m$ with $m$ much larger than the typical game

667: lengths. Maximization of only the next credit would be a very bad

668: chess playing system. Even if we would make our credit $c$ finer,

669: e.g. by evaluating the number of chessmen, the system would play

670: very bad chess for $m\!=\!1$, indeed.

671:

672: The AI$\mu$ model still depends on $\mu$ and $m_k$. $m_k$ is addressed

673: in section \ref{secAIxi}. To get our

674: final universal AI model the idea is to replace $\mu$ by the

675: universal probability $\xi$, defined later. This is motivated

676: by the fact that $\xi\!\to\!\mu$ in a certain sense for any $\mu$.

677: With $\xi$ instead of $\mu$ our model no longer depends on any

678: parameters, so it is truly universal. It remains to show that it

679: produces intelligent outputs. But let us continue step by step. In

680: the next section we develop an alternative but equivalent

681: formulation of the AI model given above. Whereas the functional

682: form is more suitable for theoretical considerations, especially

683: for the development of a timebounded version in section

684: \ref{secTime}, the iterative formulation of the next section will

685: be more appropriate for the explicit calculations in most of the

686: other sections.

687:

688: \newpage

689: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

690: \section{The AI$\mu$ Model in Recursive and Iterative Form}\label{secAImurec}

691: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

692:

693: %------------------------------%

694: \paragraph{Probability distributions:}

695: %------------------------------%

696: Throughout the paper we deal with sequences/strings and

697: conditional probability distributions on strings. Some

698: notations are therefore appropriate.

699:

700: We use Greek letters for probability distributions and underline their

701: arguments to indicate that they are probability arguments. Let

702: $\rho_n(\pb x_1...\pb x_n)$ be the probability that a string starts with

703: $x_1...x_n$. We only consider sufficiently long strings, so the

704: $\rho_n$ are normalized to 1. Moreover, we drop the index on $\rho$

705: if it is clear from its arguments:

706: \beq\label{prop}

707:   \sum_{x_n\in X}\rho(\pb x_{1:n}) \equiv

708:   \sum_{x_n}\rho_n(\pb x_{1:n}) =

709:   \rho_{n-1}(\pb x_{<n}) \equiv

710:   \rho(\pb x_{<n})

711:   ,\quad

712:   \rho(\epsilon) \equiv \rho_0(\epsilon)=1.

713: \eeq

714: We also need conditional probabilities derived from Bayes' rule.

715: We prefer a notation which preserves the chronological order of the words, in

716: contrast to the standard notation $\rho(\cdot|\cdot)$ which flips it. We extend the

717: definition of $\rho$ to the conditional case with

718: the following convention for its arguments: An underlined argument

719: $\pb x_k$ is a probability variable and other non-underlined

720: arguments $x_k$ represent conditions. With this convention, Bayes'

721: rule has the form $\rho(x_{<n}\pb x_n)\!=\!\rho(\pb x_{1:n})/\rho(\pb

722: x_{<n})$.

723: The equation states that the probability that a string

724: $x_1...x_{n-1}$ is followed by $x_n$ is equal to the probability

725: of $x_1...x_n*$ divided by the probability of

726: $x_1...x_{n-1}*$. We use $x*$ as a shortcut for 'strings

727: starting with $x$'.

728:

729: The introduced notation is also suitable for defining the

730: conditional probability $\rho(y_1\pb x_1...y_n\pb x_n)$ that the

731: environment reacts with $x_1...x_n$ under the condition that the

732: output of the system is $y_1...y_n$.

733: The environment is chronological, i.e. input $x_i$ depends on

734: $y\!x_{<i}y_i$ only. In the probabilistic case this means that

735: $\rho(y\!\pb x_{<k}y_k)\!:=\!\sum_{x_k}\rho(y\!\pb x_{1:k})$

736: is independent of $y_k$, hence a tailing $y_k$ in the arguments of $\rho$

737: can be dropped. Probability distributions with this

738: property will be called {\it chronological}.

739: The $y$ are always

740: conditions, i.e.\ never underlined, whereas additional

741: conditioning for the $x$ can be obtained with Bayes' rule

742: \bqa\label{bayes2}

743:   \rho(y\!x_{<n}y\!\pb x_n) =

744:   \rho(y\!\pb x_{1:n})/\rho(y\!\pb x_{<n}) \quad\mbox{and}

745:   \\[4mm]

746:   \rho(y\!\pb x_{1:n}) \;=\;

747:   \rho(y\!\pb x_1)\!\cdot\!\rho(y\!x_1y\!\pb x_2)\!\cdot...\cdot\!

748:   \rho(y\!x_{<n}y\!\pb x_n)

749: \eqa

750: The second equation is the first equation applied $n$ times.

751:

752: %------------------------------%

753: \paragraph{Alternative Formulation of the AI$\mu$ Model:}

754: %------------------------------%

755: Let us define the AI$\mu$ model $p^\best$ in a different way. In the

756: next subsection we will show that the $p^\best$ model defined here

757: is identical to the functional definition of $p^\best$ given

758: in the last section.

759:

760: Let $\mu(y\!\pb x_{1:k})$ be the true chronological prior probability

761: that the environment reacts with $x_{1:k}$ if provided with

762: actions $y_{1:k}$ from the system. We assume the cybernetic model depicted on page

763: \pageref{cyberpic} to be valid.

764: Next we define $C_{k+1,m}^\best(y\!x_{1:k})$ to be the $\mu$

765: expected credit sum in cycles $k\!+\!1$ to $m$ with outputs $y_i$

766: generated by system $p^\best$ and past responses $x_i$ from the

767: environment. Adding $c(x_k)$ we get the credit including cycle

768: $k$. The probability of $x_k$,

769: given $y\!x_{<k}y_k$, is given by the condition probability

770: $\mu(y\!x_{<k}y\!\pb x_k)$. So the expected credit sum

771: in cycles $k$ to $m$ given $y\!x_{<k}y_k$ is

772: \beq\label{ebesty}

773:   C_{km}^\best(y\!x_{<k}y_k) \;:=\;

774:   \sum_{x_k}[c(x_k)+C_{k+1,m}^\best(y\!x_{1:k})] \!\cdot\!

775:   \mu(y\!x_{<k}y\!\pb x_k)

776: \eeq

777: Now we ask about how $p^\best$ chooses

778: $y_k$. It should choose $y_k$ as to maximize the future credit.

779: So the expected number of errors in cycles $k$ to $m$ given

780: $y\!x_{<k}$ and $y_k$ chosen by $p^\best$ is

781: $ C_{km}^\best(y\!x_{<k})

782: \!:=\!\max_{y_k}C_{km}^\best(y\!x_{<k}y_k)$.

783: Together with the induction start

784: \beq\label{ee0}

785:   C_{m+1,m}^\best(y\!x_{1:m}) \;:=\; 0

786: \eeq

787: $C_{km}$ is completely defined.

788: We might summarize one cycle into the formula

789: \beq\label{airec2}

790:   C_{km}^\best(y\!x_{<k}) \;=\;

791:   \max_{y_k}\sum_{x_k}

792:   [c(x_k)+C_{k+1,m}^\best(y\!x_{1:k})] \!\cdot\!

793:   \mu(y\!x_{<k}y\!\pb x_k)

794: \eeq

795: If $m_k$ is our horizon function of $p^\best$ and

796: $\hh y\!\hh x_{<k}$ is the actual history in cycle

797: $k$, the output $\hh y_k$ of the system is explicitly given by

798: \beq\label{pbestrec}

799:   \hh y_k \;=\; \maxarg_{y_k}C_{km_k}^\best

800:   (\hh y\!\hh x_{<k}y_k) \;=:\;

801:   p^\best(\hh y\!\hh x_{<k})

802: \eeq

803: Then the environment responds $\hh x_k$ with

804: probability $\mu(\hh y\!\hh x_{<k}\hh y\!\pb{\hh

805: x}_k)$. Then cycle $k\!+\!1$ starts. We might

806: unfold the recursion (\ref{airec2}) further and give $\hh y_k$

807: non-recursive as

808: \beq\label{ydotrec}

809:   \hh y_k \;=\;

810:   \maxarg_{y_k}\sum_{x_k}\max_{y_{k+1}}\sum_{x_{k+1}}\;...\;

811:   \max_{y_{m_k}}\sum_{x_{m_k}}

812:   (c(x_k)\!+...+\!c(x_{m_k})) \!\cdot\!

813:   \mu(\hh y\!\hh x_{<k}y\!\pb x_{k:m_k})

814: \eeq

815: This has a direct interpretation: the probability of inputs

816: $x_{k:m_k}$ in cycle $k$ when the system outputs $y_{k:m_k}$ and

817: the actual history is $\hh y\!\hh x_{<k}$ is $\mu(\hh y\!\hh

818: x_{<k}y\!\pb x_{k:m_k})$. The future credit in this case is

819: $c(x_k)\!+...+\!c(x_{m_k})$. The best expected credit is obtained

820: by averaging over the $x_i$ ($sum_{x_i}$) and maximizing over the $y_i$.

821: This has to be done in chronological order to correctly

822: incorporate the dependency of $x_i$ and $y_i$ on the history.

823: This is essentially the expectimax algorithm/sequence

824: \cite{Mic66,Rus95}. The AI$\mu$ model is {\it optimal} in the

825: sense that no other policy leads to higher expected credit.

826:

827: These explicit as well as recursive definitions of the AI$\mu$ model

828: are more index intensive as compared to the functional form but

829: are more suitable for explicit calculations.

830:

831: %------------------------------%

832: \paragraph{Equivalence of Functional and Iterative AI model:}

833: %------------------------------%

834: The iterative environmental probability $\mu$ is given by the

835: functional form in the following way,

836: \beq\label{mufr}

837:   \mu(y\!\pb x_{1:k}) \;=\;

838:   \nq\sum_{q:q(y_{1:k})=x_{1:k}}\nq \mu(q)

839: \eeq

840: as is easy to see. We will prove the equivalence of

841: (\ref{pbestfunc}) and (\ref{pbestrec}) only for $k\!=\!2$ and

842: $m_2\!=\!3$. The proof of the general case is completely analog except

843: that the notation becomes quite messy.

844:

845: Let us first evaluate (\ref{eefunc}) for fixed $\hh

846: y_1\hh x_1$ and some $p\!\in\!\hh P_2$, i.e. $p(\hh

847: x_1)=\hh y_1y_2$ for some $y_2$. If the next input to the

848: system is $x_2$, $p$ will respond with $p(\hh x_1

849: x_2)=\hh y_1y_2y_3$ for some $y_3$ depending on $x_2$. We

850: write $y_3(x_2)$ in the following\footnote{Dependency on dotted

851: words like $\hh x_1$ is not shown as the dotted words are fixed.}.

852: The numerator of (\ref{eefunc}) simplifies to

853: \beqn

854:   \sum_{q\in \hh Q_2} \mu(q)C_{23}(p,q) \;=\;

855:   \nq\sum_{q:q(\hh y_1)=\hh x_1}\nq \mu(q)C_{23}(p,q)

856:   \;=\; \sum_{x_2x_3}(c(x_2)\!+\!c(x_3))

857:   \nq\nq\sum_{q:q(\hh y_1y_2y_3(x_2))=\hh x_1x_2x_3}\nq\nq

858:   \mu(q) \;=\;

859: \eeqn

860: \beqn

861:   \;=\; \sum_{x_2x_3}(c(x_2)\!+\!c(x_3)) \!\cdot\!

862:   \mu(\hh y_1\pb{\hh x}_1y_2\pb x_2y_3(x_2)\pb x_3)

863: \eeqn

864: In the first equality we inserted the definition of $\hh Q_2$. In

865: the second equality we split the sum over $q$ by first summing

866: over $q$ with fixed $x_2x_3$. This allows us to pull

867: $C_{23}\!=c(x_2)\!+\!c(x_3)$ out of the inner sum. Then we sum

868: over $x_2x_3$. Further, we have inserted $p$, i.e. replaced $p$

869: by $y_2$ and $y_3(\cdot)$. In the last equality we used

870: (\ref{mufr}). The denominator reduces to

871: \beqn

872:   \sum_{q\in \hh Q_2} \mu(q) \;=\;

873:   \nq\sum_{q:q(\hh y_1)=\hh x_1}\nq \mu(q)

874:   \;=\; \mu(\hh y_1\pb{\hh x}_1).

875: \eeqn

876: For the quotient we get

877: $$

878:   C_{23}(p|\hh y_1\hh x_1) \;=\;

879:   \sum_{x_2x_3}(c(x_2)\!+\!c(x_3))\!\cdot\!

880:   \mu(\hh y_1\hh x_1

881:       y_2\pb x_2y_3(x_2)\pb x_3)

882: $$

883: We have seen that the relevant behaviour of $p\!\in\!\hh P_2$ in cycle 2 and 3

884: is completely determined by $y_2$ and the function $y_3(\cdot)$

885: $$

886:   \max_{p\in\hh P_2}C_{23}(p|\hh y_1\hh x_1) \;=\;

887:   \max_{y_2}\max_{y_3(\cdot)}\sum_{x_2x_3}(c(x_2)\!+\!c(x_3))\!\cdot\!

888:   \mu(\hh y_1\hh x_1y_2\pb x_2y_3(x_2)\pb c_3) \;=\;

889: $$

890: $$

891:   \;=\;

892:   \max_{y_2}\sum_{x_2}\max_{y_3}\sum_{x_3}(c(x_2)\!+\!c(x_3))\!\cdot\!

893:   \mu(\hh y_1\hh x_1y_2\pb x_2y_3\pb x_3)

894: $$

895: In the last equality we have used the fact that the functional

896: minimization over $y_3(\cdot)$ reduces to a simple minimization

897: over the word $y_3$ when interchanging with the sum over its

898: arguments

899: $(\max_{y_3(\cdot)}\sum_{x_2}\equiv\sum_{x_2}\max_{y_3})$.

900: In the functional case $\hh y_2$ is therefore determined by

901: $$

902:   \hh y_2 \;=\;

903:   \maxarg_{y_2}\sum_{x_2}\max_{y_3}\sum_{x_3}(c(x_2)\!+\!c(x_3))\!\cdot\!

904:   \mu(\hh y_1\hh x_1y_2\pb x_2y_3\pb x_3)

905: $$

906: This is identical to the iterative definition (\ref{ydotrec}) with

907: $k\!=\!2$ and $m_2\!=\!3$ $\qed$.

908:

909: %------------------------------%

910: \paragraph{Factorizable $\mu$:}

911: %------------------------------%

912: Up to now we have made no restrictions on the form of the prior

913: probability $\mu$ apart from being a chronological probability

914: distribution. On the other hand, we will see that, in order to

915: prove rigorous credit bounds, the prior probability must satisfy

916: some separability condition to be defined later. Here we introduce

917: some very strong form of separability, when $\mu$ factorizes into

918: products. We start with a

919: factorization into two factors. Let us assume that $\mu$ is of the

920: form

921: \beq\label{fac12}

922:   \mu(y\!\pb x_{1:n}) \;=\;

923:   \mu_1(y\!\pb x_{<l}) \cdot

924:   \mu_2(y\!\pb x_{l:n})

925: \eeq

926: for some fixed $l$ and sufficiently large $n\!\geq\!m_k$.

927: For this $\mu$ the output $\hh y_k$ in cycle

928: $k$ of the AI$\mu$ system (\ref{ydotrec}) for $k\!\geq\!l$ depends on

929: $\hh y\!\hh x_{l:k-1}$ and $\mu_2$ only and

930: is independent of $\hh y\!\hh x_{<l}$

931: and $\mu_1$. This is easily seen when inserting

932: \beq\label{fac11}

933:   \mu(\hh y\!\hh x_{<k}y\!\pb x_{k:m_k}) =

934:   \underbrace{\mu_1(\hh y\!\hh x_{<l})}_{\equiv 1}

935:   \cdot

936:   \mu_2(\hh y\!\hh x_{l:k-1}y\!\pb x_{k:m_k})

937: \eeq

938: into (\ref{ydotrec}). For $k\!<\!l$ the output $\hh y_k$ depends

939: on $\hh y\!\hh x_{<k}$ (this is trivial) and $\mu_1$

940: only (trivial if $m_k\!<\!l$) and is independent of $\mu_2$.

941: The non-trivial case, where the horizon $m_k\!\geq\!l$ reaches

942: into the region $\mu_2$, can be proved as follows (we abbreviate

943: $m\!:=\!m_k$ in the following). Inserting (\ref{fac12}) into the

944: definition of $C_{lm}^\best(y\!x_{<l})$ the factor

945: $\mu_1$ is $1$ as in (\ref{fac11}). We abbreviate

946: $C_{lm}^\best\!:=\!C_{lm}^\best(y\!x_{<l})$ as

947: it is independent of its arguments. One can

948: decompose

949: \beq\label{decompE}

950:   C_{km}^\best(y\!x_{<k}) \;=\;

951:   C_{k,l-1}^\best(y\!x_{<k}) \;+\; C_{lm}^\best

952: \eeq

953: For $k\!=\!l$ this is true because the first term on the r.h.s.\ is

954: zero.

955: For $k\!<\!l$ we prove the decomposition by induction from $k\!+\!1$ to $k$.

956: \beqn

957:   C_{km}^\best(y\!x_{<k}) \;=\;

958:   \max_{y_k}\sum_{x_k}

959:   [c(x_k)+C_{k+1,l-1}^\best(y\!x_{1:k})+C_{lm}^\best] \!\cdot\!

960:   \mu_1(y\!x_{<k}y\!\pb x_k) \;=\;

961: \eeqn

962: \beqn

963:   \;=\; \max_{y_k}\bigg[\sum_{x_k}

964:   (c(x_k)+C_{k+1,l-1}^\best(y\!x_{<k})) \!\cdot\!

965:   \mu_1(y\!x_{<k}y\!\pb x_k) + C_{lm}^\best\bigg]

966:    \;=\;

967: \eeqn

968: \beqn

969:   \;=\; C_{k,l-1}^\best(y\!x_{<k}) + C_{lm}^\best

970: \eeqn

971: Inserting (\ref{decompE}), valid for $k$ by induction hypothesis,

972: into (\ref{airec2}) gives the first equality. In the second

973: equality we have performed the $x_k$ sum for the

974: $C_{lm}^\best\!\cdot\!\mu_1$ term which is now independent of $y_k$. It can

975: therefore be pulled out of $\max_{y_k}$. In the last

976: equality we used again the definition (\ref{airec2}). This completes

977: the induction step and proves

978: (\ref{decompE}) for $k\!<\!l$. $\hh y_k$ can now be represented

979: as

980: \beq

981:   \hh y_k \;=\; \maxarg_{y_k}C_{km}^\best

982:   (\hh y\!\hh x_{<k}y_k) \;=\;

983:   \maxarg_{y_k}C_{k,l-1}^\best(\hh y\!\hh x_{<k}y_k)

984: \eeq

985: where (\ref{pbestrec}) and (\ref{decompE}) and the fact that

986: an additive constant $C_{lm}^\best$ does not change

987: $\maxarg_{y_k}$ has been used. $C_{k,l-1}^\best(\hh y\!\hh x_{<k}y_k)$ and

988: hence $\hh y_k$ is independent of $\mu_2$ for $k\!<\!l$. Note,

989: that $\hh y_k$ is also independent of the choice of $m$, as

990: long as $m\!\geq\!l$.

991:

992: In the general case the cycles are grouped into

993: independent episodes $r\!=\!1,2,3,...$, where each episode $r$

994: consists of the cycles $k\!=\!n_r\!+\!1,...,n_{r+1}$ for some

995: $0=n_0<n_1<...<n_s=n$:

996: \beq\label{facmu}

997:   \mu(y\!\pb x_{1:n}) \;=\;

998:   \prod_{r=0}^{s-1} \mu_r(y\!\pb x_{n_r+1:n_{r+1}})

999: \eeq

1000: In the simplest case, when all episodes have the

1001: same length $l$ then $n_r=r\!\cdot\!l$. $\hh y_k$ depends on

1002: $\mu_r$ and $x$ and $y$ of episode $r$ only, with $r$ such

1003: that $n_r\!<\!k\!\leq\!n_{r+1}$.

1004: \beq\label{facydot}

1005:   \hh y_k =

1006:   \maxarg_{y_k}\sum_{x_k}...

1007:   \max_{y_t}\sum_{x_t}

1008:   (c(x_k)\!+...+\!c(x_t)) \!\cdot\!

1009:   \mu_r(\hh y\!\hh x_{n_r+1:k-1}y\!\pb x_{k:n_{r+1}}) \\[-3mm]

1010: \eeq

1011: with $t\!:=\!\min\{m_k,n_{r+1}\}$. The different episodes are

1012: completely independent in the following sense. The inputs $x_k$

1013: of different episodes are statistically independent and

1014: depend only on $y_k$ of the same episode. The outputs $y_k$ depend on the

1015: $x$ and $y$ of the corresponding episode $r$ only, and are

1016: independent of the actual I/O of the other episodes.

1017:

1018: If all episodes have a length of at most $l$, i.e.

1019: $n_{r+1}\!-\!n_r\!\leq\!l$ and if we choose the horizon

1020: $h_k$ to be at least $l$, then

1021: $m_k\!\geq\!k\!+\!l\!-\!1\!\geq\!n_r\!+\!l\!\geq\!n_{r+1}$ and

1022: hence $t=n_{r+1}$ independent of $m_k$. This means that for

1023: factorizable $\mu$ there is no problem in taking the limit

1024: $m_k\!\to\!\infty$. Maybe this limit can also be performed in the

1025: more general case of a separable $\mu$. The (problem of the)

1026: choice of $m_k$ will be discussed in more detail later.

1027:

1028: Although factorizable $\mu$ are too restrictive to cover all AI

1029: problems, it often occurs in practice in the form of repeated

1030: problem solving, and hence, is worth being studied. For example, if

1031: the system has to play games like chess repeatedly, or has to

1032: minimize different functions, the different games/functions might

1033: be completely independent, i.e. the environmental probability

1034: factorizes, where each factor corresponds to a game/function

1035: minimization. For details, see the appropriate sections on

1036: strategic games and function minimization.

1037:

1038: Further, for factorizable $\mu$ it is probably easier to derive

1039: suitable credit bounds for the universal AI$\xi$ model defined in

1040: the next section, than for the general separable case which will be

1041: introduced later. This could be a first step toward a definition

1042: and proof for the general case of separable problems. One goal of

1043: this paragraph was to show, that the notion of a factorizable

1044: $\mu$ could be the first step toward a definition and analysis of

1045: the general case of separable $\mu$.

1046:

1047: %------------------------------%

1048: \paragraph{Constants and Limits:}

1049: %------------------------------%

1050: We have in mind a universal system with complex

1051: interactions that is as least as intelligent and complex as a human

1052: being. One might think of a system whose input $y_k$ comes from a

1053: digital video camera, the output $x_k$ is some image to a

1054: monitor\footnote{Humans can only simulate a screen as

1055: output device by drawing pictures.}, only for the valuation we

1056: might restrict to the most primitive binary one, i.e. $c_k\!\in I\!\!B$. So we think of the

1057: following constant sizes:

1058: $$

1059: \begin{array}{ccccccccc}

1060:   1 & \ll & \langle l(y_kx_k)\rangle & \ll & k & \leq & T & \ll & |Y\times X| \\

1061:   1 & \ll & 2^{16} & \ll & 2^{24} & \le & 2^{32} & \ll & 2^{65536}

1062: \end{array}

1063: $$

1064: The first two limits say that the actual number $k$ of

1065: inputs/outputs should be reasonably large, compared to the typical

1066: size $\langle l\rangle$ of the input/output words, which itself

1067: should be rather sizeable. The last limit expresses the fact that

1068: the total lifetime $T$ (number of I/O cycles) of the system is far

1069: too small to allow every possible input to occur, or to try every

1070: possible output, or to make use of identically repeated

1071: inputs or outputs. We do not expect any useful outputs for

1072: $k\le\langle l\rangle$. More interesting than the lengths of the

1073: inputs is the complexity $K(x_1...x_k)$ of all inputs until now,

1074: to be defined later. The environment is usually not "perfect". The

1075: system could either interact with a non-perfect human or tackle a

1076: non-deterministic world (due to quantum mechanics or chaos)

1077: world\footnote{Whether there exist stochastic processes at all is

1078: a difficult question. At least the quantum indeterminacy comes

1079: very close to it.}. In either case, the sequence contains some

1080: noise, leading to $K\sim \langle l\rangle\!\cdot\!k$. The

1081: complexity of the probability distribution of the input sequence

1082: is something different. We assume that this noisy world operates

1083: according to some simple computable, though not finite rules.

1084: $K(\mu_k)\ll \langle l\rangle\!\cdot\!k$, i.e. the rules of the

1085: world can be highly compressed. On the other hand, there may

1086: appear new aspects of the environment for $k\!\to\!\infty$ causing

1087: a non-bounded $K(\mu_k)$.

1088:

1089: In the following we never use these limits, except when explicitly

1090: stated. In some simpler models and examples the size of the

1091: constants will even violate these limits (e.g. $l(x_k)=l(y_k)=1$),

1092: but it is the limits above that the reader should bear in mind. We are

1093: only interested in theorems which do not degenerate under the

1094: above limits.

1095:

1096: %------------------------------%

1097: \paragraph{Sequential decision theory:}

1098: %------------------------------%

1099: In the following we clarify the connection of (\ref{airec2}) and

1100: (\ref{pbestrec}) to sequential decision theory and discuss similarities and

1101: differences. With probability $M^a_{ij}$, the system under

1102: consideration should reach (environmental) state $i\!\in\!S$ when

1103: taking action $a\!\in\!A$ depending on the current state

1104: $j\!\in\!S$. If the system receives reward $R(i)$,

1105: the optimal policy $p^*$, maximizing expected utility (defined as

1106: sum of future rewards), and the utility $U(i)$ of policy

1107: $p^*$ are

1108: \beq\label{dt}

1109:   p^*(i)=\maxarg_a\sum_j M^a_{ij}U(j) \quad,\quad

1110:   U(i)=R(i)+\max_a\sum_j M^a_{ij}U(j)

1111: \eeq

1112: See \cite{Rus95} for details and further references. Let us identify

1113: \bqan

1114:   S=(Y\!\times\!X)^*,\quad A=Y,\quad

1115:   a=y_k, \quad M^a_{ij}=\mu(y\!x_{<k}y\!\pb x_k), \\[4pt]

1116:   i=y\!x_{<k}, \quad R(i)=c(x_{k-1}), \quad

1117:   U(i)=C^*_{k-1,m}(y\!x_{<k})=c(x_{k-1})+C^*_{km}(y\!x_{<k}), \\[4pt]

1118:   j=y\!x_{1:k}, \quad R(j)=c(x_k), \quad

1119:   U(j)=C^*_{km}(y\!x_{1:k})=c(x_k)+C^*_{k+1,m}(y\!x_{1:k}),

1120: \eqan

1121: where we further set $M^a_{ij}\!=\!0$ if $i$ is not a starting

1122: substring of $j$ or if $a\!\neq\!y_k$. This ensures the sum over

1123: $j$ in (\ref{dt}) to reduce to a sum over $x_k$. If we set

1124: $m_k\!=\!m$ and use

1125: $C^*_{km}(y\!x_{<k}y_k)\!=\!\sum_{x_k}C^*_{km}(y\!x_{1:k})$ in

1126: (\ref{pbestrec}), it is easy to see that (\ref{dt}) coincides with

1127: (\ref{airec2}) and (\ref{pbestrec}).

1128:

1129: Note that despite of this formal equivalence, we were forced to use

1130: the complete history $y\!x_{<k}$ as environmental state $i$. The

1131: AI$\mu$ model neither assumes stationarity, nor Markov property,

1132: nor complete accessibility of the environment, as any assumption

1133: would restrict the applicability of AI$\mu$. The consequence is

1134: that every state occurs at most once in the lifetime of the

1135: system. Every moment in the universe is unique! Even if the state

1136: space could be identified with the input space $X$, inputs would

1137: usually not occur twice by assumption $k\!\ll\!|X|$, made in the

1138: last subsection. Further, there is no (obvious) universal

1139: similarity relation on $(X\!\times\!Y)^*$ allowing an effective

1140: reduction of the size of the state space. Although many algorithms

1141: (e.g. value and policy iteration) have problems in solving

1142: (\ref{dt}) for huge or infinite state spaces in practice,

1143: there is no principle problem in determining $p^*$ and $U$, as

1144: long as $\mu$ is known and $|X|$, $|Y|$ and $m$ are finite.

1145:

1146: Things dramatically change if $\mu$ is unknown. Reinforcement

1147: learning algorithms \cite{Kae96} are commonly used in this case to

1148: learn the unknown $\mu$. They succeed if the state space is either

1149: small or has effectively been made small by so called generalization

1150: techniques. In any case, the solutions are either ad hoc, or work

1151: in restricted domains only, or have serious problems with state

1152: space exploration versus exploitation, or have non-optimal

1153: learning rate. There is no universal and optimal solution to this

1154: problem so far. In the next section we present a new model and

1155: argue that it formally solves all these problems in an optimal

1156: way. It will not concern with learning of $\mu$ directly. All we

1157: do is to replace the true prior probability $\mu$ by a universal

1158: probability $\xi$, which is shown to converge to $\mu$ in a sense.

1159:

1160: \newpage

1161: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1162: \section{The Universal AI$\xi$ Model}\label{secAIxi}

1163: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1164:

1165: %------------------------------%

1166: \paragraph{Induction and Algorithmic Information theory:}

1167: %------------------------------%

1168: One very important and highly non-trivial aspect of intelligence is

1169: inductive inference. Before formulating the AI$\xi$ model,

1170: a short introduction to the history of induction is given, culminating

1171: into the sequence prediction theory by Solomonoff. We emphasize

1172: only those aspects which will be of importance for the development

1173: of our universal AI$\xi$ model.

1174:

1175: Simply speaking, induction is the process of

1176: predicting the future from the past or, more precisely, it is the

1177: process of finding rules in (past) data and using these rules to

1178: guess future data. On the one hand, induction seems to happen in

1179: every day life by finding regularities in past observations and

1180: using them to predict the future. On the other hand, this procedure

1181: seems to add knowledge about the future from past observations.

1182: But how can we know something about the future? This dilemma and

1183: the induction principle in general have a long philosophical

1184: history

1185: %

1186: \begin{itemize}\parskip=0ex\parsep=0ex\itemsep=0ex

1187:   \item Hume's negation of Induction (1711-1776) \cite{Hume},

1188:   \item Epicurus' principle of multiple explanations (342?-270?

1189:   BC),

1190:   \item Occams' razor (simplicity) princple (1290?-1349?),

1191:   \item Bayes' rule for conditional probabilites \cite{Bay63}

1192: \end{itemize}

1193: %

1194: and a short but important mathematical history: a clever

1195: unification of all these aspects into one formal theory of

1196: inductive inference has been done by Solomonoff \cite{Sol64} based

1197: on Kolmogorov's \cite{Kol65} definition of complexity. For an

1198: excellent introduction into Kolmogorov complexity and Solomonoff

1199: induction one should consult the book of Li and Vit\'anyi

1200: \cite{LiVi93}. In the rest of this subsection we state all results

1201: which are needed or generalized later.

1202:

1203: Let us choose some universal prefix Turing machine $U$ with

1204: unidirectional binary input and output tapes and a bidirectional

1205: working tape. We can then define the (prefix) Kolomogorov complexity

1206: \cite{Cha75,Gac74,Kol65,Lev74} as the shortest prefix program $p$, for which $U$

1207: outputs $x\!=\!x_{1:n}$ with $x_i\in\!I\!\!B$:

1208: %

1209: $$

1210:   K(x) \;:=\; \min_p\{l(p): U(p)=x\}

1211: $$

1212: The universal semimeasure $\xi(\pb x)$ is defined as the probability

1213: that the output of the universal Turing machine $U$ starts with

1214: $x$ when provided with fair coin flips on the input tape \cite{Sol64,Sol78}. It is

1215: easy to see that this is equivalent to the formal definition

1216: \beq\label{xidef}

1217:   \xi(\pb x)\;:=\;\sum_{p\;:\;U(p)=x*}\nq 2^{-l(p)}

1218: \eeq

1219: where the sum is over minimal programs $p$ for which $U$

1220: outputs a string starting with $x$. $U$ might be non-terminating.

1221: As the shortest programs dominate the sum, $\xi$ is closely

1222: related to $K(x)$ ($\xi(\pb x)=2^{-K(x)+O(K(l(x))}$).

1223: $\xi$ has the important universality property \cite{Sol64}, that it

1224: majorizes every computable probability distribution $\rho$ up

1225: to a multiplicative factor

1226: depending only on $\rho$ but not on $x$:

1227: \beq\label{uni}

1228:   \xi(\pb x) \;\stackrel{\times}{\geq}\; 2^{-K(\rho)}\!\cdot\!\rho(\pb x).

1229: \eeq

1230: %

1231: A '$\times$' above an (in)equality denotes (in)equality within a

1232: universal multiplicative constant,

1233: a '$+$' above an (in)equality denotes (in)equality within a

1234: universal additive constant, both depending only on the choice of the

1235: universal reference machine $U$.

1236: $\xi$ itself is {\it not} a probability

1237: distribution\footnote{It is possible to normalize $\xi$ to a

1238: probability distribution as has been done in

1239: \cite{Wil70,Sol78,Hut99} by giving up the enumerability of $\xi$.

1240: Error bounds (\ref{eukdist}) and (\ref{spebound}) hold for both

1241: definitions.}.

1242: We have $\xi(\pb{x0})\!+\!\xi(\pb{x1})\!<\!\xi(\pb

1243: x)$ because there are programs $p$, which output just $x$, neither

1244: followed by $0$ nor $1$. They just stop after printing $x$ or

1245: continue forever without any further output. We will call a

1246: function $\rho\!\geq 0$ with the properties

1247: $\rho(\epsilon)\!\leq\!1$ and $\sum_{x_n}\rho(\pb

1248: x_{1:n})\!\leq\!\rho(\pb x_{<n})$ a {\it semimeasure}. $\xi$ is a

1249: semimeasure and (\ref{uni}) actually holds for all enumerable

1250: semimeasures $\rho$.

1251:

1252: (Binary) sequence prediction algorithms try to predict the

1253: continuation $x_n$ of a given sequence $x_1...x_{n-1}$. In the

1254: following we will assume that the sequences are drawn according to

1255: a probability distribution and that the true prior probability of

1256: $x_{1:n}$ is $\mu(\pb{x_1...x_n})$. The probability of $x_n$ given

1257: $x_{<n}$ hence is $\mu(x_{<n}\pb x_n)$. The best possible system

1258: predicts the $x_n$ with higher probability. Usually $\mu$ is

1259: unknown and the system can only have some belief $\rho$ about the

1260: true prior probability $\mu$. Let SP$\rho$ be a probabilistic

1261: sequence predictor, predicting $x_n$ with probability

1262: $\rho(x_{<n}\pb x_n)$. Further we define a deterministic sequence

1263: predictor SP$\Theta_\rho$ predicting the $x_n$ with higher $\rho$

1264: probability. $\Theta_\rho(x_{<n}\pb x_n)\!:=\!1$ if

1265: $\rho(x_{<n}\pb x_n)\!>\!{1\over 2}$ and $\Theta_\rho(x_{<n}\pb x_n)\!:=\!0$

1266: otherwise.  If $\rho$ is only a semimeasure the SP$\rho$ and

1267: SP$\Theta_\rho$ systems might refuse any output in some cycles

1268: $n$. The SP$\Theta_\mu$ is the best prediction scheme when $\mu$

1269: is known.

1270:

1271: If $\rho(x_{<n}\pb x_n)$ converges quickly to $\mu(x_{<n}\pb x_n)$ the

1272: number of additional prediction errors introduced by using

1273: $\Theta_\rho$ instead of $\Theta_\mu$ for prediction should be

1274: small in some sense. Now the universal probability $\xi$

1275: comes into play as it has been proved

1276: by Solomonoff \cite{Sol78} that the $\mu$ expected Euclidean

1277: distance betweewn $\xi$ and $\mu$ is finite

1278: \beq\label{eukdist}

1279:   \sum_{k=1}^\infty\sum_{x_{1:k}}\mu(\pb x_{1:k})

1280:   (\xi(x_{<k}\pb x_k)-\mu(x_{<k}\pb x_k))^2 \;\stackrel{+}{<}\;

1281:   {\1d2}\ln 2\!\cdot\!K(\mu)

1282: \eeq

1283: The '$+$' atop '$<$' means up to additive terms of order 1.

1284: So indeed the difference does tend to zero, i.e.

1285: $\xi(x_{<n}\pb x_n)\toinfty{n}\mu(x_{<n}\pb x_n)$ with $\mu$ probability

1286: $1$ for {\it any} computable probability distribution $\mu$. The reason for the

1287: astonishing property of a single (universal) function to

1288: converge to {\it any} computable probability distribution lies in the fact that the

1289: set of $\mu$ random sequences differ for different $\mu$.

1290: The universality property (\ref{uni}) is the central ingredient for

1291: proving (\ref{eukdist}).

1292:

1293: Let us define the total number of expected erroneous predictions

1294: the SP$\rho$ system makes for the first $n$ bits

1295: \beq\label{esp}

1296:   E_{n\rho} \;:=\; \sum_{k=1}^n\sum_{x_{1:k}}\mu(\pb x_{1:k})

1297:   (1\!-\!\rho(x_{<k}\pb x_k))

1298: \eeq

1299: The SP$\Theta_\mu$ system is best in the sense that

1300: $E_{n\Theta_\mu}\!\leq\!E_{n\rho}$

1301: for any $\rho$. In \cite{Hut99} it has been shown that

1302: SP$\Theta_\xi$ is not much worse

1303: \beq\label{spebound}

1304:   E_{n\Theta_\xi}\!-\!E_{n\rho} \;\leq\;

1305:   H+\sqrt{4E_{n\rho}H+H^2} \;=\;

1306:   O(\sqrt{E_{n\rho}})\quad,\quad

1307:   H\;\stackrel{+}{<}\;\ln 2\!\cdot\!K(\mu)

1308: \eeq

1309: with the tightest bound for $\rho\!=\!\Theta_\mu$. For finite

1310: $E_{\infty\Theta_\mu}$, $E_{\infty\Theta_\xi}$ is finite too. For

1311: infinite $E_{\infty\Theta_\mu}$,

1312: $E_{n\Theta_\xi}/E_{n\Theta_\mu}\toinfty{n}1$ with rapid

1313: convergence. One can hardly imagine any better prediction

1314: algorithm without extra knowledge about the environment. In

1315: \cite{Hut00e}, (\ref{eukdist}) and (\ref{spebound}) have been

1316: generalized from binary to arbitrary alphabet. Apart from

1317: computational aspects, which are of course very important, the

1318: problem of sequence prediction could be viewed as essentially

1319: solved.

1320:

1321: %------------------------------%

1322: \paragraph{Definition of the AI$\xi$ Model:}

1323: %------------------------------%

1324: We have developed enough formalism to suggest our universal

1325: AI$\xi$ model\footnote{Speak 'aixi' and write AIXI without Greek letters.}.

1326: All we have to do is to suitably generalize the universal

1327: semimeasure $\xi$ from the last subsection and replace the true

1328: but unknown prior probability $\mu^{AI}$ in the AI$\mu$ model by this

1329: generalized $\xi^{AI}$. In what sense this AI$\xi$ model is universal

1330: will be discussed later.

1331:

1332: In the functional formulation we define the universal probability

1333: $\xi^{AI}$ of an environment $q$ just as $2^{-l(q)}$

1334: \beqn

1335:   \xi(q) \;:=\; 2^{-l(q)}

1336: \eeqn

1337: The definition could not be easier\footnote{It is not necessary

1338: to use $2^{-K(q)}$ or something similar as some reader may expect

1339: at this point. The reason is that for every program $q$ there

1340: exists a functionally equivalent program $q'$ with

1341: $K(q')=l(q')$.}!\footnote{Here and later we identify objects with

1342: their coding relative to some fixed Turing machine $U$. For example, if $q$ is

1343: a function $K(q):=K(\lceil q\rceil)$ with $\lceil q\rceil$ being a

1344: binary coding of $q$ such that $U(\lceil q\rceil,y):=q(y)$. On the

1345: other hand, if $q$ already is a binary string we define $q(y)\!:=U(q,y)$.}

1346: Collecting the formulas of section \ref{secAIfunc}

1347: and replacing $\mu(q)$ by $\xi(q)$

1348: we get the definition of the AI$\xi$ system in

1349: functional form. Given the history $\hh y\!\hh x_{<k}$ the

1350: functional AI$\xi$ system outputs

1351: \beq\label{eefuncxi}

1352:   \hh y_k \;:=\;

1353:   \maxarg_{y_k}\max_{p:p(\hh x_{<k})=\hh y_{<k}y_k}

1354:   \sum_{q:q(\hh y_{<k})=\hh x_{<k}}

1355:   \nq 2^{-l(q)}\cdot C_{km_k}(p,q)

1356: \eeq

1357: in cycle $k$, where $C_{km_k}(p,q)$ is the total credit of cycles $k$ to $m_k$ when

1358: system $p$ interacts with environment $q$. We have dropped the

1359: denominator $\sum_q\mu(q)$ from (\ref{eefunc}) as it is

1360: independent of the $p\!\in\!\hh P_k$ and a constant multiplicative

1361: factor does not change $\maxarg$.

1362:

1363: For the iterative formulation the universal probability

1364: $\xi$ can be obtained by inserting the functional $\xi(q)$ into

1365: (\ref{mufr})

1366: \beq\label{uniMAI}

1367:   \xi(y\!\pb x_{1:k}) \;=\;

1368:   \nq\sum_{q:q(y_{1:k})=x_{1:k}}\nq 2^{-l(q)}

1369: \eeq

1370: Replacing $\mu$ by $\xi$ in (\ref{ydotrec}) the

1371: iterative AI$\xi$ system outputs

1372: \beq\label{ydotxi}

1373:   \hh y_k \;=\;

1374:   \maxarg_{y_k}\sum_{x_k}\max_{y_{k+1}}\sum_{x_{k+1}}\;...\;

1375:   \max_{y_{m_k}}\sum_{x_{m_k}}

1376:   (c(x_k)\!+...+\!c(x_{m_k})) \!\cdot\!

1377:   \xi(\hh y\!\hh x_{<k}y\!\pb x_{k:m_k})

1378: \eeq

1379: in cycle $k$ given the history $\hh y\!\hh x_{<k}$.

1380:

1381: One subtlety has been passed over. Like in the

1382: SP case, $\xi$ is not a probability distribution but satisfies only the weaker inequalities

1383: \beq\label{chrf}

1384:   \sum_{x_n}\xi(y\!\pb x_{1:n}) \;\leq\; \xi(y\!\pb x_{<n})

1385:   \quad,\quad

1386:   \xi(\epsilon) \;\leq\; 1

1387: \eeq

1388: Note, that the sum on the l.h.s.\ is {\it not}

1389: independent of $y_n$ unlike for chronological probability

1390: distributions. Nevertheless, it is bounded by something (the r.h.s)

1391: which is independent of $y_n$. The reason is that the sum in

1392: (\ref{uniMAI}) runs over (partial recursive) chronological

1393: functions only and the functions $q$ which satisfy

1394: $q(y_{1:n})=x_{<n}*$ are a subset of the functions satisfying

1395: $q(y_{<n})=x_{<n}$. Therefore we will in general call functions satisfying

1396: (\ref{chrf}) {\it chronological semimeasures}. The important point

1397: is that the conditional probabilities (\ref{bayes2}) are $\leq\!1$

1398: like for true probability distributions.

1399:

1400: The equivalence of the functional and iterative AI model proven in

1401: section \ref{secAImurec} is true for every chronological

1402: semimeasure $\rho$, esp.\ for $\xi$, hence we can talk about {\it

1403: the} AI$\xi$ model in this respect. It (slightly) depends on the

1404: choice of universal Turing machine. $l(q)$ is defined only up to

1405: an additive constant. It also depends on the choice of

1406: $X\!=\!C\!\times\!X'$ and $Y$, but we do not expect any bias when

1407: the spaces are chosen sufficiently simple, e.g. all strings of

1408: length $2^{16}$. Choosing $I\!\!N$ as word space would be optimal,

1409: but whether the maxima (suprema) exist in this case, has to be

1410: shown beforehand. The only non-trivial dependence is on the

1411: horizon function $m_k$ which will be discussed later. So apart

1412: from $m_k$ and unimportant details the AI$\xi$ system is uniquely

1413: defined by (\ref{eefuncxi}) or (\ref{ydotxi}).

1414: It doesn't depend

1415: on assumptions about the environment apart from being generated

1416: from some computable (but unknown!) probability distribution.

1417:

1418: %------------------------------%

1419: \paragraph{Universality of $\xi^{AI}$:}

1420: %------------------------------%

1421: In which sense the AI$\xi$ model is optimal will be clarified

1422: later. In this and the next two subsections we show that $\xi^{AI}$

1423: defined in (\ref{uniMAI}) is universal and converges to $\mu^{AI}$ analog to the

1424: SP case (\ref{uni}) and (\ref{eukdist}). The proofs are

1425: generalizations from the SP case. The $y$ are pure spectators and

1426: cause no difficulties in the generalization. The replacement of

1427: the binary alphabet $I\!\!B$ used in SP by the (possibly infinite)

1428: alphabet $X$ is possible, but needs to be done with care. In

1429: (\ref{uni}) $U(p)=x*$ produces strings starting with $x$, whereas

1430: in (\ref{uniMAI}) we can demand $q$ to output exactly $n$ words $x_{1:n}$ as

1431: $q$ knows $n$ from the number of input words $y_1...y_n$.

1432: For proofs of (\ref{uni}) and (\ref{eukdist}) see \cite{Sol78} and

1433: \cite{LiVi92}.

1434:

1435: There is an alternative

1436: definition of $\xi$ which coincides with (\ref{uniMAI}) within a

1437: multiplicative constant of $O(1)$,

1438: \beq\label{xirhodef}

1439:   \xi(y\!\pb x_{1:n}) \;\stackrel{\times}{=}\; \sum_\rho 2^{-K(\rho)}\rho(y\!\pb

1440:   x_{1:n})

1441: \eeq

1442: where the sum runs over all enumerable chronological semimeasures.

1443: The $2^{-K(\rho)}$ weighted sum over probabilistic environments

1444: $\rho$, coincides with the sum over $2^{-l(q)}$ weighted

1445: deterministic environments $q$, as will be proved below.

1446: In the next subsection we show that an enumeration of all

1447: enumerable functions can be converted into an enumeration of

1448: enumerable chronological semimeasures $\rho$. $K(\rho)$ is co-enumerable,

1449: therefore $\xi$ defined in (\ref{xirhodef}) is itself enumerable.

1450: The representation (\ref{uniMAI}) is also enumerable. As

1451: $\sum_\rho2^{-K(\rho)}\!\leq\!1$ and the $\rho's$ satisfy (\ref{chrf}), $\xi$

1452: is a chronological semimeasure as well. If we pick one $\rho$ in

1453: (\ref{xirhodef}) we get the universality property ''for free''

1454: \beq\label{uniaixi}

1455:   \xi(y\!\pb x_{1:n}) \;\stackrel{\times}{\geq}\; 2^{-K(\rho)}\rho(y\!\pb x_{1:n})

1456: \eeq

1457: $\xi$ is a universal element in the sense of (\ref{uniaixi}) in

1458: the set of all enumerable chronological semimeasures.

1459:

1460: To prove universality of $\xi$ in the form (\ref{uniMAI}) we have

1461: to show that for every  enumerable chronological semimeasure

1462: $\rho$ there exists a Turing machine $T$ with

1463: \beq\label{reprho}

1464:   \rho(y\!\pb x_{1:n}) \;=\; \sum_{q:T(qy_{1:n})=x_{1:n}}\nq 2^{-l(q)}

1465:   \quad\mbox{and}\quad l(T)\stackrel{+}{=}K(\rho).

1466: \eeq

1467:

1468: This will not be done here. Given $T$ the universality of

1469: $\xi$

1470: follows from

1471: \beqn

1472:   \xi(y\!\pb x_{1:n}) \;=\;

1473:   \nq\nq\sum_{\quad\quad q:U(qy_{1:n})=x_{1:n}}\nq\nq 2^{-l(q)}

1474:   \;\geq\;

1475:   \nq\nq\sum_{\quad\quad q:U(Tq'y_{1:n})=x_{1:n}}\nq\;\nq\nq 2^{-l(Tq')}

1476:   \;=\;

1477:   2^{-l(T)}\nq\nq\sum_{q:T(q'y_{1:n})=x_{1:n}}\nq\nq 2^{-l(q')}

1478:   \stackrel{\times}{\;=\;}

1479:   2^{-K(\rho)}\rho(y\!\pb x_{1:n})

1480: \eeqn

1481: The first equality and (\ref{uniMAI}) are identical by definition.

1482: In the inequality we have restricted the sum over all $q$ to $q$

1483: of the form $q\!=\!Tq'$. The third relation is true as running $U$

1484: on $Tz$ is a simulation of $T$ on $z$. The last equality follows

1485: from (\ref{reprho}). All enumerable, universal, chronological

1486: semimeasures coincide up to a multiplicative constant, as they

1487: mutually dominate each other. Hence, definitions (\ref{uniMAI}) and

1488: (\ref{xirhodef}) are, indeed, equivalent.

1489:

1490: %------------------------------%

1491: \paragraph{Converting general functions into chronological semi-measures:}

1492: %------------------------------%

1493: To complete the proof of the universality (\ref{uniaixi}) of $\xi$

1494: we need to convert enumerable functions

1495: $\psi:I\!\!B^*\!\to\!I\!\!R^+$ into enumerable chronological

1496: semi-measures $\rho:(Y\!\times\!X)^*\!\to\!I\!\!R^+$ with certain

1497: additional properties. Every enumerable function like $\psi$ and

1498: $\rho$ can be approximated from below by definition\footnote{Defining

1499: enumerability as the supremum of total primitive recursive

1500: functions is more suitable for our purpose than the equivalent

1501: definition as a limit of monotone increasing partial

1502: recursive functions. In terms of Turing machines, the recursion

1503: parameter is the time after which a computation is terminated.} by

1504: primitive recursive functions

1505: $\varphi:I\!\!B^*\!\times\!I\!\!N\!\to\!I\!\!\!Q^+$ and

1506: $\phi:(Y\!\times\!X)^*\!\times\!I\!\!N\!\to\!I\!\!\!Q^+$ with

1507: $\psi(s)=\sup_t\varphi(s,t)$ and $\rho(s)=\sup_t\phi(s,t)$ and

1508: recursion parameter $t$. For arguments of the form

1509: $s\!=\!y\!x_{1:n}$ we recursively (in $n$) construct $\phi$ from

1510: $\varphi$ as follows:

1511: \begin{eqnarray}\label{ccsm1}

1512:   \varphi'(y\!x_{1:n},t) &\!:=\!&

1513:   \left\{

1514:   \begin{array}{c@{\quad\mbox{for}\quad}l}

1515:     \varphi(y\!x_{1:n},t) & x_n<t     \\

1516:     0                   & x_n\geq t

1517:   \end{array} \right.

1518:   \quad,\quad \varphi'(\epsilon,t) \;:=\; \varphi(\epsilon,t)

1519: \\ \label{ccsm2}

1520:   \phi(\epsilon,t) &\!:=\!& \max_{0\leq i\leq t}

1521:   \Big\{\varphi'(\epsilon,i):\varphi'(\epsilon,i)\leq 1 \Big\}

1522: \\ \label{ccsm3}

1523:   \phi(y\!\pb x_{1:n},t) &\!:=\!& \max_{0\leq i\leq t}

1524:   \Big\{ \varphi'(y\!x_{1:n},i):{\textstyle\sum_{x_n}}\varphi'(y\!x_{1:n},i)\leq

1525:      \phi(y\!\pb x_{<n},t) \Big\}

1526: \end{eqnarray}

1527: With $x_n\!<\!t$ we mean that the natural number associated with

1528: string $x_n$ is smaller than $t$.

1529: According to (\ref{ccsm1}) with $\varphi$ also $\varphi'$ as well as

1530: $\sum_{x_n}\varphi'$ are primitive recursive functions. Further, if we

1531: allow $t\!=\!0$ we have $\varphi'(s,0)=0$. This ensures that

1532: $\phi$ is a total function.

1533:

1534: In the following we prove by induction over $n$ that $\phi$ is a

1535: primitive recursive chronological semimeasure

1536: monotone increasing in $t$. All necessary properties hold for

1537: $n\!=\!0$ ($y\!x_{1:0}\!=\!\epsilon$) according to (\ref{ccsm2}).

1538: For general $n$ assume that the induction hypothesis is true for

1539: $\phi(y\!\pb x_{<n},t)$. We can see from (\ref{ccsm3}) that

1540: $\phi(y\!\pb x_{1:n},t)$ is monotone  increasing in $t$. $\phi$ is

1541: total as $\varphi'(y\!x_{1:n},i\!=\!0)\!=\!0$ satisfies the

1542: inequality. By assumption $\phi(y\!x_{<n},t)$ is

1543: primitive recursive, hence with $\sum_{x_n}\varphi'$ also the order relation

1544: $\sum\varphi'\!\leq\!\phi$ is primitive recursive. This ensures

1545: that the non-empty finite set

1546: $\{\varphi'\!:\!\sum\varphi'\!\leq\!\phi\}_i$ and its maximum

1547: $\phi(y\!\pb x_{1:n},t)$ are primitive recursive. Further,

1548: $\phi(y\!\pb x_{1:n},t)\!=\!\varphi'(y\!x_{1:n},i)$ for some $i$ with

1549: $i\!\leq\!t$ independent of $x_n$. Thus,

1550: $\sum_{x_n}\phi(y\!\pb x_{1:n},t)$ $=$ $\sum_{x_n}\varphi'(y\!x_{1:n},i)$

1551: $\leq$ $\phi(y\!\pb x_{<n},t)$ which is the condition for $\phi$ being a

1552: chronological semimeasure. Inductively we have proved that $\phi$ is

1553: indeed a primitive recursive chronological semimeasure

1554: monotone increasing in $t$.

1555:

1556: In the following we show that every (total)\footnote{Semimeasures

1557: are, by definition, total functions.} enumerable chronological

1558: semimeasure $\rho$ can be enumerated by some $\phi$. By definition

1559: of enumerability there exist primitive recursive functions

1560: $\tilde\varphi$ with $\rho(s)\!=\!\sup_t\tilde\varphi(s,t)$. The

1561: function $\varphi(s,t)\!:=\!(1\!-\! ^1\!/_t)\!\cdot\!

1562: \max_{i<t}\tilde\varphi(s,i)$ also enumerates $\rho$ but has

1563: the additional advantage of being strictly monotone increasing in $t$.

1564:

1565: $\varphi'(y\!x_{1:n},\infty)\!=

1566: \!\varphi(y\!x_{1:n},\infty)\!=\!\rho(y\!x_{1:n})$ by definition

1567: (\ref{ccsm1}). $\phi(\epsilon,t)\!=\!\varphi'(\epsilon,t)$ by

1568: (\ref{ccsm2}) and the fact that

1569: $\varphi'(\epsilon,i\!-\!1)<\varphi'(\epsilon,i)\!\leq\!

1570: \varphi(\epsilon,i)\!\leq\!\rho(\epsilon)\!\leq\!1$, hence

1571: $\phi(\epsilon,\infty)\!=\!\rho(\epsilon)$. $\phi(y\!\pb

1572: x_{1:n},t)\!\leq\!\varphi'(y\!x_{1:n},t)$ by (\ref{ccsm3}), hence

1573: $\phi(y\!\pb x_{1:n},\infty)\!\leq\!\rho(y\!\pb x_{1:n})$. We prove

1574: the opposite direction $\phi(y\!\pb

1575: x_{1:n},\infty)\!\geq\!\rho(y\!x_{1:n})$ by induction over $n$. We

1576: have

1577: \beq\label{upineq}

1578:   \sum_{x_n}\varphi'(y\!x_{1:n},i) \;\leq\;

1579:   \sum_{x_n}\varphi(y\!x_{1:n},i)  \;<\;

1580:   \sum_{x_n}\varphi(y\!x_{1:n},\infty) \;=\;

1581:   \sum_{x_n}\rho(y\!x_{1:n}) \;\leq\; \rho(y\!\pb x_{<n})

1582: \eeq

1583: The strict monotony of $\varphi$ and the semimeasure

1584: property of $\rho$ have been used. By induction hypothesis

1585: $\lim_{t\to\infty}\phi(y\!\pb x_{<n},t)\!\geq\!\rho(y\!\pb x_{<n})$ and

1586: (\ref{upineq}) for sufficiently large $t$ we have

1587: $\phi(y\!\pb x_{<n},t)\!>\!\sum_{x_n}\varphi'(y\!x_{1:n},i)$. The

1588: condition in (\ref{ccsm3}) is, hence, satisfied and therefore

1589: $\phi(y\!\pb x_{1:n},t)\!\geq\!\varphi'(y\!x_{1:n},i)$ for sufficiently

1590: large $t$, especially

1591: $\phi(y\!\pb x_{1:n},\infty)\!\geq\!\varphi'(y\!x_{1:n},i)$ for all $i$.

1592: Taking the limit $i\!\to\!\infty$ we get

1593: $\phi(y\!\pb x_{1:n},\infty)\!\geq\!\varphi'(y\!x_{1:n},\infty)\!=\!\rho(y\!\pb x_{1:n})$.

1594:

1595: Combining all results, we have shown that the constructed

1596: $\phi(\cdot,t)$ are primitive recursive chronological semimeasures

1597: monotone increasing in $t$, which converge to the enumerable

1598: chronological semimeasure $\rho$. This finally proves the

1599: enumerability of the set of enumerable chronological

1600: semimeasures.

1601:

1602: %------------------------------%

1603: \paragraph{Convergence of $\xi^{AI}$ to $\mu^{AI}$:}

1604: %------------------------------%

1605: In \cite{Hut00e} the following inequality is proved

1606: \beq\label{entro2}

1607:   2\sum_{i=1}^{|X|} y_i(y_i\!-\!z_i)^2 \;\leq\!

1608:   \sum_{i=1}^{|X|} y_i\ln{y_i\over z_i} \quad\mbox{with}\quad

1609:   \sum_{i=1}^{|X|} y_i=1, \quad \sum_{i=1}^{|X|} z_i\leq 1

1610: \eeq

1611: If we identify $i\!=\!x_k$ and $y_i\!=\!\mu(y\!x_{<k}y\!\pb x_k)$ and

1612: $z_i\!=\!\xi(y\!x_{<k}y\!\pb x_k)$, multiply both sides with

1613: $\mu(y\!\pb x_{<k})$, take the sum over $x_{<k}$, then the sum

1614: over $k$ and use Bayes' rule $\mu(y\!\pb x_{<k})\!\cdot\!\mu(y\!x_{<k}y\!\pb

1615: x_k)=\mu(y\!\pb x_{1:k})$ we get

1616: \beq\label{eukdistxi}

1617:   2\sum_{k=1}^n\sum_{x_{1:k}}\mu(y\!\pb x_{1:k})

1618:   \Big(\mu(y\!x_{<k}\pb x_k)-\xi(y\!x_{<k}\pb x_k)\Big)^2 \;\leq\;

1619:   \sum_{k=1}^n\sum_{x_{1:k}}\mu(y\!\pb x_{1:k})

1620:   \ln{\mu(y\!x_{<k}\pb x_k)\over\xi(y\!x_{<k}\pb x_k)}

1621:   =\; ...

1622: \eeq

1623: In the r.h.s.\ we can replace $\sum_{x_{1:k}}\mu(y\!\pb

1624: x_{1:k})$ by $\sum_{x_{1:n}}\mu(y\!\pb x_{1:n})$ as the argument

1625: of the logarithm is independent of $x_{k+1:n}$. The $k$ sum can now be

1626: brought into the logarithm and converts to a product. Using Bayes'

1627: rule (\ref{bayes2}) for $\mu$ and $\xi$ we get

1628: \beq\label{eukdistxi2}

1629:   ...\;=\;

1630:   \sum_{x_{1:n}}\mu(y\!\pb x_{1:n})

1631:   \ln\prod_{k=1}^n{\mu(y\!x_{<k}\pb x_k)\over\xi(y\!x_{<k}\pb x_k)}

1632:   \;=\;

1633:   \sum_{x_{1:n}}\mu(y\!\pb x_{1:n})

1634:   \ln{\mu(y\!\pb x_{1:n})\over\xi(y\!\pb x_{1:n})}

1635:   \;\stackrel{+}{<}\; \ln 2\!\cdot\!K(\mu)

1636: \eeq

1637: where we have used the universality property (\ref{uniaixi})

1638: of $\xi$ in the last step. The main complication for generalizing

1639: (\ref{eukdist}) to (\ref{eukdistxi},\ref{eukdistxi2}) was the

1640: generalization of (\ref{entro2}) from $|X|\!=\!2$ to a general

1641: alphabet, the $y$ are, again, pure spectators. This will change when

1642: we analyze error/credit bounds analog to (\ref{spebound}).

1643:

1644: (\ref{eukdistxi},\ref{eukdistxi2}) shows that the $\mu$ expected

1645: squared difference of $\mu$ and $\xi$ is finite for computable

1646: $\mu$. This, in turn, shows that $\xi(y\!x_{<k}y\!\pb x_k)$

1647: converges to $\mu(y\!x_{<k}y\!\pb x_k)$ for $k\!\to\!\infty$ with $\mu$

1648: probability 1. If we take a finite product of $\xi's$ and use

1649: Bayes' rule, we see that also $\xi(y\!x_{<k}y\!\pb x_{k:k+r})$

1650: converges to $\mu(y\!x_{<k}y\!\pb x_{k:k+r})$. More generally, in case of

1651: a bounded horizon $h_k$, it follows that

1652: \beq\label{aixitomu}

1653:   \xi(y\!x_{<k}y\!\pb x_{k:m_k}) \toinfty{k} \mu(y\!x_{<k}y\!\pb x_{k:m_k})

1654:   \quad\mbox{if}\quad h_k\equiv m_k\!-\!k\!+\!1 \leq h_{max} < \infty

1655: \eeq

1656: This gives makes us confident that the outputs $\hh y_k$

1657: of the AI$\xi$ model (\ref{ydotxi}) could converge to the outputs $\hh

1658: y_k$ from the AI$\mu$ model (\ref{ydotrec}), at least for bounded

1659: horizon.

1660:

1661: We want to call an AI model {\it universal}, if it is $\mu$

1662: independent (unbiased, model-free) and is able

1663: to solve any solvable problem and learn any learnable task.

1664: Further, we call a universal model, {\it universally optimal}, if

1665: there is no program, which can solve or learn significantly faster

1666: (in terms of interaction cycles). As the AI$\xi$ model is

1667: parameterless, $\xi$ converges to $\mu$ (\ref{aixitomu}), the

1668: AI$\mu$ model is itself optimal, and we expect no other model to

1669: converge faster to AI$\mu$ by analogy to SP (\ref{spebound}),

1670: \beqn

1671:   \mbox{\it we expect AI$\xi$ to be universally optimal.}

1672: \eeqn

1673: This is our main claim. In a sense, the intention of the remaining

1674: (sub)sections is to define this statement more rigorously and

1675: to give further support.

1676:

1677: %------------------------------%

1678: \paragraph{Intelligence order relation:}

1679: %------------------------------%

1680: We define the $\xi$ expected credit in cycles $k$ to $m$ of a

1681: policy $p$ similar to (\ref{eefunc}) and (\ref{eefuncxi}).

1682: We extend the definition to programs $p\!\not\in\!\hh P_k$ which

1683: are not consistent with the current history.

1684: \beq\label{cxi}

1685:   C^\xi_{km}(p|\hh y\!\hh x_{<k}) \;:=\;

1686:   {1\over\cal N}

1687:   \sum_{q:q(\hh y_{<k})=\hh x_{<k}}

1688:   \nq 2^{-l(q)}\cdot C_{km}(\tilde p,q)

1689: \eeq

1690: The normalization $\cal N$ is again only necessary for

1691: interpreting $C_{km}$ as the expected credit but otherwise

1692: unneeded. For consistent policies $p\!\in\!\hh P_k$ we define

1693: $\tilde p\!:=\!p$. For $p\!\not\in\!\hh P_k$, $\tilde p$ is a

1694: modification of $p$ in such a way that its output is consistent

1695: with the current history $\hh y\!\hh x_{<k}$, hence $\tilde

1696: p\!\in\!\hh P_k$, but unaltered for the current and future cycles

1697: $\geq\!k$. Using this definition of $C_{km}$ we could take the

1698: maximium over all systems $p$ in (\ref{eefuncxi}), rather than only the

1699: consistent ones.

1700:

1701: We call $p$ {\it more or equally intelligent} than $p'$ if

1702: \beq\label{aiorder}

1703:   p\succeq p' \;:\Leftrightarrow

1704:   \forall k\forall\hh y\!\hh x_{<k}:

1705:   C^\xi_{km_k}(p|\hh y\!\hh x_{<k}) \geq

1706:   C^\xi_{km_k}(p'|\hh y\!\hh x_{<k})

1707: \eeq

1708: i.e.\ if $p$ yields in any circumstance higher $\xi$ expected

1709: credit than $p'$. As the algorithm $p^\best$ behind the AI$\xi$

1710: system maximizes $C^\xi_{km_k}$ we have $p^\best\!\succeq\!p$ for all

1711: $p$. The AI$\xi$ model is hence the most intelligent system

1712: w.r.t.\ $\succeq$. $\succeq$ is a universal order relation in the

1713: sense that it is free of any parameters (except $m_k$) or specific

1714: assumptions about the environment. A proof, that $\succeq$ is a

1715: reliable intelligence order (what we believe to be true), would

1716: prove that AI$\xi$ is universally optimal. We could further ask:

1717: how useful is $\succeq$ for ordering policies of practical

1718: interest with intermediate intelligence, or how can $\succeq$ help

1719: to guide toward constructing more intelligent systems with

1720: reasonable computation time. An effective intelligence order

1721: relation $\succeq^c$ will be defined in section \ref{secTime},

1722: which is more useful from a practical point of view.

1723:

1724: %------------------------------%

1725: \paragraph{Credit bounds and separability concepts:}

1726: %------------------------------%

1727: The credits $C_{km}$ associated with the AI systems correspond

1728: roughly to the negative error measure $-E_{n\rho}$ of the SP

1729: systems. In SP, we were interested in small bounds for the error

1730: excess $E_{n\Theta_\xi}\!-\!E_{n\rho}$. Unfortunately, simple

1731: credit bounds for AI$\xi$ in terms of $C_{km}$ analog to the error

1732: bound (\ref{spebound}) do not hold. We even have difficulties in

1733: specifying what we can expect to hold for AI$\xi$ or any AI system

1734: which claims to be universally optimal. Consequently, we cannot

1735: have a proof if we don't know what to prove. In SP, the only

1736: important property of $\mu$ for proving error bounds was its

1737: complexity $K(\mu)$. We will see that in the AI case, there are no

1738: useful bounds in terms of $K(\mu)$ only. We either have to study

1739: restricted problem classes or consider bounds depending on other

1740: properties of $\mu$, rather than on its complexity only. In the

1741: following, we will exhibit the difficulties by two examples and

1742: introduce concepts which may be useful for proving credit bounds.

1743: Despite the difficulties in even claiming useful credit bounds, we

1744: nevertheless, firmly believe that the order relation

1745: (\ref{aiorder}) correctly formalizes the intuitive meaning of

1746: intelligence and, hence, that the AI$\xi$ system is universally optimal.

1747:

1748: %------------------------------%

1749: %\paragraph{(Pseudo) passive $\mu$ and the heaven/hell example:}

1750: %------------------------------%

1751: In the following, we choose $m_k\!=\!T$. We want to compare the

1752: true, i.e. $\mu$ expected credit $C^\mu_{1T}$ of a $\mu$

1753: independent universal policy $p^{best}$ with any other policy $p$.

1754: Naively, we might expect the existence of a policy $p^{best}$ which

1755: maximizes $C^\mu_{1T}$, apart from additive

1756: corrections of lower order for $T\!\to\!\infty$

1757: \beq\label{cximu}

1758:   C^\mu_{1T}(p^{best}) \;\geq\; C^\mu_{1T}(p) - o(...)

1759:   \quad \forall\mu,p

1760: \eeq

1761: Note, that the policy $p^{*\xi}$ of the AI$\xi$ system

1762: maximizes $C^\xi_{1T}$ by definition ($p^{*\xi}\succeq p$). As

1763: $C^\xi_{1T}$ is thought to be a guess of $C^\mu_{1T}$, we might

1764: expect $p^{best}\!=\!p^{*\xi}$ to approximately maximize

1765: $C^\mu_{1T}$, i.e. (\ref{cximu}) to hold. Let us consider the

1766: problem class (set of environments) $\{\mu_0,\mu_1\}$ with

1767: $Y\!=\!C\!=\{0,1\}$ and $c_k\!=\delta_{iy_1}$ in environment

1768: $\mu_i$. The first output $y_1$ decides whether you go to heaven

1769: with all future credits $c_k$ being $1$ (good) or to hell with all

1770: future credits being $0$ (bad). It is clear, that if

1771: $\mu_i$, i.e. $i$ is known, the optimal policy $p^{*\mu_i}$

1772: is to output $y_1\!=\!i$ in the first cycle with

1773: $C^\mu_{1T}(p^{*\mu_i})\!=\!T$. On the other hand, any unbiased

1774: policy $p^{best}$ independent of the actual $\mu$ either outputs

1775: $y_1\!=\!1$ or $y_1\!=\!0$. Independent of the actual choice

1776: $y_1$, there is always an environment ($\mu\!=\!\mu_{1-y_1}$)

1777: for which this choice is catastrophic

1778: ($C^\mu_{1T}(p^{best})\!=\!0$). No single system can perform well in both

1779: environments $\mu_0$ {\it and} $\mu_1$. The r.h.s.\ of

1780: (\ref{cximu}) equals $T\!-\!o(T)$ for $p\!=\!p^{*\mu}$. For all

1781: $p^{best}$ there is a $\mu$ for which the l.h.s.\ is zero. We have

1782: shown that no $p^{best}$ can satisfy (\ref{cximu}) for all $\mu$

1783: and $p$, so we cannot expect $p^{*\xi}$ to do so. Nevertheless,

1784: there are problem classes for which (\ref{cximu}) holds, for

1785: instance SP and CF. For SP, (\ref{cximu}) is just a reformulation

1786: of (\ref{spebound}) with an appropriate choice for $p^{best}$

1787: (which differs from $p^{*\xi}$, see next section). We expect

1788: (\ref{cximu}) to hold for all inductive problems in which the

1789: environment is not influenced\footnote{Of course, the credit

1790: feedback $c_k$ depends on the system's output. What we have in mind

1791: is, like in sequence prediction, that the true sequence is not

1792: influenced by the system} by the output of the system. We want to

1793: call these $\mu$, {\it passive} or {\it inductive} environments.

1794: Further, we want to call $\mu$ satisfying (\ref{cximu}) with

1795: $p^{best}\!=\!p^{*\xi}$ {\it pseudo passive}. So we expect

1796: inductive $\mu$ to be pseudo passive.

1797:

1798: %------------------------------%

1799: %\paragraph{The OnlyOne example:}

1800: %------------------------------%

1801: Let us give a further example to demonstrate the difficulties in

1802: establishing credit bounds. Let $C\!=\{0,1\}$ and $|Y|$ be large. We

1803: consider all (deterministic) environments in which a single complex output

1804: $y^*$ is correct ($c\!=\!1$) and all others are wrong ($c\!=\!0$).

1805: The problem class $M$ is defined by

1806: $$

1807:   M:=\{\mu:\mu(y\!x_{<k}y_k\pb 1)=

1808:        \delta_{y_ky^*},\; y^*\!\in\!Y,\; K(y^*)\!=\!_\lfloor\log_2|Y|_\rfloor\}

1809: $$

1810: There are $N\stackrel\times=|Y|$ such $y^*$. The only way a

1811: $\mu$ independent policy $p$ can find the correct $y^*$�, is

1812: by trying one $y$ after the other in a certain order. In the first

1813: $N\!-\!1$ cycles at most, $N\!-\!1$ different $y$ are tested. As

1814: there are $N$ different possible $y^*$, there is always a

1815: $\mu\!\in\!M$ for which $p$ gives erroneous outputs in the first

1816: $N\!-\!1$ cycles. The number of errors are $E_{\infty

1817: p}\!\geq\!N\!-\!1\!\stackrel\times=|Y|\stackrel\times=2^{K(y^*)}\stackrel\times=2^{K(\mu)}$

1818: for this $\mu$. As this is true for any $p$, it is also true

1819: for the AI$\xi$ model, hence $E_{k\xi}\!\leq\!2^{K(\mu)}$ is the

1820: best possible error bound we can expect, which depends on $K(\mu)$

1821: only. Actually, we will derive such a bound in section

1822: \ref{secSP} for SP. Unfortunately, as we are mainly interested in

1823: the cycle region $k\ll|Y|\stackrel\times=2^{K(\mu)}$ (see section

1824: \ref{secAImurec}) this bound is trivial.

1825: There are no interesting bounds depending on $K(\mu)$

1826: only, unlike the SP case for deterministic $\mu$. Bounds must

1827: either depend on additional properties of $\mu$ or we have to

1828: consider specialized bounds for restricted problem classes. The

1829: case of probabilistic $\mu$ is similar. Whereas for SP there are

1830: useful bounds in terms of $E_{k\mu}$ and $K(\mu)$, there are no

1831: such bounds for AI$\xi$. Again, this is not a drawback of AI$\xi$

1832: since for no unbiased AI system the errors/credits could be bound in

1833: terms of $K(\mu)$ and the errors/credits of AI$\mu$ only.

1834:

1835: There is a way to make use of gross (e.g. $2^{K(\mu)}$) bounds.

1836: Assume that after a reasonable number of cycles $k$, the

1837: information $\hh x_{<k}$ perceived by the AI$\xi$ system contains

1838: a lot of information about the true environment $\mu$. The

1839: information in $\hh x_{<k}$ might be coded in any form. Let us

1840: assume that the complexity $K(\mu|\hh x_{<k})$ of $\mu$ under the

1841: condition that $\hh x_{<k}$ is known, is of order 1. Consider a

1842: theorem, bounding the sum of credits or of other quantities over

1843: cycles $1...\infty$ in terms of $f(K(\mu))$ for a function $f$

1844: with $f(O(1))\!=\!O(1)$, like $f(n)\!=\!2^n$. Then, there will be

1845: a bound for cycles $k...\infty$ in terms of $f(K(\mu|\hh

1846: x_{<k}))\!=\!O(1)$. Hence, a bound like $2^{K(\mu)}$ can be

1847: replaced by small bound $2^{K(\mu|\hh x_{<k})}\!=\!O(1)$ after

1848: a reasonable number of cycles. All one has to

1849: show/ensure/assume is that enough information about $\mu$ is

1850: presented (in any form) in the first $k$ cycles. In this way, even

1851: a gross bound could become useful. In section \ref{secEX} we use a

1852: similar argument to prove that AI$\xi$ is able to learn

1853: supervised.

1854:

1855: %------------------------------%

1856: %\paragraph{Asymptotic learnability:}

1857: %------------------------------%

1858: In the following, we weaken (\ref{cximu}) in the hope of getting a

1859: bound applicable to wider problem classes than the passive one.

1860: Consider the I/O sequence $\hh y_1\hh x_1...\hh y_n\hh x_n$ caused

1861: by AI$\xi$. On history $\hh y\!\hh x_{<k}$, AI$\xi$ will output

1862: $\hh y_k\!\equiv\hh y^\xi_k$ in cycle $k$. Let us compare this to

1863: $\hh y^\mu_k$ what AI$\mu$ would output, still on the same history

1864: $\hh y\!\hh x_{<k}$ produced by AI$\xi$. As AI$\mu$ maximizes the

1865: $\mu$ expected credit, AI$\xi$ causes lower (or at best equal)

1866: $C^\mu_{km_k}$, if $\hh y^\xi_k$ differs from $\hh y^\mu_k$. Let

1867: $D_{n\mu\xi}\!:=\!\langle\sum_{k=1}^n 1\!-\!\delta_{\hh

1868: y^\mu_k,\hh y^\xi_k}\rangle_\mu$ be the $\mu$ expected number of

1869: suboptimal choices of AI$\xi$, i.e. outputs different from AI$\mu$

1870: in the first $n$ cycles. One might weigh the deviating cases by

1871: their severity. Especially when the $\mu$ expected credits

1872: $C^\mu_{km_k}$ for $\hh y^\xi_k$ and $\hh y^\mu_k$ are equal or

1873: close to each other, this should be taken into account in the

1874: definition of $D_{n\mu\xi}$. These details do not matter in the

1875: following qualitative discussion. The important difference to

1876: (\ref{cximu}) is that here we stick on the history produced by

1877: AI$\xi$ and count a wrong decision as, at most, one error. The

1878: wrong decision in the Heaven\&Hell example in the first cycle no

1879: longer counts as losing $T$ credits, but counts as one wrong

1880: decision. In a sense, this is fairer. One shouldn't blame somebody

1881: too much who makes a single wrong decision for which he just has

1882: too little information available, in order to make a correct

1883: decision. The AI$\xi$ model would deserve to be called

1884: asymptotically optimal, if the probability of making a wrong

1885: decision tends to zero, i.e.\ if

1886: \beq\label{Doon}

1887:   D_{n\mu\xi}/n\to 0 \quad\mbox{for}\quad n\to\infty, \quad\mbox{i.e.}\quad

1888:   D_{n\mu\xi} \;=\; o(n).

1889: \eeq

1890: We say that $\mu$ can be {\it asymptotically learned} (by AI$\xi$)

1891: if (\ref{Doon}) is satisfied. We claim that AI$\xi$ (for

1892: $m_k\!\to\!\infty$) can asymptotically learn every problem $\mu$

1893: of relevance, i.e. AI$\xi$ is asymptotically optimal. We included

1894: the qualifier {\it of relevance}, as we are not sure whether there

1895: could be strange $\mu$ spoiling (\ref{Doon}) but we expect those

1896: $\mu$ to be irrelevant from the perspective of AI. In the field of

1897: Learning, there are many asymptotic learnability theorems, often

1898: not too difficult to prove. So a proof of (\ref{Doon}) might also

1899: be accessible. Unfortunately, asymptotic learnability theorems are

1900: often too weak to be useful from a practical point. Nevertheless,

1901: they point in the right direction.

1902:

1903: %------------------------------%

1904: %\paragraph{Uniform $\mu$:}

1905: %------------------------------%

1906: From the convergence (\ref{aixitomu}) of $\mu\!\to\!\xi$ we might

1907: expect $C^\xi_{km_k}\!\to\!C^\mu_{km_k}$ and hence, $\hh y^\xi_k$

1908: defined in (\ref{ydotxi}) to converge to $\hh y^\mu_k$ defined in

1909: (\ref{ydotrec}) with $\mu$ probability 1 for $k\!\to\!\infty$.

1910: The first problem is, that if the $C_{km_k}$ for

1911: the different choices of $y_k$ are nearly equal, then even if

1912: $C^\xi_{km_k}\!\approx\!C^\mu_{km_k}$, $\hh y^\xi_k\!\neq\!\hh

1913: y^\mu_k$ is possible due to the non-continuity of $\maxarg_{y_k}$. This

1914: can be cured by a weighted $D_{n\mu\xi}$ as described above. More

1915: serious is the second problem we explain for $h_k\!=\!1$ and

1916: $X\!=\!C\!=\!\{0,1\}$. For $\hh

1917: y^\xi_k\!\equiv\!\maxarg_{y_k}\xi(\hh y\!\hh c_{<k}y_k\pb 1)$ to

1918: converge to $\hh y^\mu_k\!\equiv\!\maxarg_{y_k}\mu(\hh y\!\hh

1919: c_{<k}y_k\pb 1)$, it is not sufficient to know that $\xi(\hh

1920: y\!\hh c_{<k}\hh y\!\hh{\pb c}_k)\!\to\!\mu(\hh y\!\hh c_{<k}\hh

1921: y\!\hh{\pb c}_k)$ as has been proved in (\ref{aixitomu}). We need

1922: convergence not only for the true output $\hh y_k$ and credit $\hh

1923: c_k$, but also for alternate outputs $y_k$ and credit 1.

1924: $\hh y^\xi_k$ converges to $\hh y^\mu_k$

1925: if $\xi$ converges uniformly to $\mu$, i.e. if in addition to

1926: (\ref{aixitomu})

1927: \beq\label{uniform}

1928:   \big|\mu(y\!x_{<k}y'_k\pb x'_k)-\xi(y\!x_{<k}y'_k\pb x'_k)\big|

1929:   \;<\; c\!\cdot\!

1930:   \big|\mu(y\!x_{<k}y\!\pb x_k)-\xi(y\!x_{<k}y\!\pb x_k)\big|

1931:   \quad\forall y'_kx'_k

1932: \eeq

1933: holds for some constant $c$ (at least in some $\mu$ expected sense).

1934: We call $\mu$ satisfying (\ref{uniform}) {\it uniform}. For

1935: uniform $\mu$ one can show (\ref{Doon}) with appropriately weighted

1936: $D_{n\mu\xi}$ and bounded horizon $h_k\!<\!h_{max}$. Unfortunately

1937: there are relevant $\mu$ which are not uniform.

1938: Details will be given elsewhere.

1939:

1940: %------------------------------%

1941: %\paragraph{Other concepts:}

1942: %------------------------------%

1943: In the following, we briefly mention some further

1944: concepts. A {\it Markovian} $\mu$ is defined as depending only on the

1945: last output, i.e. $\mu(y\!x_{<k}y\!\pb x_k)\!=\!\mu_k(y\!\pb x_k)$. We

1946: say $\mu$ is {\it generalized Markovian}, if $\mu(y\!x_{<k}y\!\pb

1947: x_k)\!=\!\mu_k(y\!x_{k-l:k-1}y\!\pb x_k)$ for fixed $l$. This

1948: property has some similarities to {\it factorizable} $\mu$ defined

1949: in (\ref{facmu}). If further $\mu_k\!\equiv\!\mu_1\forall k$,

1950: $\mu$ is called {\it stationary}. Further, for all enumerable

1951: $\mu$, $\mu(y\!x_{<k}y\!\pb x_k)$ and $\xi(y\!x_{<k}y\!\pb x_k)$

1952: get independent of $y\!x_{<l}$ for fixed $l$ and $k\!\to\!\infty$

1953: with $\mu$ probability 1. This property, which we want to call

1954: {\it forgetfulness}, will be proved elsewhere.

1955: Further, we say $\mu$ is {\it farsighted}, if

1956: $\lim_{m_k\to\infty}\hh y_k^{(m_k)}$ exists. More details will be given in

1957: the next subsection, where we also give an example of a

1958: possibly relevant $\mu$, which is not farsighted.

1959:

1960: %------------------------------%

1961: %\paragraph{Concepts:}

1962: %------------------------------%

1963: We have introduced several concepts, which might be useful for

1964: proving credit bounds, including forgetful, relevant, asymptotically

1965: learnable, farsighted, uniform, (generalized) Markovian, factorizable

1966: and (pseudo) passive $\mu$. We have sorted them here, approximately in

1967: the order of decreasing generality. We want to call them {\it

1968: separability concepts}. The more general (like relevant,

1969: asymptotically learnable and farsighted) $\mu$ will be called

1970: weakly separable, the more restrictive (like (pseudo) passive and

1971: factorizable) $\mu$ will be called strongly separable, but we will

1972: use these qualifiers in a more qualitative, rather than rigid

1973: sense. Other (non-separability) concepts are deterministic $\mu$

1974: and, of course, the class of all chronological $\mu$.

1975:

1976: %------------------------------%

1977: \paragraph{The choice of the horizon:}

1978: %------------------------------%

1979: The only significant arbitrariness in the AI$\xi$ model lies in

1980: the choice of the horizon function

1981: $h_k\!\equiv\!m_k\!-\!k\!+\!1$. We discuss some choices which seem

1982: to be natural and give preliminary conclusions at the end.

1983: We will not discuss ad hoc choices of $h_k$ for

1984: specific problems (like the discussion in section \ref{secSG} in

1985: the context of finite games). We are interested in universal

1986: choices of $m_k$.

1987:

1988: If the lifetime of the system is known to be $T$, which is in

1989: practice always large but finite, then the choice $m_k\!=\!T$

1990: maximizes correctly the expected future credit. $T$ is usually not

1991: known in advance, as in many cases the time we are willing to run

1992: a system depends on the quality of its outputs. For this reason,

1993: it is often desirable that good outputs are not delayed too much,

1994: if this results in a marginal credit increase only. This can be

1995: incorporated by damping the future credits. If, for instance, we

1996: assume that the survival of the system in each cycle is

1997: proportional to the past credit an exponential damping

1998: $c_k\!:=\!c'_k\!\cdot\!e^{-\lambda k}$ is appropriate, where

1999: $c'_k$ are bounded, e.g. $c'_k\!\in\![0,1]$. The expression

2000: (\ref{ydotxi}) converges for $m_k\!\to\!\infty$ in this case. But

2001: this does not solve the problem, as we introduced a new arbitrary

2002: time-scale $^1\!/_\lambda$. Every damping introduces a time-scale.

2003:

2004: Even the time-scale invariant damping factor $k^{-\alpha}$

2005: introduces a dynamic time-scale. In cycle $k$ the contribution of

2006: cycle $2^{1/\alpha}\!\cdot\!k$ is damped by a factor $\1d2$. The

2007: effective horizon $h_k$ in this case is $\sim k$. The choice

2008: $h_k\!=\!\beta\!\cdot\!k$ with $\beta\!\sim\!2^{1/\alpha}$

2009: qualitatively models the same behaviour. We have not introduced an

2010: arbitrary time-scale $T$, but limited the farsightedness to some

2011: multiple (or fraction) of the length of the current history. This

2012: avoids the pre-selection of a global time-scale $T$ or

2013: $^1\!/_\lambda$. This choice has some appeal, as it seems that

2014: humans of age $k$ years usually do not plan their lives for more

2015: than, perhaps, the next $k$ years ($\beta_{human}\!=\!1$). From a

2016: practical point of view this model might serve all needs, but from

2017: a theoretical point we feel uncomfortable with such a limitation

2018: in the horizon from the very beginning. Note, that we have to

2019: choose $\beta\!=\!O(1)$ because otherwise we would again introduce

2020: a number $\beta$, which has to be justified.

2021:

2022: The naive limit $m_k\!\to\!\infty$ in

2023: (\ref{ydotxi}) may turn out to be well defined and the previous discussion

2024: superfluous. In the following, we define a limit which is always

2025: well defined (for finite $|Y|$). Let $\hh y_k^{(m)}$ be defined as

2026: in (\ref{ydotxi}) with $m_k$ replaced by $m$. Further, let $\hh

2027: Y_k^{(m)}\!:=\!\{\,\hh y_k^{(m)}\!:\!m_k\!\geq\!m\}$ be the set of

2028: outputs in cycle $k$ for the choices $m_k\!=\!m,m+1,m+2,...$.

2029: Because $\hh Y_k^{(m)}\!\supseteq\!\hh Y_k^{(m+1)}\!\neq\!\{\}$, we

2030: have $\hh Y_k^{(\infty)}\!:=\!\bigcap_{m=k}^\infty\hh

2031: Y_k^{(m)}\!\neq\!\{\}$. We define the $m_k\!=\!\infty$ model to

2032: output any $\hh y_k^{(\infty)}\!\in\!\hh Y_k^{(\infty)}$. This is

2033: the best output consistent with any choice of $m_k$, esp.

2034: $m_k\!\to\!\infty$. Choosing the lexicographically smallest $\hh

2035: y_k^{(\infty)}\!\in\!\hh Y_k^{(\infty)}$ would correspond to the

2036: limes inferior $\underline\lim_{m\to\infty}\hh y_k^{(m)}$. $\hh

2037: y_k^{(\infty)}$ is unique, i.e. $|\hh Y_k^{(\infty)}|\!=\!1$ iff

2038: the naive limit $\lim_{m\to\infty}\hh y_k^{(m)}$ exists. Note,

2039: that the limit $\lim_{m\to\infty}C_{km}^\best(y\!x_{<k})$ needs

2040: not to exist for this construction.

2041:

2042: The construction above leads to a mathematically elegant,

2043: no-parameter AI$\xi$ model. Unfortunately this is not the end of

2044: the story. The limit $m_k\!\to\!\infty$ can cause undesirable

2045: results in the AI$\mu$ model for special $\mu$ which might also happen

2046: in the AI$\xi$ model whatever we define $m_k\!\to\!\infty$.

2047: Consider $Y\!=\!C\!=\!\{0,1\}$ and $X'\!=\!\{\}$. Output $y_k\!=\!0$ shall give credit

2048: $c_k\!=\!0$, output $y_k\!=\!1$ shall give $c_k\!=\!1$ iff $\hh

2049: y_{k-l-\sqrt l}...\hh y_{k-l}\!=\!0...0$ for some $l$. I.e. the system can

2050: achieve $l$ consecutive positive credits if there was a sequence

2051: of length at least $\sqrt l$ with $y_k\!=\!c_k\!=\!0$. If the lifetime of the

2052: AI$\mu$ system is $T$, it outputs $\hh y_k\!=\!0$ in the first $r$ cycles

2053: and then $\hh y_k\!=\!1$ for the remaining $r^2$ cycles with

2054: $r$ such that $r+r^2=T$. This will lead to the highest possible

2055: total credit $C_{1T}\!=\!\sqrt{T+^1\!\!/_4}-^1\!\!/_2$. Any fragmentation of the

2056: $0$ and $1$ sequences would reduce this. For $T\!\to\!\infty$ the

2057: AI$\mu$ system can and will delay the point $r$ of switching to

2058: $\hh y_k\!=\!1$ indefinitely and always output $0$ with total

2059: credit $0$, obviously the worst possible behaviour. The AI$\xi$

2060: system will explore the above rule after a while of trying

2061: $y_k\!=\!0/1$ and then applies the same behaviour as the AI$\mu$

2062: system, since the simplest rules covering past data dominate $\xi$.

2063: For finite $T$ this is exactly what we want, but for infinite $T$

2064: the AI$\xi$ model fails just as the AI$\mu$ model does. The good point

2065: is, that this is not a weakness of the AI$\xi$ model, as AI$\mu$

2066: fails too and no system can be better than AI$\mu$. The bad point

2067: is that $m_k\!\to\!\infty$ has far reaching consequences, even when

2068: starting from an already very large $m_k\!=\!T$. The reason being that

2069: the $\mu$ of this example is highly non-local in time, i.e. it

2070: may violate one of our weak separability conditions.

2071:

2072: In the last paragraph we have considered the consequences of

2073: $m_k\!\to\!\infty$ in the AI$\mu$ model. We now consider

2074: whether the AI$\xi$ model is a good approximation of the

2075: AI$\mu$ model for large $m_k$. Another objection against too large

2076: choices of $m_k$ is that $\xi(y\!x_{<k}y\!\pb x_{k:m_k})$ has been proved to be a

2077: good approximation of $\mu(y\!x_{<k}y\!\pb x_{k:m_k})$ only for

2078: $k\!\gg\!h_k$, which is never satisfied for $m_k\!=\!T$ or

2079: $m_k\!=\!\infty$. We have seen that, for factorizable

2080: $\mu$, the limit $h_k\!\to\!\infty$ causes

2081: no problem, as from a certain $h_k$ on the output $\hh y_k$ is

2082: independent of $h_k$. As $\xi\!\to\!\mu$ for bounded $h_k$, $\xi$

2083: will develop this separability property too. So, from a

2084: certain $k_0$ on the limit $h_k\!\to\!\infty$ might also be safe

2085: for $\xi$. Therefore, taking the limit from the very beginning worsens

2086: the behaviour of AI$\xi$ maybe only for finitely many cycles

2087: $k\!\leq k_0$, which would be acceptable. We suppose that the

2088: valuations $c_{k'}$ for $k'\!\gg\!k$, where $\xi$ can no longer

2089: be trusted as a good approximation to $\mu$, are in some sense

2090: randomly disturbed with decreasing influence on the choice of $\hh

2091: y_k$. This claim is supported by the forgetfulness property of $\xi$.

2092:

2093: We are not sure whether the choice of $m_k$ is of marginal

2094: importance, as long as $m_k$ is chosen sufficiently large and of

2095: low complexity, $m_k=2^{2^{16}}$ for instance, or whether the choice of

2096: $m_k$ will turn out to be a central topic for the AI$\xi$ model or

2097: for the planning aspect of any AI system in general. We suppose

2098: that the limit $m_k\!\to\!\infty$ for the AI$\xi$ model results in

2099: correct behaviour for weakly separable $\mu$, and that even the naive

2100: limit exists, but to prove this would probably give interesting

2101: insights.

2102:

2103: \newpage

2104: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

2105: \section{Sequence Prediction (SP)}\label{secSP}

2106: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

2107: We have introduced the

2108: AI$\xi$ model as a unification of the ideas of decision theory and

2109: universal probability distribution. We might expect AI$\xi$ to

2110: behave identically to SP$\Theta_\xi$, when faced with a sequence

2111: prediction problem, but things are not that simple, as we will see.

2112:

2113: %------------------------------%

2114: \paragraph{Using the AI$\mu$ Model for Sequence Prediction:}

2115: %------------------------------%

2116: % 9910(15) 9911(7)

2117: We have seen in the last section how to predict sequences for

2118: known and unknown prior distribution $\mu^{SP}$. Here we consider binary

2119: sequences\footnote{We use $z_k$ to avoid notational conflicts with

2120: the systems inputs $x_k$.} $z_1z_2z_3...\in I\!\!B^\infty$ with known prior

2121: probability $\mu^{SP}(\pb{z_1z_2z_3...})$.

2122:

2123: We want to show

2124: how the AI$\mu$ model can be used for sequence prediction.

2125: We will see that it gives the same prediction as the SP$\Theta_\mu$ system.

2126: First, we have to specify {\it how} the AI$\mu$ model should be used

2127: for sequence prediction. The following choice is natural:

2128:

2129: The systems output $y_k$ is interpreted as a prediction for the

2130: $k^{th}$ bit $z_k$ of the string, which has to be predicted. This

2131: means that $y_k$ is binary ($y_k\!\in\!I\!\!B\!=:\!Y$). As a

2132: reaction of the environment, the system receives credit $c_k\!=\!1$

2133: if the prediction was correct ($y_k\!=\!z_k$), or $c_k\!=\!0$ if

2134: the prediction was erroneous ($y_k\!\neq\!z_k$). The question is

2135: what the input $x'_k$ of the next cycle should be. One choice

2136: would be to inform the system about the correct $k^{th}$ bit of

2137: the last cycle of the string and set $x'_k=z_k$. But as from

2138: the credit $c_k$ in conjunction with the prediction $y_k$, the true

2139: bit $z_k=\delta_{y_kc_k}$ can be inferred, this information is

2140: redundant. $\delta$ is the Kronecker symbol, defined as

2141: $\delta_{ab}\!=\!1$ for $a\!=\!b$ and $0$ otherwise. There is no

2142: need for this additional feedback. So we set

2143: $x'_k\!=\!\epsilon\!\in\!X\!=\!\{\epsilon\}$ thus having $x_k\!\equiv\!c_k$. The

2144: system's performance does not change when we include this

2145: redundant information, it merely complicates the notation. The prior

2146: probability $\mu^{AI}$ of the AI$\mu$ model is

2147: \beq\label{muaisp}

2148:   \mu^{AI}(y_1\pb x_1 ...y_k\pb x_k) \;=\;

2149:   \mu^{AI}(y_1\pb c_1...y_k\pb c_k) \;=\;

2150:   \mu^{SP}(\pb{\delta_{y_1 c_1}...\delta_{y_k c_k}}) \;=\;

2151:   \mu^{SP}(\pb{z_1...z_k})

2152: \eeq

2153: In the following, we will drop the superscripts of $\mu$ because they

2154: are clear from the arguments of $\mu$ and the $\mu$ equal in any case.

2155:

2156: The formula (\ref{airec2}) for the expected credit reduces to

2157: \beq\label{eerecsp}

2158:   C_{km}^\best(y\!x_{<k}) \;=\;

2159:   \max_{y_k}\sum_{c_k}

2160:   [c_k+C_{k+1,m}^\best(y\!x_{1:k})] \!\cdot\!

2161:   \mu(\delta_{y_1c_1}...\delta_{y_{k-1}c_{k-1}}\pb{\delta_{y_kc_k}})

2162: \eeq

2163: The first observation we can make, is that for this special

2164: $\mu$, $C_{km}^\best$ only depends on $\delta_{y_ic_i}$, i.e.

2165: replacing $y_i$ and $c_i$ simultaneously with their complements

2166: does not change the value of $C_{km}^\best$. We have a symmetry in

2167: $y_ic_i$. For $k\!=\!m\!+\!1$ this is definitely true as

2168: $C_{m+1,m}^\best\!=\!0$ in this case (see (\ref{ee0})). For

2169: $k\!\leq\!m$ we prove it by induction. The r.h.s.\ of

2170: (\ref{eerecsp}) is symmetric in $y_ic_i$ for $i\!<\!k$ because

2171: $\mu$ possesses this symmetry and $C_{k+1,m}^\best$ possesses it by induction

2172: hypothesis, so the symmetry holds for the l.h.s., which completes

2173: the proof. The prediction $\hh y_k$ is

2174: \beq\label{ebestysp}

2175:   \hh y_k \;=\; \maxarg_{y_k}

2176:   C_{km_k}^\best(\hh y\!\hh x_{<k}y_k) \;=\;

2177:   \maxarg_{y_k}\sum_{c_k}[c_k+C_{k+1,m_k}^\best(y\!x_{1:k})]

2178:   \!\cdot\!\mu(...\pb{\delta_{y_kc_k}}) \;=\;

2179: \eeq

2180: $$

2181:   \;=\; \maxarg_{y_k}\sum_{c_k}c_k

2182:   \!\cdot\!\mu(\delta_{\hh y_1\hh c_1}...\pb{\delta_{y_kc_k}}) \;=\;

2183:   \maxarg_{y_k}\mu(\hh z_1...\hh z_{k-1}\pb y_k) \;=\;

2184:   \maxarg_{z_k}\mu(\hh z_1...\hh z_{k-1}\pb z_k)

2185: $$

2186: The first equation is the definition of the system's prediction

2187: (\ref{pbestrec}). In the second equation, we have inserted

2188: (\ref{ebesty}) which gives the r.h.s.\ of (\ref{eerecsp}) with

2189: $\max_{y_k}$ replaced by $\maxarg_{y_k}$. $\sum_c

2190: f(...\delta_{yc}...)$ is independent of $y$ for any function,

2191: depending on the combination $\delta_{yc}$ only. Therefore, the

2192: $\sum_cC^\best\mu$ term is independent of $y_k$ because

2193: $C_{k+1,m}^\best$ as well as $\mu$ depend on $\delta_{y_kc_k}$ only. In

2194: the third equation, we can therefore drop this term, as adding a

2195: constant to the argument of $\maxarg_{y_k}$ does not change the

2196: location of the maximum. In the second last equation we evaluated

2197: the $\sum_{c_k}$. Further, if the true credit to $\hh y_i$ is $\hh

2198: c_i$ the true $i^{th}$ bit of the string must be $\hh

2199: z_i\!=\!\delta_{\hh y_i\hh c_i}$. The last equation is just a renaming.

2200:

2201: So, the AI$\mu$ model predicts that $z_k$ that has maximal $\mu$

2202: probability, given $\hh z_1...\hh z_{k-1}$. This prediction is

2203: independent of the choice of $m_k$. It is exactly the prediction

2204: scheme of the deterministic sequence prediction with known prior

2205: SP$\Theta_\mu$ described in the last section. As this model was

2206: optimal, AI$\mu$ is optimal, too, i.e. has minimal number of

2207: expected errors (maximal expected credit) as compared to any other

2208: sequence prediction scheme.

2209:

2210: From this, it is already clear that the total expected credit

2211: $C_{km}$ must be related to the expected sequence prediction error

2212: $E_{m\Theta_\mu}$ (\ref{esp}). Let us prove directly that

2213: $C_{1m}(\epsilon)\!+\!E_{m\Theta_\mu\!}=m$.

2214: We rewrite $C_{km}^\best$ in (\ref{eerecsp})

2215: as a function of $z_i$ instead of $y_ic_i$ as it

2216: is symmetric in $y_ic_i$. Further, we can pull $C_{km}^\best$ out of

2217: the maximization, as it is independent of $y_k$ similar to

2218: (\ref{ebestysp}). Renaming the bounded variables $y_k$ and $c_k$

2219: we get

2220: \beq\label{ebr2}

2221:   C_{km}^\best(z_{<k}) \;=\;

2222:   \max_{z_k}\mu(z_{<k}\pb z_k) +

2223:   \sum_{z_k}C_{k+1,m}^\best(z_{1:k})

2224:   \!\cdot\!\mu(z_{<k}\pb z_k)

2225: \eeq

2226: Recursively inserting the l.h.s.\ into the r.h.s.\ we get

2227: \beq\label{ebi2}

2228:   C_{km}^\best(z_{<k}) \;=\;

2229:   \sum_{i=k}^m\nq\;\sum_{\quad z_{k:i-1}}\nq\max_{z_i}

2230:   \mu(z_{<k}\pb{z_{k:i}})

2231: \eeq

2232: This is most easily proven by induction. For $k\!=\!m$

2233: we have $C_{mm}^\best(z_{<m})\!=\!\max_{z_m}\mu(z_{<m}\pb

2234: z_m)$ from (\ref{ebr2}) and (\ref{ee0}), which equals (\ref{ebi2}). By induction

2235: hypothesis, we assume that

2236: (\ref{ebi2}) is true for $k$. Inserting this into

2237: (\ref{ebr2}) we get

2238: $$

2239:   C_{km}^\best(z_{<k})

2240:   \;=\;

2241:   \max_{z_k}\mu(z_{<k}\pb z_k) +

2242:   \sum_{z_k}\left[

2243:   \sum_{i=k+1}^m\nq\;\sum_{\quad z_{k+1:i-1}}\max_{z_i}

2244:   \mu(z_{1:k}\pb z_{k+1:i})

2245:   \right]\mu(z_{<k}\pb z_k) \;=\;

2246: $$

2247: $$

2248:   \;=\; \max_{z_k}\mu(z_{<k}\pb z_k) +

2249:   \sum_{i=k+1}^m\nq\;\sum_{\quad z_{k:i-1}}\max_{z_i}

2250:   \mu(z_{<k}\pb z_{k:i})

2251: $$

2252: which equals (\ref{ebi2}). This was the induction step and hence

2253: (\ref{ebi2}) is proven.

2254:

2255: By setting $k\!=\!0$ and slightly reformulating (\ref{ebi2}),

2256: we get the total expected credit in the first $m$ cycles

2257: $$

2258:   C_{1:m}^\best(\epsilon) \;=\;

2259:   \sum_{i=1}^m\;\sum_{z_{<i}}\mu(\pb z_{<i})

2260:   \max\{\mu(z_{<i}\pb 0),\mu(z_{<i}\pb 1)\} \;=\;

2261:   m-E_{m\Theta_\mu}

2262: $$

2263: with $E_{m\Theta_\mu}$ defined in (\ref{esp}).

2264:

2265: %------------------------------%

2266: \paragraph{Using the AI$\xi$ Model for Sequence Prediction:}

2267: %------------------------------%

2268: Now we want to use the universal AI$\xi$ model instead of

2269: AI$\mu$ for sequence prediction and try to derive error bounds

2270: analog to (\ref{spebound}).

2271: Like in the AI$\mu$ case, the systems output $y_k$ in cycle $k$ is

2272: interpreted as a prediction for the k$^{th}$ bit $z_k$ of the

2273: string, which has to be predicted. The credit is

2274: $c_k=\delta_{y_kz_k}$ and there are no other inputs

2275: $x_k=\epsilon$. What makes the analysis more difficult is that $\xi$ is not

2276: symmetric in $y_ic_i\leftrightarrow(1-y_i)(1-c_i)$ and

2277: (\ref{muaisp}) does not hold for $\xi$. On the other hand,

2278: $\xi^{AI}$ converges to $\mu^{AI}$ in the limit (\ref{aixitomu}), and

2279: (\ref{muaisp}) should hold asymptotically for $\xi$ in some sense.

2280: So we expect that everything proven for AI$\mu$ holds

2281: approximately for AI$\xi$. The AI$\xi$ model should behave

2282: similarly to SP$\Theta_\xi$, the deterministic variant of Solomonoff prediction.

2283: Especially we expect error bounds similar to (\ref{spebound}). Making

2284: this rigorous seems difficult. Some general remarks have been made

2285: in the last section.

2286:

2287: Here we concentrate on the special case of a deterministic

2288: computable environment, i.e. the environment is a sequence

2289: $\hh z\!=\!\hh z_1\hh z_2...$, $K(\hh z_1...\hh z_n*)\!\leq\!K(\hh

2290: z)\!<\!\infty$. Furthermore, we only consider the simplest

2291: horizon model $m_k\!=\!k$, i.e. maximize only the next

2292: credit. This is sufficient for sequence prediction, as the credit

2293: of cycle $k$ only depends on output $y_k$ and not on earlier

2294: decisions. This choice is in no way sufficient and satisfactory

2295: for the full AI$\xi$ model, as {\it one} single choice of $m_k$ should

2296: serve for {\it all} AI problem classes. So AI$\xi$ should allow

2297: good sequence prediction for some universal choice of $m_k$ and not

2298: only for $m_k\!=\!k$, which definitely does not suffice for more

2299: complicated AI problems. The analysis of this general case is a challenge for the future.

2300: For $m_k\!=\!k$ the AI$\xi$ model

2301: (\ref{ydotxi}) with $x'_i\!=\!\epsilon$ reduces to

2302: \beq\label{ydotxisp}

2303:   \hh y_k \;=\; \maxarg_{y_k}\sum_{c_k}c_k\!\cdot\!

2304:   \xi(\hh y\!\hh c_{<k}y\!\pb c_k) \;=\;

2305:   \maxarg_{y_k}\xi(\hh y\!\hh c_{<k}y_k\pb 1) \;=\;

2306:   \maxarg_{y_k}\xi(\hh y\!\hh{\pb c}_{<k}y_k\pb 1)

2307: \eeq

2308: The environmental response $\hh c_k$ is given by $\delta_{\hh y_k\hh

2309: z_k}$; it is 1 for a correct prediction $(\hh y_k\!=\!\hh z_k)$ and 0

2310: otherwise. In the following, we want to bound the number of errors

2311: this prediction scheme makes. We need the following inequality

2312: \beq\label{spineq}

2313:   \xi(y\!\pb c_1...y\!\pb c_k) \;>\;

2314:   2^{-K(\delta_{y_1c_1}...\delta_{y_kc_k}*)-O(1)}

2315: \eeq

2316: We have to find a short program in the sum

2317: (\ref{uniMAI}) calculating $c_1...c_k$ from $y_1...y_k$. If we

2318: knew $z_i:=\delta_{y_ic_i}$ for $1\!\leq\!i\!\leq\!k$ a program of

2319: size $O(1)$ could calculate

2320: $c_1...c_k=\delta_{y_1z_1}...\delta_{y_kz_k}$. So combining this program with

2321: a shortest coding of $z_1...z_k$ leads to a program of size

2322: $K(z_1...z_k*)\!+\!O(1)$, which proves (\ref{spineq}).

2323:

2324: Let us now assume that we make a wrong prediction in cycle $k$,

2325: i.e. $\hh c_k\!=\!0$, $\hh y_k\neq \hh z_k$. The goal is to

2326: show that $\hh\xi$ defined by

2327: \beqn

2328:   \hh\xi_k \;:=\; \xi(\hh y\!\pb{\hh c}_{1:k}) \;=\;

2329:   \xi(\hh y\pb{\hh c}_{<k}\hh y_k\pb 0) \;\leq\;

2330:   \xi(\hh y\pb{\hh c}_{<k}) -

2331:   \xi(\hh y\pb{\hh c}_{<k}\hh y_k\pb 1) \;<\;

2332:   \hh\xi_{k-1}-\alpha

2333: \eeqn

2334: decreases for every wrong prediction, at least by some $\alpha$.

2335: The $\leq$ arises from the fact that $\xi$ is only a semimeasure.

2336: \beqn

2337:   \xi(\hh y\!\pb{\hh c}_1...\hh y\pb 1) \;>\;

2338:   \xi(\hh y_1\pb{\hh c}_1...(1\!-\!\hh y_k)\pb 1) \;\stackrel{\times}{>}\;

2339:   2^{-K(\delta_{\hh y_1\hh c_1}...\delta_{(1-\hh y_k)1}*)}

2340:   \;=\;

2341: \eeqn

2342: \beqn

2343:   \;=\; 2^{-K(\hh z_1...\hh z_k*)} \;>\;

2344:   2^{-K(\hh z)-O(1)} \;=:\; \alpha

2345: \eeqn

2346: In the first inequality we have used the fact that $\hh y_k$

2347: maximizes by definition (\ref{ydotxisp}) the argument, i.e.

2348: $1\!-\!\hh y_k$ has lower probability than $\hh y_k$. (\ref{spineq}) has been

2349: applied in the second inequality. The equality holds, because

2350: $\hh z_i\!=\!\delta_{\hh y_i\hh c_i}$ and

2351: $\delta_{(1-\hh y_k)1}\!=\!\delta_{\hh y_k0}\!=\!\delta_{\hh y_k\hh

2352: c_k}\!=\!\hh z_k$. The last inequality follows from the

2353: definition of $\hh z$.

2354:

2355: We have shown that each erroneous prediction reduces $\hh\xi$ by at

2356: least the $\alpha$ defined above. Together with $\hh\xi_0\!=\!1$ and

2357: $\hh\xi_k\!>\!0$ for all $k$ this shows that the system can make

2358: at most $1/\alpha$ errors, since otherwise $\hh\xi_k$ would become

2359: negative. So the number of wrong predictions $E_{n\xi}^{AI}$ of system

2360: (\ref{ydotxisp}) is bounded by

2361: \beq\label{Ebndsp}

2362:   E_{n\xi}^{AI} \;<\; {\textstyle{1\over\alpha}} \;=\;

2363:   2^{K(\hh z)+O(1)} \;<\; \infty

2364: \eeq

2365: for a computable deterministic environment string $\hh z_1\hh

2366: z_2...$. The intuitive interpretation is that each wrong

2367: prediction eliminates at least one program $p$ of size

2368: $l(p)\!\stackrel+<\!K(\hh z)$. The size is smaller than $K(\hh z)$, as

2369: larger policies could not mislead the system to a wrong

2370: prediction, since there is a program of size $K(\hh z)$ making a correct

2371: prediction. There are at most $2^{K(\hh z)+O(1)}$ such policies,

2372: which bounds the total number of errors.

2373:

2374: We have derived a finite bound for $E_{n\xi}^{AI}$, but unfortunately, a

2375: rather weak one as compared to (\ref{spebound}). The reason for the

2376: strong bound in the SP case was that every error at least halves

2377: $\hh\xi$ because the sum of the $\maxarg_{x_k}$ arguments was 1.

2378: Here we have

2379: \bqan

2380:   \xi(\hh y_1\hh c_1...\hh y_{k-1}\hh c_{k-1}0\pb 0) +

2381:   \xi(\hh y_1\hh c_1...\hh y_{k-1}\hh c_{k-1}0\pb 1) = 1 \\

2382:   \xi(\hh y_1\hh c_1...\hh y_{k-1}\hh c_{k-1}1\pb 0) +

2383:   \xi(\hh y_1\hh c_1...\hh y_{k-1}\hh c_{k-1}1\pb 1) = 1

2384: \eqan

2385: but $\maxarg_{y_k}$ runs over the right top and right bottom

2386: $\xi$, for which no sum criterion holds.

2387:

2388: The AI$\xi$ model would not be sufficient for

2389: realistic applications if the bound (\ref{Ebndsp}) were sharp,

2390: but we have the strong feeling (but only weak

2391: arguments) that better bounds proportional to $K(\hh z)$

2392: analog to (\ref{spebound}) exist. The technique used above may not

2393: be appropriate for achieving this. One argument for a better bound is

2394: the formal similarity between $\maxarg_{z_k}(\hh z_{<k}z_k)$ and (\ref{ydotxisp}),

2395: the other is that we were unable to construct an example sequence

2396: for which (\ref{ydotxisp}) makes more than $O(K(\hh z))$ errors.

2397:

2398: \newpage

2399: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

2400: \section{Strategic Games (SG)}\label{secSG}

2401: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

2402:

2403: %------------------------------%

2404: \paragraph{Introduction:}

2405: %------------------------------%

2406: A very important class of problems are strategic games, like chess.

2407: In fact, what is subsumed under game theory nowadays, is so

2408: general, that it includes not only a huge variety of games, from simple

2409: games of chance like roulette, combined with strategy like

2410: Backgammon, up to purely strategic games like chess or checkers or

2411: go. Game theory can also describe political and economic competitions and

2412: coalitions, even Darwinism and many more have been modeled within game theory.

2413: It seems that nearly every AI problem could be brought into

2414: the form of a game. Nevertheless,

2415: the intention of a game is that several players perform

2416: some actions with (partial) observable consequences.

2417: The goal of each player is to maximize some utility

2418: function (e.g.\ to win the game). The players are assumed to be

2419: rational, taking into account all information they posses. The

2420: different goals of the players are usually in conflict.

2421: For an introduction into game theory, see \cite{Fud91,Osb94,Rus95,Neu44}.

2422:

2423: If we interpret the AI system as one player and the environment

2424: models the other rational player {\it and} the environment provides

2425: the reinforcement feedback $c_k$, we see that the system-environment

2426: configuration satisfies all criteria of a game. On the other hand,

2427: we know that the AI system can handle more general situations,

2428: since it interacts optimally with an environment, even if the environment

2429: is not a rational player with conflicting goals.

2430:

2431: %------------------------------%

2432: \paragraph{Strictly competitive strategic games:}

2433: %------------------------------%

2434: In the following, we restrict ourselves to deterministic, strictly

2435: competitive strategic\footnote{In game theory, games like chess

2436: are often called 'extensive', whereas 'strategic' is reserved for a

2437: different kind of game.} games with alternating moves. Player 1

2438: makes move $y_k'$ in round $k$, followed by the move $x_k'$ of player

2439: 2. So a game with $n$ rounds consists of a sequence of alternating

2440: moves $y'_1x'_1y'_2x'_2...y'_nx'_n$. At the end of the game in cycle $n$

2441: the game or final board state is evaluated with

2442: $C(y'_1x'_1...y'_nx'_n)$. Player 1 tries to maximize $C$, whereas player 2

2443: tries to minimize $C$. In the simplest case, $C$ is $1$ if player 1

2444: won the game, $C\!=\!-1$ if player 2 won and $C\!=\!0$ for a draw. We

2445: assume a fixed game length $n$ independent of the actual move

2446: sequence. For games with variable length but maximal possible number of

2447: moves $n$, we could add dummy moves

2448: and pad the length to $n$. The optimal strategy (Nash equilibrium)

2449: of both players is a minimax strategy

2450: \beq\label{sgxdot}

2451:   \hh x'_k=\minarg_{x'_k}\max_{y'_{k+1}}\min_{x'_{k+1}}...\max_{y'_n}\min_{x'_n}

2452:   C(\hh y'_1\hh x'_1...\hh y'_kx'_k...y'_nx'_n)

2453: \eeq

2454: \beq\label{sgydot}

2455:   \hh y'_k=\maxarg_{y'_k}\min_{x'_k}...\max_{y'_n}\min_{x'_n}

2456:   C(\hh y'_1\hh x'_1...\hh y'_{k-1}\hh x'_{k-1}y'_kx'_k...y'_nx'_n)

2457: \eeq

2458: But note, that the minimax strategy is only optimal if both players

2459: behave rationally. If, for instance, player 2 has limited capabilites or makes

2460: errors and player 1 is able to discover these (through past moves) he

2461: could exploit these and improve his performance

2462: by deviating from the minimax strategy. At least, the classical

2463: game theory of Nash equilibria does not take into account limited

2464: rationality, whereas the AI$\xi$ system should.

2465:

2466: %------------------------------%

2467: \paragraph{Using the AI$\mu$ model for game playing:}

2468: %------------------------------%

2469: In the following, we demonstrate the applicability of the AI model

2470: to games. The AI system takes the position of player 1. The

2471: environment provides the evaluation $C$. For a symmetric situation

2472: we could take a second AI system as player 2, but for simplicity we

2473: take the environment as the second player and assume that this

2474: environmental player behaves according to the minimax strategy (\ref{sgxdot}).

2475: The environment serves as a perfect player {\it and} as a teacher, albeit a

2476: very crude one as it tells the system at the end of the game,

2477: only whether it won or lost.

2478:

2479: The minimax behaviour of player 2 can be expressed by a

2480: (deterministic) probability distribution $\mu^{SG}$ as the

2481: following

2482: \beq\label{defmusg}

2483:   \mu^{SG}(y'_1\pb x'_1...y'_n\pb x'_n) \;:=\;

2484:   \left\{

2485:   \begin{array}{l}

2486:     \displaystyle

2487:     1 \quad\mbox{if}\quad

2488:     x'_k=\minarg_{x''_k}...\max_{y''_n}\min_{x''_n}

2489:     C(y'_1...x'_{k-1}y''_k...x''_n)

2490:     \;\;\forall\; 1\!\leq\!k\!\leq\!n

2491:     \\

2492:     0 \quad\mbox{otherwise}

2493:   \end{array} \right.

2494: \eeq

2495: The probability that player 2 makes move $x'_k$ is

2496: $\mu^{SG}(\hh y'_1\!\hh x'_1...\hh y'_k\pb x'_k)$ which is 1 for

2497: $x'_k\!=\!\hh x'_k$ as defined in (\ref{sgxdot}) and 0 otherwise.

2498:

2499: Clearly, the AI system receives no feedback, i.e.

2500: $c_1\!=...=\!c_{n-1}\!=\!0$, until the end of the game, where it should

2501: receive positive/negative/neutral feedback on a win/loss/draw, i.e.

2502: $c_n=C(...)$. The environmental prior probability is therefore

2503: \beq\label{muaisg}

2504:   \mu^{AI}(y_1\pb x_1...y_n\pb x_n) \;=\;

2505:   \left\{

2506:   \begin{array}{cl}

2507:     \displaystyle

2508:     \mu^{SG}(y'_1\pb x'_1...y'_n\pb x'_n) & \mbox{if}\quad

2509:     c_1\!=...=\!c_{n-1}\!=\!0 \;\mbox{and}\; c_n=C(y'_1x'_1...y'_nx'_n)

2510:     \\

2511:     0 & \mbox{otherwise}

2512:   \end{array} \right.

2513: \eeq

2514: where $y_i\!=\!y'_i$ and $x_i\!=\!c_ix'_i$.

2515: If the environment is a minimax player (\ref{sgxdot}) plus a crude

2516: teacher $C$, i.e. if $\mu^{AI}$ is the true prior probability, the

2517: question now is, what is the behaviour $\hh y_k^{AI}$ of the AI$\mu$

2518: system. It turns out that if we set $m_k\!=\!n$ the AI$\mu$ system

2519: is also a minimax player (\ref{sgydot}) and hence optimal

2520: \beqn

2521:   \hh y_k^{AI} \;=\;

2522:   \maxarg_{y_k}\sum_{x'_k}...\max_{y_n}\sum_{x'_n}

2523:   C(\hh y\!\hh x'_{<k}y\!x'_{k:n})\!\cdot\!

2524:   \mu^{SG}(\hh y\!\hh x'_{<k}y\!\pb x'_{k:n}) \;=

2525: \eeqn

2526: \beq\label{yaisg2}

2527:   =\; \maxarg_{y_k}\sum_{x'_k}...\max_{y_{n-1}}\sum_{x'_{n-1}}\max_{y_n}\min_{x'_n}

2528:   C(\hh y\!\hh x'_{<k}y\!x'_{k:n})\!\cdot\!

2529:   \mu^{SG}(\hh y\!\hh x'_{<k}y\!\pb x'_{k:n-1}) \;=

2530: \eeq

2531: \beqn

2532:  =\;...\;=\; \maxarg_{y_k}\min_{x'_{k+1}}...\max_{y_n}\min_{x'_n}

2533:      C(\hh y\!\hh x'_{<k}y\!x'_{k:n}) \;=\;

2534:      \hh y_k^{SG}

2535: \eeqn

2536: In the first line we inserted $m_k\!=\!n$ and (\ref{muaisg}) into

2537: the definition (\ref{ydotrec}) of $\hh y_k^{AI}$. This removes all

2538: sums over the $c_k$. Further, the sum over $x'_n$ gives only a

2539: contribution for $x'_n\!=\!\minarg_{x'_n}C(\hh x'_1\hh

2540: y'_1...x'_ny'_n)$ by definition (\ref{defmusg}) of $\mu^{SG}$.

2541: Inserting this $x'_n$ gives the second line. $\mu^{SG}$ is

2542: effectively reduced to a lower number of arguments and the sum

2543: over $x'_n$ replaced by $\min_{x'_n}$.  Repeating this procedure

2544: for $x'_{n-1},...,x'_{k+1}$ leads to the last line, which is just

2545: the minimax strategy of player 1 defined in (\ref{sgydot}).

2546:

2547: Let us now assume that the game under consideration is played $s$

2548: times. The prior probability then is

2549: \beq\label{sgrep}

2550:   \mu^{AI}(y\!\pb x_1...y\!\pb x_{sn}) \;=\;

2551:   \prod_{r=0}^{s-1} \mu_1^{AI}(y\!\pb x_{rn+1}...

2552:   y\!\pb x_{(r+1)n})

2553: \eeq

2554: where we have renamed the prior probability (\ref{muaisg}) for

2555: one game to $\mu_1^{AI}$. (\ref{sgrep}) is a special case of a

2556: factorizable $\mu$ (\ref{facmu}) with identical factors

2557: $\mu_r\!=\mu_1^{AI}$ for all $r$ and equal episode lengths

2558: $n_{r+1}\!-\!n_r\!=\!n$. The AI$\mu$ system (\ref{sgrep}) for repeated

2559: game playing also implements the minimax strategy,

2560: \beq\label{yaisgrep}

2561:   \hh y_k^{AI} \;=\;

2562:   \maxarg_{y_k}\min_{x'_k}...

2563:      \max_{y_{(r+1)n}}\min_{\;x'_{(r+1)n}}

2564:      C(\hh y\!\hh x'_{rn+1:k-1}...y\!x'_{k:(r+1)n})

2565: \eeq

2566: with $r$ such that $rn\!<\!k\!\leq\!(r\!+\!1)n$ and for any choice of $m_k$

2567: as long as the horizon $h_k\!\geq\!n$. This can be

2568: proved by using (\ref{facydot}) and (\ref{yaisg2}).

2569: See section (\ref{secAIxi}) for a discussion on separable and

2570: factorizable $\mu$.

2571:

2572: %------------------------------%

2573: \paragraph{Games of variable length:}

2574: %------------------------------%

2575: In the unrepeated case we have argued that games of variable but

2576: bounded length can be padded to a fixed length without effect. We

2577: now analyze in a sequence of games the effect of replacing the games with fixed

2578: length by games of variable length.

2579: The sequence $y'_1x'_1...y'_nx'_n$ can still be grouped into episodes

2580: corresponding to the moves of separated consecutive games, but now

2581: the length and total number of games that fit into the $n$

2582: moves depend on the actual moves taken\footnote{If the sum of

2583: game lengths do not fit exactly into $n$ moves, we pad the last

2584: game appropriately.}. $C(y'_1x'_1...y'_nx'_n)$

2585: equals the number of games where the

2586: system wins, minus the number of games where the environment wins.

2587: Whenever a loss, win or draw has been achieved by the

2588: system or the environment, a new game starts. The player whose turn it would next

2589: be, begins the next game. The games are still separated in

2590: the sense that the behaviour and credit of the current game does

2591: not influence the next game. On the other hand, they are

2592: slightly entangled, because the length of the current

2593: game determines the time of start of the next. As the rules of the

2594: game are time invariant, this does not influence the next game

2595: directly. If we play a fixed number of games, the games are

2596: completely independent, but if we play a fixed number of total moves

2597: $n$, the number of games depends on their lengths. This has the

2598: following consequences: the better player tries to keep the games

2599: short, to win more games in the given time $n$. The poorer player

2600: tries to draw the games out, in order loose less games. The better

2601: player might further prefer a quick draw, rather than to win a long game.

2602: Formally, this entanglement is represented by the fact that the

2603: prior probability $\mu$ does no longer factorize. The reduced

2604: form (\ref{yaisgrep}) of $\hh y_k^{AI}$ to one episode is no

2605: longer valid. Also, the behaviour $\hh y_k^{AI}$ of the system

2606: depends on $m_k$, even if the horizon $h_k$ is

2607: chosen larger than the longest possible game (unless $m_k\!\geq\!n$).

2608: The important point is that the system realizes that

2609: keeping games short/long can lead to increased credit. In

2610: practice, a horizon much larger than the average game length

2611: should be sufficient to incorporate this effect. The details of

2612: games in the distant future do not affect the current game and can,

2613: therefore, be ignored. A more quantitative analysis could be interesting, but

2614: would lead us too far astray.

2615:

2616: %------------------------------%

2617: \paragraph{Using the AI$\xi$ model for game playing:}

2618: %------------------------------%

2619: When going from the specific AI$\mu$ model, where the rules of the

2620: game have been explicitly modeled into the prior probability

2621: $\mu^{AI}$, to the universal model AI$\xi$ we have to ask whether

2622: these rules can be learned from the assigned credits $c_k$. Here,

2623: another (actually the main) reason for studying the case of

2624: repeated games, rather than just one game arises. For a single game

2625: there is only one cycle of non-trivial feedback namely the end of

2626: the game - too late to be useful except when there are further

2627: games following.

2628:

2629: Even in the case of repeated games, there is only very limited

2630: feedback, at most $\log_2 3$ bits of information per game if the 3

2631: outcomes win/loss/draw have the same frequency. So there are at

2632: least $O(K(game))$ number of games necessary to learn a game of

2633: complexity $K(game)$. Apart from extremely simple games, even this

2634: estimate is far too optimistic. As the AI$\xi$ system has no

2635: information about the game to begin with, its moves will be more

2636: or less random and it can win the first few games merely by pure luck.

2637: So the probability that the system looses is near to one and

2638: hence the information content $I$ in the feedback $c_k$ at the end

2639: of the game is much less than $\log_2 3$. This situation remains

2640: for a very large number of games. On the other hand, in principle,

2641: every game should be learnable after a very long sequence of games

2642: even with this minimal feedback only, as long as $I\not\equiv 0$.

2643:

2644: The important point is that no other learning scheme with no extra

2645: information can learn the game more quickly. We expect this to be

2646: true as $\mu^{AI}$ factorizes in the case of games of fixed

2647: length, i.e. $\mu^{AI}$ satisfies a strong separability condition.

2648: In the case of variable game length the entanglement is also low.

2649: $\mu^{AI}$ should still be sufficiently separable allowing

2650: to formulate and prove good credit bounds for AI$\xi$.

2651:

2652: To learn realistic games like tic-tac-toe (noughts and crosses) in

2653: realistic time one has to provide more feedback. This could be

2654: achieved by intermediate help during the game. The environment

2655: could give positive(negative) feedback for every good(bad) move

2656: the system makes. The demand on whether a move is to be valued as

2657: good should be adopted to the gained experience of the system in

2658: such a way that approximately half of the moves are valuated as

2659: good and the other half as bad, in order to maximize the

2660: information content of the feedback.

2661:

2662: For more complicated games like chess, even more feedback is

2663: necessary from a practical point of view. One way to increase the

2664: feedback far beyond a few bits per cycle is to train the system by

2665: teaching it good moves. This is called supervised learning.

2666: Despite the fact that the AI model has only a credit feedback

2667: $c_k$, it is able to learn by teaching, as will be shown in section

2668: \ref{secEX}. Another way would be to start with more simple games

2669: containing certain aspects of the true game and to switch to the true

2670: game when the system has learned the simple game.

2671:

2672: No other difficulties are expected when going from

2673: $\mu$ to $\xi$. Eventually $\xi^{AI}$ will converge to the

2674: minimax strategy $\mu^{AI}$. In the more realistic case, where the

2675: environment is not a perfect minimax player, AI$\xi$ can

2676: detect and exploit the weakness of the opponent.

2677:

2678: Finally, we want to comment on the input/output space $X$/$Y$ of

2679: the AI system. In practical applications, $Y$ will possibly include

2680: also illegal moves. If $Y$ is the set of moves of e.g. a robotic

2681: arm, the system could move a wrong figure or even knock over the

2682: figures. A simple way to handle illegal moves $y_k$ is by

2683: interpreting them as losing moves, which terminate the game.

2684: Further, if e.g. the input $x_k$ is the image of a video camera

2685: which makes one shot per move, $X$ is not the set of moves by the

2686: environment but includes the set of states of the game board. The

2687: discussion in this section handles this case as well. There is no

2688: need to explicitly design the systems I/O space $X/Y$ for a

2689: specific game.

2690:

2691: The discussion above on the AI$\xi$ system was rather informal for

2692: the following reason: game playing (the SG$\xi$ system) has

2693: (nearly) the same complexity as fully general AI, and quantitative

2694: results for the AI$\xi$ system are difficult (but not impossible)

2695: to obtain.

2696:

2697: \newpage

2698: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

2699: \section{Function Minimization (FM)}\label{secFM}

2700: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

2701:

2702: %------------------------------%

2703: \paragraph{Applications/Examples:}

2704: %------------------------------%

2705: There are many problems that can be reduced to a minimization

2706: problem (FM). The minimum of a (real valued) function

2707: $f\!:\!Y\!\to\!I\!\!R$ over some domain $Y$ or a good approximate

2708: of it has to be found, usually with some limited resources.

2709:

2710: One popular example is the traveling salesman problem (TSP). $Y$

2711: is the set of different routes between towns and $f(y)$ the length

2712: of route $y\!\in\!Y$. The task is to find a route of minimal

2713: length visiting all cities. This problem is NP hard. Getting good

2714: approximations in limited time is of great importance in various

2715: applications. %%%%%%

2716: Another example is the minimization of production costs (MPC),

2717: e.g.\ of a car, under several constraints. $Y$ is the set of all

2718: alternative car designs and production methods compatible with the

2719: specifications and $f(y)$ the overall cost of alternative

2720: $y\!\in\!Y$. %%%%%%

2721: A related example is finding materials or (bio)molecules with

2722: certain properties (MAT). E.g. solids with minimal electrical

2723: resistance or maximally efficient chlorophyll modifications or

2724: aromatic molecules that taste as close as possible to strawberry.%%%%%%

2725: We can also ask for nice paintings (NPT). $Y$ is the set of all

2726: existing or imaginable paintings and $f(y)$ characterizes how much

2727: person $A$ likes painting $y$. The system should present

2728: paintings, which $A$ likes.

2729:

2730: For now, these are enough examples. The TSP is very rigorous from a

2731: mathematical point of view, as $f$, i.e. an algorithm of $f$, is

2732: usually known. In principle, the minimum could be found by

2733: extensive search, were it not for computational resource

2734: limitations. For MPC, $f$ can often be modeled in a reliable and

2735: sufficiently accurate way. For MAT you need very accurate physical

2736: models, which might be unavailable or too difficult to solve or

2737: implement. For NPT the most we have is the judgement of person $A$ on

2738: every presented painting. The evaluation function $f$ cannot be

2739: implemented without scanning $A's$ brain, which is not possible with

2740: todays technology.

2741:

2742: So there are different limitations, some depending on the

2743: application we have in mind. An implementation of $f$ might not be

2744: available, $f$ can only be tested at some arguments $y$ and $f(y)$

2745: is determined by the environment. We want to (approximately)

2746: minimize $f$ with as few function calls as possible or, conversely,

2747: find an as close as possible approximation for the

2748: minimum within a fixed number of function evaluations. If $f$ is

2749: available or can quickly be inferred by the system and evaluation

2750: is quick, it is more important to minimize the total time needed to

2751: imagine new trial minimum candidates plus the evaluation time for

2752: $f$. As we do not consider computational aspects of AI$\xi$ till

2753: section \ref{secTime} we concentrate on the first

2754: case, where $f$ is not available or dominates the computational

2755: requirements.

2756:

2757: %------------------------------%

2758: \paragraph{The Greedy Model FMG$\mu$ :}

2759: %------------------------------%

2760: The FM model consists of a sequence $\hh y_1\hh z_1\hh y_2\hh

2761: z_2...$ where $\hh y_k$ is a trial of the FM system for a minimum

2762: of $f$ and $\hh z_k=f(\hh y_k)$ is the true function value

2763: returned by the environment. We randomize the model by assuming a

2764: probability distribution $\mu(f)$ over the functions. There are

2765: several reasons for doing this. We might really not know the exact

2766: function $f$, as in the NPT example, and model our uncertainty by

2767: the probability distribution $\mu$. More importantly, we want to

2768: parallel the other AI classes, like in the SP$\mu$ model, where we

2769: always started with a probability distribution $\mu$ that was finally

2770: replaced by $\xi$ to get the universal Solomonoff prediction

2771: SP$\xi$. We want to do the same thing here. Further, the probabilistic

2772: case includes the deterministic case by choosing

2773: $\mu(f)\!=\!\delta_{ff_0}$, where $f_0$ is the true function. A

2774: final reason is that the deterministic case is trivial when $\mu$

2775: and hence $f_0$ is known, as the system can internally (virtually)

2776: check all function arguments and output the correct minimum from the very

2777: beginning.

2778:

2779: We will assume that $Y$ is countable or finite and that $\mu$ is a

2780: discrete measure, e.g. by taking only computable functions. The

2781: probability that the function values of $y_1,...,y_n$ are

2782: $z_1,...,z_n$ is then given by

2783: \beq\label{fmmudef}

2784:   \mu^{FM}(y_1\pb z_1...y_n\pb z_n) \;:=\;

2785:   \sum_{f:f(y_i)=z_i\;\forall 1\leq i\leq n} \nq\mu(f)

2786: \eeq

2787: We start with a model that minimizes the expectation

2788: $z_k$ of the function value $f$ for the next output

2789: $y_k$, taking into account previous information:

2790: \beqn

2791:   \hh y_k \;:=\; \minarg_{y_k}\sum_{z_k} z_k\!\cdot\!

2792:   \mu(\hh y_1\hh z_1...\hh y_{k-1}\hh z_{k-1}y_k\pb z_k)

2793: \eeqn

2794: This type of greedy algorithm, just minimizing the next

2795: feedback, was sufficient for sequence prediction (SP) and is also

2796: sufficient for classification (CF). It is, however, not sufficient for

2797: function minimization as the following example demonstrates.

2798:

2799: Take $f:\{0,1\}\!\to\!\{1,2,3,4\}$. There are 16 different

2800: functions which shall be equiprobable, $\mu(f)\!=\!{1\over 16}$.

2801: The function expectation in the first cycle

2802: \beqn

2803:   \langle z_1\rangle \;:=\; \sum_{z_1} z_1\!\cdot\!\mu(y_1\pb z_1) \;=\;

2804:   {\textstyle{1\over 4}}\sum_{z_1}z_1 \;=\;

2805:   {\textstyle{1\over 4}}(1\!+\!2\!+\!3\!+\!4) \;=\; 2.5

2806: \eeqn

2807: is just the arithmetic average of the possible function values and

2808: is independent of $y_1$. Therefore, $\hh y_1\!=\!0$, as $\minarg$

2809: is defined to take the lexicographically first minimum in an

2810: ambiguous case. Let us assume that $f_0(0)\!=\!2$, where $f_0$ is the

2811: true environment function, i.e. $\hh z_1\!=\!2$. The expectation of $z_2$ is then

2812: \beqn

2813:   \langle z_2\rangle \;:=\; \sum_{z_2} z_2\!\cdot\!\mu(02y_2\pb z_2)

2814:   \;=\; \left\{

2815:   \begin{array}{c@{\quad\mbox{for}\quad}l}

2816:     2                      & y_2=0 \\

2817:     2.5                    & y_2=1

2818:   \end{array} \right.

2819: \eeqn

2820: For $y_2\!=\!0$ the system already knows $f(0)\!=\!2$, for

2821: $y_2\!=\!1$ the expectation is, again, the arithmetic average. The

2822: system will again output $\hh y_2\!=\!0$ with feedback $\hh

2823: z_2\!=\!2$. This will continue forever. The system is not

2824: motivated to explore other $y's$ as $f(0)$ is already smaller than the

2825: expectation of $f(1)$. This is obviously not what we

2826: want. The greedy model fails. The system ought to be inventive and

2827: try other outputs when given enough time.

2828:

2829: The general reason for the failure of the greedy approach is that

2830: the information contained in the feedback $z_k$ depends on the

2831: output $y_k$. A FM system can actively influence the knowledge it

2832: receives from the environment by the choice in $y_k$. It may be

2833: more advantageous to first collect certain knowledge about $f$ by

2834: an (in greedy sense) non-optimal choice for $y_k$, rather than to

2835: minimize the $z_k$ expectation immediately. The non-minimality of

2836: $z_k$ might be over-compensated in the long run by

2837: exploiting this knowledge. In SP, the received information is

2838: always the current bit of the sequence, independent of what SP

2839: predicts for this bit. This is the reason why a greedy

2840: strategy in the SP case is already optimal.

2841:

2842: %------------------------------%

2843: \paragraph{The general FM$\mu/\xi$ Model:}

2844: %------------------------------%

2845: To get a useful model we have to think more carefully about what we

2846: really want. Should the FM system output a good minimum in the last output

2847: in a limited number of

2848: cycles $T$, or should the average of the $z_1,...,z_T$ values be minimal, or

2849: does it suffice that just one of the $z$ is as small as possible?

2850: Let us define the FM$\mu$ model as to minimize the $\mu$ averaged weighted

2851: sum $\alpha_1 z_1\!+...+\!\alpha_T z_T$ for some given

2852: $\alpha_k\!\geq\!0$. Building the $\mu$ average by summation over

2853: the $z_i$ and minimizing w.r.t.\ the $y_i$ has to be performed in

2854: the correct chronological order. With a similar reasoning as in

2855: (\ref{ebesty}) to (\ref{ydotrec}) we get

2856: \beq\label{fmydot}

2857:   \hh y_k^{FM} \;=\; \minarg_{y_k}\sum_{z_k}...\min_{y_T}\sum_{z_T}

2858:   (\alpha_1 z_1\!+...+\!\alpha_T z_T)\!\cdot\!

2859:   \mu(\hh y_1\hh z_1...\hh y_{k-1}\hh z_{k-1}y_k\pb z_k...y_T\pb z_T)

2860: \eeq

2861: If we want the final output $\hh y_T$ to be optimal we should

2862: choose $\alpha_k\!=\!0$ for $k\!<\!T$ and $\alpha_T\!=\!1$ (final

2863: model FMF$\mu$). If we want to already have a good

2864: approximation during intermediate cycles, we should demand that the

2865: output of all cycles together are optimal in some average sense,

2866: so we should choose $\alpha_k\!=\!1$ for all $k$ (sum model

2867: FMS$\mu$). If we want to have something in between, for instance, increase

2868: the pressure to produce good outputs, we could choose the

2869: $\alpha_k\!=\!e^{\gamma(k-T)}$ exponentially increasing for some

2870: $\gamma\!>\!0$ (exponential model FME$\mu$). For

2871: $\gamma\!\to\!\infty$ we get the FMF$\mu$, for $\gamma\!\to\!0$

2872: the FMS$\mu$ model. If we want to demand that the best of the

2873: outputs $y_1...y_k$ is optimal, we must replace the $\alpha$

2874: weighted $z$-sum by $\min\{z_1,...,z_T\}$ (minimum Model

2875: FMM$\mu$). We expect the behaviour to be very similar to the

2876: FMF$\mu$ model, and do not consider it further.

2877:

2878: By construction, the FM$\mu$ models guarantee optimal results in

2879: the usual sense that no other model knowing only $\mu$

2880: can be expected to produce better results. The variety of FM

2881: variants is not a fault of the theory. They just reflect the fact

2882: that there is some interpretational freedom of what is meant by

2883: minimization within $T$ function calls. In most applications, probably FMF is

2884: appropriate. In the NPT application one might prefer the FMS model.

2885:

2886: The interesting case (in AI) is when $\mu$ is unknown. We

2887: define for this case, the FM$\xi$ model by replacing $\mu(f)$

2888: with some $\xi(f)$, which should assign high probability to

2889: functions $f$ of low complexity. So we might define\footnote

2890: {$\xi^{FM}(f)$ is a true

2891: probability distribution if we include partial functions in the

2892: domain. So normalization is not necessary.}

2893: $\xi(f)\!=\!\sum_{q:\forall x[U(qx)=f(x)]}2^{-l(q)}$.

2894: The problem with this definition is that it is, in general,

2895: undecidable whether a TM $q$ is an implementation of a function

2896: $f$. $\xi(f)$ defined in this way is uncomputable,

2897: not even approximable. As we only need a $\xi$ analog to the

2898: l.h.s.\ of (\ref{fmmudef}), the following definition is natural

2899: \beq\label{fmxidef}

2900:   \xi^{FM}(y_1\pb z_1...y_n\pb z_n) \;:=\;

2901:   \sum_{q:q(y_i)=z_i\;\forall 1\leq i\leq n} \nq 2^{-l(q)}

2902: \eeq

2903: $\xi^{FM}$ is

2904: actually equivalent to inserting the incomputable $\xi(f)$ into

2905: (\ref{fmmudef}). $\xi^{FM}$ is an enumerable semi-measure and

2906: universal, relative to all probability distributions of the form

2907: (\ref{fmmudef}). We will not prove this here.

2908:

2909: Alternatively, we could have constrained the sum in (\ref{fmxidef})

2910: by $q(y_1...y_n)\!=\!z_1...z_n$ analog to (\ref{uniMAI}), but these

2911: two definitions are not equivalent. Definition (\ref{fmxidef})

2912: ensures the symmetry\footnote{See \cite{Sol99} for a discussion

2913: on symmetric universal distributions on unordered data.} in its

2914: arguments and $\xi^{FM}(...y\pb z...y\pb z'...)\!=\!0$ for $z\neq z'$.

2915: It incorporates all general knowledge we have about function

2916: minimization, whereas (\ref{uniMAI}) does not. But this extra

2917: knowledge has only low information content (complexity of $O(1)$),

2918: so we do not expect FM$\xi$ to perform much worse when using

2919: (\ref{uniMAI}) instead of (\ref{fmxidef}). But there is no reason

2920: to deviate from (\ref{fmxidef}) at this point.

2921:

2922: We can now define an ''error'' measure $E_{T\mu}^{FM}$ as

2923: (\ref{fmydot}) with $k\!=\!1$ and $\minarg_{y_1}$ replaced by

2924: $\min_{y_1}$ and, additionally, $\mu$ replaced by $\xi$ for

2925: $E_{T\xi}^{FM}$. We expect $|E_{T\xi}^{FM}\!-\!E_{T\mu}^{FM}|$ to

2926: be bounded in a way that justifies the use of $\xi$ instead of

2927: $\mu$ for computable $\mu$, i.e. computable $f_0$ in the

2928: deterministic case. The arguments are the same as for the AI$\xi$

2929: model.

2930:

2931: %------------------------------%

2932: \paragraph{Is the general model inventive?}

2933: %------------------------------%

2934: In the following we will show that FM$\xi$ will never cease

2935: searching for minima, but will test an infinite set of different

2936: $y's$ for $T\!\to\!\infty$.

2937:

2938: Let us assume that the system tests only a finite number of

2939: $y_i\!\in\!A\!\subset Y$, $|A|\!<\!\infty$. Let $t\!-\!1$ be the

2940: cycle in which the last new $y\!\in\!A$ is selected (or some later

2941: cycle). Selecting $y's$ in cycles $k\!\geq\!t$ a second time, the

2942: feedback $z$ does not provide any new information, i.e. does not

2943: modify the probability $\xi^{FM}$. The system can

2944: minimize $E_{T\xi}^{FM}$ by outputting in cycles $k\geq t$ the

2945: best $y\!\in\!A$ found so far (in the case $\alpha_k\!=\!0$, the output

2946: does not matter).

2947: Let us fix $f$ for a moment. Then we have

2948: \beqn

2949:   E^a \;:=\; \alpha_1 z_1\!+...+\!\alpha_T z_T \;=\;

2950:   \sum_{k=1}^{t-1}\alpha_kf(y_k)+f_1\!\cdot\!\sum_{k=t}^T\alpha_k

2951:   \quad,\quad f_1:=\min_{1\leq k<t}f(y_k)

2952: \eeqn

2953: Let us now assume that the system tests one additional

2954: $y_t\!\not\in\!A$ in cycle $t$, but no other $y\!\not\in\!A$.

2955: Again, it will keep to the best output for $k\!>\!t$, which is

2956: either the one of the previous system or $y_t$.

2957: \beqn

2958:   E^b \;=\;

2959:   \sum_{k=1}^t\alpha_kf(y_k) +

2960:   \min\{f_1,f(y_t)\}\!\cdot\nq\;\sum_{k=t+1}^T\alpha_k

2961: \eeqn

2962: The difference can be represented in the form

2963: \beqn

2964:   E^a-E^b \;=\; \left(\sum_{k=t}^T\alpha_k\right)\!\cdot\!f^+ -

2965:   \alpha_t\!\cdot\!f^- \quad,\quad

2966:   f^\pm \;:=\; \max\{0,\pm(f_1\!-\!f(y_t))\} \;\geq\; 0

2967: \eeqn

2968: As the true FM$\xi$ strategy is the one which minimizes $E$, assumption

2969: $a$ is ruled out if $E^a>E^b$. We will say that $b$ is favored over $a$,

2970: which does not mean that $b$ is the correct strategy, only that

2971: $a$ is not the true one. For probability distributed $f$, $b$ is

2972: favored over $a$ when

2973: \beqn

2974:   E^a-E^b \;=\; \left(\sum_{k=t}^T\alpha_k\right)\!\cdot\!\langle f^+\rangle -

2975:   \alpha_t\!\cdot\!\langle f^-\rangle \;>\; 0

2976:   \quad\Leftrightarrow\quad

2977:   \sum_{k=t}^T\alpha_k > \alpha_t{\langle f^-\rangle\over\langle

2978:   f^+\rangle}

2979: \eeqn

2980: where $\langle f^\pm\rangle$ is the $\xi$ expectation of $\pm f_1\mp f(y_t)$

2981: under the condition that $\pm f_1\!\geq\!\pm f(y_t)$ and under the constrains

2982: imposed in cycles $1...t\!-\!1$. As $\xi$ assigns a strictly

2983: positive probability to every non-empty event, $\langle

2984: f^+\rangle\!\neq\!0$.

2985: Inserting $\alpha_k\!=\!e^{\gamma(k-T)}$, assumption $a$ is ruled

2986: out in model FME$\xi$ if

2987: \beqn

2988:   T-t \;>\; {1\over\gamma}\ln\left[1+

2989:   {\langle f^-\rangle\over\langle f^+\rangle}(e^\gamma-1)\right]-1

2990:   \;\to\; \left\{

2991:   \begin{array}{c@{\quad\mbox{for}\quad}l}

2992:     0 & \gamma\to\infty\mbox{ (FMF$\xi$ model)} \\

2993:     \langle f^-\rangle/\langle f^+\rangle-1

2994:     & \gamma\to 0\;\;\mbox{ (FMS$\xi$ model)}

2995:   \end{array} \right.

2996: \eeqn

2997: We see that if the condition is not satisfied for some $t$, it will

2998: remain wrong for all $t'\!>\!t$. So the FMF$\xi$ system will test each $y$

2999: only once up to a point from which on it always outputs the best

3000: found $y$. Further, for $T\!\to\!\infty$ the condition always gets

3001: satisfied. As this is true for any finite $A$, the assumption of a

3002: finite $A$ is wrong. For $T\!\to\!\infty$ the system

3003: tests an increasing number of different $y's$, provided $Y$ is

3004: infinite. The FMF$\xi$ model will never repeat any $y$ except in

3005: the last cycle $T$ where it chooses the best found $y$. The

3006: FMS$\xi$ model will test a new $y_t$ for fixed $T$, only if the

3007: expected value of $f(y_t)$ is not too large.

3008:

3009: The above does not necessarily hold for different choices of

3010: $\alpha_k$. The above also holds for the FMF$\mu$ system if

3011: $\langle f^+\rangle\!\neq\!0$. $\langle f^+\rangle\!=\!0$ if the

3012: system can already exclude that $y_t$ is a better guess, so there

3013: is no reason to test it explicitly.

3014:

3015: Nothing has been said about the quality of the guesses, but for

3016: the FM$\mu$ system they are optimal by definition.

3017: If $K(\mu)$ for the true distribution $\mu$ is finite, we expect

3018: the FM$\xi$ system to solve the ''exploration versus

3019: exploitation'' problem in a universally optimal way, as $\xi$

3020: converges to $\mu$.

3021:

3022: %------------------------------%

3023: \paragraph{Using the AI models for Function Mininimization:}

3024: %------------------------------%

3025: The AI model can be used for function minimization in the

3026: following way. The output $y_k$ of cycle $k$ is a guess for a

3027: minimum of $f$, like in the FM model. The credit $c_k$ should

3028: be high for small function values $z_k\!=\!f(y_k)$.

3029: The credit should also be weighted with $\alpha_k$ to reflect the

3030: same strategy as in the FM case. The choice of $c_k\!=\!-\alpha_k z_k$

3031: is natural. Here, the feedback is not binary but

3032: $c_k\!\in\!C\!\subset\!I\!\!R$, with $C$ being a countable subset of

3033: $I\!\!R$, e.g. the computable reals or all rational numbers. The

3034: feedback $x'_k$ should be the function value $f(y_k)$.

3035: So we set $x'_k\!=\!z_k$. Note, that there is a redundancy

3036: if $\alpha_{()}$ is a computable function with no zeros, as

3037: $c_k\!=-\alpha_kx'_k$. So, for small $K(\alpha_{()})$ like in

3038: the FMS model, one might set $x_k\equiv\epsilon$. If we keep $x'_k$

3039: the AI prior probability is

3040: \beq\label{muAIfm}

3041:   \mu^{AI}(y_1\pb x_1...y_n\pb x_n)

3042:   \;=\; \left\{

3043:   \begin{array}{cl}

3044:     \mu^{FM}(y_1\pb z_1...y_n\pb z_n)

3045:     & \mbox{for } c_k=-\alpha_kz_k,\; x'_k=z_k,\; x_k=c_kx_k' \\

3046:     0 & \mbox{else}.

3047:   \end{array} \right.

3048: \eeq

3049: Inserting this into (\ref{ydotrec}) with $m_k=T$ we get

3050: \beqn

3051:   \hh y_k^{AI} \;=\;

3052:   \maxarg_{y_k}\sum_{x_k}...\max_{y_T}\sum_{x_T}

3053:   (c_k\!+...+\!c_T)\!\cdot\!

3054:   \mu^{AI}(\hh y_1\hh x_1...y_k\pb x_k...y_T\pb x_T)

3055:   \;=\;

3056: \eeqn

3057: \beqn

3058:   \;=\; \minarg_{y_k}\sum_{z_k}...\min_{y_T}\sum_{z_T}

3059:   (\alpha_kz_k\!+...+\!\alpha_Tz_T)\!\cdot\!

3060:   \mu^{FM}(\hh y_1\hh z_1...y_k\pb z_k...y_T\pb z_T) \;=\; \hh y_k^{FM}

3061: \eeqn

3062: where $\hh y_k^{FM}$ has been defined in (\ref{fmydot}).

3063: The proof of equivalence was so simple because the FM model has already a

3064: rather general structure, which is similar to the full AI model.

3065:

3066: One might expect no problems when going from the already very

3067: general FM$\xi$ model to the universal AI$\xi$ model (with

3068: $m_k=T$), but there is a pitfall in the case of the FMF model. All

3069: credits $c_k$ are zero in this case, except for the last one being $c_T$.

3070: Although there is a feedback $z_k$ in every cycle, the AI$\xi$

3071: system cannot learn from this feedback as it is not told that in

3072: the final cycle $c_T$ will equal to $-z_T$. There is no problem in

3073: the FM$\xi$ model because in this case this knowledge is hardcoded into

3074: $\xi^{FM}$. The AI$\xi$ model must first learn that it

3075: has to minimize a function but it can only learn if there is a

3076: non-trivial credit assignment $c_k$. FMF works for repeated

3077: minimization of (different) functions, such as minimizing $N$

3078: functions in $N\!\cdot\!T$ cycles. In this case there are $N$ non-trivial

3079: feedbacks and AI$\xi$ has time to learn that there is a relation

3080: between $c_{k\!\cdot\!T}$ and $x'_{k\!\cdot\!T}$ every T$^{th}$

3081: cycle. This situation is similar to the case of strategic games

3082: discussed in section \ref{secSG}.

3083:

3084: There is no problem in applying AI$\xi$ to FMS because the $c$

3085: feedback provides enough information in this case. The only thing

3086: the AI$\xi$ model has to learn, is to ignore the $x$ feedbacks as

3087: all information is already contained in $c$. Interestingly the

3088: same argument holds for the FME model if $K(\gamma)$ and $K(T)$

3089: are small\footnote{If we set $\alpha_k=e^{\gamma k}$ the condition

3090: on $K(T)$ can be dropped.}. The AI$\xi$ model has additionally only to learn

3091: the relation $c_k\!=\!-e^{-\gamma(k-T)}x'_k$. This

3092: task is simple as every cycle provides one data point for a simple

3093: function to learn. This argument is no longer valid for

3094: $\gamma\!\to\!\infty$ as $K(\gamma)\!\to\!\infty$ in this case.

3095:

3096: %------------------------------%

3097: \paragraph{Remark:}

3098: %------------------------------%

3099: TSP seems to be trivial in the AI$\mu$ model but non-trivial in

3100: the AI$\xi$ model. The reason being that (\ref{fmydot}) just

3101: implements an internal complete search as

3102: $\mu(f)\!=\!\delta_{ff^{TSP}}$ contains all necessary information.

3103: AI$\mu$ outputs from the very beginning, the exact minimum of $f^{TSP}$. This

3104: ''solution'' is, of course, unacceptable from performance

3105: perspective. As long as we give no efficient approximation $\xi^c$

3106: of $\xi$, we have not contributed anything to a solution of the

3107: TSP by using AI$\xi^c$. The same is true for any other problem

3108: where $f$ is computable and easily accessible. Therefore, TSP is not (yet)

3109: a good example because all we have done is to replace a NP

3110: complete problem with the uncomputable AI$\xi$ model or by a

3111: computable AI$\xi^c$ model, for which we have said nothing about

3112: computation time yet. It is simply an overkill to reduce simple

3113: problems to AI$\xi$. TSP is a simple problem in this respect, until

3114: we consider the AI$\xi^c$ model seriously. For the other examples,

3115: where $f$ is inaccessible or complicated, AI$\xi^c$ provides a

3116: true solution to the minimization problem as an explicit

3117: definition of $f$ is not needed for AI$\xi$ and AI$\xi^c$.

3118:

3119: \newpage

3120: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

3121: \section{Supervised Learning by Examples (EX)}\label{secEX}

3122: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

3123:

3124: %------------------------------%

3125: %\paragraph{Introduction (reinforcement versus supervised learning:}

3126: %------------------------------%

3127: The AI models provide a frame for reinforcement learning. The

3128: environment provides a feedback $c$, informing the system about the

3129: quality of its last output $y$; it assigns credit $c$ to output

3130: $y$. In this sense, reinforcement learning is explicitly integrated

3131: into the AI$\rho$ model. For $\rho\!=\!\mu$ it maximizes the true

3132: expected credit, whereas the AI$\xi$ model is a universal,

3133: environment independent, reinforcement learning algorithm.

3134:

3135: There is another type of learning method: Supervised learning by

3136: presentation of examples (EX). Many problems learned by this

3137: method are association problems of the following type. Given some

3138: examples $x\!\in\!R\subset\!X$, the system should reconstruct, from

3139: a partially given $x'$, the missing or corrupted parts, i.e.

3140: complete $x'$ to $x$ such that relation $R$ contains $x$. In many

3141: cases, $X$ consists of pairs $(z,v)$, where $v$ is the possibly

3142: missing part.

3143:

3144: %------------------------------%

3145: \paragraph{Applications/Examples:}

3146: %------------------------------%

3147: Learning functions by presenting $(z,f(z))$ pairs and asking for

3148: the function value of $z$ by presenting $(z,?)$ also falls into

3149: this category.

3150:

3151: A basic example is learning properties of geometrical objects

3152: coded in some way. E.g.\ if there are 18 different objects

3153: characterized by their size (small or big), their colors (red,

3154: green or blue) and their shapes (square, triange, circle), then

3155: $(object,property)\!\in\!\!R$ if the $object$ possesses the

3156: $property$. Here, $R$ is a relation which is not the graph of a

3157: single valued function.

3158:

3159: When teaching a child, by pointing to objects and saying ''this is

3160: a tree'' or ''look how green'' or ''how beautiful'', one

3161: establishes a relation of $(object,property)$ pairs in $R$.

3162: Pointing to a (possibly different) tree later and asking ''what is

3163: this ?'' corresponds to a partially given pair $(object,?)$, where

3164: the missing part ''?'' should be completed by the

3165: child saying ''tree''.

3166:

3167: A final example we want to give is chess. We have seen that, in

3168: principle, chess can be learned by reinforcement learning. In the

3169: extreme case the environment only provides credit $c\!=\!1$ when

3170: the system wins. The learning rate is completely inacceptable from

3171: a practical point of view. The reason is the very low amount of

3172: information feedback. A more practical method of teaching chess is

3173: to present example games in the form of sensible

3174: $(board\mbox{-}state,move)$

3175: sequences. They contain information about legal and good moves

3176: (but without any explanation). After several games have been presented, the

3177: teacher could ask the system to make its own move by presenting

3178: $(board\mbox{-}state,?)$ and then evaluate the answer of the system.

3179:

3180: %------------------------------%

3181: \paragraph{Supervised learning with the AI$\mu/\xi$ model:}

3182: %------------------------------%

3183: Let us define the EX model as follows: The environment presents

3184: inputs

3185: $x'_k = z_kv_k \equiv (z_k,v_k) \in R\!\cup\!(Z\!\times\!\{?\}) \subset

3186:  Z\!\times\!(Y\!\cup\!\{?\}) = X'$

3187: to the system in cycle $k$. The system is expected to output $y_{k+1}$

3188: in the next cycle, which is evaluated with $c_{k+1}\!=\!1$ if $(z_k,y_{k+1})\!\in\!R$ and 0

3189: otherwise. To simplify the discussion, an output $y_k$ is expected

3190: and evaluated even when $v_k(\neq?)$ is given. To complete the

3191: description of the environment, the probability distribution

3192: $\mu_R(\pb{x'_1...x'_n})$ of the examples $x'_i$ (depending on $R$)

3193: has to be given. Wrong examples should not occur, i.e.\ $\mu_R$

3194: should be 0 if $x_i'\!\not\in\!R$ for some $1\!\leq\!i\!\leq\!n$.

3195: The relations $R$ might also be probability distributed with

3196: $\sigma(\pb R)$. The example prior probability in this case is

3197: \beq\label{exmudef}

3198:   \mu(\pb{x'_1...x'_n}) \;=\;

3199:   \sum_R \mu_R(\pb{x'_1...x'_n})\!\cdot\!\sigma(\pb R)

3200: \eeq

3201: The knowledge of the valuation $c_k$ on output $y_k$

3202: restricts the possible relations $R$, consistent with

3203: $R(z_k,y_{k+1})\!=\!c_{k+1}$, where $R(z,y)\!:=\!1$ if $(z,y)\!\in\!R$ and 0

3204: otherwise. The prior probability for the input sequence

3205: $x_1...x_n$ if the output sequence is $y_1...y_n$, is

3206: therefore

3207: \beqn

3208:   \mu^{AI}(y_1\pb x_1...y_n\pb x_n) \;=\;

3209:   \sum_{R:\forall 1\leq i< n[R(z_i,y_{i+1})=c_{i+1}]}

3210:   \mu_R(\pb{x'_1...x'_n})\!\cdot\!\sigma(\pb R)

3211: \eeqn

3212: where $x_i\!=\!c_ix'_i$ and $x'_{i-1}\!=\!z_iv_i$ with $v_i\!\in\!Y\!\cup\!\{?\}$.

3213: In the I/O sequence $y_1x_1y_2x_2...=y_1c_1z_2v_2y_2c_2z_3v_3...$

3214: the $c_1y_1$ are dummies, after which regular behaviour starts.

3215:

3216: The AI$\mu$ model is optimal by construction of $\mu^{AI}$. For

3217: computable prior $\mu_R$ and $\sigma$, we expect a near optimal

3218: behavior of the universal AI$\xi$ model if $\mu_R$ additionally satisfies some

3219: separability property. In the following, we give some motivation

3220: why the AI$\xi$ model takes into account the supervisor

3221: information contained in the examples and why it learns faster than by

3222: reinforcement.

3223:

3224: %------------------------------%

3225: %\paragraph{Reason why AI$\xi$ can learn supervised:}

3226: %------------------------------%

3227: We keep $R$ fixed and assume

3228: $\mu_R(x'_1...x'_n)\!=\!\mu_R(x'_1)\!\cdot...\cdot\!\mu_R(x'_n)\!\neq\!0

3229: \Leftrightarrow x'_i\!\in\!R\!\cup\!(Z\!\times\!\{?\})\;\forall i$

3230: to simplify the discussion. Short codes $q$ contribute mostly to

3231: $\xi^{AI}(y_1\pb x_1...y_n\pb x_n)$. As $x'_1...x'_n$ is

3232: distributed according to the computable probability distribution

3233: $\mu_R$, a short code of $x'_1...x'_n$ for large enough $n$ is a

3234: Huffman coding w.r.t.\ the distribution $\mu_R$. So we expect

3235: $\mu_R$ and hence $R$ coded in the dominant contributions to

3236: $\xi^{AI}$ in some way, where the plausible assumption was made

3237: that the $y$ on the input tape do not matter. Much more than one

3238: bit per cycle will usually be learned, hence, relation $R$ can be

3239: learned in $n\!\ll\!K(R)$ cycles by appropriate examples. This

3240: coding of $R$ in $q$ evolves independently of the feedbacks $c$.

3241: To maximize the feedback $c_k$, the system has to learn to output

3242: a $y_{k+1}$ with $(z_k,y_{k+1})\!\in\!R$. The system has to invent

3243: a program extension $q'$ to $q$, which extracts $z_k$ from

3244: $x_k\!=\!z_kv_k$ and searches for and outputs a $y_{k+1}$ with

3245: $(z_k,y_{k+1})\!\in\!R$. As $R$ is already coded in $q$, $q'$ can

3246: re-use this coding of $R$ in $q$. The size of the extension $q'$

3247: is, therefore, of $O(1)$. To learn this $q'$, the system requires

3248: feedback $c$ with information content of $O(1)\!=\!K(q')$ only.

3249:

3250: Let us compare this with reinforcement learning, where only $x'_k\!=\!(z_k,?)$

3251: pairs are presented. A coding of $R$ in a short code $q$ for

3252: $x'_1...x'_n$ is of no use and will therefore be absent. Only the

3253: credits $c$ force the system to learn $R$. $q'$ is therefore

3254: expected to be of size $K(R)$. The information content in the

3255: $c's$ must be of the order $K(R)$. In practice, there are often only very few

3256: $c_k\!=\!1$ at the beginning of the learning phase and the

3257: information content in $c_1...c_n$ is much less than $n$ bits. The

3258: required number of cycles to learn $R$ by reinforcement is,

3259: therefore, at least but in many cases much larger than $K(R)$.

3260:

3261: Although AI$\xi$ was never designed or told to learn

3262: supervised, it learns how to take advantage of the examples from

3263: the supervisor.  $\mu_R$ and $R$ are learned from the examples, the

3264: credits $c$ are not necessary for this process. The remaining task

3265: of learning how to learn supervised is then a simple task of

3266: complexity $O(1)$, for which the credits $c$ are necessary.

3267:

3268: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

3269: \section{Other AI Classes}\label{secOther}

3270: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

3271: \ifprivate

3272: \begin{itemize}\parskip=0ex\parsep=0ex\itemsep=0ex

3273: \item Function Inversion

3274: \item Building analogies

3275: \item Delayed SP

3276: \item Artificial Life

3277: \end{itemize}

3278: \fi

3279:

3280: %------------------------------%

3281: \paragraph{Other aspects of intelligence:}

3282: %------------------------------%

3283: In AI, a variety of general ideas and methods have been developed.

3284: In the last sections, we have seen how several problem classes can

3285: be formulated within AI$\xi$. As we claim universality of the

3286: AI$\xi$ model, we want to enlight which of, and how the other AI

3287: methods are incorporated in the AI$\xi$ model, by looking its

3288: structure. Some methods are directly included,

3289: others are or should be emergent. We do not claim the following

3290: list to be complete.

3291:

3292: {\it Probability theory} and {\it utility theory} are the heart of

3293: the AI$\mu/\xi$ models. The probabilities are the true/universal

3294: behaviours of the environment. The utility function is what we

3295: called total credit, which should be maximized. Maximization of an

3296: expected utility function in a probabilistic environment is

3297: usually called {\it sequential decision theory}, and is explicitly integrated

3298: in full generality in our model. This includes probabilistic (a

3299: generalization of deterministic) {\it reasoning}, where the

3300: object of reasoning are not true or false statements, but the

3301: prediction of the environmental behaviour. {\it Reinforcement

3302: Learning}

3303: is explicitly built in, due to the credits. Supervised learning is

3304: an emergent phenomenon (section \ref{secEX}). {\it Algorithmic

3305: information theory} leads us to use $\xi$ as a universal estimate

3306: for the prior probability $\mu$.

3307:

3308: For horizon $>\!1$, the alternative series of expectimax series

3309: in (\ref{facydot}) and the process of selecting maximal

3310: values can be interpreted as abstract {\it planning}. This expectimax

3311: series also includes {\it informed search}, in the case of AI$\mu$, and {\it

3312: heuristic search}, for AI$\xi$, where $\xi$ could be interpreted as

3313: a heuristic for $\mu$. The minimax strategy of {\it game playing}

3314: in case of AI$\mu$ is also subsumed. The AI$\xi$ model converges

3315: to the minimax strategy if the environment is a minimax player but

3316: it can also take advantage of environmental players with limited

3317: rationality. {\it Problem solving} occurs (only) in the form of

3318: how to maximize the expected future credit.

3319:

3320: {\it Knowledge} is accumulated by AI$\xi$ and is stored in some

3321: form not specified further on the working tape. Any kind of

3322: information in any representation on the inputs $y$ is

3323: exploited. The problem of {\it knowledge engineering} and

3324: representation appears in the form of how to train the AI$\xi$

3325: model. More practical aspects, like {\it language or image

3326: processing} have to be learned by AI$\xi$ from scratch.

3327:

3328: Other theories, like {\it fuzzy logic}, {\it possibility theory},

3329: {\it Dempster-Shafer theory}, ... are partly outdated and partly

3330: reducible to Bayesian probability theory \cite{Che85}. The

3331: interpretation and effects of the evidence gap

3332: $g\!:=\!1\!-\!\sum_{x_k}\xi(y\!x_{<k}y\!\pb x_k)\!>\!0$ in $\xi$ may

3333: be similar to those in Dempster-Shafer theory. Boolean logical

3334: reasoning about the external world plays, at best, an emergent

3335: role in the AI$\xi$ model.

3336:

3337: Other methods, which don't seem to be contained in the AI$\xi$ model

3338: might also be emergent phenomena. The AI$\xi$ model has to

3339: construct short codes of the environmental behaviour, the

3340: AI$\xi^{\tilde t\tilde l}$ (see next section) has to construct

3341: short action programs. If we would analyze and interpret these

3342: programs for realistic environments, we might find some of the

3343: unmentioned or unused or new AI methods at work in these

3344: algorithms. This is, however, pure speculation at this point. More

3345: important: when trying to make AI$\xi$ practically usable,

3346: some other AI methods, like genetic algorithms or neural nets,

3347: may be useful.

3348:

3349: The main thing we wanted to point out is that the AI$\xi$ model

3350: does not lack any important known property of intelligence or

3351: known AI methodology. What {\it is} missing, however, are computational

3352: aspects, which are addressed, in the next section.

3353:

3354: \newpage

3355: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

3356: \section{Time Bounds and Effectiveness}\label{secTime}

3357: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

3358:

3359: %------------------------------%

3360: \paragraph{Introduction:}

3361: %------------------------------%

3362: Until now, we have not bothered with the non-computability of the

3363: universal probability distribution $\xi$. As all universal models

3364: in this paper are based on $\xi$, they are not effective in this

3365: form. In this section, We will outline how the previous models and

3366: results can be modified/generalized to the time-bounded case.

3367: Indeed, the situation is not as bad as it could be. $\xi$ and $C$

3368: are enumerable and $\hh y_k$ is still approximable or computable

3369: in the limit. There exists an algorithm, that will produce a

3370: sequence of outputs eventually converging to the exact output $\hh

3371: y_k$, but we can never be sure whether we have already reached it.

3372: Besides this, the convergence is extremely slow, so this type of

3373: asymptotic computability is of no direct (practical) use, but will

3374: nevertheless, be important later.

3375:

3376: Let $\tilde p$ be a program which calculates within a reasonable

3377: time $\tilde t$ per cycle, a reasonable intelligent output, i.e.

3378: $\tilde p(\hh x_{<k})\!=\!\hh y_{1:k}$. This sort

3379: of computability assumption, that a general purpose computer of

3380: sufficient power is able to behave in an intelligent way, is

3381: the very basis of AI, justifying the

3382: hope to be able to construct systems which eventually reach and outperform

3383: human intelligence. For a contrary viewpoint see \cite{Pen89}. It

3384: is not necessary to discuss here, what is meant by 'reasonable

3385: time/intelligence' and 'sufficient power'. What we are interested

3386: in, in this section, is whether there is a computable version

3387: AI$\xi^{\tilde t}$ of the AI$\xi$ system which is superior or equal to any

3388: $p$ with computation time per cycle of at most $\tilde t$.

3389: With 'superior', we mean 'more intelligent', so what we

3390: need is an order relation (like) (\ref{aiorder}) for intelligence.

3391:

3392: The best result we could think of would be an AI$\xi^{\tilde t}$

3393: with computation time $\leq\!\tilde t$ at least as intelligent as

3394: any $p$ with computation time $\leq\!\tilde t$. If AI is possible

3395: at all, we would have reached the final goal, the construction of

3396: the most intelligent algorithm with computation $\leq\!\tilde t$.

3397: Just as there is no universal measure in the set of computable

3398: measures (within time $\tilde t$), such an AI$\xi^t$ may

3399: neither exist.

3400:

3401: What we can realistically hope to construct, is an AI$\xi^{\tilde

3402: t}$ system of computation time $c\!\cdot\!\tilde t$ per cycle for

3403: some constant $c$. The idea is to run all programs $p$ of length

3404: $\leq\!\tilde l\!:=\!l(\tilde p)$ and time $\leq\!\tilde t$ per

3405: cycle and pick the best output. The total computation time is

3406: $2^{\tilde l}\!\cdot\!\tilde t$, hence $c=2^{\tilde l}$. This sort

3407: of idea of 'typing monkeys' with one of them eventually writing

3408: Shakespeare, has been applied in various forms and contexts in

3409: theoretical computer science. The realization of this {\it best

3410: vote} idea, in our case, is not straightforward and will be

3411: outlined in this section. An idea related to this, is that of basing the

3412: decision on the majority of algorithms. This 'democratic vote'

3413: idea has been used in \cite{LiWa89,Vov92} for sequence prediction,

3414: and is referred to as 'weighted majority' there.

3415:

3416: %------------------------------%

3417: \paragraph{Time limited probability distributions:}

3418: %------------------------------%

3419: In the literature one can find time limited versions of Kolmogorov

3420: complexity \cite{Dal73,Ko86} and the time limited universal

3421: semimeasure \cite{LiVi91,LiVi93}. In the following, we

3422: utilize and adapt the latter and see how far we get. One way to define a

3423: time-limited universal chronological semimeasure is

3424: as a sum over all enumerable chronological semimeasures

3425: computable within time $\tilde t$ and of size at most $\tilde l$

3426: similar to the unbounded case (\ref{xirhodef}).

3427: \beq\label{aixitl}

3428:   \xi^{\tilde t\tilde l}(y\!\pb x_{1:n})

3429:   \;:=\; \nq\sum_{\quad\rho\;:\;l(\rho)\leq\tilde l\;\wedge\;t(\rho)\leq\tilde t}

3430:   \nq\nq 2^{-l(\rho)}\rho(y\!\pb x_{1:n})

3431: \eeq

3432: Let us assume that the true environmental prior probability $\mu^{AI}$

3433: is equal to or sufficiently accurately approximated by a $\rho$ with

3434: $l(\rho)\!\leq\!\tilde l$ and $t(\rho)\!\leq\!\tilde t$ with $\tilde

3435: t$ and $\tilde l$ of reasonable size. There are several AI

3436: problems that fall into this class. In function minimization of

3437: section \ref{secFM}, the computation of $f$ and $\mu^{FM}$ are

3438: usually feasible. In many cases, the sequences of section \ref{secSP}

3439: which should be predicted, can be easily calculated when $\mu^{SP}$

3440: is known. In a classifier problem, the

3441: probability distribution $\mu^{CF}$, according to which examples

3442: are presented, is, in many cases, also elementary. But not all AI

3443: problems are of this 'easy' type. For the strategic games of section

3444: \ref{secSG}, the environment is usually, itself, a highly

3445: complex strategic player with a difficult to calculate $\mu^{SG}$

3446: that is difficult to calculate,

3447: although one might argue that the environmental player may have

3448: limited capabilities too. But it is easy to think of a difficult

3449: to calculate physical (probabilistic) environment like the

3450: chemistry of biomolecules.

3451:

3452: The number of interesting applications makes this restricted class

3453: of AI problems, with time and space bounded environment

3454: $\mu^{\tilde t\tilde l}$, worth being studied. Superscripts to a

3455: probability distribution except for $\xi^{\tilde t\tilde l}$

3456: indicate their length and maximal computation time. $\xi^{\tilde

3457: t\tilde l}$ defined in (\ref{aixitl}), with a yet to be determined

3458: computation time, multiplicatively dominates all $\mu^{\tilde

3459: t\tilde l}$ of this type. Hence, an AI$\xi^{\tilde t\tilde l}$

3460: model, where we use $\xi^{\tilde t\tilde l}$ as prior probability,

3461: is universal, relative to all AI$\mu^{\tilde t\tilde l}$ models in

3462: the same way as AI$\xi$ is universal to AI$\mu$ for all enumerable

3463: chronological semimeasures $\mu$. The $\maxarg_{y_k}$ in

3464: (\ref{ydotxi}) selects a $y_k$ for which $\xi^{\tilde t\tilde l}$

3465: has the highest expected utility $C_{km_k}$, where $\xi^{\tilde

3466: t\tilde l}$ is the weighted average over the $\rho^{\tilde t\tilde

3467: l}$. $\hh y_k^{AI\xi^{\tilde t\tilde l}}$ is determined by a

3468: weighted majority. We expect $AI\xi^{\tilde t\tilde l}$ to

3469: outperform all (bounded) $AI\rho^{\tilde t\tilde l}$, analog to the

3470: unrestricted case.

3471:

3472: In the following we analyze the computability properties of

3473: $\xi^{\tilde t\tilde l}$ and AI$\xi^{\tilde t\tilde l}$,

3474: i.e.\ of $\hh y_k^{AI\xi^{\tilde t\tilde l}}$. To compute

3475: $\xi^{\tilde t\tilde l}$ according to the definition

3476: (\ref{aixitl}) we have to enumerate all chronological enumerable semimeasures

3477: $\rho^{\tilde t\tilde l}$ of length $\leq\!\tilde l$

3478: and computation time $\leq\!\tilde t$. This can be done similarly to

3479: the unbounded case (\ref{ccsm1}-\ref{ccsm3}). All $2^{\tilde l}$

3480: enumerable functions of length $\leq\!\tilde l$, computable within time

3481: $\tilde t$ have to be converted to chronological probability

3482: distributions. For this, one has to evaluate each function for

3483: $|X|\!\cdot\!k$ different arguments. Hence,

3484: $\xi^{\tilde t\tilde l}$ is computable within time\footnote{We

3485: assume that a TM can be simulated by another in linear time.}

3486: $

3487:   t(\xi^{\tilde t\tilde l}(y\!\pb x_{1:k})) \!=\!

3488:   O(|X|\!\cdot\!k\!\cdot\!2^{\tilde l}\!\cdot\!\tilde t)

3489: $.

3490: The computation time of $\hh y_k^{AI\xi^{\tilde t\tilde l}}$

3491: depends on the size of $X$, $Y$ and $m_k$.

3492: $\xi^{\tilde t\tilde l}$ has to be

3493: evaluated $|Y|^{h_k}|X|^{h_k}$ times in (\ref{ydotxi}).

3494: It is possible to

3495: optimize the algorithm and perform the computation within time

3496: \beq\label{tyaixi}

3497:   t(\hh y_k^{AI\xi^{\tilde t\tilde l}}) \;=\;

3498:   O(|Y|^{h_k}|X|^{h_k}\!\cdot\!2^{\tilde l}\!\cdot\!\tilde t)

3499: \eeq

3500: per cycle. If we assume that the computation time of $\mu^{\tilde

3501: t\tilde l}$ is exactly $\tilde t$ for all arguments, the brute

3502: force time $\bar t$ for calculating the sums and maxs in

3503: (\ref{ydotrec}) is $\bar t(\hh y_k^{AI\mu^{\tilde t\tilde

3504: l}})\!\geq\!|Y|^{h_k}|X|^{h_k}\!\cdot\!\tilde t$. Combining this

3505: with (\ref{tyaixi}), we get

3506: \beqn

3507:   t(\hh y_k^{AI\xi^{\tilde t\tilde l}}) \;=\;

3508:   O(2^{\tilde l}\!\cdot\!

3509:   \bar t(\hh y_k^{AI\mu^{\tilde t\tilde l}}))

3510: \eeqn

3511: This result has the proposed structure, that there is a universal

3512: AI$\xi^{\tilde t\tilde l}$ system with computation time

3513: $2^{\tilde l}$ times the computation time of a special

3514: AI$\mu^{\tilde t\tilde l}$ system.

3515:

3516: Unfortunately, the class of AI$\mu^{\tilde t\tilde l}$ systems

3517: with brute force evaluation of $\hh y_k$, according to

3518: (\ref{ydotrec}) is completely uninteresting from a practical point

3519: of view. E.g. in the context of chess, the above result says that

3520: the AI$\xi^{\tilde t\tilde l}$ is superior within time $2^{\tilde

3521: l}\!\cdot\!\tilde t$ to any brute force minimax strategy of computation time

3522: $\tilde t$. Even if the factor of $2^{\tilde l}$ in computation

3523: time would not matter, the AI$\xi^{\tilde t\tilde l}$ system is,

3524: nevertheless practically useless, as a brute force minimax chess

3525: player with reasonable time $\tilde t$ is a very poor player.

3526:

3527: Note, that in the case of sequence prediction ($h_k\!=\!1$,

3528: $|Y|\!=\!|X|\!=\!2$) the computation time of $\rho$ coincides with

3529: that of $\hh y_k^{AI\rho}$ within a factor of 2. The class

3530: AI$\rho^{\tilde t\tilde l}$ includes {\it all} non-incremental

3531: sequence prediction algorithms of size $\leq\!\tilde l$ and

3532: computation time $\leq\!\tilde t/2$. With non-incremental, we mean

3533: that no information of previous cycles is taken into account for

3534: the computation of $\hh y_k$ of the current cycle.

3535:

3536: The shortcomings (mentioned and unmentioned ones) of this

3537: approach are cured in the next subsection, by deviating from the

3538: standard way of defining a timebounded $\xi$ as a sum over functions or

3539: programs.

3540:

3541: %------------------------------%

3542: \paragraph{The idea of the best vote algorithm:}

3543: %------------------------------%

3544: A general cybernetic or AI system is a chronological program

3545: $p(x_{<k})=y_{1:k}$. This form, introduced in section

3546: \ref{secAIfunc}, is general enough to include any AI system (and

3547: also less intelligent systems).

3548: In the following, we are interested in programs $p$ of length

3549: $\leq\!\tilde l$ and computation time $\leq\!\tilde t$ per cycle.

3550: One important point in the time-limited setting is that $p$ should be

3551: incremental, i.e. when computing $y_k$ in cycle $k$, the

3552: information of the previous cycles stored on the working tape can

3553: be re-used. Indeed, there is probably no practically interesting,

3554: non-incremental AI system at all.

3555:

3556: In the following, we construct a policy $p^\best$, or more

3557: precisely, policies $p_k^\best$ for every cycle $k$ that

3558: outperform all time and length limited AI systems $p$. In cycle k,

3559: $p_k^\best$ runs all $2^{\tilde l}$ programs $p$ and selects the

3560: one with the best output $y_k$. This is a 'best vote' type of

3561: algorithm, as compared to the 'weighted majority' like algorithm of the

3562: last subsection. The ideal measure for the quality of the output

3563: would be the $\xi$ expected credit

3564: \beq

3565:  C_{km}(p|\hh y\!\hh x_{<k}) \;:=\; \sum_{q\in\hh Q_k}2^{-l(q)}C_{km}(p,q)

3566:  \quad,\quad

3567:   C_{km}(p,q) \;:=\; c(x_k^{pq})+...+c(x_m^{pq})

3568: \eeq

3569: The program $p$ which maximizes $C_{km_k}$ should be

3570: selected. We have dropped the normalization $\cal N$ unlike in

3571: (\ref{cxi}), as it is independent of $p$ and

3572: does not change the order relation which we are solely interested

3573: in here. Furthermore, without normalization, $C_{km}$ is enumerable,

3574: which will be important later.

3575:

3576: %------------------------------%

3577: \paragraph{Extended chronological programs:}

3578: %------------------------------%

3579: In the (functional form of the) AI$\xi$ model it was convenient to

3580: maximize $C_{km_k}$ over all $p\!\in\!\hh P_k$,

3581: i.e. all $p$ consistent with the current history $\hh y\!\hh x_{<k}$.

3582: This was no restriction, because for every

3583: possibly inconsistent program $p$ there exists a program $p'\!\in\!\hh P_k$ consistent

3584: with the current history and identical to $p$ for all future

3585: cycles $\geq\!k$. For the time limited best vote algorithm

3586: $p^\best$ it would be too restrictive to demand $p\!\in\!\hh

3587: P_k$. To prove universality, one has to compare {\it all} $2^{\tilde l}$

3588: algorithms in every cycle, not just the consistent ones. An

3589: inconsistent algorithm may become the best one in later cycles.

3590: For inconsistent programs we have to include the $\hh y_k$ into the

3591: input, i.e. $p(\hh y\!\hh x_{<k})\!=\!y_{1:k}^p$

3592: with $\hh y_i\!\neq\!y_i^p$ possible. For $p\!\in\!\hh P_k$ this

3593: was not necessary, as $p$ knows the output $\hh y_k\equiv y_k^p$ in

3594: this case. The $c_i^{pq}$ in the definition of $C_{km}$ are the

3595: valuations emerging in the I/O sequence, starting with $\hh

3596: y\!\hh x_{<k}$ (emerging from $p^\best$) and then continued

3597: by applying $p$ and $q$ with $\hh y_i\!:=\!y_i^p$ for

3598: $i\!\geq\!k$.

3599:

3600: Another problem is that we need $C_{km_k}$ to select the best

3601: policy, but unfortunately $C_{km_k}$ is uncomputable. Indeed, the

3602: structure of the definition of $C_{km_k}$ is very similar to that

3603: of $\hh y_k$, hence a brute force approach to approximate

3604: $C_{km_k}$ requires too much computation time as for $\hh y_k$. We

3605: solve this problem in a similar way, by supplementing each $p$ with

3606: a program that estimates $C_{km_k}$ by $w_k^p$ within time

3607: $\tilde t$. We combine the calculation of $y_k^p$ and $w_k^p$ and

3608: extend the notion of a chronological program once again to

3609: \beq\label{extprog}

3610:   p(\hh y\!\hh x_{<k}) \;=\; w_1^py_1^p...w_k^py_k^p

3611: \eeq

3612: with chronological order $w_1^py_1^p\hh y_1\hh x_1

3613: w_2^py_2^p\hh y_2\hh x_2...$.

3614:

3615: %------------------------------%

3616: \paragraph{Valid approximations:}

3617: %------------------------------%

3618: $p$ might suggest any output $y_k^p$ but it is not allowed to rate

3619: it with an arbitrarily high $w_k^p$ if we want $w_k^p$ to be a reliable

3620: criterion for selecting the best $p$. We demand that no policy is

3621: allowed to claim that it is better than it actually is. We define

3622: a (logical) predicate VA($p$) called {\it valid approximation}, which

3623: is true if, and only if, $p$ always satisfies

3624: $w_k^p\!\leq\!C_{km_k}(p)$, i.e. never overrates itself.

3625: \beq\label{vadef}

3626:   \mbox{VA}(p) \;\equiv\;

3627:   \forall k\forall w_1^py_1^p\hh y_1\hh x_1...w_k^py_k^p :

3628:   p(\hh y\!\hh x_{<k}) \!=\! w_1^py_1^p...w_k^py_k^p

3629:   \Rightarrow

3630:   w_k^p\!\leq\!C_{km_k}(p|\hh y\!\hh x_{<k})

3631: \eeq

3632: In the following, we restrict our attention to programs $p$, for which

3633: VA($p$) can be proved in some formal axiomatic system.

3634: A very important point is that $C_{km_k}$ is enumerable.

3635: This ensures the existence of sequences of

3636: program $p_1, p_2, p_3, ...$ for which VA($p_i$) can be proved and

3637: $\lim_{i\to\infty}w_k^{p_i}\!=\!C_{km_k}(p)$

3638: for all $k$ and all I/O sequences. The approximation is not

3639: uniform in $k$, but this does not matter as the selected $p$ is allowed to change

3640: from cycle to cycle.

3641:

3642: Another possibility would be to consider only those $p$ which check

3643: $w_k^p\!\leq\!C_{km_k}(p)$ online in every cycle, instead of

3644: the pre-check VA($p$), either by constructing a proof (on the working

3645: tape) for this special case, or it is already evident by the

3646: construction of $w_k^p$. In cases where $p$ cannot guarantee

3647: $w_k^p\!\leq\!C_{km_k}(p)$ it sets $w_k\!=\!0$ and, hence, trivially

3648: satisfies $w_k^p\!\leq\!C_{km_k}(p)$. On the other hand, for these

3649: $p$ it is also no problem to prove VA($p$) as one has simply to

3650: analyze the internal structure of $p$ and recognize that $p$ shows

3651: the validity internally itself, cycle by cycle, which is easy by

3652: assumption on $p$. The cycle by cycle check is, therefore, a special

3653: case of the pre-proof of VA($p$).

3654:

3655: %------------------------------%

3656: \paragraph{Effective intelligence order relation:}

3657: %------------------------------%

3658: In section \ref{secAIxi} we have introduced an intelligence order

3659: relation $\succeq$ on AI systems, based on the expected credit

3660: $C_{km_k}(p)$. In the following we need an order relation

3661: $\succeq^c$ based on the claimed credit $w_k^p$ which might

3662: be interpreted as an approximation to $\succeq$. We call $p$

3663: {\it effectively more or equally intelligent} than $p'$ if

3664: \bqa\label{effaiord}

3665:   p\succeq^c\!p' \;:\Leftrightarrow\;

3666:   \forall k\forall \hh y\!\hh x_{<k}

3667:   \exists w_{1:n}w'_{1:n} : \\

3668:   p(\hh y\!\hh x_{<k}) \!=\! w_1\!*...w_k\!* \;\wedge\;

3669:   p'(\hh y\!\hh x_{<k}) \!=\! w_1'\!*...w_k'\!* \;\wedge\;

3670:   w_k\!\geq\!w_k'

3671: \eqa

3672: i.e.\ if $p$ always claims higher credit estimate $w$ than $p'$.

3673: $\succeq^c$ is a co-enumerable partial order relation on extended

3674: chronological programs. Restricted to valid approximations

3675: it orders the policies w.r.t.\ the quality of their outputs {\it

3676: and} their ability to justify their outputs with high $w_k$.

3677:

3678: %------------------------------%

3679: \paragraph{The universal time bounded AI$\xi^{\tilde t\tilde l}$ system:}

3680: %------------------------------%

3681: In the following we, describe the algorithm $p^\best$ underlying

3682: the universal time bounded AI$\xi^{\tilde t\tilde l}$ system. It

3683: is essentially based on the selection of the best algorithms

3684: $p_k^\best$ out of the time ${\tilde t}$ and length ${\tilde l}$

3685: bounded $p$, for which there exists a proof of VA($p$) with length

3686: $\leq\!l_P$.

3687:

3688: \begin{enumerate}\parskip=0ex\parsep=0ex\itemsep=0ex

3689: \item Create all binary strings of length $l_P$ and interpret each

3690: as a coding of a mathematical proof in the same formal logic system in

3691: which VA($\cdot$) has been formulated. Take those strings

3692: which are proofs of VA($p$) for some $p$ and keep the

3693: corresponding programs $p$.

3694: \item Eliminate all $p$ of length $>\!\tilde l$.

3695: \item Modify all $p$ in the following way: all output $w_k^py_k^p$

3696: is temporarily written on an auxiliary tape. If $p$ stops in $\tilde t$

3697: steps the internal 'output' is copied to the output tape. If $p$

3698: does not stop after $\tilde t$ steps a stop is forced and $w_k\!=\!0$

3699: and some arbitrary $y_k$ is written on the output tape. Let $P$ be

3700: the set of all those modified programs.

3701: \item Start first cycle: $k\!:=\!1$.

3702: \item\label{pbestloop} Run every $p\!\in\!P$ on extended input

3703: $\hh y\!\hh x_{<k}$, where all outputs are redirected to some auxiliary

3704: tape:

3705: $p(\hh y\!\hh x_{<k})\!=\!w_1^py_1^p...w_k^py_k^p$.

3706: \item Select the program $p$ with highest claimed credit $w_k^p$:

3707: $p_k^\best\!:=\!\maxarg_pw_k^p$.

3708: \item Write $\hh y_k\!:=\!y_k^{p_k^\best}$ to the output tape.

3709: \item Receive input $\hh x_k$ from the environment.

3710: \item Begin next cycle: $k\!:=\!k\!+\!1$, goto step

3711: \ref{pbestloop}.

3712: \end{enumerate}

3713:

3714: It is easy to see that the following theorem holds.

3715:

3716: %------------------------------%

3717: \paragraph{Main theorem:}

3718: %------------------------------%

3719: Let $p$ be any extended chronological (incremental) program like

3720: (\ref{extprog}) of length $l(p)\!\leq\!\tilde l$ and computation

3721: time per cycle $t(p)\!\leq\!\tilde t$, for which there exists a

3722: proof of VA($p$) defined in (\ref{vadef}) of length $\leq\!l_P$.

3723: The algorithm $p^\best$ constructed in the last subsection,

3724: depending on $\tilde l$, $\tilde t$ and $l_P$ but not on $p$, is

3725: effectively more or equally intelligent, according to $\succeq^c$

3726: defined in (\ref{effaiord}) than any such $p$. The size of

3727: $p^\best$ is $l(p^\best)\!=\!O(\ln(\tilde l\!\cdot\!\tilde

3728: t\!\cdot\! l_P))$, the setup-time is

3729: $t_{setup}(p^\best)\!=\!O(l_P\!\cdot\!2^{l_P})$, the computation

3730: time per cycle is $t_{cycle}(p^\best)\!=\!O(2^{\tilde

3731: l}\!\cdot\!\tilde t)$.

3732:

3733: Roughly speaking, the theorem says, that if there exists a

3734: computable solution to some AI problem at all, the explicitly

3735: constructed algorithm $p^\best$ is such a solution. Although this

3736: theorem is quite general, there are some limitations and open

3737: questions which we discuss in the following.

3738:

3739: %------------------------------%

3740: \paragraph{Limitations and open questions:}

3741: %------------------------------%

3742: \begin{itemize}\parskip=0ex\parsep=0ex%\itemsep=0ex

3743: \item Formally, the total computation time of $p^\best$ for cycles

3744: $1...k$ increases linearly with $k$, i.e. is of order $O(k)$ with

3745: a coefficient $2^{\tilde l}\!\cdot\!\tilde t$. The unreasonably

3746: large factor $2^{\tilde l}$ is a well known drawback in

3747: best/democratic vote models and will be taken without further comments, whereas the

3748: factor ${\tilde t}$ can be assumed to be of reasonable size. If we

3749: don't take the limit $k\!\to\!\infty$ but consider reasonable $k$,

3750: the practical usefulness of the timebound on $p^\best$ is somewhat

3751: limited, due to the additional additive constant

3752: $O(l_P\!\cdot\!2^{l_P})$. It is much larger than

3753: $k\!\cdot\!2^{\tilde l}\!\cdot\!\tilde t$ as typically

3754: $l_P\!\gg\!l($VA$(p))\!\geq\!l(p)\!\equiv\!\tilde l$.

3755: \item $p^\best$ is superior only to those $p$ which justify their

3756: outputs (by large $w_k^p$). It might be possible that there are

3757: $p$ which produce good outputs $y_k^p$ within reasonable time, but

3758: it takes an unreasonably long time to justify their outputs by

3759: sufficiently high $w_k^p$. We do not think that (from a certain

3760: complexity level onwards) there are policies where the process of

3761: constructing a good output is completely separated from some sort

3762: of justification process. But this justification might not be

3763: translatable (at least within reasonable time) into a reasonable

3764: estimate of $C_{km_k}(p)$.

3765: \item The (inconsistent) programs $p$ must be able to continue

3766: strategies started by other policies. It might happen that a

3767: policy $p$ steers the environment to a direction for which it is

3768: specialized. A 'foreign' policy might be able to displace $p$

3769: only between loosely bounded episodes. There is probably no

3770: problem for factorizable $\mu$. Think of a chess game, where it is

3771: usually very difficult to continue the game/strategy of a

3772: different player. When the game is over, it is usually advantageous

3773: to replace a player by a better one for the next game. There might

3774: also be no problem for sufficiently separable $\mu$.

3775: \item There might be (efficient) valid approximations $p$ for which

3776: VA($p$) is true but not provable, or for which only a very long

3777: ($>\!l_P$) proof exists.

3778: \end{itemize}

3779:

3780: %------------------------------%

3781: \paragraph{Remarks:}

3782: %------------------------------%

3783: \begin{itemize}\parskip=0ex\parsep=0ex%\itemsep=0ex

3784: \item The idea of suggesting outputs and justifying them by proving

3785: credit bounds implements one aspect of human thinking. There are

3786: several possible reactions to an input. Each reaction possibly has

3787: far reaching consequences. Within a limited time one tries to estimate the

3788: consequences as well as possible. Finally,

3789: each reaction is valued and the best one is selected. What

3790: is inferior to human thinking is, that the estimates $w_k^p$ must

3791: be rigorously proved and the proofs are constructed by blind

3792: extensive search, further, that {\it all} behaviours $p$ of length

3793: $\leq\!\tilde l$ are checked. It is inferior 'only' in the sense of

3794: necessary computation time but not in the sense of the quality of

3795: the outputs.

3796: \item In practical applications there are often cases with

3797: short and slow programs $p_s$ performing some task $T$, e.g.

3798: the computation of the digits of $\pi$, for which there also exist

3799: long and quick programs $p_l$ too. If it is not too difficult to

3800: prove that this long program is equivalent to the short one, then it is

3801: possible to prove $K(T)\!\leq\!l(p_s)$ within time $t(p_l)$.

3802: Similarly, the method of proving bounds $w_k$ for $C_{km_k}$ can

3803: give high lower bounds without explicitly executing these short

3804: and slow programs, which mainly contribute to $C_{km_k}$.

3805: \item Dovetailing all length and time-limited programs is a well

3806: known elementary idea (typing monkeys). The crucial part

3807: which has been developed here, is the selection criterion for the

3808: most intelligent system.

3809: \item By construction of AI$\xi^{\tilde t\tilde l}$ and due to the enumerability

3810: of $C_{km_k}$, ensuring arbitrary close approximations of

3811: $C_{km_k}$ we expect that the behaviour of AI$\xi^{\tilde t\tilde l}$

3812: converges to the behaviour of AI$\xi$ in the limit $\tilde

3813: t,\tilde l\!\to\!\infty$ in a sense.

3814: \item Depending on what you know/assume that a program $p$ of size

3815: $\tilde l$ and computation time per cycle $\tilde t$ is able to

3816: achieve, the computable AI$\xi^{\tilde t\tilde l}$ model will have the

3817: same capabilities. For the strongest assumption of the existence of a Turing

3818: machine, which outperforms human intelligence, the AI$\xi^{\tilde

3819: t\tilde l}$ will do too, within the same time frame up to a (unfortunately

3820: very large) constant factor.

3821: \end{itemize}

3822:

3823: \newpage

3824: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

3825: \section{Outlook \& Discussion}\label{secOutlook}

3826: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

3827: This section contains some discussion of otherwise unmentioned

3828: topics and some (more personal) remarks. It also serves as an outlook

3829: to further research.

3830:

3831: %------------------------------%

3832: \paragraph{Miscellaneous:}

3833: %------------------------------%

3834: \begin{itemize}

3835: \item In game theory \cite{Osb94} one often wants to model the situation of

3836:       simultaneous actions, whereas the AI$\xi$ models has

3837:       serial I/O. Simultaneity can be simulated by withholding the

3838:       environment from the current system's output $y_k$, until

3839:       $x_k$ has been received by the system. Formally, this means

3840:       that $\xi(y\!x_{<k}y\!\pb x_k)$ is independent of $y_k$.

3841:       The AI$\xi$ system is already of simultaneous type in an

3842:       abstract view if the behaviour $p$ is interpreted as the action.

3843:       In this sense, AI$\xi$ is the action $p^\best$ which maximizes

3844:       the utility function (credit), under the assumption that the environment

3845:       acts according to $\xi$. The situation is different from

3846:       game theory as the environment is not modeled to be a second

3847:       'player' that tries to optimize his own utility although it might

3848:       actually be a rational player (see section \ref{secSG}).

3849: \item In various examples we have chosen differently specialized

3850:       input and output spaces $X$ and $Y$. It should be clear

3851:       that, in principle, this is unnecessary, as large enough spaces $X$

3852:       and $Y$, e.g. $2^{32}$ bit, serve every need and can always

3853:       be Turing reduced to the specific presentation needed internally by the

3854:       AI$\xi$ system itself. But it is clear that using a generic

3855:       interface, such as camera and monitor for, learning

3856:       tic-tac-toe for example, adds the task of learning vision and drawing.

3857: \end{itemize}

3858:

3859: %------------------------------%

3860: \paragraph{Outlook:}

3861: %------------------------------%

3862: \begin{itemize}

3863: \item Rigorous proofs for credit bounds are the major theoretical challenge are

3864:       -- general ones as well as tighter bounds for

3865:       special environments $\mu$. Of special importance are suitable (and

3866:       acceptable) conditions to $\mu$, under which $\hh y_k$ and

3867:       finite credit bounds exist for infinite $Y$, $X$ and $m_k$.

3868: \item A direct implementation of the

3869:       AI$\xi^{\tilde t\tilde l}$ model is ,at best, possible for toy

3870:       environments due to the large factor $2^{\tilde l}$ in

3871:       computation time. But there are other applications of the AI$\xi$ theory.

3872:       We have seen in several examples how to integrate problem classes

3873:       into the AI$\xi$ model. Conversely, one can downscale the

3874:       AI$\xi$ model by using more restricted forms of $\xi$.

3875:       This could be done in the same way as the theory of universal

3876:       induction has been downscaled with many insights

3877:       to the Minimum Description Length principle

3878:       \cite{LiVi92,Ris89} or to the domain of finite automata \cite{Fed92}.

3879:       The AI$\xi$ model might similarly serve as a super model or as the

3880:       very definition of (universal unbiased) intelligence, from

3881:       which specialized models could be derived.

3882: \item With a reasonable computation time, the AI$\xi$ model

3883:       would be a solution of AI (see next point if you disagree).

3884:       The AI$\xi^{\tilde t\tilde l}$ model was the first step,

3885:       but the elimination of the factor $2^{\tilde l}$ without giving up

3886:       universality will (almost certainly) be a very difficult task.

3887:       One could try to select programs $p$ and prove VA($p$) in a

3888:       more clever way than by mere enumeration, to improve performance

3889:       without destroying

3890:       universality. All kinds of ideas like, genetic algorithms,

3891:       advanced theorem provers and many more could be incorporated. But now we

3892:       are in trouble. We seem to have transferred the AI

3893:       problem just to a different level. This shift has some

3894:       advantages (and also some disadvantages) but presents, in no way, a

3895:       solution.

3896:       Nevertheless, we want to stress that we have reduced the AI

3897:       problem to (mere) computational questions.

3898:       Even the most general other systems the author is aware of, depend on some

3899:       (more than computational) assumptions about the

3900:       environment or it is far from clear whether they are, indeed, universal and optimal.

3901:       Although computational

3902:       questions are themselves highly complicated, this reduction is a

3903:       non-trivial result. A formal theory of something, even if

3904:       not computable, is often a great step toward solving a

3905:       problem and has also merits of its own, and AI should not be different (see previous item).

3906: \item Many researchers in AI believe that intelligence is something

3907:       complicated and cannot be condensed into a few formulas.

3908:       It is more a combining of enough {\it methods} and much explicit

3909:       {\it knowledge} in the right way. From a theoretical point of

3910:       view, we disagree as the AI$\xi$ model is simple and seems to serve all

3911:       needs. From a practical point of view we agree to the following extent.

3912:       To reduce the computational burden one should

3913:       provide special purpose algorithms ({\it methods}) from the

3914:       very beginning, probably many of them related to reduce

3915:       the complexity of the input and output spaces $X$ and $Y$ by

3916:       appropriate preprocessing {\it methods}.

3917: \item There is no need to incorporate extra {\it knowledge} from the very

3918:       beginning. It can be presented in the first few cycles in

3919:       {\it any} format. As long as the algorithm to interpret the data

3920:       is of size $O(1)$, the AI$\xi$ system will 'understand' the data

3921:       after a few cycles (see section \ref{secEX}). If the

3922:       environment $\mu$ is complicated but extra knowledge

3923:       $z$ makes $K(\mu|z)$ small, one can show that the bound

3924:       (\ref{eukdist}) reduces to $\1d2\ln 2\!\cdot\!K(\mu|z)$

3925:       when $x_1\!\equiv\!z$, i.e.\

3926:       when $z$ is presented in the first cycle. The special

3927:       purpose algorithms could be presented in $x_1$, too, but it

3928:       would be cheating to say that no special purpose algorithms

3929:       had been implemented in AI$\xi$. The boundary between

3930:       implementation and training is unsharp in the AI$\xi$ model.

3931: \item We have not said much about the training

3932:       process itself, as it is not specific to the AI$\xi$ model

3933:       and has been discussed in literature in various forms and

3934:       disciplines. A serious discussion would be out of place.

3935:       To repeat a truism, it is, of course,

3936:       important to present enough knowledge $x'_k$ and evaluate

3937:       the system output $y_k$ with $c_k$ in a reasonable way.

3938:       To maximize the information content in the credit, one should

3939:       start with simple tasks and give positive reward

3940:       $c_k\!=\!1$ to approximately half of the outputs $y_k$.

3941: \end{itemize}

3942:

3943: %------------------------------%

3944: \paragraph{The big questions:}

3945: %------------------------------%

3946: This subsection is devoted to the {\it big} questions of AI in

3947: general and the AI$\xi$ model in particular with a personal touch.

3948:

3949: \begin{itemize}

3950: \item There are two possible objections to AI in general and,

3951:       therefore, also against AI$\xi$ in particular we want

3952:       to comment on briefly. Non-computable physics (which is not too

3953:       weird) could make Turing computable AI impossible. As at least the

3954:       world that is relevant for humans seems mainly to be computable

3955:       we do not believe that it is necessary to integrate non-computable

3956:       devices into an AI system. The (clever and nearly convincing) 'G\"odel'

3957:       argument by Penrose \cite{Pen89} that non-computational physics

3958:       {\it must} exist and {\it is} relevant to the brain, has (in our opinion convincing)

3959:       loopholes.

3960: \item A more serious problem is the evolutionary information

3961:       gathering process. It has been shown that the

3962:       'number of wisdom' $\Omega$ contains a very compact

3963:       tabulation of $2^n$ undecidable problems in its very first

3964:       $n$ binary digits \cite{Cha91}. $\Omega$ is only enumerable

3965:       with computation time increasing more rapidly with $n$, than any

3966:       recursive function.

3967:       The enormous computational power of evolution could

3968:       have developed and coded something like $\Omega$ into

3969:       our genes, which significantly guides human reasoning.

3970:       In short: Intelligence could be something complicated

3971:       and evolution toward it from an even cleverly designed

3972:       algorithm of size $O(1)$ could be too slow. As evolution has

3973:       already taken place, we could add the information from our

3974:       genes or brain structure to any/our AI system, but this means that

3975:       the important part is still missing and a simple formal definition

3976:       of AI is principally impossible.

3977: \item For the probably {\it biggest question} about {\it consciousness}

3978:       we want to give a physical analogy. Quantum (field) theory is

3979:       the most accurate and universal physical theory ever

3980:       invented. Although already developed in the 1930ies the {\it

3981:       big} question regarding the interpretation of the wave function collapse

3982:       is still open. Although extremely interesting from a

3983:       philosophical point of view, it is completely irrelevant from

3984:       a practical point of view\footnote{In the theory of everything, the

3985:       collapse might become of 'practical' importance and must or will be

3986:       solved.}.

3987:       We believe the same to be true

3988:       for {\it consciousness} in the field of Artificial

3989:       Intelligence. Philosophically highly interesting but

3990:       practically unimportant. Whether consciousness {\it will} be

3991:       explained some day is another question.

3992: \end{itemize}

3993:

3994: \newpage

3995: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

3996: \section{Conclusions}\label{secCon}

3997: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

3998: All tasks which require intelligence to be solved can naturally be

3999: formulated as a maximization of some expected utility in the

4000: framework of agents. We gave a functional (\ref{pbestfunc}) and an

4001: iterative (\ref{ydotrec}) formulation of such a decision theoretic

4002: agent, which is general enough to cover all AI problem classes,

4003: as has been demonstrated by several examples. The main remaining

4004: problem is the unknown prior probability distribution $\mu^{AI}$

4005: of the environment(s). Conventional learning algorithms are

4006: unsuitable, because they can neither handle large (unstructured)

4007: state spaces, nor do they converge in the theoretically minimal

4008: number of cycles, nor can they handle non-stationary environments

4009: appropriately. On the other hand, the universal semimeasure $\xi$

4010: (\ref{xidef}), based on ideas from algorithmic information theory,

4011: solves the problem of the unknown prior distribution for induction

4012: problems. No explicit learning procedure is necessary, as $\xi$

4013: automatically converges to $\mu$. We unified the theory of

4014: universal sequence prediction with the decision theoretic agent by

4015: replacing the unknown true prior $\mu^{AI}$ by an appropriately

4016: generalized universal semimeasure $\xi^{AI}$. We gave strong

4017: arguments that the resulting AI$\xi$ model is the most

4018: intelligent, parameterless and environmental/application independent model

4019: possible. We defined an intelligence order relation

4020: (\ref{aiorder}) to give a rigorous meaning to this claim.

4021: Furthermore, possible solutions to the horizon problem have been

4022: discussed. We outlined for a number of problem classes in sections

4023: \ref{secSP}--\ref{secEX}, how the AI$\xi$ model can solve them.

4024: They include sequence prediction, strategic games, function

4025: minimization and, especially, how AI$\xi$ learns to learn

4026: supervised. The list could easily be extended to other problem

4027: classes like classification, function inversion and many others.

4028: The major drawback of the AI$\xi$ model is that it is

4029: uncomputable, or more precisely, only asymptotically computable,

4030: which makes an implementation impossible. To overcome this

4031: problem, we constructed a modified model AI$\xi^{\tilde t\tilde

4032: l}$, which is still effectively more intelligent than any other

4033: time $\tilde t$ and space $\tilde l$ bounded algorithm. The

4034: computation time of AI$\xi^{\tilde t\tilde l}$ is of the order

4035: $\tilde t\!\cdot\!2^{\tilde l}$. Possible further research has

4036: been discussed. The main directions could be to prove general

4037: and special credit bounds, use AI$\xi$ as a super model and

4038: explore its relation to other specialized models and finally

4039: improve performance with or without giving up universality.

4040:

4041: \newpage

4042: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

4043: %%                B i b l i o g r a p h y                    %%

4044: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

4045: \addcontentsline{toc}{section}{Literature}

4046: \parskip=0ex plus 1ex minus 1ex

4047: \begin{thebibliography}{9}\parskip=0ex\parsep=0ex\itemsep=0ex

4048: \bibitem{Ang83}

4049:   {\bf D. Angluin, C. H. Smith}:

4050:   {\it Inductive inference: Theory and methods};

4051:   {\rm Comput. Surveys, 15:3, (1983) 237--269 }.

4052: \bibitem{Bay63}

4053:   {\bf T. Bayes}:

4054:   {\it An essay towards solving a problem in the doctrine of chances};

4055:   {\rm Philos. Trans. Royal Soc., 53 (1763) 376--398}.

4056: \bibitem{Cha66}

4057:   {\bf G.J. Chaitin}:

4058:   {\it On the length of programs for computing finite binary sequences};

4059:   {\rm Journal A.C.M. 13:4 (1966) 547--569 and J. Assoc. Comput. Mach., 16 (1969) 145--159}.

4060: \bibitem{Cha75}

4061:   {\bf G.J. Chaitin}:

4062:   {\it A theory of program size formally identical to information theory};

4063:   {\rm J. Assoc. Comput. Mach. 22 (1975) 329--340}.

4064: \bibitem{Cha91}

4065:   {\bf G.J. Chaitin}:

4066:   {\it Algorithmic information and evolution};

4067:   {\rm in O.T. Solbrig and G. Nicolis, Perspectives on

4068:        Biological Complexity, IUBS Press (1991) 51-60}.

4069: \bibitem{Che85}

4070:   {\bf P. Cheeseman}:

4071:   {\it In defense of probability theory};

4072:   {\rm Proc. 9th int. joint conference on AI, IJCAI-85 (1985) 1002--1009}.

4073:   {\it An inquiry into computer understanding};

4074:   {\rm Comp. intelligence 4:1 (1988) 58--66}.

4075: \bibitem{Con97}

4076:   {\bf M.Conte et. al.}:

4077:   {\it Genetic programming estimates of Kolmogorov complexity};

4078:   {\rm Proc. 7th Int. Conf. on GA (1997) 743--750}.

4079: \bibitem{Dal73}

4080:   {\bf R.P. Daley}:

4081:   {\it Minimal-program complexity of sequences with restricted resources};

4082:   {\rm Inform. Contr. 23 (1973) 301--312 }.

4083:   {\it On the inference of optimal descritions};

4084:   {\rm Theoret. Comput. Sci. 4 (1977) 301--319}.

4085: \bibitem{Fed92}

4086:   {\bf M. Feder, N. Merhav, M. Gutman}:

4087:   {\it Universal prediction of individual sequences};

4088:   {\rm IEEE Trans. Inform. Theory, 38;4, (1992), 1258--1270}.

4089: \bibitem{Fud91}

4090:   {\bf D. Fudenberg, J. Tirole}:

4091:   {\it Game Theory};

4092:   {\rm The MIT Press (1991)}.

4093: \bibitem{Gac74}

4094:   {\bf P. G\'acs}:

4095:   {\it On the symmetry of algorithmic information}:

4096:   {\rm Soviet Math. Dokl. 15 (1974) 1477-1480}.

4097: \bibitem{Hume}

4098:   {\bf D. Hume,}:

4099:   {\it Treatise of Human Nature};

4100:   {\rm Book I (1739)}.

4101: \bibitem{Hut99}

4102:   {\bf M. Hutter}:

4103:   {\it New Error Bounds for Solomonoff Sequence Prediction};

4104:   {\rm Submitted to J. Comput. System Sci. (2000)},

4105:   {\rm http://xxx.lanl.gov/abs/cs.AI/9912008}.

4106: \bibitem{Hut00e}

4107:   {\bf M. Hutter}:

4108:   {\it Optimality of non-binary universal Solomonoff sequence prediction};

4109:   {\rm In progress}.

4110: \bibitem{Kae96}

4111:   {\bf L.P. Kaebling, M.L. Littman, A.W. Moore}:

4112:   {\it Reinforcement learning: a survey};

4113:   {\rm Journal of AI research 4 (1996) 237--285}.

4114: \bibitem{Ko86}

4115:   {\bf K. Ko}:

4116:   {\it On the definition of infinite pseudo-random sequences};

4117:   {\rm Theoret. Comput. Sci 48 (1986) 9--34}.

4118: \bibitem{Kol65}

4119:   {\bf A.N. Kolmogorov}:

4120:   {\it Three approaches to the quantitative definition of information};

4121:   {\rm Problems Inform. Transmission, 1:1 (1965) 1--7}.

4122: \bibitem{Lev73}

4123:   {\bf L.A. Levin}:

4124:   {\it Universal sequential search problems};

4125:   {\rm Problems of Inform. Transmission, 9:3 (1973) 265--266}.

4126: \bibitem{Lev74}

4127:   {\bf L.A. Levin}:

4128:   {\it Laws of information conservation (non-growth) and

4129:        aspects of the foundation of probability theory};

4130:   {\rm Problems Inform. Transmission, 10 (1974), 206--210}.

4131: \bibitem{LiWa89}

4132:   {\bf N. Littlestone, M.K. Warmuth}:

4133:   {\it The weighted majority algorithm};

4134:   {\rm Proc. 30th IEEE Symp. on Found. of Comp. Science (1989) 256--261}.

4135: \bibitem{LiVi91}

4136:   {\bf M. Li and P.M.B. Vit\'anyi}:

4137:   {\it Learning simple concepts under simple distributions};

4138:   {\rm SIAM J. Comput., 20:5 (1995), 915--935}.

4139: \bibitem{LiVi92}

4140:   {\bf M. Li and P.M.B. Vit\'anyi}:

4141:   {\it Inductive reasoning and Kolmogorov complexity};

4142:   {\rm J. Comput. System Sci., 44:2 (1992), 343--384}.

4143: \bibitem{LiVi92a}

4144:   {\bf M. Li and P.M.B. Vit\'anyi}:

4145:   {\it Philosophical issues in Kolmogorov complexity};

4146:   {\rm Lecture Notes Comput. Sci. 623 (1992), 1--15}.

4147: \bibitem{LiVi93}

4148:   {\bf M. Li and P.M.B. Vit\'anyi}:

4149:   {\it An Introduction to Kolmogorov Complexity and its Applications};

4150:   {\rm Springer-Verlag, New York, 2nd Edition, 1997}.

4151: \bibitem{Mic66}

4152:   {\bf D. Michie}:

4153:   {\it Game Playing and game-learning automata};

4154:   {\rm In Fox, L., editor, Adv. in Prog. and Non-Numerical Comp.,

4155:        183--200 (1966) Pergamon, NY}.

4156: \bibitem{Osb94}

4157:   {\bf M.J. Osborne, A. Rubinstein}:

4158:   {\it A course in game theory};

4159:   {\rm MIT Press (1994)}.

4160: \bibitem{Pea88}

4161:   {\bf J. Pearl}:

4162:   {\it Probabilistic reasoning in intelligent systems:

4163:        Networks of plausible inference};

4164:   {\rm Morgan Kaufmann, San Mateo, Califormia (1988)}.

4165: \bibitem{Pen89}

4166:   {\bf R. Penrose}:

4167:   {\it The empiror's new mind};

4168:   {\rm Oxford Univ. Press (1989)}.

4169:   {\it Shadows of the mind};

4170:   {\rm Oxford Univ. Press (1994)}.

4171: \bibitem{Pin97}

4172:   {\bf X. Pintaro, E. Fuentes}:

4173:   {\it A forecasting algorithm based on information theory};

4174:   {\rm Technical report, Centre Univ. d'Informatique, University of Geneva (1997)}.

4175: \bibitem{Ris89}

4176:   {\bf J.J. Rissanen}:

4177:   {\it Stochastic Complexity and Statistical Inquiry};

4178:   {\rm World Scientific Publishers (1989)}.

4179: \bibitem{Rus95}

4180:   {\bf S. Russell, P. Norvig}:

4181:   {\it Artificial Intelligence: A modern approach};

4182:   {\rm Prentice Hall (1995)}.

4183: \bibitem{Sch95}

4184:   {\bf J. Schmidhuber}:

4185:   {\it Discovering solutions with low Kolmogorov complexity and

4186:        high generalization capability};

4187:   {\rm Proc. 12th Int. Conf. on Machine Learning (1995) 488--496}.

4188: \bibitem{Sch96}

4189:   {\bf J. Schmidhuber, M. Wiering}:

4190:   {\it Solving POMDP's with Levin search and EIRA};

4191:   {\rm Proc. 13th Int. Conf. on Machine Learning (1996) 534--542}.

4192: \bibitem{Sch99}

4193:   {\bf M. Schmidt}:

4194:   {\it Time-Bounded Kolmogorov Complexity May Help in Search

4195:        for Extra Terrestrial Intelligence (SETI) };

4196:   {\rm Bulletin of the European Association for Theor. Comp. Sci. 67 (1999) 176--180}.

4197: \bibitem{Sol64}

4198:   {\bf R.J. Solomonoff}:

4199:   {\it A formal theory of inductive inference, Part 1 and 2};

4200:   {\rm Inform. Contr., 7 (1964), 1--22, 224--254}.

4201: \bibitem{Sol78}

4202:   {\bf R.J. Solomonoff}:

4203:   {\it Complexity-based induction systems: comparisons and convergence theorems};

4204:   {\rm IEEE Trans. Inform. Theory, IT-24:4, (1978), 422--432}.

4205: \bibitem{Sol86}

4206:   {\bf R.J. Solomonoff}:

4207:   {\it An application of algorithmic probability to problems in artificial intelligence};

4208:   {\rm In L.N. Kanal and J.F.Lemmer, editors, Uncertainty in Artificial Intelligence,

4209:        North-Holland, (1986), 473--491}.

4210: \bibitem{Sol97}

4211:   {\bf R.J. Solomonoff}:

4212:   {\it The discovery of algorithmic probability};

4213:   {\rm J. Comput. System Sci. 55 (1997), 73--88}.

4214: \bibitem{Sol99}

4215:   {\bf R.J. Solomonoff}:

4216:   {\it Two kinds of probabilistic induction};

4217:   {\rm Comput. Journal 42:4 (1999), 256--259}.

4218: \bibitem{Neu44}

4219:   {\bf von Neumann, J.O. Morgenstern}:

4220:   {\it The theory of games and economic behaviour};

4221:   {\rm Princeton Univ. Press (1944)}.

4222: \bibitem{Val84}

4223:   {\bf L.G. Valiant}:

4224:   {\it A theory of the learnable};

4225:   {\rm Comm. Assoc. Comput. Mach., 27 (1984) 1134--1142}.

4226: \bibitem{Vov92}

4227:   {\bf V. G. Vovk}:

4228:   {\it Universal forecasting algorithms};

4229:   {\rm Inform. and Comput., 96, (1992), 245--277}.

4230: \bibitem{Vov98}

4231:   {\bf V. Vovk, C. Watkins}:

4232:   {\it Universal portfolio selection};

4233:   {\rm Proceedings 11th Ann. Conf. on Comp. Learning Theory (1998) 12--23}.

4234: \bibitem{Wil70}

4235:   {\bf D.G. Willis}:

4236:   {\it Computational complexity and probability constructions};

4237:   {\rm J. Ass. Comput. Mach., 4 (1970), 241--259}.

4238: \end{thebibliography}

4239:

4240: \end{document}

4241:

4242: %---------------------------------------------------------------

4243: