0012:cs0012011/cs0012011

1:

2: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

3: %%   Towards a Universal Theory of Artificial Intelligence   %%

4: %%                        based on                           %%

5: %%  Algorithmic Probability and Sequential Decision Theory   %%

6: %%                                                           %%

7: %%     Marcus Hutter: Start: 09.12.00  LastEdit: 16.12.00    %%

8: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

9:

10: \newif\ifijcai\ijcaifalse % TechReport version

11:

12: %-------------------------------%

13: %  My Document-Style            %

14: %-------------------------------%

15: \documentclass[10pt,twocolumn]{article}

16:

17: \setlength\headheight{0pt}   \setlength\headsep{0pt}

18: \topmargin=0cm  \oddsidemargin=-1cm \evensidemargin=-1cm

19: \textwidth=18cm \textheight=23cm %\unitlength=1mm \sloppy

20:

21: %-------------------------------%

22: %   Macro-Definitions           %

23: %-------------------------------%

24:

25: \renewenvironment{abstract}{\centerline{\bf

26: Abstract}\vspace{0.5ex}\begin{quote}\small}{\par\end{quote}\vskip 1ex}

27: \newenvironment{keywords}{\centerline{\bf

28: Key Words}\vspace{0.5ex}\begin{quote}\small}{\par\end{quote}\vskip 1ex}

29: \def\eqd{\stackrel{\bullet}{=}}

30: \def\ff{\Longrightarrow}

31: \def\gdw{\Longleftrightarrow}

32: \def\toinfty#1{\stackrel{#1\to\infty}{\longrightarrow}}

33: \def\gtapprox{\buildrel{\lower.7ex\hbox{$>$}}\over

34:                        {\lower.7ex\hbox{$\sim$}}}

35: \def\nq{\hspace{-1em}}

36: \def\look{\(\uparrow\)}

37: \def\ignore#1{}

38: \def\deltabar{{\delta\!\!\!^{-}}}

39: \def\qed{\sqcap\!\!\!\!\sqcup}

40: \def\odt{{\textstyle{1\over 2}}}

41: \def\odf{{\textstyle{1\over 4}}}

42: \def\odA{{\textstyle{1\over A}}}

43: \def\hbar{h\!\!\!\!^{-}\,}

44: \def\dbar{d\!\!^{-}\!}

45: \def\eps{\varepsilon}

46: \def\beq{\begin{equation}}

47: \def\eeq{\end{equation}}

48: \def\beqn{\begin{displaymath}}

49: \def\eeqn{\end{displaymath}}

50: \def\bqa{\begin{equation}\begin{array}{c}}

51: \def\eqa{\end{array}\end{equation}}

52: \def\bqan{\begin{displaymath}\begin{array}{c}}

53: \def\eqan{\end{array}\end{displaymath}}

54: \def\pb{\underline}                       % probability notation

55: \def\pb#1{\underline{#1}}                 % probability notation

56: \def\blank{{\,_\sqcup\,}}                 % blank position

57: \def\maxarg{\mathop{\rm maxarg}}          % maxarg

58: \def\minarg{\mathop{\rm minarg}}          % minarg

59: \def\hh#1{{\dot{#1}}}                     % historic I/O

60: \def\best{*}                              % or {best}

61: \def\vec#1{{\bf #1}}

62: \def\length{{l}}

63: \ifijcai\def\paragraph#1{{\bf #1}}\fi

64:

65: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

66: %                      T i t l e - P a g e                      %

67: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

68:

69: \title{\bf \Large Towards a Universal Theory of Artificial Intelligence

70: based on \\ Algorithmic Probability and Sequential Decision Theory}

71:

72: {\author{ Marcus Hutter \\[2mm]

73:   {\small IDSIA, Galleria 2, CH-6928 Manno-Lugano, Switzerland}  \\

74:   {\small marcus@idsia.ch \qquad http://www.idsia.ch} \\

75:   {\small Technical Report IDSIA-14-00, 16. December 2000}

76: }

77:

78: \date{}

79:

80: \begin{document}

81:

82: \maketitle

83:

84: \begin{abstract}

85: Decision theory formally solves the problem of rational agents in

86: uncertain worlds if the true environmental probability

87: distribution is known. Solomonoff's theory of universal induction

88: formally solves the problem of sequence prediction for unknown

89: distribution. We unify both theories and give strong arguments

90: that the resulting universal AI$\xi$ model behaves optimal in any

91: computable environment. The major drawback of the AI$\xi$ model is

92: that it is uncomputable. To overcome this problem, we construct a

93: modified algorithm AI$\xi^{tl}$, which is still superior to any

94: other time $t$ and space $l$ bounded agent. The computation time

95: of AI$\xi^{tl}$ is of the order $t\!\cdot\!2^l$.\\

96: \end{abstract}

97:

98: \ifijcai\else

99: \begin{keywords}

100: Rational agents,

101: sequential decision theory, universal Solomonoff induction,

102: algorithmic probability, reinforcement learning, computational

103: complexity, theorem proving, probabilistic reasoning, Kolmogorov

104: complexity, Levin search.

105: \end{keywords}

106: \fi

107:

108: % ACM Classification

109: %I.2; I.2.3; I.2.6; I.2.8; F.1.3; F.2

110: %I.2. Artificial Intelligence,

111: %I.2.3. Deduction and Theorem Proving

112: %I.2.6. Learning

113: %I.2.8. Problem Solving, Control Methods and Search

114: %F.1.3. Complexity Classes

115: %F.2. Analysis of Algorithms and Problem Complexity

116:

117: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

118: \section{Introduction}\label{int}

119: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

120:

121: The most general framework for Artificial Intelligence is the

122: picture of an {\em agent} interacting with an environment

123: \cite{Russell:95}. If the goal is not pre-specified, the agent has

124: to learn by occasional reinforcement feedback \cite{Sutton:98}. If

125: the agent shall be universal, no assumption about the environment

126: may be made, besides that there {\it exists} some exploitable

127: structure at all. We may ask for the most intelligent way an agent

128: could behave, or, about the optimal way of learning in terms of

129: real world interaction cycles. {\em Decision theory}

130: formally\footnote{With a formal solution we mean a rigorous

131: mathematically definition, uniquely specifying the solution. For

132: problems considered here this always implies the existence of an

133: algorithm which asymptotically converges to the correct solution.}

134: solves this problem only if the true environmental probability

135: distribution is known (e.g. Backgammon)

136: \cite{Bellman:57,Bertsekas:96}. \cite{Solomonoff:64,Solomonoff:78}

137: formally solves the problem of {\em induction} if the true

138: distribution is unknown but only if the agent cannot influence the

139: environment (e.g.\ weather forecasts) \cite{Li:97}. We combine

140: both ideas and get {\em a parameterless model AI$\xi$ of an acting

141: agent which we claim to behave optimally in any computable

142: environment} (e.g.\ prisoner or auction problems, poker, car

143: driving). To get an effective solution, a modification

144: AI$\xi^{tl}$, superior to any other time $t$ and space $l$ bounded

145: agent, is constructed. The computation time of AI$\xi^{tl}$ is of

146: the order $t\!\cdot\!2^l$. The main goal of this work is to derive

147: and discuss the AI$\xi$ and the AI$\xi^{tl}$ model, and to clarify

148: the meaning of {\it universal}, {\it optimal}, {\it superior},

149: {\it etc}. Details can be found in \cite{Hutter:00f}.

150:

151: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

152: \section{Rational Agents \& Sequential Decisions}\label{secAImurec}

153: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

154:

155: %------------------------------%

156: \paragraph{Agents in probabilistic environments:}

157: %------------------------------%

158: A very general framework for intelligent systems is that of

159: rational agents \cite{Russell:95}. In cycle $k$, an agent performs

160: {\em action} $y_k\!\in\!Y$ (output word) which results in a {\em

161: perception} $x_k\!\in\!X$ (input word), followed by cycle

162: $k\!+\!1$ and so on. If agent and environment are deterministic

163: and computable, the entanglement of both can be modeled by two

164: Turing machines with two common tapes (and some private tapes)

165: containing the action stream $y_1y_2y_3...$ and the perception

166: stream $x_1x_2x_3...$ (The meaning of $x_k\!\equiv\!x'_kr_k$ is

167: explained in the next paragraph):

168:

169: \begin{center}\label{cyberpic}

170: \small\unitlength=0.8mm

171: \special{em:linewidth 0.4pt}

172: \linethickness{0.4pt}

173: \begin{picture}(106,47)

174: \thinlines

175: \put(1,41){\framebox(10,6)[cc]{$x'_1$}}

176: \put(11,41){\framebox(6,6)[cc]{$r_1$}}

177: \put(17,41){\framebox(10,6)[cc]{$x'_2$}}

178: \put(27,41){\framebox(6,6)[cc]{$r_2$}}

179: \put(33,41){\framebox(10,6)[cc]{$x'_3$}}

180: \put(43,41){\framebox(6,6)[cc]{$r_3$}}

181: \put(49,41){\framebox(10,6)[cc]{$x'_4$}}

182: \put(59,41){\framebox(6,6)[cc]{$r_4$}}

183: \put(65,41){\framebox(10,6)[cc]{$x'_5$}}

184: \put(75,41){\framebox(6,6)[cc]{$r_5$}}

185: \put(81,41){\framebox(10,6)[cc]{$x'_6$}}

186: \put(91,41){\framebox(6,6)[cc]{$r_6$}}

187: \put(102,44){\makebox(0,0)[cc]{...}}

188: \put(1,1){\framebox(16,6)[cc]{$y_1$}}

189: \put(17,1){\framebox(16,6)[cc]{$y_2$}}

190: \put(33,1){\framebox(16,6)[cc]{$y_3$}}

191: \put(49,1){\framebox(16,6)[cc]{$y_4$}}

192: \put(65,1){\framebox(16,6)[cc]{$y_5$}}

193: \put(81,1){\framebox(16,6)[cc]{$y_6$}}

194: \put(102,4){\makebox(0,0)[cc]{...}}

195: \put(97,47){\line(1,0){9}}

196: \put(97,41){\line(1,0){9}}

197: \put(97,7){\line(1,0){9}}

198: \put(97,1){\line(0,0){0}}

199: \put(97,1){\line(1,0){9}}

200: \put(1,21){\framebox(16,6)[cc]{working}}

201: \thicklines

202: \put(17,17){\framebox(20,14)[cc]{$\displaystyle{Agent\atop\bf p}$}}

203: \thinlines

204: \put(37,27){\line(1,0){14}}

205: \put(37,21){\line(1,0){14}}

206: \put(39,24){\makebox(0,0)[lc]{tape ...}}

207: \put(56,21){\framebox(16,6)[cc]{working}}

208: \thicklines

209: \put(72,17){\framebox(20,14)[cc]{$\displaystyle{Environ-\atop ment\quad\bf q}$}}

210: \thinlines

211: \put(92,27){\line(1,0){14}}

212: \put(92,21){\line(1,0){14}}

213: \put(94,24){\makebox(0,0)[lc]{tape ...}}

214: \thicklines

215: \put(54,41){\vector(-3,-1){29}}

216: \put(84,31){\vector(-3,1){30}}

217: \put(54,7){\vector(3,1){30}}

218: \put(25,17){\vector(3,-1){29}}

219: \end{picture}

220: \end{center}

221:

222: $p$ is the {\em policy} of the agent interacting with environment

223: $q$. We write $p(x_{<k})\!=\!y_{1:k}$ to denote the output

224: $y_{1:k}\!\equiv\!y_1...y_k$ of the agent $p$ on input

225: $x_{<k}\!\equiv\!x_1...x_{k-1}$ and similarly $q(y_{1:k})\!=\!x_{1:k}$

226: for the environment $q$. We call Turing machines

227: $p$ and $q$ behaving in this way {\it chronological}. In the more

228: general case of a {\em probabilistic environment}, given the

229: history $y\!x_{<k}y_k\!\equiv\!y_1x_1...y_{k-1}x_{k-1}y_k$, the

230: probability that the environment leads to perception $x_k$ in

231: cycle $k$ is (by definition) $\mu(y\!x_{<k}y\!\pb x_k)$. The

232: underlined argument $\pb x_k$ in $\mu$ is a probability variable

233: and the other non-underlined arguments $y\!x_{<k}y_k$ represent

234: conditions. We call probability distributions like $\mu$ {\it

235: chronological}.

236:

237: %------------------------------%

238: \paragraph{The AI$\mu$ Model:}

239: %------------------------------%

240: The goal of the agent is to maximize future {\em rewards}, which are

241: provided by the environment through the inputs $x_k$. The inputs

242: $x_k\!\equiv\!x'_kr_k$ are divided into a regular part $x'_k$ and

243: some (possibly empty or delayed) reward $r_k$. The $\mu$-expected

244: reward sum of future cycles $k$ to $m$ with outputs

245: $y_{k:m}\!=\!y_{k:m}^p$ generated by the agent's policy $p$

246: can be written compactly as

247: \begin{equation}\label{vpdef}

248:   V_\mu^p(\hh y\!\hh x_{<k}) \!:=\!\!

249:   \!\!\sum_{x_k...x_m}\!\!

250:   (r_k\!+...+\!r_m)

251:   \mu(\hh y\!\hh x_{<k}y\!\pb x_{k:m}),

252: \end{equation}

253: where $m$ is the {\em lifespan} of the agent,

254: and the dots above $\hh y\!\hh x_{<k}$

255: indicate the actual action and perception history.

256: The $\mu$-expected reward sum of future cycles $k$ to $m$

257: with outputs $y_i$ generated by the {\em ideal agent}, which

258: maximizes the expected future rewards is

259: \begin{equation}\label{voptdef}

260:   V_\mu^\best(\hh y\!\hh x_{<k}) :=

261:   \max_{y_k}\!\sum_{x_k}...

262:   \max_{y_{m}}\!\sum_{x_{m}}

263:   (r_k\!+...+\!r_m)

264:   \mu(\hh y\!\hh x_{<k}y\!\pb x_{k:m}),

265: \end{equation}

266: i.e.\ the best expected credit

267: is obtained by averaging over the $x_i$ and

268: maximizing over the $y_i$. This has to be done in chronological

269: order to correctly incorporate the dependency of $x_i$ and $y_i$

270: on the history. The output $\hh y_k$, which achieves the maximal value

271: defines {\em the AI$\mu$ model}:

272: \begin{equation}\label{ydotrec}

273:   \hh y_k :=

274:   \maxarg_{y_k}\!\sum_{x_k}...

275:   \max_{y_{m}}\!\sum_{x_{m}}

276:   (r_k\!+...+\!r_m)

277:   \mu(\hh y\!\hh x_{<k}y\!\pb x_{k:m}).

278: \end{equation}

279: The AI$\mu$ model is optimal in the sense that no other policy

280: leads to higher $\mu$-expected reward. A detailed derivation and

281: other recursive and functional versions can be found in

282: \cite{Hutter:00f}.

283:

284: %------------------------------%

285: \paragraph{Sequential decision theory:}

286: %------------------------------%

287: Eq.\ (\ref{ydotrec}) is essentially an Expectimax algorithm/sequence.

288: One can relate (\ref{ydotrec}) to the Bellman equations

289: \cite{Bellman:57} of sequential decision theory by identifying

290: complete histories $y\!x_{<k}$ with states, $\mu(y\!x_{<k}y\!\pb

291: x_k)$ with the state transition matrix, $V_\mu^\best(y\!x_{<k})$

292: with the value of history/state $y\!x_{<k}$, and $y_k$ with the

293: action in cycle $k$ \cite{Russell:95,Hutter:00f}.

294: Due to the use of complete histories as state space, the AI$\mu$

295: model neither assumes stationarity, nor the Markov property, nor

296: complete accessibility of the environment. Every state occurs at

297: most once in the lifetime of the system.

298: As we have in mind a universal system with complex interactions,

299: the action and perception spaces $Y$ and $X$ are huge (e.g.\ video

300: images), and every action or perception itself occurs usually only

301: once in the lifespan $m$ of the agent. As there is no (obvious)

302: universal similarity relation on the state space, an effective

303: reduction of its size is impossible, but there is no principle

304: problem in determining $\hh y_k$ as long as $\mu$ is known and

305: computable and $X$, $Y$ and $m$ are finite.

306:

307: %------------------------------%

308: \paragraph{Reinforcement learning:}

309: %------------------------------%

310: Things dramatically change if $\mu$ is unknown. Reinforcement

311: learning algorithms \cite{Kaelbling:96,Sutton:98,Bertsekas:96} are

312: commonly used in this case to learn the unknown $\mu$. They

313: succeed if the state space is either small or has effectively been

314: made small by generalization or function approximation techniques.

315: In any case, the solutions are either ad hoc, work in restricted

316: domains only, have serious problems with state space exploration

317: versus exploitation, or have non-optimal learning rate. There is

318: no universal and optimal solution to this problem so far. In the

319: Section \ref{secAIxi} we present a new model and argue that it

320: formally solves all these problems in an optimal way. The true

321: probability distribution $\mu$ will not be learned directly, but

322: will be replaced by a universal prior $\xi$, which is shown to

323: converge to $\mu$ in a sense.

324:

325:

326: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

327: \section{Algorithmic Complexity and Universal Induction}\label{secAIsp}

328: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

329:

330: %------------------------------%

331: \paragraph{The problem of the unknown environment:}

332: %------------------------------%

333: We have argued that currently there is no universal and optimal solution to

334: solving reinforcement learning problems. On the other hand,

335: \cite{Solomonoff:64} defined a universal scheme of inductive

336: inference, based on Epicurus' principle of multiple explanations,

337: Ockham's razor, and Bayes' rule for

338: conditional probabilities. For an excellent introduction

339: one should consult the book of \cite{Li:97}. In

340: the following we outline the theory and the basic results.

341:

342: %------------------------------%

343: \paragraph{Kolmogorov complexity and universal probability:}

344: %------------------------------%

345: Let us choose some universal prefix Turing machine $U$ with

346: unidirectional binary input and output tapes and a bidirectional

347: working tape. We can then define the (conditional) prefix

348: Kolmogorov complexity

349: \cite{Chaitin:75,Gacs:74,Kolmogorov:65,Levin:74} as the

350: length $l$ of the shortest

351: program $p$, for which $U$ outputs the binary string

352: $x\!=\!x_{1:n}$ with $x_i\in\!\{0,1\}$:

353: %

354: $$

355:   K(x) \;:=\; \min_p\{l(p): U(p)=x\},

356: $$

357: and given $y$

358: $$

359:   K(x|y) \;:=\; \min_p\{l(p): U(p,y)=x\}.

360: $$

361: %

362: The {\em universal semimeasure} $\xi(\pb x)$ is defined as the

363: probability that the output of $U$

364: starts with $x$ when provided with fair coin flips on the input

365: tape \cite{Solomonoff:64,Solomonoff:78}. It is easy to see that

366: this is equivalent to the formal definition

367: %

368: \beq\label{xidef}

369:   \xi(\pb x)\;:=\;\sum_{p\;:\;\exists\omega:U(p)=x\omega}\nq 2^{-l(p)}

370: \eeq

371: where the sum is over minimal programs $p$ for which $U$

372: outputs a string starting with $x$. $U$ might be non-terminating.

373: As the short programs dominate the sum, $\xi$ is closely related

374: to $K(x)$ as $\xi(\pb x)=2^{-K(x)+O(K(l(x))}$. $\xi$ has the

375: important universality property \cite{Solomonoff:64} that it

376: dominates every computable probability

377: distribution $\rho$ up to a multiplicative factor depending only

378: on $\rho$ but not on $x$:

379: \beq\label{uni}

380:   \xi(\pb x) \;\geq\; 2^{-K(\rho)-O(1)}\!\cdot\!\rho(\pb x).

381: \eeq

382: %

383: The Kolmogorov complexity of a function like $\rho$ is defined as

384: the length of the shortest self-delimiting coding of a Turing

385: machine computing this function.

386: $\xi$ itself is {\it not} a probability

387: distribution\footnote{It is possible to normalize $\xi$ to a

388: probability distribution as has been done in

389: \cite{Solomonoff:78,Hutter:99} by giving up the enumerability of $\xi$.

390: Bounds (\ref{eukdist}) and (\ref{spebound}) hold for both

391: definitions.}.

392: We have $\xi(\pb{x0})\!+\!\xi(\pb{x1})\!<\!\xi(\pb

393: x)$ because there are programs $p$, which output just $x$, neither

394: followed by $0$ nor $1$. They just stop after printing $x$ or

395: continue forever without any further output. We will call a

396: function $\rho\!\geq 0$ with the properties

397: $\rho(\epsilon)\!\leq\!1$ and $\sum_{x_n}\rho(\pb

398: x_{1:n})\!\leq\!\rho(\pb x_{<n})$ a {\it semimeasure}. $\xi$ is a

399: semimeasure and (\ref{uni}) actually holds for all enumerable

400: semimeasures $\rho$.

401:

402: %------------------------------%

403: \paragraph{Universal sequence prediction:}

404: %------------------------------%

405: (Binary) sequence prediction algorithms try to predict the

406: continuation $x_n$ of a given sequence $x_1...x_{n-1}$. In the

407: following we will assume that the sequences are drawn from

408: a probability distribution and that the true probability of a

409: string starting with $x_1...x_n$ is $\mu(\pb x_{1:n})$. The

410: probability of $x_n$ given $x_{<n}$ hence is $\mu(x_{<n}\pb x_n)$.

411: If we measure prediction quality as the number of correct

412: predictions, the best possible system predicts the $x_n$ with the

413: highest probability. Usually $\mu$ is unknown and the system can

414: only have some belief $\rho$ about the true distribution $\mu$.

415: Now the universal probability $\xi$

416: comes into play: \cite{Solomonoff:78} has proved

417: that the mean squared difference

418: between $\xi$ and $\mu$ is finite for computable $\mu$:

419: \beq\label{eukdist}

420:   \sum_{k=1}^\infty\sum_{x_{1:k}}\mu(\pb x_{<k})

421:   (\xi(x_{<k}\pb x_k)-\mu(x_{<k}\pb x_k))^2

422: \eeq

423: $$

424:   <\; \ln 2\!\cdot\!K(\mu)+O(1).

425: $$

426: A simplified proof can be found in \cite{Hutter:99}. So the

427: difference between $\xi(x_{<n}\pb x_n)$ and $\mu(x_{<n}\pb x_n)$ tends

428: to zero with $\mu$ probability $1$ for {\it any} computable

429: probability distribution $\mu$. The reason for the astonishing

430: property of a single (universal) function to converge to {\it any}

431: computable probability distribution lies in the fact that the set

432: of $\mu$-random sequences differ for different $\mu$. The

433: universality property (\ref{uni}) is the central ingredient for

434: proving (\ref{eukdist}).

435:

436: %------------------------------%

437: \paragraph{Error bounds:}

438: %------------------------------%

439: Let SP$\rho$ be a probabilistic

440: sequence predictor, predicting $x_n$ with probability

441: $\rho(x_{<n}\pb x_n)$. If $\rho$ is only a semimeasure the

442: SP$\rho$ system might refuse any output in some cycles $n$.

443: Further, we define a deterministic sequence predictor

444: SP$\Theta_\rho$ predicting the $x_n$ with highest $\rho$

445: probability. $\Theta_\rho(x_{<n}\pb x_n)\!:=\!1$ for one $x_n$

446: with $\rho(x_{<n}\pb x_n)\!\geq\!\rho(x_{<n}\pb x'_n)\,\forall

447: x'_n$ and $\Theta_\rho(x_{<n}\pb x_n)\!:=\!0$ otherwise.

448: SP$\Theta_\mu$ is the best prediction scheme when $\mu$ is known.

449: If $\rho(x_{<n}\pb x_n)$ converges quickly to $\mu(x_{<n}\pb x_n)$ the

450: number of additional prediction errors introduced by using

451: $\Theta_\rho$ instead of $\Theta_\mu$ for prediction should be

452: small in some sense.

453: Let us define the total number of expected erroneous predictions

454: the SP$\rho$ system makes for the first $n$ bits:

455: \beq\label{esp}

456:   E_{n\rho} \;:=\; \sum_{k=1}^n\sum_{x_{1:k}}\mu(\pb x_{1:k})

457:   (1\!-\!\rho(x_{<k}\pb x_k)).

458: \eeq

459: The SP$\Theta_\mu$ system is best in the sense that

460: $E_{n\Theta_\mu}\!\leq\!E_{n\rho}$

461: for any $\rho$. In \cite{Hutter:99} it has been shown that

462: SP$\Theta_\xi$ is not much worse

463: \beq\label{spebound}

464:   E_{n\Theta_\xi}\!-\!E_{n\rho} \;\leq\;

465:   H+\sqrt{4E_{n\rho}H+H^2} \;=\;

466:   O(\sqrt{E_{n\rho}})

467: \eeq

468: $$

469:   \mbox{with}\quad H\;<\;\ln 2\!\cdot\!K(\mu)+O(1)

470: $$

471: and the tightest bound for $\rho\!=\!\Theta_\mu$. For finite

472: $E_{\infty\Theta_\mu}$, $E_{\infty\Theta_\xi}$ is finite too. For

473: infinite $E_{\infty\Theta_\mu}$,

474: $E_{n\Theta_\xi}/E_{n\Theta_\mu}\toinfty{n}1$ with rapid

475: convergence. One can hardly imagine any better prediction

476: algorithm as SP$\Theta_\xi$ without extra knowledge about the

477: environment. In \cite{Hutter:00e}, (\ref{eukdist}) and

478: (\ref{spebound}) have been generalized from binary to arbitrary

479: alphabet and to general loss functions. Apart from computational

480: aspects, which are of course very important, the problem of

481: sequence prediction could be viewed as essentially solved.

482:

483: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

484: \section{The Universal AI$\xi$ Model}\label{secAIxi}

485: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

486:

487: %------------------------------%

488: \paragraph{Definition of the AI$\xi$ Model:}

489: %------------------------------%

490: We have developed enough formalism to suggest our universal

491: AI$\xi$ model. All we have to do is to suitably generalize the

492: universal semimeasure $\xi$ from the last section and to replace

493: the true but unknown probability $\mu$ in the AI$\mu$ model by

494: this generalized $\xi$. In what sense this AI$\xi$ model is

495: universal and optimal will be discussed thereafter.

496:

497: We define the generalized universal probability $\xi^{AI}$ as the

498: $2^{-l(q)}$ weighted sum over all chronological programs

499: (environments) $q$ which output $x_{1:k}$, similar to

500: (\ref{xidef}) but with $y_{1:k}$ provided on the ''input''

501: tape:

502: \beq\label{uniMAI}

503:   \xi(y\!\pb x_{1:k}) \;:=\;

504:   \nq\sum_{q:q(y_{1:k})=x_{1:k}}\nq 2^{-l(q)}.

505: \eeq

506: %

507: Replacing $\mu$ by $\xi$ in (\ref{ydotrec}) the

508: iterative AI$\xi$ system outputs

509: \beq\label{ydotxi}

510:   \hh y_k :=

511:   \maxarg_{y_k}\!\sum_{x_k}...

512:   \max_{y_m}\!\sum_{x_m}

513:   (c_k\!+...+\!c_m)

514:   \xi(\hh y\!\hh x_{<k}y\!\pb x_{k:m}).

515: \eeq

516: in cycle $k$ given the history $\hh y\!\hh x_{<k}$.

517:

518: %------------------------------%

519: \paragraph{(Non)parameters of AI$\xi$:}

520: %------------------------------%

521: The AI$\xi$ model and its behaviour is completely defined by

522: (\ref{uniMAI}) and (\ref{ydotxi}). It (slightly) depends on the

523: choice of the universal Turing machine. The AI$\xi$ model also

524: depends on the choice of $X$ and $Y$, but we do

525: not expect any bias when the spaces are chosen sufficiently large

526: and simple, e.g. all strings of length $2^{16}$. Choosing $I\!\!N$

527: as word space would be ideal, but whether the maxima (or suprema)

528: exist in this case, has to be shown beforehand. The only

529: non-trivial dependence is on the horizon $m$. Ideally we would

530: like to chose $m\!=\!\infty$, but there are several subtleties

531: \ifijcai{discussed in \cite{Hutter:00f},}

532: \else{to be discussed later,}

533: \fi

534: which prevent at least a naive limit

535: $m\!\to\!\infty$. So apart from $m$ and unimportant details, the

536: AI$\xi$ system is uniquely defined by (\ref{ydotxi}) and

537: (\ref{uniMAI}) without adjustable parameters. It does not depend on

538: any assumption about the environment apart from being generated by

539: some computable (but unknown!) probability distribution as we will see.

540:

541: \ifijcai\else

542: %------------------------------%

543: \paragraph{$\xi$ is only a semimeasure:}

544: %------------------------------%

545: One subtlety should be mentioned.

546: Like in the SP case, $\xi$ is

547: not a probability distribution but still satisfies the weaker

548: inequalities

549: \beq\label{chrf}

550:   \sum_{x_n}\xi(y\!\pb x_{1:n}) \;\leq\; \xi(y\!\pb x_{<n})

551:   \quad,\quad

552:   \xi(\epsilon) \;\leq\; 1

553: \eeq

554: Note, that the sum on the l.h.s.\ is {\it not} independent of

555: $y_n$ unlike for the chronological probability distribution $\mu$.

556: Nevertheless, it is bounded by something (the r.h.s) which is

557: independent of $y_n$. The reason is that the sum in (\ref{uniMAI})

558: runs over (partial recursive) chronological functions only and the

559: functions $q$ which satisfy $q(y_{1:n})=x_{<n}x'_n$ for some

560: $x'_n\!\in\!X$ are a subset of the functions satisfying

561: $q(y_{<n})=x_{<n}$. We will in general call functions satisfying

562: (\ref{chrf}) {\it chronological semimeasures}. The important point

563: is that the conditional probabilities (\ref{uniMAI}) are $\leq\!1$

564: like for true probability distributions.

565: \fi %extended

566:

567: %------------------------------%

568: \paragraph{Universality of $\xi^{AI}$:}

569: %------------------------------%

570: It can be shown that $\xi^{AI}$ defined in

571: (\ref{uniMAI}) is universal and converges to $\mu^{AI}$

572: analogously to the SP case (\ref{uni}) and (\ref{eukdist}). The

573: proofs are generalizations from the SP case. The actions $y$ are pure

574: spectators and cause no difficulties in the generalization. This

575: will change when we analyze error/value bounds analogously to

576: (\ref{spebound}). The major difference when incorporating $y$ is

577: that in (\ref{uni}), $U(p)=x\omega$ produces strings starting with $x$,

578: whereas in (\ref{uniMAI}) we can demand $q$ to output exactly $n$

579: words $x_{1:n}$ as $q$ knows $n$ from the number of input words

580: $y_1...y_n$.

581: $\xi^{AI}$ dominates all {\em chronological enumerable

582: semimeasures}

583: \beq\label{uniaixi}

584:   \xi(y\!\pb x_{1:n}) \;\geq\;

585:   2^{-K(\rho)-O(1)}\rho(y\!\pb x_{1:n}).

586: \eeq

587: $\xi$ is a universal element in the sense of (\ref{uniaixi})

588: in the set of all enumerable chronological semimeasures. This can

589: be proved even for infinite (countable) alphabet

590: \cite{Hutter:00f}.

591:

592: %------------------------------%

593: \paragraph{Convergence of $\xi^{AI}$ to $\mu^{AI}$:}

594: %------------------------------%

595: From (\ref{uniaixi}) one can show

596: $$

597:   \sum_{k=1}^n\sum_{x_{1:k}}\mu(y\!\pb x_{<k})

598:   \Big(\mu(y\!x_{<k}y\!\pb x_k)-\xi(y\!x_{<k}y\!\pb x_k)\Big)^2

599: $$

600: \beq\label{eukdistxi}

601:   \;<\; \ln 2\!\cdot\!K(\mu)+O(1)

602: \eeq

603: for computable chronological measures $\mu$. The main

604: complication in generalizing (\ref{eukdist}) to (\ref{eukdistxi})

605: is the generalization to non-binary alphabet \cite{Hutter:00e}.

606: The $y$ are, again, pure spectators.

607: (\ref{eukdistxi}) shows that the $\mu$-expected

608: squared difference of $\mu$ and $\xi$ is finite for computable

609: $\mu$. This, in turn, shows that $\xi(y\!x_{<k}y\!\pb x_k)$

610: converges to $\mu(y\!x_{<k}y\!\pb x_k)$ for $k\!\to\!\infty$ with $\mu$

611: probability 1. If we take a finite product of $\xi'$s and use

612: Bayes' rule, we see that also $\xi(y\!x_{<k}y\!\pb x_{k:k+r})$

613: converges to $\mu(y\!x_{<k}y\!\pb x_{k:k+r})$. More generally, in case of

614: a bounded horizon $h_k\equiv m_k\!-\!k\!+\!1 \leq h_{max}\!<\!\infty$, it follows that

615: \beq\label{aixitomu}

616:   \xi(y\!x_{<k}y\!\pb x_{k:m_k}) \toinfty{k} \mu(y\!x_{<k}y\!\pb x_{k:m_k})

617: \eeq

618: Convergence is only guaranteed for one (e.g.\ the true) i/o

619: sequence $\hh y\!\hh x_{<k}\hh y\!\hh x_{k:m_k}$ but not for

620: alternate sequences $\hh y\!\hh x_{<k}y\!x_{k:m_k}$. Since

621: (\ref{ydotxi}) takes an average over all possible future actions

622: and perceptions $y\!x_{k:m_k}$; not only the one which will

623: finally occur, (\ref{aixitomu}) does not guarantee $\hh y_k^\xi\!\to\!\hh

624: y_k^\mu$. This

625: gap is already present in the SP$\Theta_\rho$ models, but

626: nevertheless good error bounds could be proved. This gives

627: confidence that the outputs $\hh y_k$ of the AI$\xi$ model

628: (\ref{ydotxi}) could converge to the outputs $\hh y_k$ of the

629: AI$\mu$ model (\ref{ydotrec}), at least for a bounded horizon

630: $h_k$. The problems with a fixed horizon $m_k\!=\!m$ and especially

631: $m\!\to\!\infty$

632: \ifijcai{are discussed in \cite{Hutter:00f}.}

633: \else{will be discussed later.}

634: \fi

635:

636: %------------------------------%

637: \paragraph{Universally optimal AI systems:}

638: %------------------------------%

639: We want to call an AI model {\it universal}, if it is

640: $\mu$-independent (unbiased, model-free) and is able to solve any

641: solvable problem and learn any learnable task. Further, we call a

642: universal model, {\it universally optimal}, if there is no

643: program, which can solve or learn significantly faster (in terms

644: of interaction cycles). As the AI$\xi$ model is parameterless,

645: $\xi$ converges to $\mu$ in the sense of

646: (\ref{eukdistxi},\ref{aixitomu}), the AI$\mu$ model is itself

647: optimal, and we expect no other model to converge faster to

648: AI$\mu$ by analogy to SP (\ref{spebound}),

649: %we risk the following conjecture:

650: \beqn

651:   \mbox{\it we expect AI$\xi$ to be universally optimal.}

652: \eeqn

653: This is our main claim. Further support is given in

654: \cite{Hutter:00f} by a detailed analysis of the behaviour of

655: AI$\xi$ for various problem classes, including prediction,

656: optimization, games, and supervised learning.

657:

658: \ifijcai\else

659: %------------------------------%

660: \paragraph{The choice of the horizon:}

661: %------------------------------%

662: The only significant arbitrariness in the AI$\xi$ model lies in

663: the choice of the lifespan $m$ or the

664: $h_k\!\equiv\!m_k\!-\!k\!+\!1$ if we allow a cycle dependent $m$.

665: We will not discuss ad hoc choices of $h_k$ for specific problems.

666: We are interested in universal choices. The book of

667: \cite{Bertsekas:95b} thoroughly discusses the mathematical

668: problems regarding infinite horizon systems.

669:

670: In many cases the time we are willing to run a system depends on

671: the quality of its actions. Hence, the lifetime, if finite at all,

672: is not known in advance. Exponential discounting

673: $r_k\!\to\!r_k\!\cdot\!\gamma^k$ solves the mathematical problem

674: of $m\!\to\!\infty$ but is no real solution, since an effective

675: horizon $h\sim\ln{1\over\gamma}$ has been introduced. The scale

676: invariant discounting $r_k\!\to\!r_k\!\cdot\!k^{-\alpha}$ has a

677: dynamic horizon $h\sim\!k$. This choice has some appeal, as it

678: seems that humans of age $k$ years usually do not plan their lives

679: for more than the next $\sim k$ years. From a practical point of

680: view this model might serve all needs, but from a theoretical

681: point we feel uncomfortable with such a limitation in the horizon

682: from the very beginning. A possible way of taking the limit

683: $m\!\to\!\infty$ without discounting and its problems can be found

684: in \cite{Hutter:00f}.

685:

686: Another objection against too large choices of $m_k$

687: is that $\xi(y\!x_{<k}y\!\pb x_{k:m_k})$ has been proved to be a

688: good approximation of $\mu(y\!x_{<k}y\!\pb x_{k:m_k})$ only for

689: $k\!\gg\!h_k$, which is never satisfied for

690: $m_k\!=\!m\!\to\!\infty$.

691: On the other hand it may turn out that the rewards

692: $r_{k'}$ for $k'\!\gg\!k$, where $\xi$ may no longer be trusted as

693: a good approximation of $\mu$, are in a sense randomly

694: disturbed with decreasing influence on the choice of $\hh y_k$.

695: This claim is supported by the forgetfulness property of $\xi$

696: \ifijcai\else{(see next section)}\fi

697: and can be proved when restricting to

698: factorizable environments \cite{Hutter:00f}.

699:

700: We are not sure whether the choice of $m_k$ is of marginal

701: importance, as long as $m_k$ is chosen sufficiently large and of

702: low complexity, $m_k=2^{2^{16}}$ for instance, or whether the

703: choice of $m_k$ will turn out to be a central topic for the

704: AI$\xi$ model or for the planning aspect of any universal AI

705: system in general. Most if not all problems in agent design of

706: balancing exploration and exploitation vanish by a sufficiently

707: large choice of the (effective) horizon and/or a sufficiently

708: general prior. We suppose that the limit $m_k\!\to\!\infty$ for

709: the AI$\xi$ model results in correct behaviour for weakly

710: separable (defined in the next section) $\mu$, and that even the

711: naive limit $m\!\to\!\infty$ may exist.

712: \fi

713:

714: %------------------------------%

715: \paragraph{Value bounds and separability concepts:}

716: %------------------------------%

717: The values  $V_\rho^\best$ associated with the AI$\rho$ systems

718: correspond roughly to the negative error measure $-E_{n\rho}$ of

719: the SP$\rho$ systems. In the SP case we were interested in small

720: bounds for the error excess $E_{n\Theta_\xi}\!-\!E_{n\rho}$.

721: Unfortunately, simple value bounds for AI$\xi$ or any other AI system in terms of

722: $V^\best$ analogously to the error bound (\ref{spebound}) can not

723: hold \cite{Hutter:00f}. We even have difficulties in specifying

724: what we can expect to hold for AI$\xi$ or any AI system which

725: claims to be universally optimal. In SP, the only important

726: property of $\mu$ for proving error bounds was its complexity

727: $K(\mu)$. In the AI case, there are no useful bounds in terms of

728: $K(\mu)$ only. We either have to study restricted problem classes

729: or consider bounds depending on other properties of $\mu$, rather

730: than on its complexity only. In \cite{Hutter:00f} the difficulties

731: are exhibited by two examples. Several concepts, which might be

732: useful for proving value bounds are introduced and discussed. They

733: include forgetful, relevant, asymptotically learnable, farsighted,

734: uniform, (generalized) Markovian, factorizable and (pseudo)

735: passive $\mu$. They are approximately sorted in the order of

736: decreasing generality and are called {\it separability concepts}.

737: A first weak bound for passive $\mu$ is proved.

738:

739: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

740: \section{Time Bounds and Effectiveness}\label{secTime}

741: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

742:

743: %------------------------------%

744: \paragraph{Non-effectiveness of AI$\xi$:}

745: %------------------------------%

746: $\xi$ is not a computable but

747: only an enumerable semimeasure. Hence, the output $\hh y_k$ of the

748: AI$\xi$ model is only asymptotically computable. AI$\xi$ yields an

749: algorithm that produces a sequence of trial outputs eventually

750: converging to the correct output $\hh y_k$, but one can never be sure

751: whether one has already reached it. Besides this, convergence

752: is extremely slow, so this type of asymptotic computability is of

753: no direct (practical) use. Furthermore, the replacement

754: of $\xi$ by time-limited versions \cite{Li:91,Li:97}, which is

755: suitable for sequence prediction, has been shown to fail for the

756: AI$\xi$ model \cite{Hutter:00f}.

757: This leads to the issues addressed next.

758:

759: %------------------------------%

760: \paragraph{Time bounds and effectiveness:}

761: %------------------------------%

762: Let $\tilde p$ be a policy which calculates an acceptable output

763: within a reasonable time $\tilde t$ per cycle. This sort of

764: computability assumption, namely, that a general purpose computer

765: of sufficient power and appropriate program is able to behave in

766: an intelligent way, is the very basis of AI research. Here it is

767: not necessary to discuss what exactly is meant by

768: 'reasonable time/intelligence' and 'sufficient power'. What we are

769: interested in is whether there is a computable version

770: AI$\xi^{\tilde t}$ of the AI$\xi$ system which is superior or

771: equal to any program $p$ with computation time per cycle of at

772: most $\tilde t$.

773:

774: What one can realistically hope to construct is an AI$\xi^{\tilde

775: t\tilde l}$ system of computation time $c\!\cdot\!\tilde t$ per

776: cycle for some constant $c$. The idea is to run all programs $p$

777: of length $\leq\!\tilde l\!:=\!l(\tilde p)$ and time $\leq\!\tilde

778: t$ per cycle and pick the best output in the sense of maximizing

779: the {\em universal value} $V_\xi^\best$. The total computation time is

780: $c\!\cdot\!\tilde t$ with $c\!\approx\!2^{\tilde l}$. Unfortunately

781: $V_\xi^\best$ can not be used directly since this measure is also

782: only semi-computable and the approximation quality by using

783: computable versions of $\xi$ given a time of order

784: $c\!\cdot\!\tilde t$ is crude \cite{Li:97,Hutter:00f}. On the

785: other hand, we {\it have} to use a measure which converges

786: $V_\xi^\best$ for $\tilde t,\tilde l\!\to\!\infty$, since the

787: AI$\xi^{\tilde t\tilde l}$ model should converge to the AI$\xi$ model

788: in that case.

789:

790: %------------------------------%

791: \paragraph{Valid approximations:}

792: %------------------------------%

793: A solution satisfying the above conditions is suggested in

794: \cite{Hutter:00f}. The main idea is to consider {\em extended

795: chronological incremental policies} $p$, which in addition to the

796: regular output $y_k^p$ {\em rate} their own output with $w_k^p$. The

797: AI$\xi^{\tilde t\tilde l}$ model selects the output $\hh y_k\!=\!y_k^p$

798: of the policy $p$ with highest rating $w_k^p$. $p$ might suggest

799: any output $y_k^p$ but it is not allowed to rate itself with an

800: arbitrarily high $w_k^p$ if one wants $w_k^p$ to be a reliable

801: criterion for selecting the best $p$. One must demand that no

802: policy $p$ is allowed to claim that it is better than it actually

803: is. In \cite{Hutter:00f} a (logical) predicate VA($p$), called

804: {\it valid approximation}, is defined, which is true if, and only

805: if, $p$ {\it always} satisfies $w_k^p\!\leq\!V_\xi^p(y\!x_{<k})$, i.e. never

806: overrates itself. $V_\xi^p(y\!x_{<k})$ is the $\xi$ expected

807: future reward under policy $p$. Valid policies $p$ can then be

808: (partially) ordered w.r.t.\ their rating $w_k^p$.

809:

810: %------------------------------%

811: \paragraph{The universal time bounded AI$\xi^{\tilde t\tilde l}$ system:}

812: %------------------------------%

813: In the following, we describe the algorithm $p^\best$ underlying

814: the universal time bounded AI$\xi^{\tilde t\tilde l}$ system. It

815: is essentially based on the selection of the best algorithms

816: $p_k^\best$ out of the time ${\tilde t}$ and length ${\tilde l}$

817: bounded policies $p$, for which there exists a proof $P$ of

818: VA($p$) with length $\leq\!l_P$.

819:

820: \begin{enumerate}\parskip=0ex\parsep=0ex\itemsep=0ex

821: \item Create all binary strings of length $l_P$ and interpret each

822: as a coding of a mathematical proof in the same formal logic system in

823: which VA($\cdot$) has been formulated. Take those strings

824: which are proofs of VA($p$) for some $p$ and keep the

825: corresponding programs $p$.

826: \item Eliminate all $p$ of length $>\!\tilde l$.

827: \item Modify all $p$ in the following way: all output $w_k^py_k^p$

828: is temporarily written on an auxiliary tape. If $p$ stops in $\tilde t$

829: steps the internal 'output' is copied to the output tape. If $p$

830: does not stop after $\tilde t$ steps a stop is forced and $w_k^p\!=\!0$

831: and some arbitrary $y_k^p$ is written on the output tape. Let ${\cal P}$ be

832: the set of all those modified programs.

833: \item Start first cycle: $k\!:=\!1$.

834: \item\label{pbestloop} Run every $p\!\in\!{\cal P}$ on extended input

835: $\hh y\!\hh x_{<k}$, where all outputs are redirected to some auxiliary

836: tape:

837: $p(\hh y\!\hh x_{<k})\!=\!w_1^py_1^p...w_k^py_k^p$. This step is

838: performed incrementally by adding $\hh y\!\hh x_{k-1}$ for $k\!>\!1$ to

839: the input tape and continuing the computation of the previous

840: cycle.

841: \item Select the program $p$ with highest rating $w_k^p$:

842: $p_k^\best\!:=\!\maxarg_pw_k^p$.

843: \item Write $\hh y_k\!:=\!y_k^{p_k^\best}$ to the output tape.

844: \item Receive input $\hh x_k$ from the environment.

845: \item Begin next cycle: $k\!:=\!k\!+\!1$, goto step

846: \ref{pbestloop}.

847: \end{enumerate}

848:

849: %------------------------------%

850: \paragraph{Properties of the $p^\best$ algorithm:}

851: %------------------------------%

852: Let $p$ be any extended chronological (incremental) policy of

853: length $l(p)\!\leq\!\tilde l$ and computation time per cycle

854: $t(p)\!\leq\!\tilde t$, for which there exists a proof of VA($p$)

855: of length $\leq\!l_P$. The algorithm $p^\best$, depending on

856: $\tilde l$, $\tilde t$ and $l_P$ but not on $p$, has always higher

857: rating than any such $p$. The setup time of $p^\best$ is

858: $t_{setup}(p^\best)\!=\!O(l_P^2\!\cdot\!2^{l_P})$ and the

859: computation time per cycle is $t_{cycle}(p^\best)\!=\!O(2^{\tilde

860: l}\!\cdot\!\tilde t)$. Furthermore, for $\tilde t,\tilde

861: l\!\to\!\infty$, $p^\best$ converges to the behavior of the AI$\xi$

862: model.

863:

864: Roughly speaking, this means that if there exists a computable

865: solution to some AI problem at all, then the explicitly

866: constructed algorithm $p^\best$ is such a solution. Although this

867: claim is quite general, there are some limitations and open

868: questions, regarding the setup time regarding the necessity that

869: the policies must rate their own output, regarding true but not

870: efficiently provable VA($p$), and regarding ``inconsistent''

871: policies \cite{Hutter:00f}.

872:

873:

874: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

875: \section{Outlook \& Discussion}\label{secOutlook}

876: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

877: This section contains some discussion and remarks on otherwise

878: unmentioned topics.

879:

880: %------------------------------%

881: \paragraph{Value bounds:}

882: %------------------------------%

883: Rigorous proofs of value bounds for the AI$\xi$ theory are the

884: major theoretical challenge -- general ones as well as tighter

885: bounds for special environments $\mu$. Of special importance are

886: suitable (and acceptable) conditions to $\mu$, under which $\hh

887: y_k$ and finite value bounds exist for infinite $Y$, $X$ and $m$.

888:

889: %------------------------------%

890: \paragraph{Scaling AI$\xi$ down:}

891: %------------------------------%

892: \cite{Hutter:00f} shows for several examples how to integrate

893: problem classes into the AI$\xi$ model. Conversely, one can

894: downscale the AI$\xi$ model by using more restricted forms of

895: $\xi$. This could be done in a similar way as the theory of

896: universal induction has been downscaled with many insights to the

897: Minimum Description Length principle \cite{Li:92b,Rissanen:89} or

898: to the domain of finite automata \cite{Feder:92}. The AI$\xi$

899: model might similarly serve as a super model or as the very

900: definition of (universal unbiased) intelligence, from which

901: specialized models could be derived.

902:

903: %------------------------------%

904: \paragraph{Applications:}

905: %------------------------------%

906: \cite{Hutter:00f} shows how a number of AI problem classes,

907: including {\em sequence prediction}, {\em strategic games}, {\em

908: function minimization} and {\em supervised learning} fit into

909: the general AI$\xi$ model. All problems are claimed to be formally

910: solved by the AI$\xi$ model. The solution is, however, only

911: formal, because the AI$\xi$ model is uncomputable or, at best,

912: approximable. First, each problem class is formulated in its

913: natural way (when $\mu^{\mbox{\tiny problem}}$ is known) and then

914: a formulation within the AI$\mu$ model is constructed and their

915: equivalence is proven. Then, the consequences of replacing $\mu$

916: by $\xi$ are considered. The main goal is to understand

917: how the problems are solved by AI$\xi$. For more details see

918: \cite{Hutter:00f}.

919:

920: %------------------------------%

921: \paragraph{Implementation and approximation:}

922: %------------------------------%

923: The AI$\xi^{\tilde t\tilde l}$ model suffers from the same large

924: factor $2^{\tilde l}$ in computation time as Levin search for

925: inversion problems

926: \ifijcai\cite{Levin:73}.

927: \else\cite{Levin:73,Levin:84}.

928: \fi

929: Nevertheless, Levin

930: search has been implemented and successfully applied to a variety

931: of problems \cite{Schmidhuber:97nn,Schmidhuber:97bias}. Hence, a direct

932: implementation of the AI$\xi^{\tilde t\tilde l}$ model may also be

933: successful, at least in toy environments, e.g.\ prisoner problems.

934: The AI$\xi^{\tilde t\tilde l}$ algorithm should be regarded only

935: as the first step toward a {\em computable universal AI model}.

936: Elimination of the factor $2^{\tilde l}$ without giving up

937: universality will probably be a very difficult task. One could try

938: to select programs $p$ and prove VA($p$) in a more clever way than

939: by mere enumeration. All kinds of ideas like, heuristic search,

940: genetic algorithms, advanced theorem provers, and many more could

941: be incorporated. But now we have a problem.

942:

943: %------------------------------%

944: \paragraph{Computability:}

945: %------------------------------%

946: We seem to have transferred the AI problem just to a different

947: level. This shift has some advantages (and also some

948: disadvantages) but presents, in no way, a solution. Nevertheless,

949: we want to stress that we have reduced the AI problem to (mere)

950: computational questions. Even the most general other systems the

951: author is aware of, depend on some (more than complexity)

952: assumptions about the environment, or it is far from clear whether

953: they are, indeed, universally optimal. Although computational

954: questions are themselves highly complicated, this reduction is a

955: non-trivial result. A formal theory of something, even if not

956: computable, is often a great step toward solving a problem and has

957: also merits of its own (see previous paragraphs).

958:

959: %------------------------------%

960: \paragraph{Elegance:}

961: %------------------------------%

962: Many researchers in AI believe that intelligence is something

963: complicated and cannot be condensed into a few formulas. They

964: believe it is more a combining of enough {\em methods} and much

965: explicit {\em knowledge} in the right way. From a theoretical

966: point of view, we disagree as the AI$\xi$ model is simple and

967: seems to serve all needs. From a practical point of view we agree

968: to the following extent. To reduce the computational burden one

969: should provide special purpose algorithms ({\em methods}) from the

970: very beginning, probably many of them related to reduce the

971: complexity of the input and output spaces $X$ and $Y$ by

972: appropriate pre/post-processing methods.

973:

974: %------------------------------%

975: \paragraph{Extra knowledge:}

976: %------------------------------%

977: There is no need to incorporate extra {\em knowledge} from the

978: very beginning. It can be presented in the first few cycles in

979: {\it any} format. As long as the algorithm that interprets the

980: data is of size $O(1)$, the AI$\xi$ system will 'understand' the

981: data after a few cycles (see \cite{Hutter:00f}). If the

982: environment $\mu$ is complicated but extra knowledge $z$ makes

983: $K(\mu|z)$ small, one can show that the bound (\ref{eukdistxi})

984: reduces to $\ln 2\!\cdot\!K(\mu|z)$ when $x_1\!\equiv\!z$, i.e.\

985: when $z$ is presented in the first cycle. Special purpose

986: algorithms could also be presented in $x_1$, but it would be

987: cheating to say that no special purpose algorithms have been

988: implemented in AI$\xi$. The boundary between implementation and

989: training is blurred in the AI$\xi$ model.

990:

991: %------------------------------%

992: \paragraph{Training:}

993: %------------------------------%

994: We have not said much about the training process itself, as it is

995: not specific to the AI$\xi$ model and has been discussed in

996: literature in various forms and disciplines. A serious discussion

997: would be out of place. To repeat a truism, it is, of course,

998: important to present enough knowledge $x'_k$ and evaluate the

999: system output $y_k$ with $r_k$ in a reasonable way. To maximize

1000: the information content in the reward, one should start with

1001: simple tasks and give positive reward to approximately

1002: the better half of the outputs $y_k$, for instance.

1003:

1004: %------------------------------%

1005: \paragraph{The big questions:}

1006: %------------------------------%

1007: \cite{Hutter:00f} contains a discussion of the ``big'' questions

1008: concerning the mere existence of any computable, fast, and elegant

1009: universal theory of intelligence, related to non-computable $\mu$

1010: \cite{Penrose:94} and the `number of wisdom' $\Omega$

1011: \cite{Chaitin:75,Chaitin:91}.

1012:

1013: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1014: %         Bibliography                                        %

1015: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1016: \begin{thebibliography}{FMG92}

1017:

1018: \bibitem[Bel57]{Bellman:57}

1019: R.~Bellman.

1020: \newblock {\em Dynamic Programming}.

1021: \newblock Princeton University Press, New Jersey, 1957.

1022:

1023: \bibitem[Ber95]{Bertsekas:95b}

1024: D.~P. Bertsekas.

1025: \newblock {\em Dynamic Programming and Optimal Control, Vol. (II)}.

1026: \newblock Athena Scientific, Belmont, Massachusetts, 1995.

1027:

1028: \bibitem[BT96]{Bertsekas:96}

1029: D.~P. Bertsekas and J.~N. Tsitsiklis.

1030: \newblock {\em Neuro-Dynamic Programming}.

1031: \newblock Athena Scientific, Belmont, MA, 1996.

1032:

1033: \bibitem[Cha75]{Chaitin:75}

1034: G.~J. Chaitin.

1035: \newblock A theory of program size formally identical to information theory.

1036: \newblock {\em Journal of the ACM}, 22(3):329--340, 1975.

1037:

1038: \bibitem[Cha91]{Chaitin:91}

1039: G.~J. Chaitin.

1040: \newblock Algorithmic information and evolution.

1041: \newblock {\em in O.T. Solbrig and G. Nicolis, Perspectives on Biological

1042:   Complexity, IUBS Press}, pages 51--60, 1991.

1043:

1044: \bibitem[FMG92]{Feder:92}

1045: M.~Feder, N.~Merhav, and M.~Gutman.

1046: \newblock Universal prediction of individual sequences.

1047: \newblock {\em {IEEE} Transactions on Information Theory}, 38:1258--1270, 1992.

1048:

1049: \bibitem[G\'74]{Gacs:74}

1050: P.~G\'acs.

1051: \newblock On the symmetry of algorithmic information.

1052: \newblock {\em Russian Academy of Sciences Doklady. Mathematics (formerly

1053:   Soviet Mathematics--Doklady)}, 15:1477--1480, 1974.

1054:

1055: \bibitem[Hut99]{Hutter:99}

1056: M.~Hutter.

1057: \newblock New error bounds for {Solomonoff} prediction.

1058: \newblock {\em Journal of Computer and System Science, in press},

1059:   (IDSIA-11-00):1--13, 1999.

1060: \newblock ftp://ftp.idsia.ch/pub/techrep/IDSIA-11-00.ps.gz.

1061:

1062: \bibitem[Hut00a]{Hutter:00e}

1063: M.~Hutter.

1064: \newblock Optimality of universal prediction for general loss and alphabet.

1065: \newblock Technical Report IDSIA-15-00, Istituto Dalle Molle di Studi

1066:   sull'Intelligenza Artificiale, Manno(Lugano), Switzerland, 2000.

1067: \newblock In progress.

1068:

1069: \bibitem[Hut00b]{Hutter:00f}

1070: M.~Hutter.

1071: \newblock A theory of universal artificial intelligence based on algorithmic

1072:   complexity.

1073: \newblock Technical report, 2000.

1074: \newblock 62 pages, http://xxx.lanl.gov/abs/cs.AI/0004001.

1075:

1076: \bibitem[Kol65]{Kolmogorov:65}

1077: A.~N. Kolmogorov.

1078: \newblock Three approaches to the quantitative definition of information.

1079: \newblock {\em Problems of Information and Transmission}, 1(1):1--7, 1965.

1080:

1081: \bibitem[Lev73]{Levin:73}

1082: L.~A. Levin.

1083: \newblock Universal sequential search problems.

1084: \newblock {\em Problems of Information Transmission}, 9:265--266, 1973.

1085:

1086: \bibitem[Lev74]{Levin:74}

1087: L.~A. Levin.

1088: \newblock Laws of information conservation (non-growth) and aspects of the

1089:   foundation of probability theory.

1090: \newblock {\em Problems of Information Transmission}, 10:206--210, 1974.

1091:

1092: \bibitem[Lev84]{Levin:84}

1093: L.~A. Levin.

1094: \newblock Randomness conservation inequalities: Information and independence in

1095:   mathematical theories.

1096: \newblock {\em Information and Control}, 61:15--37, 1984.

1097:

1098: \bibitem[LK96]{Kaelbling:96}

1099: A.W.~Moore L.P.~Kaelbling, M.L.~Littman.

1100: \newblock Reinforcement learning: a survey.

1101: \newblock {\em Journal of AI research}, 4:237--285, 1996.

1102:

1103: \bibitem[LV91]{Li:91}

1104: M.~Li and P.~M.~B. Vit\'anyi.

1105: \newblock Learning simple concepts under simple distributions.

1106: \newblock {\em SIAM Journal on Computing}, 20(5):911--935, 1991.

1107:

1108: \bibitem[LV92]{Li:92b}

1109: M.~Li and P.~M.~B. Vit\'anyi.

1110: \newblock Inductive reasoning and {Kolmogorov} complexity.

1111: \newblock {\em Journal of Computer and System Sciences}, 44:343--384, 1992.

1112:

1113: \bibitem[LV97]{Li:97}

1114: M.~Li and P.~M.~B. Vit\'anyi.

1115: \newblock {\em An introduction to {Kolmogorov} complexity and its

1116:   applications}.

1117: \newblock Springer, 2nd edition, 1997.

1118:

1119: \bibitem[Pen94]{Penrose:94}

1120: R.~Penrose.

1121: \newblock {\em Shadows of the mind, {A} search for the missing science of

1122:   consciousness}.

1123: \newblock Oxford Univ. Press, 1994.

1124:

1125: \bibitem[Ris89]{Rissanen:89}

1126: J.~Rissanen.

1127: \newblock {\em Stochastic Complexity in Statistical Inquiry}.

1128: \newblock World Scientific Publ. Co., 1989.

1129:

1130: \bibitem[RN95]{Russell:95}

1131: S.~J. Russell and P.~Norvig.

1132: \newblock {\em Artificial Intelligence. {A} Modern Approach}.

1133: \newblock Prentice-Hall, Englewood Cliffs, 1995.

1134:

1135: \bibitem[SB98]{Sutton:98}

1136: R.~Sutton and A.~Barto.

1137: \newblock {\em Reinforcement learning: An introduction}.

1138: \newblock Cambridge, MA, MIT Press, 1998.

1139:

1140: \bibitem[Sch97]{Schmidhuber:97nn}

1141: J.~Schmidhuber.

1142: \newblock Discovering neural nets with low {Kolmogorov} complexity and high

1143:   generalization capability.

1144: \newblock {\em Neural Networks}, 10(5):857--873, 1997.

1145:

1146: \bibitem[Sol64]{Solomonoff:64}

1147: R.~J. Solomonoff.

1148: \newblock A formal theory of inductive inference: Part 1 and 2.

1149: \newblock {\em Inform. Control}, 7:1--22, 224--254, 1964.

1150:

1151: \bibitem[Sol78]{Solomonoff:78}

1152: R.~J. Solomonoff.

1153: \newblock Complexity-based induction systems: comparisons and convergence

1154:   theorems.

1155: \newblock {\em IEEE Trans. Inform. Theory}, IT-24:422--432, 1978.

1156:

1157: \bibitem[SZW97]{Schmidhuber:97bias}

1158: J.~Schmidhuber, J.~Zhao, and M.~Wiering.

1159: \newblock Shifting inductive bias with success-story algorithm, adaptive

1160:   {Levin} search, and incremental self-improvement.

1161: \newblock {\em Machine Learning}, 28:105--130, 1997.

1162:

1163: \end{thebibliography}

1164:

1165: \end{document}

1166:

1167: %---------------------------------------------------------------

1168: