0309:cs0309048/gm6.tex

1: \documentstyle[12pt,epsfig,times]{article}

2: %\renewcommand{\baselinestretch}{1.33}

3: %\documentstyle[a4wide]{article}

4: %\renewcommand\topfraction{1.0}

5: %\renewcommand\bottomfraction{1.0}

6: %\renewcommand\textfraction{0.0}

7: \renewcommand\floatpagefraction{1.0}

8:

9: \textwidth=6.5 truein \textheight=9.0truein \voffset=-1.0truein

10: \hoffset=-.5truein \parskip=1mm

11: \input epsf

12:

13: \begin{document}

14:

15: \def\ap{\textrm{'}}

16: \def\Biops{{\sc Biops} }

17: \def\Biopsn{{\sc Biops}}

18: \def\Oops{{\sc Oops} }

19: \def\Oopsn{{\sc Oops}}

20: \def\Aixi{{\sc Aixi} }

21: \def\Aixin{{\sc Aixi}}

22: \def\tl{{\sc Aixi}{\em (t,l)} }

23: \def\tln{{\sc Aixi}{\em (t,l)}}

24: \def\hs{{\sc Hsearch} }

25: \def\hsn{{\sc Hsearch}}

26: \def\GM{G\"{o}del Machine }

27: \def\gm{G\"{o}del machine }

28: \def\GMn{G\"{o}del Machine}

29: \def\gmn{G\"{o}del machine}

30: \newtheorem{theorem}{Theorem}[section]

31: \newtheorem{lemma}{Lemma}[section]

32: \newtheorem{corollary}{Corollary}[section]

33: \newtheorem{conjecture}{Conjecture}[section]

34: \newtheorem{example}{Example}[section]

35: \newtheorem{definition}{Definition}[section]

36: \newtheorem{postulate}{Postulate}[section]

37: \newtheorem{method}{Method}[section]

38: \newtheorem{procedure}{Procedure}[section]

39:

40: \def\odt{{\textstyle{1\over 2}}}

41: \def\maxarg{\mathop{\rm maxarg}}          % maxarg

42: \def\minarg{\mathop{\rm minarg}}          % minarg

43:

44:

45: \begin{titlepage}

46:

47: \begin{center}

48:

49:

50: \vspace{3cm}

51: \vspace{0.6cm}

52:

53: {\em TR IDSIA-19-03, Version 5 \\

54: December 2006, arXiv:cs.LO/0309048 v5 \\

55: (v1: 25 September 2003)

56: }

57: \vspace{0.9cm}

58:

59: {\LARGE \GMn s: Self-Referential \\

60: %\vspace{0.4cm}

61: Universal Problem Solvers Making \\

62: \vspace{0.2cm}

63: Provably Optimal Self-Improvements}

64:

65: \vspace{0.2cm}

66:

67: \end{center}

68:

69: \begin{center}

70:

71: J\"{u}rgen Schmidhuber

72:

73: {\it IDSIA, Galleria 2, 6928 Manno-Lugano, Switzerland \& \\

74: TU M\"{u}nchen, Boltzmannstr. 3,  85748 Garching, M\"{u}nchen, Germany}

75:

76: {\tt juergen@idsia.ch - http://www.idsia.ch/\~{ }juergen}

77:

78:

79: \end{center}

80:

81: \vspace{0.2cm}

82:

83:

84:

85: \begin{abstract}

86:

87: We present the first class of mathematically rigorous, general, fully

88: self-referential, self-improving, optimally efficient problem solvers.

89: Inspired by Kurt G\"{o}del's celebrated self-referential formulas (1931),

90: such a problem solver rewrites any part of its own code as soon

91: as it has found a proof that the rewrite is {\em useful,} where the

92: problem-dependent utility function and the hardware and the entire

93: initial code are described by axioms encoded in an initial proof

94: searcher which is also part of the initial code.  The searcher

95: systematically and in an asymptotically optimally efficient way tests

96: computable {\em proof techniques} (programs whose outputs are proofs)

97: until it finds a provably useful, computable self-rewrite.  We show

98: that such a self-rewrite is globally optimal---no local maxima!---since

99: the code first had to prove that it is not useful to continue the

100: proof search for alternative self-rewrites.  Unlike Hutter's previous {\em

101: non}-self-referential methods based on hardwired proof searchers, ours

102: not only boasts an optimal {\em order} of complexity but can optimally

103: reduce any slowdowns hidden by the $O()$-notation, provided the utility

104: of such speed-ups is provable at all.

105:

106: \vspace{0.5cm}

107: \noindent

108: {\bf Keywords:} self-reference,

109: reinforcement learning,

110: problem solving,

111: proof techniques, optimal

112: universal search, self-improvement

113:

114: \vspace{0.5cm}

115: \noindent

116: {\bf \gm publications} up to 2006

117: (100 years after Kurt G\"{o}del's birth, and 75 years

118: after his landmark paper

119: \cite{Goedel:31}

120: laying the foundations of

121: theoretical computer science):

122:  \cite{Schmidhuber:03gm,Schmidhuber:04gmhtml,Schmidhuber:04oops,Schmidhuber:05icann,Schmidhuber:05gmconscious,Schmidhuber:05gmai}.

123:

124: \end{abstract}

125:

126: \end{titlepage}

127:

128: \tableofcontents

129: \newpage

130:

131: \section{Introduction and Outline}

132: \label{intro}

133:

134:

135: All traditional algorithms for problem solving /

136: machine learning /

137: reinforcement learning \cite{Kaelbling:96}

138: are hardwired. Some are designed to improve some limited

139: type of policy through

140: experience, but are not part of the modifiable policy,

141: and cannot improve themselves in a theoretically sound way.

142: Humans are needed to create

143: new / better problem solving algorithms and to prove their

144: usefulness under appropriate assumptions.

145:

146: Here we will eliminate the restrictive need for human

147: effort in the most general way possible,

148: leaving all the work

149: including the proof search to a system

150: that can rewrite and improve itself in arbitrary

151: computable ways and in a most efficient fashion.

152: To attack this {\em ``Grand Problem of Artificial Intelligence''}

153: \cite{Schmidhuber:03grandai},

154: we introduce a novel class of optimal,

155: fully self-referential \cite{Goedel:31}

156: general problem solvers called

157: {\em \gmn s} \cite{Schmidhuber:03gm,Schmidhuber:04gmhtml,Schmidhuber:05icann,Schmidhuber:05gmai,Schmidhuber:05gmconscious}.\footnote{Or {\em `Goedel machine'}, to

158: avoid the {\em Umlaut}. But

159: {\em `Godel machine'} would not be quite correct.

160: Not to be confused with what Penrose

161: calls, in a different context,

162: {\em `G\"{o}del's putative theorem-proving machine'} \cite{Penrose:94}!}

163: They are universal problem solving systems that

164: interact with some (partially observable) environment and

165: can in principle modify themselves without essential limits

166: apart from the limits of computability.

167: Their initial algorithm is not hardwired;

168: it can completely rewrite itself,

169: but only if a proof searcher embedded within the initial algorithm

170: can first prove that the rewrite is useful,

171: given a formalized utility function reflecting

172: computation time and

173: expected future success (e.g., rewards).

174: We will see that self-rewrites due to this approach

175: are actually {\em globally optimal} (Theorem \ref{globopt},

176: Section \ref{secglobopt}), relative to G\"{o}del's

177: well-known fundamental restrictions of provability \cite{Goedel:31}.

178: These restrictions should not worry us;

179: if there is no proof of some self-rewrite's utility, then humans

180: cannot do much either.

181:

182: The initial proof searcher is $O()$-optimal

183: (has an optimal order of complexity)

184: in the sense of Theorem \ref{asopt},  Section \ref{biops}.

185: Unlike hardwired systems such as Hutter's

186: \cite{Hutter:01aixi+,Hutter:01fast+}

187: and Levin's \cite{Levin:73,Levin:84}

188: (Section \ref{previous}),

189: however, a \gm can in principle speed up any part of its

190: initial software, including its proof searcher, to meet

191: {\em arbitrary} formalizable notions of optimality beyond those

192: expressible in the $O()$-notation.  Our approach yields the first

193: theoretically sound, fully self-referential,

194: optimal, general problem solvers.

195:

196: \noindent {\bf Outline.} Section \ref{basics} presents basic concepts

197: and fundamental limitations, Section \ref{details}

198: the essential details of a self-referential axiomatic system,

199: Section \ref{secglobopt} the Global Optimality Theorem \ref{globopt},

200: and Section \ref{biops} the $O()$-optimal

201: (Theorem \ref{asopt}) initial proof searcher.

202: Section \ref{discussion} provides examples

203: and relations to previous work,

204: briefly discusses issues such as

205: a {\em technical} justification of consciousness,

206: and lists answers to several frequently asked questions

207: about \gmn s.

208:

209: \section{Overview / Basic Ideas /  Limitations}

210: \label{basics}

211:

212: Many traditional problems of computer science require just

213: one problem-defining input at the beginning of the problem solving

214: process.  For example, the initial input may be a large integer,

215: and the goal may be to factorize it.

216: In what follows, however, we will

217: also consider the {\em more general case}

218: where the problem solution requires interaction with a dynamic,

219: initially unknown environment that produces a continual stream of

220: inputs and feedback

221: signals, such as in autonomous robot control tasks,

222: where the goal may be to maximize expected cumulative

223: future reward \cite{Kaelbling:96}.

224: This may require the solution of essentially arbitrary problems

225: (examples in Section \ref{examples} formulate traditional

226: problems as special cases).

227:

228: \subsection{Set-up and Formal Goal}

229: \label{notation}

230: Our hardware (e.g., a universal or space-bounded

231: Turing machine

232: \cite{Turing:36}

233: or the abstract model

234: of a personal computer)

235: has a single life which

236: consists of discrete cycles or time steps $t=1, 2, \ldots$.

237: Its total lifetime $T$ may or may not be known in advance.

238: In what follows, the value of any time-varying variable $Q$

239: at time $t$ will be denoted by $Q(t)$.

240:

241: During each cycle our hardware

242: executes an elementary operation which affects its

243: variable state $s \in \cal S \subset B^*$

244: (without loss of generality, $B^*$ is

245: the set of possible bitstrings over

246: the binary alphabet $B=\{ 0, 1 \}$)

247: and possibly also the variable environmental state $Env \in \cal E$

248: (here we need not yet specify the problem-dependent set $\cal E$).

249: There is a hardwired

250: state transition function $F: \cal S \times \cal E \rightarrow \cal S$.

251: For $t > 1$, $s(t)=F(s(t-1), Env(t-1))$ is the state at a point where

252: the hardware operation of cycle $t-1$ is finished, but the one of

253: $t$ has not started yet.

254: $Env(t)$ may depend on past output actions encoded in $s(t-1)$ and

255: is simultaneously updated or (probabilistically) computed by

256: the possibly reactive environment.

257:

258: In order to talk conveniently about programs and data,

259: we will often attach names to certain string variables encoded

260: as components or substrings of $s$.

261: Of particular interest are the three variables called

262: {\em time}, {\em x}, {\em y}, and {\em p}:

263: \begin{enumerate}

264: \item At time $t$, variable $time$

265: holds a unique binary representation of $t$.

266: We initialize $time(1)=`1\ap$, the bitstring consisting only of a one.

267: The hardware increments $time$ from one cycle to the next.

268: This requires at most $O(log~t)$ and on average only $O(1)$

269: computational steps.

270:

271: \item

272: Variable {\em x} holds the inputs from the environment to the \gmn.

273: For $t>1$, $x(t)$  may

274: differ from $x(t-1)$ only if

275: a program running on the \gm has executed

276: a special input-requesting instruction at time $t-1$.

277: Generally speaking, the delays between successive inputs

278: should be sufficiently large so that

279: programs can perform

280: certain elementary computations on an input, such as copying

281: it into internal storage (a reserved part of $s$)

282: before the next input arrives.

283:

284:   \item

285: Variable {\em y} holds the outputs of the \gmn.

286: $y(t)$ is an output bitstring which may

287: subsequently influence the

288: environment, where $y(1)=`0\ap$ by default.

289: For example, $y(t)$ could be interpreted as a control

290: signal for an environment-manipulating

291: robot whose actions may have an effect on

292: future inputs.

293:

294: \item

295: $p(1)$ is the initial software: a program

296: implementing the original (sub-optimal) policy

297: for interacting with the environment,

298: represented as a substring $e(1)$ of $p(1)$,

299: plus the original policy for searching proofs.

300: Details will be discussed below.

301: \end{enumerate}

302:

303: At any given time $t$ ($1 \leq t \leq T$) the goal

304: is to maximize future success or {\em utility}.

305: A typical {\em ``value to go''} utility function

306: is of the form $u(s, Env): \cal S \times \cal E \rightarrow R$, where

307: $\cal R$ is the set of real numbers:

308: \begin{equation}

309: \label{u}

310: u(s, Env) =

311: E_{\mu} \left [ \sum_{\tau=time}^T  r(\tau)~~ \Bigg| ~~s, Env \right ],

312: \end{equation}

313: where $r(t)$ is a real-valued reward input (encoded within $s(t)$) at time $t$,

314: $E_{\mu}(\cdot \mid \cdot)$ denotes the conditional expectation operator

315: with respect to some possibly unknown distribution $\mu$ from a set $M$

316: of possible distributions ($M$ reflects

317: whatever is known about the possibly probabilistic reactions

318: of the environment), and the above-mentioned $time=time(s)$ is a function

319: of state $s$ which uniquely identifies the

320: current cycle.

321: Note that we take into account the possibility of extending

322: the expected lifespan

323: through appropriate actions.

324:

325: Alternative formalizable utility functions could favor

326: improvement of {\em worst case} instead

327: of {\em expected} future performance,

328: or higher reward intake {\em per time interval} etc.

329: Clearly, most classic problems of computer science

330: can be formulated in this framework---see examples

331: in Section \ref{examples}.

332:

333:

334: \subsection{Basic Idea of \GM}

335: Our machine becomes a self-referential

336: \cite{Goedel:31}

337: {\em \gmn}

338: by loading it

339: with a

340: particular form of

341: machine-dependent,

342: self-modifying code $p$. The initial code

343: $p(1)$ at time step 1

344: includes a (typically sub-optimal)

345: problem solving subroutine $e(1)$ for interacting with

346: the environment, such as any traditional reinforcement learning

347: algorithm \cite{Kaelbling:96},

348: and a general proof searcher subroutine

349: (Section \ref{biops})

350: that systematically makes pairs

351: {\em (switchprog, proof)} (variable substrings of $s$)

352: until it finds a {\em proof} of

353: a target theorem which essentially states: {\em `the

354: immediate rewrite of {\em p} through current program {\em switchprog}

355: on the given machine

356: implies higher utility than leaving {\em p} as is'.} Then it executes

357: {\em switchprog}, which may completely rewrite $p$, including

358: the proof searcher. Section \ref{details} will explain

359: details of the necessary

360: initial axiomatic system $\cal A$

361: encoded in $p(1)$.

362: Compare Fig. \ref{storage}.

363: \begin{figure}[hbt]

364: \centerline{\epsfig{figure=figure1.eps,angle=0,height=12cm}}

365: \caption{{\small

366: Storage snapshot of a not yet self-improved example \gmn,

367: with the initial software still intact. See text for details.

368: }} \label{storage}

369: \end{figure}

370:

371:

372: \noindent

373: The {\bf Global Optimality Theorem} (Theorem \ref{globopt},

374: Section \ref{secglobopt}) shows

375: this self-im\-prove\-ment strategy is not greedy: since the

376: utility of {\em `leaving $p$ as is'} implicitly evaluates all possible

377: alternative {\em switchprog}s which an unmodified $p$ might find later,

378: we obtain a globally optimal self-change---the {\em current switchprog}

379: represents the best of all possible relevant self-changes, relative

380: to the given resource limitations and initial proof search strategy.

381:

382: \subsection{Proof Techniques and an $O()$-Optimal Initial Proof Searcher}

383: Section \ref{biops} will present an $O()$-optimal

384: initialization of the proof searcher,

385: that is, one with an optimal {\em order} of complexity

386: (Theorem \ref{asopt}).  Still, there will remain a lot of

387: room for self-improvement hidden by the $O()$-notation.

388: The searcher

389: uses an online extension of {\em Universal Search}

390: \cite{Levin:73,Levin:84}

391: to systematically test {\em online

392: proof techniques}, which are proof-generating programs that

393: may read parts of state $s$

394: (similarly, mathematicians are often more interested in

395: proof techniques than in theorems).

396: To prove target theorems as above,

397: proof techniques may invoke special instructions

398: for generating axioms and applying inference rules to prolong the

399: current {\em proof} by theorems. Here

400: an axiomatic system $\cal A$

401: encoded in $p(1)$

402: includes axioms describing {\bf (a)} how any instruction invoked

403: by a program running on the given hardware will change

404: the machine's state $s$

405: (including instruction pointers etc.)

406: from one step to the next (such that proof techniques can reason

407: about the effects of any program including the proof searcher),

408: {\bf (b)} the initial program $p(1)$ itself (Section \ref{details} will show

409: that this is possible without introducing circularity),

410: {\bf (c)} stochastic environmental properties,

411: {\bf (d)} the formal utility function $u$,

412: e.g., equation (\ref{u}),

413: which takes into account

414: computational costs of all actions including proof search.

415:

416:

417:

418: \subsection{Limitations of \GMn s}

419: \label{limits}

420: The fundamental limitations are closely related to

421: those first identified by G\"{o}del's celebrated paper on

422: self-referential formulae \cite{Goedel:31}.

423: Any formal system that encompasses arithmetics (or ZFC etc)

424: is either flawed or allows for unprovable but true statements.

425: Hence even a \gm with unlimited computational

426: resources must ignore those self\--im\-prove\-ments

427: whose effectiveness it cannot prove,

428: e.g., for lack of sufficiently powerful axioms in $\cal A$.

429: In particular, one can construct pathological

430: examples of environments and

431: utility functions that make it impossible for the machine

432: to ever prove a target theorem.

433: Compare Blum's speed-up theorem

434: \cite{Blum:67,Blum:71}

435: based on certain incomputable predicates.

436: Similarly, a realistic \gm with limited resources

437: cannot profit from self\--im\-prove\-ments

438: whose usefulness it cannot prove within

439: its time and space constraints.

440:

441: Nevertheless, unlike previous methods, it can

442: in principle exploit at least the {\em provably} good speed-ups

443: of {\em any} part of its initial software, including those

444: parts responsible for huge (but problem class-independent) slowdowns

445: ignored by the earlier approaches \cite{Hutter:01aixi+,Hutter:01fast+}

446: (Section \ref{previous}).

447:

448: \section{Essential Details of One Representative \GM}

449: \label{details}

450: {\bf Notation.} Unless stated otherwise or obvious,

451: throughout the paper newly introduced variables and functions

452: are assumed to cover the range implicit in the context.

453: $l(q)$ denotes the number of bits in a bitstring $q$;

454: $q_n$ the $n$-th bit of $q$;

455: $\lambda$ the empty string (where $l(\lambda)=0$);

456: $q_{m:n}= \lambda$ if $m>n$ and $q_m q_{m+1} \ldots q_n$

457: otherwise (where $q_0 := q_{0:0} := \lambda$).

458:

459: Theorem proving requires an

460: axiom scheme yielding an

461: enumerable set of

462: axioms of a formal logic

463: system $\cal A$ whose formulas and theorems are symbol

464: strings over some finite alphabet that may include traditional

465: symbols of logic (such as

466: $\rightarrow,\wedge,=,(,), \forall, \exists, \ldots$,

467: $c_1, c_2, \ldots,$  $f_1, f_2, \ldots$),

468: probability theory (such as $E(\cdot)$, the expectation operator),

469: arithmetics ($+, -, /, = , \sum, <, \ldots$),

470: string manipulation (in particular, symbols for

471: representing any part of state $s$ at any time,

472: such as $s_{7:88}(5555)$).

473: A proof is a sequence of theorems,

474: each either an axiom or inferred from previous

475: theorems by applying one of the inference rules such

476: as {\em modus ponens} combined with {\em unification}, e.g.,

477: \cite{Fitting:96}.

478:

479: The remainder of this paper will omit standard knowledge to be found

480: in any proof theory textbook.

481: Instead of listing {\em all} axioms of a particular $\cal A$ in

482: a tedious fashion,

483: we will focus on the novel and critical details:

484: how to overcome potential problems with self-reference

485: and how to deal with the potentially delicate online generation of proofs

486: that talk about and affect the currently running proof generator itself.

487:

488: \subsection{Proof Techniques}

489: \label{prooftech}

490: Brute force proof searchers

491: (used in Hutter's work \cite{Hutter:01aixi+,Hutter:01fast+};

492: see Section \ref{previous} for a review)

493: systematically generate all proofs

494: in order of their sizes.  To produce a certain proof,

495: this takes time exponential in proof size.

496: Instead our $O()$-optimal $p(1)$ will produce many proofs

497: with low algorithmic complexity

498: \cite{Solomonoff:64,Kolmogorov:65,LiVitanyi:97}

499: much more quickly. It systematically tests (see Section \ref{biops})

500: programs called

501: {\em proof techniques} written in universal language

502: $\cal L$ implemented within $p(1)$.

503: For example, $\cal L$  may be a variant of PROLOG \cite{Prolog:87}

504: or the universal {\sc Forth}\cite{Forth:70}-inspired

505: programming language used in recent work on optimal search

506: \cite{Schmidhuber:04oops}.

507: A proof technique is composed

508: of instructions that allow any part of $s$ to be read,

509: such as inputs encoded in variable $x$ (a substring of $s$) or

510: the code of $p(1)$. It may write on $s^p$, a  part of $s$ reserved for

511: temporary results.  It also may rewrite {\em switchprog},

512: and produce an incrementally growing proof placed in the

513: string variable {\em proof} stored somewhere in $s$.

514: {\em proof} and $s^p$ are reset to the empty string at

515: the beginning of each new proof technique test.

516: Apart from standard arithmetic and function-defining instructions

517: \cite{Schmidhuber:04oops} that modify $s^p$,

518: the programming language $\cal L$ includes special instructions

519: (details in Section \ref{instructions})

520: for prolonging the current {\em proof} by correct theorems,

521: for setting {\em switchprog},

522: and for checking whether a provably optimal $p$-modifying

523: program was found and should be executed now.

524: Certain long proofs can be produced by short proof techniques.

525:

526: \subsection{Important Instructions Used by Proof Techniques}

527: \label{instructions}

528:

529: The nature of the six {\em proof}-modifying instructions below

530: (there are no others)

531: makes it impossible to insert an

532: incorrect theorem into {\em proof},

533: thus trivializing proof verification.

534: Let us first provide a brief overview of the most

535: important instructions: {\bf get-axiom(n)}

536: appends the $n$-th possible axiom

537: to the current {\em proof},

538: {\bf apply-rule(k, m, n)} applies the $k$-th

539: inference rule to the $m$-th and $n$-th theorem

540: in the current {\em proof} (appending the result),

541: {\bf set-switchprog(m,n)}

542: sets $switchprog:= s^p_{m:n}$,

543: and {\bf check()}

544: tests whether the last theorem in {\em proof}

545: is a {\bf target theorem} showing that a self-rewrite

546: through {\em switchprog} would be useful.

547: The details are as follows.

548:

549:

550: \begin{enumerate}

551:

552: \item {\bf get-axiom(n)}

553: takes as argument an integer $n$ computed

554: by a prefix of the currently tested proof technique

555: with the help of arithmetic instructions

556: such as those used in previous work

557: \cite{Schmidhuber:04oops}.

558: Then it appends

559: the $n$-th axiom (if it exists, according to the axiom scheme below)

560: as a theorem to the current theorem

561: sequence in {\em proof}.  The initial axiom scheme encodes:

562:

563: \begin{enumerate}

564: \item

565: \label{hardwareaxioms}

566: {\bf Hardware axioms}

567: describing the hardware,

568: formally specifying how

569: certain components of $s$ (other than

570: environmental inputs $x$) may

571: change from one cycle to the next.

572:

573: For example, if the hardware is a

574: Turing machine\footnote{

575: Turing reformulated G\"{o}del's unprovability results in terms

576: of Turing machines (TMs) \cite{Turing:36} which

577: subsequently became the most widely used abstract model of

578: computation.  It is well-known that there are {\em universal}

579: TMs that in a certain sense can emulate any other TM or any

580: other known computer.  G\"{o}del's integer-based formal language

581: can be used to describe any universal TM, and vice versa.}

582: (TM) \cite{Turing:36},

583: then $s(t)$ is a bitstring that encodes the current

584: contents of all tapes of the TM, the positions of its

585: scanning heads, and the current {\em internal state}

586: of the TM's finite state automaton, while $F$ specifies

587: the TM's look-up table which maps any possible combination

588: of internal state and bits above scanning heads

589: to a new internal state and an action such as:

590: replace some head's current bit by 1/0, increment

591: (right shift) or decrement (left shift) some

592: scanning head, read and copy next input bit to cell above

593: input tape's scanning head,

594: etc.

595:

596: Alternatively, if the hardware is given by the abstract model

597: of a modern microprocessor with limited storage, $s(t)$ will

598: encode the current storage contents,

599: register values,

600: instruction pointers etc.

601:

602: For instance, the following axiom could

603: describe how some 64-bit hardware's instruction pointer

604: stored in $s_{1:64}$ is continually incremented as long as

605: there is no overflow and the value

606: of $s_{65}$ does not indicate that a jump to some other address

607: should take place:

608: \[

609: (\forall t \forall n :

610: [(n < 2^{64}-1) \wedge  (n > 0) \wedge  (t > 1) \wedge  (t < T)

611: \]

612: \[

613: \wedge

614: (string2num(s_{1:64}(t))=n)

615: \wedge (s_{65}(t)=`0\ap) ]

616: \]

617: \[

618: \rightarrow

619: (string2num(s_{1:64}(t+1))=n+1))

620: \]

621: Here the semantics of used symbols such

622: as `('  and `$>$'  and `$\rightarrow$' (implies)

623: are the traditional ones, while `$string2num$'

624: symbolizes a function

625: translating bitstrings into numbers.

626: It is clear that any abstract hardware model can be

627: fully axiomatized in a similar way.

628:

629:

630: \item

631: {\bf Reward axioms}

632: defining the computational costs of any

633: hardware instruction,

634: and physical costs of output

635: actions, such as control signals $y(t)$

636: encoded in $s(t)$.

637: Related axioms assign values to certain input events

638: (encoded in variable $x$, a substring of $s$)

639: representing reward or punishment (e.g.,

640: when a \gmn-controlled robot bumps into an obstacle).

641: Additional axioms define the total value of the \gmn 's life as a

642: scalar-valued function of all

643: rewards (e.g., their sum) and

644: costs experienced between cycles $1$ and $T$, etc.

645: For example, assume that

646: $s_{17:18}$ can be changed only through

647: external inputs; the following example axiom

648: says that the total reward increases by 3 whenever

649: such an input equals `11'

650: (unexplained symbols carry the obvious meaning):

651: \[

652: (\forall t_1 \forall t_2:

653: [(t_1 < t_2) \wedge  (t_1 \geq 1)

654: \wedge  (t_2 \leq T)

655: \wedge  (s_{17:18}(t_2)=`11\ap)  ]

656: \]

657: \[

658: \rightarrow

659: [R(t_1,t_2)= R(t_1,t_2-1)+3]),

660: \]

661: where $R(t_1,t_2)$ is interpreted as

662: the cumulative reward between times $t_1$ and $t_2$.

663: It is clear that any formal scheme for producing

664: rewards can be fully axiomatized in a similar way.

665:

666: \item

667: \label{envaxioms}

668: {\bf Environment axioms}

669: restricting the way the environment will produce

670: new inputs (encoded within certain substrings of $s$) in reaction to

671: sequences of outputs $y$ encoded in $s$.

672: For example, it may be known

673: in advance that the environment is sampled from an unknown probability

674: distribution $\mu$ contained in a given set $M$

675: of possible distributions (compare equation \ref{u}).

676: E.g., $M$ may contain all distributions

677: that are computable, given the previous history

678: \cite{Solomonoff:64,Solomonoff:78,Hutter:01aixi+},

679: or at least limit-computable \cite{Schmidhuber:00v2,Schmidhuber:02ijfcs}.

680: Or, more restrictively, the environment

681: may be some unknown but deterministic computer

682: program \cite{Zuse:69,Schmidhuber:97brauer}

683: sampled from the Speed Prior \cite{Schmidhuber:02colt} which assigns

684: low probability to environments that are hard to compute by any method.

685: Or the interface to the environment is Markovian \cite{Schmidhuber:91nips},

686: that is, the current

687: input always uniquely identifies the environmental state---a lot

688: of work has already been done

689: on this special case \cite{Samuel:59,Bellman:61,Sutton:98}.

690: Even more restrictively, the environment may evolve in completely

691: predictable fashion known in advance.

692: All such prior assumptions

693: are perfectly formalizable in an appropriate $\cal A$

694: (otherwise we could not write scientific papers about them).

695:

696: \item

697: {\bf Uncertainty axioms; string manipulation axioms:}

698: \label{probaxioms}

699: Standard axioms for arithmetics and calculus

700: and probability theory \cite{Kolmogorov:33}

701: and statistics

702: and string manipulation that (in conjunction

703: with the hardware axioms and environment axioms) allow for constructing proofs

704: concerning (possibly uncertain) properties of future values of

705: $s(t)$ as well as bounds on expected remaining lifetime /

706: costs / rewards,

707: given some time $\tau$ and certain

708: hypothetical values for components of $s(\tau)$ etc.

709: An example theorem saying something about

710: expected properties of future inputs $x$ might look like this:

711: \[

712: (\forall t_1 \forall \mu \in M:

713: [(1 \leq t_1) \wedge

714: (t_1 + 15597 < T) \wedge

715: (s_{5:9}(t_1) = `01011\ap)

716: \]

717: \[

718: \wedge (x_{40:44}(t_1)=`00000\ap) ]

719: \rightarrow

720: (\exists t: [(t_1 < t < t_1 + 15597)

721: \]

722: \[

723: \wedge

724: (P_{\mu}(x_{17:22}(t) = `011011\ap \mid s(t_1)) > \frac{998}{1000} )])),

725: \]

726: where $P_{\mu}(. \mid . )$ represents a conditional probability

727: with respect to an axiomatized prior distribution $\mu$ from

728: a set of distributions $M$ described by

729: the environment axioms (Item \ref{envaxioms}).

730:

731: Given a particular

732: formalizable hardware

733: (Item \ref{hardwareaxioms})

734: and formalizable assumptions about the

735: possibly probabilistic environment

736: (Item \ref{envaxioms}),

737: obviously one can fully axiomatize

738: everything that is needed for

739: proof-based reasoning about past and future machine states.

740:

741: \item

742: \label{initaxioms}

743: {\bf Initial state axioms:}

744: Information about how

745: to reconstruct the initial state $s(1)$ or parts thereof,

746: such that the proof searcher

747: can build proofs including

748: axioms of the type

749: \[

750: (s_{{\bf m}:{\bf n}}(1)={\bf z}),

751: ~e.g.:~ (s_{7:9}(1)=`010\ap).

752: \]

753: Here and in the remainder of the paper

754: we use bold font in formulas to indicate

755: syntactic place holders (such as {\bf m,n,z})

756: for symbol strings representing

757: variables (such as {\em m,n,z})

758: whose semantics are explained in

759: the text---in the present context $z$ is the

760: bitstring $s_{m:n}(1)$.

761:

762: Note that it is {\bf no fundamental problem} to fully encode

763: both the hardware description {\em and} the initial

764: hardware-describing $p$ within $p$

765: itself. To see this, observe that some software may

766: include a program that can print the software.

767:

768:

769: \item

770: {\bf Utility axioms}

771: \label{itemu}

772: describing the overall goal

773: in the form of utility function $u$; e.g., equation (\ref{u}) in

774: Section \ref{notation}.

775: \end{enumerate}

776:

777: \item

778: \label{applyrule}

779: {\bf apply-rule(k, m, n)}

780: takes as arguments the index $k$ (if it exists) of

781: an inference rule

782: such as {\em modus ponens}

783: (stored in a list of possible inference rules

784: encoded within $p(1)$) and the indices $m, n$

785: of two previously proven theorems (numbered in order of

786: their creation) in the current {\em proof}.

787: If applicable, the corresponding inference rule is

788: applied to the addressed theorems

789: and the resulting theorem appended to {\em proof}. Otherwise

790: the currently tested proof technique is interrupted.  This

791: ensures that {\em proof} is never fed with invalid proofs.

792:

793: \item

794: {\bf delete-theorem(m)}

795: deletes the $m$-th theorem in the currently stored

796: {\em proof}, thus freeing storage such that proof-storing parts of $s$

797: can be reused and the maximal proof size is not necessarily limited by

798: storage constraints.

799: Theorems deleted from {\em proof}, however,  cannot be addressed any more by

800: {\em apply-rule} to produce further prolongations of {\em proof}.

801:

802: \item

803: {\bf set-switchprog(m,n)}

804: replaces $switchprog$ by $s^p_{m:n}$, provided

805: that $s^p_{m:n}$ is indeed a non-empty substring of

806: $s^p$, the storage writable by proof techniques.

807:

808: \item

809: \label{check}

810: {\bf check()}

811: verifies whether the goal of the proof search has been reached.

812: First it tests whether the last theorem (if any) in {\em proof}

813: has the form of a {\bf target theorem}. A target theorem states

814: that given the {\em current} axiomatized utility

815: function $u$ (Item \ref{itemu}), the utility of a

816: switch from $p$ to the current {\em switchprog}

817: would be higher than the utility of continuing the execution

818: of $p$ (which would keep searching for alternative {\em switchprog}s).

819: Target theorems are

820: symbol strings

821: (encoded in $s$)

822: of the (decoded) form

823: \begin{equation}

824: \label{goal}

825: (u[s({\bf t_1}) \oplus (switchbit({\bf t_1})=`1\ap), Env({\bf t_1})] >

826: \]

827: \[

828: u[s({\bf t_1}) \oplus (switchbit({\bf t_1})=`0\ap), Env({\bf t_1})] )

829: \end{equation}

830: where the variable $t_1$ (represented by syntactic place holder ${\bf t_1}$)

831: stands for a time step,

832: while all other symbols belong to the alphabet of

833: the theorem-proving calculus, that is,

834: the set of possible target theorems is parameterized only by $t_1$.

835: Here the calculus should permit the notation

836: $s({\bf t_1}) \oplus (switchbit({\bf t_1})=`b\ap)$ as

837: a shortcut for the state obtained when we replace

838: {\em switchbit}($t_1$),

839: the true value of the variable bit $switchbit$

840: (encoded in $s$)

841: at time $t_1$, by $b \in \{0, 1\}$. This will

842: facilitate the formulation of theorems

843: that compare values conditioned on various alternative

844: hypothetical properties of $s(t_1)$. (Note that $s(t_1)$ may

845: be only partially known by the current proof technique

846: even in environments where

847: $s(t_1)$ and {\em switchbit}($t_1$) are

848: fully predetermined for all valid $t_1$.)

849:

850: The purpose of introducing $t_1$ is to deal with hardware-specific

851: temporal delays that may be involved in checking and switching---it may

852: take a significant amount of

853: time to match abstract symbol strings found during proof search to the

854: \gmn's real current state.

855: If a target theorem has been found, {\em check()} uses a

856: simple prewired subroutine (also encoded in $p(1)$, of course)

857: to check whether there is enough time left

858: to set variable {\em switchbit} (originally 0) to 1 before

859: the continually increasing $time$ will equal $t_1$.

860: If this subroutine

861: returns a negative result, {\em check()} exits.

862: Otherwise it sets {\em switchbit} $:=1$

863: (there is no other way of changing {\em switchbit}).

864: Then it repeatedly tests $time$ until $time > t_1$, to make sure

865: the condition of formula (\ref{goal})

866: was fulfilled at $t_1$.

867: Then it transfers control to {\em switchprog}

868: (there is no other way of calling {\em switchprog}).

869: The {\em switchprog}

870: may subsequently rewrite all parts of $s$, excluding hardware-reserved

871: parts such as $time$ and $x$, but including $p$.

872:

873: \item

874: \label{state2theorem}

875: {\bf state2theorem(m, n)}

876: takes two integer arguments $m, n$

877: and tries to transform the current contents of $s_{m:n}$

878: into a theorem of the form

879: \[

880: (s_{{\bf m}:{\bf n}}({\bf t_1})={\bf z}),

881: ~e.g.:~ (s_{6:9}(7775555)=`1001\ap),

882: \]

883: where $t_1$ represents a time measured (by checking {\em time})

884: shortly after {\em state2theorem} was invoked,

885: and $z$ the bistring $s_{m:n}(t_1)$ (recall the special

886: case $t_1=1$ of Item \ref{initaxioms}).

887: So we accept the time-labeled current

888: observable contents of any part of $s$ as a theorem that does not have

889: to be proven in an alternative way from, say, the

890: initial state $s(1)$, because the computation so far

891: has already demonstrated that the theorem is true.

892: Thus we may exploit information conveyed

893: by environmental inputs, and the fact that sometimes

894: (but not always) the fastest way

895: to determine the output of a program is to run it.

896:

897: %\begin{small}

898: {\em

899: This non-traditional online interface between syntax and semantics

900: requires special care though.

901: We must avoid inconsistent results through

902: parts of $s$ that change while being read.

903: For example, the present value of a quickly changing

904: instruction pointer {\em IP} (continually updated by the hardware)

905: may be essentially unreadable in the

906: sense that the execution of the reading subroutine

907: itself will already modify {\em IP} many times.

908: For convenience, the (typically limited) hardware could be set up

909: such that it stores the contents of

910: fast hardware variables every $c$ cycles in

911: a reserved part of $s$, such that

912: an appropriate  variant of {\em state2theorem()} could at least

913: translate certain recent values of fast variables into theorems.

914: This, however, will not abolish {\em all} problems associated

915: with self-observations.

916: For example, the $s_{m:n}$ to be read might

917: also contain the reading procedure's

918: own, temporary, constantly changing string pointer variables,

919: etc.\footnote{We see that certain parts of the

920: current $s$ may not be directly observable without changing

921: the observable itself.

922: Sometimes, however, axioms and previous observations will allow

923: the \gm to {\em deduce} time-dependent storage contents that

924: are not directly observable.

925: For instance, by analyzing the code being executed through

926: instruction

927: pointer {\em IP} in the example above,  the value of {\em IP} at

928: certain times may be predictable (or postdictable, after the

929: fact).  The values of other variables at given times,

930: however, may not be deducible at all.

931: Such limits of self-observability

932: are reminiscent of Heisenberg's celebrated

933: uncertainty principle \cite{Heisenberg:25},

934: which states that certain physical measurements are necessarily

935: imprecise, since the measuring process affects the measured

936: quantity.}

937: To address such problems on computers with limited

938: memory, {\em state2theorem}

939: first uses some fixed protocol

940: (encoded in $p(1)$, of course)

941: to check whether the current $s_{m:n}$ is readable

942: at all or whether it might change if it

943: were read by the remaining code of {\em state2theorem}.

944: If so, or if $m,n,$ are not in the proper range,

945: then the instruction has no further effect. Otherwise it

946: appends an {\em observed} theorem of the form

947: $(s_{{\bf m}:{\bf n}}({\bf t_1})={\bf z})$

948: to {\em proof}.

949: For example, if the current time is 7770000, then the invocation

950: of {\em state2theorem(6,9)} might return the theorem $(s_{6:9}(7775555)=`1001\ap)$,

951: where

952: $7775555-7770000=5555$ reflects the time needed by {\em state2theorem}

953: to perform the initial check

954: and to read leading bits off the continually increasing $time$

955: (reading $time$ also costs time) such that

956: it can be sure that $7775555$ is a recent proper time label following

957: the start of {\em state2theorem}.

958: }

959: %\end{small}

960:

961: \end{enumerate}

962:

963: The axiomatic system $\cal A$ is a defining

964: parameter of a given \gmn.  Clearly,  $\cal A$ must be

965: strong enough to permit proofs of target theorems.

966: In particular, the theory of uncertainty axioms

967: (Item \ref{probaxioms}) must be sufficiently rich.

968: This is no fundamental problem: we simply

969: insert all traditional axioms of probability theory \cite{Kolmogorov:33}.

970:

971:

972: \section{\bf Global Optimality Theorem}

973: \label{secglobopt}

974:

975:  Intuitively, at any given time $p$ should

976:  execute some self-modification algorithm

977:  (via instruction {\em check()}---Item \ref{check} above)

978:  only if it is

979:  the `best' of all possible self-modifications,

980:  given the utility function, which typically

981:  depends on available resources, such as storage size and

982:  remaining lifetime.

983:  At first glance, however, target theorem (\ref{goal})

984:  seems to implicitly talk about just

985:  one single modification algorithm, namely, {\em switchprog}$(t_1)$

986:  as set by the systematic proof searcher at time $t_1$.

987:  Isn't this type of local search greedy? Couldn't

988:  it lead to a local optimum instead of a global one?

989:  No, it cannot, according to the following global optimality theorem.

990:

991: \subsection{Globally Optimal Self-Changes, Given $u$ and $\cal A$ Encoded in $p$}

992: \label{sectheorem}

993:

994:   \begin{theorem}

995:   \label{globopt}

996:   Given any formalizable utility function $u$ (Item \ref{itemu}),

997:   and assuming consistency of the underlying formal system $\cal A$,

998:   any self-change of $p$ obtained through execution of

999:   some program {\em switchprog} identified

1000:   through the proof of a target theorem  (\ref{goal})

1001:   is globally optimal in the following sense:

1002:   the utility of starting the execution of the present

1003:   {\em switchprog} is higher than the utility of

1004:   waiting for the proof searcher

1005:   to produce an alternative {\em switchprog} later.

1006:   \end{theorem}

1007:

1008:

1009: \noindent

1010: {\bf Proof.} Target theorem  (\ref{goal})

1011: implicitly talks about all the other

1012: {\em switchprog}s that the proof searcher

1013: could produce in the future. To see this, consider

1014: the two alternatives of the binary decision:

1015: (1) either execute the current {\em switchprog} (set {\em switchbit} $=1$), or

1016: (2) keep searching for {\em proof}s and {\em switchprog}s (set {\em switchbit} $=0$)

1017: until the systematic

1018: searcher comes up with an even better {\em switchprog}.

1019: Obviously the second alternative concerns all (possibly infinitely

1020: many) potential {\em switchprog}s to be considered later.  That is,

1021: if  the current {\em switchprog}  were not the `best', then

1022: the proof searcher would not be able to prove that

1023: setting {\em switchbit} and

1024: executing  {\em switchprog} will cause higher expected reward

1025: than discarding {\em switchprog}, assuming consistency of $\cal A$.

1026: {\em Q.E.D.}

1027:

1028: The initial proof searcher of Section \ref{biops}

1029: already generates all possible proofs and {\em switchprogs} in $O()$-optimal fashion.

1030: Nevertheless, since it is part of $p(1)$, its proofs can speak about the

1031: proof searcher itself, possibly triggering proof searcher rewrites

1032: resulting in better than merely $O()$-optimal performance.

1033:

1034: \subsection{Alternative Relaxed Target Theorem}

1035: \label{alternative}

1036: We may replace the target theorem  (\ref{goal}) (Item \ref{check})

1037: by the following

1038: alternative target theorem:

1039: \[

1040: (u[s({\bf t_1}) \oplus (switchbit({\bf t_1})=`1\ap), Env({\bf t_1})] \geq

1041: \]

1042: \begin{equation}

1043: \label{goal2}

1044: u[s({\bf t_1}) \oplus (switchbit({\bf t_1})=`0\ap), Env({\bf t_1})] )

1045: \end{equation}

1046: The only difference to the original target theorem  (\ref{goal})

1047: is that the ``$>$'' sign became a  ``$\geq$'' sign. That is,

1048: the \gm will change itself as soon as it has found a proof

1049: that the change will not make things worse.

1050: A Global Optimality Theorem similar to Theorem \ref{globopt}

1051: holds; simply replace the last phrase

1052: in Theorem \ref{globopt} by:

1053:   {\em the utility of starting the execution of the present

1054:   {\em switchprog} is at least as high as the

1055:   utility of waiting for the proof searcher

1056:   to produce an alternative {\em switchprog} later.}

1057:

1058: \subsection{Global Optimality and Recursive Meta-Levels}

1059: \label{comment}

1060: One of the most important

1061: aspects of our fully self-referential set-up is the following.

1062: Any proof of a target theorem automatically proves

1063: that the corresponding self-modification is good for all

1064: further self-modifications affected by the present one,

1065: in recursive fashion.

1066: In that sense all possible ``meta-levels'' of the self-referential

1067: system are collapsed into one.

1068:

1069: \subsection{How Difficult is it to Prove Target Theorems?}

1070: \label{difficult}

1071: This depends on the tasks and the initial axioms $\cal A$, of course.

1072: It is straight-forward to devise simple tasks and

1073: corresponding consistent $\cal A$

1074: such that there are short and trivial proofs of target theorems.

1075:

1076: Even when we initialize the initial problem solver $e(1)$ by an

1077: asymptotically optimal, rather general method such as Hutter's \tl

1078: \cite{Hutter:01aixi+,Hutter:04book+},

1079: it may be straight-forward to prove that switching to

1080: another strategy is useful,

1081: especially when $\cal A$ contains additional prior

1082: knowledge in form of axiomatic assumptions

1083: beyond those made by \tln .  The latter needs a very time-consuming but constant set-up

1084: phase whose costs disappear in the $O()$-notation but not in

1085: a utility function such as the $u$ of equation (\ref{u}).

1086: For example, simply construct an environment where

1087: maximal reward is achieved by performing a never-ending sequence

1088: of simple but rewarding actions, say, repeatedly pressing a lever,

1089: plus a very simple axiomatic system $\cal A$ that permits a short proof

1090: showing that it is useful to interrupt the non-rewarding set-up phase and

1091: start pressing the lever.

1092:

1093: On the other hand, it is possible to construct

1094: situations where it is impossible to prove target theorems,

1095: for example, by using results of undecidability theory,

1096: e.g., \cite{Goedel:31,Rice:53,Blum:67,Blum:71}.

1097: In particular, adopting the extreme notion of triviality

1098: embodied by Rice's theorem \cite{Rice:53}

1099: ({\em any nontrivial property over general

1100: functions is undecidable}), only {\em trivial}

1101: improvements of a given strategy may be provably useful.

1102: This notion of triviality, however, clearly does not reflect

1103: what is {\em intuitively} regarded as trivial by scientists.

1104: Although many theorems of the machine learning

1105: literature in particular,  and the computer science

1106: literature in general, are limited to functional properties that

1107: are trivial in the sense of Rice, they are widely regarded

1108: as non-trivial in an {\em intuitive} sense.

1109: In fact, the infinite domains of function classes

1110: addressed by Rice's theorem

1111: are irrelevant not only for most scientists dealing

1112: with real world problems

1113: but also for a typical \gm dealing

1114: with a limited number of events that

1115: may occur within its limited life time.

1116: Generally speaking, in between the {\em obviously} trivial and the {\em obviously}

1117: non-trivial cases there are many less obvious ones.

1118: The point is: usually we do not know in advance whether it is

1119: possible or not to change a given initial problem solver

1120: in a provably good way.  The traditional approach is to

1121: invest human research effort into finding out. A \gmn, however, can

1122: do this by itself, without essential limits

1123: apart from those of computability and provability.

1124:

1125: Note that to prove a target theorem,

1126: a proof technique does not necessarily have to compute

1127: the true expected utilities of switching and not

1128: switching---it just needs to determine which is higher.

1129: For example, it may be easy to prove that

1130: speeding up a subroutine of the proof searcher

1131: by a factor of 2 will certainly be worth

1132: the negligible (compared to lifetime $T$) time needed to

1133: execute the subroutine-changing algorithm, no matter

1134: what is the precise utility of the switch.

1135:

1136:

1137:

1138: \section{Bias-Optimal Proof Search (BIOPS)}

1139: \label{biops}

1140:

1141: Here we construct an initial  $p(1)$ that is $O()$-optimal in a certain

1142: limited sense to be described below, but still might be improved

1143: as it is not necessarily optimal in the sense of the given $u$

1144: (for example, the $u$ of equation (\ref{u}) neither mentions

1145: nor cares for $O()$-optimality).

1146: Our Bias-Optimal Proof Search (BIOPS)

1147: is essentially an application of

1148: {\em Universal Search}

1149: \cite{Levin:73,Levin:84}

1150: to proof search.

1151: One novelty, however, is this:

1152: Previous practical variants and extensions of {\em Universal

1153: Search} have been applied

1154: \cite{Schmidhuber:95kol,Schmidhuber:97nn,Schmidhuber:97bias,Schmidhuber:04oops}

1155: to {\em offline}

1156: program search tasks where the program inputs are fixed

1157: such that the same program always produces the same results.

1158: In our {\em online} setting, however, BIOPS has to take

1159: into account that the same proof technique

1160: started at different times may yield different proofs,

1161: as it may read parts of $s$ (e.g., inputs)

1162: that change as the machine's life proceeds.

1163:

1164: \subsection{Online Universal Search in Proof Space}

1165: \label{online}

1166:

1167: BIOPS starts with a probability distribution $P$

1168: (the initial bias) on the proof techniques $w$ that

1169: one can write in $\cal L$,

1170: e.g., $P(w)=K^{-l(w)}$ for programs composed from $K$

1171: possible instructions  \cite{Levin:84}.

1172: BIOPS is {\em near-bias-optimal}

1173: \cite{Schmidhuber:04oops}

1174: in the sense that it will not spend

1175: much more time on any proof technique than it deserves,

1176: according to its probabilistic bias,

1177: namely, not much more than its probability times the total search time:

1178: \begin{definition} [Bias-Optimal

1179: Searchers \cite{Schmidhuber:04oops}]

1180: \label{bias-optimal}

1181: {\em Let $\cal R$ be a problem class,

1182: $\cal C$ be a search space of solution candidates

1183: (where  any problem $r \in \cal R$ should have a solution in $\cal C$),

1184: $P(q \mid r)$ be a task-dependent bias in the form of conditional probability

1185: distributions on the candidates $q \in \cal C$. Suppose that we also have

1186: a predefined procedure that creates and tests any given $q$

1187: on any $r \in \cal R$ within time $t(q,r)$ (typically unknown in advance).

1188: Then} a searcher is $n$-bias-optimal ($n \geq 1$) if

1189: for any maximal total search time $T_{total} > 0$

1190: it is guaranteed to solve any problem $r \in \cal R$

1191: if it has a solution $p \in \cal C$

1192: satisfying $t(p,r) \leq P(p \mid r)~T_{total}/n$.

1193: It is bias-optimal if  $n=1$.

1194: \end{definition}

1195:

1196: \begin{method}[BIOPS]

1197: \label{lsearch}

1198: {\em

1199: In phase $(i=1,2,3, \ldots)$ {\sc Do}:

1200: {\sc For} all self-delimiting \cite{Levin:84}

1201: proof techniques $w \in \cal L$ satisfying $P(w) \geq 2^{-i}$ {\sc Do}:

1202: \begin{enumerate}

1203: \item

1204: Run $w$ until halt or error (such as

1205: division by zero) or $2^iP(w)$ steps consumed.

1206: \item

1207: Undo effects of $w$ on $s^p$

1208: (does not cost significantly more time

1209: than executing $w$).

1210: \end{enumerate}

1211: }

1212: \end{method}

1213: A proof technique $w$ can interrupt Method \ref{lsearch}

1214: only by invoking instruction {\em check()} (Item \ref{check}),

1215: which may transfer

1216: control to {\em switchprog} (which possibly

1217: even will delete or rewrite Method  \ref{lsearch}).

1218: Since the initial $p$

1219: runs on the formalized hardware, and since proof techniques

1220: tested by $p$ can read $p$ and other parts of $s$, they

1221: can produce proofs concerning the (expected)

1222: performance of $p$ and BIOPS itself.

1223: %\subsection{Asymptotically Optimal Proof Search}

1224: %\label{asymptotic}

1225: Method \ref{lsearch} at least has the

1226: optimal {\em order} of computational complexity in the following

1227: sense.

1228: \begin{theorem}

1229: \label{asopt}

1230: If independently of variable {\em time(s)} some unknown

1231: fast proof technique $w$

1232: would require at most $f(k)$ steps to produce

1233: a proof of difficulty measure $k$ (an integer depending on

1234: the nature of the task to be solved), then

1235: Method \ref{lsearch}

1236: will need at most $O(f(k))$ steps.

1237: \end{theorem}

1238: {\bf Proof.}

1239: It is easy to see that

1240: Method \ref{lsearch} will need at most $O(f(k)/P(w)) = O(f(k))$

1241: steps---the constant factor  $1/P(w)$ does not depend on $k$.

1242: {\em Q.E.D.}

1243:

1244: The initial proof search itself is merely $O()$-optimal.

1245: Note again, however,  that

1246: the proofs themselves may concern quite

1247: different, arbitrary formalizable notions of optimality

1248: (stronger than those expressible in the $O()$-notation)

1249: embodied by the given, problem-specific, formalized

1250: utility function $u$, in particular, the maximum future

1251: reward in the sense of equation (\ref{u}).

1252: This may provoke useful, constant-affecting rewrites of

1253: the initial proof searcher despite its limited (yet popular

1254: and widely used) notion of $O()$-optimality.

1255: Once a useful rewrite has been found and executed

1256: after some initial fraction of the \gmn 's total lifetime, the restrictions

1257: of $O()$-optimality need not be an issue any more.

1258:

1259:

1260: \subsection{How a Surviving Proof Searcher May

1261: Use the Optimal Ordered Problem Solver to Solve Remaining Proof Search Tasks}

1262: \label{remaining}

1263:

1264: The following is not essential for this paper.

1265: Let us assume that the execution of the {\em switchprog}

1266: corresponding to the first found target theorem has not rewritten the code of

1267: $p$ itself---the current $p$ is still equal to $p(1)$---and has

1268: reset {\em switchbit} and returned

1269: control to $p$ such that it can continue where it was interrupted.

1270: In that case the \Biops subroutine of $p(1)$ can use the Optimal

1271: Ordered Problem Solver \Oops

1272: \cite{Schmidhuber:04oops}

1273: to accelerate the search for the

1274: $n$-th target theorem ($n>1$) by reusing proof techniques for earlier found

1275: target theorems where possible.  The basic ideas are as

1276: follows (details: \cite{Schmidhuber:04oops}).

1277:

1278: Whenever a target theorem has been proven, $p(1)$ {\em freezes}

1279: the corresponding proof technique: it becomes non-writable

1280: by proof techniques to be tested in later proof search tasks,

1281: but remains readable,

1282: such that it can be copy-edited and/or invoked as a subprogram by future proof

1283: techniques.  We also allow prefixes of proof

1284: techniques to temporarily rewrite the probability distribution

1285: on their suffixes

1286: \cite{Schmidhuber:04oops},

1287: thus essentially rewriting

1288: the probability-based search

1289: procedure (an incremental extension of Method \ref{lsearch})

1290: based on previous experience. As a side-effect we metasearch

1291: for faster search procedures, which can greatly accelerate the learning of

1292: new tasks \cite{Schmidhuber:04oops}.

1293:

1294: Given a new proof search task, \Biops performs \Oops

1295: by spending half the total search time on a

1296: variant of

1297: Method \ref{lsearch}

1298: that searches only among self-delimiting

1299: \cite{Levin:74,Chaitin:75} proof techniques starting with the most recently

1300: frozen proof technique.  The rest of the

1301: time is spent on fresh proof techniques with arbitrary prefixes

1302: (which may reuse previously frozen proof techniques though)

1303: \cite{Schmidhuber:04oops}.

1304: (We could also search for a {\em generalizing} proof technique

1305: solving all proof search tasks so far. In the first half of the search

1306: we would not have to test proof techniques on tasks other than

1307: the most recent one, since we already know that their prefixes

1308: solve the previous tasks

1309: \cite{Schmidhuber:04oops}.)

1310:

1311: It can be shown that \Oops is

1312: essentially {\em 8-bias-optimal} (see Def. \ref{bias-optimal}),

1313: given either the initial bias or intermediate biases due to

1314: frozen solutions to previous tasks

1315: \cite{Schmidhuber:04oops}.

1316: This result immediately carries over to \Biopsn.

1317: To summarize, \Biops essentially allocates part of the total search

1318: time for a new task to proof techniques that exploit previous successful

1319: proof techniques in computable ways.  If the new task can be solved faster

1320: by copy-editing / invoking previously frozen

1321: proof techniques than by solving the new proof search task from scratch,

1322: then  \Biops will discover this and profit thereof. If not, then

1323: at least it will not be significantly slowed down by the previous

1324: solutions---\Biops will remain 8-bias-optimal.

1325:

1326:  Recall, however, that \Biops is not the only

1327:  possible way of initializing the \gmn's proof searcher.

1328: The Global Optimality Theorem \ref{globopt}

1329: (Section \ref{secglobopt})

1330: expresses optimality with respect to whichever

1331: initial proof searcher we choose.

1332:

1333:

1334: \section{Discussion \& Previous Work}

1335: \label{discussion}

1336:

1337:

1338: Here we list a few examples of

1339: possible types of self-improvements

1340: (Section \ref{possible}),

1341: \gm applicability to various tasks defined by various utility functions and

1342: environments

1343: (Section \ref{examples}),

1344: probabilistic hardware

1345: (Section \ref{prob}),

1346: and relations to previous work

1347: (Section \ref{previous}).

1348: We also briefly discuss self-reference

1349: and consciousness

1350: (Section \ref{conscious}),

1351: and provide a list

1352: of answers to frequently asked questions

1353: (Section \ref{faq}).

1354:

1355: \subsection{Possible Types of \GM Self-Improvements}

1356: \label{possible}

1357:

1358:

1359:   Which provably useful self-modifications are possible?

1360:   There are few limits to what a \gm  might do.

1361:

1362: \begin{enumerate}

1363: \item

1364:    In one of the simplest cases

1365:    it might leave its basic proof searcher intact

1366:    and just change the ratio of time-sharing between

1367:    the proof searching subroutine and the

1368:    subpolicy $e$---those parts of $p$

1369:    responsible for interaction with the environment.

1370:

1371: \item

1372:     Or the \gm might modify $e$ only.

1373:     For example, the initial $e(1)$ may be a program

1374:     that regularly stores limited memories

1375:     of past events somewhere in $s$; this might allow $p$ to derive that

1376:     it would be useful to modify $e$ such that $e$ will conduct certain

1377:     experiments to increase the knowledge about

1378:     the environment, and use the resulting information

1379:     to increase reward intake.  In this sense the \gm embodies

1380:     a principled way of dealing with the

1381:     exploration vs exploitation problem \cite{Kaelbling:96}.

1382:     Note that the {\em expected} utility (equation (\ref{u}))

1383:     of conducting some experiment may  exceed

1384:     the one of not conducting it,

1385:     even when the experimental outcome later suggests to

1386:     keep acting in line with the previous $e$.

1387:

1388:

1389:

1390: \item

1391: The \gm might also modify its very axioms

1392: to speed things up. For example,

1393: it might find a proof that the

1394: original axioms should be replaced or

1395: augmented by theorems derivable

1396: from the original axioms.

1397:

1398: \item

1399:  The \gm might even change

1400:  its own utility function and target theorem,

1401:  but can do so only if their {\em new} values

1402:  are provably better according to the {\em old} ones.

1403:

1404: \item

1405:   In many cases we do not expect

1406:   the \gm to replace its proof searcher by

1407:   code that completely abandons the search for proofs.

1408:   Instead we expect that only certain subroutines

1409:   of the proof searcher will be sped up---compare the

1410:   example in Section \ref{difficult}---or that perhaps just

1411:   the order of generated proofs will be

1412:   modified in problem-specific fashion. This could be done

1413:   by modifying

1414:   the probability distribution on the proof techniques of

1415:   the initial bias-optimal proof searcher

1416:   from Section \ref{biops}.

1417:

1418: \item

1419: \label{supermarket}

1420: Generally speaking,

1421: the utility of limited rewrites may often be

1422: easier to prove than the one of total rewrites.

1423: For example, suppose it is 8.00pm and our \gmn-controlled

1424: agent's permanent goal is to

1425: maximize future expected reward, using the (alternative)

1426: target theorem (\ref{goal2}).  Part thereof is to avoid hunger. There

1427: is nothing in its fridge, and shops close down at 8.30pm.  It does not

1428: have time to optimize its way to the supermarket in  every little detail,

1429: but if it does not get going right now it will stay hungry tonight (in

1430: principle such near-future consequences of actions should be easily

1431: provable, possibly even in a way related to

1432: how humans prove advantages of potential actions to themselves). That

1433: is, if the agent's previous policy did not already include, say, an automatic

1434: daily evening trip to the supermarket, the policy provably should be

1435: rewritten at least in a very limited and simple way right now, while there

1436: is still time, such that the agent will surely get some food tonight,

1437: without affecting less urgent future behavior that  can be optimized

1438: / decided later, such as details of the route to the food, or

1439: of tomorrow's actions.

1440:

1441: \item

1442:  In certain uninteresting environments reward

1443:  is maximized by becoming dumb. For example,

1444:  a given task may require to repeatedly

1445:  and forever execute the same pleasure center-activating action,

1446:  as quickly as possible.  In such cases the \gm may delete

1447:  most of its more time-consuming initial software

1448:  including the proof searcher.

1449:

1450: \item

1451:   Note that there is no reason why a \gm should not

1452:   augment its own hardware.

1453:   Suppose its lifetime is known to be 100 years.

1454:   Given a hard problem and

1455:   axioms restricting the possible

1456:   behaviors of the environment,

1457:   the \gm might find a proof that its

1458:   expected cumulative reward will increase if

1459:   it invests 10 years into building faster computational

1460:   hardware, by exploiting the physical resources of

1461:   its environment.

1462: \end{enumerate}

1463:

1464: \subsection{Example Applications}

1465: \label{examples}

1466:

1467: Traditional examples that

1468: do not involve significant interaction with a probabilistic

1469: environment are easily dealt with in our reward-based framework:

1470:

1471:

1472: \begin{example}[Time-limited NP-hard optimization]

1473: \label{tsp}

1474: The initial

1475: input to the \gm is the representation of

1476: a connected graph with

1477: a large number of nodes linked by

1478: edges of various lengths.

1479: Within given time $T$ it should

1480: find a cyclic path connecting all nodes.

1481: The only real-valued reward will occur

1482: at time $T$. It equals

1483: 1 divided by the length

1484: of the best path found so far (0 if none was found).

1485: There are no other inputs.

1486: The by-product of  maximizing

1487: expected reward is to find the shortest

1488: path findable within the limited time,

1489: given the initial bias.

1490: \end{example}

1491:

1492:  \begin{example}[Fast theorem proving]

1493:  \label{goldbach}

1494:  Prove or disprove as quickly as possible

1495:  that all even integers  $>2$ are the sum of

1496:  two primes (Goldbach's conjecture).

1497:  The reward is $1/t$,

1498:  where $t$ is the time required to produce

1499:  and verify the first such proof.

1500:  \end{example}

1501: More general cases are:

1502:

1503: \begin{example}[Maximizing expected reward with bounded resources]

1504: \label{robot}

1505: A robot that needs at least

1506: 1 liter of gasoline per hour

1507: interacts with a partially unknown environment,

1508: trying to find hidden, limited gasoline depots

1509: to occasionally refuel its tank.

1510: It is rewarded in proportion to its lifetime,

1511: and dies after at most 100 years or as

1512: soon as its tank is empty or

1513: it falls off a cliff etc.

1514: The probabilistic environmental reactions are initially

1515: unknown but assumed to be sampled from

1516: the axiomatized Speed Prior \cite{Schmidhuber:02colt},

1517: according to which

1518: hard-to-compute environmental reactions are unlikely.

1519: This permits a computable strategy for making near-optimal

1520: predictions  \cite{Schmidhuber:02colt}.

1521: One by-product of maximizing  expected reward

1522: is to maximize expected lifetime.

1523: \end{example}

1524: \begin{example}[Optimize any suboptimal problem solver]

1525: \label{optimize}

1526: Given any formalizable problem, implement

1527: a suboptimal but known problem solver

1528: as software on the \gm hardware,

1529: and let the proof searcher of Section \ref{biops} run in parallel.

1530: \end{example}

1531:

1532: \subsection{Probabilistic \GM Hardware}

1533: \label{prob}

1534: Above we have focused on an example deterministic machine

1535: living in a possibly probabilistic environment.

1536: It is straight-forward to extend this

1537: to computers whose actions are computed in

1538: probabilistic fashion, given the current state.

1539: Then the expectation calculus

1540: used for probabilistic aspects of the environment

1541: simply has to be extended to the hardware itself,

1542: and the mechanism for verifying proofs has to

1543: take into account that there is no such thing as

1544: a certain theorem---at best there are formal statements

1545: which are true with such and such probability.

1546: In fact, this may be the most realistic approach

1547: as any physical hardware is error-prone,

1548: which should be taken into account by

1549: realistic probabilistic \gmn s.

1550:

1551:  Probabilistic settings also automatically avoid certain

1552:  issues of axiomatic consistency. For example, predictions

1553:  proven to come true with

1554:  probability less than 1.0 do not necessarily cause contradictions

1555:  even when they do not match the observations.

1556:

1557:

1558: \subsection{Relations to Previous Work}

1559: \label{previous}

1560:

1561: Despite (or maybe because of) the ambitiousness and

1562: potential power of self-improving machines,

1563: there has been little work in this vein outside our own

1564: labs at IDSIA and TU M\"{u}nchen.

1565: Here we will list essential differences

1566: between the \gm and our previous approaches to `learning to learn,'

1567: `metalearning,' self-improvement, self-optimization,  etc.

1568:

1569: The most closely related approaches are Hutter's

1570: \hs and \tl (Item \ref{hutter} below).

1571: For historical reasons, however, we will first discuss

1572: Levin's {\em Universal Search} and  Hutter's \Aixin.

1573: \begin{enumerate}

1574:

1575: \item{\bf \GM vs Universal Search}

1576: \label{gmvsls}

1577:

1578: Unlike the fully self-referential \gmn,

1579: Levin's {\em Universal Search} \cite{Levin:73,Levin:84}

1580: has a hardwired, unmodifiable meta-algorithm that cannot improve itself. It

1581: is asymptotically optimal for inversion problems whose solutions can be quickly

1582: verified in $O(n)$ time (where $n$ is the solution size), but it will always

1583: suffer from the same huge constant slowdown factors (typically $>> 10^{1000}$)

1584: buried in the $O()$-notation. The self-improvements of a \gmn,

1585: however, can be more than merely $O()$-optimal, since its utility function

1586: may formalize a stonger type of optimality that does not ignore huge

1587: constants just because they are constant---compare the utility

1588: function of equation (\ref{u}).

1589:

1590: Furthermore, the \gm is applicable to

1591: general lifelong reinforcement learning (RL) tasks \cite{Kaelbling:96}

1592: where {\em Universal Search} is {\em not} asymptotically optimal,

1593: and not even applicable, since

1594: in RL the evaluation of

1595: some behavior's value in principle consumes the learner's entire life!

1596: So the naive test of whether a program is good or not

1597: would consume the entire life. That is, we could test

1598: only one program; afterwards life would be over.

1599:

1600: Therefore, to achieve their objective, general RL machines

1601: must do things that {\em Universal Search} does

1602: not do, such as predicting {\em future} tasks and rewards.

1603: This partly motivates Hutter's universal RL machine AIXI, to be

1604: discussed next.

1605: \\

1606:

1607: \item{\bf \GM vs \Aixi}

1608: \label{aixi}

1609:

1610: Unlike \gmn s,

1611: Hutter's recent {\em \Aixi model} \cite{Hutter:01aixi+,Hutter:04book+}

1612: generally needs {\em unlimited} computational resources per input update.

1613: It combines Solomonoff's universal prediction

1614: scheme \cite{Solomonoff:64,Solomonoff:78} with an {\em expectimax}

1615: computation.  In discrete cycle $k=1,2,3, \ldots$,

1616: action $y(k)$ results

1617: in perception $x(k)$ and reward $r(k)$, both

1618: sampled from the unknown

1619: (reactive) environmental probability distribution $\mu$.  \Aixi defines

1620: a mixture distribution $\xi$ as a weighted sum of distributions $\nu

1621: \in \cal M$, where $\cal M$ is any class of distributions that includes the

1622: true environment $\mu$.  For example, $\cal M$ may

1623: be a sum of all computable

1624: distributions \cite{Solomonoff:64,Solomonoff:78}, where the sum of

1625: the weights does not exceed 1.  In cycle $k+1$, \Aixi

1626: selects as next action the first in an action sequence maximizing

1627: $\xi$-predicted reward up to some given horizon.

1628: Recent work \cite{Hutter:02selfopt+} demonstrated \Aixi's optimal

1629: use of observations as follows.  The Bayes-optimal policy $p^\xi$ based on

1630: the mixture $\xi$ is self-optimizing in the sense that its average

1631: utility value converges asymptotically for all $\mu \in \cal M$ to the

1632: optimal value achieved by the (infeasible) Bayes-optimal policy $p^\mu$

1633: which knows $\mu$ in advance.  The necessary condition that $\cal M$

1634: admits self-optimizing policies is also sufficient.

1635: Furthermore, $p^\xi$ is Pareto-optimal

1636: in the sense that there is no other policy yielding higher or equal

1637: value in {\em all} environments $\nu \in \cal M$ and a strictly higher

1638: value in at least one \cite{Hutter:02selfopt+}.

1639:

1640: While \Aixi clarifies certain

1641: theoretical limits of machine learning, it

1642: is computationally intractable, especially when $\cal M$

1643: includes all computable

1644: distributions.  This drawback

1645: motivated work on the time-bounded, asymptotically optimal

1646: \tl system \cite{Hutter:01aixi+}

1647: and the related \hs  \cite{Hutter:01fast+}, both

1648: to be discussed next.

1649: \\

1650:

1651: \item{\bf \GM vs \hs and \tl}

1652: \label{hutter}

1653:

1654: Now we come to the most closely related previous work;

1655: so we will go an extra length to

1656: point out the main novelties of the \gmn.

1657:

1658: Hutter's non-self-referential but still $O()$-optimal

1659: {\em `fastest' algorithm for all well-defined problems}

1660: \hs \cite{Hutter:01fast+}

1661: uses a {\em hardwired} brute force proof searcher  and

1662: ignores the costs of proof search.

1663: Assume discrete input/output domains $X/Y$, a formal problem

1664: specification $f: X \rightarrow Y$

1665: (say, a functional description of how integers are decomposed

1666: into their prime factors),

1667: and a particular $x \in X$ (say,

1668: an integer to be factorized). \hs

1669: orders all proofs of an appropriate axiomatic system

1670: by size to find programs $q$ that

1671: for all $z \in X$ provably compute $f(z)$

1672: within time bound $t_q(z)$. Simultaneously it

1673: spends most of its time on executing the $q$ with the

1674: best currently proven time bound $t_q(x)$.

1675: It turns out that \hs

1676: is as fast as the {\em fastest} algorithm that provably

1677: computes $f(z)$ for all $z \in X$, save for a constant factor

1678: smaller than $1 + \epsilon$ (arbitrary $\epsilon > 0$)

1679: and an $f$-specific but $x$-independent

1680: additive constant  \cite{Hutter:01fast+}.

1681: This constant may be enormous though.

1682:

1683: Hutter's \tl \cite{Hutter:01aixi+}

1684: is related.

1685: In discrete cycle $k=1,2,3, \ldots$ of \tln's lifetime,

1686: action $y(k)$ results in perception $x(k)$ and reward $r(k)$,

1687: where all quantities may depend on the complete history.

1688: Using a universal computer such as a Turing machine,

1689: \tl needs an initial offline

1690: setup phase (prior to interaction with the environment) where

1691: it uses a {\em hardwired} brute force proof searcher to

1692: examine all proofs of length at

1693: most $L$, filtering out those that identify programs (of maximal

1694: size $l$ and maximal runtime $t$ per cycle) which not only

1695: could interact with the environment but which for

1696: all possible interaction histories

1697: also correctly predict a lower bound of their own expected future reward.

1698: In cycle $k$, \tl then runs all programs

1699: identified in the

1700: setup phase (at most $2^l$), finds the one with highest self-rating,

1701: and executes its corresponding action.

1702: The problem-independent setup time (where almost all of the work is done)

1703: is $O(L \cdot 2^{L})$.  The online time per cycle is $O(t \cdot 2^l)$.

1704: Both are constant but typically huge.

1705:

1706: \noindent

1707: {\bf Advantages and Novelty of the \GMn.}

1708: There are major differences between the \gm and Hutter's

1709: \hs \cite{Hutter:01fast+} and \tl \cite{Hutter:01aixi+}, including:

1710:

1711: \begin{enumerate}

1712: \item

1713: The theorem provers  of \hs and \tl

1714: are hardwired, non-self-\-refe\-ren\-tial, unmodifiable meta-algorithms

1715: that cannot improve

1716: themselves. That is, they will always suffer from the same

1717: huge constant slowdowns (typically $\gg 10^{1000}$)

1718: buried in the $O()$-notation.

1719: But there is nothing in principle that prevents the

1720: truly self-referential code

1721: of a \gm

1722: from proving and exploiting drastic

1723: reductions of such constants, in the best possible

1724: way that provably constitutes an improvement, if there is any.

1725: \item

1726: The demonstration of the $O()$-optimality of \hs and \tl depends on

1727: a clever allocation of computation time to some of their unmodifiable meta-algorithms.

1728: Our Global Optimality Theorem (Theorem \ref{globopt},

1729: Section \ref{secglobopt}), however, is justified through

1730: a quite different type of reasoning which indeed exploits and crucially depends

1731: on the fact that there is no unmodifiable software at all,

1732: and that the proof searcher itself is readable, modifiable,

1733: and can be improved.

1734: This is also the reason why its self-improvements

1735: can be more than merely $O()$-optimal.

1736: \item

1737: \hs uses a ``trick'' of proving more than is necessary

1738: which also disappears in the sometimes quite misleading $O()$-notation: it

1739: wastes time on finding programs that provably compute $f(z)$

1740: for all $z \in X$ even when the current $f(x) (x \in X)$ is the only

1741: object of interest. A \gmn, however, needs to prove only what is

1742: relevant to its goal formalized by $u$. For example, the

1743: general $u$ of eq. (\ref{u})

1744: completely ignores the limited concept of $O()$-optimality,

1745: but instead formalizes a stronger type of optimality that does not ignore huge

1746: constants just because they are constant.

1747: \item

1748: Both the \gm and \tl can maximize expected reward

1749: (\hs cannot). But the \gm is more flexible

1750: as we may plug in {\em any} type of formalizable utility function

1751: (e.g., {\em worst case} reward),  and unlike \tl it does not

1752: require an enumerable environmental distribution.

1753: \end{enumerate}

1754: Nevertheless, we may

1755: use \tl or \hs or other less general methods

1756: to initialize the

1757: substring $e$ of $p$ which is

1758: responsible for interaction with the environment.

1759: The \gm will replace $e(1)$ as soon

1760: as it finds a provably better strategy.

1761:

1762: It is the {\em self-referential} aspects

1763: of the \gm that relieve us of much of the burden of careful

1764: algorithm design required for \tl and \hsn. They make the \gm both

1765: conceptually simpler {\em and} more general.

1766: \\

1767:

1768: \item{\bf \GM vs \Oops}

1769: \label{oopsrl}

1770:

1771: The Optimal  Ordered Problem Solver

1772: \Oops \cite{Schmidhuber:04oops,Schmidhuber:03nips}

1773: (used by \Biops in Section \ref{remaining}) extends

1774: {\em Universal Search} (Item \ref{gmvsls}). It is a bias-optimal (see Def. \ref{bias-optimal})

1775: way of searching for a

1776: program that solves each problem in an ordered  sequence

1777: of problems of a rather general type,

1778: continually organizing and managing and reusing earlier acquired knowledge.

1779: Solomonoff recently also proposed related

1780: ideas for a {\em scientist's assistant}

1781: \cite{Solomonoff:03} that modifies the probability

1782: distribution of {\em Universal Search} \cite{Levin:73} based

1783: on experience.

1784:

1785: Like {\em Universal Search} (Item \ref{gmvsls}),

1786: \Oops is not directly applicable to

1787: RL problems.  A provably optimal RL machine

1788: must somehow {\em prove} properties of otherwise un-testable behaviors

1789: (such as: what is the expected reward of this behavior which

1790: one cannot naively test as there is not enough time).

1791: That is part of what the \gm does: it tries

1792: to greatly cut testing time, replacing naive time-consuming

1793: tests by much faster proofs of predictable test outcomes

1794: whenever this is possible.

1795:

1796: Proof verification

1797: itself can be performed very quickly. In particular,

1798: verifying the correctness of

1799: a found proof typically does not consume the remaining life.

1800: Hence the \gm may use \Oops as a bias-optimal proof-searching

1801: submodule (Section \ref{remaining}). Since the proofs themselves may concern quite

1802: different, {\em arbitrary} notions of optimality (not just

1803: bias-optimality), the \gm is more

1804: general than plain \Oopsn.

1805: But it is not just an extension of \Oopsn.  Instead of \Oops it may

1806: as well use non-bias-optimal alternative methods to initialize

1807: its proof searcher.

1808: On the other hand, \Oops is not just a precursor of the \gmn.

1809: It is a stand-alone, incremental, bias-optimal way of allocating

1810: runtime to programs that reuse previously successful

1811: programs, and is applicable to many traditional

1812: problems, including but not limited to proof search.

1813: \\

1814:

1815: \item{\bf \GM vs Success-Story Algorithm and Other Metalearners}

1816: \label{ssa}

1817:

1818: A learner's modifiable components are called its policy. An algorithm

1819: that modifies the policy is a learning algorithm. If the learning

1820: algorithm has modifiable components represented as part of the policy,

1821: then we speak of a self-modifying policy (SMP) \cite{Schmidhuber:97ssa}.

1822: SMPs can modify the way they modify themselves etc.

1823: The \gm has an SMP.

1824:

1825: In previous practical work we used the {\em success-story algorithm} (SSA)

1826: to force some (stochastic) SMP to trigger better and better

1827: self-modifications

1828: \cite{Schmidhuber:94self,Schmidhuber:96meta,Schmidhuber:97ssa,Schmidhuber:97bias}.

1829: During the learner's life-time, SSA is occasionally called at times computed

1830: according to SMP itself. SSA uses backtracking to undo those SMP-generated

1831: SMP-modifications that have not been empirically observed to trigger

1832: lifelong reward accelerations (measured up until the current SSA

1833: call---this evaluates the long-term effects of SMP-modifications setting the

1834: stage for later SMP-modifications). SMP-mo\-di\-fi\-ca\-tions that survive SSA

1835: represent a lifelong success history. Until the next SSA call, they build

1836: the basis for additional SMP-modifications. Solely by self-modifications

1837: our SMP/SSA-based learners solved a complex task in a partially observable

1838: environment whose state space is far bigger than most found in

1839: the literature \cite{Schmidhuber:97ssa}.

1840:

1841:  The \gmn's training algorithm is theoretically

1842:  much more powerful than SSA though.

1843:  SSA empirically measures the usefulness of previous

1844:  self-modifications, and does not necessarily encourage

1845:  provably optimal ones.

1846:  Similar drawbacks hold for

1847:  Lenat's human-assisted, non-autonomous,

1848:  self-modifying learner \cite{Lenat:83},

1849:  our Meta-Genetic Programming \cite{Schmidhuber:87} extending

1850:  Cramer's Genetic Programming \cite{Cramer:85,Banzhaf:98},

1851:  our metalearning economies \cite{Schmidhuber:87}

1852:  extending Holland's machine learning eco\-no\-mies \cite{Holland:85},

1853:  and gradient-based metalearners

1854:  for continuous program spaces of differentiable

1855:  recurrent neural networks

1856:  \cite{Schmidhuber:93selfreficann,Hochreiter:01meta}.

1857:  All these methods, however, could be used to seed

1858:  $p(1)$ with an initial policy.

1859:  \\

1860: \end{enumerate}

1861:

1862: \subsection{Are Humans Probabilistic \GMn s?}

1863: \label{human}

1864: We do not know. We think they better be.

1865: Their initial underlying formal system for dealing with uncertainty

1866: seems to differ substantially from those of

1867: traditional expectation calculus and logic though---compare Items

1868: \ref{envaxioms} and \ref{probaxioms} in Section \ref{instructions} as

1869: well as the supermarket example

1870: (Item \ref{supermarket} in Section \ref{possible}).

1871:

1872: \subsection{\GMn s and Consciousness}

1873: \label{conscious}

1874:

1875: In recent years the topic of consciousness

1876: has gained some credibility as a serious research issue,

1877: at least in philosophy and neuroscience, e.g., \cite{Crick:98}.

1878: However, there is a lack

1879: of {\em technical} justifications of consciousness:

1880: so far nobody has shown that

1881: consciousness is really useful for solving problems,

1882: although problem solving is considered

1883: of central importance in philosophy \cite{Popper:99}.

1884:

1885: The fully self-referential \gm may be viewed as providing

1886: just such a technical justification \cite{Schmidhuber:05gmconscious}.

1887: It is ``conscious'' or ``self-aware'' in the sense that

1888: its entire behavior is open to self-introspection,

1889: and modifiable.

1890: It may `step outside of itself' \cite{Hofstadter:79}

1891: by  executing self-changes that are provably good,

1892: where the proof searcher itself is subject

1893: to analysis and change through the proof techniques it tests.

1894: And this type of total self-reference is precisely

1895: the reason for its optimality as a problem solver, in the

1896: sense of Theorem \ref{globopt}.

1897:

1898: \subsection{Frequently Asked Questions}

1899: \label{faq}

1900:

1901: In the past year the author frequently

1902: fielded questions about the \gmn.

1903: Here a list of answers to typical ones.

1904:

1905: \begin{enumerate}

1906:

1907: \item

1908: {\bf Q:} {\em

1909: Does the exact business of formal proof search

1910: really make sense in the uncertain real world?

1911: }

1912:

1913: {\bf A:}

1914: Yes, it does. We just need to insert into $p(1)$ the standard axioms

1915: for representing uncertainty and for dealing with

1916: probabilistic settings and expected rewards etc.

1917: Compare items

1918: \ref{probaxioms} and \ref{envaxioms}

1919: in Section \ref{instructions}, and the definition

1920: of utility as an {\em expected} value in equation (\ref{u}).

1921: Also note that the machine learning literature is

1922: full of {\em human}-generated proofs of properties

1923: of methods for dealing with stochastic environments.

1924:

1925: \item

1926: {\bf Q:} {\em

1927: The target theorem (\ref{goal}) seems to refer only

1928: to the {\em very first} self-change, which  may completely

1929: rewrite the proof-search subroutine---doesn't this

1930: make the proof of Theorem \ref{globopt} invalid?

1931: What prevents later self-changes from being destructive?

1932: }

1933:

1934: {\bf A:}

1935: This is fully taken care of.

1936: Please have a look once more at the

1937: proof of Theorem \ref{globopt},

1938: and note that the first self-change will

1939: be executed only if it is provably useful

1940: (in the sense of the present untility function $u$)

1941: for all

1942: future self-changes (for which the present self-change is setting

1943: the stage). This is actually one of the main points of

1944: the whole self-referential set-up.

1945:

1946: \item

1947: {\bf Q} (related to the previous item): {\em

1948: The \gm implements a meta-learning behavior: what about a

1949: meta-meta, and a meta-meta-meta level?

1950: }

1951:

1952: {\bf A:}

1953: The beautiful thing is that all meta-levels are

1954: automatically collapsed into one: any proof

1955: of a target theorem automatically proves

1956: that the corresponding self-modification is good for all

1957: further self-modifications affected by the present one,

1958: in recursive fashion. Recall Section \ref{comment}.

1959:

1960: \item

1961: {\bf Q:} {\em

1962: The \gm software can produce only computable mappings

1963: from input sequences to output sequences.

1964: What if the environment is non-computable?

1965: }

1966:

1967: {\bf A:}

1968: Many physicists and other scientists

1969: (exceptions: \cite{Zuse:69,Schmidhuber:97brauer})

1970: actually seem to assume the real world

1971: makes use of all the real numbers, most

1972: of which are incomputable.

1973: Nevertheless,

1974: theorems and proofs are just finite symbol strings,

1975: and all treatises of physics contain only computable

1976: axioms and theorems, even when some of the theorems

1977: can be interpreted as making statements about uncountably

1978: many objects, such as all the real numbers. (Note though that

1979: the L\"{o}wenheim-Skolem Theorem \cite{Loewenheim:15,Skolem:19}

1980: implies that any first order theory with an uncountable model

1981: such as the real numbers also has a countable model.)

1982: Generally speaking, formal descriptions of non-computable objects

1983: do {\em not at all} present a fundamental problem---they

1984: may still allow for finding a strategy

1985: that provably maximizes utility. If so, a \gm

1986: can exploit this. If not, then humans will not

1987: have a fundamental advantage over \gmn s.

1988:

1989: \item

1990: {\bf Q:} {\em

1991: Isn't automated theorem-proving

1992: very hard?  Current AI systems cannot prove

1993: nontrivial theorems without human intervention

1994: at crucial decision points.

1995: }

1996:

1997: {\bf A:}

1998: More and more important mathematical proofs (four color

1999: theorem etc) heavily depend on automated proof search.

2000: And traditional theorem provers do not even make use of

2001: our novel notions of proof techniques and $O()$-optimal

2002: proof search.  Of course, some proofs are indeed hard to find,

2003: but here humans and \gmn s face the

2004: same fundamental limitations.

2005:

2006:

2007: \item

2008: {\bf Q:} {\em Don't the

2009: ``no free lunch theorems'' \cite{Wolpert:97}

2010: say that it is

2011: impossible to construct universal problem solvers?

2012: }

2013:

2014: {\bf A:}

2015: No, they do not. They refer to the very special

2016: case of problems sampled from {\em i.i.d.} uniform

2017: distributions on {\em finite} problem spaces.

2018: See the discussion of no free lunch theorems

2019: in an earlier paper \cite{Schmidhuber:04oops}.

2020:

2021: \item

2022: {\bf Q:} {\em

2023: Can't the \gm

2024: switch to a program {\em switchprog} that rewrites

2025: the utility function to a ``bogus'' utility function that

2026: makes unfounded promises of big rewards in the near future?

2027: }

2028:

2029: {\bf A:}

2030: No, it cannot. It should be obvious that rewrites

2031: of the utility function can happen only if

2032: the \gm first can prove that the rewrite

2033: is useful according to the {\em present} utility function.

2034:

2035: \item

2036: {\bf Q:} {\em

2037: Aren't there problems with undecidability? For example,

2038: doesn't Rice's theorem \cite{Rice:53}

2039: or Blum's speed-up theorem \cite{Blum:67,Blum:71}

2040: pose problems?

2041: }

2042:

2043: {\bf A:} Not at all.  Of course, the \gm cannot profit

2044: from a hypothetical useful self-improvement whose utility is undecidable,

2045: and will therefore simply ignore it. Compare Section

2046: \ref{limits} on fundamental

2047: limitations of \gmn s (and humans, for that matter).

2048: Nevertheless, unlike previous methods, a \gm can in principle

2049: exploit at least the provably good improvements and speed-ups of

2050: {\em any} part of its initial software.

2051:

2052: \end{enumerate}

2053:

2054:

2055: \section{Conclusion}

2056: \label{conclusion}

2057:

2058: In 1931, Kurt G\"{o}del laid the foundations of theoretical

2059: computer science, using

2060: elementary arithmetics to build a universal programming language

2061: for encoding arbitrary proofs, given an arbitrary enumerable set of

2062: axioms.  He went on to construct {\em

2063: self-referential} formal statements that claim their own unprovability,

2064: using Cantor's diagonalization trick \cite{Cantor:1874}

2065: to demonstrate that formal systems such as

2066: traditional mathematics are either flawed in a certain

2067: sense or contain unprovable but true statements \cite{Goedel:31}.

2068: Since G\"{o}del's exhibition of the fundamental limits of proof and

2069: computation, and Konrad Zuse's subsequent construction of the first working

2070: programmable computer (1935-1941),

2071: there has been a lot of work on specialized algorithms

2072: solving problems taken from more or less general problem classes.

2073: Apparently, however,

2074: one remarkable fact has so far escaped the attention of computer

2075: scientists: it is possible to use self-referential proof systems to

2076: build optimally efficient yet conceptually very simple

2077: universal problem solvers.

2078:

2079: The initial software $p(1)$ of our \gm

2080: runs an initial, typically sub-optimal problem solver, e.g., one of

2081: Hutter's approaches

2082: \cite{Hutter:01aixi+,Hutter:01fast+}

2083: which have at least an optimal {\em order} of complexity,

2084: or some less general method  \cite{Kaelbling:96}.

2085: Simultaneously, it runs an $O()$-optimal initial proof

2086: searcher using an online variant of {\em Universal Search}

2087: to test {\em proof techniques}, which are

2088: programs able to compute proofs

2089: concerning the system's own future performance,

2090: based on an axiomatic system $\cal A$

2091: encoded in $p(1)$,

2092: describing a formal {\em utility} function $u$,

2093: the hardware and $p(1)$ itself.

2094: If there is no provably good, globally optimal

2095: way of rewriting $p(1)$ at all, then humans

2096: will not find one either. But if there is one,

2097: then $p(1)$ itself can find

2098: and exploit it. This approach yields the first

2099: class of theoretically sound, fully self-referential,

2100: optimally efficient, general problem solvers.

2101:

2102: After the theoretical discussion in Sections \ref{intro}-\ref{biops},

2103: one practical question remains: to build a particular,

2104: especially practical \gm with small initial constant overhead,

2105: which generally useful

2106: theorems should one add as axioms to $\cal A$

2107: (as initial bias) such that the initial searcher

2108: does not have to prove them from scratch?

2109:

2110: \section{Acknowledgments}

2111: Thanks to

2112: Alexey Chernov,

2113: Marcus Hutter,

2114: Jan Poland,

2115: Ray Solomonoff,

2116: Sepp Hochreiter,

2117: Shane Legg,

2118: Leonid Levin,

2119: Alex Graves,

2120: Matteo Gagliolo,

2121: Viktor Zhumatiy,

2122: Ben Goertzel,

2123: Will Pearson,

2124: and Faustino Gomez,

2125: for useful comments on

2126: drafts or summaries or

2127: earlier versions of this paper.

2128: I am also grateful to many others who

2129: asked questions during \gm talks or

2130: sent comments by email. Their input helped to shape

2131: Section \ref{faq} on frequently asked questions.

2132:

2133: \newpage

2134: \bibliography{bib}

2135: \bibliographystyle{plain}

2136: \end{document}

2137:

2138:

2139: