cs0309048/gm6.tex
1: \documentstyle[12pt,epsfig,times]{article}
2: %\renewcommand{\baselinestretch}{1.33}
3: %\documentstyle[a4wide]{article}
4: %\renewcommand\topfraction{1.0}
5: %\renewcommand\bottomfraction{1.0}
6: %\renewcommand\textfraction{0.0}
7: \renewcommand\floatpagefraction{1.0}
8: 
9: \textwidth=6.5 truein \textheight=9.0truein \voffset=-1.0truein
10: \hoffset=-.5truein \parskip=1mm
11: \input epsf
12: 
13: \begin{document}
14: 
15: \def\ap{\textrm{'}}
16: \def\Biops{{\sc Biops} }
17: \def\Biopsn{{\sc Biops}}
18: \def\Oops{{\sc Oops} }
19: \def\Oopsn{{\sc Oops}}
20: \def\Aixi{{\sc Aixi} }
21: \def\Aixin{{\sc Aixi}}
22: \def\tl{{\sc Aixi}{\em (t,l)} }
23: \def\tln{{\sc Aixi}{\em (t,l)}}
24: \def\hs{{\sc Hsearch} }
25: \def\hsn{{\sc Hsearch}}
26: \def\GM{G\"{o}del Machine }
27: \def\gm{G\"{o}del machine }
28: \def\GMn{G\"{o}del Machine}
29: \def\gmn{G\"{o}del machine}
30: \newtheorem{theorem}{Theorem}[section]
31: \newtheorem{lemma}{Lemma}[section]
32: \newtheorem{corollary}{Corollary}[section]
33: \newtheorem{conjecture}{Conjecture}[section]
34: \newtheorem{example}{Example}[section]
35: \newtheorem{definition}{Definition}[section]
36: \newtheorem{postulate}{Postulate}[section]
37: \newtheorem{method}{Method}[section]
38: \newtheorem{procedure}{Procedure}[section]
39: 
40: \def\odt{{\textstyle{1\over 2}}}
41: \def\maxarg{\mathop{\rm maxarg}}          % maxarg
42: \def\minarg{\mathop{\rm minarg}}          % minarg
43: 
44: 
45: \begin{titlepage}
46: 
47: \begin{center}
48: 
49: 
50: \vspace{3cm}
51: \vspace{0.6cm}
52: 
53: {\em TR IDSIA-19-03, Version 5 \\
54: December 2006, arXiv:cs.LO/0309048 v5 \\
55: (v1: 25 September 2003)
56: }
57: \vspace{0.9cm}
58: 
59: {\LARGE \GMn s: Self-Referential \\
60: %\vspace{0.4cm}
61: Universal Problem Solvers Making \\
62: \vspace{0.2cm}
63: Provably Optimal Self-Improvements}
64: 
65: \vspace{0.2cm}
66: 
67: \end{center}
68: 
69: \begin{center}
70: 
71: J\"{u}rgen Schmidhuber
72: 
73: {\it IDSIA, Galleria 2, 6928 Manno-Lugano, Switzerland \& \\
74: TU M\"{u}nchen, Boltzmannstr. 3,  85748 Garching, M\"{u}nchen, Germany}
75: 
76: {\tt juergen@idsia.ch - http://www.idsia.ch/\~{ }juergen}
77: 
78: 
79: \end{center}
80: 
81: \vspace{0.2cm}
82: 
83: 
84: 
85: \begin{abstract}
86: 
87: We present the first class of mathematically rigorous, general, fully
88: self-referential, self-improving, optimally efficient problem solvers.
89: Inspired by Kurt G\"{o}del's celebrated self-referential formulas (1931),
90: such a problem solver rewrites any part of its own code as soon
91: as it has found a proof that the rewrite is {\em useful,} where the
92: problem-dependent utility function and the hardware and the entire
93: initial code are described by axioms encoded in an initial proof
94: searcher which is also part of the initial code.  The searcher
95: systematically and in an asymptotically optimally efficient way tests
96: computable {\em proof techniques} (programs whose outputs are proofs)
97: until it finds a provably useful, computable self-rewrite.  We show
98: that such a self-rewrite is globally optimal---no local maxima!---since
99: the code first had to prove that it is not useful to continue the
100: proof search for alternative self-rewrites.  Unlike Hutter's previous {\em
101: non}-self-referential methods based on hardwired proof searchers, ours
102: not only boasts an optimal {\em order} of complexity but can optimally
103: reduce any slowdowns hidden by the $O()$-notation, provided the utility
104: of such speed-ups is provable at all.
105: 
106: \vspace{0.5cm}
107: \noindent
108: {\bf Keywords:} self-reference, 
109: reinforcement learning, 
110: problem solving, 
111: proof techniques, optimal
112: universal search, self-improvement
113: 
114: \vspace{0.5cm}
115: \noindent
116: {\bf \gm publications} up to 2006
117: (100 years after Kurt G\"{o}del's birth, and 75 years 
118: after his landmark paper 
119: \cite{Goedel:31}
120: laying the foundations of
121: theoretical computer science):
122:  \cite{Schmidhuber:03gm,Schmidhuber:04gmhtml,Schmidhuber:04oops,Schmidhuber:05icann,Schmidhuber:05gmconscious,Schmidhuber:05gmai}.
123: 
124: \end{abstract}
125: 
126: \end{titlepage}
127: 
128: \tableofcontents
129: \newpage
130: 
131: \section{Introduction and Outline}
132: \label{intro}
133: 
134: 
135: All traditional algorithms for problem solving /
136: machine learning /
137: reinforcement learning \cite{Kaelbling:96}
138: are hardwired. Some are designed to improve some limited
139: type of policy through
140: experience, but are not part of the modifiable policy, 
141: and cannot improve themselves in a theoretically sound way.
142: Humans are needed to create 
143: new / better problem solving algorithms and to prove their 
144: usefulness under appropriate assumptions.
145: 
146: Here we will eliminate the restrictive need for human 
147: effort in the most general way possible, 
148: leaving all the work
149: including the proof search to a system
150: that can rewrite and improve itself in arbitrary 
151: computable ways and in a most efficient fashion.
152: To attack this {\em ``Grand Problem of Artificial Intelligence''}
153: \cite{Schmidhuber:03grandai},
154: we introduce a novel class of optimal,
155: fully self-referential \cite{Goedel:31}
156: general problem solvers called 
157: {\em \gmn s} \cite{Schmidhuber:03gm,Schmidhuber:04gmhtml,Schmidhuber:05icann,Schmidhuber:05gmai,Schmidhuber:05gmconscious}.\footnote{Or {\em `Goedel machine'}, to
158: avoid the {\em Umlaut}. But
159: {\em `Godel machine'} would not be quite correct.
160: Not to be confused with what Penrose
161: calls, in a different context,
162: {\em `G\"{o}del's putative theorem-proving machine'} \cite{Penrose:94}!}
163: They are universal problem solving systems that
164: interact with some (partially observable) environment and 
165: can in principle modify themselves without essential limits
166: apart from the limits of computability.  
167: Their initial algorithm is not hardwired;
168: it can completely rewrite itself,
169: but only if a proof searcher embedded within the initial algorithm 
170: can first prove that the rewrite is useful, 
171: given a formalized utility function reflecting 
172: computation time and
173: expected future success (e.g., rewards).
174: We will see that self-rewrites due to this approach 
175: are actually {\em globally optimal} (Theorem \ref{globopt}, 
176: Section \ref{secglobopt}), relative to G\"{o}del's 
177: well-known fundamental restrictions of provability \cite{Goedel:31}.
178: These restrictions should not worry us;
179: if there is no proof of some self-rewrite's utility, then humans
180: cannot do much either.
181: 
182: The initial proof searcher is $O()$-optimal 
183: (has an optimal order of complexity)
184: in the sense of Theorem \ref{asopt},  Section \ref{biops}. 
185: Unlike hardwired systems such as Hutter's 
186: \cite{Hutter:01aixi+,Hutter:01fast+}
187: and Levin's \cite{Levin:73,Levin:84}
188: (Section \ref{previous}),
189: however, a \gm can in principle speed up any part of its
190: initial software, including its proof searcher, to meet 
191: {\em arbitrary} formalizable notions of optimality beyond those
192: expressible in the $O()$-notation.  Our approach yields the first
193: theoretically sound, fully self-referential,
194: optimal, general problem solvers.
195: 
196: \noindent {\bf Outline.} Section \ref{basics} presents basic concepts 
197: and fundamental limitations, Section \ref{details} 
198: the essential details of a self-referential axiomatic system,
199: Section \ref{secglobopt} the Global Optimality Theorem \ref{globopt}, 
200: and Section \ref{biops} the $O()$-optimal 
201: (Theorem \ref{asopt}) initial proof searcher. 
202: Section \ref{discussion} provides examples
203: and relations to previous work,
204: briefly discusses issues such as 
205: a {\em technical} justification of consciousness,
206: and lists answers to several frequently asked questions
207: about \gmn s.
208: 
209: \section{Overview / Basic Ideas /  Limitations}
210: \label{basics}
211: 
212: Many traditional problems of computer science require just
213: one problem-defining input at the beginning of the problem solving
214: process.  For example, the initial input may be a large integer,
215: and the goal may be to factorize it.
216: In what follows, however, we will
217: also consider the {\em more general case}
218: where the problem solution requires interaction with a dynamic,
219: initially unknown environment that produces a continual stream of
220: inputs and feedback
221: signals, such as in autonomous robot control tasks,
222: where the goal may be to maximize expected cumulative
223: future reward \cite{Kaelbling:96}.
224: This may require the solution of essentially arbitrary problems
225: (examples in Section \ref{examples} formulate traditional
226: problems as special cases).
227: 
228: \subsection{Set-up and Formal Goal}
229: \label{notation}
230: Our hardware (e.g., a universal or space-bounded
231: Turing machine 
232: \cite{Turing:36} 
233: or the abstract model
234: of a personal computer) 
235: has a single life which
236: consists of discrete cycles or time steps $t=1, 2, \ldots$.
237: Its total lifetime $T$ may or may not be known in advance.
238: In what follows, the value of any time-varying variable $Q$
239: at time $t$ will be denoted by $Q(t)$.
240: 
241: During each cycle our hardware 
242: executes an elementary operation which affects its 
243: variable state $s \in \cal S \subset B^*$
244: (without loss of generality, $B^*$ is
245: the set of possible bitstrings over 
246: the binary alphabet $B=\{ 0, 1 \}$)
247: and possibly also the variable environmental state $Env \in \cal E$ 
248: (here we need not yet specify the problem-dependent set $\cal E$).
249: There is a hardwired
250: state transition function $F: \cal S \times \cal E \rightarrow \cal S$. 
251: For $t > 1$, $s(t)=F(s(t-1), Env(t-1))$ is the state at a point where
252: the hardware operation of cycle $t-1$ is finished, but the one of
253: $t$ has not started yet.
254: $Env(t)$ may depend on past output actions encoded in $s(t-1)$ and 
255: is simultaneously updated or (probabilistically) computed by 
256: the possibly reactive environment.
257: 
258: In order to talk conveniently about programs and data,
259: we will often attach names to certain string variables encoded
260: as components or substrings of $s$.
261: Of particular interest are the three variables called
262: {\em time}, {\em x}, {\em y}, and {\em p}:
263: \begin{enumerate}
264: \item At time $t$, variable $time$
265: holds a unique binary representation of $t$.
266: We initialize $time(1)=`1\ap$, the bitstring consisting only of a one.
267: The hardware increments $time$ from one cycle to the next.
268: This requires at most $O(log~t)$ and on average only $O(1)$
269: computational steps.
270:  
271: \item
272: Variable {\em x} holds the inputs from the environment to the \gmn.
273: For $t>1$, $x(t)$  may
274: differ from $x(t-1)$ only if
275: a program running on the \gm has executed
276: a special input-requesting instruction at time $t-1$.
277: Generally speaking, the delays between successive inputs
278: should be sufficiently large so that
279: programs can perform
280: certain elementary computations on an input, such as copying
281: it into internal storage (a reserved part of $s$)
282: before the next input arrives.
283:   
284:   \item
285: Variable {\em y} holds the outputs of the \gmn.
286: $y(t)$ is an output bitstring which may 
287: subsequently influence the
288: environment, where $y(1)=`0\ap$ by default.
289: For example, $y(t)$ could be interpreted as a control
290: signal for an environment-manipulating
291: robot whose actions may have an effect on
292: future inputs.
293: 
294: \item
295: $p(1)$ is the initial software: a program
296: implementing the original (sub-optimal) policy 
297: for interacting with the environment,
298: represented as a substring $e(1)$ of $p(1)$,
299: plus the original policy for searching proofs.
300: Details will be discussed below.
301: \end{enumerate}
302: 
303: At any given time $t$ ($1 \leq t \leq T$) the goal 
304: is to maximize future success or {\em utility}.
305: A typical {\em ``value to go''} utility function
306: is of the form $u(s, Env): \cal S \times \cal E \rightarrow R$, where
307: $\cal R$ is the set of real numbers:
308: \begin{equation}
309: \label{u}
310: u(s, Env) =
311: E_{\mu} \left [ \sum_{\tau=time}^T  r(\tau)~~ \Bigg| ~~s, Env \right ],
312: \end{equation}
313: where $r(t)$ is a real-valued reward input (encoded within $s(t)$) at time $t$, 
314: $E_{\mu}(\cdot \mid \cdot)$ denotes the conditional expectation operator
315: with respect to some possibly unknown distribution $\mu$ from a set $M$
316: of possible distributions ($M$ reflects
317: whatever is known about the possibly probabilistic reactions 
318: of the environment), and the above-mentioned $time=time(s)$ is a function 
319: of state $s$ which uniquely identifies the 
320: current cycle.
321: Note that we take into account the possibility of extending
322: the expected lifespan
323: through appropriate actions.
324: 
325: Alternative formalizable utility functions could favor 
326: improvement of {\em worst case} instead 
327: of {\em expected} future performance, 
328: or higher reward intake {\em per time interval} etc.  
329: Clearly, most classic problems of computer science
330: can be formulated in this framework---see examples
331: in Section \ref{examples}.
332: 
333: 
334: \subsection{Basic Idea of \GM}
335: Our machine becomes a self-referential 
336: \cite{Goedel:31}
337: {\em \gmn}
338: by loading it
339: with a 
340: particular form of
341: machine-dependent, 
342: self-modifying code $p$. The initial code
343: $p(1)$ at time step 1
344: includes a (typically sub-optimal)
345: problem solving subroutine $e(1)$ for interacting with
346: the environment, such as any traditional reinforcement learning 
347: algorithm \cite{Kaelbling:96}, 
348: and a general proof searcher subroutine 
349: (Section \ref{biops}) 
350: that systematically makes pairs
351: {\em (switchprog, proof)} (variable substrings of $s$)
352: until it finds a {\em proof} of
353: a target theorem which essentially states: {\em `the
354: immediate rewrite of {\em p} through current program {\em switchprog}
355: on the given machine
356: implies higher utility than leaving {\em p} as is'.} Then it executes
357: {\em switchprog}, which may completely rewrite $p$, including
358: the proof searcher. Section \ref{details} will explain
359: details of the necessary 
360: initial axiomatic system $\cal A$ 
361: encoded in $p(1)$.
362: Compare Fig. \ref{storage}.
363: \begin{figure}[hbt]
364: \centerline{\epsfig{figure=figure1.eps,angle=0,height=12cm}}
365: \caption{{\small
366: Storage snapshot of a not yet self-improved example \gmn,
367: with the initial software still intact. See text for details.
368: }} \label{storage}
369: \end{figure}
370: 
371: 
372: \noindent
373: The {\bf Global Optimality Theorem} (Theorem \ref{globopt}, 
374: Section \ref{secglobopt}) shows 
375: this self-im\-prove\-ment strategy is not greedy: since the
376: utility of {\em `leaving $p$ as is'} implicitly evaluates all possible
377: alternative {\em switchprog}s which an unmodified $p$ might find later,
378: we obtain a globally optimal self-change---the {\em current switchprog}
379: represents the best of all possible relevant self-changes, relative
380: to the given resource limitations and initial proof search strategy.
381: 
382: \subsection{Proof Techniques and an $O()$-Optimal Initial Proof Searcher}
383: Section \ref{biops} will present an $O()$-optimal 
384: initialization of the proof searcher,
385: that is, one with an optimal {\em order} of complexity
386: (Theorem \ref{asopt}).  Still, there will remain a lot of 
387: room for self-improvement hidden by the $O()$-notation. 
388: The searcher
389: uses an online extension of {\em Universal Search} 
390: \cite{Levin:73,Levin:84}
391: to systematically test {\em online
392: proof techniques}, which are proof-generating programs that
393: may read parts of state $s$
394: (similarly, mathematicians are often more interested in 
395: proof techniques than in theorems).
396: To prove target theorems as above,
397: proof techniques may invoke special instructions
398: for generating axioms and applying inference rules to prolong the 
399: current {\em proof} by theorems. Here
400: an axiomatic system $\cal A$ 
401: encoded in $p(1)$
402: includes axioms describing {\bf (a)} how any instruction invoked
403: by a program running on the given hardware will change 
404: the machine's state $s$ 
405: (including instruction pointers etc.)
406: from one step to the next (such that proof techniques can reason
407: about the effects of any program including the proof searcher),
408: {\bf (b)} the initial program $p(1)$ itself (Section \ref{details} will show
409: that this is possible without introducing circularity),
410: {\bf (c)} stochastic environmental properties,
411: {\bf (d)} the formal utility function $u$, 
412: e.g., equation (\ref{u}), 
413: which takes into account
414: computational costs of all actions including proof search.
415: 
416: 
417: 
418: \subsection{Limitations of \GMn s}
419: \label{limits}
420: The fundamental limitations are closely related to
421: those first identified by G\"{o}del's celebrated paper on
422: self-referential formulae \cite{Goedel:31}.
423: Any formal system that encompasses arithmetics (or ZFC etc)
424: is either flawed or allows for unprovable but true statements.
425: Hence even a \gm with unlimited computational
426: resources must ignore those self\--im\-prove\-ments
427: whose effectiveness it cannot prove,
428: e.g., for lack of sufficiently powerful axioms in $\cal A$.
429: In particular, one can construct pathological
430: examples of environments and
431: utility functions that make it impossible for the machine
432: to ever prove a target theorem.
433: Compare Blum's speed-up theorem 
434: \cite{Blum:67,Blum:71}
435: based on certain incomputable predicates.
436: Similarly, a realistic \gm with limited resources
437: cannot profit from self\--im\-prove\-ments
438: whose usefulness it cannot prove within
439: its time and space constraints.
440: 
441: Nevertheless, unlike previous methods, it can 
442: in principle exploit at least the {\em provably} good speed-ups 
443: of {\em any} part of its initial software, including those 
444: parts responsible for huge (but problem class-independent) slowdowns 
445: ignored by the earlier approaches \cite{Hutter:01aixi+,Hutter:01fast+}
446: (Section \ref{previous}).
447: 
448: \section{Essential Details of One Representative \GM}
449: \label{details}
450: {\bf Notation.} Unless stated otherwise or obvious, 
451: throughout the paper newly introduced variables and functions
452: are assumed to cover the range implicit in the context.
453: $l(q)$ denotes the number of bits in a bitstring $q$; 
454: $q_n$ the $n$-th bit of $q$;
455: $\lambda$ the empty string (where $l(\lambda)=0$);
456: $q_{m:n}= \lambda$ if $m>n$ and $q_m q_{m+1} \ldots q_n$
457: otherwise (where $q_0 := q_{0:0} := \lambda$).
458: 
459: Theorem proving requires an 
460: axiom scheme yielding an 
461: enumerable set of 
462: axioms of a formal logic 
463: system $\cal A$ whose formulas and theorems are symbol 
464: strings over some finite alphabet that may include traditional
465: symbols of logic (such as
466: $\rightarrow,\wedge,=,(,), \forall, \exists, \ldots$,
467: $c_1, c_2, \ldots,$  $f_1, f_2, \ldots$), 
468: probability theory (such as $E(\cdot)$, the expectation operator),
469: arithmetics ($+, -, /, = , \sum, <, \ldots$),
470: string manipulation (in particular, symbols for 
471: representing any part of state $s$ at any time, 
472: such as $s_{7:88}(5555)$).
473: A proof is a sequence of theorems,
474: each either an axiom or inferred from previous
475: theorems by applying one of the inference rules such
476: as {\em modus ponens} combined with {\em unification}, e.g.,
477: \cite{Fitting:96}.  
478: 
479: The remainder of this paper will omit standard knowledge to be found 
480: in any proof theory textbook.
481: Instead of listing {\em all} axioms of a particular $\cal A$ in
482: a tedious fashion,
483: we will focus on the novel and critical details: 
484: how to overcome potential problems with self-reference 
485: and how to deal with the potentially delicate online generation of proofs 
486: that talk about and affect the currently running proof generator itself. 
487: 
488: \subsection{Proof Techniques}
489: \label{prooftech}
490: Brute force proof searchers
491: (used in Hutter's work \cite{Hutter:01aixi+,Hutter:01fast+}; 
492: see Section \ref{previous} for a review)
493: systematically generate all proofs
494: in order of their sizes.  To produce a certain proof,
495: this takes time exponential in proof size.
496: Instead our $O()$-optimal $p(1)$ will produce many proofs
497: with low algorithmic complexity 
498: \cite{Solomonoff:64,Kolmogorov:65,LiVitanyi:97}
499: much more quickly. It systematically tests (see Section \ref{biops})
500: programs called
501: {\em proof techniques} written in universal language
502: $\cal L$ implemented within $p(1)$.  
503: For example, $\cal L$  may be a variant of PROLOG \cite{Prolog:87}
504: or the universal {\sc Forth}\cite{Forth:70}-inspired
505: programming language used in recent work on optimal search
506: \cite{Schmidhuber:04oops}.
507: A proof technique is composed
508: of instructions that allow any part of $s$ to be read, 
509: such as inputs encoded in variable $x$ (a substring of $s$) or
510: the code of $p(1)$. It may write on $s^p$, a  part of $s$ reserved for 
511: temporary results.  It also may rewrite {\em switchprog}, 
512: and produce an incrementally growing proof placed in the
513: string variable {\em proof} stored somewhere in $s$.
514: {\em proof} and $s^p$ are reset to the empty string at 
515: the beginning of each new proof technique test.
516: Apart from standard arithmetic and function-defining instructions 
517: \cite{Schmidhuber:04oops} that modify $s^p$,
518: the programming language $\cal L$ includes special instructions
519: (details in Section \ref{instructions})
520: for prolonging the current {\em proof} by correct theorems,
521: for setting {\em switchprog},
522: and for checking whether a provably optimal $p$-modifying
523: program was found and should be executed now.
524: Certain long proofs can be produced by short proof techniques.
525: 
526: \subsection{Important Instructions Used by Proof Techniques}
527: \label{instructions}
528: 
529: The nature of the six {\em proof}-modifying instructions below
530: (there are no others)  
531: makes it impossible to insert an 
532: incorrect theorem into {\em proof},
533: thus trivializing proof verification.
534: Let us first provide a brief overview of the most
535: important instructions: {\bf get-axiom(n)}
536: appends the $n$-th possible axiom
537: to the current {\em proof}, 
538: {\bf apply-rule(k, m, n)} applies the $k$-th
539: inference rule to the $m$-th and $n$-th theorem
540: in the current {\em proof} (appending the result),
541: {\bf set-switchprog(m,n)}
542: sets $switchprog:= s^p_{m:n}$, 
543: and {\bf check()}
544: tests whether the last theorem in {\em proof}
545: is a {\bf target theorem} showing that a self-rewrite
546: through {\em switchprog} would be useful.
547: The details are as follows.
548: 
549: 
550: \begin{enumerate}
551: 
552: \item {\bf get-axiom(n)}
553: takes as argument an integer $n$ computed 
554: by a prefix of the currently tested proof technique
555: with the help of arithmetic instructions 
556: such as those used in previous work
557: \cite{Schmidhuber:04oops}. 
558: Then it appends
559: the $n$-th axiom (if it exists, according to the axiom scheme below) 
560: as a theorem to the current theorem
561: sequence in {\em proof}.  The initial axiom scheme encodes:
562: 
563: \begin{enumerate}
564: \item
565: \label{hardwareaxioms}
566: {\bf Hardware axioms}
567: describing the hardware,
568: formally specifying how 
569: certain components of $s$ (other than
570: environmental inputs $x$) may
571: change from one cycle to the next. 
572: 
573: For example, if the hardware is a
574: Turing machine\footnote{
575: Turing reformulated G\"{o}del's unprovability results in terms
576: of Turing machines (TMs) \cite{Turing:36} which
577: subsequently became the most widely used abstract model of
578: computation.  It is well-known that there are {\em universal}
579: TMs that in a certain sense can emulate any other TM or any
580: other known computer.  G\"{o}del's integer-based formal language
581: can be used to describe any universal TM, and vice versa.}
582: (TM) \cite{Turing:36},
583: then $s(t)$ is a bitstring that encodes the current
584: contents of all tapes of the TM, the positions of its
585: scanning heads, and the current {\em internal state}
586: of the TM's finite state automaton, while $F$ specifies
587: the TM's look-up table which maps any possible combination
588: of internal state and bits above scanning heads
589: to a new internal state and an action such as:
590: replace some head's current bit by 1/0, increment
591: (right shift) or decrement (left shift) some
592: scanning head, read and copy next input bit to cell above
593: input tape's scanning head,
594: etc.  
595: 
596: Alternatively, if the hardware is given by the abstract model
597: of a modern microprocessor with limited storage, $s(t)$ will
598: encode the current storage contents,
599: register values,
600: instruction pointers etc.
601: 
602: For instance, the following axiom could
603: describe how some 64-bit hardware's instruction pointer
604: stored in $s_{1:64}$ is continually incremented as long as 
605: there is no overflow and the value
606: of $s_{65}$ does not indicate that a jump to some other address
607: should take place:
608: \[
609: (\forall t \forall n :
610: [(n < 2^{64}-1) \wedge  (n > 0) \wedge  (t > 1) \wedge  (t < T) 
611: \]
612: \[
613: \wedge 
614: (string2num(s_{1:64}(t))=n) 
615: \wedge (s_{65}(t)=`0\ap) ]
616: \]
617: \[
618: \rightarrow
619: (string2num(s_{1:64}(t+1))=n+1))
620: \]
621: Here the semantics of used symbols such 
622: as `('  and `$>$'  and `$\rightarrow$' (implies)
623: are the traditional ones, while `$string2num$'
624: symbolizes a function
625: translating bitstrings into numbers.
626: It is clear that any abstract hardware model can be
627: fully axiomatized in a similar way.
628: 
629: 
630: \item
631: {\bf Reward axioms}
632: defining the computational costs of any
633: hardware instruction, 
634: and physical costs of output 
635: actions, such as control signals $y(t)$
636: encoded in $s(t)$.
637: Related axioms assign values to certain input events 
638: (encoded in variable $x$, a substring of $s$)
639: representing reward or punishment (e.g.,
640: when a \gmn-controlled robot bumps into an obstacle).
641: Additional axioms define the total value of the \gmn 's life as a 
642: scalar-valued function of all 
643: rewards (e.g., their sum) and 
644: costs experienced between cycles $1$ and $T$, etc.
645: For example, assume that
646: $s_{17:18}$ can be changed only through
647: external inputs; the following example axiom 
648: says that the total reward increases by 3 whenever
649: such an input equals `11'
650: (unexplained symbols carry the obvious meaning):
651: \[
652: (\forall t_1 \forall t_2:
653: [(t_1 < t_2) \wedge  (t_1 \geq 1) 
654: \wedge  (t_2 \leq T)  
655: \wedge  (s_{17:18}(t_2)=`11\ap)  ]
656: \]
657: \[
658: \rightarrow
659: [R(t_1,t_2)= R(t_1,t_2-1)+3]),
660: \]
661: where $R(t_1,t_2)$ is interpreted as 
662: the cumulative reward between times $t_1$ and $t_2$.
663: It is clear that any formal scheme for producing
664: rewards can be fully axiomatized in a similar way.
665: 
666: \item
667: \label{envaxioms}
668: {\bf Environment axioms}
669: restricting the way the environment will produce 
670: new inputs (encoded within certain substrings of $s$) in reaction to 
671: sequences of outputs $y$ encoded in $s$.
672: For example, it may be known
673: in advance that the environment is sampled from an unknown probability
674: distribution $\mu$ contained in a given set $M$
675: of possible distributions (compare equation \ref{u}).
676: E.g., $M$ may contain all distributions
677: that are computable, given the previous history
678: \cite{Solomonoff:64,Solomonoff:78,Hutter:01aixi+},
679: or at least limit-computable \cite{Schmidhuber:00v2,Schmidhuber:02ijfcs}.
680: Or, more restrictively, the environment
681: may be some unknown but deterministic computer
682: program \cite{Zuse:69,Schmidhuber:97brauer}
683: sampled from the Speed Prior \cite{Schmidhuber:02colt} which assigns
684: low probability to environments that are hard to compute by any method.
685: Or the interface to the environment is Markovian \cite{Schmidhuber:91nips},
686: that is, the current
687: input always uniquely identifies the environmental state---a lot
688: of work has already been done 
689: on this special case \cite{Samuel:59,Bellman:61,Sutton:98}.
690: Even more restrictively, the environment may evolve in completely
691: predictable fashion known in advance.
692: All such prior assumptions
693: are perfectly formalizable in an appropriate $\cal A$ 
694: (otherwise we could not write scientific papers about them).
695: 
696: \item
697: {\bf Uncertainty axioms; string manipulation axioms:}
698: \label{probaxioms}
699: Standard axioms for arithmetics and calculus 
700: and probability theory \cite{Kolmogorov:33} 
701: and statistics 
702: and string manipulation that (in conjunction
703: with the hardware axioms and environment axioms) allow for constructing proofs 
704: concerning (possibly uncertain) properties of future values of 
705: $s(t)$ as well as bounds on expected remaining lifetime /
706: costs / rewards,  
707: given some time $\tau$ and certain 
708: hypothetical values for components of $s(\tau)$ etc. 
709: An example theorem saying something about
710: expected properties of future inputs $x$ might look like this:
711: \[
712: (\forall t_1 \forall \mu \in M:
713: [(1 \leq t_1) \wedge
714: (t_1 + 15597 < T) \wedge
715: (s_{5:9}(t_1) = `01011\ap)
716: \]
717: \[
718: \wedge (x_{40:44}(t_1)=`00000\ap) ]
719: \rightarrow
720: (\exists t: [(t_1 < t < t_1 + 15597)  
721: \]
722: \[
723: \wedge
724: (P_{\mu}(x_{17:22}(t) = `011011\ap \mid s(t_1)) > \frac{998}{1000} )])),
725: \]
726: where $P_{\mu}(. \mid . )$ represents a conditional probability
727: with respect to an axiomatized prior distribution $\mu$ from
728: a set of distributions $M$ described by
729: the environment axioms (Item \ref{envaxioms}).
730: 
731: Given a particular 
732: formalizable hardware 
733: (Item \ref{hardwareaxioms})
734: and formalizable assumptions about the 
735: possibly probabilistic environment
736: (Item \ref{envaxioms}),
737: obviously one can fully axiomatize 
738: everything that is needed for 
739: proof-based reasoning about past and future machine states.
740: 
741: \item
742: \label{initaxioms}
743: {\bf Initial state axioms:}
744: Information about how
745: to reconstruct the initial state $s(1)$ or parts thereof,
746: such that the proof searcher 
747: can build proofs including
748: axioms of the type
749: \[
750: (s_{{\bf m}:{\bf n}}(1)={\bf z}),
751: ~e.g.:~ (s_{7:9}(1)=`010\ap).
752: \]
753: Here and in the remainder of the paper 
754: we use bold font in formulas to indicate 
755: syntactic place holders (such as {\bf m,n,z})
756: for symbol strings representing 
757: variables (such as {\em m,n,z})
758: whose semantics are explained in 
759: the text---in the present context $z$ is the 
760: bitstring $s_{m:n}(1)$.
761: 
762: Note that it is {\bf no fundamental problem} to fully encode
763: both the hardware description {\em and} the initial
764: hardware-describing $p$ within $p$
765: itself. To see this, observe that some software may 
766: include a program that can print the software.
767: 
768: 
769: \item
770: {\bf Utility axioms}
771: \label{itemu}
772: describing the overall goal
773: in the form of utility function $u$; e.g., equation (\ref{u}) in
774: Section \ref{notation}.
775: \end{enumerate}
776: 
777: \item
778: \label{applyrule} 
779: {\bf apply-rule(k, m, n)}
780: takes as arguments the index $k$ (if it exists) of 
781: an inference rule 
782: such as {\em modus ponens} 
783: (stored in a list of possible inference rules
784: encoded within $p(1)$) and the indices $m, n$ 
785: of two previously proven theorems (numbered in order of
786: their creation) in the current {\em proof}.
787: If applicable, the corresponding inference rule is 
788: applied to the addressed theorems
789: and the resulting theorem appended to {\em proof}. Otherwise
790: the currently tested proof technique is interrupted.  This 
791: ensures that {\em proof} is never fed with invalid proofs. 
792: 
793: \item
794: {\bf delete-theorem(m)}
795: deletes the $m$-th theorem in the currently stored
796: {\em proof}, thus freeing storage such that proof-storing parts of $s$
797: can be reused and the maximal proof size is not necessarily limited by
798: storage constraints.
799: Theorems deleted from {\em proof}, however,  cannot be addressed any more by
800: {\em apply-rule} to produce further prolongations of {\em proof}.
801: 
802: \item
803: {\bf set-switchprog(m,n)}
804: replaces $switchprog$ by $s^p_{m:n}$, provided
805: that $s^p_{m:n}$ is indeed a non-empty substring of
806: $s^p$, the storage writable by proof techniques.
807: 
808: \item 
809: \label{check}
810: {\bf check()}
811: verifies whether the goal of the proof search has been reached.
812: First it tests whether the last theorem (if any) in {\em proof}
813: has the form of a {\bf target theorem}. A target theorem states
814: that given the {\em current} axiomatized utility
815: function $u$ (Item \ref{itemu}), the utility of a
816: switch from $p$ to the current {\em switchprog}
817: would be higher than the utility of continuing the execution
818: of $p$ (which would keep searching for alternative {\em switchprog}s).
819: Target theorems are
820: symbol strings 
821: (encoded in $s$) 
822: of the (decoded) form
823: \begin{equation}
824: \label{goal}
825: (u[s({\bf t_1}) \oplus (switchbit({\bf t_1})=`1\ap), Env({\bf t_1})] >
826: \]
827: \[
828: u[s({\bf t_1}) \oplus (switchbit({\bf t_1})=`0\ap), Env({\bf t_1})] )
829: \end{equation}
830: where the variable $t_1$ (represented by syntactic place holder ${\bf t_1}$)
831: stands for a time step,
832: while all other symbols belong to the alphabet of
833: the theorem-proving calculus, that is,
834: the set of possible target theorems is parameterized only by $t_1$.
835: Here the calculus should permit the notation
836: $s({\bf t_1}) \oplus (switchbit({\bf t_1})=`b\ap)$ as
837: a shortcut for the state obtained when we replace
838: {\em switchbit}($t_1$), 
839: the true value of the variable bit $switchbit$
840: (encoded in $s$) 
841: at time $t_1$, by $b \in \{0, 1\}$. This will
842: facilitate the formulation of theorems
843: that compare values conditioned on various alternative
844: hypothetical properties of $s(t_1)$. (Note that $s(t_1)$ may
845: be only partially known by the current proof technique
846: even in environments where
847: $s(t_1)$ and {\em switchbit}($t_1$) are
848: fully predetermined for all valid $t_1$.)
849: 
850: The purpose of introducing $t_1$ is to deal with hardware-specific
851: temporal delays that may be involved in checking and switching---it may
852: take a significant amount of
853: time to match abstract symbol strings found during proof search to the
854: \gmn's real current state.
855: If a target theorem has been found, {\em check()} uses a
856: simple prewired subroutine (also encoded in $p(1)$, of course)
857: to check whether there is enough time left
858: to set variable {\em switchbit} (originally 0) to 1 before
859: the continually increasing $time$ will equal $t_1$.
860: If this subroutine
861: returns a negative result, {\em check()} exits.
862: Otherwise it sets {\em switchbit} $:=1$
863: (there is no other way of changing {\em switchbit}).
864: Then it repeatedly tests $time$ until $time > t_1$, to make sure
865: the condition of formula (\ref{goal})
866: was fulfilled at $t_1$.
867: Then it transfers control to {\em switchprog}
868: (there is no other way of calling {\em switchprog}).
869: The {\em switchprog}
870: may subsequently rewrite all parts of $s$, excluding hardware-reserved
871: parts such as $time$ and $x$, but including $p$.
872: 
873: \item 
874: \label{state2theorem} 
875: {\bf state2theorem(m, n)}
876: takes two integer arguments $m, n$
877: and tries to transform the current contents of $s_{m:n}$ 
878: into a theorem of the form 
879: \[
880: (s_{{\bf m}:{\bf n}}({\bf t_1})={\bf z}),
881: ~e.g.:~ (s_{6:9}(7775555)=`1001\ap),
882: \]
883: where $t_1$ represents a time measured (by checking {\em time})
884: shortly after {\em state2theorem} was invoked, 
885: and $z$ the bistring $s_{m:n}(t_1)$ (recall the special 
886: case $t_1=1$ of Item \ref{initaxioms}).
887: So we accept the time-labeled current 
888: observable contents of any part of $s$ as a theorem that does not have
889: to be proven in an alternative way from, say, the 
890: initial state $s(1)$, because the computation so far
891: has already demonstrated that the theorem is true.
892: Thus we may exploit information conveyed
893: by environmental inputs, and the fact that sometimes
894: (but not always) the fastest way
895: to determine the output of a program is to run it.
896: 
897: %\begin{small}
898: {\em
899: This non-traditional online interface between syntax and semantics
900: requires special care though.
901: We must avoid inconsistent results through
902: parts of $s$ that change while being read.
903: For example, the present value of a quickly changing
904: instruction pointer {\em IP} (continually updated by the hardware)
905: may be essentially unreadable in the
906: sense that the execution of the reading subroutine
907: itself will already modify {\em IP} many times.
908: For convenience, the (typically limited) hardware could be set up
909: such that it stores the contents of
910: fast hardware variables every $c$ cycles in
911: a reserved part of $s$, such that
912: an appropriate  variant of {\em state2theorem()} could at least
913: translate certain recent values of fast variables into theorems.
914: This, however, will not abolish {\em all} problems associated
915: with self-observations.
916: For example, the $s_{m:n}$ to be read might
917: also contain the reading procedure's
918: own, temporary, constantly changing string pointer variables,
919: etc.\footnote{We see that certain parts of the
920: current $s$ may not be directly observable without changing
921: the observable itself.
922: Sometimes, however, axioms and previous observations will allow
923: the \gm to {\em deduce} time-dependent storage contents that
924: are not directly observable.
925: For instance, by analyzing the code being executed through
926: instruction
927: pointer {\em IP} in the example above,  the value of {\em IP} at
928: certain times may be predictable (or postdictable, after the
929: fact).  The values of other variables at given times,
930: however, may not be deducible at all.
931: Such limits of self-observability
932: are reminiscent of Heisenberg's celebrated
933: uncertainty principle \cite{Heisenberg:25},
934: which states that certain physical measurements are necessarily
935: imprecise, since the measuring process affects the measured
936: quantity.}
937: To address such problems on computers with limited
938: memory, {\em state2theorem}
939: first uses some fixed protocol
940: (encoded in $p(1)$, of course)
941: to check whether the current $s_{m:n}$ is readable
942: at all or whether it might change if it
943: were read by the remaining code of {\em state2theorem}.
944: If so, or if $m,n,$ are not in the proper range,
945: then the instruction has no further effect. Otherwise it
946: appends an {\em observed} theorem of the form
947: $(s_{{\bf m}:{\bf n}}({\bf t_1})={\bf z})$
948: to {\em proof}.
949: For example, if the current time is 7770000, then the invocation
950: of {\em state2theorem(6,9)} might return the theorem $(s_{6:9}(7775555)=`1001\ap)$,
951: where
952: $7775555-7770000=5555$ reflects the time needed by {\em state2theorem}
953: to perform the initial check
954: and to read leading bits off the continually increasing $time$
955: (reading $time$ also costs time) such that
956: it can be sure that $7775555$ is a recent proper time label following
957: the start of {\em state2theorem}.
958: }
959: %\end{small}
960: 
961: \end{enumerate}
962: 
963: The axiomatic system $\cal A$ is a defining
964: parameter of a given \gmn.  Clearly,  $\cal A$ must be
965: strong enough to permit proofs of target theorems.
966: In particular, the theory of uncertainty axioms
967: (Item \ref{probaxioms}) must be sufficiently rich.
968: This is no fundamental problem: we simply
969: insert all traditional axioms of probability theory \cite{Kolmogorov:33}.
970: 
971: 
972: \section{\bf Global Optimality Theorem}
973: \label{secglobopt}
974:  
975:  Intuitively, at any given time $p$ should
976:  execute some self-modification algorithm 
977:  (via instruction {\em check()}---Item \ref{check} above) 
978:  only if it is
979:  the `best' of all possible self-modifications,
980:  given the utility function, which typically
981:  depends on available resources, such as storage size and
982:  remaining lifetime.
983:  At first glance, however, target theorem (\ref{goal})
984:  seems to implicitly talk about just
985:  one single modification algorithm, namely, {\em switchprog}$(t_1)$
986:  as set by the systematic proof searcher at time $t_1$.
987:  Isn't this type of local search greedy? Couldn't
988:  it lead to a local optimum instead of a global one?
989:  No, it cannot, according to the following global optimality theorem.
990: 
991: \subsection{Globally Optimal Self-Changes, Given $u$ and $\cal A$ Encoded in $p$}
992: \label{sectheorem}
993:   
994:   \begin{theorem}
995:   \label{globopt}
996:   Given any formalizable utility function $u$ (Item \ref{itemu}),
997:   and assuming consistency of the underlying formal system $\cal A$,
998:   any self-change of $p$ obtained through execution of
999:   some program {\em switchprog} identified
1000:   through the proof of a target theorem  (\ref{goal})
1001:   is globally optimal in the following sense:
1002:   the utility of starting the execution of the present
1003:   {\em switchprog} is higher than the utility of
1004:   waiting for the proof searcher
1005:   to produce an alternative {\em switchprog} later.
1006:   \end{theorem}
1007: 
1008: 
1009: \noindent
1010: {\bf Proof.} Target theorem  (\ref{goal})
1011: implicitly talks about all the other
1012: {\em switchprog}s that the proof searcher
1013: could produce in the future. To see this, consider
1014: the two alternatives of the binary decision:
1015: (1) either execute the current {\em switchprog} (set {\em switchbit} $=1$), or
1016: (2) keep searching for {\em proof}s and {\em switchprog}s (set {\em switchbit} $=0$)
1017: until the systematic
1018: searcher comes up with an even better {\em switchprog}.
1019: Obviously the second alternative concerns all (possibly infinitely
1020: many) potential {\em switchprog}s to be considered later.  That is,
1021: if  the current {\em switchprog}  were not the `best', then
1022: the proof searcher would not be able to prove that
1023: setting {\em switchbit} and
1024: executing  {\em switchprog} will cause higher expected reward
1025: than discarding {\em switchprog}, assuming consistency of $\cal A$.
1026: {\em Q.E.D.}
1027: 
1028: The initial proof searcher of Section \ref{biops}
1029: already generates all possible proofs and {\em switchprogs} in $O()$-optimal fashion. 
1030: Nevertheless, since it is part of $p(1)$, its proofs can speak about the
1031: proof searcher itself, possibly triggering proof searcher rewrites 
1032: resulting in better than merely $O()$-optimal performance.
1033: 
1034: \subsection{Alternative Relaxed Target Theorem}
1035: \label{alternative}
1036: We may replace the target theorem  (\ref{goal}) (Item \ref{check})
1037: by the following 
1038: alternative target theorem:
1039: \[
1040: (u[s({\bf t_1}) \oplus (switchbit({\bf t_1})=`1\ap), Env({\bf t_1})] \geq
1041: \]
1042: \begin{equation}
1043: \label{goal2}
1044: u[s({\bf t_1}) \oplus (switchbit({\bf t_1})=`0\ap), Env({\bf t_1})] )
1045: \end{equation}
1046: The only difference to the original target theorem  (\ref{goal})
1047: is that the ``$>$'' sign became a  ``$\geq$'' sign. That is,
1048: the \gm will change itself as soon as it has found a proof
1049: that the change will not make things worse.
1050: A Global Optimality Theorem similar to Theorem \ref{globopt}
1051: holds; simply replace the last phrase 
1052: in Theorem \ref{globopt} by:
1053:   {\em the utility of starting the execution of the present
1054:   {\em switchprog} is at least as high as the 
1055:   utility of waiting for the proof searcher
1056:   to produce an alternative {\em switchprog} later.}
1057: 
1058: \subsection{Global Optimality and Recursive Meta-Levels}
1059: \label{comment}
1060: One of the most important
1061: aspects of our fully self-referential set-up is the following.
1062: Any proof of a target theorem automatically proves 
1063: that the corresponding self-modification is good for all
1064: further self-modifications affected by the present one,
1065: in recursive fashion. 
1066: In that sense all possible ``meta-levels'' of the self-referential
1067: system are collapsed into one. 
1068: 
1069: \subsection{How Difficult is it to Prove Target Theorems?}
1070: \label{difficult}
1071: This depends on the tasks and the initial axioms $\cal A$, of course.
1072: It is straight-forward to devise simple tasks and 
1073: corresponding consistent $\cal A$
1074: such that there are short and trivial proofs of target theorems.
1075: 
1076: Even when we initialize the initial problem solver $e(1)$ by an
1077: asymptotically optimal, rather general method such as Hutter's \tl 
1078: \cite{Hutter:01aixi+,Hutter:04book+},
1079: it may be straight-forward to prove that switching to
1080: another strategy is useful,
1081: especially when $\cal A$ contains additional prior
1082: knowledge in form of axiomatic assumptions 
1083: beyond those made by \tln .  The latter needs a very time-consuming but constant set-up
1084: phase whose costs disappear in the $O()$-notation but not in 
1085: a utility function such as the $u$ of equation (\ref{u}). 
1086: For example, simply construct an environment where
1087: maximal reward is achieved by performing a never-ending sequence
1088: of simple but rewarding actions, say, repeatedly pressing a lever,
1089: plus a very simple axiomatic system $\cal A$ that permits a short proof
1090: showing that it is useful to interrupt the non-rewarding set-up phase and
1091: start pressing the lever. 
1092: 
1093: On the other hand, it is possible to construct
1094: situations where it is impossible to prove target theorems,
1095: for example, by using results of undecidability theory, 
1096: e.g., \cite{Goedel:31,Rice:53,Blum:67,Blum:71}.
1097: In particular, adopting the extreme notion of triviality
1098: embodied by Rice's theorem \cite{Rice:53}
1099: ({\em any nontrivial property over general
1100: functions is undecidable}), only {\em trivial}
1101: improvements of a given strategy may be provably useful.
1102: This notion of triviality, however, clearly does not reflect
1103: what is {\em intuitively} regarded as trivial by scientists.
1104: Although many theorems of the machine learning
1105: literature in particular,  and the computer science 
1106: literature in general, are limited to functional properties that
1107: are trivial in the sense of Rice, they are widely regarded 
1108: as non-trivial in an {\em intuitive} sense. 
1109: In fact, the infinite domains of function classes 
1110: addressed by Rice's theorem 
1111: are irrelevant not only for most scientists dealing
1112: with real world problems 
1113: but also for a typical \gm dealing
1114: with a limited number of events that
1115: may occur within its limited life time. 
1116: Generally speaking, in between the {\em obviously} trivial and the {\em obviously}
1117: non-trivial cases there are many less obvious ones.
1118: The point is: usually we do not know in advance whether it is
1119: possible or not to change a given initial problem solver
1120: in a provably good way.  The traditional approach is to 
1121: invest human research effort into finding out. A \gmn, however, can 
1122: do this by itself, without essential limits
1123: apart from those of computability and provability.  
1124: 
1125: Note that to prove a target theorem, 
1126: a proof technique does not necessarily have to compute
1127: the true expected utilities of switching and not
1128: switching---it just needs to determine which is higher.
1129: For example, it may be easy to prove that
1130: speeding up a subroutine of the proof searcher
1131: by a factor of 2 will certainly be worth
1132: the negligible (compared to lifetime $T$) time needed to
1133: execute the subroutine-changing algorithm, no matter
1134: what is the precise utility of the switch.
1135: 
1136: 
1137:  
1138: \section{Bias-Optimal Proof Search (BIOPS)}
1139: \label{biops}
1140: 
1141: Here we construct an initial  $p(1)$ that is $O()$-optimal in a certain
1142: limited sense to be described below, but still might be improved
1143: as it is not necessarily optimal in the sense of the given $u$
1144: (for example, the $u$ of equation (\ref{u}) neither mentions
1145: nor cares for $O()$-optimality).
1146: Our Bias-Optimal Proof Search (BIOPS)
1147: is essentially an application of 
1148: {\em Universal Search}
1149: \cite{Levin:73,Levin:84}
1150: to proof search. 
1151: One novelty, however, is this:
1152: Previous practical variants and extensions of {\em Universal
1153: Search} have been applied
1154: \cite{Schmidhuber:95kol,Schmidhuber:97nn,Schmidhuber:97bias,Schmidhuber:04oops}
1155: to {\em offline}
1156: program search tasks where the program inputs are fixed
1157: such that the same program always produces the same results.
1158: In our {\em online} setting, however, BIOPS has to take
1159: into account that the same proof technique 
1160: started at different times may yield different proofs,
1161: as it may read parts of $s$ (e.g., inputs)
1162: that change as the machine's life proceeds. 
1163: 
1164: \subsection{Online Universal Search in Proof Space}
1165: \label{online}
1166: 
1167: BIOPS starts with a probability distribution $P$ 
1168: (the initial bias) on the proof techniques $w$ that
1169: one can write in $\cal L$,
1170: e.g., $P(w)=K^{-l(w)}$ for programs composed from $K$
1171: possible instructions  \cite{Levin:84}.
1172: BIOPS is {\em near-bias-optimal} 
1173: \cite{Schmidhuber:04oops}
1174: in the sense that it will not spend 
1175: much more time on any proof technique than it deserves,
1176: according to its probabilistic bias,
1177: namely, not much more than its probability times the total search time:
1178: \begin{definition} [Bias-Optimal
1179: Searchers \cite{Schmidhuber:04oops}]
1180: \label{bias-optimal}
1181: {\em Let $\cal R$ be a problem class,
1182: $\cal C$ be a search space of solution candidates
1183: (where  any problem $r \in \cal R$ should have a solution in $\cal C$),
1184: $P(q \mid r)$ be a task-dependent bias in the form of conditional probability
1185: distributions on the candidates $q \in \cal C$. Suppose that we also have
1186: a predefined procedure that creates and tests any given $q$
1187: on any $r \in \cal R$ within time $t(q,r)$ (typically unknown in advance).
1188: Then} a searcher is $n$-bias-optimal ($n \geq 1$) if
1189: for any maximal total search time $T_{total} > 0$
1190: it is guaranteed to solve any problem $r \in \cal R$
1191: if it has a solution $p \in \cal C$
1192: satisfying $t(p,r) \leq P(p \mid r)~T_{total}/n$.
1193: It is bias-optimal if  $n=1$.
1194: \end{definition}
1195: 
1196: \begin{method}[BIOPS]
1197: \label{lsearch}
1198: {\em 
1199: In phase $(i=1,2,3, \ldots)$ {\sc Do}:
1200: {\sc For} all self-delimiting \cite{Levin:84} 
1201: proof techniques $w \in \cal L$ satisfying $P(w) \geq 2^{-i}$ {\sc Do}:
1202: \begin{enumerate}
1203: \item
1204: Run $w$ until halt or error (such as 
1205: division by zero) or $2^iP(w)$ steps consumed. 
1206: \item
1207: Undo effects of $w$ on $s^p$
1208: (does not cost significantly more time 
1209: than executing $w$).
1210: \end{enumerate}
1211: }
1212: \end{method}
1213: A proof technique $w$ can interrupt Method \ref{lsearch} 
1214: only by invoking instruction {\em check()} (Item \ref{check}), 
1215: which may transfer
1216: control to {\em switchprog} (which possibly
1217: even will delete or rewrite Method  \ref{lsearch}).
1218: Since the initial $p$ 
1219: runs on the formalized hardware, and since proof techniques
1220: tested by $p$ can read $p$ and other parts of $s$, they
1221: can produce proofs concerning the (expected) 
1222: performance of $p$ and BIOPS itself.
1223: %\subsection{Asymptotically Optimal Proof Search} 
1224: %\label{asymptotic}
1225: Method \ref{lsearch} at least has the 
1226: optimal {\em order} of computational complexity in the following
1227: sense. 
1228: \begin{theorem}
1229: \label{asopt}
1230: If independently of variable {\em time(s)} some unknown 
1231: fast proof technique $w$
1232: would require at most $f(k)$ steps to produce
1233: a proof of difficulty measure $k$ (an integer depending on
1234: the nature of the task to be solved), then
1235: Method \ref{lsearch}
1236: will need at most $O(f(k))$ steps.
1237: \end{theorem}
1238: {\bf Proof.}
1239: It is easy to see that
1240: Method \ref{lsearch} will need at most $O(f(k)/P(w)) = O(f(k))$
1241: steps---the constant factor  $1/P(w)$ does not depend on $k$.
1242: {\em Q.E.D.}
1243: 
1244: The initial proof search itself is merely $O()$-optimal.
1245: Note again, however,  that
1246: the proofs themselves may concern quite
1247: different, arbitrary formalizable notions of optimality
1248: (stronger than those expressible in the $O()$-notation)
1249: embodied by the given, problem-specific, formalized 
1250: utility function $u$, in particular, the maximum future
1251: reward in the sense of equation (\ref{u}).
1252: This may provoke useful, constant-affecting rewrites of 
1253: the initial proof searcher despite its limited (yet popular
1254: and widely used) notion of $O()$-optimality.
1255: Once a useful rewrite has been found and executed
1256: after some initial fraction of the \gmn 's total lifetime, the restrictions
1257: of $O()$-optimality need not be an issue any more. 
1258: 
1259: 
1260: \subsection{How a Surviving Proof Searcher May
1261: Use the Optimal Ordered Problem Solver to Solve Remaining Proof Search Tasks}
1262: \label{remaining}
1263: 
1264: The following is not essential for this paper.
1265: Let us assume that the execution of the {\em switchprog}
1266: corresponding to the first found target theorem has not rewritten the code of
1267: $p$ itself---the current $p$ is still equal to $p(1)$---and has
1268: reset {\em switchbit} and returned
1269: control to $p$ such that it can continue where it was interrupted.
1270: In that case the \Biops subroutine of $p(1)$ can use the Optimal
1271: Ordered Problem Solver \Oops
1272: \cite{Schmidhuber:04oops}
1273: to accelerate the search for the
1274: $n$-th target theorem ($n>1$) by reusing proof techniques for earlier found
1275: target theorems where possible.  The basic ideas are as
1276: follows (details: \cite{Schmidhuber:04oops}).
1277: 
1278: Whenever a target theorem has been proven, $p(1)$ {\em freezes}
1279: the corresponding proof technique: it becomes non-writable
1280: by proof techniques to be tested in later proof search tasks,
1281: but remains readable,
1282: such that it can be copy-edited and/or invoked as a subprogram by future proof
1283: techniques.  We also allow prefixes of proof
1284: techniques to temporarily rewrite the probability distribution
1285: on their suffixes
1286: \cite{Schmidhuber:04oops},
1287: thus essentially rewriting
1288: the probability-based search
1289: procedure (an incremental extension of Method \ref{lsearch})
1290: based on previous experience. As a side-effect we metasearch
1291: for faster search procedures, which can greatly accelerate the learning of
1292: new tasks \cite{Schmidhuber:04oops}.
1293: 
1294: Given a new proof search task, \Biops performs \Oops
1295: by spending half the total search time on a
1296: variant of
1297: Method \ref{lsearch}
1298: that searches only among self-delimiting
1299: \cite{Levin:74,Chaitin:75} proof techniques starting with the most recently
1300: frozen proof technique.  The rest of the
1301: time is spent on fresh proof techniques with arbitrary prefixes
1302: (which may reuse previously frozen proof techniques though)
1303: \cite{Schmidhuber:04oops}.
1304: (We could also search for a {\em generalizing} proof technique
1305: solving all proof search tasks so far. In the first half of the search
1306: we would not have to test proof techniques on tasks other than
1307: the most recent one, since we already know that their prefixes
1308: solve the previous tasks
1309: \cite{Schmidhuber:04oops}.)
1310: 
1311: It can be shown that \Oops is
1312: essentially {\em 8-bias-optimal} (see Def. \ref{bias-optimal}),
1313: given either the initial bias or intermediate biases due to
1314: frozen solutions to previous tasks
1315: \cite{Schmidhuber:04oops}.
1316: This result immediately carries over to \Biopsn.
1317: To summarize, \Biops essentially allocates part of the total search
1318: time for a new task to proof techniques that exploit previous successful
1319: proof techniques in computable ways.  If the new task can be solved faster
1320: by copy-editing / invoking previously frozen
1321: proof techniques than by solving the new proof search task from scratch,
1322: then  \Biops will discover this and profit thereof. If not, then
1323: at least it will not be significantly slowed down by the previous
1324: solutions---\Biops will remain 8-bias-optimal.
1325:  
1326:  Recall, however, that \Biops is not the only
1327:  possible way of initializing the \gmn's proof searcher.
1328: The Global Optimality Theorem \ref{globopt} 
1329: (Section \ref{secglobopt})
1330: expresses optimality with respect to whichever
1331: initial proof searcher we choose.
1332: 
1333: 
1334: \section{Discussion \& Previous Work}
1335: \label{discussion}
1336: 
1337: 
1338: Here we list a few examples of
1339: possible types of self-improvements
1340: (Section \ref{possible}),
1341: \gm applicability to various tasks defined by various utility functions and
1342: environments
1343: (Section \ref{examples}),
1344: probabilistic hardware 
1345: (Section \ref{prob}),
1346: and relations to previous work
1347: (Section \ref{previous}).
1348: We also briefly discuss self-reference 
1349: and consciousness 
1350: (Section \ref{conscious}),
1351: and provide a list
1352: of answers to frequently asked questions 
1353: (Section \ref{faq}).
1354: 
1355: \subsection{Possible Types of \GM Self-Improvements}
1356: \label{possible}
1357:  
1358:   
1359:   Which provably useful self-modifications are possible?
1360:   There are few limits to what a \gm  might do.
1361:    
1362: \begin{enumerate}
1363: \item
1364:    In one of the simplest cases
1365:    it might leave its basic proof searcher intact
1366:    and just change the ratio of time-sharing between
1367:    the proof searching subroutine and the
1368:    subpolicy $e$---those parts of $p$
1369:    responsible for interaction with the environment.
1370:     
1371: \item
1372:     Or the \gm might modify $e$ only.
1373:     For example, the initial $e(1)$ may be a program
1374:     that regularly stores limited memories
1375:     of past events somewhere in $s$; this might allow $p$ to derive that
1376:     it would be useful to modify $e$ such that $e$ will conduct certain
1377:     experiments to increase the knowledge about
1378:     the environment, and use the resulting information
1379:     to increase reward intake.  In this sense the \gm embodies
1380:     a principled way of dealing with the
1381:     exploration vs exploitation problem \cite{Kaelbling:96}.
1382:     Note that the {\em expected} utility (equation (\ref{u}))
1383:     of conducting some experiment may  exceed
1384:     the one of not conducting it,
1385:     even when the experimental outcome later suggests to
1386:     keep acting in line with the previous $e$.
1387: 
1388: 
1389: 
1390: \item
1391: The \gm might also modify its very axioms
1392: to speed things up. For example,
1393: it might find a proof that the
1394: original axioms should be replaced or
1395: augmented by theorems derivable
1396: from the original axioms.
1397:  
1398: \item
1399:  The \gm might even change
1400:  its own utility function and target theorem,
1401:  but can do so only if their {\em new} values
1402:  are provably better according to the {\em old} ones.
1403:   
1404: \item
1405:   In many cases we do not expect
1406:   the \gm to replace its proof searcher by
1407:   code that completely abandons the search for proofs.
1408:   Instead we expect that only certain subroutines
1409:   of the proof searcher will be sped up---compare the
1410:   example in Section \ref{difficult}---or that perhaps just
1411:   the order of generated proofs will be
1412:   modified in problem-specific fashion. This could be done
1413:   by modifying
1414:   the probability distribution on the proof techniques of
1415:   the initial bias-optimal proof searcher
1416:   from Section \ref{biops}.
1417: 
1418: \item
1419: \label{supermarket}
1420: Generally speaking,
1421: the utility of limited rewrites may often be
1422: easier to prove than the one of total rewrites.
1423: For example, suppose it is 8.00pm and our \gmn-controlled
1424: agent's permanent goal is to
1425: maximize future expected reward, using the (alternative)
1426: target theorem (\ref{goal2}).  Part thereof is to avoid hunger. There
1427: is nothing in its fridge, and shops close down at 8.30pm.  It does not
1428: have time to optimize its way to the supermarket in  every little detail,
1429: but if it does not get going right now it will stay hungry tonight (in
1430: principle such near-future consequences of actions should be easily
1431: provable, possibly even in a way related to
1432: how humans prove advantages of potential actions to themselves). That
1433: is, if the agent's previous policy did not already include, say, an automatic
1434: daily evening trip to the supermarket, the policy provably should be
1435: rewritten at least in a very limited and simple way right now, while there
1436: is still time, such that the agent will surely get some food tonight,
1437: without affecting less urgent future behavior that  can be optimized
1438: / decided later, such as details of the route to the food, or
1439: of tomorrow's actions.
1440:  
1441: \item
1442:  In certain uninteresting environments reward
1443:  is maximized by becoming dumb. For example,
1444:  a given task may require to repeatedly
1445:  and forever execute the same pleasure center-activating action,
1446:  as quickly as possible.  In such cases the \gm may delete
1447:  most of its more time-consuming initial software
1448:  including the proof searcher.
1449:   
1450: \item
1451:   Note that there is no reason why a \gm should not
1452:   augment its own hardware.
1453:   Suppose its lifetime is known to be 100 years.
1454:   Given a hard problem and
1455:   axioms restricting the possible
1456:   behaviors of the environment,
1457:   the \gm might find a proof that its
1458:   expected cumulative reward will increase if
1459:   it invests 10 years into building faster computational
1460:   hardware, by exploiting the physical resources of
1461:   its environment.
1462: \end{enumerate}
1463: 
1464: \subsection{Example Applications}
1465: \label{examples}
1466: 
1467: Traditional examples that
1468: do not involve significant interaction with a probabilistic
1469: environment are easily dealt with in our reward-based framework:
1470: 
1471: 
1472: \begin{example}[Time-limited NP-hard optimization]
1473: \label{tsp}
1474: The initial
1475: input to the \gm is the representation of
1476: a connected graph with
1477: a large number of nodes linked by
1478: edges of various lengths.
1479: Within given time $T$ it should
1480: find a cyclic path connecting all nodes.
1481: The only real-valued reward will occur
1482: at time $T$. It equals
1483: 1 divided by the length
1484: of the best path found so far (0 if none was found).
1485: There are no other inputs.
1486: The by-product of  maximizing
1487: expected reward is to find the shortest
1488: path findable within the limited time,
1489: given the initial bias.
1490: \end{example}
1491:  
1492:  \begin{example}[Fast theorem proving]
1493:  \label{goldbach}
1494:  Prove or disprove as quickly as possible
1495:  that all even integers  $>2$ are the sum of
1496:  two primes (Goldbach's conjecture).
1497:  The reward is $1/t$,
1498:  where $t$ is the time required to produce
1499:  and verify the first such proof.
1500:  \end{example}
1501: More general cases are:
1502: 
1503: \begin{example}[Maximizing expected reward with bounded resources]
1504: \label{robot}
1505: A robot that needs at least
1506: 1 liter of gasoline per hour
1507: interacts with a partially unknown environment,
1508: trying to find hidden, limited gasoline depots
1509: to occasionally refuel its tank.
1510: It is rewarded in proportion to its lifetime,
1511: and dies after at most 100 years or as
1512: soon as its tank is empty or
1513: it falls off a cliff etc.
1514: The probabilistic environmental reactions are initially
1515: unknown but assumed to be sampled from
1516: the axiomatized Speed Prior \cite{Schmidhuber:02colt},
1517: according to which
1518: hard-to-compute environmental reactions are unlikely.
1519: This permits a computable strategy for making near-optimal
1520: predictions  \cite{Schmidhuber:02colt}.
1521: One by-product of maximizing  expected reward
1522: is to maximize expected lifetime.
1523: \end{example}
1524: \begin{example}[Optimize any suboptimal problem solver]
1525: \label{optimize}
1526: Given any formalizable problem, implement
1527: a suboptimal but known problem solver
1528: as software on the \gm hardware,
1529: and let the proof searcher of Section \ref{biops} run in parallel.
1530: \end{example}
1531: 
1532: \subsection{Probabilistic \GM Hardware}
1533: \label{prob}
1534: Above we have focused on an example deterministic machine
1535: living in a possibly probabilistic environment.
1536: It is straight-forward to extend this
1537: to computers whose actions are computed in
1538: probabilistic fashion, given the current state.
1539: Then the expectation calculus
1540: used for probabilistic aspects of the environment
1541: simply has to be extended to the hardware itself,
1542: and the mechanism for verifying proofs has to
1543: take into account that there is no such thing as
1544: a certain theorem---at best there are formal statements
1545: which are true with such and such probability.
1546: In fact, this may be the most realistic approach
1547: as any physical hardware is error-prone,
1548: which should be taken into account by
1549: realistic probabilistic \gmn s.
1550:  
1551:  Probabilistic settings also automatically avoid certain
1552:  issues of axiomatic consistency. For example, predictions
1553:  proven to come true with
1554:  probability less than 1.0 do not necessarily cause contradictions
1555:  even when they do not match the observations.
1556: 
1557: 
1558: \subsection{Relations to Previous Work}
1559: \label{previous}
1560:  
1561: Despite (or maybe because of) the ambitiousness and
1562: potential power of self-improving machines,
1563: there has been little work in this vein outside our own
1564: labs at IDSIA and TU M\"{u}nchen.
1565: Here we will list essential differences
1566: between the \gm and our previous approaches to `learning to learn,'
1567: `metalearning,' self-improvement, self-optimization,  etc.
1568: 
1569: The most closely related approaches are Hutter's
1570: \hs and \tl (Item \ref{hutter} below).
1571: For historical reasons, however, we will first discuss
1572: Levin's {\em Universal Search} and  Hutter's \Aixin.
1573: \begin{enumerate}
1574: 
1575: \item{\bf \GM vs Universal Search}
1576: \label{gmvsls}
1577: 
1578: Unlike the fully self-referential \gmn, 
1579: Levin's {\em Universal Search} \cite{Levin:73,Levin:84}
1580: has a hardwired, unmodifiable meta-algorithm that cannot improve itself. It
1581: is asymptotically optimal for inversion problems whose solutions can be quickly
1582: verified in $O(n)$ time (where $n$ is the solution size), but it will always
1583: suffer from the same huge constant slowdown factors (typically $>> 10^{1000}$)
1584: buried in the $O()$-notation. The self-improvements of a \gmn,
1585: however, can be more than merely $O()$-optimal, since its utility function
1586: may formalize a stonger type of optimality that does not ignore huge
1587: constants just because they are constant---compare the utility
1588: function of equation (\ref{u}).
1589: 
1590: Furthermore, the \gm is applicable to 
1591: general lifelong reinforcement learning (RL) tasks \cite{Kaelbling:96}
1592: where {\em Universal Search} is {\em not} asymptotically optimal,
1593: and not even applicable, since
1594: in RL the evaluation of 
1595: some behavior's value in principle consumes the learner's entire life!
1596: So the naive test of whether a program is good or not
1597: would consume the entire life. That is, we could test
1598: only one program; afterwards life would be over.
1599: 
1600: Therefore, to achieve their objective, general RL machines 
1601: must do things that {\em Universal Search} does
1602: not do, such as predicting {\em future} tasks and rewards.
1603: This partly motivates Hutter's universal RL machine AIXI, to be
1604: discussed next.
1605: \\
1606: 
1607: \item{\bf \GM vs \Aixi}
1608: \label{aixi}
1609: 
1610: Unlike \gmn s,
1611: Hutter's recent {\em \Aixi model} \cite{Hutter:01aixi+,Hutter:04book+}
1612: generally needs {\em unlimited} computational resources per input update.
1613: It combines Solomonoff's universal prediction
1614: scheme \cite{Solomonoff:64,Solomonoff:78} with an {\em expectimax}
1615: computation.  In discrete cycle $k=1,2,3, \ldots$,
1616: action $y(k)$ results
1617: in perception $x(k)$ and reward $r(k)$, both
1618: sampled from the unknown
1619: (reactive) environmental probability distribution $\mu$.  \Aixi defines
1620: a mixture distribution $\xi$ as a weighted sum of distributions $\nu
1621: \in \cal M$, where $\cal M$ is any class of distributions that includes the
1622: true environment $\mu$.  For example, $\cal M$ may
1623: be a sum of all computable
1624: distributions \cite{Solomonoff:64,Solomonoff:78}, where the sum of
1625: the weights does not exceed 1.  In cycle $k+1$, \Aixi
1626: selects as next action the first in an action sequence maximizing
1627: $\xi$-predicted reward up to some given horizon.
1628: Recent work \cite{Hutter:02selfopt+} demonstrated \Aixi's optimal
1629: use of observations as follows.  The Bayes-optimal policy $p^\xi$ based on
1630: the mixture $\xi$ is self-optimizing in the sense that its average
1631: utility value converges asymptotically for all $\mu \in \cal M$ to the
1632: optimal value achieved by the (infeasible) Bayes-optimal policy $p^\mu$
1633: which knows $\mu$ in advance.  The necessary condition that $\cal M$
1634: admits self-optimizing policies is also sufficient.
1635: Furthermore, $p^\xi$ is Pareto-optimal
1636: in the sense that there is no other policy yielding higher or equal
1637: value in {\em all} environments $\nu \in \cal M$ and a strictly higher
1638: value in at least one \cite{Hutter:02selfopt+}.
1639: 
1640: While \Aixi clarifies certain
1641: theoretical limits of machine learning, it
1642: is computationally intractable, especially when $\cal M$
1643: includes all computable
1644: distributions.  This drawback
1645: motivated work on the time-bounded, asymptotically optimal
1646: \tl system \cite{Hutter:01aixi+}
1647: and the related \hs  \cite{Hutter:01fast+}, both
1648: to be discussed next.
1649: \\
1650: 
1651: \item{\bf \GM vs \hs and \tl}
1652: \label{hutter}
1653: 
1654: Now we come to the most closely related previous work;
1655: so we will go an extra length to 
1656: point out the main novelties of the \gmn.
1657: 
1658: Hutter's non-self-referential but still $O()$-optimal 
1659: {\em `fastest' algorithm for all well-defined problems}
1660: \hs \cite{Hutter:01fast+} 
1661: uses a {\em hardwired} brute force proof searcher  and
1662: ignores the costs of proof search.
1663: Assume discrete input/output domains $X/Y$, a formal problem
1664: specification $f: X \rightarrow Y$
1665: (say, a functional description of how integers are decomposed
1666: into their prime factors),
1667: and a particular $x \in X$ (say,
1668: an integer to be factorized). \hs
1669: orders all proofs of an appropriate axiomatic system
1670: by size to find programs $q$ that
1671: for all $z \in X$ provably compute $f(z)$
1672: within time bound $t_q(z)$. Simultaneously it
1673: spends most of its time on executing the $q$ with the
1674: best currently proven time bound $t_q(x)$.
1675: It turns out that \hs
1676: is as fast as the {\em fastest} algorithm that provably
1677: computes $f(z)$ for all $z \in X$, save for a constant factor
1678: smaller than $1 + \epsilon$ (arbitrary $\epsilon > 0$)
1679: and an $f$-specific but $x$-independent
1680: additive constant  \cite{Hutter:01fast+}.
1681: This constant may be enormous though.
1682: 
1683: Hutter's \tl \cite{Hutter:01aixi+}
1684: is related.
1685: In discrete cycle $k=1,2,3, \ldots$ of \tln's lifetime,
1686: action $y(k)$ results in perception $x(k)$ and reward $r(k)$,
1687: where all quantities may depend on the complete history.
1688: Using a universal computer such as a Turing machine,
1689: \tl needs an initial offline
1690: setup phase (prior to interaction with the environment) where
1691: it uses a {\em hardwired} brute force proof searcher to
1692: examine all proofs of length at
1693: most $L$, filtering out those that identify programs (of maximal
1694: size $l$ and maximal runtime $t$ per cycle) which not only
1695: could interact with the environment but which for
1696: all possible interaction histories
1697: also correctly predict a lower bound of their own expected future reward.
1698: In cycle $k$, \tl then runs all programs
1699: identified in the
1700: setup phase (at most $2^l$), finds the one with highest self-rating,
1701: and executes its corresponding action.
1702: The problem-independent setup time (where almost all of the work is done)
1703: is $O(L \cdot 2^{L})$.  The online time per cycle is $O(t \cdot 2^l)$.
1704: Both are constant but typically huge.
1705: 
1706: \noindent
1707: {\bf Advantages and Novelty of the \GMn.}
1708: There are major differences between the \gm and Hutter's
1709: \hs \cite{Hutter:01fast+} and \tl \cite{Hutter:01aixi+}, including:
1710: 
1711: \begin{enumerate}
1712: \item 
1713: The theorem provers  of \hs and \tl 
1714: are hardwired, non-self-\-refe\-ren\-tial, unmodifiable meta-algorithms 
1715: that cannot improve
1716: themselves. That is, they will always suffer from the same 
1717: huge constant slowdowns (typically $\gg 10^{1000}$)
1718: buried in the $O()$-notation.
1719: But there is nothing in principle that prevents the
1720: truly self-referential code
1721: of a \gm
1722: from proving and exploiting drastic
1723: reductions of such constants, in the best possible 
1724: way that provably constitutes an improvement, if there is any. 
1725: \item
1726: The demonstration of the $O()$-optimality of \hs and \tl depends on 
1727: a clever allocation of computation time to some of their unmodifiable meta-algorithms.
1728: Our Global Optimality Theorem (Theorem \ref{globopt}, 
1729: Section \ref{secglobopt}), however, is justified through
1730: a quite different type of reasoning which indeed exploits and crucially depends
1731: on the fact that there is no unmodifiable software at all,
1732: and that the proof searcher itself is readable, modifiable,
1733: and can be improved. 
1734: This is also the reason why its self-improvements
1735: can be more than merely $O()$-optimal.
1736: \item
1737: \hs uses a ``trick'' of proving more than is necessary
1738: which also disappears in the sometimes quite misleading $O()$-notation: it
1739: wastes time on finding programs that provably compute $f(z)$
1740: for all $z \in X$ even when the current $f(x) (x \in X)$ is the only
1741: object of interest. A \gmn, however, needs to prove only what is 
1742: relevant to its goal formalized by $u$. For example, the 
1743: general $u$ of eq. (\ref{u}) 
1744: completely ignores the limited concept of $O()$-optimality, 
1745: but instead formalizes a stronger type of optimality that does not ignore huge
1746: constants just because they are constant.
1747: \item
1748: Both the \gm and \tl can maximize expected reward
1749: (\hs cannot). But the \gm is more flexible
1750: as we may plug in {\em any} type of formalizable utility function
1751: (e.g., {\em worst case} reward),  and unlike \tl it does not
1752: require an enumerable environmental distribution. 
1753: \end{enumerate}
1754: Nevertheless, we may 
1755: use \tl or \hs or other less general methods
1756: to initialize the
1757: substring $e$ of $p$ which is
1758: responsible for interaction with the environment.
1759: The \gm will replace $e(1)$ as soon
1760: as it finds a provably better strategy.
1761: 
1762: It is the {\em self-referential} aspects
1763: of the \gm that relieve us of much of the burden of careful
1764: algorithm design required for \tl and \hsn. They make the \gm both
1765: conceptually simpler {\em and} more general.
1766: \\
1767: 
1768: \item{\bf \GM vs \Oops}
1769: \label{oopsrl}
1770:  
1771: The Optimal  Ordered Problem Solver
1772: \Oops \cite{Schmidhuber:04oops,Schmidhuber:03nips}
1773: (used by \Biops in Section \ref{remaining}) extends
1774: {\em Universal Search} (Item \ref{gmvsls}). It is a bias-optimal (see Def. \ref{bias-optimal})
1775: way of searching for a
1776: program that solves each problem in an ordered  sequence
1777: of problems of a rather general type,
1778: continually organizing and managing and reusing earlier acquired knowledge.
1779: Solomonoff recently also proposed related
1780: ideas for a {\em scientist's assistant}
1781: \cite{Solomonoff:03} that modifies the probability
1782: distribution of {\em Universal Search} \cite{Levin:73} based
1783: on experience.
1784: 
1785: Like {\em Universal Search} (Item \ref{gmvsls}), 
1786: \Oops is not directly applicable to
1787: RL problems.  A provably optimal RL machine
1788: must somehow {\em prove} properties of otherwise un-testable behaviors
1789: (such as: what is the expected reward of this behavior which
1790: one cannot naively test as there is not enough time).
1791: That is part of what the \gm does: it tries
1792: to greatly cut testing time, replacing naive time-consuming
1793: tests by much faster proofs of predictable test outcomes
1794: whenever this is possible.
1795: 
1796: Proof verification
1797: itself can be performed very quickly. In particular,
1798: verifying the correctness of
1799: a found proof typically does not consume the remaining life.
1800: Hence the \gm may use \Oops as a bias-optimal proof-searching
1801: submodule (Section \ref{remaining}). Since the proofs themselves may concern quite
1802: different, {\em arbitrary} notions of optimality (not just
1803: bias-optimality), the \gm is more
1804: general than plain \Oopsn.
1805: But it is not just an extension of \Oopsn.  Instead of \Oops it may
1806: as well use non-bias-optimal alternative methods to initialize
1807: its proof searcher.
1808: On the other hand, \Oops is not just a precursor of the \gmn.
1809: It is a stand-alone, incremental, bias-optimal way of allocating
1810: runtime to programs that reuse previously successful
1811: programs, and is applicable to many traditional
1812: problems, including but not limited to proof search.
1813: \\
1814: 
1815: \item{\bf \GM vs Success-Story Algorithm and Other Metalearners}
1816: \label{ssa}
1817: 
1818: A learner's modifiable components are called its policy. An algorithm
1819: that modifies the policy is a learning algorithm. If the learning
1820: algorithm has modifiable components represented as part of the policy,
1821: then we speak of a self-modifying policy (SMP) \cite{Schmidhuber:97ssa}.
1822: SMPs can modify the way they modify themselves etc.
1823: The \gm has an SMP.
1824: 
1825: In previous practical work we used the {\em success-story algorithm} (SSA)
1826: to force some (stochastic) SMP to trigger better and better
1827: self-modifications
1828: \cite{Schmidhuber:94self,Schmidhuber:96meta,Schmidhuber:97ssa,Schmidhuber:97bias}.
1829: During the learner's life-time, SSA is occasionally called at times computed
1830: according to SMP itself. SSA uses backtracking to undo those SMP-generated
1831: SMP-modifications that have not been empirically observed to trigger
1832: lifelong reward accelerations (measured up until the current SSA
1833: call---this evaluates the long-term effects of SMP-modifications setting the
1834: stage for later SMP-modifications). SMP-mo\-di\-fi\-ca\-tions that survive SSA
1835: represent a lifelong success history. Until the next SSA call, they build
1836: the basis for additional SMP-modifications. Solely by self-modifications
1837: our SMP/SSA-based learners solved a complex task in a partially observable
1838: environment whose state space is far bigger than most found in
1839: the literature \cite{Schmidhuber:97ssa}.
1840:  
1841:  The \gmn's training algorithm is theoretically
1842:  much more powerful than SSA though.
1843:  SSA empirically measures the usefulness of previous
1844:  self-modifications, and does not necessarily encourage
1845:  provably optimal ones.
1846:  Similar drawbacks hold for
1847:  Lenat's human-assisted, non-autonomous,
1848:  self-modifying learner \cite{Lenat:83},
1849:  our Meta-Genetic Programming \cite{Schmidhuber:87} extending
1850:  Cramer's Genetic Programming \cite{Cramer:85,Banzhaf:98},
1851:  our metalearning economies \cite{Schmidhuber:87}
1852:  extending Holland's machine learning eco\-no\-mies \cite{Holland:85},
1853:  and gradient-based metalearners
1854:  for continuous program spaces of differentiable
1855:  recurrent neural networks
1856:  \cite{Schmidhuber:93selfreficann,Hochreiter:01meta}.
1857:  All these methods, however, could be used to seed
1858:  $p(1)$ with an initial policy.
1859:  \\
1860: \end{enumerate}
1861: 
1862: \subsection{Are Humans Probabilistic \GMn s?}
1863: \label{human}
1864: We do not know. We think they better be.
1865: Their initial underlying formal system for dealing with uncertainty
1866: seems to differ substantially from those of
1867: traditional expectation calculus and logic though---compare Items
1868: \ref{envaxioms} and \ref{probaxioms} in Section \ref{instructions} as
1869: well as the supermarket example 
1870: (Item \ref{supermarket} in Section \ref{possible}).
1871: 
1872: \subsection{\GMn s and Consciousness}
1873: \label{conscious}
1874: 
1875: In recent years the topic of consciousness
1876: has gained some credibility as a serious research issue,
1877: at least in philosophy and neuroscience, e.g., \cite{Crick:98}.
1878: However, there is a lack 
1879: of {\em technical} justifications of consciousness:
1880: so far nobody has shown that 
1881: consciousness is really useful for solving problems,
1882: although problem solving is considered 
1883: of central importance in philosophy \cite{Popper:99}. 
1884: 
1885: The fully self-referential \gm may be viewed as providing 
1886: just such a technical justification \cite{Schmidhuber:05gmconscious}.
1887: It is ``conscious'' or ``self-aware'' in the sense that
1888: its entire behavior is open to self-introspection,
1889: and modifiable.  
1890: It may `step outside of itself' \cite{Hofstadter:79} 
1891: by  executing self-changes that are provably good,
1892: where the proof searcher itself is subject
1893: to analysis and change through the proof techniques it tests.
1894: And this type of total self-reference is precisely 
1895: the reason for its optimality as a problem solver, in the 
1896: sense of Theorem \ref{globopt}.
1897: 
1898: \subsection{Frequently Asked Questions}
1899: \label{faq}
1900: 
1901: In the past year the author frequently
1902: fielded questions about the \gmn. 
1903: Here a list of answers to typical ones.
1904: 
1905: \begin{enumerate}
1906: 
1907: \item
1908: {\bf Q:} {\em 
1909: Does the exact business of formal proof search 
1910: really make sense in the uncertain real world?
1911: }
1912: 
1913: {\bf A:} 
1914: Yes, it does. We just need to insert into $p(1)$ the standard axioms 
1915: for representing uncertainty and for dealing with
1916: probabilistic settings and expected rewards etc. 
1917: Compare items 
1918: \ref{probaxioms} and \ref{envaxioms}
1919: in Section \ref{instructions}, and the definition
1920: of utility as an {\em expected} value in equation (\ref{u}).
1921: Also note that the machine learning literature is
1922: full of {\em human}-generated proofs of properties 
1923: of methods for dealing with stochastic environments.
1924: 
1925: \item
1926: {\bf Q:} {\em 
1927: The target theorem (\ref{goal}) seems to refer only
1928: to the {\em very first} self-change, which  may completely 
1929: rewrite the proof-search subroutine---doesn't this
1930: make the proof of Theorem \ref{globopt} invalid?
1931: What prevents later self-changes from being destructive?
1932: }
1933: 
1934: {\bf A:} 
1935: This is fully taken care of.
1936: Please have a look once more at the 
1937: proof of Theorem \ref{globopt},
1938: and note that the first self-change will
1939: be executed only if it is provably useful 
1940: (in the sense of the present untility function $u$)
1941: for all
1942: future self-changes (for which the present self-change is setting
1943: the stage). This is actually one of the main points of
1944: the whole self-referential set-up.
1945: 
1946: \item
1947: {\bf Q} (related to the previous item): {\em 
1948: The \gm implements a meta-learning behavior: what about a 
1949: meta-meta, and a meta-meta-meta level? 
1950: }
1951: 
1952: {\bf A:} 
1953: The beautiful thing is that all meta-levels are 
1954: automatically collapsed into one: any proof
1955: of a target theorem automatically proves 
1956: that the corresponding self-modification is good for all
1957: further self-modifications affected by the present one,
1958: in recursive fashion. Recall Section \ref{comment}.
1959: 
1960: \item
1961: {\bf Q:} {\em 
1962: The \gm software can produce only computable mappings
1963: from input sequences to output sequences.
1964: What if the environment is non-computable?
1965: }
1966: 
1967: {\bf A:} 
1968: Many physicists and other scientists
1969: (exceptions: \cite{Zuse:69,Schmidhuber:97brauer})
1970: actually seem to assume the real world
1971: makes use of all the real numbers, most
1972: of which are incomputable.
1973: Nevertheless,
1974: theorems and proofs are just finite symbol strings,
1975: and all treatises of physics contain only computable 
1976: axioms and theorems, even when some of the theorems
1977: can be interpreted as making statements about uncountably 
1978: many objects, such as all the real numbers. (Note though that
1979: the L\"{o}wenheim-Skolem Theorem \cite{Loewenheim:15,Skolem:19}
1980: implies that any first order theory with an uncountable model
1981: such as the real numbers also has a countable model.) 
1982: Generally speaking, formal descriptions of non-computable objects
1983: do {\em not at all} present a fundamental problem---they 
1984: may still allow for finding a strategy
1985: that provably maximizes utility. If so, a \gm
1986: can exploit this. If not, then humans will not
1987: have a fundamental advantage over \gmn s.
1988: 
1989: \item
1990: {\bf Q:} {\em 
1991: Isn't automated theorem-proving 
1992: very hard?  Current AI systems cannot prove 
1993: nontrivial theorems without human intervention 
1994: at crucial decision points.
1995: }
1996: 
1997: {\bf A:} 
1998: More and more important mathematical proofs (four color 
1999: theorem etc) heavily depend on automated proof search.
2000: And traditional theorem provers do not even make use of
2001: our novel notions of proof techniques and $O()$-optimal
2002: proof search.  Of course, some proofs are indeed hard to find, 
2003: but here humans and \gmn s face the 
2004: same fundamental limitations.
2005: 
2006: 
2007: \item
2008: {\bf Q:} {\em Don't the
2009: ``no free lunch theorems'' \cite{Wolpert:97}
2010: say that it is 
2011: impossible to construct universal problem solvers?
2012: }
2013: 
2014: {\bf A:} 
2015: No, they do not. They refer to the very special 
2016: case of problems sampled from {\em i.i.d.} uniform 
2017: distributions on {\em finite} problem spaces. 
2018: See the discussion of no free lunch theorems
2019: in an earlier paper \cite{Schmidhuber:04oops}.
2020: 
2021: \item
2022: {\bf Q:} {\em 
2023: Can't the \gm
2024: switch to a program {\em switchprog} that rewrites 
2025: the utility function to a ``bogus'' utility function that 
2026: makes unfounded promises of big rewards in the near future?
2027: }
2028: 
2029: {\bf A:} 
2030: No, it cannot. It should be obvious that rewrites
2031: of the utility function can happen only if
2032: the \gm first can prove that the rewrite
2033: is useful according to the {\em present} utility function.
2034: 
2035: \item
2036: {\bf Q:} {\em 
2037: Aren't there problems with undecidability? For example,
2038: doesn't Rice's theorem \cite{Rice:53} 
2039: or Blum's speed-up theorem \cite{Blum:67,Blum:71}
2040: pose problems?
2041: }
2042: 
2043: {\bf A:} Not at all.  Of course, the \gm cannot profit
2044: from a hypothetical useful self-improvement whose utility is undecidable,
2045: and will therefore simply ignore it. Compare Section 
2046: \ref{limits} on fundamental
2047: limitations of \gmn s (and humans, for that matter).
2048: Nevertheless, unlike previous methods, a \gm can in principle
2049: exploit at least the provably good improvements and speed-ups of 
2050: {\em any} part of its initial software.
2051: 
2052: \end{enumerate}
2053: 
2054: 
2055: \section{Conclusion}
2056: \label{conclusion}
2057: 
2058: In 1931, Kurt G\"{o}del laid the foundations of theoretical
2059: computer science, using
2060: elementary arithmetics to build a universal programming language
2061: for encoding arbitrary proofs, given an arbitrary enumerable set of
2062: axioms.  He went on to construct {\em
2063: self-referential} formal statements that claim their own unprovability,
2064: using Cantor's diagonalization trick \cite{Cantor:1874}
2065: to demonstrate that formal systems such as
2066: traditional mathematics are either flawed in a certain
2067: sense or contain unprovable but true statements \cite{Goedel:31}.
2068: Since G\"{o}del's exhibition of the fundamental limits of proof and
2069: computation, and Konrad Zuse's subsequent construction of the first working
2070: programmable computer (1935-1941),
2071: there has been a lot of work on specialized algorithms
2072: solving problems taken from more or less general problem classes.
2073: Apparently, however,
2074: one remarkable fact has so far escaped the attention of computer
2075: scientists: it is possible to use self-referential proof systems to
2076: build optimally efficient yet conceptually very simple
2077: universal problem solvers.
2078: 
2079: The initial software $p(1)$ of our \gm
2080: runs an initial, typically sub-optimal problem solver, e.g., one of
2081: Hutter's approaches 
2082: \cite{Hutter:01aixi+,Hutter:01fast+}
2083: which have at least an optimal {\em order} of complexity,
2084: or some less general method  \cite{Kaelbling:96}.
2085: Simultaneously, it runs an $O()$-optimal initial proof 
2086: searcher using an online variant of {\em Universal Search}
2087: to test {\em proof techniques}, which are 
2088: programs able to compute proofs
2089: concerning the system's own future performance,
2090: based on an axiomatic system $\cal A$ 
2091: encoded in $p(1)$,
2092: describing a formal {\em utility} function $u$,
2093: the hardware and $p(1)$ itself. 
2094: If there is no provably good, globally optimal 
2095: way of rewriting $p(1)$ at all, then humans
2096: will not find one either. But if there is one,
2097: then $p(1)$ itself can find 
2098: and exploit it. This approach yields the first
2099: class of theoretically sound, fully self-referential,
2100: optimally efficient, general problem solvers.
2101: 
2102: After the theoretical discussion in Sections \ref{intro}-\ref{biops},
2103: one practical question remains: to build a particular,
2104: especially practical \gm with small initial constant overhead,
2105: which generally useful 
2106: theorems should one add as axioms to $\cal A$
2107: (as initial bias) such that the initial searcher  
2108: does not have to prove them from scratch? 
2109: 
2110: \section{Acknowledgments}
2111: Thanks to
2112: Alexey Chernov,
2113: Marcus Hutter,
2114: Jan Poland,
2115: Ray Solomonoff,
2116: Sepp Hochreiter,
2117: Shane Legg,
2118: Leonid Levin,
2119: Alex Graves,
2120: Matteo Gagliolo,
2121: Viktor Zhumatiy,
2122: Ben Goertzel,
2123: Will Pearson,
2124: and Faustino Gomez,
2125: for useful comments on
2126: drafts or summaries or
2127: earlier versions of this paper.
2128: I am also grateful to many others who 
2129: asked questions during \gm talks or
2130: sent comments by email. Their input helped to shape 
2131: Section \ref{faq} on frequently asked questions.
2132: 
2133: \newpage
2134: \bibliography{bib}
2135: \bibliographystyle{plain}
2136: \end{document}
2137: 
2138: 
2139: