1: \documentclass{elsart}
2:
3:
4: \usepackage{amssymb}
5: \newcommand{\maj}{\mbox{MAJ}}
6: \newtheorem{theo}{\bf Theorem}
7: \newtheorem{theorem}{\bf Theorem}
8: \newtheorem{lemma}{\bf Lemma}
9: \newtheorem{corollary}{\bf Corollary}
10: \newtheorem{notation}{\bf Notation}
11: \newtheorem{definition}{\bf Definition}
12: %\newtheorem{claim}{\bf Claim}
13: \newtheorem{remark}{\bf Remark}
14: \newenvironment{comment}{\begin{small} \begin{quotation}}{\end{quotation} \end{small}}
15: \newenvironment{proof}{\par \bf Proof. \rm}{$\Box$ \vspace{1ex}}
16:
17:
18: \begin{document}
19: \date{}
20:
21: \begin{frontmatter}
22:
23:
24: \title{Sharpening Occam's Razor%\thanksref{title}}
25: %\thanks[title]{A preliminary version was presented at
26: %the {\em 8th Intn'l Computing and Combinatorics Conference
27: %(COCOON)}, held in Singapore, August, 2002.
28: }
29:
30: \author{Ming Li, %\thanksref{ming}
31: }
32: %\address{Department of Computer Science, University of Waterloo, %Univ. California
33: %Santa Barbara, CA 93106, USA,
34: %E-mail: mli@cs.ucsb.edu
35: %}
36: \author{John Tromp, %\thanksref{tromp}
37: }
38: %\address{CWI, %Kruislaan 413, 1098 SJ Amsterdam, The Netherlands;
39: %Email: tromp@cwi.nl}
40: \author{Paul Vit\'{a}nyi%\thanksref{vitanyi}
41: }
42: %\address{CWI,% Kruislaan 413, 1098 SJ Amsterdam, The Netherlands;
43: %Email: paulv@cwi.nl}
44:
45: %\thanks[ming]{
46: %Supported in part by
47: %the NSERC Operating Grant OGP0046506, ITRC, and
48: %NSF-ITR Grant 0085801 at UCSB.
49: %}
50: %\thanks[tromp]{
51: %Partially supported by an
52: %NSERC International Fellowship and ITRC.
53: %}
54: %\thanks[vitanyi]{
55: %Affiliated with CWI and the University of Amsterdam.
56: %Supported in part by the
57: %EU fifth framework project QAIP, IST--1999--11234, the NoE QUIPROCONE IST--1999--29064,
58: %the ESF QiT Programmme, and the EU Fourth Framework BRA
59: %NeuroCOLT II Working Group
60: %EP 27150.
61: %}
62:
63: %\renewcommand{\baselinestretch}{1.2}
64: %\setlength{\topmargin}{-0.2in}
65: \setlength{\textwidth}{6in}
66: \setlength{\oddsidemargin}{0.0in}
67: \setlength{\evensidemargin}{0.0in}
68: \setlength{\textheight}{8in}
69: %\setlength{\footskip}{0.5in}
70: %\setlength{\parskip}{6 pt plus 2pt minus 1pt}
71:
72: %\newtheorem{theorem}{\sc Theorem}
73: %\newtheorem{lemma}{\sc Lemma}
74: %\newtheorem{coro}{\sc Corollary}
75: %\newtheorem{nota}{\sc Notation}
76: %\newtheorem{defin}{\sc Definition}
77: %\newtheorem{cla}{\sc Claim}
78: %\newtheorem{ex}{\sc Example}
79: %\newenvironment{proof}{\par \sc Proof.\rm}{\hspace*{\fill}$\qed$\vspace{1ex}}
80: %\newenvironment{example}{\begin{ex}}{\hspace*{\fill}$\Diamond$\end{ex}}
81: %\newenvironment{claim}{\begin{cla}}{\end{cla}}
82: %\newenvironment{corollary}{\begin{coro}}{\end{coro}}
83: %\newenvironment{definition}{\begin{defin}}{\end{defin}}
84: %\newenvironment{notation}{\begin{nota}}{\end{nota}}
85: %\newenvironment{comment}{\begin{small}\begin{quotation}\hspace{-0.23in}\rm}{\end{quotation}\end{small}}
86: %
87:
88: %\pagestyle{empty}
89:
90: %\normalsize
91: \begin{abstract}
92: We provide a new representation-independent
93: formulation of Occam's razor theorem, based on
94: Kolmogorov complexity. This new formulation allows us to:
95: (i) Obtain better sample complexity than both length-based \cite{blumer1}
96: and VC-based \cite{blumer} versions of Occam's razor theorem, in many
97: applications; and (ii)
98: Achieve a sharper reverse of Occam's razor theorem than that of
99: \cite{board}. Specifically, we weaken the assumptions
100: made in \cite{board} and extend the reverse to superpolynomial
101: running times.
102: \end{abstract}
103: \begin{keyword}
104: Analysis of algorithms \sep
105: pac-learning \sep Kolmogorov complexity \sep Occam's razor-style theorems
106: \end{keyword}
107:
108: \end{frontmatter}
109:
110: %\newcommand{\proof}{{\bf Proof. \enspace}}
111: %\newtheorem{theorem}{Theorem}
112: %\newtheorem{lemma}[theorem]{Lemma}
113: %\newtheorem{corollary}[theorem]{Corollary}
114: %\newtheorem{definition}{Definition}
115: %\newtheorem{claim}[theorem]{Claim}
116: %\newtheorem{conjecture}[theorem]{Conjecture}
117:
118: \section{Introduction} \label{introsec}
119: Occam's razor theorem as formulated
120: by \cite{blumer,blumer1} is arguably the substance of efficient pac learning.
121: Roughly speaking, it says that in order to (pac-)learn, it suffices to compress.
122: A partial reverse, showing the necessity of compression,
123: has been proved by Board and Pitt \cite{board}.
124: %%added [Paul]
125: Since the theorem is about the relation between effective
126: compression and pac learning, it is natural to assume that
127: a sharper version ensues by couching it in terms
128: of the {\em ultimate} limit to effective compression which is
129: the Kolmogorov complexity. We present results in that direction.
130: %%end addition [Paul]
131:
132: Despite abundant research generated by its importance,
133: several aspects of Occam's razor
134: theorem remain unclear. There are basically two versions.
135: The {\em VC dimension-based version} of Occam's razor theorem
136: (Theorem 3.1.1 of \cite{blumer})
137: gives the following upper bound on sample complexity:
138: For a hypothesis
139: space $H$ with $VCdim(H)=d$, $1 \leq d < \infty$,
140: \begin{equation}\label{vc-sample}
141: m(H,\delta , \epsilon ) \leq \frac{4}{\epsilon}
142: (d \log \frac{12}{\epsilon} + \log \frac{2}{\delta} ).
143: \end{equation}
144: The following lower bound was proved by Ehrenfeucht {\it et al} \cite{ehren}.
145: \begin{equation}\label{vc-lowerbound}
146: m(H,\delta , \epsilon ) > \max (\frac{d-1}{32 \epsilon},
147: \frac{1}{\epsilon} \ln \frac{1}{\delta} ).
148: \end{equation}
149: The upper bound in (\ref{vc-sample}) and the lower
150: bound in (\ref{vc-lowerbound}) differ by a factor
151: $\Theta (\log \frac{1}{\epsilon} )$. It was shown in
152: \cite{haussler} that this factor is, in a sense, unavoidable.
153:
154: When $H$ is finite, one can directly obtain the following bound
155: on sample complexity for a consistent algorithm:
156: \begin{equation}\label{direct-sample}
157: m(H,\delta , \epsilon ) \leq \frac{1}{\epsilon} \ln \frac{|H|}{\delta}.
158: \end{equation}
159: For a graded boolean space $H_n$, we have the
160: following relationship between
161: the VC dimension $d$ of $H_n$ and the cardinality of $H_n$,
162: \begin{equation}
163: d \leq \log |H_n | \leq nd.
164: \end{equation}
165:
166: When $\log |H_n|=O(d)$ holds, then the sample complexity upper bound
167: given by (\ref{direct-sample}) can be seen to equal
168: $\frac{1}{\epsilon} (O(d)+\ln \frac{1}{\delta})$ which matches the lower bound
169: of (\ref{vc-lowerbound}) up to a constant factor,
170: and thus every consistent
171: algorithm achieves optimal sample complexity for such hypothesis spaces.
172:
173: The {\em length-based version} of Occam's razor theorem then
174: gives the following sample complexity $m$ to guaranty that
175: the algorithm pac-learns:
176: For given $\epsilon$ and $\delta$:
177: \begin{equation}\label{length-sample}
178: m = \max (\frac{2}{\epsilon} \ln \frac{1}{\delta} ,
179: (\frac{(2\ln 2)s^{\beta}}{\epsilon} )^{1/(1-\alpha)} ) ,
180: \end{equation}
181: This bound is based on the {\em length-based}
182: Occam algorithm \cite{blumer}:
183: A {\em deterministic} algorithm that returns a consistent hypothesis of
184: length at most $m^\alpha s^\beta$, where $\alpha < 1$ and $s$ is the length
185: of the target concept.
186:
187: %In case of total example compression, when $\alpha=0$,
188: %this is competitive with \ref{direct-sample}, and
189: %(when $s \geq \log \frac{1}{\delta}$) Equation~\ref{length-sample} becomes
190: %\begin{equation}\label{length-simple}
191: %m = O( \frac{s^{\alpha}}{\epsilon} ).
192: %\end{equation}
193: %Here, if we replace $H$ in Equation~\ref{direct-sample}
194: %by the smallest $H'$ containing the learned hypothesis (of length
195: %$n^\alpha$ in Equation~\ref{length-simple}), then the sample complexities
196: %in Equations~\ref{direct-sample} and \ref{length-simple} are
197: %approximately the same, as it should be. Since the formula in
198: %Equation~\ref{direct-sample} is not easy to use.
199: %In practice,
200: %it is the Formula~\ref{length-sample}, or more often
201: %Formula~\ref{length-simple}, that are used.
202:
203: In summary, the VC dimension based Occam's razor theorem
204: may be hard to use and it sometimes does not give the best sample
205: complexity. The length-based Occam's razor is more convenient
206: to use and often gives better sample complexity in the discrete case.
207:
208: %%added rephrased paragraph[Paul]
209: However, as we demonstrate below, the fact that the length-based
210: Occam's razor theorem sometimes gives inferior sample
211: complexity, can be due to the redundant representation format of the concept.
212: %%end rephrased paragraph [Paul]
213: We believe Occam's razor theorem should be
214: ``representation-independent''. That is, it should not be dependent
215: on accidents of ``representation format''. (See \cite{manfred} for
216: other representation-independence issues.) In fact, the sample
217: complexities given in (\ref{vc-sample}) and (\ref{vc-lowerbound})
218: are indeed representation-independent. However they are not
219: easy to use and do not give optimal sample complexity.
220: Here, we give a Kolmogorov complexity based Occam's razor
221: theorem. We will demonstrate that our KC-based Occam's razor theorem
222: is convenient to use (as convenient as the length based
223: version), gives a better sample complexity than the
224: length based version, and is representation-independent.
225: In fact, the length based version
226: can be considered as a specific computable approximation
227: to the KC-based Occam's razor.
228:
229: As one of the examples, we will demonstrate that the standard trivial learning
230: algorithm for monomials actually often
231: has a {\it better sample complexity}
232: than the more sophisticated Haussler's greedy algorithm \cite{hauss}.
233: This is
234: contrary to the commen, but mistaken,
235: belief that Haussler's algorithm is better
236: in all cases (to be sure, Haussler's method is superior
237: for target monimials of small length).
238: Another issue related to Occam's razor theorem is the
239: status of the reverse assertion.
240: Although a partial reverse of Occam's razor theorem has
241: been proved by \cite{board}, it applied only to the case of
242: polynomial running time and sample complexity.
243: They also required a property
244: of closure under exception list. This latter requirement, although
245: quite general, excludes some reasonable concept classes. Our new
246: formulation of Occam's razor theorem allows us to
247: prove a more general reverse of Occam's razor
248: theorem, allowing the arbitrary running time and weakening
249: the requirement of exception list of \cite{board}.
250:
251: \footnote{A preliminary version was presented at
252: the {\em 8th Intn'l Computing and Combinatorics Conference
253: (COCOON)}, held in Singapore, August, 2002.
254: }
255:
256: {\bf Discussion of Result and Technique:}
257: In our approach we obtain better
258: bounds on the sample complexity to learn the representation of
259: a target concept in the given representation system.
260: These bounds, however, are representation-independent
261: and depend only on the Kolmogorov complexity of the target concept.
262: If we don't care about the representation of the hypothesis
263: (but that is not the case in this paper) then better ``iff Occam style''
264: characterizations of polynomial time learnability/predicatability
265: can be given. They rely
266: on Schapire's result that ``weak learnability''
267: equals ``strong learnability'' in polynomial time \cite{Sch90}
268: exploited in \cite{HeWa95}. For a recent survey of
269: the important related ``boosting'' technique see \cite{Sch02}.
270:
271: The use of Kolmogorov complexity is to obtain a bound on the
272: size of the hypotheses class for a fixed (but arbitrary)
273: target concept.
274: Obviously, the results described
275: can be obtained using other proof methods---all true provable statements
276: must be provable from the axioms of mathematics by the inference methods
277: of mathematics. The question is whether
278: a particular proof method facilitates and guides the proving effort.
279: The message we want to convey is that thinking in terms of coding
280: and incompressibility suggest improvements to long-standing results.
281: A survey of the use of the Kolmogorov complexity
282: method in combinatorics, computational complexity, and
283: the analysis of algorithms is \cite{lv} Chapter 6.
284:
285:
286: \section{Occam's Razor}
287: Let us assume the usual definitions, say Anthony and Biggs \cite{anthony},
288: and notation of \cite{board}. For
289: Kolmogorov complexity we assume the basics of \cite{lv}.
290:
291: In the following $\Sigma, \Gamma$ is are finite {\em alphabets}: We
292: consider only discrete learning problems in this paper.
293: The set of finite strings over $\Sigma$ is denoted by $\Sigma^*$
294: and similarly for $\Gamma$.
295: An element of $\Sigma^*$ is an {\em example}, and a {\em concept}
296: is a set of examples (a language over $\Sigma$).
297: An {\em representation} is an element of $\Gamma^*$.
298:
299:
300: \begin{definition}
301: A {\em representation system} is a tuple $(R,\Gamma , c , \Sigma )$, where
302: $R \subset \Gamma^*$ is the set of representations, and
303: $c:R \rightarrow 2^{\Sigma^*}$ maps representations to concepts, the latter
304: being languages over $\Sigma$.
305: \end{definition}
306:
307: Hence, given $R$ the mapping $c$ determines a {\em concept class}.
308: For example, let $\Gamma$ is the alphabet to express Boolean formulas,
309: $\Sigma = \{0,1\}$, and let
310: $R$ be the subset of disjunctive normal form (DNF) formulas.
311: Let $c$ map each element $r \in R$, say a DNF formula over $n$
312: variables, to $c(r) \subseteq \{0,1\}^n$ such that every example
313: $e \in c(r)$ viewed as truth-value assignment makes $r$ ``true''.
314: That is, if $e=e_1 \ldots e_n$ and we assign ``true'' or ``false''
315: to the $i$th variable in $r$ according to whether $e_i$ equals ``0''
316: or ``1'' then $r$ becomes ``true''. Each concept in the thus defined
317: concept class is the set of truth assignments that make a particular
318: DNF formula ``true''.
319:
320: \begin{definition}
321: A {\em pac-algorithm} for a representation system
322: ${\bf R} = (R,\Gamma , c , \Sigma )$ is a randomized algorithm $L$
323: such that, for every $s,n\geq 1,\epsilon>0,\delta>0,r \in R^{\leq s}$,
324: and every probability distribution $D$ on $\Sigma^{\leq n}$,
325: if $L$ is given $s,n,\epsilon,\delta$ as input
326: and has access to an oracle providing examples of $c(r)$ (the concept
327: represented by $r$) according to $D$,
328: then $L$,
329: with probability at least $1-\delta$, outputs a representation $r' \in R$
330: approximating the target $r$ in the sense that
331: $D(c(r')\Delta c(r)) \leq \epsilon$.
332: Here, $\Delta$ denotes the symmetric set difference.
333: \end{definition}
334: The acronym ``pac'' coined by Dana Angluin
335: stands for ``probably approximately correct'' which aptly captures the
336: requirement the output representation must satisfy according to the definition.
337: The question of interest in pac-learning is how many examples
338: (and running time) a learning algorithm has to qualify as a pac-alpgorithm.
339: The {\em running time} and and number of examples ({\em sample complexity})
340: of the pac-algorithm are
341: expressed as functions $t(n,s,\epsilon,\delta)$ and
342: $m(n,s,\epsilon,\delta)$. The following definition generalizes the
343: notion of Occam algorithm in \cite{blumer}:
344:
345: \begin{definition}
346: \label{def.kcoccam}
347: An {\em Occam-algorithm} for a representation system
348: ${\bf R} = (R,\Gamma , c , \Sigma )$ is a randomized algorithm which
349: for every $s,n\geq 1, \gamma >0$,
350: on input of a sample consisting of
351: $m$ examples of a fixed target $r\in R^{\leq s}$,
352: with probability at least $1-\gamma$ outputs a representation $r' \in R$
353: consistent with the sample, such that $K(r' \mid r,n,s) < m/f(m,n,s,\gamma)$,
354: with $f(m,n,s,\gamma)$, the compression achieved, being an increasing
355: function of $m$.
356: \end{definition}
357: The {\em length-based version} of (possibly randomized)
358: Occam algorithm can be obtained
359: by replacing $K(r' \mid r,n,s)$ by $|r|$ in this definition.
360: The {\em running time} of the Occam-algorithm is expressed as a function
361: $t(m,n,s,\gamma)$, where $n$ is the maximum length of the input examples.
362:
363: \begin{remark}\label{rem.kco}
364: \rm
365: An Occam algorithm satisfying a given $f$,
366: achieves a lower bound on the number $m$ of examples required
367: in terms of $K(r' \mid r,n,s)$, the Kolmogorov complexity of
368: the outputted representation conditioned on the target representation,
369: rather than the (maximal) length $s$ of $r$ as in the original Occam
370: algorithm \cite{blumer} and the length-based version above.
371: This improvement enables one to use information drawn from the hidden
372: target for reduction of the Kolmogorov complexity of the output representation,
373: and hence further reduction of the required sample complexity.
374: \end{remark}
375: We need to show that the main properties
376: of an Occam algorithm are preserved under this generalization.
377: Our first theorem is a Kolmogorov complexity based Occam's Razor.
378: We denote the minimum $m$ such that $f(m,n,s,\gamma) \geq x$ by
379: $f^{-1}(x,n,s,\gamma)$, where we set $f^{-1}(x,n,s,\gamma)=\infty$
380: if $f(m,n,s,\gamma) < x$
381: for every $m$.
382:
383: \begin{theorem}
384: \label{KCoccam}
385: Suppose we have an Occam-algorithm for
386: ${\bf R} = (R,\Gamma , c , \Sigma )$ with compression $f(m,n,s,\gamma)$.
387: %that, for $r \in R^{\leq s}$, produces representations satisfying
388: %\[ K(r'|r,n,s) \leq m/
389: %Write $f$ as $f(m,\gamma)$ with the other parameters implicit.
390: Then there is a pac-learning algorithm
391: for {\bf R} with sample complexity
392: \[ m(n,s,\epsilon,\delta) =
393: \max \left\{\frac{2}{\epsilon}\ln \frac{2}{\delta},
394: f^{-1}(\frac{2\ln 2}{\epsilon},n,s,\delta/2) \right\}, \]
395: and running time $t_{\mbox{pac}}(n,s,\epsilon,\delta) =
396: t_{\mbox{occam}}(m(n,s,\epsilon,\delta),n,s,\delta/2)$.
397: \end{theorem}
398:
399: \begin{proof}
400: On input of $\epsilon,\delta,s,n$, the learning algorithm will take a sample
401: of length $m=m(n,s,\epsilon,\delta)$ from the oracle, then
402: use the Occam algorithm with $\gamma=\delta/2$ to find a hypothesis
403: (with probability at least $1-\delta/2$) consistent with the sample and
404: with low Kolmogorov complexity.
405: In the proof we abbreviate $f(m,n,s,\gamma)$ to $f(m)$
406: with the other parameters implicit.
407: Learnability follows in the standard manner from bounding
408: (by the remaining $\delta/2$) the probability
409: that all $m$ examples of the target concept
410: fall outside the, probability $\epsilon$ or greater,
411: symmetric difference with a bad hypothesis.
412: Let $m = m(n,s,\epsilon,\delta)$. Then
413: $m \geq f^{-1} (\frac{2 \ln 2}{\epsilon} ,n,s, \frac{\delta}{2})$ gives
414: \[ \epsilon - \frac{\ln 2}{f(m)} \geq \frac{\epsilon}{2}, \]
415: and therefore $m \geq \frac{2}{\epsilon}\ln \frac{2}{\delta}$ gives
416: \[ m(\epsilon - \frac{\ln 2}{f(m)} ) \geq \ln \frac{2}{\delta} .\]
417: This implies (taking the exponent on both sides and
418: using $1-\epsilon<e^{-\epsilon }$)
419: \[ 2^{m/f(m)}(1-\epsilon)^{m} \leq \delta/2 .\]
420: The probability that some concept the Occam-algorithm can output
421: has all $m$ examples being bad is at most
422: the number of concepts of complexity less than $m/f(m)$, times
423: $(1-\epsilon)^m$, which by the above is at most $\delta/2$.
424: \end{proof}
425:
426: \begin{corollary}
427: When the compression is of the form
428: \[f(m,n,s,\gamma) = \frac{m^{1-\alpha}}{p(n,s,\gamma)},\]
429: one can achieve a sample complexity of
430: \[\max\left\{\frac{2}{\epsilon}\ln \frac{2}{\delta},
431: \left( \frac{(2 \ln 2)p(n,s,\delta/2)}{\epsilon} \right)^{1/(1-\alpha)}\right\}.\]
432: In the special case of total compression, where $\alpha=0$, this
433: further reduces to
434: % Equation~\ref{length-sample}:
435: \begin{equation} \label{total-compression}
436: \frac{2}{\epsilon}\left\{\max(\ln \frac{2}{\delta},(\ln 2)
437: p(n,s,\delta/2))\right\}.
438: \end{equation}
439: For deterministic Occam-algorithms, we can furthermore replace
440: $2/\delta$ and $\delta/2$ in Theorem~\ref{KCoccam} by $1/\delta$ and
441: $\delta$ respectively.
442: \end{corollary}
443:
444: \begin{remark}
445: \rm
446: Essentially, our new
447: Kolmogorov complexity condition is a computationally
448: universal generalization of the length condition in the original
449: Occam's razor theorem of \cite{blumer1}. Here, in
450: Theorem~\ref{KCoccam}, we consider the
451: shortest description length over all effective representations,
452: given the target representation,
453: rather than in a specific (syntactical) representation system.
454: %This is representation-independent in the very strong sense
455: %of being an absolute and objective notion,
456: %which is recursively invariant by Church's thesis and the ability
457: %of universal machines to simulate each other.
458: This allows us to bound the required sample complexity
459: not by a function of the number of hypotheses
460: (returned representations)
461: of length at most the bound on
462: the length of the target representation, but by a similar
463: function of the number of hypotheses
464: that have a certain Kolmogorov complexity conditioned
465: on the target concept, see Remark~\ref{rem.kco}.
466: Nonetheless, like in the original Occam's razor Theorem of \cite{blumer1},
467: we return a representation of a concept approximating the target
468: concept in the given representation system, rather than
469: a representation outside the system like in Boosting approaches.
470: \end{remark}
471:
472: Suppose we have a concept $c$ and a mis-classified example $x$---an
473: {\em exception}. Then, the symmetric difference $c \Delta \{x\}$
474: classifies $x$ correctly: if $x \not\in c$ then
475: $c \Delta \{x\} = c \bigcup \{x\}$, and if $x \in c$ then
476: $c \Delta \{x\} = c \setminus \{x\}$.
477:
478: \begin{definition}
479: An {\em exception handler} for a representation system
480: ${\bf R} = (R,\Gamma , c , \Sigma )$ is an algorithm which
481: on input of a representation $r\in R$ of length $s$,
482: and an $x \in \Sigma^{\ast}$ of length $n$,
483: outputs a representation $r' \in R$ of the concept $c(r) \Delta \{x\}$,
484: of length at most $e(s,n)$, where $e$ is called the {\em exception expansion}
485: function.
486: The running time of the exception-handler is expressed as a function
487: $t(n,s)$ of the representation and exception lengths.
488: If $t(n,s)$ is polynomial in $n,s$, and furthermore $e(s,n)$ is of the form
489: $s+p(n)$ for some polynomial $p$, then we say ${\bf R}$ is {\em polynomially
490: closed under exceptions}.
491: \end{definition}
492:
493: %\begin{definition}
494: %A class ${\bf R} = (R,\Gamma , c , \Sigma )$ is {\em closed under
495: %exceptions} iff it has an exception handler.
496: %\end{definition}
497:
498: \begin{theorem}
499: \label{sex}
500: Let $L$ be a deterministic pac-algorithm
501: with $m(n,s,\frac{1}{2n},\gamma)$ the sample size,
502: and let $E$ be an exception handler for
503: a representation system ${\bf R}$.
504: Then there is an Occam algorithm for ${\bf R}$
505: that for $m$ examples achieves compression
506: $f(m,n,s,\gamma)= \frac{1}{2\epsilon n}$.
507: Moreover, $m \geq 2nm(n,s,\frac{1}{2n},\gamma)$ and
508: where $\epsilon$, depending on $m,n,s,\gamma$, is such that
509: $m(n,s,\epsilon,\gamma)=\epsilon m$ holds.
510: \end{theorem}
511:
512: \begin{proof}
513: The proof is obtained in a fashion similar to
514: \cite{board}.
515: Suppose we are given a sample of length $m$ and confidence parameter
516: $\gamma$.
517: Assume without loss of generality that the sample
518: contains $m$ different examples.
519: Define a uniform distribution on these examples with $\mu (x) = 1/m$
520: for each $x$ in the sample.
521: Let $\epsilon$ be as described.
522: The function $m(n,s,\epsilon,\gamma)$ decreases with
523: increasing $\epsilon$,
524: while the function $\epsilon m$ increases with $\epsilon$
525: so the two necessarily intersect, under the assumption in the theorem,
526: for some $\epsilon_0$, although it may yield an
527: $\epsilon_0 >\frac{1}{2n}$, giving no actual compression.
528: For example, if $m(n,s,\epsilon,\gamma)= (\frac{1}{\epsilon})^{b}$
529: for some constant $b$, then $\epsilon_0 = m^{-1/(b+1)}$.
530: %Let the pac-learning algorithm have sample complexity
531: %$m(n,s,\epsilon,\delta)$.
532: Apply $L$ with $\delta = \gamma$ and $\epsilon = \epsilon_0$.
533: %It will use at most $m^{b/(b+1)}$ of our $m$ examples.
534: With probability $1-\gamma$,
535: it produces a concept which is correct with error $\epsilon$,
536: giving up to $\epsilon m$ exceptions.
537: We can just add these one by one using
538: the exception handler.
539: This will expand the concept size, but not the Kolmogorov complexity.
540: The resulting representation can be described by the $\leq \epsilon m$
541: examples used plus the $\leq \epsilon m$ exceptions found,
542: Since $L$ is deterministic, this uniquely determines the required
543: consistent concept.
544: % plus some constant for the various algorithms involved.
545: The compression achieved is $\frac{m}{2\epsilon mn} = \frac{1}{2\epsilon n}$.
546: This is an increasing function of $m$, since increasing the slope of
547: the function $\epsilon m$ moves its intersection with
548: the function $m(n,s,\epsilon,\gamma)$ to the left, that is,
549: to smaller $\epsilon$.
550: \end{proof}
551:
552: \begin{definition}
553: Let ${\bf R} = (R,\Gamma , c , \Sigma )$ be a representation system.
554: The concept $\maj(r_1,r_2,r_3)$
555: is the set $\{x :$ $x$ belongs to at least two out of
556: the three concepts $c(r_1),c(r_2),c(r_3)\}$.
557: A {\em majority-of-three algorithm} for
558: ${\bf R}$ is an algorithm which
559: on input of three representation $r_1,r_2,r_3 \in R^{\leq s}$,
560: outputs a representation $r' \in R$ of the concept $\maj(r_1,r_2,r_3)$
561: of length at most $e(s)$, where $e$ is called
562: the {\em majority expansion}
563: function.
564: The running time of the algorithm is expressed as a function
565: $t(s)$ of the maximum representation length.
566: If $t(s)$ and $e(s)$ are polynomial in $s$
567: then we say ${\bf R}$ is {\em polynomially
568: closed under majority-of-three}.
569: \end{definition}
570:
571: \begin{theorem}
572: \label{maj}
573: Let $L$ be a deterministic pac-algorithm with sample complexity
574: $m(n,s,\epsilon,\delta) \in o(1/\epsilon^2)$, and let $M$
575: be a majority-of-three algorithm for
576: the representation system ${\bf R}$.
577: Then there is an Occam algorithm for ${\bf R}$ that for $m$ examples
578: has compression $f(m,n,s,\gamma)=m/3nm(n,s,\frac{1}{2\sqrt{m}},\gamma/3)$.
579: \end{theorem}
580:
581: \begin{proof}
582: Let us be given a sample of length $m$.
583: Take $\delta = \gamma / 3$ and $\epsilon = \frac{1}{2\sqrt{m}}$.
584: %The reason we take $\epsilon$ to be $1/\sqrt{m}$ is that
585: %it turns out that the $\epsilon$s of the three stages are related
586: %as $\epsilon_1 \sim \epsilon_3$ and $2 \epsilon_1 \epsilon_2m = 1$,
587: %so this is the best way to balance them.
588:
589: {\it Stage 1:} Define a uniform distribution on the $m$ examples
590: with $\mu_1 (x) = 1/m$ for each $x$ in the sample.
591: Apply the learning algorithm.
592: It produces (with probability at least $1-\gamma/3$)
593: a hypothesis $r_1$ which has error less than $\epsilon$,
594: giving up to $\epsilon m = \sqrt{m}/2$ exceptions.
595: Denote this set of exceptions by $E_1$.
596:
597: {\it Stage 2:} Define a new distribution
598: %on the $m$ examples
599: $\mu_2(x) = \epsilon$ for each $x \in E_1$,
600: and $\mu_2(x) = (1-|E_1|/2\sqrt{m})/(m-|E_1|)$ for each $x \not\in E_1$.
601: Apply the learning algorithm.
602: It produces (with probability at least $1-\gamma/3$)
603: a hypothesis $r_2$ which is correct on all of $E_1$ and with error
604: less than $\epsilon$ on the remaining examples.
605: This gives up to $\epsilon (m-|E_1|) / (1-|E_1|/2\sqrt{m}) < \sqrt{m}$
606: exceptions. This set, denoted $E_2$, is disjoint from $E_1$.
607:
608: {\it Stage 3:} Define a new distribution on the $m$ examples
609: with $\mu(x) = 1/|E_1 \cup E_2| > \epsilon$ for each $x$ in $E_1\cup E_2$,
610: and $\mu(x) = 0$ elsewhere.
611: Apply the learning algorithm.
612: % with error bound $\epsilon_3 = 1/2\sqrt{m}$.
613: %Note that $|E_1| \leq \sqrt{m}$ and $E_2 < \sqrt{m}$ gives that for
614: %$x$ in $E_1\cup E_2$, $\mu(x) > \epsilon_3$.
615: The algorithm produces (with probability at least $1-\gamma/3$)
616: a hypothesis $r_3$ which is correct on all of $E_1$ and $E_2$.
617: %and which might be totally wrong elsewhere (we don't care).
618:
619: In total the number of examples consumed by the pac-algorithm
620: is at most $3m(n,s,\frac{1}{2\sqrt{m}},\gamma/3)$, each requiring
621: $n$ bits to describe.
622: The three representations are combined into one representation
623: by the majority-of-three algorithm $M$. This is necessarily correct on all
624: of the $m$ examples, since the three exception-sets are all disjoint.
625: Furthermore, it can be described in terms of the
626: examples fed to the deterministic pac-algorithm
627: and thus achieves compression
628: $f(m,n,s,\gamma) = m/3nm(n,s,\frac{1}{2\sqrt{m}},\gamma/3)$.
629: This is an increasing function of $m$ given the assumed
630: subquadratic sample complexity.
631: \end{proof}
632:
633: The following corollaries use the fact that if
634: a representation system is learnable,
635: it must have finite VC-dimension and hence,
636: according to (\ref{vc-sample}), they are learnable with sample
637: complexity subquadratic in $\frac{1}{\epsilon}$.
638: \begin{corollary}
639: Let a representation system ${\bf R}$ be
640: closed under either exceptions or majority-of-three, or both.
641: Then ${\bf R}$ is pac-learnable iff
642: there is an Occam algorithm for ${\bf R}$.
643: \end{corollary}
644:
645: \begin{corollary}
646: Let a representation system ${\bf R}$ be polynomially
647: closed under either exceptions or majority-of-three, or both.
648: Then ${\bf R}$ is deterministically polynomially pac-learnable iff
649: there is a polynomial time Occam algorithm for ${\bf R}$.
650: \end{corollary}
651:
652: \noindent
653: {\it Example.}
654: Consider threshold circuits,
655: acyclic circuits whose nodes compute threshold
656: functions of the form $a_1x_1 + a_2x_2 + \cdots +a_nx_n \geq \delta$,
657: $x_i \in \{0,1\}, a_i,\delta \in N$ (note that no expressive
658: power is gained by allowing rational weights and threshold).
659: A simple way of representing circuits
660: over the binary alphabet is to number each node and use
661: {\em prefix-free encodings} of these numbers. For instance, encode $i$
662: as $1^{|\mbox{bin}(i)|}0\mbox{bin}(i)$,
663: the binary representation of $i$ preceded by its length in unary.
664: A complete node encoding then consists of the encoded index, encoded
665: weights, threshold, encoded degree, and encoded indices of the nodes
666: corresponding to its inputs. A complete circuit can be encoded with
667: a node-count followed by a sequence of node-encodings.
668: For this representation, a majority-of-three
669: algorithm is easily constructed that renumbers two of its three input
670: representations, and combines the three by adding a
671: 3-input node computing the majority function
672: $x_1+x_2+x_3 \geq 2$.
673: It is clear that under this representation,
674: the system of threshold circuits
675: are polynomially closed under majority-of-three.
676: On the other hand they are not closed under exceptions,
677: or under the exception lists of \cite{board}.
678:
679: \noindent
680: {\it Example.} Let $h_1 , h_2, h_3$ be 3 $k$-DNF formulas.
681: Then $\maj (h_1,h_2,h_3) = (h_1 \wedge h_2) \vee (h_2 \wedge h_3) \vee
682: (h_3 \wedge h_1)$ which can be expanded into a $2k$-DNF formula.
683: This is not good enough for Theorem~\ref{maj}, but it allows us to conclude
684: that pac-learnability of $k$-DNF implies compression of $k$-DNF into
685: $2k$-DNF.
686:
687: \section{Applications}
688: Our KC-based Occam's razor theorem might
689: be {\it conveniently} used, providing better sample
690: complexity than the length-based version.
691: In addition to giving better sample complexity,
692: our new KC-based Occam's razor theorem,
693: Theorem~\ref{KCoccam}, is easy to use, as easy
694: as the length based version, as demonstrated by the following
695: two examples.
696: While it is easy to construct an artificial system with
697: extremely bad representations such that our Theorem~\ref{KCoccam}
698: gives {\it arbitrarily} better sample complexity than
699: the length-based sample complexity given in
700: (\ref{length-sample}), we prefer to give natural examples.
701:
702: \noindent
703: {\bf Application 1: Learning a String.}
704:
705: The DNA sequencing process can be modeled as the problem
706: of learning a super-long string in the pac model \cite{jiang1,li}.
707: We are interested in learning a target string $t$ of length $s$,
708: say $s=3 \times 10^9$ (length of a human DNA sequence).
709: At each
710: step, we can obtain as an example a substring of this sequence
711: of length $n$, from a random location of $t$ (Sanger's Procedure).
712: At the time of writing, $n \approx 500$, and
713: sampling is very expensive.
714: Formally, the concepts we are learning are sets of possible length $n$
715: substrings of a superstring, and these are naturally
716: represented by the superstrings. We assume a minimal target representation
717: (which may not hold in practice).
718: Suppose we obtain a
719: sample of $m$ substrings (all positive examples). In biological
720: labs, a Greedy algorithm which repeatedly merges a pair of substrings
721: with maximum overlap is routinely used. It is conjectured
722: that Greedy produces a common superstring $t'$ of length at most $2s$,
723: where $s$ is the optimal length (NP-hard to find). In \cite{blum},
724: we have shown that $s \leq |t'| \leq 4s$.
725: Assume that $|t'| \approx 2s$.\footnote{Although only the
726: $4s$ upper bound was proved in \cite{blum}, which has since been improved,
727: it is widely believed
728: that $2s$ is the true bound.}
729: Using the length-based Occam's razor theorem, that is, Theorem~\ref{sex}
730: with $K(r' \mid r,s,n)$ in Definition~\ref{def.kcoccam} replaced
731: by $|r'|$,
732: this length of $2s$ would determine the sample complexity,
733: as in (\ref{total-compression}), with
734: %\begin{equation}\label{length-base}
735: $p(n,s,\delta/2)= 2 \cdot 2s$
736: (the extra factor 2 is the 2-logarithm of the size of the alphabet
737: $\{A,C,G,T\}$).
738: %m_{\rm len} \geq 2cs/\epsilon ,
739: %\end{equation}
740: %by Equation~\ref{length-simple}, where $c$ is the constant represented
741: %by the big-$O$ in \ref{length-simple}.
742: Is this the best we can do?
743: It is well-known that the sampling process in DNA sequencing is a very
744: costly and slow process.
745: We improve the sample complexity using our KC-based Occam's razor
746: theorem.
747: %Theorem~\ref{KCoccam}.
748:
749: \begin{lemma}
750: Let $t$ be the target string of length $s$ and $t'$ be the
751: superstring returned by Greedy of length at most $2s$. Then
752: \[
753: K(t' \mid t,s,n ) \leq 2s (2\log s + \log n) / n .
754: \]
755: \end{lemma}
756: \begin{proof}
757: We give $t'$ a short description using some information
758: from $t$. Let $S = \{ s_1 , \ldots , s_m \}$ be the set of
759: $m$ examples (substrings of $t$ of length $n$).
760: Align these substrings with the common superstring $t'$, from
761: left to right. Divide them into groups such that each group's
762: leftmost string overlaps with every string in the group but
763: does not overlap with the leftmost string of the previous group.
764: Thus there are at most $2s/n$ such groups.
765: To specify $t'$, we only need to specify these $2s/n$ groups.
766: After we obtain the superstring for each group, we re-construct $t'$
767: by optimally merging the superstrings of neighboring groups.
768: To specify each group, we only need to specify the first and the last
769: string of the group and how they are merged. This is because every
770: other string in the group is a substring of the string obtained by
771: properly merging the first and last strings. Specifying the first and
772: the last strings requires $2 \log s$ bits of information
773: to indicate their locations in $t$ and we need another
774: $\log n$ bits to indicate how they are merged.
775: Thus $K(t'\mid t,s,n) \leq 2s (2 \log s + \log n) / n$.
776: \end{proof}
777:
778: This lemma shows that (\ref{total-compression}) can also be
779: applied with
780: $p(n,s,\delta/2)= 2\cdot 2s (2 \log s + \log n) / n$, giving a factor
781: $n / (2\log s + \log n)$ improvement in sample-complexity.
782: %By Theorem~\ref{KCoccam}, the sample complexity is improved to
783: %\begin{equation}\label{kc-base}
784: %m_{\rm KC} = \frac{ 2cs (2\log s + \log n) }{n \epsilon } .
785: %\end{equation}
786: %Thus, combining Equations~\ref{length-base} and \ref{kc-base}, we have,
787: %\[
788: %m_{\rm KC} \leq m_{\rm len} \frac{2\log s + \log n}{n}.
789: %\]
790: Note that in (mammal) genome computation practice,
791: we have $n=500$ and $s=3 \times 10^9$.
792: The sample complexity using the Kolmogorov complexity-based
793: Occam's razor is reduced over the ``length based''
794: Occam's razor by a multiplicative factor of
795: $n / (2\log s + \log n) \approx \frac{500}{2 \times 31 + 9} \approx 7$.
796:
797: \noindent
798: {\bf Application 2: Learning a Monomial.}
799:
800: Consider boolean space of $\{0,1\}^n$. There are two well-known algorithms
801: for learning monomials. One is the standard algorithm.
802:
803: \noindent
804: {\bf Standard Algorithm.}
805: \begin{enumerate}
806: \item
807: Initially set the concept representation
808: $M := x_1 \overline{x_1} \ldots x_n \overline{x_n}$
809: (a conjunction of all literals of $n$
810: variables---which contradicts every example).
811: \item
812: For each positive example, delete from the current $M$ the literals that
813: contradict the example.
814: \item
815: Return the resulting monomial $M$.
816: \end{enumerate}
817:
818: Haussler \cite{hauss} proposed a more sophisticated algorithm based
819: on set-cover approximation as follows.
820: Let $k$ be the number of variables in the target monomial, and $m$
821: be the number of examples used.
822:
823: \noindent
824: {\bf Haussler's Algorithm.}
825: \begin{enumerate}
826: \item
827: Use only negative examples.
828: For each literal $x$, define $S_x$ to be the set of negative examples
829: such that $x$ falsifies these negative examples.
830: The sets associated with the literals in the target monomial form a
831: %minimum
832: % uhm, doesn't have to be minimal, e.g. xyz with examples 001,010,100
833: % gives 3 sets each falsifying 2 negative examples. -John
834: set cover of negative examples.
835: \item
836: Run the approximation algorithm of set cover, this will use at most
837: $k \log m$ sets or, equivalently, literals in our approximating
838: monomial.
839: \end{enumerate}
840:
841: It is commonly believed that Haussler's algorithm
842: has better sample complexity than the standard algorithm
843: \footnote{In fact, Haussler's algorithm is specifically aimed
844: at reducing sample complexity for small target monomials, and that it
845: does.
846: }
847: We demonstrate that the opposite is sometimes true (in fact for
848: most cases), using our KC-based Occam's razor theorem,
849: Theorem~\ref{KCoccam}. Assume that our target monomial $M$ is of
850: length $n - \sqrt{n}$. Then the length-based Occam's razor theorem
851: gives sample complexity $n/\epsilon$ for both algorithms, by
852: Formula~\ref{total-compression}. However,
853: $K(M' \mid M)\leq \sqrt{n}\log 3+O(1)$,
854: where $M'$ is the monomial returned by the standard algorithm. This is
855: true since the standard algorithm always produces a monomial
856: $M'$ that contains {\em all} literals of the target monomial $M$, and
857: we need at most $\sqrt{n} \log 3 + O(1)$ bits to specify
858: whether other literals are in (positive or negative)
859: or not in $M'$ for the
860: variables that are in $M'$ but not in $M$.
861: Thus our (\ref{total-compression}) gives
862: the sample complexity of $O(\sqrt{n}/\epsilon)$.
863: In fact, as long as $|M| > n/\log n$ (which is most likely
864: to be the case if every monomial has equal probability),
865: it makes sense to use the standard algorithm.
866:
867: \section{Conclusions}
868:
869: Several new problems are suggested by this work.
870: If we have an algorithm that, given a length-$m$ sample of a concept
871: in Euclidean space, produces a consistent hypothesis that can be described
872: with only $m^\alpha, \alpha<1$ symbols (including a symbol for every real
873: number; we're using uncountable representation alphabet), then it seems
874: intuitively appealing that this implies some form of learning.
875: However, as noted in \cite{board},
876: the standard proof of Occam's Razor does not apply, since we cannot
877: enumerate these representations. The main open question is under
878: what conditions (specifically on the real number computation
879: model) such an implication would nevertheless hold.
880:
881: Can we replace the exception element or majority of 3 requirement
882: by some weaker requirement? Or can we even eliminate such
883: closure requirement and obtain a complete reverse of
884: Occam's razor theorem?
885: Our current requirements do not even include things
886: like k-DNF and some other reasonable representation systems.
887:
888: \section{Acknowledgements}
889: We wish to thank Tao Jiang for many stimulating discussions.
890:
891: \begin{thebibliography}{99}
892: \setlength{\baselineskip}{0.6\baselineskip}
893:
894: \bibitem{anthony}
895: M. Anthony and N. Biggs,
896: {\it Computational Learning Theory}, Cambridge University Press, 1992.
897: \bibitem{blum}
898: A. Blum, T. Jiang, M. Li, J. Tromp, M. Yannakakis,
899: Linear approximation of shortest common superstrings.
900: {\it Journal ACM}, 41:4 (1994), 630-647.
901: \bibitem{blumer}
902: A. Blumer and A. Ehrenfeucht and D. Haussler and M. Warmuth,
903: Learnability and the Vapnik-Chervonenkis Dimension.
904: {\it J. Assoc. Comput. Mach.}, 35(1989), 929-965.
905: \bibitem{blumer1}
906: A. Blumer and A. Ehrenfeucht and D. Haussler and M. Warmuth,
907: Occam's Razor.
908: {\it Inform.\ Process.\ Lett.}, 24(1987), 377-380.
909: \bibitem{board}
910: R. Board and L. Pitt,
911: On the necessity of Occam Algorithms.
912: 1990 {\it STOC}, pp. 54-63.
913: \bibitem{ehren}
914: A. Ehrenfeucht, D. Haussler, M. Kearns, L. Valiant.
915: A general lower bound on the number of examples needed for
916: learning. {\it Inform.\ Computation}, 82(1989), 247-261.
917: \bibitem{hauss}
918: D. Haussler.
919: Quantifying inductive bias: AI learning algorithms and
920: Valiant's learning framework. {\it Artificial Intelligence},
921: 36:2(1988), 177-222.
922: \bibitem{haussler}
923: D. Haussler, N. Littlestone, and, M. Warmuth.
924: Predicting $\{0,1\}$-functions on randomly drawn points.
925: {\em Information and Computation}, 115:2(1994),
926: 248--292.
927: \bibitem{HeWa95}
928: D.P. Helmbold and M.K. Warmuth,
929: On weak learning, {\em J. Comput. Syst. Sci.}, 50:3(1995),551-573.
930: \bibitem{jiang1}
931: T. Jiang and M. Li,
932: DNA sequencing and string learning,
933: {\em Math. Syst. Theory}, 29(1996), 387-405.
934: %\bibitem{kearns2}
935: % M. Kearns and M. Li.
936: % Learning in the Presence of Malicious Errors.
937: % {\it SIAM J. Comput.}, 22:4(1993), 807-837.
938: \bibitem{li}
939: M. Li. Towards a DNA sequencing theory.
940: {\it 31st IEEE Symp. on Foundations of Comp. Sci.}, 125-134, 1990.
941: \bibitem{lv}
942: M. Li and P. Vit\'anyi. {\it An Introduction to
943: Kolmogorov Complexity and Its Applications}.
944: 2nd Edition,
945: Springer-Verlag, 1997.
946: \bibitem{Sch90}
947: R. E. Schapire.
948: The strength of weak learnability.
949: Machine Learning, 5:2(1990),197--227.
950: \bibitem{Sch02}
951: R.E. Schapire,
952: The boosting approach to machine learning: An overview.
953: In: {\em MSRI Workshop on Nonlinear Estimation and Classification}, 2002.
954: \bibitem{val}
955: L. G. Valiant.
956: A Theory of the Learnable.
957: {\it Comm. ACM}, 27(11), 1134-1142, 1984.
958: \bibitem{manfred}
959: M.K. Warmuth.
960: Towards representation independence in PAC-learning.
961: In {\it AII-89}, pp. 78-103, 1989.
962: \end{thebibliography}
963:
964: \end{document}
965:
966: