0201:cs0201005/cs0201005

1: \documentclass{elsart}

2:

3:

4: \usepackage{amssymb}

5: \newcommand{\maj}{\mbox{MAJ}}

6: \newtheorem{theo}{\bf Theorem}

7: \newtheorem{theorem}{\bf Theorem}

8: \newtheorem{lemma}{\bf Lemma}

9: \newtheorem{corollary}{\bf Corollary}

10: \newtheorem{notation}{\bf Notation}

11: \newtheorem{definition}{\bf Definition}

12: %\newtheorem{claim}{\bf Claim}

13: \newtheorem{remark}{\bf Remark}

14: \newenvironment{comment}{\begin{small} \begin{quotation}}{\end{quotation} \end{small}}

15: \newenvironment{proof}{\par \bf Proof. \rm}{$\Box$ \vspace{1ex}}

16:

17:

18: \begin{document}

19: \date{}

20:

21: \begin{frontmatter}

22:

23:

24: \title{Sharpening Occam's Razor%\thanksref{title}}

25: %\thanks[title]{A preliminary version was presented at

26: %the {\em 8th Intn'l Computing and Combinatorics Conference

27: %(COCOON)}, held in Singapore, August, 2002.

28: }

29:

30: \author{Ming Li, %\thanksref{ming}

31: }

32: %\address{Department of Computer Science, University of Waterloo, %Univ. California

33: %Santa Barbara, CA 93106, USA,

34: %E-mail: mli@cs.ucsb.edu

35: %}

36: \author{John Tromp, %\thanksref{tromp}

37: }

38: %\address{CWI, %Kruislaan 413, 1098 SJ Amsterdam, The Netherlands;

39: %Email: tromp@cwi.nl}

40: \author{Paul Vit\'{a}nyi%\thanksref{vitanyi}

41: }

42: %\address{CWI,% Kruislaan 413, 1098 SJ Amsterdam, The Netherlands;

43: %Email: paulv@cwi.nl}

44:

45: %\thanks[ming]{

46: %Supported in part by

47: %the NSERC Operating Grant OGP0046506, ITRC, and

48: %NSF-ITR Grant 0085801 at UCSB.

49: %}

50: %\thanks[tromp]{

51: %Partially supported by an

52: %NSERC International Fellowship and ITRC.

53: %}

54: %\thanks[vitanyi]{

55: %Affiliated with CWI and the University of Amsterdam.

56: %Supported in part by the

57: %EU fifth framework project QAIP, IST--1999--11234, the NoE QUIPROCONE IST--1999--29064,

58: %the ESF QiT Programmme, and the EU Fourth Framework BRA

59:  %NeuroCOLT II Working Group

60: %EP 27150.

61: %}

62:

63: %\renewcommand{\baselinestretch}{1.2}

64: %\setlength{\topmargin}{-0.2in}

65: \setlength{\textwidth}{6in}

66: \setlength{\oddsidemargin}{0.0in}

67: \setlength{\evensidemargin}{0.0in}

68: \setlength{\textheight}{8in}

69: %\setlength{\footskip}{0.5in}

70: %\setlength{\parskip}{6 pt plus 2pt minus 1pt}

71:

72: %\newtheorem{theorem}{\sc Theorem}

73: %\newtheorem{lemma}{\sc Lemma}

74: %\newtheorem{coro}{\sc Corollary}

75: %\newtheorem{nota}{\sc Notation}

76: %\newtheorem{defin}{\sc Definition}

77: %\newtheorem{cla}{\sc Claim}

78: %\newtheorem{ex}{\sc Example}

79: %\newenvironment{proof}{\par \sc Proof.\rm}{\hspace*{\fill}$\qed$\vspace{1ex}}

80: %\newenvironment{example}{\begin{ex}}{\hspace*{\fill}$\Diamond$\end{ex}}

81: %\newenvironment{claim}{\begin{cla}}{\end{cla}}

82: %\newenvironment{corollary}{\begin{coro}}{\end{coro}}

83: %\newenvironment{definition}{\begin{defin}}{\end{defin}}

84: %\newenvironment{notation}{\begin{nota}}{\end{nota}}

85: %\newenvironment{comment}{\begin{small}\begin{quotation}\hspace{-0.23in}\rm}{\end{quotation}\end{small}}

86: %

87:

88: %\pagestyle{empty}

89:

90: %\normalsize

91: \begin{abstract}

92: We provide a new representation-independent

93: formulation of Occam's razor theorem, based on

94: Kolmogorov complexity. This new formulation allows us to:

95: (i) Obtain better sample complexity than both length-based \cite{blumer1}

96: and VC-based \cite{blumer} versions of Occam's razor theorem, in many

97: applications; and (ii)

98: Achieve a sharper reverse of Occam's razor theorem than that of

99: \cite{board}. Specifically, we weaken the assumptions

100: made in \cite{board} and extend the reverse to superpolynomial

101: running times.

102: \end{abstract}

103: \begin{keyword}

104: Analysis of algorithms \sep

105: pac-learning \sep Kolmogorov complexity \sep Occam's razor-style theorems

106: \end{keyword}

107:

108: \end{frontmatter}

109:

110: %\newcommand{\proof}{{\bf Proof. \enspace}}

111: %\newtheorem{theorem}{Theorem}

112: %\newtheorem{lemma}[theorem]{Lemma}

113: %\newtheorem{corollary}[theorem]{Corollary}

114: %\newtheorem{definition}{Definition}

115: %\newtheorem{claim}[theorem]{Claim}

116: %\newtheorem{conjecture}[theorem]{Conjecture}

117:

118: \section{Introduction} \label{introsec}

119: Occam's razor theorem as formulated

120: by \cite{blumer,blumer1} is arguably the substance of efficient pac learning.

121: Roughly speaking, it says that in order to (pac-)learn, it suffices to compress.

122: A partial reverse, showing the necessity of compression,

123: has been proved by Board and Pitt \cite{board}.

124: %%added [Paul]

125: Since the theorem is about the relation between effective

126: compression and pac learning, it is natural to assume that

127: a sharper version ensues by couching it in terms

128: of the {\em ultimate} limit to effective compression which is

129: the Kolmogorov complexity. We present results in that direction.

130: %%end addition [Paul]

131:

132: Despite abundant research generated by its importance,

133: several aspects of Occam's razor

134: theorem remain unclear. There are basically two versions.

135: The {\em VC dimension-based version} of Occam's razor theorem

136: (Theorem 3.1.1 of \cite{blumer})

137: gives the following upper bound on sample complexity:

138: For a hypothesis

139: space $H$ with $VCdim(H)=d$, $1 \leq d < \infty$,

140: \begin{equation}\label{vc-sample}

141: m(H,\delta , \epsilon ) \leq  \frac{4}{\epsilon}

142: (d \log \frac{12}{\epsilon} + \log \frac{2}{\delta} ).

143: \end{equation}

144: The following lower bound was proved by Ehrenfeucht {\it et al} \cite{ehren}.

145: \begin{equation}\label{vc-lowerbound}

146: m(H,\delta , \epsilon ) > \max (\frac{d-1}{32 \epsilon},

147: \frac{1}{\epsilon} \ln \frac{1}{\delta} ).

148: \end{equation}

149: The upper bound in (\ref{vc-sample}) and the lower

150: bound in (\ref{vc-lowerbound}) differ by a factor

151: $\Theta (\log \frac{1}{\epsilon} )$. It was shown in

152: \cite{haussler} that this factor is, in a sense, unavoidable.

153:

154: When $H$ is finite, one can directly obtain the following bound

155: on sample complexity for a consistent algorithm:

156: \begin{equation}\label{direct-sample}

157: m(H,\delta , \epsilon ) \leq \frac{1}{\epsilon} \ln \frac{|H|}{\delta}.

158: \end{equation}

159: For a graded boolean space $H_n$, we have the

160: following relationship between

161: the VC dimension $d$ of $H_n$ and the cardinality of $H_n$,

162: \begin{equation}

163: d \leq \log |H_n | \leq nd.

164: \end{equation}

165:

166: When $\log |H_n|=O(d)$ holds, then the sample complexity upper bound

167: given by (\ref{direct-sample}) can be seen to equal

168: $\frac{1}{\epsilon} (O(d)+\ln \frac{1}{\delta})$ which matches the lower bound

169: of (\ref{vc-lowerbound}) up to a constant factor,

170: and thus every consistent

171: algorithm achieves optimal sample complexity for such hypothesis spaces.

172:

173: The {\em length-based version} of Occam's razor theorem then

174: gives the following sample complexity $m$ to guaranty that

175: the algorithm pac-learns:

176: For given $\epsilon$ and $\delta$:

177: \begin{equation}\label{length-sample}

178: m = \max (\frac{2}{\epsilon} \ln \frac{1}{\delta} ,

179: (\frac{(2\ln 2)s^{\beta}}{\epsilon} )^{1/(1-\alpha)} ) ,

180: \end{equation}

181: This bound is based on  the {\em length-based}

182: Occam algorithm \cite{blumer}:

183: A {\em deterministic} algorithm that returns a consistent hypothesis of

184: length at most $m^\alpha s^\beta$, where $\alpha < 1$ and $s$ is the length

185: of the target concept.

186:

187: %In case of total example compression,  when $\alpha=0$,

188: %this is competitive with \ref{direct-sample}, and

189: %(when $s \geq \log \frac{1}{\delta}$) Equation~\ref{length-sample} becomes

190: %\begin{equation}\label{length-simple}

191: %m = O( \frac{s^{\alpha}}{\epsilon} ).

192: %\end{equation}

193: %Here, if we replace $H$ in Equation~\ref{direct-sample}

194: %by the smallest $H'$ containing the learned hypothesis (of length

195: %$n^\alpha$ in Equation~\ref{length-simple}), then the sample complexities

196: %in Equations~\ref{direct-sample} and \ref{length-simple} are

197: %approximately the same, as it should be. Since the formula in

198: %Equation~\ref{direct-sample} is not easy to use.

199: %In practice,

200: %it is the Formula~\ref{length-sample}, or more often

201: %Formula~\ref{length-simple}, that are used.

202:

203: In summary, the VC dimension based Occam's razor theorem

204: may be hard to use and it sometimes does not give the best sample

205: complexity. The length-based Occam's razor is more convenient

206: to use and often gives better sample complexity in the discrete case.

207:

208: %%added rephrased paragraph[Paul]

209: However, as we demonstrate below, the fact that the length-based

210: Occam's razor theorem sometimes gives inferior sample

211: complexity, can be due to the redundant representation format of the concept.

212: %%end rephrased paragraph [Paul]

213: We believe Occam's razor theorem should be

214: ``representation-independent''. That is, it should not be dependent

215: on  accidents of ``representation format''.  (See \cite{manfred} for

216: other representation-independence issues.) In fact, the sample

217: complexities given in (\ref{vc-sample}) and (\ref{vc-lowerbound})

218: are indeed representation-independent. However they are not

219: easy to use and do not give optimal sample complexity.

220: Here, we give a Kolmogorov complexity based Occam's razor

221: theorem. We will demonstrate that our KC-based Occam's razor theorem

222: is convenient to use (as convenient as the length based

223: version), gives a better sample complexity than the

224: length based version, and is representation-independent.

225: In fact, the length based version

226: can be considered as a specific computable approximation

227: to the KC-based Occam's razor.

228:

229: As one of the examples, we will demonstrate that the standard trivial learning

230: algorithm for monomials actually often

231: has a {\it better sample complexity}

232: than the more sophisticated Haussler's greedy algorithm \cite{hauss}.

233: This is

234: contrary to the commen, but mistaken,

235: belief that Haussler's algorithm is better

236: in all cases (to be sure, Haussler's method is superior

237: for target monimials of small length).

238: Another issue related to Occam's razor theorem is the

239: status of the reverse assertion.

240: Although a partial reverse of Occam's razor theorem has

241: been proved by \cite{board}, it applied only to the case of

242: polynomial running time and sample complexity.

243: They also required a property

244: of closure under exception list. This latter requirement, although

245: quite general, excludes some reasonable concept classes. Our new

246: formulation of Occam's razor theorem allows us to

247: prove a more general reverse of Occam's razor

248: theorem, allowing the arbitrary running time and weakening

249: the requirement of exception list of \cite{board}.

250:

251: \footnote{A preliminary version was presented at

252: the {\em 8th Intn'l Computing and Combinatorics Conference

253: (COCOON)}, held in Singapore, August, 2002.

254: }

255:

256: {\bf Discussion of Result and Technique:}

257: In our approach we obtain better

258: bounds on the sample complexity to learn the representation of

259: a target concept in the given representation system.

260: These bounds, however, are representation-independent

261: and depend only on the Kolmogorov complexity of the target concept.

262: If we don't care about the representation of the hypothesis

263: (but that is not the case in this paper) then better ``iff Occam style''

264: characterizations of polynomial time learnability/predicatability

265: can be given. They rely

266: on Schapire's result that ``weak learnability''

267: equals ``strong learnability'' in polynomial time \cite{Sch90}

268: exploited in \cite{HeWa95}. For a recent survey of

269: the important related ``boosting'' technique see \cite{Sch02}.

270:

271: The use of Kolmogorov complexity is to obtain a bound on the

272: size of the hypotheses class for a fixed (but arbitrary)

273: target concept.

274: Obviously, the results described

275: can be obtained using other proof methods---all true provable statements

276: must be provable from the axioms of mathematics by the inference methods

277: of mathematics. The question is whether

278: a particular proof method facilitates and guides the proving effort.

279: The message we want to convey is that thinking in terms of coding

280: and incompressibility suggest improvements to long-standing results.

281:  A survey of the use of the Kolmogorov complexity

282: method in combinatorics, computational complexity, and

283: the analysis of algorithms is \cite{lv} Chapter 6.

284:

285:

286: \section{Occam's Razor}

287: Let us assume the usual definitions, say Anthony and Biggs \cite{anthony},

288: and notation of \cite{board}. For

289: Kolmogorov complexity we assume the basics of \cite{lv}.

290:

291: In the following $\Sigma, \Gamma$ is are finite {\em alphabets}: We

292: consider only discrete learning problems in this paper.

293: The set of finite strings over $\Sigma$ is denoted by $\Sigma^*$

294: and similarly for $\Gamma$.

295: An element of $\Sigma^*$ is an {\em example}, and a {\em concept}

296: is a set of examples (a language over $\Sigma$).

297: An {\em representation} is an element of $\Gamma^*$.

298:

299:

300: \begin{definition}

301: A {\em representation system} is a tuple $(R,\Gamma , c , \Sigma )$, where

302: $R \subset \Gamma^*$ is the set of representations, and

303: $c:R \rightarrow 2^{\Sigma^*}$ maps representations to concepts, the latter

304: being languages over $\Sigma$.

305: \end{definition}

306:

307: Hence, given $R$ the mapping $c$ determines a {\em concept class}.

308: For example, let $\Gamma$ is the alphabet to express Boolean formulas,

309: $\Sigma = \{0,1\}$, and let

310: $R$ be the subset of disjunctive normal form (DNF) formulas.

311: Let $c$ map each element $r \in R$, say a DNF formula over $n$

312: variables, to $c(r) \subseteq \{0,1\}^n$ such that every example

313: $e \in c(r)$ viewed as truth-value assignment makes $r$ ``true''.

314: That is, if $e=e_1 \ldots e_n$ and we assign ``true'' or ``false''

315: to the $i$th variable in $r$ according to whether $e_i$ equals ``0''

316: or ``1'' then $r$ becomes ``true''. Each concept in the thus defined

317: concept class is the set of truth assignments that make a particular

318: DNF formula ``true''.

319:

320: \begin{definition}

321: A {\em pac-algorithm} for a representation system

322: ${\bf R} = (R,\Gamma , c , \Sigma )$ is a randomized algorithm $L$

323: such that, for every $s,n\geq 1,\epsilon>0,\delta>0,r \in R^{\leq s}$,

324: and every probability distribution $D$ on $\Sigma^{\leq n}$,

325: if $L$ is given $s,n,\epsilon,\delta$ as input

326: and has access to an oracle providing examples of $c(r)$ (the concept

327: represented by $r$) according to $D$,

328: then $L$,

329: with probability at least $1-\delta$, outputs a representation $r' \in R$

330: approximating the target $r$ in the sense that

331: $D(c(r')\Delta c(r)) \leq \epsilon$.

332: Here, $\Delta$ denotes the symmetric set difference.

333: \end{definition}

334: The acronym ``pac'' coined by Dana Angluin

335: stands for ``probably approximately correct'' which aptly captures the

336: requirement the output representation must satisfy according to the definition.

337: The question of interest in pac-learning is how many examples

338: (and running time) a learning algorithm has to qualify as a pac-alpgorithm.

339: The {\em running time} and and number of examples ({\em sample complexity})

340: of the pac-algorithm are

341: expressed as functions $t(n,s,\epsilon,\delta)$ and

342: $m(n,s,\epsilon,\delta)$. The following definition generalizes the

343: notion of Occam algorithm in \cite{blumer}:

344:

345: \begin{definition}

346: \label{def.kcoccam}

347: An {\em Occam-algorithm} for a representation system

348: ${\bf R} = (R,\Gamma , c , \Sigma )$ is a randomized algorithm which

349:  for every $s,n\geq 1,  \gamma >0$,

350: on input of a sample consisting of

351: $m$ examples of a fixed target $r\in R^{\leq s}$,

352: with probability at least $1-\gamma$ outputs a representation $r' \in R$

353: consistent with the sample, such that $K(r' \mid r,n,s) < m/f(m,n,s,\gamma)$,

354: with $f(m,n,s,\gamma)$, the compression achieved, being an increasing

355: function of $m$.

356: \end{definition}

357: The {\em length-based version} of (possibly randomized)

358:  Occam algorithm can be obtained

359: by replacing $K(r' \mid r,n,s)$ by $|r|$ in this definition.

360: The {\em running time} of the Occam-algorithm is expressed as a function

361: $t(m,n,s,\gamma)$, where $n$ is the maximum length of the input examples.

362:

363: \begin{remark}\label{rem.kco}

364: \rm

365: An Occam algorithm satisfying a given $f$,

366: achieves a lower bound on the number $m$ of examples required

367: in terms of $K(r' \mid r,n,s)$, the Kolmogorov complexity of

368: the outputted representation conditioned on the target representation,

369: rather than the (maximal) length $s$ of $r$ as in the original Occam

370: algorithm \cite{blumer} and the length-based version above.

371: This improvement enables one to use information drawn from the hidden

372: target for reduction of the Kolmogorov complexity of the output representation,

373: and hence further reduction of the required sample complexity.

374: \end{remark}

375: We need to show that the main properties

376: of an Occam algorithm are preserved under this generalization.

377: Our first theorem is a Kolmogorov complexity based Occam's Razor.

378: We denote the minimum $m$ such that $f(m,n,s,\gamma) \geq x$ by

379: $f^{-1}(x,n,s,\gamma)$, where we set $f^{-1}(x,n,s,\gamma)=\infty$

380: if $f(m,n,s,\gamma) < x$

381: for every $m$.

382:

383: \begin{theorem}

384: \label{KCoccam}

385: Suppose we have an Occam-algorithm for

386: ${\bf R} = (R,\Gamma , c , \Sigma )$ with compression $f(m,n,s,\gamma)$.

387: %that, for $r \in R^{\leq s}$, produces representations satisfying

388: %\[ K(r'|r,n,s) \leq m/

389: %Write $f$ as $f(m,\gamma)$ with the other parameters implicit.

390: Then there is a pac-learning algorithm

391: for {\bf R} with sample complexity

392: \[ m(n,s,\epsilon,\delta) =

393:    \max \left\{\frac{2}{\epsilon}\ln \frac{2}{\delta},

394:         f^{-1}(\frac{2\ln 2}{\epsilon},n,s,\delta/2) \right\}, \]

395: and running time $t_{\mbox{pac}}(n,s,\epsilon,\delta) =

396: t_{\mbox{occam}}(m(n,s,\epsilon,\delta),n,s,\delta/2)$.

397: \end{theorem}

398:

399: \begin{proof}

400: On input of $\epsilon,\delta,s,n$, the learning algorithm will take a sample

401: of length $m=m(n,s,\epsilon,\delta)$ from the oracle, then

402: use the Occam algorithm with $\gamma=\delta/2$ to find a hypothesis

403: (with probability at least $1-\delta/2$) consistent with the sample and

404: with low Kolmogorov complexity.

405: In the proof we abbreviate $f(m,n,s,\gamma)$ to $f(m)$

406: with the other parameters implicit.

407: Learnability follows in the standard manner from bounding

408: (by the remaining $\delta/2$) the probability

409: that all $m$ examples of the target concept

410: fall outside the, probability $\epsilon$ or greater,

411: symmetric difference with a bad hypothesis.

412: Let $m =  m(n,s,\epsilon,\delta)$. Then

413: $m \geq f^{-1} (\frac{2 \ln 2}{\epsilon} ,n,s, \frac{\delta}{2})$ gives

414: \[ \epsilon - \frac{\ln 2}{f(m)} \geq \frac{\epsilon}{2}, \]

415: and therefore $m \geq \frac{2}{\epsilon}\ln \frac{2}{\delta}$ gives

416: \[ m(\epsilon - \frac{\ln 2}{f(m)} ) \geq \ln \frac{2}{\delta} .\]

417: This implies (taking the exponent on both sides and

418: using $1-\epsilon<e^{-\epsilon }$)

419: \[ 2^{m/f(m)}(1-\epsilon)^{m} \leq \delta/2 .\]

420: The probability that some concept the Occam-algorithm can output

421: has all $m$ examples being bad is at most

422: the number of concepts of complexity less than $m/f(m)$, times

423: $(1-\epsilon)^m$, which by the above is at most $\delta/2$.

424: \end{proof}

425:

426: \begin{corollary}

427: When the compression is of the form

428: \[f(m,n,s,\gamma) = \frac{m^{1-\alpha}}{p(n,s,\gamma)},\]

429: one can achieve a sample complexity of

430: \[\max\left\{\frac{2}{\epsilon}\ln \frac{2}{\delta},

431: \left( \frac{(2 \ln 2)p(n,s,\delta/2)}{\epsilon} \right)^{1/(1-\alpha)}\right\}.\]

432: In the special case of total compression, where $\alpha=0$, this

433: further reduces to

434: % Equation~\ref{length-sample}:

435: \begin{equation} \label{total-compression}

436: \frac{2}{\epsilon}\left\{\max(\ln \frac{2}{\delta},(\ln 2)

437: p(n,s,\delta/2))\right\}.

438: \end{equation}

439: For deterministic Occam-algorithms, we can furthermore replace

440: $2/\delta$ and $\delta/2$ in Theorem~\ref{KCoccam} by $1/\delta$ and

441: $\delta$ respectively.

442: \end{corollary}

443:

444: \begin{remark}

445: \rm

446: Essentially, our new

447: Kolmogorov complexity condition is a computationally

448: universal generalization of the length condition in the original

449: Occam's razor theorem of \cite{blumer1}. Here, in

450: Theorem~\ref{KCoccam}, we consider the

451: shortest description length over all effective representations,

452: given the target representation,

453: rather than in a specific (syntactical) representation system.

454: %This is representation-independent in the very strong sense

455: %of being an absolute and objective notion,

456: %which is recursively invariant by Church's thesis and the ability

457: %of universal machines to simulate each other.

458: This allows us to bound the required sample complexity

459: not by a function of the number of hypotheses

460: (returned representations)

461: of length at most the bound on

462: the length of the target representation, but by a similar

463: function of the number of hypotheses

464: that have a certain Kolmogorov complexity conditioned

465: on the target concept, see Remark~\ref{rem.kco}.

466: Nonetheless, like in the original Occam's razor Theorem of \cite{blumer1},

467: we return a representation of a concept approximating the target

468: concept in the given representation  system, rather than

469: a representation outside the system like in Boosting approaches.

470: \end{remark}

471:

472: Suppose we have a concept $c$ and a mis-classified example $x$---an

473: {\em exception}. Then, the symmetric difference $c \Delta \{x\}$

474: classifies $x$ correctly: if $x \not\in c$ then

475: $c \Delta \{x\} = c \bigcup  \{x\}$, and if $x \in c$ then

476: $c \Delta \{x\} = c \setminus  \{x\}$.

477:

478: \begin{definition}

479: An {\em exception handler} for a representation system

480: ${\bf R} = (R,\Gamma , c , \Sigma )$ is an algorithm which

481: on input of a representation $r\in R$ of length $s$,

482: and an $x \in \Sigma^{\ast}$ of length $n$,

483: outputs a representation $r' \in R$ of the concept $c(r) \Delta \{x\}$,

484: of length at most $e(s,n)$, where $e$ is called the {\em exception expansion}

485: function.

486: The running time of the exception-handler is expressed as a function

487: $t(n,s)$ of the representation and exception lengths.

488: If $t(n,s)$ is polynomial in $n,s$, and furthermore $e(s,n)$ is of the form

489: $s+p(n)$ for some polynomial $p$, then we say ${\bf R}$ is {\em polynomially

490: closed under exceptions}.

491: \end{definition}

492:

493: %\begin{definition}

494: %A class ${\bf R} = (R,\Gamma , c , \Sigma )$ is {\em closed under

495: %exceptions} iff it has an exception handler.

496: %\end{definition}

497:

498: \begin{theorem}

499: \label{sex}

500: Let $L$ be a deterministic pac-algorithm

501: with $m(n,s,\frac{1}{2n},\gamma)$ the sample size,

502: and let $E$ be an exception handler for

503: a representation system ${\bf R}$.

504: Then there is an Occam algorithm for ${\bf R}$

505: that for $m$ examples achieves compression

506: $f(m,n,s,\gamma)= \frac{1}{2\epsilon n}$.

507: Moreover, $m \geq 2nm(n,s,\frac{1}{2n},\gamma)$ and

508: where $\epsilon$, depending on $m,n,s,\gamma$, is such that

509: $m(n,s,\epsilon,\gamma)=\epsilon m$ holds.

510: \end{theorem}

511:

512: \begin{proof}

513: The proof is obtained in a fashion similar to

514: \cite{board}.

515: Suppose we are given a sample of length $m$ and confidence parameter

516: $\gamma$.

517: Assume without loss of generality that the sample

518: contains $m$ different examples.

519: Define a uniform distribution on these examples with $\mu (x) = 1/m$

520: for each $x$ in the sample.

521: Let $\epsilon$ be as described.

522: The function $m(n,s,\epsilon,\gamma)$ decreases with

523: increasing $\epsilon$,

524: while the function $\epsilon m$ increases with $\epsilon$

525: so the two necessarily intersect, under the assumption in the theorem,

526: for some $\epsilon_0$, although it may yield an

527: $\epsilon_0 >\frac{1}{2n}$, giving no actual compression.

528: For example, if $m(n,s,\epsilon,\gamma)= (\frac{1}{\epsilon})^{b}$

529: for some constant $b$, then $\epsilon_0 = m^{-1/(b+1)}$.

530: %Let the pac-learning algorithm have sample complexity

531: %$m(n,s,\epsilon,\delta)$.

532: Apply $L$ with $\delta = \gamma$ and $\epsilon = \epsilon_0$.

533: %It will use at most $m^{b/(b+1)}$ of our $m$ examples.

534: With probability $1-\gamma$,

535: it produces a concept which is correct with error $\epsilon$,

536: giving up to $\epsilon m$ exceptions.

537: We can just add these one by one using

538: the exception handler.

539: This will expand the concept size, but not the Kolmogorov complexity.

540: The resulting representation can be described by the $\leq \epsilon m$

541: examples used plus the $\leq \epsilon m$ exceptions found,

542: Since $L$ is deterministic, this uniquely determines the required

543: consistent concept.

544: % plus some constant for the various algorithms involved.

545: The compression achieved is $\frac{m}{2\epsilon mn} = \frac{1}{2\epsilon n}$.

546: This is an increasing function of $m$, since increasing the slope of

547: the function $\epsilon m$ moves its intersection with

548: the function $m(n,s,\epsilon,\gamma)$ to the left, that is,

549: to smaller $\epsilon$.

550: \end{proof}

551:

552: \begin{definition}

553: Let ${\bf R} = (R,\Gamma , c , \Sigma )$ be a representation system.

554: The concept $\maj(r_1,r_2,r_3)$

555: is the set $\{x :$ $x$ belongs to at least two out of

556: the three concepts $c(r_1),c(r_2),c(r_3)\}$.

557: A {\em majority-of-three algorithm} for

558: ${\bf R}$ is an algorithm which

559: on input of three representation $r_1,r_2,r_3 \in R^{\leq s}$,

560: outputs a representation $r' \in R$ of the concept $\maj(r_1,r_2,r_3)$

561: of length at most $e(s)$, where $e$ is called

562:  the {\em majority expansion}

563: function.

564: The running time of the algorithm is expressed as a function

565: $t(s)$ of the maximum representation length.

566: If $t(s)$ and $e(s)$ are polynomial in $s$

567: then we say ${\bf R}$ is {\em polynomially

568: closed under majority-of-three}.

569: \end{definition}

570:

571: \begin{theorem}

572: \label{maj}

573: Let $L$ be a deterministic pac-algorithm with sample complexity

574: $m(n,s,\epsilon,\delta) \in o(1/\epsilon^2)$, and let $M$

575: be a majority-of-three algorithm for

576: the representation system ${\bf R}$.

577: Then there is an Occam algorithm for ${\bf R}$ that for $m$ examples

578: has compression $f(m,n,s,\gamma)=m/3nm(n,s,\frac{1}{2\sqrt{m}},\gamma/3)$.

579: \end{theorem}

580:

581: \begin{proof}

582: Let us be given a sample of length $m$.

583: Take $\delta = \gamma / 3$ and $\epsilon = \frac{1}{2\sqrt{m}}$.

584: %The reason we take $\epsilon$ to be $1/\sqrt{m}$ is that

585: %it turns out that the $\epsilon$s of the three stages are related

586: %as $\epsilon_1 \sim \epsilon_3$ and $2 \epsilon_1 \epsilon_2m = 1$,

587: %so this is the best way to balance them.

588:

589: {\it Stage 1:} Define a uniform distribution on the $m$ examples

590: with $\mu_1 (x) = 1/m$ for each $x$ in the sample.

591: Apply the learning algorithm.

592: It produces (with probability at least $1-\gamma/3$)

593: a hypothesis $r_1$ which has error less than $\epsilon$,

594: giving up to $\epsilon m = \sqrt{m}/2$ exceptions.

595: Denote this set of exceptions by $E_1$.

596:

597: {\it Stage 2:} Define a new distribution

598: %on the $m$ examples

599: $\mu_2(x) = \epsilon$ for each $x \in E_1$,

600: and $\mu_2(x) = (1-|E_1|/2\sqrt{m})/(m-|E_1|)$ for each $x \not\in E_1$.

601: Apply the learning algorithm.

602: It produces (with probability at least $1-\gamma/3$)

603: a hypothesis $r_2$ which is correct on all of $E_1$ and with error

604: less than $\epsilon$ on the remaining examples.

605: This gives up to $\epsilon (m-|E_1|) / (1-|E_1|/2\sqrt{m}) < \sqrt{m}$

606: exceptions. This set, denoted $E_2$, is disjoint from $E_1$.

607:

608: {\it Stage 3:} Define a new distribution on the $m$ examples

609: with $\mu(x) = 1/|E_1 \cup E_2| > \epsilon$ for each $x$ in $E_1\cup E_2$,

610: and $\mu(x) = 0$ elsewhere.

611: Apply the learning algorithm.

612: % with error bound $\epsilon_3 = 1/2\sqrt{m}$.

613: %Note that $|E_1| \leq \sqrt{m}$ and $E_2 < \sqrt{m}$ gives that for

614: %$x$ in $E_1\cup E_2$, $\mu(x) > \epsilon_3$.

615: The algorithm produces (with probability at least $1-\gamma/3$)

616: a hypothesis $r_3$ which is correct on all of $E_1$ and $E_2$.

617: %and which might be totally wrong elsewhere (we don't care).

618:

619: In total the number of examples consumed by the pac-algorithm

620: is at most $3m(n,s,\frac{1}{2\sqrt{m}},\gamma/3)$, each requiring

621: $n$ bits to describe.

622: The three representations are combined into one representation

623: by the majority-of-three algorithm $M$. This is necessarily correct on all

624: of the $m$ examples, since the three exception-sets are all disjoint.

625: Furthermore, it can be described in terms of the

626: examples fed to the deterministic pac-algorithm

627: and thus achieves compression

628: $f(m,n,s,\gamma) = m/3nm(n,s,\frac{1}{2\sqrt{m}},\gamma/3)$.

629: This is an increasing function of $m$ given the assumed

630: subquadratic sample complexity.

631: \end{proof}

632:

633: The following corollaries use the fact that if

634:  a representation system is learnable,

635: it must have finite VC-dimension and hence,

636: according to (\ref{vc-sample}), they are learnable with sample

637: complexity subquadratic in $\frac{1}{\epsilon}$.

638: \begin{corollary}

639: Let a representation system ${\bf R}$ be

640: closed under either exceptions or majority-of-three, or both.

641: Then ${\bf R}$ is pac-learnable iff

642: there is an Occam algorithm for ${\bf R}$.

643: \end{corollary}

644:

645: \begin{corollary}

646: Let a representation system ${\bf R}$ be polynomially

647: closed under either exceptions or majority-of-three, or both.

648: Then ${\bf R}$ is deterministically polynomially pac-learnable iff

649: there is a polynomial time Occam algorithm for ${\bf R}$.

650: \end{corollary}

651:

652: \noindent

653: {\it Example.}

654: Consider threshold circuits,

655: acyclic circuits whose nodes compute threshold

656: functions of the form $a_1x_1 + a_2x_2 + \cdots +a_nx_n \geq \delta$,

657: $x_i \in \{0,1\}, a_i,\delta \in N$ (note that no expressive

658: power is gained by allowing rational weights and threshold).

659: A simple way of representing circuits

660: over the binary alphabet is to number each node and use

661: {\em prefix-free encodings} of these numbers. For instance, encode $i$

662: as $1^{|\mbox{bin}(i)|}0\mbox{bin}(i)$,

663: the binary representation of $i$ preceded by its length in unary.

664: A complete node encoding then consists of the encoded index, encoded

665: weights, threshold, encoded degree, and encoded indices of the nodes

666: corresponding to its inputs. A complete circuit can be encoded with

667: a node-count followed by a sequence of node-encodings.

668: For this representation, a majority-of-three

669: algorithm is easily constructed that renumbers two of its three input

670: representations, and combines the three by adding a

671: 3-input node computing the majority function

672: $x_1+x_2+x_3 \geq 2$.

673: It is clear that under this representation,

674: the system of threshold circuits

675: are polynomially closed under majority-of-three.

676: On the other hand they are not closed under exceptions,

677: or under the exception lists of \cite{board}.

678:

679: \noindent

680: {\it Example.} Let $h_1 , h_2, h_3$ be 3 $k$-DNF formulas.

681: Then $\maj (h_1,h_2,h_3) = (h_1 \wedge h_2) \vee (h_2 \wedge h_3) \vee

682: (h_3 \wedge h_1)$ which can be expanded into a $2k$-DNF formula.

683: This is not good enough for Theorem~\ref{maj}, but it allows us to conclude

684: that pac-learnability of $k$-DNF implies compression of $k$-DNF into

685: $2k$-DNF.

686:

687: \section{Applications}

688: Our KC-based Occam's razor theorem might

689: be {\it conveniently} used, providing better sample

690: complexity than the length-based version.

691: In addition to giving better sample complexity,

692: our new KC-based Occam's razor theorem,

693: Theorem~\ref{KCoccam}, is easy to use, as easy

694: as the length based version, as demonstrated by the following

695: two examples.

696: While it is easy to construct an artificial system with

697: extremely bad representations such that our Theorem~\ref{KCoccam}

698: gives {\it arbitrarily} better sample complexity than

699: the length-based sample complexity given in

700: (\ref{length-sample}), we prefer to give natural examples.

701:

702: \noindent

703: {\bf Application 1: Learning a String.}

704:

705: The DNA sequencing process can be modeled as the problem

706: of learning a super-long string in the pac model \cite{jiang1,li}.

707: We are interested in learning a target string $t$ of length $s$,

708: say $s=3 \times 10^9$ (length of a human DNA sequence).

709: At each

710: step, we can obtain as an example a substring of this sequence

711: of length $n$, from a random location of $t$ (Sanger's Procedure).

712: At the time of writing, $n \approx 500$, and

713: sampling is very expensive.

714: Formally, the concepts we are learning are sets of possible length $n$

715: substrings of a superstring, and these are naturally

716: represented by the superstrings. We assume a minimal target representation

717: (which may not hold in practice).

718: Suppose we obtain a

719: sample of $m$ substrings (all positive examples). In biological

720: labs, a Greedy algorithm which repeatedly merges a pair of substrings

721: with maximum overlap is routinely used. It is conjectured

722: that Greedy produces a common superstring $t'$ of length at most $2s$,

723: where $s$ is the optimal length (NP-hard to find). In \cite{blum},

724: we have shown that $s \leq |t'| \leq 4s$.

725: Assume that $|t'| \approx 2s$.\footnote{Although only the

726: $4s$ upper bound was proved in \cite{blum}, which has since been improved,

727: it is widely believed

728: that $2s$ is the true bound.}

729: Using the length-based Occam's razor theorem, that is, Theorem~\ref{sex}

730: with $K(r' \mid r,s,n)$ in Definition~\ref{def.kcoccam} replaced

731: by $|r'|$,

732: this length of $2s$ would determine the sample complexity,

733: as in (\ref{total-compression}), with

734: %\begin{equation}\label{length-base}

735: $p(n,s,\delta/2)= 2 \cdot 2s$

736: (the extra factor 2 is the 2-logarithm of the size of the alphabet

737: $\{A,C,G,T\}$).

738: %m_{\rm len} \geq 2cs/\epsilon ,

739: %\end{equation}

740: %by Equation~\ref{length-simple}, where $c$ is the constant represented

741: %by the big-$O$ in \ref{length-simple}.

742: Is this the best we can do?

743: It is well-known that the sampling process in DNA sequencing is a very

744: costly and slow process.

745: We improve the sample complexity using our KC-based Occam's razor

746: theorem.

747: %Theorem~\ref{KCoccam}.

748:

749: \begin{lemma}

750: Let $t$ be the target string of length $s$ and $t'$ be the

751: superstring returned by Greedy of length at most $2s$. Then

752: \[

753: K(t' \mid t,s,n ) \leq 2s (2\log s + \log n) / n .

754: \]

755: \end{lemma}

756: \begin{proof}

757: We give $t'$ a short description using some information

758: from $t$. Let $S = \{ s_1 , \ldots , s_m \}$ be the set of

759: $m$ examples (substrings of $t$ of length $n$).

760: Align these substrings with the common superstring $t'$, from

761: left to right. Divide them into groups such that each group's

762: leftmost string overlaps with every string in the group but

763: does not overlap with the leftmost string of the previous group.

764: Thus there are at most $2s/n$ such groups.

765: To specify $t'$, we only need to specify these $2s/n$ groups.

766: After we obtain the superstring for each group, we re-construct $t'$

767: by optimally merging the superstrings of neighboring groups.

768: To specify each group, we only need to specify the first and the last

769: string of the group and how they are merged. This is because every

770: other string in the group is a substring of the string obtained by

771: properly merging the first and last strings. Specifying the first and

772: the last strings requires $2 \log s$ bits of information

773: to indicate their locations in $t$ and we need another

774: $\log n$ bits to indicate how they are merged.

775: Thus $K(t'\mid t,s,n) \leq 2s (2 \log s + \log n) / n$.

776: \end{proof}

777:

778: This lemma shows that (\ref{total-compression}) can also be

779: applied with

780: $p(n,s,\delta/2)= 2\cdot 2s (2 \log s + \log n) / n$, giving a factor

781: $n / (2\log s + \log n)$ improvement in sample-complexity.

782: %By Theorem~\ref{KCoccam}, the sample complexity is improved to

783: %\begin{equation}\label{kc-base}

784: %m_{\rm KC} = \frac{ 2cs (2\log s + \log n) }{n \epsilon } .

785: %\end{equation}

786: %Thus, combining Equations~\ref{length-base} and \ref{kc-base}, we have,

787: %\[

788: %m_{\rm KC} \leq m_{\rm len} \frac{2\log s + \log n}{n}.

789: %\]

790: Note that in (mammal) genome computation practice,

791: we have $n=500$ and $s=3 \times 10^9$.

792: The sample complexity using the Kolmogorov complexity-based

793: Occam's razor is reduced over the ``length based''

794: Occam's razor by a multiplicative factor of

795: $n / (2\log s + \log n) \approx \frac{500}{2 \times 31 + 9} \approx 7$.

796:

797: \noindent

798: {\bf Application 2: Learning a Monomial.}

799:

800: Consider boolean space of $\{0,1\}^n$. There are two well-known algorithms

801: for learning monomials. One is the standard algorithm.

802:

803: \noindent

804: {\bf Standard Algorithm.}

805: \begin{enumerate}

806: \item

807: Initially set the concept representation

808: $M := x_1 \overline{x_1} \ldots x_n \overline{x_n}$

809: (a conjunction of all literals of $n$

810: variables---which contradicts every example).

811: \item

812: For each positive example, delete from the current $M$ the literals that

813: contradict the example.

814: \item

815: Return the resulting monomial $M$.

816: \end{enumerate}

817:

818: Haussler \cite{hauss} proposed a more sophisticated algorithm based

819: on set-cover approximation as follows.

820: Let $k$ be the number of variables in the target monomial, and $m$

821: be the number of examples used.

822:

823: \noindent

824: {\bf Haussler's Algorithm.}

825: \begin{enumerate}

826: \item

827: Use only negative examples.

828: For each literal $x$, define $S_x$ to be the set of negative examples

829: such that $x$ falsifies these negative examples.

830: The sets associated with the literals in the target monomial form a

831: %minimum

832: % uhm, doesn't have to be minimal, e.g. xyz with examples 001,010,100

833: % gives 3 sets each falsifying 2 negative examples. -John

834: set cover of negative examples.

835: \item

836: Run the approximation algorithm of set cover, this will use at most

837: $k \log m$ sets or, equivalently, literals in our approximating

838: monomial.

839: \end{enumerate}

840:

841: It is commonly believed that Haussler's algorithm

842: has better sample complexity than the standard algorithm

843: \footnote{In fact, Haussler's algorithm is specifically aimed

844: at reducing sample complexity for small target monomials, and that it

845: does.

846: }

847: We demonstrate that the opposite is sometimes true (in fact for

848: most cases), using our KC-based Occam's razor theorem,

849: Theorem~\ref{KCoccam}. Assume that our target monomial $M$ is of

850: length $n - \sqrt{n}$. Then the length-based Occam's razor theorem

851: gives sample complexity $n/\epsilon$ for both algorithms, by

852: Formula~\ref{total-compression}. However,

853: $K(M' \mid M)\leq \sqrt{n}\log 3+O(1)$,

854: where $M'$ is the monomial returned by the standard algorithm. This is

855: true since the standard algorithm always produces a monomial

856: $M'$ that contains {\em all} literals of the target monomial $M$, and

857: we need at most $\sqrt{n} \log 3 + O(1)$ bits to specify

858: whether other literals are in (positive or negative)

859: or not in $M'$ for the

860: variables that are in $M'$ but not in $M$.

861: Thus our (\ref{total-compression}) gives

862: the sample complexity of $O(\sqrt{n}/\epsilon)$.

863: In fact, as long as $|M| > n/\log n$ (which is most likely

864: to be the case if every monomial has equal probability),

865: it makes sense to use the standard algorithm.

866:

867: \section{Conclusions}

868:

869: Several new problems are suggested by this work.

870: If we have an algorithm that, given a length-$m$ sample of a concept

871: in Euclidean space, produces a consistent hypothesis that can be described

872: with only $m^\alpha, \alpha<1$ symbols (including a symbol for every real

873: number; we're using uncountable representation alphabet), then it seems

874: intuitively appealing that this implies some form of learning.

875: However, as noted in \cite{board},

876: the standard proof of Occam's Razor does not apply, since we cannot

877: enumerate these representations. The main open question is under

878: what conditions (specifically on the real number computation

879: model) such an implication would nevertheless hold.

880:

881: Can we replace the exception element or majority of 3 requirement

882: by some weaker requirement? Or can we even eliminate such

883: closure requirement and obtain a complete reverse of

884: Occam's razor theorem?

885: Our current requirements do not even include things

886: like k-DNF and some other reasonable representation systems.

887:

888: \section{Acknowledgements}

889: We wish to thank Tao Jiang for many stimulating discussions.

890:

891: \begin{thebibliography}{99}

892: \setlength{\baselineskip}{0.6\baselineskip}

893:

894: \bibitem{anthony}

895:         M. Anthony and N. Biggs,

896:         {\it Computational Learning Theory}, Cambridge University Press, 1992.

897: \bibitem{blum}

898:         A. Blum, T. Jiang, M. Li, J. Tromp, M. Yannakakis,

899:         Linear approximation of shortest common superstrings.

900:         {\it Journal ACM}, 41:4 (1994), 630-647.

901: \bibitem{blumer}

902:         A. Blumer and A. Ehrenfeucht and D. Haussler and M. Warmuth,

903:         Learnability and the Vapnik-Chervonenkis Dimension.

904:         {\it J. Assoc. Comput. Mach.}, 35(1989), 929-965.

905: \bibitem{blumer1}

906:         A. Blumer and A. Ehrenfeucht and D. Haussler and M. Warmuth,

907:         Occam's Razor.

908:         {\it Inform.\ Process.\ Lett.}, 24(1987), 377-380.

909: \bibitem{board}

910:         R. Board and L. Pitt,

911:         On the necessity of Occam Algorithms.

912:         1990 {\it STOC}, pp. 54-63.

913: \bibitem{ehren}

914:         A. Ehrenfeucht, D. Haussler, M. Kearns, L. Valiant.

915:         A general lower bound on the number of examples needed for

916:         learning. {\it Inform.\ Computation}, 82(1989), 247-261.

917: \bibitem{hauss}

918:         D. Haussler.

919:         Quantifying inductive bias: AI learning algorithms and

920:         Valiant's learning framework. {\it Artificial Intelligence},

921:         36:2(1988), 177-222.

922: \bibitem{haussler}

923:         D. Haussler, N. Littlestone, and, M. Warmuth.

924:         Predicting $\{0,1\}$-functions on randomly drawn points.

925:         {\em Information and Computation}, 115:2(1994),

926:      248--292.

927: \bibitem{HeWa95}

928: 	D.P. Helmbold and M.K. Warmuth,

929: 	On weak learning, {\em J. Comput. Syst. Sci.}, 50:3(1995),551-573.

930: \bibitem{jiang1}

931:         T. Jiang and M. Li,

932:         DNA sequencing and string learning,

933: {\em Math. Syst. Theory}, 29(1996), 387-405.

934: %\bibitem{kearns2}

935: %	M. Kearns and M. Li.

936: %	Learning in the Presence of Malicious Errors.

937: %       {\it SIAM J. Comput.}, 22:4(1993), 807-837.

938: \bibitem{li}

939:         M. Li. Towards a DNA sequencing theory.

940:         {\it 31st IEEE Symp. on Foundations of Comp. Sci.}, 125-134, 1990.

941: \bibitem{lv}

942: 	M. Li and P. Vit\'anyi. {\it An Introduction to

943:         Kolmogorov Complexity and Its Applications}.

944:         2nd Edition,

945:         Springer-Verlag, 1997.

946: \bibitem{Sch90}

947: R. E. Schapire.

948:        The strength of weak learnability.

949:        Machine Learning, 5:2(1990),197--227.

950: \bibitem{Sch02}

951: R.E. Schapire,

952: The boosting approach to machine learning: An overview.

953:        In: {\em MSRI Workshop on Nonlinear Estimation and Classification}, 2002.

954: \bibitem{val}

955: 	L. G. Valiant.

956: 	A Theory of the Learnable.

957: 	{\it Comm. ACM}, 27(11), 1134-1142, 1984.

958: \bibitem{manfred}

959:         M.K. Warmuth.

960:         Towards representation independence in PAC-learning.

961:         In {\it AII-89}, pp. 78-103, 1989.

962: \end{thebibliography}

963:

964: \end{document}

965:

966: