0311:cs0311042/d12.tex

1:

2: % d5.tex 3-27-03

3:

4:

5: \documentclass[11pt]{article}

6: \usepackage{amssymb}

7: \usepackage{amsfonts}

8: \usepackage{amsmath}

9: \usepackage{latexsym}

10: \usepackage{epsfig}

11:

12: \parindent=18pt

13: \oddsidemargin=0.15in

14: \evensidemargin=0.15in

15: \topmargin=-.5in

16: \textheight=9in

17: \textwidth=6.5in

18:

19: \newcommand{\la}{\langle}

20: \newcommand{\ra}{\rangle}

21: \newcommand{\poly}{\mathrm{poly}}

22: \newcommand{\size}{\mathrm{size}}

23: \newcommand{\fix}{\mathrm{fix}}

24: \newcommand{\bias}{\mathrm{bias}}

25: \newcommand{\R}{{\bf R}}

26: \newcommand{\E}{{\mathrm E}}

27: \newcommand{\F}{{{\bf F}_2}}

28: \newcommand{\s}{{\mathcal S}}

29: \newcommand{\K}{{\mathcal K}}

30: \newcommand{\A}{{\mathcal A}}

31: \newcommand{\B}{{\mathcal B}}

32: \newcommand{\true}{\textsc{T}}

33: \newcommand{\false}{\textsc{F}}

34: \newcommand{\bitsl}{\{\false, \true\}}

35: \newcommand{\bitsf}{\{0, 1\}}

36: \newcommand{\bitsr}{\{+1,-1\}}

37: \newcommand{\degr}{\deg_\R}

38: \newcommand{\degf}{\deg_\F}

39: \newcommand{\parity}{\mathsf{PARITY}}

40: \newcommand{\cz}{c_\emptyset}

41: \newcommand{\fin}{f_{\mathrm{in}}}

42: \newcommand{\fout}{f_{\mathrm{out}}}

43: \newcommand{\tin}{t_{\mathrm{in}}}

44: \newcommand{\tout}{t_{\mathrm{out}}}

45: \newcommand{\eps}{{\epsilon}}

46: \newcommand{\theconst}{\frac{\omega}{\omega+1}}

47: \newcommand{\ignore}[1]{}

48: \newcommand{\qed}{\hfill\rule{7pt}{7pt}}

49: \newcommand{\strutje}{\rule[-.25cm]{0cm}{.7cm}}

50: \newcommand{\omb}{ODDMAXBIT}

51: \newcommand{\PP}{\mathsf{PP}}

52: \newcommand{\PNP}{\mathsf{P^{NP}}}

53:

54:

55: \newtheorem{theorem}{Theorem}

56: \newtheorem{fact}[theorem]{Fact}

57: \newtheorem{observation}[theorem]{Observation}

58: \newtheorem{proposition}[theorem]{Proposition}

59: \newtheorem{claim}[theorem]{Claim}

60: \newtheorem{definition}[theorem]{Definition}

61: \newtheorem{corollary}[theorem]{Corollary}

62:

63: \newenvironment{proof}{\noindent \textbf{Proof:}}{\hfill{$\Box$}}

64:

65: \title{Toward Attribute Efficient Learning Algorithms}

66: \ignore{

67: OR \\

68:        Learning Decision Lists of Length $k$ using

69:        $2^{\tilde{O}(k^{1/3})}$ Examples OR \\

70:        On Learning Decision Lists Attribute Efficiently OR \\

71:        Learning Decision Lists Attribute Efficiently via Polynomial Threshold Functions OR \\

72:        Learning Decision Lists using $2^{\tilde{O}(k^{1/3})}$ Samples OR \\

73:        Learning Decision Lists via Polynomial Threshold Functions OR \\

74:        A Subexponential Algorithm for Learning Decision Lists Attribute Efficiently OR \\

75:        some other lame title

76: }

77:

78:

79:

80: \author{Adam R. Klivans\thanks{Supported by an NSF Mathematical

81: Sciences Postdoctoral Research Fellowship.}\\

82: Divsion of Engineering and Applied Sciences\\

83: Harvard University\\ Cambridge, MA 02138 \\{\tt klivans@eecs.harvard.edu}

84: \and Rocco A.\ Servedio\\

85: Department of Computer Science\\

86: Columbia University\\

87: New York, NY 10027\\ {\tt rocco@cs.columbia.edu} }

88:

89: \date{}

90:

91: \begin{document}

92:

93: \setcounter{page}{0}

94:

95: \maketitle

96:

97: \begin{abstract}

98:

99: We make progress on two important problems regarding attribute

100: efficient learnability.

101:

102: First, we give an algorithm for learning decision

103: lists of length $k$ over $n$ variables using $2^{\tilde{O}(k^{1/3})}

104: \log n$ examples and time $n^{\tilde{O}(k^{1/3})}$. This is the first

105: algorithm for learning decision lists that has both subexponential

106: sample complexity and subexponential running time in the relevant

107: parameters.  Our approach establishes a relationship between attribute

108: efficient learning and polynomial threshold functions and is based on

109: a new construction of low degree, low weight polynomial threshold

110: functions for decision lists. For a wide range of parameters our

111: construction matches a 1994 lower bound due to Beigel for the

112: ODDMAXBIT predicate and gives an essentially optimal tradeoff between

113: polynomial threshold function degree and weight.

114:

115: Second, we give an

116: algorithm for learning an unknown parity function on $k$ out of $n$

117: variables using $O(n^{1-1/k})$ examples in time polynomial in $n$. For

118: $k=o(\log n)$ this yields a polynomial time algorithm with

119: sample complexity $o(n)$.  This is the first polynomial time algorithm

120: for learning parity on a superconstant number of variables with

121: sublinear sample complexity.

122:

123: \end{abstract}

124:

125:

126: %%%%%%%% SECOND ABS

127:

128: \ignore{

129: \begin{abstract}

130:

131: We give an algorithm for learning decision lists of length $k$ over $n$

132: variables using $2^{\tilde{O}(k^{1/3})} \log n$ examples and time

133: $n^{\tilde{O}(k^{1/3})}$. This is the first algorithm for learning

134: decision lists that has both subexponential sample complexity (in the

135: relevant parameters $k$ and $\log n$)  and subexponential running time (in

136: the relevant parameter $k$;  any algorithm must take time $\Omega(n)$).

137: Our approach establishes a relationship between attribute efficient

138: learning and polynomial threshold functions, and is based on a new

139: construction of low degree, low weight polynomial threshold functions for

140: decision lists.  As a consequence of our construction we show that

141: Beigel's 1994 complexity theoretic lower bound for the ODDMAXBIT function

142: is aymptotically optimal. {\bf [[Another option for the last sentence:]]}

143: For a wide range of parameters our construction matches a 1994 lower bound due to

144: Beigel for the ODDMAXBIT predicate, and thus our construction

145: gives an optimal tradeoff between polynomial threshold function

146: degree and weight.  {\bf [[basically, do we want to say that his

147: stuff shows our stuff is optimal, or our stuff shows his stuff is

148: optimal?]]}

149:

150:

151: \end{abstract}

152: }

153:

154: %%%%%%%%%%%% END SECOND ABS

155:

156: %%%%%%%%%%% first abs:

157: \ignore{

158: \begin{abstract}

159: We give an online algorithm for learning decision lists.

160: The mistake bound of the algorithm, for learning a decision list of

161: length $k$ over $n$ Boolean variables, is

162: $2^{O(k^{1/3})}\log n$ and the running time of the algorithm is

163: $n^{O(k^{1/3})}.$  We thus achieve a tradeoff between

164: running time and sample complexity for learning decision lists.

165: Our approach combines known algorithms for attribute efficient

166: learning of linear threshold functions

167: with a new construction of polynomial threshold functions

168: which compute decision lists.  As a consequence of our

169: construction, we

170: show that Beigel's 1994 complexity theoretic

171: lower bound on the weight of any low-degree polynomial

172: threshold function for the ODDMAXBIT$_n$ predicate is asymptotically optimal.

173: \end{abstract}

174: }

175: %%%%%%%%%%% end first abs:

176:

177:

178: \thispagestyle{empty}

179:

180: \newpage

181:

182: \section{Introduction}

183:

184: \subsection{Attribute Efficient Learning}

185:

186: A central goal in machine learning is to design efficient, effective

187: algorithms for learning from small amounts of data.  An obstacle to

188: achieving this goal is that learning problems are often characterized by

189: an abundance of {\em irrelevant information}.  In many learning problems

190: each data point is naturally viewed as a high dimensional vector of

191: attribute values;  as a motivating example, in a natural language domain a

192: data point representing a text document may be a vector of word

193: frequencies over a lexicon of 100,000 words (attributes).  A newly

194: encountered word in a corpus may typically have a simple definition which

195: uses only a dozen or so words from the entire lexicon.  One would like to

196: be able to learn the meaning of such a word using a number of examples

197: which is closer to a dozen (the actual number of relevant attributes) than

198: to 100,000 (the total number of attributes).

199:

200: Towards this end, an important goal in machine learning theory is to

201: design {\em attribute efficient} algorithms for learning various classes

202: of Boolean functions.  A class ${\cal C}$ of Boolean functions over $n$

203: variables $x_1,\dots,x_n$ is said to be {\em attribute-efficiently

204: learnable} if there is a poly$(n)$ time algorithm which can learn any

205: function $f \in C$ using a number of examples which is polynomial in the

206: ``size'' (description length) of the function $f$ to be learned, rather

207: than in $n$ (the number of features in the domain over which learning

208: takes place).  (Note that the running time of the learning algorithm must

209: in general be at least $n$ since each example is an $n$-bit vector.)

210: Thus an attribute efficient learning algorithm for, say, the class of

211: Boolean conjunctions must be able to learn any Boolean conjunction of $k$

212: literals over $x_1,\dots,x_n$ using poly$(k,\log n)$ examples, since $k

213: \log n$ bits are required to specify such a conjunction.

214:

215:

216: \subsection{Decision Lists}

217:

218: A longstanding open problem in machine learning, posed first by Blum in

219: 1990 \cite{Blum:90,Blum:96,BHL:95,BlumLangley:97} and again by

220: Valiant in 1998

221: \cite{Valiant:99}, is to determine whether or not there exist attribute

222: efficient algorithms for learning {\em decision lists}.  A decision list

223: is essentially a nested ``if-then-else'' statement (we give a precise

224: definition in Section \ref{sec:prelims}).

225:

226: Attribute efficient learning of decision lists is of both theoretical and

227: practical interest. Blum's motivation for considering the problem came

228: from the {\em infinite attribute model} \cite{Blum:90}; in this model

229: there are infinitely many attributes but the concept to be learned depends

230: on only a small number of them, and each example consists of a finite list

231: of active attributes.  Blum {\em et al}. \cite{BHL:95} showed that for a

232: wide range of concept classes (including decision lists)  attribute

233: efficient learnability in the standard $n$-attribute model is equivalent

234: to learnability in the infinite attribute model.  Since simple classes

235: such as disjunctions and conjunctions are attribute efficiently learnable

236: (and hence learnable in the infinite attribute model), this motivated Blum

237: \cite{Blum:90} to ask whether the richer class of decision lists is thus

238: learnable as well.\footnote{ Additional motivation comes from the fact

239: that decision lists have such a simple algorithm in the PAC model.}

240: Several researchers have subsequently considered this problem, see e.g.

241: \cite{Blum:96,BlumLangley:97,DhagatHellerstein:94, NevoElYaniv:02,

242: Servedio:99stoc}; we summarize some of this previous work in Section

243: \ref{sec:prevdl}.

244:

245: From an applied perspective, Valiant \cite{Valiant:99} relates the

246: problem of learning decision lists attribute efficiently to the question

247: ``how can human beings learn from small amounts of data in the presence of

248: irrelevant information?'' He points out that since decision lists play an

249: important role in various models of cognition, a first step in

250: understanding this phenomenon would be to identify efficient algorithms

251: which learn decision lists from few examples. Due to the lack of progress

252: in developing such algorithms for decision lists, Valiant suggests that

253: models of cognition should perhaps focus on ``flatter" classes of

254: functions such as projective DNF \cite{Valiant:99}.

255:

256: \subsection{Parity Functions}

257:

258: Another outstanding challenge in machine learning is to determine whether

259: there exist attribute efficient algorithms for learning {\em parity

260: functions}.  The parity function

261: on a set of 0/1-valued variables $x_{i_1},\ldots,x_{i_k}$ is equal to $x_{i_1} + \cdots

262: + x_{i_k}$ modulo 2.  As with the class of decision lists, a simple PAC learning

263: algorithm is known for the class of parity functions but no attribute efficient

264: PAC learning algorithm is known.

265: Learning parity

266: functions plays an important rule in Fourier learning methods

267: \cite{MOS:03} and is closely related to  decoding random linear codes \cite{BKW:00}.

268: Both A. Blum \cite{Blum:96} and Y. Mansour \cite{Man:02} cite

269: attribute efficient learning of parity functions as an important open

270: problem.

271:

272: \ignore{

273: Given a set of examples labelled according to an unknown parity

274: function on $k$ out of $n$ variables, we wish to find an approximation

275: to the unknown parity in polynomial time using as few examples as

276: possible.  The well known solution to this problem views these

277: examples as a set of linear equations mod $2$ in $n$ variables and

278: solves the set of equations to come up with a consistent

279: hypothesis. Note, however, that we must take $\Omega(n)$ examples to

280: achieve a solution which has good generalization error, as a solution

281: to a system of $m$ equations over $n$ variables may contain

282: $\min(m,n)$ non-zero entries.  An attribute efficient algorithm for

283: learning parity should require a number of examples polynomially

284: related to $k$ and $\log n$ (information theoretically we should only

285: need $O(k \log n)$ examples).

286: }

287:

288: \subsection{Our Results: Decision Lists}

289:

290: We give the first learning algorithm for decision lists that is

291: subexponential in both sample complexity (in the relevant parameters $k$

292: and $\log n$) and running time (in the relevant parameter $k$).  Our

293: results demonstrate for the first time that it is possible to

294: simultaneously avoid the ``worst case'' in both sample complexity and

295: running time, and thus suggest that it may indeed be possible to learn

296: decision lists attribute efficiently. \ignore{We consider this to be the

297: first evidence that decision lists can be learned attribute efficiently.

298: \\}

299:

300: Our main learning result for decision lists is:

301:

302: \begin{theorem} \label{thm:main} There is an algorithm for learning

303: decision lists over $\{0,1\}^n$ which, when learning a decision list

304: of length $k$, has mistake bound\footnote{Throughout this

305: section we use ``sample complexity'' and ``mistake bound''

306: interchangeably; as described in Section \ref{sec:prelims}

307: these notions are essentially identical.}

308: $2^{\tilde{O}(k^{1/3})}\log n$ and runs  in time

309: $n^{\tilde{O}(k^{1/3})}$.

310: \end{theorem}

311:

312:

313: We prove Theorem \ref{thm:main} in two parts; first we generalize

314: Littlestone's well known Winnow algorithm \cite{Littlestone:88}

315: for learning

316: linear threshold functions to learn {\em polynomial

317: threshold functions.} In previous learning results, polynomial threshold

318: functions are learned by applying techniques from linear programming: a

319: Boolean function computed by a polynomial threshold function of degree $d$ can

320: be learned in time $n^{O(d)}$ by using polynomial time linear programming

321: algorithms such as the Ellipsoid algorithm

322: (see e.g. \cite{KlivansServedio:01}).

323: \ignore{via a linear programming solver, such as the

324: Ellipsoid algorithm.}

325: In contrast, we use the Winnow algorithm to learn polynomial threshold functions.

326: Winnow learns using few examples in a small amount of time

327: provided that the degree of the polynomial

328: is low and the integer coefficients of the polynomial are not too large:

329: \ignore{As opposed to general

330: linear programming solvers, Winnow can learn in an attribute efficient

331: manner:}

332:

333:

334: \begin{theorem} \label{thm:win}

335: Let ${\cal C}$ be a class of Boolean functions over

336: $\{0,1\}^n$ with the property that each $f \in {\cal C}$ has a polynomial

337: threshold function of degree at most $d$ and weight at most $W.$ Then

338: there is an online learning algorithm for ${\cal C}$ which runs in $n^d$

339: time per example and has mistake bound $O(W^{2} \cdot d \cdot \log n).$

340: \end{theorem}

341:

342: At this point we have reduced the problem of learning decision lists

343: attribute efficiently to the problem of representing decision lists with

344: polynomial threshold functions of low weight and low degree. To this end

345: we prove

346:

347: \begin{theorem} \label{thm:ptf} Let $L$ be a decision list of length $k$.

348: Then $L$ is computed by a polynomial threshold function of degree

349: $\tilde{O}(k^{1/3})$ and weight $2^{\tilde{O}(k^{1/3})}$.  \end{theorem}

350: Theorem \ref{thm:main} follows directly from Theorems \ref{thm:win}

351: and \ref{thm:ptf}.

352:

353: Polynomial threshold function constructions have recently been used

354: to obtain the fastest known algorithms for a range

355: of important learning problems such as learning DNF formulas

356: \cite{KlivansServedio:01}, intersections of halfspaces \cite{KOS:02},

357: and Boolean formulas of superconstant depth \cite{OdonnellServedio:03a}.

358: For each of these learning problems the sole goal was to obtain

359: fast learning algorithms, and hence the only parameter of interest in

360: these polynomial threshold function constructions is their degree,

361: since degree bounds translate directly into running time bounds for

362: learning algorithms (see e.g. \cite{KlivansServedio:01}).

363: In contrast, for the decision list problem we are interested in

364: both the running time and the number of examples required for learning.

365: Thus we must bound both the degree and the {\em weight}

366: (magnitude of integer coefficients) of the polynomial threshold

367: functions which we use.

368:

369: Our polynomial threshold function construction is essentially optimal in

370: the tradeoff between degree and weight which it achieves.  In 1994 Beigel

371: gave a lower bound showing that any degree $d$ polynomial threshold

372: function for a particular decision list must have weight

373: $2^{\Omega(n/d^{2})}$. For $d = n^{1/3}$, Beigel's lower bound implies

374: that the construction stated in Theorem \ref{thm:ptf} is essentially

375: optimal.  Furthermore, for any decision list $L$ of length $n$ and any

376: $d \leq n^{1/3}$, we will in fact construct polynomial threshold functions

377: of degree $d$ and weight $2^{\tilde{O}(n/d^{2})}$ computing $L$.

378: Beigel's lower bound thus implies that our degree $d$ polynomial threshold

379: functions are of roughly optimal weight

380: for all $d \leq n^{1/3},$ and hence strongly suggests that our

381: analysis is the best possible for the algorithm we use.

382:

383: \subsection{Our Results: Parity Functions}

384:

385: For parity functions, we give an $O(n^3)$ time algorithm which can

386: learn an unknown parity on $k$ variables out of $n$ using $O(n^{1-1/k})$ examples.

387: For values of $k = o(\log n)$ the sample complexity of

388: this algorithm is $o(n)$. This is the first algorithm for learning

389: parity on a superconstant number of variables with sublinear sample

390: complexity.

391:

392: The standard PAC learning algorithm for learning an unknown parity function

393: is based on viewing a set of $m$ labelled examples as a system of $m$ linear equations modulo 2.

394: Using Gaussian elimination it is possible to solve the system and find

395: a consistent parity function.  It can be shown that the solution thus

396: obtained is a ``good'' hypothesis if its weight (number of nonzero entries)

397: is small relative to $m$, the number of examples.  However, using Gaussian elimination

398: can result in a solution of weight as large as

399: $\min(m,n)$ even if $k$ (the number of variables in the target parity) is very small.

400: Thus in order for this approach to give a successful learning algorithm, it is necessary to

401: use $m = \Omega(n)$ examples regardless of the value of $k$.

402: In contrast, observe that an attribute efficient algorithm for

403: learning a parity of length $k$ should use only poly$(k,\log n)$ examples.

404:

405: Our algorithm works by finding a ``low weight'' solution to a system of

406: $m$ linear equations.  We prove that with high probability we can find a solution of weight

407: $O(n^{1-1/k})$ irrespective of $m$.  Thus by taking $m$ to be only slightly larger

408: than $n^{1 - 1/k}$ we have that our solution is a ``good'' hypothesis.

409:

410:

411: \subsection{Previous Results: Decision Lists} \label{sec:prevdl}

412:

413:

414: In previous work several algorithms with different performance bounds (in

415: terms of running time and number of examples used) have been given for

416: learning decision lists.

417:

418: \begin{itemize}

419:

420: \item Rivest \cite{Rivest:87} gave the first algorithm for learning

421: decision lists in Valiant's PAC model of learning from random examples.

422: Littlestone \cite{Blum:96} subsequently gave an analogue of Rivest's

423: algorithm in the online learning model. The algorithm can learn any

424: decision list of length $k$ in $O(kn^2)$ time using $O(kn)$ examples.

425:

426: \item A brute-force approach to learning decision lists of length $k$ is

427: to maintain a collection of all such lists which are consistent with the

428: examples seen so far, and to predict at each stage using majority vote

429: over the surviving hypotheses. This ``halving algorithm'' (proposed in

430: various forms by Barzdin and Freivald \cite{BarzdinFreivald:72}, Mitchell

431: \cite{Mitchell:82}, and Angluin \cite{Angluin:88}) can learn decision

432: lists of length $k$ using only $O(k \log n)$ examples, but the running

433: time is $n^{O(k)}.$

434:

435: \item Several researchers \cite{Blum:96,Valiant:99} have observed that

436: Littlestone's well-known Winnow algorithm \cite{Littlestone:88} can learn

437: decision lists of length $k$ from $2^{O(k)} \log n$ examples in time

438: $2^{O(k)} n \log n$. This follows from the observation that decision lists

439: of length $k$ can be viewed as linear threshold functions with integer

440: coefficients of magnitude $2^{\Theta(k)}$. We note that our algorithm in

441: this paper always has improved sample complexity over the basic Winnow

442: algorithm, and for $k \geq (\log n)^{3/2}$ our approach improves on the

443: time complexity of Winnow as well.

444:

445: \item Finally, several researchers have considered the special

446: case of learning a decision list of length $k$ over $n$ variables

447: in which the output bits of the decision list have at most $D$

448: alternations. Valiant \cite{Valiant:99}

449: and Nevo and El-Yaniv \cite{NevoElYaniv:02}

450: have given refined analyses of Winnow's performance for this

451: special case, and Dhagat and Hellerstein \cite{DhagatHellerstein:94}

452: have also studied this problem.  However, for the general case

453: in which $D$ can be as large as $k,$ the results thus obtained

454: do not improve on the straightforward Winnow analysis

455: described in the previous bullet.

456:

457: \end{itemize}

458: These previous algorithmic results are summarized in Figure 1.  We observe

459: that all of these earlier algorithms have an exponential dependence on the

460: relevant parameter(s) ($k$ and $\log n$ for sample complexity, $k$ for

461: running time)  for either the running time or the sample complexity.

462:

463:

464: \begin{table}[h]

465: \centerline{

466: \begin{tabular}{|l|l|l|} \hline

467: \strutje Reference: & Number of examples: & Running time: \\

468: \hline\hline

469: \strutje Rivest / Littlestone

470: & $ O(kn)$

471: & $ O(kn^2)  $ \\ \hline

472: \strutje Halving algorithm

473: & $ O(k \log n)$

474: & $ n^{O(k)} $ \\ \hline

475: \strutje Winnow algorithm

476: & $2^{O(k)} \log n$

477: & $2^{O(k)}n \log n$  \\ \hline

478: \strutje This Paper

479: & $ 2^{\tilde{O}(k^{1/3})}\log n $

480: & $ n^{\tilde{O}(k^{1/3})} $  \\ \hline

481: \end{tabular}

482: }

483: \caption{Comparison of known algorithms for

484: learning decision lists of length $k$ on $n$ variables.

485: }

486: \label{table:results}

487: \end{table}

488:

489: \subsection{Previous Results: Parity Functions}

490:

491: Little previous work has been published on learning parity

492: functions attribute efficiently in the PAC model.  The standard PAC learning

493: algorithm for parity (based on solving a system of linear equations) is due

494: to Helmbold {\em et al.\@} \cite{HSW:92}; however as described above this

495: algorithm is not attribute efficient since it uses $\Omega(n)$ examples.

496:

497: Several authors have considered learning parity attribute efficiently in a model

498: where the learner is allowed to make membership queries.  Attribute efficient

499: learning is easier in this framework since membership queries can help identify relevant variables.

500: Blum et al. \cite{BHL:95} give a randomized polynomial time membership-query

501: algorithm for learning parity on $k$ variables using only $O(k \log

502: n)$ examples.  These results were later

503: refined by Uehara {\em et al.} \cite{UTW:97}.

504:

505:

506:

507: \subsection{Organization}

508:

509: In Section \ref{sec:prelims} we give the necessary background on

510: online learning and polynomial threshold functions. In Section

511: \ref{sec:winnow} we show how known results from learning theory enable

512: us to reduce the decision list learning problem to a problem of

513: finding suitable polynomial threshold function representations of

514: decision lists. In Sections \ref{subsec:outer} and \ref{subsec:inner}

515: we give two different proofs of a weak tradeoff between degree and

516: weight for polynomial threshold function representations of decision

517: lists, and in Section \ref{subsec:compose} we combine these techniques

518: to prove Theorem \ref{thm:ptf}. In Section \ref{sec:decisiontree} we

519: show how to apply our techniques to give a tradeoff between sample

520: complexity and running time for learning decision trees. In Section

521: \ref{sec:discuss} we discuss the connection with Beigel's ODDMAXBIT

522: lower bound and related issues.  In Section \ref{sec:parity} we give

523: our new algorithm for learning parity functions, and in Section

524: \ref{sec:future} we suggest directions for future work.

525:

526: \section{Preliminaries} \label{sec:prelims}

527:

528:

529: Attribute efficient learning has been chiefly studied in the {\em on-line

530: mistake-bound} model of concept learning which was introduced in

531: \cite{Littlestone:88,Littlestone:89}.  In this model learning proceeds in

532: a series of trials, where in each trial the learner is given an unlabelled

533: boolean example $x \in \{0,1\}^n$ and must predict the value $f(x)$ of the

534: unknown target function $f.$ After each prediction the learner is given

535: the true value of $f(x)$ and can update its hypothesis before the next

536: trial begins.  The {\em mistake bound} of a learning algorithm on a target

537: concept $c$ is measured by the worst-case number of mistakes that the

538: algorithm makes over all (possibly infinite) sequences of examples, and

539: the mistake bound of a learning algorithm on a concept class (class of

540: Boolean functions) $C$ is the worst-case mistake bound across all

541: functions $f \in C.$ The running time of a learning algorithm $A$ for a

542: concept class $C$ is defined as the product of the mistake bound of $A$ on

543: $C$ times the maximum running time required by $A$ to evaluate its

544: hypothesis and update its hypothesis in any trial.

545:

546:

547: Our main interests in this paper are the classes of {\em decision

548: lists} and {\em parity functions}.

549:

550: A decision list $L$ of length $k$ over the Boolean variables

551: $x_1,\dots,x_n$ is represented by a list of $k$ pairs and a bit

552: $$

553: (\ell_1,b_1),(\ell_2,b_2),\dots,(\ell_k,b_k),b_{k+1}

554: $$

555: where each $\ell_i$ is a literal and each $b_i$ is either $-1$ or $1.$

556: Given any $x \in \{0,1\}^n,$ the value of $L(x)$ is $b_i$ if $i$ is the

557: smallest index such that $\ell_i$ is made true by $x$; if no $\ell_i$ is

558: true then $L(x)=b_{k+1}.$

559:

560: A parity function of length $k$ is defined by a set of variables $S

561: \subset \{x_{1},\ldots,x_{n}\}$ such that $|S| = k$. The

562: parity function $\chi_{S}(x)$ takes value $1$ on inputs which set

563: an even number of variables in $S$ to $1$ and takes value $-1$ on

564: inputs which set an odd number of variables in $S$ to $1.$

565:

566: Given a concept class $C$ over $\{0,1\}^n$ and a Boolean function $f \in

567: C,$ let size$(f)$ denote the description length of $f$ under some

568: reasonable encoding scheme.  (Note that if $f$ has $r$ relevant variables

569: then size$(f)$ will be at least $r \log n$ since this many bits are

570: required just to specify which variables are relevant).  We say that a

571: learning algorithm $A$ for $C$ in the mistake-bound model is {\em

572: attribute-efficient} if the mistake bound of $A$ on any concept $c \in C$

573: is polynomial in size$(f).$ In particular, the description length of a

574: length $k$ decision list (parity) is $O(k \log n)$, and thus we would ideally like

575: to have an algorithm which learns decision lists (parities) of length $k$ with a

576: mistake bound of poly$(k,\log n)$ and runs in time poly$(n).$

577:

578:

579: (We note here that attribute efficiency has also been studied in other

580: learning models, namely Valiant's Probably Approximately Correct (PAC)

581: model of learning from random examples.  Standard conversion techniques

582: are known \cite{Angluin:88,Haussler:88b,Littlestone:89b}

583: which can be used to

584: transform any mistake bound algorithm into a PAC learning algorithm.

585: This transformation essentially preserves the running time of the mistake

586: bound algorithm, and the sample size required by the PAC algorithm is

587: essentially the mistake bound. Thus, positive results for mistake bound

588: learning, such as those we give for decision lists in this paper, directly yield

589: corresponding positive results for the PAC model.)

590:

591: Finally, our results for decision lists are achieved by a careful

592: analysis of {\em polynomial threshold functions}.  Let $f$ be a

593: Boolean function $f:\{0,1\}^{n} \to \{-1,1\}$ and let $p$ be a

594: polynomial in $n$ variables with integer coefficients. Let $d$ denote

595: the degree of $p$ and let $W$ denote the sum of the absolute values of

596: $p$'s integer coefficients. If the sign of $p(x)$ equals $f(x)$ for

597: every $x \in \{0,1\}^n,$ then we say that $p$ is a {\em polynomial

598: threshold function} of degree $d$ and weight $W$ for $f.$

599:

600:

601: \section{Expanded-Winnow: Learning Polynomial Threshold Functions} \label{sec:winnow}

602:

603: Littlestone introduced the online Winnow algorithm in 1988 and showed

604: that it can attribute efficiently learn Boolean conjunctions,

605: disjunctions, and low weight linear threshold functions.  Throughout

606: its execution Winnow maintains a linear threshold function as its

607: hypothesis; at the heart of the algorithm is a novel update rule which

608: makes a {\em multiplicative} update to each coefficient of the

609: hypothesis (rather than an additive update as in the Perceptron

610: algorithm) each time a mistake is made.  Since its introduction Winnow

611: has been intensively studied from both applied and theoretical

612: standpoints (see

613: e.g. \cite{Blum:97,GoldingRoth:99,KWA:97,Servedio:02sicomp}) and

614: multiplicative updates have become widespread in machine learning

615: algorithms.

616:

617: The following theorem (which, as noted in \cite{Valiant:99}, is implicit

618: in Littlestone's analysis in \cite{Littlestone:88}) gives a

619: mistake bound for Winnow when learning linear threshold functions:

620:

621: \begin{theorem} \label{thm:winbound}

622: Let $f(x)$ be the linear threshold function

623: sign$(\sum_{i=1}^{n} w_{i}x_{i} - \theta)$

624: where $\theta$ and $w_{1},\ldots,w_{n}$ are

625: integers. Let $W = \sum_{i=1}^{n} |w_{i}|$. Then

626: Winnow learns $f(x)$ with mistake bound $O(W^{2} \log n)$,

627: and uses $n$ time steps per example.

628: \end{theorem}

629:

630: We will use a generalization of the Winnow algorithm, called

631: Expanded-Winnow, to learn {\em polynomial} threshold functions of

632: degree at most $d.$ Our generalization introduces $\sum_{i=1}^{d} {n

633: \choose d}$ new variables (one for each monomial of degree up to $d$)

634: and runs Winnow to learn a linear threshold function over these new

635: variables.  More precisely, in each trial we convert the $n$-bit

636: received example $x=(x_1,\dots,x_n)$ into a $\sum_{i=1}^d {n \choose

637: d}$ bit expanded example (where the bits in the expanded example

638: correspond to monomials over $x_1,\dots,x_n$), and we give the

639: expanded example to Winnow.  Thus the hypothesis which Winnow

640: maintains -- a linear threshold function over the space of expanded

641: features -- is a polynomial threshold function of degree $d$ over the

642: original $n$ variables $x_1,\dots,x_n.$ Theorem \ref{thm:win}, which

643: follows directly from Theorem \ref{thm:winbound}, summarizes the

644: performance of Expanded-Winnow:

645:

646: \medskip

647:

648: \noindent {\bf Theorem \ref{thm:win}}

649: {\em Let ${\cal C}$ be a class of Boolean functions over

650: $\{0,1\}^n$ with the property that each $f \in {\cal C}$ has a polynomial

651: threshold function of degree at most $d$ and weight at most $W.$ Then

652: Expanded-Winnow algorithm runs in $n^d$

653: time per example and has mistake bound $O(W^{2} \cdot d \cdot \log n)$ for

654: ${\cal C}.$

655: } \\

656:

657: Theorem \ref{thm:win} shows that the degree of a polynomial threshold

658: function corresponds to Expanded-Winnow's running time, and the weight of

659: a polynomial threshold function corresponds to its sample complexity.

660:

661: \ignore{

662:

663: \begin{figure*}[t] \label{fig:vw}

664: \begin{small}

665:

666: \noindent {\bf Algorithm V-Winnow:} \\

667:

668: \noindent {\bf Input: } A sequence of trials from a polynomial $p$ in $n$ variables $\{x_{1},\ldots,x_{n}\}$ of degree $d$ where each \mbox{~~~~~~~~~~~~~~}coefficient is at most $w$.

669:

670: \vskip.1in

671:

672: \noindent {\bf Output: } A polynomial $p'$ in $n$ variables of degree $d$

673: such that for every $x \in \{0,1\}^{n}$, $p'(x) = p(x)$.

674:

675: \medskip

676:

677: \begin{enumerate}

678:

679: \item Lexicographically order all $m = n^{d}$ monomials of degree at most

680: $d$ over the variables $\{x_{1},\ldots,x_{n}\}$.

681:

682: \item Introduce new variables $y_{1},\ldots,y_{m}$ such that $y_{i}$ is

683: equal to the $i$th monomial in Step 1.

684:

685: \item Run Winnow over the variables $y_{1},\ldots,y_{m}$ where on example

686: $(a,f(a))$, $y_{i}$ is equal to the $i$th monomial on assignment $a$.

687:

688: \item Let $h = \sum_{i=1}^{m} \alpha_{i}y_{i}$ be the output of Winnow.

689:

690: \item Return $h$ with each $y_{i}$ written as the $i$th monomial over

691: $\{x_{1},\ldots,x_{n}\}$.

692:

693: \end{enumerate}

694:

695: \end{small}

696: \caption{The V-Winnow algorithm.}

697: \end{figure*}

698:

699:

700: \begin{theorem} \label{thm:vwbound}

701: Let ${\cal C}$ be a class of Boolean functions over $\{0,1\}^n$

702: with the property that for each $f \in {\cal C}$,

703:

704: \begin{itemize}

705:

706: \item $f$ depends on at most $k$ variables

707:

708: \item $f$ is computed by a polynomial threshold function of degree at most

709: $d$ where each coefficient is an integer weight of at most $w$.

710:

711: \end{itemize}

712: Then {\tt V-Winnow} is an online learning algorithm for ${\cal C}$ which

713: uses $n^d$ time steps per example and has mistake bound $(w \cdot

714: k^{d})^{2} \cdot d \cdot \log n.$ The output hypothesis will be a

715: polynomial threshold function equivalent to $f$.

716:

717: \end{theorem}

718:

719: \begin{proof}

720: Let $f$ be a function of $k$ variables computed by a polynomial threshold

721: function $p$ of degree $d$ where each coefficient is of weight at most

722: $w$. We will now apply the algorithm {\tt V-Winnow} outlined in Figure

723: \ref{fig:vw}.  Fix a lexicographic ordering of all monomials of degree $d$

724: over $n$ variables and let $y_{i}$ be the $i$th monomial in this list.

725: Then $f$ can be written as a linear threshold function $h$ over the

726: variables $y_{i}$, i.e. $f = h = \sum_{i=1}^{m} a_{i}y_{i}$ for some

727: integer coefficients $a_{i} \leq w$. Since $f$ depends on only $k$

728: variables, at most $k^{d}$ of the variables in $h$ have nonzero

729: coefficients. Now run the standard Winnow algorithm to learn $h$ (for

730: every example $(a_{1},\ldots,a_{n}, f(a_{1},\ldots,a_{n}))$, set $y_{i}$

731: equal to the $i$th monomial on input $a_{1},\ldots,a_{n}$.)  Applying

732: Theorem \ref{thm:winbound}, the standard Winnow algorithm (and hence

733: V-Winnow) will make at most $(w \cdot k^{d})^{2} \cdot d \cdot \log n$

734: mistakes and output a linear threshold function over the $y_{i}$'s

735: equivalent to $h$. Replacing each $y_{i}$ with the $i$th monomial over

736: $\{x_{1},\ldots,x_{n}\}$ we obtain a polynomial threshold function

737: equivalent to $f$. The time bound also follows directly from Theorem

738: \ref{thm:winbound}.

739: \end{proof}

740:

741: }

742:

743: \section{Constructing Polynomial Threshold Functions for Decision Lists}

744:

745: In previous constructions of polynomial threshold functions for

746: computational learning theory applications

747: \cite{KlivansServedio:01,KOS:02,OdonnellServedio:03a} the sole goal has

748: been to minimize the {degree} of the polynomials regardless of the size of

749: the coefficients.  As an extreme example, the construction of

750: \cite{KlivansServedio:01} of $\tilde{O}(n^{1/3})$ degree polynomial

751: threshold functions for DNF formulae yields polynomials whose coefficients

752: can be {\em doubly exponential} in the degree. In contrast,

753: given Theorem \ref{thm:win} we must now

754: construct polynomial threshold functions that have low degree and low

755: weight.

756:

757: We give two constructions of polynomial threshold functions for decision lists, each of which

758: has relatively low degree \ignore{($k^{1/2}$)}

759: and relatively low weight.

760: \ignore{($2^{\tilde{O}(k^{1/2})}$).}

761: We then combine

762: these approaches to achieve an optimal construction with improved bounds on both

763: degree and weight.\ignore{with degree $k^{1/3}$

764: and weight $2^{\tilde{O}(k^{1/3})}.$}

765:

766: \subsection{Outer Construction} \label{subsec:outer}

767:

768: Let $L$ be a decision list of length $k$ over variables $x_1,\dots,x_k.$

769: We first give a simple construction of a degree $h$, weight ${\frac {2k}

770: h}2^{(k/h + h)}$ polynomial threshold function for $L$ which is based on

771: breaking the list $L$ into sublists.  We call this construction the

772: ``outer construction" since we will ultimately combine this construction

773: with a different construction for the ``inner'' sublists.

774:

775: We begin by showing that $L$ can be expressed as a threshold of {\em

776: modified decision lists} which we now define.  The set ${\cal B}_h$ of

777: modified decision lists is defined as follows:

778: each function in ${\cal B}_h$ is a decision list

779: $(\ell_1,b_1),(\ell_2,b_2),\dots, (\ell_h,b_h),0$ where each $\ell_i$ is

780: some literal over $x_1,\dots,x_n$ and each $b_i \in \{-1,1\}.$ Thus the

781: only difference between a modified decision list $f \in {\cal B}_h$ and a

782: normal decision list of length $h$ is that the final output value is

783: $0$ rather than $b_{h+1} \in \{-1,+1\}.$

784:

785: Without loss of generality we may suppose that the list $L$ is

786: $(x_1,b_1),\dots,(x_k,b_k),b_{k+1}.$ We break $L$ sequentially into $k/h$

787: blocks each of length $h$. Let $f_{i} \in {\cal B}_h$ be the modified

788: decision list which corresponds to the $i$-th block of $L,$ i.e. $f_i$ is

789: the list $(x_{(i-1) h + 1},b_{(i-1)h+1}),\ldots, (x_{(i+1)

790: h},b_{(i+1)h}),0$.  Intuitively $f_{i}$ computes the $i$th block of $L$

791: and equals $0$ only if we ``fall of the edge" of the $i$th block. We then

792: have the following straightforward claim:

793:

794: \begin{claim} \label{cla:outer}

795: The decision list $L$ is eqivalent to

796: \begin{eqnarray}

797: \mbox{sign}\left(\sum_{i=1}^{k/h}

798: 2^{k/h - i + 1} f_{i}(x) \ + \  b_{k+1} \right). \label{eq:outer}

799: \end{eqnarray}

800: \end{claim}

801: \begin{proof}

802: Given an input $x \neq 0^k$ let $r=(i-1)h + c$ be the first index such that $x_r$ is satisfied.

803: It is easy to see that $f_j(x) = 0$ for $j<i$ and hence the value in

804: (\ref{eq:outer}) is $2^{k/h - i + 1}b_{r} + \sum_{j=i+1}^{k/h}

805: 2^{k/h - j + 1} f_{j}(x) \ + \  b_{k+1}$,

806: the sign of which is easily seen to be $b_r.$

807: Finally if $x=0^k$ then the argument to (\ref{eq:outer}) is $b_{k+1}$.

808: \end{proof}

809:

810: \medskip \noindent {\bf Note:}  It is easily seen that we can replace

811: the $2$ in formula (\ref{eq:outer}) by a 3; this will prove

812: useful later.

813:

814: \medskip

815:

816: As an aside, note that Claim \ref{cla:outer} can already be used to obtain a tradeoff

817: between running time and sample complexity for learning decision lists.

818: The class ${\cal B}_h$ contains at most $(4n)^h$ functions.

819: Thus as in Section \ref{sec:winnow}

820: it is possible to run the Winnow algorithm using the functions in ${\cal B}_h$ as the base features

821: for Winnow.  (So for each example $x$ which it receives, the algorithm would first compute

822: the value of $f(x)$ for each $f \in {\cal B}_h$, and would then use this vector of $(f(x))_{f \in {\cal B}_h}$

823: values as the example point for Winnow.)  A direct analogue of Theorem

824: \ref{thm:win} now implies

825: that Expanded-Winnow (run over this expanded feature space of functions from

826: ${\cal B}_h$) can be used to learn

827: $L_k$ in time $n^{O(h)}2^{O(k/h)}$ with mistake bound $2^{O(k/h)} h \log n$.

828:

829: However, it will be more useful for us to obtain a polynomial threshold function for $L$.  We

830: can do this from Claim \ref{cla:outer} as follows:

831:

832:

833: \begin{theorem} \label{thm:outer}

834: Let $L$ be a decision list of length $k$.  Then for any $h < k$

835: we have that $L$ is computed by a

836: polynomial threshold function of degree $h$

837: and weight $4 \cdot 2^{k/h + h}$.

838: \end{theorem}

839:

840: \begin{proof}

841: Consider the first modified decision list $f_1 = (\ell_1,b_1),(\ell_2,b_2),\dots,(\ell_h,b_h),0$

842: in the expression (\ref{eq:outer}).  For $\ell$ a literal let $\tilde{\ell}$ denote $x$

843: if $\ell$ is an unnegated variable $x$ and let $\tilde{\ell}$ denote $1-x$ if

844: if $\ell$ is a negated variable $\overline{x}.$

845: We have that for all $x \in \{0,1\}^h$, $f_1(x)$ is computed exactly by

846: the polynomial

847: $$

848: f_1(x) = \tilde{\ell}_1b_1 + (1-\tilde{\ell}_1)\tilde{\ell}_2 b_2 +

849: (1-\tilde{\ell}_1)(1-\tilde{\ell}_2)\tilde{\ell}_3 b_3 + \cdots +

850: (1-\tilde{\ell}_1)\cdots(1-\tilde{\ell}_{h-1})\tilde{\ell}_h b_h.

851: $$

852: This polynomial has degree $h$ and has weight at most $2^{h+1}.$

853: Summing these polynomial representations for $f_1,\dots,f_{k/h}$

854: as in (\ref{eq:outer}) we see

855: that the resulting polynomial threshold function given by (\ref{eq:outer})

856: has degree $h$ and weight at most $2^{k/h + 1} \cdot 2^{h+1} =

857: 4 \cdot 2^{k/h + h}.$

858: \end{proof}

859:

860: \medskip

861:

862: Specializing to the case $h=\sqrt{k}$ we obtain:

863:

864: \begin{corollary} \label{cor:outer}

865: Let $L$ be a decision list of length $k$.

866: Then $L$ is computed by a polynomial threshold function of

867: degree $k^{1/2}$ and weight $4 \cdot 2^{2k^{1/2}}.$

868: \end{corollary}

869:

870: We close this section by observing that an intermediate result

871: of \cite{KlivansServedio:01} can be used to give an alternate proof

872: of Corollary \ref{cor:outer} with slightly weaker parameters;

873: see Appendix \ref{ap:alt}.

874:

875: \subsection{Inner Approximator} \label{subsec:inner}

876:

877: In this section we construct low degree, low weight

878: polynomials which approximate (in the $L_\infty$ norm)

879: the modified decision lists from the previous subsection.  Moreover,

880: the polynomials we construct

881: are exactly correct on inputs which ``fall off the end'':

882: \ignore{

883: We refer to these modified decision lists as the ``inner'' decision lists.

884: The construction is stronger than a polynomial threshold function;

885: the polynomial we give for an inner decision list is actually

886: a good approximator with respect to the

887: $L_{\infty}$ norm (and is exactly right on the input $0^h$):

888: }

889:

890: \begin{theorem} \label{thm:inner}

891: Let $f \in {\cal B}_h$ be a modified decision list of length $h$

892: (without loss of generality we may assume that $f$ is

893: $(x_1,b_1),\dots,(x_h,b_h),0$).

894: Then there is a degree $2\sqrt{h}\log{h}$

895: polynomial $p$ such that

896: \begin{itemize}

897: \item for every input $x \in \{0,1\}^h$ we have $|p(x) - f(x)| \leq 1/h$.

898: \item $p(0^h) = f(0^h) = 0$.

899: \end{itemize}

900: \end{theorem}

901: \begin{proof}

902: As in the proof of Theorem \ref{thm:outer} we have that

903: \[ f(x) = b_{1}x_{1} + b_{2}(1-x_{1})x_{2} + \cdots +

904: b_{h}(1-x_{1})\cdots(1-x_{h-1})x_{h}.

905: \]

906: We will construct a lower (roughly $\sqrt{h}$) degree polynomial which

907: closely approximates $f$.  Let $T_{i}$ denote $(1-x_1)\dots(1-x_{i-1})x_i$,

908: so we can rewrite $f$ as

909: \[ f(x) = b_{1}T_{1} + b_{2}T_{2} + \cdots + b_{h}T_{h}. \]

910:

911: We approximate each $T_i$ separately as follows:

912: set $A_{i}(x) = h-i  + x_{i} + \sum_{j=1}^{i-1} (1 - x_{j})$.

913: Note that for $x \in \{0,1\}^h,$ we have

914: $T_i(x) = 1$ iff $A_i(x) = h$ and $T_i(x) = 0$

915: iff $0 \leq A_i(x) \leq h-1.$

916: Now define the polynomial

917: $$

918: Q_{i}(x) = q \left(A_{i}(x)/h \right)  \mbox{~~~~~where~~~~~}

919: q(y) = C_d\left(y \left(1 + 1/h \right) \right).

920: $$

921:

922: \noindent As in \cite{KlivansServedio:01},

923: here $C_{d}(x)$ is the $d$th Chebyshev polynomial of the

924: first kind (a univariate polynomial of degree $d$)

925: with $d$ set to $\lceil \sqrt{h} \rceil$.

926: We will need the following facts about Chebyshev polynomials

927: \cite{Cheney:66}:

928: \begin{itemize}

929: \item $|C_d(x)| \leq 1$ for $|x| \leq 1$ with $C_d(1) = 1;$

930: \item $C_d^\prime(x) \geq d^2$ for $x > 1$ with $C_d^\prime(1) = d^2.$

931: \item The coefficients of $C_{d}$ are integers each of whose

932: magnitude is at most $2^d$.

933: \end{itemize}

934: These first two facts imply that $q(1) \geq 2$ but $|q(y)| \leq 1$

935: for $y \in [0,1 - {\frac 1 h}].$  We

936: thus have that $Q_i(x) = q(1) \geq 2$ if $T_i(x) = 1$

937: and $|Q_i(x)| \leq 1$ if $T_i(x) = 0.$

938: Now define

939: $

940: P_i(x) = \left({\frac {Q_i(x)}{q(1)}}\right)^{2 \log h}.

941: $

942: This polynomial is easily seen to be a good approximator for $T_i$:

943: if $x \in \{0,1\}^h$ is such that $T_i(x) = 1$ then $P_i(x) = 1$,

944: and if $x \in \{0,1\}^h$ is such that $T_i(x) = 0$ then

945: $|P_i(x)| < \left({\frac 1 2}\right)^{2 \log h} < {\frac 1 {h^2}}.$

946:

947: Now define

948: $R(x) = \sum_{i=1}^{\ell} b_iP_{i}(x)$ and $p(x) = R(x) - R(0^h).$

949: \ignore{

950: We will see that $Q_{i}(x) > 2$ on assignments $x$ for which

951: $T_{i}(x)=0$, while $|Q_i(x)|\leq 1$ on assignments for which

952: $T_{i}(x)$ output $s_{i}$. To

953: strengthen this separation we define the following polynomial

954: $P_{i}(x) = (1/\ell^{2}) Q_{i}(x)^{2 \log \ell}$ and to approximate

955: all of $b$ we set $R(x) = \sum_{i=1}^{\ell} P_{i}(x)$.

956: }

957: It is clear that $p(0^h)=0.$

958: We will show that for every input $0^h \neq x \in \{0,1\}^h$ we have

959: $|p(x) - f(x)| \leq {1/h}$. Fix some such $x$; let $i$ be the first

960: index such that $x_i = 1.$  As shown above we have

961: $P_i(x) = 1.$  Moreover, by inspection of $T_j(x)$ we have that

962: $T_j(x) = 0$ for all $j \neq i,$

963: and hence $|P_j(x)| < {\frac 1 {h^2}}$.  Consequently

964: the value of $R(x)$ must lie in $[b_i - {\frac {h-1}{h^2}},

965: b_i + {\frac {h-1}{h^2}}]$.  Since $f(x) = b_i$ we have that

966: $p(x)$ is an $L_\infty$ approximator for $f(x)$ as desired.

967:

968: Finally, it is straightforward to verify that $p(x)$ has the claimed

969: bound on degree.

970: \end{proof}

971:

972: \ignore{

973: \noindent Now fix any nonzero assignment to the variables $x$ that

974: causes $b$ to output $1$.  From the definition of $b$ there exists a

975: unique term $T_{i}$ that is not set to zero by $x$. Then for the

976: corresponding arithmetization $A_{i}$ we have $A_{i}/i= 1$, so $2 \leq

977: Q_{i}(x) \leq 2.01 $ and hence $1 \leq P_{i}(x) \leq 1.1$. Similarly

978: if $x$ causes $b$ to output $-1$ then $-1 \leq P_{i}(x) \leq -.9$. \\

979:

980: \noindent Let $T_{j}$ be any term that is set to zero by x, and so

981: $A_{j}(x) \leq 1 - 1/\ell$. Then $|Q_{i}(x)| \leq 1$ and thus

982: $|P_{i}(x)| \leq 1/\ell^{2}$. Hence for any nonzero assignment $x$,

983: $|R(x) - b(x)| \leq \mbox{{\bf $\eps$ from cheby approx +

984: $1/\ell$}}$. Notice also that $|R(\overline{0})| \leq 1/\ell.$ Thus

985: for any nonzero assignment $x$, $|H(x) - b(x)| \leq 2/\ell$ and

986: clearly $H(\overline{0}) = 0$.

987: }

988:

989: \medskip

990:

991: Strictly speaking we cannot discuss the weight of the polynomial

992: $p$ since its coefficients are rational numbers but not

993: integers.  However, by multiplying $p$ by a suitable integer

994: (clearing denominators) we obtain an integer polynomial

995: with essentially the same properties.

996: Using the third fact about Chebyshev polynomials from our

997: proof above, we have that $q(1)$ is a rational number $N_1/N_2$ where

998: $N_1,N_2$ are each integers of magnitude $h^{O(\sqrt{h})}.$

999: Each $Q_i(x)$ for $i=1,\dots,h$ can be written as an integer

1000: polynomial (of weight $h^{O(\sqrt{h})}$) divided by $h^{\sqrt{h}}.$

1001: Thus each $P_i(x)$ can be written as

1002: $\tilde{P}_i(x)/(h^{\sqrt{h}}N_1)^{2 \log h}$ where $\tilde{P}_i(x)$

1003: is an integer polynomial of weight $h^{O(\sqrt{h} \log h)}$.

1004: It follows that $p(x)$ equals $\tilde{p}(x)/C,$ where $C$

1005: is an integer which is at most $2^{O(h^{1/2} \log^2 h)}$

1006: and $\tilde{p}$ is a polynomial with integer coefficients and weight

1007: $2^{O(h^{1/2} \log^2 h)}.$  We thus have

1008:

1009: \begin{corollary}

1010: \label{cor:inner}

1011: Let $f \in {\cal B}_h$ be a modified decision list of length $h$.

1012: Then there is an integer polynomial

1013: $p(x)$

1014: of degree $2\sqrt{h}\log{h}$

1015: and weight $2^{O(h^{1/2} \log^2{h})}$ and an integer $C =

1016: 2^{O(h^{1/2} \log^2 h)}$ such that

1017: \begin{itemize}

1018: \item for every input $x \in \{0,1\}^h$ we have $|p(x) - Cf(x)| \leq C/h$.

1019: \item $p(0^h) = f(0^h) = 0$.

1020: \end{itemize}

1021: \end{corollary}

1022:

1023: The fact that $p(0^h)$ is exactly 0

1024: will be important in the next subsection when we combine the

1025: inner approximator with the outer construction.

1026:

1027: \subsection{Composing the Constructions} \label{subsec:compose}

1028:

1029: In this section we combine the two constructions from the previous

1030: subsections to obtain our main polynomial threshold construction:

1031:

1032: \begin{theorem} \label{thm:mainptf}

1033: Let $L$ be a decision list of length $k$.  Then for any $h < k$,

1034: $L$ is computed by a polynomial threshold function of degree

1035: $O(h^{1/2} \log h)$

1036: and weight $2^{O(k/h + h^{1/2}\log^2 h)}.$

1037: \end{theorem}

1038: \begin{proof}

1039: We suppose without loss of generality that $L$ is the decision list

1040: $(x_1,b_1),\dots,(x_k,b_k),b_{k+1}.$

1041: We begin with the outer construction: from the note following

1042: Claim \ref{cla:outer} we have that

1043: $$L(x) =

1044: \mbox{sign}\left(C\left[\sum_{i=1}^{k/h}

1045: 3^{k/h - i + 1} f_{i}(x) \ + \  b_{k+1} \right]\right)

1046: $$

1047: where $C$ is the value from Corollary \ref{cor:inner} and

1048: each $f_{i}$ is a modified decision list of length $h$

1049: computing the restriction of $L$ to its $i$th block as defined in

1050: Subsection \ref{subsec:outer}.

1051: Now we use the inner approximator to replace each $Cf_i$ above

1052: by $p_i$, the approximating polynomial from Corollary

1053: \ref{cor:inner}, i.e. consider sign$(H(x))$ where

1054: $$

1055: H(x) = \sum_{i=1}^{k/h}

1056: (3^{k/h - i + 1} p_{i}(x)) \ + \  Cb_{k+1}.

1057: $$

1058: We will show that sign$(H(x))$

1059: is a polynomial threshold function which computes $L$ correctly

1060: and has the desired degree and weight.

1061:

1062: Fix any $x \in \{0,1\}^k.$  If $x=0^k$ then by Corollary

1063: \ref{cor:inner} each $p_i(x)$ is $0$ so $H(x) = C b_{k+1}$ has

1064: the right sign.

1065: Now suppose that $r=(i-1)h+c$ is the first index such that

1066: $x_r = 1.$  By Corollary \ref{cor:inner}, we have that

1067: \begin{itemize}

1068: \item $3^{k/h - j + 1}p_j(x) = 0$ for $j < i$;

1069: \item $3^{k/h - i + 1}p_i(x)$ differs from $3^{k/h - i + 1}Cb_r$ by at most

1070: $C3^{k/h - i + 1}\cdot {\frac 1 h}$;

1071: \item The magnitude of each value $3^{k/h - j + 1}p_j(x)$ is at most

1072: $C3^{k/h - j + 1}(1 + {\frac 1 h})$ for $j > i.$

1073: \end{itemize}

1074: Combining these bounds,

1075: the value of $H(x)$ differs from $3^{k/h - i + 1}Cb_r$ by at most

1076: $$

1077: C\left(

1078: {\frac {3^{k/h - i + 1}}{h}} +

1079: \left(1 + {\frac 1 h}\right)

1080: \left[3^{k/h - i} + 3^{k/h - i - 1} + \cdots + 3\right] + 1

1081: \right)

1082: $$

1083: which is easily seen to be less than $C3^{k/h - i + 1}$ in magnitude.

1084: Thus the sign of $H(x)$ equals $b_r$, and consequently sign$(H(x))$ is a

1085: valid polynomial threshold representation for $L(x).$  Finally,

1086: our degree and weight bounds from Corollary \ref{cor:inner}

1087: imply that

1088: the degree of $H(x)$ is $O(h^{1/2} \log h)$ and the weight

1089: of $H(x)$ is $2^{O(k/h) + O(h^{1/2}\log^2 h)}$, and the theorem

1090: is proved.

1091: \end{proof}

1092:

1093: \medskip

1094:

1095: Taking $h = k^{2/3} / \log^{4/3}k$ in the above theorem we obtain our

1096: main result on representing decision lists as polynomial threshold

1097: functions:

1098:

1099: \medskip

1100:

1101: \noindent {\bf Theorem \ref{thm:ptf}}

1102: {\em Let $L$ be a decision list of length $k$.  Then

1103: $L$ is computed by a polynomial threshold function

1104: of degree $k^{1/3} \log^{1/3} k$ and weight

1105: $2^{O(k^{1/3} \log^{4/3} k)}.$

1106: } \\

1107:

1108:

1109: Theorem \ref{thm:ptf} immediately implies that Expanded-Winnow can learn decision lists of length $k$ using $2^{\tilde{O}(k^{1/3})} \log n$ examples and time $n^{\tilde{O}(k^{1/3})}$.

1110:

1111: %\section{Discussion} \label{sec:discuss}

1112:

1113:

1114: \section{Application to Learning Decision Trees} \label{sec:decisiontree}

1115:

1116: In 1989 Ehrenfeucht and Haussler \cite{EhrenfeuchtHaussler:89} gave an

1117: a time $n^{O(\log s)}$ algorithm for learning decision trees of size

1118: $s$ over $n$ variables. Their algorithm uses $n^{O(\log s)}$ examples,

1119: and they asked if the sample complexity could be reduced to

1120: $\poly(n,s)$.  We can apply our techniques here to give an algorithm

1121: using $2^{\tilde{O}(s^{1/3})} \log n$ examples, if we are willing to

1122: spend $n^{\tilde{O}(s^{1/3})}$ time.

1123:

1124: First we need to generalize Theorem \ref{thm:mainptf} for higher order

1125: decision lists. An $r$-decision list is like a standard decision list

1126: but each pair is now of the form $(C_i,b_i)$ where $C_i$ is a

1127: conjunction of at most $r$ literals and as before $b_i = \pm 1$.  The

1128: output of such an $r$-decision list on input $x$ is $b_i$ where $i$ is

1129: the smallest index such that $C_i(x)=1.$

1130:

1131: We have the following:

1132:

1133: \begin{corollary} \label{cor:gdl}

1134: Let $L$ be an $r$-decision list of length $k$. Then for any

1135: $h < k$, $L$ is computed by a polynomial threshold function

1136: of degree $O(rh^{1/2} \log h)$ and weight

1137: $2^{r + O(k/h + h^{1/2} \log^2 h)}$.

1138: \end{corollary}

1139:

1140: \begin{proof}

1141: Let $L$ be the $r$-decision list $(C_1,b_1),\dots,(C_k,b_k),b_{k+1}.$

1142: By Theorem \ref{thm:mainptf} there is a polynomial threshold function

1143: of degree $O(h^{1/2} \log h)$ and weight

1144: $2^{O(k/h + h^{1/2} \log^2 h)}$ over the variables $C_1,\dots,C_k.$

1145: Now replace each variable $C_{i}$ by the interpolating polynomial

1146: which computes it exactly as a function from $\{0,1\}^n$ to $\{0,1\}.$

1147: Each such interpolating polynomial has degree $r$ and integer

1148: coefficients of total magnitude at most $2^r$, and the corollary follows.

1149: \end{proof}

1150:

1151: \begin{corollary} \label{cor:learngdl}

1152: There is an algorithm for learning

1153: $r$-decision lists over $\{0,1\}^n$ which, when learning an $r$-decision list

1154: of length $k$, has mistake bound

1155: $2^{\tilde{O}(r + k^{1/3})}\log n$ and runs  in time

1156: $n^{\tilde{O}(rk^{1/3})}$.

1157: \end{corollary}

1158:

1159: Now we can apply Corollary \ref{cor:learngdl} to obtain a tradeoff

1160: between running time and sample complexity for learning decision

1161: trees:

1162:

1163: \begin{theorem}

1164: Let $D$ be a decision tree of size $s$ over $n$ variables. Then $D$ can be learned using $2^{\tilde{O}(s^{1/3})} \log n$ examples in time $n^{\tilde{O}(s^{1/3})}.$

1165: \end{theorem}

1166:

1167:

1168: \begin{proof}

1169: Blum \cite{Blum:92} has shown that any decision tree of size $s$ is

1170: computed by a $(\log s)$-decision list of length $s.$ Applying

1171: Corollary \ref{cor:learngdl} we thus see that Expanded-Winnow can be

1172: used to learn decision trees of size $s$ over $\{0,1\}^n$ with the

1173: claimed bounds on time and sample complexity.

1174: \end{proof}

1175:

1176:

1177:

1178:

1179: \section{Lower Bounds for Decision Lists} \label{sec:discuss}

1180:

1181: Here we observe that our construction from

1182: Theorem \ref{thm:mainptf} is essentially optimal in terms of the

1183: tradeoff it achieves between polynomial threshold function degree

1184: and weight.

1185:

1186: In \cite{Beigel:94}, Beigel constructs an oracle separating $\PP$ from

1187: $\PNP$. At the heart of his construction is a proof that any low

1188: degree polynomial threshold function for a particular

1189: decision list, called the the $\mathrm{ODDMAXBIT}_{n}$ function,

1190: must have large weights:

1191:

1192: \begin{definition}

1193: The $\mathrm{ODDMAXBIT}_{n}$ function on input $x=x_{1},\ldots,x_{n}

1194: \in \{0,1\}^{n}$ equals $(-1)^{i}$ where $i$ is the index of the

1195: first nonzero bit in $x.$

1196: \end{definition}

1197:

1198: It is clear that the $\mathrm{ODDMAXBIT}_{n}$ function is

1199: equivalent to a decision list of length $n$:

1200: $$

1201: (x_1,-1),(x_2,1),(x_3,-1),\dots,(x_n,(-1)^{n}),(-1)^{n+1}.

1202: $$

1203: The main technical theorem which Beigel proves in \cite{Beigel:94}

1204: states that any polynomial threshold function of degree $d$ computing

1205: $\mathrm{ODDMAXBIT}_{n}$ must have weight $2^{\Omega(n/d^{2})}$:

1206:

1207: \begin{theorem} \label{thm:beigel}

1208: Let $p$ be a degree $d$ polynomial threshold function with integer

1209: coefficients computing

1210: $\mathrm{ODDMAXBIT}_{n}$. Then

1211: $w = 2^{\Omega(n/d^{2})}$ where $w$ is the weight of $p.$\footnote{Beigel actually proves something stronger, namely that there must exists a coefficient whose absolute value is at least $2^{\Omega(n/d^{2})}$.}

1212: \end{theorem}

1213: (As stated in \cite{Beigel:94} the bound is actually $w \geq

1214: {\frac 1 s}2^{\Omega(n/d^2)}$ where $s$ is the number of nonzero

1215: coefficients in $p$.  Since $s \leq w$ this implies the result

1216: as stated above.)

1217:

1218:

1219: A lower bound of $2^{\Omega(n)}$

1220: on the weight of any linear threshold function ($d=1$) for

1221: $\mathrm{ODDMAXBIT}_n$ has long been known \cite{MyhillKautz:61};

1222: Beigel's proof generalizes this

1223: lower bound to all $d = O(n^{1/2}).$  A matching upper bound

1224: of $2^{O(n)}$ on weight for $d=1$ has also long been known

1225: \cite{MyhillKautz:61}.

1226: Our Theorem \ref{thm:mainptf} gives an upper bound

1227: which matches Beigel's lower bound (up to

1228: logarithmic factors) for all $d = O(n^{1/3})$:

1229: \begin{observation}

1230: For any $d = O(n^{1/3})$ there is a polynomial threshold function of

1231: degree $d$ and weight $2^{\tilde{O}(n/d^{2})}$

1232: which computes $\mathrm{ODDMAXBIT}_{n}$.

1233: \end{observation}

1234: \begin{proof}

1235: Set $d = h^{1/2} \log h$ in Theorem~\ref{thm:mainptf}.

1236: The weight bound given by Theorem~\ref{thm:mainptf}

1237: is $2^{O({\frac {n \log^2 d}{d^2}} + d \log d)}$

1238: which is $\tilde{O}(n/d^2)$ for $d = O(n^{1/3}).$

1239: \end{proof}

1240:

1241: \medskip

1242:

1243: Note that since the

1244: $\mathrm{ODDMAXBIT}_{n}$ function has a polynomial size DNF

1245: (see Appendix \ref{ap:alt}), Beigel's lower bound gives a polynomial

1246: size DNF $f$ such that any degree $\tilde{O}(n^{1/3})$ polynomial

1247: threshold function for $f$ must have weight

1248: $2^{\tilde{\Omega}(n^{1/3})}$.

1249: This suggests that the Expanded-Winnow algorithm cannot learn polynomial size

1250: DNF in $2^{\tilde{O}(n^{1/3})}$ time from

1251: $2^{n^{1/3 - \eps}}$ examples for any

1252: $\eps > 0,$ and thus suggests that improving the sample complexity

1253: of the DNF learning algorithm from \cite{KlivansServedio:01} while

1254: maintaining its $2^{\tilde{O}(n^{1/3})}$ running time may be difficult.

1255:

1256: \section{Learning Parity Functions} \label{sec:parity}

1257:

1258: We first briefly review the standard

1259: algorithm for learning parity functions.

1260:

1261: The standard algorithm for learning parity functions works by viewing a

1262: set of $m$ labelled examples as a set of $m$ linear equations over GF(2).

1263: Each labelled example $(x,b)$ induces the equation

1264: $\sum_{i: x_i = 1} a_{i} = b \bmod 2.$

1265: Since the examples are labelled according to some parity function,

1266: this parity function will be a consistent solution to the

1267: system of equations.

1268: Using Gaussian elimination it is possible to efficiently find a

1269: solution to the linear system,

1270: which yields a parity function consistent with all $m$ examples.

1271: The following standard fact from learning theory

1272: (often referred to as ``Occam's Razor'') shows that finding

1273: a consistent hypothesis suffices to establish PAC learnability:

1274:

1275: \begin{fact} \label{fact:OC}

1276: Let $C$ be a concept class and $H$ a finite set of hypotheses. Set $m

1277: = 1/\epsilon(\log |H| + \log 1/\delta)$ where $\epsilon$ and $\delta$

1278: are the usual accuracy and confidence parameters for PAC learning.

1279: Suppose that there

1280: is an algorithm $A$ running in time $t$ which takes as input $m$

1281: examples which are labelled according to some element of $C$ and outputs a

1282: hypothesis $h \in H$ consistent with these examples.

1283: Then $A$ is a PAC learning algorithm for $C$ with running time $t$

1284: and sample complexity $m.$

1285: \end{fact}

1286: Consider using the above algorithm to learn an unknown

1287: parity of length at most $k.$

1288: Even though there is a solution of weight at most $k$,

1289: Gaussian elimination (applied to a system of $m$ equations in $n$

1290: variables over GF(2)) may yield a solution of weight

1291: as large as $\min(m,n).$

1292: Using Fact \ref{fact:OC} we thus obtain a sample complexity bound of

1293: $O(n)$ examples for learning a parity of length at most $k.$

1294:

1295: We now present

1296: a simple polynomial-time algorithm for learning an unknown parity

1297: function on $k$ variables using $O(n^{1-1/k})$ examples.

1298: To the best of our knowledge this is the first improvement on the

1299: standard algorithm and analysis given above.

1300:

1301: \begin{theorem} \label{thm:mainparity}

1302:

1303: The class of all parity functions on at most $k$ variables is

1304: learnable in polynomial time using $O(n^{1-1/k} \log n)$

1305: examples. The hypothesis output by the learning algorithm

1306: is a parity function on $O(n^{1-1/k}\log n)$ variables.

1307:

1308: \end{theorem}

1309:

1310: \begin{proof}

1311: If $k = \Omega(\log n)$ then the standard algorithm suffices to

1312: prove the claimed bound.  We thus assume that $k = o(\log n)$.

1313:

1314: Let $H$ be the set of all parity functions of size at most $n^{1 - 1/k}$.

1315: Note that $|H| \leq n^{n^{1 - 1/k}}$ so

1316: $\log|H| \leq n^{1 - 1/k} \log n.$

1317: Consider the following

1318: algorithm:

1319:

1320: \begin{enumerate}

1321:

1322: \item Choose $m = 1/\epsilon (\log |H| + \log (1/\delta))$

1323: examples. Express each example as a linear equation over $n$ variables

1324: mod $2$ as described above.

1325:

1326: \item Randomly choose a set of $n - n^{1-1/k}$ variables and assign

1327: them the value $0$.

1328:

1329: \item Use Gaussian elimination to attempt to solve the resulting system

1330: of equations on the remaining $n^{1 - 1/k}$ variables.

1331: If the system has a solution, output the corresponding parity

1332: (of size at most $n^{1 - 1/k}$) as the hypothesis.

1333: If the system has no solution, output ``FAIL.''

1334:

1335: \end{enumerate}

1336:

1337: If the simplified system of equations has a solution,

1338: then by Fact \ref{fact:OC} this solution is a good hypothesis.

1339: We will show that the simplified system has a solution with probability

1340: $\Omega(1/n)$.  The theorem

1341: follows  by repeating steps 2 and 3 of the above algorithm until

1342: a solution is found (an expected $O(n)$ repetitions will suffice).

1343:

1344: Let $V$ be the set of $k$ relevant variables on which the unknown

1345: parity function depends. It is easy to see that as long as

1346: no variable in $V$ is assigned a 0,

1347: the resulting simplified system of equations will have a

1348: solution.

1349: Let $\ell = n^{1 - 1/k}.$

1350: The probability that in Step 2 the $n - \ell$ variables chosen

1351: do not include any variables in $V$ is exactly

1352: ${n - k \choose n - \ell} / {n \choose \ell}$

1353: which equals

1354: ${n - k \choose \ell - k} / {n \choose \ell}.$  Expanding

1355: binomial coefficients we have

1356: \begin{equation} \label{eq:a}

1357: {\frac {{n - k \choose \ell - k}}{{n \choose \ell}}} =

1358: \prod_{i=1}^{k} {\frac {\ell - k + i}{n -k + i}}

1359: > \left({\frac {\ell - k}{n - k}}\right)^k

1360: =

1361: \left({\frac \ell n}\right)^k

1362: \left({\frac {1 - {\frac k \ell}}{1 - {\frac k n}}}\right)^k

1363: =

1364: {\frac 1 n} \cdot

1365: \left[\left(1 - {\frac k \ell}\right)\left(1 + {\frac {2k} n}\right)\right]^k.

1366: \end{equation}

1367: The bound $k = o(\log n)$ implies that

1368: $\left(1 - {\frac k \ell}\right)\left(1 + {\frac {2k} n}\right) >

1369: (1 - {\frac {3k} \ell}).$ Consequently

1370: (\ref{eq:a}) is at least

1371: ${\frac 1 n} \cdot \left(1 - {\frac {3k^2} {\ell}}\right) >

1372: {\frac 1 {2n}}$ and the theorem is proved.

1373: \end{proof}

1374:

1375:

1376:

1377:

1378:

1379:

1380:

1381: \section{Future Work} \label{sec:future}

1382:

1383: An obvious goal for future work is to improve our algorithmic results

1384: for learning decision lists.  The question still remains:  can

1385: decision lists of length $k$ be learned in poly$(n)$ time from

1386: poly$(k,\log n)$ examples?  As a first step, one might attempt to

1387: extend the tradeoffs we achieve:  is it possible to learn

1388: decision lists of length $k$ in $n^{k^{1/2}}$ time from

1389: poly$(k,\log n)$ examples?

1390:

1391: Another goal is to extend our results for decision lists to broader

1392: concept classes.  In particular, since decision lists are a special

1393: case of linear threshold functions, it would be interesting to obtain analogues

1394: of our algorithmic

1395: results for learning general linear threshold functions (independent of

1396: their weight).  We note here that

1397: Goldmann {\em et al.} \cite{GHR:92} have given

1398: a linear threshold function over $\{-1,1\}^n$ for

1399: which any polynomial threshold function must have weight

1400: $2^{\Omega(n^{1/2})}$ regardless of its degree.  Moreover

1401: Krause and Pudlak \cite{KrausePudlak:98} have shown that any Boolean

1402: function which has a polynomial threshold function over $\{0,1\}^n$ of weight

1403: $w$ has a polynomial threshold function over $\{-1,1\}^n$ of weight

1404: $n^2w^4.$  These results imply that {\em representational} results akin

1405: to Theorem \ref{thm:ptf} for general linear threshold functions

1406: must be quantitatively weaker than Theorem \ref{thm:ptf};

1407: in particular, there is a linear threshold function over

1408: $\{0,1\}^n$ with $k$ nonzero coefficients for which

1409: {any} polynomial threshold function, regardless of degree, must have

1410: weight $2^{\Omega(k^{1/2})}.$

1411:

1412: For parity functions, one challenge is to

1413: learn parity functions on $k = \Theta(\log n)$ variables in polynomial time

1414: using a sublinear number of examples.  Another challenge is to improve

1415: the sample complexity of learning size $k$ parities from our

1416: current bound of $O(n^{1 - 1/k}).$

1417:

1418: \ignore{

1419:

1420: Decision lists can be viewed as a special case of linear threshold

1421: functions. For example, the alternating decision list (or

1422: $\mathrm{ODDMAXBIT}_{n}$ function) is equal to the sign of $h =

1423: \sum_{i=1}^{n} (-1)^{i} 2^{i}x_{i}$. The lower bound on the

1424: $\mathrm{ODDMAXBIT}_{n}$ function due to Beigel shows that for an

1425: arbitrary linear threshold function, we cannot construct polynomial

1426: threshold functions of degree $d$ and weight $2^{o(n/d^{2})}.$

1427:

1428: Here we observe that this lower bound on the weight and degree of

1429: polynomial threshold functions computing general linear threshold

1430: functions can be strengthened due to a result by Goldmann, Hastad, and

1431: Razborov:

1432:

1433: \begin{theorem} \cite{GHR:92}

1434: There exists a linear threshold function $U$ defined on $4n^{2}$

1435: variables such that if $U$ is written as a threshold of monomials then

1436: the total weight of the threshold is $\Omega(2^{(n/2)} / \sqrt{n})$.

1437: \end{theorem}

1438:

1439: \noindent The linear threshold function $U$ is the so-called Universal

1440: Halfspace defined as follows:

1441:

1442: \[ U_{n,m} = \sum_{i=1}^{n} \sum_{j=1}^{m} 2^{i}x_{ij}. \]

1443:

1444: From this we conclude that to learn an arbitrary linear threshold

1445: function on $n$ variables, V-Winnow will require

1446: $\Omega(2^{\sqrt{n}})$ samples and time $\Omega(n^{\sqrt{n}})$. This

1447: stands in contrast to the sample complexity and time complexity bounds

1448: for learning decision lists.

1449: }

1450:

1451: \section{Acknowledgements} We thank Les Valiant for his observation

1452: that Claim \ref{cla:outer} can be reinterpreted in terms of polynomial

1453: threshold functions.

1454: We thank Jean Kwon for suggesting the Chebychev polynomial.

1455:

1456: \bibliographystyle{plain}

1457: \bibliography{allrefs}

1458:

1459: \appendix

1460:

1461: \section{Alternate Proof of Corollary \ref{cor:outer}} \label{ap:alt}

1462: The alternate proof of Corollary \ref{cor:outer} is based on the

1463: observation that any decision list $L =

1464: (\ell_1,b_1),\dots,$ $(\ell_k,b_k),b_{k+1}$ of length $k$ has a

1465: $k$-term DNF in which each term is a conjunction of at most

1466: $k$ literals.  To see this, note that we obtain a DNF

1467: for $L$ simply by taking the OR of all terms

1468: $\overline{\ell}_1\overline{\ell}_2 \dots \overline{\ell}_{i-1}\ell_i$

1469: for each $i$ such that $b_i = 1.$  Now we use the following result

1470: from \cite{KlivansServedio:01}:

1471: \begin{theorem} [Corollary 12 of \cite{KlivansServedio:01}]

1472: Let $f$ be a DNF formula of $s$ terms, each of length at most $t.$

1473: Then there is a polynomial threshold function for $f$ of degree

1474: $O(\sqrt{t}\log s)$ and weight $t^{O(\sqrt{t}\log s)}.$

1475: \end{theorem}

1476: Applying this result to the DNF representation for $L,$ we immediately

1477: obtain that there is a polynomial threshold function for $L$

1478: which has degree $O(k^{1/2} \log k)$ and weight

1479: $2^{O(k^{1/2} \log^2 k)}.$  (In Section \ref{subsec:inner}, though,

1480: we need the construction given in our original proof of

1481: Corollary \ref{cor:outer}.)

1482:

1483: \end{document}

1484:

1485:

1486:

1487:

1488:

1489:

1490: