1:
2: % d5.tex 3-27-03
3:
4:
5: \documentclass[11pt]{article}
6: \usepackage{amssymb}
7: \usepackage{amsfonts}
8: \usepackage{amsmath}
9: \usepackage{latexsym}
10: \usepackage{epsfig}
11:
12: \parindent=18pt
13: \oddsidemargin=0.15in
14: \evensidemargin=0.15in
15: \topmargin=-.5in
16: \textheight=9in
17: \textwidth=6.5in
18:
19: \newcommand{\la}{\langle}
20: \newcommand{\ra}{\rangle}
21: \newcommand{\poly}{\mathrm{poly}}
22: \newcommand{\size}{\mathrm{size}}
23: \newcommand{\fix}{\mathrm{fix}}
24: \newcommand{\bias}{\mathrm{bias}}
25: \newcommand{\R}{{\bf R}}
26: \newcommand{\E}{{\mathrm E}}
27: \newcommand{\F}{{{\bf F}_2}}
28: \newcommand{\s}{{\mathcal S}}
29: \newcommand{\K}{{\mathcal K}}
30: \newcommand{\A}{{\mathcal A}}
31: \newcommand{\B}{{\mathcal B}}
32: \newcommand{\true}{\textsc{T}}
33: \newcommand{\false}{\textsc{F}}
34: \newcommand{\bitsl}{\{\false, \true\}}
35: \newcommand{\bitsf}{\{0, 1\}}
36: \newcommand{\bitsr}{\{+1,-1\}}
37: \newcommand{\degr}{\deg_\R}
38: \newcommand{\degf}{\deg_\F}
39: \newcommand{\parity}{\mathsf{PARITY}}
40: \newcommand{\cz}{c_\emptyset}
41: \newcommand{\fin}{f_{\mathrm{in}}}
42: \newcommand{\fout}{f_{\mathrm{out}}}
43: \newcommand{\tin}{t_{\mathrm{in}}}
44: \newcommand{\tout}{t_{\mathrm{out}}}
45: \newcommand{\eps}{{\epsilon}}
46: \newcommand{\theconst}{\frac{\omega}{\omega+1}}
47: \newcommand{\ignore}[1]{}
48: \newcommand{\qed}{\hfill\rule{7pt}{7pt}}
49: \newcommand{\strutje}{\rule[-.25cm]{0cm}{.7cm}}
50: \newcommand{\omb}{ODDMAXBIT}
51: \newcommand{\PP}{\mathsf{PP}}
52: \newcommand{\PNP}{\mathsf{P^{NP}}}
53:
54:
55: \newtheorem{theorem}{Theorem}
56: \newtheorem{fact}[theorem]{Fact}
57: \newtheorem{observation}[theorem]{Observation}
58: \newtheorem{proposition}[theorem]{Proposition}
59: \newtheorem{claim}[theorem]{Claim}
60: \newtheorem{definition}[theorem]{Definition}
61: \newtheorem{corollary}[theorem]{Corollary}
62:
63: \newenvironment{proof}{\noindent \textbf{Proof:}}{\hfill{$\Box$}}
64:
65: \title{Toward Attribute Efficient Learning Algorithms}
66: \ignore{
67: OR \\
68: Learning Decision Lists of Length $k$ using
69: $2^{\tilde{O}(k^{1/3})}$ Examples OR \\
70: On Learning Decision Lists Attribute Efficiently OR \\
71: Learning Decision Lists Attribute Efficiently via Polynomial Threshold Functions OR \\
72: Learning Decision Lists using $2^{\tilde{O}(k^{1/3})}$ Samples OR \\
73: Learning Decision Lists via Polynomial Threshold Functions OR \\
74: A Subexponential Algorithm for Learning Decision Lists Attribute Efficiently OR \\
75: some other lame title
76: }
77:
78:
79:
80: \author{Adam R. Klivans\thanks{Supported by an NSF Mathematical
81: Sciences Postdoctoral Research Fellowship.}\\
82: Divsion of Engineering and Applied Sciences\\
83: Harvard University\\ Cambridge, MA 02138 \\{\tt klivans@eecs.harvard.edu}
84: \and Rocco A.\ Servedio\\
85: Department of Computer Science\\
86: Columbia University\\
87: New York, NY 10027\\ {\tt rocco@cs.columbia.edu} }
88:
89: \date{}
90:
91: \begin{document}
92:
93: \setcounter{page}{0}
94:
95: \maketitle
96:
97: \begin{abstract}
98:
99: We make progress on two important problems regarding attribute
100: efficient learnability.
101:
102: First, we give an algorithm for learning decision
103: lists of length $k$ over $n$ variables using $2^{\tilde{O}(k^{1/3})}
104: \log n$ examples and time $n^{\tilde{O}(k^{1/3})}$. This is the first
105: algorithm for learning decision lists that has both subexponential
106: sample complexity and subexponential running time in the relevant
107: parameters. Our approach establishes a relationship between attribute
108: efficient learning and polynomial threshold functions and is based on
109: a new construction of low degree, low weight polynomial threshold
110: functions for decision lists. For a wide range of parameters our
111: construction matches a 1994 lower bound due to Beigel for the
112: ODDMAXBIT predicate and gives an essentially optimal tradeoff between
113: polynomial threshold function degree and weight.
114:
115: Second, we give an
116: algorithm for learning an unknown parity function on $k$ out of $n$
117: variables using $O(n^{1-1/k})$ examples in time polynomial in $n$. For
118: $k=o(\log n)$ this yields a polynomial time algorithm with
119: sample complexity $o(n)$. This is the first polynomial time algorithm
120: for learning parity on a superconstant number of variables with
121: sublinear sample complexity.
122:
123: \end{abstract}
124:
125:
126: %%%%%%%% SECOND ABS
127:
128: \ignore{
129: \begin{abstract}
130:
131: We give an algorithm for learning decision lists of length $k$ over $n$
132: variables using $2^{\tilde{O}(k^{1/3})} \log n$ examples and time
133: $n^{\tilde{O}(k^{1/3})}$. This is the first algorithm for learning
134: decision lists that has both subexponential sample complexity (in the
135: relevant parameters $k$ and $\log n$) and subexponential running time (in
136: the relevant parameter $k$; any algorithm must take time $\Omega(n)$).
137: Our approach establishes a relationship between attribute efficient
138: learning and polynomial threshold functions, and is based on a new
139: construction of low degree, low weight polynomial threshold functions for
140: decision lists. As a consequence of our construction we show that
141: Beigel's 1994 complexity theoretic lower bound for the ODDMAXBIT function
142: is aymptotically optimal. {\bf [[Another option for the last sentence:]]}
143: For a wide range of parameters our construction matches a 1994 lower bound due to
144: Beigel for the ODDMAXBIT predicate, and thus our construction
145: gives an optimal tradeoff between polynomial threshold function
146: degree and weight. {\bf [[basically, do we want to say that his
147: stuff shows our stuff is optimal, or our stuff shows his stuff is
148: optimal?]]}
149:
150:
151: \end{abstract}
152: }
153:
154: %%%%%%%%%%%% END SECOND ABS
155:
156: %%%%%%%%%%% first abs:
157: \ignore{
158: \begin{abstract}
159: We give an online algorithm for learning decision lists.
160: The mistake bound of the algorithm, for learning a decision list of
161: length $k$ over $n$ Boolean variables, is
162: $2^{O(k^{1/3})}\log n$ and the running time of the algorithm is
163: $n^{O(k^{1/3})}.$ We thus achieve a tradeoff between
164: running time and sample complexity for learning decision lists.
165: Our approach combines known algorithms for attribute efficient
166: learning of linear threshold functions
167: with a new construction of polynomial threshold functions
168: which compute decision lists. As a consequence of our
169: construction, we
170: show that Beigel's 1994 complexity theoretic
171: lower bound on the weight of any low-degree polynomial
172: threshold function for the ODDMAXBIT$_n$ predicate is asymptotically optimal.
173: \end{abstract}
174: }
175: %%%%%%%%%%% end first abs:
176:
177:
178: \thispagestyle{empty}
179:
180: \newpage
181:
182: \section{Introduction}
183:
184: \subsection{Attribute Efficient Learning}
185:
186: A central goal in machine learning is to design efficient, effective
187: algorithms for learning from small amounts of data. An obstacle to
188: achieving this goal is that learning problems are often characterized by
189: an abundance of {\em irrelevant information}. In many learning problems
190: each data point is naturally viewed as a high dimensional vector of
191: attribute values; as a motivating example, in a natural language domain a
192: data point representing a text document may be a vector of word
193: frequencies over a lexicon of 100,000 words (attributes). A newly
194: encountered word in a corpus may typically have a simple definition which
195: uses only a dozen or so words from the entire lexicon. One would like to
196: be able to learn the meaning of such a word using a number of examples
197: which is closer to a dozen (the actual number of relevant attributes) than
198: to 100,000 (the total number of attributes).
199:
200: Towards this end, an important goal in machine learning theory is to
201: design {\em attribute efficient} algorithms for learning various classes
202: of Boolean functions. A class ${\cal C}$ of Boolean functions over $n$
203: variables $x_1,\dots,x_n$ is said to be {\em attribute-efficiently
204: learnable} if there is a poly$(n)$ time algorithm which can learn any
205: function $f \in C$ using a number of examples which is polynomial in the
206: ``size'' (description length) of the function $f$ to be learned, rather
207: than in $n$ (the number of features in the domain over which learning
208: takes place). (Note that the running time of the learning algorithm must
209: in general be at least $n$ since each example is an $n$-bit vector.)
210: Thus an attribute efficient learning algorithm for, say, the class of
211: Boolean conjunctions must be able to learn any Boolean conjunction of $k$
212: literals over $x_1,\dots,x_n$ using poly$(k,\log n)$ examples, since $k
213: \log n$ bits are required to specify such a conjunction.
214:
215:
216: \subsection{Decision Lists}
217:
218: A longstanding open problem in machine learning, posed first by Blum in
219: 1990 \cite{Blum:90,Blum:96,BHL:95,BlumLangley:97} and again by
220: Valiant in 1998
221: \cite{Valiant:99}, is to determine whether or not there exist attribute
222: efficient algorithms for learning {\em decision lists}. A decision list
223: is essentially a nested ``if-then-else'' statement (we give a precise
224: definition in Section \ref{sec:prelims}).
225:
226: Attribute efficient learning of decision lists is of both theoretical and
227: practical interest. Blum's motivation for considering the problem came
228: from the {\em infinite attribute model} \cite{Blum:90}; in this model
229: there are infinitely many attributes but the concept to be learned depends
230: on only a small number of them, and each example consists of a finite list
231: of active attributes. Blum {\em et al}. \cite{BHL:95} showed that for a
232: wide range of concept classes (including decision lists) attribute
233: efficient learnability in the standard $n$-attribute model is equivalent
234: to learnability in the infinite attribute model. Since simple classes
235: such as disjunctions and conjunctions are attribute efficiently learnable
236: (and hence learnable in the infinite attribute model), this motivated Blum
237: \cite{Blum:90} to ask whether the richer class of decision lists is thus
238: learnable as well.\footnote{ Additional motivation comes from the fact
239: that decision lists have such a simple algorithm in the PAC model.}
240: Several researchers have subsequently considered this problem, see e.g.
241: \cite{Blum:96,BlumLangley:97,DhagatHellerstein:94, NevoElYaniv:02,
242: Servedio:99stoc}; we summarize some of this previous work in Section
243: \ref{sec:prevdl}.
244:
245: From an applied perspective, Valiant \cite{Valiant:99} relates the
246: problem of learning decision lists attribute efficiently to the question
247: ``how can human beings learn from small amounts of data in the presence of
248: irrelevant information?'' He points out that since decision lists play an
249: important role in various models of cognition, a first step in
250: understanding this phenomenon would be to identify efficient algorithms
251: which learn decision lists from few examples. Due to the lack of progress
252: in developing such algorithms for decision lists, Valiant suggests that
253: models of cognition should perhaps focus on ``flatter" classes of
254: functions such as projective DNF \cite{Valiant:99}.
255:
256: \subsection{Parity Functions}
257:
258: Another outstanding challenge in machine learning is to determine whether
259: there exist attribute efficient algorithms for learning {\em parity
260: functions}. The parity function
261: on a set of 0/1-valued variables $x_{i_1},\ldots,x_{i_k}$ is equal to $x_{i_1} + \cdots
262: + x_{i_k}$ modulo 2. As with the class of decision lists, a simple PAC learning
263: algorithm is known for the class of parity functions but no attribute efficient
264: PAC learning algorithm is known.
265: Learning parity
266: functions plays an important rule in Fourier learning methods
267: \cite{MOS:03} and is closely related to decoding random linear codes \cite{BKW:00}.
268: Both A. Blum \cite{Blum:96} and Y. Mansour \cite{Man:02} cite
269: attribute efficient learning of parity functions as an important open
270: problem.
271:
272: \ignore{
273: Given a set of examples labelled according to an unknown parity
274: function on $k$ out of $n$ variables, we wish to find an approximation
275: to the unknown parity in polynomial time using as few examples as
276: possible. The well known solution to this problem views these
277: examples as a set of linear equations mod $2$ in $n$ variables and
278: solves the set of equations to come up with a consistent
279: hypothesis. Note, however, that we must take $\Omega(n)$ examples to
280: achieve a solution which has good generalization error, as a solution
281: to a system of $m$ equations over $n$ variables may contain
282: $\min(m,n)$ non-zero entries. An attribute efficient algorithm for
283: learning parity should require a number of examples polynomially
284: related to $k$ and $\log n$ (information theoretically we should only
285: need $O(k \log n)$ examples).
286: }
287:
288: \subsection{Our Results: Decision Lists}
289:
290: We give the first learning algorithm for decision lists that is
291: subexponential in both sample complexity (in the relevant parameters $k$
292: and $\log n$) and running time (in the relevant parameter $k$). Our
293: results demonstrate for the first time that it is possible to
294: simultaneously avoid the ``worst case'' in both sample complexity and
295: running time, and thus suggest that it may indeed be possible to learn
296: decision lists attribute efficiently. \ignore{We consider this to be the
297: first evidence that decision lists can be learned attribute efficiently.
298: \\}
299:
300: Our main learning result for decision lists is:
301:
302: \begin{theorem} \label{thm:main} There is an algorithm for learning
303: decision lists over $\{0,1\}^n$ which, when learning a decision list
304: of length $k$, has mistake bound\footnote{Throughout this
305: section we use ``sample complexity'' and ``mistake bound''
306: interchangeably; as described in Section \ref{sec:prelims}
307: these notions are essentially identical.}
308: $2^{\tilde{O}(k^{1/3})}\log n$ and runs in time
309: $n^{\tilde{O}(k^{1/3})}$.
310: \end{theorem}
311:
312:
313: We prove Theorem \ref{thm:main} in two parts; first we generalize
314: Littlestone's well known Winnow algorithm \cite{Littlestone:88}
315: for learning
316: linear threshold functions to learn {\em polynomial
317: threshold functions.} In previous learning results, polynomial threshold
318: functions are learned by applying techniques from linear programming: a
319: Boolean function computed by a polynomial threshold function of degree $d$ can
320: be learned in time $n^{O(d)}$ by using polynomial time linear programming
321: algorithms such as the Ellipsoid algorithm
322: (see e.g. \cite{KlivansServedio:01}).
323: \ignore{via a linear programming solver, such as the
324: Ellipsoid algorithm.}
325: In contrast, we use the Winnow algorithm to learn polynomial threshold functions.
326: Winnow learns using few examples in a small amount of time
327: provided that the degree of the polynomial
328: is low and the integer coefficients of the polynomial are not too large:
329: \ignore{As opposed to general
330: linear programming solvers, Winnow can learn in an attribute efficient
331: manner:}
332:
333:
334: \begin{theorem} \label{thm:win}
335: Let ${\cal C}$ be a class of Boolean functions over
336: $\{0,1\}^n$ with the property that each $f \in {\cal C}$ has a polynomial
337: threshold function of degree at most $d$ and weight at most $W.$ Then
338: there is an online learning algorithm for ${\cal C}$ which runs in $n^d$
339: time per example and has mistake bound $O(W^{2} \cdot d \cdot \log n).$
340: \end{theorem}
341:
342: At this point we have reduced the problem of learning decision lists
343: attribute efficiently to the problem of representing decision lists with
344: polynomial threshold functions of low weight and low degree. To this end
345: we prove
346:
347: \begin{theorem} \label{thm:ptf} Let $L$ be a decision list of length $k$.
348: Then $L$ is computed by a polynomial threshold function of degree
349: $\tilde{O}(k^{1/3})$ and weight $2^{\tilde{O}(k^{1/3})}$. \end{theorem}
350: Theorem \ref{thm:main} follows directly from Theorems \ref{thm:win}
351: and \ref{thm:ptf}.
352:
353: Polynomial threshold function constructions have recently been used
354: to obtain the fastest known algorithms for a range
355: of important learning problems such as learning DNF formulas
356: \cite{KlivansServedio:01}, intersections of halfspaces \cite{KOS:02},
357: and Boolean formulas of superconstant depth \cite{OdonnellServedio:03a}.
358: For each of these learning problems the sole goal was to obtain
359: fast learning algorithms, and hence the only parameter of interest in
360: these polynomial threshold function constructions is their degree,
361: since degree bounds translate directly into running time bounds for
362: learning algorithms (see e.g. \cite{KlivansServedio:01}).
363: In contrast, for the decision list problem we are interested in
364: both the running time and the number of examples required for learning.
365: Thus we must bound both the degree and the {\em weight}
366: (magnitude of integer coefficients) of the polynomial threshold
367: functions which we use.
368:
369: Our polynomial threshold function construction is essentially optimal in
370: the tradeoff between degree and weight which it achieves. In 1994 Beigel
371: gave a lower bound showing that any degree $d$ polynomial threshold
372: function for a particular decision list must have weight
373: $2^{\Omega(n/d^{2})}$. For $d = n^{1/3}$, Beigel's lower bound implies
374: that the construction stated in Theorem \ref{thm:ptf} is essentially
375: optimal. Furthermore, for any decision list $L$ of length $n$ and any
376: $d \leq n^{1/3}$, we will in fact construct polynomial threshold functions
377: of degree $d$ and weight $2^{\tilde{O}(n/d^{2})}$ computing $L$.
378: Beigel's lower bound thus implies that our degree $d$ polynomial threshold
379: functions are of roughly optimal weight
380: for all $d \leq n^{1/3},$ and hence strongly suggests that our
381: analysis is the best possible for the algorithm we use.
382:
383: \subsection{Our Results: Parity Functions}
384:
385: For parity functions, we give an $O(n^3)$ time algorithm which can
386: learn an unknown parity on $k$ variables out of $n$ using $O(n^{1-1/k})$ examples.
387: For values of $k = o(\log n)$ the sample complexity of
388: this algorithm is $o(n)$. This is the first algorithm for learning
389: parity on a superconstant number of variables with sublinear sample
390: complexity.
391:
392: The standard PAC learning algorithm for learning an unknown parity function
393: is based on viewing a set of $m$ labelled examples as a system of $m$ linear equations modulo 2.
394: Using Gaussian elimination it is possible to solve the system and find
395: a consistent parity function. It can be shown that the solution thus
396: obtained is a ``good'' hypothesis if its weight (number of nonzero entries)
397: is small relative to $m$, the number of examples. However, using Gaussian elimination
398: can result in a solution of weight as large as
399: $\min(m,n)$ even if $k$ (the number of variables in the target parity) is very small.
400: Thus in order for this approach to give a successful learning algorithm, it is necessary to
401: use $m = \Omega(n)$ examples regardless of the value of $k$.
402: In contrast, observe that an attribute efficient algorithm for
403: learning a parity of length $k$ should use only poly$(k,\log n)$ examples.
404:
405: Our algorithm works by finding a ``low weight'' solution to a system of
406: $m$ linear equations. We prove that with high probability we can find a solution of weight
407: $O(n^{1-1/k})$ irrespective of $m$. Thus by taking $m$ to be only slightly larger
408: than $n^{1 - 1/k}$ we have that our solution is a ``good'' hypothesis.
409:
410:
411: \subsection{Previous Results: Decision Lists} \label{sec:prevdl}
412:
413:
414: In previous work several algorithms with different performance bounds (in
415: terms of running time and number of examples used) have been given for
416: learning decision lists.
417:
418: \begin{itemize}
419:
420: \item Rivest \cite{Rivest:87} gave the first algorithm for learning
421: decision lists in Valiant's PAC model of learning from random examples.
422: Littlestone \cite{Blum:96} subsequently gave an analogue of Rivest's
423: algorithm in the online learning model. The algorithm can learn any
424: decision list of length $k$ in $O(kn^2)$ time using $O(kn)$ examples.
425:
426: \item A brute-force approach to learning decision lists of length $k$ is
427: to maintain a collection of all such lists which are consistent with the
428: examples seen so far, and to predict at each stage using majority vote
429: over the surviving hypotheses. This ``halving algorithm'' (proposed in
430: various forms by Barzdin and Freivald \cite{BarzdinFreivald:72}, Mitchell
431: \cite{Mitchell:82}, and Angluin \cite{Angluin:88}) can learn decision
432: lists of length $k$ using only $O(k \log n)$ examples, but the running
433: time is $n^{O(k)}.$
434:
435: \item Several researchers \cite{Blum:96,Valiant:99} have observed that
436: Littlestone's well-known Winnow algorithm \cite{Littlestone:88} can learn
437: decision lists of length $k$ from $2^{O(k)} \log n$ examples in time
438: $2^{O(k)} n \log n$. This follows from the observation that decision lists
439: of length $k$ can be viewed as linear threshold functions with integer
440: coefficients of magnitude $2^{\Theta(k)}$. We note that our algorithm in
441: this paper always has improved sample complexity over the basic Winnow
442: algorithm, and for $k \geq (\log n)^{3/2}$ our approach improves on the
443: time complexity of Winnow as well.
444:
445: \item Finally, several researchers have considered the special
446: case of learning a decision list of length $k$ over $n$ variables
447: in which the output bits of the decision list have at most $D$
448: alternations. Valiant \cite{Valiant:99}
449: and Nevo and El-Yaniv \cite{NevoElYaniv:02}
450: have given refined analyses of Winnow's performance for this
451: special case, and Dhagat and Hellerstein \cite{DhagatHellerstein:94}
452: have also studied this problem. However, for the general case
453: in which $D$ can be as large as $k,$ the results thus obtained
454: do not improve on the straightforward Winnow analysis
455: described in the previous bullet.
456:
457: \end{itemize}
458: These previous algorithmic results are summarized in Figure 1. We observe
459: that all of these earlier algorithms have an exponential dependence on the
460: relevant parameter(s) ($k$ and $\log n$ for sample complexity, $k$ for
461: running time) for either the running time or the sample complexity.
462:
463:
464: \begin{table}[h]
465: \centerline{
466: \begin{tabular}{|l|l|l|} \hline
467: \strutje Reference: & Number of examples: & Running time: \\
468: \hline\hline
469: \strutje Rivest / Littlestone
470: & $ O(kn)$
471: & $ O(kn^2) $ \\ \hline
472: \strutje Halving algorithm
473: & $ O(k \log n)$
474: & $ n^{O(k)} $ \\ \hline
475: \strutje Winnow algorithm
476: & $2^{O(k)} \log n$
477: & $2^{O(k)}n \log n$ \\ \hline
478: \strutje This Paper
479: & $ 2^{\tilde{O}(k^{1/3})}\log n $
480: & $ n^{\tilde{O}(k^{1/3})} $ \\ \hline
481: \end{tabular}
482: }
483: \caption{Comparison of known algorithms for
484: learning decision lists of length $k$ on $n$ variables.
485: }
486: \label{table:results}
487: \end{table}
488:
489: \subsection{Previous Results: Parity Functions}
490:
491: Little previous work has been published on learning parity
492: functions attribute efficiently in the PAC model. The standard PAC learning
493: algorithm for parity (based on solving a system of linear equations) is due
494: to Helmbold {\em et al.\@} \cite{HSW:92}; however as described above this
495: algorithm is not attribute efficient since it uses $\Omega(n)$ examples.
496:
497: Several authors have considered learning parity attribute efficiently in a model
498: where the learner is allowed to make membership queries. Attribute efficient
499: learning is easier in this framework since membership queries can help identify relevant variables.
500: Blum et al. \cite{BHL:95} give a randomized polynomial time membership-query
501: algorithm for learning parity on $k$ variables using only $O(k \log
502: n)$ examples. These results were later
503: refined by Uehara {\em et al.} \cite{UTW:97}.
504:
505:
506:
507: \subsection{Organization}
508:
509: In Section \ref{sec:prelims} we give the necessary background on
510: online learning and polynomial threshold functions. In Section
511: \ref{sec:winnow} we show how known results from learning theory enable
512: us to reduce the decision list learning problem to a problem of
513: finding suitable polynomial threshold function representations of
514: decision lists. In Sections \ref{subsec:outer} and \ref{subsec:inner}
515: we give two different proofs of a weak tradeoff between degree and
516: weight for polynomial threshold function representations of decision
517: lists, and in Section \ref{subsec:compose} we combine these techniques
518: to prove Theorem \ref{thm:ptf}. In Section \ref{sec:decisiontree} we
519: show how to apply our techniques to give a tradeoff between sample
520: complexity and running time for learning decision trees. In Section
521: \ref{sec:discuss} we discuss the connection with Beigel's ODDMAXBIT
522: lower bound and related issues. In Section \ref{sec:parity} we give
523: our new algorithm for learning parity functions, and in Section
524: \ref{sec:future} we suggest directions for future work.
525:
526: \section{Preliminaries} \label{sec:prelims}
527:
528:
529: Attribute efficient learning has been chiefly studied in the {\em on-line
530: mistake-bound} model of concept learning which was introduced in
531: \cite{Littlestone:88,Littlestone:89}. In this model learning proceeds in
532: a series of trials, where in each trial the learner is given an unlabelled
533: boolean example $x \in \{0,1\}^n$ and must predict the value $f(x)$ of the
534: unknown target function $f.$ After each prediction the learner is given
535: the true value of $f(x)$ and can update its hypothesis before the next
536: trial begins. The {\em mistake bound} of a learning algorithm on a target
537: concept $c$ is measured by the worst-case number of mistakes that the
538: algorithm makes over all (possibly infinite) sequences of examples, and
539: the mistake bound of a learning algorithm on a concept class (class of
540: Boolean functions) $C$ is the worst-case mistake bound across all
541: functions $f \in C.$ The running time of a learning algorithm $A$ for a
542: concept class $C$ is defined as the product of the mistake bound of $A$ on
543: $C$ times the maximum running time required by $A$ to evaluate its
544: hypothesis and update its hypothesis in any trial.
545:
546:
547: Our main interests in this paper are the classes of {\em decision
548: lists} and {\em parity functions}.
549:
550: A decision list $L$ of length $k$ over the Boolean variables
551: $x_1,\dots,x_n$ is represented by a list of $k$ pairs and a bit
552: $$
553: (\ell_1,b_1),(\ell_2,b_2),\dots,(\ell_k,b_k),b_{k+1}
554: $$
555: where each $\ell_i$ is a literal and each $b_i$ is either $-1$ or $1.$
556: Given any $x \in \{0,1\}^n,$ the value of $L(x)$ is $b_i$ if $i$ is the
557: smallest index such that $\ell_i$ is made true by $x$; if no $\ell_i$ is
558: true then $L(x)=b_{k+1}.$
559:
560: A parity function of length $k$ is defined by a set of variables $S
561: \subset \{x_{1},\ldots,x_{n}\}$ such that $|S| = k$. The
562: parity function $\chi_{S}(x)$ takes value $1$ on inputs which set
563: an even number of variables in $S$ to $1$ and takes value $-1$ on
564: inputs which set an odd number of variables in $S$ to $1.$
565:
566: Given a concept class $C$ over $\{0,1\}^n$ and a Boolean function $f \in
567: C,$ let size$(f)$ denote the description length of $f$ under some
568: reasonable encoding scheme. (Note that if $f$ has $r$ relevant variables
569: then size$(f)$ will be at least $r \log n$ since this many bits are
570: required just to specify which variables are relevant). We say that a
571: learning algorithm $A$ for $C$ in the mistake-bound model is {\em
572: attribute-efficient} if the mistake bound of $A$ on any concept $c \in C$
573: is polynomial in size$(f).$ In particular, the description length of a
574: length $k$ decision list (parity) is $O(k \log n)$, and thus we would ideally like
575: to have an algorithm which learns decision lists (parities) of length $k$ with a
576: mistake bound of poly$(k,\log n)$ and runs in time poly$(n).$
577:
578:
579: (We note here that attribute efficiency has also been studied in other
580: learning models, namely Valiant's Probably Approximately Correct (PAC)
581: model of learning from random examples. Standard conversion techniques
582: are known \cite{Angluin:88,Haussler:88b,Littlestone:89b}
583: which can be used to
584: transform any mistake bound algorithm into a PAC learning algorithm.
585: This transformation essentially preserves the running time of the mistake
586: bound algorithm, and the sample size required by the PAC algorithm is
587: essentially the mistake bound. Thus, positive results for mistake bound
588: learning, such as those we give for decision lists in this paper, directly yield
589: corresponding positive results for the PAC model.)
590:
591: Finally, our results for decision lists are achieved by a careful
592: analysis of {\em polynomial threshold functions}. Let $f$ be a
593: Boolean function $f:\{0,1\}^{n} \to \{-1,1\}$ and let $p$ be a
594: polynomial in $n$ variables with integer coefficients. Let $d$ denote
595: the degree of $p$ and let $W$ denote the sum of the absolute values of
596: $p$'s integer coefficients. If the sign of $p(x)$ equals $f(x)$ for
597: every $x \in \{0,1\}^n,$ then we say that $p$ is a {\em polynomial
598: threshold function} of degree $d$ and weight $W$ for $f.$
599:
600:
601: \section{Expanded-Winnow: Learning Polynomial Threshold Functions} \label{sec:winnow}
602:
603: Littlestone introduced the online Winnow algorithm in 1988 and showed
604: that it can attribute efficiently learn Boolean conjunctions,
605: disjunctions, and low weight linear threshold functions. Throughout
606: its execution Winnow maintains a linear threshold function as its
607: hypothesis; at the heart of the algorithm is a novel update rule which
608: makes a {\em multiplicative} update to each coefficient of the
609: hypothesis (rather than an additive update as in the Perceptron
610: algorithm) each time a mistake is made. Since its introduction Winnow
611: has been intensively studied from both applied and theoretical
612: standpoints (see
613: e.g. \cite{Blum:97,GoldingRoth:99,KWA:97,Servedio:02sicomp}) and
614: multiplicative updates have become widespread in machine learning
615: algorithms.
616:
617: The following theorem (which, as noted in \cite{Valiant:99}, is implicit
618: in Littlestone's analysis in \cite{Littlestone:88}) gives a
619: mistake bound for Winnow when learning linear threshold functions:
620:
621: \begin{theorem} \label{thm:winbound}
622: Let $f(x)$ be the linear threshold function
623: sign$(\sum_{i=1}^{n} w_{i}x_{i} - \theta)$
624: where $\theta$ and $w_{1},\ldots,w_{n}$ are
625: integers. Let $W = \sum_{i=1}^{n} |w_{i}|$. Then
626: Winnow learns $f(x)$ with mistake bound $O(W^{2} \log n)$,
627: and uses $n$ time steps per example.
628: \end{theorem}
629:
630: We will use a generalization of the Winnow algorithm, called
631: Expanded-Winnow, to learn {\em polynomial} threshold functions of
632: degree at most $d.$ Our generalization introduces $\sum_{i=1}^{d} {n
633: \choose d}$ new variables (one for each monomial of degree up to $d$)
634: and runs Winnow to learn a linear threshold function over these new
635: variables. More precisely, in each trial we convert the $n$-bit
636: received example $x=(x_1,\dots,x_n)$ into a $\sum_{i=1}^d {n \choose
637: d}$ bit expanded example (where the bits in the expanded example
638: correspond to monomials over $x_1,\dots,x_n$), and we give the
639: expanded example to Winnow. Thus the hypothesis which Winnow
640: maintains -- a linear threshold function over the space of expanded
641: features -- is a polynomial threshold function of degree $d$ over the
642: original $n$ variables $x_1,\dots,x_n.$ Theorem \ref{thm:win}, which
643: follows directly from Theorem \ref{thm:winbound}, summarizes the
644: performance of Expanded-Winnow:
645:
646: \medskip
647:
648: \noindent {\bf Theorem \ref{thm:win}}
649: {\em Let ${\cal C}$ be a class of Boolean functions over
650: $\{0,1\}^n$ with the property that each $f \in {\cal C}$ has a polynomial
651: threshold function of degree at most $d$ and weight at most $W.$ Then
652: Expanded-Winnow algorithm runs in $n^d$
653: time per example and has mistake bound $O(W^{2} \cdot d \cdot \log n)$ for
654: ${\cal C}.$
655: } \\
656:
657: Theorem \ref{thm:win} shows that the degree of a polynomial threshold
658: function corresponds to Expanded-Winnow's running time, and the weight of
659: a polynomial threshold function corresponds to its sample complexity.
660:
661: \ignore{
662:
663: \begin{figure*}[t] \label{fig:vw}
664: \begin{small}
665:
666: \noindent {\bf Algorithm V-Winnow:} \\
667:
668: \noindent {\bf Input: } A sequence of trials from a polynomial $p$ in $n$ variables $\{x_{1},\ldots,x_{n}\}$ of degree $d$ where each \mbox{~~~~~~~~~~~~~~}coefficient is at most $w$.
669:
670: \vskip.1in
671:
672: \noindent {\bf Output: } A polynomial $p'$ in $n$ variables of degree $d$
673: such that for every $x \in \{0,1\}^{n}$, $p'(x) = p(x)$.
674:
675: \medskip
676:
677: \begin{enumerate}
678:
679: \item Lexicographically order all $m = n^{d}$ monomials of degree at most
680: $d$ over the variables $\{x_{1},\ldots,x_{n}\}$.
681:
682: \item Introduce new variables $y_{1},\ldots,y_{m}$ such that $y_{i}$ is
683: equal to the $i$th monomial in Step 1.
684:
685: \item Run Winnow over the variables $y_{1},\ldots,y_{m}$ where on example
686: $(a,f(a))$, $y_{i}$ is equal to the $i$th monomial on assignment $a$.
687:
688: \item Let $h = \sum_{i=1}^{m} \alpha_{i}y_{i}$ be the output of Winnow.
689:
690: \item Return $h$ with each $y_{i}$ written as the $i$th monomial over
691: $\{x_{1},\ldots,x_{n}\}$.
692:
693: \end{enumerate}
694:
695: \end{small}
696: \caption{The V-Winnow algorithm.}
697: \end{figure*}
698:
699:
700: \begin{theorem} \label{thm:vwbound}
701: Let ${\cal C}$ be a class of Boolean functions over $\{0,1\}^n$
702: with the property that for each $f \in {\cal C}$,
703:
704: \begin{itemize}
705:
706: \item $f$ depends on at most $k$ variables
707:
708: \item $f$ is computed by a polynomial threshold function of degree at most
709: $d$ where each coefficient is an integer weight of at most $w$.
710:
711: \end{itemize}
712: Then {\tt V-Winnow} is an online learning algorithm for ${\cal C}$ which
713: uses $n^d$ time steps per example and has mistake bound $(w \cdot
714: k^{d})^{2} \cdot d \cdot \log n.$ The output hypothesis will be a
715: polynomial threshold function equivalent to $f$.
716:
717: \end{theorem}
718:
719: \begin{proof}
720: Let $f$ be a function of $k$ variables computed by a polynomial threshold
721: function $p$ of degree $d$ where each coefficient is of weight at most
722: $w$. We will now apply the algorithm {\tt V-Winnow} outlined in Figure
723: \ref{fig:vw}. Fix a lexicographic ordering of all monomials of degree $d$
724: over $n$ variables and let $y_{i}$ be the $i$th monomial in this list.
725: Then $f$ can be written as a linear threshold function $h$ over the
726: variables $y_{i}$, i.e. $f = h = \sum_{i=1}^{m} a_{i}y_{i}$ for some
727: integer coefficients $a_{i} \leq w$. Since $f$ depends on only $k$
728: variables, at most $k^{d}$ of the variables in $h$ have nonzero
729: coefficients. Now run the standard Winnow algorithm to learn $h$ (for
730: every example $(a_{1},\ldots,a_{n}, f(a_{1},\ldots,a_{n}))$, set $y_{i}$
731: equal to the $i$th monomial on input $a_{1},\ldots,a_{n}$.) Applying
732: Theorem \ref{thm:winbound}, the standard Winnow algorithm (and hence
733: V-Winnow) will make at most $(w \cdot k^{d})^{2} \cdot d \cdot \log n$
734: mistakes and output a linear threshold function over the $y_{i}$'s
735: equivalent to $h$. Replacing each $y_{i}$ with the $i$th monomial over
736: $\{x_{1},\ldots,x_{n}\}$ we obtain a polynomial threshold function
737: equivalent to $f$. The time bound also follows directly from Theorem
738: \ref{thm:winbound}.
739: \end{proof}
740:
741: }
742:
743: \section{Constructing Polynomial Threshold Functions for Decision Lists}
744:
745: In previous constructions of polynomial threshold functions for
746: computational learning theory applications
747: \cite{KlivansServedio:01,KOS:02,OdonnellServedio:03a} the sole goal has
748: been to minimize the {degree} of the polynomials regardless of the size of
749: the coefficients. As an extreme example, the construction of
750: \cite{KlivansServedio:01} of $\tilde{O}(n^{1/3})$ degree polynomial
751: threshold functions for DNF formulae yields polynomials whose coefficients
752: can be {\em doubly exponential} in the degree. In contrast,
753: given Theorem \ref{thm:win} we must now
754: construct polynomial threshold functions that have low degree and low
755: weight.
756:
757: We give two constructions of polynomial threshold functions for decision lists, each of which
758: has relatively low degree \ignore{($k^{1/2}$)}
759: and relatively low weight.
760: \ignore{($2^{\tilde{O}(k^{1/2})}$).}
761: We then combine
762: these approaches to achieve an optimal construction with improved bounds on both
763: degree and weight.\ignore{with degree $k^{1/3}$
764: and weight $2^{\tilde{O}(k^{1/3})}.$}
765:
766: \subsection{Outer Construction} \label{subsec:outer}
767:
768: Let $L$ be a decision list of length $k$ over variables $x_1,\dots,x_k.$
769: We first give a simple construction of a degree $h$, weight ${\frac {2k}
770: h}2^{(k/h + h)}$ polynomial threshold function for $L$ which is based on
771: breaking the list $L$ into sublists. We call this construction the
772: ``outer construction" since we will ultimately combine this construction
773: with a different construction for the ``inner'' sublists.
774:
775: We begin by showing that $L$ can be expressed as a threshold of {\em
776: modified decision lists} which we now define. The set ${\cal B}_h$ of
777: modified decision lists is defined as follows:
778: each function in ${\cal B}_h$ is a decision list
779: $(\ell_1,b_1),(\ell_2,b_2),\dots, (\ell_h,b_h),0$ where each $\ell_i$ is
780: some literal over $x_1,\dots,x_n$ and each $b_i \in \{-1,1\}.$ Thus the
781: only difference between a modified decision list $f \in {\cal B}_h$ and a
782: normal decision list of length $h$ is that the final output value is
783: $0$ rather than $b_{h+1} \in \{-1,+1\}.$
784:
785: Without loss of generality we may suppose that the list $L$ is
786: $(x_1,b_1),\dots,(x_k,b_k),b_{k+1}.$ We break $L$ sequentially into $k/h$
787: blocks each of length $h$. Let $f_{i} \in {\cal B}_h$ be the modified
788: decision list which corresponds to the $i$-th block of $L,$ i.e. $f_i$ is
789: the list $(x_{(i-1) h + 1},b_{(i-1)h+1}),\ldots, (x_{(i+1)
790: h},b_{(i+1)h}),0$. Intuitively $f_{i}$ computes the $i$th block of $L$
791: and equals $0$ only if we ``fall of the edge" of the $i$th block. We then
792: have the following straightforward claim:
793:
794: \begin{claim} \label{cla:outer}
795: The decision list $L$ is eqivalent to
796: \begin{eqnarray}
797: \mbox{sign}\left(\sum_{i=1}^{k/h}
798: 2^{k/h - i + 1} f_{i}(x) \ + \ b_{k+1} \right). \label{eq:outer}
799: \end{eqnarray}
800: \end{claim}
801: \begin{proof}
802: Given an input $x \neq 0^k$ let $r=(i-1)h + c$ be the first index such that $x_r$ is satisfied.
803: It is easy to see that $f_j(x) = 0$ for $j<i$ and hence the value in
804: (\ref{eq:outer}) is $2^{k/h - i + 1}b_{r} + \sum_{j=i+1}^{k/h}
805: 2^{k/h - j + 1} f_{j}(x) \ + \ b_{k+1}$,
806: the sign of which is easily seen to be $b_r.$
807: Finally if $x=0^k$ then the argument to (\ref{eq:outer}) is $b_{k+1}$.
808: \end{proof}
809:
810: \medskip \noindent {\bf Note:} It is easily seen that we can replace
811: the $2$ in formula (\ref{eq:outer}) by a 3; this will prove
812: useful later.
813:
814: \medskip
815:
816: As an aside, note that Claim \ref{cla:outer} can already be used to obtain a tradeoff
817: between running time and sample complexity for learning decision lists.
818: The class ${\cal B}_h$ contains at most $(4n)^h$ functions.
819: Thus as in Section \ref{sec:winnow}
820: it is possible to run the Winnow algorithm using the functions in ${\cal B}_h$ as the base features
821: for Winnow. (So for each example $x$ which it receives, the algorithm would first compute
822: the value of $f(x)$ for each $f \in {\cal B}_h$, and would then use this vector of $(f(x))_{f \in {\cal B}_h}$
823: values as the example point for Winnow.) A direct analogue of Theorem
824: \ref{thm:win} now implies
825: that Expanded-Winnow (run over this expanded feature space of functions from
826: ${\cal B}_h$) can be used to learn
827: $L_k$ in time $n^{O(h)}2^{O(k/h)}$ with mistake bound $2^{O(k/h)} h \log n$.
828:
829: However, it will be more useful for us to obtain a polynomial threshold function for $L$. We
830: can do this from Claim \ref{cla:outer} as follows:
831:
832:
833: \begin{theorem} \label{thm:outer}
834: Let $L$ be a decision list of length $k$. Then for any $h < k$
835: we have that $L$ is computed by a
836: polynomial threshold function of degree $h$
837: and weight $4 \cdot 2^{k/h + h}$.
838: \end{theorem}
839:
840: \begin{proof}
841: Consider the first modified decision list $f_1 = (\ell_1,b_1),(\ell_2,b_2),\dots,(\ell_h,b_h),0$
842: in the expression (\ref{eq:outer}). For $\ell$ a literal let $\tilde{\ell}$ denote $x$
843: if $\ell$ is an unnegated variable $x$ and let $\tilde{\ell}$ denote $1-x$ if
844: if $\ell$ is a negated variable $\overline{x}.$
845: We have that for all $x \in \{0,1\}^h$, $f_1(x)$ is computed exactly by
846: the polynomial
847: $$
848: f_1(x) = \tilde{\ell}_1b_1 + (1-\tilde{\ell}_1)\tilde{\ell}_2 b_2 +
849: (1-\tilde{\ell}_1)(1-\tilde{\ell}_2)\tilde{\ell}_3 b_3 + \cdots +
850: (1-\tilde{\ell}_1)\cdots(1-\tilde{\ell}_{h-1})\tilde{\ell}_h b_h.
851: $$
852: This polynomial has degree $h$ and has weight at most $2^{h+1}.$
853: Summing these polynomial representations for $f_1,\dots,f_{k/h}$
854: as in (\ref{eq:outer}) we see
855: that the resulting polynomial threshold function given by (\ref{eq:outer})
856: has degree $h$ and weight at most $2^{k/h + 1} \cdot 2^{h+1} =
857: 4 \cdot 2^{k/h + h}.$
858: \end{proof}
859:
860: \medskip
861:
862: Specializing to the case $h=\sqrt{k}$ we obtain:
863:
864: \begin{corollary} \label{cor:outer}
865: Let $L$ be a decision list of length $k$.
866: Then $L$ is computed by a polynomial threshold function of
867: degree $k^{1/2}$ and weight $4 \cdot 2^{2k^{1/2}}.$
868: \end{corollary}
869:
870: We close this section by observing that an intermediate result
871: of \cite{KlivansServedio:01} can be used to give an alternate proof
872: of Corollary \ref{cor:outer} with slightly weaker parameters;
873: see Appendix \ref{ap:alt}.
874:
875: \subsection{Inner Approximator} \label{subsec:inner}
876:
877: In this section we construct low degree, low weight
878: polynomials which approximate (in the $L_\infty$ norm)
879: the modified decision lists from the previous subsection. Moreover,
880: the polynomials we construct
881: are exactly correct on inputs which ``fall off the end'':
882: \ignore{
883: We refer to these modified decision lists as the ``inner'' decision lists.
884: The construction is stronger than a polynomial threshold function;
885: the polynomial we give for an inner decision list is actually
886: a good approximator with respect to the
887: $L_{\infty}$ norm (and is exactly right on the input $0^h$):
888: }
889:
890: \begin{theorem} \label{thm:inner}
891: Let $f \in {\cal B}_h$ be a modified decision list of length $h$
892: (without loss of generality we may assume that $f$ is
893: $(x_1,b_1),\dots,(x_h,b_h),0$).
894: Then there is a degree $2\sqrt{h}\log{h}$
895: polynomial $p$ such that
896: \begin{itemize}
897: \item for every input $x \in \{0,1\}^h$ we have $|p(x) - f(x)| \leq 1/h$.
898: \item $p(0^h) = f(0^h) = 0$.
899: \end{itemize}
900: \end{theorem}
901: \begin{proof}
902: As in the proof of Theorem \ref{thm:outer} we have that
903: \[ f(x) = b_{1}x_{1} + b_{2}(1-x_{1})x_{2} + \cdots +
904: b_{h}(1-x_{1})\cdots(1-x_{h-1})x_{h}.
905: \]
906: We will construct a lower (roughly $\sqrt{h}$) degree polynomial which
907: closely approximates $f$. Let $T_{i}$ denote $(1-x_1)\dots(1-x_{i-1})x_i$,
908: so we can rewrite $f$ as
909: \[ f(x) = b_{1}T_{1} + b_{2}T_{2} + \cdots + b_{h}T_{h}. \]
910:
911: We approximate each $T_i$ separately as follows:
912: set $A_{i}(x) = h-i + x_{i} + \sum_{j=1}^{i-1} (1 - x_{j})$.
913: Note that for $x \in \{0,1\}^h,$ we have
914: $T_i(x) = 1$ iff $A_i(x) = h$ and $T_i(x) = 0$
915: iff $0 \leq A_i(x) \leq h-1.$
916: Now define the polynomial
917: $$
918: Q_{i}(x) = q \left(A_{i}(x)/h \right) \mbox{~~~~~where~~~~~}
919: q(y) = C_d\left(y \left(1 + 1/h \right) \right).
920: $$
921:
922: \noindent As in \cite{KlivansServedio:01},
923: here $C_{d}(x)$ is the $d$th Chebyshev polynomial of the
924: first kind (a univariate polynomial of degree $d$)
925: with $d$ set to $\lceil \sqrt{h} \rceil$.
926: We will need the following facts about Chebyshev polynomials
927: \cite{Cheney:66}:
928: \begin{itemize}
929: \item $|C_d(x)| \leq 1$ for $|x| \leq 1$ with $C_d(1) = 1;$
930: \item $C_d^\prime(x) \geq d^2$ for $x > 1$ with $C_d^\prime(1) = d^2.$
931: \item The coefficients of $C_{d}$ are integers each of whose
932: magnitude is at most $2^d$.
933: \end{itemize}
934: These first two facts imply that $q(1) \geq 2$ but $|q(y)| \leq 1$
935: for $y \in [0,1 - {\frac 1 h}].$ We
936: thus have that $Q_i(x) = q(1) \geq 2$ if $T_i(x) = 1$
937: and $|Q_i(x)| \leq 1$ if $T_i(x) = 0.$
938: Now define
939: $
940: P_i(x) = \left({\frac {Q_i(x)}{q(1)}}\right)^{2 \log h}.
941: $
942: This polynomial is easily seen to be a good approximator for $T_i$:
943: if $x \in \{0,1\}^h$ is such that $T_i(x) = 1$ then $P_i(x) = 1$,
944: and if $x \in \{0,1\}^h$ is such that $T_i(x) = 0$ then
945: $|P_i(x)| < \left({\frac 1 2}\right)^{2 \log h} < {\frac 1 {h^2}}.$
946:
947: Now define
948: $R(x) = \sum_{i=1}^{\ell} b_iP_{i}(x)$ and $p(x) = R(x) - R(0^h).$
949: \ignore{
950: We will see that $Q_{i}(x) > 2$ on assignments $x$ for which
951: $T_{i}(x)=0$, while $|Q_i(x)|\leq 1$ on assignments for which
952: $T_{i}(x)$ output $s_{i}$. To
953: strengthen this separation we define the following polynomial
954: $P_{i}(x) = (1/\ell^{2}) Q_{i}(x)^{2 \log \ell}$ and to approximate
955: all of $b$ we set $R(x) = \sum_{i=1}^{\ell} P_{i}(x)$.
956: }
957: It is clear that $p(0^h)=0.$
958: We will show that for every input $0^h \neq x \in \{0,1\}^h$ we have
959: $|p(x) - f(x)| \leq {1/h}$. Fix some such $x$; let $i$ be the first
960: index such that $x_i = 1.$ As shown above we have
961: $P_i(x) = 1.$ Moreover, by inspection of $T_j(x)$ we have that
962: $T_j(x) = 0$ for all $j \neq i,$
963: and hence $|P_j(x)| < {\frac 1 {h^2}}$. Consequently
964: the value of $R(x)$ must lie in $[b_i - {\frac {h-1}{h^2}},
965: b_i + {\frac {h-1}{h^2}}]$. Since $f(x) = b_i$ we have that
966: $p(x)$ is an $L_\infty$ approximator for $f(x)$ as desired.
967:
968: Finally, it is straightforward to verify that $p(x)$ has the claimed
969: bound on degree.
970: \end{proof}
971:
972: \ignore{
973: \noindent Now fix any nonzero assignment to the variables $x$ that
974: causes $b$ to output $1$. From the definition of $b$ there exists a
975: unique term $T_{i}$ that is not set to zero by $x$. Then for the
976: corresponding arithmetization $A_{i}$ we have $A_{i}/i= 1$, so $2 \leq
977: Q_{i}(x) \leq 2.01 $ and hence $1 \leq P_{i}(x) \leq 1.1$. Similarly
978: if $x$ causes $b$ to output $-1$ then $-1 \leq P_{i}(x) \leq -.9$. \\
979:
980: \noindent Let $T_{j}$ be any term that is set to zero by x, and so
981: $A_{j}(x) \leq 1 - 1/\ell$. Then $|Q_{i}(x)| \leq 1$ and thus
982: $|P_{i}(x)| \leq 1/\ell^{2}$. Hence for any nonzero assignment $x$,
983: $|R(x) - b(x)| \leq \mbox{{\bf $\eps$ from cheby approx +
984: $1/\ell$}}$. Notice also that $|R(\overline{0})| \leq 1/\ell.$ Thus
985: for any nonzero assignment $x$, $|H(x) - b(x)| \leq 2/\ell$ and
986: clearly $H(\overline{0}) = 0$.
987: }
988:
989: \medskip
990:
991: Strictly speaking we cannot discuss the weight of the polynomial
992: $p$ since its coefficients are rational numbers but not
993: integers. However, by multiplying $p$ by a suitable integer
994: (clearing denominators) we obtain an integer polynomial
995: with essentially the same properties.
996: Using the third fact about Chebyshev polynomials from our
997: proof above, we have that $q(1)$ is a rational number $N_1/N_2$ where
998: $N_1,N_2$ are each integers of magnitude $h^{O(\sqrt{h})}.$
999: Each $Q_i(x)$ for $i=1,\dots,h$ can be written as an integer
1000: polynomial (of weight $h^{O(\sqrt{h})}$) divided by $h^{\sqrt{h}}.$
1001: Thus each $P_i(x)$ can be written as
1002: $\tilde{P}_i(x)/(h^{\sqrt{h}}N_1)^{2 \log h}$ where $\tilde{P}_i(x)$
1003: is an integer polynomial of weight $h^{O(\sqrt{h} \log h)}$.
1004: It follows that $p(x)$ equals $\tilde{p}(x)/C,$ where $C$
1005: is an integer which is at most $2^{O(h^{1/2} \log^2 h)}$
1006: and $\tilde{p}$ is a polynomial with integer coefficients and weight
1007: $2^{O(h^{1/2} \log^2 h)}.$ We thus have
1008:
1009: \begin{corollary}
1010: \label{cor:inner}
1011: Let $f \in {\cal B}_h$ be a modified decision list of length $h$.
1012: Then there is an integer polynomial
1013: $p(x)$
1014: of degree $2\sqrt{h}\log{h}$
1015: and weight $2^{O(h^{1/2} \log^2{h})}$ and an integer $C =
1016: 2^{O(h^{1/2} \log^2 h)}$ such that
1017: \begin{itemize}
1018: \item for every input $x \in \{0,1\}^h$ we have $|p(x) - Cf(x)| \leq C/h$.
1019: \item $p(0^h) = f(0^h) = 0$.
1020: \end{itemize}
1021: \end{corollary}
1022:
1023: The fact that $p(0^h)$ is exactly 0
1024: will be important in the next subsection when we combine the
1025: inner approximator with the outer construction.
1026:
1027: \subsection{Composing the Constructions} \label{subsec:compose}
1028:
1029: In this section we combine the two constructions from the previous
1030: subsections to obtain our main polynomial threshold construction:
1031:
1032: \begin{theorem} \label{thm:mainptf}
1033: Let $L$ be a decision list of length $k$. Then for any $h < k$,
1034: $L$ is computed by a polynomial threshold function of degree
1035: $O(h^{1/2} \log h)$
1036: and weight $2^{O(k/h + h^{1/2}\log^2 h)}.$
1037: \end{theorem}
1038: \begin{proof}
1039: We suppose without loss of generality that $L$ is the decision list
1040: $(x_1,b_1),\dots,(x_k,b_k),b_{k+1}.$
1041: We begin with the outer construction: from the note following
1042: Claim \ref{cla:outer} we have that
1043: $$L(x) =
1044: \mbox{sign}\left(C\left[\sum_{i=1}^{k/h}
1045: 3^{k/h - i + 1} f_{i}(x) \ + \ b_{k+1} \right]\right)
1046: $$
1047: where $C$ is the value from Corollary \ref{cor:inner} and
1048: each $f_{i}$ is a modified decision list of length $h$
1049: computing the restriction of $L$ to its $i$th block as defined in
1050: Subsection \ref{subsec:outer}.
1051: Now we use the inner approximator to replace each $Cf_i$ above
1052: by $p_i$, the approximating polynomial from Corollary
1053: \ref{cor:inner}, i.e. consider sign$(H(x))$ where
1054: $$
1055: H(x) = \sum_{i=1}^{k/h}
1056: (3^{k/h - i + 1} p_{i}(x)) \ + \ Cb_{k+1}.
1057: $$
1058: We will show that sign$(H(x))$
1059: is a polynomial threshold function which computes $L$ correctly
1060: and has the desired degree and weight.
1061:
1062: Fix any $x \in \{0,1\}^k.$ If $x=0^k$ then by Corollary
1063: \ref{cor:inner} each $p_i(x)$ is $0$ so $H(x) = C b_{k+1}$ has
1064: the right sign.
1065: Now suppose that $r=(i-1)h+c$ is the first index such that
1066: $x_r = 1.$ By Corollary \ref{cor:inner}, we have that
1067: \begin{itemize}
1068: \item $3^{k/h - j + 1}p_j(x) = 0$ for $j < i$;
1069: \item $3^{k/h - i + 1}p_i(x)$ differs from $3^{k/h - i + 1}Cb_r$ by at most
1070: $C3^{k/h - i + 1}\cdot {\frac 1 h}$;
1071: \item The magnitude of each value $3^{k/h - j + 1}p_j(x)$ is at most
1072: $C3^{k/h - j + 1}(1 + {\frac 1 h})$ for $j > i.$
1073: \end{itemize}
1074: Combining these bounds,
1075: the value of $H(x)$ differs from $3^{k/h - i + 1}Cb_r$ by at most
1076: $$
1077: C\left(
1078: {\frac {3^{k/h - i + 1}}{h}} +
1079: \left(1 + {\frac 1 h}\right)
1080: \left[3^{k/h - i} + 3^{k/h - i - 1} + \cdots + 3\right] + 1
1081: \right)
1082: $$
1083: which is easily seen to be less than $C3^{k/h - i + 1}$ in magnitude.
1084: Thus the sign of $H(x)$ equals $b_r$, and consequently sign$(H(x))$ is a
1085: valid polynomial threshold representation for $L(x).$ Finally,
1086: our degree and weight bounds from Corollary \ref{cor:inner}
1087: imply that
1088: the degree of $H(x)$ is $O(h^{1/2} \log h)$ and the weight
1089: of $H(x)$ is $2^{O(k/h) + O(h^{1/2}\log^2 h)}$, and the theorem
1090: is proved.
1091: \end{proof}
1092:
1093: \medskip
1094:
1095: Taking $h = k^{2/3} / \log^{4/3}k$ in the above theorem we obtain our
1096: main result on representing decision lists as polynomial threshold
1097: functions:
1098:
1099: \medskip
1100:
1101: \noindent {\bf Theorem \ref{thm:ptf}}
1102: {\em Let $L$ be a decision list of length $k$. Then
1103: $L$ is computed by a polynomial threshold function
1104: of degree $k^{1/3} \log^{1/3} k$ and weight
1105: $2^{O(k^{1/3} \log^{4/3} k)}.$
1106: } \\
1107:
1108:
1109: Theorem \ref{thm:ptf} immediately implies that Expanded-Winnow can learn decision lists of length $k$ using $2^{\tilde{O}(k^{1/3})} \log n$ examples and time $n^{\tilde{O}(k^{1/3})}$.
1110:
1111: %\section{Discussion} \label{sec:discuss}
1112:
1113:
1114: \section{Application to Learning Decision Trees} \label{sec:decisiontree}
1115:
1116: In 1989 Ehrenfeucht and Haussler \cite{EhrenfeuchtHaussler:89} gave an
1117: a time $n^{O(\log s)}$ algorithm for learning decision trees of size
1118: $s$ over $n$ variables. Their algorithm uses $n^{O(\log s)}$ examples,
1119: and they asked if the sample complexity could be reduced to
1120: $\poly(n,s)$. We can apply our techniques here to give an algorithm
1121: using $2^{\tilde{O}(s^{1/3})} \log n$ examples, if we are willing to
1122: spend $n^{\tilde{O}(s^{1/3})}$ time.
1123:
1124: First we need to generalize Theorem \ref{thm:mainptf} for higher order
1125: decision lists. An $r$-decision list is like a standard decision list
1126: but each pair is now of the form $(C_i,b_i)$ where $C_i$ is a
1127: conjunction of at most $r$ literals and as before $b_i = \pm 1$. The
1128: output of such an $r$-decision list on input $x$ is $b_i$ where $i$ is
1129: the smallest index such that $C_i(x)=1.$
1130:
1131: We have the following:
1132:
1133: \begin{corollary} \label{cor:gdl}
1134: Let $L$ be an $r$-decision list of length $k$. Then for any
1135: $h < k$, $L$ is computed by a polynomial threshold function
1136: of degree $O(rh^{1/2} \log h)$ and weight
1137: $2^{r + O(k/h + h^{1/2} \log^2 h)}$.
1138: \end{corollary}
1139:
1140: \begin{proof}
1141: Let $L$ be the $r$-decision list $(C_1,b_1),\dots,(C_k,b_k),b_{k+1}.$
1142: By Theorem \ref{thm:mainptf} there is a polynomial threshold function
1143: of degree $O(h^{1/2} \log h)$ and weight
1144: $2^{O(k/h + h^{1/2} \log^2 h)}$ over the variables $C_1,\dots,C_k.$
1145: Now replace each variable $C_{i}$ by the interpolating polynomial
1146: which computes it exactly as a function from $\{0,1\}^n$ to $\{0,1\}.$
1147: Each such interpolating polynomial has degree $r$ and integer
1148: coefficients of total magnitude at most $2^r$, and the corollary follows.
1149: \end{proof}
1150:
1151: \begin{corollary} \label{cor:learngdl}
1152: There is an algorithm for learning
1153: $r$-decision lists over $\{0,1\}^n$ which, when learning an $r$-decision list
1154: of length $k$, has mistake bound
1155: $2^{\tilde{O}(r + k^{1/3})}\log n$ and runs in time
1156: $n^{\tilde{O}(rk^{1/3})}$.
1157: \end{corollary}
1158:
1159: Now we can apply Corollary \ref{cor:learngdl} to obtain a tradeoff
1160: between running time and sample complexity for learning decision
1161: trees:
1162:
1163: \begin{theorem}
1164: Let $D$ be a decision tree of size $s$ over $n$ variables. Then $D$ can be learned using $2^{\tilde{O}(s^{1/3})} \log n$ examples in time $n^{\tilde{O}(s^{1/3})}.$
1165: \end{theorem}
1166:
1167:
1168: \begin{proof}
1169: Blum \cite{Blum:92} has shown that any decision tree of size $s$ is
1170: computed by a $(\log s)$-decision list of length $s.$ Applying
1171: Corollary \ref{cor:learngdl} we thus see that Expanded-Winnow can be
1172: used to learn decision trees of size $s$ over $\{0,1\}^n$ with the
1173: claimed bounds on time and sample complexity.
1174: \end{proof}
1175:
1176:
1177:
1178:
1179: \section{Lower Bounds for Decision Lists} \label{sec:discuss}
1180:
1181: Here we observe that our construction from
1182: Theorem \ref{thm:mainptf} is essentially optimal in terms of the
1183: tradeoff it achieves between polynomial threshold function degree
1184: and weight.
1185:
1186: In \cite{Beigel:94}, Beigel constructs an oracle separating $\PP$ from
1187: $\PNP$. At the heart of his construction is a proof that any low
1188: degree polynomial threshold function for a particular
1189: decision list, called the the $\mathrm{ODDMAXBIT}_{n}$ function,
1190: must have large weights:
1191:
1192: \begin{definition}
1193: The $\mathrm{ODDMAXBIT}_{n}$ function on input $x=x_{1},\ldots,x_{n}
1194: \in \{0,1\}^{n}$ equals $(-1)^{i}$ where $i$ is the index of the
1195: first nonzero bit in $x.$
1196: \end{definition}
1197:
1198: It is clear that the $\mathrm{ODDMAXBIT}_{n}$ function is
1199: equivalent to a decision list of length $n$:
1200: $$
1201: (x_1,-1),(x_2,1),(x_3,-1),\dots,(x_n,(-1)^{n}),(-1)^{n+1}.
1202: $$
1203: The main technical theorem which Beigel proves in \cite{Beigel:94}
1204: states that any polynomial threshold function of degree $d$ computing
1205: $\mathrm{ODDMAXBIT}_{n}$ must have weight $2^{\Omega(n/d^{2})}$:
1206:
1207: \begin{theorem} \label{thm:beigel}
1208: Let $p$ be a degree $d$ polynomial threshold function with integer
1209: coefficients computing
1210: $\mathrm{ODDMAXBIT}_{n}$. Then
1211: $w = 2^{\Omega(n/d^{2})}$ where $w$ is the weight of $p.$\footnote{Beigel actually proves something stronger, namely that there must exists a coefficient whose absolute value is at least $2^{\Omega(n/d^{2})}$.}
1212: \end{theorem}
1213: (As stated in \cite{Beigel:94} the bound is actually $w \geq
1214: {\frac 1 s}2^{\Omega(n/d^2)}$ where $s$ is the number of nonzero
1215: coefficients in $p$. Since $s \leq w$ this implies the result
1216: as stated above.)
1217:
1218:
1219: A lower bound of $2^{\Omega(n)}$
1220: on the weight of any linear threshold function ($d=1$) for
1221: $\mathrm{ODDMAXBIT}_n$ has long been known \cite{MyhillKautz:61};
1222: Beigel's proof generalizes this
1223: lower bound to all $d = O(n^{1/2}).$ A matching upper bound
1224: of $2^{O(n)}$ on weight for $d=1$ has also long been known
1225: \cite{MyhillKautz:61}.
1226: Our Theorem \ref{thm:mainptf} gives an upper bound
1227: which matches Beigel's lower bound (up to
1228: logarithmic factors) for all $d = O(n^{1/3})$:
1229: \begin{observation}
1230: For any $d = O(n^{1/3})$ there is a polynomial threshold function of
1231: degree $d$ and weight $2^{\tilde{O}(n/d^{2})}$
1232: which computes $\mathrm{ODDMAXBIT}_{n}$.
1233: \end{observation}
1234: \begin{proof}
1235: Set $d = h^{1/2} \log h$ in Theorem~\ref{thm:mainptf}.
1236: The weight bound given by Theorem~\ref{thm:mainptf}
1237: is $2^{O({\frac {n \log^2 d}{d^2}} + d \log d)}$
1238: which is $\tilde{O}(n/d^2)$ for $d = O(n^{1/3}).$
1239: \end{proof}
1240:
1241: \medskip
1242:
1243: Note that since the
1244: $\mathrm{ODDMAXBIT}_{n}$ function has a polynomial size DNF
1245: (see Appendix \ref{ap:alt}), Beigel's lower bound gives a polynomial
1246: size DNF $f$ such that any degree $\tilde{O}(n^{1/3})$ polynomial
1247: threshold function for $f$ must have weight
1248: $2^{\tilde{\Omega}(n^{1/3})}$.
1249: This suggests that the Expanded-Winnow algorithm cannot learn polynomial size
1250: DNF in $2^{\tilde{O}(n^{1/3})}$ time from
1251: $2^{n^{1/3 - \eps}}$ examples for any
1252: $\eps > 0,$ and thus suggests that improving the sample complexity
1253: of the DNF learning algorithm from \cite{KlivansServedio:01} while
1254: maintaining its $2^{\tilde{O}(n^{1/3})}$ running time may be difficult.
1255:
1256: \section{Learning Parity Functions} \label{sec:parity}
1257:
1258: We first briefly review the standard
1259: algorithm for learning parity functions.
1260:
1261: The standard algorithm for learning parity functions works by viewing a
1262: set of $m$ labelled examples as a set of $m$ linear equations over GF(2).
1263: Each labelled example $(x,b)$ induces the equation
1264: $\sum_{i: x_i = 1} a_{i} = b \bmod 2.$
1265: Since the examples are labelled according to some parity function,
1266: this parity function will be a consistent solution to the
1267: system of equations.
1268: Using Gaussian elimination it is possible to efficiently find a
1269: solution to the linear system,
1270: which yields a parity function consistent with all $m$ examples.
1271: The following standard fact from learning theory
1272: (often referred to as ``Occam's Razor'') shows that finding
1273: a consistent hypothesis suffices to establish PAC learnability:
1274:
1275: \begin{fact} \label{fact:OC}
1276: Let $C$ be a concept class and $H$ a finite set of hypotheses. Set $m
1277: = 1/\epsilon(\log |H| + \log 1/\delta)$ where $\epsilon$ and $\delta$
1278: are the usual accuracy and confidence parameters for PAC learning.
1279: Suppose that there
1280: is an algorithm $A$ running in time $t$ which takes as input $m$
1281: examples which are labelled according to some element of $C$ and outputs a
1282: hypothesis $h \in H$ consistent with these examples.
1283: Then $A$ is a PAC learning algorithm for $C$ with running time $t$
1284: and sample complexity $m.$
1285: \end{fact}
1286: Consider using the above algorithm to learn an unknown
1287: parity of length at most $k.$
1288: Even though there is a solution of weight at most $k$,
1289: Gaussian elimination (applied to a system of $m$ equations in $n$
1290: variables over GF(2)) may yield a solution of weight
1291: as large as $\min(m,n).$
1292: Using Fact \ref{fact:OC} we thus obtain a sample complexity bound of
1293: $O(n)$ examples for learning a parity of length at most $k.$
1294:
1295: We now present
1296: a simple polynomial-time algorithm for learning an unknown parity
1297: function on $k$ variables using $O(n^{1-1/k})$ examples.
1298: To the best of our knowledge this is the first improvement on the
1299: standard algorithm and analysis given above.
1300:
1301: \begin{theorem} \label{thm:mainparity}
1302:
1303: The class of all parity functions on at most $k$ variables is
1304: learnable in polynomial time using $O(n^{1-1/k} \log n)$
1305: examples. The hypothesis output by the learning algorithm
1306: is a parity function on $O(n^{1-1/k}\log n)$ variables.
1307:
1308: \end{theorem}
1309:
1310: \begin{proof}
1311: If $k = \Omega(\log n)$ then the standard algorithm suffices to
1312: prove the claimed bound. We thus assume that $k = o(\log n)$.
1313:
1314: Let $H$ be the set of all parity functions of size at most $n^{1 - 1/k}$.
1315: Note that $|H| \leq n^{n^{1 - 1/k}}$ so
1316: $\log|H| \leq n^{1 - 1/k} \log n.$
1317: Consider the following
1318: algorithm:
1319:
1320: \begin{enumerate}
1321:
1322: \item Choose $m = 1/\epsilon (\log |H| + \log (1/\delta))$
1323: examples. Express each example as a linear equation over $n$ variables
1324: mod $2$ as described above.
1325:
1326: \item Randomly choose a set of $n - n^{1-1/k}$ variables and assign
1327: them the value $0$.
1328:
1329: \item Use Gaussian elimination to attempt to solve the resulting system
1330: of equations on the remaining $n^{1 - 1/k}$ variables.
1331: If the system has a solution, output the corresponding parity
1332: (of size at most $n^{1 - 1/k}$) as the hypothesis.
1333: If the system has no solution, output ``FAIL.''
1334:
1335: \end{enumerate}
1336:
1337: If the simplified system of equations has a solution,
1338: then by Fact \ref{fact:OC} this solution is a good hypothesis.
1339: We will show that the simplified system has a solution with probability
1340: $\Omega(1/n)$. The theorem
1341: follows by repeating steps 2 and 3 of the above algorithm until
1342: a solution is found (an expected $O(n)$ repetitions will suffice).
1343:
1344: Let $V$ be the set of $k$ relevant variables on which the unknown
1345: parity function depends. It is easy to see that as long as
1346: no variable in $V$ is assigned a 0,
1347: the resulting simplified system of equations will have a
1348: solution.
1349: Let $\ell = n^{1 - 1/k}.$
1350: The probability that in Step 2 the $n - \ell$ variables chosen
1351: do not include any variables in $V$ is exactly
1352: ${n - k \choose n - \ell} / {n \choose \ell}$
1353: which equals
1354: ${n - k \choose \ell - k} / {n \choose \ell}.$ Expanding
1355: binomial coefficients we have
1356: \begin{equation} \label{eq:a}
1357: {\frac {{n - k \choose \ell - k}}{{n \choose \ell}}} =
1358: \prod_{i=1}^{k} {\frac {\ell - k + i}{n -k + i}}
1359: > \left({\frac {\ell - k}{n - k}}\right)^k
1360: =
1361: \left({\frac \ell n}\right)^k
1362: \left({\frac {1 - {\frac k \ell}}{1 - {\frac k n}}}\right)^k
1363: =
1364: {\frac 1 n} \cdot
1365: \left[\left(1 - {\frac k \ell}\right)\left(1 + {\frac {2k} n}\right)\right]^k.
1366: \end{equation}
1367: The bound $k = o(\log n)$ implies that
1368: $\left(1 - {\frac k \ell}\right)\left(1 + {\frac {2k} n}\right) >
1369: (1 - {\frac {3k} \ell}).$ Consequently
1370: (\ref{eq:a}) is at least
1371: ${\frac 1 n} \cdot \left(1 - {\frac {3k^2} {\ell}}\right) >
1372: {\frac 1 {2n}}$ and the theorem is proved.
1373: \end{proof}
1374:
1375:
1376:
1377:
1378:
1379:
1380:
1381: \section{Future Work} \label{sec:future}
1382:
1383: An obvious goal for future work is to improve our algorithmic results
1384: for learning decision lists. The question still remains: can
1385: decision lists of length $k$ be learned in poly$(n)$ time from
1386: poly$(k,\log n)$ examples? As a first step, one might attempt to
1387: extend the tradeoffs we achieve: is it possible to learn
1388: decision lists of length $k$ in $n^{k^{1/2}}$ time from
1389: poly$(k,\log n)$ examples?
1390:
1391: Another goal is to extend our results for decision lists to broader
1392: concept classes. In particular, since decision lists are a special
1393: case of linear threshold functions, it would be interesting to obtain analogues
1394: of our algorithmic
1395: results for learning general linear threshold functions (independent of
1396: their weight). We note here that
1397: Goldmann {\em et al.} \cite{GHR:92} have given
1398: a linear threshold function over $\{-1,1\}^n$ for
1399: which any polynomial threshold function must have weight
1400: $2^{\Omega(n^{1/2})}$ regardless of its degree. Moreover
1401: Krause and Pudlak \cite{KrausePudlak:98} have shown that any Boolean
1402: function which has a polynomial threshold function over $\{0,1\}^n$ of weight
1403: $w$ has a polynomial threshold function over $\{-1,1\}^n$ of weight
1404: $n^2w^4.$ These results imply that {\em representational} results akin
1405: to Theorem \ref{thm:ptf} for general linear threshold functions
1406: must be quantitatively weaker than Theorem \ref{thm:ptf};
1407: in particular, there is a linear threshold function over
1408: $\{0,1\}^n$ with $k$ nonzero coefficients for which
1409: {any} polynomial threshold function, regardless of degree, must have
1410: weight $2^{\Omega(k^{1/2})}.$
1411:
1412: For parity functions, one challenge is to
1413: learn parity functions on $k = \Theta(\log n)$ variables in polynomial time
1414: using a sublinear number of examples. Another challenge is to improve
1415: the sample complexity of learning size $k$ parities from our
1416: current bound of $O(n^{1 - 1/k}).$
1417:
1418: \ignore{
1419:
1420: Decision lists can be viewed as a special case of linear threshold
1421: functions. For example, the alternating decision list (or
1422: $\mathrm{ODDMAXBIT}_{n}$ function) is equal to the sign of $h =
1423: \sum_{i=1}^{n} (-1)^{i} 2^{i}x_{i}$. The lower bound on the
1424: $\mathrm{ODDMAXBIT}_{n}$ function due to Beigel shows that for an
1425: arbitrary linear threshold function, we cannot construct polynomial
1426: threshold functions of degree $d$ and weight $2^{o(n/d^{2})}.$
1427:
1428: Here we observe that this lower bound on the weight and degree of
1429: polynomial threshold functions computing general linear threshold
1430: functions can be strengthened due to a result by Goldmann, Hastad, and
1431: Razborov:
1432:
1433: \begin{theorem} \cite{GHR:92}
1434: There exists a linear threshold function $U$ defined on $4n^{2}$
1435: variables such that if $U$ is written as a threshold of monomials then
1436: the total weight of the threshold is $\Omega(2^{(n/2)} / \sqrt{n})$.
1437: \end{theorem}
1438:
1439: \noindent The linear threshold function $U$ is the so-called Universal
1440: Halfspace defined as follows:
1441:
1442: \[ U_{n,m} = \sum_{i=1}^{n} \sum_{j=1}^{m} 2^{i}x_{ij}. \]
1443:
1444: From this we conclude that to learn an arbitrary linear threshold
1445: function on $n$ variables, V-Winnow will require
1446: $\Omega(2^{\sqrt{n}})$ samples and time $\Omega(n^{\sqrt{n}})$. This
1447: stands in contrast to the sample complexity and time complexity bounds
1448: for learning decision lists.
1449: }
1450:
1451: \section{Acknowledgements} We thank Les Valiant for his observation
1452: that Claim \ref{cla:outer} can be reinterpreted in terms of polynomial
1453: threshold functions.
1454: We thank Jean Kwon for suggesting the Chebychev polynomial.
1455:
1456: \bibliographystyle{plain}
1457: \bibliography{allrefs}
1458:
1459: \appendix
1460:
1461: \section{Alternate Proof of Corollary \ref{cor:outer}} \label{ap:alt}
1462: The alternate proof of Corollary \ref{cor:outer} is based on the
1463: observation that any decision list $L =
1464: (\ell_1,b_1),\dots,$ $(\ell_k,b_k),b_{k+1}$ of length $k$ has a
1465: $k$-term DNF in which each term is a conjunction of at most
1466: $k$ literals. To see this, note that we obtain a DNF
1467: for $L$ simply by taking the OR of all terms
1468: $\overline{\ell}_1\overline{\ell}_2 \dots \overline{\ell}_{i-1}\ell_i$
1469: for each $i$ such that $b_i = 1.$ Now we use the following result
1470: from \cite{KlivansServedio:01}:
1471: \begin{theorem} [Corollary 12 of \cite{KlivansServedio:01}]
1472: Let $f$ be a DNF formula of $s$ terms, each of length at most $t.$
1473: Then there is a polynomial threshold function for $f$ of degree
1474: $O(\sqrt{t}\log s)$ and weight $t^{O(\sqrt{t}\log s)}.$
1475: \end{theorem}
1476: Applying this result to the DNF representation for $L,$ we immediately
1477: obtain that there is a polynomial threshold function for $L$
1478: which has degree $O(k^{1/2} \log k)$ and weight
1479: $2^{O(k^{1/2} \log^2 k)}.$ (In Section \ref{subsec:inner}, though,
1480: we need the construction given in our original proof of
1481: Corollary \ref{cor:outer}.)
1482:
1483: \end{document}
1484:
1485:
1486:
1487:
1488:
1489:
1490: