cs0311042/d12.tex
1: 
2: % d5.tex 3-27-03
3: 
4: 
5: \documentclass[11pt]{article}
6: \usepackage{amssymb}
7: \usepackage{amsfonts}
8: \usepackage{amsmath}
9: \usepackage{latexsym}
10: \usepackage{epsfig}
11: 
12: \parindent=18pt
13: \oddsidemargin=0.15in
14: \evensidemargin=0.15in
15: \topmargin=-.5in
16: \textheight=9in
17: \textwidth=6.5in
18: 
19: \newcommand{\la}{\langle}
20: \newcommand{\ra}{\rangle}
21: \newcommand{\poly}{\mathrm{poly}}
22: \newcommand{\size}{\mathrm{size}}
23: \newcommand{\fix}{\mathrm{fix}}
24: \newcommand{\bias}{\mathrm{bias}}
25: \newcommand{\R}{{\bf R}}
26: \newcommand{\E}{{\mathrm E}}
27: \newcommand{\F}{{{\bf F}_2}}
28: \newcommand{\s}{{\mathcal S}}
29: \newcommand{\K}{{\mathcal K}}
30: \newcommand{\A}{{\mathcal A}}
31: \newcommand{\B}{{\mathcal B}}
32: \newcommand{\true}{\textsc{T}}
33: \newcommand{\false}{\textsc{F}}
34: \newcommand{\bitsl}{\{\false, \true\}}
35: \newcommand{\bitsf}{\{0, 1\}}
36: \newcommand{\bitsr}{\{+1,-1\}}
37: \newcommand{\degr}{\deg_\R}
38: \newcommand{\degf}{\deg_\F}
39: \newcommand{\parity}{\mathsf{PARITY}}
40: \newcommand{\cz}{c_\emptyset}
41: \newcommand{\fin}{f_{\mathrm{in}}}
42: \newcommand{\fout}{f_{\mathrm{out}}}
43: \newcommand{\tin}{t_{\mathrm{in}}}
44: \newcommand{\tout}{t_{\mathrm{out}}}
45: \newcommand{\eps}{{\epsilon}}
46: \newcommand{\theconst}{\frac{\omega}{\omega+1}}
47: \newcommand{\ignore}[1]{}
48: \newcommand{\qed}{\hfill\rule{7pt}{7pt}}
49: \newcommand{\strutje}{\rule[-.25cm]{0cm}{.7cm}}
50: \newcommand{\omb}{ODDMAXBIT}
51: \newcommand{\PP}{\mathsf{PP}}
52: \newcommand{\PNP}{\mathsf{P^{NP}}}
53: 
54: 
55: \newtheorem{theorem}{Theorem} 
56: \newtheorem{fact}[theorem]{Fact}
57: \newtheorem{observation}[theorem]{Observation}
58: \newtheorem{proposition}[theorem]{Proposition}
59: \newtheorem{claim}[theorem]{Claim}
60: \newtheorem{definition}[theorem]{Definition}
61: \newtheorem{corollary}[theorem]{Corollary}
62: 
63: \newenvironment{proof}{\noindent \textbf{Proof:}}{\hfill{$\Box$}}
64: 
65: \title{Toward Attribute Efficient Learning Algorithms}
66: \ignore{
67: OR \\
68:        Learning Decision Lists of Length $k$ using
69:        $2^{\tilde{O}(k^{1/3})}$ Examples OR \\ 
70:        On Learning Decision Lists Attribute Efficiently OR \\ 
71:        Learning Decision Lists Attribute Efficiently via Polynomial Threshold Functions OR \\
72:        Learning Decision Lists using $2^{\tilde{O}(k^{1/3})}$ Samples OR \\
73:        Learning Decision Lists via Polynomial Threshold Functions OR \\
74:        A Subexponential Algorithm for Learning Decision Lists Attribute Efficiently OR \\
75:        some other lame title
76: }
77: 
78: 
79: 
80: \author{Adam R. Klivans\thanks{Supported by an NSF Mathematical
81: Sciences Postdoctoral Research Fellowship.}\\
82: Divsion of Engineering and Applied Sciences\\ 
83: Harvard University\\ Cambridge, MA 02138 \\{\tt klivans@eecs.harvard.edu}
84: \and Rocco A.\ Servedio\\ 
85: Department of Computer Science\\
86: Columbia University\\ 
87: New York, NY 10027\\ {\tt rocco@cs.columbia.edu} }
88: 
89: \date{}
90: 
91: \begin{document}
92: 
93: \setcounter{page}{0}
94: 
95: \maketitle
96: 
97: \begin{abstract}
98: 
99: We make progress on two important problems regarding attribute
100: efficient learnability.  
101: 
102: First, we give an algorithm for learning decision
103: lists of length $k$ over $n$ variables using $2^{\tilde{O}(k^{1/3})}
104: \log n$ examples and time $n^{\tilde{O}(k^{1/3})}$. This is the first
105: algorithm for learning decision lists that has both subexponential
106: sample complexity and subexponential running time in the relevant
107: parameters.  Our approach establishes a relationship between attribute
108: efficient learning and polynomial threshold functions and is based on
109: a new construction of low degree, low weight polynomial threshold
110: functions for decision lists. For a wide range of parameters our
111: construction matches a 1994 lower bound due to Beigel for the
112: ODDMAXBIT predicate and gives an essentially optimal tradeoff between
113: polynomial threshold function degree and weight.  
114: 
115: Second, we give an
116: algorithm for learning an unknown parity function on $k$ out of $n$
117: variables using $O(n^{1-1/k})$ examples in time polynomial in $n$. For
118: $k=o(\log n)$ this yields a polynomial time algorithm with
119: sample complexity $o(n)$.  This is the first polynomial time algorithm
120: for learning parity on a superconstant number of variables with
121: sublinear sample complexity.
122: 
123: \end{abstract}
124: 
125: 
126: %%%%%%%% SECOND ABS
127: 
128: \ignore{
129: \begin{abstract}
130: 
131: We give an algorithm for learning decision lists of length $k$ over $n$
132: variables using $2^{\tilde{O}(k^{1/3})} \log n$ examples and time
133: $n^{\tilde{O}(k^{1/3})}$. This is the first algorithm for learning
134: decision lists that has both subexponential sample complexity (in the
135: relevant parameters $k$ and $\log n$)  and subexponential running time (in
136: the relevant parameter $k$;  any algorithm must take time $\Omega(n)$).
137: Our approach establishes a relationship between attribute efficient
138: learning and polynomial threshold functions, and is based on a new
139: construction of low degree, low weight polynomial threshold functions for
140: decision lists.  As a consequence of our construction we show that
141: Beigel's 1994 complexity theoretic lower bound for the ODDMAXBIT function
142: is aymptotically optimal. {\bf [[Another option for the last sentence:]]}
143: For a wide range of parameters our construction matches a 1994 lower bound due to
144: Beigel for the ODDMAXBIT predicate, and thus our construction
145: gives an optimal tradeoff between polynomial threshold function 
146: degree and weight.  {\bf [[basically, do we want to say that his
147: stuff shows our stuff is optimal, or our stuff shows his stuff is 
148: optimal?]]}
149: 
150: 
151: \end{abstract}
152: }
153: 
154: %%%%%%%%%%%% END SECOND ABS
155:  
156: %%%%%%%%%%% first abs:
157: \ignore{
158: \begin{abstract} 
159: We give an online algorithm for learning decision lists.
160: The mistake bound of the algorithm, for learning a decision list of
161: length $k$ over $n$ Boolean variables, is
162: $2^{O(k^{1/3})}\log n$ and the running time of the algorithm is
163: $n^{O(k^{1/3})}.$  We thus achieve a tradeoff between
164: running time and sample complexity for learning decision lists.
165: Our approach combines known algorithms for attribute efficient
166: learning of linear threshold functions 
167: with a new construction of polynomial threshold functions 
168: which compute decision lists.  As a consequence of our 
169: construction, we 
170: show that Beigel's 1994 complexity theoretic 
171: lower bound on the weight of any low-degree polynomial
172: threshold function for the ODDMAXBIT$_n$ predicate is asymptotically optimal.
173: \end{abstract}
174: }
175: %%%%%%%%%%% end first abs:
176: 
177: 
178: \thispagestyle{empty}
179: 
180: \newpage
181: 
182: \section{Introduction}
183: 
184: \subsection{Attribute Efficient Learning}
185: 
186: A central goal in machine learning is to design efficient, effective
187: algorithms for learning from small amounts of data.  An obstacle to
188: achieving this goal is that learning problems are often characterized by
189: an abundance of {\em irrelevant information}.  In many learning problems
190: each data point is naturally viewed as a high dimensional vector of
191: attribute values;  as a motivating example, in a natural language domain a
192: data point representing a text document may be a vector of word
193: frequencies over a lexicon of 100,000 words (attributes).  A newly
194: encountered word in a corpus may typically have a simple definition which
195: uses only a dozen or so words from the entire lexicon.  One would like to
196: be able to learn the meaning of such a word using a number of examples
197: which is closer to a dozen (the actual number of relevant attributes) than
198: to 100,000 (the total number of attributes).
199: 
200: Towards this end, an important goal in machine learning theory is to
201: design {\em attribute efficient} algorithms for learning various classes
202: of Boolean functions.  A class ${\cal C}$ of Boolean functions over $n$
203: variables $x_1,\dots,x_n$ is said to be {\em attribute-efficiently
204: learnable} if there is a poly$(n)$ time algorithm which can learn any
205: function $f \in C$ using a number of examples which is polynomial in the
206: ``size'' (description length) of the function $f$ to be learned, rather
207: than in $n$ (the number of features in the domain over which learning
208: takes place).  (Note that the running time of the learning algorithm must
209: in general be at least $n$ since each example is an $n$-bit vector.)  
210: Thus an attribute efficient learning algorithm for, say, the class of
211: Boolean conjunctions must be able to learn any Boolean conjunction of $k$
212: literals over $x_1,\dots,x_n$ using poly$(k,\log n)$ examples, since $k
213: \log n$ bits are required to specify such a conjunction.
214: 
215: 
216: \subsection{Decision Lists}
217: 
218: A longstanding open problem in machine learning, posed first by Blum in
219: 1990 \cite{Blum:90,Blum:96,BHL:95,BlumLangley:97} and again by 
220: Valiant in 1998
221: \cite{Valiant:99}, is to determine whether or not there exist attribute
222: efficient algorithms for learning {\em decision lists}.  A decision list
223: is essentially a nested ``if-then-else'' statement (we give a precise
224: definition in Section \ref{sec:prelims}).
225: 
226: Attribute efficient learning of decision lists is of both theoretical and
227: practical interest. Blum's motivation for considering the problem came
228: from the {\em infinite attribute model} \cite{Blum:90}; in this model
229: there are infinitely many attributes but the concept to be learned depends
230: on only a small number of them, and each example consists of a finite list
231: of active attributes.  Blum {\em et al}. \cite{BHL:95} showed that for a
232: wide range of concept classes (including decision lists)  attribute
233: efficient learnability in the standard $n$-attribute model is equivalent
234: to learnability in the infinite attribute model.  Since simple classes
235: such as disjunctions and conjunctions are attribute efficiently learnable
236: (and hence learnable in the infinite attribute model), this motivated Blum
237: \cite{Blum:90} to ask whether the richer class of decision lists is thus
238: learnable as well.\footnote{ Additional motivation comes from the fact
239: that decision lists have such a simple algorithm in the PAC model.}
240: Several researchers have subsequently considered this problem, see e.g.
241: \cite{Blum:96,BlumLangley:97,DhagatHellerstein:94, NevoElYaniv:02,
242: Servedio:99stoc}; we summarize some of this previous work in Section
243: \ref{sec:prevdl}.
244:     
245: From an applied perspective, Valiant \cite{Valiant:99} relates the
246: problem of learning decision lists attribute efficiently to the question
247: ``how can human beings learn from small amounts of data in the presence of
248: irrelevant information?'' He points out that since decision lists play an
249: important role in various models of cognition, a first step in
250: understanding this phenomenon would be to identify efficient algorithms
251: which learn decision lists from few examples. Due to the lack of progress
252: in developing such algorithms for decision lists, Valiant suggests that
253: models of cognition should perhaps focus on ``flatter" classes of
254: functions such as projective DNF \cite{Valiant:99}.
255: 
256: \subsection{Parity Functions}
257: 
258: Another outstanding challenge in machine learning is to determine whether 
259: there exist attribute efficient algorithms for learning {\em parity
260: functions}.  The parity function
261: on a set of 0/1-valued variables $x_{i_1},\ldots,x_{i_k}$ is equal to $x_{i_1} + \cdots
262: + x_{i_k}$ modulo 2.  As with the class of decision lists, a simple PAC learning
263: algorithm is known for the class of parity functions but no attribute efficient 
264: PAC learning algorithm is known.
265: Learning parity
266: functions plays an important rule in Fourier learning methods
267: \cite{MOS:03} and is closely related to  decoding random linear codes \cite{BKW:00}.
268: Both A. Blum \cite{Blum:96} and Y. Mansour \cite{Man:02} cite
269: attribute efficient learning of parity functions as an important open
270: problem.
271: 
272: \ignore{
273: Given a set of examples labelled according to an unknown parity
274: function on $k$ out of $n$ variables, we wish to find an approximation
275: to the unknown parity in polynomial time using as few examples as
276: possible.  The well known solution to this problem views these
277: examples as a set of linear equations mod $2$ in $n$ variables and
278: solves the set of equations to come up with a consistent
279: hypothesis. Note, however, that we must take $\Omega(n)$ examples to
280: achieve a solution which has good generalization error, as a solution
281: to a system of $m$ equations over $n$ variables may contain
282: $\min(m,n)$ non-zero entries.  An attribute efficient algorithm for
283: learning parity should require a number of examples polynomially
284: related to $k$ and $\log n$ (information theoretically we should only
285: need $O(k \log n)$ examples).
286: }
287: 
288: \subsection{Our Results: Decision Lists}
289: 
290: We give the first learning algorithm for decision lists that is
291: subexponential in both sample complexity (in the relevant parameters $k$
292: and $\log n$) and running time (in the relevant parameter $k$).  Our
293: results demonstrate for the first time that it is possible to
294: simultaneously avoid the ``worst case'' in both sample complexity and
295: running time, and thus suggest that it may indeed be possible to learn
296: decision lists attribute efficiently. \ignore{We consider this to be the
297: first evidence that decision lists can be learned attribute efficiently.
298: \\}
299: 
300: Our main learning result for decision lists is:  
301: 
302: \begin{theorem} \label{thm:main} There is an algorithm for learning
303: decision lists over $\{0,1\}^n$ which, when learning a decision list
304: of length $k$, has mistake bound\footnote{Throughout this
305: section we use ``sample complexity'' and ``mistake bound''
306: interchangeably; as described in Section \ref{sec:prelims}
307: these notions are essentially identical.}
308: $2^{\tilde{O}(k^{1/3})}\log n$ and runs  in time
309: $n^{\tilde{O}(k^{1/3})}$.
310: \end{theorem}
311: 
312: 
313: We prove Theorem \ref{thm:main} in two parts; first we generalize
314: Littlestone's well known Winnow algorithm \cite{Littlestone:88}
315: for learning 
316: linear threshold functions to learn {\em polynomial
317: threshold functions.} In previous learning results, polynomial threshold
318: functions are learned by applying techniques from linear programming: a
319: Boolean function computed by a polynomial threshold function of degree $d$ can
320: be learned in time $n^{O(d)}$ by using polynomial time linear programming
321: algorithms such as the Ellipsoid algorithm 
322: (see e.g. \cite{KlivansServedio:01}).
323: \ignore{via a linear programming solver, such as the
324: Ellipsoid algorithm.}
325: In contrast, we use the Winnow algorithm to learn polynomial threshold functions.
326: Winnow learns using few examples in a small amount of time
327: provided that the degree of the polynomial
328: is low and the integer coefficients of the polynomial are not too large:
329: \ignore{As opposed to general
330: linear programming solvers, Winnow can learn in an attribute efficient
331: manner:}
332: 
333: 
334: \begin{theorem} \label{thm:win}
335: Let ${\cal C}$ be a class of Boolean functions over
336: $\{0,1\}^n$ with the property that each $f \in {\cal C}$ has a polynomial
337: threshold function of degree at most $d$ and weight at most $W.$ Then
338: there is an online learning algorithm for ${\cal C}$ which runs in $n^d$
339: time per example and has mistake bound $O(W^{2} \cdot d \cdot \log n).$
340: \end{theorem}
341: 
342: At this point we have reduced the problem of learning decision lists
343: attribute efficiently to the problem of representing decision lists with
344: polynomial threshold functions of low weight and low degree. To this end
345: we prove
346: 
347: \begin{theorem} \label{thm:ptf} Let $L$ be a decision list of length $k$.
348: Then $L$ is computed by a polynomial threshold function of degree
349: $\tilde{O}(k^{1/3})$ and weight $2^{\tilde{O}(k^{1/3})}$.  \end{theorem}
350: Theorem \ref{thm:main} follows directly from Theorems \ref{thm:win}
351: and \ref{thm:ptf}.
352: 
353: Polynomial threshold function constructions have recently been used
354: to obtain the fastest known algorithms for a range
355: of important learning problems such as learning DNF formulas
356: \cite{KlivansServedio:01}, intersections of halfspaces \cite{KOS:02}, 
357: and Boolean formulas of superconstant depth \cite{OdonnellServedio:03a}.  
358: For each of these learning problems the sole goal was to obtain
359: fast learning algorithms, and hence the only parameter of interest in 
360: these polynomial threshold function constructions is their degree, 
361: since degree bounds translate directly into running time bounds for
362: learning algorithms (see e.g. \cite{KlivansServedio:01}).
363: In contrast, for the decision list problem we are interested in 
364: both the running time and the number of examples required for learning.
365: Thus we must bound both the degree and the {\em weight} 
366: (magnitude of integer coefficients) of the polynomial threshold 
367: functions which we use.
368: 
369: Our polynomial threshold function construction is essentially optimal in
370: the tradeoff between degree and weight which it achieves.  In 1994 Beigel
371: gave a lower bound showing that any degree $d$ polynomial threshold
372: function for a particular decision list must have weight
373: $2^{\Omega(n/d^{2})}$. For $d = n^{1/3}$, Beigel's lower bound implies
374: that the construction stated in Theorem \ref{thm:ptf} is essentially
375: optimal.  Furthermore, for any decision list $L$ of length $n$ and any
376: $d \leq n^{1/3}$, we will in fact construct polynomial threshold functions 
377: of degree $d$ and weight $2^{\tilde{O}(n/d^{2})}$ computing $L$. 
378: Beigel's lower bound thus implies that our degree $d$ polynomial threshold 
379: functions are of roughly optimal weight
380: for all $d \leq n^{1/3},$ and hence strongly suggests that our 
381: analysis is the best possible for the algorithm we use.
382: 
383: \subsection{Our Results: Parity Functions}
384: 
385: For parity functions, we give an $O(n^3)$ time algorithm which can 
386: learn an unknown parity on $k$ variables out of $n$ using $O(n^{1-1/k})$ examples.
387: For values of $k = o(\log n)$ the sample complexity of
388: this algorithm is $o(n)$. This is the first algorithm for learning
389: parity on a superconstant number of variables with sublinear sample
390: complexity.
391: 
392: The standard PAC learning algorithm for learning an unknown parity function
393: is based on viewing a set of $m$ labelled examples as a system of $m$ linear equations modulo 2.
394: Using Gaussian elimination it is possible to solve the system and find 
395: a consistent parity function.  It can be shown that the solution thus
396: obtained is a ``good'' hypothesis if its weight (number of nonzero entries)
397: is small relative to $m$, the number of examples.  However, using Gaussian elimination
398: can result in a solution of weight as large as  
399: $\min(m,n)$ even if $k$ (the number of variables in the target parity) is very small.
400: Thus in order for this approach to give a successful learning algorithm, it is necessary to 
401: use $m = \Omega(n)$ examples regardless of the value of $k$. 
402: In contrast, observe that an attribute efficient algorithm for
403: learning a parity of length $k$ should use only poly$(k,\log n)$ examples.
404: 
405: Our algorithm works by finding a ``low weight'' solution to a system of
406: $m$ linear equations.  We prove that with high probability we can find a solution of weight
407: $O(n^{1-1/k})$ irrespective of $m$.  Thus by taking $m$ to be only slightly larger
408: than $n^{1 - 1/k}$ we have that our solution is a ``good'' hypothesis.
409: 
410: 
411: \subsection{Previous Results: Decision Lists} \label{sec:prevdl}
412: 
413: 
414: In previous work several algorithms with different performance bounds (in
415: terms of running time and number of examples used) have been given for
416: learning decision lists.
417: 
418: \begin{itemize}
419: 
420: \item Rivest \cite{Rivest:87} gave the first algorithm for learning
421: decision lists in Valiant's PAC model of learning from random examples.  
422: Littlestone \cite{Blum:96} subsequently gave an analogue of Rivest's
423: algorithm in the online learning model. The algorithm can learn any
424: decision list of length $k$ in $O(kn^2)$ time using $O(kn)$ examples.
425: 
426: \item A brute-force approach to learning decision lists of length $k$ is
427: to maintain a collection of all such lists which are consistent with the
428: examples seen so far, and to predict at each stage using majority vote
429: over the surviving hypotheses. This ``halving algorithm'' (proposed in
430: various forms by Barzdin and Freivald \cite{BarzdinFreivald:72}, Mitchell
431: \cite{Mitchell:82}, and Angluin \cite{Angluin:88}) can learn decision
432: lists of length $k$ using only $O(k \log n)$ examples, but the running
433: time is $n^{O(k)}.$
434: 
435: \item Several researchers \cite{Blum:96,Valiant:99} have observed that
436: Littlestone's well-known Winnow algorithm \cite{Littlestone:88} can learn
437: decision lists of length $k$ from $2^{O(k)} \log n$ examples in time
438: $2^{O(k)} n \log n$. This follows from the observation that decision lists
439: of length $k$ can be viewed as linear threshold functions with integer
440: coefficients of magnitude $2^{\Theta(k)}$. We note that our algorithm in
441: this paper always has improved sample complexity over the basic Winnow
442: algorithm, and for $k \geq (\log n)^{3/2}$ our approach improves on the
443: time complexity of Winnow as well.
444: 
445: \item Finally, several researchers have considered the special
446: case of learning a decision list of length $k$ over $n$ variables
447: in which the output bits of the decision list have at most $D$
448: alternations. Valiant \cite{Valiant:99}
449: and Nevo and El-Yaniv \cite{NevoElYaniv:02}
450: have given refined analyses of Winnow's performance for this
451: special case, and Dhagat and Hellerstein \cite{DhagatHellerstein:94} 
452: have also studied this problem.  However, for the general case
453: in which $D$ can be as large as $k,$ the results thus obtained
454: do not improve on the straightforward Winnow analysis 
455: described in the previous bullet.
456: 
457: \end{itemize}
458: These previous algorithmic results are summarized in Figure 1.  We observe
459: that all of these earlier algorithms have an exponential dependence on the
460: relevant parameter(s) ($k$ and $\log n$ for sample complexity, $k$ for
461: running time)  for either the running time or the sample complexity.
462: 
463: 
464: \begin{table}[h]
465: \centerline{
466: \begin{tabular}{|l|l|l|} \hline
467: \strutje Reference: & Number of examples: & Running time: \\
468: \hline\hline
469: \strutje Rivest / Littlestone
470: & $ O(kn)$ 
471: & $ O(kn^2)  $ \\ \hline
472: \strutje Halving algorithm
473: & $ O(k \log n)$
474: & $ n^{O(k)} $ \\ \hline
475: \strutje Winnow algorithm
476: & $2^{O(k)} \log n$ 
477: & $2^{O(k)}n \log n$  \\ \hline
478: \strutje This Paper
479: & $ 2^{\tilde{O}(k^{1/3})}\log n $
480: & $ n^{\tilde{O}(k^{1/3})} $  \\ \hline
481: \end{tabular}
482: }
483: \caption{Comparison of known algorithms for 
484: learning decision lists of length $k$ on $n$ variables.
485: }
486: \label{table:results} 
487: \end{table}
488: 
489: \subsection{Previous Results: Parity Functions}
490: 
491: Little previous work has been published on learning parity
492: functions attribute efficiently in the PAC model.  The standard PAC learning
493: algorithm for parity (based on solving a system of linear equations) is due
494: to Helmbold {\em et al.\@} \cite{HSW:92}; however as described above this
495: algorithm is not attribute efficient since it uses $\Omega(n)$ examples.
496: 
497: Several authors have considered learning parity attribute efficiently in a model 
498: where the learner is allowed to make membership queries.  Attribute efficient
499: learning is easier in this framework since membership queries can help identify relevant variables.
500: Blum et al. \cite{BHL:95} give a randomized polynomial time membership-query
501: algorithm for learning parity on $k$ variables using only $O(k \log
502: n)$ examples.  These results were later
503: refined by Uehara {\em et al.} \cite{UTW:97}.
504: 
505: 
506: 
507: \subsection{Organization}
508: 
509: In Section \ref{sec:prelims} we give the necessary background on
510: online learning and polynomial threshold functions. In Section
511: \ref{sec:winnow} we show how known results from learning theory enable
512: us to reduce the decision list learning problem to a problem of
513: finding suitable polynomial threshold function representations of
514: decision lists. In Sections \ref{subsec:outer} and \ref{subsec:inner}
515: we give two different proofs of a weak tradeoff between degree and
516: weight for polynomial threshold function representations of decision
517: lists, and in Section \ref{subsec:compose} we combine these techniques
518: to prove Theorem \ref{thm:ptf}. In Section \ref{sec:decisiontree} we
519: show how to apply our techniques to give a tradeoff between sample
520: complexity and running time for learning decision trees. In Section
521: \ref{sec:discuss} we discuss the connection with Beigel's ODDMAXBIT
522: lower bound and related issues.  In Section \ref{sec:parity} we give
523: our new algorithm for learning parity functions, and in Section
524: \ref{sec:future} we suggest directions for future work.
525: 
526: \section{Preliminaries} \label{sec:prelims}
527: 
528: 
529: Attribute efficient learning has been chiefly studied in the {\em on-line
530: mistake-bound} model of concept learning which was introduced in
531: \cite{Littlestone:88,Littlestone:89}.  In this model learning proceeds in
532: a series of trials, where in each trial the learner is given an unlabelled
533: boolean example $x \in \{0,1\}^n$ and must predict the value $f(x)$ of the
534: unknown target function $f.$ After each prediction the learner is given
535: the true value of $f(x)$ and can update its hypothesis before the next
536: trial begins.  The {\em mistake bound} of a learning algorithm on a target
537: concept $c$ is measured by the worst-case number of mistakes that the
538: algorithm makes over all (possibly infinite) sequences of examples, and
539: the mistake bound of a learning algorithm on a concept class (class of
540: Boolean functions) $C$ is the worst-case mistake bound across all
541: functions $f \in C.$ The running time of a learning algorithm $A$ for a
542: concept class $C$ is defined as the product of the mistake bound of $A$ on
543: $C$ times the maximum running time required by $A$ to evaluate its
544: hypothesis and update its hypothesis in any trial.
545: 
546: 
547: Our main interests in this paper are the classes of {\em decision
548: lists} and {\em parity functions}.
549: 
550: A decision list $L$ of length $k$ over the Boolean variables
551: $x_1,\dots,x_n$ is represented by a list of $k$ pairs and a bit
552: $$
553: (\ell_1,b_1),(\ell_2,b_2),\dots,(\ell_k,b_k),b_{k+1}
554: $$
555: where each $\ell_i$ is a literal and each $b_i$ is either $-1$ or $1.$
556: Given any $x \in \{0,1\}^n,$ the value of $L(x)$ is $b_i$ if $i$ is the
557: smallest index such that $\ell_i$ is made true by $x$; if no $\ell_i$ is
558: true then $L(x)=b_{k+1}.$
559: 
560: A parity function of length $k$ is defined by a set of variables $S
561: \subset \{x_{1},\ldots,x_{n}\}$ such that $|S| = k$. The 
562: parity function $\chi_{S}(x)$ takes value $1$ on inputs which set
563: an even number of variables in $S$ to $1$ and takes value $-1$ on
564: inputs which set an odd number of variables in $S$ to $1.$
565: 
566: Given a concept class $C$ over $\{0,1\}^n$ and a Boolean function $f \in
567: C,$ let size$(f)$ denote the description length of $f$ under some
568: reasonable encoding scheme.  (Note that if $f$ has $r$ relevant variables
569: then size$(f)$ will be at least $r \log n$ since this many bits are
570: required just to specify which variables are relevant).  We say that a
571: learning algorithm $A$ for $C$ in the mistake-bound model is {\em
572: attribute-efficient} if the mistake bound of $A$ on any concept $c \in C$
573: is polynomial in size$(f).$ In particular, the description length of a
574: length $k$ decision list (parity) is $O(k \log n)$, and thus we would ideally like
575: to have an algorithm which learns decision lists (parities) of length $k$ with a
576: mistake bound of poly$(k,\log n)$ and runs in time poly$(n).$
577: 
578: 
579: (We note here that attribute efficiency has also been studied in other
580: learning models, namely Valiant's Probably Approximately Correct (PAC)
581: model of learning from random examples.  Standard conversion techniques
582: are known \cite{Angluin:88,Haussler:88b,Littlestone:89b}
583: which can be used to
584: transform any mistake bound algorithm into a PAC learning algorithm.  
585: This transformation essentially preserves the running time of the mistake
586: bound algorithm, and the sample size required by the PAC algorithm is
587: essentially the mistake bound. Thus, positive results for mistake bound
588: learning, such as those we give for decision lists in this paper, directly yield
589: corresponding positive results for the PAC model.)
590: 
591: Finally, our results for decision lists are achieved by a careful
592: analysis of {\em polynomial threshold functions}.  Let $f$ be a
593: Boolean function $f:\{0,1\}^{n} \to \{-1,1\}$ and let $p$ be a
594: polynomial in $n$ variables with integer coefficients. Let $d$ denote
595: the degree of $p$ and let $W$ denote the sum of the absolute values of
596: $p$'s integer coefficients. If the sign of $p(x)$ equals $f(x)$ for
597: every $x \in \{0,1\}^n,$ then we say that $p$ is a {\em polynomial
598: threshold function} of degree $d$ and weight $W$ for $f.$
599: 
600: 
601: \section{Expanded-Winnow: Learning Polynomial Threshold Functions} \label{sec:winnow}
602: 
603: Littlestone introduced the online Winnow algorithm in 1988 and showed
604: that it can attribute efficiently learn Boolean conjunctions,
605: disjunctions, and low weight linear threshold functions.  Throughout
606: its execution Winnow maintains a linear threshold function as its
607: hypothesis; at the heart of the algorithm is a novel update rule which
608: makes a {\em multiplicative} update to each coefficient of the
609: hypothesis (rather than an additive update as in the Perceptron
610: algorithm) each time a mistake is made.  Since its introduction Winnow
611: has been intensively studied from both applied and theoretical
612: standpoints (see
613: e.g. \cite{Blum:97,GoldingRoth:99,KWA:97,Servedio:02sicomp}) and
614: multiplicative updates have become widespread in machine learning
615: algorithms.
616: 
617: The following theorem (which, as noted in \cite{Valiant:99}, is implicit
618: in Littlestone's analysis in \cite{Littlestone:88}) gives a 
619: mistake bound for Winnow when learning linear threshold functions:
620: 
621: \begin{theorem} \label{thm:winbound}
622: Let $f(x)$ be the linear threshold function 
623: sign$(\sum_{i=1}^{n} w_{i}x_{i} - \theta)$ 
624: where $\theta$ and $w_{1},\ldots,w_{n}$ are
625: integers. Let $W = \sum_{i=1}^{n} |w_{i}|$. Then 
626: Winnow learns $f(x)$ with mistake bound $O(W^{2} \log n)$,
627: and uses $n$ time steps per example.
628: \end{theorem}
629: 
630: We will use a generalization of the Winnow algorithm, called
631: Expanded-Winnow, to learn {\em polynomial} threshold functions of
632: degree at most $d.$ Our generalization introduces $\sum_{i=1}^{d} {n
633: \choose d}$ new variables (one for each monomial of degree up to $d$)
634: and runs Winnow to learn a linear threshold function over these new
635: variables.  More precisely, in each trial we convert the $n$-bit
636: received example $x=(x_1,\dots,x_n)$ into a $\sum_{i=1}^d {n \choose
637: d}$ bit expanded example (where the bits in the expanded example
638: correspond to monomials over $x_1,\dots,x_n$), and we give the
639: expanded example to Winnow.  Thus the hypothesis which Winnow
640: maintains -- a linear threshold function over the space of expanded
641: features -- is a polynomial threshold function of degree $d$ over the
642: original $n$ variables $x_1,\dots,x_n.$ Theorem \ref{thm:win}, which
643: follows directly from Theorem \ref{thm:winbound}, summarizes the
644: performance of Expanded-Winnow:
645: 
646: \medskip
647: 
648: \noindent {\bf Theorem \ref{thm:win}}
649: {\em Let ${\cal C}$ be a class of Boolean functions over
650: $\{0,1\}^n$ with the property that each $f \in {\cal C}$ has a polynomial
651: threshold function of degree at most $d$ and weight at most $W.$ Then
652: Expanded-Winnow algorithm runs in $n^d$
653: time per example and has mistake bound $O(W^{2} \cdot d \cdot \log n)$ for
654: ${\cal C}.$
655: } \\
656: 
657: Theorem \ref{thm:win} shows that the degree of a polynomial threshold
658: function corresponds to Expanded-Winnow's running time, and the weight of
659: a polynomial threshold function corresponds to its sample complexity.
660: 
661: \ignore{
662: 
663: \begin{figure*}[t] \label{fig:vw}
664: \begin{small}
665: 
666: \noindent {\bf Algorithm V-Winnow:} \\
667: 
668: \noindent {\bf Input: } A sequence of trials from a polynomial $p$ in $n$ variables $\{x_{1},\ldots,x_{n}\}$ of degree $d$ where each \mbox{~~~~~~~~~~~~~~}coefficient is at most $w$. 
669: 
670: \vskip.1in
671: 
672: \noindent {\bf Output: } A polynomial $p'$ in $n$ variables of degree $d$
673: such that for every $x \in \{0,1\}^{n}$, $p'(x) = p(x)$.
674: 
675: \medskip
676: 
677: \begin{enumerate}
678: 
679: \item Lexicographically order all $m = n^{d}$ monomials of degree at most
680: $d$ over the variables $\{x_{1},\ldots,x_{n}\}$.
681: 
682: \item Introduce new variables $y_{1},\ldots,y_{m}$ such that $y_{i}$ is
683: equal to the $i$th monomial in Step 1.
684: 
685: \item Run Winnow over the variables $y_{1},\ldots,y_{m}$ where on example
686: $(a,f(a))$, $y_{i}$ is equal to the $i$th monomial on assignment $a$.
687: 
688: \item Let $h = \sum_{i=1}^{m} \alpha_{i}y_{i}$ be the output of Winnow.
689: 
690: \item Return $h$ with each $y_{i}$ written as the $i$th monomial over
691: $\{x_{1},\ldots,x_{n}\}$.
692: 
693: \end{enumerate}
694: 
695: \end{small}
696: \caption{The V-Winnow algorithm.}
697: \end{figure*}
698: 
699: 
700: \begin{theorem} \label{thm:vwbound}
701: Let ${\cal C}$ be a class of Boolean functions over $\{0,1\}^n$
702: with the property that for each $f \in {\cal C}$,
703: 
704: \begin{itemize}
705: 
706: \item $f$ depends on at most $k$ variables
707: 
708: \item $f$ is computed by a polynomial threshold function of degree at most
709: $d$ where each coefficient is an integer weight of at most $w$.
710: 
711: \end{itemize}
712: Then {\tt V-Winnow} is an online learning algorithm for ${\cal C}$ which
713: uses $n^d$ time steps per example and has mistake bound $(w \cdot
714: k^{d})^{2} \cdot d \cdot \log n.$ The output hypothesis will be a
715: polynomial threshold function equivalent to $f$.
716: 
717: \end{theorem}
718: 
719: \begin{proof}
720: Let $f$ be a function of $k$ variables computed by a polynomial threshold
721: function $p$ of degree $d$ where each coefficient is of weight at most
722: $w$. We will now apply the algorithm {\tt V-Winnow} outlined in Figure
723: \ref{fig:vw}.  Fix a lexicographic ordering of all monomials of degree $d$
724: over $n$ variables and let $y_{i}$ be the $i$th monomial in this list.
725: Then $f$ can be written as a linear threshold function $h$ over the
726: variables $y_{i}$, i.e. $f = h = \sum_{i=1}^{m} a_{i}y_{i}$ for some
727: integer coefficients $a_{i} \leq w$. Since $f$ depends on only $k$
728: variables, at most $k^{d}$ of the variables in $h$ have nonzero
729: coefficients. Now run the standard Winnow algorithm to learn $h$ (for
730: every example $(a_{1},\ldots,a_{n}, f(a_{1},\ldots,a_{n}))$, set $y_{i}$
731: equal to the $i$th monomial on input $a_{1},\ldots,a_{n}$.)  Applying
732: Theorem \ref{thm:winbound}, the standard Winnow algorithm (and hence
733: V-Winnow) will make at most $(w \cdot k^{d})^{2} \cdot d \cdot \log n$
734: mistakes and output a linear threshold function over the $y_{i}$'s
735: equivalent to $h$. Replacing each $y_{i}$ with the $i$th monomial over
736: $\{x_{1},\ldots,x_{n}\}$ we obtain a polynomial threshold function
737: equivalent to $f$. The time bound also follows directly from Theorem
738: \ref{thm:winbound}.
739: \end{proof}
740: 
741: }
742: 
743: \section{Constructing Polynomial Threshold Functions for Decision Lists}
744: 
745: In previous constructions of polynomial threshold functions for
746: computational learning theory applications
747: \cite{KlivansServedio:01,KOS:02,OdonnellServedio:03a} the sole goal has
748: been to minimize the {degree} of the polynomials regardless of the size of
749: the coefficients.  As an extreme example, the construction of
750: \cite{KlivansServedio:01} of $\tilde{O}(n^{1/3})$ degree polynomial
751: threshold functions for DNF formulae yields polynomials whose coefficients
752: can be {\em doubly exponential} in the degree. In contrast, 
753: given Theorem \ref{thm:win} we must now
754: construct polynomial threshold functions that have low degree and low
755: weight.
756: 
757: We give two constructions of polynomial threshold functions for decision lists, each of which
758: has relatively low degree \ignore{($k^{1/2}$)} 
759: and relatively low weight. 
760: \ignore{($2^{\tilde{O}(k^{1/2})}$).}  
761: We then combine 
762: these approaches to achieve an optimal construction with improved bounds on both
763: degree and weight.\ignore{with degree $k^{1/3}$
764: and weight $2^{\tilde{O}(k^{1/3})}.$}
765: 
766: \subsection{Outer Construction} \label{subsec:outer}
767: 
768: Let $L$ be a decision list of length $k$ over variables $x_1,\dots,x_k.$
769: We first give a simple construction of a degree $h$, weight ${\frac {2k}
770: h}2^{(k/h + h)}$ polynomial threshold function for $L$ which is based on
771: breaking the list $L$ into sublists.  We call this construction the
772: ``outer construction" since we will ultimately combine this construction
773: with a different construction for the ``inner'' sublists.
774: 
775: We begin by showing that $L$ can be expressed as a threshold of {\em
776: modified decision lists} which we now define.  The set ${\cal B}_h$ of
777: modified decision lists is defined as follows:
778: each function in ${\cal B}_h$ is a decision list
779: $(\ell_1,b_1),(\ell_2,b_2),\dots, (\ell_h,b_h),0$ where each $\ell_i$ is
780: some literal over $x_1,\dots,x_n$ and each $b_i \in \{-1,1\}.$ Thus the
781: only difference between a modified decision list $f \in {\cal B}_h$ and a
782: normal decision list of length $h$ is that the final output value is
783: $0$ rather than $b_{h+1} \in \{-1,+1\}.$
784: 
785: Without loss of generality we may suppose that the list $L$ is
786: $(x_1,b_1),\dots,(x_k,b_k),b_{k+1}.$ We break $L$ sequentially into $k/h$
787: blocks each of length $h$. Let $f_{i} \in {\cal B}_h$ be the modified
788: decision list which corresponds to the $i$-th block of $L,$ i.e. $f_i$ is
789: the list $(x_{(i-1) h + 1},b_{(i-1)h+1}),\ldots, (x_{(i+1)
790: h},b_{(i+1)h}),0$.  Intuitively $f_{i}$ computes the $i$th block of $L$
791: and equals $0$ only if we ``fall of the edge" of the $i$th block. We then
792: have the following straightforward claim:
793: 
794: \begin{claim} \label{cla:outer}
795: The decision list $L$ is eqivalent to 
796: \begin{eqnarray}
797: \mbox{sign}\left(\sum_{i=1}^{k/h}
798: 2^{k/h - i + 1} f_{i}(x) \ + \  b_{k+1} \right). \label{eq:outer}
799: \end{eqnarray}
800: \end{claim}
801: \begin{proof}
802: Given an input $x \neq 0^k$ let $r=(i-1)h + c$ be the first index such that $x_r$ is satisfied.
803: It is easy to see that $f_j(x) = 0$ for $j<i$ and hence the value in 
804: (\ref{eq:outer}) is $2^{k/h - i + 1}b_{r} + \sum_{j=i+1}^{k/h}
805: 2^{k/h - j + 1} f_{j}(x) \ + \  b_{k+1}$, 
806: the sign of which is easily seen to be $b_r.$
807: Finally if $x=0^k$ then the argument to (\ref{eq:outer}) is $b_{k+1}$.
808: \end{proof}
809: 
810: \medskip \noindent {\bf Note:}  It is easily seen that we can replace
811: the $2$ in formula (\ref{eq:outer}) by a 3; this will prove
812: useful later.
813: 
814: \medskip
815: 
816: As an aside, note that Claim \ref{cla:outer} can already be used to obtain a tradeoff
817: between running time and sample complexity for learning decision lists.
818: The class ${\cal B}_h$ contains at most $(4n)^h$ functions.  
819: Thus as in Section \ref{sec:winnow}
820: it is possible to run the Winnow algorithm using the functions in ${\cal B}_h$ as the base features
821: for Winnow.  (So for each example $x$ which it receives, the algorithm would first compute
822: the value of $f(x)$ for each $f \in {\cal B}_h$, and would then use this vector of $(f(x))_{f \in {\cal B}_h}$
823: values as the example point for Winnow.)  A direct analogue of Theorem 
824: \ref{thm:win} now implies
825: that Expanded-Winnow (run over this expanded feature space of functions from 
826: ${\cal B}_h$) can be used to learn 
827: $L_k$ in time $n^{O(h)}2^{O(k/h)}$ with mistake bound $2^{O(k/h)} h \log n$.
828: 
829: However, it will be more useful for us to obtain a polynomial threshold function for $L$.  We
830: can do this from Claim \ref{cla:outer} as follows:
831: 
832: 
833: \begin{theorem} \label{thm:outer}
834: Let $L$ be a decision list of length $k$.  Then for any $h < k$
835: we have that $L$ is computed by a
836: polynomial threshold function of degree $h$ 
837: and weight $4 \cdot 2^{k/h + h}$.
838: \end{theorem}
839: 
840: \begin{proof}
841: Consider the first modified decision list $f_1 = (\ell_1,b_1),(\ell_2,b_2),\dots,(\ell_h,b_h),0$ 
842: in the expression (\ref{eq:outer}).  For $\ell$ a literal let $\tilde{\ell}$ denote $x$
843: if $\ell$ is an unnegated variable $x$ and let $\tilde{\ell}$ denote $1-x$ if 
844: if $\ell$ is a negated variable $\overline{x}.$ 
845: We have that for all $x \in \{0,1\}^h$, $f_1(x)$ is computed exactly by
846: the polynomial
847: $$
848: f_1(x) = \tilde{\ell}_1b_1 + (1-\tilde{\ell}_1)\tilde{\ell}_2 b_2 + 
849: (1-\tilde{\ell}_1)(1-\tilde{\ell}_2)\tilde{\ell}_3 b_3 + \cdots + 
850: (1-\tilde{\ell}_1)\cdots(1-\tilde{\ell}_{h-1})\tilde{\ell}_h b_h.
851: $$
852: This polynomial has degree $h$ and has weight at most $2^{h+1}.$
853: Summing these polynomial representations for $f_1,\dots,f_{k/h}$ 
854: as in (\ref{eq:outer}) we see
855: that the resulting polynomial threshold function given by (\ref{eq:outer})
856: has degree $h$ and weight at most $2^{k/h + 1} \cdot 2^{h+1} = 
857: 4 \cdot 2^{k/h + h}.$
858: \end{proof}
859: 
860: \medskip
861: 
862: Specializing to the case $h=\sqrt{k}$ we obtain:
863: 
864: \begin{corollary} \label{cor:outer}
865: Let $L$ be a decision list of length $k$.
866: Then $L$ is computed by a polynomial threshold function of
867: degree $k^{1/2}$ and weight $4 \cdot 2^{2k^{1/2}}.$
868: \end{corollary}
869: 
870: We close this section by observing that an intermediate result 
871: of \cite{KlivansServedio:01} can be used to give an alternate proof 
872: of Corollary \ref{cor:outer} with slightly weaker parameters; 
873: see Appendix \ref{ap:alt}.
874: 
875: \subsection{Inner Approximator} \label{subsec:inner}
876: 
877: In this section we construct low degree, low weight 
878: polynomials which approximate (in the $L_\infty$ norm)
879: the modified decision lists from the previous subsection.  Moreover,
880: the polynomials we construct 
881: are exactly correct on inputs which ``fall off the end'':
882: \ignore{
883: We refer to these modified decision lists as the ``inner'' decision lists.
884: The construction is stronger than a polynomial threshold function; 
885: the polynomial we give for an inner decision list is actually
886: a good approximator with respect to the
887: $L_{\infty}$ norm (and is exactly right on the input $0^h$):
888: }
889: 
890: \begin{theorem} \label{thm:inner}
891: Let $f \in {\cal B}_h$ be a modified decision list of length $h$ 
892: (without loss of generality we may assume that $f$ is
893: $(x_1,b_1),\dots,(x_h,b_h),0$).
894: Then there is a degree $2\sqrt{h}\log{h}$
895: polynomial $p$ such that 
896: \begin{itemize}
897: \item for every input $x \in \{0,1\}^h$ we have $|p(x) - f(x)| \leq 1/h$. 
898: \item $p(0^h) = f(0^h) = 0$.
899: \end{itemize}
900: \end{theorem}
901: \begin{proof}
902: As in the proof of Theorem \ref{thm:outer} we have that
903: \[ f(x) = b_{1}x_{1} + b_{2}(1-x_{1})x_{2} + \cdots + 
904: b_{h}(1-x_{1})\cdots(1-x_{h-1})x_{h}. 
905: \]
906: We will construct a lower (roughly $\sqrt{h}$) degree polynomial which 
907: closely approximates $f$.  Let $T_{i}$ denote $(1-x_1)\dots(1-x_{i-1})x_i$,
908: so we can rewrite $f$ as
909: \[ f(x) = b_{1}T_{1} + b_{2}T_{2} + \cdots + b_{h}T_{h}. \]
910: 
911: We approximate each $T_i$ separately as follows:
912: set $A_{i}(x) = h-i  + x_{i} + \sum_{j=1}^{i-1} (1 - x_{j})$.
913: Note that for $x \in \{0,1\}^h,$ we have
914: $T_i(x) = 1$ iff $A_i(x) = h$ and $T_i(x) = 0$
915: iff $0 \leq A_i(x) \leq h-1.$
916: Now define the polynomial 
917: $$
918: Q_{i}(x) = q \left(A_{i}(x)/h \right)  \mbox{~~~~~where~~~~~}
919: q(y) = C_d\left(y \left(1 + 1/h \right) \right).
920: $$
921: 
922: \noindent As in \cite{KlivansServedio:01},
923: here $C_{d}(x)$ is the $d$th Chebyshev polynomial of the
924: first kind (a univariate polynomial of degree $d$) 
925: with $d$ set to $\lceil \sqrt{h} \rceil$. 
926: We will need the following facts about Chebyshev polynomials 
927: \cite{Cheney:66}: 
928: \begin{itemize} 
929: \item $|C_d(x)| \leq 1$ for $|x| \leq 1$ with $C_d(1) = 1;$ 
930: \item $C_d^\prime(x) \geq d^2$ for $x > 1$ with $C_d^\prime(1) = d^2.$ 
931: \item The coefficients of $C_{d}$ are integers each of whose
932: magnitude is at most $2^d$. 
933: \end{itemize}
934: These first two facts imply that $q(1) \geq 2$ but $|q(y)| \leq 1$ 
935: for $y \in [0,1 - {\frac 1 h}].$  We
936: thus have that $Q_i(x) = q(1) \geq 2$ if $T_i(x) = 1$
937: and $|Q_i(x)| \leq 1$ if $T_i(x) = 0.$
938: Now define 
939: $
940: P_i(x) = \left({\frac {Q_i(x)}{q(1)}}\right)^{2 \log h}.
941: $
942: This polynomial is easily seen to be a good approximator for $T_i$:
943: if $x \in \{0,1\}^h$ is such that $T_i(x) = 1$ then $P_i(x) = 1$,
944: and if $x \in \{0,1\}^h$ is such that $T_i(x) = 0$ then
945: $|P_i(x)| < \left({\frac 1 2}\right)^{2 \log h} < {\frac 1 {h^2}}.$
946: 
947: Now define
948: $R(x) = \sum_{i=1}^{\ell} b_iP_{i}(x)$ and $p(x) = R(x) - R(0^h).$
949: \ignore{
950: We will see that $Q_{i}(x) > 2$ on assignments $x$ for which 
951: $T_{i}(x)=0$, while $|Q_i(x)|\leq 1$ on assignments for which
952: $T_{i}(x)$ output $s_{i}$. To
953: strengthen this separation we define the following polynomial
954: $P_{i}(x) = (1/\ell^{2}) Q_{i}(x)^{2 \log \ell}$ and to approximate
955: all of $b$ we set $R(x) = \sum_{i=1}^{\ell} P_{i}(x)$.
956: }
957: It is clear that $p(0^h)=0.$ 
958: We will show that for every input $0^h \neq x \in \{0,1\}^h$ we have 
959: $|p(x) - f(x)| \leq {1/h}$. Fix some such $x$; let $i$ be the first
960: index such that $x_i = 1.$  As shown above we have
961: $P_i(x) = 1.$  Moreover, by inspection of $T_j(x)$ we have that
962: $T_j(x) = 0$ for all $j \neq i,$  
963: and hence $|P_j(x)| < {\frac 1 {h^2}}$.  Consequently 
964: the value of $R(x)$ must lie in $[b_i - {\frac {h-1}{h^2}},
965: b_i + {\frac {h-1}{h^2}}]$.  Since $f(x) = b_i$ we have that
966: $p(x)$ is an $L_\infty$ approximator for $f(x)$ as desired.
967: 
968: Finally, it is straightforward to verify that $p(x)$ has the claimed
969: bound on degree.
970: \end{proof}
971: 
972: \ignore{
973: \noindent Now fix any nonzero assignment to the variables $x$ that
974: causes $b$ to output $1$.  From the definition of $b$ there exists a
975: unique term $T_{i}$ that is not set to zero by $x$. Then for the
976: corresponding arithmetization $A_{i}$ we have $A_{i}/i= 1$, so $2 \leq
977: Q_{i}(x) \leq 2.01 $ and hence $1 \leq P_{i}(x) \leq 1.1$. Similarly
978: if $x$ causes $b$ to output $-1$ then $-1 \leq P_{i}(x) \leq -.9$. \\
979: 
980: \noindent Let $T_{j}$ be any term that is set to zero by x, and so
981: $A_{j}(x) \leq 1 - 1/\ell$. Then $|Q_{i}(x)| \leq 1$ and thus
982: $|P_{i}(x)| \leq 1/\ell^{2}$. Hence for any nonzero assignment $x$,
983: $|R(x) - b(x)| \leq \mbox{{\bf $\eps$ from cheby approx +
984: $1/\ell$}}$. Notice also that $|R(\overline{0})| \leq 1/\ell.$ Thus
985: for any nonzero assignment $x$, $|H(x) - b(x)| \leq 2/\ell$ and
986: clearly $H(\overline{0}) = 0$. 
987: }
988: 
989: \medskip
990: 
991: Strictly speaking we cannot discuss the weight of the polynomial
992: $p$ since its coefficients are rational numbers but not
993: integers.  However, by multiplying $p$ by a suitable integer
994: (clearing denominators) we obtain an integer polynomial
995: with essentially the same properties.
996: Using the third fact about Chebyshev polynomials from our
997: proof above, we have that $q(1)$ is a rational number $N_1/N_2$ where
998: $N_1,N_2$ are each integers of magnitude $h^{O(\sqrt{h})}.$
999: Each $Q_i(x)$ for $i=1,\dots,h$ can be written as an integer
1000: polynomial (of weight $h^{O(\sqrt{h})}$) divided by $h^{\sqrt{h}}.$
1001: Thus each $P_i(x)$ can be written as 
1002: $\tilde{P}_i(x)/(h^{\sqrt{h}}N_1)^{2 \log h}$ where $\tilde{P}_i(x)$
1003: is an integer polynomial of weight $h^{O(\sqrt{h} \log h)}$.
1004: It follows that $p(x)$ equals $\tilde{p}(x)/C,$ where $C$
1005: is an integer which is at most $2^{O(h^{1/2} \log^2 h)}$
1006: and $\tilde{p}$ is a polynomial with integer coefficients and weight
1007: $2^{O(h^{1/2} \log^2 h)}.$  We thus have
1008: 
1009: \begin{corollary}
1010: \label{cor:inner}
1011: Let $f \in {\cal B}_h$ be a modified decision list of length $h$.
1012: Then there is an integer polynomial 
1013: $p(x)$ 
1014: of degree $2\sqrt{h}\log{h}$
1015: and weight $2^{O(h^{1/2} \log^2{h})}$ and an integer $C = 
1016: 2^{O(h^{1/2} \log^2 h)}$ such that
1017: \begin{itemize}
1018: \item for every input $x \in \{0,1\}^h$ we have $|p(x) - Cf(x)| \leq C/h$.
1019: \item $p(0^h) = f(0^h) = 0$.
1020: \end{itemize}
1021: \end{corollary}
1022: 
1023: The fact that $p(0^h)$ is exactly 0
1024: will be important in the next subsection when we combine the
1025: inner approximator with the outer construction.
1026: 
1027: \subsection{Composing the Constructions} \label{subsec:compose}
1028: 
1029: In this section we combine the two constructions from the previous
1030: subsections to obtain our main polynomial threshold construction:
1031: 
1032: \begin{theorem} \label{thm:mainptf}
1033: Let $L$ be a decision list of length $k$.  Then for any $h < k$,
1034: $L$ is computed by a polynomial threshold function of degree 
1035: $O(h^{1/2} \log h)$
1036: and weight $2^{O(k/h + h^{1/2}\log^2 h)}.$
1037: \end{theorem}
1038: \begin{proof}
1039: We suppose without loss of generality that $L$ is the decision list
1040: $(x_1,b_1),\dots,(x_k,b_k),b_{k+1}.$
1041: We begin with the outer construction: from the note following
1042: Claim \ref{cla:outer} we have that 
1043: $$L(x) = 
1044: \mbox{sign}\left(C\left[\sum_{i=1}^{k/h}
1045: 3^{k/h - i + 1} f_{i}(x) \ + \  b_{k+1} \right]\right)
1046: $$
1047: where $C$ is the value from Corollary \ref{cor:inner} and 
1048: each $f_{i}$ is a modified decision list of length $h$
1049: computing the restriction of $L$ to its $i$th block as defined in
1050: Subsection \ref{subsec:outer}.
1051: Now we use the inner approximator to replace each $Cf_i$ above
1052: by $p_i$, the approximating polynomial from Corollary
1053: \ref{cor:inner}, i.e. consider sign$(H(x))$ where 
1054: $$
1055: H(x) = \sum_{i=1}^{k/h}
1056: (3^{k/h - i + 1} p_{i}(x)) \ + \  Cb_{k+1}.
1057: $$
1058: We will show that sign$(H(x))$
1059: is a polynomial threshold function which computes $L$ correctly
1060: and has the desired degree and weight.
1061: 
1062: Fix any $x \in \{0,1\}^k.$  If $x=0^k$ then by Corollary
1063: \ref{cor:inner} each $p_i(x)$ is $0$ so $H(x) = C b_{k+1}$ has
1064: the right sign.  
1065: Now suppose that $r=(i-1)h+c$ is the first index such that
1066: $x_r = 1.$  By Corollary \ref{cor:inner}, we have that
1067: \begin{itemize}
1068: \item $3^{k/h - j + 1}p_j(x) = 0$ for $j < i$;
1069: \item $3^{k/h - i + 1}p_i(x)$ differs from $3^{k/h - i + 1}Cb_r$ by at most
1070: $C3^{k/h - i + 1}\cdot {\frac 1 h}$;
1071: \item The magnitude of each value $3^{k/h - j + 1}p_j(x)$ is at most
1072: $C3^{k/h - j + 1}(1 + {\frac 1 h})$ for $j > i.$
1073: \end{itemize}
1074: Combining these bounds,
1075: the value of $H(x)$ differs from $3^{k/h - i + 1}Cb_r$ by at most
1076: $$
1077: C\left(
1078: {\frac {3^{k/h - i + 1}}{h}} + 
1079: \left(1 + {\frac 1 h}\right)
1080: \left[3^{k/h - i} + 3^{k/h - i - 1} + \cdots + 3\right] + 1
1081: \right)
1082: $$
1083: which is easily seen to be less than $C3^{k/h - i + 1}$ in magnitude.
1084: Thus the sign of $H(x)$ equals $b_r$, and consequently sign$(H(x))$ is a
1085: valid polynomial threshold representation for $L(x).$  Finally,
1086: our degree and weight bounds from Corollary \ref{cor:inner}
1087: imply that
1088: the degree of $H(x)$ is $O(h^{1/2} \log h)$ and the weight
1089: of $H(x)$ is $2^{O(k/h) + O(h^{1/2}\log^2 h)}$, and the theorem
1090: is proved.
1091: \end{proof}
1092: 
1093: \medskip
1094: 
1095: Taking $h = k^{2/3} / \log^{4/3}k$ in the above theorem we obtain our
1096: main result on representing decision lists as polynomial threshold
1097: functions:
1098: 
1099: \medskip
1100: 
1101: \noindent {\bf Theorem \ref{thm:ptf}}
1102: {\em Let $L$ be a decision list of length $k$.  Then 
1103: $L$ is computed by a polynomial threshold function
1104: of degree $k^{1/3} \log^{1/3} k$ and weight
1105: $2^{O(k^{1/3} \log^{4/3} k)}.$
1106: } \\
1107: 
1108: 
1109: Theorem \ref{thm:ptf} immediately implies that Expanded-Winnow can learn decision lists of length $k$ using $2^{\tilde{O}(k^{1/3})} \log n$ examples and time $n^{\tilde{O}(k^{1/3})}$.
1110: 
1111: %\section{Discussion} \label{sec:discuss}
1112: 
1113: 
1114: \section{Application to Learning Decision Trees} \label{sec:decisiontree}
1115: 
1116: In 1989 Ehrenfeucht and Haussler \cite{EhrenfeuchtHaussler:89} gave an
1117: a time $n^{O(\log s)}$ algorithm for learning decision trees of size
1118: $s$ over $n$ variables. Their algorithm uses $n^{O(\log s)}$ examples,
1119: and they asked if the sample complexity could be reduced to
1120: $\poly(n,s)$.  We can apply our techniques here to give an algorithm
1121: using $2^{\tilde{O}(s^{1/3})} \log n$ examples, if we are willing to
1122: spend $n^{\tilde{O}(s^{1/3})}$ time.
1123: 
1124: First we need to generalize Theorem \ref{thm:mainptf} for higher order
1125: decision lists. An $r$-decision list is like a standard decision list
1126: but each pair is now of the form $(C_i,b_i)$ where $C_i$ is a
1127: conjunction of at most $r$ literals and as before $b_i = \pm 1$.  The
1128: output of such an $r$-decision list on input $x$ is $b_i$ where $i$ is
1129: the smallest index such that $C_i(x)=1.$
1130: 
1131: We have the following:
1132:  
1133: \begin{corollary} \label{cor:gdl}
1134: Let $L$ be an $r$-decision list of length $k$. Then for any
1135: $h < k$, $L$ is computed by a polynomial threshold function 
1136: of degree $O(rh^{1/2} \log h)$ and weight 
1137: $2^{r + O(k/h + h^{1/2} \log^2 h)}$.             
1138: \end{corollary}
1139: 
1140: \begin{proof}
1141: Let $L$ be the $r$-decision list $(C_1,b_1),\dots,(C_k,b_k),b_{k+1}.$
1142: By Theorem \ref{thm:mainptf} there is a polynomial threshold function
1143: of degree $O(h^{1/2} \log h)$ and weight
1144: $2^{O(k/h + h^{1/2} \log^2 h)}$ over the variables $C_1,\dots,C_k.$
1145: Now replace each variable $C_{i}$ by the interpolating polynomial
1146: which computes it exactly as a function from $\{0,1\}^n$ to $\{0,1\}.$
1147: Each such interpolating polynomial has degree $r$ and integer
1148: coefficients of total magnitude at most $2^r$, and the corollary follows.
1149: \end{proof} 
1150: 
1151: \begin{corollary} \label{cor:learngdl}
1152: There is an algorithm for learning
1153: $r$-decision lists over $\{0,1\}^n$ which, when learning an $r$-decision list
1154: of length $k$, has mistake bound
1155: $2^{\tilde{O}(r + k^{1/3})}\log n$ and runs  in time
1156: $n^{\tilde{O}(rk^{1/3})}$.
1157: \end{corollary}
1158: 
1159: Now we can apply Corollary \ref{cor:learngdl} to obtain a tradeoff
1160: between running time and sample complexity for learning decision
1161: trees:
1162: 
1163: \begin{theorem}
1164: Let $D$ be a decision tree of size $s$ over $n$ variables. Then $D$ can be learned using $2^{\tilde{O}(s^{1/3})} \log n$ examples in time $n^{\tilde{O}(s^{1/3})}.$ 
1165: \end{theorem}
1166: 
1167: 
1168: \begin{proof}
1169: Blum \cite{Blum:92} has shown that any decision tree of size $s$ is
1170: computed by a $(\log s)$-decision list of length $s.$ Applying
1171: Corollary \ref{cor:learngdl} we thus see that Expanded-Winnow can be
1172: used to learn decision trees of size $s$ over $\{0,1\}^n$ with the
1173: claimed bounds on time and sample complexity.
1174: \end{proof}
1175: 
1176: 
1177: 
1178: 
1179: \section{Lower Bounds for Decision Lists} \label{sec:discuss}
1180: 
1181: Here we observe that our construction from
1182: Theorem \ref{thm:mainptf} is essentially optimal in terms of the
1183: tradeoff it achieves between polynomial threshold function degree
1184: and weight.
1185: 
1186: In \cite{Beigel:94}, Beigel constructs an oracle separating $\PP$ from
1187: $\PNP$. At the heart of his construction is a proof that any low
1188: degree polynomial threshold function for a particular
1189: decision list, called the the $\mathrm{ODDMAXBIT}_{n}$ function,
1190: must have large weights:
1191: 
1192: \begin{definition}
1193: The $\mathrm{ODDMAXBIT}_{n}$ function on input $x=x_{1},\ldots,x_{n}
1194: \in \{0,1\}^{n}$ equals $(-1)^{i}$ where $i$ is the index of the
1195: first nonzero bit in $x.$
1196: \end{definition}
1197: 
1198: It is clear that the $\mathrm{ODDMAXBIT}_{n}$ function is 
1199: equivalent to a decision list of length $n$:
1200: $$
1201: (x_1,-1),(x_2,1),(x_3,-1),\dots,(x_n,(-1)^{n}),(-1)^{n+1}.
1202: $$
1203: The main technical theorem which Beigel proves in \cite{Beigel:94}
1204: states that any polynomial threshold function of degree $d$ computing
1205: $\mathrm{ODDMAXBIT}_{n}$ must have weight $2^{\Omega(n/d^{2})}$:
1206: 
1207: \begin{theorem} \label{thm:beigel}
1208: Let $p$ be a degree $d$ polynomial threshold function with integer
1209: coefficients computing
1210: $\mathrm{ODDMAXBIT}_{n}$. Then  
1211: $w = 2^{\Omega(n/d^{2})}$ where $w$ is the weight of $p.$\footnote{Beigel actually proves something stronger, namely that there must exists a coefficient whose absolute value is at least $2^{\Omega(n/d^{2})}$.}
1212: \end{theorem}
1213: (As stated in \cite{Beigel:94} the bound is actually $w \geq
1214: {\frac 1 s}2^{\Omega(n/d^2)}$ where $s$ is the number of nonzero
1215: coefficients in $p$.  Since $s \leq w$ this implies the result
1216: as stated above.)
1217: 
1218: 
1219: A lower bound of $2^{\Omega(n)}$ 
1220: on the weight of any linear threshold function ($d=1$) for
1221: $\mathrm{ODDMAXBIT}_n$ has long been known \cite{MyhillKautz:61};
1222: Beigel's proof generalizes this
1223: lower bound to all $d = O(n^{1/2}).$  A matching upper bound
1224: of $2^{O(n)}$ on weight for $d=1$ has also long been known 
1225: \cite{MyhillKautz:61}.
1226: Our Theorem \ref{thm:mainptf} gives an upper bound 
1227: which matches Beigel's lower bound (up to
1228: logarithmic factors) for all $d = O(n^{1/3})$:
1229: \begin{observation}
1230: For any $d = O(n^{1/3})$ there is a polynomial threshold function of
1231: degree $d$ and weight $2^{\tilde{O}(n/d^{2})}$ 
1232: which computes $\mathrm{ODDMAXBIT}_{n}$. 
1233: \end{observation}
1234: \begin{proof}
1235: Set $d = h^{1/2} \log h$ in Theorem~\ref{thm:mainptf}.  
1236: The weight bound given by Theorem~\ref{thm:mainptf} 
1237: is $2^{O({\frac {n \log^2 d}{d^2}} + d \log d)}$
1238: which is $\tilde{O}(n/d^2)$ for $d = O(n^{1/3}).$
1239: \end{proof} 
1240: 
1241: \medskip
1242: 
1243: Note that since the 
1244: $\mathrm{ODDMAXBIT}_{n}$ function has a polynomial size DNF
1245: (see Appendix \ref{ap:alt}), Beigel's lower bound gives a polynomial 
1246: size DNF $f$ such that any degree $\tilde{O}(n^{1/3})$ polynomial
1247: threshold function for $f$ must have weight
1248: $2^{\tilde{\Omega}(n^{1/3})}$.
1249: This suggests that the Expanded-Winnow algorithm cannot learn polynomial size
1250: DNF in $2^{\tilde{O}(n^{1/3})}$ time from
1251: $2^{n^{1/3 - \eps}}$ examples for any
1252: $\eps > 0,$ and thus suggests that improving the sample complexity
1253: of the DNF learning algorithm from \cite{KlivansServedio:01} while
1254: maintaining its $2^{\tilde{O}(n^{1/3})}$ running time may be difficult.
1255: 
1256: \section{Learning Parity Functions} \label{sec:parity}
1257: 
1258: We first briefly review the standard
1259: algorithm for learning parity functions.
1260: 
1261: The standard algorithm for learning parity functions works by viewing a
1262: set of $m$ labelled examples as a set of $m$ linear equations over GF(2).
1263: Each labelled example $(x,b)$ induces the equation
1264: $\sum_{i: x_i = 1} a_{i} = b \bmod 2.$
1265: Since the examples are labelled according to some parity function,
1266: this parity function will be a consistent solution to the 
1267: system of equations.  
1268: Using Gaussian elimination it is possible to efficiently find a 
1269: solution to the linear system, 
1270: which yields a parity function consistent with all $m$ examples.
1271: The following standard fact from learning theory
1272: (often referred to as ``Occam's Razor'') shows that finding
1273: a consistent hypothesis suffices to establish PAC learnability:
1274: 
1275: \begin{fact} \label{fact:OC}
1276: Let $C$ be a concept class and $H$ a finite set of hypotheses. Set $m
1277: = 1/\epsilon(\log |H| + \log 1/\delta)$ where $\epsilon$ and $\delta$
1278: are the usual accuracy and confidence parameters for PAC learning.
1279: Suppose that there
1280: is an algorithm $A$ running in time $t$ which takes as input $m$
1281: examples which are labelled according to some element of $C$ and outputs a 
1282: hypothesis $h \in H$ consistent with these examples.  
1283: Then $A$ is a PAC learning algorithm for $C$ with running time $t$
1284: and sample complexity $m.$
1285: \end{fact}
1286: Consider using the above algorithm to learn an unknown 
1287: parity of length at most $k.$
1288: Even though there is a solution of weight at most $k$,
1289: Gaussian elimination (applied to a system of $m$ equations in $n$
1290: variables over GF(2)) may yield a solution of weight
1291: as large as $\min(m,n).$  
1292: Using Fact \ref{fact:OC} we thus obtain a sample complexity bound of 
1293: $O(n)$ examples for learning a parity of length at most $k.$
1294: 
1295: We now present
1296: a simple polynomial-time algorithm for learning an unknown parity
1297: function on $k$ variables using $O(n^{1-1/k})$ examples.
1298: To the best of our knowledge this is the first improvement on the
1299: standard algorithm and analysis given above.
1300: 
1301: \begin{theorem} \label{thm:mainparity}
1302: 
1303: The class of all parity functions on at most $k$ variables is
1304: learnable in polynomial time using $O(n^{1-1/k} \log n)$
1305: examples. The hypothesis output by the learning algorithm
1306: is a parity function on $O(n^{1-1/k}\log n)$ variables.
1307: 
1308: \end{theorem}
1309: 
1310: \begin{proof}
1311: If $k = \Omega(\log n)$ then the standard algorithm suffices to 
1312: prove the claimed bound.  We thus assume that $k = o(\log n)$.  
1313: 
1314: Let $H$ be the set of all parity functions of size at most $n^{1 - 1/k}$.
1315: Note that $|H| \leq n^{n^{1 - 1/k}}$ so
1316: $\log|H| \leq n^{1 - 1/k} \log n.$
1317: Consider the following
1318: algorithm:
1319: 
1320: \begin{enumerate}
1321: 
1322: \item Choose $m = 1/\epsilon (\log |H| + \log (1/\delta))$
1323: examples. Express each example as a linear equation over $n$ variables
1324: mod $2$ as described above.
1325: 
1326: \item Randomly choose a set of $n - n^{1-1/k}$ variables and assign
1327: them the value $0$.
1328: 
1329: \item Use Gaussian elimination to attempt to solve the resulting system
1330: of equations on the remaining $n^{1 - 1/k}$ variables.
1331: If the system has a solution, output the corresponding parity
1332: (of size at most $n^{1 - 1/k}$) as the hypothesis.
1333: If the system has no solution, output ``FAIL.''
1334: 
1335: \end{enumerate}
1336: 
1337: If the simplified system of equations has a solution, 
1338: then by Fact \ref{fact:OC} this solution is a good hypothesis.  
1339: We will show that the simplified system has a solution with probability
1340: $\Omega(1/n)$.  The theorem
1341: follows  by repeating steps 2 and 3 of the above algorithm until
1342: a solution is found (an expected $O(n)$ repetitions will suffice).
1343: 
1344: Let $V$ be the set of $k$ relevant variables on which the unknown
1345: parity function depends. It is easy to see that as long as
1346: no variable in $V$ is assigned a 0,
1347: the resulting simplified system of equations will have a 
1348: solution.  
1349: Let $\ell = n^{1 - 1/k}.$
1350: The probability that in Step 2 the $n - \ell$ variables chosen
1351: do not include any variables in $V$ is exactly
1352: ${n - k \choose n - \ell} / {n \choose \ell}$
1353: which equals
1354: ${n - k \choose \ell - k} / {n \choose \ell}.$  Expanding
1355: binomial coefficients we have
1356: \begin{equation} \label{eq:a}
1357: {\frac {{n - k \choose \ell - k}}{{n \choose \ell}}} = 
1358: \prod_{i=1}^{k} {\frac {\ell - k + i}{n -k + i}} 
1359: > \left({\frac {\ell - k}{n - k}}\right)^k 
1360: = 
1361: \left({\frac \ell n}\right)^k 
1362: \left({\frac {1 - {\frac k \ell}}{1 - {\frac k n}}}\right)^k
1363: = 
1364: {\frac 1 n} \cdot 
1365: \left[\left(1 - {\frac k \ell}\right)\left(1 + {\frac {2k} n}\right)\right]^k.
1366: \end{equation}
1367: The bound $k = o(\log n)$ implies that 
1368: $\left(1 - {\frac k \ell}\right)\left(1 + {\frac {2k} n}\right) > 
1369: (1 - {\frac {3k} \ell}).$ Consequently 
1370: (\ref{eq:a}) is at least
1371: ${\frac 1 n} \cdot \left(1 - {\frac {3k^2} {\ell}}\right) >
1372: {\frac 1 {2n}}$ and the theorem is proved.
1373: \end{proof}
1374: 
1375: 
1376: 
1377: 
1378: 
1379: 
1380: 
1381: \section{Future Work} \label{sec:future}
1382: 
1383: An obvious goal for future work is to improve our algorithmic results
1384: for learning decision lists.  The question still remains:  can
1385: decision lists of length $k$ be learned in poly$(n)$ time from
1386: poly$(k,\log n)$ examples?  As a first step, one might attempt to
1387: extend the tradeoffs we achieve:  is it possible to learn
1388: decision lists of length $k$ in $n^{k^{1/2}}$ time from
1389: poly$(k,\log n)$ examples?
1390: 
1391: Another goal is to extend our results for decision lists to broader
1392: concept classes.  In particular, since decision lists are a special
1393: case of linear threshold functions, it would be interesting to obtain analogues
1394: of our algorithmic 
1395: results for learning general linear threshold functions (independent of 
1396: their weight).  We note here that
1397: Goldmann {\em et al.} \cite{GHR:92} have given 
1398: a linear threshold function over $\{-1,1\}^n$ for
1399: which any polynomial threshold function must have weight
1400: $2^{\Omega(n^{1/2})}$ regardless of its degree.  Moreover
1401: Krause and Pudlak \cite{KrausePudlak:98} have shown that any Boolean
1402: function which has a polynomial threshold function over $\{0,1\}^n$ of weight 
1403: $w$ has a polynomial threshold function over $\{-1,1\}^n$ of weight
1404: $n^2w^4.$  These results imply that {\em representational} results akin
1405: to Theorem \ref{thm:ptf} for general linear threshold functions
1406: must be quantitatively weaker than Theorem \ref{thm:ptf};
1407: in particular, there is a linear threshold function over
1408: $\{0,1\}^n$ with $k$ nonzero coefficients for which
1409: {any} polynomial threshold function, regardless of degree, must have 
1410: weight $2^{\Omega(k^{1/2})}.$
1411: 
1412: For parity functions, one challenge is to
1413: learn parity functions on $k = \Theta(\log n)$ variables in polynomial time
1414: using a sublinear number of examples.  Another challenge is to improve
1415: the sample complexity of learning size $k$ parities from our
1416: current bound of $O(n^{1 - 1/k}).$
1417: 
1418: \ignore{
1419: 
1420: Decision lists can be viewed as a special case of linear threshold
1421: functions. For example, the alternating decision list (or
1422: $\mathrm{ODDMAXBIT}_{n}$ function) is equal to the sign of $h =
1423: \sum_{i=1}^{n} (-1)^{i} 2^{i}x_{i}$. The lower bound on the
1424: $\mathrm{ODDMAXBIT}_{n}$ function due to Beigel shows that for an
1425: arbitrary linear threshold function, we cannot construct polynomial
1426: threshold functions of degree $d$ and weight $2^{o(n/d^{2})}.$
1427: 
1428: Here we observe that this lower bound on the weight and degree of
1429: polynomial threshold functions computing general linear threshold
1430: functions can be strengthened due to a result by Goldmann, Hastad, and
1431: Razborov:
1432: 
1433: \begin{theorem} \cite{GHR:92}
1434: There exists a linear threshold function $U$ defined on $4n^{2}$
1435: variables such that if $U$ is written as a threshold of monomials then
1436: the total weight of the threshold is $\Omega(2^{(n/2)} / \sqrt{n})$.
1437: \end{theorem} 
1438: 
1439: \noindent The linear threshold function $U$ is the so-called Universal
1440: Halfspace defined as follows:
1441: 
1442: \[ U_{n,m} = \sum_{i=1}^{n} \sum_{j=1}^{m} 2^{i}x_{ij}. \]
1443: 
1444: From this we conclude that to learn an arbitrary linear threshold
1445: function on $n$ variables, V-Winnow will require
1446: $\Omega(2^{\sqrt{n}})$ samples and time $\Omega(n^{\sqrt{n}})$. This
1447: stands in contrast to the sample complexity and time complexity bounds
1448: for learning decision lists.
1449: }
1450: 
1451: \section{Acknowledgements} We thank Les Valiant for his observation
1452: that Claim \ref{cla:outer} can be reinterpreted in terms of polynomial
1453: threshold functions.  
1454: We thank Jean Kwon for suggesting the Chebychev polynomial.
1455: 
1456: \bibliographystyle{plain} 
1457: \bibliography{allrefs}
1458: 
1459: \appendix
1460: 
1461: \section{Alternate Proof of Corollary \ref{cor:outer}} \label{ap:alt}
1462: The alternate proof of Corollary \ref{cor:outer} is based on the
1463: observation that any decision list $L = 
1464: (\ell_1,b_1),\dots,$ $(\ell_k,b_k),b_{k+1}$ of length $k$ has a
1465: $k$-term DNF in which each term is a conjunction of at most
1466: $k$ literals.  To see this, note that we obtain a DNF
1467: for $L$ simply by taking the OR of all terms
1468: $\overline{\ell}_1\overline{\ell}_2 \dots \overline{\ell}_{i-1}\ell_i$
1469: for each $i$ such that $b_i = 1.$  Now we use the following result
1470: from \cite{KlivansServedio:01}:
1471: \begin{theorem} [Corollary 12 of \cite{KlivansServedio:01}]
1472: Let $f$ be a DNF formula of $s$ terms, each of length at most $t.$
1473: Then there is a polynomial threshold function for $f$ of degree
1474: $O(\sqrt{t}\log s)$ and weight $t^{O(\sqrt{t}\log s)}.$
1475: \end{theorem}
1476: Applying this result to the DNF representation for $L,$ we immediately
1477: obtain that there is a polynomial threshold function for $L$
1478: which has degree $O(k^{1/2} \log k)$ and weight
1479: $2^{O(k^{1/2} \log^2 k)}.$  (In Section \ref{subsec:inner}, though,
1480: we need the construction given in our original proof of
1481: Corollary \ref{cor:outer}.)
1482: 
1483: \end{document}
1484: 
1485: 
1486: 
1487: 
1488: 
1489: 
1490: