1: \documentstyle[fullpage,11pt]{article}
2: \newtheorem{theorem}{Theorem}
3: \newtheorem{lemma}[theorem]{Lemma}
4: \newtheorem{exercise}[theorem]{Exercise}
5: \newtheorem{proposition}[theorem]{Proposition}
6: \newtheorem{claim}[theorem]{Claim}
7: \newtheorem{corollary}[theorem]{Corollary}
8: \newtheorem{observation}[theorem]{Observation}
9: \newtheorem{definition}{Definition}
10: \newcommand{\proofspace}{\vspace{.15in}}
11: \newcommand{\ital}[1]{{\/\em #1\/}}
12: \newcommand{\word}[1]{\mbox{\rm #1}}
13: \newcommand{\angles}[2]{\langle #1,#2 \rangle}
14: \newcommand{\andd}{\wedge}
15: \newcommand{\Ftwo}{\word{\bf F}_2}
16: \newcommand{\inv}[1]{\frac{1}{#1}}
17: \newcommand{\Fpow}[1]{\Ftwo^{#1}}
18: \newcommand{\onehalf}{{\textstyle \frac{1}{2}}}
19: \newcommand{\pr}[2]{\word{Pr}_{#1}\left[#2\right]}
20: \newcommand{\booln}{\{0,1\}^n}
21: \newcommand{\bool}{\{0,1\}}
22: \newcommand{\vv}{\vec{v}}
23: \newcommand{\vx}{\vec{x}}
24: \newcommand{\xx}{x}
25: \newcommand{\comment}[1]{}
26: \newcommand{\cc}{c}
27: \newcommand{\DD}{{\cal D}}
28: \newcommand{\qq}{Q}
29:
30: \begin{document}
31: \bibliographystyle{abbrv}
32: %\title{On Noise-Tolerant learning and the Statistical Query model}
33: \title{Noise-Tolerant Learning, the Parity Problem,\\and the
34: Statistical Query Model}
35: %\title{A new algorithm for noise-tolerant learning, and extensions of
36: %the statistical query model}
37: \author{Avrim Blum \and Adam Kalai \and Hal Wasserman}
38: \date{School of Computer Science\\Carnegie Mellon Univeristy}
39: \date{\today}
40: \maketitle
41:
42: \begin{abstract}
43: We describe a slightly sub-exponential time algorithm for learning
44: parity functions in the presence of random classification noise.
45: This results in a polynomial-time algorithm for the case
46: of parity functions that depend on only the first $O(\log n \log\log
47: n)$ bits of input. This is the first known instance of an efficient
48: noise-tolerant algorithm for a concept class that is provably
49: not learnable in the Statistical Query model of Kearns
50: \cite{Kearns93}. Thus, we demonstrate that the set of problems
51: learnable in the statistical query model is a strict subset of those
52: problems learnable in the presence of noise in the PAC model.
53:
54: In coding-theory terms, what we give is a poly$(n)$-time algorithm for
55: decoding linear $k\times n$ codes in the presence of random noise for
56: the case of $k = c\log n
57: \log\log n$ for some $c > 0$. (The case of $k = O(\log n)$ is trivial
58: since one can just individually check each of the $2^k$ possible
59: messages and choose the one that yields the closest codeword.)
60:
61: A natural extension of the statistical query model is to allow queries
62: about statistical properties that involve $t$-tuples of examples (as
63: opposed to single examples). The second result of this paper is
64: to show that any class of functions learnable (strongly or weakly)
65: with $t$-wise queries for $t = O(\log n)$ is also weakly learnable
66: with standard unary queries. Hence this natural
67: extension to the statistical query model does not increase the set of
68: weakly learnable functions.
69: \end{abstract}
70:
71: \section{Introduction}
72: An important question in the study of machine learning is:
73: ``What kinds of functions can be learned efficiently from
74: noisy, imperfect data?'' The statistical query (SQ) framework of
75: Kearns \cite{Kearns93} was designed as a useful, elegant model for
76: addressing this issue.
77: The SQ model provides a restricted interface between a
78: learning algorithm and its data, and has the property that any
79: algorithm for learning in the SQ model can automatically be converted
80: to an algorithm for learning in the presence of \ital{random
81: classification noise} in the standard PAC model. (This result has
82: been extended to more general forms of noise as well
83: \cite{Decatur93,Decatur96}.) The importance of the Statistical Query model is attested to by the fact
84: that before its introduction, there were only a few provably
85: noise-tolerant learning algorithms, whereas now it is recognized that
86: a large number of
87: learning algorithms can be formulated as SQ algorithms, and
88: hence can be made noise-tolerant.
89:
90: The importance of the SQ model has led to the open question of whether
91: examples exist of problems learnable with random classification noise
92: in the PAC model but not learnable by statistical queries. This is
93: especially interesting because one can characterize
94: information-theoretically (i.e., without complexity assumptions) what
95: kinds of problems can be learned in the SQ model
96: \cite{BFJKMR94}. For example, the class of parity functions, which
97: {\em can} be learned efficiently from {\em non}-noisy data in the PAC
98: model, provably cannot be learned efficiently in the SQ model under
99: the uniform distribution. Unfortunately, there is also no known efficient
100: non-SQ algorithm for learning them in the presence of noise
101: (this is closely related to the classic coding-theory problem of
102: decoding random linear codes).
103:
104: In this paper, we describe a polynomial-time algorithm for learning
105: the class of parity functions that depend on only the first $O(\log n
106: \log\log n)$ bits of input, in the presence of random
107: classification noise (of a constant noise rate). This class
108: provably cannot be learned in the SQ model, and thus is the first
109: known example of a concept class learnable with noise but not via
110: statistical queries. Our algorithm has recently been shown to have
111: applications to the problem of determining the shortest lattice vector
112: length \cite{KS01} and to various other analyses of statistical queries
113: \cite{Jackson00}.
114:
115: An equivalent way of stating this result is that we are given a random
116: $k \times n$ boolean matrix $A$, as well as an $n$-bit vector $\tilde{y}$
117: produced by multiplying $A$ by an (unknown) $k$-bit
118: message $x$, and then corrupting each bit of the resulting
119: codeword $y = xA$ with probability $\eta < 1/2$. Our goal is to
120: recover $y$ in time poly$(n)$. For this problem, the case of $k =
121: O(\log n)$ is trivial because one could simply try each of the
122: $2^k$ possible messages and output the nearest
123: codeword found. Our algorithm works for $k = c\log n\log\log n$ for some
124: $c > 0$. The algorithm does not actually need $A$ to be random, so
125: long as the noise is random and there is no other codeword within
126: distance $o(n)$ from the true codeword $y$.
127:
128: Our algorithm can also be viewed as a slightly sub-exponential time
129: algorithm for learning arbitrary parity functions in the presence of
130: noise. For this problem, the brute-force algorithm
131: would draw $O(n)$ labeled examples, and then search through all $2^n$
132: parity functions to find the one of least empirical error. (A
133: standard argument can be used to say that with high probability, the
134: correct function will have the lowest empirical error.) In contrast,
135: our algorithm runs in time $2^{O(n/\log n)}$, though it also requires
136: $2^{O(n/\log n)}$ labeled examples. This improvement is small but
137: nonetheless sufficient to achieve the desired separation result.
138:
139: The second result of this paper concerns a $k$-wise version of the
140: Statistical Query model. In the standard version, algorithms may only
141: ask about statistical properties of single examples. (E.g., what is
142: the probability that a random example is labeled positive and has its
143: first bit equal to 1?) In the $k$-wise version, algorithms may ask
144: about properties of $k$-tuples of examples. (E.g., what is the
145: probability that two random examples have an even dot-product and have
146: the same label?) Given the first result of this paper, it is natural
147: to ask whether allowing $k$-wise queries, for some small value of $k$,
148: might increase the set of SQ-learnable functions. What we show is
149: that for $k=O(\log n)$, any concept class learnable
150: from $k$-wise queries is also (weakly) learnable from unary queries.
151: Thus the seeming generalization of the SQ model to allow for $O(\log n)$-wise
152: queries does not close the gap we have demonstrated between what is
153: efficiently learnable in the SQ and noisy-PAC models. Note that this
154: result is the best possible with respect to $k$ because the
155: results of
156: \cite{BFJKMR94} imply that for $k = \omega(\log n)$, there are concept
157: classes learnable from $k$-wise queries but not unary queries. On the other
158: hand, $\omega(\log n)$-wise queries are in a sense less interesting because it
159: is not clear whether they can in general be simulated in the presence of noise.
160: %%(Though perhaps it might be possible to generalize the first result in
161: %%this paper to do so!)
162:
163: \subsection{Main ideas}
164: The standard way to learn parity functions without noise is based on
165: the fact that if an example can be written as a sum (mod 2) of
166: previously-seen examples, then its label must be the sum (mod 2) of
167: those examples' labels. So, once one has found a basis, one can
168: use that to deduce the label of {\em any} new example (or,
169: equivalently, use Gaussian elimination to produce the target function
170: itself).
171:
172: In the presence of noise, this method breaks down. If the original
173: data had noise rate $1/4$, say, then the sum of $s$ labels
174: has noise rate $1/2 - (1/2)^{s+1}$. This means we can add
175: together only $O(\log n)$ examples if we want the resulting sum to be
176: correct with probability $1/2 + 1/poly(n)$. Thus, if we want to use
177: this kind of approach, we need some way to write
178: a new test example as a sum of only a {\em small
179: number} of training examples.
180:
181: Let us now consider the case of parity functions that depend on only
182: the first $k = \log n \log\log n$ bits of input. Equivalently, we can
183: think of all examples as having the remaining $n-k$ bits equal to 0.
184: Gaussian elimination will in this case allow us to write our test
185: example as a sum of $k$ training examples, which is too many. Our
186: algorithm will instead write it as a sum of $k/\log k = O(\log n)$
187: examples, which gives us the desired noticeable bias (that can then be
188: amplified).
189:
190: Notice that if we have seen $poly(n)$ training examples (and, say,
191: each one was chosen uniformly at random), we can argue existentially
192: that for $k = \log n \log\log n$, one should be able to write any new
193: example as a sum of just
194: $O(\log\log n)$ training examples, since there are $n^{O(\log \log n)}
195: \gg 2^k$ subsets of this size (and the subsets are pairwise
196: independent). So, while our algorithm is finding a smaller subset
197: than Gaussian elimination, it is not doing best possible.
198: If one {\em could} achieve, say, a constant-factor
199: approximation to the problem ``given a set of vectors, find the
200: smallest subset that sums to a given target vector'' then this would
201: yield an algorithm to efficiently learn the class of parity functions
202: that depend on the first $k = O(\log^2 n)$ bits of input.
203: Equivalently, this would allow one to learn parity functions over $n$ bits
204: in time $2^{O(\sqrt{n})}$, compared to the $2^{O(n/\log n)}$ time of
205: our algorithm.
206:
207: \comment{
208: Similarly, if $k =$ So,
209: if we could algorithmically {\em find} the
210: smallest subset of training examples that sums to our test
211: example, we would have a noticeable bias and (essentially) be done.
212:
213: Unfortunately, it seems difficult to efficiently find the smallest
214: subset so we cannot do quite this well. Instead, we give a weak
215: approximation. Specifically, for $k = \log n
216: \log\log n$, the existential argument tells us there should exist
217: a subset of size $O(\log \log n)$ that sums to our test example; our
218: algorithm will, in this case, find a subset of size $O(\log n)$. This
219: is a lot larger than optimal, but still better than Gaussian
220: elimination (which finds $O(\log n \log \log n)$ examples) and is
221: sufficient for our result.
222: }
223:
224: \section{Definitions and Preliminaries}
225:
226: A \ital{concept} is a boolean function on an \ital{input space}, which
227: in this paper will generally be $\booln$. A \ital{concept class} is a
228: set of concepts. We will be considering the problem of learning a
229: target concept in the presence of \ital{random classification noise}
230: \cite{AngluinLa88}. In this model, there is some fixed (known or
231: unknown) noise rate $\eta < 1/2$, a fixed (known or unknown)
232: probability distribution $\DD$ over $\booln$, and an unknown target
233: concept $c$. The learning algorithm may repeatedly ``press a button''
234: to request a labeled example. When it does so, it receives a pair
235: $(\xx, \ell)$, where $\xx$ is chosen from $\booln$ according to $\DD$
236: and $\ell$ is the value $c(\xx)$, but ``flipped'' with probability
237: $\eta$. (I.e., $\ell = c(\xx)$ with probability $1-\eta$, and
238: $\ell = 1-c(\xx)$ with probability $\eta$.) The goal of the learning
239: algorithm is to find an
240: \ital{$\epsilon$-approximation} of $c$: that is, a hypothesis
241: function $h$ such that $\Pr_{\xx \leftarrow \DD}[h(\xx) = c(\xx)] \geq
242: 1-\epsilon$.
243:
244: We say that a concept class $C$ is \ital{efficiently learnable in the
245: presence of random classification noise} under distribution $\DD$ if
246: there exists an algorithm ${\cal A}$ such that for any $\epsilon>0,
247: \delta>0, \eta < 1/2$, and any target concept $c \in C$, the algorithm
248: ${\cal A}$ with probability at least $1-\delta$ produces an
249: $\epsilon$-approximation of $c$ when given access to $\DD$-random examples
250: which have been labeled by $c$ and corrupted by noise of rate
251: $\eta$. Furthermore, ${\cal A}$ must run in time polynomial in $n$,
252: $1/\epsilon$, and $1/\delta$.\footnote{Normally, one would also
253: require polynomial dependence on $1/(1/2 -\eta)$ --- in part because
254: normally this is easy to achieve (e.g., it is achieved by any
255: statistical query algorithm). Our algorithms run in polynomial
256: time for any \ital{fixed} $\eta < 1/2$, but have a
257: super-polynomial dependence on $1/(1/2 - \eta)$.}
258:
259:
260:
261: A \ital{parity function} $c$ is defined by a corresponding vector $c
262: \in \booln$; the parity function is then given by the rule $c(x) = x
263: \cdot c \!\!\! \pmod{2}$. We say that $c$ \ital{depends on only the first
264: $k$ bits of input} if all nonzero components of $c$ lie in its
265: first $k$ bits. So, in particular, there are $2^k$ distinct parity
266: functions that depend on only the first $k$ bits of input. Parity
267: functions are especially interesting to consider under the uniform
268: distribution $\DD$, because under that distribution parity functions
269: are pairwise uncorrelated.
270:
271: \subsection{The Statistical Query model}
272: The Statistical Query (SQ) model can be viewed as providing a
273: restricted interface between the learning algorithm and the source of
274: labeled examples. In this model, the learning algorithm may only
275: receive information about the target concept through \ital{statistical
276: queries}. A statistical query is a query about some property $\qq$
277: of labeled examples (e.g., that the first two bits are equal and the label is
278: positive), along with a tolerance parameter $\tau \in
279: [0,1]$. When the algorithm asks a statistical query $(\qq,\tau)$, it
280: is asking for the probability that predicate $\qq$ holds true for a
281: random correctly-labeled example, and it receives an approximation of this
282: probability up to $\pm \tau$. In other words, the algorithm receives a
283: response $\hat{P}_{\qq} \in [P_{\qq}-\tau, P_{\qq}+\tau]$, where
284: %
285: $P_{\qq} = \Pr_{x \leftarrow \DD}[\qq(x,c(x))]$.
286: %
287: We also require each
288: query $\qq$ to be polynomially evaluable (that is, given $(x,\ell)$, we
289: can compute $\qq(x,\ell)$ in polynomial time).
290:
291: Notice that a statistical query can be simulated by drawing a large
292: sample of data and computing an empirical average, where the size of
293: the sample would be roughly $O(1/\tau^2)$ if we wanted to assure an
294: accuracy of $\tau$ with high probability.
295:
296: A concept class $C$ is \ital{learnable from statistical queries} with
297: respect to distribution $\DD$ if there is a learning algorithm ${\cal
298: A}$ such that for any $c \in C$ and any $\epsilon>0$, ${\cal A}$ produces an
299: $\epsilon$-approximation of $c$ from statistical queries; furthermore,
300: the running time, the number of queries asked, and the inverse of the
301: smallest tolerance used must be polynomial in $n$ and $1/\epsilon$.
302:
303:
304: We will also want to talk about \ital{weak learning.} An algorithm
305: ${\cal A}$ weakly learns a concept class $C$ if for any $c \in C$ and
306: for \ital{some} $\epsilon < 1/2 - 1/\word{poly}(n)$, ${\cal A}$ produces an
307: $\epsilon$-approximation of $c$. That is, an algorithm weakly learns if
308: it can do noticeably better than guessing.
309:
310: The statistical query model is defined with respect to non-noisy data.
311: However, statistical queries can be simulated from data corrupted by
312: random classification noise \cite{Kearns93}. Thus, any concept class
313: learnable from statistical queries is also PAC-learnable in the
314: presence of random classification noise.
315: There are several variants to the formulation given above that
316: improve the efficiency of the simulation \cite{AslamDe93,AslamDe98},
317: but they are all polynomially related.
318:
319: One technical point: we have
320: defined statistical query learnability in the ``known distribution''
321: setting (algorithm ${\cal A}$ knows distribution $\DD$); in the
322: ``unknown distribution'' setting, ${\cal A}$ is allowed
323: to ask for random unlabeled examples from the distribution $\DD$\@.
324: This prevents certain trivial exclusions from what is learnable from
325: statistical queries.
326:
327:
328: \subsection{An information-theoretic characterization}\label{sec:info}
329:
330: BFJKMR \cite{BFJKMR94} prove that any concept class containing more than
331: polynomially many pairwise uncorrelated functions
332: cannot be learned even weakly in the statistical query model.
333: Specifically, they show the following.
334:
335: \begin{definition} (Def.~2 of \cite{BFJKMR94})
336: For concept class $C$ and distribution $\DD$, the
337: \ital{statistical query dimension} SQ-DIM$(C,\DD)$ is the largest
338: number $d$ such that $C$ contains $d$ concepts $c_1, \ldots, c_d$ that
339: are nearly pairwise uncorrelated: specifically, for all $i\neq j$,
340: $$\left|\Pr_{x \leftarrow D}[c_i(x) = c_j(x)] - \Pr_{x \leftarrow
341: D}[c_i(x) \neq c_j(x)]\right| \leq 1/d^3.$$
342: \end{definition}
343:
344: \begin{theorem} (Thm.~12 of \cite{BFJKMR94}) In order to learn $C$ to error
345: less than $1/2 - 1/d^3$ in the SQ model, where $d = $ SQ-DIM$(C,\DD)$,
346: either the number of queries or $1/\tau$ must be at least $\frac{1}{2}d^{1/3}$
347: \end{theorem}
348:
349: Note that the class of parity functions over $\booln$ that depend on
350: only the first $O(\log n\log\log n)$ bits of input contains
351: $n^{O(\log \log n)}$ functions, all pairs of which are uncorrelated
352: with respect to the uniform distribution. Thus, this class cannot be
353: learned (even weakly) in the SQ model with polynomially many queries
354: of $1/\word{poly}(n)$ tolerance. But we will now show that there
355: nevertheless exists a polynomial-time PAC-algorithm for learning this
356: class in the presence of random classification noise.
357:
358: \section{Learning Parity with Noise}
359: \subsection{Learning over the uniform distribution}
360:
361: For ease of notation, we use the ``length-$k$ parity problem'' to
362: denote the problem of learning a parity function over $\bool^k$, under
363: the uniform distribution, in the presence of random classification
364: noise of rate $\eta$.
365:
366:
367: \begin{theorem}
368: \label{maintheorem}
369: The length-$k$ parity problem, for
370: noise rate $\eta$ equal to any constant less than $1/2$, can be solved
371: with number of samples and total computation-time $2^{O(k/\log k)}$.
372: \end{theorem}
373:
374: Thus, in the presence of noise we can learn parity functions over
375: $\{0,1\}^n$ with in time and sample size $2^{O(n/\log n)}$, and we can
376: learn parity functions over $\{0,1\}^n$ that only depend on the first
377: $k = O(\log n\log\log n)$ bits of the input in time and sample size
378: $poly(n)$.
379:
380: We begin our proof of Theorem~\ref{maintheorem} with a simple lemma about
381: how noise becomes amplified when examples are added together. For
382: convenience, if $x_1$ and $x_2$ are examples, we let $x_1 + x_2$
383: denote the vector sum mod 2; similarly, if $\ell_1$ and $\ell_2$ are
384: labels, we let $\ell_1+\ell_2$ denote their sum mod 2.
385:
386: \begin{lemma}
387: \label{sumOK}
388: Let $(x_1, \ell_1), \ldots, (x_s, \ell_s)$ be examples labeled
389: by $c$ and corrupted by random noise of rate $\eta$. Then
390: $\ell_1 + \cdots + \ell_s$ is the correct value of $(x_1 +
391: \cdots + x_s) \cdot c$ with probability $\onehalf + \onehalf(1-2\eta)^s$.
392: \end{lemma}
393:
394: \smallskip
395: \noindent
396: {\bf Proof.}
397: Clearly true when $s=1$. Now assume that the lemma is true for
398: $s-1$. Then the probability that $\ell_1 + \cdots + \ell_s =
399: (x_1 + \cdots + x_s) \cdot c$ is
400: $$(1-\eta)(\onehalf + \onehalf(1-2\eta)^{s-1}) + \eta(\onehalf -
401: \onehalf(1-2\eta)^{s-1}) = \onehalf + \onehalf(1-2\eta)^s.$$
402: The lemma then follows by induction.
403:
404:
405: The idea for the algorithm is that by drawing many more examples than
406: the minimum needed to learn information-theoretically, we will be able
407: to write basis vectors such as $(1,0,\ldots,0)$ as the sum of a
408: relatively small number of training examples --- substantially smaller
409: than the number that would result from straightforward Gaussian
410: elimination. In particular, for the length $O(\log n \log\log n)$
411: parity problem, we will be able to write $(1,0,\ldots,0)$ as the sum
412: of only $O(\log n)$ examples. By Lemma \ref{sumOK}, this means that,
413: for any constant noise rate $\eta < 1/2$, the corresponding sum of
414: labels will be polynomially distinguishable from random. Hence, by
415: repeating this process as needed to boost reliability, we may
416: determine the correct label for $(1,0,\ldots,0)$, which is
417: equivalently the first bit of the target vector $c$. This process can
418: be further repeated to determine the remaining bits of $c$, allowing
419: us to recover the entire target concept with high probability.
420:
421: To describe the algorithm for the length-$k$ parity problem, it will
422: be convenient to view each example as consisting of $a$
423: blocks, each $b$ bits long (so, $k = ab$) where $a$ and $b$ will be
424: chosen later. We then introduce the following notation.
425:
426: \begin{definition}
427: Let $V_i$ be the subspace of $\bool^{ab}$ consisting of those
428: vectors whose last $i$ blocks have all bits equal to zero. An
429: \ital{$i$-sample} of size $s$ is a set of $s$ vectors
430: independently and uniformly distributed over $V_i$.
431: \end{definition}
432: The goal of our algorithm will be to use labeled examples from
433: $\bool^{ab}$ (these form a $0$-sample) to create an $i$-sample such
434: that each vector in the $i$-sample can be written as a sum of at most
435: $2^i$ of the original examples, for all $i=1,2,\ldots, a-1$. We
436: attain this goal via the following lemma.
437:
438: \begin{lemma}
439: \label{sampling}
440: Assume we are given an $i$-sample of size $s$. We can in time
441: $O(s)$ construct an $(i+1)$-sample of size at least $s -
442: 2^b$ such that each vector in the $(i+1)$-sample is written as the sum
443: of two vectors in the given $i$-sample.
444: \end{lemma}
445:
446: \smallskip
447: \noindent
448: {\bf Proof.}
449: Let the $i$-sample be $x_1, \ldots, x_s$. In these vectors, blocks
450: $a-i+1, \ldots, a$ are all zero. Partition $x_1, \ldots, x_s$ based
451: on their values in block $a-i$. This results in a partition having at
452: most $2^b$ classes. From each nonempty class $p$, pick one vector
453: $x_{j_p}$ at random and add it to each of the other vectors in its
454: class; then discard $x_{j_p}$. The result is a collection of vectors
455: $u_1, \ldots, u_{s'}$, where $s' \geq s - 2^b$ (since we discard at most
456: one vector per class).
457:
458: What can we say about ${u}_1, \ldots, {u}_{s'}$? First of all, each ${u}_j$ is
459: formed by summing two vectors in $V_i$ which have
460: identical components throughout block $a-i$, ``zeroing out'' that
461: block. Therefore, ${u}_j$ is in $V_{i+1}$. Secondly, each
462: $u_j$ is formed by taking some $x_{j_p}$ and adding to it
463: a random vector in $V_i$, subject only to the condition that the random
464: vector agrees with $x_{j_p}$ on block $a-i$. Therefore, each $u_j$ is
465: an independent, uniform-random member of $V_{i+1}$. The vectors $u_1,
466: \ldots, u_{s'}$ thus form the desired $(i+1)$-sample.
467:
468: Using this lemma, we can now prove our main theorem.
469:
470: \smallskip
471: \noindent
472: {\bf Proof of Theorem \ref{maintheorem}.}
473: Draw $a2^b$ labeled examples. Observe that these qualify
474: as a $0$-sample. Now apply Lemma~\ref{sampling}, $a-1$ times, to
475: construct an $(a-1)$-sample. This $(a-1)$-sample will have size at
476: least $2^b$. Recall that the vectors in an $(a-1)$-sample are
477: distributed independently and uniformly at random over $V_{a-1}$, and
478: notice that $V_{a-1}$ contains only $2^b$ distinct vectors, one of
479: which is $(1,0,\ldots,0)$. Hence there is an approximately $1-1/e$
480: chance that $(1,0,\ldots,0)$ appears in our $(a-1)$-sample. If this
481: does not occur, we repeat the above process with new labeled examples.
482: Note that the expected number of repetitions is only constant.
483:
484: Now, unrolling our applications of Lemma \ref{sampling}, observe that we
485: have written the vector $(1,0,\ldots,0)$ as the sum of $2^{a-1}$
486: of our labeled examples --- and we have done so without examining
487: their labels. Thus the label noise is still random, and we can
488: apply Lemma~\ref{sumOK}. Hence the sum of the labels gives us the
489: correct value of $(1,0,\ldots,0) \cdot c$ with probability
490: $\onehalf + \onehalf(1-2\eta)^{2^{a-1}}$.
491:
492: This means that if we repeat the above process using new labeled
493: examples each time for poly$((\inv{1-2\eta})^{2^a}, b)$ times, we can
494: determine $(1,0,\ldots,0) \cdot c$ with probability of error
495: exponentially small in $ab$. In other words, we can determine the
496: first bit of $c$ with very high probability. And of course, by
497: cyclically shifting all examples, the same algorithm may be employed
498: to find each bit of $c$. Thus, with high probability we can determine
499: $c$ using a number of examples and total computation-time $
500: \word{poly}((\inv{1-2\eta})^{2^a}, 2^b)$.
501:
502: Plugging in $a = \frac{1}{2}\lg k$ and $b = 2k/\lg k$ yields the
503: desired $2^{O(k/\log k)}$ bound for constant noise rate $\eta$.
504:
505: \subsection{Extension to other distributions}
506: While the uniform distribution is in this case the most interesting,
507: we can extend our algorithm to work over any distribution. In fact,
508: it is perhaps easiest to think of this extension as an online learning
509: algorithm that is presented with an arbitrary sequence
510: of examples, one at a time. Given a new test example, the algorithm
511: will output either ``I don't know'', or else will give a prediction of
512: the label. In the former case, the algorithm is told the correct
513: label, flipped with probability $\eta$. The claim is that the
514: algorithm will, with high probability, be correct in all its
515: predictions, and furthermore will output ``I don't know'' only a
516: limited number of times. In the coding-theoretic view,
517: this corresponds to producing a $1 - o(1)$ fraction of the
518: desired codeword, where the remaining entries are left blank. This
519: allows us to recover the full codeword so long as no other codeword is
520: within relative distance $o(1)$.
521:
522: The algorithm is essentially a form of Gaussian elimination, but where
523: each entry in the matrix is an element of the vector space $\Fpow{b}$
524: rather than an element of the field $\Ftwo$. In particular, instead
525: of choosing a row that begins with a 1 and subtracting it from all
526: other such rows, what we do is choose one row for each initial $b$-bit
527: block observed: we then use these (at most $2^b-1$) rows to zero out
528: all the others. We then move on to the next $b$-bit block. If we
529: think of this as an online algorithm, then each new example seen
530: either gets captured as a new row in the matrix (and there are at most
531: $a(2^b-1)$ of them) or else it passes all the way through the matrix
532: and is given a prediction. We then do this with multiple matrices and
533: take a majority vote to drive down the probability of error.
534:
535: For concreteness, let us take the case of $n$ examples, each $k$ bits
536: long for $k = \frac{1}{4}\lg n (\lg\lg n - 2)$, and $\eta = 1/4$. We view each
537: example as consisting of $(\lg\lg n - 2)$ blocks, where each block has width
538: $\frac{1}{4}\lg n$. We now create a series of matrices $M_1,
539: M_2, \ldots$ as follows.
540: Initially, the matrices are all empty.
541: Given a new example, if its first block does not match the first block
542: of any row in $M_1$, we include it as a new row of $M_1$ (and output
543: ``I don't know''). If the
544: first block {\em does} match, then we subtract that row from it
545: (zeroing out the first block of our example) and consider the second
546: block. Again, if the second block does not match any row in $M_1$ we
547: include it as a new row (and output ``I don't know''); otherwise, we subtract that row and consider
548: the third block and so on. Notice that each example will either be
549: ``captured'' into the matrix $M_1$ or else gets completely zeroed out
550: (i.e., written as a sum of rows of $M_1$). In the latter case, we
551: have written the example as a sum of at most $2^{\lg\lg n - 2}
552: = \frac{1}{4}\lg n$ previously-seen examples, and therefore the sum
553: of their labels is correct with probability at least $\frac{1}{2}(1 +
554: 1/n^{1/4})$. To amplify this probability, instead of making a
555: prediction we put the example into a new matrix $M_2$, and so on up to
556: matrix $M_{n^{2/3}}$. If an example passes through {\em all}
557: matrices, we can then state that the majority vote is correct with
558: high probability. Since each matrix has at most $2^{\frac{1}{4}\lg
559: n}(\lg\lg n - 2)$ rows, the total number of examples on which we fail
560: to make a prediction is at most $n^{11/12}\lg\lg n = o(n)$.
561:
562: \comment{
563: \begin{theorem}
564: Over an arbitrary distribution, the length-$ab$ parity problem can also be solved by
565: an algorithm whose number of samples and total computation-time are
566: $\,\word{poly}\!\left(\left(\inv{1-2\eta}\right)^{2^a}, 2^b\right)$.
567: \end{theorem}
568:
569: \smallskip
570: \noindent
571: {\bf Proof.}
572: Over an
573: arbitrary distribution, there may be several parity functions with low error,
574: and our goal is to pick one of these.
575: Previously, we tried
576: to write $(1,0,\ldots,0)$ as the sum of $2^{a-1}$ examples. Over an arbitrary
577: distribution, this may not be possible. Instead,
578: we pick a random example and write it as the sum of $2^a-1$ examples. As
579: before, we repeat this process
580: so that we again have probability of error exponentially small in $ab$.
581: This enables us to correctly label test examples with high
582: probability. With this ability, we can correctly label a
583: $ab/(\epsilon\delta)$ examples and then apply standard noiseless
584: learning techniques to find a low-error parity hypothesis.
585:
586: We use a similar technique to the uniform distribution case to write an arbitrary
587: example $x$ as the sum of
588: $2^a-1$ examples. We take $s$ random examples, and we add $x$ to this set.
589: We now use the procedure of Lemma~\ref{sampling}, $a$ times (rather than $a-1$ times), to write $(0,0,\ldots,0)$ as
590: the sum of $2^a$ examples. This is slightly easier than before, because all
591: the elements of our $a$-sample are $(0,0,\ldots,0)$, rather than
592: having to wait for an element of an $(a-1)$-sample which is
593: $(1,0,\ldots,0)$. Of course, we technically no longer have $i$-samples
594: in the sense that they are not uniformly distributed.
595:
596: Regardless, consider the $a$-sample generated at the end.
597: Each element of this sample is the sum of $2^a$ examples, and we can
598: uniquely identify an element of the sample by the first example used in this
599: sum, because our initial $0$-sample had one element for each example.
600: Furthermore, since we have at least $s+1-a2^b$ elements of this $a$-sample,
601: and $x$ is a random example treated as any other, with probability less than
602: $a2^b/s$, we still have the element of the $s$-sample corresponding to $x$.
603: In this case, we can write $x$ as the sum of the remaining $2^a-1$ examples in
604: its sample. As before, we will repeat this process
605: $r=\word{poly}((\inv{1-2\eta})^{2^a}, b)$ times, using new labeled examples
606: each time, to determine $x \cdot c$ with probability
607: of error exponentially small in $ab$. The probability that we could not write
608: $x$ as the sum of $2^a-1$ examples in any of these $r$ repetitions is less
609: than $ra2^b/s$, which we can make sufficiently small by also choosing
610: $s=\word{poly}((\inv{1-2\eta})^{2^a}, b)$.
611: }
612:
613: \subsection{Discussion}
614:
615: Theorem~\ref{maintheorem} demonstrates that we can
616: solve the length-$n$ parity learning problem
617: in time $2^{o(n)}$. However, it must be emphasized that we accomplish
618: this by using $2^{O(n/\log n)}$ labeled examples. For the point of
619: view of coding theory, it would be useful to have an algorithm which takes time
620: $2^{o(n)}$ and number of examples $\word{poly}(n)$ or even $O(n)$. We
621: do not know if this can be done. Also of interest is the question of
622: whether our time-bound can be improved from $2^{O(n/\log n)}$ to, for
623: example, $2^{O(\sqrt{n}\,)}$.
624:
625: It would also be desirable to reduce our algorithm's dependence on
626: $\eta$. This dependence comes from Lemma \ref{sumOK}, with $s = 2^{a-1}$.
627: For instance, consider the problem of learning parity functions
628: that depend on the first $k$ bits of input for $k = O(\log n\log \log
629: n)$. In this case, if we set $a=\lceil \frac{1}{2}\lg\lg n \rceil$ and
630: $b = O(\log n)$, the running time is polynomial in $n$, with
631: dependence on $\eta$ of $(\inv{1-2\eta})^{\sqrt{\log n}}$. This
632: allows us to handle $\eta$ as large
633: as $1/2 - 2^{-\sqrt{\log n}}$ and still have polynomial running time.
634: While this can be improved slightly,
635: we do not know how to solve
636: the length-$O(\log n \log \log n)$ parity problem in polynomial time
637: for $\eta$ as large as $1/2 - 1/n$ or even $1/2 - 1/n^\varepsilon$.
638: What makes this interesting is that it is an open question (Kearns,
639: personal communication) whether noise tolerance can in general be
640: boosted; this example suggests why such a result may be
641: nontrivial.
642:
643: \comment{
644: \begin{corollary}
645: \label{highnoise}
646: Let $\varepsilon$ be any positive constant. Using $a =
647: \lceil\varepsilon\lg\lg n\rceil$, $b = O(\log n)$ in
648: Theorem~\ref{mainab}, we find that the parity problem of length
649: $O(\log n \log\log n)$, for noise rate $\eta \leq 1/2 - 2^{-(\lg
650: n)^{1-\varepsilon}}$, can be solved with number of samples and total
651: computation-time $\word{poly}(n)$.
652: \end{corollary}
653: }
654:
655: \section{Limits of O(log n)-wise Queries}
656:
657: We return to the general problem of learning a target concept $\cc$
658: over a space of examples with a fixed distribution $\cal D$. A
659: limitation of the statistical query model is that it permits only what
660: may be called \ital{unary} queries. That is, an SQ algorithm can
661: access $\cc$ only by requesting approximations of probabilities of
662: form $\pr{x}{\qq(x,\cc(x))}$, where $x$ is $\cal D$-random and $\qq$
663: is a polynomially evaluable predicate. A natural question is whether
664: problems not learnable from such queries can be learned, for example,
665: from binary queries: i.e., from probabilities of form
666: $\pr{x_1,x_2}{\qq(x_1,x_2,\cc(x_1),\cc(x_2))}$. The following theorem
667: demonstrates that this is not possible, proving that $O(\log n)$-wise
668: queries are no better than unary queries, at least with respect to
669: weak-learning.
670:
671: We assume in the discussion below that all algorithms also have access
672: to individual \ital{unlabeled} examples from distribution $\DD$, as is
673: usual in the SQ model.
674:
675: \begin{theorem}
676: \label{lognogood}
677: Let $k = O(\log n)$, and assume that there exists a $\word{poly}(n)$-time
678: algorithm using $k$-wise statistical queries which weakly learns a concept
679: class $C$ under distribution $\cal D$. That is, this algorithm learns from
680: approximations of $\pr{\vec{x}}{\qq(\vec{x},\cc(\vec{x}))}$, where $\qq$ is a
681: polynomially evaluable predicate, and $\vec{x}$ is a k-tuple of examples.
682: Then there exists a $\word{poly}(n)$-time algorithm which weakly learns the
683: same class using only unary queries, under $\cal D$.
684: \end{theorem}
685:
686: \smallskip
687: \noindent
688: {\bf Proof.}
689: We are given a $k$-wise query
690: $\pr{\vec{x}}{\qq(\vec{x},\cc(\vec{x}))}$. The first thing our
691: algorithm will do is use $Q$ to construct several candidate weak
692: hypotheses. It then tests whether each of these hypotheses is in fact
693: noticeably correlated with the target
694: using unary statistical queries. If none of them appear to be good,
695: it uses this fact to
696: estimate the value of the $k$-wise query. We prove that for any
697: $k$-wise query, with high probability we either succeed in finding a
698: weak hypothesis or we output a good estimate of the $k$-wise query.
699:
700: For simplicity, let us assume that $\pr{x}{c(x) = 1} = 1/2$; i.e., a
701: random example is equally likely to be positive or negative. (If
702: $\pr{x}{c(x) = 1}$ is far from $1/2$ then weak-learning is easy by
703: just predicting all examples are positive or all examples are
704: negative.) This assumption implies that if a hypothesis $h$ satisfies
705: $|\pr{x}{h(x) = 1 \wedge c(x) = 1} - \frac{1}{2}\pr{x}{h(x) = 1}| \geq
706: \epsilon$, then either $h(x)$ or $1 - h(x)$ is a weak hypothesis.
707:
708: We now generate a set of candidate hypotheses by choosing one random
709: $k$-tuple of
710: unlabeled examples $\vec{z}$. For each $1 \leq i \leq k$ and $\vec{\ell} \in
711: \{0,1\}^k$, we hypothesize
712: $$h_{\vec{z},i,\vec{\ell}}(x) =
713: Q(z_i,\ldots,z_{i-1},x,z_i,\ldots,z_k,\vec{\ell}),$$
714: and then use a unary statistical query to
715: tell if $h_{\vec{z},i,\vec{\ell}}(x)$ or
716: $1-h_{\vec{z},i,\vec{\ell}}(x)$ is a weak hypothesis. As noted above,
717: we will have found a weak hypothesis if
718: %\setlength{\multlinegap}{0 in}
719: $$
720: \left|\pr{x}{\qq(z_1,\ldots,z_{i-1},x,z_{i+1},\ldots,z_k,\vec{\ell})
721: \wedge \cc(x)=1} -
722: \frac{1}{2}\pr{x}{\qq(z_1,\ldots,z_{i-1},x,z_{i+1},\ldots,z_k,\vec{\ell})}\right| \geq
723: \epsilon.
724: $$
725: We repeat this process for $O(1/\epsilon)$
726: randomly chosen $k$-tuples $\vec{z}$. We now consider two cases.
727:
728: {\bf Case I:} Suppose that the $i$th label matters to the $k$-wise
729: query $Q$ for some $i$ and
730: $\vec{\ell}$. By this we mean there is at least an $\epsilon$ chance of the
731: above inequality holding for random $\vec{z}$. Then with high probability we
732: will discover such a $\vec{z}$ and thus weak learn.
733:
734: {\bf Case II:} Suppose, on the contrary, that for no $i$ or $\vec{\ell}$ does
735: the $i$th label matter, i.e.\ the probability of a random $z$
736: satisfying the above inequality is less than $\epsilon$. This means
737: that
738: \begin{eqnarray*}
739: {\bf E}_{\vec{z}}\left[\left|\pr{x}{\qq(z_1,\ldots,z_{i-1},x,z_{i+1},\ldots,z_k,\vec{\ell})
740: \wedge \cc(x)=1} - \right. \right. \\
741: \left. \left.
742: \frac{1}{2}\pr{x}{\qq(z_1,\ldots,z_{i-1},x,z_{i+1},\ldots,z_k,\vec{\ell})}\right|\right] <
743: 2\epsilon.
744: \end{eqnarray*}
745: By bucketing the $\vec{z}$'s according to the values of $c(z_1)$,
746: $\ldots$, $c(z_{i-1})$ we see that the above implies that
747: for all $b_1, \ldots, b_{i-1}$ $\in$
748: $\{0,1\},$
749: \begin{eqnarray*}
750: \left|\pr{\vec{z}}{\qq(\vec{z},\vec{\ell}) \wedge c(z_1)=b_1 \wedge
751: \ldots \wedge c(z_{i-1})=b_{i-1} \wedge c(z_i)=1}
752: - \right. \\
753: \left. \frac{1}{2}\pr{\vec{z}}{\qq(\vec{z},\vec{\ell}) \wedge c(z_1)=b_1 \wedge \ldots \wedge
754: c(z_{i-1})=b_{i-1}} \right| <
755: 2\epsilon.
756: \end{eqnarray*}
757: By a straightforward inductive argument
758: on $i$, we conclude that for every $\vec{b} \in \{0,1\}^k$, $$\left|\pr{\vec{z}}{\qq(\vec{z},\vec{\ell}) \wedge
759: c(\vec{z})=\vec{b}} -
760: \frac{1}{2^k}\pr{\vec{z}}{\qq(\vec{z},\vec{\ell})}\right| <
761: 4\epsilon(1-\frac{1}{2^k}).$$
762: This fact now allows us to estimate our desired $k$-wise query
763: $\pr{\vec{z}}{\qq(\vec{z},\cc(\vec{z}))}$. In particular,
764: $$\pr{\vec{z}}{\qq(\vec{z},\cc(\vec{z}))} = \sum_{\vec{\ell} \in \{0,1\}^k}
765: \pr{\vec{z}}{\qq(\vec{z},\vec{\ell}) \wedge \cc(\vec{z})=\vec{\ell}}.$$
766: We approximate each of the $2^k= \word{poly}(n)$ terms corresponding to a
767: different $\vec{\ell}$ by using {\em unlabeled} data to estimate
768: $\frac{1}{2^k}\pr{\vec{z}}{Q(\vec{z},{\vec{\ell}})}$. Adding up
769: these terms gives us a good estimate of
770: $\pr{\vec{z}}{\qq(\vec{z},\cc(\vec{z}))}$ with high
771: probability.
772:
773:
774: \subsection{Discussion}
775:
776: In the above proof, we saw that either the data is statistically
777: ``homogeneous'' in a way which allows us to simulate the original
778: learning algorithm with unary queries, or else we discover a
779: ``heterogeneous'' region which we can exploit with an alternative
780: learning algorithm using only unary queries. Thus any concept class
781: that can be learned from $O(\log n)$-wise
782: queries can also be weakly learned from unary queries. Note that Aslam and
783: Decatur \cite{AslamDe93} have shown that weak-learning statistical
784: query algorithms can be boosted to strong-learning algorithms, if they
785: weak-learn over \ital{every} distribution. Thus, any concept class
786: which can be
787: (weakly or strongly) learned from $O(\log n)$-wise queries over
788: \ital{every} distribution can be strongly learned over every
789: distribution from unary queries.
790:
791: It is worth noting here that $k$-wise queries can be used to
792: solve the length-$k$ parity problem. One simply asks, for each $i \in
793: \{1, \ldots, k\}$, the query:
794: ``what is the probability that $k$ random examples form a basis for
795: $\bool^k$ and,
796: upon performing Gaussian elimination, yield a target concept whose
797: $i$th bit is equal to 1?'' Thus, $k$-wise
798: queries cannot be reduced to unary queries for $k = \omega(\log n)$.
799: On the other hand, it is not at all clear how to simulate such queries
800: in general from noisy examples.
801:
802: \comment{
803: \section{Limits of O(log {n})-wise Queries}
804:
805: We return to the general problem of learning a target concept $\cc$
806: over a space of examples with a fixed distribution $\cal D$. A
807: limitation of the statistical query model is that it permits only what
808: may be called \ital{unary} queries. That is, an SQ algorithm can
809: access $\cc$ only by requesting approximations of probabilities of
810: form $\pr{x}{\qq(x,\cc(x))}$, where $x$ is chosen from $\cal D$ and $\qq$
811: is a polynomially evaluable predicate. A natural question is whether
812: problems not learnable from such queries can be learned, for example,
813: from binary queries: i.e., from probabilities of form
814: $\pr{x_1,x_2}{\qq(x_1,\cc(x_1),x_2,\cc(x_2))}$. The following theorem
815: demonstrates that this is not possible: $O(\log n)$-wise
816: queries are no better than unary queries, at least with respect to
817: weak-learning.
818:
819: We assume in the discussion below that all algorithms also have access
820: to individual \ital{unlabeled} examples from distribution $\DD$, as is
821: usual in the SQ model.
822:
823: \begin{theorem}
824: \label{lognogood}
825: Let $k = O(\log n)$, and assume that there exists a
826: $\word{poly}(n)$-time algorithm using $k$-wise statistical queries
827: which weakly learns a concept class $C$ under distribution $\cal D$.
828: That is, this algorithm learns from approximations of
829: probabilities of form $\pr{x_1,\ldots, x_k}{\qq(x_1,\cc(x_1),
830: \ldots, x_k,\cc(x_k))}$, where $\qq$ is a polynomially evaluable predicate.
831: Then there exists a
832: $\word{poly}(n)$-time algorithm which weakly learns the same class
833: using only unary queries.
834: \end{theorem}
835:
836:
837:
838: \smallskip
839: \noindent
840: {\bf Proof.}
841: The original algorithm has access to approximations, correct to
842: plus-or-minus any desired polynomial fraction, of probabilities of
843: form
844: $$\pr{x_1,\ldots,x_k}{\qq(x_1,\cc(x_1),\ldots,x_k,\cc(x_k))}$$
845: (where all probabilities are over the given distribution $\DD$). We
846: now consider the problem of simulating such a $k$-wise query using
847: only unary queries. What we will show is that either our simulation
848: succeeds, or else in failing it finds a unary query which distinguishes
849: the target function from random; in the latter case, the discovered query
850: can then be used directly for weak-learning.
851:
852: The above probability can be rewritten as:
853: \begin{equation}
854: \sum_{\ell_1,\ldots,\ell_k \in \bool}
855: \Pr_{x_1,\ldots,x_k}\big[\cc(x_1)=\ell_1 \ \andd\ldots\andd\
856: \cc(x_k)=\ell_k
857: \andd \qq(x_1,\ell_1,\ldots,x_k,\ell_k)\big].
858: \end{equation}
859: This is a sum of $2^k = \word{poly}(n)$ probabilities, so we can
860: approximate each constituent probability separately. Hence
861: let us fix $\ell_1,\ldots,\ell_k$ and focus hereafter on
862: approximating a probability of form:
863: \begin{equation}
864: \label{conjunction}
865: \pr{x_1,\ldots,x_k}{\cc(x_1)=\ell_1 \ \andd\ldots\andd\ \cc(x_k)=\ell_k \ \andd\
866: \qq(x_1,\ell_1,\ldots,x_k,\ell_k)}.
867: \end{equation}
868: This probability can in turn be rewritten as the product of $k+1$
869: constituent probabilities, namely:
870: \begin{eqnarray*}
871: \lefteqn{\pr{x_1,\ldots,x_k}{\qq(x_1,\ell_1,\ldots,x_k,\ell_k)}
872: \cdot} \\
873: &&
874: \prod_{i=1}^k
875: \pr{x_1,\ldots,x_k}{\cc(x_i) = \ell_i \,\mid\, \cc(x_1) =
876: \ell_1 \ \andd\cdots\andd\ \cc(x_{i-1}) = \ell_{i-1} \ \andd\
877: \qq(x_1,\ell_1,\ldots,x_k,\ell_k)}.
878: \end{eqnarray*}
879: Once again, we will approximate the constituent probabilities
880: individually. We start by approximating
881: $\pr{x_1,\ldots,x_k}{\qq(x_1,\ell_1,\ldots,x_k,\ell_k)}$ (this is
882: easy, as the probability does not depend on $\cc$), then proceed in
883: order from $i = 1$ up to $k$.
884: If at any point the product of the probabilities calculated so far is
885: very small, we halt, returning zero as our approximation
886: for~(\ref{conjunction}).
887:
888: Hereafter we fix $i$ and focus on approximating
889: a probability of form:
890: \begin{equation}
891: \label{conditional}
892: \pr{x_1,\ldots,x_k}{\cc(x_i) = \ell_i \,\mid\, \cc(x_1) =
893: \ell_1 \ \andd\cdots\andd\ \cc(x_{i-1}) = \ell_{i-1} \ \andd\
894: \qq(x_1,\ell_1,\ldots,x_k,\ell_k)}.
895: \end{equation}
896: To approximate this value, we sample from $\DD$ to generate
897: $\bar{z}^{(1)}, \ldots, \bar{z}^{(t)}$, a list of $(k-1)$-tuples of
898: unlabeled examples, where $t$ is of large polynomial size. Each
899: $\bar{z}^{(j)}$ is of the form $(z^{(j)}_1, \ldots, z^{(j)}_{i-1},
900: z^{(j)}_{i+1}, \linebreak[1] \ldots, z^{(j)}_k)$, where each
901: $z^{(j)}_m$ is a $\cal D$-random unlabeled example. (We think of each
902: $\bar{z}^{(j)}$ as specifying values for all of $x_1,\ldots,x_k$
903: except $x_i$.) Corresponding to each $\bar{z}^{(j)}$ we introduce
904: probabilities $S^{(j)}$ and $T^{(j)}$, defined as follows:
905: \begin{eqnarray*}
906: S^{(j)} &:=& \pr{x_i}{\qq(z^{(j)}_1,\ell_1, \ldots, x_i,\ell_i,
907: \ldots, z^{(j)}_k,\ell_k)},\\
908: T^{(j)} &:=& \pr{x_i}{\cc(x_i)=\ell_i \ \andd\
909: \qq(z^{(j)}_1,\ell_1, \ldots, x_i,\ell_i, \ldots, z^{(j)}_k,\ell_k)}.
910: \end{eqnarray*}
911: Note that we can efficiently approximate each of the probabilities
912: $S^{(j)}$ and $T^{(j)}$: indeed, $S^{(j)}$ does not depend on the
913: target concept, while $T^{(j)}$ requires only a unary query. Now
914: consider the fraction
915: \begin{equation}
916: \label{fraction}
917: \frac{\sum_{j \in {\cal R}} T^{(j)}}{\sum_{j \in {\cal R}} S^{(j)}}\ ,
918: \end{equation}
919: where the summation is over ${\cal R}
920: = \{ j \colon\:
921: \cc(z^{(j)}_m) = \ell_m \word{ for all $m < i$} \}$. Note that our
922: algorithm cannot tell which $j$ belong to ${\cal R}$ and which do not,
923: because we do not have direct access to $c$; nonetheless, we may
924: assume that $|{\cal R}|$ is not too small and indeed that
925: the denominator of this fraction is not close to zero. The reason is
926: that if this denominator were small, that would (with high
927: probability) imply that
928: $\pr{x_1,\ldots,x_k}{\cc(x_1)=\ell_1 \ \andd\ldots\andd\
929: \cc(x_{i-1})=\ell_{i-1} \ \andd\
930: \qq(x_1,\ell_1,\ldots,x_k,\ell_k)}$
931: is small, which would have caused our algorithm to halt before
932: reaching this point.
933: But observe that, if the denominator is not too small and $t$ is
934: of sufficiently large polynomial size,
935: (\ref{fraction}) will with high probability
936: be a good approximation for~(\ref{conditional}).
937: We now distinguish two cases:
938:
939: {\bf Case I:}\ \ we find that, for all $j \in \{1,\ldots,t\}$, $|2T^{(j)} -
940: S^{(j)}|$ is small. Then the value of~(\ref{fraction}) is
941: approximately $1/2$. (We know this to be true even though we
942: do not know which values of $j$ are in $\cal R$.) Hence we may
943: return $1/2$ as our approximation for~(\ref{conditional}).
944:
945: {\bf Case II:}\ \ we find some $j$ such that $|2T^{(j)} - S^{(j)}|$
946: is large. This means that
947: we have discovered a significantly large region of $\cal D$-random
948: instances, namely
949: $\{ x\colon\: \qq(z^{(j)}_1,\ell_1, \ldots, x,\ell_i, \ldots,
950: z^{(j)}_k,\ell_k)\}$, over which the probability that $\cc(x) = 1$ is
951: skewed away from $1/2$. But then we can abandon
952: our effort to simulate the original learning algorithm, and can
953: instead use this new information to directly predict the value of
954: $\cc(x)$
955: with probability significantly greater than $1/2$.
956:
957: \subsection{Discussion}
958:
959: In the above proof, we saw that either the data is statistically
960: ``homogeneous'' in a way which allows us to simulate the original
961: learning algorithm with unary queries, or else we discover a
962: ``heterogeneous'' region which we can exploit with an alternative
963: learning algorithm using only unary queries. Thus any concept class
964: that can be learned from $O(\log n)$-wise
965: queries can also be weakly learned from unary queries. Note that Aslam and
966: Decatur \cite{AslamDe93} have shown that weak-learning statistical
967: query algorithms can be boosted to strong-learning algorithms, if they
968: weak-learn over \ital{every} distribution. Thus, any concept class
969: which can be
970: (weakly or strongly) learned from $O(\log n)$-wise queries over
971: \ital{every} distribution can be strongly learned over every
972: distribution from unary queries.
973:
974: It is worth noting here that $k$-wise queries can be used to
975: solve the length-$k$ parity problem. One simply asks, for each $i \in
976: \{1, \ldots, k\}$, the query:
977: ``what is the probability that $k$ random examples form a basis for
978: $\bool^k$ and,
979: upon performing Gaussian elimination, yield a target concept whose
980: $i$th bit is equal to 1?'' Thus, $k$-wise
981: queries cannot be reduced to unary queries for $k = \omega(\log n)$.
982: On the other hand, it is not at all clear how to simulate such queries
983: in general from noisy examples.
984: }
985:
986: \section{Conclusion}
987:
988: In this paper we have addressed the classic problem of
989: learning parity functions in the presence of random noise. We have
990: shown that parity functions over $\booln$ can be learned in slightly
991: sub-exponential time, but only if many labeled examples are available.
992: It is to be hoped that future research may reduce both the time-bound
993: and the number of examples required.
994:
995: Our result also applies to the study of statistical query learning and
996: PAC-learning. We have given the first known noise-tolerant
997: PAC-learning algorithm which can learn a concept class not learnable
998: by any SQ algorithm. The separation we have established between the two
999: models is rather small: we have shown that a specific parity problem
1000: can be PAC-learned from noisy data in time $\word{poly}(n)$, as
1001: compared to time $n^{O(\log\log n)}$ for the best SQ algorithm. This
1002: separation may well prove capable of improvement and worthy of
1003: further examination. Perhaps more importantly, this suggests the
1004: possibility of interesting new noise-tolerant PAC-learning algorithms
1005: which go beyond the SQ model.
1006:
1007: We have also examined an extension to the SQ model in terms of
1008: allowing queries of arity $k$. We have shown that for $k=O(\log n)$,
1009: any concept class learnable in the SQ model with $k$-wise queries is
1010: also (weakly) learnable with unary queries. On the other hand, the
1011: results of \cite{BFJKMR94} imply this is not the case for $k =
1012: \omega(\log n)$. An interesting open question is whether every concept
1013: class learnable from $O(\log n \log\log n)$-wise queries is also
1014: PAC-learnable in the presence of classification noise. If so, then
1015: this would be a generalization of the first result of this paper.
1016:
1017: \newcommand{\etalchar}[1]{$^{#1}$}
1018: \begin{thebibliography}{BFJ{\etalchar{+}}94}
1019:
1020: \bibitem{AngluinLa88}
1021: D.~Angluin and P.~Laird.
1022: \newblock Learning from noisy examples.
1023: \newblock {\em Machine Learning}, 2(4):343--370, 1988.
1024:
1025: \bibitem{AslamDe93}
1026: J.~A. Aslam and S.~E. Decatur.
1027: \newblock General bounds on statistical query learning and {PAC} learning with
1028: noise via hypothesis boosting.
1029: \newblock In {\em Proceedings of the 34th Annual Symposium on Foundations of
1030: Computer Science}, pages 282--291, Nov. 1993.
1031:
1032: \bibitem{AslamDe98}
1033: J.~A. Aslam and S.~E. Decatur.
1034: \newblock Specification and simulation of statistical query algorithms for
1035: efficiency and noise tolerance.
1036: \newblock {\em J.~Comput. Syst. Sci.}, 56(2):191--208, April 1998.
1037:
1038: \bibitem{BFJKMR94}
1039: A.~Blum, M.~Furst, J.~Jackson, M.~Kearns, Y.~Mansour, and S.~Rudich.
1040: \newblock Weakly learning {DNF} and characterizing statistical query learning
1041: using fourier analysis.
1042: \newblock In {\em Proceedings of the 26th Annual ACM Symposium on Theory of
1043: Computing}, pages 253--262, May 1994.
1044:
1045: \bibitem{Decatur93}
1046: S.~E. Decatur.
1047: \newblock Statistical queries and faulty {PAC} oracles.
1048: \newblock In {\em Proceedings of the 6th Annual {ACM} Workshop on Computational
1049: Learning Theory}. {ACM} Press, 1993.
1050:
1051: \bibitem{Decatur96}
1052: S.~E. Decatur.
1053: \newblock Learning in hybrid noise environments using statistical queries.
1054: \newblock In D.~Fisher and H.-J. Lenz, editors, {\em Learning from Data:
1055: Artificial Intelligence and Statistics {V}.} Springer Verlag, 1996.
1056:
1057: \bibitem{Jackson00}
1058: J. Jackson
1059: \newblock On the Efficiency of Noise-Tolerant PAC Algorithms Derived from
1060: Statistical Queries.
1061: \newblock {\em Proceedings of the 13th Annual Workshop on Computational Learning Theory}, 2000.
1062:
1063: \bibitem{Kearns93}
1064: M.~Kearns.
1065: \newblock Efficient noise-tolerant learning from statistical queries.
1066: \newblock In {\em Proceedings of the 25th Annual {ACM} Symposium on Theory of
1067: Computing}, pages 392--401, 1993.
1068:
1069: \bibitem{KS01}
1070: R. Kumar and D. Sivakumar.
1071: \newblock On polynomial approximations to the shortest lattice vector
1072: length.
1073: \newblock To appear in {\em Proceedings of the 12th Annual Symposium on
1074: Discrete Algorithms}, 2001.
1075:
1076:
1077: \end{thebibliography}
1078:
1079: \end{document}
1080: