0010:cs0010022/cs0010022

1: \documentstyle[fullpage,11pt]{article}

2: \newtheorem{theorem}{Theorem}

3: \newtheorem{lemma}[theorem]{Lemma}

4: \newtheorem{exercise}[theorem]{Exercise}

5: \newtheorem{proposition}[theorem]{Proposition}

6: \newtheorem{claim}[theorem]{Claim}

7: \newtheorem{corollary}[theorem]{Corollary}

8: \newtheorem{observation}[theorem]{Observation}

9: \newtheorem{definition}{Definition}

10: \newcommand{\proofspace}{\vspace{.15in}}

11: \newcommand{\ital}[1]{{\/\em #1\/}}

12: \newcommand{\word}[1]{\mbox{\rm #1}}

13: \newcommand{\angles}[2]{\langle #1,#2 \rangle}

14: \newcommand{\andd}{\wedge}

15: \newcommand{\Ftwo}{\word{\bf F}_2}

16: \newcommand{\inv}[1]{\frac{1}{#1}}

17: \newcommand{\Fpow}[1]{\Ftwo^{#1}}

18: \newcommand{\onehalf}{{\textstyle \frac{1}{2}}}

19: \newcommand{\pr}[2]{\word{Pr}_{#1}\left[#2\right]}

20: \newcommand{\booln}{\{0,1\}^n}

21: \newcommand{\bool}{\{0,1\}}

22: \newcommand{\vv}{\vec{v}}

23: \newcommand{\vx}{\vec{x}}

24: \newcommand{\xx}{x}

25: \newcommand{\comment}[1]{}

26: \newcommand{\cc}{c}

27: \newcommand{\DD}{{\cal D}}

28: \newcommand{\qq}{Q}

29:

30: \begin{document}

31: \bibliographystyle{abbrv}

32: %\title{On Noise-Tolerant learning and the Statistical Query model}

33: \title{Noise-Tolerant Learning, the Parity Problem,\\and the

34: Statistical Query Model}

35: %\title{A new algorithm for noise-tolerant learning, and extensions of

36: %the statistical query model}

37: \author{Avrim Blum \and Adam Kalai \and Hal Wasserman}

38: \date{School of Computer Science\\Carnegie Mellon Univeristy}

39: \date{\today}

40: \maketitle

41:

42: \begin{abstract}

43: We describe a slightly sub-exponential time algorithm for learning

44: parity functions in the presence of random classification noise.

45: This results in a polynomial-time algorithm for the case

46: of parity functions that depend on only the first $O(\log n \log\log

47: n)$ bits of input.  This is the first known instance of an efficient

48: noise-tolerant algorithm for a concept class that is provably

49: not learnable in the Statistical Query model of Kearns

50: \cite{Kearns93}.  Thus, we demonstrate that the set of problems

51: learnable in the statistical query model is a strict subset of those

52: problems learnable in the presence of noise in the PAC model.

53:

54: In coding-theory terms, what we give is a poly$(n)$-time algorithm for

55: decoding linear $k\times n$ codes in the presence of random noise for

56: the case of $k = c\log n

57: \log\log n$ for some $c > 0$.  (The case of $k = O(\log n)$ is trivial

58: since one can just individually check each of the $2^k$ possible

59: messages and choose the one that yields the closest codeword.)

60:

61: A natural extension of the statistical query model is to allow queries

62: about statistical properties that involve $t$-tuples of examples (as

63: opposed to single examples).  The second result of this paper is

64: to show that any class of functions learnable (strongly or weakly)

65: with $t$-wise queries for $t = O(\log n)$ is also weakly learnable

66: with standard unary queries.  Hence this natural

67: extension to the statistical query model does not increase the set of

68: weakly learnable functions.

69: \end{abstract}

70:

71: \section{Introduction}

72: An important question in the study of machine learning is:

73: ``What kinds of functions can be learned efficiently from

74: noisy, imperfect data?''  The statistical query (SQ) framework of

75: Kearns \cite{Kearns93} was designed as a useful, elegant model for

76: addressing this issue.

77: The SQ model provides a restricted interface between a

78: learning algorithm and its data, and has the property that any

79: algorithm for learning in the SQ model can automatically be converted

80: to an algorithm for learning in the presence of \ital{random

81: classification noise} in the standard PAC model.  (This result has

82: been extended to more general forms of noise as well

83: \cite{Decatur93,Decatur96}.)  The importance of the Statistical Query model is attested to by the fact

84: that before its introduction, there were only a few provably

85: noise-tolerant learning algorithms, whereas now it is recognized that

86: a large number of

87: learning algorithms can be formulated as SQ algorithms, and

88: hence can be made noise-tolerant.

89:

90: The importance of the SQ model has led to the open question of whether

91: examples exist of problems learnable with random classification noise

92: in the PAC model but not learnable by statistical queries.  This is

93: especially interesting because one can characterize

94: information-theoretically (i.e., without complexity assumptions) what

95: kinds of problems can be learned in the SQ model

96: \cite{BFJKMR94}.  For example, the class of parity functions, which

97: {\em can} be learned efficiently from {\em non}-noisy data in the PAC

98: model, provably cannot be learned efficiently in the SQ model under

99: the uniform distribution.  Unfortunately, there is also no known efficient

100: non-SQ algorithm for learning them in the presence of noise

101: (this is closely related to the classic coding-theory problem of

102: decoding random linear codes).

103:

104: In this paper, we describe a polynomial-time algorithm for learning

105: the class of parity functions that depend on only the first $O(\log n

106: \log\log n)$ bits of input, in the presence of random

107: classification noise (of a constant noise rate).  This class

108: provably cannot be learned in the SQ model, and thus is the first

109: known example of a concept class learnable with noise but not via

110: statistical queries.  Our algorithm has recently been shown to have

111: applications to the problem of determining the shortest lattice vector

112: length \cite{KS01} and to various other analyses of statistical queries

113: \cite{Jackson00}.

114:

115: An equivalent way of stating this result is that we are given a random

116: $k \times n$ boolean matrix $A$, as well as an $n$-bit vector $\tilde{y}$

117: produced by multiplying $A$ by an (unknown) $k$-bit

118: message $x$, and then corrupting each bit of the resulting

119: codeword $y = xA$ with probability $\eta < 1/2$.  Our goal is to

120: recover $y$ in time poly$(n)$.  For this problem, the case of $k =

121: O(\log n)$ is trivial because one could simply try each of the

122: $2^k$ possible messages and output the nearest

123: codeword found.  Our algorithm works for  $k = c\log n\log\log n$ for some

124: $c > 0$.  The algorithm does not actually need $A$ to be random, so

125: long as the noise is random and there is no other codeword within

126: distance $o(n)$ from the true codeword $y$.

127:

128: Our algorithm can also be viewed as a slightly sub-exponential time

129: algorithm for learning arbitrary parity functions in the presence of

130: noise.  For this problem, the brute-force algorithm

131: would draw $O(n)$ labeled examples, and then search through all $2^n$

132: parity functions to find the one of least empirical error.  (A

133: standard argument can be used to say that with high probability, the

134: correct function will have the lowest empirical error.)  In contrast,

135: our algorithm runs in time $2^{O(n/\log n)}$, though it also requires

136: $2^{O(n/\log n)}$ labeled examples.  This improvement is small but

137: nonetheless sufficient to achieve the desired separation result.

138:

139: The second result of this paper concerns a $k$-wise version of the

140: Statistical Query model. In the standard version, algorithms may only

141: ask about statistical properties of single examples. (E.g., what is

142: the probability that a random example is labeled positive and has its

143: first bit equal to 1?)  In the $k$-wise version, algorithms may ask

144: about properties of $k$-tuples of examples.  (E.g., what is the

145: probability that two random examples have an even dot-product and have

146: the same label?)  Given the first result of this paper, it is natural

147: to ask whether allowing $k$-wise queries, for some small value of $k$,

148: might increase the set of SQ-learnable functions.  What we show is

149: that for $k=O(\log n)$, any concept class learnable

150: from $k$-wise queries is also (weakly) learnable from unary queries.

151: Thus the seeming generalization of the SQ model to allow for $O(\log n)$-wise

152: queries does not close the gap we have demonstrated between what is

153: efficiently learnable in the SQ and noisy-PAC models.  Note that this

154: result is the best possible with respect to $k$ because the

155: results of

156: \cite{BFJKMR94} imply that for $k = \omega(\log n)$, there are concept

157: classes learnable from $k$-wise queries but not unary queries.  On the other

158: hand, $\omega(\log n)$-wise queries are in a sense less interesting because it

159: is not clear whether they can in general be simulated in the presence of noise.

160: %%(Though perhaps it might be possible to generalize the first result in

161: %%this paper to do so!)

162:

163: \subsection{Main ideas}

164: The standard way to learn parity functions without noise is based on

165: the fact that if an example can be written as a sum (mod 2) of

166: previously-seen examples, then its label must be the sum (mod 2) of

167: those examples' labels.  So, once one has found a basis, one can

168: use that to deduce the label of {\em any} new example (or,

169: equivalently, use Gaussian elimination to produce the target function

170: itself).

171:

172: In the presence of noise, this method breaks down.  If the original

173: data had noise rate $1/4$, say, then the sum of $s$ labels

174: has noise rate $1/2 - (1/2)^{s+1}$.  This means we can add

175: together only $O(\log n)$ examples if we want the resulting sum to be

176: correct with probability $1/2 + 1/poly(n)$.  Thus, if we want to use

177: this kind of approach, we need some way to write

178: a new test example as a sum of only a {\em small

179: number} of training examples.

180:

181: Let us now consider the case of parity functions that depend on only

182: the first $k = \log n \log\log n$ bits of input.  Equivalently, we can

183: think of all examples as having the remaining $n-k$ bits equal to 0.

184: Gaussian elimination will in this case allow us to write our test

185: example as a sum of $k$ training examples, which is too many.  Our

186: algorithm will instead write it as a sum of $k/\log k = O(\log n)$

187: examples, which gives us the desired noticeable bias (that can then be

188: amplified).

189:

190: Notice that if we have seen $poly(n)$ training examples (and, say,

191: each one was chosen uniformly at random), we can argue existentially

192: that for $k = \log n \log\log n$, one should be able to write any new

193: example as a sum of just

194: $O(\log\log n)$ training examples, since there are $n^{O(\log \log n)}

195: \gg 2^k$ subsets of this size (and the subsets are pairwise

196: independent).  So, while our algorithm is finding a smaller subset

197: than Gaussian elimination, it is not doing best possible.

198: If one {\em could} achieve, say, a constant-factor

199: approximation to the problem ``given a set of vectors, find the

200: smallest subset that sums to a given target vector'' then this would

201: yield an algorithm to efficiently learn the class of parity functions

202: that depend on the first $k = O(\log^2 n)$ bits of input.

203: Equivalently, this would allow one to learn parity functions over $n$ bits

204: in time $2^{O(\sqrt{n})}$, compared to the $2^{O(n/\log n)}$ time of

205: our algorithm.

206:

207: \comment{

208: Similarly, if $k =$ So,

209: if we could algorithmically {\em find} the

210: smallest subset of training examples that sums to our test

211: example, we would have a noticeable bias and (essentially) be done.

212:

213: Unfortunately, it seems difficult to efficiently find the smallest

214: subset so we cannot do quite this well.  Instead, we give a weak

215: approximation.  Specifically, for $k = \log n

216: \log\log n$, the existential argument tells us there should exist

217: a subset of size $O(\log \log n)$ that sums to our test example; our

218: algorithm will, in this case, find a subset of size $O(\log n)$.  This

219: is a lot larger than optimal, but still better than Gaussian

220: elimination (which finds $O(\log n \log \log n)$ examples) and is

221: sufficient for our result.

222: }

223:

224: \section{Definitions and Preliminaries}

225:

226: A \ital{concept} is a boolean function on an \ital{input space}, which

227: in this paper will generally be $\booln$.   A \ital{concept class} is a

228: set of concepts.  We will be considering the problem of learning a

229: target concept in the presence of \ital{random classification noise}

230: \cite{AngluinLa88}.  In this model, there is some fixed (known or

231: unknown) noise rate $\eta < 1/2$, a fixed (known or unknown)

232: probability distribution $\DD$ over $\booln$, and an unknown target

233: concept $c$.  The learning algorithm may repeatedly ``press a button''

234: to request a labeled example.  When it does so, it receives a pair

235: $(\xx, \ell)$, where $\xx$ is chosen from $\booln$ according to $\DD$

236: and $\ell$ is the value $c(\xx)$, but ``flipped'' with probability

237: $\eta$.  (I.e., $\ell = c(\xx)$ with probability $1-\eta$, and

238: $\ell = 1-c(\xx)$ with probability $\eta$.)  The goal of the learning

239: algorithm is to find an

240: \ital{$\epsilon$-approximation} of $c$: that is, a hypothesis

241: function $h$ such that $\Pr_{\xx \leftarrow \DD}[h(\xx) = c(\xx)] \geq

242: 1-\epsilon$.

243:

244: We say that a concept class $C$ is \ital{efficiently learnable in the

245: presence of random classification noise} under distribution $\DD$ if

246: there exists an algorithm ${\cal A}$ such that for any $\epsilon>0,

247: \delta>0, \eta < 1/2$, and any target concept $c \in C$, the algorithm

248: ${\cal A}$ with probability at least $1-\delta$ produces an

249: $\epsilon$-approximation of $c$ when given access to $\DD$-random examples

250: which have been labeled by $c$ and corrupted by noise of rate

251: $\eta$.  Furthermore, ${\cal A}$ must run in time polynomial in $n$,

252: $1/\epsilon$, and $1/\delta$.\footnote{Normally, one would also

253: require polynomial dependence on $1/(1/2 -\eta)$ --- in part because

254: normally this is easy to achieve (e.g., it is achieved by any

255: statistical query algorithm).  Our algorithms run in polynomial

256: time for any \ital{fixed} $\eta < 1/2$, but have a

257: super-polynomial dependence on $1/(1/2 - \eta)$.}

258:

259:

260:

261: A \ital{parity function} $c$ is defined by a corresponding vector $c

262: \in \booln$; the parity function is then given by the rule $c(x) = x

263: \cdot c \!\!\! \pmod{2}$.  We say that $c$ \ital{depends on only the first

264: $k$ bits of input} if all nonzero components of $c$ lie in its

265: first $k$ bits.  So, in particular, there are $2^k$ distinct parity

266: functions that depend on only the first $k$ bits of input.  Parity

267: functions are especially interesting to consider under the uniform

268: distribution $\DD$, because under that distribution parity functions

269: are pairwise uncorrelated.

270:

271: \subsection{The Statistical Query model}

272: The Statistical Query (SQ) model can be viewed as providing a

273: restricted interface between the learning algorithm and the source of

274: labeled examples.  In this model, the learning algorithm may only

275: receive information about the target concept through \ital{statistical

276: queries}.  A statistical query is a query about some property $\qq$

277: of labeled examples (e.g., that the first two bits are equal and the label is

278: positive), along with a tolerance parameter $\tau \in

279: [0,1]$.  When the algorithm asks a statistical query $(\qq,\tau)$, it

280: is asking for the probability that predicate $\qq$ holds true for a

281: random correctly-labeled example, and it receives an approximation of this

282: probability up to $\pm \tau$. In other words, the algorithm receives a

283: response $\hat{P}_{\qq} \in [P_{\qq}-\tau, P_{\qq}+\tau]$, where

284: %

285: $P_{\qq} = \Pr_{x \leftarrow \DD}[\qq(x,c(x))]$.

286: %

287: We also require each

288: query $\qq$ to be polynomially evaluable (that is, given $(x,\ell)$, we

289: can compute $\qq(x,\ell)$ in polynomial time).

290:

291: Notice that a statistical query can be simulated by drawing a large

292: sample of data and computing an empirical average, where the size of

293: the sample would be roughly $O(1/\tau^2)$ if we wanted to assure an

294: accuracy of $\tau$ with high probability.

295:

296: A concept class $C$ is \ital{learnable from statistical queries} with

297: respect to distribution $\DD$ if there is a learning algorithm ${\cal

298: A}$ such that for any $c \in C$ and any $\epsilon>0$, ${\cal A}$ produces an

299: $\epsilon$-approximation of $c$ from statistical queries; furthermore,

300: the running time, the number of queries asked, and the inverse of the

301: smallest tolerance used must be polynomial in $n$ and $1/\epsilon$.

302:

303:

304: We will also want to talk about \ital{weak learning.}  An algorithm

305: ${\cal A}$ weakly learns a concept class $C$ if for any $c \in C$ and

306: for \ital{some} $\epsilon < 1/2 - 1/\word{poly}(n)$,  ${\cal A}$ produces an

307: $\epsilon$-approximation of $c$.  That is, an algorithm weakly learns if

308: it can do noticeably better than guessing.

309:

310: The statistical query model is defined with respect to non-noisy data.

311: However, statistical queries can be simulated from data corrupted by

312: random classification noise \cite{Kearns93}.  Thus, any concept class

313: learnable from statistical queries is also PAC-learnable in the

314: presence of random classification noise.

315: There are several variants to the formulation given above that

316: improve the efficiency of the simulation \cite{AslamDe93,AslamDe98},

317: but they are all polynomially related.

318:

319: One technical point: we have

320: defined statistical query learnability in the ``known distribution''

321: setting (algorithm ${\cal A}$ knows distribution $\DD$); in the

322: ``unknown distribution'' setting, ${\cal A}$ is allowed

323: to ask for random unlabeled examples from the distribution $\DD$\@.

324: This prevents certain trivial exclusions from what is learnable from

325: statistical queries.

326:

327:

328: \subsection{An information-theoretic characterization}\label{sec:info}

329:

330: BFJKMR \cite{BFJKMR94} prove that any concept class containing more than

331: polynomially many pairwise uncorrelated functions

332: cannot be learned even weakly in the statistical query model.

333: Specifically, they show the following.

334:

335: \begin{definition} (Def.~2 of \cite{BFJKMR94})

336: For concept class $C$ and distribution $\DD$, the

337: \ital{statistical query dimension} SQ-DIM$(C,\DD)$ is the largest

338: number $d$ such that $C$ contains $d$ concepts $c_1, \ldots, c_d$ that

339: are nearly pairwise uncorrelated: specifically, for all $i\neq j$,

340: $$\left|\Pr_{x \leftarrow D}[c_i(x) = c_j(x)] - \Pr_{x \leftarrow

341: D}[c_i(x) \neq c_j(x)]\right| \leq 1/d^3.$$

342: \end{definition}

343:

344: \begin{theorem} (Thm.~12 of \cite{BFJKMR94}) In order to learn $C$ to error

345: less than $1/2 - 1/d^3$ in the SQ model, where $d = $ SQ-DIM$(C,\DD)$,

346: either the number of queries or $1/\tau$ must be at least $\frac{1}{2}d^{1/3}$

347: \end{theorem}

348:

349: Note that the class of parity functions over $\booln$ that depend on

350: only the first $O(\log n\log\log n)$ bits of input contains

351: $n^{O(\log \log n)}$ functions, all pairs of which are uncorrelated

352: with respect to the uniform distribution.  Thus, this class cannot be

353: learned (even weakly) in the SQ model with polynomially many queries

354: of $1/\word{poly}(n)$ tolerance.  But we will now show that there

355: nevertheless exists a polynomial-time PAC-algorithm for learning this

356: class in the presence of random classification noise.

357:

358: \section{Learning Parity with Noise}

359: \subsection{Learning over the uniform distribution}

360:

361: For ease of notation, we use the ``length-$k$ parity problem'' to

362: denote the problem of learning a parity function over $\bool^k$, under

363: the uniform distribution, in the presence of random classification

364: noise of rate $\eta$.

365:

366:

367: \begin{theorem}

368: \label{maintheorem}

369: The length-$k$ parity problem, for

370: noise rate $\eta$ equal to any constant less than $1/2$, can be solved

371: with number of samples and total computation-time $2^{O(k/\log k)}$.

372: \end{theorem}

373:

374: Thus, in the presence of noise we can learn parity functions over

375: $\{0,1\}^n$ with in time and sample size $2^{O(n/\log n)}$, and we can

376: learn parity functions over $\{0,1\}^n$ that only depend on the first

377: $k = O(\log n\log\log n)$ bits of the input in time and sample size

378: $poly(n)$.

379:

380: We begin our proof of Theorem~\ref{maintheorem} with a simple lemma about

381: how noise becomes amplified when examples are added together.  For

382: convenience, if $x_1$ and $x_2$ are examples, we let $x_1 + x_2$

383: denote the vector sum mod 2; similarly, if $\ell_1$ and $\ell_2$ are

384: labels, we let $\ell_1+\ell_2$ denote their sum mod 2.

385:

386: \begin{lemma}

387: \label{sumOK}

388: Let $(x_1, \ell_1), \ldots, (x_s, \ell_s)$ be examples labeled

389: by $c$ and corrupted by random noise of rate $\eta$.  Then

390: $\ell_1 + \cdots + \ell_s$ is the correct value of $(x_1 +

391: \cdots + x_s) \cdot c$ with probability $\onehalf + \onehalf(1-2\eta)^s$.

392: \end{lemma}

393:

394: \smallskip

395: \noindent

396: {\bf Proof.}

397: Clearly true when $s=1$.  Now assume that the lemma is true for

398: $s-1$.  Then the probability that $\ell_1 + \cdots + \ell_s =

399: (x_1 + \cdots + x_s) \cdot c$ is

400: $$(1-\eta)(\onehalf + \onehalf(1-2\eta)^{s-1}) + \eta(\onehalf -

401: \onehalf(1-2\eta)^{s-1}) = \onehalf + \onehalf(1-2\eta)^s.$$

402: The lemma then follows by induction.

403:

404:

405: The idea for the algorithm is that by drawing many more examples than

406: the minimum needed to learn information-theoretically, we will be able

407: to write basis vectors such as $(1,0,\ldots,0)$ as the sum of a

408: relatively small number of training examples --- substantially smaller

409: than the number that would result from straightforward Gaussian

410: elimination.  In particular, for the length $O(\log n \log\log n)$

411: parity problem, we will be able to write $(1,0,\ldots,0)$ as the sum

412: of only $O(\log n)$ examples.  By Lemma \ref{sumOK}, this means that,

413: for any constant noise rate $\eta < 1/2$, the corresponding sum of

414: labels will be polynomially distinguishable from random.  Hence, by

415: repeating this process as needed to boost reliability, we may

416: determine the correct label for $(1,0,\ldots,0)$, which is

417: equivalently the first bit of the target vector $c$.  This process can

418: be further repeated to determine the remaining bits of $c$, allowing

419: us to recover the entire target concept with high probability.

420:

421: To describe the algorithm for the length-$k$ parity problem, it will

422: be convenient to view each example as consisting of $a$

423: blocks, each $b$ bits long (so, $k = ab$) where $a$ and $b$ will be

424: chosen later.  We then introduce the following notation.

425:

426: \begin{definition}

427: Let $V_i$ be the subspace of $\bool^{ab}$ consisting of those

428: vectors whose last $i$ blocks have all bits equal to zero.  An

429: \ital{$i$-sample} of size $s$ is a set of $s$ vectors

430: independently and uniformly distributed over $V_i$.

431: \end{definition}

432: The goal of our algorithm will be to use labeled examples from

433: $\bool^{ab}$ (these form a $0$-sample) to create an $i$-sample such

434: that each vector in the $i$-sample can be written as a sum of at most

435: $2^i$ of the original examples, for all $i=1,2,\ldots, a-1$.  We

436: attain this goal via the following lemma.

437:

438: \begin{lemma}

439: \label{sampling}

440: Assume we are given an $i$-sample of size $s$.  We can in time

441: $O(s)$ construct an $(i+1)$-sample of size at least $s -

442: 2^b$ such that each vector in the $(i+1)$-sample is written as the sum

443: of two vectors in the given $i$-sample.

444: \end{lemma}

445:

446: \smallskip

447: \noindent

448: {\bf Proof.}

449: Let the $i$-sample be $x_1, \ldots, x_s$.  In these vectors, blocks

450: $a-i+1, \ldots, a$ are all zero.  Partition $x_1, \ldots, x_s$ based

451: on their values in block $a-i$.  This results in a partition having at

452: most $2^b$ classes.  From each nonempty class $p$, pick one vector

453: $x_{j_p}$ at random and add it to each of the other vectors in its

454: class; then discard $x_{j_p}$.  The result is a collection of vectors

455: $u_1, \ldots, u_{s'}$, where $s' \geq s - 2^b$ (since we discard at most

456: one vector per class).

457:

458: What can we say about ${u}_1, \ldots, {u}_{s'}$?  First of all, each ${u}_j$ is

459: formed by summing two vectors in $V_i$ which have

460: identical components throughout block $a-i$, ``zeroing out'' that

461: block.  Therefore, ${u}_j$ is in $V_{i+1}$.  Secondly, each

462: $u_j$ is formed by taking some $x_{j_p}$ and adding to it

463: a random vector in $V_i$, subject only to the condition that the random

464: vector agrees with $x_{j_p}$ on block $a-i$.  Therefore, each $u_j$ is

465: an independent, uniform-random member of $V_{i+1}$.  The vectors $u_1,

466: \ldots, u_{s'}$ thus form the desired $(i+1)$-sample.

467:

468: Using this lemma, we can now prove our main theorem.

469:

470: \smallskip

471: \noindent

472: {\bf Proof of Theorem \ref{maintheorem}.}

473: Draw $a2^b$ labeled examples.  Observe that these qualify

474: as a $0$-sample.  Now apply Lemma~\ref{sampling},  $a-1$ times, to

475: construct an $(a-1)$-sample.  This $(a-1)$-sample will have size at

476: least $2^b$.  Recall that the vectors in an $(a-1)$-sample are

477: distributed independently and uniformly at random over $V_{a-1}$, and

478: notice that $V_{a-1}$ contains only $2^b$ distinct vectors, one of

479: which is $(1,0,\ldots,0)$.  Hence there is an approximately $1-1/e$

480: chance that $(1,0,\ldots,0)$ appears in our $(a-1)$-sample.  If this

481: does not occur, we repeat the above process with new labeled examples.

482: Note that the expected number of repetitions is only constant.

483:

484: Now, unrolling our applications of Lemma \ref{sampling}, observe that we

485: have written the vector $(1,0,\ldots,0)$ as the sum of $2^{a-1}$

486: of our labeled examples --- and we have done so without examining

487: their labels.  Thus the label noise is still random, and we can

488: apply Lemma~\ref{sumOK}.  Hence the sum of the labels gives us the

489: correct value of $(1,0,\ldots,0) \cdot c$ with probability

490: $\onehalf + \onehalf(1-2\eta)^{2^{a-1}}$.

491:

492: This means that if we repeat the above process using new labeled

493: examples each time for poly$((\inv{1-2\eta})^{2^a}, b)$ times, we can

494: determine $(1,0,\ldots,0) \cdot c$ with probability of error

495: exponentially small in $ab$.  In other words, we can determine the

496: first bit of $c$ with very high probability.  And of course, by

497: cyclically shifting all examples, the same algorithm may be employed

498: to find each bit of $c$.  Thus, with high probability we can determine

499: $c$ using a number of examples and total computation-time $

500: \word{poly}((\inv{1-2\eta})^{2^a}, 2^b)$.

501:

502: Plugging in $a = \frac{1}{2}\lg k$ and $b = 2k/\lg k$ yields the

503: desired $2^{O(k/\log k)}$ bound for constant noise rate $\eta$.

504:

505: \subsection{Extension to other distributions}

506: While the uniform distribution is in this case the most interesting,

507: we can extend our algorithm to work over any distribution.  In fact,

508: it is perhaps easiest to think of this extension as an online learning

509: algorithm that is presented with an arbitrary sequence

510: of examples, one at a time.  Given a new test example, the algorithm

511: will output either ``I don't know'', or else will give a prediction of

512: the label.  In the former case, the algorithm is told the correct

513: label, flipped with probability $\eta$.  The claim is that the

514: algorithm will, with high probability, be correct in all its

515: predictions, and furthermore will output ``I don't know'' only a

516: limited number of times.  In the coding-theoretic view,

517: this corresponds to producing a $1 - o(1)$ fraction of the

518: desired codeword, where the remaining entries are left blank.  This

519: allows us to recover the full codeword so long as no other codeword is

520: within relative distance $o(1)$.

521:

522: The algorithm is essentially a form of Gaussian elimination, but where

523: each entry in the matrix is an element of the vector space $\Fpow{b}$

524: rather than an element of the field $\Ftwo$.  In particular, instead

525: of choosing a row that begins with a 1 and subtracting it from all

526: other such rows, what we do is choose one row for each initial $b$-bit

527: block observed: we then use these (at most $2^b-1$) rows to zero out

528: all the others.  We then move on to the next $b$-bit block.  If we

529: think of this as an online algorithm, then each new example seen

530: either gets captured as a new row in the matrix (and there are at most

531: $a(2^b-1)$ of them) or else it passes all the way through the matrix

532: and is given a prediction.  We then do this with multiple matrices and

533: take a majority vote to drive down the probability of error.

534:

535: For concreteness, let us take the case of $n$ examples, each $k$ bits

536: long for $k = \frac{1}{4}\lg n (\lg\lg n - 2)$, and $\eta = 1/4$. We view each

537: example as consisting of $(\lg\lg n - 2)$ blocks, where each block has width

538: $\frac{1}{4}\lg n$.  We now create a series of matrices $M_1,

539: M_2, \ldots$ as follows.

540: Initially, the matrices are all empty.

541: Given a new example, if its first block does not match the first block

542: of any row in $M_1$, we include it as a new row of $M_1$ (and output

543: ``I don't know'').  If the

544: first block {\em does} match, then we subtract that row from it

545: (zeroing out the first block of our example) and consider the second

546: block.  Again, if the second block does not match any row in $M_1$ we

547: include it as a new row (and output ``I don't know''); otherwise, we subtract that row and consider

548: the third block and so on.  Notice that each example will either be

549: ``captured'' into the matrix $M_1$ or else gets completely zeroed out

550: (i.e., written as a sum of rows of $M_1$).  In the latter case, we

551: have written the example as a sum of at most $2^{\lg\lg n - 2}

552: = \frac{1}{4}\lg n$ previously-seen examples, and therefore the sum

553: of their labels is correct with probability at least $\frac{1}{2}(1 +

554: 1/n^{1/4})$.  To amplify this probability, instead of making a

555: prediction we put the example into a new matrix $M_2$, and so on up to

556: matrix $M_{n^{2/3}}$.   If an example passes through {\em all}

557: matrices, we can then state that the majority vote is correct with

558: high probability.  Since each matrix has at most $2^{\frac{1}{4}\lg

559: n}(\lg\lg n - 2)$ rows, the total number of examples on which we fail

560: to make a prediction is at most $n^{11/12}\lg\lg n = o(n)$.

561:

562: \comment{

563: \begin{theorem}

564: Over an arbitrary distribution, the length-$ab$ parity problem can also be solved by

565: an algorithm whose number of samples and total computation-time are

566: $\,\word{poly}\!\left(\left(\inv{1-2\eta}\right)^{2^a}, 2^b\right)$.

567: \end{theorem}

568:

569: \smallskip

570: \noindent

571: {\bf Proof.}

572: Over an

573: arbitrary distribution, there may be several parity functions with low error,

574: and our goal is to pick one of these.

575: Previously, we tried

576: to write $(1,0,\ldots,0)$ as the sum of $2^{a-1}$ examples.  Over an arbitrary

577: distribution, this may not be possible.  Instead,

578: we pick a random example and write it as the sum of $2^a-1$ examples.  As

579: before, we repeat this process

580: so that we again have probability of error exponentially small in $ab$.

581: This enables us to correctly label test examples with high

582: probability.  With this ability, we can correctly label a

583: $ab/(\epsilon\delta)$ examples and then apply standard noiseless

584: learning techniques to find a low-error parity hypothesis.

585:

586: We use a similar technique to the uniform distribution case to write an arbitrary

587: example $x$ as the sum of

588: $2^a-1$ examples.  We take $s$ random examples, and we add $x$ to this set.

589: We now use the procedure of Lemma~\ref{sampling}, $a$ times (rather than $a-1$ times), to write $(0,0,\ldots,0)$ as

590: the sum of $2^a$ examples.  This is slightly easier than before, because all

591: the elements of our $a$-sample are $(0,0,\ldots,0)$, rather than

592: having to wait for an element of an $(a-1)$-sample which is

593: $(1,0,\ldots,0)$. Of course, we technically no longer have $i$-samples

594: in the sense that they are not uniformly distributed.

595:

596: Regardless, consider the $a$-sample generated at the end.

597: Each element of this sample is the sum of $2^a$ examples, and we can

598: uniquely identify an element of the sample by the first example used in this

599: sum, because our initial $0$-sample had one element for each example.

600: Furthermore, since we have at least $s+1-a2^b$ elements of this $a$-sample,

601: and $x$ is a random example treated as any other, with probability less than

602: $a2^b/s$, we still have the element of the $s$-sample corresponding to $x$.

603: In this case, we can write $x$ as the sum of the remaining $2^a-1$ examples in

604: its sample.  As before, we will repeat this process

605: $r=\word{poly}((\inv{1-2\eta})^{2^a}, b)$ times, using new labeled examples

606: each time, to determine $x \cdot c$ with probability

607: of error exponentially small in $ab$.  The probability that we could not write

608: $x$ as the sum of $2^a-1$ examples in any of these $r$ repetitions is less

609: than $ra2^b/s$, which we can make sufficiently small by also choosing

610: $s=\word{poly}((\inv{1-2\eta})^{2^a}, b)$.

611: }

612:

613: \subsection{Discussion}

614:

615: Theorem~\ref{maintheorem} demonstrates that we can

616: solve the length-$n$ parity learning problem

617: in time $2^{o(n)}$.  However, it must be emphasized that we accomplish

618: this by using $2^{O(n/\log n)}$ labeled examples.  For the point of

619: view of coding theory, it would be useful to have an algorithm which takes time

620: $2^{o(n)}$ and number of examples $\word{poly}(n)$ or even $O(n)$.  We

621: do not know if this can be done.  Also of interest is the question of

622: whether our time-bound can be improved from $2^{O(n/\log n)}$ to, for

623: example, $2^{O(\sqrt{n}\,)}$.

624:

625: It would also be desirable to reduce our algorithm's dependence on

626: $\eta$.  This dependence comes from Lemma \ref{sumOK}, with $s = 2^{a-1}$.

627: For instance,  consider the problem of learning parity functions

628: that depend on the first $k$ bits of input for $k = O(\log n\log \log

629: n)$. In this case, if we set $a=\lceil \frac{1}{2}\lg\lg n \rceil$ and

630: $b = O(\log n)$, the running time is polynomial in $n$, with

631: dependence on $\eta$ of $(\inv{1-2\eta})^{\sqrt{\log n}}$.  This

632: allows us to handle $\eta$ as large

633: as $1/2 - 2^{-\sqrt{\log n}}$ and still have polynomial running time.

634: While this can be improved slightly,

635: we do not know how to solve

636: the length-$O(\log n \log \log n)$ parity problem in polynomial time

637: for $\eta$ as large as $1/2 - 1/n$ or even $1/2 - 1/n^\varepsilon$.

638: What makes this interesting is that it is an open question (Kearns,

639: personal communication) whether noise tolerance can in general be

640: boosted; this example suggests why such a result may be

641: nontrivial.

642:

643: \comment{

644: \begin{corollary}

645: \label{highnoise}

646: Let $\varepsilon$ be any positive constant.  Using $a =

647: \lceil\varepsilon\lg\lg n\rceil$, $b = O(\log n)$ in

648: Theorem~\ref{mainab}, we find that the parity problem of length

649: $O(\log n \log\log n)$, for noise rate $\eta \leq 1/2 - 2^{-(\lg

650: n)^{1-\varepsilon}}$, can be solved with number of samples and total

651: computation-time $\word{poly}(n)$.

652: \end{corollary}

653: }

654:

655: \section{Limits of O(log n)-wise Queries}

656:

657: We return to the general problem of learning a target concept $\cc$

658: over a space of examples with a fixed distribution $\cal D$.  A

659: limitation of the statistical query model is that it permits only what

660: may be called \ital{unary} queries.  That is, an SQ algorithm can

661: access $\cc$ only by requesting approximations of probabilities of

662: form $\pr{x}{\qq(x,\cc(x))}$, where $x$ is $\cal D$-random and $\qq$

663: is a polynomially evaluable predicate.  A natural question is whether

664: problems not learnable from such queries can be learned, for example,

665: from binary queries: i.e., from probabilities of form

666: $\pr{x_1,x_2}{\qq(x_1,x_2,\cc(x_1),\cc(x_2))}$.  The following theorem

667: demonstrates that this is not possible, proving that $O(\log n)$-wise

668: queries are no better than unary queries, at least with respect to

669: weak-learning.

670:

671: We assume in the discussion below that all algorithms also have access

672: to individual \ital{unlabeled} examples from distribution $\DD$, as is

673: usual in the SQ model.

674:

675: \begin{theorem}

676: \label{lognogood}

677: Let $k = O(\log n)$, and assume that there exists a $\word{poly}(n)$-time

678: algorithm using $k$-wise statistical queries which weakly learns a concept

679: class $C$ under distribution $\cal D$.  That is, this algorithm learns from

680: approximations of $\pr{\vec{x}}{\qq(\vec{x},\cc(\vec{x}))}$, where $\qq$ is a

681: polynomially evaluable predicate, and $\vec{x}$ is a k-tuple of examples.

682: Then there exists a $\word{poly}(n)$-time algorithm which weakly learns the

683: same class using only unary queries, under $\cal D$.

684: \end{theorem}

685:

686: \smallskip

687: \noindent

688: {\bf Proof.}

689: We are given a $k$-wise query

690: $\pr{\vec{x}}{\qq(\vec{x},\cc(\vec{x}))}$.  The first thing our

691: algorithm will do is use $Q$ to construct several candidate weak

692: hypotheses.  It then tests whether each of these hypotheses is in fact

693: noticeably correlated with the target

694: using unary statistical queries.  If none of them appear to be good,

695: it uses this fact to

696: estimate the value of the $k$-wise query.  We prove that for any

697: $k$-wise query, with high probability we either succeed in finding a

698: weak hypothesis or we output a good estimate of the $k$-wise query.

699:

700: For simplicity, let us assume that $\pr{x}{c(x) = 1} = 1/2$; i.e., a

701: random example is equally likely to be positive or negative.  (If

702: $\pr{x}{c(x) = 1}$ is far from $1/2$ then weak-learning is easy by

703: just predicting all examples are positive or all examples are

704: negative.)  This assumption implies that if a hypothesis $h$ satisfies

705: $|\pr{x}{h(x) = 1 \wedge c(x) = 1} - \frac{1}{2}\pr{x}{h(x) = 1}| \geq

706: \epsilon$, then either $h(x)$ or $1 - h(x)$ is a weak hypothesis.

707:

708: We now generate a set of candidate hypotheses by choosing one random

709: $k$-tuple of

710: unlabeled examples $\vec{z}$.  For each $1 \leq i \leq k$ and $\vec{\ell} \in

711: \{0,1\}^k$, we hypothesize

712: $$h_{\vec{z},i,\vec{\ell}}(x) =

713: Q(z_i,\ldots,z_{i-1},x,z_i,\ldots,z_k,\vec{\ell}),$$

714: and then use a unary statistical query to

715: tell if $h_{\vec{z},i,\vec{\ell}}(x)$ or

716: $1-h_{\vec{z},i,\vec{\ell}}(x)$ is a weak hypothesis.  As noted above,

717: we will have found a weak hypothesis if

718: %\setlength{\multlinegap}{0 in}

719: $$

720: \left|\pr{x}{\qq(z_1,\ldots,z_{i-1},x,z_{i+1},\ldots,z_k,\vec{\ell})

721: \wedge \cc(x)=1} -

722: \frac{1}{2}\pr{x}{\qq(z_1,\ldots,z_{i-1},x,z_{i+1},\ldots,z_k,\vec{\ell})}\right| \geq

723: \epsilon.

724: $$

725: We repeat this process for $O(1/\epsilon)$

726: randomly chosen $k$-tuples $\vec{z}$.  We now consider two cases.

727:

728: {\bf Case I:} Suppose that the $i$th label matters to the $k$-wise

729: query $Q$ for some $i$ and

730: $\vec{\ell}$.  By this we mean there is at least an $\epsilon$ chance of the

731: above inequality holding for random $\vec{z}$.  Then with high probability we

732: will discover such a $\vec{z}$ and thus weak learn.

733:

734: {\bf Case II:} Suppose, on the contrary, that for no $i$ or $\vec{\ell}$ does

735: the $i$th label matter, i.e.\ the probability of a random $z$

736: satisfying the above inequality is less than $\epsilon$.   This means

737: that

738: \begin{eqnarray*}

739: {\bf E}_{\vec{z}}\left[\left|\pr{x}{\qq(z_1,\ldots,z_{i-1},x,z_{i+1},\ldots,z_k,\vec{\ell})

740: \wedge \cc(x)=1} -  \right. \right. \\

741: \left. \left.

742: \frac{1}{2}\pr{x}{\qq(z_1,\ldots,z_{i-1},x,z_{i+1},\ldots,z_k,\vec{\ell})}\right|\right] <

743: 2\epsilon.

744: \end{eqnarray*}

745: By bucketing the $\vec{z}$'s according to the values of $c(z_1)$,

746: $\ldots$, $c(z_{i-1})$ we see that the above implies that

747: for all $b_1, \ldots, b_{i-1}$ $\in$

748: $\{0,1\},$

749: \begin{eqnarray*}

750: \left|\pr{\vec{z}}{\qq(\vec{z},\vec{\ell}) \wedge c(z_1)=b_1 \wedge

751: \ldots \wedge c(z_{i-1})=b_{i-1} \wedge c(z_i)=1}

752:  - \right. \\

753: \left. \frac{1}{2}\pr{\vec{z}}{\qq(\vec{z},\vec{\ell}) \wedge c(z_1)=b_1 \wedge \ldots \wedge

754: c(z_{i-1})=b_{i-1}} \right| <

755: 2\epsilon.

756: \end{eqnarray*}

757: By a straightforward inductive argument

758: on $i$, we conclude that for every $\vec{b} \in \{0,1\}^k$, $$\left|\pr{\vec{z}}{\qq(\vec{z},\vec{\ell}) \wedge

759: c(\vec{z})=\vec{b}} -

760: \frac{1}{2^k}\pr{\vec{z}}{\qq(\vec{z},\vec{\ell})}\right| <

761: 4\epsilon(1-\frac{1}{2^k}).$$

762: This fact now allows us to estimate our desired $k$-wise query

763: $\pr{\vec{z}}{\qq(\vec{z},\cc(\vec{z}))}$.  In particular,

764: $$\pr{\vec{z}}{\qq(\vec{z},\cc(\vec{z}))} = \sum_{\vec{\ell} \in \{0,1\}^k}

765: \pr{\vec{z}}{\qq(\vec{z},\vec{\ell}) \wedge \cc(\vec{z})=\vec{\ell}}.$$

766: We  approximate each of the $2^k= \word{poly}(n)$ terms corresponding to a

767: different $\vec{\ell}$ by using {\em unlabeled} data to estimate

768: $\frac{1}{2^k}\pr{\vec{z}}{Q(\vec{z},{\vec{\ell}})}$.   Adding up

769: these terms gives us a good estimate of

770: $\pr{\vec{z}}{\qq(\vec{z},\cc(\vec{z}))}$ with high

771: probability.

772:

773:

774: \subsection{Discussion}

775:

776: In the above proof, we saw that either the data is statistically

777: ``homogeneous'' in a way which allows us to simulate the original

778: learning algorithm with unary queries, or else we discover a

779: ``heterogeneous'' region which we can exploit with an alternative

780: learning algorithm using only unary queries.  Thus any concept class

781: that can be learned from $O(\log n)$-wise

782: queries can also be weakly learned from unary queries.  Note that Aslam and

783: Decatur \cite{AslamDe93} have shown that weak-learning statistical

784: query algorithms can be boosted to strong-learning algorithms, if they

785: weak-learn over \ital{every} distribution.  Thus, any concept class

786: which can be

787: (weakly or strongly) learned from $O(\log n)$-wise queries over

788: \ital{every} distribution can be strongly learned over every

789: distribution from unary queries.

790:

791: It is worth noting here that $k$-wise queries can be used to

792: solve the length-$k$ parity problem.  One simply asks, for each $i \in

793: \{1, \ldots, k\}$, the query:

794: ``what is the probability that $k$ random examples form a basis for

795: $\bool^k$ and,

796: upon performing Gaussian elimination, yield a target concept whose

797: $i$th bit is equal to 1?''  Thus, $k$-wise

798: queries cannot be reduced to unary queries for $k = \omega(\log n)$.

799: On the other hand, it is not at all clear how to simulate such queries

800: in general from noisy examples.

801:

802: \comment{

803: \section{Limits of O(log {n})-wise Queries}

804:

805: We return to the general problem of learning a target concept $\cc$

806: over a space of examples with a fixed distribution $\cal D$.  A

807: limitation of the statistical query model is that it permits only what

808: may be called \ital{unary} queries.  That is, an SQ algorithm can

809: access $\cc$ only by requesting approximations of probabilities of

810: form $\pr{x}{\qq(x,\cc(x))}$, where $x$ is chosen from $\cal D$ and $\qq$

811: is a polynomially evaluable predicate.  A natural question is whether

812: problems not learnable from such queries can be learned, for example,

813: from binary queries: i.e., from probabilities of form

814: $\pr{x_1,x_2}{\qq(x_1,\cc(x_1),x_2,\cc(x_2))}$.  The following theorem

815: demonstrates that this is not possible: $O(\log n)$-wise

816: queries are no better than unary queries, at least with respect to

817: weak-learning.

818:

819: We assume in the discussion below that all algorithms also have access

820: to individual \ital{unlabeled} examples from distribution $\DD$, as is

821: usual in the SQ model.

822:

823: \begin{theorem}

824: \label{lognogood}

825: Let $k = O(\log n)$, and assume that there exists a

826: $\word{poly}(n)$-time algorithm using $k$-wise statistical queries

827: which weakly learns a concept class $C$ under distribution $\cal D$.

828: That is, this algorithm learns from approximations of

829: probabilities of form $\pr{x_1,\ldots, x_k}{\qq(x_1,\cc(x_1),

830: \ldots, x_k,\cc(x_k))}$, where $\qq$ is a polynomially evaluable predicate.

831: Then there exists a

832: $\word{poly}(n)$-time algorithm which weakly learns the same class

833: using only unary queries.

834: \end{theorem}

835:

836:

837:

838: \smallskip

839: \noindent

840: {\bf Proof.}

841: The original algorithm has access to approximations, correct to

842: plus-or-minus any desired polynomial fraction, of probabilities of

843: form

844: $$\pr{x_1,\ldots,x_k}{\qq(x_1,\cc(x_1),\ldots,x_k,\cc(x_k))}$$

845: (where all probabilities are over the given distribution $\DD$).  We

846: now consider the problem of simulating such a $k$-wise query using

847: only unary queries.  What we will show is that either our simulation

848: succeeds, or else in failing it finds a unary query which distinguishes

849: the target function from random; in the latter case, the discovered query

850: can then be used directly for weak-learning.

851:

852: The above probability can be rewritten as:

853: \begin{equation}

854: \sum_{\ell_1,\ldots,\ell_k \in \bool}

855: \Pr_{x_1,\ldots,x_k}\big[\cc(x_1)=\ell_1 \ \andd\ldots\andd\

856: \cc(x_k)=\ell_k

857: \andd \qq(x_1,\ell_1,\ldots,x_k,\ell_k)\big].

858: \end{equation}

859: This is a sum of $2^k = \word{poly}(n)$ probabilities, so we can

860: approximate each constituent probability separately.  Hence

861: let us fix $\ell_1,\ldots,\ell_k$ and focus hereafter on

862: approximating a probability of form:

863: \begin{equation}

864: \label{conjunction}

865: \pr{x_1,\ldots,x_k}{\cc(x_1)=\ell_1 \ \andd\ldots\andd\ \cc(x_k)=\ell_k \ \andd\

866: \qq(x_1,\ell_1,\ldots,x_k,\ell_k)}.

867: \end{equation}

868: This probability can in turn be rewritten as the product of $k+1$

869: constituent probabilities, namely:

870: \begin{eqnarray*}

871: \lefteqn{\pr{x_1,\ldots,x_k}{\qq(x_1,\ell_1,\ldots,x_k,\ell_k)}

872: \cdot} \\

873: &&

874: \prod_{i=1}^k

875: \pr{x_1,\ldots,x_k}{\cc(x_i) = \ell_i \,\mid\, \cc(x_1) =

876: \ell_1 \ \andd\cdots\andd\ \cc(x_{i-1}) = \ell_{i-1} \ \andd\

877: \qq(x_1,\ell_1,\ldots,x_k,\ell_k)}.

878: \end{eqnarray*}

879: Once again, we will approximate the constituent probabilities

880: individually.  We start by approximating

881: $\pr{x_1,\ldots,x_k}{\qq(x_1,\ell_1,\ldots,x_k,\ell_k)}$  (this is

882: easy, as the probability does not depend on $\cc$), then proceed in

883: order from $i = 1$ up to $k$.

884: If at any point the product of the probabilities calculated so far is

885: very small, we halt, returning zero as our approximation

886: for~(\ref{conjunction}).

887:

888: Hereafter we fix $i$ and focus on approximating

889: a probability of form:

890: \begin{equation}

891: \label{conditional}

892: \pr{x_1,\ldots,x_k}{\cc(x_i) = \ell_i \,\mid\, \cc(x_1) =

893: \ell_1 \ \andd\cdots\andd\ \cc(x_{i-1}) = \ell_{i-1} \ \andd\

894: \qq(x_1,\ell_1,\ldots,x_k,\ell_k)}.

895: \end{equation}

896: To approximate this value, we sample from $\DD$ to generate

897: $\bar{z}^{(1)}, \ldots, \bar{z}^{(t)}$, a list of $(k-1)$-tuples of

898: unlabeled examples, where $t$ is of large polynomial size.  Each

899: $\bar{z}^{(j)}$ is of the form $(z^{(j)}_1, \ldots, z^{(j)}_{i-1},

900: z^{(j)}_{i+1}, \linebreak[1] \ldots, z^{(j)}_k)$, where each

901: $z^{(j)}_m$ is a $\cal D$-random unlabeled example.  (We think of each

902: $\bar{z}^{(j)}$ as specifying values for all of $x_1,\ldots,x_k$

903: except $x_i$.)  Corresponding to each $\bar{z}^{(j)}$ we introduce

904: probabilities $S^{(j)}$ and $T^{(j)}$, defined as follows:

905: \begin{eqnarray*}

906: S^{(j)} &:=& \pr{x_i}{\qq(z^{(j)}_1,\ell_1, \ldots, x_i,\ell_i,

907: \ldots, z^{(j)}_k,\ell_k)},\\

908: T^{(j)} &:=& \pr{x_i}{\cc(x_i)=\ell_i \ \andd\

909: \qq(z^{(j)}_1,\ell_1, \ldots, x_i,\ell_i, \ldots, z^{(j)}_k,\ell_k)}.

910: \end{eqnarray*}

911: Note that we can efficiently approximate each of the probabilities

912: $S^{(j)}$ and $T^{(j)}$: indeed, $S^{(j)}$ does not depend on the

913: target concept, while $T^{(j)}$ requires only a unary query.  Now

914: consider the fraction

915: \begin{equation}

916: \label{fraction}

917: \frac{\sum_{j \in {\cal R}} T^{(j)}}{\sum_{j \in {\cal R}} S^{(j)}}\ ,

918: \end{equation}

919: where the summation is over ${\cal R}

920: = \{ j \colon\:

921: \cc(z^{(j)}_m) = \ell_m \word{ for all $m < i$} \}$. Note that our

922: algorithm cannot tell which $j$ belong to ${\cal R}$ and which do not,

923: because we do not have direct access to $c$; nonetheless, we may

924: assume that $|{\cal R}|$ is not too small and indeed that

925: the denominator of this fraction is not close to zero.  The reason is

926: that if this denominator were small, that would (with high

927: probability) imply that

928: $\pr{x_1,\ldots,x_k}{\cc(x_1)=\ell_1 \ \andd\ldots\andd\

929: \cc(x_{i-1})=\ell_{i-1} \ \andd\

930: \qq(x_1,\ell_1,\ldots,x_k,\ell_k)}$

931: is small, which would have caused our algorithm to halt before

932: reaching this point.

933: But observe that, if the denominator is not too small and $t$ is

934: of sufficiently large polynomial size,

935: (\ref{fraction}) will with high probability

936: be a good approximation for~(\ref{conditional}).

937: We now distinguish two cases:

938:

939: {\bf Case I:}\ \ we find that, for all $j \in \{1,\ldots,t\}$, $|2T^{(j)} -

940: S^{(j)}|$ is small.  Then the value of~(\ref{fraction}) is

941: approximately $1/2$.  (We know this to be true even though we

942: do not know which values of $j$ are in $\cal R$.)  Hence we may

943: return $1/2$ as our approximation for~(\ref{conditional}).

944:

945: {\bf Case II:}\ \ we find some $j$ such that $|2T^{(j)} - S^{(j)}|$

946: is large.  This means that

947: we have discovered a significantly large region of $\cal D$-random

948: instances, namely

949: $\{ x\colon\: \qq(z^{(j)}_1,\ell_1, \ldots, x,\ell_i, \ldots,

950: z^{(j)}_k,\ell_k)\}$, over which the probability that $\cc(x) = 1$ is

951: skewed away from $1/2$.  But then we can abandon

952: our effort to simulate the original learning algorithm, and can

953: instead use this new information to directly predict the value of

954: $\cc(x)$

955: with probability significantly greater than $1/2$.

956:

957: \subsection{Discussion}

958:

959: In the above proof, we saw that either the data is statistically

960: ``homogeneous'' in a way which allows us to simulate the original

961: learning algorithm with unary queries, or else we discover a

962: ``heterogeneous'' region which we can exploit with an alternative

963: learning algorithm using only unary queries.  Thus any concept class

964: that can be learned from $O(\log n)$-wise

965: queries can also be weakly learned from unary queries.  Note that Aslam and

966: Decatur \cite{AslamDe93} have shown that weak-learning statistical

967: query algorithms can be boosted to strong-learning algorithms, if they

968: weak-learn over \ital{every} distribution.  Thus, any concept class

969: which can be

970: (weakly or strongly) learned from $O(\log n)$-wise queries over

971: \ital{every} distribution can be strongly learned over every

972: distribution from unary queries.

973:

974: It is worth noting here that $k$-wise queries can be used to

975: solve the length-$k$ parity problem.  One simply asks, for each $i \in

976: \{1, \ldots, k\}$, the query:

977: ``what is the probability that $k$ random examples form a basis for

978: $\bool^k$ and,

979: upon performing Gaussian elimination, yield a target concept whose

980: $i$th bit is equal to 1?''  Thus, $k$-wise

981: queries cannot be reduced to unary queries for $k = \omega(\log n)$.

982: On the other hand, it is not at all clear how to simulate such queries

983: in general from noisy examples.

984: }

985:

986: \section{Conclusion}

987:

988: In this paper we have addressed the classic problem of

989: learning parity functions in the presence of random noise.  We have

990: shown that parity functions over $\booln$ can be learned in slightly

991: sub-exponential time, but only if many labeled examples are available.

992: It is to be hoped that future research may reduce both the time-bound

993: and the number of examples required.

994:

995: Our result also applies to the study of statistical query learning and

996: PAC-learning.  We have given the first known noise-tolerant

997: PAC-learning algorithm which can learn a concept class not learnable

998: by any SQ algorithm.  The separation we have established between the two

999: models is rather small: we have shown that a specific parity problem

1000: can be PAC-learned from noisy data in time $\word{poly}(n)$, as

1001: compared to time $n^{O(\log\log n)}$ for the best SQ algorithm.  This

1002: separation may well prove capable of improvement and worthy of

1003: further examination.  Perhaps more importantly, this suggests the

1004: possibility of interesting new noise-tolerant PAC-learning algorithms

1005: which go beyond the SQ model.

1006:

1007: We have also examined an extension to the SQ model in terms of

1008: allowing queries of arity $k$.  We have shown that for $k=O(\log n)$,

1009: any concept class learnable in the SQ model with $k$-wise queries is

1010: also (weakly) learnable with unary queries.  On the other hand, the

1011: results of \cite{BFJKMR94} imply this is not the case for $k =

1012: \omega(\log n)$.  An interesting open question is whether every concept

1013: class learnable from $O(\log n \log\log n)$-wise queries is also

1014: PAC-learnable in the presence of classification noise.  If so, then

1015: this would be a generalization of the first result of this paper.

1016:

1017: \newcommand{\etalchar}[1]{$^{#1}$}

1018: \begin{thebibliography}{BFJ{\etalchar{+}}94}

1019:

1020: \bibitem{AngluinLa88}

1021: D.~Angluin and P.~Laird.

1022: \newblock Learning from noisy examples.

1023: \newblock {\em Machine Learning}, 2(4):343--370, 1988.

1024:

1025: \bibitem{AslamDe93}

1026: J.~A. Aslam and S.~E. Decatur.

1027: \newblock General bounds on statistical query learning and {PAC} learning with

1028:   noise via hypothesis boosting.

1029: \newblock In {\em Proceedings of the 34th Annual Symposium on Foundations of

1030:   Computer Science}, pages 282--291, Nov. 1993.

1031:

1032: \bibitem{AslamDe98}

1033: J.~A. Aslam and S.~E. Decatur.

1034: \newblock Specification and simulation of statistical query algorithms for

1035:   efficiency and noise tolerance.

1036: \newblock {\em J.~Comput. Syst. Sci.}, 56(2):191--208, April 1998.

1037:

1038: \bibitem{BFJKMR94}

1039: A.~Blum, M.~Furst, J.~Jackson, M.~Kearns, Y.~Mansour, and S.~Rudich.

1040: \newblock Weakly learning {DNF} and characterizing statistical query learning

1041:   using fourier analysis.

1042: \newblock In {\em Proceedings of the 26th Annual ACM Symposium on Theory of

1043:   Computing}, pages 253--262, May 1994.

1044:

1045: \bibitem{Decatur93}

1046: S.~E. Decatur.

1047: \newblock Statistical queries and faulty {PAC} oracles.

1048: \newblock In {\em Proceedings of the 6th Annual {ACM} Workshop on Computational

1049:   Learning Theory}. {ACM} Press, 1993.

1050:

1051: \bibitem{Decatur96}

1052: S.~E. Decatur.

1053: \newblock Learning in hybrid noise environments using statistical queries.

1054: \newblock In D.~Fisher and H.-J. Lenz, editors, {\em Learning from Data:

1055:   Artificial Intelligence and Statistics {V}.} Springer Verlag, 1996.

1056:

1057: \bibitem{Jackson00}

1058: J. Jackson

1059: \newblock On the Efficiency of Noise-Tolerant PAC Algorithms Derived from

1060: Statistical Queries.

1061: \newblock {\em Proceedings of the 13th Annual Workshop on Computational Learning Theory}, 2000.

1062:

1063: \bibitem{Kearns93}

1064: M.~Kearns.

1065: \newblock Efficient noise-tolerant learning from statistical queries.

1066: \newblock In {\em Proceedings of the 25th Annual {ACM} Symposium on Theory of

1067:   Computing}, pages 392--401, 1993.

1068:

1069: \bibitem{KS01}

1070: R. Kumar and D. Sivakumar.

1071: \newblock On polynomial approximations to the shortest lattice vector

1072: length.

1073: \newblock To appear in {\em Proceedings of the 12th Annual Symposium on

1074: Discrete Algorithms}, 2001.

1075:

1076:

1077: \end{thebibliography}

1078:

1079: \end{document}

1080: