0201:cs0201009/cs0201009

1: %\documentclass[12pt]{gen-j-l}

2: \documentclass[12pt]{amsart}

3: \usepackage{amsmath}

4: \usepackage{amsfonts,amssymb,amsthm}

5: \usepackage{graphicx}

6:

7: %

8: \newcommand{\beq}{\begin{equation}}

9: \newcommand{\eeq}{\end{equation}}

10: \newcommand{\bbar}{\begin{eqnarray}}

11: \newcommand{\eear}{\end{eqnarray}}

12: %

13:

14:

15: \newcommand{\thm}[2]{\begin{#1} #2 \end{#1}}

16: \newcommand{\excess}{\mathrm{excess\:}}

17: \newcommand{\sgn}{\mathrm{sgn\:}}

18: \newcommand{\realpart}{\mathrm{Re\:}}

19: \newcommand{\imagpart}{\mathrm{Im\:}}

20:

21: \newcommand{\logdet}{\log \det \Delta}

22: \newcommand{\tr}{\mathrm{tr\:}}

23: \newcommand{\diameter}{\mathrm{diameter\:}}

24: \newcommand{\area}{\mathrm{area\:}}

25:

26: \newcommand{\Sim}{\mathrm{Sim\:}}

27: \newcommand{\num}{\mathcal{N\:}}

28: %\newcommand{\arg}{\mathrm{arg\:}}

29: \newcommand{\dilatation}{\mathrm{dilatation\:}}

30: \newtheorem{theorem}{Theorem}[section]

31: \newtheorem{itheorem}{Theorem}[section]

32: \newtheorem{lemma}[theorem]{Lemma}

33: \newtheorem{ilemma}[itheorem]{Lemma}

34: \newtheorem{corollary}[theorem]{Corollary}

35: \newtheorem{conjecture}[theorem]{Conjecture}

36: \newtheorem{question}[theorem]{Question}

37: \newtheorem{claim}[theorem]{Claim}

38: \newtheorem{observation}[theorem]{Observation}

39: \newtheorem{iobservation}[itheorem]{Observation}

40: \newtheorem{remark}[theorem]{Remark}

41: \newtheorem{condition}[theorem]{Condition}

42: \newtheorem{example}[theorem]{Example}

43: \newtheorem{definition}[theorem]{Definition}

44: \newtheorem{xca}[theorem]{Exercise}

45: \newtheorem{note}[theorem]{Note}

46:

47: %\input{montreref}

48:

49: \begin{document}

50:

51: %-------------- Author entries --------------------

52: \title{The performance of the batch learning algorithm}

53: %

54:

55:

56: \author{Igor Rivin}

57: \address{Mathematics department, University of Manchester,

58: Oxford Road, Manchester M13 9PL, UK}

59: \address{Mathematics Department, Temple University,

60: Philadelphia, PA 19122}

61: \address{Mathematics Department, Princeton University, Princeton,

62: NJ 08544}

63: %

64: \email{irivin@math.princeton.edu} \thanks{The author would like

65: to think the EPSRC and the NSF for support, and Natalia Komarova

66: and  Ilan Vardi for useful conversations. }

67:

68: \subjclass{60E07, 60F15, 60J20, 91E40, 26C10} \keywords{ learning

69: theory, zeta functions, asymptotics}

70: %

71: \begin{abstract}

72: We analyze completely the convergence speed of the \emph{batch

73: learning algorithm}, and compare its speed to that of the

74: memoryless learning algorithm and of learning with memory (as

75: analyzed in \cite{kr2}). We show that the batch learning

76: algorithm is never worse than the memoryless learning algorithm

77: (at least asymptotically). Its performance \emph{vis-a-vis}

78: learning with full memory is less clearcut, and depends on

79: certain probabilistic assumptions.

80: \end{abstract}

81: %

82: \maketitle

83:

84: \renewcommand{\theitheorem}{\Alph{itheorem}}

85: %

86: \section*{Introduction}

87: The original motivation for the work in this paper was provided

88: by  research in learning theory, specifically in various models

89: of language acquisition (see, for example, \cite{knn,nkn,kn}). In

90: the paper \cite{kr2}, we had studied the speed of convergence of

91: the  \emph{memoryless learner algorithm}, and also of

92: \emph{learning with full memory}. Since the \emph{batch learning

93: algorithm} is both widely known, and believed to have superior

94: speed (at the cost of memory) to both of the above methods by

95: learning theorists, it seemed natural to analyze its behavior

96: under the same set of assumptions, in order to bring the analysis

97: in \cite{kr1} and \cite{kr2} to a sort of closure. It should be

98: noted that the detailed analysis of the batch learning algorithm

99: is performed under the assumption of \emph{independence}, which

100: was not explicitly present in our previous work. For the

101: impatient reader we state our main result (Theorem

102: \ref{batchthm}) immediately (the reader can compare it with the

103: results on the memoryless learning algorithm and learning with

104: full memory, as summarized in Theorem \ref{mainprev}):

105: %

106: \begin{itheorem}

107: Let $N_\Delta$ be the number of steps it takes for the student

108: to have probability $1 - \Delta$ of learning the

109: concept using the batch learner algorithm. Then we have the following estimates for $N_\Delta$:

110: %

111: \begin{itemize}

112: \item

113: if the distribution of overlaps is \emph{uniform}, or more

114: generally, the density function $f(1-x)$  at $0$ has the form

115: $f(x) = c + O(x^\delta),$ $\delta, c > 0,$ then there exist positive

116: constants $C_1, C_2$ such that

117: $$\lim_{n \rightarrow \infty} \mathbf{P}\left(C_1 <

118: \frac{N_\Delta}{(1- \Delta)^2 n} < C_2\right) = 1$$

119: %

120: %

121: \item

122: if the probability density function $f(1-x)$ is asymptotic to $c x^\beta

123: + O(x^{\beta + \delta}), \quad \delta, \beta > 0$, as $x$ approaches

124: $0$, then

125: $$\lim_{n \rightarrow \infty} \mathbf{P}\left(c_1 <

126: \frac{N_\Delta}{|\log \Delta|n^{\frac{1}{1+\beta}}} < c_2\right) = 1,$$

127: for some positive constants $c_1, c_2$;

128: %

129: \item

130: if the asymptotic behavior is as above, but $-1 < \beta < 0$, then

131: %

132: $$\lim_{x \rightarrow \infty}  \mathbf{P}\left(\frac{1}{x} < \frac{N_\Delta}{|\log

133: \Delta| n^{1/(1+\beta)}} < x\right) = 1$$

134: \end{itemize}

135: \end{itheorem}

136: % \begin{itheorem}

137: %  Let $N_\Delta$ be the number of steps it takes

138: % for the student (with probability $1$) to have probability $1 -

139: % \Delta$ of learning the concept using the batch learner

140: % algorithm. Then we have the following estimates for $N_\Delta$:

141: % %

142: % \begin{itemize}

143: % \item

144: % If the distribution of overlaps is \emph{uniform}, or more

145: % generally, the density function $f(1-x)$  at $0$ has the form

146: % $f(x) = c + O(x^\delta),$ $\delta, c > 0,$ then $N_\Delta=|\log

147: % \Delta|\Theta(n)$

148: % %

149: % \item

150: % If the probability density function $f(1-x)$ is asymptotic to

151: % $x^\beta + O(x^{\beta - \delta}), \quad \delta, \beta > 0$, as

152: % $x$ approaches $0$, then we have $N_\Delta=|\log

153: % \Delta|\Theta(n^{1/(1+\beta)})$;

154: % %

155: % \item

156: % If the asymptotic behavior is as above, but $-1/2 < \beta < 0$,

157: % then $N_\Delta=|\log \Delta|\Theta(n^{1/(1+\beta)}).$

158: % %

159: % \end{itemize}

160: % \end{itheorem}

161: The plan of the paper is as follows: in this Introduction we

162: recall the learning algorithms we study; in Section \ref{mathmod}

163: we define our mathematical model; in Section 2 we recall our

164: previous results, in Section 3 we begin the analysis of the batch

165: learning algorithm, and introduce some of the necessary

166: mathematical concepts; in Sections 4-6 we analyze the three cases

167: stated in Theorem A, and we summarize our findings in Section 7.

168: \subsection*{Memoryless Learning and Learning with Full Memory}

169: The general setup is as follows: There is a collection of

170: concepts $R_0, \dots, R_n$ and words which refer to these

171: concepts, sometimes ambiguously. The teacher generates a stream

172: of words, referring to the concept $R_0$. This is not known to

173: the student, but he must learn by, at each step, guessing some

174: concept $R_i$ and checking for consistency with the teacher's

175: input.  The \emph{memoryless learner algorithm} consists of

176: picking a concept $R_i$ at random, and sticking by this choice,

177: until it is proven wrong.  At this point another concept is

178: picked randomly, and the procedure repeats. \emph{Learning with

179: full memory} follows the same general process with the important

180: difference that once a concept is rejected, the student never

181: goes back to it. It is clear (for both algorithms) that once the

182: student hits on the right answer $R_0$, this will be his final

183: answer. We would like to estimate the probability of having

184: guessed the right answer is after $k$ steps, and also the

185: expected number of steps before the student settles on the right

186: answer.

187:

188: \subsection*{Batch Learning} The batch learning situation is

189: similar to the above, but here the student records the words

190: $w_1, \dots, w_k, \dots$ he gets from the teacher. For each word

191: $w_i$ , we assume that the student can find (in his textbook, for

192: example) a list $L_i$ of concepts referred to by the word. If we

193: define

194: \begin{equation*}

195: \mathcal{L}_k = \bigcap_{i=1}^k L_i,

196: \end{equation*}

197: then we are interested in the smallest value of $k$ such that

198: $\mathcal{L}_k = \{R_0\}$. This value $k_0$ is the time it has

199: taken the student to learn the concept $R_0$. We think of $k_0$

200: as a random variable, and we wish to estimate its expectation.

201: %

202: \section{The mathematical model}

203: \label{mathmod}

204:  We think of the words referring to the concept

205: $R_0$ as a probability space $\mathcal{P}$. The probability that

206: one of these words also refer to the concept $R_i$ shall be

207: denoted by $p_i$; the probability that a word refers to concepts

208: $R_{i_1}, \dots, R_{i_k}$ shall be denoted by $p_{i_1 \dots

209: i_k}$. All the results described below (obviously) depend in a

210: crucial way on the $p_1, \dots, p_n$ and (in the case of the

211: batch learning algorithm) also on the joint probabilities. Since

212: there is no \emph{a priori} reason to assume specific values for

213: the probabilities, we shall assume that all of the $p_i$ are

214: themselves \emph{independent, identically distributed random

215: variables}. We shall refer to their common distribution as

216: $\mathcal{F}$, and to the density as $f$. It turns out that the

217: convergence properties of the various learning algorithms depend

218: on the local analytic properties of the distribution

219: $\mathcal{F}$ at $1$ -- some moments reflection will convince the

220: reader that this is not really so surprising.

221:

222: Sharper analysis of the batch learning algorithm,

223: depends on the \emph{independence hypothesis}:

224: $$

225: p_{i_1 \dots i_k} = p_{i_1} \dots p_{i_k}.

226: $$

227: It is again not too surprising that some such assumption on

228: correlations ought to be required for precise asymptotic results,

229: though it is obviously the subject of a (non-mathematical) debate

230: as to whether assuming that the various concepts are truly

231: independent is reasonable from a cognitive science point of view.

232:

233: \section{Previous results}

234: In previous work \cite{kr1} and \cite{kr2} we obtained the

235: following result.

236: \thm{theorem}

237: {

238: \label{mainprev}

239: Let $N_\Delta$ be the number of steps it takes for the student

240: to have probability $1 - \Delta$ of learning the

241: concept. Then we have the following estimates for $N_\Delta$:

242: %

243: \begin{itemize}

244: \item

245: if the distribution of overlaps is \emph{uniform}, or more

246: generally, the density function $f(1-x)$  at $0$ has the form

247: $f(x) = c + O(x^\delta),$ $\delta, c > 0,$ then there exist positive

248: constants $C_1, C_2, C_1', C_2'$ such that

249: $$\lim_{n \rightarrow \infty} \mathbf{P}\left(C_1 <

250: \frac{N_\Delta}{|\log \Delta|n \log n} < C_2\right) = 1$$

251: for

252: the memoryless algorithm and

253: $$\lim_{n \rightarrow \infty} \mathbf{P}\left(C'_1 <

254: \frac{N_\Delta}{(1- \Delta)^2 n \log n} < C'_2\right) = 1$$

255: %

256: when learning with full memory;

257: %

258: \item

259: if the probability density function $f(1-x)$ is asymptotic to $c x^\beta

260: + O(x^{\beta + \delta}), \quad \delta, \beta > 0$, as $x$ approaches

261: $0$, then for the two algorithms we have respectively

262: $$\lim_{n \rightarrow \infty} \mathbf{P}\left(c_1 <

263: \frac{N_\Delta}{|\log \Delta|n} < c_2\right) = 1,$$

264: and

265: $$\lim_{n \rightarrow \infty} \mathbf{P}\left(c_1' <

266: \frac{N_\Delta}{(1- \Delta)^2 n } < c_2'\right) = 1$$

267: %

268: for some positive constants $c_1, c_2, c_1', c_2'$;

269: %

270: \item

271: if the asymptotic behavior is as above, but $-1 < \beta < 0$, then

272: %

273: $$\lim_{x \rightarrow \infty}  \mathbf{P}\left(\frac{1}{x} < \frac{N_\Delta}{|\log

274: \Delta| n^{1/(1+\beta)}} < x\right) = 1$$

275: %

276: for the memoryless learning algorithm, and similarly

277: %

278: $$\lim_{x \rightarrow \infty}  \mathbf{P}\left(\frac{1}{x} < \frac{N_\Delta}{(1-\Delta)^2 n^{1/(1+\beta)}} < x\right) = 1$$

279: %

280: for learning with full memory.

281: %

282: \end{itemize}}

283: %

284: \noindent Recall that $f(x) = \Theta(g(x))$ means that for

285: sufficiently large $x$, the ratio $f(x)/g(x)$ is bounded between

286: two strictly positive constants. The distribution of overlaps

287: referred to above is simply the distribution $\mathcal{F}$.

288: Notice that the theorem says nothing about the situation when

289: $\mathcal{F}$ is supported in some interval $[0, a]$, for $a<1$.

290: That case is (presumably) of scientific interest, but

291: mathematically it is relatively trivial: we replace the arguments

292: of all the $\Theta$s above by $1$, though, of course, we are

293: thereby hiding the dependence on $a$.

294:

295: \section{General bounds on the batch learner algorithm}

296:

297: Consider a set of words $w_1, \dots, w_k$. The probability that

298: they all refer to the concept $R_i$ is, obviously $p_i^k$.

299: \begin{lemma}

300: \label{bounds}

301:  The probability $q_k$ that we still have not

302: learned the concept $R_0$ after $k$ steps is bounded above by

303: $\sum_{i=1}^n p_i^k$, and below by $\max_i p_i^k$.

304: \end{lemma}

305: \begin{proof}

306: Immediate.

307: \end{proof}

308: We will first use these upper and lower

309: bounds to get corresponding bounds on the convergence speed of

310: the batch learner algorithm, and then invoke the independence

311: hypothesis to sharpen these bounds in many cases.

312:

313: We begin with a trivial but useful lemma.

314: %

315: \begin{lemma}

316: \label{rearrange}

317:  Let $G$ be a game where the probability of

318: success (respectively failure) after at most $k$ steps is $s_k$

319: (respectively $f_k = 1-s_k $). Then the expected number of steps

320: until success is

321: $$\sum_{k=1}^\infty k (s_k - s_{k-1}) = \sum_{k=1}^\infty s_k = 1 -

322: \sum_{k=1}^\infty f_k,$$ if the corresponding sum converges.

323: \end{lemma}

324: \begin{proof}

325: The proof is immediate from the definition of expectation and the

326: possibility of rearrangment of terms of positive series.

327: \end{proof}

328: We can combine Lemma \ref{rearrange} and Lemma \ref{bounds} to

329: obtain:

330: \begin{theorem}

331: \label{sumbounds} The expected time $T$ of convergence of the

332: batch learner algorithm is bounded as follows:

333: \begin{equation}

334: \label{trivest} \sum_{i=1}^n \frac{1}{1-p_i} \geq T \geq

335: \max_{1\leq i \leq n} \frac{1}{1-p_i}.

336: \end{equation}

337: \end{theorem}

338: The leftmost term in equation (\ref{trivest}) has been studied at

339: length in \cite{kr1}. We state a version of the results of

340: \cite{kr1} below:

341: \begin{theorem}

342: \label{allstab} Let $S=\sum_{i=1}^n \frac{1}{1-p_i},$ where the

343: $p_i$ are independently identically distributed random variables

344: with values in $[0, 1]$, with probability density $f$, such that

345: $f(1-x) = x^\beta + O(x^{\beta + \delta}),\quad \delta > 0$ for

346: $x\rightarrow 0$. Then If $\beta > 0$, then there exists a mean

347: $m$, such that $\lim_{n \rightarrow \infty} \mathbb{P}(|S/n - m|

348: > \epsilon) = 0,$ for any $\epsilon > 0.$ If $\beta = 0$, then

349: $\lim_{n \rightarrow \infty} \mathbb{P}(|S/(n\log n) - 1|

350: > \epsilon) = 0).$ Finally, if

351: $-1 \leq \beta < 0,$ then $\lim_{n \rightarrow \infty}

352: \mathbb{P}(S/n^{1/{\beta+1}} - C

353: > a) = g(a),$ where $\lim_{a \rightarrow \infty} g(a)= 0,$ and $C$ is

354: an arbitrary (but fixed) constant, and likewise

355: $$\mathbb{P}(S/n^{1/(\beta + 1)} < b) = h(b),$$ where $\lim_{a \rightarrow 0}h(a) = 0,$

356: \end{theorem}

357: The right hand side of Eq. (\ref{trivest}) is easier to

358: understand. Indeed, let $p_1, \dots, p_n$ be distributed as usual

359: (and as in the statement of Theorem \ref{allstab}). Then

360: %

361: \begin{theorem}\label{expmin}

362: $$\lim_{n\rightarrow \infty}

363: n^{\frac{1}{1+\beta}} \mathbf{E}\left(1-\max_{1 \leq i \leq n}

364: p_i\right) = C,$$

365: for some positive constant $C$.

366: \end{theorem}

367: \begin{proof}

368: First, we change variables to $q_i = 1 - p_i$. Obviously, the

369: statement of the Theorem is equivalent to the statement that

370: $$E =

371: \mathbf{E}(\min_{1 \leq i \leq n} q_i) = C  n^{-1/{1+\beta}}.$$ We

372: also write $h(x) = f(1-x),$ and let $H$ be the distribution function

373: whose density is $h,$ so that $H(x) = 1 - F(1-x).$

374: Now, the probability of that all of the $q_i$ are

375: greater than  $t$ equals $1-(1-H(t))^n,$ so that

376: $$E = \int_0^1 t~d\left[1-(1-H(t))^n\right] = \int_0^1 (1-H(t))^n d t.$$

377: We change variables $t = u/n^{1/(1+\beta)}$, to obtain

378: \begin{equation}

379: \label{firstint} E = \frac{1}{n^{1+\beta}}

380: \int_0^{n^{\frac{1}{{1+\beta}}}} \left(1-H\left(\frac{u}{n^{1/(1+\beta)}}\right)\right)^n du.

381: \end{equation}

382: Let us write $E = E_1(n) + E_2(n),$ where

383: \begin{gather}

384: \label{secondint}

385: E_1(n) = \int_0^{n^{\frac{1}{3 (\beta + 1)}}}

386: \left[1-H\left(\frac{u}{n^{1/(1+\beta)}}\right)\right]^n du,\\

387: E_2(n) = \int_{n^{\frac{1}{3 (\beta + 1)}}}^{n^{\frac{1}{1 + \beta}}}

388: \left[1-H\left(\frac{u}{n^{1/(1+\beta)}}\right)\right]^n du,

389: \end{gather}

390: Recall that

391: \begin{equation}

392: \label{asest}

393: H(x) = c x^{\beta+1} + O(x^{\beta + \delta + 1}).

394: \end{equation}

395: Let

396: \begin{equation}

397: \label{eeint}

398: \mathcal{I} = \int_0^\infty \exp\left(c x^{1+\beta}\right) d x.

399: \end{equation}

400: We now show:

401: \begin{equation}

402: \label{secondint1}

403: \lim_{n \rightarrow \infty} E_1(n) = \mathcal{I}.

404: \end{equation}

405: This is an immediate consequence of Lemma \ref{explem} and Eq. (\ref{asest}).

406: Also,

407: \begin{equation}

408: \label{secondint2}

409: \lim_{n \rightarrow \infty} E_2(n) = 0.

410: \end{equation}

411: Since $H$ is a monotonically increasing function, it is sufficient to

412: show that

413: $$\lim_{n\rightarrow \infty} n^{\frac{1}{1 + \beta}}

414: \left[1-H\left(n^{\frac{2}{3 (1 + \beta)}}\right)\right]^n = 0.$$

415: This is immediate from Eq. (\ref{asest}) and Lemma \ref{explem}.

416: \end{proof}

417: \begin{remark}

418: The argument shows that $C = \mathcal{I},$ where $C$ is the constant

419: in the statement of lemma, and $\mathcal{I}$ is the integral

420: introduced in Eq. (\ref{eeint}).

421: \end{remark}

422: \begin{lemma}

423: \label{explem}

424: Let $f_n(x) = (1-x/n)^n,$ and let $0 \leq z < 1/2.$

425: $$f_n(x) = \exp(-x)\left[1-\frac{x^2}{2 n} + O\left(\frac{x^3}{n^2}\right)\right].$$

426: \end{lemma}

427: \begin{proof}

428: Note that

429: $$\log f_n(x) = n \log(1-x/n) = -x - \sum_{k=2}^\infty \frac{x^k}{kn^{k-1}}.$$

430: The assertion of the lemma follows by exponentiating the two sides of

431: the above equation.

432: \end{proof}

433: We need one final observation:

434: \begin{theorem}

435: The variable $n^{1/(1+\beta)} \min_{i=1}^n q_i$ has a limiting

436: distribution with distribution function $G(x) =

437: 1-\exp(-x^{1+\beta}).$

438: \end{theorem}

439: \begin{proof}

440: Immediate from the proof of Theorem \ref{expmin}.

441: \end{proof}

442:

443:

444: We can now put together all of the above results as follows.

445: \begin{theorem}

446: \label{allgen}

447:  Let $p_1, \dots, p_k$ be independently distributed

448: with common density function $f$, such that $f(1-x) = c x^\beta +

449: O(x^{\beta + \delta}),$ $\delta > 0$. Let $T$ be the expected

450: time of the convergence of the batch learning algorithm with

451: overlaps $p_1, \dots, p_k$. Then, if $\beta > 0$, then there

452: exist $C_1, C_2$, such that  $C_1 n^{1/(1+\beta)} \leq T \leq C_2

453: n$, with probability tending to $1$ as $n$ tends to $\infty$. If

454: $\beta = 0$, then there exist $C_1, C_2$, such that $C_1 n \leq T

455: \leq C_2 n \log n$, with probability tending to one as $n$ tends

456: to $\infty.$ If $\beta > 0$, then $C^{-1} n^{1/(\beta + 1)} \leq

457: T \leq C n^{1/(\beta + 1)}$ with probability tending to $0$ as

458: $C$ goes to infinity.

459: \end{theorem}

460:

461: The reader will remark that in the case that $\beta > 0$, the

462: upper and lower bounds have the same order of magnitude as

463: functions of $n$.

464:

465: \section{Independent concepts}

466: We now invoke the independence hypothesis, whereby an application of the

467: inclusion-exclusion principle gives us:

468:

469: \thm{lemma}{\label{latmost} The probability $l_k$ that we have

470:  learned the concept $R_0$ after $k$ steps is given by

471: $$

472: l_k=\prod_{i=1}^n(1-p_i^k).

473: $$

474: }

475:

476: Note that the probability $s_k$ of winning the game \emph{on the

477: $k$-th step} is given by $s_k = l_k - l_{k-1}= (1-l_{k-1}) -

478: (1-l_k)$. Since the expected number of steps $T$ to learn the

479: concept is given by

480: $$T = \sum_{k=1}^\infty k s_k,$$

481: we immediately have  $$T = \sum_{k=1}^\infty (1-l_k)$$

482: %

483: \begin{lemma}

484: \label{letime} The expected time $T$ of learning the

485: concept $R_0$ is given by

486: \begin{equation}

487: \label{letimeeq}

488: T = \sum_{k=1}^\infty \left(1-\prod_{i=1}^n

489: \left(1-p_i^k\right)\right).

490: \end{equation}

491: \end{lemma}

492: %

493: Since the sum above is absolutely convergent, we can expand the

494: products and interchange the order of summation to get the

495: following formula for $T$:

496:

497: \medskip\noindent

498: \textbf{Notation.}

499: Below, we identify subsets of $\{1, \dots, n\}$ with

500:  multindexes (in the obvious way), and if $s = \{i_1, \dots, i_l\},$ then

501: $$p_s \stackrel{\mbox{def}}= p_{i_1} \cdots p_{i_l}.$$

502:

503: \begin{lemma}

504: The expression Eq. (\ref{letimeeq}) can be rewritten as:

505: \begin{equation}

506: \label{subsum} T = \sum_{s\subseteq \{1, \dots, n\}}

507: (-1)^{|s|-1} \left(\frac{1}{1-p_s} - 1\right),

508: \end{equation}

509: \end{lemma}

510:

511: \begin{proof}

512: With notation as above,

513: \begin{equation*}

514: \prod_{i=1}^m \left(1-p_i^k\right) =

515: \sum_{s \subseteq \{1, \dots, n\}} (-1)^{|s|} p_s^k,

516: \end{equation*}

517: so

518: \begin{equation*}

519: \begin{split}

520: T &= \sum_{k=1}^\infty \left(1 - \prod_{i=1}^n

521: \left(1-p_i^k\right)\right)\\

522: &= \sum_{k=1}^\infty \left(1-\sum_{s \subseteq \{1, \dots, n\}}

523: (-1)^{|s|} p_s^k\right)\\

524: &= \sum_{s\subseteq \{1, \dots, n\}} (-1)^{|s|-1}

525: \sum_{k=1}^\infty p_s^k \\

526: &= \sum_{s\subseteq \{1, \dots, n\}}

527: (-1)^{|s|-1} \left(\frac{1}{1-p_s} - 1\right),

528: \end{split}

529: \end{equation*}

530: where the change in the order of summation is permissible since all

531: sums converge absolutely.

532: \end{proof}

533: Formula (\ref{subsum}) is useful in and of itself, but we now

534: use it to analyse the statistical properties of the time of

535: success $T$ under our distribution and independence assumptions.

536: For this we shall need to study the \emph{moment zeta function} of a

537: probability distribution, introduced below. Its detailed properties

538: are investigated in my paper \cite{zeta}, where Theorems \ref{t1},

539: \ref{alpha1asymp} and \ref{alpha1asymp2}

540: below are proved. Below we summarize the definitions and the

541: results.

542: %

543: \subsection{Moment zeta function}

544: \begin{definition}

545: \label{zdef} Let $\mathcal{F}$ be a probability

546: distribution on a (possibly infinite) interval $I$, and let

547: $m_k(\mathcal{F}) =  \int_I x^k\mathcal{F}(d x)$ be the $k$-th moment

548: of  $\mathcal{F}$. Then the \emph{moment zeta function of

549: $\mathcal{F}$} is defined to be $$\zeta_{\mathcal{F}}(s) =

550: \sum_{k=1}^\infty m_k^s(\mathcal{F}),$$ whenever the sum is defined.

551: \end{definition}

552: %

553: The definition is, in a way, motivated by the following:

554:

555: \begin{lemma}

556: \label{zetalemma} Let $\mathcal{F}$ be a probability

557: distribution as above, and let $x_1, \dots, x_n$ be independent

558: random variables with common distribution $\mathcal{F}$. Then

559: \begin{equation}

560: \mathbb{E}\left(\frac{1}{1-x_1 \dots x_n}\right) =

561: \zeta_{\mathcal{F}}(n).

562: \end{equation}

563: In particular, the expectation is undefined whenever the zeta

564: function is undefined.

565: \end{lemma}

566: %

567: \begin{proof}

568: Expand the fraction in a geometric series and apply Fubini's

569: theorem.

570: \end{proof}

571: %

572: \begin{example}

573: For $\mathcal{F}$ the uniform distribution on

574: $[0, 1]$, $\zeta_{\mathcal{F}}$ is the familiar Riemann zeta

575: function.

576: \end{example}

577:

578: Using standard techinques of asymptotic analysis, the following can be

579: shown (see \cite{zeta}):

580: \begin{theorem}

581: \label{momasymp}

582: Let $\mathcal{F}$ be a continuous distribution supported in $[0, 1],$

583: let $f$ be the density of the distribution $\mathcal{F}$, and

584: suppose that $f(1-x) = c x^\beta + O(x^{\beta + \delta}),$ for some

585: $\delta > 0.$ Then the $k$-th moment of $\mathcal{F}$ is asymptotic to

586: $C k^{-(1+\beta)},$ for $C = c \Gamma(\beta).$

587: \end{theorem}

588:

589: \begin{corollary}

590: Under the assumptions of Theorem \ref{momasymp},

591: $\zeta_{\mathcal{F}}(s)$ is defined for $s

592: >1/(1+\beta)$.

593: \end{corollary}

594:

595: The moment zeta function can be used to two of the three situations

596: occuring in the study of the batch learner algorithm:

597: In the sequel, we set $\alpha = \beta + 1$.

598: \subsection{$\alpha > 1$}

599: \label{isdef}

600: In this case, we use our assumptions to rewrite Eq.

601: (\ref{subsum}) as

602: \begin{equation}

603: \label{subsum2}

604: %

605: \mathbb{E}(T) = - \sum_{k=1}^n \binom{n}{k}(-1)^k \zeta_{\mathcal{F}}(k).

606: \end{equation}

607: This, in turn, can be rewritten (by expanding the definition of

608: zeta) as

609: \begin{equation}

610: \label{subsum3} \mathbb{E}(T) = - \sum_{j=1}^\infty

611: \left[\left(1-m_j(\mathcal{F})\right)^n-1\right] =

612: \sum_{j=1}^\infty \left[1- \left(1-m_j(\mathcal{F})\right)^n\right]

613: \end{equation}

614:

615: Using the moment zeta function we can show:

616: \begin{theorem}

617: \label{t1}

618: Let $\mathcal{F}$ be a continuous distribution supported on $[0, 1],$

619: and let $f$ be the density of $\mathcal{F}.$ Suppose further that

620: $$\lim_{x \rightarrow 1} \frac{f(x)}{(1-x)^{\beta}} = c,$$ for $\beta,

621: c > 0.$ Then,

622: \begin{equation*}

623: \begin{split}

624: \lim_{n\rightarrow \infty} n^{-\frac{1}{1+\beta}} \left[\sum_{k=1}^n

625: \binom{n}{k}(-1)^k \zeta_{\mathcal{F}}(k)\right] \\=

626: -\int_0^\infty

627: \frac{1-\exp\left(-c\Gamma(\beta+1)u^{1+\beta}\right)}{u^2} du\\

628: = - \left(c \Gamma(\beta + 1)\right)^{\frac{1}{\beta+1}}

629: \Gamma\left(\frac{\beta}{\beta + 1}\right).

630: \end{split}

631: \end{equation*}

632: \end{theorem}

633: %

634: \subsection{$\alpha = 1$}

635: \label{medalpha} In this case,

636: \begin{equation}

637: \label{asest02}

638: f(x) = L + o(1)

639: \end{equation} as $x$

640: approaches $1,$ and so Theorem \ref{momasymp} tells us that

641: \begin{equation}

642: \label{asest2}

643: \lim_{j \rightarrow \infty} j m_j(\mathcal{F}) = L.

644: \end{equation}

645: It is not hard to see that

646: $\zeta_{\mathcal{F}}(n)$ is defined for $n \geq 2$. We break up

647: the expression in Eq. (\ref{subsum}) as

648: \begin{equation}

649: \label{subsumm} T = \sum_{j=1}^n {\frac{1}{1-p_j} - 1} +

650: \sum_{s\subseteq \{1, \dots, n\}, \quad |s| > 1}

651:  (-1)^{|s|-1}

652: \left(\frac{1}{1-p_s} - 1\right).

653: \end{equation}

654: Let

655: \begin{gather*} T_1 = \sum_{j=1}^n {\frac{1}{1-p_j} - 1},\\

656:  T_2 = \sum_{s\subseteq \{1, \dots, n\}, \quad |s| > 1}

657:  (-1)^{|s|-1}

658: \left(\frac{1}{1-p_s} - 1\right).

659: \end{gather*}

660:  The first sum $T_1$ has

661: no expectation, however $T_1/n$  does have have a stable

662: distribution centered on $c \log n + c_2$. We will keep this in

663: mind, but now let us look at the second sum  $T_2$. It can be

664: rewritten as

665: \begin{equation}

666: \label{subsumm2} T_2(n) = - \sum_{j=1}^\infty

667: \left[\left(1-m_j(\mathcal{F})\right)^n-1 + n

668: m_j(\mathcal{F})\right].

669: \end{equation}

670: We can again use the moment zeta function to analyse the properties of

671: $T_2,$ to get:

672: \begin{theorem}

673: \label{alpha1asymp}

674: Let $\mathcal{F}$ be a continuous distribution supported on $[0, 1],$

675: and let $f$ be the density of $\mathcal{F}.$ Suppose further that

676: $$\lim_{x \rightarrow 1} \frac{f(x)}{(1-x)} = c > 0.$$

677: Then,

678: $$\sum_{k=2}^n

679: \binom{n}{k}(-1)^k \zeta_{\mathcal{F}}(k) \sim c n \log n.

680: $$

681: \end{theorem}

682: To get error estimates, we need stronger assumption on the function

683: $f$ than the weakest possible assumption made in Theorem

684: \ref{alpha1asymp}.

685:

686: \begin{theorem}

687: \label{alpha1asymp2}

688: Let $\mathcal{F}$ be a continuous distribution supported on $[0, 1],$

689: and let $f$ be the density of $\mathcal{F}.$ Suppose further that

690: $$f(x) \sim c (1-x) + O\left((1-x)^\delta\right),$$ where $\delta > 0.$

691: Then,

692: $$\sum_{k=2}^n

693: \binom{n}{k}(-1)^k \zeta_{\mathcal{F}}(k) \sim c n \log n + O(n).

694: $$

695: \end{theorem}

696:

697: The conclusion differs somewhat from that of section

698: \ref{isdef} in that  we get an

699: additional term of $c n \log n$, where $c = \lim_{x \rightarrow

700: 1} f(x) = \lim_{j \rightarrow \infty} j m_j$. This term is equal

701: (with opposing sign) to the center of the stable law satisfied by

702: $T_1$, so in case $\alpha = 1$, we see that $T$ has no

703: expectation but satisfies a \emph{law of large numbers}, of the

704: %

705: \begin{theorem}[Law of large numbers]

706: There exists a constant $C$ such that $\lim_{y \rightarrow

707: \infty} \mathbf{P}(|T/n - C| > y) = 0.$

708: \end{theorem}

709: \section{$\alpha <1$}

710: \label{smallalpha} In this case the analysis goes through as in

711: the preceding section when $\alpha > 1/2$, but then runs into

712: considerable difficulties. However, in this case we note that

713: Theorem \ref{allgen} actually gives us tight bounds.

714: \section{The inevitable comparison}

715: We are now in a position to compare the performance of the batch

716: learning algorithm with that of the memoryless learning algorithm

717: and of learning with full memory, as summarized in Theorem

718: \ref{mainprev}. We combine our computations above with the

719: observation that the batch learner algorithm converges

720: geometrically (Lemma \ref{latmost}), to get:

721: %

722: \thm{theorem}

723: {

724: \label{batchthm}

725: Let $N_\Delta$ be the number of steps it takes for the student

726: to have probability $1 - \Delta$ of learning the

727: concept using the batch learner algorithm. Then we have the following estimates for $N_\Delta$:

728: %

729: \begin{itemize}

730: \item

731: if the distribution of overlaps is \emph{uniform}, or more

732: generally, the density function $f(1-x)$  at $0$ has the form

733: $f(x) = c + O(x^\delta),$ $\delta, c > 0,$ then there exist positive

734: constants $C_1, C_2$ such that

735: $$\lim_{n \rightarrow \infty} \mathbf{P}\left(C_1 <

736: \frac{N_\Delta}{(1- \Delta)^2 n} < C_2\right) = 1$$

737: %

738: %

739: \item

740: if the probability density function $f(1-x)$ is asymptotic to $c x^\beta

741: + O(x^{\beta + \delta}), \quad \delta, \beta > 0$, as $x$ approaches

742: $0$, then

743: $$\lim_{n \rightarrow \infty} \mathbf{P}\left(c_1 <

744: \frac{N_\Delta}{|\log \Delta|n^{\frac{1}{1+\beta}}} < c_2\right) = 1,$$

745: for some positive constants $c_1, c_2$;

746: %

747: \item

748: if the asymptotic behavior is as above, but $-1 < \beta < 0$, then

749: %

750: $$\lim_{x \rightarrow \infty}  \mathbf{P}\left(\frac{1}{x} < \frac{N_\Delta}{|\log

751: \Delta| n^{1/(1+\beta)}} < x\right) = 1$$

752: \end{itemize}}

753:

754: %

755: % \thm{theorem} {\label{batchthm} Let $N_\Delta$ be the number of

756: % steps it takes for the student (with probability $1$) to have

757: % probability $1 - \Delta$ of learning the concept using the batch

758: % learner algorithm. Then we have the following estimates for

759: % $N_\Delta$:

760: % %

761: % \begin{itemize}

762: % \item

763: % If the distribution of overlaps is \emph{uniform}, or more

764: % generally, the density function $f(1-x)$  at $0$ has the form

765: % $f(x) = c + O(x^\delta),$ $\delta, c > 0,$ then $N_\Delta=|\log

766: % \Delta|\Theta(n)$

767: % %

768: % \item

769: % If the probability density function $f(1-x)$ is asymptotic to

770: % $x^\beta + O(x^{\beta - \delta}), \quad \delta, \beta > 0$, as

771: % $x$ approaches $0$, then we have $N_\Delta=|\log

772: % \Delta|\Theta(n^{1/(1+\beta)})$;

773: % %

774: % \item

775: % If the asymptotic behavior is as above, but $-1 < \beta < 0$,

776: % then $N_\Delta=|\log \Delta|\Theta(n^{1/(1+\beta)}).$

777: % %

778: % \end{itemize}}

779: % %

780: Comparing Theorems \ref{mainprev} and \ref{batchthm}, we see that

781: batch learning algorithm is uniformly superior for $\beta \geq

782: 0$, and the only one of the three to achieve \emph{sublinear}

783: performance whenever $\beta

784: > 0$ (the other two \emph{never} do better than linearly, unless

785: the distribution $\mathcal{F}$ is supported away from $1.$) On

786: the other hand, for $\beta < 0$, the batch learning algorithm

787: performs comparably to the memoryless learner algorithm, and

788: worse than learning with full memory.

789: %\section{$\alpha <1$}

790: %The same method as in section \ref{isdef} under the assumption

791: %that the $k$-th moment is asymptotic to $k^\alpha$ (this time for

792: %$\alpha \leq 1$) can be used to write

793: %\begin{equation}

794: %\begin{split}

795: %T_2 &= n^{1/alpha} \int_0^{n^{1/\alpha}} \frac{\left[1-n

796: %m(n^{1/\alpha}/u) - (1-m(n^{1/\alpha}/u)^n \right]}{u^2} d u +

797: %O(1)\\ &= n^{1/\alpha}\left(\int_0^{n^{1/2\alpha}} +

798: %\int_{n^{1/2\alpha}}^{n^{1/\alpha}}\right) \frac{\left[1-

799: %m^\prime(u) u^\alpha - (1-m^\prime(u)u^\alpha/n)^n \right]}{u^2}

800: %d u + O(1).

801: %\end{split}

802: %\end{equation} If $1/2 < \alpha < 1$, the argument finishes in

803: %exactly the same way as in section \ref{isdef}, to give us $T_2

804: %\asymp C n^{1/\alpha}$. However, if $\alpha = 1$, we get an

805: %additional term of $C_2 n \log n$, where $C_2 = \lim_{j

806: %\rightarrow \infty} m_j$. This term is equal (with opposing sign)

807: %to the center of the stable law satisfied by $T_1$, so in case

808: %$\alpha = 1$, we see that $T$ has no expectation but satisfies a

809: %law of large numbers, with center linear in $n$. If $\alpha \leq

810: %1/2$, the integral diverges.

811: \begin{thebibliography}{xxxxxxxxxxxx}

812:

813: \bibitem[BenOrsz]{benorsz}

814: C.~M.~Bender and S.~Orszag (1999) \textit{Advanced mathematical

815: methods for scientists and engineers, I,\/} Springer-Verlag, New

816: York.

817:

818: \bibitem[KNN2001]{knn}

819: Komarova, N.~L., Niyogi,~P. and Nowak,~M.~A. (2001) The evolutionary

820: dynamics of grammar acquisition, \textit{J.~Theor.~Biology}, {\bf

821: 209}(1), pp. 43-59.

822:

823: \bibitem[KN2001]{kn}

824: Komarova, N.~L. and Nowak, M.~A. (2001) Natural selection of the

825: critical period for grammar acquisition, {\it Proc. Royal Soc.

826: B}, to appear.

827:

828: \bibitem[KR2001a]{kr1}

829: Komarova, N.~L. and Rivin, I. (2001) Harmonic mean, random

830: polynomials and stochastic matrices, \emph{preprint}.

831:

832: \bibitem[KR2001b]{kr2}

833: Komarova, N.~L. and Rivin, I. (2001) On the mathematics of

834: learning.

835:

836: \bibitem[Niyogi1998]{niy}

837: Niyogi, P. (1998). {\it The Informational Complexity of

838: Learning}. Boston: Kluwer.

839:

840: \bibitem[NKN2001]{nkn}

841: Nowak, M.~A., Komarova,~N.~L., Niyogi,~P. (2001) Evolution of

842: universal grammar, \textit{Science} \textbf{291}, 114-118.

843:

844: \bibitem[Rivin2002]{zeta}

845: Igor Rivin (2002). The moment zeta function and applications,

846: arxiv.org preprint NT/0201109.

847:

848: \end{thebibliography}

849:

850:

851: \end{document}

852:

853: