0107:cs0107033/cs0107033

1: %\documentclass[12pt]{gen-j-l}

2: \documentclass[12pt]{amsart}

3: \usepackage{amsmath}

4: \usepackage{amsfonts,amssymb,amsthm}

5: \usepackage{graphicx}

6:

7: %

8: \newcommand{\beq}{\begin{equation}}

9: \newcommand{\eeq}{\end{equation}}

10: \newcommand{\bbar}{\begin{eqnarray}}

11: \newcommand{\eear}{\end{eqnarray}}

12: %

13:

14:

15: \newcommand{\thm}[2]{\begin{#1} #2 \end{#1}}

16: \newcommand{\excess}{\mathrm{excess\:}}

17: \newcommand{\sgn}{\mathrm{sgn\:}}

18: \newcommand{\realpart}{\mathrm{Re\:}}

19: \newcommand{\imagpart}{\mathrm{Im\:}}

20:

21: \newcommand{\logdet}{\log \det \Delta}

22: \newcommand{\tr}{\mathrm{tr\:}}

23: \newcommand{\diameter}{\mathrm{diameter\:}}

24: \newcommand{\area}{\mathrm{area\:}}

25:

26: \newcommand{\Sim}{\mathrm{Sim\:}}

27: \newcommand{\num}{\mathcal{N\:}}

28: %\newcommand{\arg}{\mathrm{arg\:}}

29: \newcommand{\dilatation}{\mathrm{dilatation\:}}

30: \newtheorem{theorem}{Theorem}[section]

31: \newtheorem{itheorem}{Theorem}[section]

32: \newtheorem{lemma}[theorem]{Lemma}

33: \newtheorem{ilemma}[itheorem]{Lemma}

34: \newtheorem{corollary}[theorem]{Corollary}

35: \newtheorem{conjecture}[theorem]{Conjecture}

36: \newtheorem{question}[theorem]{Question}

37: \newtheorem{claim}[theorem]{Claim}

38: \newtheorem{observation}[theorem]{Observation}

39: \newtheorem{iobservation}[itheorem]{Observation}

40: \newtheorem{remark}[theorem]{Remark}

41: \newtheorem{condition}[theorem]{Condition}

42: \newtheorem{example}[theorem]{Example}

43: \newtheorem{definition}[theorem]{Definition}

44: \newtheorem{xca}[theorem]{Exercise}

45: \newtheorem{note}[theorem]{Note}

46:

47: %\input{montreref}

48:

49: \begin{document}

50:

51: %-------------- Author entries --------------------

52: \title{Yet another zeta function and learning}

53: %

54:

55:

56: \author{Igor Rivin}

57: \address{Mathematics department, University of Manchester,

58: Oxford Road, Manchester M13 9PL, UK}

59: \address{Mathematics Department, Temple University,

60: Philadelphia, PA 19122}

61: \address{Mathematics Department, Princeton University, Princeton,

62: NJ 08544}

63: %

64: \email{irivin@math.princeton.edu} \thanks{The author would like

65: to think the EPSRC and the NSF for support, and Natalia Komarova

66: and  Ilan Vardi for useful conversations. }

67:

68: \subjclass{60E07, 60F15, 60J20, 91E40, 26C10} \keywords{ learning

69: theory, zeta functions, asymptotics}

70: %

71: \begin{abstract}

72: We analyze completely the convergence speed of the \emph{batch

73: learning algorithm}, and compare its speed to that of the

74: memoryless learning algorithm and of learning with memory (as

75: analyzed in \cite{kr2}). We show that the batch learning

76: algorithm is never worse than the memoryless learning algorithm

77: (at least asymptotically). Its performance \emph{vis-a-vis}

78: learning with full memory is less clearcut, and depends on

79: certain probabilistic assumptions. These results necessitate the

80: introduction of the \textit{moment zeta function} of a

81: probability distribution and the study of some of its properties.

82: \end{abstract}

83: %

84: \maketitle

85:

86: \renewcommand{\theitheorem}{\Alph{itheorem}}

87: %

88: \section*{Introduction}

89: The original motivation for the work in this paper was provided

90: by  research in learning theory, specifically in various models

91: of language acquisition (see, for example, \cite{knn,nkn,kn}). In

92: the paper \cite{kr2}, we had studied the speed of convergence of

93: the  \emph{memoryless learner algorithm}, and also of

94: \emph{learning with full memory}. Since the \emph{batch learning

95: algorithm} is both widely known, and believed to have superior

96: speed (at the cost of memory) to both of the above methods by

97: learning theorists, it seemed natural to analyze its behavior

98: under the same set of assumptions, in order to bring the analysis

99: in \cite{kr1} and \cite{kr2} to a sort of closure. It should be

100: noted that the detailed analysis of the batch learning algorithm

101: is performed under the assumption of \emph{independence}, which

102: was not explicitly present in our previous work. For the

103: impatient reader we state our main result (Theorem

104: \ref{batchthm}) immediately (the reader can compare it with the

105: results on the memoryless learning algorithm and learning with

106: full memory, as summarized in Theorem \ref{mainprev}):

107: %

108: \begin{itheorem}

109:  Let $N_\Delta$ be the number of steps it takes

110: for the student (with probability $1$) to have probability $1 -

111: \Delta$ of learning the concept using the batch learner

112: algorithm. Then we have the following estimates for $N_\Delta$:

113: %

114: \begin{itemize}

115: \item

116: If the distribution of overlaps is \emph{uniform}, or more

117: generally, the density function $f(1-x)$  at $0$ has the form

118: $f(x) = c + O(x^\delta),$ $\delta, c > 0,$ then $N_\Delta=|\log

119: \Delta|\Theta(n)$

120: %

121: \item

122: If the probability density function $f(1-x)$ is asymptotic to

123: $x^\beta + O(x^{\beta - \delta}), \quad \delta, \beta > 0$, as

124: $x$ approaches $0$, then we have $N_\Delta=|\log

125: \Delta|\Theta(n^{1/(1+\beta)})$;

126: %

127: \item

128: If the asymptotic behavior is as above, but $-1/2 < \beta < 0$,

129: then $N_\Delta=|\log \Delta|\Theta(n^{1/(1+\beta)}).$

130: %

131: \end{itemize}

132: \end{itheorem}

133: The plan of the paper is as follows: in this Introduction we

134: recall the learning algorithms we study; in Section \ref{mathmod}

135: we define our mathematical model; in Section 2 we recall our

136: previous results, in Section 3 we begin the analysis of the batch

137: learning algorithm, and introduce some of the necessary

138: mathematical concepts; in Sections 4-6 we analyze the three cases

139: stated in Theorem A, and we summarize our findings in Section 7.

140: \subsection*{Memoryless Learning and Learning with Full Memory}

141: The general setup is as follows: There is a collection of

142: concepts $R_0, \dots, R_n$ and words which refer to these

143: concepts, sometimes ambiguously. The teacher generates a stream

144: of words, referring to the concept $R_0$. This is not known to

145: the student, but he must learn by, at each step, guessing some

146: concept $R_i$ and checking for consistency with the teacher's

147: input.  The \emph{memoryless learner algorithm} consists of

148: picking a concept $R_i$ at random, and sticking by this choice,

149: until it is proven wrong.  At this point another concept is

150: picked randomly, and the procedure repeats. \emph{Learning with

151: full memory} follows the same general process with the important

152: difference that once a concept is rejected, the student never

153: goes back to it. It is clear (for both algorithms) that once the

154: student hits on the right answer $R_0$, this will be his final

155: answer. We would like to estimate the probability of having

156: guessed the right answer is after $k$ steps, and also the

157: expected number of steps before the student settles on the right

158: answer.

159:

160: \subsection*{Batch Learning} The batch learning situation is

161: similar to the above, but here the student records the words

162: $w_1, \dots, w_k, \dots$ he gets from the teacher. For each word

163: $w_i$ , we assume that the student can find (in his textbook, for

164: example) a list $L_i$ of concepts referred to by the word. If we

165: define

166: \begin{equation*}

167: \mathcal{L}_k = \bigcap_{i=1}^k L_i,

168: \end{equation*}

169: then we are interested in the smallest value of $k$ such that

170: $\mathcal{L}_k = \{R_0\}$. This value $k_0$ is the time it has

171: taken the student to learn the concept $R_0$. We think of $k_0$

172: as a random variable, and we wish to estimate its expectation.

173: %

174: \section{The mathematical model}

175: \label{mathmod}

176:  We think of the words referring to the concept

177: $R_0$ as a probability space $\mathcal{P}$. The probability that

178: one of these words also refer to the concept $R_i$ shall be

179: denoted by $p_i$; the probability that a word refers to concepts

180: $R_{i_1}, \dots, R_{i_k}$ shall be denoted by $p_{i_1 \dots

181: i_k}$. All the results described below (obviously) depend in a

182: crucial way on the $p_1, \dots, p_n$ and (in the case of the

183: batch learning algorithm) also on the joint probabilities. Since

184: there is no \emph{a priori} reason to assume specific values for

185: the probabilities, we shall assume that all of the $p_i$ are

186: themselves \emph{independent, identically distributed random

187: variables}. We shall refer to their common distribution as

188: $\mathcal{F}$, and to the density as $f$. It turns out that the

189: convergence properties of the various learning algorithms depend

190: on the local analytic properties of the distribution

191: $\mathcal{F}$ at $1$ -- some moments reflection will convince the

192: reader that this is not really so surprising.

193:

194: To carry out a precise analysis of the batch learning algorithm,

195: we will also need the \emph{independence hypothesis}:

196: $$

197: p_{i_1 \dots i_k} = p_{i_1} \dots p_{i_k}.

198: $$

199: It is again not too surprising that some such assumption on

200: correlations ought to be required for precise asymptotic results,

201: though it is obviously the subject of a (non-mathematical) debate

202: as to whether assuming that the various concepts are truly

203: independent is reasonable from a cognitive science point of view.

204:

205: \section{Previous results}

206: In previous work \cite{kr1} and \cite{kr2} we obtained the

207: following result.

208:  \thm{theorem} {\label{mainprev}Let $N_\Delta$ be the number of steps it

209: takes for the student (with probability $1$) to have probability

210: $1 - \Delta$ of learning the concept. Then we have the following

211: estimates for $N_\Delta$:

212: %

213: \begin{itemize}

214: \item

215: if the distribution of overlaps is \emph{uniform}, or more

216: generally, the density function $f(1-x)$  at $0$ has the form

217: $f(x) = c + O(x^\delta),$ $\delta, c > 0,$ then

218: $N_\Delta=|\log \Delta|\Theta(n \log n)$ for

219: the memoryless algorithm and $N_\Delta=(1-\Delta)^2 \Theta(n \log

220: n)$ when learning with full memory;

221: %

222: \item

223: if the probability density function $f(1-x)$ is asymptotic to

224: $x^\beta + O(x^{\beta - \delta}), \quad \delta, \beta > 0$, as

225: $x$ approaches $0$, then for the two algorithms we have

226: respectively $N_\Delta=|\log \Delta|\Theta(n)$ and

227: $N_\Delta=(1-\Delta)^2 \Theta(n)$;

228: %

229: \item

230: if the asymptotic behavior is as above, but $-1 < \beta < 0$, then

231: $N_\Delta=|\log \Delta|\Theta(n^{1/(1+\beta)})$ for the memoryless

232: learner and $(N_\Delta=1-\Delta)^2\Theta(n^{1/(1+\beta)})$ for

233: learning with full memory.

234: %

235: \end{itemize}}

236: %

237: \noindent Recall that $f(x) = \Theta(g(x))$ means that for

238: sufficiently large $x$, the ratio $f(x)/g(x)$ is bounded between

239: two strictly positive constants. The distribution of overlaps

240: referred to above is simply the distribution $\mathcal{F}$.

241: Notice that the theorem says nothing about the situation when

242: $\mathcal{F}$ is supported in some interval $[0, a]$, for $a<1$.

243: That case is (presumably) of scientific interest, but

244: mathematically it is relatively trivial: we replace the arguments

245: of all the $\Theta$s above by $1$, though, of course, we are

246: thereby hiding the dependence on $a$.

247:

248: \section{General bounds on the batch learner algorithm}

249:

250: Consider a set of words $w_1, \dots, w_k$. The probability that

251: they all refer to the concept $R_i$ is, obviously $p_i^k$.

252: \begin{lemma}

253: \label{bounds}

254:  The probability $q_k$ that we still have not

255: learned the concept $R_0$ after $k$ steps is bounded above by

256: $\sum_{i=1}^n p_i^k$, and below by $\max_i p_i^k$.

257: \end{lemma}

258: \begin{proof}

259: Immediate.

260: \end{proof}

261: We will first use these upper and lower

262: bounds to get corresponding bounds on the convergence speed of

263: the batch learner algorithm, and then invoke the independence

264: hypothesis to sharpen these bounds in many cases.

265:

266: We begin with a trivial but useful lemma.

267: %

268: \begin{lemma}

269: \label{rearrange}

270:  Let $G$ be a game where the probability of

271: success (respectively failure) after at most $k$ steps is $s_k$

272: (respectively $f_k = 1-s_k $). Then the expected number of steps

273: until success is

274: $$\sum_{k=1}^\infty k (s_k - s_{k-1}) = \sum_{k=1}^\infty s_k = 1 -

275: \sum_{k=1}^\infty f_k,$$ if the corresponding sum converges.

276: \end{lemma}

277: \begin{proof}

278: The proof is immediate from the definition of expectation and the

279: possibility of rearrangment of terms of positive series.

280: \end{proof}

281: We can combine Lemma \ref{rearrange} and Lemma \ref{bounds} to

282: obtain:

283: \begin{theorem}

284: \label{sumbounds} The expected time $T$ of convergence of the

285: batch learner algorithm is bounded as follows:

286: \begin{equation}

287: \label{trivest} \sum_{i=1}^n \frac{1}{1-p_i} \geq T \geq

288: \max_{1\leq i \leq n} \frac{1}{1-p_i}.

289: \end{equation}

290: \end{theorem}

291: The leftmost term in equation (\ref{trivest}) has been studied at

292: length in \cite{kr1}. We state a version of the results of

293: \cite{kr1} below:

294: \begin{theorem}

295: \label{allstab} Let $S=\sum_{i=1}^n \frac{1}{1-p_i},$ where the

296: $p_i$ are independently identically distributed random variables

297: with values in $[0, 1]$, with probability density $f$, such that

298: $f(1-x) = x^\beta + O(x^{\beta - \delta}),\quad \delta > 0$ for

299: $x\rightarrow 0$. Then If $\beta > 0$, then there exists a mean

300: $m$, such that $\lim_{n \rightarrow \infty} \mathbb{P}(|S/n - m|

301: > \epsilon) = 0,$ for any $\epsilon > 0.$ If $\beta = 0$, then

302: $\lim_{n \rightarrow \infty} \mathbb{P}(|S/(n\log n) - 1|

303: > \epsilon) = 0).$ Finally, if

304: $-1 \leq \beta < 0,$ then $\lim_{n \rightarrow \infty}

305: \mathbb{P}(S/n^{1/{\beta+1}} - C

306: > a) = g(a),$ where $\lim_{a \rightarrow \infty} g(a)= 0,$ and $C$ is

307: an arbitrary (but fixed) constant, and likewise

308: $$\mathbb{P}(S/n^{1/(\beta + 1)} < b) = h(b),$$ where $\lim_{a \rightarrow 0}h(a) = 0,$

309: \end{theorem}

310: The right hand side of Eq. (\ref{trivest}) is easier to

311: understand. Indeed, let $p_1, \dots, p_n$ be distributed as usual

312: (and as in the statement of Theorem \ref{allstab}. Then

313: %

314: \begin{theorem}\label{expmin}

315: The expected value of $\max_{1 \leq i \leq n} p_i$ equals $1 - C

316: n^{-1/{1+\beta}},$ for some positive constant $C$.

317: \end{theorem}

318: \begin{proof}

319: First, we change variables to $q_i = 1 - p_i$. Obviously, the

320: statement of the Theorem is equivalent to the statement that $E =

321: \mathbf{E}(\min_{1 \leq i \leq n} q_i) = C  n^{-1/{1+\beta}}$. We

322: also write $h(x) = f(1-x),$ and similarly for the primitives $H$

323: and $F$. Now, the probability of that all of the $q_i$ are

324: greater than some fixed $y$ equals $1-(1-H(y))^n,$ so that

325: $$E = \int_0^1 t d\left[1-(1-H(t))^n\right] = \int_0^1 (1-H(t))^n d t.$$

326: Perform the change of variables $t = u/n^{1/(1+\beta)}$, to get

327: \begin{equation}

328: \label{firstint} E = \frac{1}{n^{1+\beta}}

329: \int_0^{n^{1/{1+\beta}}} (1-H(u/n^{1/(1+\beta)}))^n du.

330: \end{equation}

331: For $u \ll n^{1/(1+\beta)}$, we can write $H(u/n^{1/(1+\beta)}

332: \asymp u^{\beta + 1}/n H^\prime,$ where $H^\prime$ is a constant.

333: We also know that $H$ is a monotonic function so if we break up

334: the integral above as

335: \begin{equation}

336: \label{secondint} E = \frac{1}{n^{1/(1+\beta)}}

337: \left[\int_0^{n^{1/(2 (1 + \beta))}} + \int_{n^{1/(2 (1 +

338: \beta))}}^{n^{1/(1 + \beta)}}\right] (1-H(u/n^{1/(1+\beta)}))^n

339: du,

340: \end{equation}

341: we see that the first integral approaches $C = \int_0^\infty

342: \exp(-u^{1/(1+\beta)}) d u,$ while the second integral goes to 0.

343: Note that the proof also evaluates $C$.

344: \end{proof}

345: We need one final observation:

346: \begin{theorem}

347: The variable $n^{1/(1+\beta)} \min_{i=1}^n q_i$ has a limiting

348: distribution with distribution function $G(x) =

349: 1-\exp(-x^{1+\beta}).$

350: \end{theorem}

351: \begin{proof}

352: Immediate from the proof of Theorem \ref{expmin}.

353: \end{proof}

354:

355: We can now put together all of the above results as follows.

356: \begin{theorem}

357: \label{allgen}

358:  Let $p_1, \dots, p_k$ be independently distributed

359: with common density function $f$, such that $f(1-x) = c x^\beta +

360: O(x^{\beta + \delta}),$ $\delta > 0$. Let $T$ be the expected

361: time of the convergence of the batch learning algorithm with

362: overlaps $p_1, \dots, p_k$. Then, if $\beta > 0$, then there

363: exist $C_1, C_2$, such that  $C_1 n^{1/(1+\beta)} \leq T \leq C_2

364: n$, with probability tending to $1$ as $n$ tends to $\infty$. If

365: $\beta = 0$, then there exist $C_1, C_2$, such that $C_1 n \leq T

366: \leq C_2 n \log n$, with probability tending to one as $n$ tends

367: to $\infty.$ If $\beta > 0$, then $C^{-1} n^{1/(\beta + 1)} \leq

368: T \leq C n^{1/(\beta + 1)}$ with probability tending to $0$ as

369: $C$ goes to infinity.

370: \end{theorem}

371:

372: The reader will remark that in the case that $\beta > 0$, the

373: upper and lower bounds have the same order of magnitude as

374: functions of $n$.

375:

376: \section{Independent concepts}

377: independence hypothesis, whereby an application of the

378: inclusion-exclusion principle gives us:

379:

380: \thm{lemma}{\label{latmost} The probability $l_k$ that we have

381:  learned the concept $R_0$ after $k$ steps is given by

382: $$

383: l_k=\prod_{i=1}^n(1-p_i^k).

384: $$

385: }

386:

387: Note that the probability $s_k$ of winning the game \emph{on the

388: $k$-th step} is given by $s_k = l_k - l_{k-1}= (1-l_{k-1}) -

389: (1-l_k)$. Since the expected number of steps $T$ to learn the

390: concept is given by

391: $$T = \sum_{k=1}^\infty k s_k,$$

392: we immediately have  $$T = \sum_{k=1}^\infty (1-l_k)$$

393: %

394: \thm{lemma}{\label{letime} The expected time $T$ of learning the

395: concept $R_0$ is given by

396: $$

397: T = \sum_{k=1}^\infty \left(1-\prod_{i=1}^n

398: \left(1-p_i^k\right)\right).

399: $$

400: }

401: %

402: Since the sum above is absolutely convergent, we can expand the

403: products and interchange the order of summation to get the

404: following formula for $T$:

405:

406: \begin{equation}

407: \label{subsum} T = \sum_{s\subseteq \{1, \dots, n\}} (-1)^{|s|-1}

408: \sum_{k=1}^\infty p_s^k = \sum_{s\subseteq \{1, \dots, n\}}

409: (-1)^{|s|-1} \left(\frac{1}{1-p_s} - 1\right),

410: \end{equation}

411: where we have identified subsets of $\{1, \dots, n\}$ with the

412: corresponding multindexes.

413:

414: The formula \ref{subsum} is useful in and of itself, but we now

415: use it to attempt to get the expectation of the expected time of

416: success $T$ under our distribution and independence assumption.

417: For this we shall need the following:

418: %

419: \thm{definition}{\label{zdef} Let $\mathcal{F}$ be a probability

420: distribution on an interval $I$, and let $m_k(\mathcal{F}) =

421: \int_I x^k\mathcal{F}(d x)$ be the $k$-th moment of

422: $\mathcal{F}$. Then the \emph{moment zeta function of

423: $\mathcal{F}$} is defined to be

424: $$\zeta_{\mathcal{F}}(s) = \sum_{k=1}^\infty m_k^s(\mathcal{F}),$$ whenever the sum is defined.

425: }

426: %

427: \thm{lemma}{\label{zetalemma} Let $\mathcal{F}$ be a probability

428: distribution as above, and let $x_1, \dots, x_n$ be independent

429: random variables with common distribution $\mathcal{F}$. Then

430: \begin{equation}

431: \mathbb{E}\left(\frac{1}{1-x_1 \dots x_n}\right) =

432: \zeta_{\mathcal{F}}(n).

433: \end{equation}

434: In particular, the expectation is undefined whenever the zeta

435: function is undefined. }

436: %

437: \begin{proof}

438: Expand the fraction in a geometric series and apply Fubini's

439: theorem.

440: \end{proof}

441: %

442: \thm{example} { For $\mathcal{F}$ the uniform distribution on

443: $[0, 1]$, $\zeta_{\mathcal{F}}$ is the familiar Riemann zeta

444: function. Notice that this is \emph{not} defined for $n=1$ --

445: this will be important in the sequel.}

446:

447: It should be noted that in the case we are interested in

448: (distributions supported in $[0, 1]$), the asymptotics of the

449: moments are determined by the local properties of the

450: distribution at $1$, up to exponentially decreasing error terms.

451: So, if $f(1-x) \asymp x^\beta$ (recall that $f$ is the density),

452: we see that the $k$-th moment of $\mathcal{F}$ is asymptotic to

453: $C k^{-(1+\alpha)},$ for some constant $C$.  To show this, we

454: first define the \emph{Mellin transform} of $f$ to be

455: $$\mathcal{M}(f)(s) = \int_0^1 f(x) x^{s-1} d x.$$ We see that

456: $m_k(\mathcal{F}) = \mathcal{M}(f)(k+1).$ Mellin transform is

457: very closely related to the Laplace transform. Indeed, making the

458: substitution $x = \exp(-u)$, we see that $$\mathcal{M}(f) =

459: \int_0^\infty f(\exp(-u)) \exp(-s u) d u,$$ so the Mellin

460: transform of $f$ is equal to the Laplace transform of $f \circ

461: \exp.$ Now, the asymptotics of the Laplace transform are easily

462: computed by Laplace's method, and in the case we are interested

463: in, Watson's lemma (see, eg, \cite{benorsz}) tells us that if

464: $f(x) \asymp c (1-x)^\beta$, then $\mathcal{M}(f)(s) \asymp c

465: \Gamma(\beta) x^{-(\beta + 1)}.$ In particular,

466: $\zeta_{\mathcal{F}}(s)$ is defined for $s

467: >1/(1+\beta)$. Below we shall analyze three cases (though the

468: analysis is almost the same in the three cases, there are some

469: important variations). In the sequel, we set $\alpha = \beta + 1$.

470: \section{$\alpha > 1$}

471: \label{isdef}

472: In this case, we use our assumptions to rewrite Eq.

473: (\ref{subsum}) as

474: \begin{equation}

475: \label{subsum2}

476: %

477: T = - \sum_{k=1}^n \binom{n}{k}(-1)^k \zeta_{\mathcal{F}}(k).

478: \end{equation}

479: This, in turn, can be rewritten (by expanding the definition of

480: zeta) as

481: \begin{equation}

482: \label{subsum3} T = - \sum_{j=1}^\infty

483: \left[\left(1-m_j(\mathcal{F})\right)^n-1\right]

484: \end{equation}

485: Since the term in the sum is monotonically decreasing, the sum in

486: Eq. (\ref{subsum3}) can be approximated by an integral (of

487: \emph{any} monotonic interpolation $m$ of the sequence

488: $m_j(\mathcal{F})$; however there is no reason not to set $m(x) =

489: \mathcal{M}(f)(x+1)$), with error bounded by the first term,

490: which is, in term, bounded in absolute value by $2$, to get

491: \begin{equation}

492: \label{approx1} T = - \int_1^\infty \left[(1-m(x))^n -1\right] d

493: x + O(1),

494: \end{equation}

495: where the error term is bounded above by $2$.

496:

497: Now, let us assume that $m(x)$ is of order $x^{-\alpha}$ for some

498: $\alpha > 1$. We substitute $x = n^{1/alpha}/u$, to get

499: \begin{equation}

500: \begin{split}

501:  T &=  n^{1/\alpha}\int_0^{n^{1/\alpha}}

502: \frac{\left[1-(1-m(n^{1/\alpha}/u)^n \right]}{u^2} d u + O(1)\\

503: &=

504: n^{1/\alpha}\int_0^{n^{1/\alpha}}\frac{\left[1-(1-m^\prime(u)u^\alpha/n)^n

505: \right]}{u^2} d u + O(1)\\ & =

506: n^{1/\alpha}\left(\int_0^{n^{1/2\alpha}} +

507: \int_{n^{1/2\alpha}}^{n^{1/\alpha}}\right)

508: \frac{\left[1-(1-m^\prime(u)u^\alpha/n)^n \right]}{u^2} d u +

509: O(1) ,

510: \end{split}

511: \end{equation}

512: where $m^\prime$ is a bounded (asymptotically constant) function.

513: In the second integral the integrand is bounded above by $1/u^2$,

514: so the contribution from that integral goes to $0$, while in the

515: first integral we can approximate $(1-m^\prime u^\alpha/n)^n$ by

516: $\exp(-m^\prime u^\alpha)$, and the contribution from that

517: integral goes to

518: \begin{equation}

519: \label{mainalpha} T = n^{1/\alpha}

520: \int_0^\infty\frac{1-\exp(-m^\prime(u) u^\alpha)}{u^2} d u + O(1)

521: \asymp C n^{1/\alpha}.

522: \end{equation}

523: %

524: \section{$\alpha = 1$}

525: \label{medalpha} In this case, $f(x) = c + o(1)$ as $x$

526: approaches $1$.  It is not hard to see that

527: $\zeta_{\mathcal{F}}(n)$ is defined for $n \geq 2$. We break up

528: the expression in Eq. (\ref{subsum}) as

529: \begin{equation}

530: \label{subsumm} T = \sum_{j=1}^n {\frac{1}{1-p_j} - 1} +

531: \sum_{s\subseteq \{1, \dots, n\}, \quad |s| > 1}

532:  (-1)^{|s|-1}

533: \left(\frac{1}{1-p_s} - 1\right).

534: \end{equation}

535: Let

536: \begin{gather*} T_1 = \sum_{j=1}^n {\frac{1}{1-p_j} - 1},\\

537:  T_2 = \sum_{s\subseteq \{1, \dots, n\}, \quad |s| > 1}

538:  (-1)^{|s|-1}

539: \left(\frac{1}{1-p_s} - 1\right).

540: \end{gather*}

541:  The first sum $T_1$ has

542: no expectation, however $T_1/n$  does have have a stable

543: distribution centered on $c \log n + c_2$. We will keep this in

544: mind, but now let us look at the second sum  $T_2$. It can be

545: rewritten as

546: \begin{equation}

547: \label{subsumm2} T_2 = - \sum_{j=1}^\infty

548: \left[\left(1-m_j(\mathcal{F})\right)^n-1 + n m_j\right].

549: \end{equation}

550: The same method as in section \ref{isdef} under the assumption

551: that the $k$-th moment is asymptotic to $k^\alpha$ (this time for

552: $\alpha \leq 1$) can be used to write

553: \begin{equation}

554: \begin{split}

555: T_2 &= n \int_0^n \frac{\left[1-n m(n/u) - (1-m(n/u)^n

556: \right]}{u^2} d u + O(1)\\ &= n\left(\int_0^{n^{1/2}} +

557: \int_{n^{1/2}}^n\right) \frac{\left[1- m^\prime(u) u -

558: (1-m^\prime(u)u/n)^n \right]}{u^2} d u + O(1).

559: \end{split}

560: \end{equation} The conclusion differs somewhat from that of section \ref{isdef} in that  we get an

561: additional term of $c n \log n$, where $c = \lim_{x \rightarrow

562: 1} f(x) = \lim_{j \rightarrow \infty} j m_j$. This term is equal

563: (with opposing sign) to the center of the stable law satisfied by

564: $T_1$, so in case $\alpha = 1$, we see that $T$ has no

565: expectation but satisfies a \emph{law of large numbers}, of the

566: following form:

567: %

568: \begin{theorem}[Law of large numbers]

569: There exists a constant $C$ such that $\lim_{y \rightarrow

570: \infty} \mathbf{P}(|T/n - C| > y) = 0.$

571: \end{theorem}

572: \section{$\alpha <1$}

573: \label{smallalpha} In this case the analysis goes through as in

574: the preceding section when $\alpha > 1/2$, but then runs into

575: considerable difficulties. However, in this case we note that

576: Theorem \ref{allgen} actually gives us tight bounds.

577: \section{The inevitable comparison}

578: We are now in a position to compare the performance of the batch

579: learning algorithm with that of the memoryless learning algorithm

580: and of learning with full memory, as summarized in Theorem

581: \ref{mainprev}. We combine our computations above with the

582: observation that the batch learner algorithm converges

583: geometrically (Lemma \ref{latmost}), to get:

584: %

585: \thm{theorem} {\label{batchthm} Let $N_\Delta$ be the number of

586: steps it takes for the student (with probability $1$) to have

587: probability $1 - \Delta$ of learning the concept using the batch

588: learner algorithm. Then we have the following estimates for

589: $N_\Delta$:

590: %

591: \begin{itemize}

592: \item

593: If the distribution of overlaps is \emph{uniform}, or more

594: generally, the density function $f(1-x)$  at $0$ has the form

595: $f(x) = c + O(x^\delta),$ $\delta, c > 0,$ then $N_\Delta=|\log

596: \Delta|\Theta(n)$

597: %

598: \item

599: If the probability density function $f(1-x)$ is asymptotic to

600: $x^\beta + O(x^{\beta - \delta}), \quad \delta, \beta > 0$, as

601: $x$ approaches $0$, then we have $N_\Delta=|\log

602: \Delta|\Theta(n^{1/(1+\beta)})$;

603: %

604: \item

605: If the asymptotic behavior is as above, but $-1 < \beta < 0$,

606: then $N_\Delta=|\log \Delta|\Theta(n^{1/(1+\beta)}).$

607: %

608: \end{itemize}}

609: Comparing Theorems \ref{mainprev} and \ref{batchthm}, we see that

610: batch learning algorithm is uniformly superior for $\beta \geq

611: 0$, and the only one of the three to achieve \emph{sublinear}

612: performance whenever $\beta

613: > 0$ (the other two \emph{never} do better than linearly, unless

614: the distribution $\mathcal{F}$ is supported away from $1.$) On

615: the other hand, for $\beta < 0$, the batch learning algorithm

616: performs comparably to the memoryless learner algorithm, and

617: worse than learning with full memory.

618: %\section{$\alpha <1$}

619: %The same method as in section \ref{isdef} under the assumption

620: %that the $k$-th moment is asymptotic to $k^\alpha$ (this time for

621: %$\alpha \leq 1$) can be used to write

622: %\begin{equation}

623: %\begin{split}

624: %T_2 &= n^{1/alpha} \int_0^{n^{1/\alpha}} \frac{\left[1-n

625: %m(n^{1/\alpha}/u) - (1-m(n^{1/\alpha}/u)^n \right]}{u^2} d u +

626: %O(1)\\ &= n^{1/\alpha}\left(\int_0^{n^{1/2\alpha}} +

627: %\int_{n^{1/2\alpha}}^{n^{1/\alpha}}\right) \frac{\left[1-

628: %m^\prime(u) u^\alpha - (1-m^\prime(u)u^\alpha/n)^n \right]}{u^2}

629: %d u + O(1).

630: %\end{split}

631: %\end{equation} If $1/2 < \alpha < 1$, the argument finishes in

632: %exactly the same way as in section \ref{isdef}, to give us $T_2

633: %\asymp C n^{1/\alpha}$. However, if $\alpha = 1$, we get an

634: %additional term of $C_2 n \log n$, where $C_2 = \lim_{j

635: %\rightarrow \infty} m_j$. This term is equal (with opposing sign)

636: %to the center of the stable law satisfied by $T_1$, so in case

637: %$\alpha = 1$, we see that $T$ has no expectation but satisfies a

638: %law of large numbers, with center linear in $n$. If $\alpha \leq

639: %1/2$, the integral diverges.

640: \begin{thebibliography}{xxxxxxxxxxxx}

641:

642: \bibitem[BenOrsz]{benorsz}

643: C.~M.~Bender and S.~Orszag (1999) \textit{Advanced mathematical

644: methods for scientists and engineers, I,\/} Springer-Verlag, New

645: York.

646:

647: \bibitem[KNN2001]{knn}

648: Komarova, N.~L., Niyogi,~P. and Nowak,~M.~A. (2001) The evolutionary

649: dynamics of grammar acquisition, \textit{J.~Theor.~Biology}, {\bf

650: 209}(1), pp. 43-59.

651:

652: \bibitem[KN2001]{kn}

653: Komarova, N.~L. and Nowak, M.~A. (2001) Natural selection of the

654: critical period for grammar acquisition, {\it Proc. Royal Soc.

655: B}, to appear.

656:

657: \bibitem[KR2001a]{kr1}

658: Komarova, N.~L. and Rivin, I. (2001) Harmonic mean, random

659: polynomials and stochastic matrices, \emph{preprint}.

660:

661: \bibitem[KR2001b]{kr2}

662: Komarova, N.~L. and Rivin, I. (2001) On the mathematics of

663: learning.

664:

665: \bibitem[Niyogi1998]{niy}

666: Niyogi, P. (1998). {\it The Informational Complexity of

667: Learning}. Boston: Kluwer.

668:

669: \bibitem[NKN2001]{nkn}

670: Nowak, M.~A., Komarova,~N.~L., Niyogi,~P. (2001) Evolution of

671: universal grammar, \textit{Science} \textbf{291}, 114-118.

672:

673: \end{thebibliography}

674:

675:

676: \end{document}

677:

678: