1: %\documentclass[12pt]{gen-j-l}
2: \documentclass[12pt]{amsart}
3: \usepackage{amsmath}
4: \usepackage{amsfonts,amssymb,amsthm}
5: \usepackage{graphicx}
6:
7: %
8: \newcommand{\beq}{\begin{equation}}
9: \newcommand{\eeq}{\end{equation}}
10: \newcommand{\bbar}{\begin{eqnarray}}
11: \newcommand{\eear}{\end{eqnarray}}
12: %
13:
14:
15: \newcommand{\thm}[2]{\begin{#1} #2 \end{#1}}
16: \newcommand{\excess}{\mathrm{excess\:}}
17: \newcommand{\sgn}{\mathrm{sgn\:}}
18: \newcommand{\realpart}{\mathrm{Re\:}}
19: \newcommand{\imagpart}{\mathrm{Im\:}}
20:
21: \newcommand{\logdet}{\log \det \Delta}
22: \newcommand{\tr}{\mathrm{tr\:}}
23: \newcommand{\diameter}{\mathrm{diameter\:}}
24: \newcommand{\area}{\mathrm{area\:}}
25:
26: \newcommand{\Sim}{\mathrm{Sim\:}}
27: \newcommand{\num}{\mathcal{N\:}}
28: %\newcommand{\arg}{\mathrm{arg\:}}
29: \newcommand{\dilatation}{\mathrm{dilatation\:}}
30: \newtheorem{theorem}{Theorem}[section]
31: \newtheorem{itheorem}{Theorem}[section]
32: \newtheorem{lemma}[theorem]{Lemma}
33: \newtheorem{ilemma}[itheorem]{Lemma}
34: \newtheorem{corollary}[theorem]{Corollary}
35: \newtheorem{conjecture}[theorem]{Conjecture}
36: \newtheorem{question}[theorem]{Question}
37: \newtheorem{claim}[theorem]{Claim}
38: \newtheorem{observation}[theorem]{Observation}
39: \newtheorem{iobservation}[itheorem]{Observation}
40: \newtheorem{remark}[theorem]{Remark}
41: \newtheorem{condition}[theorem]{Condition}
42: \newtheorem{example}[theorem]{Example}
43: \newtheorem{definition}[theorem]{Definition}
44: \newtheorem{xca}[theorem]{Exercise}
45: \newtheorem{note}[theorem]{Note}
46:
47: %\input{montreref}
48:
49: \begin{document}
50:
51: %-------------- Author entries --------------------
52: \title{The performance of the batch learning algorithm}
53: %
54:
55:
56: \author{Igor Rivin}
57: \address{Mathematics department, University of Manchester,
58: Oxford Road, Manchester M13 9PL, UK}
59: \address{Mathematics Department, Temple University,
60: Philadelphia, PA 19122}
61: \address{Mathematics Department, Princeton University, Princeton,
62: NJ 08544}
63: %
64: \email{irivin@math.princeton.edu} \thanks{The author would like
65: to think the EPSRC and the NSF for support, and Natalia Komarova
66: and Ilan Vardi for useful conversations. }
67:
68: \subjclass{60E07, 60F15, 60J20, 91E40, 26C10} \keywords{ learning
69: theory, zeta functions, asymptotics}
70: %
71: \begin{abstract}
72: We analyze completely the convergence speed of the \emph{batch
73: learning algorithm}, and compare its speed to that of the
74: memoryless learning algorithm and of learning with memory (as
75: analyzed in \cite{kr2}). We show that the batch learning
76: algorithm is never worse than the memoryless learning algorithm
77: (at least asymptotically). Its performance \emph{vis-a-vis}
78: learning with full memory is less clearcut, and depends on
79: certain probabilistic assumptions.
80: \end{abstract}
81: %
82: \maketitle
83:
84: \renewcommand{\theitheorem}{\Alph{itheorem}}
85: %
86: \section*{Introduction}
87: The original motivation for the work in this paper was provided
88: by research in learning theory, specifically in various models
89: of language acquisition (see, for example, \cite{knn,nkn,kn}). In
90: the paper \cite{kr2}, we had studied the speed of convergence of
91: the \emph{memoryless learner algorithm}, and also of
92: \emph{learning with full memory}. Since the \emph{batch learning
93: algorithm} is both widely known, and believed to have superior
94: speed (at the cost of memory) to both of the above methods by
95: learning theorists, it seemed natural to analyze its behavior
96: under the same set of assumptions, in order to bring the analysis
97: in \cite{kr1} and \cite{kr2} to a sort of closure. It should be
98: noted that the detailed analysis of the batch learning algorithm
99: is performed under the assumption of \emph{independence}, which
100: was not explicitly present in our previous work. For the
101: impatient reader we state our main result (Theorem
102: \ref{batchthm}) immediately (the reader can compare it with the
103: results on the memoryless learning algorithm and learning with
104: full memory, as summarized in Theorem \ref{mainprev}):
105: %
106: \begin{itheorem}
107: Let $N_\Delta$ be the number of steps it takes for the student
108: to have probability $1 - \Delta$ of learning the
109: concept using the batch learner algorithm. Then we have the following estimates for $N_\Delta$:
110: %
111: \begin{itemize}
112: \item
113: if the distribution of overlaps is \emph{uniform}, or more
114: generally, the density function $f(1-x)$ at $0$ has the form
115: $f(x) = c + O(x^\delta),$ $\delta, c > 0,$ then there exist positive
116: constants $C_1, C_2$ such that
117: $$\lim_{n \rightarrow \infty} \mathbf{P}\left(C_1 <
118: \frac{N_\Delta}{(1- \Delta)^2 n} < C_2\right) = 1$$
119: %
120: %
121: \item
122: if the probability density function $f(1-x)$ is asymptotic to $c x^\beta
123: + O(x^{\beta + \delta}), \quad \delta, \beta > 0$, as $x$ approaches
124: $0$, then
125: $$\lim_{n \rightarrow \infty} \mathbf{P}\left(c_1 <
126: \frac{N_\Delta}{|\log \Delta|n^{\frac{1}{1+\beta}}} < c_2\right) = 1,$$
127: for some positive constants $c_1, c_2$;
128: %
129: \item
130: if the asymptotic behavior is as above, but $-1 < \beta < 0$, then
131: %
132: $$\lim_{x \rightarrow \infty} \mathbf{P}\left(\frac{1}{x} < \frac{N_\Delta}{|\log
133: \Delta| n^{1/(1+\beta)}} < x\right) = 1$$
134: \end{itemize}
135: \end{itheorem}
136: % \begin{itheorem}
137: % Let $N_\Delta$ be the number of steps it takes
138: % for the student (with probability $1$) to have probability $1 -
139: % \Delta$ of learning the concept using the batch learner
140: % algorithm. Then we have the following estimates for $N_\Delta$:
141: % %
142: % \begin{itemize}
143: % \item
144: % If the distribution of overlaps is \emph{uniform}, or more
145: % generally, the density function $f(1-x)$ at $0$ has the form
146: % $f(x) = c + O(x^\delta),$ $\delta, c > 0,$ then $N_\Delta=|\log
147: % \Delta|\Theta(n)$
148: % %
149: % \item
150: % If the probability density function $f(1-x)$ is asymptotic to
151: % $x^\beta + O(x^{\beta - \delta}), \quad \delta, \beta > 0$, as
152: % $x$ approaches $0$, then we have $N_\Delta=|\log
153: % \Delta|\Theta(n^{1/(1+\beta)})$;
154: % %
155: % \item
156: % If the asymptotic behavior is as above, but $-1/2 < \beta < 0$,
157: % then $N_\Delta=|\log \Delta|\Theta(n^{1/(1+\beta)}).$
158: % %
159: % \end{itemize}
160: % \end{itheorem}
161: The plan of the paper is as follows: in this Introduction we
162: recall the learning algorithms we study; in Section \ref{mathmod}
163: we define our mathematical model; in Section 2 we recall our
164: previous results, in Section 3 we begin the analysis of the batch
165: learning algorithm, and introduce some of the necessary
166: mathematical concepts; in Sections 4-6 we analyze the three cases
167: stated in Theorem A, and we summarize our findings in Section 7.
168: \subsection*{Memoryless Learning and Learning with Full Memory}
169: The general setup is as follows: There is a collection of
170: concepts $R_0, \dots, R_n$ and words which refer to these
171: concepts, sometimes ambiguously. The teacher generates a stream
172: of words, referring to the concept $R_0$. This is not known to
173: the student, but he must learn by, at each step, guessing some
174: concept $R_i$ and checking for consistency with the teacher's
175: input. The \emph{memoryless learner algorithm} consists of
176: picking a concept $R_i$ at random, and sticking by this choice,
177: until it is proven wrong. At this point another concept is
178: picked randomly, and the procedure repeats. \emph{Learning with
179: full memory} follows the same general process with the important
180: difference that once a concept is rejected, the student never
181: goes back to it. It is clear (for both algorithms) that once the
182: student hits on the right answer $R_0$, this will be his final
183: answer. We would like to estimate the probability of having
184: guessed the right answer is after $k$ steps, and also the
185: expected number of steps before the student settles on the right
186: answer.
187:
188: \subsection*{Batch Learning} The batch learning situation is
189: similar to the above, but here the student records the words
190: $w_1, \dots, w_k, \dots$ he gets from the teacher. For each word
191: $w_i$ , we assume that the student can find (in his textbook, for
192: example) a list $L_i$ of concepts referred to by the word. If we
193: define
194: \begin{equation*}
195: \mathcal{L}_k = \bigcap_{i=1}^k L_i,
196: \end{equation*}
197: then we are interested in the smallest value of $k$ such that
198: $\mathcal{L}_k = \{R_0\}$. This value $k_0$ is the time it has
199: taken the student to learn the concept $R_0$. We think of $k_0$
200: as a random variable, and we wish to estimate its expectation.
201: %
202: \section{The mathematical model}
203: \label{mathmod}
204: We think of the words referring to the concept
205: $R_0$ as a probability space $\mathcal{P}$. The probability that
206: one of these words also refer to the concept $R_i$ shall be
207: denoted by $p_i$; the probability that a word refers to concepts
208: $R_{i_1}, \dots, R_{i_k}$ shall be denoted by $p_{i_1 \dots
209: i_k}$. All the results described below (obviously) depend in a
210: crucial way on the $p_1, \dots, p_n$ and (in the case of the
211: batch learning algorithm) also on the joint probabilities. Since
212: there is no \emph{a priori} reason to assume specific values for
213: the probabilities, we shall assume that all of the $p_i$ are
214: themselves \emph{independent, identically distributed random
215: variables}. We shall refer to their common distribution as
216: $\mathcal{F}$, and to the density as $f$. It turns out that the
217: convergence properties of the various learning algorithms depend
218: on the local analytic properties of the distribution
219: $\mathcal{F}$ at $1$ -- some moments reflection will convince the
220: reader that this is not really so surprising.
221:
222: Sharper analysis of the batch learning algorithm,
223: depends on the \emph{independence hypothesis}:
224: $$
225: p_{i_1 \dots i_k} = p_{i_1} \dots p_{i_k}.
226: $$
227: It is again not too surprising that some such assumption on
228: correlations ought to be required for precise asymptotic results,
229: though it is obviously the subject of a (non-mathematical) debate
230: as to whether assuming that the various concepts are truly
231: independent is reasonable from a cognitive science point of view.
232:
233: \section{Previous results}
234: In previous work \cite{kr1} and \cite{kr2} we obtained the
235: following result.
236: \thm{theorem}
237: {
238: \label{mainprev}
239: Let $N_\Delta$ be the number of steps it takes for the student
240: to have probability $1 - \Delta$ of learning the
241: concept. Then we have the following estimates for $N_\Delta$:
242: %
243: \begin{itemize}
244: \item
245: if the distribution of overlaps is \emph{uniform}, or more
246: generally, the density function $f(1-x)$ at $0$ has the form
247: $f(x) = c + O(x^\delta),$ $\delta, c > 0,$ then there exist positive
248: constants $C_1, C_2, C_1', C_2'$ such that
249: $$\lim_{n \rightarrow \infty} \mathbf{P}\left(C_1 <
250: \frac{N_\Delta}{|\log \Delta|n \log n} < C_2\right) = 1$$
251: for
252: the memoryless algorithm and
253: $$\lim_{n \rightarrow \infty} \mathbf{P}\left(C'_1 <
254: \frac{N_\Delta}{(1- \Delta)^2 n \log n} < C'_2\right) = 1$$
255: %
256: when learning with full memory;
257: %
258: \item
259: if the probability density function $f(1-x)$ is asymptotic to $c x^\beta
260: + O(x^{\beta + \delta}), \quad \delta, \beta > 0$, as $x$ approaches
261: $0$, then for the two algorithms we have respectively
262: $$\lim_{n \rightarrow \infty} \mathbf{P}\left(c_1 <
263: \frac{N_\Delta}{|\log \Delta|n} < c_2\right) = 1,$$
264: and
265: $$\lim_{n \rightarrow \infty} \mathbf{P}\left(c_1' <
266: \frac{N_\Delta}{(1- \Delta)^2 n } < c_2'\right) = 1$$
267: %
268: for some positive constants $c_1, c_2, c_1', c_2'$;
269: %
270: \item
271: if the asymptotic behavior is as above, but $-1 < \beta < 0$, then
272: %
273: $$\lim_{x \rightarrow \infty} \mathbf{P}\left(\frac{1}{x} < \frac{N_\Delta}{|\log
274: \Delta| n^{1/(1+\beta)}} < x\right) = 1$$
275: %
276: for the memoryless learning algorithm, and similarly
277: %
278: $$\lim_{x \rightarrow \infty} \mathbf{P}\left(\frac{1}{x} < \frac{N_\Delta}{(1-\Delta)^2 n^{1/(1+\beta)}} < x\right) = 1$$
279: %
280: for learning with full memory.
281: %
282: \end{itemize}}
283: %
284: \noindent Recall that $f(x) = \Theta(g(x))$ means that for
285: sufficiently large $x$, the ratio $f(x)/g(x)$ is bounded between
286: two strictly positive constants. The distribution of overlaps
287: referred to above is simply the distribution $\mathcal{F}$.
288: Notice that the theorem says nothing about the situation when
289: $\mathcal{F}$ is supported in some interval $[0, a]$, for $a<1$.
290: That case is (presumably) of scientific interest, but
291: mathematically it is relatively trivial: we replace the arguments
292: of all the $\Theta$s above by $1$, though, of course, we are
293: thereby hiding the dependence on $a$.
294:
295: \section{General bounds on the batch learner algorithm}
296:
297: Consider a set of words $w_1, \dots, w_k$. The probability that
298: they all refer to the concept $R_i$ is, obviously $p_i^k$.
299: \begin{lemma}
300: \label{bounds}
301: The probability $q_k$ that we still have not
302: learned the concept $R_0$ after $k$ steps is bounded above by
303: $\sum_{i=1}^n p_i^k$, and below by $\max_i p_i^k$.
304: \end{lemma}
305: \begin{proof}
306: Immediate.
307: \end{proof}
308: We will first use these upper and lower
309: bounds to get corresponding bounds on the convergence speed of
310: the batch learner algorithm, and then invoke the independence
311: hypothesis to sharpen these bounds in many cases.
312:
313: We begin with a trivial but useful lemma.
314: %
315: \begin{lemma}
316: \label{rearrange}
317: Let $G$ be a game where the probability of
318: success (respectively failure) after at most $k$ steps is $s_k$
319: (respectively $f_k = 1-s_k $). Then the expected number of steps
320: until success is
321: $$\sum_{k=1}^\infty k (s_k - s_{k-1}) = \sum_{k=1}^\infty s_k = 1 -
322: \sum_{k=1}^\infty f_k,$$ if the corresponding sum converges.
323: \end{lemma}
324: \begin{proof}
325: The proof is immediate from the definition of expectation and the
326: possibility of rearrangment of terms of positive series.
327: \end{proof}
328: We can combine Lemma \ref{rearrange} and Lemma \ref{bounds} to
329: obtain:
330: \begin{theorem}
331: \label{sumbounds} The expected time $T$ of convergence of the
332: batch learner algorithm is bounded as follows:
333: \begin{equation}
334: \label{trivest} \sum_{i=1}^n \frac{1}{1-p_i} \geq T \geq
335: \max_{1\leq i \leq n} \frac{1}{1-p_i}.
336: \end{equation}
337: \end{theorem}
338: The leftmost term in equation (\ref{trivest}) has been studied at
339: length in \cite{kr1}. We state a version of the results of
340: \cite{kr1} below:
341: \begin{theorem}
342: \label{allstab} Let $S=\sum_{i=1}^n \frac{1}{1-p_i},$ where the
343: $p_i$ are independently identically distributed random variables
344: with values in $[0, 1]$, with probability density $f$, such that
345: $f(1-x) = x^\beta + O(x^{\beta + \delta}),\quad \delta > 0$ for
346: $x\rightarrow 0$. Then If $\beta > 0$, then there exists a mean
347: $m$, such that $\lim_{n \rightarrow \infty} \mathbb{P}(|S/n - m|
348: > \epsilon) = 0,$ for any $\epsilon > 0.$ If $\beta = 0$, then
349: $\lim_{n \rightarrow \infty} \mathbb{P}(|S/(n\log n) - 1|
350: > \epsilon) = 0).$ Finally, if
351: $-1 \leq \beta < 0,$ then $\lim_{n \rightarrow \infty}
352: \mathbb{P}(S/n^{1/{\beta+1}} - C
353: > a) = g(a),$ where $\lim_{a \rightarrow \infty} g(a)= 0,$ and $C$ is
354: an arbitrary (but fixed) constant, and likewise
355: $$\mathbb{P}(S/n^{1/(\beta + 1)} < b) = h(b),$$ where $\lim_{a \rightarrow 0}h(a) = 0,$
356: \end{theorem}
357: The right hand side of Eq. (\ref{trivest}) is easier to
358: understand. Indeed, let $p_1, \dots, p_n$ be distributed as usual
359: (and as in the statement of Theorem \ref{allstab}). Then
360: %
361: \begin{theorem}\label{expmin}
362: $$\lim_{n\rightarrow \infty}
363: n^{\frac{1}{1+\beta}} \mathbf{E}\left(1-\max_{1 \leq i \leq n}
364: p_i\right) = C,$$
365: for some positive constant $C$.
366: \end{theorem}
367: \begin{proof}
368: First, we change variables to $q_i = 1 - p_i$. Obviously, the
369: statement of the Theorem is equivalent to the statement that
370: $$E =
371: \mathbf{E}(\min_{1 \leq i \leq n} q_i) = C n^{-1/{1+\beta}}.$$ We
372: also write $h(x) = f(1-x),$ and let $H$ be the distribution function
373: whose density is $h,$ so that $H(x) = 1 - F(1-x).$
374: Now, the probability of that all of the $q_i$ are
375: greater than $t$ equals $1-(1-H(t))^n,$ so that
376: $$E = \int_0^1 t~d\left[1-(1-H(t))^n\right] = \int_0^1 (1-H(t))^n d t.$$
377: We change variables $t = u/n^{1/(1+\beta)}$, to obtain
378: \begin{equation}
379: \label{firstint} E = \frac{1}{n^{1+\beta}}
380: \int_0^{n^{\frac{1}{{1+\beta}}}} \left(1-H\left(\frac{u}{n^{1/(1+\beta)}}\right)\right)^n du.
381: \end{equation}
382: Let us write $E = E_1(n) + E_2(n),$ where
383: \begin{gather}
384: \label{secondint}
385: E_1(n) = \int_0^{n^{\frac{1}{3 (\beta + 1)}}}
386: \left[1-H\left(\frac{u}{n^{1/(1+\beta)}}\right)\right]^n du,\\
387: E_2(n) = \int_{n^{\frac{1}{3 (\beta + 1)}}}^{n^{\frac{1}{1 + \beta}}}
388: \left[1-H\left(\frac{u}{n^{1/(1+\beta)}}\right)\right]^n du,
389: \end{gather}
390: Recall that
391: \begin{equation}
392: \label{asest}
393: H(x) = c x^{\beta+1} + O(x^{\beta + \delta + 1}).
394: \end{equation}
395: Let
396: \begin{equation}
397: \label{eeint}
398: \mathcal{I} = \int_0^\infty \exp\left(c x^{1+\beta}\right) d x.
399: \end{equation}
400: We now show:
401: \begin{equation}
402: \label{secondint1}
403: \lim_{n \rightarrow \infty} E_1(n) = \mathcal{I}.
404: \end{equation}
405: This is an immediate consequence of Lemma \ref{explem} and Eq. (\ref{asest}).
406: Also,
407: \begin{equation}
408: \label{secondint2}
409: \lim_{n \rightarrow \infty} E_2(n) = 0.
410: \end{equation}
411: Since $H$ is a monotonically increasing function, it is sufficient to
412: show that
413: $$\lim_{n\rightarrow \infty} n^{\frac{1}{1 + \beta}}
414: \left[1-H\left(n^{\frac{2}{3 (1 + \beta)}}\right)\right]^n = 0.$$
415: This is immediate from Eq. (\ref{asest}) and Lemma \ref{explem}.
416: \end{proof}
417: \begin{remark}
418: The argument shows that $C = \mathcal{I},$ where $C$ is the constant
419: in the statement of lemma, and $\mathcal{I}$ is the integral
420: introduced in Eq. (\ref{eeint}).
421: \end{remark}
422: \begin{lemma}
423: \label{explem}
424: Let $f_n(x) = (1-x/n)^n,$ and let $0 \leq z < 1/2.$
425: $$f_n(x) = \exp(-x)\left[1-\frac{x^2}{2 n} + O\left(\frac{x^3}{n^2}\right)\right].$$
426: \end{lemma}
427: \begin{proof}
428: Note that
429: $$\log f_n(x) = n \log(1-x/n) = -x - \sum_{k=2}^\infty \frac{x^k}{kn^{k-1}}.$$
430: The assertion of the lemma follows by exponentiating the two sides of
431: the above equation.
432: \end{proof}
433: We need one final observation:
434: \begin{theorem}
435: The variable $n^{1/(1+\beta)} \min_{i=1}^n q_i$ has a limiting
436: distribution with distribution function $G(x) =
437: 1-\exp(-x^{1+\beta}).$
438: \end{theorem}
439: \begin{proof}
440: Immediate from the proof of Theorem \ref{expmin}.
441: \end{proof}
442:
443:
444: We can now put together all of the above results as follows.
445: \begin{theorem}
446: \label{allgen}
447: Let $p_1, \dots, p_k$ be independently distributed
448: with common density function $f$, such that $f(1-x) = c x^\beta +
449: O(x^{\beta + \delta}),$ $\delta > 0$. Let $T$ be the expected
450: time of the convergence of the batch learning algorithm with
451: overlaps $p_1, \dots, p_k$. Then, if $\beta > 0$, then there
452: exist $C_1, C_2$, such that $C_1 n^{1/(1+\beta)} \leq T \leq C_2
453: n$, with probability tending to $1$ as $n$ tends to $\infty$. If
454: $\beta = 0$, then there exist $C_1, C_2$, such that $C_1 n \leq T
455: \leq C_2 n \log n$, with probability tending to one as $n$ tends
456: to $\infty.$ If $\beta > 0$, then $C^{-1} n^{1/(\beta + 1)} \leq
457: T \leq C n^{1/(\beta + 1)}$ with probability tending to $0$ as
458: $C$ goes to infinity.
459: \end{theorem}
460:
461: The reader will remark that in the case that $\beta > 0$, the
462: upper and lower bounds have the same order of magnitude as
463: functions of $n$.
464:
465: \section{Independent concepts}
466: We now invoke the independence hypothesis, whereby an application of the
467: inclusion-exclusion principle gives us:
468:
469: \thm{lemma}{\label{latmost} The probability $l_k$ that we have
470: learned the concept $R_0$ after $k$ steps is given by
471: $$
472: l_k=\prod_{i=1}^n(1-p_i^k).
473: $$
474: }
475:
476: Note that the probability $s_k$ of winning the game \emph{on the
477: $k$-th step} is given by $s_k = l_k - l_{k-1}= (1-l_{k-1}) -
478: (1-l_k)$. Since the expected number of steps $T$ to learn the
479: concept is given by
480: $$T = \sum_{k=1}^\infty k s_k,$$
481: we immediately have $$T = \sum_{k=1}^\infty (1-l_k)$$
482: %
483: \begin{lemma}
484: \label{letime} The expected time $T$ of learning the
485: concept $R_0$ is given by
486: \begin{equation}
487: \label{letimeeq}
488: T = \sum_{k=1}^\infty \left(1-\prod_{i=1}^n
489: \left(1-p_i^k\right)\right).
490: \end{equation}
491: \end{lemma}
492: %
493: Since the sum above is absolutely convergent, we can expand the
494: products and interchange the order of summation to get the
495: following formula for $T$:
496:
497: \medskip\noindent
498: \textbf{Notation.}
499: Below, we identify subsets of $\{1, \dots, n\}$ with
500: multindexes (in the obvious way), and if $s = \{i_1, \dots, i_l\},$ then
501: $$p_s \stackrel{\mbox{def}}= p_{i_1} \cdots p_{i_l}.$$
502:
503: \begin{lemma}
504: The expression Eq. (\ref{letimeeq}) can be rewritten as:
505: \begin{equation}
506: \label{subsum} T = \sum_{s\subseteq \{1, \dots, n\}}
507: (-1)^{|s|-1} \left(\frac{1}{1-p_s} - 1\right),
508: \end{equation}
509: \end{lemma}
510:
511: \begin{proof}
512: With notation as above,
513: \begin{equation*}
514: \prod_{i=1}^m \left(1-p_i^k\right) =
515: \sum_{s \subseteq \{1, \dots, n\}} (-1)^{|s|} p_s^k,
516: \end{equation*}
517: so
518: \begin{equation*}
519: \begin{split}
520: T &= \sum_{k=1}^\infty \left(1 - \prod_{i=1}^n
521: \left(1-p_i^k\right)\right)\\
522: &= \sum_{k=1}^\infty \left(1-\sum_{s \subseteq \{1, \dots, n\}}
523: (-1)^{|s|} p_s^k\right)\\
524: &= \sum_{s\subseteq \{1, \dots, n\}} (-1)^{|s|-1}
525: \sum_{k=1}^\infty p_s^k \\
526: &= \sum_{s\subseteq \{1, \dots, n\}}
527: (-1)^{|s|-1} \left(\frac{1}{1-p_s} - 1\right),
528: \end{split}
529: \end{equation*}
530: where the change in the order of summation is permissible since all
531: sums converge absolutely.
532: \end{proof}
533: Formula (\ref{subsum}) is useful in and of itself, but we now
534: use it to analyse the statistical properties of the time of
535: success $T$ under our distribution and independence assumptions.
536: For this we shall need to study the \emph{moment zeta function} of a
537: probability distribution, introduced below. Its detailed properties
538: are investigated in my paper \cite{zeta}, where Theorems \ref{t1},
539: \ref{alpha1asymp} and \ref{alpha1asymp2}
540: below are proved. Below we summarize the definitions and the
541: results.
542: %
543: \subsection{Moment zeta function}
544: \begin{definition}
545: \label{zdef} Let $\mathcal{F}$ be a probability
546: distribution on a (possibly infinite) interval $I$, and let
547: $m_k(\mathcal{F}) = \int_I x^k\mathcal{F}(d x)$ be the $k$-th moment
548: of $\mathcal{F}$. Then the \emph{moment zeta function of
549: $\mathcal{F}$} is defined to be $$\zeta_{\mathcal{F}}(s) =
550: \sum_{k=1}^\infty m_k^s(\mathcal{F}),$$ whenever the sum is defined.
551: \end{definition}
552: %
553: The definition is, in a way, motivated by the following:
554:
555: \begin{lemma}
556: \label{zetalemma} Let $\mathcal{F}$ be a probability
557: distribution as above, and let $x_1, \dots, x_n$ be independent
558: random variables with common distribution $\mathcal{F}$. Then
559: \begin{equation}
560: \mathbb{E}\left(\frac{1}{1-x_1 \dots x_n}\right) =
561: \zeta_{\mathcal{F}}(n).
562: \end{equation}
563: In particular, the expectation is undefined whenever the zeta
564: function is undefined.
565: \end{lemma}
566: %
567: \begin{proof}
568: Expand the fraction in a geometric series and apply Fubini's
569: theorem.
570: \end{proof}
571: %
572: \begin{example}
573: For $\mathcal{F}$ the uniform distribution on
574: $[0, 1]$, $\zeta_{\mathcal{F}}$ is the familiar Riemann zeta
575: function.
576: \end{example}
577:
578: Using standard techinques of asymptotic analysis, the following can be
579: shown (see \cite{zeta}):
580: \begin{theorem}
581: \label{momasymp}
582: Let $\mathcal{F}$ be a continuous distribution supported in $[0, 1],$
583: let $f$ be the density of the distribution $\mathcal{F}$, and
584: suppose that $f(1-x) = c x^\beta + O(x^{\beta + \delta}),$ for some
585: $\delta > 0.$ Then the $k$-th moment of $\mathcal{F}$ is asymptotic to
586: $C k^{-(1+\beta)},$ for $C = c \Gamma(\beta).$
587: \end{theorem}
588:
589: \begin{corollary}
590: Under the assumptions of Theorem \ref{momasymp},
591: $\zeta_{\mathcal{F}}(s)$ is defined for $s
592: >1/(1+\beta)$.
593: \end{corollary}
594:
595: The moment zeta function can be used to two of the three situations
596: occuring in the study of the batch learner algorithm:
597: In the sequel, we set $\alpha = \beta + 1$.
598: \subsection{$\alpha > 1$}
599: \label{isdef}
600: In this case, we use our assumptions to rewrite Eq.
601: (\ref{subsum}) as
602: \begin{equation}
603: \label{subsum2}
604: %
605: \mathbb{E}(T) = - \sum_{k=1}^n \binom{n}{k}(-1)^k \zeta_{\mathcal{F}}(k).
606: \end{equation}
607: This, in turn, can be rewritten (by expanding the definition of
608: zeta) as
609: \begin{equation}
610: \label{subsum3} \mathbb{E}(T) = - \sum_{j=1}^\infty
611: \left[\left(1-m_j(\mathcal{F})\right)^n-1\right] =
612: \sum_{j=1}^\infty \left[1- \left(1-m_j(\mathcal{F})\right)^n\right]
613: \end{equation}
614:
615: Using the moment zeta function we can show:
616: \begin{theorem}
617: \label{t1}
618: Let $\mathcal{F}$ be a continuous distribution supported on $[0, 1],$
619: and let $f$ be the density of $\mathcal{F}.$ Suppose further that
620: $$\lim_{x \rightarrow 1} \frac{f(x)}{(1-x)^{\beta}} = c,$$ for $\beta,
621: c > 0.$ Then,
622: \begin{equation*}
623: \begin{split}
624: \lim_{n\rightarrow \infty} n^{-\frac{1}{1+\beta}} \left[\sum_{k=1}^n
625: \binom{n}{k}(-1)^k \zeta_{\mathcal{F}}(k)\right] \\=
626: -\int_0^\infty
627: \frac{1-\exp\left(-c\Gamma(\beta+1)u^{1+\beta}\right)}{u^2} du\\
628: = - \left(c \Gamma(\beta + 1)\right)^{\frac{1}{\beta+1}}
629: \Gamma\left(\frac{\beta}{\beta + 1}\right).
630: \end{split}
631: \end{equation*}
632: \end{theorem}
633: %
634: \subsection{$\alpha = 1$}
635: \label{medalpha} In this case,
636: \begin{equation}
637: \label{asest02}
638: f(x) = L + o(1)
639: \end{equation} as $x$
640: approaches $1,$ and so Theorem \ref{momasymp} tells us that
641: \begin{equation}
642: \label{asest2}
643: \lim_{j \rightarrow \infty} j m_j(\mathcal{F}) = L.
644: \end{equation}
645: It is not hard to see that
646: $\zeta_{\mathcal{F}}(n)$ is defined for $n \geq 2$. We break up
647: the expression in Eq. (\ref{subsum}) as
648: \begin{equation}
649: \label{subsumm} T = \sum_{j=1}^n {\frac{1}{1-p_j} - 1} +
650: \sum_{s\subseteq \{1, \dots, n\}, \quad |s| > 1}
651: (-1)^{|s|-1}
652: \left(\frac{1}{1-p_s} - 1\right).
653: \end{equation}
654: Let
655: \begin{gather*} T_1 = \sum_{j=1}^n {\frac{1}{1-p_j} - 1},\\
656: T_2 = \sum_{s\subseteq \{1, \dots, n\}, \quad |s| > 1}
657: (-1)^{|s|-1}
658: \left(\frac{1}{1-p_s} - 1\right).
659: \end{gather*}
660: The first sum $T_1$ has
661: no expectation, however $T_1/n$ does have have a stable
662: distribution centered on $c \log n + c_2$. We will keep this in
663: mind, but now let us look at the second sum $T_2$. It can be
664: rewritten as
665: \begin{equation}
666: \label{subsumm2} T_2(n) = - \sum_{j=1}^\infty
667: \left[\left(1-m_j(\mathcal{F})\right)^n-1 + n
668: m_j(\mathcal{F})\right].
669: \end{equation}
670: We can again use the moment zeta function to analyse the properties of
671: $T_2,$ to get:
672: \begin{theorem}
673: \label{alpha1asymp}
674: Let $\mathcal{F}$ be a continuous distribution supported on $[0, 1],$
675: and let $f$ be the density of $\mathcal{F}.$ Suppose further that
676: $$\lim_{x \rightarrow 1} \frac{f(x)}{(1-x)} = c > 0.$$
677: Then,
678: $$\sum_{k=2}^n
679: \binom{n}{k}(-1)^k \zeta_{\mathcal{F}}(k) \sim c n \log n.
680: $$
681: \end{theorem}
682: To get error estimates, we need stronger assumption on the function
683: $f$ than the weakest possible assumption made in Theorem
684: \ref{alpha1asymp}.
685:
686: \begin{theorem}
687: \label{alpha1asymp2}
688: Let $\mathcal{F}$ be a continuous distribution supported on $[0, 1],$
689: and let $f$ be the density of $\mathcal{F}.$ Suppose further that
690: $$f(x) \sim c (1-x) + O\left((1-x)^\delta\right),$$ where $\delta > 0.$
691: Then,
692: $$\sum_{k=2}^n
693: \binom{n}{k}(-1)^k \zeta_{\mathcal{F}}(k) \sim c n \log n + O(n).
694: $$
695: \end{theorem}
696:
697: The conclusion differs somewhat from that of section
698: \ref{isdef} in that we get an
699: additional term of $c n \log n$, where $c = \lim_{x \rightarrow
700: 1} f(x) = \lim_{j \rightarrow \infty} j m_j$. This term is equal
701: (with opposing sign) to the center of the stable law satisfied by
702: $T_1$, so in case $\alpha = 1$, we see that $T$ has no
703: expectation but satisfies a \emph{law of large numbers}, of the
704: %
705: \begin{theorem}[Law of large numbers]
706: There exists a constant $C$ such that $\lim_{y \rightarrow
707: \infty} \mathbf{P}(|T/n - C| > y) = 0.$
708: \end{theorem}
709: \section{$\alpha <1$}
710: \label{smallalpha} In this case the analysis goes through as in
711: the preceding section when $\alpha > 1/2$, but then runs into
712: considerable difficulties. However, in this case we note that
713: Theorem \ref{allgen} actually gives us tight bounds.
714: \section{The inevitable comparison}
715: We are now in a position to compare the performance of the batch
716: learning algorithm with that of the memoryless learning algorithm
717: and of learning with full memory, as summarized in Theorem
718: \ref{mainprev}. We combine our computations above with the
719: observation that the batch learner algorithm converges
720: geometrically (Lemma \ref{latmost}), to get:
721: %
722: \thm{theorem}
723: {
724: \label{batchthm}
725: Let $N_\Delta$ be the number of steps it takes for the student
726: to have probability $1 - \Delta$ of learning the
727: concept using the batch learner algorithm. Then we have the following estimates for $N_\Delta$:
728: %
729: \begin{itemize}
730: \item
731: if the distribution of overlaps is \emph{uniform}, or more
732: generally, the density function $f(1-x)$ at $0$ has the form
733: $f(x) = c + O(x^\delta),$ $\delta, c > 0,$ then there exist positive
734: constants $C_1, C_2$ such that
735: $$\lim_{n \rightarrow \infty} \mathbf{P}\left(C_1 <
736: \frac{N_\Delta}{(1- \Delta)^2 n} < C_2\right) = 1$$
737: %
738: %
739: \item
740: if the probability density function $f(1-x)$ is asymptotic to $c x^\beta
741: + O(x^{\beta + \delta}), \quad \delta, \beta > 0$, as $x$ approaches
742: $0$, then
743: $$\lim_{n \rightarrow \infty} \mathbf{P}\left(c_1 <
744: \frac{N_\Delta}{|\log \Delta|n^{\frac{1}{1+\beta}}} < c_2\right) = 1,$$
745: for some positive constants $c_1, c_2$;
746: %
747: \item
748: if the asymptotic behavior is as above, but $-1 < \beta < 0$, then
749: %
750: $$\lim_{x \rightarrow \infty} \mathbf{P}\left(\frac{1}{x} < \frac{N_\Delta}{|\log
751: \Delta| n^{1/(1+\beta)}} < x\right) = 1$$
752: \end{itemize}}
753:
754: %
755: % \thm{theorem} {\label{batchthm} Let $N_\Delta$ be the number of
756: % steps it takes for the student (with probability $1$) to have
757: % probability $1 - \Delta$ of learning the concept using the batch
758: % learner algorithm. Then we have the following estimates for
759: % $N_\Delta$:
760: % %
761: % \begin{itemize}
762: % \item
763: % If the distribution of overlaps is \emph{uniform}, or more
764: % generally, the density function $f(1-x)$ at $0$ has the form
765: % $f(x) = c + O(x^\delta),$ $\delta, c > 0,$ then $N_\Delta=|\log
766: % \Delta|\Theta(n)$
767: % %
768: % \item
769: % If the probability density function $f(1-x)$ is asymptotic to
770: % $x^\beta + O(x^{\beta - \delta}), \quad \delta, \beta > 0$, as
771: % $x$ approaches $0$, then we have $N_\Delta=|\log
772: % \Delta|\Theta(n^{1/(1+\beta)})$;
773: % %
774: % \item
775: % If the asymptotic behavior is as above, but $-1 < \beta < 0$,
776: % then $N_\Delta=|\log \Delta|\Theta(n^{1/(1+\beta)}).$
777: % %
778: % \end{itemize}}
779: % %
780: Comparing Theorems \ref{mainprev} and \ref{batchthm}, we see that
781: batch learning algorithm is uniformly superior for $\beta \geq
782: 0$, and the only one of the three to achieve \emph{sublinear}
783: performance whenever $\beta
784: > 0$ (the other two \emph{never} do better than linearly, unless
785: the distribution $\mathcal{F}$ is supported away from $1.$) On
786: the other hand, for $\beta < 0$, the batch learning algorithm
787: performs comparably to the memoryless learner algorithm, and
788: worse than learning with full memory.
789: %\section{$\alpha <1$}
790: %The same method as in section \ref{isdef} under the assumption
791: %that the $k$-th moment is asymptotic to $k^\alpha$ (this time for
792: %$\alpha \leq 1$) can be used to write
793: %\begin{equation}
794: %\begin{split}
795: %T_2 &= n^{1/alpha} \int_0^{n^{1/\alpha}} \frac{\left[1-n
796: %m(n^{1/\alpha}/u) - (1-m(n^{1/\alpha}/u)^n \right]}{u^2} d u +
797: %O(1)\\ &= n^{1/\alpha}\left(\int_0^{n^{1/2\alpha}} +
798: %\int_{n^{1/2\alpha}}^{n^{1/\alpha}}\right) \frac{\left[1-
799: %m^\prime(u) u^\alpha - (1-m^\prime(u)u^\alpha/n)^n \right]}{u^2}
800: %d u + O(1).
801: %\end{split}
802: %\end{equation} If $1/2 < \alpha < 1$, the argument finishes in
803: %exactly the same way as in section \ref{isdef}, to give us $T_2
804: %\asymp C n^{1/\alpha}$. However, if $\alpha = 1$, we get an
805: %additional term of $C_2 n \log n$, where $C_2 = \lim_{j
806: %\rightarrow \infty} m_j$. This term is equal (with opposing sign)
807: %to the center of the stable law satisfied by $T_1$, so in case
808: %$\alpha = 1$, we see that $T$ has no expectation but satisfies a
809: %law of large numbers, with center linear in $n$. If $\alpha \leq
810: %1/2$, the integral diverges.
811: \begin{thebibliography}{xxxxxxxxxxxx}
812:
813: \bibitem[BenOrsz]{benorsz}
814: C.~M.~Bender and S.~Orszag (1999) \textit{Advanced mathematical
815: methods for scientists and engineers, I,\/} Springer-Verlag, New
816: York.
817:
818: \bibitem[KNN2001]{knn}
819: Komarova, N.~L., Niyogi,~P. and Nowak,~M.~A. (2001) The evolutionary
820: dynamics of grammar acquisition, \textit{J.~Theor.~Biology}, {\bf
821: 209}(1), pp. 43-59.
822:
823: \bibitem[KN2001]{kn}
824: Komarova, N.~L. and Nowak, M.~A. (2001) Natural selection of the
825: critical period for grammar acquisition, {\it Proc. Royal Soc.
826: B}, to appear.
827:
828: \bibitem[KR2001a]{kr1}
829: Komarova, N.~L. and Rivin, I. (2001) Harmonic mean, random
830: polynomials and stochastic matrices, \emph{preprint}.
831:
832: \bibitem[KR2001b]{kr2}
833: Komarova, N.~L. and Rivin, I. (2001) On the mathematics of
834: learning.
835:
836: \bibitem[Niyogi1998]{niy}
837: Niyogi, P. (1998). {\it The Informational Complexity of
838: Learning}. Boston: Kluwer.
839:
840: \bibitem[NKN2001]{nkn}
841: Nowak, M.~A., Komarova,~N.~L., Niyogi,~P. (2001) Evolution of
842: universal grammar, \textit{Science} \textbf{291}, 114-118.
843:
844: \bibitem[Rivin2002]{zeta}
845: Igor Rivin (2002). The moment zeta function and applications,
846: arxiv.org preprint NT/0201109.
847:
848: \end{thebibliography}
849:
850:
851: \end{document}
852:
853: