1: %\documentclass[12pt]{gen-j-l}
2: \documentclass[12pt]{amsart}
3: \usepackage{amsmath}
4: \usepackage{amsfonts,amssymb,amsthm}
5: \usepackage{graphicx}
6:
7: %
8: \newcommand{\beq}{\begin{equation}}
9: \newcommand{\eeq}{\end{equation}}
10: \newcommand{\bbar}{\begin{eqnarray}}
11: \newcommand{\eear}{\end{eqnarray}}
12: %
13:
14:
15: \newcommand{\thm}[2]{\begin{#1} #2 \end{#1}}
16: \newcommand{\excess}{\mathrm{excess\:}}
17: \newcommand{\sgn}{\mathrm{sgn\:}}
18: \newcommand{\realpart}{\mathrm{Re\:}}
19: \newcommand{\imagpart}{\mathrm{Im\:}}
20:
21: \newcommand{\logdet}{\log \det \Delta}
22: \newcommand{\tr}{\mathrm{tr\:}}
23: \newcommand{\diameter}{\mathrm{diameter\:}}
24: \newcommand{\area}{\mathrm{area\:}}
25:
26: \newcommand{\Sim}{\mathrm{Sim\:}}
27: \newcommand{\num}{\mathcal{N\:}}
28: %\newcommand{\arg}{\mathrm{arg\:}}
29: \newcommand{\dilatation}{\mathrm{dilatation\:}}
30: \newtheorem{theorem}{Theorem}[section]
31: \newtheorem{itheorem}{Theorem}[section]
32: \newtheorem{lemma}[theorem]{Lemma}
33: \newtheorem{ilemma}[itheorem]{Lemma}
34: \newtheorem{corollary}[theorem]{Corollary}
35: \newtheorem{conjecture}[theorem]{Conjecture}
36: \newtheorem{question}[theorem]{Question}
37: \newtheorem{claim}[theorem]{Claim}
38: \newtheorem{observation}[theorem]{Observation}
39: \newtheorem{iobservation}[itheorem]{Observation}
40: \newtheorem{remark}[theorem]{Remark}
41: \newtheorem{condition}[theorem]{Condition}
42: \newtheorem{example}[theorem]{Example}
43: \newtheorem{definition}[theorem]{Definition}
44: \newtheorem{xca}[theorem]{Exercise}
45: \newtheorem{note}[theorem]{Note}
46:
47: %\input{montreref}
48:
49: \begin{document}
50:
51: %-------------- Author entries --------------------
52: \title{Yet another zeta function and learning}
53: %
54:
55:
56: \author{Igor Rivin}
57: \address{Mathematics department, University of Manchester,
58: Oxford Road, Manchester M13 9PL, UK}
59: \address{Mathematics Department, Temple University,
60: Philadelphia, PA 19122}
61: \address{Mathematics Department, Princeton University, Princeton,
62: NJ 08544}
63: %
64: \email{irivin@math.princeton.edu} \thanks{The author would like
65: to think the EPSRC and the NSF for support, and Natalia Komarova
66: and Ilan Vardi for useful conversations. }
67:
68: \subjclass{60E07, 60F15, 60J20, 91E40, 26C10} \keywords{ learning
69: theory, zeta functions, asymptotics}
70: %
71: \begin{abstract}
72: We analyze completely the convergence speed of the \emph{batch
73: learning algorithm}, and compare its speed to that of the
74: memoryless learning algorithm and of learning with memory (as
75: analyzed in \cite{kr2}). We show that the batch learning
76: algorithm is never worse than the memoryless learning algorithm
77: (at least asymptotically). Its performance \emph{vis-a-vis}
78: learning with full memory is less clearcut, and depends on
79: certain probabilistic assumptions. These results necessitate the
80: introduction of the \textit{moment zeta function} of a
81: probability distribution and the study of some of its properties.
82: \end{abstract}
83: %
84: \maketitle
85:
86: \renewcommand{\theitheorem}{\Alph{itheorem}}
87: %
88: \section*{Introduction}
89: The original motivation for the work in this paper was provided
90: by research in learning theory, specifically in various models
91: of language acquisition (see, for example, \cite{knn,nkn,kn}). In
92: the paper \cite{kr2}, we had studied the speed of convergence of
93: the \emph{memoryless learner algorithm}, and also of
94: \emph{learning with full memory}. Since the \emph{batch learning
95: algorithm} is both widely known, and believed to have superior
96: speed (at the cost of memory) to both of the above methods by
97: learning theorists, it seemed natural to analyze its behavior
98: under the same set of assumptions, in order to bring the analysis
99: in \cite{kr1} and \cite{kr2} to a sort of closure. It should be
100: noted that the detailed analysis of the batch learning algorithm
101: is performed under the assumption of \emph{independence}, which
102: was not explicitly present in our previous work. For the
103: impatient reader we state our main result (Theorem
104: \ref{batchthm}) immediately (the reader can compare it with the
105: results on the memoryless learning algorithm and learning with
106: full memory, as summarized in Theorem \ref{mainprev}):
107: %
108: \begin{itheorem}
109: Let $N_\Delta$ be the number of steps it takes
110: for the student (with probability $1$) to have probability $1 -
111: \Delta$ of learning the concept using the batch learner
112: algorithm. Then we have the following estimates for $N_\Delta$:
113: %
114: \begin{itemize}
115: \item
116: If the distribution of overlaps is \emph{uniform}, or more
117: generally, the density function $f(1-x)$ at $0$ has the form
118: $f(x) = c + O(x^\delta),$ $\delta, c > 0,$ then $N_\Delta=|\log
119: \Delta|\Theta(n)$
120: %
121: \item
122: If the probability density function $f(1-x)$ is asymptotic to
123: $x^\beta + O(x^{\beta - \delta}), \quad \delta, \beta > 0$, as
124: $x$ approaches $0$, then we have $N_\Delta=|\log
125: \Delta|\Theta(n^{1/(1+\beta)})$;
126: %
127: \item
128: If the asymptotic behavior is as above, but $-1/2 < \beta < 0$,
129: then $N_\Delta=|\log \Delta|\Theta(n^{1/(1+\beta)}).$
130: %
131: \end{itemize}
132: \end{itheorem}
133: The plan of the paper is as follows: in this Introduction we
134: recall the learning algorithms we study; in Section \ref{mathmod}
135: we define our mathematical model; in Section 2 we recall our
136: previous results, in Section 3 we begin the analysis of the batch
137: learning algorithm, and introduce some of the necessary
138: mathematical concepts; in Sections 4-6 we analyze the three cases
139: stated in Theorem A, and we summarize our findings in Section 7.
140: \subsection*{Memoryless Learning and Learning with Full Memory}
141: The general setup is as follows: There is a collection of
142: concepts $R_0, \dots, R_n$ and words which refer to these
143: concepts, sometimes ambiguously. The teacher generates a stream
144: of words, referring to the concept $R_0$. This is not known to
145: the student, but he must learn by, at each step, guessing some
146: concept $R_i$ and checking for consistency with the teacher's
147: input. The \emph{memoryless learner algorithm} consists of
148: picking a concept $R_i$ at random, and sticking by this choice,
149: until it is proven wrong. At this point another concept is
150: picked randomly, and the procedure repeats. \emph{Learning with
151: full memory} follows the same general process with the important
152: difference that once a concept is rejected, the student never
153: goes back to it. It is clear (for both algorithms) that once the
154: student hits on the right answer $R_0$, this will be his final
155: answer. We would like to estimate the probability of having
156: guessed the right answer is after $k$ steps, and also the
157: expected number of steps before the student settles on the right
158: answer.
159:
160: \subsection*{Batch Learning} The batch learning situation is
161: similar to the above, but here the student records the words
162: $w_1, \dots, w_k, \dots$ he gets from the teacher. For each word
163: $w_i$ , we assume that the student can find (in his textbook, for
164: example) a list $L_i$ of concepts referred to by the word. If we
165: define
166: \begin{equation*}
167: \mathcal{L}_k = \bigcap_{i=1}^k L_i,
168: \end{equation*}
169: then we are interested in the smallest value of $k$ such that
170: $\mathcal{L}_k = \{R_0\}$. This value $k_0$ is the time it has
171: taken the student to learn the concept $R_0$. We think of $k_0$
172: as a random variable, and we wish to estimate its expectation.
173: %
174: \section{The mathematical model}
175: \label{mathmod}
176: We think of the words referring to the concept
177: $R_0$ as a probability space $\mathcal{P}$. The probability that
178: one of these words also refer to the concept $R_i$ shall be
179: denoted by $p_i$; the probability that a word refers to concepts
180: $R_{i_1}, \dots, R_{i_k}$ shall be denoted by $p_{i_1 \dots
181: i_k}$. All the results described below (obviously) depend in a
182: crucial way on the $p_1, \dots, p_n$ and (in the case of the
183: batch learning algorithm) also on the joint probabilities. Since
184: there is no \emph{a priori} reason to assume specific values for
185: the probabilities, we shall assume that all of the $p_i$ are
186: themselves \emph{independent, identically distributed random
187: variables}. We shall refer to their common distribution as
188: $\mathcal{F}$, and to the density as $f$. It turns out that the
189: convergence properties of the various learning algorithms depend
190: on the local analytic properties of the distribution
191: $\mathcal{F}$ at $1$ -- some moments reflection will convince the
192: reader that this is not really so surprising.
193:
194: To carry out a precise analysis of the batch learning algorithm,
195: we will also need the \emph{independence hypothesis}:
196: $$
197: p_{i_1 \dots i_k} = p_{i_1} \dots p_{i_k}.
198: $$
199: It is again not too surprising that some such assumption on
200: correlations ought to be required for precise asymptotic results,
201: though it is obviously the subject of a (non-mathematical) debate
202: as to whether assuming that the various concepts are truly
203: independent is reasonable from a cognitive science point of view.
204:
205: \section{Previous results}
206: In previous work \cite{kr1} and \cite{kr2} we obtained the
207: following result.
208: \thm{theorem} {\label{mainprev}Let $N_\Delta$ be the number of steps it
209: takes for the student (with probability $1$) to have probability
210: $1 - \Delta$ of learning the concept. Then we have the following
211: estimates for $N_\Delta$:
212: %
213: \begin{itemize}
214: \item
215: if the distribution of overlaps is \emph{uniform}, or more
216: generally, the density function $f(1-x)$ at $0$ has the form
217: $f(x) = c + O(x^\delta),$ $\delta, c > 0,$ then
218: $N_\Delta=|\log \Delta|\Theta(n \log n)$ for
219: the memoryless algorithm and $N_\Delta=(1-\Delta)^2 \Theta(n \log
220: n)$ when learning with full memory;
221: %
222: \item
223: if the probability density function $f(1-x)$ is asymptotic to
224: $x^\beta + O(x^{\beta - \delta}), \quad \delta, \beta > 0$, as
225: $x$ approaches $0$, then for the two algorithms we have
226: respectively $N_\Delta=|\log \Delta|\Theta(n)$ and
227: $N_\Delta=(1-\Delta)^2 \Theta(n)$;
228: %
229: \item
230: if the asymptotic behavior is as above, but $-1 < \beta < 0$, then
231: $N_\Delta=|\log \Delta|\Theta(n^{1/(1+\beta)})$ for the memoryless
232: learner and $(N_\Delta=1-\Delta)^2\Theta(n^{1/(1+\beta)})$ for
233: learning with full memory.
234: %
235: \end{itemize}}
236: %
237: \noindent Recall that $f(x) = \Theta(g(x))$ means that for
238: sufficiently large $x$, the ratio $f(x)/g(x)$ is bounded between
239: two strictly positive constants. The distribution of overlaps
240: referred to above is simply the distribution $\mathcal{F}$.
241: Notice that the theorem says nothing about the situation when
242: $\mathcal{F}$ is supported in some interval $[0, a]$, for $a<1$.
243: That case is (presumably) of scientific interest, but
244: mathematically it is relatively trivial: we replace the arguments
245: of all the $\Theta$s above by $1$, though, of course, we are
246: thereby hiding the dependence on $a$.
247:
248: \section{General bounds on the batch learner algorithm}
249:
250: Consider a set of words $w_1, \dots, w_k$. The probability that
251: they all refer to the concept $R_i$ is, obviously $p_i^k$.
252: \begin{lemma}
253: \label{bounds}
254: The probability $q_k$ that we still have not
255: learned the concept $R_0$ after $k$ steps is bounded above by
256: $\sum_{i=1}^n p_i^k$, and below by $\max_i p_i^k$.
257: \end{lemma}
258: \begin{proof}
259: Immediate.
260: \end{proof}
261: We will first use these upper and lower
262: bounds to get corresponding bounds on the convergence speed of
263: the batch learner algorithm, and then invoke the independence
264: hypothesis to sharpen these bounds in many cases.
265:
266: We begin with a trivial but useful lemma.
267: %
268: \begin{lemma}
269: \label{rearrange}
270: Let $G$ be a game where the probability of
271: success (respectively failure) after at most $k$ steps is $s_k$
272: (respectively $f_k = 1-s_k $). Then the expected number of steps
273: until success is
274: $$\sum_{k=1}^\infty k (s_k - s_{k-1}) = \sum_{k=1}^\infty s_k = 1 -
275: \sum_{k=1}^\infty f_k,$$ if the corresponding sum converges.
276: \end{lemma}
277: \begin{proof}
278: The proof is immediate from the definition of expectation and the
279: possibility of rearrangment of terms of positive series.
280: \end{proof}
281: We can combine Lemma \ref{rearrange} and Lemma \ref{bounds} to
282: obtain:
283: \begin{theorem}
284: \label{sumbounds} The expected time $T$ of convergence of the
285: batch learner algorithm is bounded as follows:
286: \begin{equation}
287: \label{trivest} \sum_{i=1}^n \frac{1}{1-p_i} \geq T \geq
288: \max_{1\leq i \leq n} \frac{1}{1-p_i}.
289: \end{equation}
290: \end{theorem}
291: The leftmost term in equation (\ref{trivest}) has been studied at
292: length in \cite{kr1}. We state a version of the results of
293: \cite{kr1} below:
294: \begin{theorem}
295: \label{allstab} Let $S=\sum_{i=1}^n \frac{1}{1-p_i},$ where the
296: $p_i$ are independently identically distributed random variables
297: with values in $[0, 1]$, with probability density $f$, such that
298: $f(1-x) = x^\beta + O(x^{\beta - \delta}),\quad \delta > 0$ for
299: $x\rightarrow 0$. Then If $\beta > 0$, then there exists a mean
300: $m$, such that $\lim_{n \rightarrow \infty} \mathbb{P}(|S/n - m|
301: > \epsilon) = 0,$ for any $\epsilon > 0.$ If $\beta = 0$, then
302: $\lim_{n \rightarrow \infty} \mathbb{P}(|S/(n\log n) - 1|
303: > \epsilon) = 0).$ Finally, if
304: $-1 \leq \beta < 0,$ then $\lim_{n \rightarrow \infty}
305: \mathbb{P}(S/n^{1/{\beta+1}} - C
306: > a) = g(a),$ where $\lim_{a \rightarrow \infty} g(a)= 0,$ and $C$ is
307: an arbitrary (but fixed) constant, and likewise
308: $$\mathbb{P}(S/n^{1/(\beta + 1)} < b) = h(b),$$ where $\lim_{a \rightarrow 0}h(a) = 0,$
309: \end{theorem}
310: The right hand side of Eq. (\ref{trivest}) is easier to
311: understand. Indeed, let $p_1, \dots, p_n$ be distributed as usual
312: (and as in the statement of Theorem \ref{allstab}. Then
313: %
314: \begin{theorem}\label{expmin}
315: The expected value of $\max_{1 \leq i \leq n} p_i$ equals $1 - C
316: n^{-1/{1+\beta}},$ for some positive constant $C$.
317: \end{theorem}
318: \begin{proof}
319: First, we change variables to $q_i = 1 - p_i$. Obviously, the
320: statement of the Theorem is equivalent to the statement that $E =
321: \mathbf{E}(\min_{1 \leq i \leq n} q_i) = C n^{-1/{1+\beta}}$. We
322: also write $h(x) = f(1-x),$ and similarly for the primitives $H$
323: and $F$. Now, the probability of that all of the $q_i$ are
324: greater than some fixed $y$ equals $1-(1-H(y))^n,$ so that
325: $$E = \int_0^1 t d\left[1-(1-H(t))^n\right] = \int_0^1 (1-H(t))^n d t.$$
326: Perform the change of variables $t = u/n^{1/(1+\beta)}$, to get
327: \begin{equation}
328: \label{firstint} E = \frac{1}{n^{1+\beta}}
329: \int_0^{n^{1/{1+\beta}}} (1-H(u/n^{1/(1+\beta)}))^n du.
330: \end{equation}
331: For $u \ll n^{1/(1+\beta)}$, we can write $H(u/n^{1/(1+\beta)}
332: \asymp u^{\beta + 1}/n H^\prime,$ where $H^\prime$ is a constant.
333: We also know that $H$ is a monotonic function so if we break up
334: the integral above as
335: \begin{equation}
336: \label{secondint} E = \frac{1}{n^{1/(1+\beta)}}
337: \left[\int_0^{n^{1/(2 (1 + \beta))}} + \int_{n^{1/(2 (1 +
338: \beta))}}^{n^{1/(1 + \beta)}}\right] (1-H(u/n^{1/(1+\beta)}))^n
339: du,
340: \end{equation}
341: we see that the first integral approaches $C = \int_0^\infty
342: \exp(-u^{1/(1+\beta)}) d u,$ while the second integral goes to 0.
343: Note that the proof also evaluates $C$.
344: \end{proof}
345: We need one final observation:
346: \begin{theorem}
347: The variable $n^{1/(1+\beta)} \min_{i=1}^n q_i$ has a limiting
348: distribution with distribution function $G(x) =
349: 1-\exp(-x^{1+\beta}).$
350: \end{theorem}
351: \begin{proof}
352: Immediate from the proof of Theorem \ref{expmin}.
353: \end{proof}
354:
355: We can now put together all of the above results as follows.
356: \begin{theorem}
357: \label{allgen}
358: Let $p_1, \dots, p_k$ be independently distributed
359: with common density function $f$, such that $f(1-x) = c x^\beta +
360: O(x^{\beta + \delta}),$ $\delta > 0$. Let $T$ be the expected
361: time of the convergence of the batch learning algorithm with
362: overlaps $p_1, \dots, p_k$. Then, if $\beta > 0$, then there
363: exist $C_1, C_2$, such that $C_1 n^{1/(1+\beta)} \leq T \leq C_2
364: n$, with probability tending to $1$ as $n$ tends to $\infty$. If
365: $\beta = 0$, then there exist $C_1, C_2$, such that $C_1 n \leq T
366: \leq C_2 n \log n$, with probability tending to one as $n$ tends
367: to $\infty.$ If $\beta > 0$, then $C^{-1} n^{1/(\beta + 1)} \leq
368: T \leq C n^{1/(\beta + 1)}$ with probability tending to $0$ as
369: $C$ goes to infinity.
370: \end{theorem}
371:
372: The reader will remark that in the case that $\beta > 0$, the
373: upper and lower bounds have the same order of magnitude as
374: functions of $n$.
375:
376: \section{Independent concepts}
377: independence hypothesis, whereby an application of the
378: inclusion-exclusion principle gives us:
379:
380: \thm{lemma}{\label{latmost} The probability $l_k$ that we have
381: learned the concept $R_0$ after $k$ steps is given by
382: $$
383: l_k=\prod_{i=1}^n(1-p_i^k).
384: $$
385: }
386:
387: Note that the probability $s_k$ of winning the game \emph{on the
388: $k$-th step} is given by $s_k = l_k - l_{k-1}= (1-l_{k-1}) -
389: (1-l_k)$. Since the expected number of steps $T$ to learn the
390: concept is given by
391: $$T = \sum_{k=1}^\infty k s_k,$$
392: we immediately have $$T = \sum_{k=1}^\infty (1-l_k)$$
393: %
394: \thm{lemma}{\label{letime} The expected time $T$ of learning the
395: concept $R_0$ is given by
396: $$
397: T = \sum_{k=1}^\infty \left(1-\prod_{i=1}^n
398: \left(1-p_i^k\right)\right).
399: $$
400: }
401: %
402: Since the sum above is absolutely convergent, we can expand the
403: products and interchange the order of summation to get the
404: following formula for $T$:
405:
406: \begin{equation}
407: \label{subsum} T = \sum_{s\subseteq \{1, \dots, n\}} (-1)^{|s|-1}
408: \sum_{k=1}^\infty p_s^k = \sum_{s\subseteq \{1, \dots, n\}}
409: (-1)^{|s|-1} \left(\frac{1}{1-p_s} - 1\right),
410: \end{equation}
411: where we have identified subsets of $\{1, \dots, n\}$ with the
412: corresponding multindexes.
413:
414: The formula \ref{subsum} is useful in and of itself, but we now
415: use it to attempt to get the expectation of the expected time of
416: success $T$ under our distribution and independence assumption.
417: For this we shall need the following:
418: %
419: \thm{definition}{\label{zdef} Let $\mathcal{F}$ be a probability
420: distribution on an interval $I$, and let $m_k(\mathcal{F}) =
421: \int_I x^k\mathcal{F}(d x)$ be the $k$-th moment of
422: $\mathcal{F}$. Then the \emph{moment zeta function of
423: $\mathcal{F}$} is defined to be
424: $$\zeta_{\mathcal{F}}(s) = \sum_{k=1}^\infty m_k^s(\mathcal{F}),$$ whenever the sum is defined.
425: }
426: %
427: \thm{lemma}{\label{zetalemma} Let $\mathcal{F}$ be a probability
428: distribution as above, and let $x_1, \dots, x_n$ be independent
429: random variables with common distribution $\mathcal{F}$. Then
430: \begin{equation}
431: \mathbb{E}\left(\frac{1}{1-x_1 \dots x_n}\right) =
432: \zeta_{\mathcal{F}}(n).
433: \end{equation}
434: In particular, the expectation is undefined whenever the zeta
435: function is undefined. }
436: %
437: \begin{proof}
438: Expand the fraction in a geometric series and apply Fubini's
439: theorem.
440: \end{proof}
441: %
442: \thm{example} { For $\mathcal{F}$ the uniform distribution on
443: $[0, 1]$, $\zeta_{\mathcal{F}}$ is the familiar Riemann zeta
444: function. Notice that this is \emph{not} defined for $n=1$ --
445: this will be important in the sequel.}
446:
447: It should be noted that in the case we are interested in
448: (distributions supported in $[0, 1]$), the asymptotics of the
449: moments are determined by the local properties of the
450: distribution at $1$, up to exponentially decreasing error terms.
451: So, if $f(1-x) \asymp x^\beta$ (recall that $f$ is the density),
452: we see that the $k$-th moment of $\mathcal{F}$ is asymptotic to
453: $C k^{-(1+\alpha)},$ for some constant $C$. To show this, we
454: first define the \emph{Mellin transform} of $f$ to be
455: $$\mathcal{M}(f)(s) = \int_0^1 f(x) x^{s-1} d x.$$ We see that
456: $m_k(\mathcal{F}) = \mathcal{M}(f)(k+1).$ Mellin transform is
457: very closely related to the Laplace transform. Indeed, making the
458: substitution $x = \exp(-u)$, we see that $$\mathcal{M}(f) =
459: \int_0^\infty f(\exp(-u)) \exp(-s u) d u,$$ so the Mellin
460: transform of $f$ is equal to the Laplace transform of $f \circ
461: \exp.$ Now, the asymptotics of the Laplace transform are easily
462: computed by Laplace's method, and in the case we are interested
463: in, Watson's lemma (see, eg, \cite{benorsz}) tells us that if
464: $f(x) \asymp c (1-x)^\beta$, then $\mathcal{M}(f)(s) \asymp c
465: \Gamma(\beta) x^{-(\beta + 1)}.$ In particular,
466: $\zeta_{\mathcal{F}}(s)$ is defined for $s
467: >1/(1+\beta)$. Below we shall analyze three cases (though the
468: analysis is almost the same in the three cases, there are some
469: important variations). In the sequel, we set $\alpha = \beta + 1$.
470: \section{$\alpha > 1$}
471: \label{isdef}
472: In this case, we use our assumptions to rewrite Eq.
473: (\ref{subsum}) as
474: \begin{equation}
475: \label{subsum2}
476: %
477: T = - \sum_{k=1}^n \binom{n}{k}(-1)^k \zeta_{\mathcal{F}}(k).
478: \end{equation}
479: This, in turn, can be rewritten (by expanding the definition of
480: zeta) as
481: \begin{equation}
482: \label{subsum3} T = - \sum_{j=1}^\infty
483: \left[\left(1-m_j(\mathcal{F})\right)^n-1\right]
484: \end{equation}
485: Since the term in the sum is monotonically decreasing, the sum in
486: Eq. (\ref{subsum3}) can be approximated by an integral (of
487: \emph{any} monotonic interpolation $m$ of the sequence
488: $m_j(\mathcal{F})$; however there is no reason not to set $m(x) =
489: \mathcal{M}(f)(x+1)$), with error bounded by the first term,
490: which is, in term, bounded in absolute value by $2$, to get
491: \begin{equation}
492: \label{approx1} T = - \int_1^\infty \left[(1-m(x))^n -1\right] d
493: x + O(1),
494: \end{equation}
495: where the error term is bounded above by $2$.
496:
497: Now, let us assume that $m(x)$ is of order $x^{-\alpha}$ for some
498: $\alpha > 1$. We substitute $x = n^{1/alpha}/u$, to get
499: \begin{equation}
500: \begin{split}
501: T &= n^{1/\alpha}\int_0^{n^{1/\alpha}}
502: \frac{\left[1-(1-m(n^{1/\alpha}/u)^n \right]}{u^2} d u + O(1)\\
503: &=
504: n^{1/\alpha}\int_0^{n^{1/\alpha}}\frac{\left[1-(1-m^\prime(u)u^\alpha/n)^n
505: \right]}{u^2} d u + O(1)\\ & =
506: n^{1/\alpha}\left(\int_0^{n^{1/2\alpha}} +
507: \int_{n^{1/2\alpha}}^{n^{1/\alpha}}\right)
508: \frac{\left[1-(1-m^\prime(u)u^\alpha/n)^n \right]}{u^2} d u +
509: O(1) ,
510: \end{split}
511: \end{equation}
512: where $m^\prime$ is a bounded (asymptotically constant) function.
513: In the second integral the integrand is bounded above by $1/u^2$,
514: so the contribution from that integral goes to $0$, while in the
515: first integral we can approximate $(1-m^\prime u^\alpha/n)^n$ by
516: $\exp(-m^\prime u^\alpha)$, and the contribution from that
517: integral goes to
518: \begin{equation}
519: \label{mainalpha} T = n^{1/\alpha}
520: \int_0^\infty\frac{1-\exp(-m^\prime(u) u^\alpha)}{u^2} d u + O(1)
521: \asymp C n^{1/\alpha}.
522: \end{equation}
523: %
524: \section{$\alpha = 1$}
525: \label{medalpha} In this case, $f(x) = c + o(1)$ as $x$
526: approaches $1$. It is not hard to see that
527: $\zeta_{\mathcal{F}}(n)$ is defined for $n \geq 2$. We break up
528: the expression in Eq. (\ref{subsum}) as
529: \begin{equation}
530: \label{subsumm} T = \sum_{j=1}^n {\frac{1}{1-p_j} - 1} +
531: \sum_{s\subseteq \{1, \dots, n\}, \quad |s| > 1}
532: (-1)^{|s|-1}
533: \left(\frac{1}{1-p_s} - 1\right).
534: \end{equation}
535: Let
536: \begin{gather*} T_1 = \sum_{j=1}^n {\frac{1}{1-p_j} - 1},\\
537: T_2 = \sum_{s\subseteq \{1, \dots, n\}, \quad |s| > 1}
538: (-1)^{|s|-1}
539: \left(\frac{1}{1-p_s} - 1\right).
540: \end{gather*}
541: The first sum $T_1$ has
542: no expectation, however $T_1/n$ does have have a stable
543: distribution centered on $c \log n + c_2$. We will keep this in
544: mind, but now let us look at the second sum $T_2$. It can be
545: rewritten as
546: \begin{equation}
547: \label{subsumm2} T_2 = - \sum_{j=1}^\infty
548: \left[\left(1-m_j(\mathcal{F})\right)^n-1 + n m_j\right].
549: \end{equation}
550: The same method as in section \ref{isdef} under the assumption
551: that the $k$-th moment is asymptotic to $k^\alpha$ (this time for
552: $\alpha \leq 1$) can be used to write
553: \begin{equation}
554: \begin{split}
555: T_2 &= n \int_0^n \frac{\left[1-n m(n/u) - (1-m(n/u)^n
556: \right]}{u^2} d u + O(1)\\ &= n\left(\int_0^{n^{1/2}} +
557: \int_{n^{1/2}}^n\right) \frac{\left[1- m^\prime(u) u -
558: (1-m^\prime(u)u/n)^n \right]}{u^2} d u + O(1).
559: \end{split}
560: \end{equation} The conclusion differs somewhat from that of section \ref{isdef} in that we get an
561: additional term of $c n \log n$, where $c = \lim_{x \rightarrow
562: 1} f(x) = \lim_{j \rightarrow \infty} j m_j$. This term is equal
563: (with opposing sign) to the center of the stable law satisfied by
564: $T_1$, so in case $\alpha = 1$, we see that $T$ has no
565: expectation but satisfies a \emph{law of large numbers}, of the
566: following form:
567: %
568: \begin{theorem}[Law of large numbers]
569: There exists a constant $C$ such that $\lim_{y \rightarrow
570: \infty} \mathbf{P}(|T/n - C| > y) = 0.$
571: \end{theorem}
572: \section{$\alpha <1$}
573: \label{smallalpha} In this case the analysis goes through as in
574: the preceding section when $\alpha > 1/2$, but then runs into
575: considerable difficulties. However, in this case we note that
576: Theorem \ref{allgen} actually gives us tight bounds.
577: \section{The inevitable comparison}
578: We are now in a position to compare the performance of the batch
579: learning algorithm with that of the memoryless learning algorithm
580: and of learning with full memory, as summarized in Theorem
581: \ref{mainprev}. We combine our computations above with the
582: observation that the batch learner algorithm converges
583: geometrically (Lemma \ref{latmost}), to get:
584: %
585: \thm{theorem} {\label{batchthm} Let $N_\Delta$ be the number of
586: steps it takes for the student (with probability $1$) to have
587: probability $1 - \Delta$ of learning the concept using the batch
588: learner algorithm. Then we have the following estimates for
589: $N_\Delta$:
590: %
591: \begin{itemize}
592: \item
593: If the distribution of overlaps is \emph{uniform}, or more
594: generally, the density function $f(1-x)$ at $0$ has the form
595: $f(x) = c + O(x^\delta),$ $\delta, c > 0,$ then $N_\Delta=|\log
596: \Delta|\Theta(n)$
597: %
598: \item
599: If the probability density function $f(1-x)$ is asymptotic to
600: $x^\beta + O(x^{\beta - \delta}), \quad \delta, \beta > 0$, as
601: $x$ approaches $0$, then we have $N_\Delta=|\log
602: \Delta|\Theta(n^{1/(1+\beta)})$;
603: %
604: \item
605: If the asymptotic behavior is as above, but $-1 < \beta < 0$,
606: then $N_\Delta=|\log \Delta|\Theta(n^{1/(1+\beta)}).$
607: %
608: \end{itemize}}
609: Comparing Theorems \ref{mainprev} and \ref{batchthm}, we see that
610: batch learning algorithm is uniformly superior for $\beta \geq
611: 0$, and the only one of the three to achieve \emph{sublinear}
612: performance whenever $\beta
613: > 0$ (the other two \emph{never} do better than linearly, unless
614: the distribution $\mathcal{F}$ is supported away from $1.$) On
615: the other hand, for $\beta < 0$, the batch learning algorithm
616: performs comparably to the memoryless learner algorithm, and
617: worse than learning with full memory.
618: %\section{$\alpha <1$}
619: %The same method as in section \ref{isdef} under the assumption
620: %that the $k$-th moment is asymptotic to $k^\alpha$ (this time for
621: %$\alpha \leq 1$) can be used to write
622: %\begin{equation}
623: %\begin{split}
624: %T_2 &= n^{1/alpha} \int_0^{n^{1/\alpha}} \frac{\left[1-n
625: %m(n^{1/\alpha}/u) - (1-m(n^{1/\alpha}/u)^n \right]}{u^2} d u +
626: %O(1)\\ &= n^{1/\alpha}\left(\int_0^{n^{1/2\alpha}} +
627: %\int_{n^{1/2\alpha}}^{n^{1/\alpha}}\right) \frac{\left[1-
628: %m^\prime(u) u^\alpha - (1-m^\prime(u)u^\alpha/n)^n \right]}{u^2}
629: %d u + O(1).
630: %\end{split}
631: %\end{equation} If $1/2 < \alpha < 1$, the argument finishes in
632: %exactly the same way as in section \ref{isdef}, to give us $T_2
633: %\asymp C n^{1/\alpha}$. However, if $\alpha = 1$, we get an
634: %additional term of $C_2 n \log n$, where $C_2 = \lim_{j
635: %\rightarrow \infty} m_j$. This term is equal (with opposing sign)
636: %to the center of the stable law satisfied by $T_1$, so in case
637: %$\alpha = 1$, we see that $T$ has no expectation but satisfies a
638: %law of large numbers, with center linear in $n$. If $\alpha \leq
639: %1/2$, the integral diverges.
640: \begin{thebibliography}{xxxxxxxxxxxx}
641:
642: \bibitem[BenOrsz]{benorsz}
643: C.~M.~Bender and S.~Orszag (1999) \textit{Advanced mathematical
644: methods for scientists and engineers, I,\/} Springer-Verlag, New
645: York.
646:
647: \bibitem[KNN2001]{knn}
648: Komarova, N.~L., Niyogi,~P. and Nowak,~M.~A. (2001) The evolutionary
649: dynamics of grammar acquisition, \textit{J.~Theor.~Biology}, {\bf
650: 209}(1), pp. 43-59.
651:
652: \bibitem[KN2001]{kn}
653: Komarova, N.~L. and Nowak, M.~A. (2001) Natural selection of the
654: critical period for grammar acquisition, {\it Proc. Royal Soc.
655: B}, to appear.
656:
657: \bibitem[KR2001a]{kr1}
658: Komarova, N.~L. and Rivin, I. (2001) Harmonic mean, random
659: polynomials and stochastic matrices, \emph{preprint}.
660:
661: \bibitem[KR2001b]{kr2}
662: Komarova, N.~L. and Rivin, I. (2001) On the mathematics of
663: learning.
664:
665: \bibitem[Niyogi1998]{niy}
666: Niyogi, P. (1998). {\it The Informational Complexity of
667: Learning}. Boston: Kluwer.
668:
669: \bibitem[NKN2001]{nkn}
670: Nowak, M.~A., Komarova,~N.~L., Niyogi,~P. (2001) Evolution of
671: universal grammar, \textit{Science} \textbf{291}, 114-118.
672:
673: \end{thebibliography}
674:
675:
676: \end{document}
677:
678: