cs0107033/cs0107033
1: %\documentclass[12pt]{gen-j-l}
2: \documentclass[12pt]{amsart}
3: \usepackage{amsmath}
4: \usepackage{amsfonts,amssymb,amsthm}
5: \usepackage{graphicx}
6: 
7: %
8: \newcommand{\beq}{\begin{equation}}
9: \newcommand{\eeq}{\end{equation}}
10: \newcommand{\bbar}{\begin{eqnarray}}
11: \newcommand{\eear}{\end{eqnarray}}
12: %
13: 
14: 
15: \newcommand{\thm}[2]{\begin{#1} #2 \end{#1}}
16: \newcommand{\excess}{\mathrm{excess\:}}
17: \newcommand{\sgn}{\mathrm{sgn\:}}
18: \newcommand{\realpart}{\mathrm{Re\:}}
19: \newcommand{\imagpart}{\mathrm{Im\:}}
20: 
21: \newcommand{\logdet}{\log \det \Delta}
22: \newcommand{\tr}{\mathrm{tr\:}}
23: \newcommand{\diameter}{\mathrm{diameter\:}}
24: \newcommand{\area}{\mathrm{area\:}}
25: 
26: \newcommand{\Sim}{\mathrm{Sim\:}}
27: \newcommand{\num}{\mathcal{N\:}}
28: %\newcommand{\arg}{\mathrm{arg\:}}
29: \newcommand{\dilatation}{\mathrm{dilatation\:}}
30: \newtheorem{theorem}{Theorem}[section]
31: \newtheorem{itheorem}{Theorem}[section]
32: \newtheorem{lemma}[theorem]{Lemma}
33: \newtheorem{ilemma}[itheorem]{Lemma}
34: \newtheorem{corollary}[theorem]{Corollary}
35: \newtheorem{conjecture}[theorem]{Conjecture}
36: \newtheorem{question}[theorem]{Question}
37: \newtheorem{claim}[theorem]{Claim}
38: \newtheorem{observation}[theorem]{Observation}
39: \newtheorem{iobservation}[itheorem]{Observation}
40: \newtheorem{remark}[theorem]{Remark}
41: \newtheorem{condition}[theorem]{Condition}
42: \newtheorem{example}[theorem]{Example}
43: \newtheorem{definition}[theorem]{Definition}
44: \newtheorem{xca}[theorem]{Exercise}
45: \newtheorem{note}[theorem]{Note}
46: 
47: %\input{montreref}
48: 
49: \begin{document}
50: 
51: %-------------- Author entries --------------------
52: \title{Yet another zeta function and learning}
53: %
54: 
55: 
56: \author{Igor Rivin}
57: \address{Mathematics department, University of Manchester,
58: Oxford Road, Manchester M13 9PL, UK}
59: \address{Mathematics Department, Temple University,
60: Philadelphia, PA 19122}
61: \address{Mathematics Department, Princeton University, Princeton,
62: NJ 08544}
63: %
64: \email{irivin@math.princeton.edu} \thanks{The author would like 
65: to think the EPSRC and the NSF for support, and Natalia Komarova 
66: and  Ilan Vardi for useful conversations. }
67: 
68: \subjclass{60E07, 60F15, 60J20, 91E40, 26C10} \keywords{ learning 
69: theory, zeta functions, asymptotics}
70: %
71: \begin{abstract}
72: We analyze completely the convergence speed of the \emph{batch 
73: learning algorithm}, and compare its speed to that of the 
74: memoryless learning algorithm and of learning with memory (as 
75: analyzed in \cite{kr2}). We show that the batch learning 
76: algorithm is never worse than the memoryless learning algorithm 
77: (at least asymptotically). Its performance \emph{vis-a-vis} 
78: learning with full memory is less clearcut, and depends on 
79: certain probabilistic assumptions. These results necessitate the 
80: introduction of the \textit{moment zeta function} of a 
81: probability distribution and the study of some of its properties. 
82: \end{abstract}
83: %
84: \maketitle
85: 
86: \renewcommand{\theitheorem}{\Alph{itheorem}}
87: %
88: \section*{Introduction}
89: The original motivation for the work in this paper was provided 
90: by  research in learning theory, specifically in various models 
91: of language acquisition (see, for example, \cite{knn,nkn,kn}). In 
92: the paper \cite{kr2}, we had studied the speed of convergence of 
93: the  \emph{memoryless learner algorithm}, and also of 
94: \emph{learning with full memory}. Since the \emph{batch learning 
95: algorithm} is both widely known, and believed to have superior 
96: speed (at the cost of memory) to both of the above methods by 
97: learning theorists, it seemed natural to analyze its behavior 
98: under the same set of assumptions, in order to bring the analysis 
99: in \cite{kr1} and \cite{kr2} to a sort of closure. It should be 
100: noted that the detailed analysis of the batch learning algorithm 
101: is performed under the assumption of \emph{independence}, which 
102: was not explicitly present in our previous work. For the 
103: impatient reader we state our main result (Theorem 
104: \ref{batchthm}) immediately (the reader can compare it with the 
105: results on the memoryless learning algorithm and learning with 
106: full memory, as summarized in Theorem \ref{mainprev}): 
107: %
108: \begin{itheorem}
109:  Let $N_\Delta$ be the number of steps it takes 
110: for the student (with probability $1$) to have probability $1 - 
111: \Delta$ of learning the concept using the batch learner 
112: algorithm. Then we have the following estimates for $N_\Delta$:
113: %
114: \begin{itemize}
115: \item
116: If the distribution of overlaps is \emph{uniform}, or more 
117: generally, the density function $f(1-x)$  at $0$ has the form 
118: $f(x) = c + O(x^\delta),$ $\delta, c > 0,$ then $N_\Delta=|\log 
119: \Delta|\Theta(n)$ 
120: %
121: \item 
122: If the probability density function $f(1-x)$ is asymptotic to 
123: $x^\beta + O(x^{\beta - \delta}), \quad \delta, \beta > 0$, as 
124: $x$ approaches $0$, then we have $N_\Delta=|\log 
125: \Delta|\Theta(n^{1/(1+\beta)})$; 
126: %
127: \item 
128: If the asymptotic behavior is as above, but $-1/2 < \beta < 0$, 
129: then $N_\Delta=|\log \Delta|\Theta(n^{1/(1+\beta)}).$ 
130: %
131: \end{itemize}
132: \end{itheorem}
133: The plan of the paper is as follows: in this Introduction we 
134: recall the learning algorithms we study; in Section \ref{mathmod} 
135: we define our mathematical model; in Section 2 we recall our 
136: previous results, in Section 3 we begin the analysis of the batch 
137: learning algorithm, and introduce some of the necessary 
138: mathematical concepts; in Sections 4-6 we analyze the three cases 
139: stated in Theorem A, and we summarize our findings in Section 7.
140: \subsection*{Memoryless Learning and Learning with Full Memory} 
141: The general setup is as follows: There is a collection of 
142: concepts $R_0, \dots, R_n$ and words which refer to these 
143: concepts, sometimes ambiguously. The teacher generates a stream 
144: of words, referring to the concept $R_0$. This is not known to 
145: the student, but he must learn by, at each step, guessing some 
146: concept $R_i$ and checking for consistency with the teacher's 
147: input.  The \emph{memoryless learner algorithm} consists of 
148: picking a concept $R_i$ at random, and sticking by this choice, 
149: until it is proven wrong.  At this point another concept is 
150: picked randomly, and the procedure repeats. \emph{Learning with 
151: full memory} follows the same general process with the important 
152: difference that once a concept is rejected, the student never 
153: goes back to it. It is clear (for both algorithms) that once the 
154: student hits on the right answer $R_0$, this will be his final 
155: answer. We would like to estimate the probability of having 
156: guessed the right answer is after $k$ steps, and also the 
157: expected number of steps before the student settles on the right 
158: answer.
159: 
160: \subsection*{Batch Learning} The batch learning situation is 
161: similar to the above, but here the student records the words 
162: $w_1, \dots, w_k, \dots$ he gets from the teacher. For each word 
163: $w_i$ , we assume that the student can find (in his textbook, for 
164: example) a list $L_i$ of concepts referred to by the word. If we 
165: define 
166: \begin{equation*} 
167: \mathcal{L}_k = \bigcap_{i=1}^k L_i,
168: \end{equation*}
169: then we are interested in the smallest value of $k$ such that 
170: $\mathcal{L}_k = \{R_0\}$. This value $k_0$ is the time it has 
171: taken the student to learn the concept $R_0$. We think of $k_0$ 
172: as a random variable, and we wish to estimate its expectation.
173: %
174: \section{The mathematical model}
175: \label{mathmod}
176:  We think of the words referring to the concept 
177: $R_0$ as a probability space $\mathcal{P}$. The probability that 
178: one of these words also refer to the concept $R_i$ shall be 
179: denoted by $p_i$; the probability that a word refers to concepts 
180: $R_{i_1}, \dots, R_{i_k}$ shall be denoted by $p_{i_1 \dots 
181: i_k}$. All the results described below (obviously) depend in a 
182: crucial way on the $p_1, \dots, p_n$ and (in the case of the 
183: batch learning algorithm) also on the joint probabilities. Since 
184: there is no \emph{a priori} reason to assume specific values for 
185: the probabilities, we shall assume that all of the $p_i$ are 
186: themselves \emph{independent, identically distributed random 
187: variables}. We shall refer to their common distribution as 
188: $\mathcal{F}$, and to the density as $f$. It turns out that the 
189: convergence properties of the various learning algorithms depend 
190: on the local analytic properties of the distribution 
191: $\mathcal{F}$ at $1$ -- some moments reflection will convince the 
192: reader that this is not really so surprising. 
193: 
194: To carry out a precise analysis of the batch learning algorithm, 
195: we will also need the \emph{independence hypothesis}:
196: $$
197: p_{i_1 \dots i_k} = p_{i_1} \dots p_{i_k}.
198: $$
199: It is again not too surprising that some such assumption on 
200: correlations ought to be required for precise asymptotic results, 
201: though it is obviously the subject of a (non-mathematical) debate 
202: as to whether assuming that the various concepts are truly 
203: independent is reasonable from a cognitive science point of view. 
204: 
205: \section{Previous results}
206: In previous work \cite{kr1} and \cite{kr2} we obtained the 
207: following result. 
208:  \thm{theorem} {\label{mainprev}Let $N_\Delta$ be the number of steps it 
209: takes for the student (with probability $1$) to have probability 
210: $1 - \Delta$ of learning the concept. Then we have the following 
211: estimates for $N_\Delta$:
212: %
213: \begin{itemize}
214: \item
215: if the distribution of overlaps is \emph{uniform}, or more 
216: generally, the density function $f(1-x)$  at $0$ has the form 
217: $f(x) = c + O(x^\delta),$ $\delta, c > 0,$ then 
218: $N_\Delta=|\log \Delta|\Theta(n \log n)$ for 
219: the memoryless algorithm and $N_\Delta=(1-\Delta)^2 \Theta(n \log 
220: n)$ when learning with full memory; 
221: %
222: \item 
223: if the probability density function $f(1-x)$ is asymptotic to 
224: $x^\beta + O(x^{\beta - \delta}), \quad \delta, \beta > 0$, as 
225: $x$ approaches $0$, then for the two algorithms we have 
226: respectively $N_\Delta=|\log \Delta|\Theta(n)$ and 
227: $N_\Delta=(1-\Delta)^2 \Theta(n)$;
228: %
229: \item 
230: if the asymptotic behavior is as above, but $-1 < \beta < 0$, then
231: $N_\Delta=|\log \Delta|\Theta(n^{1/(1+\beta)})$ for the memoryless
232: learner and $(N_\Delta=1-\Delta)^2\Theta(n^{1/(1+\beta)})$ for
233: learning with full memory.
234: %
235: \end{itemize}}
236: %
237: \noindent Recall that $f(x) = \Theta(g(x))$ means that for 
238: sufficiently large $x$, the ratio $f(x)/g(x)$ is bounded between 
239: two strictly positive constants. The distribution of overlaps 
240: referred to above is simply the distribution $\mathcal{F}$. 
241: Notice that the theorem says nothing about the situation when 
242: $\mathcal{F}$ is supported in some interval $[0, a]$, for $a<1$. 
243: That case is (presumably) of scientific interest, but 
244: mathematically it is relatively trivial: we replace the arguments 
245: of all the $\Theta$s above by $1$, though, of course, we are 
246: thereby hiding the dependence on $a$.
247: 
248: \section{General bounds on the batch learner algorithm}
249: 
250: Consider a set of words $w_1, \dots, w_k$. The probability that 
251: they all refer to the concept $R_i$ is, obviously $p_i^k$. 
252: \begin{lemma}
253: \label{bounds}
254:  The probability $q_k$ that we still have not 
255: learned the concept $R_0$ after $k$ steps is bounded above by 
256: $\sum_{i=1}^n p_i^k$, and below by $\max_i p_i^k$. 
257: \end{lemma} 
258: \begin{proof}
259: Immediate. 
260: \end{proof}
261: We will first use these upper and lower 
262: bounds to get corresponding bounds on the convergence speed of 
263: the batch learner algorithm, and then invoke the independence 
264: hypothesis to sharpen these bounds in many cases.
265: 
266: We begin with a trivial but useful lemma.
267: %
268: \begin{lemma}
269: \label{rearrange}
270:  Let $G$ be a game where the probability of 
271: success (respectively failure) after at most $k$ steps is $s_k$ 
272: (respectively $f_k = 1-s_k $). Then the expected number of steps 
273: until success is 
274: $$\sum_{k=1}^\infty k (s_k - s_{k-1}) = \sum_{k=1}^\infty s_k = 1 - 
275: \sum_{k=1}^\infty f_k,$$ if the corresponding sum converges.
276: \end{lemma}
277: \begin{proof}
278: The proof is immediate from the definition of expectation and the 
279: possibility of rearrangment of terms of positive series.
280: \end{proof}
281: We can combine Lemma \ref{rearrange} and Lemma \ref{bounds} to 
282: obtain:
283: \begin{theorem}
284: \label{sumbounds} The expected time $T$ of convergence of the 
285: batch learner algorithm is bounded as follows:
286: \begin{equation}
287: \label{trivest} \sum_{i=1}^n \frac{1}{1-p_i} \geq T \geq 
288: \max_{1\leq i \leq n} \frac{1}{1-p_i}.
289: \end{equation}
290: \end{theorem}
291: The leftmost term in equation (\ref{trivest}) has been studied at 
292: length in \cite{kr1}. We state a version of the results of 
293: \cite{kr1} below:
294: \begin{theorem}
295: \label{allstab} Let $S=\sum_{i=1}^n \frac{1}{1-p_i},$ where the 
296: $p_i$ are independently identically distributed random variables 
297: with values in $[0, 1]$, with probability density $f$, such that 
298: $f(1-x) = x^\beta + O(x^{\beta - \delta}),\quad \delta > 0$ for 
299: $x\rightarrow 0$. Then If $\beta > 0$, then there exists a mean 
300: $m$, such that $\lim_{n \rightarrow \infty} \mathbb{P}(|S/n - m| 
301: > \epsilon) = 0,$ for any $\epsilon > 0.$ If $\beta = 0$, then 
302: $\lim_{n \rightarrow \infty} \mathbb{P}(|S/(n\log n) - 1| 
303: > \epsilon) = 0).$ Finally, if 
304: $-1 \leq \beta < 0,$ then $\lim_{n \rightarrow \infty} 
305: \mathbb{P}(S/n^{1/{\beta+1}} - C
306: > a) = g(a),$ where $\lim_{a \rightarrow \infty} g(a)= 0,$ and $C$ is 
307: an arbitrary (but fixed) constant, and likewise 
308: $$\mathbb{P}(S/n^{1/(\beta + 1)} < b) = h(b),$$ where $\lim_{a \rightarrow 0}h(a) = 0,$
309: \end{theorem}
310: The right hand side of Eq. (\ref{trivest}) is easier to 
311: understand. Indeed, let $p_1, \dots, p_n$ be distributed as usual 
312: (and as in the statement of Theorem \ref{allstab}. Then 
313: %
314: \begin{theorem}\label{expmin}
315: The expected value of $\max_{1 \leq i \leq n} p_i$ equals $1 - C 
316: n^{-1/{1+\beta}},$ for some positive constant $C$.
317: \end{theorem}
318: \begin{proof}
319: First, we change variables to $q_i = 1 - p_i$. Obviously, the 
320: statement of the Theorem is equivalent to the statement that $E = 
321: \mathbf{E}(\min_{1 \leq i \leq n} q_i) = C  n^{-1/{1+\beta}}$. We 
322: also write $h(x) = f(1-x),$ and similarly for the primitives $H$ 
323: and $F$. Now, the probability of that all of the $q_i$ are 
324: greater than some fixed $y$ equals $1-(1-H(y))^n,$ so that 
325: $$E = \int_0^1 t d\left[1-(1-H(t))^n\right] = \int_0^1 (1-H(t))^n d t.$$
326: Perform the change of variables $t = u/n^{1/(1+\beta)}$, to get 
327: \begin{equation}
328: \label{firstint} E = \frac{1}{n^{1+\beta}} 
329: \int_0^{n^{1/{1+\beta}}} (1-H(u/n^{1/(1+\beta)}))^n du. 
330: \end{equation}
331: For $u \ll n^{1/(1+\beta)}$, we can write $H(u/n^{1/(1+\beta)} 
332: \asymp u^{\beta + 1}/n H^\prime,$ where $H^\prime$ is a constant. 
333: We also know that $H$ is a monotonic function so if we break up 
334: the integral above as 
335: \begin{equation}
336: \label{secondint} E = \frac{1}{n^{1/(1+\beta)}} 
337: \left[\int_0^{n^{1/(2 (1 + \beta))}} + \int_{n^{1/(2 (1 + 
338: \beta))}}^{n^{1/(1 + \beta)}}\right] (1-H(u/n^{1/(1+\beta)}))^n 
339: du,
340: \end{equation}
341: we see that the first integral approaches $C = \int_0^\infty 
342: \exp(-u^{1/(1+\beta)}) d u,$ while the second integral goes to 0. 
343: Note that the proof also evaluates $C$.
344: \end{proof}
345: We need one final observation:
346: \begin{theorem}
347: The variable $n^{1/(1+\beta)} \min_{i=1}^n q_i$ has a limiting 
348: distribution with distribution function $G(x) = 
349: 1-\exp(-x^{1+\beta}).$
350: \end{theorem}
351: \begin{proof}
352: Immediate from the proof of Theorem \ref{expmin}.
353: \end{proof}
354: 
355: We can now put together all of the above results as follows.
356: \begin{theorem}
357: \label{allgen}
358:  Let $p_1, \dots, p_k$ be independently distributed 
359: with common density function $f$, such that $f(1-x) = c x^\beta + 
360: O(x^{\beta + \delta}),$ $\delta > 0$. Let $T$ be the expected 
361: time of the convergence of the batch learning algorithm with 
362: overlaps $p_1, \dots, p_k$. Then, if $\beta > 0$, then there 
363: exist $C_1, C_2$, such that  $C_1 n^{1/(1+\beta)} \leq T \leq C_2 
364: n$, with probability tending to $1$ as $n$ tends to $\infty$. If 
365: $\beta = 0$, then there exist $C_1, C_2$, such that $C_1 n \leq T 
366: \leq C_2 n \log n$, with probability tending to one as $n$ tends 
367: to $\infty.$ If $\beta > 0$, then $C^{-1} n^{1/(\beta + 1)} \leq 
368: T \leq C n^{1/(\beta + 1)}$ with probability tending to $0$ as 
369: $C$ goes to infinity.
370: \end{theorem}
371: 
372: The reader will remark that in the case that $\beta > 0$, the 
373: upper and lower bounds have the same order of magnitude as 
374: functions of $n$.
375: 
376: \section{Independent concepts}
377: independence hypothesis, whereby an application of the 
378: inclusion-exclusion principle gives us:
379: 
380: \thm{lemma}{\label{latmost} The probability $l_k$ that we have 
381:  learned the concept $R_0$ after $k$ steps is given by 
382: $$
383: l_k=\prod_{i=1}^n(1-p_i^k).
384: $$
385: }
386: 
387: Note that the probability $s_k$ of winning the game \emph{on the 
388: $k$-th step} is given by $s_k = l_k - l_{k-1}= (1-l_{k-1}) - 
389: (1-l_k)$. Since the expected number of steps $T$ to learn the 
390: concept is given by
391: $$T = \sum_{k=1}^\infty k s_k,$$
392: we immediately have  $$T = \sum_{k=1}^\infty (1-l_k)$$
393: %
394: \thm{lemma}{\label{letime} The expected time $T$ of learning the 
395: concept $R_0$ is given by
396: $$
397: T = \sum_{k=1}^\infty \left(1-\prod_{i=1}^n 
398: \left(1-p_i^k\right)\right).
399: $$
400: } 
401: %
402: Since the sum above is absolutely convergent, we can expand the 
403: products and interchange the order of summation to get the 
404: following formula for $T$:
405: 
406: \begin{equation}
407: \label{subsum} T = \sum_{s\subseteq \{1, \dots, n\}} (-1)^{|s|-1} 
408: \sum_{k=1}^\infty p_s^k = \sum_{s\subseteq \{1, \dots, n\}} 
409: (-1)^{|s|-1} \left(\frac{1}{1-p_s} - 1\right),
410: \end{equation}
411: where we have identified subsets of $\{1, \dots, n\}$ with the 
412: corresponding multindexes.
413: 
414: The formula \ref{subsum} is useful in and of itself, but we now 
415: use it to attempt to get the expectation of the expected time of 
416: success $T$ under our distribution and independence assumption. 
417: For this we shall need the following:
418: %
419: \thm{definition}{\label{zdef} Let $\mathcal{F}$ be a probability 
420: distribution on an interval $I$, and let $m_k(\mathcal{F}) = 
421: \int_I x^k\mathcal{F}(d x)$ be the $k$-th moment of 
422: $\mathcal{F}$. Then the \emph{moment zeta function of 
423: $\mathcal{F}$} is defined to be 
424: $$\zeta_{\mathcal{F}}(s) = \sum_{k=1}^\infty m_k^s(\mathcal{F}),$$ whenever the sum is defined.
425: }
426: %
427: \thm{lemma}{\label{zetalemma} Let $\mathcal{F}$ be a probability 
428: distribution as above, and let $x_1, \dots, x_n$ be independent 
429: random variables with common distribution $\mathcal{F}$. Then
430: \begin{equation}
431: \mathbb{E}\left(\frac{1}{1-x_1 \dots x_n}\right) = 
432: \zeta_{\mathcal{F}}(n).
433: \end{equation}
434: In particular, the expectation is undefined whenever the zeta 
435: function is undefined. }
436: %
437: \begin{proof}
438: Expand the fraction in a geometric series and apply Fubini's 
439: theorem.
440: \end{proof}
441: %
442: \thm{example} { For $\mathcal{F}$ the uniform distribution on 
443: $[0, 1]$, $\zeta_{\mathcal{F}}$ is the familiar Riemann zeta 
444: function. Notice that this is \emph{not} defined for $n=1$ -- 
445: this will be important in the sequel.}
446: 
447: It should be noted that in the case we are interested in 
448: (distributions supported in $[0, 1]$), the asymptotics of the 
449: moments are determined by the local properties of the 
450: distribution at $1$, up to exponentially decreasing error terms. 
451: So, if $f(1-x) \asymp x^\beta$ (recall that $f$ is the density), 
452: we see that the $k$-th moment of $\mathcal{F}$ is asymptotic to 
453: $C k^{-(1+\alpha)},$ for some constant $C$.  To show this, we 
454: first define the \emph{Mellin transform} of $f$ to be 
455: $$\mathcal{M}(f)(s) = \int_0^1 f(x) x^{s-1} d x.$$ We see that 
456: $m_k(\mathcal{F}) = \mathcal{M}(f)(k+1).$ Mellin transform is 
457: very closely related to the Laplace transform. Indeed, making the 
458: substitution $x = \exp(-u)$, we see that $$\mathcal{M}(f) = 
459: \int_0^\infty f(\exp(-u)) \exp(-s u) d u,$$ so the Mellin 
460: transform of $f$ is equal to the Laplace transform of $f \circ 
461: \exp.$ Now, the asymptotics of the Laplace transform are easily 
462: computed by Laplace's method, and in the case we are interested 
463: in, Watson's lemma (see, eg, \cite{benorsz}) tells us that if 
464: $f(x) \asymp c (1-x)^\beta$, then $\mathcal{M}(f)(s) \asymp c 
465: \Gamma(\beta) x^{-(\beta + 1)}.$ In particular, 
466: $\zeta_{\mathcal{F}}(s)$ is defined for $s 
467: >1/(1+\beta)$. Below we shall analyze three cases (though the 
468: analysis is almost the same in the three cases, there are some 
469: important variations). In the sequel, we set $\alpha = \beta + 1$.
470: \section{$\alpha > 1$}
471: \label{isdef} 
472: In this case, we use our assumptions to rewrite Eq. 
473: (\ref{subsum}) as 
474: \begin{equation}
475: \label{subsum2} 
476: %
477: T = - \sum_{k=1}^n \binom{n}{k}(-1)^k \zeta_{\mathcal{F}}(k).
478: \end{equation}
479: This, in turn, can be rewritten (by expanding the definition of 
480: zeta) as
481: \begin{equation}
482: \label{subsum3} T = - \sum_{j=1}^\infty 
483: \left[\left(1-m_j(\mathcal{F})\right)^n-1\right]
484: \end{equation}
485: Since the term in the sum is monotonically decreasing, the sum in 
486: Eq. (\ref{subsum3}) can be approximated by an integral (of 
487: \emph{any} monotonic interpolation $m$ of the sequence 
488: $m_j(\mathcal{F})$; however there is no reason not to set $m(x) = 
489: \mathcal{M}(f)(x+1)$), with error bounded by the first term, 
490: which is, in term, bounded in absolute value by $2$, to get 
491: \begin{equation}
492: \label{approx1} T = - \int_1^\infty \left[(1-m(x))^n -1\right] d 
493: x + O(1),
494: \end{equation}
495: where the error term is bounded above by $2$.
496: 
497: Now, let us assume that $m(x)$ is of order $x^{-\alpha}$ for some 
498: $\alpha > 1$. We substitute $x = n^{1/alpha}/u$, to get
499: \begin{equation}
500: \begin{split}
501:  T &=  n^{1/\alpha}\int_0^{n^{1/\alpha}}
502: \frac{\left[1-(1-m(n^{1/\alpha}/u)^n \right]}{u^2} d u + O(1)\\ 
503: &= 
504: n^{1/\alpha}\int_0^{n^{1/\alpha}}\frac{\left[1-(1-m^\prime(u)u^\alpha/n)^n 
505: \right]}{u^2} d u + O(1)\\ & = 
506: n^{1/\alpha}\left(\int_0^{n^{1/2\alpha}} + 
507: \int_{n^{1/2\alpha}}^{n^{1/\alpha}}\right)
508: \frac{\left[1-(1-m^\prime(u)u^\alpha/n)^n \right]}{u^2} d u + 
509: O(1) ,
510: \end{split}
511: \end{equation}
512: where $m^\prime$ is a bounded (asymptotically constant) function. 
513: In the second integral the integrand is bounded above by $1/u^2$, 
514: so the contribution from that integral goes to $0$, while in the 
515: first integral we can approximate $(1-m^\prime u^\alpha/n)^n$ by  
516: $\exp(-m^\prime u^\alpha)$, and the contribution from that 
517: integral goes to 
518: \begin{equation}
519: \label{mainalpha} T = n^{1/\alpha} 
520: \int_0^\infty\frac{1-\exp(-m^\prime(u) u^\alpha)}{u^2} d u + O(1) 
521: \asymp C n^{1/\alpha}.
522: \end{equation}
523: %
524: \section{$\alpha = 1$}
525: \label{medalpha} In this case, $f(x) = c + o(1)$ as $x$ 
526: approaches $1$.  It is not hard to see that 
527: $\zeta_{\mathcal{F}}(n)$ is defined for $n \geq 2$. We break up 
528: the expression in Eq. (\ref{subsum}) as 
529: \begin{equation}
530: \label{subsumm} T = \sum_{j=1}^n {\frac{1}{1-p_j} - 1} + 
531: \sum_{s\subseteq \{1, \dots, n\}, \quad |s| > 1} 
532:  (-1)^{|s|-1} 
533: \left(\frac{1}{1-p_s} - 1\right).
534: \end{equation}
535: Let 
536: \begin{gather*} T_1 = \sum_{j=1}^n {\frac{1}{1-p_j} - 1},\\
537:  T_2 = \sum_{s\subseteq \{1, \dots, n\}, \quad |s| > 1} 
538:  (-1)^{|s|-1} 
539: \left(\frac{1}{1-p_s} - 1\right).
540: \end{gather*}
541:  The first sum $T_1$ has 
542: no expectation, however $T_1/n$  does have have a stable 
543: distribution centered on $c \log n + c_2$. We will keep this in 
544: mind, but now let us look at the second sum  $T_2$. It can be 
545: rewritten as 
546: \begin{equation}
547: \label{subsumm2} T_2 = - \sum_{j=1}^\infty 
548: \left[\left(1-m_j(\mathcal{F})\right)^n-1 + n m_j\right].
549: \end{equation}
550: The same method as in section \ref{isdef} under the assumption 
551: that the $k$-th moment is asymptotic to $k^\alpha$ (this time for 
552: $\alpha \leq 1$) can be used to write 
553: \begin{equation}
554: \begin{split}
555: T_2 &= n \int_0^n \frac{\left[1-n m(n/u) - (1-m(n/u)^n 
556: \right]}{u^2} d u + O(1)\\ &= n\left(\int_0^{n^{1/2}} + 
557: \int_{n^{1/2}}^n\right) \frac{\left[1- m^\prime(u) u - 
558: (1-m^\prime(u)u/n)^n \right]}{u^2} d u + O(1).
559: \end{split}
560: \end{equation} The conclusion differs somewhat from that of section \ref{isdef} in that  we get an 
561: additional term of $c n \log n$, where $c = \lim_{x \rightarrow 
562: 1} f(x) = \lim_{j \rightarrow \infty} j m_j$. This term is equal 
563: (with opposing sign) to the center of the stable law satisfied by 
564: $T_1$, so in case $\alpha = 1$, we see that $T$ has no 
565: expectation but satisfies a \emph{law of large numbers}, of the 
566: following form: 
567: %
568: \begin{theorem}[Law of large numbers]
569: There exists a constant $C$ such that $\lim_{y \rightarrow 
570: \infty} \mathbf{P}(|T/n - C| > y) = 0.$
571: \end{theorem}
572: \section{$\alpha <1$}
573: \label{smallalpha} In this case the analysis goes through as in 
574: the preceding section when $\alpha > 1/2$, but then runs into 
575: considerable difficulties. However, in this case we note that 
576: Theorem \ref{allgen} actually gives us tight bounds. 
577: \section{The inevitable comparison}
578: We are now in a position to compare the performance of the batch 
579: learning algorithm with that of the memoryless learning algorithm 
580: and of learning with full memory, as summarized in Theorem 
581: \ref{mainprev}. We combine our computations above with the 
582: observation that the batch learner algorithm converges 
583: geometrically (Lemma \ref{latmost}), to get: 
584: %
585: \thm{theorem} {\label{batchthm} Let $N_\Delta$ be the number of 
586: steps it takes for the student (with probability $1$) to have 
587: probability $1 - \Delta$ of learning the concept using the batch 
588: learner algorithm. Then we have the following estimates for 
589: $N_\Delta$:
590: %
591: \begin{itemize}
592: \item
593: If the distribution of overlaps is \emph{uniform}, or more 
594: generally, the density function $f(1-x)$  at $0$ has the form 
595: $f(x) = c + O(x^\delta),$ $\delta, c > 0,$ then $N_\Delta=|\log 
596: \Delta|\Theta(n)$ 
597: %
598: \item 
599: If the probability density function $f(1-x)$ is asymptotic to 
600: $x^\beta + O(x^{\beta - \delta}), \quad \delta, \beta > 0$, as 
601: $x$ approaches $0$, then we have $N_\Delta=|\log 
602: \Delta|\Theta(n^{1/(1+\beta)})$; 
603: %
604: \item 
605: If the asymptotic behavior is as above, but $-1 < \beta < 0$, 
606: then $N_\Delta=|\log \Delta|\Theta(n^{1/(1+\beta)}).$ 
607: %
608: \end{itemize}}
609: Comparing Theorems \ref{mainprev} and \ref{batchthm}, we see that 
610: batch learning algorithm is uniformly superior for $\beta \geq 
611: 0$, and the only one of the three to achieve \emph{sublinear} 
612: performance whenever $\beta 
613: > 0$ (the other two \emph{never} do better than linearly, unless 
614: the distribution $\mathcal{F}$ is supported away from $1.$) On 
615: the other hand, for $\beta < 0$, the batch learning algorithm 
616: performs comparably to the memoryless learner algorithm, and 
617: worse than learning with full memory.
618: %\section{$\alpha <1$}
619: %The same method as in section \ref{isdef} under the assumption 
620: %that the $k$-th moment is asymptotic to $k^\alpha$ (this time for 
621: %$\alpha \leq 1$) can be used to write 
622: %\begin{equation}
623: %\begin{split}
624: %T_2 &= n^{1/alpha} \int_0^{n^{1/\alpha}} \frac{\left[1-n 
625: %m(n^{1/\alpha}/u) - (1-m(n^{1/\alpha}/u)^n \right]}{u^2} d u + 
626: %O(1)\\ &= n^{1/\alpha}\left(\int_0^{n^{1/2\alpha}} + 
627: %\int_{n^{1/2\alpha}}^{n^{1/\alpha}}\right) \frac{\left[1- 
628: %m^\prime(u) u^\alpha - (1-m^\prime(u)u^\alpha/n)^n \right]}{u^2} 
629: %d u + O(1).
630: %\end{split}
631: %\end{equation} If $1/2 < \alpha < 1$, the argument finishes in 
632: %exactly the same way as in section \ref{isdef}, to give us $T_2 
633: %\asymp C n^{1/\alpha}$. However, if $\alpha = 1$, we get an 
634: %additional term of $C_2 n \log n$, where $C_2 = \lim_{j 
635: %\rightarrow \infty} m_j$. This term is equal (with opposing sign) 
636: %to the center of the stable law satisfied by $T_1$, so in case 
637: %$\alpha = 1$, we see that $T$ has no expectation but satisfies a 
638: %law of large numbers, with center linear in $n$. If $\alpha \leq  
639: %1/2$, the integral diverges.
640: \begin{thebibliography}{xxxxxxxxxxxx}
641: 
642: \bibitem[BenOrsz]{benorsz}
643: C.~M.~Bender and S.~Orszag (1999) \textit{Advanced mathematical 
644: methods for scientists and engineers, I,\/} Springer-Verlag, New 
645: York.
646: 
647: \bibitem[KNN2001]{knn}
648: Komarova, N.~L., Niyogi,~P. and Nowak,~M.~A. (2001) The evolutionary 
649: dynamics of grammar acquisition, \textit{J.~Theor.~Biology}, {\bf 
650: 209}(1), pp. 43-59.
651: 
652: \bibitem[KN2001]{kn} 
653: Komarova, N.~L. and Nowak, M.~A. (2001) Natural selection of the 
654: critical period for grammar acquisition, {\it Proc. Royal Soc. 
655: B}, to appear.
656: 
657: \bibitem[KR2001a]{kr1}
658: Komarova, N.~L. and Rivin, I. (2001) Harmonic mean, random 
659: polynomials and stochastic matrices, \emph{preprint}.
660: 
661: \bibitem[KR2001b]{kr2}
662: Komarova, N.~L. and Rivin, I. (2001) On the mathematics of 
663: learning. 
664: 
665: \bibitem[Niyogi1998]{niy}
666: Niyogi, P. (1998). {\it The Informational Complexity of 
667: Learning}. Boston: Kluwer.
668: 
669: \bibitem[NKN2001]{nkn}
670: Nowak, M.~A., Komarova,~N.~L., Niyogi,~P. (2001) Evolution of 
671: universal grammar, \textit{Science} \textbf{291}, 114-118.
672: 
673: \end{thebibliography}
674: 
675: 
676: \end{document}
677: 
678: