abstract:d08ab110760c9287.tex

1: \begin{abstract}

2:  %The ``rich get richer''. An author tends to use certain words over and over again so word frequencies represent a quick way to estimate the probability of the next word in streaming text by an author.

3:  %Luckily the Yule - Simon distribution provide a model for word frequencies.

4: %These are examples of ``preferential attachment'' processes and Yule-Simon provided a model for them.

5: In this paper we develop an Expectation Maximization(EM) algorithm to estimate the parameter of a Yule-Simon distribution.

6: The Yule-Simon distribution exhibits the ``rich get richer'' effect whereby an 80-20 type of rule tends to dominate.

7: These distributions are ubiquitous in industrial settings.

8: The EM algorithm presented provides both frequentist and Bayesian estimates of $\lambda$.

9: By placing the estimation method within the EM framework we are able to derive Standard errors of the resulting estimate.

10: Additionally, we prove convergence of the Yule-Simon EM algorithm and study the rate of convergence. An explicit, closed form solution for the rate of convergence of the algorithm is given.

11: %We compare the EM algorithm with a fixed point algorithm and with a Gibbs sampler for the posterior of the parameter.

12: %\bf{D: must be changed - more like how you wrote in the conclusion We also provide both empirical and theoretical estimates of the estimation error. We evaluate the methods' performance on both synthetic and text data. The EM estimate is exactly equivalent to the fixed point. Gibbs sampler comes close. The standard errors provided by Oakes, Louis and from the Gibbs posterior also compare. The estimation algorithm converges in 10-15 steps and a theoretical convergence argument is given.}

13: \end{abstract}

14: