1: \documentclass{elsart}
2: \usepackage{epsfig}
3: \usepackage{amssymb}
4:
5: \begin{document}
6:
7: \begin{frontmatter}
8:
9: \title{On the universality of rank distributions of website popularity}
10: \author{Serge A. Krashakov\corauthref{cor}},
11: \corauth[cor]{Corresponding author.}
12: \ead{sakr@itp.ac.ru}
13: \author{Anton B. Teslyuk\thanksref{now}},
14: \thanks[now]{Present adress: Institute of Information Science,
15: RRC Kurchatov Institute, 1~Kurchatov sq., Moscow, 123182, Russia}
16: \author{Lev N. Shchur}
17:
18: \address{Landau Institute for Theoretical Physics, Chernogolovka, 142432 Russia}
19:
20: \begin{abstract}
21:
22: We present an extensive analysis of long-term statistics of the queries to
23: websites using logs collected on several web caches in Russian academic
24: networks and on US IRCache caches. We check the sensitivity of the statistics
25: to several parameters: (1)~duration of data collection, (2) geographical
26: location of the cache server collecting data, and (3) the year of data
27: collection. We propose a two-parameter modification of the Zipf law and
28: interpret the parameters. We find that the rank distribution of websites
29: is stable when approximated by the modified Zipf law. We suggest
30: that website popularity may be a universal property of Internet.
31:
32: \end{abstract}
33:
34: \begin{keyword}
35: Internet, Web traffic, Rank Distribution, Zipf Law
36: \PACS 89.20.Hh World Wide Web, Internet - 89.75.Da Systems being scaling laws
37: \end{keyword}
38: \end{frontmatter}
39:
40: \section{Introduction}
41: \label{intro}
42:
43: It has been known for a decade that web-document popularity follows the
44: Zipf law~\cite{Glassman94}. Nevertheless, the exponent values reported by
45: different authors vary significantly, from 0.60 to
46: 1.03~\cite{Glassman94,Breslau99,Kelly02,Doyle02} (see
47: Table~\ref{Datasets}). We believe that the scattering of the reported
48: values is due to the small sample size in some cases and to the details of
49: the fitting procedure used to extract the exponent.
50:
51: In this paper, we propose that the rank distribution of the websites
52: follows the Zipf law and give arguments supporting our idea. We must
53: note that website statistics are more extensive than web-document
54: statistics, and the distribution parameters can be obtained with
55: higher accuracy.
56:
57: We address the following questions: Is the rank distribution of
58: websites Zipf-like? If yes, what are the conditions under which the
59: ``true'' exponent can be obtained? Does the exponent depend on the
60: duration of the observation? Or on the geographical position of the
61: observer? And does the exponent vary with time, as the Internet
62: develops?
63:
64: We report some answers to these questions. We have studied website
65: statistics, which are indeed more stable than web-document statistics.
66: We have analyzed log files accumulated on cache servers of Russian
67: academic networks (FREEnet, RASnet, and RSSI) for about six years. These
68: networks differ by their connectivity topology and bandwidth, both
69: national and international. These cache servers have different
70: geographical locations (Moscow, Moscow region, and Yaroslavl in Russia).
71: In addition, we analyzed some statistics collected during seven weeks in
72: the fall of 2004 at a number of IRCache servers in the United States (see
73: Table~\ref{table-setUS}).
74:
75: We found that the statistics studied become stable\footnote{The
76: accuracy of the exponent becomes a few percent, e.g., 5\%.} when the
77: number of queries for the given statistics exceeds $10^5$. It is
78: therefore meaningful to fit only those data for which the number of
79: queries exceeds this value. This simple criterion can be used to
80: estimate the critical window for the rank interval where the
81: distribution is stable and the power law can be observed.
82:
83: We found that the statistics are independent of the geographical
84: location of the cache server (observer) collecting the data, at least for
85: the analyzed data sets.
86:
87: We found that the distribution is independent of the different years
88: of data collection and is therefore stable over Internet history and
89: development.
90:
91: Nevertheless, we found that the Zipf-like law approximation is
92: suitable only in the middle region of several orders of rank
93: magnitude. We propose a modification of the Zipf-like law with two
94: additional parameters and explain its possible meaning. We found that
95: if we fit the equation of the modified law to the data, the website
96: popularity distribution becomes quite stable. The value of the
97: exponent $\alpha$ is $1.02\pm0.05$ for all datasets studied in this
98: paper. We thus may suggest that website popularity follows the Zipf law.
99:
100: We verified that the same modification also works perfectly for the
101: web-document ranked distribution.
102:
103: The paper is organized as follows. In section~\ref{nature}, we present
104: a brief history of the power laws observed in nature and society. We
105: describe the data collection and processing in section~\ref{datasets}.
106: We discuss the results in section~\ref{discussion} and present our
107: conclusions in section~\ref{conclusions}.
108:
109: \section{Power laws in nature and society}
110: \label{nature}
111:
112: More than 100 years ago, Pareto~\cite{Pareto} observed that the income
113: distribution $f$ in all countries can be described by the relation
114:
115: \begin{equation}
116: \label{Pareto}
117: F(f)=1-(m/f)^{\alpha},
118: \end{equation}
119:
120: \noindent where the exponent $\alpha\simeq1.5$ and $m$ is some
121: constant. About 70 years ago, George Zipf~\cite{Zipf49} discovered a
122: striking regularity in English texts: the relative occurrence
123: frequency $f$ of the $r$th most popular word is inversely
124: proportional to the rank $r$:
125:
126: \begin{equation}
127: \label{Zipf-law}
128: f_r\sim\frac1r.
129: \end{equation}
130:
131: A more general form of Zipf law~(\ref{Zipf-law}) with the exponent
132: $\alpha \ne 1$ is often encountered in the literature and is known as
133: a {\em Zipf-like law}:
134:
135: \begin{equation}
136: \label{Zipf-like}
137: f_r\sim\frac{1}{r^\alpha}.
138: \end{equation}
139:
140: A Zipf-like law has been found in many areas of human activity and in
141: nature. Among examples are the distribution of words in random
142: texts~\cite{Li92}, of nucleotide ``words'' in
143: DNA~\cite{Mantegna,Martindale}, of bit sequences in UNIX executable
144: files~\cite{Mantegna}, of book popularities in
145: libraries~\cite{Zipf49,Mandelbrot}, of countries' areas and population
146: sizes~\cite{Zipf49,Zanette97,Marsili98}, of scientific publication
147: citation indices~\cite{Redner98}, of forest-fire areas~\cite{Malamud}.
148: Many other examples can be found in recent
149: reviews~\cite{Newman2004,Mitzenmacher2004}.
150:
151: Meanwhile, there are many discussions whether a lognormal or power law is
152: a better fit for some empirical distributions, for example, income
153: distribution, population fluctuations, file size distribution, and some
154: others (for a short review, see~\cite{Mitzenmacher2004}). In many cases
155: a lognormal distribution looks like a power law distribution for a several
156: orders of magnitude~\cite{Mitzenmacher2004,Laherrere1998}. We leave this
157: question open and analyse our data using a Zipf-like law.
158:
159: \begin{table}
160: \caption{Characteristics of Published Web Datasets}
161: \begin{tabular}{lcccll}\\ \hline
162: Dataset & Date & \# of & \# of &$\alpha$&Ref.\\
163: & (Period) & requests & pages & & \\
164: \hline
165: DEC & 1994 & $\sim$ 100k & & 1 & ~\cite{Glassman94}\\
166: BU & Jan95(42d) & 575775 & 54438 & 0.99 &~\cite{Cunha95}\\
167: BU & 1998 & 66988 & 41049 & 0.65 &~\cite{Barford99}\\
168: DEC & Jul96(6d)& 3543968 & 1354996 & 0.77 & ~\cite{Jin2000}\\
169: NLANR.RTP & Jun99(13d)& 9113027 & 3249549 & 0.71 & ~\cite{Jin2000}\\
170: NLANR.SD & Jun99(13d)& 9082461 & 3549609 & 0.72 & ~\cite{Jin2000}\\
171: NLANR.UC & Jun99(13d)& 8983585 & 2459366 & 0.66 & ~\cite{Jin2000}\\
172: USASK & Oct98(82d) & 20754720 & 5527667 & 0.76 & ~\cite{Mahanti99}\\
173: CANARIE & Dec98(26d) & 35129680 & 1423081 & 0.63 & ~\cite{Mahanti99}\\
174: NLANR.UC & Dec98(31d) & 20018680 & 7681214 & 0.65 & ~\cite{Mahanti99}\\
175: USASK & Feb99(45d) & 21070330 & 5510561 & 0.84 & ~\cite{Mahanti00}\\
176: CANARIE & Feb99(45d) & 7310038 & 4571539 & 0.77 & ~\cite{Mahanti00}\\
177: NLANR.UC & Feb99(30d) & 24560611 & 8482661 & 0.74 & ~\cite{Mahanti00}\\
178: NLANR.LJ & 1998 & $\sim$ 500k & & 0.64 & ~\cite{Roadknight99}\\
179: UPisa & 1998 & $\sim$ 500k & & 0.91 & ~\cite{Roadknight99}\\
180: FUNET & 1998 & $\sim$ 500k & & 0.70 & ~\cite{Roadknight99}\\
181: SPAIN & 1998 & $\sim$ 500k & & 0.72 & ~\cite{Roadknight99}\\
182: RMPLC & 1998 & $\sim$ 500k & & 0.86 & ~\cite{Roadknight99}\\
183: BU-CS & Oct95(14d) & 80518 & 4471 & 0.85 & ~\cite{Almeida96}\\
184: Hitachi & 1997(16d) & 2000000 & & 0.75 & ~\cite{Nishikawa98}\\
185: DEC & Aug96(7d) & 3543968 & & 0.77 & ~\cite{Breslau99}\\
186: UCB & Nov96(18d) & 1907762 & & 0.78 &~\cite{Breslau99}\\
187: UPisa & (3m) & 2833624 & & 0.83 &~\cite{Breslau99}\\
188: Questnet & Jan98(7d) & 2885285 & & 0.69 &~\cite{Breslau99}\\
189: NLANR & Dec97(1d) & 1766409 & & 0.73 &~\cite{Breslau99}\\
190: FUNET & Jun98(10d) & 4815551 & & 0.64 &~\cite{Breslau99}\\
191: HGMP & Jan98(7m) & $\sim$ 750k & & 0.60& ~\cite{Breslau99}\\
192: WebTV & Sep00(16d) & 347460865& 32541361 & 1.03 & ~\cite{Kelly02}\\
193: \hline
194: \label{Datasets}
195: \end{tabular}
196: \end{table}
197:
198: It is widely assumed that web document popularity follows a Zipf-like law.
199: We summarized all published results in
200: Table~\ref{Datasets} with the dataset name, the date and period of log
201: files in days (d) or months (m), the number of requests, the number of
202: unique web pages requested, and the reported value of the exponent
203: $\alpha$.\footnote{Some papers do not provide all the information
204: (e.g., the number of unique pages) for the datasets studied.} It can
205: be seen that exponent values vary from $0.60$ to $1.03$.\footnote{Here
206: we consider document popularity observed at the client (BU dataset) or
207: proxy side only. Values of the exponent $\alpha$ observed at the
208: web-server side vary from $0.67$ to $1.82$~\cite{website}.} A
209: question arises. {\em Why is the variation of the exponent so large?}
210: Probably, the sample size is important, and the Zipf-like law only
211: fits two decades of ranks well at best. It is quite inapplicable in
212: the ``tails'' and in small ranks, and the results are sensitive to the
213: choice of the rank window for fitting the data.
214:
215: We know only two papers where the website popularity issue was
216: addressed. In paper~\cite{Aida98}, the authors claim that the destination
217: address of web requests can be characterized by two types of Zipf laws. In
218: paper~\cite{Breslau99}, the authors presented results for three sets of
219: user request traces (shown in~\cite{Breslau99} in Fig.~5, which
220: is similar to our Figs.~\ref{Fig1} and \ref{Fig2}). In
221: particular, the UCB-trace in their Fig.~5 looks similar to the set
222: 2001-09-03 shown in our Fig.~\ref{Fig2}, and it is rather impossible to
223: extract any value of the exponent $\alpha$ using the fit to Zipf-like
224: law~(\ref{Zipf-like}). To our knowledge, the authors did not publish
225: the announced preprint with the values of exponent $\alpha$.
226:
227: \section{Datasets and methods}
228: \label{datasets}
229:
230: \begin{table}
231: \centering
232: \caption{Characteristics of Analyzed Web Datasets in Russia}
233: \begin{tabular}{llcccc}\\
234: \hline
235: Dataset & Proxy & Starting & Period & \# of & \# of \\
236: & & date & & requests & websites \\
237: \hline
238: \em{1996} & CHG & Sep 1996 & 74d & 155743 & 4360 \\
239: \em{1997} & CHG & Jan 1997 & 1y & 2642722 & 44881 \\
240: \em{2000} & CHG & Sep 2000 & 3m & 27130648 & 146693 \\
241: \em{2001} & CHG & Feb 2001 & 8m & 64577294 & 269868 \\
242: \em{ikia-2001} & IKIA & Jul 2001 & 4m & 29296632 & 177497 \\
243: \em{ikia-2002} & IKIA & May 2002 & 1m & 2067205 & 53747 \\
244: \em{wc-2001} & FREEnet & Jan 2001 & 4.5m & 16989853 & 152760 \\
245: \em{wc-2002} & FREEnet & Feb 2002 & 5m & 26576501 & 239891 \\
246: \em{yar-2002} & Yars & Apr 2002 & 1m & 9639987 & 86611 \\
247: \em{ras-2002} & RASnet & Feb 2002 & 5m & 9240289 & 227686 \\
248: \hline
249: \em{2001-09} & CHG & Sep 2001 & 1m & 7333162 & 68671 \\
250: \em{2001-09-1w} & CHG & Sep 2001 & 1w & 1382537 & 24103 \\
251: \em{2001-09-03} & CHG & Sep 2001 & 1d & 273361 & 7854 \\
252: \hline
253: \label{table-sets}
254: \end{tabular}
255: \end{table}
256:
257: We start our analysis with the data collected on several proxies (cache
258: servers) located in different Russian academic networks and in the next
259: section will compare the results with the analysis of data collected in
260: the fall of 2004 on American IRCache servers. Collections of data from
261: Russian servers are presented in Table~\ref{table-sets} with the dataset
262: name, proxy server location, starting date of log files, period of log
263: file in days (d), weeks (w), months (m), or years (y), number of requests,
264: and number of unique websites requested. The following abbreviations are
265: used for proxies: {\em CHG} for the proxy located in the Chernogolovka
266: network (AS9113), Chernogolovka, Moscow region, Russia; {\em IKIA} for
267: the proxy in Space Research Institute RAS (AS3218), Moscow, Russia; {\em
268: FREEnet} for the proxy in FREEnet (AS2895), Moscow, Russia; {\em RASnet}
269: for the proxy located in RASnet (AS3058), Moscow, Russia; and {\em Yars}
270: for the proxy located in Yaroslavl State University (AS8325), Yaroslavl,
271: Russia. Proxy-servers {\em CHG} and {\em Yars} are typical regional cache
272: servers serving requests from local users. Other servers located in Moscow
273: are a central part of the Russian web-caching hierarchy~\cite{sakr98} and
274: serve requests from local users as well as from other (e.g., regional)
275: cache servers.
276:
277: \begin{figure}
278: \centering
279: %\psfig{file=scheme.eps,width=\columnwidth}
280: \psfig{file=scheme.eps,width=60mm}
281: \caption{Sketch of the data collection}
282: \label{scheme}
283: \end{figure}
284:
285: \begin{figure}
286: \centering
287: \psfig{file=caches.eps,width=90mm}
288: \caption{Hierarchy of cache servers network.}
289: \label{caches}
290: \end{figure}
291:
292: All proxy-servers run Squid caching software. Figure~\ref{scheme}
293: sketches the process of data collection: user queries go to the cache
294: server, which processes user queries to the web servers and keeps
295: traces of user requests as records in log files. We therefore call the
296: cache servers ``observers'' to stress a possible importance of their
297: displacement in the Internet. Cache servers in Russian academic networks
298: are organized in hierarchy sketched in Figure~\ref{caches}. User queries goes
299: through the local proxy servers to regional cache servers, which may
300: redistribute them to the servers on national research and educational
301: networks, which may send queries to the neighboring caches or directly to
302: the destination. Also some queries may be sent to IRCache servers.
303: We must note that the cache server network is a
304: logical one, programmable, and does not reflect Internet connectivity but
305: is rather some subgraph of the Internet.
306:
307: We must note here that information in the datasets is private and is
308: subject to a privacy policy agreement. We therefore use all datasets
309: {\em available} to us.
310:
311: Each record contains information on the requested document (URL). A
312: typical URL looks like {\sf
313: protocol://web.site.name[:port]/path/to/document}. We treat a
314: substring between the `//' and `/' characters (omitting the `:port'
315: field if present\footnote{As a rule, requests with the `:port' field
316: are about 2\% of all requests, probably because some Russian websites
317: often use the port value for switching between various Cyrillic
318: encodings.}) as the website name. Only successful GET requests with
319: code 200 are included in our analysis.
320:
321: We counted the number of requests for each website in the log for each
322: dataset. Those numbers divided by the total number of requests in the
323: dataset give us the {\em normalized rank distribution of websites by
324: popularity $f_r$}.
325:
326: Fitting equations and parameter estimation was done by the nonlinear
327: least square method with Levenberg-Marquardt minimization.
328:
329: \section{Discussion}
330: \label{discussion}
331:
332: Normalized rank distributions (the fraction of requests to a given
333: website as a function of the corresponding rank) are presented on a
334: log-log scale in Figures~\ref{Fig1}, \ref{Fig2}, \ref{Fig3}.
335: Figure~\ref{Fig1} shows results for four datasets with the names {\em 1996}
336: (squares), {\em 1997} (circles), {\em 2000} (up triangles), and {\em 2001} (down
337: triangles) as defined in Table~\ref{table-sets}. All of them were
338: collected by the same proxy site {\em CHG}. Consulting
339: Table~\ref{table-sets}, we can conclude from Figure~\ref{Fig1} that
340: the rank distribution for all four datasets coincides well in the
341: ``middle'' straight-line part of about two decades and that the larger
342: the sample size, the larger this middle region is. We can therefore
343: conclude that the rank distribution does not change qualitatively in
344: five years and that the rank distribution comes closer and closer to
345: the ideal Zipf law.
346:
347: Our goal in Figure~\ref{Fig2} is to demonstrate how a rank
348: distribution depends on the period of observation. For that reason, we
349: plot four distributions obtained from the datasets {\em 2001-09-03}
350: (squares), {\em 2001-09-1w} (circles), {\em 2001-09} (up triangles), and
351: {\em 2001} (down triangles). Clearly, distribution does not vary in time but
352: becomes more ``flat'' in the middle part with the longer period
353: (larger sample size).
354:
355: Finally, Figure~\ref{Fig3} demonstrates that rank distributions with
356: nearly equivalent sample sizes are independent of the displacement of
357: the observer (i.e., cache server) in the Internet geography (at least,
358: for the Russian academic networks). We plot seven datasets, {\em 2001}
359: (squares), {\em ikia-2001} (circles), {\em wc-2001} (up-triangles),
360: {\em ikia-2002} (down-triangles), {\em ras-2002} (diamonds),
361: {\em wc-2002} (left-triangles), and {\em yar-2002} (right-triangles).
362: Figure~\ref{Fig3} is quite convincing that
363: the rank distribution of websites is independent of the displacement
364: of the web cache in the hierarchy.
365:
366: Totally, it can be seen that rank distributions corresponding to
367: different data\-sets coincide well for the middle values of ranks.
368: Therefore, the fraction of user requests coming to ``mainstream''
369: websites (which are often encountered in logs but are still less
370: popular than top sites) is stable and does not vary with time
371: (Figure~\ref{Fig1}), with dataset size (Figure~\ref{Fig2}), or with
372: proxy location (Figure~\ref{Fig3}).
373:
374: One more common feature of all graphs is the divergence of the rank
375: distributions in the ``tails'', the rightmost parts of the graph. Rank
376: distribution turns down strongly in tails, where the websites were
377: requested less than about $100$ times.
378:
379: There is an interesting peculiarity seen in Figure~\ref{Fig1}: the
380: fraction of requests coming to the most popular sites decreases with
381: time. For example, the frequency of occurrences of the most popular
382: website in 1996 was about an order of magnitude higher than in 2001.
383: Because the most frequent requests come to different kinds of banners,
384: counters, search engines, etc., Figure~\ref{Fig1} demonstrates that
385: their relative popularity diminishes with time. One possible reason is
386: the appearance of many different sites with similar contents (as well
387: as mirror sites) or functions (e.g., banner networks or search
388: engines), which leads to equilibrating user interest to different hot
389: sites. Another reason is improvement of web-client software. The
390: internal cache of the web browser can contain more web documents;
391: requests to the most popular documents are then processed using the
392: internal cache. This phenomena is known as the ``trickle-down'' effect
393: observed by Doyle et al.~\cite{Doyle02}, which is discussed below.
394:
395: Figure~\ref{Fig2} demonstrates that the top sites have a stable
396: fraction of requests during a given year.
397:
398: Figures 2 and 3 show that Zipf-like law~(\ref{Zipf-like}) (which must
399: be represented as a straight line) is a very coarse approximation of
400: the actual distribution. The main deviations from the
401: law~(\ref{Zipf-like}) are in the region of the most popular (top 50)
402: sites and in the tail of the distribution.
403:
404: Fitting the data to Zipf-like law, expression~(\ref{Zipf-like}), and its
405: modifications, expressions (\ref{Zipf-Mandelbrot}) and (\ref{BestFit}), is
406: a tricky problem both because of the influence of the rare statistics of
407: the large ranks and because of the high fluctuations of the leading ranks.
408: Which method is best is not yet understood~\cite{Crovella99}. We use a
409: least-square fit to estimate the parameters and calculate the accuracy of
410: the estimated values by the standard approach and give it in the
411: parentheses as a correction to the last digit.
412:
413: We can choose a region of ranks of two orders of magnitude where the
414: rank distribution looks like a straight line. But varying the interval
415: boundaries of the rank window strongly affects the fitting parameters
416: (e.g., the exponent $\alpha$). We obtained $\alpha$ in the range from
417: $0.7$ to $1.4$ depending on the rank window. For example, fitting
418: dataset {\em2001-09} with Zipf-like law~(\ref{Zipf-like}) in the
419: window $10\le r\le 1000$ gives $\alpha=0.78$ and in window $10^3\le
420: r\le 10^5$ gives $\alpha=1.13$. Other fitting windows give other
421: values in the range from $0.7$ to $1.4$. We can therefore conclude
422: that the Zipf-like law cannot give us quantitative characteristics of
423: rank distributions of websites in the whole interval of ranks.
424:
425: Slightly better results can be derived using a modified Zipf-like law,
426: known as the {\em Zipf--Mandelbrot} law~\cite{Mandelbrot},
427:
428: \begin{equation}
429: \label{Zipf-Mandelbrot}
430: f_r=\frac{b}{(c+r)^{\alpha}},
431: \end{equation}
432:
433: \noindent which gives a better approximation in the range of small
434: ranks but is still inapplicable in the ``tails''. The fit can be
435: appreciably enhanced by introducing one more parameter
436: in~(\ref{Zipf-Mandelbrot}):
437:
438: \begin{equation}
439: f_r=a+\frac{b}{(c+r)^{\alpha}}.
440: \label{BestFit}
441: \end{equation}
442:
443: Figure~\ref{Fig5} shows the rank distribution of websites in the
444: coordinates $\log(f_r-a)$, $\log(c+r)$ for the particular dataset
445: {\em 2001-09}. The fraction of requests (the vertical axis) is shifted by
446: the value $a=-1.44\cdot 10^{-6}$ and the rank by $c=15.16$. This
447: figure clearly demonstrates that function~(\ref{BestFit}) approximates
448: the data distribution well in almost the entire range of
449: ranks.\footnote{We note that this method for data ``straightening'' is
450: often applied in statistical physics~\cite{Efros,Lev2}. A similar equation
451: was also proposed in a recent work on rank distribution of
452: publication popularity~\cite{Han2004}.} We have
453: fitted expression~(\ref{BestFit}) to all our data and found that the
454: value of $\alpha$ is quite stable; the results are presented in
455: Table~\ref{Alpha} for the datasets discussed. The columns in
456: Table~\ref{Alpha} are the dataset name as defined in
457: Table~\ref{table-sets} and resulting values of $a$, $c$, and $\alpha$
458: as defined in expression~(\ref{BestFit}). The mean of the exponent
459: $\alpha$ is $1.02 \pm 0.05$, which may be considered $1.0$. The
460: statistical error is calculated as the variation of $\alpha$ from the
461: data in Table~\ref{Alpha}.
462:
463: \begin{table}
464: \centering
465: \caption{Fitting Results for Russian Servers}
466: \begin{tabular}{llrc} \\
467: \hline
468: Dataset & $a$ & $c$ & $\alpha$\\
469: \hline
470: \em{1996} & $-3.0(1)\cdot10^{-5}$ & $0.45(4)$ & $0.95(5)$\\
471: \em{1997} & $-5.77(2)\cdot10^{-6}$ & $2.96(5)$ & $0.92(3)$\\
472: \em{2000} & $-1.01(11)\cdot10^{-6}$ & $7.33(7)$ & $1.04(3)$\\
473: \em{2001} & $-2.48(3)\cdot10^{-7}$ & $9.10(5)$ & $1.06(2)$\\
474: \em{2001-09} & $-1.44(27)\cdot10^{-6}$ & $15.16(11)$ & $1.08(7)$\\
475: \em{2001-09-1w} & $-7.25(6)\cdot10^{-6}$ & $14.82(20)$ & $1.03(2)$\\
476: \em{2001-09-03} & $-2.01(7)\cdot10^{-5}$ & $17.82(72)$ & $0.99(6)$\\
477: \em{ikia-2001} & $-5.10(7)\cdot10^{-7}$ & $13.35(7)$ & $1.07(3)$\\
478: \em{ikia-2002} & $-1.58(9)\cdot10^{-6}$ & $4.53(16)$ & $1.01(1)$\\
479: \em{wc-2001} & $-5.56(9)\cdot10^{-7}$ & $14.54(9)$ & $1.09(4)$\\
480: \em{wc-2002} & $-4.43(7)\cdot10^{-7}$ & $14.02(5)$ & $1.06(3)$\\
481: \em{ras-2002} & $-9.45(2)\cdot10^{-7}$ & $9.17(10)$ & $0.95(5)$\\
482: \em{yar-2002} & $-1.30(3)\cdot10^{-6}$ & $4.64(4)$ & $0.99(5)$\\
483: \hline
484: \end{tabular}
485: \label{Alpha}
486: \end{table}
487:
488: The parameter $a$ can be considered a correction for the finite sample
489: size. The larger the sample size, the less $a$ is.
490:
491: The parameter $c$ in expression~(\ref{BestFit}) has a very clear
492: physical meaning. It is closely connected with the {\em trickle-down
493: effect} observed by Doyle~\cite{Doyle02}. Doyle found that proxies
494: disproportionally absorb requests on different levels of the
495: hierarchy. Rank distributions obtained from data collected on proxies
496: at different hierarchical levels differ in the region of small ranks.
497: This effect has a clear explanation in terms of rank distributions.
498:
499: As a clarifying example, we consider a two-layer hierarchy of proxies.
500: A first-level proxy receives requests from users. If the requested
501: document is found in its cache, then that document is returned to the
502: client; otherwise, the request is submitted to an upper-level proxy.
503: If we assume that a first-level proxy can hold $N$ documents in its
504: cache, then it accordingly filters the $N$ most popular documents from
505: the request stream, i.e., it ``cuts'' the leftmost $N$ points from the
506: rank distribution. This is equivalent to the change of variables
507: $r\rightarrow r+N$. Therefore, we presume that the parameter $c$ in
508: equation~(\ref{BestFit}) characterizes cache sizes of low-level
509: proxies (which can also be the user's browser cache).
510:
511: It can be seen that for all datasets, $\alpha$ is close to unity with
512: an accuracy of a few percent. We therefore suppose that the exponent
513: $\alpha$ in equation~(\ref{BestFit}) is a universal characteristic of
514: web traffic, which is independent of time (for time-scales comparable
515: with the Internet lifetime), is independent of data collection
516: duration (when the sample size is sufficiently large and contains more
517: than $2{\times}10^5$ requests), and is independent of the displacement
518: of the proxy server in the Internet hierarchy.
519:
520: We found a possibility to check our findings using available
521: statistics. We chose BU web-client traces available from {\sf
522: ita.ee.lbl.gov} (the full dataset from Nov 94 to May 95 contains
523: 1143842 requests, 104532 unique URLs, and 4970 unique sites). This
524: dataset was used in early work and gives one of the best examples of
525: the Zipf law for web-page popularity ($\alpha=0.986$)~\cite{Cunha95}.
526: Fitting equation (\ref{BestFit}) to the rank distribution of website
527: popularity gives $\alpha=1.025$, $a=-3.3\cdot 10^{-5}$, and $c=1.97$,
528: which coincide well with the values obtained for Russian academic
529: networks. This is an additional argument that website popularity
530: distribution is universal (in other words, is independent of both the
531: observation point in the Internet and Internet history) and
532: follows the Zipf law with an exponent $\alpha$ close to unity.
533:
534: \begin{table}
535: \centering
536: \caption{Characteristics of Analyzed Web Datasets in USA and Fitting
537: Results}
538: \begin{tabular}{lrrrrr}\\
539: \hline
540: cache & \# of & $N=$\# of & $aN$ & $c$ & $\alpha$ \\
541: &requests &websites & & & \\ \hline
542: {\em bo}&23935604&592679&-2.89(1)&8.54(4)&1.05(2) \\
543: {\em ny}&12789266&407952&-3.89(1)&-0.12(1)&0.94(3) \\
544: {\em pa}&3374392&229633&-1.57(1)&7.17(12)&0.96(8) \\
545: {\em pb}&10018478&304049&-4.47(1)&18.96(13)&0.98(4) \\
546: {\em rtp}&13221655&339918&-4.35(1)&23.52(13)&1.01(4) \\
547: {\em sd}&13840665&285356&-3.22(1)&0.166(7)&1.04(3) \\
548: {\em sj}&26130582&264396&-6.00(1)&1.935(13)&1.09(2) \\
549: {\em sv}&11119941&530731&-3.20(1)&16.34(13)&0.93(4) \\
550: {\em uc}&13294408&313178&-5.17(1)&15.14(9)&1.01(4) \\ \hline
551: {\em uc-12d}&3236853&84360&-4.37(2)&7.79(12)&0.95(8) \\
552: {\em uc-1d}&463899&13752&-1.77(4)&4.99(24)&0.96(3) \\ \hline
553: {\em all}&127724991&1176623&-8.96(1)&5.05(1)&1.03(2) \\ \hline
554: \label{table-setUS}
555: \end{tabular}
556: \end{table}
557:
558:
559: To check this statement deeper, we also analyze recently available
560: data\footnote{Thanks to D. Wessels, who kindly gave us access to the
561: data sets collected at the US IRCache servers.} collected during the period
562: from 11/03/2004 to 12/29/2004 at nine cache-servers of the US national
563: cache-mesh system for science and education built-up within the IRCache
564: project~\cite{ircache}. Table~\ref{table-setUS} presents data from the
565: following locations:
566:
567: \begin{itemize}
568: \item {\em bo} -- NCAR at Boulder, Colorado
569: \item {\em ny} -- New York, New York
570: \item {\em pa} -- Digital Internet Exchange in Palo Alto, California
571: \item {\em pb} -- PSC at Pittsburgh, Pennsylvania
572: \item {\em rtp} -- Research Triangle Park, North Carolina
573: \item {\em sd} -- SDSC at San Diego, California
574: \item {\em sj} -- MAE West Exchange Point in San Jose, California
575: \item {\em sv} -- NASA-Ames/FIX-West in Silicon Valley, California
576: \item {\em uc} -- NCSA at Urbana-Champaign, Illinois.
577: \end{itemize}
578:
579: \noindent The second and third entries from the bottom demonstrate the
580: stability of the fit for two subsets of the data collected at {\em
581: uc}-location, for 12 days (set name {\em us-12d}) and for 1 day (set {\em
582: us-1d}). The last entry represents the fit to the sum of the preceding
583: data sets. Results of the fit by expression~(\ref{BestFit}) are close to
584: unity and quite similar to those for Russian servers presented in
585: Table~\ref{Alpha}.
586:
587:
588: \section{Conclusions}
589: \label{conclusions}
590:
591: We have presented modified Zipf law~(\ref{BestFit}), which fits the rank
592: distribution of web sites in the full range of ranks rather well. We found
593: that the value of the exponent $\alpha$ in expression~(\ref{BestFit}) is
594: stable for the analyzed datasets. It does not vary with (1) the year of data
595: collection, (2) the period of data collection, or (3) the geographical
596: location of the cache server where we collected data. We found that
597: $\alpha$ is very close to $1$. We have reasons to suppose this value of
598: $\alpha$ is a universal property of web-traffic for the website rank. We
599: have also presented a clear explanation of the ``trickle-down effect''
600: based on the properties of our modified Zipf law. We suggest that website
601: popularity is universal property of Internet and follows the Zipf law.
602:
603: In a similar experiment, fluctuations of the exponent value were
604: checked~\cite{KS-tri} as a function of the volume of statistics, where
605: cache traces of user requests to different Internet domains were analyzed.
606: User requests were sent to Internet through the cache triangle, namely,
607: they went to the Master Server, which sent each odd request to the left
608: cache and each even request to the right cache. Clearly, the traces should
609: be nearly equal in the limit of a large number of requests. Indeed, it was
610: estimated that exponents extracted separately from the ``left'' traces and
611: ``right'' traces were within five per cent for a set volume larger than
612: ten thousand requests, and that those for set volume less than a few
613: hundred fluctuated strongly. Thus, rare statistics may significantly
614: affect the results.
615:
616: The results in this paper may be useful for building mirror sites and
617: CDNs as well as for improving software for DNS request caching. We
618: also conjecture that fitting with the modified Zipf law is suitable
619: for describing the rank distribution of web-document popularity.
620:
621: \begin{figure}
622: \centering
623: \psfig{file=fig1.eps,width=\columnwidth}
624: \caption{Website distribution for different years}
625: \label{Fig1}
626: \end{figure}
627:
628: \begin{figure}
629: \centering
630: \psfig{file=fig2.eps,width=\columnwidth}
631: \caption{Website distribution for different periods}
632: \label{Fig2}
633: \end{figure}
634:
635: \begin{figure}
636: \centering
637: \psfig{file=fig3.eps,width=\columnwidth}
638: \caption{Website distribution for different servers}
639: \label{Fig3}
640: \end{figure}
641:
642: \begin{figure}
643: \centering
644: \psfig{file=fig5c.eps,width=\columnwidth}
645: \caption{Website distribution in modified coordinates: dependence
646: of $f_r-a$ from $r+c$ (compare to expression~(5)) in double logarithmic
647: scale.}
648: \label{Fig5}
649: \end{figure}
650:
651: \section{Acknowledgment}
652:
653: The authors thank the anonymous referees for the valuable remarks and
654: comments that allowed us to improve this paper.
655: Special thanks to Duane Wessels for access to logs from IRCache web-cache servers.
656:
657: This work was supported by the Russian Foundation for Basic Research.
658:
659:
660: \begin{thebibliography}{99}
661:
662: \bibitem{Glassman94} Steven Glassman, {\it A caching Relay for the World Wide
663: Web.} Proc. 1st Int. Conference on the World-Wide Web, CERN,
664: Geneva (Switzerland), May 1994. Computer Networks and ISDN Systems, 27(2), 165-173 (1994).
665:
666: \bibitem{Breslau99} Lee Breslau, Pei Cao, Li Fan, G. Phillips, S.
667: Shenker, {\it Web Caching and Zipf-like Distributions: Evidence and
668: Possible Implications}, Proc. IEEE INFOCOM '99: 18th Annual Joint
669: Conference of the IEEE Computer and Communications Societies, Volume: 1,
670: p.~126-134, 1999.
671:
672: \bibitem{Kelly02} Terence Kelly, Jeffrey Mogul. {\it Aliasing on the
673: World Wide Web: Prevalence and performance implications}. Proc. 11th Int.
674: WWW Conf., Honolulu, May 2002, ACM Press, pp.281 - 292.
675:
676: \bibitem{Doyle02} Ronald P. Doyle, Jeffrey S. Chase, Syam Gadde,
677: Amin M. Vahdat. {\it The trickle-down effect: Web caching and
678: server request distribution}. Computer Communications, {\bf 25},
679: 345-356 (2002).
680:
681: \bibitem{Pareto} V. Pareto, {\it Cours d'economie politique}, Rouge,
682: Lausanne et Paris, 1897.
683:
684: \bibitem{Zipf49} G. K. Zipf, {\it Human Behavior and the Principle of
685: Least-Effort.} Addison-Wesley, Cambridge, MA, 1949.
686:
687: \bibitem{Li92} W. Li, {\it Random texts exhibit Zipf's-law-like word
688: frequency distribution}. IEEE Trans. Inform. Theory, {\bf 38}(6),
689: 1842-1845 (1992).
690:
691: \bibitem{Mantegna} R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S.
692: Havlin, C. K. Peng, M. Simons, and H. E. Stanley. {\it Linguistic Features
693: of Noncoding DNA Sequences}, Phys. Rev. Lett. {\bf 73}, 3169-3172 (1994).
694:
695: \bibitem{Martindale} C. Martindale, A. K. Konopka, {\it Oligonucleotide
696: frequencies in DNA follow a Yule distribution}, Computers \& Chemistry,
697: {\bf 20}, 35-38 (1996).
698:
699: \bibitem{Mandelbrot} B. B. Mandelbrot, {\it The Fractal Geometry of
700: Nature}. Freeman, New York, 1977.
701:
702: \bibitem{Efros} B.I. Shklovskii, A.L. Efros, {\it Electronic Properties of
703: Doped Semiconductors}, Percolation Theory (Springer-Verlag, Berlin,
704: 1984)
705: pp. 95-136
706:
707: \bibitem{Lev2} L.N. Shchur, {\it Incipient Spanning Clusters in Square and
708: Cubic Percolation}, in Springer Proceedings in Physics, ``Computer
709: Simulation Studies in Condensed Matter Physics XII'', Eds. D.P. Landau,
710: S.P. Lewis, and H.B. Sch\"uttler, (Springer-Verlag, Berlin, 2000)
711:
712: \bibitem{Han2004} Ding-Ding Han, Jin-Gao Liu, Yu-Gang Ma, Xiang-Zhou Cai,
713: Wen-Qing Shen. {\it Scale-free download network for publications}.
714: Chin. Phys. Lett., {\bf 21}(9), 1855-1857 (2004); arXiv:cond-mat/0405428
715: (2004).
716:
717: \bibitem{Zanette97} Damai\'an H. Zanette, Susanna C. Manrubia, {\it Role
718: of intermittence in urban development: A model of large-scale city
719: formation}. Phys. Rev. Lett., {\bf 79}(3), 523-526 (1997).
720:
721: \bibitem{Marsili98} Matteo Marsili, Yi-Cheng Zhang, {\it Interacting
722: individuals leading to Zipf's law}. Phys. Rev. Lett., {\bf 80}(12),
723: 2741-2744 (1998).
724:
725: \bibitem{Redner98} S. Redner. {\it How popular is your paper? An empirical
726: study of the citation distribution}. Eur. Phys. J. B {\bf 4}, 131-134
727: (1998).
728:
729: \bibitem{Malamud} B. D. Malamud, G. Morein, D. L. Turcotte. {\it Forest
730: fires: an example of self-organized critical behavior}. Science, {\bf
731: 281}, 1840-1842 (1998).
732:
733: \bibitem{Newman2004} M.E.J. Newman. {\it Power laws, Pareto distributions
734: and Zipf's law}. arXiv:cond-mat/0412004 (2004)
735:
736: \bibitem{Mitzenmacher2004} Michael Mitzenmacher. {\it A brief history of
737: generative models for power law and lognormal distributions}.
738: Internet Mathematics, {\bf 1}(2), 226-251 (2004).
739:
740: \bibitem{Laherrere1998} J. Laherr\`ere, D. Sornette. {\it Stretched
741: exponential distributions in nature and economy: ``fat tails'' with
742: characteristic scales}. Eur. Phys. J., {\bf B 2}, 525-539 (1998).
743:
744: \bibitem{website} Azer Bestavros. {\it WWW traffic reduction and load
745: balancing through server-based caching}. IEEE Concurrency, {\bf 5}(1),
746: 56-67 (1997);
747: Takashi Hatashima, Toshihiro Motoda, Shuichro Yamamoto. {\it An
748: ``interest'' index for WWW servers and CyberRanking}. IEICE Trans. Inf.
749: \& Syst., {\bf E83-D}, 729-734 (2000);
750: Venkata N. Padmanabhan, Lili Qiu. {\it The content and access dynamics
751: of a busy web site: Findings and implications}. Proc. ACM SIGCOMM'00,
752: Stockholm, Sweden, 2000, pp. 111-123;
753: Adeniyi Oke, Rick Bunt. {\it Hierarchical workload characterization for
754: a busy web server}. In: Computer Performance Evaluation (Ed. T. Field,
755: P.G. Harrison, J. Bradley, U. Harder). Springer-Verlag: Berlin ea, 2002,
756: pp.309-328. Proc. TOOLS'2002: 12th Int. Conf. on Modeling Techniques and
757: Tools, London, UK, April 14-17 2002. [Lecture Notes in Computer Science,
758: Vol. 2324].
759:
760: \bibitem{Aida98} Masaki Aida, Noriyuki Takahashi, Tetsua Abe.
761: {\it A proposal of dual Zipfian model for describing HTTP access trends
762: and its application to address cache design}. IEICE Trans. Commun.,
763: {\bf E81-B} (7), 1475-1485 (1998).
764:
765: \bibitem{Cunha95} C. R. Cunha, A. Bestavros, M. E. Crovella, {\it
766: Characteristics of WWW Client-based Traces}, Technical report
767: BU-CS-95-010, Boston University, July, 1995.
768:
769: \bibitem{Barford99} P. Barford, A. Bestavros, A. Bradley, M. Crovella, {\it
770: Changes in Web client access patterns: Characteristics and caching implications}.
771: World Wide Web J., Spec. Issue on Characterization and
772: Performance Evaluation, {\bf 2}, 15-28 (1999).
773:
774: \bibitem{Jin2000} Shudong Jin and Azer Bestavros, {\it Sources and
775: Characteristics of Web Temporal Locality}. Proc. MASCOTS'2000: The 8th
776: IEEE/ACM International Symposium on Modeling, Analysis and Simulation of
777: Computer and Telecommunication Systems, San Francisco, CA, 29 Aug - 1 Sept
778: 2000. p.28-35.
779:
780: \bibitem{Mahanti99} A. Mahanti, C. Williamson. {\it Web proxy workload
781: characterization}. Tech. Report, Department of Computer Science,
782: University of Saskatchewan, February 1999.
783:
784: \bibitem{Crovella99} M.E. Crovella and M.S. Taqqu, {\it Estimating the
785: Heavy Tail Index from Scaling Properties}, In: Methodology and Computing in
786: Applied Probability, {\bf 1}, 55-79 (1999).
787:
788: \bibitem{Mahanti00} A. Mahanti, C. Williamson, D. Eager, {\it Traffic
789: analysis of a Web proxy caching hierarchy}. IEEE Network Magazine, {\bf
790: 14}(3), 16-23 (May/Jun 2000).
791:
792: \bibitem{Roadknight99} Chris Roadknight, Ian Marshall, and Deborah Vearer.
793: {\it File Popularity Characterization}. Proc. WISP'99: 2nd Workshop on
794: Internet Server Performance, Atlanta, Georgia, May 1999.
795:
796: \bibitem{Almeida96} V. Almeida, A. Bestavros, M. E. Crovella,
797: A. de Oliveira, {\it Characterizing Reference Locality in the
798: WWW}, Proc. PDIS'96, Dec. 1996, p.~92-103.
799:
800: \bibitem{Nishikawa98} N. Nishikawa, T. Hosokawa, Y. Mori, K. Yoshida, H.
801: Tsuji. {\em Memory-based architecture for distributed WWW caching proxy}.
802: Computer Networks and ISDN Systems, {\bf 30}, 205-214 (1998).
803:
804: \bibitem{sakr98} Serge Krashakov, Lev Shchur. {\it WWW Caching in
805: Russia - Current State and Future Development}. Proc. 3d Int. Web Caching
806: Workshop, Manchester, June 15-17, 1998.
807:
808: \bibitem{ircache} The IRCache project - {\tt http://www.ircache.net}.
809:
810: \bibitem{KS-tri} Sergey A. Krashakov, Lev N. Shchur. {\em Active measurements
811: (experiments) of the Interhet traffic using cache-mesh}. Int. J. Modern
812: Physics C, {\bf 12}, 549-562 (2001).
813:
814: \end{thebibliography}
815:
816: \end{document}
817: