cs0404010/cn.tex
1: \documentclass{elsart}
2: \usepackage{epsfig}
3: \usepackage{amssymb}
4: 
5: \begin{document}
6: 
7: \begin{frontmatter}
8: 
9: \title{On the universality of rank distributions of website popularity}
10: \author{Serge A. Krashakov\corauthref{cor}},
11: \corauth[cor]{Corresponding author.}
12: \ead{sakr@itp.ac.ru}
13: \author{Anton B. Teslyuk\thanksref{now}},
14: \thanks[now]{Present adress: Institute of Information Science, 
15: RRC Kurchatov Institute, 1~Kurchatov sq., Moscow, 123182, Russia}
16: \author{Lev N. Shchur}
17: 
18: \address{Landau Institute for Theoretical Physics, Chernogolovka, 142432 Russia}
19: 
20: \begin{abstract}
21: 
22: We present an extensive analysis of long-term statistics of the queries to 
23: websites using logs collected on several web caches in Russian academic 
24: networks and on US IRCache caches.  We check the sensitivity of the statistics 
25: to several parameters: (1)~duration of data collection, (2) geographical 
26: location of the cache server collecting data, and (3) the year of data 
27: collection. We propose a two-parameter modification of the Zipf law and 
28: interpret the parameters. We find that the rank distribution of websites 
29: is stable when approximated by the modified Zipf law. We suggest 
30: that website popularity may be a universal property of Internet.
31: 
32: \end{abstract}
33: 
34: \begin{keyword} 
35: Internet, Web traffic, Rank Distribution, Zipf Law
36: \PACS 89.20.Hh World Wide Web, Internet - 89.75.Da Systems being scaling laws
37: \end{keyword}
38: \end{frontmatter}
39: 
40: \section{Introduction}
41: \label{intro}
42: 
43: It has been known for a decade that web-document popularity follows the 
44: Zipf law~\cite{Glassman94}. Nevertheless, the exponent values reported by 
45: different authors vary significantly, from 0.60 to 
46: 1.03~\cite{Glassman94,Breslau99,Kelly02,Doyle02} (see 
47: Table~\ref{Datasets}). We believe that the scattering of the reported 
48: values is due to the small sample size in some cases and to the details of 
49: the fitting procedure used to extract the exponent. 
50: 
51: In this paper, we propose that the rank distribution of the websites
52: follows the Zipf law and give arguments supporting our idea. We must
53: note that website statistics are more extensive than web-document
54: statistics, and the distribution parameters can be obtained with
55: higher accuracy.
56: 
57: We address the following questions: Is the rank distribution of
58: websites Zipf-like?  If yes, what are the conditions under which the
59: ``true'' exponent can be obtained?  Does the exponent depend on the
60: duration of the observation? Or on the geographical position of the
61: observer? And does the exponent vary with time, as the Internet
62: develops?
63: 
64: We report some answers to these questions.  We have studied website 
65: statistics, which are indeed more stable than web-document statistics. 
66: We have analyzed log files accumulated on cache servers of Russian 
67: academic networks (FREEnet, RASnet, and RSSI) for about six years. These 
68: networks differ by their connectivity topology and bandwidth, both 
69: national and international.  These cache servers have different 
70: geographical locations (Moscow, Moscow region, and Yaroslavl in Russia). 
71: In addition, we analyzed some statistics collected during seven weeks in 
72: the fall of 2004 at a number of IRCache servers in the United States (see 
73: Table~\ref{table-setUS}).
74: 
75: We found that the statistics studied become stable\footnote{The
76: accuracy of the exponent becomes a few percent, e.g., 5\%.} when the
77: number of queries for the given statistics exceeds $10^5$. It is
78: therefore meaningful to fit only those data for which the number of
79: queries exceeds this value. This simple criterion can be used to
80: estimate the critical window for the rank interval where the
81: distribution is stable and the power law can be observed.
82: 
83: We found that the statistics are independent of the geographical
84: location of the cache server (observer) collecting the data, at least for 
85: the analyzed data sets.
86: 
87: We found that the distribution is independent of the different years
88: of data collection and is therefore stable over Internet history and
89: development.
90: 
91: Nevertheless, we found that the Zipf-like law approximation is
92: suitable only in the middle region of several orders of rank
93: magnitude. We propose a modification of the Zipf-like law with two
94: additional parameters and explain its possible meaning. We found that
95: if we fit the equation of the modified law to the data, the website
96: popularity distribution becomes quite stable. The value of the
97: exponent $\alpha$ is $1.02\pm0.05$ for all datasets studied in this
98: paper. We thus may suggest that website popularity follows the Zipf law.
99: 
100: We verified that the same modification also works perfectly for the
101: web-document ranked distribution.
102: 
103: The paper is organized as follows. In section~\ref{nature}, we present
104: a brief history of the power laws observed in nature and society. We
105: describe the data collection and processing in section~\ref{datasets}.
106: We discuss the results in section~\ref{discussion} and present our
107: conclusions in section~\ref{conclusions}.
108: 
109: \section{Power laws in nature and society}
110: \label{nature}
111: 
112: More than 100 years ago, Pareto~\cite{Pareto} observed that the income
113: distribution $f$ in all countries can be described by the relation
114: 
115: \begin{equation}
116: \label{Pareto}
117: F(f)=1-(m/f)^{\alpha},
118: \end{equation}
119: 
120: \noindent where the exponent $\alpha\simeq1.5$ and $m$ is some 
121: constant.  About 70 years ago, George Zipf~\cite{Zipf49} discovered a 
122: striking regularity in English texts: the relative occurrence 
123: frequency $f$ of the $r$th most popular word is inversely 
124: proportional to the rank $r$:
125: 
126: \begin{equation}
127: \label{Zipf-law}
128: f_r\sim\frac1r.
129: \end{equation}
130: 
131: A more general form of Zipf law~(\ref{Zipf-law}) with the exponent
132: $\alpha \ne 1$ is often encountered in the literature and is known as
133: a {\em Zipf-like law}:
134: 
135: \begin{equation}
136: \label{Zipf-like}
137: f_r\sim\frac{1}{r^\alpha}.
138: \end{equation}
139: 
140: A Zipf-like law has been found in many areas of human activity and in
141: nature. Among examples are the distribution of words in random
142: texts~\cite{Li92}, of nucleotide ``words'' in 
143: DNA~\cite{Mantegna,Martindale}, of bit sequences in UNIX executable
144: files~\cite{Mantegna}, of book popularities in
145: libraries~\cite{Zipf49,Mandelbrot}, of countries' areas and population
146: sizes~\cite{Zipf49,Zanette97,Marsili98}, of scientific publication
147: citation indices~\cite{Redner98}, of forest-fire areas~\cite{Malamud}.
148: Many other examples can be found in recent
149: reviews~\cite{Newman2004,Mitzenmacher2004}.
150: 
151: Meanwhile, there are many discussions whether a lognormal or power law is 
152: a better fit for some empirical distributions, for example, income 
153: distribution, population fluctuations, file size distribution, and some 
154: others (for a short review, see~\cite{Mitzenmacher2004}). In many cases 
155: a lognormal distribution looks like a power law distribution for a several 
156: orders of magnitude~\cite{Mitzenmacher2004,Laherrere1998}. We leave this 
157: question open and analyse our data using a Zipf-like law.
158: 
159: \begin{table}
160: \caption{Characteristics of Published Web Datasets}
161: \begin{tabular}{lcccll}\\ \hline
162: Dataset & Date & \# of & \# of &$\alpha$&Ref.\\
163: 	& (Period) & requests & pages	& & \\
164: \hline
165: DEC &   1994    & $\sim$ 100k &     & 1 & ~\cite{Glassman94}\\
166: BU  &   Jan95(42d) &  575775  & 54438 & 0.99 &~\cite{Cunha95}\\
167: BU  &   1998    &   66988   & 41049   & 0.65 &~\cite{Barford99}\\
168: DEC &   Jul96(6d)&    3543968 & 1354996 & 0.77 & ~\cite{Jin2000}\\
169: NLANR.RTP & Jun99(13d)&    9113027 & 3249549 & 0.71 & ~\cite{Jin2000}\\
170: NLANR.SD &  Jun99(13d)&    9082461 & 3549609 & 0.72 & ~\cite{Jin2000}\\
171: NLANR.UC &  Jun99(13d)&    8983585 & 2459366 & 0.66 & ~\cite{Jin2000}\\
172: USASK   & Oct98(82d) &    20754720 & 5527667 & 0.76 & ~\cite{Mahanti99}\\
173: CANARIE & Dec98(26d) &   35129680  &  1423081 & 0.63  & ~\cite{Mahanti99}\\
174: NLANR.UC & Dec98(31d) &   20018680 & 7681214 & 0.65 & ~\cite{Mahanti99}\\
175: USASK   & Feb99(45d) &    21070330 & 5510561 & 0.84 & ~\cite{Mahanti00}\\
176: CANARIE & Feb99(45d) &    7310038 & 4571539 & 0.77 & ~\cite{Mahanti00}\\
177: NLANR.UC & Feb99(30d) &    24560611 & 8482661 & 0.74 & ~\cite{Mahanti00}\\
178: NLANR.LJ &  1998    & $\sim$ 500k    &   & 0.64 & ~\cite{Roadknight99}\\
179: UPisa    &  1998    & $\sim$ 500k    &   & 0.91 & ~\cite{Roadknight99}\\
180: FUNET    &  1998    & $\sim$ 500k    &   & 0.70 & ~\cite{Roadknight99}\\
181: SPAIN    &  1998    & $\sim$ 500k    &   & 0.72 & ~\cite{Roadknight99}\\
182: RMPLC    &  1998    & $\sim$ 500k    &   & 0.86 & ~\cite{Roadknight99}\\
183: BU-CS   & Oct95(14d) & 80518 & 4471  & 0.85 & ~\cite{Almeida96}\\
184: Hitachi &   1997(16d) &  2000000   &   & 0.75  & ~\cite{Nishikawa98}\\
185: DEC & Aug96(7d) & 3543968   &   & 0.77  & ~\cite{Breslau99}\\
186: UCB & Nov96(18d) & 1907762   & & 0.78  &~\cite{Breslau99}\\
187: UPisa   &   (3m)       & 2833624   & & 0.83  &~\cite{Breslau99}\\
188: Questnet & Jan98(7d) & 2885285  & & 0.69  &~\cite{Breslau99}\\
189: NLANR   & Dec97(1d)  & 1766409   & & 0.73  &~\cite{Breslau99}\\
190: FUNET   & Jun98(10d)    & 4815551 & & 0.64  &~\cite{Breslau99}\\
191: HGMP    & Jan98(7m)  & $\sim$ 750k & & 0.60& ~\cite{Breslau99}\\
192: WebTV   & Sep00(16d) & 347460865&  32541361 & 1.03 & ~\cite{Kelly02}\\
193: \hline
194: \label{Datasets}
195: \end{tabular}
196: \end{table}
197: 
198: It is widely assumed that web document popularity follows a Zipf-like law.
199: We summarized all published results in
200: Table~\ref{Datasets} with the dataset name, the date and period of log
201: files in days (d) or months (m), the number of requests, the number of
202: unique web pages requested, and the reported value of the exponent
203: $\alpha$.\footnote{Some papers do not provide all the information
204: (e.g., the number of unique pages) for the datasets studied.} It can
205: be seen that exponent values vary from $0.60$ to $1.03$.\footnote{Here
206: we consider document popularity observed at the client (BU dataset) or
207: proxy side only. Values of the exponent $\alpha$ observed at the
208: web-server side vary from $0.67$ to $1.82$~\cite{website}.} A
209: question arises. {\em Why is the variation of the exponent so large?}
210: Probably, the sample size is important, and the Zipf-like law only
211: fits two decades of ranks well at best. It is quite inapplicable in
212: the ``tails'' and in small ranks, and the results are sensitive to the
213: choice of the rank window for fitting the data.
214: 
215: We know only two papers where the website popularity issue was 
216: addressed. In paper~\cite{Aida98}, the authors claim that the destination 
217: address of web requests can be characterized by two types of Zipf laws. In 
218: paper~\cite{Breslau99}, the authors presented results for  three sets of 
219: user request traces (shown in~\cite{Breslau99} in Fig.~5, which 
220: is similar to our Figs.~\ref{Fig1} and \ref{Fig2}). In 
221: particular, the UCB-trace in their Fig.~5 looks similar to the set 
222: 2001-09-03 shown in our Fig.~\ref{Fig2}, and it is rather impossible to 
223: extract any value of the exponent $\alpha$ using the fit to Zipf-like 
224: law~(\ref{Zipf-like}). To our knowledge, the authors did not publish 
225: the announced preprint with the values of exponent $\alpha$.
226: 
227: \section{Datasets and methods}
228: \label{datasets}
229: 
230: \begin{table}
231: \centering
232: \caption{Characteristics of Analyzed Web Datasets in Russia}
233: \begin{tabular}{llcccc}\\
234: \hline
235: Dataset & Proxy & Starting & Period & \# of  &  \# of \\
236:         &       & date  &       & requests & websites \\
237: \hline
238: \em{1996}   & CHG  & Sep 1996   & 74d & 155743    & 4360 \\
239: \em{1997}   & CHG  & Jan 1997    & 1y & 2642722   & 44881 \\
240: \em{2000}   & CHG  & Sep 2000    & 3m & 27130648  & 146693 \\
241: \em{2001}   & CHG  & Feb 2001    & 8m     & 64577294  & 269868 \\
242: \em{ikia-2001}  & IKIA & Jul 2001   & 4m  & 29296632  & 177497 \\
243: \em{ikia-2002}  & IKIA & May 2002   & 1m  & 2067205  & 53747 \\
244: \em{wc-2001}    & FREEnet & Jan 2001   & 4.5m    & 16989853  & 152760 \\
245: \em{wc-2002}    & FREEnet & Feb 2002   & 5m  & 26576501  & 239891 \\
246: \em{yar-2002}   & Yars & Apr 2002   & 1m & 9639987 & 86611 \\
247: \em{ras-2002}   & RASnet & Feb 2002 & 5m & 9240289 & 227686 \\
248: \hline
249: \em{2001-09}    & CHG  & Sep 2001    & 1m   & 7333162   & 68671 \\
250: \em{2001-09-1w} & CHG  & Sep 2001    & 1w & 1382537   & 24103 \\
251: \em{2001-09-03} & CHG  & Sep 2001    & 1d  & 273361    & 7854 \\
252: \hline
253: \label{table-sets}
254: \end{tabular}
255: \end{table}
256: 
257: We start our analysis with the data collected on several proxies (cache 
258: servers) located in different Russian academic networks and in the next 
259: section will compare the results with the analysis of data collected in 
260: the fall of 2004 on American IRCache servers. Collections of data from 
261: Russian servers are presented in Table~\ref{table-sets} with the dataset 
262: name, proxy server location, starting date of log files, period of log 
263: file in days (d), weeks (w), months (m), or years (y), number of requests, 
264: and number of unique websites requested. The following abbreviations are 
265: used for proxies: {\em CHG} for the proxy located in the Chernogolovka 
266: network (AS9113), Chernogolovka, Moscow region, Russia;  {\em IKIA} for 
267: the proxy in Space Research Institute RAS (AS3218), Moscow, Russia;  {\em 
268: FREEnet} for the proxy in FREEnet (AS2895), Moscow, Russia; {\em RASnet} 
269: for the proxy located in RASnet (AS3058), Moscow, Russia; and {\em Yars} 
270: for the proxy located in Yaroslavl State University (AS8325), Yaroslavl, 
271: Russia. Proxy-servers {\em CHG} and {\em Yars} are typical regional cache 
272: servers serving requests from local users. Other servers located in Moscow 
273: are a central part of the Russian web-caching hierarchy~\cite{sakr98} and 
274: serve requests from local users as well as from other (e.g., regional) 
275: cache servers.
276: 
277: \begin{figure}
278: \centering
279: %\psfig{file=scheme.eps,width=\columnwidth}
280: \psfig{file=scheme.eps,width=60mm}
281: \caption{Sketch of the data collection}
282: \label{scheme}
283: \end{figure}
284: 
285: \begin{figure}
286: \centering
287: \psfig{file=caches.eps,width=90mm}
288: \caption{Hierarchy of cache servers network.}
289: \label{caches}
290: \end{figure}
291: 
292: All proxy-servers run Squid caching software. Figure~\ref{scheme}
293: sketches the process of data collection:  user queries go to the cache
294: server, which processes user queries to the web servers and keeps
295: traces of user requests as records in log files. We therefore call the
296: cache servers ``observers'' to stress a possible importance of their
297: displacement in the Internet. Cache servers in Russian academic networks 
298: are organized in hierarchy sketched in Figure~\ref{caches}. User queries goes 
299: through the local proxy servers to regional cache servers, which may 
300: redistribute them to the servers on national research and educational 
301: networks, which may send queries to the neighboring caches or directly to 
302: the destination. Also some queries may be sent to IRCache servers.
303: We must note that the cache server network is a 
304: logical one, programmable, and does not reflect Internet connectivity but
305: is rather some subgraph of the Internet.
306: 
307: We must note here that information in the datasets is private and is
308: subject to a privacy policy agreement. We therefore use all datasets
309: {\em available} to us.
310: 
311: Each record contains information on the requested document (URL). A
312: typical URL looks like {\sf
313: protocol://web.site.name[:port]/path/to/document}. We treat a
314: substring between the `//' and `/' characters (omitting the `:port'
315: field if present\footnote{As a rule, requests with the `:port' field
316: are about 2\% of all requests, probably because some Russian websites
317: often use the port value for switching between various Cyrillic
318: encodings.}) as the website name. Only successful GET requests with
319: code 200 are included in our analysis.
320: 
321: We counted the number of requests for each website in the log for each
322: dataset. Those numbers divided by the total number of requests in the
323: dataset give us the {\em normalized rank distribution of websites by
324: popularity $f_r$}.
325: 
326: Fitting equations and parameter estimation was done by the nonlinear 
327: least square method with Levenberg-Marquardt minimization.
328: 
329: \section{Discussion}
330: \label{discussion}
331: 
332: Normalized rank distributions (the fraction of requests to a given
333: website as a function of the corresponding rank) are presented on a
334: log-log scale in Figures~\ref{Fig1}, \ref{Fig2}, \ref{Fig3}.
335: Figure~\ref{Fig1} shows results for four datasets with the names {\em 1996}
336: (squares), {\em 1997} (circles), {\em 2000} (up triangles), and {\em 2001} (down
337: triangles) as defined in Table~\ref{table-sets}. All of them were
338: collected by the same proxy site {\em CHG}. Consulting
339: Table~\ref{table-sets}, we can conclude from Figure~\ref{Fig1} that
340: the rank distribution for all four datasets coincides well in the
341: ``middle'' straight-line part of about two decades and that the larger
342: the sample size, the larger this middle region is. We can therefore
343: conclude that the rank distribution does not change qualitatively in
344: five years and that the rank distribution comes closer and closer to
345: the ideal Zipf law.
346: 
347: Our goal in Figure~\ref{Fig2} is to demonstrate how a rank
348: distribution depends on the period of observation. For that reason, we
349: plot four distributions obtained from the datasets {\em 2001-09-03}
350: (squares), {\em 2001-09-1w} (circles), {\em 2001-09} (up triangles), and 
351: {\em 2001} (down triangles). Clearly, distribution does not vary in time but
352: becomes more ``flat'' in the middle part with the longer period
353: (larger sample size).
354: 
355: Finally, Figure~\ref{Fig3} demonstrates that rank distributions with
356: nearly equivalent sample sizes are independent of the displacement of
357: the observer (i.e., cache server) in the Internet geography (at least,
358: for the Russian academic networks). We plot seven datasets, {\em 2001}
359: (squares), {\em ikia-2001} (circles), {\em wc-2001} (up-triangles), 
360: {\em ikia-2002} (down-triangles), {\em ras-2002} (diamonds), 
361: {\em wc-2002} (left-triangles), and {\em yar-2002} (right-triangles). 
362: Figure~\ref{Fig3} is quite convincing that
363: the rank distribution of websites is independent of the displacement
364: of the web cache in the hierarchy.
365: 
366: Totally, it can be seen that rank distributions corresponding to
367: different data\-sets coincide well for the middle values of ranks.
368: Therefore, the fraction of user requests coming to ``mainstream''
369: websites (which are often encountered in logs but are still less
370: popular than top sites) is stable and does not vary with time
371: (Figure~\ref{Fig1}), with dataset size (Figure~\ref{Fig2}), or with
372: proxy location (Figure~\ref{Fig3}).
373: 
374: One more common feature of all graphs is the divergence of the rank
375: distributions in the ``tails'', the rightmost parts of the graph. Rank
376: distribution turns down strongly in tails, where the websites were
377: requested less than about $100$ times.
378: 
379: There is an interesting peculiarity seen in Figure~\ref{Fig1}: the
380: fraction of requests coming to the most popular sites decreases with
381: time. For example, the frequency of occurrences of the most popular
382: website in 1996 was about an order of magnitude higher than in 2001.
383: Because the most frequent requests come to different kinds of banners,
384: counters, search engines, etc., Figure~\ref{Fig1} demonstrates that
385: their relative popularity diminishes with time. One possible reason is
386: the appearance of many different sites with similar contents (as well
387: as mirror sites) or functions (e.g., banner networks or search
388: engines), which leads to equilibrating user interest to different hot
389: sites. Another reason is improvement of web-client software. The
390: internal cache of the web browser can contain more web documents;
391: requests to the most popular documents are then processed using the
392: internal cache. This phenomena is known as the ``trickle-down'' effect
393: observed by Doyle et al.~\cite{Doyle02}, which is discussed below.
394: 
395: Figure~\ref{Fig2} demonstrates that the top sites have a stable
396: fraction of requests during a given year.
397: 
398: Figures 2 and 3 show that Zipf-like law~(\ref{Zipf-like}) (which must
399: be represented as a straight line) is a very coarse approximation of
400: the actual distribution. The main deviations from the
401: law~(\ref{Zipf-like}) are in the region of the most popular (top 50)
402: sites and in the tail of the distribution.
403: 
404: Fitting the data to Zipf-like law, expression~(\ref{Zipf-like}), and its 
405: modifications, expressions (\ref{Zipf-Mandelbrot}) and (\ref{BestFit}), is 
406: a tricky problem both because of the influence of the rare statistics of 
407: the large ranks and because of the high fluctuations of the leading ranks. 
408: Which method is best is not yet understood~\cite{Crovella99}. We use a 
409: least-square fit to estimate the parameters and calculate the accuracy of 
410: the estimated values by the standard approach and give it in the 
411: parentheses as a correction to the last digit.
412: 
413: We can choose a region of ranks of two orders of magnitude where the
414: rank distribution looks like a straight line. But varying the interval
415: boundaries of the rank window strongly affects the fitting parameters
416: (e.g., the exponent $\alpha$). We obtained $\alpha$ in the range from
417: $0.7$ to $1.4$ depending on the rank window. For example, fitting
418: dataset {\em2001-09} with Zipf-like law~(\ref{Zipf-like}) in the
419: window $10\le r\le 1000$ gives $\alpha=0.78$ and in window $10^3\le
420: r\le 10^5$ gives $\alpha=1.13$. Other fitting windows give other
421: values in the range from $0.7$ to $1.4$. We can therefore conclude
422: that the Zipf-like law cannot give us quantitative characteristics of
423: rank distributions of websites in the whole interval of ranks.
424: 
425: Slightly better results can be derived using a modified Zipf-like law,
426: known as the {\em Zipf--Mandelbrot} law~\cite{Mandelbrot},
427: 
428: \begin{equation}
429: \label{Zipf-Mandelbrot}
430: f_r=\frac{b}{(c+r)^{\alpha}},
431: \end{equation}
432: 
433: \noindent which gives a better approximation in the range of small
434: ranks but is still inapplicable in the ``tails''. The fit can be
435: appreciably enhanced by introducing one more parameter
436: in~(\ref{Zipf-Mandelbrot}):
437: 
438: \begin{equation}
439: f_r=a+\frac{b}{(c+r)^{\alpha}}.
440: \label{BestFit}
441: \end{equation}
442: 
443: Figure~\ref{Fig5} shows the rank distribution of websites in the
444: coordinates $\log(f_r-a)$, $\log(c+r)$ for the particular dataset
445: {\em 2001-09}. The fraction of requests (the vertical axis) is shifted by
446: the value $a=-1.44\cdot 10^{-6}$ and the rank by $c=15.16$. This
447: figure clearly demonstrates that function~(\ref{BestFit}) approximates
448: the data distribution well in almost the entire range of
449: ranks.\footnote{We note that this method for data ``straightening'' is
450: often applied in statistical physics~\cite{Efros,Lev2}. A similar equation
451: was also proposed in a recent work on rank distribution of 
452: publication popularity~\cite{Han2004}.} We have
453: fitted expression~(\ref{BestFit}) to all our data and found that the
454: value of $\alpha$ is quite stable; the results are presented in
455: Table~\ref{Alpha} for the datasets discussed. The columns in
456: Table~\ref{Alpha} are the dataset name as defined in
457: Table~\ref{table-sets} and resulting values of $a$, $c$, and $\alpha$
458: as defined in expression~(\ref{BestFit}). The mean of the exponent
459: $\alpha$ is $1.02 \pm 0.05$, which may be considered $1.0$. The
460: statistical error is calculated as the variation of $\alpha$ from the
461: data in Table~\ref{Alpha}.
462: 
463: \begin{table}
464: \centering
465: \caption{Fitting Results for Russian Servers}
466: \begin{tabular}{llrc} \\
467: \hline
468: Dataset &  $a$ & $c$ & $\alpha$\\
469: \hline
470: \em{1996} &  $-3.0(1)\cdot10^{-5}$        & $0.45(4)$        & $0.95(5)$\\
471: \em{1997} &  $-5.77(2)\cdot10^{-6}$     & $2.96(5)$        & $0.92(3)$\\
472: \em{2000} &  $-1.01(11)\cdot10^{-6}$     & $7.33(7)$        & $1.04(3)$\\
473: \em{2001} &  $-2.48(3)\cdot10^{-7}$     & $9.10(5)$        & $1.06(2)$\\
474: \em{2001-09} & $-1.44(27)\cdot10^{-6}$    & $15.16(11)$       & $1.08(7)$\\
475: \em{2001-09-1w} & $-7.25(6)\cdot10^{-6}$  & $14.82(20)$       & $1.03(2)$\\
476: \em{2001-09-03} & $-2.01(7)\cdot10^{-5}$     & $17.82(72)$       & $0.99(6)$\\
477: \em{ikia-2001} & $-5.10(7)\cdot10^{-7}$  & $13.35(7)$       & $1.07(3)$\\
478: \em{ikia-2002} & $-1.58(9)\cdot10^{-6}$ & $4.53(16)$        & $1.01(1)$\\
479: \em{wc-2001} & $-5.56(9)\cdot10^{-7}$   & $14.54(9)$       & $1.09(4)$\\
480: \em{wc-2002} & $-4.43(7)\cdot10^{-7}$   & $14.02(5)$       & $1.06(3)$\\
481: \em{ras-2002} & $-9.45(2)\cdot10^{-7}$  & $9.17(10)$        & $0.95(5)$\\
482: \em{yar-2002} & $-1.30(3)\cdot10^{-6}$  & $4.64(4)$        & $0.99(5)$\\
483: \hline
484: \end{tabular}
485: \label{Alpha}
486: \end{table}
487: 
488: The parameter $a$ can be considered a correction for the finite sample
489: size. The larger the sample size, the less $a$ is.
490: 
491: The parameter $c$ in expression~(\ref{BestFit}) has a very clear
492: physical meaning. It is closely connected with the {\em trickle-down
493: effect} observed by Doyle~\cite{Doyle02}. Doyle found that proxies
494: disproportionally absorb requests on different levels of the
495: hierarchy. Rank distributions obtained from data collected on proxies
496: at different hierarchical levels differ in the region of small ranks.
497: This effect has a clear explanation in terms of rank distributions.
498: 
499: As a clarifying example, we consider a two-layer hierarchy of proxies.
500: A first-level proxy receives requests from users. If the requested
501: document is found in its cache, then that document is returned to the
502: client; otherwise, the request is submitted to an upper-level proxy.
503: If we assume that a first-level proxy can hold $N$ documents in its
504: cache, then it accordingly filters the $N$ most popular documents from
505: the request stream, i.e., it ``cuts'' the leftmost $N$ points from the
506: rank distribution. This is equivalent to the change of variables
507: $r\rightarrow r+N$. Therefore, we presume that the parameter $c$ in
508: equation~(\ref{BestFit}) characterizes cache sizes of low-level
509: proxies (which can also be the user's browser cache).
510: 
511: It can be seen that for all datasets, $\alpha$ is close to unity with
512: an accuracy of a few percent. We therefore suppose that the exponent
513: $\alpha$ in equation~(\ref{BestFit}) is a universal characteristic of
514: web traffic, which is independent of time (for time-scales comparable
515: with the Internet lifetime), is independent of data collection
516: duration (when the sample size is sufficiently large and contains more
517: than $2{\times}10^5$ requests), and is independent of the displacement
518: of the proxy server in the Internet hierarchy.
519: 
520: We found a possibility to check our findings using available
521: statistics. We chose BU web-client traces available from {\sf
522: ita.ee.lbl.gov} (the full dataset from Nov 94 to May 95 contains
523: 1143842 requests, 104532 unique URLs, and 4970 unique sites). This
524: dataset was used in early work and gives one of the best examples of
525: the Zipf law for web-page popularity ($\alpha=0.986$)~\cite{Cunha95}.
526: Fitting equation (\ref{BestFit}) to the rank distribution of website
527: popularity gives $\alpha=1.025$, $a=-3.3\cdot 10^{-5}$, and $c=1.97$,
528: which coincide well with the values obtained for Russian academic
529: networks. This is an additional argument that website popularity
530: distribution is universal (in other words, is independent of both the
531: observation point in the Internet and Internet history) and
532: follows the Zipf law with an exponent $\alpha$ close to unity.
533: 
534: \begin{table}
535: \centering
536: \caption{Characteristics of Analyzed Web Datasets in USA and Fitting 
537: Results}
538: \begin{tabular}{lrrrrr}\\
539: \hline
540: cache & \# of & $N=$\# of & $aN$ & $c$ & $\alpha$ \\ 
541:       &requests &websites &      &     &    \\ \hline
542: {\em bo}&23935604&592679&-2.89(1)&8.54(4)&1.05(2) \\
543: {\em ny}&12789266&407952&-3.89(1)&-0.12(1)&0.94(3) \\
544: {\em pa}&3374392&229633&-1.57(1)&7.17(12)&0.96(8) \\
545: {\em pb}&10018478&304049&-4.47(1)&18.96(13)&0.98(4) \\
546: {\em rtp}&13221655&339918&-4.35(1)&23.52(13)&1.01(4) \\
547: {\em sd}&13840665&285356&-3.22(1)&0.166(7)&1.04(3) \\
548: {\em sj}&26130582&264396&-6.00(1)&1.935(13)&1.09(2) \\
549: {\em sv}&11119941&530731&-3.20(1)&16.34(13)&0.93(4) \\
550: {\em uc}&13294408&313178&-5.17(1)&15.14(9)&1.01(4) \\ \hline
551: {\em uc-12d}&3236853&84360&-4.37(2)&7.79(12)&0.95(8) \\
552: {\em uc-1d}&463899&13752&-1.77(4)&4.99(24)&0.96(3) \\ \hline
553: {\em all}&127724991&1176623&-8.96(1)&5.05(1)&1.03(2) \\ \hline
554: \label{table-setUS}
555: \end{tabular}
556: \end{table}
557: 
558: 
559: To check this statement deeper, we also analyze recently available 
560: data\footnote{Thanks to D.  Wessels, who kindly gave us access to the 
561: data sets collected at the US IRCache servers.} collected during the period 
562: from 11/03/2004 to 12/29/2004 at nine cache-servers of the US national 
563: cache-mesh system for science and education built-up within the IRCache 
564: project~\cite{ircache}. Table~\ref{table-setUS} presents data from the 
565: following locations:
566: 
567: \begin{itemize}
568: \item {\em bo} -- NCAR at Boulder, Colorado
569: \item {\em ny} -- New York, New York
570: \item {\em pa} -- Digital Internet Exchange in Palo Alto, California
571: \item {\em pb} -- PSC at Pittsburgh, Pennsylvania
572: \item {\em rtp} -- Research Triangle Park, North Carolina
573: \item {\em sd} -- SDSC at San Diego, California
574: \item {\em sj} -- MAE West Exchange Point in San Jose, California
575: \item {\em sv} -- NASA-Ames/FIX-West in Silicon Valley, California
576: \item {\em uc} -- NCSA at Urbana-Champaign, Illinois.
577: \end{itemize}
578: 
579: \noindent The second and third entries from the bottom demonstrate the 
580: stability of the fit for two subsets of the data collected at {\em 
581: uc}-location, for 12 days (set name {\em us-12d}) and for 1 day (set {\em 
582: us-1d}). The last entry represents the fit to the sum of the preceding 
583: data sets. Results of the fit by expression~(\ref{BestFit}) are close to 
584: unity and quite similar to those for Russian servers presented in 
585: Table~\ref{Alpha}.
586: 
587: 
588: \section{Conclusions}
589: \label{conclusions}
590: 
591: We have presented modified Zipf law~(\ref{BestFit}), which fits the rank 
592: distribution of web sites in the full range of ranks rather well. We found 
593: that the value of the exponent $\alpha$ in expression~(\ref{BestFit}) is 
594: stable for the analyzed datasets. It does not vary with (1) the year of data 
595: collection, (2) the period of data collection, or (3) the geographical 
596: location of the cache server where we collected data. We found that 
597: $\alpha$ is very close to $1$. We have reasons to suppose this value of 
598: $\alpha$ is a universal property of web-traffic for the website rank. We 
599: have also presented a clear explanation of the ``trickle-down effect'' 
600: based on the properties of our modified Zipf law. We suggest that website 
601: popularity is universal property of Internet and follows the Zipf law.
602: 
603: In a similar experiment, fluctuations of the exponent value were 
604: checked~\cite{KS-tri} as a function of the volume of statistics, where 
605: cache traces of user requests to different Internet domains were analyzed. 
606: User requests were sent to Internet through the cache triangle, namely, 
607: they went to the Master Server, which sent each odd request to the left 
608: cache and each even request to the right cache. Clearly, the traces should 
609: be nearly equal in the limit of a large number of requests. Indeed, it was 
610: estimated that exponents extracted separately from the ``left'' traces and 
611: ``right'' traces were within five per cent for a set volume larger than 
612: ten thousand requests, and that those for set volume less than a few 
613: hundred fluctuated strongly. Thus, rare statistics may significantly 
614: affect the results.
615: 
616: The results in this paper may be useful for building mirror sites and
617: CDNs as well as for improving software for DNS request caching. We
618: also conjecture that fitting with the modified Zipf law is suitable
619: for describing the rank distribution of web-document popularity.
620: 
621: \begin{figure}
622: \centering
623: \psfig{file=fig1.eps,width=\columnwidth}
624: \caption{Website distribution for different years}
625: \label{Fig1}
626: \end{figure}
627: 
628: \begin{figure}
629: \centering
630: \psfig{file=fig2.eps,width=\columnwidth}
631: \caption{Website distribution for different periods}
632: \label{Fig2}
633: \end{figure}
634: 
635: \begin{figure}
636: \centering
637: \psfig{file=fig3.eps,width=\columnwidth}
638: \caption{Website distribution for different servers}
639: \label{Fig3}
640: \end{figure}
641: 
642: \begin{figure}
643: \centering
644: \psfig{file=fig5c.eps,width=\columnwidth}
645: \caption{Website distribution in modified coordinates: dependence 
646: of $f_r-a$ from $r+c$ (compare to expression~(5)) in double logarithmic 
647: scale.}
648: \label{Fig5}
649: \end{figure}
650: 
651: \section{Acknowledgment}
652: 
653: The authors thank the anonymous referees for the valuable remarks and
654: comments that allowed us to improve this paper.
655: Special thanks to Duane Wessels for access to logs from IRCache web-cache servers.
656: 
657: This work was supported by the Russian Foundation for Basic Research.
658: 
659: 
660: \begin{thebibliography}{99}
661: 
662: \bibitem{Glassman94} Steven Glassman, {\it A caching Relay for the World Wide
663: Web.} Proc. 1st Int. Conference on the World-Wide Web, CERN,
664: Geneva (Switzerland), May 1994. Computer Networks and ISDN Systems, 27(2), 165-173 (1994).
665: 
666: \bibitem{Breslau99} Lee Breslau, Pei Cao, Li Fan, G. Phillips, S.
667: Shenker, {\it Web Caching and Zipf-like Distributions: Evidence and
668: Possible Implications}, Proc. IEEE INFOCOM '99: 18th Annual Joint
669: Conference of the IEEE Computer and Communications Societies, Volume: 1,
670: p.~126-134, 1999.
671: 
672: \bibitem{Kelly02} Terence Kelly, Jeffrey Mogul.  {\it Aliasing on the
673: World Wide Web: Prevalence and performance implications}. Proc. 11th Int.
674: WWW Conf., Honolulu, May 2002, ACM Press, pp.281 - 292.
675: 
676: \bibitem{Doyle02}  Ronald P. Doyle, Jeffrey S. Chase, Syam Gadde,
677: Amin M. Vahdat.  {\it The trickle-down effect: Web caching and
678: server request distribution}.  Computer Communications, {\bf 25},
679: 345-356 (2002).
680: 
681: \bibitem{Pareto} V. Pareto, {\it Cours d'economie politique}, Rouge,
682: Lausanne et Paris, 1897.
683: 
684: \bibitem{Zipf49} G. K. Zipf, {\it Human Behavior and the Principle of
685: Least-Effort.} Addison-Wesley, Cambridge, MA, 1949.
686: 
687: \bibitem{Li92} W. Li, {\it Random texts exhibit Zipf's-law-like word
688: frequency distribution}.  IEEE Trans. Inform. Theory, {\bf 38}(6),
689: 1842-1845 (1992).
690: 
691: \bibitem{Mantegna} R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S.
692: Havlin, C. K. Peng, M. Simons, and H. E. Stanley. {\it Linguistic Features
693: of Noncoding DNA Sequences}, Phys. Rev. Lett. {\bf 73}, 3169-3172 (1994).
694: 
695: \bibitem{Martindale} C. Martindale, A. K. Konopka, {\it Oligonucleotide
696: frequencies in DNA follow a Yule distribution}, Computers \& Chemistry,
697: {\bf 20}, 35-38 (1996).
698: 
699: \bibitem{Mandelbrot} B. B. Mandelbrot, {\it The Fractal Geometry of
700: Nature}. Freeman, New York, 1977.
701: 
702: \bibitem{Efros} B.I. Shklovskii, A.L. Efros, {\it Electronic Properties of
703: Doped Semiconductors}, Percolation Theory (Springer-Verlag, Berlin, 
704: 1984)
705: pp. 95-136
706: 
707: \bibitem{Lev2} L.N. Shchur, {\it Incipient Spanning Clusters in Square and
708: Cubic Percolation}, in Springer Proceedings in Physics, ``Computer
709: Simulation Studies in Condensed Matter Physics XII'', Eds. D.P. Landau,
710: S.P. Lewis, and H.B. Sch\"uttler, (Springer-Verlag, Berlin, 2000)
711: 
712: \bibitem{Han2004} Ding-Ding Han, Jin-Gao Liu, Yu-Gang Ma, Xiang-Zhou Cai, 
713: Wen-Qing Shen. {\it Scale-free download network for publications}. 
714: Chin. Phys. Lett., {\bf 21}(9), 1855-1857 (2004); arXiv:cond-mat/0405428
715: (2004).
716: 
717: \bibitem{Zanette97} Damai\'an H. Zanette, Susanna C. Manrubia, {\it Role
718: of intermittence in urban development: A model of large-scale city
719: formation}.  Phys. Rev. Lett., {\bf 79}(3), 523-526 (1997).
720: 
721: \bibitem{Marsili98} Matteo Marsili, Yi-Cheng Zhang, {\it Interacting
722: individuals leading to Zipf's law}.  Phys. Rev. Lett., {\bf 80}(12),
723: 2741-2744 (1998).
724: 
725: \bibitem{Redner98} S. Redner. {\it How popular is your paper? An empirical
726: study of the citation distribution}.  Eur. Phys. J. B {\bf 4}, 131-134
727: (1998).
728: 
729: \bibitem{Malamud} B. D. Malamud, G. Morein, D. L. Turcotte. {\it Forest
730: fires: an example of self-organized critical behavior}.  Science, {\bf
731: 281}, 1840-1842 (1998).
732: 
733: \bibitem{Newman2004} M.E.J. Newman. {\it Power laws, Pareto distributions
734: and Zipf's law}. arXiv:cond-mat/0412004 (2004)
735: 
736: \bibitem{Mitzenmacher2004} Michael Mitzenmacher. {\it A brief history of 
737: generative models for power law and lognormal distributions}.
738: Internet Mathematics, {\bf 1}(2), 226-251 (2004).
739: 
740: \bibitem{Laherrere1998} J. Laherr\`ere, D. Sornette. {\it Stretched 
741: exponential distributions in nature and economy: ``fat tails'' with
742: characteristic scales}. Eur. Phys. J., {\bf B 2}, 525-539 (1998).
743: 
744: \bibitem{website} Azer Bestavros. {\it WWW traffic reduction and load
745: balancing through server-based caching}.  IEEE Concurrency, {\bf 5}(1),
746: 56-67 (1997);
747: Takashi Hatashima, Toshihiro Motoda, Shuichro Yamamoto.  {\it An
748: ``interest'' index for WWW servers and CyberRanking}.  IEICE Trans. Inf.
749: \& Syst., {\bf E83-D}, 729-734 (2000);
750: Venkata N. Padmanabhan, Lili Qiu.  {\it The content and access dynamics
751: of a busy web site: Findings and implications}.  Proc. ACM SIGCOMM'00,
752: Stockholm, Sweden, 2000, pp. 111-123;
753: Adeniyi Oke, Rick Bunt.  {\it Hierarchical workload characterization for
754: a busy web server}.  In: Computer Performance Evaluation (Ed. T. Field,
755: P.G. Harrison, J. Bradley, U. Harder). Springer-Verlag: Berlin ea, 2002,
756: pp.309-328. Proc. TOOLS'2002: 12th Int. Conf. on Modeling Techniques and
757: Tools, London, UK, April 14-17 2002. [Lecture Notes in Computer Science,
758: Vol. 2324].
759: 
760: \bibitem{Aida98} Masaki Aida, Noriyuki Takahashi, Tetsua Abe.
761: {\it A proposal of dual Zipfian model for describing HTTP access trends
762: and its application to address cache design}. IEICE Trans. Commun.,
763: {\bf E81-B} (7), 1475-1485 (1998).
764: 
765: \bibitem{Cunha95} C. R. Cunha, A. Bestavros, M. E. Crovella, {\it
766: Characteristics of WWW Client-based Traces}, Technical report
767: BU-CS-95-010, Boston University, July, 1995.
768: 
769: \bibitem{Barford99} P. Barford, A. Bestavros, A. Bradley, M. Crovella, {\it
770: Changes in Web client access patterns: Characteristics and caching implications}.
771: World Wide Web J., Spec. Issue on Characterization and
772: Performance Evaluation, {\bf 2}, 15-28 (1999).
773: 
774: \bibitem{Jin2000} Shudong Jin and Azer Bestavros, {\it Sources and
775: Characteristics of Web Temporal Locality}. Proc. MASCOTS'2000: The 8th
776: IEEE/ACM International Symposium on Modeling, Analysis and Simulation of
777: Computer and Telecommunication Systems, San Francisco, CA, 29 Aug - 1 Sept
778: 2000. p.28-35.
779: 
780: \bibitem{Mahanti99} A. Mahanti, C. Williamson.  {\it Web proxy workload
781: characterization}.  Tech. Report, Department of Computer Science,
782: University of Saskatchewan, February 1999.
783: 
784: \bibitem{Crovella99} M.E. Crovella and M.S. Taqqu, {\it Estimating the 
785: Heavy Tail Index from Scaling Properties}, In: Methodology and Computing in 
786: Applied Probability, {\bf 1}, 55-79 (1999).
787: 
788: \bibitem{Mahanti00} A. Mahanti, C. Williamson, D. Eager, {\it Traffic
789: analysis of a Web proxy caching hierarchy}. IEEE Network Magazine, {\bf
790: 14}(3), 16-23 (May/Jun 2000).
791: 
792: \bibitem{Roadknight99} Chris Roadknight, Ian Marshall, and Deborah Vearer.
793: {\it File Popularity Characterization}. Proc. WISP'99: 2nd Workshop on
794: Internet Server Performance, Atlanta, Georgia, May 1999.
795: 
796: \bibitem{Almeida96} V. Almeida, A. Bestavros, M. E. Crovella,
797: A. de Oliveira, {\it Characterizing Reference Locality in the
798: WWW}, Proc. PDIS'96, Dec. 1996, p.~92-103.
799: 
800: \bibitem{Nishikawa98} N. Nishikawa, T. Hosokawa, Y. Mori, K. Yoshida, H.
801: Tsuji. {\em Memory-based architecture for distributed WWW caching proxy}.
802: Computer Networks and ISDN Systems, {\bf 30}, 205-214 (1998).
803: 
804: \bibitem{sakr98} Serge Krashakov, Lev Shchur. {\it WWW Caching in
805: Russia - Current State and Future Development}. Proc. 3d Int. Web Caching
806: Workshop, Manchester, June 15-17, 1998.
807: 
808: \bibitem{ircache} The IRCache project - {\tt http://www.ircache.net}.
809: 
810: \bibitem{KS-tri} Sergey A. Krashakov, Lev N. Shchur.  {\em Active measurements 
811: (experiments) of the Interhet traffic using cache-mesh}.  Int. J. Modern 
812: Physics C, {\bf 12}, 549-562  (2001). 
813: 
814: \end{thebibliography}
815: 
816: \end{document}
817: