0404:cs0404010/cn.tex

1: \documentclass{elsart}

2: \usepackage{epsfig}

3: \usepackage{amssymb}

4:

5: \begin{document}

6:

7: \begin{frontmatter}

8:

9: \title{On the universality of rank distributions of website popularity}

10: \author{Serge A. Krashakov\corauthref{cor}},

11: \corauth[cor]{Corresponding author.}

12: \ead{sakr@itp.ac.ru}

13: \author{Anton B. Teslyuk\thanksref{now}},

14: \thanks[now]{Present adress: Institute of Information Science,

15: RRC Kurchatov Institute, 1~Kurchatov sq., Moscow, 123182, Russia}

16: \author{Lev N. Shchur}

17:

18: \address{Landau Institute for Theoretical Physics, Chernogolovka, 142432 Russia}

19:

20: \begin{abstract}

21:

22: We present an extensive analysis of long-term statistics of the queries to

23: websites using logs collected on several web caches in Russian academic

24: networks and on US IRCache caches.  We check the sensitivity of the statistics

25: to several parameters: (1)~duration of data collection, (2) geographical

26: location of the cache server collecting data, and (3) the year of data

27: collection. We propose a two-parameter modification of the Zipf law and

28: interpret the parameters. We find that the rank distribution of websites

29: is stable when approximated by the modified Zipf law. We suggest

30: that website popularity may be a universal property of Internet.

31:

32: \end{abstract}

33:

34: \begin{keyword}

35: Internet, Web traffic, Rank Distribution, Zipf Law

36: \PACS 89.20.Hh World Wide Web, Internet - 89.75.Da Systems being scaling laws

37: \end{keyword}

38: \end{frontmatter}

39:

40: \section{Introduction}

41: \label{intro}

42:

43: It has been known for a decade that web-document popularity follows the

44: Zipf law~\cite{Glassman94}. Nevertheless, the exponent values reported by

45: different authors vary significantly, from 0.60 to

46: 1.03~\cite{Glassman94,Breslau99,Kelly02,Doyle02} (see

47: Table~\ref{Datasets}). We believe that the scattering of the reported

48: values is due to the small sample size in some cases and to the details of

49: the fitting procedure used to extract the exponent.

50:

51: In this paper, we propose that the rank distribution of the websites

52: follows the Zipf law and give arguments supporting our idea. We must

53: note that website statistics are more extensive than web-document

54: statistics, and the distribution parameters can be obtained with

55: higher accuracy.

56:

57: We address the following questions: Is the rank distribution of

58: websites Zipf-like?  If yes, what are the conditions under which the

59: ``true'' exponent can be obtained?  Does the exponent depend on the

60: duration of the observation? Or on the geographical position of the

61: observer? And does the exponent vary with time, as the Internet

62: develops?

63:

64: We report some answers to these questions.  We have studied website

65: statistics, which are indeed more stable than web-document statistics.

66: We have analyzed log files accumulated on cache servers of Russian

67: academic networks (FREEnet, RASnet, and RSSI) for about six years. These

68: networks differ by their connectivity topology and bandwidth, both

69: national and international.  These cache servers have different

70: geographical locations (Moscow, Moscow region, and Yaroslavl in Russia).

71: In addition, we analyzed some statistics collected during seven weeks in

72: the fall of 2004 at a number of IRCache servers in the United States (see

73: Table~\ref{table-setUS}).

74:

75: We found that the statistics studied become stable\footnote{The

76: accuracy of the exponent becomes a few percent, e.g., 5\%.} when the

77: number of queries for the given statistics exceeds $10^5$. It is

78: therefore meaningful to fit only those data for which the number of

79: queries exceeds this value. This simple criterion can be used to

80: estimate the critical window for the rank interval where the

81: distribution is stable and the power law can be observed.

82:

83: We found that the statistics are independent of the geographical

84: location of the cache server (observer) collecting the data, at least for

85: the analyzed data sets.

86:

87: We found that the distribution is independent of the different years

88: of data collection and is therefore stable over Internet history and

89: development.

90:

91: Nevertheless, we found that the Zipf-like law approximation is

92: suitable only in the middle region of several orders of rank

93: magnitude. We propose a modification of the Zipf-like law with two

94: additional parameters and explain its possible meaning. We found that

95: if we fit the equation of the modified law to the data, the website

96: popularity distribution becomes quite stable. The value of the

97: exponent $\alpha$ is $1.02\pm0.05$ for all datasets studied in this

98: paper. We thus may suggest that website popularity follows the Zipf law.

99:

100: We verified that the same modification also works perfectly for the

101: web-document ranked distribution.

102:

103: The paper is organized as follows. In section~\ref{nature}, we present

104: a brief history of the power laws observed in nature and society. We

105: describe the data collection and processing in section~\ref{datasets}.

106: We discuss the results in section~\ref{discussion} and present our

107: conclusions in section~\ref{conclusions}.

108:

109: \section{Power laws in nature and society}

110: \label{nature}

111:

112: More than 100 years ago, Pareto~\cite{Pareto} observed that the income

113: distribution $f$ in all countries can be described by the relation

114:

115: \begin{equation}

116: \label{Pareto}

117: F(f)=1-(m/f)^{\alpha},

118: \end{equation}

119:

120: \noindent where the exponent $\alpha\simeq1.5$ and $m$ is some

121: constant.  About 70 years ago, George Zipf~\cite{Zipf49} discovered a

122: striking regularity in English texts: the relative occurrence

123: frequency $f$ of the $r$th most popular word is inversely

124: proportional to the rank $r$:

125:

126: \begin{equation}

127: \label{Zipf-law}

128: f_r\sim\frac1r.

129: \end{equation}

130:

131: A more general form of Zipf law~(\ref{Zipf-law}) with the exponent

132: $\alpha \ne 1$ is often encountered in the literature and is known as

133: a {\em Zipf-like law}:

134:

135: \begin{equation}

136: \label{Zipf-like}

137: f_r\sim\frac{1}{r^\alpha}.

138: \end{equation}

139:

140: A Zipf-like law has been found in many areas of human activity and in

141: nature. Among examples are the distribution of words in random

142: texts~\cite{Li92}, of nucleotide ``words'' in

143: DNA~\cite{Mantegna,Martindale}, of bit sequences in UNIX executable

144: files~\cite{Mantegna}, of book popularities in

145: libraries~\cite{Zipf49,Mandelbrot}, of countries' areas and population

146: sizes~\cite{Zipf49,Zanette97,Marsili98}, of scientific publication

147: citation indices~\cite{Redner98}, of forest-fire areas~\cite{Malamud}.

148: Many other examples can be found in recent

149: reviews~\cite{Newman2004,Mitzenmacher2004}.

150:

151: Meanwhile, there are many discussions whether a lognormal or power law is

152: a better fit for some empirical distributions, for example, income

153: distribution, population fluctuations, file size distribution, and some

154: others (for a short review, see~\cite{Mitzenmacher2004}). In many cases

155: a lognormal distribution looks like a power law distribution for a several

156: orders of magnitude~\cite{Mitzenmacher2004,Laherrere1998}. We leave this

157: question open and analyse our data using a Zipf-like law.

158:

159: \begin{table}

160: \caption{Characteristics of Published Web Datasets}

161: \begin{tabular}{lcccll}\\ \hline

162: Dataset & Date & \# of & \# of &$\alpha$&Ref.\\

163: 	& (Period) & requests & pages	& & \\

164: \hline

165: DEC &   1994    & $\sim$ 100k &     & 1 & ~\cite{Glassman94}\\

166: BU  &   Jan95(42d) &  575775  & 54438 & 0.99 &~\cite{Cunha95}\\

167: BU  &   1998    &   66988   & 41049   & 0.65 &~\cite{Barford99}\\

168: DEC &   Jul96(6d)&    3543968 & 1354996 & 0.77 & ~\cite{Jin2000}\\

169: NLANR.RTP & Jun99(13d)&    9113027 & 3249549 & 0.71 & ~\cite{Jin2000}\\

170: NLANR.SD &  Jun99(13d)&    9082461 & 3549609 & 0.72 & ~\cite{Jin2000}\\

171: NLANR.UC &  Jun99(13d)&    8983585 & 2459366 & 0.66 & ~\cite{Jin2000}\\

172: USASK   & Oct98(82d) &    20754720 & 5527667 & 0.76 & ~\cite{Mahanti99}\\

173: CANARIE & Dec98(26d) &   35129680  &  1423081 & 0.63  & ~\cite{Mahanti99}\\

174: NLANR.UC & Dec98(31d) &   20018680 & 7681214 & 0.65 & ~\cite{Mahanti99}\\

175: USASK   & Feb99(45d) &    21070330 & 5510561 & 0.84 & ~\cite{Mahanti00}\\

176: CANARIE & Feb99(45d) &    7310038 & 4571539 & 0.77 & ~\cite{Mahanti00}\\

177: NLANR.UC & Feb99(30d) &    24560611 & 8482661 & 0.74 & ~\cite{Mahanti00}\\

178: NLANR.LJ &  1998    & $\sim$ 500k    &   & 0.64 & ~\cite{Roadknight99}\\

179: UPisa    &  1998    & $\sim$ 500k    &   & 0.91 & ~\cite{Roadknight99}\\

180: FUNET    &  1998    & $\sim$ 500k    &   & 0.70 & ~\cite{Roadknight99}\\

181: SPAIN    &  1998    & $\sim$ 500k    &   & 0.72 & ~\cite{Roadknight99}\\

182: RMPLC    &  1998    & $\sim$ 500k    &   & 0.86 & ~\cite{Roadknight99}\\

183: BU-CS   & Oct95(14d) & 80518 & 4471  & 0.85 & ~\cite{Almeida96}\\

184: Hitachi &   1997(16d) &  2000000   &   & 0.75  & ~\cite{Nishikawa98}\\

185: DEC & Aug96(7d) & 3543968   &   & 0.77  & ~\cite{Breslau99}\\

186: UCB & Nov96(18d) & 1907762   & & 0.78  &~\cite{Breslau99}\\

187: UPisa   &   (3m)       & 2833624   & & 0.83  &~\cite{Breslau99}\\

188: Questnet & Jan98(7d) & 2885285  & & 0.69  &~\cite{Breslau99}\\

189: NLANR   & Dec97(1d)  & 1766409   & & 0.73  &~\cite{Breslau99}\\

190: FUNET   & Jun98(10d)    & 4815551 & & 0.64  &~\cite{Breslau99}\\

191: HGMP    & Jan98(7m)  & $\sim$ 750k & & 0.60& ~\cite{Breslau99}\\

192: WebTV   & Sep00(16d) & 347460865&  32541361 & 1.03 & ~\cite{Kelly02}\\

193: \hline

194: \label{Datasets}

195: \end{tabular}

196: \end{table}

197:

198: It is widely assumed that web document popularity follows a Zipf-like law.

199: We summarized all published results in

200: Table~\ref{Datasets} with the dataset name, the date and period of log

201: files in days (d) or months (m), the number of requests, the number of

202: unique web pages requested, and the reported value of the exponent

203: $\alpha$.\footnote{Some papers do not provide all the information

204: (e.g., the number of unique pages) for the datasets studied.} It can

205: be seen that exponent values vary from $0.60$ to $1.03$.\footnote{Here

206: we consider document popularity observed at the client (BU dataset) or

207: proxy side only. Values of the exponent $\alpha$ observed at the

208: web-server side vary from $0.67$ to $1.82$~\cite{website}.} A

209: question arises. {\em Why is the variation of the exponent so large?}

210: Probably, the sample size is important, and the Zipf-like law only

211: fits two decades of ranks well at best. It is quite inapplicable in

212: the ``tails'' and in small ranks, and the results are sensitive to the

213: choice of the rank window for fitting the data.

214:

215: We know only two papers where the website popularity issue was

216: addressed. In paper~\cite{Aida98}, the authors claim that the destination

217: address of web requests can be characterized by two types of Zipf laws. In

218: paper~\cite{Breslau99}, the authors presented results for  three sets of

219: user request traces (shown in~\cite{Breslau99} in Fig.~5, which

220: is similar to our Figs.~\ref{Fig1} and \ref{Fig2}). In

221: particular, the UCB-trace in their Fig.~5 looks similar to the set

222: 2001-09-03 shown in our Fig.~\ref{Fig2}, and it is rather impossible to

223: extract any value of the exponent $\alpha$ using the fit to Zipf-like

224: law~(\ref{Zipf-like}). To our knowledge, the authors did not publish

225: the announced preprint with the values of exponent $\alpha$.

226:

227: \section{Datasets and methods}

228: \label{datasets}

229:

230: \begin{table}

231: \centering

232: \caption{Characteristics of Analyzed Web Datasets in Russia}

233: \begin{tabular}{llcccc}\\

234: \hline

235: Dataset & Proxy & Starting & Period & \# of  &  \# of \\

236:         &       & date  &       & requests & websites \\

237: \hline

238: \em{1996}   & CHG  & Sep 1996   & 74d & 155743    & 4360 \\

239: \em{1997}   & CHG  & Jan 1997    & 1y & 2642722   & 44881 \\

240: \em{2000}   & CHG  & Sep 2000    & 3m & 27130648  & 146693 \\

241: \em{2001}   & CHG  & Feb 2001    & 8m     & 64577294  & 269868 \\

242: \em{ikia-2001}  & IKIA & Jul 2001   & 4m  & 29296632  & 177497 \\

243: \em{ikia-2002}  & IKIA & May 2002   & 1m  & 2067205  & 53747 \\

244: \em{wc-2001}    & FREEnet & Jan 2001   & 4.5m    & 16989853  & 152760 \\

245: \em{wc-2002}    & FREEnet & Feb 2002   & 5m  & 26576501  & 239891 \\

246: \em{yar-2002}   & Yars & Apr 2002   & 1m & 9639987 & 86611 \\

247: \em{ras-2002}   & RASnet & Feb 2002 & 5m & 9240289 & 227686 \\

248: \hline

249: \em{2001-09}    & CHG  & Sep 2001    & 1m   & 7333162   & 68671 \\

250: \em{2001-09-1w} & CHG  & Sep 2001    & 1w & 1382537   & 24103 \\

251: \em{2001-09-03} & CHG  & Sep 2001    & 1d  & 273361    & 7854 \\

252: \hline

253: \label{table-sets}

254: \end{tabular}

255: \end{table}

256:

257: We start our analysis with the data collected on several proxies (cache

258: servers) located in different Russian academic networks and in the next

259: section will compare the results with the analysis of data collected in

260: the fall of 2004 on American IRCache servers. Collections of data from

261: Russian servers are presented in Table~\ref{table-sets} with the dataset

262: name, proxy server location, starting date of log files, period of log

263: file in days (d), weeks (w), months (m), or years (y), number of requests,

264: and number of unique websites requested. The following abbreviations are

265: used for proxies: {\em CHG} for the proxy located in the Chernogolovka

266: network (AS9113), Chernogolovka, Moscow region, Russia;  {\em IKIA} for

267: the proxy in Space Research Institute RAS (AS3218), Moscow, Russia;  {\em

268: FREEnet} for the proxy in FREEnet (AS2895), Moscow, Russia; {\em RASnet}

269: for the proxy located in RASnet (AS3058), Moscow, Russia; and {\em Yars}

270: for the proxy located in Yaroslavl State University (AS8325), Yaroslavl,

271: Russia. Proxy-servers {\em CHG} and {\em Yars} are typical regional cache

272: servers serving requests from local users. Other servers located in Moscow

273: are a central part of the Russian web-caching hierarchy~\cite{sakr98} and

274: serve requests from local users as well as from other (e.g., regional)

275: cache servers.

276:

277: \begin{figure}

278: \centering

279: %\psfig{file=scheme.eps,width=\columnwidth}

280: \psfig{file=scheme.eps,width=60mm}

281: \caption{Sketch of the data collection}

282: \label{scheme}

283: \end{figure}

284:

285: \begin{figure}

286: \centering

287: \psfig{file=caches.eps,width=90mm}

288: \caption{Hierarchy of cache servers network.}

289: \label{caches}

290: \end{figure}

291:

292: All proxy-servers run Squid caching software. Figure~\ref{scheme}

293: sketches the process of data collection:  user queries go to the cache

294: server, which processes user queries to the web servers and keeps

295: traces of user requests as records in log files. We therefore call the

296: cache servers ``observers'' to stress a possible importance of their

297: displacement in the Internet. Cache servers in Russian academic networks

298: are organized in hierarchy sketched in Figure~\ref{caches}. User queries goes

299: through the local proxy servers to regional cache servers, which may

300: redistribute them to the servers on national research and educational

301: networks, which may send queries to the neighboring caches or directly to

302: the destination. Also some queries may be sent to IRCache servers.

303: We must note that the cache server network is a

304: logical one, programmable, and does not reflect Internet connectivity but

305: is rather some subgraph of the Internet.

306:

307: We must note here that information in the datasets is private and is

308: subject to a privacy policy agreement. We therefore use all datasets

309: {\em available} to us.

310:

311: Each record contains information on the requested document (URL). A

312: typical URL looks like {\sf

313: protocol://web.site.name[:port]/path/to/document}. We treat a

314: substring between the `//' and `/' characters (omitting the `:port'

315: field if present\footnote{As a rule, requests with the `:port' field

316: are about 2\% of all requests, probably because some Russian websites

317: often use the port value for switching between various Cyrillic

318: encodings.}) as the website name. Only successful GET requests with

319: code 200 are included in our analysis.

320:

321: We counted the number of requests for each website in the log for each

322: dataset. Those numbers divided by the total number of requests in the

323: dataset give us the {\em normalized rank distribution of websites by

324: popularity $f_r$}.

325:

326: Fitting equations and parameter estimation was done by the nonlinear

327: least square method with Levenberg-Marquardt minimization.

328:

329: \section{Discussion}

330: \label{discussion}

331:

332: Normalized rank distributions (the fraction of requests to a given

333: website as a function of the corresponding rank) are presented on a

334: log-log scale in Figures~\ref{Fig1}, \ref{Fig2}, \ref{Fig3}.

335: Figure~\ref{Fig1} shows results for four datasets with the names {\em 1996}

336: (squares), {\em 1997} (circles), {\em 2000} (up triangles), and {\em 2001} (down

337: triangles) as defined in Table~\ref{table-sets}. All of them were

338: collected by the same proxy site {\em CHG}. Consulting

339: Table~\ref{table-sets}, we can conclude from Figure~\ref{Fig1} that

340: the rank distribution for all four datasets coincides well in the

341: ``middle'' straight-line part of about two decades and that the larger

342: the sample size, the larger this middle region is. We can therefore

343: conclude that the rank distribution does not change qualitatively in

344: five years and that the rank distribution comes closer and closer to

345: the ideal Zipf law.

346:

347: Our goal in Figure~\ref{Fig2} is to demonstrate how a rank

348: distribution depends on the period of observation. For that reason, we

349: plot four distributions obtained from the datasets {\em 2001-09-03}

350: (squares), {\em 2001-09-1w} (circles), {\em 2001-09} (up triangles), and

351: {\em 2001} (down triangles). Clearly, distribution does not vary in time but

352: becomes more ``flat'' in the middle part with the longer period

353: (larger sample size).

354:

355: Finally, Figure~\ref{Fig3} demonstrates that rank distributions with

356: nearly equivalent sample sizes are independent of the displacement of

357: the observer (i.e., cache server) in the Internet geography (at least,

358: for the Russian academic networks). We plot seven datasets, {\em 2001}

359: (squares), {\em ikia-2001} (circles), {\em wc-2001} (up-triangles),

360: {\em ikia-2002} (down-triangles), {\em ras-2002} (diamonds),

361: {\em wc-2002} (left-triangles), and {\em yar-2002} (right-triangles).

362: Figure~\ref{Fig3} is quite convincing that

363: the rank distribution of websites is independent of the displacement

364: of the web cache in the hierarchy.

365:

366: Totally, it can be seen that rank distributions corresponding to

367: different data\-sets coincide well for the middle values of ranks.

368: Therefore, the fraction of user requests coming to ``mainstream''

369: websites (which are often encountered in logs but are still less

370: popular than top sites) is stable and does not vary with time

371: (Figure~\ref{Fig1}), with dataset size (Figure~\ref{Fig2}), or with

372: proxy location (Figure~\ref{Fig3}).

373:

374: One more common feature of all graphs is the divergence of the rank

375: distributions in the ``tails'', the rightmost parts of the graph. Rank

376: distribution turns down strongly in tails, where the websites were

377: requested less than about $100$ times.

378:

379: There is an interesting peculiarity seen in Figure~\ref{Fig1}: the

380: fraction of requests coming to the most popular sites decreases with

381: time. For example, the frequency of occurrences of the most popular

382: website in 1996 was about an order of magnitude higher than in 2001.

383: Because the most frequent requests come to different kinds of banners,

384: counters, search engines, etc., Figure~\ref{Fig1} demonstrates that

385: their relative popularity diminishes with time. One possible reason is

386: the appearance of many different sites with similar contents (as well

387: as mirror sites) or functions (e.g., banner networks or search

388: engines), which leads to equilibrating user interest to different hot

389: sites. Another reason is improvement of web-client software. The

390: internal cache of the web browser can contain more web documents;

391: requests to the most popular documents are then processed using the

392: internal cache. This phenomena is known as the ``trickle-down'' effect

393: observed by Doyle et al.~\cite{Doyle02}, which is discussed below.

394:

395: Figure~\ref{Fig2} demonstrates that the top sites have a stable

396: fraction of requests during a given year.

397:

398: Figures 2 and 3 show that Zipf-like law~(\ref{Zipf-like}) (which must

399: be represented as a straight line) is a very coarse approximation of

400: the actual distribution. The main deviations from the

401: law~(\ref{Zipf-like}) are in the region of the most popular (top 50)

402: sites and in the tail of the distribution.

403:

404: Fitting the data to Zipf-like law, expression~(\ref{Zipf-like}), and its

405: modifications, expressions (\ref{Zipf-Mandelbrot}) and (\ref{BestFit}), is

406: a tricky problem both because of the influence of the rare statistics of

407: the large ranks and because of the high fluctuations of the leading ranks.

408: Which method is best is not yet understood~\cite{Crovella99}. We use a

409: least-square fit to estimate the parameters and calculate the accuracy of

410: the estimated values by the standard approach and give it in the

411: parentheses as a correction to the last digit.

412:

413: We can choose a region of ranks of two orders of magnitude where the

414: rank distribution looks like a straight line. But varying the interval

415: boundaries of the rank window strongly affects the fitting parameters

416: (e.g., the exponent $\alpha$). We obtained $\alpha$ in the range from

417: $0.7$ to $1.4$ depending on the rank window. For example, fitting

418: dataset {\em2001-09} with Zipf-like law~(\ref{Zipf-like}) in the

419: window $10\le r\le 1000$ gives $\alpha=0.78$ and in window $10^3\le

420: r\le 10^5$ gives $\alpha=1.13$. Other fitting windows give other

421: values in the range from $0.7$ to $1.4$. We can therefore conclude

422: that the Zipf-like law cannot give us quantitative characteristics of

423: rank distributions of websites in the whole interval of ranks.

424:

425: Slightly better results can be derived using a modified Zipf-like law,

426: known as the {\em Zipf--Mandelbrot} law~\cite{Mandelbrot},

427:

428: \begin{equation}

429: \label{Zipf-Mandelbrot}

430: f_r=\frac{b}{(c+r)^{\alpha}},

431: \end{equation}

432:

433: \noindent which gives a better approximation in the range of small

434: ranks but is still inapplicable in the ``tails''. The fit can be

435: appreciably enhanced by introducing one more parameter

436: in~(\ref{Zipf-Mandelbrot}):

437:

438: \begin{equation}

439: f_r=a+\frac{b}{(c+r)^{\alpha}}.

440: \label{BestFit}

441: \end{equation}

442:

443: Figure~\ref{Fig5} shows the rank distribution of websites in the

444: coordinates $\log(f_r-a)$, $\log(c+r)$ for the particular dataset

445: {\em 2001-09}. The fraction of requests (the vertical axis) is shifted by

446: the value $a=-1.44\cdot 10^{-6}$ and the rank by $c=15.16$. This

447: figure clearly demonstrates that function~(\ref{BestFit}) approximates

448: the data distribution well in almost the entire range of

449: ranks.\footnote{We note that this method for data ``straightening'' is

450: often applied in statistical physics~\cite{Efros,Lev2}. A similar equation

451: was also proposed in a recent work on rank distribution of

452: publication popularity~\cite{Han2004}.} We have

453: fitted expression~(\ref{BestFit}) to all our data and found that the

454: value of $\alpha$ is quite stable; the results are presented in

455: Table~\ref{Alpha} for the datasets discussed. The columns in

456: Table~\ref{Alpha} are the dataset name as defined in

457: Table~\ref{table-sets} and resulting values of $a$, $c$, and $\alpha$

458: as defined in expression~(\ref{BestFit}). The mean of the exponent

459: $\alpha$ is $1.02 \pm 0.05$, which may be considered $1.0$. The

460: statistical error is calculated as the variation of $\alpha$ from the

461: data in Table~\ref{Alpha}.

462:

463: \begin{table}

464: \centering

465: \caption{Fitting Results for Russian Servers}

466: \begin{tabular}{llrc} \\

467: \hline

468: Dataset &  $a$ & $c$ & $\alpha$\\

469: \hline

470: \em{1996} &  $-3.0(1)\cdot10^{-5}$        & $0.45(4)$        & $0.95(5)$\\

471: \em{1997} &  $-5.77(2)\cdot10^{-6}$     & $2.96(5)$        & $0.92(3)$\\

472: \em{2000} &  $-1.01(11)\cdot10^{-6}$     & $7.33(7)$        & $1.04(3)$\\

473: \em{2001} &  $-2.48(3)\cdot10^{-7}$     & $9.10(5)$        & $1.06(2)$\\

474: \em{2001-09} & $-1.44(27)\cdot10^{-6}$    & $15.16(11)$       & $1.08(7)$\\

475: \em{2001-09-1w} & $-7.25(6)\cdot10^{-6}$  & $14.82(20)$       & $1.03(2)$\\

476: \em{2001-09-03} & $-2.01(7)\cdot10^{-5}$     & $17.82(72)$       & $0.99(6)$\\

477: \em{ikia-2001} & $-5.10(7)\cdot10^{-7}$  & $13.35(7)$       & $1.07(3)$\\

478: \em{ikia-2002} & $-1.58(9)\cdot10^{-6}$ & $4.53(16)$        & $1.01(1)$\\

479: \em{wc-2001} & $-5.56(9)\cdot10^{-7}$   & $14.54(9)$       & $1.09(4)$\\

480: \em{wc-2002} & $-4.43(7)\cdot10^{-7}$   & $14.02(5)$       & $1.06(3)$\\

481: \em{ras-2002} & $-9.45(2)\cdot10^{-7}$  & $9.17(10)$        & $0.95(5)$\\

482: \em{yar-2002} & $-1.30(3)\cdot10^{-6}$  & $4.64(4)$        & $0.99(5)$\\

483: \hline

484: \end{tabular}

485: \label{Alpha}

486: \end{table}

487:

488: The parameter $a$ can be considered a correction for the finite sample

489: size. The larger the sample size, the less $a$ is.

490:

491: The parameter $c$ in expression~(\ref{BestFit}) has a very clear

492: physical meaning. It is closely connected with the {\em trickle-down

493: effect} observed by Doyle~\cite{Doyle02}. Doyle found that proxies

494: disproportionally absorb requests on different levels of the

495: hierarchy. Rank distributions obtained from data collected on proxies

496: at different hierarchical levels differ in the region of small ranks.

497: This effect has a clear explanation in terms of rank distributions.

498:

499: As a clarifying example, we consider a two-layer hierarchy of proxies.

500: A first-level proxy receives requests from users. If the requested

501: document is found in its cache, then that document is returned to the

502: client; otherwise, the request is submitted to an upper-level proxy.

503: If we assume that a first-level proxy can hold $N$ documents in its

504: cache, then it accordingly filters the $N$ most popular documents from

505: the request stream, i.e., it ``cuts'' the leftmost $N$ points from the

506: rank distribution. This is equivalent to the change of variables

507: $r\rightarrow r+N$. Therefore, we presume that the parameter $c$ in

508: equation~(\ref{BestFit}) characterizes cache sizes of low-level

509: proxies (which can also be the user's browser cache).

510:

511: It can be seen that for all datasets, $\alpha$ is close to unity with

512: an accuracy of a few percent. We therefore suppose that the exponent

513: $\alpha$ in equation~(\ref{BestFit}) is a universal characteristic of

514: web traffic, which is independent of time (for time-scales comparable

515: with the Internet lifetime), is independent of data collection

516: duration (when the sample size is sufficiently large and contains more

517: than $2{\times}10^5$ requests), and is independent of the displacement

518: of the proxy server in the Internet hierarchy.

519:

520: We found a possibility to check our findings using available

521: statistics. We chose BU web-client traces available from {\sf

522: ita.ee.lbl.gov} (the full dataset from Nov 94 to May 95 contains

523: 1143842 requests, 104532 unique URLs, and 4970 unique sites). This

524: dataset was used in early work and gives one of the best examples of

525: the Zipf law for web-page popularity ($\alpha=0.986$)~\cite{Cunha95}.

526: Fitting equation (\ref{BestFit}) to the rank distribution of website

527: popularity gives $\alpha=1.025$, $a=-3.3\cdot 10^{-5}$, and $c=1.97$,

528: which coincide well with the values obtained for Russian academic

529: networks. This is an additional argument that website popularity

530: distribution is universal (in other words, is independent of both the

531: observation point in the Internet and Internet history) and

532: follows the Zipf law with an exponent $\alpha$ close to unity.

533:

534: \begin{table}

535: \centering

536: \caption{Characteristics of Analyzed Web Datasets in USA and Fitting

537: Results}

538: \begin{tabular}{lrrrrr}\\

539: \hline

540: cache & \# of & $N=$\# of & $aN$ & $c$ & $\alpha$ \\

541:       &requests &websites &      &     &    \\ \hline

542: {\em bo}&23935604&592679&-2.89(1)&8.54(4)&1.05(2) \\

543: {\em ny}&12789266&407952&-3.89(1)&-0.12(1)&0.94(3) \\

544: {\em pa}&3374392&229633&-1.57(1)&7.17(12)&0.96(8) \\

545: {\em pb}&10018478&304049&-4.47(1)&18.96(13)&0.98(4) \\

546: {\em rtp}&13221655&339918&-4.35(1)&23.52(13)&1.01(4) \\

547: {\em sd}&13840665&285356&-3.22(1)&0.166(7)&1.04(3) \\

548: {\em sj}&26130582&264396&-6.00(1)&1.935(13)&1.09(2) \\

549: {\em sv}&11119941&530731&-3.20(1)&16.34(13)&0.93(4) \\

550: {\em uc}&13294408&313178&-5.17(1)&15.14(9)&1.01(4) \\ \hline

551: {\em uc-12d}&3236853&84360&-4.37(2)&7.79(12)&0.95(8) \\

552: {\em uc-1d}&463899&13752&-1.77(4)&4.99(24)&0.96(3) \\ \hline

553: {\em all}&127724991&1176623&-8.96(1)&5.05(1)&1.03(2) \\ \hline

554: \label{table-setUS}

555: \end{tabular}

556: \end{table}

557:

558:

559: To check this statement deeper, we also analyze recently available

560: data\footnote{Thanks to D.  Wessels, who kindly gave us access to the

561: data sets collected at the US IRCache servers.} collected during the period

562: from 11/03/2004 to 12/29/2004 at nine cache-servers of the US national

563: cache-mesh system for science and education built-up within the IRCache

564: project~\cite{ircache}. Table~\ref{table-setUS} presents data from the

565: following locations:

566:

567: \begin{itemize}

568: \item {\em bo} -- NCAR at Boulder, Colorado

569: \item {\em ny} -- New York, New York

570: \item {\em pa} -- Digital Internet Exchange in Palo Alto, California

571: \item {\em pb} -- PSC at Pittsburgh, Pennsylvania

572: \item {\em rtp} -- Research Triangle Park, North Carolina

573: \item {\em sd} -- SDSC at San Diego, California

574: \item {\em sj} -- MAE West Exchange Point in San Jose, California

575: \item {\em sv} -- NASA-Ames/FIX-West in Silicon Valley, California

576: \item {\em uc} -- NCSA at Urbana-Champaign, Illinois.

577: \end{itemize}

578:

579: \noindent The second and third entries from the bottom demonstrate the

580: stability of the fit for two subsets of the data collected at {\em

581: uc}-location, for 12 days (set name {\em us-12d}) and for 1 day (set {\em

582: us-1d}). The last entry represents the fit to the sum of the preceding

583: data sets. Results of the fit by expression~(\ref{BestFit}) are close to

584: unity and quite similar to those for Russian servers presented in

585: Table~\ref{Alpha}.

586:

587:

588: \section{Conclusions}

589: \label{conclusions}

590:

591: We have presented modified Zipf law~(\ref{BestFit}), which fits the rank

592: distribution of web sites in the full range of ranks rather well. We found

593: that the value of the exponent $\alpha$ in expression~(\ref{BestFit}) is

594: stable for the analyzed datasets. It does not vary with (1) the year of data

595: collection, (2) the period of data collection, or (3) the geographical

596: location of the cache server where we collected data. We found that

597: $\alpha$ is very close to $1$. We have reasons to suppose this value of

598: $\alpha$ is a universal property of web-traffic for the website rank. We

599: have also presented a clear explanation of the ``trickle-down effect''

600: based on the properties of our modified Zipf law. We suggest that website

601: popularity is universal property of Internet and follows the Zipf law.

602:

603: In a similar experiment, fluctuations of the exponent value were

604: checked~\cite{KS-tri} as a function of the volume of statistics, where

605: cache traces of user requests to different Internet domains were analyzed.

606: User requests were sent to Internet through the cache triangle, namely,

607: they went to the Master Server, which sent each odd request to the left

608: cache and each even request to the right cache. Clearly, the traces should

609: be nearly equal in the limit of a large number of requests. Indeed, it was

610: estimated that exponents extracted separately from the ``left'' traces and

611: ``right'' traces were within five per cent for a set volume larger than

612: ten thousand requests, and that those for set volume less than a few

613: hundred fluctuated strongly. Thus, rare statistics may significantly

614: affect the results.

615:

616: The results in this paper may be useful for building mirror sites and

617: CDNs as well as for improving software for DNS request caching. We

618: also conjecture that fitting with the modified Zipf law is suitable

619: for describing the rank distribution of web-document popularity.

620:

621: \begin{figure}

622: \centering

623: \psfig{file=fig1.eps,width=\columnwidth}

624: \caption{Website distribution for different years}

625: \label{Fig1}

626: \end{figure}

627:

628: \begin{figure}

629: \centering

630: \psfig{file=fig2.eps,width=\columnwidth}

631: \caption{Website distribution for different periods}

632: \label{Fig2}

633: \end{figure}

634:

635: \begin{figure}

636: \centering

637: \psfig{file=fig3.eps,width=\columnwidth}

638: \caption{Website distribution for different servers}

639: \label{Fig3}

640: \end{figure}

641:

642: \begin{figure}

643: \centering

644: \psfig{file=fig5c.eps,width=\columnwidth}

645: \caption{Website distribution in modified coordinates: dependence

646: of $f_r-a$ from $r+c$ (compare to expression~(5)) in double logarithmic

647: scale.}

648: \label{Fig5}

649: \end{figure}

650:

651: \section{Acknowledgment}

652:

653: The authors thank the anonymous referees for the valuable remarks and

654: comments that allowed us to improve this paper.

655: Special thanks to Duane Wessels for access to logs from IRCache web-cache servers.

656:

657: This work was supported by the Russian Foundation for Basic Research.

658:

659:

660: \begin{thebibliography}{99}

661:

662: \bibitem{Glassman94} Steven Glassman, {\it A caching Relay for the World Wide

663: Web.} Proc. 1st Int. Conference on the World-Wide Web, CERN,

664: Geneva (Switzerland), May 1994. Computer Networks and ISDN Systems, 27(2), 165-173 (1994).

665:

666: \bibitem{Breslau99} Lee Breslau, Pei Cao, Li Fan, G. Phillips, S.

667: Shenker, {\it Web Caching and Zipf-like Distributions: Evidence and

668: Possible Implications}, Proc. IEEE INFOCOM '99: 18th Annual Joint

669: Conference of the IEEE Computer and Communications Societies, Volume: 1,

670: p.~126-134, 1999.

671:

672: \bibitem{Kelly02} Terence Kelly, Jeffrey Mogul.  {\it Aliasing on the

673: World Wide Web: Prevalence and performance implications}. Proc. 11th Int.

674: WWW Conf., Honolulu, May 2002, ACM Press, pp.281 - 292.

675:

676: \bibitem{Doyle02}  Ronald P. Doyle, Jeffrey S. Chase, Syam Gadde,

677: Amin M. Vahdat.  {\it The trickle-down effect: Web caching and

678: server request distribution}.  Computer Communications, {\bf 25},

679: 345-356 (2002).

680:

681: \bibitem{Pareto} V. Pareto, {\it Cours d'economie politique}, Rouge,

682: Lausanne et Paris, 1897.

683:

684: \bibitem{Zipf49} G. K. Zipf, {\it Human Behavior and the Principle of

685: Least-Effort.} Addison-Wesley, Cambridge, MA, 1949.

686:

687: \bibitem{Li92} W. Li, {\it Random texts exhibit Zipf's-law-like word

688: frequency distribution}.  IEEE Trans. Inform. Theory, {\bf 38}(6),

689: 1842-1845 (1992).

690:

691: \bibitem{Mantegna} R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S.

692: Havlin, C. K. Peng, M. Simons, and H. E. Stanley. {\it Linguistic Features

693: of Noncoding DNA Sequences}, Phys. Rev. Lett. {\bf 73}, 3169-3172 (1994).

694:

695: \bibitem{Martindale} C. Martindale, A. K. Konopka, {\it Oligonucleotide

696: frequencies in DNA follow a Yule distribution}, Computers \& Chemistry,

697: {\bf 20}, 35-38 (1996).

698:

699: \bibitem{Mandelbrot} B. B. Mandelbrot, {\it The Fractal Geometry of

700: Nature}. Freeman, New York, 1977.

701:

702: \bibitem{Efros} B.I. Shklovskii, A.L. Efros, {\it Electronic Properties of

703: Doped Semiconductors}, Percolation Theory (Springer-Verlag, Berlin,

704: 1984)

705: pp. 95-136

706:

707: \bibitem{Lev2} L.N. Shchur, {\it Incipient Spanning Clusters in Square and

708: Cubic Percolation}, in Springer Proceedings in Physics, ``Computer

709: Simulation Studies in Condensed Matter Physics XII'', Eds. D.P. Landau,

710: S.P. Lewis, and H.B. Sch\"uttler, (Springer-Verlag, Berlin, 2000)

711:

712: \bibitem{Han2004} Ding-Ding Han, Jin-Gao Liu, Yu-Gang Ma, Xiang-Zhou Cai,

713: Wen-Qing Shen. {\it Scale-free download network for publications}.

714: Chin. Phys. Lett., {\bf 21}(9), 1855-1857 (2004); arXiv:cond-mat/0405428

715: (2004).

716:

717: \bibitem{Zanette97} Damai\'an H. Zanette, Susanna C. Manrubia, {\it Role

718: of intermittence in urban development: A model of large-scale city

719: formation}.  Phys. Rev. Lett., {\bf 79}(3), 523-526 (1997).

720:

721: \bibitem{Marsili98} Matteo Marsili, Yi-Cheng Zhang, {\it Interacting

722: individuals leading to Zipf's law}.  Phys. Rev. Lett., {\bf 80}(12),

723: 2741-2744 (1998).

724:

725: \bibitem{Redner98} S. Redner. {\it How popular is your paper? An empirical

726: study of the citation distribution}.  Eur. Phys. J. B {\bf 4}, 131-134

727: (1998).

728:

729: \bibitem{Malamud} B. D. Malamud, G. Morein, D. L. Turcotte. {\it Forest

730: fires: an example of self-organized critical behavior}.  Science, {\bf

731: 281}, 1840-1842 (1998).

732:

733: \bibitem{Newman2004} M.E.J. Newman. {\it Power laws, Pareto distributions

734: and Zipf's law}. arXiv:cond-mat/0412004 (2004)

735:

736: \bibitem{Mitzenmacher2004} Michael Mitzenmacher. {\it A brief history of

737: generative models for power law and lognormal distributions}.

738: Internet Mathematics, {\bf 1}(2), 226-251 (2004).

739:

740: \bibitem{Laherrere1998} J. Laherr\`ere, D. Sornette. {\it Stretched

741: exponential distributions in nature and economy: ``fat tails'' with

742: characteristic scales}. Eur. Phys. J., {\bf B 2}, 525-539 (1998).

743:

744: \bibitem{website} Azer Bestavros. {\it WWW traffic reduction and load

745: balancing through server-based caching}.  IEEE Concurrency, {\bf 5}(1),

746: 56-67 (1997);

747: Takashi Hatashima, Toshihiro Motoda, Shuichro Yamamoto.  {\it An

748: ``interest'' index for WWW servers and CyberRanking}.  IEICE Trans. Inf.

749: \& Syst., {\bf E83-D}, 729-734 (2000);

750: Venkata N. Padmanabhan, Lili Qiu.  {\it The content and access dynamics

751: of a busy web site: Findings and implications}.  Proc. ACM SIGCOMM'00,

752: Stockholm, Sweden, 2000, pp. 111-123;

753: Adeniyi Oke, Rick Bunt.  {\it Hierarchical workload characterization for

754: a busy web server}.  In: Computer Performance Evaluation (Ed. T. Field,

755: P.G. Harrison, J. Bradley, U. Harder). Springer-Verlag: Berlin ea, 2002,

756: pp.309-328. Proc. TOOLS'2002: 12th Int. Conf. on Modeling Techniques and

757: Tools, London, UK, April 14-17 2002. [Lecture Notes in Computer Science,

758: Vol. 2324].

759:

760: \bibitem{Aida98} Masaki Aida, Noriyuki Takahashi, Tetsua Abe.

761: {\it A proposal of dual Zipfian model for describing HTTP access trends

762: and its application to address cache design}. IEICE Trans. Commun.,

763: {\bf E81-B} (7), 1475-1485 (1998).

764:

765: \bibitem{Cunha95} C. R. Cunha, A. Bestavros, M. E. Crovella, {\it

766: Characteristics of WWW Client-based Traces}, Technical report

767: BU-CS-95-010, Boston University, July, 1995.

768:

769: \bibitem{Barford99} P. Barford, A. Bestavros, A. Bradley, M. Crovella, {\it

770: Changes in Web client access patterns: Characteristics and caching implications}.

771: World Wide Web J., Spec. Issue on Characterization and

772: Performance Evaluation, {\bf 2}, 15-28 (1999).

773:

774: \bibitem{Jin2000} Shudong Jin and Azer Bestavros, {\it Sources and

775: Characteristics of Web Temporal Locality}. Proc. MASCOTS'2000: The 8th

776: IEEE/ACM International Symposium on Modeling, Analysis and Simulation of

777: Computer and Telecommunication Systems, San Francisco, CA, 29 Aug - 1 Sept

778: 2000. p.28-35.

779:

780: \bibitem{Mahanti99} A. Mahanti, C. Williamson.  {\it Web proxy workload

781: characterization}.  Tech. Report, Department of Computer Science,

782: University of Saskatchewan, February 1999.

783:

784: \bibitem{Crovella99} M.E. Crovella and M.S. Taqqu, {\it Estimating the

785: Heavy Tail Index from Scaling Properties}, In: Methodology and Computing in

786: Applied Probability, {\bf 1}, 55-79 (1999).

787:

788: \bibitem{Mahanti00} A. Mahanti, C. Williamson, D. Eager, {\it Traffic

789: analysis of a Web proxy caching hierarchy}. IEEE Network Magazine, {\bf

790: 14}(3), 16-23 (May/Jun 2000).

791:

792: \bibitem{Roadknight99} Chris Roadknight, Ian Marshall, and Deborah Vearer.

793: {\it File Popularity Characterization}. Proc. WISP'99: 2nd Workshop on

794: Internet Server Performance, Atlanta, Georgia, May 1999.

795:

796: \bibitem{Almeida96} V. Almeida, A. Bestavros, M. E. Crovella,

797: A. de Oliveira, {\it Characterizing Reference Locality in the

798: WWW}, Proc. PDIS'96, Dec. 1996, p.~92-103.

799:

800: \bibitem{Nishikawa98} N. Nishikawa, T. Hosokawa, Y. Mori, K. Yoshida, H.

801: Tsuji. {\em Memory-based architecture for distributed WWW caching proxy}.

802: Computer Networks and ISDN Systems, {\bf 30}, 205-214 (1998).

803:

804: \bibitem{sakr98} Serge Krashakov, Lev Shchur. {\it WWW Caching in

805: Russia - Current State and Future Development}. Proc. 3d Int. Web Caching

806: Workshop, Manchester, June 15-17, 1998.

807:

808: \bibitem{ircache} The IRCache project - {\tt http://www.ircache.net}.

809:

810: \bibitem{KS-tri} Sergey A. Krashakov, Lev N. Shchur.  {\em Active measurements

811: (experiments) of the Interhet traffic using cache-mesh}.  Int. J. Modern

812: Physics C, {\bf 12}, 549-562  (2001).

813:

814: \end{thebibliography}

815:

816: \end{document}

817: