cs0108004/xxx.tex
1: \documentclass[11pt]{article}
2: 
3: \newif\ifpdf
4: \ifx\pdfoutput\undefined
5: \pdffalse % we are not running PDFLaTeX
6: \else
7: \pdfoutput=1 % we are running PDFLaTeX
8: \pdftrue
9: \fi
10: 
11: \ifpdf
12: \usepackage[pdftex]{graphicx}
13: \else
14: \usepackage{graphicx}
15: \fi
16: 
17: \textwidth = 6.5 in
18: \textheight = 9 in
19: \oddsidemargin = 0.0 in
20: \evensidemargin = 0.0 in
21: \topmargin = 0.0 in
22: \headheight = 0.0 in
23: \headsep = 0.0 in
24: \parskip = 0.2in
25: \parindent = 0.0in
26: 
27: \usepackage{cite}
28: \usepackage{citesupernumber}
29: \usepackage{nature}
30: \bibliographystyle{nature}
31: 
32: \title{Links tell us about lexical and semantic Web content}
33: 
34: \author{Filippo Menczer\\
35: Department of Management Sciences\\
36: The University of Iowa\\
37: Iowa City, IA 52242\\
38: %\texttt{filippo-menczer@uiowa.edu}
39: }
40: 
41: \begin{document}
42: 
43: \ifpdf
44: \DeclareGraphicsExtensions{.pdf, .png}
45: \else
46: \DeclareGraphicsExtensions{.eps, .png}
47: \fi
48: 
49: \date{}
50: 
51: \maketitle
52: 
53: \textbf{The latest generation of Web search tools is beginning to 
54: exploit hypertext link information to improve 
55: ranking\cite{Brin98,Kleinberg98} and 
56: crawling\cite{Menczer00,Ben-Shaul99etal,Chakrabarti99} algorithms.  The 
57: hidden assumption behind such approaches, a correlation between the 
58: graph structure of the Web and its content, has not been tested 
59: explicitly despite increasing research on Web 
60: topology\cite{Lawrence98,Albert99,Adamic99,Butler00}.  Here I 
61: formalize and quantitatively validate two conjectures drawing 
62: connections from link information to lexical and semantic Web content.  
63: The \emph{link-content conjecture} states that a page is similar to 
64: the pages that link to it, i.e., one can infer the lexical content of 
65: a page by looking at the pages that link to it.  I also show that 
66: lexical inferences based on link cues are quite heterogeneous across 
67: Web communities.  The \emph{link-cluster conjecture} states that pages 
68: about the same topic are clustered together, i.e., one can infer the 
69: meaning of a page by looking at its neighbours.  These results explain 
70: the success of the newest search technologies and open the way for 
71: more dynamic and scalable methods to locate information in a topic or 
72: user driven way.}
73: 
74: %:\section{Intro}
75: 
76: All search engines basically perform two functions: (i) crawling Web 
77: pages to maintain an index, and (ii) matching URLs in the index 
78: database against user queries.  Effective search engines achieve a 
79: high coverage of the Web, keep their index fresh, and rank hits in a 
80: way that correlates with the user's notion of relevance.  Ranking and 
81: crawling algorithms use cues from words and hyperlinks, associated 
82: respectively with \emph{lexical} and \emph{link topology}.  In the 
83: former, two pages are close to each other if they have similar textual 
84: content; in the latter, if there is a short path between them.  Lexical 
85: metrics are traditionally used by search engines to rank hits 
86: according to their similarity to the query, thus attempting to infer 
87: the semantics of pages from their lexical representation.  Similarity 
88: metrics are derived from the vector space model\cite{Salton83}, that 
89: represents each document or query by a vector with one dimension for 
90: each term and a weight along that dimension that estimates the term's 
91: contribution to the meaning of the document.  The \emph{cluster 
92: hypothesis} behind this model is that a document lexically close to a 
93: relevant document is also relevant with high probability\cite{vanR79cluster}.  
94: Links have traditionally been used by search engine crawlers only in 
95: exhaustive, centralized algorithms.  However the latest generation of 
96: Web search tools is beginning to integrate lexical and link metrics to 
97: improve ranking and crawling performance through better models of 
98: relevance.  The best known example is the \emph{PageRank} metric used 
99: by Google: pages containing the query's lexical features are ranked 
100: using query-independent link analysis\cite{Brin98}.  Links are also 
101: used in conjunction with text to identify hub and authority pages for 
102: a certain subject\cite{Kleinberg98}, determine the reputation of a 
103: given site\cite{Mendelzon00}, and guide search agents crawling on 
104: behalf of users or topical search 
105: engines\cite{Menczer00,Ben-Shaul99etal,Chakrabarti99}.  
106: 
107: %:\section{The link-content conjecture}
108: 
109: To study the connection between link and lexical topologies, I 
110: conjecture a positive correlation between distance measures defined in 
111: the two spaces.  Given any pair of Web 
112: pages $(p_{1},p_{2})$ we have well-defined distance functions 
113: $\delta_{l}$ and $\delta_{t}$ in link and lexical space, 
114: respectively.  To compute $\delta_{l}(p_{1},p_{2})$ we use the Web 
115: hypertext structure to find the length, in links, of the shortest path 
116: from $p_{1}$ to $p_{2}$. (This is not a metric distance 
117: because it is not symmetric in a directed graph, but 
118: for convenience I refer to $\delta_{l}$ as ``distance''.) 
119: To compute $\delta_{t}(p_{1},p_{2})$ we 
120: can use the vector representations of the two pages, where the vector 
121: components (weights) of page $p$, $w_{p}^{k}$, are computed for terms 
122: $k$ in the textual content of $p$ given some weighting scheme.  One 
123: possibility would be to use Euclidean distance in this word vector 
124: space, or any other $L_{z}$ norm.
125: However, $L_{z}$ metrics have a dependency on the 
126: dimensionality of the pages, i.e., larger documents tend to appear 
127: more distant from each other than shorter ones, irrespective of 
128: content.  To circumvent this problem, one can instead define a metric 
129: based on the \emph{similarity} between pages.
130: Let us use the \emph{cosine similarity} function, a standard measure 
131: in information retrieval:
132: \begin{equation}
133: 	\sigma(p_{1},p_{2}) = \frac{\sum_{k \in p_{1} \cap p_{2}} 
134: 	w_{p_{1}}^{k} w_{p_{2}}^{k}}
135: 	{\sqrt{\sum_{k \in p_{1}} (w_{p_{1}}^{k})^{2} 
136: 	\sum_{k \in p_{2}} (w_{p_{2}}^{k})^{2}}}.
137: 	\label{eq:sim}
138: \end{equation}
139: 
140: According to the link-content conjecture, $\sigma$ is anticorrelated 
141: with $\delta_{l}$.  The idea is to measure the correlation between the 
142: two distance measures across pairs of pages.  Figure~\ref{yahoo}
143: illustrates how a collection of Web pages was crawled and processed
144: for this purpose.
145: 
146: %:\subsection{Correlation of lexical and link distance}
147: 
148: The link distances $\delta_{l}(q,p)$ and similarities $\sigma(q,p)$ 
149: were averaged for each topic $q$ over all pages $p$ in the crawl set 
150: $P_{d}^{q}$ for each depth $d$:
151: \begin{eqnarray}
152: 	\delta(q,d) &\equiv& \langle \delta_{l}(q,p) \rangle_{P_{d}^{q}} =
153: 		\frac{1}{N_{d}^{q}} \sum_{i=1}^{d} i \cdot 
154: 		(N_{i}^{q} - N_{i-1}^{q}) \label{eq:Laver} \\
155: 	\sigma(q,d) &\equiv& \langle \sigma(q,p) \rangle_{P_{d}^{q}} = 
156: 		\frac{1}{N_{d}^{q}} \sum_{p \in P_{d}^{q}} 
157: 		\sigma(q,p). \label{eq:Saver}
158: \end{eqnarray}
159: 
160: The 300 measures of $\delta(q,d)$ and $\sigma(q,d)$ 
161: from Equations~\ref{eq:Laver} and \ref{eq:Saver} are shown in  
162: Figure~\ref{scatter}.  The two metrics are indeed well 
163: anticorrelated and predictive of each other with high 
164: statistical significance.  This quantitatively 
165: confirms the link-content conjecture.
166: 
167: %:\subsection{Range of link-based lexical predictions}
168: 
169: To analyze the decrease in the reliability of lexical content 
170: inferences with distance from the topic page in link space one can 
171: perform a nonlinear least-squares fit of these data to a family of 
172: exponential decay models:
173: \begin{equation}
174: 	\sigma(\delta) \sim \sigma_{\infty} + 
175: 		(1 - \sigma_{\infty}) e^{-\alpha_{1} \delta^{\alpha_{2}}}
176: 	\label{sim-decay}
177: \end{equation}
178: using the 300 points as independent samples. Here $\sigma_{\infty}$ 
179: is the noise level in similarity.
180: Note that while starting from Yahoo pages may bias $\sigma(\delta<1)$ 
181: upward, the decay fit is most affected by the constraint 
182: $\sigma(\delta=0) = 1$ (by definition of similarity) and by the 
183: longer-range measures $\sigma(\delta>1)$.  
184: The similarity decay fit curve is also shown in Figure~\ref{scatter}. It
185: provides us with a rough estimate of how far in link space one 
186: can make inferences about lexical content.
187: 
188: %:\subsection{Heterogeneity of link-based lexical cues}
189: 
190: How heterogeneous is the reliability of lexical inferences based on 
191: link neighbourhood across communities of Web content providers?  To 
192: answer this question the crawled pages were divided up into connected 
193: sets within top-level Internet domains.  The scatter plot of the 
194: $\delta(q,d)$ and $\sigma(q,d)$ measures for these domain-based crawls 
195: is shown in Figure~\ref{domains}a.  The plot illustrates the 
196: heterogeneity in the reliability of lexical inferences based on link 
197: cues across domains.  The parameters obtained from fitting each domain 
198: data to the exponential decay model of Equation~\ref{sim-decay} 
199: (Figure~\ref{domains}b) estimate how reliably links point to lexically 
200: related pages in each domain.  A summary of the statistically 
201: significant differences among the parametric estimates is shown in 
202: Figure~\ref{domains}c.  It is evident that, for example, 
203: academic Web pages are better connected to each other than commercial 
204: pages in that they do a better job at pointing to other similar pages.  
205: In other words it is easier to find related pages browsing through 
206: academic pages than through commercial pages.  This is not surprising 
207: considering the different goals of the two communities.
208: 
209: %:\section{The link-cluster conjecture}
210: 
211: The link-cluster conjecture is a link-based analog of the cluster 
212: hypothesis, stating that pages within a few links from a relevant 
213: source are also relevant with high probability.
214: Here I experimentally assess the extent to which relevance is 
215: preserved within link space neighbourhoods, and the decay in expected 
216: relevance as one browses away from a relevant page.  
217: 
218: The link-cluster conjecture has been implied or stated in various
219: forms\cite{Kleinberg98,Gibson98,Brin98,Chakrabarti98etal,Dean99,Davison00}. 
220: One can most simply and generally state it in terms of the conditional 
221: probability that a page $p$ is relevant with respect to some query $q$, 
222: given that page $r$ is relevant and that $p$ is within $d$ links from $r$:
223: \begin{equation}
224: 	R_q(d) \equiv 
225: 		\Pr[rel_q(p) \: | \: rel_q(r) \wedge \delta_{l}(r,p) \leq d]
226: \end{equation}
227: where $rel_q()$ is a binary relevance assessment with respect to $q$. 
228: In other words a page has a higher than random probability of being 
229: about a certain topic if it is in the neighbourhood of other pages 
230: about that topic. $R_q(d)$ is the posterior relevance probability 
231: given the evidence of a relevant page nearby. The simplest form of 
232: the link-cluster conjecture is stated by comparing $R_q(1)$ to the 
233: prior relevance probability $G_q$:
234: \begin{equation}
235: 	G_q \equiv \Pr[rel_q(p)]
236: \end{equation}
237: also known as the \emph{generality} of the query. If link 
238: neighbourhoods allow for semantic inferences, then the following 
239: condition must hold:
240: \begin{equation}
241: 	\lambda(q,d=1) \equiv \frac{R_q(1)}{G_q} > 1.
242: 	\label{def_L}
243: \end{equation}
244: To illustrate the meaning of the link-cluster conjecture, consider 
245: a random crawler (or user) searching for pages about a topic $q$. 
246: Call $\eta_q(t)$ the probability that the crawler hits a relevant 
247: page at time $t$. Solving the recursion
248: \begin{equation}
249: 	\eta_q(t+1) = \eta_q(t) \cdot R_q(1) + (1 - \eta_q(t)) \cdot G_q
250: \end{equation}
251: for $\eta_q(t+1) = \eta_q(t)$ yields the stationary hit rate 
252: \begin{equation}
253: 	\eta_q^* = \frac{G_q}{1 + G_q - R_q(1)}.
254: \end{equation}
255: The link-cluster conjecture is a necessary and sufficient condition 
256: for such a crawler to have a better than chance hit rate, thus 
257: justifying the crawling (and browsing!) activity:
258: \begin{equation}
259: 	\eta_q^* > G_q \Longleftrightarrow \lambda(q,1) > 1.
260: \end{equation}
261: 
262: Definition~\ref{def_L} can be generalized to likelihood factors over 
263: larger neighbourhoods:
264: \begin{equation}	
265: 	\lambda(q,d) \equiv \frac{R_q(d)}{G_q} \stackrel{d \rightarrow 		
266: \infty}{\longrightarrow} 1
267: 	\label{L-def}
268: \end{equation}
269: and a stronger version of the conjecture can be formulated as follows:
270: \begin{equation}
271: 	\lambda(q,d) \gg 1 \; \mbox{for} \; \delta(q,d) < \delta^*
272: \end{equation}
273: where $\delta^*$ is a critical link distance beyond which semantic 
274: inferences are unreliable.
275: 
276: %:\subsection{Semantic clusters in link neighbourhoods}
277: 
278: I first attempted to measure the likelihood factor $\lambda(q,1)$ for 
279: a few queries and found that 
280: \linebreak $\langle \lambda(q,1) \rangle_q \gg 1$,
281: but those estimates were based on very noisy relevance 
282: assessments\cite{Menczer97b}. To obtain a reliable quantitative 
283: validation of the stronger link-cluster conjecture, I repeated such 
284: measurements on the data set described in Figure~\ref{yahoo}. 
285: 
286: The 300 measures of $\lambda(q,d)$ thus obtained are plotted versus 
287: $\delta(q,d)$ from  Equation~\ref{eq:Laver} in  
288: Figure~\ref{likelihood}. Closeness to a relevant page in link space 
289: is highly predictive of relevance, increasing the relevance 
290: probability by a likelihood factor $\lambda(q,d) \gg 1$ over the 
291: range of observed distances and queries.
292: 
293: %:\subsection{Expected relevance decay in link space}
294: 
295: We also performed a nonlinear least-squares fit of these data to a 
296: family of exponential decay functions using the 300 points as 
297: independent samples: 
298: \begin{equation}
299: 	\lambda(\delta) \sim 1 + \alpha_3 e^{-\alpha_4 \delta^{\alpha_5}}.
300: \end{equation}
301: Note that this three-parameter model is more complex than the one in 
302: Equation~\ref{sim-decay} because $\lambda(\delta=0)$ must also be 
303: estimated from the data ($\lambda(q,0) = 1/G_q$). The 
304: relationship between link distance and the semantic likelihood factor 
305: is less regular than between link distance and lexical similarity. 
306: The resulting fit (also shown in 
307: Figure~\ref{likelihood}) provides us with a rough estimate of how 
308: far in link space we can make inferences about the semantics 
309: (relevance) of pages, i.e., up to a critical distance $\delta^*$ 
310: between 4 and 5 links.
311: 
312: %:\section{Discussion}
313: 
314: It is surprising that the link-content and link-cluster conjectures 
315: have not been formalized and addressed explicitly before, especially 
316: when one looks at the considerable attention recently received by the 
317: Web's graph topology\cite{Lawrence98,Butler00}.  The correlation 
318: between Web links and content takes on additional significance in 
319: light of link analysis studies that tell us the Web is a ``small 
320: world'' network, i.e., a graph with an inverse power law distribution 
321: of in-links and out-links\cite{Albert99,Adamic99}.  Small world 
322: networks have a mixture of non-random local structure and non-local 
323: random links. Such a topology creates short paths between pages, whose length scales logarithmically with the number of Web pages.  The present 
324: results indicate that the Web's local structure is created by the 
325: semantic clusters resulting from authors linking their pages to 
326: related resources.
327: 
328: The link-cluster and link-content conjectures have important normative 
329: implications for future Web search technology.  For example the 
330: measurements in this paper suggest that topic driven crawlers should 
331: keep track of their position with a bias to remain within a few links 
332: from some relevant source.  In such a range hyperlinks create 
333: detectable signals about lexical and semantic content, despite the 
334: Web's chaotic lack of structure.  Absent such signals, the short paths 
335: predicted by the small world model might be very hard to locate for 
336: localized algorithms \cite{Kleinberg00}.  In general the present 
337: findings should foster the design of better search tools by 
338: integrating traditional search engines with topic- and query-driven 
339: crawlers\cite{Menczer01} guided by \emph{local} link and lexical 
340: clues.  Smart crawlers of this kind are already emerging (see for 
341: example \texttt{http://myspiders.biz.uiowa.edu}).  
342: Due to the size and dynamic nature of the Web, the 
343: efficiency-motivated search engine practice of keeping query 
344: processing separate from crawling leads to poor trade-offs between 
345: coverage and recency\cite{Lawrence99}.  Closing the loop from user 
346: queries to smart crawlers will lead to dynamic indices 
347: with more scalable and user-driven update algorithms than 
348: the centralized ones used today.
349: 
350: \begin{thebibliography}{10}
351: 
352: \bibitem{Brin98}
353: Brin, S. and Page, L.
354: \newblock The anatomy of a large-scale hypertextual {W}eb search engine.
355: \newblock {\em Computer Networks}{ \bf 30}(1--7), 107--117 (1998).
356: 
357: \bibitem{Kleinberg98}
358: Kleinberg, J.
359: \newblock Authoritative sources in a hyperlinked environment.
360: \newblock {\em Journal of the ACM}{ \bf 46}(5), 604--632 (1999).
361: 
362: \bibitem{Menczer00}
363: Menczer, F. and Belew, R.
364: \newblock Adaptive retrieval agents: Internalizing local context and scaling up
365:   to the {W}eb.
366: \newblock {\em Machine Learning}{ \bf 39}(2--3), 203--242 (2000).
367: 
368: \bibitem{Ben-Shaul99etal}
369: Ben-Shaul, I. et~al.
370: \newblock Adding support for dynamic and focused search with {F}etuccino.
371: \newblock {\em Computer Networks}{ \bf 31}(11--16), 1653--1665 (1999).
372: 
373: \bibitem{Chakrabarti99}
374: Chakrabarti, S., {van den Berg}, M., and Dom, B.
375: \newblock Focused crawling: A new approach to topic-specific {W}eb resource
376:   discovery.
377: \newblock {\em Computer Networks}{ \bf 31}(11--16), 1623--1640 (1999).
378: 
379: \bibitem{Lawrence98}
380: Lawrence, S. and Giles, C.
381: \newblock Searching the {W}orld {W}ide {W}eb.
382: \newblock {\em Science}{ \bf 280}, 98--100 (1998).
383: 
384: \bibitem{Albert99}
385: Albert, R., Jeong, H., and Barabasi, A.-L.
386: \newblock Diameter of the {W}orld {W}ide {W}eb.
387: \newblock {\em Nature}{ \bf 401}(6749), 130--131 (1999).
388: 
389: \bibitem{Adamic99}
390: Adamic, L.
391: \newblock The {S}mall {W}orld {W}eb.
392: \newblock {\em LNCS}{ \bf 1696}, 443--452 (1999).
393: 
394: \bibitem{Butler00}
395: Butler, D.
396: \newblock Souped-up search engines.
397: \newblock {\em Nature}{ \bf 405}(6783), 112--115 (2000).
398: 
399: \bibitem{Salton83}
400: Salton, G. and McGill, M.
401: \newblock {\em An Introduction to Modern Information Retrieval}.
402: \newblock McGraw-Hill, New York, NY,  (1983).
403: 
404: \bibitem{vanR79cluster}
405: van Rijsbergen, C.
406: \newblock {\em Information Retrieval}, chapter~3,  30--31.
407: \newblock Butterworths, London (1979).
408: \newblock Second edition.
409: 
410: \bibitem{Mendelzon00}
411: Mendelzon, A. and Rafiei, D.
412: \newblock What do the neighbours think? {Computing} web page reputations.
413: \newblock {\em IEEE Data Engineering Bulletin}{ \bf 23}(3), 9--16 (2000).
414: 
415: \bibitem{Gibson98}
416: Gibson, D., Kleinberg, J., and Raghavan, P.
417: \newblock Inferring {W}eb communities from link topology.
418: \newblock In {\em Proc. 9th ACM Conference on Hypertext and Hypermedia},
419:   225--234,  (1998).
420: 
421: \bibitem{Chakrabarti98etal}
422: Chakrabarti, S. et~al.
423: \newblock Automatic resource compilation by analyzing hyperlink structure and
424:   associated text.
425: \newblock {\em Computer Networks}{ \bf 30}(1--7), 65--74 (1998).
426: 
427: \bibitem{Dean99}
428: Dean, J. and Henzinger, M.
429: \newblock Finding related pages in the {W}orld {W}ide {W}eb.
430: \newblock {\em Computer Networks}{ \bf 31}(11--16), 1467--1479 (1999).
431: 
432: \bibitem{Davison00}
433: Davison, B.
434: \newblock Topical locality in the {W}eb.
435: \newblock In {\em Proc. 23rd International ACM SIGIR Conference on Research and
436:   Development in Information Retrieval},  272--279,  (2000).
437: 
438: \bibitem{Menczer97b}
439: Menczer, F.
440: \newblock {ARACHNID}: {A}daptive {R}etrieval {A}gents {C}hoosing {H}euristic
441:   {N}eighborhoods for {I}nformation {D}iscovery.
442: \newblock In {\em Proc. 14th International Conference on Machine Learning},
443:   227--235,  (1997).
444: 
445: \bibitem{Kleinberg00}
446: Kleinberg, J.
447: \newblock Navigation in a small world.
448: \newblock {\em Nature}{ \bf 406}, 845 (2000).
449: 
450: \bibitem{Menczer01}
451: Menczer, F., Pant, G., Ruiz, M., and Srinivasan, P.
452: \newblock Evaluating topic-driven {W}eb crawlers.
453: \newblock In {\em Proc. 24th Annual International ACM SIGIR Conference on
454:   Research and Development in Information Retrieval},  (2001).
455: 
456: \bibitem{Lawrence99}
457: Lawrence, S. and Giles, C.
458: \newblock Accessibility of information on the {W}eb.
459: \newblock {\em Nature}{ \bf 400}, 107--109 (1999).
460: 
461: \bibitem{Porter80}
462: Porter, M.
463: \newblock An algorithm for suffix stripping.
464: \newblock {\em Program}{ \bf 14}(3), 130--137 (1980).
465: 
466: \bibitem{Jones72}
467: Sparck~Jones, K.
468: \newblock A statistical interpretation of term specificity and its application
469:   in retrieval.
470: \newblock {\em Journal of Documentation}{ \bf 28}, 111--121 (1972).
471: 
472: \end{thebibliography}
473: 
474: \subsection*{Acknowledgements}
475: 
476: {\small The author is grateful to D. Eichmann, P. Srinivasan, W.N. 
477: Street, A.M. Segre, R.K. Belew and A. Monge for helpful comments 
478: and discussions, and to M. Lee and M. Porter for contributions 
479: to the crawling and parsing code.}
480: 
481: \textbf{Correspondence and requests for materials should be sent to 
482: the author (email: \linebreak \texttt{filippo-menczer@uiowa.edu}).}
483: 
484: \newpage
485: 
486: \begin{figure}[hp]
487: 	\centering
488: 	\includegraphics{yahoo} 
489: 	\caption{Representation of the data collection.
490: 	100 topic pages were chosen in the Yahoo directory owing to 
491: 	this portal's wide popularity. Yahoo category
492: 	pages are marked ``Y'', external pages are marked ``W''.
493: 	The topic pages were chosen among ``leaf'' categories, i.e.  
494: 	without sub-categories.  This way 
495: 	the external pages linked by a topic page (``Yq'') represent the
496: 	relevant set compiled for that topic by the Yahoo editors (shaded).
497: 	Topics were selected in breadth-first order and therefore 
498: 	covered the full spectrum of Yahoo top-level categories.  
499: 	In this example the topic is \texttt{SOCIETY CULTURE BIBLIOGRAPHY}.
500: 	Arrows represent hyperlinks and dotted arrows are examples of links
501: 	pointing back to the relevant set.
502: 	For each topic, we performed 
503: 	a breadth-first crawl up to a depth of 3 links.  The crawl set is 
504: 	represented inside the dashed line.
505: 	To obtain meaningful and comparable 
506: 	statistics at $\delta_{l}=1$, only topic pages with at least 
507: 	5 external links were used, and only the first 10 links 
508: 	for topic pages with over 10 links.  Each crawl 
509: 	was stopped if 10,000 pages had been downloaded at depth $\delta_{l}=3$
510: 	from the start page.  A timeout of 60 seconds was applied for each 
511: 	page.  The resulting collection comprised 376,483 pages.  The text of 
512: 	each fetched page was parsed to extract links and terms. Terms were 
513: 	conflated using a standard stemming algorithm\cite{Porter80}.
514: 	A common TFIDF weighting scheme\cite{Jones72} was employed to 
515: 	represent each page in word vector space.  This model assumes a 
516: 	global measure of term frequency across pages (inverse document 
517: 	frequency).  To make the measures scalable with the maximum crawl 
518: 	depth (a parameter), inverse document frequency was computed as a 
519: 	function of distance from the start page, among the set of 
520: 	documents within that distance from the source.  Formally, for 
521: 	each topic $q$, page $p$, term $k$ and depth $d$:
522: 	$w_{p,d,q}^{k} = tf(k,p) \cdot idf(k,d,q)$
523: 	where $tf(k,p)$ is the number of occurrences 
524: 	of term $k$ in page $p$ and
525: 	$idf(k,d,q) = 1 + \ln\left(\frac{N_{d}^{q}}{N_{d}^{q}(k)}\right)$.
526: 	Here $N_{d}^{q}$ is the size of the cumulative page set 
527: 	$P_{d}^{q} = \{ p : \delta_{l}(q,p) \leq d \}$, and
528: 	$N_{d}^{q}(k)$ is the size of the subset of $P_{d}^{q}$ of pages
529: 	containing term $k$.}
530: 	\label{yahoo}
531: \end{figure}
532: 
533: \newpage
534: 
535: \begin{figure}[hp]
536: 	\centering
537: 	\includegraphics{sim} 
538: 	\caption{Scatter plot of $\sigma(q,d)$ versus 
539: 	$\delta(q,d)$ for topics $q=0,\ldots,99$ and
540: 	depths $d=1,2,3$. Pearson's correlation coefficient $\rho = -0.76,
541: 	p<0.0001$. The similarity 
542: 	noise level $\sigma_{\infty}$ and an exponential decay fit 
543: 	of the data and are also shown. 
544: 	$\sigma_{\infty}$ was computed by comparing each topic 
545: 	page to external pages linked from different Yahoo categories:
546: 	$\sigma_{\infty} \equiv \left\langle 
547: 	\frac{1}{N_{1}^{q'}} \sum_{p \in P_{1}^{q'}} \sigma(q,p) 
548: 	\right\rangle_{\{q,q': q \neq q'\}} 
549: 	\approx 0.0318 \pm 0.0006$.
550: 	The regression yielded 
551: 	parametric estimates $\alpha_{1} \approx 1.8$ and $\alpha_{2} \approx 
552: 	0.6$.}
553: 	\label{scatter}
554: \end{figure}
555: 
556: \newpage
557: 
558: \begin{figure}[hp]
559: \centering
560: \begin{tabular}{rcrc}
561: 	\textbf{a}
562: 	&
563: 	\multicolumn{3}{c}{\raisebox{-1.75in}{\includegraphics{domains}}} \\
564: 	\textbf{b}
565: 	&
566: 	\begin{tabular}{ccc}
567: 		Domain & $\alpha_{1}$ & $\alpha_{2}$ \\
568: 		\hline
569: 		\texttt{edu} & $1.11 \pm 0.03$ & $0.87 \pm 0.05$ \\
570: 		\texttt{net} & $1.16 \pm 0.04$ & $0.88 \pm 0.05$ \\
571: 		\texttt{gov} & $1.22 \pm 0.07$ & $1.00 \pm 0.09$ \\
572: 		\texttt{org} & $1.38 \pm 0.03$ & $0.93 \pm 0.05$ \\
573: 		\texttt{com} & $1.63 \pm 0.04$ & $1.13 \pm 0.05$ \\
574: 		\hline
575: 	\end{tabular}
576: 	& 
577: 	\textbf{c}
578: 	&
579: 	\raisebox{-0.8in}{\includegraphics[width=2.5in]{domains-stat}}
580: \end{tabular}
581: \caption{\textbf{a.} Scatter plot of $\sigma(q,d)$ versus 
582: $\delta(q,d)$ for topics $q=0,\ldots,99$ and
583: depths $d=1,2,3$, for each of the major US top-level domains.
584: The domain sets were obtained by simulating crawlers that only 
585: follow links to servers within each domain.  
586: An exponential decay fit is also shown for each domain.
587: \textbf{b.} Exponential decay model parameters obtained by 
588: nonlinear least-squares fit of each domain data. 
589: \textbf{c.} Summary of statistically 
590: significant differences (at the 68.3\% confidence level) between the 
591: parametric estimates; dashed arrows represent significant differences 
592: in $\alpha_{1}$ only, and solid arrows significant differences in 
593: both $\alpha_{1}$ and $\alpha_{2}$.}
594: \label{domains}
595: \end{figure}
596: 
597: \newpage
598: 
599: \begin{figure}[hp]
600: 	\centering
601: 	\includegraphics{likelihood}
602: 	\caption{Scatter plot of $\lambda(q,d)$ versus 
603: 	$\delta(q,d)$ for topics $q=0,\ldots,99$ and depths $d=1,2,3$.
604: 	Pearson's $\rho = -0.1, p=0.09$. In computing $\lambda(q,d)$ 
605: 	from Definition~\ref{L-def}, the relevant set $Q_q$ compiled by the 
606: 	Yahoo editors for each topic $q$ was used to estimate
607: 	$R_q(d) \simeq \frac{|P_{d}^{q} \cap Q_q|}{N_{d}^{q}}$
608: 	(cf. dotted links in Figure~\ref{yahoo}). 
609: 	Generality was approximated by
610: 	$G_q \simeq \frac{|Q'_q|}{|\bigcup_{q' \in Y} Q'_{q'}|}$ where
611: 	all of the relevant links for each topic $q$ are included in $Q'_q$, 
612: 	even for topics where only the first 10 links were used in
613: 	the crawl ($Q'_q \supseteq Q_q$), and 
614: 	the set $Y$ in the denominator includes all Yahoo leaf categories. 
615: 	An exponential decay fit of the data is
616: 	also shown. The regression yielded parametric estimates 
617: 	$\alpha_{3} \approx 1000$, $\alpha_{4} \approx 0.002$ and $\alpha_{5} 
618: 	\approx 5.5$.}
619: 	\label{likelihood}
620: \end{figure}
621: 
622: \end{document}
623: \end
624: