1: \documentclass[11pt]{article}
2:
3: \newif\ifpdf
4: \ifx\pdfoutput\undefined
5: \pdffalse % we are not running PDFLaTeX
6: \else
7: \pdfoutput=1 % we are running PDFLaTeX
8: \pdftrue
9: \fi
10:
11: \ifpdf
12: \usepackage[pdftex]{graphicx}
13: \else
14: \usepackage{graphicx}
15: \fi
16:
17: \textwidth = 6.5 in
18: \textheight = 9 in
19: \oddsidemargin = 0.0 in
20: \evensidemargin = 0.0 in
21: \topmargin = 0.0 in
22: \headheight = 0.0 in
23: \headsep = 0.0 in
24: \parskip = 0.2in
25: \parindent = 0.0in
26:
27: \usepackage{cite}
28: \usepackage{citesupernumber}
29: \usepackage{nature}
30: \bibliographystyle{nature}
31:
32: \title{Links tell us about lexical and semantic Web content}
33:
34: \author{Filippo Menczer\\
35: Department of Management Sciences\\
36: The University of Iowa\\
37: Iowa City, IA 52242\\
38: %\texttt{filippo-menczer@uiowa.edu}
39: }
40:
41: \begin{document}
42:
43: \ifpdf
44: \DeclareGraphicsExtensions{.pdf, .png}
45: \else
46: \DeclareGraphicsExtensions{.eps, .png}
47: \fi
48:
49: \date{}
50:
51: \maketitle
52:
53: \textbf{The latest generation of Web search tools is beginning to
54: exploit hypertext link information to improve
55: ranking\cite{Brin98,Kleinberg98} and
56: crawling\cite{Menczer00,Ben-Shaul99etal,Chakrabarti99} algorithms. The
57: hidden assumption behind such approaches, a correlation between the
58: graph structure of the Web and its content, has not been tested
59: explicitly despite increasing research on Web
60: topology\cite{Lawrence98,Albert99,Adamic99,Butler00}. Here I
61: formalize and quantitatively validate two conjectures drawing
62: connections from link information to lexical and semantic Web content.
63: The \emph{link-content conjecture} states that a page is similar to
64: the pages that link to it, i.e., one can infer the lexical content of
65: a page by looking at the pages that link to it. I also show that
66: lexical inferences based on link cues are quite heterogeneous across
67: Web communities. The \emph{link-cluster conjecture} states that pages
68: about the same topic are clustered together, i.e., one can infer the
69: meaning of a page by looking at its neighbours. These results explain
70: the success of the newest search technologies and open the way for
71: more dynamic and scalable methods to locate information in a topic or
72: user driven way.}
73:
74: %:\section{Intro}
75:
76: All search engines basically perform two functions: (i) crawling Web
77: pages to maintain an index, and (ii) matching URLs in the index
78: database against user queries. Effective search engines achieve a
79: high coverage of the Web, keep their index fresh, and rank hits in a
80: way that correlates with the user's notion of relevance. Ranking and
81: crawling algorithms use cues from words and hyperlinks, associated
82: respectively with \emph{lexical} and \emph{link topology}. In the
83: former, two pages are close to each other if they have similar textual
84: content; in the latter, if there is a short path between them. Lexical
85: metrics are traditionally used by search engines to rank hits
86: according to their similarity to the query, thus attempting to infer
87: the semantics of pages from their lexical representation. Similarity
88: metrics are derived from the vector space model\cite{Salton83}, that
89: represents each document or query by a vector with one dimension for
90: each term and a weight along that dimension that estimates the term's
91: contribution to the meaning of the document. The \emph{cluster
92: hypothesis} behind this model is that a document lexically close to a
93: relevant document is also relevant with high probability\cite{vanR79cluster}.
94: Links have traditionally been used by search engine crawlers only in
95: exhaustive, centralized algorithms. However the latest generation of
96: Web search tools is beginning to integrate lexical and link metrics to
97: improve ranking and crawling performance through better models of
98: relevance. The best known example is the \emph{PageRank} metric used
99: by Google: pages containing the query's lexical features are ranked
100: using query-independent link analysis\cite{Brin98}. Links are also
101: used in conjunction with text to identify hub and authority pages for
102: a certain subject\cite{Kleinberg98}, determine the reputation of a
103: given site\cite{Mendelzon00}, and guide search agents crawling on
104: behalf of users or topical search
105: engines\cite{Menczer00,Ben-Shaul99etal,Chakrabarti99}.
106:
107: %:\section{The link-content conjecture}
108:
109: To study the connection between link and lexical topologies, I
110: conjecture a positive correlation between distance measures defined in
111: the two spaces. Given any pair of Web
112: pages $(p_{1},p_{2})$ we have well-defined distance functions
113: $\delta_{l}$ and $\delta_{t}$ in link and lexical space,
114: respectively. To compute $\delta_{l}(p_{1},p_{2})$ we use the Web
115: hypertext structure to find the length, in links, of the shortest path
116: from $p_{1}$ to $p_{2}$. (This is not a metric distance
117: because it is not symmetric in a directed graph, but
118: for convenience I refer to $\delta_{l}$ as ``distance''.)
119: To compute $\delta_{t}(p_{1},p_{2})$ we
120: can use the vector representations of the two pages, where the vector
121: components (weights) of page $p$, $w_{p}^{k}$, are computed for terms
122: $k$ in the textual content of $p$ given some weighting scheme. One
123: possibility would be to use Euclidean distance in this word vector
124: space, or any other $L_{z}$ norm.
125: However, $L_{z}$ metrics have a dependency on the
126: dimensionality of the pages, i.e., larger documents tend to appear
127: more distant from each other than shorter ones, irrespective of
128: content. To circumvent this problem, one can instead define a metric
129: based on the \emph{similarity} between pages.
130: Let us use the \emph{cosine similarity} function, a standard measure
131: in information retrieval:
132: \begin{equation}
133: \sigma(p_{1},p_{2}) = \frac{\sum_{k \in p_{1} \cap p_{2}}
134: w_{p_{1}}^{k} w_{p_{2}}^{k}}
135: {\sqrt{\sum_{k \in p_{1}} (w_{p_{1}}^{k})^{2}
136: \sum_{k \in p_{2}} (w_{p_{2}}^{k})^{2}}}.
137: \label{eq:sim}
138: \end{equation}
139:
140: According to the link-content conjecture, $\sigma$ is anticorrelated
141: with $\delta_{l}$. The idea is to measure the correlation between the
142: two distance measures across pairs of pages. Figure~\ref{yahoo}
143: illustrates how a collection of Web pages was crawled and processed
144: for this purpose.
145:
146: %:\subsection{Correlation of lexical and link distance}
147:
148: The link distances $\delta_{l}(q,p)$ and similarities $\sigma(q,p)$
149: were averaged for each topic $q$ over all pages $p$ in the crawl set
150: $P_{d}^{q}$ for each depth $d$:
151: \begin{eqnarray}
152: \delta(q,d) &\equiv& \langle \delta_{l}(q,p) \rangle_{P_{d}^{q}} =
153: \frac{1}{N_{d}^{q}} \sum_{i=1}^{d} i \cdot
154: (N_{i}^{q} - N_{i-1}^{q}) \label{eq:Laver} \\
155: \sigma(q,d) &\equiv& \langle \sigma(q,p) \rangle_{P_{d}^{q}} =
156: \frac{1}{N_{d}^{q}} \sum_{p \in P_{d}^{q}}
157: \sigma(q,p). \label{eq:Saver}
158: \end{eqnarray}
159:
160: The 300 measures of $\delta(q,d)$ and $\sigma(q,d)$
161: from Equations~\ref{eq:Laver} and \ref{eq:Saver} are shown in
162: Figure~\ref{scatter}. The two metrics are indeed well
163: anticorrelated and predictive of each other with high
164: statistical significance. This quantitatively
165: confirms the link-content conjecture.
166:
167: %:\subsection{Range of link-based lexical predictions}
168:
169: To analyze the decrease in the reliability of lexical content
170: inferences with distance from the topic page in link space one can
171: perform a nonlinear least-squares fit of these data to a family of
172: exponential decay models:
173: \begin{equation}
174: \sigma(\delta) \sim \sigma_{\infty} +
175: (1 - \sigma_{\infty}) e^{-\alpha_{1} \delta^{\alpha_{2}}}
176: \label{sim-decay}
177: \end{equation}
178: using the 300 points as independent samples. Here $\sigma_{\infty}$
179: is the noise level in similarity.
180: Note that while starting from Yahoo pages may bias $\sigma(\delta<1)$
181: upward, the decay fit is most affected by the constraint
182: $\sigma(\delta=0) = 1$ (by definition of similarity) and by the
183: longer-range measures $\sigma(\delta>1)$.
184: The similarity decay fit curve is also shown in Figure~\ref{scatter}. It
185: provides us with a rough estimate of how far in link space one
186: can make inferences about lexical content.
187:
188: %:\subsection{Heterogeneity of link-based lexical cues}
189:
190: How heterogeneous is the reliability of lexical inferences based on
191: link neighbourhood across communities of Web content providers? To
192: answer this question the crawled pages were divided up into connected
193: sets within top-level Internet domains. The scatter plot of the
194: $\delta(q,d)$ and $\sigma(q,d)$ measures for these domain-based crawls
195: is shown in Figure~\ref{domains}a. The plot illustrates the
196: heterogeneity in the reliability of lexical inferences based on link
197: cues across domains. The parameters obtained from fitting each domain
198: data to the exponential decay model of Equation~\ref{sim-decay}
199: (Figure~\ref{domains}b) estimate how reliably links point to lexically
200: related pages in each domain. A summary of the statistically
201: significant differences among the parametric estimates is shown in
202: Figure~\ref{domains}c. It is evident that, for example,
203: academic Web pages are better connected to each other than commercial
204: pages in that they do a better job at pointing to other similar pages.
205: In other words it is easier to find related pages browsing through
206: academic pages than through commercial pages. This is not surprising
207: considering the different goals of the two communities.
208:
209: %:\section{The link-cluster conjecture}
210:
211: The link-cluster conjecture is a link-based analog of the cluster
212: hypothesis, stating that pages within a few links from a relevant
213: source are also relevant with high probability.
214: Here I experimentally assess the extent to which relevance is
215: preserved within link space neighbourhoods, and the decay in expected
216: relevance as one browses away from a relevant page.
217:
218: The link-cluster conjecture has been implied or stated in various
219: forms\cite{Kleinberg98,Gibson98,Brin98,Chakrabarti98etal,Dean99,Davison00}.
220: One can most simply and generally state it in terms of the conditional
221: probability that a page $p$ is relevant with respect to some query $q$,
222: given that page $r$ is relevant and that $p$ is within $d$ links from $r$:
223: \begin{equation}
224: R_q(d) \equiv
225: \Pr[rel_q(p) \: | \: rel_q(r) \wedge \delta_{l}(r,p) \leq d]
226: \end{equation}
227: where $rel_q()$ is a binary relevance assessment with respect to $q$.
228: In other words a page has a higher than random probability of being
229: about a certain topic if it is in the neighbourhood of other pages
230: about that topic. $R_q(d)$ is the posterior relevance probability
231: given the evidence of a relevant page nearby. The simplest form of
232: the link-cluster conjecture is stated by comparing $R_q(1)$ to the
233: prior relevance probability $G_q$:
234: \begin{equation}
235: G_q \equiv \Pr[rel_q(p)]
236: \end{equation}
237: also known as the \emph{generality} of the query. If link
238: neighbourhoods allow for semantic inferences, then the following
239: condition must hold:
240: \begin{equation}
241: \lambda(q,d=1) \equiv \frac{R_q(1)}{G_q} > 1.
242: \label{def_L}
243: \end{equation}
244: To illustrate the meaning of the link-cluster conjecture, consider
245: a random crawler (or user) searching for pages about a topic $q$.
246: Call $\eta_q(t)$ the probability that the crawler hits a relevant
247: page at time $t$. Solving the recursion
248: \begin{equation}
249: \eta_q(t+1) = \eta_q(t) \cdot R_q(1) + (1 - \eta_q(t)) \cdot G_q
250: \end{equation}
251: for $\eta_q(t+1) = \eta_q(t)$ yields the stationary hit rate
252: \begin{equation}
253: \eta_q^* = \frac{G_q}{1 + G_q - R_q(1)}.
254: \end{equation}
255: The link-cluster conjecture is a necessary and sufficient condition
256: for such a crawler to have a better than chance hit rate, thus
257: justifying the crawling (and browsing!) activity:
258: \begin{equation}
259: \eta_q^* > G_q \Longleftrightarrow \lambda(q,1) > 1.
260: \end{equation}
261:
262: Definition~\ref{def_L} can be generalized to likelihood factors over
263: larger neighbourhoods:
264: \begin{equation}
265: \lambda(q,d) \equiv \frac{R_q(d)}{G_q} \stackrel{d \rightarrow
266: \infty}{\longrightarrow} 1
267: \label{L-def}
268: \end{equation}
269: and a stronger version of the conjecture can be formulated as follows:
270: \begin{equation}
271: \lambda(q,d) \gg 1 \; \mbox{for} \; \delta(q,d) < \delta^*
272: \end{equation}
273: where $\delta^*$ is a critical link distance beyond which semantic
274: inferences are unreliable.
275:
276: %:\subsection{Semantic clusters in link neighbourhoods}
277:
278: I first attempted to measure the likelihood factor $\lambda(q,1)$ for
279: a few queries and found that
280: \linebreak $\langle \lambda(q,1) \rangle_q \gg 1$,
281: but those estimates were based on very noisy relevance
282: assessments\cite{Menczer97b}. To obtain a reliable quantitative
283: validation of the stronger link-cluster conjecture, I repeated such
284: measurements on the data set described in Figure~\ref{yahoo}.
285:
286: The 300 measures of $\lambda(q,d)$ thus obtained are plotted versus
287: $\delta(q,d)$ from Equation~\ref{eq:Laver} in
288: Figure~\ref{likelihood}. Closeness to a relevant page in link space
289: is highly predictive of relevance, increasing the relevance
290: probability by a likelihood factor $\lambda(q,d) \gg 1$ over the
291: range of observed distances and queries.
292:
293: %:\subsection{Expected relevance decay in link space}
294:
295: We also performed a nonlinear least-squares fit of these data to a
296: family of exponential decay functions using the 300 points as
297: independent samples:
298: \begin{equation}
299: \lambda(\delta) \sim 1 + \alpha_3 e^{-\alpha_4 \delta^{\alpha_5}}.
300: \end{equation}
301: Note that this three-parameter model is more complex than the one in
302: Equation~\ref{sim-decay} because $\lambda(\delta=0)$ must also be
303: estimated from the data ($\lambda(q,0) = 1/G_q$). The
304: relationship between link distance and the semantic likelihood factor
305: is less regular than between link distance and lexical similarity.
306: The resulting fit (also shown in
307: Figure~\ref{likelihood}) provides us with a rough estimate of how
308: far in link space we can make inferences about the semantics
309: (relevance) of pages, i.e., up to a critical distance $\delta^*$
310: between 4 and 5 links.
311:
312: %:\section{Discussion}
313:
314: It is surprising that the link-content and link-cluster conjectures
315: have not been formalized and addressed explicitly before, especially
316: when one looks at the considerable attention recently received by the
317: Web's graph topology\cite{Lawrence98,Butler00}. The correlation
318: between Web links and content takes on additional significance in
319: light of link analysis studies that tell us the Web is a ``small
320: world'' network, i.e., a graph with an inverse power law distribution
321: of in-links and out-links\cite{Albert99,Adamic99}. Small world
322: networks have a mixture of non-random local structure and non-local
323: random links. Such a topology creates short paths between pages, whose length scales logarithmically with the number of Web pages. The present
324: results indicate that the Web's local structure is created by the
325: semantic clusters resulting from authors linking their pages to
326: related resources.
327:
328: The link-cluster and link-content conjectures have important normative
329: implications for future Web search technology. For example the
330: measurements in this paper suggest that topic driven crawlers should
331: keep track of their position with a bias to remain within a few links
332: from some relevant source. In such a range hyperlinks create
333: detectable signals about lexical and semantic content, despite the
334: Web's chaotic lack of structure. Absent such signals, the short paths
335: predicted by the small world model might be very hard to locate for
336: localized algorithms \cite{Kleinberg00}. In general the present
337: findings should foster the design of better search tools by
338: integrating traditional search engines with topic- and query-driven
339: crawlers\cite{Menczer01} guided by \emph{local} link and lexical
340: clues. Smart crawlers of this kind are already emerging (see for
341: example \texttt{http://myspiders.biz.uiowa.edu}).
342: Due to the size and dynamic nature of the Web, the
343: efficiency-motivated search engine practice of keeping query
344: processing separate from crawling leads to poor trade-offs between
345: coverage and recency\cite{Lawrence99}. Closing the loop from user
346: queries to smart crawlers will lead to dynamic indices
347: with more scalable and user-driven update algorithms than
348: the centralized ones used today.
349:
350: \begin{thebibliography}{10}
351:
352: \bibitem{Brin98}
353: Brin, S. and Page, L.
354: \newblock The anatomy of a large-scale hypertextual {W}eb search engine.
355: \newblock {\em Computer Networks}{ \bf 30}(1--7), 107--117 (1998).
356:
357: \bibitem{Kleinberg98}
358: Kleinberg, J.
359: \newblock Authoritative sources in a hyperlinked environment.
360: \newblock {\em Journal of the ACM}{ \bf 46}(5), 604--632 (1999).
361:
362: \bibitem{Menczer00}
363: Menczer, F. and Belew, R.
364: \newblock Adaptive retrieval agents: Internalizing local context and scaling up
365: to the {W}eb.
366: \newblock {\em Machine Learning}{ \bf 39}(2--3), 203--242 (2000).
367:
368: \bibitem{Ben-Shaul99etal}
369: Ben-Shaul, I. et~al.
370: \newblock Adding support for dynamic and focused search with {F}etuccino.
371: \newblock {\em Computer Networks}{ \bf 31}(11--16), 1653--1665 (1999).
372:
373: \bibitem{Chakrabarti99}
374: Chakrabarti, S., {van den Berg}, M., and Dom, B.
375: \newblock Focused crawling: A new approach to topic-specific {W}eb resource
376: discovery.
377: \newblock {\em Computer Networks}{ \bf 31}(11--16), 1623--1640 (1999).
378:
379: \bibitem{Lawrence98}
380: Lawrence, S. and Giles, C.
381: \newblock Searching the {W}orld {W}ide {W}eb.
382: \newblock {\em Science}{ \bf 280}, 98--100 (1998).
383:
384: \bibitem{Albert99}
385: Albert, R., Jeong, H., and Barabasi, A.-L.
386: \newblock Diameter of the {W}orld {W}ide {W}eb.
387: \newblock {\em Nature}{ \bf 401}(6749), 130--131 (1999).
388:
389: \bibitem{Adamic99}
390: Adamic, L.
391: \newblock The {S}mall {W}orld {W}eb.
392: \newblock {\em LNCS}{ \bf 1696}, 443--452 (1999).
393:
394: \bibitem{Butler00}
395: Butler, D.
396: \newblock Souped-up search engines.
397: \newblock {\em Nature}{ \bf 405}(6783), 112--115 (2000).
398:
399: \bibitem{Salton83}
400: Salton, G. and McGill, M.
401: \newblock {\em An Introduction to Modern Information Retrieval}.
402: \newblock McGraw-Hill, New York, NY, (1983).
403:
404: \bibitem{vanR79cluster}
405: van Rijsbergen, C.
406: \newblock {\em Information Retrieval}, chapter~3, 30--31.
407: \newblock Butterworths, London (1979).
408: \newblock Second edition.
409:
410: \bibitem{Mendelzon00}
411: Mendelzon, A. and Rafiei, D.
412: \newblock What do the neighbours think? {Computing} web page reputations.
413: \newblock {\em IEEE Data Engineering Bulletin}{ \bf 23}(3), 9--16 (2000).
414:
415: \bibitem{Gibson98}
416: Gibson, D., Kleinberg, J., and Raghavan, P.
417: \newblock Inferring {W}eb communities from link topology.
418: \newblock In {\em Proc. 9th ACM Conference on Hypertext and Hypermedia},
419: 225--234, (1998).
420:
421: \bibitem{Chakrabarti98etal}
422: Chakrabarti, S. et~al.
423: \newblock Automatic resource compilation by analyzing hyperlink structure and
424: associated text.
425: \newblock {\em Computer Networks}{ \bf 30}(1--7), 65--74 (1998).
426:
427: \bibitem{Dean99}
428: Dean, J. and Henzinger, M.
429: \newblock Finding related pages in the {W}orld {W}ide {W}eb.
430: \newblock {\em Computer Networks}{ \bf 31}(11--16), 1467--1479 (1999).
431:
432: \bibitem{Davison00}
433: Davison, B.
434: \newblock Topical locality in the {W}eb.
435: \newblock In {\em Proc. 23rd International ACM SIGIR Conference on Research and
436: Development in Information Retrieval}, 272--279, (2000).
437:
438: \bibitem{Menczer97b}
439: Menczer, F.
440: \newblock {ARACHNID}: {A}daptive {R}etrieval {A}gents {C}hoosing {H}euristic
441: {N}eighborhoods for {I}nformation {D}iscovery.
442: \newblock In {\em Proc. 14th International Conference on Machine Learning},
443: 227--235, (1997).
444:
445: \bibitem{Kleinberg00}
446: Kleinberg, J.
447: \newblock Navigation in a small world.
448: \newblock {\em Nature}{ \bf 406}, 845 (2000).
449:
450: \bibitem{Menczer01}
451: Menczer, F., Pant, G., Ruiz, M., and Srinivasan, P.
452: \newblock Evaluating topic-driven {W}eb crawlers.
453: \newblock In {\em Proc. 24th Annual International ACM SIGIR Conference on
454: Research and Development in Information Retrieval}, (2001).
455:
456: \bibitem{Lawrence99}
457: Lawrence, S. and Giles, C.
458: \newblock Accessibility of information on the {W}eb.
459: \newblock {\em Nature}{ \bf 400}, 107--109 (1999).
460:
461: \bibitem{Porter80}
462: Porter, M.
463: \newblock An algorithm for suffix stripping.
464: \newblock {\em Program}{ \bf 14}(3), 130--137 (1980).
465:
466: \bibitem{Jones72}
467: Sparck~Jones, K.
468: \newblock A statistical interpretation of term specificity and its application
469: in retrieval.
470: \newblock {\em Journal of Documentation}{ \bf 28}, 111--121 (1972).
471:
472: \end{thebibliography}
473:
474: \subsection*{Acknowledgements}
475:
476: {\small The author is grateful to D. Eichmann, P. Srinivasan, W.N.
477: Street, A.M. Segre, R.K. Belew and A. Monge for helpful comments
478: and discussions, and to M. Lee and M. Porter for contributions
479: to the crawling and parsing code.}
480:
481: \textbf{Correspondence and requests for materials should be sent to
482: the author (email: \linebreak \texttt{filippo-menczer@uiowa.edu}).}
483:
484: \newpage
485:
486: \begin{figure}[hp]
487: \centering
488: \includegraphics{yahoo}
489: \caption{Representation of the data collection.
490: 100 topic pages were chosen in the Yahoo directory owing to
491: this portal's wide popularity. Yahoo category
492: pages are marked ``Y'', external pages are marked ``W''.
493: The topic pages were chosen among ``leaf'' categories, i.e.
494: without sub-categories. This way
495: the external pages linked by a topic page (``Yq'') represent the
496: relevant set compiled for that topic by the Yahoo editors (shaded).
497: Topics were selected in breadth-first order and therefore
498: covered the full spectrum of Yahoo top-level categories.
499: In this example the topic is \texttt{SOCIETY CULTURE BIBLIOGRAPHY}.
500: Arrows represent hyperlinks and dotted arrows are examples of links
501: pointing back to the relevant set.
502: For each topic, we performed
503: a breadth-first crawl up to a depth of 3 links. The crawl set is
504: represented inside the dashed line.
505: To obtain meaningful and comparable
506: statistics at $\delta_{l}=1$, only topic pages with at least
507: 5 external links were used, and only the first 10 links
508: for topic pages with over 10 links. Each crawl
509: was stopped if 10,000 pages had been downloaded at depth $\delta_{l}=3$
510: from the start page. A timeout of 60 seconds was applied for each
511: page. The resulting collection comprised 376,483 pages. The text of
512: each fetched page was parsed to extract links and terms. Terms were
513: conflated using a standard stemming algorithm\cite{Porter80}.
514: A common TFIDF weighting scheme\cite{Jones72} was employed to
515: represent each page in word vector space. This model assumes a
516: global measure of term frequency across pages (inverse document
517: frequency). To make the measures scalable with the maximum crawl
518: depth (a parameter), inverse document frequency was computed as a
519: function of distance from the start page, among the set of
520: documents within that distance from the source. Formally, for
521: each topic $q$, page $p$, term $k$ and depth $d$:
522: $w_{p,d,q}^{k} = tf(k,p) \cdot idf(k,d,q)$
523: where $tf(k,p)$ is the number of occurrences
524: of term $k$ in page $p$ and
525: $idf(k,d,q) = 1 + \ln\left(\frac{N_{d}^{q}}{N_{d}^{q}(k)}\right)$.
526: Here $N_{d}^{q}$ is the size of the cumulative page set
527: $P_{d}^{q} = \{ p : \delta_{l}(q,p) \leq d \}$, and
528: $N_{d}^{q}(k)$ is the size of the subset of $P_{d}^{q}$ of pages
529: containing term $k$.}
530: \label{yahoo}
531: \end{figure}
532:
533: \newpage
534:
535: \begin{figure}[hp]
536: \centering
537: \includegraphics{sim}
538: \caption{Scatter plot of $\sigma(q,d)$ versus
539: $\delta(q,d)$ for topics $q=0,\ldots,99$ and
540: depths $d=1,2,3$. Pearson's correlation coefficient $\rho = -0.76,
541: p<0.0001$. The similarity
542: noise level $\sigma_{\infty}$ and an exponential decay fit
543: of the data and are also shown.
544: $\sigma_{\infty}$ was computed by comparing each topic
545: page to external pages linked from different Yahoo categories:
546: $\sigma_{\infty} \equiv \left\langle
547: \frac{1}{N_{1}^{q'}} \sum_{p \in P_{1}^{q'}} \sigma(q,p)
548: \right\rangle_{\{q,q': q \neq q'\}}
549: \approx 0.0318 \pm 0.0006$.
550: The regression yielded
551: parametric estimates $\alpha_{1} \approx 1.8$ and $\alpha_{2} \approx
552: 0.6$.}
553: \label{scatter}
554: \end{figure}
555:
556: \newpage
557:
558: \begin{figure}[hp]
559: \centering
560: \begin{tabular}{rcrc}
561: \textbf{a}
562: &
563: \multicolumn{3}{c}{\raisebox{-1.75in}{\includegraphics{domains}}} \\
564: \textbf{b}
565: &
566: \begin{tabular}{ccc}
567: Domain & $\alpha_{1}$ & $\alpha_{2}$ \\
568: \hline
569: \texttt{edu} & $1.11 \pm 0.03$ & $0.87 \pm 0.05$ \\
570: \texttt{net} & $1.16 \pm 0.04$ & $0.88 \pm 0.05$ \\
571: \texttt{gov} & $1.22 \pm 0.07$ & $1.00 \pm 0.09$ \\
572: \texttt{org} & $1.38 \pm 0.03$ & $0.93 \pm 0.05$ \\
573: \texttt{com} & $1.63 \pm 0.04$ & $1.13 \pm 0.05$ \\
574: \hline
575: \end{tabular}
576: &
577: \textbf{c}
578: &
579: \raisebox{-0.8in}{\includegraphics[width=2.5in]{domains-stat}}
580: \end{tabular}
581: \caption{\textbf{a.} Scatter plot of $\sigma(q,d)$ versus
582: $\delta(q,d)$ for topics $q=0,\ldots,99$ and
583: depths $d=1,2,3$, for each of the major US top-level domains.
584: The domain sets were obtained by simulating crawlers that only
585: follow links to servers within each domain.
586: An exponential decay fit is also shown for each domain.
587: \textbf{b.} Exponential decay model parameters obtained by
588: nonlinear least-squares fit of each domain data.
589: \textbf{c.} Summary of statistically
590: significant differences (at the 68.3\% confidence level) between the
591: parametric estimates; dashed arrows represent significant differences
592: in $\alpha_{1}$ only, and solid arrows significant differences in
593: both $\alpha_{1}$ and $\alpha_{2}$.}
594: \label{domains}
595: \end{figure}
596:
597: \newpage
598:
599: \begin{figure}[hp]
600: \centering
601: \includegraphics{likelihood}
602: \caption{Scatter plot of $\lambda(q,d)$ versus
603: $\delta(q,d)$ for topics $q=0,\ldots,99$ and depths $d=1,2,3$.
604: Pearson's $\rho = -0.1, p=0.09$. In computing $\lambda(q,d)$
605: from Definition~\ref{L-def}, the relevant set $Q_q$ compiled by the
606: Yahoo editors for each topic $q$ was used to estimate
607: $R_q(d) \simeq \frac{|P_{d}^{q} \cap Q_q|}{N_{d}^{q}}$
608: (cf. dotted links in Figure~\ref{yahoo}).
609: Generality was approximated by
610: $G_q \simeq \frac{|Q'_q|}{|\bigcup_{q' \in Y} Q'_{q'}|}$ where
611: all of the relevant links for each topic $q$ are included in $Q'_q$,
612: even for topics where only the first 10 links were used in
613: the crawl ($Q'_q \supseteq Q_q$), and
614: the set $Y$ in the denominator includes all Yahoo leaf categories.
615: An exponential decay fit of the data is
616: also shown. The regression yielded parametric estimates
617: $\alpha_{3} \approx 1000$, $\alpha_{4} \approx 0.002$ and $\alpha_{5}
618: \approx 5.5$.}
619: \label{likelihood}
620: \end{figure}
621:
622: \end{document}
623: \end
624: