0108:cs0108004/xxx.tex

1: \documentclass[11pt]{article}

2:

3: \newif\ifpdf

4: \ifx\pdfoutput\undefined

5: \pdffalse % we are not running PDFLaTeX

6: \else

7: \pdfoutput=1 % we are running PDFLaTeX

8: \pdftrue

9: \fi

10:

11: \ifpdf

12: \usepackage[pdftex]{graphicx}

13: \else

14: \usepackage{graphicx}

15: \fi

16:

17: \textwidth = 6.5 in

18: \textheight = 9 in

19: \oddsidemargin = 0.0 in

20: \evensidemargin = 0.0 in

21: \topmargin = 0.0 in

22: \headheight = 0.0 in

23: \headsep = 0.0 in

24: \parskip = 0.2in

25: \parindent = 0.0in

26:

27: \usepackage{cite}

28: \usepackage{citesupernumber}

29: \usepackage{nature}

30: \bibliographystyle{nature}

31:

32: \title{Links tell us about lexical and semantic Web content}

33:

34: \author{Filippo Menczer\\

35: Department of Management Sciences\\

36: The University of Iowa\\

37: Iowa City, IA 52242\\

38: %\texttt{filippo-menczer@uiowa.edu}

39: }

40:

41: \begin{document}

42:

43: \ifpdf

44: \DeclareGraphicsExtensions{.pdf, .png}

45: \else

46: \DeclareGraphicsExtensions{.eps, .png}

47: \fi

48:

49: \date{}

50:

51: \maketitle

52:

53: \textbf{The latest generation of Web search tools is beginning to

54: exploit hypertext link information to improve

55: ranking\cite{Brin98,Kleinberg98} and

56: crawling\cite{Menczer00,Ben-Shaul99etal,Chakrabarti99} algorithms.  The

57: hidden assumption behind such approaches, a correlation between the

58: graph structure of the Web and its content, has not been tested

59: explicitly despite increasing research on Web

60: topology\cite{Lawrence98,Albert99,Adamic99,Butler00}.  Here I

61: formalize and quantitatively validate two conjectures drawing

62: connections from link information to lexical and semantic Web content.

63: The \emph{link-content conjecture} states that a page is similar to

64: the pages that link to it, i.e., one can infer the lexical content of

65: a page by looking at the pages that link to it.  I also show that

66: lexical inferences based on link cues are quite heterogeneous across

67: Web communities.  The \emph{link-cluster conjecture} states that pages

68: about the same topic are clustered together, i.e., one can infer the

69: meaning of a page by looking at its neighbours.  These results explain

70: the success of the newest search technologies and open the way for

71: more dynamic and scalable methods to locate information in a topic or

72: user driven way.}

73:

74: %:\section{Intro}

75:

76: All search engines basically perform two functions: (i) crawling Web

77: pages to maintain an index, and (ii) matching URLs in the index

78: database against user queries.  Effective search engines achieve a

79: high coverage of the Web, keep their index fresh, and rank hits in a

80: way that correlates with the user's notion of relevance.  Ranking and

81: crawling algorithms use cues from words and hyperlinks, associated

82: respectively with \emph{lexical} and \emph{link topology}.  In the

83: former, two pages are close to each other if they have similar textual

84: content; in the latter, if there is a short path between them.  Lexical

85: metrics are traditionally used by search engines to rank hits

86: according to their similarity to the query, thus attempting to infer

87: the semantics of pages from their lexical representation.  Similarity

88: metrics are derived from the vector space model\cite{Salton83}, that

89: represents each document or query by a vector with one dimension for

90: each term and a weight along that dimension that estimates the term's

91: contribution to the meaning of the document.  The \emph{cluster

92: hypothesis} behind this model is that a document lexically close to a

93: relevant document is also relevant with high probability\cite{vanR79cluster}.

94: Links have traditionally been used by search engine crawlers only in

95: exhaustive, centralized algorithms.  However the latest generation of

96: Web search tools is beginning to integrate lexical and link metrics to

97: improve ranking and crawling performance through better models of

98: relevance.  The best known example is the \emph{PageRank} metric used

99: by Google: pages containing the query's lexical features are ranked

100: using query-independent link analysis\cite{Brin98}.  Links are also

101: used in conjunction with text to identify hub and authority pages for

102: a certain subject\cite{Kleinberg98}, determine the reputation of a

103: given site\cite{Mendelzon00}, and guide search agents crawling on

104: behalf of users or topical search

105: engines\cite{Menczer00,Ben-Shaul99etal,Chakrabarti99}.

106:

107: %:\section{The link-content conjecture}

108:

109: To study the connection between link and lexical topologies, I

110: conjecture a positive correlation between distance measures defined in

111: the two spaces.  Given any pair of Web

112: pages $(p_{1},p_{2})$ we have well-defined distance functions

113: $\delta_{l}$ and $\delta_{t}$ in link and lexical space,

114: respectively.  To compute $\delta_{l}(p_{1},p_{2})$ we use the Web

115: hypertext structure to find the length, in links, of the shortest path

116: from $p_{1}$ to $p_{2}$. (This is not a metric distance

117: because it is not symmetric in a directed graph, but

118: for convenience I refer to $\delta_{l}$ as ``distance''.)

119: To compute $\delta_{t}(p_{1},p_{2})$ we

120: can use the vector representations of the two pages, where the vector

121: components (weights) of page $p$, $w_{p}^{k}$, are computed for terms

122: $k$ in the textual content of $p$ given some weighting scheme.  One

123: possibility would be to use Euclidean distance in this word vector

124: space, or any other $L_{z}$ norm.

125: However, $L_{z}$ metrics have a dependency on the

126: dimensionality of the pages, i.e., larger documents tend to appear

127: more distant from each other than shorter ones, irrespective of

128: content.  To circumvent this problem, one can instead define a metric

129: based on the \emph{similarity} between pages.

130: Let us use the \emph{cosine similarity} function, a standard measure

131: in information retrieval:

132: \begin{equation}

133: 	\sigma(p_{1},p_{2}) = \frac{\sum_{k \in p_{1} \cap p_{2}}

134: 	w_{p_{1}}^{k} w_{p_{2}}^{k}}

135: 	{\sqrt{\sum_{k \in p_{1}} (w_{p_{1}}^{k})^{2}

136: 	\sum_{k \in p_{2}} (w_{p_{2}}^{k})^{2}}}.

137: 	\label{eq:sim}

138: \end{equation}

139:

140: According to the link-content conjecture, $\sigma$ is anticorrelated

141: with $\delta_{l}$.  The idea is to measure the correlation between the

142: two distance measures across pairs of pages.  Figure~\ref{yahoo}

143: illustrates how a collection of Web pages was crawled and processed

144: for this purpose.

145:

146: %:\subsection{Correlation of lexical and link distance}

147:

148: The link distances $\delta_{l}(q,p)$ and similarities $\sigma(q,p)$

149: were averaged for each topic $q$ over all pages $p$ in the crawl set

150: $P_{d}^{q}$ for each depth $d$:

151: \begin{eqnarray}

152: 	\delta(q,d) &\equiv& \langle \delta_{l}(q,p) \rangle_{P_{d}^{q}} =

153: 		\frac{1}{N_{d}^{q}} \sum_{i=1}^{d} i \cdot

154: 		(N_{i}^{q} - N_{i-1}^{q}) \label{eq:Laver} \\

155: 	\sigma(q,d) &\equiv& \langle \sigma(q,p) \rangle_{P_{d}^{q}} =

156: 		\frac{1}{N_{d}^{q}} \sum_{p \in P_{d}^{q}}

157: 		\sigma(q,p). \label{eq:Saver}

158: \end{eqnarray}

159:

160: The 300 measures of $\delta(q,d)$ and $\sigma(q,d)$

161: from Equations~\ref{eq:Laver} and \ref{eq:Saver} are shown in

162: Figure~\ref{scatter}.  The two metrics are indeed well

163: anticorrelated and predictive of each other with high

164: statistical significance.  This quantitatively

165: confirms the link-content conjecture.

166:

167: %:\subsection{Range of link-based lexical predictions}

168:

169: To analyze the decrease in the reliability of lexical content

170: inferences with distance from the topic page in link space one can

171: perform a nonlinear least-squares fit of these data to a family of

172: exponential decay models:

173: \begin{equation}

174: 	\sigma(\delta) \sim \sigma_{\infty} +

175: 		(1 - \sigma_{\infty}) e^{-\alpha_{1} \delta^{\alpha_{2}}}

176: 	\label{sim-decay}

177: \end{equation}

178: using the 300 points as independent samples. Here $\sigma_{\infty}$

179: is the noise level in similarity.

180: Note that while starting from Yahoo pages may bias $\sigma(\delta<1)$

181: upward, the decay fit is most affected by the constraint

182: $\sigma(\delta=0) = 1$ (by definition of similarity) and by the

183: longer-range measures $\sigma(\delta>1)$.

184: The similarity decay fit curve is also shown in Figure~\ref{scatter}. It

185: provides us with a rough estimate of how far in link space one

186: can make inferences about lexical content.

187:

188: %:\subsection{Heterogeneity of link-based lexical cues}

189:

190: How heterogeneous is the reliability of lexical inferences based on

191: link neighbourhood across communities of Web content providers?  To

192: answer this question the crawled pages were divided up into connected

193: sets within top-level Internet domains.  The scatter plot of the

194: $\delta(q,d)$ and $\sigma(q,d)$ measures for these domain-based crawls

195: is shown in Figure~\ref{domains}a.  The plot illustrates the

196: heterogeneity in the reliability of lexical inferences based on link

197: cues across domains.  The parameters obtained from fitting each domain

198: data to the exponential decay model of Equation~\ref{sim-decay}

199: (Figure~\ref{domains}b) estimate how reliably links point to lexically

200: related pages in each domain.  A summary of the statistically

201: significant differences among the parametric estimates is shown in

202: Figure~\ref{domains}c.  It is evident that, for example,

203: academic Web pages are better connected to each other than commercial

204: pages in that they do a better job at pointing to other similar pages.

205: In other words it is easier to find related pages browsing through

206: academic pages than through commercial pages.  This is not surprising

207: considering the different goals of the two communities.

208:

209: %:\section{The link-cluster conjecture}

210:

211: The link-cluster conjecture is a link-based analog of the cluster

212: hypothesis, stating that pages within a few links from a relevant

213: source are also relevant with high probability.

214: Here I experimentally assess the extent to which relevance is

215: preserved within link space neighbourhoods, and the decay in expected

216: relevance as one browses away from a relevant page.

217:

218: The link-cluster conjecture has been implied or stated in various

219: forms\cite{Kleinberg98,Gibson98,Brin98,Chakrabarti98etal,Dean99,Davison00}.

220: One can most simply and generally state it in terms of the conditional

221: probability that a page $p$ is relevant with respect to some query $q$,

222: given that page $r$ is relevant and that $p$ is within $d$ links from $r$:

223: \begin{equation}

224: 	R_q(d) \equiv

225: 		\Pr[rel_q(p) \: | \: rel_q(r) \wedge \delta_{l}(r,p) \leq d]

226: \end{equation}

227: where $rel_q()$ is a binary relevance assessment with respect to $q$.

228: In other words a page has a higher than random probability of being

229: about a certain topic if it is in the neighbourhood of other pages

230: about that topic. $R_q(d)$ is the posterior relevance probability

231: given the evidence of a relevant page nearby. The simplest form of

232: the link-cluster conjecture is stated by comparing $R_q(1)$ to the

233: prior relevance probability $G_q$:

234: \begin{equation}

235: 	G_q \equiv \Pr[rel_q(p)]

236: \end{equation}

237: also known as the \emph{generality} of the query. If link

238: neighbourhoods allow for semantic inferences, then the following

239: condition must hold:

240: \begin{equation}

241: 	\lambda(q,d=1) \equiv \frac{R_q(1)}{G_q} > 1.

242: 	\label{def_L}

243: \end{equation}

244: To illustrate the meaning of the link-cluster conjecture, consider

245: a random crawler (or user) searching for pages about a topic $q$.

246: Call $\eta_q(t)$ the probability that the crawler hits a relevant

247: page at time $t$. Solving the recursion

248: \begin{equation}

249: 	\eta_q(t+1) = \eta_q(t) \cdot R_q(1) + (1 - \eta_q(t)) \cdot G_q

250: \end{equation}

251: for $\eta_q(t+1) = \eta_q(t)$ yields the stationary hit rate

252: \begin{equation}

253: 	\eta_q^* = \frac{G_q}{1 + G_q - R_q(1)}.

254: \end{equation}

255: The link-cluster conjecture is a necessary and sufficient condition

256: for such a crawler to have a better than chance hit rate, thus

257: justifying the crawling (and browsing!) activity:

258: \begin{equation}

259: 	\eta_q^* > G_q \Longleftrightarrow \lambda(q,1) > 1.

260: \end{equation}

261:

262: Definition~\ref{def_L} can be generalized to likelihood factors over

263: larger neighbourhoods:

264: \begin{equation}

265: 	\lambda(q,d) \equiv \frac{R_q(d)}{G_q} \stackrel{d \rightarrow

266: \infty}{\longrightarrow} 1

267: 	\label{L-def}

268: \end{equation}

269: and a stronger version of the conjecture can be formulated as follows:

270: \begin{equation}

271: 	\lambda(q,d) \gg 1 \; \mbox{for} \; \delta(q,d) < \delta^*

272: \end{equation}

273: where $\delta^*$ is a critical link distance beyond which semantic

274: inferences are unreliable.

275:

276: %:\subsection{Semantic clusters in link neighbourhoods}

277:

278: I first attempted to measure the likelihood factor $\lambda(q,1)$ for

279: a few queries and found that

280: \linebreak $\langle \lambda(q,1) \rangle_q \gg 1$,

281: but those estimates were based on very noisy relevance

282: assessments\cite{Menczer97b}. To obtain a reliable quantitative

283: validation of the stronger link-cluster conjecture, I repeated such

284: measurements on the data set described in Figure~\ref{yahoo}.

285:

286: The 300 measures of $\lambda(q,d)$ thus obtained are plotted versus

287: $\delta(q,d)$ from  Equation~\ref{eq:Laver} in

288: Figure~\ref{likelihood}. Closeness to a relevant page in link space

289: is highly predictive of relevance, increasing the relevance

290: probability by a likelihood factor $\lambda(q,d) \gg 1$ over the

291: range of observed distances and queries.

292:

293: %:\subsection{Expected relevance decay in link space}

294:

295: We also performed a nonlinear least-squares fit of these data to a

296: family of exponential decay functions using the 300 points as

297: independent samples:

298: \begin{equation}

299: 	\lambda(\delta) \sim 1 + \alpha_3 e^{-\alpha_4 \delta^{\alpha_5}}.

300: \end{equation}

301: Note that this three-parameter model is more complex than the one in

302: Equation~\ref{sim-decay} because $\lambda(\delta=0)$ must also be

303: estimated from the data ($\lambda(q,0) = 1/G_q$). The

304: relationship between link distance and the semantic likelihood factor

305: is less regular than between link distance and lexical similarity.

306: The resulting fit (also shown in

307: Figure~\ref{likelihood}) provides us with a rough estimate of how

308: far in link space we can make inferences about the semantics

309: (relevance) of pages, i.e., up to a critical distance $\delta^*$

310: between 4 and 5 links.

311:

312: %:\section{Discussion}

313:

314: It is surprising that the link-content and link-cluster conjectures

315: have not been formalized and addressed explicitly before, especially

316: when one looks at the considerable attention recently received by the

317: Web's graph topology\cite{Lawrence98,Butler00}.  The correlation

318: between Web links and content takes on additional significance in

319: light of link analysis studies that tell us the Web is a ``small

320: world'' network, i.e., a graph with an inverse power law distribution

321: of in-links and out-links\cite{Albert99,Adamic99}.  Small world

322: networks have a mixture of non-random local structure and non-local

323: random links. Such a topology creates short paths between pages, whose length scales logarithmically with the number of Web pages.  The present

324: results indicate that the Web's local structure is created by the

325: semantic clusters resulting from authors linking their pages to

326: related resources.

327:

328: The link-cluster and link-content conjectures have important normative

329: implications for future Web search technology.  For example the

330: measurements in this paper suggest that topic driven crawlers should

331: keep track of their position with a bias to remain within a few links

332: from some relevant source.  In such a range hyperlinks create

333: detectable signals about lexical and semantic content, despite the

334: Web's chaotic lack of structure.  Absent such signals, the short paths

335: predicted by the small world model might be very hard to locate for

336: localized algorithms \cite{Kleinberg00}.  In general the present

337: findings should foster the design of better search tools by

338: integrating traditional search engines with topic- and query-driven

339: crawlers\cite{Menczer01} guided by \emph{local} link and lexical

340: clues.  Smart crawlers of this kind are already emerging (see for

341: example \texttt{http://myspiders.biz.uiowa.edu}).

342: Due to the size and dynamic nature of the Web, the

343: efficiency-motivated search engine practice of keeping query

344: processing separate from crawling leads to poor trade-offs between

345: coverage and recency\cite{Lawrence99}.  Closing the loop from user

346: queries to smart crawlers will lead to dynamic indices

347: with more scalable and user-driven update algorithms than

348: the centralized ones used today.

349:

350: \begin{thebibliography}{10}

351:

352: \bibitem{Brin98}

353: Brin, S. and Page, L.

354: \newblock The anatomy of a large-scale hypertextual {W}eb search engine.

355: \newblock {\em Computer Networks}{ \bf 30}(1--7), 107--117 (1998).

356:

357: \bibitem{Kleinberg98}

358: Kleinberg, J.

359: \newblock Authoritative sources in a hyperlinked environment.

360: \newblock {\em Journal of the ACM}{ \bf 46}(5), 604--632 (1999).

361:

362: \bibitem{Menczer00}

363: Menczer, F. and Belew, R.

364: \newblock Adaptive retrieval agents: Internalizing local context and scaling up

365:   to the {W}eb.

366: \newblock {\em Machine Learning}{ \bf 39}(2--3), 203--242 (2000).

367:

368: \bibitem{Ben-Shaul99etal}

369: Ben-Shaul, I. et~al.

370: \newblock Adding support for dynamic and focused search with {F}etuccino.

371: \newblock {\em Computer Networks}{ \bf 31}(11--16), 1653--1665 (1999).

372:

373: \bibitem{Chakrabarti99}

374: Chakrabarti, S., {van den Berg}, M., and Dom, B.

375: \newblock Focused crawling: A new approach to topic-specific {W}eb resource

376:   discovery.

377: \newblock {\em Computer Networks}{ \bf 31}(11--16), 1623--1640 (1999).

378:

379: \bibitem{Lawrence98}

380: Lawrence, S. and Giles, C.

381: \newblock Searching the {W}orld {W}ide {W}eb.

382: \newblock {\em Science}{ \bf 280}, 98--100 (1998).

383:

384: \bibitem{Albert99}

385: Albert, R., Jeong, H., and Barabasi, A.-L.

386: \newblock Diameter of the {W}orld {W}ide {W}eb.

387: \newblock {\em Nature}{ \bf 401}(6749), 130--131 (1999).

388:

389: \bibitem{Adamic99}

390: Adamic, L.

391: \newblock The {S}mall {W}orld {W}eb.

392: \newblock {\em LNCS}{ \bf 1696}, 443--452 (1999).

393:

394: \bibitem{Butler00}

395: Butler, D.

396: \newblock Souped-up search engines.

397: \newblock {\em Nature}{ \bf 405}(6783), 112--115 (2000).

398:

399: \bibitem{Salton83}

400: Salton, G. and McGill, M.

401: \newblock {\em An Introduction to Modern Information Retrieval}.

402: \newblock McGraw-Hill, New York, NY,  (1983).

403:

404: \bibitem{vanR79cluster}

405: van Rijsbergen, C.

406: \newblock {\em Information Retrieval}, chapter~3,  30--31.

407: \newblock Butterworths, London (1979).

408: \newblock Second edition.

409:

410: \bibitem{Mendelzon00}

411: Mendelzon, A. and Rafiei, D.

412: \newblock What do the neighbours think? {Computing} web page reputations.

413: \newblock {\em IEEE Data Engineering Bulletin}{ \bf 23}(3), 9--16 (2000).

414:

415: \bibitem{Gibson98}

416: Gibson, D., Kleinberg, J., and Raghavan, P.

417: \newblock Inferring {W}eb communities from link topology.

418: \newblock In {\em Proc. 9th ACM Conference on Hypertext and Hypermedia},

419:   225--234,  (1998).

420:

421: \bibitem{Chakrabarti98etal}

422: Chakrabarti, S. et~al.

423: \newblock Automatic resource compilation by analyzing hyperlink structure and

424:   associated text.

425: \newblock {\em Computer Networks}{ \bf 30}(1--7), 65--74 (1998).

426:

427: \bibitem{Dean99}

428: Dean, J. and Henzinger, M.

429: \newblock Finding related pages in the {W}orld {W}ide {W}eb.

430: \newblock {\em Computer Networks}{ \bf 31}(11--16), 1467--1479 (1999).

431:

432: \bibitem{Davison00}

433: Davison, B.

434: \newblock Topical locality in the {W}eb.

435: \newblock In {\em Proc. 23rd International ACM SIGIR Conference on Research and

436:   Development in Information Retrieval},  272--279,  (2000).

437:

438: \bibitem{Menczer97b}

439: Menczer, F.

440: \newblock {ARACHNID}: {A}daptive {R}etrieval {A}gents {C}hoosing {H}euristic

441:   {N}eighborhoods for {I}nformation {D}iscovery.

442: \newblock In {\em Proc. 14th International Conference on Machine Learning},

443:   227--235,  (1997).

444:

445: \bibitem{Kleinberg00}

446: Kleinberg, J.

447: \newblock Navigation in a small world.

448: \newblock {\em Nature}{ \bf 406}, 845 (2000).

449:

450: \bibitem{Menczer01}

451: Menczer, F., Pant, G., Ruiz, M., and Srinivasan, P.

452: \newblock Evaluating topic-driven {W}eb crawlers.

453: \newblock In {\em Proc. 24th Annual International ACM SIGIR Conference on

454:   Research and Development in Information Retrieval},  (2001).

455:

456: \bibitem{Lawrence99}

457: Lawrence, S. and Giles, C.

458: \newblock Accessibility of information on the {W}eb.

459: \newblock {\em Nature}{ \bf 400}, 107--109 (1999).

460:

461: \bibitem{Porter80}

462: Porter, M.

463: \newblock An algorithm for suffix stripping.

464: \newblock {\em Program}{ \bf 14}(3), 130--137 (1980).

465:

466: \bibitem{Jones72}

467: Sparck~Jones, K.

468: \newblock A statistical interpretation of term specificity and its application

469:   in retrieval.

470: \newblock {\em Journal of Documentation}{ \bf 28}, 111--121 (1972).

471:

472: \end{thebibliography}

473:

474: \subsection*{Acknowledgements}

475:

476: {\small The author is grateful to D. Eichmann, P. Srinivasan, W.N.

477: Street, A.M. Segre, R.K. Belew and A. Monge for helpful comments

478: and discussions, and to M. Lee and M. Porter for contributions

479: to the crawling and parsing code.}

480:

481: \textbf{Correspondence and requests for materials should be sent to

482: the author (email: \linebreak \texttt{filippo-menczer@uiowa.edu}).}

483:

484: \newpage

485:

486: \begin{figure}[hp]

487: 	\centering

488: 	\includegraphics{yahoo}

489: 	\caption{Representation of the data collection.

490: 	100 topic pages were chosen in the Yahoo directory owing to

491: 	this portal's wide popularity. Yahoo category

492: 	pages are marked ``Y'', external pages are marked ``W''.

493: 	The topic pages were chosen among ``leaf'' categories, i.e.

494: 	without sub-categories.  This way

495: 	the external pages linked by a topic page (``Yq'') represent the

496: 	relevant set compiled for that topic by the Yahoo editors (shaded).

497: 	Topics were selected in breadth-first order and therefore

498: 	covered the full spectrum of Yahoo top-level categories.

499: 	In this example the topic is \texttt{SOCIETY CULTURE BIBLIOGRAPHY}.

500: 	Arrows represent hyperlinks and dotted arrows are examples of links

501: 	pointing back to the relevant set.

502: 	For each topic, we performed

503: 	a breadth-first crawl up to a depth of 3 links.  The crawl set is

504: 	represented inside the dashed line.

505: 	To obtain meaningful and comparable

506: 	statistics at $\delta_{l}=1$, only topic pages with at least

507: 	5 external links were used, and only the first 10 links

508: 	for topic pages with over 10 links.  Each crawl

509: 	was stopped if 10,000 pages had been downloaded at depth $\delta_{l}=3$

510: 	from the start page.  A timeout of 60 seconds was applied for each

511: 	page.  The resulting collection comprised 376,483 pages.  The text of

512: 	each fetched page was parsed to extract links and terms. Terms were

513: 	conflated using a standard stemming algorithm\cite{Porter80}.

514: 	A common TFIDF weighting scheme\cite{Jones72} was employed to

515: 	represent each page in word vector space.  This model assumes a

516: 	global measure of term frequency across pages (inverse document

517: 	frequency).  To make the measures scalable with the maximum crawl

518: 	depth (a parameter), inverse document frequency was computed as a

519: 	function of distance from the start page, among the set of

520: 	documents within that distance from the source.  Formally, for

521: 	each topic $q$, page $p$, term $k$ and depth $d$:

522: 	$w_{p,d,q}^{k} = tf(k,p) \cdot idf(k,d,q)$

523: 	where $tf(k,p)$ is the number of occurrences

524: 	of term $k$ in page $p$ and

525: 	$idf(k,d,q) = 1 + \ln\left(\frac{N_{d}^{q}}{N_{d}^{q}(k)}\right)$.

526: 	Here $N_{d}^{q}$ is the size of the cumulative page set

527: 	$P_{d}^{q} = \{ p : \delta_{l}(q,p) \leq d \}$, and

528: 	$N_{d}^{q}(k)$ is the size of the subset of $P_{d}^{q}$ of pages

529: 	containing term $k$.}

530: 	\label{yahoo}

531: \end{figure}

532:

533: \newpage

534:

535: \begin{figure}[hp]

536: 	\centering

537: 	\includegraphics{sim}

538: 	\caption{Scatter plot of $\sigma(q,d)$ versus

539: 	$\delta(q,d)$ for topics $q=0,\ldots,99$ and

540: 	depths $d=1,2,3$. Pearson's correlation coefficient $\rho = -0.76,

541: 	p<0.0001$. The similarity

542: 	noise level $\sigma_{\infty}$ and an exponential decay fit

543: 	of the data and are also shown.

544: 	$\sigma_{\infty}$ was computed by comparing each topic

545: 	page to external pages linked from different Yahoo categories:

546: 	$\sigma_{\infty} \equiv \left\langle

547: 	\frac{1}{N_{1}^{q'}} \sum_{p \in P_{1}^{q'}} \sigma(q,p)

548: 	\right\rangle_{\{q,q': q \neq q'\}}

549: 	\approx 0.0318 \pm 0.0006$.

550: 	The regression yielded

551: 	parametric estimates $\alpha_{1} \approx 1.8$ and $\alpha_{2} \approx

552: 	0.6$.}

553: 	\label{scatter}

554: \end{figure}

555:

556: \newpage

557:

558: \begin{figure}[hp]

559: \centering

560: \begin{tabular}{rcrc}

561: 	\textbf{a}

562: 	&

563: 	\multicolumn{3}{c}{\raisebox{-1.75in}{\includegraphics{domains}}} \\

564: 	\textbf{b}

565: 	&

566: 	\begin{tabular}{ccc}

567: 		Domain & $\alpha_{1}$ & $\alpha_{2}$ \\

568: 		\hline

569: 		\texttt{edu} & $1.11 \pm 0.03$ & $0.87 \pm 0.05$ \\

570: 		\texttt{net} & $1.16 \pm 0.04$ & $0.88 \pm 0.05$ \\

571: 		\texttt{gov} & $1.22 \pm 0.07$ & $1.00 \pm 0.09$ \\

572: 		\texttt{org} & $1.38 \pm 0.03$ & $0.93 \pm 0.05$ \\

573: 		\texttt{com} & $1.63 \pm 0.04$ & $1.13 \pm 0.05$ \\

574: 		\hline

575: 	\end{tabular}

576: 	&

577: 	\textbf{c}

578: 	&

579: 	\raisebox{-0.8in}{\includegraphics[width=2.5in]{domains-stat}}

580: \end{tabular}

581: \caption{\textbf{a.} Scatter plot of $\sigma(q,d)$ versus

582: $\delta(q,d)$ for topics $q=0,\ldots,99$ and

583: depths $d=1,2,3$, for each of the major US top-level domains.

584: The domain sets were obtained by simulating crawlers that only

585: follow links to servers within each domain.

586: An exponential decay fit is also shown for each domain.

587: \textbf{b.} Exponential decay model parameters obtained by

588: nonlinear least-squares fit of each domain data.

589: \textbf{c.} Summary of statistically

590: significant differences (at the 68.3\% confidence level) between the

591: parametric estimates; dashed arrows represent significant differences

592: in $\alpha_{1}$ only, and solid arrows significant differences in

593: both $\alpha_{1}$ and $\alpha_{2}$.}

594: \label{domains}

595: \end{figure}

596:

597: \newpage

598:

599: \begin{figure}[hp]

600: 	\centering

601: 	\includegraphics{likelihood}

602: 	\caption{Scatter plot of $\lambda(q,d)$ versus

603: 	$\delta(q,d)$ for topics $q=0,\ldots,99$ and depths $d=1,2,3$.

604: 	Pearson's $\rho = -0.1, p=0.09$. In computing $\lambda(q,d)$

605: 	from Definition~\ref{L-def}, the relevant set $Q_q$ compiled by the

606: 	Yahoo editors for each topic $q$ was used to estimate

607: 	$R_q(d) \simeq \frac{|P_{d}^{q} \cap Q_q|}{N_{d}^{q}}$

608: 	(cf. dotted links in Figure~\ref{yahoo}).

609: 	Generality was approximated by

610: 	$G_q \simeq \frac{|Q'_q|}{|\bigcup_{q' \in Y} Q'_{q'}|}$ where

611: 	all of the relevant links for each topic $q$ are included in $Q'_q$,

612: 	even for topics where only the first 10 links were used in

613: 	the crawl ($Q'_q \supseteq Q_q$), and

614: 	the set $Y$ in the denominator includes all Yahoo leaf categories.

615: 	An exponential decay fit of the data is

616: 	also shown. The regression yielded parametric estimates

617: 	$\alpha_{3} \approx 1000$, $\alpha_{4} \approx 0.002$ and $\alpha_{5}

618: 	\approx 5.5$.}

619: 	\label{likelihood}

620: \end{figure}

621:

622: \end{document}

623: \end

624: