0203:cs0203024/pend.tex

1: \section{Concluding remarks}

2:

3: The geography of the Web, delineating communities and their boundaries

4: in a state of continual flux, is a fascinating source of data for

5: social network analysis (see, e.g.,

6: \url{http://www.cybergeography.org/}).

7: In this paper we have initiated

8: an exploration of the terrain of broad topics in the Web graph, and

9: characterized some important notions of topical locality on the Web.

10: Specifically, we have shown how to estimate the background

11: distribution of broad topics on the Web, how pages relevant to these

12: topics cite each other, and how soon a random path starting from a

13: given topic `loses' itself into the background distribution.

14:

15: We believe this work barely scratches the surface w.r.t.\ a new,

16: content-rich characterization of the Web, and opens up many questions,

17: some of which we list below.

18:

19: \paragraph{PageRank jump parameter:}

20: How should one set the jump probability in PageRank?  Is it useful to

21: set topic-specific jump probabilities?  Does an understanding of

22: mixing radius help us set better jump probabilities?  Is there a

23: useful middle ground between PageRank's precomputed scores and HITS's

24: runtime graph collection?

25:

26: \paragraph{Topical stability of distillation algorithms:}

27: How can we propose models of HITS and stochastic variants that are

28: content-cognizant?  Can content-guided random walks be used to

29: \emph{define} what a focused crawler should visit and/or collect?  Can

30: we validate this definition (or proposal) on synthetic graphs?  Can

31: such a theory, coupled with our measurements of topic linkage, predict

32: and help avoid topic drift in distillation algorithms?

33:

34: % E.g., can we propose an alternative algorithm

35: % which works similar to DOMHITS or DOMTextHITS and can be analyzed?

36: % need a catalog of topics and the property of the web subgraphs mapped

37: % to those topics.  How to express, search and track evolution?

38:

39: %%\subsection{Implications}

40: % estimate bounds on the success of focused crawlers

41: % better crawling strategies

42: % better sampling if fair representation of topics desired

43: % PageRank algorithm tuning jump parameter

44: % HITS graph collection

45: % how to model the quality of topic distillation output

46:

47: \paragraph{Better crawling algorithms:}

48: Given that we can measure mixing distances and inter-topic linkage,

49: can we develop smarter federations of crawlers in which each

50: concentrates on a collection of tightly knit topics?  Can this lead to

51: better and fresher coverage of small communities?  Can we exploit the

52: fact that degrees follow power laws both globally and locally within

53: topic contexts to derive better, less topic-biased samples of URLs

54: from the Web?

55:

56:

57: % None of the sampling methods use the property that degrees follow

58: % power laws.  Is it possible to exploit this property?  Similarity

59: % between random jump, lots of iterations (50--60 reported by Brin and

60: % Page) vs.\ random initialization, few (one divided by jump probability

61: % $d$, i.e., 5--10 for the usual choice of $d$ which is 0.1 to 0.2)

62: % iterations and stop.

63:

64: