cs0203024/pend.tex
1: \section{Concluding remarks}
2: 
3: The geography of the Web, delineating communities and their boundaries
4: in a state of continual flux, is a fascinating source of data for
5: social network analysis (see, e.g.,
6: \url{http://www.cybergeography.org/}).  
7: In this paper we have initiated
8: an exploration of the terrain of broad topics in the Web graph, and
9: characterized some important notions of topical locality on the Web.
10: Specifically, we have shown how to estimate the background
11: distribution of broad topics on the Web, how pages relevant to these
12: topics cite each other, and how soon a random path starting from a
13: given topic `loses' itself into the background distribution.
14: 
15: We believe this work barely scratches the surface w.r.t.\ a new,
16: content-rich characterization of the Web, and opens up many questions,
17: some of which we list below.
18: 
19: \paragraph{PageRank jump parameter:}
20: How should one set the jump probability in PageRank?  Is it useful to
21: set topic-specific jump probabilities?  Does an understanding of
22: mixing radius help us set better jump probabilities?  Is there a
23: useful middle ground between PageRank's precomputed scores and HITS's
24: runtime graph collection?
25: 
26: \paragraph{Topical stability of distillation algorithms:}
27: How can we propose models of HITS and stochastic variants that are
28: content-cognizant?  Can content-guided random walks be used to
29: \emph{define} what a focused crawler should visit and/or collect?  Can
30: we validate this definition (or proposal) on synthetic graphs?  Can
31: such a theory, coupled with our measurements of topic linkage, predict
32: and help avoid topic drift in distillation algorithms?
33: 
34: % E.g., can we propose an alternative algorithm
35: % which works similar to DOMHITS or DOMTextHITS and can be analyzed?
36: % need a catalog of topics and the property of the web subgraphs mapped
37: % to those topics.  How to express, search and track evolution?
38: 
39: %%\subsection{Implications}
40: % estimate bounds on the success of focused crawlers
41: % better crawling strategies
42: % better sampling if fair representation of topics desired
43: % PageRank algorithm tuning jump parameter
44: % HITS graph collection
45: % how to model the quality of topic distillation output
46: 
47: \paragraph{Better crawling algorithms:}
48: Given that we can measure mixing distances and inter-topic linkage,
49: can we develop smarter federations of crawlers in which each
50: concentrates on a collection of tightly knit topics?  Can this lead to
51: better and fresher coverage of small communities?  Can we exploit the
52: fact that degrees follow power laws both globally and locally within
53: topic contexts to derive better, less topic-biased samples of URLs
54: from the Web?
55: 
56: 
57: % None of the sampling methods use the property that degrees follow
58: % power laws.  Is it possible to exploit this property?  Similarity
59: % between random jump, lots of iterations (50--60 reported by Brin and
60: % Page) vs.\ random initialization, few (one divided by jump probability
61: % $d$, i.e., 5--10 for the usual choice of $d$ which is 0.1 to 0.2)
62: % iterations and stop.
63: 
64: