1: \section{Concluding remarks}
2:
3: The geography of the Web, delineating communities and their boundaries
4: in a state of continual flux, is a fascinating source of data for
5: social network analysis (see, e.g.,
6: \url{http://www.cybergeography.org/}).
7: In this paper we have initiated
8: an exploration of the terrain of broad topics in the Web graph, and
9: characterized some important notions of topical locality on the Web.
10: Specifically, we have shown how to estimate the background
11: distribution of broad topics on the Web, how pages relevant to these
12: topics cite each other, and how soon a random path starting from a
13: given topic `loses' itself into the background distribution.
14:
15: We believe this work barely scratches the surface w.r.t.\ a new,
16: content-rich characterization of the Web, and opens up many questions,
17: some of which we list below.
18:
19: \paragraph{PageRank jump parameter:}
20: How should one set the jump probability in PageRank? Is it useful to
21: set topic-specific jump probabilities? Does an understanding of
22: mixing radius help us set better jump probabilities? Is there a
23: useful middle ground between PageRank's precomputed scores and HITS's
24: runtime graph collection?
25:
26: \paragraph{Topical stability of distillation algorithms:}
27: How can we propose models of HITS and stochastic variants that are
28: content-cognizant? Can content-guided random walks be used to
29: \emph{define} what a focused crawler should visit and/or collect? Can
30: we validate this definition (or proposal) on synthetic graphs? Can
31: such a theory, coupled with our measurements of topic linkage, predict
32: and help avoid topic drift in distillation algorithms?
33:
34: % E.g., can we propose an alternative algorithm
35: % which works similar to DOMHITS or DOMTextHITS and can be analyzed?
36: % need a catalog of topics and the property of the web subgraphs mapped
37: % to those topics. How to express, search and track evolution?
38:
39: %%\subsection{Implications}
40: % estimate bounds on the success of focused crawlers
41: % better crawling strategies
42: % better sampling if fair representation of topics desired
43: % PageRank algorithm tuning jump parameter
44: % HITS graph collection
45: % how to model the quality of topic distillation output
46:
47: \paragraph{Better crawling algorithms:}
48: Given that we can measure mixing distances and inter-topic linkage,
49: can we develop smarter federations of crawlers in which each
50: concentrates on a collection of tightly knit topics? Can this lead to
51: better and fresher coverage of small communities? Can we exploit the
52: fact that degrees follow power laws both globally and locally within
53: topic contexts to derive better, less topic-biased samples of URLs
54: from the Web?
55:
56:
57: % None of the sampling methods use the property that degrees follow
58: % power laws. Is it possible to exploit this property? Similarity
59: % between random jump, lots of iterations (50--60 reported by Brin and
60: % Page) vs.\ random initialization, few (one divided by jump probability
61: % $d$, i.e., 5--10 for the usual choice of $d$ which is 0.1 to 0.2)
62: % iterations and stop.
63:
64: