cs0203024/pabst.tex
1: 
2: \begin{abstract}
3: The Web graph is a giant social network whose properties have been
4: measured and modeled extensively in recent years.  Most such studies
5: concentrate on the graph structure alone, and do not consider textual
6: properties of the nodes.  Consequently, Web communities have been
7: characterized purely in terms of graph structure and not on page
8: content.  We propose that a topic taxonomy such as Yahoo! or the Open
9: Directory provides a useful framework for understanding the structure
10: of content-based clusters and communities.  In particular, using a
11: topic taxonomy and an automatic classifier, we can measure the
12: background distribution of broad topics on the Web, and analyze the
13: capability of recent random walk algorithms to draw samples which
14: follow such distributions.  In addition, we can measure the
15: probability that a page about one broad topic will link to another
16: broad topic.  Extending this experiment, we can measure how quickly
17: topic context is lost while walking randomly on the Web graph.
18: Estimates of this topic mixing distance may explain why a global
19: PageRank is still meaningful in the context of broad queries.  In
20: general, our measurements may prove valuable in the design of
21: community-specific crawlers and link-based ranking systems.
22: 
23: \begingroup \raggedright
24: \par\smallskip
25: \paragraph{Categories and subject descriptors:}
26: H.5.4~[\textbf{Information interfaces and presentation}]:
27: Hypertext/hypermedia; \\
28: H.5.3~[\textbf{Information interfaces and presentation}]:
29: Group and Organization Interfaces, Theory and models; \\
30: H.1.0~[\textbf{Information systems}]: Models and principles.
31: 
32: \paragraph{General terms:}
33: Measurements, experimentation.
34: 
35: \paragraph{Keywords:} Social network analysis, Web bibliometry.
36: \endgroup
37: 
38: \par\smallskip \howtoviewhtml
39: \end{abstract}
40: