0203:cs0203024/pabst.tex

1:

2: \begin{abstract}

3: The Web graph is a giant social network whose properties have been

4: measured and modeled extensively in recent years.  Most such studies

5: concentrate on the graph structure alone, and do not consider textual

6: properties of the nodes.  Consequently, Web communities have been

7: characterized purely in terms of graph structure and not on page

8: content.  We propose that a topic taxonomy such as Yahoo! or the Open

9: Directory provides a useful framework for understanding the structure

10: of content-based clusters and communities.  In particular, using a

11: topic taxonomy and an automatic classifier, we can measure the

12: background distribution of broad topics on the Web, and analyze the

13: capability of recent random walk algorithms to draw samples which

14: follow such distributions.  In addition, we can measure the

15: probability that a page about one broad topic will link to another

16: broad topic.  Extending this experiment, we can measure how quickly

17: topic context is lost while walking randomly on the Web graph.

18: Estimates of this topic mixing distance may explain why a global

19: PageRank is still meaningful in the context of broad queries.  In

20: general, our measurements may prove valuable in the design of

21: community-specific crawlers and link-based ranking systems.

22:

23: \begingroup \raggedright

24: \par\smallskip

25: \paragraph{Categories and subject descriptors:}

26: H.5.4~[\textbf{Information interfaces and presentation}]:

27: Hypertext/hypermedia; \\

28: H.5.3~[\textbf{Information interfaces and presentation}]:

29: Group and Organization Interfaces, Theory and models; \\

30: H.1.0~[\textbf{Information systems}]: Models and principles.

31:

32: \paragraph{General terms:}

33: Measurements, experimentation.

34:

35: \paragraph{Keywords:} Social network analysis, Web bibliometry.

36: \endgroup

37:

38: \par\smallskip \howtoviewhtml

39: \end{abstract}

40: