cs0104009/eval.tex
1: \section{Evaluating Recommendation Algorithms}
2: \label{back}
3: Most current research efforts cast recommendation
4: as a specialized task of information retrieval/\hskip0ex filtering or
5: as a task of function approximation/\hskip0ex learning mappings 
6: \cite{aggarwal1,
7: basu1,billsus,
8: eigentaste,good,herlocker,
9: hill,kitts,konstan1,
10: pennock,sarwar1,sarwar3,
11: schafer1,
12: ringo1,soboroff,
13: terveen}. Even approaches
14: that focus on clustering view clustering primarily as a
15: pre-processing step for functional modeling \cite{kohrs1}, or as a 
16: technique to ensure scalability \cite{Conner1,sarwar3}
17: or to overcome sparsity of ratings \cite{ungar1}. This emphasis on functional 
18: modeling and retrieval has influenced evaluation criteria for recommender 
19: systems.
20: 
21: Traditional information retrieval evaluation 
22: metrics such as precision and recall have been applied toward recommender
23: systems involving content-based design.
24: Ideas such as cross-validation on an unseen test set have been used to
25: evaluate mappings from people to artifacts, especially in
26: collaborative filtering recommender systems. Such approaches miss
27: many desirable aspects of the recommendation process, namely:
28: \begin{itemize}
29: \item {\bf Recommendation is an indirect way
30: of bringing people together.} Social network theory \cite{wasserman-faust}
31: helps model a recommendation system of people versus artifacts as an {\it
32: affiliation network} and distinguishes between a {\it primary mode} 
33: (e.g., people) and a {\it secondary mode} (e.g., movies), where a 
34: {\it mode} refers to a distinct set of entities that have similar
35: attributes \cite{wasserman-faust}. The purpose of
36: the secondary mode is viewed as serving to bring entities of the primary
37: mode together (i.e., it isn't treated as a {\it first-class} mode).
38: \item {\bf Recommendation, as a process,
39: should emphasize modeling connections from people to
40: artifacts, besides predicting ratings for artifacts.}
41: In many situations, users would like to request recommendations
42: purely based on local and global constraints on the nature of
43: the specific connections explored. 
44: Functional modeling techniques are
45: inadequate because they embed the task of learning a mapping from people
46: to predicted values of artifacts in a general-purpose learning system such
47: as neural networks or Bayesian classification \cite{breese}.
48: A notable exception is the work by Hofmann and Puzicha \cite{Hofmann}
49: which allows the incorporation of constraints in the form
50: of aspect models involving a latent variable. 
51: \item {\bf Recommendations should be explainable and believable.} The 
52: explanations should be made in terms and constructs that are natural to the 
53: user/application domain. It is nearly
54: impossible to convince the user of the quality of a recommendation obtained
55: by black-box techniques such as neural networks. Furthermore, it is well
56: recognized that ``users are more satisfied with a system that produces
57: [bad recommendations] for reasons that seem to make sense to them, than they
58: are with a system that produces [bad recommendations] for semmingly stupid 
59: reasons'' \cite{riloff}.
60: %Berry and Browne highlight
61: %the need for believability in search engine results 
62: %(`Coming back instantenously with {\it No results found} potentially causes 
63: %dissatisfaction for the user').
64: \item {\bf Recommendations are not delivered in isolation, but in the
65: context of an implicit/explicit social network.}
66: In a recommender system, the rating patterns of
67: people on artifacts induce an implicit social network and influence the
68: connectivities in this network.  Little study has been done to understand
69: how such rating patterns influence recommendations and how they
70: can be advantageously exploited.
71: \end{itemize}
72: 
73: Our approach in this paper is to evaluate recommendation algorithms using
74: ideas from graph analysis. In the next section,
75: we will show how our viewpoint addresses each of 
76: the above aspects, by providing novel metrics. The basic idea is to begin
77: with data that can be modeled as a network and attempt to infer useful
78: knowledge from the nodes and links of the graph. Nodes represent entities
79: in the domain (e.g., people, movies), and edges represent the relationships
80: between entities (e.g., the act of a person viewing a particular movie).
81: 
82: \subsection{Related Research}
83: The idea of graph analysis as a basis to study information networks has a long
84: tradition; one of the earliest pertinent studies is Schwartz and
85: Wood \cite{graph-schwartz}. The authors describe the use of graph-theoretic
86: notions such as cliques, connected components, cores, clustering,
87: average path distances, and the inducement of secondary graphs. The focus
88: of the study was to model shared interests among a web of people,
89: using email messages as connections. Such link
90: analysis has been used to extract information in many areas such as in web
91: search engines \cite {klein2}, in exploration of associations among
92: criminals \cite{lee1}, and in the field of medicine \cite {swanson1}.
93: With the emergence of the web as a large scale graph, interest in information 
94: networks has recently exploded \cite{adamic1,google,bowtie,
95: clever-journal,flake-kdd,
96: kautz1,klein2,klein1,trawling,payton1,silk,watts1}.
97: 
98: Most graph-based algorithms for information networks can be studied in terms
99: of (i) the modeling of the
100: graph (e.g., what are the modes?, how do they
101: relate to the information domain?), and (ii) the structures/operations
102: that are mined/conducted on the graph.
103: One of the most celebrated examples of graph analysis arises in search
104: engines that exploit link information, in addition to textual content.
105: The Google search engine uses the web's link structure,
106: in addition to the anchor text as a factor in ranking pages, based on
107: the pages that (hyper)link to the given page \cite{google}. Google essentially models
108: a one-mode directed graph (of web pages) and uses measures involving 
109: principal components to ascertain `page ranks.'
110: Jon Kleinberg's HITS (Hyperlink-Induced Topic Search) algorithm
111: goes a step further by viewing the one-mode web graph as
112: actually comprising two modes (called
113: hubs and authorities) \cite{klein2}. A hub is a node primarily with edges
114: to authorities, and so a good hub has links to many authorities. A good
115: authority is a page that is linked to by many hubs.
116: Starting with a specific
117: search query, HITS performs a text-based search to seed an initial set of
118: results. An iterative relaxation algorithm then assings hub and authority
119: weights using a matrix power iteration. Empirical results show that
120: remarkably authoritative results are obtained for search queries. The CLEVER
121: search engine is built primarily on top of the basic HITS
122: algorithm \cite{clever-journal}. The offline query-independent
123: computation in Google, as opposed to 
124: the topic-induced search of CLEVER, is one of the main reasons for the
125: commercial success of the former. 
126: 
127: The use of link analysis in recommender systems was highlighted by the
128: ``referral chaining'' technique of the 
129: ReferralWeb project \cite{kautz1}. The idea is to
130: use the co-occurrence of names in any
131: of the documents available on the web to detect the existence of direct
132: relationships between people and thus indirectly form social networks. The
133: underlying assumption is that people with similar interests swarm in the
134: same circles to discover collaborators \cite{payton1}.
135: 
136: The exploration of link analysis in social structures has led to several
137: new avenues of research, most notably small-world networks. Small-world
138: networks are highly clustered but relatively sparse networks with
139: small average length. An example is the 
140: folklore notion of six degrees of separation separating any two people in
141: our universe: the phenomenon where a person can discover a link to any
142: other random person through a chain of at most six acquaintances. A small-world
143: network is sufficiently clustered so that most second neighbors of a node
144: $X$ are also neighbors of $X$ (a typical ratio would be $80\%$). On the
145: other hand, the average distance between any two nodes in the graph is
146: comparable to the low characteristic path length of a random graph. Until
147: recently, a mathematical characterization of such small-world networks has
148: proven elusive.  Watts and Strogatz \cite{watts1} provide the first
149: such characterization of small-world networks in the form
150: of a graph generation model.
151: 
152: \begin{figure}
153: \centering
154: \begin{tabular}{cc}
155: & \mbox{\psfig{figure=small-world.eps,width=5in}}
156: \end{tabular}
157: \caption{Generation of a small-world network by random rewiring from a
158: regular wreath network. Figure adapted from \cite{watts1}.}
159: \label{small}
160: \end{figure}
161: 
162: \begin{figure} \centering
163: \begin{tabular}{cc} &
164: \mbox{\psfig{figure=small-world-graphs.eps,height=3in}}
165: \end{tabular}
166: \caption{Average path length and clustering coefficient versus
167: the rewiring probability $p$ (from \cite{watts1}). All measurements are
168: scaled w.r.t. the values at $p = 0$.}
169: \label{smgs}
170: \end{figure}
171: 
172: In this model, Watts and Strogatz
173: use a regular wreath network with $n$ nodes, and $k$ edges per node
174: (to its nearest neighbors) as a starting point for the design. A small
175: fraction of the edges are then randomly rewired to arbitrary points on
176: the network. A full rewiring (probability $p=1$) leads to a completely
177: random graph, while $p=0$ corresponds to the (original) wreath 
178: (Fig.~\ref{small}). The starting point in the figure is a regular wreath
179: topology of $12$ nodes with every node connected to its four nearest
180: neighbors. This structure has a high characteristic path length and
181: high clustering coefficient. The average length is the mean of the shortest
182: path lengths over all pairs of nodes. The clustering coefficient is
183: determined by first computing the local neighborhood of every node. The
184: number of edges in this neighborhood as a fraction of the total possible
185: number of edges denotes the extent of the neighborhood being a clique.
186: This factor
187: is averaged over all nodes to determine the clustering coefficient. The
188: other extreme in Fig.~\ref{small}
189: is a random network with a low characteristic path length and
190: almost no clustering. The small-world network, an interpolation between the
191: two, has the low characteristic path length (of a random network), and
192: retains the high clustering coefficient (of the wreath).  Measuring
193: properties such as average length and clustering coefficient in the region
194: $0 \leq p \leq 1$ produces surprising results (see Fig.~\ref{smgs}).
195: 
196: As shown in Fig.~\ref{smgs}, only a very small fraction of edges need to be
197: rewired to bring the length down to random graph limits, and yet the
198: clustering coefficient is high. On closer inspection, it is easy to see why
199: this should be true. Even for small values of $p$ (e.g., $0.1$), the result
200: of introducing edges between distantly separated nodes reduces not only the
201: distance between these nodes but also the distances between the neighbors of
202: those nodes, and so on (these reduced paths between distant nodes are
203: called {\it shortcuts}). The introduction of these edges
204: further leads to a rapid decrease in the average length of the network, but
205: the clustering coefficient remains almost unchanged. Thus, small-world
206: networks fall in between regular and random networks, having the small
207: average lengths of random networks but high clustering coefficients akin to
208: regular networks.
209: 
210: While the Watts-Strogatz model describes how small-world networks can be
211: formed, it does not explain how people are adept at actually finding short
212: paths through such networks in a decentralized fashion. Kleinberg 
213: addresses precisely this issue and proves that this is not possible
214: in the family of one-dimensional Watts-Strogatz networks 
215: \cite{klein1}. Embedding the notion of 
216: random rewiring in a
217: two-dimensional lattice leads to one unique model for which such
218: decentralization is effective.
219: 
220: The small-world network concept has implications for a variety of domains.
221: Watts and Strogatz simulate the `wildfire' like spread of an
222: infectious disease in a small-world network \cite{watts1}. 
223: Adamic shows
224: that the world wide web is a small-world network and suggests that
225: search engines capable of exploiting this fact can be more
226: effective in  hyperlink modeling, crawling, and finding authoritative 
227: sources \cite{adamic1}.
228: 
229: Besides the Watts-Strogatz model, a variety of models from graph
230: theory are available and can be used to analyze information networks.
231: Kumar et al.~\cite{trawling} highlight the use of traditional
232: random graph models
233: to confirm the existence of properties such as cores and connected
234: components in the web. In particular, they characterize the distributions 
235: of web page degrees 
236: and show that they are well approximated by power laws.
237: Finally, they perform a study similar to Schwartz and Wood
238: \cite{graph-schwartz} to find cybercommunities on the web. 
239: Flake et al.~\cite{flake-kdd} provide a max-flow, min-cut algorithm to 
240: identify cybercommunities. They also provide a focused crawling strategy to
241: approximate such communities.
242: Broder et al.~\cite{bowtie} perform a more detailed mapping of the web
243: and demonstrate that it has a bow-tie structure, which consists of
244: a strongly connected component, as well as nodes that 
245: link into but are not linked
246: from the strongly connected component, and nodes that are linked from but
247: do not link to the strongly connected component.
248: Pirolli et al.~\cite{silk} use ideas from spreading activation
249: theory to subsume link analysis, content-based modeling, and usage
250: patterns.
251: 
252: A final thread of research, while not centered on information networks,
253: emphasizes the modeling of problems and applications in ways that make them
254: amenable to graph-based analyses. A good example in this category is
255: the approach of Gibson et al.~\cite{gibson1} for mining categorical datasets.
256: 
257: While many of these ideas, especially link analysis,
258: have found their way into recommender systems,
259: they have been primarily viewed as mechanisms to mine or model 
260: structures. In this paper, we show how ideas from graph analysis
261: can actually serve to provide novel evaluation criteria for recommender 
262: systems.
263: