cs0105028/ppp.tex
1: \documentclass[11pt]{article}
2: 
3: %\usepackage{ijcai01}
4: %\usepackage{fullpage,palatino}
5: \usepackage{fullpage}
6: \setlength{\oddsidemargin}{-0.25in}
7: \setlength{\evensidemargin}{-0.25in}
8: \setlength{\topmargin}{0.5in}
9: \setlength{\headheight}{0pt}
10: \setlength{\headsep}{0pt}
11: \setlength{\footskip}{0pt}
12: \setlength{\textheight}{8.75in}
13: \setlength{\textwidth}{7in}
14: \setlength{\marginparwidth}{0in}
15: \setlength{\marginparsep}{0in}
16: \newenvironment{descit}[1]{\begin{quote} \textit{#1}}{\end{quote}}
17: 
18: \input{psfig-dvips}
19: 
20: \newif\ifpdf
21: \ifx\pdfoutput\undefined
22:   \pdffalse
23: \else
24:   \pdfoutput=1
25:   \pdftrue
26: \fi
27: 
28: \ifpdf
29:   \usepackage[pdftex]{graphicx}
30:   \usepackage[pdftex]{color}
31:   \DeclareGraphicsExtensions{.pdf,.png,.jpg}
32: \else
33:   \usepackage[dvips]{graphicx}
34:   \usepackage[dvips]{color}
35:   \DeclareGraphicsExtensions{.eps,.epsi,.ps}
36: \fi
37: 
38: \usepackage{times}
39: %\usepackage{fancyheadings}
40: 
41: %\pagestyle{plain}
42: \thispagestyle{empty}
43: \pagestyle{empty}
44: 
45: \def\midv{\mathop{\,|\,}}
46: \newtheorem{defn}{Definition}
47: \long\def\cbk#1{{\color{red}[CBK: #1]}}
48: \newlength\colwidth \setlength\colwidth{3.25in}
49: 
50: \title{When being Weak is Brave:\\
51: Privacy Issues in Recommender Systems}
52: 
53: \author{Naren Ramakrishnan, Benjamin J. Keller, and Batul J. Mirza\\
54: Department of Computer Science\\
55: Virginia Tech, VA 24061\\
56: Email:\{naren,keller,bmirza\}@cs.vt.edu\\
57: \medskip
58: \\
59: Ananth Y. Grama\\
60: Department of Computer Sciences\\
61: Purdue University, IN 47907\\
62: Email: ayg@cs.purdue.edu\\
63: \medskip
64: \\
65: George Karypis\\
66: Department of Computer Science\\
67: University of Minnesota, MN 55455\\
68: Email: karypis@cs.umn.edu}
69: 
70: \begin{document}
71: 
72: \maketitle
73: \thispagestyle{empty}
74: \pagestyle{empty}
75: 
76: \begin{abstract}
77: \noindent
78: We explore the conflict between personalization and privacy that arises
79: from the existence of weak ties. A weak tie is an unexpected connection that
80: provides serendipitous recommendations. However, information about weak ties
81: could be used in conjunction with
82: other sources of data to uncover identities and reveal other personal
83: information. In this article, we use a graph-theoretic model to study the
84: benefit and risk from weak ties.
85: \end{abstract}
86: \newpage
87: 
88: \section{Introduction}
89: Privacy in Internet services is typically thought of in terms of protecting
90: attributes of users (and can thus be related to solutions in database security).
91: However, information provided by a recommender system can also allow the 
92: privacy of some of its users to be compromised, when used in conjunction with 
93: other information. For example, consider a system
94: that recommends books by finding correlations between
95: a user's ratings and those of other participants for the same books (a
96: nearest neighbor algorithm~\cite{herlocker}). Suppose as a user, person X
97: rates only books on computer networking, but has an interest in Indian 
98: classical music. We could
99: imagine the following dialogue between person X and the recommender system:
100: 
101: \begin{descit}
102: {\bf person X:} Would I like ``Evolution of Indian Classical Music''?\\
103: {\bf recommender:} Yes.\\
104: {\bf person X:} Why? (surprised)\\ 
105: {\bf recommender:} People who liked the books you have liked also liked this book.
106: \end{descit}
107: 
108: \noindent
109: This is an example of {\it serendipity} in recommendation: person X
110: does not expect
111: to receive a recommendation for a book on a topic outside of those
112: that he/she has rated. But, in fact, it is not by luck that the recommendation 
113: was provided --- it means that at least one person has rated the book 
114: in addition to the 
115: books that person X has rated (and possibly others). The explanation conveys
116: this, but also indicates that the algorithm used is a nearest neighbor 
117: algorithm. The serendipity, if person X is malicious, is that he/she has found 
118: a candidate
119: {\it weak tie.}
120: 
121: In social network theory, a tie is a relationship between people. The strength 
122: of a tie is measured in terms of the number of shared associations.
123: A family is an example: a child knows her mother and her father, and her
124: parents know each other. Weak ties, on the other hand,
125: form bridges between two groups of people
126: who would not otherwise interact. 
127: Most importantly for us, weak ties allow us to deduce identities. For instance,
128: one may know that a particular computer science
129: department has a networking professor
130: who was an interest in Indian classical music. 
131: This professor thus provides a tie between 
132: university faculty and the Indian classical music
133: aficionados. Consequently, upon meeting the {\it only}
134: Indian networking professor in that department,
135: you can safely identify this person as the Indian classical music enthusiast.
136: 
137: A similar situation occurs in recommendation where weak ties provide the
138: opportunity for serendipitous recommendations. In our example, a candidate
139: weak tie has been found between networking books and books on Indian
140: classical music. We don't know if this truly is a weak tie, but can test it by
141: varying what we rate (the process would involve masquerading as different users
142: and rating different sets of books to probe around the original ratings).
143: Through this process, if the query fails for most variations on the ratings,
144: then we can be more confident that we have a weak tie. And, in fact, the
145: more restrictive the set of ratings that yield the positive recommendation,
146: the more confidence we have that there is one person who has rated the
147: books.
148: 
149: Our ability to find such a minimum set of ratings is where the risk lies
150: --- we can use this rating set to determine what the other person has rated 
151: using queries, and perhaps fit this information to our knowledge of people
152: who use the system. In the end, it is conceivable that we could identify 
153: our hypothetical Indian music enthusiast/networking professor in the 
154: recommendation system, and determine what he may have rated. If the system 
155: allows us to vary the ratings, then we might be able to estimate the 
156: person's ratings as well.
157: 
158: In this example, a person actually forms a tie between books through their
159: ratings: a book relates to another if they have both been rated by
160: someone. The alternate view that we will take is that there is a tie between
161: two people if they have rated some number of books in common. So, in the
162: example, the Indian music enthusiast/networking professor has ties to people who
163: rated networking books and people who rate books on Indian classical music. 
164: The ties to people who rate Indian classical music books are weak only 
165: if there are relatively few people who have the same tastes.
166: 
167: %\subsection{Modeling Risk}
168: Clearly, the task of identifying someone through their ratings is difficult,
169: and is even harder if more people have similar rating patterns. But the
170: above example illustrates the risk that may exist even in simple
171: recommendation. The risk is really to people who would
172: participate in weak ties, because they are the people who could most easily
173: be identified. In some application domains (voting preferences and membership
174: on boards), even knowing that a weak tie 
175: exists constitutes a breach of privacy.
176: 
177: \subsection*{Our Approach}
178: Our goal is to model the benefit from and risk to users who participate
179: in weak ties. In particular, we 
180: would like to characterize benefits and risks on an algorithm-independent
181: basis. To achieve this, we use the model of 
182: `jumping connections' \cite{batul-thesis}
183: that casts recommendation as making a series of jumps between people, based on 
184: common ratings (in nearest neighbor recommendation, there is only one jump). 
185: %The result is a model that posits (indirect)
186: %connections between people. 
187: We describe how this model has relevance to 
188: conventionally accepted metrics of evaluation (Section~\ref{jc-model}). Using 
189: this model we can then identify causes for
190: weak ties in terms of the rating patterns of people (Section~\ref{weak-ties}). 
191: Furthermore, the model 
192: allows us to qualify benefits in terms of reachability in a graph, and risk 
193: in terms of weak ties (Section~\ref{benefits-risks}). Finally 
194: (Section~\ref{bigger-pictures}), we look at how policies and social structures
195: can be designed that can support and enable recommender systems.
196: 
197: \section{Recommendation: The Jumping Connections Model}
198: \label{jc-model}
199: Recommendation algorithms work in a wide variety of ways, from forms of graph
200: search to learning. This variety presents a difficulty when attempting to
201: study the risks of recommendation in general. However, we can think of
202: recommendation as making connections between people who have rated artifacts
203: in common. We represent people, the artifacts they rated (e.g., movies),
204: and their ratings as a bipartite graph (Fig.~\ref{jc-intro} (a)). An
205: algorithm can then {\it jump} over the common artifacts to form a
206: connection between two people. Fig.~\ref{jc-intro} illustrates a skip
207: jump where two people are brought together if they rate at least
208: one movie in common.
209: 
210: A jump induces a {\it social network graph} (Fig.~\ref{jc-intro} (b)),
211: which includes only people and edges between them (in social network theory,
212: such a graph shows direct relationships, but here two connected people
213: need not know one another and their connection depends on the jump). The
214: {\it recommender graph} (Fig.~\ref{jc-intro} (c)) orients the edges in the social
215: network graph and adds back the movies. An algorithm can then find paths from
216: a person making a query to a person who has rated the movie of interest.
217: Note that Fig.~\ref{jc-intro} illustrates only one way of jumping
218: --- other jumps are identified by
219: Mirza \cite{batul-thesis}.
220: %In our analysis below, it is sufficient to 
221: %consider only the social network graph.
222: 
223: \begin{figure}
224: \centering
225: \begin{tabular}{cc}
226: & \mbox{\psfig{figure=skip.eps,width=5in}}
227: \end{tabular}
228: \caption{Illustration of the \emph{skip} jump. (a) bipartite graph
229: of people and movies. (b) Social network graph, and (c)
230: recommender graph.}
231: \label{jc-intro}
232: \end{figure}
233: 
234: In this article, we restrict our attention to {\it hammock} jumps. A hammock
235: jump of width $w$ connects two people if they have rated at least $w$
236: movies in common (a skip is a hammock jump of width one). A hammock
237: path of length $l$ is a sequence of $l$ hammocks, as illustrated in
238: Fig.~\ref{hammock-pic}. Our hypothesis is that hammock jumps underlie most
239: recommendation approaches, and at the very least can be used as the basis to
240: design metrics
241: for studying privacy issues.
242: Note that nearest neighbor algorithms (e.g., GroupLens~\cite{konstan1},
243: LikeMinds, and Firefly) use an implicit hammock sequence of length 1.
244: The `horting' algorithm of Aggarwal et al.~\cite{aggarwal1} uses 
245: sequences of explicit hammock-like jumps. 
246: 
247: \begin{figure}
248: \centering
249: \begin{tabular}{cc}
250: & \mbox{\psfig{figure=hammocks.eps,width=5in}}
251: \end{tabular}
252: \caption{A path of hammock jumps, with a hammock width
253: $w=4$.}
254: \label{hammock-pic}
255: \end{figure}
256: 
257: \begin{figure}
258: \centering
259: \begin{tabular}{cc}
260: \mbox{\psfig{figure=likeminds.epsi,width=2.3in}} & 
261: \mbox{\psfig{figure=horting.epsi,width=2.3in}}\\
262: \end{tabular}
263: \caption{(left) Influence of hammock width $w$ on quality of recommendation.
264: The annotations denote the fraction of people and movies reachable for
265: different values of the hammock width. (right) Influence of
266: hammock path length $l$ on quality of recommendations. The annotations denote
267: the number of recommendations possible for each value of $l$.}
268: \label{expt1}
269: \end{figure}
270: 
271: Our model completely ignores accuracy of predicted values of
272: ratings, and instead focuses on the parameters of hammock width and
273: path length. This is because
274: if recommendation is truly a matter of making (the right) connections, then
275: a recommendation of a particular movie for a given person can be
276: characterized by values for the hammock width $w$ and the hammock path 
277: length $l$. Notice that we do
278: not emphasize how individual ratings for the nodes (movies) spanning a
279: hammock are transformed into a prediction.
280: In addition, it is 
281: highly likely that there are multiple paths between the same set of
282: nodes, with various constraints on $w$ and $l$. Intuitively, since
283: considering more common ratings can be beneficial (see 
284: \cite{herlocker} for approaches) having a wider hammock could be better (this
285: has to be carefully done when correlations between ratings are
286: considered \cite{herlocker}). But if we insist on a wide hammock, we
287: might have to traverse longer paths to reach a particular
288: movie from a given person~\cite{batul-thesis}. However, recommendations 
289: involving shorter
290: path lengths are preferred, for reasons of explainability, over longer paths.
291: From a graph-theoretic point of view, $w$ and $l$ thus qualify the
292: reachability of different movies from a given person, and indirectly provide
293: a measure of the expected quality of predictions.
294: 
295: Preliminary analysis of the relationship between $w$, $l$, and predictive 
296: accuracy supports this intuition. Fig.~\ref{expt1} (left) shows a plot
297: of the average discrepancy between predicted and actual ratings for
298: each hammock width when using the LikeMinds algorithm (as described 
299: in~\cite{aggarwal1}). These results were determined by a leave-one-out study,
300: where an available rating was masked, and a prediction was made for that
301: rating (using the remaining data). The number of common ratings between
302: the given person and the person with the highest agreement scalar (and who
303: contributed to the recommendation) was used as the hammock width. The
304: results indicate that (for the LikeMinds algorithm), wider hammocks
305: contribute to better ratings.
306: Notice that LikeMinds's hammocks do not just model commonality, they also
307: posit agreement between the rating values spanning a hammock.
308: While it is certainly true that we can get a poor quality 
309: recommendation even with a wide hammock (involving perhaps noisy ratings or
310: a faulty aggregation procedure), 
311: overall quality of predictions is influenced by greater hammock widths.
312: However, increasing the hammock width results in a progressive 
313: disconnection of the social network graph into many components. As a result,
314: fewer and fewer connections can be made --- Fig.~\ref{expt1} (left) also
315: lists the fraction of people and movies reachable for various levels of hammock
316: width. A $w$ of 53 for instance reaches only 48\% of the 
317: people and 93\% of the movies. By the time the abrupt improvement
318: in agreement values is observed (after $w \approx 110$), less than 25\%
319: of the people and only about $86\%$ of the movies are reachable.
320: 
321: Fig.~\ref{expt1} (right) describes the results
322: of an  experiment where a minimum hammock width constraint was set
323: at $w=113$ (according to the LikeMinds definition) 
324: and the resulting recommender graph was analyzed for
325: paths of varying lengths from people to movies. 
326: We used the transformation
327: technique described in \cite{aggarwal1} to make predictions of ratings from
328: others' ratings, once again using the leave-one-out method. 
329: Paths in the recommender
330: graph involve 1, 2, or 3 hops to the person providing a recommendation
331: and a final hop to the movie being recommended (hence the bucketing of values
332: into 2, 3, and 4 in Fig.~\ref{expt1}, right).
333: As can be seen, 
334: greater lengths (for the same $w$) cause a faster-than-linear
335: decay in the quality of predictions. We should caution that
336: horting~\cite{aggarwal1} may exhibit different behavior, though we still
337: expect longer paths to be of lower quality.
338: 
339: These results support the intuition that wider hammocks and shorter paths
340: provide better ratings. Hammock widths are determined by rating patterns
341: that ensure significant overlap. We see this in the 
342: MovieLens dataset for which each participant rates a minimum of 20 movies
343: and which has a connected social network graph for all 
344: $w \le 17$~\cite{batul-thesis}. 
345: 
346: The primary cause of shorter paths is having more connections in the graph.
347: In the MovieLens dataset, 
348: a recommendation is almost always possible using
349: a path of length no longer than 3. This is due to the power-law degree
350: distribution of the rating patterns. Other graphs, such
351: as {\it small-world networks} \cite{watts1}, have
352: small clusters of vertices that are connected by relatively few edges. 
353: `Weak ties' are important in both these situations
354: because they make some recommendations possible, and provide others with 
355: shorter paths. Therefore, weak ties are very important to recommendation.
356: 
357: \section{Ties: Strong, Weak, and Brave}
358: \label{weak-ties}
359: In contrast to weak ties\footnote{It is important to note that there is
360: nothing fundamentally feeble or fragile about a weak tie; a weak tie
361: creates a powerful and robust link between nodes from different neighborhoods.},
362: a strong tie connects two people who share 
363: many associations (like in a family or some other close-knit group).
364: We can think of weak ties as forming bridges between groups of people
365: who would otherwise not interact.  Of course, strength and weakness
366: are relative, and there is no agreed definition of what a strong tie is
367: in terms of the number of shared associations.
368: 
369: \begin{figure}
370: \centering
371: \begin{tabular}{cc}
372: & \mbox{\psfig{figure=triad.epsi,width=2.5in}}
373: \end{tabular}
374: \caption{Strong ties in a social network graph (right)
375: induced by a hammock jump on a recommendation dataset (left) with $w=2$.}
376: \label{figtriad}
377: \end{figure}
378: 
379: In a graph, strong ties are characterized by a triangle of vertices
380: (a \emph{triad} in the social network literature).
381: Fig.~\ref{figtriad}
382: illustrates how these triads can occur in a social
383: network graph induced by a hammock jump.
384: In this case, the width of the hammock jump is 2, and what looks like
385: two relationships becomes three in the social network graph.
386: Notice that, in this example, if the hammock jump width were three,
387: then the resulting social network would not have a triad and so
388: neither edge would represent a strong tie. 
389: It is a classical argument in social network theory that
390: no strong tie can be a bridge
391: and that two strong ties would imply a third tie~\cite{weak}.
392: 
393: Weak ties are of most interest to us, because they are the foundation
394: for our notion of risk.
395: As discussed earlier,
396: a weak tie in a social setting allows people to identify someone with
397: other information that they've been given.
398: Weak ties occur simply because someone knows
399: someone else outside of their usual circle of friends; or perhaps
400: there is a person (an `outsider') who is friends with a few people who
401: each have strong(er) ties to people in different groups.
402: 
403: 
404: \begin{figure}
405: \centering
406: \begin{tabular}{cc}
407: \mbox{\psfig{figure=powerlaw.eps,width=2.5in}} &
408: \mbox{\psfig{figure=goodloops.eps,width=2.5in}}
409: \end{tabular}
410: \vspace{0.3in}
411: \begin{tabular}{cc}
412: \mbox{\psfig{figure=componentpl.eps,width=2.5in}} &
413: \mbox{\psfig{figure=badloops.eps,width=2.5in}}
414: \end{tabular}
415: \caption{Two different types of induced social networks that can exhibit weak
416: ties. (top left) A dataset with a power-law induces a low-risk
417: social network (top right) where increasing hammock widths cause 
418: a `nested clam shells' picture. Each circle in the social network picture
419: denotes a group of people brought together. Increasing hammock widths cause
420: the circles to get progressively smaller.
421: (bottom left) A dataset with power-laws in only subgraphs and a few weak
422: ties induces a high-risk social network (bottom right) characterized by
423: the breakdown of a connected network into disconnected networks.
424: Some experimental data supporting the
425: diagrams above can be found in~\cite{batul-thesis}.}
426: \label{graphs-risks}
427: \end{figure}
428: 
429: In recommendation, weak ties originate from the rating patterns of the
430: participants, but the jump process also plays a crucial role. We
431: hypothesize two fundamental rating patterns. One can be observed
432: in the public movie recommendation datasets (MovieLens and EachMovie), and
433: the other is what we would assume for a domain where people have stronger
434: bias in their tastes (such as books or music).
435: 
436: The movie datasets exhibit a power-law degree distribution as illustrated
437: in Fig.~\ref{graphs-risks} (top, left). The power-law rating pattern comes from
438: preferential attachment; for example, some movies (the {\it hits}) are
439: rated by almost everyone, and some people (the {\it buffs}) rate almost
440: all movies. Weak ties are rare in this setting but might occur when a person
441: shows no strong allegiance to any genre and rates relatively few movies in
442: each (he/she would not be a buff). The real risk to these people is that
443: they might not have enough ratings in common with anyone so they can be
444: given recommendations.
445: 
446: The second rating pattern would occur where most people exhibit a preference
447: for a particular kind of artifact. This is illustrated in Fig.~\ref{graphs-risks} (bottom, left) where there are three subgraphs with power-law structures,
448: connected by a relatively small number of ratings. This diagram illustrates
449: one source of weak tie in this setting, which is when someone who
450: ordinarily only rates artifacts in one domain (e.g., networking books)
451: rates an artifact in another domain (e.g., Indian classical music books). 
452: Another
453: possibility is someone with more eclectic tastes who rates artifacts across
454: many domains, and unlike in the power-law graph is truly a weak tie. The risk
455: with weak ties in this rating pattern is that they may allow us to 
456: identify a person whose ratings can get us from one domain to another.
457: 
458: The jump process, described in Section~\ref{jc-model},
459: can also create weak ties when using common ratings as the basis for making
460: connections between people.
461: Many people might have rated
462: across several domains, but only a few have enough ratings to satisfy the
463: jump being used. A final reason relates to merging of data collected from
464: different settings. For instance, the recent purchase of eToys consumer
465: data by another retail giant signals the possibility of the creation of
466: a social network graph with weak ties.
467: 
468: The risk in a weak tie really comes from being the only person with a peculiar
469: rating pattern --- there is safety in numbers, or at least in
470: homogeneous tastes (as in power-law graphs).
471: The more people who rate the same kinds of things, the less likely
472: that any one of them will be identifiable as participating in a weak tie.
473: But notice that if the jump definition weeds some of those people out,
474: the risk is still there (although it is less likely that any additional
475: information could be used to identify a single person).
476: 
477: \section{The Benefits and Perils of Personalization}
478: \label{benefits-risks}
479: Intuitively, a user desires the most benefit from a recommendation that is
480: based on wider hammocks and shorter path lengths. Of course, to get these 
481: qualities we have to provide more ratings, and the risk is that we might
482: introduce a weak tie. The problem then is can we relate how much
483: we rate to the benefit and risk inherent in recommendation?
484: 
485: \begin{table}
486: \caption {Movies used in analyzing the benefits of ratings on personalization.
487: `Star Wars' and `Scream of Stone' had the highest and lowest number of ratings,
488: respectively.}
489: %\hspace{25mm}
490: \centering
491: \vspace{0.07in}
492: \begin{tabular}{|l|r|} \hline\hline
493: \emph{Movie Name} & \emph {Number of Ratings} \\ \hline
494: Star Wars & 583\\ \hline
495: Tommorrow Never Dies & 180 \\ \hline
496: Robin Hood: Men in Tights & 56 \\ \hline
497: Scream of Stone & 1 \\ \hline
498: \end{tabular}
499: \label{movies-listing}
500: \end{table}
501: 
502: When there are multiple recommendation paths between a given combination
503: of person and movie, we would like a benefit formula that captures our
504: preference for wider hammocks and shorter path lengths. 
505: By defining the benefit of a recommendation as:
506: $$\mathrm{benefit} = {w \over{l^2}}$$
507: we can give
508: more weight to improvements in path length from $2$ to $1$ than,
509: say, from $3$ to $2$. This non-linear dependence of quality of interaction
510: on the length is supported by research in diffusion processes~\cite{watts1}, 
511: social networks~\cite{weak} and also our own experiments (see Fig.~\ref{expt1},
512: right). 
513: 
514: We can explore benefit in terms of the number of artifacts that are rated.
515: Typically, recommender systems require that users rate a minimum number of
516: artifacts before they can make queries, and so we look at the incremental
517: benefit received by providing additional ratings. For this purpose, we
518: analyze the MovieLens dataset where it is required that a user rate 20 movies,
519: and add a new person. The MovieLens dataset consists of 943 people, 1682
520: movies, and is connected as a graph.
521: 
522: For the experiment, we introduce a 944th person and incrementally add ratings
523: from the new person to movies (so that movies with a higher rating
524: frequency were more likely to be rated). After each rating was added the path
525: lengths $l$ to particular movies (see Table~\ref{movies-listing}) were computed
526: for each hammock width $w$. Twenty repetitions were performed for each
527: additional rating.
528: 
529: \begin{figure}
530: \centering
531: \begin{tabular}{cc}
532: & \mbox{\psfig{figure=densityplot.eps,width=2.3in}}
533: \end{tabular}
534: \caption{Benefit vs. number of additional ratings required, for various
535: choices of movie destination nodes. The cells are colored with greater 
536: intensities corresponding to movies with fewer ratings.}
537: \label{expt3}
538: \end{figure}
539: 
540: 
541: The benefit from additional ratings for the movies in 
542: Table~\ref{movies-listing} is shown in Fig.~\ref{expt3}. Each colored cell
543: indicates that the particular benefit is possible for the corresponding
544: number of ratings. The feasible benefit regions are actually monotonically
545: increasing by popularity of the movie --- with more possibilities for
546: `Star Wars' than for `Tomorrow Never Dies.' The plot shows that if you
547: want a good recommendation for a less popular movie, you need to provide more
548: ratings, but can receive good recommendations for popular movies with
549: fewer ratings. In particular, requesting an improvement in benefit for
550: a `Star Wars' recommendation from
551: 5 to 14 requires no extra ratings!
552: 
553: \begin{figure}
554: \centering
555: \begin{tabular}{cc}
556: & \mbox{\psfig{figure=small-world.eps,width=5in}}
557: \end{tabular}
558: \caption{Random rewiring, starting from a regular wreath network, introduces
559: weak ties that help model small-world graphs. 
560: Figure adapted from \cite{watts1}.}
561: \label{small}
562: \end{figure}
563: 
564: \begin{figure} \centering
565: \begin{tabular}{cc} 
566: \mbox{\psfig{figure=small-world-graphs.eps,width=2.6in}} &
567: \mbox{\psfig{figure=expt2.epsi,width=2.6in}}\\
568: \end{tabular}
569: \caption{(left) Average path length and clustering coefficient versus
570: the rewiring probability $p$ (from \cite{watts1}). All measurements are
571: scaled w.r.t. the values at $p = 0$. (right) Quantifying the risk 
572: as a function of rewiring probability $p$.} 
573: \label{smgs}
574: \end{figure}
575: 
576: The danger involved
577: in recommendation relates to the probability that a weak connection is
578: exposed; unfortunately, this is not a static property of a recommendation
579: path and can only be studied in reference
580: to the social network graph {\it in the absence} of the considered 
581: connection. This means that we need a more complete understanding of the
582: dynamics by which weak ties are introduced, modeled, and employed in a social
583: network. Such an understanding 
584: could take the form of a graph-generation
585: model. Here we use the model of 
586: Watts and Strogatz \cite{watts1} as a basis for our study of risk.
587: 
588: The intuition is that risk occurs when we have edges that are weak ties between
589: subgraphs that are cliques (or at least nearly so), and the risk decreases
590: as more of these edges are added. In particular, the risk is highest when
591: a new weak tie occurs and the lengths between people decrease dramatically.
592: As more weak ties are added, the risk decreases.
593: 
594: This idea of risk can be explored in the
595: Watts-Strogatz model for small-world networks. They show how to
596: generate a spectrum of graphs from a regular wreath graph by adjusting 
597: the probability $p$ of rewiring an edge
598: (Fig.~\ref{small}). When $p$ is zero, we have the wreath; but when $p$ is
599: one, we have a random graph. The risk from weak ties is low in both the wreath
600: and random graphs, but increases as the average path length drops but
601: the vertices are still clustered. Fig.~\ref{smgs} (left) illustrates
602: the relationship between length and clustering (see~\cite{watts1} for
603: details of the definitions). When $p$ is between $0$ and $0.1$, the
604: graph is a small-world network, and poses the most risk from weak ties.
605: 
606: We can express the risk of weak ties in terms of $p$: the risk in having ratings
607: that form weak ties can be quantified as the rate at which $l$ reduces, as
608: a function of $p$:
609: $$\mathrm{risk} = {- {{\partial l} \over {\partial p}}}$$ 
610: 
611: The risk for the dynamics described in Fig.~\ref{small} is given in
612: Fig.~\ref{smgs}, right (the length values are 
613: scaled with respect to the length at $p=0$ before calculating the risk). 
614: Notice that the risk increases 
615: rapidly (as weak ties
616: are introduced) and drops down gradually (as more weak ties share
617: responsibility for length reduction). This captures our intuition pertaining
618: to disclosure of sensitive information by ferreting out weak ties. 
619: However, our jumping connections model is not directly parameterized by $p$. 
620: 
621: To be useful, the above formula for risk must relate length reduction to
622: a metric that could be used to balance personalization and privacy. We
623: can illustrate the risk of becoming a weak tie by studying what happens
624: as we decrease the hammock width $w$ in Fig.~\ref{graphs-risks} (bottom). 
625: Consider the situation when the social network graph is in three disconnected
626: components. 
627: Decreasing the hammock width would introduce new edges that
628: are weak ties, which would contribute to length reduction and thus,
629: quantification of risk. However, recall that increasing the hammock width is 
630: desirable from the viewpoint of benefit. Taken another way, benefits improve
631: monotonically with increasing width $w$ but risk rises rapidly (as 
632: fewer weak ties share responsibility for length reduction) upto a point and
633: then drops sharply. 
634: 
635: \begin{figure}
636: \centering
637: \begin{tabular}{cc}
638: & \mbox{\psfig{figure=expt4.ps,width=3in}}
639: \end{tabular}
640: \caption{Risk as a function of hammock width $w$.}
641: \label{expt4}
642: \end{figure}
643: 
644: To explore this setting, we created an artificial dataset that
645: consists of three subgraphs with power-law degree distributions, each
646: with 200 people and 75 artifact vertices. Each person node is linked to
647: at most 15 artifact nodes within the same subgraph. Specifically, the
648: people and artifacts were ordered, and the $b^{th}$ person rated the first
649: $\lceil 75 b^{-\epsilon} \rceil$ artifacts.
650: The value of $\epsilon$
651: was calibrated to achieve a minimum rating of $15$ artifacts. Then three
652: extra people were added who rate 
653: (at most 15)
654: artifacts in all three connected components, again with a 
655: `master' power-law. 
656: 
657: For a hammock width of 9, the social network of this graph
658: consists of three disconnected components. By decreasing the hammock
659: width, weak ties will be introduced into the social network, and the
660: path lengths decrease. The results are plotted in Fig.~\ref{expt4}
661: (lengths are scaled against the path length for $w=8$). As could
662: be expected, risk is highest when the graph is first connected.
663: 
664: It is not possible to provide a traditional benefit-risk profile, as is
665: customary in analysis. This is because recommender systems aggregate the
666: ratings of many participants when computing a recommendation. A user's
667: benefits comes from `plugging into' the social network by 
668: providing a sufficient number of ratings, but a user's risk depends
669: not only on what is rated but also
670: on what other people rate. Ultimately, the difficulty
671: comes from the fact that risk occurs even if recommendation queries are
672: not made, but benefit requires that the user make queries. 
673: 
674: The two qualitative conclusions from our studies are that (i) a few 
675: weak ties are more risky than a lot of weak
676: ties, and (ii) more so, in some (induced) social networks than others.
677: 
678: \section{Concluding Remarks}
679: \label{bigger-pictures}
680: The very factors that make weak ties useful are the ones that 
681: raise the threat of privacy. We have demonstrated that under certain conditions,
682: recommendations could involve weak ties and could potentially compromise
683: the privacy of individuals. Like most problems in computer security, the
684: ideal deterrents are better awareness of the issues and more openness in 
685: how recommender systems operate in the market place. In particular, policies
686: and methodologies employed by an individual site should be made clear. 
687: Sites that involve multiple homogeneous networks have a crucial responsibility
688: in clarifying the role of weak ties in their system designs and what forms of
689: mechanisms are in place to thwart hackers.
690: 
691: Ideally, recommender systems should convey to the user both benefits and risks 
692: in an intuitive manner. One possibility is to present the user with plots of
693: benefit and risk versus user-modifiable parameters --- ratings, $w$, and $l$
694: (if the algorithm allows their direct specification). Another possibility is
695: to qualify the risks and benefits associated with rating each individual 
696: artifact (as a function of the previous ratings in the system). Providing
697: a rating for `Scream of Stone' for instance would provide dramatic improvements
698: in benefit than providing a rating for `Star Wars.' At the same time, 
699: the system should qualify the extent to which a user becomes a weak tie,
700: by such a rating.
701: 
702: Singh and colleagues \cite{singh-cacm} make a provoking observation in
703: drawing comparisons from community-based networks to recommender systems ---
704: namely, that people really want to control to whom they reveal their ratings
705: but would like to know how recommendations are being made. In a distributed
706: setting, one can imagine a scenario where people specify how data collected
707: from their interactions should be modeled and used. Interfaces for 
708: privacy management are woefully inadequate and their role is only now 
709: being recognized \cite{etzioni-cacm}. Extending the results here to 
710: a distributed setting where people can set arbitrary constraints on their
711: station in the social network graph (whether they are willing to participate
712: in a path?; are there constraints on such participation?; would they provide
713: ratings if they knew that it would contribute to a weak tie?) is a possible
714: direction for future research. 
715: 
716: One wonders if weak ties will happen at all, if concerns are raised about
717: their compromise. Social network theory postulates that they are
718: the primary mechanisms by which micro-level interactions can manifest at
719: macro levels, and that such ties will be kindled whenever communities have
720: to be mobilized for collective action. It remains to be seen if weak ties
721: induced by jumps in a recommender system also conform to similar 
722: distributed organization. 
723:  
724: \bibliographystyle{plain}
725: %\bibliographystyle{named}
726: \bibliography{ppp}
727: 
728: \end{document}
729: 
730: