1: \documentclass[11pt,twocolumn]{article}
2:
3: \usepackage{amssymb, amsmath,algorithm,algorithmic,amsthm}
4: \usepackage{graphicx,epsfig,subfigure}
5: \usepackage[left=1in,top=1in,right=1in,bottom=1in,nohead]{geometry}
6:
7: \renewcommand{\baselinestretch}{.98}
8: \newcommand{\weakaddition}{{\tt Weak-Any}}
9: \newcommand{\strongaddition}{{\tt Strong-Any}}
10: \newcommand{\weakgreedy}{{\tt Weak-Greedy}}
11: \newcommand{\stronggreedy}{{\tt Strong-Greedy}}
12:
13: \newcommand{\an}{$(k,1)$}
14: \newcommand{\anonvar}{$(k,\ell)$}
15: \newcommand{\secondvar}{$\ell$}
16: \newtheorem{problem}{{Problem}}
17: \newtheorem{theorem}{{Theorem}}
18: \newtheorem{example}{{Example}}
19: \newtheorem{definition}{{Definition}}
20: \newtheorem{lemma}{{Lemma}}
21: \newtheorem{corollary}{{Corollary}}
22: \newtheorem{proposition}{{Proposition}}
23: \newtheorem{conjecture}{{Conjecture}}
24:
25: \newcommand{\squishlist}{
26: \begin{list}{$\bullet$}
27: { \setlength{\itemsep}{0pt} \setlength{\parsep}{3pt}
28: \setlength{\topsep}{3pt} \setlength{\partopsep}{0pt}
29: \setlength{\leftmargin}{1.5em} \setlength{\labelwidth}{1em}
30: \setlength{\labelsep}{0.5em} } }
31:
32: \newcommand{\squishlisttwo}{
33: \begin{list}{$\bullet$}
34: { \setlength{\itemsep}{0pt} \setlength{\parsep}{0pt}
35: \setlength{\topsep}{0pt} \setlength{\partopsep}{0pt}
36: \setlength{\leftmargin}{2em} \setlength{\labelwidth}{1.5em}
37: \setlength{\labelsep}{0.5em} } }
38:
39: \newcommand{\squishend}{
40: \end{list} }
41:
42:
43: \begin{document}
44:
45: \title{Anonymizing Graphs}
46: \author{
47: Tom\'{a}s Feder\\Stanford University \and Shubha U. Nabar\\Stanford University \and Evimaria Terzi\\IBM Almaden
48: }
49: \date{}
50:
51: \maketitle
52:
53: \begin{abstract}
54: Motivated by recently discovered privacy attacks on social networks, we study the problem of anonymizing the underlying graph of interactions in a social network. We call a graph \emph{\anonvar-anonymous} if for every node in the
55: graph there exist at least $k$ other nodes that share at least
56: $\ell$ of its neighbors. We consider two combinatorial problems
57: arising from this notion of anonymity in graphs. More
58: specifically, given an input graph we ask for the minimum number
59: of edges to be added so that the graph becomes
60: \emph{\anonvar-anonymous}. We define two variants of this
61: minimization problem and study their properties. We show that for
62: certain values of $k$ and $\ell$ the problems are polynomial-time solvable,
63: while for others they become NP-hard. Approximation algorithms for
64: the latter cases are also given.
65: \end{abstract}
66:
67: \pagenumbering{arabic}
68:
69: \newcommand{\DEF}[1]{{\em #1\/}}
70:
71: \newcommand\chic{\chi_c}
72: \newcommand\C{\hbox{${\cal C}$}}
73: \newcommand{\RR}{\mbox{$\mathbb R$}}
74: \newcommand{\NN}{\mbox{$\mathbb N$}}
75: \newcommand{\ZZ}{\mbox{$\mathbb Z$}}
76: \newcommand{\eopf}{\raisebox{0.8ex}{\framebox{}}}
77: \newcommand{\dist}{\hbox{\rm d}}
78: \renewcommand\a{\alpha}
79: \renewcommand\b{\beta}
80: \renewcommand\c{\gamma}
81: \renewcommand\d{\delta}
82: \newcommand\D{\Delta}
83: \newcommand{\directedchi}{\mbox{$\vec{\chi}$}}
84: \newcommand{\directedE}{\mbox{$\vec{E}$}}
85: \newcommand{\directedG}{\mbox{$\vec{G}$}}
86: \newcommand{\directedK}{\mbox{$\vec{K}$}}
87:
88:
89: \section{Introduction}\label{introduction}
90:
91: The popularity of online communities and social networks
92: in recent years has motivated research on social-network analysis.
93: Though these studies are useful in uncovering
94: the underpinnings of human social behavior, they also raise
95: privacy concerns for the individuals involved.
96:
97: A social network is usually represented as a graph, where nodes
98: correspond to individuals and edges capture relationships between
99: these individuals. For example, in LinkedIn, an online network of
100: professionals, every link between two users specifies a
101: professional relationship between them. In Facebook and Orkut
102: links correspond to friendships. There are online communities that
103: permit any user to access the information of every node in the graph
104: and view its neighbors. However, many communities are
105: increasingly restricting access to the personal information of other users. For example, in LinkedIn, a user can only see the profiles of his own friends and their connections.
106:
107: In this paper, we consider a scenario where the owner of a
108: social network would like to release the underlying
109: graph of interactions for social-network analysis purposes, while preserving the privacy of its users. More specifically,
110: the private information to be protected
111: is the mapping of nodes to real-world entities and interconnections amongst them.
112: Therefore, we design an anonymization framework that tries to hide the identity of nodes by creating groups of nodes that look similar by virtue of sharing many of the same neighbors. We call such nodes {\em anonymized}. Our goal is to anonymize all nodes of the graph by introducing minimal changes to the overall graph structure. In this way we can guarantee that the anonymized graph is still useful for social-network analysis purposes.
113:
114: Recently, Backstrom et. al.~\cite{backstrom} have shown that the most
115: simple graph-anonymization technique that removes the identity of
116: each node in the graph, replacing it with a random identification
117: number instead, is not adequate for preserving the privacy of
118: nodes. Specifically, they show that in such an anonymized network, there exists an adversary
119: who can identify target individuals and the link structure between them. However, the problem of designing anonymization methods against such
120: adversaries is not addressed in~\cite{backstrom}.
121:
122: Following the work of~\cite{backstrom}, Hay et. al.~\cite{miklau}
123: have very recently given a definition of graph anonymity: a graph
124: is $k$-anonymous if every node shares the same neighborhood
125: structure with at least $k-1$ other nodes. The definition is
126: recursive, and has some nice properties studied in~\cite{miklau}.
127: However, the focus of~\cite{miklau} is mostly on the properties of
128: the definitions rather than on algorithms to achieve the anonymity requirements.
129:
130: Motivated by ~\cite{backstrom} and~\cite{miklau}, Zhou and
131: Pei~\cite{zhou08preserving} consider the following definition of
132: anonymity in graphs: a graph is $k$-anonymous if for every node
133: there exist at least $k-1$ other nodes that share isomorphic
134: $1$-neighborhoods. They consider the problem of minimum
135: graph-modifications (in terms of edge additions) that would lead
136: to a graph satisfying the anonymity requirement. Although this
137: definition is interesting, the algorithm presented
138: in~\cite{zhou08preserving} is not supported by theoretical
139: analysis. Further, if the anonymity definition is extended to consider the neighborhood structure beyond just the immediate $1$-neighborhood of each node, algorithmic techniques quickly become infeasible.
140:
141: Despite the fact that privacy concerns in releasing social-network data have
142: been pinpointed, there is no agreement on the definition of
143: privacy or anonymity that should be used for such data. In this paper, we try to move
144: this line of research one step forward by proposing a new definition
145: of graph anonymity that is inline to a certain extent with
146: the definitions provided in~\cite{miklau}. Our definition of
147: anonymity is in a sense less strict than the one proposed
148: in~\cite{zhou08preserving}. However, we consider it to be natural,
149: intuitive and more amenable to theoretical analysis.
150:
151: Intuitively our definition aims to protect an individual from an adversary who knows some subset of the individual's neighbors in the graph. After anonymization, the hope is that the adversary can no longer identify the target individual because several other nodes in the graph will also share this subset of neighbors. Further, during anonymization, the identifying subset of neighbors themselves will become distorted and harder for the adversary to identify.
152:
153: \vspace{0.1in}
154: \noindent{\bf The Problem:} We define a graph to be
155: {\anonvar}-anonymous if for every node $u$ in the graph there
156: exist at least $k$ other nodes that share at least $\ell$ of their
157: neighbors with $u$. In order to meet this anonymity requirement one could transform
158: any graph into a complete graph. For a graph consisting of $n$
159: nodes this would mean that every node would share $n-2$ neighbors
160: with each of the $n-1$ other nodes. Although such an anonymization
161: would preserve privacy, it would make the anonymized graph useless
162: for any study. For this reason we impose the additional
163: requirement that the minimum number of such edge additions should
164: be made. The aim is to preserve the utility of the original graph,
165: while at the same time satisfying the {\anonvar}-anonymity
166: constraint.
167:
168: Given $k$ and $\ell$ we formally define two variants of the
169: \emph{graph-anonymization} problem that ask for the minimum number
170: of edge additions to be made so that the resulting graph is
171: {\anonvar}-anonymous. We show that for certain values of $k$ and
172: $\ell$ the problems are polynomial-time solvable, while for others
173: they are NP-hard. We also present simple and intuitive
174: approximation algorithms for these hard instances. To summarize our contributions:
175:
176: \squishlist
177: \item We propose a new definition of graph anonymity building on
178: previously proposed definitions.
179: \item We provide the first formal
180: algorithmic treatment of the graph-anonymization problem.
181: \squishend
182:
183: \vspace{0.15in}
184: Besides graph anonymization, the combinatorial problems we study
185: here may also arise in other domains, e.g., graph reliability.
186: We therefore believe that the problem definitions and
187: algorithms we present are of independent interest.
188:
189: \vspace{0.1in}
190: \noindent{\bf Roadmap:} The rest of the paper is organized as
191: follows. In Section~\ref{sec:related} we summarize the related
192: work. Section~\ref{sec:definitions} gives the necessary notation
193: and definitions. Algorithms and hardness results for different
194: instances of the {\anonvar}-anonymization problem are given in
195: Sections~\ref{2_1},~\ref{6_1to7_1},~\ref{K_1} and~\ref{K_L}. We
196: conclude in Section~\ref{sec:conclusions}.
197:
198:
199: \section{Related Work}\label{sec:related}
200:
201: As mentioned in the Introduction, there has been some prior work
202: on privacy-preserving releases of social-network graphs. The authors
203: in~\cite{backstrom} show that the naive approach of simply masking
204: usernames is not sufficient anonymization. In particular, they
205: show that, if an adversary is given the chance to create as few as $\Theta(\log(n))$
206: new accounts in the network, prior to its release, then he can efficiently recover the
207: structure of connections between any $\Theta(\log^2 (n))$ nodes
208: chosen apriori. He can do so by identifying the new accounts that he inserted in
209: to the network. The focus of~\cite{backstrom} is on revealing the
210: power of such adversaries and not on devising methods to protect against them.
211:
212: In~\cite{miklau} the authors experimentally evaluate
213: how much background information about the neighborhood of an individual would be sufficient for an adversary to uniquely identify that individual in a naively anonymized graph. Additionally, a new recursive definition of graph anonymity is given. The definition says that a graph is
214: $k$-anonymous if for every structure query there exist $k$ nodes
215: that satisfy it. The definition
216: is constructed for a certain class of structure queries
217: that query the neighborhood structure of the nodes.
218: Our definition of anonymity is inspired by~\cite{miklau}, however
219: it is substantially different. Moreover, the focus of our work is
220: on the combinatorial problems arising from our anonymity
221: definition.
222:
223: Very recently, the authors of~\cite{zhou08preserving} consider yet
224: another definition of graph anonymity; a graph is $k$-anonymous if
225: for every node there exist at least $k-1$ other nodes that share
226: isomorphic $1$-neighborhoods. This definition of anonymity in
227: graphs is different from ours. In a sense it is a more strict one.
228: Moreover, though the algorithm presented
229: in~\cite{zhou08preserving} seems to work well in practice, no
230: theoretical analysis of its performance is presented. Finally,
231: extending the privacy definition to more than just the
232: $1$-neighborhood of nodes causes the algorithms
233: of~\cite{zhou08preserving} to quickly become infeasible.
234:
235: The problem of protecting sensitive links between individuals in
236: an anonymized social network is considered in~\cite{Zheleva:07}.
237: Simple edge-deletion and node-merging algorithms are proposed to
238: reduce the risk of sensitive link disclosure. This work is
239: different from ours in that we are primarily interested in
240: protecting the identity of the individuals while
241: in~\cite{Zheleva:07} the emphasis is on protecting the types of links
242: associated with individuals. Also, the combinatorial problems that
243: we need to solve in our framework are very different from the set
244: of problems discussed in~\cite{Zheleva:07}.
245:
246: In~\cite{Frikken:06} the authors study the problem of assembling
247: pieces of a graph owned by different parties privately. They
248: propose a set of cryptographic protocols that allow a group of
249: authorities to jointly reconstruct a graph without revealing the
250: identity of the nodes. The graph thus constructed is isomorphic to
251: a perturbed version of the original graph. The perturbation
252: consists of addition and or deletion of nodes and or edges. Unlike
253: that work, we try to anonymize a single graph by modifying it as
254: little as possible. Moreover, our methods are purely combinatorial
255: and no cryptographic protocols are involved.
256:
257: Korolova et. al.~\cite{korolova} investigate an attack where an
258: adversary strategically subverts user accounts. He then uses the
259: online interface provided by the social network to gain access to
260: local neighborhoods and to piece them together to form a global picture. The authors provide
261: recommendations on what the lookahead of a social network should
262: be to render such attacks infeasible. This work does not
263: consider an anonymized release of the entire network graph and is thus different from ours.
264:
265: Besides graphs, there has been considerable prior work on
266: anonymizing traditional relational data sets. The line of work on
267: $k$-anonymity found in
268: \cite{aggarwal,gehrke,lefevre,meyerson,sweeney,tclose} aims to
269: minimally suppress or generalize public attributes of individuals
270: in a database in such a way that every individual (identifiable by
271: his public attributes) is hidden in a group of size at least $k$.
272: Our notion of graph anonymity draws inspiration from this.
273:
274: Apart from suppression or generalization techniques, perturbation
275: techniques have also been used to anonymize relational data sets
276: in \cite{dilys,haritsa,srikant}. Perturbation-based approaches for
277: graph anonymization are also considered in~\cite{miklau,xintao}; in that
278: case edges are randomly inserted or deleted to anonymize the graph. We do not consider perturbation-based approaches in this paper.
279:
280: \section{Preliminaries}\label{sec:definitions}
281:
282: In this section we formalize our definition of graph
283: anonymity and introduce two natural optimization problems
284: that arise from it.
285:
286: Throughout the paper we assume that the social-network graph is
287: simple, \textit{i.e.}, it is undirected, unweighted, and contains
288: no self-loops or multi-edges. This is an important category of
289: graphs to study; most of the aforementioned social
290: networks (Facebook, LinkedIn, Orkut) allow only bidirectional
291: links and are thus instances of such simple graphs. We assume that
292: the actual identifiers of individual nodes are removed prior to
293: further anonymization. Our definition for graph anonymity is
294: inspired by the notion of $k$-anonymity for relational data
295: wherein each person, identifiable by his public attributes, is
296: required to be hidden in a group of size $k$. In the case of a
297: social-network graph, the publicly-known attributes of a user
298: would be (a subset of) his connections (and interconnections
299: amongst them) within the graph.
300:
301: Consider a simple unlabelled graph and an adversary who knows that
302: a target individual and some number of his friends form a clique.
303: In the released graph, the adversary could look for such cliques
304: to narrow down the set of nodes that might correspond to the
305: target individual. The goal of an anonymization scheme is to
306: prevent such an adversary from uniquely identifying the individual
307: and his remaining connections in the anonymized graph.
308:
309: We achieve this by introducing an anonymity property that requires
310: that for every node in the graph, some subset of its neighbors
311: should be shared by other nodes. In this way, an adversary
312: who knows some subset of the neighbors of a target individual and can even pinpoint them in the graph, will not be able to distinguish the target individual from other nodes in the network that share this subset of neighbors. Further, in the
313: process of anonymization, the identifying subset of neighbors
314: itself becomes distorted and harder for the adversary to pinpoint.
315: More formally we define the {\anonvar}-anonymity property as
316: follows.
317:
318: \begin{definition}[\anonvar-anonymity]
319: A graph $G=(V,E)$ is {\em \anonvar-anonymous} if for each vertex
320: $v\in V$, there exists a set of vertices $U\subseteq V$ not
321: containing $v$ such that $|U|\geq k$ and for each $u\in U$ the
322: vertices $u$ and $v$ share at least \secondvar\ neighbors.
323: \end{definition}
324:
325: \begin{example}
326: A clique of $n$ nodes is $(n-1,n-2)$-anonymous.
327: \end{example}
328:
329: To demonstrate the kinds of attacks we hope to protect against, we give another example.
330:
331: \begin{figure}
332: \begin{center}
333: \subfigure[Input graph $G$]{\includegraphics[scale=0.75,angle=270]{41a}\label{41examplea}}
334: \subfigure[(4,1)-anonymous transformation of $G$]{\includegraphics[scale=0.75,angle=270]{41b}
335: \label{41exampleb}}
336: \caption{In Figure~\ref{41examplea} an adversary can identify Alice as the node marked X. Figure~\ref{41exampleb} is a (4,1)-anonymous transformation of the graph.}\vspace{-0.15in}
337: \end{center}
338: \end{figure}
339:
340: \begin{example}
341: Consider the graph in Figure~\ref{41examplea}. Suppose an adversary knows that Alice is in this graph and that Alice is connected to a friend who is part of a triangle. There is only one such node in the graph and hence the adversary will be able to determine that the node marked X in the graph uniquely corresponds to Alice. From this he may be able to further infer the identities of Alice's neighbors and their neighbors as well. Now if the edges shown in dotted lines in Figure~\ref{41exampleb} are added to this graph, the resulting graph is $(4,1)$-anonymous. In this new graph, Alice is no longer the only node connected to a node of a triangle. Further, there is no longer only one triangle in the graph.
342: \end{example}
343:
344: Given an input graph $G=(V,E)$ with $n$ nodes, and integers $k$ and
345: $\ell$, our goal is thus to transform the graph into a
346: {\anonvar}-anonymous graph. We focus on transformations that
347: allow only additions of edges to the original graph
348: In order for the anonymized graph to remain useful for social-network (or
349: other) studies, we need to ensure that the transformed graph is as
350: close as possible to the original graph. We achieve this by
351: requiring that a minimum number of edges should be added to $G$ so that
352: the $(k,\ell)$-anonymity property holds. This leads us to the
353: following two variants of the {\anonvar}-anonymization problem.
354:
355: \begin{problem}[Weak \anonvar-anonymization]\label{problem:weak}
356: Given a graph $G=(V,E)$ and integers $k$ and $\ell$, find the
357: minimum number of edges that need to be added to $E$, to obtain a
358: graph $G' = (V, E')$ that is \anonvar-anonymous.
359: \end{problem}
360:
361: The following example illustrates the
362: weak-anonymization problem.
363:
364: \begin{figure}
365: \begin{center}
366: \subfigure[Input graph $G$] {\includegraphics[scale =
367: 0.45]{example_input}\label{fig:input_graph}} \hspace{1cm}
368: \subfigure[Weakly $(k-1,1)$ - anonymized graph $G'$]
369: {\includegraphics[scale =
370: 0.45]{example}\label{fig:weak_anonymized}}\caption{Illustrative
371: example of the difference between \emph{weak} and \emph{strong}
372: anonymity. }\label{figure:example}
373: \end{center}
374: \end{figure}
375:
376: \begin{example}\label{example:weak}
377: Consider the input graph $G$ of Figure~\ref{fig:input_graph}. The
378: graph consists of a clique of size $k$ and $2$ nodes $x$ and $y$
379: connected by an edge. The nodes in the clique are all $(k-1,
380: k-2)$-anonymous. However, the existence of $x$ and $y$ prevents
381: $G$ from being fully $(k-1, 1)$-anonymous.
382:
383: Assume now that we connect both $x$ and $y$ to a single node $u$ of the clique.
384: In this way, we construct graph $G'$ shown in
385: Figure~\ref{fig:weak_anonymized}. Obviously, $G'$ is
386: $(k-1,1)$-anonymous; all the nodes in $G'$ (including $x$ and $y$) have
387: $k-1$ other nodes that share at least one of their neighbors. For
388: $x$ and $y$, this neighbor is node $u$.
389: \end{example}
390:
391: The problem in the above example is that graph $G'$ satisfies the
392: $(k-1,1)$-anonymity requirement, however, the anonymity of nodes
393: $x$ and $y$ is achieved via node $u$ that was not a part of their
394: initial set of neighbors in $G$. Thus, the goal of having many
395: other nodes sharing the original neighborhood structure of $x$ or
396: $y$ is not necessarily achieved unless we place additional
397: requirements on the anonymization procedure. To this end we
398: introduce the problem of \emph{strong anonymization}. Strong
399: anonymity places additional restrictions on how anonymity can be achieved
400: and provides better privacy.
401:
402: \begin{definition}[Strong
403: \anonvar-transformation]\label{dfn:strong_anonymity} Consider
404: graphs $G=(V,E)$ and $G'=(V,E')$, so that $E\subseteq E'$ and $G'$
405: is {\anonvar}-anonymous. For fixed $k$ and $\ell$, we say that
406: $G'$ is a \emph{strongly-anonymized transformation} of $G$, if for
407: every vertex $v\in V$, there exists a set of vertices $U\subseteq
408: V$ not containing $v$ such that $|U|\geq k$ and for each $u\in U$,
409: $|N_G(v) \cap N_{G'}(u)|\geq$ \secondvar. Here $N_G(v)$ is the set
410: of neighbors of $v$ in $G$, and $N_{G'}(u)$ is the set of
411: neighbors of $u$ in $G'$.
412: \end{definition}
413:
414: Therefore, if a graph $G'$ is a strong {\anonvar}-transformation
415: of graph $G$, then each vertex in $G'$ is required to have $k$
416: other vertices sharing at least \secondvar\ of its {\em original}
417: neighbors in $G$. For this to be possible, every vertex must have at least $\ell$ neighbors in the original graph $G$ to begin with.
418:
419: \begin{example}\label{example:strong}
420: Consider again the graph $G$ of Figure~\ref{fig:input_graph} and
421: its transformation to graph $G'$ shown in
422: Figure~\ref{fig:weak_anonymized}. In Example~\ref{example:weak} we
423: showed that graph $G'$ is $(k-1,1)$-anonymous in the weak sense.
424: However, in order to get a strong $(k-1,1)$-transformation
425: of $G$, we would have to connect each of the nodes $x$ and $y$ to $k-1$ other nodes from the clique.
426: \end{example}
427:
428: The definition of a strong {\anonvar}-transformation gives rise to
429: the following \emph{strong {\anonvar}-anonymization} problem.
430:
431: \begin{problem}[Strong \anonvar-anonymization]\label{problem:strong}
432: Given a graph $G=(V,E)$ and integers $k$ and $\ell$, find the
433: minimum number of edges that need to be added to $E$, to obtain
434: graph $G' = (V, E')$ that is a strong \anonvar-transformation of
435: $G$.
436: \end{problem}
437:
438: Obviously achieving strong anonymity would require the addition of
439: a larger number of edges than weak anonymity. This statement is formalized as follows.
440:
441: \begin{proposition}
442: Consider input graph $G=(V,E)$ and integers $k$ and $\ell$. Let
443: $G'=(V,E')$ be the $(k,\ell)$-anonymous graph that is the optimal
444: solution for Problem~\ref{problem:weak}, and $G''=(V,E'')$ be the
445: $(k,\ell)$-anonymous graph that is the optimal solution for
446: Problem~\ref{problem:strong}. Then it holds that $|E''|\geq |E'|$.
447: \end{proposition}
448:
449: The notion of {\anonvar}-anonymity is strongly related to the
450: immediate neighbors of a node in the graph, and how these are
451: shared with other nodes. Therefore, for every node $u$ it is
452: important to know the nodes that are reachable from $u$ via a path
453: of length exactly $2$. Given its importance, we define the notion
454: of $2$-neighborhood of a node as follows.
455:
456: \begin{definition}[$2$-neighborhood]
457: Given a graph $G=(V,E)$ and a node $v\in V$ we define the $2$-neighborhood of $v$ to be the set of all nodes in $G$ that are
458: reachable from $v$ via paths of length exactly $2$.
459: \end{definition}
460:
461: We also define two more terms that will be used in the rest of the paper.
462:
463: \begin{definition}[Residual Anonymity]\label{dfn:residual}
464: Consider a graph $G=(V,E)$ that we would like to make
465: $(k,\ell)$-anonymous. Consider any node $v \in V$ and suppose that
466: $k'$ other nodes in the graph share at least $\ell$ of $v$'s
467: neighbors. Then, we define the residual anonymity of $v$ to be
468: $r(v) = {\text max}\{k-k', 0\}$. The residual anonymity of a graph
469: $G=(V,E)$ is defined to be $r(G) = \sum_{v \in V} r(v)$.
470: \end{definition}
471:
472: We define the concept of a deficient node for nodes that are not $(k,\ell)$-anonymous.
473:
474: \begin{definition}[Deficient Node]
475: A node $v$ is deficient if $r(v) > 0$.
476: \end{definition}
477:
478: It is the deficient nodes that we need to take care of in order to
479: anonymize a graph. With these definitions in hand, we are now ready
480: to proceed to the technical results of the paper.
481:
482:
483: \section{$(2,1)$-anonymization}\label{2_1}
484:
485: In this section we provide polynomial-time algorithms for the weak
486: and strong $(2,1)$-anonymization problems. First, it is easy to
487: see that there is a simple characterization of $(2,1)$-anonymous
488: graphs. This fact is captured in the following proposition.
489:
490: \begin{proposition}\label{prop:characterization}
491: A graph $G=(V,E)$ is $(2,1)$-anonymous if and only if each vertex
492: $u\in V$ is (a) part of a triangle, (b) adjacent to a vertex of
493: degree at least 3, or (c) is the middle vertex in a path of 5 vertices.
494: \end{proposition}
495:
496: The main idea of the algorithms that we develop for
497: $(2,1)$-anonymization is that they add the minimum number of edges
498: so that every vertex of the resulting graph satisfies one of the
499: conditions of Proposition~\ref{prop:characterization}. Both
500: algorithms proceed in two phases: the \emph{deficit-assignment}
501: and the \emph{deficit-matching} phase. The deficit assignment
502: requires a linear scan of the graph in which deficits are assigned
503: to vertices. Roughly speaking, a deficit of $1$ signifies that the vertex needs to be
504: connected to another vertex of non-zero deficit by the addition of an extra edge.
505: This added edge ensures that the $(2,1)$-anonymity requirement for the vertex or its
506: neighbors will be satisfied. Once the deficits are assigned to vertices
507: the algorithms proceed to the actual addition of edges. The edges
508: are added by taking into account the deficits of all vertices. For
509: example, two vertices both of deficit $1$ can be connected by the addition of a
510: single edge (if they are not already neighbors and are not
511: isolated). In this way, a single edge accommodates a total
512: deficit of $2$. The minimum number of edges to be added can be found via a matching of the vertices with deficits. The matching consists of edges that are not already in the graph. A perfect matching is the matching that satisfies all the deficits. In the case of weak anonymization, this matching can be found in linear time by randomly pairing up non-adjacent vertices with deficits. For strong anonymization, it needs to be explicitly computed by solving the maximum-matching problem over edges that are not already in the graph.
513:
514: Another key point in the development of our algorithms is that in
515: order to assign deficits it suffices to explore only vertices that
516: are within a distance $4$ from some leaf vertex or from a vertex
517: of degree $2$. Any other vertex can be shown to satisfy the conditions of Proposition~\ref{prop:characterization}.
518: Finally, it only requires a case analysis to show that our
519: algorithms optimally assign deficits to vertices,
520: independently of the order in which they traverse the vertices of
521: the input graph during the first phase. For lack of space we only
522: give a sketch of the algorithms and proofs in this section.
523:
524: \subsection{Linear-time weak $(2,1)$-anonymi-zation}
525: As we have already mentioned our algorithm for the weak
526: $(2,1)$-anonymization problem has two phases (1) deficit
527: assignment and (2) deficit matching\footnote{Recall that a node $u$ is assigned deficit $i$ if $i$ edges need to be added between other non-zero deficit vertices and $u$ in order to satisfy the anonymity requirements of $u$ or $u$'s neighbors.}
528:
529: \vspace{0.1in}
530: \noindent{\bf Deficit Assignment:} First assume that the input graph has no
531: isolated vertices -- we will show how to deal with isolated
532: vertices later. For the deficit-assignment phase, the algorithm
533: starts with an {\em unmarked} vertex of
534: degree $1$ or $2$ and explores vertices within a distance $4$ of it. Deficits are assigned as follows:
535:
536:
537: \begin{itemize}
538: \item For an isolated edge $uv$, we assign deficit $1$ to $u$ and
539: deficit $1$ to $v$; it may be that both edges will be added at $u$.
540:
541: \item For an isolated path $uvw$, we assign deficit $1$ to $v$.
542:
543: \item For an isolated path $uvwx$, we assign deficit $1$ to $v$ and
544: deficit $1$ to $w$.
545:
546: \item For a subgraph consisting of a path $uvw$ with adjacent
547: vertices attached to $w$, we assign deficit $1$ to $v$.
548:
549: \item For a component $uvX_i$ with vertex $u$ having degree one
550: with vertex $v$ connected to a set of vertices $X_i$ such that
551: each $x\in X_i$ has degree $1$ (and no other vertices) assign
552: deficit $1$ to $v$. This component corresponds to an isolated star
553: centered at $v$.
554:
555: \item For a component consisting of a square $uvwx$ (isolated
556: square), we assign deficit $1$ to $u$ and deficit $1$ to $w$; it may
557: be that the two edges will be added at $u$ and $v$, or that $u$
558: and $w$ will be joined.
559:
560: \item For a subgraph consisting of a square $uvwx$ with edges (one
561: or more) $ux_i$ coming out of the square, we assign deficit $1$ to
562: $v$.
563:
564: \item For a subgraph consisting of squares $uv_1wx_1$,
565: $uv_2wx_2$, $\ldots$, $uv_jwx_j$, we assign deficit
566: $1$ to one of the $v_i$'s.
567:
568: \item Finally, for a subgraph consisting of a vertex $u$ adjacent
569: to vertices $x_i$ of degree $1$ and to a vertex $y$ of degree $2$,
570: assign deficit $1$ to $y$.
571:
572: \end{itemize}
573:
574: All the vertices that are visited in this process are {\em marked}
575: (that is the assigned deficits cover all marked vertices) and the
576: deficit-assignment process repeats starting with the next
577: unmarked vertex until no more unmarked
578: vertices of degree $1$ or $2$ remain.
579:
580: \vspace{0.1in}
581: \noindent{\bf Deficit Matching:} If the
582: number of vertices with deficit $1$ is $2m$, and $2m\geq 4$ or
583: $2m=2$ -- in some case other than an isolated edge $uv$ -- then, we
584: need to find any perfect matching amongst these
585: vertices to find the edges to add.
586: The matching of deficits can be done in linear time since any
587: (random) pairing of non-adjacent vertices with non-zero deficits suffices. In this case we add $m$ extra edges.
588: If the number of vertices with deficit $1$ is $2m+1$, then all but one of
589: these vertices can be matched, and a single edge needs to be added
590: to the remaining vertex, connecting it to some vertex of degree
591: at least $2$. This results in a total of $m+1$ extra edges.
592: There are, however, some special cases that we need to take care of first.
593:
594: \vspace{0.1in}
595: \noindent{\bf Special Cases:} Before finding the perfect matching we match all
596: isolated edges to each other. This is because the isolated edges
597: need to be connected in a special way to take care of the deficits
598: at the two ends. For a pair of isolated edges $uv$ and $u'v'$, we add
599: the edges $uu'$ and $vu'$ (we treat the two deficits of $1$ at $u$
600: and $v$ as being concentrated at $u$). In the end we may be left
601: with a single isolated edge $uv$. In this case, two edges need to be added
602: and we can connect them to any other vertex in the graph
603: forming a triangle. Similarly, in the case where
604: the remainder is an isolated star centered at $v$ with vertices
605: $x_i$ of degree one, it is enough to add a single edge to
606: connect vertices $x_j$ and $x_{j'}$ of the star.
607:
608: \vspace{0.1in}
609: \noindent{\bf Isolated Vertices:} It remains to take care of isolated
610: vertices. For this we consider a set of six isolated vertices
611: $u,v,w,u',v',w'$ and we connect them with edges $uv, uw, uu',$
612: $u'v', u'w'$. These five edges can take care of the six isolated
613: vertices. In general, the vertices with deficit $1$ can be attached
614: to isolated vertices first, with two exceptions to be considered
615: next. When we have an isolated edge $xy$, one of the two deficits
616: of 1 can be satisfied by connecting $x$ to an isolated vertex, but
617: the other one can also be satisfied by connecting $x$ to an
618: isolated vertex $u$ if $u$ is also made adjacent to two other
619: isolated vertices $v$ and $w$ to obtain the above mentioned
620: component. Similarly if $x$ is only adjacent to vertices $y_i$ of
621: degree $1$, then the deficit $1$ at $x$ can only be matched to an
622: isolated $u$ if $u$ is also made adjacent to two other isolated
623: vertices $v$ and $w$. In the end we will be left with fewer than
624: six isolated vertices which each need one edge. These can be
625: connected to any vertex in the graph of degree at least $2$. The
626: optimality follows because a tree on $5$ vertices is optimal saving.
627:
628: \begin{theorem}
629: The above algorithm solves optimally the weak
630: $(2,1)$-anonymization problem in linear time.
631: \end{theorem}
632:
633: {\em Proof Sketch:} It requires a case analysis (that we omit for lack of space) to show
634: that the deficit-assignment scheme we described above is complete and optimal
635: and that the total deficit assigned is independent of the order in
636: which the vertices of the graph are traversed. Since we find a
637: perfect matching, we satisfy these deficits with as few edges as
638: possible, hence, the optimality of the algorithm.
639:
640: It is also easy to see that the deficit-assignment takes time
641: linear with respect to the number of edges in the graph: first we
642: only consider vertices of degree one or two as starting points. For every such vertex we only have to explore all
643: vertices within a distance $4$. This is because any other vertex can be seen to satisfy one of the conditions of Proposition~\ref{prop:characterization}. After each iteration of the deficit assignment, we
644: mark all the vertices that have been visited in this process as
645: marked (that is the assigned deficits cover all visited
646: vertices). The deficit-assignment process continues starting
647: with the next unmarked vertex of degree $1$ or $2$. The scanning of
648: the algorithm requires only linear time with respect to the number
649: of edges in the graph since every traversed edge connects only
650: marked endpoints and thus no edge needs to be traversed more than once
651: by the algorithm.
652:
653: The deficit-matching phase is also linear since it only requires to
654: find any (random) matching between non-adjacent deficits.
655:
656: \subsection{Polynomial-time strong $(2,1)$-anonymization}
657:
658: The algorithm for solving the strong $(2,1)$-anonymization problem
659: is very similar to the one presented in the previous section, so
660: we only briefly discuss it here. For brevity we avoid mentioning
661: various special cases that are similar to the weak-anonymization
662: problem. The first key difference is that for strong
663: $(2,1)$-anonymization we need to develop a different
664: deficit-assignment scheme. Although the actual structures we have
665: to consider for assigning the deficits are the same we need to
666: assign different deficits to different vertices so that we satisfy
667: the strong anonymity requirement. This is because an edge added at
668: a vertex with assigned deficit can only help the original
669: neighbors of the vertex, and not the vertex itself. The second
670: difference is that in the deficit-matching phase we need to
671: actually solve a maximum-matching problem; not every random pairing of non-adjacent vertices with assigned deficit is a valid solution.
672:
673: In strong $(2,1)$-anonymization we first have to assume that there
674: are no isolated vertices in the input graph $G$; otherwise strong
675: $(2,1)$-anonymity is not achievable for these vertices.
676:
677: \vspace{0.1in}
678: \noindent{\bf Deficit Assignment:} For the deficit-assignment step, the
679: algorithm starts with an unmarked vertex in the input graph
680: with degree $1$ or $2$ and assigns deficits as follows:
681:
682: \begin{itemize}
683: \item For an isolated edge $uv$, assign deficit of $2$ at each end.
684:
685: \item For an isolated path $uvwx$, put deficit $1$ at $v$ and at
686: $w$.
687:
688: \item For an isolated square $uvwx$, put deficit $1$ at $u$ and
689: $v$.
690:
691: \item If such a square has edges already coming out of $v$, put
692: just deficit $1$ at $u$.
693:
694:
695: \item If multiple squares $uv_iwx_i$ all start from vertex $u$,
696: then assign deficit $1$ to one of the $v_i$'s.
697:
698:
699: \item For a path $uvw$, put deficit $1$ at each of the $3$ vertices.
700:
701: \item For a vertex of degree at least $3$ attached to vertices of
702: degree $1$, put two deficits of $1$ at degree $1$ vertices.
703:
704: \item If a path starts $uvwx$, with $x$ of degree at least $2$, put
705: deficit $1$ at $v$ and $1$ at $w$.
706:
707:
708: \item If in addition $w$ has other edges coming out of it, put
709: deficit $1$ just at $v$. Otherwise if in addition only $v$ has other
710: edges coming out of it that join to a vertex of degree $1$, put
711: deficit $1$ just at $w$.
712:
713: \end{itemize}
714:
715: All vertices that are visited in the process are marked, and the
716: algorithm proceeds with the next unmarked vertex until there are
717: no unmarked vertices left.
718:
719: \vspace{0.1in}
720: \noindent{\bf Deficit Matching:} For solving the strong $(2,1)$ -
721: anonymization problem exactly we need to solve a maximum-matching
722: problem between the nodes with deficits. This can be done in polynomial
723: time~(\cite{papadimitrioucombinatorial}). Note, that in the weak
724: $(2,1)$-anonymization problem \emph{any} random pairing of
725: non-adjacent nodes with deficits was sufficient, allowing for a linear-time matching phase.
726: This was because with the exception of isolated edges and isolated paths of length $4$, there was no case in which two vertices of non-zero deficit could be adjacent. This is not the case in the strong anonymization problem, and here a maximum-matching problem needs to be solved over edges that are not already in the graph.
727:
728: A linear-time deficit-matching algorithm with a small
729: additive error can also be developed. This is summarized in the
730: following theorem.
731:
732: \begin{theorem}
733: The strong $(2,1)$-anonymization problem can be approximated in
734: linear time within an additive error of 2, and can be solved
735: exactly in polynomial time.
736: \end{theorem}
737:
738: {\em Proof Sketch:} It requires again a case analysis to
739: show that the deficit-assignment scheme is optimal and independent
740: of the order in which we traverse the vertices.
741:
742: Now, if all deficits add up to $m$, they can easily be paired
743: using a greedy linear-time matching algorithm. However,
744: the last $2$ deficits may be assigned to adjacent
745: vertices. So instead of adding $\lceil m/2\rceil$ edges, we may
746: add $\lceil m/2\rceil+2$, for an additive error of 2. If instead
747: we use a maximum-matching algorithm to match as many deficits
748: as possible and satisfy the unmatched deficits individually, the
749: problem can be solved optimally in polynomial time.
750:
751:
752: \vspace{-0.15in}
753: \section{From $(6,1)$ to $(7,1)$-anonymity}\label{6_1to7_1}
754: We show here that given a graph that is already (6,1)-anonymous,
755: it is NP-hard to find the minimal number of edges that need to be
756: added to make it either weakly or strongly (7,1)-anonymous. This result provides insight into the complexity of the anonymization problem, showing that it is hard to achieve anonymity even incrementally. The
757: result follows from a reduction from the {\em 1-in-3
758: satisfiability} problem. An instance of 1-in-3 satisfiability
759: consists of triples of Boolean variables $(x,y,z)$ to be assigned
760: values 0 or 1 in such a way that each triple contains one 1 and
761: two 0s. This problem was shown to be NP-complete by
762: Schaefer~\cite{schaefer78complexity}. We first show that even a
763: restricted form of the 1-in-3 satisfiability problem is
764: NP-complete.
765:
766: \begin{lemma}\label{lemma1}
767: The 1-in-3 satisfiability problem is NP-complete even if each variable
768: occurs in exactly 3 triples, no two triples share more than one variable, and
769: the total number of triples is even.
770: \end{lemma}
771:
772: \begin{proof}
773: We prove this by taking an arbitrary instance of the 1-in-3 satisfiability problem and converting it to an instance satisfying the constraints of the above lemma. We start off by renaming multiple occurrences of a variable $x$ as $x_1$, $x_2$, and so on, so that by the end, each variable occurs in at most 1 triple and no two triples share more than one variable. We can then enforce the condition that each $x_i$ be equal to $x_{i+1}$ by inserting the triples
774: $(x_i,u,v)$, $(x_{i+1},u',v')$, $(u,u',w)$ and $(v,v',w)$.
775: This guarantees at most 3 occurrences of
776: each variable in triples. If a variable $y$ occurs in 2 triples, we may include
777: a triple $(y,z,t)$ introducing two new variables, so that at the end of this process each variable occurs in either
778: 1 or 3 triples. Finally we make nine copies of the entire instance, each labeled $(i,j)$ with
779: $1\leq i,j\leq 3$, and equate the $z$s that have the same $i$ and also equate
780: the $t$s that have the same $j$. This guarantees that each variable appears
781: in exactly 3 triples. Making two copies of this instance guarantees that the
782: number of triples is even.
783: \end{proof}
784:
785: \begin{theorem}
786: Suppose $G$ is $(6,1)$-anonymous. Finding the smallest set of
787: edges to add to $G$ to solve the weak or strong
788: $(7,1)$-anonymization problem is NP-hard. The same results hold
789: for going from $(k,1)$-anonymity to weak or strong $(k+1,
790: 1)$-anonymity when $k \geq 6$.
791: \end{theorem}
792:
793: \begin{proof}
794: We show this via a reduction from the 1-in-3 satisfiability
795: problem. We take an instance of the 1-in-3 satisfiability problem
796: satisfying the constraints of Lemma~\ref{lemma1}. We further
797: assume that the number of triples in this instance is a multiple
798: of 3, since if it is not a multiple of 3, it is easy to see that
799: there will be no satisfying assignment. Since we also assume that
800: the number of triples is even, the number of triples is in fact of
801: the form $6m$.
802:
803: Taking this instance, we now form a cubic bipartite graph
804: $G=(U,V,E)$ by creating a vertex in $U$ for each triple and a
805: vertex in $V$ for each variable, with the two vertices connected
806: by an edge if the variable occurs in the triple. We add 5 new
807: neighbors of degree 1 to each vertex in $U$. Each of these added
808: neighbors and the vertices in $V$ are $(7,1)$-anonymous, but the
809: vertices in $U$ have only 6 vertices at distance 2, namely the 2
810: other neighbors of each of the 3 neighbors in $V$, giving
811: $(6,1)$-anonymity. We would like to increase the anonymity of
812: these vertices so that they are also $(7,1)$-anonymous. Note that
813: a solution to this anonymity problem has to consist of at least
814: $m$ edges. This is because the total residual anonymity of the graph is $6m$ and each new edge can reduce the residual anonymity by at most $6$. Now, if it were possible to select $2m$ vertices in $V$
815: that were adjacent to all the $6m$ vertices in $U$, we could
816: insert a perfect matching of $m$ edges between these $2m$ vertices
817: and simultaneously increase the anonymity of all the vertices in
818: $U$ by at least 1. This would correspond to a solution to the
819: 1-in-3 satisfiability problem. Similarly, if there is a solution
820: to the anonymity problem that involves the addition of only $m$
821: edges, it must necessarily correspond to a solution to the 1-in-3
822: satisfiability problem. Thus a solution to the 1-in-3
823: satisfiability problem exists if and only if the solution to the
824: anonymity problem involves the addition of $m$ extra edges.
825:
826: For $k\geq 6$, add $k-2$ nodes of degree 1 attached to each vertex
827: in $U$. Attach an additional node of degree $k-5$ to each vertex
828: in $U$. Attach the remaining $k-6$ neighbors of each such
829: additional node to a clique of size $k+2$. The result then follows
830: from the case of $k=6$.
831: \end{proof}
832:
833: The complexity of minimally obtaining weak and strong
834: $(k,1)$-anonymous graphs remains open for $k=3,4,5,6$.
835:
836: \section{{\an}-anonymization}\label{K_1}
837: We start our study for the $(k,1)$-anonymization problem by giving
838: two simple $O(k)$-approximation algorithms. We then show that the
839: approximation factor can be further improved to match a lower bound.
840: \subsection{${\text O}(k)$-approximation algorithms for $(k,1)$-anonymization}
841: Let $G=(V,E)$ be the input graph to the weak $(k,1)$-anonymization
842: problem. Consider the following simple iterative algorithm: at
843: every step $i$ add to graph $G_i$ ($G_1 = G$) a single edge
844: between a neighbor of a deficient node $u$ and a node that is not
845: already in the $2$-neighborhood of $u$ in $G_i$. If there are only
846: isolated deficient nodes in $G_i$, the algorithm directly connects
847: a deficient node to a node of a $(k+1)$-clique. If no such clique
848: exists, the algorithm creates it in a preprocessing step; $(k+1)$
849: randomly selected nodes are picked for this purpose. Repeat the
850: process until no deficient nodes remain. We call this algorithm
851: the {\weakaddition} algorithm. We show that {\weakaddition} is an
852: ${\text O}(k)$-approximation algorithm for the weak
853: $(k,1)$-anonymization problem. This result is summarized in the
854: following theorem.
855:
856:
857: \begin{theorem}\label{thm:weak1}
858: {\weakaddition} gives a ${\text O}(k)$-approximation
859: for the weak $(k,1)$-anonymization problem. If the optimal
860: solution is of size $t$, {\weakaddition} adds at
861: most $4kt + k^2$ edges.
862: \end{theorem}
863:
864: \begin{proof}
865: Let $R = \sum_{v\in V}r(v)$ be the residual anonymity (see
866: Definition~\ref{dfn:residual}) of graph $G = (V,E)$. Let
867: ${\text{\sc Wa}}$ be the total number of edges added by the
868: {\weakaddition} algorithm. It holds that ${\text{\sc Wa}} \leq
869: R+k^2$. This is because at every step the algorithm adds one edge
870: that decreases the residual anonymity of the graph by at least
871: $1$. Therefore the algorithm adds at most $R$ edges. The
872: additional $k^2$ edges may be required to create a $(k+1)$-clique if such a clique does not exist.
873:
874: Now assume that the optimal solution adds $t$ edges. Consider an
875: edge $uv$ of the optimal solution. This edge, at the time of its
876: addition, could have decreased the residual anonymity of the graph
877: by at most $4k$. This is because it could have decreased the
878: residual anonymity of each of $u$ and $v$ as well as the residual
879: anonymities of at most $k$ neighbors connected to $u$ and at most
880: $k$ neighbors connected to $v$ (if $u$ or $v$ had more than $k$
881: neighbors, then none of these neighbors would have been
882: deficient). Further, the edge $uv$ could have decreased the
883: residual anonymity of $u$ or $v$ by at most $k$, and the residual
884: anonymities of each of the $k$ neighbors of $u$ or each of the $k$
885: neighbors of $v$ by at most 1.
886:
887: Thus, each edge of the optimal solution could have reduced the
888: residual anonymity of the graph by at most $4k$ at the time of its
889: addition. That is, $t \geq R/4k$.
890:
891: Thus it is clear that $\text{\sc Wa}\leq 4kt +k^2$.
892: \end{proof}
893:
894: For the strong $(k,1)$-anonymization problem we show that the
895: {\strongaddition} algorithm (very similar to {\weakaddition}), is
896: an ${\text O}(k)$-approximation. {\strongaddition} is also
897: iterative: in each iteration $i$ it considers graph $G_i$ and adds
898: one edge to it. The edge to be added is one that connects a
899: neighbor of a deficient node $u$ to a node that is not already in
900: the $2$-neighborhood of $u$. This process is repeated till no
901: deficient nodes remain. We can state the following for the
902: approximation ratio achieved by the {\strongaddition} algorithm.
903:
904: \begin{theorem}
905: {\strongaddition} is a $2k$-approximation algorithm
906: for the strong $(k,1)$-anonymization problem.
907: \end{theorem}
908:
909: \begin{proof}
910: As in the proof of Theorem~\ref{thm:weak1} consider input graph
911: $G=(V,E)$ with initial residual anonymity $R$. Every edge added by
912: the {\strongaddition} algorithm would reduce the residual
913: anonymity of the graph by at least $1$. Therefore, if the number
914: of edges added by the {\strongaddition} algorithm is $\text{\sc
915: Sa}$ we have that $\text{\sc Sa}\leq R$.
916:
917: Suppose now that the optimal solution adds $t$ edges. An added
918: edge $uv$ decreases the residual anonymity of the graph by at most
919: $2k$. This is because the edge can decrease the residual anonymity
920: of only the {\em original} neighbors of $u$ and $v$ by at most $1$
921: each and there can be at most $2k$ such deficient neighbors. Thus
922: $t \geq R/2k$.
923:
924: From the above we have that $\text{\sc Sa}\leq 2kt$.
925: \end{proof}
926:
927: \subsection{$\Theta(\log n)$-approximation algorithms for $(k,1)$-anonymization}
928:
929: We now provide two simple greedy algorithms for the weak and
930: strong \an-anonymization problems and show that they output
931: solutions that are ${\text O}(\log n)$-approximations to the
932: optimal. We then show that this is the best approximation factor
933: we can hope to achieve for arbitrary $k$.
934:
935: We start by presenting {\weakgreedy} which is an ${\text O}(\log
936: n)$-approximation algorithm for the weak $(k,1)$-anonymization
937: problem. Consider input graph $G=(V,E)$ that has total
938: residual anonymity $R$. The optimal solution to the problem
939: consists of a set of edges that together take care of all the
940: residual anonymity in the graph.
941:
942:
943: \begin{figure}[]
944: \begin{center}
945: {\includegraphics[scale =
946: 0.35]{reinforcement}}\caption{Illustrative example of the
947: reinforcement between new edges in the weak-anonymization
948: problem.}\label{figure:reinforcement}
949: \end{center}
950: \end{figure}
951:
952: We may be tempted to use a set-cover type solution: greedily
953: choose edges to add that maximally reduce the residual anonymity
954: of the graph at each step. However, such a greedy algorithm is not
955: so easy to analyze in the context of the weak-anonymization
956: problem. The difficulty in the analysis stems from the fact that
957: the new edges may \emph{reinforce} each other. That is, the
958: addition of an edge may bring about a greater reduction in the
959: residual anonymity of the graph in the presence of other added
960: edges. Consider, for example, the input graph $G$ shown in
961: Figure~\ref{figure:reinforcement}. Note that solid lines
962: correspond to the original edges in $G$. In this case, the
963: addition of edge $x2z1$ alone does not help in the anonymization
964: of node $y2$. (Neither does the addition of edge $y2z1$ in the
965: anonymization of $x2$). However, if edge $y2z1$ is already added
966: in the graph, then edge $x2z1$ helps in anonymizing node $y2$
967: as well.
968:
969: We get around this peculiarity of our problem by greedily choosing
970: triplets of edges to add instead of singleton edges. Algorithm~1, called {\weakgreedy}, describes the procedure.
971:
972: \begin{algorithm}[H]\label{alg1}
973: \caption{{\weakgreedy} for weak $(k,1)$-anonymization}
974: \begin{algorithmic}[1]
975: \STATE //Input: $k, G=(V,E)$
976: \STATE Randomly choose a node $w \in V$
977: \STATE Add up to ${k+1 \choose 2}$ edges to $E$ to form a $k+1$-clique at $w$
978: \STATE Compute $R = $ residual anonymity of $G$
979: \WHILE{$R > 0$}
980: \STATE Find triplet $uv, uw, vw$ that maximally decrease $R$
981: \STATE $E = E \cup \{uv\} \cup \{uw\} \cup \{vw\}$
982: \STATE Update $R$
983: \ENDWHILE
984: \end{algorithmic}
985: \end{algorithm}
986:
987: \begin{theorem}
988: {\weakgreedy} is a polynomial-time nearly ${\text
989: O}(\log n)$-approximation algorithm for the weak \an-anony- mization
990: problem. If the optimal solution is of size $t$, the algorithm
991: adds $k^2+6t\log n$ edges.
992: \end{theorem}
993:
994: \begin{proof}
995: Consider the optimal solution of $t$ edges. These $t$ edges
996: together take care of all the residual anonymity in the graph. We
997: can convert this solution to a solution of triplets that consists of at most $k^2 +
998: 3t$ edges: first randomly choose a node $w$ and create a
999: $(k+1)$-clique amongst $w$ and $k$ other randomly chosen nodes.
1000: Then, for each edge $uv$ of the optimal solution, add a triangle
1001: $(uv, vw, uw)$ to the graph. The resulting graph will clearly
1002: continue to be \an-anonymous. The $t$ triangles in conjunction
1003: with the $(k+1)$-clique take care of all the residual anonymity in
1004: the graph. Further, these triangles do not reinforce each other
1005: because they are all connected to a node of degree $k$.
1006:
1007: Going back to Algorithm~1, this means that once a $(k+1)$-clique has been added to
1008: the graph, at each iteration of the algorithm, there must exist
1009: some triangle with a vertex in the $(k+1)$-clique that reduces the
1010: residual anonymity of the graph by a factor of at least $t$ (similar to the argument for the greedy set cover algorithm). And
1011: since the algorithm greedily chooses triangles to add, the
1012: residual anonymity of the graph will decrease by at least this
1013: factor at each step. Since the residual anonymity of the graph can
1014: be at most $kn < n^2$ to begin with, the algorithm will only
1015: proceed for at most $r$ iterations till $(1 - 1/t)^r \leq 1/kn$.
1016: This would mean that $r= {\text O}(t\log(kn)) = {\text O}(2t\log
1017: n)$ and $3r = {\text O}(6t\log n)$.
1018: \end{proof}
1019:
1020: The approximation algorithm for the strong \an-anony- mization
1021: problem is simpler, since added edges cannot
1022: reinforce each other --- an added edge can only help the original
1023: neighbors of its two end points. Algorithm~2 gives the details of
1024: the {\stronggreedy} algorithm.
1025:
1026: \begin{algorithm}[H]
1027: \caption{{\stronggreedy} for $(k,1)$-anonymization}
1028: \begin{algorithmic}[1]
1029: \STATE //Input: $k, G=(V,E)$
1030: \STATE Compute $R = $ residual anonymity of $G$
1031: \WHILE{$R > 0$}
1032: \STATE Find edge $uv$ that maximally reduces $R$
1033: \STATE $E = E \cup \{uv\}$
1034: \STATE Update $R$
1035: \ENDWHILE
1036: \end{algorithmic}
1037: \end{algorithm}
1038:
1039:
1040: Since the added edges do not reinforce each other in the strong
1041: $(k,1)$-anonymization problem, the analysis of {\tt Strong-Greedy} is
1042: similar to the analysis of the greedy algorithm for the standard
1043: set-cover problem.
1044:
1045: \begin{theorem}
1046: {\stronggreedy} is a polynomial-time $2\log n$-approximation
1047: algorithm for the strong \an-anonymization problem.
1048: \end{theorem}
1049:
1050: \begin{proof}
1051: Suppose the optimal solution adds $t$ edges, to reduce the
1052: residual anonymity of the graph by at most $kn < n^2$. Since edges
1053: of the solution do not reinforce each other, there must exist some
1054: edge that reduces the residual anonymity of the graph by at least
1055: a factor of $t$.
1056:
1057: Therefore at each iteration of Algorithm~2, we greedily choose an
1058: edge to add that must cause at least this much reduction in the
1059: residual anonymity of the graph. The algorithm will thus terminate
1060: after $r$ steps where ${(1-1/t)}^r \leq 1/(kn)$, or
1061: $r=t\log(kn)\leq 2t\log n$.
1062: \end{proof}
1063:
1064:
1065: We next show that $\log n$ is the best factor we
1066: could hope to achieve for unbounded $k$, for both the weak and strong
1067: $(k,1)$-anonymization problems via an approximation-preserving reduction from the hitting set problem.
1068:
1069: \begin{theorem}
1070: The weak and strong $(k,1)$-anonymization problems with $k$ unbounded are
1071: $\Omega(\log n)$-approximation NP-hard.
1072: \end{theorem}
1073:
1074: \begin{proof}
1075: Hitting set is $\Omega(\log n)$-approximation NP-hard. Consider an
1076: instance of the hitting-set problem consisting of sets ${\cal S} =
1077: \{S_1, S_2, \ldots\}$. Let $k$ be greater than the maximum number
1078: of sets intersecting any one set $S_i$. Add a unique element $v_i$
1079: to each $S_i$. Additionally, construct sets ${\cal T} = \{T_1,
1080: T_2, \ldots\}$ such that each $T_i$ contains the appropriate
1081: $v_i$'s so that every $S_i$ intersects exactly $k-1$ other sets. In
1082: every set $T_i$ add an additional element $w$ so that each set in
1083: ${\cal T}$ intersects at least $k$ other sets. Now construct a
1084: bipartite graph $G=(U,V,E)$, where the vertices of $U$ correspond
1085: to the sets in ${\cal S}$ and ${\cal T}$, the vertices of $V$
1086: correspond to individual members of these sets, with $E$
1087: indicating membership of elements from $V$ in sets from $U$. For
1088: every element $u$ in $U$ create $(k+1)$ new vertices of degree $1$
1089: in $V$ and connect them to $u$. In the resulting graph, the
1090: vertices in $V$ are all \an-anonymous, however the vertices in $U$
1091: that correspond to sets in ${\cal S}$ are only $(k-1,
1092: 1)$-anonymous. Consider the $t$ nodes in $V$ that are the optimal
1093: solution to the hitting-set problem. Then matching these nodes
1094: using $\lceil t/2\rceil$ edges will be an optimal solution to the
1095: strong or weak $(k, 1)$-anonymization problem in the bipartite
1096: graph $G=(U,V,E)$. Therefore, an optimal solution to the
1097: anonymization problem corresponds to an optimal solution to the
1098: hitting-set problem which is $\Omega(\log n)$-hard to approximate.
1099: \end{proof}
1100:
1101:
1102: \section{{\anonvar}-anonymization for $\ell > 1$}\label{K_L}
1103:
1104: In this section we provide algorithms for the weak and strong
1105: {\anonvar}-anonymization problems when $\ell > 1$.
1106:
1107: The algorithm for weak {\anonvar}-anonymization is a randomized algorithm that constructs a bounded-degree expander between deficient vertices. Given a
1108: $(k,\ell')$-anonymous graph $G$, it solves the weak $(k,
1109: \ell)$-anonymization problem by adding only ${\text
1110: O}(\sqrt{k-k')\ell})$ additional edges at each vertex. The algorithm
1111: can also be easily adapted to solve the weak $(k,
1112: \ell)$-anonymization problem for any input graph irrespective of
1113: its initial anonymity.
1114:
1115: \begin{theorem}
1116: There exists a randomized polynomial-time algorithm that adds ${\text O}(\sqrt{(k-k')\ell})$ edges per vertex and increases the anonymity of
1117: a graph from $(k',\ell)$ to $(k,\ell)$ where $\ell\leq k\leq
1118: n^{1-\epsilon}$ and $\epsilon$ is a constant greater than $0$.
1119: \end{theorem}
1120:
1121: {\em Proof Sketch:} Randomly partition the $n$ vertices into $n/\ell$ sets
1122: of size $\ell$. Treat each set as a ``supernode''.
1123: Construct an expander of degree $\sqrt{(k-k')/\ell}$ on these
1124: $n/\ell$ supernodes. In this way each supernode has $(k-k')\ell$
1125: supernodes in its $2$-neighborhood that can be reached through just one
1126: intermediate supernode. Replace each edge $uv$ of this expander
1127: with a $K_{\ell, \ell}$ clique of edges between the constituent
1128: vertices of the supernodes $u$ and $v$. Thus each vertex now has $k-k'$
1129: vertices in its $2$-neighborhood that can be reached through
1130: an intermediate set of size $\ell$. Since $l\leq k\leq
1131: n^{1-\epsilon}$, we can show that with high probability, none of
1132: these $k-k'$ new vertices will coincide with the $k'$ vertices
1133: previously in the node's $2$-neighborhood.\\
1134:
1135:
1136: As a final result, we present the algorithm for strong {\anonvar}-anonymization. This algorithm is a generalization of the {\stronggreedy} algorithm (see Algorithm~2).
1137: The difference is that instead of picking a single edge to add at
1138: every iteration the algorithm picks edges in groups of size at
1139: most $\ell$. At each iteration it picks the group that causes the largest
1140: reduction in the residual anonymity of the graph. The pseudocode
1141: is given in Algorithm~3.
1142:
1143: \begin{algorithm}[H]
1144: \caption{{\stronggreedy} for $(k, \ell)$-anonymization}
1145: \begin{algorithmic}[1]
1146: \STATE //Input: $k, \ell, G=(V,E)$ \STATE Compute $R = $ residual
1147: anonymity of $G$ \WHILE{$R > 0$} \STATE Find set of edges ${\cal
1148: E}$, with $|{\cal E}| \leq \ell$, that maximally reduces $R$
1149: \STATE $E = E \cup \cal E$ \STATE Update $R$ \ENDWHILE
1150: \end{algorithmic}
1151: \end{algorithm}
1152:
1153: We can state the following theorem for the approximation factor of
1154: Algorithm~3 when $\ell$ is a constant.
1155:
1156: \begin{theorem}
1157: Consider $G=(V,E)$ to be the input graph to the strong {\anonvar}-anonymization problem.
1158: Also assume $\ell$ is a constant.
1159: Let $t$ be the optimal number of edges that need to be added to solve the strong {\anonvar}-anonymization problem on $G$. Then Algorithm~3 is a
1160: polynomial-time $O(t^{\ell-1}\log n)$-approximation algorithm.
1161: \end{theorem}
1162:
1163: {\em Proof Sketch:} In the $(k, \ell)$- anonymization problem,
1164: groups of up to $\ell$ edges at a time incident at a single vertex
1165: can reduce the residual anonymity of a vertex adjacent to the $\ell$
1166: endpoints of these edges. The $t$ edges added by the optimal
1167: solution define at most $t^{\ell}$ subsets of at most $\ell$ edges
1168: incident to a single vertex. By selecting such subsets greedily as
1169: in a set-cover problem we ultimately reduce the residual anonymity
1170: of the graph to $0$ in ${\text O}(t^{\ell}\log n)$ steps. We can show that reinforcement effects between subsets of edges are taken care of. This proves the $O(t^{\ell}\log n)$ bound on the number of edges
1171: selected. If $\ell$ is a small constant, the approximation factor may not be too large. Further, in practice this simple algorithm may perform better than this worst case bound indicates.
1172:
1173: \section{Conclusions}\label{sec:conclusions}
1174: Motivated by recent studies on privacy-preserving graph releases,
1175: we proposed a new definition of anonymity in graphs. We further
1176: defined two new combinatorial problems arising from this
1177: definition, studied their complexity and proposed simple,
1178: efficient and intuitive algorithms for solving them.
1179:
1180: The key idea behind our anonymization scheme was to enforce
1181: that every node in the graph should share some number of its
1182: neighbors with $k$ other nodes. The optimization problems we
1183: defined ask for the minimum number of edges to be added to the
1184: input graph so that the anonymization requirement is satisfied.
1185: For these optimization problems we provided algorithms that solve
1186: them exactly ($k=2$) or approximately ($k>2$).
1187:
1188: An interesting avenue for future work would be to fully characterize the
1189: kinds of attacks that our definition of anonymity protects
1190: against, and to study the impact of our anonymization schemes on the utility of the graph release.
1191:
1192: Finally, we believe that the combinatorial problems we have
1193: studied in this paper are interesting in their own right, and may
1194: also prove useful in other domains. For example, at a high level
1195: there is a similarity between the problem we study in this paper
1196: and the problem of constructing reliable graphs for, say, reliable
1197: routing.
1198:
1199:
1200: \bibliographystyle{plain}
1201: \bibliography{graphanon}
1202:
1203: \end{document}
1204: