0810:0810.5578/ga.tex

1: \documentclass[11pt,twocolumn]{article}

2:

3: \usepackage{amssymb, amsmath,algorithm,algorithmic,amsthm}

4: \usepackage{graphicx,epsfig,subfigure}

5: \usepackage[left=1in,top=1in,right=1in,bottom=1in,nohead]{geometry}

6:

7: \renewcommand{\baselinestretch}{.98}

8: \newcommand{\weakaddition}{{\tt Weak-Any}}

9: \newcommand{\strongaddition}{{\tt Strong-Any}}

10: \newcommand{\weakgreedy}{{\tt Weak-Greedy}}

11: \newcommand{\stronggreedy}{{\tt Strong-Greedy}}

12:

13: \newcommand{\an}{$(k,1)$}

14: \newcommand{\anonvar}{$(k,\ell)$}

15: \newcommand{\secondvar}{$\ell$}

16: \newtheorem{problem}{{Problem}}

17: \newtheorem{theorem}{{Theorem}}

18: \newtheorem{example}{{Example}}

19: \newtheorem{definition}{{Definition}}

20: \newtheorem{lemma}{{Lemma}}

21: \newtheorem{corollary}{{Corollary}}

22: \newtheorem{proposition}{{Proposition}}

23: \newtheorem{conjecture}{{Conjecture}}

24:

25: \newcommand{\squishlist}{

26:    \begin{list}{$\bullet$}

27:     { \setlength{\itemsep}{0pt}      \setlength{\parsep}{3pt}

28:       \setlength{\topsep}{3pt}       \setlength{\partopsep}{0pt}

29:       \setlength{\leftmargin}{1.5em} \setlength{\labelwidth}{1em}

30:       \setlength{\labelsep}{0.5em} } }

31:

32: \newcommand{\squishlisttwo}{

33:    \begin{list}{$\bullet$}

34:     { \setlength{\itemsep}{0pt}    \setlength{\parsep}{0pt}

35:       \setlength{\topsep}{0pt}     \setlength{\partopsep}{0pt}

36:       \setlength{\leftmargin}{2em} \setlength{\labelwidth}{1.5em}

37:       \setlength{\labelsep}{0.5em} } }

38:

39: \newcommand{\squishend}{

40:     \end{list}  }

41:

42:

43: \begin{document}

44:

45: \title{Anonymizing Graphs}

46: \author{

47: Tom\'{a}s Feder\\Stanford University \and Shubha U. Nabar\\Stanford University \and Evimaria Terzi\\IBM Almaden

48:  }

49: \date{}

50:

51: \maketitle

52:

53: \begin{abstract}

54: Motivated by recently discovered privacy attacks on social networks, we study the problem of anonymizing the underlying graph of interactions in a social network. We call a graph \emph{\anonvar-anonymous} if for every node in the

55: graph there exist at least $k$ other nodes that share at least

56: $\ell$ of its neighbors. We consider two combinatorial problems

57: arising from this notion of anonymity in graphs. More

58: specifically, given an input graph we ask for the minimum number

59: of edges to be added so that the graph becomes

60: \emph{\anonvar-anonymous}. We define two variants of this

61: minimization problem and study their properties. We show that for

62: certain values of $k$ and $\ell$ the problems are polynomial-time solvable,

63: while for others they become NP-hard. Approximation algorithms for

64: the latter cases are also given.

65: \end{abstract}

66:

67: \pagenumbering{arabic}

68:

69: \newcommand{\DEF}[1]{{\em #1\/}}

70:

71: \newcommand\chic{\chi_c}

72: \newcommand\C{\hbox{${\cal C}$}}

73: \newcommand{\RR}{\mbox{$\mathbb R$}}

74: \newcommand{\NN}{\mbox{$\mathbb N$}}

75: \newcommand{\ZZ}{\mbox{$\mathbb Z$}}

76: \newcommand{\eopf}{\raisebox{0.8ex}{\framebox{}}}

77: \newcommand{\dist}{\hbox{\rm d}}

78: \renewcommand\a{\alpha}

79: \renewcommand\b{\beta}

80: \renewcommand\c{\gamma}

81: \renewcommand\d{\delta}

82: \newcommand\D{\Delta}

83: \newcommand{\directedchi}{\mbox{$\vec{\chi}$}}

84: \newcommand{\directedE}{\mbox{$\vec{E}$}}

85: \newcommand{\directedG}{\mbox{$\vec{G}$}}

86: \newcommand{\directedK}{\mbox{$\vec{K}$}}

87:

88:

89: \section{Introduction}\label{introduction}

90:

91: The popularity of online communities and social networks

92: in recent years has motivated research on social-network analysis.

93: Though these studies are useful in uncovering

94: the underpinnings of human social behavior, they also raise

95: privacy concerns for the individuals involved.

96:

97: A social network is usually represented as a graph, where nodes

98: correspond to individuals and edges capture relationships between

99: these individuals. For example, in LinkedIn, an online network of

100: professionals, every link between two users specifies a

101: professional relationship between them. In Facebook and Orkut

102: links correspond to friendships. There are online communities that

103: permit any user to access the information of every node in the graph

104: and view its neighbors. However, many communities are

105: increasingly restricting access to the personal information of other users. For example, in LinkedIn, a user can only see the profiles of his own friends and their connections.

106:

107: In this paper, we consider a scenario where the owner of a

108: social network would like to release the underlying

109: graph of interactions for social-network analysis purposes, while preserving the privacy of its users. More specifically,

110: the private information to be protected

111: is the mapping of nodes to real-world entities and interconnections amongst them.

112: Therefore, we design an anonymization framework that tries to hide the identity of nodes by creating groups of nodes that look similar by virtue of sharing many of the same neighbors. We call such nodes {\em anonymized}. Our goal is to anonymize all nodes of the graph by introducing minimal changes to the overall graph structure. In this way we can guarantee that the anonymized graph is still useful for social-network analysis purposes.

113:

114: Recently, Backstrom et. al.~\cite{backstrom} have shown that the most

115: simple graph-anonymization technique that removes the identity of

116: each node in the graph, replacing it with a random identification

117: number instead, is not adequate for preserving the privacy of

118: nodes. Specifically, they show that in such an anonymized network, there exists an adversary

119: who can identify target individuals and the link structure between them. However, the problem of designing anonymization methods against such

120: adversaries is not addressed in~\cite{backstrom}.

121:

122: Following the work of~\cite{backstrom}, Hay et. al.~\cite{miklau}

123: have very recently given a definition of graph anonymity: a graph

124: is $k$-anonymous if every node shares the same neighborhood

125: structure with at least $k-1$ other nodes. The definition is

126: recursive, and has some nice properties studied in~\cite{miklau}.

127: However, the focus of~\cite{miklau} is mostly on the properties of

128: the definitions rather than on algorithms to achieve the anonymity requirements.

129:

130: Motivated by ~\cite{backstrom} and~\cite{miklau}, Zhou and

131: Pei~\cite{zhou08preserving} consider the following definition of

132: anonymity in graphs: a graph is $k$-anonymous if for every node

133: there exist at least $k-1$ other nodes that share isomorphic

134: $1$-neighborhoods. They consider the problem of minimum

135: graph-modifications (in terms of edge additions) that would lead

136: to a graph satisfying the anonymity requirement. Although this

137: definition is interesting, the algorithm presented

138: in~\cite{zhou08preserving} is not supported by theoretical

139: analysis. Further, if the anonymity definition is extended to consider the neighborhood structure beyond just the immediate $1$-neighborhood of each node, algorithmic techniques quickly become infeasible.

140:

141: Despite the fact that privacy concerns in releasing social-network data have

142: been pinpointed, there is no agreement on the definition of

143: privacy or anonymity that should be used for such data. In this paper, we try to move

144: this line of research one step forward by proposing a new definition

145: of graph anonymity that is inline to a certain extent with

146: the definitions provided in~\cite{miklau}. Our definition of

147: anonymity is in a sense less strict than the one proposed

148: in~\cite{zhou08preserving}. However, we consider it to be natural,

149: intuitive and more amenable to theoretical analysis.

150:

151: Intuitively our definition aims to protect an individual from an adversary who knows some subset of the individual's neighbors in the graph. After anonymization, the hope is that the adversary can no longer identify the target individual because several other nodes in the graph will also share this subset of neighbors. Further, during anonymization, the identifying subset of neighbors themselves will become distorted and harder for the adversary to identify.

152:

153: \vspace{0.1in}

154: \noindent{\bf The Problem:} We define a graph to be

155: {\anonvar}-anonymous if for every node $u$ in the graph there

156: exist at least $k$ other nodes that share at least $\ell$ of their

157: neighbors with $u$. In order to meet this anonymity requirement one could transform

158: any graph into a complete graph. For a graph consisting of $n$

159: nodes this would mean that every node would share $n-2$ neighbors

160: with each of the $n-1$ other nodes. Although such an anonymization

161: would preserve privacy, it would make the anonymized graph useless

162: for any study. For this reason we impose the additional

163: requirement that the minimum number of such edge additions should

164: be made. The aim is to preserve the utility of the original graph,

165: while at the same time satisfying the {\anonvar}-anonymity

166: constraint.

167:

168: Given $k$ and $\ell$ we formally define two variants of the

169: \emph{graph-anonymization} problem that ask for the minimum number

170: of edge additions to be made so that the resulting graph is

171: {\anonvar}-anonymous. We show that for certain values of $k$ and

172: $\ell$ the problems are polynomial-time solvable, while for others

173: they are NP-hard. We also present simple and intuitive

174: approximation algorithms for these hard instances. To summarize our contributions:

175:

176: \squishlist

177: \item We propose a new definition of graph anonymity building on

178: previously proposed definitions.

179: \item We provide the first formal

180: algorithmic treatment of the graph-anonymization problem.

181: \squishend

182:

183: \vspace{0.15in}

184: Besides graph anonymization, the combinatorial problems we study

185: here may also arise in other domains, e.g., graph reliability.

186: We therefore believe that the problem definitions and

187: algorithms we present are of independent interest.

188:

189: \vspace{0.1in}

190: \noindent{\bf Roadmap:} The rest of the paper is organized as

191: follows. In Section~\ref{sec:related} we summarize the related

192: work. Section~\ref{sec:definitions} gives the necessary notation

193: and definitions. Algorithms and hardness results for different

194: instances of the {\anonvar}-anonymization problem are given in

195: Sections~\ref{2_1},~\ref{6_1to7_1},~\ref{K_1} and~\ref{K_L}. We

196: conclude in Section~\ref{sec:conclusions}.

197:

198:

199: \section{Related Work}\label{sec:related}

200:

201: As mentioned in the Introduction, there has been some prior work

202: on privacy-preserving releases of social-network graphs. The authors

203: in~\cite{backstrom} show that the naive approach of simply masking

204: usernames is not sufficient anonymization. In particular, they

205: show that, if an adversary is given the chance to create as few as $\Theta(\log(n))$

206: new accounts in the network, prior to its release, then he can efficiently recover the

207: structure of connections between any $\Theta(\log^2 (n))$ nodes

208: chosen apriori. He can do so by identifying the new accounts that he inserted in

209: to the network. The focus of~\cite{backstrom} is on revealing the

210: power of such adversaries and not on devising methods to protect against them.

211:

212: In~\cite{miklau} the authors experimentally evaluate

213: how much background information about the neighborhood of an individual would be sufficient for an adversary to uniquely identify that individual in a naively anonymized graph. Additionally, a new recursive definition of graph anonymity is given. The definition says that a graph is

214: $k$-anonymous if for every structure query there exist $k$ nodes

215: that satisfy it. The definition

216: is constructed for a certain class of structure queries

217: that query the neighborhood structure of the nodes.

218: Our definition of anonymity is inspired by~\cite{miklau}, however

219: it is substantially different. Moreover, the focus of our work is

220: on the combinatorial problems arising from our anonymity

221: definition.

222:

223: Very recently, the authors of~\cite{zhou08preserving} consider yet

224: another definition of graph anonymity; a graph is $k$-anonymous if

225: for every node there exist at least $k-1$ other nodes that share

226: isomorphic $1$-neighborhoods. This definition of anonymity in

227: graphs is different from ours. In a sense it is a more strict one.

228: Moreover, though the algorithm presented

229: in~\cite{zhou08preserving} seems to work well in practice, no

230: theoretical analysis of its performance is presented. Finally,

231: extending the privacy definition to more than just the

232: $1$-neighborhood of nodes causes the algorithms

233: of~\cite{zhou08preserving} to quickly become infeasible.

234:

235: The problem of protecting sensitive links between individuals in

236: an anonymized social network is considered in~\cite{Zheleva:07}.

237: Simple edge-deletion and node-merging algorithms are proposed to

238: reduce the risk of sensitive link disclosure. This work is

239: different from ours in that we are primarily interested in

240: protecting the identity of the individuals while

241: in~\cite{Zheleva:07} the emphasis is on protecting the types of links

242: associated with individuals. Also, the combinatorial problems that

243: we need to solve in our framework are very different from the set

244: of problems discussed in~\cite{Zheleva:07}.

245:

246: In~\cite{Frikken:06} the authors study the problem of assembling

247: pieces of a graph owned by different parties privately. They

248: propose a set of cryptographic protocols that allow a group of

249: authorities to jointly reconstruct a graph without revealing the

250: identity of the nodes. The graph thus constructed is isomorphic to

251: a perturbed version of the original graph. The perturbation

252: consists of addition and or deletion of nodes and or edges. Unlike

253: that work, we try to anonymize a single graph by modifying it as

254: little as possible. Moreover, our methods are purely combinatorial

255: and no cryptographic protocols are involved.

256:

257: Korolova et. al.~\cite{korolova} investigate an attack where an

258: adversary strategically subverts user accounts. He then uses the

259: online interface provided by the social network to gain access to

260: local neighborhoods and to piece them together to form a global picture. The authors provide

261: recommendations on what the lookahead of a social network should

262: be to render such attacks infeasible. This work does not

263: consider an anonymized release of the entire network graph and is thus different from ours.

264:

265: Besides graphs, there has been considerable prior work on

266: anonymizing traditional relational data sets. The line of work on

267: $k$-anonymity found in

268: \cite{aggarwal,gehrke,lefevre,meyerson,sweeney,tclose} aims to

269: minimally suppress or generalize public attributes of individuals

270: in a database in such a way that every individual (identifiable by

271: his public attributes) is hidden in a group of size at least $k$.

272: Our notion of graph anonymity draws inspiration from this.

273:

274: Apart from suppression or generalization techniques, perturbation

275: techniques have also been used to anonymize relational data sets

276: in \cite{dilys,haritsa,srikant}. Perturbation-based approaches for

277: graph anonymization are also considered in~\cite{miklau,xintao}; in that

278: case edges are randomly inserted or deleted to anonymize the graph. We do not consider perturbation-based approaches in this paper.

279:

280: \section{Preliminaries}\label{sec:definitions}

281:

282: In this section we formalize our definition of graph

283: anonymity and introduce two natural optimization problems

284: that arise from it.

285:

286: Throughout the paper we assume that the social-network graph is

287: simple, \textit{i.e.}, it is undirected, unweighted, and contains

288: no self-loops or multi-edges. This is an important category of

289: graphs to study; most of the aforementioned social

290: networks (Facebook, LinkedIn, Orkut) allow only bidirectional

291: links and are thus instances of such simple graphs. We assume that

292: the actual identifiers of individual nodes are removed prior to

293: further anonymization. Our definition for graph anonymity is

294: inspired by the notion of $k$-anonymity for relational data

295: wherein each person, identifiable by his public attributes, is

296: required to be hidden in a group of size $k$. In the case of a

297: social-network graph, the publicly-known attributes of a user

298: would be (a subset of) his connections (and interconnections

299: amongst them) within the graph.

300:

301: Consider a simple unlabelled graph and an adversary who knows that

302: a target individual and some number of his friends form a clique.

303: In the released graph, the adversary could look for such cliques

304: to narrow down the set of nodes that might correspond to the

305: target individual. The goal of an anonymization scheme is to

306: prevent such an adversary from uniquely identifying the individual

307: and his remaining connections in the anonymized graph.

308:

309: We achieve this by introducing an anonymity property that requires

310: that for every node in the graph, some subset of its neighbors

311: should be shared by other nodes. In this way, an adversary

312: who knows some subset of the neighbors of a target individual and can even pinpoint them in the graph, will not be able to distinguish the target individual from other nodes in the network that share this subset of neighbors. Further, in the

313: process of anonymization, the identifying subset of neighbors

314: itself becomes distorted and harder for the adversary to pinpoint.

315: More formally we define the {\anonvar}-anonymity property as

316: follows.

317:

318: \begin{definition}[\anonvar-anonymity]

319: A graph $G=(V,E)$ is {\em \anonvar-anonymous} if for each vertex

320: $v\in V$, there exists a set of vertices $U\subseteq V$ not

321: containing $v$ such that $|U|\geq k$ and for each $u\in U$ the

322: vertices $u$ and $v$ share at least \secondvar\ neighbors.

323: \end{definition}

324:

325: \begin{example}

326: A clique of $n$ nodes is $(n-1,n-2)$-anonymous.

327: \end{example}

328:

329: To demonstrate the kinds of attacks we hope to protect against, we give another example.

330:

331: \begin{figure}

332: \begin{center}

333: \subfigure[Input graph $G$]{\includegraphics[scale=0.75,angle=270]{41a}\label{41examplea}}

334: \subfigure[(4,1)-anonymous transformation of $G$]{\includegraphics[scale=0.75,angle=270]{41b}

335: \label{41exampleb}}

336: \caption{In Figure~\ref{41examplea} an adversary can identify Alice as the node marked X. Figure~\ref{41exampleb} is a (4,1)-anonymous transformation of the graph.}\vspace{-0.15in}

337: \end{center}

338: \end{figure}

339:

340: \begin{example}

341: Consider the graph in Figure~\ref{41examplea}. Suppose an adversary knows that Alice is in this graph and that Alice is connected to a friend who is part of a triangle. There is only one such node in the graph and hence the adversary will be able to determine that the node marked X in the graph uniquely corresponds to Alice. From this he may be able to further infer the identities of Alice's neighbors and their neighbors as well. Now if the edges shown in dotted lines in Figure~\ref{41exampleb} are added to this graph, the resulting graph is $(4,1)$-anonymous. In this new graph, Alice is no longer the only node connected to a node of a triangle. Further, there is no longer only one triangle in the graph.

342: \end{example}

343:

344: Given an input graph $G=(V,E)$ with $n$ nodes, and integers $k$ and

345: $\ell$, our goal is thus to transform the graph into a

346: {\anonvar}-anonymous graph. We focus on transformations that

347: allow only additions of edges to the original graph

348: In order for the anonymized graph to remain useful for social-network (or

349: other) studies, we need to ensure that the transformed graph is as

350: close as possible to the original graph. We achieve this by

351: requiring that a minimum number of edges should be added to $G$ so that

352: the $(k,\ell)$-anonymity property holds. This leads us to the

353: following two variants of the {\anonvar}-anonymization problem.

354:

355: \begin{problem}[Weak \anonvar-anonymization]\label{problem:weak}

356: Given a graph $G=(V,E)$ and integers $k$ and $\ell$, find the

357: minimum number of edges that need to be added to $E$, to obtain a

358: graph $G' = (V, E')$ that is \anonvar-anonymous.

359: \end{problem}

360:

361: The following example illustrates the

362: weak-anonymization problem.

363:

364: \begin{figure}

365: \begin{center}

366: \subfigure[Input graph $G$] {\includegraphics[scale =

367: 0.45]{example_input}\label{fig:input_graph}} \hspace{1cm}

368: \subfigure[Weakly $(k-1,1)$ - anonymized graph $G'$]

369: {\includegraphics[scale =

370: 0.45]{example}\label{fig:weak_anonymized}}\caption{Illustrative

371: example of the difference between \emph{weak} and \emph{strong}

372: anonymity. }\label{figure:example}

373: \end{center}

374: \end{figure}

375:

376: \begin{example}\label{example:weak}

377: Consider the input graph $G$ of Figure~\ref{fig:input_graph}. The

378: graph consists of a clique of size $k$ and $2$ nodes $x$ and $y$

379: connected by an edge. The nodes in the clique are all $(k-1,

380: k-2)$-anonymous. However, the existence of $x$ and $y$ prevents

381: $G$ from being fully $(k-1, 1)$-anonymous.

382:

383: Assume now that we connect both $x$ and $y$ to a single node $u$ of the clique.

384: In this way, we construct graph $G'$ shown in

385: Figure~\ref{fig:weak_anonymized}. Obviously, $G'$ is

386: $(k-1,1)$-anonymous; all the nodes in $G'$ (including $x$ and $y$) have

387: $k-1$ other nodes that share at least one of their neighbors. For

388: $x$ and $y$, this neighbor is node $u$.

389: \end{example}

390:

391: The problem in the above example is that graph $G'$ satisfies the

392: $(k-1,1)$-anonymity requirement, however, the anonymity of nodes

393: $x$ and $y$ is achieved via node $u$ that was not a part of their

394: initial set of neighbors in $G$. Thus, the goal of having many

395: other nodes sharing the original neighborhood structure of $x$ or

396: $y$ is not necessarily achieved unless we place additional

397: requirements on the anonymization procedure. To this end we

398: introduce the problem of \emph{strong anonymization}. Strong

399: anonymity places additional restrictions on how anonymity can be achieved

400: and provides better privacy.

401:

402: \begin{definition}[Strong

403: \anonvar-transformation]\label{dfn:strong_anonymity} Consider

404: graphs $G=(V,E)$ and $G'=(V,E')$, so that $E\subseteq E'$ and $G'$

405: is {\anonvar}-anonymous. For fixed $k$ and $\ell$, we say that

406: $G'$ is a \emph{strongly-anonymized transformation} of $G$, if for

407: every vertex $v\in V$, there exists a set of vertices $U\subseteq

408: V$ not containing $v$ such that $|U|\geq k$ and for each $u\in U$,

409: $|N_G(v) \cap N_{G'}(u)|\geq$ \secondvar. Here $N_G(v)$ is the set

410: of neighbors of $v$ in $G$, and $N_{G'}(u)$ is the set of

411: neighbors of $u$ in $G'$.

412: \end{definition}

413:

414: Therefore, if a graph $G'$ is a strong {\anonvar}-transformation

415: of graph $G$, then each vertex in $G'$ is required to have $k$

416: other vertices sharing at least \secondvar\ of its {\em original}

417: neighbors in $G$. For this to be possible, every vertex must have at least $\ell$ neighbors in the original graph $G$ to begin with.

418:

419: \begin{example}\label{example:strong}

420: Consider again the graph $G$ of Figure~\ref{fig:input_graph} and

421: its transformation to graph $G'$ shown in

422: Figure~\ref{fig:weak_anonymized}. In Example~\ref{example:weak} we

423: showed that graph $G'$ is $(k-1,1)$-anonymous in the weak sense.

424: However, in order to get a strong $(k-1,1)$-transformation

425: of $G$, we would have to connect each of the nodes $x$ and $y$ to $k-1$ other nodes from the clique.

426: \end{example}

427:

428: The definition of a strong {\anonvar}-transformation gives rise to

429: the following \emph{strong {\anonvar}-anonymization} problem.

430:

431: \begin{problem}[Strong \anonvar-anonymization]\label{problem:strong}

432: Given a graph $G=(V,E)$ and integers $k$ and $\ell$, find the

433: minimum number of edges that need to be added to $E$, to obtain

434: graph $G' = (V, E')$ that is a strong \anonvar-transformation of

435: $G$.

436: \end{problem}

437:

438: Obviously achieving strong anonymity would require the addition of

439: a larger number of edges than weak anonymity. This statement is formalized as follows.

440:

441: \begin{proposition}

442: Consider input graph $G=(V,E)$ and integers $k$ and $\ell$. Let

443: $G'=(V,E')$ be the $(k,\ell)$-anonymous graph that is the optimal

444: solution for Problem~\ref{problem:weak}, and $G''=(V,E'')$ be the

445: $(k,\ell)$-anonymous graph that is the optimal solution for

446: Problem~\ref{problem:strong}. Then it holds that $|E''|\geq |E'|$.

447: \end{proposition}

448:

449: The notion of {\anonvar}-anonymity is strongly related to the

450: immediate neighbors of a node in the graph, and how these are

451: shared with other nodes. Therefore, for every node $u$ it is

452: important to know the nodes that are reachable from $u$ via a path

453: of length exactly $2$. Given its importance, we define the notion

454: of $2$-neighborhood of a node as follows.

455:

456: \begin{definition}[$2$-neighborhood]

457: Given a graph $G=(V,E)$ and a node $v\in V$ we define the $2$-neighborhood of $v$ to be the set of all nodes in $G$ that are

458: reachable from $v$ via paths of length exactly $2$.

459: \end{definition}

460:

461: We also define two more terms that will be used in the rest of the paper.

462:

463: \begin{definition}[Residual Anonymity]\label{dfn:residual}

464: Consider a graph $G=(V,E)$ that we would like to make

465: $(k,\ell)$-anonymous. Consider any node $v \in V$ and suppose that

466: $k'$ other nodes in the graph share at least $\ell$ of $v$'s

467: neighbors. Then, we define the residual anonymity of $v$ to be

468: $r(v) = {\text max}\{k-k', 0\}$. The residual anonymity of a graph

469: $G=(V,E)$ is defined to be $r(G) = \sum_{v \in V} r(v)$.

470: \end{definition}

471:

472: We define the concept of a deficient node for nodes that are not $(k,\ell)$-anonymous.

473:

474: \begin{definition}[Deficient Node]

475: A node $v$ is deficient if $r(v) > 0$.

476: \end{definition}

477:

478: It is the deficient nodes that we need to take care of in order to

479: anonymize a graph. With these definitions in hand, we are now ready

480: to proceed to the technical results of the paper.

481:

482:

483: \section{$(2,1)$-anonymization}\label{2_1}

484:

485: In this section we provide polynomial-time algorithms for the weak

486: and strong $(2,1)$-anonymization problems. First, it is easy to

487: see that there is a simple characterization of $(2,1)$-anonymous

488: graphs. This fact is captured in the following proposition.

489:

490: \begin{proposition}\label{prop:characterization}

491: A graph $G=(V,E)$ is $(2,1)$-anonymous if and only if each vertex

492: $u\in V$ is (a) part of a triangle, (b) adjacent to a vertex of

493: degree at least 3, or (c) is the middle vertex in a path of 5 vertices.

494: \end{proposition}

495:

496: The main idea of the algorithms that we develop for

497: $(2,1)$-anonymization is that they add the minimum number of edges

498: so that every vertex of the resulting graph satisfies one of the

499: conditions of Proposition~\ref{prop:characterization}. Both

500: algorithms proceed in two phases: the \emph{deficit-assignment}

501: and the \emph{deficit-matching} phase. The deficit assignment

502: requires a linear scan of the graph in which deficits are assigned

503: to vertices. Roughly speaking, a deficit of $1$ signifies that the vertex needs to be

504: connected to another vertex of non-zero deficit by the addition of an extra edge.

505: This added edge ensures that the $(2,1)$-anonymity requirement for the vertex or its

506: neighbors will be satisfied. Once the deficits are assigned to vertices

507: the algorithms proceed to the actual addition of edges. The edges

508: are added by taking into account the deficits of all vertices. For

509: example, two vertices both of deficit $1$ can be connected by the addition of a

510: single edge (if they are not already neighbors and are not

511: isolated). In this way, a single edge accommodates a total

512: deficit of $2$. The minimum number of edges to be added can be found via a matching of the vertices with deficits. The matching consists of edges that are not already in the graph. A perfect matching is the matching that satisfies all the deficits. In the case of weak anonymization, this matching can be found in linear time by randomly pairing up non-adjacent vertices with deficits. For strong anonymization, it needs to be explicitly computed by solving the maximum-matching problem over edges that are not already in the graph.

513:

514: Another key point in the development of our algorithms is that in

515: order to assign deficits it suffices to explore only vertices that

516: are within a distance $4$ from some leaf vertex or from a vertex

517: of degree $2$. Any other vertex can be shown to satisfy the conditions of Proposition~\ref{prop:characterization}.

518: Finally, it only requires a case analysis to show that our

519: algorithms optimally assign deficits to vertices,

520: independently of the order in which they traverse the vertices of

521: the input graph during the first phase. For lack of space we only

522: give a sketch of the algorithms and proofs in this section.

523:

524: \subsection{Linear-time weak $(2,1)$-anonymi-zation}

525: As we have already mentioned our algorithm for the weak

526: $(2,1)$-anonymization problem has two phases (1) deficit

527: assignment and (2) deficit matching\footnote{Recall that a node $u$ is assigned deficit $i$ if $i$ edges need to be added between other non-zero deficit vertices and $u$ in order to satisfy the anonymity requirements of $u$ or $u$'s neighbors.}

528:

529: \vspace{0.1in}

530: \noindent{\bf Deficit Assignment:} First assume that the input graph has no

531: isolated vertices -- we will show how to deal with isolated

532: vertices later. For the deficit-assignment phase, the algorithm

533: starts with an {\em unmarked} vertex of

534: degree $1$ or $2$ and explores vertices within a distance $4$ of it. Deficits are assigned as follows:

535:

536:

537: \begin{itemize}

538: \item For an isolated edge $uv$, we assign deficit $1$ to $u$ and

539: deficit $1$ to $v$; it may be that both edges will be added at $u$.

540:

541: \item For an isolated path $uvw$, we assign deficit $1$ to $v$.

542:

543: \item For an isolated path $uvwx$, we assign deficit $1$ to $v$ and

544: deficit $1$ to $w$.

545:

546: \item For a subgraph consisting of a path $uvw$ with adjacent

547: vertices attached to $w$, we assign deficit $1$ to $v$.

548:

549: \item For a component $uvX_i$ with vertex $u$ having degree one

550: with vertex $v$ connected to a set of vertices $X_i$ such that

551: each $x\in X_i$ has degree $1$ (and no other vertices) assign

552: deficit $1$ to $v$. This component corresponds to an isolated star

553: centered at $v$.

554:

555: \item For a component consisting of a square $uvwx$ (isolated

556: square), we assign deficit $1$ to $u$ and deficit $1$ to $w$; it may

557: be that the two edges will be added at $u$ and $v$, or that $u$

558: and $w$ will be joined.

559:

560: \item For a subgraph consisting of a square $uvwx$ with edges (one

561: or more) $ux_i$ coming out of the square, we assign deficit $1$ to

562: $v$.

563:

564: \item For a subgraph consisting of squares $uv_1wx_1$,

565: $uv_2wx_2$, $\ldots$, $uv_jwx_j$, we assign deficit

566: $1$ to one of the $v_i$'s.

567:

568: \item Finally, for a subgraph consisting of a vertex $u$ adjacent

569: to vertices $x_i$ of degree $1$ and to a vertex $y$ of degree $2$,

570: assign deficit $1$ to $y$.

571:

572: \end{itemize}

573:

574: All the vertices that are visited in this process are {\em marked}

575: (that is the assigned deficits cover all marked vertices) and the

576: deficit-assignment process repeats starting with the next

577: unmarked vertex until no more unmarked

578: vertices of degree $1$ or $2$ remain.

579:

580: \vspace{0.1in}

581: \noindent{\bf Deficit Matching:} If the

582: number of vertices with deficit $1$ is $2m$, and $2m\geq 4$ or

583: $2m=2$ -- in some case other than an isolated edge $uv$ -- then, we

584: need to find any perfect matching amongst these

585: vertices to find the edges to add.

586: The matching of deficits can be done in linear time since any

587: (random) pairing of non-adjacent vertices with non-zero deficits suffices. In this case we add $m$ extra edges.

588: If the number of vertices with deficit $1$ is $2m+1$, then all but one of

589: these vertices can be matched, and a single edge needs to be added

590: to the remaining vertex, connecting it to some vertex of degree

591: at least $2$. This results in a total of $m+1$ extra edges.

592: There are, however, some special cases that we need to take care of first.

593:

594: \vspace{0.1in}

595: \noindent{\bf Special Cases:} Before finding the perfect matching we match all

596: isolated edges to each other. This is because the isolated edges

597: need to be connected in a special way to take care of the deficits

598: at the two ends. For a pair of isolated edges $uv$ and $u'v'$, we add

599: the edges $uu'$ and $vu'$ (we treat the two deficits of $1$ at $u$

600: and $v$ as being concentrated at $u$). In the end we may be left

601: with a single isolated edge $uv$. In this case, two edges need to be added

602: and we can connect them to any other vertex in the graph

603: forming a triangle.  Similarly, in the case where

604: the remainder is an isolated star centered at $v$ with vertices

605: $x_i$ of degree one, it is enough to add a single edge to

606: connect vertices $x_j$ and $x_{j'}$ of the star.

607:

608: \vspace{0.1in}

609: \noindent{\bf Isolated Vertices:} It remains to take care of isolated

610: vertices. For this we consider a set of six isolated vertices

611: $u,v,w,u',v',w'$  and we connect them with edges $uv, uw, uu',$

612: $u'v', u'w'$. These five edges can take care of the six isolated

613: vertices. In general, the vertices with deficit $1$ can be attached

614: to isolated vertices first, with two exceptions to be considered

615: next. When we have an isolated edge $xy$, one of the two deficits

616: of 1 can be satisfied by connecting $x$ to an isolated vertex, but

617: the other one can also be satisfied by connecting $x$ to an

618: isolated vertex $u$ if $u$ is also made adjacent to two other

619: isolated vertices $v$ and $w$ to obtain the above mentioned

620: component. Similarly if $x$ is only adjacent to vertices $y_i$ of

621: degree $1$, then the deficit $1$ at $x$ can only be matched to an

622: isolated $u$ if $u$ is also made adjacent to two other isolated

623: vertices $v$ and $w$. In the end we will be left with fewer than

624: six isolated vertices which each need one edge. These can be

625: connected to any vertex in the graph of degree at least $2$. The

626: optimality follows because a tree on $5$ vertices is optimal saving.

627:

628: \begin{theorem}

629: The above algorithm solves optimally the weak

630: $(2,1)$-anonymization problem in linear time.

631: \end{theorem}

632:

633: {\em Proof Sketch:} It requires a case analysis (that we omit for lack of space) to show

634: that the deficit-assignment scheme we described above is complete and optimal

635: and that the total deficit assigned is independent of the order in

636: which the vertices of the graph are traversed. Since we find a

637: perfect matching, we satisfy these deficits with as few edges as

638: possible, hence, the optimality of the algorithm.

639:

640: It is also easy to see that the deficit-assignment takes time

641: linear with respect to the number of edges in the graph: first we

642: only consider vertices of degree one or two as starting points. For every such vertex we only have to explore all

643: vertices within a distance $4$. This is because any other vertex can be seen to satisfy one of the conditions of Proposition~\ref{prop:characterization}. After each iteration of the deficit assignment, we

644: mark all the vertices that have been visited in this process as

645: marked (that is the assigned deficits cover all visited

646: vertices). The deficit-assignment process continues starting

647: with the next unmarked vertex of degree $1$ or $2$. The scanning of

648: the algorithm requires only linear time with respect to the number

649: of edges in the graph since every traversed edge connects only

650: marked endpoints and thus no edge needs to be traversed more than once

651: by the algorithm.

652:

653: The deficit-matching phase is also linear since it only requires to

654: find any (random) matching between non-adjacent deficits.

655:

656: \subsection{Polynomial-time strong $(2,1)$-anonymization}

657:

658: The algorithm for solving the strong $(2,1)$-anonymization problem

659: is very similar to the one presented in the previous section, so

660: we only briefly discuss it here. For brevity we avoid mentioning

661: various special cases that are similar to the weak-anonymization

662: problem. The first key difference is that for strong

663: $(2,1)$-anonymization we need to develop a different

664: deficit-assignment scheme. Although the actual structures we have

665: to consider for assigning the deficits are the same we need to

666: assign different deficits to different vertices so that we satisfy

667: the strong anonymity requirement. This is because an edge added at

668: a vertex with assigned deficit can only help the original

669: neighbors of the vertex, and not the vertex itself. The second

670: difference is that in the deficit-matching phase we need to

671: actually solve a maximum-matching problem; not every random pairing of non-adjacent vertices with assigned deficit is a valid solution.

672:

673: In strong $(2,1)$-anonymization we first have to assume that there

674: are no isolated vertices in the input graph $G$; otherwise strong

675: $(2,1)$-anonymity is not achievable for these vertices.

676:

677: \vspace{0.1in}

678: \noindent{\bf Deficit Assignment:} For the deficit-assignment step, the

679: algorithm starts with an unmarked vertex in the input graph

680: with degree $1$ or $2$ and assigns deficits as follows:

681:

682: \begin{itemize}

683: \item For an isolated edge $uv$, assign deficit of $2$ at each end.

684:

685: \item For an isolated path $uvwx$, put deficit $1$ at $v$ and at

686: $w$.

687:

688: \item For an isolated square $uvwx$, put deficit $1$ at $u$ and

689: $v$.

690:

691: \item If such a square has edges already coming out of $v$, put

692: just deficit $1$ at $u$.

693:

694:

695: \item If multiple squares $uv_iwx_i$ all start from vertex $u$,

696: then assign deficit $1$ to one of the $v_i$'s.

697:

698:

699: \item For a path $uvw$, put deficit $1$ at each of the $3$ vertices.

700:

701: \item For a vertex of degree at least $3$ attached to vertices of

702: degree $1$, put two deficits of $1$ at degree $1$ vertices.

703:

704: \item If a path starts $uvwx$, with $x$ of degree at least $2$, put

705: deficit $1$ at $v$ and $1$ at $w$.

706:

707:

708: \item If in addition $w$ has other edges coming out of it, put

709: deficit $1$ just at $v$. Otherwise if in addition only $v$ has other

710: edges coming out of it that join to a vertex of degree $1$, put

711: deficit $1$ just at $w$.

712:

713: \end{itemize}

714:

715: All vertices that are visited in the process are marked, and the

716: algorithm proceeds with the next unmarked vertex until there are

717: no unmarked vertices left.

718:

719: \vspace{0.1in}

720: \noindent{\bf Deficit Matching:} For solving the strong $(2,1)$ -

721: anonymization problem exactly we need to solve a maximum-matching

722: problem between the nodes with deficits. This can be done in polynomial

723: time~(\cite{papadimitrioucombinatorial}). Note, that in the weak

724: $(2,1)$-anonymization problem \emph{any} random pairing of

725: non-adjacent nodes with deficits was sufficient, allowing for a linear-time matching phase.

726: This was because with the exception of isolated edges and isolated paths of length $4$, there was no case in which two vertices of non-zero deficit could be adjacent. This is not the case in the strong anonymization problem, and here a maximum-matching problem needs to be solved over edges that are not already in the graph.

727:

728: A linear-time deficit-matching algorithm with a small

729: additive error can also be developed. This is summarized in the

730: following theorem.

731:

732: \begin{theorem}

733: The strong $(2,1)$-anonymization problem can be approximated in

734: linear time within an additive error of 2, and can be solved

735: exactly in polynomial time.

736: \end{theorem}

737:

738: {\em Proof Sketch:} It requires again a case analysis to

739: show that the deficit-assignment scheme is optimal and independent

740: of the order in which we traverse the vertices.

741:

742: Now, if all deficits add up to $m$, they can easily be paired

743: using a greedy linear-time matching algorithm. However,

744: the last $2$ deficits may be assigned to adjacent

745: vertices. So instead of adding $\lceil m/2\rceil$ edges, we may

746: add $\lceil m/2\rceil+2$, for an additive error of 2. If instead

747: we use a maximum-matching algorithm to match as many deficits

748: as possible and satisfy the unmatched deficits individually, the

749: problem can be solved optimally in polynomial time.

750:

751:

752: \vspace{-0.15in}

753: \section{From $(6,1)$ to $(7,1)$-anonymity}\label{6_1to7_1}

754: We show here that given a graph that is already (6,1)-anonymous,

755: it is NP-hard to find the minimal number of edges that need to be

756: added to make it either weakly or strongly (7,1)-anonymous. This result provides insight into the complexity of the anonymization problem, showing that it is hard to achieve anonymity even incrementally. The

757: result follows from a reduction from the {\em 1-in-3

758: satisfiability} problem. An instance of 1-in-3 satisfiability

759: consists of triples of Boolean variables $(x,y,z)$ to be assigned

760: values 0 or 1 in such a way that each triple contains one 1 and

761: two 0s. This problem was shown to be NP-complete by

762: Schaefer~\cite{schaefer78complexity}. We first show that even a

763: restricted form of the 1-in-3 satisfiability problem is

764: NP-complete.

765:

766: \begin{lemma}\label{lemma1}

767: The 1-in-3 satisfiability problem is NP-complete even if each variable

768: occurs in exactly 3 triples, no two triples share more than one variable, and

769: the total number of triples is even.

770: \end{lemma}

771:

772: \begin{proof}

773: We prove this by taking an arbitrary instance of the 1-in-3 satisfiability problem and converting it to an instance satisfying the constraints of the above lemma. We start off by renaming multiple occurrences of a variable $x$ as $x_1$, $x_2$, and so on, so that by the end, each variable occurs in at most 1 triple and no two triples share more than one variable. We can then enforce the condition that each $x_i$ be equal to $x_{i+1}$ by inserting the triples

774: $(x_i,u,v)$, $(x_{i+1},u',v')$, $(u,u',w)$ and $(v,v',w)$.

775: This guarantees at most 3 occurrences of

776: each variable in triples. If a variable $y$ occurs in 2 triples, we may include

777: a triple $(y,z,t)$ introducing two new variables, so that at the end of this process each variable occurs in either

778: 1 or 3 triples. Finally we make nine copies of the entire instance, each labeled $(i,j)$ with

779: $1\leq i,j\leq 3$, and equate the $z$s that have the same $i$ and also equate

780: the $t$s that have the same $j$. This guarantees that each variable appears

781: in exactly 3 triples. Making two copies of this instance guarantees that the

782: number of triples is even.

783: \end{proof}

784:

785: \begin{theorem}

786: Suppose $G$ is $(6,1)$-anonymous. Finding the smallest set of

787: edges to add to $G$ to solve the weak or strong

788: $(7,1)$-anonymization problem is NP-hard. The same results hold

789: for going from $(k,1)$-anonymity to weak or strong $(k+1,

790: 1)$-anonymity when $k \geq 6$.

791: \end{theorem}

792:

793: \begin{proof}

794: We show this via a reduction from the 1-in-3 satisfiability

795: problem. We take an instance of the 1-in-3 satisfiability problem

796: satisfying the constraints of Lemma~\ref{lemma1}. We further

797: assume that the number of triples in this instance is a multiple

798: of 3, since if it is not a multiple of 3, it is easy to see that

799: there will be no satisfying assignment. Since we also assume that

800: the number of triples is even, the number of triples is in fact of

801: the form $6m$.

802:

803: Taking this instance, we now form a cubic bipartite graph

804: $G=(U,V,E)$ by creating a vertex in $U$ for each triple and a

805: vertex in $V$ for each variable, with the two vertices connected

806: by an edge if the variable occurs in the triple. We add 5 new

807: neighbors of degree 1 to each vertex in $U$. Each of these added

808: neighbors and the vertices in $V$ are $(7,1)$-anonymous, but the

809: vertices in $U$ have only 6 vertices at distance 2, namely the 2

810: other neighbors of each of the 3 neighbors in $V$, giving

811: $(6,1)$-anonymity. We would like to increase the anonymity of

812: these vertices so that they are also $(7,1)$-anonymous. Note that

813: a solution to this anonymity problem has to consist of at least

814: $m$ edges. This is because the total residual anonymity of the graph is $6m$ and each new edge can reduce the residual anonymity by at most $6$. Now, if it were possible to select $2m$ vertices in $V$

815: that were adjacent to all the $6m$ vertices in $U$, we could

816: insert a perfect matching of $m$ edges between these $2m$ vertices

817: and simultaneously increase the anonymity of all the vertices in

818: $U$ by at least 1. This would correspond to a solution to the

819: 1-in-3 satisfiability problem. Similarly, if there is a solution

820: to the anonymity problem that involves the addition of only $m$

821: edges, it must necessarily correspond to a solution to the 1-in-3

822: satisfiability problem. Thus a solution to the 1-in-3

823: satisfiability problem exists if and only if the solution to the

824: anonymity problem involves the addition of $m$ extra edges.

825:

826: For $k\geq 6$, add $k-2$ nodes of degree 1 attached to each vertex

827: in $U$. Attach an additional node of degree $k-5$ to each vertex

828: in $U$. Attach the remaining $k-6$ neighbors of each such

829: additional node to a clique of size $k+2$. The result then follows

830: from the case of $k=6$.

831: \end{proof}

832:

833: The complexity of minimally obtaining weak and strong

834: $(k,1)$-anonymous graphs remains open for $k=3,4,5,6$.

835:

836: \section{{\an}-anonymization}\label{K_1}

837: We start our study for the $(k,1)$-anonymization problem by giving

838: two simple $O(k)$-approximation algorithms. We then show that the

839: approximation factor can be further improved to match a lower bound.

840: \subsection{${\text O}(k)$-approximation algorithms for $(k,1)$-anonymization}

841: Let $G=(V,E)$ be the input graph to the weak $(k,1)$-anonymization

842: problem. Consider the following simple iterative algorithm: at

843: every step $i$ add to graph $G_i$ ($G_1 = G$) a single edge

844: between a neighbor of a deficient node $u$ and a node that is not

845: already in the $2$-neighborhood of $u$ in $G_i$. If there are only

846: isolated deficient nodes in $G_i$, the algorithm directly connects

847: a deficient node to a node of a $(k+1)$-clique. If no such clique

848: exists, the algorithm creates it in a preprocessing step; $(k+1)$

849: randomly selected nodes are picked for this purpose. Repeat the

850: process until no deficient nodes remain. We call this algorithm

851: the {\weakaddition} algorithm. We show that {\weakaddition} is an

852: ${\text O}(k)$-approximation algorithm for the weak

853: $(k,1)$-anonymization problem. This result is summarized in the

854: following theorem.

855:

856:

857: \begin{theorem}\label{thm:weak1}

858: {\weakaddition} gives a ${\text O}(k)$-approximation

859: for the weak $(k,1)$-anonymization problem. If the optimal

860: solution is of size $t$, {\weakaddition} adds at

861: most $4kt + k^2$ edges.

862: \end{theorem}

863:

864: \begin{proof}

865: Let $R = \sum_{v\in V}r(v)$ be the residual anonymity (see

866: Definition~\ref{dfn:residual}) of graph $G = (V,E)$. Let

867: ${\text{\sc Wa}}$ be the total number of edges added by the

868: {\weakaddition} algorithm. It holds that ${\text{\sc Wa}} \leq

869: R+k^2$. This is because at every step the algorithm adds one edge

870: that decreases the residual anonymity of the graph by at least

871: $1$. Therefore the algorithm adds at most $R$ edges. The

872: additional $k^2$ edges may be required to create a $(k+1)$-clique if such a clique does not exist.

873:

874: Now assume that the optimal solution adds $t$ edges. Consider an

875: edge $uv$ of the optimal solution. This edge, at the time of its

876: addition, could have decreased the residual anonymity of the graph

877: by at most $4k$. This is because it could have decreased the

878: residual anonymity of each of $u$ and $v$ as well as the residual

879: anonymities of at most $k$ neighbors connected to $u$ and at most

880: $k$ neighbors connected to $v$ (if $u$ or $v$ had more than $k$

881: neighbors, then none of these neighbors would have been

882: deficient). Further, the edge $uv$ could have decreased the

883: residual anonymity of $u$ or $v$ by at most $k$, and the residual

884: anonymities of each of the $k$ neighbors of $u$ or each of the $k$

885: neighbors of $v$ by at most 1.

886:

887: Thus, each edge of the optimal solution could have reduced the

888: residual anonymity of the graph by at most $4k$ at the time of its

889: addition. That is, $t \geq R/4k$.

890:

891: Thus it is clear that $\text{\sc Wa}\leq 4kt +k^2$.

892: \end{proof}

893:

894: For the strong $(k,1)$-anonymization problem we show that the

895: {\strongaddition} algorithm (very similar to {\weakaddition}), is

896: an ${\text O}(k)$-approximation. {\strongaddition} is also

897: iterative: in each iteration $i$ it considers graph $G_i$ and adds

898: one edge to it. The edge to be added is one that connects a

899: neighbor of a deficient node $u$ to a node that is not already in

900: the $2$-neighborhood of $u$. This process is repeated till no

901: deficient nodes remain. We can state the following for the

902: approximation ratio achieved by the {\strongaddition} algorithm.

903:

904: \begin{theorem}

905: {\strongaddition} is a $2k$-approximation algorithm

906: for the strong $(k,1)$-anonymization problem.

907: \end{theorem}

908:

909: \begin{proof}

910: As in the proof of Theorem~\ref{thm:weak1} consider input graph

911: $G=(V,E)$ with initial residual anonymity $R$. Every edge added by

912: the {\strongaddition} algorithm would reduce the residual

913: anonymity of the graph by at least $1$. Therefore, if the number

914: of edges added by the {\strongaddition} algorithm is $\text{\sc

915: Sa}$ we have that $\text{\sc Sa}\leq R$.

916:

917: Suppose now that the optimal solution adds $t$ edges. An added

918: edge $uv$ decreases the residual anonymity of the graph by at most

919: $2k$. This is because the edge can decrease the residual anonymity

920: of only the {\em original} neighbors of $u$ and $v$ by at most $1$

921: each and there can be at most $2k$ such deficient neighbors. Thus

922: $t \geq R/2k$.

923:

924: From the above we have that $\text{\sc Sa}\leq 2kt$.

925: \end{proof}

926:

927: \subsection{$\Theta(\log n)$-approximation algorithms for $(k,1)$-anonymization}

928:

929: We now provide two simple greedy algorithms for the weak and

930: strong \an-anonymization problems and show that they output

931: solutions that are ${\text O}(\log n)$-approximations to the

932: optimal. We then show that this is the best approximation factor

933: we can hope to achieve for arbitrary $k$.

934:

935: We start by presenting {\weakgreedy} which is an ${\text O}(\log

936: n)$-approximation algorithm for the weak $(k,1)$-anonymization

937: problem. Consider input graph $G=(V,E)$ that has total

938: residual anonymity $R$.  The optimal solution to the problem

939: consists of a set of edges that together take care of all the

940: residual anonymity in the graph.

941:

942:

943: \begin{figure}[]

944: \begin{center}

945: {\includegraphics[scale =

946: 0.35]{reinforcement}}\caption{Illustrative example of the

947: reinforcement between new edges in the weak-anonymization

948: problem.}\label{figure:reinforcement}

949: \end{center}

950: \end{figure}

951:

952: We may be tempted to use a set-cover type solution: greedily

953: choose edges to add that maximally reduce the residual anonymity

954: of the graph at each step. However, such a greedy algorithm is not

955: so easy to analyze in the context of the weak-anonymization

956: problem. The difficulty in the analysis stems from the fact that

957: the new edges may \emph{reinforce} each other. That is, the

958: addition of an edge may bring about a greater reduction in the

959: residual anonymity of the graph in the presence of other added

960: edges. Consider, for example, the input graph $G$ shown in

961: Figure~\ref{figure:reinforcement}. Note that solid lines

962: correspond to the original edges in $G$. In this case, the

963: addition of edge $x2z1$ alone does not help in the anonymization

964: of node $y2$. (Neither does the addition of edge $y2z1$ in the

965: anonymization of $x2$). However, if edge $y2z1$ is already added

966: in the graph, then edge $x2z1$ helps in anonymizing node $y2$

967: as well.

968:

969: We get around this peculiarity of our problem by greedily choosing

970: triplets of edges to add instead of singleton edges. Algorithm~1, called {\weakgreedy}, describes the procedure.

971:

972: \begin{algorithm}[H]\label{alg1}

973: \caption{{\weakgreedy} for weak $(k,1)$-anonymization}

974: \begin{algorithmic}[1]

975: \STATE //Input: $k, G=(V,E)$

976: \STATE Randomly choose a node $w \in V$

977: \STATE Add up to ${k+1 \choose 2}$ edges to $E$ to form a $k+1$-clique at $w$

978: \STATE Compute $R = $ residual anonymity of $G$

979: \WHILE{$R > 0$}

980: \STATE Find triplet $uv, uw, vw$ that maximally decrease $R$

981: \STATE $E = E \cup \{uv\} \cup \{uw\} \cup \{vw\}$

982: \STATE Update $R$

983: \ENDWHILE

984: \end{algorithmic}

985: \end{algorithm}

986:

987: \begin{theorem}

988: {\weakgreedy} is a polynomial-time nearly ${\text

989: O}(\log n)$-approximation algorithm for the weak \an-anony- mization

990: problem. If the optimal solution is of size $t$, the algorithm

991: adds $k^2+6t\log n$ edges.

992: \end{theorem}

993:

994: \begin{proof}

995: Consider the optimal solution of $t$ edges. These $t$ edges

996: together take care of all the residual anonymity in the graph. We

997: can convert this solution to a solution of triplets that consists of at most $k^2 +

998: 3t$ edges: first randomly choose a node $w$ and create a

999: $(k+1)$-clique amongst $w$ and $k$ other randomly chosen nodes.

1000: Then, for each edge $uv$ of the optimal solution, add a triangle

1001: $(uv, vw, uw)$ to the graph. The resulting graph will clearly

1002: continue to be \an-anonymous. The $t$ triangles in conjunction

1003: with the $(k+1)$-clique take care of all the residual anonymity in

1004: the graph. Further, these triangles do not reinforce each other

1005: because they are all connected to a node of degree $k$.

1006:

1007: Going back to Algorithm~1, this means that once a $(k+1)$-clique has been added to

1008: the graph, at each iteration of the algorithm, there must exist

1009: some triangle with a vertex in the $(k+1)$-clique that reduces the

1010: residual anonymity of the graph by a factor of at least $t$ (similar to the argument for the greedy set cover algorithm). And

1011: since the algorithm greedily chooses triangles to add, the

1012: residual anonymity of the graph will decrease by at least this

1013: factor at each step. Since the residual anonymity of the graph can

1014: be at most $kn < n^2$ to begin with, the algorithm will only

1015: proceed for at most $r$ iterations till $(1 - 1/t)^r \leq  1/kn$.

1016: This would mean that $r= {\text O}(t\log(kn)) = {\text O}(2t\log

1017: n)$ and $3r =  {\text O}(6t\log n)$.

1018: \end{proof}

1019:

1020: The approximation algorithm for the strong \an-anony- mization

1021: problem is simpler, since added edges cannot

1022: reinforce each other --- an added edge can only help the original

1023: neighbors of its two end points. Algorithm~2 gives the details of

1024: the {\stronggreedy} algorithm.

1025:

1026: \begin{algorithm}[H]

1027: \caption{{\stronggreedy} for $(k,1)$-anonymization}

1028: \begin{algorithmic}[1]

1029: \STATE //Input: $k, G=(V,E)$

1030: \STATE Compute $R = $ residual anonymity of $G$

1031: \WHILE{$R > 0$}

1032: \STATE Find edge $uv$ that maximally reduces $R$

1033: \STATE $E = E \cup \{uv\}$

1034: \STATE Update $R$

1035: \ENDWHILE

1036: \end{algorithmic}

1037: \end{algorithm}

1038:

1039:

1040: Since the added edges do not reinforce each other in the strong

1041: $(k,1)$-anonymization problem, the analysis of {\tt Strong-Greedy} is

1042: similar to the analysis of the greedy algorithm for the standard

1043: set-cover problem.

1044:

1045: \begin{theorem}

1046: {\stronggreedy} is a polynomial-time $2\log n$-approximation

1047: algorithm for the strong \an-anonymization problem.

1048: \end{theorem}

1049:

1050: \begin{proof}

1051: Suppose the optimal solution adds $t$ edges, to reduce the

1052: residual anonymity of the graph by at most $kn < n^2$. Since edges

1053: of the solution do not reinforce each other, there must exist some

1054: edge that reduces the residual anonymity of the graph by at least

1055: a factor of $t$.

1056:

1057: Therefore at each iteration of Algorithm~2, we greedily choose an

1058: edge to add that must cause at least this much reduction in the

1059: residual anonymity of the graph. The algorithm will thus terminate

1060: after $r$ steps where ${(1-1/t)}^r \leq 1/(kn)$, or

1061: $r=t\log(kn)\leq 2t\log n$.

1062: \end{proof}

1063:

1064:

1065: We next show that $\log n$ is the best factor we

1066: could hope to achieve for unbounded $k$, for both the weak and strong

1067: $(k,1)$-anonymization problems via an approximation-preserving reduction from the hitting set problem.

1068:

1069: \begin{theorem}

1070: The weak and strong $(k,1)$-anonymization problems with $k$ unbounded are

1071: $\Omega(\log n)$-approximation NP-hard.

1072: \end{theorem}

1073:

1074: \begin{proof}

1075: Hitting set is $\Omega(\log n)$-approximation NP-hard. Consider an

1076: instance of the hitting-set problem consisting of sets ${\cal S} =

1077: \{S_1, S_2, \ldots\}$. Let $k$ be greater than the maximum number

1078: of sets intersecting any one set $S_i$. Add a unique element $v_i$

1079: to each $S_i$. Additionally, construct sets ${\cal T} = \{T_1,

1080: T_2, \ldots\}$ such that each $T_i$ contains the appropriate

1081: $v_i$'s so that every $S_i$ intersects exactly $k-1$ other sets. In

1082: every set $T_i$ add an additional element $w$ so that each set in

1083: ${\cal T}$ intersects at least $k$ other sets. Now construct a

1084: bipartite graph $G=(U,V,E)$, where the vertices of $U$ correspond

1085: to the sets in ${\cal S}$ and ${\cal T}$, the vertices of $V$

1086: correspond to individual members of these sets, with $E$

1087: indicating membership of elements from $V$ in sets from $U$. For

1088: every element $u$ in $U$ create $(k+1)$ new vertices of degree $1$

1089: in $V$ and connect them to $u$. In the resulting graph, the

1090: vertices in $V$ are all \an-anonymous, however the vertices in $U$

1091: that correspond to sets in ${\cal S}$ are only $(k-1,

1092: 1)$-anonymous. Consider the $t$ nodes in $V$ that are the optimal

1093: solution to the hitting-set problem. Then matching these nodes

1094: using $\lceil t/2\rceil$ edges will be an optimal solution to the

1095: strong or weak $(k, 1)$-anonymization problem in the bipartite

1096: graph $G=(U,V,E)$. Therefore, an optimal solution to the

1097: anonymization problem corresponds to an optimal solution to the

1098: hitting-set problem which is $\Omega(\log n)$-hard to approximate.

1099: \end{proof}

1100:

1101:

1102: \section{{\anonvar}-anonymization for $\ell > 1$}\label{K_L}

1103:

1104: In this section we provide algorithms for the weak and strong

1105: {\anonvar}-anonymization problems when $\ell > 1$.

1106:

1107: The algorithm for weak {\anonvar}-anonymization is a randomized algorithm that constructs a bounded-degree expander between deficient vertices. Given a

1108: $(k,\ell')$-anonymous graph $G$, it solves the weak $(k,

1109: \ell)$-anonymization problem by adding only ${\text

1110: O}(\sqrt{k-k')\ell})$ additional edges at each vertex. The algorithm

1111: can also be easily adapted to solve the weak $(k,

1112: \ell)$-anonymization problem for any input graph irrespective of

1113: its initial anonymity.

1114:

1115: \begin{theorem}

1116: There exists a randomized polynomial-time algorithm that adds ${\text O}(\sqrt{(k-k')\ell})$ edges per vertex and increases the anonymity of

1117: a graph from $(k',\ell)$ to $(k,\ell)$ where $\ell\leq k\leq

1118: n^{1-\epsilon}$ and $\epsilon$ is a constant greater than $0$.

1119: \end{theorem}

1120:

1121: {\em Proof Sketch:} Randomly partition the $n$ vertices into $n/\ell$ sets

1122: of size $\ell$. Treat each set as a ``supernode''.

1123: Construct an expander of degree $\sqrt{(k-k')/\ell}$ on these

1124: $n/\ell$ supernodes. In this way each supernode has $(k-k')\ell$

1125: supernodes in its $2$-neighborhood that can be reached through just one

1126: intermediate supernode.  Replace each edge $uv$ of this expander

1127: with a $K_{\ell, \ell}$ clique of edges between the constituent

1128: vertices of the supernodes $u$ and $v$. Thus each vertex now has $k-k'$

1129: vertices in its $2$-neighborhood that can be reached through

1130: an intermediate set of size $\ell$. Since $l\leq k\leq

1131: n^{1-\epsilon}$, we can show that with high probability, none of

1132: these $k-k'$ new vertices will coincide with the $k'$ vertices

1133: previously in the node's $2$-neighborhood.\\

1134:

1135:

1136: As a final result, we present the algorithm for strong {\anonvar}-anonymization. This algorithm is a generalization of the {\stronggreedy} algorithm (see Algorithm~2).

1137: The difference is that instead of picking a single edge to add at

1138: every iteration the algorithm picks edges in groups of size at

1139: most $\ell$. At each iteration it picks the group that causes the largest

1140: reduction in the residual anonymity of the graph. The pseudocode

1141: is given in Algorithm~3.

1142:

1143: \begin{algorithm}[H]

1144: \caption{{\stronggreedy} for $(k, \ell)$-anonymization}

1145: \begin{algorithmic}[1]

1146: \STATE //Input: $k, \ell, G=(V,E)$ \STATE Compute $R = $ residual

1147: anonymity of $G$ \WHILE{$R > 0$} \STATE Find set of edges ${\cal

1148: E}$, with $|{\cal E}| \leq \ell$, that maximally reduces $R$

1149: \STATE $E = E \cup \cal E$ \STATE Update $R$ \ENDWHILE

1150: \end{algorithmic}

1151: \end{algorithm}

1152:

1153: We can state the following theorem for the approximation factor of

1154: Algorithm~3 when $\ell$ is a constant.

1155:

1156: \begin{theorem}

1157: Consider $G=(V,E)$ to be the input graph to the strong {\anonvar}-anonymization problem.

1158: Also assume $\ell$ is a constant.

1159: Let $t$ be the optimal number of edges that need to be added to solve the strong {\anonvar}-anonymization problem on $G$. Then Algorithm~3 is a

1160: polynomial-time $O(t^{\ell-1}\log n)$-approximation algorithm.

1161: \end{theorem}

1162:

1163: {\em Proof Sketch:} In the $(k, \ell)$- anonymization problem,

1164: groups of up to $\ell$ edges at a time incident at a single vertex

1165: can reduce the residual anonymity of a vertex adjacent to the $\ell$

1166: endpoints of these edges. The $t$ edges added by the optimal

1167: solution define at most $t^{\ell}$ subsets of at most $\ell$ edges

1168: incident to a single vertex. By selecting such subsets greedily as

1169: in a set-cover problem we ultimately reduce the residual anonymity

1170: of the graph to $0$ in ${\text O}(t^{\ell}\log n)$ steps. We can show that reinforcement effects between subsets of edges are taken care of. This proves the $O(t^{\ell}\log n)$ bound on the number of edges

1171: selected. If $\ell$ is a small constant, the approximation factor may not be too large. Further, in practice this simple algorithm may perform better than this worst case bound indicates.

1172:

1173: \section{Conclusions}\label{sec:conclusions}

1174: Motivated by recent studies on privacy-preserving graph releases,

1175: we proposed a new definition of anonymity in graphs. We further

1176: defined two new combinatorial problems arising from this

1177: definition, studied their complexity and proposed simple,

1178: efficient and intuitive algorithms for solving them.

1179:

1180: The key idea behind our anonymization scheme was to enforce

1181: that every node in the graph should share some number of its

1182: neighbors with $k$ other nodes. The optimization problems we

1183: defined ask for the minimum number of edges to be added to the

1184: input graph so that the anonymization requirement is satisfied.

1185: For these optimization problems we provided algorithms that solve

1186: them exactly ($k=2$) or approximately ($k>2$).

1187:

1188: An interesting avenue for future work would be to fully characterize the

1189: kinds of attacks that our definition of anonymity protects

1190: against, and to study the impact of our anonymization schemes on the utility of the graph release.

1191:

1192: Finally, we believe that the combinatorial problems we have

1193: studied in this paper are interesting in their own right, and may

1194: also prove useful in other domains. For example, at a high level

1195: there is a similarity between the problem we study in this paper

1196: and the problem of constructing reliable graphs for, say, reliable

1197: routing.

1198:

1199:

1200: \bibliographystyle{plain}

1201: \bibliography{graphanon}

1202:

1203: \end{document}

1204: