0504:cs0504012/sruti.tex

1: \documentclass[letterpaper,twocolumn,10pt]{article}

2:

3: \usepackage{endnotes}

4: \usepackage{times}

5: \usepackage{epsfig}

6: \usepackage{subfigure}

7: \usepackage{graphics}

8: \usepackage{graphicx}

9: \usepackage{multirow}

10: \usepackage[latin1]{inputenc}

11: \usepackage{algorithmic}

12: \usepackage[plain]{algorithm}

13: %\usepackage[margin=2.7cm]{geometry}

14: %\usepackage[margin=2.7cm,left=4.6pc,right=4.6pc,top=4.0pc,bottom=4.2pc]{geometry}

15:

16: \begin{document}

17:

18: \date{}

19:

20: %\title{\Large \bf A Community-Based Spam Detection Algorithm}

21:

22: \title{Improving Spam Detection Based on Structural Similarity}

23:

24: \newcommand{\institutions}{\mbox{{\hspace*{-.5cm}

25:       \begin{minipage}{7cm}

26:         \begin{center}

27:           $^{\dag}$ Computer Science Dept.\\

28:           Universidade Federal de Minas Gerais \\

29:           Belo Horizonte - Brazil \\

30:           \{lhg, fernando, barra, virgilio, jussara\}@dcc.ufmg.br

31:         \end{center}

32:       \end{minipage}

33:       \begin{minipage}{7cm}

34:         \begin{center}

35:           $^{\ddag}$ Computer and Computational Sciences\\

36:           Los Alamos National Laboratory \\

37:           Los Alamos - USA\\

38:           lmbett@lanl.gov

39:         \end{center}

40:       \end{minipage}}}}

41:

42:

43: \author{

44:   {\rm Luiz H. Gomes$^{\dag}$\footnote{Luiz H. Gomes is supported by Banco Central do

45:       Brasil.}, Fernando D. O. Castro$^{\dag}$, Rodrigo B. Almeida$^{\dag}$, }\\

46:   {\rm Luis M. A. Bettencourt$^{\ddag}$, Virg�lio A. F. Almeida$^{\dag}$, Jussara M. Almeida$^{\dag}$}

47: \\[0.3cm]

48: \institutions

49: }

50:

51: %\author{

52: %  {\rm Luiz H. Gomes\thanks{Luiz H. Gomes is supported by Banco Central do

53: %      Brasil.}, Fernando D. O. Castro, }\\

54: %  {\rm Virg�lio A. F.Almeida, Jussara M. Almeida, Rodrigo B. Almeida}\\

55: %  Computer Science Dept.,Universidade Federal de Minas Gerais, Belo Horizonte - Brazil \\

56: %  \{lhg,fernando, virgilio,jussara, barra\}@dcc.ufmg.br\\

57: %\and

58: %  {\rm Luis M. A. Bettencourt }\\

59: %  Computer and Computational Sciences, Los Alamos National Laboratory, Los Alamos - USA\\

60: %  lmbett@lanl.gov\\

61: %}

62:

63: \maketitle

64:

65: \thispagestyle{empty}

66:

67: \subsection*{Abstract}

68:

69:

70:

71:

72: %Spam senders use a multitude of strategies, based on knowledge of current filters, to evade detection, imposing serious limitations on spam recognition by analyzing content or mapping senders' techniques.

73:

74: We propose a new detection algorithm that  uses structural relationships

75: between senders and recipients of email as the basis for the identification of

76: spam messages.  Users and receivers are represented as vectors in their

77: reciprocal spaces. A measure of similarity  between vectors is constructed and

78: used to group users into clusters. Knowledge of their classification  as past

79: senders/receivers of spam or legitimate mail, comming

80: from an auxiliary detection algorithm, is then used to label these

81: clusters probabilistically. This knowledge comes from an auxiliary

82: algorithm. The measure of similarity between the sender and receiver sets of a

83: new message to the center vector  of clusters is then used to asses the

84: possibility of that message being legitimate or spam. We show that the proposed

85: algorithm is able to correct part of the false positives (legitimate messages

86: classified as spam) using a testbed of one week smtp log.

87:

88: \section{Introduction}

89: \label{sec:introduction}

90:

91: %Spam is quickly becoming the leading threat to  the viability of email

92: %as a means of communication and a leading source of fraud and other criminal

93: %activity worldwide.  The vast majority of spam messages presently originate

94: %in the USA and China, hosted by well known ISPs and generated  by identified

95: %individuals~\cite{spamhaus} Nevertheless, an increased effort in criminal

96: %investigation and waves of high profile legislation have not yet succeeded at

97: %reducing the

98:

99: %It is often said that the problem of spam email is that it is an extremely

100: %asymmetric threat. While it is technically easy and very cheap to send a spam

101: %email it requires sophisticated organization and much higher costs at the

102: %receiving end to sort out legitimate emails from junk.

103:

104: The relentless rise in spam email traffic,  now accounting for about

105: $83\%$ of all incoming messages, up from $24\%$ in January

106: 2003~\cite{messageLabs},  is becoming one of the greatest threats to the

107: use of email as a form of communication.

108:

109: The greatest problem in detecting spam stems from active adversarial efforts to thwart

110: classification. Spam senders use a multitude of techniques based on knowledge of

111: current detection algorithms, to evade detection. These techniques

112: range from changes in the way text is written - so that it can not be directly

113: analyzed computationally,  but can be understood by humans naturally -

114: to frequent changes in other elements, such as user names, domains, subjects, etc.

115: Therefore, good choices for spam identifiers are becoming increasingly more difficult.

116:

117: %Other elements that

118: %change frequently are the message contents and subject lines. Detection of spam

119: %messages based on these attributes requires that the filters are able to match,

120: %in breadth and speed, this variability. Because the space of possibilities is

121: %truly immense this always leaves spam detection algorithms playing catch up in

122: %a never ending evolutionary arms race.

123:

124: In the light of this enormous variability the question then is: what are the

125: identifiers of spam that are most costly to change, from the point of view of

126: the sender? The limitations of attempts to recognize spam by analyzing content

127: are clear~\cite{gerf}.  Content-based techniques\cite{sahami98bayesian,

128:   zhou03approximate, spamassassin} have to cope with the constant changes in

129: the way spammers generate their solicitations. The structure of the target

130: space for these solicitations   tends however  to be much more stable since

131: spams senders still need to reach recipients, even if under forged identifiers,

132: in order to be effective. Specifically by structure we mean the space of

133: recipients targeted by a spam sender, as well as the space of senders that

134: target a given recipient,  i.e. the contacts of a user. The contact lists, or

135: subsets thereof, can then be thought of as a signature of spam senders and

136: recipients. Additionally by    constructing a similarity measure in these

137: spaces we can track  how lists evolve over time, by addition or removal of

138: addresses.

139:

140: In this paper, we propose an algorithm for spam detection that uses structural

141: relationships between senders and recipients as the basis for the

142: identification of spam messages.  The algorithm must work in conjunction with

143: another spam classifier, necessary to produce spam  or legitimate mail tags on

144: past senders and receivers, which in turn are used to infer new ones through

145: structural similarity (hereafter called: auxiliary algorithm),

146: The key idea is that the lists spammers and legitimate users send messages to,

147: as well as the lists from which they receive messages from can be used as the

148: identifiers of classes of email traffic~\cite{priority, ceas}.

149: We will show that the final result of the application of our structural

150: algorithm over the determinations of  the initial  classifier  leads to the

151: correction of a number of misclassifications as false positives.

152:

153: %The statistical properties of the lists created by spammers and legitimate

154: %users allow us to separate the messages in classes~\cite{ceas,

155: %  priority}. Moreover the grouping of both senders and recipients in

156: %communities, according to a similarity measure in the contact list space,

157: %allows us to generalize the historical information of messages exchanged by

158: %those communities. The final classification provided by our algorithm outperforms

159: %the algorithm used as {\it oraculum}.

160: % and also other symbiotic techniques~\footnote{Techniques that are used coupled with other ones.}.

161:

162: This paper is organized as follows: Section~\ref{sec:modeling} presents the

163: methodology used to handle email data. Our structural algorithm is described in

164: Section~\ref{sec:algorithm}. We present the characteristics of our

165: example workload in section~\ref{sec:results}, as well as the classification results

166: obtained with our algorithm over this set. Related work is presented in

167: Section~\ref{sec:related-work} and conclusions and future work in

168: Section~\ref{sec:concl-future-works}.

169:

170: %    Caveats:

171: %    \begin{itemize}

172: %        \item such type of algorithms need time to be trained

173: %        \item one only ever has access to incomplete lists of all the senders/recipients contacts, by  measuring over a period of time.

174: %        \item the algorithm may become memory intensive for large lists.

175: %    \end{itemize}

176:

177: \section{Modeling Similarity Among Email Senders and Recipients}

178: \label{sec:modeling}

179:

180: Our proposed spam detection algorithm exploits the structural

181: similarities that exist in groups of senders and recipients

182: as well as in the relationship established through the emails

183: exchanged between them. This section introduces our modeling of

184: individual email users and a metric to express the similarity existent

185:  among different users. It then extends the modeling to account for

186: clusters of users who have great similarity.

187: % The notations and definitions provided are used in the

188: % presentation of our  detection algorithm in the next section.

189:

190: Our basic assumption is that, in both

191: legitimate email and spam traffics, users have a defined list of peers

192: they often have contact with (i.e., they  send/receive an email

193: to/from). In legitimate email traffic, contact lists are consequence of

194:  social relationships on which users' communications are

195: based. In spam traffic, on the other hand, the lists used by spammers

196: to distribute their solicitations are created for business interest

197: and, generally, do not reflect any form of social interaction.

198: A user's contact list certainly may change over time. However,

199: we expect it to be much less variable than other characteristics

200: commonly used for spam detection, such as

201: sender user-name, presence of certain keywords in the email content

202: and encoding rules. In other words, we expect contact lists to be

203: more effective in identifying spams and, thus, we use them as

204: the basis for developing our algorithm.

205:

206: We start by representing an email user as a vector in a

207: multi-dimensional conceptual space created with all possible

208: contacts. We represent email senders and recipients separately.

209: We then use vectorial

210: operations to express the similarity among multiple senders (recipients),

211: and use this metric for clustering them.

212: Note that the term email user is

213: used throughout this work to denote any identification of

214: an email sender/recipient (e.g., email address, domain name, etc).

215:

216: Let $N_r$ be the number of distinct recipients. We represent

217: sender $s_i$ as a $N_r$ dimensional vector, $\vec{s_i}$,

218: defined in the conceptual space created by the email recipients being

219: considered.  The $n$-th

220: dimension (representing recipient $r_n$) of $\vec{s_i}$ is defined as:

221: \begin{eqnarray}

222:    \vec{s_i}[n] = \left\{ \begin{array}{ll}

223: 1, & $ if $s_i \rightarrow r_n \\

224: 0, & $ otherwise$ \\

225: \end{array}

226: \right.,

227: \end{eqnarray}

228: where $s_i \rightarrow r_n$ indicates that sender $s_i$ has sent at least one email to

229: $r_n$ recipient.

230:

231: Similarly, we define $\vec{r_i}$ as a $N_s$ dimensional vector representation

232: for the recipient $r_i$, where $N_s$ is the number of distinct senders being

233: considered. The $n$-th dimension of this vector is set to $1$ if recipient

234: $r_i$ has received at least one email from $s_n$.

235:

236:

237: % Jussara: A frase abaixo deveria ir ou pra related work ou para algoritmo

238: % Previous~\cite{emailnetwork,dialoginemail,emailspectroscopy,inferring communities}

239: %studies showed that we can group users, web pages and others based on their

240: % email contacts, common interests and on links on Web

241: % pages.

242:

243:

244: We next define the similarity between two senders $s_i$ and $s_j$

245: as the cosine of the angle between their vector representation

246: ($\vec{s_i}$ and $\vec{s_j}$). The similarity is computed as follows:

247: \begin{eqnarray} \label{similarity}

248:   sim(s_i,s_j) = \frac{\vec{s_i} \circ \vec{s_j}}{|\vec{s_i}||\vec{s_j}|} = cos(\vec{s_i},\vec{s_j}) ,

249: \end{eqnarray}

250: where $\vec{s_i} \circ \vec{s_j}$ is the internal product of the vectors and

251: $|\vec{s_i}|$ is  the norm of $\vec{s_i}$. Note that this

252: metric varies from 0, when senders do not share any recipient in their

253: contact lists,  to 1, when senders have identical contact lists and thus

254: have the same representation. The similarity between two recipients

255: is defined similarly.

256:

257: We note that our similarity metric has different interpretations

258: in legitimate

259: and spam traffics. In legitimate email traffic, it represents

260:  social interaction with the same group of people, whereas in the spam

261: traffic, a great similarity represents the use of different identifiers by

262: the same spammer or the sharing of distribution lists by distinct spammers.

263:

264: Finally, we can use our vectorial modeling approach to represent a

265: cluster of users (senders or recipients) who have great similarity.

266: A sender cluster $sc_i$, represented by vector

267:  $\vec{sc_i}$, is computed as the vectorial sum of its elements,

268: that is:

269: \begin{equation}

270:   \vec{sc_i} = \sum_{s \in sc_i}{\vec{s}}.

271: \end{equation}

272:

273: The similarity between

274: sender $s_i$ and an existing cluster $sc_j$ can then be directly

275: assessed by extending Equation~\ref{similarity} as follows:

276: \begin{eqnarray}

277:    sim(sc_i,s_i) = \left\{ \begin{array}{ll}

278:  cos(\vec{sc_i} - \vec{s_i}, \vec{s_i}) , & $ if $s_i \in sc_i \\

279: cos(\vec{sc_i}, \vec{s_i}) , & $ otherwise$ \\

280: \end{array}

281: \right.

282: \end{eqnarray}

283: We note that a sender $s_i$  vectorial representation and thus the sender cluster to which

284: it belongs (i.e., shares the greatest similarity)  may change over time

285: as new emails are considered. Therefore, in order to accurately estimate

286:  the similarity

287: between a sender $s_i$ and a sender cluster $sc_i$ to which $s_i$ currently belongs, we

288: first remove $s_i$ from $sc_i$, and then take the cossine between the two vectors

289:  ($\vec{sc_i} - \vec{s_i}$ and $\vec{s_i}$). This is performed so that

290: the previous classification of a user does

291: not influence its reclassification. Recipient clusters and the similarity

292: between a recipient and a given recipient cluster are defined analogously.

293:

294:

295: %%% Local Variables:

296: %%% mode: latex

297: %%% TeX-master: "sruti"

298: %%% End:

299:

300: \section{A New Algorithm for Improving Spam Detection }

301: \label{sec:algorithm}

302:

303: This section introduces our new email classification

304: algorithm which exploits the similarities between

305: email senders and between email recipients for

306: clustering and uses historical properties of clusters

307: to improve spam detection accuracy. Our algorithm is

308: designed to work together with any existing spamdetection

309: and filtering technique that runs at the ISP level.

310: Our goal is to provide a significant reduction of

311: false positives (i.e., legitimate emails wrongly classified as spam),

312: which can be as high as 15\% in current filters~\cite{sizecost}.

313: %, incurring costs that are hard to measure.

314:

315: A description of the proposed algorithm is shown

316: in Algorithm~\ref{alg:detection}.

317: It runs on each arriving email $m$, taking as input

318: the classification of $m$, $mClass$, as either spam

319: or legitimate email, performed by the existing auxiliary

320: spam detection method. Using the vectorial representation

321: of email senders, recipients and clusters as well as the

322: similarity metric defined in Section 2, it then determines a

323: new classification for $m$, which may or not agree with $mClass$.

324: The idea is that the classification by the auxiliary method

325: is used to build an incremental historical knowledge base that gets

326: more  representative through time. Our algorithm benefits from that and

327: outperforms the  auxiliary one as shown in Section~\ref{sec:results}.

328:

329: \begin{algorithm}

330: \centering

331:     \begin{algorithmic}

332:         \FORALL{arriving message $m$}

333:             \STATE $mClass = $classification of $m$ by auxiliary detection method;

334:             \STATE $sc = $find cluster for $m.sender$;

335:             \STATE Update spam probability for $sc$ using $mClass$;

336:             \STATE $P_s(m) = $spam probability for $sc$;

337:             \STATE $P_r(m) = 0$;

338:             \FORALL{recipient $r \in m.recipients$}

339:                 \STATE $rc = $find cluster for $r$;

340:                 \STATE Update spam probability for $rc$ using $mClass$;

341:                 \STATE $P_r(m) = P_r(m) + $spam probability for $rc$;

342:             \ENDFOR

343:             \STATE $P_r(m) = P_r(m)/size(m.recipients)$

344:             \STATE $SP(m) = $ compute spam rank based on $P_s(m)$ and  $P_r(m)$;

345:             \IF{$SP(m) > \omega$}

346:                 \STATE classify $m$ as spam;

347:             \ELSIF{$SP(m) < 1 - \omega$}

348:                 \STATE classify $m$ as legitimate;

349:             \ELSE

350:                 \STATE classify $m$ as $mClass$;

351:             \ENDIF

352:         \ENDFOR

353:     \end{algorithmic}

354: \caption{New Algorithm for  Email Classification}

355: \label{alg:detection}

356: \end{algorithm}

357:

358: In order to improve the accuracy of email classification, our algorithm

359: maintains sets of sender and recipient clusters, created based on the structural similarity

360: of different users. A sender (recipient) of an incoming email

361:  is added to a sender (recipient) cluster

362: that is most similar to it, as defined in Equation (4), provided that their similarity

363: exceeds a given threshold $\tau$. Thus, $\tau$ defines the minimum similarity

364: a sender (recipient) must have with a cluster to be assigned to it.

365: Varying $\tau$ allows us to create more tightly or loosely knit

366: clusters. If no cluster can be found, a new single-user cluster is created.

367: In this case, the sender (recipient) is used as  seed for populating the new

368: cluster.

369:

370: The sets of recipient and sender clusters are updated at each new email arrival

371: based on the email sender and list of recipients. Recall that to determine the

372: cluster a previously observed, and thus clustered, user (sender or recipient)

373: belongs to, we first remove the user from his current cluster and then assess

374: its similarity to each existing cluster. Thus, single-user clusters tend to

375: disappear as more emails are processed, except for users that appear only very sporadically.

376:

377: \begin{figure}[th!]

378:   \centering

379:   \includegraphics[width=200pt]{figures/spamRank.eps}

380:   \caption{Spam Rank Computation and Email Classification.}

381:   \label{fig:spamRank}

382: \end{figure}

383:

384: A probability of sending (receiving) a spam is assigned to each sender (recipient)

385:  cluster. We refer to this measure as simply the cluster spam probability.

386: We calculate the spam probability of a sender (recipient)

387: cluster as the average spam probability  of its elements, which, in

388: turn, is estimated based on the frequency of spams sent/received by each of them

389: in the past. Therefore, our algorithm uses the result of the email classification performed

390: by the auxiliary algorithm on each arriving email $m$ ($mClass$ in

391: Algorithm~\ref{alg:detection}) to continuously update cluster spam probabilities.

392:

393: Let us define the probability of an email $m$ being sent by a spammer,

394: $P_s(m)$, as the spam probability of its sender's cluster.

395: Similarly, let the probability of an email

396:  $m$ being addressed to users that receive spam, $P_r(m)$, as the

397: average spam probability of all of its recipients' clusters

398: (see Algorithm~\ref{alg:detection}). Our algorithm uses

399: $P_s(m)$ and $P_r(m)$ to compute a number that

400: expresses the chance of email $m$ being spam. We call this number the spam rank of

401: email $m$, denoted by $SR(m)$. The idea is that emails with large values of $P_s(m)$

402: and $P_r(m)$ should have large spam ranks and thus should be classified as spams.

403: Similarly, emails with small values of $P_s(m)$

404: and $P_r(m)$ should receive low spam rank and be classified as legitimate email.

405:

406: Figure~\ref{fig:spamRank} shows a graphical representation of

407: the computation of an email spam rank. We first normalize the probabilities

408: $P_s(m)$ and $P_r(m)$  by a factor of $\sqrt{2}$, so that the diagonal of

409: the square region defined in the bi-dimensional space  is equal to 1

410: (see Figure~\ref{fig:spamRank}-left). Each email $m$ can be represented as

411:  a point in this square. The spam rank of $m$, $SR(m)$, is

412: then defined as the  length  of the segment starting at

413: the origin (0,0) and ending at the projection of $m$

414:  on the diagonal of the square (see Figure~\ref{fig:spamRank}-right). Note

415: the spam rank varies between 0 and 1.

416:

417: The spam rank $SR(m)$ is then used to classify $m$ as follows: if it is greater

418: than a given threshold $\omega$, the email is classified as spam; if it is

419: smaller than $1 - \omega$, it is classified as legitimate email. Otherwise, we can

420: not precisely classify the email, and we rely on the initial classification

421: provided by the auxiliary detection algorithm. The parameter $\omega$ can be tuned

422: to determine the precision that we expect from our classification.

423: Graphically, emails are classified according to the marked regions shown in

424: Figure~\ref{fig:spamRank}-left. The two triangles, with identical size

425: and height $\omega$, represent the regions where

426: our algorithm is able to classify emails as either spam (upper right) or legitimate

427: email (lower left).

428:

429: %%% Local Variables:

430: %%% mode: latex

431: %%% TeX-master: t

432: %%% End:

433:

434: \section{Experimental Results}

435: \label{sec:results}

436:

437: In this section we describe our experimental results. We first present some important details of

438: our workload, followed by the quantitative results of our approach, compared to others.

439:

440: \subsection{Workload}

441: \label{sec:workload}

442:

443: Our email workload consists of anonymized and sanitized SMTP logs of incoming

444: emails to a large university in Brazil, with around 22 thousand students. The

445: server  handles all emails coming from domains

446: outside the university, sent to students, faculty and staff with

447: email addresses under the university's domain name~\footnote{Only the

448: emails addressed to two out of over 100 university subdomains (i.e.,

449: departments, research labs, research groups) do not pass through the central

450: server.}

451:

452: The central email server runs Exim email software~\cite{exim},

453: the Amavis virus scanner~\cite{amavis} and the Trendmicro

454: Vscan anti-virus tool~\cite{antivirus}. A set of pre-acceptance

455: spam filters (e.g. black lists, DNS reversal) blocks about 50\% of the total traffic

456: received by the server.

457:

458: The messages not rejected by the pre-acceptance  tests are directed to

459: Spam-Assassin~\cite{spamassassin}. Spam-Assassin is a popular spam filtering

460: software that detects spam messages  based on a changing set of user-defined

461: rules. These rules assign scores to each email received based on the presence

462: in the  subject or in the email body of one or more pre-categorized

463: keywords. Spam-Assassin also uses other rules based on message size

464: and encoding. Highly ranked messages according to these criteria are flagged

465: as spam.

466:

467: We analyze an eight-day log collected  between 01/19/2004 to 01/26/2004.

468: Our logs store the header of each email (i.e. containing sender, recipients, size , date, etc.)

469: that passes the pre-acceptance filters, along with the results  of the tests performed by

470: Spam-Assassin and the virus scanners. We also have the full body of the messages

471: that were classified as spam by Spam-Assassin. Table~\ref{bst} summarizes our

472: workload.

473: %~\footnote{Emails that are

474: %flagged with virus or addressed to recipients in a domain name outside the

475: %university, for which the central email server is a published relay, are not

476: %included our analysis. These emails correspond to only 0.8\% of all

477: %logged data.}.

478:

479: \begin{table}[th!]

480:   \centering

481:   \footnotesize

482:   \begin{tabular}{|l|l|l|l|} \hline

483:     {\bf Measure}   & {\bf Non-Spam}  & {\bf Spam} &  {\bf Aggregate} \\ \hline

484:     \# of emails & 191,417 & 173,584& 365,001    \\ \hline

485:     Size of emails& 11.3 GB& 1.2 GB& 12.5 GB \\ \hline

486:     \# of distinct senders & 12,338& 19,567& 27,734 \\ \hline

487:     \# of distinct recipients & 22,762& 27,926& 38,875\\ \hline

488:   \end{tabular}

489:   \caption{Summary of the Workload}

490:   \label{bst}

491: \end{table}

492:

493: By visually inspecting the list of sender {\em user names}~\footnote{The part

494:   before @ in email addresses.} in the  spam component of our workload, we

495: found that a large number of them corresponded to a seemingly random sequence

496: of characters, suggesting that spammers tend to change user names as an evasion

497: technique. Therefore, for the experiments presented below we identified the

498: sender of a message by his/her domain while recipients were identified by their

499: full address, including both domain and user name.

500:

501: \subsection{Classification Results}

502: \begin{figure}[th!]

503:   \centering

504:   \includegraphics[width=200pt]{plots/tauxncomm.eps}

505:   \caption{Number of Email User Clusters and Beta CV  vs. $\tau$.}

506:   \label{fig:betacvxncom}

507: \end{figure}

508:

509: The results shown in this section were obtained through the simulation of

510: the algorithm proposed here over the set of messages in our logs. The

511: implementation of the simulator made use of an inverted

512: lists~\cite{moffat} approach for storing information about senders, recipients

513: and clusters that is effective both in terms of memory and processing

514: time. Our simulations were executed on a commodity workstation

515: (Intel Pentium \textregistered 4 - 2.80GHz - with 500MBytes) and

516: the simulator was able to classify 20 messages per second. This is far faster

517: than the average rate with which messages usually arrive and than the peak rate

518: observed over the workload collection time~\cite{gomes}.

519:

520: %representing communities

521: %The simulator was able to process our one week log i

522:

523: \begin{figure}[th!]

524:   \centering

525:   \subfigure[Bin size = 0.10]{\includegraphics[width=100pt]{plots/spamSurface_0.10_0.5.eps}}

526:   \subfigure[Bin size = 0.25]{\includegraphics[width=100pt]{plots/spamSurface_0.25_0.5.eps}}

527:   \caption{Number of Spam Messages by Varying Message Spam Probabilities for

528:     Different Bin Sizes.}

529:   \label{fig:message_classification}

530: \end{figure}

531:

532: %\begin{figure}[th!]

533: %    \centering

534: %    \subfigure[Sender identification]{\includegraphics[width=100pt]{plots/dls-sna.cdf.eps}}

535: %    \subfigure[Recipient identification]{\includegraphics[width=100pt]{plots/cls-sna.cdf.eps}}

536: %\caption{Distribution of the number of user each user has been in contact from

537: %  1 to 100 (96\% of spammers and 99\% of legitimate users).}

538: %\label{fig:cdf-ls}

539: %\end{figure}

540:

541: %The value of the parameter $\kappa$ is important in order to decide whether or

542: %not to try to classify each user depending on their

543: %contacts. Figure~\ref{fig:cdf-ls} shows probability distribution of the number

544: %of users each user has been in contact with. Legitimate traffic presents

545: %preponderantly lower sized values when compared with the values for spam

546: %traffic. In fact, we found that, only for sizes over $500$ legitimate senders

547: %surpass spammers. If we set $\kappa$ to $4$ we are able to indentify clusters

548: %for $60\%$ of the sender nodes in spam traffic.

549: %In legitimate traffic, $\kappa$ seems to be less effective, mainly

550: %because of great number of senders who send only one email ($56\%$).

551: %For $\kappa = 4$, we try to classify $68\%$ of spam recipients

552: %and $20\%$ of non spam recipients.

553:

554: The number and quality of the clusters generated through our similarity

555: measure are the direct result of the chosen value for the threshold $\tau$ (see

556: Section~\ref{sec:algorithm}). In order to determine the best parameter value

557: the simulation was executed several times for varying $\tau$.

558:

559: Figure~\ref{fig:betacvxncom} shows how the number of clusters

560: and beta CV~\footnote{Beta CV means intra CV/inter

561:       CV and assesses the quality of the clusters generated. The lower the beta CV

562:       the better quality in terms of grouping obtained~\cite{livrovirgilio}.}

563:     vary with $\tau$. There is

564: one clear  point of stabilization of the curve (i.e. a plateau) at $\tau = 0.5$

565: and that is the value we adopt for the remaining  of the paper. Although other

566: stabilization points occur for values of $\tau$ above $0.5$, the lowest of such

567: values seems to be the most appropriate for our experiments. The reason for

568: that is that this value of $\tau$ is the one that generates the smaller stable number

569: of clusters, i.e. cluster with more elements, and that allows us to evaluate

570: better the beneficial effects that clustering senders and recipients may have.

571: Moreover, while analyzing the beta CV we are able to see that the quality of

572: the clustering for all values  $\tau>0.4$ is approximately the same.

573: %{\bf what's the meaning of $\tau=0.5$? two lists that have half their

574: %  elements in common? }

575: % Oi luis, isso estah na nova secao 3. Um sujeito soh eh alocado em uma

576: % comunidade se o coseno dele com a tal comunidade for maior que tau. Talvez

577: % lah nao esteja bem explicado. Caso esse seja o problema nos avise que

578: % tentaremos reescrever de forma que fique mais apropriado.

579:

580: %\input{classificationTable}

581:

582: %An important advantage of our approach is that it learns how to

583: %classify messages using historical information provided by the auxiliary

584: %detection algorithm. In order to asses the rationality and the qualities of the

585: %clusters created, suppose we classify each message in terms of their sender and

586: %recipient classes using a threshold method. The sender of a message is

587: %classified as spam if its $P_s(m)$ is higher than $0.8$ and legitimate if it

588: %lower than $0.2$ otherwise it is undefined. The recipient is classified

589: %analogously based on $P_r(m)$.

590:

591: %Table~\ref{tab:sigma-omega} shows how this classification would behave, the

592: %number of messages that fall into each classification pair and the probable

593: %classification for that pair in term of relative number of messages exchanged

594: %between the two users. It is interesting to note that the percentage  of

595: %messages exchanged between spam senders and legitimate recipients and

596: %vice-versa is very small ($2.14\%$ of the messages). This reinforces our belief

597: %in the appropriate working of our cluster identification algorithm and the

598: %parameter $\tau$ chosen. Moreover, other expected rules such as spam sender

599: %sends spam messages to spam recipients are also present in this  analysis.

600:

601: One of the hypothesis of our algorithm is that we can group spam messages in

602: terms of the probabilities $P_s(m)$ and

603: $P_r(m)$. Figure~\ref{fig:message_classification} shows the fraction of spam

604: messages that exist for different values of $P_s(m)$ and $P_r(m)$ grouped based

605: on a discretization of the full space represented in the plot. The full space

606: is subdivided  into smaller squares of the same size called bins.

607: %The bin size determines the length of one

608: %of its side.

609: Clearly, spam/legitimate messages are indeed located in the regions (top and bottom

610: respectively) as we have hypothesized in Section~\ref{sec:algorithm}. There is however a region in the middle where

611: we can not determine the classification for the messages based on the computed

612: probabilities. This is why it becomes necessary to vary $\omega$. One should

613: adjust $\omega$ based on the level of confidence he/she has on the auxiliary algorithm.

614:

615: Figure~\ref{fig:message_classification} shows that differentiation between

616: senders and recipients for detecting spam can be more effective than the

617: simple choice we use in this paper. Messages addressed to recipients that

618: have high $P_r(m)$ tend to be spam more frequently than messages with the same

619: value of $P_s(m)$. Analogously, messages with low $P_s(m)$ have higher probability

620: of being legitimate messages. Ways of using this information in our algorithm are

621: an ongoing research effort that we intend to pursue in future extensions.

622:

623: %Figure~\ref{fig:message_classification} shows that differentiation between

624: %senders and recipients for detecting spam can be better explored than the

625: %simple approach we use in this paper. Messages addressed to recipients that

626: %have high $P_r(m)$ tend to be spam more frequently than messages with the same

627: %value of $P_s(m)$. Analogously, messages with low $P_s(m)$ have higher probability

628: %of being legitimate messages. Ways of using this information in our algorithm are

629: %a current research effort.

630:

631: Our algorithm makes use of an auxiliary spam detection algorithm - such as

632: SpamAssassin. Therefore, we need to evaluate how frequently we maintain the

633: same classification as such an algorithm. Figure~\ref{fig:omega} shows the

634: the percentage of messages that received the same classification and the total

635: number of classified messages in our simulation by varying $\omega$. The

636: difference between these curves is the set of messages that

637: were classified differently from the original classification provided.

638: There is a clear tradeoff between the total number of messages that

639: are classifiable and the accordance with the previous classification provided

640: by the original classifier algorithm.

641:

642: %A value for $\omega$ that represents a

643: %good tradeoff and that we will consider for the rest of this paper is $76\%$

644: %which has a $96\%$ change of correctly classifying a message against the {\it oraculum}

645: %and can classify $15\%$ of the messages.

646:

647: \begin{figure}[th]

648:   \centering

649: %  \subfigure[Messages correctly classified by varying $\sigma$ with $\omega =

650: %  50\%$] {\includegraphics[width=100pt]{plots/sigma.eps}}

651:   \includegraphics[width=150pt]{plots/omega_0.5.eps}

652:   \caption{Messages Classified in Accordance With to the Auxiliary Algorithm and the Total

653:   Number of Messages Classified by Varying $\omega$}

654:   \label{fig:omega}

655: \end{figure}

656:

657: In another experiment, we simulated a different algorithm that also makes

658: use of history information provided by an auxiliary spam detector described in~\cite{priority}.

659: This approach tries to classify messages based on the

660: historical properties of their senders. We built a simulator for this algorithm

661: and executed it against our data set. The results show that it was

662: able to classify $85.11\%$ of the messages in accordance

663: with the auxiliary algorithm. Its important to note that, on the other hand,

664: our algorithm can be tuned by the proper set of threshold $\omega$. The higher

665: the parameter $\omega$ the more in acordance with the auxiliary classification

666: the classification of our algorithm is.

667:

668: We believe that the differences between the original classification and the

669: classification proposed for high $\omega$ values generally are due to

670: missclassifications by the auxiliary algorithm. In our data set we have access to

671: the full body of the messages that were originally classified as spam. Therefore, we

672: can evaluate a fraction of the total amount of false positives (messages that

673: the auxiliary algorithm classify as spam and our algorithm  classify as

674: legitimate message) that were generated by the auxiliary algorithm. This is

675: important since there is a common belief that the cost of false positives is

676: higher than the cost of false negatives~\cite{gerf}.

677:

678: Each of the possible false positives were manually evaluated by three people so

679: as to determine whether such a message was indeed spam. Table~\ref{tab:manual}

680: summarizes the results for $\omega = 0.85$, 879 messages

681: were manually analyzed ($0.24\%$ of the total of messages). Our algorithm outperforms the

682: original classification since it generates less false positives. We emphasize

683: that we can not similarly determine the quality of classification for the

684: messages classified as legitimate by the auxiliary algorithm since we do not have

685: access to the full body of those messages. Due to the cost of manually classifying

686: messages we can not aford to classify all of the messages classified as spam by

687: the auxiliary algorithm.

688:

689: \begin{table}[th!]

690:   \centering

691:   \footnotesize

692:   \begin{tabular}{|l|c|} \hline

693:     \multicolumn{1}{|c|}{\bf Algorithm}   & \multicolumn{1}{c|}{\bf \% of Missclassifications } \\ \hline

694:     Original Classification & $60.33\%$ \\ \hline

695:     %Sender-based & $xx\%$ \\ \hline

696:     Our approach & $39.67\%$ \\ \hline

697:   \end{tabular}

698:   \caption{Possible False Positives Generated by the Approaches Studied.}

699:   \label{tab:manual}

700: \end{table}

701:

702: %%% Local Variables:

703: %%% mode: latex

704: %%% TeX-master: "sruti"

705: %%% End:

706:

707: \section{Related Work}

708: \label{sec:related-work}

709:

710: Previous work have focused on reducing the impact of spam.

711: The approaches to reduce spam  can be categorized into pre-acceptance and

712: post-acceptance methods, based on whether they detect and block

713: spam before or after accepting messages. Examples of pre-acceptance methods

714: are black lists~\cite{blacklist2}, gray lists~\cite{greylist}, server

715: authentication~\cite{spam,authentication} and

716: accountability~\cite{solvingspam}. Post-acceptance methods are mostly based on

717: information available in the body of the messages and include Bayesian

718: filters~\cite{sahami98bayesian}, collaborative

719: filtering~\cite{zhou03approximate}.

720:

721: Recent papers have focused on spam combat techniques based

722: on characteristics of graph models of email

723: traffic~\cite{emailnetcombat,spammachines}. The techniques

724: used try to model

725: email traffic as a graph and detect spam and spam attacks

726: respectively in terms

727: of graph properties. In~\cite{emailnetcombat} a graph is

728: created representing the email traffic captured in the mailbox of individual

729: users. The subsequent analysis is based on the fact that such a

730: network possesses several disconnected components. The clustering coefficient

731: of each of these components is then used to characterize messages as spam or

732: legitimate. Their results show that 53\% of the messages were

733: precisely classified using

734: the proposed approach.

735: In~\cite{spammachines} the authors used the approach of

736: detecting machines that

737: behave as spam senders by analyzing a border flow graph of sender and recipient

738: machines. In\cite{priority}, the authors propose a new scheme for

739: handling spam. It is a post-acceptance mechanism that processes

740: mail suspected of being spam at reduced priority, when compared to

741: the priority assigned to messages classified as legitimate. The

742: proposed mechanism\cite{priority} works in conjunction with some sort

743: of mail filter that provides past history of mails received by a server.

744:

745: None of the existing spam filtering mechanisms are

746: infallible\cite{priority, gerf}. Their main problems are false positive and

747: wrong mail classification. In addition to those problems, filters must

748: be continuously updated to capture the multitude of mechanism constantly

749: introduced by spammers to avoid filtering actions. The algorithm presented in

750: this paper aims at improving the effectiveness of spam filtering mechanisms,

751: by reducing false positives and by providing information that help those mechanism

752: to tune their collection of rules.

753:

754: %%% Local Variables:

755: %%% mode: latex

756: %%% TeX-master: "sruti.tex"

757: %%% End:

758:

759: \section{Conclusions and Future Work}

760: \label{sec:concl-future-works}

761:

762: In this paper we proposed a new spam detection algorithm based on  the structural similarity between contact lists of email users. The idea is that contact lists, integrated

763: over a suitable amount of time, are much more stable identifiers of email users than id names, domains

764: or message contents, which can all be made to vary quickly and widely.

765: The major drawback of our approach is that our algorithm can only group users based on their structural

766: similarity, but has no way of determining by itself if such vector clusters correspond to spam or legitimate email. Because of this feature it must work in tandem with an original classifier.

767: Given this information we have shown that we can successfully group spam and legitimate email users separately and that this structural inference can improve the quality of other spam detection algorithms.

768:

769: Specifically we have implemented a simulator based on data collected from the

770: main SMTP server for a major university in Brazil that uses SpamAssassin. We

771: have shown that our algorithm can be tuned to produce classifications similar

772: to those of the original classifier algorithm and that, for a certain set of

773: parameters, is was capable of correcting false positives generated by

774: SpamAssassin in our workload.

775:

776:  There are several improvements and developments that were not explored here, but promise

777:  to reinforce the strength of our approach. We intend to explore these in future work. We observe that structural similarity gives us a basis for time correlation of similar addresses, and as such to follow the time evolution of spam sender techniques, in ways that suitably factor out the enormous variability of their apparent identifiers. Finally we note  that the probabilistic basis of our approach lends itself naturally to the evolution of users' classifications (say through Bayesian inference), both through collaborative filtering using user feedback and from information derived from other algorithmic classifiers.

778:

779:  %We also intend to explore better ways of using the probabilities $P_s(m)$ and

780: %$P_r(m)$ to separate out spam, namely  by using more sophisticated delimiters

781: %that account  ...

782:

783: \bibliographystyle{acm}

784: \begin{thebibliography}{10}

785:

786: \bibitem{amavis}

787: Amavis.

788: \newblock http://www.amavis.org, 2004.

789:

790: \bibitem{sizecost}

791: {\sc Atkins, S.}

792: \newblock Size and cost of the problem.

793: \newblock In {\em 56th IETF Meeting\/} (March 2003).

794:

795: \bibitem{authentication}

796: {\sc Baker, H.~P.}

797: \newblock Authentication approaches.

798: \newblock In {\em 56th IETF Meeting\/} (March 2003).

799:

800: \bibitem{emailnetcombat}

801: {\sc Boykin, P.~O., and Roychowdhury, V.}

802: \newblock Personal email networks: An effective anti-spam tool.

803: \newblock http://www.arxiv.org/abs/cond-mat/0402143, February 2004.

804:

805: \bibitem{solvingspam}

806: {\sc Brandmo, H.~P.}

807: \newblock Solving spam by establishing a platform for sender accountability.

808: \newblock In {\em 56th IETF Meeting\/} (March 2003).

809:

810: \bibitem{gerf}

811: {\sc Cerf, V.~G.}

812: \newblock Spam, spim, and spit.

813: \newblock {\em Commun. ACM 48}, 4 (2005), 39--43.

814:

815: \bibitem{spam}

816: {\sc Cranor, L.~F., and LaMacchia, B.~A.}

817: \newblock Spam!

818: \newblock In {\em Communications of the ACM\/} (1998).

819:

820: \bibitem{spammachines}

821: {\sc Desikan, P., and Srivastava, J.}

822: \newblock Analyzing network traffic to detect e-mail spamming machines.

823: \newblock Tech. Rep. 180, Army High Performance Computing Research Center

824:   TECHNICAL REPORT, 2004.

825:

826: \bibitem{exim}

827: Exim internet mailer home page.

828: \newblock http://www.exim.org, 2004.

829:

830: \bibitem{ceas}

831: {\sc Gomes, L.~H., Almeida, R.~B., Bettencourt, L. M.~A., Almeida, V. A.~F.,

832:   and Almeida, J.~M.}

833: \newblock Comparative graph theoretical characterization of networks of spam

834:   and regular email.

835: \newblock http://arxiv.org/abs/cond-mat/0503725, March 2005.

836:

837: \bibitem{gomes}

838: {\sc Gomes, L.~H., Cazita, C., Almeida, J., Almeida, V. A.~F., and Jr., W.~M.}

839: \newblock Characterizing a spam traffic.

840: \newblock In {\em Proc. of the 4th ACM SIGCOMM conference on Internet

841:   measurement\/} (2004).

842:

843: \bibitem{greylist}

844: {\sc Harris, E.}

845: \newblock The next step in the spam control war: Greylisting.

846: \newblock http://projects.puremagic.com/greylisting/, April 2004.

847:

848: \bibitem{messageLabs}

849: {\sc Labs, M.}

850: \newblock Message labs home page.

851: \newblock http://www.messagelabs.co.uk/, 2005.

852:

853: \bibitem{blacklist2}

854: Maps - mail abuse prevention system home page.

855: \newblock http://mail-abuse.org/rbl/getoff.html, 2004.

856:

857: \bibitem{livrovirgilio}

858: {\sc Menasc\'{e}, D., and Almeida, V.}

859: \newblock {\em Capacity Planning for Web Services: metrics, models and

860:   methods}.

861: \newblock Prentice Hall Inc., USA, September 2001.

862:

863: \bibitem{sahami98bayesian}

864: {\sc Sahami, M., Dumais, S., Heckerman, D., and Horvitz, E.}

865: \newblock A bayesian approach to filtering junk {E}-mail.

866: \newblock In {\em Learning for Text Categorization: Papers from the 1998

867:   Workshop\/} (Madison, Wisconsin, USA, 1998), AAAI Technical Report WS-98-05.

868:

869: \bibitem{spamassassin}

870: Spamassassin.

871: \newblock http://www.spamassassin.org, 2004.

872:

873: \bibitem{antivirus}

874: Trend micro home page.

875: \newblock http://www.trendmicro.com, 2004.

876:

877: \bibitem{priority}

878: {\sc Twining, R.~D., Willianson, M.~M., Mowbray, M., and Rahmouni, M.}

879: \newblock Email prioritization: Reducing delays on legitimate mail caused by

880:   junk mail.

881: \newblock In {\em Proc. Usenix Annual Technical Conference\/} (Boston, MA, June

882:   2004).

883:

884: \bibitem{moffat}

885: {\sc Witten, I.~H., Bell, T.~C., and Moffat, A.}

886: \newblock {\em Managing Gigabytes: Compressing and Indexing Documents and

887:   Images}.

888: \newblock John Wiley \& Sons, Inc., New York, NY, USA, 1994.

889:

890: \bibitem{zhou03approximate}

891: {\sc Zhou, F., Zhuang, L., Zhao, B., Huang, L., Joseph, A., and Kubiatowicz,

892:   J.}

893: \newblock Approximate object location and spam filtering on peer-to-peer

894:   systems.

895: \newblock In {\em Proc. of Middleware\/} (June 2003).

896:

897: \end{thebibliography}

898:

899: \end{document}

900: