1: \documentclass[letterpaper,twocolumn,10pt]{article}
2:
3: \usepackage{endnotes}
4: \usepackage{times}
5: \usepackage{epsfig}
6: \usepackage{subfigure}
7: \usepackage{graphics}
8: \usepackage{graphicx}
9: \usepackage{multirow}
10: \usepackage[latin1]{inputenc}
11: \usepackage{algorithmic}
12: \usepackage[plain]{algorithm}
13: %\usepackage[margin=2.7cm]{geometry}
14: %\usepackage[margin=2.7cm,left=4.6pc,right=4.6pc,top=4.0pc,bottom=4.2pc]{geometry}
15:
16: \begin{document}
17:
18: \date{}
19:
20: %\title{\Large \bf A Community-Based Spam Detection Algorithm}
21:
22: \title{Improving Spam Detection Based on Structural Similarity}
23:
24: \newcommand{\institutions}{\mbox{{\hspace*{-.5cm}
25: \begin{minipage}{7cm}
26: \begin{center}
27: $^{\dag}$ Computer Science Dept.\\
28: Universidade Federal de Minas Gerais \\
29: Belo Horizonte - Brazil \\
30: \{lhg, fernando, barra, virgilio, jussara\}@dcc.ufmg.br
31: \end{center}
32: \end{minipage}
33: \begin{minipage}{7cm}
34: \begin{center}
35: $^{\ddag}$ Computer and Computational Sciences\\
36: Los Alamos National Laboratory \\
37: Los Alamos - USA\\
38: lmbett@lanl.gov
39: \end{center}
40: \end{minipage}}}}
41:
42:
43: \author{
44: {\rm Luiz H. Gomes$^{\dag}$\footnote{Luiz H. Gomes is supported by Banco Central do
45: Brasil.}, Fernando D. O. Castro$^{\dag}$, Rodrigo B. Almeida$^{\dag}$, }\\
46: {\rm Luis M. A. Bettencourt$^{\ddag}$, Virgílio A. F. Almeida$^{\dag}$, Jussara M. Almeida$^{\dag}$}
47: \\[0.3cm]
48: \institutions
49: }
50:
51: %\author{
52: % {\rm Luiz H. Gomes\thanks{Luiz H. Gomes is supported by Banco Central do
53: % Brasil.}, Fernando D. O. Castro, }\\
54: % {\rm Virgílio A. F.Almeida, Jussara M. Almeida, Rodrigo B. Almeida}\\
55: % Computer Science Dept.,Universidade Federal de Minas Gerais, Belo Horizonte - Brazil \\
56: % \{lhg,fernando, virgilio,jussara, barra\}@dcc.ufmg.br\\
57: %\and
58: % {\rm Luis M. A. Bettencourt }\\
59: % Computer and Computational Sciences, Los Alamos National Laboratory, Los Alamos - USA\\
60: % lmbett@lanl.gov\\
61: %}
62:
63: \maketitle
64:
65: \thispagestyle{empty}
66:
67: \subsection*{Abstract}
68:
69:
70:
71:
72: %Spam senders use a multitude of strategies, based on knowledge of current filters, to evade detection, imposing serious limitations on spam recognition by analyzing content or mapping senders' techniques.
73:
74: We propose a new detection algorithm that uses structural relationships
75: between senders and recipients of email as the basis for the identification of
76: spam messages. Users and receivers are represented as vectors in their
77: reciprocal spaces. A measure of similarity between vectors is constructed and
78: used to group users into clusters. Knowledge of their classification as past
79: senders/receivers of spam or legitimate mail, comming
80: from an auxiliary detection algorithm, is then used to label these
81: clusters probabilistically. This knowledge comes from an auxiliary
82: algorithm. The measure of similarity between the sender and receiver sets of a
83: new message to the center vector of clusters is then used to asses the
84: possibility of that message being legitimate or spam. We show that the proposed
85: algorithm is able to correct part of the false positives (legitimate messages
86: classified as spam) using a testbed of one week smtp log.
87:
88: \section{Introduction}
89: \label{sec:introduction}
90:
91: %Spam is quickly becoming the leading threat to the viability of email
92: %as a means of communication and a leading source of fraud and other criminal
93: %activity worldwide. The vast majority of spam messages presently originate
94: %in the USA and China, hosted by well known ISPs and generated by identified
95: %individuals~\cite{spamhaus} Nevertheless, an increased effort in criminal
96: %investigation and waves of high profile legislation have not yet succeeded at
97: %reducing the
98:
99: %It is often said that the problem of spam email is that it is an extremely
100: %asymmetric threat. While it is technically easy and very cheap to send a spam
101: %email it requires sophisticated organization and much higher costs at the
102: %receiving end to sort out legitimate emails from junk.
103:
104: The relentless rise in spam email traffic, now accounting for about
105: $83\%$ of all incoming messages, up from $24\%$ in January
106: 2003~\cite{messageLabs}, is becoming one of the greatest threats to the
107: use of email as a form of communication.
108:
109: The greatest problem in detecting spam stems from active adversarial efforts to thwart
110: classification. Spam senders use a multitude of techniques based on knowledge of
111: current detection algorithms, to evade detection. These techniques
112: range from changes in the way text is written - so that it can not be directly
113: analyzed computationally, but can be understood by humans naturally -
114: to frequent changes in other elements, such as user names, domains, subjects, etc.
115: Therefore, good choices for spam identifiers are becoming increasingly more difficult.
116:
117: %Other elements that
118: %change frequently are the message contents and subject lines. Detection of spam
119: %messages based on these attributes requires that the filters are able to match,
120: %in breadth and speed, this variability. Because the space of possibilities is
121: %truly immense this always leaves spam detection algorithms playing catch up in
122: %a never ending evolutionary arms race.
123:
124: In the light of this enormous variability the question then is: what are the
125: identifiers of spam that are most costly to change, from the point of view of
126: the sender? The limitations of attempts to recognize spam by analyzing content
127: are clear~\cite{gerf}. Content-based techniques\cite{sahami98bayesian,
128: zhou03approximate, spamassassin} have to cope with the constant changes in
129: the way spammers generate their solicitations. The structure of the target
130: space for these solicitations tends however to be much more stable since
131: spams senders still need to reach recipients, even if under forged identifiers,
132: in order to be effective. Specifically by structure we mean the space of
133: recipients targeted by a spam sender, as well as the space of senders that
134: target a given recipient, i.e. the contacts of a user. The contact lists, or
135: subsets thereof, can then be thought of as a signature of spam senders and
136: recipients. Additionally by constructing a similarity measure in these
137: spaces we can track how lists evolve over time, by addition or removal of
138: addresses.
139:
140: In this paper, we propose an algorithm for spam detection that uses structural
141: relationships between senders and recipients as the basis for the
142: identification of spam messages. The algorithm must work in conjunction with
143: another spam classifier, necessary to produce spam or legitimate mail tags on
144: past senders and receivers, which in turn are used to infer new ones through
145: structural similarity (hereafter called: auxiliary algorithm),
146: The key idea is that the lists spammers and legitimate users send messages to,
147: as well as the lists from which they receive messages from can be used as the
148: identifiers of classes of email traffic~\cite{priority, ceas}.
149: We will show that the final result of the application of our structural
150: algorithm over the determinations of the initial classifier leads to the
151: correction of a number of misclassifications as false positives.
152:
153: %The statistical properties of the lists created by spammers and legitimate
154: %users allow us to separate the messages in classes~\cite{ceas,
155: % priority}. Moreover the grouping of both senders and recipients in
156: %communities, according to a similarity measure in the contact list space,
157: %allows us to generalize the historical information of messages exchanged by
158: %those communities. The final classification provided by our algorithm outperforms
159: %the algorithm used as {\it oraculum}.
160: % and also other symbiotic techniques~\footnote{Techniques that are used coupled with other ones.}.
161:
162: This paper is organized as follows: Section~\ref{sec:modeling} presents the
163: methodology used to handle email data. Our structural algorithm is described in
164: Section~\ref{sec:algorithm}. We present the characteristics of our
165: example workload in section~\ref{sec:results}, as well as the classification results
166: obtained with our algorithm over this set. Related work is presented in
167: Section~\ref{sec:related-work} and conclusions and future work in
168: Section~\ref{sec:concl-future-works}.
169:
170: % Caveats:
171: % \begin{itemize}
172: % \item such type of algorithms need time to be trained
173: % \item one only ever has access to incomplete lists of all the senders/recipients contacts, by measuring over a period of time.
174: % \item the algorithm may become memory intensive for large lists.
175: % \end{itemize}
176:
177: \section{Modeling Similarity Among Email Senders and Recipients}
178: \label{sec:modeling}
179:
180: Our proposed spam detection algorithm exploits the structural
181: similarities that exist in groups of senders and recipients
182: as well as in the relationship established through the emails
183: exchanged between them. This section introduces our modeling of
184: individual email users and a metric to express the similarity existent
185: among different users. It then extends the modeling to account for
186: clusters of users who have great similarity.
187: % The notations and definitions provided are used in the
188: % presentation of our detection algorithm in the next section.
189:
190: Our basic assumption is that, in both
191: legitimate email and spam traffics, users have a defined list of peers
192: they often have contact with (i.e., they send/receive an email
193: to/from). In legitimate email traffic, contact lists are consequence of
194: social relationships on which users' communications are
195: based. In spam traffic, on the other hand, the lists used by spammers
196: to distribute their solicitations are created for business interest
197: and, generally, do not reflect any form of social interaction.
198: A user's contact list certainly may change over time. However,
199: we expect it to be much less variable than other characteristics
200: commonly used for spam detection, such as
201: sender user-name, presence of certain keywords in the email content
202: and encoding rules. In other words, we expect contact lists to be
203: more effective in identifying spams and, thus, we use them as
204: the basis for developing our algorithm.
205:
206: We start by representing an email user as a vector in a
207: multi-dimensional conceptual space created with all possible
208: contacts. We represent email senders and recipients separately.
209: We then use vectorial
210: operations to express the similarity among multiple senders (recipients),
211: and use this metric for clustering them.
212: Note that the term email user is
213: used throughout this work to denote any identification of
214: an email sender/recipient (e.g., email address, domain name, etc).
215:
216: Let $N_r$ be the number of distinct recipients. We represent
217: sender $s_i$ as a $N_r$ dimensional vector, $\vec{s_i}$,
218: defined in the conceptual space created by the email recipients being
219: considered. The $n$-th
220: dimension (representing recipient $r_n$) of $\vec{s_i}$ is defined as:
221: \begin{eqnarray}
222: \vec{s_i}[n] = \left\{ \begin{array}{ll}
223: 1, & $ if $s_i \rightarrow r_n \\
224: 0, & $ otherwise$ \\
225: \end{array}
226: \right.,
227: \end{eqnarray}
228: where $s_i \rightarrow r_n$ indicates that sender $s_i$ has sent at least one email to
229: $r_n$ recipient.
230:
231: Similarly, we define $\vec{r_i}$ as a $N_s$ dimensional vector representation
232: for the recipient $r_i$, where $N_s$ is the number of distinct senders being
233: considered. The $n$-th dimension of this vector is set to $1$ if recipient
234: $r_i$ has received at least one email from $s_n$.
235:
236:
237: % Jussara: A frase abaixo deveria ir ou pra related work ou para algoritmo
238: % Previous~\cite{emailnetwork,dialoginemail,emailspectroscopy,inferring communities}
239: %studies showed that we can group users, web pages and others based on their
240: % email contacts, common interests and on links on Web
241: % pages.
242:
243:
244: We next define the similarity between two senders $s_i$ and $s_j$
245: as the cosine of the angle between their vector representation
246: ($\vec{s_i}$ and $\vec{s_j}$). The similarity is computed as follows:
247: \begin{eqnarray} \label{similarity}
248: sim(s_i,s_j) = \frac{\vec{s_i} \circ \vec{s_j}}{|\vec{s_i}||\vec{s_j}|} = cos(\vec{s_i},\vec{s_j}) ,
249: \end{eqnarray}
250: where $\vec{s_i} \circ \vec{s_j}$ is the internal product of the vectors and
251: $|\vec{s_i}|$ is the norm of $\vec{s_i}$. Note that this
252: metric varies from 0, when senders do not share any recipient in their
253: contact lists, to 1, when senders have identical contact lists and thus
254: have the same representation. The similarity between two recipients
255: is defined similarly.
256:
257: We note that our similarity metric has different interpretations
258: in legitimate
259: and spam traffics. In legitimate email traffic, it represents
260: social interaction with the same group of people, whereas in the spam
261: traffic, a great similarity represents the use of different identifiers by
262: the same spammer or the sharing of distribution lists by distinct spammers.
263:
264: Finally, we can use our vectorial modeling approach to represent a
265: cluster of users (senders or recipients) who have great similarity.
266: A sender cluster $sc_i$, represented by vector
267: $\vec{sc_i}$, is computed as the vectorial sum of its elements,
268: that is:
269: \begin{equation}
270: \vec{sc_i} = \sum_{s \in sc_i}{\vec{s}}.
271: \end{equation}
272:
273: The similarity between
274: sender $s_i$ and an existing cluster $sc_j$ can then be directly
275: assessed by extending Equation~\ref{similarity} as follows:
276: \begin{eqnarray}
277: sim(sc_i,s_i) = \left\{ \begin{array}{ll}
278: cos(\vec{sc_i} - \vec{s_i}, \vec{s_i}) , & $ if $s_i \in sc_i \\
279: cos(\vec{sc_i}, \vec{s_i}) , & $ otherwise$ \\
280: \end{array}
281: \right.
282: \end{eqnarray}
283: We note that a sender $s_i$ vectorial representation and thus the sender cluster to which
284: it belongs (i.e., shares the greatest similarity) may change over time
285: as new emails are considered. Therefore, in order to accurately estimate
286: the similarity
287: between a sender $s_i$ and a sender cluster $sc_i$ to which $s_i$ currently belongs, we
288: first remove $s_i$ from $sc_i$, and then take the cossine between the two vectors
289: ($\vec{sc_i} - \vec{s_i}$ and $\vec{s_i}$). This is performed so that
290: the previous classification of a user does
291: not influence its reclassification. Recipient clusters and the similarity
292: between a recipient and a given recipient cluster are defined analogously.
293:
294:
295: %%% Local Variables:
296: %%% mode: latex
297: %%% TeX-master: "sruti"
298: %%% End:
299:
300: \section{A New Algorithm for Improving Spam Detection }
301: \label{sec:algorithm}
302:
303: This section introduces our new email classification
304: algorithm which exploits the similarities between
305: email senders and between email recipients for
306: clustering and uses historical properties of clusters
307: to improve spam detection accuracy. Our algorithm is
308: designed to work together with any existing spamdetection
309: and filtering technique that runs at the ISP level.
310: Our goal is to provide a significant reduction of
311: false positives (i.e., legitimate emails wrongly classified as spam),
312: which can be as high as 15\% in current filters~\cite{sizecost}.
313: %, incurring costs that are hard to measure.
314:
315: A description of the proposed algorithm is shown
316: in Algorithm~\ref{alg:detection}.
317: It runs on each arriving email $m$, taking as input
318: the classification of $m$, $mClass$, as either spam
319: or legitimate email, performed by the existing auxiliary
320: spam detection method. Using the vectorial representation
321: of email senders, recipients and clusters as well as the
322: similarity metric defined in Section 2, it then determines a
323: new classification for $m$, which may or not agree with $mClass$.
324: The idea is that the classification by the auxiliary method
325: is used to build an incremental historical knowledge base that gets
326: more representative through time. Our algorithm benefits from that and
327: outperforms the auxiliary one as shown in Section~\ref{sec:results}.
328:
329: \begin{algorithm}
330: \centering
331: \begin{algorithmic}
332: \FORALL{arriving message $m$}
333: \STATE $mClass = $classification of $m$ by auxiliary detection method;
334: \STATE $sc = $find cluster for $m.sender$;
335: \STATE Update spam probability for $sc$ using $mClass$;
336: \STATE $P_s(m) = $spam probability for $sc$;
337: \STATE $P_r(m) = 0$;
338: \FORALL{recipient $r \in m.recipients$}
339: \STATE $rc = $find cluster for $r$;
340: \STATE Update spam probability for $rc$ using $mClass$;
341: \STATE $P_r(m) = P_r(m) + $spam probability for $rc$;
342: \ENDFOR
343: \STATE $P_r(m) = P_r(m)/size(m.recipients)$
344: \STATE $SP(m) = $ compute spam rank based on $P_s(m)$ and $P_r(m)$;
345: \IF{$SP(m) > \omega$}
346: \STATE classify $m$ as spam;
347: \ELSIF{$SP(m) < 1 - \omega$}
348: \STATE classify $m$ as legitimate;
349: \ELSE
350: \STATE classify $m$ as $mClass$;
351: \ENDIF
352: \ENDFOR
353: \end{algorithmic}
354: \caption{New Algorithm for Email Classification}
355: \label{alg:detection}
356: \end{algorithm}
357:
358: In order to improve the accuracy of email classification, our algorithm
359: maintains sets of sender and recipient clusters, created based on the structural similarity
360: of different users. A sender (recipient) of an incoming email
361: is added to a sender (recipient) cluster
362: that is most similar to it, as defined in Equation (4), provided that their similarity
363: exceeds a given threshold $\tau$. Thus, $\tau$ defines the minimum similarity
364: a sender (recipient) must have with a cluster to be assigned to it.
365: Varying $\tau$ allows us to create more tightly or loosely knit
366: clusters. If no cluster can be found, a new single-user cluster is created.
367: In this case, the sender (recipient) is used as seed for populating the new
368: cluster.
369:
370: The sets of recipient and sender clusters are updated at each new email arrival
371: based on the email sender and list of recipients. Recall that to determine the
372: cluster a previously observed, and thus clustered, user (sender or recipient)
373: belongs to, we first remove the user from his current cluster and then assess
374: its similarity to each existing cluster. Thus, single-user clusters tend to
375: disappear as more emails are processed, except for users that appear only very sporadically.
376:
377: \begin{figure}[th!]
378: \centering
379: \includegraphics[width=200pt]{figures/spamRank.eps}
380: \caption{Spam Rank Computation and Email Classification.}
381: \label{fig:spamRank}
382: \end{figure}
383:
384: A probability of sending (receiving) a spam is assigned to each sender (recipient)
385: cluster. We refer to this measure as simply the cluster spam probability.
386: We calculate the spam probability of a sender (recipient)
387: cluster as the average spam probability of its elements, which, in
388: turn, is estimated based on the frequency of spams sent/received by each of them
389: in the past. Therefore, our algorithm uses the result of the email classification performed
390: by the auxiliary algorithm on each arriving email $m$ ($mClass$ in
391: Algorithm~\ref{alg:detection}) to continuously update cluster spam probabilities.
392:
393: Let us define the probability of an email $m$ being sent by a spammer,
394: $P_s(m)$, as the spam probability of its sender's cluster.
395: Similarly, let the probability of an email
396: $m$ being addressed to users that receive spam, $P_r(m)$, as the
397: average spam probability of all of its recipients' clusters
398: (see Algorithm~\ref{alg:detection}). Our algorithm uses
399: $P_s(m)$ and $P_r(m)$ to compute a number that
400: expresses the chance of email $m$ being spam. We call this number the spam rank of
401: email $m$, denoted by $SR(m)$. The idea is that emails with large values of $P_s(m)$
402: and $P_r(m)$ should have large spam ranks and thus should be classified as spams.
403: Similarly, emails with small values of $P_s(m)$
404: and $P_r(m)$ should receive low spam rank and be classified as legitimate email.
405:
406: Figure~\ref{fig:spamRank} shows a graphical representation of
407: the computation of an email spam rank. We first normalize the probabilities
408: $P_s(m)$ and $P_r(m)$ by a factor of $\sqrt{2}$, so that the diagonal of
409: the square region defined in the bi-dimensional space is equal to 1
410: (see Figure~\ref{fig:spamRank}-left). Each email $m$ can be represented as
411: a point in this square. The spam rank of $m$, $SR(m)$, is
412: then defined as the length of the segment starting at
413: the origin (0,0) and ending at the projection of $m$
414: on the diagonal of the square (see Figure~\ref{fig:spamRank}-right). Note
415: the spam rank varies between 0 and 1.
416:
417: The spam rank $SR(m)$ is then used to classify $m$ as follows: if it is greater
418: than a given threshold $\omega$, the email is classified as spam; if it is
419: smaller than $1 - \omega$, it is classified as legitimate email. Otherwise, we can
420: not precisely classify the email, and we rely on the initial classification
421: provided by the auxiliary detection algorithm. The parameter $\omega$ can be tuned
422: to determine the precision that we expect from our classification.
423: Graphically, emails are classified according to the marked regions shown in
424: Figure~\ref{fig:spamRank}-left. The two triangles, with identical size
425: and height $\omega$, represent the regions where
426: our algorithm is able to classify emails as either spam (upper right) or legitimate
427: email (lower left).
428:
429: %%% Local Variables:
430: %%% mode: latex
431: %%% TeX-master: t
432: %%% End:
433:
434: \section{Experimental Results}
435: \label{sec:results}
436:
437: In this section we describe our experimental results. We first present some important details of
438: our workload, followed by the quantitative results of our approach, compared to others.
439:
440: \subsection{Workload}
441: \label{sec:workload}
442:
443: Our email workload consists of anonymized and sanitized SMTP logs of incoming
444: emails to a large university in Brazil, with around 22 thousand students. The
445: server handles all emails coming from domains
446: outside the university, sent to students, faculty and staff with
447: email addresses under the university's domain name~\footnote{Only the
448: emails addressed to two out of over 100 university subdomains (i.e.,
449: departments, research labs, research groups) do not pass through the central
450: server.}
451:
452: The central email server runs Exim email software~\cite{exim},
453: the Amavis virus scanner~\cite{amavis} and the Trendmicro
454: Vscan anti-virus tool~\cite{antivirus}. A set of pre-acceptance
455: spam filters (e.g. black lists, DNS reversal) blocks about 50\% of the total traffic
456: received by the server.
457:
458: The messages not rejected by the pre-acceptance tests are directed to
459: Spam-Assassin~\cite{spamassassin}. Spam-Assassin is a popular spam filtering
460: software that detects spam messages based on a changing set of user-defined
461: rules. These rules assign scores to each email received based on the presence
462: in the subject or in the email body of one or more pre-categorized
463: keywords. Spam-Assassin also uses other rules based on message size
464: and encoding. Highly ranked messages according to these criteria are flagged
465: as spam.
466:
467: We analyze an eight-day log collected between 01/19/2004 to 01/26/2004.
468: Our logs store the header of each email (i.e. containing sender, recipients, size , date, etc.)
469: that passes the pre-acceptance filters, along with the results of the tests performed by
470: Spam-Assassin and the virus scanners. We also have the full body of the messages
471: that were classified as spam by Spam-Assassin. Table~\ref{bst} summarizes our
472: workload.
473: %~\footnote{Emails that are
474: %flagged with virus or addressed to recipients in a domain name outside the
475: %university, for which the central email server is a published relay, are not
476: %included our analysis. These emails correspond to only 0.8\% of all
477: %logged data.}.
478:
479: \begin{table}[th!]
480: \centering
481: \footnotesize
482: \begin{tabular}{|l|l|l|l|} \hline
483: {\bf Measure} & {\bf Non-Spam} & {\bf Spam} & {\bf Aggregate} \\ \hline
484: \# of emails & 191,417 & 173,584& 365,001 \\ \hline
485: Size of emails& 11.3 GB& 1.2 GB& 12.5 GB \\ \hline
486: \# of distinct senders & 12,338& 19,567& 27,734 \\ \hline
487: \# of distinct recipients & 22,762& 27,926& 38,875\\ \hline
488: \end{tabular}
489: \caption{Summary of the Workload}
490: \label{bst}
491: \end{table}
492:
493: By visually inspecting the list of sender {\em user names}~\footnote{The part
494: before @ in email addresses.} in the spam component of our workload, we
495: found that a large number of them corresponded to a seemingly random sequence
496: of characters, suggesting that spammers tend to change user names as an evasion
497: technique. Therefore, for the experiments presented below we identified the
498: sender of a message by his/her domain while recipients were identified by their
499: full address, including both domain and user name.
500:
501: \subsection{Classification Results}
502: \begin{figure}[th!]
503: \centering
504: \includegraphics[width=200pt]{plots/tauxncomm.eps}
505: \caption{Number of Email User Clusters and Beta CV vs. $\tau$.}
506: \label{fig:betacvxncom}
507: \end{figure}
508:
509: The results shown in this section were obtained through the simulation of
510: the algorithm proposed here over the set of messages in our logs. The
511: implementation of the simulator made use of an inverted
512: lists~\cite{moffat} approach for storing information about senders, recipients
513: and clusters that is effective both in terms of memory and processing
514: time. Our simulations were executed on a commodity workstation
515: (Intel Pentium \textregistered 4 - 2.80GHz - with 500MBytes) and
516: the simulator was able to classify 20 messages per second. This is far faster
517: than the average rate with which messages usually arrive and than the peak rate
518: observed over the workload collection time~\cite{gomes}.
519:
520: %representing communities
521: %The simulator was able to process our one week log i
522:
523: \begin{figure}[th!]
524: \centering
525: \subfigure[Bin size = 0.10]{\includegraphics[width=100pt]{plots/spamSurface_0.10_0.5.eps}}
526: \subfigure[Bin size = 0.25]{\includegraphics[width=100pt]{plots/spamSurface_0.25_0.5.eps}}
527: \caption{Number of Spam Messages by Varying Message Spam Probabilities for
528: Different Bin Sizes.}
529: \label{fig:message_classification}
530: \end{figure}
531:
532: %\begin{figure}[th!]
533: % \centering
534: % \subfigure[Sender identification]{\includegraphics[width=100pt]{plots/dls-sna.cdf.eps}}
535: % \subfigure[Recipient identification]{\includegraphics[width=100pt]{plots/cls-sna.cdf.eps}}
536: %\caption{Distribution of the number of user each user has been in contact from
537: % 1 to 100 (96\% of spammers and 99\% of legitimate users).}
538: %\label{fig:cdf-ls}
539: %\end{figure}
540:
541: %The value of the parameter $\kappa$ is important in order to decide whether or
542: %not to try to classify each user depending on their
543: %contacts. Figure~\ref{fig:cdf-ls} shows probability distribution of the number
544: %of users each user has been in contact with. Legitimate traffic presents
545: %preponderantly lower sized values when compared with the values for spam
546: %traffic. In fact, we found that, only for sizes over $500$ legitimate senders
547: %surpass spammers. If we set $\kappa$ to $4$ we are able to indentify clusters
548: %for $60\%$ of the sender nodes in spam traffic.
549: %In legitimate traffic, $\kappa$ seems to be less effective, mainly
550: %because of great number of senders who send only one email ($56\%$).
551: %For $\kappa = 4$, we try to classify $68\%$ of spam recipients
552: %and $20\%$ of non spam recipients.
553:
554: The number and quality of the clusters generated through our similarity
555: measure are the direct result of the chosen value for the threshold $\tau$ (see
556: Section~\ref{sec:algorithm}). In order to determine the best parameter value
557: the simulation was executed several times for varying $\tau$.
558:
559: Figure~\ref{fig:betacvxncom} shows how the number of clusters
560: and beta CV~\footnote{Beta CV means intra CV/inter
561: CV and assesses the quality of the clusters generated. The lower the beta CV
562: the better quality in terms of grouping obtained~\cite{livrovirgilio}.}
563: vary with $\tau$. There is
564: one clear point of stabilization of the curve (i.e. a plateau) at $\tau = 0.5$
565: and that is the value we adopt for the remaining of the paper. Although other
566: stabilization points occur for values of $\tau$ above $0.5$, the lowest of such
567: values seems to be the most appropriate for our experiments. The reason for
568: that is that this value of $\tau$ is the one that generates the smaller stable number
569: of clusters, i.e. cluster with more elements, and that allows us to evaluate
570: better the beneficial effects that clustering senders and recipients may have.
571: Moreover, while analyzing the beta CV we are able to see that the quality of
572: the clustering for all values $\tau>0.4$ is approximately the same.
573: %{\bf what's the meaning of $\tau=0.5$? two lists that have half their
574: % elements in common? }
575: % Oi luis, isso estah na nova secao 3. Um sujeito soh eh alocado em uma
576: % comunidade se o coseno dele com a tal comunidade for maior que tau. Talvez
577: % lah nao esteja bem explicado. Caso esse seja o problema nos avise que
578: % tentaremos reescrever de forma que fique mais apropriado.
579:
580: %\input{classificationTable}
581:
582: %An important advantage of our approach is that it learns how to
583: %classify messages using historical information provided by the auxiliary
584: %detection algorithm. In order to asses the rationality and the qualities of the
585: %clusters created, suppose we classify each message in terms of their sender and
586: %recipient classes using a threshold method. The sender of a message is
587: %classified as spam if its $P_s(m)$ is higher than $0.8$ and legitimate if it
588: %lower than $0.2$ otherwise it is undefined. The recipient is classified
589: %analogously based on $P_r(m)$.
590:
591: %Table~\ref{tab:sigma-omega} shows how this classification would behave, the
592: %number of messages that fall into each classification pair and the probable
593: %classification for that pair in term of relative number of messages exchanged
594: %between the two users. It is interesting to note that the percentage of
595: %messages exchanged between spam senders and legitimate recipients and
596: %vice-versa is very small ($2.14\%$ of the messages). This reinforces our belief
597: %in the appropriate working of our cluster identification algorithm and the
598: %parameter $\tau$ chosen. Moreover, other expected rules such as spam sender
599: %sends spam messages to spam recipients are also present in this analysis.
600:
601: One of the hypothesis of our algorithm is that we can group spam messages in
602: terms of the probabilities $P_s(m)$ and
603: $P_r(m)$. Figure~\ref{fig:message_classification} shows the fraction of spam
604: messages that exist for different values of $P_s(m)$ and $P_r(m)$ grouped based
605: on a discretization of the full space represented in the plot. The full space
606: is subdivided into smaller squares of the same size called bins.
607: %The bin size determines the length of one
608: %of its side.
609: Clearly, spam/legitimate messages are indeed located in the regions (top and bottom
610: respectively) as we have hypothesized in Section~\ref{sec:algorithm}. There is however a region in the middle where
611: we can not determine the classification for the messages based on the computed
612: probabilities. This is why it becomes necessary to vary $\omega$. One should
613: adjust $\omega$ based on the level of confidence he/she has on the auxiliary algorithm.
614:
615: Figure~\ref{fig:message_classification} shows that differentiation between
616: senders and recipients for detecting spam can be more effective than the
617: simple choice we use in this paper. Messages addressed to recipients that
618: have high $P_r(m)$ tend to be spam more frequently than messages with the same
619: value of $P_s(m)$. Analogously, messages with low $P_s(m)$ have higher probability
620: of being legitimate messages. Ways of using this information in our algorithm are
621: an ongoing research effort that we intend to pursue in future extensions.
622:
623: %Figure~\ref{fig:message_classification} shows that differentiation between
624: %senders and recipients for detecting spam can be better explored than the
625: %simple approach we use in this paper. Messages addressed to recipients that
626: %have high $P_r(m)$ tend to be spam more frequently than messages with the same
627: %value of $P_s(m)$. Analogously, messages with low $P_s(m)$ have higher probability
628: %of being legitimate messages. Ways of using this information in our algorithm are
629: %a current research effort.
630:
631: Our algorithm makes use of an auxiliary spam detection algorithm - such as
632: SpamAssassin. Therefore, we need to evaluate how frequently we maintain the
633: same classification as such an algorithm. Figure~\ref{fig:omega} shows the
634: the percentage of messages that received the same classification and the total
635: number of classified messages in our simulation by varying $\omega$. The
636: difference between these curves is the set of messages that
637: were classified differently from the original classification provided.
638: There is a clear tradeoff between the total number of messages that
639: are classifiable and the accordance with the previous classification provided
640: by the original classifier algorithm.
641:
642: %A value for $\omega$ that represents a
643: %good tradeoff and that we will consider for the rest of this paper is $76\%$
644: %which has a $96\%$ change of correctly classifying a message against the {\it oraculum}
645: %and can classify $15\%$ of the messages.
646:
647: \begin{figure}[th]
648: \centering
649: % \subfigure[Messages correctly classified by varying $\sigma$ with $\omega =
650: % 50\%$] {\includegraphics[width=100pt]{plots/sigma.eps}}
651: \includegraphics[width=150pt]{plots/omega_0.5.eps}
652: \caption{Messages Classified in Accordance With to the Auxiliary Algorithm and the Total
653: Number of Messages Classified by Varying $\omega$}
654: \label{fig:omega}
655: \end{figure}
656:
657: In another experiment, we simulated a different algorithm that also makes
658: use of history information provided by an auxiliary spam detector described in~\cite{priority}.
659: This approach tries to classify messages based on the
660: historical properties of their senders. We built a simulator for this algorithm
661: and executed it against our data set. The results show that it was
662: able to classify $85.11\%$ of the messages in accordance
663: with the auxiliary algorithm. Its important to note that, on the other hand,
664: our algorithm can be tuned by the proper set of threshold $\omega$. The higher
665: the parameter $\omega$ the more in acordance with the auxiliary classification
666: the classification of our algorithm is.
667:
668: We believe that the differences between the original classification and the
669: classification proposed for high $\omega$ values generally are due to
670: missclassifications by the auxiliary algorithm. In our data set we have access to
671: the full body of the messages that were originally classified as spam. Therefore, we
672: can evaluate a fraction of the total amount of false positives (messages that
673: the auxiliary algorithm classify as spam and our algorithm classify as
674: legitimate message) that were generated by the auxiliary algorithm. This is
675: important since there is a common belief that the cost of false positives is
676: higher than the cost of false negatives~\cite{gerf}.
677:
678: Each of the possible false positives were manually evaluated by three people so
679: as to determine whether such a message was indeed spam. Table~\ref{tab:manual}
680: summarizes the results for $\omega = 0.85$, 879 messages
681: were manually analyzed ($0.24\%$ of the total of messages). Our algorithm outperforms the
682: original classification since it generates less false positives. We emphasize
683: that we can not similarly determine the quality of classification for the
684: messages classified as legitimate by the auxiliary algorithm since we do not have
685: access to the full body of those messages. Due to the cost of manually classifying
686: messages we can not aford to classify all of the messages classified as spam by
687: the auxiliary algorithm.
688:
689: \begin{table}[th!]
690: \centering
691: \footnotesize
692: \begin{tabular}{|l|c|} \hline
693: \multicolumn{1}{|c|}{\bf Algorithm} & \multicolumn{1}{c|}{\bf \% of Missclassifications } \\ \hline
694: Original Classification & $60.33\%$ \\ \hline
695: %Sender-based & $xx\%$ \\ \hline
696: Our approach & $39.67\%$ \\ \hline
697: \end{tabular}
698: \caption{Possible False Positives Generated by the Approaches Studied.}
699: \label{tab:manual}
700: \end{table}
701:
702: %%% Local Variables:
703: %%% mode: latex
704: %%% TeX-master: "sruti"
705: %%% End:
706:
707: \section{Related Work}
708: \label{sec:related-work}
709:
710: Previous work have focused on reducing the impact of spam.
711: The approaches to reduce spam can be categorized into pre-acceptance and
712: post-acceptance methods, based on whether they detect and block
713: spam before or after accepting messages. Examples of pre-acceptance methods
714: are black lists~\cite{blacklist2}, gray lists~\cite{greylist}, server
715: authentication~\cite{spam,authentication} and
716: accountability~\cite{solvingspam}. Post-acceptance methods are mostly based on
717: information available in the body of the messages and include Bayesian
718: filters~\cite{sahami98bayesian}, collaborative
719: filtering~\cite{zhou03approximate}.
720:
721: Recent papers have focused on spam combat techniques based
722: on characteristics of graph models of email
723: traffic~\cite{emailnetcombat,spammachines}. The techniques
724: used try to model
725: email traffic as a graph and detect spam and spam attacks
726: respectively in terms
727: of graph properties. In~\cite{emailnetcombat} a graph is
728: created representing the email traffic captured in the mailbox of individual
729: users. The subsequent analysis is based on the fact that such a
730: network possesses several disconnected components. The clustering coefficient
731: of each of these components is then used to characterize messages as spam or
732: legitimate. Their results show that 53\% of the messages were
733: precisely classified using
734: the proposed approach.
735: In~\cite{spammachines} the authors used the approach of
736: detecting machines that
737: behave as spam senders by analyzing a border flow graph of sender and recipient
738: machines. In\cite{priority}, the authors propose a new scheme for
739: handling spam. It is a post-acceptance mechanism that processes
740: mail suspected of being spam at reduced priority, when compared to
741: the priority assigned to messages classified as legitimate. The
742: proposed mechanism\cite{priority} works in conjunction with some sort
743: of mail filter that provides past history of mails received by a server.
744:
745: None of the existing spam filtering mechanisms are
746: infallible\cite{priority, gerf}. Their main problems are false positive and
747: wrong mail classification. In addition to those problems, filters must
748: be continuously updated to capture the multitude of mechanism constantly
749: introduced by spammers to avoid filtering actions. The algorithm presented in
750: this paper aims at improving the effectiveness of spam filtering mechanisms,
751: by reducing false positives and by providing information that help those mechanism
752: to tune their collection of rules.
753:
754: %%% Local Variables:
755: %%% mode: latex
756: %%% TeX-master: "sruti.tex"
757: %%% End:
758:
759: \section{Conclusions and Future Work}
760: \label{sec:concl-future-works}
761:
762: In this paper we proposed a new spam detection algorithm based on the structural similarity between contact lists of email users. The idea is that contact lists, integrated
763: over a suitable amount of time, are much more stable identifiers of email users than id names, domains
764: or message contents, which can all be made to vary quickly and widely.
765: The major drawback of our approach is that our algorithm can only group users based on their structural
766: similarity, but has no way of determining by itself if such vector clusters correspond to spam or legitimate email. Because of this feature it must work in tandem with an original classifier.
767: Given this information we have shown that we can successfully group spam and legitimate email users separately and that this structural inference can improve the quality of other spam detection algorithms.
768:
769: Specifically we have implemented a simulator based on data collected from the
770: main SMTP server for a major university in Brazil that uses SpamAssassin. We
771: have shown that our algorithm can be tuned to produce classifications similar
772: to those of the original classifier algorithm and that, for a certain set of
773: parameters, is was capable of correcting false positives generated by
774: SpamAssassin in our workload.
775:
776: There are several improvements and developments that were not explored here, but promise
777: to reinforce the strength of our approach. We intend to explore these in future work. We observe that structural similarity gives us a basis for time correlation of similar addresses, and as such to follow the time evolution of spam sender techniques, in ways that suitably factor out the enormous variability of their apparent identifiers. Finally we note that the probabilistic basis of our approach lends itself naturally to the evolution of users' classifications (say through Bayesian inference), both through collaborative filtering using user feedback and from information derived from other algorithmic classifiers.
778:
779: %We also intend to explore better ways of using the probabilities $P_s(m)$ and
780: %$P_r(m)$ to separate out spam, namely by using more sophisticated delimiters
781: %that account ...
782:
783: \bibliographystyle{acm}
784: \begin{thebibliography}{10}
785:
786: \bibitem{amavis}
787: Amavis.
788: \newblock http://www.amavis.org, 2004.
789:
790: \bibitem{sizecost}
791: {\sc Atkins, S.}
792: \newblock Size and cost of the problem.
793: \newblock In {\em 56th IETF Meeting\/} (March 2003).
794:
795: \bibitem{authentication}
796: {\sc Baker, H.~P.}
797: \newblock Authentication approaches.
798: \newblock In {\em 56th IETF Meeting\/} (March 2003).
799:
800: \bibitem{emailnetcombat}
801: {\sc Boykin, P.~O., and Roychowdhury, V.}
802: \newblock Personal email networks: An effective anti-spam tool.
803: \newblock http://www.arxiv.org/abs/cond-mat/0402143, February 2004.
804:
805: \bibitem{solvingspam}
806: {\sc Brandmo, H.~P.}
807: \newblock Solving spam by establishing a platform for sender accountability.
808: \newblock In {\em 56th IETF Meeting\/} (March 2003).
809:
810: \bibitem{gerf}
811: {\sc Cerf, V.~G.}
812: \newblock Spam, spim, and spit.
813: \newblock {\em Commun. ACM 48}, 4 (2005), 39--43.
814:
815: \bibitem{spam}
816: {\sc Cranor, L.~F., and LaMacchia, B.~A.}
817: \newblock Spam!
818: \newblock In {\em Communications of the ACM\/} (1998).
819:
820: \bibitem{spammachines}
821: {\sc Desikan, P., and Srivastava, J.}
822: \newblock Analyzing network traffic to detect e-mail spamming machines.
823: \newblock Tech. Rep. 180, Army High Performance Computing Research Center
824: TECHNICAL REPORT, 2004.
825:
826: \bibitem{exim}
827: Exim internet mailer home page.
828: \newblock http://www.exim.org, 2004.
829:
830: \bibitem{ceas}
831: {\sc Gomes, L.~H., Almeida, R.~B., Bettencourt, L. M.~A., Almeida, V. A.~F.,
832: and Almeida, J.~M.}
833: \newblock Comparative graph theoretical characterization of networks of spam
834: and regular email.
835: \newblock http://arxiv.org/abs/cond-mat/0503725, March 2005.
836:
837: \bibitem{gomes}
838: {\sc Gomes, L.~H., Cazita, C., Almeida, J., Almeida, V. A.~F., and Jr., W.~M.}
839: \newblock Characterizing a spam traffic.
840: \newblock In {\em Proc. of the 4th ACM SIGCOMM conference on Internet
841: measurement\/} (2004).
842:
843: \bibitem{greylist}
844: {\sc Harris, E.}
845: \newblock The next step in the spam control war: Greylisting.
846: \newblock http://projects.puremagic.com/greylisting/, April 2004.
847:
848: \bibitem{messageLabs}
849: {\sc Labs, M.}
850: \newblock Message labs home page.
851: \newblock http://www.messagelabs.co.uk/, 2005.
852:
853: \bibitem{blacklist2}
854: Maps - mail abuse prevention system home page.
855: \newblock http://mail-abuse.org/rbl/getoff.html, 2004.
856:
857: \bibitem{livrovirgilio}
858: {\sc Menasc\'{e}, D., and Almeida, V.}
859: \newblock {\em Capacity Planning for Web Services: metrics, models and
860: methods}.
861: \newblock Prentice Hall Inc., USA, September 2001.
862:
863: \bibitem{sahami98bayesian}
864: {\sc Sahami, M., Dumais, S., Heckerman, D., and Horvitz, E.}
865: \newblock A bayesian approach to filtering junk {E}-mail.
866: \newblock In {\em Learning for Text Categorization: Papers from the 1998
867: Workshop\/} (Madison, Wisconsin, USA, 1998), AAAI Technical Report WS-98-05.
868:
869: \bibitem{spamassassin}
870: Spamassassin.
871: \newblock http://www.spamassassin.org, 2004.
872:
873: \bibitem{antivirus}
874: Trend micro home page.
875: \newblock http://www.trendmicro.com, 2004.
876:
877: \bibitem{priority}
878: {\sc Twining, R.~D., Willianson, M.~M., Mowbray, M., and Rahmouni, M.}
879: \newblock Email prioritization: Reducing delays on legitimate mail caused by
880: junk mail.
881: \newblock In {\em Proc. Usenix Annual Technical Conference\/} (Boston, MA, June
882: 2004).
883:
884: \bibitem{moffat}
885: {\sc Witten, I.~H., Bell, T.~C., and Moffat, A.}
886: \newblock {\em Managing Gigabytes: Compressing and Indexing Documents and
887: Images}.
888: \newblock John Wiley \& Sons, Inc., New York, NY, USA, 1994.
889:
890: \bibitem{zhou03approximate}
891: {\sc Zhou, F., Zhuang, L., Zhao, B., Huang, L., Joseph, A., and Kubiatowicz,
892: J.}
893: \newblock Approximate object location and spam filtering on peer-to-peer
894: systems.
895: \newblock In {\em Proc. of Middleware\/} (June 2003).
896:
897: \end{thebibliography}
898:
899: \end{document}
900: