cs0409005/ch2.tex
1:  \section{Related Work}
2: 
3: \subsection{Current Internet Log and Data Collection Centers}
4: While there are a growing number of data centers that collect and
5: analyze logs, few are dedicated to sharing logs with the research
6: community. Those that do share do not anonymize sufficiently, and
7: they tend to focus on only one particular type of log. The goal is to
8: share as many types of logs from as many sources as possible.
9: Furthermore, there needs to be adequate protection of these logs to suit
10: the different needs of those who share them. To accomplish this goal,
11: there must be multiple, standardized levels of anonymization. 
12: 
13: \subsubsection{CAIDA}
14: CAIDA \cite{CAIDA} has one of the largest Internet log data catalogs
15: available for public analysis. They have an index to many data
16: sets stored off site and are developing a cataloging system for these
17: data sets. They also seek to develop tools to analyze the data sets.
18: CAIDA focuses on macro-level data to support network measurement
19: research. As such, they do not have the level of detail or the many
20: heterogeneous types of logs necessary for security research. The
21: usefulness of these logs to security research is limited to a very high level. 
22: The network traces that CAIDA provides could be used to get a big
23: picture of worm activity, but they do not contain the level of detail to
24: capture new exploit binaries or hacker behaviors post-compromise. 
25: 
26: Even though these logs are not primarily intended for security
27: research, they still suffer from the problems of undefined
28: anonymization standards. While it is important to anonymize logs so
29: that adversaries cannot map out contributor networks, CAIDA itself
30: does not anonymize logs that it does not generate. 
31: Different organizations anonymize to different levels---some not
32: anonymizing at all---and in different manners. There is no
33: consistency in the way anonymization is done because all of the
34: contributors do it their own way. Thus we see that development of
35: anonymization standards applies to more than just security research. 
36: 
37: \subsubsection{DeepSight}
38: Symantec's DeepSight \cite{Symantec} does utilize a cross-sectional view of the
39: Internet and analyzes detailed IDS, firewall and virus scanner logs.
40: However, it is not a log sharing system for research, but rather it is
41: a data collection system with a commercial purpose. Anonymization is 
42: not a real issue because data is not shared, but rather all data is 
43: collected and sent to a trusted third party. The purpose of this
44: collection is to notice trends and provide early warnings of attacks
45: and new threats to customers. They allow certain thresholds and other
46: variables to be set by customers to somewhat customize their alerts and 
47: reports, but mostly it is a way to alert their customers of
48: new worms, viruses or software exploits being used in the wild. It
49: does not provide data sets to share with the research community. 
50: 
51: \subsubsection{Internet Storm Center}
52: The Internet Storm Center (ISC) \cite{SANS} is like a grass-roots
53: version of DeepSight that is run by SANS. They collect IDS logs
54: from volunteers and analyze them to detect trends. Their purpose is
55: to provide an early warning system of new worm activity on the
56: Internet. They provide reports on the top ports being scanned with
57: respect to time, and they use the trend information they find to
58: determine the INFOCon threat level, much like Symantec defines the
59: ThreatCon level with DeepSight data. 
60: 
61: ISC does not share actual logs, but they produce high level statistics.
62: For this reason, their port activity and trends data do not need to be
63: sanitized. They sanitize information about scanner source IPs by looking very
64: broadly at the number of scans per class C network. This kind of
65: anonymization is also used in some of the CAIDA logs where they simply
66: truncate IP addresses. The danger is pretty low in sharing this kind
67: of information, but its usefulness is also minimal. It does nothing
68: more than allow inferences such as ``The US does the most scanning" or
69: ``Universities contribute to most of the P2P traffic". Many of these
70: statistics can 
71: be predicted from the density of addresses assigned in
72: the respective class C networks. ISC does share specific addresses in
73: one place: it lists the top 10 scanners by IP address. Many
74: organizations use this information to block misbehaving machines. The
75: repercussion of doing this is again minimal. They do not provide
76: specific details about those machines or the networks they are on. It
77: simply serves to embarrass the ISPs that host the compromised
78: machines. Any sort of anonymization here would defeat the purpose. 
79: 
80: Overall, the type of data they provide is a homogeneous set of aggregated  
81: statistics. More information can be gathered from CAIDA logs because 
82: raw access is provided. Thus, one is not restricted to only the
83: statistics they provide. The main difference of course is that the ISC
84: is real-time and uses  
85: a more distributed sample. In conclusion, ISC works very well for monitoring
86: general worm behavior, detecting trends that indicate new worms and
87: analyzing the life cycle of an exploit. However, they are not
88: gathering many types of logs, and they are not sharing them with the
89: general community for research. 
90: 
91: \subsubsection{DShield.org}
92: DShield.org \cite{DShield} is a grass-roots log collection system,
93: though it is now 
94: funded partially by SANS. They gather firewall logs and convert them
95: to a standard format. Currently, they exclusively accept packet filter
96: traces. These are used to create reports of types similar to the
97: Internet Storm Center. They have reports on port activity trends, the top
98: 10 most offensive scanners and the top 10 most probed ports. They
99: produce the blacklist of offenders that the ISC uses and provide
100: searches on activity by particular IP addresses. 
101: 
102: Anonymity is not dealt with seriously here, and they say ``You should not
103: submit any information you consider business critical or proprietary".
104: They say that they ``try" to hide destination IPs to mask who is being
105: attacked, but raw data is searchable and may be made available raw
106: to the public. Of course they do not indicate the submitter of the
107: data. Decisions of what and how to release raw data is made on a per
108: individual basis. 
109: 
110: Several problems exist with the system as is. First, there is only
111: one type of log. Second, there are no precautions to keep people from
112: resubmitting logs and polluting the data set. If special clients are
113: used instead of web submissions, accidental resubmissions are prevented. 
114: Third, anonymous submissions allow fake data to be
115: submitted that could wrongly blacklist individuals. Fourth, even if
116: their non-guaranteed anonymization of target hosts works, anyone can
117: query information about specific hosts and networks. This allows
118: attackers to find already compromised machines on a network rather
119: easily. In conclusion, DShield.org provides data of limited types and
120: minimal data protection mechanisms. 
121: 
122: \subsubsection{Packet Vault}
123: The University of Michigan has worked on a secure, long-term archive
124: of network packet data they call the {\it Packet Vault} \cite{Antonelli00, Antonelli99}. 
125: It is basically a special purpose network device and encrypted database system 
126: specifically designed for packet sniffers. It is designed so that selected
127: traffic can be made available without exposing other traffic.
128: 
129: They use the ``black marker" approach to anonymize logs by completely
130: encrypting packet information. As such, it cannot really be called
131: anonymization because all fields are either encrypted or decrypted,
132: essentially having the same effect as printing a log and using a black
133: marker on most of the lines. They group items under the same
134: encryption key if the packets are part of the same ``conversation". 
135: 
136: Now, if instead they used different keys for different fields, that
137: would allow them to release different views of the same records to
138: different organizations. That would be a crude sort of black marker
139: (all-or-nothing) anonymization of selected fields. But their goals
140: are different. They are not trying to share logs while preserving
141: privacy. They are making logs available to participants of
142: the conversation, but not to anyone else. They give the
143: appropriate keys to decrypt logs records that describe a participant's 
144: actions but not those in which they were not a participant. 
145: 
146: In conclusion, we see that while there are some centers dedicated  to
147: collecting log files, they all suffer from one or more of the
148: following problems: (1) They do not have a wide view of the Internet
149: but are quite localized, (2) the repositories are very specific,
150: addressing one or only a few types of logs, (3) anonymization is weak
151: or nonexistent and usually inconsistent or (4) they collect many logs
152: but do not share them with the research community. 
153: 
154: \subsection{Anonymizers}
155: There has been some research in log anonymization. However, most
156: work addresses only a small subset of all the available log
157: sources (particularly network traces) and focuses exclusively on anonymizing
158: IP addresses within a log. While IP addresses could simply be removed 
159: or randomized in logs,  such a solution is undesirable since it
160: destroys a basic structure used in analyzing logs. Significant work
161: has been accomplished on prefix-preserving anonymization of IP 
162: addresses. In prefix-preserving anonymization, IP addresses are mapped
163: to pseudo-random anonymized IP addresses by a function we will call
164: $\tau$. Let $P_n()$ be the function that truncates an IP address to $n$
165: bits. Then $\tau$ is a {\it prefix preserving} permutation of IP addresses
166: if $\forall$ $1\leq n\leq 32$, $P_n(x)=P_n(y)$ if and only if
167: $\tau(P_n(x))=\tau(P_n(y))$. TCPdpriv \cite{TCPdpriv} is a free program 
168: that performs prefix-preserving TCPdump trace anonymization using
169: tables. Because of the use of tables, it is difficult to process logs
170: in parallel with this tool. In \cite{Xu01,Xu02}, Xu et al.\ have
171: created a prefix-preserving IP pseudonymizer that overcomes this limitation by
172: eliminating the need for centralized tables to be shared and edited by
173: multiple entities. Instead, with their tool CryptoPAn, one only needs
174: to distribute a short key between entities that wish to pseudonymize
175: consistently with each other. Furthermore, they have shown that all
176: prefix-preserving pseudonymizers must take a particular form and that
177: their solution is optimal with respect to security. But while
178: theoretically optimal solutions have been created for this reduced
179: problem, the larger problem of anonymizing whole log files 
180: remains unsolved. 
181: 
182: One of the earliest uses of {\it pseudonyms} can be found in
183: \cite{Chaum81} where public keys are used as pseudonyms. We now
184: recognize what Chaum described in \cite{Chaum81} as a ``digital
185: pseudonym" to be a specific type of pseudonym called an authorization
186: certificate. As noted in \cite{Lundin99}, pseudonyms help define
187: middle ground in the zero-sum tradeoff between security and privacy of
188: audit logs. In \cite{Sobirey97}, Sobirey et al.\ first suggested
189: privacy-enhanced intrusion detection using pseudonyms and provided the
190: motivation for the work of Biskup et al.\ in \cite{Biskup00a,Biskup00b}.
191: 
192: While the work in \cite{Lundin99,Biskup00a,Biskup00b} does deal with
193: log data and anonymization, their goals are significantly different
194: than ours. All three works deal specifically with pseudonymization
195: in Intrusion Detection Systems (IDSs). The adversary in their model
196: is the system administrator, and the one requiring protection is the
197: user of the system. In our case, we instead assume that the
198: system/network administrators have access to raw logs, and we are
199: trying to protect the systems from those who would see the shared
200: logs. To contrast how this makes a difference, consider that in their
201: scenario the server addresses and services running are not even
202: sensitive---just information that could identify clients of the
203: system. Furthermore, we do not care about reversal of pseudonyms. We
204: have no need for that capability, but since the system/network
205: administrators do not have raw data in their case, the privacy officer
206: must help the system security officer reverse pseudonyms if alerts
207: indicate suspicious behavior. In \cite{Biskup00a,Biskup00b}, they
208: take this further and try to support automatic re-identification if a
209: certain threshold of events is met. In that way their pseudonymizer
210: must be intelligent, like an IDS, predicting when re-identification
211: may be necessary and thus altering how it pseudonymizes data. They
212: also differ from us in that they create transactional pseudonyms,  
213: so a pseudonym this week might map to a different
214: entity the next week. We, however, require consistency with respect to time for
215: logs to be useful. Lastly, all of the anonymizing solutions in these
216: papers  filter log entries and remove them if they are not relevant to
217: the IDS; we dispose of no entries because completeness is very important
218: for logs released to the general research populace. 
219: 
220: In \cite{Flegel02}, Flegel takes his previous work in privacy
221: preserving intrusion detection \cite{Biskup00a,Biskup00b}  and
222: changes the motivation slightly. Here he imagines a scenario of web
223: servers volunteering to protect the privacy of visitors from
224: themselves, and he believes IP addresses of visitors need
225: pseudonymization. However, to a web server IP addresses already act
226: as a pseudonym protecting the client's identity since ISPs rarely
227: volunteer IP address to person mappings. The case where this is not
228: true is if the web server is that of the ISP. Then in that
229: particular instance IP addresses are not pseudonyms. Though the
230: motivation differs slightly, the system described is the same
231: underlying threshold based pseudonymization system, and the focus of
232: this paper is really about the implementation and performance of the
233: system. As such, the results of \cite{Flegel02} can be applied to
234: \cite{Biskup00a,Biskup00b}. 
235: 
236: In \cite{Pang03}, Pang et al.\ developed a new packet anonymizer that
237: anonymizes 
238: packet payloads as well as transactional information, though their
239: methodology only works with application level protocols that their
240: anonymizer understands (HTTP, FTP, Finger, Ident and SMTP). The
241: process can also alter logs significantly, losing fragmentation
242: information, the size and number of packets and information about
243: retransmissions, skewing time stamps, sequence numbers and
244: checksums. 
245: While their anonymizer is limited in its capabilities, it is fail-safe
246: because it only leaves information in the packets that it can parse
247: and understand. Further, they create a classification of
248: anonymization techniques and a classification of attacks against
249: anonymization that we found useful. We use a similar classification
250: which is based off of their work. 
251: 
252: Most recently Waters et al.\ \cite{Waters04} address the tension
253: between data access control and searchability of audit logs through a
254: new method they developed to search asymmetrically encrypted logs. In
255: this way the encrypted log can be made public for search, and the
256: owner distributes private keys corresponding to keywords. Thus,
257: instead of the data owner decrypting the log and running the search,
258: he can simply give the query maker the ability to perform the query
259: with a set of keywords he deems acceptable.