0409:cs0409005/ch2.tex

1:  \section{Related Work}

2:

3: \subsection{Current Internet Log and Data Collection Centers}

4: While there are a growing number of data centers that collect and

5: analyze logs, few are dedicated to sharing logs with the research

6: community. Those that do share do not anonymize sufficiently, and

7: they tend to focus on only one particular type of log. The goal is to

8: share as many types of logs from as many sources as possible.

9: Furthermore, there needs to be adequate protection of these logs to suit

10: the different needs of those who share them. To accomplish this goal,

11: there must be multiple, standardized levels of anonymization.

12:

13: \subsubsection{CAIDA}

14: CAIDA \cite{CAIDA} has one of the largest Internet log data catalogs

15: available for public analysis. They have an index to many data

16: sets stored off site and are developing a cataloging system for these

17: data sets. They also seek to develop tools to analyze the data sets.

18: CAIDA focuses on macro-level data to support network measurement

19: research. As such, they do not have the level of detail or the many

20: heterogeneous types of logs necessary for security research. The

21: usefulness of these logs to security research is limited to a very high level.

22: The network traces that CAIDA provides could be used to get a big

23: picture of worm activity, but they do not contain the level of detail to

24: capture new exploit binaries or hacker behaviors post-compromise.

25:

26: Even though these logs are not primarily intended for security

27: research, they still suffer from the problems of undefined

28: anonymization standards. While it is important to anonymize logs so

29: that adversaries cannot map out contributor networks, CAIDA itself

30: does not anonymize logs that it does not generate.

31: Different organizations anonymize to different levels---some not

32: anonymizing at all---and in different manners. There is no

33: consistency in the way anonymization is done because all of the

34: contributors do it their own way. Thus we see that development of

35: anonymization standards applies to more than just security research.

36:

37: \subsubsection{DeepSight}

38: Symantec's DeepSight \cite{Symantec} does utilize a cross-sectional view of the

39: Internet and analyzes detailed IDS, firewall and virus scanner logs.

40: However, it is not a log sharing system for research, but rather it is

41: a data collection system with a commercial purpose. Anonymization is

42: not a real issue because data is not shared, but rather all data is

43: collected and sent to a trusted third party. The purpose of this

44: collection is to notice trends and provide early warnings of attacks

45: and new threats to customers. They allow certain thresholds and other

46: variables to be set by customers to somewhat customize their alerts and

47: reports, but mostly it is a way to alert their customers of

48: new worms, viruses or software exploits being used in the wild. It

49: does not provide data sets to share with the research community.

50:

51: \subsubsection{Internet Storm Center}

52: The Internet Storm Center (ISC) \cite{SANS} is like a grass-roots

53: version of DeepSight that is run by SANS. They collect IDS logs

54: from volunteers and analyze them to detect trends. Their purpose is

55: to provide an early warning system of new worm activity on the

56: Internet. They provide reports on the top ports being scanned with

57: respect to time, and they use the trend information they find to

58: determine the INFOCon threat level, much like Symantec defines the

59: ThreatCon level with DeepSight data.

60:

61: ISC does not share actual logs, but they produce high level statistics.

62: For this reason, their port activity and trends data do not need to be

63: sanitized. They sanitize information about scanner source IPs by looking very

64: broadly at the number of scans per class C network. This kind of

65: anonymization is also used in some of the CAIDA logs where they simply

66: truncate IP addresses. The danger is pretty low in sharing this kind

67: of information, but its usefulness is also minimal. It does nothing

68: more than allow inferences such as ``The US does the most scanning" or

69: ``Universities contribute to most of the P2P traffic". Many of these

70: statistics can

71: be predicted from the density of addresses assigned in

72: the respective class C networks. ISC does share specific addresses in

73: one place: it lists the top 10 scanners by IP address. Many

74: organizations use this information to block misbehaving machines. The

75: repercussion of doing this is again minimal. They do not provide

76: specific details about those machines or the networks they are on. It

77: simply serves to embarrass the ISPs that host the compromised

78: machines. Any sort of anonymization here would defeat the purpose.

79:

80: Overall, the type of data they provide is a homogeneous set of aggregated

81: statistics. More information can be gathered from CAIDA logs because

82: raw access is provided. Thus, one is not restricted to only the

83: statistics they provide. The main difference of course is that the ISC

84: is real-time and uses

85: a more distributed sample. In conclusion, ISC works very well for monitoring

86: general worm behavior, detecting trends that indicate new worms and

87: analyzing the life cycle of an exploit. However, they are not

88: gathering many types of logs, and they are not sharing them with the

89: general community for research.

90:

91: \subsubsection{DShield.org}

92: DShield.org \cite{DShield} is a grass-roots log collection system,

93: though it is now

94: funded partially by SANS. They gather firewall logs and convert them

95: to a standard format. Currently, they exclusively accept packet filter

96: traces. These are used to create reports of types similar to the

97: Internet Storm Center. They have reports on port activity trends, the top

98: 10 most offensive scanners and the top 10 most probed ports. They

99: produce the blacklist of offenders that the ISC uses and provide

100: searches on activity by particular IP addresses.

101:

102: Anonymity is not dealt with seriously here, and they say ``You should not

103: submit any information you consider business critical or proprietary".

104: They say that they ``try" to hide destination IPs to mask who is being

105: attacked, but raw data is searchable and may be made available raw

106: to the public. Of course they do not indicate the submitter of the

107: data. Decisions of what and how to release raw data is made on a per

108: individual basis.

109:

110: Several problems exist with the system as is. First, there is only

111: one type of log. Second, there are no precautions to keep people from

112: resubmitting logs and polluting the data set. If special clients are

113: used instead of web submissions, accidental resubmissions are prevented.

114: Third, anonymous submissions allow fake data to be

115: submitted that could wrongly blacklist individuals. Fourth, even if

116: their non-guaranteed anonymization of target hosts works, anyone can

117: query information about specific hosts and networks. This allows

118: attackers to find already compromised machines on a network rather

119: easily. In conclusion, DShield.org provides data of limited types and

120: minimal data protection mechanisms.

121:

122: \subsubsection{Packet Vault}

123: The University of Michigan has worked on a secure, long-term archive

124: of network packet data they call the {\it Packet Vault} \cite{Antonelli00, Antonelli99}.

125: It is basically a special purpose network device and encrypted database system

126: specifically designed for packet sniffers. It is designed so that selected

127: traffic can be made available without exposing other traffic.

128:

129: They use the ``black marker" approach to anonymize logs by completely

130: encrypting packet information. As such, it cannot really be called

131: anonymization because all fields are either encrypted or decrypted,

132: essentially having the same effect as printing a log and using a black

133: marker on most of the lines. They group items under the same

134: encryption key if the packets are part of the same ``conversation".

135:

136: Now, if instead they used different keys for different fields, that

137: would allow them to release different views of the same records to

138: different organizations. That would be a crude sort of black marker

139: (all-or-nothing) anonymization of selected fields. But their goals

140: are different. They are not trying to share logs while preserving

141: privacy. They are making logs available to participants of

142: the conversation, but not to anyone else. They give the

143: appropriate keys to decrypt logs records that describe a participant's

144: actions but not those in which they were not a participant.

145:

146: In conclusion, we see that while there are some centers dedicated  to

147: collecting log files, they all suffer from one or more of the

148: following problems: (1) They do not have a wide view of the Internet

149: but are quite localized, (2) the repositories are very specific,

150: addressing one or only a few types of logs, (3) anonymization is weak

151: or nonexistent and usually inconsistent or (4) they collect many logs

152: but do not share them with the research community.

153:

154: \subsection{Anonymizers}

155: There has been some research in log anonymization. However, most

156: work addresses only a small subset of all the available log

157: sources (particularly network traces) and focuses exclusively on anonymizing

158: IP addresses within a log. While IP addresses could simply be removed

159: or randomized in logs,  such a solution is undesirable since it

160: destroys a basic structure used in analyzing logs. Significant work

161: has been accomplished on prefix-preserving anonymization of IP

162: addresses. In prefix-preserving anonymization, IP addresses are mapped

163: to pseudo-random anonymized IP addresses by a function we will call

164: $\tau$. Let $P_n()$ be the function that truncates an IP address to $n$

165: bits. Then $\tau$ is a {\it prefix preserving} permutation of IP addresses

166: if $\forall$ $1\leq n\leq 32$, $P_n(x)=P_n(y)$ if and only if

167: $\tau(P_n(x))=\tau(P_n(y))$. TCPdpriv \cite{TCPdpriv} is a free program

168: that performs prefix-preserving TCPdump trace anonymization using

169: tables. Because of the use of tables, it is difficult to process logs

170: in parallel with this tool. In \cite{Xu01,Xu02}, Xu et al.\ have

171: created a prefix-preserving IP pseudonymizer that overcomes this limitation by

172: eliminating the need for centralized tables to be shared and edited by

173: multiple entities. Instead, with their tool CryptoPAn, one only needs

174: to distribute a short key between entities that wish to pseudonymize

175: consistently with each other. Furthermore, they have shown that all

176: prefix-preserving pseudonymizers must take a particular form and that

177: their solution is optimal with respect to security. But while

178: theoretically optimal solutions have been created for this reduced

179: problem, the larger problem of anonymizing whole log files

180: remains unsolved.

181:

182: One of the earliest uses of {\it pseudonyms} can be found in

183: \cite{Chaum81} where public keys are used as pseudonyms. We now

184: recognize what Chaum described in \cite{Chaum81} as a ``digital

185: pseudonym" to be a specific type of pseudonym called an authorization

186: certificate. As noted in \cite{Lundin99}, pseudonyms help define

187: middle ground in the zero-sum tradeoff between security and privacy of

188: audit logs. In \cite{Sobirey97}, Sobirey et al.\ first suggested

189: privacy-enhanced intrusion detection using pseudonyms and provided the

190: motivation for the work of Biskup et al.\ in \cite{Biskup00a,Biskup00b}.

191:

192: While the work in \cite{Lundin99,Biskup00a,Biskup00b} does deal with

193: log data and anonymization, their goals are significantly different

194: than ours. All three works deal specifically with pseudonymization

195: in Intrusion Detection Systems (IDSs). The adversary in their model

196: is the system administrator, and the one requiring protection is the

197: user of the system. In our case, we instead assume that the

198: system/network administrators have access to raw logs, and we are

199: trying to protect the systems from those who would see the shared

200: logs. To contrast how this makes a difference, consider that in their

201: scenario the server addresses and services running are not even

202: sensitive---just information that could identify clients of the

203: system. Furthermore, we do not care about reversal of pseudonyms. We

204: have no need for that capability, but since the system/network

205: administrators do not have raw data in their case, the privacy officer

206: must help the system security officer reverse pseudonyms if alerts

207: indicate suspicious behavior. In \cite{Biskup00a,Biskup00b}, they

208: take this further and try to support automatic re-identification if a

209: certain threshold of events is met. In that way their pseudonymizer

210: must be intelligent, like an IDS, predicting when re-identification

211: may be necessary and thus altering how it pseudonymizes data. They

212: also differ from us in that they create transactional pseudonyms,

213: so a pseudonym this week might map to a different

214: entity the next week. We, however, require consistency with respect to time for

215: logs to be useful. Lastly, all of the anonymizing solutions in these

216: papers  filter log entries and remove them if they are not relevant to

217: the IDS; we dispose of no entries because completeness is very important

218: for logs released to the general research populace.

219:

220: In \cite{Flegel02}, Flegel takes his previous work in privacy

221: preserving intrusion detection \cite{Biskup00a,Biskup00b}  and

222: changes the motivation slightly. Here he imagines a scenario of web

223: servers volunteering to protect the privacy of visitors from

224: themselves, and he believes IP addresses of visitors need

225: pseudonymization. However, to a web server IP addresses already act

226: as a pseudonym protecting the client's identity since ISPs rarely

227: volunteer IP address to person mappings. The case where this is not

228: true is if the web server is that of the ISP. Then in that

229: particular instance IP addresses are not pseudonyms. Though the

230: motivation differs slightly, the system described is the same

231: underlying threshold based pseudonymization system, and the focus of

232: this paper is really about the implementation and performance of the

233: system. As such, the results of \cite{Flegel02} can be applied to

234: \cite{Biskup00a,Biskup00b}.

235:

236: In \cite{Pang03}, Pang et al.\ developed a new packet anonymizer that

237: anonymizes

238: packet payloads as well as transactional information, though their

239: methodology only works with application level protocols that their

240: anonymizer understands (HTTP, FTP, Finger, Ident and SMTP). The

241: process can also alter logs significantly, losing fragmentation

242: information, the size and number of packets and information about

243: retransmissions, skewing time stamps, sequence numbers and

244: checksums.

245: While their anonymizer is limited in its capabilities, it is fail-safe

246: because it only leaves information in the packets that it can parse

247: and understand. Further, they create a classification of

248: anonymization techniques and a classification of attacks against

249: anonymization that we found useful. We use a similar classification

250: which is based off of their work.

251:

252: Most recently Waters et al.\ \cite{Waters04} address the tension

253: between data access control and searchability of audit logs through a

254: new method they developed to search asymmetrically encrypted logs. In

255: this way the encrypted log can be made public for search, and the

256: owner distributes private keys corresponding to keywords. Thus,

257: instead of the data owner decrypting the log and running the search,

258: he can simply give the query maker the ability to perform the query

259: with a set of keywords he deems acceptable.