cs0409005/ch1.tex
1: \section{Introduction}
2: Log data is essential to security operation teams at any organization
3: large enough to have full-time security personnel. While IDSs operate
4: on streaming data, matching signatures and producing alerts, it is
5: still necessary for human beings to examine logs to understand these
6: alerts. Logs also form the core source of evidence for computer
7: forensic investigations following security incidents. The current
8: state-of-the-art is for each autonomous organization to use log data
9: to locally optimize network management and security protection. For
10: instance, it may only be when they themselves are scanned by an
11: individual that an organization will block a particular IP address.
12: Administrators may miss the bigger picture and not see that they are just a
13: piece of a larger target. Furthermore, administrators may only start
14: to scan their own network for a particular vulnerability once an
15: attacker has exploited it on their systems. There are very few
16: cross-sectional views of the Internet, and until recently there have
17: been no mechanisms to enable such wider views. Additionally, current examples
18: of wide views, such as spam blacklists and worm signatures, are often
19: focused on a specific characteristic even though signatures are 
20: gathered from events across the entire Internet. 
21: 
22: Sharing data is in fact common among attackers. They trade zombies, publicly
23: post information on vulnerable systems/networks and coordinate
24: attacks. Recent events at several U.S. supercomputing centers \cite{SuperAttacks} 
25: have demonstrated examples of coordinated attacks against
26: organizations that do not have good mechanisms of data sharing and log
27: correlation. Real, not simulated, data is necessary. While worms
28: that are let go without further human interaction could possibly be
29: modeled and simulated, human motives and specific interactions cannot.
30: It is no longer satisfactory to focus solely on the local picture; 
31: there is a need to look globally across the Internet. While the
32: data needed exists, tapping into thousands of data sources effectively and
33: sharing critical information intelligently and to the data owners'
34: satisfaction is an open problem. 
35: 
36: In fact, the Department of Homeland Security has recognized the
37: importance of sharing information and has established Information Sharing
38: and Analysis Centers (ISACs) to facilitate the storage and sharing of
39: information about security threats \cite{ISAC}. The importance of log
40: sharing has also gained industry recognition with investments in
41: infrastructure dedicated solely for this purpose across multiple
42: industry sectors \cite{NSSC}. The National Strategy to Secure
43: Cyberspace (NSSC) explicitly lists sharing as one of its highest
44: priorities---data sharing within the government, within industry
45: sectors and between the government and industry. In fact, of the
46: eight action items reached in the NSSC report, three of them are
47: directly related to log data sharing: Item 2:  ``Provide for the
48: development of tactical and strategic analysis of cyber attacks and
49: vulnerability assessments"; Item 3:  ``Encourage the development of a
50: private sector capability to share a synoptic view of the health of
51: cyberspace"; and Item 8:  ``Improve and enhance public/private
52: information sharing involving cyber-attacks, threats, and vulnerabilities". 
53: 
54: While it is understood and well-accepted that log sharing is
55: important, it happens on a very limited scale if at all. We believe 
56: that the problem, while social on the surface, is
57: technical at the heart. Organizations realize the importance of
58: sharing such data, but they are reluctant because sharing log data
59: allows their networks to be ``mapped out". This exposure creates an
60: increased risk for those who share. Anonymization of logs is still in
61: its infancy and the technology to date does not meet the needs of many
62: organizations. Even the largest public source of network traces
63: \cite{CAIDA} contains logs anonymized in inconsistent ways.  Some
64: data is anonymized, some is not. Anonymized data may preserve
65: prefixes or it may just truncate IP addresses. Even prefix-preserving
66: anonymization has different mappings between data collections from the
67: same sources but recorded at different times. Lastly, few data sets 
68: anonymize anything beyond IP addresses. 
69: 
70: We believe there is a need for standards in anonymization. There should
71: be defined levels of anonymization and methods to express those different
72: levels succinctly. There must also be ways to map an
73: organization's needs and trust levels with other organizations to the
74: appropriate anonymization levels. In addition to defining these
75: levels, methodologies and algorithms should be developed to
76: anonymize log data in new ways. 
77: 
78: The rest of this paper is organized as follows. Section 2 discusses
79: current efforts for sharing log data and related work in
80: anonymization. Section 3 covers the many types of logs and different
81: fields that could potentially be anonymized. Section 4 
82: discusses attacks against currently immature anonymization
83: systems. Section 5 discusses our goals and vision for a new system of
84: anonymization. We conclude in section 6.