0409:cs0409005/ch1.tex

1: \section{Introduction}

2: Log data is essential to security operation teams at any organization

3: large enough to have full-time security personnel. While IDSs operate

4: on streaming data, matching signatures and producing alerts, it is

5: still necessary for human beings to examine logs to understand these

6: alerts. Logs also form the core source of evidence for computer

7: forensic investigations following security incidents. The current

8: state-of-the-art is for each autonomous organization to use log data

9: to locally optimize network management and security protection. For

10: instance, it may only be when they themselves are scanned by an

11: individual that an organization will block a particular IP address.

12: Administrators may miss the bigger picture and not see that they are just a

13: piece of a larger target. Furthermore, administrators may only start

14: to scan their own network for a particular vulnerability once an

15: attacker has exploited it on their systems. There are very few

16: cross-sectional views of the Internet, and until recently there have

17: been no mechanisms to enable such wider views. Additionally, current examples

18: of wide views, such as spam blacklists and worm signatures, are often

19: focused on a specific characteristic even though signatures are

20: gathered from events across the entire Internet.

21:

22: Sharing data is in fact common among attackers. They trade zombies, publicly

23: post information on vulnerable systems/networks and coordinate

24: attacks. Recent events at several U.S. supercomputing centers \cite{SuperAttacks}

25: have demonstrated examples of coordinated attacks against

26: organizations that do not have good mechanisms of data sharing and log

27: correlation. Real, not simulated, data is necessary. While worms

28: that are let go without further human interaction could possibly be

29: modeled and simulated, human motives and specific interactions cannot.

30: It is no longer satisfactory to focus solely on the local picture;

31: there is a need to look globally across the Internet. While the

32: data needed exists, tapping into thousands of data sources effectively and

33: sharing critical information intelligently and to the data owners'

34: satisfaction is an open problem.

35:

36: In fact, the Department of Homeland Security has recognized the

37: importance of sharing information and has established Information Sharing

38: and Analysis Centers (ISACs) to facilitate the storage and sharing of

39: information about security threats \cite{ISAC}. The importance of log

40: sharing has also gained industry recognition with investments in

41: infrastructure dedicated solely for this purpose across multiple

42: industry sectors \cite{NSSC}. The National Strategy to Secure

43: Cyberspace (NSSC) explicitly lists sharing as one of its highest

44: priorities---data sharing within the government, within industry

45: sectors and between the government and industry. In fact, of the

46: eight action items reached in the NSSC report, three of them are

47: directly related to log data sharing: Item 2:  ``Provide for the

48: development of tactical and strategic analysis of cyber attacks and

49: vulnerability assessments"; Item 3:  ``Encourage the development of a

50: private sector capability to share a synoptic view of the health of

51: cyberspace"; and Item 8:  ``Improve and enhance public/private

52: information sharing involving cyber-attacks, threats, and vulnerabilities".

53:

54: While it is understood and well-accepted that log sharing is

55: important, it happens on a very limited scale if at all. We believe

56: that the problem, while social on the surface, is

57: technical at the heart. Organizations realize the importance of

58: sharing such data, but they are reluctant because sharing log data

59: allows their networks to be ``mapped out". This exposure creates an

60: increased risk for those who share. Anonymization of logs is still in

61: its infancy and the technology to date does not meet the needs of many

62: organizations. Even the largest public source of network traces

63: \cite{CAIDA} contains logs anonymized in inconsistent ways.  Some

64: data is anonymized, some is not. Anonymized data may preserve

65: prefixes or it may just truncate IP addresses. Even prefix-preserving

66: anonymization has different mappings between data collections from the

67: same sources but recorded at different times. Lastly, few data sets

68: anonymize anything beyond IP addresses.

69:

70: We believe there is a need for standards in anonymization. There should

71: be defined levels of anonymization and methods to express those different

72: levels succinctly. There must also be ways to map an

73: organization's needs and trust levels with other organizations to the

74: appropriate anonymization levels. In addition to defining these

75: levels, methodologies and algorithms should be developed to

76: anonymize log data in new ways.

77:

78: The rest of this paper is organized as follows. Section 2 discusses

79: current efforts for sharing log data and related work in

80: anonymization. Section 3 covers the many types of logs and different

81: fields that could potentially be anonymized. Section 4

82: discusses attacks against currently immature anonymization

83: systems. Section 5 discusses our goals and vision for a new system of

84: anonymization. We conclude in section 6.