0504:cs0504022/cs0504022

1: \documentclass[12pt]{article}

2: \usepackage{fullpage}

3: \usepackage{times}

4: \newcommand{\omt}[1]{}

5: \pagestyle{plain}

6:

7: \begin{document}

8:

9: \title{A Matter of Opinion:  Sentiment Analysis and Business

10: Intelligence \\ (position paper)}

11:

12: \author{Lillian Lee,  Cornell University \\ http://www.cs.cornell.edu/home/llee}

13: \date{}

14: %\maketitle

15:

16: \begin{center}

17: {\Large A Matter of Opinion:  Sentiment Analysis and Business

18: Intelligence (position paper)} \\

19: {\Large Lillian Lee,  Cornell University} \\

20:

21: \medskip

22:

23: Presented at the IBM Faculty Summit on the Architecture of On-Demand

24: Business, May 2004

25: \end{center}

26:

27: \medskip

28:

29: %\thispagestyle{empty}

30: \paragraph{Motivation}

31:

32: In the novel  {\em Hard Times}, Charles Dickens described the fictional

33: ``Coketown'' as follows:

34: \begin{quotation}

35: Fact, fact, fact, everywhere in the material aspect of the town; fact,

36: fact, fact, everywhere in the immaterial. The M'Choakumchild

37: school was all fact, and the school of design was all fact, and the

38: relations between master and man were all fact, and {everything was

39: fact between the lying-in hospital and the cemetery, and what you

40: couldn't state in figures, or show to be purchasable in the

41: cheapest market and salable in the dearest, was not}, and never should

42: be, world without end, Amen.

43: \end{quotation}

44: In

45: real-life business intelligence,

46: facts are of course very important, but {\em opinion} also plays a

47: crucial role.  Consider, for instance, the following scenario.

48: A major computer manufacturer, disappointed with unexpectedly low

49: sales,  finds itself confronted with this question:

50: \begin{center}

51: Why aren't consumers buying our laptop?

52: \end{center}

53: While concrete data such as the laptop's weight or the

54: price of a competitor's model are obviously relevant, answering this

55: question requires focusing more on people's personal {\em views}

56: of such objective characteristics. Moreover,

57: subjective judgments regarding intangible qualities --- e.g., ``the

58: design is tacky'' or ``customer service was condescending'' --- or

59: even misperceptions ---

60: ``updated device drivers aren't  available''

61: ---

62: must be taken into account as well.

63:

64: {\em Sentiment-analysis technologies} for extracting opinions from

65: unstructured human-authored documents would be excellent tools for

66: handling many business-intelligence tasks related to the one just

67: described.  Continuing with our example scenario: it would be

68: difficult to try to directly survey laptop purchasers who {\em

69: haven't} bought the company's product.  Rather, we could employ a system that

70: (a) finds reviews or other expressions of opinion on the Web ---

71: newsgroups, individual

72: blogs, and  aggregation sites such as epinions.com are likely to be productive sources --- and then (b) creates

73: condensed versions of the reviews or a digest of the overall

74: consensus. This would save

75: the analyst from having to read potentially dozens or even hundreds of

76: versions of the same complaints.  Note that Internet sources can vary

77: wildly in form, tenor, and even grammaticality; this fact underscores

78: the need for robust techniques even when only one language (e.g.,

79: English) is considered.

80:

81:

82:

83:

84:

85: \paragraph{Challenges in sentiment classification}

86:

87: Given the multitude of potential applications,

88: researchers have been devoting more and more

89: attention to sentiment analysis.  Much of the current work is devoted

90: to {\em classification} problems: determining whether a particular

91: document or portion thereof is subjective or not, and/or determining

92: whether the opinion it expresses is positive or negative.  At first

93: blush, this might not appear so hard: one might expect that we need

94: simply look for obvious sentiment-indicating words, such as ``great''.

95: The

96: difficulty lies in the richness

97: of human language use.  First, there can be

98: an amazingly large number of ways to say the same thing (especially,

99: it seems, when that thing is a negative perception); this complicates

100: the task of finding a high-coverage set of indicators.

101: Furthermore, the

102: same indicator may admit several different interpretations.

103: Consider, for example,  the following sentences:

104: \begin{itemize}

105: \item This laptop is \underline{a great deal}.

106: \item \underline{A great deal} of media attention surrounded the release of the new

107:   laptop model.

108: \item If you think this laptop is \underline{a great deal},

109: I've got a nice bridge you might be interested in.

110: \end{itemize}

111: Each of these sentences contains the three words ``a great deal'', but the opinions  expressed

112: are, respectively, positive, neutral, and negative.  The first two

113: sentences use the same phrase to mean different things.

114: The last sentence involves sarcasm, which, along with related rhetorical

115: devices, is an intrinsic feature of texts from unrestricted domains such as

116: blogs and newsgroup postings.

117:

118: In general, researchers have adopted one of two approaches to

119: meeting the challenges that sentiment analysis presents.

120: Many groups are working to directly improve

121: the selection and interpretation of indicators through the

122: incorporation of linguistic knowledge; given  the

123: subtleties of natural language, such efforts will be critical to building

124: operational systems.

125: Others have been pursuing a different tack:

126: employing  {\em learning

127: algorithms} that can automatically infer from text samples what

128: indicators are useful.  Besides being potentially more cost-effective, more

129: easily ported to other domains and languages, and more robust to

130: grammatical mistakes, learning-based systems can also

131: discover indicators that humans might neglect.  For example, in our

132: own work, we

133: found that the phrase ``still,'' (comma included)

134: is a better indicator of positive sentiment than ``good'' --- a

135: typical instance of use would be a sentence like ``Still, despite these flaws, I'd go with this laptop''.

136: Nevertheless,  it bears repeating that incorporating deep knowledge about language

137: will be absolutely crucial to  developing systems capable of

138: high-quality (as opposed to merely high-throughput) sentiment analysis.

139: Both the linguistic and the learning approach have considerable

140: merits;

141: it seems very safe to say that  the community will need to  turn towards

142: finding ways  to combine their advantages.

143:

144: \paragraph{Related problems, new directions}

145:

146:

147: The classification problems discussed above only involve the

148: determination of sentiment.  However, there is growing interest in

149: capturing interactions between {\em subjectivity} and

150: {\em subject} --- we not only need to know what an author's opinion

151: is, but what that opinion is about.  For example, while in a broad

152: sense a review of a particular laptop is only about one topic

153: (the laptop itself), it almost surely discusses various specific

154: aspects of

155: the machine.

156: We would ideally like a sentiment-analysis system to reveal whether

157: there are particular features that the review's author disapproves of

158: even if his or her overall impression was positive.

159:

160:

161: Another interesting research direction of potentially great importance

162: is to integrate into sentiment analysis the notion of the {\em status}

163: of an opinion holder, perhaps via  adaptation of the

164: hubs-and-authorities techniques used in Web search or link-analysis

165: methods in reputation systems.  For example,

166: we might want to identify {\em bellwethers} --- thought leaders

167: with enough influence that others explicitly adopt their opinions ---

168: or  {\em barometers} --- those whose opinions

169: are generally held by the majority of the population of

170: interest. Tracking the views of these two types of people could both

171: streamline and enhance the process of gathering business intelligence

172: to a large degree.  Surely that sounds like a great deal!

173:

174:

175: \omt{Cornell (in general.  And, Claire; Thorsten = textcat)}

176:

177:

178: \end{document}

179: