1: \documentclass[12pt]{article}
2: \usepackage{fullpage}
3: \usepackage{times}
4: \newcommand{\omt}[1]{}
5: \pagestyle{plain}
6:
7: \begin{document}
8:
9: \title{A Matter of Opinion: Sentiment Analysis and Business
10: Intelligence \\ (position paper)}
11:
12: \author{Lillian Lee, Cornell University \\ http://www.cs.cornell.edu/home/llee}
13: \date{}
14: %\maketitle
15:
16: \begin{center}
17: {\Large A Matter of Opinion: Sentiment Analysis and Business
18: Intelligence (position paper)} \\
19: {\Large Lillian Lee, Cornell University} \\
20:
21: \medskip
22:
23: Presented at the IBM Faculty Summit on the Architecture of On-Demand
24: Business, May 2004
25: \end{center}
26:
27: \medskip
28:
29: %\thispagestyle{empty}
30: \paragraph{Motivation}
31:
32: In the novel {\em Hard Times}, Charles Dickens described the fictional
33: ``Coketown'' as follows:
34: \begin{quotation}
35: Fact, fact, fact, everywhere in the material aspect of the town; fact,
36: fact, fact, everywhere in the immaterial. The M'Choakumchild
37: school was all fact, and the school of design was all fact, and the
38: relations between master and man were all fact, and {everything was
39: fact between the lying-in hospital and the cemetery, and what you
40: couldn't state in figures, or show to be purchasable in the
41: cheapest market and salable in the dearest, was not}, and never should
42: be, world without end, Amen.
43: \end{quotation}
44: In
45: real-life business intelligence,
46: facts are of course very important, but {\em opinion} also plays a
47: crucial role. Consider, for instance, the following scenario.
48: A major computer manufacturer, disappointed with unexpectedly low
49: sales, finds itself confronted with this question:
50: \begin{center}
51: Why aren't consumers buying our laptop?
52: \end{center}
53: While concrete data such as the laptop's weight or the
54: price of a competitor's model are obviously relevant, answering this
55: question requires focusing more on people's personal {\em views}
56: of such objective characteristics. Moreover,
57: subjective judgments regarding intangible qualities --- e.g., ``the
58: design is tacky'' or ``customer service was condescending'' --- or
59: even misperceptions ---
60: ``updated device drivers aren't available''
61: ---
62: must be taken into account as well.
63:
64: {\em Sentiment-analysis technologies} for extracting opinions from
65: unstructured human-authored documents would be excellent tools for
66: handling many business-intelligence tasks related to the one just
67: described. Continuing with our example scenario: it would be
68: difficult to try to directly survey laptop purchasers who {\em
69: haven't} bought the company's product. Rather, we could employ a system that
70: (a) finds reviews or other expressions of opinion on the Web ---
71: newsgroups, individual
72: blogs, and aggregation sites such as epinions.com are likely to be productive sources --- and then (b) creates
73: condensed versions of the reviews or a digest of the overall
74: consensus. This would save
75: the analyst from having to read potentially dozens or even hundreds of
76: versions of the same complaints. Note that Internet sources can vary
77: wildly in form, tenor, and even grammaticality; this fact underscores
78: the need for robust techniques even when only one language (e.g.,
79: English) is considered.
80:
81:
82:
83:
84:
85: \paragraph{Challenges in sentiment classification}
86:
87: Given the multitude of potential applications,
88: researchers have been devoting more and more
89: attention to sentiment analysis. Much of the current work is devoted
90: to {\em classification} problems: determining whether a particular
91: document or portion thereof is subjective or not, and/or determining
92: whether the opinion it expresses is positive or negative. At first
93: blush, this might not appear so hard: one might expect that we need
94: simply look for obvious sentiment-indicating words, such as ``great''.
95: The
96: difficulty lies in the richness
97: of human language use. First, there can be
98: an amazingly large number of ways to say the same thing (especially,
99: it seems, when that thing is a negative perception); this complicates
100: the task of finding a high-coverage set of indicators.
101: Furthermore, the
102: same indicator may admit several different interpretations.
103: Consider, for example, the following sentences:
104: \begin{itemize}
105: \item This laptop is \underline{a great deal}.
106: \item \underline{A great deal} of media attention surrounded the release of the new
107: laptop model.
108: \item If you think this laptop is \underline{a great deal},
109: I've got a nice bridge you might be interested in.
110: \end{itemize}
111: Each of these sentences contains the three words ``a great deal'', but the opinions expressed
112: are, respectively, positive, neutral, and negative. The first two
113: sentences use the same phrase to mean different things.
114: The last sentence involves sarcasm, which, along with related rhetorical
115: devices, is an intrinsic feature of texts from unrestricted domains such as
116: blogs and newsgroup postings.
117:
118: In general, researchers have adopted one of two approaches to
119: meeting the challenges that sentiment analysis presents.
120: Many groups are working to directly improve
121: the selection and interpretation of indicators through the
122: incorporation of linguistic knowledge; given the
123: subtleties of natural language, such efforts will be critical to building
124: operational systems.
125: Others have been pursuing a different tack:
126: employing {\em learning
127: algorithms} that can automatically infer from text samples what
128: indicators are useful. Besides being potentially more cost-effective, more
129: easily ported to other domains and languages, and more robust to
130: grammatical mistakes, learning-based systems can also
131: discover indicators that humans might neglect. For example, in our
132: own work, we
133: found that the phrase ``still,'' (comma included)
134: is a better indicator of positive sentiment than ``good'' --- a
135: typical instance of use would be a sentence like ``Still, despite these flaws, I'd go with this laptop''.
136: Nevertheless, it bears repeating that incorporating deep knowledge about language
137: will be absolutely crucial to developing systems capable of
138: high-quality (as opposed to merely high-throughput) sentiment analysis.
139: Both the linguistic and the learning approach have considerable
140: merits;
141: it seems very safe to say that the community will need to turn towards
142: finding ways to combine their advantages.
143:
144: \paragraph{Related problems, new directions}
145:
146:
147: The classification problems discussed above only involve the
148: determination of sentiment. However, there is growing interest in
149: capturing interactions between {\em subjectivity} and
150: {\em subject} --- we not only need to know what an author's opinion
151: is, but what that opinion is about. For example, while in a broad
152: sense a review of a particular laptop is only about one topic
153: (the laptop itself), it almost surely discusses various specific
154: aspects of
155: the machine.
156: We would ideally like a sentiment-analysis system to reveal whether
157: there are particular features that the review's author disapproves of
158: even if his or her overall impression was positive.
159:
160:
161: Another interesting research direction of potentially great importance
162: is to integrate into sentiment analysis the notion of the {\em status}
163: of an opinion holder, perhaps via adaptation of the
164: hubs-and-authorities techniques used in Web search or link-analysis
165: methods in reputation systems. For example,
166: we might want to identify {\em bellwethers} --- thought leaders
167: with enough influence that others explicitly adopt their opinions ---
168: or {\em barometers} --- those whose opinions
169: are generally held by the majority of the population of
170: interest. Tracking the views of these two types of people could both
171: streamline and enhance the process of gathering business intelligence
172: to a large degree. Surely that sounds like a great deal!
173:
174:
175: \omt{Cornell (in general. And, Claire; Thorsten = textcat)}
176:
177:
178: \end{document}
179: