1: \documentclass[11pt]{article}
2: \usepackage{fullname}
3: \usepackage{fullpage}
4: \usepackage{epsfig}
5: \usepackage{times}
6:
7: \begin{document}
8: \title{``I'm sorry Dave, I'm afraid I can't do that'':
9: Linguistics, Statistics, and Natural Language Processing circa 2001\thanks{To appear in the National Research Council
10: study on the Fundamentals of Computer Science. This is an April 2003 version.} }
11: \author{Lillian Lee, Cornell University}
12: \date{}
13: \maketitle
14:
15: \bibliographystyle{fullname}
16:
17: \begin{quote}
18: {\em It's the year 2000, but
19: where are the flying cars? I was promised flying cars.} \\ \hspace*{2.5in} -- Avery
20: Brooks, IBM commercial
21: \end{quote}
22: According to many pop-culture visions of the future, technology
23: will eventually produce the Machine that Can Speak to Us. Examples
24: range from the False Maria in Fritz Lang's 1926 film {\em Metropolis}
25: to {\em Knight Rider}'s KITT (a {\em talking} car) to {\em Star Wars}'
26: C-3PO (said to have been modeled on the False Maria). And, of course,
27: there is the HAL 9000 computer from {\em 2001: A Space Odyssey}; in
28: one of the film's most famous scenes, the astronaut Dave asks HAL
29: to open a pod bay door on the spacecraft, to which HAL responds, ``I'm
30: sorry Dave, I'm afraid I can't do that''.
31:
32: Natural language processing, or NLP, is the field of computer science
33: devoted to creating such machines --- that is, enabling computers to
34: use human languages both as input and as output. The area is quite
35: broad, encompassing problems ranging from simultaneous multi-language
36: translation to advanced search engine development to the design of
37: computer interfaces capable of combining speech,
38: diagrams, and other modalities simultaneously. A natural consequence of this wide range of
39: inquiry is the integration of ideas from computer science with work
40: from many other fields,
41: including
42: {\em linguistics}, which provides models of language;
43: {\em psychology}, which provides models of cognitive processes;
44: {\em information theory}, which provides models of communication; and
45: {\em mathematics and statistics}, which provide tools for analyzing
46: and acquiring such models.
47:
48: The interaction of these ideas together with advances in machine
49: learning (see [other chapter]) has resulted in concerted research
50: activity in {\em statistical natural language processing}: making
51: computers language-enabled by having them acquire linguistic
52: information directly from samples of language itself. In this essay,
53: we describe the history of statistical NLP; the twists
54: and turns of the story serve to highlight the sometimes complex
55: interplay between computer science and other fields.
56:
57: Although currently a major focus of research, the data-driven,
58: computational approach to language processing was for some time held
59: in deep disregard because it directly conflicts with
60: another commonly-held viewpoint:
61: human language is so complex that language samples alone seemingly
62: cannot yield enough information to understand it. Indeed, it is often
63: said that NLP is ``AI-complete'' (a pun on NP-completeness; see [other
64: chapter]), meaning that the most difficult problems in artificial
65: intelligence manifest themselves in human language phenomena. This
66: belief in language use as the touchstone of intelligent behavior dates
67: back at least to the 1950 proposal of the Turing Test\footnote{Roughly
68: speaking, a computer will have passed the Turing Test if it can engage
69: in conversations indistinguishable from that of a human's.} as a way
70: to gauge whether machine intelligence has been achieved; as Turing
71: wrote, ``The question and answer method seems to be suitable for
72: introducing almost any one of the fields of human endeavour that we
73: wish to include''.
74:
75: The reader might be somewhat surprised to hear that language
76: understanding is so hard. After all, human children get the hang of
77: it in a few years, word processing software now corrects (some of) our
78: grammatical errors, and TV ads show us phones capable of effortless
79: translation. One might therefore be led to believe that HAL is just
80: around the corner.
81:
82: Such is not the case, however. In order to appreciate this point, we
83: temporarily divert from describing statistical NLP's history --- which touches upon Hamilton versus Madison, the
84: sleeping habits of colorless green ideas, and what happens when one
85: fires a linguist --- to examine a few examples illustrating
86: why understanding human language is such a difficult problem.
87:
88: \section*{Ambiguity and language analysis}
89:
90: \begin{quote}
91: {\em At last, a computer that understands you like your mother.}\\
92: \hspace*{2.5in} -- 1985 McDonnell-Douglas ad
93: \end{quote}
94:
95: The snippet quoted above indicates the early confidence at
96: least one company had in the feasibility of getting computers to
97: understand human language.
98: But in fact, that very sentence is illustrative of the host of
99: difficulties that arise in trying to analyze human utterances, and so,
100: ironically, it
101: is quite unlikely that the system being promoted
102: would have been up to the task. A moment's reflection reveals
103: that the sentence admits at least three different interpretations:
104: \begin{enumerate}
105: \item The computer understands you as well as your mother understands
106: you.
107: \item The computer understands that you like your mother.
108: \item The computer understands you as well as it understands your mother.
109: \end{enumerate}
110: That is, the sentence is {\em ambiguous}; and yet we humans seem to
111: instantaneously rule out all the alternatives except the first (and
112: presumably the intended) one.
113: We do so based on a great deal of background knowledge, including
114: understanding what advertisements typically try to convince us of.
115: How are we to get such information into a computer?
116:
117: A number of other types of ambiguity are also lurking here. For
118: example, consider the speech recognition problem: how can we
119: distinguish between this utterance, when spoken, and ``... a computer
120: that understands your lie cured mother''? We also have a word sense
121: ambiguity problem: how do we know that here ``mother'' means ``a
122: female parent'', rather than the Oxford English Dictionary-approved
123: alternative of ``a cask or vat used in vinegar-making''? Again, it is
124: our broad knowledge about the world and the context of the remark that
125: allows us humans to make these decisions easily.
126:
127: Now, one might be tempted to think that all these ambiguities arise
128: because our example sentence is highly unusual (although the ad
129: writers probably did not set out to craft a strange sentence). Or,
130: one might argue that these ambiguities are somehow artificial because
131: the alternative interpretations are so unrealistic that
132: an NLP system could easily filter them out. But ambiguities crop up in many
133: situations. For example, in ``Copy the local patient files to disk''
134: (which seems like a perfectly plausible command to issue to a computer),
135: is it the patients or the files that are local?\footnote{Or, perhaps,
136: the files themselves are patient? But our knowledge about the world
137: rules this possibility out.} Again, we need to
138: know the specifics of the situation in order to decide. And in
139: multilingual settings, extra ambiguities may arise. Here is a
140: sequence of seven Japanese characters:
141: \begin{center}
142: \psfig{figure=shachoh_unsegmented.eps,width=1.7in}
143: \end{center}
144: Since Japanese doesn't have spaces between words, one is faced with
145: the initial task of deciding what the component words are. In
146: particular, this character sequence corresponds to at least two
147: possible word sequences, ``president, both, business,
148: general-manager'' (= ``a president as well as a general manager of
149: business'') and ``president, subsidiary-business, Tsutomu (a name),
150: general-manager'' (= ?).
151: It requires a fair bit of linguistic
152: information to choose the correct alternative.\footnote{To take an
153: analogous example in English, consider the
154: non-word-delimited sequence of letters
155: ``{theyouthevent}''. This
156: corresponds to the word sequences ``the youth event'', ``they out he
157: vent'', and ``the you the vent''.}
158:
159: To sum up, we see that the NLP task is highly daunting, for to
160: resolve the many ambiguities that arise in trying to analyze even a
161: single sentence requires deep knowledge not just about language but
162: also about the world. And so when HAL says, ``I'm afraid I can't do
163: that'', NLP researchers are tempted to respond, ``I'm afraid you might
164: be right''.
165:
166:
167: \section*{Firth things first}
168:
169: But before we assume that the only viable approach to NLP is a massive
170: knowledge engineering project, let us go back to the early approaches
171: to the problem. In the 1940s and 1950s, one prominent trend in
172: linguistics was explicitly empirical and in particular distributional,
173: as exemplified by the work of Zellig Harris (who started the first
174: linguistics program in the USA). The idea was that
175: correlations (co-occurrences) found in language data are important
176: sources of information, or, as the influential linguist J. R. Firth
177: declared in 1957, ``You shall know a word by the company it keeps''.
178:
179: Such notions accord quite happily with ideas put forth by Claude
180: Shannon in his landmark 1948 paper establishing the field of
181: information theory; speaking from an engineering perspective, he
182: identified the probability of a message's being chosen from among
183: several alternatives, rather than the message's actual content, as its
184: critical characteristic. Influenced by this work, Warren Weaver in
185: 1949 proposed treating the problem of translating between languages as
186: an application of cryptography (see [other chapter]), with one
187: language viewed as an encrypted form of another. And, Alan Turing's
188: work on cracking German codes during World War II led to the
189: development of the Good-Turing formula, an important tool for
190: computing certain statistical properties of language.
191:
192: In yet a third area, 1941 saw the statisticians Frederick Mosteller
193: and Frederick Williams address the question of whether it was
194: Alexander Hamilton or James Madison who wrote some of the pseudonymous
195: Federalist Papers. Unlike previous attempts, which were based on
196: historical data and arguments, Mosteller and Williams used the
197: patterns of word occurrences in the texts as evidence. This work led
198: up to the famed Mosteller and Wallace statistical study which many
199: consider to have settled the authorship of the disputed papers.
200:
201: Thus, we see arising independently from a variety of fields the idea
202: that language can be viewed from a data-driven, empirical perspective
203: --- and a data-driven perspective leads naturally to a computational
204: perspective.
205:
206: \section*{A ``C'' change}
207:
208: However, data-driven approaches fell out of favor in the late 1950's.
209: One of the commonly cited factors is a 1957 argument by linguist (and
210: student of Harris) Noam Chomsky, who believed that language behavior
211: should be analyzed at a much deeper level than its surface
212: statistics. He claimed,
213: \begin{quote}
214: It is fair to assume that neither sentence (1) [Colorless green
215: ideas sleep furiously] nor (2) [Furiously sleep ideas green
216: colorless] ... has ever occurred .... Hence, in any [computed]
217: statistical model ... these sentences will be ruled out on identical
218: grounds as equally ``remote'' from English.\footnote{Interestingly,
219: this claim has become so famous as to be self-negating, as simple
220: web searches on ``Colorless green ideas sleep furiously'' and its
221: reversal will show.} Yet (1), though nonsensical, is grammatical,
222: while (2) is not.
223: \end{quote}
224: That is, we humans know that sentence (1), which at least obeys (some)
225: rules of grammar, is indeed more probable than (2), which is just word
226: salad; but (the claim goes), since both sentences are so rare, they
227: will have identical statistics --- i.e., a frequency of zero --- in
228: any sample of English. Chomsky's criticism is essentially that
229: data-driven approaches will always suffer from a lack of data, and
230: hence are doomed to failure.
231:
232: This observation turned out to be remarkably prescient: even now, when
233: billions of words of text are available on-line, perfectly reasonable
234: phrases are not present. Thus, the so-called {\em sparse data problem}
235: continues to be a serious challenge for statistical NLP even
236: today. And so, the effect of Chomsky's claim, together with some
237: negative results for machine learning and a general lack of computing
238: power at the time, was to cause researchers to turn away from
239: empirical approaches and toward {\em knowledge-based} approaches where
240: human experts encoded relevant information in computer-usable form.
241:
242:
243: This change in perspective led to several new lines of fundamental,
244: interdisciplinary research. For example, Chomsky's work viewing
245: language as a formal, mathematically-describable object has had
246: lasting impact on both linguistics and computer science; indeed, the
247: {\em Chomsky hierarchy}, a sequence of increasingly more powerful
248: classes of grammars, is a staple of the undergraduate computer science
249: curriculum. Conversely, the highly influential work of, among others,
250: Kazimierz Adjukiewicz, Joachim Lambek, David K. Lewis, and Richard Montague
251: adopted the {\em lambda calculus}, a fundamental concept in the study
252: of programming languages, to model the semantics of natural languages.
253:
254:
255: \section*{The empiricists strike back}
256:
257: By the '80s, the tide had begun to shift once again, in part
258: because of the work done by the speech recognition group at IBM. These
259: researchers, influenced by ideas from information theory, explored the
260: power of probabilistic models of language combined with access to much
261: more sophisticated algorithmic and data resources than had previously been
262: available. In the realm of speech recognition, their ideas form the
263: core of the design of modern systems; and given the recent successes
264: of such software --- large-vocabulary continuous-speech recognition
265: programs are now available on the market --- it behooves us to examine
266: how these systems work.
267:
268: Given some acoustic signal, which we denote by the variable $a$, we
269: can think of the speech recognition problem as that of transcription:
270: determining what sentence is most likely to have produced $a$.
271: Probabilities arise because of the ever-present problem of ambiguity:
272: as mentioned above, several word sequences, such as ``your lie cured
273: mother'' versus ``you like your mother'', can give rise to similar
274: spoken output. Therefore, modern speech recognition systems
275: incorporate information both about the acoustic signal and the
276: language behind the signal. More specifically, they rephrase the
277: problem as determining which sentence $s$ maximizes the product
278: $P(a|s)\times P(s)$. The first term measures how likely the acoustic
279: signal would be if $s$ were actually the sentence being uttered
280: (again, we use probabilities because humans don't pronounce words the
281: same way all the time). The second term measures the probability of
282: the sentence $s$ itself; for example, as Chomsky noted, ``colorless
283: green ideas sleep furiously'' is intuitively more likely to be uttered
284: than the reversal of the phrase. It is in computing this second term,
285: $P(s)$, where statistical NLP techniques come into play, since
286: accurate estimation of these sentence probabilities requires
287: developing probabilistic models of language. These models are
288: acquired by processing tens of millions of words or more.
289: This is by no means a simple procedure; even linguistically naive
290: models require the use of sophisticated computational and statistical
291: techniques because of the sparse data problem foreseen by Chomsky.
292: But using probabilistic models, large datasets, and powerful learning
293: algorithms (both for $P(s)$ and $P(a|s)$) has led to our achieving the
294: milestone of commercial-grade speech recognition products capable of
295: handling continuous speech ranging over a large vocabulary.
296:
297:
298: But let us return to our story. Buoyed by the successes in speech
299: recognition in the '70s and '80s (substantial performance gains over
300: knowledge-based systems were posted), researchers began applying
301: data-driven approaches to many problems in natural language
302: processing, in a turn-around so extreme that it has been deemed a
303: ``revolution''. Indeed, now empirical methods are used at all levels
304: of language analysis. This is not just due to increased resources: a
305: succession of breakthroughs in machine learning algorithms has
306: allowed us to leverage existing resources much more effectively.
307: At the same time, evidence from psychology shows that human learning
308: may be more statistically-based than previously thought; for instance,
309: work by Jenny Saffran, Richard Aslin, and Elissa Newport reveals that
310: 8-month-old infants can learn to divide continuous speech into word
311: segments based simply on the statistics of sounds following one
312: another. Hence, it seems that the ``revolution'' is here to stay.
313:
314:
315: Of course, we must not go overboard and mistakenly conclude that the
316: successes of statistical NLP render linguistics irrelevant (rash
317: statements to this effect have been made in the past, e.g., the
318: notorious remark, ``Every time I fire a linguist, my performance goes
319: up''). The information and insight that linguists, psychologists, and
320: others have gathered about language is invaluable in creating
321: high-performance broad-domain language understanding systems; for
322: instance, in the speech recognition setting described above, a better
323: understanding of language structure can lead to better language
324: models. Moreover, truly interdisciplinary research has furthered our
325: understanding of the human language faculty. One important example of
326: this is the development of the {\em head-driven phrase structure
327: grammar} (HPSG) formalism --- this is a way of analyzing natural
328: language utterances that truly marries deep linguistic information
329: with computer science mechanisms, such as unification and recursive
330: data-types, for representing and propagating this information
331: throughout the utterance's structure. In sum, computational
332: techniques and data-driven methods are now an integral part both of
333: building systems capable of handling language in a domain-independent,
334: flexible, and graceful way, and of improving our understanding of
335: language itself.
336:
337: \subsection*{Acknowledgments} Thanks to the members of the CSTB
338: Fundamentals of Computer Science study --- and especially Alan
339: Biermann --- for their helpful feedback. Also, thanks to Alex Acero,
340: Takako Aikawa, Mike Bailey, Regina Barzilay, Eric Brill, Chris
341: Brockett, Claire Cardie, Joshua Goodman, Ed Hovy, Rebecca Hwa, John
342: Lafferty, Bob Moore, Greg Morrisett, Fernando Pereira, Hisami Suzuki,
343: and many others for stimulating discussions and very useful comments.
344: Rie Kubota Ando provided the Japanese example. The use of the term
345: ``revolution'' to describe the re-ascendance of statistical methods
346: comes from Julia Hirschberg's 1998 invited address to the American
347: Association for Artificial Intelligence. I learned of the
348: McDonnell-Douglas ad and some of its analyses from a class run by
349: Stuart Shieber. All errors are mine alone. This paper is based upon
350: work supported in part by the National Science Foundation under ITR/IM
351: grant IIS-0081334 and a Sloan Research Fellowship. Any opinions,
352: findings, and conclusions or recommendations expressed above are those
353: of the authors and do not necessarily reflect the views of the
354: National Science Foundation or the Sloan Foundation.
355:
356: \begin{thebibliography}{}
357:
358: \bibitem{Adjukiewicz:35a}
359: Adjukiewicz, Kazimierz.
360: \newblock 1935.
361: \newblock {Die syntaktische Konnexit\"at}.
362: \newblock {\em Studia Philosophica}, 1:1--27.
363: \newblock {English translation available in Storrs McCall, editor, {\em Polish
364: Logic 1920-1939}, Clarendon Press (1967).}
365:
366:
367: \bibitem{Chomsky:57a}
368: Chomsky, Noam.
369: \newblock 1957.
370: \newblock {\em Syntactic Structures}.
371: \newblock Number~IV in Janua Linguarum. Mouton, The Hague, The Netherlands.
372:
373: \bibitem{Firth:57a}
374: Firth, John~Rupert.
375: \newblock 1957.
376: \newblock A synopsis of linguistic theory 1930--1955.
377: \newblock In the Philological Society's {\em Studies in Linguistic
378: Analysis}. Blackwell, Oxford, pages 1--32.
379: \newblock Reprinted in {\it Selected Papers of J. R. Firth}, edited by F.
380: Palmer. Longman, 1968.
381:
382: \bibitem{Good:53a}
383: Good, Irving~J.
384: \newblock 1953.
385: \newblock The population frequencies of species and the estimation of
386: population parameters.
387: \newblock {\em Biometrika}, 40(3,4):237--264.
388:
389: \bibitem{Harris:51a}
390: Harris, Zellig.
391: \newblock 1951.
392: \newblock {\em Methods in Structural Linguistics}.
393: \newblock University of Chicago Press.
394: \newblock Reprinted by Phoenix Books in 1960 under the title {\em Structural
395: Linguistics}.
396:
397: \bibitem{Lambek:58a}
398: Lambek, Joachim.
399: \newblock 1958.
400: \newblock {The mathematics of sentence structure}.
401: \newblock {\em American Mathematical Monthly}, 65:154--169.
402:
403: \bibitem{Lewis:70a}
404: Lewis, David K.
405: \newblock 1970.
406: \newblock {General semantics}.
407: \newblock {\em Synth\`ese}, 22:18--67.
408:
409:
410: \bibitem{Montague:74a}
411: Montague, Richard.
412: \newblock 1974.
413: \newblock {\em Formal Philosophy: Selected Papers of {Richard Montague}}.
414: \newblock Yale University Press.
415: \newblock Edited by Richmond H. Thomason.
416:
417: \bibitem{Mosteller+Wallace:84a}
418: Mosteller, Frederick and David~L. Wallace.
419: \newblock 1984.
420: \newblock {\em Applied Bayesian and Classical Inference: The Case of the
421: Federalist Papers}.
422: \newblock Springer-Verlag.
423: \newblock First edition published in 1964 under the title {\em Inference and
424: Disputed Authorship: The Federalist}.
425:
426: \bibitem{Pollard+Sag:94a}
427: Pollard, Carl and Ivan Sag.
428: \newblock 1994.
429: \newblock {\em Head-driven phrase structure grammar}.
430: \newblock Chicago University Press and CSLI Publications.
431:
432: \bibitem{Saffran+Aslin+Newport:96a}
433: Saffran, Jenny~R., Richard~N. Aslin, and Elissa~L. Newport.
434: \newblock 1996.
435: \newblock Statistical learning by 8-month-old infants.
436: \newblock {\em Science}, 274(5294):1926--1928, December.
437:
438: \bibitem{Shannon:48a}
439: Shannon, Claude~E.
440: \newblock 1948.
441: \newblock A mathematical theory of communication.
442: \newblock {\em Bell System Technical Journal}, 27:379--423 and 623--656.
443:
444: \bibitem{Turing:50a}
445: Turing, Alan~M.
446: \newblock 1950.
447: \newblock Computing machinery and intelligence.
448: \newblock {\em Mind}, LIX:433--60.
449:
450: \bibitem{Weaver:49a}
451: Weaver, Warren.
452: \newblock 1949.
453: \newblock Translation.
454: \newblock Memorandum. Reprinted in W.N. Locke and A.D. Booth, eds., {\em
455: Machine Translation of Languages: Fourteen Essays}, MIT Press, 1955.
456:
457: \end{thebibliography}
458:
459:
460: \section*{For further reading}
461: \newcommand{\myind}{\hspace*{.3in}}
462:
463: \noindent Charniak, Eugene.
464: \newblock 1993.
465: \newblock {\em Statistical Language Learning}.
466: \newblock MIT Press.
467:
468: \bigskip
469:
470: \noindent Jurafsky, Daniel and James~H. Martin.
471: \newblock 2000.
472: \newblock {\em Speech and Language Processing: An Introduction to Natural
473: \\ \myind Language Processing, Computational Linguistics, and Speech Recognition}.
474: \newblock Prentice Hall.
475: \newblock Contribut- \\ \myind ing writers: Andrew Kehler, Keith Vander Linden, and Nigel
476: Ward.
477:
478: \bigskip
479:
480: \noindent Manning, Christopher~D. and Hinrich Sch\"{u}tze.
481: \newblock 1999.
482: \newblock {\em Foundations of Statistical Natural Language
483: Process-\\ \myind ing}. The MIT Press.
484:
485: \end{document}
486: