0304:cs0304027/cmplg.tex

1: \documentclass[11pt]{article}

2: \usepackage{fullname}

3: \usepackage{fullpage}

4: \usepackage{epsfig}

5: \usepackage{times}

6:

7: \begin{document}

8: \title{``I'm sorry Dave, I'm afraid I can't do that'':

9: Linguistics, Statistics, and Natural Language Processing circa 2001\thanks{To appear in the National Research Council

10: study on the Fundamentals of Computer Science.  This is an April 2003 version.} }

11: \author{Lillian Lee, Cornell University}

12: \date{}

13: \maketitle

14:

15: \bibliographystyle{fullname}

16:

17: \begin{quote}

18: {\em It's the year 2000, but

19: where are the flying cars?  I was promised flying cars.} \\ \hspace*{2.5in} -- Avery

20: Brooks, IBM commercial

21: \end{quote}

22: According to many pop-culture visions of the future, technology

23: will eventually produce the Machine that Can Speak to Us.  Examples

24: range from the False Maria in Fritz Lang's 1926 film {\em Metropolis}

25: to {\em Knight Rider}'s KITT (a {\em talking} car) to {\em Star Wars}'

26: C-3PO (said to have been modeled on the False Maria).  And, of course,

27: there is the HAL 9000 computer from {\em 2001: A Space Odyssey}; in

28: one of the film's most famous scenes, the astronaut Dave asks HAL

29: to open a pod bay door on the spacecraft, to which HAL responds, ``I'm

30: sorry Dave, I'm afraid I can't do that''.

31:

32: Natural language processing, or NLP, is the field of computer science

33: devoted to creating such machines --- that is, enabling computers to

34: use human languages both as input and as output.  The area is quite

35: broad, encompassing problems ranging from simultaneous multi-language

36: translation to advanced search engine development to the design of

37: computer interfaces capable of combining speech,

38: diagrams, and other modalities simultaneously.  A natural consequence of this wide range of

39: inquiry is the integration of ideas from computer science with work

40: from many other fields,

41: including

42: {\em linguistics}, which provides models of language;

43: {\em psychology}, which provides models of cognitive processes;

44: {\em information theory}, which provides models of communication; and

45: {\em mathematics and statistics}, which provide tools for analyzing

46: and acquiring such models.

47:

48: The interaction of these ideas together with advances in machine

49: learning (see [other chapter]) has resulted in concerted research

50: activity in {\em statistical natural language processing}: making

51: computers language-enabled by having them acquire linguistic

52: information directly from samples of language itself.  In this essay,

53: we describe the history of statistical NLP; the twists

54: and turns of the story serve to highlight the sometimes complex

55: interplay between computer science and other fields.

56:

57: Although currently a major focus of research, the data-driven,

58: computational approach to language processing was for some time held

59: in deep disregard because it directly conflicts with

60: another commonly-held viewpoint:

61: human language is so complex that language samples alone seemingly

62: cannot yield enough information to understand it.  Indeed, it is often

63: said that NLP is ``AI-complete'' (a pun on NP-completeness; see [other

64: chapter]), meaning that the most difficult problems in artificial

65: intelligence manifest themselves in human language phenomena. This

66: belief in language use as the touchstone of intelligent behavior dates

67: back at least to the 1950 proposal of the Turing Test\footnote{Roughly

68: speaking, a computer will have passed the Turing Test if it can engage

69: in conversations indistinguishable from that of a human's.} as a way

70: to gauge whether machine intelligence has been achieved; as Turing

71: wrote, ``The question and answer method seems to be suitable for

72: introducing almost any one of the fields of human endeavour that we

73: wish to include''.

74:

75: The reader might be somewhat surprised to hear that language

76: understanding is so hard.  After all, human children get the hang of

77: it in a few years, word processing software now corrects (some of) our

78: grammatical errors, and TV ads show us phones capable of effortless

79: translation. One might therefore be led to believe that HAL is just

80: around the corner.

81:

82: Such is not the case, however. In order to appreciate this point, we

83: temporarily divert from describing statistical NLP's history --- which touches upon Hamilton versus Madison, the

84: sleeping habits of colorless green ideas, and what happens when one

85: fires a linguist --- to examine a few examples illustrating

86: why understanding human language is such a difficult problem.

87:

88: \section*{Ambiguity and language analysis}

89:

90: \begin{quote}

91: {\em At last, a computer that understands you like your mother.}\\

92: \hspace*{2.5in} -- 1985 McDonnell-Douglas ad

93: \end{quote}

94:

95: The snippet quoted above indicates the early confidence at

96: least one company had in the feasibility of getting computers to

97: understand human language.

98: But in fact, that very sentence is illustrative of the host of

99: difficulties that arise in trying to analyze human utterances, and so,

100: ironically, it

101: is quite unlikely that  the system being promoted

102: would have been up to the task.  A moment's reflection reveals

103: that the sentence admits at least three different interpretations:

104: \begin{enumerate}

105:   \item The computer understands you as well as your mother understands

106:   you.

107:   \item  The computer understands that you like your mother.

108:   \item  The computer understands you as well as it understands your mother.

109: \end{enumerate}

110: That is, the sentence is {\em ambiguous}; and yet we humans seem to

111: instantaneously rule out all the alternatives except the first (and

112: presumably the intended) one.

113: We do so based on a great deal of background knowledge, including

114: understanding what advertisements typically try to convince us of.

115: How are we to get such information into a computer?

116:

117: A number of other types of ambiguity are also lurking here.  For

118: example, consider the speech recognition problem: how can we

119: distinguish between this utterance, when spoken, and ``... a computer

120: that understands your lie cured mother''?  We also have a word sense

121: ambiguity problem: how do we know that here ``mother'' means ``a

122: female parent'', rather than the Oxford English Dictionary-approved

123: alternative of ``a cask or vat used in vinegar-making''? Again, it is

124: our broad knowledge about the world and the context of the remark that

125: allows us humans to make these decisions easily.

126:

127: Now, one might be tempted to think that all these ambiguities arise

128: because our example sentence is highly unusual (although the ad

129: writers probably did not set out to craft a strange sentence).  Or,

130: one might argue that these ambiguities are somehow artificial because

131: the alternative interpretations are so unrealistic that

132: an NLP system could easily filter them out.  But ambiguities crop up in many

133: situations.  For example, in ``Copy the local patient files to disk''

134: (which seems like a perfectly plausible command to issue to a computer),

135: is it the patients or the files that are local?\footnote{Or, perhaps,

136: the files themselves are patient?  But our knowledge about the world

137: rules this possibility out.}  Again, we need to

138: know the specifics of the situation in order to decide.  And in

139: multilingual settings, extra ambiguities may arise.  Here is a

140: sequence of seven Japanese characters:

141: \begin{center}

142: \psfig{figure=shachoh_unsegmented.eps,width=1.7in}

143: \end{center}

144: Since Japanese doesn't have spaces between words, one is faced with

145: the initial task of deciding what the component words are. In

146: particular, this character sequence corresponds to at least two

147: possible word sequences, ``president, both, business,

148: general-manager'' (= ``a president as well as a general manager of

149: business'') and ``president, subsidiary-business, Tsutomu (a name),

150: general-manager'' (= ?).

151: It requires a fair bit of linguistic

152: information to choose the correct alternative.\footnote{To take an

153: analogous example in English, consider the

154: non-word-delimited sequence of letters

155: ``{theyouthevent}''.   This

156: corresponds to the word sequences ``the youth event'', ``they out he

157: vent'', and ``the you the vent''.}

158:

159: To sum up, we see that the NLP task is highly daunting, for to

160: resolve the many ambiguities that arise in trying to analyze even a

161: single sentence requires deep knowledge not just about language but

162: also about the world. And so when HAL says, ``I'm afraid I can't do

163: that'', NLP researchers are tempted to respond, ``I'm afraid you might

164: be right''.

165:

166:

167: \section*{Firth things first}

168:

169: But before we assume that the only viable approach to NLP is a massive

170: knowledge engineering project, let us go back to the early approaches

171: to the problem.  In the 1940s and 1950s, one prominent trend in

172: linguistics was explicitly empirical and in particular distributional,

173: as exemplified by the work of Zellig Harris (who started the first

174: linguistics program in the USA). The idea was that

175: correlations (co-occurrences) found in language data are important

176: sources of information, or, as the influential linguist J. R. Firth

177: declared in 1957, ``You shall know a word by the company it keeps''.

178:

179: Such notions accord quite happily with ideas put forth by Claude

180: Shannon in his landmark 1948 paper establishing the field of

181: information theory; speaking from an engineering perspective, he

182: identified the probability of a message's being chosen from among

183: several alternatives, rather than the message's actual content, as its

184: critical characteristic.  Influenced by this work, Warren Weaver in

185: 1949 proposed treating the problem of translating between languages as

186: an application of cryptography (see [other chapter]), with one

187: language viewed as an encrypted form of another.  And, Alan Turing's

188: work on cracking German codes during World War II led to the

189: development of the Good-Turing formula, an important tool for

190: computing certain statistical properties of language.

191:

192: In yet a third area, 1941 saw the statisticians Frederick Mosteller

193: and Frederick Williams address the question of whether it was

194: Alexander Hamilton or James Madison who wrote some of the pseudonymous

195: Federalist Papers. Unlike previous attempts, which were based on

196: historical data and arguments, Mosteller and Williams used the

197: patterns of word occurrences in the texts as evidence. This work led

198: up to the famed Mosteller and Wallace statistical study which many

199: consider to have settled the authorship of the disputed papers.

200:

201: Thus, we see arising independently from a variety of fields the idea

202: that language can be viewed from a data-driven, empirical perspective

203: --- and a data-driven perspective leads naturally to a computational

204: perspective.

205:

206: \section*{A ``C'' change}

207:

208: However, data-driven approaches fell out of favor in the late 1950's.

209: One of the commonly cited factors is a 1957 argument by linguist (and

210: student of Harris) Noam Chomsky, who believed that language behavior

211: should be analyzed at a much deeper level than its surface

212: statistics.  He claimed,

213: \begin{quote}

214:   It is fair to assume that neither sentence (1) [Colorless green

215:   ideas sleep furiously] nor (2) [Furiously sleep ideas green

216:   colorless] ... has ever occurred .... Hence, in any [computed]

217:   statistical model ... these sentences will be ruled out on identical

218:   grounds as equally ``remote'' from English.\footnote{Interestingly,

219:   this claim has become so famous as to be self-negating, as simple

220:   web searches on ``Colorless green ideas sleep furiously'' and its

221:   reversal will show.} Yet (1), though nonsensical, is grammatical,

222:   while (2) is not.

223: \end{quote}

224: That is, we humans know that sentence (1), which at least obeys (some)

225: rules of grammar, is indeed more probable than (2), which is just word

226: salad; but (the claim goes), since both sentences are so rare, they

227: will have identical statistics --- i.e., a frequency of zero --- in

228: any sample of English.  Chomsky's criticism is essentially that

229: data-driven approaches will always suffer from a lack of data, and

230: hence are doomed to failure.

231:

232: This observation turned out to be remarkably prescient: even now, when

233: billions of words of text are available on-line, perfectly reasonable

234: phrases are not present. Thus, the so-called {\em sparse data problem}

235: continues to be a serious challenge for statistical NLP even

236: today. And so, the effect of Chomsky's claim, together with some

237: negative results for machine learning and a general lack of computing

238: power at the time, was to cause researchers to turn away from

239: empirical approaches and toward {\em knowledge-based} approaches where

240: human experts encoded relevant information in computer-usable form.

241:

242:

243: This change in perspective led to several new lines of fundamental,

244: interdisciplinary research.  For example, Chomsky's work viewing

245: language as a formal, mathematically-describable object has had

246: lasting impact on both linguistics and computer science; indeed, the

247: {\em Chomsky hierarchy}, a sequence of increasingly more powerful

248: classes of grammars, is a staple of the undergraduate computer science

249: curriculum.  Conversely, the highly influential work of, among others,

250: Kazimierz Adjukiewicz, Joachim Lambek, David K. Lewis, and Richard Montague

251: adopted the {\em lambda calculus}, a fundamental concept in the study

252: of programming languages, to model the semantics of natural languages.

253:

254:

255: \section*{The empiricists strike back}

256:

257: By the '80s, the tide had begun to shift once again, in  part

258: because of the work done by the speech recognition group at IBM.  These

259: researchers, influenced by ideas from information theory, explored the

260: power of probabilistic models of language combined with access to much

261: more sophisticated algorithmic and data resources than had previously been

262: available. In the realm of speech recognition, their ideas form the

263: core of the design of modern systems; and given the recent successes

264: of such software --- large-vocabulary continuous-speech recognition

265: programs are now available on the market --- it behooves us to examine

266: how these systems work.

267:

268: Given some acoustic signal, which we denote by the variable $a$, we

269: can think of the speech recognition problem as that of transcription:

270: determining what sentence is most likely to have produced $a$.

271: Probabilities arise because of the ever-present problem of ambiguity:

272: as mentioned above, several word sequences, such as ``your lie cured

273: mother'' versus ``you like your mother'', can give rise to similar

274: spoken output.  Therefore, modern speech recognition systems

275: incorporate information both about the acoustic signal and the

276: language behind the signal.  More specifically, they rephrase the

277: problem as determining which sentence $s$ maximizes the product

278: $P(a|s)\times P(s)$. The first term measures how likely the acoustic

279: signal would be if $s$ were actually the sentence being uttered

280: (again, we use probabilities because humans don't pronounce words the

281: same way all the time).  The second term measures the probability of

282: the sentence $s$ itself; for example, as Chomsky noted, ``colorless

283: green ideas sleep furiously'' is intuitively more likely to be uttered

284: than the reversal of the phrase.  It is in computing this second term,

285: $P(s)$, where statistical NLP techniques come into play, since

286: accurate estimation of these sentence probabilities requires

287: developing probabilistic models of language.  These models are

288: acquired by processing tens of millions of words or more.

289: This is by no means a simple procedure; even linguistically naive

290: models require the use of sophisticated computational and statistical

291: techniques because of the sparse data problem foreseen by Chomsky.

292: But using probabilistic models, large datasets, and powerful learning

293: algorithms (both for $P(s)$ and $P(a|s)$) has led to our achieving the

294: milestone of commercial-grade speech recognition products capable of

295: handling continuous speech ranging over a large vocabulary.

296:

297:

298: But let us return to our story. Buoyed by the successes in speech

299: recognition in the '70s and '80s (substantial performance gains over

300: knowledge-based systems were posted), researchers began applying

301: data-driven approaches to many problems in natural language

302: processing, in a turn-around so extreme that it has been deemed a

303: ``revolution''.  Indeed, now empirical methods are used at all levels

304: of language analysis.  This is not just due to increased resources: a

305: succession of breakthroughs in machine learning algorithms  has

306: allowed us to leverage existing resources much more effectively.

307: At the same time, evidence from psychology shows that human learning

308: may be more statistically-based than previously thought; for instance,

309: work by Jenny Saffran, Richard Aslin, and Elissa Newport reveals that

310: 8-month-old infants can learn to divide continuous speech into word

311: segments based simply on the statistics of sounds following one

312: another.  Hence, it seems that the ``revolution'' is here to stay.

313:

314:

315: Of course, we must not go overboard and mistakenly conclude that the

316: successes of statistical NLP render linguistics irrelevant (rash

317: statements to this effect have been made in the past, e.g., the

318: notorious remark, ``Every time I fire a linguist, my performance goes

319: up'').  The information and insight that linguists, psychologists, and

320: others have gathered about language is invaluable in creating

321: high-performance broad-domain language understanding systems; for

322: instance, in the speech recognition setting described above, a better

323: understanding of language structure can lead to better language

324: models.  Moreover, truly interdisciplinary research has furthered our

325: understanding of the human language faculty.  One important example of

326: this is the development of the {\em head-driven phrase structure

327: grammar} (HPSG) formalism --- this is a way of analyzing natural

328: language utterances that truly marries deep linguistic information

329: with computer science mechanisms, such as unification and recursive

330: data-types, for representing and propagating this information

331: throughout the utterance's structure.  In sum, computational

332: techniques and data-driven methods are now an integral part both of

333: building systems capable of handling language in a domain-independent,

334: flexible, and graceful way, and of improving our understanding of

335: language itself.

336:

337: \subsection*{Acknowledgments} Thanks to the members of the CSTB

338: Fundamentals of Computer Science study --- and especially Alan

339: Biermann --- for their helpful feedback.  Also, thanks to Alex Acero,

340: Takako Aikawa, Mike Bailey, Regina Barzilay, Eric Brill, Chris

341: Brockett, Claire Cardie, Joshua Goodman, Ed Hovy, Rebecca Hwa, John

342: Lafferty, Bob Moore, Greg Morrisett, Fernando Pereira, Hisami Suzuki,

343: and many others for stimulating discussions and very useful comments.

344: Rie Kubota Ando provided the Japanese example.  The use of the term

345: ``revolution'' to describe the re-ascendance of statistical methods

346: comes from Julia Hirschberg's 1998 invited address to the American

347: Association for Artificial Intelligence.  I learned of the

348: McDonnell-Douglas ad and some of its analyses from a class run by

349: Stuart Shieber.  All errors are mine alone.  This paper is based upon

350: work supported in part by the National Science Foundation under ITR/IM

351: grant IIS-0081334 and a Sloan Research Fellowship.  Any opinions,

352: findings, and conclusions or recommendations expressed above are those

353: of the authors and do not necessarily reflect the views of the

354: National Science Foundation or the Sloan Foundation.

355:

356: \begin{thebibliography}{}

357:

358: \bibitem{Adjukiewicz:35a}

359: Adjukiewicz, Kazimierz.

360: \newblock 1935.

361: \newblock {Die syntaktische Konnexit\"at}.

362: \newblock {\em Studia Philosophica}, 1:1--27.

363: \newblock {English translation available in Storrs McCall, editor, {\em Polish

364: Logic 1920-1939}, Clarendon Press  (1967).}

365:

366:

367: \bibitem{Chomsky:57a}

368: Chomsky, Noam.

369: \newblock 1957.

370: \newblock {\em Syntactic Structures}.

371: \newblock Number~IV in Janua Linguarum. Mouton, The Hague, The Netherlands.

372:

373: \bibitem{Firth:57a}

374: Firth, John~Rupert.

375: \newblock 1957.

376: \newblock A synopsis of linguistic theory 1930--1955.

377: \newblock In the Philological Society's {\em Studies in Linguistic

378:   Analysis}. Blackwell, Oxford, pages 1--32.

379: \newblock Reprinted in {\it Selected Papers of J. R. Firth}, edited by F.

380:   Palmer. Longman, 1968.

381:

382: \bibitem{Good:53a}

383: Good, Irving~J.

384: \newblock 1953.

385: \newblock The population frequencies of species and the estimation of

386:   population parameters.

387: \newblock {\em Biometrika}, 40(3,4):237--264.

388:

389: \bibitem{Harris:51a}

390: Harris, Zellig.

391: \newblock 1951.

392: \newblock {\em Methods in Structural Linguistics}.

393: \newblock University of Chicago Press.

394: \newblock Reprinted by Phoenix Books in 1960 under the title {\em Structural

395:   Linguistics}.

396:

397: \bibitem{Lambek:58a}

398: Lambek, Joachim.

399: \newblock 1958.

400: \newblock {The mathematics of sentence structure}.

401: \newblock {\em American Mathematical Monthly}, 65:154--169.

402:

403: \bibitem{Lewis:70a}

404: Lewis, David K.

405: \newblock 1970.

406: \newblock {General semantics}.

407: \newblock {\em Synth\`ese}, 22:18--67.

408:

409:

410: \bibitem{Montague:74a}

411: Montague, Richard.

412: \newblock 1974.

413: \newblock {\em Formal Philosophy: Selected Papers of {Richard Montague}}.

414: \newblock Yale University Press.

415: \newblock Edited by Richmond H. Thomason.

416:

417: \bibitem{Mosteller+Wallace:84a}

418: Mosteller, Frederick and David~L. Wallace.

419: \newblock 1984.

420: \newblock {\em Applied Bayesian and Classical Inference: The Case of the

421:   Federalist Papers}.

422: \newblock Springer-Verlag.

423: \newblock First edition published in 1964 under the title {\em Inference and

424:   Disputed Authorship: The Federalist}.

425:

426: \bibitem{Pollard+Sag:94a}

427: Pollard, Carl and Ivan Sag.

428: \newblock 1994.

429: \newblock {\em Head-driven phrase structure grammar}.

430: \newblock Chicago University Press and CSLI Publications.

431:

432: \bibitem{Saffran+Aslin+Newport:96a}

433: Saffran, Jenny~R., Richard~N. Aslin, and Elissa~L. Newport.

434: \newblock 1996.

435: \newblock Statistical learning by 8-month-old infants.

436: \newblock {\em Science}, 274(5294):1926--1928, December.

437:

438: \bibitem{Shannon:48a}

439: Shannon, Claude~E.

440: \newblock 1948.

441: \newblock A mathematical theory of communication.

442: \newblock {\em Bell System Technical Journal}, 27:379--423 and 623--656.

443:

444: \bibitem{Turing:50a}

445: Turing, Alan~M.

446: \newblock 1950.

447: \newblock Computing machinery and intelligence.

448: \newblock {\em Mind}, LIX:433--60.

449:

450: \bibitem{Weaver:49a}

451: Weaver, Warren.

452: \newblock 1949.

453: \newblock Translation.

454: \newblock Memorandum. Reprinted in W.N. Locke and A.D. Booth, eds., {\em

455:   Machine Translation of Languages: Fourteen Essays}, MIT Press, 1955.

456:

457: \end{thebibliography}

458:

459:

460: \section*{For further reading}

461: \newcommand{\myind}{\hspace*{.3in}}

462:

463: \noindent Charniak, Eugene.

464: \newblock 1993.

465: \newblock {\em Statistical Language Learning}.

466: \newblock MIT Press.

467:

468: \bigskip

469:

470: \noindent Jurafsky, Daniel and James~H. Martin.

471: \newblock 2000.

472: \newblock {\em Speech and Language Processing: An Introduction to Natural

473:  \\ \myind  Language Processing, Computational Linguistics, and Speech Recognition}.

474: \newblock Prentice Hall.

475: \newblock Contribut-  \\ \myind ing writers: Andrew Kehler, Keith Vander Linden, and Nigel

476:   Ward.

477:

478: \bigskip

479:

480: \noindent Manning, Christopher~D. and Hinrich Sch\"{u}tze.

481: \newblock 1999.

482: \newblock {\em Foundations of Statistical Natural Language

483: Process-\\ \myind ing}. The MIT Press.

484:

485: \end{document}

486: