1: \documentclass[10pt]{article}
2:
3:
4: \usepackage{hltnaacl03}
5: \usepackage{times}
6: \usepackage{latexsym}
7: \usepackage{epsfig}
8: \usepackage{xspace}
9: \newcommand{\epsfscaledbox}[2]{\centerline{\psfig{figure=#1,width=#2}}}
10: \newcommand{\omt}[1]{}
11: \newcommand{\bibsnip}{\vspace*{-.11in}}
12: \newcommand{\proc}{Proc.\xspace}
13: \newcommand{\U}[1]{\underline{#1}}
14: \newcommand{\UU}[1]{\underline{\underline{#1}}}
15: \newcommand{\comment}[1]{{\bf !!- - - #1 - - - !!}}
16: \newcommand{\Lattice}{Lattice\xspace}
17: \newcommand{\lattice}{lattice\xspace}
18: \newcommand{\lattices}{lattices\xspace}
19: \newcommand{\slotlat}{slotted \lattice}
20: \newcommand{\slotlats}{slotted \lattices}
21: \newcommand{\template}[1]{{\sf #1}}
22: \newcommand{\corpus}{C}
23:
24: \newenvironment{frameit}[1]
25: {\begin{tabular}{|p{#1}|}\hline}{\\\hline\end{tabular}}
26:
27: \newcommand{\textexample}[1]{
28: {\noindent
29: \begin{center}
30: \fbox{\parbox{0.45\textwidth}{\small\sf #1}}
31: \end{center}}}
32:
33: \setlength\titlebox{6.5cm}
34: \title{\vspace{-75pt}
35: {\normalsize {\it \hfill Proceedings of HLT/NAACL 2003}} \\ \mbox{}\\Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment}
36:
37: \author{Regina Barzilay \and Lillian Lee \\
38: Department of Computer Science \\
39: Cornell University \\
40: Ithaca, NY 14853-7501 \\
41: \{regina,llee\}@cs.cornell.edu}
42:
43: \date{}
44:
45: \begin{document}
46: \maketitle
47: \begin{abstract}
48: We address the text-to-text generation problem of sentence-level paraphrasing
49: --- a phenomenon distinct from and more difficult than word- or phrase-level
50: paraphrasing. Our approach applies {\em multiple-sequence alignment} to
51: sentences gathered from unannotated comparable corpora: it learns a set of
52: paraphrasing patterns represented by {\em word lattice} pairs and
53: automatically determines how to apply these patterns to rewrite new
54: sentences. The results of our evaluation experiments show that the system
55: derives accurate paraphrases, outperforming baseline systems.
56: \end{abstract}
57:
58:
59:
60: \section{Introduction}
61:
62: \begin{quote}
63: {\em This is a late parrot! It's a stiff! Bereft of life, it rests in
64: peace! If you hadn't nailed him to the perch he would be pushing up
65: the daisies! Its metabolical processes are of interest only to
66: historians! It's hopped the twig! It's shuffled off this mortal coil!
67: It's rung down the curtain and joined the choir invisible! This is
68: an EX-PARROT!} --- Monty Python, ``Pet Shop''
69: \end{quote}
70:
71: A mechanism for automatically generating multiple paraphrases of a
72: given sentence would be of significant practical import for
73: text-to-text generation systems. Applications include summarization
74: \cite{Knight&Marcu:2000a} and rewriting
75: \cite{Chandrasekar+Srinivas:97a}: both could employ such a mechanism
76: to produce candidate sentence paraphrases
77: that other
78: system components would filter for length, sophistication level, and
79: so forth.\footnote{Another interesting application,
80: somewhat tangential to generation, would be
81: to expand existing corpora by providing several versions of their
82: component sentences.
83: This could, for example, aid machine-translation evaluation, where it has
84: become common to evaluate systems by comparing their output against a bank of
85: several reference translations for the same sentences
86: \cite{Papineni&al:2002a}.
87: See \newcite{Bangalore&Murdock&Riccardi:2002a} and
88: \newcite{Barzilay&Lee:2002a} for other uses of such data.}
89: Not surprisingly, therefore,
90: paraphrasing has been a focus of generation
91: research for quite some time
92: \cite{McKeown:79a,Meteer+Shaked:88a,Dras:1999a}.
93:
94: One might initially suppose that sentence-level paraphrasing is simply the
95: result of word-for-word or phrase-by-phrase substitution applied in a domain-
96: and context-independent fashion. However, in studies of paraphrases across
97: several domains
98: \cite{Iordanskaja&Kittredge&Polguere:1991a,Robin-phd,McKeown&Kukich&Shaw:1994a},
99: this was generally not the case.
100: For instance, consider the following two sentences (similar to
101: examples found in \newcite{Smadja&McKeown:1991a}):
102: \begin{center}
103: \begin{frameit}{0.9\columnwidth}
104: {\small After the latest Fed rate cut, stocks rose across the board.}
105: \\\hline
106: {\small Winners strongly outpaced losers after Greenspan cut
107: interest rates again.}
108: \end{frameit}
109: \end{center}
110: Observe that ``Fed'' (Federal Reserve) and ``Greenspan'' are interchangeable
111: only in the domain of US financial matters. Also, note that one cannot draw
112: one-to-one correspondences between single words or phrases. For instance,
113: nothing in the second sentence is really equivalent to ``across the board'';
114: we can only say that the entire clauses ``stocks rose across the board'' and
115: ``winners strongly outpaced losers'' are paraphrases. This evidence suggests
116: two consequences: (1) we cannot rely solely on generic domain-independent
117: lexical resources for the task of paraphrasing, and (2) {\em sentence-level}
118: paraphrasing is an important problem extending beyond that of paraphrasing
119: smaller lexical units.
120:
121: {\em Our work presents a novel knowledge-lean algorithm that uses {\em
122: multiple-sequence alignment} (MSA) to {\em learn} to generate
123: sentence-level paraphrases essentially from unannotated corpus data alone.}
124: In contrast to previous work using MSA for generation
125: \cite{Barzilay&Lee:2002a}, we need neither parallel data nor explicit
126: information about sentence semantics. Rather, we use two {\em comparable
127: corpora}, in our case, collections of articles produced by two different
128: newswire agencies about the same events. The use of related corpora is key:
129: we can capture paraphrases that on the surface bear little resemblance but
130: that, by the nature of the data, must be descriptions of the same
131: information. Note that we also acquire paraphrases from each of the
132: individual corpora; but the lack of clues as to sentence equivalence in
133: single corpora means that we must be more conservative, only selecting as
134: paraphrases items that are structurally very similar.
135:
136: Our approach has three main steps. First, working on each of the comparable
137: corpora separately, we compute {\em \lattices} --- compact graph-based
138: representations --- to find commonalities within (automatically derived)
139: groups of structurally similar sentences. Next, we identify pairs of
140: lattices from the two different corpora that are paraphrases of each other;
141: the identification process checks whether the lattices take similar
142: arguments. Finally, given an input sentence to be paraphrased, we match it
143: to a lattice and use a paraphrase from the matched lattice's mate to generate
144: an output sentence. The key features of this approach are:
145:
146: \noindent
147: \textbf{Focus on paraphrase generation.} In contrast to earlier work, we not
148: only extract paraphrasing rules, but also automatically determine which of the
149: potentially relevant rules to apply to an input sentence and produce a revised
150: form using them.
151:
152: \noindent
153: \textbf{Flexible paraphrase types.} Previous approaches to paraphrase
154: acquisition focused on certain rigid types of paraphrases, for instance,
155: limiting the number of arguments. In contrast, our method is not limited to a
156: set of {\it a priori}-specified paraphrase types.
157:
158: \noindent
159: \textbf{Use of comparable corpora and minimal use of knowledge resources.} In
160: addition to the advantages mentioned above, comparable corpora can be easily
161: obtained for many domains, whereas previous approaches to paraphrase
162: acquisition (and the related problem of phrase-based machine translation
163: \cite{Wang:1998a,Och&Tillman&Ney:1999a,Vogel&Ney:2000a}) required parallel
164: corpora. We point out that one such approach, recently proposed by
165: \newcite{Pang+Knight+Marcu:03a}, also represents paraphrases by lattices,
166: similarly to our method, although their lattices are derived using parse
167: information.
168:
169:
170: Moreover, our algorithm does not employ knowledge resources such as parsers or
171: lexical databases, which may not be available or appropriate for all domains
172: --- a key issue since paraphrasing is typically domain-dependent. Nonetheless,
173: our algorithm achieves good performance.
174:
175:
176:
177: \section{Related work}
178: Previous work on automated paraphrasing has considered different levels of
179: paraphrase granularity. Learning synonyms via distributional similarity has
180: been well-studied \cite{Pereira&Tishby&Lee:1993a,Grefenstette:94a,Lin:1998a}.
181: \newcite{Jacquemin:l999a} and \newcite{Barzilay&McKeown:01a} identify
182: phrase-level paraphrases, while \newcite{Lin&Pantel:2001a} and
183: \newcite{Shinyama&al:2002a} acquire structural paraphrases encoded as
184: templates. These latter are the most closely related to the sentence-level
185: paraphrases we desire, and so we focus in this section on template-induction
186: approaches.
187:
188: \newcite{Lin&Pantel:2001a} extract inference rules, which are related to
189: paraphrases (for example, \template{X wrote Y} implies \template{X is the
190: author of Y}), to improve question answering. They assume that {\em paths}
191: in dependency trees that take similar arguments (leaves) are close in meaning.
192: However, only two-argument templates are considered.
193: \newcite{Shinyama&al:2002a} also use dependency-tree information to extract
194: templates of a limited form (in their case, determined by the underlying
195: information extraction application). Like us (and unlike Lin and Pantel, who
196: employ a single large corpus), they use articles written about the same event
197: in different newspapers as data.
198:
199: Our approach shares two characteristics with the two methods just described:
200: pattern comparison by analysis of the patterns' respective arguments, and use
201: of non-parallel corpora as a data source. However, {\em extraction} methods
202: are not easily extended to {\em generation} methods. One problem is that their
203: templates often only match small fragments of a sentence. While this is
204: appropriate for other applications, deciding whether to use a given template to
205: generate a paraphrase requires information about the surrounding context
206: provided by the entire sentence.
207:
208:
209: \newcommand{\slot}{slot\xspace}
210: \newcommand{\slots}{slots\xspace}
211: \newcommand{\findclusters}{Sentence clustering}
212: \newcommand{\families}{clusters\xspace}
213: \newcommand{\Families}{Clusters\xspace}
214: \newcommand{\family}{cluster\xspace}
215: \newcommand{\famlat}{\lattice}
216: \newcommand{\famlats}{\lattices}
217: \newcommand{\msg}{pattern\xspace}
218:
219: \newcommand{\patterninformal}{pattern\xspace}
220: \newcommand{\patternsinformal}{patterns\xspace}
221: \newcommand{\Patterninformal}{Pattern\xspace}
222: \newcommand{\surprise}{surprise\xspace}
223: \newcommand{\backbone}{backbone\xspace}
224: \newcommand{\numtoken}{NUM}
225: \newcommand{\nametoken}{NAME}
226: \newcommand{\datetoken}{DATE}
227:
228:
229:
230: \section{Algorithm}
231:
232:
233: \paragraph{Overview} We first sketch the algorithm's broad outlines. The subsequent subsections provide
234: more detailed descriptions of the individual steps.
235:
236: The major goals of our algorithm are to learn:
237: \begin{itemize}
238: \item recurring {\patternsinformal} in the data, such as \template{X
239: (injured/wounded) Y people, Z seriously}, where the capital letters
240: represent variables;
241: \item
242: pairings between such \patternsinformal that represent paraphrases, for
243: example, between the \patterninformal \template{X (injured/wounded) Y people,
244: Z of them seriously} and the \patterninformal \template{Y were
245: (wounded/hurt) by X, among them Z were in serious condition}.
246: \end{itemize}
247:
248: Figure~\ref{fig:arch} illustrates the main stages of our approach. During
249: training, \patterninformal induction is first applied independently to the two
250: datasets making up a pair of {comparable corpora}. Individual
251: \patternsinformal are learned by applying {\em multiple-sequence alignment} to
252: \families of sentences describing approximately similar events; these
253: \patternsinformal are represented compactly by {\em \lattices} (see Figure
254: \ref{fig:lattice}). We then check for \lattices from the two different corpora
255: that tend to take the same arguments; these \lattice pairs are taken to be
256: paraphrase \patternsinformal.
257:
258: \begin{figure}
259: \begin{center}
260: \epsfscaledbox{arch.eps}{2.2in}
261: \end{center}
262: \vspace*{-.2in}
263: \caption{\label{fig:arch} System architecture.}
264: \end{figure}
265:
266: Once training is done, we can generate paraphrases as follows: given the
267: sentence ``The \surprise bombing injured twenty people, five of them
268: seriously'', we match it to the lattice \template{X (injured/wounded) Y people,
269: Z of them seriously} which can be rewritten as \template{Y were
270: (wounded/hurt) by X, among them Z were in serious condition}, and so by
271: substituting arguments we can generate ``Twenty were wounded by the \surprise
272: bombing, among them five were in serious condition'' or ``Twenty were hurt by
273: the \surprise bombing, among them five were in serious condition''.
274:
275: \begin{figure}
276: \newcounter{sentexample}\setcounter{sentexample}{1}
277: \newcommand{\sentex}[1]{{\footnotesize (\thesentexample)~#1 \stepcounter{sentexample}}}
278: \fbox{
279: \begin{minipage}{3in}
280: \sentex{\textbf{A Palestinian suicide bomber blew himself up in} a southern
281: city Wednesday, \textbf{killing} two other \textbf{people}
282: \textbf{and wounding} 27.} \\
283: \sentex{\textbf{A suicide bomber blew himself up in} the settlement of Efrat,
284: on Sunday, \textbf{killing} himself \textbf{and injuring}
285: seven people.} \\
286: \sentex{\textbf{A suicide bomber blew himself up in} the coastal resort of
287: Netanya on Monday, \textbf{killing} three other \textbf{people}
288: \textbf{and wounding} dozens more.} \\
289: \sentex{\textbf{A Palestinian suicide bomber blew himself up in} a garden
290: cafe on Saturday, \textbf{killing} 10 \textbf{people} \textbf{and wounding}
291: 54.} \\
292: \sentex{\textbf{A suicide bomber blew himself up in} the centre of Netanya on
293: Sunday, \textbf{killing} three \textbf{people} as well as himself
294: \textbf{and injuring} 40. }
295: \end{minipage}
296: }
297: \caption{\label{fig:cluster} Five sentences (without date, number,
298: and name substitution) from a \family of 49, similarities emphasized. }
299: \end{figure}
300:
301:
302:
303: \begin{figure*}
304: \psfig{figure=msa-new.ps,width=6.5in}
305: \caption{\label{fig:lattice} \Lattice and
306: \slotlat for the five sentences from Figure \ref{fig:cluster}. Punctuation
307: and articles removed for clarity.}
308: \end{figure*}
309:
310: \subsection{\findclusters}
311:
312: Our first step is to cluster sentences into groups from which to learn useful
313: patterns; for the multiple-sequence techniques we will use, this means that the
314: sentences within \families should describe similar events and have similar
315: structure, as in the sentences of Figure \ref{fig:cluster}. This is
316: accomplished by applying hierarchical complete-link clustering to the sentences
317: using a similarity metric based on word n-gram overlap ($n=1,2,3,4$). The only
318: subtlety is that we do not want mismatches on sentence details (e.g., the
319: location of a raid) causing sentences describing the same type of occurrence
320: (e.g., a raid) from being separated, as this might yield \families too
321: fragmented for effective learning to take place. (Moreover, variability in the
322: {\em arguments} of the sentences in a cluster is needed for our learning
323: algorithm to succeed; see below.) We therefore first
324: replace all appearances of dates, numbers, and proper names\footnote{Our crude
325: proper-name identification method was to flag every phrase (extracted by a
326: noun-phrase chunker) appearing capitalized in a non-sentence-initial position
327: sufficiently often. } with generic tokens. \Families with fewer than ten
328: sentences are discarded.
329:
330:
331: \newcommand{\art}[1]{}
332: \newcommand{\monthtoken}{MONTH\xspace}
333: \newcommand{\mayseven}{\datetoken~\numtoken \xspace}
334: \newcommand{\palestinian}{\nametoken\xspace}
335: \newcommand{\southta}{\nametoken\xspace}
336: \newcommand{\marchnine}{\datetoken~\numtoken \xspace}
337: \newcommand{\jerusalem}{\nametoken\xspace}
338: \newcommand{\ipmasharon}{\nametoken\xspace}
339: \newcommand{\marchten}{\datetoken~\numtoken \xspace}
340: \newcommand{\saturday}{\datetoken\xspace}
341: \newcommand{\marchthirtyone}{\datetoken~\numtoken \xspace}
342: \newcommand{\afpsource}{\nametoken\xspace}
343: \newcommand{\jewish}{\nametoken\xspace}
344: \newcommand{\efratwwbbeth}{\nametoken\xspace}
345: \newcommand{\sunday}{\datetoken\xspace}
346: \newcommand{\juneeighteen}{\datetoken~\numtoken \xspace}
347: \newcommand{\fifteen}{\numtoken1\xspace}
348: \newcommand{\tuesday}{\datetoken\xspace}
349: \newcommand{\seven}{{\numtoken1}\xspace}
350: \newcommand{\eleven}{{\numtoken1}\xspace}
351: \newcommand{\fifty}{\numtoken2\xspace}
352: \newcommand{\eighteen}{\numtoken1\xspace}
353: \newcommand{\fortyeight}{\numtoken2\xspace}
354: \newcommand{\locone}{{in \art{a} crowded hall in \southta}\xspace}
355: \newcommand{\loconeshort}{in \art{a} crowded hall$\ldots$\xspace}
356: \newcommand{\loctwo}{into \art{a} crowded \jerusalem cafe [sic] \ipmasharon's residence\xspace}
357: \newcommand{\loctwoshort}{into \art{a} crowded $\ldots$ residence\xspace}
358: \newcommand{\locthree}{{in \art{the} \jewish
359: settlement of \efratwwbbeth}\xspace}
360: \newcommand{\locthreeshort}{in \art{the} \jewish
361: settlement $\ldots$ \xspace}
362: \newcommand{\locfour}{{aboard \art{a} crowded bus in \jerusalem}\xspace}
363: \newcommand{\locfourshort}{{aboard $\ldots$ \jerusalem}\xspace}
364: \newcommand{\synone}{injuring\xspace}
365: \newcommand{\syntwo}{wounding\xspace}
366:
367:
368:
369:
370: \subsection{Inducing \patternsinformal}
371:
372: \newcommand{\simfn}{\textrm{sim}} \newcommand{\alphabet}{\Sigma}
373: \newcommand{\underscore}{\underline{~}} In order to learn \patternsinformal, we
374: first compute a {\em multiple-sequence alignment} (MSA) of the sentences in a
375: given \family. Pairwise MSA takes two sentences and a scoring function giving
376: the similarity between words; it determines the highest-scoring way to perform
377: insertions, deletions, and changes to transform one of the sentences into the
378: other. Pairwise MSA can be extended efficiently to multiple sequences via the
379: iterative pairwise alignment, a polynomial-time method commonly used in
380: computational biology \cite{Durbin+Eddy+al:98a}.\footnote{Scoring function:
381: aligning two identical words scores 1; inserting a word scores -0.01, and
382: aligning two different words scores -0.5 (parameter values taken from
383: \newcite{Barzilay&Lee:2002a}).} \omt{ $$\simfn(x,y) = 1 & $x = y$, $x \in
384: \alphabet$; \cr -0.01 & exactly one of $x,y$ is $\underscore$~; \cr -0.5 &
385: otherwise (mismatch)$$
386: 1 if the two words $x$ and $y$ are the same, -0.01 }
387: The results can be represented in an intuitive form via a word {\em \lattice}
388: (see Figure \ref{fig:lattice}), which compactly represents (n-gram) structural
389: similarities between the \family's sentences.
390:
391: To transform \lattices into generation-suitable \patternsinformal requires some
392: understanding of the possible varieties of \lattice structures. The most
393: important part of the transformation is to determine which words are actually
394: instances of arguments, and so should be replaced by {\em slots} (representing
395: variables). The key intuition is that because the sentences in the \family
396: represent the same {\em type} of event, such as a bombing, but generally refer
397: to different {\em instances} of said event (e.g. a bombing in Jerusalem versus
398: in Gaza), areas of large variability in the \lattice should correspond to
399: arguments.
400:
401: To quantify this notion of variability, we first formalize its opposite:
402: commonality. We define {\em \backbone} nodes as those shared by more than 50\%
403: of the \family's sentences. The choice of 50\% is not arbitrary --- it can be
404: proved using the pigeonhole principle that our strict-majority criterion
405: imposes a unique linear ordering of the backbone nodes that respects the word
406: ordering within the sentences, thus guaranteeing at least a degree of
407: well-formedness and avoiding the problem of how to order backbone nodes
408: occurring on parallel ``branches'' of the lattice.
409:
410:
411: Once we have identified the \backbone nodes as points of strong commonality,
412: the next step is to identify the regions of variability (or, in \lattice terms,
413: many parallel disjoint paths) between them as (probably) corresponding to the
414: arguments of the propositions that the sentences represent. For example, in
415: the top of Figure \ref{fig:lattice}, the words ``southern city, ``settlement of
416: NAME'',``coastal resort of NAME'', etc. all correspond to the location of an
417: event and could be replaced by a single {\slot}.
418: Figure \ref{fig:lattice} shows an example of a \lattice and the derived
419: \slotlat; we give the details of the slot-induction process in the Appendix.
420:
421:
422: \subsection{Matching \famlats}
423:
424: Now, if we were using a parallel corpus, we could employ
425: sentence-alignment information to determine which lattices correspond
426: to paraphrases. Since we do not have this information, we essentially
427: approximate the parallel-corpus situation by correlating information
428: from descriptions of (what we hope are) the same event occurring in
429: the two different corpora.
430:
431: Our method works as follows. Once \lattices for each corpus in our
432: comparable-corpus pair are computed, we identify \lattice paraphrase pairs,
433: using the idea that paraphrases will tend to take the same values as arguments
434: \cite{Shinyama&al:2002a,Lin&Pantel:2001a}. More specifically, we take a pair of
435: \lattices from different corpora, look back at the sentence clusters from which
436: the two lattices were derived, and compare the slot values of those
437: cross-corpus sentence pairs that appear in articles written on the {\em same
438: day} on the same topic; we pair the \lattices if the degree of matching is
439: over a threshold tuned on held-out data. For example, suppose we have two
440: (linearized) lattices \template{{slot1} bombed slot2} and \template{slot3 was
441: bombed by slot4} drawn from different corpora. If in the first lattice's
442: sentence cluster we have the sentence ``the plane bombed the town'', and in the
443: second lattice's sentence cluster we have a sentence written on the same day
444: reading ``the town was bombed by the plane'', then the corresponding lattices
445: may well be paraphrases, where \template{slot1} is identified with
446: \template{slot4} and \template{slot2} with \template{slot3}.
447:
448:
449: To compare the set of argument values of two lattices, we simply count their
450: word overlap, giving double weight to proper names and numbers and discarding
451: auxiliaries (we purposely ignore order because paraphrases can consist of word
452: re-orderings).
453:
454: \subsection{Generating paraphrase sentences}
455:
456: Given a sentence to paraphrase, we first need to identify which, if any, of our
457: previously-computed sentence \families the new sentence belongs most strongly
458: to. We do this by finding the best alignment of the sentence to the existing
459: \famlats.\footnote{ To facilitate this process, we add ``insert'' nodes between
460: \backbone nodes; these nodes can match any word sequence and thus account for
461: new words in the input sentence. Then, we perform multiple-sequence
462: alignment where insertions score \mbox{-0.1} and all other node alignments
463: receive a score of unity.} If a matching \famlat is found, we choose one of
464: its comparable-corpus paraphrase \lattices to rewrite the sentence,
465: substituting in the argument values of the original sentence. This yields as
466: many paraphrases as there are lattice paths.
467:
468:
469:
470: \section{Evaluation}
471: \label{sec:eval}
472:
473:
474: All evaluations involved judgments by native speakers of
475: English who were not familiar with the paraphrasing systems
476: under consideration.
477:
478: \begin{figure*}
479: \epsfscaledbox{templateeval4.eps}{6.4in}
480: \caption{\label{msa-dirt-accuracy} Correctness and agreement results.
481: Columns = instances; each grey box represents a judgment of ``valid''
482: for the instance. For each method, a good, middling, and poor
483: instance is shown. (Results separated by algorithm for clarity; the
484: blind evaluation presented instances from the two algorithms in random
485: order.)
486: }
487: \end{figure*}
488:
489: We implemented our system on a pair of comparable corpora consisting of
490: articles produced between September 2000 and August 2002 by the Agence
491: France-Presse (AFP) and Reuters news agencies. Given our interest in
492: domain-dependent paraphrasing, we limited attention to 9MB of articles,
493: collected using a TDT-style document clustering system, concerning individual
494: acts of violence in Israel and army raids on the
495: Palestinian territories. From this data (after removing 120 articles as a
496: held-out parameter-training set), we extracted 43 \slotlats from the AFP corpus
497: and 32 \slotlats from the Reuters corpus, and found 25 cross-corpus matching
498: pairs; since \lattices contain multiple paths, these yielded 6,534 template
499: pairs.\footnote{The extracted paraphrases are available at \texttt{http://www.cs.cornell.edu/Info/Projects/\\NLP/statpar.html}}
500:
501:
502: \subsection{Template Quality Evaluation}
503:
504: Before evaluating the quality of the rewritings produced by our templates and
505: \lattices, we first tested the quality of a random sample of just the template
506: pairs. In our instructions to the judges, we defined two {text units} (such as
507: sentences or snippets) to be paraphrases if one of them can generally be
508: substituted for the other without great loss of information (but not
509: necessarily vice versa). \footnote{We switched to this ``one-sided''
510: definition because in initial tests judges found it excruciating to decide on
511: equivalence.
512: %LL-post
513: Also, in applications such as summarization some information loss is
514: acceptable.} Given a pair of {\em templates} produced by a system, the
515: judges marked them as paraphrases if for many instantiations of the templates'
516: variables, the resulting text units were paraphrases. (Several labelled
517: examples were provided to supply further guidance).
518:
519: To put the evaluation results into context, we wanted to compare against
520: another system, but we are not aware of any previous work creating templates
521: precisely for the task of generating paraphrases. Instead, we made a
522: good-faith effort to adapt the DIRT system \cite{Lin&Pantel:2001a} to the
523: problem, selecting the 6,534 highest-scoring templates it produced when run on
524: our datasets. (The system of \newcite{Shinyama&al:2002a} was unsuitable for
525: evaluation purposes because their paraphrase extraction component is too
526: tightly coupled to the underlying information extraction system.) It is
527: important to note some important caveats in making this comparison, the most
528: prominent being that DIRT was not designed with sentence-paraphrase generation
529: in mind --- its templates are much shorter than ours, which may have affected
530: the evaluators' judgments --- and was originally implemented on much larger
531: data sets.\footnote{To cope with the corpus-size issue, DIRT was trained on an
532: 84MB corpus of Middle-East news articles, a strict superset of the 9MB we
533: used. Other issues include the fact that DIRT's output needed to be
534: converted into English: it produces paths like ``N:of:N
535: $\langle$tide$\rangle$ N:nn:N'', which we transformed into ``Y tide of X'' so
536: that its output format would be the same as ours. } The point of this
537: evaluation is simply to determine whether another corpus-based
538: paraphrase-focused approach could easily achieve the same performance level.
539:
540:
541: In brief, the DIRT system works as follows. Dependency trees are
542: constructed from parsing a large corpus. Leaf-to-leaf paths are
543: extracted from these dependency trees, with the leaves serving as
544: slots. Then, pairs of paths in which the slots tend to be filled by
545: similar values, where the similarity measure is based on the mutual
546: information between the value and the slot, are deemed to be
547: paraphrases.
548:
549:
550: We randomly extracted 500 pairs from the two algorithms' output sets. Of
551: these, 100 paraphrases (50 per system) made up a ``common'' set evaluated by
552: all four judges, allowing us to compute agreement rates; in addition, each
553: judge also evaluated another ``individual'' set, seen only by him- or herself,
554: consisting of another 100 pairs (50 per system). The ``individual'' sets
555: allowed us to broaden our sample's coverage of the corpus.\footnote{Each judge
556: took several hours at the task, making it infeasible to expand the sample
557: size further.}
558: The pairs were presented in random order, and the judges were
559: not told which system produced a given pair.
560:
561: As Figure~\ref{msa-dirt-accuracy} shows, our system outperforms the DIRT
562: system, with a consistent performance gap for all the judges of about 38\%,
563: although the absolute scores vary (for example, Judge 4 seems lenient). The
564: judges' assessment of correctness was fairly constant between the full
565: 100-instance set and just the 50-instance common set alone.
566:
567: In terms of agreement, the Kappa value (measuring pairwise agreement
568: discounting chance occurrences\footnote{One issue is that the Kappa
569: statistic doesn't account for varying difficulty among instances. For
570: this reason, we actually asked judges to indicate for each instance
571: whether making the validity decision was difficult. However, the
572: judges generally did not agree on difficulty. Post hoc analysis
573: indicates that perception of difficulty depends on each judge's
574: individual ``threshold of similarity'', not just the instance itself.
575: }) on the common set was 0.54, which corresponds to moderate
576: agreement~\cite{Landis&Koch:1977a}. Multiway agreement is depicted in
577: Figure~\ref{msa-dirt-accuracy} --- there, we see that in 86 of 100
578: cases, at least three of the judges gave the same correctness
579: assessment, and in 60 cases all four judges concurred.
580:
581:
582: \subsection{Evaluation of the generated paraphrases}
583:
584: Finally, we evaluated the quality of the paraphrase sentences generated by our
585: system, thus (indirectly) testing all the system components: pattern selection,
586: paraphrase acquisition, and generation. We are not aware of another system
587: generating sentence-level paraphrases. Therefore, we used as a baseline a
588: simple paraphrasing system that just replaces words with one of their
589: randomly-chosen WordNet synonyms (using the most frequent sense of the word
590: that WordNet listed synonyms for). The number of substitutions was set
591: proportional to the number of words our method replaced in the same sentence.
592: The point of this comparison is to check whether simple synonym substitution
593: yields results comparable to those of our algorithm. \footnote{ We chose not
594: to employ a language model to re-rank either system's output because such an
595: addition would make it hard to isolate the contribution of the paraphrasing
596: component itself. }
597:
598:
599:
600: \begin{figure*}[htpb]\footnotesize
601: \hspace*{-.2in}
602: \begin{tabular}{|l|l|} \hline
603: Original (1) & {\em The caller identified the bomber as Yussef Attala, 20, from the
604: Balata refugee camp near Nablus.} \\\hline
605: MSA & The caller named the bomber as 20-year old Yussef Attala from the
606: Balata refugee camp near Nablus. \\\hline
607: Baseline & The company placed the bomber as Yussef Attala, 20, from the
608: Balata refugee camp near Nablus. \\\hline \hline
609: Original (2) & {\em A spokesman for the group claimed responsibility for the attack
610: in a phone call to AFP in this northern West Bank town}. \\\hline
611: MSA & The attack in a phone call to AFP in this northern West Bank town
612: was claimed by a spokesman of the group. \\\hline
613: Baseline & \parbox[t]{6in}{A spokesman for the grouping laid claim
614: responsibility for the onslaught in a phone call to AFP
615: in this northern West Bank town. } \\\hline
616: \end{tabular}
617: \caption{Example sentences and generated paraphrases. Both judges felt
618: MSA preserved the meaning of (1) but not (2), and that neither
619: baseline paraphrase was meaning-preserving.}
620: \label{fig:WordNet}
621: \end{figure*}
622:
623:
624:
625: \newcommand{\results}[2]{#2\xspace} For this experiment, we randomly selected
626: 20 AFP articles about violence in the Middle East published later than the
627: articles in our training corpus. Out of 484 sentences in this set, our system
628: was able to paraphrase 59 (12.2\%). (We chose parameters that optimized
629: precision rather than recall on our small held-out set.) We found that after
630: proper name substitution, only seven sentences in the test set appeared in the
631: training set,\footnote{Since we are doing unsupervised paraphrase acquisition,
632: train-test overlap is allowed.} which implies that \lattices boost the
633: generalization power of our method significantly: from seven to 59 sentences.
634: Interestingly, the coverage of the system varied significantly with article
635: length. For the eight articles of ten or fewer sentences, we paraphrased
636: 60.8\% of the sentences per article on average, but for longer articles only
637: 9.3\% of the sentences per article on average were paraphrased. Our analysis
638: revealed that long articles tend to include large portions that are unique to
639: the article, such as personal stories of the event participants, which explains
640: why our algorithm had a lower paraphrasing rate for such articles.
641:
642:
643:
644: All 118 instances (59 per system) were presented in random order to two judges,
645: who were asked to indicate whether the meaning had been preserved. Of the
646: paraphrases generated by our system, the two evaluators deemed
647: \results{59-11}{81.4\%} and \results{59-13}{78\%}, respectively, to be valid,
648: whereas for the baseline system, the correctness results were
649: \results{59-18}{69.5\%} and \results{59-20}{66.1\%}, respectively. Agreement
650: according to the Kappa statistic was 0.6. Note that judging full sentences is
651: inherently easier than judging templates, because template comparison requires
652: considering a variety of possible slot values, while sentences are
653: self-contained units.
654:
655: Figure \ref{fig:WordNet} shows two example sentences, one where our MSA-based
656: paraphrase was deemed correct by both judges, and one where both judges deemed
657: the MSA-generated paraphrase incorrect. Examination of the results indicates
658: that the two systems make essentially orthogonal types of errors. The baseline
659: system's relatively poor performance supports our claim that whole-sentence
660: paraphrasing is a hard task even when accurate word-level paraphrases are
661: given.
662:
663:
664: \section{Conclusions}
665:
666: We presented an approach for generating sentence level
667: paraphrases, a task not addressed previously. Our method learns
668: structurally similar patterns of expression from data and identifies
669: paraphrasing pairs among them using a comparable corpus. A flexible
670: pattern-matching procedure allows us to paraphrase an unseen sentence by
671: matching it to one of the induced patterns. Our approach
672: generates both lexical and structural paraphrases.
673:
674: Another contribution is the induction of MSA lattices from non-parallel data.
675: Lattices have proven advantageous in a number of NLP contexts
676: \cite{Mangu&Brill&Stolcke:00a,Bangalore&Murdock&Riccardi:2002a,Barzilay&Lee:2002a,Pang+Knight+Marcu:03a},
677: but were usually produced from \mbox{(multi-)p}arallel data, which may not be
678: readily available for many applications. We showed that word lattices can be
679: induced from a type of corpus that can be easily obtained for many domains,
680: broadening the applicability of this useful representation.
681:
682: \vspace*{-.1in}
683:
684:
685: \section*{Acknowledgments}
686:
687: {\footnotesize{
688: We are grateful to many people for helping us in this work. We thank
689: Stuart Allen, Itai Balaban, Hubie Chen, Tom Heyerman, Evelyn Kleinberg,
690: Carl Sable, and Alex Zubatov for acting as judges. Eric Breck helped us
691: with translating the output of the DIRT system. We had numerous very
692: useful conversations with all those mentioned above and with Eli Barzilay,
693: Noemie Elhadad, Jon Kleinberg (who made the ``pigeonhole'' observation),
694: Mirella Lapata, Smaranda Muresan and Bo Pang. We are very grateful to
695: Dekang Lin for providing us with DIRT's output. We thank the Cornell NLP
696: group, especially Eric Breck, Claire Cardie, Amanda Holland-Minkley, and Bo
697: Pang, for helpful comments on previous drafts. This paper is based upon
698: work supported in part by the National Science Foundation under ITR/IM
699: grant IIS-0081334 and a Sloan Research Fellowship. Any opinions, findings,
700: and conclusions or recommendations expressed above are those of the authors
701: and do not necessarily reflect the views of the National Science Foundation
702: or the Sloan Foundation.
703:
704: \vspace*{-.2in}
705:
706:
707:
708:
709: \bibliographystyle{llee-fullname}
710:
711: \begin{thebibliography}{}
712:
713: \bibitem[\protect\citename{Bangalore, Murdock, and
714: Riccardi}2002]{Bangalore&Murdock&Riccardi:2002a}
715: Bangalore, Srinivas, Vanessa Murdock, and Giuseppe Riccardi.
716: \newblock 2002.
717: \newblock Bootstrapping bilingual data using consensus translation for a
718: multilingual instant messaging system.
719: \newblock In {\em \proc of COLING}.
720:
721: \bibsnip
722:
723: \bibitem[\protect\citename{Barzilay and Lee}2002]{Barzilay&Lee:2002a}
724: Barzilay, Regina and Lillian Lee.
725: \newblock 2002.
726: \newblock Bootstrapping lexical choice via multiple-sequence alignment.
727: \newblock In {\em \proc of EMNLP}, pages 164--171.
728:
729: \bibsnip
730:
731: \bibitem[\protect\citename{Barzilay and McKeown}2001]{Barzilay&McKeown:01a}
732: Barzilay, Regina and Kathleen McKeown.
733: \newblock 2001.
734: \newblock Extracting paraphrases from a parallel corpus.
735: \newblock In {\em \proc of the ACL/EACL}, pages 50--57.
736:
737: \bibsnip
738:
739: \bibitem[\protect\citename{Chandrasekar and
740: Bangalore}1997]{Chandrasekar+Srinivas:97a}
741: Chandrasekar, Raman and Srinivas Bangalore.
742: \newblock 1997.
743: \newblock Automatic induction of rules for text simplification.
744: \newblock {\em Knowledge-Based Systems}, 10(3):183--190.
745:
746: \bibsnip
747:
748: \bibitem[\protect\citename{Dras}1999]{Dras:1999a}
749: Dras, Mark.
750: \newblock 1999.
751: \newblock {\em Tree Adjoining Grammar and the Reluctant Paraphrasing of Text}.
752: \newblock {Ph.D.} thesis, Macquarie University.
753:
754: \bibsnip
755:
756: \bibitem[\protect\citename{Durbin \bgroup et al.\egroup
757: }1998]{Durbin+Eddy+al:98a}
758: Durbin, Richard, Sean Eddy, Anders Krogh, and Graeme Mitchison.
759: \newblock 1998.
760: \newblock {\em Biological Sequence Analysis}.
761: \newblock Cambridge University Press, Cambridge, UK.
762:
763: \bibsnip
764:
765: \bibitem[\protect\citename{Grefenstette}1994]{Grefenstette:94a}
766: Grefenstette, Gregory.
767: \newblock 1994.
768: \newblock {\em Explorations in Automatic Thesaurus Discovery}, volume 278.
769: \newblock Kluwer.
770:
771: \bibsnip
772:
773: \bibitem[\protect\citename{Iordanskaja, Kittredge, and
774: Polguere}1991]{Iordanskaja&Kittredge&Polguere:1991a}
775: Iordanskaja, L., R.~Kittredge, and A.~Polguere.
776: \newblock 1991.
777: \newblock Lexical selection and paraphrase in a meaning-text generation model.
778: \newblock In C.~Paris, W.~Swartout, and W.~Mann, editors, {\em Natural Language
779: Generation in Artificial Intelligence and Computational Linguistics}. Kluwer,
780: chapter~11.
781:
782: \bibsnip
783:
784: \bibitem[\protect\citename{Jacquemin}1999]{Jacquemin:l999a}
785: Jacquemin, Christian.
786: \newblock 1999.
787: \newblock Syntagmatic and paradigmatic representations of term variations.
788: \newblock In {\em \proc of the ACL}, pages 341--349.
789:
790: \bibsnip
791:
792: \bibitem[\protect\citename{Knight and Marcu}2000]{Knight&Marcu:2000a}
793: Knight, Kevin and Daniel Marcu.
794: \newblock 2000.
795: \newblock Statistics-based summarization --- {Step} one: Sentence compression.
796: \newblock In {\em \proc of AAAI}.
797:
798: \bibsnip
799:
800: \bibitem[\protect\citename{Landis and Koch}1977]{Landis&Koch:1977a}
801: Landis, J.~Richard and Gary~G. Koch.
802: \newblock 1977.
803: \newblock The measurement of observer agreement for categorical data.
804: \newblock {\em Biometrics}, 33:159--174.
805:
806: \bibsnip
807:
808: \bibitem[\protect\citename{Lin}1998]{Lin:1998a}
809: Lin, Dekang.
810: \newblock 1998.
811: \newblock {Automatic retrieval and clustering of similar words}.
812: \newblock In {\em \proc of ACL/COLING}, pages 768--774.
813:
814: \bibsnip
815:
816: \bibitem[\protect\citename{Lin and Pantel}2001]{Lin&Pantel:2001a}
817: Lin, Dekang and Patrick Pantel.
818: \newblock 2001.
819: \newblock Discovery of inference rules for question-answering.
820: \newblock {\em Natural Language Engineering}, 7(4):343--360.
821:
822: \bibsnip
823:
824: \bibitem[\protect\citename{Mangu, Brill, and
825: Stolcke}2000]{Mangu&Brill&Stolcke:00a}
826: Mangu, Lidia, Eric Brill, and Andreas Stolcke.
827: \newblock 2000.
828: \newblock Finding consensus in speech recognition: Word error minimization and
829: other applications of confusion networks.
830: \newblock {\em Computer, Speech and Language}, 14(4):373--400.
831:
832: \bibsnip
833:
834: \bibitem[\protect\citename{McKeown}1979]{McKeown:79a}
835: McKeown, Kathleen~R.
836: \newblock 1979.
837: \newblock Paraphrasing using given and new information in a question-answer
838: system.
839: \newblock In {\em \proc of the ACL}, pages 67--72.
840:
841: \bibsnip
842:
843: \bibitem[\protect\citename{McKeown, Kukich, and
844: Shaw}1994]{McKeown&Kukich&Shaw:1994a}
845: McKeown, Kathleen~R., Karen Kukich, and James Shaw.
846: \newblock 1994.
847: \newblock Practical issues in automatic documentation generation.
848: \newblock In {\em \proc of ANLP}, pages 7--14.
849:
850: \bibsnip
851:
852: \bibitem[\protect\citename{Meteer and Shaked}1988]{Meteer+Shaked:88a}
853: Meteer, Marie and Varda Shaked.
854: \newblock 1988.
855: \newblock Strategies for effective paraphrasing.
856: \newblock In {\em \proc of COLING}, pages 431--436.
857:
858: \bibsnip
859:
860: \bibitem[\protect\citename{Och, Tillman, and Ney}1999]{Och&Tillman&Ney:1999a}
861: Och, Franz~Josef, Christoph Tillman, and Hermann Ney.
862: \newblock 1999.
863: \newblock Improved alignment models for statistical machine translation.
864: \newblock In {\em \proc of EMNLP}, pages 20--28.
865:
866: \bibsnip
867:
868: \bibitem[\protect\citename{Pang, Knight, and Marcu}2003]{Pang+Knight+Marcu:03a}
869: Pang, Bo, Kevin Knight, and Daniel Marcu.
870: \newblock 2003.
871: \newblock Syntax-based alignment of multiple translations: Extracting
872: paraphrases and generating new sentences.
873: \newblock In {\em Proceedings of HLT/NAACL}.
874:
875: \bibsnip
876:
877: \bibitem[\protect\citename{Papineni \bgroup et al.\egroup
878: }2002]{Papineni&al:2002a}
879: Papineni, Kishore~A., Salim Roukos, Todd Ward, and Wei-Jing Zhu.
880: \newblock 2002.
881: \newblock Bleu: A method for automatic evaluation of machine translation.
882: \newblock In {\em \proc of the ACL}, pages 311--318.
883:
884: \bibsnip
885:
886: \bibitem[\protect\citename{Pereira, Tishby, and
887: Lee}1993]{Pereira&Tishby&Lee:1993a}
888: Pereira, Fernando, Naftali Tishby, and Lillian Lee.
889: \newblock 1993.
890: \newblock Distributional clustering of {English} words.
891: \newblock In {\em \proc of the ACL}, pages 183--190.
892:
893: \bibsnip
894:
895: \bibitem[\protect\citename{Robin}1994]{Robin-phd}
896: Robin, Jacques.
897: \newblock 1994.
898: \newblock {\em Revision-Based Generation of Natural Language Summaries
899: Providing Historical Background: Corpus-Based Analysis, Design,
900: Implementation, and Evaluation}.
901: \newblock {Ph.D.} thesis, Columbia University.
902:
903: \bibsnip
904:
905: \bibitem[\protect\citename{Shinyama \bgroup et al.\egroup
906: }2002]{Shinyama&al:2002a}
907: Shinyama, Yusuke, Satoshi Sekine, Kiyoshi Sudo, and Ralph Grishman.
908: \newblock 2002.
909: \newblock Automatic paraphrase acquisition from news articles.
910: \newblock In {\em \proc of HLT}, pages 40--46.
911:
912: \bibsnip
913:
914: \bibitem[\protect\citename{Smadja and McKeown}1991]{Smadja&McKeown:1991a}
915: Smadja, Frank and Kathleen McKeown.
916: \newblock 1991.
917: \newblock Using collocations for language generation.
918: \newblock {\em Computational Intelligence}, 7(4).
919: \newblock Special issue on natural language generation.
920:
921: \bibsnip
922:
923: \bibitem[\protect\citename{Vogel and Ney}2000]{Vogel&Ney:2000a}
924: Vogel, Stephan and Hermann Ney.
925: \newblock 2000.
926: \newblock Construction of a hierarchical translation memory.
927: \newblock In {\em \proc of COLING}, pages 1131--1135.
928:
929: \bibsnip
930:
931: \bibitem[\protect\citename{Wang}1998]{Wang:1998a}
932: Wang, Ye-Yi.
933: \newblock 1998.
934: \newblock {\em Grammar Inference and Statistical Machine Translation}.
935: \newblock {Ph.D.} thesis, CMU.
936:
937: \end{thebibliography}
938:
939: }
940: }
941:
942: \section*{Appendix}
943:
944: In this appendix, we describe how we insert slots into \lattices to
945: form \slotlats.
946:
947: Recall that the backbone nodes in our \lattices represent words appearing in
948: many of the sentences from which the lattice was built. As mentioned above,
949: the intuition is that areas of high variability between backbone nodes may
950: correspond to arguments, or slots. But the key thing to note is that there are
951: actually two different phenomena giving rise to multiple parallel paths: {\em
952: argument variability}, described above, and {\em synonym variability}. For
953: example, Figure \ref{fig:variability}(b) contains parallel paths corresponding
954: to the synonyms ``injured'' and ``wounded''. Note that we want to {\em remove}
955: argument variability so that we can generate paraphrases of sentences with
956: arbitrary arguments; but we want to {\em preserve} synonym variability in order
957: to generate a variety of sentence rewritings.
958:
959: To distinguish these two situations, we analyze the {\em split
960: level} of \backbone nodes that begin regions with multiple paths. The
961: basic intuition is that there is probably more variability associated
962: with arguments than with
963: synonymy: for example, as datasets increase, the number of locations
964: mentioned rises faster than the number of synonyms appearing. We make
965: use of a
966: {\em synonymy threshold} $s$ (set by held-out parameter-tuning
967: to 30), as follows.
968:
969: \begin{itemize}
970: \item If no more than $s$\% of all the edges out of a \backbone node
971: lead to the same next node, we have high enough variability to
972: warrant inserting a {\slot} node.
973: \item Otherwise, we incorporate reliable synonyms\footnote{While our original
974: implementation, evaluated in Section~\ref{sec:eval}, identified only
975: single-word synonyms, phrase-level synonyms can similarly be acquired by
976: considering chains of nodes connecting backbone nodes.} into the \backbone
977: structure by preserving all nodes that are reached by at least $s$\% of the
978: sentences passing through the two neighboring \backbone nodes.
979: \end{itemize}
980: Furthermore, all \backbone nodes
981: labelled with our special generic tokens are
982: also replaced with \slot nodes, since they, too, probably represent arguments
983: (we condense adjacent \slots into one). Nodes with in-degree lower than the
984: synonymy threshold are removed under the assumption that they probably
985: represent idiosyncrasies of individual sentences. See Figure
986: \ref{fig:variability} for examples.
987:
988: Figure \ref{fig:lattice} shows an example of a
989: \lattice and the \slotlat derived via the process just described.
990:
991:
992: \begin{figure}[h]
993: \epsfscaledbox{variability.eps}{2.8in}
994: \caption{\label{fig:variability} Simple seven-sentence examples of two types of
995: variability. The double-boxed nodes are \backbone nodes; edges show
996: consecutive words in some sentence. The synonymy threshold (set to 30\%
997: in this example)
998: determines the type of variability. }
999: \end{figure}
1000:
1001:
1002:
1003:
1004:
1005: \end{document}
1006: