0804:0804.1033/sam.tex

1: \documentclass[11pt]{article}

2: \usepackage{graphicx}

3: \usepackage{listings}

4:

5: \parindent 0pt

6: %\textwidth 15.0cm

7: %\textheight 20cm

8:

9:

10: \title{\bf A Semi-Automatic Framework to Discover Epistemic Modalities in Scientific Articles}

11: {\author{\small Sviatlana Danilava\\

12:              \small JW Goethe-University Frankfurt am Main\\

13:              \small Dept. of Computer Science and Mathematics\\

14:              \small Robert-Mayer-Str. 11-15, D-60486 Frankfurt am Main, Germany.\\

15:              \small Email: danilava@cs.uni-frankfurt.de

16:              \and

17:              \small Christoph Schommer \\

18:               \small University of Luxembourg\\

19:               \small Dept. of Computer Science - ILIAS Laboratory, MINE Research Group\\

20:               \small 6, Rue Richard Coudenhove-Kalergi, 1359 Luxembourg, Luxembourg\\

21:               \small Email: christoph.schommer @ uni.lu Home: mine.uni.lu

22: }

23: \date{\today}

24:

25: \markboth{A}{B}

26:

27: \begin{document}

28: \maketitle

29:

30: % - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

31: \begin{abstract}

32: Documents in scientific newspapers are often marked by attitudes and opinions of the author and/or other persons, who contribute with objective and subjective statements and arguments as well. In this respect, the attitude is often accomplished by a linguistic modality. As in languages like english, french and german, the modality is expressed by special verbs like {\sf can, must, may, etc.} and the subjunctive mood, an occurrence of modalities often induces that these verbs take over the role of {\it modality}. This is not correct as it is proven that modality is the instrument of the whole sentence where both the adverbs, modal particles, punctuation marks, and the intonation of a sentence contribute. Often, a combination of all these instruments are necessary to express a modality. In this work, we concern with the finding of modal verbs in scientific texts as a pre-step towards the discovery of the attitude of an author. Whereas the input will be an arbitrary text, the output consists of zones representing modalities.

33: \end{abstract}

34:

35: % - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

36: \section{Introduction}

37: % - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

38: Search engines that base on the World-wide Web find large amounts of hits and information by any request. However, intelligent search queries like {\sf Which scientists support hypothesis A} or {\sf Does the author believe in my opinion} are not yet supported. In order to answer, a search engine must  first search for appropriate documents and then analyse them fast. For this, intelligent algorithms are required that take into account linguistic insights for a analytical consideration of syntax and style but also for a treatment with meta-aspects like opinion and attitude of the author himself. This is especially true for scientific texts: they are very objective in concern of the description of a hypothesis or in discussing diverse problems. The discovery of the subjective opinion or attitude of the author is a major topic and is the research objective of \textit{Attitude Mining}. It concerns with the discovery of meta-information out of documents, especially the attitude of an author in respect to events, references to other's work, etc. The attitude can be positive or negative, but in most cases, it is hidden and to be proved by indications. Attitude Mining concerns with the explorative discovery of these indications (\cite{SQW06}), but demands for profound knowledge in areas like computer science, linguistics, cognitive sciences and psychology. Following \cite{HOL88}, there exist more than 350 lexical style attributes for the attitude, for example to express doubts or beliefs. To further motivate, the following sentences should demonstrate the existence of subjectivity in scientific texts:

39:

40: \begin{itemize}

41:     \item[$\rightarrow$] {\sf When paleontologists seek the roots of life, they head to rocks of the Archaean Eon, which range from 3.8 billion to 2.5 billion years old.}

42:     \item[$\rightarrow$] {\sf Australian and Canadian researchers argue this week in Nature that stromatolites were so diverse and complex that they must have been alive.}

43:     \item[$\rightarrow$] {\sf Martin Brasier of Oxford University is less sanguine, arguing that the structures are more likely chemical precipitates. He also objects to the reasoning in the Nature paper. ``You can‘t use the argument that complexity is the signature for life,'' he says. }

44: \end{itemize}

45:

46: The first sentence is neutral, as it describes only a procedure what palaeontologists normally do {\it when they try out to find the origin of life}. The second sentence holds a hypothesis with explanatory statements, the third sentence arguments against the hypothesis in the second sentence having the author/originator referenced. In this respect, the modality concerns with the speaker's style to modify the proposition of sentences through subjective components. And as we have seen above, many sentences are modal, for example

47:

48: \begin{itemize}

49:     \item[$\rightarrow$]  {\sf I believe she arrives this morning at London Heathrow.}

50:     \item[$\rightarrow$]  {\sf I can not be in today.}

51: \end{itemize}

52:

53: In western languages, the modality is expressed by special verbs like {\sf can, must, may, etc.} and by the subjunctive mood. However, this often induces that verbs take over the role of {\it modality}, which is not correct: it is proven that modality is the attribute of the whole sentence where both the adverbs, modal particles,  punctuation marks, and the intonation of a sentence contribute to it. Often, a combination of all these instruments are necessary to express a certain modality. For example, the sentence

54:

55: \begin{itemize}

56:      \item[$\rightarrow$]  {\sf Do you really think that?}

57: \end{itemize}

58:

59: leads to another understanding as with

60:

61: \begin{itemize}

62:      \item[$\rightarrow$]  {\sf You do not really think of that?}

63: \end{itemize}

64:

65: In the first sentence, the combination of \textit{really}, \textit{think} and the transfer into a question is very subjective, but leaves the recipient some space. However, the second sentence is much more subjective, influencing the recipient's answer completely and leaving no space for another answer than 'no'. Overall, the complexity in using modalities is one of the major problems, both for the analysis of texts per se and for machine translation systems.

66:

67: In this work, we concern with finding modal verbs in scientific texts as a pre-step toward discovering the attitude of an author. Whereas the input will be an arbitrary text, the output consists of zones representing modalities.

68:

69: % - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

70: \section{Fundamentals}

71: % - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

72: Originally, the concept of modality derives from the formal logic. Here, a modal expression consists of two parts, the \textit{modal} part and the \textit{proposition} part. The modal part contains the modality, the proposition the actual statement. Moreover, the modal part is either \textit{deontic} or \textit{epistemic} (\cite{PAL01}). A deontic modality describes the conditions that leads the statement to true or false, always being in relation with the reality, for example:

73:

74: \begin{itemize}

75:     \item[$\rightarrow$] {\sf Indeed, the turnover of phytoplankton can be so high that there can be inverted pyramids of biomass, in which the standing crop of herbivorous zooplankton actually exceeds that of the phytoplankton. }

76: \end{itemize}

77:

78: The verb acts as modality, it expresses a certain idea of the objective reality that might come true under certain circumstances. The epistemic modality, on the other side, concerns with personal experiences and a knowledge level of the author, but less with reality:

79:

80: \begin{itemize}

81:     \item[$\rightarrow$] {\sf Australian and Canadian researchers argue this week in Nature that stromatolites were so diverse and complex that they must have been alive.}

82: \end{itemize}

83:

84: The verb \textit{must} appears in an epistemic way, the statement is not proven yet but still an assumption. This assumption is proven initiated by a justification. Furthermore, the source of information is given, for example in

85:

86: \begin{itemize}

87:     \item[$\rightarrow$] {\sf Martin Brasier of Oxford University is less sanguine, arguing that the structures are more likely chemical precipitates. }

88: \end{itemize}

89:

90: This sentence contains an explicit source, namely \textit{Martin Brasier}. Such statements are referenced as evidential statements and are mostly referenced as a sub-category of an epistemic modality.

91:

92: The modality is supported by a set of expressions: in order to develop a methodology in respect to an automatic recognition, the lexical fundament must be found first. Modal verbs form a class of verbs that add a modal meaning to a proposition. They allow the sender to modify the essence of a sentence by possibilities, necessities, doubts, beliefs, etc. In the English language, this is for example {\sf must - have to}, {\sf can - could - may}, and {\sf will - would - shall}. However, the use of modal verbs often leads to ambiguity as the same modal verbs are taken to express both the deontic and the epistemic relevance. Verbs like {\sf believe, doubt, accept, reject, etc.} describe the mental state of the speaker or his attitude against propositional part of the statement. Moreover, \textit{noun} may describe the mental states or cognitive processes as well, for example by \textit{doubt, belief, rejection, etc.}. Adverbs and adjective are \textit{lexical modifiers} that may assign doubts and beliefs, for example \textit{perhaps, probably, possibly, certain, likely}, etc.

93:

94: English modal verbs are used both in epistemic and in deontic meanings. Generally, modal verbs express either a possibility or a necessity; each modal verb offers several meanings with semantic and pragmatic differences, for example the word \textit{must}. In the deontic version, it describes a necessity with the consideration of an external source, where the propositional subject is not source of modality. In contrast to this, an epistemic version describes a necessity, taking a logical justification. The following two sentences are deontic (first) and epistemic (second):

95:

96: \begin{itemize}

97:     \item[$\rightarrow$] {\sf I must go, she is already waiting for me.}

98:     \item[$\rightarrow$] {\sf Where is John? It is 14h00, he must be in school.}

99: \end{itemize}

100:

101: The epistemic reading of modal verbs can be summarised as follows:

102:

103: \begin{itemize}

104:     \item Epistemic necessity as a conclusion out of the speaker's evidence: \textit{she must be in her office}.

105:     \item Epistemic necessity as logical conclusion out of a common valid and known fact:  \textit{she will be in her office}.

106:     \item Epistemic possibility as an uncertainty of the speaker:  \textit{she may be in her office}.

107: \end{itemize}

108:

109: The epistemic usage of modal verbs, the epistemic adverbs and cognitive verbs distribute the subjectivity. The provide a basis for the attitude of the author, as for example in

110:

111: \begin{itemize}

112:     \item[$\rightarrow$]{\sf The individual grains in them could not have accumulated mechanically because the slope of the cone is too great,“ says Stanley Awramik, a stromatolite expert at the University of California, Santa Barbara, who was not involved in the research.}

113: \end{itemize}

114:

115: Here, the proposition is just a personal attitude (\textit{could}) of the referenced person, that is not proven at all. Given by the modal verbs, there is still enough information to discover the author's attitude and to differentiate the author's attitude against others' attitudes.

116:

117: % - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

118: \section{Selected Research Work}

119: % - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

120:

121: The current research follows divergent directions, especially in the establishment of linguistic and cognitive models. These models support an understanding of the lexical means of expression, their influence to the lexical environment, and the modification of meaning while using modality.

122:

123: In respect to modalities as a influencing component to discovering the attitude, \cite{PZ06} says that it is insufficient to implement the attitude as to be positive or negative. Moreover, the attitude can be modified via {\it contextual valence shifters} by {\sf not}, {\sf never}, {\sf none}, but must take into account modifiers like {\sf rather}, {\sf deeply}, and/or {\sf few}. \cite{BER06} says that a {\it reported speech} shares a particular attention, since evidential aspects must be examined additionally. \cite{KEF06} argues that the lexical means of expression should not become considered as conveyor of meaning, but typical structures of attitude phrases can be observed.

124:

125: The analysis of lexical resources that is additionally used to highlight the intention of the authors to produce attitudes is currently under research as well. \cite{MAT06} follows an establishment of specific emotional lexicons with positive, negative, and neutral meaning as well as an automatic extraction of emotion to extend these lexicons.

126:

127: The detection of document zones to structure the document becomes more and more popular. Initially, it has been presented as \textit{Argumentative Zoning} by Teufel and Moens  (\cite{TM99}), but has been applied in other works as well (\cite{ST07}, \cite{TEU06}) or strongly influenced research work on \textit{Content Zoning} (\cite{BRU08}). The main motivation is to \textit{summarise} documents and to zone in discourse-rhetoric zones. Teufel and Moens argue that - depending on the type, genre and style of the text - a standardised structure can often be identified. Using \textit{scientific articles}, they have assigned seven argumentative zones to each text, the zoning is then performed by a supervised learning system. \cite{MMC04} suggests an extend classification where each sentence is assigned to a rhetoric role. There exist up to ten zones that are classified into 3 classes. They argue that there exist no sequences of rhetoric roles; sentence may belong to different zones, also called as combined zones.

128:

129: Following the idea of \textit{Opinion Mining}, \cite{BYT06} describe a model to detect \textit{opinion words}. The idea is to discover propositions, which contain subjective lexical expressions and the proposition itself, for example in combination with \textit{accuse}, \textit{criticise}, or \textit{doubt}. All constituents of each sentence receive a zoning label like \textit{Opinion Proposition}, \textit{Opinion Holder} or \textit{Null}. Another approach are disambiguation processes of modal verbs, where \cite{KIP95} has implemented a rule-based system towards the disambiguation of the epistemic and deontic meaning of the german verbs like {\sf sollen}, {\sf k\"onnen}, or {\sf d\"urfen}.

130:

131: % - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

132: \section{Architecture}

133: % - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

134:

135: The framework of this work consists of two major parts which are presented in Figure \ref{fig:architecture}. The first part focus on pre-processing the input text whereas the second part concerns with the disambiguation of the modal verbs and semi-automatic classification of the corresponding sentences. The pre-processing begins with a part-of-speech tagger, and is followed by a module to detect the naming entities and the pronouns.

136:

137: \vskip 0.5cm

138: \begin{figure}[htbp]

139:    \centering

140:    \includegraphics[width=7cm]{I1.jpg}

141:    \caption{The architecture of the framework, using the Brown Corpus and the Names Corpus. First, a part-of-speech tagger\cite{BLK07} to a given input and sends the intermediate result to the \textit{naming} and \textit{pronoun engine}. After that, modalities are disambiguated (\textit{Ambiguity Engine}) while synchronised with the list of modality verbs and finally classified (Modality Classifier). The text output is contains \textit{modality tags}.}

142:    \label{fig:architecture}

143: \end{figure}

144:

145: We have taken an advantage in a way that we have used the WordNet thesaurus (\cite{FEL98}) to establish a list of modality verbs. This list contains several lexical categories and are found recursively by using a synonym function of WortNet. Currently, there exist several corpora for the English language, for example the Brown University Corpus (\cite{KUF67}), the International Corpus of English (ICE), and the British National Corpus (BNC). The ICE is a set of corpora that supports various dialects of English from around the world. The BNC is a text corpus with both written and spoken English words, covering covers more than 100 million words of the late twentieth century from a wide variety of genres. However, in this work, we use the Brown University Corpus to support the part-of-speech tagger to assign a syntactic category to each word of the input document. Here, the word is kept as it occurs, meaning that the original word is substituted by a list of syntactic categories and the original word. Words of the same root but of another flexion are kept as they are.

146:

147: In concern of dissolving \textit{naming entities}, a first method concerns with identification of personal names on the basis of references that are probably given in the document. Per definitionem, this method is firstly applied but suffers from diverse proper names of institution names like {\sf Max-Planck Institute}. In this case, external databases must be consulted using an automaton (see Figure \ref{fig:graph}). For the identification of person names, we have used the \textit{Names Corpus} by \cite{KRR91}, which contains 5001 female and 3000 male first names.

148:

149: To identify the pronouns in the text, we restrict the list of possible candidates and consider only \textit{he}, \textit{she}, and \textit{who} as they concretely reference to one specific person. Common terms like \textit{researchers} or \textit{community members} are not considered as well as the pronoun \textit{they} and \textit{cataphora}. To identify the pronouns, we firstly concern with \textit{who}, which occurs after a referenced nominal phrase (NP) but in the same sentence as a NP.

150:

151: After having pre-processed the data, the annotated texts are then sent to the classification module. The modal verbs are first disambiguated before they are sent to the classifier. As we must differentiate between deontic and epistemic modality, these two classes are taken as classes. We then use the following simple rule scheme:

152:

153: \begin{itemize}

154:     \item A sentence is \textbf{deontic modal} if it contains a modality word that is deontic and if there exist modality words, which reference to facts.

155:     \item A sentence is \textbf{epistemic modal} if it contains a modality word that is epistemic and there exist modality words, which reference to subjective attitude of the author.

156:     \item A sentence is \textbf{non-modal} if there is no lexical evidence for modality.

157: \end{itemize}

158:

159: The scheme may become improved when other criteria for disambiguation are included, for example the time. A more granular differentiation between \textit{epistemic positive} and \textit{epistemic negative} is possible when considering together the modal and propositional part of the sentence and classifying the sentences into \textit{Author X believes in Y} (positive) and \textit{Author rejects Y} (negative). The disambiguation process is shown as disambiguation automaton in Figure \ref{fig:graph2}.

160:

161: \begin{figure}[htbp]

162:    \centering

163:    \includegraphics[width=12cm]{I2.jpg}

164:    \caption{Automaton used for Naming Detection where \textit{FN} corresponds to the full first name, \textit{LN} to the full last name, and \textit{ABB} to any kind of abbreviations, like the abbreviated middle name. For example, P. Green follows the path of \textit{ABB\_LN}, whereas \textit{Peter Green} empties in \textit{FN\_LN}.}

165:    \label{fig:graph}

166: \end{figure}

167:

168: \begin{figure}[htbp]

169:    \centering

170:    \includegraphics[width=12cm]{I3.jpg}

171:    \caption{Automaton used for disambiguation where \textit{MV} corresponds to modal verb, \textit{Vpa} the verb past participle, and \textit{Vpr} the verb present participle. \textit{not} represents the negation, \textit{have}, \textit{be}, and \textit{been} the corresponding words.}

172:    \label{fig:graph2}

173: \end{figure}

174:

175: The automaton decides to which class a modal verb belongs to. Depending on certain collocations, the a-priori probability for a modal verb to be epistemic is generally higher than to be deontic, so that we take a decision quite early. For example, if a certain collocation proves that a modal verb $v_i$ is probably epistemic for 90 percent, the automaton classifies $v_i$ as to be epistemic. The classification criteria are:

176:

177: \begin{itemize}

178:    \item The modal verb {\sf must} refers to an epistemic necessity if it occurs with the following components: {\sf have been}, {\sf be} and {\sf verb present participle}, {\sf have} and {\sf verb past participle}, {\sf have been} and {\sf verb present participle}. In all other cases, the verb should be deontic.

179:    \item {\sf can} is deontic.

180:    \item {\sf can not} refers to the same verb components than {\sf must}.

181:    \item {\sf could} is epistemic.

182:    \item {\sf may} is epistemic.

183:    \item {\sf might} is epistemic as a tentative version of {\sf may}.

184:    \item {\sf will} is epistemic since future aspects are still hypothetic.

185:    \item {\sf shall} is deontic.

186:    \item {\sf should} is epistemic as {\sf must}.

187: \end{itemize}

188:

189: In Figure \ref{fig:graph2}, only the paths to the class epistemic are shown; it is assumed that all other paths are either deontic or non-modal. A path like

190:

191: \begin{center} {\sf MV$\rightarrow$have$\rightarrow$been} \end{center}

192:

193: means that the sentence contains a sequence of verb and \textit{have} and \textit{been}. Modal verbs express the attitude and opinion of a person. In this work, we concern with two types of persons:

194:

195: \begin{itemize}

196:     \item The person is the author: the author of the text gives an opinion and attitude about hypotheses, other authors, or other methods. This often occurs in scientific articles, for example references or citations. In the following example, the attitude is expressed by the author himself:

197:      \begin{itemize}

198:           \item[{$\rightarrow$}] {\sf Would a 100 mm scanning resolution be sufficient to produce an accurate model for paleontological study, or is a 50 mm scanning resolution a requirement?}

199:      \end{itemize}

200:      \item The person is the third person: this happens if the author speaks about other persons and presents those attitudes. In this case, these persons are referenced explicitly by name or work. This is a typical way of discussions in scientific articles.

201:      \begin{itemize}

202:           \item[{$\rightarrow$}] {\sf Lowe pointed out their resemblance to modern forms but later had doubts.}

203:      \end{itemize}

204: \end{itemize}

205:

206: To estimate the attitude of an author, we only consider epistemic sentences. We mark the importance of the modal part by a predicate $M$ and $\neg M$, respectively, and the propositional part by a predicate $H$, if the propositional part contains arguments pro $M$, otherwise $\neg H$. All epistemic sentences can be described with

207:

208: \begin{center}

209:    $M(H)$ or $M(\neg H)$ or $\neg M(H)$ or $\neg M(\neg H)$

210: \end{center}

211:

212: However, this step can become conditionally automated as it is quite hard to decide if the modal part is $M$ or $\neg M$: to do so, we certainly must find out the lexical information about a modal verb inside its lexical environment. Some modifiers like {\sf less} or {\sf more} and negations like {\sf not} or {\sf none} must be taken into account as they modify the meaning of the modal verbs. Their scope is important; an exact analysis implies the definition of complex grammars. Secondly, we must decide if the propositional part is $H$ or $\neg H$, so that we concern with propositional content analyis. This could be done with thesauri like WordNet, as these contain descriptions of relationships between words, for example synonyms. For example, WordNet allows a multiple calculation of similarity between words, depending on the distance between these words in the thesaurus: the shorter the distance, the similar the words.

213:

214: In this work, we have identified two problems: first, the similarity between two words does not correspond to the actual situation in the text and second, the similarity can only be computed between pairs of words, but not between phrases or sub-phrases. We may say at this point that the architecture is \textit{hybrid}, meaning that the last step of estimating the attitude is done manually - based on the result that is produced. We then finally get a text result that is composed of text and meta information, consisting of two parts: the first part is machine readable as the data structures stay constantly with tags and structural information; it can therefore further be processed. The second part contains the epistemic sentences. A third and last step concerns with the segmentation of epistemic sentences depending on the hypothesis of the text. For this, we may use a graph, where all referenced persons are classified into three classes: \textit{Pro} references all members $P$, \textit{Contra} all members $C$, and \textit{Neutral} all members $N$. Each group can be empty, but not at the same time, as the author must belong to at least one class. We then assign

215:

216: \begin{itemize}

217:    \item \textit{Pro} refers to sentences of $M(H)$ and $\neg M(\neg H)$

218:    \item \textit{Contra} to sentences of $M(\neg H)$ and $\neg M(H)$

219:    \item \textit{Neutral} collects undecidable sentences, especially of those persons who decline a decision.

220: \end{itemize}

221:

222: % - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

223: \section{Example}

224: % - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

225: The following steps show an example using the following scientific text:

226:

227: \begin{itemize}

228:      \item[$\rightarrow$] {\sf "The individual grains in them could not have accumulated mechanically because the slope of the cone is too great," says Stanley Awramik, a stromatolite expert at the University of California, Santa Barbara, who was not involved in the research.}

229: \end{itemize}

230:

231: Generally, figures, formulas, and charts are manually pre-processed and substituted by tag-placeholders like \textit{FIG} or \textit{MATH}. The part-of-speech tagger then marks the text by two subsequent loops, where first all words are matched against the \textit{Brown Corpus}. Often, domain-specific termini arise, which are unknown and therefore labeled by a \textit{None}. Therefore, a second loop takes into account the morphologic structure of these words, for example, assigning a suffix \textit{tion} to the category \textit{noun}:

232:

233: \begin{itemize}

234:      \item[$\rightarrow$] {\sf [(The, ART), (individual, ADJ), (grains, NNS), (in, IN), (them, PPO), (could, MV), (not, *), (have, HAVE), (accumulated, VPA), (’mechanically’, ’RB’),...] }

235: \end{itemize}

236:

237: Recognizing the names, we then check if the text contains a list of references: in the positive case, all names are marked by a \textit{Person}-tag. These words that begin with an uppercase letter are considered as well and set to candidates of possible first and last names, abbreviations, or other personal names. They are marked by a \textit{NP}-tag. The first names are matched up with the mentioned \textit{Names Corpus}. However, as ambiguity may occur, such words are disambiguated manually. We then get the following automaton as it has been described in Figure \ref{fig:graph2}:

238:

239: \begin{itemize}

240:      \item[$\rightarrow$] {\sf \dots $<$Person$>$ (Stanley, NP) (Awramik, NP) $<$/Person$>$, \dots}

241: \end{itemize}

242:

243: The decision, to which objets a personal pronoun belongs to, is taken by considering the lexical categories \textit{PPS} and \textit{WPS}:

244:

245: \begin{itemize}

246:      \item[$\rightarrow$] {\sf \dots $<$Person$>$ (Stanley, NP) (Awramik, NP) $<$/Person$>$, \dots$\\\dots <$Person Name= Awramik$>$ (who, WPS) $<$/Person Name= Awramik$>$}

247: \end{itemize}

248:

249: The final classification then leads us to

250:

251: \begin{itemize}

252:      \item[$\rightarrow$] {\sf $<$EPISTEMIC$>$\\ \dots (could, MV), (not, *), (have, HAVE), (accumulated, VPA), \dots\\ $<$/EPISTEMIC$>$ }

253: \end{itemize}

254:

255: where the tag \textit{EPISTEMIC}, \textit{DEONTIC}, or \textit{NON-MODAL} represent the modal state. As modelled in Figure \ref{fig:graph2}, the phrase {\sf could not $\rightarrow$ have $\rightarrow$ accumulated} is ambiguous and leads to \textit{negMV\_HAVE\_VPA}. The sentence therefore is marked as epistemic.

256:

257: % - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

258: \section{Classification Results}

259: % - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

260: We have used scientific articles from the fields of palaeontology and biology as a first test set (in the following called {\sf SCA}) and contributions to the scientific newspaper (in the following called {\sf SCI}) as a second test set. All test documents of {\sf SCA} share a common frame like \textit{Author X talks about his work Y}; text documents of {\sf SCI} share a frame like \textit{Author X talks about the opinions of M scientists in respect to hypothesis Y}. For {\sf SCA}, the texts share a similar length and style; the number of epistemic sentences is dominant to deontic and/or non-modal sentences (see Figure \ref{fig:result1}).

261:

262: \begin{figure}[htbp]

263:    \centering

264:    \includegraphics[width=12cm]{I4.jpg}

265:    \caption{Percental distribution of epistemic, deontic, and non-modal sentences, where the left chart corresponds to {\sf SCA}, the right chart to {\sf SCI}.}

266:    \label{fig:result1}

267: \end{figure}

268:

269: \begin{figure}[htbp]

270:    \centering

271:    \includegraphics[width=12cm]{I5.jpg}

272:    \caption{Percental classification result of selected sentences of {\sf SCA} and {\sf SCI}. The correct classified sentences are higher for {\sf SCI} (87\%) than to {\sf SCA} (78.6\%). }

273:    \label{fig:result2}

274: \end{figure}

275:

276: In total, 312 sentences have been analysed where 55.4\% are of {\sf SCA} and 44.6\% from {\sf SCI}. As presented in Figure \ref{fig:result2}, the correct classified sentences for {\sf SCI} (87\%) are higher than to {\sf SCA} (78.6\%). In respect to the wrong classified sentences, the modal word {\sf will} occurs most frequently. The following list shows some epistemic sentences that are classified correctly and wrongly:

277:

278: \begin{itemize}

279:     \item[$\rightarrow$]{\sf EPISTEMIC This evidence of an ecological shift preceding phenotypic change suggests that this part of the sequence {\bf may} record rapid evolution driven by shifts in trophic ecology and adaptation to benthic niches.(correct)}

280:     \item[$\rightarrow$]{\sf EPISTEMIC If this {\bf hypothesis} is correct however the low number of specimens displaying intermediate phenotypes is puzzling and the scenario of replacement of one lineage by another cannot be ruled out. (correct)}

281:     \item[$\rightarrow$]{\sf EPISTEMIC Yet direct evidence that feeding controls evolution over extended time scales available only from the fossil record is difficult to obtain because it is rarely {\bf possible} to directly analyze dietary change in long-dead animals. (wrong)}

282:     \item[$\rightarrow$]{\sf EPISTEMIC First {\bf perhaps} the best-known work on specialisation in fishes concerns stickleback in postglacial coastal lakes in Canada where planktivores and benthic feeders coexist as two reproductively isolated and phenotypical distinct tropic.(wrong)}

283:     \item[$\rightarrow$]{\sf EPISTEMIC Laboratory feeding experiments and analyses of wild stickleback populations {\bf show} that micro-wear exhibits a progressive shift from planktivores to benthic feeders.(wrong)}

284: \end{itemize}

285:

286: The main reason for a wrong classification is that - although the modal verb only has influenced a part of the whole sentence - the whole sentence has been assigned as to be epistemic. Especially, composed sentences like

287:

288: \begin{itemize}

289:     \item[$\rightarrow$]{\sf EPISTEMIC-DEONTIC  This uncertainty {\bf may} relate to the fact that Buddenbrockia genes

290: have undergone rapid sequence evolution, which {\bf can} either cause artifactual groupings or reduce the support for the correct grouping. }

291: \end{itemize}

292:

293: have been classified twice, i.e., being epistemic and deontic. This is wrong as only the first part ({\sf may}) is epistemic, the second part deontic ({\sf can}).

294:

295: % - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

296: \section{Conclusions}

297: % - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

298: The calculation process is characterised and influences by a multitude of external contributions, therefore, one of the next steps will be a step-by-step automatisation and the access to extended sources.

299:

300: Although the classification result show good results, a more detailed consideration of modal verbs may become concerned as some of them negatively and positively influence propositional sentences. Last, the lexical environment must be considered if we want to automate the general hypothesis of being the modal part is $M$ or $\neg M$. If a modal verb is discovered in the sentence structure, we can assume that the meaning is either positive or negative; it can be modified, if negations occur.

301:

302: We still have in mind to constitute the modality as one possible method to characterise the author's attitude. This may be accomplished by other works of the group, i.e., the zoning of textual documents, the imaging of texts to self-organizing maps, and the fingerprinting of texts using statistic and linguistic variables.

303:

304: % - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

305: \section{Acknowledgement}

306: % - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

307:

308: This work has been performed within the research project \textit{TRIAS}, which is funded by the University of Luxembourg.

309:

310: {\small

311: \begin{thebibliography}{4}

312: \bibitem{BER06} S. Bergler: Conveying attitude with reported speech. In Computing Attitude and Affect in Text: Theories and Applications, pp. 11–22, 2006.

313: \bibitem{BLK07} S. Bird, E. Loper, and E. Klein. The Natural Language Toolkit. Version 9.0. 2007.

314: \bibitem{BPP96} A. L. Berger, S. Della Pietra, and V. J. Della Pietra: A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–71, 1996.

315: \bibitem{BYT06} S. Bethard, H. Yu, A. Thornton, V. Hazivassiloglou, and D. Jurafsky: Extracting opinion propositions and opinion holders using syntactic and lexical cues. In Computing Attitude and Affect in Text: Theories and Applications. pp. 125–141. Springer, 2006.

316: \bibitem{BRU08} C. Brucks, M. Hilker, C. Schommer, C. Wagner, and R. Weires: Semi-automated Content Zoning of Spam Emails. Lecture Notes on Business Information Processing (Springer).

317: \bibitem{DAN08} S. Danilova: Semi-automatische Bestimmung der Attit\"ude \"uber epistemische Modalit\"at. Diplomarbeit. JW Goethe-University, Frankfurt am Main. Feburary 2008.

318: \bibitem{FEL98} C. Fellbaum. Wordnet: An Electronic Lexical Database. Bradford Books, 1998.

319: \bibitem{HOL88}J. Holmes: Doubt and certainty in esl textbooks. Applied Linguistics, 9, 1. pp. 20–44, 1988.

320: \bibitem{KEF06} J. Karlgren, G. Eriksson, and K. Franzen: Where attitudinal expressions get their attitude. In Computing Attitude and Affect in Text: Theories and Applications, pages 23–31, 2006.

321: \bibitem{KIP92a}B. Kipper: Eine Disabiguierungskomponente f\"ur Modalverben. In KONVENS, pages 258–267, 1992.

322: \bibitem{KIP92b} B. Kipper: MODALYS - a system for the semantic-pragmatic analysis of modal verbs. In AIMSA, pp. 171–180, 1992.

323: \bibitem{KIP95}B. Kipper. Ambiguit\"atsprobleme bei der Modalverbanalyse, 1995.

324: \bibitem{KRR91}M. Kantrowitz, B. Ross. Names corpus. Carnegie Mellon, 1991.

325: \bibitem{KUF67}H. Kucera, W. N. Francis, Brown University. 1967.

326: \bibitem{MAT06} Y. Y. Mathieu: A computational semantic lexicon of french verbs of emotion. In Computing Attitude and Affect in Text: Theories and Applications. pp. 109–124, 2006.

327: \bibitem{MIT99} R. Mitkov: Anaphora resolution: The state of the art, 1999.

328: \bibitem{MMC04} Y. Mizuta, T. Mullen, and N. Collier. Annotation of biomedical texts for zone analysis. Technical report, National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda, Tokyo, Japan, 2004.

329: \bibitem{PAL01} F. R. Palmer: Mood and Modality. Cambridge University Press. 2001.

330: \bibitem{PZ06}L. Polanyi, A. Zaenen: Contextual valence shifters. In Computing Attitude and Affect in Text: Theories and Applications. pp. 1–10, 2006.

331: \bibitem{SCH08}C. Schommer, C. Uhde: Textual Fingerprinting with Texts from Parkin, Bassewitz, and Leander. CoRR abs/0802.2234: (2008).

332: \bibitem{SQW06} J. G. Shanahan, Y. Qu, and J. Wiebe: Computing Attitude and Affect in Text: Theory and Applications. Springer, 2006.

333: \bibitem{ST07} A. Siddharthan, S. Teufel: Whose idea was this, and why does it matter? Attributing scientific work to citations. In NAACL-HLT, 2007.

334: \bibitem{TEU06}S. Teufel: Argumentative zoning for improved citation indexing. In Computing Attitude and Affect in Text: Theories and Applications, pp. 159–169, 2006.

335: \bibitem{TM99} S. Teufel, M. Moens: Discourse level argumentation in scientific articles: human and automatic annotation. Towards Standards and Tools for Discourse Tagging. ACL 1999 Workshop, 1999.

336: \bibitem{WM06}R. Witte, J- M\"uller (edt.): Text Mining: Wissensgewinnung aus nat\"urlichsprachigen Dokumenten, Interner Bericht 2006-5. Universit\"at Karlsruhe, Fakult  ̈at f \"ur Informatik, Institut f\"ur Programmstrukturen und Datenorganisation (IPD), 2006. ISSN 1432-7864.

337: \bibitem{ZS02}G. Zhou and J. Su. Named entity recognition using an hmm-based chunk tagger, 2002.

338: \end{thebibliography}

339: }

340:

341:

342:

343: \end{document}