1: \documentclass[12pt,singlespacing]{article}
2: \usepackage{amssymb}
3: \usepackage{amsfonts}
4: \usepackage{graphicx}
5: \newcommand{\R}{\mathbb{R}}
6: \newcommand{\Z}{\mathbb{Z}}
7:
8: \begin{document}
9: \title{A Note on Local Ultrametricity in Text}
10: \author{Fionn Murtagh \\
11: Department of Computer Science \\
12: Royal Holloway University of London \\
13: Egham, Surrey TW20 0EX, England \\
14: E-mail fmurtagh@acm.org}
15:
16: \maketitle
17:
18: \begin{abstract}
19: High dimensional, sparsely populated data spaces have been characterized in
20: terms of ultrametric topology. This implies that there are natural, not
21: necessarily unique, tree or hierarchy structures defined by the
22: ultrametric topology. In this note we study the extent of local
23: ultrametric topology in texts, with the aim of
24: finding unique ``fingerprints'' for a text or corpus, discriminating between
25: texts from different domains, and opening up the possibility of
26: exploiting hierarchical structures in the data.
27: We use coherent and meaningful collections of
28: over 1000 texts, comprising over 1.3 million words.
29: \end{abstract}
30: %network \sep complete graph \sep edge weighted \sep metric \sep
31: %Euclidean \sep ultrametric \sep chi squared metric
32: %\PACS{
33: %{89.75.Hc}{Networks and genealogical trees} \and
34: %{02.50.Sk}{Multivariate analysis} \and
35: %{89.75.Kd}{Patterns} \and
36: %{89.75.Fb}{Structures and organization in complex systems}
37: %} % end of PACS
38: %} % end of abstract
39:
40:
41:
42: \section{Introduction}
43:
44: Structures that are inherent to data of any type can be of importance, and
45: hierarchical structure is a prime example. In this work we take text
46: corpora and assess the extent of hierarchical structure among words
47: constituting the texts. By comprehensively taking context into account we
48: seek to study hierarchical structures in the domain semantics.
49:
50: The data studied in Rammal et al.\ (1986) and Murtagh (2004) is point pattern
51: data: observational features with their measurements on many coordinate
52: dimensions. Data may be instead presented as time-varying signals and
53: in a similar way, related to the findings of Rammal et al.\ (1986) and
54: Murtagh (2004),
55: we have investigated ultrametric-related
56: properties of time series or 1D signals in
57: Murtagh (2005a). In the latter time series work, we encoded the data in a
58: particular way. In this paper, we show how texts can also be
59: characterized in a similar manner.
60:
61: The triangular inequality holds for a metric space: $d(x,z) \leq
62: d(x,y) + d(y,z)$ for any triplet
63: of points $x,y,z$. In addition the properties
64: of symmetry and positive definiteness are respected. The ``strong
65: triangular inequality'' or ultrametric inequality is: $d(x,z) \leq
66: \mbox{ max } \{ d(x,y), d(y,z) \}$ for any triplet $x,y,z$. An ultrametric
67: space implies respect for a range of stringent properties. For example,
68: the triangle formed by any triplet is necessarily isosceles, with the two
69: large sides equal; or is equilateral. Any agglomerative hierarchical
70: procedure (cf.\ Benz\'ecri, 1978; Lerman, 1981; Murtagh, 1983, 1985) can
71: impose hierarchical structure. Our aim in this work is to assess
72: inherent extent of hierarchical structure.
73:
74: We take a large
75: number of coherent collections of meaningful texts. Through shared words,
76: we can define a similarity network between all texts in each of the
77: collections we chose. Aspects of the semantics of the given collection are
78: captured in this way. We investigate how ultrametric each of these
79: semantic networks is.
80:
81: %We select texts
82: %each containing roughly 500 to 1000 words (but as will be seen below,
83: %some texts had up to around 44,000 words).
84: Our selected texts in this study are in English and
85: do not contain accented characters (and this can be easily catered for).
86: These were: fairy tales by the Brothers Grimm; novels by the English
87: writer, Jane Austen; in order to have very technical language, aircraft
88: accident reports from the US National Transport Safety Board; and in order
89: to seek linkages with biological and cognitive processes, a range of
90: dream reports from the online DreamBank repository.
91:
92: We find clear distinctions between the semantic networks (or text collections)
93: studied, in terms of their relative (albeit small) extent of ultrametricity.
94:
95: Our objectives in such assessment of inherent, local, hierarchical
96: structure include the following:
97:
98: \begin{enumerate}
99: \item Ontologies (see e.g.\ G\'omez-Perez et al., 2004) have become of
100: great interest to facilitate information resource discovery, and to
101: support querying and retrieval of information, in current areas of work
102: such as the semantic web. Automatic or semi-automatic
103: construction of ontologies is aided greatly by hierarchical relationships
104: between terms. The characterizing of texts in terms of local
105: hierarchical structure simultaneously provides justification for unambiguous
106: local hierarchies. (We return to this issue of ontology creation
107: in the Conclusion.)
108:
109: \item Structures defined on terms that are more general than grammars
110: may be of use in modelling and assessing consistency of textual data
111: (see Sasaki and P\"onninghaus, 2003); and perhaps in mapping some aspects of
112: semantics and flow of reason and logic in text.
113: %(for example, providing
114: %a quantitative expression of Freud's concepts of
115: % condensation and displacement).
116:
117: \item Limited extent of hierarchical structure may point to the
118: undesirability of a global tree or hierarchical clustering model for the
119: text or set of texts. However for the same reason, a set of
120: local hierarchical clusterings, or a forest of (locally defined) trees, may be
121: more appropriate.
122:
123: We note that our work is quite different from Leo
124: Breiman's random forest methodology, where classification trees are
125: fitted multiply to a
126: data set. Our work, as opposed to this, is directed towards the finding of
127: ``shrubs'' or tree fragments in a data set.
128:
129: \item Latent ultrametric distances were estimated by Schweinberger and Snijders
130: (2003) in order to represent transitive structures among pairwise
131: relationships.
132:
133: \item Further motivation is provided by fingerprinting of authorship, and
134: document clustering (e.g.\ to facilitate retrieval).
135:
136: \end{enumerate}
137:
138: \section{Methodology}
139:
140: We employ correspondence analysis for metric embedding,
141: followed by determination of the extent of ultrametricity, in factor
142: space, based on the alpha coefficient of ultrametricity. Our motivation
143: for using precisely this Euclidean embedding is as follows. Our input
144: data is in the form of frequencies of occurrence. Now, a Euclidean distance
145: defined on vectors with such values is not appropriate.
146:
147: The $\chi^2$ distance
148: is an appropriate weighted Euclidean distance for use with such data
149: (Benz\'ecri, 1979; Murtagh, 2005b).
150: Consider texts $i$ and $i'$ crossed by words $j$. Let $k_{ij}$ be the number of
151: occurrences of word $j$ in text $i$. Then, omitting a constant,
152: the $\chi^2$ distance between texts $i$ and $i'$ is given by
153: $ \sum_j 1/k_j ( k_{ij}/k_i - k_{i'j}/k_{i'} )^2$. The weighting term is
154: $1/k_j$. The weighted Euclidean distance is between the {\em profile}
155: of text $i$, viz.\ $k_{ij}/k_i$ for all $j$, and the analogous
156: {\em profile} of text $i'$.
157:
158:
159: \subsection{Alpha Coefficient of Ultrametricity}
160:
161: The definition of ultrametricity introduced in Murtagh (2004) and justified
162: relative to alternatives was, in
163: summary, as follows. For all triplets of points, we consider the three
164: internal angles. We require that the smallest angle be less than or equal
165: to 60 degrees. Then we require that the two remaining angles be
166: approximately equal. Approximate equality is defined as less than 2 degrees,
167: in order to cater for imprecise coordinate measurement (e.g., due to
168: floating point values) in an acceptable way. Satisfying these angular
169: constraints implies that the triplet of points defines an approximate
170: isosceles (with small base) or equilateral triangle. We define a
171: coefficient of ultrametricity of the point set as the proportion of all
172: triangles satisfying these requirements. The coefficient of ultrametricity
173: is 1 for perfectly ultrametric data; and if 0 no triangle satisfies the
174: isosceles or equilateral requirements. This coefficient is
175: referred to as alpha below in this article.
176:
177: As already noted, assessing ultrametricity through triangle properties
178: is based on the prior correspondence analysis, and this has the following
179: beneficial (and, in a sense, enabling) implications. The correspondence
180: analysis
181: factor space is Euclidean. A Euclidean space, as a particular Hilbert
182: space, is a complete, normed vector space endowed with a scalar product.
183: It is precisely the scalar product that allows us to define angles and
184: hence the triangle properties that we need.
185:
186: \subsection{Correspondence Analysis:
187: Mapping $\chi^2$ into Euclidean Distances}
188:
189: As a dimensionality reduction technique
190: correspondence analysis is particularly appropriate for handling
191: frequency data. As an example of the latter, frequencies of word
192: occurrence in text will be studied below.
193:
194: The given contingency table (or numbers of occurrence)
195: data is denoted $k_{IJ} =
196: \{ k_{IJ}(i,j) = k(i, j) ; i \in I, j \in J \}$. $I$ is the set of text
197: indexes, and $J$ is the set of word indexes. We have
198: $k(i) = \sum_{j \in J} k(i, j)$. Analogously $k(j)$ is defined,
199: and $k = \sum_{i \in I, j \in J} k(i,j)$. Next, $f_{IJ} = \{ f_{ij}
200: = k(i,j)/k ; i \in I, j \in J\} \subset \R_{I \times J}$,
201: similarly $f_I$ is defined as $\{f_i = k(i)/k ; i \in I, j \in J\}
202: \subset \R_I$, and $f_J$ analogously. What we have described here is
203: taking numbers of occurrences into relative frequencies.
204:
205: The conditional distribution of $f_J$ knowing $i \in I$, also termed
206: the $j$th profile with coordinates indexed by the elements of $I$, is:
207:
208: $$ f^i_J = \{ f^i_j = f_{ij}/f_i = (k_{ij}/k)/(k_i/k) ; f_i \neq 0 ;
209: j \in J \}$$ and likewise for $f^j_I$.
210:
211: Note that the input data values here are always non-negative reals. The
212: output factor projections (and contributions to the principal directions
213: of inertia) will be reals.
214:
215: \subsection{Input: Cloud of Points Endowed with the Chi Squared Metric}
216:
217:
218: The cloud of points consists of the couple: profile coordinate and mass.
219: We have $ N_J(I) = \{ ( f^i_J, f_i ) ; i \in I \} \subset \R_J $, and
220: again similarly for $N_I(J)$.
221:
222: The moment of inertia is as follows:
223: $$M^2(N_J(I)) = M^2(N_I(J)) = \| f_{IJ} - f_I f_J \|^2_{f_I f_J} $$
224: \begin{equation}
225: = \sum_{i \in I, j \in J} (f_{ij} - f_i f_j)^2 / f_i f_j
226: \end{equation}
227: The term $\| f_{IJ} - f_I f_J \|^2_{f_I f_J}$ is the $\chi^2$ metric
228: between the probability distribution $f_{IJ}$ and the product of marginal
229: distributions $f_I f_J$, with as center of the metric the product
230: $f_I f_J$. Decomposing the moment of inertia of the cloud $N_J(I)$ -- or
231: of $N_I(J)$ since both analyses are inherently related -- furnishes the
232: principal axes of inertia, defined from a singular value decomposition.
233:
234: \subsection{Output: Cloud of Points Endowed with the Euclidean
235: Metric in Factor Space}
236:
237: From the initial frequencies data matrix, a set of probability data,
238: $f_{ij}$, is defined by dividing each value by the grand total of all
239: elements in
240: the matrix. In correspondence analysis,
241: each row (or column) point is considered to have an
242: associated weight. The weight of the $i$th row point is given
243: by $f_i = \sum_j x_{ij}$, and the weight of the $j$th column point
244: is given by $f_j = \sum_i x_{ij}$. We consider the row points to have
245: coordinates ${f_{ij} / x_i}$, thus allowing points of the same
246: {\em profile} to be identical (i.e., superimposed). The following weighted
247: Euclidean distance, the $\chi^2$ distance, is then used between row
248: points:
249: $$ d^2(i,k) = \sum_j {1 \over x_j} \left( {f_{ij} \over x_i} -
250: {f_{kj} \over x_k} \right)^2 $$
251: and an analogous distance is used between column points.
252:
253: The mean row point is given by the weighted average of all row
254: points:
255: $$ \sum_i f_i {f_{ij} \over f_i} = f_j$$
256: for $j = 1, 2, \dots, m$. Similarly the mean column profile has
257: $i$th coordinate $f_i$.
258:
259: We
260: first consider the projections of the $n$
261: profiles in $\R^m$ onto an axis, ${\bf u}$. This is given by
262: $$ \sum_j {f_{ij} \over x_i} {1 \over x_j} u_j$$ for all $i$ (note
263: the use of the scalar product here). For details on determining the
264: new axis, ${\bf u}$, see Murtagh (2005).
265:
266: The projections of points onto
267: axis ${\bf u}$ were with respect to the ${1 / f_i}$ weighted Euclidean
268: metric. This makes interpreting projections very difficult from a
269: human/visual point of view, and so it is more natural to present results
270: in such a way that projections can be simply appreciated. Therefore
271: {\em factors} are defined, such that the projections of row vectors
272: \index{factor}
273: onto factor ${\bf \phi}$ associated with axis ${\bf u}$ are given by
274: $$\sum_j {f_{ij} \over x_i} \phi_j$$ for all $i$. Taking $$\phi_j =
275: {1 \over f_j} u_j$$ ensures this and projections onto ${\bf \phi}$
276: are with respect to the ordinary (unweighted) Euclidean distance.
277:
278: An analogous set of relationships hold in $\R^n$ where the best
279: fitting axis, ${\bf v}$, is searched for. A simple mathematical
280: relationship holds between ${\bf u}$ and ${\bf v}$, and between
281: ${\bf \phi}$ and ${\bf \psi}$ (the latter being the factor associated
282: with axis or eigenvector ${\bf v}$):
283: $$ \sqrt{\lambda} \psi_i = \sum_j {f_{ij} \over f_i} \phi_j $$
284: $$ \sqrt{\lambda} \phi_j = \sum_i {f_{ij} \over f_j} \psi_i $$
285: These are termed {\em transition formulas}.
286: Axes ${\bf u}$
287: \index{transition formula}
288: and ${\bf v}$, and factors ${\bf \phi}$ and ${\bf \psi}$, are
289: associated with eigenvalue $\lambda$ and best fitting higher-dimensional
290: subspaces are associated with decreasing values of $\lambda$ (see Murtagh,
291: 2005b, for further details).
292:
293: \subsection{Conclusions on Correspondence Analysis and Introduction to the
294: Numerical Experiments to Follow}
295:
296: Some important points for the analyses to follow are -- firstly in relation
297: to correspondence analysis:
298:
299: \begin{enumerate}
300:
301: \item From numbers of occurrence data we always get (by design)
302: a Euclidean embedding
303: using correspondence analysis. The factors are embedded in a Euclidean
304: metric.
305:
306: \item As seen in the previous subsection, the
307: numbers of factors, i.e.\ number of non-zero eigenvalues, are
308: given by one less than the minimum of the number of observations studied
309: (indexed by set $I$) and the number of variables or attributes used
310: (indexed by set $J$).
311: The number of dimensions in factor space may be less than full rank
312: if there are linear dependencies present.
313:
314: \item In the experiments to follow in the next section, we always
315: have $n < m$, where $n$ is number of texts or text segments, and $m$ is
316: number of words. This implies that inherent (full rank)
317: dimensionality of the projected Euclidean
318: factor space is $n - 1$.
319:
320: \item To assess stability of results,
321: in our studies we often take as input a word set given by the
322: (for example, 1000) most highly ranked (in terms of frequency of
323: occurrence) words. Thus we take $m = 1000, 2000,$ and the full
324: attribute set (say,
325: $m_{\rm tot}$) in each case, where the attributes are ordered in terms of
326: decreasing marginal frequency. In other words, we take the 1000 most
327: frequent words to characterize our texts; then the 2000 most frequent words;
328: and finally all words. Since $n < m$ it is not surprising that
329: very similar results are found irrespective of the value of $m$, since
330: the inherent, projected, Euclidean, factor space dimensionality is the
331: same in each case, viz., $n - 1$. But we additionally find confirmation
332: of stability of our results.
333: We will show quite convincingly that our results are
334: characteristic of the texts used, in each case, and are in no way ``one off''
335: or arbitrary.
336:
337: %\item Purely as a baseline we will look at direct Euclidean pairwise
338: %distances defined on $\{ k_{ij} | i = 1, 2, \dots , n; j = 1, 2,
339: %\dots , m \}$.
340:
341: \end{enumerate}
342:
343: Some important points related to our numerical assessments below, in
344: relation to data used, determining of ultrametricity coefficient,
345: and software used, are as follows.
346:
347: \begin{enumerate}
348:
349: \item
350: In line with one tradition of textual analysis associated with Benz\'ecri's
351: correspondence analysis (see Murtagh, 2005b) we take the unique full words and
352: rank them in order of importance. Thus for the Brothers Grimm work,
353: below, we find: ``the'', 19,696 occurrences; ``and'',
354: 14,582 occurrences; ``to'', 7380 occurrences; ``he'', 5951 occurrences;
355: ``was'', 4122 occurrences; and so on. Last three, with one occurrence each:
356: ``yolk'', ``zeal'', ``zest''.
357:
358: \item The alpha ultrametricity coefficient is based on triangles. Now,
359: with $n$ graph nodes we have $O(n^3)$ possible triangles which is
360: computationally prohibitive, so we instead sample. The means and
361: standard deviations below are based on 2000 random triangle vertex
362: realizations, repeated 20 times; hence, in each case, in total 40,000
363: random selections of triangles.
364:
365: \item All text collections reported on below (section \ref{sectreal})
366: are publicly accessible (and web addresses are cited). All texts were
367: obtained by us in straight (ascii) text format.
368:
369: The preparation of the input data was carried out with programs of
370: ours, written in C, and available at www.correspondances.info (accompanying
371: Murtagh, 2005b). The correspondence analysis software was written in
372: the public R statistical software environment
373: (www.r-project.org, again see Murtagh, 2005b) and is available at this same
374: web address. Some
375: simple statistical calculations were carried out by us also
376: in the R environment.
377:
378: \end{enumerate}
379:
380:
381:
382:
383:
384:
385:
386:
387: \section{Real Case Studies: Text Interrelationships Through Shared Words}
388: \label{sectreal}
389:
390: We use in all over 900 short texts, given by short stories, or chapters,
391: or short reports. All are in English. Unique words are determined
392: through delimitation by white space and by punctuation characters
393: with no distinction of upper and lower case. In
394: all, over one million words are used in our studies of these texts.
395: The study of word/text occurrences in a straightforward way, with no
396: truncation nor stemming nor other preprocessing, typifies a great deal
397: of the work of Benz\'ecri, and his journal {\em Les Cahiers de
398: l'Analyse des Donn\'ees}, published by the French publisher Dunod over
399: three decades up to 1996. This work of Benz\'ecri is
400: discussed in detail in Murtagh (2005b).
401:
402: We carried out some assessments of Porter stemming (Porter, 1980)
403: as an alternative
404: to use of whitespace- or punctuation-delimited words, without much
405: difference.
406:
407: \subsection{Brothers Grimm}
408:
409: As a homogeneous collection of texts we take 209 fairy tales of the Brothers
410: Grimm (Ockerbloom, 2003),
411: containing 7443 unique (in total 280,629) space- or
412: punctuation-delimited words. Story lengths were between 650 and 44,400 words.
413:
414: To define a semantic context of increasing
415: resolution we took the most frequent 1000 words, followed by the most frequent
416: 2000 words, and finally all 7443 words.
417: %(We tested extensively the case of
418: %just the 100 most frequent words also. But in view of the texts versus
419: %words dimensionality implications, viz.\ $ n > m$ here, and the slightly
420: %more tricky interpretation, we deliberately do not report on these
421: %results here.)
422: We constructed a cross-tabulation of numbers of occurrences of
423: each word in each one of the 209 fairy tales. This led therefore to a
424: set of frequency tables of dimensions: $209 \times 1000,
425: 209 \times 2000$ and $209 \times 7443$. Through use of the $\chi^2$
426: distance between fairy tale texts, a correspondence analysis was carried out.
427: From the three frequency tables, the contingency table crossing all pairs
428: of fairy tales could be examined; but it was far more convenient for us
429: to proceed straight to the factor space, of dimension $209 - 1 = 208$. The
430: factor space is Euclidean, so the correspondence analysis can be said to be
431: a mapping from the $\chi^2$ metric into a Euclidean metric space.
432:
433:
434: %\begin{table}
435: %\begin{center}
436: %\begin{tabular}{|crrrr|} \hline
437: %\multicolumn{5}{c}{209 Brothers Grimm fairy tales} \\ \hline
438: %Texts & Dim. & Original & Dim. & Factors \\ \hline
439: %%209 & 100 & 0.0273 & 99 & 0.1002 \\
440: %209 & 1000 & 0.0324 & 208 & 0.1189 \\
441: %209 & 2000 & 0.0334 & 208 & 0.1083 \\
442: %209 & 7443 & 0.0324 & 208 & 0.1154 \\ \hline
443: %\end{tabular}
444: %\end{center}
445: %\caption{Coefficient of ultrametricity.
446: %Original: frequencies of occurrence matrix defined on the 209 texts
447: %crossed by: % 100,
448: %1000, 2000, and all = 7443, words. Euclidean distance
449: %defined on each pair of texts. Factors: factor projections resulting
450: %from correspondence analysis, with Euclidean distance used between each
451: %pair of texts.}
452: %\label{tabcorr}
453: %\end{table}
454:
455: \begin{table}
456: \caption{Coefficient of ultrametricity, alpha.
457: Input data: frequencies of occurrence matrices defined on the 209 texts
458: crossed by: %100,
459: 1000, 2000, and all = 7443, words.
460: Alpha (ultrametricity coefficient) based
461: on factors: i.e., factor projections resulting
462: from correspondence analysis, with Euclidean distance used between each
463: pair of texts in factor space, of dimensionality 208.
464: %The mean and standard deviations are each based on 20 realizations of
465: %2000 triangles.
466: }
467: \label{tabcorrb}
468: \begin{center}
469: \setlength{\tabcolsep}{1mm}
470: \begin{tabular}{|crrrr|} \hline
471: & \multicolumn{3}{c}{209 Brothers Grimm fairy tales} & \\ \hline
472: Texts & Orig.Dim. & FactorDim. & Alpha, mean & Alpha, sdev. \\ \hline
473: %209 & 100 & 99 & 0.0939 & 0.0063 \\
474: 209 & 1000 & 208 & 0.1236 & 0.0054 \\
475: 209 & 2000 & 208 & 0.1123 & 0.0065 \\
476: 209 & 7443 & 208 & 0.1147 & 0.0066 \\ \hline
477: \end{tabular}
478: \end{center}
479: \end{table}
480:
481: %The Euclidean distance was defined on the set of 209 fairy tales, based
482: %on the four different semantic contexts (i.e., based on characterization
483: %by %100,
484: %1000, 2000 and 7443 words).
485:
486: %Secondly the chi squared distance or weighted Euclidean distance between
487: %profiles was used as an appropriate way to assess relative similarity. If
488: %$k_{ij}$ is the number of occurrences of word $k$ in text $i$, then the
489: %chi squared distance between texts $i$ and $i'$ is $d_\chi(i,i') =
490: %\sum_j k/k_j (k_{ij}/k_i - k_{i'j}/k_{i'}$ where for text $i$, $k_{ij}/k_i$
491: %for all words $j$ defines the text's profile; $k_i = \sum_j k_{ij}$;
492: %similarly word $j$'s weight is $k_j = \sum_i k_{ij}$; and finally the
493: %overall total of words in all texts is $k = \sum_i \sum_j k_{ij}$. This
494: %distance is well established for discrete data such as frequencies of
495: %o%ccurence. As can be seen, weights ($k_i$, $k_j$) are used to
496: %c%ounter-balance overly frequent (or rare) words or unusually long (or
497: %short) texts. This chi squared metric is mapped into a Euclidean space
498: %by determining principal axes of orientation, which correspond to
499: %axes of intertia, in correspondence analysis (Murtagh, 2005). The factor
500: %projections will then define a Euclidean coordinate system. It is this
501: %which we use, rather than the original chi squared metric, in our
502: %experiments.
503:
504: %For the varying semantic resolution levels (viz., %100-,
505: %1000-, 2000-, and 7443-dimensional) the inherent resolution level is not
506:
507: Table \ref{tabcorrb} (columns 4, 5)
508: shows remarkable stability of the alpha ultrametricity
509: coefficient results, and such stability will be seen in all further results
510: to be presented below. The ultrametricity is not high for the Grimm
511: Brothers' data: we recall that an alpha value of 0 means no triangle is
512: isosceles/equilateral. We see that there is very little ultrametric
513: (hence hierarchical) structure in the Brothers Grimm data (based on our
514: particular definition of ultrametricity/hierarchy).
515:
516:
517: \subsection{Jane Austen}
518:
519: To further study stories of a general sort, we use some works of the
520: English novelist, Jane Austen.
521:
522: \begin{enumerate}
523: \item {\em Sense and Sensibility} (Austen, 1811),
524: 50 chapters = files, chapter lengths from 1028 to 5632 words.
525: \item {\em Pride and Prejudice} (Austen, 1813),
526: 61 chapters each containing between 683 and 5227 words.
527: \item {\em Persuasion} (Austen, 1817), 24 chapters,
528: chapter lengths 1579 to 7007 words.
529: \item {\em Sense and Sensibility} split into 131 separate
530: texts, each containing around 1000 words
531: (i.e., each chapter was split into files containing 5000 or fewer characters).
532: We did this to check on any influence by the size (total number of words) of
533: the text unit used (and we found no such influence).
534: \end{enumerate}
535:
536: In all there were 266 texts containing a total of 9723 unique words. We
537: looked at the 1000, 2000 and all = 9723 most frequent words to
538: characterize the texts by frequency of occurrence.
539:
540:
541: %\begin{table}
542: %\begin{center}
543: %\begin{tabular}{|crrrr|} \hline
544: %\multicolumn{5}{c}{266 J.\ Austen chapters or partial chapters} \\ \hline
545: %Texts & Dim. & Original & Dim. & Factors \\ \hline
546: %%266 & 100 & 0.0409 & 99 & 0.1066 \\
547: %266 & 1000 & 0.0581 & 261 & 0.1521 \\
548: %266 & 2000 & 0.0601 & 262 & 0.1435 \\
549: %266 & 9723 & 0.0596 & 263 & 0.1420 \\ \hline
550: %\end{tabular}
551: %\end{center}
552: %\caption{Coefficient of ultrametricity.
553: %Original: frequencies of occurrence matrix defined on the 266 texts
554: %crossed by: %100,
555: %1000, 2000, and all = 9273, words. Euclidean distance
556: %defined on each pair of texts. Factors: factor projections resulting
557: %from correspondence analysis, with Euclidean distance used between each
558: %pair of texts. Dimensionality of latter is necessarily less than $ 266 -1$,
559: %adjusted above for 0 eigenvalues = linear dependence.}
560: %\label{tabcorr2}
561: %\end{table}
562:
563: \begin{table}
564: \caption{Coefficient of ultrametricity, alpha.
565: Input data: frequencies of occurrence matrices defined on the 266 texts
566: crossed by: %100,
567: 1000, 2000, and all = 9723, words.
568: Alpha (ultrametricity coefficient) based
569: on factors: i.e., factor projections resulting
570: from correspondence analysis, with Euclidean distance used between each
571: pair of texts in factor space.
572: Dimensionality of latter is necessarily $ \leq 266 -1$,
573: adjusted for 0 eigenvalues = linear dependence.
574: %The mean and standard deviations are each based on 40,000 realizations of
575: %triangles.
576: }
577: \label{tabcorr2b}
578: \begin{center}
579: \setlength{\tabcolsep}{1mm}
580: \begin{tabular}{|crrrr|} \hline
581: & \multicolumn{3}{c}{266 Austen chapters or partial chapters} & \\ \hline
582: Texts & Orig.Dim. & FactorDim. & Alpha, mean & Alpha, sdev. \\ \hline
583: %266 & 100 & 99 & 0.1001 & 0.0068 \\
584: 266 & 1000 & 261 & 0.1455 & 0.0084 \\
585: 266 & 2000 & 262 & 0.1489 & 0.0083 \\
586: 266 & 9723 & 263 & 0.1404 & 0.0075 \\ \hline
587: \end{tabular}
588: \end{center}
589: \end{table}
590:
591: Table \ref{tabcorr2b}, again displaying very stable alpha values, indicates
592: that the Austen corpus is a small amount more ultrametric than the Grimms'
593: corpus, Table \ref{tabcorrb}.
594:
595: \subsection{Air Accident Reports}
596:
597: We used air accident reports to explore documents with very particular,
598: technical, vocabulary.
599: The NTSB aviation accident database
600: (Aviation Accident Database and Synopses, 2003)
601: contains information
602: about civil aviation accidents in the United States and elsewhere.
603: We selected 50 reports. Examples of two such reports used
604: by us: occurred Sunday, January 02, 2000 in Corning, AR,
605: aircraft Piper PA-46-310P, injuries -- 5 uninjured; occurred Sunday,
606: January 02, 2000 in Telluride, TN, aircraft: Bellanca BL-17-30A,
607: injuries -- 1 fatal. In the 50 reports, there were 55,165 words.
608: Report lengths ranged between approximately 2300 and 28,000 words. The
609: number of unique words was 4261.
610:
611: Sample of start of report 30: {\em On January 16, 2000, about
612: 1630 eastern standard time (all times are eastern standard time,
613: based on the 24 hour clock), a Beech P-35, N9740Y, registered to a
614: private owner, and operated as a Title 14 CFR Part 91 personal
615: flight, crashed into Clinch Mountain, about 6 miles north of
616: Rogersville, Tennessee. Instrument meteorological conditions prevailed
617: in the area, and no flight plan was filed. The aircraft incurred
618: substantial damage, and the private-rated pilot, the sole occupant,
619: received fatal injuries. The flight originated from Louisville,
620: Kentucky, the same day about 1532.}
621:
622: %\begin{table}
623: %\begin{center}
624: %\begin{tabular}{|crrrr|} \hline
625: %\multicolumn{5}{c}{50 aviation accident reports} \\ \hline
626: %Texts & Dim. & Original & Dim. & Factors \\ \hline
627: %%50 & 100 & 0.0270 & 48 & 0.1063 \\
628: %50 & 1000 & 0.0407 & 48 & 0.1317 \\
629: %50 & 2000 & 0.0407 & 48 & 0.1212 \\
630: %50 & 4261 & 0.0413 & 48 & 0.1180 \\ \hline
631: %\end{tabular}
632: %\end{center}
633: %\caption{Coefficient of ultrametricity.
634: %Original: frequencies of occurrence matrix defined on the 50 texts
635: %crossed by: %100,
636: %1000, 2000, and all = 4261, words. Euclidean distance
637: %defined on each pair of texts. Factors: factor projections resulting
638: %from correspondence analysis, with Euclidean distance used between each
639: %pair of texts. Dimensionality of latter is necessarily less than $ 50 -1$,
640: %adjusted above for 0 eigenvalues = linear dependence.}
641: %\label{tabcorr4}
642: %\end{table}
643:
644: \begin{table}
645: \caption{Coefficient of ultrametricity, alpha.
646: Input data: frequencies of occurrence matrices defined on the 50 texts
647: crossed by: %100,
648: 1000, 2000, and all = 4261, words.
649: Alpha (ultrametricity coefficient) based
650: on factors: i.e., factor projections resulting
651: from correspondence analysis, with Euclidean distance used between each
652: pair of texts in factor space.
653: Dimensionality of latter is necessarily less than $ 50 -1$,
654: with an additional adjustment made for one 0-valued eigenvalue,
655: implying linear dependence.
656: %The mean and standard deviations are each based on 40,000 realizations
657: %triangles.
658: }
659: \label{tabcorr4b}
660: \begin{center}
661: \setlength{\tabcolsep}{1mm}
662: \begin{tabular}{|crrrr|} \hline
663: & \multicolumn{3}{c}{50 aviation accident reports} & \\ \hline
664: Texts & Orig.Dim. & FactorDim. & Alpha, mean & Alpha, sdev. \\ \hline
665: %50 & 100 & 48 & 0.1101 & 0.0081 \\
666: 50 & 1000 & 48 & 0.1338 & 0.0077 \\
667: 50 & 2000 & 48 & 0.1186 & 0.0058 \\
668: 50 & 4261 & 48 & 0.1154 & 0.0050 \\ \hline
669: \end{tabular}
670: \end{center}
671: \end{table}
672:
673:
674: In Table \ref{tabcorr4b} we find ultrametricity values that are marginally
675: greater than those found for the Brothers Grimm (Table \ref{tabcorrb}). It
676: could be argued that the latter, too, uses its own technical
677: vocabulary. We would need to use more data to see if we can clearly
678: distinguish between the (small) ultrametricity levels of these two
679: corpora.
680:
681:
682: \subsection{DreamBank}
683:
684: With dream reports (i.e., reports by individuals on their remembered
685: dreams) we depart from a technical vocabulary, and instead raise the
686: question as to whether dream reports can perhaps be considered as types
687: of fairy tale or story, or even akin to accident reports.
688:
689: From the Dreambank repository (Domhoff, 2003; DreamBank, 2004; Schneider
690: and Domhoff, 2004)
691: we selected the following collections:
692: \begin{enumerate}
693: \item ``Alta: a detailed dreamer,'' in period 1985--1997, 422 dream reports.
694: \item ``Chuck: a physical scientist,'' in period
695: 1991--1993, 75 dream reports.
696: \item ``College women,'' in period 1946--1950, 681 dream reports.
697: \item ``Miami Home/Lab,'' in period 1963--1965, 445 dream reports.
698: \item ``The Natural Scientist,'' 1939, 234 dream reports.
699: \item ``UCSC women,'' 1996, 81 dream reports.
700: \end{enumerate}
701:
702: To have adequate length reports, we requested report sizes of between
703: 500 and 1500 words. With this criterion, from (1) we obtained 118 reports,
704: from (2) and (6) we obtained no reports, from (3) we obtained 15 reports,
705: from (4) we obtained 73 reports, and finally from (5) we obtained 8 reports.
706: In all, we used 214 dream reports, comprising 13696 words.
707:
708: Sample of start of report 100: {\em I'm delivering a car to a man --
709: something he's just bought, a Lincoln
710: Town Car, very nice. I park it and go down the street to find him -- he
711: turns out to be an old guy, he's buying the car for nostalgia -- it turns
712: out to be an old one, too, but very nicely restored, in excellent
713: condition. I think he's black, tall, friendly, maybe wearing overalls. I
714: show him the car and he drives off. I'm with another girl who drove
715: another car and we start back for it but I look into a shop first -- it's
716: got outdoor gear in it - we're on a sort of mall, outdoors but the shops
717: face on a courtyard of bricks. I've got something from the shop just
718: outside the doors, a quilt or something, like I'm trying it on, when
719: it's time to go on for sure so I leave it on the bench. We go further,
720: there's a group now, and we're looking at this office facade for the
721: Honda headquarters.}
722:
723: With the above we took another set of dream reports, from one individual,
724: Barbara Sanders. A more reliable (according to DreamBank, 2004) set of
725: reports comprised 139 reports, and a second comprised 32 reports. In all
726: 171 reports were used from this person. Typical lengths were about 2500
727: up to 5322. The total number of words in the Barbara Sanders set of
728: dream reports was 107,791.
729:
730:
731: %\begin{table}
732: %\begin{center}
733: %\begin{tabular}{|crrrr|} \hline
734: %\multicolumn{5}{c}{385 dream reports} \\ \hline
735: %Texts & Dim. & Original & Dim. & Factors \\ \hline
736: %%385 & 100 & 0.0780 & 99 & 0.1379 \\
737: %385 & 1000 & 0.1122 & 384 & 0.2048 \\
738: %385 & 2000 & 0.1057 & 384 & 0.2137 \\
739: %385 & 11441 & 0.1288 & 384 & 0.1958 \\ \hline
740: %\end{tabular}
741: %\end{center}
742: %\caption{Coefficient of ultrametricity.
743: %Original: frequencies of occurrence matrix defined on the 385 texts
744: %crossed by: %100,
745: %1000, 2000, and all = 11441, words. Euclidean distance
746: %defined on each pair of texts. Factors: factor projections resulting
747: %from correspondence analysis, with Euclidean distance used between each
748: %pair of texts. Dimensionality of latter is necessarily less than $ 266 -1$,
749: %adjusted above for 0 eigenvalues = linear dependence.}
750: %\label{tabcorr3}
751: %\end{table}
752:
753: \begin{table}
754: \caption{Coefficient of ultrametricity, alpha.
755: Input data: frequencies of occurrence matrices defined on the 384 texts
756: crossed by: %100,
757: 1000, 2000, and all = 11441, words.
758: Alpha (ultrametricity coefficient) based
759: on factors: i.e., factor projections resulting
760: from correspondence analysis, with Euclidean distance used between each
761: pair of texts in factor space, of dimensionality $ 385 -1 = 384$.
762: %The mean and standard deviations are each based on 40,000
763: %realizations of triangles.
764: }
765: \label{tabcorr3b}
766: \begin{center}
767: \setlength{\tabcolsep}{1mm}
768: \begin{tabular}{|crrrr|} \hline
769: & \multicolumn{3}{c}{385 dream reports} & \\ \hline
770: Texts & Orig.Dim. & FactorDim. & Alpha, mean & Alpha, sdev. \\ \hline
771: %385 & 100 & 99 & 0.1413 & 0.0090 \\
772: 385 & 1000 & 384 & 0.1998 & 0.0088 \\
773: 385 & 2000 & 384 & 0.1876 & 0.0095 \\
774: 385 & 11441 & 384 & 0.1933 & 0.0087 \\ \hline
775: \end{tabular}
776: \end{center}
777: \end{table}
778:
779: First we analyzed all dream reports, furnishing Table \ref{tabcorr3b}.
780:
781: In order to look at a more homogeneous subset of dream reports, we
782: then analyzed separately
783: the Barbara Sanders set of 171 reports, leading to Table \ref{tabcorr333b}.
784: (Note that this analysis is on a subset of
785: the previously analyzed dream reports, Table \ref{tabcorr3b}).
786: The Barbara Sanders subset of 171 reports contained 7044
787: unique words in all.
788:
789:
790: Compared to Table \ref{tabcorr3b} based on the entire dream report
791: collection, Table \ref{tabcorr333b} which is based on one person
792: shows, on average, higher ultrametricity levels. It is interesting to note
793: that the dream reports, collectively, are higher in ultrametricity level
794: than our previous values for alpha; and that the ultrametricity level is
795: raised again when the data used relates to one person.
796:
797: \subsection{James Joyce's Ulysses, and Overall Summary}
798:
799: We carried out a study of James Joyce's {\em Ulysses}, comprising
800: 304,414 words in total. We broke this text into 183 separate files,
801: comprising approximately between 1400 and 2000 words each. The number of
802: unique words in these 183 files was found to be 28,649 words. The
803: ultrametricity alpha values for this collection of 183 Joycean texts
804: were found to be less than the Barbara Sanders values, but higher than the
805: global set of all dream reports.
806: % CORRECTION WITH NEW PROGRAMS 9 MAY: no of unique words was up from 28,631
807: % NEW MEAN FOR 7000 WAS: 0.2057
808: For 183 text segments, with frequencies of occurrence of 7000 (top-ranked)
809: words, we found a mean alpha of 0.2057, with standard deviation 0.0092.
810:
811: %\begin{table}
812: %\begin{center}
813: %\begin{tabular}{|crrrr|} \hline
814: %\multicolumn{5}{c}{171 Barbara Sanders dream reports} \\ \hline
815: %Texts & Dim. & Original & Dim. & Factors \\ \hline
816: %%171 & 100 & 0.0816 & 99 & 0.1405 \\
817: %171 & 1000 & 0.1212 & 170 & 0.2470 \\
818: %171 & 2000 & 0.1293 & 170 & 0.2110 \\
819: %171 & 11441 & 0.1324 & 170 & 0.2404 \\ \hline
820: %\end{tabular}
821: %\end{center}
822: %\caption{Coefficient of ultrametricity.
823: %Original: frequencies of occurrence matrix defined on the 171 texts
824: %crossed by: %100,
825: %1000, 2000, and all = 7044, words. Euclidean distance
826: %defined on each pair of texts. Factors: factor projections resulting
827: %from correspondence analysis, with Euclidean distance used between each
828: %pair of texts. Dimensionality of latter is necessarily less than $ 171 -1$,
829: %with no adjustment necessary for 0 eigenvalues = linear dependence.}
830: %\label{tabcorr333}
831: %\end{table}
832:
833: \begin{table}
834: \caption{Coefficient of ultrametricity, alpha.
835: Input data: frequencies of occurrence matrices defined on the 171 texts
836: crossed by: %100,
837: 1000, 2000, and all = 7044, words.
838: Alpha (ultrametricity coefficient) based
839: on factors: i.e., factor projections resulting
840: from correspondence analysis, with Euclidean distance used between each
841: pair of texts in factor space, of dimensionality $ 171 -1 = 170$.
842: %The mean and standard deviations are each based on 40,000
843: %realizations of triangles.
844: }
845: \label{tabcorr333b}
846: \begin{center}
847: \setlength{\tabcolsep}{1mm}
848: \begin{tabular}{|crrrr|} \hline
849: & \multicolumn{3}{c}{171 Barbara Sanders dream reports} & \\ \hline
850: Texts & Orig.Dim. & FactorDim. & Alpha, mean & Alpha, sdev. \\ \hline
851: %171 & 100 & 99 & 0.1592 & 0.0063 \\
852: 171 & 1000 & 170 & 0.2250 & 0.0089 \\
853: 171 & 2000 & 170 & 0.2256 & 0.0112 \\
854: 171 & 7044 & 170 & 0.2603 & 0.0108 \\ \hline
855: \end{tabular}
856: \end{center}
857: \end{table}
858:
859: %Ulysses text: http://www.lib.ru/DVOJS/ulysses.txt
860:
861: A summary of all our results is in Table \ref{tabsum}. A few words of explanation
862: follow. The lower values of ultrametricity can be explained by a more
863: common, shared word set; viz., shared over the text segment set. The
864: higher values of ultrametricity are associated with dreams, in particular
865: with a single dreamer, and with {\em Ulysses}: one could argue that
866: characteristics of these data sets include frequent changes in interest,
867: and frequent replacement of one scene, and one set of personages,
868: with another. In factor space, this implies that a triplet of points
869: is more likely to be isosceles with small base, or equilateral, compared to
870: the alternative (low ultrametricity case) of more smooth transitions from
871: one sentence, paragraph or section to another.
872:
873: \begin{table}
874: \begin{center}
875: \begin{tabular}{|lrrr|}\hline
876: Data & No. texts & No. words & ultrametricity \\ \hline
877: Grimm tales & 209 & 7443 & 0.1147 \\
878: aviation accidents & 50 & 4261 & 0.1154 \\
879: Jane Austen novels & 266 & 9723 & 0.1404 \\
880: dream reports & 385 & 11441 & 0.1933 \\
881: Joyce's Ulysses & 183 & 28631 & 0.2057 \\
882: single person dreams & 171 & 7044 & 0.2603 \\ \hline
883: \end{tabular}
884: \end{center}
885: \caption{Summary of results for the full word set, with the exception of
886: the Joyce data, where 7000 words were used. The ultrametricity is the
887: alpha measure used throughout this article, where 1 is respect for
888: ultrametricity by all triangles, and and 0 is non-respect in all cases.}
889: \label{tabsum}
890: \end{table}
891:
892: \section{Conclusion}
893:
894: We studied a range of text corpora, comprising over 1000 texts, or text
895: segments,
896: containing over 1.3 million words. We found very stable ultrametricity
897: quantifications of the text collections, across numbers of most frequent
898: words used to characterize the texts, and sampling of triplets of texts.
899: We also found that in all cases (save, perhaps, the Brothers Grimm versus
900: air accident reports) there was a clear distinction between the ultrametricity
901: values of the text collections.
902:
903: %We end with a few remarks which much remain as speculation until far more
904: %sizable tests have been carried out (involving a far greater number of texts).
905: %However even speculation serves to motivate future work.
906: Some very intriguing ultrametricity characterizations were found in our
907: work. For example, we found that the technical vocabulary of air accidents
908: did not differ greatly in terms of inherent ultrametricity compared to the
909: Brothers Grimm fairy tales. Secondly we found that novelist Austen's
910: works were distinguishable from the Grimm fairy tales. Thirdly we found
911: dream reports to be have higher ultrametricity level than the other
912: text collections. Further exploration of these issues will require
913: availability of very high quality textual data.
914:
915: Values of our alpha ultrametricity coefficient were small but
916: revealing and useful nonetheless. Ultrametricity implies hierarchical
917: embedding, or structuring in terms of embedded sets. This is what we are
918: finding locally (and not globally) in our data. The use of such
919: hierarchical fragments as relations of dominance between concepts could be
920: of use for ontologies.
921:
922: Ontologies, or concept hierarchies, are used
923: to help the user in information retrieval in a range of ways including:
924: tree-based homing in on content to be retrieved; characterizing the
925: content of data repositories before querying starts;
926: and disambiguating different
927: but overlapping content domains. In \cite{autoonto} we explore the use
928: of local ultrametric embedding for ontology fragments. As an example,
929: we use Aristotle's {\em Categories} and some other modern texts (on
930: ubiquitous computing, and from Wikipedia), and we
931: also discuss an online web-based demonstrator supporting retrieval through
932: a visual user interface.
933:
934:
935: \begin{thebibliography}{99}
936:
937: \bibitem{refa1}
938: Austen, J. (1811). {\em Sense and Sensibility}. Available at: \\
939: http://www.pemberley.com/etext/SandS
940:
941: \bibitem{refa2}
942: Austen, J. (1813). {\em Pride and Prejudice}. Available at: \\
943: http://www.pemberley.com/etext/PandP
944:
945: \bibitem{refa3}
946: Austen, J. (1817). {\em Persuasion}. Available at: \\
947: http://www.pemberley.com/etext/Persuasion
948:
949: %\bibitem{ref1}
950: %A.-L. Barab\'asi, ``Self-organized networks: resources'', at
951: %www.nd.edu/$\sim$networks/database (2004).
952:
953: \bibitem{ref2}
954: Benz\'ecri, J.P. (1979a). {\em L'Analyse des Donn\'ees Tome 1,
955: La Taxinomie}, 2nd ed., Dunod, Paris.
956:
957: \bibitem{ref3}
958: Benz\'ecri, J.P. (1979b). {\em L'Analyse des Donn\'ees Tome 2,
959: Correspondances}, 2nd ed., Dunod, Paris.
960:
961: %\bibitem{ref4}
962: %G. Caldarelli, A. Erzan and A. Vespignani, Eds., Special issue on Networks,
963: %European Physical Journal B {\bf 38}, no. 2 (2004).
964:
965: %\bibitem{ref5}
966: %Comtet, L. (1974). {\em Advanced Combinatorics}, Reidel, Dordrecht.
967:
968: \bibitem{ref6}
969: Domhoff, G.W. (2003).
970: {\em The Scientific Study of Dreams: Neural Networks,
971: Cognitive Development and Content Analysis}, American Psychological
972: Association.
973:
974: %\bibitem{ref7}
975: %Donaghey, R. (1975).
976: %Alternating Permutations and Binary Increasing Trees,
977: %{\em Journal of Combinatorial Theory (A)}, 18: 141--148.
978:
979: \bibitem{ref8}
980: DreamBank (2004), Repository of Dream Reports, www.dreambank.net
981:
982: \bibitem{gom}
983: G\'omez-P\'erez, A., Fern\'andez-L\'opez, M. and Corcho, O. (2004).
984: {\em Ontological Engineering (with Examples from the Areas of Knowledge
985: Management, e-Commerce and the Semantic Web)}, Springer, Berlin.
986:
987: %\bibitem{ref9}
988: %J.C. Gower, ``Some distance properties of latent root and vector
989: %methods used in multivariate analysis'', Biometrika {\bf 53}, 325
990: %(1966). % 325--328
991:
992: \bibitem{ref10}
993: Lerman, I.C. (1981).
994: {\em Classification et Analyse Ordinale des Donn\'ees},
995: Dunod, Paris.
996:
997: \bibitem{ref11}
998: Murtagh, F. (1983). A Survey of Recent Advances in Hierarchical
999: Clustering Algorithms, {\em The Computer Journal}, 26:
1000: 354--359.
1001:
1002: %\bibitem{ref12}
1003: %Murtagh, F. (1984).
1004: %Counting Dendrograms: A Survey,
1005: %{\em Discrete Applied Mathematics}, 7: 191--199.
1006:
1007: \bibitem{ref13}
1008: Murtagh, F. (1985).
1009: {\em Multidimensional Clustering Algorithms},
1010: Physica-Verlag, W\"urzburg.
1011:
1012: \bibitem{ref14}
1013: Murtagh, F. (2004). On Ultrametricity, Data Coding, and Computation,
1014: {\em Journal of Classification}, 21: 167--184.
1015:
1016: \bibitem{ref15}
1017: Murtagh, F. (2005a). Identifying the Ultrametricity of Time Series,
1018: {\em European Physical Journal B}, 43: 573--579.
1019:
1020: \bibitem{ref16}
1021: Murtagh, F. (2005b). {\em
1022: Correspondence Analysis and Data Coding with Java and R},
1023: Chapman and Hall/CRC Press, New York.
1024:
1025: \bibitem{autoonto}
1026: Murtagh, F., Mothe, J. and Englmeier, K. (2007). Ontology from local
1027: hierarchical structure in text. http://arxiv.org/abs/cs.IR/0701180
1028:
1029: \bibitem{ref17} NTSB
1030: Aviation Accident Database and Synopses (2003),
1031: National Transport Safety Board,
1032: accessible from http://www.landings.com
1033: %/evird.acgi\$pass*59062640!\_h-www.landings.com/\_landings/
1034: %pages/search/rep-ntsb.html
1035:
1036:
1037: \bibitem{ref18}
1038: Ockerbloom, J.M. (2003). {\em Grimms' Fairy Tales},
1039: http://www-2.cs.cmu.edu/$\sim$spok/grimmtmp
1040:
1041: \bibitem{por}
1042: Porter, M.F. (1980). An Algorithm for Suffix Stripping,
1043: {\em Program}, 14: 130--137.
1044:
1045: \bibitem{ref19}
1046: Rammal, R., Toulouse, G. and Virasoro, M.A. (1986).
1047: Ultrametricity for
1048: Physicists, {\em Reviews of Modern Physics}, 58: 765--788.
1049:
1050: \bibitem{sas}
1051: Sasaki, F. and P\"onninghaus, J. (2003).
1052: Testing Structural Properties in Textual Data: Beyond Document Grammars,
1053: {\em Literary and Linguistic Computing}, 18: 89-100.
1054:
1055: \bibitem{ref20}
1056: Schneider, A. and Domhoff, G.W. (2004). The Quantitative Study of Dreams,
1057: http://dreamresearch.net
1058:
1059: \bibitem{refxy}
1060: Schweinberger, M. and Snijders, T.A.B. (2003). Setting in Social Networks:
1061: A Measurement Model, {\em Sociological Methodology}, 33: 307--342.
1062:
1063: %\bibitem{ref21}
1064: %W.S. Torgerson,
1065: %Theory and Methods of Scaling (Wiley, New York, 1958).
1066:
1067: %\bibitem{ref22}
1068: %C.J. van Rijsbergen, Information Retrieval, 2nd ed.
1069: %(Butterworths, 1979).
1070:
1071: %\bibitem{ref23}
1072: %A. Trusina, S. Maslov, P. Minnhagen and K. Sneppen,
1073: %``Hierarchy measures in complex networks'',
1074: %Physical Review Letters {\bf 92}, 178702(4) (2004).
1075:
1076: \end{thebibliography}
1077:
1078: \end{document}
1079:
1080:
1081:
1082: