cs0701181/cs0701181
1: \documentclass[12pt,singlespacing]{article}
2: \usepackage{amssymb}
3: \usepackage{amsfonts}
4: \usepackage{graphicx}
5: \newcommand{\R}{\mathbb{R}}
6: \newcommand{\Z}{\mathbb{Z}}
7: 
8: \begin{document}
9: \title{A Note on Local Ultrametricity in Text}
10: \author{Fionn Murtagh \\
11: Department of Computer Science \\
12: Royal Holloway University of London \\
13: Egham, Surrey TW20 0EX, England \\
14: E-mail fmurtagh@acm.org}
15: 
16: \maketitle
17: 
18: \begin{abstract}
19: High dimensional, sparsely populated data spaces have been characterized in 
20: terms of ultrametric topology.  This implies that there are natural, not
21: necessarily unique, tree or hierarchy structures defined by the 
22: ultrametric topology.  In this note we study the extent of local 
23: ultrametric topology in texts, with the aim of 
24: finding unique ``fingerprints'' for a text or corpus, discriminating between
25: texts from different domains, and opening up the possibility of 
26: exploiting hierarchical structures in the data.  
27: We use coherent and meaningful collections of 
28: over 1000 texts, comprising over 1.3 million words.   
29: \end{abstract}
30: %network \sep complete graph \sep edge weighted \sep metric \sep 
31: %Euclidean \sep ultrametric \sep chi squared metric 
32: %\PACS{
33: %{89.75.Hc}{Networks and genealogical trees} \and
34: %{02.50.Sk}{Multivariate analysis} \and
35: %{89.75.Kd}{Patterns} \and
36: %{89.75.Fb}{Structures and organization in complex systems}
37: %} % end of PACS
38: %}  % end of abstract
39: 
40: 
41: 
42: \section{Introduction}
43: 
44: Structures that are inherent to data of any type can be of importance, and 
45: hierarchical structure is a prime example.   In this work we take text
46: corpora and assess the extent of hierarchical structure among words 
47: constituting the texts.  By comprehensively taking context into account we 
48: seek to study hierarchical structures in the domain semantics.
49: 
50: The data studied in Rammal et al.\ (1986) and Murtagh (2004)  is point pattern 
51: data: observational features with their measurements on many coordinate 
52: dimensions.  Data may be instead presented as time-varying signals and 
53: in a similar way, related to the findings of Rammal et al.\ (1986) and 
54: Murtagh (2004),  
55: we have investigated ultrametric-related 
56: properties of time series or 1D signals in 
57: Murtagh (2005a).  In the latter time series work, we encoded the data in a 
58: particular way.  In this paper, we show how texts can also be 
59: characterized in a similar manner.
60: 
61: The triangular inequality holds for a metric space: $d(x,z) \leq 
62: d(x,y) + d(y,z)$ for any triplet 
63: of points $x,y,z$.  In addition the properties 
64: of symmetry and positive definiteness are respected.  The ``strong 
65: triangular inequality'' or ultrametric inequality is: $d(x,z) \leq 
66: \mbox{ max } \{ d(x,y), d(y,z) \}$ for any triplet $x,y,z$.  An ultrametric
67: space implies respect for a range of stringent properties.  For example, 
68: the triangle formed by any triplet is necessarily isosceles, with the two
69: large sides equal; or is equilateral.  Any agglomerative hierarchical 
70: procedure (cf.\ Benz\'ecri, 1978; Lerman, 1981; Murtagh, 1983, 1985) can 
71: impose hierarchical structure.  Our aim in this work is to assess 
72: inherent extent of hierarchical structure.  
73: 
74: We take a large
75: number of coherent collections of meaningful texts.  Through shared words,
76: we can define a similarity network between all texts in each of the 
77: collections we chose.  Aspects of the semantics of the given collection are
78: captured in this way.  We investigate how ultrametric each of these 
79: semantic networks is.  
80: 
81: %We select texts 
82: %each containing roughly 500 to 1000 words (but as will be seen below, 
83: %some texts had up to around 44,000 words).  
84: Our selected texts in this study are in English and 
85: do not contain accented characters (and this can be easily catered for).
86: These were: fairy tales by the Brothers Grimm; novels by the English 
87: writer, Jane Austen; in order to have very technical language, aircraft
88: accident reports from the US National Transport Safety Board; and in order
89: to seek linkages with biological and cognitive processes, a range of 
90: dream reports from the online DreamBank repository.  
91: 
92: We find clear distinctions between the semantic networks (or text collections)
93: studied, in terms of their relative (albeit small) extent of ultrametricity.  
94: 
95: Our objectives in such assessment of inherent, local, hierarchical 
96: structure include the following:
97: 
98: \begin{enumerate}
99: \item Ontologies (see e.g.\ G\'omez-Perez et al., 2004) have become of 
100: great interest to facilitate information resource discovery, and to 
101: support querying and retrieval of information, in current areas of work 
102: such as the semantic web.  Automatic or semi-automatic 
103: construction of ontologies is aided greatly by hierarchical relationships
104: between terms.  The characterizing of texts in terms of local 
105: hierarchical structure simultaneously provides justification for unambiguous 
106: local hierarchies.  (We return to this issue of ontology creation
107: in the Conclusion.)
108: 
109: \item Structures defined on terms that are more general than grammars
110: may be of use in modelling and assessing consistency of textual data 
111: (see Sasaki and P\"onninghaus, 2003); and perhaps in mapping some aspects of 
112: semantics and flow of reason and logic in text. 
113: %(for example, providing 
114: %a quantitative expression of Freud's concepts of
115: % condensation and displacement).  
116: 
117: \item Limited extent of hierarchical structure may point to the 
118: undesirability of a global tree or hierarchical clustering model for the 
119: text or set of texts.  However for the same reason, a set of 
120: local hierarchical clusterings, or a forest of (locally defined) trees, may be
121: more appropriate.  
122: 
123: We note that our work is quite different from Leo
124: Breiman's random forest methodology, where classification trees are
125: fitted multiply to a
126: data set.  Our work, as opposed to this, is directed towards the finding of
127: ``shrubs'' or tree fragments in a data set.
128: 
129: \item Latent ultrametric distances were estimated by Schweinberger and Snijders
130: (2003) in order to represent transitive structures among pairwise 
131: relationships.  
132: 
133: \item Further motivation is provided by fingerprinting of authorship, and
134: document clustering (e.g.\ to facilitate retrieval).
135: 
136: \end{enumerate}
137: 
138: \section{Methodology}
139: 
140: We employ correspondence analysis for metric embedding,
141: followed by determination of the extent of  ultrametricity, in factor
142: space, based on the alpha coefficient of ultrametricity.  Our motivation 
143: for using precisely this Euclidean embedding is as follows.  Our input 
144: data is in the form of frequencies of occurrence.  Now, a Euclidean distance
145: defined on vectors with such values is not appropriate.  
146: 
147: The $\chi^2$ distance
148: is an appropriate weighted Euclidean distance for use with such data
149: (Benz\'ecri, 1979; Murtagh, 2005b).  
150: Consider texts $i$ and $i'$ crossed by words $j$.  Let $k_{ij}$ be the number of
151: occurrences of word $j$ in text $i$.  Then, omitting a constant, 
152: the $\chi^2$ distance between texts $i$ and $i'$ is given by 
153: $ \sum_j 1/k_j ( k_{ij}/k_i - k_{i'j}/k_{i'} )^2$.  The weighting term is 
154: $1/k_j$.  The weighted Euclidean distance is between the {\em profile} 
155: of text $i$, viz.\ $k_{ij}/k_i$ for all $j$, and the analogous 
156: {\em profile} of text $i'$.  
157: 
158: 
159: \subsection{Alpha Coefficient of Ultrametricity}
160: 
161: The definition of ultrametricity introduced in Murtagh (2004) and justified 
162: relative to alternatives was, in 
163: summary, as follows.  For all triplets of points, we consider the three
164: internal angles.  We require that the smallest angle be less than or equal
165: to 60 degrees.  Then we require that the two remaining angles be
166: approximately equal.  Approximate equality is defined as less than 2 degrees,
167: in order to cater for imprecise coordinate measurement (e.g., due to 
168: floating point values) in an acceptable way.  Satisfying these angular 
169: constraints implies that the triplet of points defines an approximate 
170: isosceles (with small base) or equilateral triangle.  We define a
171: coefficient of ultrametricity of the point set as the proportion of all 
172: triangles satisfying these requirements.  The coefficient of ultrametricity 
173: is 1 for perfectly ultrametric data; and if 0 no triangle satisfies the 
174: isosceles or equilateral requirements.  This coefficient is 
175: referred to as alpha below in this article.  
176: 
177: As already noted, assessing ultrametricity through triangle properties 
178: is based on the prior  correspondence analysis, and this has the following
179: beneficial (and, in a sense, enabling) implications.  The correspondence
180: analysis 
181: factor space is  Euclidean.  A Euclidean space, as a particular Hilbert 
182: space, is a complete, normed vector space endowed with a scalar product.  
183: It is precisely the scalar product that allows us to define angles and
184: hence the triangle properties that we need.  
185: 
186: \subsection{Correspondence Analysis: 
187: Mapping $\chi^2$ into Euclidean Distances}
188: 
189: As a dimensionality reduction technique 
190: correspondence analysis is particularly appropriate for handling 
191: frequency data.  As an example of the latter, frequencies of word
192: occurrence in text will be studied below.  
193: 
194: The given contingency table (or numbers of occurrence) 
195: data is denoted $k_{IJ} =
196: \{ k_{IJ}(i,j) = k(i, j) ; i \in I, j \in J \}$.  $I$ is the set of text
197: indexes, and $J$ is the set of word indexes.  We have
198: $k(i) = \sum_{j \in J} k(i, j)$.  Analogously $k(j)$ is defined,
199: and $k = \sum_{i \in I, j \in J} k(i,j)$.  Next, $f_{IJ} = \{ f_{ij}
200: = k(i,j)/k ; i \in I, j \in J\} \subset \R_{I \times J}$,
201: similarly $f_I$ is defined as  $\{f_i = k(i)/k ; i \in I, j \in J\}
202: \subset \R_I$, and $f_J$ analogously.  What we have described here is 
203: taking numbers of occurrences into relative frequencies.
204: 
205: The conditional distribution of $f_J$ knowing $i \in I$, also termed
206: the $j$th profile with coordinates indexed by the elements of $I$, is:
207: 
208: $$ f^i_J = \{ f^i_j = f_{ij}/f_i = (k_{ij}/k)/(k_i/k) ; f_i \neq 0 ;
209: j \in J \}$$ and likewise for $f^j_I$.  
210: 
211: Note that the input data values here are always non-negative reals.  The 
212: output factor projections (and contributions to the principal directions 
213: of inertia) will be reals.  
214: 
215: \subsection{Input: Cloud of Points Endowed with the Chi Squared Metric}
216: 
217: 
218: The cloud of points consists of the couple: profile coordinate and mass.
219: We have $ N_J(I) = \{ ( f^i_J, f_i ) ; i  \in I \} \subset \R_J $, and
220: again similarly for $N_I(J)$.
221: 
222: The moment of inertia is as follows: 
223: $$M^2(N_J(I)) = M^2(N_I(J)) = \| f_{IJ} - f_I f_J \|^2_{f_I f_J} $$
224: \begin{equation}
225: = \sum_{i \in I, j \in J} (f_{ij} - f_i f_j)^2 / f_i f_j
226: \end{equation}
227: The term  $\| f_{IJ} - f_I f_J \|^2_{f_I f_J}$ is the $\chi^2$ metric
228: between the probability distribution $f_{IJ}$ and the product of marginal
229: distributions $f_I f_J$, with as center of the metric the product
230: $f_I f_J$.  Decomposing the moment of inertia of the cloud $N_J(I)$ -- or 
231: of $N_I(J)$ since both analyses are inherently related -- furnishes the 
232: principal axes of inertia, defined from a singular value decomposition.
233: 
234: \subsection{Output: Cloud of Points Endowed with the Euclidean 
235: Metric in Factor Space}
236: 
237: From the initial frequencies data matrix, a set of probability data,
238: $f_{ij}$, is defined by dividing each value by the grand total of all
239: elements in
240: the matrix.  In correspondence analysis,
241: each row (or column) point is considered to have an
242: associated weight.  The weight of the $i$th row point is given
243: by $f_i = \sum_j x_{ij}$, and the weight of the $j$th column point
244: is given by $f_j = \sum_i x_{ij}$. We consider the row points to have
245: coordinates ${f_{ij} / x_i}$, thus allowing points of the same
246: {\em profile} to be identical (i.e., superimposed). The following weighted
247: Euclidean distance, the $\chi^2$ distance, is then used between row
248: points:
249: $$ d^2(i,k) = \sum_j {1 \over x_j} \left( {f_{ij} \over x_i} -
250:                                      {f_{kj} \over x_k} \right)^2 $$
251: and an analogous distance is used between column points.
252: 
253: The mean row point is given by the weighted average of all row
254: points:
255: $$ \sum_i f_i {f_{ij} \over f_i} = f_j$$
256: for $j = 1, 2, \dots, m$.  Similarly the mean column profile has
257: $i$th coordinate $f_i$.
258: 
259: We
260: first consider the projections of the $n$
261: profiles in $\R^m$ onto an axis, ${\bf u}$.  This is given by
262: $$ \sum_j {f_{ij} \over x_i} {1 \over x_j} u_j$$ for all $i$ (note
263: the use of the scalar product here).  For details on determining the 
264: new axis, ${\bf u}$, see Murtagh (2005).
265: 
266: The  projections of points onto
267: axis ${\bf u}$ were with respect to the ${1 / f_i}$ weighted Euclidean
268: metric.  This makes interpreting projections very difficult from a
269: human/visual point of view, and so it is more natural to present results
270: in such a way that projections can be simply appreciated.  Therefore
271: {\em factors} are defined, such that the projections of row vectors
272: \index{factor}
273: onto factor ${\bf \phi}$ associated with axis ${\bf u}$ are given by
274: $$\sum_j {f_{ij} \over x_i} \phi_j$$ for all $i$.  Taking $$\phi_j =
275: {1 \over f_j} u_j$$ ensures this and projections onto ${\bf \phi}$
276: are with respect to the ordinary (unweighted) Euclidean distance.
277: 
278: An analogous set of relationships hold in $\R^n$ where the best
279: fitting axis, ${\bf v}$, is searched for.  A simple mathematical
280: relationship holds between ${\bf u}$ and ${\bf v}$, and between
281: ${\bf \phi}$ and ${\bf \psi}$ (the latter being the factor associated
282: with axis or eigenvector ${\bf v}$):
283: $$ \sqrt{\lambda} \psi_i = \sum_j {f_{ij} \over f_i} \phi_j $$
284: $$ \sqrt{\lambda} \phi_j = \sum_i {f_{ij} \over f_j} \psi_i $$
285: These are termed {\em transition formulas}. 
286:  Axes ${\bf u}$
287: \index{transition formula}
288: and ${\bf v}$, and factors ${\bf \phi}$ and ${\bf \psi}$, are
289: associated with eigenvalue $\lambda$ and best fitting higher-dimensional
290: subspaces are associated with decreasing values of $\lambda$ (see Murtagh,
291: 2005b, for further details).
292: 
293: \subsection{Conclusions on Correspondence Analysis and Introduction to the 
294: Numerical Experiments to Follow}
295: 
296: Some important points for the analyses to follow are -- firstly in relation 
297: to correspondence analysis: 
298: 
299: \begin{enumerate}
300: 
301: \item From numbers of occurrence data we always get (by design) 
302: a Euclidean embedding
303: using correspondence analysis.  The factors are embedded in a Euclidean 
304: metric.  
305: 
306: \item As seen in the previous subsection, the 
307: numbers of factors, i.e.\ number of non-zero eigenvalues, are
308: given by one less than the minimum of the number of observations studied
309: (indexed by set $I$) and the number of variables or attributes used 
310: (indexed by set $J$).  
311: The number of dimensions in factor space may be less than full rank
312: if there are linear dependencies present.  
313: 
314: \item In the experiments to follow in the next section, we  always 
315: have  $n < m$, where $n$ is number of texts or text segments, and $m$ is 
316: number of words.  This implies that inherent (full rank) 
317: dimensionality of the projected Euclidean 
318: factor space is $n - 1$.  
319: 
320: \item To assess stability of results,
321: in our studies we often take as input a word set given by the 
322: (for example, 1000) most highly ranked (in terms of frequency of 
323: occurrence)  words.  Thus we take $m = 1000, 2000,$ and the full 
324: attribute set (say, 
325: $m_{\rm tot}$) in each case, where the attributes are ordered in terms of 
326: decreasing marginal frequency.  In other words, we take the 1000 most
327: frequent words to characterize our texts; then the 2000 most frequent words; 
328: and finally all words.  Since $n < m$ it is not surprising that 
329: very similar results are found irrespective of the value of $m$, since
330: the inherent, projected, Euclidean, factor space dimensionality is the 
331: same in each case, viz., $n - 1$.  But we additionally find confirmation 
332: of stability of our results.  
333:  We will show quite convincingly that our results are 
334: characteristic of the texts used, in each case, and are in no way ``one off''
335: or arbitrary.  
336: 
337: %\item Purely as a baseline we will look at direct Euclidean pairwise 
338: %distances defined on $\{ k_{ij} | i = 1, 2, \dots , n; j = 1, 2, 
339: %\dots , m \}$.
340: 
341: \end{enumerate}
342: 
343: Some important points related to our numerical assessments below, in 
344: relation to data used, determining of ultrametricity coefficient, 
345: and software used, are as follows.
346: 
347: \begin{enumerate}
348: 
349: \item 
350: In line with one tradition of textual analysis associated with Benz\'ecri's
351: correspondence analysis (see Murtagh, 2005b) we take the unique full words and
352: rank them in order of importance.  Thus for the Brothers Grimm work,
353: below, we find: ``the'', 19,696 occurrences; ``and'',
354: 14,582 occurrences; ``to'', 7380 occurrences; ``he'', 5951 occurrences; 
355: ``was'', 4122 occurrences; and so on.  Last three, with one occurrence each:
356: ``yolk'', ``zeal'', ``zest''.   
357: 
358: \item The alpha ultrametricity coefficient is based on triangles. Now, 
359: with $n$ graph nodes we have $O(n^3)$ possible triangles which is 
360: computationally prohibitive, so we instead sample.  The means and 
361: standard deviations below are based on 2000 random triangle vertex
362: realizations, repeated 20 times; hence, in each case, in total 40,000 
363: random selections of triangles.  
364: 
365: \item All text collections reported on below (section \ref{sectreal}) 
366: are publicly accessible (and web addresses are cited).  All texts were
367: obtained by us in straight (ascii) text format.
368:    
369: The preparation of the input data was carried out with programs of 
370: ours, written in C, and available at www.correspondances.info (accompanying 
371: Murtagh, 2005b).  The correspondence analysis software was written in
372: the public  R statistical software environment  
373: (www.r-project.org, again see Murtagh, 2005b) and is available at this same 
374: web address.  Some 
375: simple statistical calculations were carried out by us also 
376: in the R environment.  
377: 
378: \end{enumerate}
379: 
380: 
381: 
382: 
383: 
384: 
385: 
386: 
387: \section{Real Case Studies: Text Interrelationships Through Shared Words}
388: \label{sectreal}
389: 
390: We use in all over 900 short texts, given by short stories, or chapters,
391: or short reports.  All are in English.  Unique words are determined 
392: through delimitation by white space and by punctuation characters
393: with no distinction of upper and lower case.  In
394: all, over one million words are used in our studies of these texts.  
395: The study of word/text occurrences in a straightforward way, with no
396: truncation nor stemming nor other preprocessing, typifies a great deal
397: of the work of Benz\'ecri, and his journal {\em Les Cahiers de 
398: l'Analyse des Donn\'ees}, published by the French publisher Dunod over
399: three decades up to 1996.  This work of Benz\'ecri is 
400: discussed in detail in Murtagh (2005b).  
401: 
402: We carried out some assessments of Porter stemming (Porter, 1980)
403: as an alternative 
404: to use of whitespace- or punctuation-delimited words, without much 
405: difference.  
406: 
407: \subsection{Brothers Grimm}
408: 
409: As a homogeneous collection of texts we take 209 fairy tales of the Brothers 
410: Grimm (Ockerbloom, 2003), 
411: containing 7443 unique (in total 280,629) space- or 
412: punctuation-delimited words.  Story lengths were between 650 and 44,400 words.
413: 
414: To define a semantic context of increasing
415: resolution we took the most frequent 1000 words, followed by the most frequent
416: 2000 words, and finally all 7443 words.  
417: %(We tested extensively the case of 
418: %just the 100 most frequent words also.  But in view of the texts versus 
419: %words dimensionality implications, viz.\ $ n > m$ here, and the slightly 
420: %more tricky interpretation, we deliberately do not report on these 
421: %results here.) 
422: We constructed a cross-tabulation of numbers of occurrences of
423: each word in each one of the 209 fairy tales.  This led therefore to a 
424: set of frequency tables of dimensions: $209 \times 1000,
425: 209 \times 2000$ and $209 \times 7443$.    Through use of the $\chi^2$ 
426: distance between fairy tale texts, a correspondence analysis was carried out.  
427: From the three frequency tables, the contingency table crossing all pairs
428: of fairy tales could be examined; but it was far more convenient for us 
429: to proceed straight to the factor space, of dimension $209 - 1 = 208$.  The
430: factor space is Euclidean, so the correspondence analysis can be said to be
431: a mapping from the $\chi^2$ metric into a Euclidean metric space.  
432: 
433: 
434: %\begin{table}
435: %\begin{center}
436: %\begin{tabular}{|crrrr|} \hline
437: %\multicolumn{5}{c}{209 Brothers Grimm fairy tales} \\ \hline
438: %Texts  &  Dim.  &    Original   &  Dim. & Factors  \\ \hline
439: %%209    &  100   &     0.0273    &  99   & 0.1002  \\
440: %209    &  1000  &     0.0324    &  208    & 0.1189  \\
441: %209    &  2000  &     0.0334    &  208    & 0.1083  \\
442: %209    &  7443  &     0.0324    &  208    & 0.1154  \\ \hline
443: %\end{tabular}
444: %\end{center}
445: %\caption{Coefficient of ultrametricity.  
446: %Original: frequencies of occurrence matrix defined on the 209 texts 
447: %crossed by: % 100, 
448: %1000, 2000, and all = 7443, words.  Euclidean distance 
449: %defined on each pair of texts.  Factors: factor projections resulting 
450: %from correspondence analysis, with Euclidean distance used between each 
451: %pair of texts.}
452: %\label{tabcorr}
453: %\end{table} 
454: 
455: \begin{table}
456: \caption{Coefficient of ultrametricity, alpha.  
457: Input data: frequencies of occurrence matrices defined on the 209 texts 
458: crossed by: %100, 
459: 1000, 2000, and all = 7443, words.  
460: Alpha (ultrametricity coefficient) based
461: on factors: i.e., factor projections resulting 
462: from correspondence analysis, with Euclidean distance used between each 
463: pair of texts in factor space, of dimensionality 208.  
464: %The mean and standard deviations are each based on 20 realizations of 
465: %2000 triangles.
466: }
467: \label{tabcorrb}
468: \begin{center}
469: \setlength{\tabcolsep}{1mm}
470: \begin{tabular}{|crrrr|} \hline 
471:       &  \multicolumn{3}{c}{209 Brothers Grimm fairy tales}  &  \\ \hline
472: Texts & Orig.Dim. & FactorDim. & Alpha, mean & Alpha, sdev. \\ \hline 
473: %209  &  100      & 99   &  0.0939   &  0.0063 \\
474: 209   &  1000     & 208  &  0.1236   &  0.0054 \\
475: 209   &  2000     & 208  &  0.1123   &  0.0065 \\
476: 209   &  7443    & 208  &  0.1147   &  0.0066 \\ \hline
477: \end{tabular}
478: \end{center}
479: \end{table} 
480: 
481: %The Euclidean distance was defined on the set of 209 fairy tales, based
482: %on the four different semantic contexts (i.e., based on characterization 
483: %by %100, 
484: %1000, 2000 and 7443 words).  
485: 
486: %Secondly the chi squared distance or weighted Euclidean distance between 
487: %profiles was used as an appropriate way to assess relative similarity. If
488: %$k_{ij}$ is the number of occurrences of word $k$ in text $i$, then the
489: %chi squared distance between texts $i$ and $i'$ is $d_\chi(i,i') =
490: %\sum_j k/k_j (k_{ij}/k_i - k_{i'j}/k_{i'}$ where for text $i$, $k_{ij}/k_i$
491: %for all words $j$ defines the text's profile; $k_i = \sum_j k_{ij}$; 
492: %similarly word $j$'s weight is $k_j = \sum_i k_{ij}$; and finally the 
493: %overall total of words in all texts is $k = \sum_i \sum_j k_{ij}$.  This 
494: %distance is well established for discrete data such as frequencies of 
495: %o%ccurence.  As can be seen, weights ($k_i$, $k_j$) are used to 
496: %c%ounter-balance overly frequent (or rare) words or unusually long (or 
497: %short) texts.  This chi squared metric is mapped into a Euclidean space
498: %by determining principal axes of orientation, which correspond to 
499: %axes of intertia, in correspondence analysis (Murtagh, 2005).  The factor
500: %projections will then define a Euclidean coordinate system.  It is this
501: %which we use, rather than the original chi squared metric, in our 
502: %experiments.   
503: 
504: %For the varying semantic resolution levels (viz., %100-, 
505: %1000-, 2000-, and 7443-dimensional) the inherent resolution level is not 
506: 
507: Table \ref{tabcorrb} (columns 4, 5) 
508: shows remarkable stability of the alpha ultrametricity
509: coefficient results, and such stability will be seen in all further results 
510: to be presented below.  The ultrametricity is not high for the Grimm 
511: Brothers' data: we recall that an alpha value of 0 means no triangle is 
512: isosceles/equilateral.  We see that there is very little ultrametric
513: (hence hierarchical) structure in the Brothers Grimm data (based on our 
514: particular definition of ultrametricity/hierarchy).  
515: 
516: 
517: \subsection{Jane Austen}
518: 
519: To further study stories of a general sort, we use some works of the 
520: English novelist, Jane Austen.  
521: 
522: \begin{enumerate}
523: \item {\em Sense and Sensibility} (Austen, 1811), 
524: 50 chapters = files, chapter lengths from 1028 to 5632 words.
525: \item {\em Pride and Prejudice} (Austen, 1813), 
526: 61 chapters each containing between 683 and 5227 words. 
527: \item {\em Persuasion} (Austen, 1817), 24 chapters,
528: chapter lengths 1579 to 7007 words.
529: \item {\em Sense and Sensibility} split into 131 separate 
530: texts, each containing around 1000 words
531: (i.e., each chapter was split into files containing 5000 or fewer characters).
532: We did this to check on any influence by the size (total number of words) of
533: the text unit used (and we found no such influence).  
534: \end{enumerate}  
535: 
536: In all there were 266 texts containing a total of 9723 unique words.  We 
537: looked at the 1000, 2000 and all = 9723 most frequent words to 
538: characterize the texts by frequency of occurrence.
539: 
540: 
541: %\begin{table}
542: %\begin{center}
543: %\begin{tabular}{|crrrr|} \hline
544: %\multicolumn{5}{c}{266 J.\ Austen chapters or partial chapters} \\ \hline
545: %Texts  &  Dim.  &    Original   &  Dim.  &   Factors  \\ \hline
546: %%266    &  100   &     0.0409    &  99    &  0.1066  \\
547: %266    &  1000  &     0.0581    &  261   &  0.1521  \\
548: %266    &  2000  &     0.0601    &  262   &  0.1435  \\
549: %266    &  9723  &     0.0596    &  263   &  0.1420  \\ \hline
550: %\end{tabular}
551: %\end{center}
552: %\caption{Coefficient of ultrametricity.  
553: %Original: frequencies of occurrence matrix defined on the 266 texts 
554: %crossed by: %100, 
555: %1000, 2000, and all = 9273, words.  Euclidean distance 
556: %defined on each pair of texts.  Factors: factor projections resulting 
557: %from correspondence analysis, with Euclidean distance used between each 
558: %pair of texts.  Dimensionality of latter is necessarily less than $ 266 -1$,
559: %adjusted above for 0 eigenvalues = linear dependence.}
560: %\label{tabcorr2}
561: %\end{table} 
562: 
563: \begin{table}
564: \caption{Coefficient of ultrametricity, alpha.  
565: Input data: frequencies of occurrence matrices defined on the 266 texts 
566: crossed by: %100, 
567: 1000, 2000, and all = 9723, words.  
568: Alpha (ultrametricity coefficient) based
569: on factors: i.e., factor projections resulting 
570: from correspondence analysis, with Euclidean distance used between each 
571: pair of texts in factor space.  
572: Dimensionality of latter is necessarily $ \leq 266 -1$,
573: adjusted for 0 eigenvalues = linear dependence. 
574: %The mean and standard deviations are each based on 40,000 realizations of 
575: %triangles.
576: }
577: \label{tabcorr2b}
578: \begin{center}
579: \setlength{\tabcolsep}{1mm}
580: \begin{tabular}{|crrrr|} \hline 
581:   & \multicolumn{3}{c}{266 Austen chapters or partial chapters} & \\ \hline
582: Texts & Orig.Dim. & FactorDim. & Alpha, mean & Alpha, sdev. \\ \hline 
583: %266  &  100      & 99   &  0.1001   &  0.0068 \\
584: 266   &  1000     & 261  &  0.1455   &  0.0084 \\
585: 266   &  2000     & 262  &  0.1489   &  0.0083 \\
586: 266   &  9723     & 263  &  0.1404   &  0.0075 \\ \hline
587: \end{tabular}
588: \end{center}
589: \end{table} 
590: 
591: Table \ref{tabcorr2b}, again displaying very stable alpha values, indicates
592: that the Austen corpus is a small amount more ultrametric than the Grimms'
593: corpus, Table \ref{tabcorrb}.
594: 
595: \subsection{Air Accident Reports}
596: 
597: We used air accident reports to explore documents with very particular,
598: technical, vocabulary.  
599: The NTSB aviation accident database  
600: (Aviation Accident Database and Synopses, 2003)
601: contains information 
602: about civil aviation accidents in the United States and elsewhere.
603: We selected 50 reports.  Examples of two such reports used
604: by us: occurred Sunday, January 02, 2000 in Corning, AR,
605: aircraft Piper PA-46-310P, injuries -- 5 uninjured; occurred Sunday,
606: January 02, 2000 in Telluride, TN, aircraft: Bellanca BL-17-30A,
607: injuries -- 1 fatal.  In the 50 reports, there were 55,165 words.
608: Report lengths ranged between approximately 2300 and 28,000 words. The
609: number of unique words was 4261.
610: 
611: Sample of start of report 30: {\em On January 16, 2000, about
612: 1630 eastern standard time (all times are eastern standard time,
613: based on the 24 hour clock), a Beech P-35, N9740Y, registered to a
614: private owner, and operated as a Title 14 CFR Part 91 personal
615: flight, crashed into Clinch Mountain, about 6 miles north of
616: Rogersville, Tennessee. Instrument meteorological conditions prevailed
617: in the area, and no flight plan was filed. The aircraft incurred
618: substantial damage, and the private-rated pilot, the sole occupant,
619: received fatal injuries. The flight originated from Louisville,
620: Kentucky, the same day about 1532.}
621: 
622: %\begin{table}
623: %\begin{center}
624: %\begin{tabular}{|crrrr|} \hline
625: %\multicolumn{5}{c}{50 aviation accident reports} \\ \hline
626: %Texts  &  Dim.  &    Original   &  Dim.  &   Factors  \\ \hline
627: %%50    &  100   &     0.0270    &  48    &   0.1063  \\
628: %50    &  1000  &     0.0407    &  48  &   0.1317  \\
629: %50    &  2000  &     0.0407    &  48   &  0.1212  \\
630: %50    &  4261  &     0.0413    &  48   &  0.1180   \\ \hline
631: %\end{tabular}
632: %\end{center}
633: %\caption{Coefficient of ultrametricity.  
634: %Original: frequencies of occurrence matrix defined on the 50 texts 
635: %crossed by: %100, 
636: %1000, 2000, and all = 4261, words.  Euclidean distance 
637: %defined on each pair of texts.  Factors: factor projections resulting 
638: %from correspondence analysis, with Euclidean distance used between each 
639: %pair of texts.  Dimensionality of latter is necessarily less than $ 50 -1$,
640: %adjusted above for 0 eigenvalues = linear dependence.}
641: %\label{tabcorr4}
642: %\end{table} 
643: 
644: \begin{table}
645: \caption{Coefficient of ultrametricity, alpha.  
646: Input data: frequencies of occurrence matrices defined on the 50 texts 
647: crossed by: %100, 
648: 1000, 2000, and all = 4261, words.  
649:  Alpha (ultrametricity coefficient) based
650: on factors: i.e., factor projections resulting 
651: from correspondence analysis, with Euclidean distance used between each 
652: pair of texts in factor space.  
653: Dimensionality of latter is necessarily less than $ 50 -1$,
654: with an additional adjustment made for one 0-valued eigenvalue,
655: implying linear dependence. 
656: %The mean and standard deviations are each based on 40,000 realizations 
657: %triangles.
658: }
659: \label{tabcorr4b}
660: \begin{center}
661: \setlength{\tabcolsep}{1mm}
662: \begin{tabular}{|crrrr|} \hline 
663:   & \multicolumn{3}{c}{50 aviation accident reports} &  \\ \hline
664: Texts & Orig.Dim. & FactorDim. & Alpha, mean & Alpha, sdev. \\ \hline 
665: %50  &  100      & 48   &  0.1101   &  0.0081 \\
666: 50   &  1000     & 48  &  0.1338   &  0.0077 \\
667: 50   &  2000     & 48  &  0.1186   &  0.0058 \\
668: 50   &  4261     & 48  &  0.1154   &  0.0050 \\ \hline
669: \end{tabular}
670: \end{center}
671: \end{table} 
672: 
673: 
674: In Table \ref{tabcorr4b} we find ultrametricity values that are marginally
675: greater than those found for the Brothers Grimm (Table \ref{tabcorrb}).  It 
676: could be argued that the latter, too, uses its own technical 
677: vocabulary.   We would need to use more data to see if we can clearly 
678: distinguish between the (small) ultrametricity levels of these two 
679: corpora.  
680: 
681: 
682: \subsection{DreamBank}
683: 
684: With dream reports (i.e., reports by individuals on their remembered 
685: dreams) we depart from a technical vocabulary, and instead raise the 
686: question as to whether dream reports can perhaps be considered as types
687: of fairy tale or story, or even akin to accident reports.  
688: 
689: From the Dreambank repository (Domhoff, 2003; DreamBank, 2004; Schneider
690: and Domhoff, 2004) 
691: we selected the following collections:
692: \begin{enumerate}
693: \item ``Alta: a detailed dreamer,'' in period 1985--1997, 422 dream reports.
694: \item  ``Chuck: a physical scientist,''  in period
695: 1991--1993,  75 dream reports.
696: \item ``College women,'' in period 1946--1950,  681 dream reports.
697: \item ``Miami Home/Lab,''  in period  1963--1965,  445 dream reports.
698: \item ``The Natural Scientist,''  1939,  234 dream reports.
699: \item ``UCSC women,''  1996,  81 dream reports.
700: \end{enumerate}
701: 
702: To have adequate length reports, we requested report sizes of between
703: 500 and 1500 words.  With this criterion, from (1) we obtained 118 reports,
704: from (2) and (6) we obtained no reports, from (3) we obtained 15 reports,
705: from (4) we obtained 73 reports, and finally from (5) we obtained 8 reports.
706: In all, we used 214 dream reports, comprising 13696 words.
707: 
708: Sample of start of report 100: {\em I'm delivering a car to a man --
709: something he's just bought, a Lincoln
710: Town Car, very nice. I park it and go down the street to find him -- he
711: turns out to be an old guy, he's buying the car for nostalgia -- it turns
712: out to be an old one, too, but very nicely restored, in excellent
713: condition. I think he's black, tall, friendly, maybe wearing overalls. I
714: show him the car and he drives off. I'm with another girl who drove
715: another car and we start back for it but I look into a shop first -- it's
716: got outdoor gear in it - we're on a sort of mall, outdoors but the shops
717: face on a courtyard of bricks. I've got something from the shop just
718: outside the doors, a quilt or something, like I'm trying it on, when
719: it's time to go on for sure so I leave it on the bench. We go further,
720: there's a group now, and we're looking at this office facade for the
721: Honda headquarters.}
722: 
723: With the above we took another set of dream reports, from one individual,
724: Barbara Sanders.  A more reliable (according to DreamBank, 2004) set of
725: reports comprised 139 reports, and a second comprised 32 reports.  In all
726: 171 reports were used from this person.  Typical lengths were about 2500
727: up to 5322.  The total number of words in the Barbara Sanders set of
728: dream reports was 107,791.
729: 
730: 
731: %\begin{table}
732: %\begin{center}
733: %\begin{tabular}{|crrrr|} \hline
734: %\multicolumn{5}{c}{385 dream reports} \\ \hline
735: %Texts  &  Dim.  &    Original   &  Dim.  &   Factors  \\ \hline
736: %%385    &  100   &     0.0780    &  99    &   0.1379  \\
737: %385    &  1000  &     0.1122    &  384  &   0.2048  \\
738: %385    &  2000  &     0.1057    &  384   &  0.2137  \\
739: %385    &  11441  &     0.1288    &  384   &  0.1958   \\ \hline
740: %\end{tabular}
741: %\end{center}
742: %\caption{Coefficient of ultrametricity.  
743: %Original: frequencies of occurrence matrix defined on the 385 texts 
744: %crossed by: %100, 
745: %1000, 2000, and all = 11441, words.  Euclidean distance 
746: %defined on each pair of texts.  Factors: factor projections resulting 
747: %from correspondence analysis, with Euclidean distance used between each 
748: %pair of texts.  Dimensionality of latter is necessarily less than $ 266 -1$,
749: %adjusted above for 0 eigenvalues = linear dependence.}
750: %\label{tabcorr3}
751: %\end{table} 
752: 
753: \begin{table}
754: \caption{Coefficient of ultrametricity, alpha.  
755: Input data: frequencies of occurrence matrices defined on the 384 texts 
756: crossed by: %100, 
757: 1000, 2000, and all = 11441, words.  
758: Alpha (ultrametricity coefficient) based
759: on factors: i.e., factor projections resulting 
760: from correspondence analysis, with Euclidean distance used between each 
761: pair of texts in factor space, of dimensionality $ 385 -1 = 384$.  
762: %The mean and standard deviations are each based on 40,000 
763: %realizations of triangles.
764: }
765: \label{tabcorr3b}
766: \begin{center}
767: \setlength{\tabcolsep}{1mm}
768: \begin{tabular}{|crrrr|} \hline 
769:  & \multicolumn{3}{c}{385 dream reports}  & \\ \hline
770: Texts & Orig.Dim. & FactorDim. & Alpha, mean & Alpha, sdev. \\ \hline 
771: %385  &  100      & 99   &  0.1413   &  0.0090 \\
772: 385   &  1000     & 384  &  0.1998   &  0.0088 \\
773: 385   &  2000     & 384  &  0.1876   &  0.0095 \\
774: 385   &  11441    & 384  &  0.1933   &  0.0087 \\ \hline
775: \end{tabular}
776: \end{center}
777: \end{table} 
778: 
779: First we analyzed all dream reports, furnishing Table \ref{tabcorr3b}. 
780: 
781: In order to look at a more homogeneous subset of dream reports, we 
782: then analyzed separately 
783: the Barbara Sanders set of 171 reports, leading to Table \ref{tabcorr333b}.  
784: (Note that this analysis is on a subset of 
785: the previously analyzed dream reports, Table \ref{tabcorr3b}).  
786: The Barbara Sanders subset of 171 reports contained 7044
787: unique words in all.  
788: 
789: 
790: Compared to Table \ref{tabcorr3b} based on the entire dream report 
791: collection, Table \ref{tabcorr333b} which is based on one person 
792: shows, on average, higher ultrametricity levels.  It is interesting to note
793: that the dream reports, collectively, are higher in ultrametricity level 
794: than our previous values for alpha; and that the ultrametricity level is 
795: raised again when the data used relates to one person.  
796: 
797: \subsection{James Joyce's Ulysses, and Overall Summary}
798: 
799: We carried out a study of James Joyce's {\em Ulysses}, comprising 
800: 304,414 words in total.  We broke this text into 183 separate files, 
801: comprising approximately between 1400 and 2000 words each.  The number of 
802: unique words in these 183 files was found to be 28,649 words.  The 
803: ultrametricity alpha values for this collection of 183 Joycean texts 
804: were found to be less than the Barbara Sanders values, but higher than the 
805: global set of all dream reports.  
806: % CORRECTION WITH NEW PROGRAMS 9 MAY:  no of unique words was up from 28,631
807: % NEW MEAN FOR 7000 WAS: 0.2057
808: For 183 text segments, with frequencies of occurrence of 7000 (top-ranked)
809: words, we found a mean alpha of 0.2057, with standard deviation 0.0092.
810: 
811: %\begin{table}
812: %\begin{center}
813: %\begin{tabular}{|crrrr|} \hline
814: %\multicolumn{5}{c}{171 Barbara Sanders dream reports} \\ \hline
815: %Texts  &  Dim.  &    Original   &  Dim.  &   Factors  \\ \hline
816: %%171    &  100   &     0.0816    &  99    &   0.1405  \\
817: %171    &  1000  &     0.1212    &  170  &   0.2470  \\
818: %171    &  2000  &     0.1293    &  170   &  0.2110  \\
819: %171    &  11441  &     0.1324    &  170   &  0.2404   \\ \hline
820: %\end{tabular}
821: %\end{center}
822: %\caption{Coefficient of ultrametricity.  
823: %Original: frequencies of occurrence matrix defined on the 171 texts 
824: %crossed by: %100, 
825: %1000, 2000, and all = 7044, words.  Euclidean distance 
826: %defined on each pair of texts.  Factors: factor projections resulting 
827: %from correspondence analysis, with Euclidean distance used between each 
828: %pair of texts.  Dimensionality of latter is necessarily less than $ 171 -1$,
829: %with no adjustment necessary for 0 eigenvalues = linear dependence.}
830: %\label{tabcorr333}
831: %\end{table} 
832: 
833: \begin{table}
834: \caption{Coefficient of ultrametricity, alpha.  
835: Input data: frequencies of occurrence matrices defined on the 171 texts 
836: crossed by: %100, 
837: 1000, 2000, and all = 7044, words.  
838: Alpha (ultrametricity coefficient) based
839: on factors: i.e., factor projections resulting 
840: from correspondence analysis, with Euclidean distance used between each 
841: pair of texts in factor space, of dimensionality $ 171 -1 = 170$. 
842: %The mean and standard deviations are each based on 40,000 
843: %realizations of triangles.
844: }
845: \label{tabcorr333b}
846: \begin{center}
847: \setlength{\tabcolsep}{1mm}
848: \begin{tabular}{|crrrr|} \hline 
849:  & \multicolumn{3}{c}{171 Barbara Sanders dream reports}  & \\ \hline
850: Texts & Orig.Dim. & FactorDim. & Alpha, mean & Alpha, sdev. \\ \hline 
851: %171  &  100      & 99   &  0.1592   &  0.0063 \\
852: 171   &  1000     & 170  &  0.2250   &  0.0089 \\
853: 171   &  2000     & 170  &  0.2256   &  0.0112 \\
854: 171   &  7044     & 170  &  0.2603   &  0.0108 \\ \hline
855: \end{tabular}
856: \end{center}
857: \end{table} 
858: 
859: %Ulysses text: http://www.lib.ru/DVOJS/ulysses.txt
860: 
861: A summary of all our results is in Table \ref{tabsum}.  A few words of explanation
862: follow.  The lower values of ultrametricity can be explained by a more
863: common, shared word set; viz., shared over the text segment set.  The 
864: higher values of ultrametricity are associated with dreams, in particular 
865: with a single dreamer, and with {\em Ulysses}: one could argue that 
866: characteristics of these data sets include frequent changes in interest,
867: and frequent replacement of one scene, and one set of personages,
868:  with another.  In factor space, this implies that a triplet of points 
869: is  more likely to be isosceles with small base, or equilateral, compared to 
870: the alternative (low ultrametricity case) of more smooth transitions from 
871: one sentence, paragraph or section to another.   
872: 
873: \begin{table}
874: \begin{center}
875: \begin{tabular}{|lrrr|}\hline
876: Data                 &   No. texts &  No. words &  ultrametricity  \\ \hline
877: Grimm tales          &   209       &    7443    &     0.1147       \\
878: aviation accidents   &    50       &    4261    &     0.1154       \\
879: Jane Austen novels   &   266       &    9723    &     0.1404       \\
880: dream reports        &   385       &   11441    &     0.1933       \\
881: Joyce's Ulysses      &   183       &   28631    &     0.2057       \\
882: single person dreams &   171       &    7044    &     0.2603       \\ \hline
883: \end{tabular}
884: \end{center}
885: \caption{Summary of results for the full word set, with the exception of
886: the Joyce data, where 7000 words were used.  The ultrametricity is the 
887: alpha measure used throughout this article, where 1 is respect for 
888: ultrametricity by all triangles, and and 0 is non-respect in all cases.}
889: \label{tabsum}
890: \end{table}
891: 
892: \section{Conclusion}
893: 
894: We studied a range of text corpora, comprising over 1000 texts, or text
895: segments,  
896: containing over 1.3 million words.  We found very stable ultrametricity 
897: quantifications of the text collections, across numbers of most frequent 
898: words used to characterize the texts, and sampling of triplets of texts.  
899: We also found that in all cases (save, perhaps, the Brothers Grimm versus 
900: air accident reports) there was a clear distinction between the ultrametricity
901: values of the text collections.  
902: 
903: %We end with a few remarks which much remain as speculation until far more 
904: %sizable tests have been carried out (involving a far greater number of texts).
905: %However even speculation serves to motivate future work. 
906: Some very intriguing ultrametricity characterizations were found in our
907: work.  For example, we found that the technical vocabulary of air accidents 
908: did not differ greatly in terms of inherent ultrametricity compared to the 
909: Brothers Grimm fairy tales.  Secondly we found that novelist Austen's 
910: works were distinguishable from the Grimm fairy tales.  Thirdly we found 
911: dream reports to be have higher ultrametricity level than the other 
912: text collections.  Further exploration of these issues will require 
913: availability of very high quality textual data.  
914: 
915: Values of our alpha ultrametricity coefficient were small but 
916: revealing and useful nonetheless.  Ultrametricity implies hierarchical
917: embedding, or structuring in terms of embedded sets.  This is what we are
918: finding locally (and not globally) in our data. The use of such 
919: hierarchical fragments as relations of dominance between concepts could be
920: of use for ontologies.  
921: 
922: Ontologies, or concept hierarchies, are used 
923: to help the user in information retrieval in a range of ways including: 
924: tree-based homing in on content to be retrieved; characterizing the 
925: content of data repositories before querying starts; 
926: and disambiguating different
927: but overlapping content domains.  In \cite{autoonto} we explore the use
928: of local ultrametric embedding for ontology fragments.  As an example, 
929: we use Aristotle's {\em Categories} and some other modern texts (on 
930: ubiquitous computing, and from Wikipedia), and we 
931: also discuss an online web-based demonstrator supporting retrieval through 
932: a visual user interface.  
933: 
934: 
935: \begin{thebibliography}{99}
936: 
937: \bibitem{refa1}
938: Austen, J. (1811).  {\em Sense and Sensibility}.  Available at: \\
939: http://www.pemberley.com/etext/SandS
940: 
941: \bibitem{refa2}
942: Austen, J. (1813).  {\em Pride and Prejudice}.  Available at: \\
943: http://www.pemberley.com/etext/PandP
944: 
945: \bibitem{refa3}
946: Austen, J. (1817).  {\em Persuasion}.  Available at: \\
947: http://www.pemberley.com/etext/Persuasion
948: 
949: %\bibitem{ref1}
950: %A.-L. Barab\'asi, ``Self-organized networks: resources'', at 
951: %www.nd.edu/$\sim$networks/database (2004).
952: 
953: \bibitem{ref2}
954: Benz\'ecri, J.P. (1979a).  {\em L'Analyse des Donn\'ees Tome 1, 
955: La Taxinomie}, 2nd ed., Dunod, Paris.
956: 
957: \bibitem{ref3}
958: Benz\'ecri, J.P. (1979b).  {\em L'Analyse des Donn\'ees Tome 2, 
959: Correspondances}, 2nd ed., Dunod, Paris.
960: 
961: %\bibitem{ref4}
962: %G. Caldarelli, A. Erzan and A. Vespignani, Eds., Special issue on Networks,
963: %European Physical Journal B {\bf 38}, no. 2 (2004). 
964: 
965: %\bibitem{ref5}
966: %Comtet, L. (1974).  {\em Advanced Combinatorics}, Reidel, Dordrecht.
967: 
968: \bibitem{ref6}
969: Domhoff, G.W. (2003).
970: {\em The Scientific Study of Dreams: Neural Networks,
971: Cognitive Development and Content Analysis}, American Psychological
972: Association.
973: 
974: %\bibitem{ref7}
975: %Donaghey, R. (1975).
976: %Alternating Permutations and Binary Increasing Trees,
977: %{\em Journal of Combinatorial Theory (A)}, 18:  141--148.
978: 
979: \bibitem{ref8} 
980: DreamBank (2004), Repository of Dream Reports, www.dreambank.net
981: 
982: \bibitem{gom}
983: G\'omez-P\'erez, A., Fern\'andez-L\'opez, M. and Corcho, O. (2004).
984: {\em Ontological Engineering (with Examples from the Areas of Knowledge
985: Management, e-Commerce and the Semantic Web)}, Springer, Berlin.
986: 
987: %\bibitem{ref9} 
988: %J.C. Gower, ``Some distance properties of latent root and vector
989: %methods used in multivariate analysis'',  Biometrika {\bf 53}, 325
990: %(1966).  % 325--328
991: 
992: \bibitem{ref10}
993: Lerman, I.C. (1981). 
994: {\em Classification et Analyse Ordinale des Donn\'ees},
995: Dunod, Paris.
996: 
997: \bibitem{ref11} 
998: Murtagh,  F. (1983).  A Survey of Recent Advances in Hierarchical 
999: Clustering Algorithms, {\em The Computer Journal}, 26: 
1000: 354--359.
1001: 
1002: %\bibitem{ref12} 
1003: %Murtagh, F. (1984). 
1004: %Counting Dendrograms: A Survey,
1005: %{\em Discrete Applied Mathematics}, 7: 191--199. 
1006: 
1007: \bibitem{ref13}
1008: Murtagh, F. (1985). 
1009: {\em Multidimensional Clustering Algorithms},
1010: Physica-Verlag, W\"urzburg.
1011: 
1012: \bibitem{ref14}
1013: Murtagh,  F. (2004).  On Ultrametricity, Data Coding, and Computation,
1014: {\em Journal of Classification}, 21: 167--184.
1015: 
1016: \bibitem{ref15} 
1017: Murtagh, F. (2005a).  Identifying the Ultrametricity of Time Series, 
1018: {\em European Physical Journal B}, 43: 573--579.
1019: 
1020: \bibitem{ref16} 
1021: Murtagh, F. (2005b).  {\em 
1022: Correspondence Analysis and Data Coding with Java and R},
1023: Chapman and Hall/CRC Press, New York.  
1024: 
1025: \bibitem{autoonto}
1026: Murtagh, F., Mothe, J. and Englmeier, K. (2007).  Ontology from local
1027: hierarchical structure in text.  http://arxiv.org/abs/cs.IR/0701180
1028: 
1029: \bibitem{ref17} NTSB 
1030: Aviation Accident Database and Synopses (2003), 
1031: National Transport Safety Board,
1032: accessible from http://www.landings.com
1033: %/evird.acgi\$pass*59062640!\_h-www.landings.com/\_landings/
1034: %pages/search/rep-ntsb.html 
1035: 
1036: 
1037: \bibitem{ref18}
1038: Ockerbloom, J.M. (2003). {\em Grimms' Fairy Tales}, 
1039: http://www-2.cs.cmu.edu/$\sim$spok/grimmtmp
1040: 
1041: \bibitem{por}
1042: Porter, M.F. (1980). An Algorithm for Suffix Stripping, 
1043: {\em Program}, 14: 130--137.
1044: 
1045: \bibitem{ref19}
1046: Rammal, R.,  Toulouse, G. and Virasoro, M.A. (1986). 
1047: Ultrametricity for
1048: Physicists, {\em Reviews of Modern Physics}, 58: 765--788.
1049: 
1050: \bibitem{sas}
1051: Sasaki, F. and P\"onninghaus, J. (2003). 
1052: Testing Structural Properties in Textual Data: Beyond Document Grammars, 
1053: {\em Literary and Linguistic Computing}, 18: 89-100.
1054: 
1055: \bibitem{ref20} 
1056: Schneider, A. and Domhoff, G.W. (2004). The Quantitative Study of Dreams, 
1057: http://dreamresearch.net 
1058: 
1059: \bibitem{refxy}
1060: Schweinberger, M. and Snijders, T.A.B. (2003).  Setting in Social Networks:
1061: A Measurement Model, {\em Sociological Methodology}, 33: 307--342.
1062: 
1063: %\bibitem{ref21} 
1064: %W.S. Torgerson, 
1065: %Theory and Methods of Scaling (Wiley, New York, 1958).
1066: 
1067: %\bibitem{ref22} 
1068: %C.J. van Rijsbergen, Information Retrieval, 2nd ed. 
1069: %(Butterworths, 1979).
1070: 
1071: %\bibitem{ref23} 
1072: %A. Trusina, S. Maslov, P. Minnhagen and K. Sneppen, 
1073: %``Hierarchy measures in complex networks'',   
1074: %Physical Review Letters {\bf 92}, 178702(4) (2004).
1075: 
1076: \end{thebibliography}
1077: 
1078: \end{document}
1079: 
1080: 
1081: 
1082: