0701:cs0701181/cs0701181

1: \documentclass[12pt,singlespacing]{article}

2: \usepackage{amssymb}

3: \usepackage{amsfonts}

4: \usepackage{graphicx}

5: \newcommand{\R}{\mathbb{R}}

6: \newcommand{\Z}{\mathbb{Z}}

7:

8: \begin{document}

9: \title{A Note on Local Ultrametricity in Text}

10: \author{Fionn Murtagh \\

11: Department of Computer Science \\

12: Royal Holloway University of London \\

13: Egham, Surrey TW20 0EX, England \\

14: E-mail fmurtagh@acm.org}

15:

16: \maketitle

17:

18: \begin{abstract}

19: High dimensional, sparsely populated data spaces have been characterized in

20: terms of ultrametric topology.  This implies that there are natural, not

21: necessarily unique, tree or hierarchy structures defined by the

22: ultrametric topology.  In this note we study the extent of local

23: ultrametric topology in texts, with the aim of

24: finding unique ``fingerprints'' for a text or corpus, discriminating between

25: texts from different domains, and opening up the possibility of

26: exploiting hierarchical structures in the data.

27: We use coherent and meaningful collections of

28: over 1000 texts, comprising over 1.3 million words.

29: \end{abstract}

30: %network \sep complete graph \sep edge weighted \sep metric \sep

31: %Euclidean \sep ultrametric \sep chi squared metric

32: %\PACS{

33: %{89.75.Hc}{Networks and genealogical trees} \and

34: %{02.50.Sk}{Multivariate analysis} \and

35: %{89.75.Kd}{Patterns} \and

36: %{89.75.Fb}{Structures and organization in complex systems}

37: %} % end of PACS

38: %}  % end of abstract

39:

40:

41:

42: \section{Introduction}

43:

44: Structures that are inherent to data of any type can be of importance, and

45: hierarchical structure is a prime example.   In this work we take text

46: corpora and assess the extent of hierarchical structure among words

47: constituting the texts.  By comprehensively taking context into account we

48: seek to study hierarchical structures in the domain semantics.

49:

50: The data studied in Rammal et al.\ (1986) and Murtagh (2004)  is point pattern

51: data: observational features with their measurements on many coordinate

52: dimensions.  Data may be instead presented as time-varying signals and

53: in a similar way, related to the findings of Rammal et al.\ (1986) and

54: Murtagh (2004),

55: we have investigated ultrametric-related

56: properties of time series or 1D signals in

57: Murtagh (2005a).  In the latter time series work, we encoded the data in a

58: particular way.  In this paper, we show how texts can also be

59: characterized in a similar manner.

60:

61: The triangular inequality holds for a metric space: $d(x,z) \leq

62: d(x,y) + d(y,z)$ for any triplet

63: of points $x,y,z$.  In addition the properties

64: of symmetry and positive definiteness are respected.  The ``strong

65: triangular inequality'' or ultrametric inequality is: $d(x,z) \leq

66: \mbox{ max } \{ d(x,y), d(y,z) \}$ for any triplet $x,y,z$.  An ultrametric

67: space implies respect for a range of stringent properties.  For example,

68: the triangle formed by any triplet is necessarily isosceles, with the two

69: large sides equal; or is equilateral.  Any agglomerative hierarchical

70: procedure (cf.\ Benz\'ecri, 1978; Lerman, 1981; Murtagh, 1983, 1985) can

71: impose hierarchical structure.  Our aim in this work is to assess

72: inherent extent of hierarchical structure.

73:

74: We take a large

75: number of coherent collections of meaningful texts.  Through shared words,

76: we can define a similarity network between all texts in each of the

77: collections we chose.  Aspects of the semantics of the given collection are

78: captured in this way.  We investigate how ultrametric each of these

79: semantic networks is.

80:

81: %We select texts

82: %each containing roughly 500 to 1000 words (but as will be seen below,

83: %some texts had up to around 44,000 words).

84: Our selected texts in this study are in English and

85: do not contain accented characters (and this can be easily catered for).

86: These were: fairy tales by the Brothers Grimm; novels by the English

87: writer, Jane Austen; in order to have very technical language, aircraft

88: accident reports from the US National Transport Safety Board; and in order

89: to seek linkages with biological and cognitive processes, a range of

90: dream reports from the online DreamBank repository.

91:

92: We find clear distinctions between the semantic networks (or text collections)

93: studied, in terms of their relative (albeit small) extent of ultrametricity.

94:

95: Our objectives in such assessment of inherent, local, hierarchical

96: structure include the following:

97:

98: \begin{enumerate}

99: \item Ontologies (see e.g.\ G\'omez-Perez et al., 2004) have become of

100: great interest to facilitate information resource discovery, and to

101: support querying and retrieval of information, in current areas of work

102: such as the semantic web.  Automatic or semi-automatic

103: construction of ontologies is aided greatly by hierarchical relationships

104: between terms.  The characterizing of texts in terms of local

105: hierarchical structure simultaneously provides justification for unambiguous

106: local hierarchies.  (We return to this issue of ontology creation

107: in the Conclusion.)

108:

109: \item Structures defined on terms that are more general than grammars

110: may be of use in modelling and assessing consistency of textual data

111: (see Sasaki and P\"onninghaus, 2003); and perhaps in mapping some aspects of

112: semantics and flow of reason and logic in text.

113: %(for example, providing

114: %a quantitative expression of Freud's concepts of

115: % condensation and displacement).

116:

117: \item Limited extent of hierarchical structure may point to the

118: undesirability of a global tree or hierarchical clustering model for the

119: text or set of texts.  However for the same reason, a set of

120: local hierarchical clusterings, or a forest of (locally defined) trees, may be

121: more appropriate.

122:

123: We note that our work is quite different from Leo

124: Breiman's random forest methodology, where classification trees are

125: fitted multiply to a

126: data set.  Our work, as opposed to this, is directed towards the finding of

127: ``shrubs'' or tree fragments in a data set.

128:

129: \item Latent ultrametric distances were estimated by Schweinberger and Snijders

130: (2003) in order to represent transitive structures among pairwise

131: relationships.

132:

133: \item Further motivation is provided by fingerprinting of authorship, and

134: document clustering (e.g.\ to facilitate retrieval).

135:

136: \end{enumerate}

137:

138: \section{Methodology}

139:

140: We employ correspondence analysis for metric embedding,

141: followed by determination of the extent of  ultrametricity, in factor

142: space, based on the alpha coefficient of ultrametricity.  Our motivation

143: for using precisely this Euclidean embedding is as follows.  Our input

144: data is in the form of frequencies of occurrence.  Now, a Euclidean distance

145: defined on vectors with such values is not appropriate.

146:

147: The $\chi^2$ distance

148: is an appropriate weighted Euclidean distance for use with such data

149: (Benz\'ecri, 1979; Murtagh, 2005b).

150: Consider texts $i$ and $i'$ crossed by words $j$.  Let $k_{ij}$ be the number of

151: occurrences of word $j$ in text $i$.  Then, omitting a constant,

152: the $\chi^2$ distance between texts $i$ and $i'$ is given by

153: $ \sum_j 1/k_j ( k_{ij}/k_i - k_{i'j}/k_{i'} )^2$.  The weighting term is

154: $1/k_j$.  The weighted Euclidean distance is between the {\em profile}

155: of text $i$, viz.\ $k_{ij}/k_i$ for all $j$, and the analogous

156: {\em profile} of text $i'$.

157:

158:

159: \subsection{Alpha Coefficient of Ultrametricity}

160:

161: The definition of ultrametricity introduced in Murtagh (2004) and justified

162: relative to alternatives was, in

163: summary, as follows.  For all triplets of points, we consider the three

164: internal angles.  We require that the smallest angle be less than or equal

165: to 60 degrees.  Then we require that the two remaining angles be

166: approximately equal.  Approximate equality is defined as less than 2 degrees,

167: in order to cater for imprecise coordinate measurement (e.g., due to

168: floating point values) in an acceptable way.  Satisfying these angular

169: constraints implies that the triplet of points defines an approximate

170: isosceles (with small base) or equilateral triangle.  We define a

171: coefficient of ultrametricity of the point set as the proportion of all

172: triangles satisfying these requirements.  The coefficient of ultrametricity

173: is 1 for perfectly ultrametric data; and if 0 no triangle satisfies the

174: isosceles or equilateral requirements.  This coefficient is

175: referred to as alpha below in this article.

176:

177: As already noted, assessing ultrametricity through triangle properties

178: is based on the prior  correspondence analysis, and this has the following

179: beneficial (and, in a sense, enabling) implications.  The correspondence

180: analysis

181: factor space is  Euclidean.  A Euclidean space, as a particular Hilbert

182: space, is a complete, normed vector space endowed with a scalar product.

183: It is precisely the scalar product that allows us to define angles and

184: hence the triangle properties that we need.

185:

186: \subsection{Correspondence Analysis:

187: Mapping $\chi^2$ into Euclidean Distances}

188:

189: As a dimensionality reduction technique

190: correspondence analysis is particularly appropriate for handling

191: frequency data.  As an example of the latter, frequencies of word

192: occurrence in text will be studied below.

193:

194: The given contingency table (or numbers of occurrence)

195: data is denoted $k_{IJ} =

196: \{ k_{IJ}(i,j) = k(i, j) ; i \in I, j \in J \}$.  $I$ is the set of text

197: indexes, and $J$ is the set of word indexes.  We have

198: $k(i) = \sum_{j \in J} k(i, j)$.  Analogously $k(j)$ is defined,

199: and $k = \sum_{i \in I, j \in J} k(i,j)$.  Next, $f_{IJ} = \{ f_{ij}

200: = k(i,j)/k ; i \in I, j \in J\} \subset \R_{I \times J}$,

201: similarly $f_I$ is defined as  $\{f_i = k(i)/k ; i \in I, j \in J\}

202: \subset \R_I$, and $f_J$ analogously.  What we have described here is

203: taking numbers of occurrences into relative frequencies.

204:

205: The conditional distribution of $f_J$ knowing $i \in I$, also termed

206: the $j$th profile with coordinates indexed by the elements of $I$, is:

207:

208: $$ f^i_J = \{ f^i_j = f_{ij}/f_i = (k_{ij}/k)/(k_i/k) ; f_i \neq 0 ;

209: j \in J \}$$ and likewise for $f^j_I$.

210:

211: Note that the input data values here are always non-negative reals.  The

212: output factor projections (and contributions to the principal directions

213: of inertia) will be reals.

214:

215: \subsection{Input: Cloud of Points Endowed with the Chi Squared Metric}

216:

217:

218: The cloud of points consists of the couple: profile coordinate and mass.

219: We have $ N_J(I) = \{ ( f^i_J, f_i ) ; i  \in I \} \subset \R_J $, and

220: again similarly for $N_I(J)$.

221:

222: The moment of inertia is as follows:

223: $$M^2(N_J(I)) = M^2(N_I(J)) = \| f_{IJ} - f_I f_J \|^2_{f_I f_J} $$

224: \begin{equation}

225: = \sum_{i \in I, j \in J} (f_{ij} - f_i f_j)^2 / f_i f_j

226: \end{equation}

227: The term  $\| f_{IJ} - f_I f_J \|^2_{f_I f_J}$ is the $\chi^2$ metric

228: between the probability distribution $f_{IJ}$ and the product of marginal

229: distributions $f_I f_J$, with as center of the metric the product

230: $f_I f_J$.  Decomposing the moment of inertia of the cloud $N_J(I)$ -- or

231: of $N_I(J)$ since both analyses are inherently related -- furnishes the

232: principal axes of inertia, defined from a singular value decomposition.

233:

234: \subsection{Output: Cloud of Points Endowed with the Euclidean

235: Metric in Factor Space}

236:

237: From the initial frequencies data matrix, a set of probability data,

238: $f_{ij}$, is defined by dividing each value by the grand total of all

239: elements in

240: the matrix.  In correspondence analysis,

241: each row (or column) point is considered to have an

242: associated weight.  The weight of the $i$th row point is given

243: by $f_i = \sum_j x_{ij}$, and the weight of the $j$th column point

244: is given by $f_j = \sum_i x_{ij}$. We consider the row points to have

245: coordinates ${f_{ij} / x_i}$, thus allowing points of the same

246: {\em profile} to be identical (i.e., superimposed). The following weighted

247: Euclidean distance, the $\chi^2$ distance, is then used between row

248: points:

249: $$ d^2(i,k) = \sum_j {1 \over x_j} \left( {f_{ij} \over x_i} -

250:                                      {f_{kj} \over x_k} \right)^2 $$

251: and an analogous distance is used between column points.

252:

253: The mean row point is given by the weighted average of all row

254: points:

255: $$ \sum_i f_i {f_{ij} \over f_i} = f_j$$

256: for $j = 1, 2, \dots, m$.  Similarly the mean column profile has

257: $i$th coordinate $f_i$.

258:

259: We

260: first consider the projections of the $n$

261: profiles in $\R^m$ onto an axis, ${\bf u}$.  This is given by

262: $$ \sum_j {f_{ij} \over x_i} {1 \over x_j} u_j$$ for all $i$ (note

263: the use of the scalar product here).  For details on determining the

264: new axis, ${\bf u}$, see Murtagh (2005).

265:

266: The  projections of points onto

267: axis ${\bf u}$ were with respect to the ${1 / f_i}$ weighted Euclidean

268: metric.  This makes interpreting projections very difficult from a

269: human/visual point of view, and so it is more natural to present results

270: in such a way that projections can be simply appreciated.  Therefore

271: {\em factors} are defined, such that the projections of row vectors

272: \index{factor}

273: onto factor ${\bf \phi}$ associated with axis ${\bf u}$ are given by

274: $$\sum_j {f_{ij} \over x_i} \phi_j$$ for all $i$.  Taking $$\phi_j =

275: {1 \over f_j} u_j$$ ensures this and projections onto ${\bf \phi}$

276: are with respect to the ordinary (unweighted) Euclidean distance.

277:

278: An analogous set of relationships hold in $\R^n$ where the best

279: fitting axis, ${\bf v}$, is searched for.  A simple mathematical

280: relationship holds between ${\bf u}$ and ${\bf v}$, and between

281: ${\bf \phi}$ and ${\bf \psi}$ (the latter being the factor associated

282: with axis or eigenvector ${\bf v}$):

283: $$ \sqrt{\lambda} \psi_i = \sum_j {f_{ij} \over f_i} \phi_j $$

284: $$ \sqrt{\lambda} \phi_j = \sum_i {f_{ij} \over f_j} \psi_i $$

285: These are termed {\em transition formulas}.

286:  Axes ${\bf u}$

287: \index{transition formula}

288: and ${\bf v}$, and factors ${\bf \phi}$ and ${\bf \psi}$, are

289: associated with eigenvalue $\lambda$ and best fitting higher-dimensional

290: subspaces are associated with decreasing values of $\lambda$ (see Murtagh,

291: 2005b, for further details).

292:

293: \subsection{Conclusions on Correspondence Analysis and Introduction to the

294: Numerical Experiments to Follow}

295:

296: Some important points for the analyses to follow are -- firstly in relation

297: to correspondence analysis:

298:

299: \begin{enumerate}

300:

301: \item From numbers of occurrence data we always get (by design)

302: a Euclidean embedding

303: using correspondence analysis.  The factors are embedded in a Euclidean

304: metric.

305:

306: \item As seen in the previous subsection, the

307: numbers of factors, i.e.\ number of non-zero eigenvalues, are

308: given by one less than the minimum of the number of observations studied

309: (indexed by set $I$) and the number of variables or attributes used

310: (indexed by set $J$).

311: The number of dimensions in factor space may be less than full rank

312: if there are linear dependencies present.

313:

314: \item In the experiments to follow in the next section, we  always

315: have  $n < m$, where $n$ is number of texts or text segments, and $m$ is

316: number of words.  This implies that inherent (full rank)

317: dimensionality of the projected Euclidean

318: factor space is $n - 1$.

319:

320: \item To assess stability of results,

321: in our studies we often take as input a word set given by the

322: (for example, 1000) most highly ranked (in terms of frequency of

323: occurrence)  words.  Thus we take $m = 1000, 2000,$ and the full

324: attribute set (say,

325: $m_{\rm tot}$) in each case, where the attributes are ordered in terms of

326: decreasing marginal frequency.  In other words, we take the 1000 most

327: frequent words to characterize our texts; then the 2000 most frequent words;

328: and finally all words.  Since $n < m$ it is not surprising that

329: very similar results are found irrespective of the value of $m$, since

330: the inherent, projected, Euclidean, factor space dimensionality is the

331: same in each case, viz., $n - 1$.  But we additionally find confirmation

332: of stability of our results.

333:  We will show quite convincingly that our results are

334: characteristic of the texts used, in each case, and are in no way ``one off''

335: or arbitrary.

336:

337: %\item Purely as a baseline we will look at direct Euclidean pairwise

338: %distances defined on $\{ k_{ij} | i = 1, 2, \dots , n; j = 1, 2,

339: %\dots , m \}$.

340:

341: \end{enumerate}

342:

343: Some important points related to our numerical assessments below, in

344: relation to data used, determining of ultrametricity coefficient,

345: and software used, are as follows.

346:

347: \begin{enumerate}

348:

349: \item

350: In line with one tradition of textual analysis associated with Benz\'ecri's

351: correspondence analysis (see Murtagh, 2005b) we take the unique full words and

352: rank them in order of importance.  Thus for the Brothers Grimm work,

353: below, we find: ``the'', 19,696 occurrences; ``and'',

354: 14,582 occurrences; ``to'', 7380 occurrences; ``he'', 5951 occurrences;

355: ``was'', 4122 occurrences; and so on.  Last three, with one occurrence each:

356: ``yolk'', ``zeal'', ``zest''.

357:

358: \item The alpha ultrametricity coefficient is based on triangles. Now,

359: with $n$ graph nodes we have $O(n^3)$ possible triangles which is

360: computationally prohibitive, so we instead sample.  The means and

361: standard deviations below are based on 2000 random triangle vertex

362: realizations, repeated 20 times; hence, in each case, in total 40,000

363: random selections of triangles.

364:

365: \item All text collections reported on below (section \ref{sectreal})

366: are publicly accessible (and web addresses are cited).  All texts were

367: obtained by us in straight (ascii) text format.

368:

369: The preparation of the input data was carried out with programs of

370: ours, written in C, and available at www.correspondances.info (accompanying

371: Murtagh, 2005b).  The correspondence analysis software was written in

372: the public  R statistical software environment

373: (www.r-project.org, again see Murtagh, 2005b) and is available at this same

374: web address.  Some

375: simple statistical calculations were carried out by us also

376: in the R environment.

377:

378: \end{enumerate}

379:

380:

381:

382:

383:

384:

385:

386:

387: \section{Real Case Studies: Text Interrelationships Through Shared Words}

388: \label{sectreal}

389:

390: We use in all over 900 short texts, given by short stories, or chapters,

391: or short reports.  All are in English.  Unique words are determined

392: through delimitation by white space and by punctuation characters

393: with no distinction of upper and lower case.  In

394: all, over one million words are used in our studies of these texts.

395: The study of word/text occurrences in a straightforward way, with no

396: truncation nor stemming nor other preprocessing, typifies a great deal

397: of the work of Benz\'ecri, and his journal {\em Les Cahiers de

398: l'Analyse des Donn\'ees}, published by the French publisher Dunod over

399: three decades up to 1996.  This work of Benz\'ecri is

400: discussed in detail in Murtagh (2005b).

401:

402: We carried out some assessments of Porter stemming (Porter, 1980)

403: as an alternative

404: to use of whitespace- or punctuation-delimited words, without much

405: difference.

406:

407: \subsection{Brothers Grimm}

408:

409: As a homogeneous collection of texts we take 209 fairy tales of the Brothers

410: Grimm (Ockerbloom, 2003),

411: containing 7443 unique (in total 280,629) space- or

412: punctuation-delimited words.  Story lengths were between 650 and 44,400 words.

413:

414: To define a semantic context of increasing

415: resolution we took the most frequent 1000 words, followed by the most frequent

416: 2000 words, and finally all 7443 words.

417: %(We tested extensively the case of

418: %just the 100 most frequent words also.  But in view of the texts versus

419: %words dimensionality implications, viz.\ $ n > m$ here, and the slightly

420: %more tricky interpretation, we deliberately do not report on these

421: %results here.)

422: We constructed a cross-tabulation of numbers of occurrences of

423: each word in each one of the 209 fairy tales.  This led therefore to a

424: set of frequency tables of dimensions: $209 \times 1000,

425: 209 \times 2000$ and $209 \times 7443$.    Through use of the $\chi^2$

426: distance between fairy tale texts, a correspondence analysis was carried out.

427: From the three frequency tables, the contingency table crossing all pairs

428: of fairy tales could be examined; but it was far more convenient for us

429: to proceed straight to the factor space, of dimension $209 - 1 = 208$.  The

430: factor space is Euclidean, so the correspondence analysis can be said to be

431: a mapping from the $\chi^2$ metric into a Euclidean metric space.

432:

433:

434: %\begin{table}

435: %\begin{center}

436: %\begin{tabular}{|crrrr|} \hline

437: %\multicolumn{5}{c}{209 Brothers Grimm fairy tales} \\ \hline

438: %Texts  &  Dim.  &    Original   &  Dim. & Factors  \\ \hline

439: %%209    &  100   &     0.0273    &  99   & 0.1002  \\

440: %209    &  1000  &     0.0324    &  208    & 0.1189  \\

441: %209    &  2000  &     0.0334    &  208    & 0.1083  \\

442: %209    &  7443  &     0.0324    &  208    & 0.1154  \\ \hline

443: %\end{tabular}

444: %\end{center}

445: %\caption{Coefficient of ultrametricity.

446: %Original: frequencies of occurrence matrix defined on the 209 texts

447: %crossed by: % 100,

448: %1000, 2000, and all = 7443, words.  Euclidean distance

449: %defined on each pair of texts.  Factors: factor projections resulting

450: %from correspondence analysis, with Euclidean distance used between each

451: %pair of texts.}

452: %\label{tabcorr}

453: %\end{table}

454:

455: \begin{table}

456: \caption{Coefficient of ultrametricity, alpha.

457: Input data: frequencies of occurrence matrices defined on the 209 texts

458: crossed by: %100,

459: 1000, 2000, and all = 7443, words.

460: Alpha (ultrametricity coefficient) based

461: on factors: i.e., factor projections resulting

462: from correspondence analysis, with Euclidean distance used between each

463: pair of texts in factor space, of dimensionality 208.

464: %The mean and standard deviations are each based on 20 realizations of

465: %2000 triangles.

466: }

467: \label{tabcorrb}

468: \begin{center}

469: \setlength{\tabcolsep}{1mm}

470: \begin{tabular}{|crrrr|} \hline

471:       &  \multicolumn{3}{c}{209 Brothers Grimm fairy tales}  &  \\ \hline

472: Texts & Orig.Dim. & FactorDim. & Alpha, mean & Alpha, sdev. \\ \hline

473: %209  &  100      & 99   &  0.0939   &  0.0063 \\

474: 209   &  1000     & 208  &  0.1236   &  0.0054 \\

475: 209   &  2000     & 208  &  0.1123   &  0.0065 \\

476: 209   &  7443    & 208  &  0.1147   &  0.0066 \\ \hline

477: \end{tabular}

478: \end{center}

479: \end{table}

480:

481: %The Euclidean distance was defined on the set of 209 fairy tales, based

482: %on the four different semantic contexts (i.e., based on characterization

483: %by %100,

484: %1000, 2000 and 7443 words).

485:

486: %Secondly the chi squared distance or weighted Euclidean distance between

487: %profiles was used as an appropriate way to assess relative similarity. If

488: %$k_{ij}$ is the number of occurrences of word $k$ in text $i$, then the

489: %chi squared distance between texts $i$ and $i'$ is $d_\chi(i,i') =

490: %\sum_j k/k_j (k_{ij}/k_i - k_{i'j}/k_{i'}$ where for text $i$, $k_{ij}/k_i$

491: %for all words $j$ defines the text's profile; $k_i = \sum_j k_{ij}$;

492: %similarly word $j$'s weight is $k_j = \sum_i k_{ij}$; and finally the

493: %overall total of words in all texts is $k = \sum_i \sum_j k_{ij}$.  This

494: %distance is well established for discrete data such as frequencies of

495: %o%ccurence.  As can be seen, weights ($k_i$, $k_j$) are used to

496: %c%ounter-balance overly frequent (or rare) words or unusually long (or

497: %short) texts.  This chi squared metric is mapped into a Euclidean space

498: %by determining principal axes of orientation, which correspond to

499: %axes of intertia, in correspondence analysis (Murtagh, 2005).  The factor

500: %projections will then define a Euclidean coordinate system.  It is this

501: %which we use, rather than the original chi squared metric, in our

502: %experiments.

503:

504: %For the varying semantic resolution levels (viz., %100-,

505: %1000-, 2000-, and 7443-dimensional) the inherent resolution level is not

506:

507: Table \ref{tabcorrb} (columns 4, 5)

508: shows remarkable stability of the alpha ultrametricity

509: coefficient results, and such stability will be seen in all further results

510: to be presented below.  The ultrametricity is not high for the Grimm

511: Brothers' data: we recall that an alpha value of 0 means no triangle is

512: isosceles/equilateral.  We see that there is very little ultrametric

513: (hence hierarchical) structure in the Brothers Grimm data (based on our

514: particular definition of ultrametricity/hierarchy).

515:

516:

517: \subsection{Jane Austen}

518:

519: To further study stories of a general sort, we use some works of the

520: English novelist, Jane Austen.

521:

522: \begin{enumerate}

523: \item {\em Sense and Sensibility} (Austen, 1811),

524: 50 chapters = files, chapter lengths from 1028 to 5632 words.

525: \item {\em Pride and Prejudice} (Austen, 1813),

526: 61 chapters each containing between 683 and 5227 words.

527: \item {\em Persuasion} (Austen, 1817), 24 chapters,

528: chapter lengths 1579 to 7007 words.

529: \item {\em Sense and Sensibility} split into 131 separate

530: texts, each containing around 1000 words

531: (i.e., each chapter was split into files containing 5000 or fewer characters).

532: We did this to check on any influence by the size (total number of words) of

533: the text unit used (and we found no such influence).

534: \end{enumerate}

535:

536: In all there were 266 texts containing a total of 9723 unique words.  We

537: looked at the 1000, 2000 and all = 9723 most frequent words to

538: characterize the texts by frequency of occurrence.

539:

540:

541: %\begin{table}

542: %\begin{center}

543: %\begin{tabular}{|crrrr|} \hline

544: %\multicolumn{5}{c}{266 J.\ Austen chapters or partial chapters} \\ \hline

545: %Texts  &  Dim.  &    Original   &  Dim.  &   Factors  \\ \hline

546: %%266    &  100   &     0.0409    &  99    &  0.1066  \\

547: %266    &  1000  &     0.0581    &  261   &  0.1521  \\

548: %266    &  2000  &     0.0601    &  262   &  0.1435  \\

549: %266    &  9723  &     0.0596    &  263   &  0.1420  \\ \hline

550: %\end{tabular}

551: %\end{center}

552: %\caption{Coefficient of ultrametricity.

553: %Original: frequencies of occurrence matrix defined on the 266 texts

554: %crossed by: %100,

555: %1000, 2000, and all = 9273, words.  Euclidean distance

556: %defined on each pair of texts.  Factors: factor projections resulting

557: %from correspondence analysis, with Euclidean distance used between each

558: %pair of texts.  Dimensionality of latter is necessarily less than $ 266 -1$,

559: %adjusted above for 0 eigenvalues = linear dependence.}

560: %\label{tabcorr2}

561: %\end{table}

562:

563: \begin{table}

564: \caption{Coefficient of ultrametricity, alpha.

565: Input data: frequencies of occurrence matrices defined on the 266 texts

566: crossed by: %100,

567: 1000, 2000, and all = 9723, words.

568: Alpha (ultrametricity coefficient) based

569: on factors: i.e., factor projections resulting

570: from correspondence analysis, with Euclidean distance used between each

571: pair of texts in factor space.

572: Dimensionality of latter is necessarily $ \leq 266 -1$,

573: adjusted for 0 eigenvalues = linear dependence.

574: %The mean and standard deviations are each based on 40,000 realizations of

575: %triangles.

576: }

577: \label{tabcorr2b}

578: \begin{center}

579: \setlength{\tabcolsep}{1mm}

580: \begin{tabular}{|crrrr|} \hline

581:   & \multicolumn{3}{c}{266 Austen chapters or partial chapters} & \\ \hline

582: Texts & Orig.Dim. & FactorDim. & Alpha, mean & Alpha, sdev. \\ \hline

583: %266  &  100      & 99   &  0.1001   &  0.0068 \\

584: 266   &  1000     & 261  &  0.1455   &  0.0084 \\

585: 266   &  2000     & 262  &  0.1489   &  0.0083 \\

586: 266   &  9723     & 263  &  0.1404   &  0.0075 \\ \hline

587: \end{tabular}

588: \end{center}

589: \end{table}

590:

591: Table \ref{tabcorr2b}, again displaying very stable alpha values, indicates

592: that the Austen corpus is a small amount more ultrametric than the Grimms'

593: corpus, Table \ref{tabcorrb}.

594:

595: \subsection{Air Accident Reports}

596:

597: We used air accident reports to explore documents with very particular,

598: technical, vocabulary.

599: The NTSB aviation accident database

600: (Aviation Accident Database and Synopses, 2003)

601: contains information

602: about civil aviation accidents in the United States and elsewhere.

603: We selected 50 reports.  Examples of two such reports used

604: by us: occurred Sunday, January 02, 2000 in Corning, AR,

605: aircraft Piper PA-46-310P, injuries -- 5 uninjured; occurred Sunday,

606: January 02, 2000 in Telluride, TN, aircraft: Bellanca BL-17-30A,

607: injuries -- 1 fatal.  In the 50 reports, there were 55,165 words.

608: Report lengths ranged between approximately 2300 and 28,000 words. The

609: number of unique words was 4261.

610:

611: Sample of start of report 30: {\em On January 16, 2000, about

612: 1630 eastern standard time (all times are eastern standard time,

613: based on the 24 hour clock), a Beech P-35, N9740Y, registered to a

614: private owner, and operated as a Title 14 CFR Part 91 personal

615: flight, crashed into Clinch Mountain, about 6 miles north of

616: Rogersville, Tennessee. Instrument meteorological conditions prevailed

617: in the area, and no flight plan was filed. The aircraft incurred

618: substantial damage, and the private-rated pilot, the sole occupant,

619: received fatal injuries. The flight originated from Louisville,

620: Kentucky, the same day about 1532.}

621:

622: %\begin{table}

623: %\begin{center}

624: %\begin{tabular}{|crrrr|} \hline

625: %\multicolumn{5}{c}{50 aviation accident reports} \\ \hline

626: %Texts  &  Dim.  &    Original   &  Dim.  &   Factors  \\ \hline

627: %%50    &  100   &     0.0270    &  48    &   0.1063  \\

628: %50    &  1000  &     0.0407    &  48  &   0.1317  \\

629: %50    &  2000  &     0.0407    &  48   &  0.1212  \\

630: %50    &  4261  &     0.0413    &  48   &  0.1180   \\ \hline

631: %\end{tabular}

632: %\end{center}

633: %\caption{Coefficient of ultrametricity.

634: %Original: frequencies of occurrence matrix defined on the 50 texts

635: %crossed by: %100,

636: %1000, 2000, and all = 4261, words.  Euclidean distance

637: %defined on each pair of texts.  Factors: factor projections resulting

638: %from correspondence analysis, with Euclidean distance used between each

639: %pair of texts.  Dimensionality of latter is necessarily less than $ 50 -1$,

640: %adjusted above for 0 eigenvalues = linear dependence.}

641: %\label{tabcorr4}

642: %\end{table}

643:

644: \begin{table}

645: \caption{Coefficient of ultrametricity, alpha.

646: Input data: frequencies of occurrence matrices defined on the 50 texts

647: crossed by: %100,

648: 1000, 2000, and all = 4261, words.

649:  Alpha (ultrametricity coefficient) based

650: on factors: i.e., factor projections resulting

651: from correspondence analysis, with Euclidean distance used between each

652: pair of texts in factor space.

653: Dimensionality of latter is necessarily less than $ 50 -1$,

654: with an additional adjustment made for one 0-valued eigenvalue,

655: implying linear dependence.

656: %The mean and standard deviations are each based on 40,000 realizations

657: %triangles.

658: }

659: \label{tabcorr4b}

660: \begin{center}

661: \setlength{\tabcolsep}{1mm}

662: \begin{tabular}{|crrrr|} \hline

663:   & \multicolumn{3}{c}{50 aviation accident reports} &  \\ \hline

664: Texts & Orig.Dim. & FactorDim. & Alpha, mean & Alpha, sdev. \\ \hline

665: %50  &  100      & 48   &  0.1101   &  0.0081 \\

666: 50   &  1000     & 48  &  0.1338   &  0.0077 \\

667: 50   &  2000     & 48  &  0.1186   &  0.0058 \\

668: 50   &  4261     & 48  &  0.1154   &  0.0050 \\ \hline

669: \end{tabular}

670: \end{center}

671: \end{table}

672:

673:

674: In Table \ref{tabcorr4b} we find ultrametricity values that are marginally

675: greater than those found for the Brothers Grimm (Table \ref{tabcorrb}).  It

676: could be argued that the latter, too, uses its own technical

677: vocabulary.   We would need to use more data to see if we can clearly

678: distinguish between the (small) ultrametricity levels of these two

679: corpora.

680:

681:

682: \subsection{DreamBank}

683:

684: With dream reports (i.e., reports by individuals on their remembered

685: dreams) we depart from a technical vocabulary, and instead raise the

686: question as to whether dream reports can perhaps be considered as types

687: of fairy tale or story, or even akin to accident reports.

688:

689: From the Dreambank repository (Domhoff, 2003; DreamBank, 2004; Schneider

690: and Domhoff, 2004)

691: we selected the following collections:

692: \begin{enumerate}

693: \item ``Alta: a detailed dreamer,'' in period 1985--1997, 422 dream reports.

694: \item  ``Chuck: a physical scientist,''  in period

695: 1991--1993,  75 dream reports.

696: \item ``College women,'' in period 1946--1950,  681 dream reports.

697: \item ``Miami Home/Lab,''  in period  1963--1965,  445 dream reports.

698: \item ``The Natural Scientist,''  1939,  234 dream reports.

699: \item ``UCSC women,''  1996,  81 dream reports.

700: \end{enumerate}

701:

702: To have adequate length reports, we requested report sizes of between

703: 500 and 1500 words.  With this criterion, from (1) we obtained 118 reports,

704: from (2) and (6) we obtained no reports, from (3) we obtained 15 reports,

705: from (4) we obtained 73 reports, and finally from (5) we obtained 8 reports.

706: In all, we used 214 dream reports, comprising 13696 words.

707:

708: Sample of start of report 100: {\em I'm delivering a car to a man --

709: something he's just bought, a Lincoln

710: Town Car, very nice. I park it and go down the street to find him -- he

711: turns out to be an old guy, he's buying the car for nostalgia -- it turns

712: out to be an old one, too, but very nicely restored, in excellent

713: condition. I think he's black, tall, friendly, maybe wearing overalls. I

714: show him the car and he drives off. I'm with another girl who drove

715: another car and we start back for it but I look into a shop first -- it's

716: got outdoor gear in it - we're on a sort of mall, outdoors but the shops

717: face on a courtyard of bricks. I've got something from the shop just

718: outside the doors, a quilt or something, like I'm trying it on, when

719: it's time to go on for sure so I leave it on the bench. We go further,

720: there's a group now, and we're looking at this office facade for the

721: Honda headquarters.}

722:

723: With the above we took another set of dream reports, from one individual,

724: Barbara Sanders.  A more reliable (according to DreamBank, 2004) set of

725: reports comprised 139 reports, and a second comprised 32 reports.  In all

726: 171 reports were used from this person.  Typical lengths were about 2500

727: up to 5322.  The total number of words in the Barbara Sanders set of

728: dream reports was 107,791.

729:

730:

731: %\begin{table}

732: %\begin{center}

733: %\begin{tabular}{|crrrr|} \hline

734: %\multicolumn{5}{c}{385 dream reports} \\ \hline

735: %Texts  &  Dim.  &    Original   &  Dim.  &   Factors  \\ \hline

736: %%385    &  100   &     0.0780    &  99    &   0.1379  \\

737: %385    &  1000  &     0.1122    &  384  &   0.2048  \\

738: %385    &  2000  &     0.1057    &  384   &  0.2137  \\

739: %385    &  11441  &     0.1288    &  384   &  0.1958   \\ \hline

740: %\end{tabular}

741: %\end{center}

742: %\caption{Coefficient of ultrametricity.

743: %Original: frequencies of occurrence matrix defined on the 385 texts

744: %crossed by: %100,

745: %1000, 2000, and all = 11441, words.  Euclidean distance

746: %defined on each pair of texts.  Factors: factor projections resulting

747: %from correspondence analysis, with Euclidean distance used between each

748: %pair of texts.  Dimensionality of latter is necessarily less than $ 266 -1$,

749: %adjusted above for 0 eigenvalues = linear dependence.}

750: %\label{tabcorr3}

751: %\end{table}

752:

753: \begin{table}

754: \caption{Coefficient of ultrametricity, alpha.

755: Input data: frequencies of occurrence matrices defined on the 384 texts

756: crossed by: %100,

757: 1000, 2000, and all = 11441, words.

758: Alpha (ultrametricity coefficient) based

759: on factors: i.e., factor projections resulting

760: from correspondence analysis, with Euclidean distance used between each

761: pair of texts in factor space, of dimensionality $ 385 -1 = 384$.

762: %The mean and standard deviations are each based on 40,000

763: %realizations of triangles.

764: }

765: \label{tabcorr3b}

766: \begin{center}

767: \setlength{\tabcolsep}{1mm}

768: \begin{tabular}{|crrrr|} \hline

769:  & \multicolumn{3}{c}{385 dream reports}  & \\ \hline

770: Texts & Orig.Dim. & FactorDim. & Alpha, mean & Alpha, sdev. \\ \hline

771: %385  &  100      & 99   &  0.1413   &  0.0090 \\

772: 385   &  1000     & 384  &  0.1998   &  0.0088 \\

773: 385   &  2000     & 384  &  0.1876   &  0.0095 \\

774: 385   &  11441    & 384  &  0.1933   &  0.0087 \\ \hline

775: \end{tabular}

776: \end{center}

777: \end{table}

778:

779: First we analyzed all dream reports, furnishing Table \ref{tabcorr3b}.

780:

781: In order to look at a more homogeneous subset of dream reports, we

782: then analyzed separately

783: the Barbara Sanders set of 171 reports, leading to Table \ref{tabcorr333b}.

784: (Note that this analysis is on a subset of

785: the previously analyzed dream reports, Table \ref{tabcorr3b}).

786: The Barbara Sanders subset of 171 reports contained 7044

787: unique words in all.

788:

789:

790: Compared to Table \ref{tabcorr3b} based on the entire dream report

791: collection, Table \ref{tabcorr333b} which is based on one person

792: shows, on average, higher ultrametricity levels.  It is interesting to note

793: that the dream reports, collectively, are higher in ultrametricity level

794: than our previous values for alpha; and that the ultrametricity level is

795: raised again when the data used relates to one person.

796:

797: \subsection{James Joyce's Ulysses, and Overall Summary}

798:

799: We carried out a study of James Joyce's {\em Ulysses}, comprising

800: 304,414 words in total.  We broke this text into 183 separate files,

801: comprising approximately between 1400 and 2000 words each.  The number of

802: unique words in these 183 files was found to be 28,649 words.  The

803: ultrametricity alpha values for this collection of 183 Joycean texts

804: were found to be less than the Barbara Sanders values, but higher than the

805: global set of all dream reports.

806: % CORRECTION WITH NEW PROGRAMS 9 MAY:  no of unique words was up from 28,631

807: % NEW MEAN FOR 7000 WAS: 0.2057

808: For 183 text segments, with frequencies of occurrence of 7000 (top-ranked)

809: words, we found a mean alpha of 0.2057, with standard deviation 0.0092.

810:

811: %\begin{table}

812: %\begin{center}

813: %\begin{tabular}{|crrrr|} \hline

814: %\multicolumn{5}{c}{171 Barbara Sanders dream reports} \\ \hline

815: %Texts  &  Dim.  &    Original   &  Dim.  &   Factors  \\ \hline

816: %%171    &  100   &     0.0816    &  99    &   0.1405  \\

817: %171    &  1000  &     0.1212    &  170  &   0.2470  \\

818: %171    &  2000  &     0.1293    &  170   &  0.2110  \\

819: %171    &  11441  &     0.1324    &  170   &  0.2404   \\ \hline

820: %\end{tabular}

821: %\end{center}

822: %\caption{Coefficient of ultrametricity.

823: %Original: frequencies of occurrence matrix defined on the 171 texts

824: %crossed by: %100,

825: %1000, 2000, and all = 7044, words.  Euclidean distance

826: %defined on each pair of texts.  Factors: factor projections resulting

827: %from correspondence analysis, with Euclidean distance used between each

828: %pair of texts.  Dimensionality of latter is necessarily less than $ 171 -1$,

829: %with no adjustment necessary for 0 eigenvalues = linear dependence.}

830: %\label{tabcorr333}

831: %\end{table}

832:

833: \begin{table}

834: \caption{Coefficient of ultrametricity, alpha.

835: Input data: frequencies of occurrence matrices defined on the 171 texts

836: crossed by: %100,

837: 1000, 2000, and all = 7044, words.

838: Alpha (ultrametricity coefficient) based

839: on factors: i.e., factor projections resulting

840: from correspondence analysis, with Euclidean distance used between each

841: pair of texts in factor space, of dimensionality $ 171 -1 = 170$.

842: %The mean and standard deviations are each based on 40,000

843: %realizations of triangles.

844: }

845: \label{tabcorr333b}

846: \begin{center}

847: \setlength{\tabcolsep}{1mm}

848: \begin{tabular}{|crrrr|} \hline

849:  & \multicolumn{3}{c}{171 Barbara Sanders dream reports}  & \\ \hline

850: Texts & Orig.Dim. & FactorDim. & Alpha, mean & Alpha, sdev. \\ \hline

851: %171  &  100      & 99   &  0.1592   &  0.0063 \\

852: 171   &  1000     & 170  &  0.2250   &  0.0089 \\

853: 171   &  2000     & 170  &  0.2256   &  0.0112 \\

854: 171   &  7044     & 170  &  0.2603   &  0.0108 \\ \hline

855: \end{tabular}

856: \end{center}

857: \end{table}

858:

859: %Ulysses text: http://www.lib.ru/DVOJS/ulysses.txt

860:

861: A summary of all our results is in Table \ref{tabsum}.  A few words of explanation

862: follow.  The lower values of ultrametricity can be explained by a more

863: common, shared word set; viz., shared over the text segment set.  The

864: higher values of ultrametricity are associated with dreams, in particular

865: with a single dreamer, and with {\em Ulysses}: one could argue that

866: characteristics of these data sets include frequent changes in interest,

867: and frequent replacement of one scene, and one set of personages,

868:  with another.  In factor space, this implies that a triplet of points

869: is  more likely to be isosceles with small base, or equilateral, compared to

870: the alternative (low ultrametricity case) of more smooth transitions from

871: one sentence, paragraph or section to another.

872:

873: \begin{table}

874: \begin{center}

875: \begin{tabular}{|lrrr|}\hline

876: Data                 &   No. texts &  No. words &  ultrametricity  \\ \hline

877: Grimm tales          &   209       &    7443    &     0.1147       \\

878: aviation accidents   &    50       &    4261    &     0.1154       \\

879: Jane Austen novels   &   266       &    9723    &     0.1404       \\

880: dream reports        &   385       &   11441    &     0.1933       \\

881: Joyce's Ulysses      &   183       &   28631    &     0.2057       \\

882: single person dreams &   171       &    7044    &     0.2603       \\ \hline

883: \end{tabular}

884: \end{center}

885: \caption{Summary of results for the full word set, with the exception of

886: the Joyce data, where 7000 words were used.  The ultrametricity is the

887: alpha measure used throughout this article, where 1 is respect for

888: ultrametricity by all triangles, and and 0 is non-respect in all cases.}

889: \label{tabsum}

890: \end{table}

891:

892: \section{Conclusion}

893:

894: We studied a range of text corpora, comprising over 1000 texts, or text

895: segments,

896: containing over 1.3 million words.  We found very stable ultrametricity

897: quantifications of the text collections, across numbers of most frequent

898: words used to characterize the texts, and sampling of triplets of texts.

899: We also found that in all cases (save, perhaps, the Brothers Grimm versus

900: air accident reports) there was a clear distinction between the ultrametricity

901: values of the text collections.

902:

903: %We end with a few remarks which much remain as speculation until far more

904: %sizable tests have been carried out (involving a far greater number of texts).

905: %However even speculation serves to motivate future work.

906: Some very intriguing ultrametricity characterizations were found in our

907: work.  For example, we found that the technical vocabulary of air accidents

908: did not differ greatly in terms of inherent ultrametricity compared to the

909: Brothers Grimm fairy tales.  Secondly we found that novelist Austen's

910: works were distinguishable from the Grimm fairy tales.  Thirdly we found

911: dream reports to be have higher ultrametricity level than the other

912: text collections.  Further exploration of these issues will require

913: availability of very high quality textual data.

914:

915: Values of our alpha ultrametricity coefficient were small but

916: revealing and useful nonetheless.  Ultrametricity implies hierarchical

917: embedding, or structuring in terms of embedded sets.  This is what we are

918: finding locally (and not globally) in our data. The use of such

919: hierarchical fragments as relations of dominance between concepts could be

920: of use for ontologies.

921:

922: Ontologies, or concept hierarchies, are used

923: to help the user in information retrieval in a range of ways including:

924: tree-based homing in on content to be retrieved; characterizing the

925: content of data repositories before querying starts;

926: and disambiguating different

927: but overlapping content domains.  In \cite{autoonto} we explore the use

928: of local ultrametric embedding for ontology fragments.  As an example,

929: we use Aristotle's {\em Categories} and some other modern texts (on

930: ubiquitous computing, and from Wikipedia), and we

931: also discuss an online web-based demonstrator supporting retrieval through

932: a visual user interface.

933:

934:

935: \begin{thebibliography}{99}

936:

937: \bibitem{refa1}

938: Austen, J. (1811).  {\em Sense and Sensibility}.  Available at: \\

939: http://www.pemberley.com/etext/SandS

940:

941: \bibitem{refa2}

942: Austen, J. (1813).  {\em Pride and Prejudice}.  Available at: \\

943: http://www.pemberley.com/etext/PandP

944:

945: \bibitem{refa3}

946: Austen, J. (1817).  {\em Persuasion}.  Available at: \\

947: http://www.pemberley.com/etext/Persuasion

948:

949: %\bibitem{ref1}

950: %A.-L. Barab\'asi, ``Self-organized networks: resources'', at

951: %www.nd.edu/$\sim$networks/database (2004).

952:

953: \bibitem{ref2}

954: Benz\'ecri, J.P. (1979a).  {\em L'Analyse des Donn\'ees Tome 1,

955: La Taxinomie}, 2nd ed., Dunod, Paris.

956:

957: \bibitem{ref3}

958: Benz\'ecri, J.P. (1979b).  {\em L'Analyse des Donn\'ees Tome 2,

959: Correspondances}, 2nd ed., Dunod, Paris.

960:

961: %\bibitem{ref4}

962: %G. Caldarelli, A. Erzan and A. Vespignani, Eds., Special issue on Networks,

963: %European Physical Journal B {\bf 38}, no. 2 (2004).

964:

965: %\bibitem{ref5}

966: %Comtet, L. (1974).  {\em Advanced Combinatorics}, Reidel, Dordrecht.

967:

968: \bibitem{ref6}

969: Domhoff, G.W. (2003).

970: {\em The Scientific Study of Dreams: Neural Networks,

971: Cognitive Development and Content Analysis}, American Psychological

972: Association.

973:

974: %\bibitem{ref7}

975: %Donaghey, R. (1975).

976: %Alternating Permutations and Binary Increasing Trees,

977: %{\em Journal of Combinatorial Theory (A)}, 18:  141--148.

978:

979: \bibitem{ref8}

980: DreamBank (2004), Repository of Dream Reports, www.dreambank.net

981:

982: \bibitem{gom}

983: G\'omez-P\'erez, A., Fern\'andez-L\'opez, M. and Corcho, O. (2004).

984: {\em Ontological Engineering (with Examples from the Areas of Knowledge

985: Management, e-Commerce and the Semantic Web)}, Springer, Berlin.

986:

987: %\bibitem{ref9}

988: %J.C. Gower, ``Some distance properties of latent root and vector

989: %methods used in multivariate analysis'',  Biometrika {\bf 53}, 325

990: %(1966).  % 325--328

991:

992: \bibitem{ref10}

993: Lerman, I.C. (1981).

994: {\em Classification et Analyse Ordinale des Donn\'ees},

995: Dunod, Paris.

996:

997: \bibitem{ref11}

998: Murtagh,  F. (1983).  A Survey of Recent Advances in Hierarchical

999: Clustering Algorithms, {\em The Computer Journal}, 26:

1000: 354--359.

1001:

1002: %\bibitem{ref12}

1003: %Murtagh, F. (1984).

1004: %Counting Dendrograms: A Survey,

1005: %{\em Discrete Applied Mathematics}, 7: 191--199.

1006:

1007: \bibitem{ref13}

1008: Murtagh, F. (1985).

1009: {\em Multidimensional Clustering Algorithms},

1010: Physica-Verlag, W\"urzburg.

1011:

1012: \bibitem{ref14}

1013: Murtagh,  F. (2004).  On Ultrametricity, Data Coding, and Computation,

1014: {\em Journal of Classification}, 21: 167--184.

1015:

1016: \bibitem{ref15}

1017: Murtagh, F. (2005a).  Identifying the Ultrametricity of Time Series,

1018: {\em European Physical Journal B}, 43: 573--579.

1019:

1020: \bibitem{ref16}

1021: Murtagh, F. (2005b).  {\em

1022: Correspondence Analysis and Data Coding with Java and R},

1023: Chapman and Hall/CRC Press, New York.

1024:

1025: \bibitem{autoonto}

1026: Murtagh, F., Mothe, J. and Englmeier, K. (2007).  Ontology from local

1027: hierarchical structure in text.  http://arxiv.org/abs/cs.IR/0701180

1028:

1029: \bibitem{ref17} NTSB

1030: Aviation Accident Database and Synopses (2003),

1031: National Transport Safety Board,

1032: accessible from http://www.landings.com

1033: %/evird.acgi\$pass*59062640!\_h-www.landings.com/\_landings/

1034: %pages/search/rep-ntsb.html

1035:

1036:

1037: \bibitem{ref18}

1038: Ockerbloom, J.M. (2003). {\em Grimms' Fairy Tales},

1039: http://www-2.cs.cmu.edu/$\sim$spok/grimmtmp

1040:

1041: \bibitem{por}

1042: Porter, M.F. (1980). An Algorithm for Suffix Stripping,

1043: {\em Program}, 14: 130--137.

1044:

1045: \bibitem{ref19}

1046: Rammal, R.,  Toulouse, G. and Virasoro, M.A. (1986).

1047: Ultrametricity for

1048: Physicists, {\em Reviews of Modern Physics}, 58: 765--788.

1049:

1050: \bibitem{sas}

1051: Sasaki, F. and P\"onninghaus, J. (2003).

1052: Testing Structural Properties in Textual Data: Beyond Document Grammars,

1053: {\em Literary and Linguistic Computing}, 18: 89-100.

1054:

1055: \bibitem{ref20}

1056: Schneider, A. and Domhoff, G.W. (2004). The Quantitative Study of Dreams,

1057: http://dreamresearch.net

1058:

1059: \bibitem{refxy}

1060: Schweinberger, M. and Snijders, T.A.B. (2003).  Setting in Social Networks:

1061: A Measurement Model, {\em Sociological Methodology}, 33: 307--342.

1062:

1063: %\bibitem{ref21}

1064: %W.S. Torgerson,

1065: %Theory and Methods of Scaling (Wiley, New York, 1958).

1066:

1067: %\bibitem{ref22}

1068: %C.J. van Rijsbergen, Information Retrieval, 2nd ed.

1069: %(Butterworths, 1979).

1070:

1071: %\bibitem{ref23}

1072: %A. Trusina, S. Maslov, P. Minnhagen and K. Sneppen,

1073: %``Hierarchy measures in complex networks'',

1074: %Physical Review Letters {\bf 92}, 178702(4) (2004).

1075:

1076: \end{thebibliography}

1077:

1078: \end{document}

1079:

1080:

1081:

1082: