0504:cs0504072/chow.tex

1: %\documentstyle[aps,epsf,rotate,preprint]{revtex}

2: %\documentstyle[aps,epsf,rotate,multicol]{revtex}

3: %\documentclass[a4paper,10pt,pre,twocolumn,showpacs,aps,floats,floatfix,superscriptaddress]{revtex4}

4: \documentclass[10pt,pre,twocolumn,aps,floats,floatfix,superscriptaddress]{revtex4}

5:

6: %\usepackage{epsfig}

7: \usepackage{times}

8: \usepackage{helvet}

9: \usepackage{courier}

10: \usepackage{graphicx}

11: \usepackage{subfigure}

12:

13:

14: \begin{document}

15:

16: \title{Knowledge Representation Issues in Semantic Graphs for Relationship Detection}

17:

18: \author{Marc Barth\'elemy\footnote{Authors listed alphabetically.}}

19: \affiliation{CEA-Centre d'Etudes de Bruy\`eres-Le-Ch\^atel  \\

20: Departement de Physique Th\'eorique et Appliqu\'ee\\

21: BP12, 91680 Bruy\`{e}res-Le-Ch\^{a}tel Cedex, France \\

22: }

23:

24: \author{Edmond Chow}

25: \affiliation{Center for Applied Scientific Computing \\

26: Lawrence Livermore National Laboratory \\

27: Box 808, L-560, Livermore, CA 94551, USA \\

28: }

29:

30: \affiliation{Biodefense Knowledge Center,

31: Lawrence Livermore National Laboratory.}

32:

33: \author{Tina Eliassi-Rad}

34: \affiliation{Center for Applied Scientific Computing \\

35: Lawrence Livermore National Laboratory \\

36: Box 808, L-560, Livermore, CA 94551, USA \\

37: }

38:

39: \affiliation{Biodefense Knowledge Center,

40: Lawrence Livermore National Laboratory.}

41:

42: \begin{abstract}

43: %\begin{quote}

44: An important task for Homeland Security is the prediction of

45: threat vulnerabilities, such as through the detection of

46: relationships between seemingly disjoint entities. A structure

47: used for this task is a \emph{semantic graph},

48: also known as a \emph{relational data graph} or an

49: \emph{attributed relational graph}.  These graphs encode relationships as

50: {\em typed} links between a pair of {\em typed} nodes.

51: Indeed, semantic graphs are very similar to semantic networks used in AI.

52: The node and link types are related through

53: an \emph{ontology} graph (also known as a \emph{schema}).

54: Furthermore, each node has a set of attributes associated

55: with it (e.g., ``age'' may be an attribute of a node of type

56: ``person''). Unfortunately, the selection of types and attributes for

57: both nodes and links depends on human expertise and

58: is somewhat subjective and even arbitrary. This subjectiveness

59: introduces biases into any algorithm that operates on

60: semantic graphs. Here, we raise some knowledge

61: representation issues for semantic graphs and provide some

62: possible solutions using recently developed ideas in the field

63: of complex networks.  In particular, we use the concept of

64: transitivity to evaluate the relevance of individual links in the

65: semantic graph for detecting relationships.

66: We also propose new statistical measures

67: for semantic graphs and illustrate these semantic measures on

68: graphs constructed from movies and terrorism data.

69: %\end{quote}

70: \end{abstract}

71:

72:

73:

74: \maketitle

75:

76: %\author{Marc Barth\'elemy} \affiliation{CEA-Centre d'Etudes de

77: %Bruy{\`e}res-le-Ch{\^a}tel, D\'epartement de Physique Th\'eorique et

78: %Appliqu\'ee BP12, 91680 Bruy\`eres-Le-Ch\^atel, France}

79:

80:

81:

82: %-----------------------------------------------------------------

83: \section{Introduction}

84:

85: A semantic graph is a network of {\em heterogeneous} nodes and links.

86: In contrast to the usual mathematical description of a graph, semantic

87: graphs have different types of nodes, and in general, different types

88: of links.  Also called attributed relational graphs \cite{coffman:2004}

89: and relational data graphs (used in the knowledge discovery literature),

90: it is clear that the power of these graphs lies not only in their structure

91: but also in the semantic information that resides on their nodes and links.

92: Examples of semantic graphs include citation networks where the nodes do

93: not simply consist of papers, but also consist of

94: authors, institutions, journals,

95: and conferences.  Another example is the Internet Movie Database

96: where the nodes may be persons (actors, directors, etc.),

97: movies, studios, and awards, among others.  In Homeland Security,

98: these graphs are used in a variety of information analysis tasks

99: \cite{jensen.2003,coffman:2004,popp.2004,DHS-DSW:2004}.

100: In particular, such graphs

101: may be used for predicting threat vulnerabilities.

102:

103: Data for semantic graphs come from relations parsed from text documents

104: and/or data from relational databases.  Our motivation for this

105: work comes from our experience in constructing semantic graphs

106: from two sources of data---movies data and terrorism data---to be discussed

107: at the end of this paper.

108: In both these cases, we were faced

109: with a wide variety of choices:  what are the node types, what

110: are the link types, and how do these choices affect the algorithms

111: that we intend to use on these graphs?

112:

113: Several types of algorithms operating on semantic graphs

114: are of interest to us.  For example,

115: to determine the nature of a possible relationship

116: between two entities, a subgraph consisting of the shortest paths

117: (or another metric) between two nodes in the semantic graph

118: may be constructed and examined \cite{faloutsos:2004}.

119: We refer to this process as {\em relationship detection}.

120: Fast algorithms based on heuristic search

121: (which improve on breadth-first search or bi-directional search)

122: are available for this task, which either use or do not use the

123: semantic information in the graph \cite{eliassi-rad-tr:2004,chow-tr:2004}.

124: These algorithms,

125: however, depend on knowing which links (or link types) in the semantic graph

126: are useful for detecting relationships.  For example,

127: two people who share a connection to ``San Francisco'' because they

128: were born there are unlikely to have any real-life connection.  One of the goals

129: of this paper is to present automatic algorithms for determining

130: which are useful links for relationship detection, as well as present

131: concepts to help answer related questions.

132:

133: In the past few years, a new field called {\em complex networks}

134: (see, e.g., \citeauthor{albert:2002} (2002) and

135: \citeauthor{newman:2003a} (2003)) has

136: emerged to study the structure of real-world networks.

137: Statistical tools for characterizing graphs and networks have been

138: developed, with the impetus of understanding the relationship

139: between the structure and function of networks.  Computer techniques have

140: allowed these statistical measurements to be performed on very large

141: real-world networks.  In this paper

142: we generalize some of these techniques in order to apply them

143: to semantic graphs.  For example, some types of nodes in semantic

144: graphs can be connected to many other types of nodes, but generally

145: have few actual links.  We quantify this concept and hypothesize that

146: nodes such as these are not useful for relationship detection.

147: In addition, the concept of {\em transitivity} in social network analysis

148: (called {\em clustering coefficient} in the complex networks literature)

149: is useful for determining

150: which are useful links for relationship detection.

151:

152: In the following, we begin by describing semantic graphs and ontologies.

153: We then use the concept of transitivity for evaluating links and link

154: types for relationship detection.  An important aspect of this paper is

155: a presentation of new statistical measures for semantic graphs, as well

156: as issues related to the scale (level of detail) of semantic graphs.

157: Examples of semantic graphs for movies and terrorism data are

158: given near the end of the paper.

159:

160: \section{Semantic Graphs and Ontologies}

161:

162: A semantic graph

163: consists of nodes and directed links, with each

164: node having a {\em type} (e.g., movie).  The set of types is usually

165: small compared to the number of nodes.  Each node is also labeled

166: with one or more {\em attributes} identifying the specific node

167: (e.g., {\em Shrek}) or gives additional information about that node

168: (e.g., gross income).  Links may also have types, for example, the

169: (person $\rightarrow$ movie) link may be of type ``acted-in,'' or

170: ``directed.''  (In this case, multigraphs, or graphs that may have

171: multiple links between the same pair of nodes, are possible.)  In some

172: semantic graphs, the meaning of a link between any two nodes is clear (although

173: different between different pairs of node types), and no link types need

174: to be defined.  Finally, links may also have attributes.

175: For additional details, see

176: \citeauthor{sowa:1984} (1984).

177: %quillian:1968,lenat:1995,reed:2002,shapiro:2000a,woods:1975}.

178:

179: Depending on the types of nodes and links and on the available

180: information, certain relations can or cannot exist.  The set of

181: relations that can exist in a given semantic graph can be described by an

182: auxiliary graph called an {\em ontology,}

183: or a {\em schema} \cite{jensen:2002}.  More often, an ontology graph

184: is created first by defining the types of relations that the semantic graph

185: will encode.

186: A small example of an ontology is given in Figure \ref{fig:example},

187: showing three node types: person, meeting and city.

188:

189: Special links in an ontology graph could describe {\em is-a} and {\em part-of}

190: relationships among node types.  This is a node type hierarchy that will be

191: briefly mentioned when we discuss the scale of semantic graphs.

192:

193: \begin{figure}

194: \begin{center}

195: \includegraphics[width=5.0cm]{example.eps}

196: \caption{A small ontology consisting of three node types.}

197: \label{fig:example}

198: \end{center}

199: \end{figure}

200:

201: \section{Transitivity for Evaluating Nodes and Edges}

202:

203: Consider a node ``San Francisco'' of type ``city'' in a semantic graph,

204: and suppose we have a database of people which includes city of birth

205: among the data fields.  A node ``Alice'' of type ``person'' may be

206: linked to the node ``San Francisco'' if Alice was born in San Francisco.

207: Other nodes linked to node San Francisco imply a relationship

208: to San Francisco and in turn their relation to Alice.  However,

209: it is not clear that such relationships give useful information

210: about Alice since most entities a short graph distance away from ``Alice''

211: will have no real-life connection to Alice.

212:

213: On the other hand, people born in a city such as

214: ``Tikrit,'' may have a much higher likelihood of

215: knowing each other, that is, it may be important in this case to be able to

216: associate two people

217: through their city of birth.  Instead of using a human

218: with potential biases to evaluate nodes

219: and links, an automatic procedure is

220: desirable for objectively determining which nodes and links

221: should be used in the semantic graph for relationship detection.

222:

223: Another example is nodes of type ``date.''

224: Dates could represent birthdates, dates of meetings, etc.

225: For example, a node for a person born on 9-11-2001 may be linked to a node

226: labeled ``9-11-2001.''

227: However, two events sharing a date

228: rarely predicts that two events are related.  Our bias is to

229: treat dates as attributes of nodes, rather than as its

230: own node (with the type ``date'').

231: Topologically, a ``date''

232: node may be connected to many other {\em types} of nodes, but generally

233: each date node is connected to only a small number of other nodes.

234: This may be an unbiased indication that a date is not useful for relationship

235: detection.

236:

237: \subsection{The transitivity concept}

238:

239: The concept of link transitivity is useful to address some of

240: the above issues.  If a node $i$ has a link to node $j$ and node $j$

241: has a link to node $k$, then a measure of transitivity in the network

242: is the probability that node $i$ has a link to node $k$.  In social

243: networks and many other networks categorized as {\em small-world} networks,

244: this probability is high.  This is natural in social networks because

245: a friend of a friend is also a friend in proportion that is much higher

246: than in a random network.  In general, we refer to $j$ as a {\em neighbor}

247: of $i$ if $i$ and $j$ are directly connected in a graph.  Also, we

248: refer to the {\em degree} of a node as the number of neighbors it has.

249:

250: The concept of transitivity is quantified as follows.

251: The {\em clustering coefficient} of a node, denoted by $C(i)$,

252: is a measure of the connectedness between the neighbors of the node.

253: Let $k_i$ denote the degree of node $i$, and let

254: $E_i$ denote the number of links between the $k_i$ neighbors.

255: Then, for an undirected graph, the quantity \cite{watts:1998}

256: \begin{equation}

257: C(i) = \frac{E_i}{k_i(k_i - 1)/2}

258: \label{eq:cc}

259: \end{equation}

260: is the ratio of the number of links between

261: a node's neighbors to the number of links that can exist.

262: We define $C(i)$ to be 0 when $k_i$ is 0 or 1.

263: When $C(i)$ is averaged over all nodes in the graph, we have the clustering

264: coefficient for a graph.

265: Note that

266: high average clustering coefficient does {\em not} imply the existence of

267: clusters or communities (subgraphs that are internally

268: more highly connected than externally) in the graph.

269:

270: \subsection{Relevance of a node}

271:

272: We consider the problem of determining whether a node in a semantic graph

273: (e.g., ``San Francisco'' in a previous example)

274: is useful for relationship detection.  Consider a node $i$

275: which has links to many other nodes.

276: For now, we assume the links are of all the same type.

277: To evaluate whether or not $i$ is useful for relationship

278: detection, we examine whether or not the neighbors of $i$

279: are actually related in the semantic graph with high frequency.

280: Whether or not two neighbors are related is decided by whether

281: or not a link exists between the two neighbors.  (A weaker condition

282: if this does not hold is whether the two neighbors are linked

283: via a third node which is already deemed a useful node for

284: relationship detection.)  This leads to the use of the clustering

285: coefficient defined in Equation (\ref{eq:cc}) to measure

286: the relevance of a node $i$ with degree greater than 1.

287: The equation can be generalized so that $E_i$ counts links with

288: the weaker condition described above.

289: A threshold $\tau$ is needed and if $C(i) > \tau$ then $i$ is

290: a useful node.  If $i$ is not a useful node, {\em all} the links

291: involving $i$ should not be used for relationship detection and

292: could be removed from the semantic graph.  If these links are removed,

293: $i$ could be made an attribute of the nodes that $i$ originally linked

294: to, in order not to lose any information.

295:

296: The above can be generalized for semantic graphs

297: when $i$ is linked via many different

298: types of links.  In this case, instead of a count of relationships

299: involving pairs of neighbors of $i$, a matrix $M(t_1,t_2)$ is used

300: instead.  Here $M(t_1,t_2)$ counts the number of relationships

301: between pairs of neighbors $(a,b)$, where $a$ is linked to $i$ via type $t_1$

302: and $b$ is linked to $i$ via type $t_2$.  Small entries in this

303: matrix gives {\em pairs} of link types (associated with $i$)

304: that should not be traversed in relationship detection.

305:

306: \subsection{Relevance of a link}

307:

308: The relevance of an existing or potential relationship between two

309: nodes $a$ and $b$ can be evaluated by how many neighbors they have in

310: common.  More precisely a relevance measure may be defined as

311: \begin{equation}

312: S(a,b) = \frac{|N(a,b)|}{|T(a,b)|}

313: \label{eq:strength}

314: \end{equation}

315: where

316: \[

317: N(a,b) = \left\{ w \mid w \mbox{ is linked to $a$ and $b$},

318:  w \ne a, w \ne b \right\}

319: \]

320: and

321: \[

322: T(a,b) = \left\{ w \mid w \mbox{ is linked to $a$ or $b$},

323:   w \ne a, w \ne b \right\}

324: \]

325: with $|T(a,b)| = \mbox{deg}(a) + \mbox{deg}(b) - |N(a,b)|$

326: where $\mbox{deg}(a)$ is the degree of $a$.

327: We have $0 \le S(a,b) \le 1$ with

328: large values of this relevance measure indicating a strong

329: relationship between $a$ and $b$ supported by a high proportion of

330: common neighbors.

331: This quantity is similar to the clustering coefficient and

332: can be generalized to involve neighbors $w$ farther from $a$

333: and $b$.

334:

335: There are many applications of this relevance measure.  For example,

336: pairs of nodes with no existing link can be evaluated to check if

337: a latent link might exist.  In another example, the relevance measure

338: can be computed for all links of a given type.  A low average of this relevance

339: measure indicates that the given link type is not useful for

340: relationship detection; there is not a strong relation between nodes

341: incident on a link with the given type.

342: A high relevance measure for a link when the average relevance measure

343: for the link type is low (and vice-versa)

344: indicates an outlier that may be interesting

345: to investigate.  This relevance measure must be used carefully, however,

346: since it uses links that it assumes confers bona fide

347: relationships.

348:

349: It must also be recognized that a low relevance measure for an individual link

350: does not imply that the link is unimportant.  On the contrary, the notion

351: of the ``strength of weak ties'' \cite{granovetter:1973}

352: suggests that these links

353: are critical in some sense.  It is when almost all links of the {\em same}

354: type have low relevance measure (and this link type is not

355: $a$ ``secretly knows'' $b$) that this link type should not be used in

356: relationship detection.

357:

358: %-----------------------------------------------------------

359: \subsection{Generalization of clustering coefficient for semantic graphs}

360:

361: The clustering coefficient defined earlier has little meaning for

362: semantic graphs as it mixes different types of nodes and it does not

363: include the constraints imposed by the ontology.

364: To illustrate this, consider the ontology for a semantic graph

365: given by Figure~\ref{fig:clustering_example}.

366: In this case, a node of type $\alpha$ can be connected to types

367: $\beta$, $\gamma$ and $\delta$, but a neighbor of type $\delta$ can

368: never be connected to neighbors of type $\beta$ or $\gamma$. In order

369: to avoid unrealistically small values of the clustering coefficient we

370: thus have to divide by the number of links actually {\it allowed} by

371: the ontology and obtain

372: \begin{equation}

373: C(i;\alpha)=\frac{E_i}{E(i;\alpha)}

374: \end{equation}

375: where $E(i;\alpha)$ denotes the maximum number of links allowed

376: by the ontology.

377:

378: \begin{figure}

379: \begin{center}

380: \includegraphics[width=3.0cm]{clustering_example.eps}

381: \caption{A particular ontology for which neighbors of $\alpha$ of

382: type $\delta$ can never be connected to neighbors of type $\beta$ or $\gamma$.}

383: \label{fig:clustering_example}

384: \end{center}

385: \end{figure}

386:

387: %-----------------------------------------------------------

388: \section{Statistical Measures for Semantic Graphs}

389:

390: Along with clustering coefficient, two other relevant graph

391: properties that have been developed for standard (non-semantic) graphs

392: are {\em distributions of node degree} (number of neighbors of a node)

393: and {\em average path length} between any two nodes in the graph.

394: Together, these three graph properties can be useful for

395: studying the properties of a semantic graph for representing knowledge.

396:

397: Many real-world networks have high clustering coefficient, much higher

398: than $O(1/n)$ for random graphs, where $n$ is the number of nodes in

399: the graph.  We believe that properly constructed semantic graphs must also

400: have moderately high clustering coefficients.  Low values of clustering

401: coefficient may indicate that the linkage information in the semantic

402: graph is incomplete.  Very high values of clustering coefficient may

403: also indicate a poorly constructed semantic graph where all the nodes are

404: very highly linked to each other (the limit is a fully connected graph),

405: indicating little discrimination in how the nodes are connected.

406:

407: The average path length, $\ell$,

408: in a semantic graph must also not be too small (which

409: is also associated with very high clustering coefficients).  When the

410: average path length is small,

411: almost all nodes are approximately the same graph distance from each other,

412: giving little discriminatory ability to path-length based algorithms

413: for detecting relationships.

414:

415: For example, an ontology graph may contain a node (e.g., a node of

416: type ``provenance'') to which every other node in the ontology is linked.

417: In this case,

418: the maximum shortest path length length in the ontology graph is 2,

419: which also suggests that the average path length in the semantic graph is

420: small.  It may be useful to identify nodes or links in the ontology

421: graph that dramatically shorten the average path length.  These nodes

422: and links are potentially not useful for relationship detection.

423:

424: The connectivity distribution $P(k)$

425: is of interest for semantic graphs, particularly the existence of

426: nodes with very high degree, as in the case of scale-free

427: networks~\cite{barabasi:1999,amaral:2000}.  In a relationship detection

428: path search, paths through very high degree nodes are deemed less informative

429: \cite{faloutsos:2004}.  For example,

430: in a social network, two people who know a popular person

431: are less likely to know each other; the linkages to the popular person

432: should be disregarded in the relationship detection search since they

433: may confer erroneous relationships.

434:

435: It is believed that power-law connectivity distributions arise when

436: there is little or no cost involved in the formation of links in the

437: network \cite{amaral:2000}.  Without this property, no nodes would be able to

438: acquire a very large number of links.  This may suggest that a graph

439: with power-law degree distribution may contain many weak linkages.

440: However, these weak linkages cannot be disregarded; Cf. strength of weak ties,

441: mentioned above.

442:

443: For semantic graphs, we showed above how to extend the concept of

444: clustering coefficient.  In the next subsections, we expand the

445: potential usefulness of other concepts for semantic graphs.

446:

447: \subsection{Extension of node degree}

448:

449: Even in the simple case of connectivity, a given value

450: $k$ of the connectivity of a node of type $\alpha$ has no real meaning

451: for semantic graphs. Indeed, as shown in Figure~\ref{fig:k_example} the

452: topological connectivity in both cases is $k=4$ but the meaning of it

453: is very different in each case.

454:

455: \begin{figure}

456: \begin{center}

457: \includegraphics[width=5.0cm]{connectivity_example.eps}

458: \caption{Two examples for which the $\alpha$-type node has

459: topological connectivity $k=4$ but with a different meaning in each case,

460: Cf.~\citeauthor{jensen:2002} (2002).}

461: \label{fig:k_example}

462: \end{center}

463: \end{figure}

464:

465: In the first case, the environment is very homogeneous while it is not

466: in the second case. Another complexity comes from the

467: fact that the number of $\beta$-type nodes can be very large thus

468: inducing a bias in the connectivity of the other nodes.

469:

470: The ontology implies that each node of type $\alpha$ can be connected

471: to a certain number, $k^{0}_{\alpha}$, of other

472: types. In the semantic graph, we have a total number of nodes

473: $n=\sum_{\alpha}n_{\alpha}$ and we denote the nodes by

474: $i=1,\dots,n$. The type of a node is given by the function $t(i)$. We

475: denote by $k_{\alpha\beta}(i)$ the number of neighbors of type $\beta$

476: of a node $i$ of type $\alpha$. The usual topological connectivity of

477: the node $i$ (which is of type $\alpha$) is then given by

478: \begin{equation}

479: k_{\alpha}(i)=\sum_{\beta}k_{\alpha\beta}(i).

480: \end{equation}

481: Using this quantity, we can define the average connectivity

482: of type $\alpha$ which is just the average over all nodes with

483: type $\alpha$ as

484: \begin{equation}

485: \overline{k_{\alpha}}=\frac{1}{n_{\alpha}}\sum_{i,\; t(i)=\alpha}k_{\alpha}(i).

486: \end{equation}

487:

488: If we want to compare the different types relative to their

489: connectivity, it is important to remember that some types can be

490: connected to many others (such as persons which can be linked to

491: others persons, cities, meeting, jobs, etc.) while other types are

492: only linked to one type (such as a conference which takes place only at

493: one location). In order to compare the different types we thus have to

494: rescale by the number of different neighbor types they can have according to

495: the ontology:

496: \begin{equation}

497: m_{\alpha}=\frac{\overline{k_{\alpha}}}{k^{0}_{\alpha}}.

498: \end{equation}

499:

500: This quantity indicates the average number of neighbors {\it per

501: type}. This quantity however does not tell us if there are large

502: connectivity fluctuations or if in contrast all nodes of a given type

503: have essentially the same connectivity. We thus have to measure the

504: connectivity variance {\it per type} which is calculated using the second moment

505: \begin{equation}

506: \overline{k^{2}_{\alpha}}=\frac{1}{n_{\alpha}}\sum_{i,\; t(i)=\alpha}k^{2}_{\alpha}(i)

507: \end{equation}

508: with the dispersion per type given by

509: \begin{equation}

510: \sigma^{k}_{\alpha}=\frac{[\overline{k^{2}_{\alpha}}-(\overline{k_{\alpha}})^2]^{1/2}}

511: {k^{0}_{\alpha}}.

512: \end{equation}

513:

514: Another possible way to characterize the connectivity distribution per

515: type is to plot the connectivity distribution. However, the dispersion

516: around the average is already a first indication of the nature of the

517: connections for different types. For some cases, the fluctuations

518: will be small, while for others it can fluctuate greatly

519: (such as the number of persons a person knows).

520: %This suggests that the nature

521: %of different networks between various types and spanning the

522: %semantic graph can be very different. The network of persons could be

523: %for example scale-free while for other types it can be well described

524: %by a simple random graph model.

525:

526: %-----------------------------------------------------------

527: \subsection{Disparity of connected types}

528:

529: The above quantities tell us the expected number of connections of a node of

530: a given type to another type

531: but not the correlations between different types. Indeed, a type

532: $\alpha$ can preferentially link to a type $\beta$ while it could be

533: in principle also be linked to other types (as given by the ontology).

534:

535: We thus quantify the disparity (or affinity) of each

536: type to link to other types. In order to do this we use a convenient

537: quantity---denoted by $Y_2$---which was introduced in another

538: context~\cite{Derrida:1987,Barthelemy:2003a}. In order to understand the meaning

539: of this quantity let us consider an object that is broken into a number $N$

540: of parts, each part having a weight $w_i$. By construction $\sum_{i}w_i=1$

541: and $Y_2$ is given in this case by

542: \begin{equation}

543: Y_2=\sum_i[w_i]^2.

544: \end{equation}

545: If all parts have the same weight $w_I\sim 1/N$ then $Y_2\sim 1/N$ is

546: small (for large $N$). In contrast, if we have $w_1=1/2$ and the rest

547: is small implying $w_{i\ne 1}\sim 1/2(N-1)$ then we obtain $Y_2\sim

548: 1/4$. This simple example can be easily generalized to more

549: complicated situations and shows that a small value of $Y_2$ indicates

550: a large number of relevant parts while a larger value (typically of

551: order $1/m$ where $m$ is of order unity) indicates the dominance of a

552: few parts.

553:

554: We now apply this idea to the number of types to quantify the

555: disparity of a node or the affinity of a type. The quantity $Y_2$ is first

556: defined for a given node $i$ of type $\alpha$

557: \begin{equation}

558: Y_2(i;\alpha)=\sum_{\beta}\left[\frac{k_{\alpha\beta}(i)}{k_{\alpha}(i)}\right]^2.

559: \end{equation}

560: In order to get results with statistical significance, we average this

561: quantity over all

562: nodes of the same type and we also compute its dispersion $\sigma^{Y}_{\alpha}$:

563: \begin{eqnarray}

564: \overline{Y}_2(\alpha)=\frac{1}{n_{\alpha}}\sum_{i,\; t(i)=\alpha}

565: Y_2(i;\alpha),\\

566: \sigma^{Y}_{\alpha}=\left[

567: \overline{Y_2^{2}(\alpha)}-(\overline{Y}_2(\alpha))^2

568: \right]^{1/2}.

569: \end{eqnarray}

570:

571: These results must however be weighted by the fact that some types are

572: more numerous than others which could be a reason why they appear

573: more often than others. For a given node $\alpha$, we denote by ${\cal

574: V}(\alpha)$ the set of types which can be connected to $\alpha$ as

575: given by the ontology. If a node has $k$ neighbors, and if these

576: neighbors are picked at random in the set of different nodes with

577: population $n_{\beta}$, we then obtain a disparity given by

578: \begin{equation}

579: Y^{r}_2=\sum_{\beta\in{\cal V}(\alpha)}\left[\frac{n_{\beta}}{n}\right]^2.

580: \end{equation}

581: Again, this quantity will be very small if all types are uniformly

582: present in the semantic graph $Y^{r}_2\sim 1/N$ (where $N$ is the

583: total number of different types) and if it is of order unity then

584: essentially a few types are over-represented. In order to take these

585: heterogeneities into account it is thus necessary to rescale

586: $Y_2(\alpha)$ by $Y^{r}_2$ and to form the factor

587: \begin{equation}

588: R(\alpha)=\frac{Y_2(\alpha)}{Y^{r}_2}

589: \end{equation}

590: and its corresponding dispersion,

591: \begin{equation}

592: \sigma^{R}_{\alpha}=\frac{\sigma^{Y}_{\alpha}}{Y^{r}_2}.

593: \end{equation}

594:

595: A large value (larger than one) of $R(\alpha)$ indicates that type

596: $\alpha$ preferentially links to a small number of types and that

597: its neighbor types ${\cal V}(\alpha)$ are diverse in number.

598: If $R\ll 1$, the type

599: $\alpha$ may still be preferentially connected to a small set of types

600: but the diversity of the numbers of each neighbor type is small.

601:

602: The dispersion $\sigma^{R}(\alpha)$ indicates whether the behavior as

603: described by the average value $R(\alpha)$ is typical, or if in

604: contrast there is large diversity among the nodes of type $\alpha$.

605:

606: Other usual quantities that are measured in order to

607: characterize a large network can also be generalized without any

608: difficulty. For example, degree distributions should be examined by

609: type of node.  In a semantic graph, the overall degree distribution

610: may not be meaningful, but the degree distribution for a specific

611: node type may be power-law, etc.

612: As a further example, the average path length generalizes to become a matrix

613: $\ell_{\alpha\beta}$ where $\alpha$ indicates the source node of the

614: shortest paths while $\beta$ is the target node.

615: This matrix will in general have

616: entries with very different values.

617:

618: %%-----------------------------------------------------------

619: %\subsection{Extension of average path length}

620: %

621: %In the same flow of ideas, it is easy to generalize the important

622: %quantity which is the betweenness centrality to a semantic graphs.

623: %For topological networks, this quantity counts the fraction of shortest paths

624: %that goes through a given node

625: %\begin{equation}

626: %g(v)=\sum_{i\ne j}\frac{\sigma_{ij}(v)}{\sigma_{ij}}

627: %\end{equation}

628: %where $\sigma_{ij}$ denotes the number of shortest paths going from

629: %$i$ to $j$ and where $\sigma_{ij}(v)$ denotes the number of shortest

630: %paths going from $i$ to $j$ through $v$. The natural generalization to

631: %semantic graphs is then

632: %\begin{equation}

633: %g_{\beta\gamma}(v;\alpha)=\sum_{i, t(i)\beta\ne j, t(j)=\gamma}

634: %\frac{\sigma_{ij}(v)}{\sigma_{ij}}

635: %\end{equation}

636: %which means that we consider only shortest paths from a node of type

637: %$\beta$ to a node of type $\gamma$ (while the node $v$ is of type

638: %$\alpha$).

639:

640: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

641: \section{Scale in Semantic Graphs}

642:

643: Given a knowledge base of relational data, the choice of ontology

644: depends on what information needs to be captured in the semantic graph,

645: and how easily certain information needs to be retrieved.

646: The level of detail (or scale) chosen for the ontology

647: (choice of node and link types)

648: will have a direct impact on the properties of the corresponding

649: semantic graph.

650:

651: In the simplest ontology, we have nodes of only one type.

652: In the example of the movies database, this ontology

653: is a simple network of actors without any types and two actors are

654: connected if they played in the same movie.  At the next finer

655: scale, we have actors and movies as node types.  In this

656: case, the ontology is an actor connected to a movie if he played in that

657: movie.  This is a special case of a semantic graph

658: which is a {\em bipartite} network (two types of nodes, with links only between

659: the two types).

660: %The ontologies and graphs for

661: %these two scales are illustrated in Figure~\ref{fig:example_actor}.

662: Coarser models lose some of the information present in finer models

663: but can be useful for

664: large-scale computations, such as multi-level search techniques.

665:

666: At the finest scale of a terrorist network, we may have nodes

667: of type ``Religious Terrorist Organization'' and ``Political Terrorist

668: Organization.''  A coarser model may aggregate nodes of these

669: two types into a new type, ``Terrorist Organization'' (or the

670: aggregation may occur directly if a type hierarchy is available).

671: Depending on what information needs to be preserved, it may or may not be

672: important to distinguish between these two node types

673: at the structural level of the semantic graph.

674:

675: We note that in Homeland Security tasks,

676: data analysis more often involves searching for outliers rather

677: than commonplace patterns.  Thus it is essential that the fine

678: scale data is retained and the coarse scale data is used

679: appropriately (for example, as an aid in managing and processing

680: large-scale data).

681:

682: %The semantic graph may be examined to determine whether or not it is

683: %appropriate to coalesce two node types into a single node type.

684: %

685: %Two nodes are structurally similar if they

686: %We first define the structural similarity of two nodes as the

687: %number of neighbors they have in common

688: %

689: %Algorithm: average structural similarity may be used to aggregate nodes.

690: %

691: %We also have the scale of the semantic graph.  Here, we can collapse

692: %along the type hierarchy.

693: %

694: %Two nodes can be coalesced if they have a high degree of structural

695: %similarity.

696:

697: %\begin{figure}

698: %\begin{center}

699: %\includegraphics[width=6.0cm]{example_actor.eps}

700: %\caption{Two different scales for the movie actor network.}

701: %\label{fig:example_actor}

702: %\end{center}

703: %\end{figure}

704:

705: %-----------------------------------------------------------

706: \subsection{Effect of scale on statistical measures}

707:

708: Here we simply illustrate the effect of scale on the clustering coefficient.

709: We consider a random bipartite graph with Poisson distributed

710: numbers of both movies per actor (with average $\mu$) and actors per

711: movie (with average $\nu$). We suppose that we have $n_A$ actors and

712: $n_M$ movies and the fact that each link connects an actor to a movie imposes

713: the constraint

714: \begin{equation}

715: \frac{\mu}{n_A}=\frac{\nu}{n_M} .

716: \end{equation}

717:

718: This model can be considered as a ``null'' model since there are no

719: particular correlations here. If one computes the clustering coefficient of

720: the one-mode projection

721: of this network, one obtains~\cite{newman:2001a}

722: \begin{equation}

723: C=\frac{1}{\mu+1} .

724: \end{equation}

725: This quantity is finite even in the limit of very large networks

726: $n_{A,M}\to\infty$. This is in contrast with the usual random network for

727: which

728: \begin{equation}

729: C\sim \frac{1}{n}

730: \end{equation}

731: where $n$ is the number of nodes. At this stage the conclusion is that

732: the actor network is very clustered and different from a random

733: network with no correlations. This is however clearly an incorrect

734: statement since the existence of a large clustering coefficient here is a

735: consequence of the network construction procedure.

736:

737: %This simple example shows that the way of constructing the network and

738: %the choice of the scale can be very relevant and should clearly specified in

739: %any discussion.

740:

741: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

742: \section{Examples}

743:

744: \subsection{Movies data}

745:

746: The ``Movies'' test data at the UCI KDD Archive contains information

747: about movies, persons (actors, directors, etc.), studios, awards, etc.

748: The data was originally compiled by Gio Wiederhold (Stanford University).

749: We used this data to construct an ontology and semantic graph to

750: express most of the information in the dataset.  Figure \ref{fig:imdb-ont}

751: shows the ontology graph that we developed.  In the figure, the meaning of

752: most of the links is obvious.  However, the person-person

753: link implies {\em married-to}, {\em lived-with}, or some other non-professional

754: relationship; the person-studio link implies {\em founded}; the movie-movie

755: link implies {\em sequel-to}.  We note that the data is very incomplete.

756:

757: \begin{figure}

758: \begin{center}

759: \includegraphics[width=2in]{movies_schema.eps}

760: \caption{Movies ontology.}

761: \label{fig:imdb-ont}

762: \end{center}

763: \end{figure}

764:

765: In this ontology, the best meaning of the node Role is unclear.

766: For example, are two actors linked to the same Role node in the semantic

767: graph if they

768: played the role of Villain in two different movies?  Alternatively,

769: a role node in the semantic graph may only link to actors playing

770: a given role in a single movie.  We arbitrarily chose the former in our case.

771:

772: A related question, which is structurally similar but semantically different

773: is the following.  Should two actors who win a Best Actor award be linked to the

774: {\em same} Award node in the semantic graph?  In this case we did not choose

775: this interpretation since it seems that awards are individual entities,

776: whereas roles are not.

777:

778: Table \ref{tbl:imdb-results} summarizes the node types, frequencies,

779: and other statistical measures for the movies semantic graph.

780: The results show

781: high dispersion of average connectivity per type, for all types.

782: Further, the disparity of connected types is not particularly

783: different from a random model.  These indicate a relatively well-constructed

784: semantic graph; there are no particular correlations (given

785: the numbers of each node type) and thus the information

786: content in the graph is high.  The results will be very different

787: for the terrorism data.

788:

789: In the semantic graph,

790: the nodes with the largest clustering coefficients

791: depend on whether the types of the nodes are considered.  In the standard

792: case where the types are not considered, the node Maurice Barrymore

793: has high clustering coefficient; the node is connected to Georgiana Drew

794: Barrymore, Lionel Barrymore, Ethel Barrymore, etc., all of which

795: are connected to each other.  If node types are considered, then it is

796: not important that neighbors of a node are not linked if they are

797: not permitted to be linked according to the ontology.  Now nodes that

798: were missed with the above measure may have high clustering coefficient,

799: e.g., the movie {\em Dogma} (perhaps due to the idiosyncrasies of

800: the incomplete data).

801:

802: % low clust coef?

803:

804: In the semantic graph, the link between Columbia Pictures and

805: drama (genre) has the most number of common neighbors (710).

806: However, when the link

807: relevance measure (Equation (\ref{eq:strength})) is used,

808: which accounts for the number of links a node has,

809: the link between Bud Abbott and Lou Costello is found (30 common neighbors).

810: (We also found re-releases of movies under a new name in this process.)

811: Further, a semantic version of relevance can be defined, which

812: considers only the links that are allowed by the semantic graph.

813: In this case, the link between Tokuma Studio and docu-drama is found.

814: (Tokuma is linked to drama and the movie {\em Carences}; docu-drama is

815: linked to {\em Carences} and Miramax; and Miramax is linked to drama.)

816:

817: We also computed the average relevance per link type for the semantic graph.

818: First, the link types of least frequency were Person-\emph{founded}-Studio

819: and Studio-\emph{located-in}-Country.  However, the links with lowest average

820: relevance per link were Movie-\emph{shot-in}-Country and

821: Award-\emph{awarded-in}-Country.  As mentioned, these latter links may by

822: least useful for automatic relationship detection.

823:

824: \begin{table}

825: \begin{center} \scriptsize

826: \begin{tabular}{|rl|r|rr|rr|} \hline

827:    & Node Type       & $n_\alpha$ &

828:         $m_\alpha$ & $\sigma_\alpha^k$ & $R(\alpha)$ & $\sigma_\alpha^R$ \\

829: \hline

830:  1 & Person          & 21504  &    0.872 &    2.383 &   1.836 &    0.663 \\

831:  2 & Movie           & 11540  &    1.131 &    0.816 &   1.299 &    0.644 \\

832:  3 & Award           &  6734  &    2.579 &   10.201 &   0.905 &    0.144 \\

833:  4 & Country         &    19  &  222.509 &  582.572 &   1.812 &    0.364 \\

834:  5 & Studio          &  1075  &    1.948 &    9.534 &   1.241 &    0.408 \\

835:  6 & Genre           &    39  &   77.803 &  160.060 &   0.512 &    0.154 \\

836:  7 & Role            &   115  &   25.561 &   64.164 &   0.924 &    0.028 \\

837:  8 & Distributor     &    16  &  206.156 &  356.043 &   0.782 &    0.165 \\

838: \hline

839: \end{tabular}

840: \caption{Node types and statistics for the movies data:  frequency of

841: node type $n_\alpha$, average connectivity per type $m_\alpha$ and

842: its dispersion $\sigma_\alpha^k$, disparity of connected types

843: $R(\alpha)$ and its dispersion $\sigma_\alpha^R$.

844: The results show

845: high dispersion of average connectivity per type, for all types.

846: Further, the disparity of connected types is not particularly

847: different from a random model.

848: }

849: \label{tbl:imdb-results}

850: \end{center}

851: \end{table}

852:

853: %Figure \ref{fig:bpdist} shows the degree distributions of the movie-actor

854: %bipartite graph.  When all nodes are considered together, the power-law

855: %relationship for the nodes of type Person are hidden.

856: %

857: %\begin{figure}

858: %\centering

859: %\subfigure[Person nodes (30226 nodes).]{\includegraphics[width=1.5in]{bpdist_person.eps}}

860: %\subfigure[Movie nodes (11561 nodes).]{\includegraphics[width=1.5in]{bpdist_movie.eps}}

861: %\subfigure[All nodes.]{\includegraphics[width=1.5in]{bpdist_all.eps}}

862: %\caption{Degree distributions of the movie-actor bipartite graph

863: %by node type.}

864: %\label{fig:bpdist}

865: %\end{figure}

866:

867: %{OLD:Before ROLE collapsed:Node types and statistics for the movies data.}

868: %

869: %\begin{table}

870: %\begin{center} \scriptsize

871: %\begin{tabular}{|rl|r|rr|rr|} \hline

872: %   & Node Type       & $n_\alpha$ &

873: %        $m_\alpha$ & $\sigma_\alpha^k$ & $R(\alpha)$ & $\sigma_\alpha^R$ \\

874: %\hline

875: % 1 & Person          & 30226  &   0.999  &    5.319 & 2.309 & 1.078 \\

876: % 2 & Movie           & 11561  &   2.071  &    1.625 & 1.342 & 0.588 \\

877: % 3 & Award           & 29759  &   1.357  &    4.070 & 0.740 & 0.188 \\

878: % 4 & Country         &    17  & 963.412  & 3336.693 & 1.027 & 0.126 \\

879: % 5 & Studio          &  1044  &   1.937  &    9.397 & 1.109 & 0.363 \\

880: % 6 & Genre           &   201  &  17.511  &   85.387 & 0.422 & 0.076 \\

881: % 7 & Role            & 46154  &   1.000  &    0.000 & 0.834 & 0.000 \\

882: % 8 & Distributor     &   111  &   4.716  &    4.640 & 0.769 & 0.193 \\

883: %\hline

884: %\end{tabular}

885: %\label{tbl:imdb-results}

886: %\end{center}

887: %\end{table}

888:

889: % following are degree distributions for the full imdb (not bipartite)

890: %

891: %\begin{figure}

892: %\centering

893: %\subfigure[All nodes.]{\includegraphics[width=2.5in]{dist_all.eps}}

894: %\subfigure[Person nodes.]{\includegraphics[width=2.5in]{dist_person.eps}}

895: %\\

896: %\subfigure[Movie nodes.]{\includegraphics[width=2.5in]{dist_movie.eps}}

897: %\subfigure[Award nodes.]{\includegraphics[width=2.5in]{dist_award.eps}}

898: %\caption{Degree distributions by node type.}

899: %\label{fig:1}

900: %\end{figure}

901:

902: %-----------------------------------------------------------

903: \subsection{Terrorism data}

904:

905: %Terrorism data is available from the Anti-Defamation League.

906:

907: Relational data about world-wide terrorist events is available,%

908: \footnote{Data available at http://ontology.teknowledge.com.}

909: as well

910: as ontologies describing the organization of this data \cite{niles:2001}.

911: From this data we constructed an ontology and semantic graph.

912: The 59 node types are shown in Table \ref{tbl:terrorism-types}.

913: The ontology is shown in Figure \ref{fig:terr-adjmat} as an adjacency

914: matrix.  The semantic graph contains 2366 nodes.

915:

916: %After removing isolated nodes from the data.

917:

918: \begin{table}

919: \renewcommand{\arraystretch}{0.6}

920: \begin{center} \scriptsize

921: \begin{tabular}{|rl|c||rl|c|} \hline

922:    & Type & $n_\alpha$ & & Type & $n_\alpha$ \rule[-1.0ex]{0pt}{3ex}\\

923: \hline

924:  1 & Nation                       &  92 & 31 & Shooting               & 445 \\

925:  2 & GeographicalRegion           &  85 & 32 & Bombing                & 323 \\

926:  3 & City                         & 555 & 33 & HostageTaking          &  14 \\

927:  4 & Building                     &  10 & 34 & IncendDeviceAttack     &  18 \\

928:  5 & Combustion                   &   0 & 35 & Lynching               &   3 \\

929:  6 & Destruction                  &   0 & 36 & SuicideBombing         & 107 \\

930:  7 & Device                       &   0 & 37 & CarBombing             & 114 \\

931:  8 & GeographicArea               &   3 & 38 & Arson                  &  15 \\

932:  9 & Government                   &   1 & 39 & HandgrenadeAttack      &  38 \\

933: 10 & GovernmentPerson             &   2 & 40 & Hijacking              &  15 \\

934: 11 & Group                        &   1 & 41 & RocketMissileAttack    &  14 \\

935: 12 & Hole                         &   1 & 42 & KnifeAttack            &  53 \\

936: 13 & Human                        &   6 & 43 & ChemicalAttack         &   9 \\

937: 14 & JoiningAnOrg                 &   0 & 44 & LetterBombAttack       &  10 \\

938: 15 & Killing                      &   0 & 45 & Stoning                &   3 \\

939: 16 & OccupationalRole             &   3 & 46 & VehicleAttack          &   7 \\

940: 17 & Region                       &   0 & 47 & MortarAttack           &   8 \\

941: 18 & SocialRole                   &   1 & 48 & Vandalism              &   4 \\

942: 19 & StationaryArtifact           &   1 & 49 & Other                  &   5 \\

943: 20 & UnilateralGetting            &   0 & 50 & Number                 & 120 \\

944: 21 & Vehicle                      &   1 & 51 & Continent              &   2 \\

945: 22 & ViolentContest               &   1 & 52 & GeneralStructure       &   6 \\

946: 23 & Weapon                       &   0 & 53 & Month                  &  12 \\

947: 24 & Proposition                  &   0 & 54 & GeneralBuilding        &   2 \\

948: 25 & BinaryPredicate              &   0 & 55 & GeneralHuman           &   2 \\

949: 26 & ForeignTerrOrg               &  28 & 56 & Airbase                &   2 \\

950: 27 & ReligiousOrg                 &   0 & 57 & Airport                &   3 \\

951: 28 & TerroristOrg                 &  53 & 58 & State                  &   4 \\

952: 29 & Infiltration                 &   8 & 59 & Railway                &   1 \\

953: 30 & Kidnapping                   & 155 &    &                        &     \\

954: \hline

955: \end{tabular}

956: \caption{Node types and their frequencies, $n_\alpha$, for the terrorism data.}

957: \label{tbl:terrorism-types}

958: \end{center}

959: \end{table}

960:

961: \begin{figure}

962: \begin{center}

963: \includegraphics[width=2in]{terr_adjmat.eps}

964: \caption{Adjacency matrix for the terrorism ontology.  The matrix

965: is used to determine which node types are allowed to link to a given type.}

966: \label{fig:terr-adjmat}

967: \end{center}

968: \end{figure}

969:

970: \begin{figure}

971: \begin{center}

972: \includegraphics[width=3.25in]{terr_deg.eps}

973: \caption{Terrorism data: average number of neighbors per type, $m_\alpha$.

974: Each error bar is of

975: length $\sigma_\alpha^k$ on each side of the average.  }

976: \label{fig:terr-deg}

977: \end{center}

978: \end{figure}

979:

980: \begin{figure}

981: \begin{center}

982: \includegraphics[width=3.25in]{terr_con.eps}

983: \caption{Terrorism data:

984: disparity of connected types, $R(\alpha)$.  Each error bar is of

985: length $\sigma_\alpha^R$ on each side.}

986: \label{fig:terr-con}

987: \end{center}

988: \end{figure}

989:

990: Figures \ref{fig:terr-deg} and \ref{fig:terr-con} plot the average number

991: of neighbors per type and the disparity of connected types, respectively.

992: Error bars are used to show the dispersion of the quantities.

993: We consider that frequencies of 50 or more in this data set are

994: statistically significant.  Thus, we consider types

995: 1, 2, 3, 28, 30, 31, 32, 36, 37 42, and 50.

996: For all these types, the average number of neighbors per type is small.

997: The types, however, can be separated by their disparity.

998: Types 1, 2, 3, 28, and 50 have high disparity, i.e., they are connected

999: to many different types.  This is consistent with nodes of

1000: types 1, 2, and 3 being of type

1001: ``location,'' nodes of type 28 being of type ``terrorist organization,''

1002: and nodes of type 50 being of type ``number.''

1003: The remaining types are types of attacks and are not particularly

1004: correlated with any other node types (given the numbers of each node type).

1005: We note in this case

1006: that semantically similar node types

1007: have similar values of $m_\alpha$ and $R(\alpha)$.

1008:

1009:

1010: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1011: \section{Conclusion}

1012:

1013: This paper reveals some of the knowledge representation

1014: issues associated with semantic graphs.  Ideas from the field of complex

1015: networks have been applied and generalized to semantic graphs.

1016: For example, transitivity may be used to determine

1017: the relevance of edge types for relationship detection.

1018:

1019: We have defined several measures for statistically characterizing

1020: node types.  These quantities

1021: take into account the ontology which specifies the permitted connections in

1022: the semantic graph.

1023: Many other important measures can be defined,

1024: such as correlations with attribute {\em values}

1025: \cite{jensen:2002}, which was not covered in this paper.

1026: These and other tools can be useful to help design ontologies

1027: and semantic graphs for knowledge representation.

1028:

1029:

1030: %Many issues arise due to the existence of

1031: %different types on the nodes and links.  In standard

1032: %graphs without these types, coarser models, for example, are generally

1033: %built by clustering nodes (of the same type) that are likely to be

1034: %related.  What we have described is the very different case of

1035: %being able to cluster nodes based on the types of links that join them.

1036:

1037: %For example, we define

1038: %the tendency of a node type to link to a few or

1039: %to many other node types, and the average number of neighbors

1040: %of a node, taking into account the types to which it is permitted to link

1041: %(according to the ontology).  Using these measures, we found that node types

1042: %including dates, numbers, (e.g., number of

1043: %deaths in a terrorist event) and document-ID's (the original source of

1044: %the link data) are not particularly useful for relationship detection.

1045:

1046:

1047:

1048:

1049: %We summarize the general

1050: %procedure below:

1051: %

1052: %\begin{itemize}

1053: %\item{} First identify nodes with large $n_{\alpha}$. Only for these types,

1054: %a statistical analysis is meaningful and the analysis below is performed

1055: %for these types.

1056: %\item{} Compute $k_{\alpha}$ and its variance. This will show which

1057: %type is highly connected and what is the nature (scale-free or not for

1058: %example) of the network relatively to the different types.

1059: %\item{} Compute the quantity $Y_2(\alpha)$ and its dispersion leading to the quantity

1060: %$R(\alpha)$ and $\sigma^{R}(\alpha)$. These quantities indicate the

1061: %``disparity'' of each type (ie. they favorite connections if they

1062: %exist) and the variations among nodes of the same type.

1063: %\item{} Depending on the problem, one can also compute the clustering

1064: %coefficient per type as well as the centrality matrix.

1065: %\end{itemize}

1066:

1067:

1068: \section{Acknowledgments}

1069: We are pleased to

1070: thank Keith Henderson and David Jensen for helpful discussions.

1071: MB wishes to thank the Center for Applied Scientific Computing and

1072: the Institute for Scientific Computing Research at Lawrence Livermore

1073: National Laboratory for their hospitality during the formative stages

1074: of this work. This work was performed under the auspices of the U.S. Department

1075: of Energy by University of California Lawrence Livermore

1076: National Laboratory under contract No.~W-7405-ENG-48.

1077:

1078:

1079:

1080: \begin{thebibliography}{50}

1081:

1082: \bibitem[\protect\citeauthoryear{Albert \& Barabasi}{2002}]{albert:2002}

1083: Albert, R., and Barabasi, A.-L.

1084: \newblock 2002.

1085: \newblock Statistical mechanics of complex networks.

1086: \newblock {\em Reviews of Modern Physics} 74(1):47--97.

1087:

1088: \bibitem[\protect\citeauthoryear{Amaral \bgroup \em et al.\egroup

1089:   }{2000}]{amaral:2000}

1090: Amaral, L. A.~N.; Scala, A.; Barth{\'e}lemy, M.; and Stanley, H.~E.

1091: \newblock 2000.

1092: \newblock Classes of small-world networks.

1093: \newblock In {\em Proceedings of the National Academy of Sciences USA},

1094:   volume~97,  11149--11152.

1095: \newblock National Academy of Sciences.

1096:

1097: \bibitem[\protect\citeauthoryear{Barabasi \& Albert}{1999}]{barabasi:1999}

1098: Barabasi, A.-L., and Albert, R.

1099: \newblock 1999.

1100: \newblock Emergence of scaling in random networks.

1101: \newblock {\em Science} 286:509--512.

1102:

1103: \bibitem[\protect\citeauthoryear{Barth{\'e}lemy, Gondran, \&

1104:   Guichard}{2003}]{Barthelemy:2003a}

1105: Barth{\'e}lemy, M.; Gondran, B.; and Guichard, E.

1106: \newblock 2003.

1107: \newblock Spatial structure of the internet traffic.

1108: \newblock {\em Physica A} 319:633--642.

1109:

1110: \bibitem[\protect\citeauthoryear{Chow}{2004}]{chow-tr:2004}

1111: Chow, E.

1112: \newblock 2004.

1113: \newblock A graph search heuristic for shortest distance paths.

1114: \newblock Technical Report UCRL-JRNL-202894, Lawrence Livermore National

1115:   Laboratory.

1116:

1117: \bibitem[\protect\citeauthoryear{Coffman, Greenblatt, \&

1118:   Marcus}{2004}]{coffman:2004}

1119: Coffman, T.; Greenblatt, S.; and Marcus, S.

1120: \newblock 2004.

1121: \newblock Graph-based technologies for intelligence analysis.

1122: \newblock {\em Communications of ACM} 47:45--47.

1123:

1124: \bibitem[\protect\citeauthoryear{Derrida \& Flyvbjerg}{1987}]{Derrida:1987}

1125: Derrida, B., and Flyvbjerg, H.

1126: \newblock 1987.

1127: \newblock Statistical properties of randomly broken objects and of multivalley

1128:   structures in disordered systems.

1129: \newblock {\em Journal of Physics {A}} 20(15):5273--5288.

1130:

1131: \bibitem[\protect\citeauthoryear{Eliassi-Rad \&

1132:   Chow}{2004}]{eliassi-rad-tr:2004}

1133: Eliassi-Rad, T., and Chow, E.

1134: \newblock 2004.

1135: \newblock A probabilistic approach to accelerating path-finding in large

1136:   semantic networks.

1137: \newblock Technical Report UCRL-CONF-202002, Lawrence Livermore National

1138:   Laboratory.

1139:

1140: \bibitem[\protect\citeauthoryear{Faloutsos, McCurley, \&

1141:   Tomkins}{2004}]{faloutsos:2004}

1142: Faloutsos, C.; McCurley, K.; and Tomkins, A.

1143: \newblock 2004.

1144: \newblock Fast discovery of connection subgraphs.

1145: \newblock In {\em Proceedings of the 10th ACM SIGKDD International Conference

1146:   on Knowledge Discovery and Data Mining},  118--127.

1147: \newblock Seattle, WA, USA: ACM Press.

1148:

1149: \bibitem[\protect\citeauthoryear{Granovetter}{1973}]{granovetter:1973}

1150: Granovetter, M.

1151: \newblock 1973.

1152: \newblock The strength of weak ties.

1153: \newblock {\em American Journal of Sociology} 78:1360--1380.

1154:

1155: \bibitem[\protect\citeauthoryear{Jensen \& Neville}{2002}]{jensen:2002}

1156: Jensen, D., and Neville, J.

1157: \newblock 2002.

1158: \newblock Data mining in social networks.

1159: \newblock In {\em Papers of the Symposium on Dynamic Social Network Modeling

1160:   and Analysis (Sponsored by National Academy of Sciences)}.

1161: \newblock Washington, DC, USA: National Academy Press.

1162:

1163: \bibitem[\protect\citeauthoryear{Jensen, Rattigan, \& Blau}{2003}]{jensen.2003}

1164: Jensen, D.; Rattigan, M.; and Blau, H.

1165: \newblock 2003.

1166: \newblock Information awareness: a prospective technical assessment.

1167: \newblock In {\em Proceedings of the ninth ACM SIGKDD international conference

1168:   on Knowledge discovery and data mining},  378--387.

1169: \newblock Washington, D.C.: ACM Press.

1170:

1171: \bibitem[\protect\citeauthoryear{Kolda \bgroup \em et al.\egroup

1172:   }{2004}]{DHS-DSW:2004}

1173: Kolda, T.; Brown, D.; Corones, J.; Critchlow, T.; Eliassi-Rad, T.; Getoor, L.;

1174:   Hendrickson, B.; Kumar, V.; Lambert, D.; Matarazzo, C.; McCurley, K.;

1175:   Merrill, M.; Samatova, N.; Speck, D.; Srikant, R.; Thomas, J.; Wertheimer,

1176:   M.; and Wong, P.~C.

1177: \newblock 2004.

1178: \newblock Data sciences technology for homeland security information management

1179:   and knowledge discovery.

1180: \newblock Technical Report UCRL-TR-208926, Lawrence Livermore National

1181:   Laboratory.

1182:

1183: \bibitem[\protect\citeauthoryear{Newman, Strogatz, \&

1184:   Watts}{2001}]{newman:2001a}

1185: Newman, M. E.~J.; Strogatz, S.~H.; and Watts, D.~J.

1186: \newblock 2001.

1187: \newblock Random graphs with arbitrary degree distributions and their

1188:   applications.

1189: \newblock {\em Physical Review E} 64(026118).

1190:

1191: \bibitem[\protect\citeauthoryear{Newman}{2003}]{newman:2003a}

1192: Newman, M.~E.

1193: \newblock 2003.

1194: \newblock The structure and function of complex networks.

1195: \newblock {\em SIAM Review} 45(2):167--256.

1196:

1197: \bibitem[\protect\citeauthoryear{Niles \& Pease}{2001}]{niles:2001}

1198: Niles, I., and Pease, A.

1199: \newblock 2001.

1200: \newblock Towards a standard upper ontology.

1201: \newblock In {\em Proceedings of the 2nd International Conference on Formal

1202:   Ontology in Information Systems (FOIS-2001)}.

1203:

1204: \bibitem[\protect\citeauthoryear{Popp \bgroup \em et al.\egroup

1205:   }{2004}]{popp.2004}

1206: Popp, R.; Armour, T.; Senator, T.; and Numrych, K.

1207: \newblock 2004.

1208: \newblock Countering terrorism through information technology.

1209: \newblock {\em Communications of the ACM} 47(3):36--43.

1210:

1211: \bibitem[\protect\citeauthoryear{Sowa}{1984}]{sowa:1984}

1212: Sowa, J.~F.

1213: \newblock 1984.

1214: \newblock {\em Conceptual Structures: Information Processing in Mind and

1215:   Machine}.

1216: \newblock Reading, MA: Addison-Wesley.

1217:

1218: \bibitem[\protect\citeauthoryear{Watts \& Strogatz}{1998}]{watts:1998}

1219: Watts, D.~J., and Strogatz, S.~H.

1220: \newblock 1998.

1221: \newblock Collective dynamics of small-world networks.

1222: \newblock {\em Nature} 393:440--442.

1223:

1224:

1225:

1226:

1227:

1228:

1229: \end{thebibliography}

1230:

1231:

1232: %\bibliographystyle{aaai}

1233: %\bibliography{aaai-ss05-kr}

1234: %\bibliography{CNGraph-jan05}

1235:

1236:

1237:

1238:

1239:

1240: \end{document}

1241: