cs0504072/chow.tex
1: %\documentstyle[aps,epsf,rotate,preprint]{revtex}
2: %\documentstyle[aps,epsf,rotate,multicol]{revtex}
3: %\documentclass[a4paper,10pt,pre,twocolumn,showpacs,aps,floats,floatfix,superscriptaddress]{revtex4}
4: \documentclass[10pt,pre,twocolumn,aps,floats,floatfix,superscriptaddress]{revtex4}
5: 
6: %\usepackage{epsfig}
7: \usepackage{times}
8: \usepackage{helvet}
9: \usepackage{courier}
10: \usepackage{graphicx}
11: \usepackage{subfigure}
12: 
13: 
14: \begin{document}
15: 
16: \title{Knowledge Representation Issues in Semantic Graphs for Relationship Detection}
17: 
18: \author{Marc Barth\'elemy\footnote{Authors listed alphabetically.}}
19: \affiliation{CEA-Centre d'Etudes de Bruy\`eres-Le-Ch\^atel  \\
20: Departement de Physique Th\'eorique et Appliqu\'ee\\
21: BP12, 91680 Bruy\`{e}res-Le-Ch\^{a}tel Cedex, France \\ 
22: }
23: 
24: \author{Edmond Chow}
25: \affiliation{Center for Applied Scientific Computing \\ 
26: Lawrence Livermore National Laboratory \\ 
27: Box 808, L-560, Livermore, CA 94551, USA \\
28: }
29: 
30: \affiliation{Biodefense Knowledge Center,
31: Lawrence Livermore National Laboratory.}
32: 
33: \author{Tina Eliassi-Rad}
34: \affiliation{Center for Applied Scientific Computing \\ 
35: Lawrence Livermore National Laboratory \\ 
36: Box 808, L-560, Livermore, CA 94551, USA \\
37: }
38: 
39: \affiliation{Biodefense Knowledge Center,
40: Lawrence Livermore National Laboratory.}
41: 
42: \begin{abstract}
43: %\begin{quote}
44: An important task for Homeland Security is the prediction of
45: threat vulnerabilities, such as through the detection of
46: relationships between seemingly disjoint entities. A structure
47: used for this task is a \emph{semantic graph},
48: also known as a \emph{relational data graph} or an
49: \emph{attributed relational graph}.  These graphs encode relationships as
50: {\em typed} links between a pair of {\em typed} nodes.
51: Indeed, semantic graphs are very similar to semantic networks used in AI.
52: The node and link types are related through
53: an \emph{ontology} graph (also known as a \emph{schema}).
54: Furthermore, each node has a set of attributes associated
55: with it (e.g., ``age'' may be an attribute of a node of type
56: ``person''). Unfortunately, the selection of types and attributes for
57: both nodes and links depends on human expertise and 
58: is somewhat subjective and even arbitrary. This subjectiveness
59: introduces biases into any algorithm that operates on
60: semantic graphs. Here, we raise some knowledge
61: representation issues for semantic graphs and provide some 
62: possible solutions using recently developed ideas in the field
63: of complex networks.  In particular, we use the concept of
64: transitivity to evaluate the relevance of individual links in the
65: semantic graph for detecting relationships.
66: We also propose new statistical measures
67: for semantic graphs and illustrate these semantic measures on
68: graphs constructed from movies and terrorism data.
69: %\end{quote}
70: \end{abstract}
71: 
72: 
73: 
74: \maketitle
75: 
76: %\author{Marc Barth\'elemy} \affiliation{CEA-Centre d'Etudes de
77: %Bruy{\`e}res-le-Ch{\^a}tel, D\'epartement de Physique Th\'eorique et
78: %Appliqu\'ee BP12, 91680 Bruy\`eres-Le-Ch\^atel, France}
79: 
80: 
81: 
82: %-----------------------------------------------------------------
83: \section{Introduction}
84: 
85: A semantic graph is a network of {\em heterogeneous} nodes and links.
86: In contrast to the usual mathematical description of a graph, semantic
87: graphs have different types of nodes, and in general, different types
88: of links.  Also called attributed relational graphs \cite{coffman:2004}
89: and relational data graphs (used in the knowledge discovery literature),
90: it is clear that the power of these graphs lies not only in their structure
91: but also in the semantic information that resides on their nodes and links.
92: Examples of semantic graphs include citation networks where the nodes do
93: not simply consist of papers, but also consist of 
94: authors, institutions, journals,
95: and conferences.  Another example is the Internet Movie Database
96: where the nodes may be persons (actors, directors, etc.),
97: movies, studios, and awards, among others.  In Homeland Security,
98: these graphs are used in a variety of information analysis tasks
99: \cite{jensen.2003,coffman:2004,popp.2004,DHS-DSW:2004}.  
100: In particular, such graphs
101: may be used for predicting threat vulnerabilities.
102: 
103: Data for semantic graphs come from relations parsed from text documents
104: and/or data from relational databases.  Our motivation for this
105: work comes from our experience in constructing semantic graphs
106: from two sources of data---movies data and terrorism data---to be discussed
107: at the end of this paper.
108: In both these cases, we were faced
109: with a wide variety of choices:  what are the node types, what
110: are the link types, and how do these choices affect the algorithms
111: that we intend to use on these graphs?
112: 
113: Several types of algorithms operating on semantic graphs
114: are of interest to us.  For example,
115: to determine the nature of a possible relationship
116: between two entities, a subgraph consisting of the shortest paths
117: (or another metric) between two nodes in the semantic graph 
118: may be constructed and examined \cite{faloutsos:2004}.  
119: We refer to this process as {\em relationship detection}.
120: Fast algorithms based on heuristic search 
121: (which improve on breadth-first search or bi-directional search)
122: are available for this task, which either use or do not use the
123: semantic information in the graph \cite{eliassi-rad-tr:2004,chow-tr:2004}.  
124: These algorithms,
125: however, depend on knowing which links (or link types) in the semantic graph
126: are useful for detecting relationships.  For example, 
127: two people who share a connection to ``San Francisco'' because they
128: were born there are unlikely to have any real-life connection.  One of the goals
129: of this paper is to present automatic algorithms for determining
130: which are useful links for relationship detection, as well as present
131: concepts to help answer related questions.
132: 
133: In the past few years, a new field called {\em complex networks} 
134: (see, e.g., \citeauthor{albert:2002} (2002) and 
135: \citeauthor{newman:2003a} (2003)) has 
136: emerged to study the structure of real-world networks.  
137: Statistical tools for characterizing graphs and networks have been
138: developed, with the impetus of understanding the relationship
139: between the structure and function of networks.  Computer techniques have
140: allowed these statistical measurements to be performed on very large
141: real-world networks.  In this paper
142: we generalize some of these techniques in order to apply them
143: to semantic graphs.  For example, some types of nodes in semantic
144: graphs can be connected to many other types of nodes, but generally
145: have few actual links.  We quantify this concept and hypothesize that
146: nodes such as these are not useful for relationship detection.
147: In addition, the concept of {\em transitivity} in social network analysis
148: (called {\em clustering coefficient} in the complex networks literature)
149: is useful for determining
150: which are useful links for relationship detection.
151: 
152: In the following, we begin by describing semantic graphs and ontologies.
153: We then use the concept of transitivity for evaluating links and link
154: types for relationship detection.  An important aspect of this paper is
155: a presentation of new statistical measures for semantic graphs, as well
156: as issues related to the scale (level of detail) of semantic graphs.
157: Examples of semantic graphs for movies and terrorism data are
158: given near the end of the paper.
159: 
160: \section{Semantic Graphs and Ontologies}
161: 
162: A semantic graph
163: consists of nodes and directed links, with each
164: node having a {\em type} (e.g., movie).  The set of types is usually
165: small compared to the number of nodes.  Each node is also labeled
166: with one or more {\em attributes} identifying the specific node
167: (e.g., {\em Shrek}) or gives additional information about that node
168: (e.g., gross income).  Links may also have types, for example, the
169: (person $\rightarrow$ movie) link may be of type ``acted-in,'' or
170: ``directed.''  (In this case, multigraphs, or graphs that may have
171: multiple links between the same pair of nodes, are possible.)  In some
172: semantic graphs, the meaning of a link between any two nodes is clear (although
173: different between different pairs of node types), and no link types need
174: to be defined.  Finally, links may also have attributes.
175: For additional details, see 
176: \citeauthor{sowa:1984} (1984).
177: %quillian:1968,lenat:1995,reed:2002,shapiro:2000a,woods:1975}.
178: 
179: Depending on the types of nodes and links and on the available
180: information, certain relations can or cannot exist.  The set of
181: relations that can exist in a given semantic graph can be described by an
182: auxiliary graph called an {\em ontology,} 
183: or a {\em schema} \cite{jensen:2002}.  More often, an ontology graph 
184: is created first by defining the types of relations that the semantic graph
185: will encode.
186: A small example of an ontology is given in Figure \ref{fig:example},
187: showing three node types: person, meeting and city.  
188: 
189: Special links in an ontology graph could describe {\em is-a} and {\em part-of}
190: relationships among node types.  This is a node type hierarchy that will be
191: briefly mentioned when we discuss the scale of semantic graphs.
192: 
193: \begin{figure}
194: \begin{center}
195: \includegraphics[width=5.0cm]{example.eps}
196: \caption{A small ontology consisting of three node types.}
197: \label{fig:example}
198: \end{center}
199: \end{figure}
200: 
201: \section{Transitivity for Evaluating Nodes and Edges}
202: 
203: Consider a node ``San Francisco'' of type ``city'' in a semantic graph, 
204: and suppose we have a database of people which includes city of birth
205: among the data fields.  A node ``Alice'' of type ``person'' may be 
206: linked to the node ``San Francisco'' if Alice was born in San Francisco.
207: Other nodes linked to node San Francisco imply a relationship
208: to San Francisco and in turn their relation to Alice.  However,
209: it is not clear that such relationships give useful information 
210: about Alice since most entities a short graph distance away from ``Alice''
211: will have no real-life connection to Alice.  
212: 
213: On the other hand, people born in a city such as 
214: ``Tikrit,'' may have a much higher likelihood of 
215: knowing each other, that is, it may be important in this case to be able to 
216: associate two people
217: through their city of birth.  Instead of using a human 
218: with potential biases to evaluate nodes
219: and links, an automatic procedure is
220: desirable for objectively determining which nodes and links
221: should be used in the semantic graph for relationship detection.
222: 
223: Another example is nodes of type ``date.''  
224: Dates could represent birthdates, dates of meetings, etc.
225: For example, a node for a person born on 9-11-2001 may be linked to a node
226: labeled ``9-11-2001.''
227: However, two events sharing a date
228: rarely predicts that two events are related.  Our bias is to
229: treat dates as attributes of nodes, rather than as its 
230: own node (with the type ``date'').  
231: Topologically, a ``date''
232: node may be connected to many other {\em types} of nodes, but generally
233: each date node is connected to only a small number of other nodes.
234: This may be an unbiased indication that a date is not useful for relationship
235: detection.
236: 
237: \subsection{The transitivity concept}
238: 
239: The concept of link transitivity is useful to address some of
240: the above issues.  If a node $i$ has a link to node $j$ and node $j$
241: has a link to node $k$, then a measure of transitivity in the network
242: is the probability that node $i$ has a link to node $k$.  In social
243: networks and many other networks categorized as {\em small-world} networks,
244: this probability is high.  This is natural in social networks because
245: a friend of a friend is also a friend in proportion that is much higher
246: than in a random network.  In general, we refer to $j$ as a {\em neighbor}
247: of $i$ if $i$ and $j$ are directly connected in a graph.  Also, we 
248: refer to the {\em degree} of a node as the number of neighbors it has.
249: 
250: The concept of transitivity is quantified as follows.
251: The {\em clustering coefficient} of a node, denoted by $C(i)$,
252: is a measure of the connectedness between the neighbors of the node.
253: Let $k_i$ denote the degree of node $i$, and let
254: $E_i$ denote the number of links between the $k_i$ neighbors.
255: Then, for an undirected graph, the quantity \cite{watts:1998}
256: \begin{equation}
257: C(i) = \frac{E_i}{k_i(k_i - 1)/2}
258: \label{eq:cc}
259: \end{equation}
260: is the ratio of the number of links between
261: a node's neighbors to the number of links that can exist.
262: We define $C(i)$ to be 0 when $k_i$ is 0 or 1.
263: When $C(i)$ is averaged over all nodes in the graph, we have the clustering
264: coefficient for a graph.
265: Note that
266: high average clustering coefficient does {\em not} imply the existence of
267: clusters or communities (subgraphs that are internally
268: more highly connected than externally) in the graph.
269: 
270: \subsection{Relevance of a node}
271: 
272: We consider the problem of determining whether a node in a semantic graph 
273: (e.g., ``San Francisco'' in a previous example)
274: is useful for relationship detection.  Consider a node $i$
275: which has links to many other nodes.
276: For now, we assume the links are of all the same type.
277: To evaluate whether or not $i$ is useful for relationship 
278: detection, we examine whether or not the neighbors of $i$ 
279: are actually related in the semantic graph with high frequency.
280: Whether or not two neighbors are related is decided by whether
281: or not a link exists between the two neighbors.  (A weaker condition
282: if this does not hold is whether the two neighbors are linked 
283: via a third node which is already deemed a useful node for
284: relationship detection.)  This leads to the use of the clustering
285: coefficient defined in Equation (\ref{eq:cc}) to measure 
286: the relevance of a node $i$ with degree greater than 1.
287: The equation can be generalized so that $E_i$ counts links with
288: the weaker condition described above.
289: A threshold $\tau$ is needed and if $C(i) > \tau$ then $i$ is
290: a useful node.  If $i$ is not a useful node, {\em all} the links 
291: involving $i$ should not be used for relationship detection and 
292: could be removed from the semantic graph.  If these links are removed,
293: $i$ could be made an attribute of the nodes that $i$ originally linked
294: to, in order not to lose any information.
295: 
296: The above can be generalized for semantic graphs
297: when $i$ is linked via many different
298: types of links.  In this case, instead of a count of relationships
299: involving pairs of neighbors of $i$, a matrix $M(t_1,t_2)$ is used
300: instead.  Here $M(t_1,t_2)$ counts the number of relationships
301: between pairs of neighbors $(a,b)$, where $a$ is linked to $i$ via type $t_1$
302: and $b$ is linked to $i$ via type $t_2$.  Small entries in this
303: matrix gives {\em pairs} of link types (associated with $i$)
304: that should not be traversed in relationship detection.
305: 
306: \subsection{Relevance of a link}
307: 
308: The relevance of an existing or potential relationship between two 
309: nodes $a$ and $b$ can be evaluated by how many neighbors they have in
310: common.  More precisely a relevance measure may be defined as
311: \begin{equation}
312: S(a,b) = \frac{|N(a,b)|}{|T(a,b)|}
313: \label{eq:strength}
314: \end{equation}
315: where
316: \[
317: N(a,b) = \left\{ w \mid w \mbox{ is linked to $a$ and $b$}, 
318:  w \ne a, w \ne b \right\}
319: \]
320: and
321: \[
322: T(a,b) = \left\{ w \mid w \mbox{ is linked to $a$ or $b$},
323:   w \ne a, w \ne b \right\}
324: \]
325: with $|T(a,b)| = \mbox{deg}(a) + \mbox{deg}(b) - |N(a,b)|$
326: where $\mbox{deg}(a)$ is the degree of $a$.
327: We have $0 \le S(a,b) \le 1$ with
328: large values of this relevance measure indicating a strong 
329: relationship between $a$ and $b$ supported by a high proportion of
330: common neighbors.
331: This quantity is similar to the clustering coefficient and
332: can be generalized to involve neighbors $w$ farther from $a$
333: and $b$.  
334: 
335: There are many applications of this relevance measure.  For example,
336: pairs of nodes with no existing link can be evaluated to check if
337: a latent link might exist.  In another example, the relevance measure
338: can be computed for all links of a given type.  A low average of this relevance
339: measure indicates that the given link type is not useful for 
340: relationship detection; there is not a strong relation between nodes
341: incident on a link with the given type.
342: A high relevance measure for a link when the average relevance measure
343: for the link type is low (and vice-versa)
344: indicates an outlier that may be interesting
345: to investigate.  This relevance measure must be used carefully, however,
346: since it uses links that it assumes confers bona fide
347: relationships.
348: 
349: It must also be recognized that a low relevance measure for an individual link
350: does not imply that the link is unimportant.  On the contrary, the notion
351: of the ``strength of weak ties'' \cite{granovetter:1973} 
352: suggests that these links
353: are critical in some sense.  It is when almost all links of the {\em same}
354: type have low relevance measure (and this link type is not 
355: $a$ ``secretly knows'' $b$) that this link type should not be used in
356: relationship detection.
357: 
358: %-----------------------------------------------------------
359: \subsection{Generalization of clustering coefficient for semantic graphs}
360: 
361: The clustering coefficient defined earlier has little meaning for
362: semantic graphs as it mixes different types of nodes and it does not
363: include the constraints imposed by the ontology. 
364: To illustrate this, consider the ontology for a semantic graph
365: given by Figure~\ref{fig:clustering_example}.
366: In this case, a node of type $\alpha$ can be connected to types
367: $\beta$, $\gamma$ and $\delta$, but a neighbor of type $\delta$ can
368: never be connected to neighbors of type $\beta$ or $\gamma$. In order
369: to avoid unrealistically small values of the clustering coefficient we
370: thus have to divide by the number of links actually {\it allowed} by
371: the ontology and obtain
372: \begin{equation}
373: C(i;\alpha)=\frac{E_i}{E(i;\alpha)}
374: \end{equation}
375: where $E(i;\alpha)$ denotes the maximum number of links allowed
376: by the ontology.
377: 
378: \begin{figure}
379: \begin{center}
380: \includegraphics[width=3.0cm]{clustering_example.eps}
381: \caption{A particular ontology for which neighbors of $\alpha$ of 
382: type $\delta$ can never be connected to neighbors of type $\beta$ or $\gamma$.}
383: \label{fig:clustering_example}
384: \end{center}
385: \end{figure}
386: 
387: %-----------------------------------------------------------
388: \section{Statistical Measures for Semantic Graphs}
389: 
390: Along with clustering coefficient, two other relevant graph 
391: properties that have been developed for standard (non-semantic) graphs
392: are {\em distributions of node degree} (number of neighbors of a node)
393: and {\em average path length} between any two nodes in the graph.
394: Together, these three graph properties can be useful for
395: studying the properties of a semantic graph for representing knowledge.
396: 
397: Many real-world networks have high clustering coefficient, much higher
398: than $O(1/n)$ for random graphs, where $n$ is the number of nodes in
399: the graph.  We believe that properly constructed semantic graphs must also
400: have moderately high clustering coefficients.  Low values of clustering
401: coefficient may indicate that the linkage information in the semantic
402: graph is incomplete.  Very high values of clustering coefficient may
403: also indicate a poorly constructed semantic graph where all the nodes are
404: very highly linked to each other (the limit is a fully connected graph), 
405: indicating little discrimination in how the nodes are connected.
406: 
407: The average path length, $\ell$,
408: in a semantic graph must also not be too small (which
409: is also associated with very high clustering coefficients).  When the
410: average path length is small,
411: almost all nodes are approximately the same graph distance from each other,
412: giving little discriminatory ability to path-length based algorithms
413: for detecting relationships.
414: 
415: For example, an ontology graph may contain a node (e.g., a node of 
416: type ``provenance'') to which every other node in the ontology is linked.
417: In this case, 
418: the maximum shortest path length length in the ontology graph is 2,
419: which also suggests that the average path length in the semantic graph is
420: small.  It may be useful to identify nodes or links in the ontology
421: graph that dramatically shorten the average path length.  These nodes
422: and links are potentially not useful for relationship detection.
423: 
424: The connectivity distribution $P(k)$ 
425: is of interest for semantic graphs, particularly the existence of
426: nodes with very high degree, as in the case of scale-free
427: networks~\cite{barabasi:1999,amaral:2000}.  In a relationship detection
428: path search, paths through very high degree nodes are deemed less informative
429: \cite{faloutsos:2004}.  For example, 
430: in a social network, two people who know a popular person
431: are less likely to know each other; the linkages to the popular person 
432: should be disregarded in the relationship detection search since they
433: may confer erroneous relationships.
434: 
435: It is believed that power-law connectivity distributions arise when
436: there is little or no cost involved in the formation of links in the
437: network \cite{amaral:2000}.  Without this property, no nodes would be able to
438: acquire a very large number of links.  This may suggest that a graph
439: with power-law degree distribution may contain many weak linkages.
440: However, these weak linkages cannot be disregarded; Cf. strength of weak ties,
441: mentioned above.
442: 
443: For semantic graphs, we showed above how to extend the concept of
444: clustering coefficient.  In the next subsections, we expand the
445: potential usefulness of other concepts for semantic graphs.
446: 
447: \subsection{Extension of node degree}
448: 
449: Even in the simple case of connectivity, a given value
450: $k$ of the connectivity of a node of type $\alpha$ has no real meaning
451: for semantic graphs. Indeed, as shown in Figure~\ref{fig:k_example} the
452: topological connectivity in both cases is $k=4$ but the meaning of it
453: is very different in each case.
454: 
455: \begin{figure}
456: \begin{center}
457: \includegraphics[width=5.0cm]{connectivity_example.eps}
458: \caption{Two examples for which the $\alpha$-type node has
459: topological connectivity $k=4$ but with a different meaning in each case,
460: Cf.~\citeauthor{jensen:2002} (2002).}
461: \label{fig:k_example}
462: \end{center}
463: \end{figure}
464: 
465: In the first case, the environment is very homogeneous while it is not
466: in the second case. Another complexity comes from the
467: fact that the number of $\beta$-type nodes can be very large thus
468: inducing a bias in the connectivity of the other nodes.
469: 
470: The ontology implies that each node of type $\alpha$ can be connected
471: to a certain number, $k^{0}_{\alpha}$, of other
472: types. In the semantic graph, we have a total number of nodes
473: $n=\sum_{\alpha}n_{\alpha}$ and we denote the nodes by
474: $i=1,\dots,n$. The type of a node is given by the function $t(i)$. We
475: denote by $k_{\alpha\beta}(i)$ the number of neighbors of type $\beta$
476: of a node $i$ of type $\alpha$. The usual topological connectivity of
477: the node $i$ (which is of type $\alpha$) is then given by
478: \begin{equation}
479: k_{\alpha}(i)=\sum_{\beta}k_{\alpha\beta}(i).
480: \end{equation}
481: Using this quantity, we can define the average connectivity
482: of type $\alpha$ which is just the average over all nodes with
483: type $\alpha$ as
484: \begin{equation}
485: \overline{k_{\alpha}}=\frac{1}{n_{\alpha}}\sum_{i,\; t(i)=\alpha}k_{\alpha}(i).
486: \end{equation}
487: 
488: If we want to compare the different types relative to their
489: connectivity, it is important to remember that some types can be
490: connected to many others (such as persons which can be linked to
491: others persons, cities, meeting, jobs, etc.) while other types are
492: only linked to one type (such as a conference which takes place only at
493: one location). In order to compare the different types we thus have to
494: rescale by the number of different neighbor types they can have according to
495: the ontology:
496: \begin{equation}
497: m_{\alpha}=\frac{\overline{k_{\alpha}}}{k^{0}_{\alpha}}.
498: \end{equation}
499: 
500: This quantity indicates the average number of neighbors {\it per
501: type}. This quantity however does not tell us if there are large
502: connectivity fluctuations or if in contrast all nodes of a given type
503: have essentially the same connectivity. We thus have to measure the
504: connectivity variance {\it per type} which is calculated using the second moment
505: \begin{equation}
506: \overline{k^{2}_{\alpha}}=\frac{1}{n_{\alpha}}\sum_{i,\; t(i)=\alpha}k^{2}_{\alpha}(i)
507: \end{equation}
508: with the dispersion per type given by
509: \begin{equation}
510: \sigma^{k}_{\alpha}=\frac{[\overline{k^{2}_{\alpha}}-(\overline{k_{\alpha}})^2]^{1/2}}
511: {k^{0}_{\alpha}}.
512: \end{equation}
513: 
514: Another possible way to characterize the connectivity distribution per
515: type is to plot the connectivity distribution. However, the dispersion
516: around the average is already a first indication of the nature of the
517: connections for different types. For some cases, the fluctuations
518: will be small, while for others it can fluctuate greatly 
519: (such as the number of persons a person knows).
520: %This suggests that the nature
521: %of different networks between various types and spanning the
522: %semantic graph can be very different. The network of persons could be
523: %for example scale-free while for other types it can be well described
524: %by a simple random graph model.
525: 
526: %-----------------------------------------------------------
527: \subsection{Disparity of connected types}
528: 
529: The above quantities tell us the expected number of connections of a node of
530: a given type to another type
531: but not the correlations between different types. Indeed, a type
532: $\alpha$ can preferentially link to a type $\beta$ while it could be
533: in principle also be linked to other types (as given by the ontology).
534: 
535: We thus quantify the disparity (or affinity) of each
536: type to link to other types. In order to do this we use a convenient
537: quantity---denoted by $Y_2$---which was introduced in another
538: context~\cite{Derrida:1987,Barthelemy:2003a}. In order to understand the meaning
539: of this quantity let us consider an object that is broken into a number $N$
540: of parts, each part having a weight $w_i$. By construction $\sum_{i}w_i=1$
541: and $Y_2$ is given in this case by
542: \begin{equation}
543: Y_2=\sum_i[w_i]^2.
544: \end{equation}
545: If all parts have the same weight $w_I\sim 1/N$ then $Y_2\sim 1/N$ is
546: small (for large $N$). In contrast, if we have $w_1=1/2$ and the rest
547: is small implying $w_{i\ne 1}\sim 1/2(N-1)$ then we obtain $Y_2\sim
548: 1/4$. This simple example can be easily generalized to more
549: complicated situations and shows that a small value of $Y_2$ indicates
550: a large number of relevant parts while a larger value (typically of
551: order $1/m$ where $m$ is of order unity) indicates the dominance of a
552: few parts.
553: 
554: We now apply this idea to the number of types to quantify the
555: disparity of a node or the affinity of a type. The quantity $Y_2$ is first
556: defined for a given node $i$ of type $\alpha$
557: \begin{equation}
558: Y_2(i;\alpha)=\sum_{\beta}\left[\frac{k_{\alpha\beta}(i)}{k_{\alpha}(i)}\right]^2.
559: \end{equation}
560: In order to get results with statistical significance, we average this 
561: quantity over all
562: nodes of the same type and we also compute its dispersion $\sigma^{Y}_{\alpha}$:
563: \begin{eqnarray}
564: \overline{Y}_2(\alpha)=\frac{1}{n_{\alpha}}\sum_{i,\; t(i)=\alpha}
565: Y_2(i;\alpha),\\
566: \sigma^{Y}_{\alpha}=\left[
567: \overline{Y_2^{2}(\alpha)}-(\overline{Y}_2(\alpha))^2
568: \right]^{1/2}.
569: \end{eqnarray}
570: 
571: These results must however be weighted by the fact that some types are
572: more numerous than others which could be a reason why they appear
573: more often than others. For a given node $\alpha$, we denote by ${\cal
574: V}(\alpha)$ the set of types which can be connected to $\alpha$ as
575: given by the ontology. If a node has $k$ neighbors, and if these
576: neighbors are picked at random in the set of different nodes with
577: population $n_{\beta}$, we then obtain a disparity given by
578: \begin{equation}
579: Y^{r}_2=\sum_{\beta\in{\cal V}(\alpha)}\left[\frac{n_{\beta}}{n}\right]^2.
580: \end{equation}
581: Again, this quantity will be very small if all types are uniformly
582: present in the semantic graph $Y^{r}_2\sim 1/N$ (where $N$ is the
583: total number of different types) and if it is of order unity then
584: essentially a few types are over-represented. In order to take these
585: heterogeneities into account it is thus necessary to rescale
586: $Y_2(\alpha)$ by $Y^{r}_2$ and to form the factor
587: \begin{equation}
588: R(\alpha)=\frac{Y_2(\alpha)}{Y^{r}_2}
589: \end{equation}
590: and its corresponding dispersion,
591: \begin{equation}
592: \sigma^{R}_{\alpha}=\frac{\sigma^{Y}_{\alpha}}{Y^{r}_2}.
593: \end{equation}
594: 
595: A large value (larger than one) of $R(\alpha)$ indicates that type
596: $\alpha$ preferentially links to a small number of types and that 
597: its neighbor types ${\cal V}(\alpha)$ are diverse in number.
598: If $R\ll 1$, the type
599: $\alpha$ may still be preferentially connected to a small set of types
600: but the diversity of the numbers of each neighbor type is small.
601: 
602: The dispersion $\sigma^{R}(\alpha)$ indicates whether the behavior as
603: described by the average value $R(\alpha)$ is typical, or if in
604: contrast there is large diversity among the nodes of type $\alpha$.
605: 
606: Other usual quantities that are measured in order to 
607: characterize a large network can also be generalized without any
608: difficulty. For example, degree distributions should be examined by
609: type of node.  In a semantic graph, the overall degree distribution
610: may not be meaningful, but the degree distribution for a specific
611: node type may be power-law, etc.
612: As a further example, the average path length generalizes to become a matrix
613: $\ell_{\alpha\beta}$ where $\alpha$ indicates the source node of the
614: shortest paths while $\beta$ is the target node. 
615: This matrix will in general have
616: entries with very different values.
617: 
618: %%-----------------------------------------------------------
619: %\subsection{Extension of average path length}
620: %
621: %In the same flow of ideas, it is easy to generalize the important
622: %quantity which is the betweenness centrality to a semantic graphs.
623: %For topological networks, this quantity counts the fraction of shortest paths
624: %that goes through a given node
625: %\begin{equation}
626: %g(v)=\sum_{i\ne j}\frac{\sigma_{ij}(v)}{\sigma_{ij}}
627: %\end{equation}
628: %where $\sigma_{ij}$ denotes the number of shortest paths going from
629: %$i$ to $j$ and where $\sigma_{ij}(v)$ denotes the number of shortest
630: %paths going from $i$ to $j$ through $v$. The natural generalization to
631: %semantic graphs is then
632: %\begin{equation}
633: %g_{\beta\gamma}(v;\alpha)=\sum_{i, t(i)\beta\ne j, t(j)=\gamma}
634: %\frac{\sigma_{ij}(v)}{\sigma_{ij}}
635: %\end{equation}
636: %which means that we consider only shortest paths from a node of type
637: %$\beta$ to a node of type $\gamma$ (while the node $v$ is of type
638: %$\alpha$).
639: 
640: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
641: \section{Scale in Semantic Graphs}
642: 
643: Given a knowledge base of relational data, the choice of ontology
644: depends on what information needs to be captured in the semantic graph,
645: and how easily certain information needs to be retrieved.
646: The level of detail (or scale) chosen for the ontology
647: (choice of node and link types)
648: will have a direct impact on the properties of the corresponding
649: semantic graph. 
650: 
651: In the simplest ontology, we have nodes of only one type.
652: In the example of the movies database, this ontology
653: is a simple network of actors without any types and two actors are
654: connected if they played in the same movie.  At the next finer
655: scale, we have actors and movies as node types.  In this
656: case, the ontology is an actor connected to a movie if he played in that
657: movie.  This is a special case of a semantic graph
658: which is a {\em bipartite} network (two types of nodes, with links only between
659: the two types).
660: %The ontologies and graphs for
661: %these two scales are illustrated in Figure~\ref{fig:example_actor}.
662: Coarser models lose some of the information present in finer models
663: but can be useful for
664: large-scale computations, such as multi-level search techniques.
665: 
666: At the finest scale of a terrorist network, we may have nodes
667: of type ``Religious Terrorist Organization'' and ``Political Terrorist 
668: Organization.''  A coarser model may aggregate nodes of these
669: two types into a new type, ``Terrorist Organization'' (or the
670: aggregation may occur directly if a type hierarchy is available).
671: Depending on what information needs to be preserved, it may or may not be
672: important to distinguish between these two node types
673: at the structural level of the semantic graph.
674: 
675: We note that in Homeland Security tasks, 
676: data analysis more often involves searching for outliers rather 
677: than commonplace patterns.  Thus it is essential that the fine
678: scale data is retained and the coarse scale data is used
679: appropriately (for example, as an aid in managing and processing
680: large-scale data).
681: 
682: %The semantic graph may be examined to determine whether or not it is
683: %appropriate to coalesce two node types into a single node type.
684: %
685: %Two nodes are structurally similar if they 
686: %We first define the structural similarity of two nodes as the 
687: %number of neighbors they have in common
688: %
689: %Algorithm: average structural similarity may be used to aggregate nodes.
690: %
691: %We also have the scale of the semantic graph.  Here, we can collapse
692: %along the type hierarchy.
693: %
694: %Two nodes can be coalesced if they have a high degree of structural
695: %similarity.
696: 
697: %\begin{figure}
698: %\begin{center}
699: %\includegraphics[width=6.0cm]{example_actor.eps}
700: %\caption{Two different scales for the movie actor network.}
701: %\label{fig:example_actor}
702: %\end{center}
703: %\end{figure}
704: 
705: %-----------------------------------------------------------
706: \subsection{Effect of scale on statistical measures}
707: 
708: Here we simply illustrate the effect of scale on the clustering coefficient.
709: We consider a random bipartite graph with Poisson distributed
710: numbers of both movies per actor (with average $\mu$) and actors per
711: movie (with average $\nu$). We suppose that we have $n_A$ actors and
712: $n_M$ movies and the fact that each link connects an actor to a movie imposes
713: the constraint
714: \begin{equation}
715: \frac{\mu}{n_A}=\frac{\nu}{n_M} .
716: \end{equation}
717: 
718: This model can be considered as a ``null'' model since there are no
719: particular correlations here. If one computes the clustering coefficient of
720: the one-mode projection
721: of this network, one obtains~\cite{newman:2001a}
722: \begin{equation}
723: C=\frac{1}{\mu+1} .
724: \end{equation}
725: This quantity is finite even in the limit of very large networks
726: $n_{A,M}\to\infty$. This is in contrast with the usual random network for
727: which
728: \begin{equation}
729: C\sim \frac{1}{n}
730: \end{equation}
731: where $n$ is the number of nodes. At this stage the conclusion is that
732: the actor network is very clustered and different from a random
733: network with no correlations. This is however clearly an incorrect
734: statement since the existence of a large clustering coefficient here is a
735: consequence of the network construction procedure.
736: 
737: %This simple example shows that the way of constructing the network and
738: %the choice of the scale can be very relevant and should clearly specified in
739: %any discussion.
740: 
741: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
742: \section{Examples}
743: 
744: \subsection{Movies data}   
745:                                        
746: The ``Movies'' test data at the UCI KDD Archive contains information
747: about movies, persons (actors, directors, etc.), studios, awards, etc.
748: The data was originally compiled by Gio Wiederhold (Stanford University).
749: We used this data to construct an ontology and semantic graph to 
750: express most of the information in the dataset.  Figure \ref{fig:imdb-ont}
751: shows the ontology graph that we developed.  In the figure, the meaning of
752: most of the links is obvious.  However, the person-person 
753: link implies {\em married-to}, {\em lived-with}, or some other non-professional
754: relationship; the person-studio link implies {\em founded}; the movie-movie
755: link implies {\em sequel-to}.  We note that the data is very incomplete.
756: 
757: \begin{figure}
758: \begin{center}
759: \includegraphics[width=2in]{movies_schema.eps}
760: \caption{Movies ontology.}
761: \label{fig:imdb-ont}
762: \end{center}
763: \end{figure}
764: 
765: In this ontology, the best meaning of the node Role is unclear.
766: For example, are two actors linked to the same Role node in the semantic
767: graph if they
768: played the role of Villain in two different movies?  Alternatively,
769: a role node in the semantic graph may only link to actors playing
770: a given role in a single movie.  We arbitrarily chose the former in our case.
771: 
772: A related question, which is structurally similar but semantically different
773: is the following.  Should two actors who win a Best Actor award be linked to the
774: {\em same} Award node in the semantic graph?  In this case we did not choose
775: this interpretation since it seems that awards are individual entities,
776: whereas roles are not.
777: 
778: Table \ref{tbl:imdb-results} summarizes the node types, frequencies,
779: and other statistical measures for the movies semantic graph.
780: The results show
781: high dispersion of average connectivity per type, for all types.
782: Further, the disparity of connected types is not particularly 
783: different from a random model.  These indicate a relatively well-constructed
784: semantic graph; there are no particular correlations (given
785: the numbers of each node type) and thus the information
786: content in the graph is high.  The results will be very different
787: for the terrorism data.
788: 
789: In the semantic graph,
790: the nodes with the largest clustering coefficients 
791: depend on whether the types of the nodes are considered.  In the standard
792: case where the types are not considered, the node Maurice Barrymore
793: has high clustering coefficient; the node is connected to Georgiana Drew
794: Barrymore, Lionel Barrymore, Ethel Barrymore, etc., all of which
795: are connected to each other.  If node types are considered, then it is 
796: not important that neighbors of a node are not linked if they are 
797: not permitted to be linked according to the ontology.  Now nodes that
798: were missed with the above measure may have high clustering coefficient,
799: e.g., the movie {\em Dogma} (perhaps due to the idiosyncrasies of 
800: the incomplete data).
801: 
802: % low clust coef?
803: 
804: In the semantic graph, the link between Columbia Pictures and 
805: drama (genre) has the most number of common neighbors (710).  
806: However, when the link
807: relevance measure (Equation (\ref{eq:strength})) is used,
808: which accounts for the number of links a node has,
809: the link between Bud Abbott and Lou Costello is found (30 common neighbors).
810: (We also found re-releases of movies under a new name in this process.)
811: Further, a semantic version of relevance can be defined, which
812: considers only the links that are allowed by the semantic graph.
813: In this case, the link between Tokuma Studio and docu-drama is found.
814: (Tokuma is linked to drama and the movie {\em Carences}; docu-drama is 
815: linked to {\em Carences} and Miramax; and Miramax is linked to drama.)
816: 
817: We also computed the average relevance per link type for the semantic graph.
818: First, the link types of least frequency were Person-\emph{founded}-Studio
819: and Studio-\emph{located-in}-Country.  However, the links with lowest average
820: relevance per link were Movie-\emph{shot-in}-Country and 
821: Award-\emph{awarded-in}-Country.  As mentioned, these latter links may by
822: least useful for automatic relationship detection.
823: 
824: \begin{table}
825: \begin{center} \scriptsize
826: \begin{tabular}{|rl|r|rr|rr|} \hline
827:    & Node Type       & $n_\alpha$ & 
828:         $m_\alpha$ & $\sigma_\alpha^k$ & $R(\alpha)$ & $\sigma_\alpha^R$ \\
829: \hline
830:  1 & Person          & 21504  &    0.872 &    2.383 &   1.836 &    0.663 \\
831:  2 & Movie           & 11540  &    1.131 &    0.816 &   1.299 &    0.644 \\
832:  3 & Award           &  6734  &    2.579 &   10.201 &   0.905 &    0.144 \\
833:  4 & Country         &    19  &  222.509 &  582.572 &   1.812 &    0.364 \\
834:  5 & Studio          &  1075  &    1.948 &    9.534 &   1.241 &    0.408 \\
835:  6 & Genre           &    39  &   77.803 &  160.060 &   0.512 &    0.154 \\
836:  7 & Role            &   115  &   25.561 &   64.164 &   0.924 &    0.028 \\
837:  8 & Distributor     &    16  &  206.156 &  356.043 &   0.782 &    0.165 \\
838: \hline
839: \end{tabular}
840: \caption{Node types and statistics for the movies data:  frequency of
841: node type $n_\alpha$, average connectivity per type $m_\alpha$ and 
842: its dispersion $\sigma_\alpha^k$, disparity of connected types
843: $R(\alpha)$ and its dispersion $\sigma_\alpha^R$.  
844: The results show
845: high dispersion of average connectivity per type, for all types.
846: Further, the disparity of connected types is not particularly 
847: different from a random model.
848: }
849: \label{tbl:imdb-results}
850: \end{center}
851: \end{table}
852: 
853: %Figure \ref{fig:bpdist} shows the degree distributions of the movie-actor
854: %bipartite graph.  When all nodes are considered together, the power-law
855: %relationship for the nodes of type Person are hidden.
856: %
857: %\begin{figure}
858: %\centering
859: %\subfigure[Person nodes (30226 nodes).]{\includegraphics[width=1.5in]{bpdist_person.eps}}
860: %\subfigure[Movie nodes (11561 nodes).]{\includegraphics[width=1.5in]{bpdist_movie.eps}}
861: %\subfigure[All nodes.]{\includegraphics[width=1.5in]{bpdist_all.eps}}
862: %\caption{Degree distributions of the movie-actor bipartite graph
863: %by node type.}
864: %\label{fig:bpdist}
865: %\end{figure}
866: 
867: %{OLD:Before ROLE collapsed:Node types and statistics for the movies data.}
868: %
869: %\begin{table}
870: %\begin{center} \scriptsize
871: %\begin{tabular}{|rl|r|rr|rr|} \hline
872: %   & Node Type       & $n_\alpha$ & 
873: %        $m_\alpha$ & $\sigma_\alpha^k$ & $R(\alpha)$ & $\sigma_\alpha^R$ \\
874: %\hline
875: % 1 & Person          & 30226  &   0.999  &    5.319 & 2.309 & 1.078 \\
876: % 2 & Movie           & 11561  &   2.071  &    1.625 & 1.342 & 0.588 \\
877: % 3 & Award           & 29759  &   1.357  &    4.070 & 0.740 & 0.188 \\
878: % 4 & Country         &    17  & 963.412  & 3336.693 & 1.027 & 0.126 \\
879: % 5 & Studio          &  1044  &   1.937  &    9.397 & 1.109 & 0.363 \\
880: % 6 & Genre           &   201  &  17.511  &   85.387 & 0.422 & 0.076 \\
881: % 7 & Role            & 46154  &   1.000  &    0.000 & 0.834 & 0.000 \\
882: % 8 & Distributor     &   111  &   4.716  &    4.640 & 0.769 & 0.193 \\
883: %\hline
884: %\end{tabular}
885: %\label{tbl:imdb-results}
886: %\end{center}
887: %\end{table}
888: 
889: % following are degree distributions for the full imdb (not bipartite)
890: %
891: %\begin{figure}                         
892: %\centering
893: %\subfigure[All nodes.]{\includegraphics[width=2.5in]{dist_all.eps}}
894: %\subfigure[Person nodes.]{\includegraphics[width=2.5in]{dist_person.eps}}
895: %\\
896: %\subfigure[Movie nodes.]{\includegraphics[width=2.5in]{dist_movie.eps}}
897: %\subfigure[Award nodes.]{\includegraphics[width=2.5in]{dist_award.eps}}
898: %\caption{Degree distributions by node type.}
899: %\label{fig:1}
900: %\end{figure}                         
901: 
902: %-----------------------------------------------------------
903: \subsection{Terrorism data}
904: 
905: %Terrorism data is available from the Anti-Defamation League.
906: 
907: Relational data about world-wide terrorist events is available,%
908: \footnote{Data available at http://ontology.teknowledge.com.}
909: as well
910: as ontologies describing the organization of this data \cite{niles:2001}.
911: From this data we constructed an ontology and semantic graph.
912: The 59 node types are shown in Table \ref{tbl:terrorism-types}.
913: The ontology is shown in Figure \ref{fig:terr-adjmat} as an adjacency
914: matrix.  The semantic graph contains 2366 nodes.
915: 
916: %After removing isolated nodes from the data.
917: 
918: \begin{table}
919: \renewcommand{\arraystretch}{0.6}
920: \begin{center} \scriptsize
921: \begin{tabular}{|rl|c||rl|c|} \hline
922:    & Type & $n_\alpha$ & & Type & $n_\alpha$ \rule[-1.0ex]{0pt}{3ex}\\
923: \hline
924:  1 & Nation                       &  92 & 31 & Shooting               & 445 \\
925:  2 & GeographicalRegion           &  85 & 32 & Bombing                & 323 \\
926:  3 & City                         & 555 & 33 & HostageTaking          &  14 \\
927:  4 & Building                     &  10 & 34 & IncendDeviceAttack     &  18 \\
928:  5 & Combustion                   &   0 & 35 & Lynching               &   3 \\
929:  6 & Destruction                  &   0 & 36 & SuicideBombing         & 107 \\
930:  7 & Device                       &   0 & 37 & CarBombing             & 114 \\
931:  8 & GeographicArea               &   3 & 38 & Arson                  &  15 \\
932:  9 & Government                   &   1 & 39 & HandgrenadeAttack      &  38 \\
933: 10 & GovernmentPerson             &   2 & 40 & Hijacking              &  15 \\
934: 11 & Group                        &   1 & 41 & RocketMissileAttack    &  14 \\
935: 12 & Hole                         &   1 & 42 & KnifeAttack            &  53 \\
936: 13 & Human                        &   6 & 43 & ChemicalAttack         &   9 \\
937: 14 & JoiningAnOrg                 &   0 & 44 & LetterBombAttack       &  10 \\
938: 15 & Killing                      &   0 & 45 & Stoning                &   3 \\
939: 16 & OccupationalRole             &   3 & 46 & VehicleAttack          &   7 \\
940: 17 & Region                       &   0 & 47 & MortarAttack           &   8 \\
941: 18 & SocialRole                   &   1 & 48 & Vandalism              &   4 \\
942: 19 & StationaryArtifact           &   1 & 49 & Other                  &   5 \\
943: 20 & UnilateralGetting            &   0 & 50 & Number                 & 120 \\
944: 21 & Vehicle                      &   1 & 51 & Continent              &   2 \\
945: 22 & ViolentContest               &   1 & 52 & GeneralStructure       &   6 \\
946: 23 & Weapon                       &   0 & 53 & Month                  &  12 \\
947: 24 & Proposition                  &   0 & 54 & GeneralBuilding        &   2 \\
948: 25 & BinaryPredicate              &   0 & 55 & GeneralHuman           &   2 \\
949: 26 & ForeignTerrOrg               &  28 & 56 & Airbase                &   2 \\
950: 27 & ReligiousOrg                 &   0 & 57 & Airport                &   3 \\
951: 28 & TerroristOrg                 &  53 & 58 & State                  &   4 \\
952: 29 & Infiltration                 &   8 & 59 & Railway                &   1 \\
953: 30 & Kidnapping                   & 155 &    &                        &     \\
954: \hline                                 
955: \end{tabular}                          
956: \caption{Node types and their frequencies, $n_\alpha$, for the terrorism data.}
957: \label{tbl:terrorism-types}            
958: \end{center}                           
959: \end{table}                            
960:                                        
961: \begin{figure}                         
962: \begin{center}                         
963: \includegraphics[width=2in]{terr_adjmat.eps}      
964: \caption{Adjacency matrix for the terrorism ontology.  The matrix
965: is used to determine which node types are allowed to link to a given type.}
966: \label{fig:terr-adjmat}                
967: \end{center}                           
968: \end{figure}                           
969:                                        
970: \begin{figure}                         
971: \begin{center}                         
972: \includegraphics[width=3.25in]{terr_deg.eps}
973: \caption{Terrorism data: average number of neighbors per type, $m_\alpha$.  
974: Each error bar is of 
975: length $\sigma_\alpha^k$ on each side of the average.  }
976: \label{fig:terr-deg}                
977: \end{center}                           
978: \end{figure}                           
979: 
980: \begin{figure}                         
981: \begin{center}                         
982: \includegraphics[width=3.25in]{terr_con.eps}
983: \caption{Terrorism data: 
984: disparity of connected types, $R(\alpha)$.  Each error bar is of 
985: length $\sigma_\alpha^R$ on each side.}
986: \label{fig:terr-con}                
987: \end{center}                           
988: \end{figure}                           
989: 
990: Figures \ref{fig:terr-deg} and \ref{fig:terr-con} plot the average number
991: of neighbors per type and the disparity of connected types, respectively.
992: Error bars are used to show the dispersion of the quantities.
993: We consider that frequencies of 50 or more in this data set are 
994: statistically significant.  Thus, we consider types
995: 1, 2, 3, 28, 30, 31, 32, 36, 37 42, and 50.
996: For all these types, the average number of neighbors per type is small.
997: The types, however, can be separated by their disparity.
998: Types 1, 2, 3, 28, and 50 have high disparity, i.e., they are connected
999: to many different types.  This is consistent with nodes of
1000: types 1, 2, and 3 being of type 
1001: ``location,'' nodes of type 28 being of type ``terrorist organization,''
1002: and nodes of type 50 being of type ``number.''
1003: The remaining types are types of attacks and are not particularly
1004: correlated with any other node types (given the numbers of each node type).
1005: We note in this case
1006: that semantically similar node types
1007: have similar values of $m_\alpha$ and $R(\alpha)$.
1008: 
1009: 
1010: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1011: \section{Conclusion}
1012: 
1013: This paper reveals some of the knowledge representation
1014: issues associated with semantic graphs.  Ideas from the field of complex
1015: networks have been applied and generalized to semantic graphs.
1016: For example, transitivity may be used to determine
1017: the relevance of edge types for relationship detection.
1018: 
1019: We have defined several measures for statistically characterizing
1020: node types.  These quantities
1021: take into account the ontology which specifies the permitted connections in
1022: the semantic graph.
1023: Many other important measures can be defined,
1024: such as correlations with attribute {\em values}
1025: \cite{jensen:2002}, which was not covered in this paper.
1026: These and other tools can be useful to help design ontologies
1027: and semantic graphs for knowledge representation.
1028: 
1029: 
1030: %Many issues arise due to the existence of
1031: %different types on the nodes and links.  In standard
1032: %graphs without these types, coarser models, for example, are generally 
1033: %built by clustering nodes (of the same type) that are likely to be
1034: %related.  What we have described is the very different case of
1035: %being able to cluster nodes based on the types of links that join them.
1036: 
1037: %For example, we define 
1038: %the tendency of a node type to link to a few or
1039: %to many other node types, and the average number of neighbors
1040: %of a node, taking into account the types to which it is permitted to link 
1041: %(according to the ontology).  Using these measures, we found that node types
1042: %including dates, numbers, (e.g., number of
1043: %deaths in a terrorist event) and document-ID's (the original source of
1044: %the link data) are not particularly useful for relationship detection.
1045: 
1046: 
1047: 
1048: 
1049: %We summarize the general
1050: %procedure below:
1051: %
1052: %\begin{itemize}
1053: %\item{} First identify nodes with large $n_{\alpha}$. Only for these types, 
1054: %a statistical analysis is meaningful and the analysis below is performed
1055: %for these types.
1056: %\item{} Compute $k_{\alpha}$ and its variance. This will show which 
1057: %type is highly connected and what is the nature (scale-free or not for
1058: %example) of the network relatively to the different types.
1059: %\item{} Compute the quantity $Y_2(\alpha)$ and its dispersion leading to the quantity 
1060: %$R(\alpha)$ and $\sigma^{R}(\alpha)$. These quantities indicate the
1061: %``disparity'' of each type (ie. they favorite connections if they
1062: %exist) and the variations among nodes of the same type.
1063: %\item{} Depending on the problem, one can also compute the clustering 
1064: %coefficient per type as well as the centrality matrix.
1065: %\end{itemize}
1066: 
1067: 
1068: \section{Acknowledgments}
1069: We are pleased to
1070: thank Keith Henderson and David Jensen for helpful discussions.
1071: MB wishes to thank the Center for Applied Scientific Computing and
1072: the Institute for Scientific Computing Research at Lawrence Livermore
1073: National Laboratory for their hospitality during the formative stages
1074: of this work. This work was performed under the auspices of the U.S. Department
1075: of Energy by University of California Lawrence Livermore
1076: National Laboratory under contract No.~W-7405-ENG-48.
1077: 
1078: 
1079: 
1080: \begin{thebibliography}{50}
1081: 
1082: \bibitem[\protect\citeauthoryear{Albert \& Barabasi}{2002}]{albert:2002}
1083: Albert, R., and Barabasi, A.-L.
1084: \newblock 2002.
1085: \newblock Statistical mechanics of complex networks.
1086: \newblock {\em Reviews of Modern Physics} 74(1):47--97.
1087: 
1088: \bibitem[\protect\citeauthoryear{Amaral \bgroup \em et al.\egroup
1089:   }{2000}]{amaral:2000}
1090: Amaral, L. A.~N.; Scala, A.; Barth{\'e}lemy, M.; and Stanley, H.~E.
1091: \newblock 2000.
1092: \newblock Classes of small-world networks.
1093: \newblock In {\em Proceedings of the National Academy of Sciences USA},
1094:   volume~97,  11149--11152.
1095: \newblock National Academy of Sciences.
1096: 
1097: \bibitem[\protect\citeauthoryear{Barabasi \& Albert}{1999}]{barabasi:1999}
1098: Barabasi, A.-L., and Albert, R.
1099: \newblock 1999.
1100: \newblock Emergence of scaling in random networks.
1101: \newblock {\em Science} 286:509--512.
1102: 
1103: \bibitem[\protect\citeauthoryear{Barth{\'e}lemy, Gondran, \&
1104:   Guichard}{2003}]{Barthelemy:2003a}
1105: Barth{\'e}lemy, M.; Gondran, B.; and Guichard, E.
1106: \newblock 2003.
1107: \newblock Spatial structure of the internet traffic.
1108: \newblock {\em Physica A} 319:633--642.
1109: 
1110: \bibitem[\protect\citeauthoryear{Chow}{2004}]{chow-tr:2004}
1111: Chow, E.
1112: \newblock 2004.
1113: \newblock A graph search heuristic for shortest distance paths.
1114: \newblock Technical Report UCRL-JRNL-202894, Lawrence Livermore National
1115:   Laboratory.
1116: 
1117: \bibitem[\protect\citeauthoryear{Coffman, Greenblatt, \&
1118:   Marcus}{2004}]{coffman:2004}
1119: Coffman, T.; Greenblatt, S.; and Marcus, S.
1120: \newblock 2004.
1121: \newblock Graph-based technologies for intelligence analysis.
1122: \newblock {\em Communications of ACM} 47:45--47.
1123: 
1124: \bibitem[\protect\citeauthoryear{Derrida \& Flyvbjerg}{1987}]{Derrida:1987}
1125: Derrida, B., and Flyvbjerg, H.
1126: \newblock 1987.
1127: \newblock Statistical properties of randomly broken objects and of multivalley
1128:   structures in disordered systems.
1129: \newblock {\em Journal of Physics {A}} 20(15):5273--5288.
1130: 
1131: \bibitem[\protect\citeauthoryear{Eliassi-Rad \&
1132:   Chow}{2004}]{eliassi-rad-tr:2004}
1133: Eliassi-Rad, T., and Chow, E.
1134: \newblock 2004.
1135: \newblock A probabilistic approach to accelerating path-finding in large
1136:   semantic networks.
1137: \newblock Technical Report UCRL-CONF-202002, Lawrence Livermore National
1138:   Laboratory.
1139: 
1140: \bibitem[\protect\citeauthoryear{Faloutsos, McCurley, \&
1141:   Tomkins}{2004}]{faloutsos:2004}
1142: Faloutsos, C.; McCurley, K.; and Tomkins, A.
1143: \newblock 2004.
1144: \newblock Fast discovery of connection subgraphs.
1145: \newblock In {\em Proceedings of the 10th ACM SIGKDD International Conference
1146:   on Knowledge Discovery and Data Mining},  118--127.
1147: \newblock Seattle, WA, USA: ACM Press.
1148: 
1149: \bibitem[\protect\citeauthoryear{Granovetter}{1973}]{granovetter:1973}
1150: Granovetter, M.
1151: \newblock 1973.
1152: \newblock The strength of weak ties.
1153: \newblock {\em American Journal of Sociology} 78:1360--1380.
1154: 
1155: \bibitem[\protect\citeauthoryear{Jensen \& Neville}{2002}]{jensen:2002}
1156: Jensen, D., and Neville, J.
1157: \newblock 2002.
1158: \newblock Data mining in social networks.
1159: \newblock In {\em Papers of the Symposium on Dynamic Social Network Modeling
1160:   and Analysis (Sponsored by National Academy of Sciences)}.
1161: \newblock Washington, DC, USA: National Academy Press.
1162: 
1163: \bibitem[\protect\citeauthoryear{Jensen, Rattigan, \& Blau}{2003}]{jensen.2003}
1164: Jensen, D.; Rattigan, M.; and Blau, H.
1165: \newblock 2003.
1166: \newblock Information awareness: a prospective technical assessment.
1167: \newblock In {\em Proceedings of the ninth ACM SIGKDD international conference
1168:   on Knowledge discovery and data mining},  378--387.
1169: \newblock Washington, D.C.: ACM Press.
1170: 
1171: \bibitem[\protect\citeauthoryear{Kolda \bgroup \em et al.\egroup
1172:   }{2004}]{DHS-DSW:2004}
1173: Kolda, T.; Brown, D.; Corones, J.; Critchlow, T.; Eliassi-Rad, T.; Getoor, L.;
1174:   Hendrickson, B.; Kumar, V.; Lambert, D.; Matarazzo, C.; McCurley, K.;
1175:   Merrill, M.; Samatova, N.; Speck, D.; Srikant, R.; Thomas, J.; Wertheimer,
1176:   M.; and Wong, P.~C.
1177: \newblock 2004.
1178: \newblock Data sciences technology for homeland security information management
1179:   and knowledge discovery.
1180: \newblock Technical Report UCRL-TR-208926, Lawrence Livermore National
1181:   Laboratory.
1182: 
1183: \bibitem[\protect\citeauthoryear{Newman, Strogatz, \&
1184:   Watts}{2001}]{newman:2001a}
1185: Newman, M. E.~J.; Strogatz, S.~H.; and Watts, D.~J.
1186: \newblock 2001.
1187: \newblock Random graphs with arbitrary degree distributions and their
1188:   applications.
1189: \newblock {\em Physical Review E} 64(026118).
1190: 
1191: \bibitem[\protect\citeauthoryear{Newman}{2003}]{newman:2003a}
1192: Newman, M.~E.
1193: \newblock 2003.
1194: \newblock The structure and function of complex networks.
1195: \newblock {\em SIAM Review} 45(2):167--256.
1196: 
1197: \bibitem[\protect\citeauthoryear{Niles \& Pease}{2001}]{niles:2001}
1198: Niles, I., and Pease, A.
1199: \newblock 2001.
1200: \newblock Towards a standard upper ontology.
1201: \newblock In {\em Proceedings of the 2nd International Conference on Formal
1202:   Ontology in Information Systems (FOIS-2001)}.
1203: 
1204: \bibitem[\protect\citeauthoryear{Popp \bgroup \em et al.\egroup
1205:   }{2004}]{popp.2004}
1206: Popp, R.; Armour, T.; Senator, T.; and Numrych, K.
1207: \newblock 2004.
1208: \newblock Countering terrorism through information technology.
1209: \newblock {\em Communications of the ACM} 47(3):36--43.
1210: 
1211: \bibitem[\protect\citeauthoryear{Sowa}{1984}]{sowa:1984}
1212: Sowa, J.~F.
1213: \newblock 1984.
1214: \newblock {\em Conceptual Structures: Information Processing in Mind and
1215:   Machine}.
1216: \newblock Reading, MA: Addison-Wesley.
1217: 
1218: \bibitem[\protect\citeauthoryear{Watts \& Strogatz}{1998}]{watts:1998}
1219: Watts, D.~J., and Strogatz, S.~H.
1220: \newblock 1998.
1221: \newblock Collective dynamics of small-world networks.
1222: \newblock {\em Nature} 393:440--442.
1223: 
1224: 
1225: 
1226: 
1227: 
1228: 
1229: \end{thebibliography}
1230: 
1231: 
1232: %\bibliographystyle{aaai}
1233: %\bibliography{aaai-ss05-kr}
1234: %\bibliography{CNGraph-jan05}
1235: 
1236: 
1237: 
1238: 
1239: 
1240: \end{document}
1241: