1: %\documentstyle[aps,epsf,rotate,preprint]{revtex}
2: %\documentstyle[aps,epsf,rotate,multicol]{revtex}
3: %\documentclass[a4paper,10pt,pre,twocolumn,showpacs,aps,floats,floatfix,superscriptaddress]{revtex4}
4: \documentclass[10pt,pre,twocolumn,aps,floats,floatfix,superscriptaddress]{revtex4}
5:
6: %\usepackage{epsfig}
7: \usepackage{times}
8: \usepackage{helvet}
9: \usepackage{courier}
10: \usepackage{graphicx}
11: \usepackage{subfigure}
12:
13:
14: \begin{document}
15:
16: \title{Knowledge Representation Issues in Semantic Graphs for Relationship Detection}
17:
18: \author{Marc Barth\'elemy\footnote{Authors listed alphabetically.}}
19: \affiliation{CEA-Centre d'Etudes de Bruy\`eres-Le-Ch\^atel \\
20: Departement de Physique Th\'eorique et Appliqu\'ee\\
21: BP12, 91680 Bruy\`{e}res-Le-Ch\^{a}tel Cedex, France \\
22: }
23:
24: \author{Edmond Chow}
25: \affiliation{Center for Applied Scientific Computing \\
26: Lawrence Livermore National Laboratory \\
27: Box 808, L-560, Livermore, CA 94551, USA \\
28: }
29:
30: \affiliation{Biodefense Knowledge Center,
31: Lawrence Livermore National Laboratory.}
32:
33: \author{Tina Eliassi-Rad}
34: \affiliation{Center for Applied Scientific Computing \\
35: Lawrence Livermore National Laboratory \\
36: Box 808, L-560, Livermore, CA 94551, USA \\
37: }
38:
39: \affiliation{Biodefense Knowledge Center,
40: Lawrence Livermore National Laboratory.}
41:
42: \begin{abstract}
43: %\begin{quote}
44: An important task for Homeland Security is the prediction of
45: threat vulnerabilities, such as through the detection of
46: relationships between seemingly disjoint entities. A structure
47: used for this task is a \emph{semantic graph},
48: also known as a \emph{relational data graph} or an
49: \emph{attributed relational graph}. These graphs encode relationships as
50: {\em typed} links between a pair of {\em typed} nodes.
51: Indeed, semantic graphs are very similar to semantic networks used in AI.
52: The node and link types are related through
53: an \emph{ontology} graph (also known as a \emph{schema}).
54: Furthermore, each node has a set of attributes associated
55: with it (e.g., ``age'' may be an attribute of a node of type
56: ``person''). Unfortunately, the selection of types and attributes for
57: both nodes and links depends on human expertise and
58: is somewhat subjective and even arbitrary. This subjectiveness
59: introduces biases into any algorithm that operates on
60: semantic graphs. Here, we raise some knowledge
61: representation issues for semantic graphs and provide some
62: possible solutions using recently developed ideas in the field
63: of complex networks. In particular, we use the concept of
64: transitivity to evaluate the relevance of individual links in the
65: semantic graph for detecting relationships.
66: We also propose new statistical measures
67: for semantic graphs and illustrate these semantic measures on
68: graphs constructed from movies and terrorism data.
69: %\end{quote}
70: \end{abstract}
71:
72:
73:
74: \maketitle
75:
76: %\author{Marc Barth\'elemy} \affiliation{CEA-Centre d'Etudes de
77: %Bruy{\`e}res-le-Ch{\^a}tel, D\'epartement de Physique Th\'eorique et
78: %Appliqu\'ee BP12, 91680 Bruy\`eres-Le-Ch\^atel, France}
79:
80:
81:
82: %-----------------------------------------------------------------
83: \section{Introduction}
84:
85: A semantic graph is a network of {\em heterogeneous} nodes and links.
86: In contrast to the usual mathematical description of a graph, semantic
87: graphs have different types of nodes, and in general, different types
88: of links. Also called attributed relational graphs \cite{coffman:2004}
89: and relational data graphs (used in the knowledge discovery literature),
90: it is clear that the power of these graphs lies not only in their structure
91: but also in the semantic information that resides on their nodes and links.
92: Examples of semantic graphs include citation networks where the nodes do
93: not simply consist of papers, but also consist of
94: authors, institutions, journals,
95: and conferences. Another example is the Internet Movie Database
96: where the nodes may be persons (actors, directors, etc.),
97: movies, studios, and awards, among others. In Homeland Security,
98: these graphs are used in a variety of information analysis tasks
99: \cite{jensen.2003,coffman:2004,popp.2004,DHS-DSW:2004}.
100: In particular, such graphs
101: may be used for predicting threat vulnerabilities.
102:
103: Data for semantic graphs come from relations parsed from text documents
104: and/or data from relational databases. Our motivation for this
105: work comes from our experience in constructing semantic graphs
106: from two sources of data---movies data and terrorism data---to be discussed
107: at the end of this paper.
108: In both these cases, we were faced
109: with a wide variety of choices: what are the node types, what
110: are the link types, and how do these choices affect the algorithms
111: that we intend to use on these graphs?
112:
113: Several types of algorithms operating on semantic graphs
114: are of interest to us. For example,
115: to determine the nature of a possible relationship
116: between two entities, a subgraph consisting of the shortest paths
117: (or another metric) between two nodes in the semantic graph
118: may be constructed and examined \cite{faloutsos:2004}.
119: We refer to this process as {\em relationship detection}.
120: Fast algorithms based on heuristic search
121: (which improve on breadth-first search or bi-directional search)
122: are available for this task, which either use or do not use the
123: semantic information in the graph \cite{eliassi-rad-tr:2004,chow-tr:2004}.
124: These algorithms,
125: however, depend on knowing which links (or link types) in the semantic graph
126: are useful for detecting relationships. For example,
127: two people who share a connection to ``San Francisco'' because they
128: were born there are unlikely to have any real-life connection. One of the goals
129: of this paper is to present automatic algorithms for determining
130: which are useful links for relationship detection, as well as present
131: concepts to help answer related questions.
132:
133: In the past few years, a new field called {\em complex networks}
134: (see, e.g., \citeauthor{albert:2002} (2002) and
135: \citeauthor{newman:2003a} (2003)) has
136: emerged to study the structure of real-world networks.
137: Statistical tools for characterizing graphs and networks have been
138: developed, with the impetus of understanding the relationship
139: between the structure and function of networks. Computer techniques have
140: allowed these statistical measurements to be performed on very large
141: real-world networks. In this paper
142: we generalize some of these techniques in order to apply them
143: to semantic graphs. For example, some types of nodes in semantic
144: graphs can be connected to many other types of nodes, but generally
145: have few actual links. We quantify this concept and hypothesize that
146: nodes such as these are not useful for relationship detection.
147: In addition, the concept of {\em transitivity} in social network analysis
148: (called {\em clustering coefficient} in the complex networks literature)
149: is useful for determining
150: which are useful links for relationship detection.
151:
152: In the following, we begin by describing semantic graphs and ontologies.
153: We then use the concept of transitivity for evaluating links and link
154: types for relationship detection. An important aspect of this paper is
155: a presentation of new statistical measures for semantic graphs, as well
156: as issues related to the scale (level of detail) of semantic graphs.
157: Examples of semantic graphs for movies and terrorism data are
158: given near the end of the paper.
159:
160: \section{Semantic Graphs and Ontologies}
161:
162: A semantic graph
163: consists of nodes and directed links, with each
164: node having a {\em type} (e.g., movie). The set of types is usually
165: small compared to the number of nodes. Each node is also labeled
166: with one or more {\em attributes} identifying the specific node
167: (e.g., {\em Shrek}) or gives additional information about that node
168: (e.g., gross income). Links may also have types, for example, the
169: (person $\rightarrow$ movie) link may be of type ``acted-in,'' or
170: ``directed.'' (In this case, multigraphs, or graphs that may have
171: multiple links between the same pair of nodes, are possible.) In some
172: semantic graphs, the meaning of a link between any two nodes is clear (although
173: different between different pairs of node types), and no link types need
174: to be defined. Finally, links may also have attributes.
175: For additional details, see
176: \citeauthor{sowa:1984} (1984).
177: %quillian:1968,lenat:1995,reed:2002,shapiro:2000a,woods:1975}.
178:
179: Depending on the types of nodes and links and on the available
180: information, certain relations can or cannot exist. The set of
181: relations that can exist in a given semantic graph can be described by an
182: auxiliary graph called an {\em ontology,}
183: or a {\em schema} \cite{jensen:2002}. More often, an ontology graph
184: is created first by defining the types of relations that the semantic graph
185: will encode.
186: A small example of an ontology is given in Figure \ref{fig:example},
187: showing three node types: person, meeting and city.
188:
189: Special links in an ontology graph could describe {\em is-a} and {\em part-of}
190: relationships among node types. This is a node type hierarchy that will be
191: briefly mentioned when we discuss the scale of semantic graphs.
192:
193: \begin{figure}
194: \begin{center}
195: \includegraphics[width=5.0cm]{example.eps}
196: \caption{A small ontology consisting of three node types.}
197: \label{fig:example}
198: \end{center}
199: \end{figure}
200:
201: \section{Transitivity for Evaluating Nodes and Edges}
202:
203: Consider a node ``San Francisco'' of type ``city'' in a semantic graph,
204: and suppose we have a database of people which includes city of birth
205: among the data fields. A node ``Alice'' of type ``person'' may be
206: linked to the node ``San Francisco'' if Alice was born in San Francisco.
207: Other nodes linked to node San Francisco imply a relationship
208: to San Francisco and in turn their relation to Alice. However,
209: it is not clear that such relationships give useful information
210: about Alice since most entities a short graph distance away from ``Alice''
211: will have no real-life connection to Alice.
212:
213: On the other hand, people born in a city such as
214: ``Tikrit,'' may have a much higher likelihood of
215: knowing each other, that is, it may be important in this case to be able to
216: associate two people
217: through their city of birth. Instead of using a human
218: with potential biases to evaluate nodes
219: and links, an automatic procedure is
220: desirable for objectively determining which nodes and links
221: should be used in the semantic graph for relationship detection.
222:
223: Another example is nodes of type ``date.''
224: Dates could represent birthdates, dates of meetings, etc.
225: For example, a node for a person born on 9-11-2001 may be linked to a node
226: labeled ``9-11-2001.''
227: However, two events sharing a date
228: rarely predicts that two events are related. Our bias is to
229: treat dates as attributes of nodes, rather than as its
230: own node (with the type ``date'').
231: Topologically, a ``date''
232: node may be connected to many other {\em types} of nodes, but generally
233: each date node is connected to only a small number of other nodes.
234: This may be an unbiased indication that a date is not useful for relationship
235: detection.
236:
237: \subsection{The transitivity concept}
238:
239: The concept of link transitivity is useful to address some of
240: the above issues. If a node $i$ has a link to node $j$ and node $j$
241: has a link to node $k$, then a measure of transitivity in the network
242: is the probability that node $i$ has a link to node $k$. In social
243: networks and many other networks categorized as {\em small-world} networks,
244: this probability is high. This is natural in social networks because
245: a friend of a friend is also a friend in proportion that is much higher
246: than in a random network. In general, we refer to $j$ as a {\em neighbor}
247: of $i$ if $i$ and $j$ are directly connected in a graph. Also, we
248: refer to the {\em degree} of a node as the number of neighbors it has.
249:
250: The concept of transitivity is quantified as follows.
251: The {\em clustering coefficient} of a node, denoted by $C(i)$,
252: is a measure of the connectedness between the neighbors of the node.
253: Let $k_i$ denote the degree of node $i$, and let
254: $E_i$ denote the number of links between the $k_i$ neighbors.
255: Then, for an undirected graph, the quantity \cite{watts:1998}
256: \begin{equation}
257: C(i) = \frac{E_i}{k_i(k_i - 1)/2}
258: \label{eq:cc}
259: \end{equation}
260: is the ratio of the number of links between
261: a node's neighbors to the number of links that can exist.
262: We define $C(i)$ to be 0 when $k_i$ is 0 or 1.
263: When $C(i)$ is averaged over all nodes in the graph, we have the clustering
264: coefficient for a graph.
265: Note that
266: high average clustering coefficient does {\em not} imply the existence of
267: clusters or communities (subgraphs that are internally
268: more highly connected than externally) in the graph.
269:
270: \subsection{Relevance of a node}
271:
272: We consider the problem of determining whether a node in a semantic graph
273: (e.g., ``San Francisco'' in a previous example)
274: is useful for relationship detection. Consider a node $i$
275: which has links to many other nodes.
276: For now, we assume the links are of all the same type.
277: To evaluate whether or not $i$ is useful for relationship
278: detection, we examine whether or not the neighbors of $i$
279: are actually related in the semantic graph with high frequency.
280: Whether or not two neighbors are related is decided by whether
281: or not a link exists between the two neighbors. (A weaker condition
282: if this does not hold is whether the two neighbors are linked
283: via a third node which is already deemed a useful node for
284: relationship detection.) This leads to the use of the clustering
285: coefficient defined in Equation (\ref{eq:cc}) to measure
286: the relevance of a node $i$ with degree greater than 1.
287: The equation can be generalized so that $E_i$ counts links with
288: the weaker condition described above.
289: A threshold $\tau$ is needed and if $C(i) > \tau$ then $i$ is
290: a useful node. If $i$ is not a useful node, {\em all} the links
291: involving $i$ should not be used for relationship detection and
292: could be removed from the semantic graph. If these links are removed,
293: $i$ could be made an attribute of the nodes that $i$ originally linked
294: to, in order not to lose any information.
295:
296: The above can be generalized for semantic graphs
297: when $i$ is linked via many different
298: types of links. In this case, instead of a count of relationships
299: involving pairs of neighbors of $i$, a matrix $M(t_1,t_2)$ is used
300: instead. Here $M(t_1,t_2)$ counts the number of relationships
301: between pairs of neighbors $(a,b)$, where $a$ is linked to $i$ via type $t_1$
302: and $b$ is linked to $i$ via type $t_2$. Small entries in this
303: matrix gives {\em pairs} of link types (associated with $i$)
304: that should not be traversed in relationship detection.
305:
306: \subsection{Relevance of a link}
307:
308: The relevance of an existing or potential relationship between two
309: nodes $a$ and $b$ can be evaluated by how many neighbors they have in
310: common. More precisely a relevance measure may be defined as
311: \begin{equation}
312: S(a,b) = \frac{|N(a,b)|}{|T(a,b)|}
313: \label{eq:strength}
314: \end{equation}
315: where
316: \[
317: N(a,b) = \left\{ w \mid w \mbox{ is linked to $a$ and $b$},
318: w \ne a, w \ne b \right\}
319: \]
320: and
321: \[
322: T(a,b) = \left\{ w \mid w \mbox{ is linked to $a$ or $b$},
323: w \ne a, w \ne b \right\}
324: \]
325: with $|T(a,b)| = \mbox{deg}(a) + \mbox{deg}(b) - |N(a,b)|$
326: where $\mbox{deg}(a)$ is the degree of $a$.
327: We have $0 \le S(a,b) \le 1$ with
328: large values of this relevance measure indicating a strong
329: relationship between $a$ and $b$ supported by a high proportion of
330: common neighbors.
331: This quantity is similar to the clustering coefficient and
332: can be generalized to involve neighbors $w$ farther from $a$
333: and $b$.
334:
335: There are many applications of this relevance measure. For example,
336: pairs of nodes with no existing link can be evaluated to check if
337: a latent link might exist. In another example, the relevance measure
338: can be computed for all links of a given type. A low average of this relevance
339: measure indicates that the given link type is not useful for
340: relationship detection; there is not a strong relation between nodes
341: incident on a link with the given type.
342: A high relevance measure for a link when the average relevance measure
343: for the link type is low (and vice-versa)
344: indicates an outlier that may be interesting
345: to investigate. This relevance measure must be used carefully, however,
346: since it uses links that it assumes confers bona fide
347: relationships.
348:
349: It must also be recognized that a low relevance measure for an individual link
350: does not imply that the link is unimportant. On the contrary, the notion
351: of the ``strength of weak ties'' \cite{granovetter:1973}
352: suggests that these links
353: are critical in some sense. It is when almost all links of the {\em same}
354: type have low relevance measure (and this link type is not
355: $a$ ``secretly knows'' $b$) that this link type should not be used in
356: relationship detection.
357:
358: %-----------------------------------------------------------
359: \subsection{Generalization of clustering coefficient for semantic graphs}
360:
361: The clustering coefficient defined earlier has little meaning for
362: semantic graphs as it mixes different types of nodes and it does not
363: include the constraints imposed by the ontology.
364: To illustrate this, consider the ontology for a semantic graph
365: given by Figure~\ref{fig:clustering_example}.
366: In this case, a node of type $\alpha$ can be connected to types
367: $\beta$, $\gamma$ and $\delta$, but a neighbor of type $\delta$ can
368: never be connected to neighbors of type $\beta$ or $\gamma$. In order
369: to avoid unrealistically small values of the clustering coefficient we
370: thus have to divide by the number of links actually {\it allowed} by
371: the ontology and obtain
372: \begin{equation}
373: C(i;\alpha)=\frac{E_i}{E(i;\alpha)}
374: \end{equation}
375: where $E(i;\alpha)$ denotes the maximum number of links allowed
376: by the ontology.
377:
378: \begin{figure}
379: \begin{center}
380: \includegraphics[width=3.0cm]{clustering_example.eps}
381: \caption{A particular ontology for which neighbors of $\alpha$ of
382: type $\delta$ can never be connected to neighbors of type $\beta$ or $\gamma$.}
383: \label{fig:clustering_example}
384: \end{center}
385: \end{figure}
386:
387: %-----------------------------------------------------------
388: \section{Statistical Measures for Semantic Graphs}
389:
390: Along with clustering coefficient, two other relevant graph
391: properties that have been developed for standard (non-semantic) graphs
392: are {\em distributions of node degree} (number of neighbors of a node)
393: and {\em average path length} between any two nodes in the graph.
394: Together, these three graph properties can be useful for
395: studying the properties of a semantic graph for representing knowledge.
396:
397: Many real-world networks have high clustering coefficient, much higher
398: than $O(1/n)$ for random graphs, where $n$ is the number of nodes in
399: the graph. We believe that properly constructed semantic graphs must also
400: have moderately high clustering coefficients. Low values of clustering
401: coefficient may indicate that the linkage information in the semantic
402: graph is incomplete. Very high values of clustering coefficient may
403: also indicate a poorly constructed semantic graph where all the nodes are
404: very highly linked to each other (the limit is a fully connected graph),
405: indicating little discrimination in how the nodes are connected.
406:
407: The average path length, $\ell$,
408: in a semantic graph must also not be too small (which
409: is also associated with very high clustering coefficients). When the
410: average path length is small,
411: almost all nodes are approximately the same graph distance from each other,
412: giving little discriminatory ability to path-length based algorithms
413: for detecting relationships.
414:
415: For example, an ontology graph may contain a node (e.g., a node of
416: type ``provenance'') to which every other node in the ontology is linked.
417: In this case,
418: the maximum shortest path length length in the ontology graph is 2,
419: which also suggests that the average path length in the semantic graph is
420: small. It may be useful to identify nodes or links in the ontology
421: graph that dramatically shorten the average path length. These nodes
422: and links are potentially not useful for relationship detection.
423:
424: The connectivity distribution $P(k)$
425: is of interest for semantic graphs, particularly the existence of
426: nodes with very high degree, as in the case of scale-free
427: networks~\cite{barabasi:1999,amaral:2000}. In a relationship detection
428: path search, paths through very high degree nodes are deemed less informative
429: \cite{faloutsos:2004}. For example,
430: in a social network, two people who know a popular person
431: are less likely to know each other; the linkages to the popular person
432: should be disregarded in the relationship detection search since they
433: may confer erroneous relationships.
434:
435: It is believed that power-law connectivity distributions arise when
436: there is little or no cost involved in the formation of links in the
437: network \cite{amaral:2000}. Without this property, no nodes would be able to
438: acquire a very large number of links. This may suggest that a graph
439: with power-law degree distribution may contain many weak linkages.
440: However, these weak linkages cannot be disregarded; Cf. strength of weak ties,
441: mentioned above.
442:
443: For semantic graphs, we showed above how to extend the concept of
444: clustering coefficient. In the next subsections, we expand the
445: potential usefulness of other concepts for semantic graphs.
446:
447: \subsection{Extension of node degree}
448:
449: Even in the simple case of connectivity, a given value
450: $k$ of the connectivity of a node of type $\alpha$ has no real meaning
451: for semantic graphs. Indeed, as shown in Figure~\ref{fig:k_example} the
452: topological connectivity in both cases is $k=4$ but the meaning of it
453: is very different in each case.
454:
455: \begin{figure}
456: \begin{center}
457: \includegraphics[width=5.0cm]{connectivity_example.eps}
458: \caption{Two examples for which the $\alpha$-type node has
459: topological connectivity $k=4$ but with a different meaning in each case,
460: Cf.~\citeauthor{jensen:2002} (2002).}
461: \label{fig:k_example}
462: \end{center}
463: \end{figure}
464:
465: In the first case, the environment is very homogeneous while it is not
466: in the second case. Another complexity comes from the
467: fact that the number of $\beta$-type nodes can be very large thus
468: inducing a bias in the connectivity of the other nodes.
469:
470: The ontology implies that each node of type $\alpha$ can be connected
471: to a certain number, $k^{0}_{\alpha}$, of other
472: types. In the semantic graph, we have a total number of nodes
473: $n=\sum_{\alpha}n_{\alpha}$ and we denote the nodes by
474: $i=1,\dots,n$. The type of a node is given by the function $t(i)$. We
475: denote by $k_{\alpha\beta}(i)$ the number of neighbors of type $\beta$
476: of a node $i$ of type $\alpha$. The usual topological connectivity of
477: the node $i$ (which is of type $\alpha$) is then given by
478: \begin{equation}
479: k_{\alpha}(i)=\sum_{\beta}k_{\alpha\beta}(i).
480: \end{equation}
481: Using this quantity, we can define the average connectivity
482: of type $\alpha$ which is just the average over all nodes with
483: type $\alpha$ as
484: \begin{equation}
485: \overline{k_{\alpha}}=\frac{1}{n_{\alpha}}\sum_{i,\; t(i)=\alpha}k_{\alpha}(i).
486: \end{equation}
487:
488: If we want to compare the different types relative to their
489: connectivity, it is important to remember that some types can be
490: connected to many others (such as persons which can be linked to
491: others persons, cities, meeting, jobs, etc.) while other types are
492: only linked to one type (such as a conference which takes place only at
493: one location). In order to compare the different types we thus have to
494: rescale by the number of different neighbor types they can have according to
495: the ontology:
496: \begin{equation}
497: m_{\alpha}=\frac{\overline{k_{\alpha}}}{k^{0}_{\alpha}}.
498: \end{equation}
499:
500: This quantity indicates the average number of neighbors {\it per
501: type}. This quantity however does not tell us if there are large
502: connectivity fluctuations or if in contrast all nodes of a given type
503: have essentially the same connectivity. We thus have to measure the
504: connectivity variance {\it per type} which is calculated using the second moment
505: \begin{equation}
506: \overline{k^{2}_{\alpha}}=\frac{1}{n_{\alpha}}\sum_{i,\; t(i)=\alpha}k^{2}_{\alpha}(i)
507: \end{equation}
508: with the dispersion per type given by
509: \begin{equation}
510: \sigma^{k}_{\alpha}=\frac{[\overline{k^{2}_{\alpha}}-(\overline{k_{\alpha}})^2]^{1/2}}
511: {k^{0}_{\alpha}}.
512: \end{equation}
513:
514: Another possible way to characterize the connectivity distribution per
515: type is to plot the connectivity distribution. However, the dispersion
516: around the average is already a first indication of the nature of the
517: connections for different types. For some cases, the fluctuations
518: will be small, while for others it can fluctuate greatly
519: (such as the number of persons a person knows).
520: %This suggests that the nature
521: %of different networks between various types and spanning the
522: %semantic graph can be very different. The network of persons could be
523: %for example scale-free while for other types it can be well described
524: %by a simple random graph model.
525:
526: %-----------------------------------------------------------
527: \subsection{Disparity of connected types}
528:
529: The above quantities tell us the expected number of connections of a node of
530: a given type to another type
531: but not the correlations between different types. Indeed, a type
532: $\alpha$ can preferentially link to a type $\beta$ while it could be
533: in principle also be linked to other types (as given by the ontology).
534:
535: We thus quantify the disparity (or affinity) of each
536: type to link to other types. In order to do this we use a convenient
537: quantity---denoted by $Y_2$---which was introduced in another
538: context~\cite{Derrida:1987,Barthelemy:2003a}. In order to understand the meaning
539: of this quantity let us consider an object that is broken into a number $N$
540: of parts, each part having a weight $w_i$. By construction $\sum_{i}w_i=1$
541: and $Y_2$ is given in this case by
542: \begin{equation}
543: Y_2=\sum_i[w_i]^2.
544: \end{equation}
545: If all parts have the same weight $w_I\sim 1/N$ then $Y_2\sim 1/N$ is
546: small (for large $N$). In contrast, if we have $w_1=1/2$ and the rest
547: is small implying $w_{i\ne 1}\sim 1/2(N-1)$ then we obtain $Y_2\sim
548: 1/4$. This simple example can be easily generalized to more
549: complicated situations and shows that a small value of $Y_2$ indicates
550: a large number of relevant parts while a larger value (typically of
551: order $1/m$ where $m$ is of order unity) indicates the dominance of a
552: few parts.
553:
554: We now apply this idea to the number of types to quantify the
555: disparity of a node or the affinity of a type. The quantity $Y_2$ is first
556: defined for a given node $i$ of type $\alpha$
557: \begin{equation}
558: Y_2(i;\alpha)=\sum_{\beta}\left[\frac{k_{\alpha\beta}(i)}{k_{\alpha}(i)}\right]^2.
559: \end{equation}
560: In order to get results with statistical significance, we average this
561: quantity over all
562: nodes of the same type and we also compute its dispersion $\sigma^{Y}_{\alpha}$:
563: \begin{eqnarray}
564: \overline{Y}_2(\alpha)=\frac{1}{n_{\alpha}}\sum_{i,\; t(i)=\alpha}
565: Y_2(i;\alpha),\\
566: \sigma^{Y}_{\alpha}=\left[
567: \overline{Y_2^{2}(\alpha)}-(\overline{Y}_2(\alpha))^2
568: \right]^{1/2}.
569: \end{eqnarray}
570:
571: These results must however be weighted by the fact that some types are
572: more numerous than others which could be a reason why they appear
573: more often than others. For a given node $\alpha$, we denote by ${\cal
574: V}(\alpha)$ the set of types which can be connected to $\alpha$ as
575: given by the ontology. If a node has $k$ neighbors, and if these
576: neighbors are picked at random in the set of different nodes with
577: population $n_{\beta}$, we then obtain a disparity given by
578: \begin{equation}
579: Y^{r}_2=\sum_{\beta\in{\cal V}(\alpha)}\left[\frac{n_{\beta}}{n}\right]^2.
580: \end{equation}
581: Again, this quantity will be very small if all types are uniformly
582: present in the semantic graph $Y^{r}_2\sim 1/N$ (where $N$ is the
583: total number of different types) and if it is of order unity then
584: essentially a few types are over-represented. In order to take these
585: heterogeneities into account it is thus necessary to rescale
586: $Y_2(\alpha)$ by $Y^{r}_2$ and to form the factor
587: \begin{equation}
588: R(\alpha)=\frac{Y_2(\alpha)}{Y^{r}_2}
589: \end{equation}
590: and its corresponding dispersion,
591: \begin{equation}
592: \sigma^{R}_{\alpha}=\frac{\sigma^{Y}_{\alpha}}{Y^{r}_2}.
593: \end{equation}
594:
595: A large value (larger than one) of $R(\alpha)$ indicates that type
596: $\alpha$ preferentially links to a small number of types and that
597: its neighbor types ${\cal V}(\alpha)$ are diverse in number.
598: If $R\ll 1$, the type
599: $\alpha$ may still be preferentially connected to a small set of types
600: but the diversity of the numbers of each neighbor type is small.
601:
602: The dispersion $\sigma^{R}(\alpha)$ indicates whether the behavior as
603: described by the average value $R(\alpha)$ is typical, or if in
604: contrast there is large diversity among the nodes of type $\alpha$.
605:
606: Other usual quantities that are measured in order to
607: characterize a large network can also be generalized without any
608: difficulty. For example, degree distributions should be examined by
609: type of node. In a semantic graph, the overall degree distribution
610: may not be meaningful, but the degree distribution for a specific
611: node type may be power-law, etc.
612: As a further example, the average path length generalizes to become a matrix
613: $\ell_{\alpha\beta}$ where $\alpha$ indicates the source node of the
614: shortest paths while $\beta$ is the target node.
615: This matrix will in general have
616: entries with very different values.
617:
618: %%-----------------------------------------------------------
619: %\subsection{Extension of average path length}
620: %
621: %In the same flow of ideas, it is easy to generalize the important
622: %quantity which is the betweenness centrality to a semantic graphs.
623: %For topological networks, this quantity counts the fraction of shortest paths
624: %that goes through a given node
625: %\begin{equation}
626: %g(v)=\sum_{i\ne j}\frac{\sigma_{ij}(v)}{\sigma_{ij}}
627: %\end{equation}
628: %where $\sigma_{ij}$ denotes the number of shortest paths going from
629: %$i$ to $j$ and where $\sigma_{ij}(v)$ denotes the number of shortest
630: %paths going from $i$ to $j$ through $v$. The natural generalization to
631: %semantic graphs is then
632: %\begin{equation}
633: %g_{\beta\gamma}(v;\alpha)=\sum_{i, t(i)\beta\ne j, t(j)=\gamma}
634: %\frac{\sigma_{ij}(v)}{\sigma_{ij}}
635: %\end{equation}
636: %which means that we consider only shortest paths from a node of type
637: %$\beta$ to a node of type $\gamma$ (while the node $v$ is of type
638: %$\alpha$).
639:
640: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
641: \section{Scale in Semantic Graphs}
642:
643: Given a knowledge base of relational data, the choice of ontology
644: depends on what information needs to be captured in the semantic graph,
645: and how easily certain information needs to be retrieved.
646: The level of detail (or scale) chosen for the ontology
647: (choice of node and link types)
648: will have a direct impact on the properties of the corresponding
649: semantic graph.
650:
651: In the simplest ontology, we have nodes of only one type.
652: In the example of the movies database, this ontology
653: is a simple network of actors without any types and two actors are
654: connected if they played in the same movie. At the next finer
655: scale, we have actors and movies as node types. In this
656: case, the ontology is an actor connected to a movie if he played in that
657: movie. This is a special case of a semantic graph
658: which is a {\em bipartite} network (two types of nodes, with links only between
659: the two types).
660: %The ontologies and graphs for
661: %these two scales are illustrated in Figure~\ref{fig:example_actor}.
662: Coarser models lose some of the information present in finer models
663: but can be useful for
664: large-scale computations, such as multi-level search techniques.
665:
666: At the finest scale of a terrorist network, we may have nodes
667: of type ``Religious Terrorist Organization'' and ``Political Terrorist
668: Organization.'' A coarser model may aggregate nodes of these
669: two types into a new type, ``Terrorist Organization'' (or the
670: aggregation may occur directly if a type hierarchy is available).
671: Depending on what information needs to be preserved, it may or may not be
672: important to distinguish between these two node types
673: at the structural level of the semantic graph.
674:
675: We note that in Homeland Security tasks,
676: data analysis more often involves searching for outliers rather
677: than commonplace patterns. Thus it is essential that the fine
678: scale data is retained and the coarse scale data is used
679: appropriately (for example, as an aid in managing and processing
680: large-scale data).
681:
682: %The semantic graph may be examined to determine whether or not it is
683: %appropriate to coalesce two node types into a single node type.
684: %
685: %Two nodes are structurally similar if they
686: %We first define the structural similarity of two nodes as the
687: %number of neighbors they have in common
688: %
689: %Algorithm: average structural similarity may be used to aggregate nodes.
690: %
691: %We also have the scale of the semantic graph. Here, we can collapse
692: %along the type hierarchy.
693: %
694: %Two nodes can be coalesced if they have a high degree of structural
695: %similarity.
696:
697: %\begin{figure}
698: %\begin{center}
699: %\includegraphics[width=6.0cm]{example_actor.eps}
700: %\caption{Two different scales for the movie actor network.}
701: %\label{fig:example_actor}
702: %\end{center}
703: %\end{figure}
704:
705: %-----------------------------------------------------------
706: \subsection{Effect of scale on statistical measures}
707:
708: Here we simply illustrate the effect of scale on the clustering coefficient.
709: We consider a random bipartite graph with Poisson distributed
710: numbers of both movies per actor (with average $\mu$) and actors per
711: movie (with average $\nu$). We suppose that we have $n_A$ actors and
712: $n_M$ movies and the fact that each link connects an actor to a movie imposes
713: the constraint
714: \begin{equation}
715: \frac{\mu}{n_A}=\frac{\nu}{n_M} .
716: \end{equation}
717:
718: This model can be considered as a ``null'' model since there are no
719: particular correlations here. If one computes the clustering coefficient of
720: the one-mode projection
721: of this network, one obtains~\cite{newman:2001a}
722: \begin{equation}
723: C=\frac{1}{\mu+1} .
724: \end{equation}
725: This quantity is finite even in the limit of very large networks
726: $n_{A,M}\to\infty$. This is in contrast with the usual random network for
727: which
728: \begin{equation}
729: C\sim \frac{1}{n}
730: \end{equation}
731: where $n$ is the number of nodes. At this stage the conclusion is that
732: the actor network is very clustered and different from a random
733: network with no correlations. This is however clearly an incorrect
734: statement since the existence of a large clustering coefficient here is a
735: consequence of the network construction procedure.
736:
737: %This simple example shows that the way of constructing the network and
738: %the choice of the scale can be very relevant and should clearly specified in
739: %any discussion.
740:
741: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
742: \section{Examples}
743:
744: \subsection{Movies data}
745:
746: The ``Movies'' test data at the UCI KDD Archive contains information
747: about movies, persons (actors, directors, etc.), studios, awards, etc.
748: The data was originally compiled by Gio Wiederhold (Stanford University).
749: We used this data to construct an ontology and semantic graph to
750: express most of the information in the dataset. Figure \ref{fig:imdb-ont}
751: shows the ontology graph that we developed. In the figure, the meaning of
752: most of the links is obvious. However, the person-person
753: link implies {\em married-to}, {\em lived-with}, or some other non-professional
754: relationship; the person-studio link implies {\em founded}; the movie-movie
755: link implies {\em sequel-to}. We note that the data is very incomplete.
756:
757: \begin{figure}
758: \begin{center}
759: \includegraphics[width=2in]{movies_schema.eps}
760: \caption{Movies ontology.}
761: \label{fig:imdb-ont}
762: \end{center}
763: \end{figure}
764:
765: In this ontology, the best meaning of the node Role is unclear.
766: For example, are two actors linked to the same Role node in the semantic
767: graph if they
768: played the role of Villain in two different movies? Alternatively,
769: a role node in the semantic graph may only link to actors playing
770: a given role in a single movie. We arbitrarily chose the former in our case.
771:
772: A related question, which is structurally similar but semantically different
773: is the following. Should two actors who win a Best Actor award be linked to the
774: {\em same} Award node in the semantic graph? In this case we did not choose
775: this interpretation since it seems that awards are individual entities,
776: whereas roles are not.
777:
778: Table \ref{tbl:imdb-results} summarizes the node types, frequencies,
779: and other statistical measures for the movies semantic graph.
780: The results show
781: high dispersion of average connectivity per type, for all types.
782: Further, the disparity of connected types is not particularly
783: different from a random model. These indicate a relatively well-constructed
784: semantic graph; there are no particular correlations (given
785: the numbers of each node type) and thus the information
786: content in the graph is high. The results will be very different
787: for the terrorism data.
788:
789: In the semantic graph,
790: the nodes with the largest clustering coefficients
791: depend on whether the types of the nodes are considered. In the standard
792: case where the types are not considered, the node Maurice Barrymore
793: has high clustering coefficient; the node is connected to Georgiana Drew
794: Barrymore, Lionel Barrymore, Ethel Barrymore, etc., all of which
795: are connected to each other. If node types are considered, then it is
796: not important that neighbors of a node are not linked if they are
797: not permitted to be linked according to the ontology. Now nodes that
798: were missed with the above measure may have high clustering coefficient,
799: e.g., the movie {\em Dogma} (perhaps due to the idiosyncrasies of
800: the incomplete data).
801:
802: % low clust coef?
803:
804: In the semantic graph, the link between Columbia Pictures and
805: drama (genre) has the most number of common neighbors (710).
806: However, when the link
807: relevance measure (Equation (\ref{eq:strength})) is used,
808: which accounts for the number of links a node has,
809: the link between Bud Abbott and Lou Costello is found (30 common neighbors).
810: (We also found re-releases of movies under a new name in this process.)
811: Further, a semantic version of relevance can be defined, which
812: considers only the links that are allowed by the semantic graph.
813: In this case, the link between Tokuma Studio and docu-drama is found.
814: (Tokuma is linked to drama and the movie {\em Carences}; docu-drama is
815: linked to {\em Carences} and Miramax; and Miramax is linked to drama.)
816:
817: We also computed the average relevance per link type for the semantic graph.
818: First, the link types of least frequency were Person-\emph{founded}-Studio
819: and Studio-\emph{located-in}-Country. However, the links with lowest average
820: relevance per link were Movie-\emph{shot-in}-Country and
821: Award-\emph{awarded-in}-Country. As mentioned, these latter links may by
822: least useful for automatic relationship detection.
823:
824: \begin{table}
825: \begin{center} \scriptsize
826: \begin{tabular}{|rl|r|rr|rr|} \hline
827: & Node Type & $n_\alpha$ &
828: $m_\alpha$ & $\sigma_\alpha^k$ & $R(\alpha)$ & $\sigma_\alpha^R$ \\
829: \hline
830: 1 & Person & 21504 & 0.872 & 2.383 & 1.836 & 0.663 \\
831: 2 & Movie & 11540 & 1.131 & 0.816 & 1.299 & 0.644 \\
832: 3 & Award & 6734 & 2.579 & 10.201 & 0.905 & 0.144 \\
833: 4 & Country & 19 & 222.509 & 582.572 & 1.812 & 0.364 \\
834: 5 & Studio & 1075 & 1.948 & 9.534 & 1.241 & 0.408 \\
835: 6 & Genre & 39 & 77.803 & 160.060 & 0.512 & 0.154 \\
836: 7 & Role & 115 & 25.561 & 64.164 & 0.924 & 0.028 \\
837: 8 & Distributor & 16 & 206.156 & 356.043 & 0.782 & 0.165 \\
838: \hline
839: \end{tabular}
840: \caption{Node types and statistics for the movies data: frequency of
841: node type $n_\alpha$, average connectivity per type $m_\alpha$ and
842: its dispersion $\sigma_\alpha^k$, disparity of connected types
843: $R(\alpha)$ and its dispersion $\sigma_\alpha^R$.
844: The results show
845: high dispersion of average connectivity per type, for all types.
846: Further, the disparity of connected types is not particularly
847: different from a random model.
848: }
849: \label{tbl:imdb-results}
850: \end{center}
851: \end{table}
852:
853: %Figure \ref{fig:bpdist} shows the degree distributions of the movie-actor
854: %bipartite graph. When all nodes are considered together, the power-law
855: %relationship for the nodes of type Person are hidden.
856: %
857: %\begin{figure}
858: %\centering
859: %\subfigure[Person nodes (30226 nodes).]{\includegraphics[width=1.5in]{bpdist_person.eps}}
860: %\subfigure[Movie nodes (11561 nodes).]{\includegraphics[width=1.5in]{bpdist_movie.eps}}
861: %\subfigure[All nodes.]{\includegraphics[width=1.5in]{bpdist_all.eps}}
862: %\caption{Degree distributions of the movie-actor bipartite graph
863: %by node type.}
864: %\label{fig:bpdist}
865: %\end{figure}
866:
867: %{OLD:Before ROLE collapsed:Node types and statistics for the movies data.}
868: %
869: %\begin{table}
870: %\begin{center} \scriptsize
871: %\begin{tabular}{|rl|r|rr|rr|} \hline
872: % & Node Type & $n_\alpha$ &
873: % $m_\alpha$ & $\sigma_\alpha^k$ & $R(\alpha)$ & $\sigma_\alpha^R$ \\
874: %\hline
875: % 1 & Person & 30226 & 0.999 & 5.319 & 2.309 & 1.078 \\
876: % 2 & Movie & 11561 & 2.071 & 1.625 & 1.342 & 0.588 \\
877: % 3 & Award & 29759 & 1.357 & 4.070 & 0.740 & 0.188 \\
878: % 4 & Country & 17 & 963.412 & 3336.693 & 1.027 & 0.126 \\
879: % 5 & Studio & 1044 & 1.937 & 9.397 & 1.109 & 0.363 \\
880: % 6 & Genre & 201 & 17.511 & 85.387 & 0.422 & 0.076 \\
881: % 7 & Role & 46154 & 1.000 & 0.000 & 0.834 & 0.000 \\
882: % 8 & Distributor & 111 & 4.716 & 4.640 & 0.769 & 0.193 \\
883: %\hline
884: %\end{tabular}
885: %\label{tbl:imdb-results}
886: %\end{center}
887: %\end{table}
888:
889: % following are degree distributions for the full imdb (not bipartite)
890: %
891: %\begin{figure}
892: %\centering
893: %\subfigure[All nodes.]{\includegraphics[width=2.5in]{dist_all.eps}}
894: %\subfigure[Person nodes.]{\includegraphics[width=2.5in]{dist_person.eps}}
895: %\\
896: %\subfigure[Movie nodes.]{\includegraphics[width=2.5in]{dist_movie.eps}}
897: %\subfigure[Award nodes.]{\includegraphics[width=2.5in]{dist_award.eps}}
898: %\caption{Degree distributions by node type.}
899: %\label{fig:1}
900: %\end{figure}
901:
902: %-----------------------------------------------------------
903: \subsection{Terrorism data}
904:
905: %Terrorism data is available from the Anti-Defamation League.
906:
907: Relational data about world-wide terrorist events is available,%
908: \footnote{Data available at http://ontology.teknowledge.com.}
909: as well
910: as ontologies describing the organization of this data \cite{niles:2001}.
911: From this data we constructed an ontology and semantic graph.
912: The 59 node types are shown in Table \ref{tbl:terrorism-types}.
913: The ontology is shown in Figure \ref{fig:terr-adjmat} as an adjacency
914: matrix. The semantic graph contains 2366 nodes.
915:
916: %After removing isolated nodes from the data.
917:
918: \begin{table}
919: \renewcommand{\arraystretch}{0.6}
920: \begin{center} \scriptsize
921: \begin{tabular}{|rl|c||rl|c|} \hline
922: & Type & $n_\alpha$ & & Type & $n_\alpha$ \rule[-1.0ex]{0pt}{3ex}\\
923: \hline
924: 1 & Nation & 92 & 31 & Shooting & 445 \\
925: 2 & GeographicalRegion & 85 & 32 & Bombing & 323 \\
926: 3 & City & 555 & 33 & HostageTaking & 14 \\
927: 4 & Building & 10 & 34 & IncendDeviceAttack & 18 \\
928: 5 & Combustion & 0 & 35 & Lynching & 3 \\
929: 6 & Destruction & 0 & 36 & SuicideBombing & 107 \\
930: 7 & Device & 0 & 37 & CarBombing & 114 \\
931: 8 & GeographicArea & 3 & 38 & Arson & 15 \\
932: 9 & Government & 1 & 39 & HandgrenadeAttack & 38 \\
933: 10 & GovernmentPerson & 2 & 40 & Hijacking & 15 \\
934: 11 & Group & 1 & 41 & RocketMissileAttack & 14 \\
935: 12 & Hole & 1 & 42 & KnifeAttack & 53 \\
936: 13 & Human & 6 & 43 & ChemicalAttack & 9 \\
937: 14 & JoiningAnOrg & 0 & 44 & LetterBombAttack & 10 \\
938: 15 & Killing & 0 & 45 & Stoning & 3 \\
939: 16 & OccupationalRole & 3 & 46 & VehicleAttack & 7 \\
940: 17 & Region & 0 & 47 & MortarAttack & 8 \\
941: 18 & SocialRole & 1 & 48 & Vandalism & 4 \\
942: 19 & StationaryArtifact & 1 & 49 & Other & 5 \\
943: 20 & UnilateralGetting & 0 & 50 & Number & 120 \\
944: 21 & Vehicle & 1 & 51 & Continent & 2 \\
945: 22 & ViolentContest & 1 & 52 & GeneralStructure & 6 \\
946: 23 & Weapon & 0 & 53 & Month & 12 \\
947: 24 & Proposition & 0 & 54 & GeneralBuilding & 2 \\
948: 25 & BinaryPredicate & 0 & 55 & GeneralHuman & 2 \\
949: 26 & ForeignTerrOrg & 28 & 56 & Airbase & 2 \\
950: 27 & ReligiousOrg & 0 & 57 & Airport & 3 \\
951: 28 & TerroristOrg & 53 & 58 & State & 4 \\
952: 29 & Infiltration & 8 & 59 & Railway & 1 \\
953: 30 & Kidnapping & 155 & & & \\
954: \hline
955: \end{tabular}
956: \caption{Node types and their frequencies, $n_\alpha$, for the terrorism data.}
957: \label{tbl:terrorism-types}
958: \end{center}
959: \end{table}
960:
961: \begin{figure}
962: \begin{center}
963: \includegraphics[width=2in]{terr_adjmat.eps}
964: \caption{Adjacency matrix for the terrorism ontology. The matrix
965: is used to determine which node types are allowed to link to a given type.}
966: \label{fig:terr-adjmat}
967: \end{center}
968: \end{figure}
969:
970: \begin{figure}
971: \begin{center}
972: \includegraphics[width=3.25in]{terr_deg.eps}
973: \caption{Terrorism data: average number of neighbors per type, $m_\alpha$.
974: Each error bar is of
975: length $\sigma_\alpha^k$ on each side of the average. }
976: \label{fig:terr-deg}
977: \end{center}
978: \end{figure}
979:
980: \begin{figure}
981: \begin{center}
982: \includegraphics[width=3.25in]{terr_con.eps}
983: \caption{Terrorism data:
984: disparity of connected types, $R(\alpha)$. Each error bar is of
985: length $\sigma_\alpha^R$ on each side.}
986: \label{fig:terr-con}
987: \end{center}
988: \end{figure}
989:
990: Figures \ref{fig:terr-deg} and \ref{fig:terr-con} plot the average number
991: of neighbors per type and the disparity of connected types, respectively.
992: Error bars are used to show the dispersion of the quantities.
993: We consider that frequencies of 50 or more in this data set are
994: statistically significant. Thus, we consider types
995: 1, 2, 3, 28, 30, 31, 32, 36, 37 42, and 50.
996: For all these types, the average number of neighbors per type is small.
997: The types, however, can be separated by their disparity.
998: Types 1, 2, 3, 28, and 50 have high disparity, i.e., they are connected
999: to many different types. This is consistent with nodes of
1000: types 1, 2, and 3 being of type
1001: ``location,'' nodes of type 28 being of type ``terrorist organization,''
1002: and nodes of type 50 being of type ``number.''
1003: The remaining types are types of attacks and are not particularly
1004: correlated with any other node types (given the numbers of each node type).
1005: We note in this case
1006: that semantically similar node types
1007: have similar values of $m_\alpha$ and $R(\alpha)$.
1008:
1009:
1010: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1011: \section{Conclusion}
1012:
1013: This paper reveals some of the knowledge representation
1014: issues associated with semantic graphs. Ideas from the field of complex
1015: networks have been applied and generalized to semantic graphs.
1016: For example, transitivity may be used to determine
1017: the relevance of edge types for relationship detection.
1018:
1019: We have defined several measures for statistically characterizing
1020: node types. These quantities
1021: take into account the ontology which specifies the permitted connections in
1022: the semantic graph.
1023: Many other important measures can be defined,
1024: such as correlations with attribute {\em values}
1025: \cite{jensen:2002}, which was not covered in this paper.
1026: These and other tools can be useful to help design ontologies
1027: and semantic graphs for knowledge representation.
1028:
1029:
1030: %Many issues arise due to the existence of
1031: %different types on the nodes and links. In standard
1032: %graphs without these types, coarser models, for example, are generally
1033: %built by clustering nodes (of the same type) that are likely to be
1034: %related. What we have described is the very different case of
1035: %being able to cluster nodes based on the types of links that join them.
1036:
1037: %For example, we define
1038: %the tendency of a node type to link to a few or
1039: %to many other node types, and the average number of neighbors
1040: %of a node, taking into account the types to which it is permitted to link
1041: %(according to the ontology). Using these measures, we found that node types
1042: %including dates, numbers, (e.g., number of
1043: %deaths in a terrorist event) and document-ID's (the original source of
1044: %the link data) are not particularly useful for relationship detection.
1045:
1046:
1047:
1048:
1049: %We summarize the general
1050: %procedure below:
1051: %
1052: %\begin{itemize}
1053: %\item{} First identify nodes with large $n_{\alpha}$. Only for these types,
1054: %a statistical analysis is meaningful and the analysis below is performed
1055: %for these types.
1056: %\item{} Compute $k_{\alpha}$ and its variance. This will show which
1057: %type is highly connected and what is the nature (scale-free or not for
1058: %example) of the network relatively to the different types.
1059: %\item{} Compute the quantity $Y_2(\alpha)$ and its dispersion leading to the quantity
1060: %$R(\alpha)$ and $\sigma^{R}(\alpha)$. These quantities indicate the
1061: %``disparity'' of each type (ie. they favorite connections if they
1062: %exist) and the variations among nodes of the same type.
1063: %\item{} Depending on the problem, one can also compute the clustering
1064: %coefficient per type as well as the centrality matrix.
1065: %\end{itemize}
1066:
1067:
1068: \section{Acknowledgments}
1069: We are pleased to
1070: thank Keith Henderson and David Jensen for helpful discussions.
1071: MB wishes to thank the Center for Applied Scientific Computing and
1072: the Institute for Scientific Computing Research at Lawrence Livermore
1073: National Laboratory for their hospitality during the formative stages
1074: of this work. This work was performed under the auspices of the U.S. Department
1075: of Energy by University of California Lawrence Livermore
1076: National Laboratory under contract No.~W-7405-ENG-48.
1077:
1078:
1079:
1080: \begin{thebibliography}{50}
1081:
1082: \bibitem[\protect\citeauthoryear{Albert \& Barabasi}{2002}]{albert:2002}
1083: Albert, R., and Barabasi, A.-L.
1084: \newblock 2002.
1085: \newblock Statistical mechanics of complex networks.
1086: \newblock {\em Reviews of Modern Physics} 74(1):47--97.
1087:
1088: \bibitem[\protect\citeauthoryear{Amaral \bgroup \em et al.\egroup
1089: }{2000}]{amaral:2000}
1090: Amaral, L. A.~N.; Scala, A.; Barth{\'e}lemy, M.; and Stanley, H.~E.
1091: \newblock 2000.
1092: \newblock Classes of small-world networks.
1093: \newblock In {\em Proceedings of the National Academy of Sciences USA},
1094: volume~97, 11149--11152.
1095: \newblock National Academy of Sciences.
1096:
1097: \bibitem[\protect\citeauthoryear{Barabasi \& Albert}{1999}]{barabasi:1999}
1098: Barabasi, A.-L., and Albert, R.
1099: \newblock 1999.
1100: \newblock Emergence of scaling in random networks.
1101: \newblock {\em Science} 286:509--512.
1102:
1103: \bibitem[\protect\citeauthoryear{Barth{\'e}lemy, Gondran, \&
1104: Guichard}{2003}]{Barthelemy:2003a}
1105: Barth{\'e}lemy, M.; Gondran, B.; and Guichard, E.
1106: \newblock 2003.
1107: \newblock Spatial structure of the internet traffic.
1108: \newblock {\em Physica A} 319:633--642.
1109:
1110: \bibitem[\protect\citeauthoryear{Chow}{2004}]{chow-tr:2004}
1111: Chow, E.
1112: \newblock 2004.
1113: \newblock A graph search heuristic for shortest distance paths.
1114: \newblock Technical Report UCRL-JRNL-202894, Lawrence Livermore National
1115: Laboratory.
1116:
1117: \bibitem[\protect\citeauthoryear{Coffman, Greenblatt, \&
1118: Marcus}{2004}]{coffman:2004}
1119: Coffman, T.; Greenblatt, S.; and Marcus, S.
1120: \newblock 2004.
1121: \newblock Graph-based technologies for intelligence analysis.
1122: \newblock {\em Communications of ACM} 47:45--47.
1123:
1124: \bibitem[\protect\citeauthoryear{Derrida \& Flyvbjerg}{1987}]{Derrida:1987}
1125: Derrida, B., and Flyvbjerg, H.
1126: \newblock 1987.
1127: \newblock Statistical properties of randomly broken objects and of multivalley
1128: structures in disordered systems.
1129: \newblock {\em Journal of Physics {A}} 20(15):5273--5288.
1130:
1131: \bibitem[\protect\citeauthoryear{Eliassi-Rad \&
1132: Chow}{2004}]{eliassi-rad-tr:2004}
1133: Eliassi-Rad, T., and Chow, E.
1134: \newblock 2004.
1135: \newblock A probabilistic approach to accelerating path-finding in large
1136: semantic networks.
1137: \newblock Technical Report UCRL-CONF-202002, Lawrence Livermore National
1138: Laboratory.
1139:
1140: \bibitem[\protect\citeauthoryear{Faloutsos, McCurley, \&
1141: Tomkins}{2004}]{faloutsos:2004}
1142: Faloutsos, C.; McCurley, K.; and Tomkins, A.
1143: \newblock 2004.
1144: \newblock Fast discovery of connection subgraphs.
1145: \newblock In {\em Proceedings of the 10th ACM SIGKDD International Conference
1146: on Knowledge Discovery and Data Mining}, 118--127.
1147: \newblock Seattle, WA, USA: ACM Press.
1148:
1149: \bibitem[\protect\citeauthoryear{Granovetter}{1973}]{granovetter:1973}
1150: Granovetter, M.
1151: \newblock 1973.
1152: \newblock The strength of weak ties.
1153: \newblock {\em American Journal of Sociology} 78:1360--1380.
1154:
1155: \bibitem[\protect\citeauthoryear{Jensen \& Neville}{2002}]{jensen:2002}
1156: Jensen, D., and Neville, J.
1157: \newblock 2002.
1158: \newblock Data mining in social networks.
1159: \newblock In {\em Papers of the Symposium on Dynamic Social Network Modeling
1160: and Analysis (Sponsored by National Academy of Sciences)}.
1161: \newblock Washington, DC, USA: National Academy Press.
1162:
1163: \bibitem[\protect\citeauthoryear{Jensen, Rattigan, \& Blau}{2003}]{jensen.2003}
1164: Jensen, D.; Rattigan, M.; and Blau, H.
1165: \newblock 2003.
1166: \newblock Information awareness: a prospective technical assessment.
1167: \newblock In {\em Proceedings of the ninth ACM SIGKDD international conference
1168: on Knowledge discovery and data mining}, 378--387.
1169: \newblock Washington, D.C.: ACM Press.
1170:
1171: \bibitem[\protect\citeauthoryear{Kolda \bgroup \em et al.\egroup
1172: }{2004}]{DHS-DSW:2004}
1173: Kolda, T.; Brown, D.; Corones, J.; Critchlow, T.; Eliassi-Rad, T.; Getoor, L.;
1174: Hendrickson, B.; Kumar, V.; Lambert, D.; Matarazzo, C.; McCurley, K.;
1175: Merrill, M.; Samatova, N.; Speck, D.; Srikant, R.; Thomas, J.; Wertheimer,
1176: M.; and Wong, P.~C.
1177: \newblock 2004.
1178: \newblock Data sciences technology for homeland security information management
1179: and knowledge discovery.
1180: \newblock Technical Report UCRL-TR-208926, Lawrence Livermore National
1181: Laboratory.
1182:
1183: \bibitem[\protect\citeauthoryear{Newman, Strogatz, \&
1184: Watts}{2001}]{newman:2001a}
1185: Newman, M. E.~J.; Strogatz, S.~H.; and Watts, D.~J.
1186: \newblock 2001.
1187: \newblock Random graphs with arbitrary degree distributions and their
1188: applications.
1189: \newblock {\em Physical Review E} 64(026118).
1190:
1191: \bibitem[\protect\citeauthoryear{Newman}{2003}]{newman:2003a}
1192: Newman, M.~E.
1193: \newblock 2003.
1194: \newblock The structure and function of complex networks.
1195: \newblock {\em SIAM Review} 45(2):167--256.
1196:
1197: \bibitem[\protect\citeauthoryear{Niles \& Pease}{2001}]{niles:2001}
1198: Niles, I., and Pease, A.
1199: \newblock 2001.
1200: \newblock Towards a standard upper ontology.
1201: \newblock In {\em Proceedings of the 2nd International Conference on Formal
1202: Ontology in Information Systems (FOIS-2001)}.
1203:
1204: \bibitem[\protect\citeauthoryear{Popp \bgroup \em et al.\egroup
1205: }{2004}]{popp.2004}
1206: Popp, R.; Armour, T.; Senator, T.; and Numrych, K.
1207: \newblock 2004.
1208: \newblock Countering terrorism through information technology.
1209: \newblock {\em Communications of the ACM} 47(3):36--43.
1210:
1211: \bibitem[\protect\citeauthoryear{Sowa}{1984}]{sowa:1984}
1212: Sowa, J.~F.
1213: \newblock 1984.
1214: \newblock {\em Conceptual Structures: Information Processing in Mind and
1215: Machine}.
1216: \newblock Reading, MA: Addison-Wesley.
1217:
1218: \bibitem[\protect\citeauthoryear{Watts \& Strogatz}{1998}]{watts:1998}
1219: Watts, D.~J., and Strogatz, S.~H.
1220: \newblock 1998.
1221: \newblock Collective dynamics of small-world networks.
1222: \newblock {\em Nature} 393:440--442.
1223:
1224:
1225:
1226:
1227:
1228:
1229: \end{thebibliography}
1230:
1231:
1232: %\bibliographystyle{aaai}
1233: %\bibliography{aaai-ss05-kr}
1234: %\bibliography{CNGraph-jan05}
1235:
1236:
1237:
1238:
1239:
1240: \end{document}
1241: