cond-mat0207156/day.tex
1: %\documentstyle[aps,graphics,multicol,psfig,epsfig,preprint,color]{revtex}  
2: \documentstyle[aps,graphics,multicol,psfig,epsfig,color]{revtex}  
3: 
4: \begin{document}
5: 
6: \title{Dissecting financial markets: Sectors and states}
7: 
8: \author{Matteo Marsili}
9: 
10: \address{Abdus
11: Salam International Center for Theoretical Physics, Strada Costiera 11, 
12: 34014 Trieste, Italy\\
13: and\\
14: Istituto Nazionale per la Fisica della Materia (INFM),
15: Unit\'a Trieste SISSA, Via Beirut 2-4, 34014 Trieste and }
16: 
17: \date{\today}  
18: 
19: \maketitle 
20: 
21: \begin{abstract}
22: By analyzing a large data set of daily returns with data clustering
23: technique, we identify economic sectors as clusters of assets with a
24: similar economic dynamics. The sector size distribution follows Zipf's
25: law. Secondly, we find that patterns of daily market-wide economic
26: activity cluster into classes that can be identified with market
27: states. The distribution of frequencies of market states shows
28: scale-free properties and the memory of the market state process
29: extends to long times ($\sim 50$ days). Assets in the same sector
30: behave similarly across states. We characterize market efficiency by
31: analyzing market's predictability and find that indeed the market is
32: close to being efficient. We find evidence of the existence of a dynamic
33: pattern after market's crashes.
34: \end{abstract}
35: 
36: \pacs{PACS numbers: 05.40.-a, 05.20.Dd, 64.60.Ht, 87.23.Ge}
37: 
38: \begin{multicols}{2}
39: \narrowtext           
40: 
41: \section{Introduction}
42: 
43: Thanks to the availability of massive flows of financial data,
44: theoretical insights on financial markets can nowadays be tested to an
45: unprecedented precision in socio-economic systems. This poses a
46: challenge which has attracted natural scientists who have pioneered an
47: {\em empirical} approach to financial fluctuations
48: \cite{Mandelbrot1,MantegnaStanley,BouchaudPotters} independent of the
49: econometric approach and often in contrast with the {\em axiomatic}
50: approach of theoretical finance \cite{Farmer,duffie}.
51: 
52: The empirical evidence depicts financial markets as complex
53: self-organizing critical systems: The statistics of real market
54: returns deviate considerably from the Olympic Gaussian world described
55: by Louis Bachelier at the turn of last century. Rather Mandelbrot
56: \cite{Mandelbrot2} observed that fractal (Levy) statistics gives a
57: closer approximation, even though that is not a satisfactory
58: model\cite{Mandelbrot1,MantegnaStanley}. Market returns display
59: scaling\cite{MantegnaStanley}, long range volatility correlations
60: and evidence of multiscaling \cite{multisc} have also
61: been discussed. Such features evoke the theory of critical phenomena
62: in physics, which explains how quite similar features may emerge from
63: the interaction of many microscopic degrees of freedom and statistical
64: laws. Indeed financial markets {\em are} systems of many interacting
65: degrees of freedom (the traders) and there are very good theoretical
66: reasons to expect that they operate rather close to criticality
67: \cite{MEM}. These expectations have been substantiated by microscopic
68: agent based market models\cite{CCMZ,CMZ01,BMRZ}: The picture offered
69: by these {\em synthetic markets} is one where speculation drives
70: market to information efficiency -- i.e. to a point where market
71: returns are unpredictable. But the point where markets become exactly
72: efficient is the locus of a {\em phase transition}. Close to the phase
73: transition the behavior of synthetic markets is characterized by the
74: observed stylized facts -- fat tails and long range correlations --
75: whereas far from the critical region the market is well described in
76: terms of random walks (see Ref. \cite{CCMZ} for a non technical
77: discussion).
78: 
79: Work has however been mostly confined on single assets or
80: indices. Recently ensembles of assets and their correlations have
81: become the focus of quite intense interest. On one side the role of
82: random matrix theory has been realized as a tool for understanding how
83: noise dresses financial correlations \cite{Focus} how one can undress
84: them \cite{GM}, how clustering techniques can help understanding the
85: structure of correlation \cite{Mantegna}, and the impact of such
86: consideration on portfolio optimization \cite{Gopiport}.
87: 
88: Here we report findings that strongly support the view of a
89: self-organized critical market. We show that long range correlations
90: and scale invariance extends both across assets and, in the behavior
91: of the ensemble of assets, across frequencies. More precisely, we
92: apply a novel parameter free data clustering method \cite{GM,mldc} to a
93: large financial data set \cite{data_set} in order to uncover the
94: internal structure of correlations both across different assets and
95: across different days. We identify statistically significant
96: classifications of assets in correlated {\em sectors} and of daily
97: profiles of market-wide activity in market {\em states}. Both the
98: statistics of sector sizes and of state sizes shows scale free
99: properties.
100: 
101: Determining market's states is an important achievement both
102: theoretically and practically: The concept of a state which codifies
103: all relevant economic informations is the basis of many theoretical
104: models of financial markets. But practically every day traders
105: experience a quite different reality: The market place is
106: flooded with massive flows of information of which it may be hard to
107: say what is relevant and what is irrelevant. It is by no means obvious
108: that something like market states exists at all and even if they exist
109: the problem becomes that of identifying them. Our aim is to give a
110: practical answer to these questions. We shall keep our discussion as
111: simple as possible, relegating technical details in notes and in the
112: appendix.
113: 
114: \section{The method and the data set}
115: 
116: The data clustering method that we use has been recently proposed in
117: Ref. \cite{GM}. In brief, it is based on the simple statistical
118: hypothesis that {\em similar objects have something in common}. 
119: It is possible to compute the likelihood that a given data set
120: satisfies this hypothesis and hence to look for the most likely 
121: cluster structure. A precise definition is given in the appendix and
122: for more details we refer the interested reader to
123: Refs. \cite{GM,mldc}.  Let us only mention that this method overcomes
124: several limitation of traditional data clustering approaches, such as the
125: needs of pre-defining a metric, fixing {\em a priori} the number of
126: clusters or tuning the value of other parameters\cite{mldc}. 
127: 
128: The data set covers a period from 1st January 1990 to 30th of April
129: 1999 and it reports daily prices (open, hi, low, close) for $7679$
130: assets traded in the New York Stock Exchange \cite{data_set}. 
131: The number of assets actually traded varies with time. Hence we mainly
132: focus on a subset of the $2000$ most actively traded assets (see
133: {\tt http://www.sissa.it/dataclustering/fin/} for the detailed list 
134: of assets considered, as well as for further informations).
135: 
136: Our goal is to investigate the {\em internal} structure of
137: correlations hence we first normalize the raw data \cite{norma} in
138: order to eliminate common trends and patterns both across assets and
139: across different days.  
140: %More precisely, if $x_i(t)$ is the normalized daily return of asset
141: %$i=1,\ldots,N$ in day $t=1,\ldots,T$, we have 
142: %\begin{eqnarray*} 
143: %\sum_{i=1}^N x_i(t)=0,&~&~~\sum_{i=1}^N x_i(t)^2=N~~~\forall t\\
144: %\sum_{t=1}^T x_i(t)=0,&~&~~\sum_{t=1}^T x_i(t)^2=T~~~\forall i
145: %\end{eqnarray*}
146: %hold simultaneously \cite{norma}. 
147: This procedure eliminates for example the so-called ``market mode'',
148: i.e. the constant correlation of individual asset's returns with the
149: so-called ``market's return''.
150: 
151: \section{Market sectors: Scale free market structure}
152: 
153: We first apply data clustering to group assets with a similar economic
154: dynamics in sectors of {\em correlated} assets (see appendix). This
155: classification reveals a rich structure. The clusters giving the
156: largest contributions to the log-likelihood clearly emerge from the
157: noisy background in Fig. \ref{figassets}. We find a large overlap with
158: the sectors of economic activity defined by the Standard Industrial
159: Classification (SIC) codes (see caption of Fig. \ref{figassets}). But
160: we also find significant correlations between assets with widely
161: different SIC. This has practical relevance for risk management of
162: large portfolios which cannot be handled all at once. Indeed rather
163: than splitting the problem according to economic sectors (defined by
164: the SIC) it is preferable to use our classification in correlated
165: sectors. The difference of the two classifications is also revealed by
166: a Zipf's plot of the size of sector against its rank (see inset of
167: Fig. \ref{figassets}). The distribution of correlated sector sizes
168: follows Zipf's law to a high accuracy, i.e. the number ${\cal N}(n)$
169: of sectors with more than $n$ firms (i.e. of size larger than $n$) is
170: inversely proportional to $n$. Note that the scale free distribution
171: of sector sizes is not due to an analogous property of {\em
172: fundamentals}. Indeed the rank plot of economic sector sizes bends in
173: log-log scale. This suggests that Zipf's law arises as a dynamical
174: consequence of market interaction.
175: 
176: The scale invariant behavior is robust with respect to the subset of
177: assets taken: The same behavior is found considering the
178: $1000,~2000$ or $4000$ most actively traded assets, in that
179: period or $443$ assets in the S\&P500 index (see
180: Ref. \cite{GM}). In addition we find, as in Ref. \cite{GM}, that the
181: correlation $c_s$ inside sector $s$ (see appendix) scales with its
182: size $n_s$ with a law $c_s\sim n_s^\gamma$ with $\gamma\simeq 1.66$.
183: 
184: \begin{figure}
185: \centerline{\psfig{figure=ass.eps,width=8cm}}
186: \caption{Dendrogram of the cluster structure of correlated sectors
187: resulting from hierarchical clustering algorithm. Assets are reported
188: along the horizontal axis and red shapes correspond to clusters of
189: correlated assets. The height of a shape is the contribution to the
190: log-likelihood of the corresponding cluster of assets. See the
191: appendix for more details. The cluster structure
192: is statistically significant because the noise level corresponding to
193: uncorrelated data would show structures with a log-likelihood of at
194: most $0.1$, three orders of magnitude smaller.  The classification in
195: sectors has a large overlap with economic sectors. For example,
196: clusters 1 and 2 contain firms in the electric sector and computers
197: respectively. Cluster 4 is the sector of gold, 5 is composed of banks,
198: 8 contains oil and gas firms, 9 petroleum. Clusters 3, 6 and 7 are
199: mixed clusters (more details are available at 
200: {\tt http://www.sissa.it/dataclustering/fin/}). Inset: Distribution of
201: correlated sector sizes for $2000$ ($\bullet$) and $4000$ ($\Box$)
202: assets. The distribution of the size of economic sectors ($\circ$), as
203: defined by the (first two digits of the) SIC codes, for the same
204: $4000$ assets is shown for comparison. The line (drawn as a guide to
205: the eyes) has slope $-1$.}
206: \label{figassets}
207: \end{figure}
208: 
209: We finally remark that this property is not an artifact of the
210: method. Indeed the distribution of eigenvalues of the correlation
211: matrix shows a similar broad distribution, even though that is
212: affected by considerable noise dressing \cite{Focus}. A factor model
213: which takes into account a large enough number of principal components
214: (corresponding to the largest eigenvalues) reproduces the same
215: features\footnote{In our case $\approx 30$ eigenvalues of the
216: correlation matrix are significantly outside the noise band predicted
217: by Random Matrix Theory \cite{Focus}. With a correlation matrix which
218: retains the structure of the first $\sim 20$ principal components
219: (considering the remaining components as uncorrelated noise) we found
220: a quite similar cluster structure.}.
221: 
222: \section{Market states}
223: 
224: Are there well defined patterns of daily market-wide economic
225: performance? In order to answer this question, rather than classifying
226: assets according to their temporal evolution, we can classify days
227: according to the performance of different assets. Fig. \ref{figdays}
228: implies that, above a noisy background, a meaningful classification of
229: the daily profiles of market activity exists. Clusters of days can be
230: identified with different patterns of market wide activity -- or
231: market states. Quite remarkably, the maximum likelihood classification
232: in market states shows scale free features, for large clusters
233: (frequent patterns of market activity). The number of patterns which
234: occur more than $d$ days behaves as ${\cal N}(d)\sim d^{-1.5}$ for the
235: most frequent patterns (inset top). There is a clear crossover in the
236: plot of cluster's correlation versus cluster size which distinguishes
237: the meaningful clusters (patterns) from a random noise background
238: (inset bottom).
239: 
240: \begin{figure}
241: \centerline{\psfig{figure=day.eps,width=8cm}}
242: \caption{Same plot as Fig. 1 for days: Clusters of days
243: identify market states. We identify states (see labels) as groups of
244: correlated clusters of days.  Inset: Distribution of cluster sizes,
245: i.e. of the frequency with which states occur (top) and correlation
246: $c_s$ inside each cluster (bottom).}
247: \label{figdays}
248: \end{figure}
249: 
250: From a sample of $2000$ assets over $T=2358$ days we identify $5$
251: different states -- characterized by similar profiles of market
252: activity -- plus a sixth random state (see 
253: Fig. \ref{figdays}). We assign an integer $\omega(t)$ between $1$ and
254: $6$ to each day $t$, which is the state which occurred in that day.
255: 
256: We are then in a position to analyze market performance in different
257: states. Fig. \ref{figcross} shows the (non normalized) average daily
258: returns of different asset in different states. We find that market's
259: behavior in states 1 and 2 are anti-correlated: Those assets which go
260: up in state 1 go down in state 2, on average. Fig. \ref{figcross} also
261: shows that assets in the same sector as defined above have a similar
262: behavior. So, for example, while most of the assets go up in state 1
263: and down in state 2, the cluster of assets of Gold and Silver mining
264: has an opposite behavior. State 3 is clearly characterized by a fall
265: of High-tech companies and a mild rise in the electric sector. An
266: opposite behavior takes place in state 4, whereas state 5 is dominated
267: by the a marked rise of Oil \& Gas, and Petroleum refining companies
268: \cite{data_set}. 
269: 
270: These results are remarkably stable with respect to the definition of
271: the time window where the analysis is performed \cite{stability}.
272: 
273: \begin{figure}
274: \centerline{\psfig{figure=cross.eps,width=8cm}}
275: \caption{Performance of the market in different states. Each asset $i$
276: corresponds to a point whose coordinates are the average returns
277: $(\langle{r_i|\omega}\rangle,\langle{r_i|\omega'}\rangle)$ of asset
278: $i$ in states $\omega$ and $\omega'$. Assets in different sectors are
279: plotted differently.}
280: \label{figcross}
281: \end{figure}
282: 
283: \subsection{Predictability and market efficiency}
284: 
285: Clustering the market's dynamics leaves us with the sequence
286: $\omega(t)$ of the states of the market in different days
287: $t=1,\ldots,T$. This allows us to pose interesting questions on
288: predictability and market's information efficiency.  
289: 
290: Let us first ask: Is it possible to predict the
291: state $\omega'$ of the market tomorrow, given the state $\omega$ of
292: the market today? In order to answer this question we estimate the
293: probability
294: \[
295: P_1(\omega'|\omega)=\sum_{t=1}^{T-1} 
296: \delta_{\omega(t),\omega}\delta_{\omega(t+1),\omega'}
297: /\sum_{t=1}^{T-1} 
298: \delta_{\omega(t),\omega}
299: \]
300: of transition from state $\omega$ to state $\omega'$. It turns out
301: that both the classification in states and the transition matrix
302: $P_1(\omega'|\omega)$ are very stable with respect to the definition of
303: the time window \cite{stability}. This means that they both vary very
304: slowly in time. Hence we shall neglect their variation in time
305: henceforth.
306: 
307: {\em If} the process $\omega(t)$ were Markovian, its predictability
308: could be quantified by the characteristic time $\tau$ of convergence
309: to the stationary state. This is related to the second largest
310: (in absolute value) eigenvalue $\lambda$ of the matrix
311: $P_1(\omega'|\omega)$ by $\tau=-1/\log|\lambda|$. We find $\tau\approx
312: 0.54$ days -- a value which would occur by chance, if there were no
313: correlations, in one out of $10^7$ cases\footnote{This conclusion was
314: reached considering the characteristic times $\tau$ for symbolic
315: sequences $\tilde\omega(t)$ generated by randomly reshuffling
316: days. These times are distributed around $\tau\approx
317: 0.33$ with a spread $\delta\tau\approx 0.04$. The analysis of the tail
318: of the distribution allows to estimate the likelihood of $\tau\simeq
319: 0.54$ for the real sequence.}.  Statistical prediction is possible.
320: 
321: Can we predict market's returns on the basis of these results?
322: Fig. \ref{figcross} shows that average returns $\langle
323: r_i(t)|\omega(t)\rangle$ conditional on the state $\omega(t)$ of the
324: market contain non-trivial information. However this information is
325: not available for trading in day $t$. But if we know the transition
326: matrix $P_1(\omega'|\omega)$ we can estimate the expected return of
327: asset $i$ tomorrow given the state $\omega$ today:
328: \[
329: \langle r_i(t+1)|\omega(t)\rangle=\sum_{\omega'}
330: \langle r_i(t+1)|\omega(t+1)=\omega'\rangle P_1(\omega'|\omega(t)).
331: \]
332: A natural measure of predictability, inspired by works on theoretical
333: models \cite{CM,CMZ,CCMZ,BMRZ}, is the averaged signal-to-noise ratio
334: defined as:
335: \[
336: H_i(t'|t)= 
337: \sqrt{\sum_\omega \rho_\omega
338: \frac{\langle \delta r_i(t')|\omega(t)=\omega\rangle^2}
339: {\langle\delta r_i^2|\omega\rangle}}
340: \]
341: where $\delta r_i(t)=r_i(t)-\langle r_i\rangle$ and $\rho_\omega$ is
342: the frequency with which state $\omega$ occurs.  The distribution of
343: $H_i$ across assets is shown in Fig. \ref{figPH} for $t'=t$, $t'=t+1$
344: and $t'=t+\infty$.  The latter gives a benchmark of the
345: background noise level. We find $H_i(t|t)\gg H_i(t+\infty|t)$ for
346: several assets $i$: the knowledge of $\omega(t)$ {\em before} day $t$
347: provides significant predictive power on excess returns. That same
348: information is much less useful the day after, since $H(t+1|t)$ is
349: only slightly above the noise level. This is a further indication that
350: the financial market is close to information efficiency, but not quite
351: unpredictable. In reality the transition matrix $P_1(\omega'|\omega)$
352: changes slowly in time. Hence this conclusion provides an ``upper
353: bound'' for the market's predictability (when measured out-of-sample):
354: Real markets are therefore even closer to efficiency.
355: 
356: If $\omega(t)$ were a Markov process, the characteristic time $\tau_k$
357: for transitions $\omega(t)\to\omega(t+k)$ over $k$
358: days\footnote{$\tau_k$ is computed in the same way as $\tau=\tau_1$
359: above, from the matrix $P_k(\omega'|\omega)$ of transition
360: probabilities $\omega(t)=\omega\to\omega(t+k)=\omega'$ in $k$ days. 
361: For a Markov process this matrix is the $k^{\rm th}$ power of the
362: matrix $P_1(\omega'|\omega)$ and its eigenvalues are given by
363: $\lambda_k=\lambda_1^k$.}
364: should decrease with $k$ as $\tau_k=\tau_1/k$. A prediction of the
365: future state of the market, which is significantly better than a
366: random draw, would only be possible on a time horizon of one day, if
367: the process were Markovian. The inset of Fig. \ref{figPH} shows that
368: $\tau_k$ remains significantly above the noise level almost up to
369: $k\approx 100$ days!  This means that $\omega(t)$ carries significant
370: information about the future state $\omega(t+k)$ of the market, even
371: after $k\approx 50$ days. The slow decay of $\tau_k$ is a further
372: signature of the presence of long range correlations.
373: 
374: \begin{figure}
375: \centerline{\psfig{figure=PH.eps,width=8cm}}
376: \caption{Distribution of predictability $H_i(t'|t)$ for $t'=t,~t+1$
377: and $t+\infty$. The noise background predictability $H_i(t+\infty|t)$
378: is estimated drawing $\omega(t +\infty)$ at random from the
379: populations of states.  Inset: Characteristic times $\tau_k$ for
380: transitions over $k$ days for the real sequence $\omega(t)$
381: ($\bullet$), a random sequence ($+$) and a Markov chain sequence
382: ($\circ$) generated with the transition probability
383: $P_1(\omega'|\omega)$ estimated from $\omega(t)$. The random sequence
384: ($+$) represents the noise background. For a Markov chain $\tau_k$
385: ($\circ$) is significantly above the noise level only for $k=1$. 
386: For the real market process $\tau_k$ is well above the noise level
387: up to $k\approx 50$.}
388: \label{figPH}
389: \end{figure}
390: 
391: During the period we have studied, two major extreme events occurs:
392: the 27 October 1997 and the 31 August 1998 crashes.  The state process
393: $\omega(t)$ is different before the crash, but is quite similar after
394: it. The strings of states, starting from the day of the crash, read
395: $2136613611\ldots$ and $2126614633\ldots$ in the two cases. This is a
396: significant similarity\footnote{Only two other string of the type
397: $21x661$ occurred in the process but the starting days were Fridays
398: (90/04/27 and 90/05/25) and not Mondays. Note furthermore that
399: normalization \cite{norma} removes the collective component of the
400: dynamics and it ensures that crash days appear with the same weight as
401: normal days in the analysis.}. This suggests the existence of a
402: particular dynamical pattern with which markets respond to extreme
403: events (see also Ref. \cite{Omori} on this).
404: 
405: \section{Conclusion and outlook}
406: 
407: In conclusion we show that both the {\em horizontal} clustering of
408: assets in correlated sectors and the {\em vertical} classification of 
409: market-wide economic performance in market states, reveal a scale free
410: structure (see Figs. \ref{figassets}, \ref{figdays}). The emergent
411: picture poses quite severe constraints on multi-asset agent based
412: modeling, which we believe will disclose important information on how
413: real markets work. This expectation is based on the fact that
414: scale-free statistical behavior is a signature of interaction
415: mechanisms which is rather insensitive to microscopic details.
416: 
417: Furthermore, the identification of market states allows us to
418: precisely quantify informational efficiency by computing the market's
419: predictability, thereby establishing a direct contact between the
420: empirical world and the realm of theoretical models. In particular we
421: find that, as expected, markets are close to information efficiency.
422: 
423: We find that correlated sectors have a large overlap with sectors of
424: economic activity. In the same way, it would be interesting to
425: understand how states are correlated with economic information and the
426: news arrival process. 
427: 
428: In a wider context, we have discussed an unsupervised approach to the
429: study of a complex system. Be it a stock market, the world economy,
430: urban traffic network, a cell of a living organism or the immune
431: system, the complex system can be considered as a black box.  We show
432: how a series of simultaneous measures in many different ``points''
433: of the system allows one to identify its {\em parts} and its {\em
434: states}.
435: 
436: A black box approach to a financial market or to a cell, which
437: neglects all of economics and finance or of biology and genetics and
438: relies only on empirical data, may lead to misleading results
439: specially if the data set is incomplete. Still, we believe, it has the
440: potential of uncovering collective aspects which can hardly be derived
441: in a theoretical bottom-up approach.
442: 
443: \appendix
444: \section{Maximum likelihood data clustering}
445: \label{mldc}
446: 
447: Consider a set of $N$ objects each of which is defined in terms of $D$
448: measurable features, so that each object is represented by a vector
449: $\vec \xi_i\in R^D$, $i=1,\ldots,N$. We assume for simplicity that data
450: are normalized: $\vec \xi_i\cdot \vec e=0$ where $\vec e=(1,1,\ldots,1)$
451: and $\|\xi_i\|^2=\vec \xi_i\vec \xi_i=1$.
452: 
453: In our case, when identifying sectors, the objects are assets and
454: $N=A$, the number of assets. Their features are the daily returns in
455: each day $t$ and $D=T$. The $t^{\rm th}$ component of $\vec \xi_i$ is
456: $x_i(t)/\sqrt{T}$. When identifying states instead objects are days
457: and features are assets (i.e. $N=T$ and $D=A$). The $i^{\rm th}$
458: component of $\vec \xi_t$ is $x_i(t)/\sqrt{A}$. 
459: 
460: The problem of classifying  $N$ objects into different classes
461: goes under the name of data clustering.  Naively one would like to
462: have similar objects classified in the same cluster, but in practice
463: one faces a number of problems: What does it mean similar?  What is
464: the ``right'' number of clusters?  Which principle to follow?  We
465: resort to a recent data clustering technique \cite{GM,mldc} based on
466: the maximum likelihood principle and a simple statistical hypothesis:
467: {\em similar objects have something in common}. In mathematical terms,
468: we let $s_i$ be the label of the cluster to which object $i$ belongs,
469: and $A_s=\{i:~s_i=s\}$ be the set of objects with $s_i=s$. We assume
470: that
471: \begin{equation}
472: \vec \xi_i = g_{s_i}\vec\eta_{s_i}+\sqrt{1-g_{s_i}^2}\vec\epsilon_i.
473: \label{ansatz}
474: \end{equation}
475: Here $\vec \eta_s$ denoted the {\em common} component shared by all
476: objects $i\in A_s$ and $g_s\ge 0$ weights the common component against
477: the individual one $\vec\epsilon_i$. Eq. (\ref{ansatz}) is the 
478: statistical hypothesis where $g_s$ and $s_i$ are the parameters to be
479: fitted. Assuming further that both $\vec \eta_s$ and $\vec \epsilon_i$
480: are Gaussian vectors in $R^D$, with zero average and unit variance
481: ($E[\|\eta_s\|^2]=E[\|\epsilon_i\|^2]=1$) makes it possible to compute
482: the likelihood of the parameters ${\cal G}=\{g_s\}$ and ${\cal
483: S}=\{s_i\}$ (see Ref. \cite{GM} for details). The likelihood is
484: maximal when
485: \begin{equation}
486: g_s=\sqrt{\max\left[0,\frac{c_s-n_s}{n_s^2-n_s}\right]}
487: %~~~\hbox{if $c_s\ge n_s$}
488: \end{equation}
489: %and $g_s=0$ otherwise, 
490: where $n_s=|A_s|$ is the number of objects in
491: cluster $s$ and
492: \[
493: c_s=\sum_{i,j\in A_s} \vec \xi_i\vec \xi_j
494: \]
495: is the total correlation inside cluster $s$.
496: The maximum log-likelihood per feature takes the form
497: %\begin{equation}
498: \[
499: {\cal L}_c({\cal S})=\frac{1}{2}\sum_{s:~n_s>1}\max
500: \left[0,\log
501: \frac{n_s}{c_s}+(n_s-1)\log\frac{n_s^2-n_s}
502: {n_s^2-c_s}\right].
503: \]
504: %\end{equation}
505: Note that a cluster with a single isolated object ($n_s=c_s=1$), or a
506: cluster of uncorrelated objects ($c_s=n_s$) gives a vanishing
507: contribution to the log-likelihood. 
508: 
509: Several algorithms for finding an approximate maximum of ${\cal L}_c$
510: over the space of cluster structures ${\cal S}$ have been discussed in
511: Ref. \cite{mldc}. We used both hierarchical clustering and simulated
512: annealing algorithms, which yield quite similar results (the codes are
513: available on the Internet \cite{data_set}).
514: 
515: Figures \ref{figassets} and \ref{figdays} are a graphic representation
516: of the hierarchical clustering algorithm: It starts from $N$ clusters
517: composed of a single object and it produces a sequence of cluster
518: structures. At each iteration, two clusters of the configurations with
519: $K$ clusters are merged so that the log-likelihood of the resulting
520: configuration with $K-1$ clusters is maximal. This procedure starts
521: with $K=N$ and it stops with $K=1$, when a single cluster is
522: formed. The log-likelihood of the cluster structure is ${\cal L}_c=0$
523: when $K=N$, it decreases with $K$ and it reaches a minimum for an
524: intermediate value of $K$. Then it increases again and reaches ${\cal
525: L}_c=0$ when $K=1$, because of data normalization. 
526: 
527: The graphs report the log-likelihood of each cluster on the $y$ axis.
528: The initial configuration corresponds to $N$ points aligned on the $x$
529: axis (zero log-likelihood). Each merge operation is represented
530: graphically by a link between the merging clusters and the new
531: cluster. Hence as the log-likelihood decreases structures above the
532: $x$ axis start to form. Red links are merging steps which increase the
533: log-likelihood. Blue links corresponds to situation where the
534: log-likelihood of the union of the clusters is larger than that of
535: each part but it is smaller than their sum (hence the total
536: log-likelihood decreases). Hence statistically relevant clusters
537: appear as the large red structures in the plot.
538: 
539: \begin{thebibliography}{99}
540: 
541: \bibitem{Mandelbrot1} Mandelbrot, B. B., {\em Fractals and
542: Scaling in Finance}, Springer-Verlag (New York 1997).
543: 
544: \bibitem{MantegnaStanley} R.N. Mantegna and H.E. Stanley, {\em Introduction
545: to Econophysics: Correlations and Complexity in Finance}, Cambridge
546: Univ. Press (Cambridge UK, 1999). 
547: 
548: \bibitem{BouchaudPotters} J.-P. Bouchaud and M. Potters, {\em Theory of
549: Financial Risk: From Statistical Physics to Risk Management},
550: Cambridge Univ. Press (Cambridge UK, 2000)
551: 
552: \bibitem{Farmer} J.D. Farmer, Physicists Attempt to Scale the Ivory
553: Towers of Finance , Computing in Science and Engineering (IEEE),
554: {\bf 1} 1999, 26-39.
555: 
556: \bibitem{duffie} J.Y. Campbell, A.W. Lo, and A.C. MacKinlay, {\em The
557: Econometrics of Financial Markets}, Princeton Univ. Press (Princeton
558: N.J., 1997).
559: 
560: \bibitem{Mandelbrot2} B.B. Mandelbrot, The Variation of Certain
561: Speculative Prices, J. Business, Vol. 36, 1963, pp. 394 419.
562: 
563: \bibitem{multisc} S. Ghashghaie et al., {\em Turbulent Cascades in Foreign
564: Exchange Markets}, Nature, {\bf 381}, 767 (1996).
565: 
566: \bibitem{MEM} D. Challet, M. Marsili and Y.-C. Zhang, {\em Modeling
567: market mechanism with minority game}, Physica A {\bf 276}, 284
568: (2000).
569: 
570: \bibitem{CCMZ} D. Challet et al., {\em From Minority Games to real
571: markets}, Quantitative Finance {\bf 1}, 168 (2001).
572: 
573: \bibitem{CMZ01} D. Challet, M. Marsili and Y.-C. Zhang, {\em Stylized
574: facts of financial markets and market crashes in Minority Games},
575: Physica A {\bf 294}, 514 (2001).
576: 
577: \bibitem{BMRZ} J. Berg et al. {\em Statistical mechanics of asset
578: markets with private information}, Quantitative Finance {\bf 1}, 203
579: (2001).
580: 
581: \bibitem{Focus} Laloux et al., {\em Noise Dressing of Financial
582: Correlation Matrices}, Phys. Rev. Lett. {\bf 83}, 1467 (1999);
583: V. Plerou et al. {\em Universal and Nonuniversal Properties
584:             of Cross Correlations in Financial Time
585:             Series}, {\em ibid} 1471.
586: 
587: \bibitem{GM} L. Giada, M. Marsili, {\em Data clustering and noise
588: undressing of correlation matrices}, Phys. Rev. E {\bf 63}, 1101
589: (2001).
590: 
591: \bibitem{Mantegna} R.N. Mantegna, {\em Hierarchical structure in
592: financial markets}, Eur. Phys. J. B {\bf 11} , 193 (1999).
593: 
594: \bibitem{Gopiport} P. Gopikrishnan et al., {\em Quantifying and
595: interpreting collective behavior in financial markets}, Phys. Rev. E
596: {\bf 64}, 035106 (2001).
597: 
598: \bibitem{mldc} L. Giada, M. Marsili, {\em Algorithms of maximum
599: likelihood data clustering with applications}, eprint cond-mat/0204008
600: (2002).
601: 
602: \bibitem{data_set} The data set was made available by courtesy of
603: R. N. Mantegna. The tic symbols of the subset of assets considered,
604: the detailed cluster structures of sectors and states and other
605: informations are available at {\tt
606: http://www.sissa.it/dataclustering/fin/}.
607: 
608: \bibitem{norma} Let $x_i^{(0)}(t)=\log p_i^{\rm open}(t)/p_i^{\rm
609: close}(t)$ be the return of asset $i=1,\ldots, A$ in day
610: $t=1,\ldots,T$. We set 
611: \begin{eqnarray*}
612: x_i^{(2k+1)}&=&\frac{x_i^{(2k)}(t)-\langle x_i^{(2k)}\rangle}
613: {\sqrt{\langle(x_i^{(2k)}-\langle x_i^{(2k)}\rangle)^2\rangle}}\\
614: x_i^{(2k+2)}&=&\frac{x_i^{(2k+1)}(t)-\overline{x_i^{(2k)}}}
615: {\sqrt{\overline{(x_i^{(2k)}-\overline{x_i^{(2k)}})^2}}}
616: \end{eqnarray*} 
617: where $\langle\ldots\rangle=\sum_{t=1}^T(\ldots)/T$ is
618: a time average and
619: $\overline{(\ldots)}=\sum_{i=1}^A(\ldots)/A$ denotes the
620: average over assets. As in M. B. Eisen et al., 
621: %{\em Cluster analysis and display of genome-wide expression
622: %patterns}, 
623: [Proc. Natl. Acad. Sci. USA, {\bf 95}, 14863 (1998).], the normalized
624: data $x_i(t)$, is obtained as the limit of $x_i^{(n)}(t)$ as
625: $n\to\infty$. In practice the iteration was stopped after a given
626: accuracy was reached. This procedure does not affect significantly the
627: results. Indeed the first step of normalization eliminates most of the
628: global patterns. For missing values we assumed $x_i(t)=0$ if asset $i$
629: were not traded on day $t$.
630: 
631: \bibitem{stability} In order to asses the stability of the results we
632: repeated the classification of days for the first (from Jan. '90 to
633: Aug. '94) and the second (Sep. '94 to Apr. '99) halves of the time
634: series. We found dendrograms quite similar to those in
635: Fig. \ref{figdays} with two main dominant states. Clustering again
636: days into $6$ states, we found two new sequences
637: $\omega_{<}(t)$ for $t=1,\ldots,T/2$ and $\omega_{<}(t)$ for
638: $t=T/2+1,\ldots,T$. We found that $\omega_{<}(t)=\omega(t)$ in $73$\%
639: of cases and $\omega_{>}(t)=\omega(t)$ in $82$\% of cases, where
640: $\omega(t)$ is the state occurring in day $t$ according to the
641: analysis of the whole time series.
642: 
643: \bibitem{CM} D. Challet, M. Marsili, {\em Phase transition and symmetry
644: breaking in the minority game} Phys. Rev. {\bf E 60}, R6271 (1999).
645: 
646: \bibitem{CMZ} D. Challet, M. Marsili, R. Zecchina, {\em Statistical
647: mechanics of systems with heterogeneous agents: Minority games},
648: Phys. Rev. Lett. {\bf 84}, 1824 (2000).
649: 
650: \bibitem{Omori} F. Lillo, R.N. Mantegna, {\em Omori law after a
651: financial market crash}, e-print cond-mat/0111257 (to appear in Physica A).
652: 
653: \end{thebibliography}
654: 
655: \end{multicols}  
656: 
657: \end{document}
658: