0207:cond-mat0207156/day.tex

1: %\documentstyle[aps,graphics,multicol,psfig,epsfig,preprint,color]{revtex}

2: \documentstyle[aps,graphics,multicol,psfig,epsfig,color]{revtex}

3:

4: \begin{document}

5:

6: \title{Dissecting financial markets: Sectors and states}

7:

8: \author{Matteo Marsili}

9:

10: \address{Abdus

11: Salam International Center for Theoretical Physics, Strada Costiera 11,

12: 34014 Trieste, Italy\\

13: and\\

14: Istituto Nazionale per la Fisica della Materia (INFM),

15: Unit\'a Trieste SISSA, Via Beirut 2-4, 34014 Trieste and }

16:

17: \date{\today}

18:

19: \maketitle

20:

21: \begin{abstract}

22: By analyzing a large data set of daily returns with data clustering

23: technique, we identify economic sectors as clusters of assets with a

24: similar economic dynamics. The sector size distribution follows Zipf's

25: law. Secondly, we find that patterns of daily market-wide economic

26: activity cluster into classes that can be identified with market

27: states. The distribution of frequencies of market states shows

28: scale-free properties and the memory of the market state process

29: extends to long times ($\sim 50$ days). Assets in the same sector

30: behave similarly across states. We characterize market efficiency by

31: analyzing market's predictability and find that indeed the market is

32: close to being efficient. We find evidence of the existence of a dynamic

33: pattern after market's crashes.

34: \end{abstract}

35:

36: \pacs{PACS numbers: 05.40.-a, 05.20.Dd, 64.60.Ht, 87.23.Ge}

37:

38: \begin{multicols}{2}

39: \narrowtext

40:

41: \section{Introduction}

42:

43: Thanks to the availability of massive flows of financial data,

44: theoretical insights on financial markets can nowadays be tested to an

45: unprecedented precision in socio-economic systems. This poses a

46: challenge which has attracted natural scientists who have pioneered an

47: {\em empirical} approach to financial fluctuations

48: \cite{Mandelbrot1,MantegnaStanley,BouchaudPotters} independent of the

49: econometric approach and often in contrast with the {\em axiomatic}

50: approach of theoretical finance \cite{Farmer,duffie}.

51:

52: The empirical evidence depicts financial markets as complex

53: self-organizing critical systems: The statistics of real market

54: returns deviate considerably from the Olympic Gaussian world described

55: by Louis Bachelier at the turn of last century. Rather Mandelbrot

56: \cite{Mandelbrot2} observed that fractal (Levy) statistics gives a

57: closer approximation, even though that is not a satisfactory

58: model\cite{Mandelbrot1,MantegnaStanley}. Market returns display

59: scaling\cite{MantegnaStanley}, long range volatility correlations

60: and evidence of multiscaling \cite{multisc} have also

61: been discussed. Such features evoke the theory of critical phenomena

62: in physics, which explains how quite similar features may emerge from

63: the interaction of many microscopic degrees of freedom and statistical

64: laws. Indeed financial markets {\em are} systems of many interacting

65: degrees of freedom (the traders) and there are very good theoretical

66: reasons to expect that they operate rather close to criticality

67: \cite{MEM}. These expectations have been substantiated by microscopic

68: agent based market models\cite{CCMZ,CMZ01,BMRZ}: The picture offered

69: by these {\em synthetic markets} is one where speculation drives

70: market to information efficiency -- i.e. to a point where market

71: returns are unpredictable. But the point where markets become exactly

72: efficient is the locus of a {\em phase transition}. Close to the phase

73: transition the behavior of synthetic markets is characterized by the

74: observed stylized facts -- fat tails and long range correlations --

75: whereas far from the critical region the market is well described in

76: terms of random walks (see Ref. \cite{CCMZ} for a non technical

77: discussion).

78:

79: Work has however been mostly confined on single assets or

80: indices. Recently ensembles of assets and their correlations have

81: become the focus of quite intense interest. On one side the role of

82: random matrix theory has been realized as a tool for understanding how

83: noise dresses financial correlations \cite{Focus} how one can undress

84: them \cite{GM}, how clustering techniques can help understanding the

85: structure of correlation \cite{Mantegna}, and the impact of such

86: consideration on portfolio optimization \cite{Gopiport}.

87:

88: Here we report findings that strongly support the view of a

89: self-organized critical market. We show that long range correlations

90: and scale invariance extends both across assets and, in the behavior

91: of the ensemble of assets, across frequencies. More precisely, we

92: apply a novel parameter free data clustering method \cite{GM,mldc} to a

93: large financial data set \cite{data_set} in order to uncover the

94: internal structure of correlations both across different assets and

95: across different days. We identify statistically significant

96: classifications of assets in correlated {\em sectors} and of daily

97: profiles of market-wide activity in market {\em states}. Both the

98: statistics of sector sizes and of state sizes shows scale free

99: properties.

100:

101: Determining market's states is an important achievement both

102: theoretically and practically: The concept of a state which codifies

103: all relevant economic informations is the basis of many theoretical

104: models of financial markets. But practically every day traders

105: experience a quite different reality: The market place is

106: flooded with massive flows of information of which it may be hard to

107: say what is relevant and what is irrelevant. It is by no means obvious

108: that something like market states exists at all and even if they exist

109: the problem becomes that of identifying them. Our aim is to give a

110: practical answer to these questions. We shall keep our discussion as

111: simple as possible, relegating technical details in notes and in the

112: appendix.

113:

114: \section{The method and the data set}

115:

116: The data clustering method that we use has been recently proposed in

117: Ref. \cite{GM}. In brief, it is based on the simple statistical

118: hypothesis that {\em similar objects have something in common}.

119: It is possible to compute the likelihood that a given data set

120: satisfies this hypothesis and hence to look for the most likely

121: cluster structure. A precise definition is given in the appendix and

122: for more details we refer the interested reader to

123: Refs. \cite{GM,mldc}.  Let us only mention that this method overcomes

124: several limitation of traditional data clustering approaches, such as the

125: needs of pre-defining a metric, fixing {\em a priori} the number of

126: clusters or tuning the value of other parameters\cite{mldc}.

127:

128: The data set covers a period from 1st January 1990 to 30th of April

129: 1999 and it reports daily prices (open, hi, low, close) for $7679$

130: assets traded in the New York Stock Exchange \cite{data_set}.

131: The number of assets actually traded varies with time. Hence we mainly

132: focus on a subset of the $2000$ most actively traded assets (see

133: {\tt http://www.sissa.it/dataclustering/fin/} for the detailed list

134: of assets considered, as well as for further informations).

135:

136: Our goal is to investigate the {\em internal} structure of

137: correlations hence we first normalize the raw data \cite{norma} in

138: order to eliminate common trends and patterns both across assets and

139: across different days.

140: %More precisely, if $x_i(t)$ is the normalized daily return of asset

141: %$i=1,\ldots,N$ in day $t=1,\ldots,T$, we have

142: %\begin{eqnarray*}

143: %\sum_{i=1}^N x_i(t)=0,&~&~~\sum_{i=1}^N x_i(t)^2=N~~~\forall t\\

144: %\sum_{t=1}^T x_i(t)=0,&~&~~\sum_{t=1}^T x_i(t)^2=T~~~\forall i

145: %\end{eqnarray*}

146: %hold simultaneously \cite{norma}.

147: This procedure eliminates for example the so-called ``market mode'',

148: i.e. the constant correlation of individual asset's returns with the

149: so-called ``market's return''.

150:

151: \section{Market sectors: Scale free market structure}

152:

153: We first apply data clustering to group assets with a similar economic

154: dynamics in sectors of {\em correlated} assets (see appendix). This

155: classification reveals a rich structure. The clusters giving the

156: largest contributions to the log-likelihood clearly emerge from the

157: noisy background in Fig. \ref{figassets}. We find a large overlap with

158: the sectors of economic activity defined by the Standard Industrial

159: Classification (SIC) codes (see caption of Fig. \ref{figassets}). But

160: we also find significant correlations between assets with widely

161: different SIC. This has practical relevance for risk management of

162: large portfolios which cannot be handled all at once. Indeed rather

163: than splitting the problem according to economic sectors (defined by

164: the SIC) it is preferable to use our classification in correlated

165: sectors. The difference of the two classifications is also revealed by

166: a Zipf's plot of the size of sector against its rank (see inset of

167: Fig. \ref{figassets}). The distribution of correlated sector sizes

168: follows Zipf's law to a high accuracy, i.e. the number ${\cal N}(n)$

169: of sectors with more than $n$ firms (i.e. of size larger than $n$) is

170: inversely proportional to $n$. Note that the scale free distribution

171: of sector sizes is not due to an analogous property of {\em

172: fundamentals}. Indeed the rank plot of economic sector sizes bends in

173: log-log scale. This suggests that Zipf's law arises as a dynamical

174: consequence of market interaction.

175:

176: The scale invariant behavior is robust with respect to the subset of

177: assets taken: The same behavior is found considering the

178: $1000,~2000$ or $4000$ most actively traded assets, in that

179: period or $443$ assets in the S\&P500 index (see

180: Ref. \cite{GM}). In addition we find, as in Ref. \cite{GM}, that the

181: correlation $c_s$ inside sector $s$ (see appendix) scales with its

182: size $n_s$ with a law $c_s\sim n_s^\gamma$ with $\gamma\simeq 1.66$.

183:

184: \begin{figure}

185: \centerline{\psfig{figure=ass.eps,width=8cm}}

186: \caption{Dendrogram of the cluster structure of correlated sectors

187: resulting from hierarchical clustering algorithm. Assets are reported

188: along the horizontal axis and red shapes correspond to clusters of

189: correlated assets. The height of a shape is the contribution to the

190: log-likelihood of the corresponding cluster of assets. See the

191: appendix for more details. The cluster structure

192: is statistically significant because the noise level corresponding to

193: uncorrelated data would show structures with a log-likelihood of at

194: most $0.1$, three orders of magnitude smaller.  The classification in

195: sectors has a large overlap with economic sectors. For example,

196: clusters 1 and 2 contain firms in the electric sector and computers

197: respectively. Cluster 4 is the sector of gold, 5 is composed of banks,

198: 8 contains oil and gas firms, 9 petroleum. Clusters 3, 6 and 7 are

199: mixed clusters (more details are available at

200: {\tt http://www.sissa.it/dataclustering/fin/}). Inset: Distribution of

201: correlated sector sizes for $2000$ ($\bullet$) and $4000$ ($\Box$)

202: assets. The distribution of the size of economic sectors ($\circ$), as

203: defined by the (first two digits of the) SIC codes, for the same

204: $4000$ assets is shown for comparison. The line (drawn as a guide to

205: the eyes) has slope $-1$.}

206: \label{figassets}

207: \end{figure}

208:

209: We finally remark that this property is not an artifact of the

210: method. Indeed the distribution of eigenvalues of the correlation

211: matrix shows a similar broad distribution, even though that is

212: affected by considerable noise dressing \cite{Focus}. A factor model

213: which takes into account a large enough number of principal components

214: (corresponding to the largest eigenvalues) reproduces the same

215: features\footnote{In our case $\approx 30$ eigenvalues of the

216: correlation matrix are significantly outside the noise band predicted

217: by Random Matrix Theory \cite{Focus}. With a correlation matrix which

218: retains the structure of the first $\sim 20$ principal components

219: (considering the remaining components as uncorrelated noise) we found

220: a quite similar cluster structure.}.

221:

222: \section{Market states}

223:

224: Are there well defined patterns of daily market-wide economic

225: performance? In order to answer this question, rather than classifying

226: assets according to their temporal evolution, we can classify days

227: according to the performance of different assets. Fig. \ref{figdays}

228: implies that, above a noisy background, a meaningful classification of

229: the daily profiles of market activity exists. Clusters of days can be

230: identified with different patterns of market wide activity -- or

231: market states. Quite remarkably, the maximum likelihood classification

232: in market states shows scale free features, for large clusters

233: (frequent patterns of market activity). The number of patterns which

234: occur more than $d$ days behaves as ${\cal N}(d)\sim d^{-1.5}$ for the

235: most frequent patterns (inset top). There is a clear crossover in the

236: plot of cluster's correlation versus cluster size which distinguishes

237: the meaningful clusters (patterns) from a random noise background

238: (inset bottom).

239:

240: \begin{figure}

241: \centerline{\psfig{figure=day.eps,width=8cm}}

242: \caption{Same plot as Fig. 1 for days: Clusters of days

243: identify market states. We identify states (see labels) as groups of

244: correlated clusters of days.  Inset: Distribution of cluster sizes,

245: i.e. of the frequency with which states occur (top) and correlation

246: $c_s$ inside each cluster (bottom).}

247: \label{figdays}

248: \end{figure}

249:

250: From a sample of $2000$ assets over $T=2358$ days we identify $5$

251: different states -- characterized by similar profiles of market

252: activity -- plus a sixth random state (see

253: Fig. \ref{figdays}). We assign an integer $\omega(t)$ between $1$ and

254: $6$ to each day $t$, which is the state which occurred in that day.

255:

256: We are then in a position to analyze market performance in different

257: states. Fig. \ref{figcross} shows the (non normalized) average daily

258: returns of different asset in different states. We find that market's

259: behavior in states 1 and 2 are anti-correlated: Those assets which go

260: up in state 1 go down in state 2, on average. Fig. \ref{figcross} also

261: shows that assets in the same sector as defined above have a similar

262: behavior. So, for example, while most of the assets go up in state 1

263: and down in state 2, the cluster of assets of Gold and Silver mining

264: has an opposite behavior. State 3 is clearly characterized by a fall

265: of High-tech companies and a mild rise in the electric sector. An

266: opposite behavior takes place in state 4, whereas state 5 is dominated

267: by the a marked rise of Oil \& Gas, and Petroleum refining companies

268: \cite{data_set}.

269:

270: These results are remarkably stable with respect to the definition of

271: the time window where the analysis is performed \cite{stability}.

272:

273: \begin{figure}

274: \centerline{\psfig{figure=cross.eps,width=8cm}}

275: \caption{Performance of the market in different states. Each asset $i$

276: corresponds to a point whose coordinates are the average returns

277: $(\langle{r_i|\omega}\rangle,\langle{r_i|\omega'}\rangle)$ of asset

278: $i$ in states $\omega$ and $\omega'$. Assets in different sectors are

279: plotted differently.}

280: \label{figcross}

281: \end{figure}

282:

283: \subsection{Predictability and market efficiency}

284:

285: Clustering the market's dynamics leaves us with the sequence

286: $\omega(t)$ of the states of the market in different days

287: $t=1,\ldots,T$. This allows us to pose interesting questions on

288: predictability and market's information efficiency.

289:

290: Let us first ask: Is it possible to predict the

291: state $\omega'$ of the market tomorrow, given the state $\omega$ of

292: the market today? In order to answer this question we estimate the

293: probability

294: \[

295: P_1(\omega'|\omega)=\sum_{t=1}^{T-1}

296: \delta_{\omega(t),\omega}\delta_{\omega(t+1),\omega'}

297: /\sum_{t=1}^{T-1}

298: \delta_{\omega(t),\omega}

299: \]

300: of transition from state $\omega$ to state $\omega'$. It turns out

301: that both the classification in states and the transition matrix

302: $P_1(\omega'|\omega)$ are very stable with respect to the definition of

303: the time window \cite{stability}. This means that they both vary very

304: slowly in time. Hence we shall neglect their variation in time

305: henceforth.

306:

307: {\em If} the process $\omega(t)$ were Markovian, its predictability

308: could be quantified by the characteristic time $\tau$ of convergence

309: to the stationary state. This is related to the second largest

310: (in absolute value) eigenvalue $\lambda$ of the matrix

311: $P_1(\omega'|\omega)$ by $\tau=-1/\log|\lambda|$. We find $\tau\approx

312: 0.54$ days -- a value which would occur by chance, if there were no

313: correlations, in one out of $10^7$ cases\footnote{This conclusion was

314: reached considering the characteristic times $\tau$ for symbolic

315: sequences $\tilde\omega(t)$ generated by randomly reshuffling

316: days. These times are distributed around $\tau\approx

317: 0.33$ with a spread $\delta\tau\approx 0.04$. The analysis of the tail

318: of the distribution allows to estimate the likelihood of $\tau\simeq

319: 0.54$ for the real sequence.}.  Statistical prediction is possible.

320:

321: Can we predict market's returns on the basis of these results?

322: Fig. \ref{figcross} shows that average returns $\langle

323: r_i(t)|\omega(t)\rangle$ conditional on the state $\omega(t)$ of the

324: market contain non-trivial information. However this information is

325: not available for trading in day $t$. But if we know the transition

326: matrix $P_1(\omega'|\omega)$ we can estimate the expected return of

327: asset $i$ tomorrow given the state $\omega$ today:

328: \[

329: \langle r_i(t+1)|\omega(t)\rangle=\sum_{\omega'}

330: \langle r_i(t+1)|\omega(t+1)=\omega'\rangle P_1(\omega'|\omega(t)).

331: \]

332: A natural measure of predictability, inspired by works on theoretical

333: models \cite{CM,CMZ,CCMZ,BMRZ}, is the averaged signal-to-noise ratio

334: defined as:

335: \[

336: H_i(t'|t)=

337: \sqrt{\sum_\omega \rho_\omega

338: \frac{\langle \delta r_i(t')|\omega(t)=\omega\rangle^2}

339: {\langle\delta r_i^2|\omega\rangle}}

340: \]

341: where $\delta r_i(t)=r_i(t)-\langle r_i\rangle$ and $\rho_\omega$ is

342: the frequency with which state $\omega$ occurs.  The distribution of

343: $H_i$ across assets is shown in Fig. \ref{figPH} for $t'=t$, $t'=t+1$

344: and $t'=t+\infty$.  The latter gives a benchmark of the

345: background noise level. We find $H_i(t|t)\gg H_i(t+\infty|t)$ for

346: several assets $i$: the knowledge of $\omega(t)$ {\em before} day $t$

347: provides significant predictive power on excess returns. That same

348: information is much less useful the day after, since $H(t+1|t)$ is

349: only slightly above the noise level. This is a further indication that

350: the financial market is close to information efficiency, but not quite

351: unpredictable. In reality the transition matrix $P_1(\omega'|\omega)$

352: changes slowly in time. Hence this conclusion provides an ``upper

353: bound'' for the market's predictability (when measured out-of-sample):

354: Real markets are therefore even closer to efficiency.

355:

356: If $\omega(t)$ were a Markov process, the characteristic time $\tau_k$

357: for transitions $\omega(t)\to\omega(t+k)$ over $k$

358: days\footnote{$\tau_k$ is computed in the same way as $\tau=\tau_1$

359: above, from the matrix $P_k(\omega'|\omega)$ of transition

360: probabilities $\omega(t)=\omega\to\omega(t+k)=\omega'$ in $k$ days.

361: For a Markov process this matrix is the $k^{\rm th}$ power of the

362: matrix $P_1(\omega'|\omega)$ and its eigenvalues are given by

363: $\lambda_k=\lambda_1^k$.}

364: should decrease with $k$ as $\tau_k=\tau_1/k$. A prediction of the

365: future state of the market, which is significantly better than a

366: random draw, would only be possible on a time horizon of one day, if

367: the process were Markovian. The inset of Fig. \ref{figPH} shows that

368: $\tau_k$ remains significantly above the noise level almost up to

369: $k\approx 100$ days!  This means that $\omega(t)$ carries significant

370: information about the future state $\omega(t+k)$ of the market, even

371: after $k\approx 50$ days. The slow decay of $\tau_k$ is a further

372: signature of the presence of long range correlations.

373:

374: \begin{figure}

375: \centerline{\psfig{figure=PH.eps,width=8cm}}

376: \caption{Distribution of predictability $H_i(t'|t)$ for $t'=t,~t+1$

377: and $t+\infty$. The noise background predictability $H_i(t+\infty|t)$

378: is estimated drawing $\omega(t +\infty)$ at random from the

379: populations of states.  Inset: Characteristic times $\tau_k$ for

380: transitions over $k$ days for the real sequence $\omega(t)$

381: ($\bullet$), a random sequence ($+$) and a Markov chain sequence

382: ($\circ$) generated with the transition probability

383: $P_1(\omega'|\omega)$ estimated from $\omega(t)$. The random sequence

384: ($+$) represents the noise background. For a Markov chain $\tau_k$

385: ($\circ$) is significantly above the noise level only for $k=1$.

386: For the real market process $\tau_k$ is well above the noise level

387: up to $k\approx 50$.}

388: \label{figPH}

389: \end{figure}

390:

391: During the period we have studied, two major extreme events occurs:

392: the 27 October 1997 and the 31 August 1998 crashes.  The state process

393: $\omega(t)$ is different before the crash, but is quite similar after

394: it. The strings of states, starting from the day of the crash, read

395: $2136613611\ldots$ and $2126614633\ldots$ in the two cases. This is a

396: significant similarity\footnote{Only two other string of the type

397: $21x661$ occurred in the process but the starting days were Fridays

398: (90/04/27 and 90/05/25) and not Mondays. Note furthermore that

399: normalization \cite{norma} removes the collective component of the

400: dynamics and it ensures that crash days appear with the same weight as

401: normal days in the analysis.}. This suggests the existence of a

402: particular dynamical pattern with which markets respond to extreme

403: events (see also Ref. \cite{Omori} on this).

404:

405: \section{Conclusion and outlook}

406:

407: In conclusion we show that both the {\em horizontal} clustering of

408: assets in correlated sectors and the {\em vertical} classification of

409: market-wide economic performance in market states, reveal a scale free

410: structure (see Figs. \ref{figassets}, \ref{figdays}). The emergent

411: picture poses quite severe constraints on multi-asset agent based

412: modeling, which we believe will disclose important information on how

413: real markets work. This expectation is based on the fact that

414: scale-free statistical behavior is a signature of interaction

415: mechanisms which is rather insensitive to microscopic details.

416:

417: Furthermore, the identification of market states allows us to

418: precisely quantify informational efficiency by computing the market's

419: predictability, thereby establishing a direct contact between the

420: empirical world and the realm of theoretical models. In particular we

421: find that, as expected, markets are close to information efficiency.

422:

423: We find that correlated sectors have a large overlap with sectors of

424: economic activity. In the same way, it would be interesting to

425: understand how states are correlated with economic information and the

426: news arrival process.

427:

428: In a wider context, we have discussed an unsupervised approach to the

429: study of a complex system. Be it a stock market, the world economy,

430: urban traffic network, a cell of a living organism or the immune

431: system, the complex system can be considered as a black box.  We show

432: how a series of simultaneous measures in many different ``points''

433: of the system allows one to identify its {\em parts} and its {\em

434: states}.

435:

436: A black box approach to a financial market or to a cell, which

437: neglects all of economics and finance or of biology and genetics and

438: relies only on empirical data, may lead to misleading results

439: specially if the data set is incomplete. Still, we believe, it has the

440: potential of uncovering collective aspects which can hardly be derived

441: in a theoretical bottom-up approach.

442:

443: \appendix

444: \section{Maximum likelihood data clustering}

445: \label{mldc}

446:

447: Consider a set of $N$ objects each of which is defined in terms of $D$

448: measurable features, so that each object is represented by a vector

449: $\vec \xi_i\in R^D$, $i=1,\ldots,N$. We assume for simplicity that data

450: are normalized: $\vec \xi_i\cdot \vec e=0$ where $\vec e=(1,1,\ldots,1)$

451: and $\|\xi_i\|^2=\vec \xi_i\vec \xi_i=1$.

452:

453: In our case, when identifying sectors, the objects are assets and

454: $N=A$, the number of assets. Their features are the daily returns in

455: each day $t$ and $D=T$. The $t^{\rm th}$ component of $\vec \xi_i$ is

456: $x_i(t)/\sqrt{T}$. When identifying states instead objects are days

457: and features are assets (i.e. $N=T$ and $D=A$). The $i^{\rm th}$

458: component of $\vec \xi_t$ is $x_i(t)/\sqrt{A}$.

459:

460: The problem of classifying  $N$ objects into different classes

461: goes under the name of data clustering.  Naively one would like to

462: have similar objects classified in the same cluster, but in practice

463: one faces a number of problems: What does it mean similar?  What is

464: the ``right'' number of clusters?  Which principle to follow?  We

465: resort to a recent data clustering technique \cite{GM,mldc} based on

466: the maximum likelihood principle and a simple statistical hypothesis:

467: {\em similar objects have something in common}. In mathematical terms,

468: we let $s_i$ be the label of the cluster to which object $i$ belongs,

469: and $A_s=\{i:~s_i=s\}$ be the set of objects with $s_i=s$. We assume

470: that

471: \begin{equation}

472: \vec \xi_i = g_{s_i}\vec\eta_{s_i}+\sqrt{1-g_{s_i}^2}\vec\epsilon_i.

473: \label{ansatz}

474: \end{equation}

475: Here $\vec \eta_s$ denoted the {\em common} component shared by all

476: objects $i\in A_s$ and $g_s\ge 0$ weights the common component against

477: the individual one $\vec\epsilon_i$. Eq. (\ref{ansatz}) is the

478: statistical hypothesis where $g_s$ and $s_i$ are the parameters to be

479: fitted. Assuming further that both $\vec \eta_s$ and $\vec \epsilon_i$

480: are Gaussian vectors in $R^D$, with zero average and unit variance

481: ($E[\|\eta_s\|^2]=E[\|\epsilon_i\|^2]=1$) makes it possible to compute

482: the likelihood of the parameters ${\cal G}=\{g_s\}$ and ${\cal

483: S}=\{s_i\}$ (see Ref. \cite{GM} for details). The likelihood is

484: maximal when

485: \begin{equation}

486: g_s=\sqrt{\max\left[0,\frac{c_s-n_s}{n_s^2-n_s}\right]}

487: %~~~\hbox{if $c_s\ge n_s$}

488: \end{equation}

489: %and $g_s=0$ otherwise,

490: where $n_s=|A_s|$ is the number of objects in

491: cluster $s$ and

492: \[

493: c_s=\sum_{i,j\in A_s} \vec \xi_i\vec \xi_j

494: \]

495: is the total correlation inside cluster $s$.

496: The maximum log-likelihood per feature takes the form

497: %\begin{equation}

498: \[

499: {\cal L}_c({\cal S})=\frac{1}{2}\sum_{s:~n_s>1}\max

500: \left[0,\log

501: \frac{n_s}{c_s}+(n_s-1)\log\frac{n_s^2-n_s}

502: {n_s^2-c_s}\right].

503: \]

504: %\end{equation}

505: Note that a cluster with a single isolated object ($n_s=c_s=1$), or a

506: cluster of uncorrelated objects ($c_s=n_s$) gives a vanishing

507: contribution to the log-likelihood.

508:

509: Several algorithms for finding an approximate maximum of ${\cal L}_c$

510: over the space of cluster structures ${\cal S}$ have been discussed in

511: Ref. \cite{mldc}. We used both hierarchical clustering and simulated

512: annealing algorithms, which yield quite similar results (the codes are

513: available on the Internet \cite{data_set}).

514:

515: Figures \ref{figassets} and \ref{figdays} are a graphic representation

516: of the hierarchical clustering algorithm: It starts from $N$ clusters

517: composed of a single object and it produces a sequence of cluster

518: structures. At each iteration, two clusters of the configurations with

519: $K$ clusters are merged so that the log-likelihood of the resulting

520: configuration with $K-1$ clusters is maximal. This procedure starts

521: with $K=N$ and it stops with $K=1$, when a single cluster is

522: formed. The log-likelihood of the cluster structure is ${\cal L}_c=0$

523: when $K=N$, it decreases with $K$ and it reaches a minimum for an

524: intermediate value of $K$. Then it increases again and reaches ${\cal

525: L}_c=0$ when $K=1$, because of data normalization.

526:

527: The graphs report the log-likelihood of each cluster on the $y$ axis.

528: The initial configuration corresponds to $N$ points aligned on the $x$

529: axis (zero log-likelihood). Each merge operation is represented

530: graphically by a link between the merging clusters and the new

531: cluster. Hence as the log-likelihood decreases structures above the

532: $x$ axis start to form. Red links are merging steps which increase the

533: log-likelihood. Blue links corresponds to situation where the

534: log-likelihood of the union of the clusters is larger than that of

535: each part but it is smaller than their sum (hence the total

536: log-likelihood decreases). Hence statistically relevant clusters

537: appear as the large red structures in the plot.

538:

539: \begin{thebibliography}{99}

540:

541: \bibitem{Mandelbrot1} Mandelbrot, B. B., {\em Fractals and

542: Scaling in Finance}, Springer-Verlag (New York 1997).

543:

544: \bibitem{MantegnaStanley} R.N. Mantegna and H.E. Stanley, {\em Introduction

545: to Econophysics: Correlations and Complexity in Finance}, Cambridge

546: Univ. Press (Cambridge UK, 1999).

547:

548: \bibitem{BouchaudPotters} J.-P. Bouchaud and M. Potters, {\em Theory of

549: Financial Risk: From Statistical Physics to Risk Management},

550: Cambridge Univ. Press (Cambridge UK, 2000)

551:

552: \bibitem{Farmer} J.D. Farmer, Physicists Attempt to Scale the Ivory

553: Towers of Finance , Computing in Science and Engineering (IEEE),

554: {\bf 1} 1999, 26-39.

555:

556: \bibitem{duffie} J.Y. Campbell, A.W. Lo, and A.C. MacKinlay, {\em The

557: Econometrics of Financial Markets}, Princeton Univ. Press (Princeton

558: N.J., 1997).

559:

560: \bibitem{Mandelbrot2} B.B. Mandelbrot, The Variation of Certain

561: Speculative Prices, J. Business, Vol. 36, 1963, pp. 394 419.

562:

563: \bibitem{multisc} S. Ghashghaie et al., {\em Turbulent Cascades in Foreign

564: Exchange Markets}, Nature, {\bf 381}, 767 (1996).

565:

566: \bibitem{MEM} D. Challet, M. Marsili and Y.-C. Zhang, {\em Modeling

567: market mechanism with minority game}, Physica A {\bf 276}, 284

568: (2000).

569:

570: \bibitem{CCMZ} D. Challet et al., {\em From Minority Games to real

571: markets}, Quantitative Finance {\bf 1}, 168 (2001).

572:

573: \bibitem{CMZ01} D. Challet, M. Marsili and Y.-C. Zhang, {\em Stylized

574: facts of financial markets and market crashes in Minority Games},

575: Physica A {\bf 294}, 514 (2001).

576:

577: \bibitem{BMRZ} J. Berg et al. {\em Statistical mechanics of asset

578: markets with private information}, Quantitative Finance {\bf 1}, 203

579: (2001).

580:

581: \bibitem{Focus} Laloux et al., {\em Noise Dressing of Financial

582: Correlation Matrices}, Phys. Rev. Lett. {\bf 83}, 1467 (1999);

583: V. Plerou et al. {\em Universal and Nonuniversal Properties

584:             of Cross Correlations in Financial Time

585:             Series}, {\em ibid} 1471.

586:

587: \bibitem{GM} L. Giada, M. Marsili, {\em Data clustering and noise

588: undressing of correlation matrices}, Phys. Rev. E {\bf 63}, 1101

589: (2001).

590:

591: \bibitem{Mantegna} R.N. Mantegna, {\em Hierarchical structure in

592: financial markets}, Eur. Phys. J. B {\bf 11} , 193 (1999).

593:

594: \bibitem{Gopiport} P. Gopikrishnan et al., {\em Quantifying and

595: interpreting collective behavior in financial markets}, Phys. Rev. E

596: {\bf 64}, 035106 (2001).

597:

598: \bibitem{mldc} L. Giada, M. Marsili, {\em Algorithms of maximum

599: likelihood data clustering with applications}, eprint cond-mat/0204008

600: (2002).

601:

602: \bibitem{data_set} The data set was made available by courtesy of

603: R. N. Mantegna. The tic symbols of the subset of assets considered,

604: the detailed cluster structures of sectors and states and other

605: informations are available at {\tt

606: http://www.sissa.it/dataclustering/fin/}.

607:

608: \bibitem{norma} Let $x_i^{(0)}(t)=\log p_i^{\rm open}(t)/p_i^{\rm

609: close}(t)$ be the return of asset $i=1,\ldots, A$ in day

610: $t=1,\ldots,T$. We set

611: \begin{eqnarray*}

612: x_i^{(2k+1)}&=&\frac{x_i^{(2k)}(t)-\langle x_i^{(2k)}\rangle}

613: {\sqrt{\langle(x_i^{(2k)}-\langle x_i^{(2k)}\rangle)^2\rangle}}\\

614: x_i^{(2k+2)}&=&\frac{x_i^{(2k+1)}(t)-\overline{x_i^{(2k)}}}

615: {\sqrt{\overline{(x_i^{(2k)}-\overline{x_i^{(2k)}})^2}}}

616: \end{eqnarray*}

617: where $\langle\ldots\rangle=\sum_{t=1}^T(\ldots)/T$ is

618: a time average and

619: $\overline{(\ldots)}=\sum_{i=1}^A(\ldots)/A$ denotes the

620: average over assets. As in M. B. Eisen et al.,

621: %{\em Cluster analysis and display of genome-wide expression

622: %patterns},

623: [Proc. Natl. Acad. Sci. USA, {\bf 95}, 14863 (1998).], the normalized

624: data $x_i(t)$, is obtained as the limit of $x_i^{(n)}(t)$ as

625: $n\to\infty$. In practice the iteration was stopped after a given

626: accuracy was reached. This procedure does not affect significantly the

627: results. Indeed the first step of normalization eliminates most of the

628: global patterns. For missing values we assumed $x_i(t)=0$ if asset $i$

629: were not traded on day $t$.

630:

631: \bibitem{stability} In order to asses the stability of the results we

632: repeated the classification of days for the first (from Jan. '90 to

633: Aug. '94) and the second (Sep. '94 to Apr. '99) halves of the time

634: series. We found dendrograms quite similar to those in

635: Fig. \ref{figdays} with two main dominant states. Clustering again

636: days into $6$ states, we found two new sequences

637: $\omega_{<}(t)$ for $t=1,\ldots,T/2$ and $\omega_{<}(t)$ for

638: $t=T/2+1,\ldots,T$. We found that $\omega_{<}(t)=\omega(t)$ in $73$\%

639: of cases and $\omega_{>}(t)=\omega(t)$ in $82$\% of cases, where

640: $\omega(t)$ is the state occurring in day $t$ according to the

641: analysis of the whole time series.

642:

643: \bibitem{CM} D. Challet, M. Marsili, {\em Phase transition and symmetry

644: breaking in the minority game} Phys. Rev. {\bf E 60}, R6271 (1999).

645:

646: \bibitem{CMZ} D. Challet, M. Marsili, R. Zecchina, {\em Statistical

647: mechanics of systems with heterogeneous agents: Minority games},

648: Phys. Rev. Lett. {\bf 84}, 1824 (2000).

649:

650: \bibitem{Omori} F. Lillo, R.N. Mantegna, {\em Omori law after a

651: financial market crash}, e-print cond-mat/0111257 (to appear in Physica A).

652:

653: \end{thebibliography}

654:

655: \end{multicols}

656:

657: \end{document}

658: