0710:0710.5837/tr.tex

1: \documentclass[12pt]{article}

2: %\documentclass[a4paper,12pt]{article}

3: \usepackage{amsmath}

4: \usepackage{amsthm}

5: \usepackage{amscd}

6: \usepackage{epsfig}

7: \usepackage{textcomp}

8: \usepackage{fullpage}

9: \usepackage{natbib}

10: \usepackage{setspace}

11: \usepackage{amsfonts}

12: \usepackage{color}

13:

14: \newcommand{\bm}[1]{\mbox{\boldmath $#1$}}

15: \newcommand{\mb}[1]{\mathbf{#1}}

16: \renewcommand{\Re}[0]{\mathbb{R}}

17: \newcommand{\mT}[0]{\mathcal{T}}

18: \newcommand{\Var}[0]{\mbox{Var}}

19: \newcommand{\NA}[0]{\mbox{\tt NA}}

20: \newcommand{\ith}[1]{$#1^{\mbox{\tiny th}}$}

21: \DeclareMathOperator*{\argmin}{argmin}

22:

23: \begin{document}

24:

25: \title{

26: On estimating covariances between many assets

27:   with histories of highly variable length}

28: \author{

29:   Robert B. Gramacy\\

30:   Statistical Laboratory\\

31:   University of Cambridge\\

32:   bobby@statslab.cam.ac.uk \and

33:   Joo Hee Lee \\

34:   Fidelity Investments \\

35:   London\\

36:   joohee.lee@uk.fid-intl.com \and

37:   Ricardo Silva\\

38:   Department of Statistical Science\\

39:   University College London\\

40:   ricardo@stats.ucl.ac.uk

41: }

42:

43: \maketitle

44:

45: \doublespacing

46:

47: \begin{abstract}

48:   Quantitative portfolio allocation requires the accurate and

49:   tractable estimation of covariances between a large number of

50:   assets, whose histories can greatly vary in length.  Such data are

51:   said to follow a monotone missingness pattern, under which the

52:   likelihood has a convenient factorization.  Upon further assuming

53:   that asset returns are multivariate normally distributed, with

54:   histories at least as long as the total asset count, maximum

55:   likelihood (ML) estimates are easily obtained by performing repeated

56:   ordinary least squares (OLS) regressions, one for each asset. Things

57:   get more interesting when there are more assets than historical

58:   returns.  OLS becomes unstable due to rank--deficient design

59:   matrices, which is called a ``big $p$ small $n$'' problem.  We

60:   explore remedies that involve making a change of basis, as in

61:   principal components or partial least squares regression, or by

62:   applying shrinkage methods like ridge regression or the lasso.  This

63:   enables the estimation of covariances between large sets of assets

64:   with histories of essentially arbitrary length, and offers

65:   improvements in accuracy and interpretation.  We further extend the

66:   method by showing how external factors can be incorporated.  This

67:   allows for the adaptive use of factors without the restrictive

68:   assumptions common in factor models.  Our methods are demonstrated

69:   on randomly generated data, and then benchmarked by the performance

70:   of balanced portfolios using real historical financial returns. An

71:   accompanying {\sf R} package called {\tt monomvn}, containing code

72:   implementing the estimators described herein, has been made freely

73:   available on CRAN.

74:

75:   \bigskip

76:   \noindent {\bf Key words:} financial time series, monotone missing

77:   data, maximum likelihood, ridge regression, principal component

78:   regression, partial least squares, lasso, factor models

79: \end{abstract}

80:

81: \section{Introduction}

82: \label{sec:intro}

83:

84: Missingness in data, and hence the quest if one should eliminate a

85: part of the data or try and estimate characteristics of it, is common

86: in statistical analysis. The missing observation problem varies in

87: style, depending on the type of data. One example is random

88: missingness, which may stem from erroneous data

89: \citep{dempster:laird:rubin:1977}.  In financial returns data

90: analysis, however, one problem stands out, which we will refer to as

91: monotone missingness. This happens when the assets of interest have

92: different lengths of historical financial data, e.g., stock prices and

93: returns.  There are several possible ways of dealing with this type of

94: incomplete dataset.  One way is by utilizing the portion of data

95: available across all of the assets.  Another approach involves

96: estimating the missing portion, called {\em imputation}

97: \citep[e.g.,][]{little:rubin:2002}.  A third approach is the focus of

98: this paper.

99:

100: Aside from some glitches in data, which will typically give rise to

101: unrealistic spikes or random missingness in data, the monotone style

102: of missingness that permeates financial historical returns data can be

103: grouped into two patterns.  The first is where the histories of assets

104: differ due to the fact that they have started being publicly traded at

105: different times. The second is where assets close for various reasons,

106: including corporate actions such as M\&A (Merger and Acquisition)

107: activities, or liquidation due to bankruptcy. Both are critical

108: problems to address when conducting a multivariate analysis.  In this

109: paper, we shall focus mainly on the former. This is sensible for the

110: application to portfolio balancing that we have in mind, since one is

111: naturally restricted to purchasing shares of companies which have

112: survived up to current point in time.  The latter type of missingness,

113: in absence of the former, can be handled similarly, but it is not

114: immediately clear how this would be useful for portfolio balancing.

115: Handling both types of monotone missingness jointly, and other types

116: of approximately monotone missingness, requires the method of data

117: augmentation \citep{schafer:1997,little:rubin:2002}. This could

118: potentially be useful for a descriptive analysis, but is beyond the

119: scope of this paper.

120:

121: Data with arbitrary missingness patterns typically require specialized

122: iterative (even stochastic) estimation algorithms that can be slow and

123: cumbersome to implement.  However, data which follow a monotone

124: missingness pattern lead to a likelihood which has a convenient

125: factorization.  If we further assume that asset returns are

126: multivariate normally distributed (MVN), with histories at least as

127: long as the total asset count, then maximum likelihood (ML) estimators

128: are easily obtained by performing repeated ordinary least squares

129: (OLS) regressions, one for each asset.  In the finance literature,

130: this approach is usually attributed to \cite{stambaugh:1997}, but it

131: was first described by \cite{andersen:1957} and has since been

132: discussed in many texts (see Section \ref{sec:monotone}).  The method

133: fails when there are more assets than historical returns.  In this

134: case the OLS regressions become unstable due to rank--deficient design

135: matrices.  This is sometimes called the ``big $p$ small $n$'' problem.

136: It has recently received much attention in the statistics community,

137: with ready applications in bioinformatics and genomics, for example.

138: In the context of estimation for data with a monotone missingness

139: pattern, it can severely limit applicability to cases with a small to

140: modest level of missingness.

141:

142: In financial applications, where there may be more assets than there

143: are historical price observations for (some of) the assets, this

144: essentially means that the method cannot be applied on the full set of

145: assets of interest.  This paper explores remedies to this problem.  We

146: aim to develop a method that can be applied in settings where some

147: assets have histories which are shorter than the total number of

148: assets, and even when there are more assets than observations.  In

149: short, our solution involves replacing OLS with ``parsimonious

150: regressions'' that either make a change of basis, as in principal

151: components or partial least squares regression, or apply shrinkage,

152: like ridge regression or the lasso.  This enables the estimation of

153: covariances between large sets of assets with histories of essentially

154: arbitrary (and uneven) length.  Even in situations where OLS would

155: have been sufficient, we find that the more parsimonious approach can

156: offer improvements in accuracy and interpretation.

157:

158: The parsimonious approach also motivates novel ways of exploiting {\it

159:   factor} information, e.g., the value--weighted market index, size,

160: and book--to--market factors \citep{famafrench:1993}.  Traditionally,

161: factor models require the restrictive assumption that assets are

162: independent given the factors.  This underlying assumption can be

163: thought of as a specific type of parsimony.  We show how one can use

164: the data to decide which independence constraints are reasonable, by

165: incorporating the factors into our proposed framework, and furthermore

166: how this may be accomplished even under condition of monotone

167: missingness in the historical returns {\em and} factors.

168:

169: The remainder of the paper is organized as follows.  Section

170: \ref{sec:monotone} defines the monotone pattern for missing data,

171: derives the corresponding factorized likelihood, and gives an

172: algorithm of repeated regressions to analytically find a ML estimator

173: for the case where the sampling distribution is assumed to be MVN.

174: Section \ref{sec:bpsn} outlines methods for dealing with the ``big $p$

175: small $n$'' problem in the context of regression with transformed

176: inputs and shrinkage estimators.  We highlight the benefits of

177: increased applicability, accuracy, and interpretability obtained with

178: these methods.  Section \ref{sec:monomvn} gives the details of an

179: algorithm---for MVN data under a monotone missingness pattern---that

180: combines the method in Section \ref{sec:monotone} with the

181: parsimonious regressions in Section \ref{sec:bpsn}. We explain how the

182: method can easily integrate factor information, generating a model

183: that essentially mixes factor models with estimators that account for

184: the direct dependency between returns.  We then briefly describe an

185: implementation which has been made freely available as an {\sf R}

186: package called {\tt monomvn}.  Section \ref{sec:results} shows the

187: method in action on synthetic data and real financial data with large

188: numbers of assets having histories of highly varying length.  Our

189: results are benchmarked against several standard comparators in the

190: context of covariance estimation and portfolio balancing, and are

191: accompanied by comments on interpretation, efficiency, and on the

192: (benign) consequences of using a method that leverages an MVN

193: assumption when that assumption not believed to hold.

194: Finally, we conclude with a discussion in Section \ref{sec:discuss}

195: that focuses on some of the limitations inherent in taking a maximum

196: likelihood approach.

197:

198:

199: \section{Multivariate normal monotone missing data}

200: \label{sec:monotone}

201:

202: Let $\mb{Y}$ be a $n \times m$ matrix of random observations $Y_{i,j}$

203: which may not be completely observed.  Denote $y_{i,j} = \NA$ if the

204: \ith{i} sample of the \ith{j} covariate is missing.  In other words,

205: if the columns of a sampled $\mb{Y}$: $y_{:,1},\dots, y_{:,m}$,

206: represent a historical return series of assets indexed by $j$ and a

207: return for asset $j$ is not available at time $i$, then $y_{i,j} =

208: \NA$.  Observed $\mb{Y}$ are said to follow a {\em monotone

209:   missingness pattern} [e.g., \citep[][Section 6.5.1]{schafer:1997} or

210: \citep[][Section 7.4]{little:rubin:2002}] if the columns can be

211: arranged so that $y_{i,j} \ne \NA$ whenever $y_{i,j+1} \ne \NA$.

212: \begin{figure}[ht!]

213: \centering

214: \input{mono.pstex_t}

215: \caption{Diagram of a monotone missingness pattern with $m=6$

216:   covariates, with a maximum of $n$ completely observed samples in

217:   $\mb{y}_1=y_{:,1}$.}

218: \label{f:mono}

219: \end{figure}

220: Figure \ref{f:mono} illustrates this property diagrammatically.  The

221: row dimension $n$, of $\mb{Y}$, is equal to the number of completely

222: observed samples $n_1$ of $\mb{y}_1 \equiv y_{:,1}$, the maximally

223: observed column.  Similarly, let $\mb{y}_j \equiv y_{1:n_j,j}$ collect

224: the complete data in the \ith{j} column of $\mb{Y}$, so that $n_j \geq

225: n_{j+1}$.

226:

227: The monotone missingness patterns considered in this paper are assumed

228: to be {\em missing completely at random} (MCAR) in that the pattern of

229: missingness neither depends on the observed nor unobserved responses.

230: Note that there may be columns with identical missingness patterns.

231: In the case of asset return series with observed histories going back

232: different amounts of time, the MCAR assumption may be tenuous, but it

233: is commonly asserted anyway \citep[e.g.,][]{stambaugh:1997}.  In our

234: notation, the time index ($t$) for an asset's return history would run

235: counter to $i$, the index of the rows of $\mb{Y}$; i.e, $t=n-i+1$, as

236: also illustrated in Figure \ref{f:mono}.

237:

238: %For parameters $\bm{\theta}=(\bm{\theta}_1,\dots,\bm{\theta}_m)$,

239: When the missing data pattern is monotone, the likelihood $f(\mb{Y}|

240: \bm{\theta})$ can generally be factorized by exploiting an auxiliary

241: parameterization $\bm{\phi}=(\bm{\phi}_1, \dots, \bm{\phi}_m)$:

242: \[

243: f(\mb{Y}|\bm{\theta}) = f(\mb{y}_1|\bm{\phi}_1)

244: f(\mb{y}_2|\mb{y}_1,\bm{\phi}_2)

245: f(\mb{y}_3|\mb{y}_1,\mb{y}_2,\bm{\phi}_2) \cdots f(\mb{y}_m |

246: \mb{y}_1,\dots,\mb{y}_{m-1},\bm{\phi}_m).

247: \]

248: together with a mapping $\bm{\phi} = \Phi(\bm{\theta})$.

249: With the appropriate conditioning, the $y_{i,j}$ are assumed to be

250: independent and identically distributed (i.i.d.), so that

251: \begin{equation}

252: f(\mb{y}_j | \mb{y}_1,\dots \mb{y}_{j-1}, \bm{\phi}_j) = \prod_{i=1}^{n_j}

253: f(y_{i,j}|y_{i,1}\dots, y_{i,j-1}, \bm{\phi}_j). \label{eq:iidlik}

254: \end{equation}

255: We are concerned with the case where the $(y_{i,1},\dots y_{i,m})$

256: follow a multivariate normal distribution (MVN) so that the likelihood

257: in (\ref{eq:iidlik}) also follows a MVN with constant variance and a

258: mean linear in $y_{i,1},\dots, y_{i,j-1}$.  The i.i.d.~and MVN

259: assumptions may be less than ideal for financial returns data

260: \citep[e.g.,][]{mills:1927}, but we note that these are common

261: simplifying assumptions \citep{stambaugh:1997,ckl:1999,jagma:2003}

262: because they lead to tractable inference and compare favorably (see

263: Section \ref{sec:results} for results and further discussion).

264: Maximum likelihood estimators (MLEs) of $\bm{\theta}_j = (\mu_j,

265: \bm{\Sigma}_{1:j,j})$, $j=2,\dots,m$, can then be obtained by

266: regression on the complete data:

267: \begin{align}

268:   \mb{y}_j &= \mb{Y}_j \bm{\beta}_j + \bm{\epsilon}_j, &

269:   \{\epsilon_{i,j}\}_{i=1}^{n_j} &\stackrel{\mbox{\tiny i.i.d.}}{\sim}

270:   N(0,\sigma_j^2) \label{eq:monoreg}

271: \end{align}

272: where $\bm{\beta}_j^\top = (\beta_{0,j}, \beta_{1,j}, \dots,

273: \beta_{(j-1),j})$ and $\mb{Y}_j \equiv \mb{Y}_{0:(j-1)}^{(n_j)}$ is

274: the $n_j \times j$ design matrix

275: \[

276: \mb{Y}_j \equiv \mb{Y}_{0:(j-1)}^{(n_j)} = \begin{pmatrix}

277:   1 & y_{1,1} & \cdots & y_{1,(j-1)} \\

278:   1 & y_{2,1} & \cdots & y_{2,(j-1)} \\

279:   \vdots & \vdots & \ddots & \vdots \\

280:   1 & y_{n_j,1} & \cdots & y_{n_j, (j-1)}

281: \end{pmatrix}

282: \]

283: containing an intercept column, and the first $n_j$ observations of

284: the first $j-1$ columns of $\mb{Y}$.  So the auxiliary parameters

285: used in (\ref{eq:monoreg}) are $\bm{\phi}_j = (\bm{\beta}_j,

286: \sigma_j^2)$.

287: \begin{figure}[ht!]

288: \centering

289: \input{mono_regress.pstex_t}

290: \caption{Diagram of the design matrix $\mb{Y}_5$ (without an intercept

291:   term) and the response vector $\mb{y}_5$ for the fifth regression

292:   involved in maximizing the likelihood of MVN data under a monotone

293:   missingness pattern with $m=6$ covariates.}

294: \label{f:monoreg}

295: \end{figure}

296: Figure \ref{f:monoreg} diagrams the design matrix (without the

297: intercept term) and response vector involved in one such regression.

298: When $\mathrm{rank}(\mb{Y}_j) = j$, and particularly when $n_j > j$,

299: MLEs $\hat{\bm{\phi}}_j$ are obtainable via the straightforward

300: calculation:

301: \begin{align}

302: \hat{\bm{\beta}}_j &= (\mb{Y}_j^\top \mb{Y}_j)^{-1} \mb{Y}_j^\top \mb{y}_j &

303: \mbox{and} &&

304: \hat{\sigma}^2_j &= \frac{1}{n_j} ||\mb{y}_j - \mb{Y}_j \hat{\bm{\beta}}_j||^2

305: = \frac{1}{n_j} \sum_{i=1}^{n_j} (y_{i,j}

306: - (\mb{y}_i^\top)_{1:n_j}\, \hat{\bm{\beta}}_j)^2.

307: \label{eq:regress}

308: \end{align}

309: Then,

310: starting with $\hat{\bm{\theta}}_1$ comprising of $\hat{\mu}_1 =

311: \sum_{i=1}^{n_1} y_{i,1}/{n_1}$, and $\hat{\Sigma}_{1,1} =

312: \sum_{i=1}^{n_1} (y_{i,1} - \hat{\mu}_1)^2/{n_1}$, each

313: $\hat{\bm{\theta}}_j$ can be estimated conditional on

314: $\hat{\bm{\theta}}_{1:(j-1)} = (\hat{\bm{\mu}}_{1:(j-1)}^\top,

315: \hat{\bm{\Sigma}}_{1:(j-1),1:(j-1)})$ and estimates of

316: $\hat{\bm{\beta}}_j$ and $\hat{\sigma}^2_j$ as \citep{stambaugh:1997}:

317: \begin{align}

318:   \hat{\mu}_j &= \hat{\beta}_{0,j} + \hat{\bm{\beta}}_{1:(j-1),j}^\top

319:   \hat{\bm{\mu}}_{1:(j-1)}

320: &\hspace{-0.075cm} \mbox{and}&&

321: \hat{\bm{\Sigma}}_{1:j,j}

322:   &= \begin{pmatrix}

323:     \hat{\bm{\beta}}_{1:(j-1),j}^\top \hat{\bm{\Sigma}}_{1:(j-1),1:(j-1)} \\

324:     \hat{\sigma}^2_j + \hat{\bm{\beta}}_{1:(j-1),j}^\top

325:     \hat{\bm{\Sigma}}_{1:(j-1),1:(j-1)} \hat{\bm{\beta}}_{1:(j-1),j},

326: \label{eq:addy}

327: \end{pmatrix}

328: \end{align}

329: thus implicitly describing the mapping $\Phi^{-1}$ back to

330: $\bm{\theta}_j$--space.  Observe that we do not use a bias--corrected

331: estimator for $\sigma_j^2$ in (\ref{eq:regress}), i.e., with $n_j-j$

332: instead of $n_j$ in the denominator, to ensure that ML estimates

333: $\hat{\bm{\theta}}$ are obtained \citep[][pp.~224]{schafer:1997}.

334: However, we have found it to be beneficial in practice to use $n_j-1$

335: in the denominator as is typical in obtaining unbiased estimates of

336: covariance matrices in the complete data case.

337:

338: When several columns $\mb{y}_\ell$, say $\ell=j_1,\dots,j_2$, have

339: equal lengths of observed histories $n_\ell$, it is typical to use a

340: multivariate regression $(\mb{y}_{j_1} \; \cdots \; \mb{y}_{j_2}) =

341: \mb{Y}_{j_1} \bm{\beta}_{j_1:j_2} + \bm{\epsilon}_{j_1:j_2}$ to find

342: $\hat{\bm{\beta}}_{j_1:j_2}$ and the empirical variance--covariance

343: matrix $\hat{\mb{V}}_{j_1:j_2,j_1:j_2}$.  Then, several

344: $\hat{\bm{\theta}}_{j_1:j_2}$ can be found at once by replacing

345: $\hat{\bm{\beta}}_j$ with $\hat{\bm{\beta}}_{j_1:j_2}$ and

346: $\hat{\sigma}_j^2$ with $\hat{\mb{V}}_{j_1:j_2,j_1:j_2}$ in

347: (\ref{eq:addy}).  Importantly, if

348: $\hat{\bm{\Sigma}}_{1:(j_1-1),1:(j_1-1)}$ and

349: $\hat{\mb{V}}_{j_1:j_2,j_1:j_2}$ are positive definite, then

350: $\hat{\bm{\Sigma}}_{1:j_2,1:j_2}$ will be positive definite as well

351: \citep{stambaugh:1997}.

352:

353: Calculating such MLEs requires having $n_j > j$ for all $j=1,\dots,m$.

354: That is, there cannot be an asset whose history is shorter than the

355: number of assets whose histories have greater length.  If such were

356: the case, then $\mb{Y}_j$ would not be of full rank, and $\mb{Y}_j^\top

357: \mb{Y}_j$ could not be inverted in Eq.~(\ref{eq:regress}).  This is

358: sometimes referred to in the literature as the problem of regression

359: with ``big $p$ [number of parameters] small $n$ [number of

360: observations]''. Numerical singularities may arise whenever $n_j$ is

361: less than, but nearly equal to, $j$---especially when $n$ and $m$ are

362: large.  In the following section we illustrate how these difficulties

363: may be overcome by methods of subset selection, coefficient shrinkage,

364: or the use of principal components.

365:

366: \section{Parsimonious regression}

367: \label{sec:bpsn}

368:

369: In this section, we extract and focus on the subproblem of the linear

370: regression in (\ref{eq:monoreg}), in terms of a design matrix of $p$

371: predictor variables with an intercept term ($\mb{X} \equiv \mb{Y}_j$)

372: observed for $n$ cases, with corresponding responses ($\mb{y} \equiv

373: \mb{y}_j$, where $n \equiv n_j$):

374: \begin{align}

375:   \mb{y} &= \mb{X} \bm{\beta} + \bm{\epsilon}, &

376:   \{\epsilon_{i}\}_{i=1}^{n} &\stackrel{\mbox{\tiny i.i.d.}}{\sim}

377:   N(0,\sigma^2).

378: \end{align}

379: Ordinary least squares (OLS) gives a MLE of $ \hat{\bm{\beta}} =

380: (\mb{X}^\top \mb{X})^{-1} \mb{X}^\top \mb{y}$.  Classically, there are

381: two main reasons why one may desire a more parsimonious approach to

382: regression than that provided by OLS.  The first is that OLS tends to

383: lead to high variance estimators.  The second is a desire for model

384: fits that have high qualitative interpretability, i.e., that describe

385: the data adequately but assume no more causes than will account for

386: the effect.  Our reasons for seeking an alternative are related to the

387: former more so than the latter.  But, most importantly, we aim to

388: circumvent the problem of having linear dependence in the columns of

389: $\mb{Y}_j$ when $n_j \leq j$.  In this case, we are faced with an

390: $n\times p$ design matrix $\mb{X}$ with number of columns $p$ greater

391: than the number of observations $n$, yielding an $\mb{X}^\top \mb{X}$

392: matrix that is singular and cannot be inverted---a so--called ``big

393: $p$ small $n$'' ($p > n$) problem.  We may even have that $p \gg n$,

394: say, when the total number of assets $m$ is far greater than the

395: number of returns recorded for the asset with the shortest history.

396:

397: Popular solutions to this problem involve methods of variable

398: selection and coefficient shrinkage.  Probably the most

399: straightforward method is {\em subset selection} \citep[][Section

400: 3.4.1]{hastie:tibsh:fried:2001} which aims to find the model with the

401: ``best'' size $k$ (i.e., with $k\in \{1,\dots,\min(p,n-1)\}$

402: covariates).  ``Best'' can be defined in a number of ways, but

403: typically involves $t-$tests, or minimizing an estimate of expected

404: prediction error.  Searching through all possible subsets quickly

405: becomes infeasible for $p>40$.  Larger $p$ can be handled by greedy

406: methods, but these offer fewer guarantees.  Such methods include {\em

407:   forward stepwise selection} which starts in the null (intercept

408: only) model and sequentially adds predictors, and {\em backward

409:   stepwise selection} which starts at the saturated model (only

410: applicable when $p<n$) and deletes predictors.  Hybridizations also

411: exist.

412:

413: By discarding some predictors, subset selection methods can yield a

414: model which is more interpretable, and may have lower prediction

415: error.  But this ``discrete'' process can produce estimators with high

416: variance. Shrinkage methods are a popular alternative.  They are

417: hailed for being more ``continuous'', and in some special cases they

418: can have implicit behavior similar to methods like forward selection.

419: The following subsection considers the shrinkage methods of ridge

420: regression, and those related to the lasso.  In Section \ref{sec:pc}

421: we consider another family of methods which are based on derived input

422: directions: principal components regression, which has connections to

423: ridge regression, and partial least squares regression.  These are

424: handy when the predictors are highly correlated.

425:

426: The parsimonious regression methods outlined in this section have been

427: chosen for familiarity, computational tractability, and

428: implementation.  In each case {\sf R} packages are

429: available on the Comprehensive {\sf R} Archive Network (CRAN),

430: \begin{center}

431: \verb!http://cran.R-project.org! \hspace{1cm} \citep{rproject},

432: \end{center}

433: \noindent which provide off--the--shelf implementations that will make

434: for nice subroutines within the framework of constructing estimators

435: for MVN data under monotone missingness.  It is typical to first

436: standardize the inputs ($\mb{X}$ and $\mb{y}$) as the methods outlined

437: below are not equivariant under re-scaling.

438:

439: \subsection{Shrinkage methods: ridge regression, and the lasso}

440: \label{sec:ridge}

441:

442: {\em Ridge regression} and the {\em lasso} shrink the coefficients of

443: an OLS regression by imposing a penalty on their size:

444: \begin{equation}

445:   \hat{\bm{\beta}}^{(q)} = \argmin_{\bm{\beta}}

446:   \left\{\sum_{i=1}^n \left(y_i - \beta_0 -

447:       \sum_{j=1}^p x_{ij} \beta_j\right)^2 +

448:     \lambda \sum_{j=1}^p |\beta_j|^q\right\}

449: \label{eq:ridge:lasso}

450: \end{equation}

451: with $q=2$ for ridge regression, and $q=1$ for the lasso.  The tuning

452: parameter $\lambda$ controls the amount of shrinkage.  Notice that the

453: intercept ($\beta_0$) is left out of the penalty term.  Solutions to

454: (\ref{eq:ridge:lasso}) can be obtained analytically in the case of

455: ridge regression with $\hat{\bm{\beta}}^{(2)} = (\mb{X}^\top \mb{X} +

456: \lambda \mb{I})^{-1} \mb{X}^\top \mb{y}$.  Quadratic programming is

457: required for the lasso.  Both methods have interpretations as Bayesian

458: {\em maximum a posteriori} (MAP) estimators after imposing particular

459: prior distributions.  Other choices of $q>0$ are also possible,

460: however the constraint region for $0<q<1$ is non-convex, which makes

461: solving the optimization problem more difficult.

462:

463: For ridge regression, the penalty parameter ($\lambda$) is most

464: advantageously chosen by minimizing cross validation (CV) estimates of

465: predictive error.  The commonly used HKB \citep{hkb:1975} and L--W

466: \citep{lw:1976} methods are computationally efficient, but require

467: that $p < n$ to fit an OLS.  The implementation of ridge regression

468: used in this paper comes from the {\tt MASS} library \citep{mass:2002}

469: for {\sf R} in the form of a function called {\tt lm.ridge}.

470:

471: Though the form of ridge regression and the lasso are similar, there

472: are several important differences.  A large $\lambda$ will cause the

473: ridge estimator $\hat{\bm{\beta}}^{(2)}$ to have many coefficients

474: shrunk towards zero.  The lasso estimator $\hat{\bm{\beta}}^{(1)}$ has

475: as similar effect, but, importantly, may contain many coefficients

476: which are exactly zero---something which is only possible for $0 < q

477: \leq 1$.  In the Bayesian interpretation, setting $q\leq 1$

478: corresponds to choosing a prior which concentrates more mass on small

479: $|\beta_j|$, with the most on $\beta_j = 0$.  In this way, the lasso

480: implements a kind of continuous subset selection.  As $\lambda$ is

481: increased, the $|\beta_j|$ decrease, eventually increasing the number

482: of them which are identically zero, though this relationship need not

483: be strictly monotonic.

484:

485: The implementation of lasso used in this paper is contained in the

486: {\tt lars} package for {\sf R} \citep{lars:2007}.  \cite{efron:2004}

487: show how the lasso, and two methods called {\em stepwise} and {\em

488:   forward stagewise}, are special cases of their method of {\em least

489:   angle regression} (LAR).  LARS can calculate all possible lasso

490: estimators with computational effort in the same order of magnitude as

491: OLS regression applied to the full set of covariates.  CV can be used

492: to select the final model, e.g., using the ``one--standard--error''

493: rule \citep[][Section 7.10]{hastie:tibsh:fried:2001}, or a more

494: thrifty $C_p$ \citep{mallows:1973} method can be used, but only when

495: $p < n$.  When applicable, the $C_p$ method performs nearly as well as

496: CV within the MVN setting with monotone missingness.

497: \cite{madigan:ridgeway:2004} come to similar conclusions on equally

498: tame benchmarks.  However, $C_p$ has also been criticized for

499: preferring large models \citep{ishwaran:2004,stine:2004} and for being

500: slightly at odds with LARS \citep{loubes:massart:2004}.  Since we are

501: mostly interested in applying LARS methods (i.e., lasso) when OLS is

502: not applicable, i.e., when $p \geq n$, we shall generally rely on CV

503: to select the final model.

504:

505: \subsection{Principal components and partial least squares regression}

506: \label{sec:pc}

507:

508: In situations where there are a large number of highly correlated

509: inputs, a decomposition by principal components (PCs) can be used to

510: select a small number of linear combinations of the original inputs to

511: be used in place of $\mb{X}$.  The related methods of principal

512: component regression (PCR) and partial least squares regression (PLSR)

513: start by performing an orthogonal decomposition of $\mb{X}$, but

514: differ in how the linear combinations are constructed.

515:

516: In PCR, {\em singular value decomposition} (SVD) is performed on

517: $\mb{X}$, i.e., $\mb{X} = (\mb{U} \mb{D}) \mb{V}^\top =

518: \mb{T}\mb{P}^\top$, where $\mb{U}$ is an $n \times p$ matrix of left

519: singular vectors describing the ``output basis'', $\mb{D}$ is a

520: diagonal matrix containing the corresponding singular values (a

521: square--root of the eigenvalues) in non-decreasing order, $\mb{V}$ is

522: a $p \times p$ matrix of right singular vectors describing the ``input

523: basis'', and $\mb{T}$ and $\mb{P}$ are the so--called {\em scores} and

524: {\em loadings} defined by the decomposition.  Next, $\mb{y}$ is

525: regressed on the first $k$ PCs, i.e., the scores $\mb{T}_{(k)}$, where

526: the $(k)$ subscript indicates the extraction of the first $k$ columns

527: of $\mb{T}$, i.e., the first $k$ columns of $\mb{U}$, $\mb{V}$, and

528: the first $k$ rows/cols of $\mb{D}$.  Since the columns of $\mb{T}$

529: are orthogonal, the solution is just a sum of univariate regressions.

530: Importantly, the solution can then be written in terms of the

531: coefficients on the predictors in the columns of $\mb{X}$,

532: \begin{align}

533:   \mbox{(arbitrary scores and loadings)} && \hat{\bm{\beta}}(k) &=

534:   \label{eq:preg}

535:   \mb{P}_{(k)} (\mb{T}_{(k)}^\top \mb{T}_{(k)})^{-1} \mb{T}_{(k)}^\top \mb{y} \\

536:   \mbox{(from SVD on $\mb{X}$)} && \hat{\bm{\beta}}^{\mbox{\tiny pcr}}(k) &

537:   =\mb{V}_{(k)} \mb{D}_{(k)}^{-1} \mb{U}_{(k)}^\top \mb{y}, \nonumber

538: \end{align}

539: a vector of length $p$.  When $k=p < n$, the coefficients in

540: (\ref{eq:preg}) are identical to those obtained by OLS. There are many

541: ways of choosing how many components ($k$) to keep in the final model.

542: One way is to consider the relative sizes of the eigenvalues as a

543: proportion of the variation explained by each principal component, and

544: then choose $k$ so that 80--90\% of the variation is explained.  A

545: less ad hoc and more reliable---but more computationally

546: intensive---method that can be applied even when $p \geq n$ involves

547: using CV to estimate predictive error in order to find $k \in

548: \{1,\dots,\min(p,n-1)\}$.

549:

550: PLSR, by contrast, aims to incorporate information about both $\mb{X}$

551: and $\mb{y}$ in the scores and loadings---which in this context are

552: often called {\em latent variables} (LVs)---by proceeding iteratively.

553: The method is initialized with the SVD of $\mb{X}^\top \mb{y}$,

554: thereby including information about the correlation between, and the

555: variance within, $\mb{X}$ and $\mb{y}$.  The scores and loadings

556: obtained by PLSR optimally capture the covariance between $\mb{X}$ and

557: $\mb{y}$, whereas PCR concentrates only on the variance of $\mb{X}$

558: \citep{dejong:1993}.  There are several algorithms for obtaining the

559: scores and loadings, but once obtained, the regression coefficients

560: $\hat{\bm{\beta}}^{\mbox{\tiny plsr}}(k)$ in $\mb{X}$-space are

561: recovered by following (\ref{eq:preg}), and CV can be similarly used

562: to pick $k$.

563:

564: In situations where a minor component of $\mb{X}$ is highly correlated

565: with $\mb{y}$, PLSR may have a significant advantage over PCR.

566: Otherwise, the methods have a more or less comparable performance

567: record despite a few operational differences---e.g., PLSR usually

568: needs fewer LVs, but can also yield higher variance estimators of the

569: regression coefficients.  Both have behavior similar to other

570: shrinkage methods, particularly ridge regression.  For example, it can

571: be shown \citep{frank:fried:1993} that ridge regression shrinks the

572: coefficients of principal components by a factor of

573: $d_j^2/(d_j^2+\lambda)$, where the $d_j$ are from the diagonal of

574: $\mb{D}$, whereas PCR truncates them at $k$.

575:

576: An {\sf R} package called {\tt pls} \citep{heige:2007} provides a

577: unified implementation of PCR and three algorithms for PLSR

578: \citep{dayal:macg:1997,dejong:1993,martens:naes:1989}, together with

579: built--in facilities for estimating $k$ via CV.

580:

581:

582: \section{The {\tt monomvn} algorithm}

583: \label{sec:monomvn}

584:

585: So long as $n_j > j$ for all $j=1\dots,m$, and $n_j \geq n_{j+1}$, an

586: algorithm for finding the parameters $\bm{\mu}$ and $\bm{\Sigma}$ that

587: maximize the MVN likelihood for monotone missing data proceeds as

588: outlined in Section \ref{sec:monotone}.  Initialize $\mu_1$ and

589: $\Sigma_{11}$ to the sample mean and variance of the first column

590: $\mb{y}_1$ of $\mb{Y}$, then iterate through the following steps for

591: $j=2,\dots,m$:

592: \begin{enumerate}

593: \item Find the MLEs (\ref{eq:regress}) of $\bm{\beta}_j$ and

594:   $\sigma_j^2$ in a regression (\ref{eq:monoreg}) of $\mb{y}_j$ onto the first

595:   $j-1$ columns of $\mb{Y}$ (as predictors), using only the first

596:   $n_j$ observations;

597: \item Obtain the MLEs of $\mu_j$ and $\mb{\Sigma}_{(1:j),j}$

598:   from $\hat{\bm{\mu}}_{1:(j-1)}$, $\hat{\bm{\Sigma}}_{1:(j-1),1:(j-1)}$,

599:   $\hat{\bm{\beta}}_j$ and $\hat{\sigma}^2_j$ as in (\ref{eq:addy}).

600: \end{enumerate}

601: If any $n_j \leq j$, then we have a ``big $p$ small $n$'' problem, and

602: the standard regression in step 1 above cannot be performed.  In

603: practice, it may be that $n_j > j$ and still there are columns of the

604: design matrix which are not linearly independent, and so it is not of

605: full rank.  The chances that this may happen become increasingly more

606: likely as $j$ approaches $n_j$ when finite (double--precision)

607: computer representations make it so that the design matrix is

608: numerically rank deficient.  Both issues are addressed simultaneously

609: by instead performing one of the parsimonious regressions outlined in

610: Section \ref{sec:bpsn}.  Then step 2 can proceed as usual.  Observe

611: that this approach also enables estimation when there are more assets

612: than historical returns ($m > n$).

613:

614: \subsection{Choosing the parsimonious proportion}

615:

616: Even when parsimonious regression is not strictly necessary, it can

617: aid in interpretation, and possibly even yield more accurate and lower

618: variance estimators.  The lasso and the other LARS methods can

619: choose to shrink $\bm{\beta}$ so that only the intercept term is

620: nonzero.  This enables the detection of zeros in the MVN covariance

621: matrix $\bm{\Sigma}$.  In other words, it can be used as a test, of

622: sorts, for independence between assets.

623:

624: Towards building a more efficient and interpretable estimator, one may

625: consider applying a parsimonious regression for every iteration of

626: step 1 above.  This is explored further in Section \ref{sec:depend}.

627: Alternatively, one could determine a threshold, say $p$, representing

628: a proportion of rows to columns in the design matrix past which a

629: parsimonious regression is applied regardless.  That is, when $n_j

630: \leq pj$, for $0\leq p\leq 1$.  Then, the $p=0$ case corresponds to

631: always using a parsimonious method, and $p=1$ reverts to applying one

632: only when necessary.  In Section \ref{sec:parsi} we show how easy it

633: is to establish reliable rules of thumb for choosing $p$.

634:

635: \subsection{Incorporating factors}

636: \label{sec:fact}

637:

638: A popular estimator for the covariance matrix of financial asset

639: returns involves using {\it factor models}.  The essential idea behind

640: the factor model is to regress the observed returns $\mb{y}_j$ on

641: measured common market factors $\mb{F}$, and to derive a covariance

642: matrix of the returns as a function of the regression equations.

643:

644: For a factor space with $K$ factors, the model can be formalized as

645: follows. Each excess return $y_{i, j}$ is modeled by the regression

646: equation

647: \begin{equation}

648: \label{eq:factor-regression}

649: y_{i, j} = \lambda_{0, j} + \sum_{k = 1}^K \lambda_{k, j}f_{i, k} + \epsilon_{i, j}

650: \end{equation}

651: where each $\epsilon_{i, j}$ is a residual term independent of

652: $\mb{F}$. The residual terms for the $i^{\mbox{\tiny th}}$ instance

653: are assumed to follow a zero--mean MVN with diagonal covariance matrix

654: $\mb{D}$.  For instance, a common one--factor model takes $f$ to be

655: value--weighted market index \citep[e.g.,][]{ckl:1999}. A common

656: three--factor model augments the value--weighted market index with

657: size and book--to--market factors \citep{famafrench:1993}.

658:

659: Factors are assumed, for now, to be i.i.d.~and to follow a MVN with

660: $K\times K$ covariance matrix $\bm \Omega$.  Let $\bm{\Lambda}$ be the

661: $K\times m$ matrix defined by the entries $\bm{\Lambda}_{k, j} =

662: \lambda_{k, j}$, for $k=1,\dots,K$.  It follows that the covariance

663: matrix of the returns, as parameterized by $\{\bm{\Omega},

664: \bm{\Lambda}, \mb{D}\}$, is given by

665: \begin{equation}

666: \bm{\Sigma}^{(f)} = \bm{\Lambda}^\top \bm{\Omega}\bm{\Lambda} + \mb{D}.

667: \end{equation}

668: An estimate $\hat{\bm{\Sigma}}^{(f)}$ can therefore be obtained by

669: estimating each column $\hat{\bm{\lambda}}_j = (\lambda_{1, j},

670: \dots,\lambda_{K, j})^\top$ of $\hat{\bm{\Lambda}}$ by regressing

671: $\bm{y}_j$ on $\mb{F}$ with an intercept.  The mean sum of squares of

672: the residuals of each regression forms the diagonal of $\hat{\mb{D}}$,

673: and the off--diagonal entries are zero.  The estimate $\hat{\bm

674:   \Omega}$ is the empirical covariance of the factors.  Note that each

675: regression equation requires only the data observed for the particular

676: return $\mb{y}_j$, together with the corresponding observations for

677: the factor(s).  However in practice, the method is applied only to

678: completely observed $\mb{Y}$ and $\mb{F}$.

679:

680: The main underlying assumption is that returns are mutually

681: independent conditioned on the factors. If the number of factors is

682: considerably smaller than the number of returns, the model will be

683: parsimonious and the resulting $\hat{\bm{\Sigma}}^{(f)}$ will have

684: lower variance than the empirical covariance matrix.  This assumption

685: allows for any missingness pattern, even the extreme one where no

686: joint observation of returns $\mb{y}_j$ and $\mb{y}_k$ exists.  The

687: drawback is that the independence assumptions encoded in this model

688: might be unrealistic, and the resulting estimate will suffer from a

689: strong bias.

690:

691: Instead, we can use the data to find which independence assumptions

692: are adequate by integrating the factor model into the {\tt monomvn}

693: framework.  Consider the {\it full} regression model, where we regress

694: $\mb{y}_j$ on $\mb{Y}_j$ and $\mb{F}_j \equiv\mb{F}_{1:(j-1)}^{(n_j)}$

695: simultaneously:

696: \begin{equation}

697: \label{eq:full-factor-regression}

698: \mb{y}_j = \mb{Y}_j \bm{\beta}_j + \mb{F}_j

699: \bm{\lambda}_j + \bm{\epsilon}_j,

700: \end{equation}

701: %where $\bm \beta_j^T = (\beta_{0, j}, \beta_{1, j}, \dots,

702: %\beta_{(j-1), j})$ as before, and $\bm{\lambda}_j = (\lambda_{1, j},

703: %\dots,\lambda_{K, j})$.

704: The $\lambda_{0, j}$ term does not appear because it is not

705: identifiable given the presence of $\beta_{0, j}$.  Since this

706: formulation is in the same family of parameterizations of the original

707: models used in {\tt monomvn}, an analogous procedure applies with

708: minor pre- and post-processing. First shift the labels the returns for

709: each asset by $K$ so that $\mb{y}_j$ becomes $\mb{y}_{j + K}$ and the

710: corresponding $\bm{\beta}_j$ becomes $\bm{\beta}_{j+K}$.  Then map

711: $\mb{F}_k$ to $\mb{Y}_k$ and $\bm{\lambda}_k$ to $\bm{\beta}_k$.  If

712: the recursion in Eq.~(\ref{eq:addy}) is then applied as usual, giving

713: the estimates $\hat{\bm{\mu}}$ [an $(m+K)$ vector] and $\hat{\bm{\Sigma}}$

714: [an $(m+K)\times (m+K)$ matrix], an estimate of the covariance matrix

715: of the asset returns can then be extracted from the bottom--right $m

716: \times m$ block of $\hat{\bm{\Sigma}}$, i.e.,

717: $\hat{\bm{\Sigma}}^{(f+m)} = \hat{\bm{\Sigma}}_{(K + 1):(m + K), (K +

718:   1):(m + K)}$.  The superscript $(f+m)$ is meant to indicate

719: dependence on both factors and assets.  Importantly, no internal

720: changes to the workings of the {\tt monomvn} algorithm are necessary.

721:

722: Observe that if the (parsimonious) regression method applied within

723: {\tt monomvn} uses OLS whenever regressing onto the factors, and sets

724: the regression coefficients to zero otherwise, then we obtain

725: $\hat{\bm{\Sigma}}^{(f+m)} = \hat{\bm{\Sigma}}^{(f)}$.  In the context

726: of {\tt monomvn} we call this the ``factor--parsimony'' regression,

727: filling a role similar to PCR, lasso, etc.  If required, the

728: covariance matrix of the factors can also be recovered as

729: $\hat{\bm{\Omega}} = \hat{\bm{\Sigma}}_{1:K,1:K}$.  Also observe that,

730: within the {\tt monomvn} framework, it is possible to handle factors

731: with historical missingness.

732:

733: If, instead of the factor--parsimony method, any of the other methods

734: (outlined in Section \ref{sec:bpsn}) are used, then shrinkage is

735: applied to both $\bm \beta_j$ and $\bm{\lambda}_j$ in

736: (\ref{eq:full-factor-regression}).  In this case we obtain a

737: generalization of the independence structure assumed in the classical

738: factor model, allowing the data (factors and returns) to determine the

739: appropriate mix of influence on the resulting estimator for

740: $\bm{\Sigma}$.  It is interesting to point out the link between this

741: generalized factor model (\ref{eq:full-factor-regression}) resulting

742: in $\hat{\bm{\Sigma}}^{(f+m)}$, and the optimal shrinkage estimator of

743: \citet{ledoit:2002}:

744: \begin{equation}

745:   \hat{\bm{\Sigma}}^{(\ell)} = \alpha \hat{\bm{\Sigma}}^{(f)} +

746:   (1 - \alpha)\hat{\bm{\Sigma}}^{(c)}, \;\;\;\;\; \mbox{for } \alpha \in [0, 1].

747:   \label{eq:ledoit}

748: \end{equation}

749: Here, $\hat{\bm{\Sigma}^{(c)}}$ is the standard covariance estimate

750: obtained using only the portion of the data available across all

751: assets and $\alpha$ is an ``optimal'' mixing proportion chosen by CV.

752: (Note that Ledoit's factor--based estimator $\hat{\bm{\Sigma}}^{(f)}$

753: uses only completely observed joint returns.)  The spirit of these two

754: approaches is similar, but they are quite distinct.  The published

755: success of this type of shrinkage approach suggests that it is

756: important to combine a (complete data) factor--based estimate with a

757: traditional covariance estimate.  Indeed, the estimator

758: $\hat{\bm{\Sigma}}^{(f+m)}$ involves combining covariances mediated by

759: factors with covariances that are not accounted for by factors; it can

760: also handle historical missingness via the ``factor--parsimony''

761: regressions within {\tt monomvn}.  But rather than shrinking a

762: (possibly) non--positive definite estimator $\hat{\bm{\Sigma}^{(c)}}$

763: towards $\hat{\bm{\Sigma}^{(f)}}$ with a single parameter $\alpha$ as

764: in (\ref{eq:ledoit}), {\tt monomvn} applies $m+K$ unique shrinkage

765: parameters, one for {\em each} regression, while taking full advantage

766: of all available returns.

767:

768: \subsection{Software}

769:

770: Finally, an {\sf R} package called {\tt monomvn} \citep{monomvn} has

771: been made freely available through CRAN. It implements the algorithm

772: described in this section, and supports all of the parsimonious

773: regression methods outlined in Section \ref{sec:bpsn} via the

774: stand--alone packages outlined therein.  Two forms of CV are supported

775: for choosing the number of components in the parsimonious regression:

776: random 10--fold and (deterministic) leave--one--out (LOO).  A $p$

777: argument facilitates parsimonious regression modeling, as described

778: above.  Incorporating factors is as straightforward as bundling them in

779: as if they were returns, as described above.

780:

781: \section{Empirical results}

782: \label{sec:results}

783:

784: In this section, the {\tt monomvn} methods are illustrated and

785: validated on real and synthetic data.  In Section \ref{sec:synth} we

786: focus on the properties of estimates of $\hat{\bm{\mu}}$ and

787: $\hat{\bm{\Sigma}}$ in a controlled setting involving synthetic data

788: under monotone missingness.  In \ref{sec:portfolio} we turn to

789: applying the estimators towards balancing portfolios in a

790: mean--variance setting.  We wrap up in \ref{sec:depend} by using

791: {\tt monomvn} in a descriptive analysis of dependence involving

792: thousands of assets.

793:

794: \subsection{Properties of the estimators on synthetic data}

795: \label{sec:synth}

796:

797: Here, we use a data--generation mechanism provided by the {\tt

798:   monomvn} package: {\tt randmvn} generates random samples from a

799: randomly generated MVN distribution with an i.i.d.~standard normal

800: mean vector $\bm{\mu}$, and an Inv--Wishart sampled $\bm{\Sigma}$;

801: {\tt rmono} imposes a uniformly distributed monotone missingness

802: pattern.  A similar method is used to generate samples with monotone

803: missingness from a multivariate $t$ distribution (MV$t$) as well, in

804: order to demonstrate that the MVN--based {\tt monomvn} methods still

805: perform well in the presence of heavier tailed data.

806:

807: %\subsubsection{Comparators}

808:

809: The comparisons to follow focus on highlighting the relative strengths

810: and weaknesses of variations of {\tt monomvn} as a function of the

811: choice of parsimonious regression method applied.  Additionally, two

812: simpler methods are devised as calibration tools, and to illustrate

813: the advantage of the {\tt monomvn} approach over those which do not

814: leverage the structure of the monotone missingness pattern.  The

815: simplest comparator is called ``complete'', where $\bm{\mu}$ and

816: $\bm{\Sigma}$ are estimated using only the portion of data available

817: across all assets, i.e., only the completely observed returns.  Put

818: yet another way: only the first $n_m$ rows of $\mb{Y}$ are used.

819: Another comparator is ``observed'' which uses all of the available

820: data in an obvious but na\"ive way:

821: \begin{align}

822:   \hat{\mu}_j &= \frac{1}{n_j} \sum_{k=1}^{n_j} y_{k,j} && \mbox{

823:     and} & \hat{\Sigma}_{i,j} &= \frac{1}{n_j} \sum_{k=1}^{n_j}

824:   (y_{k,j} - \hat{\mu}_j)(y_{k,i} - \hat{\mu}_i) \;\;\;\; \mbox{ for }

825:   i=1,\dots,j.

826: \end{align}

827: Unfortunately, the covariance matrices provided by the ``complete'' and

828: ``observed'' estimators are not guaranteed to be positive--definite

829: \citep{stambaugh:1997}.  %Besides meaning that these estimators are

830: % invalid, the KL divergence to the true distribution cannot be

831: % calculated, and so the RMSE statistics will be our only metric for

832: % comparison.

833:

834: As a final comparator, we consider a method of estimation for

835: incomplete data for arbitrary missingness patterns

836: \citep{dempster:laird:rubin:1977}, using the expectation conditional

837: maximization (ECM) algorithm \citep{meng:rubin:1993}.  Consequently,

838: this method also works when the missingness pattern is monotone, but

839: represents a sort of overkill in this case.  Two similar software

840: packages are available for this method when the data is assumed to

841: follow a multivariate normal distribution: the {\tt norm} package

842: \citep{norm:2002} for {\sf R}, and {\tt ecmnmle} (contained in the

843: {\sf Matlab} {\tt Financial Toolbox}).  We prefer {\tt norm} because

844: its core is implemented in compiled {\sf Fortran}, with an {\sf R}

845: wrapper.  It gives nearly identical results to---but runs more than 20

846: times faster than---{\tt ecmnmle} which is written solely in {\sf

847:   Matlab}.  The ECM method iterates until convergence, stopping at a

848: {\em local} maximum when an improvement threshold is met.  As a

849: result, its computational demands and the ultimate optimality of the

850: resulting estimator are sensitive to the initial configuration of the

851: algorithm.  Though the missingness pattern may be arbitrary, it is

852: well--known that the method can fail due to convergence issues and/or

853: numerical singularities that can arise due to finite machine

854: representations when more than 15\% of the data is missing (see, e.g.,

855: the {\tt ecmnmle} documentation within {\sf Matlab}).  So it cannot

856: handle $m > n$, which precludes it from general use in our problem.

857:

858: The expected log likelihood (ELL), which is related to the

859: Kullback--Leibler (KL) divergence, is used as the main metric for

860: comparisons.  For probability distribution functions (PDFs) $p$ and

861: $q$, the KL divergence between $p$ and $q$ is defined as

862: \[

863: D_{\mbox{\tiny KL}}(q \parallel p) = \int p(x) \log \frac{p(x)}{q(x)} \;dx.

864: \]

865: In the particular case where $q$ is the estimated MVN with parameters

866: $\hat{\bm{\mu}}$ and $\hat{\bm{\Sigma}}$ and $p$ is the ``true''

867: parameterization with $\bm{\mu}$ and $\bm{\Sigma}$, the KL divergence

868: can be shown to be:

869: \[

870: D_{\mbox{\tiny KL}}(\mathrm{MVN}(\hat{\bm{\mu}}, \hat{\bm{\Sigma}}) \parallel

871: \mathrm{MVN}(\bm{\mu}, \bm{\Sigma})) = \frac{1}{2} \left(\log

872:   \frac{|\hat{\bm{\Sigma}}|}{|\bm{\Sigma}|} +

873:   \mbox{tr}(\hat{\bm{\Sigma}}^{-1} \bm{\Sigma}) + (\hat{\bm{\mu}} -

874:   \bm{\mu})^\top \hat{\bm{\Sigma}}^{-1}(\hat{\bm{\mu}} - \bm{\mu}) \right).

875: \]

876: The ELL of $q$ relative to data sampled from $p$ is given by

877: \begin{align}

878: \mathbb{E}_p\{\log q\} &= \int p(x) \log q(x) \;dx \nonumber \\

879: &= \int p(x) \log p(x) \;dx

880: - D_{\mbox{\tiny KL}}(q \parallel p). \label{e:ell}

881: \end{align}

882: The integral $\int p\log p$ in (\ref{e:ell}) is the entropy of $p$.

883: The entropy of $\mathrm{MVN}(\bm{\mu}, \bm{\Sigma})$ can be shown to

884: work out to $-\frac{1}{2} \log \{(2\pi e)^N |\bm{\Sigma}|\}.  $ When

885: analytical expressions are not available it is easy to approximate

886: (\ref{e:ell}) numerically by $T^{-1} \sum_{t=1}^T \log q(x_t)$, where

887: $x_t \sim p$ is simulated out of sample.  This nicely converges to the

888: truth for large $T$.  The ELL is good for ranking competing

889: estimators, however actual ``distances'' between estimators is hard to

890: interpret.

891:

892: % As an auxiliary metric we consider a root mean squared error (RMSE)

893: % obtained by treating all $m + m(m+1)/2$ unique components of $\bm{\mu}$

894: % and $\bm{\Sigma}$ equally:

895: % \[

896: % \mbox{RMSE}(\{\hat{\bm{\mu}}, \hat{\bm{\Sigma}}\},

897: % \{\bm{\mu}, \bm{\Sigma}\})

898: % = \sqrt{\frac{1}{m+m(m+1)/2} \left[

899: %   \sum_{j=1}^m (\hat{\mu}_j - \mu_j)^2

900: %  + \sum_{1 \leq i \leq j}^m (\hat{\Sigma}_{i,j} -

901: %   \Sigma_{i,j})^2 \right]}.

902: % \]

903: % This metric has many advantages including intuitive appeal, ease of

904: % computation, a natural quadratic scale, and is a measure of goodness

905: % of fit that is devoid of (possibly tenuous) distributional

906: % assumptions.  However, as we shall see, it is possible that estimated

907: % $\hat{\bm{\mu}}$ and $\hat{\bm{\Sigma}}$ have low RMSE yet depict

908: % relatively poor probability densities for the true underlying data.

909: % One reason is because the squared distance between components of

910: % $\hat{\bm{\Sigma}}$ and $\bm{\Sigma}$ ignores their sign.

911:

912: \subsubsection{Comparing estimators}

913:

914: Figure \ref{f:synth} {\em (left)} summarizes a comparison between the

915: different parsimonious regressions within the {\tt monomvn} algorithm,

916: using randomly generated MVN data with $m=100$ and $n=1000$, repeated

917: over 100 trials, each time sampling new $\bm{\mu}$, $\bm{\Sigma}$ and

918: $\mb{Y}\sim \mathrm{MVN}(\bm{\mu}, \bm{\Sigma})$ with uniform monotone

919: missingness.

920: \begin{figure}[ht!]

921: \centering

922: \includegraphics[angle=-90, scale=0.285]{rEllik}

923: \includegraphics[angle=-90, scale=0.285]{rtllik}

924: \caption{Comparison of parsimonious regression ($p=1$) methods (using

925:   10--fold CV) on randomly generated MVN data ($n=1000$ samples,

926:   $m=100$ dimensions) data with $\bm{\mu}\sim N_m(0,1)$, $\bm{\Sigma}

927:   \sim$ Inv--Wishart and uniform monotone missingness: boxplots of ELL

928:   ranks summarizing 100 repeated trials.

929:   \label{f:synth}}

930: \end{figure}

931: Parsimonious regressions were used only when necessary (i.e., $p=1$).

932: 10--fold CV was used to choose $\lambda$ or the number of (principal)

933: components.  As can be seen from the table, PCR emerges as the clear

934: winner in this comparison, nearly always having the best ELL rank.

935: The complete and observed comparators are almost always ranked worst.

936: % The RMSE results give more insight into the poor performance of

937: % these comparators, but they are less helpful for discerning between

938: % the variations on {\tt monomvn}.  It would appear that ridge regression

939: % has the lowest RMSE but, paradoxically, has the second worst rank.

940:

941: In anticipation of the application in Section \ref{sec:portfolio} to

942: financial returns data, which are believed to follow a heavier tailed

943: distribution than MVN, we repeated the above experiment with

944: synthetically generated MV$t$ data with a monotone missingness

945: pattern.  The degrees of freedom parameter was sampled as $\nu \sim

946: \mathrm{Exp}(\frac{1}{2})+1$.  Figure \ref{f:synth} {\em (right)} shows

947: roughly similar behavior for the MVN based {\tt monomvn} estimators

948: when fit to MV$t$ data: PCR is the best and the observed and complete

949: estimators are the worst (although the order is switched).  ELL was

950: computed numerically using the known degrees of freedom parameter(s),

951: $\nu$, which generated the data.  This is a legitimate choice since

952: the $\nu$ is not used in the mean--variance analysis to follow in

953: Section \ref{sec:portfolio}.  It is interesting to note the improved

954: rank(s) of the ridge regression based estimator in this case.

955:

956: These results are in line with those of previous simulation studies

957: which compare ML estimators---that are able to leverage all of the

958: available data by exploiting the MVN assumption---to those which use

959: more reasonable distributional assumptions but which, for reasons of

960: tractability, can only use the completely observed cases

961: \citep[e.g.,][]{little:1988}.  The evidence suggests that making use

962: of all of the available data in a sensible way is the crucial

963: ingredient despite that the underlying assumptions may be violated.

964: The dominance of PCR in both MVN and MV$t$ scenarios is in line with a

965: recent study \citep{cpr:5829} showing that PCR out--competes other

966: shrinkage (Bayesian motivated) estimators in applications with a large

967: number of financial asset returns.

968:

969: \subsubsection{Choosing the parsimonious proportion}

970: \label{sec:parsi}

971:

972: Recall from Section \ref{sec:monomvn} that $p\in [0,1]$ determines

973: when a parsimonious method is to be used instead of OLS in the {\tt

974:   monomvn} algorithm.  The experiment performed here is similar to the

975: previous one, except that $n$ and $m$ are varied stochastically with

976: $m$ uniform in $\{5,\dots,100\}$ and $n|m$ uniform in $\{\max(10,

977: \lfloor m/2\rfloor),\dots, md\}$.

978: \begin{table}[ht]

979: \begin{center}

980: \begin{tabular}{l||rrr|r}

981: & \multicolumn{3}{c|}{optimal $p$} & \\

982: method & 5\% & mean & 95\% & improv \\

983:   \hline

984:   plsr & 0.12 & 0.23 & 0.37 & 0.55  \\

985:   pcr & 0.09 & 0.27 & 0.51 & 0.69 \\

986:   ridge & 0.04 & 0.25 & 0.67 & 0.29 \\

987:   lasso & 0.12 & 0.24 & 0.38 &  0.76 \\

988:   lar & 0.11 & 0.26 & 0.41 & 0.65 \\

989:   stepwise & 0.15 & 0.26 & 0.39 & 0.74

990: \end{tabular}

991: \end{center}

992: \caption{Mean and 90\% interval for optimal $p$, the ratio of columns

993:   to rows in the design matrix before switching from OLS to a parsimonious

994:   regression.  The {\em improv} column gives the proportion of runs for

995:   which $p=0.25$ is better than $p=0$. We repeated this over 100 trials

996:   with LOO CV with the ELL as an objective.

997: \label{t:p}}

998: \end{table}

999: Table \ref{t:p} shows the mean and 90\% interval for the optimal $p$

1000: over 100 repeated trials sampling new $m$, $n$, etc., each time.  LOO

1001: CV was used to choose $\lambda$, or the number of (principal)

1002: components, and the objective criteria used was ELL. The final column

1003: in the table shows the proportion of time when $p=0.25$ was better

1004: than $p=0$.  Observe that all methods except ridge regression work

1005: well, as a rule of thumb, with $p=0.25$.  All things being equal, a

1006: larger $p$ setting may be preferred for speed reasons.

1007:

1008: \subsubsection{Comparing to ECM}

1009:

1010: Due to the limitations of ECM--based methods, like those implemented

1011: by {\tt norm} and {\tt ecmnmle}, a comparison of {\tt monomvn} to

1012: these approaches requires a more controlled experiment.  Fixing $m=10$

1013: and $n=100$, 1000 repeated experiments similar to the ones described

1014: above, with uniform monotone missingness, gave that {\tt monomvn}

1015: (with PCR) had higher ELL 997 times ($100\%$) and that ECM failed to

1016: converge 53 times ($\approx 5\%$).  As $n$ grows relative to $m$, the

1017: performance of the methods converge.  For example, with $m=10$ and

1018: $n=1000$ the means are {\tt monomvn} is better 831 times ($83\%$), and

1019: ECM failed to converge 11 times ($1\%$).  As the dimensionality ($m$)

1020: increases modestly compared to the sample size ($n$), the ECM--based

1021: {\tt norm} algorithm consistently diverges.  For example, with $m=20$

1022: and $n=100$ {\tt norm} fails to converge more than 40\% of the time.

1023:

1024: \subsection{Constructing portfolios from historical returns}

1025: \label{sec:portfolio}

1026:

1027: In this section we examine the characteristics of minimum variance

1028: portfolios constructed using estimates of $\mb{\Sigma}$ based on

1029: historical monthly returns.  The experimental setup is similar to ones

1030: that have been used in several recent papers on covariance estimation,

1031: and minimum variance portfolio balancing

1032: \citep[e.g.][]{ckl:1999,jagma:2003}.  Following these works we use the

1033: monthly returns of common domestic stocks traded on the NYSE and the

1034: AMEX from April 1968 until 1998. We require that the stocks have a

1035: share price greater than \$5 and a market capitalization greater than

1036: 20\% based on the size distribution of NYSE firms.  Estimators of

1037: $\bm{\Sigma}$ are constructed based on (at most) the most recently

1038: available 60 months of historical returns.  This is in keeping with

1039: previous work and acknowledges that the i.i.d.~assumption in

1040: Eq.~(\ref{eq:iidlik}) is only valid locally (in time) due to the

1041: conditional heteroskedastic nature of financial returns.  Short

1042: selling is not allowed; all portfolio weights must be nonnegative.

1043: Although it is typical to cap the weights as well, e.g., at 2\%, in

1044: order to ``tame occasional bold forecasts'' \citep{ckl:1999} that

1045: typically arise due to poor estimators \citep{jagma:2003}, we

1046: specifically do not do so here.  Our goal is fully expose the quality

1047: of the estimators and to illustrate that with good estimators such

1048: rules of thumb are unnecessary.

1049:

1050: Four classes of estimators of $\bm{\Sigma}$ are used in the

1051: comparisons which follow.  (1) The {\em complete} estimator outlined

1052: earlier, with variations depending on how many assets have historical

1053: returns with certain lengths (more below). (2) A one--factor model

1054: using the return on the value--weighted portfolio of stocks traded on

1055: the NYSE, AMEX, and Nasdaq. (3) The {\tt monomvn} method using the

1056: parsimonious regressions of Section \ref{sec:bpsn} with $p=0.25$. (4)

1057: The {\tt monomvn} method incorporating the value--weighted portfolio

1058: as a factor with, as described in Section \ref{sec:fact}, and with

1059: $p=0$.  For this class we augment the collection of parsimonious

1060: regressions to include the ``factor--parsimony'' method.  We do not

1061: compare to the ECM methods of {\tt norm} or {\tt ecmnmle} here, as

1062: this has proved to be both cumbersome and troublesome; the methods

1063: seem unable to handle the missingness level in this data.  For

1064: example, {\tt norm} consistently fails to converge even after

1065: thousands of very slow iterations of ECM (each taking several seconds

1066: on a 3.2 GHz Xeon).

1067:

1068: To assess the quality and characteristics of the constructed

1069: portfolios we follow \cite{ckl:1999} in using the following:

1070: (annualized) return and standard deviation; (annualized) Sharpe ratio

1071: (average return in excess of the Treasury bill rate divided by the

1072: standard deviation); (annualized) tracking error (standard deviation

1073: of the portfolio return in excess of the S\&P500 return); correlation

1074: to the market (S\&P500 return); average number of stocks with weights

1075: above 0.5\%. We closely follow the experimental setup of

1076: \citet{ckl:1999} and \citet{jagma:2003} by randomly subsampling from

1077: the qualifying stocks in each year, and holding the portfolios for the

1078: entire subsequent 12 months.  The random subsample reduces the size of

1079: the estimation problem, and thus computational burden, so that many

1080: methods can be simultaneously benchmarked against one another.  It can

1081: also serve the dual purpose of enabling the calculation of

1082: nonparametric (bootstrap--like) Monte Carlo assessments of

1083: variability, which was not a feature explored in previous work.

1084:

1085: Specifically, in each April, starting in 1972, we randomly subsample

1086: 250 stocks

1087: % \footnote{\citet{jagma:2003} use subsamples of size 500.  Since

1088: %   there are approximately 900 qualifying assets in any year we

1089: %   prefer to follow \citet{ckl:1999} and use 250 in order to better

1090: %   explore the spread of the characteristics of our estimators in

1091: %   this experiment.}

1092: (without replacement) from those which qualify (in the

1093: sense outlined above) and which have at least 12 months of historical

1094: returns.  In this way our work differs slightly from our predecessors

1095: whose estimators require exactly 60 months of historical returns.  We

1096: chose 12 months in order to highlight the benefit of incorporating

1097: assets in the portfolio with fewer than 60 months of returns via {\tt

1098:   monomvn}.  Estimates of the covariance matrix of monthly excess

1099: returns (over the monthly Treasury Bill rate) are generated form the

1100: different models using at most the last 60 months of historical

1101: returns for the 250 assets.  Based on the estimate(s), quadratic

1102: programming is used to find the global minimum variance portfolio(s)

1103: described by weights $\hat{\mb{w}} = \mbox{argmin}_{\mb{w}} \mb{w}^T

1104: \mb{\hat{\bm{\Sigma}}} \mb{w}$.  Then, the weights $\hat{\mb{w}}$ are

1105: applied to form buy--and--hold portfolio returns until the next April,

1106: when the randomization, estimation, and optimization steps are

1107: repeated and the portfolios are reformed.

1108:

1109: \begin{table}[ht!]

1110: \begin{center}

1111: \begin{tabular}{r||rrrrrr}

1112: %  \hline

1113: method & mean & sd & sharpe & te & cm & wmin \\

1114:   \hline \hline

1115: eq & 0.149 & 0.188 & 0.432 & 0.062 & 0.949 & 0 \\

1116: vw & 0.135 & 0.162 & 0.412 & 0.032 & 0.981 & 45 \\

1117: \hline

1118: min & 0.147 & 0.183 & 0.431 & 0.105 & 0.819 & 29 \\

1119: com & 0.150 & 0.183 & 0.447 & 0.107 & 0.810 & 26 \\

1120: rm & 0.132 & 0.129 & 0.494 & 0.094 & 0.803 & 16 \\

1121: \hline

1122: fmin & 0.142 & 0.146 & 0.503 & 0.086 & 0.845 & 38 \\

1123: fcom & 0.144 & 0.146 & 0.521 & 0.087 & 0.841 & 37 \\

1124: frm & 0.138 & 0.130 & 0.537 & 0.117 & 0.688 & 21 \\

1125: \hline

1126: plsr & 0.148 & 0.154 & 0.516 & 0.124 & 0.686 & 15 \\

1127: pcr & 0.143 & 0.132 & 0.563 & 0.109 & 0.732 & 23 \\

1128: ridge & 0.158 & 0.165 & 0.546 & 0.122 & 0.716 & 16 \\

1129: lasso & 0.151 & 0.150 & 0.550 & 0.054 & 0.941 & 69 \\

1130: lar & 0.151 & 0.151 & 0.545 & 0.053 & 0.944 & 71 \\

1131: step & 0.152 & 0.155 & 0.541 & 0.052 & 0.946 & 75 \\

1132: \hline

1133: ffp & 0.143 & 0.132 & 0.566 & 0.113 & 0.712 & 24 \\

1134: fplsr & 0.147 & 0.153 & 0.514 & 0.123 & 0.688 & 15 \\

1135: fpcr & 0.142 & 0.131 & 0.560 & 0.109 & 0.732 & 24 \\

1136: fridge & 0.158 & 0.163 & 0.554 & 0.119 & 0.726 & 19 \\

1137: flasso & 0.152 & 0.148 & 0.561 & 0.056 & 0.936 & 69 \\

1138: flar & 0.151 & 0.151 & 0.546 & 0.053 & 0.943 & 70 \\

1139: fstep & 0.154 & 0.153 & 0.558 & 0.055 & 0.939 & 73 \\

1140: \end{tabular}

1141: \end{center}

1142: \caption{Comparing statistics summarizing the returns of

1143:   yearly buy--and--hold portfolios generated over 50 repeated

1144:   random paths through the 26 years of monthly historical returns.

1145:   The first group of rows show the equal-- and value--weighted

1146:   portfolios; the second group of rows have complete data estimators

1147:   based on the preceding 12--months of returns, the maximal completely

1148:   observed historical returns, and the returns for the subset of

1149:   assets with 60 months of historical returns; the third group

1150:   uses the same returns as the second with a one--factor model;

1151:   the penultimate group uses {\tt monomvn}; the final group uses

1152:   {\tt monomvn} with the additional one--factor.  The statistics

1153:   across the columns are (annualized) mean return, standard

1154:   deviation, Sharpe ratio, tracking error, correlation to market

1155:   and average number of stocks with weights above 0.5\%.

1156: } \label{t:sharpe}

1157: \end{table}

1158:

1159: Table \ref{t:sharpe} summarizes the properties of those returns

1160: averaged over 50 repeated random paths through the 26 years in the

1161: study.  The table is broken into five sections, vertically, starting

1162: with the equal-- and value--weighted portfolios (for comparison),

1163: followed by global minimum variance portfolios based on estimated

1164: $\bm{\Sigma}$: complete data estimators, complete data estimators

1165: based on a one--factor model, {\tt monomvn} estimators, and {\tt

1166:   monomvn} estimators incorporating the one--factor.  Throughout, the

1167: ``f'' prefix indicates that the estimator uses the value--weighted

1168: factor in some way.  The ``min'' and ``fmin'' estimators use only the

1169: last 12--months of historical returns, whereas the ``com'' and

1170: ``fcom'' estimators use the maximal complete history available.  The

1171: ``rm'' and ``frm'' estimators focus only on those assets with

1172: completely observed returns for the last 60 months---where the weights

1173: for the other assets are set to zero (removing them from the

1174: portfolio). The annualized mean, standard deviation, and Sharpe ratio

1175: statistics for these six estimators lead one to conclude that the more

1176: historical returns (within the five--year window) that can be used to

1177: estimate $\bm{\Sigma}$ the better.  Tracking error is also improved,

1178: except in the case of ``frm''.  All in all, these results support

1179: those obtained in previous studies \citep[e.g.,][]{ckl:1999} showing

1180: that, in particular, factor models improve upon the na\"ive estimator

1181: in the complete data case.  Further inspection of this part of the

1182: table reveals that the improved Sharpe ratios for ``rm'' and ``frm''

1183: are due to the smaller standard deviation obtained under these

1184: estimators, but that this comes at the expense of a smaller mean

1185: return.  This may be due to more weight being placed on fewer assets

1186: (as indicated in the ``wmin'' column).  Both ``rm'' and ``frm'' also

1187: have the lowest correlation to the market in their cohort.

1188:

1189: The final two groups of rows tell a similar story.  The Sharpe ratios

1190: for the {\tt monomvn} estimators---with and without the

1191: value--weighted factor---show marked improvements over the complete

1192: data estimators.  As before, the inclusion of the value--weighted

1193: factor further adds to the improvement, e.g., yielding higher Sharpe

1194: ratios except in the case of PCR where they remain essentially

1195: unchanged.  The ``ffp'' estimator, i.e., the one--factor model

1196: applied via {\tt monomvn} using the ``factor--parsimony'' regression

1197: method, has the lowest standard deviation, and therefore a

1198: comparatively high Sharpe ratio despite a low mean return.  We can see

1199: that, as with ``rm'' and ``frm'', this low standard deviation is

1200: obtained by placing large weight on only a few assets.  PCR, PLSR, and

1201: ridge regression---both with and without factors---show similar

1202: properties.  In contrast, the LARS estimators (lasso, lar, and

1203: stepwise---both with and without the factor), obtained similar or

1204: better Sharpe ratios but with a large mean return, by assigning large

1205: weight to roughly three times more assets.  As a result, these LARS

1206: estimators obtain a much lower tracking error and higher correlation

1207: to the market.

1208:

1209:

1210: So when appropriate factors are available it makes sense to use them,

1211: and the best way to do so is via {\tt monomvn}.  It would seem that

1212: the one--factor LARS based {\tt monomvn} estimators give the best

1213: results in the study, overall, with lasso in the top spot.  It is

1214: reassuring to notice that, when an appropriate factor is {\em not}

1215: available, the LARS based {\tt monomvn} methods, and PCR, give largely

1216: similar results by incorporating all of the available returns in a

1217: parsimonious way.  This is not true in the case of the complete data

1218: estimators.

1219:

1220: \begin{figure}[ht!]

1221: \centering

1222: \includegraphics[trim=40 0 0 10,scale=0.75]{sharpe_boxplot}

1223: \includegraphics[trim=40 0 0 25,scale=0.75]{te_boxplot}

1224: \caption{Boxplots of Sharpe ratios {\em (top)} and the tracking error

1225:   {\em (bottom)} obtained over 50 random paths through the 26 years,

1226:   obtained by randomly sampling 250 qualifying assets in each year.

1227:   The averages of these numbers is what is reported in Table

1228:   \ref{t:sharpe}.  The horizontal bars correspond to the vertical ones

1229:   in that table.}

1230: \label{f:boot}

1231: \end{figure}

1232: Figure \ref{f:boot} compliments Table \ref{t:sharpe} by showing the

1233: distribution (via boxplots) of the Sharpe ratios and the tracking

1234: error obtained for each of the 50 random paths through the 26 years.

1235: Recall that these were obtained by randomly sampling 250 qualifying

1236: assets in each year.  The numbers in Table \ref{t:sharpe} are the

1237: means of data use to construct each boxplot, whereas the boxplots in

1238: the figure represent Monte Carlo approximations to the sampling

1239: distribution of portfolio characteristics under the various estimators

1240: of $\hat{\bm{\Sigma}}$.  In short, the figure reinforces the

1241: superiority of the LARS estimators which, in addition to having large

1242: Sharpe ratios and small tracking error, also exhibit small variability

1243: with respect to Monte Carlo resampling.  It is interesting to note

1244: that the LARS based estimators (without the factor) show the lowest

1245: variability in their Sharpe ratios amongst all {\tt monomvn}

1246: estimators.

1247:

1248: It may be tempting to conclude that these results contradict the

1249: results of the ELL--based comparison(s) on synthetic data in Section

1250: \ref{sec:synth}.  Indeed, in that section we saw that PCR seemed to be

1251: the best at recovering the (known) of the distribution which generated

1252: the training data.  However, means, variances, Sharpe ratios, tracking

1253: error, etc., are specific statistics, and moreover they are obtained

1254: after a (highly non--linear) transformation into portfolio weights via

1255: quadratic programming.  Therefore, we should expect to see different

1256: results, since these statistics represent utilities which are

1257: different from ELL.  That being said, notice that PCR is still the

1258: best in terms of average annualized standard deviation (and thus

1259: Sharpe ratio) [see Table \ref{t:sharpe}] when no appropriate factors

1260: are available---but with high variability [see Figure \ref{f:boot}].

1261: Importantly, both experiments (here and in Section \ref{sec:synth})

1262: show, resoundingly, that using all of the available data via {\tt

1263:   monomvn} is preferred over a complete data estimator.

1264:

1265: \subsection{Examining dependence relationships between assets}

1266: \label{sec:depend}

1267:

1268: For our final empirical analysis we shall demonstrate the descriptive

1269: power of {\tt monomvn}.  At the same time we shall take the

1270: opportunity to show how the method can be applied when there are

1271: thousands of assets.

1272:

1273: From Thomson Financial's Datastream ({\tt www.datastream.com}), we

1274: have downloaded, in dollar terms, the total returns data of each stock

1275: in the Russell 3000$^{\mbox{\tiny \textregistered}}$

1276: Index

1277: %\footnote{The Russel 3000$^{\mbox{\tiny \textregistered}}$ Index

1278: % represents

1279:   representing the broad United States equity universe encompassing

1280:   approximately 98\% of the market:

1281:   % .}

1282:   1792 weekly returns between 12/01/1973 and 11/05/2007 for 2894

1283:   assets. In order to obtain a set of clean and complete data, each

1284:   series is tested for illiquidity, completeness, and stationarity,

1285:   using the following methodology.  We removed assets which were

1286:   marked to market at a frequency other than weekly, to exclude

1287:   illiquid assets that may exhibit artificial serial correlation (this

1288:   essentially excludes any stock that has more than two weeks of

1289:   consecutive unchanging prices at any point in time).  Then, an

1290:   augmented Dickey Fuller test \citep{dickey:fuller:1979} is employed

1291:   to exclude any of the assets that exhibit non--stationarity (six

1292:   lags have been tested at the 99\% confidence level).  A total of

1293:   2461 stocks remained after applying these two filtering steps.

1294:   There are 558 assets with longest history of 1792 returns; the least

1295:   observed asset has only 76 returns (so the ``complete'' estimator(s)

1296:   can use only 3\% of the data); the overall proportion of missing

1297:   observations was 0.472.

1298:

1299: We consider applying the lasso version of the {\tt monomvn} algorithm

1300: to this data, with $p = 0$, i.e., always use the lasso (never use

1301: OLS).  As we have mentioned, the lasso (and other LARS methods) have

1302: descriptive (as well as predictive) power because they can provide

1303: $\hat{\bm{\beta}}$ with many coefficients set to zero.  In the context

1304: of the {\tt monomvn} algorithm this means that the MLE

1305: $\hat{\bm{\Sigma}}$ may have zero entries, indicating marginally

1306: uncorrelated assets, and moreover may have block--diagonal structure

1307: (or zeros in $\hat{\bm{\Sigma}}^{-1})$ indicating a pairwise

1308: conditional independence of assets.  Since ridge regression, PCR, and

1309: PLSR always yield $|\hat{\beta}_i| > 0$, they would never produce a

1310: zero in $\hat{\bm{\Sigma}}$ or $\hat{\bm{\Sigma}}^{-1}$, and so would

1311: be less useful for creating such qualitative summaries of the

1312: relationships between asset returns.  It may be tempting to interject

1313: zeros where there are small values in $\hat{\bm{\Sigma}}$ or

1314: $\hat{\bm{\Sigma}}^{-1}$, but like the ``complete'' and ``observed''

1315: estimators, the resulting matrix would not usually be positive

1316: definite.  Moreover, classical pairwise tests for independence, say

1317: via the Pearson product--moment correlation coefficient, would give

1318: unrealistic results.  With return histories as short as $\sim80$ weeks

1319: and estimated correlation less than about 0.2, a simple calculation

1320: shows that there would not be enough evidence to reject the

1321: hypothesis that the correlation is zero.

1322:

1323: The estimator obtained using the lasso on this data yields a

1324: $\hat{\bm{\Sigma}}$ with 36\% of its entries set to zero.  Moreover,

1325: 50 of its 2641 columns (or 2\%) are everywhere zero except in the

1326: diagonal position.  This means that 36\% of asset pairings are

1327: marginally uncorrelated.  Investigating pairwise correlation between

1328: assets, conditional on all of the others, involves looking for zeros

1329: in $\hat{\bm{\Sigma}}^{-1}$, of which we find 140 (or 6\%).  This

1330: means that the rows/columns of $\hat{\bm{\Sigma}}$ can be reordered so

1331: that the matrix has block--diagonal structure, and that the returns of

1332: 6\% of the assets are conditionally independent.

1333: \begin{figure}

1334: \includegraphics[angle=-90,scale=0.8]{indep.ps}

1335: \vspace{-0.1cm}

1336: \caption{Histograms of the number of zeros in each column of

1337:   $\hat{\bm{\Sigma}}$ {\em (left)} and $\hat{\bm{\Sigma}}^{-1}$ {\em

1338:     (right)}.}

1339: \label{f:indep}

1340: \end{figure}

1341: Figure \ref{f:indep} shows histograms summarizing the number of zeros

1342: in each column of $\hat{\bm{\Sigma}}$ and $\hat{\bm{\Sigma}}^{-1}$.

1343: Every column in both matrices had at least one zero entry.  The figure

1344: clearly illustrates that the resulting correlations can be used to

1345: cluster the assets, but this is beyond the scope of this paper.

1346:

1347: To wrap up the experiment we downloaded the market returns available

1348: from the Russel 3000 index

1349: for 1479 (of 1792) contiguous weeks ending 11/5/2007 and used them to

1350: create a residual return series

1351: for each of the 2461 assets in our

1352: study.  We then re-ran the lasso experiment, above, to discover that

1353: 58\% of the asset parings are marginally uncorrelated and 14\% are

1354: conditionally independent when the market is taken into account.  The

1355: histograms corresponding to this experiment are similar to those for

1356: the initial one, in Figure \ref{f:indep}, and so they are not

1357: reproduced here.  %{\em Profound concluding comment.}

1358:

1359: \section{Discussion}

1360: \label{sec:discuss}

1361:

1362: We have shown how the methods of \cite{stambaugh:1997} can be applied

1363: for large numbers of assets whose histories are (nearly) unconstrained

1364: in length.  The key insight is in replacing OLS regressions with more

1365: parsimonious ones that either use derived input directions or apply

1366: some sort of shrinkage.  Whereas Stambaugh demonstrated his

1367: methodology on 22 assets, we have shown how the {\tt monomvn}

1368: algorithm---essentially the same methodology with a different

1369: regression method---can handle thousands.  We argued that even when

1370: OLS regressions suffice, the more parsimonious ones can offer

1371: improvements in both accuracy and interpretation.  We also argued that

1372: it is advantageous to let a model selection method (e.g., parsimonious

1373: regression) decide which dependencies between factors and returns

1374: exist, as opposed to assuming a classical factor model structure.

1375:

1376: \cite{stambaugh:1997} showed that by applying the standard

1377: noninformative prior $\pi(\bm{\theta}) \propto

1378: |\bm{\Sigma}|^{\frac{p-1}{2}}$ \citep[e.g.][pp.~154]{schafer:1997} it

1379: is possible to turn the MLEs $\hat{\bm{\mu}}$ and $\hat{\bm{\Sigma}}$

1380: into moments $\tilde{\bm{\mu}}=\hat{\bm{\mu}}$ and

1381: $\tilde{\bm{\Sigma}}\ne\hat{\bm{\Sigma}}$ of a Bayesian posterior

1382: (predictive) distribution that, when used in the mean--variance

1383: framework, are said to take {\em estimation risk} into account.  We

1384: note that, due to the notation used in that paper, it is a common

1385: misconception that these posterior moments forecast the ML estimates

1386: into the future.  Since Stambaugh employs the i.i.d.  assumption in

1387: the same way that we do in Eq.(\ref{eq:iidlik}), these are only

1388: moments of the posterior for $\bm{\theta}$ conditioned on the

1389: available historical data.  Therefore, time is irrelevant, so the

1390: moments apply to the past as well without modification.  Finally, to

1391: label this approach as ``Bayesian'' is an overstatement.  While

1392: Stambaugh is correct to note that estimates of the mean vector and

1393: covariance matrix are all that are needed within the mean--variance

1394: framework, what results is a point--estimate (vector) of optimal

1395: portfolio weights, not (samples from) a Bayesian posterior

1396: distribution, as would be ideal.  The challenge is that while the

1397: moments of the posterior have a nice closed form, the distribution

1398: itself does not.  Further challenges limit the application of this

1399: approach in the ``big $p$ small $n$ setting''.  In this situation the

1400: standard noninformative prior leads to an improper posterior.  This

1401: can be most easily seen in the calculation of Stambaugh's $\tilde{V}

1402: \equiv \tilde{\bm{\Sigma}}$ (in our notation) in Eq.~(69--71),

1403: pp.~302, where the resulting diagonal would be negative.

1404:

1405: Stambaugh's Bayesian approach is not the only way forward.  It is

1406: possible to obtain the sampling covariance matrix of $\hat{\bm{\mu}}$

1407: analytically.  However, an analytic form for the sampling variability

1408: of $\hat{\bm{\Sigma}}$ is not known.  The bootstrap

1409: \citep[e.g.][Sections 7.11 \& 8.2]{hastie:tibsh:fried:2001} offers a

1410: Monte Carlo method for quantifying the {\em stability} of

1411: $\hat{\bm{\Sigma}}$ via its component-wise confidence intervals.  We

1412: took a related approach at the end of Section \ref{sec:portfolio} to

1413: examine how variability in $\hat{\bm{\Sigma}}$, arising from random

1414: subsamples of 250 assets, filters through to the properties of the

1415: balanced portfolios.  However, \citet[][Section

1416: 7.4.4]{little:rubin:2002} make a strong argument in preference for a

1417: fully Bayesian approach instead.  Facilitating tractable Bayesian

1418: estimation for parsimonious regression algorithms, as would be

1419: required by {\tt monomvn}, presents a serious challenge.  The Bayesian

1420: lasso \citep{park:casella:2008} and so--called Bayesian latent factor

1421: models \citep{west:2003}, which can be seen as a Bayesian extension of

1422: principal components and partial least squares regressions, have

1423: received much attention in the recent literature.  Exploring the

1424: extent to which these can be applied within the {\tt monomvn}

1425: algorithm to get samples from the posterior distribution of $\bm{\mu}$

1426: and $\bm{\Sigma}$ is part of our ongoing work.  These samples can

1427: accurately reflect the estimation risk in mean--variance portfolio

1428: allocation by filtering the uncertainty though the optimization to get

1429: a distribution on the simplex of portfolio weights.

1430:

1431: Another interesting extension would involve relaxing the assumption of

1432: (multivariate) normality, i.e., to decouple the dependence

1433: distribution, or {\em copula} \citep{sklar:1957}, from the marginals.

1434: In this regard, \cite{patton:2006} has made promising inroads into

1435: applying copulas to a pair of return series under a monotone

1436: missingness pattern.  Although the theory for copulas

1437: \citep{nelsen:1999} naturally extends beyond two dimensions, the

1438: application of the methodology quickly becomes intractable without

1439: enforcing severely restrictive assumptions.  Our ongoing work includes

1440: identifying ways in which the {\tt monomvn} algorithm for

1441: high--dimensional estimation under monotone missingness may be

1442: extended to support marginal Student--$t$ distributions and GARCH

1443: models with various parametric forms of the copula.  While there is

1444: plenty of evidence in the literature against the assumption of

1445: normality for asset returns \citep[e.g.][]{mills:1927}, we argued that

1446: the most important thing is to be able to make use of all of the

1447: available data with an algorithm that is computationally tractable.

1448:

1449: % smaller spacing for references

1450: %\renewcommand{\baselinestretch}{1.5}\small\normalsize

1451:

1452: \bibliography{corr}

1453: \bibliographystyle{jasa}

1454:

1455: \end{document}

1456: