1: \documentclass[12pt]{article}
2: %\documentclass[a4paper,12pt]{article}
3: \usepackage{amsmath}
4: \usepackage{amsthm}
5: \usepackage{amscd}
6: \usepackage{epsfig}
7: \usepackage{textcomp}
8: \usepackage{fullpage}
9: \usepackage{natbib}
10: \usepackage{setspace}
11: \usepackage{amsfonts}
12: \usepackage{color}
13:
14: \newcommand{\bm}[1]{\mbox{\boldmath $#1$}}
15: \newcommand{\mb}[1]{\mathbf{#1}}
16: \renewcommand{\Re}[0]{\mathbb{R}}
17: \newcommand{\mT}[0]{\mathcal{T}}
18: \newcommand{\Var}[0]{\mbox{Var}}
19: \newcommand{\NA}[0]{\mbox{\tt NA}}
20: \newcommand{\ith}[1]{$#1^{\mbox{\tiny th}}$}
21: \DeclareMathOperator*{\argmin}{argmin}
22:
23: \begin{document}
24:
25: \title{
26: On estimating covariances between many assets
27: with histories of highly variable length}
28: \author{
29: Robert B. Gramacy\\
30: Statistical Laboratory\\
31: University of Cambridge\\
32: bobby@statslab.cam.ac.uk \and
33: Joo Hee Lee \\
34: Fidelity Investments \\
35: London\\
36: joohee.lee@uk.fid-intl.com \and
37: Ricardo Silva\\
38: Department of Statistical Science\\
39: University College London\\
40: ricardo@stats.ucl.ac.uk
41: }
42:
43: \maketitle
44:
45: \doublespacing
46:
47: \begin{abstract}
48: Quantitative portfolio allocation requires the accurate and
49: tractable estimation of covariances between a large number of
50: assets, whose histories can greatly vary in length. Such data are
51: said to follow a monotone missingness pattern, under which the
52: likelihood has a convenient factorization. Upon further assuming
53: that asset returns are multivariate normally distributed, with
54: histories at least as long as the total asset count, maximum
55: likelihood (ML) estimates are easily obtained by performing repeated
56: ordinary least squares (OLS) regressions, one for each asset. Things
57: get more interesting when there are more assets than historical
58: returns. OLS becomes unstable due to rank--deficient design
59: matrices, which is called a ``big $p$ small $n$'' problem. We
60: explore remedies that involve making a change of basis, as in
61: principal components or partial least squares regression, or by
62: applying shrinkage methods like ridge regression or the lasso. This
63: enables the estimation of covariances between large sets of assets
64: with histories of essentially arbitrary length, and offers
65: improvements in accuracy and interpretation. We further extend the
66: method by showing how external factors can be incorporated. This
67: allows for the adaptive use of factors without the restrictive
68: assumptions common in factor models. Our methods are demonstrated
69: on randomly generated data, and then benchmarked by the performance
70: of balanced portfolios using real historical financial returns. An
71: accompanying {\sf R} package called {\tt monomvn}, containing code
72: implementing the estimators described herein, has been made freely
73: available on CRAN.
74:
75: \bigskip
76: \noindent {\bf Key words:} financial time series, monotone missing
77: data, maximum likelihood, ridge regression, principal component
78: regression, partial least squares, lasso, factor models
79: \end{abstract}
80:
81: \section{Introduction}
82: \label{sec:intro}
83:
84: Missingness in data, and hence the quest if one should eliminate a
85: part of the data or try and estimate characteristics of it, is common
86: in statistical analysis. The missing observation problem varies in
87: style, depending on the type of data. One example is random
88: missingness, which may stem from erroneous data
89: \citep{dempster:laird:rubin:1977}. In financial returns data
90: analysis, however, one problem stands out, which we will refer to as
91: monotone missingness. This happens when the assets of interest have
92: different lengths of historical financial data, e.g., stock prices and
93: returns. There are several possible ways of dealing with this type of
94: incomplete dataset. One way is by utilizing the portion of data
95: available across all of the assets. Another approach involves
96: estimating the missing portion, called {\em imputation}
97: \citep[e.g.,][]{little:rubin:2002}. A third approach is the focus of
98: this paper.
99:
100: Aside from some glitches in data, which will typically give rise to
101: unrealistic spikes or random missingness in data, the monotone style
102: of missingness that permeates financial historical returns data can be
103: grouped into two patterns. The first is where the histories of assets
104: differ due to the fact that they have started being publicly traded at
105: different times. The second is where assets close for various reasons,
106: including corporate actions such as M\&A (Merger and Acquisition)
107: activities, or liquidation due to bankruptcy. Both are critical
108: problems to address when conducting a multivariate analysis. In this
109: paper, we shall focus mainly on the former. This is sensible for the
110: application to portfolio balancing that we have in mind, since one is
111: naturally restricted to purchasing shares of companies which have
112: survived up to current point in time. The latter type of missingness,
113: in absence of the former, can be handled similarly, but it is not
114: immediately clear how this would be useful for portfolio balancing.
115: Handling both types of monotone missingness jointly, and other types
116: of approximately monotone missingness, requires the method of data
117: augmentation \citep{schafer:1997,little:rubin:2002}. This could
118: potentially be useful for a descriptive analysis, but is beyond the
119: scope of this paper.
120:
121: Data with arbitrary missingness patterns typically require specialized
122: iterative (even stochastic) estimation algorithms that can be slow and
123: cumbersome to implement. However, data which follow a monotone
124: missingness pattern lead to a likelihood which has a convenient
125: factorization. If we further assume that asset returns are
126: multivariate normally distributed (MVN), with histories at least as
127: long as the total asset count, then maximum likelihood (ML) estimators
128: are easily obtained by performing repeated ordinary least squares
129: (OLS) regressions, one for each asset. In the finance literature,
130: this approach is usually attributed to \cite{stambaugh:1997}, but it
131: was first described by \cite{andersen:1957} and has since been
132: discussed in many texts (see Section \ref{sec:monotone}). The method
133: fails when there are more assets than historical returns. In this
134: case the OLS regressions become unstable due to rank--deficient design
135: matrices. This is sometimes called the ``big $p$ small $n$'' problem.
136: It has recently received much attention in the statistics community,
137: with ready applications in bioinformatics and genomics, for example.
138: In the context of estimation for data with a monotone missingness
139: pattern, it can severely limit applicability to cases with a small to
140: modest level of missingness.
141:
142: In financial applications, where there may be more assets than there
143: are historical price observations for (some of) the assets, this
144: essentially means that the method cannot be applied on the full set of
145: assets of interest. This paper explores remedies to this problem. We
146: aim to develop a method that can be applied in settings where some
147: assets have histories which are shorter than the total number of
148: assets, and even when there are more assets than observations. In
149: short, our solution involves replacing OLS with ``parsimonious
150: regressions'' that either make a change of basis, as in principal
151: components or partial least squares regression, or apply shrinkage,
152: like ridge regression or the lasso. This enables the estimation of
153: covariances between large sets of assets with histories of essentially
154: arbitrary (and uneven) length. Even in situations where OLS would
155: have been sufficient, we find that the more parsimonious approach can
156: offer improvements in accuracy and interpretation.
157:
158: The parsimonious approach also motivates novel ways of exploiting {\it
159: factor} information, e.g., the value--weighted market index, size,
160: and book--to--market factors \citep{famafrench:1993}. Traditionally,
161: factor models require the restrictive assumption that assets are
162: independent given the factors. This underlying assumption can be
163: thought of as a specific type of parsimony. We show how one can use
164: the data to decide which independence constraints are reasonable, by
165: incorporating the factors into our proposed framework, and furthermore
166: how this may be accomplished even under condition of monotone
167: missingness in the historical returns {\em and} factors.
168:
169: The remainder of the paper is organized as follows. Section
170: \ref{sec:monotone} defines the monotone pattern for missing data,
171: derives the corresponding factorized likelihood, and gives an
172: algorithm of repeated regressions to analytically find a ML estimator
173: for the case where the sampling distribution is assumed to be MVN.
174: Section \ref{sec:bpsn} outlines methods for dealing with the ``big $p$
175: small $n$'' problem in the context of regression with transformed
176: inputs and shrinkage estimators. We highlight the benefits of
177: increased applicability, accuracy, and interpretability obtained with
178: these methods. Section \ref{sec:monomvn} gives the details of an
179: algorithm---for MVN data under a monotone missingness pattern---that
180: combines the method in Section \ref{sec:monotone} with the
181: parsimonious regressions in Section \ref{sec:bpsn}. We explain how the
182: method can easily integrate factor information, generating a model
183: that essentially mixes factor models with estimators that account for
184: the direct dependency between returns. We then briefly describe an
185: implementation which has been made freely available as an {\sf R}
186: package called {\tt monomvn}. Section \ref{sec:results} shows the
187: method in action on synthetic data and real financial data with large
188: numbers of assets having histories of highly varying length. Our
189: results are benchmarked against several standard comparators in the
190: context of covariance estimation and portfolio balancing, and are
191: accompanied by comments on interpretation, efficiency, and on the
192: (benign) consequences of using a method that leverages an MVN
193: assumption when that assumption not believed to hold.
194: Finally, we conclude with a discussion in Section \ref{sec:discuss}
195: that focuses on some of the limitations inherent in taking a maximum
196: likelihood approach.
197:
198:
199: \section{Multivariate normal monotone missing data}
200: \label{sec:monotone}
201:
202: Let $\mb{Y}$ be a $n \times m$ matrix of random observations $Y_{i,j}$
203: which may not be completely observed. Denote $y_{i,j} = \NA$ if the
204: \ith{i} sample of the \ith{j} covariate is missing. In other words,
205: if the columns of a sampled $\mb{Y}$: $y_{:,1},\dots, y_{:,m}$,
206: represent a historical return series of assets indexed by $j$ and a
207: return for asset $j$ is not available at time $i$, then $y_{i,j} =
208: \NA$. Observed $\mb{Y}$ are said to follow a {\em monotone
209: missingness pattern} [e.g., \citep[][Section 6.5.1]{schafer:1997} or
210: \citep[][Section 7.4]{little:rubin:2002}] if the columns can be
211: arranged so that $y_{i,j} \ne \NA$ whenever $y_{i,j+1} \ne \NA$.
212: \begin{figure}[ht!]
213: \centering
214: \input{mono.pstex_t}
215: \caption{Diagram of a monotone missingness pattern with $m=6$
216: covariates, with a maximum of $n$ completely observed samples in
217: $\mb{y}_1=y_{:,1}$.}
218: \label{f:mono}
219: \end{figure}
220: Figure \ref{f:mono} illustrates this property diagrammatically. The
221: row dimension $n$, of $\mb{Y}$, is equal to the number of completely
222: observed samples $n_1$ of $\mb{y}_1 \equiv y_{:,1}$, the maximally
223: observed column. Similarly, let $\mb{y}_j \equiv y_{1:n_j,j}$ collect
224: the complete data in the \ith{j} column of $\mb{Y}$, so that $n_j \geq
225: n_{j+1}$.
226:
227: The monotone missingness patterns considered in this paper are assumed
228: to be {\em missing completely at random} (MCAR) in that the pattern of
229: missingness neither depends on the observed nor unobserved responses.
230: Note that there may be columns with identical missingness patterns.
231: In the case of asset return series with observed histories going back
232: different amounts of time, the MCAR assumption may be tenuous, but it
233: is commonly asserted anyway \citep[e.g.,][]{stambaugh:1997}. In our
234: notation, the time index ($t$) for an asset's return history would run
235: counter to $i$, the index of the rows of $\mb{Y}$; i.e, $t=n-i+1$, as
236: also illustrated in Figure \ref{f:mono}.
237:
238: %For parameters $\bm{\theta}=(\bm{\theta}_1,\dots,\bm{\theta}_m)$,
239: When the missing data pattern is monotone, the likelihood $f(\mb{Y}|
240: \bm{\theta})$ can generally be factorized by exploiting an auxiliary
241: parameterization $\bm{\phi}=(\bm{\phi}_1, \dots, \bm{\phi}_m)$:
242: \[
243: f(\mb{Y}|\bm{\theta}) = f(\mb{y}_1|\bm{\phi}_1)
244: f(\mb{y}_2|\mb{y}_1,\bm{\phi}_2)
245: f(\mb{y}_3|\mb{y}_1,\mb{y}_2,\bm{\phi}_2) \cdots f(\mb{y}_m |
246: \mb{y}_1,\dots,\mb{y}_{m-1},\bm{\phi}_m).
247: \]
248: together with a mapping $\bm{\phi} = \Phi(\bm{\theta})$.
249: With the appropriate conditioning, the $y_{i,j}$ are assumed to be
250: independent and identically distributed (i.i.d.), so that
251: \begin{equation}
252: f(\mb{y}_j | \mb{y}_1,\dots \mb{y}_{j-1}, \bm{\phi}_j) = \prod_{i=1}^{n_j}
253: f(y_{i,j}|y_{i,1}\dots, y_{i,j-1}, \bm{\phi}_j). \label{eq:iidlik}
254: \end{equation}
255: We are concerned with the case where the $(y_{i,1},\dots y_{i,m})$
256: follow a multivariate normal distribution (MVN) so that the likelihood
257: in (\ref{eq:iidlik}) also follows a MVN with constant variance and a
258: mean linear in $y_{i,1},\dots, y_{i,j-1}$. The i.i.d.~and MVN
259: assumptions may be less than ideal for financial returns data
260: \citep[e.g.,][]{mills:1927}, but we note that these are common
261: simplifying assumptions \citep{stambaugh:1997,ckl:1999,jagma:2003}
262: because they lead to tractable inference and compare favorably (see
263: Section \ref{sec:results} for results and further discussion).
264: Maximum likelihood estimators (MLEs) of $\bm{\theta}_j = (\mu_j,
265: \bm{\Sigma}_{1:j,j})$, $j=2,\dots,m$, can then be obtained by
266: regression on the complete data:
267: \begin{align}
268: \mb{y}_j &= \mb{Y}_j \bm{\beta}_j + \bm{\epsilon}_j, &
269: \{\epsilon_{i,j}\}_{i=1}^{n_j} &\stackrel{\mbox{\tiny i.i.d.}}{\sim}
270: N(0,\sigma_j^2) \label{eq:monoreg}
271: \end{align}
272: where $\bm{\beta}_j^\top = (\beta_{0,j}, \beta_{1,j}, \dots,
273: \beta_{(j-1),j})$ and $\mb{Y}_j \equiv \mb{Y}_{0:(j-1)}^{(n_j)}$ is
274: the $n_j \times j$ design matrix
275: \[
276: \mb{Y}_j \equiv \mb{Y}_{0:(j-1)}^{(n_j)} = \begin{pmatrix}
277: 1 & y_{1,1} & \cdots & y_{1,(j-1)} \\
278: 1 & y_{2,1} & \cdots & y_{2,(j-1)} \\
279: \vdots & \vdots & \ddots & \vdots \\
280: 1 & y_{n_j,1} & \cdots & y_{n_j, (j-1)}
281: \end{pmatrix}
282: \]
283: containing an intercept column, and the first $n_j$ observations of
284: the first $j-1$ columns of $\mb{Y}$. So the auxiliary parameters
285: used in (\ref{eq:monoreg}) are $\bm{\phi}_j = (\bm{\beta}_j,
286: \sigma_j^2)$.
287: \begin{figure}[ht!]
288: \centering
289: \input{mono_regress.pstex_t}
290: \caption{Diagram of the design matrix $\mb{Y}_5$ (without an intercept
291: term) and the response vector $\mb{y}_5$ for the fifth regression
292: involved in maximizing the likelihood of MVN data under a monotone
293: missingness pattern with $m=6$ covariates.}
294: \label{f:monoreg}
295: \end{figure}
296: Figure \ref{f:monoreg} diagrams the design matrix (without the
297: intercept term) and response vector involved in one such regression.
298: When $\mathrm{rank}(\mb{Y}_j) = j$, and particularly when $n_j > j$,
299: MLEs $\hat{\bm{\phi}}_j$ are obtainable via the straightforward
300: calculation:
301: \begin{align}
302: \hat{\bm{\beta}}_j &= (\mb{Y}_j^\top \mb{Y}_j)^{-1} \mb{Y}_j^\top \mb{y}_j &
303: \mbox{and} &&
304: \hat{\sigma}^2_j &= \frac{1}{n_j} ||\mb{y}_j - \mb{Y}_j \hat{\bm{\beta}}_j||^2
305: = \frac{1}{n_j} \sum_{i=1}^{n_j} (y_{i,j}
306: - (\mb{y}_i^\top)_{1:n_j}\, \hat{\bm{\beta}}_j)^2.
307: \label{eq:regress}
308: \end{align}
309: Then,
310: starting with $\hat{\bm{\theta}}_1$ comprising of $\hat{\mu}_1 =
311: \sum_{i=1}^{n_1} y_{i,1}/{n_1}$, and $\hat{\Sigma}_{1,1} =
312: \sum_{i=1}^{n_1} (y_{i,1} - \hat{\mu}_1)^2/{n_1}$, each
313: $\hat{\bm{\theta}}_j$ can be estimated conditional on
314: $\hat{\bm{\theta}}_{1:(j-1)} = (\hat{\bm{\mu}}_{1:(j-1)}^\top,
315: \hat{\bm{\Sigma}}_{1:(j-1),1:(j-1)})$ and estimates of
316: $\hat{\bm{\beta}}_j$ and $\hat{\sigma}^2_j$ as \citep{stambaugh:1997}:
317: \begin{align}
318: \hat{\mu}_j &= \hat{\beta}_{0,j} + \hat{\bm{\beta}}_{1:(j-1),j}^\top
319: \hat{\bm{\mu}}_{1:(j-1)}
320: &\hspace{-0.075cm} \mbox{and}&&
321: \hat{\bm{\Sigma}}_{1:j,j}
322: &= \begin{pmatrix}
323: \hat{\bm{\beta}}_{1:(j-1),j}^\top \hat{\bm{\Sigma}}_{1:(j-1),1:(j-1)} \\
324: \hat{\sigma}^2_j + \hat{\bm{\beta}}_{1:(j-1),j}^\top
325: \hat{\bm{\Sigma}}_{1:(j-1),1:(j-1)} \hat{\bm{\beta}}_{1:(j-1),j},
326: \label{eq:addy}
327: \end{pmatrix}
328: \end{align}
329: thus implicitly describing the mapping $\Phi^{-1}$ back to
330: $\bm{\theta}_j$--space. Observe that we do not use a bias--corrected
331: estimator for $\sigma_j^2$ in (\ref{eq:regress}), i.e., with $n_j-j$
332: instead of $n_j$ in the denominator, to ensure that ML estimates
333: $\hat{\bm{\theta}}$ are obtained \citep[][pp.~224]{schafer:1997}.
334: However, we have found it to be beneficial in practice to use $n_j-1$
335: in the denominator as is typical in obtaining unbiased estimates of
336: covariance matrices in the complete data case.
337:
338: When several columns $\mb{y}_\ell$, say $\ell=j_1,\dots,j_2$, have
339: equal lengths of observed histories $n_\ell$, it is typical to use a
340: multivariate regression $(\mb{y}_{j_1} \; \cdots \; \mb{y}_{j_2}) =
341: \mb{Y}_{j_1} \bm{\beta}_{j_1:j_2} + \bm{\epsilon}_{j_1:j_2}$ to find
342: $\hat{\bm{\beta}}_{j_1:j_2}$ and the empirical variance--covariance
343: matrix $\hat{\mb{V}}_{j_1:j_2,j_1:j_2}$. Then, several
344: $\hat{\bm{\theta}}_{j_1:j_2}$ can be found at once by replacing
345: $\hat{\bm{\beta}}_j$ with $\hat{\bm{\beta}}_{j_1:j_2}$ and
346: $\hat{\sigma}_j^2$ with $\hat{\mb{V}}_{j_1:j_2,j_1:j_2}$ in
347: (\ref{eq:addy}). Importantly, if
348: $\hat{\bm{\Sigma}}_{1:(j_1-1),1:(j_1-1)}$ and
349: $\hat{\mb{V}}_{j_1:j_2,j_1:j_2}$ are positive definite, then
350: $\hat{\bm{\Sigma}}_{1:j_2,1:j_2}$ will be positive definite as well
351: \citep{stambaugh:1997}.
352:
353: Calculating such MLEs requires having $n_j > j$ for all $j=1,\dots,m$.
354: That is, there cannot be an asset whose history is shorter than the
355: number of assets whose histories have greater length. If such were
356: the case, then $\mb{Y}_j$ would not be of full rank, and $\mb{Y}_j^\top
357: \mb{Y}_j$ could not be inverted in Eq.~(\ref{eq:regress}). This is
358: sometimes referred to in the literature as the problem of regression
359: with ``big $p$ [number of parameters] small $n$ [number of
360: observations]''. Numerical singularities may arise whenever $n_j$ is
361: less than, but nearly equal to, $j$---especially when $n$ and $m$ are
362: large. In the following section we illustrate how these difficulties
363: may be overcome by methods of subset selection, coefficient shrinkage,
364: or the use of principal components.
365:
366: \section{Parsimonious regression}
367: \label{sec:bpsn}
368:
369: In this section, we extract and focus on the subproblem of the linear
370: regression in (\ref{eq:monoreg}), in terms of a design matrix of $p$
371: predictor variables with an intercept term ($\mb{X} \equiv \mb{Y}_j$)
372: observed for $n$ cases, with corresponding responses ($\mb{y} \equiv
373: \mb{y}_j$, where $n \equiv n_j$):
374: \begin{align}
375: \mb{y} &= \mb{X} \bm{\beta} + \bm{\epsilon}, &
376: \{\epsilon_{i}\}_{i=1}^{n} &\stackrel{\mbox{\tiny i.i.d.}}{\sim}
377: N(0,\sigma^2).
378: \end{align}
379: Ordinary least squares (OLS) gives a MLE of $ \hat{\bm{\beta}} =
380: (\mb{X}^\top \mb{X})^{-1} \mb{X}^\top \mb{y}$. Classically, there are
381: two main reasons why one may desire a more parsimonious approach to
382: regression than that provided by OLS. The first is that OLS tends to
383: lead to high variance estimators. The second is a desire for model
384: fits that have high qualitative interpretability, i.e., that describe
385: the data adequately but assume no more causes than will account for
386: the effect. Our reasons for seeking an alternative are related to the
387: former more so than the latter. But, most importantly, we aim to
388: circumvent the problem of having linear dependence in the columns of
389: $\mb{Y}_j$ when $n_j \leq j$. In this case, we are faced with an
390: $n\times p$ design matrix $\mb{X}$ with number of columns $p$ greater
391: than the number of observations $n$, yielding an $\mb{X}^\top \mb{X}$
392: matrix that is singular and cannot be inverted---a so--called ``big
393: $p$ small $n$'' ($p > n$) problem. We may even have that $p \gg n$,
394: say, when the total number of assets $m$ is far greater than the
395: number of returns recorded for the asset with the shortest history.
396:
397: Popular solutions to this problem involve methods of variable
398: selection and coefficient shrinkage. Probably the most
399: straightforward method is {\em subset selection} \citep[][Section
400: 3.4.1]{hastie:tibsh:fried:2001} which aims to find the model with the
401: ``best'' size $k$ (i.e., with $k\in \{1,\dots,\min(p,n-1)\}$
402: covariates). ``Best'' can be defined in a number of ways, but
403: typically involves $t-$tests, or minimizing an estimate of expected
404: prediction error. Searching through all possible subsets quickly
405: becomes infeasible for $p>40$. Larger $p$ can be handled by greedy
406: methods, but these offer fewer guarantees. Such methods include {\em
407: forward stepwise selection} which starts in the null (intercept
408: only) model and sequentially adds predictors, and {\em backward
409: stepwise selection} which starts at the saturated model (only
410: applicable when $p<n$) and deletes predictors. Hybridizations also
411: exist.
412:
413: By discarding some predictors, subset selection methods can yield a
414: model which is more interpretable, and may have lower prediction
415: error. But this ``discrete'' process can produce estimators with high
416: variance. Shrinkage methods are a popular alternative. They are
417: hailed for being more ``continuous'', and in some special cases they
418: can have implicit behavior similar to methods like forward selection.
419: The following subsection considers the shrinkage methods of ridge
420: regression, and those related to the lasso. In Section \ref{sec:pc}
421: we consider another family of methods which are based on derived input
422: directions: principal components regression, which has connections to
423: ridge regression, and partial least squares regression. These are
424: handy when the predictors are highly correlated.
425:
426: The parsimonious regression methods outlined in this section have been
427: chosen for familiarity, computational tractability, and
428: implementation. In each case {\sf R} packages are
429: available on the Comprehensive {\sf R} Archive Network (CRAN),
430: \begin{center}
431: \verb!http://cran.R-project.org! \hspace{1cm} \citep{rproject},
432: \end{center}
433: \noindent which provide off--the--shelf implementations that will make
434: for nice subroutines within the framework of constructing estimators
435: for MVN data under monotone missingness. It is typical to first
436: standardize the inputs ($\mb{X}$ and $\mb{y}$) as the methods outlined
437: below are not equivariant under re-scaling.
438:
439: \subsection{Shrinkage methods: ridge regression, and the lasso}
440: \label{sec:ridge}
441:
442: {\em Ridge regression} and the {\em lasso} shrink the coefficients of
443: an OLS regression by imposing a penalty on their size:
444: \begin{equation}
445: \hat{\bm{\beta}}^{(q)} = \argmin_{\bm{\beta}}
446: \left\{\sum_{i=1}^n \left(y_i - \beta_0 -
447: \sum_{j=1}^p x_{ij} \beta_j\right)^2 +
448: \lambda \sum_{j=1}^p |\beta_j|^q\right\}
449: \label{eq:ridge:lasso}
450: \end{equation}
451: with $q=2$ for ridge regression, and $q=1$ for the lasso. The tuning
452: parameter $\lambda$ controls the amount of shrinkage. Notice that the
453: intercept ($\beta_0$) is left out of the penalty term. Solutions to
454: (\ref{eq:ridge:lasso}) can be obtained analytically in the case of
455: ridge regression with $\hat{\bm{\beta}}^{(2)} = (\mb{X}^\top \mb{X} +
456: \lambda \mb{I})^{-1} \mb{X}^\top \mb{y}$. Quadratic programming is
457: required for the lasso. Both methods have interpretations as Bayesian
458: {\em maximum a posteriori} (MAP) estimators after imposing particular
459: prior distributions. Other choices of $q>0$ are also possible,
460: however the constraint region for $0<q<1$ is non-convex, which makes
461: solving the optimization problem more difficult.
462:
463: For ridge regression, the penalty parameter ($\lambda$) is most
464: advantageously chosen by minimizing cross validation (CV) estimates of
465: predictive error. The commonly used HKB \citep{hkb:1975} and L--W
466: \citep{lw:1976} methods are computationally efficient, but require
467: that $p < n$ to fit an OLS. The implementation of ridge regression
468: used in this paper comes from the {\tt MASS} library \citep{mass:2002}
469: for {\sf R} in the form of a function called {\tt lm.ridge}.
470:
471: Though the form of ridge regression and the lasso are similar, there
472: are several important differences. A large $\lambda$ will cause the
473: ridge estimator $\hat{\bm{\beta}}^{(2)}$ to have many coefficients
474: shrunk towards zero. The lasso estimator $\hat{\bm{\beta}}^{(1)}$ has
475: as similar effect, but, importantly, may contain many coefficients
476: which are exactly zero---something which is only possible for $0 < q
477: \leq 1$. In the Bayesian interpretation, setting $q\leq 1$
478: corresponds to choosing a prior which concentrates more mass on small
479: $|\beta_j|$, with the most on $\beta_j = 0$. In this way, the lasso
480: implements a kind of continuous subset selection. As $\lambda$ is
481: increased, the $|\beta_j|$ decrease, eventually increasing the number
482: of them which are identically zero, though this relationship need not
483: be strictly monotonic.
484:
485: The implementation of lasso used in this paper is contained in the
486: {\tt lars} package for {\sf R} \citep{lars:2007}. \cite{efron:2004}
487: show how the lasso, and two methods called {\em stepwise} and {\em
488: forward stagewise}, are special cases of their method of {\em least
489: angle regression} (LAR). LARS can calculate all possible lasso
490: estimators with computational effort in the same order of magnitude as
491: OLS regression applied to the full set of covariates. CV can be used
492: to select the final model, e.g., using the ``one--standard--error''
493: rule \citep[][Section 7.10]{hastie:tibsh:fried:2001}, or a more
494: thrifty $C_p$ \citep{mallows:1973} method can be used, but only when
495: $p < n$. When applicable, the $C_p$ method performs nearly as well as
496: CV within the MVN setting with monotone missingness.
497: \cite{madigan:ridgeway:2004} come to similar conclusions on equally
498: tame benchmarks. However, $C_p$ has also been criticized for
499: preferring large models \citep{ishwaran:2004,stine:2004} and for being
500: slightly at odds with LARS \citep{loubes:massart:2004}. Since we are
501: mostly interested in applying LARS methods (i.e., lasso) when OLS is
502: not applicable, i.e., when $p \geq n$, we shall generally rely on CV
503: to select the final model.
504:
505: \subsection{Principal components and partial least squares regression}
506: \label{sec:pc}
507:
508: In situations where there are a large number of highly correlated
509: inputs, a decomposition by principal components (PCs) can be used to
510: select a small number of linear combinations of the original inputs to
511: be used in place of $\mb{X}$. The related methods of principal
512: component regression (PCR) and partial least squares regression (PLSR)
513: start by performing an orthogonal decomposition of $\mb{X}$, but
514: differ in how the linear combinations are constructed.
515:
516: In PCR, {\em singular value decomposition} (SVD) is performed on
517: $\mb{X}$, i.e., $\mb{X} = (\mb{U} \mb{D}) \mb{V}^\top =
518: \mb{T}\mb{P}^\top$, where $\mb{U}$ is an $n \times p$ matrix of left
519: singular vectors describing the ``output basis'', $\mb{D}$ is a
520: diagonal matrix containing the corresponding singular values (a
521: square--root of the eigenvalues) in non-decreasing order, $\mb{V}$ is
522: a $p \times p$ matrix of right singular vectors describing the ``input
523: basis'', and $\mb{T}$ and $\mb{P}$ are the so--called {\em scores} and
524: {\em loadings} defined by the decomposition. Next, $\mb{y}$ is
525: regressed on the first $k$ PCs, i.e., the scores $\mb{T}_{(k)}$, where
526: the $(k)$ subscript indicates the extraction of the first $k$ columns
527: of $\mb{T}$, i.e., the first $k$ columns of $\mb{U}$, $\mb{V}$, and
528: the first $k$ rows/cols of $\mb{D}$. Since the columns of $\mb{T}$
529: are orthogonal, the solution is just a sum of univariate regressions.
530: Importantly, the solution can then be written in terms of the
531: coefficients on the predictors in the columns of $\mb{X}$,
532: \begin{align}
533: \mbox{(arbitrary scores and loadings)} && \hat{\bm{\beta}}(k) &=
534: \label{eq:preg}
535: \mb{P}_{(k)} (\mb{T}_{(k)}^\top \mb{T}_{(k)})^{-1} \mb{T}_{(k)}^\top \mb{y} \\
536: \mbox{(from SVD on $\mb{X}$)} && \hat{\bm{\beta}}^{\mbox{\tiny pcr}}(k) &
537: =\mb{V}_{(k)} \mb{D}_{(k)}^{-1} \mb{U}_{(k)}^\top \mb{y}, \nonumber
538: \end{align}
539: a vector of length $p$. When $k=p < n$, the coefficients in
540: (\ref{eq:preg}) are identical to those obtained by OLS. There are many
541: ways of choosing how many components ($k$) to keep in the final model.
542: One way is to consider the relative sizes of the eigenvalues as a
543: proportion of the variation explained by each principal component, and
544: then choose $k$ so that 80--90\% of the variation is explained. A
545: less ad hoc and more reliable---but more computationally
546: intensive---method that can be applied even when $p \geq n$ involves
547: using CV to estimate predictive error in order to find $k \in
548: \{1,\dots,\min(p,n-1)\}$.
549:
550: PLSR, by contrast, aims to incorporate information about both $\mb{X}$
551: and $\mb{y}$ in the scores and loadings---which in this context are
552: often called {\em latent variables} (LVs)---by proceeding iteratively.
553: The method is initialized with the SVD of $\mb{X}^\top \mb{y}$,
554: thereby including information about the correlation between, and the
555: variance within, $\mb{X}$ and $\mb{y}$. The scores and loadings
556: obtained by PLSR optimally capture the covariance between $\mb{X}$ and
557: $\mb{y}$, whereas PCR concentrates only on the variance of $\mb{X}$
558: \citep{dejong:1993}. There are several algorithms for obtaining the
559: scores and loadings, but once obtained, the regression coefficients
560: $\hat{\bm{\beta}}^{\mbox{\tiny plsr}}(k)$ in $\mb{X}$-space are
561: recovered by following (\ref{eq:preg}), and CV can be similarly used
562: to pick $k$.
563:
564: In situations where a minor component of $\mb{X}$ is highly correlated
565: with $\mb{y}$, PLSR may have a significant advantage over PCR.
566: Otherwise, the methods have a more or less comparable performance
567: record despite a few operational differences---e.g., PLSR usually
568: needs fewer LVs, but can also yield higher variance estimators of the
569: regression coefficients. Both have behavior similar to other
570: shrinkage methods, particularly ridge regression. For example, it can
571: be shown \citep{frank:fried:1993} that ridge regression shrinks the
572: coefficients of principal components by a factor of
573: $d_j^2/(d_j^2+\lambda)$, where the $d_j$ are from the diagonal of
574: $\mb{D}$, whereas PCR truncates them at $k$.
575:
576: An {\sf R} package called {\tt pls} \citep{heige:2007} provides a
577: unified implementation of PCR and three algorithms for PLSR
578: \citep{dayal:macg:1997,dejong:1993,martens:naes:1989}, together with
579: built--in facilities for estimating $k$ via CV.
580:
581:
582: \section{The {\tt monomvn} algorithm}
583: \label{sec:monomvn}
584:
585: So long as $n_j > j$ for all $j=1\dots,m$, and $n_j \geq n_{j+1}$, an
586: algorithm for finding the parameters $\bm{\mu}$ and $\bm{\Sigma}$ that
587: maximize the MVN likelihood for monotone missing data proceeds as
588: outlined in Section \ref{sec:monotone}. Initialize $\mu_1$ and
589: $\Sigma_{11}$ to the sample mean and variance of the first column
590: $\mb{y}_1$ of $\mb{Y}$, then iterate through the following steps for
591: $j=2,\dots,m$:
592: \begin{enumerate}
593: \item Find the MLEs (\ref{eq:regress}) of $\bm{\beta}_j$ and
594: $\sigma_j^2$ in a regression (\ref{eq:monoreg}) of $\mb{y}_j$ onto the first
595: $j-1$ columns of $\mb{Y}$ (as predictors), using only the first
596: $n_j$ observations;
597: \item Obtain the MLEs of $\mu_j$ and $\mb{\Sigma}_{(1:j),j}$
598: from $\hat{\bm{\mu}}_{1:(j-1)}$, $\hat{\bm{\Sigma}}_{1:(j-1),1:(j-1)}$,
599: $\hat{\bm{\beta}}_j$ and $\hat{\sigma}^2_j$ as in (\ref{eq:addy}).
600: \end{enumerate}
601: If any $n_j \leq j$, then we have a ``big $p$ small $n$'' problem, and
602: the standard regression in step 1 above cannot be performed. In
603: practice, it may be that $n_j > j$ and still there are columns of the
604: design matrix which are not linearly independent, and so it is not of
605: full rank. The chances that this may happen become increasingly more
606: likely as $j$ approaches $n_j$ when finite (double--precision)
607: computer representations make it so that the design matrix is
608: numerically rank deficient. Both issues are addressed simultaneously
609: by instead performing one of the parsimonious regressions outlined in
610: Section \ref{sec:bpsn}. Then step 2 can proceed as usual. Observe
611: that this approach also enables estimation when there are more assets
612: than historical returns ($m > n$).
613:
614: \subsection{Choosing the parsimonious proportion}
615:
616: Even when parsimonious regression is not strictly necessary, it can
617: aid in interpretation, and possibly even yield more accurate and lower
618: variance estimators. The lasso and the other LARS methods can
619: choose to shrink $\bm{\beta}$ so that only the intercept term is
620: nonzero. This enables the detection of zeros in the MVN covariance
621: matrix $\bm{\Sigma}$. In other words, it can be used as a test, of
622: sorts, for independence between assets.
623:
624: Towards building a more efficient and interpretable estimator, one may
625: consider applying a parsimonious regression for every iteration of
626: step 1 above. This is explored further in Section \ref{sec:depend}.
627: Alternatively, one could determine a threshold, say $p$, representing
628: a proportion of rows to columns in the design matrix past which a
629: parsimonious regression is applied regardless. That is, when $n_j
630: \leq pj$, for $0\leq p\leq 1$. Then, the $p=0$ case corresponds to
631: always using a parsimonious method, and $p=1$ reverts to applying one
632: only when necessary. In Section \ref{sec:parsi} we show how easy it
633: is to establish reliable rules of thumb for choosing $p$.
634:
635: \subsection{Incorporating factors}
636: \label{sec:fact}
637:
638: A popular estimator for the covariance matrix of financial asset
639: returns involves using {\it factor models}. The essential idea behind
640: the factor model is to regress the observed returns $\mb{y}_j$ on
641: measured common market factors $\mb{F}$, and to derive a covariance
642: matrix of the returns as a function of the regression equations.
643:
644: For a factor space with $K$ factors, the model can be formalized as
645: follows. Each excess return $y_{i, j}$ is modeled by the regression
646: equation
647: \begin{equation}
648: \label{eq:factor-regression}
649: y_{i, j} = \lambda_{0, j} + \sum_{k = 1}^K \lambda_{k, j}f_{i, k} + \epsilon_{i, j}
650: \end{equation}
651: where each $\epsilon_{i, j}$ is a residual term independent of
652: $\mb{F}$. The residual terms for the $i^{\mbox{\tiny th}}$ instance
653: are assumed to follow a zero--mean MVN with diagonal covariance matrix
654: $\mb{D}$. For instance, a common one--factor model takes $f$ to be
655: value--weighted market index \citep[e.g.,][]{ckl:1999}. A common
656: three--factor model augments the value--weighted market index with
657: size and book--to--market factors \citep{famafrench:1993}.
658:
659: Factors are assumed, for now, to be i.i.d.~and to follow a MVN with
660: $K\times K$ covariance matrix $\bm \Omega$. Let $\bm{\Lambda}$ be the
661: $K\times m$ matrix defined by the entries $\bm{\Lambda}_{k, j} =
662: \lambda_{k, j}$, for $k=1,\dots,K$. It follows that the covariance
663: matrix of the returns, as parameterized by $\{\bm{\Omega},
664: \bm{\Lambda}, \mb{D}\}$, is given by
665: \begin{equation}
666: \bm{\Sigma}^{(f)} = \bm{\Lambda}^\top \bm{\Omega}\bm{\Lambda} + \mb{D}.
667: \end{equation}
668: An estimate $\hat{\bm{\Sigma}}^{(f)}$ can therefore be obtained by
669: estimating each column $\hat{\bm{\lambda}}_j = (\lambda_{1, j},
670: \dots,\lambda_{K, j})^\top$ of $\hat{\bm{\Lambda}}$ by regressing
671: $\bm{y}_j$ on $\mb{F}$ with an intercept. The mean sum of squares of
672: the residuals of each regression forms the diagonal of $\hat{\mb{D}}$,
673: and the off--diagonal entries are zero. The estimate $\hat{\bm
674: \Omega}$ is the empirical covariance of the factors. Note that each
675: regression equation requires only the data observed for the particular
676: return $\mb{y}_j$, together with the corresponding observations for
677: the factor(s). However in practice, the method is applied only to
678: completely observed $\mb{Y}$ and $\mb{F}$.
679:
680: The main underlying assumption is that returns are mutually
681: independent conditioned on the factors. If the number of factors is
682: considerably smaller than the number of returns, the model will be
683: parsimonious and the resulting $\hat{\bm{\Sigma}}^{(f)}$ will have
684: lower variance than the empirical covariance matrix. This assumption
685: allows for any missingness pattern, even the extreme one where no
686: joint observation of returns $\mb{y}_j$ and $\mb{y}_k$ exists. The
687: drawback is that the independence assumptions encoded in this model
688: might be unrealistic, and the resulting estimate will suffer from a
689: strong bias.
690:
691: Instead, we can use the data to find which independence assumptions
692: are adequate by integrating the factor model into the {\tt monomvn}
693: framework. Consider the {\it full} regression model, where we regress
694: $\mb{y}_j$ on $\mb{Y}_j$ and $\mb{F}_j \equiv\mb{F}_{1:(j-1)}^{(n_j)}$
695: simultaneously:
696: \begin{equation}
697: \label{eq:full-factor-regression}
698: \mb{y}_j = \mb{Y}_j \bm{\beta}_j + \mb{F}_j
699: \bm{\lambda}_j + \bm{\epsilon}_j,
700: \end{equation}
701: %where $\bm \beta_j^T = (\beta_{0, j}, \beta_{1, j}, \dots,
702: %\beta_{(j-1), j})$ as before, and $\bm{\lambda}_j = (\lambda_{1, j},
703: %\dots,\lambda_{K, j})$.
704: The $\lambda_{0, j}$ term does not appear because it is not
705: identifiable given the presence of $\beta_{0, j}$. Since this
706: formulation is in the same family of parameterizations of the original
707: models used in {\tt monomvn}, an analogous procedure applies with
708: minor pre- and post-processing. First shift the labels the returns for
709: each asset by $K$ so that $\mb{y}_j$ becomes $\mb{y}_{j + K}$ and the
710: corresponding $\bm{\beta}_j$ becomes $\bm{\beta}_{j+K}$. Then map
711: $\mb{F}_k$ to $\mb{Y}_k$ and $\bm{\lambda}_k$ to $\bm{\beta}_k$. If
712: the recursion in Eq.~(\ref{eq:addy}) is then applied as usual, giving
713: the estimates $\hat{\bm{\mu}}$ [an $(m+K)$ vector] and $\hat{\bm{\Sigma}}$
714: [an $(m+K)\times (m+K)$ matrix], an estimate of the covariance matrix
715: of the asset returns can then be extracted from the bottom--right $m
716: \times m$ block of $\hat{\bm{\Sigma}}$, i.e.,
717: $\hat{\bm{\Sigma}}^{(f+m)} = \hat{\bm{\Sigma}}_{(K + 1):(m + K), (K +
718: 1):(m + K)}$. The superscript $(f+m)$ is meant to indicate
719: dependence on both factors and assets. Importantly, no internal
720: changes to the workings of the {\tt monomvn} algorithm are necessary.
721:
722: Observe that if the (parsimonious) regression method applied within
723: {\tt monomvn} uses OLS whenever regressing onto the factors, and sets
724: the regression coefficients to zero otherwise, then we obtain
725: $\hat{\bm{\Sigma}}^{(f+m)} = \hat{\bm{\Sigma}}^{(f)}$. In the context
726: of {\tt monomvn} we call this the ``factor--parsimony'' regression,
727: filling a role similar to PCR, lasso, etc. If required, the
728: covariance matrix of the factors can also be recovered as
729: $\hat{\bm{\Omega}} = \hat{\bm{\Sigma}}_{1:K,1:K}$. Also observe that,
730: within the {\tt monomvn} framework, it is possible to handle factors
731: with historical missingness.
732:
733: If, instead of the factor--parsimony method, any of the other methods
734: (outlined in Section \ref{sec:bpsn}) are used, then shrinkage is
735: applied to both $\bm \beta_j$ and $\bm{\lambda}_j$ in
736: (\ref{eq:full-factor-regression}). In this case we obtain a
737: generalization of the independence structure assumed in the classical
738: factor model, allowing the data (factors and returns) to determine the
739: appropriate mix of influence on the resulting estimator for
740: $\bm{\Sigma}$. It is interesting to point out the link between this
741: generalized factor model (\ref{eq:full-factor-regression}) resulting
742: in $\hat{\bm{\Sigma}}^{(f+m)}$, and the optimal shrinkage estimator of
743: \citet{ledoit:2002}:
744: \begin{equation}
745: \hat{\bm{\Sigma}}^{(\ell)} = \alpha \hat{\bm{\Sigma}}^{(f)} +
746: (1 - \alpha)\hat{\bm{\Sigma}}^{(c)}, \;\;\;\;\; \mbox{for } \alpha \in [0, 1].
747: \label{eq:ledoit}
748: \end{equation}
749: Here, $\hat{\bm{\Sigma}^{(c)}}$ is the standard covariance estimate
750: obtained using only the portion of the data available across all
751: assets and $\alpha$ is an ``optimal'' mixing proportion chosen by CV.
752: (Note that Ledoit's factor--based estimator $\hat{\bm{\Sigma}}^{(f)}$
753: uses only completely observed joint returns.) The spirit of these two
754: approaches is similar, but they are quite distinct. The published
755: success of this type of shrinkage approach suggests that it is
756: important to combine a (complete data) factor--based estimate with a
757: traditional covariance estimate. Indeed, the estimator
758: $\hat{\bm{\Sigma}}^{(f+m)}$ involves combining covariances mediated by
759: factors with covariances that are not accounted for by factors; it can
760: also handle historical missingness via the ``factor--parsimony''
761: regressions within {\tt monomvn}. But rather than shrinking a
762: (possibly) non--positive definite estimator $\hat{\bm{\Sigma}^{(c)}}$
763: towards $\hat{\bm{\Sigma}^{(f)}}$ with a single parameter $\alpha$ as
764: in (\ref{eq:ledoit}), {\tt monomvn} applies $m+K$ unique shrinkage
765: parameters, one for {\em each} regression, while taking full advantage
766: of all available returns.
767:
768: \subsection{Software}
769:
770: Finally, an {\sf R} package called {\tt monomvn} \citep{monomvn} has
771: been made freely available through CRAN. It implements the algorithm
772: described in this section, and supports all of the parsimonious
773: regression methods outlined in Section \ref{sec:bpsn} via the
774: stand--alone packages outlined therein. Two forms of CV are supported
775: for choosing the number of components in the parsimonious regression:
776: random 10--fold and (deterministic) leave--one--out (LOO). A $p$
777: argument facilitates parsimonious regression modeling, as described
778: above. Incorporating factors is as straightforward as bundling them in
779: as if they were returns, as described above.
780:
781: \section{Empirical results}
782: \label{sec:results}
783:
784: In this section, the {\tt monomvn} methods are illustrated and
785: validated on real and synthetic data. In Section \ref{sec:synth} we
786: focus on the properties of estimates of $\hat{\bm{\mu}}$ and
787: $\hat{\bm{\Sigma}}$ in a controlled setting involving synthetic data
788: under monotone missingness. In \ref{sec:portfolio} we turn to
789: applying the estimators towards balancing portfolios in a
790: mean--variance setting. We wrap up in \ref{sec:depend} by using
791: {\tt monomvn} in a descriptive analysis of dependence involving
792: thousands of assets.
793:
794: \subsection{Properties of the estimators on synthetic data}
795: \label{sec:synth}
796:
797: Here, we use a data--generation mechanism provided by the {\tt
798: monomvn} package: {\tt randmvn} generates random samples from a
799: randomly generated MVN distribution with an i.i.d.~standard normal
800: mean vector $\bm{\mu}$, and an Inv--Wishart sampled $\bm{\Sigma}$;
801: {\tt rmono} imposes a uniformly distributed monotone missingness
802: pattern. A similar method is used to generate samples with monotone
803: missingness from a multivariate $t$ distribution (MV$t$) as well, in
804: order to demonstrate that the MVN--based {\tt monomvn} methods still
805: perform well in the presence of heavier tailed data.
806:
807: %\subsubsection{Comparators}
808:
809: The comparisons to follow focus on highlighting the relative strengths
810: and weaknesses of variations of {\tt monomvn} as a function of the
811: choice of parsimonious regression method applied. Additionally, two
812: simpler methods are devised as calibration tools, and to illustrate
813: the advantage of the {\tt monomvn} approach over those which do not
814: leverage the structure of the monotone missingness pattern. The
815: simplest comparator is called ``complete'', where $\bm{\mu}$ and
816: $\bm{\Sigma}$ are estimated using only the portion of data available
817: across all assets, i.e., only the completely observed returns. Put
818: yet another way: only the first $n_m$ rows of $\mb{Y}$ are used.
819: Another comparator is ``observed'' which uses all of the available
820: data in an obvious but na\"ive way:
821: \begin{align}
822: \hat{\mu}_j &= \frac{1}{n_j} \sum_{k=1}^{n_j} y_{k,j} && \mbox{
823: and} & \hat{\Sigma}_{i,j} &= \frac{1}{n_j} \sum_{k=1}^{n_j}
824: (y_{k,j} - \hat{\mu}_j)(y_{k,i} - \hat{\mu}_i) \;\;\;\; \mbox{ for }
825: i=1,\dots,j.
826: \end{align}
827: Unfortunately, the covariance matrices provided by the ``complete'' and
828: ``observed'' estimators are not guaranteed to be positive--definite
829: \citep{stambaugh:1997}. %Besides meaning that these estimators are
830: % invalid, the KL divergence to the true distribution cannot be
831: % calculated, and so the RMSE statistics will be our only metric for
832: % comparison.
833:
834: As a final comparator, we consider a method of estimation for
835: incomplete data for arbitrary missingness patterns
836: \citep{dempster:laird:rubin:1977}, using the expectation conditional
837: maximization (ECM) algorithm \citep{meng:rubin:1993}. Consequently,
838: this method also works when the missingness pattern is monotone, but
839: represents a sort of overkill in this case. Two similar software
840: packages are available for this method when the data is assumed to
841: follow a multivariate normal distribution: the {\tt norm} package
842: \citep{norm:2002} for {\sf R}, and {\tt ecmnmle} (contained in the
843: {\sf Matlab} {\tt Financial Toolbox}). We prefer {\tt norm} because
844: its core is implemented in compiled {\sf Fortran}, with an {\sf R}
845: wrapper. It gives nearly identical results to---but runs more than 20
846: times faster than---{\tt ecmnmle} which is written solely in {\sf
847: Matlab}. The ECM method iterates until convergence, stopping at a
848: {\em local} maximum when an improvement threshold is met. As a
849: result, its computational demands and the ultimate optimality of the
850: resulting estimator are sensitive to the initial configuration of the
851: algorithm. Though the missingness pattern may be arbitrary, it is
852: well--known that the method can fail due to convergence issues and/or
853: numerical singularities that can arise due to finite machine
854: representations when more than 15\% of the data is missing (see, e.g.,
855: the {\tt ecmnmle} documentation within {\sf Matlab}). So it cannot
856: handle $m > n$, which precludes it from general use in our problem.
857:
858: The expected log likelihood (ELL), which is related to the
859: Kullback--Leibler (KL) divergence, is used as the main metric for
860: comparisons. For probability distribution functions (PDFs) $p$ and
861: $q$, the KL divergence between $p$ and $q$ is defined as
862: \[
863: D_{\mbox{\tiny KL}}(q \parallel p) = \int p(x) \log \frac{p(x)}{q(x)} \;dx.
864: \]
865: In the particular case where $q$ is the estimated MVN with parameters
866: $\hat{\bm{\mu}}$ and $\hat{\bm{\Sigma}}$ and $p$ is the ``true''
867: parameterization with $\bm{\mu}$ and $\bm{\Sigma}$, the KL divergence
868: can be shown to be:
869: \[
870: D_{\mbox{\tiny KL}}(\mathrm{MVN}(\hat{\bm{\mu}}, \hat{\bm{\Sigma}}) \parallel
871: \mathrm{MVN}(\bm{\mu}, \bm{\Sigma})) = \frac{1}{2} \left(\log
872: \frac{|\hat{\bm{\Sigma}}|}{|\bm{\Sigma}|} +
873: \mbox{tr}(\hat{\bm{\Sigma}}^{-1} \bm{\Sigma}) + (\hat{\bm{\mu}} -
874: \bm{\mu})^\top \hat{\bm{\Sigma}}^{-1}(\hat{\bm{\mu}} - \bm{\mu}) \right).
875: \]
876: The ELL of $q$ relative to data sampled from $p$ is given by
877: \begin{align}
878: \mathbb{E}_p\{\log q\} &= \int p(x) \log q(x) \;dx \nonumber \\
879: &= \int p(x) \log p(x) \;dx
880: - D_{\mbox{\tiny KL}}(q \parallel p). \label{e:ell}
881: \end{align}
882: The integral $\int p\log p$ in (\ref{e:ell}) is the entropy of $p$.
883: The entropy of $\mathrm{MVN}(\bm{\mu}, \bm{\Sigma})$ can be shown to
884: work out to $-\frac{1}{2} \log \{(2\pi e)^N |\bm{\Sigma}|\}. $ When
885: analytical expressions are not available it is easy to approximate
886: (\ref{e:ell}) numerically by $T^{-1} \sum_{t=1}^T \log q(x_t)$, where
887: $x_t \sim p$ is simulated out of sample. This nicely converges to the
888: truth for large $T$. The ELL is good for ranking competing
889: estimators, however actual ``distances'' between estimators is hard to
890: interpret.
891:
892: % As an auxiliary metric we consider a root mean squared error (RMSE)
893: % obtained by treating all $m + m(m+1)/2$ unique components of $\bm{\mu}$
894: % and $\bm{\Sigma}$ equally:
895: % \[
896: % \mbox{RMSE}(\{\hat{\bm{\mu}}, \hat{\bm{\Sigma}}\},
897: % \{\bm{\mu}, \bm{\Sigma}\})
898: % = \sqrt{\frac{1}{m+m(m+1)/2} \left[
899: % \sum_{j=1}^m (\hat{\mu}_j - \mu_j)^2
900: % + \sum_{1 \leq i \leq j}^m (\hat{\Sigma}_{i,j} -
901: % \Sigma_{i,j})^2 \right]}.
902: % \]
903: % This metric has many advantages including intuitive appeal, ease of
904: % computation, a natural quadratic scale, and is a measure of goodness
905: % of fit that is devoid of (possibly tenuous) distributional
906: % assumptions. However, as we shall see, it is possible that estimated
907: % $\hat{\bm{\mu}}$ and $\hat{\bm{\Sigma}}$ have low RMSE yet depict
908: % relatively poor probability densities for the true underlying data.
909: % One reason is because the squared distance between components of
910: % $\hat{\bm{\Sigma}}$ and $\bm{\Sigma}$ ignores their sign.
911:
912: \subsubsection{Comparing estimators}
913:
914: Figure \ref{f:synth} {\em (left)} summarizes a comparison between the
915: different parsimonious regressions within the {\tt monomvn} algorithm,
916: using randomly generated MVN data with $m=100$ and $n=1000$, repeated
917: over 100 trials, each time sampling new $\bm{\mu}$, $\bm{\Sigma}$ and
918: $\mb{Y}\sim \mathrm{MVN}(\bm{\mu}, \bm{\Sigma})$ with uniform monotone
919: missingness.
920: \begin{figure}[ht!]
921: \centering
922: \includegraphics[angle=-90, scale=0.285]{rEllik}
923: \includegraphics[angle=-90, scale=0.285]{rtllik}
924: \caption{Comparison of parsimonious regression ($p=1$) methods (using
925: 10--fold CV) on randomly generated MVN data ($n=1000$ samples,
926: $m=100$ dimensions) data with $\bm{\mu}\sim N_m(0,1)$, $\bm{\Sigma}
927: \sim$ Inv--Wishart and uniform monotone missingness: boxplots of ELL
928: ranks summarizing 100 repeated trials.
929: \label{f:synth}}
930: \end{figure}
931: Parsimonious regressions were used only when necessary (i.e., $p=1$).
932: 10--fold CV was used to choose $\lambda$ or the number of (principal)
933: components. As can be seen from the table, PCR emerges as the clear
934: winner in this comparison, nearly always having the best ELL rank.
935: The complete and observed comparators are almost always ranked worst.
936: % The RMSE results give more insight into the poor performance of
937: % these comparators, but they are less helpful for discerning between
938: % the variations on {\tt monomvn}. It would appear that ridge regression
939: % has the lowest RMSE but, paradoxically, has the second worst rank.
940:
941: In anticipation of the application in Section \ref{sec:portfolio} to
942: financial returns data, which are believed to follow a heavier tailed
943: distribution than MVN, we repeated the above experiment with
944: synthetically generated MV$t$ data with a monotone missingness
945: pattern. The degrees of freedom parameter was sampled as $\nu \sim
946: \mathrm{Exp}(\frac{1}{2})+1$. Figure \ref{f:synth} {\em (right)} shows
947: roughly similar behavior for the MVN based {\tt monomvn} estimators
948: when fit to MV$t$ data: PCR is the best and the observed and complete
949: estimators are the worst (although the order is switched). ELL was
950: computed numerically using the known degrees of freedom parameter(s),
951: $\nu$, which generated the data. This is a legitimate choice since
952: the $\nu$ is not used in the mean--variance analysis to follow in
953: Section \ref{sec:portfolio}. It is interesting to note the improved
954: rank(s) of the ridge regression based estimator in this case.
955:
956: These results are in line with those of previous simulation studies
957: which compare ML estimators---that are able to leverage all of the
958: available data by exploiting the MVN assumption---to those which use
959: more reasonable distributional assumptions but which, for reasons of
960: tractability, can only use the completely observed cases
961: \citep[e.g.,][]{little:1988}. The evidence suggests that making use
962: of all of the available data in a sensible way is the crucial
963: ingredient despite that the underlying assumptions may be violated.
964: The dominance of PCR in both MVN and MV$t$ scenarios is in line with a
965: recent study \citep{cpr:5829} showing that PCR out--competes other
966: shrinkage (Bayesian motivated) estimators in applications with a large
967: number of financial asset returns.
968:
969: \subsubsection{Choosing the parsimonious proportion}
970: \label{sec:parsi}
971:
972: Recall from Section \ref{sec:monomvn} that $p\in [0,1]$ determines
973: when a parsimonious method is to be used instead of OLS in the {\tt
974: monomvn} algorithm. The experiment performed here is similar to the
975: previous one, except that $n$ and $m$ are varied stochastically with
976: $m$ uniform in $\{5,\dots,100\}$ and $n|m$ uniform in $\{\max(10,
977: \lfloor m/2\rfloor),\dots, md\}$.
978: \begin{table}[ht]
979: \begin{center}
980: \begin{tabular}{l||rrr|r}
981: & \multicolumn{3}{c|}{optimal $p$} & \\
982: method & 5\% & mean & 95\% & improv \\
983: \hline
984: plsr & 0.12 & 0.23 & 0.37 & 0.55 \\
985: pcr & 0.09 & 0.27 & 0.51 & 0.69 \\
986: ridge & 0.04 & 0.25 & 0.67 & 0.29 \\
987: lasso & 0.12 & 0.24 & 0.38 & 0.76 \\
988: lar & 0.11 & 0.26 & 0.41 & 0.65 \\
989: stepwise & 0.15 & 0.26 & 0.39 & 0.74
990: \end{tabular}
991: \end{center}
992: \caption{Mean and 90\% interval for optimal $p$, the ratio of columns
993: to rows in the design matrix before switching from OLS to a parsimonious
994: regression. The {\em improv} column gives the proportion of runs for
995: which $p=0.25$ is better than $p=0$. We repeated this over 100 trials
996: with LOO CV with the ELL as an objective.
997: \label{t:p}}
998: \end{table}
999: Table \ref{t:p} shows the mean and 90\% interval for the optimal $p$
1000: over 100 repeated trials sampling new $m$, $n$, etc., each time. LOO
1001: CV was used to choose $\lambda$, or the number of (principal)
1002: components, and the objective criteria used was ELL. The final column
1003: in the table shows the proportion of time when $p=0.25$ was better
1004: than $p=0$. Observe that all methods except ridge regression work
1005: well, as a rule of thumb, with $p=0.25$. All things being equal, a
1006: larger $p$ setting may be preferred for speed reasons.
1007:
1008: \subsubsection{Comparing to ECM}
1009:
1010: Due to the limitations of ECM--based methods, like those implemented
1011: by {\tt norm} and {\tt ecmnmle}, a comparison of {\tt monomvn} to
1012: these approaches requires a more controlled experiment. Fixing $m=10$
1013: and $n=100$, 1000 repeated experiments similar to the ones described
1014: above, with uniform monotone missingness, gave that {\tt monomvn}
1015: (with PCR) had higher ELL 997 times ($100\%$) and that ECM failed to
1016: converge 53 times ($\approx 5\%$). As $n$ grows relative to $m$, the
1017: performance of the methods converge. For example, with $m=10$ and
1018: $n=1000$ the means are {\tt monomvn} is better 831 times ($83\%$), and
1019: ECM failed to converge 11 times ($1\%$). As the dimensionality ($m$)
1020: increases modestly compared to the sample size ($n$), the ECM--based
1021: {\tt norm} algorithm consistently diverges. For example, with $m=20$
1022: and $n=100$ {\tt norm} fails to converge more than 40\% of the time.
1023:
1024: \subsection{Constructing portfolios from historical returns}
1025: \label{sec:portfolio}
1026:
1027: In this section we examine the characteristics of minimum variance
1028: portfolios constructed using estimates of $\mb{\Sigma}$ based on
1029: historical monthly returns. The experimental setup is similar to ones
1030: that have been used in several recent papers on covariance estimation,
1031: and minimum variance portfolio balancing
1032: \citep[e.g.][]{ckl:1999,jagma:2003}. Following these works we use the
1033: monthly returns of common domestic stocks traded on the NYSE and the
1034: AMEX from April 1968 until 1998. We require that the stocks have a
1035: share price greater than \$5 and a market capitalization greater than
1036: 20\% based on the size distribution of NYSE firms. Estimators of
1037: $\bm{\Sigma}$ are constructed based on (at most) the most recently
1038: available 60 months of historical returns. This is in keeping with
1039: previous work and acknowledges that the i.i.d.~assumption in
1040: Eq.~(\ref{eq:iidlik}) is only valid locally (in time) due to the
1041: conditional heteroskedastic nature of financial returns. Short
1042: selling is not allowed; all portfolio weights must be nonnegative.
1043: Although it is typical to cap the weights as well, e.g., at 2\%, in
1044: order to ``tame occasional bold forecasts'' \citep{ckl:1999} that
1045: typically arise due to poor estimators \citep{jagma:2003}, we
1046: specifically do not do so here. Our goal is fully expose the quality
1047: of the estimators and to illustrate that with good estimators such
1048: rules of thumb are unnecessary.
1049:
1050: Four classes of estimators of $\bm{\Sigma}$ are used in the
1051: comparisons which follow. (1) The {\em complete} estimator outlined
1052: earlier, with variations depending on how many assets have historical
1053: returns with certain lengths (more below). (2) A one--factor model
1054: using the return on the value--weighted portfolio of stocks traded on
1055: the NYSE, AMEX, and Nasdaq. (3) The {\tt monomvn} method using the
1056: parsimonious regressions of Section \ref{sec:bpsn} with $p=0.25$. (4)
1057: The {\tt monomvn} method incorporating the value--weighted portfolio
1058: as a factor with, as described in Section \ref{sec:fact}, and with
1059: $p=0$. For this class we augment the collection of parsimonious
1060: regressions to include the ``factor--parsimony'' method. We do not
1061: compare to the ECM methods of {\tt norm} or {\tt ecmnmle} here, as
1062: this has proved to be both cumbersome and troublesome; the methods
1063: seem unable to handle the missingness level in this data. For
1064: example, {\tt norm} consistently fails to converge even after
1065: thousands of very slow iterations of ECM (each taking several seconds
1066: on a 3.2 GHz Xeon).
1067:
1068: To assess the quality and characteristics of the constructed
1069: portfolios we follow \cite{ckl:1999} in using the following:
1070: (annualized) return and standard deviation; (annualized) Sharpe ratio
1071: (average return in excess of the Treasury bill rate divided by the
1072: standard deviation); (annualized) tracking error (standard deviation
1073: of the portfolio return in excess of the S\&P500 return); correlation
1074: to the market (S\&P500 return); average number of stocks with weights
1075: above 0.5\%. We closely follow the experimental setup of
1076: \citet{ckl:1999} and \citet{jagma:2003} by randomly subsampling from
1077: the qualifying stocks in each year, and holding the portfolios for the
1078: entire subsequent 12 months. The random subsample reduces the size of
1079: the estimation problem, and thus computational burden, so that many
1080: methods can be simultaneously benchmarked against one another. It can
1081: also serve the dual purpose of enabling the calculation of
1082: nonparametric (bootstrap--like) Monte Carlo assessments of
1083: variability, which was not a feature explored in previous work.
1084:
1085: Specifically, in each April, starting in 1972, we randomly subsample
1086: 250 stocks
1087: % \footnote{\citet{jagma:2003} use subsamples of size 500. Since
1088: % there are approximately 900 qualifying assets in any year we
1089: % prefer to follow \citet{ckl:1999} and use 250 in order to better
1090: % explore the spread of the characteristics of our estimators in
1091: % this experiment.}
1092: (without replacement) from those which qualify (in the
1093: sense outlined above) and which have at least 12 months of historical
1094: returns. In this way our work differs slightly from our predecessors
1095: whose estimators require exactly 60 months of historical returns. We
1096: chose 12 months in order to highlight the benefit of incorporating
1097: assets in the portfolio with fewer than 60 months of returns via {\tt
1098: monomvn}. Estimates of the covariance matrix of monthly excess
1099: returns (over the monthly Treasury Bill rate) are generated form the
1100: different models using at most the last 60 months of historical
1101: returns for the 250 assets. Based on the estimate(s), quadratic
1102: programming is used to find the global minimum variance portfolio(s)
1103: described by weights $\hat{\mb{w}} = \mbox{argmin}_{\mb{w}} \mb{w}^T
1104: \mb{\hat{\bm{\Sigma}}} \mb{w}$. Then, the weights $\hat{\mb{w}}$ are
1105: applied to form buy--and--hold portfolio returns until the next April,
1106: when the randomization, estimation, and optimization steps are
1107: repeated and the portfolios are reformed.
1108:
1109: \begin{table}[ht!]
1110: \begin{center}
1111: \begin{tabular}{r||rrrrrr}
1112: % \hline
1113: method & mean & sd & sharpe & te & cm & wmin \\
1114: \hline \hline
1115: eq & 0.149 & 0.188 & 0.432 & 0.062 & 0.949 & 0 \\
1116: vw & 0.135 & 0.162 & 0.412 & 0.032 & 0.981 & 45 \\
1117: \hline
1118: min & 0.147 & 0.183 & 0.431 & 0.105 & 0.819 & 29 \\
1119: com & 0.150 & 0.183 & 0.447 & 0.107 & 0.810 & 26 \\
1120: rm & 0.132 & 0.129 & 0.494 & 0.094 & 0.803 & 16 \\
1121: \hline
1122: fmin & 0.142 & 0.146 & 0.503 & 0.086 & 0.845 & 38 \\
1123: fcom & 0.144 & 0.146 & 0.521 & 0.087 & 0.841 & 37 \\
1124: frm & 0.138 & 0.130 & 0.537 & 0.117 & 0.688 & 21 \\
1125: \hline
1126: plsr & 0.148 & 0.154 & 0.516 & 0.124 & 0.686 & 15 \\
1127: pcr & 0.143 & 0.132 & 0.563 & 0.109 & 0.732 & 23 \\
1128: ridge & 0.158 & 0.165 & 0.546 & 0.122 & 0.716 & 16 \\
1129: lasso & 0.151 & 0.150 & 0.550 & 0.054 & 0.941 & 69 \\
1130: lar & 0.151 & 0.151 & 0.545 & 0.053 & 0.944 & 71 \\
1131: step & 0.152 & 0.155 & 0.541 & 0.052 & 0.946 & 75 \\
1132: \hline
1133: ffp & 0.143 & 0.132 & 0.566 & 0.113 & 0.712 & 24 \\
1134: fplsr & 0.147 & 0.153 & 0.514 & 0.123 & 0.688 & 15 \\
1135: fpcr & 0.142 & 0.131 & 0.560 & 0.109 & 0.732 & 24 \\
1136: fridge & 0.158 & 0.163 & 0.554 & 0.119 & 0.726 & 19 \\
1137: flasso & 0.152 & 0.148 & 0.561 & 0.056 & 0.936 & 69 \\
1138: flar & 0.151 & 0.151 & 0.546 & 0.053 & 0.943 & 70 \\
1139: fstep & 0.154 & 0.153 & 0.558 & 0.055 & 0.939 & 73 \\
1140: \end{tabular}
1141: \end{center}
1142: \caption{Comparing statistics summarizing the returns of
1143: yearly buy--and--hold portfolios generated over 50 repeated
1144: random paths through the 26 years of monthly historical returns.
1145: The first group of rows show the equal-- and value--weighted
1146: portfolios; the second group of rows have complete data estimators
1147: based on the preceding 12--months of returns, the maximal completely
1148: observed historical returns, and the returns for the subset of
1149: assets with 60 months of historical returns; the third group
1150: uses the same returns as the second with a one--factor model;
1151: the penultimate group uses {\tt monomvn}; the final group uses
1152: {\tt monomvn} with the additional one--factor. The statistics
1153: across the columns are (annualized) mean return, standard
1154: deviation, Sharpe ratio, tracking error, correlation to market
1155: and average number of stocks with weights above 0.5\%.
1156: } \label{t:sharpe}
1157: \end{table}
1158:
1159: Table \ref{t:sharpe} summarizes the properties of those returns
1160: averaged over 50 repeated random paths through the 26 years in the
1161: study. The table is broken into five sections, vertically, starting
1162: with the equal-- and value--weighted portfolios (for comparison),
1163: followed by global minimum variance portfolios based on estimated
1164: $\bm{\Sigma}$: complete data estimators, complete data estimators
1165: based on a one--factor model, {\tt monomvn} estimators, and {\tt
1166: monomvn} estimators incorporating the one--factor. Throughout, the
1167: ``f'' prefix indicates that the estimator uses the value--weighted
1168: factor in some way. The ``min'' and ``fmin'' estimators use only the
1169: last 12--months of historical returns, whereas the ``com'' and
1170: ``fcom'' estimators use the maximal complete history available. The
1171: ``rm'' and ``frm'' estimators focus only on those assets with
1172: completely observed returns for the last 60 months---where the weights
1173: for the other assets are set to zero (removing them from the
1174: portfolio). The annualized mean, standard deviation, and Sharpe ratio
1175: statistics for these six estimators lead one to conclude that the more
1176: historical returns (within the five--year window) that can be used to
1177: estimate $\bm{\Sigma}$ the better. Tracking error is also improved,
1178: except in the case of ``frm''. All in all, these results support
1179: those obtained in previous studies \citep[e.g.,][]{ckl:1999} showing
1180: that, in particular, factor models improve upon the na\"ive estimator
1181: in the complete data case. Further inspection of this part of the
1182: table reveals that the improved Sharpe ratios for ``rm'' and ``frm''
1183: are due to the smaller standard deviation obtained under these
1184: estimators, but that this comes at the expense of a smaller mean
1185: return. This may be due to more weight being placed on fewer assets
1186: (as indicated in the ``wmin'' column). Both ``rm'' and ``frm'' also
1187: have the lowest correlation to the market in their cohort.
1188:
1189: The final two groups of rows tell a similar story. The Sharpe ratios
1190: for the {\tt monomvn} estimators---with and without the
1191: value--weighted factor---show marked improvements over the complete
1192: data estimators. As before, the inclusion of the value--weighted
1193: factor further adds to the improvement, e.g., yielding higher Sharpe
1194: ratios except in the case of PCR where they remain essentially
1195: unchanged. The ``ffp'' estimator, i.e., the one--factor model
1196: applied via {\tt monomvn} using the ``factor--parsimony'' regression
1197: method, has the lowest standard deviation, and therefore a
1198: comparatively high Sharpe ratio despite a low mean return. We can see
1199: that, as with ``rm'' and ``frm'', this low standard deviation is
1200: obtained by placing large weight on only a few assets. PCR, PLSR, and
1201: ridge regression---both with and without factors---show similar
1202: properties. In contrast, the LARS estimators (lasso, lar, and
1203: stepwise---both with and without the factor), obtained similar or
1204: better Sharpe ratios but with a large mean return, by assigning large
1205: weight to roughly three times more assets. As a result, these LARS
1206: estimators obtain a much lower tracking error and higher correlation
1207: to the market.
1208:
1209:
1210: So when appropriate factors are available it makes sense to use them,
1211: and the best way to do so is via {\tt monomvn}. It would seem that
1212: the one--factor LARS based {\tt monomvn} estimators give the best
1213: results in the study, overall, with lasso in the top spot. It is
1214: reassuring to notice that, when an appropriate factor is {\em not}
1215: available, the LARS based {\tt monomvn} methods, and PCR, give largely
1216: similar results by incorporating all of the available returns in a
1217: parsimonious way. This is not true in the case of the complete data
1218: estimators.
1219:
1220: \begin{figure}[ht!]
1221: \centering
1222: \includegraphics[trim=40 0 0 10,scale=0.75]{sharpe_boxplot}
1223: \includegraphics[trim=40 0 0 25,scale=0.75]{te_boxplot}
1224: \caption{Boxplots of Sharpe ratios {\em (top)} and the tracking error
1225: {\em (bottom)} obtained over 50 random paths through the 26 years,
1226: obtained by randomly sampling 250 qualifying assets in each year.
1227: The averages of these numbers is what is reported in Table
1228: \ref{t:sharpe}. The horizontal bars correspond to the vertical ones
1229: in that table.}
1230: \label{f:boot}
1231: \end{figure}
1232: Figure \ref{f:boot} compliments Table \ref{t:sharpe} by showing the
1233: distribution (via boxplots) of the Sharpe ratios and the tracking
1234: error obtained for each of the 50 random paths through the 26 years.
1235: Recall that these were obtained by randomly sampling 250 qualifying
1236: assets in each year. The numbers in Table \ref{t:sharpe} are the
1237: means of data use to construct each boxplot, whereas the boxplots in
1238: the figure represent Monte Carlo approximations to the sampling
1239: distribution of portfolio characteristics under the various estimators
1240: of $\hat{\bm{\Sigma}}$. In short, the figure reinforces the
1241: superiority of the LARS estimators which, in addition to having large
1242: Sharpe ratios and small tracking error, also exhibit small variability
1243: with respect to Monte Carlo resampling. It is interesting to note
1244: that the LARS based estimators (without the factor) show the lowest
1245: variability in their Sharpe ratios amongst all {\tt monomvn}
1246: estimators.
1247:
1248: It may be tempting to conclude that these results contradict the
1249: results of the ELL--based comparison(s) on synthetic data in Section
1250: \ref{sec:synth}. Indeed, in that section we saw that PCR seemed to be
1251: the best at recovering the (known) of the distribution which generated
1252: the training data. However, means, variances, Sharpe ratios, tracking
1253: error, etc., are specific statistics, and moreover they are obtained
1254: after a (highly non--linear) transformation into portfolio weights via
1255: quadratic programming. Therefore, we should expect to see different
1256: results, since these statistics represent utilities which are
1257: different from ELL. That being said, notice that PCR is still the
1258: best in terms of average annualized standard deviation (and thus
1259: Sharpe ratio) [see Table \ref{t:sharpe}] when no appropriate factors
1260: are available---but with high variability [see Figure \ref{f:boot}].
1261: Importantly, both experiments (here and in Section \ref{sec:synth})
1262: show, resoundingly, that using all of the available data via {\tt
1263: monomvn} is preferred over a complete data estimator.
1264:
1265: \subsection{Examining dependence relationships between assets}
1266: \label{sec:depend}
1267:
1268: For our final empirical analysis we shall demonstrate the descriptive
1269: power of {\tt monomvn}. At the same time we shall take the
1270: opportunity to show how the method can be applied when there are
1271: thousands of assets.
1272:
1273: From Thomson Financial's Datastream ({\tt www.datastream.com}), we
1274: have downloaded, in dollar terms, the total returns data of each stock
1275: in the Russell 3000$^{\mbox{\tiny \textregistered}}$
1276: Index
1277: %\footnote{The Russel 3000$^{\mbox{\tiny \textregistered}}$ Index
1278: % represents
1279: representing the broad United States equity universe encompassing
1280: approximately 98\% of the market:
1281: % .}
1282: 1792 weekly returns between 12/01/1973 and 11/05/2007 for 2894
1283: assets. In order to obtain a set of clean and complete data, each
1284: series is tested for illiquidity, completeness, and stationarity,
1285: using the following methodology. We removed assets which were
1286: marked to market at a frequency other than weekly, to exclude
1287: illiquid assets that may exhibit artificial serial correlation (this
1288: essentially excludes any stock that has more than two weeks of
1289: consecutive unchanging prices at any point in time). Then, an
1290: augmented Dickey Fuller test \citep{dickey:fuller:1979} is employed
1291: to exclude any of the assets that exhibit non--stationarity (six
1292: lags have been tested at the 99\% confidence level). A total of
1293: 2461 stocks remained after applying these two filtering steps.
1294: There are 558 assets with longest history of 1792 returns; the least
1295: observed asset has only 76 returns (so the ``complete'' estimator(s)
1296: can use only 3\% of the data); the overall proportion of missing
1297: observations was 0.472.
1298:
1299: We consider applying the lasso version of the {\tt monomvn} algorithm
1300: to this data, with $p = 0$, i.e., always use the lasso (never use
1301: OLS). As we have mentioned, the lasso (and other LARS methods) have
1302: descriptive (as well as predictive) power because they can provide
1303: $\hat{\bm{\beta}}$ with many coefficients set to zero. In the context
1304: of the {\tt monomvn} algorithm this means that the MLE
1305: $\hat{\bm{\Sigma}}$ may have zero entries, indicating marginally
1306: uncorrelated assets, and moreover may have block--diagonal structure
1307: (or zeros in $\hat{\bm{\Sigma}}^{-1})$ indicating a pairwise
1308: conditional independence of assets. Since ridge regression, PCR, and
1309: PLSR always yield $|\hat{\beta}_i| > 0$, they would never produce a
1310: zero in $\hat{\bm{\Sigma}}$ or $\hat{\bm{\Sigma}}^{-1}$, and so would
1311: be less useful for creating such qualitative summaries of the
1312: relationships between asset returns. It may be tempting to interject
1313: zeros where there are small values in $\hat{\bm{\Sigma}}$ or
1314: $\hat{\bm{\Sigma}}^{-1}$, but like the ``complete'' and ``observed''
1315: estimators, the resulting matrix would not usually be positive
1316: definite. Moreover, classical pairwise tests for independence, say
1317: via the Pearson product--moment correlation coefficient, would give
1318: unrealistic results. With return histories as short as $\sim80$ weeks
1319: and estimated correlation less than about 0.2, a simple calculation
1320: shows that there would not be enough evidence to reject the
1321: hypothesis that the correlation is zero.
1322:
1323: The estimator obtained using the lasso on this data yields a
1324: $\hat{\bm{\Sigma}}$ with 36\% of its entries set to zero. Moreover,
1325: 50 of its 2641 columns (or 2\%) are everywhere zero except in the
1326: diagonal position. This means that 36\% of asset pairings are
1327: marginally uncorrelated. Investigating pairwise correlation between
1328: assets, conditional on all of the others, involves looking for zeros
1329: in $\hat{\bm{\Sigma}}^{-1}$, of which we find 140 (or 6\%). This
1330: means that the rows/columns of $\hat{\bm{\Sigma}}$ can be reordered so
1331: that the matrix has block--diagonal structure, and that the returns of
1332: 6\% of the assets are conditionally independent.
1333: \begin{figure}
1334: \includegraphics[angle=-90,scale=0.8]{indep.ps}
1335: \vspace{-0.1cm}
1336: \caption{Histograms of the number of zeros in each column of
1337: $\hat{\bm{\Sigma}}$ {\em (left)} and $\hat{\bm{\Sigma}}^{-1}$ {\em
1338: (right)}.}
1339: \label{f:indep}
1340: \end{figure}
1341: Figure \ref{f:indep} shows histograms summarizing the number of zeros
1342: in each column of $\hat{\bm{\Sigma}}$ and $\hat{\bm{\Sigma}}^{-1}$.
1343: Every column in both matrices had at least one zero entry. The figure
1344: clearly illustrates that the resulting correlations can be used to
1345: cluster the assets, but this is beyond the scope of this paper.
1346:
1347: To wrap up the experiment we downloaded the market returns available
1348: from the Russel 3000 index
1349: for 1479 (of 1792) contiguous weeks ending 11/5/2007 and used them to
1350: create a residual return series
1351: for each of the 2461 assets in our
1352: study. We then re-ran the lasso experiment, above, to discover that
1353: 58\% of the asset parings are marginally uncorrelated and 14\% are
1354: conditionally independent when the market is taken into account. The
1355: histograms corresponding to this experiment are similar to those for
1356: the initial one, in Figure \ref{f:indep}, and so they are not
1357: reproduced here. %{\em Profound concluding comment.}
1358:
1359: \section{Discussion}
1360: \label{sec:discuss}
1361:
1362: We have shown how the methods of \cite{stambaugh:1997} can be applied
1363: for large numbers of assets whose histories are (nearly) unconstrained
1364: in length. The key insight is in replacing OLS regressions with more
1365: parsimonious ones that either use derived input directions or apply
1366: some sort of shrinkage. Whereas Stambaugh demonstrated his
1367: methodology on 22 assets, we have shown how the {\tt monomvn}
1368: algorithm---essentially the same methodology with a different
1369: regression method---can handle thousands. We argued that even when
1370: OLS regressions suffice, the more parsimonious ones can offer
1371: improvements in both accuracy and interpretation. We also argued that
1372: it is advantageous to let a model selection method (e.g., parsimonious
1373: regression) decide which dependencies between factors and returns
1374: exist, as opposed to assuming a classical factor model structure.
1375:
1376: \cite{stambaugh:1997} showed that by applying the standard
1377: noninformative prior $\pi(\bm{\theta}) \propto
1378: |\bm{\Sigma}|^{\frac{p-1}{2}}$ \citep[e.g.][pp.~154]{schafer:1997} it
1379: is possible to turn the MLEs $\hat{\bm{\mu}}$ and $\hat{\bm{\Sigma}}$
1380: into moments $\tilde{\bm{\mu}}=\hat{\bm{\mu}}$ and
1381: $\tilde{\bm{\Sigma}}\ne\hat{\bm{\Sigma}}$ of a Bayesian posterior
1382: (predictive) distribution that, when used in the mean--variance
1383: framework, are said to take {\em estimation risk} into account. We
1384: note that, due to the notation used in that paper, it is a common
1385: misconception that these posterior moments forecast the ML estimates
1386: into the future. Since Stambaugh employs the i.i.d. assumption in
1387: the same way that we do in Eq.(\ref{eq:iidlik}), these are only
1388: moments of the posterior for $\bm{\theta}$ conditioned on the
1389: available historical data. Therefore, time is irrelevant, so the
1390: moments apply to the past as well without modification. Finally, to
1391: label this approach as ``Bayesian'' is an overstatement. While
1392: Stambaugh is correct to note that estimates of the mean vector and
1393: covariance matrix are all that are needed within the mean--variance
1394: framework, what results is a point--estimate (vector) of optimal
1395: portfolio weights, not (samples from) a Bayesian posterior
1396: distribution, as would be ideal. The challenge is that while the
1397: moments of the posterior have a nice closed form, the distribution
1398: itself does not. Further challenges limit the application of this
1399: approach in the ``big $p$ small $n$ setting''. In this situation the
1400: standard noninformative prior leads to an improper posterior. This
1401: can be most easily seen in the calculation of Stambaugh's $\tilde{V}
1402: \equiv \tilde{\bm{\Sigma}}$ (in our notation) in Eq.~(69--71),
1403: pp.~302, where the resulting diagonal would be negative.
1404:
1405: Stambaugh's Bayesian approach is not the only way forward. It is
1406: possible to obtain the sampling covariance matrix of $\hat{\bm{\mu}}$
1407: analytically. However, an analytic form for the sampling variability
1408: of $\hat{\bm{\Sigma}}$ is not known. The bootstrap
1409: \citep[e.g.][Sections 7.11 \& 8.2]{hastie:tibsh:fried:2001} offers a
1410: Monte Carlo method for quantifying the {\em stability} of
1411: $\hat{\bm{\Sigma}}$ via its component-wise confidence intervals. We
1412: took a related approach at the end of Section \ref{sec:portfolio} to
1413: examine how variability in $\hat{\bm{\Sigma}}$, arising from random
1414: subsamples of 250 assets, filters through to the properties of the
1415: balanced portfolios. However, \citet[][Section
1416: 7.4.4]{little:rubin:2002} make a strong argument in preference for a
1417: fully Bayesian approach instead. Facilitating tractable Bayesian
1418: estimation for parsimonious regression algorithms, as would be
1419: required by {\tt monomvn}, presents a serious challenge. The Bayesian
1420: lasso \citep{park:casella:2008} and so--called Bayesian latent factor
1421: models \citep{west:2003}, which can be seen as a Bayesian extension of
1422: principal components and partial least squares regressions, have
1423: received much attention in the recent literature. Exploring the
1424: extent to which these can be applied within the {\tt monomvn}
1425: algorithm to get samples from the posterior distribution of $\bm{\mu}$
1426: and $\bm{\Sigma}$ is part of our ongoing work. These samples can
1427: accurately reflect the estimation risk in mean--variance portfolio
1428: allocation by filtering the uncertainty though the optimization to get
1429: a distribution on the simplex of portfolio weights.
1430:
1431: Another interesting extension would involve relaxing the assumption of
1432: (multivariate) normality, i.e., to decouple the dependence
1433: distribution, or {\em copula} \citep{sklar:1957}, from the marginals.
1434: In this regard, \cite{patton:2006} has made promising inroads into
1435: applying copulas to a pair of return series under a monotone
1436: missingness pattern. Although the theory for copulas
1437: \citep{nelsen:1999} naturally extends beyond two dimensions, the
1438: application of the methodology quickly becomes intractable without
1439: enforcing severely restrictive assumptions. Our ongoing work includes
1440: identifying ways in which the {\tt monomvn} algorithm for
1441: high--dimensional estimation under monotone missingness may be
1442: extended to support marginal Student--$t$ distributions and GARCH
1443: models with various parametric forms of the copula. While there is
1444: plenty of evidence in the literature against the assumption of
1445: normality for asset returns \citep[e.g.][]{mills:1927}, we argued that
1446: the most important thing is to be able to make use of all of the
1447: available data with an algorithm that is computationally tractable.
1448:
1449: % smaller spacing for references
1450: %\renewcommand{\baselinestretch}{1.5}\small\normalsize
1451:
1452: \bibliography{corr}
1453: \bibliographystyle{jasa}
1454:
1455: \end{document}
1456: