0511:math0511389/sjs.tex

1: \documentclass[11pt]{article}

2: \usepackage{amsmath, amssymb, chicago, amsfonts, latexsym, annals, natbib}

3: \setlength{\oddsidemargin}{0.0in}

4: \setlength{\evensidemargin}{0.0in}

5: \setlength{\textwidth}{6.5in}

6: \setlength{\topmargin}{0.0in}

7: \advance \topmargin by -\headheight

8: \advance \topmargin by -\headsep

9: \advance \topmargin .2in

10: \setlength{\textheight}{8.0in}

11: \sloppy \hyphenpenalty=10000

12: \def\var{\mathrm{var}}

13: \def\E{\mathrm{E}}

14: \def\T{^{\mbox{T}}}

15: \def\ez{\eta^{(0)}}

16: \def\eo{\eta^{(1)}}

17: \def\sz{S^{(0)}}

18: \def\so{S^{(1)}}

19: \def\bd{\begin{description}}

20: \def\ed{\end{description}}

21: %\def\theequation{\thesection.\arabic{equation}}

22: \def\bl{\begin{list}{$\bullet$}{}}   % Begins a bullet list.

23: \def\cl{\begin{list}{$\circ$}{}}     % Begins a circle list.

24: \def\el{\end{list}}                  % Ends a list.

25: \newcommand{\ef}{\hfill $\Box$}

26: \def\b1{\mathbf{1}}

27: %\newcommand{\mycite}[1]{{\small \sc \citeNP{#1}}}

28: \newcommand{\elinfH}{ \ell^{\infty}({\cal H}) }

29: \newcommand{\psidot}[1]{ \dot{\Psi}_{#1} }

30: \newcommand{\Var}{\mbox{Var}}

31: \newcommand{\PP}{{\mathbb  P}}

32: \newcommand{\QQ}{{\mathbb Q}}

33: \newcommand{\GG}{{\mathbb G}}

34: \newcommand{\RR}{{\mathbb R}}

35:

36: \begin{document}

37: \setlength{\baselineskip}{20pt}

38:

39: \title{Weighted Likelihood for Semiparametric Models

40: and Two-phase Stratified Samples, with Application to Cox Regression}

41:

42: \author{Norman E. Breslow \\ Jon A. Wellner}

43:

44: \affiliation{University of Washington, Seattle}

45: \date{\today}

46:

47: \maketitle

48:

49: \begin{abstract}

50: Weighted likelihood, in which one solves Horvitz-Thompson or inverse probability weighted (IPW) versions of the likelihood equations, offers a simple and robust method for fitting models to two phase stratified samples.

51: We consider semiparametric models for which solution of infinite dimensional estimating equations leads to $\sqrt{N}$ consistent and asymptotically Gaussian estimators of both Euclidean and nonparametric parameters.

52: If  the phase two sample is selected via Bernoulli (i.i.d.) sampling with known sampling probabilities, standard estimating equation theory shows that the influence function for the weighted likelihood estimator of the Euclidean parameter is the IPW version of the ordinary influence function.

53: By proving weak convergence of the IPW empirical process, and borrowing results on weighted bootstrap empirical processes, we derive a parallel asymptotic expansion for finite population stratified sampling.

54: Whereas the asymptotic variance for Bernoulli sampling involves the within strata second moments of the influence function, for finite population stratified sampling it involves only the within strata variances.

55: The latter asymptotic variance also arises when the observed sampling fractions are used as estimates of those known \textit{a priori}.

56: A general procedure is proposed for fitting semiparametric models with estimated weights to two phase data.

57: Several of our key results have already been derived for the special case of Cox regression with stratified case-cohort studies, other complex survey designs and missing data problems more generally.

58: This paper is intended to help place this previous work in appropriate context and to pave the way for applications to other models.

59:

60: \end{abstract}

61:

62: \vspace{2cm}

63: \noindent

64: \textit{Key words}:

65: case-cohort,

66: estimated weights,

67: failure time,

68: inverse probability weights,

69: missing data

70:

71: \newpage

72: \section{Introduction}

73:

74: Two phase stratified sampling, also known as double sampling,

75: was introduced by \citet{4923} to estimate the population mean of a

76: target variable that is costly or difficult to measure.

77: At phase one a relatively large random sample is drawn and

78: measurements are made on an auxiliary variable that is

79: correlated with the target variable but easier to measure.

80: At phase two measurements on the target variable are made

81:  for a subsample drawn randomly, without replacement, from

82:  within strata defined by the auxiliary variable.

83: Neyman showed that the optimal, design unbiased linear

84: estimator of the population mean is the Horvitz-Thompson (\citeyear{3888}) estimator that weights each observation by the inverse

85: of the probability of its selection into the phase two sample.

86:

87: Two-phase stratified sampling designs can dramatically reduce

88: the costs of regression modeling when the strata depend

89: on (correlates of) both outcome and explanatory variables.

90: A common method of estimation is ``weighted exogenous

91: sampling maximum likelihood", here simply Weighted Likelihood

92: or WL, in which one maximizes the inverse probability weighted

93: (IPW) sum of log-likelihood contributions from the phase

94: two observations \citep{200, 265}.

95: Equivalently, one may solve an IPW version of the

96: score equations \citep[\S 3.4]{4105}.

97: Although easy to implement, WL estimators are sometimes

98: seriously inefficient \citep{3937}.

99: They may still be of interest, however, because even when

100: the model is wrong they consistently estimate the finite population

101: parameters that would be obtained by fitting the model to

102: complete phase one data \citep{1815, 4326}.

103: Fully efficient estimators are available for logistic and other

104: parametric regression models in situations where the phase

105: one data consist only of stratum frequencies.

106: See, for example, \citet{4927} and the references cited therein.

107:

108: The asymptotic properties of WL estimators of Euclidean parameters

109: in parametric models follow readily from standard results for

110: $M$-estimators \citep[Chapter 5]{4895}.

111: WL may also be used for estimation of both Euclidean and infinite

112: dimensional parameters in semiparametric models, for which the

113: paradigm is Cox (\citeyear{1272}) proportional  hazards regression.

114: \citet{4894} developed asymptotic results for both regression

115: coefficients and baseline cumulative hazard when fitting the Cox model to survey

116: data including those obtained using two phase sampling.

117: \citet{4324} obtained the same results for the regression parameters

118: when fitting the Cox model to data from exposure stratified case-cohort

119: studies, in which all subjects who have a failure event (the cases)

120: are sampled at phase two.

121: One purpose of the present paper is to develop a modern theory

122: of WL estimation in semiparametric models that encompasses these previous results, helps to interpret them and paves the way for further applications.

123: We also explore the relationship between results based on finite

124: population stratified sampling at phase two and those based on i.i.d.

125: variable probability sampling with sampling weights

126: estimated using information from phase one.

127:

128: \section{Notation, Assumptions and Problem Statement}

129: Suppose $P_{\theta,\eta}$ denotes a probability distribution in

130: a semiparametric model for a random variable $X \in {\cal X}$,

131: where $\theta \in \Theta \subset \RR^p$ is the Euclidean parameter

132: and $\eta$, taking values in some arbitrary space $H$, is the nonparametric one.

133: Let $P_0=P_{\theta_0,\eta_0}$ denote the distribution from which $X$ is actually sampled.

134: Following closely \S 25.12 of \citet{4895}, suppose maximum

135: likelihood (ML) estimators $(\hat\theta, \hat\eta)$ are obtained by solving the system

136: \begin{eqnarray}

137: \Psi_{N1}(\theta,\eta) & = & \PP_N \dot\ell_{\theta,\eta}\; = \;0 \nonumber \\

138: \Psi_{N2}(\theta,\eta) & = & \PP_N B_{\theta,\eta}h

139:       -P_{\theta,\eta}B_{\theta,\eta}h \;= \;0 \; \forall \;  h \in {\cal H}. \label{eq:like}

140: \end{eqnarray}

141: Here $\dot\ell_{\theta,\eta}$ is the $p$-dimensional

142: likelihood score for $\theta$, $B_{\theta,\eta}$ is the score

143: operator \citep{1262} working on an infinite dimensional class

144: ${\cal H}$ of directions $h$ from which paths of one-dimensional

145: submodels for $\eta$ may approach $\eta_0$, and $\PP_N$ is empirical measure based on the i.i.d. sequence $X_1,\ldots,X_N$.

146: Set $\dot\ell_0=\dot\ell_{\theta_0,\eta_0}$ and $B_0=B_{\theta_0,\eta_0}$.

147:

148: Suppose the following assumptions, which slightly strengthen the

149: hypotheses of \citet[Theorem 25.90]{4895}, are satisfied so that

150: $\sqrt{N}(\hat\theta-\theta_0,\hat\eta-\eta_0)$ is asymptotically Gaussian:

151: \bd

152: \item[A1] for $(\theta,\eta)$ in a $\delta$-neighborhood of

153: $(\theta_0,\eta_0)$   the functions $\dot\ell_{\theta,\eta}$ and

154:  $\{B_{\theta,\eta}h, h \in {\cal H} \}$ are contained in a $P_0$-Donsker class ${\cal F}$;

155: \item[A2] $P_0\| \dot\ell_{\theta, \eta}-\dot\ell_0\|^2$

156: and $\sup_{h \in {\cal H}}P_0|B_{\theta, \eta }h-B_0h|^2$  converge

157: to $0$ as $(\theta,\eta) \rightarrow (\theta_0,\eta_0)$;

158: \item[A3]

159: the map $\Psi=(\Psi_1,\Psi_2):\Theta \times H \mapsto \RR^p\times \ell^{\infty}({\cal H})$

160: with components

161: \begin{eqnarray}

162: \Psi_1(\theta,\eta) & = & P_0\dot\ell_{\theta,\eta} \nonumber \\

163: \Psi_2(\theta,\eta) & = & P_0B_{\theta,\eta}h

164:               -P_{\theta,\eta}B_{\theta,\eta} h , \ \  h \in {\cal H}, \label{eq:expectedmap}

165: \end{eqnarray}

166: which is the expectation of the random map $\Psi_N=(\Psi_{N1},\Psi_{N2})$ in (\ref{eq:like}),

167: has a Fr\'echet derivative

168: $\dot\Psi_0$ at $(\theta_0,\eta_0)$ that is continuously invertible on its range.

169: \item[A4] $(\hat\theta,\hat\eta)$ is consistent for $(\theta_0,\eta_0)$ and satisfies

170: $\Psi_N(\hat\theta,\hat\eta)=0.$

171: \ed

172: Assumption \textbf{A3} is typically established by showing that the information

173: operator $B^*_0B_0$ is continuously invertible and thus that $\eta$ is

174: estimable at a $\sqrt{N}$ rate.

175: This is the most restrictive assumption, but one that leads quickly to our main result.

176:

177: With two phase sampling, however, $X$ is not observed for all $N$ subjects.

178: At phase one we observe only  a coarsening $\tilde X=\tilde X(X)$ of $X$

179: plus auxiliary variables $U \in {\cal U}$ that serve to determine the sampling strata.

180: $X$ is fully observed for subjects sampled at phase two.

181: Let $W=(X,U) \in {\cal W} = {\cal X} \times {\cal U}$ denote the variables

182: potentially available for everyone, but in fact fully observed only for those

183: in the phase two sample, and $V=(\tilde X,U) \in {\cal V} = {\cal \tilde X} \times {\cal U}$

184: denote the variables actually observed for everyone.

185: We write $\tilde{P}_0$ for the distribution of $W = (X,U)$ and

186: denote by $\Sigma_N=\sigma[W_1,\ldots,W_N]$ the sigma field of information,

187: also referred to as the complete data, potentially available for the $N$ subjects.

188: A sequence of binary indicators $(\xi_1,\ldots,\xi_N)$ shows which

189: subjects are selected $(\xi_i=1)$ at phase two for observation of $X_i$.

190: We consider two probability models for the indicators $\xi_i$.

191: In the first, known as Bernoulli or Manski-Lerman (\citeyear{200}) sampling,

192: each phase one subject is examined in succession for the value of

193: $V_i$ and the indicator $\xi_i$ is independently generated with

194: $\Pr(\xi_i=1|W_i)=\Pr(\xi_i=1|V_i) = \pi_0(V_i)$ where $\pi_0$ is a

195: known sampling function.

196: This preserves the i.i.d. structure for the observations $(\xi_i,V_i,\xi_iX_i)$.

197: Note the crucial missing at random (MAR) assumption: $\pi_0$

198: depends only on what is observed at phase one.

199: We write $Q_0$ for the

200: distribution of $(W_i, \xi_i)$.

201: If ${\cal V}$ is partitioned into $J$ strata ${\cal V}_1 \cup \cdots \cup {\cal V}_J$,

202: stratified Bernoulli sampling corresponds to the special case where

203: $\pi_0(v)=p_j$ for $v \in {\cal V}_j$.

204: We assume that all $J$ strata are sampled with positive probability, or more generally that

205: \begin{equation}

206: 0 < \sigma \leq \pi_0(v) \leq 1 \quad \mbox{for} \quad v \in {\cal V}.

207: \label{eq:boundedweights}

208: \end{equation}

209: Even though the sampling fractions are known, it is advisable to estimate

210: $\pi_0$ in order to increase the efficiency of WL \citep{3937}.

211: We consider estimation of $\pi_0$ using a parametric model in \S 6.

212:

213: The second sampling model corresponds to Neyman's original design

214: and is usually closer to actual practice.

215: Here we observe the entire phase one sample at once and record the

216: stratum frequencies

217: $N_j=\sum_{i=1}^N \b1_{{\cal V}_j}(V_i)$ for $j=1,\ldots,J$.

218: At phase two samples of size $n_j \leq N_j$ are drawn at random, without

219: replacement, from each of the $J$ finite phase one strata.

220: Using now a doubly subscripted notation where $\xi_{j,i}$

221: denotes the indicator variable for $i^{\mbox{th}}$ subject in stratum

222: $j$, the essential features of this design are that, conditionally

223: on $\Sigma_N$: ($i$) for $j=1,\ldots,J$

224: the random variables $(\xi_{j1},\ldots,\xi_{jN_j})$ are exchangeable

225: with $\Pr(\xi_{j,i}=1|\Sigma_N)={n_j}/{N_j}$;

226: and ($ii$) the $J$ random vectors $(\xi_{j1},\ldots,\xi_{jN_j})$ are independent.

227: Our problem is to estimate $(\theta,\eta)$ using the incomplete observations $V_i$ on everyone and the complete observations

228: $X_i$ on subjects sampled at phase two.

229:

230: \section{Weighted Likelihood Estimator}

231:

232: WL estimates are obtained by solving Horvitz-Thompson

233: (IPW) versions of the likelihood equations.

234: Define the \textit{inverse probability weighted empirical measure} by

235: \begin{equation}

236: \PP_N^{\pi} = \frac{1}{N}\sum_{i=1}^N\frac{\xi_i}{\pi_i}\delta_{X_i} , \label{eq:empiricalmeasure}

237: \end{equation}

238: where $\delta_{X_i}$ denotes Dirac measure placing unit mass on $X_i$ and

239: \begin{eqnarray*}

240: \pi_i & = & \left \{ \begin{array}{ll} \pi_0(V_i) &

241: \mbox{\ for\ Bernoulli\ sampling}\\ \mbox{ } \\ \frac{n_j}{N_j}

242:               \mbox{\ if\ }V_i\in{\cal V}_j &

243: \mbox{\ for\ finite\ population\ stratified\ sampling}. \end{array} \right .

244: \end{eqnarray*}

245: Then, instead of (\ref{eq:like}) we solve

246: \begin{eqnarray}

247: \Psi_{N1}^{\pi}(\theta,\eta) & = & \PP_N^{\pi}\dot\ell_{\theta,\eta} \, = \,  0 \nonumber \\

248: \Psi_{N2}^{\pi}(\theta,\eta) & = & \PP_N^{\pi}B_{\theta,\eta}h - P_{\theta, \eta} B_{\theta, \eta} h

249: \,  = \, 0  \qquad \mbox{for all} \ \ h \in {\cal H}. \label{eq:IPWlike}

250: \end{eqnarray}

251:

252: In view of the MAR assumption, for any integrable function $f:{\cal X} \mapsto \RR$

253: and under either Bernoulli or finite population stratified sampling,

254: \[

255: \E\frac{\xi_i}{\pi_i}f(X_i) = \E \left [

256: \E \left ( \left . \frac{\xi_i}{\pi_i} \right | \Sigma_N \right ) f(X_i) \right ] = \E f(X_i), \;  \; i=1,\ldots,N,

257: \]

258: so that $\E \PP_N^{\pi}f = \E \PP_N f = P_0f$.

259: Consequently, the random map $\Psi_N^{\pi}=(\Psi_{N1}^{\pi},\Psi_{N2}^{\pi})$

260: defined by (\ref{eq:IPWlike}) has the same expectation as the random

261: map $\Psi_N$ in (\ref{eq:like}), namely $\Psi=(\Psi_1,\Psi_2)$ as in (\ref{eq:expectedmap}).

262: The implication is that the assumptions \textbf{A1}-\textbf{A4}

263: made to guarantee the asymptotic normality of the ML estimator

264: based on complete phase one data are also the assumptions needed

265: to guarantee the asymptotic normality of the WL estimator based on two phase data.

266: Indeed, van der Vaart's (\citeyear{4895}) Theorem 25.90,

267: or more precisely his Theorem 19.26 of which it is a restatement,

268: applies virtually without change to the Bernoulli sampling setup.

269: The Donsker class ${\cal F}$ in \textbf{A1} is modified to

270: $\tilde{\cal F} = \{ [\xi/\pi_0(V)] f(X), f \in {\cal F}\}$.

271: Since under the hypothesis (\ref{eq:boundedweights}) it is the product

272: of a fixed bounded function with the Donsker class ${\cal F}$, the fact

273: that $\tilde{\cal F}$ is Donsker for the joint distribution $Q_0$

274: of $(W, \xi)$ follows from \citet[example 2.10.10]{4920}.

275: The random map $\Psi_N$ corresponding to the estimating

276: functions (\ref{eq:IPWlike}) is ordinary empirical measure

277: $\QQ_N$ for $\{(W_i, \xi_i), i=1,\ldots, N\}$ applied to the unbiased

278: estimating functions $(\xi/\pi_0)\dot\ell_{\theta,\eta}$ and $(\xi/\pi_0)B_{\theta,\eta}h$.

279: \textbf{A4} will generally follow from (\ref{eq:boundedweights})

280: and the arguments used to establish consistency for the complete data ML estimator,

281: together with (\ref{eq:IPWlike}).

282: \textbf{A2} and \textbf{A3} are unchanged.

283: The more general Theorem 3.3.1 of \citet{4920} is needed, however,

284: to deal with the non i.i.d. data induced by finite population stratified sampling.

285: To verify its hypotheses, we first must establish weak convergence

286: of the empirical process based on $\PP_N^{\pi}$.

287:

288: \section{Weak Convergence of the IPW Empirical Process}

289:

290: Two phase stratified sampling resembles the bootstrap in that it

291: involves random sampling from the finite, albeit incompletely

292: observed, population $\{X_1,\ldots,X_N\}$.

293: Here we use results on weighted bootstrap empirical processes

294: from \citet[Theorem 2.2]{4922}, as incorporated in \citet[Theorem 3.6.13]{4920},

295: to demonstrate weak convergence of the IPW empirical process

296: $\mathbb{G}_N^{\pi}=\sqrt{N}(\PP_N^{\pi}-P_0)$ for finite population stratified sampling.

297: First note that, with the subscript $j,i$ denoting the

298: $i^{\mbox{th}}$ of $N_j$ observations in stratum $j$,

299: \begin{eqnarray}

300: \mathbb{P}_N^{\pi} & = & \frac{1}{N}\sum_{j=1}^J\frac{N_j}{n_j}\sum_{i=1}^{N_j}\xi_{j,i}\delta_{X_{j,i}}

301:  =  \frac{1}{N}\sum_{j=1}^J \frac{N_j^2}{n_j} \mathbb{P}_{j,N_j}^{\xi} \label{eq:bootstrap}

302: \end{eqnarray}

303: where

304: \[

305: \mathbb{P}_{j,N_j}^{\xi} = \frac{1}{N_j}\sum_{i=1}^{N_j}\xi_{j,i}\delta_{X_{j,i}}

306: \]

307: is a \textit{finite sampling empirical measure} for the $j^{\mbox{th}}$

308: stratum.

309: Similarly one can express the ordinary empirical measure as

310: \begin{equation}

311: \mathbb{P}_N = \frac{1}{N}\sum_{j=1}^JN_j\mathbb{P}_{j,N_j}

312: \label{eq:empirical}

313: \end{equation}

314: where

315: \begin{equation}

316: \mathbb{P}_{j,N_j} = \frac{1}{N_j} \sum_{i=1}^N \delta_{X_i} \b1_{{\cal V}_j} (V_i)

317: = \frac{1}{N_j}\sum_{i=1}^{N_j}\delta_{X_{j,i}}

318: \label{eq:stratumwiseEmpirical}

319: \end{equation}

320: denotes the empirical measure for the $j^{\mbox{th}}$ stratum.

321: Justification of the second (doubly indexed) form is given in Appendix A.

322:

323: Combining (\ref{eq:bootstrap}) and (\ref{eq:empirical}), and letting

324: $\GG_N=\sqrt{N}(\PP_N-P_0)$ denote the standard empirical process, we have

325: \begin{eqnarray}

326: \mathbb{G}_N^{\pi}

327: & = & \sqrt{N}\left(\mathbb{P}_N^{\pi}-P_0\right) \nonumber \\

328: & = & \sqrt{N}\left(\mathbb{P}_N-P_0\right)

329:            +\sqrt{N}\left(\mathbb{P}_N^{\pi}-\mathbb{P}_N\right) \nonumber \\

330: & = & \GG_N + \frac{1}{\sqrt{N}}\sum_{j=1}^J

331:              \left(\frac{N_j^2}{n_j}\right)\left(\mathbb{P}_{j,N_j}^{\xi}-\frac{n_j}{N_j}

332:              \mathbb{P}_{j,N_j}\right) \nonumber \\

333: & = & \GG_N

334:             +\sum_{j=1}^J\sqrt{\frac{N_j}{N}}\left(\frac{N_j}{n_j}\right)\mathbb{G}^{\xi}_{j,N_J}

335:             \label{eq:IPWempiricalexpansion}

336: \end{eqnarray}

337: where

338: \begin{equation}

339: \mathbb{G}^{\xi}_{j,N_j} = \sqrt{N_j}\left(\mathbb{P}_{j,N_j}^{\xi}

340:        -\frac{n_j}{N_j}\mathbb{P}_{j,N_j}\right) \label{eq:weightedbootstrapprocess}

341: \end{equation}

342: is the \textit{finite sampling empirical process} for stratum $j$.

343:

344: The first term in (\ref{eq:IPWempiricalexpansion}) converges to the

345: $P_0$-Brownian bridge process $\mathbb{G}$ indexed by the Donsker

346: class ${\cal F}$ mentioned in \textbf{A1}.

347: Let $P_{0|j}(\cdot)=\E(\cdot|V \in{\cal V}_j)$ denote $\tilde{P}_0$

348: conditional on membership in stratum $j$, \textit{i.e.},  for measurable

349: $A \subset {\cal X}$, $P_{0|j} (A) = \tilde P_0 [A  {\bf 1}_{{\cal V}_j}(V)]/\nu_j$

350: with $\nu_j= \tilde P_0\b1_{{\cal V}_j}(V)$,

351: and let $\mathbb{G}_j$ denote the

352: $P_{0|j}$-Brownian bridge, also indexed by $ {\cal F}$.

353: Our goal is to establish the weak convergence of the remaining terms

354: on the RHS of (\ref{eq:IPWempiricalexpansion}).

355: If as $N \rightarrow \infty$ the sampling fractions converge

356: with $n_j/N_j \rightarrow p_j$,  the assumption on the exchangeable

357: ``weights" $(\xi_{j,1},\ldots,\xi_{j,N_j})$ in equation (3.6.8) of \citet{4920} holds trivially with

358: \[

359: \frac{1}{N_j}\sum_{i=1}^{N_j}\left(\xi_{j,i}-\bar\xi_{j.}\right)

360: \stackrel{\mbox{p}}{\rightarrow} p_j(1-p_j) \label{eq:VdVW368}.

361: \]

362: Furthermore, with $\rightsquigarrow$ denoting weak convergence in

363: $\ell^{\infty}({\cal F})$, $\sqrt{N_j} ( \PP_{j,N_j} - P_{0|j} ) \rightsquigarrow \GG_{j}$;

364: see Appendix B for the proof.

365: Thus their Theorems 3.6.13 and 1.12.4 imply that, for almost

366: every sequence of complete data,

367: $\mathbb{G}_{j,N_j}^{\xi} \rightsquigarrow \sqrt{p_j(1-p_j)}\mathbb{G}_j$.

368: Conditionally on $\Sigma_N$, the processes $\mathbb{G}^{\xi}_{j,N_j}$

369: are mutually independent because of the independence of the

370: $\{\xi_{j,i}\}$ in different strata.

371: Furthermore, by virtue of the fact that they also are (unconditionally)

372: uncorrelated with $\mathbb{G}_N=\sqrt{N}(\mathbb{P}_N-P_0)$,

373: which follows along the lines of \citet[Corollary 2.9.3]{4920},

374: or that (conditionally) they have the same limiting distributions for

375: almost all sequences of data, the vector of processes

376: $(\mathbb{G}_N,\mathbb{G}_{1,N_1}^{\xi},\ldots,\mathbb{G}_{J,N_J}^{\xi})$

377: converges weakly to the vector of independent Brownian bridge processes

378: $(\mathbb{G},\mathbb{G}_1,\ldots,\mathbb{G}_J)$.

379: Consequently

380: \begin{equation}

381: \mathbb{G}_N^{\pi} \rightsquigarrow \mathbb{G}

382: + \sum_{j=1}^J\sqrt{\nu_j}\sqrt{\frac{1-p_j}{p_j}} \mathbb{G}_j.

383: \label{eq:limIPWempiricalprocess}

384: \end{equation}

385: This result formalizes and extends Proposition 1 of \citet{218}

386: and the arguments in \S 4 of \citet{4324}.

387:

388: \section{Asymptotic Distributions of the WL estimator}

389:

390: We apply Theorem 19.26 of \citet{4895} to conclude that,

391: under Bernoulli sampling,

392: \begin{equation}

393: \sqrt{N} \dot\Psi_0 \left( \begin{array}{c}

394: \hat{\theta}-\theta_0 \\ \hat{\eta}-\eta_0 \end{array} \right )

395: = -\GG_N \frac{\xi}{\pi_0}

396: \left (  \begin{array}{c} \dot\ell_0 \\

397:  B_0h \end{array} \right ) \; + \; o_p(1). \label{eq:BernoulliForm}

398: \end{equation}

399: Similarly, using Theorem 3.3.1 of \citet{4920} together with the development

400: of the previous section, we conclude that

401: for finite population stratified sampling

402: \begin{equation}

403: \sqrt{N} \dot\Psi_0 \left( \begin{array}{c}

404: \hat{\theta}-\theta_0 \\ \hat{\eta}-\eta_0 \end{array} \right )

405: = -\GG_N^{\pi}

406: \left ( \begin{array}{c} \dot\ell_0 \\

407: B_0h \end{array} \right ) \;+ \;o_p(1). \label{eq:FPSSForm}

408: \end{equation}

409: We have already argued that the hypotheses of the first theorem

410: follow from appropriately modified versions of \textbf{A1}-\textbf{A4}.

411: Together with the weak convergence of $\mathbb{G}_N^{\pi}$

412: just established, they also suffice for the second theorem.

413: In particular, the stochastic condition (3.3.2) of \citet{4920} follows

414: from \textbf{A1} and \textbf{A2} together with the proof of their Lemma 3.3.5

415: applied to each of $\GG_N,\GG^{\xi}_{1,N_1},\ldots, \GG^{\xi}_{1,N_1}$.

416:

417: In practice attention is usually focused on inferences for the Euclidean parameter $\theta$.

418: To derive a general expression for the asymptotic variance of $\hat\theta$ we further assume

419: \bd

420: \item[A5] $\dot\Psi_0$ admits a partition as in equation (25.91)

421: of \citet{4895} where the information operator

422: $B_0^* B_0$ is continuously invertible.

423: \ed

424: Following closely the arguments in \S 25.12 of van der Vaart, we

425: calculate from (\ref{eq:BernoulliForm}) that under Bernoulli sampling

426: \begin{equation}

427: \sqrt{N}(\hat{\theta}-\theta_0) = \GG_N \frac{\xi}{\pi_0}\tilde\ell_0 + o_p (1) \label{eq:mainresultBernoulli}

428: \end{equation}

429: whereas from (\ref{eq:FPSSForm}) under finite population stratified sampling

430: \begin{equation}

431: \sqrt{N}(\hat{\theta}-\theta_0)= \GG_N^{\pi} \tilde\ell_0 + o_p (1),

432: \label{eq:mainresultFPSS}

433: \end{equation}

434: where in both cases $\tilde\ell_0$ denotes the efficient influence function

435: \begin{equation}

436: \tilde\ell_0 = \tilde I_0^{-1}\left ( I-B_0\left ( B^*_0B_0 \right )^{-1}  B^*_0 \right ) \dot\ell_0 \label{eq:effinfluence}

437: \end{equation}

438: and

439: \begin{equation}

440: \tilde I_0

441: = P_0 \left [  \left ( I-B_0\left ( B^*_0B_0 \right )^{-1}  B^*_0 \right )

442: \dot\ell_0\dot\ell_0^{T} \right ] \label{eq:effinfo}

443: \end{equation}

444: is the efficient information.

445: Since $P_0\tilde\ell_0=0$, moreover, both (\ref{eq:mainresultBernoulli}) and (\ref{eq:mainresultFPSS}) may be expressed

446: \begin{equation}

447: \sqrt{N}(\hat\theta-\theta_0) = \sqrt{N} \PP_N^{\pi}\tilde\ell_0 + o_p(1)=

448: \frac{1}{\sqrt{N}} \sum_{i=1}^N \frac{\xi_i}{\pi_i}\tilde\ell_0(X_i) +o_p(1), \label{eq:general}

449: \end{equation}

450: which expansion constitutes the principal result of this paper.

451:

452: Under Bernoulli sampling with known $\pi_0$ the asymptotic variance is therefore

453: \begin{eqnarray}

454: \Var_{\mbox{A}}\sqrt{N}(\hat\theta-\theta_0) & = &

455: \Var \left ( \frac{\xi}{\pi_0} \tilde\ell_0 \right ) \nonumber \\

456:  & = &

457: \Var \; \E \left ( \left . \frac{\xi}{\pi_0} \tilde\ell_0 \right | X \right ) +

458: \E \; \Var \left ( \left . \frac{\xi}{\pi_0} \tilde\ell_0 \right | X \right ) \nonumber \\

459: & = & \Var(\tilde\ell_0) + \E \left [ \frac{\tilde\ell_0^{\otimes 2}}{\pi_0^2} \Var(\xi|X)\right ]

460: \nonumber \\

461: & = & \tilde I_0^{-1} + \tilde P_0 \left ( \frac{1-\pi_0}{\pi_0}\tilde\ell_0^{\otimes 2} \right ). \label{eq:IPWvariance}

462: \end{eqnarray}

463: In the special case of stratified Bernoulli sampling, with $\pi_i=\pi_0(V_i)=p_j$ for $V_i \in {\cal V}_j$, this becomes

464: \begin{equation}

465: \tilde I_0^{-1} + \sum_{j=1}^J\nu_j\frac{1-p_j}{p_j}P_{0|j}\left (\tilde\ell_0^{\otimes 2} \right ).

466: \label{eq:varIPWstrataiid}

467: \end{equation}

468: On the other hand, from (\ref{eq:limIPWempiricalprocess}) and

469: (\ref{eq:mainresultFPSS}), the asymptotic variance under finite population stratified sampling is

470: \begin{equation}

471: \tilde I_0^{-1} + \sum_{j=1}^J\nu_j\frac{1-p_j}{p_j}\Var_j(\tilde\ell_0), \label{eq:varIPWstratified}

472: \end{equation}

473: where $\Var_j(f)=P_{0|j}(f^{\otimes 2})-P^{\otimes 2}_{0|j}(f)$.

474: Comparing the last two expressions shows the substantial potential gain from

475: keeping track of the stratum frequencies for the phase one data.

476:

477: \section{Bernoulli Sampling with Estimated Weights}

478: Let ${\cal V}_0$ denote an additional stratum, possibly null, such that $\xi_i=1$ for $V_i \in {\cal V}_0$.

479: Introduction of this special stratum with $p_0=1$ does not affect the previous development;

480: in particular, equations (\ref{eq:general})-(\ref{eq:varIPWstratified}) continue to hold.

481: For $V_i \notin {\cal V}_0$ suppose

482: \begin{equation}

483: \Pr(\xi_i=1|X_i, V_i;\alpha) = \Pr(\xi_i=1|V_i;\alpha) = \pi_{\alpha}(V_i)  < 1 \label{eq:modelforPi}

484: \end{equation}

485: where $\alpha \in \Xi \subset \RR^q$

486: is a parameter to be estimated by

487: maximum likelihood from the phase

488: one observations $\{V_i, i=1,\ldots,N\}$ not in ${\cal V}_0$.

489: We assume sufficient regularity in the model for $\alpha$, e.g., to satisfy the hypotheses of

490: Theorem 5.21 of \citet{4895}, so that the ML estimator

491: $\hat\alpha$ is consistent and asymptotically normal with influence function

492: \begin{equation}

493: \tilde\ell^{\alpha}_0= \b1_{{\cal V}_0^c}

494: \left ( \tilde P_0 \b1_{{\cal V}_0^c} \frac{\dot\pi_0^{\otimes 2}} {\pi_0(1-\pi_0)} \right )^{-1} \dot\pi_0\frac{\xi-\pi_0}{\pi_0(1-\pi_0)}.

495: \label{eq:InfFuncAlpha}

496: \end{equation}

497: Here for $V \in {\cal V}_0^c$, the complement of ${\cal V}_0$,

498: $\pi_0(V)=\pi_{\alpha_0}(V)$ is the true sampling function while $\dot\pi_0(V)$

499: denotes the $q$-vector of partial derivatives of $\pi_{\alpha}(V)$ with

500: respect to $\alpha$ evaluated at $\alpha=\alpha_0$.

501: If $\hat\theta(\alpha)$ denotes the WL estimator under two phase

502: Bernoulli sampling with ``known" sampling function $\pi_{\alpha}(V)$,

503: then from (\ref{eq:InfFuncAlpha}) and (\ref{eq:general}) we have

504: \begin{equation}

505: \sqrt{N}\left(\begin{array}{c}\hat\theta(\alpha_0)-\theta_0\\[.2cm]

506: \hat\alpha-\alpha_0 \end{array}\right)

507: = \sqrt{N}\left(\begin{array}{c}\PP_N^{\pi}\tilde\ell_0 \\[.2cm] \QQ_N\tilde\ell^{\alpha}_0

508: \end{array}\right) + o_p(1). \label{eq:jointExpansion}

509: \end{equation}

510: Furthermore, with $\hat\pi_i=\pi(V_i;\hat\alpha)$ for

511: $V_i \in {\cal V}_0^c$ otherwise $\hat\pi_i=1$, we show in Appendix C that under some further mild assumptions regarding $\pi_{\alpha}(V)$

512: %\textbf{[Jon: please note changes, in particular use

513: %of $\QQ$ and reference to additional mild assumptions in Appendix]}

514: \begin{equation}

515: \sqrt{N}(\PP_N^{\hat\pi}-\PP_N^{\pi_0})\tilde\ell_0

516: = - \tilde  P_0 \left (\b1_{{\cal V}_0^c} \frac{\tilde\ell_0

517: \dot\pi^{T}_0}{\pi_0}\right ) \sqrt{N}(\hat\alpha-\alpha_0) + o_p(1). \label{eq:TaylorExpansion}

518: \end{equation}

519: The joint asymptotic normality of $(\hat\theta(\alpha_0),\hat\alpha)$ that follows

520: from (\ref{eq:jointExpansion}), together with the Taylor expansion

521: (\ref{eq:TaylorExpansion}), are precisely the hypotheses used by \citet{855}

522: to deduce that $\sqrt{N}[\hat\theta(\hat\alpha)-\theta_0] \rightsquigarrow Z $

523: where  $Z\in \RR^p $ is mean zero Gaussian with covariance matrix

524: \begin{equation}

525: \Var_{\mbox{A}}\sqrt{N}\left(\hat\theta(\hat\alpha)-\theta_0\right)

526: = \Var  \left ( \frac{\xi}{\pi_0}\tilde\ell_0 \right ) - \tilde P_0 \b1_{{\cal V}_0^c}

527:     \frac{\tilde\ell_0 \dot\pi_0^T}{\pi_0} \left ( \tilde P_0 \b1_{{\cal V}_0^c}

528:     \frac{\dot\pi_0^{\otimes 2}}{\pi_0(1-\pi_0)} \right )^{-1}

529:     \tilde P_0 \b1_{{\cal V}_0^c} \frac{\dot\pi_0\tilde\ell_0^T }{\pi_0}. \label{eq:VarEstAlpha}

530: \end{equation}

531: A matrix calculation shows that, when (\ref{eq:VarEstAlpha}) is evaluated for stratified Bernoulli sampling

532: \[

533:  \pi_{\alpha} = \pi_{\alpha}(V) = \left \{ \begin{array}{ll} 1,  & V \; \in \; {\cal V}_0 \\

534:  \alpha_j, & V \; \in \; {\cal V}_j, \; j=1,\ldots,J , \end{array} \right .

535: \]

536: the asymptotic variance for the WL estimator $\hat\theta$ with

537: \textit{estimated} sampling probabilities $\hat\alpha_j=n_j/N_j$ is identical to

538: the finite population sampling variance (\ref{eq:varIPWstratified}) with $p_j=\alpha_{j,0} = \lim n_j/N_j$.

539:

540: Two possibilities present themselves for estimation of the terms in (\ref{eq:VarEstAlpha}).

541: Let $\hat\pi_i= \pi_{\hat\alpha}(V_i)$ for $V_i \in {\cal V}_0^c$ else $\hat\pi_i=1$.

542: Then, using (\ref{eq:IPWvariance}), we could estimate the first term by

543: \[

544: \widehat{\Var\left ( \frac{\xi}{\pi_0}\tilde\ell_0 \right )}

545: = \tilde I^{-1}_{\hat\theta,\hat\eta} + \frac{1}{N}\sum_{i=1}^N\frac{\xi_i(1-\hat\pi_i)}{\hat\pi_i^2}\tilde\ell_{\hat\theta,\hat\eta}^{\otimes 2}(X_i),

546: \]

547: the expression in the middle of the second term by

548: \[

549: \widehat {\tilde P_0 \b1_{{\cal V}_0^c} \frac{\dot\pi_0^{\otimes 2}}{\pi_0(1-\pi_0)} }

550: = \frac{1}{N}\sum_{i=1}^N \b1_{{\cal V}_0^c}(V_i)

551: \frac{\dot\pi_{\hat\alpha}^{\otimes 2}(V_i)}{\hat\pi_i(1-\hat\pi_i)}

552: \]

553: and similarly for $\tilde P_0(\tilde\ell_0\dot\pi_0^{T}/\pi_0)$.

554: A more empirical approach, however, would be to use the

555: $\theta$ and $\alpha$ influence function contributions themselves to estimate these terms as in

556: \begin{eqnarray*}

557: \widehat{\Var\left ( \frac{\xi}{\pi_0}\tilde\ell_0 \right ) }

558: & = & \frac{1}{N}\sum_{i=1}^N \left ( \frac{\xi_i}{\hat\pi_i}\tilde\ell_{\hat\theta,\hat\eta}(X_i) \right )^{\otimes 2}, \\

559: \widehat {\tilde P_0 \b1_{{\cal V}_0^c} \frac{\tilde\ell_0\dot\pi_0^{T} } {\pi_0} }

560: & = & \frac{1}{N}\sum_{i=1}^N \b1_{{\cal V}_0^c}(V_i) \frac{\xi_i}{\hat\pi_i}

561:           \frac{\tilde\ell_{\hat\theta,\hat\eta}(X_i)} {\hat\pi_i} \dot\pi_{\hat\alpha}(V_i)^{T} \\

562: & = & \frac{1}{N}\sum_{i=1}^N \b1_{{\cal V}_0^c }(V_i)

563:            \left ( \frac{\xi_i\tilde\ell_{\hat\theta,\hat\eta}(X_i)}{\hat\pi_i} \right )

564:            \left ( \frac{\dot\pi_{\hat\alpha}(V_i)^{T}(\xi_i-\hat\pi_i)}{\hat\pi_i(1-\hat\pi_i)} \right ) \quad \mbox{and} \\

565: \widehat { \tilde P_0 \b1_{{\cal V}_0^c} \frac{\dot\pi_0^{\otimes 2}}{\pi_0(1-\pi_0)} }

566: & = & \frac{1}{N}\sum_{i=1}^N \b1_{{\cal V}_0^c }(V_i)

567:             \left ( \frac{ \dot\pi_{\hat\alpha}(V_i)(\xi_i-\hat\pi_i)}{\hat\pi_i(1-\hat\pi_i)} \right )^{\otimes 2}.

568: \end{eqnarray*}

569: The resulting asymptotic variance for $\hat\theta$ may be recognized as the comprising the residual sums of squares and of cross products from the least squares regressions of each the $p$ components of the $\hat\theta$ influence function contributions $\xi_i\tilde\ell_{\hat\theta,\hat\eta}(X_i)/\hat\pi_i$, to which subjects not in the

570: phase two sample contribute 0, on the $q$ components of the estimated $\hat\alpha$ influence function contributions (\ref{eq:InfFuncAlpha}), to which subjects having $V_i \in {\cal V}_0$ contribute 0.

571: See \citet{4931} for a recent discussion and interpretation.

572: This suggests the following estimation procedure:

573: \begin{enumerate}

574: \item Estimate $\alpha$ from the phase one data and compute the estimated sampling fractions $\hat\pi_i$.

575: \item Estimate $\theta$ and $\eta$ from the phase two data by WL, using the inverse $\hat\pi_i$ as known weights.

576: \item Regress each component of the influence function contributions for $\hat\theta$ on those for $\hat\alpha$.

577: \item Estimate Var$_{\mbox{A}}(\hat\theta)$ as the matrix comprising the residual sums of squares and of cross products from these regressions.

578: \end{enumerate}

579: \citet[p. 166]{4901}, who cited earlier work by \citet{4902}, suggested this

580: procedure for the special case of Cox regression, to which we now direct our attention.

581:

582: \section{Application to the Cox Proportional Hazards Model}

583: Our development of the Cox model follows closely that of \citet[\S 25.12]{4895}

584: where $X=(\Delta,T,Z)$ with $T$=min($\tilde T,C)$ a censored failure time,

585: $\Delta=\b1_{[\tilde T \leq C]}$ the failure indicator and $Z \in \RR^p$ a vector of covariates.

586: The Euclidean parameter is the $p$-vector of regression coefficients $\beta$ in the linear predictor $z\beta$.

587: The nonparametric parameter $\eta=(\Lambda,G,G_Z)$ has three

588: infinite dimensional components: $\Lambda(\cdot)=\int_0^{\cdot}\lambda(s)ds$ the baseline cumulative

589: hazard function, assumed differentiable; $G(t|z)=\Pr(C \leq t|Z=z)$ the conditional distribution of the

590: censoring time; and $G_Z$, the marginal distribution of the covariates.

591: We introduce the usual notation for the ``at risk" process

592: $Y(t)=\b1_{[T \geq t]}$ and the event counting process $N(t)=\Delta \b1_{[T \leq t]}$

593: and we make the standard assumptions: (i) that the true failure time $\tilde T$

594: and $C$ are independent given $Z$; and (ii) that there is a finite maximum

595: censoring time $\tau$ such that $\Pr[Y(\tau)=1]>0$.

596: \citet{4895} makes some further ``partly unnecessary" assumptions to simplify his development, namely that the covariates $Z$ are bounded,

597: that $G$ and $G_Z$ have densities as indicated and especially that $\Pr(C \geq \tau) = \Pr(C = \tau) >0$ (see discussion in \S 8).

598: Writing the density for $x=(\delta,t,z)$, with $z$ a row vector, as

599: \begin{equation}

600: e^{-e^{z\beta}\Lambda(t)}\left [e^{z\beta}\lambda(t)\left (1-G(t-|z)\right)\right]^{\delta}

601: \left[g(t|z)\right]^{1-\delta}g_Z(z) \label{eq:Coxlike},

602: \end{equation}

603: and noting that $G$ and $G_Z$ factor out of the complete data likelihood, \citet{4895}

604: considers ML estimation for $(\beta,\Lambda)$ only.

605: With ${\cal H}$ denoting various subsets of the space BV$[0,\tau]$ of bounded functions of bounded variation,

606: he develops the following explicit expressions for the $\beta$ score vector,

607: the $\Lambda$ score operator that maps functions $h \in {\cal H}$ to functions of the data, its adjoint (but only evaluated for the

608: $\beta$ scores) and the information operator that maps ${\cal H}$ onto itself:

609: \begin{eqnarray}

610: \dot\ell_{\beta,\Lambda}(x)

611: & = & \delta z -ze^{z\beta}\Lambda(t) \label{eq:Coxscore} \\

612: B_{\beta,\Lambda}h(x)

613: & = & \delta h(t) - e^{z\beta}\int_0^thd\Lambda \label{eq:CoxScoreOp} \\

614: B^*_{\beta,\Lambda}\dot\ell_{\beta,\Lambda}(t)

615: & = & P_{\beta,\Lambda}Y(t)Ze^{Z\beta} \nonumber \\

616: B^*_{\beta,\Lambda}B_{\beta,\Lambda}h(t)

617: & = & h(t) P_{\beta,\Lambda}Y(t)e^{Z\beta}. \nonumber

618: \end{eqnarray}

619: These are used to calculate the efficient scores

620: \begin{eqnarray*}

621: \ell^*_{\beta,\Lambda}(x)& = & \dot\ell_{\beta,\Lambda}-

622: B_{\beta,\Lambda}\left(B^*_{\beta,\Lambda}B_{\beta,\Lambda}\right)^{-1}B^*_{\beta,\Lambda}\dot\ell_{\beta,\Lambda} \\

623:  & = & \delta\left[z-m(t;\beta)\right]-e^{z\beta}\int_0^t\left[z-m(s;\beta)\right]d\Lambda(s)

624: \end{eqnarray*}

625: and efficient information

626: \begin{eqnarray*}

627:  \tilde I_0 & = & I_0 - P_0B_0\left(B^*_0B_0\right)^{-1}B^*_0\dot\ell_0 \\

628:  & = & P_0 \left ( e^{Z\beta_0}\int_0^{\tau} \left [

629: Z - m(t;\beta_0) \right ]^{\otimes 2} \Pr(T \geq t |Z) d\Lambda_0(t) \right ) ,

630: \end{eqnarray*}

631: respectively,

632: where $I_0=P_0\dot\ell_o\dot\ell_0^{T}$ and $m(t;\beta) = S^{(1)}(t;\beta)/S^{(0)}(t;\beta)$ with

633: \begin{eqnarray*}

634: S^{(0)}(t;\beta) &=& P_0 e^{Z\beta}Y(t)  \\

635: S^{(1)}(t;\beta) &=& P_0Ze^{Z\beta}Y(t).

636: \end{eqnarray*}

637:

638: To fit the Cox model by WL to two phase stratified samples,

639: first define IPW estimators of the two quantities just considered by

640: $\hat S^{(0)}(t;\beta)=\PP_N^{\pi}e^{Z\beta}Y(t)$ and

641: $S^{(1)}(t;\beta)=\PP_N^{\pi}Ze^{Z\beta}Y(t) $.

642: By definition the WL estimators solve

643: \begin{eqnarray}

644: \Psi_{N1}^{\pi}(\beta,\Lambda) & = & \PP_N^{\pi}\dot\ell_{\beta,\Lambda} = 0

645: \label{eq:CoxWL1} \\

646: \Psi_{N2}^{\pi}(\beta,\Lambda)h & = & \PP_N^{\pi} B_{\beta,\Lambda}h =0

647: \qquad \mbox{for all} \; h\in \; {\cal H}, \label{eq:CoxWL2}

648: \end{eqnarray}

649: where we have used the fact that $P_{\beta,\Lambda}B_{\beta,\Lambda}h=0 $.

650: Substituting

651: \[

652: h_t(s) \; = \; \frac{\b1_{[s\leq t]}}{\hat S^{(0)}(s,\beta)}

653: \]

654: for $h$ in (\ref{eq:CoxWL2}) and solving using (\ref{eq:CoxScoreOp})

655: shows that, for fixed $\beta$, the cumulative hazard

656: function that partially maximizes the weighted likelihood and, as is easily checked, satisfies $\PP_N^{\pi}B_{\beta,\hat\Lambda_{\beta}}h=0$ for all  $h$, is

657: \begin{equation}

658: \hat\Lambda_{\beta}(t) \;= \; \PP_N^{\pi}\frac{\Delta\b1[T\leq t]}{\hat\sz(T;\beta)} \; = \;\frac{1}{N} \sum_{i=1}^N \int_0^t\frac{\xi_i}{\pi_i}\frac{dN_i(s)}{\hat S^{(0)}(s;\beta)} \label{eq:Breslow}.

659:  \end{equation}

660: This may be recognized as an IPW version of the so called \citet{1266} estimator.

661: Inserting this expression into (\ref{eq:CoxWL1}) and evaluating using (\ref{eq:Coxscore}) yields

662: \[

663: \Psi_{N1}^{\pi}(\beta,\hat\Lambda_{\beta})\; = \; \PP_N^{\pi}\Delta\left[Z-\hat m(T;\beta)\right] \;= \; \frac{1}{N}\sum_{i=1}^N\frac{\xi_i}{\pi_i}\Delta_i\left[Z_i-\frac{\hat S^{(1)}(T_i;\beta}{\hat S^{(0)}(T_i;\beta)}\right]\;=\;0 ,

664: \]

665: which is the IPW Cox ``partial score" equation.

666: Its solution, together with (\ref{eq:Breslow}),

667: are the estimators proposed for Cox regression by \citet{4326},

668: \citet{4902}, \citet[Estimator II]{4324}, \citet{4894} and others for a

669: variety of complex sampling and missing data problems.

670: Using the results of this paper, the large sample properties of $(\hat\beta,\hat\Lambda_{\hat\beta})$ follow

671: from those already developed for the ML estimators with complete data, which are given by the same equations with $\xi_i=\pi_i=1, i=1,\ldots,N$.

672:

673: \section{Discussion}

674: The two phase stratified sampling designs considered here are quite

675: flexible in that the phase one strata may be formed using all available

676: information and sampled with arbitrary positive probabilities.

677: This is in the spirit of \citet{4326} and \citet{4894}, who considered

678: even more general complex sample survey designs.

679: Others \citep{4324,4871} have restricted their attention to covariate stratified versions of the case-cohort design, whereby all subjects who fail are sampled at phase two for complete covariate ascertainment.

680: Although this may well be an efficient design when the failure rate is low, the assumption that $\xi=1$ whenever $\Delta=1$ is often unnecessary and may sometimes be unduly restrictive.

681: Not only does it limit application when the phase one population has

682: large numbers of both failures and non-failures, it also does so when the sampling has been carried out for one failure type but it is of interest to evaluate another.

683: When following patients enrolled in a clinical trial, for example, all deaths may be sampled as ``cases" but it may later be decided to analyze the data also in terms of ``event-free survival".

684: In other contexts, biological samples may turn out out to be non-informative so that data are still missing for substantial numbers of subjects, including failed cases, who are sampled at phase two.

685: Provided one is willing to make the standard MAR assumptions, WL

686: methods as described herein may still be used by determining the

687: stratum frequencies for subjects having complete data at phase two and using them to estimate the sampling weights.

688:

689: The major drawback of WL estimation is its lack of statistical efficiency.

690: Efforts to address this deficiency with Cox regression have been made by

691: several authors including \citet{3937}, \citet{4871}, \citet{4811}, \citet{4929} and \citet{4930}.

692: Most of these methods are relatively recent and involve sufficiently complex calculations, or sufficiently restrictive assumptions, that none have yet seen widespread use.

693: These limitations are certain to decline with advances in computing

694: hardware and software, making more efficient estimation methods more widely available.

695: In the meantime, the WL estimation procedure outlined at the end

696: of \S 6 offers a relatively simple and robust alternative.

697: It is likely to remain the method of choice for many survey statisticians for the reasons

698: mentioned in the introduction, namely, their interest in finite population

699: parameters defined as solutions to ML estimating equations.

700: As emphasized by \citet{3937}, in view of the interpretation of

701: (\ref{eq:VarEstAlpha}) as a residual sum of squares, inclusion of

702: additional variables in the model (\ref{eq:modelforPi}) for $\pi$ can

703: only enhance the efficiency of $\theta$ estimation.

704: When the sampling probabilities vary, as in finite population stratified

705: sampling, inclusion of the stratum factors in the model is essential to avoid bias.

706: Finer stratification, or the inclusion of auxiliary variables in the model for $\pi$, serves the cause of efficiency.

707: Equation (\ref{eq:varIPWstratified}) suggests that such additional

708: variables would be most valuable if they could somehow be chosen

709: to be highly correlated with the efficient scores.

710: The doubly weighted estimator developed by \citet{4871} for exposure stratified case-cohort studies is intriguing in that it uses

711: a separate set of (time-dependent) weights for each covariate.

712: A preliminary analysis is conducted to estimate quantities that resemble within stratum conditional expectations of partial score contributions given the phase one data, and these are used to form the weights.

713: An extension of their approach to more general two phase stratified sampling designs would be of considerable interest.

714:

715: This paper is limited in application to semiparametric models that satisfy the rather stringent assumptions \textbf{A1}-\textbf{A4} of \S 2.

716: Even in the case of Cox regression, these have been established only under the ``partly unnecessary" conditions imposed by \citet[\S 25.12.1]{4895}.

717: His assumption that everyone still ``on-study" is censored at the common time $\tau$ would apply to situations in which time $t$ referred to calendar time, everyone was entered on study at $t=0$ and there was a common closing date at $t=\tau$.

718: It would not apply, however, if subjects were entered on study at various

719: calendar times but withdrawn on a common closing date, and $t$ was taken to be ``time-on-study".

720: Nor would it apply if $t$ was ``age" and subjects both entered and exited the study at various ages.

721: We look forward to further work that relaxes these assumptions, in particular to a determination as to whether or not the general approach extends to Cox regression with time-dependent covariates and repeated failure events under standard assumptions \citep{3924}.

722:

723: In his Appendix \citet{4894} remarks

724: \begin{quote}

725: ``To our knowledge, there does not exist a general theory on the conditions required for the tightness and weak convergence of Horvitz-Thompson processes.

726: However, the results of \citet[\S\S 2.9, 3.6, 3.7]{4920} can be applied to possibly stratified simple random sampling and can potentially be extended to other survey designs."

727: \end{quote}

728: One purpose of this paper has been to carry out in detail the program

729: mentioned for stratified random sampling.

730: We conjecture that our fundamental equation (\ref{eq:general}) applies

731: to Horvitz-Thompson estimators for other complex sampling designs, and work is in progress to explore these extensions.

732:

733: \section*{Acknowledgements}

734: The second author owes thanks to Galen Shorack for a helpful discussion concerning

735: the representation in Appendix A.

736: Supported in part by grants 5-R01-CA40644 and 2-R01-AI291968 from the

737: U.S. National Institutes of Health and by grant DMS-0503822

738: from the U.S. National Science Foundation.

739:

740: %\bibliographystyle{plain}

741: \bibliographystyle{ims}

742: \bibliography{sjs}

743:

744: \section{Appendices}

745: In Appendices A and B we establish two results slightly more

746: general than needed for the development in Section 4.

747: (See the end of Appendix  B for the special case required.)

748: The notation in these two appendices should be understood

749: to be independent of the that in the body of the paper.

750:

751: \par\noindent

752: {\bf Appendix A.  \ A Representation of Stratified Sampling.}

753:

754:  Suppose that $( \Omega, {\cal A},P)$ is a probability space

755: and $W : (\Omega , {\cal A} ) \rightarrow ({\cal W} , {\cal B} )$.

756: Write $P^W$ for the  measure induced by $W$ on

757: $ ({\cal W} , {\cal B} )$; in the notation of section 2, $P^W = \tilde{P}_0$.

758: Suppose that ${\cal W}_1 , \ldots , {\cal W}_J$ is a (measurable) partition

759: of ${\cal W}$:  \\

760: (a) \ \ ${\cal W}_j \in {\cal B}$, $j = 1, \ldots , J$;\\

761: (b) \ \  ${\cal W}_j \cap {\cal W}_{j'} = \empty$ for $j \not= j'$; and \\

762: (c) \ \ $\cup_{j=1}^J {\cal W}_j = {\cal W}$.\\

763: We will assume that $P(W \in {\cal W}_j ) \equiv p_j > 0$

764: for $j=1, \ldots , J$.

765:

766: Now consider a new probability space $(\Omega^{\dagger}, \cal A^{\dagger}, P^{\dagger})$

767: where

768: \begin{eqnarray*}

769: &&\Omega^{\dagger}

770: = \Omega_0^{\dagger} \times \Omega_1^{\dagger} \times \cdots \times \Omega_J^{\dagger} ,\\

771: && {\cal A}^{\dagger}

772: = {\cal A}_0^{\dagger} \times {\cal A}_1^{\dagger} \times \cdots \times {\cal A}_J^{\dagger} , \\

773: && P^{\dagger} = P_0^{\dagger} \cdot P_1^{\dagger} \cdots P_J^{\dagger} ,

774: \end{eqnarray*}

775: and random variables $\Delta = (\Delta_1, \ldots , \Delta_J)$,

776: $W_1^{\dagger}, \ldots , W_J^{\dagger}$

777: defined thereon as follows:

778: for $\omega^{\dagger} = (\omega_0^{\dagger} , \omega_1^{\dagger}, \ldots ,

779: \omega_J^{\dagger} ) \in \Omega^{\dagger}$,

780: \begin{eqnarray*}

781: && \Delta (\omega^{\dagger})

782:  = \Delta (\omega_0^{\dagger}) \sim \mbox{Multinomial}_J (1, (p_1 , \ldots , p_J ) ) \\

783: && W_j^{\dagger} ( \omega^{\dagger} ) = W_j^{\dagger} (\omega_j^{\dagger} )

784: \sim P_j^{\dagger}

785: \end{eqnarray*}

786: for $j =1 , \ldots , J$ where  $p_j = P( W \in {\cal W}_j )$, $j =1 , \ldots , J$, and

787: $P_j^{\dagger}$ is defined by

788: \begin{eqnarray}

789: P_j ^{\dagger} (W_j \in B) = \frac{P( W \in B\cap {\cal W}_j)}{P(W \in {\cal W}_j )}

790: = \frac{P^W (B \cap {\cal W}_j )}{P^W ( {\cal W}_j )},

791: \qquad B \in {\cal B} .

792: \label{DefnOfPSubJDagger}

793: \end{eqnarray}

794: Now define a random variable

795: $W^{\dagger} :  (\Omega^{\dagger}, \cal A^{\dagger}) \rightarrow  ( {\cal W} , {\cal B} )$

796: by

797: \begin{eqnarray*}

798: W^{\dagger} (\omega^{\dagger} )

799: = \Delta_1 (\omega_0^{\dagger} ) W_1^{\dagger}( \omega_1^{\dagger} )

800:        + \cdots +  \Delta_J ( \omega_0^{\dagger} ) X_J^{\dagger} ( \omega_J^{\dagger} )  .

801: \end{eqnarray*}

802: Note that $\Delta$, $W_1^{\dagger} , \ldots , W_J^{\dagger}$ are independent by

803: construction.

804: \bigskip

805:

806: \par\noindent

807: {\bf Proposition A.1} \ \   $W^{\dagger} \stackrel{d}{=} W$ on $({\cal W}, {\cal B} )$.

808: That is, $P^{W^{\dagger}} = P^W $ as measures on $({\cal W}, {\cal B})$.

809: \medskip

810:

811: \par\noindent

812: {\bf Proof.}

813: First note that

814: \begin{eqnarray}

815: P^{\dagger} (W^{\dagger}\in {\cal W}_j )

816: & = & P^{\dagger} (W_j^{\dagger} \in {\cal W}_j , \Delta_j = 1) \nonumber  \\

817: & = & P^{\dagger} (W_j^{\dagger} \in {\cal W}_j ) P^{\dagger} (\Delta_j = 1)

818:           %\nonumber \\

819:  =  1 \cdot p_j = p_j

820: \label{ComputationOfProbDaggerOfFallingInjthElementOfPartition}

821: \end{eqnarray}

822: using independence of $\Delta$ and $W_j^{\dagger}$,

823: the fact that $W_j^{\dagger}$ takes values in ${\cal W}_j$ with $P^{\dagger}$-probability $1$,

824: and $P^{\dagger} (\Delta_j = 1)= p_j $ by the definition of $P^{\dagger}$.

825:

826: Now let $B \in {\cal B}$.  Then since $p_j > 0$ for $j =1, \ldots , J$,

827: \begin{eqnarray*}

828: P^{\dagger} (W^{\dagger} \in B)

829: & = & \sum_{j=1}^J P^{\dagger} ( W^{\dagger} \in B \cap {\cal W}_j )

830:  =  \sum_{j=1}^J \frac{P^{\dagger} ( W^{\dagger} \in B \cap {\cal W}_j ) }{ P^{\dagger} (W^{\dagger} \in {\cal W}_j )}

831:                  P^{\dagger} (W^{\dagger} \in {\cal W}_j )\\

832: & = & \sum_{j=1}^J \frac{P^{\dagger} ( W_j^{\dagger} \in B ) }{P^{\dagger} (W^{\dagger}_j \in {\cal W}_j )} p_j

833: \qquad \mbox{by} \ (\ref{ComputationOfProbDaggerOfFallingInjthElementOfPartition}) \\

834: & = & \sum_{j=1}^J \frac{P^W (B \cap {\cal W}_j) / P^W ( {\cal W}_j )}{ 1} \cdot p_j

835: \qquad \mbox{by} \ (\ref{DefnOfPSubJDagger})  \\

836: & = & \sum_{j=1}^J P^W ( B \cap {\cal W}_j ) = P^W (B) = P( W \in B) .

837: %\qquad \qquad \qquad \qquad\qquad \Box

838: \end{eqnarray*}

839: \hfill $\Box$

840: \medskip

841:

842: If $W_1, \ldots , W_N$ are i.i.d. $P^W$, then we can represent the $W_i$'s

843: in terms of $( \Delta_i , W_{1,i}^{\dagger} , \ldots, W_{J,i}^{\dagger} )$, $i=1, \ldots , N$, i.i.d. as

844: $( \Delta , W_1^{\dagger} , \ldots , W_J^{\dagger})$ as described in proposition A.1.  It follows that

845: \begin{eqnarray}

846: \PP_{j, N_j}

847: & = & \frac{1}{N_j} \sum_{i=1}^N \delta_{W_i} 1_{{\cal W}_j} (W_i) \nonumber \\

848: & = & \frac{1}{N_j} \sum_{j'=1}^J \sum_{i=1}^N \Delta_{j',i} \delta_{W_{j',i}^{\dagger}}

849:             1_{{\cal W}_j} (W_{j,i}^{\dagger}) \nonumber  \\

850: & = & \frac{1}{N_j} \sum_{i=1}^{N_j} \delta_{W_{j,i} }

851: \label{eq:StratumSpecificEmpiricalMeasureAppendixB}

852: \end{eqnarray}

853: by relabelling the $W_{j,i}^{\dagger}$'s and where

854: $N_j = \sum_{i=1}^N \Delta_{j,i}$ on the right side is independent of the $W_{j,i}^{\dagger}$'s.

855:   This yields the promised doubly indexed form of the stratum - specific

856: empirical measure in terms of independent $W_{j,i}$'s distributed according to $P_{0|j}$

857: where, for $B \in {\cal B}$,

858: $$

859: P_{0|j} (B) = \frac{P_0 (B 1_{{\cal W}_j} )}{P_0 ( 1_{{\cal W}_j} )} .

860: $$

861: \bigskip

862:

863: \par\noindent

864: {\bf Appendix B.  Proof of weak convergence of the stratum-specific empirical process}

865:

866: Let $\PP_{j,N_j} $ be as defined in (\ref{eq:StratumSpecificEmpiricalMeasureAppendixB})

867: %eq:stratumwiseEmpirical}),

868: %where the double subscripting:

869: $$

870: \PP_{j,N_j} = \frac{1}{N_j} \sum_{i=1}^n \delta_{W_i} 1_{{\cal W}_j} (W_i)

871: $$

872: where

873: $$

874: N^{-1} N_j = \PP_N ( 1_{{\cal W}_j} ) \rightarrow_{a.s.} P_0 ( {\cal W}_j ) \equiv \nu_j > 0 .

875: $$

876: \medskip

877:

878: \par\noindent

879: {\bf Proposition B.1.}

880: If ${\cal F}$ is $P_{0}-$Donsker and $\nu_j > 0$, then

881: ${\cal F}$ is $P_{0|j} - $Donsker on stratum ${\cal W}_j$  in the sense that

882: \begin{equation}

883: \GG_{j,Nj} \equiv \sqrt{N_j} ( \PP_{j, N_j} - P_{0|j} )

884: \rightsquigarrow  \GG_j  \qquad \mbox{in} \ \ \ell^{\infty} ({\cal F})

885: \label{EmpiricalProcessForStratumj}

886: \end{equation}

887: where $\GG_j$, defined by

888: \begin{equation}

889: \GG_j (f) = \nu_j^{-1/2} \GG_{P_0} ((f - P_{0|j} (f)) 1_{{\cal W}_j} ) ,

890: \qquad f \in \ell^{\infty} ({\cal F}) ,

891: \end{equation}

892: is a $P_{0|j}$-Brownian bridge process.

893: \medskip

894:

895: \par\noindent

896: {\bf Remark 1.}

897: Note that

898: \begin{eqnarray*}

899: Var( \GG_{j} (f) )

900: & = & \nu_j^{-1} P_0 \left [ ( f - P_{0|j} (f))^2 1_{{\cal W}_j} \right ]

901:  =  Var_j (f) \equiv Var(f(W)| W \in {\cal W}_j ).

902: \end{eqnarray*}

903: %\medskip

904:

905: \par\noindent

906: {\bf Remark 2.}  The proposition implies that the process

907: $\sqrt{N_j} ( \PP_{j, N_j} - P_{0|j} ) $ behaves asymptotically the same as that

908: of a sample of fixed size drawn from the conditional distribution $P_{0|j}$.

909: \medskip

910:

911: \par\noindent

912: {\bf Proof of the proposition}.  First proof.

913: By the discussion at the beginning of section 2.10.4, page 200, van der Vaart

914: and Wellner (1996), ${\cal F}_j \equiv \{ f 1_{{\cal W}_j} : \ f \in {\cal F} \}$

915: is $P_0-$Donsker, and hence the collection

916: $\tilde{{\cal F}}_j \equiv \{ f 1_{{\cal W}_j} : \ f \in {\cal F} \cup \{ 1\} \}$

917: is also $P_0-$Donsker.  Now we

918: write

919: \begin{eqnarray*}

920: \sqrt{N_j}  ( \PP_{j,N_j} f - P_{0|j} f )

921: & = & \sqrt{N_j} \left ( \frac{\frac{1}{N} \sum_{i=1}^N f(W_i ) 1_{{\cal W}_j} (W_i)}

922:                                          {\frac{1}{N} \sum_{i=1}^N  1_{{\cal W}_j} (W_i)}

923:                                          - \frac{P_0 ( f 1_{{\cal W}_j}}{P_0 ( 1_{{\cal W}_j}) }

924:                          \right )  \\

925: & = & \sqrt{ \frac{N_j}{N}} \left \{  \frac{ \GG_N ( f 1_{{\cal W}_j} ) }{N_j/N}

926:            -   \frac{ \GG_N ( 1_{{\cal W}_j} ) P_0 ( f 1_{{\cal W}_j} )}

927:                                                    { (N_j /N) P_0 ( {\cal W}_j ) }  \right \} \\

928: & = &  \frac{1}{\sqrt{N_j/N}} \left \{ \GG_N ( f 1_{{\cal W}_j} )

929:              -   \GG_N ( 1_{{\cal W}_j} ) P_{0|j} ( f )    \right \} \\

930: & = & \frac{1}{\sqrt{N_j/N}} \GG_N ( ( f   - P_{0|j} (f)) 1_{{\cal W}_j}  )  \\

931: & \Rightarrow &   \frac{1}{\sqrt{\nu_j}} \GG_{P_0} ( ( f   - P_{0|j} (f)) 1_{{\cal W}_j}  )

932:  \equiv \GG_{P_{0|j}} (f) \,  ,

933: %\qquad \Box

934: \end{eqnarray*}

935: and, in fact,

936: \begin{eqnarray*}

937: \left \{ \frac{1}{\sqrt{\nu_j}} \GG_{P_0} ((f - P_{0|j} (f))1_{{\cal W}_j} ) : \ f \in {\cal F} \right \}

938: \stackrel{d}{=} \{ \GG_{P_{0|j} } (f) : \ f \in {\cal F} \} .

939: \end{eqnarray*}

940:

941: Second proof.  By the second representation of the stratum-specific empirical measure

942: $\PP_{j,N_j}$ as  $\PP_{j,N_j} = N_j^{-1} \sum_{i=1}^{N_j} \delta_{W_{j,i}}$ where the

943: $W_{j,i}$'s are i.i.d. $P_{0|j}$, it follows that the empirical

944: process

945: $\GG_{j,N_j} = \sqrt{N_j} ( \PP_{j,N_j} - P_{0|j}) $

946: is just the empirical process of  i.i.d. $W_{j,i}$'s, but with a random sample size

947: $N_j$ independent of the $W_{j,i}$'s.    Since $N_j / N \rightarrow \nu_j > 0$, it follows from

948: theorem 3.5.1, page 339, van der Vaart and Wellner (1996), that

949: $\GG_{j,N_j} \rightsquigarrow \GG_j$ in $\ell^{\infty} ({\cal F})$  where $\GG_j$ is a

950: $P_{0|j}-$Brownian bridge process as before.

951: \hfill$\Box$

952:

953: In the application of the results of Appendices A and B in section 4 we take

954: ${\cal W}_1 , \ldots , {\cal W}_J$ to be the measurable partition of ${\cal W}$ induced by

955: the partition ${\cal V}_1 , \ldots , {\cal V}_J$ of ${\cal V}$ (i.e. ${\cal W}_j = V^{-1} ({\cal V}_j)$

956: for $j=1, \ldots , J$ where $V(W) \equiv ( \tilde{X} (X), U)$).    Moreover, the Donsker class

957: ${\cal F}$ in Proposition B.1 is taken to be a Donsker class of functions of $X$ only

958: rather than functions of $W = (X,U)$.  This is exactly what is needed for the development

959: in section 4.

960: \medskip

961:

962: {\bf Appendix C. Proof of equation (\ref{eq:TaylorExpansion}).}

963: Besides the consistency and asymptotic linearity (\ref{eq:InfFuncAlpha})

964: for $\hat\alpha$ assumed in \S 6, we further assume that $0  <  \sigma  \leq \pi_{\alpha}(v)$ as in (\ref{eq:boundedweights}) and that

965: \begin{eqnarray}

966: \Big | \frac{1}{\pi_{\alpha} (v)} - \frac{1}{\pi_{\alpha_0} (v) }

967:      - \frac{-\dot{\pi}_0^{T}(v)}{\pi_0^2 (v)} (\alpha - \alpha_0) \Big |

968:      \le \psi (v) | \alpha - \alpha_0 |^{1+ \zeta }

969: \label{eq:DerivativeConditionPlus}

970: \end{eqnarray}

971: for $\alpha$ in a neighborhood of $\alpha_0$ where $\zeta> 0$

972: and $\psi$ satisfies $E \psi^2 (V) < \infty$.

973: The second assumption will typically follow from the first

974: provided that $\pi_{\alpha}$ has a continuous second derivative.

975: For example, suppose that $\pi_{\alpha}$ is given by a logistic regression model with linear predictor

976: $\tilde v^{T}\alpha$ where $\tilde v=\tilde v(v) \in \RR^q$.

977: Then Taylor's formula with remainder shows that the LHS of (\ref{eq:DerivativeConditionPlus}) equals

978: $\left | \frac{1}{2}e^{-\tilde v^{T} \alpha^*}(\alpha-\alpha_0)^{T}\tilde v \tilde v^{T}(\alpha-\alpha_0) \right |$

979: with $\alpha^*$ on the line segment between $\alpha$ and $\alpha_0$.

980: Thus the condition holds with $\zeta=1$ provided $e^{\tilde v^{T} \alpha}=\pi_{\alpha}(v)/[1-\pi_{\alpha}(v)]$

981: is bounded away from 0 and $\tilde V$ has finite fourth moment.

982: It follows that

983: \begin{eqnarray}

984: \left ( \PP_N^{\hat\pi}- \PP_N^{\pi_0} \right ) \tilde\ell_0

985: & = &

986: \frac{1}{N}\sum_{i=1}^N {\bf 1}_{{\cal V}_0^c} (V_i) \left ( \frac{\xi_i}{\hat\pi_i} - \frac{\xi_i}{\pi_0} \right ) \tilde\ell_0(X_i)

987:          \nonumber \\

988: & = & \frac{1}{N}\sum_{i=1}^N {\bf 1}_{{\cal V}_0^c}(V_i)

989:           \xi_i \tilde\ell_0(X_i)

990:            \left [\frac{1}{\pi_{\hat\alpha}(V_i)} -  \frac{1}{\pi_{\alpha_0}(V_i)}

991:            - \frac{-\dot{\pi}_0^T(V_i)}{\pi_0^2 (V_i)} (\hat{\alpha} - \alpha_0)

992:            \right ]   \nonumber \\

993:  && \qquad  + \frac{1}{N}\sum_{i=1}^N {\bf 1}_{{\cal V}_0^c}(V_i)

994:            \xi_i \tilde\ell_0(X_i)

995:            \left [ \frac{-\dot{\pi}_0^T(V_i)}{\pi_0^2 (V_i)}

996:            \right ]  ( \hat{\alpha} - \alpha_0)  \nonumber \\

997: & \equiv & R_N -   \frac{1}{N}\sum_{i=1}^N {\bf 1}_{{\cal V}_0^c}(V_i)

998:           \frac{\xi_i}{\pi_{0}(V_i)} \tilde\ell_0(X_i)

999:            \left [ \frac{\dot{\pi}_0^T(V_i)}{\pi_0 (V_i)} \right ] (\hat{\alpha} - \alpha_0)

1000: \label{eq:RemainderPlusMainTerm}

1001: \end{eqnarray}

1002:  where by (\ref{eq:boundedweights}), the similar assumption for

1003:  $\pi_{\alpha}$  and (\ref{eq:DerivativeConditionPlus}),

1004:  \begin{eqnarray*}

1005:  |R_N |

1006:  & \le & \Big |  \frac{1}{N}\sum_{i=1}^N {\bf 1}_{{\cal V}_0^c}(V_i)

1007:           \xi_i \tilde\ell_0(X_i)

1008:            \left [\frac{1}{\pi_{\hat\alpha}(V_i)} -  \frac{1}{\pi_{\alpha_0}(V_i)}

1009:            - \frac{-\dot{\pi}_0^T(V_i)}{\pi_0^2 (V_i)} (\hat{\alpha} - \alpha_0) \right ]

1010:        \Big | \\

1011:  & \le & \frac{1}{\sigma^2} \frac{1}{N} \sum_{i=1}^N \psi (V_i)|  \tilde\ell_0 (X_i)| \cdot | \hat{\alpha} - \alpha_0|^{1+\zeta} \\

1012:  & = &   O_p (1) | \hat{\alpha} - \alpha_0| | \hat{\alpha} - \alpha_0|^{\zeta} \\

1013:  & = &   O_p (1) O_{p} (N^{-1/2} ) o_p (1).

1014:  \end{eqnarray*}

1015: Multiplying through (\ref{eq:RemainderPlusMainTerm}) by $\sqrt{N}$, we conclude that (\ref{eq:TaylorExpansion}) holds by virtue of $\sqrt{N} \tilde{R}_N = o_p (1)$ and the strong law of large numbers.

1016:

1017:

1018: \end{document}

1019: