1: \documentclass[11pt]{article}
2: \usepackage{amsmath, amssymb, chicago, amsfonts, latexsym, annals, natbib}
3: \setlength{\oddsidemargin}{0.0in}
4: \setlength{\evensidemargin}{0.0in}
5: \setlength{\textwidth}{6.5in}
6: \setlength{\topmargin}{0.0in}
7: \advance \topmargin by -\headheight
8: \advance \topmargin by -\headsep
9: \advance \topmargin .2in
10: \setlength{\textheight}{8.0in}
11: \sloppy \hyphenpenalty=10000
12: \def\var{\mathrm{var}}
13: \def\E{\mathrm{E}}
14: \def\T{^{\mbox{T}}}
15: \def\ez{\eta^{(0)}}
16: \def\eo{\eta^{(1)}}
17: \def\sz{S^{(0)}}
18: \def\so{S^{(1)}}
19: \def\bd{\begin{description}}
20: \def\ed{\end{description}}
21: %\def\theequation{\thesection.\arabic{equation}}
22: \def\bl{\begin{list}{$\bullet$}{}} % Begins a bullet list.
23: \def\cl{\begin{list}{$\circ$}{}} % Begins a circle list.
24: \def\el{\end{list}} % Ends a list.
25: \newcommand{\ef}{\hfill $\Box$}
26: \def\b1{\mathbf{1}}
27: %\newcommand{\mycite}[1]{{\small \sc \citeNP{#1}}}
28: \newcommand{\elinfH}{ \ell^{\infty}({\cal H}) }
29: \newcommand{\psidot}[1]{ \dot{\Psi}_{#1} }
30: \newcommand{\Var}{\mbox{Var}}
31: \newcommand{\PP}{{\mathbb P}}
32: \newcommand{\QQ}{{\mathbb Q}}
33: \newcommand{\GG}{{\mathbb G}}
34: \newcommand{\RR}{{\mathbb R}}
35:
36: \begin{document}
37: \setlength{\baselineskip}{20pt}
38:
39: \title{Weighted Likelihood for Semiparametric Models
40: and Two-phase Stratified Samples, with Application to Cox Regression}
41:
42: \author{Norman E. Breslow \\ Jon A. Wellner}
43:
44: \affiliation{University of Washington, Seattle}
45: \date{\today}
46:
47: \maketitle
48:
49: \begin{abstract}
50: Weighted likelihood, in which one solves Horvitz-Thompson or inverse probability weighted (IPW) versions of the likelihood equations, offers a simple and robust method for fitting models to two phase stratified samples.
51: We consider semiparametric models for which solution of infinite dimensional estimating equations leads to $\sqrt{N}$ consistent and asymptotically Gaussian estimators of both Euclidean and nonparametric parameters.
52: If the phase two sample is selected via Bernoulli (i.i.d.) sampling with known sampling probabilities, standard estimating equation theory shows that the influence function for the weighted likelihood estimator of the Euclidean parameter is the IPW version of the ordinary influence function.
53: By proving weak convergence of the IPW empirical process, and borrowing results on weighted bootstrap empirical processes, we derive a parallel asymptotic expansion for finite population stratified sampling.
54: Whereas the asymptotic variance for Bernoulli sampling involves the within strata second moments of the influence function, for finite population stratified sampling it involves only the within strata variances.
55: The latter asymptotic variance also arises when the observed sampling fractions are used as estimates of those known \textit{a priori}.
56: A general procedure is proposed for fitting semiparametric models with estimated weights to two phase data.
57: Several of our key results have already been derived for the special case of Cox regression with stratified case-cohort studies, other complex survey designs and missing data problems more generally.
58: This paper is intended to help place this previous work in appropriate context and to pave the way for applications to other models.
59:
60: \end{abstract}
61:
62: \vspace{2cm}
63: \noindent
64: \textit{Key words}:
65: case-cohort,
66: estimated weights,
67: failure time,
68: inverse probability weights,
69: missing data
70:
71: \newpage
72: \section{Introduction}
73:
74: Two phase stratified sampling, also known as double sampling,
75: was introduced by \citet{4923} to estimate the population mean of a
76: target variable that is costly or difficult to measure.
77: At phase one a relatively large random sample is drawn and
78: measurements are made on an auxiliary variable that is
79: correlated with the target variable but easier to measure.
80: At phase two measurements on the target variable are made
81: for a subsample drawn randomly, without replacement, from
82: within strata defined by the auxiliary variable.
83: Neyman showed that the optimal, design unbiased linear
84: estimator of the population mean is the Horvitz-Thompson (\citeyear{3888}) estimator that weights each observation by the inverse
85: of the probability of its selection into the phase two sample.
86:
87: Two-phase stratified sampling designs can dramatically reduce
88: the costs of regression modeling when the strata depend
89: on (correlates of) both outcome and explanatory variables.
90: A common method of estimation is ``weighted exogenous
91: sampling maximum likelihood", here simply Weighted Likelihood
92: or WL, in which one maximizes the inverse probability weighted
93: (IPW) sum of log-likelihood contributions from the phase
94: two observations \citep{200, 265}.
95: Equivalently, one may solve an IPW version of the
96: score equations \citep[\S 3.4]{4105}.
97: Although easy to implement, WL estimators are sometimes
98: seriously inefficient \citep{3937}.
99: They may still be of interest, however, because even when
100: the model is wrong they consistently estimate the finite population
101: parameters that would be obtained by fitting the model to
102: complete phase one data \citep{1815, 4326}.
103: Fully efficient estimators are available for logistic and other
104: parametric regression models in situations where the phase
105: one data consist only of stratum frequencies.
106: See, for example, \citet{4927} and the references cited therein.
107:
108: The asymptotic properties of WL estimators of Euclidean parameters
109: in parametric models follow readily from standard results for
110: $M$-estimators \citep[Chapter 5]{4895}.
111: WL may also be used for estimation of both Euclidean and infinite
112: dimensional parameters in semiparametric models, for which the
113: paradigm is Cox (\citeyear{1272}) proportional hazards regression.
114: \citet{4894} developed asymptotic results for both regression
115: coefficients and baseline cumulative hazard when fitting the Cox model to survey
116: data including those obtained using two phase sampling.
117: \citet{4324} obtained the same results for the regression parameters
118: when fitting the Cox model to data from exposure stratified case-cohort
119: studies, in which all subjects who have a failure event (the cases)
120: are sampled at phase two.
121: One purpose of the present paper is to develop a modern theory
122: of WL estimation in semiparametric models that encompasses these previous results, helps to interpret them and paves the way for further applications.
123: We also explore the relationship between results based on finite
124: population stratified sampling at phase two and those based on i.i.d.
125: variable probability sampling with sampling weights
126: estimated using information from phase one.
127:
128: \section{Notation, Assumptions and Problem Statement}
129: Suppose $P_{\theta,\eta}$ denotes a probability distribution in
130: a semiparametric model for a random variable $X \in {\cal X}$,
131: where $\theta \in \Theta \subset \RR^p$ is the Euclidean parameter
132: and $\eta$, taking values in some arbitrary space $H$, is the nonparametric one.
133: Let $P_0=P_{\theta_0,\eta_0}$ denote the distribution from which $X$ is actually sampled.
134: Following closely \S 25.12 of \citet{4895}, suppose maximum
135: likelihood (ML) estimators $(\hat\theta, \hat\eta)$ are obtained by solving the system
136: \begin{eqnarray}
137: \Psi_{N1}(\theta,\eta) & = & \PP_N \dot\ell_{\theta,\eta}\; = \;0 \nonumber \\
138: \Psi_{N2}(\theta,\eta) & = & \PP_N B_{\theta,\eta}h
139: -P_{\theta,\eta}B_{\theta,\eta}h \;= \;0 \; \forall \; h \in {\cal H}. \label{eq:like}
140: \end{eqnarray}
141: Here $\dot\ell_{\theta,\eta}$ is the $p$-dimensional
142: likelihood score for $\theta$, $B_{\theta,\eta}$ is the score
143: operator \citep{1262} working on an infinite dimensional class
144: ${\cal H}$ of directions $h$ from which paths of one-dimensional
145: submodels for $\eta$ may approach $\eta_0$, and $\PP_N$ is empirical measure based on the i.i.d. sequence $X_1,\ldots,X_N$.
146: Set $\dot\ell_0=\dot\ell_{\theta_0,\eta_0}$ and $B_0=B_{\theta_0,\eta_0}$.
147:
148: Suppose the following assumptions, which slightly strengthen the
149: hypotheses of \citet[Theorem 25.90]{4895}, are satisfied so that
150: $\sqrt{N}(\hat\theta-\theta_0,\hat\eta-\eta_0)$ is asymptotically Gaussian:
151: \bd
152: \item[A1] for $(\theta,\eta)$ in a $\delta$-neighborhood of
153: $(\theta_0,\eta_0)$ the functions $\dot\ell_{\theta,\eta}$ and
154: $\{B_{\theta,\eta}h, h \in {\cal H} \}$ are contained in a $P_0$-Donsker class ${\cal F}$;
155: \item[A2] $P_0\| \dot\ell_{\theta, \eta}-\dot\ell_0\|^2$
156: and $\sup_{h \in {\cal H}}P_0|B_{\theta, \eta }h-B_0h|^2$ converge
157: to $0$ as $(\theta,\eta) \rightarrow (\theta_0,\eta_0)$;
158: \item[A3]
159: the map $\Psi=(\Psi_1,\Psi_2):\Theta \times H \mapsto \RR^p\times \ell^{\infty}({\cal H})$
160: with components
161: \begin{eqnarray}
162: \Psi_1(\theta,\eta) & = & P_0\dot\ell_{\theta,\eta} \nonumber \\
163: \Psi_2(\theta,\eta) & = & P_0B_{\theta,\eta}h
164: -P_{\theta,\eta}B_{\theta,\eta} h , \ \ h \in {\cal H}, \label{eq:expectedmap}
165: \end{eqnarray}
166: which is the expectation of the random map $\Psi_N=(\Psi_{N1},\Psi_{N2})$ in (\ref{eq:like}),
167: has a Fr\'echet derivative
168: $\dot\Psi_0$ at $(\theta_0,\eta_0)$ that is continuously invertible on its range.
169: \item[A4] $(\hat\theta,\hat\eta)$ is consistent for $(\theta_0,\eta_0)$ and satisfies
170: $\Psi_N(\hat\theta,\hat\eta)=0.$
171: \ed
172: Assumption \textbf{A3} is typically established by showing that the information
173: operator $B^*_0B_0$ is continuously invertible and thus that $\eta$ is
174: estimable at a $\sqrt{N}$ rate.
175: This is the most restrictive assumption, but one that leads quickly to our main result.
176:
177: With two phase sampling, however, $X$ is not observed for all $N$ subjects.
178: At phase one we observe only a coarsening $\tilde X=\tilde X(X)$ of $X$
179: plus auxiliary variables $U \in {\cal U}$ that serve to determine the sampling strata.
180: $X$ is fully observed for subjects sampled at phase two.
181: Let $W=(X,U) \in {\cal W} = {\cal X} \times {\cal U}$ denote the variables
182: potentially available for everyone, but in fact fully observed only for those
183: in the phase two sample, and $V=(\tilde X,U) \in {\cal V} = {\cal \tilde X} \times {\cal U}$
184: denote the variables actually observed for everyone.
185: We write $\tilde{P}_0$ for the distribution of $W = (X,U)$ and
186: denote by $\Sigma_N=\sigma[W_1,\ldots,W_N]$ the sigma field of information,
187: also referred to as the complete data, potentially available for the $N$ subjects.
188: A sequence of binary indicators $(\xi_1,\ldots,\xi_N)$ shows which
189: subjects are selected $(\xi_i=1)$ at phase two for observation of $X_i$.
190: We consider two probability models for the indicators $\xi_i$.
191: In the first, known as Bernoulli or Manski-Lerman (\citeyear{200}) sampling,
192: each phase one subject is examined in succession for the value of
193: $V_i$ and the indicator $\xi_i$ is independently generated with
194: $\Pr(\xi_i=1|W_i)=\Pr(\xi_i=1|V_i) = \pi_0(V_i)$ where $\pi_0$ is a
195: known sampling function.
196: This preserves the i.i.d. structure for the observations $(\xi_i,V_i,\xi_iX_i)$.
197: Note the crucial missing at random (MAR) assumption: $\pi_0$
198: depends only on what is observed at phase one.
199: We write $Q_0$ for the
200: distribution of $(W_i, \xi_i)$.
201: If ${\cal V}$ is partitioned into $J$ strata ${\cal V}_1 \cup \cdots \cup {\cal V}_J$,
202: stratified Bernoulli sampling corresponds to the special case where
203: $\pi_0(v)=p_j$ for $v \in {\cal V}_j$.
204: We assume that all $J$ strata are sampled with positive probability, or more generally that
205: \begin{equation}
206: 0 < \sigma \leq \pi_0(v) \leq 1 \quad \mbox{for} \quad v \in {\cal V}.
207: \label{eq:boundedweights}
208: \end{equation}
209: Even though the sampling fractions are known, it is advisable to estimate
210: $\pi_0$ in order to increase the efficiency of WL \citep{3937}.
211: We consider estimation of $\pi_0$ using a parametric model in \S 6.
212:
213: The second sampling model corresponds to Neyman's original design
214: and is usually closer to actual practice.
215: Here we observe the entire phase one sample at once and record the
216: stratum frequencies
217: $N_j=\sum_{i=1}^N \b1_{{\cal V}_j}(V_i)$ for $j=1,\ldots,J$.
218: At phase two samples of size $n_j \leq N_j$ are drawn at random, without
219: replacement, from each of the $J$ finite phase one strata.
220: Using now a doubly subscripted notation where $\xi_{j,i}$
221: denotes the indicator variable for $i^{\mbox{th}}$ subject in stratum
222: $j$, the essential features of this design are that, conditionally
223: on $\Sigma_N$: ($i$) for $j=1,\ldots,J$
224: the random variables $(\xi_{j1},\ldots,\xi_{jN_j})$ are exchangeable
225: with $\Pr(\xi_{j,i}=1|\Sigma_N)={n_j}/{N_j}$;
226: and ($ii$) the $J$ random vectors $(\xi_{j1},\ldots,\xi_{jN_j})$ are independent.
227: Our problem is to estimate $(\theta,\eta)$ using the incomplete observations $V_i$ on everyone and the complete observations
228: $X_i$ on subjects sampled at phase two.
229:
230: \section{Weighted Likelihood Estimator}
231:
232: WL estimates are obtained by solving Horvitz-Thompson
233: (IPW) versions of the likelihood equations.
234: Define the \textit{inverse probability weighted empirical measure} by
235: \begin{equation}
236: \PP_N^{\pi} = \frac{1}{N}\sum_{i=1}^N\frac{\xi_i}{\pi_i}\delta_{X_i} , \label{eq:empiricalmeasure}
237: \end{equation}
238: where $\delta_{X_i}$ denotes Dirac measure placing unit mass on $X_i$ and
239: \begin{eqnarray*}
240: \pi_i & = & \left \{ \begin{array}{ll} \pi_0(V_i) &
241: \mbox{\ for\ Bernoulli\ sampling}\\ \mbox{ } \\ \frac{n_j}{N_j}
242: \mbox{\ if\ }V_i\in{\cal V}_j &
243: \mbox{\ for\ finite\ population\ stratified\ sampling}. \end{array} \right .
244: \end{eqnarray*}
245: Then, instead of (\ref{eq:like}) we solve
246: \begin{eqnarray}
247: \Psi_{N1}^{\pi}(\theta,\eta) & = & \PP_N^{\pi}\dot\ell_{\theta,\eta} \, = \, 0 \nonumber \\
248: \Psi_{N2}^{\pi}(\theta,\eta) & = & \PP_N^{\pi}B_{\theta,\eta}h - P_{\theta, \eta} B_{\theta, \eta} h
249: \, = \, 0 \qquad \mbox{for all} \ \ h \in {\cal H}. \label{eq:IPWlike}
250: \end{eqnarray}
251:
252: In view of the MAR assumption, for any integrable function $f:{\cal X} \mapsto \RR$
253: and under either Bernoulli or finite population stratified sampling,
254: \[
255: \E\frac{\xi_i}{\pi_i}f(X_i) = \E \left [
256: \E \left ( \left . \frac{\xi_i}{\pi_i} \right | \Sigma_N \right ) f(X_i) \right ] = \E f(X_i), \; \; i=1,\ldots,N,
257: \]
258: so that $\E \PP_N^{\pi}f = \E \PP_N f = P_0f$.
259: Consequently, the random map $\Psi_N^{\pi}=(\Psi_{N1}^{\pi},\Psi_{N2}^{\pi})$
260: defined by (\ref{eq:IPWlike}) has the same expectation as the random
261: map $\Psi_N$ in (\ref{eq:like}), namely $\Psi=(\Psi_1,\Psi_2)$ as in (\ref{eq:expectedmap}).
262: The implication is that the assumptions \textbf{A1}-\textbf{A4}
263: made to guarantee the asymptotic normality of the ML estimator
264: based on complete phase one data are also the assumptions needed
265: to guarantee the asymptotic normality of the WL estimator based on two phase data.
266: Indeed, van der Vaart's (\citeyear{4895}) Theorem 25.90,
267: or more precisely his Theorem 19.26 of which it is a restatement,
268: applies virtually without change to the Bernoulli sampling setup.
269: The Donsker class ${\cal F}$ in \textbf{A1} is modified to
270: $\tilde{\cal F} = \{ [\xi/\pi_0(V)] f(X), f \in {\cal F}\}$.
271: Since under the hypothesis (\ref{eq:boundedweights}) it is the product
272: of a fixed bounded function with the Donsker class ${\cal F}$, the fact
273: that $\tilde{\cal F}$ is Donsker for the joint distribution $Q_0$
274: of $(W, \xi)$ follows from \citet[example 2.10.10]{4920}.
275: The random map $\Psi_N$ corresponding to the estimating
276: functions (\ref{eq:IPWlike}) is ordinary empirical measure
277: $\QQ_N$ for $\{(W_i, \xi_i), i=1,\ldots, N\}$ applied to the unbiased
278: estimating functions $(\xi/\pi_0)\dot\ell_{\theta,\eta}$ and $(\xi/\pi_0)B_{\theta,\eta}h$.
279: \textbf{A4} will generally follow from (\ref{eq:boundedweights})
280: and the arguments used to establish consistency for the complete data ML estimator,
281: together with (\ref{eq:IPWlike}).
282: \textbf{A2} and \textbf{A3} are unchanged.
283: The more general Theorem 3.3.1 of \citet{4920} is needed, however,
284: to deal with the non i.i.d. data induced by finite population stratified sampling.
285: To verify its hypotheses, we first must establish weak convergence
286: of the empirical process based on $\PP_N^{\pi}$.
287:
288: \section{Weak Convergence of the IPW Empirical Process}
289:
290: Two phase stratified sampling resembles the bootstrap in that it
291: involves random sampling from the finite, albeit incompletely
292: observed, population $\{X_1,\ldots,X_N\}$.
293: Here we use results on weighted bootstrap empirical processes
294: from \citet[Theorem 2.2]{4922}, as incorporated in \citet[Theorem 3.6.13]{4920},
295: to demonstrate weak convergence of the IPW empirical process
296: $\mathbb{G}_N^{\pi}=\sqrt{N}(\PP_N^{\pi}-P_0)$ for finite population stratified sampling.
297: First note that, with the subscript $j,i$ denoting the
298: $i^{\mbox{th}}$ of $N_j$ observations in stratum $j$,
299: \begin{eqnarray}
300: \mathbb{P}_N^{\pi} & = & \frac{1}{N}\sum_{j=1}^J\frac{N_j}{n_j}\sum_{i=1}^{N_j}\xi_{j,i}\delta_{X_{j,i}}
301: = \frac{1}{N}\sum_{j=1}^J \frac{N_j^2}{n_j} \mathbb{P}_{j,N_j}^{\xi} \label{eq:bootstrap}
302: \end{eqnarray}
303: where
304: \[
305: \mathbb{P}_{j,N_j}^{\xi} = \frac{1}{N_j}\sum_{i=1}^{N_j}\xi_{j,i}\delta_{X_{j,i}}
306: \]
307: is a \textit{finite sampling empirical measure} for the $j^{\mbox{th}}$
308: stratum.
309: Similarly one can express the ordinary empirical measure as
310: \begin{equation}
311: \mathbb{P}_N = \frac{1}{N}\sum_{j=1}^JN_j\mathbb{P}_{j,N_j}
312: \label{eq:empirical}
313: \end{equation}
314: where
315: \begin{equation}
316: \mathbb{P}_{j,N_j} = \frac{1}{N_j} \sum_{i=1}^N \delta_{X_i} \b1_{{\cal V}_j} (V_i)
317: = \frac{1}{N_j}\sum_{i=1}^{N_j}\delta_{X_{j,i}}
318: \label{eq:stratumwiseEmpirical}
319: \end{equation}
320: denotes the empirical measure for the $j^{\mbox{th}}$ stratum.
321: Justification of the second (doubly indexed) form is given in Appendix A.
322:
323: Combining (\ref{eq:bootstrap}) and (\ref{eq:empirical}), and letting
324: $\GG_N=\sqrt{N}(\PP_N-P_0)$ denote the standard empirical process, we have
325: \begin{eqnarray}
326: \mathbb{G}_N^{\pi}
327: & = & \sqrt{N}\left(\mathbb{P}_N^{\pi}-P_0\right) \nonumber \\
328: & = & \sqrt{N}\left(\mathbb{P}_N-P_0\right)
329: +\sqrt{N}\left(\mathbb{P}_N^{\pi}-\mathbb{P}_N\right) \nonumber \\
330: & = & \GG_N + \frac{1}{\sqrt{N}}\sum_{j=1}^J
331: \left(\frac{N_j^2}{n_j}\right)\left(\mathbb{P}_{j,N_j}^{\xi}-\frac{n_j}{N_j}
332: \mathbb{P}_{j,N_j}\right) \nonumber \\
333: & = & \GG_N
334: +\sum_{j=1}^J\sqrt{\frac{N_j}{N}}\left(\frac{N_j}{n_j}\right)\mathbb{G}^{\xi}_{j,N_J}
335: \label{eq:IPWempiricalexpansion}
336: \end{eqnarray}
337: where
338: \begin{equation}
339: \mathbb{G}^{\xi}_{j,N_j} = \sqrt{N_j}\left(\mathbb{P}_{j,N_j}^{\xi}
340: -\frac{n_j}{N_j}\mathbb{P}_{j,N_j}\right) \label{eq:weightedbootstrapprocess}
341: \end{equation}
342: is the \textit{finite sampling empirical process} for stratum $j$.
343:
344: The first term in (\ref{eq:IPWempiricalexpansion}) converges to the
345: $P_0$-Brownian bridge process $\mathbb{G}$ indexed by the Donsker
346: class ${\cal F}$ mentioned in \textbf{A1}.
347: Let $P_{0|j}(\cdot)=\E(\cdot|V \in{\cal V}_j)$ denote $\tilde{P}_0$
348: conditional on membership in stratum $j$, \textit{i.e.}, for measurable
349: $A \subset {\cal X}$, $P_{0|j} (A) = \tilde P_0 [A {\bf 1}_{{\cal V}_j}(V)]/\nu_j$
350: with $\nu_j= \tilde P_0\b1_{{\cal V}_j}(V)$,
351: and let $\mathbb{G}_j$ denote the
352: $P_{0|j}$-Brownian bridge, also indexed by $ {\cal F}$.
353: Our goal is to establish the weak convergence of the remaining terms
354: on the RHS of (\ref{eq:IPWempiricalexpansion}).
355: If as $N \rightarrow \infty$ the sampling fractions converge
356: with $n_j/N_j \rightarrow p_j$, the assumption on the exchangeable
357: ``weights" $(\xi_{j,1},\ldots,\xi_{j,N_j})$ in equation (3.6.8) of \citet{4920} holds trivially with
358: \[
359: \frac{1}{N_j}\sum_{i=1}^{N_j}\left(\xi_{j,i}-\bar\xi_{j.}\right)
360: \stackrel{\mbox{p}}{\rightarrow} p_j(1-p_j) \label{eq:VdVW368}.
361: \]
362: Furthermore, with $\rightsquigarrow$ denoting weak convergence in
363: $\ell^{\infty}({\cal F})$, $\sqrt{N_j} ( \PP_{j,N_j} - P_{0|j} ) \rightsquigarrow \GG_{j}$;
364: see Appendix B for the proof.
365: Thus their Theorems 3.6.13 and 1.12.4 imply that, for almost
366: every sequence of complete data,
367: $\mathbb{G}_{j,N_j}^{\xi} \rightsquigarrow \sqrt{p_j(1-p_j)}\mathbb{G}_j$.
368: Conditionally on $\Sigma_N$, the processes $\mathbb{G}^{\xi}_{j,N_j}$
369: are mutually independent because of the independence of the
370: $\{\xi_{j,i}\}$ in different strata.
371: Furthermore, by virtue of the fact that they also are (unconditionally)
372: uncorrelated with $\mathbb{G}_N=\sqrt{N}(\mathbb{P}_N-P_0)$,
373: which follows along the lines of \citet[Corollary 2.9.3]{4920},
374: or that (conditionally) they have the same limiting distributions for
375: almost all sequences of data, the vector of processes
376: $(\mathbb{G}_N,\mathbb{G}_{1,N_1}^{\xi},\ldots,\mathbb{G}_{J,N_J}^{\xi})$
377: converges weakly to the vector of independent Brownian bridge processes
378: $(\mathbb{G},\mathbb{G}_1,\ldots,\mathbb{G}_J)$.
379: Consequently
380: \begin{equation}
381: \mathbb{G}_N^{\pi} \rightsquigarrow \mathbb{G}
382: + \sum_{j=1}^J\sqrt{\nu_j}\sqrt{\frac{1-p_j}{p_j}} \mathbb{G}_j.
383: \label{eq:limIPWempiricalprocess}
384: \end{equation}
385: This result formalizes and extends Proposition 1 of \citet{218}
386: and the arguments in \S 4 of \citet{4324}.
387:
388: \section{Asymptotic Distributions of the WL estimator}
389:
390: We apply Theorem 19.26 of \citet{4895} to conclude that,
391: under Bernoulli sampling,
392: \begin{equation}
393: \sqrt{N} \dot\Psi_0 \left( \begin{array}{c}
394: \hat{\theta}-\theta_0 \\ \hat{\eta}-\eta_0 \end{array} \right )
395: = -\GG_N \frac{\xi}{\pi_0}
396: \left ( \begin{array}{c} \dot\ell_0 \\
397: B_0h \end{array} \right ) \; + \; o_p(1). \label{eq:BernoulliForm}
398: \end{equation}
399: Similarly, using Theorem 3.3.1 of \citet{4920} together with the development
400: of the previous section, we conclude that
401: for finite population stratified sampling
402: \begin{equation}
403: \sqrt{N} \dot\Psi_0 \left( \begin{array}{c}
404: \hat{\theta}-\theta_0 \\ \hat{\eta}-\eta_0 \end{array} \right )
405: = -\GG_N^{\pi}
406: \left ( \begin{array}{c} \dot\ell_0 \\
407: B_0h \end{array} \right ) \;+ \;o_p(1). \label{eq:FPSSForm}
408: \end{equation}
409: We have already argued that the hypotheses of the first theorem
410: follow from appropriately modified versions of \textbf{A1}-\textbf{A4}.
411: Together with the weak convergence of $\mathbb{G}_N^{\pi}$
412: just established, they also suffice for the second theorem.
413: In particular, the stochastic condition (3.3.2) of \citet{4920} follows
414: from \textbf{A1} and \textbf{A2} together with the proof of their Lemma 3.3.5
415: applied to each of $\GG_N,\GG^{\xi}_{1,N_1},\ldots, \GG^{\xi}_{1,N_1}$.
416:
417: In practice attention is usually focused on inferences for the Euclidean parameter $\theta$.
418: To derive a general expression for the asymptotic variance of $\hat\theta$ we further assume
419: \bd
420: \item[A5] $\dot\Psi_0$ admits a partition as in equation (25.91)
421: of \citet{4895} where the information operator
422: $B_0^* B_0$ is continuously invertible.
423: \ed
424: Following closely the arguments in \S 25.12 of van der Vaart, we
425: calculate from (\ref{eq:BernoulliForm}) that under Bernoulli sampling
426: \begin{equation}
427: \sqrt{N}(\hat{\theta}-\theta_0) = \GG_N \frac{\xi}{\pi_0}\tilde\ell_0 + o_p (1) \label{eq:mainresultBernoulli}
428: \end{equation}
429: whereas from (\ref{eq:FPSSForm}) under finite population stratified sampling
430: \begin{equation}
431: \sqrt{N}(\hat{\theta}-\theta_0)= \GG_N^{\pi} \tilde\ell_0 + o_p (1),
432: \label{eq:mainresultFPSS}
433: \end{equation}
434: where in both cases $\tilde\ell_0$ denotes the efficient influence function
435: \begin{equation}
436: \tilde\ell_0 = \tilde I_0^{-1}\left ( I-B_0\left ( B^*_0B_0 \right )^{-1} B^*_0 \right ) \dot\ell_0 \label{eq:effinfluence}
437: \end{equation}
438: and
439: \begin{equation}
440: \tilde I_0
441: = P_0 \left [ \left ( I-B_0\left ( B^*_0B_0 \right )^{-1} B^*_0 \right )
442: \dot\ell_0\dot\ell_0^{T} \right ] \label{eq:effinfo}
443: \end{equation}
444: is the efficient information.
445: Since $P_0\tilde\ell_0=0$, moreover, both (\ref{eq:mainresultBernoulli}) and (\ref{eq:mainresultFPSS}) may be expressed
446: \begin{equation}
447: \sqrt{N}(\hat\theta-\theta_0) = \sqrt{N} \PP_N^{\pi}\tilde\ell_0 + o_p(1)=
448: \frac{1}{\sqrt{N}} \sum_{i=1}^N \frac{\xi_i}{\pi_i}\tilde\ell_0(X_i) +o_p(1), \label{eq:general}
449: \end{equation}
450: which expansion constitutes the principal result of this paper.
451:
452: Under Bernoulli sampling with known $\pi_0$ the asymptotic variance is therefore
453: \begin{eqnarray}
454: \Var_{\mbox{A}}\sqrt{N}(\hat\theta-\theta_0) & = &
455: \Var \left ( \frac{\xi}{\pi_0} \tilde\ell_0 \right ) \nonumber \\
456: & = &
457: \Var \; \E \left ( \left . \frac{\xi}{\pi_0} \tilde\ell_0 \right | X \right ) +
458: \E \; \Var \left ( \left . \frac{\xi}{\pi_0} \tilde\ell_0 \right | X \right ) \nonumber \\
459: & = & \Var(\tilde\ell_0) + \E \left [ \frac{\tilde\ell_0^{\otimes 2}}{\pi_0^2} \Var(\xi|X)\right ]
460: \nonumber \\
461: & = & \tilde I_0^{-1} + \tilde P_0 \left ( \frac{1-\pi_0}{\pi_0}\tilde\ell_0^{\otimes 2} \right ). \label{eq:IPWvariance}
462: \end{eqnarray}
463: In the special case of stratified Bernoulli sampling, with $\pi_i=\pi_0(V_i)=p_j$ for $V_i \in {\cal V}_j$, this becomes
464: \begin{equation}
465: \tilde I_0^{-1} + \sum_{j=1}^J\nu_j\frac{1-p_j}{p_j}P_{0|j}\left (\tilde\ell_0^{\otimes 2} \right ).
466: \label{eq:varIPWstrataiid}
467: \end{equation}
468: On the other hand, from (\ref{eq:limIPWempiricalprocess}) and
469: (\ref{eq:mainresultFPSS}), the asymptotic variance under finite population stratified sampling is
470: \begin{equation}
471: \tilde I_0^{-1} + \sum_{j=1}^J\nu_j\frac{1-p_j}{p_j}\Var_j(\tilde\ell_0), \label{eq:varIPWstratified}
472: \end{equation}
473: where $\Var_j(f)=P_{0|j}(f^{\otimes 2})-P^{\otimes 2}_{0|j}(f)$.
474: Comparing the last two expressions shows the substantial potential gain from
475: keeping track of the stratum frequencies for the phase one data.
476:
477: \section{Bernoulli Sampling with Estimated Weights}
478: Let ${\cal V}_0$ denote an additional stratum, possibly null, such that $\xi_i=1$ for $V_i \in {\cal V}_0$.
479: Introduction of this special stratum with $p_0=1$ does not affect the previous development;
480: in particular, equations (\ref{eq:general})-(\ref{eq:varIPWstratified}) continue to hold.
481: For $V_i \notin {\cal V}_0$ suppose
482: \begin{equation}
483: \Pr(\xi_i=1|X_i, V_i;\alpha) = \Pr(\xi_i=1|V_i;\alpha) = \pi_{\alpha}(V_i) < 1 \label{eq:modelforPi}
484: \end{equation}
485: where $\alpha \in \Xi \subset \RR^q$
486: is a parameter to be estimated by
487: maximum likelihood from the phase
488: one observations $\{V_i, i=1,\ldots,N\}$ not in ${\cal V}_0$.
489: We assume sufficient regularity in the model for $\alpha$, e.g., to satisfy the hypotheses of
490: Theorem 5.21 of \citet{4895}, so that the ML estimator
491: $\hat\alpha$ is consistent and asymptotically normal with influence function
492: \begin{equation}
493: \tilde\ell^{\alpha}_0= \b1_{{\cal V}_0^c}
494: \left ( \tilde P_0 \b1_{{\cal V}_0^c} \frac{\dot\pi_0^{\otimes 2}} {\pi_0(1-\pi_0)} \right )^{-1} \dot\pi_0\frac{\xi-\pi_0}{\pi_0(1-\pi_0)}.
495: \label{eq:InfFuncAlpha}
496: \end{equation}
497: Here for $V \in {\cal V}_0^c$, the complement of ${\cal V}_0$,
498: $\pi_0(V)=\pi_{\alpha_0}(V)$ is the true sampling function while $\dot\pi_0(V)$
499: denotes the $q$-vector of partial derivatives of $\pi_{\alpha}(V)$ with
500: respect to $\alpha$ evaluated at $\alpha=\alpha_0$.
501: If $\hat\theta(\alpha)$ denotes the WL estimator under two phase
502: Bernoulli sampling with ``known" sampling function $\pi_{\alpha}(V)$,
503: then from (\ref{eq:InfFuncAlpha}) and (\ref{eq:general}) we have
504: \begin{equation}
505: \sqrt{N}\left(\begin{array}{c}\hat\theta(\alpha_0)-\theta_0\\[.2cm]
506: \hat\alpha-\alpha_0 \end{array}\right)
507: = \sqrt{N}\left(\begin{array}{c}\PP_N^{\pi}\tilde\ell_0 \\[.2cm] \QQ_N\tilde\ell^{\alpha}_0
508: \end{array}\right) + o_p(1). \label{eq:jointExpansion}
509: \end{equation}
510: Furthermore, with $\hat\pi_i=\pi(V_i;\hat\alpha)$ for
511: $V_i \in {\cal V}_0^c$ otherwise $\hat\pi_i=1$, we show in Appendix C that under some further mild assumptions regarding $\pi_{\alpha}(V)$
512: %\textbf{[Jon: please note changes, in particular use
513: %of $\QQ$ and reference to additional mild assumptions in Appendix]}
514: \begin{equation}
515: \sqrt{N}(\PP_N^{\hat\pi}-\PP_N^{\pi_0})\tilde\ell_0
516: = - \tilde P_0 \left (\b1_{{\cal V}_0^c} \frac{\tilde\ell_0
517: \dot\pi^{T}_0}{\pi_0}\right ) \sqrt{N}(\hat\alpha-\alpha_0) + o_p(1). \label{eq:TaylorExpansion}
518: \end{equation}
519: The joint asymptotic normality of $(\hat\theta(\alpha_0),\hat\alpha)$ that follows
520: from (\ref{eq:jointExpansion}), together with the Taylor expansion
521: (\ref{eq:TaylorExpansion}), are precisely the hypotheses used by \citet{855}
522: to deduce that $\sqrt{N}[\hat\theta(\hat\alpha)-\theta_0] \rightsquigarrow Z $
523: where $Z\in \RR^p $ is mean zero Gaussian with covariance matrix
524: \begin{equation}
525: \Var_{\mbox{A}}\sqrt{N}\left(\hat\theta(\hat\alpha)-\theta_0\right)
526: = \Var \left ( \frac{\xi}{\pi_0}\tilde\ell_0 \right ) - \tilde P_0 \b1_{{\cal V}_0^c}
527: \frac{\tilde\ell_0 \dot\pi_0^T}{\pi_0} \left ( \tilde P_0 \b1_{{\cal V}_0^c}
528: \frac{\dot\pi_0^{\otimes 2}}{\pi_0(1-\pi_0)} \right )^{-1}
529: \tilde P_0 \b1_{{\cal V}_0^c} \frac{\dot\pi_0\tilde\ell_0^T }{\pi_0}. \label{eq:VarEstAlpha}
530: \end{equation}
531: A matrix calculation shows that, when (\ref{eq:VarEstAlpha}) is evaluated for stratified Bernoulli sampling
532: \[
533: \pi_{\alpha} = \pi_{\alpha}(V) = \left \{ \begin{array}{ll} 1, & V \; \in \; {\cal V}_0 \\
534: \alpha_j, & V \; \in \; {\cal V}_j, \; j=1,\ldots,J , \end{array} \right .
535: \]
536: the asymptotic variance for the WL estimator $\hat\theta$ with
537: \textit{estimated} sampling probabilities $\hat\alpha_j=n_j/N_j$ is identical to
538: the finite population sampling variance (\ref{eq:varIPWstratified}) with $p_j=\alpha_{j,0} = \lim n_j/N_j$.
539:
540: Two possibilities present themselves for estimation of the terms in (\ref{eq:VarEstAlpha}).
541: Let $\hat\pi_i= \pi_{\hat\alpha}(V_i)$ for $V_i \in {\cal V}_0^c$ else $\hat\pi_i=1$.
542: Then, using (\ref{eq:IPWvariance}), we could estimate the first term by
543: \[
544: \widehat{\Var\left ( \frac{\xi}{\pi_0}\tilde\ell_0 \right )}
545: = \tilde I^{-1}_{\hat\theta,\hat\eta} + \frac{1}{N}\sum_{i=1}^N\frac{\xi_i(1-\hat\pi_i)}{\hat\pi_i^2}\tilde\ell_{\hat\theta,\hat\eta}^{\otimes 2}(X_i),
546: \]
547: the expression in the middle of the second term by
548: \[
549: \widehat {\tilde P_0 \b1_{{\cal V}_0^c} \frac{\dot\pi_0^{\otimes 2}}{\pi_0(1-\pi_0)} }
550: = \frac{1}{N}\sum_{i=1}^N \b1_{{\cal V}_0^c}(V_i)
551: \frac{\dot\pi_{\hat\alpha}^{\otimes 2}(V_i)}{\hat\pi_i(1-\hat\pi_i)}
552: \]
553: and similarly for $\tilde P_0(\tilde\ell_0\dot\pi_0^{T}/\pi_0)$.
554: A more empirical approach, however, would be to use the
555: $\theta$ and $\alpha$ influence function contributions themselves to estimate these terms as in
556: \begin{eqnarray*}
557: \widehat{\Var\left ( \frac{\xi}{\pi_0}\tilde\ell_0 \right ) }
558: & = & \frac{1}{N}\sum_{i=1}^N \left ( \frac{\xi_i}{\hat\pi_i}\tilde\ell_{\hat\theta,\hat\eta}(X_i) \right )^{\otimes 2}, \\
559: \widehat {\tilde P_0 \b1_{{\cal V}_0^c} \frac{\tilde\ell_0\dot\pi_0^{T} } {\pi_0} }
560: & = & \frac{1}{N}\sum_{i=1}^N \b1_{{\cal V}_0^c}(V_i) \frac{\xi_i}{\hat\pi_i}
561: \frac{\tilde\ell_{\hat\theta,\hat\eta}(X_i)} {\hat\pi_i} \dot\pi_{\hat\alpha}(V_i)^{T} \\
562: & = & \frac{1}{N}\sum_{i=1}^N \b1_{{\cal V}_0^c }(V_i)
563: \left ( \frac{\xi_i\tilde\ell_{\hat\theta,\hat\eta}(X_i)}{\hat\pi_i} \right )
564: \left ( \frac{\dot\pi_{\hat\alpha}(V_i)^{T}(\xi_i-\hat\pi_i)}{\hat\pi_i(1-\hat\pi_i)} \right ) \quad \mbox{and} \\
565: \widehat { \tilde P_0 \b1_{{\cal V}_0^c} \frac{\dot\pi_0^{\otimes 2}}{\pi_0(1-\pi_0)} }
566: & = & \frac{1}{N}\sum_{i=1}^N \b1_{{\cal V}_0^c }(V_i)
567: \left ( \frac{ \dot\pi_{\hat\alpha}(V_i)(\xi_i-\hat\pi_i)}{\hat\pi_i(1-\hat\pi_i)} \right )^{\otimes 2}.
568: \end{eqnarray*}
569: The resulting asymptotic variance for $\hat\theta$ may be recognized as the comprising the residual sums of squares and of cross products from the least squares regressions of each the $p$ components of the $\hat\theta$ influence function contributions $\xi_i\tilde\ell_{\hat\theta,\hat\eta}(X_i)/\hat\pi_i$, to which subjects not in the
570: phase two sample contribute 0, on the $q$ components of the estimated $\hat\alpha$ influence function contributions (\ref{eq:InfFuncAlpha}), to which subjects having $V_i \in {\cal V}_0$ contribute 0.
571: See \citet{4931} for a recent discussion and interpretation.
572: This suggests the following estimation procedure:
573: \begin{enumerate}
574: \item Estimate $\alpha$ from the phase one data and compute the estimated sampling fractions $\hat\pi_i$.
575: \item Estimate $\theta$ and $\eta$ from the phase two data by WL, using the inverse $\hat\pi_i$ as known weights.
576: \item Regress each component of the influence function contributions for $\hat\theta$ on those for $\hat\alpha$.
577: \item Estimate Var$_{\mbox{A}}(\hat\theta)$ as the matrix comprising the residual sums of squares and of cross products from these regressions.
578: \end{enumerate}
579: \citet[p. 166]{4901}, who cited earlier work by \citet{4902}, suggested this
580: procedure for the special case of Cox regression, to which we now direct our attention.
581:
582: \section{Application to the Cox Proportional Hazards Model}
583: Our development of the Cox model follows closely that of \citet[\S 25.12]{4895}
584: where $X=(\Delta,T,Z)$ with $T$=min($\tilde T,C)$ a censored failure time,
585: $\Delta=\b1_{[\tilde T \leq C]}$ the failure indicator and $Z \in \RR^p$ a vector of covariates.
586: The Euclidean parameter is the $p$-vector of regression coefficients $\beta$ in the linear predictor $z\beta$.
587: The nonparametric parameter $\eta=(\Lambda,G,G_Z)$ has three
588: infinite dimensional components: $\Lambda(\cdot)=\int_0^{\cdot}\lambda(s)ds$ the baseline cumulative
589: hazard function, assumed differentiable; $G(t|z)=\Pr(C \leq t|Z=z)$ the conditional distribution of the
590: censoring time; and $G_Z$, the marginal distribution of the covariates.
591: We introduce the usual notation for the ``at risk" process
592: $Y(t)=\b1_{[T \geq t]}$ and the event counting process $N(t)=\Delta \b1_{[T \leq t]}$
593: and we make the standard assumptions: (i) that the true failure time $\tilde T$
594: and $C$ are independent given $Z$; and (ii) that there is a finite maximum
595: censoring time $\tau$ such that $\Pr[Y(\tau)=1]>0$.
596: \citet{4895} makes some further ``partly unnecessary" assumptions to simplify his development, namely that the covariates $Z$ are bounded,
597: that $G$ and $G_Z$ have densities as indicated and especially that $\Pr(C \geq \tau) = \Pr(C = \tau) >0$ (see discussion in \S 8).
598: Writing the density for $x=(\delta,t,z)$, with $z$ a row vector, as
599: \begin{equation}
600: e^{-e^{z\beta}\Lambda(t)}\left [e^{z\beta}\lambda(t)\left (1-G(t-|z)\right)\right]^{\delta}
601: \left[g(t|z)\right]^{1-\delta}g_Z(z) \label{eq:Coxlike},
602: \end{equation}
603: and noting that $G$ and $G_Z$ factor out of the complete data likelihood, \citet{4895}
604: considers ML estimation for $(\beta,\Lambda)$ only.
605: With ${\cal H}$ denoting various subsets of the space BV$[0,\tau]$ of bounded functions of bounded variation,
606: he develops the following explicit expressions for the $\beta$ score vector,
607: the $\Lambda$ score operator that maps functions $h \in {\cal H}$ to functions of the data, its adjoint (but only evaluated for the
608: $\beta$ scores) and the information operator that maps ${\cal H}$ onto itself:
609: \begin{eqnarray}
610: \dot\ell_{\beta,\Lambda}(x)
611: & = & \delta z -ze^{z\beta}\Lambda(t) \label{eq:Coxscore} \\
612: B_{\beta,\Lambda}h(x)
613: & = & \delta h(t) - e^{z\beta}\int_0^thd\Lambda \label{eq:CoxScoreOp} \\
614: B^*_{\beta,\Lambda}\dot\ell_{\beta,\Lambda}(t)
615: & = & P_{\beta,\Lambda}Y(t)Ze^{Z\beta} \nonumber \\
616: B^*_{\beta,\Lambda}B_{\beta,\Lambda}h(t)
617: & = & h(t) P_{\beta,\Lambda}Y(t)e^{Z\beta}. \nonumber
618: \end{eqnarray}
619: These are used to calculate the efficient scores
620: \begin{eqnarray*}
621: \ell^*_{\beta,\Lambda}(x)& = & \dot\ell_{\beta,\Lambda}-
622: B_{\beta,\Lambda}\left(B^*_{\beta,\Lambda}B_{\beta,\Lambda}\right)^{-1}B^*_{\beta,\Lambda}\dot\ell_{\beta,\Lambda} \\
623: & = & \delta\left[z-m(t;\beta)\right]-e^{z\beta}\int_0^t\left[z-m(s;\beta)\right]d\Lambda(s)
624: \end{eqnarray*}
625: and efficient information
626: \begin{eqnarray*}
627: \tilde I_0 & = & I_0 - P_0B_0\left(B^*_0B_0\right)^{-1}B^*_0\dot\ell_0 \\
628: & = & P_0 \left ( e^{Z\beta_0}\int_0^{\tau} \left [
629: Z - m(t;\beta_0) \right ]^{\otimes 2} \Pr(T \geq t |Z) d\Lambda_0(t) \right ) ,
630: \end{eqnarray*}
631: respectively,
632: where $I_0=P_0\dot\ell_o\dot\ell_0^{T}$ and $m(t;\beta) = S^{(1)}(t;\beta)/S^{(0)}(t;\beta)$ with
633: \begin{eqnarray*}
634: S^{(0)}(t;\beta) &=& P_0 e^{Z\beta}Y(t) \\
635: S^{(1)}(t;\beta) &=& P_0Ze^{Z\beta}Y(t).
636: \end{eqnarray*}
637:
638: To fit the Cox model by WL to two phase stratified samples,
639: first define IPW estimators of the two quantities just considered by
640: $\hat S^{(0)}(t;\beta)=\PP_N^{\pi}e^{Z\beta}Y(t)$ and
641: $S^{(1)}(t;\beta)=\PP_N^{\pi}Ze^{Z\beta}Y(t) $.
642: By definition the WL estimators solve
643: \begin{eqnarray}
644: \Psi_{N1}^{\pi}(\beta,\Lambda) & = & \PP_N^{\pi}\dot\ell_{\beta,\Lambda} = 0
645: \label{eq:CoxWL1} \\
646: \Psi_{N2}^{\pi}(\beta,\Lambda)h & = & \PP_N^{\pi} B_{\beta,\Lambda}h =0
647: \qquad \mbox{for all} \; h\in \; {\cal H}, \label{eq:CoxWL2}
648: \end{eqnarray}
649: where we have used the fact that $P_{\beta,\Lambda}B_{\beta,\Lambda}h=0 $.
650: Substituting
651: \[
652: h_t(s) \; = \; \frac{\b1_{[s\leq t]}}{\hat S^{(0)}(s,\beta)}
653: \]
654: for $h$ in (\ref{eq:CoxWL2}) and solving using (\ref{eq:CoxScoreOp})
655: shows that, for fixed $\beta$, the cumulative hazard
656: function that partially maximizes the weighted likelihood and, as is easily checked, satisfies $\PP_N^{\pi}B_{\beta,\hat\Lambda_{\beta}}h=0$ for all $h$, is
657: \begin{equation}
658: \hat\Lambda_{\beta}(t) \;= \; \PP_N^{\pi}\frac{\Delta\b1[T\leq t]}{\hat\sz(T;\beta)} \; = \;\frac{1}{N} \sum_{i=1}^N \int_0^t\frac{\xi_i}{\pi_i}\frac{dN_i(s)}{\hat S^{(0)}(s;\beta)} \label{eq:Breslow}.
659: \end{equation}
660: This may be recognized as an IPW version of the so called \citet{1266} estimator.
661: Inserting this expression into (\ref{eq:CoxWL1}) and evaluating using (\ref{eq:Coxscore}) yields
662: \[
663: \Psi_{N1}^{\pi}(\beta,\hat\Lambda_{\beta})\; = \; \PP_N^{\pi}\Delta\left[Z-\hat m(T;\beta)\right] \;= \; \frac{1}{N}\sum_{i=1}^N\frac{\xi_i}{\pi_i}\Delta_i\left[Z_i-\frac{\hat S^{(1)}(T_i;\beta}{\hat S^{(0)}(T_i;\beta)}\right]\;=\;0 ,
664: \]
665: which is the IPW Cox ``partial score" equation.
666: Its solution, together with (\ref{eq:Breslow}),
667: are the estimators proposed for Cox regression by \citet{4326},
668: \citet{4902}, \citet[Estimator II]{4324}, \citet{4894} and others for a
669: variety of complex sampling and missing data problems.
670: Using the results of this paper, the large sample properties of $(\hat\beta,\hat\Lambda_{\hat\beta})$ follow
671: from those already developed for the ML estimators with complete data, which are given by the same equations with $\xi_i=\pi_i=1, i=1,\ldots,N$.
672:
673: \section{Discussion}
674: The two phase stratified sampling designs considered here are quite
675: flexible in that the phase one strata may be formed using all available
676: information and sampled with arbitrary positive probabilities.
677: This is in the spirit of \citet{4326} and \citet{4894}, who considered
678: even more general complex sample survey designs.
679: Others \citep{4324,4871} have restricted their attention to covariate stratified versions of the case-cohort design, whereby all subjects who fail are sampled at phase two for complete covariate ascertainment.
680: Although this may well be an efficient design when the failure rate is low, the assumption that $\xi=1$ whenever $\Delta=1$ is often unnecessary and may sometimes be unduly restrictive.
681: Not only does it limit application when the phase one population has
682: large numbers of both failures and non-failures, it also does so when the sampling has been carried out for one failure type but it is of interest to evaluate another.
683: When following patients enrolled in a clinical trial, for example, all deaths may be sampled as ``cases" but it may later be decided to analyze the data also in terms of ``event-free survival".
684: In other contexts, biological samples may turn out out to be non-informative so that data are still missing for substantial numbers of subjects, including failed cases, who are sampled at phase two.
685: Provided one is willing to make the standard MAR assumptions, WL
686: methods as described herein may still be used by determining the
687: stratum frequencies for subjects having complete data at phase two and using them to estimate the sampling weights.
688:
689: The major drawback of WL estimation is its lack of statistical efficiency.
690: Efforts to address this deficiency with Cox regression have been made by
691: several authors including \citet{3937}, \citet{4871}, \citet{4811}, \citet{4929} and \citet{4930}.
692: Most of these methods are relatively recent and involve sufficiently complex calculations, or sufficiently restrictive assumptions, that none have yet seen widespread use.
693: These limitations are certain to decline with advances in computing
694: hardware and software, making more efficient estimation methods more widely available.
695: In the meantime, the WL estimation procedure outlined at the end
696: of \S 6 offers a relatively simple and robust alternative.
697: It is likely to remain the method of choice for many survey statisticians for the reasons
698: mentioned in the introduction, namely, their interest in finite population
699: parameters defined as solutions to ML estimating equations.
700: As emphasized by \citet{3937}, in view of the interpretation of
701: (\ref{eq:VarEstAlpha}) as a residual sum of squares, inclusion of
702: additional variables in the model (\ref{eq:modelforPi}) for $\pi$ can
703: only enhance the efficiency of $\theta$ estimation.
704: When the sampling probabilities vary, as in finite population stratified
705: sampling, inclusion of the stratum factors in the model is essential to avoid bias.
706: Finer stratification, or the inclusion of auxiliary variables in the model for $\pi$, serves the cause of efficiency.
707: Equation (\ref{eq:varIPWstratified}) suggests that such additional
708: variables would be most valuable if they could somehow be chosen
709: to be highly correlated with the efficient scores.
710: The doubly weighted estimator developed by \citet{4871} for exposure stratified case-cohort studies is intriguing in that it uses
711: a separate set of (time-dependent) weights for each covariate.
712: A preliminary analysis is conducted to estimate quantities that resemble within stratum conditional expectations of partial score contributions given the phase one data, and these are used to form the weights.
713: An extension of their approach to more general two phase stratified sampling designs would be of considerable interest.
714:
715: This paper is limited in application to semiparametric models that satisfy the rather stringent assumptions \textbf{A1}-\textbf{A4} of \S 2.
716: Even in the case of Cox regression, these have been established only under the ``partly unnecessary" conditions imposed by \citet[\S 25.12.1]{4895}.
717: His assumption that everyone still ``on-study" is censored at the common time $\tau$ would apply to situations in which time $t$ referred to calendar time, everyone was entered on study at $t=0$ and there was a common closing date at $t=\tau$.
718: It would not apply, however, if subjects were entered on study at various
719: calendar times but withdrawn on a common closing date, and $t$ was taken to be ``time-on-study".
720: Nor would it apply if $t$ was ``age" and subjects both entered and exited the study at various ages.
721: We look forward to further work that relaxes these assumptions, in particular to a determination as to whether or not the general approach extends to Cox regression with time-dependent covariates and repeated failure events under standard assumptions \citep{3924}.
722:
723: In his Appendix \citet{4894} remarks
724: \begin{quote}
725: ``To our knowledge, there does not exist a general theory on the conditions required for the tightness and weak convergence of Horvitz-Thompson processes.
726: However, the results of \citet[\S\S 2.9, 3.6, 3.7]{4920} can be applied to possibly stratified simple random sampling and can potentially be extended to other survey designs."
727: \end{quote}
728: One purpose of this paper has been to carry out in detail the program
729: mentioned for stratified random sampling.
730: We conjecture that our fundamental equation (\ref{eq:general}) applies
731: to Horvitz-Thompson estimators for other complex sampling designs, and work is in progress to explore these extensions.
732:
733: \section*{Acknowledgements}
734: The second author owes thanks to Galen Shorack for a helpful discussion concerning
735: the representation in Appendix A.
736: Supported in part by grants 5-R01-CA40644 and 2-R01-AI291968 from the
737: U.S. National Institutes of Health and by grant DMS-0503822
738: from the U.S. National Science Foundation.
739:
740: %\bibliographystyle{plain}
741: \bibliographystyle{ims}
742: \bibliography{sjs}
743:
744: \section{Appendices}
745: In Appendices A and B we establish two results slightly more
746: general than needed for the development in Section 4.
747: (See the end of Appendix B for the special case required.)
748: The notation in these two appendices should be understood
749: to be independent of the that in the body of the paper.
750:
751: \par\noindent
752: {\bf Appendix A. \ A Representation of Stratified Sampling.}
753:
754: Suppose that $( \Omega, {\cal A},P)$ is a probability space
755: and $W : (\Omega , {\cal A} ) \rightarrow ({\cal W} , {\cal B} )$.
756: Write $P^W$ for the measure induced by $W$ on
757: $ ({\cal W} , {\cal B} )$; in the notation of section 2, $P^W = \tilde{P}_0$.
758: Suppose that ${\cal W}_1 , \ldots , {\cal W}_J$ is a (measurable) partition
759: of ${\cal W}$: \\
760: (a) \ \ ${\cal W}_j \in {\cal B}$, $j = 1, \ldots , J$;\\
761: (b) \ \ ${\cal W}_j \cap {\cal W}_{j'} = \empty$ for $j \not= j'$; and \\
762: (c) \ \ $\cup_{j=1}^J {\cal W}_j = {\cal W}$.\\
763: We will assume that $P(W \in {\cal W}_j ) \equiv p_j > 0$
764: for $j=1, \ldots , J$.
765:
766: Now consider a new probability space $(\Omega^{\dagger}, \cal A^{\dagger}, P^{\dagger})$
767: where
768: \begin{eqnarray*}
769: &&\Omega^{\dagger}
770: = \Omega_0^{\dagger} \times \Omega_1^{\dagger} \times \cdots \times \Omega_J^{\dagger} ,\\
771: && {\cal A}^{\dagger}
772: = {\cal A}_0^{\dagger} \times {\cal A}_1^{\dagger} \times \cdots \times {\cal A}_J^{\dagger} , \\
773: && P^{\dagger} = P_0^{\dagger} \cdot P_1^{\dagger} \cdots P_J^{\dagger} ,
774: \end{eqnarray*}
775: and random variables $\Delta = (\Delta_1, \ldots , \Delta_J)$,
776: $W_1^{\dagger}, \ldots , W_J^{\dagger}$
777: defined thereon as follows:
778: for $\omega^{\dagger} = (\omega_0^{\dagger} , \omega_1^{\dagger}, \ldots ,
779: \omega_J^{\dagger} ) \in \Omega^{\dagger}$,
780: \begin{eqnarray*}
781: && \Delta (\omega^{\dagger})
782: = \Delta (\omega_0^{\dagger}) \sim \mbox{Multinomial}_J (1, (p_1 , \ldots , p_J ) ) \\
783: && W_j^{\dagger} ( \omega^{\dagger} ) = W_j^{\dagger} (\omega_j^{\dagger} )
784: \sim P_j^{\dagger}
785: \end{eqnarray*}
786: for $j =1 , \ldots , J$ where $p_j = P( W \in {\cal W}_j )$, $j =1 , \ldots , J$, and
787: $P_j^{\dagger}$ is defined by
788: \begin{eqnarray}
789: P_j ^{\dagger} (W_j \in B) = \frac{P( W \in B\cap {\cal W}_j)}{P(W \in {\cal W}_j )}
790: = \frac{P^W (B \cap {\cal W}_j )}{P^W ( {\cal W}_j )},
791: \qquad B \in {\cal B} .
792: \label{DefnOfPSubJDagger}
793: \end{eqnarray}
794: Now define a random variable
795: $W^{\dagger} : (\Omega^{\dagger}, \cal A^{\dagger}) \rightarrow ( {\cal W} , {\cal B} )$
796: by
797: \begin{eqnarray*}
798: W^{\dagger} (\omega^{\dagger} )
799: = \Delta_1 (\omega_0^{\dagger} ) W_1^{\dagger}( \omega_1^{\dagger} )
800: + \cdots + \Delta_J ( \omega_0^{\dagger} ) X_J^{\dagger} ( \omega_J^{\dagger} ) .
801: \end{eqnarray*}
802: Note that $\Delta$, $W_1^{\dagger} , \ldots , W_J^{\dagger}$ are independent by
803: construction.
804: \bigskip
805:
806: \par\noindent
807: {\bf Proposition A.1} \ \ $W^{\dagger} \stackrel{d}{=} W$ on $({\cal W}, {\cal B} )$.
808: That is, $P^{W^{\dagger}} = P^W $ as measures on $({\cal W}, {\cal B})$.
809: \medskip
810:
811: \par\noindent
812: {\bf Proof.}
813: First note that
814: \begin{eqnarray}
815: P^{\dagger} (W^{\dagger}\in {\cal W}_j )
816: & = & P^{\dagger} (W_j^{\dagger} \in {\cal W}_j , \Delta_j = 1) \nonumber \\
817: & = & P^{\dagger} (W_j^{\dagger} \in {\cal W}_j ) P^{\dagger} (\Delta_j = 1)
818: %\nonumber \\
819: = 1 \cdot p_j = p_j
820: \label{ComputationOfProbDaggerOfFallingInjthElementOfPartition}
821: \end{eqnarray}
822: using independence of $\Delta$ and $W_j^{\dagger}$,
823: the fact that $W_j^{\dagger}$ takes values in ${\cal W}_j$ with $P^{\dagger}$-probability $1$,
824: and $P^{\dagger} (\Delta_j = 1)= p_j $ by the definition of $P^{\dagger}$.
825:
826: Now let $B \in {\cal B}$. Then since $p_j > 0$ for $j =1, \ldots , J$,
827: \begin{eqnarray*}
828: P^{\dagger} (W^{\dagger} \in B)
829: & = & \sum_{j=1}^J P^{\dagger} ( W^{\dagger} \in B \cap {\cal W}_j )
830: = \sum_{j=1}^J \frac{P^{\dagger} ( W^{\dagger} \in B \cap {\cal W}_j ) }{ P^{\dagger} (W^{\dagger} \in {\cal W}_j )}
831: P^{\dagger} (W^{\dagger} \in {\cal W}_j )\\
832: & = & \sum_{j=1}^J \frac{P^{\dagger} ( W_j^{\dagger} \in B ) }{P^{\dagger} (W^{\dagger}_j \in {\cal W}_j )} p_j
833: \qquad \mbox{by} \ (\ref{ComputationOfProbDaggerOfFallingInjthElementOfPartition}) \\
834: & = & \sum_{j=1}^J \frac{P^W (B \cap {\cal W}_j) / P^W ( {\cal W}_j )}{ 1} \cdot p_j
835: \qquad \mbox{by} \ (\ref{DefnOfPSubJDagger}) \\
836: & = & \sum_{j=1}^J P^W ( B \cap {\cal W}_j ) = P^W (B) = P( W \in B) .
837: %\qquad \qquad \qquad \qquad\qquad \Box
838: \end{eqnarray*}
839: \hfill $\Box$
840: \medskip
841:
842: If $W_1, \ldots , W_N$ are i.i.d. $P^W$, then we can represent the $W_i$'s
843: in terms of $( \Delta_i , W_{1,i}^{\dagger} , \ldots, W_{J,i}^{\dagger} )$, $i=1, \ldots , N$, i.i.d. as
844: $( \Delta , W_1^{\dagger} , \ldots , W_J^{\dagger})$ as described in proposition A.1. It follows that
845: \begin{eqnarray}
846: \PP_{j, N_j}
847: & = & \frac{1}{N_j} \sum_{i=1}^N \delta_{W_i} 1_{{\cal W}_j} (W_i) \nonumber \\
848: & = & \frac{1}{N_j} \sum_{j'=1}^J \sum_{i=1}^N \Delta_{j',i} \delta_{W_{j',i}^{\dagger}}
849: 1_{{\cal W}_j} (W_{j,i}^{\dagger}) \nonumber \\
850: & = & \frac{1}{N_j} \sum_{i=1}^{N_j} \delta_{W_{j,i} }
851: \label{eq:StratumSpecificEmpiricalMeasureAppendixB}
852: \end{eqnarray}
853: by relabelling the $W_{j,i}^{\dagger}$'s and where
854: $N_j = \sum_{i=1}^N \Delta_{j,i}$ on the right side is independent of the $W_{j,i}^{\dagger}$'s.
855: This yields the promised doubly indexed form of the stratum - specific
856: empirical measure in terms of independent $W_{j,i}$'s distributed according to $P_{0|j}$
857: where, for $B \in {\cal B}$,
858: $$
859: P_{0|j} (B) = \frac{P_0 (B 1_{{\cal W}_j} )}{P_0 ( 1_{{\cal W}_j} )} .
860: $$
861: \bigskip
862:
863: \par\noindent
864: {\bf Appendix B. Proof of weak convergence of the stratum-specific empirical process}
865:
866: Let $\PP_{j,N_j} $ be as defined in (\ref{eq:StratumSpecificEmpiricalMeasureAppendixB})
867: %eq:stratumwiseEmpirical}),
868: %where the double subscripting:
869: $$
870: \PP_{j,N_j} = \frac{1}{N_j} \sum_{i=1}^n \delta_{W_i} 1_{{\cal W}_j} (W_i)
871: $$
872: where
873: $$
874: N^{-1} N_j = \PP_N ( 1_{{\cal W}_j} ) \rightarrow_{a.s.} P_0 ( {\cal W}_j ) \equiv \nu_j > 0 .
875: $$
876: \medskip
877:
878: \par\noindent
879: {\bf Proposition B.1.}
880: If ${\cal F}$ is $P_{0}-$Donsker and $\nu_j > 0$, then
881: ${\cal F}$ is $P_{0|j} - $Donsker on stratum ${\cal W}_j$ in the sense that
882: \begin{equation}
883: \GG_{j,Nj} \equiv \sqrt{N_j} ( \PP_{j, N_j} - P_{0|j} )
884: \rightsquigarrow \GG_j \qquad \mbox{in} \ \ \ell^{\infty} ({\cal F})
885: \label{EmpiricalProcessForStratumj}
886: \end{equation}
887: where $\GG_j$, defined by
888: \begin{equation}
889: \GG_j (f) = \nu_j^{-1/2} \GG_{P_0} ((f - P_{0|j} (f)) 1_{{\cal W}_j} ) ,
890: \qquad f \in \ell^{\infty} ({\cal F}) ,
891: \end{equation}
892: is a $P_{0|j}$-Brownian bridge process.
893: \medskip
894:
895: \par\noindent
896: {\bf Remark 1.}
897: Note that
898: \begin{eqnarray*}
899: Var( \GG_{j} (f) )
900: & = & \nu_j^{-1} P_0 \left [ ( f - P_{0|j} (f))^2 1_{{\cal W}_j} \right ]
901: = Var_j (f) \equiv Var(f(W)| W \in {\cal W}_j ).
902: \end{eqnarray*}
903: %\medskip
904:
905: \par\noindent
906: {\bf Remark 2.} The proposition implies that the process
907: $\sqrt{N_j} ( \PP_{j, N_j} - P_{0|j} ) $ behaves asymptotically the same as that
908: of a sample of fixed size drawn from the conditional distribution $P_{0|j}$.
909: \medskip
910:
911: \par\noindent
912: {\bf Proof of the proposition}. First proof.
913: By the discussion at the beginning of section 2.10.4, page 200, van der Vaart
914: and Wellner (1996), ${\cal F}_j \equiv \{ f 1_{{\cal W}_j} : \ f \in {\cal F} \}$
915: is $P_0-$Donsker, and hence the collection
916: $\tilde{{\cal F}}_j \equiv \{ f 1_{{\cal W}_j} : \ f \in {\cal F} \cup \{ 1\} \}$
917: is also $P_0-$Donsker. Now we
918: write
919: \begin{eqnarray*}
920: \sqrt{N_j} ( \PP_{j,N_j} f - P_{0|j} f )
921: & = & \sqrt{N_j} \left ( \frac{\frac{1}{N} \sum_{i=1}^N f(W_i ) 1_{{\cal W}_j} (W_i)}
922: {\frac{1}{N} \sum_{i=1}^N 1_{{\cal W}_j} (W_i)}
923: - \frac{P_0 ( f 1_{{\cal W}_j}}{P_0 ( 1_{{\cal W}_j}) }
924: \right ) \\
925: & = & \sqrt{ \frac{N_j}{N}} \left \{ \frac{ \GG_N ( f 1_{{\cal W}_j} ) }{N_j/N}
926: - \frac{ \GG_N ( 1_{{\cal W}_j} ) P_0 ( f 1_{{\cal W}_j} )}
927: { (N_j /N) P_0 ( {\cal W}_j ) } \right \} \\
928: & = & \frac{1}{\sqrt{N_j/N}} \left \{ \GG_N ( f 1_{{\cal W}_j} )
929: - \GG_N ( 1_{{\cal W}_j} ) P_{0|j} ( f ) \right \} \\
930: & = & \frac{1}{\sqrt{N_j/N}} \GG_N ( ( f - P_{0|j} (f)) 1_{{\cal W}_j} ) \\
931: & \Rightarrow & \frac{1}{\sqrt{\nu_j}} \GG_{P_0} ( ( f - P_{0|j} (f)) 1_{{\cal W}_j} )
932: \equiv \GG_{P_{0|j}} (f) \, ,
933: %\qquad \Box
934: \end{eqnarray*}
935: and, in fact,
936: \begin{eqnarray*}
937: \left \{ \frac{1}{\sqrt{\nu_j}} \GG_{P_0} ((f - P_{0|j} (f))1_{{\cal W}_j} ) : \ f \in {\cal F} \right \}
938: \stackrel{d}{=} \{ \GG_{P_{0|j} } (f) : \ f \in {\cal F} \} .
939: \end{eqnarray*}
940:
941: Second proof. By the second representation of the stratum-specific empirical measure
942: $\PP_{j,N_j}$ as $\PP_{j,N_j} = N_j^{-1} \sum_{i=1}^{N_j} \delta_{W_{j,i}}$ where the
943: $W_{j,i}$'s are i.i.d. $P_{0|j}$, it follows that the empirical
944: process
945: $\GG_{j,N_j} = \sqrt{N_j} ( \PP_{j,N_j} - P_{0|j}) $
946: is just the empirical process of i.i.d. $W_{j,i}$'s, but with a random sample size
947: $N_j$ independent of the $W_{j,i}$'s. Since $N_j / N \rightarrow \nu_j > 0$, it follows from
948: theorem 3.5.1, page 339, van der Vaart and Wellner (1996), that
949: $\GG_{j,N_j} \rightsquigarrow \GG_j$ in $\ell^{\infty} ({\cal F})$ where $\GG_j$ is a
950: $P_{0|j}-$Brownian bridge process as before.
951: \hfill$\Box$
952:
953: In the application of the results of Appendices A and B in section 4 we take
954: ${\cal W}_1 , \ldots , {\cal W}_J$ to be the measurable partition of ${\cal W}$ induced by
955: the partition ${\cal V}_1 , \ldots , {\cal V}_J$ of ${\cal V}$ (i.e. ${\cal W}_j = V^{-1} ({\cal V}_j)$
956: for $j=1, \ldots , J$ where $V(W) \equiv ( \tilde{X} (X), U)$). Moreover, the Donsker class
957: ${\cal F}$ in Proposition B.1 is taken to be a Donsker class of functions of $X$ only
958: rather than functions of $W = (X,U)$. This is exactly what is needed for the development
959: in section 4.
960: \medskip
961:
962: {\bf Appendix C. Proof of equation (\ref{eq:TaylorExpansion}).}
963: Besides the consistency and asymptotic linearity (\ref{eq:InfFuncAlpha})
964: for $\hat\alpha$ assumed in \S 6, we further assume that $0 < \sigma \leq \pi_{\alpha}(v)$ as in (\ref{eq:boundedweights}) and that
965: \begin{eqnarray}
966: \Big | \frac{1}{\pi_{\alpha} (v)} - \frac{1}{\pi_{\alpha_0} (v) }
967: - \frac{-\dot{\pi}_0^{T}(v)}{\pi_0^2 (v)} (\alpha - \alpha_0) \Big |
968: \le \psi (v) | \alpha - \alpha_0 |^{1+ \zeta }
969: \label{eq:DerivativeConditionPlus}
970: \end{eqnarray}
971: for $\alpha$ in a neighborhood of $\alpha_0$ where $\zeta> 0$
972: and $\psi$ satisfies $E \psi^2 (V) < \infty$.
973: The second assumption will typically follow from the first
974: provided that $\pi_{\alpha}$ has a continuous second derivative.
975: For example, suppose that $\pi_{\alpha}$ is given by a logistic regression model with linear predictor
976: $\tilde v^{T}\alpha$ where $\tilde v=\tilde v(v) \in \RR^q$.
977: Then Taylor's formula with remainder shows that the LHS of (\ref{eq:DerivativeConditionPlus}) equals
978: $\left | \frac{1}{2}e^{-\tilde v^{T} \alpha^*}(\alpha-\alpha_0)^{T}\tilde v \tilde v^{T}(\alpha-\alpha_0) \right |$
979: with $\alpha^*$ on the line segment between $\alpha$ and $\alpha_0$.
980: Thus the condition holds with $\zeta=1$ provided $e^{\tilde v^{T} \alpha}=\pi_{\alpha}(v)/[1-\pi_{\alpha}(v)]$
981: is bounded away from 0 and $\tilde V$ has finite fourth moment.
982: It follows that
983: \begin{eqnarray}
984: \left ( \PP_N^{\hat\pi}- \PP_N^{\pi_0} \right ) \tilde\ell_0
985: & = &
986: \frac{1}{N}\sum_{i=1}^N {\bf 1}_{{\cal V}_0^c} (V_i) \left ( \frac{\xi_i}{\hat\pi_i} - \frac{\xi_i}{\pi_0} \right ) \tilde\ell_0(X_i)
987: \nonumber \\
988: & = & \frac{1}{N}\sum_{i=1}^N {\bf 1}_{{\cal V}_0^c}(V_i)
989: \xi_i \tilde\ell_0(X_i)
990: \left [\frac{1}{\pi_{\hat\alpha}(V_i)} - \frac{1}{\pi_{\alpha_0}(V_i)}
991: - \frac{-\dot{\pi}_0^T(V_i)}{\pi_0^2 (V_i)} (\hat{\alpha} - \alpha_0)
992: \right ] \nonumber \\
993: && \qquad + \frac{1}{N}\sum_{i=1}^N {\bf 1}_{{\cal V}_0^c}(V_i)
994: \xi_i \tilde\ell_0(X_i)
995: \left [ \frac{-\dot{\pi}_0^T(V_i)}{\pi_0^2 (V_i)}
996: \right ] ( \hat{\alpha} - \alpha_0) \nonumber \\
997: & \equiv & R_N - \frac{1}{N}\sum_{i=1}^N {\bf 1}_{{\cal V}_0^c}(V_i)
998: \frac{\xi_i}{\pi_{0}(V_i)} \tilde\ell_0(X_i)
999: \left [ \frac{\dot{\pi}_0^T(V_i)}{\pi_0 (V_i)} \right ] (\hat{\alpha} - \alpha_0)
1000: \label{eq:RemainderPlusMainTerm}
1001: \end{eqnarray}
1002: where by (\ref{eq:boundedweights}), the similar assumption for
1003: $\pi_{\alpha}$ and (\ref{eq:DerivativeConditionPlus}),
1004: \begin{eqnarray*}
1005: |R_N |
1006: & \le & \Big | \frac{1}{N}\sum_{i=1}^N {\bf 1}_{{\cal V}_0^c}(V_i)
1007: \xi_i \tilde\ell_0(X_i)
1008: \left [\frac{1}{\pi_{\hat\alpha}(V_i)} - \frac{1}{\pi_{\alpha_0}(V_i)}
1009: - \frac{-\dot{\pi}_0^T(V_i)}{\pi_0^2 (V_i)} (\hat{\alpha} - \alpha_0) \right ]
1010: \Big | \\
1011: & \le & \frac{1}{\sigma^2} \frac{1}{N} \sum_{i=1}^N \psi (V_i)| \tilde\ell_0 (X_i)| \cdot | \hat{\alpha} - \alpha_0|^{1+\zeta} \\
1012: & = & O_p (1) | \hat{\alpha} - \alpha_0| | \hat{\alpha} - \alpha_0|^{\zeta} \\
1013: & = & O_p (1) O_{p} (N^{-1/2} ) o_p (1).
1014: \end{eqnarray*}
1015: Multiplying through (\ref{eq:RemainderPlusMainTerm}) by $\sqrt{N}$, we conclude that (\ref{eq:TaylorExpansion}) holds by virtue of $\sqrt{N} \tilde{R}_N = o_p (1)$ and the strong law of large numbers.
1016:
1017:
1018: \end{document}
1019: