0610:cs0610155/crp.tex

1: \documentclass[twoside,11pt]{article}

2: %\usepackage{nips2005e,times}

3: \usepackage{jmlr2e}

4:

5: \usepackage{times,url}

6: \usepackage{amsmath}

7: \usepackage{amsfonts}

8: \usepackage{graphicx}

9: \usepackage{subfigure}

10: %\newtheorem{theorem}{Theorem}

11: %\newtheorem{lemma}{Lemma}

12: %\newtheorem{corollary}{Corollary}

13: %\newtheorem{proposition}{Proposition}

14: \newcommand{\dataset}{{\cal D}}

15: \newcommand{\fracpartial}[2]{\frac{\partial #1}{\partial  #2}}

16:

17: \ShortHeadings{Cauchy Random Projections}{Li, Hastie, and Church}

18: \firstpageno{1}

19:

20: \begin{document}

21: \title{Nonlinear Estimators and Tail Bounds for Dimension Reduction in

22:   $l_1$ Using Cauchy Random Projections}

23:

24:

25: \author{\name Ping Li \email pingli@stat.stanford.edu \\

26:        \addr Department of Statistics\\

27:        Stanford University\\

28:        Stanford, CA 94305, USA

29:        \AND

30:        \name Trevor J.\ Hastie \email hastie@stanford.edu \\

31:        \addr Department of Statistics\\

32:        Stanford University\\

33:        Stanford, CA 94305, USA

34:        \AND

35:        \name Kenneth W.\ Church \email church@microsoft.com \\

36:        \addr Microsoft Research\\

37:        Microsoft Corporation\\

38:        Redmond, WA 98052, USA

39: }

40: \editor{}

41:

42: \maketitle

43: \vspace{-0.5in}

44: \begin{abstract}

45:

46: For \footnote{Revised \today. The original version, titled {\em

47:     Practical Procedures for Dimension Reduction in $l_1$}, is

48:   available as a technical report in Stanford Statistics achive

49:   (report No. 2006-04, June, 2006). }

50:  dimension reduction in $l_1$, the method of {\em Cauchy random projections} multiplies

51: the original data matrix $\mathbf{A}

52: \in\mathbb{R}^{n\times D}$ with a random matrix $\mathbf{R} \in

53: \mathbb{R}^{D\times k}$ ($k\ll\min(n,D)$) whose entries are i.i.d. samples

54: of the standard Cauchy $C(0,1)$. Because of the impossibility results,  one can not

55: hope to recover the pairwise $l_1$ distances in $\mathbf{A}$

56: from $\mathbf{B} = \mathbf{AR} \in \mathbb{R}^{n\times k}$,

57: using linear estimators without incurring large

58: errors. However, nonlinear estimators are still useful for certain

59: applications in data stream computation, information

60: retrieval, learning, and data mining.

61:

62: We propose three types of nonlinear estimators: the bias-corrected

63: sample median estimator, the bias-corrected geometric mean estimator,

64: and the bias-corrected maximum likelihood estimator. The sample median

65: estimator and the geometric mean estimator are asymptotically (as

66: $k\rightarrow \infty$) equivalent but the latter is more accurate at

67: small $k$.  We derive explicit tail bounds

68: for the geometric mean estimator and establish an analog of the

69: Johnson-Lindenstrauss  (JL) lemma for dimension reduction in $l_1$,

70: which is weaker than the classical JL lemma for dimension reduction in

71: $l_2$.

72:

73: Asymptotically, both the sample median estimator and the

74: geometric mean estimators are about $80\%$ efficient compared to the

75: maximum likelihood estimator (MLE). We analyze the moments of the MLE

76: and propose approximating the distribution of the MLE by an

77: inverse Gaussian.

78:

79: \end{abstract}

80:

81: \textbf{Keywords:} Dimension reduction, $l_1$ norm, Cauchy Random

82: projections, JL bound

83:

84:

85: \section{Introduction}

86:

87: This paper focuses on dimension reduction in $l_1$,  in particular, on

88: the

89: method based on {\em Cauchy random projections}

90: \citep{Proc:Indyk_FOCS00}, which is special case of {\em linear

91:   random projections}.

92:

93: The idea of {\em linear random projections} is to multiply the original data

94: matrix $\mathbf{A} \in \mathbb{R}^{n\times D}$ with a random projection matrix

95: $\mathbf{R} \in \mathbb{R}^{D\times k}$, resulting in a projected

96: matrix $\mathbf{B} = \mathbf{AR} \in \mathbb{R}^{n\times

97:   k}$. If $k \ll \min(n,D)$, then it should be much more efficient to

98: compute certain summary statistics (e.g., pairwise distances) from

99: $\mathbf{B}$ as opposed to $\mathbf{A}$. Moreover, $\mathbf{B}$ may be small

100: enough to reside in physical memory while $\mathbf{A}$ is often

101: too large to fit in the main memory.

102:

103: The choice of the random projection matrix $\mathbf{R}$ depends on which norm we

104: would like to work with.

105: \cite{Proc:Indyk_FOCS00} proposed constructing $\mathbf{R}$ from

106: i.i.d. samples of $p$-stable distributions, for dimension reduction in

107: $l_p$ ($0< p\leq 2$). In the stable distribution family \citep{Book:Zolotarev_86}, normal is

108: 2-stable and Cauchy is 1-stable. Thus, we will call random projections

109: for $l_2$ and $l_1$,  {\em normal random projections} and {\em Cauchy

110:   random projections}, respectively.

111:

112: In {\em normal random projections} \citep{Book:Vempala}, we can estimate the original

113: pairwise $l_2$ distances of $\mathbf{A}$ directly using the

114: corresponding $l_2$ distances of $\mathbf{B}$ (up to a normalizing

115: constant). Furthermore, the Johnson-Lindenstrauss  (JL)

116: lemma \citep{Article:JL84} provides the performance guarantee.

117:  We will review {\em normal random projections} in more detail in

118:  Section \ref{sec_intr_rp}.

119:

120: For {\em Cauchy random projections}, we should not use the $l_1$ distance

121: in $\mathbf{B}$ to approximate the original $l_1$ distance in

122: $\mathbf{A}$, as the Cauchy distribution does not even have  a finite first

123: moment. The impossibility results

124: \citep{Proc:Brinkman_FOCS03,Article:Lee_Naor_04,Article:Brinkman_JACM05}

125: have proved that one can not hope to recover the $l_1$ distance using

126: linear projections and linear estimators (e.g., sample mean), without

127: incurring large errors.  Fortunately, the impossibility results do not

128:  rule out nonlinear estimators, which may be still useful in

129: certain applications in data stream computation, information

130: retrieval, learning, and data mining.

131:

132: \cite{Proc:Indyk_FOCS00} proposed using the sample median (instead of

133: the sample mean) in {\em Cauchy random projections} and described its

134: application in data stream computation. In this study, we provide

135: three types of nonlinear estimators:  the bias-corrected

136: sample median estimator, the bias-corrected geometric mean estimator,

137: and the bias-corrected maximum likelihood estimator. The sample median

138: estimator and the geometric mean estimator are asymptotically

139: equivalent (i.e., both are about $80\%$ efficient as the maximum

140: likelihood estimator), but the latter is more accurate at small sample size $k$.

141: Furthermore, we

142: derive explicit tail bounds for the bias-corrected geometric mean estimator and

143: establish an analog of the  JL Lemma for dimension reduction in

144: $l_1$.

145:

146: This analog of the JL Lemma for $l_1$ is weaker than the classical

147: JL Lemma for $l_2$, as the geometric mean estimator is a non-convex

148: norm and hence is not a metric.  Many efficient algorithms, such as

149: some sub-linear time (using super-linear memory) nearest neighbor algorithms \citep{Book:NN_05}, rely

150: on the metric properties (e.g., the triangle inequality). Nevertheless, nonlinear estimators may be

151: still useful in important scenarios.

152: \begin{itemize}

153: \item {\em Estimating $l_1$ distances online} \\

154: The original data matrix $\mathbf{A} \in \mathbb{R}^{n\times D}$

155: requires $O(nD)$ storage space; and hence it is

156: often too large for physical memory. The storage cost of all

157: pairwise distances is $O(n^2)$, which may be also too large for the

158: memory. For example, in information retrieval, $n$ could be

159: the total number of  word types or documents at Web scale. To avoid page fault,

160: it may be more efficient to estimate the distances on the fly from

161: the  projected data matrix $\mathbf{B}$ in the memory.

162: \item {\em Computing all pairwise $l_1$ distances} \\

163: In distance-based clustering and  classification applications, we need

164: to compute all pairwise distances in $\mathbf{A}$, at the cost of

165: time $O(n^2D)$. Using {\em Cauchy random projections}, the cost can be reduced

166: to $O(nDk + n^2k)$. Because $k \ll \min(n,D)$, the savings

167: could be enormous.

168: \item {\em Linear scan nearest neighbor searching}\\

169: We can always search for the nearest neighbors by linear scans. When

170: working with the projected data matrix $\mathbf{B}$ (which is in the  memory), the cost of

171: searching for the nearest neighbor for one data point is time $O(nk)$,

172: which may be still significantly faster than the sub-linear algorithms

173: working with the original data matrix $\mathbf{A}$ (which is often on the

174: disk).

175: \end{itemize}

176:

177: We briefly comment on {\em coordinate

178:   sampling}, another strategy for dimension reduction.  Given a data matrix $\mathbf{A}

179: \in \mathbb{R}^{n\times D}$, one can randomly sample $k$ columns from $\mathbf{A}$ and

180: estimate the summary statistics (including $l_1$ and $l_2$

181: distances). Despite its simplicity, there are two

182: major disadvantages in

183: coordinate sampling. First, there is no performance guarantee. For

184: heavy-tailed data, we may have to choose $k$ very large in order to

185: achieve sufficient accuracy. Second, large datasets are often highly sparse,

186: for example,  text data \citep{Article:Dhillon_ML01} and market-basket

187: data \citep{Proc:Aggarwal_Wolf_Sigmod99,Proc:Strehl_HiPC00}.  \cite{Report:Li_Church_Sketch} and  \cite{Report:Li_Church_Hastie_crs}

188: provide an alternative coordinate sampling strategy, called

189: {\em Conditional Random Sampling (CRS)}, suitable for sparse

190: data. For non-sparse data, however, methods based on {\em linear

191:   random projections} are superior.

192:

193: The rest of the paper is organized as follows. Section \ref{sec_intr_rp}

194: reviews {\em linear random projections}. Section \ref{sec_results}

195: summarizes the main results for three types of nonlinear

196: estimators. Section \ref{sec_median} presents the sample median

197: estimators. Section \ref{sec_gm} concerns the geometric mean

198: estimators. Section \ref{sec_mle} is devoted to the maximum likelihood

199: estimators. Section \ref{sec_conclusion}

200: concludes the paper.

201:

202: \section{Introduction to Linear Random Projections}\label{sec_intr_rp}

203:

204: We give a review on {\em linear random projections},

205: including {\em normal} and {\em Cauchy random projections}.

206:

207:

208: Denote the original data matrix by $\mathbf{A} \in

209: \mathbb{R}^{n\times D}$, i.e., $n$ data points in $D$ dimensions. Let

210: $\{u_i^\text{T}\}_{i=1}^n \in \mathbb{R}^D$ be the $i$th row of $\mathbf{A}$. Let

211: $\mathbf{R}\in \mathbb{R}^{D\times k}$ be a random matrix whose

212: entries are i.i.d. samples of some random variable. The projected

213: data matrix $\mathbf{B} = \mathbf{AR} \in \mathbb{R}^{n\times

214:   k}$. Denote the  entries of $\mathbf{R}$ by $\{r_{ij}\}_{i=1}^D\

215: _{j=1}^k$ and let $\{v_i^\text{T}\}_{i=1}^n \in \mathbb{R}^k$ be the

216: $i$th row of $\mathbf{B}$. Then $v_i = \mathbf{R}^\text{T}u_i$, with entries $v_{i,j} = \mathbf{R}^\text{T}_ju_i$,

217: i.i.d. $j = 1$ to $k$, where $\mathbf{R}_j$ is the $j$th column of

218: $\mathbf{R}$.

219:

220:

221: For simplicity, we focus on the leading two rows, $u_1$ and $u_2$, in

222: $\mathbf{A}$, and the leading  two rows,

223: $v_1$ and $v_2$, in $\mathbf{B}$. Define $\{x_j\}_{j=1}^k$ to be

224: \begin{align}

225: x_j = v_{1,j} - v_{2,j} = \sum_{i=1}^D r_{ij} \left(u_{1,i}-u_{2,i}\right),

226: \hspace{0.5in} j = 1, 2, ..., k

227: \end{align}

228:

229: If we sample $r_{ij}$ i.i.d. from a {\em stable distribution}

230: \citep{Book:Zolotarev_86,Proc:Indyk_FOCS00}, then $x_j$'s are also

231: i.i.d. samples of the same stable distribution with a different scale

232: parameter. In the family of stable distributions, normal and

233: Cauchy are two important special cases.

234:

235: \subsection{Normal Random Projections}

236:

237: When $r_{ij}$ is sampled from the standard normal, i.e., $r_{ij}\sim

238: N(0,1)$, i.i.d.,  then

239: \begin{align}

240: x_j = v_{1,j} - v_{2,j} =\sum_{i=1}^D r_{ij} \left(u_{1,i}-u_{2,i}\right) \sim

241: N\left(0,\sum_{i=1}^D|u_{1,i}-u_{2,i}|^2\right), \ \ \  j = 1, 2, ...,

242: k,

243: \end{align}

244: \noindent because a weighted sum of normals is also normal.

245:

246: Denote the squared $l_2$ distance between $u_1$ and $u_2$ by $d_{l_2} =

247: \|u_1-u_2\|^2_2 = \sum_{i=1}^D|u_{1,i}-u_{2,i}|^2$. We can estimate

248: $d_{l_2}$ from the sample squared $l_2$ distance:

249: \begin{align}

250: \hat{d}_{l_2} = \frac{1}{k} \sum_{j=1}^k x_j^2.

251: \end{align}

252: It is easy to show that (e.g., \citep{Book:Vempala,Proc:Li_Hastie_Church_COLT06})

253: \begin{align}

254: &\text{E}\left(\hat{d}_{l_2}\right) = d_{l_2}, \hspace{0.45in}

255: \text{Var}\left(\hat{d}_{l_2}\right) = \frac{2}{k}d^2_{l_2},\\

256: &\mathbf{Pr}\left(\left|\hat{d}_{l_2} -d_{l_2}\right|\geq \epsilon d_{l_2}\right)  \leq

257: 2\exp\left(-\frac{k}{4}\epsilon^2 + \frac{k}{6}\epsilon^3\right), \ \

258: \ \epsilon >0 \label{eqn_normal_tail}

259: \end{align}

260:

261: We would like to bound the error probability

262: $\mathbf{Pr}\left(\left|\hat{d}_{l_2} -d_{l_2}\right|\geq \epsilon

263:   d_{l_2}\right)$ by $\delta$. Since there

264: are in total $\frac{n(n-1)}{2} < \frac{n^2}{2}$ pairs among $n$

265: data points, we need to bound the tail probabilities simultaneously for

266: all pairs. By the Bonferroni union bound, it suffices if

267: \begin{align}

268: &\frac{n^2}{2}\mathbf{Pr}\left(\left|\hat{d}_{l_2} -d_{l_2}\right|\geq

269:   \epsilon d_{l_2}\right)  \leq \delta.

270: \end{align}

271:

272: Using (\ref{eqn_normal_tail}), it suffices if

273: \begin{align}

274: \frac{n^2}{2}

275: &2\exp\left(-\frac{k}{4}\epsilon^2 + \frac{k}{6}\epsilon^3\right) \leq

276: \delta \\

277: \Longrightarrow & k \geq \frac{2\log n - \log \delta }{\epsilon^2/4 -

278:   \epsilon^3/6}.

279: \end{align}

280:

281:

282: Therefore, we obtain one version of the JL lemma:

283:

284: {\em

285: If $k \geq \frac{2\log n - \log \delta }{\epsilon^2/4 -

286:   \epsilon^3/6}$, then with probability at least $1-\delta$, the

287: squared $l_2$

288: distance between any pair of data points (among $n$ data points) can

289: be approximated within $1\pm \epsilon$ fraction of the

290: truth, using the squared $l_2$ distance of the

291: projected data after normal random projections. }

292:

293: Many versions of the JL lemma have been proved

294: \citep{Article:JL84,Article:Frankl_JL,Proc:Indyk_STOC98,Proc:Arriaga_FOCS99,Article:Dasgupta_JL,Proc:Indyk_FOCS00,Proc:Indyk_FOCS01,Article:Achlioptas_JCSS03,Article:Proc:Arriaga_Vempala_ML06,Proc:Ailon_STOC06}.

295:

296:

297: Note that we do not have to use $r_{ij} \sim N(0,1)$ for dimension

298: reduction in $l_2$. For example, we can sample $r_{ij}$ from

299: some {\em sub-Gaussian distributions} \citep{Article:Indyk_Naor}, in particular, the following

300: {\em sparse projection distribution}:

301: \begin{align}\label{eqn_subg_rji}

302: r_{ij} = \sqrt{s}\left\{\begin{array}{rl} 1 & \text{ with prob. }

303:     \frac{1}{2s}  \\ 0 & \text{ with prob. } 1-\frac{1}{s}\\ -1 & \text{ with prob. }

304:     \frac{1}{2s} \end{array} \right..

305: \end{align}

306:

307: When $ 1\leq s\leq3$, \cite{Article:Achlioptas_JCSS03} proved the JL

308: lemma for the above sparse

309: projection, which can also be shown by sub-Gaussian analysis

310: \citep{Report:Li_Hastie_Church_subrp}.

311: Recently,  \cite{Proc:Li_Hastie_Church_KDD06} proposed {\em very

312:   sparse random projections} using $s = \sqrt{D}$ in

313: (\ref{eqn_subg_rji}), based on two practical considerations:

314: \begin{itemize}

315: \item $D$ should be very large, otherwise

316: there would be no need for dimension reduction.

317: \item

318: The original $l_2$ distance should make

319: engineering sense, in that  the second (or higher) moments should be

320: bounded (otherwise various {\em term-weighting} schemes will be

321: applied).

322: \end{itemize}

323:

324: Based on these two practical

325: assumptions, the projected data are asymptotically normal at a fast

326: rate of convergence when $s = \sqrt{D}$.  Of course, {\em very sparse

327:   random projections} do not have worst case performance

328: guarantees.

329:

330: \subsection{Cauchy Random Projections}\label{sec_intro}

331:

332: In {\em Cauchy random projections}, we sample $r_{ij}$ i.i.d. from the

333: standard Cauchy distribution, i.e., $r_{ij} \sim C(0,1)$. By the 1-stability of Cauchy \citep{Book:Zolotarev_86}, we know that

334: \begin{align}

335: x_j = v_{1,j} - v_{2,j}  \sim C\left(0,\sum_{i=1}^D|u_{1,i} -

336:   u_{2,i}|\right).

337: \end{align}

338: \noindent That is, the projected differences $x_j = v_{1,j} - v_{2,j}$ are also

339: Cauchy random variables with the scale parameter being the $l_1$

340: distance, $d = |u_1 - u_2| = \sum_{i=1}^D|u_{1,i} -

341:   u_{2,i}|$, in the original space.

342:

343: Recall that a Cauchy random variable $z \sim C(0,\gamma)$ has the density

344: \begin{align}

345: f(z)  = \frac{\gamma}{\pi} \frac{1}{z^2 + \gamma^2}, \hspace{0.5in}

346: \gamma >0, \hspace{0.2in}  -\infty<z<\infty

347: \end{align}

348:

349: The easiest way to see the 1-stability is via the characteristic

350: function,

351: \begin{align}

352: &\text{E}\left(\exp(\sqrt{-1}z_1t)\right) =

353: \exp\left(-\gamma|t|\right),\\

354: &\text{E}\left(\exp\left(\sqrt{-1} t\sum_{i=1}^D c_i z_i\right)\right)

355: = \exp\left(-\gamma\sum_{i=1}^D|c_i|t\right),

356: \end{align}

357: \noindent for $z_1$, $z_2$, ..., $z_D$, i.i.d. $C(0,\gamma)$, and

358: any constants $c_1$, $c_2$, ..., $c_D$.

359:

360:

361: Therefore, in {\em Cauchy random projections}, the problem boils down to

362: estimating the Cauchy scale parameter of $C(0,d)$ from $k$

363: i.i.d. samples $x_j \sim C(0,d)$.  Unfortunately, unlike in {\em normal

364:   random projections}, we can no longer estimate $d$ from the

365: sample mean (i.e., $\frac{1}{k}\sum_{j=1}^k|x_j|$) because

366: $\text{E}\left(x_j\right) = \infty$.

367:

368: Although the impossibility results

369: \citep{Article:Lee_Naor_04,Article:Brinkman_JACM05}

370: have ruled out estimators that are metrics, there is enough information

371: to recover $d$ from $k$

372: samples $\{x_j\}_{j=1}^k$, with a high accuracy.  For

373: example, \cite{Proc:Indyk_FOCS00} proposed using the sample median as

374: an estimator. The problem with the sample median estimator is the

375: inaccuracy at small $k$ and the difficulty in deriving explicit tail

376: bounds needed for determining the sample size $k$. \\

377:

378: This study focuses on deriving better estimators and explicit tail bounds for

379: {\em Cauchy random projections}. Our main results are summarized in

380: the next section, before we present the detailed derivations. Casual

381: readers may skip these derivations after Section

382: \ref{sec_results}.

383:

384: \section{Main Results}\label{sec_results}

385:

386:  We propose three types of nonlinear

387: estimators: the bias-corrected sample median estimator

388: ($\hat{d}_{me,c}$), the bias-corrected geometric mean estimator

389: ($\hat{d}_{gm,c}$), and  the bias-corrected maximum likelihood

390: estimator ($\hat{d}_{MLE,c}$). $\hat{d}_{me,c}$ and $\hat{d}_{gm,c}$

391: are asymptotically equivalent but the latter is more accurate at small

392: sample size $k$. In addition, we derive explicit tail bounds for

393: $\hat{d}_{gm,c}$, from which an analog of the Johnson-Lindenstrauss  (JL)

394: lemma for dimension reduction in $l_1$ follows. Asymptotically, both

395: $\hat{d}_{me,c}$ and $\hat{d}_{gm,c}$ are $\frac{8}{\pi^2} \approx

396: 80\%$ efficient compared to the maximum likelihood estimator

397: $\hat{d}_{MLE,c}$. We propose accurate approximations to the

398: distribution and tail bounds of $\hat{d}_{MLE,c}$, while the exact

399: closed-form answers are not attainable.

400:

401: \subsection{The Bias-corrected Sample Median Estimator}

402:

403: Denoted by $\hat{d}_{me,c}$, the bias-corrected sample median

404: estimator is

405: \begin{align}

406: \hat{d}_{me,c} = \frac{\hat{d}_{me}}{b_{me}},

407: \end{align}

408: \noindent where

409: \begin{align}

410: \hat{d}_{me} &= \text{median}(|x_j|, j = 1, 2,..., k)\\

411: b_{me}

412: &=

413: \int_0^1\frac{(2m+1)!}{(m!)^2}\tan\left(\frac{\pi}{2}t\right)\left(t-t^2\right)^m

414:   dt, \ \ \ k = 2m+1

415: \end{align}

416:

417: Here, for convenience, we only consider $k = 2m+1$, $m$ = 1, 2, 3,

418: ...

419:

420:

421: Some key properties of $\hat{d}_{me,c}$:

422:

423: \begin{itemize}

424: \item $\text{E}\left(\hat{d}_{me,c}\right) = d$, i.e, $\hat{d}_{me,c}$

425:   is unbiased.

426: \item When $k\geq 5$, the variance of $\hat{d}_{me,c}$ is

427: \begin{align}

428: \text{Var}\left(\hat{d}_{me,c}\right) =

429: d^2\left(\frac{(m!)^2}{(2m+1)!}\frac{\int_0^1

430:   \tan^2\left(\frac{\pi}{2}t\right)\left(t-t^2\right)^m dt}{\left(\int_0^1

431:   \tan\left(\frac{\pi}{2}t\right)\left(t-t^2\right)^m dt\right)^2} -

432: 1\right), \ \ \ \ k\geq5

433: \end{align}

434: $\text{Var}\left(\hat{d}_{me,c}\right) = \infty$ if $k = 3$.

435: \item As $k \rightarrow \infty$, $\hat{d}_{me,c}$ converges to a

436:   normal in distribution

437: \begin{align}

438: \sqrt{k}\left(\hat{d}_{me,c}  - d \right)\overset{D}{\Longrightarrow} N\left(0,\frac{\pi^2}{4}d^2\right).

439: \end{align}

440: \end{itemize}

441:

442: \subsection{The Bias-corrected Geometric Mean Estimator}

443: Denoted by $\hat{d}_{gm,c}$, the bias-corrected geometric mean

444: estimator is defined as

445: \begin{align}

446: \hat{d}_{gm,c} =

447: \cos^k\left(\frac{\pi}{2k}\right)\prod_{j=1}^k|x_j|^{1/k},

448: \hspace{0.1in} k>1

449: \end{align}

450:

451: Important properties of $\hat{d}_{gm,c}$ include:

452: \begin{itemize}

453: \item This estimator is a non-convex norm, i.e., the $l_p$ norm

454:   with $p\rightarrow 0$.

455: \item It is unbiased, i.e., $\text{E}\left(\hat{d}_{gm,c}\right)

456:   = d$.

457: \item Its variance is (for $k>2$)

458: \begin{align}

459: \text{Var}\left(\hat{d}_{gm,c}\right) &= d^2

460: \left(\frac{\cos^{2k}\left(\frac{\pi}{2k}\right)}{\cos^k\left(\frac{\pi}{k}\right)}-1

461: \right)

462: = \frac{\pi^2}{4}\frac{d^2}{k} +

463: \frac{\pi^4}{32}\frac{d^2}{k^2}+O\left(\frac{1}{k^3}\right).

464: \end{align}

465: \item For $0\leq \epsilon \leq 1$, its tail bounds can be represented in exponential forms

466: \begin{align}

467: &\mathbf{Pr}\left(\hat{d}_{gm,c} - d > \epsilon d \right) \leq

468: \exp\left(-k\left(\frac{\epsilon^2}{8(1+\epsilon)}\right)\right)\\

469: &\mathbf{Pr}\left(\hat{d}_{gm,c} - d < -\epsilon d \right) \leq

470: \exp\left(-k\left(\frac{\epsilon^2}{8(1+\epsilon)}\right)\right), \ \

471: \ k \geq \frac{\pi^2}{1.5\epsilon}

472: \end{align}

473: \item These exponential tail bounds yield an analog of the

474: Johnson-Lindenstrauss  (JL) lemma for dimension reduction in $l_1$:

475:

476: {\em

477: If $k \geq \frac{8\left(2\log n -

478:   \log\delta\right)}{\epsilon^2/(1+\epsilon)}\geq \frac{\pi^2}{1.5\epsilon}$, then with probability at

479: least $1-\delta$, one can recover the original $l_1$ distance between

480: any pair of data points (among all $n$ data points) within

481: $1\pm\epsilon$ ($0\leq

482: \epsilon\leq 1$) fraction of the truth,

483: using $\hat{d}_{gm,c}$, i.e., $|\hat{d}_{gm,c}-d|\leq \epsilon d$. }

484: \end{itemize}

485:

486: \subsection{The Bias-corrected Maximum Likelihood Estimator}

487: Denoted by $\hat{d}_{MLE,c}$, the bias-corrected maximum likelihood

488: estimator is

489: \begin{align}

490: \hat{d}_{MLE,c} = \hat{d}_{MLE}\left(1-\frac{1}{k}\right),

491: \end{align}

492: where $\hat{d}_{MLE}$ solves a nonlinear MLE equation

493: \begin{align}

494: -\frac{k}{\hat{d}_{MLE}} + \sum_{j=1}^k\frac{2\hat{d}_{MLE}}{x_j^2 + \hat{d}_{MLE}^2} = 0.

495: \end{align}

496:

497: Some properties of $\hat{d}_{MLE,c}$:

498: \begin{itemize}

499: \item It is nearly unbiased, $\text{E}\left(\hat{d}_{MLE,c}\right) = d

500:   + O\left(\frac{1}{k^2}\right)$.

501: \item Its asymptotic variance is

502: \begin{align}

503: \text{Var}\left(\hat{d}_{MLE,c}\right) = \frac{2d^2}{k} +

504: \frac{3d^2}{k^2}

505:   + O\left(\frac{1}{k^3}\right),

506: \end{align}

507: \noindent i.e.,

508: $\frac{\text{Var}\left(\hat{d}_{MLE,c}\right)}{\text{Var}\left(\hat{d}_{me,c}\right)}

509: \rightarrow \frac{8}{\pi^2}$, $\frac{\text{Var}\left(\hat{d}_{MLE,c}\right)}{\text{Var}\left(\hat{d}_{gm,c}\right)}

510: \rightarrow \frac{8}{\pi^2}$, as $k\rightarrow

511: \infty$. ($\frac{8}{\pi^2} \approx 80\%$)

512: \item Its distribution can be accurately approximated by an inverse

513:   Gaussian, at least in the small deviation range. Based on the

514:   inverse Gaussian approximation, we suggest the following approximate tail bound

515: \begin{align}

516: &\mathbf{Pr}\left(|\hat{d}_{MLE,c} - d| \geq \epsilon d\right) \overset{\sim}{\leq}

517: 2\exp\left(-\frac{\epsilon^2/(1+\epsilon)}{2 \left(\frac{2}{k} + \frac{3}{k^2}\right)}\right),

518: \hspace{0.15in} 0\leq \epsilon \leq 1,

519: \end{align}

520: \noindent which has been verified by simulations for the tail

521: probability $\geq 10^{-10}$ range.

522: \end{itemize}

523:

524:

525: \section{The Sample Median Estimators}\label{sec_median}

526:

527: Recall in Cauchy random projections, $\mathbf{B} = \mathbf{AR}$, we

528: denote the leading two rows in $\mathbf{A}$ by $u_1$, $u_2$ $\in

529: \mathbb{R}^{D}$, and the leading two rows in $\mathbf{B}$ by $v_1$,

530: $v_2$ $\in \mathbb{R}^{k}$. Our goal is to estimate the $l_1$ distance

531: $d = |u_1 - u_2| = \sum_{i=1}^D |u_{1,i} - u_{2,i}|$ from

532: $\{x_j\}_{j=1}^k$, $x_j = v_{1,j} - v_{2,j} \sim C(0,d)$, i.i.d.

533:

534: It is easy to show (e.g., \cite{Proc:Indyk_FOCS00}) that the

535: population median of $|x_j|$ is $d$. Therefore, it is natural to

536: consider estimating $d$ from the sample median,

537: \begin{align} \label{eqn_def_me}

538: \hat{d}_{me} = \text{median}\{|x_j|, j = 1, 2, ..., k\}.

539: \end{align}

540:

541: As illustrated in the following lemma (proved in Appendix \ref{app_proof_lem_me}), the sample median estimator,

542: $\hat{d}_{me}$, is asymptotically  unbiased and normal. For small

543: samples (e.g., $k\leq 20$), however, $\hat{d}_{me}$ is severely

544: biased.

545:

546: \begin{lemma} \label{lem_me}

547: The sample median estimator, $\hat{d}_{me}$, defined in

548: (\ref{eqn_def_me}), is asymptotically unbiased and normal

549: \begin{align}

550: \sqrt{k}\left(\hat{d}_{me}  - d \right)\overset{D}{\Longrightarrow} N\left(0,\frac{\pi^2}{4}d^2\right)

551: \end{align}

552: When $k = 2m+1$, $m$ = 1, 2, 3, ..., the $r^{th}$ moment of

553: $\hat{d}_{me}$ can be represented as

554: \begin{align}

555: &\text{E}\left(\hat{d}_{me}\right)^r = d^r\left(\int_0^1\frac{(2m+1)!}{(m!)^2}\tan^r\left(\frac{\pi}{2}t\right)\left(t-t^2\right)^m

556:   dt\right), \ \ \  m \geq r

557: \end{align}

558: If $m<r$, then $\text{E}\left(\hat{d}_{me}\right)^r = \infty$. \\ \\

559: \end{lemma}

560:

561: For simplicity, we only consider $k = 2m+1$ when evaluating

562: $\text{E}\left(\hat{d}_{me}\right)^r$.

563:

564: Once we know $\text{E}\left(\hat{d}_{me}\right)$, we can remove the

565: bias of $\hat{d}_{me}$ using

566: \begin{align}

567: \hat{d}_{me,c} = \frac{\hat{d}_{me}}{b_{me}},

568: \end{align}

569: where the bias correction factor $b_{me}$ is

570: \begin{align}\label{eqn_bme}

571: b_{me} = \frac{\text{E}\left(\hat{d}_{me}\right)}{d} = \int_0^1\frac{(2m+1)!}{(m!)^2}\tan\left(\frac{\pi}{2}t\right)\left(t-t^2\right)^m

572:   dt.

573: \end{align}

574:

575: $b_{me}$ can be numerically evaluated and tabulated, at least for small

576: $k$.\footnote{It is possible to express $b_{me}$ as an infinite

577:   sum. Note that $\frac{(2m+1)!}{(m!)^2}\left(t-t^2\right)^m$, $0\leq

578:   t\leq 1$, is the probability density of a Beta distribution

579:   $Beta(m+1,m+1)$.}

580: % By Taylor expansion \citep[1.411.6]{Book:Gradshteyn_94},

581: %  $\tan\left(\frac{\pi}{2}t\right) =

582: %  \sum_{j=1}^\infty\frac{2^{2j}\left(2^{2j}-1\right)}{(2j)!}|B_{2j}|\left(\frac{\pi}{2}\right)^{2j-1}t^{2j-1}$, where $B_{2j}$ is the {\em Bernoulli number} \citep[9.61]{Book:Gradshteyn_94}. If $z \sim Beta(m+1,m+1)$, then $\text{E}\left(z^r\right) = \frac{(2m+1)!(m+r)!}{(2m+1+r)!m!}$ (\url{http://mathworld.wolfram.com/BetaDistribution.html}). Therefore, $b_{me} = \sum_{j=1}^\infty\frac{2^{2j}\left(2^{2j}-1\right)}{(2j)!}|B_{2j}|\left(\frac{\pi}{2}\right)^{2j-1} \frac{(2m+1)!(m+2j-1)!}{(2m+2j)!m!}$. }

583:

584: Obviously, $\hat{d}_{me,c}$ is unbiased, i.e.,

585: $\text{E}\left(\hat{d}_{me,c}\right) = d$. Its variance would be

586: \begin{align}

587: \text{Var}\left(\hat{d}_{me,c}\right) =

588: d^2\left(\frac{(m!)^2}{(2m+1)!}\frac{\int_0^1

589:   \tan^2\left(\frac{\pi}{2}t\right)\left(t-t^2\right)^m dt}{\left(\int_0^1

590:   \tan\left(\frac{\pi}{2}t\right)\left(t-t^2\right)^m dt\right)^2} -

591: 1\right), \ \ \ \ k=2m+1\geq5

592: \end{align}

593:

594: Of course, $\hat{d}_{gm,c}$ and $\hat{d}_{gm}$ are asymptotically

595: equivalent, i.e.,

596: $\sqrt{k}\left(\hat{d}_{me,c}  - d \right)\overset{D}{\Longrightarrow}

597: N\left(0,\frac{\pi^2}{4}d^2\right)$.

598:

599: Figure \ref{fig_bme} plots $b_{me}$ as a function of $k$, indicating

600: that $\hat{d}_{me}$ is severely biased when $k\leq 20$. When $k>50$,

601: the bias becomes negligible. Note that, because $b_{me}\geq 1$, the bias

602: correction not only removes the bias of $\hat{d}_{me}$ but also

603: reduces its variance.

604:

605:

606: \begin{figure}[h]

607: \begin{center}

608: \includegraphics[width = 2.5in]{fig/me_bias_correction_factor.eps}

609: \end{center}\vspace{-0.3in}

610: \caption{The bias correction factor, $b_{me}$ in (\ref{eqn_bme}), as a function of $k

611:   =2m+1$. After $k>50$, the bias is negligible. Note that

612:   $b_{me}=\infty$ when $k=1$. }\label{fig_bme}

613: \end{figure}

614:

615: The sample median is a special case of sample quantile estimators

616: \citep{Article:Fama_68,Article:Fama_71}.   For example, one

617: version of the quantile estimators given by

618: \cite{Article:McCulloch_86} would be

619: \begin{align}

620: \hat{d}_{or} = \frac{\hat{|x|}_{.75} - \hat{|x|}_{.25}}{2.0},

621: \end{align}

622: \noindent where $\hat{|x|}_{.75}$ and $\hat{|x|}_{.25}$ are the .75 and

623: .25 sample quantiles of $\{|x_{j}|\}_{j=1}^k$, respectively.

624:

625: Our simulations indicate that $\hat{d}_{me}$ actually slightly outperforms

626: $\hat{d}_{or}$. This is not surprising. $\hat{d}_{or}$ works for any

627: Cauchy distribution whose location parameter does not have to be zero,

628: while $\hat{d}_{me}$ takes advantage of the fact that the

629: Cauchy location parameter is always zero in our case.

630:

631:

632: \section{The Geometric Mean Estimators }\label{sec_gm}

633:

634: This section derives estimators based on the geometric

635: mean, which are more accurate than the sample median estimators. The

636: geometric mean estimators allow us to derive tail bounds in explicit

637: forms and (consequently) an analog of the

638: Johnson-Lindenstrauss  (JL) lemma for dimension reduction in $l_1$.

639:

640: Recall, our goal is to estimate $d$ from $k$ i.i.d. samples $x_j

641: \sim C(0,d)$. To help derive the geometric mean estimators, we

642: first study two nonlinear estimators based on the fractional moment, i.e., $\text{E}(|x|^\lambda)$

643: ($|\lambda|<1$) and the logarithmic moment, i.e,

644: $\text{E}\left(\log(|x|)\right)$, respectively, as presented in

645: Lemma  \ref{lem_d_log}. See the proof in Appendix \ref{app_proof_lem_d_log}.

646:

647: \begin{lemma}\label{lem_d_log}

648: Assume $x \sim C(0,d)$. Then

649: \begin{align}

650: &\text{E}\left(|x|^\lambda\right)

651: =\frac{d^\lambda}{\cos(\lambda\pi/2)}, \hspace{0.5in}|\lambda|<1\\

652: &\text{E}\left(\log(|x|)\right) = \log(d), \\

653: &\text{Var}\left(\log(|x|)\right) = \frac{\pi^2}{4},

654: \end{align}

655: \noindent from which we can derive two biased estimators of $d$ from

656: $k$ i.i.d. samples $x_j \sim C(0,d)$:

657: \begin{align}

658: &\hat{d}_\lambda = \left(\frac{1}{k}\sum_{j=1}^k|x_j|^\lambda

659:   \cos(\lambda\pi/2)\right)^{1/\lambda}, \hspace{0.2in} |\lambda| <1,\\

660: &\hat{d}_{log} = \exp\left(\frac{1}{k}\sum_{j=1}^k\log(|x_j|)\right),

661: \end{align}

662: \noindent whose variances are, respectively,

663: \begin{align}

664: &\text{Var}\left(\hat{d}_{\lambda}\right) = \frac{d^2}{k}

665: \frac{\sin^2(\lambda \pi/2)}{\lambda^2 \cos(\lambda\pi)} +

666: O\left(\frac{1}{k^2}\right), \hspace{0.2in} |\lambda| <1/2\\

667: &\text{Var}\left(\hat{d}_{log}\right)  = \frac{\pi^2d^2}{4k} +

668: O\left(\frac{1}{k^2}\right).

669: \end{align}

670:

671: The term $\frac{\sin^2(\lambda \pi/2)}{\lambda^2 \cos(\lambda\pi)}$

672: decreases with decreasing $|\lambda|$, reaching a limit

673: \begin{align}

674: \underset{\lambda\rightarrow 0}\lim\frac{\sin^2(\lambda

675:   \pi/2)}{\lambda^2 \cos(\lambda\pi)} = \frac{\pi^2}{4}.

676: \end{align}

677: \noindent In other words, the variance of $\hat{d}_{\lambda}$ converges to

678: that of $\hat{d}_{log}$ as $|\lambda|$ approaches zero.

679: \\

680: \end{lemma}

681:

682:  Note that $\hat{d}_{log}$ can in fact be

683: written as the {\em geometric mean}:

684: \begin{align}

685: \hat{d}_{log} = \hat{d}_{gm} = \prod_{j=1}^k|x_j|^{1/k}.

686: \end{align}

687:

688: $\hat{d}_{\lambda}$ is a non-convex norm ($l_\lambda$) because $\lambda

689: <1$. $\hat{d}_{gm}$ is also

690: a non-convex norm (the $l_\lambda$ norm as $\lambda \rightarrow 0$). Both

691: $\hat{d}_{\lambda}$ and $\hat{d}_{gm}$ do not satisfy the triangle

692: inequality.

693:

694: We propose $\hat{d}_{gm,c}$, the bias-corrected geometric mean

695: estimator. Lemma \ref{lem_d_gm}  derives the moments of

696: $\hat{d}_{gm,c}$, proved in Appendix \ref{app_proof_lem_d_gm}.

697:

698: \begin{lemma}\label{lem_d_gm}

699: \begin{align}

700: \hat{d}_{gm,c} =

701: \cos^k\left(\frac{\pi}{2k}\right)\prod_{j=1}^k|x_j|^{1/k},

702: \hspace{0.1in} k>1

703: \end{align}

704: is unbiased, with the variance  (valid when $k>2$)

705: \begin{align}

706: \text{Var}\left(\hat{d}_{gm,c}\right) &= d^2

707: \left(\frac{\cos^{2k}\left(\frac{\pi}{2k}\right)}{\cos^k\left(\frac{\pi}{k}\right)}-1

708: \right)=\frac{d^2}{k} \frac{\pi^2}{4} +

709: \frac{\pi^4}{32}\frac{d^2}{k^2}+O\left(\frac{1}{k^3}\right).

710: \end{align}

711:

712: The third and fourth central moments are  (for $k>3$ and $k>4$,

713: respectively)

714: \begin{align}

715: &\text{E}\left(\hat{d}_{gm,c} -

716:   \text{E}\left(\hat{d}_{gm,c}\right)\right)^3 =

717: \frac{3\pi^4}{16}\frac{d^3}{k^2} + O\left(\frac{1}{k^3}\right) \\

718: &\text{E}\left(\hat{d}_{gm,c} -

719:   \text{E}\left(\hat{d}_{gm,c}\right)\right)^4 =

720: \frac{3\pi^4}{16}\frac{d^4}{k^2} + O\left(\frac{1}{k^3}\right).

721: \end{align}\\

722: \end{lemma}

723:

724: The higher (third or fourth) moments may be useful for approximating

725: the distribution of $\hat{d}_{gm,c}$.  In Section \ref{sec_mle}, we

726: will show how to approximate the distribution of the maximum

727: likelihood estimator by matching the first four moments (in the

728: leading terms). We could apply the similar technique to approximate

729: $\hat{d}_{gm,c}$. Fortunately, we do not have to do so because we are

730: able to derive the exact tail bounds of $\hat{d}_{gm,c}$ in Lemma

731: \ref{lem_d_gm_tail}, which is proved in Appendix \ref{app_proof_lem_d_gm_tail}.

732:

733: \begin{lemma}\label{lem_d_gm_tail}

734: \begin{align}\label{eqn_gm_bound}

735: \mathbf{Pr}\left(\hat{d}_{gm,c} \geq (1+\epsilon)d \right) \leq

736: \frac{\cos^{kt_1^*}\left(\frac{\pi}{2k}\right)}{\cos^k\left(\frac{\pi

737:       t_1^*}{2k}\right)(1+\epsilon)^{t_1^*}}, \hspace{0.25in} \epsilon \geq0

738: \end{align}

739: \noindent where

740: \begin{align}

741: t_1^* = \frac{2k}{\pi}\tan^{-1}\left(\left(\log(1+\epsilon) -

742:     k\log\cos\left(\frac{\pi}{2k}\right)\right)\frac{2}{\pi}\right).

743: \end{align}

744: \begin{align}\label{eqn_gm_bound_left}

745: \mathbf{Pr}\left(\hat{d}_{gm,c} \leq  (1-\epsilon)d \right) \leq

746: \frac{ (1-\epsilon)^{t_2^*}}{\cos^k\left(\frac{\pi

747:       t_2^*}{2k}\right)\cos^{kt_2^*}\left(\frac{\pi}{2k}\right)},

748: \hspace{0.25in} 0\leq \epsilon\leq 1, \hspace{0.1in} k\geq \frac{\pi^2}{8\epsilon}

749: \end{align}

750: \noindent where

751: \begin{align}

752: t_2^* = \frac{2k}{\pi}\tan^{-1}\left(\left(-\log(1-\epsilon) +

753:     k\log\cos\left(\frac{\pi}{2k}\right)\right)\frac{2}{\pi}\right).

754: \end{align}

755:

756:

757: By restricting $0\leq\epsilon\leq 1$, the tail bounds can be written

758: in exponential forms:

759: \begin{align}\label{eqn_exp_right}

760: &\mathbf{Pr}\left(\hat{d}_{gm,c} \geq (1+\epsilon)d \right) \leq

761: \exp\left(-k\frac{\epsilon^2}{8(1+\epsilon)}\right) \\

762: &\mathbf{Pr}\left(\hat{d}_{gm,c} \leq (1-\epsilon)d \right) \leq

763: \exp\left(-k\frac{\epsilon^2}{8(1+\epsilon)}\right), \hspace{0.2in} k\geq \frac{\pi^2}{1.5\epsilon}\label{eqn_exp_left}

764: \end{align}\\

765: \end{lemma}

766:

767: An analog of the JL bound for $l_1$ follows from the exponential tail

768: bounds (\ref{eqn_exp_right}) and

769: (\ref{eqn_exp_left}).

770: \begin{lemma}\label{lem_JL_l1}

771: Using $\hat{d}_{gm,c}$ with $k \geq \frac{8\left(2\log n -

772:   \log\delta\right)}{\epsilon^2/(1+\epsilon)} \geq

773: \frac{\pi^2}{1.5\epsilon}$, then with probability at

774: least $1-\delta$, the $l_1$ distance, $d$, between

775: any pair of data points (among $n$ data points), can be estimated with

776: errors bounded by $\pm \epsilon d$, i.e., $|\hat{d}_{gm,c} - d| \leq

777: \epsilon d$.

778: \end{lemma}

779:

780: \textbf{Remarks on Lemma \ref{lem_JL_l1}}: (1) We can replace the constant ``8'' in Lemma

781: \ref{lem_JL_l1} with better (i.e., smaller) constants for

782: specific values of $\epsilon$. For example, If $\epsilon = 0.2$, we can

783: replace ``8'' by ``5''. See the proof of Lemma \ref{lem_d_gm_tail}.

784: (2) This Lemma is weaker than the classical JL Lemma for

785: dimension reduction in $l_2$ as reviewed in Section 2.1. The classical

786: JL Lemma for $l_2$ ensures that the $l_2$ inter-point distances of the

787: projected data points are close enough to the original $l_2$

788: distances, while Lemma

789: \ref{lem_JL_l1} merely says that the projected data points contain

790: enough information to reconstruct the original $l_1$ distances.  On

791: the other hand, the geometric mean estimator is a non-convex

792: norm; and therefore it does contain some information about the

793: geometry. We leave it for future work to explore the possibility of

794: developing efficient algorithms using the geometric mean estimator. \\

795:

796:

797: Figure \ref{fig_hist_d_gm}   presents the simulated histograms of $\hat{d}_{gm,c}$

798: for $d=1$, with $k = 5$ and $k=50$. The histograms reveal some

799: characteristics shared by the maximum likelihood estimator  we will

800: discuss in the next section:

801: \begin{itemize}

802: \item Supported on $[0,\infty)$, $\hat{d}_{gm,c}$ is positively

803: skewed.

804: \item The distribution of $\hat{d}_{gm,c}$ is still

805:   ``heavy-tailed.'' However, in the region not too far from the mean, the distribution of $\hat{d}_{gm,c}$ may be

806:   well captured by a gamma (or a generalized gamma) distribution. For large $k$, even a

807: normal  approximation may suffice.

808: \end{itemize}

809: \begin{figure}[h]

810: \begin{center}\mbox{

811: \subfigure[$k=5$]{\includegraphics[width = 2.5in]{fig/hist_gm5.eps}}

812: \subfigure[$k=50$]{\includegraphics[width = 2.5in]{fig/hist_gm50.eps}}}

813: \end{center}\vspace{-0.4in}

814: \caption{Histograms of $\hat{d}_{gm,c}$, obtained from $10^6$ simulations. At

815:   least in the range not too far from the mean, the

816:   distribution of $\hat{d}_{gm,c}$ resembles a gamma and also resembles

817: a normal when $k$ is large enough. }\label{fig_hist_d_gm}

818: \end{figure}

819:

820:

821: Figure \ref{fig_gm_vs_me} compares $\hat{d}_{gm,c}$ with the sample median estimators $\hat{d}_{me}$ and

822: $\hat{d}_{me,c}$, in terms of the mean square errors.  $\hat{d}_{gm,c}$ is considerably more accurate than

823: $\hat{d}_{me}$ at small $k$. The bias correction significantly reduces

824: the mean square errors of $\hat{d}_{me}$.

825: \begin{figure}[h]

826: \begin{center}

827: \includegraphics[width = 2.5in]{fig/me_gm_mse_ratio.eps}

828: \end{center}\vspace{-0.25in}

829: \caption{ The ratios of the mean square errors (MSN),

830:   $\frac{\text{MSE}(\hat{d}_{me})}{\text{MSE}(\hat{d}_{gm,c})}$ and

831:   $\frac{\text{MSE}(\hat{d}_{me,c})}{\text{MSE}(\hat{d}_{gm,c})}$,

832:   demonstrate that the bias-corrected geometric mean estimator

833:   $\hat{d}_{gm,c}$ is considerably more accurate than the sample

834:   median estimator $\hat{d}_{me}$. The bias correction on

835:   $\hat{d}_{me}$ considerably reduces the MSE. Note that when $k=3$, the ratios are $\infty$. }\label{fig_gm_vs_me}

836: \end{figure}

837:

838:

839:

840: \section{The Maximum Likelihood Estimators}\label{sec_mle}

841:

842: This section is devoted to analyzing the maximum likelihood

843: estimators (MLE), which are ``asymptotically optimum.'' In comparisons,

844: the sample median estimators and geometric mean estimators are

845: not optimum.  Our contribution in this section includes the higher-order

846: analysis for the bias and  moments and accurate closed-from

847: approximations to the distribution of the MLE.

848:

849:

850:

851: The method of maximum likelihood is widely used.  For example, \cite{Proc:Li_Hastie_Church_COLT06} applied the maximum likelihood method to {\em normal random

852:   projections} and provided an improved estimator of the

853: $l_2$ distance by taking advantage of the marginal information.

854:

855:

856: The Cauchy distribution is often considered a ``challenging''

857: example because of the ``multiple

858: roots'' problem when estimating the location

859: parameter \citep{Article:Barnett_66,Article:Haas_70}. In our case, since

860: the location parameter is always zero, much of the difficulty is avoided.

861:

862: Recall our goal is to estimate $d$ from $k$ i.i.d. samples

863: $x_j \sim C(0,d), j = 1, 2,..., k$. The $\log$ joint

864: likelihood of $\{x_j\}_{j=1}^k$ is

865: \begin{align}

866: L(x_1,x_2,...x_k;d) = k\log(d) - k\log(\pi) - \sum_{j=1}^k\log(x_j^2+d^2),

867: \end{align}

868: \noindent whose first and second derivatives (w.r.t. $d$) are

869: \begin{align}

870: &L^\prime(d) = \frac{k}{d} - \sum_{j=1}^k\frac{2d}{x_j^2+d^2},\\

871: &L^{\prime\prime}(d) = -\frac{k}{d^2} -

872: \sum_{j=1}^k\frac{2x_j^2-2d^2}{(x_j^2+d^2)^2} =

873: - \frac{ L^\prime(d)}{d}  - 4\sum_{j=1}^k\frac{x_j^2}{(x_j^2+d^2)^2}.

874: \end{align}

875:

876: The maximum likelihood estimator of $d$, denoted by $\hat{d}_{MLE}$, is

877: the solution  to $L^\prime(d) = 0$, i.e.,

878: \begin{align}\label{eqn_mle}

879: -\frac{k}{\hat{d}_{MLE}}+\sum_{j=1}^k\frac{2\hat{d}_{MLE}}{x_j^2+\hat{d}_{MLE}^2} = 0.

880: \end{align}

881: \noindent Because $L^{\prime\prime}(\hat{d}_{MLE}) \leq 0$, $\hat{d}_{MLE}$ indeed maximizes the joint likelihood and is the

882: only solution to the MLE equation (\ref{eqn_mle}). Solving

883: (\ref{eqn_mle}) numerically is not difficult (e.g., a few iterations

884: using the Newton's method). For a better accuracy, we

885: recommend the following bias-corrected estimator:

886: \begin{align}

887: \hat{d}_{MLE,c} = \hat{d}_{MLE}\left(1-\frac{1}{k}\right).

888: \end{align}

889:

890: Lemma  \ref{lem_mle_asymp} concerns the asymptotic moments of $\hat{d}_{MLE}$ and $\hat{d}_{MLE,c}$, proved in Appendix

891: \ref{app_proof_lem_asymp}.

892: \begin{lemma}\label{lem_mle_asymp}

893: Both $\hat{d}_{MLE}$ and $\hat{d}_{MLE,c}$ are asymptotically unbiased and

894: normal. The first four moments of $\hat{d}_{MLE}$ are

895: \begin{align}

896: &\text{E}\left(\hat{d}_{MLE} - d\right) = \frac{d}{k}+ O\left(\frac{1}{k^2}\right) \\

897: &\text{Var}\left(\hat{d}_{MLE}\right) = \frac{2d^2}{k} + \frac{7d^2}{k^2} +O\left(\frac{1}{k^3}\right)\\

898: &\text{E}\left(\hat{d}_{MLE} - \text{E}(\hat{d}_{MLE})\right)^3 = \frac{12d^3}{k^2} +

899: O\left(\frac{1}{k^3}\right) \\

900: &\text{E}\left(\hat{d}_{MLE} - \text{E}(\hat{d}_{MLE})\right)^4 = \frac{12d^4}{k^2} +

901: \frac{222d^4}{k^3} + O\left(\frac{1}{k^4}\right)

902: \end{align}

903: The first four moments of $\hat{d}_{MLE,c}$ are

904: \begin{align}

905: &\text{E}\left(\hat{d}_{MLE,c} - d\right) =

906: O\left(\frac{1}{k^2}\right) \\

907: &\text{Var}\left(\hat{d}_{MLE,c}\right) = \frac{2d^2}{k} +

908: \frac{3d^2}{k^2}+O\left(\frac{1}{k^3}\right)  \\

909: &\text{E}\left(\hat{d}_{MLE,c} - \text{E}(\hat{d}_{MLE,c})\right)^3 = \frac{12d^3}{k^2} +

910: O\left(\frac{1}{k^3}\right) \\

911: &\text{E}\left(\hat{d}_{MLE,c} - \text{E}(\hat{d}_{MLE,c})\right)^4 = \frac{12d^4}{k^2} +

912: \frac{186d^4}{k^3} + O\left(\frac{1}{k^4}\right)

913: \end{align}\\

914: \end{lemma}

915:

916: The order $O\left(\frac{1}{k}\right)$ term of the

917: variance, i.e.,  $\frac{2d^2}{k}$, is known, e.g.,

918:  \citep{Article:Haas_70}.  We derive the  bias-corrected estimator, $\hat{d}_{MLE,c}$,  and the higher order moments using stochastic Taylor

919: expansions \citep{Article:Bartlett_53,Article:Shenton_63,Article:Ferrari_96,Article:Cysneiros_01}.

920:

921: We will propose an inverse Gaussian distribution to approximate the

922: distribution of $\hat{d}_{MLE,c}$, by matching the first four moments

923: (at least in the leading terms).

924:

925: \subsection{A Numerical Example}

926: %\vspace{-0.1in}

927: The maximum likelihood estimators are tested on MSN Web crawl

928: data, a term-by-document matrix with

929: $D=2^{16}$ Web pages. We conduct Cauchy random

930: projections and estimate the $l_1$ distances

931: between words.  In this experiment, we compare the empirical and

932: (asymptotic) theoretical moments, using one pair of words. Figure \ref{fig_bias_var} illustrates that the bias correction is

933: effective and these (asymptotic) formulas for the first four moments

934: of $\hat{d}_{MLE,c}$ in Lemma \ref{lem_mle_asymp} are accurate, especially when $k\geq 20$.\vspace{-0.25in}

935: \begin{figure}[h]

936: \begin{center}\mbox{

937: \subfigure[{\scriptsize $\text{E}(\hat{d}_{MLE}-d)/d$ v.s. $\text{E}(\hat{d}_{MLE,c}-d)/d$}]{\includegraphics[width = 2.25in]{fig/bias55.eps}}

938: \subfigure[{\scriptsize $\left(\text{E}(\hat{d}_{MLE,c}-\text{E}(\hat{d}_{MLE,c}))^2/d^2\right)^{1/2}$}]{\includegraphics[width = 2.25in]{fig/var55.eps}}}\vspace{-0.3in}

939: \mbox{

940: \subfigure[{\scriptsize $\left(\text{E}(\hat{d}_{MLE,c}-\text{E}(\hat{d}_{MLE,c}))^3/d^3\right)^{1/3}$}]{\includegraphics[width = 2.25in]{fig/third55.eps}}

941: \subfigure[{\scriptsize $\left(\text{E}(\hat{d}_{MLE,c}-\text{E}(\hat{d}_{MLE,c}))^4/d^4\right)^{1/4}$}]{\includegraphics[width = 2.25in]{fig/fourth55.eps}}}

942: \end{center}\vspace{-0.45in}

943: \caption{One pair of words are selected from an  MSN term-by-document

944:   matrix with $D=2^{16}$ Web pages. We conduct Cauchy random

945:   projections and estimate the $l_1$ distance between one pair of words using the maximum

946:   likelihood estimator $\hat{d}_{MLE}$ and the bias-corrected version

947:   $\hat{d}_{MLE,c}$. Panel (a)

948:   plots the biases of $\hat{d}_{MLE}$ and $\hat{d}_{MLE,c}$, indicating that

949:   the bias correction is effective. Panels (b), (c), and

950:   (d) plot the variance, third moment, and fourth moment of

951:   $\hat{d}_{MLE,c}$, respectively. The dashed curves are the theoretical

952:   asymptotic moments. When $k\geq 20$,

953:   the theoretical asymptotic formulas for moments are accurate.}\label{fig_bias_var}\vspace{-0.1in}

954: \end{figure}

955:

956: \subsection{Approximation Distributions}

957:

958: Theoretical analysis on the exact distribution of a maximum likelihood

959: estimator is difficult.\footnote{In fact, conditional on the observations $x_1$,

960:   $x_2$, ..., $x_k$, the distribution of $\hat{d}_{MLE}$ can be exactly

961:   characterized \citep{Article::Fisher_34}.  \cite{Article:Lawless_72}

962:   studied the conditional confidence interval of the MLE. Later,

963:    \cite{Article:Hinkley_78} proposed the normal approximation to the exact

964: conditional confidence interval and showed that it was superior to the

965: unconditional normality approximation. Unfortunately, we can not take advantage of the conditional

966: analysis because our goal is to determine the sample size $k$ before

967: seeing any samples. } In statistics, the standard

968: approach is to assume normality, which, however, is quite

969: inaccurate. The so-called {\em Edgeworth expansion}\footnote{The so-called {\em Saddlepoint approximation} in general improves

970: Edgeworth expansions \citep{Book:Jensen_95}, often very

971: considerably. Unfortunately, we can not apply the Saddlepoint

972: approximation in our case (at least not directly), because the

973: Saddlepoint approximation needs a bounded moment generating

974: function.} improves the

975: normal approximation by matching higher moments

976: \citep{Book:Feller_II,Article:Bhattacharya_78, Book:Severini_00}. For

977: example, if we approximate the distribution of $\hat{d}_{MLE,c}$ using

978: an Edgeworth expansion by matching the first four moments of

979: $\hat{d}_{MLE,c}$ derived in Lemma \ref{lem_mle_asymp}, then the errors

980:  will be on the order of $O\left(k^{-3/2}\right)$. However, Edgeworth

981:  expansions have some well-known drawbacks. The resultant

982:  expressions are quite sophisticated. They are not accurate at

983:  the tails. It is possible that the approximate probability has values

984:  below zero. Also, Edgeworth expansions consider the support is

985:  $(-\infty, \infty)$, while  $\hat{d}_{MLE,c}$ is

986:  non-negative.

987:

988:

989:

990: We propose approximating the distributions of

991: $\hat{d}_{MLE,c}$ directly using some well-studied common

992: distributions. We will first consider a gamma distribution with the

993: same first two (asymptotic) moments of $\hat{d}_{MLE,c}$. That is, the

994: gamma distribution will be asymptotically equivalent to the normal

995: approximation. While a normal has zero third

996: central moment, a gamma has nonzero third central moment. This, to an

997: extent, speeds up the rate of convergence. Another important reason

998: why a gamma is more accurate is because it has the same support as

999: $\hat{d}_{MLE,c}$, i.e., $[0,\infty)$.

1000:

1001: We will furthermore consider a {\em   generalized gamma} distribution,

1002: which allows us to match the first

1003: three (asymptotic) moments of $\hat{d}_{MLE,c}$.  Interestingly, in

1004: this case, the generalized gamma approximation turns out to be an

1005: inverse Gaussian distribution, which has a closed-form probability density. More

1006: interestingly, this inverse Gaussian distribution also

1007: matches the fourth central moment of $\hat{d}_{MLE,c}$ in the

1008: $O\left(\frac{1}{k^2}\right)$ term and almost in the

1009: $O\left(\frac{1}{k^3}\right)$ term. By simulations, the inverse

1010: Gaussian approximation is highly accurate.

1011:

1012: Note that, since we are interested in the very small (e.g., $10^{-10}$) tail probability

1013: range, $O\left(k^{-3/2}\right)$ is not too meaningful. For example,

1014: $k^{-3/2} = 10^{-3}$ if $k = 100$. Therefore, we will have to

1015: rely on simulations to assess the accuracy of the approximations. On

1016: the other hand, an upper

1017: bound may hold exactly (verified by simulations) even if it is based

1018: on an approximate distribution.

1019:

1020: As the related work, \cite{Article:Li_SINR06} applied gamma and generalized gamma

1021: approximations to model the performance measure distribution in some

1022: wireless communication channels using random matrix theory and

1023: produced  accurate results in evaluating the error probabilities.

1024:

1025: \subsubsection{The Gamma Approximation}

1026:

1027: The gamma approximation is an obvious improvement over the normal

1028: approximation.\footnote{In {\em normal random projections} for

1029:   dimension reduction in $l_2$, the resultant estimator of the squared

1030:   $l_2$

1031:   distance has a chi-squared distribution (e.g., \cite[Lemma

1032:   1.3]{Book:Vempala}), which is a special case of gamma.}

1033: A gamma distribution, $G(\alpha,\beta)$, has two parameters, $\alpha$

1034: and $\beta$, which can be determined by matching the first two

1035: (asymptotic) moments of $\hat{d}_{MLE,c}$. That is, we assume that $\hat{d}_{MLE,c} \sim G(\alpha, \beta)$, with

1036: \begin{align}

1037: &\alpha\beta = d, \hspace{0.25in} \alpha\beta^2 = \frac{2d^2}{k} +

1038: \frac{3d^2}{k^2}, \ \ \

1039: \Longrightarrow \  \

1040: \alpha = \frac{1}{\frac{2}{k} + \frac{3}{k^2}}, \hspace{0.25in} \beta = \frac{2d}{k} + \frac{3d}{k^2}.

1041: \end{align}

1042:

1043: Assuming a gamma distribution, it is easy to obtain the following

1044: Chernoff bounds\footnote{Using the Chernoff inequality

1045:   \citep{Article:Chernoff_52}, we bound the tail probability by

1046: $\mathbf{Pr}\left(Q>z\right) = \mathbf{Pr}\left(e^{Qt}>e^{zt}\right)

1047: \leq \text{E}\left(e^{Qt}\right)e^{-zt}$; and we then choose $t$ that minimizes

1048: the upper bound.}:

1049: \begin{align}\label{eqn_gamma_right}

1050: &\mathbf{Pr}\left(\hat{d}_{MLE,c} \geq  (1+\epsilon)

1051:   d\right)  \overset{\sim}{\leq} \exp\left(-\alpha\left(\epsilon -

1052:     \log(1+\epsilon)\right)\right), \hspace{0.2in} \epsilon \geq 0 \\

1053: &\mathbf{Pr}\left(\hat{d}_{MLE,c} \leq (1-\epsilon)

1054:   d\right)  \overset{\sim}{\leq} \exp\left(-\alpha\left(-\epsilon -

1055:     \log(1-\epsilon)\right)\right), \hspace{0.2in} 0\leq \epsilon <

1056: 1\label{eqn_gamma_left},

1057: \end{align}

1058: \noindent where we use $\overset{\sim}{\leq}$ to indicate that these

1059: inequalities are based on an approximate distribution.

1060:

1061: Note that the distribution of $\hat{d}_{MLE}/d$ (and hence $\hat{d}_{MLE,c}/d$) is only a function of

1062: $k$ as shown in \citep{Article:Antle_69,Article:Haas_70}. Therefore, we

1063: can evaluate the accuracy of the gamma approximation by simulations

1064: with $d = 1$, as presented in Figure \ref{fig_gamma_tail}.

1065:

1066:

1067: \begin{figure}[h]

1068: \begin{center}\mbox{

1069: \subfigure[]{\includegraphics[width = 2.8in]{fig/gamma10.eps}}

1070: \subfigure[]{\includegraphics[width = 2.8in]{fig/gbound10.eps}}}

1071: \end{center}\vspace{-0.4in}

1072: \caption{ We consider $k$ = 10, 20, 50, 100, 200, and 400. For each $k$, we

1073:   simulate standard Cauchy samples, from which we

1074:   estimate the Cauchy parameter by the MLE $\hat{d}_{MLE,c}$ and compute the tail

1075: probabilities. Panel (a) compares the empirical tail probabilities

1076: (thick solid) with

1077: the gamma tail probabilities (thin solid), indicating that the gamma distribution

1078: is better than the

1079: normal  (dashed) for approximating the distribution of

1080: $\hat{d}_{MLE,c}$.  Panel (b) compares the empirical tail

1081: probabilities with the gamma upper bound

1082: (\ref{eqn_gamma_right})+(\ref{eqn_gamma_left}).  }\label{fig_gamma_tail}

1083: \end{figure}

1084:

1085: Figure \ref{fig_gamma_tail}(a) shows that both the gamma and

1086: normal approximations are fairly accurate when the tail probability $\geq

1087: 10^{-2}\sim 10^{-3}$; and the gamma approximation is  obviously

1088: better.

1089:

1090: Figure \ref{fig_gamma_tail}(b) compares the empirical tail probabilities with the

1091: gamma Chernoff upper bound

1092: (\ref{eqn_gamma_right})+(\ref{eqn_gamma_left}), indicating that these bounds are reliable, when the tail probability $\geq

1093: 10^{-5}\sim 10^{-6}$.

1094:

1095:

1096: \subsubsection{The Inverse Gaussian  (Generalized Gamma) Approximation}

1097:

1098: The distribution of $\hat{d}_{MLE,c}$ can be well

1099: approximated by an inverse Gaussian distribution, which is a special

1100: case of the three-parameter generalized gamma distribution

1101:  \citep{Article:Hougaard_86,Article:Gerber}, denoted by $GG(\alpha, \beta,

1102: \eta)$. Note that the usual gamma distribution is a special case

1103: with $\eta = 1$.

1104:

1105: If $z \sim GG(\alpha, \beta, \eta)$, then the first

1106: three moments are

1107: \begin{align}

1108: \text{E}(z) = \alpha\beta, \hspace{0.2in} \text{Var}(z) =

1109: \alpha\beta^2, \hspace{0.2in} \text{E}\left(z - \text{E}(z)\right)^3 =

1110: \alpha\beta^3(1+\eta).

1111: \end{align}

1112:

1113: We can approximate the distribution of $\hat{d}_{MLE,c}$ by matching the

1114: first three moments, i.e.,

1115: \begin{align}

1116: \alpha\beta = d, \hspace{0.2in} \alpha\beta^2 = \frac{2d^2}{k} +

1117: \frac{3d^2}{k^2}, \hspace{0.2in} \alpha\beta^3(1+\eta) =

1118: \frac{12d^3}{k^2},

1119: \end{align}

1120: \noindent from which we obtain

1121: \begin{align}

1122: \alpha = \frac{1}{\frac{2}{k} + \frac{3}{k^2}}, \hspace{0.2in} \beta

1123: = \frac{2d}{k} + \frac{3d}{k^2}, \hspace{0.2in} \eta = 2 +

1124: O\left(\frac{1}{k}\right). \label{eqn_ig_parameters}

1125: \end{align}

1126: Taking only the leading term for $\eta$, the generalized gamma

1127: approximation of $\hat{d}_{MLE,c}$ would be

1128: \begin{align}

1129: GG\left(\frac{1}{\frac{2}{k} + \frac{3}{k^2}}, \frac{2d}{k} +

1130:   \frac{3d}{k^2}, 2\right). \label{eqn_ig}

1131: \end{align}

1132:

1133: In general, a generalized gamma distribution does not have a closed-form

1134: density function although it always has a closed-from moment generating

1135: function.  In our case, (\ref{eqn_ig}) is actually an

1136: inverse Gaussian distribution, which has a closed-form density

1137: function. Assuming $\hat{d}_{MLE,c} \sim IG(\alpha, \beta)$,

1138: with parameters $\alpha$ and

1139: $\beta$ defined in (\ref{eqn_ig_parameters}), the moment

1140: generating function (MGF), the probability density

1141: function (PDF), and cumulative density function (CDF) would

1142: be \citep[Chapter 2]{Book:Seshadri_93} \citep{Article:Tweedie_57I,Article:Tweedie_57II}\footnote{The inverse Gaussian distribution was first noted as the

1143:   distribution of the first passage time of the Brownian motion with a

1144:   positive drift. It has many interesting properties such as

1145:   infinitely divisible. Two monographs

1146:    \citep{Book:Chhikara_89,Book:Seshadri_93} are devoted entirely to the

1147:   inverse Gaussian distributions. For a quick reference, one can check

1148: {\it http://mathworld.wolfram.com/InverseGaussianDistribution.html}.}

1149: \begin{align}

1150: &\text{E}\left(\exp(\hat{d}_{MLE,c}t)\right) \overset{\sim}{=}

1151: \exp\left(\alpha\left(1-(1-2\beta t)^{1/2}\right)\right),\\

1152: &\mathbf{Pr}(\hat{d}_{MLE,c} = y)\overset{\sim}{=} \frac{\alpha \sqrt{\beta}}{\sqrt{2\pi}}

1153: y^{-\frac{3}{2}} \exp\left(-\frac{\left(y/\beta -

1154:       \alpha\right)^2}{2y/\beta}\right) = \sqrt{\frac{\alpha d}{2\pi}}y^{-\frac{3}{2}} \exp\left(-\frac{\left(y-d\right)^2}{2y\beta}\right),\\ \notag

1155: &\mathbf{Pr}\left(\hat{d}_{MLE,c} \leq y\right) \overset{\sim}{=}

1156: \Phi\left(\sqrt{\frac{\alpha^2\beta}{y}}\left(\frac{y}{\alpha\beta} -1

1157:     \right)\right) + e^{2\alpha}

1158:   \Phi\left(-\sqrt{\frac{\alpha^2\beta}{y}}\left(\frac{y}{\alpha\beta}

1159:       +1

1160:     \right)\right)\\

1161: &\hspace{1.1in}=

1162: \Phi\left(\sqrt{\frac{\alpha d}{y}}\left(\frac{y}{d} -1

1163:     \right)\right) + e^{2\alpha}

1164:   \Phi\left(-\sqrt{\frac{\alpha d}{y}}\left(\frac{y}{d}

1165:       +1

1166:     \right)\right),

1167: \end{align}

1168: \noindent where $\Phi(.)$ is the standard normal CDF, i.e., $\Phi(z) =

1169: \int_{-\infty}^z \frac{1}{\sqrt{2\pi}}e^{-\frac{t^2}{2}}dt$. Here we

1170: use $\overset{\sim}{=}$ to indicate that these equalities are based on

1171: an approximate distribution.

1172:

1173:

1174: Assuming $\hat{d}_{MLE,c} \sim

1175: IG(\alpha,\beta)$, then the fourth central moment should be

1176: \begin{align}\notag

1177: \text{E}\left(\hat{d}_{MLE,c} - \text{E}\left(\hat{d}_{MLE,c}\right)\right)^4 &\overset{\sim}{=}

1178: 15\alpha\beta^4+ 3\left(\alpha\beta^2\right)^2 \\\notag

1179: &=15d\left(\frac{2d}{k}+\frac{3d}{k^2}\right)^3 +

1180: 3\left(\frac{2d^2}{k}+\frac{3d^2}{k^2}\right)^2 \\

1181: &=\frac{12d^4}{k^2} + \frac{156d^4}{k^3} +

1182: O\left(\frac{1}{k^4}\right).

1183: \end{align}

1184:

1185: Lemma \ref{lem_mle_asymp} has shown the true asymptotic fourth central

1186: moment:

1187: \begin{align}

1188: \text{E}\left(\hat{d}_{MLE,c} -

1189:   \text{E}\left(\hat{d}_{MLE,c}\right)\right)^4 =\frac{12d^4}{k^2} + \frac{186d^4}{k^3} +

1190: O\left(\frac{1}{k^4}\right).

1191: \end{align}

1192: \noindent That is, the inverse Gaussian approximation matches not only the

1193: leading term, $\frac{12d^4}{k^2}$, but also almost the higher

1194: order term, $\frac{186d^4}{k^3}$, of the true asymptotic fourth moment of

1195:  $\hat{d}_{MLE,c}$.

1196:

1197: Assuming $\hat{d}_{MLE,c} \sim IG(\alpha,\beta)$, the tail probability

1198: of $\hat{d}_{MLE,c}$ can be expressed  as

1199: \begin{align}

1200: &\mathbf{Pr}\left(\hat{d}_{MLE,c} \geq (1+\epsilon)d\right) \overset{\sim}{=}

1201: \Phi\left(-\epsilon \sqrt{\frac{\alpha}{1+\epsilon}}\right) -

1202: e^{2\alpha} \Phi\left(-(2+\epsilon)\sqrt{\frac{\alpha}{1+\epsilon}}\right),

1203: \hspace{0.1in} \epsilon \geq 0 \\

1204: &\mathbf{Pr}\left(\hat{d}_{MLE,c} \leq (1-\epsilon)d\right)  \overset{\sim}{=} \Phi\left(-\epsilon \sqrt{\frac{\alpha}{1-\epsilon}}\right) +

1205: e^{2\alpha} \Phi\left(-(2-\epsilon)\sqrt{\frac{\alpha}{1-\epsilon}}\right),

1206: \hspace{0.1in}   0\leq \epsilon < 1.

1207: \end{align}

1208:

1209:

1210: Assuming  $\hat{d}_{MLE,c} \sim IG(\alpha,\beta)$, it is easy to show

1211: the following  Chernoff bounds:

1212: \begin{align}\label{eqn_ig_left}

1213: &\mathbf{Pr}\left(\hat{d}_{MLE,c} \geq (1+\epsilon)d\right) \overset{\sim}{\leq}

1214: \exp\left(-\frac{\alpha \epsilon^2}{2(1+\epsilon)}\right),  \hspace{0.2in} \epsilon \geq 0 \\

1215: &\mathbf{Pr}\left(\hat{d}_{MLE,c} \leq (1-\epsilon)d\right) \overset{\sim}{\leq}

1216: \exp\left(-\frac{\alpha \epsilon^2}{2(1-\epsilon)}\right),

1217: \hspace{0.2in}   0\leq \epsilon < 1. \label{eqn_ig_right}

1218: \end{align}

1219:

1220: To see (\ref{eqn_ig_left}). Assume $z \sim IG(\alpha,\beta)$. Then,

1221: using the Chernoff inequality:

1222: \begin{align}\notag

1223: \mathbf{Pr}\left(z \geq (1+\epsilon)d\right) \leq&

1224: \text{E}\left(zt\right)\exp(-(1+\epsilon)dt)\\\notag

1225: =&\exp\left(\alpha\left(1-(1-2\beta t)^{1/2}\right)-(1+\epsilon)dt\right),

1226: \end{align}

1227: whose minimum is $\exp\left(-\frac{\alpha

1228:     \epsilon^2}{2(1+\epsilon)}\right)$, attained at $t =

1229: \left(1-\frac{1}{(1+\epsilon)^2}\right)\frac{1}{2\beta}$. We can

1230: similarly show (\ref{eqn_ig_right}). \\

1231:

1232: Combining (\ref{eqn_ig_left}) and (\ref{eqn_ig_right}) yields a

1233: symmetric bound

1234: \begin{align}

1235: &\mathbf{Pr}\left(|\hat{d}_{MLE,c} - d| \geq \epsilon d\right) \overset{\sim}{\leq}

1236: 2\exp\left(-\frac{\epsilon^2/(1+\epsilon)}{2 \left(\frac{2}{k} + \frac{3}{k^2}\right)}\right),

1237: \hspace{0.15in} 0\leq \epsilon \leq 1

1238: \end{align}

1239:

1240:

1241: Figure \ref{fig_ig_tail} compares the inverse Gaussian approximation with the same

1242: simulations as presented in Figure \ref{fig_gamma_tail}, indicating

1243: that the inverse Gaussian approximation is highly

1244: accurate. When the tail probability $\geq 10^{-4} \sim 10^{-6}$, we can treat the

1245: inverse Gaussian as the exact distribution of $\hat{d}_{MLE,c}$.  The Chernoff upper bounds for the inverse Gaussian

1246: are always reliable in our simulation range (the tail probability

1247: $\geq 10^{-10}$).

1248:

1249: \begin{figure}[h]

1250: \begin{center}\mbox{

1251: \subfigure[]{\includegraphics[width = 2.8in]{fig/ig10.eps}}

1252: \subfigure[]{\includegraphics[width = 2.8in]{fig/igbound10.eps}}}

1253: \end{center}\vspace{-0.4in}

1254: \caption{We compare the inverse Gaussian approximation

1255:   with the same simulations as presented in Figure

1256:   \ref{fig_gamma_tail}. Panel (a) compares the empirical tail

1257:   probabilities with the inverse Gaussian tail probabilities,

1258:   indicating that the approximation is highly accurate.

1259:   Panel (b) compares the empirical tail probabilities with the inverse

1260:   Gaussian upper bound (\ref{eqn_ig_left})+(\ref{eqn_ig_right}). The upper bounds are all

1261: above the corresponding empirical curves, indicating that our proposed bounds are

1262: reliable at least in our simulation range.  }\label{fig_ig_tail}

1263: \end{figure}

1264:

1265:

1266:

1267: \section{Conclusion}\label{sec_conclusion}

1268:

1269: It is well-known that the $l_1$ distance is far more robust than the

1270: $l_2$ distance against ``outliers.'' There are

1271: numerous  success stories of using the $l_1$ distance, e.g.,

1272:   Lasso \citep{Article:Tibshirani_96}, LARS \citep{Article:Efron_LARS04}, 1-norm

1273:   SVM \citep{Proc:Zhu_NIPS03}, and Laplacian radial basis kernel

1274:   \citep{Article:Chapelle_99,Proc:Ferecatu_MIR04}.

1275:

1276: Dimension reduction in the $l_1$ norm, however, has been proved

1277: {\em impossible} if we use {\em linear random projections} and {\em

1278:   linear estimators}. In this study, we propose three types of nonlinear

1279: estimators for {\em Cauchy random projections}: the bias-corrected

1280: sample median estimator, the bias-corrected geometric mean estimator,

1281: and the bias-corrected maximum likelihood estimator. Our theoretical

1282: analysis has shown that these nonlinear estimators can accurately

1283: recover the original $l_1$ distance, even though none of them can be a

1284: metric.

1285:

1286: The bias-corrected sample median estimator and the bias-corrected

1287: geometric mean estimator are asymptotically equivalent but the latter

1288: is more accurate at small sample size. We have derived explicit tail

1289: bounds for the bias-corrected geometric mean estimator and have expressed

1290: the tail bounds in exponential forms. Using these tail bounds, we have

1291: established an analog of the

1292: Johnson-Lindenstrauss  (JL) lemma for dimension reduction in $l_1$, which is weaker than the classical JL lemma for dimension reduction in

1293: $l_2$.

1294:

1295: We conduct theoretic analysis  on the bias-corrected maximum

1296: likelihood estimator (MLE), which is ``asymptotically optimum.'' Both

1297: the sample median estimator and the geometric mean estimator are about

1298: $80\%$ efficient as the MLE. We propose

1299: approximating its distribution by an inverse Gaussian, which has the

1300: same support and matches the leading terms of the first four moments of

1301: the proposed estimator. Approximate tail bounds have been provide based

1302: on the inverse Gaussian approximation. Verified by simulations, these

1303: approximate tail bounds hold at least in the $\geq

1304: 10^{-10}$ tail probability range.

1305:

1306: Although these nonlinear estimators are not metrics, they are still

1307: useful for certain applications in (e.g.,) data stream computation,

1308: information retrieval, learning and data mining, whenever the goal is

1309: to compute the $l_1$ distances efficiently using a small storage space.

1310:

1311:

1312: The geometric mean estimator is a non-convex

1313: norm (i.e., the $l_p$ norm as $p\rightarrow 0$); and therefore it does

1314: contain some information about the geometry.  It may be still possible

1315: to develop certain efficient algorithms using the geometric mean estimator by

1316: avoiding the non-convexity.  We leave this for future

1317: work. \\

1318:

1319:

1320:

1321: \section*{Acknowledgment}

1322:

1323: We are grateful to Piotr Indyk and Assaf Naor for the very constructive

1324: comments on various versions of this manuscript. We thank Dimitris

1325: Achlioptas,

1326: Christopher Burges, Moses Charikar, Jerome Friedman, Tze L. Lai, Art

1327: B. Owen, John Platt,  Joseph Romano, Tim

1328: Roughgarden, Yiyuan She,  and  Guenther Walther

1329: for helpful conversations or suggesting relevant references. We also thank Silvia Ferrari

1330: and Gauss Cordeiro for clarifying some parts of their papers.

1331:

1332: Trevor Hastie was partially supported by grant DMS-0505676 from the National

1333: Science Foundation, and grant 2R01 CA 72028-07 from the National

1334: Institutes of

1335: Health.

1336:

1337: %\bibliographystyle{plain}

1338: {\small

1339:

1340: \begin{thebibliography}{59}

1341: \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi

1342: \expandafter\ifx\csname url\endcsname\relax

1343:   \def\url#1{{\tt #1}}\fi

1344:

1345: \bibitem[Achlioptas(2003)]{Article:Achlioptas_JCSS03}

1346: Dimitris Achlioptas.

1347: \newblock Database-friendly random projections: \text{Johnson-Lindenstrauss}

1348:   with binary coins.

1349: \newblock {\em Journal of Computer and System Sciences}, 66\penalty0

1350:   (4):\penalty0 671--687, 2003.

1351:

1352: \bibitem[Aggarwal and Wolf(1999)]{Proc:Aggarwal_Wolf_Sigmod99}

1353: Charu~C. Aggarwal and Joel~L. Wolf.

1354: \newblock A new method for similarity indexing of market basket data.

1355: \newblock In {\em Proc. of SIGMOD}, pages 407--418, Philadelphia, PA, 1999.

1356:

1357: \bibitem[Ailon and Chazelle(2006)]{Proc:Ailon_STOC06}

1358: Nir Ailon and Bernard Chazelle.

1359: \newblock Approximate nearest neighbors and the fast \text{Johnson-Lindenstrauss}

1360:   transform.

1361: \newblock In {\em Proc. of STOC}, pages 557--563, Seattle, WA, 2006.

1362:

1363: \bibitem[Antle and Bain(1969)]{Article:Antle_69}

1364: Charles Antle and Lee Bain.

1365: \newblock A property of maximum likelihood estimators of location and scale

1366:   parameters.

1367: \newblock {\em SIAM Review}, 11\penalty0 (2):\penalty0 251--253, 1969.

1368:

1369: \bibitem[Arriaga and Vempala(1999)]{Proc:Arriaga_FOCS99}

1370: Rosa Arriaga and Santosh Vempala.

1371: \newblock An algorithmic theory of learning: Robust concepts and random

1372:   projection.

1373: \newblock In {\em Proc. of FOCS}, pages 616--623, New York, 1999.

1374:

1375: \bibitem[Arriaga and Vempala(2006)]{Article:Proc:Arriaga_Vempala_ML06}

1376: Rosa Arriaga and Santosh Vempala.

1377: \newblock An algorithmic theory of learning: Robust concepts and random

1378:   projection.

1379: \newblock {\em Machine Learning}, 63\penalty0 (2):\penalty0 161--182, 2006.

1380:

1381: \bibitem[Barnett(1966)]{Article:Barnett_66}

1382: V.~D. Barnett.

1383: \newblock Evaluation of the maximum-likelihood estimator where the likelihood

1384:   equation has multiple roots.

1385: \newblock {\em Biometrika}, 53\penalty0 (1/2):\penalty0 151--165, 1966.

1386:

1387: \bibitem[Bartlett(1953)]{Article:Bartlett_53}

1388: M.~S. Bartlett.

1389: \newblock Approximate confidence intervals, \text{II}.

1390: \newblock {\em Biometrika}, 40\penalty0 (3/4):\penalty0 306--317, 1953.

1391:

1392: \bibitem[Bhattacharya and Ghosh(1978)]{Article:Bhattacharya_78}

1393: R.~N. Bhattacharya and J.~K. Ghosh.

1394: \newblock On the validity of the formal \text{Edgeworth} expansion.

1395: \newblock {\em The Annals of Statistics}, 6\penalty0 (2):\penalty0 434--451,

1396:   1978.

1397:

1398: \bibitem[Brinkman and Charikar(2003)]{Proc:Brinkman_FOCS03}

1399: Bo~Brinkman and Mose Charikar.

1400: \newblock On the impossibility of dimension reduction in $l_1$.

1401: \newblock In {\em Proc. of FOCS}, pages 514--523, Cambridge, MA, 2003.

1402:

1403: \bibitem[Brinkman and Charikar(2005)]{Article:Brinkman_JACM05}

1404: Bo~Brinkman and Mose Charikar.

1405: \newblock On the impossibility of dimension reduction in $l_1$.

1406: \newblock {\em Journal of ACM}, 52\penalty0 (2):\penalty0 766--788, 2005.

1407:

1408: \bibitem[Chapelle et~al.(1999)Chapelle, Haffner, and

1409:   Vapnik]{Article:Chapelle_99}

1410: Olivier Chapelle, Patrick Haffner, and Vladimir~N. Vapnik.

1411: \newblock Support vector machines for histogram-based image classification.

1412: \newblock {\em {IEEE} Trans. Neural Networks}, 10\penalty0 (5):\penalty0

1413:   1055--1064, 1999.

1414:

1415: \bibitem[Chernoff(1952)]{Article:Chernoff_52}

1416: Herman Chernoff.

1417: \newblock A measure of asymptotic efficiency for tests of a hypothesis based on

1418:   the sum of observations.

1419: \newblock {\em The Annals of Mathematical Statistics}, 23\penalty0

1420:   (4):\penalty0 493--507, 1952.

1421:

1422: \bibitem[Chhikara and Folks(1989)]{Book:Chhikara_89}

1423: Raj~S. Chhikara and J.~Leroy Folks.

1424: \newblock {\em The Inverse Gaussian Distribution: Theory, Methodology, and

1425:   Applications}.

1426: \newblock Marcel Dekker, Inc, New York, 1989.

1427:

1428: \bibitem[Cysneiros et~al.(2001)Cysneiros, dos Santos, and

1429:   Cordeiro]{Article:Cysneiros_01}

1430: Francisco Jose De.~A. Cysneiros, Sylvio Jose~P. dos Santos, and Gass~M.

1431:   Cordeiro.

1432: \newblock Skewness and kurtosis for maximum likelihood estimator in

1433:   one-parameter exponential family models.

1434: \newblock {\em Brazilian Journal of Probability and Statistics}, 15\penalty0

1435:   (1):\penalty0 85--105, 2001.

1436:

1437: \bibitem[Dasgupta and Gupta(2003)]{Article:Dasgupta_JL}

1438: Sanjoy Dasgupta and Anupam Gupta.

1439: \newblock An elementary proof of a theorem of \text{Johnson and Lindenstrauss}.

1440: \newblock {\em Random Structures and Algorithms}, 22\penalty0 (1):\penalty0 60

1441:   -- 65, 2003.

1442:

1443: \bibitem[Dhillon and Modha(2001)]{Article:Dhillon_ML01}

1444: Inderjit~S. Dhillon and Dharmendra~S. Modha.

1445: \newblock Concept decompositions for large sparse text data using clustering.

1446: \newblock {\em Machine Learning}, 42\penalty0 (1-2):\penalty0 143--175, 2001.

1447:

1448: \bibitem[Efron et~al.(2004)Efron, Hastie, Johnstone, and

1449:   Tibshirani]{Article:Efron_LARS04}

1450: Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani.

1451: \newblock Least angle regression.

1452: \newblock {\em The Annals of Statistics}, 32\penalty0 (2):\penalty0 407--499,

1453:   2004.

1454:

1455: \bibitem[Fama and Roll(1968)]{Article:Fama_68}

1456: Eugene~F. Fama and Richard Roll.

1457: \newblock Some properties of symmetric stable distributions.

1458: \newblock {\em Journal of the American Statistical Association}, 63\penalty0

1459:   (323):\penalty0 817--836, 1968.

1460:

1461: \bibitem[Fama and Roll(1971)]{Article:Fama_71}

1462: Eugene~F. Fama and Richard Roll.

1463: \newblock Parameter estimates for symmetric stable distributions.

1464: \newblock {\em Journal of the American Statistical Association}, 66\penalty0

1465:   (334):\penalty0 331--338, 1971.

1466:

1467: \bibitem[Feller(1971)]{Book:Feller_II}

1468: William Feller.

1469: \newblock {\em An Introduction to Probability Theory and Its Applications

1470:   (Volume \text{II})}.

1471: \newblock John Wiley \& Sons, New York, NY, second edition, 1971.

1472:

1473: \bibitem[Ferecatu et~al.(2004)Ferecatu, Crucianu, and

1474:   Boujemaa]{Proc:Ferecatu_MIR04}

1475: Marin Ferecatu, Michel Crucianu, and Nozha Boujemaa.

1476: \newblock Retrieval of difficult image classes using SVD-based relevance

1477:   feedback.

1478: \newblock In {\em Prof. of Multimedia Information Retrieval}, pages 23--30, New

1479:   York, NY, 2004.

1480:

1481: \bibitem[Ferrari et~al.(1996)Ferrari, Botter, Cordeiro, and

1482:   Cribari-Neto]{Article:Ferrari_96}

1483: Silvia L.~P. Ferrari, Denise~A. Botter, Gauss~M. Cordeiro, and Francisco

1484:   Cribari-Neto.

1485: \newblock Second and third order bias reduction for one-parameter family

1486:   models.

1487: \newblock {\em Stat. and Prob. Letters}, 30:\penalty0 339--345, 1996.

1488:

1489: \bibitem[Fisher(1934)]{Article::Fisher_34}

1490: R.~A. Fisher.

1491: \newblock Two new properties of mathematical likelihood.

1492: \newblock {\em Proceedings of the Royal Society of London}, 144\penalty0

1493:   (852):\penalty0 285--307, 1934.

1494:

1495: \bibitem[Frankl and Maehara(1987)]{Article:Frankl_JL}

1496: P.~Frankl and H.~Maehara.

1497: \newblock The \text{Johnson-Lindenstrauss} lemma and the sphericity of some

1498:   graphs.

1499: \newblock {\em Journal of Combinatorial Theory A}, 44\penalty0 (3):\penalty0

1500:   355--362, 1987.

1501:

1502: \bibitem[Gerber(1991)]{Article:Gerber}

1503: Hans~U. Gerber.

1504: \newblock From the generalized gamma to the generalized negative binomial

1505:   distribution.

1506: \newblock {\em Insurance:Mathematics and Economics}, 10\penalty0 (4):\penalty0

1507:   303--309, 1991.

1508:

1509: \bibitem[Gradshteyn and Ryzhik(1994)]{Book:Gradshteyn_94}

1510: I.~S. Gradshteyn and I.~M. Ryzhik.

1511: \newblock {\em Table of Integrals, Series, and Products}.

1512: \newblock Academic Press, New York, fifth edition, 1994.

1513:

1514: \bibitem[Haas et~al.(1970)Haas, Bain, and Antle]{Article:Haas_70}

1515: Gerald Haas, Lee Bain, and Charles Antle.

1516: \newblock Inferences for the Cauchy distribution based on maximum likelihood

1517:   estimation.

1518: \newblock {\em Biometrika}, 57\penalty0 (2):\penalty0 403--408, 1970.

1519:

1520: \bibitem[Hinkley(1978)]{Article:Hinkley_78}

1521: David~V. Hinkley.

1522: \newblock Likelihood inference about location and scale parameters.

1523: \newblock {\em Biometrika}, 65\penalty0 (2):\penalty0 253--261, 1978.

1524:

1525: \bibitem[Hougaard(1986)]{Article:Hougaard_86}

1526: P.~Hougaard.

1527: \newblock Survival models for heterogeneous populations derived from stable

1528:   distributions.

1529: \newblock {\em Biometrika}, 73\penalty0 (2):\penalty0 387--396, 1986.

1530:

1531: \bibitem[Indyk(2000)]{Proc:Indyk_FOCS00}

1532: Piotr Indyk.

1533: \newblock Stable distributions, pseudorandom generators, embeddings and data

1534:   stream computation.

1535: \newblock In {\em FOCS}, pages 189--197, Redondo Beach,CA, 2000.

1536:

1537: \bibitem[Indyk(2001)]{Proc:Indyk_FOCS01}

1538: Piotr Indyk.

1539: \newblock Algorithmic applications of low-distortion geometric embeddings.

1540: \newblock In {\em Proc. of FOCS}, pages 10--33, Las Vegas, NV, 2001.

1541:

1542: \bibitem[Indyk and Motwani(1998)]{Proc:Indyk_STOC98}

1543: Piotr Indyk and Rajeev Motwani.

1544: \newblock Approximate nearest neighbors: Towards removing the curse of

1545:   dimensionality.

1546: \newblock In {\em Proc. of STOC}, pages 604--613, Dallas, TX, 1998.

1547:

1548: \bibitem[Indyk and Naor(2006)]{Article:Indyk_Naor}

1549: Piotr Indyk and Assaf Naor.

1550: \newblock Nearest neighbor preserving embeddings.

1551: \newblock {\em ACM Transactions on Algorithms (to appear)}, 2006.

1552:

1553: \bibitem[Jensen(1995)]{Book:Jensen_95}

1554: Jens~Ledet Jensen.

1555: \newblock {\em Saddlepoint approximations}.

1556: \newblock Oxford University Press, New York, 1995.

1557:

1558: \bibitem[Johnson and Lindenstrauss(1984)]{Article:JL84}

1559: W.~B. Johnson and J.~Lindenstrauss.

1560: \newblock Extensions of \text{Lipschitz} mapping into \text{Hilbert} space.

1561: \newblock {\em Contemporary Mathematics}, 26:\penalty0 189--206, 1984.

1562:

1563: \bibitem[Lawless(1972)]{Article:Lawless_72}

1564: J.~F. Lawless.

1565: \newblock Conditional confidence interval procedures for the location and scale

1566:   parameters of the Cauchy and logistic distributions.

1567: \newblock {\em Biometrika}, 59\penalty0 (2):\penalty0 377--386, 1972.

1568:

1569: \bibitem[Lee and Naor(2004)]{Article:Lee_Naor_04}

1570: James~R. Lee and Assaf Naor.

1571: \newblock Embedding the diamond graph in $l_p$ and dimension reduction in

1572:   $l_1$.

1573: \newblock {\em Geometric And Functional Analysis}, 14\penalty0 (4):\penalty0

1574:   745--747, 2004.

1575:

1576: \bibitem[Li and Church(2005)]{Report:Li_Church_Sketch}

1577: Ping Li and Kenneth~W. Church.

1578: \newblock Using sketches to estimate two-way and multi-way associations.

1579: \newblock Technical Report TR-2005-115, Microsoft Research, (A shorter version

1580:   is available at

1581:   www.stanford.edu/$^\sim$pingli98/publications/Report\_Sketch.pdf), Redmond,

1582:   WA, September 2005.

1583:

1584: \bibitem[Li et~al.(2006{\natexlab{a}})Li, Church, and

1585:   Hastie]{Report:Li_Church_Hastie_crs}

1586: Ping Li, Kenneth~W. Church, and Trevor~J. Hastie.

1587: \newblock Conditional random sampling: A sketched-based sampling technique for

1588:   sparse data.

1589: \newblock Technical report, Department of Statistics, Stanford University

1590:   (\url{www.stanford.edu/~pingli98/publications/CRS_tr.pdf}),

1591:   2006{\natexlab{a}}.

1592:

1593: \bibitem[Li et~al.(2006{\natexlab{b}})Li, Hastie, and

1594:   Church]{Proc:Li_Hastie_Church_COLT06}

1595: Ping Li, Trevor~J. Hastie, and Kenneth~W. Church.

1596: \newblock Improving random projections using marginal information.

1597: \newblock In {\em Proc. of COLT}, Pittsburgh, PA, 2006{\natexlab{b}}.

1598:

1599: \bibitem[Li et~al.(2006{\natexlab{c}})Li, Hastie, and

1600:   Church]{Report:Li_Hastie_Church_subrp}

1601: Ping Li, Trevor~J. Hastie, and Kenneth~W. Church.

1602: \newblock Sub-Gaussian random projections.

1603: \newblock Technical report, Department of Statistics, Stanford University

1604:   (\url{www.stanford.edu/~pingli98/report/subg_rp.pdf}), 2006{\natexlab{c}}.

1605:

1606: \bibitem[Li et~al.(2006{\natexlab{d}})Li, Hastie, and

1607:   Church]{Proc:Li_Hastie_Church_KDD06}

1608: Ping Li, Trevor~J. Hastie, and Kenneth~W. Church.

1609: \newblock Very sparse random projections.

1610: \newblock In {\em Proc. of KDD}, Philadelphia, PA, 2006{\natexlab{d}}.

1611:

1612: \bibitem[Li et~al.(2006{\natexlab{e}})Li, Paul, Narasimhan, and

1613:   Cioffi]{Article:Li_SINR06}

1614: Ping Li, Debashis Paul, Ravi Narasimhan, and John Cioffi.

1615: \newblock On the distribution of \text{SINR} for the \text{MMSE MIMO} receiver

1616:   and performance analysis.

1617: \newblock {\em {IEEE} Trans. Inform. Theory}, 52\penalty0 (1):\penalty0

1618:   271--286, 2006{\natexlab{e}}.

1619:

1620: \bibitem[Lugosi(2004)]{Article:Lugosi_04}

1621: Gabor Lugosi.

1622: \newblock Concentration-of-measure inequalities.

1623: \newblock {\em Lecture Notes}, 2004.

1624:

1625: \bibitem[McCulloch(1986)]{Article:McCulloch_86}

1626: J.~Huston McCulloch.

1627: \newblock Simple consistent estimators of stable distribution parameters.

1628: \newblock {\em Communications on Statistics-Simulation}, 15\penalty0

1629:   (4):\penalty0 1109--1136, 1986.

1630:

1631: \bibitem[Philips and Nelson(1995)]{Article:Philips_95}

1632: Thomas~K. Philips and Randolph Nelson.

1633: \newblock The moment bound is tighter than Chernoff's bound for positive tail

1634:   probabilities.

1635: \newblock {\em The American Statistician}, 49\penalty0 (2):\penalty0 175--178,

1636:   1995.

1637:

1638: \bibitem[Seshadri(1993)]{Book:Seshadri_93}

1639: V.~Seshadri.

1640: \newblock {\em The Inverse Gaussian Distribution: A Case Study in Exponential

1641:   Families}.

1642: \newblock Oxford University Press Inc., New York, 1993.

1643:

1644: \bibitem[Severini(2000)]{Book:Severini_00}

1645: Thomas~A. Severini.

1646: \newblock {\em Likelihood Methods in Statistics}.

1647: \newblock Oxford University Press, New York, 2000.

1648:

1649: \bibitem[Shakhnarovich et~al.(2005)Shakhnarovich, Darrell, and

1650:   Indyk]{Book:NN_05}

1651: Gregory Shakhnarovich, Trevor Darrell, and Piotr Indyk, editors.

1652: \newblock {\em Nearest-Neighbor Methods in Learning and Vision, Theory and

1653:   Practice}.

1654: \newblock The MIT Press, Cambridge, MA, 2005.

1655:

1656: \bibitem[Shao(2003)]{Book:Shao}

1657: Jun Shao.

1658: \newblock {\em Mathematical Statistics}.

1659: \newblock Springer, New York, NY, second edition, 2003.

1660:

1661: \bibitem[Shenton and Bowman(1963)]{Article:Shenton_63}

1662: L.~R. Shenton and K.~Bowman.

1663: \newblock Higher moments of a maximum-likelihood estimate.

1664: \newblock {\em Journal of Royal Statistical Society \text{B}}, 25\penalty0

1665:   (2):\penalty0 305--317, 1963.

1666:

1667: \bibitem[Strehl and Ghosh(2000)]{Proc:Strehl_HiPC00}

1668: Alexander Strehl and Joydeep Ghosh.

1669: \newblock A scalable approach to balanced, high-dimensional clustering of

1670:   market-baskets.

1671: \newblock In {\em Proc. of HiPC}, pages 525--536, Bangalore, India, 2000.

1672:

1673: \bibitem[Tibshirani(1996)]{Article:Tibshirani_96}

1674: Robert Tibshirani.

1675: \newblock Regression shrinkage and selection via the lasso.

1676: \newblock {\em Journal of Royal Statistical Society \text{B}}, 58\penalty0

1677:   (1):\penalty0 267--288, 1996.

1678:

1679: \bibitem[Tweedie(1957{\natexlab{a}})]{Article:Tweedie_57I}

1680: M.~C.~K. Tweedie.

1681: \newblock Statistical properties of inverse Gaussian distributions. \text{I}.

1682: \newblock {\em The Annals of Mathematical Statistics}, 28\penalty0

1683:   (2):\penalty0 362--377, 1957{\natexlab{a}}.

1684:

1685: \bibitem[Tweedie(1957{\natexlab{b}})]{Article:Tweedie_57II}

1686: M.~C.~K. Tweedie.

1687: \newblock Statistical properties of inverse Gaussian distributions. \text{II}.

1688: \newblock {\em The Annals of Mathematical Statistics}, 28\penalty0

1689:   (3):\penalty0 696--705, 1957{\natexlab{b}}.

1690:

1691: \bibitem[Vempala(2004)]{Book:Vempala}

1692: Santosh Vempala.

1693: \newblock {\em The Random Projection Method}.

1694: \newblock American Mathematical Society, Providence, RI, 2004.

1695:

1696: \bibitem[Zhu et~al.(2003)Zhu, Rosset, Hastie, and Tibshirani]{Proc:Zhu_NIPS03}

1697: Ji~Zhu, Saharon Rosset, Trevor Hastie, and Robert Tibshirani.

1698: \newblock 1-norm support vector machines.

1699: \newblock In {\em NIPS}, 2003.

1700:

1701: \bibitem[Zolotarev(1986)]{Book:Zolotarev_86}

1702: V.~M. Zolotarev.

1703: \newblock {\em One-dimensional Stable Distributions}.

1704: \newblock American Mathematical Society, Providence, RI, 1986.

1705:

1706: \end{thebibliography}

1707:

1708: %\bibliography{../bib/IEEEabrv,../bib/mybibfile}

1709: }

1710:

1711: \appendix

1712:

1713:

1714: \section{Proof of Lemma \ref{lem_me}}\label{app_proof_lem_me}

1715:

1716: Assume $x \sim C(0,d)$. The probability density function (PDF) and the

1717: cumulative density function (CDF) of $|x|$ would be

1718: \begin{align}

1719: &\mathbf{Pr}(|x|=z) = \frac{2d}{\pi}\frac{1}{z^2+d^2}, \hspace{0.2in}

1720: z\geq0 \\

1721: &\mathbf{Pr}(|x|\leq z) = \frac{2}{\pi}\tan^{-1}\frac{z}{d}, \hspace{0.2in}

1722: z\geq0

1723: \end{align}

1724:

1725: The asymptotic normality of $\hat{d}_{me}$ follows from the asymptotic

1726: results on sample quantiles \citep[Theorem

1727: 5.10]{Book:Shao}.

1728: \begin{align}

1729: \sqrt{k}\left(\hat{d}_{me}-d\right) \overset{D}{\Longrightarrow}

1730: N\left(0,

1731:   \frac{1}{2}\left(1-\frac{1}{2}\right)/\left(\left.\mathbf{Pr}(|x|=z)\right|_{z = d}\right)^2\right) = N\left(0,\frac{\pi^2}{4}d^2\right)

1732: \end{align}

1733:

1734: The probability density of $\hat{d}_{me}$ can be derived from

1735: the probability density of order statistics \citep[Example

1736: 2.9]{Book:Shao}. For simplicity, we only consider $k = 2m+1$, $m = 1,

1737: 2, ..., $

1738: \begin{align}\notag

1739: \mathbf{Pr}(\hat{d}_{me}=z) &=

1740: \frac{(2m+1)!}{(m!)^2}\left(\mathbf{Pr}(|x|\leq

1741:   z)\right)^m\left(1-\mathbf{Pr}(|x|\leq z)\right)^m \mathbf{Pr}(|x|=

1742: z) \\

1743: &=\frac{(2m+1)!}{(m!)^2}\left(\frac{2}{\pi}\tan^{-1}\frac{z}{d}\right)^m\left(1-\frac{2}{\pi}\tan^{-1}\frac{z}{d}\right)^m\frac{2d}{\pi}\frac{1}{z^2+d^2}.

1744: \end{align}

1745:

1746: The $r^{th}$ moment of $\hat{d}_{me}$ would be

1747: \begin{align}\notag

1748: \text{E}\left(\hat{d}_{me}\right)^r &= \int_0^\infty z^r

1749: \frac{(2m+1)!}{(m!)^2}\left(\frac{2}{\pi}\tan^{-1}\frac{z}{d}\right)^m\left(1-\frac{2}{\pi}\tan^{-1}\frac{z}{d}\right)^m\frac{2d}{\pi}\frac{1}{z^2+d^2}dz

1750: \\

1751: &= d^r\int_0^1\frac{(2m+1)!}{(m!)^2}\tan^r\left(\frac{\pi}{2}t\right)

1752: \left(t-t^2\right)^m dt,

1753: \end{align}

1754: \noindent by substituting $t = \frac{2}{\pi}

1755: \tan^{-1}\frac{z}{d}$.

1756:

1757:

1758: When $t\rightarrow 1-0$, $\tan\left(\frac{\pi}{2}t\right) \rightarrow

1759: \infty$, but $t-t^2 = t(1-t) \rightarrow 0$. Around $t =1-0$,

1760:  $\tan\left(\frac{\pi}{2}t\right) =

1761:  \frac{1}{\tan\left(\frac{\pi}{2}(1-t)\right)} =

1762:  \frac{2}{\pi}\frac{1}{1-t}+...$, by the Taylor expansion. Therefore, in

1763:  order for $\text{E}\left(\hat{d}_{me}\right)^r <\infty$, we must have

1764:  $m \geq r$.

1765:

1766: We complete the proof of Lemma \ref{lem_me}.

1767:

1768: \section{Proof of  Lemma \ref{lem_d_log}}\label{app_proof_lem_d_log}

1769:

1770: Assume $x \sim C(0,d)$. The first moment of $\log(|x|)$ would be

1771: \begin{align}\notag

1772: \text{E}\left(\log(|x|)\right) &= \frac{2d}{\pi}\int_0^\infty

1773: \frac{\log(y)}{y^2+d^2}dy \\\notag

1774: &=\frac{1}{\pi}\int_0^\infty\frac{\log(d)y^{-1/2}}{y+1} +

1775: \frac{1/2\log(y)y^{-1/2}}{y+1}dy\\

1776: &= \log(d),

1777: \end{align}

1778: \noindent with the help of the integral tables \cite[3.221.1,

1779: 4.251.1]{Book:Gradshteyn_94}.

1780:

1781: Thus, given i.i.d. samples $x_j \sim C(0,d)$, $j = 1, 2, ..., k$,

1782: a nonlinear estimator of $d$ would be

1783: \begin{align}

1784: \hat{d}_{log} = \exp\left(\frac{1}{k}\sum_{j=1}^k\log(|x_j|)\right).

1785: \end{align}

1786:

1787: We can derive another nonlinear estimator from

1788: $\text{E}\left(|x|^\lambda\right)$, $|\lambda| <1$. Using the integral

1789: tables \cite[3.221.1]{Book:Gradshteyn_94}, we obtain

1790: \begin{align}\notag

1791: \text{E}\left(|x|^\lambda\right) &= \frac{2d}{\pi}\int_0^\infty

1792: \frac{y^\lambda}{y^2+d^2}dy\\ \notag

1793: &=\frac{d^\lambda}{\pi}\int_0^\infty\frac{y^{\frac{\lambda-1}{2}}}{y+1}dy

1794: \\

1795: &=\frac{d^\lambda}{\cos(\lambda\pi/2)},

1796: \end{align}

1797: \noindent from which a  nonlinear estimator follows immediately

1798: \begin{align}

1799: \hat{d}_\lambda = \left(\frac{1}{k}\sum_{j=1}^k|x_j|^\lambda

1800:   \cos(\lambda\pi/2)\right)^{1/\lambda}, \hspace{0.2in} |\lambda| <1

1801: \end{align}

1802:

1803: Both nonlinear estimators $\hat{d}_{log}$ and $\hat{d}_\lambda$ are

1804: biased. The leading terms of their variances can be obtained by the

1805: {\em Delta Method} \citep[Corollary 1.1]{Book:Shao}.

1806:

1807:

1808: With the help of \cite[4.261.10]{Book:Gradshteyn_94}, we obtain

1809: \begin{align}

1810: \text{E}\left(\log^2(|x|)\right) = \log^2(d) + \frac{\pi^2}{4},

1811: \hspace{0.2in} \text{i.e., } \ \ \text{Var}\left(\log^2(|x|)\right) =  \frac{\pi^2}{4}.

1812: \end{align}

1813: \noindent Thus,

1814: \begin{align}

1815: \text{E}\left(\frac{1}{k}\sum_{j=1}^k\log(|x_j|)\right) = \log d,

1816: \hspace{0.5in} \text{Var}\left(\frac{1}{k}\sum_{j=1}^k\log(|x_j|)\right) = \frac{1}{k}\frac{\pi^2}{4}.

1817: \end{align}

1818:

1819:

1820: By the {\em Delta Method}, the asymptotic variance of

1821: $\hat{d}_{log}$ should be

1822: \begin{align}

1823: \text{Var}\left(\hat{d}_{log}\right) =

1824: \frac{1}{k}\frac{\pi^2}{4}\exp^2\left(\log(d)\right) +

1825: O\left(\frac{1}{k^2}\right) = \frac{\pi^2d^2}{4k} +

1826: O\left(\frac{1}{k^2}\right).

1827: \end{align}

1828:

1829: Similarly, the asymptotic variance of $\hat{d}_\lambda$ is

1830: \begin{align}

1831: \text{Var}\left(\hat{d}_{\lambda}\right) = \frac{d^2}{k}

1832: \frac{\sin^2(\lambda \pi/2)}{\lambda^2 \cos(\lambda\pi)} +

1833: O\left(\frac{1}{k^2}\right), \hspace{0.2in} |\lambda| <1/2

1834: \end{align}

1835:

1836: $\text{Var}\left(\hat{d}_{\lambda}\right)\rightarrow \infty$

1837: as $|\lambda|\rightarrow \frac{1}{2}$. $\text{Var}\left(\hat{d}_{\lambda}\right)$

1838: converges to $\text{Var}\left(\hat{d}_{log}\right)$ as $\lambda

1839: \rightarrow 0$, because

1840: \begin{align}

1841: \underset{\lambda\rightarrow 0}\lim\frac{\sin^2(\lambda

1842:   \pi/2)}{\lambda^2 \cos(\lambda\pi)} = \frac{\pi^2}{4}.

1843: \end{align}

1844:

1845: This completes the proof of Lemma \ref{lem_d_log}.

1846:

1847:

1848:

1849:

1850: \section{Proof of Lemma \ref{lem_d_gm}}\label{app_proof_lem_d_gm}

1851:

1852: Assume that $x_1$, $x_2$, ..., $x_k$, are i.i.d. $C(0,d)$.

1853: The estimator, $\hat{d}_{gm,c}$, expressed as

1854: \begin{align}

1855: \hat{d}_{gm,c} = \cos^k\left(\frac{\pi}{2k}\right)\prod_{j=1}^k|x_j|^{1/k},

1856: \end{align}

1857: is unbiased, because, from Lemma \ref{lem_d_log},

1858: \begin{align}\notag

1859: \text{E}\left(\hat{d}_{gm,c}\right) &=

1860:   \cos^k\left(\frac{\pi}{2k}\right)\prod_{j=1}^k\text{E}\left(|x_j|^{1/k}\right) \\\notag

1861: &=\cos^k\left(\frac{\pi}{2k}\right)\prod_{j=1}^k\left(\frac{d^{1/k}}{\cos\left(\frac{\pi}{2k}\right)}\right)\\

1862: &=d.

1863: \end{align}

1864:

1865: The variance  is

1866: \begin{align}\notag

1867: \text{Var}\left(\hat{d}_{gm,c}\right) &=

1868: \cos^{2k}\left(\frac{\pi}{2k}\right)\prod_{j=1}^k\text{E}\left(|x_j|^{2/k}\right)

1869:   -d^2\\

1870: &=

1871: d^2

1872: \left(\frac{\cos^{2k}\left(\frac{\pi}{2k}\right)}{\cos^k\left(\frac{\pi}{k}\right)}-1

1873: \right)\\

1874: &=\frac{\pi^2}{4}\frac{d^2}{k}  + \frac{\pi^4}{32}\frac{d^2}{k^2}+ O\left(\frac{1}{k^3}\right),

1875: \end{align}

1876: \noindent because

1877: \begin{align}\notag

1878: \frac{\cos^{2k}\left(\frac{\pi}{2k}\right)}{\cos^k\left(\frac{\pi}{k}\right)}

1879: &=

1880: \left(\frac{1}{2}+\frac{1}{2}\left(\frac{1}{\cos(\pi/k)}\right)\right)^k

1881: \\ \notag

1882: &=\left(1+\frac{1}{4}\frac{\pi^2}{k^2} +

1883:   \frac{5}{48}\frac{\pi^4}{k^4}+O\left(\frac{1}{k^6}\right)\right)^k

1884: \\\notag

1885: &=1+k\left(\frac{1}{4}\frac{\pi^2}{k^2}+\frac{5}{48}\frac{\pi^4}{k^4}\right)

1886: +

1887: \frac{k(k-1)}{2}\left(\frac{1}{4}\frac{\pi^2}{k^2}+\frac{5}{48}\frac{\pi^4}{k^4}\right)^2+

1888: ... \\

1889: &=1+\frac{\pi^2}{4}\frac{1}{k}+\frac{\pi^4}{32}\frac{1}{k^2} +O\left(\frac{1}{k^3}\right).

1890: \end{align}

1891:

1892: Some more algebra can similarly show the third and fourth central moments:

1893: \begin{align}

1894: &\text{E}\left(\hat{d}_{gm,c} -

1895:   \text{E}\left(\hat{d}_{gm,c}\right)\right)^3 =

1896: \frac{3\pi^4}{16}\frac{d^3}{k^2} + O\left(\frac{1}{k^3}\right)\\

1897: &\text{E}\left(\hat{d}_{gm,c} -

1898:   \text{E}\left(\hat{d}_{gm,c}\right)\right)^4 =

1899: \frac{3\pi^4}{16}\frac{d^4}{k^2} + O\left(\frac{1}{k^3}\right).

1900: \end{align}

1901:

1902:

1903: Therefore, we have completed the proof of Lemma \ref{lem_d_gm}.

1904:

1905:

1906: \section{Proof of Lemma \ref{lem_d_gm_tail}}

1907: \label{app_proof_lem_d_gm_tail}

1908:

1909: This section proves the tail bounds for $\hat{d}_{gm,c}$.

1910: Note that $\hat{d}_{gm,c}$ does not have a moment generating function

1911: because  $\text{E}\left(\hat{d}_{gm,c}\right)^t=\infty$ if

1912: $t\geq k$. However, we can still use the Markov moment bound.\footnote{In

1913: fact, even when the moment generating function does exist, for any positive

1914: random variable, the Markov moment bound is always sharper than the

1915: Chernoff bound, although the Chernoff bound will be in an exponential

1916: form. See \cite{Article:Philips_95,Article:Lugosi_04}.}

1917:

1918: For any $\epsilon \geq0$ and $0\leq t<k$, the Markov inequality says

1919: \begin{align}

1920: \mathbf{Pr}\left(\hat{d}_{gm,c} \geq (1+\epsilon)d \right) \leq \frac{\text{E}\left(\hat{d}_{gm,c}\right)^t}{(1+\epsilon)^td^t}

1921: =

1922: \frac{\cos^{kt}\left(\frac{\pi}{2k}\right)}{\cos^k\left(\frac{\pi

1923:       t}{2k}\right)(1+\epsilon)^{t}},

1924: \end{align}

1925: \noindent which can be minimized by choosing the optimum $t = t_1^*$,

1926: where

1927: \begin{align}

1928: t_1^* = \frac{2k}{\pi}\tan^{-1}\left(\left(\log(1+\epsilon) -

1929:     k\log\cos\left(\frac{\pi}{2k}\right)\right)\frac{2}{\pi}\right).

1930: \end{align}

1931:

1932: We need to make sure that $0\leq t_1^*<k$. $t_1^*\geq0$ because $\log\cos(.)\leq

1933: 0$; and $t_1^*<k$ because $\tan^{-1}(.) \leq \frac{\pi}{2}$, with

1934: equality holding only when $k\rightarrow \infty$.

1935:

1936: For $0\leq \epsilon \leq1$, we can prove an exponential bound for

1937: $\mathbf{Pr}\left(\hat{d}_{gm,c} \geq (1+\epsilon)d \right)$.

1938: First of all, note that we do not

1939: have to choose the optimum $t = t_1^*$. By the Taylor expansion, for

1940: small $\epsilon$, $t_1^*$ can be well approximated by

1941: \begin{align}

1942: t_1^* \approx \frac{4k\epsilon}{\pi^2} + \frac{1}{2} \approx

1943: \frac{4k\epsilon}{\pi^2} = t_1^{**}.

1944: \end{align}

1945:

1946: Therefore, taking $t=t_1^{**} = \frac{4k\epsilon}{\pi^2}$, the tail bound becomes

1947: \begin{align}\notag

1948: \mathbf{Pr}\left(\hat{d}_{gm,c} \geq (1+\epsilon)d \right) &\leq  \frac{\cos^{kt_1^{**}}\left(\frac{\pi}{2k}\right)}{\cos^k\left(\frac{\pi

1949:       t_1^{**}}{2k}\right)(1+\epsilon)^{t_1^{**}}} \\\notag

1950: &=

1951: \left(\frac{\cos^{t_1^{**}}\left(\frac{\pi}{2k}\right)}{\cos\left(\frac{2\epsilon}{\pi}\right)(1+\epsilon)^{4\epsilon/\pi^2}}

1952: \right)^k \\\notag

1953: &\leq \left(\frac{1}{\cos\left(\frac{2\epsilon}{\pi}\right)(1+\epsilon)^{4\epsilon/\pi^2}}

1954: \right)^k \\\notag

1955: &=\exp\left(-k\left(\log\left(\cos\left(\frac{2\epsilon}{\pi}\right)\right) +

1956: \frac{4\epsilon}{\pi^2}\log(1+\epsilon)\right)\right)

1957: \\

1958: &\leq \exp\left(-k\frac{\epsilon^2}{8(1+\epsilon)}\right),

1959: \hspace{0.1in} 0\leq \epsilon\leq1\label{eqn_proof_right}

1960: \end{align}

1961:

1962: The last step in (\ref{eqn_proof_right}) needs some

1963: explanations. First, by the Taylor expansion,

1964: \begin{align}\notag

1965: &\log\left(\cos\left(\frac{2\epsilon}{\pi}\right)\right) +

1966: \frac{4\epsilon}{\pi^2}\log(1+\epsilon) \\\notag

1967: =& \left(-\frac{2\epsilon^2}{\pi^2} -

1968:   \frac{4}{3}\frac{\epsilon^4}{\pi^4} +... \right)+

1969: \frac{4\epsilon}{\pi^2}\left(\epsilon -

1970:   \frac{1}{2}\epsilon^2+...\right)\\

1971: =& \frac{2\epsilon^2}{\pi^2}\left(1-\epsilon+...\right)

1972: \end{align}

1973:

1974: Therefore, we can seek the smallest constant $\gamma_1$ so that

1975: \begin{align}

1976: \log\left(\cos\left(\frac{2\epsilon}{\pi}\right)\right) +

1977: \frac{4\epsilon}{\pi^2}\log(1+\epsilon)

1978: \geq \frac{\epsilon^2}{\gamma_1(1+\epsilon)} =

1979: \frac{\epsilon^2}{\gamma_1}(1-\epsilon +...)

1980: \end{align}

1981:

1982: It is easy to see that as $\epsilon \rightarrow 0$,

1983: $\gamma_1\rightarrow \frac{\pi^2}{2}$. Figure \ref{fig_gm_constant}(a)

1984: illustrates that it suffices to let $\gamma_1 = 8$, which can be

1985: numerically verified. This is why the last step in

1986: (\ref{eqn_proof_right}) holds. Of course, we can get a better constant

1987: if (e.g.,) $\epsilon =0.5$.

1988:

1989:

1990: Now we need to  show the other tail bound $\mathbf{Pr}\left(\hat{d}_{gm,c}

1991:   \leq  (1-\epsilon)d \right)$:

1992: \begin{align}\notag

1993: &\mathbf{Pr}\left(\hat{d}_{gm,c} \leq  (1-\epsilon)d \right)

1994: =\mathbf{Pr}\left(\cos\left(\frac{\pi}{2k}\right)^k

1995:   \prod_{j=1}^k|x_j|^{1/k} \leq  (1-\epsilon)d \right) \\\notag

1996: =&\mathbf{Pr}\left(

1997:   \sum_{j=1}^k\log\left(|x_j|^{1/k}\right)\leq

1998:   \log\left(\frac{(1-\epsilon)d}{\cos^k\left(\frac{\pi}{2k}\right)}\right)\right)\\\notag

1999: =&\mathbf{Pr}\left( \exp\left(

2000:   \sum_{j=1}^k\log\left(|x_j|^{-t/k}\right)\right)\geq

2001:   \exp\left(-t\log\left(\frac{(1-\epsilon)d}{\cos^k\left(\frac{\pi}{2k}\right)}\right)\right)\right), \hspace{0.2in} 0\leq t<k \\

2002: \leq & \left(\frac{(1-\epsilon)}{\cos^k\left(\frac{\pi}{2k}\right)}\right)^t

2003: \frac{1}{\cos^k\left(\frac{\pi t}{2k}\right)}, \hspace{0.2in}

2004: \text{(Chernoff bound)}

2005: \end{align}

2006: \noindent which is minimized at $t = t_2^*$

2007: \begin{align}

2008: t_2^* = \frac{2k}{\pi}\tan^{-1}\left(\left(-\log(1-\epsilon) +

2009:     k\log\cos\left(\frac{\pi}{2k}\right)\right)\frac{2}{\pi}\right),

2010: \end{align}

2011: \noindent provided $k\geq \frac{\pi^2}{8\epsilon}$, otherwise $t_2^*$

2012: may be less than 0.

2013:

2014: Again, $t_2^*$ can be replaced by its approximation

2015: \begin{align}

2016: t_2^* \approx t_2^{**} = \frac{4k\epsilon}{\pi^2},

2017: \end{align}

2018: \noindent provided $k\geq\frac{\pi^2}{4\epsilon}$, otherwise the

2019: probability upper bound may exceed one.  Therefore,

2020:

2021: \begin{align}\notag

2022: \mathbf{Pr}\left(\hat{d}_{gm,c} \leq  (1-\epsilon)d \right)

2023: \leq& \left(\frac{(1-\epsilon)}{\cos^k\left(\frac{\pi}{2k}\right)}\right)^{t_2^{**}}

2024: \frac{1}{\cos^k\left(\frac{\pi t_2^{**}}{2k}\right)}\\\notag

2025: =&\exp\left(-k\left(\log\left(\cos\frac{2\epsilon}{\pi}\right) -

2026:     \frac{4\epsilon}{\pi^2}\log(1-\epsilon) +  \frac{4k\epsilon}{\pi^2}\log\left(\cos\frac{\pi}{2k}\right) \right)\right).

2027: \end{align}

2028: \noindent We can bound

2029: $\frac{4k\epsilon}{\pi^2}\log\left(\cos\frac{\pi}{2k}\right)$ by restricting $k$.

2030:

2031: In order to attain $\mathbf{Pr}\left(\hat{d}_{gm,c} \leq  (1-\epsilon)d

2032: \right)  \leq

2033: \exp\left(-k\left(\frac{\epsilon^2}{8(1+\epsilon)}\right)\right)$, we

2034: have to restrict $k$ to be larger than a certain value. For no

2035: particular reason, we like to express the restriction as $k \geq

2036: \frac{\pi^2}{\gamma_2\epsilon}$, for some constant $\gamma_2$. We

2037: find $k \geq

2038: \frac{\pi^2}{1.5\epsilon}$ suffices, although readers can verify that a

2039: slightly better (smaller) restriction would be $k \geq

2040: \frac{1}{4/\pi^2-1/4}\frac{1}{\epsilon} = \frac{\pi^2}{1.5326\epsilon} $.

2041:

2042: If $k \geq

2043: \frac{\pi^2}{1.5\epsilon}$, then

2044: $\frac{4k\epsilon}{\pi^2}\log\left(\cos\frac{\pi}{2k}\right) \geq

2045: \frac{8}{3}\log\left(\cos \frac{\epsilon}{3\pi}\right)$. Therefore,

2046: \begin{align}\notag

2047: \mathbf{Pr}\left(\hat{d}_{gm,c} \leq  (1-\epsilon)d \right) \leq  &\exp\left(-k\left(\log\left(\cos\frac{2\epsilon}{\pi}\right) -

2048:     \frac{4\epsilon}{\pi^2}\log(1-\epsilon) +

2049:     \frac{8}{3}\log\left(\cos

2050:       \frac{\epsilon}{3\pi}\right)\right)\right)\\

2051: \leq &\exp\left(-k\frac{\epsilon^2}{8(1+\epsilon)}\right), \hspace{0.2in}

2052: k\geq \frac{\pi^2}{1.5\epsilon}\label{eqn_proof_left}

2053: \end{align}

2054:

2055: \begin{figure}[h]

2056: \begin{center}\mbox{

2057: \subfigure[]{\includegraphics[width = 2.5in]{fig/gm_bound_const.eps}}\hspace{0.5in}

2058: \subfigure[]{\includegraphics[width = 2.5in]{fig/gm_bound_const_left.eps}}}

2059: \end{center}\vspace{-0.4in}

2060: \caption{ (a):

2061:   $\frac{\epsilon^2/(1+\epsilon)}{\log\left(\cos\left(\frac{2\epsilon}{\pi}\right)\right) +

2062: \frac{4\epsilon}{\pi^2}\log(1+\epsilon)}$ as a function of

2063: $\epsilon$. (b): $\frac{\epsilon^2/(1+\epsilon)}{\log\left(\cos\frac{2\epsilon}{\pi}\right) -

2064:     \frac{4\epsilon}{\pi^2}\log(1-\epsilon) +

2065:     \frac{8}{3}\log\left(\cos

2066:       \frac{\epsilon}{3\pi}\right) }$ as a function of $\epsilon$. Graphically, we know that it suffices to use a constant 8

2067: in (\ref{eqn_proof_right}) and (\ref{eqn_proof_left}). The optimal

2068: constant will be different for different $\epsilon$. For example, if

2069: $\epsilon = 0.2$, we could replace the constant 8 by a constant 5.  }\label{fig_gm_constant}\vspace{-0.2in}

2070: \end{figure}

2071:

2072:

2073:

2074:

2075: This completes the proof of Lemma \ref{lem_d_gm_tail}.

2076:

2077:

2078: \section{Proof of Lemma \ref{lem_mle_asymp}} \label{app_proof_lem_asymp}

2079: Assume $x \sim C(0,d)$. The $\log$

2080: likelihood ($l(x;d)$) and first three derivatives

2081: are

2082: \begin{align}

2083: &l(x;d) = \log(d) - \log(\pi) - \log(x^2+d^2),\\

2084: &l^\prime(d) = \frac{1}{d} - \frac{2d}{x^2+d^2}\\

2085: &l^{\prime\prime}(d) = -\frac{1}{d^2} -

2086: \frac{2x^2-2d^2}{(x^2+d^2)^2}\\

2087: &l^{\prime\prime\prime}(d) = \frac{2}{d^3} +

2088: \frac{4d}{(x^2+d^2)^2} + \frac{8d(x^2-d^2)}{(x^2+d^2)^3}

2089: %\\

2090: %&l^{\prime\prime\prime\prime}(d) = -\frac{6}{d^3} +

2091: %\frac{4x^2-12d^2}{(x^2+d^2)^3} + \frac{8x^4-64x^2d^2+24d^4)}{(x^2+d^2)^4}.

2092: \end{align}

2093:

2094: The MLE  $\hat{d}_{MLE}$ is

2095: asymptotically normal with mean $d$ and variance

2096: $\frac{1}{k\text{I}(d)}$, where $\text{I}(d)$, the expected Fisher

2097: Information, is

2098: \begin{align}

2099: \text{I} = \text{I}(d) = \text{E}\left(-l^{\prime\prime}(d)\right)  =

2100: \frac{1}{d^2} +

2101: 2\text{E}\left(\frac{x^2-d^2}{(x^2+d^2)^2}\right) = \frac{1}{2d^2},

2102: \end{align}

2103: \noindent because

2104: \begin{align}\notag

2105: \text{E}\left(\frac{x^2-d^2}{(x^2+d^2)^2}\right) &= \frac{d}{\pi}

2106: \int_{-\infty}^\infty \frac{x^2-d^2}{(x^2+d^2)^3}dx \\ \notag

2107: &=\frac{d}{\pi} \int_{-\pi/2}^{\pi/2} \frac{d^2(\tan^2(t) -

2108:   1)}{d^6/\cos^6(t)} \frac{d}{\cos^2(t)}dt \\\notag

2109: &=\frac{1}{d^2\pi}\int_{-\pi/2}^{\pi/2}\cos^2(t) - 2\cos^4(t) dt  \\

2110: &= \frac{1}{d^2\pi}\left(\frac{\pi}{2}-2\frac{3}{8}\pi\right) = -\frac{1}{4d^2}

2111: \end{align}

2112: Therefore, we obtain

2113: \begin{align}

2114: \text{Var}\left(\hat{d}_{MLE}\right) = \frac{2d^2}{k} + O\left(\frac{1}{k^2}\right).

2115: \end{align}

2116:

2117: General formulas for the bias and higher moments of the MLE are

2118: available in \citep{Article:Bartlett_53,Article:Shenton_63}.  We need to evaluate

2119: the expressions in \cite[16a-16d]{Article:Shenton_63}, involving

2120: tedious algebra:

2121: \begin{align}

2122: &\text{E}\left(\hat{d}_{MLE}\right) = d - \frac{[12]}{2k\text{I}^2} +

2123: O\left(\frac{1}{k^2}\right) \\

2124: &\text{Var}\left(\hat{d}_{MLE}\right) = \frac{1}{k\text{I}} +

2125: \frac{1}{k^2}\left(-\frac{1}{\text{I}}+\frac{[1^4]-[1^22]-[13]}{\text{I}^3}

2126: +\frac{3.5[12]^2-[1^3]^2}{\text{I}^4}\right) +

2127: O\left(\frac{1}{k^3}\right) \\

2128: &\text{E}\left(\hat{d}_{MLE}-\text{E}\left(\hat{d}_{MLE}\right)\right)^3 =

2129: \frac{[1^3]-3[12]}{k^2\text{I}^2}+O\left(\frac{1}{k^3}\right) \\\notag

2130: &\text{E}\left(\hat{d}_{MLE}-\text{E}\left(\hat{d}_{MLE}\right)\right)^4=

2131: \frac{3}{k^2\text{I}^2} +

2132: \frac{1}{k^3}\left(-\frac{9}{\text{I}^2}+

2133:   \frac{7[1^4] - 6[1^22]-10[13]}{\text{I}^4}\right)\\

2134: &\hspace{1.7in} +

2135: \frac{1}{k^3}\left(\frac{-6[1^3]^2-12[1^3][12]+45[12]^2}{\text{I}^5}\right)+O\left(\frac{1}{k^4}\right),

2136: \end{align}

2137: \noindent where, after re-formatting,

2138: \begin{align}\notag

2139: &[12] = \text{E}(l^\prime)^3 +  \text{E}(l^\prime l^{\prime\prime}),

2140: \hspace{0.3in} [1^4] = \text{E}(l^\prime)^4, \hspace{0.3in} [1^22] =

2141: \text{E}(l^{\prime\prime}(l^\prime)^2) +  \text{E}(l^{\prime})^4, \\

2142: &[13] = \text{E}(l^\prime)^4 +

2143: 3\text{E}(l^{\prime\prime}(l^\prime)^2)  + \text{E}(l^\prime

2144: l^{\prime\prime\prime}), \hspace{0.3in} [1^3]=\text{E}(l^\prime)^3.

2145: \end{align}

2146:

2147: We will neglect most of the algebra. To help readers verifying the

2148: results, the following formula we derive may be useful:

2149: \begin{align}

2150: \text{E}\left(\frac{1}{x^2+d^2}\right)^m =

2151: \frac{1\times3\times5\times...\times(2m-1)}{2\times4\times6\times...\times(2m)}\frac{1}{d^{2m}},

2152: \hspace{0.2in} m = 1, 2, 3, ...

2153: \end{align}

2154:

2155: Without giving the detail, we report

2156: \begin{align}\notag

2157: &\text{E}\left(l^{\prime}\right)^3 = 0, \hspace{0.3in}

2158: \text{E}\left(l^\prime l^{\prime\prime}\right) = -\frac{1}{2}\frac{1}{d^3}, \hspace{0.3in}

2159: \text{E}\left(l^{\prime}\right)^4 =

2160: \frac{3}{8}\frac{1}{d^4}, \\

2161: &\text{E}(l^{\prime\prime}(l^\prime)^2) = -\frac{1}{8}\frac{1}{d^4},  \hspace{0.3in}

2162: \text{E}\left(l^{\prime}l^{\prime\prime\prime}\right) =

2163: \frac{3}{4}\frac{1}{d^4}.

2164: \end{align}

2165: Hence

2166: \begin{align}

2167: &[12] = -\frac{1}{2}\frac{1}{d^3}, \hspace{0.25in} [1^4] =

2168: \frac{3}{8}\frac{1}{d^4}, \hspace{0.25in}[1^22] =

2169: \frac{1}{4}\frac{1}{d^4}, \hspace{0.25in}[13] =

2170: \frac{3}{4}\frac{1}{d^4}, \hspace{0.25in}[1^3] = 0.

2171: \end{align}

2172:

2173:

2174: Thus,  we obtain

2175: \begin{align}

2176: &\text{E}\left(\hat{d}_{MLE}\right) = d

2177: +\frac{d}{k} + O\left(\frac{1}{k^2}\right)\\

2178: &\text{Var}\left(\hat{d}_{MLE}\right) = \frac{2d^2}{k} + \frac{7d^2}{k^2} +

2179: O\left(\frac{1}{k^3}\right) \\

2180: &\text{E}\left(\hat{d}_{MLE}-\text{E}\left(\hat{d}_{MLE}\right)\right)^3 =

2181: \frac{12d^3}{k^2} + O\left(\frac{1}{k^3}\right)\\

2182: &\text{E}\left(\hat{d}_{MLE}-\text{E}\left(\hat{d}_{MLE}\right)\right)^4 =

2183: \frac{12d^4}{k^2} + \frac{222d^4}{k^3} +  O\left(\frac{1}{k^4}\right).

2184: \end{align}

2185:

2186: Because $\hat{d}_{MLE}$ has $O\left(\frac{1}{k}\right)$ bias, we

2187: recommend the bias-corrected estimator

2188: \begin{align}

2189: \hat{d}_{MLE,c} = \hat{d}_{MLE}\left(1-\frac{1}{k}\right),

2190: \end{align}

2191: whose first four moments are

2192: \begin{align}

2193: &\text{E}\left(\hat{d}_{MLE,c}\right) = d + O\left(\frac{1}{k^2}\right)\\

2194: &\text{Var}\left(\hat{d}_{MLE,c}\right) = \frac{2d^2}{k} + \frac{3d^2}{k^2} +

2195: O\left(\frac{1}{k^3}\right) \\

2196: &\text{E}\left(\hat{d}_{MLE,c}-\text{E}\left(\hat{d}_{MLE,c}\right)\right)^3 =

2197: \frac{12d^3}{k^2} + O\left(\frac{1}{k^3}\right)\\

2198: &\text{E}\left(\hat{d}_{MLE,c}-\text{E}\left(\hat{d}_{MLE,c}\right)\right)^4 =

2199: \frac{12d^4}{k^2} + \frac{186d^4}{k^3} +  O\left(\frac{1}{k^4}\right),

2200: \end{align}

2201: \noindent by brute-force algebra. First, it is obvious that

2202: \begin{align}

2203: \text{E}\left(\hat{d}_{MLE} - d\right)^2 = \frac{2d^2}{k} + \frac{8d^2}{k^2}

2204: + O\left(\frac{1}{k^3}\right).

2205: \end{align}

2206: Then

2207: \begin{align}\notag

2208: \text{Var}\left(\hat{d}_{MLE,c}\right) &= \text{E}\left(\hat{d}_{MLE,c} -

2209:   \text{E}(\hat{d}_{MLE,c})\right)^2\\\notag &=

2210: \text{E}\left(\hat{d}_{MLE}\left(1-\frac{1}{k}\right) - d +

2211:   O\left(\frac{1}{k^2}\right)\right)^2 \\\notag &= \text{E}\left(\left(\hat{d}_{MLE}-d\right)\left(1-\frac{1}{k}\right) -\frac{d}{k} +

2212:   O\left(\frac{1}{k^2}\right)\right)^2  \\\notag

2213: &=\text{E}\left(\hat{d}_{MLE}-d\right)^2\left(1-\frac{2}{k}\right) +

2214: \frac{d^2}{k^2} - 2\frac{d}{k}\left(1-\frac{1}{k}\right) +

2215: O\left(\frac{1}{k^3}\right) \\

2216: &= \frac{2d^2}{k} + \frac{3d^2}{k^2} + O\left(\frac{1}{k^3}\right).

2217: \end{align}

2218:

2219: We can evaluate the higher central moments of $\hat{d}_{MLE,c}$ similarly,

2220: but we skip the algebra.

2221:

2222:

2223: Therefore, we have completed the proof for Lemma \ref{lem_mle_asymp}.

2224:

2225:

2226: \end{document}

2227:

2228: