cs0610155/crp.tex
1: \documentclass[twoside,11pt]{article}
2: %\usepackage{nips2005e,times}
3: \usepackage{jmlr2e}
4: 
5: \usepackage{times,url}
6: \usepackage{amsmath}
7: \usepackage{amsfonts}
8: \usepackage{graphicx}
9: \usepackage{subfigure}
10: %\newtheorem{theorem}{Theorem}
11: %\newtheorem{lemma}{Lemma}
12: %\newtheorem{corollary}{Corollary}
13: %\newtheorem{proposition}{Proposition}
14: \newcommand{\dataset}{{\cal D}}
15: \newcommand{\fracpartial}[2]{\frac{\partial #1}{\partial  #2}}
16: 
17: \ShortHeadings{Cauchy Random Projections}{Li, Hastie, and Church}
18: \firstpageno{1}
19: 
20: \begin{document}
21: \title{Nonlinear Estimators and Tail Bounds for Dimension Reduction in
22:   $l_1$ Using Cauchy Random Projections}
23: 
24: 
25: \author{\name Ping Li \email pingli@stat.stanford.edu \\
26:        \addr Department of Statistics\\
27:        Stanford University\\
28:        Stanford, CA 94305, USA
29:        \AND
30:        \name Trevor J.\ Hastie \email hastie@stanford.edu \\
31:        \addr Department of Statistics\\
32:        Stanford University\\
33:        Stanford, CA 94305, USA
34:        \AND
35:        \name Kenneth W.\ Church \email church@microsoft.com \\
36:        \addr Microsoft Research\\
37:        Microsoft Corporation\\
38:        Redmond, WA 98052, USA
39: }
40: \editor{}
41: 
42: \maketitle
43: \vspace{-0.5in}
44: \begin{abstract}
45: 
46: For \footnote{Revised \today. The original version, titled {\em
47:     Practical Procedures for Dimension Reduction in $l_1$}, is 
48:   available as a technical report in Stanford Statistics achive
49:   (report No. 2006-04, June, 2006). } 
50:  dimension reduction in $l_1$, the method of {\em Cauchy random projections} multiplies
51: the original data matrix $\mathbf{A} 
52: \in\mathbb{R}^{n\times D}$ with a random matrix $\mathbf{R} \in
53: \mathbb{R}^{D\times k}$ ($k\ll\min(n,D)$) whose entries are i.i.d. samples
54: of the standard Cauchy $C(0,1)$. Because of the impossibility results,  one can not
55: hope to recover the pairwise $l_1$ distances in $\mathbf{A}$
56: from $\mathbf{B} = \mathbf{AR} \in \mathbb{R}^{n\times k}$, 
57: using linear estimators without incurring large
58: errors. However, nonlinear estimators are still useful for certain
59: applications in data stream computation, information
60: retrieval, learning, and data mining.  
61: 
62: We propose three types of nonlinear estimators: the bias-corrected
63: sample median estimator, the bias-corrected geometric mean estimator,
64: and the bias-corrected maximum likelihood estimator. The sample median
65: estimator and the geometric mean estimator are asymptotically (as
66: $k\rightarrow \infty$) equivalent but the latter is more accurate at
67: small $k$.  We derive explicit tail bounds
68: for the geometric mean estimator and establish an analog of the
69: Johnson-Lindenstrauss  (JL) lemma for dimension reduction in $l_1$,
70: which is weaker than the classical JL lemma for dimension reduction in
71: $l_2$. 
72: 
73: Asymptotically, both the sample median estimator and the
74: geometric mean estimators are about $80\%$ efficient compared to the
75: maximum likelihood estimator (MLE). We analyze the moments of the MLE
76: and propose approximating the distribution of the MLE by an 
77: inverse Gaussian. 
78: 
79: \end{abstract}
80: 
81: \textbf{Keywords:} Dimension reduction, $l_1$ norm, Cauchy Random
82: projections, JL bound
83: 
84: 
85: \section{Introduction}
86: 
87: This paper focuses on dimension reduction in $l_1$,  in particular, on
88: the
89: method based on {\em Cauchy random projections}
90: \citep{Proc:Indyk_FOCS00}, which is special case of {\em linear
91:   random projections}. 
92: 
93: The idea of {\em linear random projections} is to multiply the original data
94: matrix $\mathbf{A} \in \mathbb{R}^{n\times D}$ with a random projection matrix
95: $\mathbf{R} \in \mathbb{R}^{D\times k}$, resulting in a projected
96: matrix $\mathbf{B} = \mathbf{AR} \in \mathbb{R}^{n\times
97:   k}$. If $k \ll \min(n,D)$, then it should be much more efficient to
98: compute certain summary statistics (e.g., pairwise distances) from 
99: $\mathbf{B}$ as opposed to $\mathbf{A}$. Moreover, $\mathbf{B}$ may be small 
100: enough to reside in physical memory while $\mathbf{A}$ is often
101: too large to fit in the main memory.  
102: 
103: The choice of the random projection matrix $\mathbf{R}$ depends on which norm we
104: would like to work with. 
105: \cite{Proc:Indyk_FOCS00} proposed constructing $\mathbf{R}$ from 
106: i.i.d. samples of $p$-stable distributions, for dimension reduction in
107: $l_p$ ($0< p\leq 2$). In the stable distribution family \citep{Book:Zolotarev_86}, normal is
108: 2-stable and Cauchy is 1-stable. Thus, we will call random projections
109: for $l_2$ and $l_1$,  {\em normal random projections} and {\em Cauchy
110:   random projections}, respectively. 
111: 
112: In {\em normal random projections} \citep{Book:Vempala}, we can estimate the original
113: pairwise $l_2$ distances of $\mathbf{A}$ directly using the
114: corresponding $l_2$ distances of $\mathbf{B}$ (up to a normalizing
115: constant). Furthermore, the Johnson-Lindenstrauss  (JL)
116: lemma \citep{Article:JL84} provides the performance guarantee.
117:  We will review {\em normal random projections} in more detail in
118:  Section \ref{sec_intr_rp}.
119: 
120: For {\em Cauchy random projections}, we should not use the $l_1$ distance
121: in $\mathbf{B}$ to approximate the original $l_1$ distance in
122: $\mathbf{A}$, as the Cauchy distribution does not even have  a finite first
123: moment. The impossibility results
124: \citep{Proc:Brinkman_FOCS03,Article:Lee_Naor_04,Article:Brinkman_JACM05}
125: have proved that one can not hope to recover the $l_1$ distance using
126: linear projections and linear estimators (e.g., sample mean), without
127: incurring large errors.  Fortunately, the impossibility results do not
128:  rule out nonlinear estimators, which may be still useful in
129: certain applications in data stream computation, information
130: retrieval, learning, and data mining. 
131: 
132: \cite{Proc:Indyk_FOCS00} proposed using the sample median (instead of
133: the sample mean) in {\em Cauchy random projections} and described its
134: application in data stream computation. In this study, we provide
135: three types of nonlinear estimators:  the bias-corrected
136: sample median estimator, the bias-corrected geometric mean estimator,
137: and the bias-corrected maximum likelihood estimator. The sample median
138: estimator and the geometric mean estimator are asymptotically
139: equivalent (i.e., both are about $80\%$ efficient as the maximum
140: likelihood estimator), but the latter is more accurate at small sample size $k$. 
141: Furthermore, we
142: derive explicit tail bounds for the bias-corrected geometric mean estimator and
143: establish an analog of the  JL Lemma for dimension reduction in
144: $l_1$. 
145: 
146: This analog of the JL Lemma for $l_1$ is weaker than the classical
147: JL Lemma for $l_2$, as the geometric mean estimator is a non-convex
148: norm and hence is not a metric.  Many efficient algorithms, such as
149: some sub-linear time (using super-linear memory) nearest neighbor algorithms \citep{Book:NN_05}, rely
150: on the metric properties (e.g., the triangle inequality). Nevertheless, nonlinear estimators may be
151: still useful in important scenarios. 
152: \begin{itemize}
153: \item {\em Estimating $l_1$ distances online} \\
154: The original data matrix $\mathbf{A} \in \mathbb{R}^{n\times D}$
155: requires $O(nD)$ storage space; and hence it is
156: often too large for physical memory. The storage cost of all
157: pairwise distances is $O(n^2)$, which may be also too large for the 
158: memory. For example, in information retrieval, $n$ could be
159: the total number of  word types or documents at Web scale. To avoid page fault,
160: it may be more efficient to estimate the distances on the fly from
161: the  projected data matrix $\mathbf{B}$ in the memory.  
162: \item {\em Computing all pairwise $l_1$ distances} \\
163: In distance-based clustering and  classification applications, we need
164: to compute all pairwise distances in $\mathbf{A}$, at the cost of
165: time $O(n^2D)$. Using {\em Cauchy random projections}, the cost can be reduced
166: to $O(nDk + n^2k)$. Because $k \ll \min(n,D)$, the savings
167: could be enormous. 
168: \item {\em Linear scan nearest neighbor searching}\\
169: We can always search for the nearest neighbors by linear scans. When
170: working with the projected data matrix $\mathbf{B}$ (which is in the  memory), the cost of
171: searching for the nearest neighbor for one data point is time $O(nk)$,
172: which may be still significantly faster than the sub-linear algorithms
173: working with the original data matrix $\mathbf{A}$ (which is often on the
174: disk). 
175: \end{itemize}
176: 
177: We briefly comment on {\em coordinate
178:   sampling}, another strategy for dimension reduction.  Given a data matrix $\mathbf{A}
179: \in \mathbb{R}^{n\times D}$, one can randomly sample $k$ columns from $\mathbf{A}$ and
180: estimate the summary statistics (including $l_1$ and $l_2$
181: distances). Despite its simplicity, there are two
182: major disadvantages in
183: coordinate sampling. First, there is no performance guarantee. For 
184: heavy-tailed data, we may have to choose $k$ very large in order to
185: achieve sufficient accuracy. Second, large datasets are often highly sparse,
186: for example,  text data \citep{Article:Dhillon_ML01} and market-basket
187: data \citep{Proc:Aggarwal_Wolf_Sigmod99,Proc:Strehl_HiPC00}.  \cite{Report:Li_Church_Sketch} and  \cite{Report:Li_Church_Hastie_crs}
188: provide an alternative coordinate sampling strategy, called
189: {\em Conditional Random Sampling (CRS)}, suitable for sparse
190: data. For non-sparse data, however, methods based on {\em linear 
191:   random projections} are superior. 
192: 
193: The rest of the paper is organized as follows. Section \ref{sec_intr_rp}
194: reviews {\em linear random projections}. Section \ref{sec_results}
195: summarizes the main results for three types of nonlinear
196: estimators. Section \ref{sec_median} presents the sample median
197: estimators. Section \ref{sec_gm} concerns the geometric mean
198: estimators. Section \ref{sec_mle} is devoted to the maximum likelihood
199: estimators. Section \ref{sec_conclusion}
200: concludes the paper.  
201: 
202: \section{Introduction to Linear Random Projections}\label{sec_intr_rp}
203: 
204: We give a review on {\em linear random projections},
205: including {\em normal} and {\em Cauchy random projections}. 
206: 
207: 
208: Denote the original data matrix by $\mathbf{A} \in
209: \mathbb{R}^{n\times D}$, i.e., $n$ data points in $D$ dimensions. Let
210: $\{u_i^\text{T}\}_{i=1}^n \in \mathbb{R}^D$ be the $i$th row of $\mathbf{A}$. Let
211: $\mathbf{R}\in \mathbb{R}^{D\times k}$ be a random matrix whose
212: entries are i.i.d. samples of some random variable. The projected
213: data matrix $\mathbf{B} = \mathbf{AR} \in \mathbb{R}^{n\times
214:   k}$. Denote the  entries of $\mathbf{R}$ by $\{r_{ij}\}_{i=1}^D\
215: _{j=1}^k$ and let $\{v_i^\text{T}\}_{i=1}^n \in \mathbb{R}^k$ be the
216: $i$th row of $\mathbf{B}$. Then $v_i = \mathbf{R}^\text{T}u_i$, with entries $v_{i,j} = \mathbf{R}^\text{T}_ju_i$,
217: i.i.d. $j = 1$ to $k$, where $\mathbf{R}_j$ is the $j$th column of
218: $\mathbf{R}$. 
219: 
220: 
221: For simplicity, we focus on the leading two rows, $u_1$ and $u_2$, in
222: $\mathbf{A}$, and the leading  two rows, 
223: $v_1$ and $v_2$, in $\mathbf{B}$. Define $\{x_j\}_{j=1}^k$ to be
224: \begin{align}
225: x_j = v_{1,j} - v_{2,j} = \sum_{i=1}^D r_{ij} \left(u_{1,i}-u_{2,i}\right),
226: \hspace{0.5in} j = 1, 2, ..., k
227: \end{align}
228: 
229: If we sample $r_{ij}$ i.i.d. from a {\em stable distribution}
230: \citep{Book:Zolotarev_86,Proc:Indyk_FOCS00}, then $x_j$'s are also
231: i.i.d. samples of the same stable distribution with a different scale
232: parameter. In the family of stable distributions, normal and
233: Cauchy are two important special cases. 
234: 
235: \subsection{Normal Random Projections}
236: 
237: When $r_{ij}$ is sampled from the standard normal, i.e., $r_{ij}\sim
238: N(0,1)$, i.i.d.,  then 
239: \begin{align}
240: x_j = v_{1,j} - v_{2,j} =\sum_{i=1}^D r_{ij} \left(u_{1,i}-u_{2,i}\right) \sim
241: N\left(0,\sum_{i=1}^D|u_{1,i}-u_{2,i}|^2\right), \ \ \  j = 1, 2, ...,
242: k, 
243: \end{align}
244: \noindent because a weighted sum of normals is also normal. 
245: 
246: Denote the squared $l_2$ distance between $u_1$ and $u_2$ by $d_{l_2} =
247: \|u_1-u_2\|^2_2 = \sum_{i=1}^D|u_{1,i}-u_{2,i}|^2$. We can estimate
248: $d_{l_2}$ from the sample squared $l_2$ distance:
249: \begin{align}
250: \hat{d}_{l_2} = \frac{1}{k} \sum_{j=1}^k x_j^2.
251: \end{align}
252: It is easy to show that (e.g., \citep{Book:Vempala,Proc:Li_Hastie_Church_COLT06})
253: \begin{align}
254: &\text{E}\left(\hat{d}_{l_2}\right) = d_{l_2}, \hspace{0.45in}
255: \text{Var}\left(\hat{d}_{l_2}\right) = \frac{2}{k}d^2_{l_2},\\
256: &\mathbf{Pr}\left(\left|\hat{d}_{l_2} -d_{l_2}\right|\geq \epsilon d_{l_2}\right)  \leq
257: 2\exp\left(-\frac{k}{4}\epsilon^2 + \frac{k}{6}\epsilon^3\right), \ \
258: \ \epsilon >0 \label{eqn_normal_tail}
259: \end{align}
260: 
261: We would like to bound the error probability
262: $\mathbf{Pr}\left(\left|\hat{d}_{l_2} -d_{l_2}\right|\geq \epsilon
263:   d_{l_2}\right)$ by $\delta$. Since there
264: are in total $\frac{n(n-1)}{2} < \frac{n^2}{2}$ pairs among $n$
265: data points, we need to bound the tail probabilities simultaneously for
266: all pairs. By the Bonferroni union bound, it suffices if 
267: \begin{align}
268: &\frac{n^2}{2}\mathbf{Pr}\left(\left|\hat{d}_{l_2} -d_{l_2}\right|\geq
269:   \epsilon d_{l_2}\right)  \leq \delta.
270: \end{align}
271: 
272: Using (\ref{eqn_normal_tail}), it suffices if 
273: \begin{align}
274: \frac{n^2}{2}
275: &2\exp\left(-\frac{k}{4}\epsilon^2 + \frac{k}{6}\epsilon^3\right) \leq
276: \delta \\
277: \Longrightarrow & k \geq \frac{2\log n - \log \delta }{\epsilon^2/4 -
278:   \epsilon^3/6}. 
279: \end{align}
280: 
281: 
282: Therefore, we obtain one version of the JL lemma: 
283: 
284: {\em 
285: If $k \geq \frac{2\log n - \log \delta }{\epsilon^2/4 -
286:   \epsilon^3/6}$, then with probability at least $1-\delta$, the
287: squared $l_2$
288: distance between any pair of data points (among $n$ data points) can
289: be approximated within $1\pm \epsilon$ fraction of the
290: truth, using the squared $l_2$ distance of the
291: projected data after normal random projections. }
292: 
293: Many versions of the JL lemma have been proved
294: \citep{Article:JL84,Article:Frankl_JL,Proc:Indyk_STOC98,Proc:Arriaga_FOCS99,Article:Dasgupta_JL,Proc:Indyk_FOCS00,Proc:Indyk_FOCS01,Article:Achlioptas_JCSS03,Article:Proc:Arriaga_Vempala_ML06,Proc:Ailon_STOC06}.
295: 
296: 
297: Note that we do not have to use $r_{ij} \sim N(0,1)$ for dimension
298: reduction in $l_2$. For example, we can sample $r_{ij}$ from
299: some {\em sub-Gaussian distributions} \citep{Article:Indyk_Naor}, in particular, the following
300: {\em sparse projection distribution}: 
301: \begin{align}\label{eqn_subg_rji}
302: r_{ij} = \sqrt{s}\left\{\begin{array}{rl} 1 & \text{ with prob. }
303:     \frac{1}{2s}  \\ 0 & \text{ with prob. } 1-\frac{1}{s}\\ -1 & \text{ with prob. }
304:     \frac{1}{2s} \end{array} \right..
305: \end{align}
306: 
307: When $ 1\leq s\leq3$, \cite{Article:Achlioptas_JCSS03} proved the JL
308: lemma for the above sparse
309: projection, which can also be shown by sub-Gaussian analysis
310: \citep{Report:Li_Hastie_Church_subrp}. 
311: Recently,  \cite{Proc:Li_Hastie_Church_KDD06} proposed {\em very
312:   sparse random projections} using $s = \sqrt{D}$ in
313: (\ref{eqn_subg_rji}), based on two practical considerations:
314: \begin{itemize}
315: \item $D$ should be very large, otherwise
316: there would be no need for dimension reduction. 
317: \item 
318: The original $l_2$ distance should make
319: engineering sense, in that  the second (or higher) moments should be
320: bounded (otherwise various {\em term-weighting} schemes will be
321: applied). 
322: \end{itemize}
323: 
324: Based on these two practical
325: assumptions, the projected data are asymptotically normal at a fast
326: rate of convergence when $s = \sqrt{D}$.  Of course, {\em very sparse
327:   random projections} do not have worst case performance
328: guarantees.
329: 
330: \subsection{Cauchy Random Projections}\label{sec_intro}
331: 
332: In {\em Cauchy random projections}, we sample $r_{ij}$ i.i.d. from the
333: standard Cauchy distribution, i.e., $r_{ij} \sim C(0,1)$. By the 1-stability of Cauchy \citep{Book:Zolotarev_86}, we know that 
334: \begin{align}
335: x_j = v_{1,j} - v_{2,j}  \sim C\left(0,\sum_{i=1}^D|u_{1,i} -
336:   u_{2,i}|\right). 
337: \end{align}
338: \noindent That is, the projected differences $x_j = v_{1,j} - v_{2,j}$ are also
339: Cauchy random variables with the scale parameter being the $l_1$
340: distance, $d = |u_1 - u_2| = \sum_{i=1}^D|u_{1,i} -
341:   u_{2,i}|$, in the original space. 
342: 
343: Recall that a Cauchy random variable $z \sim C(0,\gamma)$ has the density 
344: \begin{align}
345: f(z)  = \frac{\gamma}{\pi} \frac{1}{z^2 + \gamma^2}, \hspace{0.5in}
346: \gamma >0, \hspace{0.2in}  -\infty<z<\infty
347: \end{align}
348: 
349: The easiest way to see the 1-stability is via the characteristic
350: function, 
351: \begin{align}
352: &\text{E}\left(\exp(\sqrt{-1}z_1t)\right) =
353: \exp\left(-\gamma|t|\right),\\
354: &\text{E}\left(\exp\left(\sqrt{-1} t\sum_{i=1}^D c_i z_i\right)\right)
355: = \exp\left(-\gamma\sum_{i=1}^D|c_i|t\right), 
356: \end{align}
357: \noindent for $z_1$, $z_2$, ..., $z_D$, i.i.d. $C(0,\gamma)$, and
358: any constants $c_1$, $c_2$, ..., $c_D$.
359: 
360: 
361: Therefore, in {\em Cauchy random projections}, the problem boils down to
362: estimating the Cauchy scale parameter of $C(0,d)$ from $k$
363: i.i.d. samples $x_j \sim C(0,d)$.  Unfortunately, unlike in {\em normal
364:   random projections}, we can no longer estimate $d$ from the
365: sample mean (i.e., $\frac{1}{k}\sum_{j=1}^k|x_j|$) because
366: $\text{E}\left(x_j\right) = \infty$.  
367: 
368: Although the impossibility results
369: \citep{Article:Lee_Naor_04,Article:Brinkman_JACM05}
370: have ruled out estimators that are metrics, there is enough information
371: to recover $d$ from $k$ 
372: samples $\{x_j\}_{j=1}^k$, with a high accuracy.  For
373: example, \cite{Proc:Indyk_FOCS00} proposed using the sample median as
374: an estimator. The problem with the sample median estimator is the
375: inaccuracy at small $k$ and the difficulty in deriving explicit tail
376: bounds needed for determining the sample size $k$. \\
377: 
378: This study focuses on deriving better estimators and explicit tail bounds for
379: {\em Cauchy random projections}. Our main results are summarized in
380: the next section, before we present the detailed derivations. Casual 
381: readers may skip these derivations after Section
382: \ref{sec_results}. 
383: 
384: \section{Main Results}\label{sec_results}
385: 
386:  We propose three types of nonlinear
387: estimators: the bias-corrected sample median estimator
388: ($\hat{d}_{me,c}$), the bias-corrected geometric mean estimator
389: ($\hat{d}_{gm,c}$), and  the bias-corrected maximum likelihood
390: estimator ($\hat{d}_{MLE,c}$). $\hat{d}_{me,c}$ and $\hat{d}_{gm,c}$
391: are asymptotically equivalent but the latter is more accurate at small
392: sample size $k$. In addition, we derive explicit tail bounds for
393: $\hat{d}_{gm,c}$, from which an analog of the Johnson-Lindenstrauss  (JL)
394: lemma for dimension reduction in $l_1$ follows. Asymptotically, both
395: $\hat{d}_{me,c}$ and $\hat{d}_{gm,c}$ are $\frac{8}{\pi^2} \approx
396: 80\%$ efficient compared to the maximum likelihood estimator
397: $\hat{d}_{MLE,c}$. We propose accurate approximations to the
398: distribution and tail bounds of $\hat{d}_{MLE,c}$, while the exact
399: closed-form answers are not attainable. 
400: 
401: \subsection{The Bias-corrected Sample Median Estimator}
402: 
403: Denoted by $\hat{d}_{me,c}$, the bias-corrected sample median
404: estimator is
405: \begin{align}
406: \hat{d}_{me,c} = \frac{\hat{d}_{me}}{b_{me}}, 
407: \end{align}
408: \noindent where
409: \begin{align}
410: \hat{d}_{me} &= \text{median}(|x_j|, j = 1, 2,..., k)\\
411: b_{me}
412: &=
413: \int_0^1\frac{(2m+1)!}{(m!)^2}\tan\left(\frac{\pi}{2}t\right)\left(t-t^2\right)^m
414:   dt, \ \ \ k = 2m+1 
415: \end{align}
416: 
417: Here, for convenience, we only consider $k = 2m+1$, $m$ = 1, 2, 3,
418: ... 
419: 
420: 
421: Some key properties of $\hat{d}_{me,c}$: 
422: 
423: \begin{itemize}
424: \item $\text{E}\left(\hat{d}_{me,c}\right) = d$, i.e, $\hat{d}_{me,c}$
425:   is unbiased. 
426: \item When $k\geq 5$, the variance of $\hat{d}_{me,c}$ is 
427: \begin{align}
428: \text{Var}\left(\hat{d}_{me,c}\right) =
429: d^2\left(\frac{(m!)^2}{(2m+1)!}\frac{\int_0^1
430:   \tan^2\left(\frac{\pi}{2}t\right)\left(t-t^2\right)^m dt}{\left(\int_0^1
431:   \tan\left(\frac{\pi}{2}t\right)\left(t-t^2\right)^m dt\right)^2} -
432: 1\right), \ \ \ \ k\geq5
433: \end{align}
434: $\text{Var}\left(\hat{d}_{me,c}\right) = \infty$ if $k = 3$. 
435: \item As $k \rightarrow \infty$, $\hat{d}_{me,c}$ converges to a
436:   normal in distribution 
437: \begin{align}
438: \sqrt{k}\left(\hat{d}_{me,c}  - d \right)\overset{D}{\Longrightarrow} N\left(0,\frac{\pi^2}{4}d^2\right).
439: \end{align}
440: \end{itemize}
441: 
442: \subsection{The Bias-corrected Geometric Mean Estimator}
443: Denoted by $\hat{d}_{gm,c}$, the bias-corrected geometric mean
444: estimator is defined as 
445: \begin{align}
446: \hat{d}_{gm,c} =
447: \cos^k\left(\frac{\pi}{2k}\right)\prod_{j=1}^k|x_j|^{1/k},
448: \hspace{0.1in} k>1
449: \end{align}
450: 
451: Important properties of $\hat{d}_{gm,c}$ include: 
452: \begin{itemize}
453: \item This estimator is a non-convex norm, i.e., the $l_p$ norm
454:   with $p\rightarrow 0$. 
455: \item It is unbiased, i.e., $\text{E}\left(\hat{d}_{gm,c}\right)
456:   = d$. 
457: \item Its variance is (for $k>2$) 
458: \begin{align}
459: \text{Var}\left(\hat{d}_{gm,c}\right) &= d^2
460: \left(\frac{\cos^{2k}\left(\frac{\pi}{2k}\right)}{\cos^k\left(\frac{\pi}{k}\right)}-1
461: \right)
462: = \frac{\pi^2}{4}\frac{d^2}{k} +
463: \frac{\pi^4}{32}\frac{d^2}{k^2}+O\left(\frac{1}{k^3}\right).
464: \end{align}
465: \item For $0\leq \epsilon \leq 1$, its tail bounds can be represented in exponential forms 
466: \begin{align}
467: &\mathbf{Pr}\left(\hat{d}_{gm,c} - d > \epsilon d \right) \leq
468: \exp\left(-k\left(\frac{\epsilon^2}{8(1+\epsilon)}\right)\right)\\
469: &\mathbf{Pr}\left(\hat{d}_{gm,c} - d < -\epsilon d \right) \leq
470: \exp\left(-k\left(\frac{\epsilon^2}{8(1+\epsilon)}\right)\right), \ \
471: \ k \geq \frac{\pi^2}{1.5\epsilon}
472: \end{align}
473: \item These exponential tail bounds yield an analog of the 
474: Johnson-Lindenstrauss  (JL) lemma for dimension reduction in $l_1$:
475: 
476: {\em 
477: If $k \geq \frac{8\left(2\log n -
478:   \log\delta\right)}{\epsilon^2/(1+\epsilon)}\geq \frac{\pi^2}{1.5\epsilon}$, then with probability at
479: least $1-\delta$, one can recover the original $l_1$ distance between
480: any pair of data points (among all $n$ data points) within 
481: $1\pm\epsilon$ ($0\leq
482: \epsilon\leq 1$) fraction of the truth,
483: using $\hat{d}_{gm,c}$, i.e., $|\hat{d}_{gm,c}-d|\leq \epsilon d$. }
484: \end{itemize}
485: 
486: \subsection{The Bias-corrected Maximum Likelihood Estimator} 
487: Denoted by $\hat{d}_{MLE,c}$, the bias-corrected maximum likelihood
488: estimator is 
489: \begin{align}
490: \hat{d}_{MLE,c} = \hat{d}_{MLE}\left(1-\frac{1}{k}\right),
491: \end{align}
492: where $\hat{d}_{MLE}$ solves a nonlinear MLE equation 
493: \begin{align}
494: -\frac{k}{\hat{d}_{MLE}} + \sum_{j=1}^k\frac{2\hat{d}_{MLE}}{x_j^2 + \hat{d}_{MLE}^2} = 0.
495: \end{align}
496: 
497: Some properties of $\hat{d}_{MLE,c}$:
498: \begin{itemize}
499: \item It is nearly unbiased, $\text{E}\left(\hat{d}_{MLE,c}\right) = d
500:   + O\left(\frac{1}{k^2}\right)$. 
501: \item Its asymptotic variance is 
502: \begin{align}
503: \text{Var}\left(\hat{d}_{MLE,c}\right) = \frac{2d^2}{k} +
504: \frac{3d^2}{k^2} 
505:   + O\left(\frac{1}{k^3}\right), 
506: \end{align}
507: \noindent i.e.,
508: $\frac{\text{Var}\left(\hat{d}_{MLE,c}\right)}{\text{Var}\left(\hat{d}_{me,c}\right)}
509: \rightarrow \frac{8}{\pi^2}$, $\frac{\text{Var}\left(\hat{d}_{MLE,c}\right)}{\text{Var}\left(\hat{d}_{gm,c}\right)}
510: \rightarrow \frac{8}{\pi^2}$, as $k\rightarrow
511: \infty$. ($\frac{8}{\pi^2} \approx 80\%$) 
512: \item Its distribution can be accurately approximated by an inverse
513:   Gaussian, at least in the small deviation range. Based on the
514:   inverse Gaussian approximation, we suggest the following approximate tail bound
515: \begin{align}
516: &\mathbf{Pr}\left(|\hat{d}_{MLE,c} - d| \geq \epsilon d\right) \overset{\sim}{\leq}
517: 2\exp\left(-\frac{\epsilon^2/(1+\epsilon)}{2 \left(\frac{2}{k} + \frac{3}{k^2}\right)}\right),
518: \hspace{0.15in} 0\leq \epsilon \leq 1, 
519: \end{align}
520: \noindent which has been verified by simulations for the tail
521: probability $\geq 10^{-10}$ range. 
522: \end{itemize}
523: 
524: 
525: \section{The Sample Median Estimators}\label{sec_median}
526: 
527: Recall in Cauchy random projections, $\mathbf{B} = \mathbf{AR}$, we
528: denote the leading two rows in $\mathbf{A}$ by $u_1$, $u_2$ $\in
529: \mathbb{R}^{D}$, and the leading two rows in $\mathbf{B}$ by $v_1$,
530: $v_2$ $\in \mathbb{R}^{k}$. Our goal is to estimate the $l_1$ distance
531: $d = |u_1 - u_2| = \sum_{i=1}^D |u_{1,i} - u_{2,i}|$ from
532: $\{x_j\}_{j=1}^k$, $x_j = v_{1,j} - v_{2,j} \sim C(0,d)$, i.i.d.
533: 
534: It is easy to show (e.g., \cite{Proc:Indyk_FOCS00}) that the
535: population median of $|x_j|$ is $d$. Therefore, it is natural to
536: consider estimating $d$ from the sample median,
537: \begin{align} \label{eqn_def_me}
538: \hat{d}_{me} = \text{median}\{|x_j|, j = 1, 2, ..., k\}.
539: \end{align}
540: 
541: As illustrated in the following lemma (proved in Appendix \ref{app_proof_lem_me}), the sample median estimator,
542: $\hat{d}_{me}$, is asymptotically  unbiased and normal. For small
543: samples (e.g., $k\leq 20$), however, $\hat{d}_{me}$ is severely
544: biased. 
545: 
546: \begin{lemma} \label{lem_me}
547: The sample median estimator, $\hat{d}_{me}$, defined in
548: (\ref{eqn_def_me}), is asymptotically unbiased and normal 
549: \begin{align}
550: \sqrt{k}\left(\hat{d}_{me}  - d \right)\overset{D}{\Longrightarrow} N\left(0,\frac{\pi^2}{4}d^2\right)
551: \end{align}
552: When $k = 2m+1$, $m$ = 1, 2, 3, ..., the $r^{th}$ moment of
553: $\hat{d}_{me}$ can be represented as 
554: \begin{align}
555: &\text{E}\left(\hat{d}_{me}\right)^r = d^r\left(\int_0^1\frac{(2m+1)!}{(m!)^2}\tan^r\left(\frac{\pi}{2}t\right)\left(t-t^2\right)^m
556:   dt\right), \ \ \  m \geq r
557: \end{align}
558: If $m<r$, then $\text{E}\left(\hat{d}_{me}\right)^r = \infty$. \\ \\
559: \end{lemma}
560: 
561: For simplicity, we only consider $k = 2m+1$ when evaluating 
562: $\text{E}\left(\hat{d}_{me}\right)^r$. 
563: 
564: Once we know $\text{E}\left(\hat{d}_{me}\right)$, we can remove the
565: bias of $\hat{d}_{me}$ using 
566: \begin{align}
567: \hat{d}_{me,c} = \frac{\hat{d}_{me}}{b_{me}},
568: \end{align}
569: where the bias correction factor $b_{me}$ is 
570: \begin{align}\label{eqn_bme}
571: b_{me} = \frac{\text{E}\left(\hat{d}_{me}\right)}{d} = \int_0^1\frac{(2m+1)!}{(m!)^2}\tan\left(\frac{\pi}{2}t\right)\left(t-t^2\right)^m
572:   dt.
573: \end{align}
574: 
575: $b_{me}$ can be numerically evaluated and tabulated, at least for small
576: $k$.\footnote{It is possible to express $b_{me}$ as an infinite
577:   sum. Note that $\frac{(2m+1)!}{(m!)^2}\left(t-t^2\right)^m$, $0\leq
578:   t\leq 1$, is the probability density of a Beta distribution
579:   $Beta(m+1,m+1)$.}
580: % By Taylor expansion \citep[1.411.6]{Book:Gradshteyn_94},
581: %  $\tan\left(\frac{\pi}{2}t\right) =
582: %  \sum_{j=1}^\infty\frac{2^{2j}\left(2^{2j}-1\right)}{(2j)!}|B_{2j}|\left(\frac{\pi}{2}\right)^{2j-1}t^{2j-1}$, where $B_{2j}$ is the {\em Bernoulli number} \citep[9.61]{Book:Gradshteyn_94}. If $z \sim Beta(m+1,m+1)$, then $\text{E}\left(z^r\right) = \frac{(2m+1)!(m+r)!}{(2m+1+r)!m!}$ (\url{http://mathworld.wolfram.com/BetaDistribution.html}). Therefore, $b_{me} = \sum_{j=1}^\infty\frac{2^{2j}\left(2^{2j}-1\right)}{(2j)!}|B_{2j}|\left(\frac{\pi}{2}\right)^{2j-1} \frac{(2m+1)!(m+2j-1)!}{(2m+2j)!m!}$. }
583: 
584: Obviously, $\hat{d}_{me,c}$ is unbiased, i.e.,
585: $\text{E}\left(\hat{d}_{me,c}\right) = d$. Its variance would be 
586: \begin{align}
587: \text{Var}\left(\hat{d}_{me,c}\right) =
588: d^2\left(\frac{(m!)^2}{(2m+1)!}\frac{\int_0^1
589:   \tan^2\left(\frac{\pi}{2}t\right)\left(t-t^2\right)^m dt}{\left(\int_0^1
590:   \tan\left(\frac{\pi}{2}t\right)\left(t-t^2\right)^m dt\right)^2} -
591: 1\right), \ \ \ \ k=2m+1\geq5
592: \end{align}
593: 
594: Of course, $\hat{d}_{gm,c}$ and $\hat{d}_{gm}$ are asymptotically
595: equivalent, i.e., 
596: $\sqrt{k}\left(\hat{d}_{me,c}  - d \right)\overset{D}{\Longrightarrow}
597: N\left(0,\frac{\pi^2}{4}d^2\right)$. 
598: 
599: Figure \ref{fig_bme} plots $b_{me}$ as a function of $k$, indicating
600: that $\hat{d}_{me}$ is severely biased when $k\leq 20$. When $k>50$,
601: the bias becomes negligible. Note that, because $b_{me}\geq 1$, the bias
602: correction not only removes the bias of $\hat{d}_{me}$ but also
603: reduces its variance. 
604: 
605: 
606: \begin{figure}[h]
607: \begin{center}
608: \includegraphics[width = 2.5in]{fig/me_bias_correction_factor.eps}
609: \end{center}\vspace{-0.3in}
610: \caption{The bias correction factor, $b_{me}$ in (\ref{eqn_bme}), as a function of $k
611:   =2m+1$. After $k>50$, the bias is negligible. Note that
612:   $b_{me}=\infty$ when $k=1$. }\label{fig_bme}
613: \end{figure}
614: 
615: The sample median is a special case of sample quantile estimators
616: \citep{Article:Fama_68,Article:Fama_71}.   For example, one 
617: version of the quantile estimators given by
618: \cite{Article:McCulloch_86} would be
619: \begin{align}
620: \hat{d}_{or} = \frac{\hat{|x|}_{.75} - \hat{|x|}_{.25}}{2.0},
621: \end{align}
622: \noindent where $\hat{|x|}_{.75}$ and $\hat{|x|}_{.25}$ are the .75 and
623: .25 sample quantiles of $\{|x_{j}|\}_{j=1}^k$, respectively. 
624: 
625: Our simulations indicate that $\hat{d}_{me}$ actually slightly outperforms
626: $\hat{d}_{or}$. This is not surprising. $\hat{d}_{or}$ works for any
627: Cauchy distribution whose location parameter does not have to be zero,
628: while $\hat{d}_{me}$ takes advantage of the fact that the
629: Cauchy location parameter is always zero in our case. 
630: 
631: 
632: \section{The Geometric Mean Estimators }\label{sec_gm}
633: 
634: This section derives estimators based on the geometric
635: mean, which are more accurate than the sample median estimators. The
636: geometric mean estimators allow us to derive tail bounds in explicit
637: forms and (consequently) an analog of the
638: Johnson-Lindenstrauss  (JL) lemma for dimension reduction in $l_1$. 
639: 
640: Recall, our goal is to estimate $d$ from $k$ i.i.d. samples $x_j
641: \sim C(0,d)$. To help derive the geometric mean estimators, we
642: first study two nonlinear estimators based on the fractional moment, i.e., $\text{E}(|x|^\lambda)$
643: ($|\lambda|<1$) and the logarithmic moment, i.e,
644: $\text{E}\left(\log(|x|)\right)$, respectively, as presented in 
645: Lemma  \ref{lem_d_log}. See the proof in Appendix \ref{app_proof_lem_d_log}. 
646: 
647: \begin{lemma}\label{lem_d_log}
648: Assume $x \sim C(0,d)$. Then
649: \begin{align}
650: &\text{E}\left(|x|^\lambda\right) 
651: =\frac{d^\lambda}{\cos(\lambda\pi/2)}, \hspace{0.5in}|\lambda|<1\\
652: &\text{E}\left(\log(|x|)\right) = \log(d), \\
653: &\text{Var}\left(\log(|x|)\right) = \frac{\pi^2}{4}, 
654: \end{align}
655: \noindent from which we can derive two biased estimators of $d$ from
656: $k$ i.i.d. samples $x_j \sim C(0,d)$:
657: \begin{align}
658: &\hat{d}_\lambda = \left(\frac{1}{k}\sum_{j=1}^k|x_j|^\lambda
659:   \cos(\lambda\pi/2)\right)^{1/\lambda}, \hspace{0.2in} |\lambda| <1,\\
660: &\hat{d}_{log} = \exp\left(\frac{1}{k}\sum_{j=1}^k\log(|x_j|)\right),
661: \end{align}
662: \noindent whose variances are, respectively,
663: \begin{align}
664: &\text{Var}\left(\hat{d}_{\lambda}\right) = \frac{d^2}{k}
665: \frac{\sin^2(\lambda \pi/2)}{\lambda^2 \cos(\lambda\pi)} +
666: O\left(\frac{1}{k^2}\right), \hspace{0.2in} |\lambda| <1/2\\
667: &\text{Var}\left(\hat{d}_{log}\right)  = \frac{\pi^2d^2}{4k} +
668: O\left(\frac{1}{k^2}\right).
669: \end{align}
670: 
671: The term $\frac{\sin^2(\lambda \pi/2)}{\lambda^2 \cos(\lambda\pi)}$
672: decreases with decreasing $|\lambda|$, reaching a limit
673: \begin{align}
674: \underset{\lambda\rightarrow 0}\lim\frac{\sin^2(\lambda
675:   \pi/2)}{\lambda^2 \cos(\lambda\pi)} = \frac{\pi^2}{4}.
676: \end{align}
677: \noindent In other words, the variance of $\hat{d}_{\lambda}$ converges to
678: that of $\hat{d}_{log}$ as $|\lambda|$ approaches zero. 
679: \\
680: \end{lemma}
681: 
682:  Note that $\hat{d}_{log}$ can in fact be
683: written as the {\em geometric mean}:
684: \begin{align}
685: \hat{d}_{log} = \hat{d}_{gm} = \prod_{j=1}^k|x_j|^{1/k}. 
686: \end{align}
687: 
688: $\hat{d}_{\lambda}$ is a non-convex norm ($l_\lambda$) because $\lambda
689: <1$. $\hat{d}_{gm}$ is also
690: a non-convex norm (the $l_\lambda$ norm as $\lambda \rightarrow 0$). Both
691: $\hat{d}_{\lambda}$ and $\hat{d}_{gm}$ do not satisfy the triangle
692: inequality. 
693: 
694: We propose $\hat{d}_{gm,c}$, the bias-corrected geometric mean
695: estimator. Lemma \ref{lem_d_gm}  derives the moments of
696: $\hat{d}_{gm,c}$, proved in Appendix \ref{app_proof_lem_d_gm}.
697: 
698: \begin{lemma}\label{lem_d_gm}
699: \begin{align}
700: \hat{d}_{gm,c} =
701: \cos^k\left(\frac{\pi}{2k}\right)\prod_{j=1}^k|x_j|^{1/k},
702: \hspace{0.1in} k>1
703: \end{align}
704: is unbiased, with the variance  (valid when $k>2$)
705: \begin{align}
706: \text{Var}\left(\hat{d}_{gm,c}\right) &= d^2
707: \left(\frac{\cos^{2k}\left(\frac{\pi}{2k}\right)}{\cos^k\left(\frac{\pi}{k}\right)}-1
708: \right)=\frac{d^2}{k} \frac{\pi^2}{4} +
709: \frac{\pi^4}{32}\frac{d^2}{k^2}+O\left(\frac{1}{k^3}\right).
710: \end{align}
711: 
712: The third and fourth central moments are  (for $k>3$ and $k>4$,
713: respectively) 
714: \begin{align}
715: &\text{E}\left(\hat{d}_{gm,c} -
716:   \text{E}\left(\hat{d}_{gm,c}\right)\right)^3 =
717: \frac{3\pi^4}{16}\frac{d^3}{k^2} + O\left(\frac{1}{k^3}\right) \\
718: &\text{E}\left(\hat{d}_{gm,c} -
719:   \text{E}\left(\hat{d}_{gm,c}\right)\right)^4 =
720: \frac{3\pi^4}{16}\frac{d^4}{k^2} + O\left(\frac{1}{k^3}\right).
721: \end{align}\\
722: \end{lemma}
723: 
724: The higher (third or fourth) moments may be useful for approximating
725: the distribution of $\hat{d}_{gm,c}$.  In Section \ref{sec_mle}, we
726: will show how to approximate the distribution of the maximum
727: likelihood estimator by matching the first four moments (in the
728: leading terms). We could apply the similar technique to approximate
729: $\hat{d}_{gm,c}$. Fortunately, we do not have to do so because we are
730: able to derive the exact tail bounds of $\hat{d}_{gm,c}$ in Lemma
731: \ref{lem_d_gm_tail}, which is proved in Appendix \ref{app_proof_lem_d_gm_tail}.
732: 
733: \begin{lemma}\label{lem_d_gm_tail}
734: \begin{align}\label{eqn_gm_bound}
735: \mathbf{Pr}\left(\hat{d}_{gm,c} \geq (1+\epsilon)d \right) \leq
736: \frac{\cos^{kt_1^*}\left(\frac{\pi}{2k}\right)}{\cos^k\left(\frac{\pi
737:       t_1^*}{2k}\right)(1+\epsilon)^{t_1^*}}, \hspace{0.25in} \epsilon \geq0
738: \end{align}
739: \noindent where 
740: \begin{align}
741: t_1^* = \frac{2k}{\pi}\tan^{-1}\left(\left(\log(1+\epsilon) -
742:     k\log\cos\left(\frac{\pi}{2k}\right)\right)\frac{2}{\pi}\right). 
743: \end{align}
744: \begin{align}\label{eqn_gm_bound_left}
745: \mathbf{Pr}\left(\hat{d}_{gm,c} \leq  (1-\epsilon)d \right) \leq
746: \frac{ (1-\epsilon)^{t_2^*}}{\cos^k\left(\frac{\pi
747:       t_2^*}{2k}\right)\cos^{kt_2^*}\left(\frac{\pi}{2k}\right)},
748: \hspace{0.25in} 0\leq \epsilon\leq 1, \hspace{0.1in} k\geq \frac{\pi^2}{8\epsilon}
749: \end{align}
750: \noindent where 
751: \begin{align}
752: t_2^* = \frac{2k}{\pi}\tan^{-1}\left(\left(-\log(1-\epsilon) +
753:     k\log\cos\left(\frac{\pi}{2k}\right)\right)\frac{2}{\pi}\right). 
754: \end{align}
755: 
756: 
757: By restricting $0\leq\epsilon\leq 1$, the tail bounds can be written
758: in exponential forms: 
759: \begin{align}\label{eqn_exp_right}
760: &\mathbf{Pr}\left(\hat{d}_{gm,c} \geq (1+\epsilon)d \right) \leq
761: \exp\left(-k\frac{\epsilon^2}{8(1+\epsilon)}\right) \\
762: &\mathbf{Pr}\left(\hat{d}_{gm,c} \leq (1-\epsilon)d \right) \leq
763: \exp\left(-k\frac{\epsilon^2}{8(1+\epsilon)}\right), \hspace{0.2in} k\geq \frac{\pi^2}{1.5\epsilon}\label{eqn_exp_left}
764: \end{align}\\
765: \end{lemma}
766: 
767: An analog of the JL bound for $l_1$ follows from the exponential tail
768: bounds (\ref{eqn_exp_right}) and 
769: (\ref{eqn_exp_left}). 
770: \begin{lemma}\label{lem_JL_l1}
771: Using $\hat{d}_{gm,c}$ with $k \geq \frac{8\left(2\log n -
772:   \log\delta\right)}{\epsilon^2/(1+\epsilon)} \geq
773: \frac{\pi^2}{1.5\epsilon}$, then with probability at
774: least $1-\delta$, the $l_1$ distance, $d$, between
775: any pair of data points (among $n$ data points), can be estimated with
776: errors bounded by $\pm \epsilon d$, i.e., $|\hat{d}_{gm,c} - d| \leq
777: \epsilon d$.  
778: \end{lemma}
779: 
780: \textbf{Remarks on Lemma \ref{lem_JL_l1}}: (1) We can replace the constant ``8'' in Lemma
781: \ref{lem_JL_l1} with better (i.e., smaller) constants for 
782: specific values of $\epsilon$. For example, If $\epsilon = 0.2$, we can
783: replace ``8'' by ``5''. See the proof of Lemma \ref{lem_d_gm_tail}. 
784: (2) This Lemma is weaker than the classical JL Lemma for
785: dimension reduction in $l_2$ as reviewed in Section 2.1. The classical
786: JL Lemma for $l_2$ ensures that the $l_2$ inter-point distances of the
787: projected data points are close enough to the original $l_2$
788: distances, while Lemma
789: \ref{lem_JL_l1} merely says that the projected data points contain
790: enough information to reconstruct the original $l_1$ distances.  On
791: the other hand, the geometric mean estimator is a non-convex
792: norm; and therefore it does contain some information about the
793: geometry. We leave it for future work to explore the possibility of
794: developing efficient algorithms using the geometric mean estimator. \\
795: 
796: 
797: Figure \ref{fig_hist_d_gm}   presents the simulated histograms of $\hat{d}_{gm,c}$
798: for $d=1$, with $k = 5$ and $k=50$. The histograms reveal some
799: characteristics shared by the maximum likelihood estimator  we will
800: discuss in the next section: 
801: \begin{itemize}
802: \item Supported on $[0,\infty)$, $\hat{d}_{gm,c}$ is positively
803: skewed. 
804: \item The distribution of $\hat{d}_{gm,c}$ is still
805:   ``heavy-tailed.'' However, in the region not too far from the mean, the distribution of $\hat{d}_{gm,c}$ may be
806:   well captured by a gamma (or a generalized gamma) distribution. For large $k$, even a 
807: normal  approximation may suffice. 
808: \end{itemize}
809: \begin{figure}[h]
810: \begin{center}\mbox{
811: \subfigure[$k=5$]{\includegraphics[width = 2.5in]{fig/hist_gm5.eps}}
812: \subfigure[$k=50$]{\includegraphics[width = 2.5in]{fig/hist_gm50.eps}}}
813: \end{center}\vspace{-0.4in}
814: \caption{Histograms of $\hat{d}_{gm,c}$, obtained from $10^6$ simulations. At
815:   least in the range not too far from the mean, the
816:   distribution of $\hat{d}_{gm,c}$ resembles a gamma and also resembles
817: a normal when $k$ is large enough. }\label{fig_hist_d_gm}
818: \end{figure}
819: 
820: 
821: Figure \ref{fig_gm_vs_me} compares $\hat{d}_{gm,c}$ with the sample median estimators $\hat{d}_{me}$ and
822: $\hat{d}_{me,c}$, in terms of the mean square errors.  $\hat{d}_{gm,c}$ is considerably more accurate than
823: $\hat{d}_{me}$ at small $k$. The bias correction significantly reduces
824: the mean square errors of $\hat{d}_{me}$.
825: \begin{figure}[h]
826: \begin{center}
827: \includegraphics[width = 2.5in]{fig/me_gm_mse_ratio.eps}
828: \end{center}\vspace{-0.25in}
829: \caption{ The ratios of the mean square errors (MSN),
830:   $\frac{\text{MSE}(\hat{d}_{me})}{\text{MSE}(\hat{d}_{gm,c})}$ and
831:   $\frac{\text{MSE}(\hat{d}_{me,c})}{\text{MSE}(\hat{d}_{gm,c})}$,
832:   demonstrate that the bias-corrected geometric mean estimator
833:   $\hat{d}_{gm,c}$ is considerably more accurate than the sample
834:   median estimator $\hat{d}_{me}$. The bias correction on
835:   $\hat{d}_{me}$ considerably reduces the MSE. Note that when $k=3$, the ratios are $\infty$. }\label{fig_gm_vs_me}
836: \end{figure}
837: 
838: 
839: 
840: \section{The Maximum Likelihood Estimators}\label{sec_mle}
841: 
842: This section is devoted to analyzing the maximum likelihood
843: estimators (MLE), which are ``asymptotically optimum.'' In comparisons, 
844: the sample median estimators and geometric mean estimators are
845: not optimum.  Our contribution in this section includes the higher-order
846: analysis for the bias and  moments and accurate closed-from
847: approximations to the distribution of the MLE. 
848: 
849: 
850: 
851: The method of maximum likelihood is widely used.  For example, \cite{Proc:Li_Hastie_Church_COLT06} applied the maximum likelihood method to {\em normal random
852:   projections} and provided an improved estimator of the
853: $l_2$ distance by taking advantage of the marginal information. 
854: 
855: 
856: The Cauchy distribution is often considered a ``challenging''
857: example because of the ``multiple
858: roots'' problem when estimating the location
859: parameter \citep{Article:Barnett_66,Article:Haas_70}. In our case, since
860: the location parameter is always zero, much of the difficulty is avoided. 
861: 
862: Recall our goal is to estimate $d$ from $k$ i.i.d. samples
863: $x_j \sim C(0,d), j = 1, 2,..., k$. The $\log$ joint
864: likelihood of $\{x_j\}_{j=1}^k$ is  
865: \begin{align}
866: L(x_1,x_2,...x_k;d) = k\log(d) - k\log(\pi) - \sum_{j=1}^k\log(x_j^2+d^2),
867: \end{align}
868: \noindent whose first and second derivatives (w.r.t. $d$) are
869: \begin{align}
870: &L^\prime(d) = \frac{k}{d} - \sum_{j=1}^k\frac{2d}{x_j^2+d^2},\\
871: &L^{\prime\prime}(d) = -\frac{k}{d^2} -
872: \sum_{j=1}^k\frac{2x_j^2-2d^2}{(x_j^2+d^2)^2} =
873: - \frac{ L^\prime(d)}{d}  - 4\sum_{j=1}^k\frac{x_j^2}{(x_j^2+d^2)^2}.
874: \end{align}
875: 
876: The maximum likelihood estimator of $d$, denoted by $\hat{d}_{MLE}$, is 
877: the solution  to $L^\prime(d) = 0$, i.e., 
878: \begin{align}\label{eqn_mle}
879: -\frac{k}{\hat{d}_{MLE}}+\sum_{j=1}^k\frac{2\hat{d}_{MLE}}{x_j^2+\hat{d}_{MLE}^2} = 0.
880: \end{align}
881: \noindent Because $L^{\prime\prime}(\hat{d}_{MLE}) \leq 0$, $\hat{d}_{MLE}$ indeed maximizes the joint likelihood and is the
882: only solution to the MLE equation (\ref{eqn_mle}). Solving
883: (\ref{eqn_mle}) numerically is not difficult (e.g., a few iterations
884: using the Newton's method). For a better accuracy, we
885: recommend the following bias-corrected estimator:
886: \begin{align}
887: \hat{d}_{MLE,c} = \hat{d}_{MLE}\left(1-\frac{1}{k}\right).
888: \end{align}
889: 
890: Lemma  \ref{lem_mle_asymp} concerns the asymptotic moments of $\hat{d}_{MLE}$ and $\hat{d}_{MLE,c}$, proved in Appendix
891: \ref{app_proof_lem_asymp}. 
892: \begin{lemma}\label{lem_mle_asymp}
893: Both $\hat{d}_{MLE}$ and $\hat{d}_{MLE,c}$ are asymptotically unbiased and
894: normal. The first four moments of $\hat{d}_{MLE}$ are
895: \begin{align}
896: &\text{E}\left(\hat{d}_{MLE} - d\right) = \frac{d}{k}+ O\left(\frac{1}{k^2}\right) \\
897: &\text{Var}\left(\hat{d}_{MLE}\right) = \frac{2d^2}{k} + \frac{7d^2}{k^2} +O\left(\frac{1}{k^3}\right)\\
898: &\text{E}\left(\hat{d}_{MLE} - \text{E}(\hat{d}_{MLE})\right)^3 = \frac{12d^3}{k^2} +
899: O\left(\frac{1}{k^3}\right) \\
900: &\text{E}\left(\hat{d}_{MLE} - \text{E}(\hat{d}_{MLE})\right)^4 = \frac{12d^4}{k^2} +
901: \frac{222d^4}{k^3} + O\left(\frac{1}{k^4}\right)
902: \end{align}
903: The first four moments of $\hat{d}_{MLE,c}$ are
904: \begin{align}
905: &\text{E}\left(\hat{d}_{MLE,c} - d\right) =
906: O\left(\frac{1}{k^2}\right) \\
907: &\text{Var}\left(\hat{d}_{MLE,c}\right) = \frac{2d^2}{k} +
908: \frac{3d^2}{k^2}+O\left(\frac{1}{k^3}\right)  \\
909: &\text{E}\left(\hat{d}_{MLE,c} - \text{E}(\hat{d}_{MLE,c})\right)^3 = \frac{12d^3}{k^2} +
910: O\left(\frac{1}{k^3}\right) \\
911: &\text{E}\left(\hat{d}_{MLE,c} - \text{E}(\hat{d}_{MLE,c})\right)^4 = \frac{12d^4}{k^2} +
912: \frac{186d^4}{k^3} + O\left(\frac{1}{k^4}\right) 
913: \end{align}\\
914: \end{lemma}
915: 
916: The order $O\left(\frac{1}{k}\right)$ term of the
917: variance, i.e.,  $\frac{2d^2}{k}$, is known, e.g.,
918:  \citep{Article:Haas_70}.  We derive the  bias-corrected estimator, $\hat{d}_{MLE,c}$,  and the higher order moments using stochastic Taylor
919: expansions \citep{Article:Bartlett_53,Article:Shenton_63,Article:Ferrari_96,Article:Cysneiros_01}.
920: 
921: We will propose an inverse Gaussian distribution to approximate the
922: distribution of $\hat{d}_{MLE,c}$, by matching the first four moments
923: (at least in the leading terms). 
924: 
925: \subsection{A Numerical Example}
926: %\vspace{-0.1in}
927: The maximum likelihood estimators are tested on MSN Web crawl
928: data, a term-by-document matrix with
929: $D=2^{16}$ Web pages. We conduct Cauchy random 
930: projections and estimate the $l_1$ distances
931: between words.  In this experiment, we compare the empirical and
932: (asymptotic) theoretical moments, using one pair of words. Figure \ref{fig_bias_var} illustrates that the bias correction is
933: effective and these (asymptotic) formulas for the first four moments
934: of $\hat{d}_{MLE,c}$ in Lemma \ref{lem_mle_asymp} are accurate, especially when $k\geq 20$.\vspace{-0.25in}
935: \begin{figure}[h]
936: \begin{center}\mbox{
937: \subfigure[{\scriptsize $\text{E}(\hat{d}_{MLE}-d)/d$ v.s. $\text{E}(\hat{d}_{MLE,c}-d)/d$}]{\includegraphics[width = 2.25in]{fig/bias55.eps}}
938: \subfigure[{\scriptsize $\left(\text{E}(\hat{d}_{MLE,c}-\text{E}(\hat{d}_{MLE,c}))^2/d^2\right)^{1/2}$}]{\includegraphics[width = 2.25in]{fig/var55.eps}}}\vspace{-0.3in}
939: \mbox{
940: \subfigure[{\scriptsize $\left(\text{E}(\hat{d}_{MLE,c}-\text{E}(\hat{d}_{MLE,c}))^3/d^3\right)^{1/3}$}]{\includegraphics[width = 2.25in]{fig/third55.eps}}
941: \subfigure[{\scriptsize $\left(\text{E}(\hat{d}_{MLE,c}-\text{E}(\hat{d}_{MLE,c}))^4/d^4\right)^{1/4}$}]{\includegraphics[width = 2.25in]{fig/fourth55.eps}}}
942: \end{center}\vspace{-0.45in}
943: \caption{One pair of words are selected from an  MSN term-by-document
944:   matrix with $D=2^{16}$ Web pages. We conduct Cauchy random
945:   projections and estimate the $l_1$ distance between one pair of words using the maximum
946:   likelihood estimator $\hat{d}_{MLE}$ and the bias-corrected version
947:   $\hat{d}_{MLE,c}$. Panel (a)
948:   plots the biases of $\hat{d}_{MLE}$ and $\hat{d}_{MLE,c}$, indicating that
949:   the bias correction is effective. Panels (b), (c), and
950:   (d) plot the variance, third moment, and fourth moment of
951:   $\hat{d}_{MLE,c}$, respectively. The dashed curves are the theoretical
952:   asymptotic moments. When $k\geq 20$,
953:   the theoretical asymptotic formulas for moments are accurate.}\label{fig_bias_var}\vspace{-0.1in}
954: \end{figure}
955: 
956: \subsection{Approximation Distributions}
957: 
958: Theoretical analysis on the exact distribution of a maximum likelihood
959: estimator is difficult.\footnote{In fact, conditional on the observations $x_1$,
960:   $x_2$, ..., $x_k$, the distribution of $\hat{d}_{MLE}$ can be exactly
961:   characterized \citep{Article::Fisher_34}.  \cite{Article:Lawless_72}
962:   studied the conditional confidence interval of the MLE. Later, 
963:    \cite{Article:Hinkley_78} proposed the normal approximation to the exact
964: conditional confidence interval and showed that it was superior to the
965: unconditional normality approximation. Unfortunately, we can not take advantage of the conditional
966: analysis because our goal is to determine the sample size $k$ before
967: seeing any samples. } In statistics, the standard
968: approach is to assume normality, which, however, is quite
969: inaccurate. The so-called {\em Edgeworth expansion}\footnote{The so-called {\em Saddlepoint approximation} in general improves
970: Edgeworth expansions \citep{Book:Jensen_95}, often very
971: considerably. Unfortunately, we can not apply the Saddlepoint
972: approximation in our case (at least not directly), because the
973: Saddlepoint approximation needs a bounded moment generating
974: function.} improves the
975: normal approximation by matching higher moments
976: \citep{Book:Feller_II,Article:Bhattacharya_78, Book:Severini_00}. For
977: example, if we approximate the distribution of $\hat{d}_{MLE,c}$ using
978: an Edgeworth expansion by matching the first four moments of
979: $\hat{d}_{MLE,c}$ derived in Lemma \ref{lem_mle_asymp}, then the errors
980:  will be on the order of $O\left(k^{-3/2}\right)$. However, Edgeworth
981:  expansions have some well-known drawbacks. The resultant
982:  expressions are quite sophisticated. They are not accurate at
983:  the tails. It is possible that the approximate probability has values
984:  below zero. Also, Edgeworth expansions consider the support is
985:  $(-\infty, \infty)$, while  $\hat{d}_{MLE,c}$ is 
986:  non-negative. 
987: 
988: 
989: 
990: We propose approximating the distributions of
991: $\hat{d}_{MLE,c}$ directly using some well-studied common
992: distributions. We will first consider a gamma distribution with the
993: same first two (asymptotic) moments of $\hat{d}_{MLE,c}$. That is, the
994: gamma distribution will be asymptotically equivalent to the normal
995: approximation. While a normal has zero third
996: central moment, a gamma has nonzero third central moment. This, to an
997: extent, speeds up the rate of convergence. Another important reason
998: why a gamma is more accurate is because it has the same support as
999: $\hat{d}_{MLE,c}$, i.e., $[0,\infty)$. 
1000: 
1001: We will furthermore consider a {\em   generalized gamma} distribution,
1002: which allows us to match the first 
1003: three (asymptotic) moments of $\hat{d}_{MLE,c}$.  Interestingly, in
1004: this case, the generalized gamma approximation turns out to be an
1005: inverse Gaussian distribution, which has a closed-form probability density. More
1006: interestingly, this inverse Gaussian distribution also 
1007: matches the fourth central moment of $\hat{d}_{MLE,c}$ in the
1008: $O\left(\frac{1}{k^2}\right)$ term and almost in the
1009: $O\left(\frac{1}{k^3}\right)$ term. By simulations, the inverse
1010: Gaussian approximation is highly accurate. 
1011: 
1012: Note that, since we are interested in the very small (e.g., $10^{-10}$) tail probability
1013: range, $O\left(k^{-3/2}\right)$ is not too meaningful. For example,
1014: $k^{-3/2} = 10^{-3}$ if $k = 100$. Therefore, we will have to
1015: rely on simulations to assess the accuracy of the approximations. On
1016: the other hand, an upper
1017: bound may hold exactly (verified by simulations) even if it is based
1018: on an approximate distribution. 
1019: 
1020: As the related work, \cite{Article:Li_SINR06} applied gamma and generalized gamma 
1021: approximations to model the performance measure distribution in some
1022: wireless communication channels using random matrix theory and
1023: produced  accurate results in evaluating the error probabilities. 
1024:  
1025: \subsubsection{The Gamma Approximation}
1026: 
1027: The gamma approximation is an obvious improvement over the normal
1028: approximation.\footnote{In {\em normal random projections} for
1029:   dimension reduction in $l_2$, the resultant estimator of the squared
1030:   $l_2$
1031:   distance has a chi-squared distribution (e.g., \cite[Lemma
1032:   1.3]{Book:Vempala}), which is a special case of gamma.} 
1033: A gamma distribution, $G(\alpha,\beta)$, has two parameters, $\alpha$
1034: and $\beta$, which can be determined by matching the first two
1035: (asymptotic) moments of $\hat{d}_{MLE,c}$. That is, we assume that $\hat{d}_{MLE,c} \sim G(\alpha, \beta)$, with 
1036: \begin{align}
1037: &\alpha\beta = d, \hspace{0.25in} \alpha\beta^2 = \frac{2d^2}{k} +
1038: \frac{3d^2}{k^2}, \ \ \ 
1039: \Longrightarrow \  \
1040: \alpha = \frac{1}{\frac{2}{k} + \frac{3}{k^2}}, \hspace{0.25in} \beta = \frac{2d}{k} + \frac{3d}{k^2}.
1041: \end{align}
1042: 
1043: Assuming a gamma distribution, it is easy to obtain the following
1044: Chernoff bounds\footnote{Using the Chernoff inequality
1045:   \citep{Article:Chernoff_52}, we bound the tail probability by 
1046: $\mathbf{Pr}\left(Q>z\right) = \mathbf{Pr}\left(e^{Qt}>e^{zt}\right)
1047: \leq \text{E}\left(e^{Qt}\right)e^{-zt}$; and we then choose $t$ that minimizes
1048: the upper bound.}: 
1049: \begin{align}\label{eqn_gamma_right}
1050: &\mathbf{Pr}\left(\hat{d}_{MLE,c} \geq  (1+\epsilon)
1051:   d\right)  \overset{\sim}{\leq} \exp\left(-\alpha\left(\epsilon -
1052:     \log(1+\epsilon)\right)\right), \hspace{0.2in} \epsilon \geq 0 \\
1053: &\mathbf{Pr}\left(\hat{d}_{MLE,c} \leq (1-\epsilon)
1054:   d\right)  \overset{\sim}{\leq} \exp\left(-\alpha\left(-\epsilon -
1055:     \log(1-\epsilon)\right)\right), \hspace{0.2in} 0\leq \epsilon <
1056: 1\label{eqn_gamma_left}, 
1057: \end{align}
1058: \noindent where we use $\overset{\sim}{\leq}$ to indicate that these
1059: inequalities are based on an approximate distribution. 
1060: 
1061: Note that the distribution of $\hat{d}_{MLE}/d$ (and hence $\hat{d}_{MLE,c}/d$) is only a function of
1062: $k$ as shown in \citep{Article:Antle_69,Article:Haas_70}. Therefore, we
1063: can evaluate the accuracy of the gamma approximation by simulations
1064: with $d = 1$, as presented in Figure \ref{fig_gamma_tail}. 
1065: 
1066: 
1067: \begin{figure}[h]
1068: \begin{center}\mbox{
1069: \subfigure[]{\includegraphics[width = 2.8in]{fig/gamma10.eps}}
1070: \subfigure[]{\includegraphics[width = 2.8in]{fig/gbound10.eps}}}
1071: \end{center}\vspace{-0.4in}
1072: \caption{ We consider $k$ = 10, 20, 50, 100, 200, and 400. For each $k$, we
1073:   simulate standard Cauchy samples, from which we
1074:   estimate the Cauchy parameter by the MLE $\hat{d}_{MLE,c}$ and compute the tail
1075: probabilities. Panel (a) compares the empirical tail probabilities
1076: (thick solid) with
1077: the gamma tail probabilities (thin solid), indicating that the gamma distribution
1078: is better than the
1079: normal  (dashed) for approximating the distribution of
1080: $\hat{d}_{MLE,c}$.  Panel (b) compares the empirical tail
1081: probabilities with the gamma upper bound
1082: (\ref{eqn_gamma_right})+(\ref{eqn_gamma_left}).  }\label{fig_gamma_tail}
1083: \end{figure}
1084: 
1085: Figure \ref{fig_gamma_tail}(a) shows that both the gamma and
1086: normal approximations are fairly accurate when the tail probability $\geq
1087: 10^{-2}\sim 10^{-3}$; and the gamma approximation is  obviously
1088: better. 
1089: 
1090: Figure \ref{fig_gamma_tail}(b) compares the empirical tail probabilities with the 
1091: gamma Chernoff upper bound
1092: (\ref{eqn_gamma_right})+(\ref{eqn_gamma_left}), indicating that these bounds are reliable, when the tail probability $\geq
1093: 10^{-5}\sim 10^{-6}$. 
1094: 
1095: 
1096: \subsubsection{The Inverse Gaussian  (Generalized Gamma) Approximation}
1097: 
1098: The distribution of $\hat{d}_{MLE,c}$ can be well
1099: approximated by an inverse Gaussian distribution, which is a special
1100: case of the three-parameter generalized gamma distribution
1101:  \citep{Article:Hougaard_86,Article:Gerber}, denoted by $GG(\alpha, \beta,
1102: \eta)$. Note that the usual gamma distribution is a special case
1103: with $\eta = 1$. 
1104: 
1105: If $z \sim GG(\alpha, \beta, \eta)$, then the first
1106: three moments are 
1107: \begin{align}
1108: \text{E}(z) = \alpha\beta, \hspace{0.2in} \text{Var}(z) =
1109: \alpha\beta^2, \hspace{0.2in} \text{E}\left(z - \text{E}(z)\right)^3 =
1110: \alpha\beta^3(1+\eta). 
1111: \end{align}
1112: 
1113: We can approximate the distribution of $\hat{d}_{MLE,c}$ by matching the
1114: first three moments, i.e., 
1115: \begin{align}
1116: \alpha\beta = d, \hspace{0.2in} \alpha\beta^2 = \frac{2d^2}{k} +
1117: \frac{3d^2}{k^2}, \hspace{0.2in} \alpha\beta^3(1+\eta) =
1118: \frac{12d^3}{k^2}, 
1119: \end{align}
1120: \noindent from which we obtain
1121: \begin{align}
1122: \alpha = \frac{1}{\frac{2}{k} + \frac{3}{k^2}}, \hspace{0.2in} \beta
1123: = \frac{2d}{k} + \frac{3d}{k^2}, \hspace{0.2in} \eta = 2 +
1124: O\left(\frac{1}{k}\right). \label{eqn_ig_parameters}
1125: \end{align}
1126: Taking only the leading term for $\eta$, the generalized gamma
1127: approximation of $\hat{d}_{MLE,c}$ would be 
1128: \begin{align}
1129: GG\left(\frac{1}{\frac{2}{k} + \frac{3}{k^2}}, \frac{2d}{k} +
1130:   \frac{3d}{k^2}, 2\right). \label{eqn_ig}
1131: \end{align}
1132: 
1133: In general, a generalized gamma distribution does not have a closed-form
1134: density function although it always has a closed-from moment generating
1135: function.  In our case, (\ref{eqn_ig}) is actually an
1136: inverse Gaussian distribution, which has a closed-form density
1137: function. Assuming $\hat{d}_{MLE,c} \sim IG(\alpha, \beta)$,
1138: with parameters $\alpha$ and
1139: $\beta$ defined in (\ref{eqn_ig_parameters}), the moment
1140: generating function (MGF), the probability density
1141: function (PDF), and cumulative density function (CDF) would
1142: be \citep[Chapter 2]{Book:Seshadri_93} \citep{Article:Tweedie_57I,Article:Tweedie_57II}\footnote{The inverse Gaussian distribution was first noted as the
1143:   distribution of the first passage time of the Brownian motion with a
1144:   positive drift. It has many interesting properties such as
1145:   infinitely divisible. Two monographs
1146:    \citep{Book:Chhikara_89,Book:Seshadri_93} are devoted entirely to the
1147:   inverse Gaussian distributions. For a quick reference, one can check
1148: {\it http://mathworld.wolfram.com/InverseGaussianDistribution.html}.}
1149: \begin{align}
1150: &\text{E}\left(\exp(\hat{d}_{MLE,c}t)\right) \overset{\sim}{=}
1151: \exp\left(\alpha\left(1-(1-2\beta t)^{1/2}\right)\right),\\
1152: &\mathbf{Pr}(\hat{d}_{MLE,c} = y)\overset{\sim}{=} \frac{\alpha \sqrt{\beta}}{\sqrt{2\pi}}
1153: y^{-\frac{3}{2}} \exp\left(-\frac{\left(y/\beta -
1154:       \alpha\right)^2}{2y/\beta}\right) = \sqrt{\frac{\alpha d}{2\pi}}y^{-\frac{3}{2}} \exp\left(-\frac{\left(y-d\right)^2}{2y\beta}\right),\\ \notag
1155: &\mathbf{Pr}\left(\hat{d}_{MLE,c} \leq y\right) \overset{\sim}{=}
1156: \Phi\left(\sqrt{\frac{\alpha^2\beta}{y}}\left(\frac{y}{\alpha\beta} -1
1157:     \right)\right) + e^{2\alpha}
1158:   \Phi\left(-\sqrt{\frac{\alpha^2\beta}{y}}\left(\frac{y}{\alpha\beta}
1159:       +1 
1160:     \right)\right)\\
1161: &\hspace{1.1in}= 
1162: \Phi\left(\sqrt{\frac{\alpha d}{y}}\left(\frac{y}{d} -1
1163:     \right)\right) + e^{2\alpha}
1164:   \Phi\left(-\sqrt{\frac{\alpha d}{y}}\left(\frac{y}{d}
1165:       +1 
1166:     \right)\right), 
1167: \end{align}
1168: \noindent where $\Phi(.)$ is the standard normal CDF, i.e., $\Phi(z) =
1169: \int_{-\infty}^z \frac{1}{\sqrt{2\pi}}e^{-\frac{t^2}{2}}dt$. Here we
1170: use $\overset{\sim}{=}$ to indicate that these equalities are based on
1171: an approximate distribution. 
1172: 
1173: 
1174: Assuming $\hat{d}_{MLE,c} \sim
1175: IG(\alpha,\beta)$, then the fourth central moment should be 
1176: \begin{align}\notag
1177: \text{E}\left(\hat{d}_{MLE,c} - \text{E}\left(\hat{d}_{MLE,c}\right)\right)^4 &\overset{\sim}{=}
1178: 15\alpha\beta^4+ 3\left(\alpha\beta^2\right)^2 \\\notag
1179: &=15d\left(\frac{2d}{k}+\frac{3d}{k^2}\right)^3 +
1180: 3\left(\frac{2d^2}{k}+\frac{3d^2}{k^2}\right)^2 \\
1181: &=\frac{12d^4}{k^2} + \frac{156d^4}{k^3} +
1182: O\left(\frac{1}{k^4}\right). 
1183: \end{align}
1184: 
1185: Lemma \ref{lem_mle_asymp} has shown the true asymptotic fourth central
1186: moment: 
1187: \begin{align}
1188: \text{E}\left(\hat{d}_{MLE,c} -
1189:   \text{E}\left(\hat{d}_{MLE,c}\right)\right)^4 =\frac{12d^4}{k^2} + \frac{186d^4}{k^3} +
1190: O\left(\frac{1}{k^4}\right).
1191: \end{align}
1192: \noindent That is, the inverse Gaussian approximation matches not only the
1193: leading term, $\frac{12d^4}{k^2}$, but also almost the higher
1194: order term, $\frac{186d^4}{k^3}$, of the true asymptotic fourth moment of
1195:  $\hat{d}_{MLE,c}$.
1196: 
1197: Assuming $\hat{d}_{MLE,c} \sim IG(\alpha,\beta)$, the tail probability
1198: of $\hat{d}_{MLE,c}$ can be expressed  as 
1199: \begin{align}
1200: &\mathbf{Pr}\left(\hat{d}_{MLE,c} \geq (1+\epsilon)d\right) \overset{\sim}{=}
1201: \Phi\left(-\epsilon \sqrt{\frac{\alpha}{1+\epsilon}}\right) -
1202: e^{2\alpha} \Phi\left(-(2+\epsilon)\sqrt{\frac{\alpha}{1+\epsilon}}\right),
1203: \hspace{0.1in} \epsilon \geq 0 \\
1204: &\mathbf{Pr}\left(\hat{d}_{MLE,c} \leq (1-\epsilon)d\right)  \overset{\sim}{=} \Phi\left(-\epsilon \sqrt{\frac{\alpha}{1-\epsilon}}\right) +
1205: e^{2\alpha} \Phi\left(-(2-\epsilon)\sqrt{\frac{\alpha}{1-\epsilon}}\right),
1206: \hspace{0.1in}   0\leq \epsilon < 1. 
1207: \end{align}
1208: 
1209: 
1210: Assuming  $\hat{d}_{MLE,c} \sim IG(\alpha,\beta)$, it is easy to show
1211: the following  Chernoff bounds: 
1212: \begin{align}\label{eqn_ig_left}
1213: &\mathbf{Pr}\left(\hat{d}_{MLE,c} \geq (1+\epsilon)d\right) \overset{\sim}{\leq}
1214: \exp\left(-\frac{\alpha \epsilon^2}{2(1+\epsilon)}\right),  \hspace{0.2in} \epsilon \geq 0 \\
1215: &\mathbf{Pr}\left(\hat{d}_{MLE,c} \leq (1-\epsilon)d\right) \overset{\sim}{\leq}
1216: \exp\left(-\frac{\alpha \epsilon^2}{2(1-\epsilon)}\right),
1217: \hspace{0.2in}   0\leq \epsilon < 1. \label{eqn_ig_right}
1218: \end{align}
1219: 
1220: To see (\ref{eqn_ig_left}). Assume $z \sim IG(\alpha,\beta)$. Then,
1221: using the Chernoff inequality: 
1222: \begin{align}\notag
1223: \mathbf{Pr}\left(z \geq (1+\epsilon)d\right) \leq&
1224: \text{E}\left(zt\right)\exp(-(1+\epsilon)dt)\\\notag
1225: =&\exp\left(\alpha\left(1-(1-2\beta t)^{1/2}\right)-(1+\epsilon)dt\right),
1226: \end{align}
1227: whose minimum is $\exp\left(-\frac{\alpha
1228:     \epsilon^2}{2(1+\epsilon)}\right)$, attained at $t =
1229: \left(1-\frac{1}{(1+\epsilon)^2}\right)\frac{1}{2\beta}$. We can
1230: similarly show (\ref{eqn_ig_right}). \\
1231: 
1232: Combining (\ref{eqn_ig_left}) and (\ref{eqn_ig_right}) yields a
1233: symmetric bound 
1234: \begin{align}
1235: &\mathbf{Pr}\left(|\hat{d}_{MLE,c} - d| \geq \epsilon d\right) \overset{\sim}{\leq}
1236: 2\exp\left(-\frac{\epsilon^2/(1+\epsilon)}{2 \left(\frac{2}{k} + \frac{3}{k^2}\right)}\right),
1237: \hspace{0.15in} 0\leq \epsilon \leq 1
1238: \end{align}
1239: 
1240: 
1241: Figure \ref{fig_ig_tail} compares the inverse Gaussian approximation with the same
1242: simulations as presented in Figure \ref{fig_gamma_tail}, indicating 
1243: that the inverse Gaussian approximation is highly
1244: accurate. When the tail probability $\geq 10^{-4} \sim 10^{-6}$, we can treat the
1245: inverse Gaussian as the exact distribution of $\hat{d}_{MLE,c}$.  The Chernoff upper bounds for the inverse Gaussian
1246: are always reliable in our simulation range (the tail probability
1247: $\geq 10^{-10}$). 
1248: 
1249: \begin{figure}[h]
1250: \begin{center}\mbox{
1251: \subfigure[]{\includegraphics[width = 2.8in]{fig/ig10.eps}}
1252: \subfigure[]{\includegraphics[width = 2.8in]{fig/igbound10.eps}}}
1253: \end{center}\vspace{-0.4in}
1254: \caption{We compare the inverse Gaussian approximation
1255:   with the same simulations as presented in Figure
1256:   \ref{fig_gamma_tail}. Panel (a) compares the empirical tail
1257:   probabilities with the inverse Gaussian tail probabilities,
1258:   indicating that the approximation is highly accurate. 
1259:   Panel (b) compares the empirical tail probabilities with the inverse
1260:   Gaussian upper bound (\ref{eqn_ig_left})+(\ref{eqn_ig_right}). The upper bounds are all
1261: above the corresponding empirical curves, indicating that our proposed bounds are
1262: reliable at least in our simulation range.  }\label{fig_ig_tail}
1263: \end{figure}
1264: 
1265: 
1266: 
1267: \section{Conclusion}\label{sec_conclusion}
1268: 
1269: It is well-known that the $l_1$ distance is far more robust than the
1270: $l_2$ distance against ``outliers.'' There are
1271: numerous  success stories of using the $l_1$ distance, e.g., 
1272:   Lasso \citep{Article:Tibshirani_96}, LARS \citep{Article:Efron_LARS04}, 1-norm
1273:   SVM \citep{Proc:Zhu_NIPS03}, and Laplacian radial basis kernel
1274:   \citep{Article:Chapelle_99,Proc:Ferecatu_MIR04}. 
1275: 
1276: Dimension reduction in the $l_1$ norm, however, has been proved
1277: {\em impossible} if we use {\em linear random projections} and {\em
1278:   linear estimators}. In this study, we propose three types of nonlinear
1279: estimators for {\em Cauchy random projections}: the bias-corrected
1280: sample median estimator, the bias-corrected geometric mean estimator,
1281: and the bias-corrected maximum likelihood estimator. Our theoretical
1282: analysis has shown that these nonlinear estimators can accurately
1283: recover the original $l_1$ distance, even though none of them can be a
1284: metric. 
1285: 
1286: The bias-corrected sample median estimator and the bias-corrected
1287: geometric mean estimator are asymptotically equivalent but the latter
1288: is more accurate at small sample size. We have derived explicit tail
1289: bounds for the bias-corrected geometric mean estimator and have expressed
1290: the tail bounds in exponential forms. Using these tail bounds, we have
1291: established an analog of the 
1292: Johnson-Lindenstrauss  (JL) lemma for dimension reduction in $l_1$, which is weaker than the classical JL lemma for dimension reduction in
1293: $l_2$.
1294: 
1295: We conduct theoretic analysis  on the bias-corrected maximum
1296: likelihood estimator (MLE), which is ``asymptotically optimum.'' Both
1297: the sample median estimator and the geometric mean estimator are about
1298: $80\%$ efficient as the MLE. We propose
1299: approximating its distribution by an inverse Gaussian, which has the
1300: same support and matches the leading terms of the first four moments of
1301: the proposed estimator. Approximate tail bounds have been provide based
1302: on the inverse Gaussian approximation. Verified by simulations, these
1303: approximate tail bounds hold at least in the $\geq 
1304: 10^{-10}$ tail probability range. 
1305: 
1306: Although these nonlinear estimators are not metrics, they are still
1307: useful for certain applications in (e.g.,) data stream computation,
1308: information retrieval, learning and data mining, whenever the goal is
1309: to compute the $l_1$ distances efficiently using a small storage space. 
1310: 
1311: 
1312: The geometric mean estimator is a non-convex
1313: norm (i.e., the $l_p$ norm as $p\rightarrow 0$); and therefore it does
1314: contain some information about the geometry.  It may be still possible
1315: to develop certain efficient algorithms using the geometric mean estimator by
1316: avoiding the non-convexity.  We leave this for future
1317: work. \\ 
1318: 
1319: 
1320: 
1321: \section*{Acknowledgment}
1322: 
1323: We are grateful to Piotr Indyk and Assaf Naor for the very constructive
1324: comments on various versions of this manuscript. We thank Dimitris
1325: Achlioptas, 
1326: Christopher Burges, Moses Charikar, Jerome Friedman, Tze L. Lai, Art 
1327: B. Owen, John Platt,  Joseph Romano, Tim
1328: Roughgarden, Yiyuan She,  and  Guenther Walther
1329: for helpful conversations or suggesting relevant references. We also thank Silvia Ferrari 
1330: and Gauss Cordeiro for clarifying some parts of their papers. 
1331: 
1332: Trevor Hastie was partially supported by grant DMS-0505676 from the National
1333: Science Foundation, and grant 2R01 CA 72028-07 from the National
1334: Institutes of
1335: Health.
1336: 
1337: %\bibliographystyle{plain}
1338: {\small
1339: 
1340: \begin{thebibliography}{59}
1341: \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi
1342: \expandafter\ifx\csname url\endcsname\relax
1343:   \def\url#1{{\tt #1}}\fi
1344: 
1345: \bibitem[Achlioptas(2003)]{Article:Achlioptas_JCSS03}
1346: Dimitris Achlioptas.
1347: \newblock Database-friendly random projections: \text{Johnson-Lindenstrauss}
1348:   with binary coins.
1349: \newblock {\em Journal of Computer and System Sciences}, 66\penalty0
1350:   (4):\penalty0 671--687, 2003.
1351: 
1352: \bibitem[Aggarwal and Wolf(1999)]{Proc:Aggarwal_Wolf_Sigmod99}
1353: Charu~C. Aggarwal and Joel~L. Wolf.
1354: \newblock A new method for similarity indexing of market basket data.
1355: \newblock In {\em Proc. of SIGMOD}, pages 407--418, Philadelphia, PA, 1999.
1356: 
1357: \bibitem[Ailon and Chazelle(2006)]{Proc:Ailon_STOC06}
1358: Nir Ailon and Bernard Chazelle.
1359: \newblock Approximate nearest neighbors and the fast \text{Johnson-Lindenstrauss}
1360:   transform.
1361: \newblock In {\em Proc. of STOC}, pages 557--563, Seattle, WA, 2006.
1362: 
1363: \bibitem[Antle and Bain(1969)]{Article:Antle_69}
1364: Charles Antle and Lee Bain.
1365: \newblock A property of maximum likelihood estimators of location and scale
1366:   parameters.
1367: \newblock {\em SIAM Review}, 11\penalty0 (2):\penalty0 251--253, 1969.
1368: 
1369: \bibitem[Arriaga and Vempala(1999)]{Proc:Arriaga_FOCS99}
1370: Rosa Arriaga and Santosh Vempala.
1371: \newblock An algorithmic theory of learning: Robust concepts and random
1372:   projection.
1373: \newblock In {\em Proc. of FOCS}, pages 616--623, New York, 1999.
1374: 
1375: \bibitem[Arriaga and Vempala(2006)]{Article:Proc:Arriaga_Vempala_ML06}
1376: Rosa Arriaga and Santosh Vempala.
1377: \newblock An algorithmic theory of learning: Robust concepts and random
1378:   projection.
1379: \newblock {\em Machine Learning}, 63\penalty0 (2):\penalty0 161--182, 2006.
1380: 
1381: \bibitem[Barnett(1966)]{Article:Barnett_66}
1382: V.~D. Barnett.
1383: \newblock Evaluation of the maximum-likelihood estimator where the likelihood
1384:   equation has multiple roots.
1385: \newblock {\em Biometrika}, 53\penalty0 (1/2):\penalty0 151--165, 1966.
1386: 
1387: \bibitem[Bartlett(1953)]{Article:Bartlett_53}
1388: M.~S. Bartlett.
1389: \newblock Approximate confidence intervals, \text{II}.
1390: \newblock {\em Biometrika}, 40\penalty0 (3/4):\penalty0 306--317, 1953.
1391: 
1392: \bibitem[Bhattacharya and Ghosh(1978)]{Article:Bhattacharya_78}
1393: R.~N. Bhattacharya and J.~K. Ghosh.
1394: \newblock On the validity of the formal \text{Edgeworth} expansion.
1395: \newblock {\em The Annals of Statistics}, 6\penalty0 (2):\penalty0 434--451,
1396:   1978.
1397: 
1398: \bibitem[Brinkman and Charikar(2003)]{Proc:Brinkman_FOCS03}
1399: Bo~Brinkman and Mose Charikar.
1400: \newblock On the impossibility of dimension reduction in $l_1$.
1401: \newblock In {\em Proc. of FOCS}, pages 514--523, Cambridge, MA, 2003.
1402: 
1403: \bibitem[Brinkman and Charikar(2005)]{Article:Brinkman_JACM05}
1404: Bo~Brinkman and Mose Charikar.
1405: \newblock On the impossibility of dimension reduction in $l_1$.
1406: \newblock {\em Journal of ACM}, 52\penalty0 (2):\penalty0 766--788, 2005.
1407: 
1408: \bibitem[Chapelle et~al.(1999)Chapelle, Haffner, and
1409:   Vapnik]{Article:Chapelle_99}
1410: Olivier Chapelle, Patrick Haffner, and Vladimir~N. Vapnik.
1411: \newblock Support vector machines for histogram-based image classification.
1412: \newblock {\em {IEEE} Trans. Neural Networks}, 10\penalty0 (5):\penalty0
1413:   1055--1064, 1999.
1414: 
1415: \bibitem[Chernoff(1952)]{Article:Chernoff_52}
1416: Herman Chernoff.
1417: \newblock A measure of asymptotic efficiency for tests of a hypothesis based on
1418:   the sum of observations.
1419: \newblock {\em The Annals of Mathematical Statistics}, 23\penalty0
1420:   (4):\penalty0 493--507, 1952.
1421: 
1422: \bibitem[Chhikara and Folks(1989)]{Book:Chhikara_89}
1423: Raj~S. Chhikara and J.~Leroy Folks.
1424: \newblock {\em The Inverse Gaussian Distribution: Theory, Methodology, and
1425:   Applications}.
1426: \newblock Marcel Dekker, Inc, New York, 1989.
1427: 
1428: \bibitem[Cysneiros et~al.(2001)Cysneiros, dos Santos, and
1429:   Cordeiro]{Article:Cysneiros_01}
1430: Francisco Jose De.~A. Cysneiros, Sylvio Jose~P. dos Santos, and Gass~M.
1431:   Cordeiro.
1432: \newblock Skewness and kurtosis for maximum likelihood estimator in
1433:   one-parameter exponential family models.
1434: \newblock {\em Brazilian Journal of Probability and Statistics}, 15\penalty0
1435:   (1):\penalty0 85--105, 2001.
1436: 
1437: \bibitem[Dasgupta and Gupta(2003)]{Article:Dasgupta_JL}
1438: Sanjoy Dasgupta and Anupam Gupta.
1439: \newblock An elementary proof of a theorem of \text{Johnson and Lindenstrauss}.
1440: \newblock {\em Random Structures and Algorithms}, 22\penalty0 (1):\penalty0 60
1441:   -- 65, 2003.
1442: 
1443: \bibitem[Dhillon and Modha(2001)]{Article:Dhillon_ML01}
1444: Inderjit~S. Dhillon and Dharmendra~S. Modha.
1445: \newblock Concept decompositions for large sparse text data using clustering.
1446: \newblock {\em Machine Learning}, 42\penalty0 (1-2):\penalty0 143--175, 2001.
1447: 
1448: \bibitem[Efron et~al.(2004)Efron, Hastie, Johnstone, and
1449:   Tibshirani]{Article:Efron_LARS04}
1450: Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani.
1451: \newblock Least angle regression.
1452: \newblock {\em The Annals of Statistics}, 32\penalty0 (2):\penalty0 407--499,
1453:   2004.
1454: 
1455: \bibitem[Fama and Roll(1968)]{Article:Fama_68}
1456: Eugene~F. Fama and Richard Roll.
1457: \newblock Some properties of symmetric stable distributions.
1458: \newblock {\em Journal of the American Statistical Association}, 63\penalty0
1459:   (323):\penalty0 817--836, 1968.
1460: 
1461: \bibitem[Fama and Roll(1971)]{Article:Fama_71}
1462: Eugene~F. Fama and Richard Roll.
1463: \newblock Parameter estimates for symmetric stable distributions.
1464: \newblock {\em Journal of the American Statistical Association}, 66\penalty0
1465:   (334):\penalty0 331--338, 1971.
1466: 
1467: \bibitem[Feller(1971)]{Book:Feller_II}
1468: William Feller.
1469: \newblock {\em An Introduction to Probability Theory and Its Applications
1470:   (Volume \text{II})}.
1471: \newblock John Wiley \& Sons, New York, NY, second edition, 1971.
1472: 
1473: \bibitem[Ferecatu et~al.(2004)Ferecatu, Crucianu, and
1474:   Boujemaa]{Proc:Ferecatu_MIR04}
1475: Marin Ferecatu, Michel Crucianu, and Nozha Boujemaa.
1476: \newblock Retrieval of difficult image classes using SVD-based relevance
1477:   feedback.
1478: \newblock In {\em Prof. of Multimedia Information Retrieval}, pages 23--30, New
1479:   York, NY, 2004.
1480: 
1481: \bibitem[Ferrari et~al.(1996)Ferrari, Botter, Cordeiro, and
1482:   Cribari-Neto]{Article:Ferrari_96}
1483: Silvia L.~P. Ferrari, Denise~A. Botter, Gauss~M. Cordeiro, and Francisco
1484:   Cribari-Neto.
1485: \newblock Second and third order bias reduction for one-parameter family
1486:   models.
1487: \newblock {\em Stat. and Prob. Letters}, 30:\penalty0 339--345, 1996.
1488: 
1489: \bibitem[Fisher(1934)]{Article::Fisher_34}
1490: R.~A. Fisher.
1491: \newblock Two new properties of mathematical likelihood.
1492: \newblock {\em Proceedings of the Royal Society of London}, 144\penalty0
1493:   (852):\penalty0 285--307, 1934.
1494: 
1495: \bibitem[Frankl and Maehara(1987)]{Article:Frankl_JL}
1496: P.~Frankl and H.~Maehara.
1497: \newblock The \text{Johnson-Lindenstrauss} lemma and the sphericity of some
1498:   graphs.
1499: \newblock {\em Journal of Combinatorial Theory A}, 44\penalty0 (3):\penalty0
1500:   355--362, 1987.
1501: 
1502: \bibitem[Gerber(1991)]{Article:Gerber}
1503: Hans~U. Gerber.
1504: \newblock From the generalized gamma to the generalized negative binomial
1505:   distribution.
1506: \newblock {\em Insurance:Mathematics and Economics}, 10\penalty0 (4):\penalty0
1507:   303--309, 1991.
1508: 
1509: \bibitem[Gradshteyn and Ryzhik(1994)]{Book:Gradshteyn_94}
1510: I.~S. Gradshteyn and I.~M. Ryzhik.
1511: \newblock {\em Table of Integrals, Series, and Products}.
1512: \newblock Academic Press, New York, fifth edition, 1994.
1513: 
1514: \bibitem[Haas et~al.(1970)Haas, Bain, and Antle]{Article:Haas_70}
1515: Gerald Haas, Lee Bain, and Charles Antle.
1516: \newblock Inferences for the Cauchy distribution based on maximum likelihood
1517:   estimation.
1518: \newblock {\em Biometrika}, 57\penalty0 (2):\penalty0 403--408, 1970.
1519: 
1520: \bibitem[Hinkley(1978)]{Article:Hinkley_78}
1521: David~V. Hinkley.
1522: \newblock Likelihood inference about location and scale parameters.
1523: \newblock {\em Biometrika}, 65\penalty0 (2):\penalty0 253--261, 1978.
1524: 
1525: \bibitem[Hougaard(1986)]{Article:Hougaard_86}
1526: P.~Hougaard.
1527: \newblock Survival models for heterogeneous populations derived from stable
1528:   distributions.
1529: \newblock {\em Biometrika}, 73\penalty0 (2):\penalty0 387--396, 1986.
1530: 
1531: \bibitem[Indyk(2000)]{Proc:Indyk_FOCS00}
1532: Piotr Indyk.
1533: \newblock Stable distributions, pseudorandom generators, embeddings and data
1534:   stream computation.
1535: \newblock In {\em FOCS}, pages 189--197, Redondo Beach,CA, 2000.
1536: 
1537: \bibitem[Indyk(2001)]{Proc:Indyk_FOCS01}
1538: Piotr Indyk.
1539: \newblock Algorithmic applications of low-distortion geometric embeddings.
1540: \newblock In {\em Proc. of FOCS}, pages 10--33, Las Vegas, NV, 2001.
1541: 
1542: \bibitem[Indyk and Motwani(1998)]{Proc:Indyk_STOC98}
1543: Piotr Indyk and Rajeev Motwani.
1544: \newblock Approximate nearest neighbors: Towards removing the curse of
1545:   dimensionality.
1546: \newblock In {\em Proc. of STOC}, pages 604--613, Dallas, TX, 1998.
1547: 
1548: \bibitem[Indyk and Naor(2006)]{Article:Indyk_Naor}
1549: Piotr Indyk and Assaf Naor.
1550: \newblock Nearest neighbor preserving embeddings.
1551: \newblock {\em ACM Transactions on Algorithms (to appear)}, 2006.
1552: 
1553: \bibitem[Jensen(1995)]{Book:Jensen_95}
1554: Jens~Ledet Jensen.
1555: \newblock {\em Saddlepoint approximations}.
1556: \newblock Oxford University Press, New York, 1995.
1557: 
1558: \bibitem[Johnson and Lindenstrauss(1984)]{Article:JL84}
1559: W.~B. Johnson and J.~Lindenstrauss.
1560: \newblock Extensions of \text{Lipschitz} mapping into \text{Hilbert} space.
1561: \newblock {\em Contemporary Mathematics}, 26:\penalty0 189--206, 1984.
1562: 
1563: \bibitem[Lawless(1972)]{Article:Lawless_72}
1564: J.~F. Lawless.
1565: \newblock Conditional confidence interval procedures for the location and scale
1566:   parameters of the Cauchy and logistic distributions.
1567: \newblock {\em Biometrika}, 59\penalty0 (2):\penalty0 377--386, 1972.
1568: 
1569: \bibitem[Lee and Naor(2004)]{Article:Lee_Naor_04}
1570: James~R. Lee and Assaf Naor.
1571: \newblock Embedding the diamond graph in $l_p$ and dimension reduction in
1572:   $l_1$.
1573: \newblock {\em Geometric And Functional Analysis}, 14\penalty0 (4):\penalty0
1574:   745--747, 2004.
1575: 
1576: \bibitem[Li and Church(2005)]{Report:Li_Church_Sketch}
1577: Ping Li and Kenneth~W. Church.
1578: \newblock Using sketches to estimate two-way and multi-way associations.
1579: \newblock Technical Report TR-2005-115, Microsoft Research, (A shorter version
1580:   is available at
1581:   www.stanford.edu/$^\sim$pingli98/publications/Report\_Sketch.pdf), Redmond,
1582:   WA, September 2005.
1583: 
1584: \bibitem[Li et~al.(2006{\natexlab{a}})Li, Church, and
1585:   Hastie]{Report:Li_Church_Hastie_crs}
1586: Ping Li, Kenneth~W. Church, and Trevor~J. Hastie.
1587: \newblock Conditional random sampling: A sketched-based sampling technique for
1588:   sparse data.
1589: \newblock Technical report, Department of Statistics, Stanford University
1590:   (\url{www.stanford.edu/~pingli98/publications/CRS_tr.pdf}),
1591:   2006{\natexlab{a}}.
1592: 
1593: \bibitem[Li et~al.(2006{\natexlab{b}})Li, Hastie, and
1594:   Church]{Proc:Li_Hastie_Church_COLT06}
1595: Ping Li, Trevor~J. Hastie, and Kenneth~W. Church.
1596: \newblock Improving random projections using marginal information.
1597: \newblock In {\em Proc. of COLT}, Pittsburgh, PA, 2006{\natexlab{b}}.
1598: 
1599: \bibitem[Li et~al.(2006{\natexlab{c}})Li, Hastie, and
1600:   Church]{Report:Li_Hastie_Church_subrp}
1601: Ping Li, Trevor~J. Hastie, and Kenneth~W. Church.
1602: \newblock Sub-Gaussian random projections.
1603: \newblock Technical report, Department of Statistics, Stanford University
1604:   (\url{www.stanford.edu/~pingli98/report/subg_rp.pdf}), 2006{\natexlab{c}}.
1605: 
1606: \bibitem[Li et~al.(2006{\natexlab{d}})Li, Hastie, and
1607:   Church]{Proc:Li_Hastie_Church_KDD06}
1608: Ping Li, Trevor~J. Hastie, and Kenneth~W. Church.
1609: \newblock Very sparse random projections.
1610: \newblock In {\em Proc. of KDD}, Philadelphia, PA, 2006{\natexlab{d}}.
1611: 
1612: \bibitem[Li et~al.(2006{\natexlab{e}})Li, Paul, Narasimhan, and
1613:   Cioffi]{Article:Li_SINR06}
1614: Ping Li, Debashis Paul, Ravi Narasimhan, and John Cioffi.
1615: \newblock On the distribution of \text{SINR} for the \text{MMSE MIMO} receiver
1616:   and performance analysis.
1617: \newblock {\em {IEEE} Trans. Inform. Theory}, 52\penalty0 (1):\penalty0
1618:   271--286, 2006{\natexlab{e}}.
1619: 
1620: \bibitem[Lugosi(2004)]{Article:Lugosi_04}
1621: Gabor Lugosi.
1622: \newblock Concentration-of-measure inequalities.
1623: \newblock {\em Lecture Notes}, 2004.
1624: 
1625: \bibitem[McCulloch(1986)]{Article:McCulloch_86}
1626: J.~Huston McCulloch.
1627: \newblock Simple consistent estimators of stable distribution parameters.
1628: \newblock {\em Communications on Statistics-Simulation}, 15\penalty0
1629:   (4):\penalty0 1109--1136, 1986.
1630: 
1631: \bibitem[Philips and Nelson(1995)]{Article:Philips_95}
1632: Thomas~K. Philips and Randolph Nelson.
1633: \newblock The moment bound is tighter than Chernoff's bound for positive tail
1634:   probabilities.
1635: \newblock {\em The American Statistician}, 49\penalty0 (2):\penalty0 175--178,
1636:   1995.
1637: 
1638: \bibitem[Seshadri(1993)]{Book:Seshadri_93}
1639: V.~Seshadri.
1640: \newblock {\em The Inverse Gaussian Distribution: A Case Study in Exponential
1641:   Families}.
1642: \newblock Oxford University Press Inc., New York, 1993.
1643: 
1644: \bibitem[Severini(2000)]{Book:Severini_00}
1645: Thomas~A. Severini.
1646: \newblock {\em Likelihood Methods in Statistics}.
1647: \newblock Oxford University Press, New York, 2000.
1648: 
1649: \bibitem[Shakhnarovich et~al.(2005)Shakhnarovich, Darrell, and
1650:   Indyk]{Book:NN_05}
1651: Gregory Shakhnarovich, Trevor Darrell, and Piotr Indyk, editors.
1652: \newblock {\em Nearest-Neighbor Methods in Learning and Vision, Theory and
1653:   Practice}.
1654: \newblock The MIT Press, Cambridge, MA, 2005.
1655: 
1656: \bibitem[Shao(2003)]{Book:Shao}
1657: Jun Shao.
1658: \newblock {\em Mathematical Statistics}.
1659: \newblock Springer, New York, NY, second edition, 2003.
1660: 
1661: \bibitem[Shenton and Bowman(1963)]{Article:Shenton_63}
1662: L.~R. Shenton and K.~Bowman.
1663: \newblock Higher moments of a maximum-likelihood estimate.
1664: \newblock {\em Journal of Royal Statistical Society \text{B}}, 25\penalty0
1665:   (2):\penalty0 305--317, 1963.
1666: 
1667: \bibitem[Strehl and Ghosh(2000)]{Proc:Strehl_HiPC00}
1668: Alexander Strehl and Joydeep Ghosh.
1669: \newblock A scalable approach to balanced, high-dimensional clustering of
1670:   market-baskets.
1671: \newblock In {\em Proc. of HiPC}, pages 525--536, Bangalore, India, 2000.
1672: 
1673: \bibitem[Tibshirani(1996)]{Article:Tibshirani_96}
1674: Robert Tibshirani.
1675: \newblock Regression shrinkage and selection via the lasso.
1676: \newblock {\em Journal of Royal Statistical Society \text{B}}, 58\penalty0
1677:   (1):\penalty0 267--288, 1996.
1678: 
1679: \bibitem[Tweedie(1957{\natexlab{a}})]{Article:Tweedie_57I}
1680: M.~C.~K. Tweedie.
1681: \newblock Statistical properties of inverse Gaussian distributions. \text{I}.
1682: \newblock {\em The Annals of Mathematical Statistics}, 28\penalty0
1683:   (2):\penalty0 362--377, 1957{\natexlab{a}}.
1684: 
1685: \bibitem[Tweedie(1957{\natexlab{b}})]{Article:Tweedie_57II}
1686: M.~C.~K. Tweedie.
1687: \newblock Statistical properties of inverse Gaussian distributions. \text{II}.
1688: \newblock {\em The Annals of Mathematical Statistics}, 28\penalty0
1689:   (3):\penalty0 696--705, 1957{\natexlab{b}}.
1690: 
1691: \bibitem[Vempala(2004)]{Book:Vempala}
1692: Santosh Vempala.
1693: \newblock {\em The Random Projection Method}.
1694: \newblock American Mathematical Society, Providence, RI, 2004.
1695: 
1696: \bibitem[Zhu et~al.(2003)Zhu, Rosset, Hastie, and Tibshirani]{Proc:Zhu_NIPS03}
1697: Ji~Zhu, Saharon Rosset, Trevor Hastie, and Robert Tibshirani.
1698: \newblock 1-norm support vector machines.
1699: \newblock In {\em NIPS}, 2003.
1700: 
1701: \bibitem[Zolotarev(1986)]{Book:Zolotarev_86}
1702: V.~M. Zolotarev.
1703: \newblock {\em One-dimensional Stable Distributions}.
1704: \newblock American Mathematical Society, Providence, RI, 1986.
1705: 
1706: \end{thebibliography}
1707: 
1708: %\bibliography{../bib/IEEEabrv,../bib/mybibfile}
1709: }
1710: 
1711: \appendix
1712: 
1713: 
1714: \section{Proof of Lemma \ref{lem_me}}\label{app_proof_lem_me}
1715: 
1716: Assume $x \sim C(0,d)$. The probability density function (PDF) and the
1717: cumulative density function (CDF) of $|x|$ would be 
1718: \begin{align}
1719: &\mathbf{Pr}(|x|=z) = \frac{2d}{\pi}\frac{1}{z^2+d^2}, \hspace{0.2in}
1720: z\geq0 \\
1721: &\mathbf{Pr}(|x|\leq z) = \frac{2}{\pi}\tan^{-1}\frac{z}{d}, \hspace{0.2in}
1722: z\geq0
1723: \end{align}
1724: 
1725: The asymptotic normality of $\hat{d}_{me}$ follows from the asymptotic
1726: results on sample quantiles \citep[Theorem
1727: 5.10]{Book:Shao}.
1728: \begin{align}
1729: \sqrt{k}\left(\hat{d}_{me}-d\right) \overset{D}{\Longrightarrow}
1730: N\left(0,
1731:   \frac{1}{2}\left(1-\frac{1}{2}\right)/\left(\left.\mathbf{Pr}(|x|=z)\right|_{z = d}\right)^2\right) = N\left(0,\frac{\pi^2}{4}d^2\right)
1732: \end{align}
1733: 
1734: The probability density of $\hat{d}_{me}$ can be derived from
1735: the probability density of order statistics \citep[Example
1736: 2.9]{Book:Shao}. For simplicity, we only consider $k = 2m+1$, $m = 1,
1737: 2, ..., $
1738: \begin{align}\notag
1739: \mathbf{Pr}(\hat{d}_{me}=z) &=
1740: \frac{(2m+1)!}{(m!)^2}\left(\mathbf{Pr}(|x|\leq
1741:   z)\right)^m\left(1-\mathbf{Pr}(|x|\leq z)\right)^m \mathbf{Pr}(|x|=
1742: z) \\
1743: &=\frac{(2m+1)!}{(m!)^2}\left(\frac{2}{\pi}\tan^{-1}\frac{z}{d}\right)^m\left(1-\frac{2}{\pi}\tan^{-1}\frac{z}{d}\right)^m\frac{2d}{\pi}\frac{1}{z^2+d^2}.
1744: \end{align}
1745: 
1746: The $r^{th}$ moment of $\hat{d}_{me}$ would be 
1747: \begin{align}\notag
1748: \text{E}\left(\hat{d}_{me}\right)^r &= \int_0^\infty z^r
1749: \frac{(2m+1)!}{(m!)^2}\left(\frac{2}{\pi}\tan^{-1}\frac{z}{d}\right)^m\left(1-\frac{2}{\pi}\tan^{-1}\frac{z}{d}\right)^m\frac{2d}{\pi}\frac{1}{z^2+d^2}dz
1750: \\
1751: &= d^r\int_0^1\frac{(2m+1)!}{(m!)^2}\tan^r\left(\frac{\pi}{2}t\right)
1752: \left(t-t^2\right)^m dt,
1753: \end{align}
1754: \noindent by substituting $t = \frac{2}{\pi}
1755: \tan^{-1}\frac{z}{d}$. 
1756: 
1757: 
1758: When $t\rightarrow 1-0$, $\tan\left(\frac{\pi}{2}t\right) \rightarrow
1759: \infty$, but $t-t^2 = t(1-t) \rightarrow 0$. Around $t =1-0$,  
1760:  $\tan\left(\frac{\pi}{2}t\right) =
1761:  \frac{1}{\tan\left(\frac{\pi}{2}(1-t)\right)} =
1762:  \frac{2}{\pi}\frac{1}{1-t}+...$, by the Taylor expansion. Therefore, in
1763:  order for $\text{E}\left(\hat{d}_{me}\right)^r <\infty$, we must have
1764:  $m \geq r$.
1765: 
1766: We complete the proof of Lemma \ref{lem_me}. 
1767: 
1768: \section{Proof of  Lemma \ref{lem_d_log}}\label{app_proof_lem_d_log} 
1769: 
1770: Assume $x \sim C(0,d)$. The first moment of $\log(|x|)$ would be
1771: \begin{align}\notag
1772: \text{E}\left(\log(|x|)\right) &= \frac{2d}{\pi}\int_0^\infty
1773: \frac{\log(y)}{y^2+d^2}dy \\\notag
1774: &=\frac{1}{\pi}\int_0^\infty\frac{\log(d)y^{-1/2}}{y+1} +
1775: \frac{1/2\log(y)y^{-1/2}}{y+1}dy\\
1776: &= \log(d), 
1777: \end{align}
1778: \noindent with the help of the integral tables \cite[3.221.1,
1779: 4.251.1]{Book:Gradshteyn_94}. 
1780: 
1781: Thus, given i.i.d. samples $x_j \sim C(0,d)$, $j = 1, 2, ..., k$,
1782: a nonlinear estimator of $d$ would be 
1783: \begin{align}
1784: \hat{d}_{log} = \exp\left(\frac{1}{k}\sum_{j=1}^k\log(|x_j|)\right). 
1785: \end{align}
1786: 
1787: We can derive another nonlinear estimator from 
1788: $\text{E}\left(|x|^\lambda\right)$, $|\lambda| <1$. Using the integral
1789: tables \cite[3.221.1]{Book:Gradshteyn_94}, we obtain
1790: \begin{align}\notag
1791: \text{E}\left(|x|^\lambda\right) &= \frac{2d}{\pi}\int_0^\infty
1792: \frac{y^\lambda}{y^2+d^2}dy\\ \notag
1793: &=\frac{d^\lambda}{\pi}\int_0^\infty\frac{y^{\frac{\lambda-1}{2}}}{y+1}dy
1794: \\
1795: &=\frac{d^\lambda}{\cos(\lambda\pi/2)}, 
1796: \end{align}
1797: \noindent from which a  nonlinear estimator follows immediately
1798: \begin{align}
1799: \hat{d}_\lambda = \left(\frac{1}{k}\sum_{j=1}^k|x_j|^\lambda
1800:   \cos(\lambda\pi/2)\right)^{1/\lambda}, \hspace{0.2in} |\lambda| <1
1801: \end{align}
1802: 
1803: Both nonlinear estimators $\hat{d}_{log}$ and $\hat{d}_\lambda$ are
1804: biased. The leading terms of their variances can be obtained by the
1805: {\em Delta Method} \citep[Corollary 1.1]{Book:Shao}.
1806: 
1807: 
1808: With the help of \cite[4.261.10]{Book:Gradshteyn_94}, we obtain 
1809: \begin{align}
1810: \text{E}\left(\log^2(|x|)\right) = \log^2(d) + \frac{\pi^2}{4},
1811: \hspace{0.2in} \text{i.e., } \ \ \text{Var}\left(\log^2(|x|)\right) =  \frac{\pi^2}{4}.
1812: \end{align}
1813: \noindent Thus, 
1814: \begin{align}
1815: \text{E}\left(\frac{1}{k}\sum_{j=1}^k\log(|x_j|)\right) = \log d,
1816: \hspace{0.5in} \text{Var}\left(\frac{1}{k}\sum_{j=1}^k\log(|x_j|)\right) = \frac{1}{k}\frac{\pi^2}{4}.
1817: \end{align}
1818: 
1819: 
1820: By the {\em Delta Method}, the asymptotic variance of
1821: $\hat{d}_{log}$ should be 
1822: \begin{align}
1823: \text{Var}\left(\hat{d}_{log}\right) =
1824: \frac{1}{k}\frac{\pi^2}{4}\exp^2\left(\log(d)\right) +
1825: O\left(\frac{1}{k^2}\right) = \frac{\pi^2d^2}{4k} +
1826: O\left(\frac{1}{k^2}\right). 
1827: \end{align}
1828: 
1829: Similarly, the asymptotic variance of $\hat{d}_\lambda$ is 
1830: \begin{align}
1831: \text{Var}\left(\hat{d}_{\lambda}\right) = \frac{d^2}{k}
1832: \frac{\sin^2(\lambda \pi/2)}{\lambda^2 \cos(\lambda\pi)} +
1833: O\left(\frac{1}{k^2}\right), \hspace{0.2in} |\lambda| <1/2
1834: \end{align}
1835: 
1836: $\text{Var}\left(\hat{d}_{\lambda}\right)\rightarrow \infty$
1837: as $|\lambda|\rightarrow \frac{1}{2}$. $\text{Var}\left(\hat{d}_{\lambda}\right)$
1838: converges to $\text{Var}\left(\hat{d}_{log}\right)$ as $\lambda
1839: \rightarrow 0$, because 
1840: \begin{align}
1841: \underset{\lambda\rightarrow 0}\lim\frac{\sin^2(\lambda
1842:   \pi/2)}{\lambda^2 \cos(\lambda\pi)} = \frac{\pi^2}{4}.
1843: \end{align}
1844: 
1845: This completes the proof of Lemma \ref{lem_d_log}.
1846: 
1847: 
1848: 
1849: 
1850: \section{Proof of Lemma \ref{lem_d_gm}}\label{app_proof_lem_d_gm} 
1851: 
1852: Assume that $x_1$, $x_2$, ..., $x_k$, are i.i.d. $C(0,d)$. 
1853: The estimator, $\hat{d}_{gm,c}$, expressed as
1854: \begin{align}
1855: \hat{d}_{gm,c} = \cos^k\left(\frac{\pi}{2k}\right)\prod_{j=1}^k|x_j|^{1/k},
1856: \end{align}
1857: is unbiased, because, from Lemma \ref{lem_d_log}, 
1858: \begin{align}\notag
1859: \text{E}\left(\hat{d}_{gm,c}\right) &=
1860:   \cos^k\left(\frac{\pi}{2k}\right)\prod_{j=1}^k\text{E}\left(|x_j|^{1/k}\right) \\\notag
1861: &=\cos^k\left(\frac{\pi}{2k}\right)\prod_{j=1}^k\left(\frac{d^{1/k}}{\cos\left(\frac{\pi}{2k}\right)}\right)\\
1862: &=d.
1863: \end{align}
1864: 
1865: The variance  is 
1866: \begin{align}\notag
1867: \text{Var}\left(\hat{d}_{gm,c}\right) &=
1868: \cos^{2k}\left(\frac{\pi}{2k}\right)\prod_{j=1}^k\text{E}\left(|x_j|^{2/k}\right)
1869:   -d^2\\
1870: &=
1871: d^2
1872: \left(\frac{\cos^{2k}\left(\frac{\pi}{2k}\right)}{\cos^k\left(\frac{\pi}{k}\right)}-1
1873: \right)\\ 
1874: &=\frac{\pi^2}{4}\frac{d^2}{k}  + \frac{\pi^4}{32}\frac{d^2}{k^2}+ O\left(\frac{1}{k^3}\right),
1875: \end{align}
1876: \noindent because
1877: \begin{align}\notag
1878: \frac{\cos^{2k}\left(\frac{\pi}{2k}\right)}{\cos^k\left(\frac{\pi}{k}\right)}
1879: &=
1880: \left(\frac{1}{2}+\frac{1}{2}\left(\frac{1}{\cos(\pi/k)}\right)\right)^k
1881: \\ \notag
1882: &=\left(1+\frac{1}{4}\frac{\pi^2}{k^2} +
1883:   \frac{5}{48}\frac{\pi^4}{k^4}+O\left(\frac{1}{k^6}\right)\right)^k
1884: \\\notag
1885: &=1+k\left(\frac{1}{4}\frac{\pi^2}{k^2}+\frac{5}{48}\frac{\pi^4}{k^4}\right)
1886: +
1887: \frac{k(k-1)}{2}\left(\frac{1}{4}\frac{\pi^2}{k^2}+\frac{5}{48}\frac{\pi^4}{k^4}\right)^2+
1888: ... \\
1889: &=1+\frac{\pi^2}{4}\frac{1}{k}+\frac{\pi^4}{32}\frac{1}{k^2} +O\left(\frac{1}{k^3}\right).
1890: \end{align}
1891: 
1892: Some more algebra can similarly show the third and fourth central moments: 
1893: \begin{align}
1894: &\text{E}\left(\hat{d}_{gm,c} -
1895:   \text{E}\left(\hat{d}_{gm,c}\right)\right)^3 =
1896: \frac{3\pi^4}{16}\frac{d^3}{k^2} + O\left(\frac{1}{k^3}\right)\\
1897: &\text{E}\left(\hat{d}_{gm,c} -
1898:   \text{E}\left(\hat{d}_{gm,c}\right)\right)^4 =
1899: \frac{3\pi^4}{16}\frac{d^4}{k^2} + O\left(\frac{1}{k^3}\right).
1900: \end{align}
1901: 
1902: 
1903: Therefore, we have completed the proof of Lemma \ref{lem_d_gm}. 
1904: 
1905: 
1906: \section{Proof of Lemma \ref{lem_d_gm_tail}}
1907: \label{app_proof_lem_d_gm_tail} 
1908: 
1909: This section proves the tail bounds for $\hat{d}_{gm,c}$. 
1910: Note that $\hat{d}_{gm,c}$ does not have a moment generating function
1911: because  $\text{E}\left(\hat{d}_{gm,c}\right)^t=\infty$ if
1912: $t\geq k$. However, we can still use the Markov moment bound.\footnote{In
1913: fact, even when the moment generating function does exist, for any positive
1914: random variable, the Markov moment bound is always sharper than the
1915: Chernoff bound, although the Chernoff bound will be in an exponential
1916: form. See \cite{Article:Philips_95,Article:Lugosi_04}.}  
1917: 
1918: For any $\epsilon \geq0$ and $0\leq t<k$, the Markov inequality says 
1919: \begin{align}
1920: \mathbf{Pr}\left(\hat{d}_{gm,c} \geq (1+\epsilon)d \right) \leq \frac{\text{E}\left(\hat{d}_{gm,c}\right)^t}{(1+\epsilon)^td^t}
1921: = 
1922: \frac{\cos^{kt}\left(\frac{\pi}{2k}\right)}{\cos^k\left(\frac{\pi
1923:       t}{2k}\right)(1+\epsilon)^{t}},
1924: \end{align}
1925: \noindent which can be minimized by choosing the optimum $t = t_1^*$, 
1926: where 
1927: \begin{align}
1928: t_1^* = \frac{2k}{\pi}\tan^{-1}\left(\left(\log(1+\epsilon) -
1929:     k\log\cos\left(\frac{\pi}{2k}\right)\right)\frac{2}{\pi}\right). 
1930: \end{align}
1931: 
1932: We need to make sure that $0\leq t_1^*<k$. $t_1^*\geq0$ because $\log\cos(.)\leq
1933: 0$; and $t_1^*<k$ because $\tan^{-1}(.) \leq \frac{\pi}{2}$, with
1934: equality holding only when $k\rightarrow \infty$. 
1935: 
1936: For $0\leq \epsilon \leq1$, we can prove an exponential bound for 
1937: $\mathbf{Pr}\left(\hat{d}_{gm,c} \geq (1+\epsilon)d \right)$. 
1938: First of all, note that we do not
1939: have to choose the optimum $t = t_1^*$. By the Taylor expansion, for
1940: small $\epsilon$, $t_1^*$ can be well approximated by 
1941: \begin{align}
1942: t_1^* \approx \frac{4k\epsilon}{\pi^2} + \frac{1}{2} \approx
1943: \frac{4k\epsilon}{\pi^2} = t_1^{**}.
1944: \end{align}
1945: 
1946: Therefore, taking $t=t_1^{**} = \frac{4k\epsilon}{\pi^2}$, the tail bound becomes 
1947: \begin{align}\notag
1948: \mathbf{Pr}\left(\hat{d}_{gm,c} \geq (1+\epsilon)d \right) &\leq  \frac{\cos^{kt_1^{**}}\left(\frac{\pi}{2k}\right)}{\cos^k\left(\frac{\pi
1949:       t_1^{**}}{2k}\right)(1+\epsilon)^{t_1^{**}}} \\\notag
1950: &=
1951: \left(\frac{\cos^{t_1^{**}}\left(\frac{\pi}{2k}\right)}{\cos\left(\frac{2\epsilon}{\pi}\right)(1+\epsilon)^{4\epsilon/\pi^2}}
1952: \right)^k \\\notag
1953: &\leq \left(\frac{1}{\cos\left(\frac{2\epsilon}{\pi}\right)(1+\epsilon)^{4\epsilon/\pi^2}}
1954: \right)^k \\\notag
1955: &=\exp\left(-k\left(\log\left(\cos\left(\frac{2\epsilon}{\pi}\right)\right) +
1956: \frac{4\epsilon}{\pi^2}\log(1+\epsilon)\right)\right)
1957: \\ 
1958: &\leq \exp\left(-k\frac{\epsilon^2}{8(1+\epsilon)}\right),
1959: \hspace{0.1in} 0\leq \epsilon\leq1\label{eqn_proof_right}
1960: \end{align}
1961: 
1962: The last step in (\ref{eqn_proof_right}) needs some
1963: explanations. First, by the Taylor expansion, 
1964: \begin{align}\notag
1965: &\log\left(\cos\left(\frac{2\epsilon}{\pi}\right)\right) +
1966: \frac{4\epsilon}{\pi^2}\log(1+\epsilon) \\\notag
1967: =& \left(-\frac{2\epsilon^2}{\pi^2} -
1968:   \frac{4}{3}\frac{\epsilon^4}{\pi^4} +... \right)+
1969: \frac{4\epsilon}{\pi^2}\left(\epsilon -
1970:   \frac{1}{2}\epsilon^2+...\right)\\
1971: =& \frac{2\epsilon^2}{\pi^2}\left(1-\epsilon+...\right)
1972: \end{align}
1973: 
1974: Therefore, we can seek the smallest constant $\gamma_1$ so that
1975: \begin{align}
1976: \log\left(\cos\left(\frac{2\epsilon}{\pi}\right)\right) +
1977: \frac{4\epsilon}{\pi^2}\log(1+\epsilon) 
1978: \geq \frac{\epsilon^2}{\gamma_1(1+\epsilon)} =
1979: \frac{\epsilon^2}{\gamma_1}(1-\epsilon +...)
1980: \end{align}
1981: 
1982: It is easy to see that as $\epsilon \rightarrow 0$,
1983: $\gamma_1\rightarrow \frac{\pi^2}{2}$. Figure \ref{fig_gm_constant}(a)
1984: illustrates that it suffices to let $\gamma_1 = 8$, which can be
1985: numerically verified. This is why the last step in
1986: (\ref{eqn_proof_right}) holds. Of course, we can get a better constant
1987: if (e.g.,) $\epsilon =0.5$. 
1988: 
1989: 
1990: Now we need to  show the other tail bound $\mathbf{Pr}\left(\hat{d}_{gm,c}
1991:   \leq  (1-\epsilon)d \right)$: 
1992: \begin{align}\notag
1993: &\mathbf{Pr}\left(\hat{d}_{gm,c} \leq  (1-\epsilon)d \right)  
1994: =\mathbf{Pr}\left(\cos\left(\frac{\pi}{2k}\right)^k
1995:   \prod_{j=1}^k|x_j|^{1/k} \leq  (1-\epsilon)d \right) \\\notag
1996: =&\mathbf{Pr}\left(
1997:   \sum_{j=1}^k\log\left(|x_j|^{1/k}\right)\leq
1998:   \log\left(\frac{(1-\epsilon)d}{\cos^k\left(\frac{\pi}{2k}\right)}\right)\right)\\\notag
1999: =&\mathbf{Pr}\left( \exp\left(
2000:   \sum_{j=1}^k\log\left(|x_j|^{-t/k}\right)\right)\geq 
2001:   \exp\left(-t\log\left(\frac{(1-\epsilon)d}{\cos^k\left(\frac{\pi}{2k}\right)}\right)\right)\right), \hspace{0.2in} 0\leq t<k \\
2002: \leq & \left(\frac{(1-\epsilon)}{\cos^k\left(\frac{\pi}{2k}\right)}\right)^t
2003: \frac{1}{\cos^k\left(\frac{\pi t}{2k}\right)}, \hspace{0.2in}
2004: \text{(Chernoff bound)}
2005: \end{align}
2006: \noindent which is minimized at $t = t_2^*$ 
2007: \begin{align}
2008: t_2^* = \frac{2k}{\pi}\tan^{-1}\left(\left(-\log(1-\epsilon) +
2009:     k\log\cos\left(\frac{\pi}{2k}\right)\right)\frac{2}{\pi}\right),
2010: \end{align}
2011: \noindent provided $k\geq \frac{\pi^2}{8\epsilon}$, otherwise $t_2^*$
2012: may be less than 0. 
2013: 
2014: Again, $t_2^*$ can be replaced by its approximation 
2015: \begin{align}
2016: t_2^* \approx t_2^{**} = \frac{4k\epsilon}{\pi^2},  
2017: \end{align}
2018: \noindent provided $k\geq\frac{\pi^2}{4\epsilon}$, otherwise the
2019: probability upper bound may exceed one.  Therefore, 
2020: 
2021: \begin{align}\notag
2022: \mathbf{Pr}\left(\hat{d}_{gm,c} \leq  (1-\epsilon)d \right)  
2023: \leq& \left(\frac{(1-\epsilon)}{\cos^k\left(\frac{\pi}{2k}\right)}\right)^{t_2^{**}}
2024: \frac{1}{\cos^k\left(\frac{\pi t_2^{**}}{2k}\right)}\\\notag
2025: =&\exp\left(-k\left(\log\left(\cos\frac{2\epsilon}{\pi}\right) -
2026:     \frac{4\epsilon}{\pi^2}\log(1-\epsilon) +  \frac{4k\epsilon}{\pi^2}\log\left(\cos\frac{\pi}{2k}\right) \right)\right).
2027: \end{align}
2028: \noindent We can bound
2029: $\frac{4k\epsilon}{\pi^2}\log\left(\cos\frac{\pi}{2k}\right)$ by restricting $k$.  
2030: 
2031: In order to attain $\mathbf{Pr}\left(\hat{d}_{gm,c} \leq  (1-\epsilon)d
2032: \right)  \leq
2033: \exp\left(-k\left(\frac{\epsilon^2}{8(1+\epsilon)}\right)\right)$, we
2034: have to restrict $k$ to be larger than a certain value. For no
2035: particular reason, we like to express the restriction as $k \geq
2036: \frac{\pi^2}{\gamma_2\epsilon}$, for some constant $\gamma_2$. We
2037: find $k \geq
2038: \frac{\pi^2}{1.5\epsilon}$ suffices, although readers can verify that a
2039: slightly better (smaller) restriction would be $k \geq
2040: \frac{1}{4/\pi^2-1/4}\frac{1}{\epsilon} = \frac{\pi^2}{1.5326\epsilon} $. 
2041: 
2042: If $k \geq
2043: \frac{\pi^2}{1.5\epsilon}$, then
2044: $\frac{4k\epsilon}{\pi^2}\log\left(\cos\frac{\pi}{2k}\right) \geq
2045: \frac{8}{3}\log\left(\cos \frac{\epsilon}{3\pi}\right)$. Therefore, 
2046: \begin{align}\notag
2047: \mathbf{Pr}\left(\hat{d}_{gm,c} \leq  (1-\epsilon)d \right) \leq  &\exp\left(-k\left(\log\left(\cos\frac{2\epsilon}{\pi}\right) -
2048:     \frac{4\epsilon}{\pi^2}\log(1-\epsilon) +
2049:     \frac{8}{3}\log\left(\cos
2050:       \frac{\epsilon}{3\pi}\right)\right)\right)\\
2051: \leq &\exp\left(-k\frac{\epsilon^2}{8(1+\epsilon)}\right), \hspace{0.2in}
2052: k\geq \frac{\pi^2}{1.5\epsilon}\label{eqn_proof_left}
2053: \end{align}
2054: 
2055: \begin{figure}[h]
2056: \begin{center}\mbox{
2057: \subfigure[]{\includegraphics[width = 2.5in]{fig/gm_bound_const.eps}}\hspace{0.5in}
2058: \subfigure[]{\includegraphics[width = 2.5in]{fig/gm_bound_const_left.eps}}}
2059: \end{center}\vspace{-0.4in}
2060: \caption{ (a):
2061:   $\frac{\epsilon^2/(1+\epsilon)}{\log\left(\cos\left(\frac{2\epsilon}{\pi}\right)\right) +
2062: \frac{4\epsilon}{\pi^2}\log(1+\epsilon)}$ as a function of
2063: $\epsilon$. (b): $\frac{\epsilon^2/(1+\epsilon)}{\log\left(\cos\frac{2\epsilon}{\pi}\right) -
2064:     \frac{4\epsilon}{\pi^2}\log(1-\epsilon) +
2065:     \frac{8}{3}\log\left(\cos
2066:       \frac{\epsilon}{3\pi}\right) }$ as a function of $\epsilon$. Graphically, we know that it suffices to use a constant 8
2067: in (\ref{eqn_proof_right}) and (\ref{eqn_proof_left}). The optimal
2068: constant will be different for different $\epsilon$. For example, if
2069: $\epsilon = 0.2$, we could replace the constant 8 by a constant 5.  }\label{fig_gm_constant}\vspace{-0.2in}
2070: \end{figure}
2071: 
2072: 
2073: 
2074: 
2075: This completes the proof of Lemma \ref{lem_d_gm_tail}. 
2076: 
2077: 
2078: \section{Proof of Lemma \ref{lem_mle_asymp}} \label{app_proof_lem_asymp}
2079: Assume $x \sim C(0,d)$. The $\log$
2080: likelihood ($l(x;d)$) and first three derivatives
2081: are  
2082: \begin{align}
2083: &l(x;d) = \log(d) - \log(\pi) - \log(x^2+d^2),\\
2084: &l^\prime(d) = \frac{1}{d} - \frac{2d}{x^2+d^2}\\
2085: &l^{\prime\prime}(d) = -\frac{1}{d^2} -
2086: \frac{2x^2-2d^2}{(x^2+d^2)^2}\\
2087: &l^{\prime\prime\prime}(d) = \frac{2}{d^3} +
2088: \frac{4d}{(x^2+d^2)^2} + \frac{8d(x^2-d^2)}{(x^2+d^2)^3}
2089: %\\
2090: %&l^{\prime\prime\prime\prime}(d) = -\frac{6}{d^3} +
2091: %\frac{4x^2-12d^2}{(x^2+d^2)^3} + \frac{8x^4-64x^2d^2+24d^4)}{(x^2+d^2)^4}.
2092: \end{align}
2093: 
2094: The MLE  $\hat{d}_{MLE}$ is
2095: asymptotically normal with mean $d$ and variance
2096: $\frac{1}{k\text{I}(d)}$, where $\text{I}(d)$, the expected Fisher
2097: Information, is 
2098: \begin{align}
2099: \text{I} = \text{I}(d) = \text{E}\left(-l^{\prime\prime}(d)\right)  =
2100: \frac{1}{d^2} +
2101: 2\text{E}\left(\frac{x^2-d^2}{(x^2+d^2)^2}\right) = \frac{1}{2d^2},
2102: \end{align}
2103: \noindent because
2104: \begin{align}\notag
2105: \text{E}\left(\frac{x^2-d^2}{(x^2+d^2)^2}\right) &= \frac{d}{\pi}
2106: \int_{-\infty}^\infty \frac{x^2-d^2}{(x^2+d^2)^3}dx \\ \notag
2107: &=\frac{d}{\pi} \int_{-\pi/2}^{\pi/2} \frac{d^2(\tan^2(t) -
2108:   1)}{d^6/\cos^6(t)} \frac{d}{\cos^2(t)}dt \\\notag
2109: &=\frac{1}{d^2\pi}\int_{-\pi/2}^{\pi/2}\cos^2(t) - 2\cos^4(t) dt  \\
2110: &= \frac{1}{d^2\pi}\left(\frac{\pi}{2}-2\frac{3}{8}\pi\right) = -\frac{1}{4d^2}
2111: \end{align}
2112: Therefore, we obtain
2113: \begin{align}
2114: \text{Var}\left(\hat{d}_{MLE}\right) = \frac{2d^2}{k} + O\left(\frac{1}{k^2}\right).
2115: \end{align}
2116: 
2117: General formulas for the bias and higher moments of the MLE are
2118: available in \citep{Article:Bartlett_53,Article:Shenton_63}.  We need to evaluate
2119: the expressions in \cite[16a-16d]{Article:Shenton_63}, involving
2120: tedious algebra: 
2121: \begin{align}
2122: &\text{E}\left(\hat{d}_{MLE}\right) = d - \frac{[12]}{2k\text{I}^2} +
2123: O\left(\frac{1}{k^2}\right) \\
2124: &\text{Var}\left(\hat{d}_{MLE}\right) = \frac{1}{k\text{I}} +
2125: \frac{1}{k^2}\left(-\frac{1}{\text{I}}+\frac{[1^4]-[1^22]-[13]}{\text{I}^3}
2126: +\frac{3.5[12]^2-[1^3]^2}{\text{I}^4}\right) + 
2127: O\left(\frac{1}{k^3}\right) \\
2128: &\text{E}\left(\hat{d}_{MLE}-\text{E}\left(\hat{d}_{MLE}\right)\right)^3 = 
2129: \frac{[1^3]-3[12]}{k^2\text{I}^2}+O\left(\frac{1}{k^3}\right) \\\notag
2130: &\text{E}\left(\hat{d}_{MLE}-\text{E}\left(\hat{d}_{MLE}\right)\right)^4= 
2131: \frac{3}{k^2\text{I}^2} +
2132: \frac{1}{k^3}\left(-\frac{9}{\text{I}^2}+
2133:   \frac{7[1^4] - 6[1^22]-10[13]}{\text{I}^4}\right)\\
2134: &\hspace{1.7in} +
2135: \frac{1}{k^3}\left(\frac{-6[1^3]^2-12[1^3][12]+45[12]^2}{\text{I}^5}\right)+O\left(\frac{1}{k^4}\right),
2136: \end{align}
2137: \noindent where, after re-formatting,
2138: \begin{align}\notag
2139: &[12] = \text{E}(l^\prime)^3 +  \text{E}(l^\prime l^{\prime\prime}),
2140: \hspace{0.3in} [1^4] = \text{E}(l^\prime)^4, \hspace{0.3in} [1^22] =
2141: \text{E}(l^{\prime\prime}(l^\prime)^2) +  \text{E}(l^{\prime})^4, \\
2142: &[13] = \text{E}(l^\prime)^4 +
2143: 3\text{E}(l^{\prime\prime}(l^\prime)^2)  + \text{E}(l^\prime
2144: l^{\prime\prime\prime}), \hspace{0.3in} [1^3]=\text{E}(l^\prime)^3.
2145: \end{align}
2146: 
2147: We will neglect most of the algebra. To help readers verifying the
2148: results, the following formula we derive may be useful:
2149: \begin{align}
2150: \text{E}\left(\frac{1}{x^2+d^2}\right)^m =
2151: \frac{1\times3\times5\times...\times(2m-1)}{2\times4\times6\times...\times(2m)}\frac{1}{d^{2m}},
2152: \hspace{0.2in} m = 1, 2, 3, ...
2153: \end{align}
2154: 
2155: Without giving the detail, we report 
2156: \begin{align}\notag
2157: &\text{E}\left(l^{\prime}\right)^3 = 0, \hspace{0.3in}
2158: \text{E}\left(l^\prime l^{\prime\prime}\right) = -\frac{1}{2}\frac{1}{d^3}, \hspace{0.3in}
2159: \text{E}\left(l^{\prime}\right)^4 =
2160: \frac{3}{8}\frac{1}{d^4}, \\
2161: &\text{E}(l^{\prime\prime}(l^\prime)^2) = -\frac{1}{8}\frac{1}{d^4},  \hspace{0.3in}
2162: \text{E}\left(l^{\prime}l^{\prime\prime\prime}\right) =
2163: \frac{3}{4}\frac{1}{d^4}.
2164: \end{align}
2165: Hence
2166: \begin{align}
2167: &[12] = -\frac{1}{2}\frac{1}{d^3}, \hspace{0.25in} [1^4] =
2168: \frac{3}{8}\frac{1}{d^4}, \hspace{0.25in}[1^22] =
2169: \frac{1}{4}\frac{1}{d^4}, \hspace{0.25in}[13] =
2170: \frac{3}{4}\frac{1}{d^4}, \hspace{0.25in}[1^3] = 0.
2171: \end{align}
2172: 
2173: 
2174: Thus,  we obtain
2175: \begin{align}
2176: &\text{E}\left(\hat{d}_{MLE}\right) = d
2177: +\frac{d}{k} + O\left(\frac{1}{k^2}\right)\\ 
2178: &\text{Var}\left(\hat{d}_{MLE}\right) = \frac{2d^2}{k} + \frac{7d^2}{k^2} +
2179: O\left(\frac{1}{k^3}\right) \\ 
2180: &\text{E}\left(\hat{d}_{MLE}-\text{E}\left(\hat{d}_{MLE}\right)\right)^3 =
2181: \frac{12d^3}{k^2} + O\left(\frac{1}{k^3}\right)\\
2182: &\text{E}\left(\hat{d}_{MLE}-\text{E}\left(\hat{d}_{MLE}\right)\right)^4 =
2183: \frac{12d^4}{k^2} + \frac{222d^4}{k^3} +  O\left(\frac{1}{k^4}\right).
2184: \end{align}
2185: 
2186: Because $\hat{d}_{MLE}$ has $O\left(\frac{1}{k}\right)$ bias, we
2187: recommend the bias-corrected estimator 
2188: \begin{align}
2189: \hat{d}_{MLE,c} = \hat{d}_{MLE}\left(1-\frac{1}{k}\right), 
2190: \end{align}
2191: whose first four moments are 
2192: \begin{align}
2193: &\text{E}\left(\hat{d}_{MLE,c}\right) = d + O\left(\frac{1}{k^2}\right)\\ 
2194: &\text{Var}\left(\hat{d}_{MLE,c}\right) = \frac{2d^2}{k} + \frac{3d^2}{k^2} +
2195: O\left(\frac{1}{k^3}\right) \\ 
2196: &\text{E}\left(\hat{d}_{MLE,c}-\text{E}\left(\hat{d}_{MLE,c}\right)\right)^3 =
2197: \frac{12d^3}{k^2} + O\left(\frac{1}{k^3}\right)\\
2198: &\text{E}\left(\hat{d}_{MLE,c}-\text{E}\left(\hat{d}_{MLE,c}\right)\right)^4 =
2199: \frac{12d^4}{k^2} + \frac{186d^4}{k^3} +  O\left(\frac{1}{k^4}\right),
2200: \end{align}
2201: \noindent by brute-force algebra. First, it is obvious that
2202: \begin{align}
2203: \text{E}\left(\hat{d}_{MLE} - d\right)^2 = \frac{2d^2}{k} + \frac{8d^2}{k^2}
2204: + O\left(\frac{1}{k^3}\right). 
2205: \end{align}
2206: Then 
2207: \begin{align}\notag
2208: \text{Var}\left(\hat{d}_{MLE,c}\right) &= \text{E}\left(\hat{d}_{MLE,c} -
2209:   \text{E}(\hat{d}_{MLE,c})\right)^2\\\notag &=
2210: \text{E}\left(\hat{d}_{MLE}\left(1-\frac{1}{k}\right) - d +
2211:   O\left(\frac{1}{k^2}\right)\right)^2 \\\notag &= \text{E}\left(\left(\hat{d}_{MLE}-d\right)\left(1-\frac{1}{k}\right) -\frac{d}{k} +
2212:   O\left(\frac{1}{k^2}\right)\right)^2  \\\notag 
2213: &=\text{E}\left(\hat{d}_{MLE}-d\right)^2\left(1-\frac{2}{k}\right) +
2214: \frac{d^2}{k^2} - 2\frac{d}{k}\left(1-\frac{1}{k}\right) +
2215: O\left(\frac{1}{k^3}\right) \\
2216: &= \frac{2d^2}{k} + \frac{3d^2}{k^2} + O\left(\frac{1}{k^3}\right). 
2217: \end{align}
2218: 
2219: We can evaluate the higher central moments of $\hat{d}_{MLE,c}$ similarly,
2220: but we skip the algebra. 
2221: 
2222: 
2223: Therefore, we have completed the proof for Lemma \ref{lem_mle_asymp}.
2224: 
2225: 
2226: \end{document}
2227: 
2228: