0208:cs0208040/paper.tex

1: \documentclass[11pt]{article}

2:

3: \usepackage{amssymb}

4: \usepackage{amsmath}

5: \usepackage{fullpage}

6: \usepackage{times}

7: \usepackage{graphicx}

8: \setlength{\oddsidemargin}{-0.25in}

9: \setlength{\evensidemargin}{-0.25in}

10: \setlength{\topmargin}{0.5in}

11: \setlength{\headheight}{0pt}

12: \setlength{\headsep}{0pt}

13: \setlength{\footskip}{0.5in}

14: \setlength{\textheight}{8.75in}

15: \setlength{\textwidth}{7in}

16: \setlength{\marginparwidth}{0in}

17: \setlength{\marginparsep}{0in}

18: \newcommand{\SFW}{\mbox{S$^4$W}}

19: \newcommand{\paper}{paper}

20:

21: \title{Using Hierarchical Data Mining to Characterize \\

22: Performance of Wireless System Configurations}

23: \author{Alex Verstak$^*$, Naren Ramakrishnan$^*$, Kyung Kyoon Bae$^{\dagger}$, William H. Tranter$^{\dagger}$,\\

24: Layne T. Watson$^*$, Jian He$^*$, and Clifford A. Shaffer$^*$\\

25: \large $^*$Department of Computer Science\\

26: \large $^{\dagger}$Bradley Department of Electrical and Computer Engineering\\

27: \large Virginia Polytechnic Institute and State University\\

28: \large Blacksburg, VA 24061\\

29: \,\,\,\,\\

30: \large Theodore S. Rappaport\\

31: \large Department of Electrical and Computer Engineering\\

32: \large University of Texas\\

33: \large Austin, TX 78712}

34:

35: \date{}

36: \begin{document}

37:

38: \maketitle

39:

40: \begin{abstract}

41: \noindent

42: This \paper{} presents a statistical framework for assessing wireless

43: systems performance using hierarchical data mining techniques. We

44: consider WCDMA (wideband code division multiple access) systems with

45: two-branch STTD (space time transmit diversity) and 1/2 rate

46: convolutional coding (forward error correction codes). Monte Carlo

47: simulation estimates the bit error probability (BEP) of the system

48: across a wide range of signal-to-noise ratios (SNRs).  A performance

49: database of simulation runs is collected over a targeted space of

50: system configurations.  This database is then mined to obtain regions

51: of the configuration space that exhibit acceptable average performance.

52: The shape of the mined regions illustrates the joint influence of

53: configuration parameters on system performance.  The role of data

54: mining in this application is to provide explainable and statistically

55: valid design conclusions.  The research issue is to define

56: statistically meaningful aggregation of data in a manner that permits

57: efficient and effective data mining algorithms. We achieve a good

58: compromise between these goals and help establish the applicability of

59: data mining for characterizing wireless systems performance.

60: \end{abstract}

61: \thispagestyle{empty}

62: %\newpage

63: %\tableofcontents{}

64: \newpage

65:

66: \section{Introduction}

67:

68: Data mining is becoming increasingly relevant in simulation methodology

69: and computational science~\cite{naren-ayg}.  It entails the

70: `non-trivial process of identifying valid, novel, potentially useful,

71: and ultimately understandable patterns in data'~\cite{kdd-cacm}.  Data

72: mining can

73: be used in both predictive (e.g., quantitative assessment of factors on

74: some performance metric) and descriptive (e.g., summarization and

75: system characterization) settings. Our goal in this \paper{} is to

76: demonstrate a hierarchical data mining framework applied to the problem

77: of characterizing wireless system performance.

78:

79: %\ifthesis

80: %We study the effect of configuration parameters on the bit error probability

81: %(BEP) of a system simulated in~\SFW.

82: %%\else

83: This work is done in the context of the \SFW{} problem solving

84: environment~\cite{ipdps-s4w}---`Site-Specific System Simulator for

85: Wireless System Design'.  \SFW{} provides site-specific (deterministic)

86: electromagnetic propagation models as well as stochastic wireless

87: system models for predicting the performance of wireless systems in

88: specific environments, such as office buildings. \SFW{} is also

89: designed to support the inclusion of new models into the system,

90: visualization of results produced by the models, integration of

91: optimization loops around the models, validation of models by

92: comparison with field measurements, and management of the results

93: produced by a large series of experiments.  In this paper,

94: we study the effect of

95: configuration parameters on the bit error probability (BEP) of a system

96: simulated in~\SFW.

97:

98: The approach we take is to accumulate a performance database of

99: simulation runs that sweep over a targeted space of system

100: configurations. This database is then mined to obtain regions of the

101: configuration space that exhibit acceptable average performance.

102: Exploiting prior knowledge about the underlying simulation, organizing

103: the computational steps in data mining, and interpreting the results at

104: every stage, are important research issues. In addition, we bring out

105: the often prevailing tension between making statistically meaningful

106: conclusions and the assumptions required for efficient and effective

107: data mining algorithms. This interplay leads to a novel set of problems

108: that we address in the context of the wireless systems performance

109: domain.

110:

111: Data mining algorithms work in a variety of ways but, for the purposes

112: of this \paper{}, it is helpful to think of them as performing systematic

113: aggregation and redescription of data into higher-level objects.  Our

114: work can be viewed as employing three such layers of aggregation:

115: points, buckets, and regions. Points (configurations) are records in

116: the performance database. These records contain configuration

117: parameters as well as unbiased estimates of bit error probabilities

118: that we use as performance metrics.  Buckets represent averages of

119: points.  We use buckets to reduce data dimensionality to two, which is

120: the most convenient number of dimensions for visualization.  Finally,

121: buckets are aggregated into 2D regions of constrained shape.  We find

122: regions of buckets where we are most confident that the configurations

123: exhibit acceptable average performance.  The shapes of these regions

124: illustrate the nature of the joint influence of the two selected

125: configuration parameters on the configuration performance.  Specific

126: region attributes, such as region width, provide estimates for the

127: thresholds of sensitivity of configurations to variations in parameter

128: values.

129:

130: \subsection{Reader's Guide}

131:

132: Our major contribution is the development of a statistical framework

133: for assessing wireless system performance using data mining

134: techniques.  The following section outlines wireless systems

135: performance simulation methodology and develops a statistical framework

136: for spatial aggregation of simulation results.

137: Section~\ref{sec:w-example} demonstrates a substantial subset of this

138: framework in the context of a performance study of WCDMA (wideband code

139: division multiple access~\cite{wcdma-holma}) systems that employ

140: two-branch STTD (space-time transmit diversity~\cite{sttd-alamouti})

141: techniques and 1/2 rate convolutional coding (forward error correction

142: codes~\cite{wcdma-holma}).  We study the effect of power imbalance

143: between the branches on the BEP of the system across a wide range of

144: average signal-to-noise ratios (SNRs).  Section~\ref{sec:gizmo} extends

145: the statistical framework to support computation of optimized regions

146: of the bucket space.  Such regions are computed by a well-known data

147: mining algorithm~\cite{fukuda-tods, fukuda-rectilinear}.

148: Section~\ref{sec:experiments} applies these concepts to the example in

149: Section~\ref{sec:w-example}.  Section~\ref{sec:conclusion} summarizes

150: present findings and outlines directions for future research.

151:

152: \section{The Statistics of Aggregation and the Aggregation of Statistics}

153: \label{sec:stat}

154:

155: Temporal variations in wireless channels have been extensively studied

156: in the literature~\cite{comm-ziemer}.  The present work uses a Monte

157: Carlo simulation of WCDMA wireless systems to study the effect of these

158: variations.  The simulation traces a number of frames of random

159: information bits through the encoding filters, the channel (a Rayleigh

160: fading linear filter~\cite{cm-hashemi}), and the decoding filters.  The

161: inputs are hardware parameters, average SNR, channel impulse response,

162: and the number of frames to simulate.  The output is the bit error

163: rate---the ratio of the number of information bits decoded in error to

164: the total number of information bits simulated.  Simulations of this

165: kind statistically model channel variations due to changes in the

166: environment and device movement across a small geographical area

167: (\emph{small-scale fading}~\cite{cm-hashemi}).  We refer to this kind

168: of channel variation as \emph{temporal variation} because a system is

169: simulated over a period of time.  Further, we say that a given list of

170: inputs to the WCDMA simulation is a \emph{configuration} or a

171: \emph{point} in the configuration space.

172:

173: \begin{figure}

174: \begin{center}

175: \includegraphics[width=4.0in]{wcdma-test}

176: \end{center}

177: \caption[Typical 1D slices of the configuration space.]{Typical 1D

178: slices of the configuration space.  The plots show simulated BERs (bit

179: error rates) of wireless systems for five common benchmark

180: channels~\cite{umts-utra} across a typical range of average SNRs.}

181: \label{fig:2d-example}

182: \end{figure}

183:

184: \emph{Spatial variations} are due to changes in system configurations.

185: We use this term to describe two quite different phenomena: changes in

186: the average SNR and channel impulse response due to \emph{large-scale

187: fading}~\cite{cm-hashemi} and variations of hardware parameters.  A

188: typical approach to the analysis of spatial variations is to run

189: several temporal variation simulations (i.e., compute bit error

190: rates --- BERs --- at several

191: points within a given area of interest) and plot 1D or 2D slices of the

192: configuration space, as shown in Figure~\ref{fig:2d-example}.  In this

193: \paper{}, we augment this approach with statistically meaningful

194: aggregation of performance estimates across several points.  The result

195: of this aggregation is a space of buckets, each bucket representing the

196: aggregation of a number of points. Moving up one level of aggregation

197: in this manner allows us to bring data mining algorithms to operate at

198: the level of buckets. The space of buckets mined by the data mining

199: algorithm is then visualized using color maps. The color of each bucket

200: is the confidence that the points (configurations) that map to this

201: bucket exhibit acceptable average performance.

202:

203: \begin{table}

204: \begin{center}

205: \begin{tabular}{c l}

206: \hline

207: $c,C,R$ & entities (points, buckets, regions) \\

208: $x,b,B$ & random variables \\

209: $E[x],E[b],E[B]$ & true means of random variables $x$, $b$, $B$ \\

210: $\sigma^2,\Sigma^2$ & true variances of random variables $b$, $B$ \\

211: $\hat{x},\hat{b},\hat{B}$ & estimates of means $E[x]$, $E[b]$, $E[B]$ of random variables $x$, $b$, $B$ \\

212: $\hat{\sigma}^2,\hat{\Sigma}^2$ & estimates of variances $\sigma^2$, $\Sigma^2$ of random variables $b$, $B$ \\

213: $P(E)$ & probability of event $E$, where $E$ is a boolean condition \\

214: $F_{N-1}(T)$ & $P(X<T)$ for $X$ having the Student~$t$ distribution with $N-1$ degrees of freedom \\

215: $\{x_k\}_{k=1}^n$ & set $\{x_1,x_2,\ldots,x_n\}$ \\

216: $\{\{x_{kj}\}_{j=1}^{n_k}\}_{k=1}^n$ & set $\{x_{11},x_{12},\ldots,x_{1n_1},x_{21},x_{22},\ldots,x_{2n_2},\ldots,x_{n1},x_{n2},\ldots,x_{nn_n}\}$ \\

217: \hline

218: \end{tabular}

219: \end{center}

220: \caption[Summary of mathematical notation.]{Summary of mathematical

221: notation. Lower case letters are used for points and upper case letters

222: are used for buckets and regions.  Additional conventions are

223: introduced in Table~\ref{tab:notation2}.}

224: \label{tab:notation}

225: \end{table}

226:

227: \subsection{The First Level of Aggregation: Points}

228:

229: Table~\ref{tab:notation} summarizes some of the syntactic conventions

230: used in this \paper{}.  Mathematically, we can think of the WCDMA

231: simulation as estimating the mean~$E[x_k]$ of a random variable~$x_k$

232: with some (unknown) distribution~\cite{comm-jeruchim} ($x_k$ is one

233: when the information bit is decoded in error or zero when it is decoded

234: correctly).  Each BER $\hat{x}_{kj}$, $1\le{}j\le{}n_k$, output by the

235: simulation is an unbiased estimate of the BEP~$E[x_k]$ of the simulated

236: configuration~$c_k$.

237: %The distribution of~$x_k$ is analytically inconvenient.

238: Instead of building a detailed stochastic model of the

239: simulation (analytically, from the

240: distribution of $x_k$), we choose to work with the simpler distribution of

241: the BER $\hat{x}_{kj}$, referred to henceforth as just $b_k$.

242: Thus, each sample from the distribution of $b_k$ is realized by

243: simulating a number of frames and obtaining an estimate

244: of $E[x_k]$.

245: The distribution of~$b_k$ is

246: approximately Gaussian due to the Central Limit Theorem.

247: Technically, we assume

248: that the number of frames per estimate $\hat{x}_{kj}$ is `large enough'

249: so that the Lindeberg condition is satisfied, that the variance

250: of~$\hat{x}_{kj}$ is finite, and that $\{\hat{x}_{kj}\}_{j=1}^{n_k}$

251: are i.i.d.  We say that $E[b_k]=E[E[x_k]]$ is the \emph{expected BEP}

252: of configuration~$c_k$ under Rayleigh fading.

253:

254: \subsection{The Second Level of Aggregation: Buckets}

255:

256: Let us now aggregate several points (i.e., random variables) into one

257: bucket.  The purpose of this aggregation is to reduce data

258: dimensionality to a size that is easy to visualize, usually one or two

259: dimensions.  The basic idea is to linearly average all points that map

260: to the same bucket but we must do so carefully, in order to preserve a

261: meaningful statistical interpretation.  Let $\{b_k\}_{k=1}^n$ be

262: Gaussian random variables with means $\{E[b_k]\}_{k=1}^n$ and variances

263: $\{\sigma^2_k\}_{k=1}^n$.  As in the previous paragraph, let each such

264: variable~$b_k$ be the estimated BEP of some configuration~$c_k$,

265: $1\le{}k\le{}n$.  For bucket~$C$, define a \emph{bucket} (mixture)

266: \emph{random variable}~$B$ as the convex combination

267: $$B=\sum_{k=1}^np_kb_k,$$ where

268: the $p_k \geq 0$ and $\sum_{k=1}^{n} p_k = 1$.

269: %$\{p_k\}_{k=1}^n$ are (positive) constant weights of

270: %$\{b_k\}_{k=1}^n$.

271: It is convenient to make $\{p_k\}_{k=1}^n$ the

272: probabilities of occurrence of the configurations $\{c_k\}_{k=1}^n$ in

273: the dataset being analyzed.  This setup underlines the dependence of

274: the outputs on the distribution of the inputs and frees the user from

275: having to provide values for the constants $\{p_k\}_{k=1}^n$.  It is

276: well known that, as long as $\{b_k\}_{k=1}^n$ are \emph{mutually

277: independent} and Gaussian with means~$\{E[b_k]\}_{k=1}^n$ and

278: variances~$\{\sigma^2_k\}_{k=1}^n$, $B$ is Gaussian with mean

279: $E[B]=\sum_{k=1}^np_kE[b_k]$ and variance

280: $\Sigma^2=\sum_{k=1}^np_k^2\sigma_k^2$~\cite{stat-casella}.  The

281: expected value $E[B]$ of the random variable~$B$ can be viewed as the

282: expected BEP of bucket $C=\{c_1,c_2,\ldots,c_n\}$ in a Rayleigh fading

283: environment, conditional on the (discrete) distribution of the

284: configurations in~$C$.

285:

286: The values $\{p_k\}_{k=1}^n$ are what the statisticians call

287: \emph{prior probabilities}.  For most purposes of this \paper{}, we simply

288: estimate $\{p_k\}_{k=1}^n$ from available data.  These values are

289: explicitly or implicitly constructed during experiment design and we

290: assume that they remain constant during experiment analysis. However,

291: one can collect additional data as long as doing so does not change

292: $\{p_k\}_{k=1}^n$.  Prior probabilities can come from a number of

293: sources:  channel sounding measurements, propagation simulations,

294: hardware and budget constraints, or even educated guesses by wireless

295: system designers.  The rest of the \paper{} silently assumes that the

296: values $\{p_k\}_{k=1}^n$ have been established beforehand.  It is

297: important to remember that even though the prior probabilities are for

298: the most part transparent to the analysis presented here, they

299: nonetheless always exist and all conclusions of data analysis are made

300: conditional on the prior probabilities.

301:

302: This discussion of $\{p_k\}_{k=1}^n$ can be interpreted as a deferral

303: of the exact definition of~$B$ until experiment setup, or as

304: parameterization of the analysis procedure.  A natural question is

305: whether or not this level of parameterization is sufficient.  It is

306: sufficient for the purposes of this \paper{} but, strictly speaking,

307: the interrelations between $\{b_k\}_{k=1}^n$ should also be defined

308: during experiment setup.  Mutual independence of $\{b_k\}_{k=1}^n$ is a

309: simplifying assumption and it might be desirable to model interactions

310: between $\{b_k\}_{k=1}^n$ in practice.  This implies adding covariance

311: terms to~$\Sigma^2$ and re-thinking the distribution of~$B$.  Such

312: analysis is necessarily specific to a particular experiment.  For the

313: sake of simplicity, the rest of this \paper{} assumes mutual

314: independence of variables in a given bucket.

315:

316: \subsection{Confidence Estimation}

317:

318: Point and bucket estimates of the expected BEP are meaningful

319: performance metrics for wireless systems.  Let us also estimate our

320: confidence in these estimates.  Confidence analysis enables wireless

321: system designers to make more practical claims than point estimates

322: alone.  A statement of the form `this configuration will exhibit

323: acceptable performance in 95\% of the cases' is often preferable to a

324: statement of the form `the expected BEP of this configuration is

325: approximately $5\times{}10^{-4}$'.  More precisely, we say that

326: configuration~$c_k$ \emph{exhibits acceptable performance} when the

327: expected BEP $E[b_k]$ of configuration $c_k$ is below some fixed

328: threshold~$T$.  This statement is conditional on the temporal

329: simulation assumptions, i.e., Rayleigh fading.  Standard values for~$T$

330: are $10^{-3}$ for voice quality systems and $10^{-6}$ for data quality

331: systems.  Likewise, we say that bucket~$C$ (a subspace of

332: configurations) \emph{exhibits acceptable average performance} when the

333: expected BEP $E[B]$ of bucket~$C$ is below some fixed threshold~$T$.

334: This statement is conditional on both the temporal simulation

335: assumptions and the distribution of configurations $\{c_k\}_{k=1}^n$ in

336: the bucket (the prior probabilities).

337:

338: The confidence that configuration~$c_k$ (resp. bucket~$C$) exhibits

339: acceptable (average) performance is $P(E[b_k]<T)$ (resp.

340: $P(E[B]<T)$).  Since $b_k$ and~$B$ are Gaussian, these probabilities

341: can be estimated as

342: $$P(E[b_k]<T)\approx{}F_{n_k-1}\left(\frac{T-\hat{b}_k}{\hat{\sigma}_k/\sqrt{n_k}}\right),\quad

343: P(E[B]<T)\approx{}F_{N-1}\left(\frac{T-\hat{B}}{\hat{\Sigma}/\sqrt{N}}\right),$$

344: where $F_{N-1}(\cdot)$ is the CDF of the Student~$t$ distribution with

345: $N-1$ degrees of freedom and $n_k$ and~$N$ are the sample sizes for

346: configuration~$c_k$ and bucket~$C$, respectively.  For

347: configuration~$c_k$,

348: $$\hat{b}_k=\frac{1}{n_k}\sum_{j=1}^{n_k}\hat{x}_{kj},\quad

349: \hat{\sigma}^2_k=\frac{1}{(n_k-1)}\sum_{j=1}^{n_k}(\hat{x}_{kj}-\hat{b}_k)^2,$$

350: where $\hat{b}_k$ and $\hat{\sigma}_k^2$ are the estimates of the

351: expected BEP and the BEP variance at point~$c_k$, $n_k\ge{}2$ is sample

352: size, and $\{\hat{x}_{kj}\}_{j=1}^{n_k}$ are sample values.  For

353: bucket~$C$, we substitute point estimates into

354: $E[B]=\sum_{k=1}^np_kE[b_k]$ and $\Sigma^2=\sum_{k=1}^np_k^2\sigma_k^2$

355: to obtain $$\hat{B}=\sum_{k=1}^n\hat{p}_k\hat{b}_k, \quad

356: \hat{\Sigma}^2=\sum_{k=1}^n\hat{p}_k^2\hat{\sigma}_k^2,$$

357: %\quad N=\min_{1\le{}k\le{}n}n_k,$$

358: where $\hat{B}$ and $\hat{\Sigma}^2$ are

359: the estimates of the expected BEP and the BEP variance at bucket~$C$,

360: %$N$ serves the role of `bucket sample size',

361: and

362: $\{\hat{p}_k\}_{k=1}^n$ are the prior probabilities estimated from the

363: dataset as $\hat{p}_k=n_k/\sum_{i=1}^nn_i$.  Observe that

364: $$\hat{B}=\sum_{k=1}^n\hat{p}_k\hat{b}_k=\frac{1}{\sum_{k=1}^nn_k}\sum_{k=1}^n\sum_{j=1}^{n_k}\hat{x}_{kj}$$

365: is exactly the sample mean of all observations in the bucket, but

366: $\hat{\Sigma}^2$ is \emph{not} the variance

367: %and $N$ is \emph{not} the size

368: of this sample.  This is the case because

369: $\{\{\hat{x}_{kj}\}_{j=1}^{n_k}\}_{k=1}^n$ are not i.i.d.  samples from

370: the mixture distribution of~$B$---they are samples from the constituent

371: distributions of $\{b_k\}_{k=1}^n$.

372:

373: \section{Extended Example}

374: \label{sec:w-example}

375:

376: Let us now apply the techniques developed so far to analyze the

377: performance of a space of configurations.  The wireless systems under

378: consideration employ WCDMA technology with two-branch STTD and

379: $1/2$~rate convolutional coding.  We require that the transmitter has

380: two antennas (branches) separated by a distance large enough for their

381: signals to be uncorrelated, but small enough for the mean path losses

382: and impulse responses of their channels to be approximately equal at

383: receiver locations of interest.  We assume Rayleigh flat fading

384: channels, which is reasonable for indoor applications in the ISM and

385: UNII carrier frequency bands (2.4 and 5.2~GHz, respectively).  The goal

386: is to study the effect of power imbalance between the branches on the

387: BEP of the configurations across a wide range of average SNRs.

388:

389: This section presents a number of plots that summarize simulated BERs.

390: We also outline the process of statistically significant sampling of

391: the configuration space.  The next section develops a data mining

392: methodology that solves a practically important problem: given a

393: dataset similar to the one presented next, find a region of the

394: configuration space where we can confidently claim that configurations

395: will exhibit acceptable (average) performance.

396:

397: \begin{figure}

398: \begin{center}

399: \includegraphics[width=4.0in]{wcdma_sttd1} \\

400: \medskip

401: \includegraphics[width=4.0in]{sttd_stat}

402: \end{center}

403: \caption[BEP estimates for a space of configurations.]{(top) Estimates

404: of the BEPs for a space of configurations $\{c_k\}_{k=1}^M$ ($M=1600$

405: points at 10000 frames per point).  The $X$ and~$Y$ axes are the

406: average SNRs of the branches (in~dB).  The $Z$ axis is the (base ten)

407: logarithm of the simulated BER.  These estimates are not statistically

408: significant. (bottom) Statistically significant estimates

409: $\{\hat{b}_k\}_{k=1}^M$ of the expected BEPs $\{E[b_k]\}_{k=1}^M$ for

410: the same space of configurations $\{c_k\}_{k=1}^M$.  For the most part,

411: we are 90\% confident that the estimated expected BEP lies within 10\%

412: of its true value.  See text for exceptions.}

413: \label{fig:space}

414: \end{figure}

415:

416: \begin{figure}

417: \begin{center}

418: \includegraphics[width=4.0in]{sttd_stat_fixed_alpha} \\

419: \bigskip

420: \includegraphics[width=4.0in]{sttd_stat_fixed_power}

421: \end{center}

422: \caption[1D slices of the surface in Figure~\ref{fig:space}.]{1D slices

423: of the configuration space $\{c_k\}_{k=1}^M$ with fixed branch power

424: imbalance factor $\alpha=10^{-0.1|S_1-S_2|}$ and varying effective SNR

425: $S=10\log_{10}\left((10^{0.1S_1}+10^{0.1S_2})/2\right)$ (top), and

426: fixed effective SNR~$S$ and varying branch power imbalance

427: factor~$\alpha$ (bottom).  These slices were computed from the surface

428: fit onto the data in Figure~\ref{fig:space} (bottom).  The entire

429: fitted surface is shown in Figure~\ref{fig:fitted}.}

430: \label{fig:fixed-slices}

431: \end{figure}

432:

433: \begin{figure}

434: \begin{center}

435: \includegraphics[width=4.0in]{sttd_stat_local_fit}

436: \end{center}

437: \caption[A surface fitted onto the data in Figure~\ref{fig:space}.]{A

438: surface fitted onto the statistically significant results in

439: Figure~\ref{fig:space} (bottom).  We used a local linear least squares

440: regression with a 5\% neighborhood and tricubic weighting.  This

441: procedure was chosen because it can approximate the relatively steep

442: edge of the tolerance region.  See~\cite{dm-stat} for details.}

443: \label{fig:fitted}

444: \end{figure}

445:

446: Let us begin with an initial sample of the configuration space, as

447: shown in Figure~\ref{fig:space} (top).  This figure shows the simulated

448: BER as a 2D function $\hat{f}(S_1,S_2)$ of the average branch bit

449: energy-to-noise ratios (SNRs) $S_1$ and~$S_2$, in~dB.  The parallel

450: simulation ran for three days on 120 machines (AMD Athlon 1.0~GHz) at a

451: speed of approximately 2.5 points per machine per day.  10000 frames,

452: or 800000 information bits, were simulated for each of the 820 points

453: $S_2=3,4,\ldots,42$; $S_1=3,4,\ldots,S_2$.  Since $\hat{f}(S_1,S_2)$ is

454: symmetric~\cite{sttd-stutzman}, we show $M=1600$ points

455: $\{c_k\}_{k=1}^M$ for a full cross-product of $S_1$ and~$S_2$.

456:

457: Wireless system designers are more accustomed to 1D slices of the

458: configuration space, e.g., the ones shown in

459: Figure~\ref{fig:fixed-slices}.  Define the \emph{branch power imbalance

460: factor} $$\alpha=10^{-0.1|S_1-S_2|},$$ where $S_1$ and~$S_2$ are the

461: average SNRs of the branches, in~dB.  (This definition applies as long

462: as the mean path losses of the branches are equal.)  By definition,

463: $0\le\alpha\le{}1$, where zero corresponds to a total malfunction of

464: one of the branches and one corresponds to a perfect balance of branch

465: powers.  The graphs in Figure~\ref{fig:fixed-slices} were obtained by

466: fixing $\alpha$ and varying the \emph{effective SNR}

467: $$S=10\log_{10}\left((10^{0.1S_1}+10^{0.1S_2})/2\right),$$ in~dB (top),

468: and fixing the effective SNR and varying $\alpha$ (bottom).  (Note that

469: fixing the effective SNR is equivalent to fixing total transmitter

470: power.)  The sample of configurations came from the

471: dataset shown in Figure~\ref{fig:space} (bottom), described in detail

472: later. However, this sample does not contain

473: the exact points for typical slices, so we

474: used a fitted surface---Figure~\ref{fig:fitted}---to approximate the

475: BERs for the slices in~Figure~\ref{fig:fixed-slices}.  We choose to

476: work with the axes $S_1,S_2$ in Figure~\ref{fig:space} because it

477: simplifies the discussion later.

478:

479: What can be gathered from Figure~\ref{fig:space} (top)?  The deep

480: valley along the diagonal is due to the fact that, provided that the

481: effective SNR is fixed, we expect the BEP to be smallest when the

482: branch power is balanced ($S_1=S_2$, $\alpha=1$)~\cite{sttd-stutzman}.

483: Somewhat less expected were (a)~the wide \emph{tolerance region} where

484: $|S_1-S_2|$ is large (up to 12~dB) but the BER is still small, (b)~a

485: very sharp decline in performance at the edge of the tolerance region,

486: and (c)~a region of high local variability in the upper part of the

487: diagonal.  The surface is truncated at

488: $$\min_{1\le{}k\le{}M}\{\hat{b}_k\}=3.75\times{}10^{-6}$$ because

489: smaller estimates of the (expected) BEP require an enormous computation

490: time due to the convergence properties of Monte Carlo Estimation (more

491: on this below).

492:

493: \subsection{Statistically Significant Sampling Methodology}

494:

495: The initial sample looks reasonable and uncovers interesting trends in

496: system performance, but it does not contain enough information to make

497: statistically significant claims.  Estimating the probability that a

498: configuration exhibits acceptable average performance requires several

499: samples per point~$c_k$.  The simulation is computationally expensive

500: and different regions of the configuration space exhibit different

501: variability.  Therefore, we must define tight stopping criteria for

502: sampling.  Figure~\ref{fig:space} (bottom) shows the output obtained

503: with the following (per point~$c_k$) stopping criteria.  The criteria

504: are designed to achieve high estimation accuracy.

505: \begin{enumerate}

506: \item Sampling $\{\hat{x}_{kj}\}$ stops when the relative error in

507: the estimate~$\hat{b}_k$ of the expected BEP $E[b_k]$ is smaller than

508: the \emph{relative accuracy threshold} $\beta=0.1$ times the current

509: estimate~$\hat{b}_k$, at a $\gamma=0.9$ confidence level, i.e., when

510: $$P(|E[b_k]-\hat{b}_k|<\beta\hat{b}_k)\ge\gamma.$$ We required

511: $n_k\ge{}2$ samples to obtain an estimate~$\hat{\sigma}_k^2$ of the BEP

512: variance~$\sigma_k^2$.  Notice that the target is the relative error,

513: not the absolute error, because the range of $\{\hat{b}_k\}_{k=1}^M$ in

514: the configuration space spans four orders of magnitude.  Therefore,

515: absolute error measures are misleading.

516: \item Sampling $\{\hat{x}_{kj}\}$ also stops when we can say,

517: with confidence $\gamma=0.9$, that the expected BEP $E[b_k]$ is below

518: the \emph{sampling threshold}~$t=10^{-4}$, i.e., when

519: $$P(E[b_k]<t)\ge\gamma.$$ This work considers voice quality

520: applications, so the exact value of the expected BEP is irrelevant as

521: long as it is smaller than the performance threshold~$T=10^{-3}$.  The

522: sampling threshold~$t$ was set to an order of magnitude below the

523: performance threshold~$T$ to avoid large approximation error of a

524: fitted surface near~$T$.

525: \item Finally, sampling $\{\hat{x}_{kj}\}$ stops when more than

526: 50 samples of 10000 frames each are required to satisfy either of the

527: previous rules.  This rule fired in 5\% of the cases, all at the

528: boundary of the tolerance region and most in mid diagonal.

529: \end{enumerate}

530:

531: \noindent Altogether, 5154 samples were collected for an average of 6.3

532: samples per point.  Needless to say, the computational expense of such

533: sampling remains too high for practical applications. While a large

534: number of samples is typically desirable (for validation purposes),

535: we will show that our data mining

536: framework makes very effective use of data

537: and thus requires fewer samples in practice.

538: Let us now look at

539: the data in more detail.

540:

541: \subsection{Results of Statistically Significant Sampling}

542:

543: \begin{figure}

544: \begin{center}

545: \includegraphics[width=4.0in]{sttd_stat_point_cdf}

546: \end{center}

547: \caption[Empirical CDF for one of the configurations.]{Empirical CDF of

548: 21 samples for a randomly chosen point vs. that of the Gaussian

549: distribution with appropriate mean and variance.}

550: \label{fig:ecdf}

551: \end{figure}

552:

553: It is also likely that the samples output by the WCDMA simulation are

554: approximately Gaussian distributed. Intuitively,

555: %We assumed that the BEPs $\{b_k\}_{k=1}^M$ are Gaussian.  Intuitively,

556: we are simulating a large number of information bits (800000) per BEP

557: estimate $\hat{x}_{kj}$, so the Lindeberg condition for the Central

558: Limit Theorem should hold.  Figure~\ref{fig:ecdf} shows empirical

559: evidence that this is the case.  We have arbitrarily chosen one point

560: among those with 20--30 sample values $\{\hat{x}_{kj}\}_{j=1}^{n_k}$

561: and plotted the empirical CDF of this sample against that of the

562: Gaussian distribution with the mean equal to sample mean~$\hat{b}_k$

563: and the variance equal to sample variance~$\hat{\sigma}_k^2$.  The

564: curves are close to each other and the Shapiro-Wilk test yields

565: $W=0.98$ ($0\le W\le 1$) and $p$-value of~$0.88$.  Other points also

566: demonstrate similar curves and high values of~$W$, but $p$-values vary

567: significantly.  This dataset contains sufficient samples to estimate

568: $\{E[b_k]\}_{k=1}^M$ with high relative accuracy, but 6.3 samples per

569: point are insufficient to formally justify a Gaussian assumption.

570:

571: \begin{figure}

572: \begin{center}

573: \includegraphics[width=4.0in]{sttd_stat_sample_size} \\

574: \bigskip

575: \includegraphics[width=4.0in]{sttd_stat_sample_size_2D}

576: \end{center}

577: \caption[Sample sizes for Figure~\ref{fig:space}.]{Sample sizes for

578: Figure~\ref{fig:space} (bottom).  The top part shows the perspective

579: plot and the bottom part shows the scatter plot.}

580: \label{fig:sample-size}

581: \end{figure}

582:

583: \begin{figure}

584: \begin{center}

585: \includegraphics[width=4.0in]{sttd_stat_error} \\

586: \bigskip

587: \includegraphics[width=4.0in]{sttd_stat_error_2D}

588: \end{center}

589: \caption[Sample standard-deviation-to-mean ratios for

590: Figure~\ref{fig:space}.]{Sample standard deviation-to-mean ratios for

591: Figure~\ref{fig:space} (bottom).  The top part shows the perspective

592: plot and the bottom part shows the scatter plot.}

593: \label{fig:sample-error}

594: \end{figure}

595:

596: It is also instructive to see some measure of how the sample variance

597: is distributed across the configuration space.

598: Figures~\ref{fig:sample-size} and~\ref{fig:sample-error} show sample

599: sizes and sample standard deviation-to-mean ratios for the samples in

600: Figure~\ref{fig:space} (recall that we prefer relative measures because

601: the range of $\{\hat{b}_k\}_{k=1}^M$ is large).  Both figures indicate

602: high variance around the boundary of the tolerance region.  This is not

603: surprising because the edges of the tolerance region are relatively

604: steep.  Figure~\ref{fig:sample-error} also shows relatively high

605: variance at some points inside the tolerance region.  This is because

606: the simulation achieved the sampling threshold~$t=10^{-4}$ and stopped

607: before it achieved the relative accuracy threshold~$\beta=0.1$.

608: Knowing this, one would expect a larger relative variance in the

609: tolerance region.  Let us examine why this is not the case.

610:

611: We treat the BEP as a continuous Gaussian random variable~$b_k$, but

612: all sample values $\{\hat{x}_{kj}\}_{j=1}^{n_k}$ are discrete---they

613: are ratios of two integers, the number of errors and the number of bits

614: simulated.  The simulation may not detect any bit errors when the

615: expected BEP $E[b_k]$ is relatively small (e.g., one error in the

616: number of bits simulated).  Since no channel is perfect, zero is too

617: optimistic an estimate for the expected BEP.  Instead, we

618: conservatively assume that at least three bit errors have been

619: detected.  This is why the smallest estimate~$\hat{b}_k$ of~$E[b_k]$ is

620: $3/800000=3.75\times{}10^{-6}$.  However, using any constant cutoff

621: prevents us from estimating the variance $\sigma_k^2$.  We would need

622: to simulate a large number of frames to estimate $\sigma_k^2$ when the

623: expected BEP is small.  Instead, we can empirically show that the

624: probability that the expected BEP is smaller than the performance

625: threshold~$T=10^{-3}$ is close to one.  Let

626: $\hat{b}_k=3.75\times{}10^{-6}$ be the sample mean, $n_k=2$ be the

627: sample size, and $\sigma_k^2$ be the BEP variance at point~$c_k$ where

628: two independent simulations detected three or fewer bit errors each.

629: Sampling $\{\hat{x}_{kj}\}$ will stop because sample variance is zero,

630: so the first stopping rule applies.

631:

632: We need to show that sampling can indeed stop, i.e., that the

633: probability that the expected BEP is below the performance

634: threshold~$T$ is

635: $$P(E[b_k]<T)\approx{}F_{n_k-1}\left(\frac{T-\hat{b}_k}{\hat{\sigma}_k/\sqrt{n_k}}\right)\ge{}0.995.$$

636: This statement can only be false when

637: $(T-\hat{b}_k)\sqrt{n_k}/\hat{\sigma}_k\le{}64$, or

638: $\hat{\sigma}_k\ge{}2.2\times{}10^{-5}$, almost an order of magnitude bigger

639: than the conservative estimate $\hat{b}_k$ of the expected BEP

640: $E[b_k]$.  This is unlikely because Figure~\ref{fig:sample-error}

641: (bottom) shows that the sample standard deviation rarely exceeds the

642: sample mean even by half an order of magnitude.  In other words, we do

643: not have accurate estimates for variance~$\sigma_k^2$ in the tolerance

644: region.  However, we can still reasonably conclude that configurations

645: exhibit acceptable performance in this region.

646:

647: % The next section develops a data mining approach to data analysis.

648: % This approach seeks an optimal and statistically defensible region of

649: % the configuration space.  Mild assumptions of region connectivity and

650: % rectilinearity make the data mining approach resistant to highly

651: % variable and scarce data.  This is important in applications like the

652: % one just described, where data collection is computationally

653: % expensive.

654:

655:

656: \section{The Third Level of Aggregation: Regions}

657: \label{sec:gizmo}

658:

659: Consider a set of buckets $\{C_k\}_{k=1}^M$ with corresponding random

660: variables $\{B_k\}_{k=1}^M$.  Given a number of sample values, the

661: framework developed in Section~\ref{sec:stat} allows us to estimate the

662: probabilities $\{P(E[B_k]<T)\}_{k=1}^M$ that buckets $\{C_k\}_{k=1}^M$

663: exhibit acceptable average performance.  (All arguments about buckets

664: equally apply to points because a point is a special case of a bucket.)

665: This section is concerned with finding an optimal subset of random

666: variables from among $\{B_k\}_{k=1}^M$.  This optimal subset

667: corresponds to an optimal region of a 2D bucket space.  We would like

668: to find a sufficiently large admissible region~$R_m$ such that we are

669: sufficiently confident that buckets in~$R_m$ exhibit acceptable average

670: performance.

671:

672: There are many ways to define admissibility and we are interested in

673: adopting a definition that is both meaningful in the wireless domain

674: and permits effective data mining algorithms. Among a space of such

675: admissible regions, we can define different optimality criteria and

676: data mining then reduces to searching within this space. In this

677: \paper{}, a region~$R_m$ is admissible when it has a particular type of

678: shape.  We will explore three different criteria for the mining of

679: optimal regions; the algorithms and these criteria are based on the

680: work of Fukuda et al.~\cite{fukuda-rectilinear} and have been adapted

681: to the problem of mining simulation data in this \paper{}.

682:

683: \begin{table}

684: \begin{center}

685: \begin{tabular}{c l}

686: \hline

687: $X,Y$ & parameters that partition the point space into buckets \\

688: $M_X,M_Y$ & $X$ and $Y$ dimensions of the bucket space \\

689: $M=M_X\times{}M_Y$ & number of buckets in the bucket space \\

690: $D_X,D_Y$ & domains of $X$ and $Y$ \\

691: $\eta(m)$ & number of buckets in region $R_m$ \\

692: $C_{\kappa(m,i)}$ & $i$-th bucket in region $R_m$, $1\le{}i\le{}\eta(m)$ \\

693: $n_{\kappa(m,i)}$ & number of samples in bucket $C_{\kappa(m,i)}$\\

694: $x_{\kappa(m,i)},y_{\kappa(m,i)}$ & $X$ and $Y$ values for bucket $C_{\kappa(m,i)}$ \\

695: \hline

696: \end{tabular}

697: \end{center}

698: \caption[Summary of region notation.]{Summary of region notation.  Also

699: see Table~\ref{tab:notation}.}

700: \label{tab:notation2}

701: \end{table}

702:

703: Additional notation relating buckets to regions is introduced in

704: Table~\ref{tab:notation2}.  Let $X$ and~$Y$ be two discrete parameters

705: to the temporal (e.g., WCDMA) simulations such that $X$ and~$Y$

706: partition the point space into disjoint buckets $\{C_k\}_{k=1}^M$.

707: More precisely, let $X,Y$ have ordinal domains $D_X,D_Y$, let

708: $|D_X|=M_X,|D_Y|=M_Y,|D_X||D_Y|=M$, and assume that the map

709: $\rho:D_X\times{}D_Y\rightarrow{}\{C_k\}_{k=1}^M$ is bijective.  In

710: other words, $X$ and~$Y$ define a discrete 2D space of buckets.  Since

711: the domains of $X$ and~$Y$ are ordinal, this space is easily visualized

712: as a 2D color map or a 3D perspective plot.

713:

714: \begin{figure}

715: \begin{center}

716: \includegraphics[width=4.0in]{sttd_stat_probabilities}

717: \end{center}

718: \caption[Probabilities that configurations in Figure~\ref{fig:space}

719: exhibit acceptable performance.]{Probabilities

720: $\{P(E[b_k]<T)\}_{k=1}^M$ that configurations $\{c_k\}_{k=1}^M$

721: exhibit acceptable performance with respect to the performance

722: threshold~$T=10^{-3}$ (voice quality system).  This perspective plot

723: corresponds to the STTD dataset in Figure~\ref{fig:space} (bottom).

724: The axes $S_1$ and~$S_2$ are rotated 180 degrees counter-clockwise to

725: provide a better view of the surface.}

726: \label{fig:probabilities}

727: \end{figure}

728:

729: For example, the average SNRs $S_1$ and~$S_2$ in the previous section

730: partition the space of configurations into buckets.  Both $S_1$

731: and~$S_2$ vary from~3 to~42 in steps of~1 (in~dB), so $M_X=M_Y=40$ and

732: $M=40\times{}40=1600$ (recall, from Section~\ref{sec:w-example}, that

733: only 820 of these points were simulated and the remaining ones were

734: symmetrically reflected).  Furthermore, the domains of $S_1$ and~$S_2$

735: are ordinal because the values of~$S_1$ and~$S_2$ are directly related

736: to the powers of the transmitter antennas.  In this case, the buckets

737: are simply the points in the space of configurations.  In general,

738: buckets can be convex combinations of points, as detailed in

739: Section~\ref{sec:stat}.  Recall that we defined the color of a bucket

740: as the probability that the bucket exhibits acceptable average

741: performance.  Figure~\ref{fig:probabilities} shows these `colors' as a

742: perspective plot for the STTD example.

743:

744: \subsection{Region Shape}

745:

746: Consider regions (subsets) of buckets in the bucket space.  If the

747: shape of these regions is unconstrained, there are $2^M$ possible

748: regions $\{R_m\}_{m=1}^{2^M}$.  Let region $R_m$, $1\le{}m\le{}2^M$,

749: consist of buckets $\{C_{\kappa(m,i)}\}_{i=1}^{\eta(m)}$, where

750: $\eta(m)$, $1\le{}m\le{}2^M$, is a mapping from region number~$m$ to

751: the number of buckets in this region, and $\kappa(m,i)$,

752: $1\le{}m\le{}2^M$, $1\le{}i\le\eta(m)$, is a mapping from region

753: number~$m$ and bucket number~$i$ within region~$R_m$ to bucket

754: number~$k$, $1\le{}k\le{}M$, that we use to subscript buckets

755: $\{C_k\}_{k=1}^M$.  The exact definitions of $\eta(m)$ and

756: $\kappa(m,i)$ are not important as long as they generate all possible

757: regions (subsets) $\{R_m\}_{m=1}^{2^M}$.

758:

759: \begin{figure}

760: \begin{center}

761: \includegraphics{rr-types-short}

762: \end{center}

763: \caption[Types of admissible regions.]{Some types of admissible

764: (connected rectilinear) regions.  When we look at an admissible region

765: from left to right, its upper boundary must first increase and then

766: decrease monotonically, and its lower boundary must first decrease and

767: then increase monotonically.}

768: \label{fig:admissible}

769: \end{figure}

770:

771: The shape of admissible regions should be constrained because

772: unconstrained regions are hard to interpret and tend to overfit the

773: training data.  Besides, the problem of selecting an optimal

774: unconstrained region is computationally intractable---all $2^M$

775: possible regions must be considered, where $M=1600$ in the STTD

776: example.  The region shape can be constrained in a number of different

777: ways (rectangular, x-monotone, etc.).  Our restrictions on region shape

778: are discussed next.

779:

780: Without loss of generality, assume that $D_X=\{1,2,\ldots,M_X\}$ and

781: $D_Y=\{1,2,\ldots,M_Y\}$.  Intuitively, region~$R_m$ is rectilinear

782: when its intersection with any horizontal or vertical line is

783: connected.  More formally, region~$R_m$ is \emph{rectilinear} if and

784: only if whenever buckets $C_{\kappa(m,i)}$ at

785: $(x_{\kappa(m,i)},y_{\kappa(m,i)})$ and $C_{\kappa(m,j)}$ at

786: $(x_{\kappa(m,j)},y_{\kappa(m,j)})$ are both in~$R_m$, then

787: (a)~$\rho(r,s) = C_{\kappa(m,i)}$ and $\rho(r,t) = C_{\kappa(m,j)}$

788: imply buckets $\rho(r,u)$ are also in~$R_m$ for all

789: $u \in [s,t]$, and

790: (b)~$\rho(r,t) = C_{\kappa(m,i)}$ and $\rho(s,t) = C_{\kappa(m,j)}$

791: imply buckets $\rho(u,t)$ are also in~$R_m$ for all

792: $u \in [r,s]$. Here $[a,b]$ means all integers between the integers

793: $a$, $b$, inclusive.

794: %

795: %(a)~$x_{\kappa(m,i)}=x_{\kappa(m,j)}$ implies that each bucket

796: %$C_{\kappa(m,l)}$ where

797: %$$(y_{\kappa(m,l)}-y_{\kappa(m,i)})(y_{\kappa(m,l)}-y_{\kappa(m,j)})<0$$

798: %is also in~$R_m$, and (b)~likewise for

799: %$y_{\kappa(m,i)}=y_{\kappa(m,j)}$.

800: We use Manhattan geometry to define

801: connectedness.  Region~$R_m$ is \emph{connected} if and only if for

802: every pair of buckets $C_{\kappa(m,i)}$ and $C_{\kappa(m,j)}$ in~$R_m$

803: there exists a sequence of buckets

804: $$C_{\kappa(m,i)}=C_{\kappa(m,l_1)},C_{\kappa(m,l_2)},\ldots,C_{\kappa(m,l_n)}=C_{\kappa(m,j)}$$

805: in~$R_m$ such that for every $1\le{}k<n$

806: $${\parallel \rho^{-1} (C_{\kappa(m,l_k)}) - \rho^{-1} (C_{\kappa(m,l_{k+1})}) \parallel}_1 = 1.$$

807: %$$|x_{\kappa(m,l_k)}-x_{\kappa(m,l_{k+1})}|+|y_{\kappa(m,l_k)}-y_{\kappa(m,l_{k+1})}|=1.$$

808: Furthermore, we say that region~$R_m$ is \emph{admissible} if it is

809: both rectilinear and connected.

810:

811: This definition of admissibility can be viewed as a relaxed definition

812: of convexity.  Geometrically, it is easy to see that region~$R_m$ is

813: admissible if and only if, when we look at~$R_m$ from left to right,

814: its upper boundary first increases and then decreases monotonically (a

815: pseudoconcave function), and its lower boundary first decreases and

816: then increases monotonically (a pseudoconvex function).  In other

817: words, the region boundary need not be strictly convex or strictly

818: concave, but it must be pseudoconvex or pseudoconcave.  Admissible

819: regions are informally summarized in Figure~\ref{fig:admissible}.  All

820: admissible regions are composed of regions of four primitive types: W

821: (region gets wider from left to right), N (region gets narrower), U

822: (region slants up), and D (region slants down).  Twelve combinations of

823: these types yield all types of admissible regions: W, WU, WUN, WD, WDN,

824: WN, UN, DN, U, D, N, and the empty region.

825:

826: Our choice of connected rectilinear regions is due to primarily

827: heuristic considerations.  These considerations are commonly

828: applicable, but must be re-evaluated for each study.  Both the

829: connectedness and the rectilinearity restrictions can be justified for

830: the STTD example (see next section).  In general, it is easy to justify

831: connectedness, but hard to justify rectilinearity.  We advocate the use

832: of connected rectilinear regions primarily because this shape is

833: resistant to noise in the sample, not because we can analytically show

834: that the region boundary is rectilinear.  In data mining, the choice of

835: region shape is most commonly dictated by the desired tradeoff between

836: bias and variance~\cite{dm-stat}.  Regions with flexible shape exhibit

837: small bias (they can fit any data) but high variance (they can be

838: overly sensitive to a particular dataset).  Regions with rigid shape

839: exhibit high bias but small variance.  Connected rectilinear regions

840: provide a reasonable tradeoff between bias and variance for many

841: applications.

842:

843: \subsection{Evaluating Regions}

844:

845: Another prerequisite to finding regions with the desired properties is

846: a definition of region `goodness'.  Let us map bucket confidence

847: $P(E[B_{\kappa(m,i)}]<T)$ to a discrete range $[0\ldots{}1000]$ and

848: define the \emph{hit of bucket} $C_{\kappa(m,i)}$ as

849: $$h_{\kappa(m,i)}=\lfloor{}1000P(E[B_{\kappa(m,i)}]<T)+0.5\rfloor$$

850: ($\lfloor{}X\rfloor$ denotes the largest integer that does not exceed

851: $X$), the \emph{support of bucket} $C_{\kappa(m,i)}$ as

852: $$s_{\kappa(m,i)}=1000$$ (this constant was chosen to make the

853: discretization error reasonably small), the \emph{hit of region} $R_m$

854: as $$H_m=\sum_{i=1}^{\eta(m)}h_{\kappa(m,i)},$$ and the \emph{support

855: of region} $R_m$ as

856: $$S_m=\sum_{i=1}^{\eta(m)}s_{\kappa(m,i)}=1000\eta(m).$$  The key to

857: efficient computation of optimized-confidence and optimized-support

858: admissible regions is the definition of region confidence as

859: $$\Theta_m=H_m/S_m,$$ where $H_m$ is the hit and~$S_m$ is the support

860: of region~$R_m$.  Let us explore the implications of these definitions

861: in more detail.

862:

863: \subsubsection{Model-Based and Model-Free Analyses}

864:

865: Suppose, $n_{\kappa(m,i)}=6$ samples have been collected for bucket

866: $C_{\kappa(m,i)}$ that consists of a single point.  Let the sample mean

867: be $\hat{B}_{\kappa(m,i)}=5\times{}10^{-4}$ and the sample standard

868: deviation be $\hat{\Sigma}_{\kappa(m,i)}=8.87\times{}10^{-4}$.

869: Furthermore, suppose that five of these samples have the BER below

870: $10^{-3}$ and one has the BER above $10^{-3}$.  Then,

871: $$P(E[B_{\kappa(m,i)}]<T)\approx{}F_5\left(\frac{10^{-3}-5\times{}10^{-4}}{8.87\times{}10^{-4}/\sqrt{6}}\right)\approx{}0.887.$$

872: A purely model-free approach would interpret the above simulation

873: results as `bucket $C_{\kappa(m,i)}$ will exhibit acceptable average

874: performance in~5 out of~6 cases.' A strongly model-based approach would

875: interpret the simulation results as `we are 88.7\% confident that

876: bucket $C_{\kappa(m,i)}$ exhibits acceptable average performance.' Our

877: interpretation lies between the model-based approach and a model-free

878: approach and posits that `bucket $C_{\kappa(m,i)}$ will exhibit

879: acceptable average performance in~887 out of~1000 cases.' These

880: interpretations provide confidence estimates under different

881: simplifying assumptions.

882:

883: The model-free interpretation does not take either sample variance or

884: sample distribution into account.  This interpretation is only reliable

885: for a sufficiently large number of samples, which is a luxury in our

886: application.  Our middle-ground interpretation explicitly accounts for

887: sample variance and sample distribution.  When sample size is small,

888: our interpretation provides a statistically valid estimate of

889: confidence that the bucket exhibits acceptable average performance.

890: For a single bucket, this interpretation is as good as a strongly

891: model-based interpretation, modulo a reasonably small discretization

892: error.  However, our interpretation diverges from the model-based

893: interpretation at the region level.

894:

895: A strongly model-based analysis procedure would define a region random

896: variable

897: $$Q_m=\frac{1}{W_m}\sum_{i=1}^{\eta(m)}w_{\kappa(m,i)}B_{\kappa(m,i)},$$

898: where $\{B_{\kappa(m,i)}\}_{i=1}^{\eta(m)}$ are bucket random

899: variables, $\{w_{\kappa(m,i)}\}_{i=1}^{\eta(m)}$ are \emph{a priori}

900: (positive) constant weights, and

901: $$W_m=\sum_{i=1}^{\eta(m)}w_{\kappa(m,i)}$$ is a normalization factor

902: that maps these weights to probabilities of bucket occurrence in the

903: region.  A procedure similar to that in Section~\ref{sec:stat} would

904: then be used to estimate $P(E[Q_m]<T)$ for a threshold~$T$.  The

905: result of this calculation can be interpreted as the probability that

906: region~$R_m$ exhibits acceptable average performance, conditional on

907: the temporal simulation assumptions, the bucketing prior probabilities,

908: and the region prior probabilities.  However, as we shall see later,

909: this definition of region confidence violates a property that permits

910: an efficient data mining algorithm.

911:

912: We think of region confidence in terms of average bucket confidence

913: over the whole region, namely,

914: $$\Theta_m\approx{}\frac{1}{\eta(m)}\sum_{i=1}^{\eta(m)}P(E[B_{\kappa(m,i)}]<T).$$

915: (If region size $\eta(m)$ is large enough, we can reasonably expect the

916: discretization errors to cancel each other.) This interpretation

917: of~$\Theta_m$ does not correspond to the strongly model-based

918: probability that region~$R_m$ exhibits acceptable average performance.

919: Instead, we define a region random variable~$P_m$ as the probability

920: that \emph{any} bucket $C_{\kappa(m,i)}$ in region~$R_m$ exhibits

921: acceptable average performance.  Then, we estimate the expected value

922: $E[P_m]$ across the region~$R_m$ by the sample mean

923: $\hat{P}_m\approx\Theta_m$ of estimates of bucket confidences

924: $\{P(E[B_{\kappa(m,i)}]<T)\}_{i=1}^{\eta(m)}$.

925:

926: How do these two definitions relate to each other?  It is easy to show

927: that they are equivalent only under very restrictive assumptions.

928: Basically, we are assuming that the buckets are mutually independent,

929: that population variance is small, and that the region is consistent,

930: i.e., `good' and `bad' buckets are never mixed in the same region.  Let

931: bucket random variables $\{B_{\kappa(m,i)}\}_{i=1}^{\eta(m)}$ be

932: mutually independent, let the estimates

933: $\{\hat{\Sigma}_{\kappa(m,i)}^2\}_{i=1}^{\eta(m)}$ of bucket variances

934: be approximately equal to zero, and let the estimates

935: $\{\hat{B}_{\kappa(m,i)}\}_{i=1}^{\eta(m)}$ of bucket expected BEPs be

936: either all greater than the performance threshold~$T$ or all smaller

937: than the performance threshold~$T$ (i.e., all

938: $\{T-\hat{B}_{\kappa(m,i)}\}_{i=1}^{\eta(m)}$ have the same sign).

939: Then, bucket confidences

940: $$P(E[B_{\kappa(m,i)}]<T)\approx{}F_{n_{\kappa(m,i)}-1}\left(\frac{T-\hat{B}_{\kappa(m,i)}}{\hat{\Sigma}_{\kappa(m,i)}/\sqrt{n_{\kappa(m,i)}}}\right),$$

941: $1\le{}i\le\eta(m)$, will be either all approximately equal to zero

942: ($\hat{B}_{\kappa(m,i)}>T$), or all approximately equal to one

943: ($\hat{B}_{\kappa(m,i)}<T$).  Therefore, region confidence~$\Theta_m$

944: will be approximately equal to zero or one.  Likewise, the strongly

945: model-based region confidence

946: $$P(E[Q_m]<T)\approx{}F_{\eta(m)-1}\left(\frac{T-\hat{Q}_m}{\hat{\Psi}_m/\sqrt{\eta(m)}}\right)$$

947: will be approximately equal to zero or one because the estimate

948: $\hat{\Psi}^2_m$ of region variance is (see Section~\ref{sec:stat})

949: $$\hat{\Psi}_m^2=\frac{1}{W_m^2}\sum_{i=1}^{\eta(m)}w_{\kappa(m,i)}^2\hat{\Sigma}_{\kappa(m,i)}^2\approx{}0.$$

950: The sign of $T-\hat{Q}_m$ determines whether $P(E[Q_m]<T)$ is

951: approximately equal to zero or one.  After a minor rearrangement of

952: terms,

953: $$T-\hat{Q}_m=\frac{1}{W_m}\sum_{i=1}^{\eta(m)}w_{\kappa(m,i)}(T-\hat{B}_{\kappa(m,i)}).$$

954: We assumed that $\{T-\hat{B}_{\kappa(m,i)}\}_{i=1}^{\eta(m)}$ have the

955: same sign, so we have shown that $P(E[Q_m]<T)\approx\Theta_m$.  The

956: equality is asymptotically exact as all variance estimates

957: $\{\hat{\Sigma}_{\kappa(m,i)}^2\}_{i=1}^{\eta(m)}$ approach zero.  This

958: argument applies regardless of the distributions of

959: $\{B_{\kappa(m,i)}\}_{i=1}^{\eta(m)}$, as long as these random

960: variables are mutually independent.

961:

962: \subsection{Optimized Regions}

963:

964: We now pursue the definition of optimized regions.  Given a

965: {\it slope}~$\tau$, $0\le\tau\le{}1$, define the \emph{gain} of region~$R_m$,

966: $1\le{}m\le{}2^M$, as $$G(R_m,\tau)=H_m-\tau{}S_m,$$ where $H_m$ is the

967: region hit and~$S_m$ is the region support.  Let an

968: \emph{optimized-gain admissible region}~$R_\tau$ with respect to

969: slope~$\tau$, $0\le{}\tau\le{}1$, be an admissible region with the

970: maximum gain $G(R_\tau,\tau)$ over all admissible regions (this region

971: need not be unique).  Optimized-gain admissible regions are easy to

972: define, compute, and analyze, but hard to interpret.  Common practice

973: is to define optimized-confidence and optimized-support admissible

974: regions.  Admissible region~$R_*$ is an \emph{optimized-confidence

975: admissible region} with respect to a given support

976: threshold~$1000\eta$, $0\le{}\eta\le{}M$, if $R_*$ has the maximum

977: confidence $\Theta_*=H_*/S_*$ among all admissible regions with support

978: of at least $1000\eta$.  Likewise, admissible region~$R_\diamond$ is an

979: \emph{optimized-support admissible region} with respect to a given

980: confidence threshold~$\theta$, $0\le{}\theta\le{}1$, if $R_\diamond$

981: has the maximum support $S_\diamond=1000\eta(\diamond)$ among all

982: admissible regions with confidence of at least~$\theta$.  In other

983: words, we can either fix the region confidence~$\theta$ and find the

984: largest region~$R_\diamond$ with confidence of at least~$\theta$, or we

985: can fix the minimum region size (support) $1000\eta$ and find the most

986: confident region~$R_*$ with support of at least $1000\eta$.

987:

988: Observe that~$\tau$ in the definition of an optimized-gain admissible

989: region is the relative importance of support vs. that of confidence.

990: We can find a small region with high confidence or a large region with

991: small confidence, but both objectives cannot be maximized

992: simultaneously.  Increasing~$\tau$ will increase the confidence of the

993: optimized-gain admissible region, but decrease its support.  Likewise,

994: decreasing~$\tau$ will decrease the confidence of the optimized-gain

995: admissible region, but increase its support.  Therefore, we can find

996: approximate optimized-confidence and optimized-support admissible

997: regions by a binary search for the value of~$\tau$ where the respective

998: threshold is barely satisfied.  The search can stop at a given level of

999: precision $\Delta\tau$, where the lower bound on $\Delta\tau$ can be

1000: found in~\cite{fukuda-rectilinear} (they show that the number of steps

1001: in this search is logarithmic in the support $1000M$ of the bucket

1002: space).  This algorithm is approximate because an optimized-confidence

1003: (resp. optimized-support) admissible region need not be an

1004: optimized-gain admissible region for any value of~$\tau$.  Yoda et

1005: al.~\cite{fukuda-rectilinear} argue that this approximation is

1006: reasonable for large datasets.

1007:

1008: Let us revisit the definition of region `goodness'.  Geometrically, the

1009: buckets with the same value of~$X$ are the columns and the buckets with

1010: the same value of~$Y$ are the rows.  An optimized-gain admissible

1011: region can be computed in $O(M_XM_Y^2)$ time by a set of rules of the

1012: following form.  Recall that a region of type W gets wider from left to

1013: right (see Figure~\ref{fig:admissible}).  Let $R_W(m,[s,t])$ be the

1014: region of type W with maximum gain $f_W(m,[s,t])$ over all admissible

1015: regions of type W that end in column~$m$ and span rows~$s$ through~$t$

1016: in this column.  Then, either (a)~$m$ is the first column of

1017: $R_W(m,[s,t])$, or (b)~$R_W(m,[s,t])$ includes the region

1018: $R_W(m-1,[s',t'])$ with the maximum gain $f_W(m,[s',t'])$ over all

1019: admissible regions that end in column~$m-1$ and span rows $s'\ge{}s$

1020: through $t'\le{}t$ in this column.  \cite{fukuda-rectilinear}~keeps the

1021: regions with maximum gain for every region type and every triple

1022: $(m,[s,t])$ in a dynamic programming table.  These locally maximal

1023: regions then grow according to a set of rules that compute an

1024: optimized-gain admissible region.  This efficient greedy algorithm for

1025: computing optimized-gain admissible regions depends on the property of

1026: the gain function that we refer to as monotonicity.  Let

1027: $0\le\tau\le{}1$ be a slope and~$R_{m'}$ and~$R_{m''}$ be two

1028: admissible regions with gains $G(R_{m'},\tau)\ge{}G(R_{m''},\tau)$.

1029: The gain function $G(R_m,\tau)$ is \emph{monotonic} if for any region

1030: $R_k$ disjoint with both~$R_{m'}$ and~$R_{m''}$

1031: $$G(R_{m'}\cup{}R_k,\tau)\ge{}G(R_{m''}\cup{}R_k,\tau),$$ where the

1032: union of regions is defined in the obvious way.  It is easy to see that

1033: our gain function

1034: $$G(R_m,\tau)=H_m-\tau{}S_m=\sum_{i=1}^{\eta(m)}\lfloor{}1000P(E[B_{\kappa(m,i)}]<T)+0.5\rfloor-1000\tau\eta(m)$$

1035: is monotonic because it is additive.  However, a strongly model-based

1036: gain function $$G^{(M)}(R_m,\tau)=P(E[Q_m]<T)-\tau\eta(m)/M$$ is not

1037: monotonic even if we assume independence of bucket random variables

1038: $\{B_{\kappa(m,i)}\}_{i=1}^{\eta(m)}$ that make up~$Q_m$.  To the best

1039: of our knowledge, only monotonic gain functions are known to result in

1040: practical algorithms for computing optimized-gain admissible regions.

1041:

1042: What happens when no estimates of mean and/or variance are available

1043: for some bucket~$C_k$?  The answer to this question depends on

1044: problem-specific considerations.  As was demonstrated in

1045: Section~\ref{sec:w-example}, it is sometimes possible to provide

1046: conservative estimates for these values.  For example, we have

1047: empirically shown that the expected BEPs of some configurations

1048: $\{c_k\}$ are smaller than $T=10^{-3}$ with confidence

1049: $P(E[b_k]<T)\ge{}0.995$.  Likewise, we know that as the effective SNR

1050: approaches negative infinity (in~dB), the BEP approaches 0.5, which is

1051: the probability of correctly guessing the value of a random bit when

1052: the transmitter is turned off.  Thus, we can let $P(E[b_k]<T)=0$ for

1053: points with sufficiently small effective SNRs and a reasonable

1054: performance threshold~$T$.  If no such estimates are available, we can

1055: simply omit the missing buckets from the probability computation.  This

1056: must be done with care because such buckets will contribute nothing to

1057: the confidence of the region.  This fact can be used to reduce the

1058: computational expense of sampling.

1059:

1060: This section has highlighted the sometimes contradictory objectives

1061: that aggregation must satisfy: permit valid statistical interpretations

1062: and afford structure that can be exploited by data mining algorithms.

1063: Our approach has been a judicious mix of concepts from both statistics

1064: and data mining.  We showed that our formulation of the data mining

1065: problem lies between the completely model-free approach and the

1066: strongly model-based approach.

1067: %In particular, mutual independence of

1068: %bucket random variables is crucial to efficient computation of

1069: %optimized admissible regions.  If the covariance terms are

1070: %non-negligible, our approach is no longer asymptotically equivalent to

1071: %the strongly model-based approach.  Likewise, interactions between the

1072: %buckets break the monotonicity of the gain function and make one

1073: %question the greedy region expansion strategy.

1074: The next section

1075: applies the data mining methodology described here to the example in

1076: Section~\ref{sec:w-example}.

1077:

1078:

1079: \section{Optimized-Support Regions for the STTD Example}

1080: \label{sec:experiments}

1081:

1082: This section continues the example in Section~\ref{sec:w-example}.

1083: First, we show that optimized-gain regions are both rectilinear and

1084: connected for this example.  It immediately follows that

1085: optimized-support and optimized-confidence regions are also

1086: admissible.  An optimized-support admissible region is presented next.

1087: We show that the elaborate region mining setup leads to simple

1088: engineering interpretations.  Finally, we look at the performance of

1089: data mining when the number of samples is small.  Three-fold

1090: cross-validation shows that data mining performs well under these

1091: circumstances.

1092:

1093: \subsection{Justification of Data Mining for the STTD Example}

1094:

1095: Let the average SNRs $S_1=X$ and~$S_2=Y$ partition the space of

1096: configurations in Figure~\ref{fig:space} into disjoint points (buckets)

1097: $\{c_k\}_{k=1}^M$, $1\le{}M\le{}1600$.

1098: We now give an intuitive

1099: argument to justify the suitability of the data mining algorithm for the

1100: STTD study.

1101: Without loss of generality,

1102: consider only the points with $X\le{}Y$, i.e., $S_1\le{}S_2$.  It is

1103: easy to extend all arguments to $X>Y$, but this adds little to the

1104: discussion.

1105:

1106: Let $c_1$ at $(x_1,y_1)$ and $c_2$ at $(x_2,y_1)$, $x_1<x_2<y_1$, be

1107: two points in an optimized-gain region (of arbitrary shape) for some

1108: slope $0<\tau<1$ (see Figure~\ref{fig:admissible-example}).  This means

1109: that the confidences of these points are one, and thus the expected

1110: BEPs of these points are smaller than the performance threshold~$T$.

1111: When $x_1,x_2<y_1$ and $y_1$ is fixed, the BEP is a monotonically

1112: decreasing function of $x$---increasing~$x$ decreases the power

1113: imbalance and increases the effective SNR, so the BEP must decrease.

1114: Therefore, the expected BEP of any point~$c_u$ at $(x_u,y_1)$,

1115: $x_1<x_u<x_2$, is below the performance threshold~$T$.  Thus, the

1116: confidences of points~$\{c_u\}$ are one and these points must also be in the

1117: optimized-gain region~$R_\tau$.  Three more symmetric arguments of this

1118: kind show that optimized-gain regions are rectilinear.

1119:

1120: Likewise, let $c_1$ at $(x_1,y_1)$ and $c_2$ at $(x_2,y_2)$,

1121: $x_1<x_2<y_1<y_2$, be two points in an optimized-gain rectilinear

1122: region (refer to Figure~\ref{fig:admissible-example}).  Since $c_1$ is

1123: in the optimized-gain region and $x_1<x_2$, the point at $(x_2,y_1)$ is

1124: also in this region because it has a smaller BEP than~$c_1$.  Since the

1125: optimized-gain region is rectilinear, there is a horizontal path from

1126: $(x_1,y_1)$ to $(x_2,y_1)$ and a vertical path from $(x_2,y_1)$ to

1127: $(x_2,y_2)$.  Thus, there is a Manhattan path from $(x_1,y_1)$ to

1128: $(x_2,y_2)$.  Arguments of this kind show that optimized-gain

1129: rectilinear regions must be connected as long as they are `wide

1130: enough'.

1131:

1132: To summarize, we have shown that

1133: optimized-gain (and thus optimized-support and optimized-confidence)

1134: regions are admissible. The data mining

1135: algorithm described in Section~\ref{sec:gizmo}, which results in

1136: optimal admissible regions, is thus appropriate for the

1137: STTD example.  We now show and interpret data mining results.

1138:

1139: \begin{figure}

1140: \begin{center}

1141: \includegraphics[width=4.0in]{admissible-example}

1142: \end{center}

1143: \caption[Why optimal regions are admissible.]{Points for arguments

1144: about region shape (see text).}

1145: \label{fig:admissible-example}

1146: \end{figure}

1147:

1148: \subsection{Optimized-Support Admissible Regions}

1149:

1150: \begin{figure}

1151: \begin{center}

1152: \includegraphics[width=4.0in]{p_region} \\

1153: \end{center}

1154: \caption[Optimized-support admissible regions for

1155: Figure~\ref{fig:space}.]{Optimized-support admissible region for data

1156: in Figure~\ref{fig:space} (bottom) with the confidence

1157: threshold~$\theta=0.99$ and the performance threshold~$T=10^{-3}$.}

1158: \label{fig:region}

1159: \end{figure}

1160:

1161: Figure~\ref{fig:region} shows an optimized-support admissible region

1162: for the confidence threshold~$\theta=0.99$.  Intuitively, this is the

1163: largest admissible region where we can claim, with confidence of at

1164: least~$0.99$, that configurations exhibit acceptable performance.  This

1165: claim is conditional on temporal simulation assumptions and on mutual

1166: independence of configurations in the region.  The shape of this region

1167: confirms that, under a fixed effective SNR, the BEP is minimal when the

1168: average SNRs of the two branches are equal.  The width of this region

1169: shows the largest acceptable power imbalance.  For this example, the

1170: system tolerates power imbalance of up to 12~dB.  However, the width of

1171: the optimized region is not uniform.  The region is narrower for small

1172: effective SNRs and wider for large effective SNRs.  This means that

1173: configurations with low effective SNRs are more sensitive to power

1174: imbalance than configurations with high effective SNRs.  None of these

1175: observations are news to an informed reader.  The contribution of data

1176: mining in this context is not qualitative discoveries; it is

1177: statistically significant quantitative results.

1178:

1179: \begin{figure*}

1180: \begin{center}

1181: \begin{tabular}{c c}

1182: \includegraphics[width=250pt]{p_1} &

1183: \includegraphics[width=250pt]{p_2} \\

1184: \includegraphics[width=250pt]{p_3} &

1185: \includegraphics[width=250pt]{p_all} \\

1186: \end{tabular}

1187: \end{center}

1188: \caption[Cross-validation of optimized-support admissible

1189: regions.]{Cross-validation of optimized-support admissible regions with

1190: the confidence threshold~$\theta=0.95$.  The regions in top left,

1191: bottom left, and top right have been computed with $n_k=2$ independent

1192: samples per bucket.  There are $758\pm{}2$ buckets ($47\%$ of all data)

1193: per such region.  The region in the bottom right has been computed from

1194: the statistically significant data in Figure~\ref{fig:space} (bottom).

1195: It consists of 766 buckets ($48\%$ of all data).  Red (dark)

1196: corresponds to low bucket confidence and white (light) corresponds to

1197: high bucket confidence w.r.t. the voice quality threshold~$T=10^{-3}$.}

1198: \label{fig:cv}

1199: \end{figure*}

1200:

1201: Let us see how the data mining algorithm performs when data is scarce.

1202: The initial sample of the configuration space in Figure~\ref{fig:space}

1203: (top) contains one sample value per bucket.  The statistically

1204: significant sample in Figure~\ref{fig:space} (bottom) contains at least

1205: two additional sample values per bucket (recall that we required at

1206: least two sample values to estimate bucket variance $\sigma_k^2$).

1207: Therefore, three-fold cross-validation is the most elaborate

1208: cross-validation procedure that this dataset affords.  The regions in

1209: the top left, bottom left, and top right of Figure~\ref{fig:cv} have

1210: been computed for sample values in Figure~\ref{fig:space} (top) and the

1211: first two sample values per bucket in Figure~\ref{fig:space} (bottom).

1212: Each of these regions has been computed with two out of the three

1213: sample values per bucket.  The region in the lower right has been

1214: computed with all data in Figure~\ref{fig:space} (bottom).  All four

1215: regions are optimized-support admissible regions with the confidence

1216: threshold~$\theta=0.95$.  The regions are overlaid on top of the

1217: color-coded bucket confidence values.  Red (dark) corresponds to low

1218: confidence and white (light) corresponds to high confidence that

1219: configuration $c_k$ exhibits acceptable average performance w.r.t. the

1220: voice quality threshold~$T=10^{-3}$.

1221:

1222: The regions in Figure~\ref{fig:cv} are identical except for the lower

1223: left corner.  This is not surprising because this part of the

1224: configuration space exhibits high relative variance.  Also, the data is

1225: symmetric but the regions are asymmetric in the lower-left corner.

1226: Recall that optimized-gain admissible regions, and thus

1227: optimized-support admissible regions, are not unique.  The ties in

1228: region gains are broken arbitrarily.  Therefore, region asymmetry is an

1229: additional indicator of region instability.

1230:

1231: Figure~\ref{fig:cv} also shows that additional data improves image

1232: contrast but does not significantly affect region shape.  Collecting

1233: additional sample values separates the points into ones with low

1234: confidence and ones with high confidence.  A curious side effect occurs

1235: when the difference in confidence estimates of low-confidence points

1236: falls below the discretization error (1/1000).  In this case, the

1237: `confidence slack' $1-\theta$ is allocated to arbitrary points with low

1238: confidence.  One way to correct this situation is to raise the

1239: confidence threshold~$\theta$---after all, more accurate data should

1240: afford stronger claims.  Another alternative is to lower the

1241: discretization threshold.  In general, optimized regions work best when

1242: the data is noisy.  A contour plot will suffice when the data is highly

1243: accurate.

1244:

1245: It can also be seen that the high contrast created by the sharp edge of

1246: the tolerance region is advantageous to data mining.  The region is

1247: stable where the contrast is high.  When the image is blurred, data

1248: mining tries to avoid the questionable boundary points.

1249:

1250: To summarize, this section has demonstrated that optimized-gain regions

1251: are rectilinear and connected for a non-trivial space of wireless

1252: system configurations.  We have also shown that optimized-support

1253: admissible regions are easy to interpret.  Finally, we have shown that

1254: data mining works well when sample sizes are small.

1255:

1256: \section{Discussion and Future Work}

1257: \label{sec:conclusion}

1258:

1259: We have demonstrated a hierarchical formulation of data mining suitable

1260: for assessing performance of wireless system configurations.  WCDMA

1261: simulation results are systematically aggregated and redescribed,

1262: leading to intuitive regions that allow the engineer to evaluate

1263: wireless system configuration parameters.  We have shown that the

1264: assumptions about region shape and properties made by data mining

1265: algorithms can be valid in the wireless design context; the patterns

1266: mined hence lead to explainable and statistically valid design

1267: conclusions.  As a methodology, data mining is thus shown to be

1268: extremely powerful when coupled with statistically meaningful

1269: performance evaluation.

1270:

1271: This work is the first (known to the authors) application of data

1272: mining methodology to solve problems in wireless system design.

1273: Therefore, a large number of extensions are possible and called for.

1274: We outline possible extensions at the three levels of aggregation:

1275: points, buckets, and regions.

1276:

1277: At the point level, it may be advantageous to model temporal

1278: simulations more precisely.  This paper assumes a `large enough' number

1279: of frames per simulation and works with the distribution of estimated BEPs.

1280:  We have shown reasonable analytical and empirical evidence that this

1281: distribution is Gaussian.  The advantage of this problem formulation is

1282: the independence of spatial aggregation from the assumptions of

1283: temporal simulation.  This helps introduce wireless engineers to the

1284: methodology of data mining for studying design problems.  However, a

1285: stronger model of temporal simulation (e.g., Markov chains

1286: in~\cite{fsmc-wang}) may yield appreciable gains in software

1287: performance.  This direction is worth pursuing because few research

1288: groups have access to parallel computing facilities of the scale used

1289: in this work.  For instance, the initial sample of the configuration

1290: space in Figure~\ref{fig:space} (top) would take one year of

1291: computation time on a modern workstation.  The study presented in this

1292: paper would clearly be impossible without significant computational

1293: power.

1294:

1295: Aggregation of points into buckets is the least developed part of this

1296: work.  Suppose that we would like to simulate the effects of

1297: interference on configuration performance.  Assume that the

1298: distribution of the average strengths of the interfering signals is

1299: known \emph{a priori} (e.g., estimated by ray tracing).  We can either

1300: make this distribution known to the temporal simulation, or,

1301: alternatively, run several temporal simulations for different strengths

1302: of interfering signals.  The former is more accurate and

1303: computationally more efficient, but the latter is more generic and

1304: simpler to implement.  Bucketing of simulation results with varying

1305: simulation parameters is intended to approximate the performance of a

1306: single device under varying conditions.  This paper does not employ

1307: such bucketing but instead builds all the necessary kinds of parameter

1308: variation into the temporal simulation (which can be argued to be the

1309: right way to do it).  However, bucketing may be necessary when one has

1310: to work with a given dataset (e.g., measurements).  Bucket space can be

1311: viewed as a configuration space for a more complex temporal

1312: simulation.  Therefore, an in-depth treatment of bucketing is

1313: orthogonal to the primary topic of this paper, which is data mining.

1314:

1315: Significant work remains to be done at the region level as well.  For

1316: instance, the assumption of small variance could conceivably be relaxed.

1317: One can

1318: also pursue the relatively difficult task of incorporating strongly

1319: model-based prior knowledge into the data mining algorithm, or the

1320: somewhat easier task of applying different kinds of region mining

1321: algorithms to problems in wireless system design.

1322:

1323: Defining additional case studies is another obvious direction for

1324: future work.  We have studied a relatively small part of the parameter

1325: space of modern wireless systems.  More studies of this type must be

1326: performed to highlight the merits and the shortcomings of data mining

1327: in this domain.

1328:

1329: Finally, the strict staging of data collection and data mining can be

1330: relaxed.  One can fruitfully interleave the two activities and have the

1331: results of data mining drive subsequent data collection.  In

1332: data-scarce domains, it would be advantageous to focus the data

1333: collection effort on only those regions deemed most important to

1334: support a particular data mining objective.  Methodologies for

1335: closing-the-loop in this manner are becoming increasingly

1336: prevalent~\cite{sampling-cise}. This will also help define alternative

1337: criteria for evaluating experiment designs and layouts.

1338:

1339: \bibliographystyle{alpha}

1340: \bibliography{paper}

1341: \end{document}

1342: