0111:cs0111050/intro.tex

1: \section{Introduction}\label{sec:intro}

2: The Analysis of Algorithms community has been challenged by the

3:   existence of remarkable algorithms that are known by scientists and

4:   engineers to work well in practice, but whose theoretical analyses

5:   are negative or inconclusive.

6: The root of this problem is that algorithms

7:   are usually analyzed in one of two ways: by worst-case or average-case

8:   analysis.

9: Worst-case analysis can improperly suggest that an

10:   algorithm will perform poorly by examining its performance under

11:   the most contrived circumstances.

12: Average-case analysis was introduced to

13:   provide a less pessimistic measure of the performance of algorithms,

14:  and many practical algorithms perform well on the random

15:   inputs considered in average-case analysis.

16: However, average-case analysis may be unconvincing as

17:   the inputs encountered in many application domains

18:   may bear little resemblance to the random inputs

19:   that dominate the analysis.

20:

21: We propose an analysis that we call smoothed analysis which

22:   can help explain the

23:   success of algorithms that have poor worst-case complexity

24:   and whose inputs look sufficiently different from random that

25:   average-case analysis cannot be convincingly applied.

26: In smoothed analysis, we measure the

27:   performance of an algorithm under slight random perturbations of

28:   arbitrary inputs.

29: In particular, we consider

30:   Gaussian perturbations of inputs to algorithms that take real

31:   inputs, and we measure the running times of algorithms in terms

32:   of their input size and the standard deviation of the Gaussian perturbations.

33:

34: We show that the simplex method has polynomial smoothed

35:   complexity.

36: The simplex method is the classic example of an

37:   algorithm that is known to perform well in practice but which takes

38:   exponential time in the worst case

39: \cite{KleeMinty,Murty,GoldfarbSit,Goldfarb,AvisChvatal,Jeroslow,AmentaZiegler}.

40: In the late 1970's and early 1980's the simplex method was shown

41:   to converge in expected polynomial time on various distributions of

42:   random inputs by researchers including Borgwardt, Smale, Haimovich, Adler,

43:   Karp, Shamir, Megiddo, and Todd

44: \cite{Borg82,Borg77,SmaleRand,Haimovich,AdlerKarpShamir,AdlerMegiddo,ToddRand}.

45: These works introduced novel probabilistic tools to the analysis

46:   of algorithms, and provided some intuition as to why the

47:   simplex method runs so quickly.

48: However, these analyses are dominated by

49:   ``random looking'' inputs: even if one were to prove

50:   very strong bounds on the higher moments of the distributions

51:   of running times on random inputs,

52:   one could not prove that an algorithm performs well

53:   in any particular small neighborhood of inputs.

54:

55: To bound expected running times on small neighborhoods of inputs,

56:   we consider linear programming problems in the form

57: \begin{eqnarray}\label{prg:A}

58:  &  \mbox{maximize} & \zz ^{T} \xx  \nonumber \\

59:  & \mbox{subject to} & \AA  \xx  \leq \yy,

60: \end{eqnarray}

61:  and prove that for every vector $\zz$

62:   and every matrix $\AAo$ and vector $\orig{\yy}$,

63:   the expectation over standard deviation

64:   $\sigma \left(\max_{i}\norm{(\orig{y}_{i}, \aao_{i})} \right)$

65:   Gaussian perturbations $\AA$ and $\yy$ of

66:   $\AAo $ and $\orig{\yy}$

67:   of the time taken by a two-phase shadow-vertex simplex method

68:   to solve such a linear program

69:   is polynomial in $1/\sigma$ and the dimensions of $\AA$.

70:

71:

72: \subsection{Linear Programming and the Simplex Method}\label{ssec:lp}

73: It is difficult to overstate the importance of linear programming

74:   to optimization.

75: Linear programming problems arise in innumerable industrial contexts.

76: Moreover, linear programming is often used as a fundamental step

77:   in other optimization algorithms.

78: In a linear programming problem, one is asked to maximize or

79:   minimize a linear function over a polyhedral region.

80:

81: Perhaps one reason we see so many linear programs is that we

82:   can solve them efficiently.

83: In 1947, Dantzig~\cite{Dantzig} introduced the simplex method,

84:   which was the first practical approach to solving linear programs

85:   and which remains widely used today.

86: To state it roughly, the simplex method proceeds by walking from

87:   one vertex to another of the polyhedron defined by the inequalities

88:   in \eqref{prg:A}.

89: At each step, it walks to a vertex that is better with respect to

90:   the objective function.

91: The algorithm will either determine that

92:   the constraints are unsatisfiable, determine that the objective function is

93:   unbounded, or  reach a vertex from which it cannot make

94:   progress, which necessarily optimizes the objective function.

95:

96: Because of its great importance, other algorithms for

97:   linear programming have been invented.

98: In 1979, Khachiyan~\cite{Khachiyan} applied the

99:   ellipsoid algorithm to linear programming and proved that

100:   it always converged in time polynomial in

101:   $d$, $n$, and $L$---the number of

102:   bits needed to represent the linear program.

103: However, the ellipsoid algorithm has not been competitive

104:   with the simplex method in practice.

105: In contrast, the interior-point method introduced in 1984

106:   by  Karmarkar~\cite{Karmarkar}, which also runs in time polynomial

107:   in $d$, $n$, and $L$, has performed very well:

108:  variations of the interior point method are competitive with

109:   and occasionally superior to the simplex method in practice.

110:

111: In spite of half a century of attempts to unseat it,

112:   the simplex method remains the most popular method

113:   for solving linear programs.

114: However, there has been no satisfactory theoretical

115:   explanation of its excellent performance.

116: A fascinating approach to understanding the performance of the

117:   simplex method has been the attempt to

118:   prove that there always exists a short

119:   walk from each vertex to the optimal vertex.

120: The Hirsch conjecture states that there should

121:   always be a walk of length at most $n - d$.

122: Significant progress on this conjecture was

123:   made by Kalai and Kleitman~\cite{KalaiKleitman}, who proved that

124:   there always exists a walk of length

125:   at most $n ^{\log_{2}d + 2}$.

126: However, the existence of such a short walk does not imply

127:   that the simplex method will find it.

128:

129: A simplex method is not completely defined until one

130:   specifies its \textit{pivot rule}---the method by which

131:   it decides which vertex to walk to

132:   when it has many to choose from.

133: There is no deterministic pivot rule under which the

134:   simplex method is known to take a sub-exponential

135:   number of steps.

136: In fact, for almost every deterministic

137:   pivot rule there is a family of polytopes

138:   on which it is known to take an exponential number of

139:   steps

140: \cite{KleeMinty,Murty,GoldfarbSit,Goldfarb,AvisChvatal,Jeroslow}.

141:   (See~\cite{AmentaZiegler} for a survey and a

142:   unified construction of these polytopes).

143: The best present analysis of randomized pivot rules shows

144:   that they take expected time $n^{O (\sqrt{d})}$%

145: \cite{KalaiSubexp,Matousek},

146:   which is quite far from the polynomial complexity

147:   observed in practice.

148: This inconsistency between the exponential worst-case behavior of the

149:   simplex method and its everyday practicality leave us wanting

150:   a more reasonable theoretical analysis.

151:

152: %% from STOC version

153:

154: Various average-case analyses of the simplex method

155:   have been performed.

156: Most relevant to this paper is the analysis of

157:   Borgwardt~\cite{Borg77,Borg82}, who

158:   proved that the simplex method with the shadow

159:   vertex pivot rule runs in expected polynomial time

160:   for polytopes whose constraints are drawn independently from

161:   spherically symmetric distributions

162:   (\textit{e.g.} Gaussian distributions centered at the origin).

163: Independently,

164:   Smale~\cite{SmaleRand,SmaleRand2} proved bounds on the

165:   expected running time of Lemke's self-dual parametric simplex algorithm

166:   on linear programming problems

167:   chosen from a spherically-symmetric distribution.

168: Smale's analysis was substantially improved by Megiddo~\cite{Megiddo}.

169:

170: While these average-case analyses are significant

171:   accomplishments, it is not clear whether they

172:   actually provide intuition for what happens

173:   on typical inputs.

174: Edelman~\cite{EdelmanRoulette} writes on this point:

175: \begin{quotation}

176: What is a mistake is to psychologically link a random

177:   matrix with the intuitive notion of a ``typical'' matrix

178:   or the vague concept of ``any old matrix.''

179: \end{quotation}

180:

181: Another model of random linear programs was studied in

182:   a line of research initiated independently

183:   by Haimovich~\cite{Haimovich} and Adler~\cite{Adler}.

184: Their works

185:   considered the maximum over matrices, $\AA$,

186:   of the expected time taken by parametric simplex

187:   methods to solve linear programs over these matrices

188:   in which the directions of the

189:   inequalities are chosen at random.

190: As this framework considers the maximum of an average,

191:   it may be viewed as a precursor to smoothed

192:   analysis---the distinction being that

193:   the random choice of

194:   inequalities cannot be viewed as a perturbation,

195:   as different choices yield radically different linear programs.

196: Haimovich and Adler both proved that

197:   parametric simplex methods

198:   would take an expected linear number of steps

199:   to go from the vertex minimizing the objective function

200:   to the vertex maximizing the objective function,

201:   even conditioned on the program being feasible.

202: While their theorems confirmed the intuitions of many practitioners,

203:   they were geometric rather than algorithmic%

204: \footnote{Our results in Section~\ref{sec:shadow} are analogous to

205:   these results.}

206:  as it

207:   was not clear how an algorithm would locate either vertex.

208: Building on these analyses, Todd~\cite{ToddRand},

209:   Adler and Megiddo~\cite{AdlerMegiddo},

210:   and Adler, Karp and Shamir~\cite{AdlerKarpShamir}

211:   analyzed parametric algorithms for linear programming under this model

212:   and proved quadratic

213:   bounds on their expected running time.

214: While the random inputs considered in these analyses are

215:   not as special as the random inputs obtained from spherically

216:   symmetric distributions,

217:   the model of randomly flipped inequalities provokes some

218:   similar objections.

219:

220: \subsection{Smoothed Analysis of Algorithms

221:  and Related Work}\label{ssec:smooth}

222: We introduce the \textit{smoothed analysis of algorithms} in the hope that

223:   it will help explain the good practical performance of many

224:   algorithms that worst-case does not and for which average-case analysis

225:   is unconvincing.

226: Our first application of the smoothed analysis of algorithms will be to

227:   the simplex method.

228: We will consider the maximum over $\AAo$

229:  and $\orig{\yy}$ of the expected running time

230:   of the simplex method on inputs of the form

231: \begin{eqnarray}

232:  &  \mbox{maximize} & \zz ^{T} \xx \nonumber \\

233:  & \mbox{subject to} & (\AAo + \GG) \xx  \leq (\orig{\yy} + \hh),  \label{prg:AG}

234: \end{eqnarray}

235: where we let $\AAo$ and $\orig{\yy}$ be arbitrary

236:   and $\GG$ and $\hh$ be a matrix and a vector of independently chosen

237:   Gaussian random variables of mean $0$ and

238:   standard deviation $\sigma \left(\max_{i}\norm{(\orig{y}_{i}, \aao_{i})} \right)$.

239: If we let $\sigma $ go to $0$, then we obtain the worst-case

240:   complexity of the simplex method; whereas, if we let $\sigma $

241:   be so large that $\GG$ swamps out $\AA$, we obtain the

242:   average-case analyzed by Borgwardt.

243: By choosing polynomially small $\sigma $, this analysis combines

244:   advantages of worst-case and average-case analysis, and roughly

245:   corresponds to the notion of imprecision in low-order digits.

246:

247: In a smoothed analysis of an algorithm, we assume that the inputs

248:   to the algorithm are subject to slight random perturbations,

249:   and we measure the complexity of the algorithm in terms of the input

250:   size and the standard deviation of the perturbations.

251: If an algorithm has low smoothed complexity, then one should expect it to

252:   work well in practice since most real-world problems are generated

253:   from data that is inherently noisy.

254: Another way of thinking about smoothed complexity is to observe that if an

255:   algorithm has low smoothed complexity, then one must be unlucky

256:   to choose an input instance on which it performs poorly.

257:

258:

259: We now provide some definitions for the smoothed analysis of algorithms

260:   that take real or complex inputs.

261: For an algorithm $A$ and input $\xx $, let

262: \[

263:    \calC_{A} (\xx )

264: \]

265: be a complexity measure of $A$ on input $\xx$.

266: Let $X$ be the domain of inputs to $A$, and let

267:   $X_{n}$ be the set of inputs of size $n$.

268: The size of an input can be measured in various ways.

269: Standard measures are the number of real variables

270:   contained in the input and the sums of the bit-lengths

271:   of the variables.

272: Using this notation, one can say that $A$ has worst-case

273:   $\calC$-complexity $f (n)$ if

274: \[

275:   \max _{\xx \in X_{n}} (\calC_{A} (\xx )) = f (n).

276: \]

277: Given a family of distributions $\mu_{n} $ on $X_{n}$, we say that $A$

278:   has average-case $\calC$-complexity $f (n)$ under $\mu $ if

279: \[

280:   \expec{\xx  \from{\mu _{n}}{X_{n}}}{\calC_{A} (\xx )} = f (n).

281: \]

282: Similarly, we say that $A$ has \textit{smoothed $\calC$-complexity}

283:   $f (n, \sigma )$ if

284: \begin{equation}\label{eqn:smoothedcomplexity}

285:  \max _{\xx  \in X_{n}}

286:   \expec{\gg }{\calC_{A} (\xx + \left(\sigma \norm{\xx}_{?} \right) \gg  )} = f (n, \sigma ),

287: \end{equation}

288: \index{smoothed-complexity}%

289:  where $\left( \sigma \norm{\xx}_{?} \right) \gg$ is a vector of Gaussian random variables of mean $0$ and

290:   standard deviation $\sigma \norm{\xx}_{?}$ and $\norm{\xx}_{?}$ is a measure of the magnitude

291:   of $\xx$, such as the largest element or the norm.

292: We say that an algorithm has \textit{polynomial smoothed complexity}

293:   if its smoothed complexity is polynomial in $n$ and $1/\sigma $.

294: \index{polynomial smoothed complexity}

295: In Section~\ref{sec:conclusions}, we present some

296:   generalizations of the definition of smoothed complexity that

297:   might prove useful.

298: To further contrast smoothed analysis with average-case analysis,

299:   we note that the probability mass in \eqref{eqn:smoothedcomplexity} is

300:   concentrated in a region of radius $O (\sigma \sqrt{n})$ and

301:   volume at most $O (\sigma \sqrt{n})^{n}$,

302:   and so, when $\sigma$ is small, this region contains an exponentially small fraction

303:   of the probability mass in an average-case analysis.

304: Thus, even an extension of average-case analysis to higher moments

305:   will not imply meaningful bounds on smoothed complexity.

306:

307: A discrete analog of smoothed analysis has been studied in a collection

308:   of works inspired by Santha and Vazirani's \textit{semi-random source}

309:   model~\cite{SanthaVazirani}.

310: In this model, an adversary generates an input, and each bit of this input

311:   has some probability of being flipped.

312: Blum and Spencer~\cite{BlumSpencer} design a polynomial-time

313:   algorithm that $k$-colors

314:   $k$-colorable graphs generated by this model.

315: Feige and Krauthgamer~\cite{FeigeKrauthgamer} analyze a model

316:   in which the adversary is more powerful,

317:   and use it to show that Turner's algorithm~\cite{Turner}

318:   for approximating the bandwidth performs well

319:   on semi-random inputs.

320: They also improve Turner's analysis.

321: Feige and Kilian~\cite{FeigeKilian}

322:   present polynomial-time algorithms that

323:   recover large independent sets,

324:   $k$-colorings, and optimal bisections

325:   in semi-random graphs.

326: They also demonstrate that significantly better

327:   results would lead to surprising

328:   collapses of complexity classes.

329:

330: \subsection{Our Results}\label{ssec:results}

331:

332: We consider

333:   the maximum over $\zz$, $\orig{\yy}$,

334:   and $\vs{\aao}{1}{n}$ of the expected time taken

335:   by a two-phase shadow vertex simplex method to solve

336:  linear programming problems of the form

337: \begin{eqnarray}

338:  &  \mbox{maximize} & \zz^{T} \xx \nonumber  \\

339:  & \mbox{subject to} & \form{\aa _{i}}{\xx} \leq y _{i},

340:   \mbox{ for $1 \leq i \leq n$,} \label{eqn:lpEnumerated2}

341: \end{eqnarray} \index{zz@$\zz $}%

342: where each $\aa _{i}$ is a Gaussian random vector of standard deviation

343:   $\sigma \max_{i} \norm{(\orig{y}_{i}, \aao_{i})}$ centered at $\aao _{i}$,

344:   and each $y_{i}$ is a Gaussian random variable of  standard deviation

345:   $\sigma \max_{i} \norm{(\orig{y}_{i}, \aao_{i})}$ centered at $\orig{y} _{i}$.

346:

347: We begin by considering the case in which

348:   $\yy = \oone $, $\norm{\aao _{i}} \leq 1$,

349:   and $\sigma < 1/3 \sqrt{d \ln n}$.

350: In this case, our first result, Theorem~\ref{thm:shadow}, says that

351:   for every vector

352:   $\tt $ the expected size of the {\em  shadow} of the polytope---the

353:   projection of the polytope defined

354:   by the equations (\ref{eqn:lpEnumerated2})  onto the plane

355:   spanned by $\tt $ and $\zz $---is polynomial in $n$, the dimension,

356:   and $1/\sigma $.

357: This result is the geometric foundation of our work, but

358:   it does not directly bound the running time of an algorithm,

359:   as the shadow relevant to the analysis of an algorithm

360:   depends on the perturbed program and cannot be specified

361:   beforehand as the vector $\tt$ must be.

362: In Section~\ref{sec:introSVM2phase}, we describe a two-phase

363:   shadow-vertex simplex algorithm,

364:   and in Section~\ref{sec:phaseI} we

365:   use Theorem~\ref{thm:shadow} as a black box to show

366:   that it takes expected time polynomial in $n$, $d$,

367:   and $1/\sigma $ in the case described above.

368:

369: Efforts have been made to analyze how much the solution of a linear

370:   program can change as its data is perturbed.

371: For an introduction to such analyses,

372:   and an analysis of the complexity of interior point

373:   methods in terms of the resulting condition number,

374:   we refer the reader to

375:   the work of Renegar~\cite{RenegarFunc,RenegarCond,RenegarPert}.

376:

377:

378: \subsection{Intuition Through Condition Numbers}\label{sec:intuition}

379: For those already familiar with the simplex method and condition numbers,

380:   we include this section to provide some intuition for why our

381:   results should be true.

382:

383: Our analysis will exploit geometric properties

384:   of the condition number of a matrix, rather than of a

385:   linear program.

386: We start with the observation that if a corner of a polytope

387:   is specified by the equation $A_{I} \xx = \yy_{I}$,

388:   where $I$ is a $d$-set, then the condition number of

389:   the matrix $A_{I}$ provides a good measure of how far the corner

390:   is from being flat.

391: Moreover, it is relatively easy to show that if

392:   $A$ is subject to perturbation, then it is unlikely that

393:   $A_{I}$ has poor condition number.

394: So, it seems intuitive that if $A$ is perturbed, then most

395:   corners of the polytope should have angles bounded away

396:   from being flat.

397: This already provides some intuition as to why the simplex method

398:   should run quickly: one should make reasonable progress as

399:   one rounds a corner if it is not too flat.

400:

401: There are two difficulties in making the above intuition rigorous:

402:   the first is that even if $A_{I}$ is well-conditioned for most

403:   sets $I$, it is not clear that $A_{I}$ will be well-conditioned

404:   for most sets $I$ that are bases of corners of the polytope.

405: The second difficulty is that even if most corners of the polytope

406:   have reasonable condition number, it is not clear that a simplex

407:   method will actually encounter many of these corners.

408: By analyzing the shadow vertex pivot rule, it is possible to resolve

409:   both of these difficulties.

410:

411: The first advantage of studying the shadow vertex pivot rule is

412:   that its analysis comes down to studying the expected sizes

413:   of shadows of the polytope.

414: From the specification of the plane onto which the polytope will be projected,

415:   one obtains a characterization of all the corners that will be in

416:   the shadow, thereby avoiding the complication of an iterative

417:   characterization.

418: The second advantage is that these corners are specified by the

419:   property that they optimize a particular objective function,

420:   and using this property one can actually bound the probability

421:   that they are ill-conditioned.

422: While the results of Section~\ref{sec:shadow} are not stated in

423:   these terms, this is the intuition behind them.

424:

425: Condition numbers also play a fundamental role in our

426:   analysis of the shadow-vertex algorithm.

427: The analysis of the algorithm differs from the mere analysis

428:   of the sizes of shadows in that, in the study of an algorithm,

429:   the plane onto which the polytope is projected depends upon

430:   the polytope itself.

431: This correlation of the plane with the polytope complicates

432:   the analysis, but is also resolved through the help

433:   of condition numbers.

434: In our analysis, we view the perturbation as the composition

435:   of two perturbations, where the second is small relative to the first.

436: We show that our choice of the plane onto which we

437:   project the shadow is well-conditioned with high

438:   probability after the first perturbation.

439: That is, we show that the second perturbation is unlikely

440:   to substantially change the plane onto which we project,

441:   and therefore unlikely to substantially change the shadow.

442: Thus,  it suffices to measure the expected size of the

443:   shadow obtained after the second perturbation onto the

444:   plane that would have been chosen after just the first

445:   perturbation.

446:

447: The technical lemma that enables this analysis, Lemma~\ref{lem:MGC},

448:   is a concentration result that proves that it is highly

449:   unlikely that almost all of the minors of a random

450:   matrix have poor condition number.

451: This analysis also enables us to show that it is highly

452:   unlikely that we will need a large ``big-$M$''

453:   in phase I of our algorithm.

454:

455: We note that the condition numbers of the $A_{I}$s

456:   have been studied before in the complexity of

457:   linear programming algorithms.

458: The condition number $\bar{\chi}_{A}$

459:   of Vavasis and Ye~\cite{VavasisYe} measures

460:   the condition number of the worst sub-matrix $A_{I}$,

461:   and their algorithm runs in time proportional

462:   to $\ln (\bar{\chi }_{A})$.

463: Todd, Tun{\c{c}}el, and Ye~\cite{ToddTuncelYe} have shown

464:   that for a Gaussian random matrix the expectation

465:   of $\ln (\bar{\chi }_{A})$ is $O (\min (d \ln n, n))$.

466: That is, they show that it is unlikely that any $A_{I}$

467:   is exponentially ill-conditioned.

468: It is relatively simple to apply the techniques of

469:   Section~\ref{sec:phaseIManyGood} to obtain a similar

470:   result in the smoothed case.

471: We wonder whether our concentration result that it

472:   is exponentially unlikely that many $A_{I}$

473:   are even polynomially ill-conditioned could

474:   be used to obtain a better smoothed analysis

475:   of the Vavasis-Ye algorithm.

476:

477: \subsection{Discussion}\label{sec:introDiscussion}

478:

479: One can debate whether the definition of

480:   \textit{polynomial smoothed complexity}

481:   should be that an algorithm have complexity polynomial in $1/\sigma $

482:   or $\log (1/\sigma )$.

483: We believe that the choice of being polynomial in $1/\sigma $

484:   will prove more useful as the other definition is too strong

485:   and quite similar

486:   to the notion of being polynomial in the worst case.

487: In particular, one can convert any algorithm for linear programming

488:   whose smoothed complexity

489:   is polynomial in $d$, $n$ and $\log (1/\sigma) $

490:   into an algorithm whose worst-case complexity is polynomial in $d$,

491:   $n$, and $L$.

492: That said, one should certainly prefer complexity bounds that are

493:   lower as a function of $1/\sigma$, $d$ and $n$.

494:

495:

496: We also remark that a simple examination of the

497:   constructions that provide exponential lower bounds

498:   for various pivot

499:   rules~\cite{KleeMinty,Murty,GoldfarbSit,Goldfarb,AvisChvatal,Jeroslow}

500:   reveals that none of these pivot rules

501:   have smoothed complexity polynomial in $n$ and

502:   sub-polynomial in $1/\sigma $.

503: That is, these constructions are unaffected by exponentially

504:   small perturbations.

505:

506:

507:

508:

509:

510:

511:

512: % Local Variables: ***

513: % TeX-master:"shadow.tex" ***

514: % End: ***

515:

516: