cs0111050/intro.tex
1: \section{Introduction}\label{sec:intro}
2: The Analysis of Algorithms community has been challenged by the
3:   existence of remarkable algorithms that are known by scientists and
4:   engineers to work well in practice, but whose theoretical analyses
5:   are negative or inconclusive.  
6: The root of this problem is that algorithms
7:   are usually analyzed in one of two ways: by worst-case or average-case
8:   analysis.  
9: Worst-case analysis can improperly suggest that an
10:   algorithm will perform poorly by examining its performance under
11:   the most contrived circumstances.
12: Average-case analysis was introduced to
13:   provide a less pessimistic measure of the performance of algorithms,
14:  and many practical algorithms perform well on the random
15:   inputs considered in average-case analysis.
16: However, average-case analysis may be unconvincing as
17:   the inputs encountered in many application domains
18:   may bear little resemblance to the random inputs
19:   that dominate the analysis. 
20: 
21: We propose an analysis that we call smoothed analysis which
22:   can help explain the
23:   success of algorithms that have poor worst-case complexity
24:   and whose inputs look sufficiently different from random that
25:   average-case analysis cannot be convincingly applied.
26: In smoothed analysis, we measure the
27:   performance of an algorithm under slight random perturbations of
28:   arbitrary inputs.  
29: In particular, we consider 
30:   Gaussian perturbations of inputs to algorithms that take real 
31:   inputs, and we measure the running times of algorithms in terms
32:   of their input size and the standard deviation of the Gaussian perturbations.
33: 
34: We show that the simplex method has polynomial smoothed
35:   complexity.  
36: The simplex method is the classic example of an
37:   algorithm that is known to perform well in practice but which takes
38:   exponential time in the worst case
39: \cite{KleeMinty,Murty,GoldfarbSit,Goldfarb,AvisChvatal,Jeroslow,AmentaZiegler}.
40: In the late 1970's and early 1980's the simplex method was shown
41:   to converge in expected polynomial time on various distributions of
42:   random inputs by researchers including Borgwardt, Smale, Haimovich, Adler,
43:   Karp, Shamir, Megiddo, and Todd
44: \cite{Borg82,Borg77,SmaleRand,Haimovich,AdlerKarpShamir,AdlerMegiddo,ToddRand}.
45: These works introduced novel probabilistic tools to the analysis
46:   of algorithms, and provided some intuition as to why the
47:   simplex method runs so quickly.
48: However, these analyses are dominated by
49:   ``random looking'' inputs: even if one were to prove
50:   very strong bounds on the higher moments of the distributions
51:   of running times on random inputs,
52:   one could not prove that an algorithm performs well
53:   in any particular small neighborhood of inputs.
54: 
55: To bound expected running times on small neighborhoods of inputs,
56:   we consider linear programming problems in the form
57: \begin{eqnarray}\label{prg:A}
58:  &  \mbox{maximize} & \zz ^{T} \xx  \nonumber \\
59:  & \mbox{subject to} & \AA  \xx  \leq \yy,
60: \end{eqnarray}
61:  and prove that for every vector $\zz$
62:   and every matrix $\AAo$ and vector $\orig{\yy}$,
63:   the expectation over standard deviation
64:   $\sigma \left(\max_{i}\norm{(\orig{y}_{i}, \aao_{i})} \right)$
65:   Gaussian perturbations $\AA$ and $\yy$ of
66:   $\AAo $ and $\orig{\yy}$
67:   of the time taken by a two-phase shadow-vertex simplex method
68:   to solve such a linear program
69:   is polynomial in $1/\sigma$ and the dimensions of $\AA$.
70: 
71: 
72: \subsection{Linear Programming and the Simplex Method}\label{ssec:lp}
73: It is difficult to overstate the importance of linear programming
74:   to optimization.
75: Linear programming problems arise in innumerable industrial contexts.
76: Moreover, linear programming is often used as a fundamental step
77:   in other optimization algorithms.
78: In a linear programming problem, one is asked to maximize or
79:   minimize a linear function over a polyhedral region.
80: 
81: Perhaps one reason we see so many linear programs is that we
82:   can solve them efficiently.
83: In 1947, Dantzig~\cite{Dantzig} introduced the simplex method,
84:   which was the first practical approach to solving linear programs
85:   and which remains widely used today.
86: To state it roughly, the simplex method proceeds by walking from
87:   one vertex to another of the polyhedron defined by the inequalities
88:   in \eqref{prg:A}.
89: At each step, it walks to a vertex that is better with respect to
90:   the objective function.
91: The algorithm will either determine that 
92:   the constraints are unsatisfiable, determine that the objective function is
93:   unbounded, or  reach a vertex from which it cannot make
94:   progress, which necessarily optimizes the objective function.
95: 
96: Because of its great importance, other algorithms for 
97:   linear programming have been invented.
98: In 1979, Khachiyan~\cite{Khachiyan} applied the
99:   ellipsoid algorithm to linear programming and proved that
100:   it always converged in time polynomial in
101:   $d$, $n$, and $L$---the number of
102:   bits needed to represent the linear program.
103: However, the ellipsoid algorithm has not been competitive  
104:   with the simplex method in practice.
105: In contrast, the interior-point method introduced in 1984
106:   by  Karmarkar~\cite{Karmarkar}, which also runs in time polynomial
107:   in $d$, $n$, and $L$, has performed very well:
108:  variations of the interior point method are competitive with
109:   and occasionally superior to the simplex method in practice.
110: 
111: In spite of half a century of attempts to unseat it,
112:   the simplex method remains the most popular method
113:   for solving linear programs.
114: However, there has been no satisfactory theoretical 
115:   explanation of its excellent performance.
116: A fascinating approach to understanding the performance of the
117:   simplex method has been the attempt to
118:   prove that there always exists a short
119:   walk from each vertex to the optimal vertex.
120: The Hirsch conjecture states that there should
121:   always be a walk of length at most $n - d$.
122: Significant progress on this conjecture was 
123:   made by Kalai and Kleitman~\cite{KalaiKleitman}, who proved that
124:   there always exists a walk of length
125:   at most $n ^{\log_{2}d + 2}$.
126: However, the existence of such a short walk does not imply
127:   that the simplex method will find it.
128: 
129: A simplex method is not completely defined until one
130:   specifies its \textit{pivot rule}---the method by which
131:   it decides which vertex to walk to 
132:   when it has many to choose from.  
133: There is no deterministic pivot rule under which the
134:   simplex method is known to take a sub-exponential
135:   number of steps.
136: In fact, for almost every deterministic
137:   pivot rule there is a family of polytopes 
138:   on which it is known to take an exponential number of 
139:   steps
140: \cite{KleeMinty,Murty,GoldfarbSit,Goldfarb,AvisChvatal,Jeroslow}.
141:   (See~\cite{AmentaZiegler} for a survey and a 
142:   unified construction of these polytopes).
143: The best present analysis of randomized pivot rules shows
144:   that they take expected time $n^{O (\sqrt{d})}$%
145: \cite{KalaiSubexp,Matousek},
146:   which is quite far from the polynomial complexity
147:   observed in practice.
148: This inconsistency between the exponential worst-case behavior of the
149:   simplex method and its everyday practicality leave us wanting
150:   a more reasonable theoretical analysis.
151: 
152: %% from STOC version
153: 
154: Various average-case analyses of the simplex method
155:   have been performed.
156: Most relevant to this paper is the analysis of
157:   Borgwardt~\cite{Borg77,Borg82}, who
158:   proved that the simplex method with the shadow
159:   vertex pivot rule runs in expected polynomial time
160:   for polytopes whose constraints are drawn independently from 
161:   spherically symmetric distributions 
162:   (\textit{e.g.} Gaussian distributions centered at the origin).
163: Independently, 
164:   Smale~\cite{SmaleRand,SmaleRand2} proved bounds on the 
165:   expected running time of Lemke's self-dual parametric simplex algorithm
166:   on linear programming problems 
167:   chosen from a spherically-symmetric distribution.
168: Smale's analysis was substantially improved by Megiddo~\cite{Megiddo}.
169: 
170: While these average-case analyses are significant
171:   accomplishments, it is not clear whether they
172:   actually provide intuition for what happens
173:   on typical inputs.
174: Edelman~\cite{EdelmanRoulette} writes on this point:
175: \begin{quotation}
176: What is a mistake is to psychologically link a random
177:   matrix with the intuitive notion of a ``typical'' matrix
178:   or the vague concept of ``any old matrix.''
179: \end{quotation}
180: 
181: Another model of random linear programs was studied in
182:   a line of research initiated independently
183:   by Haimovich~\cite{Haimovich} and Adler~\cite{Adler}.
184: Their works
185:   considered the maximum over matrices, $\AA$,
186:   of the expected time taken by parametric simplex
187:   methods to solve linear programs over these matrices
188:   in which the directions of the
189:   inequalities are chosen at random.
190: As this framework considers the maximum of an average,
191:   it may be viewed as a precursor to smoothed 
192:   analysis---the distinction being that 
193:   the random choice of
194:   inequalities cannot be viewed as a perturbation,
195:   as different choices yield radically different linear programs.
196: Haimovich and Adler both proved that 
197:   parametric simplex methods
198:   would take an expected linear number of steps
199:   to go from the vertex minimizing the objective function
200:   to the vertex maximizing the objective function,
201:   even conditioned on the program being feasible.
202: While their theorems confirmed the intuitions of many practitioners, 
203:   they were geometric rather than algorithmic%
204: \footnote{Our results in Section~\ref{sec:shadow} are analogous to
205:   these results.}
206:  as it 
207:   was not clear how an algorithm would locate either vertex.
208: Building on these analyses, Todd~\cite{ToddRand},
209:   Adler and Megiddo~\cite{AdlerMegiddo},
210:   and Adler, Karp and Shamir~\cite{AdlerKarpShamir}
211:   analyzed parametric algorithms for linear programming under this model
212:   and proved quadratic
213:   bounds on their expected running time.
214: While the random inputs considered in these analyses are
215:   not as special as the random inputs obtained from spherically
216:   symmetric distributions,
217:   the model of randomly flipped inequalities provokes some
218:   similar objections.
219: 
220: \subsection{Smoothed Analysis of Algorithms
221:  and Related Work}\label{ssec:smooth}
222: We introduce the \textit{smoothed analysis of algorithms} in the hope that
223:   it will help explain the good practical performance of many
224:   algorithms that worst-case does not and for which average-case analysis 
225:   is unconvincing.
226: Our first application of the smoothed analysis of algorithms will be to
227:   the simplex method.
228: We will consider the maximum over $\AAo$
229:  and $\orig{\yy}$ of the expected running time
230:   of the simplex method on inputs of the form
231: \begin{eqnarray}
232:  &  \mbox{maximize} & \zz ^{T} \xx \nonumber \\
233:  & \mbox{subject to} & (\AAo + \GG) \xx  \leq (\orig{\yy} + \hh),  \label{prg:AG}
234: \end{eqnarray}
235: where we let $\AAo$ and $\orig{\yy}$ be arbitrary 
236:   and $\GG$ and $\hh$ be a matrix and a vector of independently chosen
237:   Gaussian random variables of mean $0$ and 
238:   standard deviation $\sigma \left(\max_{i}\norm{(\orig{y}_{i}, \aao_{i})} \right)$.
239: If we let $\sigma $ go to $0$, then we obtain the worst-case
240:   complexity of the simplex method; whereas, if we let $\sigma $
241:   be so large that $\GG$ swamps out $\AA$, we obtain the
242:   average-case analyzed by Borgwardt.
243: By choosing polynomially small $\sigma $, this analysis combines
244:   advantages of worst-case and average-case analysis, and roughly
245:   corresponds to the notion of imprecision in low-order digits.
246: 
247: In a smoothed analysis of an algorithm, we assume that the inputs
248:   to the algorithm are subject to slight random perturbations,
249:   and we measure the complexity of the algorithm in terms of the input
250:   size and the standard deviation of the perturbations.
251: If an algorithm has low smoothed complexity, then one should expect it to
252:   work well in practice since most real-world problems are generated
253:   from data that is inherently noisy.
254: Another way of thinking about smoothed complexity is to observe that if an
255:   algorithm has low smoothed complexity, then one must be unlucky
256:   to choose an input instance on which it performs poorly.
257: 
258: 
259: We now provide some definitions for the smoothed analysis of algorithms
260:   that take real or complex inputs.
261: For an algorithm $A$ and input $\xx $, let 
262: \[
263:    \calC_{A} (\xx )
264: \]
265: be a complexity measure of $A$ on input $\xx$.
266: Let $X$ be the domain of inputs to $A$, and let
267:   $X_{n}$ be the set of inputs of size $n$.
268: The size of an input can be measured in various ways.
269: Standard measures are the number of real variables
270:   contained in the input and the sums of the bit-lengths  
271:   of the variables.
272: Using this notation, one can say that $A$ has worst-case
273:   $\calC$-complexity $f (n)$ if 
274: \[
275:   \max _{\xx \in X_{n}} (\calC_{A} (\xx )) = f (n).
276: \]
277: Given a family of distributions $\mu_{n} $ on $X_{n}$, we say that $A$
278:   has average-case $\calC$-complexity $f (n)$ under $\mu $ if
279: \[
280:   \expec{\xx  \from{\mu _{n}}{X_{n}}}{\calC_{A} (\xx )} = f (n).
281: \]
282: Similarly, we say that $A$ has \textit{smoothed $\calC$-complexity} 
283:   $f (n, \sigma )$ if
284: \begin{equation}\label{eqn:smoothedcomplexity}
285:  \max _{\xx  \in X_{n}} 
286:   \expec{\gg }{\calC_{A} (\xx + \left(\sigma \norm{\xx}_{?} \right) \gg  )} = f (n, \sigma ),
287: \end{equation}
288: \index{smoothed-complexity}%
289:  where $\left( \sigma \norm{\xx}_{?} \right) \gg$ is a vector of Gaussian random variables of mean $0$ and
290:   standard deviation $\sigma \norm{\xx}_{?}$ and $\norm{\xx}_{?}$ is a measure of the magnitude
291:   of $\xx$, such as the largest element or the norm.
292: We say that an algorithm has \textit{polynomial smoothed complexity}
293:   if its smoothed complexity is polynomial in $n$ and $1/\sigma $.
294: \index{polynomial smoothed complexity}
295: In Section~\ref{sec:conclusions}, we present some 
296:   generalizations of the definition of smoothed complexity that
297:   might prove useful.
298: To further contrast smoothed analysis with average-case analysis, 
299:   we note that the probability mass in \eqref{eqn:smoothedcomplexity} is
300:   concentrated in a region of radius $O (\sigma \sqrt{n})$ and
301:   volume at most $O (\sigma \sqrt{n})^{n}$,
302:   and so, when $\sigma$ is small, this region contains an exponentially small fraction
303:   of the probability mass in an average-case analysis.
304: Thus, even an extension of average-case analysis to higher moments
305:   will not imply meaningful bounds on smoothed complexity.
306: 
307: A discrete analog of smoothed analysis has been studied in a collection
308:   of works inspired by Santha and Vazirani's \textit{semi-random source}
309:   model~\cite{SanthaVazirani}.
310: In this model, an adversary generates an input, and each bit of this input
311:   has some probability of being flipped.
312: Blum and Spencer~\cite{BlumSpencer} design a polynomial-time 
313:   algorithm that $k$-colors
314:   $k$-colorable graphs generated by this model.
315: Feige and Krauthgamer~\cite{FeigeKrauthgamer} analyze a model
316:   in which the adversary is more powerful,
317:   and use it to show that Turner's algorithm~\cite{Turner}
318:   for approximating the bandwidth performs well 
319:   on semi-random inputs.
320: They also improve Turner's analysis.
321: Feige and Kilian~\cite{FeigeKilian}
322:   present polynomial-time algorithms that
323:   recover large independent sets, 
324:   $k$-colorings, and optimal bisections
325:   in semi-random graphs.
326: They also demonstrate that significantly better
327:   results would lead to surprising
328:   collapses of complexity classes.
329: 
330: \subsection{Our Results}\label{ssec:results}
331: 
332: We consider
333:   the maximum over $\zz$, $\orig{\yy}$,
334:   and $\vs{\aao}{1}{n}$ of the expected time taken
335:   by a two-phase shadow vertex simplex method to solve
336:  linear programming problems of the form
337: \begin{eqnarray}
338:  &  \mbox{maximize} & \zz^{T} \xx \nonumber  \\
339:  & \mbox{subject to} & \form{\aa _{i}}{\xx} \leq y _{i}, 
340:   \mbox{ for $1 \leq i \leq n$,} \label{eqn:lpEnumerated2}
341: \end{eqnarray} \index{zz@$\zz $}%
342: where each $\aa _{i}$ is a Gaussian random vector of standard deviation
343:   $\sigma \max_{i} \norm{(\orig{y}_{i}, \aao_{i})}$ centered at $\aao _{i}$,
344:   and each $y_{i}$ is a Gaussian random variable of  standard deviation
345:   $\sigma \max_{i} \norm{(\orig{y}_{i}, \aao_{i})}$ centered at $\orig{y} _{i}$.
346: 
347: We begin by considering the case in which
348:   $\yy = \oone $, $\norm{\aao _{i}} \leq 1$,
349:   and $\sigma < 1/3 \sqrt{d \ln n}$.
350: In this case, our first result, Theorem~\ref{thm:shadow}, says that
351:   for every vector
352:   $\tt $ the expected size of the {\em  shadow} of the polytope---the
353:   projection of the polytope defined
354:   by the equations (\ref{eqn:lpEnumerated2})  onto the plane
355:   spanned by $\tt $ and $\zz $---is polynomial in $n$, the dimension,
356:   and $1/\sigma $.
357: This result is the geometric foundation of our work, but
358:   it does not directly bound the running time of an algorithm,
359:   as the shadow relevant to the analysis of an algorithm
360:   depends on the perturbed program and cannot be specified
361:   beforehand as the vector $\tt$ must be.
362: In Section~\ref{sec:introSVM2phase}, we describe a two-phase 
363:   shadow-vertex simplex algorithm,
364:   and in Section~\ref{sec:phaseI} we
365:   use Theorem~\ref{thm:shadow} as a black box to show
366:   that it takes expected time polynomial in $n$, $d$,
367:   and $1/\sigma $ in the case described above.
368: 
369: Efforts have been made to analyze how much the solution of a linear
370:   program can change as its data is perturbed.
371: For an introduction to such analyses, 
372:   and an analysis of the complexity of interior point
373:   methods in terms of the resulting condition number,
374:   we refer the reader to
375:   the work of Renegar~\cite{RenegarFunc,RenegarCond,RenegarPert}.
376: 
377: 
378: \subsection{Intuition Through Condition Numbers}\label{sec:intuition}
379: For those already familiar with the simplex method and condition numbers,
380:   we include this section to provide some intuition for why our   
381:   results should be true.
382: 
383: Our analysis will exploit geometric properties
384:   of the condition number of a matrix, rather than of a
385:   linear program.
386: We start with the observation that if a corner of a polytope
387:   is specified by the equation $A_{I} \xx = \yy_{I}$,
388:   where $I$ is a $d$-set, then the condition number of
389:   the matrix $A_{I}$ provides a good measure of how far the corner
390:   is from being flat.
391: Moreover, it is relatively easy to show that if
392:   $A$ is subject to perturbation, then it is unlikely that
393:   $A_{I}$ has poor condition number.
394: So, it seems intuitive that if $A$ is perturbed, then most
395:   corners of the polytope should have angles bounded away
396:   from being flat.
397: This already provides some intuition as to why the simplex method
398:   should run quickly: one should make reasonable progress as
399:   one rounds a corner if it is not too flat.
400: 
401: There are two difficulties in making the above intuition rigorous:
402:   the first is that even if $A_{I}$ is well-conditioned for most
403:   sets $I$, it is not clear that $A_{I}$ will be well-conditioned
404:   for most sets $I$ that are bases of corners of the polytope.
405: The second difficulty is that even if most corners of the polytope
406:   have reasonable condition number, it is not clear that a simplex
407:   method will actually encounter many of these corners.
408: By analyzing the shadow vertex pivot rule, it is possible to resolve
409:   both of these difficulties.
410: 
411: The first advantage of studying the shadow vertex pivot rule is
412:   that its analysis comes down to studying the expected sizes
413:   of shadows of the polytope.
414: From the specification of the plane onto which the polytope will be projected,
415:   one obtains a characterization of all the corners that will be in
416:   the shadow, thereby avoiding the complication of an iterative
417:   characterization.
418: The second advantage is that these corners are specified by the
419:   property that they optimize a particular objective function,
420:   and using this property one can actually bound the probability
421:   that they are ill-conditioned.
422: While the results of Section~\ref{sec:shadow} are not stated in
423:   these terms, this is the intuition behind them.
424: 
425: Condition numbers also play a fundamental role in our
426:   analysis of the shadow-vertex algorithm.
427: The analysis of the algorithm differs from the mere analysis
428:   of the sizes of shadows in that, in the study of an algorithm,
429:   the plane onto which the polytope is projected depends upon
430:   the polytope itself.
431: This correlation of the plane with the polytope complicates
432:   the analysis, but is also resolved through the help
433:   of condition numbers.
434: In our analysis, we view the perturbation as the composition
435:   of two perturbations, where the second is small relative to the first.
436: We show that our choice of the plane onto which we
437:   project the shadow is well-conditioned with high
438:   probability after the first perturbation.
439: That is, we show that the second perturbation is unlikely
440:   to substantially change the plane onto which we project,
441:   and therefore unlikely to substantially change the shadow.
442: Thus,  it suffices to measure the expected size of the
443:   shadow obtained after the second perturbation onto the
444:   plane that would have been chosen after just the first
445:   perturbation.
446: 
447: The technical lemma that enables this analysis, Lemma~\ref{lem:MGC},
448:   is a concentration result that proves that it is highly
449:   unlikely that almost all of the minors of a random
450:   matrix have poor condition number.
451: This analysis also enables us to show that it is highly
452:   unlikely that we will need a large ``big-$M$''
453:   in phase I of our algorithm.
454: 
455: We note that the condition numbers of the $A_{I}$s
456:   have been studied before in the complexity of
457:   linear programming algorithms.
458: The condition number $\bar{\chi}_{A}$
459:   of Vavasis and Ye~\cite{VavasisYe} measures
460:   the condition number of the worst sub-matrix $A_{I}$,
461:   and their algorithm runs in time proportional
462:   to $\ln (\bar{\chi }_{A})$.
463: Todd, Tun{\c{c}}el, and Ye~\cite{ToddTuncelYe} have shown 
464:   that for a Gaussian random matrix the expectation
465:   of $\ln (\bar{\chi }_{A})$ is $O (\min (d \ln n, n))$.
466: That is, they show that it is unlikely that any $A_{I}$
467:   is exponentially ill-conditioned.
468: It is relatively simple to apply the techniques of
469:   Section~\ref{sec:phaseIManyGood} to obtain a similar
470:   result in the smoothed case.
471: We wonder whether our concentration result that it
472:   is exponentially unlikely that many $A_{I}$
473:   are even polynomially ill-conditioned could
474:   be used to obtain a better smoothed analysis
475:   of the Vavasis-Ye algorithm.
476: 
477: \subsection{Discussion}\label{sec:introDiscussion}
478: 
479: One can debate whether the definition of
480:   \textit{polynomial smoothed complexity}
481:   should be that an algorithm have complexity polynomial in $1/\sigma $
482:   or $\log (1/\sigma )$.
483: We believe that the choice of being polynomial in $1/\sigma $
484:   will prove more useful as the other definition is too strong
485:   and quite similar
486:   to the notion of being polynomial in the worst case.
487: In particular, one can convert any algorithm for linear programming
488:   whose smoothed complexity
489:   is polynomial in $d$, $n$ and $\log (1/\sigma) $
490:   into an algorithm whose worst-case complexity is polynomial in $d$,
491:   $n$, and $L$.
492: That said, one should certainly prefer complexity bounds that are
493:   lower as a function of $1/\sigma$, $d$ and $n$.
494: 
495: 
496: We also remark that a simple examination of the 
497:   constructions that provide exponential lower bounds
498:   for various pivot 
499:   rules~\cite{KleeMinty,Murty,GoldfarbSit,Goldfarb,AvisChvatal,Jeroslow}  
500:   reveals that none of these pivot rules
501:   have smoothed complexity polynomial in $n$ and
502:   sub-polynomial in $1/\sigma $.
503: That is, these constructions are unaffected by exponentially
504:   small perturbations.
505: 
506: 
507: 
508: 
509: 
510: 
511: 
512: % Local Variables: ***
513: % TeX-master:"shadow.tex" ***
514: % End: ***
515: 
516: