0710:0710.0013/lr.tex

1: \documentclass[letterpaper, 10 pt, conference]{ieeeconf}

2: \IEEEoverridecommandlockouts

3: \overrideIEEEmargins

4:

5: \newcommand{\calG}{\mathcal{G}}

6: \newcommand{\calF}{\mathcal{F}}

7: \newcommand{\calM}{\mathcal{M}}

8: \newcommand{\calR}{\mathcal{R}}

9: \newcommand{\comment}[1]{}

10:

11: \newtheorem{proposition}{Proposition}

12:

13: % The following packages can be found on http:\\www.ctan.org

14:

15: \usepackage{graphics} % for pdf, bitmapped graphics files

16: \usepackage{epsfig} % for postscript graphics files

17: \usepackage{mathptmx} % assumes new font selection scheme installed

18: \usepackage{times} % assumes new font selection scheme installed

19: \usepackage{amsmath} % assumes amsmath package installed

20: \usepackage{amssymb}  % assumes amsmath package installed

21:

22: \title{\LARGE \bf Lagrangian Relaxation for MAP Estimation in

23: Graphical Models}

24:

25: \author{Jason K. Johnson, Dmitry M. Malioutov and Alan

26: S. Willsky\thanks{The authors are with the Electrical Engineering and Computer Science Department, Massachusetts Institute of Technology, Cambridge,

27: MA 02139, USA.  {\tt\small {\{jasonj,dmm,willsky\}@mit.com}}.}}

28:

29: \begin{document}

30:

31: \maketitle

32: \begin{minipage}[t][0pt][t]{0.96\textwidth}

33: \vspace{-1.8in}

34: In Proceedings of \emph{The 45th Allerton Conference on

35: Communication, Control and Computing}, September, 2007.

36: \end{minipage}\vspace{-\baselineskip}

37:

38:

39: \thispagestyle{empty}

40: \pagestyle{empty}

41: \begin{abstract}

42: We develop a general framework for MAP estimation in discrete and

43: Gaussian graphical models using Lagrangian relaxation techniques.  The

44: key idea is to reformulate an intractable estimation problem as one

45: defined on a more tractable graph, but subject to additional

46: constraints. Relaxing these constraints gives a tractable dual

47: problem, one defined by a thin graph, which is then optimized by an

48: iterative procedure.  When this iterative optimization leads to a

49: consistent estimate, one which also satisfies the constraints, then it

50: corresponds to an optimal MAP estimate of the original model.

51: Otherwise there is a ``duality gap'', and we obtain a bound on the

52: optimal solution.  Thus, our approach combines convex optimization

53: with dynamic programming techniques applicable for thin graphs. The

54: popular tree-reweighted max-product (TRMP) method may be seen as

55: solving a particular class of such relaxations, where the intractable

56: graph is relaxed to a set of spanning trees.  We also consider

57: relaxations to a set of small induced subgraphs, thin subgraphs

58: (e.g. loops), and a connected tree obtained by ``unwinding'' cycles.

59: In addition, we propose a new class of multiscale relaxations that

60: introduce ``summary'' variables.  The potential benefits of such

61: generalizations include: reducing or eliminating the ``duality gap''

62: in hard problems, reducing the number or Lagrange multipliers in the

63: dual problem, and accelerating convergence of the iterative

64: optimization procedure.

65: \end{abstract}

66:

67: \section{Introduction}

68:

69: Graphical models are probability models for a collection of random

70: variables on a graph: the nodes of the graph represent random

71: variables and the graph structure encodes conditional independence

72: relations among the variables.  Such models provide compact

73: representations of probability distributions, and have found many

74: practical applications in physics, statistical signal and image

75: processing, error-correcting coding and machine learning.  However,

76: performing optimal estimation in such models using standard junction

77: tree approaches generally is intractable in large-scale estimation

78: scenarios.  This motivates the development of variational techniques

79: to perform approximate inference, and, in some cases, recover the

80: optimal estimate.

81:

82: We consider a general Lagrangian relaxation (LR) approach to

83: \emph{maximum a posteriori} (MAP) estimation in graphical models. The

84: general idea is to reformulate the estimation problem on an

85: intractable graph as a constrained estimation over an augmented model

86: defined on a larger, but more tractable graph.  Then, using Lagrange

87: multipliers to relax the constraints, we obtain a tractable estimation

88: problem that gives an upper-bound on the original problem.  This leads

89: to a convex optimization problem of minimizing the upper-bound as a

90: function of Lagrange multipliers.

91:

92: We consider a variety of strategies to augment the original graph.

93: The simplest approach breaks the graph into many small, overlapping

94: subgraphs, which involves replicating some variables.  Similarly, the

95: graph can be broken into a set of thin subgraphs, as in the TRMP

96: approach, or ``unrolled'' to obtain a larger, but connected, thin

97: graph. We show that all of these approaches are essentially

98: equivalent, being characterized by the set of maximal cliques of the

99: augmented graph.  More generally, we also consider the introduction of

100: ``summary'' variables, which leads naturally to multiscale algorithms.

101: We develop a general optimization approach based on marginal and

102: max-marginal matching procedures, which enforce consistency between

103: replicas of a node or edge, and moment-matching in the multiscale

104: relaxation. We show that the resulting bound is tight if and only if

105: there exists an optimal assignment in the augmented model that

106: satisfies the constraints.  In that case, we obtain the desired MAP

107: estimate of the original model.  When there is a duality gap, this is

108: evidenced by the occurrence of ``ties'' in the resulting set of

109: max-marginals, which requires further augmentation of the model to

110: reduce and ultimately eliminate the duality gap.  We focus primarily

111: on discrete graphical models with binary variables, but also consider

112: the extension to Gaussian graphical models.  In the Gaussian model, we

113: find that, whenever LR is ``well-posed'', so that the augmented model

114: is valid, it leads to a tight bound and the optimal MAP estimate, and

115: also gives \emph{upper-bounds} on variances that provide a measure of

116: confidence in the MAP estimate.

117:

118: \section{Background}

119:

120: We consider probabilistic graphical models

121: \cite{Lauritzen96,Cowell*99,Frey98}, which are probability

122: distributions of the form

123: \begin{equation}\label{eq:1}

124: p(x_1,\dots,x_n) = \frac{1}{Z} \exp\{f(x)\} = \frac{1}{Z} \exp\left\{\sum_{C \in \calG} f_C(x_C)\right\}

125: \end{equation}

126: where each function $f_C$ only depends on a subset of variables $x_C =

127: (x_v, v \in C)$ and $Z$ is a normalization constant of the model,

128: called the \emph{partition function} in statistical physics.  If the

129: sum ranges over all \emph{cliques} of the graph, which are the fully

130: connected subsets of variables, this representation is sufficient to

131: realize any Markov model on $\calG$ \cite{Lauritzen96}.

132: \comment{\footnote{The probability

133: distribution $p(x)$ is \emph{Markov} on $\calG$ if for every $S

134: \subset V$ that separates $A,B \subset V$ in $\calG$, $x_A$ and $x_B$

135: are independent given $x_S$.}}  However, it is also common to consider

136: restricted Markov models where only singleton and pairwise

137: interactions are specified.  In general, we specify the set of

138: interactions by a hypergraph $\calG \subset 2^V$, where $2^V$

139: represents the set of all subsets of $V$.  The elements of $\calG$

140: are its \emph{hyperedges}, which generalizes the usual concept of

141: a graph with pairwise edges.

142:

143: \emph{Discrete Models.} While our approach is applicable for general

144: discrete models, we focus on models with binary variables.  One may

145: use either the Boltzmann machine representation $x_v \in \{0,1\}$, or

146: that of the Ising model $x_v \in \{-1,+1\}$.  These models can be

147: represented as in (\ref{eq:1}) with

148: \begin{equation}

149: f(x;\theta) = \sum_{E\in\calG} \theta_E \phi_E(x_E), \;\; \phi_E(x_E) = \prod_{v \in E} x_v

150: \end{equation}

151: This defines an \emph{exponential family} \cite{WainwrightJordan03} of

152: probability distributions based on model features $\phi$ and

153: parameterized by $\theta$.  $\Phi(\theta) \triangleq \log Z(\theta)$

154: is the \emph{log-partition function} and has the

155: \emph{moment-generating property}: $\frac{\partial

156: \Phi(\theta)}{\partial \theta_E} = \mathbb{E}_\theta\{\phi_E(x)\}

157: \triangleq \eta_E$.  Here, $\eta$ are the \emph{moments} of the

158: distribution, which serve both as an alternate parameterization of

159: the exponential family and, in graphical models, to specify the

160: marginal distributions on cliques of the model. Inference in discrete

161: models using junction tree methods, either to compute the mode or the

162: marginals, is generally linear in the number of variables $n$ but

163: grows exponentially in the \emph{width} of the graph \cite{Cowell*99}, which

164: is determined by the size of the maximal cliques in a junction tree

165: representation of the graph. Hence, exact inference is only tractable

166: for \emph{thin} graphs, that is, where one can build an equivalent

167: junction tree with small cliques.

168:

169: \emph{Gaussian Models.} We also consider Gaussian graphical models

170: \cite{Dempster72,SpeedKiiveri86} represented in \emph{information form}:

171: \begin{equation}

172: p(x) = \exp\{-\tfrac{1}{2} x^T J x + h^T x - \Phi(h,J) \}

173: \end{equation}

174: where $J$ is the \emph{information matrix}, $h$ a potential vector and

175: $\Phi(h,J) = \tfrac{1}{2} \{ h^TJ^{-1}h - \log\det J + n\log 2\pi\}$.  This

176: corresponds to the standard form of the Gaussian model specified by

177: the covariance matrix $P = J^{-1}$ and mean vector $\hat{x} = J^{-1}

178: h$.  This translates into an exponential family where we identify

179: $(h,J)$ with the parameters $\theta$ and $(\hat{x},P)$ with the

180: moments $\eta$.  In general, the complexity of inference in Gaussian

181: models is $\mathcal{O}(n^3)$.  The fill pattern of $J$ determines the

182: Markov structure of the Gaussian model: $(i,j) \in \calG$ if $J_{i,j}

183: \neq 0$. Using more efficient recursive inference methods that exploit

184: sparsity, such as junction trees or sparse Gaussian elimination, the

185: complexity is linear in $n$ but cubic in the width of the graph, which

186: is still impractical for many large-scale estimation problems.

187:

188: \section{Discrete Lagrangian Relaxation}

189:

190: To begin with, consider the problem of maximizing the following

191: objective function, defined over a hypergraph $\calG \subset

192: 2^V$ based on a vertex set $V = \{1,\dots,n\}$ corresponding to

193: discrete variables $x = (x_1,\dots,x_n)$.

194: \begin{equation}

195: f(x) = \sum_{E \in \calG} f_E(x_E)

196: \end{equation}

197: For instance, this may be defined as $f(x) = \langle \theta, \phi(x)

198: \rangle$ in an exponential family graphical model, such that each term

199: corresponds to a feature $f_E(x_E) = \theta_E \phi_E(x_E)$.  Then, we

200: seek $x^*$ to maximize $f(x)$ to obtain the MAP estimate of (\ref{eq:1}).

201:

202: \begin{figure}

203: \centering

204: \input{LR_ex.pstex_t}

205: \caption{\label{fig:toy_example}A simple illustrative example of Lagrangian relaxation.}

206: \vspace{-.3cm}

207: \end{figure}

208:

209: \emph{An Illustrative Example.} To briefly convey the basic concept,

210: we consider a simple pairwise model defined on a $3$-node cycle

211: $\calG$ represented in Fig. \ref{fig:toy_example}. Here, the augmented

212: graph $\calG'$ is a 4-node chain, where node $4$ is a replica of node

213: $1$. We copy all the potentials on the nodes and edges from $\calG$ to

214: $\calG'$. For the replicated variables, $x_1'$ and $x_4'$, we split

215: $f_1$ between $f_1'$ and $f_4'$ such that $f_1(y) = f_1'(y) + f_4'(y)$

216: for $y \in \{0,1\}$. Now the problem $\max_x f(x)$ is equivalent to

217: maximizing $f'(x')$ subject to the constraint $x_1' = x_4'$. To solve

218: the latter we relax the constraint using Lagrange multipliers:

219: $L(x', \lambda) = f'(x') + \lambda(x'_1 - x'_4)$. The additional term

220: $\lambda(x_1' - x_4')$ modifies the self-potentials: $f_1' \leftarrow

221: f_1'(x_1') + \lambda x_1'$ and $f_4' \leftarrow f_4'(x_4') - \lambda

222: x_4'$, parameterizing a family of models on $\calG'$ all of which are

223: equivalent to $f$ under the constraint $x'_1=x'_4$. For a fixed

224: $\lambda$, solving $\max_x L(x, \lambda) \triangleq g(\lambda)$ gives

225: an upper bound on $f^* = \max_x f(x)$, so by optimizing $\lambda$ to

226: minimize $g(\lambda)$, we find the tightest bound $g^* = \min_\lambda

227: g(\lambda)$. If the constraint $x_1' = x_4'$ is satisfied in the final

228: solution, then there is strong duality $g^* = f^*$ and we obtain the

229: correct MAP assignment for $f(x)$.

230:

231: We now discuss the general procedure and develop our approach to

232: optimize $g(\lambda)$ in more difficult cases.

233:

234: \subsection{Obtaining a Tractable Graph by Vertex Replication}

235:

236: In this section, we consider approaches that involve

237: \emph{replicating} variables to define the augmented model. The basic

238: constraints in designing $\calG'$ are as follows: $\calG'$ is

239: comprised of replicas of nodes and edges of $\calG$.  Every node and

240: edge of $\calG$ must be represented at least once in $\calG'$.

241: Finally, $\calG'$ should be a thin graph, which relates to the

242: complexity of our method.

243:

244: To help illustrate the various strategies, we consider a

245: pairwise model $f(x)$ defined on $5 \times 5$ grid, as seen in

246: Fig.~\ref{fig:graphs}(a).  A natural approach is to break the

247: model up into small subgraphs.  The simplest method is to break the

248: graph up into its composite interactions.  For pairwise models, this

249: means that we split the graph into a set of disjoint edges as shown in

250: (b). Here, each internal node of the graph is replicated four times.

251: To reduce the number of replicated nodes, and hence the number of

252: constraints, it is also useful to merge many of these smaller

253: subgraphs into larger thin graphs.  One approach is to group edges

254: into \emph{spanning trees} of the graph as seen in (c).  Here, each

255: edge must be including in at least one tree, and some edges are

256: replicated in multiple trees.  The TRMP approach is based on this

257: idea.  One could also allow multiple replicas of a node in the same

258: connected component of $\calG'$.  For instance, by taking a spanning

259: tree of the graph and then adding an extra leaf node for each missing

260: edge we obtain the graph seen in (d).

261:

262: It is also tractable to use small subgraphs that are not trees.  We

263: can break the graph into a set of short loops as in (e) or a set of

264: induced subgraphs as in (f) where we select a set of $3\times3$

265: subgraphs that overlap on their boundary.  In such cases, including

266: additional edges in the overlap of these subgraphs, such as the dotted

267: edges in (f), can enhance the relaxations that we consider.  Finally,

268: we reduce the number of constraints in these formulations by again

269: grouping subgraphs to form larger subgraphs that are still thin, as

270: shown in (g).  This will also lead to tractability in our methods.

271: Again, it can be useful to include extra edges in the overlap of these

272: subgraphs as in (h), although this increases the width of the subgraph

273: and affects the computational complexity of our methods.

274:

275: \begin{figure}

276: \centering

277: (a)\epsfig{file=grid5x5.eps,scale=.85}

278: \hspace{.2cm}

279: (b)\epsfig{file=grid5x5_edges.eps,scale=0.55}\\

280: \vspace{.2cm}

281: (c)\epsfig{file=grid5x5_trees.eps,scale=0.7}

282: (d)\epsfig{file=grid5x5_comp_tree.eps,scale=0.7}\\

283: \vspace{.2cm}

284: (e)\epsfig{file=grid5x5_cells.eps,scale=0.75}

285: \hspace{.2cm}

286: (f)\epsfig{file=grid5x5_cells3x3.eps,scale=0.75}\\

287: \vspace{.2cm}

288: (g)\epsfig{file=grid5x5_layers.eps,scale=0.75}

289: \hspace{.2cm}

290: (h)\epsfig{file=grid5x5_layers2.eps,scale=0.75}

291: \caption{\label{fig:graphs} Illustrations of a variety of possible

292: ways to obtain a tractable graph structure from a $5 \times 5$ grid by

293: replicating some vertices of the graph.}

294: \vspace{-.4cm}

295: \end{figure}

296:

297: \emph{Notation.} Let $\calG^\prime$ denote the augmented graph (or

298: collection of subgraphs), which is based on an extended vertex set

299: $V'$, comprised of replicas of nodes in $V$. We assume that all edges

300: of this graph are also replicas of edges of the original graph

301: $\calG$.\footnote{In the case that we introduce extra edges in

302: $\calG'$, as in (f) and (h), we also add corresponding edges to

303: $\calG$ to maintain this convention.} Thus, there is a

304: well-defined surjective map $\Gamma: \calG' \rightarrow \calG$, each

305: edge $E' \in \calG'$ is a replica an edge $E=\Gamma(E') \in

306: \calG$, and every edge of $\calG$ has at least one such replica.  This

307: notation is overloaded for nodes by treating them as singleton edges

308: of $\calG$.  We also denote the set-valued inverse of $\Gamma$ by

309: $\mathcal{R}(E) \triangleq \Gamma^{-1}(E)$, which is the set of

310: replicas of $E$, and let $r_E \triangleq |\calR(E)|$ denote the

311: number of replicas.  This defines an equivalence relation on $\calG'$:

312: $A,B \in \calG'$ are equivalent $A \equiv B$ if $\Gamma(A)=\Gamma(B)$,

313: that is, if $A,B \in \calR(E)$ are replicas of the same edge $E \in

314: \calG$.

315:

316: \subsection{Equivalent Constrained Estimation Problem}

317:

318: We now define a corresponding objective function $f'(x')$, where $x' =

319: (x'_v)_{v \in V'}$ are the variables of the augmented model. For each

320: hyperedge $E \in \calG$ (including individual nodes), we split the

321: function $f_E(x_E)$ among a set of replica functions $\{f'_{E'}, E'

322: \in \calR(E)\}$, requiring that these are \emph{consistent},

323: \begin{equation}

324: f_E(x_E) = \sum_{E' \in \calR(E)} f'_{E'}(x_E) \mbox{ for all } x_E.

325: \end{equation}

326: Using the parametric representation $f(x) = \langle \theta, \phi(x)

327: \rangle$, this consistency condition is equivalent to requiring

328: $\theta_E = \Sigma_{E'} \theta'_{E'}$. We will see that the LR

329: approach to follow may be viewed as an optimization over all such

330: possible consistent splittings.  Next, we define the augmented

331: objective function over the graph $\calG'$ as

332: \begin{equation}

333: f'(x') \triangleq \sum_{E \in \calG'} f'_E(x'_E).

334: \end{equation}

335: This insures that $f(x) = f'(x')$ where $x' = \zeta(x)$ is the

336: replicated version of $x$, defined by $x'_{v'} = x_v$ for all $v' \in

337: \calR(v)$.  This equivalence holds for all \emph{consistent}

338: configurations $x^\prime \in \zeta(\mathbb{X})$, where $x'$ is

339: self-consistent over various replicas of the same node.  Thus, we are

340: led to an equivalent optimization problem in the augmented model

341: subject to consistency constraints:

342: \begin{equation}

343: \label{eq:a}

344: f^* \triangleq \max_{x \in \mathbb{X}} f(x) = \max_{x' \in

345: \zeta(\mathbb{X})} f'(x')

346: \end{equation}

347: Expressing the consistency constraint as a set of linear constraints

348: on the model features $\phi$, we obtain:

349: \begin{equation}

350: \begin{array}{ll}

351: \mbox{maximize} & f'(x')\\

352: \mbox{subject to} & \phi_A(x'_A) = \phi_B(x'_B) \mbox{ for all } A \equiv B.

353: \end{array}

354: \end{equation}

355: Recall that, in the discrete binary model, these features are defined

356: $\phi_E(x_E) = \Pi_{v \in E} \, x_v$. Clearly, there is some

357: redundancy in these constraints: $x_a = x_b$ for all replicated nodes

358: $a \equiv b$ would insure that the edges agree. However, these

359: redundant edge-wise feature constraints do enhance the

360: following relaxation.

361:

362: \subsection{Lagrangian Relaxation}

363:

364: We have now defined an equivalent model on a tractable graph.

365: However, the equivalent \emph{constrained} optimization is still

366: intractable, because the constraints couple some variables of

367: $\calG^\prime$, spoiling its tractable structure.  This suggests the

368: use of Lagrangian duality to relax those complicating constraints.

369: Introducing Lagrange multipliers $\lambda_{A,B}$ for each constraint,

370: we define the \emph{Lagrangian}, which is a modified version of the

371: objective function:

372: \begin{equation}\label{eq:Lagrangian}

373: L(x',\lambda) = f'(x') + \sum_{A \equiv B} \lambda_{A,B} \,

374: (\phi_A(x'_A)-\phi_B(x'_B))

375: \end{equation}

376: Grouping terms by edges $E \in \calG'$, and using $f'_E(x_E) = \theta'_E \phi_E(x_E)$, this is represented

377: \begin{equation}

378: L(x',\lambda) = \sum_{E \in \calG'} f'_E(x'_E;\lambda) \nonumber \\

379: \end{equation}

380: \begin{equation}

381: f'_E(x'_E;\lambda) = \theta'_E(\lambda) \phi_E(x'_E) \nonumber \\

382: \end{equation}

383: \begin{equation}

384: \theta'_E(\lambda) = \theta'_E + \Sigma_B \lambda_{E,B} - \Sigma_A

385: \lambda_{A,E}

386: \end{equation}

387:  Note that the Lagrange multipliers may be interpreted as

388: parameterizing all consistent splittings, $\theta'(\lambda)$ spans the

389: subspace of all consistent $\theta'$ parameters.\footnote{We obtain a

390: minimal $\lambda$ parameterization by only using a subset of

391: constraints in (\ref{eq:Lagrangian}), such that

392: $\{(\phi_A(x')-\phi_B(x'))\}$ are linearly independent.}

393:

394: It is tractable to maximize the Lagrangian, as it is defined

395: over the thin graph $\calG^\prime$. The value of this maximization

396: defines the \emph{dual function}:

397: \begin{equation}

398: g(\lambda) = \max_{x'} L(x',\lambda)

399: \end{equation}

400: Note that this is an \emph{unconstrained} optimization over

401: $\mathbb{X}^\prime$, and its solution need not lead to a consistent

402: $x' \in \zeta(\mathbb{X})$.  However, if this $x'$ is consistent then

403: it is an optimal solution of the constrained optimization problem

404: (\ref{eq:a}), and hence $x = \zeta^{-1}(x')$ (which is well-defined

405: for consistent $x'$) is also an optimal solution of the original

406: problem.  This is the goal of our approach, to find tractable

407: relaxations of the MAP estimation problem which lead to the correct

408: MAP estimate.  This motivates solution of the \emph{dual problem}:

409: \begin{equation}\label{eq:dual_problem}

410: \min_\lambda g(\lambda) \triangleq g^*

411: \end{equation}

412: Appealing to well-known results

413: \cite{Bertsekas95,BertsimasTsitsiklis97}, we conclude:

414:

415: \begin{proposition}[Lagrangian duality] We have $g(\lambda) \ge f^*$ for all $\lambda$. Hence $g^* \ge f^*$.  If $g(\lambda^*)=g^*$, then one of the

416: following holds:

417: \begin{enumerate}

418: \item[(i)] There exists a consistent solution:

419: \begin{displaymath}

420: x' \in \arg\max_{x' \in \mathbb{X}'} L(x';\lambda^*) \cap \zeta(\mathbb{X}).

421: \end{displaymath}

422: Then, we have \emph{strong duality} $g^* = f^*$ and the set of \emph{all} MAP estimates is obtained as:

423: \begin{displaymath}

424: \arg\max_{x' \in \zeta(\mathbb{X})} f'(x') = \arg\max_{x' \in \mathbb{X}} L(x',\lambda^*) \cap \zeta(\mathbb{X}).

425: \end{displaymath}

426: \item[(ii)] There are no consistent solutions:

427: \begin{displaymath}

428: \arg\max_{x' \in \mathbb{X}'} L(x';\lambda^*) \cap \zeta(\mathbb{X}) = \emptyset.

429: \end{displaymath}

430: Then, there is a \emph{duality gap} $g^* > f^*$ and \emph{no} choice

431: of $\lambda$ will provide a consistent solution.

432: \end{enumerate}

433: Also, condition (i) holds only if $g(\lambda^*)=g^*$.

434: \end{proposition}

435:

436: \begin{figure}

437: \centering

438: (a)\input{dual_gap.pstex_t}

439: (b)\input{dual_func.pstex_t}

440: \caption{\label{fig:LR}Illustration of the Lagrangian duality in

441: the cases that (a) there is a duality gap and (b) there is no duality

442: gap (strong duality holds).}

443: \vspace{-.4cm}

444: \end{figure}

445:

446: This result generalizes the analogous \emph{strong tree-agreement}

447: optimality condition for TRMP, and clarifies its connection to

448: standard Lagrangian duality results for integer programs.  To provide

449: some intuition, we present the following geometric interpretation

450: illustrated in Fig. \ref{fig:LR}.  The dual function is the maximum

451: over a finite set of linear functions in $\lambda$ indexed by $x'$.

452: For each $x' \in \mathbb{X}'$, there is a linear function

453: $g(\lambda;x') = \langle a(x'), \lambda\rangle + b(x')$, with $a(x') =

454: (\phi_A(x')-\phi_B(x'))_{A \equiv B}$, which is the gradient, and

455: $b(x') = f'(x')$.  The graph of each of these functions defines a

456: hyperplane in $\mathbb{R}^{d+1}$, where $d$ is the number of

457: constraints.  The flat hyperplanes, with $a = 0$, correspond to

458: consistent assignments $x' \in \zeta(\mathbb{X})$.  The remaining

459: sloped hyperplanes represent inconsistent assignments.  Hence, the

460: highest flat hyperplane corresponds to the optimal MAP estimate, with

461: height equal to $f^*$.  The dual function $g(\lambda)$ is defined by

462: the maximum height over this set of hyperplanes for each $\lambda$,

463: and is therefore convex, piece-wise linear and greater than or equal

464: to $f^*$ for all $\lambda$.  In the case of a duality gap, the

465: inconsistent hyperplanes hide the consistent ones, as depicted in (a),

466: so that the minimum of the dual function is defined by an intersection

467: of slanted hyperplanes corresponding to inconsistent assignments of

468: $x'$.  If there is no duality gap, as depicted in (b), then the

469: minimum is defined by the flat hyperplane corresponding to a

470: consistent assignment. Its intersection with slanted hyperplanes

471: defines the polytope of optimal Lagrange multipliers over which the

472: maximum flat hyperplane is exposed.

473:

474: \subsection{Linear Programming Formulations}

475:

476: We briefly consider a connection between this LR

477: picture and TRMP \cite{Wainwright*nov05,KolmogorovWainwright05} and

478: related linear programming approaches

479: \cite{Feldman*05,Yanover*06,Werner07}. This analysis also serves to

480: understand when different relaxations of the MAP estimation

481: problem will be equivalent.

482:

483: The \emph{epigraph} of the dual function is defined as the set of all

484: points $(\lambda,h) = \mathbb{R}^{d+1}$ where $g(\lambda) \le h$, that

485: is, where $a(x') \lambda + b(x') \le h$ for all $x'$.  Thus, the

486: minimum of the dual function is equal to the lowest point of the

487: epigraph, which defines a linear program (LP) over $(\lambda,h) \in

488: \mathbb{R}^{d+1}$:

489: \begin{equation}

490: \begin{array}{ll}

491: \mbox{minimize} & h \\

492: \mbox{subject to} & \langle a(x'), \lambda \rangle + b(x') \le h \mbox{ for all } x'.

493: \end{array}

494: \end{equation}

495: Note that there are exponentially many constraints in this

496: formulation, so it is intractable.  However, recalling that it

497: \emph{is} tractable to compute the dual function for a given

498: $\lambda$, using the max-product algorithm applied to the thin graph

499: $\calG'$, we seek a more tractable representation of this LP.  To

500: achieve this, we consider the LP dual problem obtained by dualizing

501: the constraints, which is always tight \cite{BertsimasTsitsiklis97}.

502: This LP dual should be distinguished from our Lagrangian dual

503: (\ref{eq:dual_problem}) that is the subject of our paper.

504:

505: Introducing non-negative Lagrange multipliers $\mu(x') \ge 0$ for each

506: inequality constraint, indexed by $x' \in \mathbb{X}'$, we obtain the

507: LP Lagrangian:

508: \begin{eqnarray}

509: M(h,\lambda;\mu) &=& h + \mu\left[ \langle a(x'), \lambda \rangle + b(x') - h \right] \nonumber \\

510:  &=& \langle \mu[a], \lambda \rangle + \mu[b] + (1-\mu[1]) h,

511: \end{eqnarray}

512: where $\mu$ denotes $\mu$-weighted summation, e.g.,

513: $\mu[a] = \sum_{x'} \mu(x') a(x')$.  The LP dual function is then:

514: \begin{equation}

515: M^*(\mu) \triangleq \min_{h,\lambda} M(h,\lambda;\mu) =

516: \left\{

517: \begin{array}{ll}

518: \mu[b], & \mu[1]=1 \mbox{ and } \mu[a]=0\\

519: -\infty, & \mbox{otherwise.}

520: \end{array}

521: \right.

522: \end{equation}

523: Note that $\mu > 0$ and $\mu[1]=1$ imply that $\mu$ is a probability

524: distribution and $\mu[\cdot]$ an expectation operator.  Recalling

525: $a(x') \triangleq (\phi_A(x')-\phi_B(x'), A \equiv B)$ and $b(x')

526: \triangleq f'(x')$, we obtain the dual LP:

527: \begin{equation}

528: \label{eq:b}

529: \max_{\mu \ge 0} M^*(\mu) =

530: \left\{

531: \begin{array}{ll}

532: \mbox{maximize} & \mu[f'] \\

533: \mbox{subject to} & \mu[\phi_A] = \mu[\phi_B] \mbox{ for } A \equiv B

534: \end{array}

535: \right.

536: \end{equation}

537: We seek a probability distribution over all configurations of the

538: augmented model that maximizes the expected value of $f'(x')$ subject

539: to constraints that the moments specifying marginal distributions

540: are consistent for replicated nodes and edges of the graph. This is a

541: convex relaxation of the constrained version of problem (4), where the

542: objective and constraint functions have been replaced by their

543: expected values under $\mu$.  Note that only marginals $\mu_{E'}$ over

544: hyperedges $E \in \calG'$ are needed to evaluate both the objective

545: and the constraints of this LP.  Hence, it reduces to one defined over

546: the \emph{marginal polytope} $\calM(\calG')$ \cite{Wainwright*nov05},

547: defined as the set of all \emph{realizable} collections of marginals

548: over the hyperedges of $\calG'$. Moreover, if the graph $\calG'$ is

549: \emph{chordal} \cite{Cowell*99}, then its marginal polytope has a

550: simple characterization. Let $\calM_{\mathrm{local}}(\calG')$ denote

551: the \emph{local marginal polytope} defined as the set of all edge-wise

552: marginal specifications that are consistent on intersections of edges.

553: In general, $\calM(\calG') \subset \calM_{\mathrm{local}}(\calG')$.

554: However, in chordal graphs it holds that $\calM(\calG') =

555: \calM_{\mathrm{local}}(\calG')$.  Thus, if $\calG'$ is a thin chordal

556: graph, we obtain a tractable LP whose value is equivalent to $g^*$ in

557: our framework.\footnote{Some graphs shown in Fig. \ref{fig:graphs} are

558: not chordal, but they can be extended to a thin chordal graph by

559: adding a few edges.  If no two of these new edges are equivalent when

560: mapped into $\calG$, then this does not change $g^*$.}

561:

562: One last step shows the connection to LP approaches

563: \cite{Wainwright*nov05,Feldman*05,Yanover*06}.  The key observation is

564: that, roughly speaking,

565: \begin{equation}

566: \calM_{\mathrm{local}}(\calG') \cap \{\mu | \mu(x_A) = \mu(x_B), A \equiv B\} \equiv \calM_{\mathrm{local}}(\calG).

567: \end{equation}

568: This is seen by replicating marginals from $\calG$ to $\calG'$, or by

569: copying (consistent) replicated marginals back to $\calG$.

570: For such consistent $\mu$, we have $\mu[f'] = \mu[f]$, which gives:

571: \begin{equation}

572: g^* = \max_{\mu \in \calM_{\mathrm{local}}(\calG)} \mu[f] \ge \max_{\mu \in \calM(\calG)} \mu[f] = f^*.

573: \end{equation}

574: The maximum over $\calM_{\mathrm{local}}(\calG)$ gives an upper-bound on the maximum over $\calM(\calG) \subset \calM_{\mathrm{local}}(\calG)$. The latter is equivalent to exact MAP estimation and the bound becomes tight if $\calG$ is the set of maximal cliques of a chordal graph.  This discussion leads to the following characterization of LR:

575:

576: \begin{proposition}[LR Hierarchy]

577: \emph{Equivalence:} Let $\calG'_1$ and $\calG'_2$ be the set of

578: maximal cliques of two chordal augmented graphs. If

579: $\Gamma^{-1}(\calG_1)=\Gamma^{-1}(\calG_2)$ then $g_1^*=g_2^*$ for the

580: respective dual problems.  Let $g^*(\calG)$ denote the common dual

581: value of all such chordal relaxations where

582: $\Gamma^{-1}(\calG')=\calG$. \emph{Monotonicity:} If $\calG_1 \subset

583: \calG_2$ then $g^*(\calG_1) \ge g^*(\calG_2)$.  \emph{Strong Duality:}

584: If $\calG$ is the set of maximal cliques of a chordal graph, then

585: $g^*(\calG)=f^*$.

586: \end{proposition}

587:

588: \subsection{Smooth Relaxation of the Dual Problem}

589:

590: \begin{figure}

591: \centering

592: \input{smooth_dual.pstex_t}

593: \caption{\label{fig:smoothLR} Illustration of the ``log-sum-exp''

594: smooth approximation of the dual function, as a function of

595: ``temperature'' $\tau$, and of an optimization procedure for

596: minimizing the non-smooth dual function through a sequence of smooth

597: minimizations.}

598: \vspace{-.4cm}

599: \end{figure}

600:

601: In this section, we develop an approach to solve the dual problem.

602: One approach to minimize $g(\lambda)$ is to use non-smooth

603: optimization methods, such as the subgradient method

604: \cite{BertsimasTsitsiklis97}.  Here, we consider an

605: alternative, based on the following smooth approximation of

606: $g(\lambda)$:

607: \begin{equation}

608: g(\lambda; \tau) \triangleq \tau \log \sum_{x' \in \mathbb{X}} \exp\left( \frac{L(x';\lambda)}{\tau}\right)

609: \end{equation}

610: As illustrated if Fig. \ref{fig:smoothLR}, the parameter $\tau > 0$

611: controls the trade-off between smoothness of $g(\lambda;\tau)$ and how

612: well it approximates $g(\lambda)$.  This is known as the

613: ``log-sum-exp'' approximation to the ``max function''

614: \cite{BoydVandenberghe04}:

615: \begin{equation}

616: g(\lambda) \le g(\lambda;\tau) \le g(\lambda) + \tau \log |\mathbb{X}| \mbox{ for all } \tau > 0.

617: \end{equation}

618: Hence, $g(\lambda;\tau) \rightarrow g(\lambda)$ \emph{uniformly} as $\tau

619: \rightarrow 0$ and, hence, $g^*(\tau) \triangleq \min_\lambda g(\lambda;\tau)$

620: converges to $g^*$.

621:

622: The function $g(\lambda;\tau)$ has another useful interpretation.  Consider

623: the Gibbs distribution defined by

624: \begin{equation}

625: p_{\lambda,\tau}(x') = \exp\left( \frac{L(x',\lambda) - g(\lambda;\tau)}{\tau} \right)

626: \end{equation}

627: Here, $\tau > 0$ is the ``temperature'' and $g(\lambda;\tau)$

628: normalizes the distribution for each choice of $\lambda$ and $\tau$,

629: and is equal to the Helmholtz free energy $\mathcal{F}_H(\theta') =

630: \tau \Phi_\tau(\theta')$, where $\Phi_\tau(\theta') = \log \Sigma

631: \exp(\tau^{-1} \langle \theta', \phi'(x')\rangle)$ is the usual

632: log-partition function.  Thus, $g(\lambda; \tau)$ is a strictly

633: convex, analytic function. Using the moment-generating property of

634: $\Phi_\tau(\theta')$, the gradient of $g(\lambda;\tau)$ is computed

635: as:

636: \begin{eqnarray}

637: \frac{\partial g(\lambda;\tau)}{\partial \lambda_{A,B}} &=&

638:     \frac{\partial\Phi_\tau}{\partial\theta'_A} \frac{\partial\theta'_A}{\partial\lambda_{A,B}}

639:   + \frac{\partial\Phi_\tau}{\partial\theta'_B} \frac{\partial\theta'_B}{\partial\lambda_{A,B}} \nonumber \\

640:  &=& p_{\lambda,\tau}[\phi_A] - p_{\lambda,\tau}[\phi_B]

641: \end{eqnarray}

642: where we use $p[\cdot]$ to denote expectation under $p$. Thus,

643: appealing to strict convexity, there is a unique $\lambda^*(\tau)$

644: that minimizes $g(\lambda;\tau)$ and it is also the unique solution of

645: the set of moment-matching conditions:

646: \begin{displaymath}

647: p_{\lambda,\tau}[\phi_A] = p_{\lambda,\tau}[\phi_B], \mbox{ for all } A \equiv B.

648: \end{displaymath}

649: These moment-matching conditions are equivalent to requiring that

650: the marginal distributions $p_{\lambda,\tau}(x_A)$ and

651: $p_{\lambda,\tau}(x_B)$ are equal for $x_A = x_B$.  We also

652: note that $\frac{\partial g(\lambda;\tau)}{\partial \tau} =

653: p_{\lambda,\tau}[-\log p_{\lambda,\tau}]$, which is the \emph{entropy}

654: of $p_{\lambda,\tau}$ and is positive for all $\lambda$.  Hence, for a

655: decreasing sequence $\tau_k > 0$ converging to zero, $g(\lambda;\tau)$

656: converges \emph{monotonically} to $g(\lambda)$.  Likewise, $g^*(\tau_k)$ converges

657: \emph{monotonically} to $g^*$.

658:

659: Rather than directly optimizing $g(\lambda)$, we instead perform a

660: sequence of minimizations with respect to the functions

661: $g(\lambda;\tau_k)$.  At each step, the previous estimate of

662: $\lambda_k^* = \arg\min g(\lambda;\tau_k)$ is used to initialize an

663: iterative method to minimize $g(\lambda;\tau_{k+1})$.  This is

664: illustrated in Fig. \ref{fig:smoothLR}. At each step, we use the

665: following optimization procedure based on the marginal agreement

666: condition.

667:

668: \subsubsection{Iterative Log-Marginal Averaging}

669:

670: \begin{figure}

671: \vspace{.3cm}

672: \hrule

673: \begin{tabbing}

674: {\bf ALGORITHM 1 (Discrete LR)}\\

675: It\=erate until convergence:\\

676: Fo\=r $E \in \calG \mbox{ where } r_E>1$\\

677: \>Fo\=r $E' \in \calR(E)$\\

678: \>\>$\hat{f}_{\tau,E'}(x'_{E'}) = \tau \log p_{\tau,\lambda}(x'_{E'})$\\

679: \>end\\

680: \>$\bar{f}_{\tau,E}(x_E) = r_E^{-1} \sum_{E'} \hat{f}_{\tau,E'}(x_E)$\\

681: \>For $E' \in \calR(E)$\\

682: \>\>$f_{E'}(x_E) \leftarrow f_{E'}(x_E) + \left( \bar{f}_{\tau,E'}(x_E) - \hat{f}_{\tau,E'}(x_E) \right)$\\

683: \>end\\

684: end

685: \end{tabbing}

686: \vspace{-.2cm}

687: \hrule

688: \vspace{-.4cm}

689: \end{figure}

690:

691: To minimize $g(\lambda;\tau)$ for a specified $\tau$, starting from an

692: initial guess for $\lambda$ (or, equivalently, an initial splitting of

693: $f$), we develop a block coordinate-descent method.  Our approach is

694: in the same spirit as the iterative proportional fitting procedure

695: \cite{Ruschendorf95}.

696:

697: We begin with the case that the augmented model is defined so that no

698: two replicas of a node are contained in the same connected component

699: of $\calG'$.  Then, at each step, we minimize over the set of all

700: Lagrange multipliers associated with features defined within any

701: replica of $E$.  This is equivalent to solving the condition that the

702: corresponding marginal distributions $p_{\lambda,\tau}(x'_{E'})$ are

703: consistent for all $E' \in \calR(E)$.  Algorithm 1 summarizes the

704: method, which involves computing the log-marginal of each replica

705: edge, and then updates the functions $f'_{E'}$ according to the rule:

706: \begin{equation}

707: f'_{E'}(x_E) \leftarrow f'_{E'}(x_E) + (\bar{f}_{\tau,E}(x_E) - \hat{f}_{\tau,E'}(x_E)) \\

708: \end{equation}

709: where

710: \begin{displaymath}

711: \hat{f}_{\tau,E'}(x'_{E'}) = \tau \log p_{\lambda,\tau}(x'_{E'}), \;\; \bar{f}_{\tau,E}(x_E) = r_E^{-1} \sum_{E' \in \calR(E)} \hat{f}_{\tau,E'}(x_E).

712: \end{displaymath}

713: After the update, the new log-marginals of all replicas $E'$ are equal

714: to $\bar{f}_{\tau,E}$.  Also, these updates maintain a consistent

715: representation: $\sum_{E'} (\bar{f}_{\tau,E} - \hat{f}_{\tau,E'}) =

716: 0$.  To handle augmented models with multiple replicas of $E$ in the

717: same connected subgraph, we only update a \emph{subset} of replicas at

718: each step, where no two replicas are in the same subgraph. In some

719: cases, this requires including an extra replica of $E$ to act as an

720: intermediary in the update step.

721:

722: Each step of the procedure requires that we compute the marginal

723: distributions of each replica $E'$ in their respective subgraphs. In

724: the graphs are thin, these marginals can be computed efficiently, with

725: computation linear in the size of each subgraph, using standard belief

726: propagations methods and their junction tree variants.  Moreover, if

727: we take some care to store the messages computed by belief

728: propagation, it is possible to amortize the cost of this inference, by

729: only updating a few ``messages'' at each step.  In fact, it is only

730: necessary to update those messages along the directed path from the

731: last updated node or edge to the location in the tree (or junction

732: tree) of the node or edge currently being updated.  We find that this

733: generally allows a complete set of updates to be computed with

734: complexity linear in $n$.  Similar ideas are discussed in

735: \cite{Kolmogorov05}.

736:

737: Using Algorithm 1, together with a rule to gradually reduce $\tau$, we

738: obtain a simple algorithm which generates a sequence $\lambda_k$ such

739: that $g(\lambda_k)$ converges to $g^*$ and $\lambda_k$ converge to a

740: point in the set of optimal Lagrange multipliers.

741:

742: \subsubsection{Iterative Max-Marginal Averaging}

743:

744: We now consider what happens as $\tau$ approaches zero.  The main

745: insight is that the (non-normalized) log-marginals converge to

746: \emph{max-marginals} in the limit as $\tau$ approaches zero:

747: \begin{equation}

748: \hat{f}_{\tau,E'}(x'_{E'}) +

749: g(\lambda,\tau) \rightarrow \hat{f}_{E'}(x'_{E'})

750: \triangleq \max_{x'_{\setminus E'}} f'(x'_{E'},x'_{\setminus E'};\lambda)

751: \end{equation}

752: Hence, as $\tau$ becomes small, the marginal agreement conditions are

753: similar to a set of \emph{max-marginal agreement} conditions among all

754: replicas of an edge or node.  One could consider a ``zero-temperature''

755: version of Algorithm 1 aimed at solving these max-marginal

756: conditions directly:

757: \begin{eqnarray}\label{eq:max_marg_match}

758: f'_{E'}(x_E) &\leftarrow& f'_{E'}(x_E) + \left(\bar{f}_E(x_E) -

759: \hat{f}_{E'}(x_E) \right) \nonumber\\ \bar{f}_E(x_E) &=& r_E^{-1}

760: \sum_{E'} \hat{f}_{E'}(x_E)

761: \end{eqnarray}

762: Here, $\bar{f}_E$ is the averaged max-marginal over all replicas of

763: $E$.  Note that $\hat{f}_{E'}(x_E) \ge \hat{f}_E(x_E) \triangleq

764: \max_{x_{\setminus E}} f(x)$ for all $x_E$ and $E' \in

765: \mathcal{R}(E)$, which implies $\bar{f}_E(x_E) \ge \hat{f}_E(x_E)$.

766: This ``zero-temperature'' approach has close ties to max-sum diffusion

767: (see \cite{Werner07} and reference therein) and Kolmogorov's serial

768: approach to TRMP \cite{Kolmogorov05}.

769:

770: In our framework, one can show that $\lambda^* \triangleq \lim_{\tau

771: \rightarrow 0} \lambda^*(\tau)$ is well-defined and minimizes

772: $g(\lambda)$.  This point $\lambda^*$ also satisfies the max-marginal

773: agreement condition and is therefore a fixed point of max-marginal

774: averaging.  However, the max-marginal agreement condition by itself

775: does not uniquely determine $\lambda^*$ and, in fact, is not

776: sufficient to insure that $g(\lambda)$ is minimized (this is related

777: to the existence of non-minimal fixed-points observed by Kolmogorov).

778: Hence, our approach to minimize $g(\lambda;\tau)$ while gradually

779: reducing the temperature has the advantage that it cannot get stuck in

780: such spurious fixed-points.  It also helps to accelerate convergence,

781: because the initial optimization at higher temperatures serves to

782: smooth over irregularities of the dual function.

783:

784: \begin{figure*}

785: \centering

786: \epsfig{file=attractive3_lr_val.eps,scale=0.17}

787: \epsfig{file=attractive1_lr_val.eps,scale=0.17}

788: \epsfig{file=frustrated1_lr_val.eps,scale=0.17}

789: \epsfig{file=frustrated4_lr_val.eps,scale=0.17}

790: \epsfig{file=frustrated2_lr_val.eps,scale=0.17}\\

791: \comment{\epsfig{file=attractive3_max_marg_err.eps,scale=0.17}

792: \epsfig{file=attractive1_max_marg_err.eps,scale=0.17}

793: \epsfig{file=frustrated1_max_marg_err.eps,scale=0.17}

794: \epsfig{file=frustrated4_max_marg_err.eps,scale=0.17}

795: \epsfig{file=frustrated2_max_marg_err.eps,scale=0.17}\\}

796: \epsfig{file=attractive3_lr_est.eps,scale=0.18}\hspace{.1cm}

797: \epsfig{file=attractive1_lr_est.eps,scale=0.18}\hspace{.1cm}

798: \epsfig{file=frustrated1_lr_est.eps,scale=0.18}\hspace{.1cm}

799: \epsfig{file=frustrated4_lr_est.eps,scale=0.18}\hspace{.1cm}

800: \epsfig{file=frustrated2_lr_est.eps,scale=0.18}

801: \caption{\label{fig:discreteLR} Five examples for discrete LR showing:

802: (top row) convergence of $g(\lambda)$ to $g^*$ compared to $f^*$

803: (horizontal line); (bottom row) the resulting estimates generated by

804: relaxed max-marginals (grey areas denote non-unique maximum).  The

805: first two columns are examples of attractive models with $\sigma=2

806: \mbox{ and } 1$.  The last three columns are frustrated models with

807: $\sigma= 1.5, 1, \mbox{ and } .7$.}

808: \vspace{-.4cm}

809: \end{figure*}

810:

811: \emph{Computational Examples.} In this section we provide some

812: preliminary results using our approach to solve binary MRFs. These

813: examples are for a binary model $x_v \in \{-1,+1\}$ defined on a $10

814: \times 10$ grid similar to the one seen in Fig. \ref{fig:graphs}(a).

815: For each node, we include a node potential $f_v(x_v) = \theta_v x_v$

816: with $\theta_v \sim N(0,\sigma^2)$.  For each edge, we include an edge

817: potential $f_{u,v}(x_u,x_v) = \theta_{uv} x_u x_v$ with $\theta_{uv} =

818: 1$ in the ``attractive'' model and random $\theta_{uv} = \pm 1$ in the

819: ``frustrated'' model.  Hence, $\sigma$ controls the strength of node

820: potentials relative to edge potentials.  As seen in

821: Fig. \ref{fig:discreteLR}, we obtain strong duality $g^*=f^*$ and

822: recover the correct MAP estimates in attractive models. This is

823: consistent with a result on optimality of TRMP in attractive models

824: \cite{KolmogorovWainwright05}. In the frustrated model, the same holds

825: with strong node potentials, but as $\sigma$ is decreased the

826: frustration of the edge potentials cause a duality gap. However, even

827: in these cases, we have observed that some nodes have a unique maximum

828: in their re-summed max-marginals, and these nodes provide a partial MAP

829: estimate that agrees with the correct global MAP estimate.  This is

830: apparently related to the \emph{weak tree agreement} condition for

831: partial optimality in TRMP \cite{KolmogorovWainwright05}.

832:

833: \section{Gaussian Lagrangian Relaxation}

834:

835: In this section we apply the LR approach to the

836: problem of MAP estimation in Gaussian graphical models, which

837: is equivalent to maximizing a quadratic objective function

838: \begin{equation}

839: f(x;h,J) = -\frac{1}{2} x^T J x + h^T x,

840: \end{equation}

841: where $J \succ 0$ is sparse with respect to $\calG$.  Again, we

842: construct an augmented model, which is now specified by an information

843: form $(h',J')$, defined by a larger graph $\calG'$.  For consistency,

844: we also require $f'(\zeta(x);h',J')=f(x;h,J)$ for all $x$.  Denoting

845: variable replication by $\zeta(x) = A x$, this is equivalent to $A^T

846: J' A = J$ and $A^T h' = h$. In order for the dual function to be

847: well-defined, we also require that $J' \succ 0$.  For general $J \succ

848: 0$, it is possible that, for a given augmented graph $\calG'$, there

849: do not exist any $J' \succ 0$ defined on $\calG'$ such that $A^T J' A

850: = J$.  To avoid this issue, we will focus on models that are of the

851: form:

852: \begin{equation}\label{eq:e}

853: f(x) = \sum_{E \in \calF} f_E(x_E)

854: \end{equation}

855: where $\calF$ is a hyper-graph, composed of cliques of $\calG$, and

856: each term $f_E(x_E)$ is itself a quadratic form $f_E(x_E) =

857: -\frac{1}{2} x_E^T J_E x_E + h_E^T x_E$ based on $J_E \succ 0$.  Then,

858: $J = \sum_E [J_E]_V$ is the sum of these (zero-padded)

859: submatrices. Then, it is simple to obtain a valid augmented model.  We

860: split each $J_E$ between its replicas as $J_{E'} = r_E^{-1} J_E$ to

861: obtain $J' = \sum_{E' \in \calF'} [J_{E'}]_{V'} \succ 0$.

862:

863: If there exists a representation of $J$ in terms of $2 \times 2$

864: \emph{pairwise} interactions $J_E \succ 0$, it is said to be

865: \emph{pairwise normalizable}.  This condition is equivalent to the

866: walk-summability condition considered in \cite{Malioutov*06}, which is

867: related to the convergence (and correctness) of a variety of

868: approximate inference methods \cite{Malioutov*06,Chandrasekaran*07}.

869: Here, we show that for the more general class of models of the form

870: (\ref{eq:e}), we obtain a convergent iterative method for solving the

871: dual problem that is tractable provided the cliques are not too large.

872: Moreover, for this class of Gaussian models, we show that there is

873: \emph{no duality gap} and we always converge to the unique MAP

874: estimate of the model.  As an additional bonus, we also find that, by

875: solving marginal agreement conditions in the augmented Gaussian model,

876: we obtain a set of upper-bounds on the variances of each variable,

877: although these bounds are often rather loose.

878:

879: \subsection{Gaussian LR with Linear Constraints}

880:

881: We begin by considering the Lagrangian dual of the

882: following linearly-constrained quadratic program:

883: \begin{equation}

884: \begin{array}{ll}

885: \mbox{maximize} & -\tfrac{1}{2} x'^T J' x' + h'^T x'\\

886: \mbox{subject to} & x'_a = x'_b \mbox{ for all } a \equiv b.

887: \end{array}

888: \end{equation}

889: We may express the linear constraints on $x'$ as $H x' =

890: 0$. Relaxing these constraints leads to the following dual function:

891: \begin{eqnarray}

892: g(\lambda) &=& \max_{x'} \{ -\tfrac{1}{2} x'^TJ'x'+(h'+H^T\lambda)x' \} \nonumber \\

893:            &=& \tfrac{1}{2} (h'+H^T \lambda)^T J'^{-1} (h'+H^T \lambda)

894: \end{eqnarray}

895: Moreover, by strong duality of quadratic programming \cite{Bertsekas95}, it holds

896: that $g^*=f^*$. We also note the following equivalent representation

897: of the dual problem:

898: \begin{equation}\label{eq:qp1}

899: g^* = \left\{

900: \begin{array}{ll}

901: \mbox{minimize} & \tfrac{1}{2} h'^T J'^{-1} h'\\

902: \mbox{subject to} & A^T h' = h \\

903: \end{array}

904: \right.

905: \end{equation}

906: Here, $h'$ is the problem variable, and we consider all possible

907: choices of $h'$ that are consistent with $h$ under the constraint $x'

908: = A x$.  The optimal choice of $h'$ in this problem is the one which

909: leads to consistency in the estimate $\hat{x}' = J'^{-1} h'$.

910:

911: \subsection{Quadratic Constraints and Log-Det Regularization}

912:

913: Although, in Gaussian models, it is sufficient to include only linear

914: constraints (there is no duality gap), our method can also accommodate

915: quadratic constraints, and this results in faster convergence and

916: tighter bounds on variances. Consider the constrained optimization

917: problem:

918: \begin{equation}

919: \begin{array}{ll}

920: \mbox{maximize} & -\tfrac{1}{2} x'^T J' x' + h'^T x'\\

921: \mbox{subject to}

922: & x_a = x_b, x_a^2 = x_b^2 \mbox{ for all } a \equiv b,\\

923: & x_{a_1}x_{a_2} = x_{b_1}x_{b_2} \mbox{ for all } (a_1,a_2) \equiv (b_1,b_2).

924: \end{array}

925: \end{equation}

926: This leads to the following equivalent version of the dual

927: problem with problem variables $(h',J')$:

928: \begin{equation}

929: \begin{array}{ll}

930: \label{eq:c}

931: \mbox{minimize} & \tfrac{1}{2} h'^T J'^{-1} h' \\

932: \mbox{subject to} & A^T h'=h, \; A^T J' A = J, \; J' \succ 0.

933: \end{array}

934: \end{equation}

935: Any solution of the linearly-constrained relaxation provides a

936: feasible point for this problem, so the value of (\ref{eq:c}) is

937: less than or equal to that of (\ref{eq:qp1}).  However, since there is

938: no duality gap in (\ref{eq:qp1}), the value of the two problem

939: are equal, both achieve $g^*=f^*$ and obtain the MAP estimate.

940:

941: While the choice of $J'$ does not affect the value of the dual

942: problem, it does effect variance estimates and convergence of

943: iterative methods.  Hence, we regularize the choice of $J'$ by adding

944: a penalty $-\tfrac{1}{2} \log\det J'$ to the objective of

945: (\ref{eq:c}), which also serves as a barrier function enforcing $J'

946: \succ 0$.  The resulting objective function is then equivalent to

947: $\Phi(h,J)$, which shows a parallel to our earlier approach for

948: ``smoothing'' the dual function in discrete

949: problems. \comment{(although, here, there is no need for a temperature

950: parameter). Similarly, we find that minimizing the log-partition

951: function in the Gaussian model, subject to consistency constraints,

952: reduces to matching means and covariances among replicas of a node or

953: edge.}

954:

955: \subsection{Gaussian Moment-Matching}

956:

957: \begin{figure}

958: \vspace{.3cm}

959: \hrule

960: \begin{tabbing}

961: {\bf ALGORITHM 2 (Gaussian LR)}\\

962: It\=erate until convergence:\\

963: Fo\=r $E \in \calG \mbox{ where } r_E>1$\\

964: \>Fo\=r $E' \in \calR(E)$\\

965: \>\> Compute moments $(\hat{x}_{E'},P_{E'})$ in $(h',J')$.\\

966: \>\> $\hat{J}_{E'} = P_{E'}^{-1}, \; \hat{h}_{E'} = P_{E'}^{-1} h_{E'}$\\

967: \>end\\

968: \>$\bar{J}_E = r_E^{-1} \sum_{E'} \hat{J}_{E'}, \; \bar{h}_E = r_E^{-1} \sum_{E'} \hat{h}_{E'}$\\

969: \>For $E' \in \calR(E)$\\

970: \>\>$J'_{E',E'} \leftarrow J'_{E',E'} + \left(\bar{J}_E - \hat{J}_{E'} \right)$\\

971: \>\>$h'_{E'} \leftarrow h'_{E'} + \left(\bar{h}_E - \hat{h}_{E'} \right)$\\

972: \>end\\

973: end

974: \end{tabbing}

975: \vspace{-.2cm}

976: \hrule

977: \vspace{-.4cm}

978: \end{figure}

979:

980: We develop an approach in the same spirit as the Gaussian iterative

981: scaling method \cite{SpeedKiiveri86}. We minimize the log-partition

982: function with respect to the information parameters over all replicas

983: of a node or edge, subject to consistency and positive definite

984: constraints.  The optimality condition for this minimization is that

985: the marginal moments (means and variances) of all replicas are

986: equalized.  It can be shown that the following information-form

987: updates achieve this objective.  First, for all replicas $E'$ of $E$,

988: we compute the marginal information parameters given by sparse

989: Gaussian elimination of $C = V' \setminus E'$ in $(J',h')$:

990: \begin{eqnarray}

991: \hat{J}_{E'} &=& J'_{E',E'} - J'_{E',C} (J'_{C,C})^{-1} J'_{C,E'} \nonumber \\

992: \hat{h}_{E'} &=& h'_{E'} - J'_{E',C} (J'_{C,C})^{-1} h'_{C}

993: \end{eqnarray}

994: This is equivalent to $\hat{J}_{E'} = P_{E'}^{-1}$ and $\hat{h}_{E'} =

995: P_{E'}^{-1} \hat{x}_{E'}$. Next, we average these marginal information

996: forms over all replicas:

997: \begin{equation}\label{eq:marg_info_match}

998: \bar{J}_E = r_E^{-1} \sum_{E'} \hat{J}_{E'}, \;\; \bar{h}_E = r_E^{-1} \sum_{E'} \hat{h}_{E'}

999: \end{equation}

1000: Finally, we update the information form according to:

1001: \begin{eqnarray}

1002: J'_{E',E'} &\leftarrow& J'_{E',E'} + (\bar{J}_E - \hat{J}_{E'}) \nonumber \\

1003: h'_{E'} &\leftarrow& h'_{E'} + (\bar{h}_E - \hat{h}_{E'})

1004: \end{eqnarray}

1005: Using the characterization of positive-definiteness of a block matrix

1006: in terms of a principle submatrix and its Schur complement, it can be

1007: shown that this update preserves positive definiteness of $J'$.  It

1008: also preserves consistency, e.g., $\sum_{E'} (\bar{J}_E -

1009: \hat{J}_{E'}) = 0$.  After the update, the new marginal information

1010: parameters for all replicas of $E$ are equal to

1011: $(\bar{h}_E,\bar{J}_E)$.  Algorithm 2 summarizes this iterative

1012: approach for solving the Gaussian LR problem.

1013:

1014: Lastly, using the fact that $\bar{f}_E(x_E) \ge \hat{f}_E(x_E)$ for

1015: all $x_E$ and that there is no duality gap upon convergence, we

1016: conclude that the final equalized marginal information must satisfy

1017: $\bar{J}_E \preceq \hat{J}_E \triangleq J_{E,E} - J_{E,\setminus

1018: E}(J_{\setminus E,\setminus E})^{-1}J_{\setminus E,E}$. Hence, LR

1019: gives an upper-bound on the true variance: $P_E=(\hat{J}_E)^{-1}

1020: \preceq (\bar{J}_E)^{-1}$.  If each replica of $E$ is contained in a

1021: separate connected component of $\calG'$, then a tighter bound

1022: holds: $P_E \preceq (r_E \bar{J}_E)^{-1}$.

1023:

1024: \emph{Computational Examples.} We apply LR for two Gaussian models

1025: defined on a $50 \times 50$ 2D grid with correlation lengths

1026: comparable to the size of the field. First, we use the \emph{thin

1027: membrane} model, which encourages neighboring nodes to be similar by

1028: having potentials $f_{ij} = (x_i - x_j)^2$ for each edge

1029: $\{i,j\} \in \calG$. We split the 2D model into vertical strips of

1030: narrow width $K$, which have overlap $L$ (we vary $K$ and set

1031: $L=2$). We impose marginal agreement conditions in $K \times L$ blocks

1032: in these overlaps. The updates are done consecutively, from top to

1033: bottom blocks, from the left to the right strip. A full update of all

1034: the blocks constitutes one iteration. We compare LR to loopy belief

1035: propagation (LBP). The LBP variances are underestimates by $21.5$

1036: percent (averaged over all nodes), while LR variances for $K=8$ are

1037: overestimates by $16.1$ percent. In Figure \ref{fig:gauss_LR} (top) we

1038: show convergence of LR for several values of $K$, and compare it to

1039: LBP. The convergence of variances is similar to LBP, while for the

1040: means LR converges considerably faster. In addition, the means in LR

1041: converge faster than using block Gauss-Seidel on the same set of

1042: overlapping $K \times 50$ vertical strips.

1043:

1044: Next, we use the \emph{thin plate model}, which enforces that each

1045: node $v$ is close to the average of its nearest neighbors $N(v)$ in

1046: the grid, and penalizes curvature. At each node there is a potential:

1047: $f_i(x_i,x_{N(i)}) = (x_i - \tfrac{1}{|N(i)|} \sum_{j \in N(i)}

1048: x_j)^2$. LBP does not converge for this model. LR gives rather loose

1049: variance bounds for this more difficult model: for $K=12$, it

1050: overestimates the variances by $75.4$ percent. More importantly, it

1051: accelerate convergence of the means. In Figure \ref{fig:gauss_LR}

1052: (bottom) we show convergence plots for means and variances, for

1053: several values of $K$. As $K$ increases, the agreement is achieved

1054: faster, and for $K=12$ agreement is achieved in under $13$ iterations

1055: for both means and variances. We note that LR with $K=4$ converges

1056: much faster for the means than block Gauss-Seidel.

1057:

1058: \begin{figure}

1059: \centering

1060: \epsfig{figure=var_LR_gauss_tm_final.eps,scale=.55}

1061: \epsfig{figure=means_LR_gauss_tm_final.eps,scale=.55}\\

1062: \epsfig{figure=var_LR_gauss_tp_final.eps,scale=.55}

1063: \epsfig{figure=means_LR_gauss_tp_final.eps,scale=.55}

1064: \caption{\label{fig:gauss_LR} Convergence plots for variances (left) and

1065: means (right), in the thin-membrane model (top) and thin-plate model (bottom).}

1066: \vspace{-.3cm}

1067: \end{figure}

1068:

1069: \section{Multi-Scale Lagrangian Relaxation}

1070:

1071: In this section, we propose an extension of the LR method considered

1072: thus far.  Previously, we have considered relaxations based on

1073: augmented models where $x' = \zeta(x)$ involves replication of

1074: variables.  Here, we consider more general definition of $\zeta$ to

1075: allow the augmented model to include \emph{summary variables}, such as

1076: a sum over a subset of variables, or any linear combination of these.

1077: In discrete models, summary variables can also be non-linear functions

1078: of $x$. For example, ``parity bits'' are used in coding

1079: applications and the ``majority rule'' is used to define

1080: coarse-scale binary variables in the renormalization group approach

1081: \cite{Gidas89}.

1082:

1083: Using this idea, we develop a \emph{multiscale} Lagrangian relaxation

1084: approach for MRFs defined on grids.  The purpose of this relaxation is

1085: similar to that of the multigrid and renormalization group methods

1086: \cite{Trottenberg*01,Gidas89}. Iterative methods generally involve

1087: simple rules that propagate information locally within the graph.  Using a

1088: multiscale representation of the model allows information to

1089: propagate through coarse scales, which improves the rate of

1090: convergence to global equilibrium.  Also, in discrete problems,

1091: such multiscale representations can help to avoid local minima.  In

1092: the context of our convex LR approach, we expect this to translate

1093: into a reduction of the duality gap to obtain the optimal MAP estimate

1094: in a larger class of problems.

1095:

1096: \begin{figure}[t]

1097: \centering

1098: (a)\epsfig{file=multiscale_lr_small.eps,scale=0.8}\\

1099: \vspace{.2cm}

1100: (b)\epsfig{file=multiscale_lr_cliques_small.eps,scale=0.8}

1101: \caption{\label{fig:multiscaleLR} Illustration of multiscale

1102: LR method. (a) First, we define an equivalent

1103: multiscale model subject to cross-scale constraints. Relaxing these

1104: constraints leads to a set of single-scale models. (b) Next, each single

1105: scale is relaxed to a set of tractable subgraphs.}

1106: \vspace{-.4cm}

1107: \end{figure}

1108:

1109: \subsection{An Equivalent Multiscale Model}

1110:

1111: We illustrate the general idea with a simple example based on a 1D

1112: Markov chain.  While this case is actually tractable by exact methods,

1113: it serves to illustrate our approach, which generalizes to

1114: 2D grids and 3D lattices.  In Fig. \ref{fig:multiscaleLR}, we show how

1115: to construct the augmented model $f'(x')$ defined on a graph $\calG'$.

1116: This is done in two stages.

1117:

1118: First, as illustrated in Fig. \ref{fig:multiscaleLR}(a), we introduce

1119: coarse-scale representations of the fine scale variables by

1120: recursively defining summary variables at coarser scales to be

1121: functions of variables at the next level down.  This defines a set of

1122: cross-scale constraints, denoted by the square nodes. To allow

1123: interactions between coarse-scale variables, while maintaining

1124: consistency with the original single-scale model, we introduce extra

1125: edges (the dotted ones in Fig. \ref{fig:multiscaleLR}(a)) between

1126: blocks of nodes that have a (solid) edge between their summary nodes

1127: at the next coarser scale.  This representation allows us to define a

1128: family of constrained multiscale models that are all equivalent to the

1129: original single-scale model.  For 2D and 3D lattices, this model is

1130: still intractable even after relaxing the cross-scale constraints

1131: because each scale is itself intractable.

1132:

1133: Next, to obtain a tractable dual problem, we break up the graph into smaller

1134: subgraphs, introducing additional constraints to enforce consistency

1135: among replicated variables.  In the example, we break the augmented

1136: graph at each scale into its maximal cliques, shown in

1137: Fig. \ref{fig:multiscaleLR}(b).  This defines the final augmented

1138: model and the corresponding graph. In a 2D graph, the same idea

1139: applies, but we obtain a set of maximal cliques consisting of

1140: overlapping $2 \times 4$ and $4 \times 2$ blocks of the grid.

1141: Alternatively, we could break up the 2D grid into a set of width 2

1142: vertical strips, as discussed previously.

1143:

1144: Now, the procedure is essentially the same as before. We start with

1145: the equivalent constrained optimization problem defined on the

1146: augmented graph, now subject to both in-scale and cross-scale

1147: constraints. We obtain a tractable problem by introducing Lagrange

1148: multipliers to relax these constraints.  Then we iteratively adjust

1149: the Lagrange multipliers to minimize the dual function, with the aim

1150: of eliminating constraint violations to obtain the desired MAP

1151: estimate.  This is equivalent to adjusting the augmented model

1152: $f'(x')$ on $\calG'$, subject to the constraint that it remains

1153: equivalent to $f(x)$ for all $x' = Ax$.

1154:

1155: \subsection{Gaussian Multiscale Moment-Matching}

1156:

1157: We demonstrate this approach in the Gaussian model.  To carry out the

1158: minimization, we again use a block coordinate-descent method that

1159: finds an exact minimum over a subset of Lagrange multipliers at each

1160: step. The replica constraints are handled the same as before. Here, we

1161: briefly summarize our approach to handle the cross-scale summary

1162: constraints. Let $x_1$ and $x_2$ denote two random vectors at

1163: consecutive scales coupled by the constraint $x_2 = A x_1$.  Let

1164: $(\hat{h}_1,\hat{J}_1)$ and $(\hat{h}_2,\hat{J}_2)$ denote their corresponding \emph{marginal}

1165: information parameters. Relaxing the constraints $x_2 = A x_1$ and

1166: $x_2 x_2^T = A x_1 x_1^T A^T$, with Lagrange multipliers

1167: $(\lambda,-\tfrac{1}{2}\Lambda)$, leads to the following optimality

1168: conditions:

1169: \begin{eqnarray}

1170: (\hat{J}_2+\Lambda)^{-1} &=& A (\hat{J}_1-A^T\!\Lambda\, A)^{-1} A^T \\

1171: (\hat{J}_2+\Lambda)^{-1}(\hat{h}_2+\lambda) &=& A (\hat{J}_1-A^T\!\Lambda\, A)^{-1}(\hat{h}_1-A^T\lambda) \nonumber

1172: \end{eqnarray}

1173: We find that the solution is:\footnote{The formula (\ref{eq:multiscale_update}) corresponds to a generalization of Algorithm 2, in which the moments $(\hat{x}_1,P_1)$ of fine-scale variables $x_1$ are replaced by the corresponding moments $(A \hat{x}_1, A \hat{P}_1 A^T)$ of the summary statistic $\tilde{x}_1 = A x_1$.}

1174: \begin{eqnarray}\label{eq:multiscale_update}

1175: \Lambda &=& \tfrac{1}{2} \{(A\hat{J}_1^{-1}A^T)^{-1} - \hat{J}_2\} \nonumber\\

1176: \lambda &=& \tfrac{1}{2} \{(A\hat{J}_1^{-1}A^T)^{-1} A \hat{J}_1^{-1} \hat{h}_1 - \hat{h}_2\}

1177: \end{eqnarray}

1178: The model $(h',J')$ is then updated by adding $(\lambda,\Lambda)$ to

1179: the coarse-scale and subtracting $(A^T \lambda,A^T \!\Lambda\, A)$ from the

1180: fine scale. This update enforces the moment conditions $\hat{x}_2 = A

1181: \hat{x}_1$ and $P_2 = A P_1 A^T$ while maintaining consistency of the

1182: model $(h',J')$. Similar updates can be derived when there are

1183: multiple replicas of $x_1$ and $x_2$.  These methods, together with

1184: those described previously, are used to minimize the dual function in

1185: the Gaussian multiscale relaxation.

1186:

1187: \emph{Multiscale Example.} We provide a preliminary result involving a

1188: 1D thin-membrane model with $1024$ nodes.  It is defined to have a

1189: long correlation length comparable to the length of the field. Using a

1190: random $h$-vector, we solve for the MAP estimates using three methods:

1191: a standard block Gauss-Seidel iteration using overlapping blocks of

1192: size 4; the (single-scale) Gaussian LR method with the same choice of

1193: blocks; and the multiscale LR method.  The convergence of all three

1194: methods are shown in Fig. \ref{fig:multiscale_lr_example}.  We see

1195: that the single-scale LR approach is moderately faster than block

1196: Gauss-Seidel, but introducing coarser-scales into the method leads to

1197: a significant speed-up in the rate of convergence.

1198:

1199: \begin{figure}

1200: \centering

1201: \epsfig{file=multiscale_lr_example.eps,scale=.25}

1202: \caption{\label{fig:multiscale_lr_example}Convergence of single- and multi-scale LR and block Gauss-Seidel.}

1203: \vspace{-.4cm}

1204: \end{figure}

1205:

1206: \section{Discussion}

1207:

1208: We have introduced a general Lagrangian relaxation framework for MAP

1209: estimation in both discrete and Gaussian graphical models.  This

1210: provides a new interpretation of some existing methods, provides

1211: deeper insights into those methods, and leads to new generalizations,

1212: such as the multiscale relaxation introduced here. There are many promising

1213: directions for further work.  While we have considered discrete and

1214: Gaussian models separately, the basic approach should extend to the

1215: richer class of conditionally Gaussian models \cite{Lauritzen96}

1216: including both discrete and continuous variables.  In discrete models,

1217: designing augmented models that capture more structure of the original

1218: problem leads to reduced duality gaps and optimal MAP estimates in

1219: larger classes of models.  It would be of great interest to finds ways

1220: to \emph{adaptively} search this hierarchy of relaxations to

1221: efficiently reduce and eventually eliminate the duality gap with

1222: minimal computation.  It is also of interest to consider

1223: approaches to identity provably \emph{near-optimal} estimates, perhaps

1224: using the relaxed max-marginal estimates, in cases where it is not

1225: tractable to completely eliminate the duality gap.

1226:

1227: %\nocite{Lauritzen96,Cowell*99,Frey98,Wainwright*nov05,Wainwright*jul05,BarndorffNielsen78,WainwrightJordan03,Bertsekas95,BertsimasTsitsiklis97,BertsimasWeismantel05,BoydVandenberghe04,SpeedKiiveri86,Kolmogorov05,KolmogorovWainwright05,Werner07,Yanover*06,Feldman*05,ZhaoLuh98}

1228:

1229: \bibliography{lr}

1230: %\bibliographystyle{plain}

1231: \bibliographystyle{unsrt}

1232:

1233: % These are some excerpts that were omitted due to space

1234: % constraints, but may be useful in a longer version of the paper

1235:

1236: % This goes right before proposition 1

1237:

1238: \comment{Let's review some basic results of Lagrangian duality.  The dual

1239: function has two important properties:  First, it is a maximum over a

1240: set of linear functions in $\lambda$, and is therefore \emph{convex}.

1241: Second, for every $\lambda$ it provides an upper-bound on the value of

1242: the constrained optimization problem: $g(\lambda) \ge f^*$ for all

1243: $\lambda$.  Hence, to determine the best possible choice of $\lambda$,

1244: it is natural to \emph{minimize} the dual function, which is the standard

1245: \emph{Lagrangian dual problem}:

1246: \begin{equation}

1247: g^* = \min_\lambda g(\lambda) = \min_\lambda \max_{x' \in \mathbb{X}'}

1248: L(x';\lambda)

1249: \end{equation}

1250: This may also be interpreted as optimizing over all equivalent models

1251: $f'$ defined on $\calG'$,

1252: \begin{equation}

1253: g^* =

1254: \begin{array}{ll}

1255: \mbox{minimize} & \max_{x'} f'(x') \\

1256: \mbox{subject to} & f'(\zeta(x)) = f(x) \mbox{ for all } x.

1257: \end{array}

1258: \end{equation}

1259: The constrained primal problem (\ref{fig:a}) is equivalent to the reverse max-min

1260: problem:

1261: \begin{equation}

1262: f^* = \max_{x' \in \zeta(\mathbb{X})} f'(x') = \max_{x' \in \mathbb{X}'} \min_\lambda L(x';\lambda)

1263: \end{equation}

1264: This is seen by observing that  $L(x',\lambda) = f'(x')$ if

1265: $x' \in \zeta(\mathbb{X})$ and $\min_\lambda L(x',\lambda) = -\infty$

1266: otherwise.  It always holds that $g^* \ge f^*$, which is known as the

1267: \emph{minimax inequality}.  When $g^* > f^*$, it is said that there is

1268: a \emph{duality gap}.  Our aim is to find formulations where there is

1269: no duality gap, that is, where $g^* = f^*$. This occurs if and only if

1270: there exists a \emph{saddle-point}, that is, a pair $(x'^*,\lambda^*)$

1271: such that

1272: \begin{displaymath}

1273: \lambda^* \in \arg\min L(x'^*,\cdot)

1274: \end{displaymath}

1275: and

1276: \begin{displaymath}

1277: x'^* \in \arg\max L(\cdot,\lambda^*).

1278: \end{displaymath}

1279: In this case, it also holds that $\lambda^* \in \arg\min g(\lambda)$

1280: and $x'^* \in \arg\max_{\zeta(\mathbb{X})} f'$. In other words,

1281: each saddle point corresponds to a pair of primal/dual optimal solutions.

1282: Moreover, if there is no duality gap, \emph{every} pair of

1283: primal/dual optimal solutions $(x'^*,\lambda^*) \in \arg\min g \otimes

1284: \arg\min_{\zeta(\mathbb{X})} f'$ is then a saddle point.  We refer the

1285: reader to [CITE] for proofs of these well-known results.

1286:

1287: These elementary considerations lead to the following simple

1288: characterization of whether or not there is a duality gap, and of the

1289: relation between the optimal MAP estimates and those in the relaxed

1290: problem in the case that there is no duality gap:}

1291:

1292: % this comes right after proposition 2

1293:

1294: \comment{However, this does not mean that there is no advantage to grouping

1295: cliques together into larger thin graphs.  Doing so reduces the number

1296: of Lagrange multipliers in the dual problem, which can lead to reduced

1297: computations and faster convergence in iterative methods. We also note

1298: that a non-chordal, thin augmented graph $\calG'$ can be extended to a

1299: chordal graph of the same with by adding fill edges to $\calG'$.  If

1300: there exists such a chordal extension such that no two fill edges map

1301: back to the same edge in $\calG$, this does not change the value of

1302: $g^*$.  For example, from these considerations we conclude the

1303: following relations between the examples shown in Fig. \ref{fig:graphs}:

1304: \begin{displaymath}

1305: (b) = (c) = (d) \ge (e) = (g) \ge (h)

1306: \end{displaymath}

1307: For instance, $(e)=(g)$ because we can obtain chordal versions of both

1308: graphs by adding diagonal edges within each cell without any

1309: replicated chords, so their dual values do not change, and both

1310: chordal graphs then use the same set of maximal cliques in $\calG$.

1311: But $(g)\ge(h)$ because adding edges to make $(h)$ chordal introduces

1312: larger cliques than $(g)$.}

1313:

1314: % this goes in the discussion of max-marginals

1315:

1316: \comment{Assuming there is a unique MAP estimate in the original problem, then

1317: there are two typical cases: If there is no duality gap, the

1318: max-marginals estimates obtained will typically each have a unique

1319: maximum $x_E^* = \arg\max \bar{f}(x_E)$ for all $E \in \calG$.  Then,

1320: these are consistent and the global MAP estimate $x^*$ is recovered.

1321: When there are ties in some of the max-marginal estimates, this

1322: usually indicates that there is a duality gap and no consistent

1323: solutions.  However, in some exceptional cases, it is possible that

1324: there is no duality gap even in this case. To be certain, one would

1325: have to check for a consistent solution $x^*$ that simultaneously

1326: maximizes all of these relaxed max-marginals.}

1327:

1328: % right after proposition 1

1329:

1330: \comment{If the original problem has a unique MAP estimate, then it typically

1331: holds that, in the case of no duality gap, the relaxed problem has a

1332: unique solution, and this then provides the MAP estimate $x^* =

1333: \zeta^{-1}(x'^*)$.}

1334:

1335: \end{document}

1336: