1: \documentclass[letterpaper, 10 pt, conference]{ieeeconf}
2: \IEEEoverridecommandlockouts
3: \overrideIEEEmargins
4:
5: \newcommand{\calG}{\mathcal{G}}
6: \newcommand{\calF}{\mathcal{F}}
7: \newcommand{\calM}{\mathcal{M}}
8: \newcommand{\calR}{\mathcal{R}}
9: \newcommand{\comment}[1]{}
10:
11: \newtheorem{proposition}{Proposition}
12:
13: % The following packages can be found on http:\\www.ctan.org
14:
15: \usepackage{graphics} % for pdf, bitmapped graphics files
16: \usepackage{epsfig} % for postscript graphics files
17: \usepackage{mathptmx} % assumes new font selection scheme installed
18: \usepackage{times} % assumes new font selection scheme installed
19: \usepackage{amsmath} % assumes amsmath package installed
20: \usepackage{amssymb} % assumes amsmath package installed
21:
22: \title{\LARGE \bf Lagrangian Relaxation for MAP Estimation in
23: Graphical Models}
24:
25: \author{Jason K. Johnson, Dmitry M. Malioutov and Alan
26: S. Willsky\thanks{The authors are with the Electrical Engineering and Computer Science Department, Massachusetts Institute of Technology, Cambridge,
27: MA 02139, USA. {\tt\small {\{jasonj,dmm,willsky\}@mit.com}}.}}
28:
29: \begin{document}
30:
31: \maketitle
32: \begin{minipage}[t][0pt][t]{0.96\textwidth}
33: \vspace{-1.8in}
34: In Proceedings of \emph{The 45th Allerton Conference on
35: Communication, Control and Computing}, September, 2007.
36: \end{minipage}\vspace{-\baselineskip}
37:
38:
39: \thispagestyle{empty}
40: \pagestyle{empty}
41: \begin{abstract}
42: We develop a general framework for MAP estimation in discrete and
43: Gaussian graphical models using Lagrangian relaxation techniques. The
44: key idea is to reformulate an intractable estimation problem as one
45: defined on a more tractable graph, but subject to additional
46: constraints. Relaxing these constraints gives a tractable dual
47: problem, one defined by a thin graph, which is then optimized by an
48: iterative procedure. When this iterative optimization leads to a
49: consistent estimate, one which also satisfies the constraints, then it
50: corresponds to an optimal MAP estimate of the original model.
51: Otherwise there is a ``duality gap'', and we obtain a bound on the
52: optimal solution. Thus, our approach combines convex optimization
53: with dynamic programming techniques applicable for thin graphs. The
54: popular tree-reweighted max-product (TRMP) method may be seen as
55: solving a particular class of such relaxations, where the intractable
56: graph is relaxed to a set of spanning trees. We also consider
57: relaxations to a set of small induced subgraphs, thin subgraphs
58: (e.g. loops), and a connected tree obtained by ``unwinding'' cycles.
59: In addition, we propose a new class of multiscale relaxations that
60: introduce ``summary'' variables. The potential benefits of such
61: generalizations include: reducing or eliminating the ``duality gap''
62: in hard problems, reducing the number or Lagrange multipliers in the
63: dual problem, and accelerating convergence of the iterative
64: optimization procedure.
65: \end{abstract}
66:
67: \section{Introduction}
68:
69: Graphical models are probability models for a collection of random
70: variables on a graph: the nodes of the graph represent random
71: variables and the graph structure encodes conditional independence
72: relations among the variables. Such models provide compact
73: representations of probability distributions, and have found many
74: practical applications in physics, statistical signal and image
75: processing, error-correcting coding and machine learning. However,
76: performing optimal estimation in such models using standard junction
77: tree approaches generally is intractable in large-scale estimation
78: scenarios. This motivates the development of variational techniques
79: to perform approximate inference, and, in some cases, recover the
80: optimal estimate.
81:
82: We consider a general Lagrangian relaxation (LR) approach to
83: \emph{maximum a posteriori} (MAP) estimation in graphical models. The
84: general idea is to reformulate the estimation problem on an
85: intractable graph as a constrained estimation over an augmented model
86: defined on a larger, but more tractable graph. Then, using Lagrange
87: multipliers to relax the constraints, we obtain a tractable estimation
88: problem that gives an upper-bound on the original problem. This leads
89: to a convex optimization problem of minimizing the upper-bound as a
90: function of Lagrange multipliers.
91:
92: We consider a variety of strategies to augment the original graph.
93: The simplest approach breaks the graph into many small, overlapping
94: subgraphs, which involves replicating some variables. Similarly, the
95: graph can be broken into a set of thin subgraphs, as in the TRMP
96: approach, or ``unrolled'' to obtain a larger, but connected, thin
97: graph. We show that all of these approaches are essentially
98: equivalent, being characterized by the set of maximal cliques of the
99: augmented graph. More generally, we also consider the introduction of
100: ``summary'' variables, which leads naturally to multiscale algorithms.
101: We develop a general optimization approach based on marginal and
102: max-marginal matching procedures, which enforce consistency between
103: replicas of a node or edge, and moment-matching in the multiscale
104: relaxation. We show that the resulting bound is tight if and only if
105: there exists an optimal assignment in the augmented model that
106: satisfies the constraints. In that case, we obtain the desired MAP
107: estimate of the original model. When there is a duality gap, this is
108: evidenced by the occurrence of ``ties'' in the resulting set of
109: max-marginals, which requires further augmentation of the model to
110: reduce and ultimately eliminate the duality gap. We focus primarily
111: on discrete graphical models with binary variables, but also consider
112: the extension to Gaussian graphical models. In the Gaussian model, we
113: find that, whenever LR is ``well-posed'', so that the augmented model
114: is valid, it leads to a tight bound and the optimal MAP estimate, and
115: also gives \emph{upper-bounds} on variances that provide a measure of
116: confidence in the MAP estimate.
117:
118: \section{Background}
119:
120: We consider probabilistic graphical models
121: \cite{Lauritzen96,Cowell*99,Frey98}, which are probability
122: distributions of the form
123: \begin{equation}\label{eq:1}
124: p(x_1,\dots,x_n) = \frac{1}{Z} \exp\{f(x)\} = \frac{1}{Z} \exp\left\{\sum_{C \in \calG} f_C(x_C)\right\}
125: \end{equation}
126: where each function $f_C$ only depends on a subset of variables $x_C =
127: (x_v, v \in C)$ and $Z$ is a normalization constant of the model,
128: called the \emph{partition function} in statistical physics. If the
129: sum ranges over all \emph{cliques} of the graph, which are the fully
130: connected subsets of variables, this representation is sufficient to
131: realize any Markov model on $\calG$ \cite{Lauritzen96}.
132: \comment{\footnote{The probability
133: distribution $p(x)$ is \emph{Markov} on $\calG$ if for every $S
134: \subset V$ that separates $A,B \subset V$ in $\calG$, $x_A$ and $x_B$
135: are independent given $x_S$.}} However, it is also common to consider
136: restricted Markov models where only singleton and pairwise
137: interactions are specified. In general, we specify the set of
138: interactions by a hypergraph $\calG \subset 2^V$, where $2^V$
139: represents the set of all subsets of $V$. The elements of $\calG$
140: are its \emph{hyperedges}, which generalizes the usual concept of
141: a graph with pairwise edges.
142:
143: \emph{Discrete Models.} While our approach is applicable for general
144: discrete models, we focus on models with binary variables. One may
145: use either the Boltzmann machine representation $x_v \in \{0,1\}$, or
146: that of the Ising model $x_v \in \{-1,+1\}$. These models can be
147: represented as in (\ref{eq:1}) with
148: \begin{equation}
149: f(x;\theta) = \sum_{E\in\calG} \theta_E \phi_E(x_E), \;\; \phi_E(x_E) = \prod_{v \in E} x_v
150: \end{equation}
151: This defines an \emph{exponential family} \cite{WainwrightJordan03} of
152: probability distributions based on model features $\phi$ and
153: parameterized by $\theta$. $\Phi(\theta) \triangleq \log Z(\theta)$
154: is the \emph{log-partition function} and has the
155: \emph{moment-generating property}: $\frac{\partial
156: \Phi(\theta)}{\partial \theta_E} = \mathbb{E}_\theta\{\phi_E(x)\}
157: \triangleq \eta_E$. Here, $\eta$ are the \emph{moments} of the
158: distribution, which serve both as an alternate parameterization of
159: the exponential family and, in graphical models, to specify the
160: marginal distributions on cliques of the model. Inference in discrete
161: models using junction tree methods, either to compute the mode or the
162: marginals, is generally linear in the number of variables $n$ but
163: grows exponentially in the \emph{width} of the graph \cite{Cowell*99}, which
164: is determined by the size of the maximal cliques in a junction tree
165: representation of the graph. Hence, exact inference is only tractable
166: for \emph{thin} graphs, that is, where one can build an equivalent
167: junction tree with small cliques.
168:
169: \emph{Gaussian Models.} We also consider Gaussian graphical models
170: \cite{Dempster72,SpeedKiiveri86} represented in \emph{information form}:
171: \begin{equation}
172: p(x) = \exp\{-\tfrac{1}{2} x^T J x + h^T x - \Phi(h,J) \}
173: \end{equation}
174: where $J$ is the \emph{information matrix}, $h$ a potential vector and
175: $\Phi(h,J) = \tfrac{1}{2} \{ h^TJ^{-1}h - \log\det J + n\log 2\pi\}$. This
176: corresponds to the standard form of the Gaussian model specified by
177: the covariance matrix $P = J^{-1}$ and mean vector $\hat{x} = J^{-1}
178: h$. This translates into an exponential family where we identify
179: $(h,J)$ with the parameters $\theta$ and $(\hat{x},P)$ with the
180: moments $\eta$. In general, the complexity of inference in Gaussian
181: models is $\mathcal{O}(n^3)$. The fill pattern of $J$ determines the
182: Markov structure of the Gaussian model: $(i,j) \in \calG$ if $J_{i,j}
183: \neq 0$. Using more efficient recursive inference methods that exploit
184: sparsity, such as junction trees or sparse Gaussian elimination, the
185: complexity is linear in $n$ but cubic in the width of the graph, which
186: is still impractical for many large-scale estimation problems.
187:
188: \section{Discrete Lagrangian Relaxation}
189:
190: To begin with, consider the problem of maximizing the following
191: objective function, defined over a hypergraph $\calG \subset
192: 2^V$ based on a vertex set $V = \{1,\dots,n\}$ corresponding to
193: discrete variables $x = (x_1,\dots,x_n)$.
194: \begin{equation}
195: f(x) = \sum_{E \in \calG} f_E(x_E)
196: \end{equation}
197: For instance, this may be defined as $f(x) = \langle \theta, \phi(x)
198: \rangle$ in an exponential family graphical model, such that each term
199: corresponds to a feature $f_E(x_E) = \theta_E \phi_E(x_E)$. Then, we
200: seek $x^*$ to maximize $f(x)$ to obtain the MAP estimate of (\ref{eq:1}).
201:
202: \begin{figure}
203: \centering
204: \input{LR_ex.pstex_t}
205: \caption{\label{fig:toy_example}A simple illustrative example of Lagrangian relaxation.}
206: \vspace{-.3cm}
207: \end{figure}
208:
209: \emph{An Illustrative Example.} To briefly convey the basic concept,
210: we consider a simple pairwise model defined on a $3$-node cycle
211: $\calG$ represented in Fig. \ref{fig:toy_example}. Here, the augmented
212: graph $\calG'$ is a 4-node chain, where node $4$ is a replica of node
213: $1$. We copy all the potentials on the nodes and edges from $\calG$ to
214: $\calG'$. For the replicated variables, $x_1'$ and $x_4'$, we split
215: $f_1$ between $f_1'$ and $f_4'$ such that $f_1(y) = f_1'(y) + f_4'(y)$
216: for $y \in \{0,1\}$. Now the problem $\max_x f(x)$ is equivalent to
217: maximizing $f'(x')$ subject to the constraint $x_1' = x_4'$. To solve
218: the latter we relax the constraint using Lagrange multipliers:
219: $L(x', \lambda) = f'(x') + \lambda(x'_1 - x'_4)$. The additional term
220: $\lambda(x_1' - x_4')$ modifies the self-potentials: $f_1' \leftarrow
221: f_1'(x_1') + \lambda x_1'$ and $f_4' \leftarrow f_4'(x_4') - \lambda
222: x_4'$, parameterizing a family of models on $\calG'$ all of which are
223: equivalent to $f$ under the constraint $x'_1=x'_4$. For a fixed
224: $\lambda$, solving $\max_x L(x, \lambda) \triangleq g(\lambda)$ gives
225: an upper bound on $f^* = \max_x f(x)$, so by optimizing $\lambda$ to
226: minimize $g(\lambda)$, we find the tightest bound $g^* = \min_\lambda
227: g(\lambda)$. If the constraint $x_1' = x_4'$ is satisfied in the final
228: solution, then there is strong duality $g^* = f^*$ and we obtain the
229: correct MAP assignment for $f(x)$.
230:
231: We now discuss the general procedure and develop our approach to
232: optimize $g(\lambda)$ in more difficult cases.
233:
234: \subsection{Obtaining a Tractable Graph by Vertex Replication}
235:
236: In this section, we consider approaches that involve
237: \emph{replicating} variables to define the augmented model. The basic
238: constraints in designing $\calG'$ are as follows: $\calG'$ is
239: comprised of replicas of nodes and edges of $\calG$. Every node and
240: edge of $\calG$ must be represented at least once in $\calG'$.
241: Finally, $\calG'$ should be a thin graph, which relates to the
242: complexity of our method.
243:
244: To help illustrate the various strategies, we consider a
245: pairwise model $f(x)$ defined on $5 \times 5$ grid, as seen in
246: Fig.~\ref{fig:graphs}(a). A natural approach is to break the
247: model up into small subgraphs. The simplest method is to break the
248: graph up into its composite interactions. For pairwise models, this
249: means that we split the graph into a set of disjoint edges as shown in
250: (b). Here, each internal node of the graph is replicated four times.
251: To reduce the number of replicated nodes, and hence the number of
252: constraints, it is also useful to merge many of these smaller
253: subgraphs into larger thin graphs. One approach is to group edges
254: into \emph{spanning trees} of the graph as seen in (c). Here, each
255: edge must be including in at least one tree, and some edges are
256: replicated in multiple trees. The TRMP approach is based on this
257: idea. One could also allow multiple replicas of a node in the same
258: connected component of $\calG'$. For instance, by taking a spanning
259: tree of the graph and then adding an extra leaf node for each missing
260: edge we obtain the graph seen in (d).
261:
262: It is also tractable to use small subgraphs that are not trees. We
263: can break the graph into a set of short loops as in (e) or a set of
264: induced subgraphs as in (f) where we select a set of $3\times3$
265: subgraphs that overlap on their boundary. In such cases, including
266: additional edges in the overlap of these subgraphs, such as the dotted
267: edges in (f), can enhance the relaxations that we consider. Finally,
268: we reduce the number of constraints in these formulations by again
269: grouping subgraphs to form larger subgraphs that are still thin, as
270: shown in (g). This will also lead to tractability in our methods.
271: Again, it can be useful to include extra edges in the overlap of these
272: subgraphs as in (h), although this increases the width of the subgraph
273: and affects the computational complexity of our methods.
274:
275: \begin{figure}
276: \centering
277: (a)\epsfig{file=grid5x5.eps,scale=.85}
278: \hspace{.2cm}
279: (b)\epsfig{file=grid5x5_edges.eps,scale=0.55}\\
280: \vspace{.2cm}
281: (c)\epsfig{file=grid5x5_trees.eps,scale=0.7}
282: (d)\epsfig{file=grid5x5_comp_tree.eps,scale=0.7}\\
283: \vspace{.2cm}
284: (e)\epsfig{file=grid5x5_cells.eps,scale=0.75}
285: \hspace{.2cm}
286: (f)\epsfig{file=grid5x5_cells3x3.eps,scale=0.75}\\
287: \vspace{.2cm}
288: (g)\epsfig{file=grid5x5_layers.eps,scale=0.75}
289: \hspace{.2cm}
290: (h)\epsfig{file=grid5x5_layers2.eps,scale=0.75}
291: \caption{\label{fig:graphs} Illustrations of a variety of possible
292: ways to obtain a tractable graph structure from a $5 \times 5$ grid by
293: replicating some vertices of the graph.}
294: \vspace{-.4cm}
295: \end{figure}
296:
297: \emph{Notation.} Let $\calG^\prime$ denote the augmented graph (or
298: collection of subgraphs), which is based on an extended vertex set
299: $V'$, comprised of replicas of nodes in $V$. We assume that all edges
300: of this graph are also replicas of edges of the original graph
301: $\calG$.\footnote{In the case that we introduce extra edges in
302: $\calG'$, as in (f) and (h), we also add corresponding edges to
303: $\calG$ to maintain this convention.} Thus, there is a
304: well-defined surjective map $\Gamma: \calG' \rightarrow \calG$, each
305: edge $E' \in \calG'$ is a replica an edge $E=\Gamma(E') \in
306: \calG$, and every edge of $\calG$ has at least one such replica. This
307: notation is overloaded for nodes by treating them as singleton edges
308: of $\calG$. We also denote the set-valued inverse of $\Gamma$ by
309: $\mathcal{R}(E) \triangleq \Gamma^{-1}(E)$, which is the set of
310: replicas of $E$, and let $r_E \triangleq |\calR(E)|$ denote the
311: number of replicas. This defines an equivalence relation on $\calG'$:
312: $A,B \in \calG'$ are equivalent $A \equiv B$ if $\Gamma(A)=\Gamma(B)$,
313: that is, if $A,B \in \calR(E)$ are replicas of the same edge $E \in
314: \calG$.
315:
316: \subsection{Equivalent Constrained Estimation Problem}
317:
318: We now define a corresponding objective function $f'(x')$, where $x' =
319: (x'_v)_{v \in V'}$ are the variables of the augmented model. For each
320: hyperedge $E \in \calG$ (including individual nodes), we split the
321: function $f_E(x_E)$ among a set of replica functions $\{f'_{E'}, E'
322: \in \calR(E)\}$, requiring that these are \emph{consistent},
323: \begin{equation}
324: f_E(x_E) = \sum_{E' \in \calR(E)} f'_{E'}(x_E) \mbox{ for all } x_E.
325: \end{equation}
326: Using the parametric representation $f(x) = \langle \theta, \phi(x)
327: \rangle$, this consistency condition is equivalent to requiring
328: $\theta_E = \Sigma_{E'} \theta'_{E'}$. We will see that the LR
329: approach to follow may be viewed as an optimization over all such
330: possible consistent splittings. Next, we define the augmented
331: objective function over the graph $\calG'$ as
332: \begin{equation}
333: f'(x') \triangleq \sum_{E \in \calG'} f'_E(x'_E).
334: \end{equation}
335: This insures that $f(x) = f'(x')$ where $x' = \zeta(x)$ is the
336: replicated version of $x$, defined by $x'_{v'} = x_v$ for all $v' \in
337: \calR(v)$. This equivalence holds for all \emph{consistent}
338: configurations $x^\prime \in \zeta(\mathbb{X})$, where $x'$ is
339: self-consistent over various replicas of the same node. Thus, we are
340: led to an equivalent optimization problem in the augmented model
341: subject to consistency constraints:
342: \begin{equation}
343: \label{eq:a}
344: f^* \triangleq \max_{x \in \mathbb{X}} f(x) = \max_{x' \in
345: \zeta(\mathbb{X})} f'(x')
346: \end{equation}
347: Expressing the consistency constraint as a set of linear constraints
348: on the model features $\phi$, we obtain:
349: \begin{equation}
350: \begin{array}{ll}
351: \mbox{maximize} & f'(x')\\
352: \mbox{subject to} & \phi_A(x'_A) = \phi_B(x'_B) \mbox{ for all } A \equiv B.
353: \end{array}
354: \end{equation}
355: Recall that, in the discrete binary model, these features are defined
356: $\phi_E(x_E) = \Pi_{v \in E} \, x_v$. Clearly, there is some
357: redundancy in these constraints: $x_a = x_b$ for all replicated nodes
358: $a \equiv b$ would insure that the edges agree. However, these
359: redundant edge-wise feature constraints do enhance the
360: following relaxation.
361:
362: \subsection{Lagrangian Relaxation}
363:
364: We have now defined an equivalent model on a tractable graph.
365: However, the equivalent \emph{constrained} optimization is still
366: intractable, because the constraints couple some variables of
367: $\calG^\prime$, spoiling its tractable structure. This suggests the
368: use of Lagrangian duality to relax those complicating constraints.
369: Introducing Lagrange multipliers $\lambda_{A,B}$ for each constraint,
370: we define the \emph{Lagrangian}, which is a modified version of the
371: objective function:
372: \begin{equation}\label{eq:Lagrangian}
373: L(x',\lambda) = f'(x') + \sum_{A \equiv B} \lambda_{A,B} \,
374: (\phi_A(x'_A)-\phi_B(x'_B))
375: \end{equation}
376: Grouping terms by edges $E \in \calG'$, and using $f'_E(x_E) = \theta'_E \phi_E(x_E)$, this is represented
377: \begin{equation}
378: L(x',\lambda) = \sum_{E \in \calG'} f'_E(x'_E;\lambda) \nonumber \\
379: \end{equation}
380: \begin{equation}
381: f'_E(x'_E;\lambda) = \theta'_E(\lambda) \phi_E(x'_E) \nonumber \\
382: \end{equation}
383: \begin{equation}
384: \theta'_E(\lambda) = \theta'_E + \Sigma_B \lambda_{E,B} - \Sigma_A
385: \lambda_{A,E}
386: \end{equation}
387: Note that the Lagrange multipliers may be interpreted as
388: parameterizing all consistent splittings, $\theta'(\lambda)$ spans the
389: subspace of all consistent $\theta'$ parameters.\footnote{We obtain a
390: minimal $\lambda$ parameterization by only using a subset of
391: constraints in (\ref{eq:Lagrangian}), such that
392: $\{(\phi_A(x')-\phi_B(x'))\}$ are linearly independent.}
393:
394: It is tractable to maximize the Lagrangian, as it is defined
395: over the thin graph $\calG^\prime$. The value of this maximization
396: defines the \emph{dual function}:
397: \begin{equation}
398: g(\lambda) = \max_{x'} L(x',\lambda)
399: \end{equation}
400: Note that this is an \emph{unconstrained} optimization over
401: $\mathbb{X}^\prime$, and its solution need not lead to a consistent
402: $x' \in \zeta(\mathbb{X})$. However, if this $x'$ is consistent then
403: it is an optimal solution of the constrained optimization problem
404: (\ref{eq:a}), and hence $x = \zeta^{-1}(x')$ (which is well-defined
405: for consistent $x'$) is also an optimal solution of the original
406: problem. This is the goal of our approach, to find tractable
407: relaxations of the MAP estimation problem which lead to the correct
408: MAP estimate. This motivates solution of the \emph{dual problem}:
409: \begin{equation}\label{eq:dual_problem}
410: \min_\lambda g(\lambda) \triangleq g^*
411: \end{equation}
412: Appealing to well-known results
413: \cite{Bertsekas95,BertsimasTsitsiklis97}, we conclude:
414:
415: \begin{proposition}[Lagrangian duality] We have $g(\lambda) \ge f^*$ for all $\lambda$. Hence $g^* \ge f^*$. If $g(\lambda^*)=g^*$, then one of the
416: following holds:
417: \begin{enumerate}
418: \item[(i)] There exists a consistent solution:
419: \begin{displaymath}
420: x' \in \arg\max_{x' \in \mathbb{X}'} L(x';\lambda^*) \cap \zeta(\mathbb{X}).
421: \end{displaymath}
422: Then, we have \emph{strong duality} $g^* = f^*$ and the set of \emph{all} MAP estimates is obtained as:
423: \begin{displaymath}
424: \arg\max_{x' \in \zeta(\mathbb{X})} f'(x') = \arg\max_{x' \in \mathbb{X}} L(x',\lambda^*) \cap \zeta(\mathbb{X}).
425: \end{displaymath}
426: \item[(ii)] There are no consistent solutions:
427: \begin{displaymath}
428: \arg\max_{x' \in \mathbb{X}'} L(x';\lambda^*) \cap \zeta(\mathbb{X}) = \emptyset.
429: \end{displaymath}
430: Then, there is a \emph{duality gap} $g^* > f^*$ and \emph{no} choice
431: of $\lambda$ will provide a consistent solution.
432: \end{enumerate}
433: Also, condition (i) holds only if $g(\lambda^*)=g^*$.
434: \end{proposition}
435:
436: \begin{figure}
437: \centering
438: (a)\input{dual_gap.pstex_t}
439: (b)\input{dual_func.pstex_t}
440: \caption{\label{fig:LR}Illustration of the Lagrangian duality in
441: the cases that (a) there is a duality gap and (b) there is no duality
442: gap (strong duality holds).}
443: \vspace{-.4cm}
444: \end{figure}
445:
446: This result generalizes the analogous \emph{strong tree-agreement}
447: optimality condition for TRMP, and clarifies its connection to
448: standard Lagrangian duality results for integer programs. To provide
449: some intuition, we present the following geometric interpretation
450: illustrated in Fig. \ref{fig:LR}. The dual function is the maximum
451: over a finite set of linear functions in $\lambda$ indexed by $x'$.
452: For each $x' \in \mathbb{X}'$, there is a linear function
453: $g(\lambda;x') = \langle a(x'), \lambda\rangle + b(x')$, with $a(x') =
454: (\phi_A(x')-\phi_B(x'))_{A \equiv B}$, which is the gradient, and
455: $b(x') = f'(x')$. The graph of each of these functions defines a
456: hyperplane in $\mathbb{R}^{d+1}$, where $d$ is the number of
457: constraints. The flat hyperplanes, with $a = 0$, correspond to
458: consistent assignments $x' \in \zeta(\mathbb{X})$. The remaining
459: sloped hyperplanes represent inconsistent assignments. Hence, the
460: highest flat hyperplane corresponds to the optimal MAP estimate, with
461: height equal to $f^*$. The dual function $g(\lambda)$ is defined by
462: the maximum height over this set of hyperplanes for each $\lambda$,
463: and is therefore convex, piece-wise linear and greater than or equal
464: to $f^*$ for all $\lambda$. In the case of a duality gap, the
465: inconsistent hyperplanes hide the consistent ones, as depicted in (a),
466: so that the minimum of the dual function is defined by an intersection
467: of slanted hyperplanes corresponding to inconsistent assignments of
468: $x'$. If there is no duality gap, as depicted in (b), then the
469: minimum is defined by the flat hyperplane corresponding to a
470: consistent assignment. Its intersection with slanted hyperplanes
471: defines the polytope of optimal Lagrange multipliers over which the
472: maximum flat hyperplane is exposed.
473:
474: \subsection{Linear Programming Formulations}
475:
476: We briefly consider a connection between this LR
477: picture and TRMP \cite{Wainwright*nov05,KolmogorovWainwright05} and
478: related linear programming approaches
479: \cite{Feldman*05,Yanover*06,Werner07}. This analysis also serves to
480: understand when different relaxations of the MAP estimation
481: problem will be equivalent.
482:
483: The \emph{epigraph} of the dual function is defined as the set of all
484: points $(\lambda,h) = \mathbb{R}^{d+1}$ where $g(\lambda) \le h$, that
485: is, where $a(x') \lambda + b(x') \le h$ for all $x'$. Thus, the
486: minimum of the dual function is equal to the lowest point of the
487: epigraph, which defines a linear program (LP) over $(\lambda,h) \in
488: \mathbb{R}^{d+1}$:
489: \begin{equation}
490: \begin{array}{ll}
491: \mbox{minimize} & h \\
492: \mbox{subject to} & \langle a(x'), \lambda \rangle + b(x') \le h \mbox{ for all } x'.
493: \end{array}
494: \end{equation}
495: Note that there are exponentially many constraints in this
496: formulation, so it is intractable. However, recalling that it
497: \emph{is} tractable to compute the dual function for a given
498: $\lambda$, using the max-product algorithm applied to the thin graph
499: $\calG'$, we seek a more tractable representation of this LP. To
500: achieve this, we consider the LP dual problem obtained by dualizing
501: the constraints, which is always tight \cite{BertsimasTsitsiklis97}.
502: This LP dual should be distinguished from our Lagrangian dual
503: (\ref{eq:dual_problem}) that is the subject of our paper.
504:
505: Introducing non-negative Lagrange multipliers $\mu(x') \ge 0$ for each
506: inequality constraint, indexed by $x' \in \mathbb{X}'$, we obtain the
507: LP Lagrangian:
508: \begin{eqnarray}
509: M(h,\lambda;\mu) &=& h + \mu\left[ \langle a(x'), \lambda \rangle + b(x') - h \right] \nonumber \\
510: &=& \langle \mu[a], \lambda \rangle + \mu[b] + (1-\mu[1]) h,
511: \end{eqnarray}
512: where $\mu$ denotes $\mu$-weighted summation, e.g.,
513: $\mu[a] = \sum_{x'} \mu(x') a(x')$. The LP dual function is then:
514: \begin{equation}
515: M^*(\mu) \triangleq \min_{h,\lambda} M(h,\lambda;\mu) =
516: \left\{
517: \begin{array}{ll}
518: \mu[b], & \mu[1]=1 \mbox{ and } \mu[a]=0\\
519: -\infty, & \mbox{otherwise.}
520: \end{array}
521: \right.
522: \end{equation}
523: Note that $\mu > 0$ and $\mu[1]=1$ imply that $\mu$ is a probability
524: distribution and $\mu[\cdot]$ an expectation operator. Recalling
525: $a(x') \triangleq (\phi_A(x')-\phi_B(x'), A \equiv B)$ and $b(x')
526: \triangleq f'(x')$, we obtain the dual LP:
527: \begin{equation}
528: \label{eq:b}
529: \max_{\mu \ge 0} M^*(\mu) =
530: \left\{
531: \begin{array}{ll}
532: \mbox{maximize} & \mu[f'] \\
533: \mbox{subject to} & \mu[\phi_A] = \mu[\phi_B] \mbox{ for } A \equiv B
534: \end{array}
535: \right.
536: \end{equation}
537: We seek a probability distribution over all configurations of the
538: augmented model that maximizes the expected value of $f'(x')$ subject
539: to constraints that the moments specifying marginal distributions
540: are consistent for replicated nodes and edges of the graph. This is a
541: convex relaxation of the constrained version of problem (4), where the
542: objective and constraint functions have been replaced by their
543: expected values under $\mu$. Note that only marginals $\mu_{E'}$ over
544: hyperedges $E \in \calG'$ are needed to evaluate both the objective
545: and the constraints of this LP. Hence, it reduces to one defined over
546: the \emph{marginal polytope} $\calM(\calG')$ \cite{Wainwright*nov05},
547: defined as the set of all \emph{realizable} collections of marginals
548: over the hyperedges of $\calG'$. Moreover, if the graph $\calG'$ is
549: \emph{chordal} \cite{Cowell*99}, then its marginal polytope has a
550: simple characterization. Let $\calM_{\mathrm{local}}(\calG')$ denote
551: the \emph{local marginal polytope} defined as the set of all edge-wise
552: marginal specifications that are consistent on intersections of edges.
553: In general, $\calM(\calG') \subset \calM_{\mathrm{local}}(\calG')$.
554: However, in chordal graphs it holds that $\calM(\calG') =
555: \calM_{\mathrm{local}}(\calG')$. Thus, if $\calG'$ is a thin chordal
556: graph, we obtain a tractable LP whose value is equivalent to $g^*$ in
557: our framework.\footnote{Some graphs shown in Fig. \ref{fig:graphs} are
558: not chordal, but they can be extended to a thin chordal graph by
559: adding a few edges. If no two of these new edges are equivalent when
560: mapped into $\calG$, then this does not change $g^*$.}
561:
562: One last step shows the connection to LP approaches
563: \cite{Wainwright*nov05,Feldman*05,Yanover*06}. The key observation is
564: that, roughly speaking,
565: \begin{equation}
566: \calM_{\mathrm{local}}(\calG') \cap \{\mu | \mu(x_A) = \mu(x_B), A \equiv B\} \equiv \calM_{\mathrm{local}}(\calG).
567: \end{equation}
568: This is seen by replicating marginals from $\calG$ to $\calG'$, or by
569: copying (consistent) replicated marginals back to $\calG$.
570: For such consistent $\mu$, we have $\mu[f'] = \mu[f]$, which gives:
571: \begin{equation}
572: g^* = \max_{\mu \in \calM_{\mathrm{local}}(\calG)} \mu[f] \ge \max_{\mu \in \calM(\calG)} \mu[f] = f^*.
573: \end{equation}
574: The maximum over $\calM_{\mathrm{local}}(\calG)$ gives an upper-bound on the maximum over $\calM(\calG) \subset \calM_{\mathrm{local}}(\calG)$. The latter is equivalent to exact MAP estimation and the bound becomes tight if $\calG$ is the set of maximal cliques of a chordal graph. This discussion leads to the following characterization of LR:
575:
576: \begin{proposition}[LR Hierarchy]
577: \emph{Equivalence:} Let $\calG'_1$ and $\calG'_2$ be the set of
578: maximal cliques of two chordal augmented graphs. If
579: $\Gamma^{-1}(\calG_1)=\Gamma^{-1}(\calG_2)$ then $g_1^*=g_2^*$ for the
580: respective dual problems. Let $g^*(\calG)$ denote the common dual
581: value of all such chordal relaxations where
582: $\Gamma^{-1}(\calG')=\calG$. \emph{Monotonicity:} If $\calG_1 \subset
583: \calG_2$ then $g^*(\calG_1) \ge g^*(\calG_2)$. \emph{Strong Duality:}
584: If $\calG$ is the set of maximal cliques of a chordal graph, then
585: $g^*(\calG)=f^*$.
586: \end{proposition}
587:
588: \subsection{Smooth Relaxation of the Dual Problem}
589:
590: \begin{figure}
591: \centering
592: \input{smooth_dual.pstex_t}
593: \caption{\label{fig:smoothLR} Illustration of the ``log-sum-exp''
594: smooth approximation of the dual function, as a function of
595: ``temperature'' $\tau$, and of an optimization procedure for
596: minimizing the non-smooth dual function through a sequence of smooth
597: minimizations.}
598: \vspace{-.4cm}
599: \end{figure}
600:
601: In this section, we develop an approach to solve the dual problem.
602: One approach to minimize $g(\lambda)$ is to use non-smooth
603: optimization methods, such as the subgradient method
604: \cite{BertsimasTsitsiklis97}. Here, we consider an
605: alternative, based on the following smooth approximation of
606: $g(\lambda)$:
607: \begin{equation}
608: g(\lambda; \tau) \triangleq \tau \log \sum_{x' \in \mathbb{X}} \exp\left( \frac{L(x';\lambda)}{\tau}\right)
609: \end{equation}
610: As illustrated if Fig. \ref{fig:smoothLR}, the parameter $\tau > 0$
611: controls the trade-off between smoothness of $g(\lambda;\tau)$ and how
612: well it approximates $g(\lambda)$. This is known as the
613: ``log-sum-exp'' approximation to the ``max function''
614: \cite{BoydVandenberghe04}:
615: \begin{equation}
616: g(\lambda) \le g(\lambda;\tau) \le g(\lambda) + \tau \log |\mathbb{X}| \mbox{ for all } \tau > 0.
617: \end{equation}
618: Hence, $g(\lambda;\tau) \rightarrow g(\lambda)$ \emph{uniformly} as $\tau
619: \rightarrow 0$ and, hence, $g^*(\tau) \triangleq \min_\lambda g(\lambda;\tau)$
620: converges to $g^*$.
621:
622: The function $g(\lambda;\tau)$ has another useful interpretation. Consider
623: the Gibbs distribution defined by
624: \begin{equation}
625: p_{\lambda,\tau}(x') = \exp\left( \frac{L(x',\lambda) - g(\lambda;\tau)}{\tau} \right)
626: \end{equation}
627: Here, $\tau > 0$ is the ``temperature'' and $g(\lambda;\tau)$
628: normalizes the distribution for each choice of $\lambda$ and $\tau$,
629: and is equal to the Helmholtz free energy $\mathcal{F}_H(\theta') =
630: \tau \Phi_\tau(\theta')$, where $\Phi_\tau(\theta') = \log \Sigma
631: \exp(\tau^{-1} \langle \theta', \phi'(x')\rangle)$ is the usual
632: log-partition function. Thus, $g(\lambda; \tau)$ is a strictly
633: convex, analytic function. Using the moment-generating property of
634: $\Phi_\tau(\theta')$, the gradient of $g(\lambda;\tau)$ is computed
635: as:
636: \begin{eqnarray}
637: \frac{\partial g(\lambda;\tau)}{\partial \lambda_{A,B}} &=&
638: \frac{\partial\Phi_\tau}{\partial\theta'_A} \frac{\partial\theta'_A}{\partial\lambda_{A,B}}
639: + \frac{\partial\Phi_\tau}{\partial\theta'_B} \frac{\partial\theta'_B}{\partial\lambda_{A,B}} \nonumber \\
640: &=& p_{\lambda,\tau}[\phi_A] - p_{\lambda,\tau}[\phi_B]
641: \end{eqnarray}
642: where we use $p[\cdot]$ to denote expectation under $p$. Thus,
643: appealing to strict convexity, there is a unique $\lambda^*(\tau)$
644: that minimizes $g(\lambda;\tau)$ and it is also the unique solution of
645: the set of moment-matching conditions:
646: \begin{displaymath}
647: p_{\lambda,\tau}[\phi_A] = p_{\lambda,\tau}[\phi_B], \mbox{ for all } A \equiv B.
648: \end{displaymath}
649: These moment-matching conditions are equivalent to requiring that
650: the marginal distributions $p_{\lambda,\tau}(x_A)$ and
651: $p_{\lambda,\tau}(x_B)$ are equal for $x_A = x_B$. We also
652: note that $\frac{\partial g(\lambda;\tau)}{\partial \tau} =
653: p_{\lambda,\tau}[-\log p_{\lambda,\tau}]$, which is the \emph{entropy}
654: of $p_{\lambda,\tau}$ and is positive for all $\lambda$. Hence, for a
655: decreasing sequence $\tau_k > 0$ converging to zero, $g(\lambda;\tau)$
656: converges \emph{monotonically} to $g(\lambda)$. Likewise, $g^*(\tau_k)$ converges
657: \emph{monotonically} to $g^*$.
658:
659: Rather than directly optimizing $g(\lambda)$, we instead perform a
660: sequence of minimizations with respect to the functions
661: $g(\lambda;\tau_k)$. At each step, the previous estimate of
662: $\lambda_k^* = \arg\min g(\lambda;\tau_k)$ is used to initialize an
663: iterative method to minimize $g(\lambda;\tau_{k+1})$. This is
664: illustrated in Fig. \ref{fig:smoothLR}. At each step, we use the
665: following optimization procedure based on the marginal agreement
666: condition.
667:
668: \subsubsection{Iterative Log-Marginal Averaging}
669:
670: \begin{figure}
671: \vspace{.3cm}
672: \hrule
673: \begin{tabbing}
674: {\bf ALGORITHM 1 (Discrete LR)}\\
675: It\=erate until convergence:\\
676: Fo\=r $E \in \calG \mbox{ where } r_E>1$\\
677: \>Fo\=r $E' \in \calR(E)$\\
678: \>\>$\hat{f}_{\tau,E'}(x'_{E'}) = \tau \log p_{\tau,\lambda}(x'_{E'})$\\
679: \>end\\
680: \>$\bar{f}_{\tau,E}(x_E) = r_E^{-1} \sum_{E'} \hat{f}_{\tau,E'}(x_E)$\\
681: \>For $E' \in \calR(E)$\\
682: \>\>$f_{E'}(x_E) \leftarrow f_{E'}(x_E) + \left( \bar{f}_{\tau,E'}(x_E) - \hat{f}_{\tau,E'}(x_E) \right)$\\
683: \>end\\
684: end
685: \end{tabbing}
686: \vspace{-.2cm}
687: \hrule
688: \vspace{-.4cm}
689: \end{figure}
690:
691: To minimize $g(\lambda;\tau)$ for a specified $\tau$, starting from an
692: initial guess for $\lambda$ (or, equivalently, an initial splitting of
693: $f$), we develop a block coordinate-descent method. Our approach is
694: in the same spirit as the iterative proportional fitting procedure
695: \cite{Ruschendorf95}.
696:
697: We begin with the case that the augmented model is defined so that no
698: two replicas of a node are contained in the same connected component
699: of $\calG'$. Then, at each step, we minimize over the set of all
700: Lagrange multipliers associated with features defined within any
701: replica of $E$. This is equivalent to solving the condition that the
702: corresponding marginal distributions $p_{\lambda,\tau}(x'_{E'})$ are
703: consistent for all $E' \in \calR(E)$. Algorithm 1 summarizes the
704: method, which involves computing the log-marginal of each replica
705: edge, and then updates the functions $f'_{E'}$ according to the rule:
706: \begin{equation}
707: f'_{E'}(x_E) \leftarrow f'_{E'}(x_E) + (\bar{f}_{\tau,E}(x_E) - \hat{f}_{\tau,E'}(x_E)) \\
708: \end{equation}
709: where
710: \begin{displaymath}
711: \hat{f}_{\tau,E'}(x'_{E'}) = \tau \log p_{\lambda,\tau}(x'_{E'}), \;\; \bar{f}_{\tau,E}(x_E) = r_E^{-1} \sum_{E' \in \calR(E)} \hat{f}_{\tau,E'}(x_E).
712: \end{displaymath}
713: After the update, the new log-marginals of all replicas $E'$ are equal
714: to $\bar{f}_{\tau,E}$. Also, these updates maintain a consistent
715: representation: $\sum_{E'} (\bar{f}_{\tau,E} - \hat{f}_{\tau,E'}) =
716: 0$. To handle augmented models with multiple replicas of $E$ in the
717: same connected subgraph, we only update a \emph{subset} of replicas at
718: each step, where no two replicas are in the same subgraph. In some
719: cases, this requires including an extra replica of $E$ to act as an
720: intermediary in the update step.
721:
722: Each step of the procedure requires that we compute the marginal
723: distributions of each replica $E'$ in their respective subgraphs. In
724: the graphs are thin, these marginals can be computed efficiently, with
725: computation linear in the size of each subgraph, using standard belief
726: propagations methods and their junction tree variants. Moreover, if
727: we take some care to store the messages computed by belief
728: propagation, it is possible to amortize the cost of this inference, by
729: only updating a few ``messages'' at each step. In fact, it is only
730: necessary to update those messages along the directed path from the
731: last updated node or edge to the location in the tree (or junction
732: tree) of the node or edge currently being updated. We find that this
733: generally allows a complete set of updates to be computed with
734: complexity linear in $n$. Similar ideas are discussed in
735: \cite{Kolmogorov05}.
736:
737: Using Algorithm 1, together with a rule to gradually reduce $\tau$, we
738: obtain a simple algorithm which generates a sequence $\lambda_k$ such
739: that $g(\lambda_k)$ converges to $g^*$ and $\lambda_k$ converge to a
740: point in the set of optimal Lagrange multipliers.
741:
742: \subsubsection{Iterative Max-Marginal Averaging}
743:
744: We now consider what happens as $\tau$ approaches zero. The main
745: insight is that the (non-normalized) log-marginals converge to
746: \emph{max-marginals} in the limit as $\tau$ approaches zero:
747: \begin{equation}
748: \hat{f}_{\tau,E'}(x'_{E'}) +
749: g(\lambda,\tau) \rightarrow \hat{f}_{E'}(x'_{E'})
750: \triangleq \max_{x'_{\setminus E'}} f'(x'_{E'},x'_{\setminus E'};\lambda)
751: \end{equation}
752: Hence, as $\tau$ becomes small, the marginal agreement conditions are
753: similar to a set of \emph{max-marginal agreement} conditions among all
754: replicas of an edge or node. One could consider a ``zero-temperature''
755: version of Algorithm 1 aimed at solving these max-marginal
756: conditions directly:
757: \begin{eqnarray}\label{eq:max_marg_match}
758: f'_{E'}(x_E) &\leftarrow& f'_{E'}(x_E) + \left(\bar{f}_E(x_E) -
759: \hat{f}_{E'}(x_E) \right) \nonumber\\ \bar{f}_E(x_E) &=& r_E^{-1}
760: \sum_{E'} \hat{f}_{E'}(x_E)
761: \end{eqnarray}
762: Here, $\bar{f}_E$ is the averaged max-marginal over all replicas of
763: $E$. Note that $\hat{f}_{E'}(x_E) \ge \hat{f}_E(x_E) \triangleq
764: \max_{x_{\setminus E}} f(x)$ for all $x_E$ and $E' \in
765: \mathcal{R}(E)$, which implies $\bar{f}_E(x_E) \ge \hat{f}_E(x_E)$.
766: This ``zero-temperature'' approach has close ties to max-sum diffusion
767: (see \cite{Werner07} and reference therein) and Kolmogorov's serial
768: approach to TRMP \cite{Kolmogorov05}.
769:
770: In our framework, one can show that $\lambda^* \triangleq \lim_{\tau
771: \rightarrow 0} \lambda^*(\tau)$ is well-defined and minimizes
772: $g(\lambda)$. This point $\lambda^*$ also satisfies the max-marginal
773: agreement condition and is therefore a fixed point of max-marginal
774: averaging. However, the max-marginal agreement condition by itself
775: does not uniquely determine $\lambda^*$ and, in fact, is not
776: sufficient to insure that $g(\lambda)$ is minimized (this is related
777: to the existence of non-minimal fixed-points observed by Kolmogorov).
778: Hence, our approach to minimize $g(\lambda;\tau)$ while gradually
779: reducing the temperature has the advantage that it cannot get stuck in
780: such spurious fixed-points. It also helps to accelerate convergence,
781: because the initial optimization at higher temperatures serves to
782: smooth over irregularities of the dual function.
783:
784: \begin{figure*}
785: \centering
786: \epsfig{file=attractive3_lr_val.eps,scale=0.17}
787: \epsfig{file=attractive1_lr_val.eps,scale=0.17}
788: \epsfig{file=frustrated1_lr_val.eps,scale=0.17}
789: \epsfig{file=frustrated4_lr_val.eps,scale=0.17}
790: \epsfig{file=frustrated2_lr_val.eps,scale=0.17}\\
791: \comment{\epsfig{file=attractive3_max_marg_err.eps,scale=0.17}
792: \epsfig{file=attractive1_max_marg_err.eps,scale=0.17}
793: \epsfig{file=frustrated1_max_marg_err.eps,scale=0.17}
794: \epsfig{file=frustrated4_max_marg_err.eps,scale=0.17}
795: \epsfig{file=frustrated2_max_marg_err.eps,scale=0.17}\\}
796: \epsfig{file=attractive3_lr_est.eps,scale=0.18}\hspace{.1cm}
797: \epsfig{file=attractive1_lr_est.eps,scale=0.18}\hspace{.1cm}
798: \epsfig{file=frustrated1_lr_est.eps,scale=0.18}\hspace{.1cm}
799: \epsfig{file=frustrated4_lr_est.eps,scale=0.18}\hspace{.1cm}
800: \epsfig{file=frustrated2_lr_est.eps,scale=0.18}
801: \caption{\label{fig:discreteLR} Five examples for discrete LR showing:
802: (top row) convergence of $g(\lambda)$ to $g^*$ compared to $f^*$
803: (horizontal line); (bottom row) the resulting estimates generated by
804: relaxed max-marginals (grey areas denote non-unique maximum). The
805: first two columns are examples of attractive models with $\sigma=2
806: \mbox{ and } 1$. The last three columns are frustrated models with
807: $\sigma= 1.5, 1, \mbox{ and } .7$.}
808: \vspace{-.4cm}
809: \end{figure*}
810:
811: \emph{Computational Examples.} In this section we provide some
812: preliminary results using our approach to solve binary MRFs. These
813: examples are for a binary model $x_v \in \{-1,+1\}$ defined on a $10
814: \times 10$ grid similar to the one seen in Fig. \ref{fig:graphs}(a).
815: For each node, we include a node potential $f_v(x_v) = \theta_v x_v$
816: with $\theta_v \sim N(0,\sigma^2)$. For each edge, we include an edge
817: potential $f_{u,v}(x_u,x_v) = \theta_{uv} x_u x_v$ with $\theta_{uv} =
818: 1$ in the ``attractive'' model and random $\theta_{uv} = \pm 1$ in the
819: ``frustrated'' model. Hence, $\sigma$ controls the strength of node
820: potentials relative to edge potentials. As seen in
821: Fig. \ref{fig:discreteLR}, we obtain strong duality $g^*=f^*$ and
822: recover the correct MAP estimates in attractive models. This is
823: consistent with a result on optimality of TRMP in attractive models
824: \cite{KolmogorovWainwright05}. In the frustrated model, the same holds
825: with strong node potentials, but as $\sigma$ is decreased the
826: frustration of the edge potentials cause a duality gap. However, even
827: in these cases, we have observed that some nodes have a unique maximum
828: in their re-summed max-marginals, and these nodes provide a partial MAP
829: estimate that agrees with the correct global MAP estimate. This is
830: apparently related to the \emph{weak tree agreement} condition for
831: partial optimality in TRMP \cite{KolmogorovWainwright05}.
832:
833: \section{Gaussian Lagrangian Relaxation}
834:
835: In this section we apply the LR approach to the
836: problem of MAP estimation in Gaussian graphical models, which
837: is equivalent to maximizing a quadratic objective function
838: \begin{equation}
839: f(x;h,J) = -\frac{1}{2} x^T J x + h^T x,
840: \end{equation}
841: where $J \succ 0$ is sparse with respect to $\calG$. Again, we
842: construct an augmented model, which is now specified by an information
843: form $(h',J')$, defined by a larger graph $\calG'$. For consistency,
844: we also require $f'(\zeta(x);h',J')=f(x;h,J)$ for all $x$. Denoting
845: variable replication by $\zeta(x) = A x$, this is equivalent to $A^T
846: J' A = J$ and $A^T h' = h$. In order for the dual function to be
847: well-defined, we also require that $J' \succ 0$. For general $J \succ
848: 0$, it is possible that, for a given augmented graph $\calG'$, there
849: do not exist any $J' \succ 0$ defined on $\calG'$ such that $A^T J' A
850: = J$. To avoid this issue, we will focus on models that are of the
851: form:
852: \begin{equation}\label{eq:e}
853: f(x) = \sum_{E \in \calF} f_E(x_E)
854: \end{equation}
855: where $\calF$ is a hyper-graph, composed of cliques of $\calG$, and
856: each term $f_E(x_E)$ is itself a quadratic form $f_E(x_E) =
857: -\frac{1}{2} x_E^T J_E x_E + h_E^T x_E$ based on $J_E \succ 0$. Then,
858: $J = \sum_E [J_E]_V$ is the sum of these (zero-padded)
859: submatrices. Then, it is simple to obtain a valid augmented model. We
860: split each $J_E$ between its replicas as $J_{E'} = r_E^{-1} J_E$ to
861: obtain $J' = \sum_{E' \in \calF'} [J_{E'}]_{V'} \succ 0$.
862:
863: If there exists a representation of $J$ in terms of $2 \times 2$
864: \emph{pairwise} interactions $J_E \succ 0$, it is said to be
865: \emph{pairwise normalizable}. This condition is equivalent to the
866: walk-summability condition considered in \cite{Malioutov*06}, which is
867: related to the convergence (and correctness) of a variety of
868: approximate inference methods \cite{Malioutov*06,Chandrasekaran*07}.
869: Here, we show that for the more general class of models of the form
870: (\ref{eq:e}), we obtain a convergent iterative method for solving the
871: dual problem that is tractable provided the cliques are not too large.
872: Moreover, for this class of Gaussian models, we show that there is
873: \emph{no duality gap} and we always converge to the unique MAP
874: estimate of the model. As an additional bonus, we also find that, by
875: solving marginal agreement conditions in the augmented Gaussian model,
876: we obtain a set of upper-bounds on the variances of each variable,
877: although these bounds are often rather loose.
878:
879: \subsection{Gaussian LR with Linear Constraints}
880:
881: We begin by considering the Lagrangian dual of the
882: following linearly-constrained quadratic program:
883: \begin{equation}
884: \begin{array}{ll}
885: \mbox{maximize} & -\tfrac{1}{2} x'^T J' x' + h'^T x'\\
886: \mbox{subject to} & x'_a = x'_b \mbox{ for all } a \equiv b.
887: \end{array}
888: \end{equation}
889: We may express the linear constraints on $x'$ as $H x' =
890: 0$. Relaxing these constraints leads to the following dual function:
891: \begin{eqnarray}
892: g(\lambda) &=& \max_{x'} \{ -\tfrac{1}{2} x'^TJ'x'+(h'+H^T\lambda)x' \} \nonumber \\
893: &=& \tfrac{1}{2} (h'+H^T \lambda)^T J'^{-1} (h'+H^T \lambda)
894: \end{eqnarray}
895: Moreover, by strong duality of quadratic programming \cite{Bertsekas95}, it holds
896: that $g^*=f^*$. We also note the following equivalent representation
897: of the dual problem:
898: \begin{equation}\label{eq:qp1}
899: g^* = \left\{
900: \begin{array}{ll}
901: \mbox{minimize} & \tfrac{1}{2} h'^T J'^{-1} h'\\
902: \mbox{subject to} & A^T h' = h \\
903: \end{array}
904: \right.
905: \end{equation}
906: Here, $h'$ is the problem variable, and we consider all possible
907: choices of $h'$ that are consistent with $h$ under the constraint $x'
908: = A x$. The optimal choice of $h'$ in this problem is the one which
909: leads to consistency in the estimate $\hat{x}' = J'^{-1} h'$.
910:
911: \subsection{Quadratic Constraints and Log-Det Regularization}
912:
913: Although, in Gaussian models, it is sufficient to include only linear
914: constraints (there is no duality gap), our method can also accommodate
915: quadratic constraints, and this results in faster convergence and
916: tighter bounds on variances. Consider the constrained optimization
917: problem:
918: \begin{equation}
919: \begin{array}{ll}
920: \mbox{maximize} & -\tfrac{1}{2} x'^T J' x' + h'^T x'\\
921: \mbox{subject to}
922: & x_a = x_b, x_a^2 = x_b^2 \mbox{ for all } a \equiv b,\\
923: & x_{a_1}x_{a_2} = x_{b_1}x_{b_2} \mbox{ for all } (a_1,a_2) \equiv (b_1,b_2).
924: \end{array}
925: \end{equation}
926: This leads to the following equivalent version of the dual
927: problem with problem variables $(h',J')$:
928: \begin{equation}
929: \begin{array}{ll}
930: \label{eq:c}
931: \mbox{minimize} & \tfrac{1}{2} h'^T J'^{-1} h' \\
932: \mbox{subject to} & A^T h'=h, \; A^T J' A = J, \; J' \succ 0.
933: \end{array}
934: \end{equation}
935: Any solution of the linearly-constrained relaxation provides a
936: feasible point for this problem, so the value of (\ref{eq:c}) is
937: less than or equal to that of (\ref{eq:qp1}). However, since there is
938: no duality gap in (\ref{eq:qp1}), the value of the two problem
939: are equal, both achieve $g^*=f^*$ and obtain the MAP estimate.
940:
941: While the choice of $J'$ does not affect the value of the dual
942: problem, it does effect variance estimates and convergence of
943: iterative methods. Hence, we regularize the choice of $J'$ by adding
944: a penalty $-\tfrac{1}{2} \log\det J'$ to the objective of
945: (\ref{eq:c}), which also serves as a barrier function enforcing $J'
946: \succ 0$. The resulting objective function is then equivalent to
947: $\Phi(h,J)$, which shows a parallel to our earlier approach for
948: ``smoothing'' the dual function in discrete
949: problems. \comment{(although, here, there is no need for a temperature
950: parameter). Similarly, we find that minimizing the log-partition
951: function in the Gaussian model, subject to consistency constraints,
952: reduces to matching means and covariances among replicas of a node or
953: edge.}
954:
955: \subsection{Gaussian Moment-Matching}
956:
957: \begin{figure}
958: \vspace{.3cm}
959: \hrule
960: \begin{tabbing}
961: {\bf ALGORITHM 2 (Gaussian LR)}\\
962: It\=erate until convergence:\\
963: Fo\=r $E \in \calG \mbox{ where } r_E>1$\\
964: \>Fo\=r $E' \in \calR(E)$\\
965: \>\> Compute moments $(\hat{x}_{E'},P_{E'})$ in $(h',J')$.\\
966: \>\> $\hat{J}_{E'} = P_{E'}^{-1}, \; \hat{h}_{E'} = P_{E'}^{-1} h_{E'}$\\
967: \>end\\
968: \>$\bar{J}_E = r_E^{-1} \sum_{E'} \hat{J}_{E'}, \; \bar{h}_E = r_E^{-1} \sum_{E'} \hat{h}_{E'}$\\
969: \>For $E' \in \calR(E)$\\
970: \>\>$J'_{E',E'} \leftarrow J'_{E',E'} + \left(\bar{J}_E - \hat{J}_{E'} \right)$\\
971: \>\>$h'_{E'} \leftarrow h'_{E'} + \left(\bar{h}_E - \hat{h}_{E'} \right)$\\
972: \>end\\
973: end
974: \end{tabbing}
975: \vspace{-.2cm}
976: \hrule
977: \vspace{-.4cm}
978: \end{figure}
979:
980: We develop an approach in the same spirit as the Gaussian iterative
981: scaling method \cite{SpeedKiiveri86}. We minimize the log-partition
982: function with respect to the information parameters over all replicas
983: of a node or edge, subject to consistency and positive definite
984: constraints. The optimality condition for this minimization is that
985: the marginal moments (means and variances) of all replicas are
986: equalized. It can be shown that the following information-form
987: updates achieve this objective. First, for all replicas $E'$ of $E$,
988: we compute the marginal information parameters given by sparse
989: Gaussian elimination of $C = V' \setminus E'$ in $(J',h')$:
990: \begin{eqnarray}
991: \hat{J}_{E'} &=& J'_{E',E'} - J'_{E',C} (J'_{C,C})^{-1} J'_{C,E'} \nonumber \\
992: \hat{h}_{E'} &=& h'_{E'} - J'_{E',C} (J'_{C,C})^{-1} h'_{C}
993: \end{eqnarray}
994: This is equivalent to $\hat{J}_{E'} = P_{E'}^{-1}$ and $\hat{h}_{E'} =
995: P_{E'}^{-1} \hat{x}_{E'}$. Next, we average these marginal information
996: forms over all replicas:
997: \begin{equation}\label{eq:marg_info_match}
998: \bar{J}_E = r_E^{-1} \sum_{E'} \hat{J}_{E'}, \;\; \bar{h}_E = r_E^{-1} \sum_{E'} \hat{h}_{E'}
999: \end{equation}
1000: Finally, we update the information form according to:
1001: \begin{eqnarray}
1002: J'_{E',E'} &\leftarrow& J'_{E',E'} + (\bar{J}_E - \hat{J}_{E'}) \nonumber \\
1003: h'_{E'} &\leftarrow& h'_{E'} + (\bar{h}_E - \hat{h}_{E'})
1004: \end{eqnarray}
1005: Using the characterization of positive-definiteness of a block matrix
1006: in terms of a principle submatrix and its Schur complement, it can be
1007: shown that this update preserves positive definiteness of $J'$. It
1008: also preserves consistency, e.g., $\sum_{E'} (\bar{J}_E -
1009: \hat{J}_{E'}) = 0$. After the update, the new marginal information
1010: parameters for all replicas of $E$ are equal to
1011: $(\bar{h}_E,\bar{J}_E)$. Algorithm 2 summarizes this iterative
1012: approach for solving the Gaussian LR problem.
1013:
1014: Lastly, using the fact that $\bar{f}_E(x_E) \ge \hat{f}_E(x_E)$ for
1015: all $x_E$ and that there is no duality gap upon convergence, we
1016: conclude that the final equalized marginal information must satisfy
1017: $\bar{J}_E \preceq \hat{J}_E \triangleq J_{E,E} - J_{E,\setminus
1018: E}(J_{\setminus E,\setminus E})^{-1}J_{\setminus E,E}$. Hence, LR
1019: gives an upper-bound on the true variance: $P_E=(\hat{J}_E)^{-1}
1020: \preceq (\bar{J}_E)^{-1}$. If each replica of $E$ is contained in a
1021: separate connected component of $\calG'$, then a tighter bound
1022: holds: $P_E \preceq (r_E \bar{J}_E)^{-1}$.
1023:
1024: \emph{Computational Examples.} We apply LR for two Gaussian models
1025: defined on a $50 \times 50$ 2D grid with correlation lengths
1026: comparable to the size of the field. First, we use the \emph{thin
1027: membrane} model, which encourages neighboring nodes to be similar by
1028: having potentials $f_{ij} = (x_i - x_j)^2$ for each edge
1029: $\{i,j\} \in \calG$. We split the 2D model into vertical strips of
1030: narrow width $K$, which have overlap $L$ (we vary $K$ and set
1031: $L=2$). We impose marginal agreement conditions in $K \times L$ blocks
1032: in these overlaps. The updates are done consecutively, from top to
1033: bottom blocks, from the left to the right strip. A full update of all
1034: the blocks constitutes one iteration. We compare LR to loopy belief
1035: propagation (LBP). The LBP variances are underestimates by $21.5$
1036: percent (averaged over all nodes), while LR variances for $K=8$ are
1037: overestimates by $16.1$ percent. In Figure \ref{fig:gauss_LR} (top) we
1038: show convergence of LR for several values of $K$, and compare it to
1039: LBP. The convergence of variances is similar to LBP, while for the
1040: means LR converges considerably faster. In addition, the means in LR
1041: converge faster than using block Gauss-Seidel on the same set of
1042: overlapping $K \times 50$ vertical strips.
1043:
1044: Next, we use the \emph{thin plate model}, which enforces that each
1045: node $v$ is close to the average of its nearest neighbors $N(v)$ in
1046: the grid, and penalizes curvature. At each node there is a potential:
1047: $f_i(x_i,x_{N(i)}) = (x_i - \tfrac{1}{|N(i)|} \sum_{j \in N(i)}
1048: x_j)^2$. LBP does not converge for this model. LR gives rather loose
1049: variance bounds for this more difficult model: for $K=12$, it
1050: overestimates the variances by $75.4$ percent. More importantly, it
1051: accelerate convergence of the means. In Figure \ref{fig:gauss_LR}
1052: (bottom) we show convergence plots for means and variances, for
1053: several values of $K$. As $K$ increases, the agreement is achieved
1054: faster, and for $K=12$ agreement is achieved in under $13$ iterations
1055: for both means and variances. We note that LR with $K=4$ converges
1056: much faster for the means than block Gauss-Seidel.
1057:
1058: \begin{figure}
1059: \centering
1060: \epsfig{figure=var_LR_gauss_tm_final.eps,scale=.55}
1061: \epsfig{figure=means_LR_gauss_tm_final.eps,scale=.55}\\
1062: \epsfig{figure=var_LR_gauss_tp_final.eps,scale=.55}
1063: \epsfig{figure=means_LR_gauss_tp_final.eps,scale=.55}
1064: \caption{\label{fig:gauss_LR} Convergence plots for variances (left) and
1065: means (right), in the thin-membrane model (top) and thin-plate model (bottom).}
1066: \vspace{-.3cm}
1067: \end{figure}
1068:
1069: \section{Multi-Scale Lagrangian Relaxation}
1070:
1071: In this section, we propose an extension of the LR method considered
1072: thus far. Previously, we have considered relaxations based on
1073: augmented models where $x' = \zeta(x)$ involves replication of
1074: variables. Here, we consider more general definition of $\zeta$ to
1075: allow the augmented model to include \emph{summary variables}, such as
1076: a sum over a subset of variables, or any linear combination of these.
1077: In discrete models, summary variables can also be non-linear functions
1078: of $x$. For example, ``parity bits'' are used in coding
1079: applications and the ``majority rule'' is used to define
1080: coarse-scale binary variables in the renormalization group approach
1081: \cite{Gidas89}.
1082:
1083: Using this idea, we develop a \emph{multiscale} Lagrangian relaxation
1084: approach for MRFs defined on grids. The purpose of this relaxation is
1085: similar to that of the multigrid and renormalization group methods
1086: \cite{Trottenberg*01,Gidas89}. Iterative methods generally involve
1087: simple rules that propagate information locally within the graph. Using a
1088: multiscale representation of the model allows information to
1089: propagate through coarse scales, which improves the rate of
1090: convergence to global equilibrium. Also, in discrete problems,
1091: such multiscale representations can help to avoid local minima. In
1092: the context of our convex LR approach, we expect this to translate
1093: into a reduction of the duality gap to obtain the optimal MAP estimate
1094: in a larger class of problems.
1095:
1096: \begin{figure}[t]
1097: \centering
1098: (a)\epsfig{file=multiscale_lr_small.eps,scale=0.8}\\
1099: \vspace{.2cm}
1100: (b)\epsfig{file=multiscale_lr_cliques_small.eps,scale=0.8}
1101: \caption{\label{fig:multiscaleLR} Illustration of multiscale
1102: LR method. (a) First, we define an equivalent
1103: multiscale model subject to cross-scale constraints. Relaxing these
1104: constraints leads to a set of single-scale models. (b) Next, each single
1105: scale is relaxed to a set of tractable subgraphs.}
1106: \vspace{-.4cm}
1107: \end{figure}
1108:
1109: \subsection{An Equivalent Multiscale Model}
1110:
1111: We illustrate the general idea with a simple example based on a 1D
1112: Markov chain. While this case is actually tractable by exact methods,
1113: it serves to illustrate our approach, which generalizes to
1114: 2D grids and 3D lattices. In Fig. \ref{fig:multiscaleLR}, we show how
1115: to construct the augmented model $f'(x')$ defined on a graph $\calG'$.
1116: This is done in two stages.
1117:
1118: First, as illustrated in Fig. \ref{fig:multiscaleLR}(a), we introduce
1119: coarse-scale representations of the fine scale variables by
1120: recursively defining summary variables at coarser scales to be
1121: functions of variables at the next level down. This defines a set of
1122: cross-scale constraints, denoted by the square nodes. To allow
1123: interactions between coarse-scale variables, while maintaining
1124: consistency with the original single-scale model, we introduce extra
1125: edges (the dotted ones in Fig. \ref{fig:multiscaleLR}(a)) between
1126: blocks of nodes that have a (solid) edge between their summary nodes
1127: at the next coarser scale. This representation allows us to define a
1128: family of constrained multiscale models that are all equivalent to the
1129: original single-scale model. For 2D and 3D lattices, this model is
1130: still intractable even after relaxing the cross-scale constraints
1131: because each scale is itself intractable.
1132:
1133: Next, to obtain a tractable dual problem, we break up the graph into smaller
1134: subgraphs, introducing additional constraints to enforce consistency
1135: among replicated variables. In the example, we break the augmented
1136: graph at each scale into its maximal cliques, shown in
1137: Fig. \ref{fig:multiscaleLR}(b). This defines the final augmented
1138: model and the corresponding graph. In a 2D graph, the same idea
1139: applies, but we obtain a set of maximal cliques consisting of
1140: overlapping $2 \times 4$ and $4 \times 2$ blocks of the grid.
1141: Alternatively, we could break up the 2D grid into a set of width 2
1142: vertical strips, as discussed previously.
1143:
1144: Now, the procedure is essentially the same as before. We start with
1145: the equivalent constrained optimization problem defined on the
1146: augmented graph, now subject to both in-scale and cross-scale
1147: constraints. We obtain a tractable problem by introducing Lagrange
1148: multipliers to relax these constraints. Then we iteratively adjust
1149: the Lagrange multipliers to minimize the dual function, with the aim
1150: of eliminating constraint violations to obtain the desired MAP
1151: estimate. This is equivalent to adjusting the augmented model
1152: $f'(x')$ on $\calG'$, subject to the constraint that it remains
1153: equivalent to $f(x)$ for all $x' = Ax$.
1154:
1155: \subsection{Gaussian Multiscale Moment-Matching}
1156:
1157: We demonstrate this approach in the Gaussian model. To carry out the
1158: minimization, we again use a block coordinate-descent method that
1159: finds an exact minimum over a subset of Lagrange multipliers at each
1160: step. The replica constraints are handled the same as before. Here, we
1161: briefly summarize our approach to handle the cross-scale summary
1162: constraints. Let $x_1$ and $x_2$ denote two random vectors at
1163: consecutive scales coupled by the constraint $x_2 = A x_1$. Let
1164: $(\hat{h}_1,\hat{J}_1)$ and $(\hat{h}_2,\hat{J}_2)$ denote their corresponding \emph{marginal}
1165: information parameters. Relaxing the constraints $x_2 = A x_1$ and
1166: $x_2 x_2^T = A x_1 x_1^T A^T$, with Lagrange multipliers
1167: $(\lambda,-\tfrac{1}{2}\Lambda)$, leads to the following optimality
1168: conditions:
1169: \begin{eqnarray}
1170: (\hat{J}_2+\Lambda)^{-1} &=& A (\hat{J}_1-A^T\!\Lambda\, A)^{-1} A^T \\
1171: (\hat{J}_2+\Lambda)^{-1}(\hat{h}_2+\lambda) &=& A (\hat{J}_1-A^T\!\Lambda\, A)^{-1}(\hat{h}_1-A^T\lambda) \nonumber
1172: \end{eqnarray}
1173: We find that the solution is:\footnote{The formula (\ref{eq:multiscale_update}) corresponds to a generalization of Algorithm 2, in which the moments $(\hat{x}_1,P_1)$ of fine-scale variables $x_1$ are replaced by the corresponding moments $(A \hat{x}_1, A \hat{P}_1 A^T)$ of the summary statistic $\tilde{x}_1 = A x_1$.}
1174: \begin{eqnarray}\label{eq:multiscale_update}
1175: \Lambda &=& \tfrac{1}{2} \{(A\hat{J}_1^{-1}A^T)^{-1} - \hat{J}_2\} \nonumber\\
1176: \lambda &=& \tfrac{1}{2} \{(A\hat{J}_1^{-1}A^T)^{-1} A \hat{J}_1^{-1} \hat{h}_1 - \hat{h}_2\}
1177: \end{eqnarray}
1178: The model $(h',J')$ is then updated by adding $(\lambda,\Lambda)$ to
1179: the coarse-scale and subtracting $(A^T \lambda,A^T \!\Lambda\, A)$ from the
1180: fine scale. This update enforces the moment conditions $\hat{x}_2 = A
1181: \hat{x}_1$ and $P_2 = A P_1 A^T$ while maintaining consistency of the
1182: model $(h',J')$. Similar updates can be derived when there are
1183: multiple replicas of $x_1$ and $x_2$. These methods, together with
1184: those described previously, are used to minimize the dual function in
1185: the Gaussian multiscale relaxation.
1186:
1187: \emph{Multiscale Example.} We provide a preliminary result involving a
1188: 1D thin-membrane model with $1024$ nodes. It is defined to have a
1189: long correlation length comparable to the length of the field. Using a
1190: random $h$-vector, we solve for the MAP estimates using three methods:
1191: a standard block Gauss-Seidel iteration using overlapping blocks of
1192: size 4; the (single-scale) Gaussian LR method with the same choice of
1193: blocks; and the multiscale LR method. The convergence of all three
1194: methods are shown in Fig. \ref{fig:multiscale_lr_example}. We see
1195: that the single-scale LR approach is moderately faster than block
1196: Gauss-Seidel, but introducing coarser-scales into the method leads to
1197: a significant speed-up in the rate of convergence.
1198:
1199: \begin{figure}
1200: \centering
1201: \epsfig{file=multiscale_lr_example.eps,scale=.25}
1202: \caption{\label{fig:multiscale_lr_example}Convergence of single- and multi-scale LR and block Gauss-Seidel.}
1203: \vspace{-.4cm}
1204: \end{figure}
1205:
1206: \section{Discussion}
1207:
1208: We have introduced a general Lagrangian relaxation framework for MAP
1209: estimation in both discrete and Gaussian graphical models. This
1210: provides a new interpretation of some existing methods, provides
1211: deeper insights into those methods, and leads to new generalizations,
1212: such as the multiscale relaxation introduced here. There are many promising
1213: directions for further work. While we have considered discrete and
1214: Gaussian models separately, the basic approach should extend to the
1215: richer class of conditionally Gaussian models \cite{Lauritzen96}
1216: including both discrete and continuous variables. In discrete models,
1217: designing augmented models that capture more structure of the original
1218: problem leads to reduced duality gaps and optimal MAP estimates in
1219: larger classes of models. It would be of great interest to finds ways
1220: to \emph{adaptively} search this hierarchy of relaxations to
1221: efficiently reduce and eventually eliminate the duality gap with
1222: minimal computation. It is also of interest to consider
1223: approaches to identity provably \emph{near-optimal} estimates, perhaps
1224: using the relaxed max-marginal estimates, in cases where it is not
1225: tractable to completely eliminate the duality gap.
1226:
1227: %\nocite{Lauritzen96,Cowell*99,Frey98,Wainwright*nov05,Wainwright*jul05,BarndorffNielsen78,WainwrightJordan03,Bertsekas95,BertsimasTsitsiklis97,BertsimasWeismantel05,BoydVandenberghe04,SpeedKiiveri86,Kolmogorov05,KolmogorovWainwright05,Werner07,Yanover*06,Feldman*05,ZhaoLuh98}
1228:
1229: \bibliography{lr}
1230: %\bibliographystyle{plain}
1231: \bibliographystyle{unsrt}
1232:
1233: % These are some excerpts that were omitted due to space
1234: % constraints, but may be useful in a longer version of the paper
1235:
1236: % This goes right before proposition 1
1237:
1238: \comment{Let's review some basic results of Lagrangian duality. The dual
1239: function has two important properties: First, it is a maximum over a
1240: set of linear functions in $\lambda$, and is therefore \emph{convex}.
1241: Second, for every $\lambda$ it provides an upper-bound on the value of
1242: the constrained optimization problem: $g(\lambda) \ge f^*$ for all
1243: $\lambda$. Hence, to determine the best possible choice of $\lambda$,
1244: it is natural to \emph{minimize} the dual function, which is the standard
1245: \emph{Lagrangian dual problem}:
1246: \begin{equation}
1247: g^* = \min_\lambda g(\lambda) = \min_\lambda \max_{x' \in \mathbb{X}'}
1248: L(x';\lambda)
1249: \end{equation}
1250: This may also be interpreted as optimizing over all equivalent models
1251: $f'$ defined on $\calG'$,
1252: \begin{equation}
1253: g^* =
1254: \begin{array}{ll}
1255: \mbox{minimize} & \max_{x'} f'(x') \\
1256: \mbox{subject to} & f'(\zeta(x)) = f(x) \mbox{ for all } x.
1257: \end{array}
1258: \end{equation}
1259: The constrained primal problem (\ref{fig:a}) is equivalent to the reverse max-min
1260: problem:
1261: \begin{equation}
1262: f^* = \max_{x' \in \zeta(\mathbb{X})} f'(x') = \max_{x' \in \mathbb{X}'} \min_\lambda L(x';\lambda)
1263: \end{equation}
1264: This is seen by observing that $L(x',\lambda) = f'(x')$ if
1265: $x' \in \zeta(\mathbb{X})$ and $\min_\lambda L(x',\lambda) = -\infty$
1266: otherwise. It always holds that $g^* \ge f^*$, which is known as the
1267: \emph{minimax inequality}. When $g^* > f^*$, it is said that there is
1268: a \emph{duality gap}. Our aim is to find formulations where there is
1269: no duality gap, that is, where $g^* = f^*$. This occurs if and only if
1270: there exists a \emph{saddle-point}, that is, a pair $(x'^*,\lambda^*)$
1271: such that
1272: \begin{displaymath}
1273: \lambda^* \in \arg\min L(x'^*,\cdot)
1274: \end{displaymath}
1275: and
1276: \begin{displaymath}
1277: x'^* \in \arg\max L(\cdot,\lambda^*).
1278: \end{displaymath}
1279: In this case, it also holds that $\lambda^* \in \arg\min g(\lambda)$
1280: and $x'^* \in \arg\max_{\zeta(\mathbb{X})} f'$. In other words,
1281: each saddle point corresponds to a pair of primal/dual optimal solutions.
1282: Moreover, if there is no duality gap, \emph{every} pair of
1283: primal/dual optimal solutions $(x'^*,\lambda^*) \in \arg\min g \otimes
1284: \arg\min_{\zeta(\mathbb{X})} f'$ is then a saddle point. We refer the
1285: reader to [CITE] for proofs of these well-known results.
1286:
1287: These elementary considerations lead to the following simple
1288: characterization of whether or not there is a duality gap, and of the
1289: relation between the optimal MAP estimates and those in the relaxed
1290: problem in the case that there is no duality gap:}
1291:
1292: % this comes right after proposition 2
1293:
1294: \comment{However, this does not mean that there is no advantage to grouping
1295: cliques together into larger thin graphs. Doing so reduces the number
1296: of Lagrange multipliers in the dual problem, which can lead to reduced
1297: computations and faster convergence in iterative methods. We also note
1298: that a non-chordal, thin augmented graph $\calG'$ can be extended to a
1299: chordal graph of the same with by adding fill edges to $\calG'$. If
1300: there exists such a chordal extension such that no two fill edges map
1301: back to the same edge in $\calG$, this does not change the value of
1302: $g^*$. For example, from these considerations we conclude the
1303: following relations between the examples shown in Fig. \ref{fig:graphs}:
1304: \begin{displaymath}
1305: (b) = (c) = (d) \ge (e) = (g) \ge (h)
1306: \end{displaymath}
1307: For instance, $(e)=(g)$ because we can obtain chordal versions of both
1308: graphs by adding diagonal edges within each cell without any
1309: replicated chords, so their dual values do not change, and both
1310: chordal graphs then use the same set of maximal cliques in $\calG$.
1311: But $(g)\ge(h)$ because adding edges to make $(h)$ chordal introduces
1312: larger cliques than $(g)$.}
1313:
1314: % this goes in the discussion of max-marginals
1315:
1316: \comment{Assuming there is a unique MAP estimate in the original problem, then
1317: there are two typical cases: If there is no duality gap, the
1318: max-marginals estimates obtained will typically each have a unique
1319: maximum $x_E^* = \arg\max \bar{f}(x_E)$ for all $E \in \calG$. Then,
1320: these are consistent and the global MAP estimate $x^*$ is recovered.
1321: When there are ties in some of the max-marginal estimates, this
1322: usually indicates that there is a duality gap and no consistent
1323: solutions. However, in some exceptional cases, it is possible that
1324: there is no duality gap even in this case. To be certain, one would
1325: have to check for a consistent solution $x^*$ that simultaneously
1326: maximizes all of these relaxed max-marginals.}
1327:
1328: % right after proposition 1
1329:
1330: \comment{If the original problem has a unique MAP estimate, then it typically
1331: holds that, in the case of no duality gap, the relaxed problem has a
1332: unique solution, and this then provides the MAP estimate $x^* =
1333: \zeta^{-1}(x'^*)$.}
1334:
1335: \end{document}
1336: