0012:cond-mat0012191/num.tex

1: \section{Numerical Experiments.} \label{secnum}

2:

3: \subsection{Description of the numerical implementation of the DPLL algorithm.}

4:

5: We have implemented DPLL with the GUC rule, see Figure~\ref{algo} and

6: Section~\ref{heuri}, to have a fast unit

7: propagation and an inexpensive backtracking \cite{Cra}. The program is

8: divided in three parts. The first routine draws the

9: clauses and  represents the data in a convenient structure.  The second,

10: main routine updates and saves the state of the search, {\em i.e.}  the

11: indices and values of assigned variables, to allow an

12: easy backtracking. Then, it checks if a solution is found; if not,

13: a new variable is assigned.

14: The third routine extracts the implication of the choice (propagation).

15: If unit clauses have been generated, the corresponding literals are

16: fixed, or a contradiction is detected.

17:

18: \subsubsection{Data Representation}

19:

20: Three arrays are used to encode the data: the two first arrays are labelled

21: by the clause number $m=1,\ldots, M$ and the number $b=1,\ldots, K$ of

22: the components in the clause (with $K=2,3$).

23: The entries of these arrays, initially drawn at random, are the indices

24: $clausnum\, (m,b)$ and  the values $ clausval\,(m,b)$,

25: true or false, of the variables. The indices of the third array are

26: integers $i=1,\ldots ,N$ and $j=1,\ldots , o_i$ where $o_i$ is the number

27: of occurrences of $x_i$  in the clauses (from zero to $M$).

28: The entries of the matrix, $a\,(i,j)$,  are the numbers of the corresponding

29: clauses (between 1 and $M$).

30:

31: \subsubsection{Updating of the search state.}

32:

33: If the third routine has found a contradiction,  the second routine

34: goes back along the branch, inverts the last assigned variable

35: and calls again the third routine. If not, the descent along the branch

36: is pursued. A Boolean-valued vector points to

37: the assigned variables, while the values are stored

38: in another unidimensional array.

39: For each clause, we check if the variables are already assigned and, if so,

40: if they are in agreement or not with the clauses. When splitting occurs,

41: a new variable is fixed to satisfy a 2-clause, {\em i.e.} a clause with

42: one false  literal and two unknown variables, and the third subroutine

43: is called. If there are only 3-clauses, a

44: new variable is fixed to satisfy any 3-clause and

45: the third subroutine is called. The variable

46: chosen and its value are stored in a vector with index the length of the

47: branch, {\em i.e.} the number of nodes it contains, to allow

48: future backtracking. If there are neither 2- nor 3-clauses left, a solution is

49: found.

50:

51: \subsubsection{Consequences of a choice and unit propagation}

52:

53: All clauses containing the last fixed variable

54: are analyzed by taking into account all possibilities: 1. the clause

55: is satisfied; 2. the clause is reduced to a 2- or 1-clause; 3. the

56: clause is violated (contradiction). In the second case, the 1-clause is stored

57: to be analyzed by unit-propagation once all clauses containing

58: the variable have been reviewed.

59:

60: \subsection{Characteristic running times}

61:

62: We have implemented the DPLL search algorithm in Fortran~77;

63: the algorithm runs on a Pentium II PC with a 433M Hz frequency clock.

64: The number of nodes added per minute ranges from

65: 300,000 (typically obtained for $\alpha=3.5$) to

66: 100,000 ($\alpha=10$) since unit propagation is more and more

67: frequent as $\alpha$ increases.

68: The order of magnitude of the computational time needed to solve an instance

69: are listed in Table~\ref{tabtemps} for ratios corresponding to hard instances.

70: These times limit the maximal size $N$ of the instances we have

71: experimentally studied and the number of samples over which we have

72: averaged. Some rare instances

73: may be much harder than the typical times indicated in Table~\ref{tabtemps}.

74: For instance, for $\alpha=3.1$ and $N=500$, instances

75: are usually solved in about 4 minutes but some samples required

76: more than 2 hours of computation. Such a phenomenon will be discussed

77: in Section~\ref{foc}.

78:

79: \subsection{Overview of experiments }

80:

81: \subsubsection{Number of nodes of the search tree}

82:

83: We have first investigated complexity by a direct  count of

84: the number of  splittings, that is the number of nodes (black

85: points) in Figure~\ref{tree}, for sat (Figure~\ref{tree}A,C) and

86: unsat (Figure~\ref{tree}B) trees.

87:

88: \subsubsection{Histogram of branch length.}

89:

90: We have also experimentally investigated  the structure

91: of search trees for unsat instances (Figure~\ref{tree}B).

92: A  branch is defined as a path joining the root (first

93: node on the top of the search tree) to a leaf  marked

94: with a contradiction $C$ (or a solution $S$ for sat

95: instance) in Figure~\ref{tree}.

96: The length of a branch is the number of nodes it contains.

97: For an unsat instance, the complete search tree depends on

98: the variables chosen at each split, but not on the values they are assigned to.

99: Indeed, to prove that there is no solution, all nodes in the search tree

100: have to carry two outgoing branches,

101: corresponding to the two choices of the attached variables. What

102: choice is made first does not matter.

103: This simple remark will be of  crucial importance in the theoretical analysis

104: of Section~\ref{paral}.

105:

106: We have derived the histogram of the branch lengths by counting

107: the number $B(l)$ of branches having length $l \,N$ once the tree

108: is entirely built up. The histogram is very useful to deduce the

109: complexity in the unsat phase, since in a complete tree the

110: total number of branches $B$ is related  to the number of nodes $Q$

111: through the identity,

112: \begin{equation}

113: \label{eqnb}

114: B=\sum _{l=\frac 1N , \ldots , 1} B(l) = Q+1 \qquad .

115: \end{equation}

116: that can be inferred from Figure~\ref{tree}B.

117:

118: \subsubsection{Highest backtracking point} \label{high}

119:

120: Another key property we have focused upon is the

121: highest backtracking point in the search tree.

122: In the unsat phase, DPLL backtracks all the nodes of the tree since no

123: solution can be present. The highest backtracking point in the tree

124: simply coincides with the top (root) node. In the sat phase, the

125: situation is more involved. A solution generally requires some

126: backtracking and the highest backtracking node may be defined as the

127: closest node to the origin through which two branches pass, node $G$ on

128: Figure~\ref{tree}B. We experimentally keep trace of the

129: highest backtracking point by measuring the numbers $C_2 (G)$, $C_3(G)$

130: of 2- and 3-clauses, the number of not-yet-assigned variables

131: $N (G)$, and  computing the coordinates

132: $p_G=C_3(G)/(C_2(G)+C_3(G)), \alpha_G=(C_2 (G)+C_3 (G))/N(G)$

133: of $G$ in the phase diagram of Figure~\ref{diag}.

134:

135: \subsection{Experimental Results}

136:

137: \subsubsection{Fluctuations of complexity.}

138: \label{foc}

139: The size of the search tree built by DPLL is a random variable, due to

140: the (quenched) randomness of the 3-SAT instance and the choices made

141: by the splitting rule (``thermal noise''). We show on

142: Figure~\ref{histoseuil} the distribution of the logarithms (in base 2, and

143: divided by $N$) of the number of nodes for different values

144: of $N$. The distributions are more and more peaked around their mean

145: values $\omega _N (\alpha )$ as the size $N$ increases. This indicates

146: that the logarithm of the complexity is a self-averaging quantity in

147: the thermodynamic limit.  However, fluctuations are dramatically

148: different at low and large ratios.  For $\alpha =10$, and more

149: generally in the unsat phase, the distributions are roughly

150: symmetric (Figure~\ref{histoseuil}A).  Tails are small and

151: complexity does not fluctuate too much from sample to

152: sample\cite{Kirk2,Easy}. In the vicinity of $\alpha_L$, e.g. $\alpha

153: =3.1$, much bigger fluctuations are present. There are large tails on

154: the right flanks of the distributions on Figure~\ref{histoseuil}B, due

155: to the presence of rare and very hard samples \cite{Easy}. Complexity is not

156: self-averaging. We will come back to this point in

157: section~\ref{secmixed}.

158:

159: \subsubsection{The easy-hard-easy pattern.}

160:

161: We have averaged the logarithm of the number of nodes over

162: 10,000 randomly drawn instances to obtain $\omega _N (\alpha)$.

163: The  typical size $Q$ of the search tree is simply $Q=2^{N \omega _N}$.

164: Results are shown in Figure~\ref{tempi}. An easy-hard-easy pattern of

165: complexity appears clearly as the ratio $\alpha$ varies.

166:

167: \begin{itemize}

168: \item At small ratios, complexity increases as $\gamma (\alpha ) N$,

169: that is, only linearly with the size of the instance.

170: DPLL easily finds a solution and the search tree

171: essentially reduces to a single branch shown on Figure~\ref{tree}A.

172: For the GUC heuristic, the linear regime extends up to $\alpha _L

173: \simeq 3.003$ \cite{Fra,Fri}.

174:

175: \item Above threshold, complexity grows exponentially

176: with $N$\cite{Chv}. The logarithm $\omega

177: (\alpha )$, limit of $\omega _N (\alpha)$ as $N \to \infty$, is maximal at

178: criticality...

179:

180: \item ... and decreases at large ratios as $1/\alpha$ \cite{Bea}.

181: The ``easy'' region on the right

182: hand side of Figure~\ref{tempi} is still exponential but with

183: small coefficients $\omega$.

184: \end{itemize}

185:

186: Of particular interest is the intermediate region $\alpha _L < \alpha

187: < \alpha _C$. We shall show that complexity is exponential in this

188: range of ratios, and that the search tree is a mixture of the

189: search trees taking place in the other ranges $\alpha < \alpha _L$

190: and $\alpha > \alpha _C$.

191:

192:

193: Let us mention that, while this paper is devoted to typical-case

194: (happening with probability one) complexity, rigorous results have

195: been obtained that apply to any instance. So far, using a refined

196: version of DPLL, any instance is guaranteed to be solved in less than

197: $1.504^N$ steps, {\em i.e.} $\omega < 0.588$. The reader is referred

198: to reference \cite{Kul} for this worst-case analysis.

199:

200: \subsubsection{Lower sat phase ($\alpha<\alpha_L$)}

201: \label{secnumlin}

202:

203: The complexity data of Figure~\ref{tempi} obtained for different

204: sizes $N=50, 75, 100$ are plotted again on Figure~\ref{constante} after

205: division by $N$. Data collapse on a single curve, proving that

206: complexity is linear in the regime $\alpha < \alpha _L$.

207: In the vicinity of the cross over ratio $\alpha =\alpha_L$

208:  finite-size effects became important in this

209: region. We have thus performed extra simulations for larger sizes

210: $N=500, 1000$ in the range $2.5 \le \alpha \le 3.$ that confirm the

211: linear scaling of the complexity.

212:

213: \subsubsection{Unsat Phase ($\alpha>\alpha_C$)}

214:

215: Results for the shape of the search trees are shown in

216: Figure~\ref{histo}.  We represent the logarithm $b(l)$, in base 2 and divided

217: by $N$, of the number $B(l)$ of branches as a function of the branch length

218: $l$, averaged over many samples and for different sizes $N$ and ratios

219: $\alpha.$ When $\alpha$ increases

220: at fixed $N$, branches are shorter and shorter and less and less

221: numerous, making complexity decrease (Figure~\ref{tempi}).

222:

223: As $N$ gets large at fixed $\alpha$,

224: the histogram $b(l)$ becomes a smooth function

225: of $l$ and we can replace the discrete sum in (\ref{eqnb}) with

226: a continuous  integral on the length,

227: \begin{equation}

228: Q+1=\sum_{l=\frac 1N,\ldots,1}  2^{N\,b(l)}

229: \simeq N\int_{0}^{1} dl \;

230: 2^{N\,b(l)} \qquad . \label{int741}

231: \end{equation}

232: The integral is exponentially dominated by

233: the maximal value $b_{max}$ of $b(l)$. $\omega$, the limit of the logarithm

234: of the complexity divided by $N$, is therefore equal to $b_{max}$.

235: Nicely indeed, the height $b_{max}$ of the histogram does not depend on $N$

236: (within the statistical errors)

237: and gives a straightforward and precise estimate of $\omega$, not

238:  affected by finite-size effects. The values of $b_{max}$ as a function

239: of $\alpha$ are listed in the third column of Table~\ref{tabunsat}.

240:

241: The above discussion is also very useful to interpret the data

242: on the size $Q$ of the search trees.

243: From the quadratic correction around the saddle-point,

244: $b(l) \simeq b_{max} -  \beta (l-l_{max})^2 /2 $,

245: the expected finite size correction to $\omega _N$ read

246: \begin{equation}

247: \label{nodeq}

248: \omega _N  \simeq \omega + \frac 1{2N} \log_2 N +

249: \frac{1}{2N} \log_2 \left[ \frac{2\pi}{\beta \ln 2} \right] + O \left(

250: \frac 1 {N^2} \right) \qquad .

251: \end{equation}

252: We have checked the validity of this equation by fitting

253: $\omega_N - \log_2 N / (2N)$ as a polynomial function of $1/N$.

254: The constant at the origin gives values of $\omega$ in very good

255: agreement with $b_{max}$ (second column in Table~\ref{tabunsat})

256: while the linear term gives access to the curvature $\beta$.

257: We compare in Table~\ref{tabfit} this curvature with the direct

258: measurements of $\beta$ obtained by looking at the vicinity of the

259: top of the histogram. The agreement is fair, proving that equation

260: (\ref{nodeq}) is an accurate way of extrapolating data on $Q$ to the

261: infinite size limit.

262:

263: \subsubsection{Upper sat phase ($\alpha _L <\alpha < \alpha _C$)}

264:

265: To investigate the sat region slightly below threshold $\alpha _L <

266: \alpha  < \alpha _C$, we have carried out simulations with a starting

267: ratio $\alpha  =3.5$. Results are shown on Figure~\ref{comp3.5}A.

268: As instances are sat with a high probability, no simple

269: identity relates the number of nodes $Q$ to the number of branches $B$,

270: see search tree in Figure~\ref{tree}C and

271: we measure the complexity through $Q$ only. Complexity clearly scales

272: exponentially with the size of the instance

273: and exhibits large fluctuations from sample to sample. The annealed

274: complexity (logarithm of the average complexity), $\omega _3 ^{ann}$,

275: is larger than the typical solving hardness (average of the logarithm

276: of the complexity), $\omega _3 ^{typ}$, see Table~\ref{tabmixed}.

277:

278: To reach a better understanding of the structure of the search tree,

279: we have focused on the highest backtracking point G defined in

280: Figure~\ref{tree}C and Section~\ref{high}. The coordinates $p_G, \alpha _G$

281: of point G, averaged over instances are shown for increasing sizes

282: $N$ on Figure~\ref{pgalphag}. The coordinates of G exhibit strong

283: sample fluctuations which make the large $N$ extrapolation, $p_G

284: =0.78 \pm 0.01$, $\alpha _G = 3.02 \pm 0.02$ rather imprecise.

285:

286: In Section~\ref{secmixed}, we shall show how the solving complexity in the

287: upper sat phase is related to the solving complexity of corresponding

288: 2+p-SAT problems with parameters $p_G, \alpha _G$.

289:

290:

291:

292:

293:

294: