0008:physics0008232/SE.tex

1: %\documentstyle[epsf,11pt]{article}

2: \documentstyle[epsf,aps,prl,12pt,preprint]{revtex}

3: %\documentstyle[epsf,multicol,aps,prl]{revtex}

4: \def\baselinestretch{1.5}

5: %\topmargin=-10mm

6: %\oddsidemargin=-0.2mm

7: %\evensidemargin=-0.2mm

8: %\textwidth=160mm

9: %\textheight=215mm

10: \pagenumbering{arabic}

11:

12: \title{\bf {\em Minimum} Entropy Approach to Word Segmentation Problems}

13: \author{Bin Wang \\

14:  {\small Institute of Theoretical Physics, Chinese Academy of

15: Sciences},\\

16:   {\small P.O. Box 2735, Beijing 100080, P. R. China.}\\

17:  {\small State Key Laboratory of Scientific and Engineering Computing},\\

18: {\small Institute of Computational Mathematics and Scientific/Engineering Computing}, \\

19:   {\small P.O. Box 2719, Beijing 100080, P. R. China.}\\

20:   }

21: %\date{May 21, 2000}

22: \begin{document}

23: \maketitle

24: %\widetext

25: \vspace {1cm}

26: \begin{abstract}

27: Given a sequence composed of a limit number of characters,

28:  we try to ``read"

29:  it as a ``text". This involves to segment the sequence into ``words". The

30: difficulty is to distinguish  good segmentation from enormous number of

31: random ones.

32: Aiming at revealing the nonrandomness of the sequence as strongly as

33: possible,

34: by applying maximum likelihood method, we find a

35: quantity called {\bf segmentation entropy} that can be used to

36: fulfill the duty. Contrary to commonplace

37: where maximum entropy principle was applied to obtain good solution, we

38: choose to {\em minimize} the segmentation entropy to obtain good

39: segmentation. The concept developed in this letter can be used to

40: study the noncoding DNA sequences, e.g., for regulatory elements

41: prediction, in eukaryote genomes.

42:

43: \vspace {0.4cm}

44: \noindent

45: %PACS number: 87.10.+e, 89.70.+c.

46: \end{abstract}

47: %\widetext

48:

49: \clearpage

50: \newpage

51: %\begin {multicols}{2}

52:

53: \section{Introduction.}

54: The problem addressed in this paper is rather

55: elementary in statistics. It is best described as the following:

56: suppose one who knows nothing about English language

57: was given a sequence of English letters, which was actually

58: obtained by taking off all the interwords delimiters

59: among a sample of English text,

60: how could he recover the words of the text by choosing to

61: insert spaces between

62: adjacent letters? Note that the only thing he can

63: consult is the statistical properties of the sequence?

64:

65: Any two adjacent letters can be chosen to belong to the same word (keep

66: adjacent) as

67: well as belong to separate words (be separated by space).

68: Suppose the sequence length is $N$.

69: Any choice on the connectivity between $N-1$ pairs of

70: adjacent letters is called a segmentation.

71: There are a total of $2^{N-1}$ possible

72: segmentations. The word segmentation problem is to find ways to

73: distinguish the

74: correct segmentation -- in the sense that adjacent letters in the

75: original text keep

76: adjacent while letters separated by spaces and/or punctuation

77: marks in the original

78: text are separated by spaces in the segmentation -- from others.

79:

80: Although the problem seems toy-like, its fundamental

81: importance for statistical linguistics is evident. We study on it,

82: however, also for practical purposes. Noncoding sequences in the genomes

83: of species play

84: essential rule on the regulation of gene expression and function~\cite{liw}.

85: However the development of computational methods for extracting regulatory elements is far behand DNA sequencing and gene finding~\cite{regulate}.

86: One reason is the lack of efficient way to discriminate large

87: amount of sequence signals in noncoding DNA sequences.

88: Through linguistic study it has been shown that noncoding sequences in

89: eukaryotic genomes are structurally much similar to natural and

90: artificial language~\cite{stanley}. Thus many may expect to ``read" the

91: noncoding sequences as a ``text". Actually, efforts have been given to

92: build a dictionary for genomes~\cite{trifonov,li}. Li et al.~\cite{li}

93: showed the connection between regulatory elements prediction

94: and word segmentation in noncoding DNA sequences of eukaryote genomes.

95: We expect that

96: progress on word segmentation problem may help to deepen our knowledge on

97: noncoding

98: regions of eukaryote genomes. Besides, word segmentation is an important

99: issue for Asian languages (e.g., Chinese and Japanese) processing~\cite{ponte},

100: because they lack interword delimiters.

101:

102: \section{Segmentation entropy and its connection to word segmentation problem.}

103: To tackle word segmentation problem,

104: we first consider a problem under constraints, so that one important

105: concept --

106:  segmentation entropy -- can be introduced. The

107: constraints will be released at the end of this paper.

108: Suppose we have known that there are $n_l$ words of length $l$

109: $(l=1,2,\cdots)$ in the original text. Obviously,

110: \begin {equation}

111: \sum_l{n_ll}=N.

112: \end {equation}

113: Under these constraints -- Words Length Constraints WLC -- there are

114: totally

115: \begin {equation}

116: \frac{(\sum_l{n_l})!}{\prod_l{n_l}!}

117: \end {equation}

118: segmentations. For example, for the following story, there are totally

119: $3.12e144$ segmentations, while the number under WLC is about $1.33e97$.

120:

121: \begin {quote}

122: \begin {center}

123: {\sl

124:                        The Fox and the Grapes

125: }

126: \end {center}

127: {\small

128: {\sl

129:    \ \ \ \ Once upon a time there was a fox strolling through the woods.

130:  He came upon a grape orchard.  There he found a bunch of beautiful

131: grapes hanging from a high branch.

132:

133:   \ \ \ \ ``Boy those sure would be tasty," he thought to himself.  He

134: backed up and took a running start, and jumped.  He did not get high

135: enough.

136:

137:   \ \ \ \ He went back to his starting spot and tried again.  He almost

138: got high enough this time, but not quite.

139:

140:   \ \ \ \ He tried and tried, again and again, but just couldn't get

141: high enough to grab the grapes.

142:

143:   \ \ \ \ Finally, he gave up.

144:

145:   \ \ \ \ As he walked away, he put his nose in the air and said: ``I am

146: sure those grapes are sour."

147: }

148: }

149: \end {quote}

150:

151: Following least effort principle~\cite{zipf}, it is appreciable in

152: natural languages to

153: combine existing words to express different meaning.

154: Shannon~\cite{shannon} pointed out the

155: importance of redundancy in natural languages long ago: generally

156: speaking, nearly half of the letters

157: in a sample of English text can be deleted while someone else can still

158: restore them. These

159: properties of natural language ensure the sequence obtained by

160: taking off interword delimiters from a certain text being highly nonrandom

161: and showing determinant and regular characteristics.

162: It is expected that the correct segmentation reveals these

163: characteristics as

164: strongly as possible. From information point of view, this means

165: that, if

166: a form of information entropy can be properly defined on each segmentation,

167: the entropy of the correct

168: segmentation will be the smallest.

169:

170: Interestingly, a maximum likelihood approach leads to the same

171: proposal and automatically gives the definition of the entropy.

172: Given one sequence of length $N$, we expect to find a likelihood function which

173: reaches its maximum on the correct

174: segmentation. For a concrete segmentation, we assign a probability to

175: each word in it

176: \begin {equation}

177: w_i\to{p_i}, \qquad  i=1...M

178: \end {equation}

179: with

180: \begin {equation}

181: \sum_{i=1}^M{p_i}=1.

182: \end {equation}

183: The likelihood function is written as

184: \begin {equation}

185: Z_s=\prod_{i=1}^M{{p_i}^{m_il_i}}

186: \end {equation}

187: where $m_i$ is the number of word $w_i$ in the segmentation, and $l_i$

188: is the length of the word.

189:

190: By maximizing the likelihood function subjected to

191: eq.(4) we obtain

192: \begin {equation}

193: p_i=\frac{m_il_i}{N}.

194: \end {equation}

195: Thus the maximum likelihood for the segmentation is

196: \begin {equation}

197: Z_s=\prod_{i=1}^M{(\frac{m_il_i}{N})^{m_il_i}}.

198: \end {equation}

199: The segmentation with maximum likelihood is just the one minimizing

200: \begin {equation}

201: S=-\frac{lnZ_s}{N}=-\sum_{i=1}^M{\frac{m_il_i}{N}ln(\frac{m_il_i}{N})}.

202: \end {equation}

203: This function has the form of entropy~\cite{shannon} and will be called Segmentation Entropy (SE).

204:

205: Starting from a maximum likelihood approach, we now come to the

206: suggestion

207: to minimize the segmentation entropy.

208: This is in contrast to commonplace.

209: Maximizing likelihood leads to maximizing certain entropy in some

210: cases~\cite{frieden,jaynes}.

211: As a general principle for investigating statistical problems, maximum

212: entropy method has been successfully applied in a

213: variety of fields~\cite{frieden,jaynes}. We propose that, instead of

214: applying maximum entropy principle, one may choose to minimize certain entropy

215: (minimum entropy principle) in some problems. This seems attractive

216: especially in the era of bioinformatics when most of the

217: problems are to reveal regularity in large amount of seemingly

218: random sequences.

219:

220: Because the present is a statistical method, the

221: text under study needs to be not too short. For example, when we

222: tried to

223: find the segmentation with the smallest segmentation entropy for the

224: saying

225: \begin {quote}

226: {\sl God is nowhere as much as he is in the soul... and the soul

227: means the world}

228: \end {quote}

229: (By Meister Eckhart, 14-century Dominican

230: priest, Preacher, and Theologian), it was found that, among a total of

231: $343062720$ segmentations under WLC,

232: there are 15 segmentations

233: whose SE is $2.3684$, smaller than 2.3802 of the correct one. One example is

234: \begin {quote}

235: {\sl god isnow {\bf he} rea smuchas {\bf he} is int {\bf he} {\bf soul} andt

236: {\bf he} {\bf soul} meanst {\bf he} world},

237: \end {quote}

238: in which the five {\em ``he"} and two {\em ``soul"} are

239: revealed.

240:

241: Unfortunately, present computational power does not permit to

242: exhaustively

243: study even a text as short as {\sl ``the Fox and the Grapes"}, the

244: number of

245: permitted segmentations for which is $1.33E+97$ under WLC.

246: We choose to see the relevance of the concept of segmentation entropy

247: in some special ways. The study focuses on ``The Fox and the Grapes".

248:

249: To change a segmentation slightly, one way is to choose two adjacent words

250: along the sequences randomly and then exchange their length. This way the

251: original two words may change to different words. This procedure

252: can be repeated on the resulting segmentations. The change does not violate the WLC. Because of the large number of possible choices in each step, the segmentation is expected to become increasingly dissimilar to the original one. Starting from the correct segmentation of ``The Fox and the Grapes", we expect to see the evolution of SE by changing the segmentation this way.

253: Figure 1 shows that SE increase drastically in the first 500 steps,

254: and then reaches and fluctuates around certain equilibrium value.

255: Compared with the gap between the equilibrium value and the original

256: SE, the fluctuation is minor.

257: This shows that, at least locally, the correct segmentation is at the

258: minimum of

259: SE. Actually, we have traced a trajectory of evolution up to $10^{10}$

260: steps. No

261: segmentation with SE smaller than the correct one was observed. This

262: implies that SE of the correct segmentation is also globally minimal.

263:

264: The distribution of segmentation entropy may give

265: further insight to the atypicality of

266: the correct SE.

267: We randomly sampled $10^{10}$ segmentations in the following way: while

268: keeping the WLC, the length of each words in the segmentation is assigned

269: randomly. The distribution of SE is shown in Fig. 2.

270: The minimal SE we sampled is 4.5298, still much higher

271: than 4.097 of the correct segmentation (see Fig. 1).

272: It is interesting to observe that the

273: distribution shows fractal characteristics. The fractal-like distribution

274: presents also for

275: other text, even for random sequence (Fig. 3). The fractal-like feature

276: is determined by the WLC and the statistical structure of the sequence

277: under study.

278: In Fig. 3 we compared the distribution of SE of two

279: sequences (under the same WLC), the original sequence of {\sl ``The Fox and the Grapes"} and

280: a random sequence obtained by randomizing the order of letters in the text.

281: The result is in accordance with the fact that the original

282: sequence is in a much more ordered state, manifesting that

283: segmentation entropy captures the statistical structure of the

284: sequences successfully.

285:

286: There is one way to estimate the number of segmentations the SE of which is

287: 4.097, the value for the correct segmentation.

288: See Fig. 4 in which the distribution of SE in Fig. 2 are shown

289: in logrithmic scale here. The left edge of the distribution fall on a line.

290: The edge can be fitted by $e^{(165x-750.42)}.$

291: The number of segmentations with SE x among the totally $1.33e97$ possible

292: segmentations under WLC is:

293: \begin {equation}

294: c(x)=\frac{1.33e^{97}}{9\times10^9}e^{(165x-750.42)}.

295: \end {equation}

296: We obtained $c(4.097)=0.96$.

297: From the distribution of SE shown in Fig. 3(a) we obtained the same value of $c(4.097)$. The estimation support the idea that segmentation entropy of

298: correct segmentation is unique.

299:

300:

301: We now consider how to release the WLC.

302: Unfortunately, searching the

303: segmentation with the smallest SE among all the possible

304: is sure to fail to find the correct one. For example, SE of the

305: segmentation in which the whole sequence

306: is considered as one word (single-word segmentation)

307: is 0, the smallest possible SE.

308: Also, the

309: segmentation in which each letter is viewed as a separate word

310: ($N$-word segmentation) has a considerably small

311: SE (2.8655 for {\sl ``The fox and the grapes"}).

312: These are called side attraction effects. These examples show that smaller

313: SE does not necessarily means better segmentation

314: when we compare the SEs of segmentations under

315: different WLC (here WLC refers to any partition of numbers of

316: words of various length

317: satisfying eq.(1), not necessarily the same as the original text.)

318: The bias induced by different WLC must be taken off.

319: In order to do so, we suggest to use

320: \begin {equation}

321: R_S=\frac{S}{S_0}

322: \end {equation}

323: instead of $S$.

324: Here $S_0$ is the average SE under the same WLC of a sequence obtained

325: by randomizing the order of letters in the original text.

326: $S_0$ plays the role of chemical potential for a thermodynamic system~\cite{chempot}.

327: $R_S$ for the single word and $N$-word segmentations are 1, the largest

328: possible value.

329: By searching segmentation with the smallest $R_S$, it is expected to

330: find meaningful segmentation. For examples, for the segmentation

331: \begin {quote}

332: {\sl god isnow {\em he} rea smuchas {\em he} is int {\em he} {\em soul} andt {\em he} {\em soul} meanst {\em he}

333: world,}

334: \end {quote}

335: which has already been shown above, $R_S$

336: is 0.8601; while

337: \begin {quote}

338: {\sl god {\em is} now {\em he} re {\em as} much {\em as} {\em he} {\em is} int {\em he} {\em soul} {\em an} dt {\em he} {\em soul} me {\em an}

339: st {\em he} world}

340: \end {quote}

341: is a better -- actually one of the best -- segmentation according to

342: $R_S$ ($R_S=0.8259$). Intuitively this is reasonable, because in this

343: second segmentation, more repeated ``words" -- two copies of

344: {\em ``is"}, {\em ``as"} and {\em ``an"} -- are revealed.

345: Another segmentation

346: \begin {quote}

347: {\sl god {\em is} now {\em he} re {\em as} much {\em as} {\em he} {\em is} in {\em thesoul} {\em an} d {\em thesoul} me {\em an} st

348: {\em he} world},

349: \end {quote}

350: which differs from the second segmentation by revealing the two

351: {\em ``thesoul"}, has a moderately small $R_S$: 0.8481.

352: Comparison shows that the five repeats of

353: {\em ``he"} is the most preferred part in good segmentations.

354:

355: \section {Concluding remarks.}

356: In statistical linguistics many efforts are given on

357: signal extracting and statistical inference.

358: Our method, however, is new on at least two points. First, there is neither

359: assumption on distribution~\cite{peitra} nor demand for training

360: sets, lexical or grammatical knowledge~\cite{ponte}.

361: This feature is important for studying biological

362: sequences, because present knowledge on the ``language" (DNA)

363: of life is still lack.

364: Second, instead of extracting a limit number of signals,

365: we try to ``read" the sequence exactly as a ``text".

366: A text includes more than words: it also includes the organization of words.

367: The results of segmentation form a basis for many further elaborations.

368:

369: Principally, the concept of segmentation entropy can be applied to study

370: the noncoding DNA sequences of eukaryote genomes. It is expected that the

371: study may gives more than some meaningful ``words" or regulatory

372: elements. Possible applications are not

373:  confined to studying noncoding DNA sequences of course. Segmentation

374: entropy can be used to find patterns in any symbolic sequences.

375: However,

376: the application of segmentation entropy is restricted by the difficulty to find

377: the segmentation with the smallest $R_s$ from the vast amount possible

378: ones. We are now developing algorithm that can be used for regulatory binding sites prediction. in the algorithm the principle of minimun entropy will be incorporated in.

379:

380: \section*{ACKNOWLEDGMENTS}

381: I thanks Professor Bai-lin Hao who helps to make

382: the computing possible. I also thanks Professor Wei-mou Zheng and

383: Professor Bai-lin Hao for stimulating discussions. Mr. Xiong Zhang carefully

384: read the manuscript. The work was supported

385: partly by National Science Fundation.

386:

387: \clearpage

388: \newpage

389:

390: \begin{thebibliography}{99}

391: \renewcommand{\baselinestretch}{0.2}

392: {\small

393: \bibitem{liw} See, e.g., W. Li, {\em Molecular Evolution} (Sinauer Associates, 1997).

394: \bibitem{regulate} A.G. Pedersen, P. Baldi, Y. Chauvin, and S. Brunak, Comput. Chem. {\bf 23}, 191 (1997).

395: \bibitem{stanley} R.N. Mantegna, S.V. Buldyrev, A.L. Goldberger, S.

396: Havlin, C.-k. Peng, M. Simons, and H.E. Stanley, Phys. Rev. Lett. {\bf 73}, 3169 (1994).

397: \bibitem{trifonov} V. Brendel, J.S. Beckmann, and E.N. Trifonov,

398: J. Biomol. Struct. Dyn. {\bf 7}, 11 (1986); P.A. Pevzner, M.Y. Borodovsky, and A.A. Mironov, J. Biomol. Struct. Dyn. {\bf 6}, 1013 (1989).

399: \bibitem{li} H.J. Bussemaker, H. Li, and E.D. Siggia, Preprint.

400: \bibitem{ponte} J.M. Ponte, and W.B. Croft, UMass Computer Science Tech Rep. 1996-2002 (1996), available at ftp://ftp.cs.umass.edu/pub/techrept/techreport/1996; R. Ando and L. Lee, Cornell CS Report TR99-1756 (1999), available at http://www.cs.cornell.edu/home/llee/papers.html.

401: \bibitem{zipf} G.K. Zipf, {\em human Behavior and the Principle of Least

402: Effort} (Addison-Wesley Press, Reading, 1949).

403: \bibitem{shannon} C.E. Shannon, Bell System Tech. J. {\bf 27}, 379 (1948).

404: \bibitem{frieden} B.R. Frieden, J. Opt. Soc. Am. {\bf 62}, 511 (1972);

405: E.T. Jaynes, Phys. Rev. {\bf 106}, 620 (1975); {\bf 108}, 171 (1975).

406: \bibitem{jaynes} N. Wu, {\em The Maximum Entropy Method and its Applications in Radio Astronomy}, Ph.D. thesis (Sydney University, 1985).

407: \bibitem{chempot} See, e.g., L.E. Reichl, {\em A Modern Course in Statistical Physics} (Anorld, 1980).

408: \bibitem{peitra} S.D. Peitra, V.D. Peitra, and J. Lafferty, IEEE Transactions Pattern Analysis and Machine Intelligence {\bf 19}, 1 (1997).

409: }

410:

411: \end{thebibliography}

412:

413:

414: \clearpage

415: \newpage

416:

417:

418: \begin{figure}[p]

419: \vspace {2cm}

420: \centerline{\epsfxsize=10cm \epsfbox{figure1.eps}}

421: \label{evolve}

422: \vspace {2cm}

423: \caption{The evolution of segmentation entropy. Starting from the

424: correct one, the segmentation was change stepwisely by

425: exchanging the lengths of a pair of adjacent words randomly chosen along the

426: sequence.

427: The doted line corresponds to the smallest segmentation entropy 4.5298

428: for the $10^{10}$ randomly sampled segmentations, see Fig. 2.}

429: \end{figure}

430:

431: \clearpage

432: \newpage

433:

434: \begin{figure}[p]

435: \vspace {2cm}

436: \centerline{\epsfxsize=10cm \epsfbox{figure2.eps}}

437: \label{stat}

438: \vspace {2cm}

439: \caption{The distribution of the segmentation entropy of

440: $9\times10^{9}$ segmentations randomly chosen for the text ``The Fox

441: and the Grapes". The numbers of words of various length in the original

442: text were first counted. In the sampled segmentations these numbers were

443: kept, but the length of each word along the sequence were randomly

444: assigned.}

445: \end{figure}

446:

447: \clearpage

448: \newpage

449:

450: \begin{figure}[p]

451: \vspace {2cm}

452: \centerline{\epsfxsize=10cm \epsfbox{figure3.eps}}

453: \label{compare}

454: \vspace {2cm}

455: \caption{Comparison of the distribution of segmentation entropy for two

456: sequences: the original sequence of {\sl ``The Fox and the Grapes"}, and

457: a random sequence obtained by randomizing the order of letters in the

458: original text. For each sequence, $10^{9}$ segmentations are randomly

459: sampled in the way described in the caption of Fig. 2.}

460: \end{figure}

461:

462: \clearpage

463: \newpage

464:

465: \begin{figure}[p]

466: \vspace {2cm}

467: \centerline{\epsfxsize=10cm \epsfbox{figure4.eps}}

468: \label{fit}

469: \vspace {2cm}

470: \caption{The distribution of segmentation shown in Fig. 2 is shown in log

471: scale here. The line along the left edge of the distribution is

472: $e^{(165x-750.42)}$.}

473: \end{figure}

474: %\end{multicols}

475:

476: \end{document} \bye

477:

478:

479: