physics0008232/SE.tex
1: %\documentstyle[epsf,11pt]{article}
2: \documentstyle[epsf,aps,prl,12pt,preprint]{revtex}
3: %\documentstyle[epsf,multicol,aps,prl]{revtex}
4: \def\baselinestretch{1.5}
5: %\topmargin=-10mm
6: %\oddsidemargin=-0.2mm
7: %\evensidemargin=-0.2mm
8: %\textwidth=160mm
9: %\textheight=215mm
10: \pagenumbering{arabic}
11: 
12: \title{\bf {\em Minimum} Entropy Approach to Word Segmentation Problems}
13: \author{Bin Wang \\   
14:  {\small Institute of Theoretical Physics, Chinese Academy of 
15: Sciences},\\
16:   {\small P.O. Box 2735, Beijing 100080, P. R. China.}\\
17:  {\small State Key Laboratory of Scientific and Engineering Computing},\\
18: {\small Institute of Computational Mathematics and Scientific/Engineering Computing}, \\
19:   {\small P.O. Box 2719, Beijing 100080, P. R. China.}\\
20:   }
21: %\date{May 21, 2000}
22: \begin{document}
23: \maketitle
24: %\widetext
25: \vspace {1cm}
26: \begin{abstract}
27: Given a sequence composed of a limit number of characters, 
28:  we try to ``read" 
29:  it as a ``text". This involves to segment the sequence into ``words". The 
30: difficulty is to distinguish  good segmentation from enormous number of 
31: random ones.
32: Aiming at revealing the nonrandomness of the sequence as strongly as 
33: possible, 
34: by applying maximum likelihood method, we find a 
35: quantity called {\bf segmentation entropy} that can be used to 
36: fulfill the duty. Contrary to commonplace 
37: where maximum entropy principle was applied to obtain good solution, we 
38: choose to {\em minimize} the segmentation entropy to obtain good 
39: segmentation. The concept developed in this letter can be used to 
40: study the noncoding DNA sequences, e.g., for regulatory elements 
41: prediction, in eukaryote genomes.
42: 
43: \vspace {0.4cm}
44: \noindent
45: %PACS number: 87.10.+e, 89.70.+c. 
46: \end{abstract}
47: %\widetext
48: 
49: \clearpage
50: \newpage
51: %\begin {multicols}{2}
52: 
53: \section{Introduction.}
54: The problem addressed in this paper is rather 
55: elementary in statistics. It is best described as the following: 
56: suppose one who knows nothing about English language 
57: was given a sequence of English letters, which was actually 
58: obtained by taking off all the interwords delimiters 
59: among a sample of English text,
60: how could he recover the words of the text by choosing to 
61: insert spaces between 
62: adjacent letters? Note that the only thing he can
63: consult is the statistical properties of the sequence?
64: 
65: Any two adjacent letters can be chosen to belong to the same word (keep 
66: adjacent) as 
67: well as belong to separate words (be separated by space). 
68: Suppose the sequence length is $N$. 
69: Any choice on the connectivity between $N-1$ pairs of 
70: adjacent letters is called a segmentation. 
71: There are a total of $2^{N-1}$ possible
72: segmentations. The word segmentation problem is to find ways to 
73: distinguish the 
74: correct segmentation -- in the sense that adjacent letters in the 
75: original text keep 
76: adjacent while letters separated by spaces and/or punctuation 
77: marks in the original 
78: text are separated by spaces in the segmentation -- from others. 
79: 
80: Although the problem seems toy-like, its fundamental 
81: importance for statistical linguistics is evident. We study on it, 
82: however, also for practical purposes. Noncoding sequences in the genomes 
83: of species play 
84: essential rule on the regulation of gene expression and function~\cite{liw}. 
85: However the development of computational methods for extracting regulatory elements is far behand DNA sequencing and gene finding~\cite{regulate}.
86: One reason is the lack of efficient way to discriminate large 
87: amount of sequence signals in noncoding DNA sequences. 
88: Through linguistic study it has been shown that noncoding sequences in 
89: eukaryotic genomes are structurally much similar to natural and 
90: artificial language~\cite{stanley}. Thus many may expect to ``read" the 
91: noncoding sequences as a ``text". Actually, efforts have been given to 
92: build a dictionary for genomes~\cite{trifonov,li}. Li et al.~\cite{li} 
93: showed the connection between regulatory elements prediction 
94: and word segmentation in noncoding DNA sequences of eukaryote genomes. 
95: We expect that 
96: progress on word segmentation problem may help to deepen our knowledge on 
97: noncoding 
98: regions of eukaryote genomes. Besides, word segmentation is an important 
99: issue for Asian languages (e.g., Chinese and Japanese) processing~\cite{ponte}, 
100: because they lack interword delimiters.
101: 
102: \section{Segmentation entropy and its connection to word segmentation problem.}
103: To tackle word segmentation problem,
104: we first consider a problem under constraints, so that one important 
105: concept --
106:  segmentation entropy -- can be introduced. The 
107: constraints will be released at the end of this paper. 
108: Suppose we have known that there are $n_l$ words of length $l$ 
109: $(l=1,2,\cdots)$ in the original text. Obviously,
110: \begin {equation}
111: \sum_l{n_ll}=N. 
112: \end {equation}
113: Under these constraints -- Words Length Constraints WLC -- there are 
114: totally 
115: \begin {equation}
116: \frac{(\sum_l{n_l})!}{\prod_l{n_l}!}
117: \end {equation}
118: segmentations. For example, for the following story, there are totally 
119: $3.12e144$ segmentations, while the number under WLC is about $1.33e97$. 
120: 
121: \begin {quote}
122: \begin {center}
123: {\sl
124:                        The Fox and the Grapes
125: }
126: \end {center}
127: {\small
128: {\sl
129:    \ \ \ \ Once upon a time there was a fox strolling through the woods. 
130:  He came upon a grape orchard.  There he found a bunch of beautiful 
131: grapes hanging from a high branch.	
132: 
133:   \ \ \ \ ``Boy those sure would be tasty," he thought to himself.  He 
134: backed up and took a running start, and jumped.  He did not get high 
135: enough.
136: 
137:   \ \ \ \ He went back to his starting spot and tried again.  He almost 
138: got high enough this time, but not quite.	
139: 
140:   \ \ \ \ He tried and tried, again and again, but just couldn't get 
141: high enough to grab the grapes.	
142: 
143:   \ \ \ \ Finally, he gave up.	
144: 
145:   \ \ \ \ As he walked away, he put his nose in the air and said: ``I am 
146: sure those grapes are sour."
147: }
148: }
149: \end {quote}
150: 
151: Following least effort principle~\cite{zipf}, it is appreciable in 
152: natural languages to 
153: combine existing words to express different meaning. 
154: Shannon~\cite{shannon} pointed out the 
155: importance of redundancy in natural languages long ago: generally 
156: speaking, nearly half of the letters 
157: in a sample of English text can be deleted while someone else can still 
158: restore them. These 
159: properties of natural language ensure the sequence obtained by 
160: taking off interword delimiters from a certain text being highly nonrandom 
161: and showing determinant and regular characteristics. 
162: It is expected that the correct segmentation reveals these 
163: characteristics as 
164: strongly as possible. From information point of view, this means 
165: that, if 
166: a form of information entropy can be properly defined on each segmentation, 
167: the entropy of the correct 
168: segmentation will be the smallest.
169: 
170: Interestingly, a maximum likelihood approach leads to the same 
171: proposal and automatically gives the definition of the entropy.
172: Given one sequence of length $N$, we expect to find a likelihood function which 
173: reaches its maximum on the correct 
174: segmentation. For a concrete segmentation, we assign a probability to 
175: each word in it 
176: \begin {equation}
177: w_i\to{p_i}, \qquad  i=1...M
178: \end {equation}
179: with
180: \begin {equation}
181: \sum_{i=1}^M{p_i}=1.
182: \end {equation}
183: The likelihood function is written as
184: \begin {equation}
185: Z_s=\prod_{i=1}^M{{p_i}^{m_il_i}}
186: \end {equation}
187: where $m_i$ is the number of word $w_i$ in the segmentation, and $l_i$ 
188: is the length of the word.
189: 
190: By maximizing the likelihood function subjected to 
191: eq.(4) we obtain
192: \begin {equation}
193: p_i=\frac{m_il_i}{N}.
194: \end {equation}
195: Thus the maximum likelihood for the segmentation is
196: \begin {equation}
197: Z_s=\prod_{i=1}^M{(\frac{m_il_i}{N})^{m_il_i}}.
198: \end {equation}
199: The segmentation with maximum likelihood is just the one minimizing 
200: \begin {equation}
201: S=-\frac{lnZ_s}{N}=-\sum_{i=1}^M{\frac{m_il_i}{N}ln(\frac{m_il_i}{N})}.
202: \end {equation}
203: This function has the form of entropy~\cite{shannon} and will be called Segmentation Entropy (SE).
204: 
205: Starting from a maximum likelihood approach, we now come to the 
206: suggestion 
207: to minimize the segmentation entropy. 
208: This is in contrast to commonplace. 
209: Maximizing likelihood leads to maximizing certain entropy in some 
210: cases~\cite{frieden,jaynes}. 
211: As a general principle for investigating statistical problems, maximum 
212: entropy method has been successfully applied in a 
213: variety of fields~\cite{frieden,jaynes}. We propose that, instead of 
214: applying maximum entropy principle, one may choose to minimize certain entropy 
215: (minimum entropy principle) in some problems. This seems attractive 
216: especially in the era of bioinformatics when most of the 
217: problems are to reveal regularity in large amount of seemingly 
218: random sequences.
219: 
220: Because the present is a statistical method, the 
221: text under study needs to be not too short. For example, when we 
222: tried to 
223: find the segmentation with the smallest segmentation entropy for the 
224: saying 
225: \begin {quote}
226: {\sl God is nowhere as much as he is in the soul... and the soul 
227: means the world} 
228: \end {quote}
229: (By Meister Eckhart, 14-century Dominican 
230: priest, Preacher, and Theologian), it was found that, among a total of 
231: $343062720$ segmentations under WLC, 
232: there are 15 segmentations 
233: whose SE is $2.3684$, smaller than 2.3802 of the correct one. One example is 
234: \begin {quote}
235: {\sl god isnow {\bf he} rea smuchas {\bf he} is int {\bf he} {\bf soul} andt 
236: {\bf he} {\bf soul} meanst {\bf he} world}, 
237: \end {quote}
238: in which the five {\em ``he"} and two {\em ``soul"} are 
239: revealed.
240: 
241: Unfortunately, present computational power does not permit to 
242: exhaustively 
243: study even a text as short as {\sl ``the Fox and the Grapes"}, the 
244: number of 
245: permitted segmentations for which is $1.33E+97$ under WLC. 
246: We choose to see the relevance of the concept of segmentation entropy 
247: in some special ways. The study focuses on ``The Fox and the Grapes".
248: 
249: To change a segmentation slightly, one way is to choose two adjacent words 
250: along the sequences randomly and then exchange their length. This way the 
251: original two words may change to different words. This procedure 
252: can be repeated on the resulting segmentations. The change does not violate the WLC. Because of the large number of possible choices in each step, the segmentation is expected to become increasingly dissimilar to the original one. Starting from the correct segmentation of ``The Fox and the Grapes", we expect to see the evolution of SE by changing the segmentation this way.
253: Figure 1 shows that SE increase drastically in the first 500 steps, 
254: and then reaches and fluctuates around certain equilibrium value. 
255: Compared with the gap between the equilibrium value and the original  
256: SE, the fluctuation is minor.
257: This shows that, at least locally, the correct segmentation is at the 
258: minimum of 
259: SE. Actually, we have traced a trajectory of evolution up to $10^{10}$ 
260: steps. No 
261: segmentation with SE smaller than the correct one was observed. This 
262: implies that SE of the correct segmentation is also globally minimal.
263: 
264: The distribution of segmentation entropy may give 
265: further insight to the atypicality of 
266: the correct SE. 
267: We randomly sampled $10^{10}$ segmentations in the following way: while 
268: keeping the WLC, the length of each words in the segmentation is assigned 
269: randomly. The distribution of SE is shown in Fig. 2. 
270: The minimal SE we sampled is 4.5298, still much higher 
271: than 4.097 of the correct segmentation (see Fig. 1). 
272: It is interesting to observe that the 
273: distribution shows fractal characteristics. The fractal-like distribution 
274: presents also for 
275: other text, even for random sequence (Fig. 3). The fractal-like feature 
276: is determined by the WLC and the statistical structure of the sequence 
277: under study.
278: In Fig. 3 we compared the distribution of SE of two 
279: sequences (under the same WLC), the original sequence of {\sl ``The Fox and the Grapes"} and 
280: a random sequence obtained by randomizing the order of letters in the text.
281: The result is in accordance with the fact that the original 
282: sequence is in a much more ordered state, manifesting that 
283: segmentation entropy captures the statistical structure of the 
284: sequences successfully.
285: 
286: There is one way to estimate the number of segmentations the SE of which is 
287: 4.097, the value for the correct segmentation.
288: See Fig. 4 in which the distribution of SE in Fig. 2 are shown 
289: in logrithmic scale here. The left edge of the distribution fall on a line. 
290: The edge can be fitted by $e^{(165x-750.42)}.$ 
291: The number of segmentations with SE x among the totally $1.33e97$ possible 
292: segmentations under WLC is:
293: \begin {equation}
294: c(x)=\frac{1.33e^{97}}{9\times10^9}e^{(165x-750.42)}.
295: \end {equation}
296: We obtained $c(4.097)=0.96$. 
297: From the distribution of SE shown in Fig. 3(a) we obtained the same value of $c(4.097)$. The estimation support the idea that segmentation entropy of 
298: correct segmentation is unique.
299: 
300: 
301: We now consider how to release the WLC. 
302: Unfortunately, searching the 
303: segmentation with the smallest SE among all the possible 
304: is sure to fail to find the correct one. For example, SE of the 
305: segmentation in which the whole sequence 
306: is considered as one word (single-word segmentation) 
307: is 0, the smallest possible SE. 
308: Also, the 
309: segmentation in which each letter is viewed as a separate word 
310: ($N$-word segmentation) has a considerably small 
311: SE (2.8655 for {\sl ``The fox and the grapes"}). 
312: These are called side attraction effects. These examples show that smaller 
313: SE does not necessarily means better segmentation 
314: when we compare the SEs of segmentations under 
315: different WLC (here WLC refers to any partition of numbers of 
316: words of various length 
317: satisfying eq.(1), not necessarily the same as the original text.)
318: The bias induced by different WLC must be taken off.
319: In order to do so, we suggest to use 
320: \begin {equation}
321: R_S=\frac{S}{S_0}
322: \end {equation}
323: instead of $S$.
324: Here $S_0$ is the average SE under the same WLC of a sequence obtained 
325: by randomizing the order of letters in the original text.
326: $S_0$ plays the role of chemical potential for a thermodynamic system~\cite{chempot}.
327: $R_S$ for the single word and $N$-word segmentations are 1, the largest 
328: possible value.
329: By searching segmentation with the smallest $R_S$, it is expected to 
330: find meaningful segmentation. For examples, for the segmentation 
331: \begin {quote}
332: {\sl god isnow {\em he} rea smuchas {\em he} is int {\em he} {\em soul} andt {\em he} {\em soul} meanst {\em he} 
333: world,}
334: \end {quote}
335: which has already been shown above, $R_S$ 
336: is 0.8601; while 
337: \begin {quote}
338: {\sl god {\em is} now {\em he} re {\em as} much {\em as} {\em he} {\em is} int {\em he} {\em soul} {\em an} dt {\em he} {\em soul} me {\em an} 
339: st {\em he} world} 
340: \end {quote}
341: is a better -- actually one of the best -- segmentation according to 
342: $R_S$ ($R_S=0.8259$). Intuitively this is reasonable, because in this  
343: second segmentation, more repeated ``words" -- two copies of 
344: {\em ``is"}, {\em ``as"} and {\em ``an"} -- are revealed. 
345: Another segmentation 
346: \begin {quote}
347: {\sl god {\em is} now {\em he} re {\em as} much {\em as} {\em he} {\em is} in {\em thesoul} {\em an} d {\em thesoul} me {\em an} st 
348: {\em he} world}, 
349: \end {quote}
350: which differs from the second segmentation by revealing the two 
351: {\em ``thesoul"}, has a moderately small $R_S$: 0.8481. 
352: Comparison shows that the five repeats of 
353: {\em ``he"} is the most preferred part in good segmentations.
354: 
355: \section {Concluding remarks.}
356: In statistical linguistics many efforts are given on 
357: signal extracting and statistical inference. 
358: Our method, however, is new on at least two points. First, there is neither 
359: assumption on distribution~\cite{peitra} nor demand for training 
360: sets, lexical or grammatical knowledge~\cite{ponte}. 
361: This feature is important for studying biological 
362: sequences, because present knowledge on the ``language" (DNA) 
363: of life is still lack.
364: Second, instead of extracting a limit number of signals, 
365: we try to ``read" the sequence exactly as a ``text". 
366: A text includes more than words: it also includes the organization of words.
367: The results of segmentation form a basis for many further elaborations.
368: 
369: Principally, the concept of segmentation entropy can be applied to study 
370: the noncoding DNA sequences of eukaryote genomes. It is expected that the 
371: study may gives more than some meaningful ``words" or regulatory 
372: elements. Possible applications are not 
373:  confined to studying noncoding DNA sequences of course. Segmentation 
374: entropy can be used to find patterns in any symbolic sequences. 
375: However,
376: the application of segmentation entropy is restricted by the difficulty to find 
377: the segmentation with the smallest $R_s$ from the vast amount possible 
378: ones. We are now developing algorithm that can be used for regulatory binding sites prediction. in the algorithm the principle of minimun entropy will be incorporated in. 
379: 
380: \section*{ACKNOWLEDGMENTS}
381: I thanks Professor Bai-lin Hao who helps to make 
382: the computing possible. I also thanks Professor Wei-mou Zheng and 
383: Professor Bai-lin Hao for stimulating discussions. Mr. Xiong Zhang carefully 
384: read the manuscript. The work was supported 
385: partly by National Science Fundation.
386: 
387: \clearpage
388: \newpage
389: 
390: \begin{thebibliography}{99}
391: \renewcommand{\baselinestretch}{0.2}
392: {\small
393: \bibitem{liw} See, e.g., W. Li, {\em Molecular Evolution} (Sinauer Associates, 1997).
394: \bibitem{regulate} A.G. Pedersen, P. Baldi, Y. Chauvin, and S. Brunak, Comput. Chem. {\bf 23}, 191 (1997).
395: \bibitem{stanley} R.N. Mantegna, S.V. Buldyrev, A.L. Goldberger, S. 
396: Havlin, C.-k. Peng, M. Simons, and H.E. Stanley, Phys. Rev. Lett. {\bf 73}, 3169 (1994).
397: \bibitem{trifonov} V. Brendel, J.S. Beckmann, and E.N. Trifonov, 
398: J. Biomol. Struct. Dyn. {\bf 7}, 11 (1986); P.A. Pevzner, M.Y. Borodovsky, and A.A. Mironov, J. Biomol. Struct. Dyn. {\bf 6}, 1013 (1989).
399: \bibitem{li} H.J. Bussemaker, H. Li, and E.D. Siggia, Preprint.
400: \bibitem{ponte} J.M. Ponte, and W.B. Croft, UMass Computer Science Tech Rep. 1996-2002 (1996), available at ftp://ftp.cs.umass.edu/pub/techrept/techreport/1996; R. Ando and L. Lee, Cornell CS Report TR99-1756 (1999), available at http://www.cs.cornell.edu/home/llee/papers.html.
401: \bibitem{zipf} G.K. Zipf, {\em human Behavior and the Principle of Least 
402: Effort} (Addison-Wesley Press, Reading, 1949).
403: \bibitem{shannon} C.E. Shannon, Bell System Tech. J. {\bf 27}, 379 (1948).
404: \bibitem{frieden} B.R. Frieden, J. Opt. Soc. Am. {\bf 62}, 511 (1972);
405: E.T. Jaynes, Phys. Rev. {\bf 106}, 620 (1975); {\bf 108}, 171 (1975).
406: \bibitem{jaynes} N. Wu, {\em The Maximum Entropy Method and its Applications in Radio Astronomy}, Ph.D. thesis (Sydney University, 1985). 
407: \bibitem{chempot} See, e.g., L.E. Reichl, {\em A Modern Course in Statistical Physics} (Anorld, 1980).
408: \bibitem{peitra} S.D. Peitra, V.D. Peitra, and J. Lafferty, IEEE Transactions Pattern Analysis and Machine Intelligence {\bf 19}, 1 (1997).
409: }
410: 
411: \end{thebibliography}
412: 
413: 
414: \clearpage
415: \newpage
416: 
417: 
418: \begin{figure}[p]
419: \vspace {2cm}
420: \centerline{\epsfxsize=10cm \epsfbox{figure1.eps}}
421: \label{evolve}
422: \vspace {2cm}
423: \caption{The evolution of segmentation entropy. Starting from the 
424: correct one, the segmentation was change stepwisely by 
425: exchanging the lengths of a pair of adjacent words randomly chosen along the 
426: sequence. 
427: The doted line corresponds to the smallest segmentation entropy 4.5298 
428: for the $10^{10}$ randomly sampled segmentations, see Fig. 2.}
429: \end{figure}
430: 
431: \clearpage
432: \newpage
433: 
434: \begin{figure}[p]
435: \vspace {2cm}
436: \centerline{\epsfxsize=10cm \epsfbox{figure2.eps}}
437: \label{stat}
438: \vspace {2cm}
439: \caption{The distribution of the segmentation entropy of 
440: $9\times10^{9}$ segmentations randomly chosen for the text ``The Fox 
441: and the Grapes". The numbers of words of various length in the original 
442: text were first counted. In the sampled segmentations these numbers were 
443: kept, but the length of each word along the sequence were randomly 
444: assigned.}
445: \end{figure}
446: 
447: \clearpage
448: \newpage
449: 
450: \begin{figure}[p]
451: \vspace {2cm}
452: \centerline{\epsfxsize=10cm \epsfbox{figure3.eps}}
453: \label{compare}
454: \vspace {2cm}
455: \caption{Comparison of the distribution of segmentation entropy for two 
456: sequences: the original sequence of {\sl ``The Fox and the Grapes"}, and 
457: a random sequence obtained by randomizing the order of letters in the 
458: original text. For each sequence, $10^{9}$ segmentations are randomly 
459: sampled in the way described in the caption of Fig. 2.}
460: \end{figure}
461: 
462: \clearpage
463: \newpage
464: 
465: \begin{figure}[p]
466: \vspace {2cm}
467: \centerline{\epsfxsize=10cm \epsfbox{figure4.eps}}
468: \label{fit}
469: \vspace {2cm}
470: \caption{The distribution of segmentation shown in Fig. 2 is shown in log 
471: scale here. The line along the left edge of the distribution is 
472: $e^{(165x-750.42)}$.}
473: \end{figure}
474: %\end{multicols}
475: 
476: \end{document} \bye
477: 
478: 
479: