1: %\documentstyle[epsf,11pt]{article}
2: \documentstyle[epsf,aps,prl,12pt,preprint]{revtex}
3: %\documentstyle[epsf,multicol,aps,prl]{revtex}
4: \def\baselinestretch{1.5}
5: %\topmargin=-10mm
6: %\oddsidemargin=-0.2mm
7: %\evensidemargin=-0.2mm
8: %\textwidth=160mm
9: %\textheight=215mm
10: \pagenumbering{arabic}
11:
12: \title{\bf {\em Minimum} Entropy Approach to Word Segmentation Problems}
13: \author{Bin Wang \\
14: {\small Institute of Theoretical Physics, Chinese Academy of
15: Sciences},\\
16: {\small P.O. Box 2735, Beijing 100080, P. R. China.}\\
17: {\small State Key Laboratory of Scientific and Engineering Computing},\\
18: {\small Institute of Computational Mathematics and Scientific/Engineering Computing}, \\
19: {\small P.O. Box 2719, Beijing 100080, P. R. China.}\\
20: }
21: %\date{May 21, 2000}
22: \begin{document}
23: \maketitle
24: %\widetext
25: \vspace {1cm}
26: \begin{abstract}
27: Given a sequence composed of a limit number of characters,
28: we try to ``read"
29: it as a ``text". This involves to segment the sequence into ``words". The
30: difficulty is to distinguish good segmentation from enormous number of
31: random ones.
32: Aiming at revealing the nonrandomness of the sequence as strongly as
33: possible,
34: by applying maximum likelihood method, we find a
35: quantity called {\bf segmentation entropy} that can be used to
36: fulfill the duty. Contrary to commonplace
37: where maximum entropy principle was applied to obtain good solution, we
38: choose to {\em minimize} the segmentation entropy to obtain good
39: segmentation. The concept developed in this letter can be used to
40: study the noncoding DNA sequences, e.g., for regulatory elements
41: prediction, in eukaryote genomes.
42:
43: \vspace {0.4cm}
44: \noindent
45: %PACS number: 87.10.+e, 89.70.+c.
46: \end{abstract}
47: %\widetext
48:
49: \clearpage
50: \newpage
51: %\begin {multicols}{2}
52:
53: \section{Introduction.}
54: The problem addressed in this paper is rather
55: elementary in statistics. It is best described as the following:
56: suppose one who knows nothing about English language
57: was given a sequence of English letters, which was actually
58: obtained by taking off all the interwords delimiters
59: among a sample of English text,
60: how could he recover the words of the text by choosing to
61: insert spaces between
62: adjacent letters? Note that the only thing he can
63: consult is the statistical properties of the sequence?
64:
65: Any two adjacent letters can be chosen to belong to the same word (keep
66: adjacent) as
67: well as belong to separate words (be separated by space).
68: Suppose the sequence length is $N$.
69: Any choice on the connectivity between $N-1$ pairs of
70: adjacent letters is called a segmentation.
71: There are a total of $2^{N-1}$ possible
72: segmentations. The word segmentation problem is to find ways to
73: distinguish the
74: correct segmentation -- in the sense that adjacent letters in the
75: original text keep
76: adjacent while letters separated by spaces and/or punctuation
77: marks in the original
78: text are separated by spaces in the segmentation -- from others.
79:
80: Although the problem seems toy-like, its fundamental
81: importance for statistical linguistics is evident. We study on it,
82: however, also for practical purposes. Noncoding sequences in the genomes
83: of species play
84: essential rule on the regulation of gene expression and function~\cite{liw}.
85: However the development of computational methods for extracting regulatory elements is far behand DNA sequencing and gene finding~\cite{regulate}.
86: One reason is the lack of efficient way to discriminate large
87: amount of sequence signals in noncoding DNA sequences.
88: Through linguistic study it has been shown that noncoding sequences in
89: eukaryotic genomes are structurally much similar to natural and
90: artificial language~\cite{stanley}. Thus many may expect to ``read" the
91: noncoding sequences as a ``text". Actually, efforts have been given to
92: build a dictionary for genomes~\cite{trifonov,li}. Li et al.~\cite{li}
93: showed the connection between regulatory elements prediction
94: and word segmentation in noncoding DNA sequences of eukaryote genomes.
95: We expect that
96: progress on word segmentation problem may help to deepen our knowledge on
97: noncoding
98: regions of eukaryote genomes. Besides, word segmentation is an important
99: issue for Asian languages (e.g., Chinese and Japanese) processing~\cite{ponte},
100: because they lack interword delimiters.
101:
102: \section{Segmentation entropy and its connection to word segmentation problem.}
103: To tackle word segmentation problem,
104: we first consider a problem under constraints, so that one important
105: concept --
106: segmentation entropy -- can be introduced. The
107: constraints will be released at the end of this paper.
108: Suppose we have known that there are $n_l$ words of length $l$
109: $(l=1,2,\cdots)$ in the original text. Obviously,
110: \begin {equation}
111: \sum_l{n_ll}=N.
112: \end {equation}
113: Under these constraints -- Words Length Constraints WLC -- there are
114: totally
115: \begin {equation}
116: \frac{(\sum_l{n_l})!}{\prod_l{n_l}!}
117: \end {equation}
118: segmentations. For example, for the following story, there are totally
119: $3.12e144$ segmentations, while the number under WLC is about $1.33e97$.
120:
121: \begin {quote}
122: \begin {center}
123: {\sl
124: The Fox and the Grapes
125: }
126: \end {center}
127: {\small
128: {\sl
129: \ \ \ \ Once upon a time there was a fox strolling through the woods.
130: He came upon a grape orchard. There he found a bunch of beautiful
131: grapes hanging from a high branch.
132:
133: \ \ \ \ ``Boy those sure would be tasty," he thought to himself. He
134: backed up and took a running start, and jumped. He did not get high
135: enough.
136:
137: \ \ \ \ He went back to his starting spot and tried again. He almost
138: got high enough this time, but not quite.
139:
140: \ \ \ \ He tried and tried, again and again, but just couldn't get
141: high enough to grab the grapes.
142:
143: \ \ \ \ Finally, he gave up.
144:
145: \ \ \ \ As he walked away, he put his nose in the air and said: ``I am
146: sure those grapes are sour."
147: }
148: }
149: \end {quote}
150:
151: Following least effort principle~\cite{zipf}, it is appreciable in
152: natural languages to
153: combine existing words to express different meaning.
154: Shannon~\cite{shannon} pointed out the
155: importance of redundancy in natural languages long ago: generally
156: speaking, nearly half of the letters
157: in a sample of English text can be deleted while someone else can still
158: restore them. These
159: properties of natural language ensure the sequence obtained by
160: taking off interword delimiters from a certain text being highly nonrandom
161: and showing determinant and regular characteristics.
162: It is expected that the correct segmentation reveals these
163: characteristics as
164: strongly as possible. From information point of view, this means
165: that, if
166: a form of information entropy can be properly defined on each segmentation,
167: the entropy of the correct
168: segmentation will be the smallest.
169:
170: Interestingly, a maximum likelihood approach leads to the same
171: proposal and automatically gives the definition of the entropy.
172: Given one sequence of length $N$, we expect to find a likelihood function which
173: reaches its maximum on the correct
174: segmentation. For a concrete segmentation, we assign a probability to
175: each word in it
176: \begin {equation}
177: w_i\to{p_i}, \qquad i=1...M
178: \end {equation}
179: with
180: \begin {equation}
181: \sum_{i=1}^M{p_i}=1.
182: \end {equation}
183: The likelihood function is written as
184: \begin {equation}
185: Z_s=\prod_{i=1}^M{{p_i}^{m_il_i}}
186: \end {equation}
187: where $m_i$ is the number of word $w_i$ in the segmentation, and $l_i$
188: is the length of the word.
189:
190: By maximizing the likelihood function subjected to
191: eq.(4) we obtain
192: \begin {equation}
193: p_i=\frac{m_il_i}{N}.
194: \end {equation}
195: Thus the maximum likelihood for the segmentation is
196: \begin {equation}
197: Z_s=\prod_{i=1}^M{(\frac{m_il_i}{N})^{m_il_i}}.
198: \end {equation}
199: The segmentation with maximum likelihood is just the one minimizing
200: \begin {equation}
201: S=-\frac{lnZ_s}{N}=-\sum_{i=1}^M{\frac{m_il_i}{N}ln(\frac{m_il_i}{N})}.
202: \end {equation}
203: This function has the form of entropy~\cite{shannon} and will be called Segmentation Entropy (SE).
204:
205: Starting from a maximum likelihood approach, we now come to the
206: suggestion
207: to minimize the segmentation entropy.
208: This is in contrast to commonplace.
209: Maximizing likelihood leads to maximizing certain entropy in some
210: cases~\cite{frieden,jaynes}.
211: As a general principle for investigating statistical problems, maximum
212: entropy method has been successfully applied in a
213: variety of fields~\cite{frieden,jaynes}. We propose that, instead of
214: applying maximum entropy principle, one may choose to minimize certain entropy
215: (minimum entropy principle) in some problems. This seems attractive
216: especially in the era of bioinformatics when most of the
217: problems are to reveal regularity in large amount of seemingly
218: random sequences.
219:
220: Because the present is a statistical method, the
221: text under study needs to be not too short. For example, when we
222: tried to
223: find the segmentation with the smallest segmentation entropy for the
224: saying
225: \begin {quote}
226: {\sl God is nowhere as much as he is in the soul... and the soul
227: means the world}
228: \end {quote}
229: (By Meister Eckhart, 14-century Dominican
230: priest, Preacher, and Theologian), it was found that, among a total of
231: $343062720$ segmentations under WLC,
232: there are 15 segmentations
233: whose SE is $2.3684$, smaller than 2.3802 of the correct one. One example is
234: \begin {quote}
235: {\sl god isnow {\bf he} rea smuchas {\bf he} is int {\bf he} {\bf soul} andt
236: {\bf he} {\bf soul} meanst {\bf he} world},
237: \end {quote}
238: in which the five {\em ``he"} and two {\em ``soul"} are
239: revealed.
240:
241: Unfortunately, present computational power does not permit to
242: exhaustively
243: study even a text as short as {\sl ``the Fox and the Grapes"}, the
244: number of
245: permitted segmentations for which is $1.33E+97$ under WLC.
246: We choose to see the relevance of the concept of segmentation entropy
247: in some special ways. The study focuses on ``The Fox and the Grapes".
248:
249: To change a segmentation slightly, one way is to choose two adjacent words
250: along the sequences randomly and then exchange their length. This way the
251: original two words may change to different words. This procedure
252: can be repeated on the resulting segmentations. The change does not violate the WLC. Because of the large number of possible choices in each step, the segmentation is expected to become increasingly dissimilar to the original one. Starting from the correct segmentation of ``The Fox and the Grapes", we expect to see the evolution of SE by changing the segmentation this way.
253: Figure 1 shows that SE increase drastically in the first 500 steps,
254: and then reaches and fluctuates around certain equilibrium value.
255: Compared with the gap between the equilibrium value and the original
256: SE, the fluctuation is minor.
257: This shows that, at least locally, the correct segmentation is at the
258: minimum of
259: SE. Actually, we have traced a trajectory of evolution up to $10^{10}$
260: steps. No
261: segmentation with SE smaller than the correct one was observed. This
262: implies that SE of the correct segmentation is also globally minimal.
263:
264: The distribution of segmentation entropy may give
265: further insight to the atypicality of
266: the correct SE.
267: We randomly sampled $10^{10}$ segmentations in the following way: while
268: keeping the WLC, the length of each words in the segmentation is assigned
269: randomly. The distribution of SE is shown in Fig. 2.
270: The minimal SE we sampled is 4.5298, still much higher
271: than 4.097 of the correct segmentation (see Fig. 1).
272: It is interesting to observe that the
273: distribution shows fractal characteristics. The fractal-like distribution
274: presents also for
275: other text, even for random sequence (Fig. 3). The fractal-like feature
276: is determined by the WLC and the statistical structure of the sequence
277: under study.
278: In Fig. 3 we compared the distribution of SE of two
279: sequences (under the same WLC), the original sequence of {\sl ``The Fox and the Grapes"} and
280: a random sequence obtained by randomizing the order of letters in the text.
281: The result is in accordance with the fact that the original
282: sequence is in a much more ordered state, manifesting that
283: segmentation entropy captures the statistical structure of the
284: sequences successfully.
285:
286: There is one way to estimate the number of segmentations the SE of which is
287: 4.097, the value for the correct segmentation.
288: See Fig. 4 in which the distribution of SE in Fig. 2 are shown
289: in logrithmic scale here. The left edge of the distribution fall on a line.
290: The edge can be fitted by $e^{(165x-750.42)}.$
291: The number of segmentations with SE x among the totally $1.33e97$ possible
292: segmentations under WLC is:
293: \begin {equation}
294: c(x)=\frac{1.33e^{97}}{9\times10^9}e^{(165x-750.42)}.
295: \end {equation}
296: We obtained $c(4.097)=0.96$.
297: From the distribution of SE shown in Fig. 3(a) we obtained the same value of $c(4.097)$. The estimation support the idea that segmentation entropy of
298: correct segmentation is unique.
299:
300:
301: We now consider how to release the WLC.
302: Unfortunately, searching the
303: segmentation with the smallest SE among all the possible
304: is sure to fail to find the correct one. For example, SE of the
305: segmentation in which the whole sequence
306: is considered as one word (single-word segmentation)
307: is 0, the smallest possible SE.
308: Also, the
309: segmentation in which each letter is viewed as a separate word
310: ($N$-word segmentation) has a considerably small
311: SE (2.8655 for {\sl ``The fox and the grapes"}).
312: These are called side attraction effects. These examples show that smaller
313: SE does not necessarily means better segmentation
314: when we compare the SEs of segmentations under
315: different WLC (here WLC refers to any partition of numbers of
316: words of various length
317: satisfying eq.(1), not necessarily the same as the original text.)
318: The bias induced by different WLC must be taken off.
319: In order to do so, we suggest to use
320: \begin {equation}
321: R_S=\frac{S}{S_0}
322: \end {equation}
323: instead of $S$.
324: Here $S_0$ is the average SE under the same WLC of a sequence obtained
325: by randomizing the order of letters in the original text.
326: $S_0$ plays the role of chemical potential for a thermodynamic system~\cite{chempot}.
327: $R_S$ for the single word and $N$-word segmentations are 1, the largest
328: possible value.
329: By searching segmentation with the smallest $R_S$, it is expected to
330: find meaningful segmentation. For examples, for the segmentation
331: \begin {quote}
332: {\sl god isnow {\em he} rea smuchas {\em he} is int {\em he} {\em soul} andt {\em he} {\em soul} meanst {\em he}
333: world,}
334: \end {quote}
335: which has already been shown above, $R_S$
336: is 0.8601; while
337: \begin {quote}
338: {\sl god {\em is} now {\em he} re {\em as} much {\em as} {\em he} {\em is} int {\em he} {\em soul} {\em an} dt {\em he} {\em soul} me {\em an}
339: st {\em he} world}
340: \end {quote}
341: is a better -- actually one of the best -- segmentation according to
342: $R_S$ ($R_S=0.8259$). Intuitively this is reasonable, because in this
343: second segmentation, more repeated ``words" -- two copies of
344: {\em ``is"}, {\em ``as"} and {\em ``an"} -- are revealed.
345: Another segmentation
346: \begin {quote}
347: {\sl god {\em is} now {\em he} re {\em as} much {\em as} {\em he} {\em is} in {\em thesoul} {\em an} d {\em thesoul} me {\em an} st
348: {\em he} world},
349: \end {quote}
350: which differs from the second segmentation by revealing the two
351: {\em ``thesoul"}, has a moderately small $R_S$: 0.8481.
352: Comparison shows that the five repeats of
353: {\em ``he"} is the most preferred part in good segmentations.
354:
355: \section {Concluding remarks.}
356: In statistical linguistics many efforts are given on
357: signal extracting and statistical inference.
358: Our method, however, is new on at least two points. First, there is neither
359: assumption on distribution~\cite{peitra} nor demand for training
360: sets, lexical or grammatical knowledge~\cite{ponte}.
361: This feature is important for studying biological
362: sequences, because present knowledge on the ``language" (DNA)
363: of life is still lack.
364: Second, instead of extracting a limit number of signals,
365: we try to ``read" the sequence exactly as a ``text".
366: A text includes more than words: it also includes the organization of words.
367: The results of segmentation form a basis for many further elaborations.
368:
369: Principally, the concept of segmentation entropy can be applied to study
370: the noncoding DNA sequences of eukaryote genomes. It is expected that the
371: study may gives more than some meaningful ``words" or regulatory
372: elements. Possible applications are not
373: confined to studying noncoding DNA sequences of course. Segmentation
374: entropy can be used to find patterns in any symbolic sequences.
375: However,
376: the application of segmentation entropy is restricted by the difficulty to find
377: the segmentation with the smallest $R_s$ from the vast amount possible
378: ones. We are now developing algorithm that can be used for regulatory binding sites prediction. in the algorithm the principle of minimun entropy will be incorporated in.
379:
380: \section*{ACKNOWLEDGMENTS}
381: I thanks Professor Bai-lin Hao who helps to make
382: the computing possible. I also thanks Professor Wei-mou Zheng and
383: Professor Bai-lin Hao for stimulating discussions. Mr. Xiong Zhang carefully
384: read the manuscript. The work was supported
385: partly by National Science Fundation.
386:
387: \clearpage
388: \newpage
389:
390: \begin{thebibliography}{99}
391: \renewcommand{\baselinestretch}{0.2}
392: {\small
393: \bibitem{liw} See, e.g., W. Li, {\em Molecular Evolution} (Sinauer Associates, 1997).
394: \bibitem{regulate} A.G. Pedersen, P. Baldi, Y. Chauvin, and S. Brunak, Comput. Chem. {\bf 23}, 191 (1997).
395: \bibitem{stanley} R.N. Mantegna, S.V. Buldyrev, A.L. Goldberger, S.
396: Havlin, C.-k. Peng, M. Simons, and H.E. Stanley, Phys. Rev. Lett. {\bf 73}, 3169 (1994).
397: \bibitem{trifonov} V. Brendel, J.S. Beckmann, and E.N. Trifonov,
398: J. Biomol. Struct. Dyn. {\bf 7}, 11 (1986); P.A. Pevzner, M.Y. Borodovsky, and A.A. Mironov, J. Biomol. Struct. Dyn. {\bf 6}, 1013 (1989).
399: \bibitem{li} H.J. Bussemaker, H. Li, and E.D. Siggia, Preprint.
400: \bibitem{ponte} J.M. Ponte, and W.B. Croft, UMass Computer Science Tech Rep. 1996-2002 (1996), available at ftp://ftp.cs.umass.edu/pub/techrept/techreport/1996; R. Ando and L. Lee, Cornell CS Report TR99-1756 (1999), available at http://www.cs.cornell.edu/home/llee/papers.html.
401: \bibitem{zipf} G.K. Zipf, {\em human Behavior and the Principle of Least
402: Effort} (Addison-Wesley Press, Reading, 1949).
403: \bibitem{shannon} C.E. Shannon, Bell System Tech. J. {\bf 27}, 379 (1948).
404: \bibitem{frieden} B.R. Frieden, J. Opt. Soc. Am. {\bf 62}, 511 (1972);
405: E.T. Jaynes, Phys. Rev. {\bf 106}, 620 (1975); {\bf 108}, 171 (1975).
406: \bibitem{jaynes} N. Wu, {\em The Maximum Entropy Method and its Applications in Radio Astronomy}, Ph.D. thesis (Sydney University, 1985).
407: \bibitem{chempot} See, e.g., L.E. Reichl, {\em A Modern Course in Statistical Physics} (Anorld, 1980).
408: \bibitem{peitra} S.D. Peitra, V.D. Peitra, and J. Lafferty, IEEE Transactions Pattern Analysis and Machine Intelligence {\bf 19}, 1 (1997).
409: }
410:
411: \end{thebibliography}
412:
413:
414: \clearpage
415: \newpage
416:
417:
418: \begin{figure}[p]
419: \vspace {2cm}
420: \centerline{\epsfxsize=10cm \epsfbox{figure1.eps}}
421: \label{evolve}
422: \vspace {2cm}
423: \caption{The evolution of segmentation entropy. Starting from the
424: correct one, the segmentation was change stepwisely by
425: exchanging the lengths of a pair of adjacent words randomly chosen along the
426: sequence.
427: The doted line corresponds to the smallest segmentation entropy 4.5298
428: for the $10^{10}$ randomly sampled segmentations, see Fig. 2.}
429: \end{figure}
430:
431: \clearpage
432: \newpage
433:
434: \begin{figure}[p]
435: \vspace {2cm}
436: \centerline{\epsfxsize=10cm \epsfbox{figure2.eps}}
437: \label{stat}
438: \vspace {2cm}
439: \caption{The distribution of the segmentation entropy of
440: $9\times10^{9}$ segmentations randomly chosen for the text ``The Fox
441: and the Grapes". The numbers of words of various length in the original
442: text were first counted. In the sampled segmentations these numbers were
443: kept, but the length of each word along the sequence were randomly
444: assigned.}
445: \end{figure}
446:
447: \clearpage
448: \newpage
449:
450: \begin{figure}[p]
451: \vspace {2cm}
452: \centerline{\epsfxsize=10cm \epsfbox{figure3.eps}}
453: \label{compare}
454: \vspace {2cm}
455: \caption{Comparison of the distribution of segmentation entropy for two
456: sequences: the original sequence of {\sl ``The Fox and the Grapes"}, and
457: a random sequence obtained by randomizing the order of letters in the
458: original text. For each sequence, $10^{9}$ segmentations are randomly
459: sampled in the way described in the caption of Fig. 2.}
460: \end{figure}
461:
462: \clearpage
463: \newpage
464:
465: \begin{figure}[p]
466: \vspace {2cm}
467: \centerline{\epsfxsize=10cm \epsfbox{figure4.eps}}
468: \label{fit}
469: \vspace {2cm}
470: \caption{The distribution of segmentation shown in Fig. 2 is shown in log
471: scale here. The line along the left edge of the distribution is
472: $e^{(165x-750.42)}$.}
473: \end{figure}
474: %\end{multicols}
475:
476: \end{document} \bye
477:
478:
479: