physics0009002/MS.tex
1: \documentstyle[epsf,11pt]{article}
2: %\documentstyle[epsf,aps,prb]{revtex}
3: \def\baselinestretch{1.3}
4: \topmargin=-10mm
5: \oddsidemargin=-2mm
6: \evensidemargin=-2mm
7: \textwidth=160mm
8: \textheight=215mm
9: \pagenumbering{arabic}
10: %\draft
11: %\twocolumn[\hsize\textwidth\columnwidth\hsize\csname
12: %\twocolumnfalse\endcsname
13: \title{\bf Statistically Significant Strings are Related to Regulatory Elements in the Promoter Regions of {\it Saccharomyces cerevisiae}}
14: \author{Rui Hu$^{1,2}$, Bin Wang$^1$\\
15:   {\small $^1$Institute of Theoretical Physics, Academia Sinica}\\ %%
16:    {\small P. O. Box 2735, Beijing 10080, China}\\ 
17:    {\small $^2$Department of Mordern Physics,} \\
18:     {\small  University of Science and Technology of China,}\\
19:      {\small Anhui, 230027, China}
20:     }
21: \date{}
22: 
23: \begin{document}
24: \maketitle
25: \begin{abstract}
26: 
27: Finding out statistically significant words 
28: in DNA and protein sequences forms the basis for many genetic studies. 
29: By applying 
30: the maximal entropy principle, we give one systematic way to 
31: study the nonrandom occurrence of words in DNA or protein sequences.
32: Through comparison with experimental results, it was shown that patterns of regulatory binding sites in {\it Saccharomyces cerevisiae}({\em yeast}) genomes tend to occur significantly in the promoter regions. We studied two correlated gene family of {\em yeast}. The method successfully extracts the binding sites varified by experiments in each family. Many putative regulatory sites in the upstream regions are proposed. The study also suggested that some regulatory sites are active in both directions, while others show directional preference. 
33: 
34: \end{abstract}
35: 
36: \section{\bf Introduction.} 
37: 
38: It is attractive, but not unexpected, that DNA and protein sequences deviate 
39: remarkably from random sequences~\cite{r1}.
40: According to information theory, random sequences carry minimal 
41: information (maximal entropy)~\cite{r2}, while the total information of life is assumed to be in DNA and protein sequences. 
42: As a result, investigation on 
43: the non-randomness of DNA and amino acid sequences 
44: would be the focus of Bioinformatics.
45: 
46: To find out nonrandom occurrence of words (short strings) in 
47: non-coding DNA sequences 
48: is interesting because a large portion of regulatory elements of eukaryotes 
49: usually are words of limit length in the non-coding sequences 
50: (for example, about 10 bases,
51: while the core part is about 5 bases~\cite{b1}). subjected to 
52: functional constraints, the patterns of regulatory elements are expected to deviate from random occurrence. 
53: 
54: In this paper, by applying the Maximal Entropy Principle (MEP), we develop one way to investigate the nonrandom occurrence of words in DNA sequences. Each word is given one significance index which quantifies the nonrandomness occurrence of the word. The method is then applied to study the promoter regions of {\it Saccharomyces cerevisiae} ({\it yeast}).~\cite{ar1} 
55: We compare the theoretical result with experiments in two ways. In the first way, the promoter database of {\em yeast} (SCPD)~\cite{r10} was analysed. It was found that, statistically, overrepresented words are more easily encountered in the database. The second way is to study the promoters of coregulated gene families. The experimentally found binding sites were successfully extracted, and more putative binding sites are suggested.  
56: 
57: In the following the method will be developed in details, and in the third section the method will be applied to study the promoter regions of {\em yeast}.
58: 
59: \section {\bf Treat the nonrandomness of DNA sequences via Maximal Entropy Principle.} 
60: 
61: The idea comes from a simple observation. Take a long DNA sequence as 
62: an example. Given only the (normalized) frequencies of $A$,$C$,$G$,$T$ ($P_A$,$P_C$,
63: $P_G$,$P_T$) , one would expected that the frequencies 2-tuples have the form 
64: \begin{equation} \label{eq:intu}
65: 	{P^0}_{c_1c_2}=P_{c_1}{\times}P_{c_2}
66: \end{equation}
67: Here $c_1$ and $c_2$ are one of the four bases ${A,C,G,T}$.
68: 
69: Comparison between the measured frequency $P_{c_1c_2}$ and the expected value ${P^0}_{c_1c_2}$ reveals the statistical significance of $c_1c_2$ in the sequence.
70: 
71: To generalize the above idea, one encounters the problem to predict the 
72: frequencies of $k+1$-tuples from the frequencies of $k$-tuples when $k>1$. 
73: A reasonable defination can then be used to evaluate the statistical significance of words longer than two bases.
74: 
75: The following is an attemption to answer this problem.
76: In the treatment, when the composition of a $k$-tuple 
77: is concerned, the word
78: will be written as $c_1c_2{\cdots}c_{k-1}c_{k}$. However when only the length 
79: $k$ of the word is relevant, it will be given in the form of $w^k$. 
80:  A combinatory form may also be used.  For example,
81: $w^kc$ ($cw^k$) is the word obtained by adding a letter $c$ to the 
82: right (left) of $w^k$. The measured and expected frequencies of $w^k$ in the
83: sequence will be written as $P_{w^k}$ and ${P^0}_{w^k}$, respectively.
84: 
85: There are a total of $4^k$ $k$-tuples. 
86: For prediction the Maximal Entropy Principle (MEP) is a prefered choice.
87: According to modern genetics, the driving force for 
88: nucleotide sequence evolution is, on one hand, random mutations of bases that
89:  maximize the entropy, and, on the other hand, the natural selection
90:  which subjects the maximization of entropy to certain constraints. Therefore, DNA sequence analysis shows intrinsic correlation 
91: to the MEP.
92: One brief introduction (which is necessary for our use) to MEP will be given 
93: below. More details can be found in e.g.~\cite{r7}.
94: 
95: Suppose that $\{P_i, i=0,1,2,{\cdots}\}$ is a discrete distribution. An information 
96: entropy can be defined on it~\cite{r2}:
97: \begin{equation} \label{eq:entropy}
98: 	S=\sum_iP_ilnP_i.
99: \end{equation}
100: Usually \{$P_i$\} satisfies some constraints: 
101: \begin{equation} \label{eq:constraints}
102: F_j(\{P_i\})=0,  \qquad      	j=1,2,{\dots},M.
103: \end{equation}
104: Here $M$ is the number of constraints. Define a target function:
105: \begin{equation}
106: 	H=S+\sum_{j=1}^M\lambda_j F_j(\{P_i\}),
107: \end{equation}
108: $\lambda_j$ being Largrange factors. MEP states that the distribution 
109: minimizing the target function $H$ is the most reasonable
110: distribution satisfying constraints (\ref{eq:constraints}). 
111: This, however, does not state 
112: that \{$P_i$\} is the only distribution satisfying (\ref{eq:constraints}).
113: 
114: The MEP now can be applied to study the problem raised above. 
115: The entropy function here is:
116: \begin{displaymath}
117: S=\sum_i{P^0}_{w^{k+1}(i)}ln{P^0}_{w^{k+1}(i)},
118: \end{displaymath}
119: where $i$ is a index used to distinguish k-tuples from each other. 
120: (In order to get the index of 
121: a word, the following maps were used: $A$ 
122: to 0, $C$ to 1, $G$ to 2, and $T$ to 3. The original word is thus 
123: mapped to a string containing only 0,1,2 and 3. The string is then 
124: considered as quaternary 
125: number. After being transformed to decimal, the number is used as the 
126: index of the word.)
127: 
128: Constraints in the present problem is:
129: \begin{eqnarray} \label{eq:normal}
130: P_{w^{k}(i)}=\sum_{c}{P^0}_{w^{k}(i)c},\nonumber \\
131: P_{w^{k}(i)}=\sum_{c}{P^0}_{cw^{k}(i)},\nonumber \\
132: i=0,1,2,{\cdots},4^{k}-1.
133: \end{eqnarray}
134: ${P^0}_{w^{k+1}}$ is the frequency needs to be predicted and $P_{w^k}$ is the frequency already known.
135: There is a total of $2{\times}4^{k}$ constraints. It is possible that these 
136: constraints are linearly related, so that the number of effective 
137: constraints is smaller than $2{\times}4^{k}$. This, however, does not alternate
138: the result.
139: 
140: The solution can be obtained:
141: \begin{equation} \label{eq:mep}
142: {P^0}_{c_1c_2{\cdots}c_{k+1}}=\frac{P_{c_1c_2{\cdots}c_k}{\times}P_{c_2c_3{\cdots}c_{k+1}}}{P_{c_2c_3{\cdots}c_{k}}}.
143: \end{equation}
144: When k=1, the solution reduces to the intuitively result, eq.(\ref{eq:intu}).
145: 
146: The above treatment is from $k$-tuples to $k+1$-tuples. As a generic 
147: scheme, the MEP can also be applied to predict the frequencies of $k+2$-tuples, 
148: $k+3$-tuples and so on,  
149: based on the frequency of k-tuples.
150: Actually, one can get the result by repeatedly applying eq. (6). For example:
151: {
152: \setlength\arraycolsep{2pt}
153: \begin{eqnarray} \label{eq:mlevel}
154: {P^0}_{c_1c_2{\cdots}c_{k+1}c_{k+2}}&=&\frac{{P^0}_{c_1c_2{\cdots}c_{k+1}}{\times}{P^0}_{c_2c_3{\cdots}c_{k+2}}}{P_{c_2c_3{\cdots}c_{k+1}}} 
155: 	\nonumber \\
156: &=&\frac{P_{c_1c_2{\cdots}c_k}{\times}{P_{c_2c_3{\cdots}c_{k+1}}}{\times}P_{c_3c_4{\cdots}c_{k+2}}}{P_{c_2c_3{\cdots}c_k}{\times}P_{c_3c_4{\cdots}c_{k+1}}}.
157: \end{eqnarray}
158: }
159: Thus, when one refers to the expected frequency of a certain word of length $k$, 
160: the knowledge that the prediction is based on must be pointed out. 
161: 
162: With the frequencies of longer words, 
163: one can always obtain the frequencies of shorter ones. 
164: On the other hand, the expected frequencies of longer words, 
165: eq.(\ref{eq:mep}), is predicted from the frequencies of shorter words, with no more 
166: information added. Therefore,
167: the deviation of the measured frequencies from the expected ones gives new 
168: information emerges only in the frequencies of the longer words. 
169: In order to use this part of information, we refer to the following significance index 
170: \begin{equation} \label{eq:index}
171: I_{w^k}=\frac{P_{w^k}-{P^0}_{w^k}}{\sqrt{{P^0}_{w^k}}}.
172: \end{equation}
173: The indexs of $k$-tuples form a vector of $4^k$ dimension. 
174: 
175: It should be pointed out that the simple solution eq.(\ref{eq:mep}) 
176: results from the constraints, eq.(\ref{eq:normal}). Although there are many ways to write down the prediction~\cite{c1,b2}, the Maximal Entropy Principle ensure that, 
177: submitted to these constraints, the solution eq.(\ref{eq:mep}) is the best 
178: one. 
179: However, one can consider more constraints. Expect for the continuous words, spaced patterns can also be involved in the above statistical treatment~\cite{b2}. As an example, consider the spaced word $c_1$-$c_2$, where $c_1$ and $c_2$ are certain bases and the base between them is not relevant. One more constraint 
180: \begin{displaymath}
181: 	P_{c_1-c_2}=\sum_{c}P_{c_1cc_2}
182: \end{displaymath}
183: can be added to the frequencies of $3$-tuples, and the statistical significance of the spaced words can also be evaluated. The MEP, as a general framework, is still applicable, but there will be no simple explicit solution as eq.(\ref{eq:mep}).
184: 
185: \section{\bf The relationship between regulatory elements and statistically 
186: significant words in the {\em yeast} promoter regions.}
187: 
188: With the accumulation of huge amount of genome sequences, analysis of
189: the regulatory regions becomes urgent, because they govern the regulation 
190: of gene expression. Finding out the regulatory sites in Eukaryotes genomes is 
191: especially difficult, largely because of their strong variance. 
192: This, however, gives the chance for 
193: statistical methods to play an important role in binding sites prediction. 
194: 
195: The regulatory elements are functionally constrained and are often
196:  shared by many genes. As a result, the sites are expected to be 
197: significantly represented. Based on this 
198: belief, the method developed above is expected to be applicable 
199: in finding regulatory sites in the promoter regions of {\it yeast}.
200: we employ two ways to check this point.
201: 
202: In the first way, as just an illustration of 
203: the effectiveness of the MEP treatment, a data set including
204: all the promoters of {\it yeast} will be used 
205: to perform the statistical 
206: evaluation. The promoter regions refer, 
207: according to Zhang~\cite{b1}, to the upstream region of 500 bases long. 
208: From the sequence set the word frequencies are obtained 
209:  and ${I_{w^k}}, k=2,\cdots,8$, are calculated according to eq.(\ref{eq:mep}) and eq.(\ref{eq:index}).
210: (to obtain ${I_{w^k}}$, ${P^0}_{w^k}$ is predicted based on the frequency of k-1-tuples.)
211: For comparison the index ${I_{w^k}}, k=2,\cdots,8$, of words in the coding regions (CDSs) of {\em yeast} were also calculated. 
212: 
213: To compare the significance index of words with experimentally verified regulatory 
214:  elements, a strongly statistically characterized method was pursued. 
215: The promoter database of {\it yeast} collected by 
216: Zhou et al.~\cite{r10} was used as targets. 
217: One word is called to {\it hit the target} if it covers a known 
218: regulatory element or part of the element. 
219: In this way, each word will be checked against all the elements in 
220: the database. We want to see if the total hits of words show correlation 
221: with the significance index. 
222: 
223: Fig.1 shows the ratio of the average 
224: hits of words whose significance index are larger than 
225: a certain cutoff (5.0, 3.0, or 2.0) to the average hits of all the $k$-tuples. 
226: Some properties of significance index in the promoter regions are revealed.
227: First, for all the cutoff value shown in fig.1, the ratios are always larger than $1$.
228: Second, when the words are longer than 4 bases, the average hits 
229: increase with the increase of cutoff. Furthermore, the ratio also increases with the increase of word length.
230: As a comparison, Fig.1 shows that the ratio of hits does not depend on significance index in the CDS regions.
231: 
232: 
233: To see the dependence of hits on significance index further, words are divided
234:  into groups according to their significance index values. In each group the hits were 
235: averaged. See table 1, and Fig.2 which is based upon the
236: data in Table.1 but shown as a more audio-visual illustration. The dependence of hits on significance index shown in Fig.1 is seen again. 
237: Furthermore, the average hits are not the 
238: monotonic function of significance index in the promoter regions. 
239: For words with both positively and 
240: negatively large significance index in the promoter regions, 
241: the average hits are larger than those of words 
242: whose significance index is around zero. 
243: Again no dependence of average hits on significance index 
244: in CDS regions is observed in Table.1 and Fig.2.
245: 
246: That words with large {\em negative} significant index 
247: in the promoter regions also show 
248: higher affinity to binding sites deserves more consideration. One account is that although some regulatory elements, such as those involved in the expression of housekeeping genes, are expected to be overrepresented since large amount of the genes are needed, others that control the expression of some essential but restrictedly needed genes, are expected to be underrepresented to avoid inappropriate translation. However, more convincing explaination exists: if a word, e.g., $w$A, has high positive index, then some of $w$C,$w$G,$w$T are expected to have negative index. This can be seen from the following example. While the index of TATAT is 16.3, that of TATAA is -12.2. Actually, both have much high counts in the sequences and both are variance of binding site of the same transcriptional factor. 
249: 
250: For universally existing regulatory elements, as expected, the 
251: significance index in 
252: the promoter regions are much high. One example is the poly(A/T) stretches. 
253: As given above, the significance index of TATAT is 16.3. Also the significance 
254: index of TATATAT, 8.1, is high.
255: As another example, the significance index of the core of CAAT-box, CAAT, 
256: is 8.95. 
257: However, in order to develop an algorithm for regulatory elements prediction, 
258: more subtle consideration must be involved. 
259: First, genes are needed to be classified into 
260: families to improve the compositional bias of the sequences. 
261: Furthermore, more complicated usage of the information given 
262: by significance index should 
263: be considered, because, according to eq.(\ref{eq:mlevel}), the expected frequency of $k$-mers can be defined in $k-1$ ways, i.e., based on the frequency of $1,2,\cdots,k-1$-mers, respectively. For each definition the significance index can be obtained. 
264: On considering the statistical property of words in the sequences, each of these indexes would give useful information.
265: We choose two coregulated gene family to further test our method.
266: 
267: The coregulated genes of {\em yeast} metabolism have been widely studied, and these datasets provide ideal material to test the methods for binding sites prediction. Two families of coregulated genes, GCN and TUP, were shown in table 2. Detailed information on them can be found in~\cite{r9}. For each family, the frequencies of 6-tuples in the promoter regions were first collected. The expected frequencies them were predicted in five ways, which are based on the frequencies of bases, 2-tuples, 3-tuples, 4-tuples, 5-tuples, respectively. In stead of $I_{w^k}$, a simpler significance index $P_{w^k}/{P^0}_{w^k}$ was used. In our study only the single strand of promoter sequences is considered. This is different from that of~\cite{r9}. They count the number of each words in both strands. In this way there are only 2080 distinct oligonucleotides, while the number in ours is $4^6=4096$. Table 3 shows the words that possess no less than 3 among the 5 significance index larger than 3.  There are 13 such words for GCN family, and 23 for TUP family. In table 3 several words tend to cluster together to form a longer pattern. Generally speaking, the clusters can be expanded by involving  words with slightly lower significance.  
268: 
269: In both families, 6-tuples corresponding to regulatory binding sites found by experimental analysis are observed in table 3. See the first cluster of words for GCN family and the first and the second clusters for TUP family. Most of these words also show high statistical significance in the analysis of~\cite{r9}. Some words predicted by~\cite{r9} but not varified by experiments are also observed in table 3 (significant words shared by~\cite{r9} and the present analysis are shown in bold in table 3). However, our analysis also found many significant words which do not show as highly significant scores according to~\cite{r9}.
270:  
271: Two clusters of words for TUP families is noteworthy (see the first and the second clusters in table 3). The first cluster includes GTGGGG, AGGGGC, ACGGGC, TGGGGT, and GGGGTA, and the second cluster involves TACCCC, ACCCCG, CCCCGC, and CCCCAC. between them GTGGGG and CCCCAC, GGGGTA and TACCCC are reverse complements. The two clusters both correspond to the binding sites of transcription factor Mig1p (Zn finger), but seen from different strands. This may imply that the binding sites of Mig1p are active in both orientation. This property, however, was not found for the binding sites of Gcn4p (see table 3). For example, when the cutoff of significance index is reduced to 1.3 (now 46 words satisfy the creterion), the cluster of TGACTC and GACTCA expands to involve another 4 members: CGATGA, GATGAC, ATGACT, and GTGACT; while only one of inverse complements of them, GAGTCA, also has 3 index larger than 1.3. Among the 46 words, it can only be clustered with another words GGAGTC. Thus, the binding sites of Gcn4p seem to be active preferencially in one direction.
272: 
273: Among the available methods of binding sites prediction, ours is similar to that of~\cite{r9} in that both work by defining expected frequencies of words. the difference is that our method defines the expected frequences on the statistical stproperties of the sequences themselves, while~\cite{r9} more or less heuristically defines the word frequencies of whole non-coding sequences as the expected value. It is thus expected that our method is more precise and gives more unbiased result. 
274: 
275: An alternative method developed by Li et al~\cite{add4}. gives more subtle consideration on the statistical feature of DNA sequences. In their model, the sequence is considered as a text without interwords dilimiters. They apply maximal likelihood consideration to {\em recover} the words, which they consider as possible binding site condidates. But the computation is far more complex to get meaningful result. 
276: 
277: More methods to detect unknown elements within funtionally related sequences are availible (for a review, see \cite{add1}), most of which, such as the consensus~\cite{add3} and the Gibbs sampler~\cite{add2}, are based upon well difined biological models. The type of signals that can be detected are generally limited; it is difficult for them to detect multiple signals. But these methods are able to detect much larger patterns with high precision. The present method can be used to detect multiple elements, but the pattern it can find is short. 
278: 
279: It is also a widely explored problem in biology to compare the noncoding 
280: and coding regions of DNA sequences~\cite{d1,d2,r11}. The MEP treatment 
281:  gives one systematic way to study the statistical differences between
282:  coding and noncoding regions.  In table.1 it is shown 
283: that significance index in CDS 
284: regions distribute much more stretchy than that 
285: of the promoter regions. The contrast keeps for all the word lengths we 
286: studied (up to 8 bases). This reveals that CDS regions are in a more 
287: nonrandom state. Two factors may help to interpret this phenomenon.
288: First, the mutation rate of CDS regions is much lower than 
289: that of the promoter regions~\cite{d2}. Secondly, the code usage in 
290: CDS region is universal and definite, while in the promoter regions 
291: the length of regulatory elements differ from each other and the regulatory 
292: elements may differ strongly from the consensus sequences. 
293: 
294: \section*{ACKNOWLEDGMENTS}
295: We are grateful to professor Bai-lin Hao and Wei-mou Zheng for stimulating discussions. We also thank Guo-yi Chen for helps on computation.
296: \newpage
297: 
298: \begin{thebibliography}{99}
299: \renewcommand{\baselinestretch}{0.2}
300: \parskip=-0.4mm
301: {%\small
302: \bibitem{r1} C.-K. Peng, S.V. Buldyrev, A.L. Goldberger, S. Havlin, F. Sciortino, M. Simons, H.E. Stanley, Nature 356 (1992) 168.
303: \bibitem{r2} A.E. Shannon, Bell System Tech. J. 27 (1948) 379.
304: \bibitem{b1} M.Q. Zhang, comput. Chem. 23 (1999) 233.
305: \bibitem{ar1} A. Goffeau, B.G. Barrell, H. Bussey, R.W. Davis, B. Dujon, H. Feldmann, F. Galibert, J.D. Hoheisel, C. Jacq, M. Johnston, E.J. Louis, H.W. Mewes, Y. Murakami, P. Philippsen, H. Tettelin, S.G. Oliver, Science 274 (1996) 546.
306: \bibitem{r10} J. Zhu, M.Q. Zhang, Bioinformatics 15 (1999) 607.
307: \bibitem{r7} J. Honerkamp, Statistical Physics: an Advanced Approach with Application, Springer, Berlin, 1998.
308: \bibitem{c1} G.J. Phillips, J. Arnold, R. Ivarie, Nucl. Acids. Res. 15 (1987) 2611.
309: \bibitem{b2} P.A. Pevzer, M.Y. Borodovsky, A.A. Mironov, J. Biomol. Struct. Dynam. 6 (1989) 1013.
310: \bibitem{r9} J.V. Helden, B. Andre, J. Collado-Vides, J. Mol. Biol. 281 (1998) 827.
311: %\bibitem{b3} M.B. Eisen, P.O. Spellman, D. Botstein, Proc. Natl. Acad. Sci. U.S.A. 95 (1998) 14863.
312: %\bibitem{b4} F.R. Roth, J.D. Hughes, P.W. Estep, G.M. Church,  Nature Biotechnology 16 (1998) 939.
313: %\bibitem{b5} S. Liuli, N. Prunella, G. Pesole, T. D'Orazio, E. Stella, A. Distante, CABIOS 9 (1993) 701.
314: \bibitem{add4} H.J. Bussemaker, H. Li, and E.D. Siggia, Preprint.
315: \bibitem{add1} J.W. Fickett, A. G. Hatzigeorgiou, Eukaryotic Promoter Recognition in Genome Research, Cold Spring Harbor Laboratory Press, 1997.
316: \bibitem{add3} G.Z. Hertz, G. W. Hartzell, G.D. Stormo, Comput. Appl. Biosci. 6 (1990) 81.
317: \bibitem{add2} C.E. Lawrence, S.F. Altschul, M.S. Boguski, J.S. Liu, A.F. Neuwald, J.C. Wootton Science 262 (1993) 208.
318: \bibitem{d1} C. Burge, A. Campbell, S. Karlin, Proc. Natl, Acad. Sci. USA 89 (1992) 1358.
319: \bibitem{d2} W.-H. Li, Molecular Evolution, Sinauer Associates, Canada, 1997.
320: \bibitem{r11} R.N. Mantegna, S.V. Buldyrev, A.L. Goldberger, C.-K. Peng, M. Simons, H.E. Stanley, Phys. Rev. Lett. 73 (1994) 3169.
321: \bibitem{add5} A. G. Hinnebusch, General and pathway-specific regulatory mechanisms controlling the synthesis of amino acid biosynthetic enzymes in {\em Saccharomyces cerevisiae}, in: E.W. Jones, J.R. Pringle, J.R. Broach (Eds), The molecular and Cellular Biology of the  Yeast {\em Saccharomyces}: Gene Expression, pp. 319-414, Sold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, 1992.
322: \bibitem{add6} J.L. Derisi, V.R. Iyer, P.O. Brown, Science, 278 (1997) 680.
323: 
324: }
325: 
326: 
327: \end{thebibliography}
328: 
329: \newpage
330: 
331: \begin{figure}
332: \centerline{\epsfxsize=10cm \epsfbox{hits.eps}}
333: \caption{
334: The ratio of average hits ($H$) of words above certain cutoff of significance index to the average hits ($H_0$) of all the words of same length. The $H_0$(word length) are $405(2), 94.4(3), 21.7(4), 4.92(5), 1.10(6), 0.241(7), 0.0528(8)$.
335: }
336: \end{figure}
337: 
338: 
339: \newpage
340: 
341: 
342: \begin{table}
343: \caption{The dependence of average hits on the significance index 
344: $I_{w}=\frac{P_w-{P^0}_w}{\sqrt{{P^0}_w}}.$
345:  The values shown in the hits volume are averaged over the hits of the points (words) included in the significance index range shown in the $I_w$ colummn.}
346: {\scriptsize
347: \renewcommand{\baselinestretch}{-0.2}
348: \begin{center}
349: \begin{tabular}{|lll|lll|lll|lll|}
350: \hline
351: \multicolumn{6}{|c|}{pentemer}&
352: \multicolumn{6}{c|}{hexmer} \\
353: \hline
354: \multicolumn{3}{|c|}{promoter}&
355: \multicolumn{3}{c|}{CDS}&
356: \multicolumn{3}{c|}{promoter}&
357: \multicolumn{3}{c|}{CDS} \\
358: \hline
359: $I_w$&points&hits&$I_w$&points&hits&$I_w$&points&hits&$I_w$&points&hits \\
360: \hline
361: -15,-9&9&7.33&-29,-12&16&5.50&-11,-6&10&2.10&-15,-9&12&1.17 \\
362: -9,-7&12&4.75&-12,-10&12&4.33&-6,-4&15&1.40&-9,-7&15&0.60 \\
363: -7,-5&16&4.00&-10,-9&12&5.67&-4,-3&28&1.79&-7,-6&30&1.40 \\
364: -5,-4&22&3.50&-9,-8&14&5,14&-3,-2&161&1.04&-6,-5&95&0.78 \\
365: -4,-3&29&4.31&-8,-7&23&4.87&-2,-1&716&0.976&-5,-4&102&0.98 \\
366: -3,-2&91&4.02&-7,-6&40&5.38&-1,0&1137&0.997&-4,-3&202&1.15  \\
367: -2,-1&158&4.03&-6,-5&44&4.34&0,1&1169&1.05&-3,-2&334&1.10  \\
368: -1,0&182&4.46&-5,-4&46&5.41&1,2&593&1.27&-2,-1&563&1.07  \\
369: 0,1&198&5.02&-4,-3&63&5.27&2,3&178&1.51&-1,0&738&1.09 \\
370: 1,2&134&5.58&-3,-2&74&5.34&3,4&52&1.28&0,1&750&1.04 \\
371: 2,3&77&5.25&-2,-1&79&4.52&4,5&19&1.68&1,2&570&1.09 \\
372: 3,4&47&6.55&-1,0&94&5.53&5,6&11&2.18&2,3&345&1.24  \\
373: 4,6&21&7.09&0,1&101&4.58&6,13&12&3.17&3,4&117&1.07 \\
374: 6,8&13&6.15&1,2&83&4.16& & & &4,5&94&1.29  \\
375: 8,10&10&7.40&2,3&59&5.47& & & &5,6&64&1.17 \\
376: 10,19&9&10.22&3,4&60&4.60& & & &6,7&21&1.23  \\
377:  & & &4,5&48&4.70& & & &7,9&18&1.00  \\
378:  & & &5,6&38&4.08& & & &9,19&16&1.44 \\
379:  & & &6,7&27&4.19& & & & & &  \\
380:  & & &7,8&25&4.80& & & & & &  \\
381:  & & &8,9&19&6.84& & & & & & \\
382:  & & &9,10&13&5.46& & & & & &  \\
383:  & & &10,12&12&4.42& & & & & &  \\
384:  & & &12,29&19&5.21& & & & & &  \\
385: \hline
386:  \end{tabular}
387: \end{center}
388:  }
389: \end{table}
390: 
391: \newpage
392: 
393: \begin{figure}
394: \centerline{\epsfxsize=10cm \epsfbox{score6.eps}}
395: \caption{
396: The dependence of average hits of 6-tuples on their average significance index. The data in this figure are  shown as a more audio-visual illustration of the 6-tuple data in Table 1.
397: }
398: \end{figure}
399: 
400: %\clearpage
401: %\newpage
402: 
403: \begin{table}
404: \caption{The coregulated gene family GCN and TUP, and criterion for them being 
405: clustered.}
406: \vspace {1cm}
407: {\scriptsize
408: \begin{tabular}{lp{5.5cm}p{4.7cm}l}
409: \hline
410: Family&Genes&Shared regulatory property&References \\
411: \hline
412: GCN&ARG1,ARG3,ARG4,ARG8,ARO3,ARO4,
413: ARO7,CPA1,CPA2,GLN1,HIS1,HIS2,
414: HIS3,HIS4,HIS5,HOM2,HOM3,HOM6,
415: ILV1,ILV2,ILV5,LEU1,LEU2,LEU3,
416: LEU4,LYS1,LYS2,LYS5,LYS9,MES1,
417: MET14,MET3,MET6,TRP2,TRP3,
418: TRP4,TRP5,THR1&General amino acid contral; 
419: genes activated by Gcn4p.&Hinnebusch~\cite{add5} \\
420: TUP&FSP2,YNR073C,YOL157C,HXT15,SUC2,
421: YNR071C,YDR533C,YEL070W,RNR2,
422: YER067W,CWP1,YGR243W,YDR043C,
423: YER096W,HXT6,YLR327C,YJL171C,
424: YGR138C,HXT4,GSY1,YOR389W,
425: MAL31,YML131W,RCK1&All genes which are both 
426: derepressed by a facter larger than 4 
427: when TUP1 is deleted, and 
428: induced by a factor larger than during the 
429: diauxic shift&DeRisi et al.~\cite{add6}  \\
430: \hline
431:  \end{tabular}
432: }
433: \end{table}
434: 
435: \clearpage
436: \newpage
437: 
438: \begin{table}
439: \caption{Highly overrepresented words in promoter regions of GCN and TUP family.For each family, the 6-tuples with no less than 3 among the 5 significance index larger than 3 are indicated. The words also appear in table 2 of~\cite{r9} as significant patterns are highlighted in bold. Words are clustered according to their similarity. {\em sig(i)} is the value of $P_{w^6}/{P^0}_{w^6}$ with ${P^0}_{w^6}$ being the frequency of 6-tuple $w^6$ predicted based on the frequencied of i-tuples.}
440: \vspace{1cm}
441: {\scriptsize
442: \begin{tabular}{lr@{}lrrrrrrcc}
443: \hline
444: \multicolumn{9}{c}{analysis result on 6-tuples}&
445: \multicolumn{2}{c}{sites previously characterized}\\
446: Family&
447: \multicolumn{2}{c}{Sequences}&counts&{\em sig(1)}&{\em sig(2)}&{\em sig(3)}&{\em sig(4)}&{\em sig(5)}&Consensus&binding factors\\
448: \hline
449: \vspace {-0.2cm}
450: GCN&
451:  {\bf TGA}&{\bf CTC}   &29 &4.47 &4.61 &4.16 &2.93 &1.39&RRTGACTCTTT&Gcn4p \\
452: & {\bf GA}&{\bf CTCA}  &21 &3.28 &3.36 &3.25 &3.24 &1.39&&(bZip) \\
453: \vspace {-0.2cm}
454: &{\bf CCGG}&{\bf TT}   &12 &3.18 &3.38 &3.47 &2.07 &1.50&& \\%AACCGG
455: \vspace {-0.2cm}
456: &CCGG&GT   &6 &2.77 &3.27 &3.29 &3.01 &1.70&-&- \\
457: &  GG&GCGC &5 &4.02 &3.10 &2.93 &3.66 &1.68&& \\
458: \vspace {-0.2cm}
459: &CAG&CAG   &16 &4.35 &3.45 &3.12 &1.99 &1.69&-&- \\
460: &{\bf CAG}&{\bf CGG}   &12 &5.61 &4.95 &4.63 &2.28 &1.55&& \\
461: 
462: &{\bf CCG}&{\bf CTG}   &12 &4.99 &4.60 &3.51 &2.18 &1.36&-&- \\%CAGCGG
463: \vspace {-0.2cm}
464: &CCC&CCC   &7 &3.71 &3.88 &3.15 &2.10 &1.84&& \\
465: &CCT&GCC   &10 &3.75 &3.15 &3.22 &1.94 &1.55&-&- \\
466: &GTG&CCA   &14 &3.76 &3.35 &3.06 &2.23 &1.37&& \\
467: 
468: &GGT&GGT   &10 &3.26 &3.73 &3.14 &2.19 &1.53&-&- \\
469: 
470: 
471: \hline
472: \vspace {-0.2cm}
473: TUP&
474:  {\bf GTGG}&{\bf GG}   &9 &6.67 &5.23 &3.76 &3.27 &1.47&& \\
475: \vspace {-0.2cm}
476: & {\bf AGG}&{\bf GGC}  &10 &6.77 &3.90 &3.65 &2.64 &1.64&KANWWWWATSYGGGGW&Mig1p\\
477: \vspace {-0.2cm}
478: & ACG&GGC  &7 &4.49 &3.57 &3.23 &2.62 &1.97&&(Zn finger) \\
479: \vspace {-0.2cm}
480: & {\bf TGG}&{\bf GGT}  &9 &4.10 &3.21 &3.14 &3.39 &1.37&& \\
481: &  {\bf GG}&{\bf GGTA} &10 &4.39 &4.29 &3.52 &2.89 &1.58&& \\
482: 
483: \vspace {-0.2cm}
484: &{\bf TACC}&{\bf CC}   &16 &5.67 &5.73 &4.22 &2.52 &1.32&& \\%GGGGTA
485: \vspace {-0.2cm}
486: & {\bf ACC}&{\bf CCG}  &11 &6.34 &5.24 &5.15 &3.27 &1.39&Complement of&Mig1p \\%CGGGGT
487: \vspace {-0.2cm}
488: &  {\bf CC}&{\bf CCGC} &8 &7.37 &4.68 &3.60 &2.31 &1.33&KANWWWWATSYGGGGW&(Zn finger) \\%GCGGGG
489: &  {\bf CC}&{\bf CCAC} &12 &6.55 &5.05 &3.49 &2.27 &1.36&& \\%GTGGGG
490: 
491: %\vspace {-0.2cm}
492: &{\bf AGG}&{\bf AGG}   &11 &4.66 &3.79 &3.12 &1.70 &1.44&-&- \\
493: & GG&TGGT  &9 &4.10 &4.27 &3.41 &2.21 &1.31&& \\
494: 
495: \vspace {-0.2cm}
496: &CTC&GAG   &8 &3.15 &4.00 &4.42 &2.22 &1.17&-&- \\
497: & TC&GAGG  &9 &3.75 &3.88 &4.38 &2.15 &1.73&& \\
498: 
499: \vspace {-0.2cm}
500: &{\bf GCG}&{\bf GAG}   &7 &4.74 &4.07 &3.20 &1.84 &1.35&-&- \\
501: & CG&GAGA  &10 &4.02 &4.17 &3.05 &1.97 &1.69&& \\
502: 
503: \vspace {-0.2cm}
504: &CTG&CTA   &10 &2.42 &3.23 &4.28 &3.21 &1.90&& \\
505: \vspace {-0.2cm}
506: &{\bf GTG}&{\bf CCT}   &17 &6.95 &6.81 &4.86 &3.34 &1.71&-&- \\
507: & TG&CCAC  &10 &3.74 &3.38 &3.02 &1.74 &1.51&& \\
508: 
509: \vspace {-0.2cm}
510: &GCG&CCG   &4 &4.10 &3.12 &3.13 &3.23 &2.67&& \\
511: \vspace {-0.2cm}
512: &GCA&ACG   &9 &3.43 &2.88 &3.12 &3.13 &1.37&-&- \\
513: &{\bf GCA}&{\bf CGG}   &8 &5.13 &4.66 &3.12 &2.58 &1.66&& \\
514: 
515: &CAG&TGG   &8 &3.33 &3.48 &3.01 &1.90 &1.61&-&- \\
516: 
517: &CGC&GAT   &7 &2.76 &3.48 &4.12 &3.68 &2.083&-&- \\
518: \hline
519: \end{tabular}
520: }
521: \end{table}
522: \end{document}
523: 
524: