q-bio0309020/tig.tex
1: %\documentclass[twocolumn,prl,showpacs]{revtex4}
2: \documentclass[preprint]{revtex4}
3: \usepackage{graphicx,epsfig,amssymb}
4: 
5: 
6: \begin{document}
7: %\draft
8: %\preprint{}
9: 
10: 
11: \title{Human housekeeping genes are compact}
12:  
13: \author{Eli Eisenberg and Erez Y. Levanon}
14: \affiliation{Compugen Ltd., 72 Pinchas Rosen Street, Tel Aviv 69512, Israel}
15: 
16: \begin{abstract}
17: We identify a set of 575 human genes that are
18: expressed in all conditions tested in a publicly available
19: database of microarray results. Based on this common
20: occurrence, the set is expected to be rich in ``housekeeping''
21: genes, showing constitutive expression in all tissues.
22: We compare selected aspects of their genomic structure
23: with a set of background genes. We find that the
24: introns, untranslated regions and coding sequences
25: of the housekeeping genes are shorter, indicating a
26: selection for compactness in these genes.
27: \end{abstract}
28: 
29: \maketitle 
30: 
31: The amazing diversity of the human body stems from the
32: different expression patterns of genes in different tissues.
33: Although most genes show constitutive expression in only
34: a subset of tissues, some gene products are required for the
35: maintenance of the basal cellular function and are
36: constitutively found in all human cells. These genes are
37: called housekeeping genes (HK genes) \cite{1}. HK genes can
38: be used to calibrate measurements of gene expression \cite{2}.
39: They might also help to define the minimal gene complement
40: needed for a human cell \cite{1}. Several attempts have been
41: made recently to define the complete set of HK genes \cite{3,4}.
42: 
43: Microarrays are often used to identify sets of genes that
44: are expressed either ubiquitously or in specific tissues or
45: conditions. However, the technique is technically demanding
46: and prone to artifacts, so independent evidence is often
47: required to confirm the results. In principle, identifying
48: the set of HK genes using microarray data is straightforward;
49: one need only look for genes that are expressed in all
50: tissues and all experimental conditions. Employing such
51: an approach has so far resulted in two lists of HK genes
52: \cite{3,4}. However, problems in probe design, measurement
53: noise and other artifacts introduce inevitable errors in
54: such lists. Because a northern blot experiment for each
55: gene in each tissue is impractical, an independent test is
56: needed to validate any list of HK genes. Here, we report a
57: validation test that uses a recently discovered property of
58: highly expressed genes.
59: 
60: The transcription process is both slow and costly; it
61: takes 50 milliseconds \cite{5,6} and two ATP molecules \cite{7}
62: approximately to transcribe a nucleotide. This might be
63: expected to provide selective pressure to make genes as
64: short as functionally possible. The more copies of a gene
65: required for the organism, the stronger this pressure
66: should be. The first demonstration of this principle \cite{8}
67: showed that genes with a large number of expressed
68: sequence tags (ESTs) in public libraries (and hence most
69: mRNAs) have a significantly shorter average intron length
70: than those with fewer ESTs.
71: 
72: Here, an implication of this principle is used to validate
73: a set of HK genes. The HK genes, which are transcribed in
74: all somatic cells and under all circumstances, are by
75: nature highly expressed, and therefore should be selected
76: to have shorter introns. We used a recently published
77: database of microarray experiments \cite{9} to identify a set of
78: HK genes. As a further validation step, we checked the
79: Gene Ontology (GO) annotation of these genes. We
80: compared the structure of the HK genes with all other
81: genes, and not only the introns, but all parts of the HK
82: genes were found to be, on average, shorter than other
83: genes. In particular, the untranslated regions and the
84: translated proteins are all shorter in the HK genes.
85: 
86: \section{Assignment of housekeeping genes}
87: 
88: \begin{figure}
89: \includegraphics[width=2.5in]{TIGfig1a.eps}
90: \caption{The distribution of 7500 RefSeq genes represented on 
91: the microarray as a function of the number of tissues they express in. 
92: Each bin gives the number of genes
93: expressed in M out of 47 different tissues. The M=47 bin corresponds to 
94: the housekeeping genes, expressed in all tissues.}
95: \end{figure}
96: 
97: A recently published database provides microarray
98: expression data for Affymetrix U95A chip, containing
99: 12,600 probes, and hybridized to 101 different samples \cite{9}
100: from 47 different human tissues and cell lines. These
101: samples are mainly from the normal human physiological
102: state, and therefore this dataset provides a description of
103: the normal mammalian transcriptome.
104: 
105: We calculated the distribution of the number of different
106: tissues in which a gene is expressed. Discarding probes for
107: which the associated gene was not represented in the
108: RefSeq database \cite{10}, and unifying all probes measuring
109: the same gene (ignoring the potential differences among
110: splice variants) yielded probes representing 7500 human
111: genes. The experiments measuring replicates of the same
112: biological condition were averaged to reduce the measurement
113: noise, resulting in 47 data points per probe. We
114: considered that a probe was expressed in a certain
115: condition if its average reading was above a certain cutoff
116: value. The results were not sensitive to the exact cut-off
117: value, and we chose 200 standard Affymetrix averagedifference
118: units, considered to be a conservative cut-off
119: value for determining gene presence \cite{9}. This is also the
120: trimmed average expression level in each tissue in
121: accordance with the standard Affymetrix normalization
122: procedure \cite{11,12}. Thus, our HK genes are expressed in all
123: tissues at an above-average level.
124: 
125: A histogram (Fig. 1) of the number of genes expressed
126: in exactly M of the 47 tissues shows a clear tendency for
127: frequency to decrease as M increases. However, a
128: substantial number of genes (575), belong to the class of
129: genes that are expressed in all tissues. Because their
130: number is far greater than expected based on the general
131: trend described above, we assumed this class to be rich in
132: HK genes, and considered it to be the set of HK genes.
133: 
134: It is noteworthy that the genes in our HK list tend to
135: have an average expression significantly higher than other
136: genes; the geometric mean expression of our HK genes is
137: 1200 in Affymetrix average difference units, whereas that
138: of other genes is 150. The difference cannot be accounted
139: for by the cutoff used to define the HK genes, and is not a
140: result of a bias due to inclusion of genes expressed in a few
141: tissues only (data not shown).
142: 
143: Two additional tests were conducted to validate this set.
144: First, a study of the GO annotation \cite{13} of these genes
145: revealed the set is rich in metabolic proteins (24\%) and
146: RNA-interacting proteins (19\%, mostly ribosomal proteins).
147: Second, we compiled a list of 18 well-established
148: HK genes commonly used for quantitative PCR calibration
149: \cite{14,15}, and checked our list against it. We found 13 of
150: the 18 genes in our list, and the other five were not
151: represented on the microarray (see Table in Supplementary
152: Information at http://www.compugen.co.il/supp\_info/Housekeeping\_genes.html).
153: 
154: \section{Length analysis of HK genes}
155: 
156: \begin{figure}[t]
157: \includegraphics[width=2.5in]{TIGfig1b.eps}
158: \caption{A Histogram of the total length of introns. Green bars, HK genes; 
159: blue bars, non-HK genes.}
160: \end{figure}
161: 
162: 
163: \begin{table}[b]
164: 
165: \caption{{\bf Human housekeeping genes are compact.} 
166: Comparison of structure of housekeeping (HK) 
167: genes versus non-HK genes. For each case the first 
168: line gives the average value, s.e.m, and the second line gives the median.
169: For the average intron and exon lengths, 
170: all introns and exons belonging to the relevant set were 
171: included; the number appears in parentheses. The P-value was calculated
172: using the Mann-Whitney test. UTR, untranslated region.
173: }
174: 
175: \begin{tabular}{llll}
176: & & &\\
177: & {\bf HK genes (n=532)}&{\bf non-HK (n=5404)}& {\bf P-value}\\
178: Average intron length&$2573\pm 145$(n=4353)\ \ \ \ \ \ &$5025\pm 71$(n=57447)\ \ \ \ \ \ &$4\times 10^{-130}$\\
179: &672&1365& \\
180: Total intron length&$21050\pm 1781$&$53418\pm 1425$&$7\times 10^{-28}$\\
181: &9293&20804& \\
182: Average exon length&$212\pm 5$(n=4885)&$240\pm 2$(n=62851)&$9\times 10^{-5}$\\
183: &672&1365& \\
184: 5' UTR length&$135\pm   8$        &$ 173\pm  3$         &$4\times 10^{-  7}$\\
185: & 79& 106& \\
186:     3' UTR length     &$599 \pm  30$        &$846 \pm 13$         &$3\times 10^{-13 }$\\
187: &333& 552& \\
188: Coding sequence length&$1211\pm  44$        &$1770\pm 26$         &$3\times 10^{-26 }$\\
189: &928&1322& \\
190: Number of introns      &$8.2 \pm 0.3$        &$10.6\pm 0.2$         &$6\times 10^{-7  }$\\
191: &6  &   8& \\
192: Intron bps per coding bp\ \ \ \ \ \ \ \ & $20  \pm   2$        &$31.8\pm 0.8$         &$2\times 10^{-11 }$\\
193: &9.9&15.6& 
194: \end{tabular}
195: 
196: \end {table}
197: 
198: Table 1 compares the lengths of various parts of the HK
199: genes and the background genes. The alignment data was
200: taken from the UCSC genome browser (http://genome.ucsc.edu) 
201: \cite{16}. We excluded 322 genes that do not have a
202: unique alignment, as well as 1242 genes that were not
203: expressed in any tissue (to avoid potential problems
204: because of defective probes). This left 532 HK genes and
205: 5404 non-HK genes. The histograms in Fig. 2-4 compare
206: HK genes with the other genes by total intron length, 5'
207: UTR length and coding sequence length. Remarkably,
208: there was a statistically significant difference between HK
209: and non-HK genes in all aspects of gene structure. Average
210: intron length is shorter for the HK genes than for the
211: background genes (2573 bp versus 5025 bp, respectively);
212: total gene length is shorter (21,050 bp versus 53,418 bp);
213: average exon length is shorter (212 bp versus 240 bp);
214: average lengths of both 3' and 5' untranslated regions
215: (UTRs) are shorter (5': 135 bp versus 173 bp; 3': 599 bp
216: versus 846 bp); and, most notably, the translated proteins
217: are shorter as well (403 amino acids versus 590 amino
218: acids). Accordingly, the number of introns bp per unit
219: of coding sequence length is lower for the HK genes
220: (20 versus 32). We studied the structure of each gene as a
221: function of the number of tissues it is expressed in and
222: verified that the results are not due to bias of the non-HK
223: genes by tissue-specific genes (data not shown).
224: 
225: \begin{figure}[t]
226: \includegraphics[width=2.5in]{TIGfig1c.eps}
227: \caption{A Histogram of the length of the 5' untranslated regions 
228: (UTR). Green bars, HK genes; blue bars, non-HK genes.}
229: \end{figure}
230: 
231: The pronounced statistical characteristics of the HK
232: gene set further supports their assignment as a unique set.
233: Our findings confirm and extend previous research,
234: showing that the introns of highly expressed genes are
235: shorter \cite{5}. As mentioned above, the HK genes expression
236: levels are high, and the fact that they have to be expressed
237: in all cells at all times makes them even more costly to
238: transcribe. Previously \cite{8}, the high abundance of a certain
239: gene in EST libraries was an indication the gene was
240: highly expressed in the human body. It was pointed out \cite{8},
241: however, that this method is prone to bias due to the
242: inclusion of normalized and tumor libraries and overrepresentation
243: of certain tissues. Our approach overcomes
244: this difficulty and confirms the previous result. Moreover,
245: we find here that UTRs and even the encoded proteins are
246: shorter for the HK genes. The magnitude of the difference
247: is greater for the introns than for the exons and proteins
248: (Table 1), which makes sense because the coding sequences
249: and the UTRs are less susceptible to change.
250: 
251: It should be mentioned that intronless genes were
252: included in our analysis after verifying that their inclusion
253: or exclusion had no effect on the results. It also must be
254: noted that the UTRs are not always fully sequenced, and
255: thus their actual lengths might be longer. This bias was
256: found to have no effect on the length of the coding
257: sequences, and in any case the effect would be the same
258: for both HK and non-HK genes.
259: 
260: It has been noted that codon usage bias in nonmammalian
261: organisms is correlated with the expression
262: level and with the gene length \cite{17,18,19}. These results led to
263: the conjecture of selective pressure on highly expressed
264: genes resulting in shorter proteins \cite{19}. However, no
265: evidence for this selection was found \cite{18}, possibly because
266: of a lack of high quality databases for these organisms.
267: Recent works have suggested that there is no selection for
268: codon usage bias in humans \cite{20}, and thus our results
269: demonstrate that the expression-length correlation is
270: not related to the expression-codon bias correlation.
271: 
272: \begin{figure}
273: \includegraphics[width=2.5in]{TIGfig1d.eps}
274: \caption{A Histogram of the length of the coding region. Green bars, HK genes; 
275: blue bars, non-HK genes.}
276: \end{figure}
277: 
278: It could be argued that selection towards shorter genes
279: should have eliminated the introns in highly expressed
280: genes. However, it is known that introns do have
281: important roles, such as splicing regulation. Therefore,
282: there is a balance between the advantageous contribution
283: of the introns and the selective pressure for shortening.
284: 
285: Finally, when we compared our results with two (largely
286: overlapping) published sets of HK genes, we found that
287: roughly half of the genes in the intersection of those sets
288: were present in our set. We used the genomic structure to
289: test the remaining genes, and found a statistically
290: significant difference between them and our HK gene
291: set. The differences between our results and those of
292: earlier studies \cite{3,4} could be due to the fact that the
293: database we used was based on more advanced chip
294: technology and included many more different tissues,
295: giving it more discriminative power to identify HK genes.
296: 
297: In conclusion, we have identified a set of HK genes. The
298: set is publicly available at 
299: http://www.compugen.co.il/supp\_info/Housekeeping\_genes.html
300: and can be used for
301: calibration of microarrays, toxicity evaluation and quantitative
302: PCR experiments. Furthermore, we show that
303: HK genes have shorter introns, UTRs and coding
304: sequences, attesting to the strong selection for compactness
305: in these genes.
306: 
307: \begin{acknowledgments}
308: We thank Andrew Su for helpful discussion and for providing us with the
309: RefSeq mapping. Gady Cojocaru and Rotem Sorek are acknowledged for
310: comments on the manuscript and insightful discussion.
311: \end{acknowledgments}
312: 
313: 
314: 
315: 
316: \begin{thebibliography}{10}
317: 
318: \bibitem{1}Butte, A.J. et al. (2001) Physiol. Genomics 7, 95-96.
319: \bibitem{2}Gibson, U.E. et al. (1996) Genome Res. 6, 995-1001.
320: \bibitem{3}Warrington, J.A. et al. (2000) Physiol. Genomics 2, 143-147.
321: \bibitem{4}Hsiao, L.L. et al. (2001) Physiol. Genomics 7, 97-104.
322: \bibitem{5}Ucker,D.S. and Yamamoto, K.R. (1984) 
323: J. Biol. Chem. 259, 7416-7420.
324: \bibitem{6}Izban, M.G. and Luse, D.S. (1992) J. Biol. Chem. 267,
325: 13647-13655.
326: \bibitem{7}Lehninger, A.L. et al. (1982) Biochemistry, 615-644.
327: \bibitem{8}Castillo-Davis, C.I. et al. (2002) Nat. Genet. 31, 415-418.
328: \bibitem{9}Su, A.I. et al. (2002) Proc. Natl. Acad. Sci. U.S.A. 99, 4465-4470.
329: \bibitem{10}Pruitt, K.D. et al. (2000) Trends Genet. 16, 44-47.
330: \bibitem{11}Lockhart, D.J. et al. (1996) Nat. Biotechnol. 14, 1675-1680.
331: \bibitem{12}Wodicka, L. et al. (1997) Nat. Biotechnol. 15, 1359-1367.
332: \bibitem{13}Gene Ontlology Consortium, (2001) Genome Res. 11, 1425-1433.
333: \bibitem{14}Hamalainen, H.K. et al. (2001) Anal. Biochem. 299, 63-70.
334: \bibitem{15}Lee, P.D. (2002) Genome Res. 12, 292-297.
335: \bibitem{16}Karolchik, D. et al. (2003) Nucleic Acids Res. 31, 51-54.
336: \bibitem{17}Akashi, H. (2001) Curr. Opin. Genet. Dev. 11, 660-666.
337: \bibitem{18}Duret, L. and Mouchiroud, D. (1999) Proc. Natl. Acad. Sci. U.S.A.
338:  96, 4482-4487.
339: \bibitem{19}Moriyama, E.N. and Powell, J.R. (1998) Nucleic Acids Res. 
340: 26, 3188-3193.
341: \bibitem{20}Urrutia, A.O. and Hurst, L.D. (2001) Genetics 159, 1191-1199.
342: 
343: \end{thebibliography}
344: 
345: 
346: \end{document}
347: