0309:q-bio0309020/tig.tex

1: %\documentclass[twocolumn,prl,showpacs]{revtex4}

2: \documentclass[preprint]{revtex4}

3: \usepackage{graphicx,epsfig,amssymb}

4:

5:

6: \begin{document}

7: %\draft

8: %\preprint{}

9:

10:

11: \title{Human housekeeping genes are compact}

12:

13: \author{Eli Eisenberg and Erez Y. Levanon}

14: \affiliation{Compugen Ltd., 72 Pinchas Rosen Street, Tel Aviv 69512, Israel}

15:

16: \begin{abstract}

17: We identify a set of 575 human genes that are

18: expressed in all conditions tested in a publicly available

19: database of microarray results. Based on this common

20: occurrence, the set is expected to be rich in ``housekeeping''

21: genes, showing constitutive expression in all tissues.

22: We compare selected aspects of their genomic structure

23: with a set of background genes. We find that the

24: introns, untranslated regions and coding sequences

25: of the housekeeping genes are shorter, indicating a

26: selection for compactness in these genes.

27: \end{abstract}

28:

29: \maketitle

30:

31: The amazing diversity of the human body stems from the

32: different expression patterns of genes in different tissues.

33: Although most genes show constitutive expression in only

34: a subset of tissues, some gene products are required for the

35: maintenance of the basal cellular function and are

36: constitutively found in all human cells. These genes are

37: called housekeeping genes (HK genes) \cite{1}. HK genes can

38: be used to calibrate measurements of gene expression \cite{2}.

39: They might also help to define the minimal gene complement

40: needed for a human cell \cite{1}. Several attempts have been

41: made recently to define the complete set of HK genes \cite{3,4}.

42:

43: Microarrays are often used to identify sets of genes that

44: are expressed either ubiquitously or in specific tissues or

45: conditions. However, the technique is technically demanding

46: and prone to artifacts, so independent evidence is often

47: required to confirm the results. In principle, identifying

48: the set of HK genes using microarray data is straightforward;

49: one need only look for genes that are expressed in all

50: tissues and all experimental conditions. Employing such

51: an approach has so far resulted in two lists of HK genes

52: \cite{3,4}. However, problems in probe design, measurement

53: noise and other artifacts introduce inevitable errors in

54: such lists. Because a northern blot experiment for each

55: gene in each tissue is impractical, an independent test is

56: needed to validate any list of HK genes. Here, we report a

57: validation test that uses a recently discovered property of

58: highly expressed genes.

59:

60: The transcription process is both slow and costly; it

61: takes 50 milliseconds \cite{5,6} and two ATP molecules \cite{7}

62: approximately to transcribe a nucleotide. This might be

63: expected to provide selective pressure to make genes as

64: short as functionally possible. The more copies of a gene

65: required for the organism, the stronger this pressure

66: should be. The first demonstration of this principle \cite{8}

67: showed that genes with a large number of expressed

68: sequence tags (ESTs) in public libraries (and hence most

69: mRNAs) have a significantly shorter average intron length

70: than those with fewer ESTs.

71:

72: Here, an implication of this principle is used to validate

73: a set of HK genes. The HK genes, which are transcribed in

74: all somatic cells and under all circumstances, are by

75: nature highly expressed, and therefore should be selected

76: to have shorter introns. We used a recently published

77: database of microarray experiments \cite{9} to identify a set of

78: HK genes. As a further validation step, we checked the

79: Gene Ontology (GO) annotation of these genes. We

80: compared the structure of the HK genes with all other

81: genes, and not only the introns, but all parts of the HK

82: genes were found to be, on average, shorter than other

83: genes. In particular, the untranslated regions and the

84: translated proteins are all shorter in the HK genes.

85:

86: \section{Assignment of housekeeping genes}

87:

88: \begin{figure}

89: \includegraphics[width=2.5in]{TIGfig1a.eps}

90: \caption{The distribution of 7500 RefSeq genes represented on

91: the microarray as a function of the number of tissues they express in.

92: Each bin gives the number of genes

93: expressed in M out of 47 different tissues. The M=47 bin corresponds to

94: the housekeeping genes, expressed in all tissues.}

95: \end{figure}

96:

97: A recently published database provides microarray

98: expression data for Affymetrix U95A chip, containing

99: 12,600 probes, and hybridized to 101 different samples \cite{9}

100: from 47 different human tissues and cell lines. These

101: samples are mainly from the normal human physiological

102: state, and therefore this dataset provides a description of

103: the normal mammalian transcriptome.

104:

105: We calculated the distribution of the number of different

106: tissues in which a gene is expressed. Discarding probes for

107: which the associated gene was not represented in the

108: RefSeq database \cite{10}, and unifying all probes measuring

109: the same gene (ignoring the potential differences among

110: splice variants) yielded probes representing 7500 human

111: genes. The experiments measuring replicates of the same

112: biological condition were averaged to reduce the measurement

113: noise, resulting in 47 data points per probe. We

114: considered that a probe was expressed in a certain

115: condition if its average reading was above a certain cutoff

116: value. The results were not sensitive to the exact cut-off

117: value, and we chose 200 standard Affymetrix averagedifference

118: units, considered to be a conservative cut-off

119: value for determining gene presence \cite{9}. This is also the

120: trimmed average expression level in each tissue in

121: accordance with the standard Affymetrix normalization

122: procedure \cite{11,12}. Thus, our HK genes are expressed in all

123: tissues at an above-average level.

124:

125: A histogram (Fig. 1) of the number of genes expressed

126: in exactly M of the 47 tissues shows a clear tendency for

127: frequency to decrease as M increases. However, a

128: substantial number of genes (575), belong to the class of

129: genes that are expressed in all tissues. Because their

130: number is far greater than expected based on the general

131: trend described above, we assumed this class to be rich in

132: HK genes, and considered it to be the set of HK genes.

133:

134: It is noteworthy that the genes in our HK list tend to

135: have an average expression significantly higher than other

136: genes; the geometric mean expression of our HK genes is

137: 1200 in Affymetrix average difference units, whereas that

138: of other genes is 150. The difference cannot be accounted

139: for by the cutoff used to define the HK genes, and is not a

140: result of a bias due to inclusion of genes expressed in a few

141: tissues only (data not shown).

142:

143: Two additional tests were conducted to validate this set.

144: First, a study of the GO annotation \cite{13} of these genes

145: revealed the set is rich in metabolic proteins (24\%) and

146: RNA-interacting proteins (19\%, mostly ribosomal proteins).

147: Second, we compiled a list of 18 well-established

148: HK genes commonly used for quantitative PCR calibration

149: \cite{14,15}, and checked our list against it. We found 13 of

150: the 18 genes in our list, and the other five were not

151: represented on the microarray (see Table in Supplementary

152: Information at http://www.compugen.co.il/supp\_info/Housekeeping\_genes.html).

153:

154: \section{Length analysis of HK genes}

155:

156: \begin{figure}[t]

157: \includegraphics[width=2.5in]{TIGfig1b.eps}

158: \caption{A Histogram of the total length of introns. Green bars, HK genes;

159: blue bars, non-HK genes.}

160: \end{figure}

161:

162:

163: \begin{table}[b]

164:

165: \caption{{\bf Human housekeeping genes are compact.}

166: Comparison of structure of housekeeping (HK)

167: genes versus non-HK genes. For each case the first

168: line gives the average value, s.e.m, and the second line gives the median.

169: For the average intron and exon lengths,

170: all introns and exons belonging to the relevant set were

171: included; the number appears in parentheses. The P-value was calculated

172: using the Mann-Whitney test. UTR, untranslated region.

173: }

174:

175: \begin{tabular}{llll}

176: & & &\\

177: & {\bf HK genes (n=532)}&{\bf non-HK (n=5404)}& {\bf P-value}\\

178: Average intron length&$2573\pm 145$(n=4353)\ \ \ \ \ \ &$5025\pm 71$(n=57447)\ \ \ \ \ \ &$4\times 10^{-130}$\\

179: &672&1365& \\

180: Total intron length&$21050\pm 1781$&$53418\pm 1425$&$7\times 10^{-28}$\\

181: &9293&20804& \\

182: Average exon length&$212\pm 5$(n=4885)&$240\pm 2$(n=62851)&$9\times 10^{-5}$\\

183: &672&1365& \\

184: 5' UTR length&$135\pm   8$        &$ 173\pm  3$         &$4\times 10^{-  7}$\\

185: & 79& 106& \\

186:     3' UTR length     &$599 \pm  30$        &$846 \pm 13$         &$3\times 10^{-13 }$\\

187: &333& 552& \\

188: Coding sequence length&$1211\pm  44$        &$1770\pm 26$         &$3\times 10^{-26 }$\\

189: &928&1322& \\

190: Number of introns      &$8.2 \pm 0.3$        &$10.6\pm 0.2$         &$6\times 10^{-7  }$\\

191: &6  &   8& \\

192: Intron bps per coding bp\ \ \ \ \ \ \ \ & $20  \pm   2$        &$31.8\pm 0.8$         &$2\times 10^{-11 }$\\

193: &9.9&15.6&

194: \end{tabular}

195:

196: \end {table}

197:

198: Table 1 compares the lengths of various parts of the HK

199: genes and the background genes. The alignment data was

200: taken from the UCSC genome browser (http://genome.ucsc.edu)

201: \cite{16}. We excluded 322 genes that do not have a

202: unique alignment, as well as 1242 genes that were not

203: expressed in any tissue (to avoid potential problems

204: because of defective probes). This left 532 HK genes and

205: 5404 non-HK genes. The histograms in Fig. 2-4 compare

206: HK genes with the other genes by total intron length, 5'

207: UTR length and coding sequence length. Remarkably,

208: there was a statistically significant difference between HK

209: and non-HK genes in all aspects of gene structure. Average

210: intron length is shorter for the HK genes than for the

211: background genes (2573 bp versus 5025 bp, respectively);

212: total gene length is shorter (21,050 bp versus 53,418 bp);

213: average exon length is shorter (212 bp versus 240 bp);

214: average lengths of both 3' and 5' untranslated regions

215: (UTRs) are shorter (5': 135 bp versus 173 bp; 3': 599 bp

216: versus 846 bp); and, most notably, the translated proteins

217: are shorter as well (403 amino acids versus 590 amino

218: acids). Accordingly, the number of introns bp per unit

219: of coding sequence length is lower for the HK genes

220: (20 versus 32). We studied the structure of each gene as a

221: function of the number of tissues it is expressed in and

222: verified that the results are not due to bias of the non-HK

223: genes by tissue-specific genes (data not shown).

224:

225: \begin{figure}[t]

226: \includegraphics[width=2.5in]{TIGfig1c.eps}

227: \caption{A Histogram of the length of the 5' untranslated regions

228: (UTR). Green bars, HK genes; blue bars, non-HK genes.}

229: \end{figure}

230:

231: The pronounced statistical characteristics of the HK

232: gene set further supports their assignment as a unique set.

233: Our findings confirm and extend previous research,

234: showing that the introns of highly expressed genes are

235: shorter \cite{5}. As mentioned above, the HK genes expression

236: levels are high, and the fact that they have to be expressed

237: in all cells at all times makes them even more costly to

238: transcribe. Previously \cite{8}, the high abundance of a certain

239: gene in EST libraries was an indication the gene was

240: highly expressed in the human body. It was pointed out \cite{8},

241: however, that this method is prone to bias due to the

242: inclusion of normalized and tumor libraries and overrepresentation

243: of certain tissues. Our approach overcomes

244: this difficulty and confirms the previous result. Moreover,

245: we find here that UTRs and even the encoded proteins are

246: shorter for the HK genes. The magnitude of the difference

247: is greater for the introns than for the exons and proteins

248: (Table 1), which makes sense because the coding sequences

249: and the UTRs are less susceptible to change.

250:

251: It should be mentioned that intronless genes were

252: included in our analysis after verifying that their inclusion

253: or exclusion had no effect on the results. It also must be

254: noted that the UTRs are not always fully sequenced, and

255: thus their actual lengths might be longer. This bias was

256: found to have no effect on the length of the coding

257: sequences, and in any case the effect would be the same

258: for both HK and non-HK genes.

259:

260: It has been noted that codon usage bias in nonmammalian

261: organisms is correlated with the expression

262: level and with the gene length \cite{17,18,19}. These results led to

263: the conjecture of selective pressure on highly expressed

264: genes resulting in shorter proteins \cite{19}. However, no

265: evidence for this selection was found \cite{18}, possibly because

266: of a lack of high quality databases for these organisms.

267: Recent works have suggested that there is no selection for

268: codon usage bias in humans \cite{20}, and thus our results

269: demonstrate that the expression-length correlation is

270: not related to the expression-codon bias correlation.

271:

272: \begin{figure}

273: \includegraphics[width=2.5in]{TIGfig1d.eps}

274: \caption{A Histogram of the length of the coding region. Green bars, HK genes;

275: blue bars, non-HK genes.}

276: \end{figure}

277:

278: It could be argued that selection towards shorter genes

279: should have eliminated the introns in highly expressed

280: genes. However, it is known that introns do have

281: important roles, such as splicing regulation. Therefore,

282: there is a balance between the advantageous contribution

283: of the introns and the selective pressure for shortening.

284:

285: Finally, when we compared our results with two (largely

286: overlapping) published sets of HK genes, we found that

287: roughly half of the genes in the intersection of those sets

288: were present in our set. We used the genomic structure to

289: test the remaining genes, and found a statistically

290: significant difference between them and our HK gene

291: set. The differences between our results and those of

292: earlier studies \cite{3,4} could be due to the fact that the

293: database we used was based on more advanced chip

294: technology and included many more different tissues,

295: giving it more discriminative power to identify HK genes.

296:

297: In conclusion, we have identified a set of HK genes. The

298: set is publicly available at

299: http://www.compugen.co.il/supp\_info/Housekeeping\_genes.html

300: and can be used for

301: calibration of microarrays, toxicity evaluation and quantitative

302: PCR experiments. Furthermore, we show that

303: HK genes have shorter introns, UTRs and coding

304: sequences, attesting to the strong selection for compactness

305: in these genes.

306:

307: \begin{acknowledgments}

308: We thank Andrew Su for helpful discussion and for providing us with the

309: RefSeq mapping. Gady Cojocaru and Rotem Sorek are acknowledged for

310: comments on the manuscript and insightful discussion.

311: \end{acknowledgments}

312:

313:

314:

315:

316: \begin{thebibliography}{10}

317:

318: \bibitem{1}Butte, A.J. et al. (2001) Physiol. Genomics 7, 95-96.

319: \bibitem{2}Gibson, U.E. et al. (1996) Genome Res. 6, 995-1001.

320: \bibitem{3}Warrington, J.A. et al. (2000) Physiol. Genomics 2, 143-147.

321: \bibitem{4}Hsiao, L.L. et al. (2001) Physiol. Genomics 7, 97-104.

322: \bibitem{5}Ucker,D.S. and Yamamoto, K.R. (1984)

323: J. Biol. Chem. 259, 7416-7420.

324: \bibitem{6}Izban, M.G. and Luse, D.S. (1992) J. Biol. Chem. 267,

325: 13647-13655.

326: \bibitem{7}Lehninger, A.L. et al. (1982) Biochemistry, 615-644.

327: \bibitem{8}Castillo-Davis, C.I. et al. (2002) Nat. Genet. 31, 415-418.

328: \bibitem{9}Su, A.I. et al. (2002) Proc. Natl. Acad. Sci. U.S.A. 99, 4465-4470.

329: \bibitem{10}Pruitt, K.D. et al. (2000) Trends Genet. 16, 44-47.

330: \bibitem{11}Lockhart, D.J. et al. (1996) Nat. Biotechnol. 14, 1675-1680.

331: \bibitem{12}Wodicka, L. et al. (1997) Nat. Biotechnol. 15, 1359-1367.

332: \bibitem{13}Gene Ontlology Consortium, (2001) Genome Res. 11, 1425-1433.

333: \bibitem{14}Hamalainen, H.K. et al. (2001) Anal. Biochem. 299, 63-70.

334: \bibitem{15}Lee, P.D. (2002) Genome Res. 12, 292-297.

335: \bibitem{16}Karolchik, D. et al. (2003) Nucleic Acids Res. 31, 51-54.

336: \bibitem{17}Akashi, H. (2001) Curr. Opin. Genet. Dev. 11, 660-666.

337: \bibitem{18}Duret, L. and Mouchiroud, D. (1999) Proc. Natl. Acad. Sci. U.S.A.

338:  96, 4482-4487.

339: \bibitem{19}Moriyama, E.N. and Powell, J.R. (1998) Nucleic Acids Res.

340: 26, 3188-3193.

341: \bibitem{20}Urrutia, A.O. and Hurst, L.D. (2001) Genetics 159, 1191-1199.

342:

343: \end{thebibliography}

344:

345:

346: \end{document}

347: