0710:0710.2808/sdr.tex

1: %% LyX 1.3 created this file.  For more info, see http://www.lyx.org/.

2: %% Do not edit unless you really know what you are doing.

3: \documentclass[12pt,english]{article}

4: \pdfoutput=1

5: \usepackage[T1]{fontenc}

6: \usepackage[latin1]{inputenc}

7: \usepackage{array}

8: \usepackage{graphicx}

9: \usepackage{setspace}

10: \onehalfspacing

11: \usepackage[authoryear]{natbib}

12:

13: \makeatletter

14:

15: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.

16: %% Bold symbol macro for standard LaTeX users

17: \providecommand{\boldsymbol}[1]{\mbox{\boldmath $#1$}}

18:

19: %% Because html converters don't know tabularnewline

20: \providecommand{\tabularnewline}{\\}

21:

22: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands.

23: \addtolength{\oddsidemargin}{-50pt}

24: \addtolength{\voffset}{-50pt}

25: \addtolength{\textwidth}{100pt}

26: \addtolength{\textheight}{100pt}

27:

28:

29: \usepackage{xspace}

30: \newcommand{\phipsi}{$\phi,\psi$\xspace}

31: \newcommand{\eg}{e.g.\xspace}

32: \newcommand{\etc}{etc.\xspace}

33: \newcommand{\dfs}{\textsc{dfs}~}

34: \newcommand{\cpp}{\textsc{c++}\xspace}

35: \newcommand{\python}{\textsc{python}\xspace}

36: \newcommand{\swig}{\textsc{swig}\xspace}

37: \newcommand{\pdb}{\textsc{PDB}\xspace}

38: \newcommand{\gabb}{\textsc{GABB}\xspace}

39: \newcommand{\scwrl}{\textsc{scwrl}\xspace}

40: \newcommand{\rapper}{\textsc{rapper}\xspace}

41: \newcommand{\rappertk}{\textit{Rapper}\textbf{tk}\xspace}

42: \newcommand{\probe}{\textsc{probe}\xspace}

43: \newcommand{\ie}{i.e.\xspace}

44: \newcommand{\ca}{$C_\alpha$\xspace}

45: \newcommand{\chir}{$\chi$}

46: \newcommand{\CA}[1]{${C_\alpha}^{#1}$~}

47: \newcommand{\CB}[1]{${C_\beta}^{#1}$~}

48: \newcommand{\C}[1]{$C^{#1}$~}

49: \newcommand{\N}[1]{$N^{#1}$\xspace}

50: \newcommand{\cO}[1]{$O^{#1}$\xspace}

51:

52:

53: \newcommand{\Ang}[1]{${#1}$\AA\xspace}

54:

55: \newcommand{\degr}[1]{${#1}^o$\xspace}

56:

57: \date{}

58:

59: \usepackage{babel}

60: \makeatother

61: \begin{document}

62:

63: \title{Identification of specificity determining residues in enzymes using

64: environment specific substitution tables}

65:

66:

67: \author{Swanand Gore and Tom Blundell\\

68: \{swanand,tom\}@cryst.bioc.cam.ac.uk\\

69: Department of Biochemistry, University of Cambridge\\

70: Cambridge CB2 1GA England}

71:

72: \maketitle

73: \begin{abstract}

74: Environment specific substitution tables have been used effectively

75: for distinguishing structural and functional constraints on proteins

76: and thereby identify their active sites (\citet{distinguish_str_func_restr}).

77: This work explores whether a similar approach can be used to identify

78: specificity determining residues (SDRs) responsible for cofactor dependence,

79: substrate specificity or subtle catalytic variations. We combine structure-sequence

80: information and functional annotation from various data sources to

81: create structural alignments for homologous enzymes and functional

82: partitions therein. We develop a scoring procedure to predict SDRs

83: and assess their accuracy using information from bound specific ligands

84: and published literature.\newpage

85:

86: \end{abstract}

87:

88: \section{Introduction}

89:

90: Enzymes are critical to cellular machinery. Enzymes are believed to

91: have developed different specificities following gene duplication

92: events that ease the evolutionary pressure on copies and allow exploration

93: of novel avenues to greater organismal fitness. Each copy then develops

94: its own niche, characterized by expression and localization, catalytic

95: mechanism, substrate specificity, cofactor dependence and catalysis

96: products. Such paralogous enzymes should have an evolutionary imprint

97: corresponding to their specific niche, in addition to maintenance

98: of structural fold. Thus evolutionary analysis of available structural

99: and sequnce data should enable identification of key residues responsible

100: for specificity of various kinds. Enzyme specificity can be estimated

101: with functional assays without structure determination, but identification

102: of SDRs (specificity determining residues) remains difficult. While

103: ENZYME (\citet{ENZYME}) - a database of enzyme sequences with detailed

104: functional annotation - exists, there is no such database of SDRs.

105: Time, cost and technical limitations slow down structure determination

106: and even when structure is known, it is not trivial to identify the

107: residues important for binding cofactors and substrates. Hence it

108: is important to be able to identify such residues computationally.

109: Reliable detection of such residues will aid in deciding whether a

110: SNP is deleterious or neutral and suggest mutation studies. Function

111: assignment to sequence could be done at a finer level, e.g. by verifying

112: that SDRs necessary for certain substrate are present. Computational

113: SDR identification has received a lot of attention and several methods

114: have been proposed. Evolutionary trace (ET) is one of the most important

115: methods (\citet{evoltraceStrClust}, \citet{EThybridMethods}). It

116: builds a phylogenetic tree based on sequence comparisons, such that

117: branch lengths are indicative of evolutionary divergence. Functional

118: subgroups consist of sequences in subtrees determined from this tree

119: using a divergence cutoff. Residues common to a subtree are considered

120: specificity-conferring rather than the ones common to entire tree.

121: Spatial cluster identification can be used with ET to reduce the number

122: of false positives. Inferring phylogeny correctly remains the main

123: cause of concern in this approach, hence attempts have been made to

124: use existing annotation with various statistical techniques. Another

125: important direction is to use spatial proximity of residues.

126:

127: Cornerstone of our approach is that structural environment influences

128: residue substitution patterns, illustrated by \citet{earlyESST} and

129: later used effectively for structure-sequence alignment and fold recognition

130: (\citet{fugue}). Structural environment of a residue is described

131: in terms of secondary structure, solvent accessibility, sidechain-sidechain

132: and sidechain-mainchain hydrogen bonding. Residue substitution tables

133: derived from a set of high quality sequence-structure alignments represent

134: the expected substitution rate in a structural environment. Unexpected

135: conservation of a residue is indicative of functional restraint acting

136: on it. Advantage of using ESSTs is that the structurally conserved

137: residues are masked, which is why active sites of homologous enzymes

138: can be identied reliably with this approach. This approach has been

139: extended in the present work by using functional annotation information.

140:

141: A set of homologous enzymes is generally a union of smaller functionally

142: specific subsets, e.g. substrate-specific subsets in serine proteinases

143: (trypsin, chymotrypsin etc.), cofactor-specific subsets in ferrodoxin

144: reductases (NAD and NADP specific) and so on. In multiple sequence

145: alignment of a homologous protein family, SDRs generally appear as

146: differentially conserved subcolumns. But all such appearances would

147: not be SDRs. Our hypothesis is that SDRs would be identified by combining

148: differential conservation with ESST-based detection of functional

149: restraint.

150:

151:

152: \section{Families, functional partitions and profiles}

153:

154: In order to test our hypothesis, we need to construct a dataset of

155: homologous enzyme families with reliable functional partitions in

156: them. While SCOP classification can be used in a straightforward way

157: for making families, identifying functionally specific subsets is

158: not a trivial task. Some automated approaches to detect functional

159: shift, e.g. \citet{funshiftakker}, exist to infer such partitions

160: but manual annotation remains the most reliable. Additionally, protein

161: function is not a precise and quantifiable entity. This restricted

162: our study to enzymes which are the the most well studied and well

163: annotated class of proteins. Enzyme function is fairly well defined

164: and well classified according to hierarchical Enzyme Classification

165: scheme (EC). We use the mapping between SCOP domains and EC numbers

166: (\citet{scopec}) to make EC-specific subgroups within a SCOP domain

167: family. We generate profiles (multiple structure-sequence alignments)

168: for SCOP families and functional partitions. Sequence homologs for

169: structural families were found using PSIBLAST (\citet{psiblast})

170: on nonredundant sequence database, whereas function-specific partitions

171: were enriched using PSIBLAST searches on ENZYME database (\citet{ENZYME}).

172: PSIBLAST hit on ENZYME database is retained only if the EC number

173: of hit matches that of query. All PSIBLAST searches were with 5 rounds

174: and e-value 0.01, hits smaller than 75\% of query length were ignored.

175: All structure-sequence alignments were carried out with fugueseq (\citet{fugue})

176: which has been shown to improve alignment quality over PSIBLAST. This

177: process is summarized in Fig.\ref{workflow}.

178:

179: %

180: \begin{figure}

181:

182: \caption{Workflow}

183:

184: \begin{center}\includegraphics[%

185:   width=150mm]{workflow.pdf}\end{center}

186:

187: \label{workflow}

188: \end{figure}

189:

190:

191: Another constraint on the choice of dataset comes from the need for

192: sufficient functional diversity in a SCOP domain family. In its absence,

193: the contrast between the domain family and EC-specific subgroup within

194: it might not be detectable. Hence we chose the SCOP families with

195: at least two different EC annotations.

196:

197: To be able to test the hypothesis quantitatively, a gold standard

198: set of SDRs for every enzyme is needed. But SDRs are generally a topic

199: of lively debate among researchers, partly due to the infeasibility

200: of performing all necessary mutation studies. Thus there is no such

201: dataset in our knowledge. Hence we use the information of bound ligands

202: and close-by residues to assess the hypothesis. Due to this, the dataset

203: gets restricted to only those cases where at least one EC-specific

204: domain group has a relevant ligand bound. A relevant ligand is the

205: one unique to the reaction carried out by that EC-group among all

206: possible reactions in that domain family. For example, in SCOP family

207: c.1.10.4 there are two functional subgroups:

208:

209: 3-deoxy-8-phosphooctulonate synthase (EC 2.5.1.55) : Phosphoenolpyruvate

210: + D-arabinose 5-phosphate + H(2)O = 2-dehydro-3-deoxy-D-octonate 8-phosphate

211: + phosphate

212:

213: 3-deoxy-7-phosphoheptulonate synthase (EC 2.5.1.54) : Phosphoenolpyruvate

214: + D-erythrose 4-phosphate + H(2)O = 3-deoxy-D-arabino-hept-2-ulosonate

215: 7-phosphate + phosphate

216:

217: Here D-arabinose 5-phosphate is unique to EC 2.5.1.55 and is present

218: in domain 1fxqA as A5P. Hence it is taken as an indicator of SDR locations

219: and not phosphienolpyruvate which is common cofactor in both reactions.

220: We sometimes use products also as such indicators. Ligand is considered

221: relevant if its name from the PDB file (HETNAM, HETSYM records) matches

222: its name in the reaction or PDBsum (\citet{pdbsum}) finds it sufficiently

223: similar to ideal ligand molecule. Our final dataset consists of 97

224: examples drawn from 68 families. Very few SDR identification studies

225: are carried out with these many examples.

226:

227:

228: \section{Profiles and substitution patterns}

229:

230: Structural and sequence information in MSSA can be misleading if dominated

231: by very close homologs, hence each MSSA was filtered with 90\% sequence

232: identity cutoff to avoid redundancy.

233:

234: Observed substitution pattern for a column in profile MSSA (multiple

235: structure-sequence alignment) was calculated after weighing down contributions

236: from similar sequences ($>60\%$ sequence identity). Gaps were ignored

237: while calculating the observed substitution pattern but the ratio

238: of gaps to amino acids in a column was computed. Columns with high

239: gap content are generally not functional hence gap content was used

240: as a filtering criterion as described later. Observed substitution

241: patterns are normalized and sequence entropy was also calculated to

242: get a measure of variability in the column as $\sum_{i=1}^{20}-f_{i}log(f_{i})$,

243: where $f_{i}$ is the fraction of $i^{th}$ amino acid in the distribution.

244:

245: Expected substitution patterns for a column were calculated using

246: environment specific substitution probability tables derived from

247: high quality multiple structure alignments from 371 families (\citet{fugue}).

248: Substitution probabilties from every structure were averaged to get

249: expected substitution probabilities for each column in MSSA. Again,

250: sequence-based clustering was used to avoid expected substitution

251: pattern getting dominated by very similar structures.

252:

253: Functional restraint is calculated as the city-block distance between

254: normalized observed and predicted substitution patterns ($\sum_{i=1}^{20}o_{i}-e_{i}$,

255: $o_{i}$ being observed fraction of $i^{th}$ amino acid and $e_{i}$

256: being the fraction of times it is expected to occur). Thus, for both

257: MSSAs (whole family and EC-specific) we have the following quantities

258: : functional restraint ($famF,ecF$), gap content ($famG,ecG$) and

259: sequence entropy ($famE,ecE$). Moreover for each MSSA, number of

260: sequences $<80\%$ identical to each other was taken as an indicator

261: of evolutionary information available in it.

262:

263:

264: \section{Benchmarking}

265:

266: In order to assess the differences in residues important for whole

267: family and EC partition, baseline predictions were made by choosing

268: top-ranking residues according to whole family functional constraint

269: from residues which are not highly gapped ($famG<0.5$). Number of

270: baseline and SDR predictions is same whenever they are compared or

271: an overlap between them is computed. This helps in assessing whether

272: information in the EC-specific MSSA is distinct.

273:

274: The likelihood of a residue to be an SDR is presumably proportional

275: to its proximity to the specific ligand. Hence, to quantify the merit

276: of a prediction, we defined mean proximity as the ratio of mean separation

277: between predicted residues and ligand. Mean relative proximity is

278: defined as the ratio of mean proximity to the mean separation between

279: all residues in the domain and the ligand. Distance between a residue

280: and ligand is taken to be the closest distance between residue sidechain

281: (mainchain for glycine) and ligand atoms. Smaller the mean relative

282: proximity, better the prediction. Prediction quality will also depend

283: on the number of distinct homologous sequences available. In case

284: of multiple ligands close to a domain, a residue's proximity to the

285: ligand is calculated with respect to the closest ligand. The basis

286: for SDR prediction is that it be sufficiently distinct between whole

287: family and EC-specific MSSAs. As \citet{funshiftakker} describe it,

288: an SDR should be a rate-shifted or conservation-shifted site. Additionally,

289: SDR should be sufficiently functionally constrained from ESSTs perspective

290: ($ecF$). For a residue with low entropy in EC MSSA, if change in

291: entropy $dE$ (family MSSA sequence entropy - EC MSSA sequence entropy)

292: is high, it indicates that it could be SDR. Since each MSSA will be

293: different in its variability, it is not advisable to use same functional

294: constraint cutoff or entropy cutoff for all of them. This immediately

295: suggests two 2-step approaches : choose top $N1$ residues with high

296: dierence in sequence entropy between whole and EC MSSAs, then select

297: top $N2$ according to functional constraint in EC MSSA and vice versa.

298: But there could be a third and more attractive approach that combines

299: functional constraint from EC MSSA and sequence entropy difference.

300: We pursue the third approach.

301:

302: We assume that SDR score of a residue is a linear combination of its

303: functional constraint, entropy and change in entropy, given that the

304: residue passes certain quality checks ($ecF>0.5$, $ecG<0.5$, $ecE<1$,

305: $dE>0.5$):

306:

307: $SDRscore=ecF+a*(famE-ecE)-b*ecE$

308:

309: In order to optimize the parameters $a,b$ and test the optimal ones,

310: we created a high quality test set from our examples, consisting of

311: 23 examples drawn from SCOP families with at least 2 EC groups, each

312: with at $>10$ distict sequence homologs from ENZYME database. Parameters

313: $a,b$ were varied from 0 to 5 in steps of $0.2$ and 10 SDR predictions

314: were made. For each value of $a$ and $b$, SDR and baseline predictions

315: are made, each consisiting of 10 residues. Note that baseline predictions

316: are not affected by values of $a,b$. Optimization can be done with

317: two objectives, either to minimize the mean proximity or to maximize

318: the number of close ($<$\Ang{6}) residues. $a,b$ values of $0.4,1.2$

319: minimize the prior obective to \Ang{9.24} and yield $3.6$ close

320: residues per prediction, whereas $0,0.8$ maximize the latter to $4.08$

321: residues while yielding \Ang{9.36} for the prior. Performance of

322: these two $a,b$ values on different sets of examples is shown in

323: Table \ref{evolABperf}.

324:

325: %

326: \begin{table}

327:

328: \caption{Optimal values of a and b for various levels of evolutionary information

329: available.}

330:

331: \begin{center}\begin{tabular}{|c|p{1in}|p{1in}|p{1in}|p{1in}|}

332: \hline

333: Criteria for&

334: \multicolumn{2}{c|}{Mean proximity}&

335: \multicolumn{2}{c|}{\#close (<\Ang{6}) residues}\tabularnewline

336: choice of examples&

337: (0,0.8)&

338: (0.4,1.2)&

339: (0.0.8)&

340: (0.4,1.2)\tabularnewline

341: \hline

342: \hline

343: >5 homologs&

344: 10.84&

345: 11.24&

346: 3.35&

347: 3.01\tabularnewline

348: (67 examples)&

349: &

350: &

351: &

352: \tabularnewline

353: \hline

354: >10 homologs&

355: 10.41&

356: 10.64&

357: 3.45&

358: 3.2\tabularnewline

359: (55 examples)&

360: &

361: &

362: &

363: \tabularnewline

364: \hline

365: >10 homologs, >1 EC&

366: 9.36&

367: 9.24&

368: 4.08&

369: 3.6\tabularnewline

370: (23 examples)&

371: &

372: &

373: &

374: \tabularnewline

375: \hline

376: \end{tabular}\end{center}

377:

378: \label{evolABperf}

379: \end{table}

380:

381:

382: This suggests that optimal $a,b$ parameters are $0,0.8$. It is surprising

383: that there is no importance for the value of $dE=famE-genE$ in SDR

384: score. Perhaps this is due to the quality checks applied prior to

385: calculation of SDR scores, which demand $dE>0.5$.

386:

387: Fig.\ref{proxDistrib} shows the distribution of mean proximity in

388: various sets derived according to number of distinct homologs in ENZYME.

389: This shows that quality of evolutionary information available has

390: great impact on quality of predictions.

391:

392: %

393: \begin{figure}

394:

395: \caption{Frequency of observing a certain mean proximity of SDR predictions

396: (binned in \Ang{1} bins) for different qualities of evolutionary

397: information available.}

398:

399: \begin{center}\includegraphics[%

400:   width=150mm]{proxDistrib.pdf}\end{center}

401:

402: \label{proxDistrib}

403: \end{figure}

404:

405:

406: Mean relative proximity indicates how far from random is the prediction.

407: Table \ref{meanRelProxTable} shows that mean relative proximity depends

408: on quality of evolutionary information and is far from random for

409: both SDR and baseline predictions.

410:

411: %

412: \begin{table}

413:

414: \caption{Mean relative proximity in various datasets made according to number

415: of available distinct homologs.}

416:

417: \begin{center}\begin{tabular}{|c|c|c|c|}

418: \hline

419: Dataset&

420: Mean Rel. Prox.&

421: Mean Rel. Prox.&

422: Frequency of\tabularnewline

423: &

424: &

425: &

426: MRP(SDR) $\leq$ MRP(baseline)\tabularnewline

427: \hline

428: >0 homologs&

429: 0.67&

430: 0.66&

431: 34\% (33/97)\tabularnewline

432: \hline

433: >5 homologs&

434: 0.57&

435: 0.66&

436: 60\% (40/67)\tabularnewline

437: \hline

438: >10 homologs&

439: 0.57&

440: 0.62&

441: 85\% (47/55)\tabularnewline

442: \hline

443: \end{tabular}\end{center}

444:

445: \label{meanRelProxTable}

446: \end{table}

447:

448:

449: The fraction of SDRs present in baseline predictions is $15\%$ in

450: all $>0,>5,>10$ homologs classes, which suggests that SDR predictions

451: are fairly different than baseline. This also suggests that baseline

452: and SDR predictions are complementary to each other.

453:

454:

455: \section{Some examples}

456:

457: When quality sequence information is available, SDR predictions are

458: closer to specific ligand than baseline predictions which in turn

459: are closer than random. Here we compare our Top10 predictions with

460: information from literature for some examples.

461:

462:

463: \subsection{Aminotransferases}

464:

465: Aminotransferases or transaminases are important to amino acid biosynthesis

466: and unique due to their specificity to two substrates : a glutamate

467: and a amino-carrier. Our dataset contains two SCOP families (c.67.1.1

468: and c.67.1.4) that contain transaminases. Of those, we focus on SCOP

469: family c.67.1.1 which contains the functional categories aspartate

470: transaminase (AspAT, EC 2.6.1.1) and histidinol phosphate transaminase

471: (HspAT, EC 2.6.1.9). Other non-transaminase members of this family

472: include threonine adolases (EC 4.1.2.5) and alliin lyase (EC 4.4.1.4).

473: When Top10 predictions were analyzed in 1gex, an HspAT, we found that

474: SDR predictions are very well clustered around the ligands PLP and

475: HSP, but 5 of the 10 predictions were shared with Top10 baseline predictions.

476: This overlap can be attributed to degrees of functional diversity

477: in the SCOP family, i.e. large entropy reduction in HspAT residues

478: could be due to their importance to general transaminase mechanism

479: (as opposed to aldolase mechanism) or for substrate specificity to

480: histidinol phosphate (as opposed to aspartate in AspATs). In order

481: to increase the number of distinct predictions, Top20 baseline and

482: SDR predictions were used. Fig.\ref{figTransaminase} shows the predictions

483: for 1gexA, an HspAT from E. coli - 7 predictions are common. Catalytically

484: important residues (\citet{Haruyama2001}) Asn-157, Tyr-187, Lys-214

485: are identied as baseline, SDR and common respectively. Tyr-55, which

486: interacts with substrate of the other subunit, is predicted as SDR%

487: \footnote{This is conrmed from a similar prediction in 1gc4, an AspAT.%

488: }. Tyr-20, believed to be important for specificity, is not predicted

489: as such because it is conserved only 80\% of times, whereas a similarly

490: placed Tyr-55 from other subunit is much better conserved (98\% times)

491: and could be equally important for specificity. Ala-186, considered

492: important for restricting rotation of PLP's pyrimidine ring and thereby

493: contributing to strain essential for enzyme function, is predicted

494: as both SDR and baseline. Most other predicted SDRs lie close to the

495: substrate. Their location and AspAT counterparts suggest their role

496: in conferring specificty towards histidinol phosphate (see Table \ref{transaminaseTable}).

497:

498: %

499: \begin{table}

500:

501: \caption{Residues from speculated roles \citet{Haruyama2001} for HspAT 1gex

502: and how well they were predicted. The aligned residues in other subfamilies

503: with transaminases are also shown.}

504:

505: \begin{center}\includegraphics[%

506:   width=150mm]{transaminaseTable.pdf}\end{center}

507:

508: \label{transaminaseTable}

509: \end{table}

510:

511:

512: %

513: \begin{figure}

514:

515: \caption{SDR (green) and functional residue (red) predictions for 1gex, a

516: HspAT. Residues predicted both as functional and specificity-conferring

517: are colored blue. Top left panel shows Top5 predictions, top right

518: panel shows Top10 predictions and bottom panel zooms in on the region

519: around ligand in the Top10 case.}

520:

521: \begin{center}\includegraphics[%

522:   width=150mm]{transaminaseFig.jpg}\end{center}

523:

524: \label{figTransaminase}

525: \end{figure}

526:

527:

528:

529: \subsection{Phosphoric monoester hydrolases}

530:

531: SCOP family e.7.1.1 in our dataset contains 4 classes of phosphoric

532: monoester hydrolases, 3'(2'),5'-bisphosphate nucleotidase (EC 3.1.3.7),

533: Fructose-bisphosphatase (EC 3.1.3.11), Inositolphosphate phosphatase

534: (EC 3.1.3.25) and Inositol-1,4-bisphosphate 1-phosphatase (EC 3.1.3.57).

535: Here we look at the SDR and baseline predictions for 1cnq, a member

536: of FBPase category. FBPases are of key importance to regulation of

537: gluconeogenic pathway and catalyze the hydrolysis of fructose 1,6-biphosphate

538: to fructose 6-phosphate. They are metal dependent and are allosterically

539: controlled by AMP which triggers a conformational change and masks

540: the fructose active site. Fig.\ref{figFBPase} shows the Top10 baseline

541: and general predictions, the overlap in this case of 2 residues. F6P

542: molecule around which most predictions are clustered lies in the active

543: site whereas the other F6P molecule is similarly located as AMP (from

544: comparison with PDB 1yyz). Baseline predictions Tyr-279, Glu-280,

545: Tyr-244, Met-244 and common prediction Tyr-264 are within interacting

546: distance of F6P ligand in the active site. Most predicted SDRs form

547: the active site walls and differ between FBPase and IMPase (1awb)

548: : Arg-276 to His, Ser-96 to Gly, Ser-123 to Thr, Ser-124 to Thr (see

549: Table \ref{FBPaseTable}). It is surprising to see that the allosteric

550: site is only mildly detected. Predictions Ala-161 (Top10 SDR), Lys-290

551: (Top10 baseline) and Val-178 (Top20 SDR) are close and suggestive

552: of some role in AMP binding.

553:

554: %

555: \begin{table}

556:

557: \caption{Speculated roles of residues in FBPase for 1cnq from literature and

558: how well they were predicted. Aligned residues in other subfamilies

559: of hydrolases are also shown.}

560:

561: \begin{center}\includegraphics[%

562:   width=150mm]{FBPaseTable.pdf}\end{center}

563:

564: \label{FBPaseTable}

565: \end{table}

566:

567:

568: %

569: \begin{figure}

570:

571: \caption{SDR and functional residue predictions for 1cnq, a FBPase. Residue-coloring

572: scheme same as Fig.\ref{figTransaminase}. The bottom panel is a closer

573: view of the region around ligand in the top panel.}

574:

575: \begin{center}\includegraphics[%

576:   width=100mm]{figFBPase.jpg}\end{center}

577:

578: \label{figFBPase}

579: \end{figure}

580:

581:

582:

583: \subsection{Dehydrogenases}

584:

585: L-3-hydroxyacyl-CoA dehydrogenase (HAD, EC 1.1.1.35) is penultimate

586: enzyme in -oxidation spiral and catalyzes conversion of hydroxy group

587: to keto group while converting NAD+ to NADH. It consists of NAD-binding

588: and C-terminal domains, which undergo relative movement between NAD

589: binding and substrate binding events (\citet{activesiteSequestration}).

590: Its SCOP family is c.2.1.6, other members of which are other NAD/NADP-dependent

591: dehydrogenases (ECs 1.1.1.8, 1.1.1.22, 1.1.1.44). HAD is represented

592: in our dataset by NAD-binding domain of 1f0y (residues from A-12 to

593: A-203). Fig.\ref{figHAD} shows Top10 baseline and SDR predictions.

594: Catalytically important pair of Glu-170 and His-158 is identied as

595: SDRs. Ser-137, interesting due to its contact with substrate as well

596: as NAD, is also identied as SDR. With the exceptions of Leu-122, Ala-35

597: (baseline) and Gly-29, Ala-107 (SDR), all other predictions are within

598: interacting distance of either NAD or substrate. Ser-61 and Lys-68

599: are not detected due to their high entropy.

600:

601: %

602: \begin{figure}

603:

604: \caption{SDR and functional residue predictions for 1f0y, a HAD. Residue-coloring

605: scheme same as Fig.\ref{figTransaminase}.}

606:

607: \begin{center}\includegraphics[%

608:   width=100mm]{figHAD.jpg}\end{center}

609:

610: \label{figHAD}

611: \end{figure}

612:

613:

614:

615: \subsection{Tryptophan biosynthesis enzymes}

616:

617: Phosphoribosylanthranilate (PRA) isomerase (TrpF) is a $(\beta\alpha)_{8}$

618: barrel enzyme which is the most common fold adopted by enzymes and

619: popular among non-enzymes. TrpF (EC 5.3.1.24) shares its SCOP family

620: (c.1.2.4) with indole-3-glycerol-phosphate synthase (EC 4.1.1.48)

621: and tryptophan synthase (EC 4.2.1.20), which are all involved in Trp

622: biosynthesis. Top10 baseline and SDR predictions are show in Fig.\ref{figTRPF}.

623: His-83 and Arg-36, considered important for catalysis, are predicted.

624: Gln-81 (Glu in Trp synthase 1kfc), predicted as baseline and SDR,

625: could be important for catalysis due to its location. A few baseline

626: predictions are far from active site and their conservation suggests

627: protein-protein binding interface. Predicted SDRs lie close to ligand

628: and are either replaced by other residues in Trp synthase (Arg-36

629: to Asn) or deleted (Gln-184, Asp-178), which suggests that they could

630: be specificity determining.

631:

632: %

633: \begin{figure}

634:

635: \caption{SDR and functional residue predictions for TrpF. Residue-coloring

636: scheme same as Fig.\ref{figTransaminase}.}

637:

638: \begin{center}\includegraphics[%

639:   width=100mm]{figTRPF.jpg}\end{center}

640:

641: \label{figTRPF}

642: \end{figure}

643:

644:

645:

646: \subsection{tRNA synthetases}

647:

648: Aminoacyl-tRNA synthetases catalyze the process of attaching an amino

649: acid to its tRNA carrier so that it can be incorporated into a protein.

650: SCOP family c.26.1.1 contains tyrosyl-tRNA synthetase (EC 6.1.1.1)

651: along with other (Trp-, Glu-, Gln-) tRNA synthetases. Fig.\ref{figTyrTRNA}

652: shows baseline and SDR predictions for tyrosyl-tRNA synthetase 1h3e

653: from a thermophilic baterium T. thermophilus (\citet{tyrTRNAclass12}).

654: Residues important for catalysis from 51-HIGH and 233-KMSKS regions

655: are predicted as baseline (His-52, Gly-54, His-55, Lys-235). Predicted

656: SDRs lie close to the substrate and cofactor. Residues specific for

657: L-tyrosine binding, according to \citet{tyrTRNAspecificity} (e.g.

658: Thr-80, Tyr-175, Gln-179, Asp-182, Glu-197), are detected. Note that

659: substrate similarity makes 2 broad divisions in this family corresponding

660: to Trp/Tyr and Glu/Gln, each of which is subdivided into finer groups.

661: Table \ref{tRnaTable} shows residues structurally aligned to SDRs

662: in these tRNA synthetases.

663:

664: %

665: \begin{table}

666:

667: \caption{Residues in other tRNA synthetases aligned to predicted SDRs in tyrosil

668: tRNA synthetase.}

669:

670: \begin{center}\includegraphics[%

671:   width=150mm]{tRNAtable.pdf}\end{center}

672:

673: \label{tRnaTable}

674: \end{table}

675:

676:

677: %

678: \begin{figure}

679:

680: \caption{SDR and functional residue predictions for 1h3e (tyrosil tRNA synthetase).

681: Residue-coloring scheme same as Fig.\ref{figTransaminase}.}

682:

683: \begin{center}\includegraphics[%

684:   width=100mm]{figTYRtRNA.jpg}\end{center}

685:

686: \label{figTyrTRNA}

687: \end{figure}

688:

689:

690: Residues distinct for each substrate-group could be specific for it,

691: e.g. Gln-179. Detection of residue Tyr-175 as SDR suggests that there

692: could be more functions associated with this structural family than

693: these four AATSs. Detection of residues close to cofactor indicates

694: different/no cofactors used by other functions of this structural

695: family. Some residues speculated by \citet{tyrTRNAspecificity} to

696: be functional, stay undetected, e.g. Asn-128 which is not predicted

697: due to high entropy (Ser dominates the MSSA column, not Asn).

698:

699:

700: \section{Conclusion}

701:

702: We have combined structural and sequence information, functional annnotation,

703: residue entropy and environment specific substitution tables to predict

704: specificity determining residues. We tested the predictions by using

705: information of specific ligands and in some cases, published literature.

706: We found that the predictions are far from random and functionally

707: relevant, which suggests that our approach is effective. Predictions

708: obtained with functional annotation (SDRs) and without it (baseline)

709: are different, suggesting that available functional annotation is

710: valuable. SDR and baseline predictions are complementary because they

711: enlarge the set of functionally significant residues that can be computationally

712: identified. We expected and found that our method cannot identify

713: significant residues in absence of high quality evolutionary information,

714: hence the importance of identifying chemically interesting patches

715: remains undiminished. A major concern is how to obtain functional

716: partitions in absence of annotation, which is similar as establishing

717: ortho/paralogy relationships. We plan to explore structure-sequence

718: scoring schemes that would help establish functional partitions reliably.

719: Alternatively, it would be useful to analyze the effects of constructing

720: a functional partition based on sequence identity. We plan to use

721: residue proximity information and residue contact conservation to

722: detect clusters which may not be conserved in the obvious sense. We

723: expect that cluster identification will alleviate the problem of not

724: identifying structurally conserved residues. The most important purpose

725: of SDR and catalytic residue identification is to help classify SNPs

726: into normal/deleterious classes and this would be an important avenue

727: to explore in near future.

728:

729:

730: \subsection*{Acknowledgements}

731:

732: We thank Dr Kenji Mizuguchi and Dr Vijayalakshmi Chelliah for helpful

733: discussions. Swanand Gore thanks Cambridge Commonwealth Trust and

734: Universities UK Overseas Research Studentship for funding.

735:

736: \bibliographystyle{marko}

737: \bibliography{sdr}

738:

739: \end{document}

740: