1: %% LyX 1.3 created this file. For more info, see http://www.lyx.org/.
2: %% Do not edit unless you really know what you are doing.
3: \documentclass[12pt,english]{article}
4: \pdfoutput=1
5: \usepackage[T1]{fontenc}
6: \usepackage[latin1]{inputenc}
7: \usepackage{array}
8: \usepackage{graphicx}
9: \usepackage{setspace}
10: \onehalfspacing
11: \usepackage[authoryear]{natbib}
12:
13: \makeatletter
14:
15: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
16: %% Bold symbol macro for standard LaTeX users
17: \providecommand{\boldsymbol}[1]{\mbox{\boldmath $#1$}}
18:
19: %% Because html converters don't know tabularnewline
20: \providecommand{\tabularnewline}{\\}
21:
22: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands.
23: \addtolength{\oddsidemargin}{-50pt}
24: \addtolength{\voffset}{-50pt}
25: \addtolength{\textwidth}{100pt}
26: \addtolength{\textheight}{100pt}
27:
28:
29: \usepackage{xspace}
30: \newcommand{\phipsi}{$\phi,\psi$\xspace}
31: \newcommand{\eg}{e.g.\xspace}
32: \newcommand{\etc}{etc.\xspace}
33: \newcommand{\dfs}{\textsc{dfs}~}
34: \newcommand{\cpp}{\textsc{c++}\xspace}
35: \newcommand{\python}{\textsc{python}\xspace}
36: \newcommand{\swig}{\textsc{swig}\xspace}
37: \newcommand{\pdb}{\textsc{PDB}\xspace}
38: \newcommand{\gabb}{\textsc{GABB}\xspace}
39: \newcommand{\scwrl}{\textsc{scwrl}\xspace}
40: \newcommand{\rapper}{\textsc{rapper}\xspace}
41: \newcommand{\rappertk}{\textit{Rapper}\textbf{tk}\xspace}
42: \newcommand{\probe}{\textsc{probe}\xspace}
43: \newcommand{\ie}{i.e.\xspace}
44: \newcommand{\ca}{$C_\alpha$\xspace}
45: \newcommand{\chir}{$\chi$}
46: \newcommand{\CA}[1]{${C_\alpha}^{#1}$~}
47: \newcommand{\CB}[1]{${C_\beta}^{#1}$~}
48: \newcommand{\C}[1]{$C^{#1}$~}
49: \newcommand{\N}[1]{$N^{#1}$\xspace}
50: \newcommand{\cO}[1]{$O^{#1}$\xspace}
51:
52:
53: \newcommand{\Ang}[1]{${#1}$\AA\xspace}
54:
55: \newcommand{\degr}[1]{${#1}^o$\xspace}
56:
57: \date{}
58:
59: \usepackage{babel}
60: \makeatother
61: \begin{document}
62:
63: \title{Identification of specificity determining residues in enzymes using
64: environment specific substitution tables}
65:
66:
67: \author{Swanand Gore and Tom Blundell\\
68: \{swanand,tom\}@cryst.bioc.cam.ac.uk\\
69: Department of Biochemistry, University of Cambridge\\
70: Cambridge CB2 1GA England}
71:
72: \maketitle
73: \begin{abstract}
74: Environment specific substitution tables have been used effectively
75: for distinguishing structural and functional constraints on proteins
76: and thereby identify their active sites (\citet{distinguish_str_func_restr}).
77: This work explores whether a similar approach can be used to identify
78: specificity determining residues (SDRs) responsible for cofactor dependence,
79: substrate specificity or subtle catalytic variations. We combine structure-sequence
80: information and functional annotation from various data sources to
81: create structural alignments for homologous enzymes and functional
82: partitions therein. We develop a scoring procedure to predict SDRs
83: and assess their accuracy using information from bound specific ligands
84: and published literature.\newpage
85:
86: \end{abstract}
87:
88: \section{Introduction}
89:
90: Enzymes are critical to cellular machinery. Enzymes are believed to
91: have developed different specificities following gene duplication
92: events that ease the evolutionary pressure on copies and allow exploration
93: of novel avenues to greater organismal fitness. Each copy then develops
94: its own niche, characterized by expression and localization, catalytic
95: mechanism, substrate specificity, cofactor dependence and catalysis
96: products. Such paralogous enzymes should have an evolutionary imprint
97: corresponding to their specific niche, in addition to maintenance
98: of structural fold. Thus evolutionary analysis of available structural
99: and sequnce data should enable identification of key residues responsible
100: for specificity of various kinds. Enzyme specificity can be estimated
101: with functional assays without structure determination, but identification
102: of SDRs (specificity determining residues) remains difficult. While
103: ENZYME (\citet{ENZYME}) - a database of enzyme sequences with detailed
104: functional annotation - exists, there is no such database of SDRs.
105: Time, cost and technical limitations slow down structure determination
106: and even when structure is known, it is not trivial to identify the
107: residues important for binding cofactors and substrates. Hence it
108: is important to be able to identify such residues computationally.
109: Reliable detection of such residues will aid in deciding whether a
110: SNP is deleterious or neutral and suggest mutation studies. Function
111: assignment to sequence could be done at a finer level, e.g. by verifying
112: that SDRs necessary for certain substrate are present. Computational
113: SDR identification has received a lot of attention and several methods
114: have been proposed. Evolutionary trace (ET) is one of the most important
115: methods (\citet{evoltraceStrClust}, \citet{EThybridMethods}). It
116: builds a phylogenetic tree based on sequence comparisons, such that
117: branch lengths are indicative of evolutionary divergence. Functional
118: subgroups consist of sequences in subtrees determined from this tree
119: using a divergence cutoff. Residues common to a subtree are considered
120: specificity-conferring rather than the ones common to entire tree.
121: Spatial cluster identification can be used with ET to reduce the number
122: of false positives. Inferring phylogeny correctly remains the main
123: cause of concern in this approach, hence attempts have been made to
124: use existing annotation with various statistical techniques. Another
125: important direction is to use spatial proximity of residues.
126:
127: Cornerstone of our approach is that structural environment influences
128: residue substitution patterns, illustrated by \citet{earlyESST} and
129: later used effectively for structure-sequence alignment and fold recognition
130: (\citet{fugue}). Structural environment of a residue is described
131: in terms of secondary structure, solvent accessibility, sidechain-sidechain
132: and sidechain-mainchain hydrogen bonding. Residue substitution tables
133: derived from a set of high quality sequence-structure alignments represent
134: the expected substitution rate in a structural environment. Unexpected
135: conservation of a residue is indicative of functional restraint acting
136: on it. Advantage of using ESSTs is that the structurally conserved
137: residues are masked, which is why active sites of homologous enzymes
138: can be identied reliably with this approach. This approach has been
139: extended in the present work by using functional annotation information.
140:
141: A set of homologous enzymes is generally a union of smaller functionally
142: specific subsets, e.g. substrate-specific subsets in serine proteinases
143: (trypsin, chymotrypsin etc.), cofactor-specific subsets in ferrodoxin
144: reductases (NAD and NADP specific) and so on. In multiple sequence
145: alignment of a homologous protein family, SDRs generally appear as
146: differentially conserved subcolumns. But all such appearances would
147: not be SDRs. Our hypothesis is that SDRs would be identified by combining
148: differential conservation with ESST-based detection of functional
149: restraint.
150:
151:
152: \section{Families, functional partitions and profiles}
153:
154: In order to test our hypothesis, we need to construct a dataset of
155: homologous enzyme families with reliable functional partitions in
156: them. While SCOP classification can be used in a straightforward way
157: for making families, identifying functionally specific subsets is
158: not a trivial task. Some automated approaches to detect functional
159: shift, e.g. \citet{funshiftakker}, exist to infer such partitions
160: but manual annotation remains the most reliable. Additionally, protein
161: function is not a precise and quantifiable entity. This restricted
162: our study to enzymes which are the the most well studied and well
163: annotated class of proteins. Enzyme function is fairly well defined
164: and well classified according to hierarchical Enzyme Classification
165: scheme (EC). We use the mapping between SCOP domains and EC numbers
166: (\citet{scopec}) to make EC-specific subgroups within a SCOP domain
167: family. We generate profiles (multiple structure-sequence alignments)
168: for SCOP families and functional partitions. Sequence homologs for
169: structural families were found using PSIBLAST (\citet{psiblast})
170: on nonredundant sequence database, whereas function-specific partitions
171: were enriched using PSIBLAST searches on ENZYME database (\citet{ENZYME}).
172: PSIBLAST hit on ENZYME database is retained only if the EC number
173: of hit matches that of query. All PSIBLAST searches were with 5 rounds
174: and e-value 0.01, hits smaller than 75\% of query length were ignored.
175: All structure-sequence alignments were carried out with fugueseq (\citet{fugue})
176: which has been shown to improve alignment quality over PSIBLAST. This
177: process is summarized in Fig.\ref{workflow}.
178:
179: %
180: \begin{figure}
181:
182: \caption{Workflow}
183:
184: \begin{center}\includegraphics[%
185: width=150mm]{workflow.pdf}\end{center}
186:
187: \label{workflow}
188: \end{figure}
189:
190:
191: Another constraint on the choice of dataset comes from the need for
192: sufficient functional diversity in a SCOP domain family. In its absence,
193: the contrast between the domain family and EC-specific subgroup within
194: it might not be detectable. Hence we chose the SCOP families with
195: at least two different EC annotations.
196:
197: To be able to test the hypothesis quantitatively, a gold standard
198: set of SDRs for every enzyme is needed. But SDRs are generally a topic
199: of lively debate among researchers, partly due to the infeasibility
200: of performing all necessary mutation studies. Thus there is no such
201: dataset in our knowledge. Hence we use the information of bound ligands
202: and close-by residues to assess the hypothesis. Due to this, the dataset
203: gets restricted to only those cases where at least one EC-specific
204: domain group has a relevant ligand bound. A relevant ligand is the
205: one unique to the reaction carried out by that EC-group among all
206: possible reactions in that domain family. For example, in SCOP family
207: c.1.10.4 there are two functional subgroups:
208:
209: 3-deoxy-8-phosphooctulonate synthase (EC 2.5.1.55) : Phosphoenolpyruvate
210: + D-arabinose 5-phosphate + H(2)O = 2-dehydro-3-deoxy-D-octonate 8-phosphate
211: + phosphate
212:
213: 3-deoxy-7-phosphoheptulonate synthase (EC 2.5.1.54) : Phosphoenolpyruvate
214: + D-erythrose 4-phosphate + H(2)O = 3-deoxy-D-arabino-hept-2-ulosonate
215: 7-phosphate + phosphate
216:
217: Here D-arabinose 5-phosphate is unique to EC 2.5.1.55 and is present
218: in domain 1fxqA as A5P. Hence it is taken as an indicator of SDR locations
219: and not phosphienolpyruvate which is common cofactor in both reactions.
220: We sometimes use products also as such indicators. Ligand is considered
221: relevant if its name from the PDB file (HETNAM, HETSYM records) matches
222: its name in the reaction or PDBsum (\citet{pdbsum}) finds it sufficiently
223: similar to ideal ligand molecule. Our final dataset consists of 97
224: examples drawn from 68 families. Very few SDR identification studies
225: are carried out with these many examples.
226:
227:
228: \section{Profiles and substitution patterns}
229:
230: Structural and sequence information in MSSA can be misleading if dominated
231: by very close homologs, hence each MSSA was filtered with 90\% sequence
232: identity cutoff to avoid redundancy.
233:
234: Observed substitution pattern for a column in profile MSSA (multiple
235: structure-sequence alignment) was calculated after weighing down contributions
236: from similar sequences ($>60\%$ sequence identity). Gaps were ignored
237: while calculating the observed substitution pattern but the ratio
238: of gaps to amino acids in a column was computed. Columns with high
239: gap content are generally not functional hence gap content was used
240: as a filtering criterion as described later. Observed substitution
241: patterns are normalized and sequence entropy was also calculated to
242: get a measure of variability in the column as $\sum_{i=1}^{20}-f_{i}log(f_{i})$,
243: where $f_{i}$ is the fraction of $i^{th}$ amino acid in the distribution.
244:
245: Expected substitution patterns for a column were calculated using
246: environment specific substitution probability tables derived from
247: high quality multiple structure alignments from 371 families (\citet{fugue}).
248: Substitution probabilties from every structure were averaged to get
249: expected substitution probabilities for each column in MSSA. Again,
250: sequence-based clustering was used to avoid expected substitution
251: pattern getting dominated by very similar structures.
252:
253: Functional restraint is calculated as the city-block distance between
254: normalized observed and predicted substitution patterns ($\sum_{i=1}^{20}o_{i}-e_{i}$,
255: $o_{i}$ being observed fraction of $i^{th}$ amino acid and $e_{i}$
256: being the fraction of times it is expected to occur). Thus, for both
257: MSSAs (whole family and EC-specific) we have the following quantities
258: : functional restraint ($famF,ecF$), gap content ($famG,ecG$) and
259: sequence entropy ($famE,ecE$). Moreover for each MSSA, number of
260: sequences $<80\%$ identical to each other was taken as an indicator
261: of evolutionary information available in it.
262:
263:
264: \section{Benchmarking}
265:
266: In order to assess the differences in residues important for whole
267: family and EC partition, baseline predictions were made by choosing
268: top-ranking residues according to whole family functional constraint
269: from residues which are not highly gapped ($famG<0.5$). Number of
270: baseline and SDR predictions is same whenever they are compared or
271: an overlap between them is computed. This helps in assessing whether
272: information in the EC-specific MSSA is distinct.
273:
274: The likelihood of a residue to be an SDR is presumably proportional
275: to its proximity to the specific ligand. Hence, to quantify the merit
276: of a prediction, we defined mean proximity as the ratio of mean separation
277: between predicted residues and ligand. Mean relative proximity is
278: defined as the ratio of mean proximity to the mean separation between
279: all residues in the domain and the ligand. Distance between a residue
280: and ligand is taken to be the closest distance between residue sidechain
281: (mainchain for glycine) and ligand atoms. Smaller the mean relative
282: proximity, better the prediction. Prediction quality will also depend
283: on the number of distinct homologous sequences available. In case
284: of multiple ligands close to a domain, a residue's proximity to the
285: ligand is calculated with respect to the closest ligand. The basis
286: for SDR prediction is that it be sufficiently distinct between whole
287: family and EC-specific MSSAs. As \citet{funshiftakker} describe it,
288: an SDR should be a rate-shifted or conservation-shifted site. Additionally,
289: SDR should be sufficiently functionally constrained from ESSTs perspective
290: ($ecF$). For a residue with low entropy in EC MSSA, if change in
291: entropy $dE$ (family MSSA sequence entropy - EC MSSA sequence entropy)
292: is high, it indicates that it could be SDR. Since each MSSA will be
293: different in its variability, it is not advisable to use same functional
294: constraint cutoff or entropy cutoff for all of them. This immediately
295: suggests two 2-step approaches : choose top $N1$ residues with high
296: dierence in sequence entropy between whole and EC MSSAs, then select
297: top $N2$ according to functional constraint in EC MSSA and vice versa.
298: But there could be a third and more attractive approach that combines
299: functional constraint from EC MSSA and sequence entropy difference.
300: We pursue the third approach.
301:
302: We assume that SDR score of a residue is a linear combination of its
303: functional constraint, entropy and change in entropy, given that the
304: residue passes certain quality checks ($ecF>0.5$, $ecG<0.5$, $ecE<1$,
305: $dE>0.5$):
306:
307: $SDRscore=ecF+a*(famE-ecE)-b*ecE$
308:
309: In order to optimize the parameters $a,b$ and test the optimal ones,
310: we created a high quality test set from our examples, consisting of
311: 23 examples drawn from SCOP families with at least 2 EC groups, each
312: with at $>10$ distict sequence homologs from ENZYME database. Parameters
313: $a,b$ were varied from 0 to 5 in steps of $0.2$ and 10 SDR predictions
314: were made. For each value of $a$ and $b$, SDR and baseline predictions
315: are made, each consisiting of 10 residues. Note that baseline predictions
316: are not affected by values of $a,b$. Optimization can be done with
317: two objectives, either to minimize the mean proximity or to maximize
318: the number of close ($<$\Ang{6}) residues. $a,b$ values of $0.4,1.2$
319: minimize the prior obective to \Ang{9.24} and yield $3.6$ close
320: residues per prediction, whereas $0,0.8$ maximize the latter to $4.08$
321: residues while yielding \Ang{9.36} for the prior. Performance of
322: these two $a,b$ values on different sets of examples is shown in
323: Table \ref{evolABperf}.
324:
325: %
326: \begin{table}
327:
328: \caption{Optimal values of a and b for various levels of evolutionary information
329: available.}
330:
331: \begin{center}\begin{tabular}{|c|p{1in}|p{1in}|p{1in}|p{1in}|}
332: \hline
333: Criteria for&
334: \multicolumn{2}{c|}{Mean proximity}&
335: \multicolumn{2}{c|}{\#close (<\Ang{6}) residues}\tabularnewline
336: choice of examples&
337: (0,0.8)&
338: (0.4,1.2)&
339: (0.0.8)&
340: (0.4,1.2)\tabularnewline
341: \hline
342: \hline
343: >5 homologs&
344: 10.84&
345: 11.24&
346: 3.35&
347: 3.01\tabularnewline
348: (67 examples)&
349: &
350: &
351: &
352: \tabularnewline
353: \hline
354: >10 homologs&
355: 10.41&
356: 10.64&
357: 3.45&
358: 3.2\tabularnewline
359: (55 examples)&
360: &
361: &
362: &
363: \tabularnewline
364: \hline
365: >10 homologs, >1 EC&
366: 9.36&
367: 9.24&
368: 4.08&
369: 3.6\tabularnewline
370: (23 examples)&
371: &
372: &
373: &
374: \tabularnewline
375: \hline
376: \end{tabular}\end{center}
377:
378: \label{evolABperf}
379: \end{table}
380:
381:
382: This suggests that optimal $a,b$ parameters are $0,0.8$. It is surprising
383: that there is no importance for the value of $dE=famE-genE$ in SDR
384: score. Perhaps this is due to the quality checks applied prior to
385: calculation of SDR scores, which demand $dE>0.5$.
386:
387: Fig.\ref{proxDistrib} shows the distribution of mean proximity in
388: various sets derived according to number of distinct homologs in ENZYME.
389: This shows that quality of evolutionary information available has
390: great impact on quality of predictions.
391:
392: %
393: \begin{figure}
394:
395: \caption{Frequency of observing a certain mean proximity of SDR predictions
396: (binned in \Ang{1} bins) for different qualities of evolutionary
397: information available.}
398:
399: \begin{center}\includegraphics[%
400: width=150mm]{proxDistrib.pdf}\end{center}
401:
402: \label{proxDistrib}
403: \end{figure}
404:
405:
406: Mean relative proximity indicates how far from random is the prediction.
407: Table \ref{meanRelProxTable} shows that mean relative proximity depends
408: on quality of evolutionary information and is far from random for
409: both SDR and baseline predictions.
410:
411: %
412: \begin{table}
413:
414: \caption{Mean relative proximity in various datasets made according to number
415: of available distinct homologs.}
416:
417: \begin{center}\begin{tabular}{|c|c|c|c|}
418: \hline
419: Dataset&
420: Mean Rel. Prox.&
421: Mean Rel. Prox.&
422: Frequency of\tabularnewline
423: &
424: &
425: &
426: MRP(SDR) $\leq$ MRP(baseline)\tabularnewline
427: \hline
428: >0 homologs&
429: 0.67&
430: 0.66&
431: 34\% (33/97)\tabularnewline
432: \hline
433: >5 homologs&
434: 0.57&
435: 0.66&
436: 60\% (40/67)\tabularnewline
437: \hline
438: >10 homologs&
439: 0.57&
440: 0.62&
441: 85\% (47/55)\tabularnewline
442: \hline
443: \end{tabular}\end{center}
444:
445: \label{meanRelProxTable}
446: \end{table}
447:
448:
449: The fraction of SDRs present in baseline predictions is $15\%$ in
450: all $>0,>5,>10$ homologs classes, which suggests that SDR predictions
451: are fairly different than baseline. This also suggests that baseline
452: and SDR predictions are complementary to each other.
453:
454:
455: \section{Some examples}
456:
457: When quality sequence information is available, SDR predictions are
458: closer to specific ligand than baseline predictions which in turn
459: are closer than random. Here we compare our Top10 predictions with
460: information from literature for some examples.
461:
462:
463: \subsection{Aminotransferases}
464:
465: Aminotransferases or transaminases are important to amino acid biosynthesis
466: and unique due to their specificity to two substrates : a glutamate
467: and a amino-carrier. Our dataset contains two SCOP families (c.67.1.1
468: and c.67.1.4) that contain transaminases. Of those, we focus on SCOP
469: family c.67.1.1 which contains the functional categories aspartate
470: transaminase (AspAT, EC 2.6.1.1) and histidinol phosphate transaminase
471: (HspAT, EC 2.6.1.9). Other non-transaminase members of this family
472: include threonine adolases (EC 4.1.2.5) and alliin lyase (EC 4.4.1.4).
473: When Top10 predictions were analyzed in 1gex, an HspAT, we found that
474: SDR predictions are very well clustered around the ligands PLP and
475: HSP, but 5 of the 10 predictions were shared with Top10 baseline predictions.
476: This overlap can be attributed to degrees of functional diversity
477: in the SCOP family, i.e. large entropy reduction in HspAT residues
478: could be due to their importance to general transaminase mechanism
479: (as opposed to aldolase mechanism) or for substrate specificity to
480: histidinol phosphate (as opposed to aspartate in AspATs). In order
481: to increase the number of distinct predictions, Top20 baseline and
482: SDR predictions were used. Fig.\ref{figTransaminase} shows the predictions
483: for 1gexA, an HspAT from E. coli - 7 predictions are common. Catalytically
484: important residues (\citet{Haruyama2001}) Asn-157, Tyr-187, Lys-214
485: are identied as baseline, SDR and common respectively. Tyr-55, which
486: interacts with substrate of the other subunit, is predicted as SDR%
487: \footnote{This is conrmed from a similar prediction in 1gc4, an AspAT.%
488: }. Tyr-20, believed to be important for specificity, is not predicted
489: as such because it is conserved only 80\% of times, whereas a similarly
490: placed Tyr-55 from other subunit is much better conserved (98\% times)
491: and could be equally important for specificity. Ala-186, considered
492: important for restricting rotation of PLP's pyrimidine ring and thereby
493: contributing to strain essential for enzyme function, is predicted
494: as both SDR and baseline. Most other predicted SDRs lie close to the
495: substrate. Their location and AspAT counterparts suggest their role
496: in conferring specificty towards histidinol phosphate (see Table \ref{transaminaseTable}).
497:
498: %
499: \begin{table}
500:
501: \caption{Residues from speculated roles \citet{Haruyama2001} for HspAT 1gex
502: and how well they were predicted. The aligned residues in other subfamilies
503: with transaminases are also shown.}
504:
505: \begin{center}\includegraphics[%
506: width=150mm]{transaminaseTable.pdf}\end{center}
507:
508: \label{transaminaseTable}
509: \end{table}
510:
511:
512: %
513: \begin{figure}
514:
515: \caption{SDR (green) and functional residue (red) predictions for 1gex, a
516: HspAT. Residues predicted both as functional and specificity-conferring
517: are colored blue. Top left panel shows Top5 predictions, top right
518: panel shows Top10 predictions and bottom panel zooms in on the region
519: around ligand in the Top10 case.}
520:
521: \begin{center}\includegraphics[%
522: width=150mm]{transaminaseFig.jpg}\end{center}
523:
524: \label{figTransaminase}
525: \end{figure}
526:
527:
528:
529: \subsection{Phosphoric monoester hydrolases}
530:
531: SCOP family e.7.1.1 in our dataset contains 4 classes of phosphoric
532: monoester hydrolases, 3'(2'),5'-bisphosphate nucleotidase (EC 3.1.3.7),
533: Fructose-bisphosphatase (EC 3.1.3.11), Inositolphosphate phosphatase
534: (EC 3.1.3.25) and Inositol-1,4-bisphosphate 1-phosphatase (EC 3.1.3.57).
535: Here we look at the SDR and baseline predictions for 1cnq, a member
536: of FBPase category. FBPases are of key importance to regulation of
537: gluconeogenic pathway and catalyze the hydrolysis of fructose 1,6-biphosphate
538: to fructose 6-phosphate. They are metal dependent and are allosterically
539: controlled by AMP which triggers a conformational change and masks
540: the fructose active site. Fig.\ref{figFBPase} shows the Top10 baseline
541: and general predictions, the overlap in this case of 2 residues. F6P
542: molecule around which most predictions are clustered lies in the active
543: site whereas the other F6P molecule is similarly located as AMP (from
544: comparison with PDB 1yyz). Baseline predictions Tyr-279, Glu-280,
545: Tyr-244, Met-244 and common prediction Tyr-264 are within interacting
546: distance of F6P ligand in the active site. Most predicted SDRs form
547: the active site walls and differ between FBPase and IMPase (1awb)
548: : Arg-276 to His, Ser-96 to Gly, Ser-123 to Thr, Ser-124 to Thr (see
549: Table \ref{FBPaseTable}). It is surprising to see that the allosteric
550: site is only mildly detected. Predictions Ala-161 (Top10 SDR), Lys-290
551: (Top10 baseline) and Val-178 (Top20 SDR) are close and suggestive
552: of some role in AMP binding.
553:
554: %
555: \begin{table}
556:
557: \caption{Speculated roles of residues in FBPase for 1cnq from literature and
558: how well they were predicted. Aligned residues in other subfamilies
559: of hydrolases are also shown.}
560:
561: \begin{center}\includegraphics[%
562: width=150mm]{FBPaseTable.pdf}\end{center}
563:
564: \label{FBPaseTable}
565: \end{table}
566:
567:
568: %
569: \begin{figure}
570:
571: \caption{SDR and functional residue predictions for 1cnq, a FBPase. Residue-coloring
572: scheme same as Fig.\ref{figTransaminase}. The bottom panel is a closer
573: view of the region around ligand in the top panel.}
574:
575: \begin{center}\includegraphics[%
576: width=100mm]{figFBPase.jpg}\end{center}
577:
578: \label{figFBPase}
579: \end{figure}
580:
581:
582:
583: \subsection{Dehydrogenases}
584:
585: L-3-hydroxyacyl-CoA dehydrogenase (HAD, EC 1.1.1.35) is penultimate
586: enzyme in -oxidation spiral and catalyzes conversion of hydroxy group
587: to keto group while converting NAD+ to NADH. It consists of NAD-binding
588: and C-terminal domains, which undergo relative movement between NAD
589: binding and substrate binding events (\citet{activesiteSequestration}).
590: Its SCOP family is c.2.1.6, other members of which are other NAD/NADP-dependent
591: dehydrogenases (ECs 1.1.1.8, 1.1.1.22, 1.1.1.44). HAD is represented
592: in our dataset by NAD-binding domain of 1f0y (residues from A-12 to
593: A-203). Fig.\ref{figHAD} shows Top10 baseline and SDR predictions.
594: Catalytically important pair of Glu-170 and His-158 is identied as
595: SDRs. Ser-137, interesting due to its contact with substrate as well
596: as NAD, is also identied as SDR. With the exceptions of Leu-122, Ala-35
597: (baseline) and Gly-29, Ala-107 (SDR), all other predictions are within
598: interacting distance of either NAD or substrate. Ser-61 and Lys-68
599: are not detected due to their high entropy.
600:
601: %
602: \begin{figure}
603:
604: \caption{SDR and functional residue predictions for 1f0y, a HAD. Residue-coloring
605: scheme same as Fig.\ref{figTransaminase}.}
606:
607: \begin{center}\includegraphics[%
608: width=100mm]{figHAD.jpg}\end{center}
609:
610: \label{figHAD}
611: \end{figure}
612:
613:
614:
615: \subsection{Tryptophan biosynthesis enzymes}
616:
617: Phosphoribosylanthranilate (PRA) isomerase (TrpF) is a $(\beta\alpha)_{8}$
618: barrel enzyme which is the most common fold adopted by enzymes and
619: popular among non-enzymes. TrpF (EC 5.3.1.24) shares its SCOP family
620: (c.1.2.4) with indole-3-glycerol-phosphate synthase (EC 4.1.1.48)
621: and tryptophan synthase (EC 4.2.1.20), which are all involved in Trp
622: biosynthesis. Top10 baseline and SDR predictions are show in Fig.\ref{figTRPF}.
623: His-83 and Arg-36, considered important for catalysis, are predicted.
624: Gln-81 (Glu in Trp synthase 1kfc), predicted as baseline and SDR,
625: could be important for catalysis due to its location. A few baseline
626: predictions are far from active site and their conservation suggests
627: protein-protein binding interface. Predicted SDRs lie close to ligand
628: and are either replaced by other residues in Trp synthase (Arg-36
629: to Asn) or deleted (Gln-184, Asp-178), which suggests that they could
630: be specificity determining.
631:
632: %
633: \begin{figure}
634:
635: \caption{SDR and functional residue predictions for TrpF. Residue-coloring
636: scheme same as Fig.\ref{figTransaminase}.}
637:
638: \begin{center}\includegraphics[%
639: width=100mm]{figTRPF.jpg}\end{center}
640:
641: \label{figTRPF}
642: \end{figure}
643:
644:
645:
646: \subsection{tRNA synthetases}
647:
648: Aminoacyl-tRNA synthetases catalyze the process of attaching an amino
649: acid to its tRNA carrier so that it can be incorporated into a protein.
650: SCOP family c.26.1.1 contains tyrosyl-tRNA synthetase (EC 6.1.1.1)
651: along with other (Trp-, Glu-, Gln-) tRNA synthetases. Fig.\ref{figTyrTRNA}
652: shows baseline and SDR predictions for tyrosyl-tRNA synthetase 1h3e
653: from a thermophilic baterium T. thermophilus (\citet{tyrTRNAclass12}).
654: Residues important for catalysis from 51-HIGH and 233-KMSKS regions
655: are predicted as baseline (His-52, Gly-54, His-55, Lys-235). Predicted
656: SDRs lie close to the substrate and cofactor. Residues specific for
657: L-tyrosine binding, according to \citet{tyrTRNAspecificity} (e.g.
658: Thr-80, Tyr-175, Gln-179, Asp-182, Glu-197), are detected. Note that
659: substrate similarity makes 2 broad divisions in this family corresponding
660: to Trp/Tyr and Glu/Gln, each of which is subdivided into finer groups.
661: Table \ref{tRnaTable} shows residues structurally aligned to SDRs
662: in these tRNA synthetases.
663:
664: %
665: \begin{table}
666:
667: \caption{Residues in other tRNA synthetases aligned to predicted SDRs in tyrosil
668: tRNA synthetase.}
669:
670: \begin{center}\includegraphics[%
671: width=150mm]{tRNAtable.pdf}\end{center}
672:
673: \label{tRnaTable}
674: \end{table}
675:
676:
677: %
678: \begin{figure}
679:
680: \caption{SDR and functional residue predictions for 1h3e (tyrosil tRNA synthetase).
681: Residue-coloring scheme same as Fig.\ref{figTransaminase}.}
682:
683: \begin{center}\includegraphics[%
684: width=100mm]{figTYRtRNA.jpg}\end{center}
685:
686: \label{figTyrTRNA}
687: \end{figure}
688:
689:
690: Residues distinct for each substrate-group could be specific for it,
691: e.g. Gln-179. Detection of residue Tyr-175 as SDR suggests that there
692: could be more functions associated with this structural family than
693: these four AATSs. Detection of residues close to cofactor indicates
694: different/no cofactors used by other functions of this structural
695: family. Some residues speculated by \citet{tyrTRNAspecificity} to
696: be functional, stay undetected, e.g. Asn-128 which is not predicted
697: due to high entropy (Ser dominates the MSSA column, not Asn).
698:
699:
700: \section{Conclusion}
701:
702: We have combined structural and sequence information, functional annnotation,
703: residue entropy and environment specific substitution tables to predict
704: specificity determining residues. We tested the predictions by using
705: information of specific ligands and in some cases, published literature.
706: We found that the predictions are far from random and functionally
707: relevant, which suggests that our approach is effective. Predictions
708: obtained with functional annotation (SDRs) and without it (baseline)
709: are different, suggesting that available functional annotation is
710: valuable. SDR and baseline predictions are complementary because they
711: enlarge the set of functionally significant residues that can be computationally
712: identified. We expected and found that our method cannot identify
713: significant residues in absence of high quality evolutionary information,
714: hence the importance of identifying chemically interesting patches
715: remains undiminished. A major concern is how to obtain functional
716: partitions in absence of annotation, which is similar as establishing
717: ortho/paralogy relationships. We plan to explore structure-sequence
718: scoring schemes that would help establish functional partitions reliably.
719: Alternatively, it would be useful to analyze the effects of constructing
720: a functional partition based on sequence identity. We plan to use
721: residue proximity information and residue contact conservation to
722: detect clusters which may not be conserved in the obvious sense. We
723: expect that cluster identification will alleviate the problem of not
724: identifying structurally conserved residues. The most important purpose
725: of SDR and catalytic residue identification is to help classify SNPs
726: into normal/deleterious classes and this would be an important avenue
727: to explore in near future.
728:
729:
730: \subsection*{Acknowledgements}
731:
732: We thank Dr Kenji Mizuguchi and Dr Vijayalakshmi Chelliah for helpful
733: discussions. Swanand Gore thanks Cambridge Commonwealth Trust and
734: Universities UK Overseas Research Studentship for funding.
735:
736: \bibliographystyle{marko}
737: \bibliography{sdr}
738:
739: \end{document}
740: