0710.2808/sdr.lyx
1: #LyX 1.3 created this file. For more info see http://www.lyx.org/
2: \lyxformat 221
3: \textclass article
4: \begin_preamble
5: \addtolength{\oddsidemargin}{-50pt}
6: \addtolength{\voffset}{-50pt}
7: \addtolength{\textwidth}{100pt}
8: \addtolength{\textheight}{100pt}
9: 
10: 
11: \usepackage{xspace}
12: \newcommand{\phipsi}{$\phi,\psi$\xspace}
13: \newcommand{\eg}{e.g.\xspace}
14: \newcommand{\etc}{etc.\xspace}
15: \newcommand{\dfs}{\textsc{dfs}~}
16: \newcommand{\cpp}{\textsc{c++}\xspace}
17: \newcommand{\python}{\textsc{python}\xspace}
18: \newcommand{\swig}{\textsc{swig}\xspace}
19: \newcommand{\pdb}{\textsc{PDB}\xspace}
20: \newcommand{\gabb}{\textsc{GABB}\xspace}
21: \newcommand{\scwrl}{\textsc{scwrl}\xspace}
22: \newcommand{\rapper}{\textsc{rapper}\xspace}
23: \newcommand{\rappertk}{\textit{Rapper}\textbf{tk}\xspace}
24: \newcommand{\probe}{\textsc{probe}\xspace}
25: \newcommand{\ie}{i.e.\xspace}
26: \newcommand{\ca}{$C_\alpha$\xspace}
27: \newcommand{\chir}{$\chi$}
28: \newcommand{\CA}[1]{${C_\alpha}^{#1}$~}
29: \newcommand{\CB}[1]{${C_\beta}^{#1}$~}
30: \newcommand{\C}[1]{$C^{#1}$~}
31: \newcommand{\N}[1]{$N^{#1}$\xspace}
32: \newcommand{\cO}[1]{$O^{#1}$\xspace}
33: 
34: 
35: \newcommand{\Ang}[1]{${#1}$\AA\xspace}
36: 
37: \newcommand{\degr}[1]{${#1}^o$\xspace}
38: 
39: \date{}
40: \end_preamble
41: \language english
42: \inputencoding auto
43: \fontscheme default
44: \graphics default
45: \paperfontsize 12
46: \spacing onehalf 
47: \papersize Default
48: \paperpackage a4
49: \use_geometry 0
50: \use_amsmath 0
51: \use_natbib 1
52: \use_numerical_citations 0
53: \paperorientation portrait
54: \secnumdepth 3
55: \tocdepth 3
56: \paragraph_separation indent
57: \defskip medskip
58: \quotes_language english
59: \quotes_times 2
60: \papercolumns 1
61: \papersides 1
62: \paperpagestyle default
63: 
64: \layout Title
65: 
66: Identification of specificity determining residues in enzymes using environment
67:  specific substitution tables
68: \layout Author
69: 
70: Swanand Gore and Tom Blundell
71: \newline 
72: {swanand,tom}@cryst.bioc.cam.ac.uk
73: \newline 
74: Department of Biochemistry, University of Cambridge
75: \newline 
76: Cambridge CB2 1GA England
77: \layout Abstract
78: \pagebreak_bottom 
79: Environment specific substitution tables have been used effectively for
80:  distinguishing structural and functional constraints on proteins and thereby
81:  identify their active sites (
82: \begin_inset LatexCommand \citet{distinguish_str_func_restr}
83: 
84: \end_inset 
85: 
86: ).
87:  This work explores whether a similar approach can be used to identify specifici
88: ty determining residues (SDRs) responsible for cofactor dependence, substrate
89:  specificity or subtle catalytic variations.
90:  We combine structure-sequence information and functional annotation from
91:  various data sources to create structural alignments for homologous enzymes
92:  and functional partitions therein.
93:  We develop a scoring procedure to predict SDRs and assess their accuracy
94:  using information from bound specific ligands and published literature.
95: \layout Section
96: 
97: Introduction
98: \layout Standard
99: 
100: Enzymes are critical to cellular machinery.
101:  Enzymes are believed to have developed different specificities following
102:  gene duplication events that ease the evolutionary pressure on copies and
103:  allow exploration of novel avenues to greater organismal fitness.
104:  Each copy then develops its own niche, characterized by expression and
105:  localization, catalytic mechanism, substrate specificity, cofactor dependence
106:  and catalysis products.
107:  Such paralogous enzymes should have an evolutionary imprint corresponding
108:  to their specific niche, in addition to maintenance of structural fold.
109:  Thus evolutionary analysis of available structural and sequnce data should
110:  enable identification of key residues responsible for specificity of various
111:  kinds.
112:  Enzyme specificity can be estimated with functional assays without structure
113:  determination, but identification of SDRs (specificity determining residues)
114:  remains difficult.
115:  While ENZYME (
116: \begin_inset LatexCommand \citet{ENZYME}
117: 
118: \end_inset 
119: 
120: ) - a database of enzyme sequences with detailed functional annotation -
121:  exists, there is no such database of SDRs.
122:  Time, cost and technical limitations slow down structure determination
123:  and even when structure is known, it is not trivial to identify the residues
124:  important for binding cofactors and substrates.
125:  Hence it is important to be able to identify such residues computationally.
126:  Reliable detection of such residues will aid in deciding whether a SNP
127:  is deleterious or neutral and suggest mutation studies.
128:  Function assignment to sequence could be done at a finer level, e.g.
129:  by verifying that SDRs necessary for certain substrate are present.
130:  Computational SDR identification has received a lot of attention and several
131:  methods have been proposed.
132:  Evolutionary trace (ET) is one of the most important methods (
133: \begin_inset LatexCommand \citet{evoltraceStrClust}
134: 
135: \end_inset 
136: 
137: , 
138: \begin_inset LatexCommand \citet{EThybridMethods}
139: 
140: \end_inset 
141: 
142: ).
143:  It builds a phylogenetic tree based on sequence comparisons, such that
144:  branch lengths are indicative of evolutionary divergence.
145:  Functional subgroups consist of sequences in subtrees determined from this
146:  tree using a divergence cutoff.
147:  Residues common to a subtree are considered specificity-conferring rather
148:  than the ones common to entire tree.
149:  Spatial cluster identification can be used with ET to reduce the number
150:  of false positives.
151:  Inferring phylogeny correctly remains the main cause of concern in this
152:  approach, hence attempts have been made to use existing annotation with
153:  various statistical techniques.
154:  Another important direction is to use spatial proximity of residues.
155: \layout Standard
156: 
157: Cornerstone of our approach is that structural environment influences residue
158:  substitution patterns, illustrated by 
159: \begin_inset LatexCommand \citet{earlyESST}
160: 
161: \end_inset 
162: 
163:  and later used effectively for structure-sequence alignment and fold recognitio
164: n (
165: \begin_inset LatexCommand \citet{fugue}
166: 
167: \end_inset 
168: 
169: ).
170:  Structural environment of a residue is described in terms of secondary
171:  structure, solvent accessibility, sidechain-sidechain and sidechain-mainchain
172:  hydrogen bonding.
173:  Residue substitution tables derived from a set of high quality sequence-structu
174: re alignments represent the expected substitution rate in a structural environme
175: nt.
176:  Unexpected conservation of a residue is indicative of functional restraint
177:  acting on it.
178:  Advantage of using ESSTs is that the structurally conserved residues are
179:  masked, which is why active sites of homologous enzymes can be identied
180:  reliably with this approach.
181:  This approach has been extended in the present work by using functional
182:  annotation information.
183: \layout Standard
184: 
185: A set of homologous enzymes is generally a union of smaller functionally
186:  specific subsets, e.g.
187:  substrate-specific subsets in serine proteinases (trypsin, chymotrypsin
188:  etc.), cofactor-specific subsets in ferrodoxin reductases (NAD and NADP
189:  specific) and so on.
190:  In multiple sequence alignment of a homologous protein family, SDRs generally
191:  appear as differentially conserved subcolumns.
192:  But all such appearances would not be SDRs.
193:  Our hypothesis is that SDRs would be identified by combining differential
194:  conservation with ESST-based detection of functional restraint.
195: \layout Section
196: 
197: Families, functional partitions and profiles
198: \layout Standard
199: 
200: In order to test our hypothesis, we need to construct a dataset of homologous
201:  enzyme families with reliable functional partitions in them.
202:  While SCOP classification can be used in a straightforward way for making
203:  families, identifying functionally specific subsets is not a trivial task.
204:  Some automated approaches to detect functional shift, e.g.
205:  
206: \begin_inset LatexCommand \citet{funshiftakker}
207: 
208: \end_inset 
209: 
210: , exist to infer such partitions but manual annotation remains the most
211:  reliable.
212:  Additionally, protein function is not a precise and quantifiable entity.
213:  This restricted our study to enzymes which are the the most well studied
214:  and well annotated class of proteins.
215:  Enzyme function is fairly well defined and well classified according to
216:  hierarchical Enzyme Classification scheme (EC).
217:  We use the mapping between SCOP domains and EC numbers (
218: \begin_inset LatexCommand \citet{scopec}
219: 
220: \end_inset 
221: 
222: ) to make EC-specific subgroups within a SCOP domain family.
223:  We generate profiles (multiple structure-sequence alignments) for SCOP
224:  families and functional partitions.
225:  Sequence homologs for structural families were found using PSIBLAST (
226: \begin_inset LatexCommand \citet{psiblast}
227: 
228: \end_inset 
229: 
230: ) on nonredundant sequence database, whereas function-specific partitions
231:  were enriched using PSIBLAST searches on ENZYME database (
232: \begin_inset LatexCommand \citet{ENZYME}
233: 
234: \end_inset 
235: 
236: ).
237:  PSIBLAST hit on ENZYME database is retained only if the EC number of hit
238:  matches that of query.
239:  All PSIBLAST searches were with 5 rounds and e-value 0.01, hits smaller
240:  than 75% of query length were ignored.
241:  All structure-sequence alignments were carried out with fugueseq (
242: \begin_inset LatexCommand \citet{fugue}
243: 
244: \end_inset 
245: 
246: ) which has been shown to improve alignment quality over PSIBLAST.
247:  This process is summarized in Fig.
248: \begin_inset LatexCommand \ref{workflow}
249: 
250: \end_inset 
251: 
252: .
253: \layout Standard
254: 
255: 
256: \begin_inset Float figure
257: wide false
258: collapsed false
259: 
260: \layout Caption
261: 
262: Workflow
263: \layout Standard
264: \align center 
265: 
266: \begin_inset Graphics
267: 	filename workflow.pdf
268: 	width 150mm
269: 
270: \end_inset 
271: 
272: 
273: \layout Standard
274: 
275: 
276: \begin_inset LatexCommand \label{workflow}
277: 
278: \end_inset 
279: 
280: 
281: \end_inset 
282: 
283: 
284: \layout Standard
285: 
286: Another constraint on the choice of dataset comes from the need for sufficient
287:  functional diversity in a SCOP domain family.
288:  In its absence, the contrast between the domain family and EC-specific
289:  subgroup within it might not be detectable.
290:  Hence we chose the SCOP families with at least two different EC annotations.
291: \layout Standard
292: 
293: To be able to test the hypothesis quantitatively, a gold standard set of
294:  SDRs for every enzyme is needed.
295:  But SDRs are generally a topic of lively debate among researchers, partly
296:  due to the infeasibility of performing all necessary mutation studies.
297:  Thus there is no such dataset in our knowledge.
298:  Hence we use the information of bound ligands and close-by residues to
299:  assess the hypothesis.
300:  Due to this, the dataset gets restricted to only those cases where at least
301:  one EC-specific domain group has a relevant ligand bound.
302:  A relevant ligand is the one unique to the reaction carried out by that
303:  EC-group among all possible reactions in that domain family.
304:  For example, in SCOP family c.1.10.4 there are two functional subgroups:
305: \layout Standard
306: 
307: 3-deoxy-8-phosphooctulonate synthase (EC 2.5.1.55) : Phosphoenolpyruvate +
308:  D-arabinose 5-phosphate + H(2)O = 2-dehydro-3-deoxy-D-octonate 8-phosphate
309:  + phosphate
310: \layout Standard
311: 
312: 3-deoxy-7-phosphoheptulonate synthase (EC 2.5.1.54) : Phosphoenolpyruvate +
313:  D-erythrose 4-phosphate + H(2)O = 3-deoxy-D-arabino-hept-2-ulosonate 7-phosphat
314: e + phosphate
315: \layout Standard
316: 
317: Here D-arabinose 5-phosphate is unique to EC 2.5.1.55 and is present in domain
318:  1fxqA as A5P.
319:  Hence it is taken as an indicator of SDR locations and not phosphienolpyruvate
320:  which is common cofactor in both reactions.
321:  We sometimes use products also as such indicators.
322:  Ligand is considered relevant if its name from the PDB file (HETNAM, HETSYM
323:  records) matches its name in the reaction or PDBsum (
324: \begin_inset LatexCommand \citet{pdbsum}
325: 
326: \end_inset 
327: 
328: ) finds it sufficiently similar to ideal ligand molecule.
329:  Our final dataset consists of 97 examples drawn from 68 families.
330:  Very few SDR identification studies are carried out with these many examples.
331: \layout Section
332: 
333: Profiles and substitution patterns
334: \layout Standard
335: 
336: Structural and sequence information in MSSA can be misleading if dominated
337:  by very close homologs, hence each MSSA was filtered with 90% sequence
338:  identity cutoff to avoid redundancy.
339: \layout Standard
340: 
341: Observed substitution pattern for a column in profile MSSA (multiple structure-s
342: equence alignment) was calculated after weighing down contributions from
343:  similar sequences (
344: \begin_inset Formula $>60\%$
345: \end_inset 
346: 
347:  sequence identity).
348:  Gaps were ignored while calculating the observed substitution pattern but
349:  the ratio of gaps to amino acids in a column was computed.
350:  Columns with high gap content are generally not functional hence gap content
351:  was used as a filtering criterion as described later.
352:  Observed substitution patterns are normalized and sequence entropy was
353:  also calculated to get a measure of variability in the column as 
354: \begin_inset Formula $\sum_{i=1}^{20}-f_{i}log(f_{i})$
355: \end_inset 
356: 
357: , where 
358: \begin_inset Formula $f_{i}$
359: \end_inset 
360: 
361:  is the fraction of 
362: \begin_inset Formula $i^{th}$
363: \end_inset 
364: 
365:  amino acid in the distribution.
366: \layout Standard
367: 
368: Expected substitution patterns for a column were calculated using environment
369:  specific substitution probability tables derived from high quality multiple
370:  structure alignments from 371 families (
371: \begin_inset LatexCommand \citet{fugue}
372: 
373: \end_inset 
374: 
375: ).
376:  Substitution probabilties from every structure were averaged to get expected
377:  substitution probabilities for each column in MSSA.
378:  Again, sequence-based clustering was used to avoid expected substitution
379:  pattern getting dominated by very similar structures.
380: \layout Standard
381: 
382: Functional restraint is calculated as the city-block distance between normalized
383:  observed and predicted substitution patterns (
384: \begin_inset Formula $\sum_{i=1}^{20}o_{i}-e_{i}$
385: \end_inset 
386: 
387: , 
388: \begin_inset Formula $o_{i}$
389: \end_inset 
390: 
391:  being observed fraction of 
392: \begin_inset Formula $i^{th}$
393: \end_inset 
394: 
395:  amino acid and 
396: \begin_inset Formula $e_{i}$
397: \end_inset 
398: 
399:  being the fraction of times it is expected to occur).
400:  Thus, for both MSSAs (whole family and EC-specific) we have the following
401:  quantities : functional restraint (
402: \begin_inset Formula $famF,ecF$
403: \end_inset 
404: 
405: ), gap content (
406: \begin_inset Formula $famG,ecG$
407: \end_inset 
408: 
409: ) and sequence entropy (
410: \begin_inset Formula $famE,ecE$
411: \end_inset 
412: 
413: ).
414:  Moreover for each MSSA, number of sequences 
415: \begin_inset Formula $<80\%$
416: \end_inset 
417: 
418:  identical to each other was taken as an indicator of evolutionary information
419:  available in it.
420: \layout Section
421: 
422: Benchmarking
423: \layout Standard
424: 
425: In order to assess the differences in residues important for whole family
426:  and EC partition, baseline predictions were made by choosing top-ranking
427:  residues according to whole family functional constraint from residues
428:  which are not highly gapped (
429: \begin_inset Formula $famG<0.5$
430: \end_inset 
431: 
432: ).
433:  Number of baseline and SDR predictions is same whenever they are compared
434:  or an overlap between them is computed.
435:  This helps in assessing whether information in the EC-specific MSSA is
436:  distinct.
437: \layout Standard
438: 
439: The likelihood of a residue to be an SDR is presumably proportional to its
440:  proximity to the specific ligand.
441:  Hence, to quantify the merit of a prediction, we defined mean proximity
442:  as the ratio of mean separation between predicted residues and ligand.
443:  Mean relative proximity is defined as the ratio of mean proximity to the
444:  mean separation between all residues in the domain and the ligand.
445:  Distance between a residue and ligand is taken to be the closest distance
446:  between residue sidechain (mainchain for glycine) and ligand atoms.
447:  Smaller the mean relative proximity, better the prediction.
448:  Prediction quality will also depend on the number of distinct homologous
449:  sequences available.
450:  In case of multiple ligands close to a domain, a residue's proximity to
451:  the ligand is calculated with respect to the closest ligand.
452:  The basis for SDR prediction is that it be sufficiently distinct between
453:  whole family and EC-specific MSSAs.
454:  As 
455: \begin_inset LatexCommand \citet{funshiftakker}
456: 
457: \end_inset 
458: 
459:  describe it, an SDR should be a rate-shifted or conservation-shifted site.
460:  Additionally, SDR should be sufficiently functionally constrained from
461:  ESSTs perspective (
462: \begin_inset Formula $ecF$
463: \end_inset 
464: 
465: ).
466:  For a residue with low entropy in EC MSSA, if change in entropy 
467: \begin_inset Formula $dE$
468: \end_inset 
469: 
470:  (family MSSA sequence entropy - EC MSSA sequence entropy) is high, it indicates
471:  that it could be SDR.
472:  Since each MSSA will be different in its variability, it is not advisable
473:  to use same functional constraint cutoff or entropy cutoff for all of them.
474:  This immediately suggests two 2-step approaches : choose top 
475: \begin_inset Formula $N1$
476: \end_inset 
477: 
478:  residues with high dierence in sequence entropy between whole and EC MSSAs,
479:  then select top 
480: \begin_inset Formula $N2$
481: \end_inset 
482: 
483:  according to functional constraint in EC MSSA and vice versa.
484:  But there could be a third and more attractive approach that combines functiona
485: l constraint from EC MSSA and sequence entropy difference.
486:  We pursue the third approach.
487: \layout Standard
488: 
489: We assume that SDR score of a residue is a linear combination of its functional
490:  constraint, entropy and change in entropy, given that the residue passes
491:  certain quality checks (
492: \begin_inset Formula $ecF>0.5$
493: \end_inset 
494: 
495: , 
496: \begin_inset Formula $ecG<0.5$
497: \end_inset 
498: 
499: , 
500: \begin_inset Formula $ecE<1$
501: \end_inset 
502: 
503: , 
504: \begin_inset Formula $dE>0.5$
505: \end_inset 
506: 
507: ):
508: \layout Standard
509: 
510: 
511: \begin_inset Formula $SDRscore=ecF+a*(famE-ecE)-b*ecE$
512: \end_inset 
513: 
514: 
515: \layout Standard
516: 
517: In order to optimize the parameters 
518: \begin_inset Formula $a,b$
519: \end_inset 
520: 
521:  and test the optimal ones, we created a high quality test set from our
522:  examples, consisting of 23 examples drawn from SCOP families with at least
523:  2 EC groups, each with at 
524: \begin_inset Formula $>10$
525: \end_inset 
526: 
527:  distict sequence homologs from ENZYME database.
528:  Parameters 
529: \begin_inset Formula $a,b$
530: \end_inset 
531: 
532:  were varied from 0 to 5 in steps of 
533: \begin_inset Formula $0.2$
534: \end_inset 
535: 
536:  and 10 SDR predictions were made.
537:  For each value of 
538: \begin_inset Formula $a$
539: \end_inset 
540: 
541:  and 
542: \begin_inset Formula $b$
543: \end_inset 
544: 
545: , SDR and baseline predictions are made, each consisiting of 10 residues.
546:  Note that baseline predictions are not affected by values of 
547: \begin_inset Formula $a,b$
548: \end_inset 
549: 
550: .
551:  Optimization can be done with two objectives, either to minimize the mean
552:  proximity or to maximize the number of close (
553: \begin_inset Formula $<$
554: \end_inset 
555: 
556: 
557: \begin_inset ERT
558: status Collapsed
559: 
560: \layout Standard
561: 
562: \backslash 
563: Ang{6}
564: \end_inset 
565: 
566: ) residues.
567:  
568: \begin_inset Formula $a,b$
569: \end_inset 
570: 
571:  values of 
572: \begin_inset Formula $0.4,1.2$
573: \end_inset 
574: 
575:  minimize the prior obective to 
576: \begin_inset ERT
577: status Collapsed
578: 
579: \layout Standard
580: 
581: \backslash 
582: Ang{9.24}
583: \end_inset 
584: 
585:  and yield 
586: \begin_inset Formula $3.6$
587: \end_inset 
588: 
589:  close residues per prediction, whereas 
590: \begin_inset Formula $0,0.8$
591: \end_inset 
592: 
593:  maximize the latter to 
594: \begin_inset Formula $4.08$
595: \end_inset 
596: 
597:  residues while yielding 
598: \begin_inset ERT
599: status Collapsed
600: 
601: \layout Standard
602: 
603: \backslash 
604: Ang{9.36}
605: \end_inset 
606: 
607:  for the prior.
608:  Performance of these two 
609: \begin_inset Formula $a,b$
610: \end_inset 
611: 
612:  values on different sets of examples is shown in Table 
613: \begin_inset LatexCommand \ref{evolABperf}
614: 
615: \end_inset 
616: 
617: .
618: \layout Standard
619: 
620: 
621: \begin_inset Float table
622: wide false
623: collapsed true
624: 
625: \layout Caption
626: 
627: Optimal values of a and b for various levels of evolutionary information
628:  available.
629: \layout Standard
630: \align center 
631: 
632: \begin_inset  Tabular
633: <lyxtabular version="3" rows="8" columns="5">
634: <features>
635: <column alignment="center" valignment="top" leftline="true" width="0">
636: <column alignment="block" valignment="top" leftline="true" width="1in">
637: <column alignment="block" valignment="top" leftline="true" width="1in">
638: <column alignment="block" valignment="top" leftline="true" width="1in">
639: <column alignment="block" valignment="top" leftline="true" rightline="true" width="1in">
640: <row topline="true">
641: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
642: \begin_inset Text
643: 
644: \layout Standard
645: 
646: Criteria for
647: \end_inset 
648: </cell>
649: <cell multicolumn="1" alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
650: \begin_inset Text
651: 
652: \layout Standard
653: 
654: Mean proximity
655: \end_inset 
656: </cell>
657: <cell multicolumn="2" alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
658: \begin_inset Text
659: 
660: \layout Standard
661: 
662: \end_inset 
663: </cell>
664: <cell multicolumn="1" alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
665: \begin_inset Text
666: 
667: \layout Standard
668: 
669: #close (<
670: \begin_inset ERT
671: status Collapsed
672: 
673: \layout Standard
674: 
675: \backslash 
676: Ang{6}
677: \end_inset 
678: 
679: ) residues
680: \end_inset 
681: </cell>
682: <cell multicolumn="2" alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
683: \begin_inset Text
684: 
685: \layout Standard
686: 
687: \end_inset 
688: </cell>
689: </row>
690: <row bottomline="true">
691: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
692: \begin_inset Text
693: 
694: \layout Standard
695: 
696: choice of examples
697: \end_inset 
698: </cell>
699: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
700: \begin_inset Text
701: 
702: \layout Standard
703: 
704: (0,0.8)
705: \end_inset 
706: </cell>
707: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
708: \begin_inset Text
709: 
710: \layout Standard
711: 
712: (0.4,1.2)
713: \end_inset 
714: </cell>
715: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
716: \begin_inset Text
717: 
718: \layout Standard
719: 
720: (0.0.8)
721: \end_inset 
722: </cell>
723: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
724: \begin_inset Text
725: 
726: \layout Standard
727: 
728: (0.4,1.2)
729: \end_inset 
730: </cell>
731: </row>
732: <row topline="true">
733: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
734: \begin_inset Text
735: 
736: \layout Standard
737: 
738: >5 homologs
739: \end_inset 
740: </cell>
741: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
742: \begin_inset Text
743: 
744: \layout Standard
745: 
746: 10.84
747: \end_inset 
748: </cell>
749: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
750: \begin_inset Text
751: 
752: \layout Standard
753: 
754: 11.24
755: \end_inset 
756: </cell>
757: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
758: \begin_inset Text
759: 
760: \layout Standard
761: 
762: 3.35
763: \end_inset 
764: </cell>
765: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
766: \begin_inset Text
767: 
768: \layout Standard
769: 
770: 3.01
771: \end_inset 
772: </cell>
773: </row>
774: <row>
775: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
776: \begin_inset Text
777: 
778: \layout Standard
779: 
780: (67 examples)
781: \end_inset 
782: </cell>
783: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
784: \begin_inset Text
785: 
786: \layout Standard
787: 
788: \end_inset 
789: </cell>
790: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
791: \begin_inset Text
792: 
793: \layout Standard
794: 
795: \end_inset 
796: </cell>
797: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
798: \begin_inset Text
799: 
800: \layout Standard
801: 
802: \end_inset 
803: </cell>
804: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
805: \begin_inset Text
806: 
807: \layout Standard
808: 
809: \end_inset 
810: </cell>
811: </row>
812: <row topline="true">
813: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
814: \begin_inset Text
815: 
816: \layout Standard
817: 
818: >10 homologs
819: \end_inset 
820: </cell>
821: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
822: \begin_inset Text
823: 
824: \layout Standard
825: 
826: 10.41
827: \end_inset 
828: </cell>
829: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
830: \begin_inset Text
831: 
832: \layout Standard
833: 
834: 10.64
835: \end_inset 
836: </cell>
837: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
838: \begin_inset Text
839: 
840: \layout Standard
841: 
842: 3.45
843: \end_inset 
844: </cell>
845: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
846: \begin_inset Text
847: 
848: \layout Standard
849: 
850: 3.2
851: \end_inset 
852: </cell>
853: </row>
854: <row>
855: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
856: \begin_inset Text
857: 
858: \layout Standard
859: 
860: (55 examples)
861: \end_inset 
862: </cell>
863: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
864: \begin_inset Text
865: 
866: \layout Standard
867: 
868: \end_inset 
869: </cell>
870: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
871: \begin_inset Text
872: 
873: \layout Standard
874: 
875: \end_inset 
876: </cell>
877: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
878: \begin_inset Text
879: 
880: \layout Standard
881: 
882: \end_inset 
883: </cell>
884: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
885: \begin_inset Text
886: 
887: \layout Standard
888: 
889: \end_inset 
890: </cell>
891: </row>
892: <row topline="true">
893: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
894: \begin_inset Text
895: 
896: \layout Standard
897: 
898: >10 homologs, >1 EC
899: \end_inset 
900: </cell>
901: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
902: \begin_inset Text
903: 
904: \layout Standard
905: 
906: 9.36
907: \end_inset 
908: </cell>
909: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
910: \begin_inset Text
911: 
912: \layout Standard
913: 
914: 9.24
915: \end_inset 
916: </cell>
917: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
918: \begin_inset Text
919: 
920: \layout Standard
921: 
922: 4.08
923: \end_inset 
924: </cell>
925: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
926: \begin_inset Text
927: 
928: \layout Standard
929: 
930: 3.6
931: \end_inset 
932: </cell>
933: </row>
934: <row bottomline="true">
935: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
936: \begin_inset Text
937: 
938: \layout Standard
939: 
940: (23 examples)
941: \end_inset 
942: </cell>
943: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
944: \begin_inset Text
945: 
946: \layout Standard
947: 
948: \end_inset 
949: </cell>
950: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
951: \begin_inset Text
952: 
953: \layout Standard
954: 
955: \end_inset 
956: </cell>
957: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
958: \begin_inset Text
959: 
960: \layout Standard
961: 
962: \end_inset 
963: </cell>
964: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
965: \begin_inset Text
966: 
967: \layout Standard
968: 
969: \end_inset 
970: </cell>
971: </row>
972: </lyxtabular>
973: 
974: \end_inset 
975: 
976: 
977: \layout Standard
978: 
979: 
980: \begin_inset LatexCommand \label{evolABperf}
981: 
982: \end_inset 
983: 
984: 
985: \end_inset 
986: 
987: 
988: \layout Standard
989: 
990: This suggests that optimal 
991: \begin_inset Formula $a,b$
992: \end_inset 
993: 
994:  parameters are 
995: \begin_inset Formula $0,0.8$
996: \end_inset 
997: 
998: .
999:  It is surprising that there is no importance for the value of 
1000: \begin_inset Formula $dE=famE-genE$
1001: \end_inset 
1002: 
1003:  in SDR score.
1004:  Perhaps this is due to the quality checks applied prior to calculation
1005:  of SDR scores, which demand 
1006: \begin_inset Formula $dE>0.5$
1007: \end_inset 
1008: 
1009: .
1010: \layout Standard
1011: 
1012: Fig.
1013: \begin_inset LatexCommand \ref{proxDistrib}
1014: 
1015: \end_inset 
1016: 
1017:  shows the distribution of mean proximity in various sets derived according
1018:  to number of distinct homologs in ENZYME.
1019:  This shows that quality of evolutionary information available has great
1020:  impact on quality of predictions.
1021: \layout Standard
1022: 
1023: 
1024: \begin_inset Float figure
1025: wide false
1026: collapsed false
1027: 
1028: \layout Caption
1029: 
1030: Frequency of observing a certain mean proximity of SDR predictions (binned
1031:  in 
1032: \begin_inset ERT
1033: status Collapsed
1034: 
1035: \layout Standard
1036: 
1037: \backslash 
1038: Ang{1}
1039: \end_inset 
1040: 
1041:  bins) for different qualities of evolutionary information available.
1042: \layout Standard
1043: \align center 
1044: 
1045: \begin_inset Graphics
1046: 	filename proxDistrib.pdf
1047: 	width 150mm
1048: 
1049: \end_inset 
1050: 
1051: 
1052: \layout Standard
1053: 
1054: 
1055: \begin_inset LatexCommand \label{proxDistrib}
1056: 
1057: \end_inset 
1058: 
1059: 
1060: \end_inset 
1061: 
1062: 
1063: \layout Standard
1064: 
1065: Mean relative proximity indicates how far from random is the prediction.
1066:  Table 
1067: \begin_inset LatexCommand \ref{meanRelProxTable}
1068: 
1069: \end_inset 
1070: 
1071:  shows that mean relative proximity depends on quality of evolutionary informati
1072: on and is far from random for both SDR and baseline predictions.
1073: \layout Standard
1074: 
1075: 
1076: \begin_inset Float table
1077: wide false
1078: collapsed false
1079: 
1080: \layout Caption
1081: 
1082: Mean relative proximity in various datasets made according to number of
1083:  available distinct homologs.
1084: \layout Standard
1085: \align center 
1086: 
1087: \begin_inset  Tabular
1088: <lyxtabular version="3" rows="5" columns="4">
1089: <features>
1090: <column alignment="center" valignment="top" leftline="true" width="0">
1091: <column alignment="center" valignment="top" leftline="true" width="0">
1092: <column alignment="center" valignment="top" leftline="true" width="0">
1093: <column alignment="center" valignment="top" leftline="true" rightline="true" width="0">
1094: <row topline="true">
1095: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1096: \begin_inset Text
1097: 
1098: \layout Standard
1099: 
1100: Dataset
1101: \end_inset 
1102: </cell>
1103: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1104: \begin_inset Text
1105: 
1106: \layout Standard
1107: 
1108: Mean Rel.
1109:  Prox.
1110: \end_inset 
1111: </cell>
1112: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1113: \begin_inset Text
1114: 
1115: \layout Standard
1116: 
1117: Mean Rel.
1118:  Prox.
1119: \end_inset 
1120: </cell>
1121: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1122: \begin_inset Text
1123: 
1124: \layout Standard
1125: 
1126: Frequency of
1127: \end_inset 
1128: </cell>
1129: </row>
1130: <row>
1131: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1132: \begin_inset Text
1133: 
1134: \layout Standard
1135: 
1136: \end_inset 
1137: </cell>
1138: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1139: \begin_inset Text
1140: 
1141: \layout Standard
1142: 
1143: \end_inset 
1144: </cell>
1145: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1146: \begin_inset Text
1147: 
1148: \layout Standard
1149: 
1150: \end_inset 
1151: </cell>
1152: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1153: \begin_inset Text
1154: 
1155: \layout Standard
1156: 
1157: MRP(SDR) 
1158: \begin_inset Formula $\leq$
1159: \end_inset 
1160: 
1161:  MRP(baseline)
1162: \end_inset 
1163: </cell>
1164: </row>
1165: <row topline="true">
1166: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1167: \begin_inset Text
1168: 
1169: \layout Standard
1170: 
1171: >0 homologs
1172: \end_inset 
1173: </cell>
1174: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1175: \begin_inset Text
1176: 
1177: \layout Standard
1178: 
1179: 0.67
1180: \end_inset 
1181: </cell>
1182: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1183: \begin_inset Text
1184: 
1185: \layout Standard
1186: 
1187: 0.66
1188: \end_inset 
1189: </cell>
1190: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1191: \begin_inset Text
1192: 
1193: \layout Standard
1194: 
1195: 34% (33/97)
1196: \end_inset 
1197: </cell>
1198: </row>
1199: <row topline="true">
1200: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1201: \begin_inset Text
1202: 
1203: \layout Standard
1204: 
1205: >5 homologs
1206: \end_inset 
1207: </cell>
1208: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1209: \begin_inset Text
1210: 
1211: \layout Standard
1212: 
1213: 0.57
1214: \end_inset 
1215: </cell>
1216: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1217: \begin_inset Text
1218: 
1219: \layout Standard
1220: 
1221: 0.66
1222: \end_inset 
1223: </cell>
1224: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1225: \begin_inset Text
1226: 
1227: \layout Standard
1228: 
1229: 60% (40/67)
1230: \end_inset 
1231: </cell>
1232: </row>
1233: <row topline="true" bottomline="true">
1234: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1235: \begin_inset Text
1236: 
1237: \layout Standard
1238: 
1239: >10 homologs
1240: \end_inset 
1241: </cell>
1242: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1243: \begin_inset Text
1244: 
1245: \layout Standard
1246: 
1247: 0.57
1248: \end_inset 
1249: </cell>
1250: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">
1251: \begin_inset Text
1252: 
1253: \layout Standard
1254: 
1255: 0.62
1256: \end_inset 
1257: </cell>
1258: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">
1259: \begin_inset Text
1260: 
1261: \layout Standard
1262: 
1263: 85% (47/55)
1264: \end_inset 
1265: </cell>
1266: </row>
1267: </lyxtabular>
1268: 
1269: \end_inset 
1270: 
1271: 
1272: \layout Standard
1273: 
1274: 
1275: \begin_inset LatexCommand \label{meanRelProxTable}
1276: 
1277: \end_inset 
1278: 
1279: 
1280: \end_inset 
1281: 
1282: 
1283: \layout Standard
1284: 
1285: The fraction of SDRs present in baseline predictions is 
1286: \begin_inset Formula $15\%$
1287: \end_inset 
1288: 
1289:  in all 
1290: \begin_inset Formula $>0,>5,>10$
1291: \end_inset 
1292: 
1293:  homologs classes, which suggests that SDR predictions are fairly different
1294:  than baseline.
1295:  This also suggests that baseline and SDR predictions are complementary
1296:  to each other.
1297: \layout Section
1298: 
1299: Some examples
1300: \layout Standard
1301: 
1302: When quality sequence information is available, SDR predictions are closer
1303:  to specific ligand than baseline predictions which in turn are closer than
1304:  random.
1305:  Here we compare our Top10 predictions with information from literature
1306:  for some examples.
1307: \layout Subsection
1308: 
1309: Aminotransferases
1310: \layout Standard
1311: 
1312: Aminotransferases or transaminases are important to amino acid biosynthesis
1313:  and unique due to their specificity to two substrates : a glutamate and
1314:  a amino-carrier.
1315:  Our dataset contains two SCOP families (c.67.1.1 and c.67.1.4) that contain transamin
1316: ases.
1317:  Of those, we focus on SCOP family c.67.1.1 which contains the functional categorie
1318: s aspartate transaminase (AspAT, EC 2.6.1.1) and histidinol phosphate transaminase
1319:  (HspAT, EC 2.6.1.9).
1320:  Other non-transaminase members of this family include threonine adolases
1321:  (EC 4.1.2.5) and alliin lyase (EC 4.4.1.4).
1322:  When Top10 predictions were analyzed in 1gex, an HspAT, we found that SDR
1323:  predictions are very well clustered around the ligands PLP and HSP, but
1324:  5 of the 10 predictions were shared with Top10 baseline predictions.
1325:  This overlap can be attributed to degrees of functional diversity in the
1326:  SCOP family, i.e.
1327:  large entropy reduction in HspAT residues could be due to their importance
1328:  to general transaminase mechanism (as opposed to aldolase mechanism) or
1329:  for substrate specificity to histidinol phosphate (as opposed to aspartate
1330:  in AspATs).
1331:  In order to increase the number of distinct predictions, Top20 baseline
1332:  and SDR predictions were used.
1333:  Fig.
1334: \begin_inset LatexCommand \ref{figTransaminase}
1335: 
1336: \end_inset 
1337: 
1338:  shows the predictions for 1gexA, an HspAT from E.
1339:  coli - 7 predictions are common.
1340:  Catalytically important residues (
1341: \begin_inset LatexCommand \citet{Haruyama2001}
1342: 
1343: \end_inset 
1344: 
1345: ) Asn-157, Tyr-187, Lys-214 are identied as baseline, SDR and common respectivel
1346: y.
1347:  Tyr-55, which interacts with substrate of the other subunit, is predicted
1348:  as SDR
1349: \begin_inset Foot
1350: collapsed true
1351: 
1352: \layout Standard
1353: 
1354: This is conrmed from a similar prediction in 1gc4, an AspAT.
1355: \end_inset 
1356: 
1357: .
1358:  Tyr-20, believed to be important for specificity, is not predicted as such
1359:  because it is conserved only 80% of times, whereas a similarly placed Tyr-55
1360:  from other subunit is much better conserved (98% times) and could be equally
1361:  important for specificity.
1362:  Ala-186, considered important for restricting rotation of PLP's pyrimidine
1363:  ring and thereby contributing to strain essential for enzyme function,
1364:  is predicted as both SDR and baseline.
1365:  Most other predicted SDRs lie close to the substrate.
1366:  Their location and AspAT counterparts suggest their role in conferring
1367:  specificty towards histidinol phosphate (see Table 
1368: \begin_inset LatexCommand \ref{transaminaseTable}
1369: 
1370: \end_inset 
1371: 
1372: ).
1373: \layout Standard
1374: 
1375: 
1376: \begin_inset Float table
1377: wide false
1378: collapsed false
1379: 
1380: \layout Caption
1381: 
1382: Residues from speculated roles 
1383: \begin_inset LatexCommand \citet{Haruyama2001}
1384: 
1385: \end_inset 
1386: 
1387:  for HspAT 1gex and how well they were predicted.
1388:  The aligned residues in other subfamilies with transaminases are also shown.
1389: \layout Standard
1390: \align center 
1391: 
1392: \begin_inset Graphics
1393: 	filename transaminaseTable.pdf
1394: 	width 150mm
1395: 
1396: \end_inset 
1397: 
1398: 
1399: \layout Standard
1400: 
1401: 
1402: \begin_inset LatexCommand \label{transaminaseTable}
1403: 
1404: \end_inset 
1405: 
1406: 
1407: \end_inset 
1408: 
1409: 
1410: \layout Standard
1411: 
1412: 
1413: \begin_inset Float figure
1414: wide false
1415: collapsed false
1416: 
1417: \layout Caption
1418: 
1419: SDR (green) and functional residue (red) predictions for 1gex, a HspAT.
1420:  Residues predicted both as functional and specificity-conferring are colored
1421:  blue.
1422:  Top left panel shows Top5 predictions, top right panel shows Top10 predictions
1423:  and bottom panel zooms in on the region around ligand in the Top10 case.
1424: \layout Standard
1425: \align center 
1426: 
1427: \begin_inset Graphics
1428: 	filename transaminaseFig.jpg
1429: 	width 150mm
1430: 
1431: \end_inset 
1432: 
1433: 
1434: \layout Standard
1435: 
1436: 
1437: \begin_inset LatexCommand \label{figTransaminase}
1438: 
1439: \end_inset 
1440: 
1441: 
1442: \end_inset 
1443: 
1444: 
1445: \layout Subsection
1446: 
1447: Phosphoric monoester hydrolases
1448: \layout Standard
1449: 
1450: SCOP family e.7.1.1 in our dataset contains 4 classes of phosphoric monoester
1451:  hydrolases, 3'(2'),5'-bisphosphate nucleotidase (EC 3.1.3.7), Fructose-bisphosphat
1452: ase (EC 3.1.3.11), Inositolphosphate phosphatase (EC 3.1.3.25) and Inositol-1,4-bispho
1453: sphate 1-phosphatase (EC 3.1.3.57).
1454:  Here we look at the SDR and baseline predictions for 1cnq, a member of
1455:  FBPase category.
1456:  FBPases are of key importance to regulation of gluconeogenic pathway and
1457:  catalyze the hydrolysis of fructose 1,6-biphosphate to fructose 6-phosphate.
1458:  They are metal dependent and are allosterically controlled by AMP which
1459:  triggers a conformational change and masks the fructose active site.
1460:  Fig.
1461: \begin_inset LatexCommand \ref{figFBPase}
1462: 
1463: \end_inset 
1464: 
1465:  shows the Top10 baseline and general predictions, the overlap in this case
1466:  of 2 residues.
1467:  F6P molecule around which most predictions are clustered lies in the active
1468:  site whereas the other F6P molecule is similarly located as AMP (from compariso
1469: n with PDB 1yyz).
1470:  Baseline predictions Tyr-279, Glu-280, Tyr-244, Met-244 and common prediction
1471:  Tyr-264 are within interacting distance of F6P ligand in the active site.
1472:  Most predicted SDRs form the active site walls and differ between FBPase
1473:  and IMPase (1awb) : Arg-276 to His, Ser-96 to Gly, Ser-123 to Thr, Ser-124
1474:  to Thr (see Table 
1475: \begin_inset LatexCommand \ref{FBPaseTable}
1476: 
1477: \end_inset 
1478: 
1479: ).
1480:  It is surprising to see that the allosteric site is only mildly detected.
1481:  Predictions Ala-161 (Top10 SDR), Lys-290 (Top10 baseline) and Val-178 (Top20
1482:  SDR) are close and suggestive of some role in AMP binding.
1483: \layout Standard
1484: 
1485: 
1486: \begin_inset Float table
1487: wide false
1488: collapsed false
1489: 
1490: \layout Caption
1491: 
1492: Speculated roles of residues in FBPase for 1cnq from literature and how
1493:  well they were predicted.
1494:  Aligned residues in other subfamilies of hydrolases are also shown.
1495: \layout Standard
1496: \align center 
1497: 
1498: \begin_inset Graphics
1499: 	filename FBPaseTable.pdf
1500: 	width 150mm
1501: 
1502: \end_inset 
1503: 
1504: 
1505: \layout Standard
1506: 
1507: 
1508: \begin_inset LatexCommand \label{FBPaseTable}
1509: 
1510: \end_inset 
1511: 
1512: 
1513: \end_inset 
1514: 
1515: 
1516: \layout Standard
1517: 
1518: 
1519: \begin_inset Float figure
1520: wide false
1521: collapsed false
1522: 
1523: \layout Caption
1524: 
1525: SDR and functional residue predictions for 1cnq, a FBPase.
1526:  Residue-coloring scheme same as Fig.
1527: \begin_inset LatexCommand \ref{figTransaminase}
1528: 
1529: \end_inset 
1530: 
1531: .
1532:  The bottom panel is a closer view of the region around ligand in the top
1533:  panel.
1534: \layout Standard
1535: \align center 
1536: 
1537: \begin_inset Graphics
1538: 	filename figFBPase.jpg
1539: 	width 100mm
1540: 
1541: \end_inset 
1542: 
1543: 
1544: \layout Standard
1545: 
1546: 
1547: \begin_inset LatexCommand \label{figFBPase}
1548: 
1549: \end_inset 
1550: 
1551: 
1552: \end_inset 
1553: 
1554: 
1555: \layout Subsection
1556: 
1557: Dehydrogenases
1558: \layout Standard
1559: 
1560: L-3-hydroxyacyl-CoA dehydrogenase (HAD, EC 1.1.1.35) is penultimate enzyme
1561:  in -oxidation spiral and catalyzes conversion of hydroxy group to keto
1562:  group while converting NAD+ to NADH.
1563:  It consists of NAD-binding and C-terminal domains, which undergo relative
1564:  movement between NAD binding and substrate binding events (
1565: \begin_inset LatexCommand \citet{activesiteSequestration}
1566: 
1567: \end_inset 
1568: 
1569: ).
1570:  Its SCOP family is c.2.1.6, other members of which are other NAD/NADP-dependent
1571:  dehydrogenases (ECs 1.1.1.8, 1.1.1.22, 1.1.1.44).
1572:  HAD is represented in our dataset by NAD-binding domain of 1f0y (residues
1573:  from A-12 to A-203).
1574:  Fig.
1575: \begin_inset LatexCommand \ref{figHAD}
1576: 
1577: \end_inset 
1578: 
1579:  shows Top10 baseline and SDR predictions.
1580:  Catalytically important pair of Glu-170 and His-158 is identied as SDRs.
1581:  Ser-137, interesting due to its contact with substrate as well as NAD,
1582:  is also identied as SDR.
1583:  With the exceptions of Leu-122, Ala-35 (baseline) and Gly-29, Ala-107 (SDR),
1584:  all other predictions are within interacting distance of either NAD or
1585:  substrate.
1586:  Ser-61 and Lys-68 are not detected due to their high entropy.
1587: \layout Standard
1588: 
1589: 
1590: \begin_inset Float figure
1591: wide false
1592: collapsed false
1593: 
1594: \layout Caption
1595: 
1596: SDR and functional residue predictions for 1f0y, a HAD.
1597:  Residue-coloring scheme same as Fig.
1598: \begin_inset LatexCommand \ref{figTransaminase}
1599: 
1600: \end_inset 
1601: 
1602: .
1603: \layout Standard
1604: \align center 
1605: 
1606: \begin_inset Graphics
1607: 	filename figHAD.jpg
1608: 	width 100mm
1609: 
1610: \end_inset 
1611: 
1612: 
1613: \layout Standard
1614: 
1615: 
1616: \begin_inset LatexCommand \label{figHAD}
1617: 
1618: \end_inset 
1619: 
1620: 
1621: \end_inset 
1622: 
1623: 
1624: \layout Subsection
1625: 
1626: Tryptophan biosynthesis enzymes
1627: \layout Standard
1628: 
1629: Phosphoribosylanthranilate (PRA) isomerase (TrpF) is a 
1630: \begin_inset Formula $(\beta\alpha)_{8}$
1631: \end_inset 
1632: 
1633:  barrel enzyme which is the most common fold adopted by enzymes and popular
1634:  among non-enzymes.
1635:  TrpF (EC 5.3.1.24) shares its SCOP family (c.1.2.4) with indole-3-glycerol-phosphate
1636:  synthase (EC 4.1.1.48) and tryptophan synthase (EC 4.2.1.20), which are all involved
1637:  in Trp biosynthesis.
1638:  Top10 baseline and SDR predictions are show in Fig.
1639: \begin_inset LatexCommand \ref{figTRPF}
1640: 
1641: \end_inset 
1642: 
1643: .
1644:  His-83 and Arg-36, considered important for catalysis, are predicted.
1645:  Gln-81 (Glu in Trp synthase 1kfc), predicted as baseline and SDR, could
1646:  be important for catalysis due to its location.
1647:  A few baseline predictions are far from active site and their conservation
1648:  suggests protein-protein binding interface.
1649:  Predicted SDRs lie close to ligand and are either replaced by other residues
1650:  in Trp synthase (Arg-36 to Asn) or deleted (Gln-184, Asp-178), which suggests
1651:  that they could be specificity determining.
1652: \layout Standard
1653: 
1654: 
1655: \begin_inset Float figure
1656: wide false
1657: collapsed false
1658: 
1659: \layout Caption
1660: 
1661: SDR and functional residue predictions for TrpF.
1662:  Residue-coloring scheme same as Fig.
1663: \begin_inset LatexCommand \ref{figTransaminase}
1664: 
1665: \end_inset 
1666: 
1667: .
1668: \layout Standard
1669: \align center 
1670: 
1671: \begin_inset Graphics
1672: 	filename figTRPF.jpg
1673: 	width 100mm
1674: 
1675: \end_inset 
1676: 
1677: 
1678: \layout Standard
1679: 
1680: 
1681: \begin_inset LatexCommand \label{figTRPF}
1682: 
1683: \end_inset 
1684: 
1685: 
1686: \end_inset 
1687: 
1688: 
1689: \layout Subsection
1690: 
1691: tRNA synthetases
1692: \layout Standard
1693: 
1694: Aminoacyl-tRNA synthetases catalyze the process of attaching an amino acid
1695:  to its tRNA carrier so that it can be incorporated into a protein.
1696:  SCOP family c.26.1.1 contains tyrosyl-tRNA synthetase (EC 6.1.1.1) along with
1697:  other (Trp-, Glu-, Gln-) tRNA synthetases.
1698:  Fig.
1699: \begin_inset LatexCommand \ref{figTyrTRNA}
1700: 
1701: \end_inset 
1702: 
1703:  shows baseline and SDR predictions for tyrosyl-tRNA synthetase 1h3e from
1704:  a thermophilic baterium T.
1705:  thermophilus (
1706: \begin_inset LatexCommand \citet{tyrTRNAclass12}
1707: 
1708: \end_inset 
1709: 
1710: ).
1711:  Residues important for catalysis from 51-HIGH and 233-KMSKS regions are
1712:  predicted as baseline (His-52, Gly-54, His-55, Lys-235).
1713:  Predicted SDRs lie close to the substrate and cofactor.
1714:  Residues specific for L-tyrosine binding, according to 
1715: \begin_inset LatexCommand \citet{tyrTRNAspecificity}
1716: 
1717: \end_inset 
1718: 
1719:  (e.g.
1720:  Thr-80, Tyr-175, Gln-179, Asp-182, Glu-197), are detected.
1721:  Note that substrate similarity makes 2 broad divisions in this family correspon
1722: ding to Trp/Tyr and Glu/Gln, each of which is subdivided into finer groups.
1723:  Table 
1724: \begin_inset LatexCommand \ref{tRnaTable}
1725: 
1726: \end_inset 
1727: 
1728:  shows residues structurally aligned to SDRs in these tRNA synthetases.
1729: \layout Standard
1730: 
1731: 
1732: \begin_inset Float table
1733: wide false
1734: collapsed false
1735: 
1736: \layout Caption
1737: 
1738: Residues in other tRNA synthetases aligned to predicted SDRs in tyrosil
1739:  tRNA synthetase.
1740: \layout Standard
1741: \align center 
1742: 
1743: \begin_inset Graphics
1744: 	filename tRNAtable.pdf
1745: 	width 150mm
1746: 
1747: \end_inset 
1748: 
1749: 
1750: \layout Standard
1751: 
1752: 
1753: \begin_inset LatexCommand \label{tRnaTable}
1754: 
1755: \end_inset 
1756: 
1757: 
1758: \end_inset 
1759: 
1760: 
1761: \layout Standard
1762: 
1763: 
1764: \begin_inset Float figure
1765: wide false
1766: collapsed false
1767: 
1768: \layout Caption
1769: 
1770: SDR and functional residue predictions for 1h3e (tyrosil tRNA synthetase).
1771:  Residue-coloring scheme same as Fig.
1772: \begin_inset LatexCommand \ref{figTransaminase}
1773: 
1774: \end_inset 
1775: 
1776: .
1777: \layout Standard
1778: \align center 
1779: 
1780: \begin_inset Graphics
1781: 	filename figTYRtRNA.jpg
1782: 	width 100mm
1783: 
1784: \end_inset 
1785: 
1786: 
1787: \layout Standard
1788: 
1789: 
1790: \begin_inset LatexCommand \label{figTyrTRNA}
1791: 
1792: \end_inset 
1793: 
1794: 
1795: \end_inset 
1796: 
1797: 
1798: \layout Standard
1799: 
1800: Residues distinct for each substrate-group could be specific for it, e.g.
1801:  Gln-179.
1802:  Detection of residue Tyr-175 as SDR suggests that there could be more functions
1803:  associated with this structural family than these four AATSs.
1804:  Detection of residues close to cofactor indicates different/no cofactors
1805:  used by other functions of this structural family.
1806:  Some residues speculated by 
1807: \begin_inset LatexCommand \citet{tyrTRNAspecificity}
1808: 
1809: \end_inset 
1810: 
1811:  to be functional, stay undetected, e.g.
1812:  Asn-128 which is not predicted due to high entropy (Ser dominates the MSSA
1813:  column, not Asn).
1814: \layout Section
1815: 
1816: Conclusion
1817: \layout Standard
1818: 
1819: We have combined structural and sequence information, functional annnotation,
1820:  residue entropy and environment specific substitution tables to predict
1821:  specificity determining residues.
1822:  We tested the predictions by using information of specific ligands and
1823:  in some cases, published literature.
1824:  We found that the predictions are far from random and functionally relevant,
1825:  which suggests that our approach is effective.
1826:  Predictions obtained with functional annotation (SDRs) and without it (baseline
1827: ) are different, suggesting that available functional annotation is valuable.
1828:  SDR and baseline predictions are complementary because they enlarge the
1829:  set of functionally significant residues that can be computationally identified.
1830:  We expected and found that our method cannot identify significant residues
1831:  in absence of high quality evolutionary information, hence the importance
1832:  of identifying chemically interesting patches remains undiminished.
1833:  A major concern is how to obtain functional partitions in absence of annotation
1834: , which is similar as establishing ortho/paralogy relationships.
1835:  We plan to explore structure-sequence scoring schemes that would help establish
1836:  functional partitions reliably.
1837:  Alternatively, it would be useful to analyze the effects of constructing
1838:  a functional partition based on sequence identity.
1839:  We plan to use residue proximity information and residue contact conservation
1840:  to detect clusters which may not be conserved in the obvious sense.
1841:  We expect that cluster identification will alleviate the problem of not
1842:  identifying structurally conserved residues.
1843:  The most important purpose of SDR and catalytic residue identification
1844:  is to help classify SNPs into normal/deleterious classes and this would
1845:  be an important avenue to explore in near future.
1846: \layout Subsection*
1847: 
1848: Acknowledgements
1849: \layout Standard
1850: 
1851: We thank Dr Kenji Mizuguchi and Dr Vijayalakshmi Chelliah for helpful discussion
1852: s.
1853:  Swanand Gore thanks Cambridge Commonwealth Trust and Universities UK Overseas
1854:  Research Studentship for funding.
1855: \layout Standard
1856: 
1857: 
1858: \begin_inset LatexCommand \BibTeX[marko]{sdr}
1859: 
1860: \end_inset 
1861: 
1862: 
1863: \the_end
1864: