0710:0710.2808/sdr.lyx

1: #LyX 1.3 created this file. For more info see http://www.lyx.org/

2: \lyxformat 221

3: \textclass article

4: \begin_preamble

5: \addtolength{\oddsidemargin}{-50pt}

6: \addtolength{\voffset}{-50pt}

7: \addtolength{\textwidth}{100pt}

8: \addtolength{\textheight}{100pt}

9:

10:

11: \usepackage{xspace}

12: \newcommand{\phipsi}{$\phi,\psi$\xspace}

13: \newcommand{\eg}{e.g.\xspace}

14: \newcommand{\etc}{etc.\xspace}

15: \newcommand{\dfs}{\textsc{dfs}~}

16: \newcommand{\cpp}{\textsc{c++}\xspace}

17: \newcommand{\python}{\textsc{python}\xspace}

18: \newcommand{\swig}{\textsc{swig}\xspace}

19: \newcommand{\pdb}{\textsc{PDB}\xspace}

20: \newcommand{\gabb}{\textsc{GABB}\xspace}

21: \newcommand{\scwrl}{\textsc{scwrl}\xspace}

22: \newcommand{\rapper}{\textsc{rapper}\xspace}

23: \newcommand{\rappertk}{\textit{Rapper}\textbf{tk}\xspace}

24: \newcommand{\probe}{\textsc{probe}\xspace}

25: \newcommand{\ie}{i.e.\xspace}

26: \newcommand{\ca}{$C_\alpha$\xspace}

27: \newcommand{\chir}{$\chi$}

28: \newcommand{\CA}[1]{${C_\alpha}^{#1}$~}

29: \newcommand{\CB}[1]{${C_\beta}^{#1}$~}

30: \newcommand{\C}[1]{$C^{#1}$~}

31: \newcommand{\N}[1]{$N^{#1}$\xspace}

32: \newcommand{\cO}[1]{$O^{#1}$\xspace}

33:

34:

35: \newcommand{\Ang}[1]{${#1}$\AA\xspace}

36:

37: \newcommand{\degr}[1]{${#1}^o$\xspace}

38:

39: \date{}

40: \end_preamble

41: \language english

42: \inputencoding auto

43: \fontscheme default

44: \graphics default

45: \paperfontsize 12

46: \spacing onehalf

47: \papersize Default

48: \paperpackage a4

49: \use_geometry 0

50: \use_amsmath 0

51: \use_natbib 1

52: \use_numerical_citations 0

53: \paperorientation portrait

54: \secnumdepth 3

55: \tocdepth 3

56: \paragraph_separation indent

57: \defskip medskip

58: \quotes_language english

59: \quotes_times 2

60: \papercolumns 1

61: \papersides 1

62: \paperpagestyle default

63:

64: \layout Title

65:

66: Identification of specificity determining residues in enzymes using environment

67:  specific substitution tables

68: \layout Author

69:

70: Swanand Gore and Tom Blundell

71: \newline

72: {swanand,tom}@cryst.bioc.cam.ac.uk

73: \newline

74: Department of Biochemistry, University of Cambridge

75: \newline

76: Cambridge CB2 1GA England

77: \layout Abstract

78: \pagebreak_bottom

79: Environment specific substitution tables have been used effectively for

80:  distinguishing structural and functional constraints on proteins and thereby

81:  identify their active sites (

82: \begin_inset LatexCommand \citet{distinguish_str_func_restr}

83:

84: \end_inset

85:

86: ).

87:  This work explores whether a similar approach can be used to identify specifici

88: ty determining residues (SDRs) responsible for cofactor dependence, substrate

89:  specificity or subtle catalytic variations.

90:  We combine structure-sequence information and functional annotation from

91:  various data sources to create structural alignments for homologous enzymes

92:  and functional partitions therein.

93:  We develop a scoring procedure to predict SDRs and assess their accuracy

94:  using information from bound specific ligands and published literature.

95: \layout Section

96:

97: Introduction

98: \layout Standard

99:

100: Enzymes are critical to cellular machinery.

101:  Enzymes are believed to have developed different specificities following

102:  gene duplication events that ease the evolutionary pressure on copies and

103:  allow exploration of novel avenues to greater organismal fitness.

104:  Each copy then develops its own niche, characterized by expression and

105:  localization, catalytic mechanism, substrate specificity, cofactor dependence

106:  and catalysis products.

107:  Such paralogous enzymes should have an evolutionary imprint corresponding

108:  to their specific niche, in addition to maintenance of structural fold.

109:  Thus evolutionary analysis of available structural and sequnce data should

110:  enable identification of key residues responsible for specificity of various

111:  kinds.

112:  Enzyme specificity can be estimated with functional assays without structure

113:  determination, but identification of SDRs (specificity determining residues)

114:  remains difficult.

115:  While ENZYME (

116: \begin_inset LatexCommand \citet{ENZYME}

117:

118: \end_inset

119:

120: ) - a database of enzyme sequences with detailed functional annotation -

121:  exists, there is no such database of SDRs.

122:  Time, cost and technical limitations slow down structure determination

123:  and even when structure is known, it is not trivial to identify the residues

124:  important for binding cofactors and substrates.

125:  Hence it is important to be able to identify such residues computationally.

126:  Reliable detection of such residues will aid in deciding whether a SNP

127:  is deleterious or neutral and suggest mutation studies.

128:  Function assignment to sequence could be done at a finer level, e.g.

129:  by verifying that SDRs necessary for certain substrate are present.

130:  Computational SDR identification has received a lot of attention and several

131:  methods have been proposed.

132:  Evolutionary trace (ET) is one of the most important methods (

133: \begin_inset LatexCommand \citet{evoltraceStrClust}

134:

135: \end_inset

136:

137: ,

138: \begin_inset LatexCommand \citet{EThybridMethods}

139:

140: \end_inset

141:

142: ).

143:  It builds a phylogenetic tree based on sequence comparisons, such that

144:  branch lengths are indicative of evolutionary divergence.

145:  Functional subgroups consist of sequences in subtrees determined from this

146:  tree using a divergence cutoff.

147:  Residues common to a subtree are considered specificity-conferring rather

148:  than the ones common to entire tree.

149:  Spatial cluster identification can be used with ET to reduce the number

150:  of false positives.

151:  Inferring phylogeny correctly remains the main cause of concern in this

152:  approach, hence attempts have been made to use existing annotation with

153:  various statistical techniques.

154:  Another important direction is to use spatial proximity of residues.

155: \layout Standard

156:

157: Cornerstone of our approach is that structural environment influences residue

158:  substitution patterns, illustrated by

159: \begin_inset LatexCommand \citet{earlyESST}

160:

161: \end_inset

162:

163:  and later used effectively for structure-sequence alignment and fold recognitio

164: n (

165: \begin_inset LatexCommand \citet{fugue}

166:

167: \end_inset

168:

169: ).

170:  Structural environment of a residue is described in terms of secondary

171:  structure, solvent accessibility, sidechain-sidechain and sidechain-mainchain

172:  hydrogen bonding.

173:  Residue substitution tables derived from a set of high quality sequence-structu

174: re alignments represent the expected substitution rate in a structural environme

175: nt.

176:  Unexpected conservation of a residue is indicative of functional restraint

177:  acting on it.

178:  Advantage of using ESSTs is that the structurally conserved residues are

179:  masked, which is why active sites of homologous enzymes can be identied

180:  reliably with this approach.

181:  This approach has been extended in the present work by using functional

182:  annotation information.

183: \layout Standard

184:

185: A set of homologous enzymes is generally a union of smaller functionally

186:  specific subsets, e.g.

187:  substrate-specific subsets in serine proteinases (trypsin, chymotrypsin

188:  etc.), cofactor-specific subsets in ferrodoxin reductases (NAD and NADP

189:  specific) and so on.

190:  In multiple sequence alignment of a homologous protein family, SDRs generally

191:  appear as differentially conserved subcolumns.

192:  But all such appearances would not be SDRs.

193:  Our hypothesis is that SDRs would be identified by combining differential

194:  conservation with ESST-based detection of functional restraint.

195: \layout Section

196:

197: Families, functional partitions and profiles

198: \layout Standard

199:

200: In order to test our hypothesis, we need to construct a dataset of homologous

201:  enzyme families with reliable functional partitions in them.

202:  While SCOP classification can be used in a straightforward way for making

203:  families, identifying functionally specific subsets is not a trivial task.

204:  Some automated approaches to detect functional shift, e.g.

205:

206: \begin_inset LatexCommand \citet{funshiftakker}

207:

208: \end_inset

209:

210: , exist to infer such partitions but manual annotation remains the most

211:  reliable.

212:  Additionally, protein function is not a precise and quantifiable entity.

213:  This restricted our study to enzymes which are the the most well studied

214:  and well annotated class of proteins.

215:  Enzyme function is fairly well defined and well classified according to

216:  hierarchical Enzyme Classification scheme (EC).

217:  We use the mapping between SCOP domains and EC numbers (

218: \begin_inset LatexCommand \citet{scopec}

219:

220: \end_inset

221:

222: ) to make EC-specific subgroups within a SCOP domain family.

223:  We generate profiles (multiple structure-sequence alignments) for SCOP

224:  families and functional partitions.

225:  Sequence homologs for structural families were found using PSIBLAST (

226: \begin_inset LatexCommand \citet{psiblast}

227:

228: \end_inset

229:

230: ) on nonredundant sequence database, whereas function-specific partitions

231:  were enriched using PSIBLAST searches on ENZYME database (

232: \begin_inset LatexCommand \citet{ENZYME}

233:

234: \end_inset

235:

236: ).

237:  PSIBLAST hit on ENZYME database is retained only if the EC number of hit

238:  matches that of query.

239:  All PSIBLAST searches were with 5 rounds and e-value 0.01, hits smaller

240:  than 75% of query length were ignored.

241:  All structure-sequence alignments were carried out with fugueseq (

242: \begin_inset LatexCommand \citet{fugue}

243:

244: \end_inset

245:

246: ) which has been shown to improve alignment quality over PSIBLAST.

247:  This process is summarized in Fig.

248: \begin_inset LatexCommand \ref{workflow}

249:

250: \end_inset

251:

252: .

253: \layout Standard

254:

255:

256: \begin_inset Float figure

257: wide false

258: collapsed false

259:

260: \layout Caption

261:

262: Workflow

263: \layout Standard

264: \align center

265:

266: \begin_inset Graphics

267: 	filename workflow.pdf

268: 	width 150mm

269:

270: \end_inset

271:

272:

273: \layout Standard

274:

275:

276: \begin_inset LatexCommand \label{workflow}

277:

278: \end_inset

279:

280:

281: \end_inset

282:

283:

284: \layout Standard

285:

286: Another constraint on the choice of dataset comes from the need for sufficient

287:  functional diversity in a SCOP domain family.

288:  In its absence, the contrast between the domain family and EC-specific

289:  subgroup within it might not be detectable.

290:  Hence we chose the SCOP families with at least two different EC annotations.

291: \layout Standard

292:

293: To be able to test the hypothesis quantitatively, a gold standard set of

294:  SDRs for every enzyme is needed.

295:  But SDRs are generally a topic of lively debate among researchers, partly

296:  due to the infeasibility of performing all necessary mutation studies.

297:  Thus there is no such dataset in our knowledge.

298:  Hence we use the information of bound ligands and close-by residues to

299:  assess the hypothesis.

300:  Due to this, the dataset gets restricted to only those cases where at least

301:  one EC-specific domain group has a relevant ligand bound.

302:  A relevant ligand is the one unique to the reaction carried out by that

303:  EC-group among all possible reactions in that domain family.

304:  For example, in SCOP family c.1.10.4 there are two functional subgroups:

305: \layout Standard

306:

307: 3-deoxy-8-phosphooctulonate synthase (EC 2.5.1.55) : Phosphoenolpyruvate +

308:  D-arabinose 5-phosphate + H(2)O = 2-dehydro-3-deoxy-D-octonate 8-phosphate

309:  + phosphate

310: \layout Standard

311:

312: 3-deoxy-7-phosphoheptulonate synthase (EC 2.5.1.54) : Phosphoenolpyruvate +

313:  D-erythrose 4-phosphate + H(2)O = 3-deoxy-D-arabino-hept-2-ulosonate 7-phosphat

314: e + phosphate

315: \layout Standard

316:

317: Here D-arabinose 5-phosphate is unique to EC 2.5.1.55 and is present in domain

318:  1fxqA as A5P.

319:  Hence it is taken as an indicator of SDR locations and not phosphienolpyruvate

320:  which is common cofactor in both reactions.

321:  We sometimes use products also as such indicators.

322:  Ligand is considered relevant if its name from the PDB file (HETNAM, HETSYM

323:  records) matches its name in the reaction or PDBsum (

324: \begin_inset LatexCommand \citet{pdbsum}

325:

326: \end_inset

327:

328: ) finds it sufficiently similar to ideal ligand molecule.

329:  Our final dataset consists of 97 examples drawn from 68 families.

330:  Very few SDR identification studies are carried out with these many examples.

331: \layout Section

332:

333: Profiles and substitution patterns

334: \layout Standard

335:

336: Structural and sequence information in MSSA can be misleading if dominated

337:  by very close homologs, hence each MSSA was filtered with 90% sequence

338:  identity cutoff to avoid redundancy.

339: \layout Standard

340:

341: Observed substitution pattern for a column in profile MSSA (multiple structure-s

342: equence alignment) was calculated after weighing down contributions from

343:  similar sequences (

344: \begin_inset Formula $>60\%$

345: \end_inset

346:

347:  sequence identity).

348:  Gaps were ignored while calculating the observed substitution pattern but

349:  the ratio of gaps to amino acids in a column was computed.

350:  Columns with high gap content are generally not functional hence gap content

351:  was used as a filtering criterion as described later.

352:  Observed substitution patterns are normalized and sequence entropy was

353:  also calculated to get a measure of variability in the column as

354: \begin_inset Formula $\sum_{i=1}^{20}-f_{i}log(f_{i})$

355: \end_inset

356:

357: , where

358: \begin_inset Formula $f_{i}$

359: \end_inset

360:

361:  is the fraction of

362: \begin_inset Formula $i^{th}$

363: \end_inset

364:

365:  amino acid in the distribution.

366: \layout Standard

367:

368: Expected substitution patterns for a column were calculated using environment

369:  specific substitution probability tables derived from high quality multiple

370:  structure alignments from 371 families (

371: \begin_inset LatexCommand \citet{fugue}

372:

373: \end_inset

374:

375: ).

376:  Substitution probabilties from every structure were averaged to get expected

377:  substitution probabilities for each column in MSSA.

378:  Again, sequence-based clustering was used to avoid expected substitution

379:  pattern getting dominated by very similar structures.

380: \layout Standard

381:

382: Functional restraint is calculated as the city-block distance between normalized

383:  observed and predicted substitution patterns (

384: \begin_inset Formula $\sum_{i=1}^{20}o_{i}-e_{i}$

385: \end_inset

386:

387: ,

388: \begin_inset Formula $o_{i}$

389: \end_inset

390:

391:  being observed fraction of

392: \begin_inset Formula $i^{th}$

393: \end_inset

394:

395:  amino acid and

396: \begin_inset Formula $e_{i}$

397: \end_inset

398:

399:  being the fraction of times it is expected to occur).

400:  Thus, for both MSSAs (whole family and EC-specific) we have the following

401:  quantities : functional restraint (

402: \begin_inset Formula $famF,ecF$

403: \end_inset

404:

405: ), gap content (

406: \begin_inset Formula $famG,ecG$

407: \end_inset

408:

409: ) and sequence entropy (

410: \begin_inset Formula $famE,ecE$

411: \end_inset

412:

413: ).

414:  Moreover for each MSSA, number of sequences

415: \begin_inset Formula $<80\%$

416: \end_inset

417:

418:  identical to each other was taken as an indicator of evolutionary information

419:  available in it.

420: \layout Section

421:

422: Benchmarking

423: \layout Standard

424:

425: In order to assess the differences in residues important for whole family

426:  and EC partition, baseline predictions were made by choosing top-ranking

427:  residues according to whole family functional constraint from residues

428:  which are not highly gapped (

429: \begin_inset Formula $famG<0.5$

430: \end_inset

431:

432: ).

433:  Number of baseline and SDR predictions is same whenever they are compared

434:  or an overlap between them is computed.

435:  This helps in assessing whether information in the EC-specific MSSA is

436:  distinct.

437: \layout Standard

438:

439: The likelihood of a residue to be an SDR is presumably proportional to its

440:  proximity to the specific ligand.

441:  Hence, to quantify the merit of a prediction, we defined mean proximity

442:  as the ratio of mean separation between predicted residues and ligand.

443:  Mean relative proximity is defined as the ratio of mean proximity to the

444:  mean separation between all residues in the domain and the ligand.

445:  Distance between a residue and ligand is taken to be the closest distance

446:  between residue sidechain (mainchain for glycine) and ligand atoms.

447:  Smaller the mean relative proximity, better the prediction.

448:  Prediction quality will also depend on the number of distinct homologous

449:  sequences available.

450:  In case of multiple ligands close to a domain, a residue's proximity to

451:  the ligand is calculated with respect to the closest ligand.

452:  The basis for SDR prediction is that it be sufficiently distinct between

453:  whole family and EC-specific MSSAs.

454:  As

455: \begin_inset LatexCommand \citet{funshiftakker}

456:

457: \end_inset

458:

459:  describe it, an SDR should be a rate-shifted or conservation-shifted site.

460:  Additionally, SDR should be sufficiently functionally constrained from

461:  ESSTs perspective (

462: \begin_inset Formula $ecF$

463: \end_inset

464:

465: ).

466:  For a residue with low entropy in EC MSSA, if change in entropy

467: \begin_inset Formula $dE$

468: \end_inset

469:

470:  (family MSSA sequence entropy - EC MSSA sequence entropy) is high, it indicates

471:  that it could be SDR.

472:  Since each MSSA will be different in its variability, it is not advisable

473:  to use same functional constraint cutoff or entropy cutoff for all of them.

474:  This immediately suggests two 2-step approaches : choose top

475: \begin_inset Formula $N1$

476: \end_inset

477:

478:  residues with high dierence in sequence entropy between whole and EC MSSAs,

479:  then select top

480: \begin_inset Formula $N2$

481: \end_inset

482:

483:  according to functional constraint in EC MSSA and vice versa.

484:  But there could be a third and more attractive approach that combines functiona

485: l constraint from EC MSSA and sequence entropy difference.

486:  We pursue the third approach.

487: \layout Standard

488:

489: We assume that SDR score of a residue is a linear combination of its functional

490:  constraint, entropy and change in entropy, given that the residue passes

491:  certain quality checks (

492: \begin_inset Formula $ecF>0.5$

493: \end_inset

494:

495: ,

496: \begin_inset Formula $ecG<0.5$

497: \end_inset

498:

499: ,

500: \begin_inset Formula $ecE<1$

501: \end_inset

502:

503: ,

504: \begin_inset Formula $dE>0.5$

505: \end_inset

506:

507: ):

508: \layout Standard

509:

510:

511: \begin_inset Formula $SDRscore=ecF+a*(famE-ecE)-b*ecE$

512: \end_inset

513:

514:

515: \layout Standard

516:

517: In order to optimize the parameters

518: \begin_inset Formula $a,b$

519: \end_inset

520:

521:  and test the optimal ones, we created a high quality test set from our

522:  examples, consisting of 23 examples drawn from SCOP families with at least

523:  2 EC groups, each with at

524: \begin_inset Formula $>10$

525: \end_inset

526:

527:  distict sequence homologs from ENZYME database.

528:  Parameters

529: \begin_inset Formula $a,b$

530: \end_inset

531:

532:  were varied from 0 to 5 in steps of

533: \begin_inset Formula $0.2$

534: \end_inset

535:

536:  and 10 SDR predictions were made.

537:  For each value of

538: \begin_inset Formula $a$

539: \end_inset

540:

541:  and

542: \begin_inset Formula $b$

543: \end_inset

544:

545: , SDR and baseline predictions are made, each consisiting of 10 residues.

546:  Note that baseline predictions are not affected by values of

547: \begin_inset Formula $a,b$

548: \end_inset

549:

550: .

551:  Optimization can be done with two objectives, either to minimize the mean

552:  proximity or to maximize the number of close (

553: \begin_inset Formula $<$

554: \end_inset

555:

556:

557: \begin_inset ERT

558: status Collapsed

559:

560: \layout Standard

561:

562: \backslash

563: Ang{6}

564: \end_inset

565:

566: ) residues.

567:

568: \begin_inset Formula $a,b$

569: \end_inset

570:

571:  values of

572: \begin_inset Formula $0.4,1.2$

573: \end_inset

574:

575:  minimize the prior obective to

576: \begin_inset ERT

577: status Collapsed

578:

579: \layout Standard

580:

581: \backslash

582: Ang{9.24}

583: \end_inset

584:

585:  and yield

586: \begin_inset Formula $3.6$

587: \end_inset

588:

589:  close residues per prediction, whereas

590: \begin_inset Formula $0,0.8$

591: \end_inset

592:

593:  maximize the latter to

594: \begin_inset Formula $4.08$

595: \end_inset

596:

597:  residues while yielding

598: \begin_inset ERT

599: status Collapsed

600:

601: \layout Standard

602:

603: \backslash

604: Ang{9.36}

605: \end_inset

606:

607:  for the prior.

608:  Performance of these two

609: \begin_inset Formula $a,b$

610: \end_inset

611:

612:  values on different sets of examples is shown in Table

613: \begin_inset LatexCommand \ref{evolABperf}

614:

615: \end_inset

616:

617: .

618: \layout Standard

619:

620:

621: \begin_inset Float table

622: wide false

623: collapsed true

624:

625: \layout Caption

626:

627: Optimal values of a and b for various levels of evolutionary information

628:  available.

629: \layout Standard

630: \align center

631:

632: \begin_inset  Tabular

633: <lyxtabular version="3" rows="8" columns="5">

634: <features>

635: <column alignment="center" valignment="top" leftline="true" width="0">

636: <column alignment="block" valignment="top" leftline="true" width="1in">

637: <column alignment="block" valignment="top" leftline="true" width="1in">

638: <column alignment="block" valignment="top" leftline="true" width="1in">

639: <column alignment="block" valignment="top" leftline="true" rightline="true" width="1in">

640: <row topline="true">

641: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

642: \begin_inset Text

643:

644: \layout Standard

645:

646: Criteria for

647: \end_inset

648: </cell>

649: <cell multicolumn="1" alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

650: \begin_inset Text

651:

652: \layout Standard

653:

654: Mean proximity

655: \end_inset

656: </cell>

657: <cell multicolumn="2" alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

658: \begin_inset Text

659:

660: \layout Standard

661:

662: \end_inset

663: </cell>

664: <cell multicolumn="1" alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">

665: \begin_inset Text

666:

667: \layout Standard

668:

669: #close (<

670: \begin_inset ERT

671: status Collapsed

672:

673: \layout Standard

674:

675: \backslash

676: Ang{6}

677: \end_inset

678:

679: ) residues

680: \end_inset

681: </cell>

682: <cell multicolumn="2" alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">

683: \begin_inset Text

684:

685: \layout Standard

686:

687: \end_inset

688: </cell>

689: </row>

690: <row bottomline="true">

691: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

692: \begin_inset Text

693:

694: \layout Standard

695:

696: choice of examples

697: \end_inset

698: </cell>

699: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

700: \begin_inset Text

701:

702: \layout Standard

703:

704: (0,0.8)

705: \end_inset

706: </cell>

707: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

708: \begin_inset Text

709:

710: \layout Standard

711:

712: (0.4,1.2)

713: \end_inset

714: </cell>

715: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

716: \begin_inset Text

717:

718: \layout Standard

719:

720: (0.0.8)

721: \end_inset

722: </cell>

723: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">

724: \begin_inset Text

725:

726: \layout Standard

727:

728: (0.4,1.2)

729: \end_inset

730: </cell>

731: </row>

732: <row topline="true">

733: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

734: \begin_inset Text

735:

736: \layout Standard

737:

738: >5 homologs

739: \end_inset

740: </cell>

741: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

742: \begin_inset Text

743:

744: \layout Standard

745:

746: 10.84

747: \end_inset

748: </cell>

749: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

750: \begin_inset Text

751:

752: \layout Standard

753:

754: 11.24

755: \end_inset

756: </cell>

757: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

758: \begin_inset Text

759:

760: \layout Standard

761:

762: 3.35

763: \end_inset

764: </cell>

765: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">

766: \begin_inset Text

767:

768: \layout Standard

769:

770: 3.01

771: \end_inset

772: </cell>

773: </row>

774: <row>

775: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

776: \begin_inset Text

777:

778: \layout Standard

779:

780: (67 examples)

781: \end_inset

782: </cell>

783: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

784: \begin_inset Text

785:

786: \layout Standard

787:

788: \end_inset

789: </cell>

790: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

791: \begin_inset Text

792:

793: \layout Standard

794:

795: \end_inset

796: </cell>

797: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

798: \begin_inset Text

799:

800: \layout Standard

801:

802: \end_inset

803: </cell>

804: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">

805: \begin_inset Text

806:

807: \layout Standard

808:

809: \end_inset

810: </cell>

811: </row>

812: <row topline="true">

813: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

814: \begin_inset Text

815:

816: \layout Standard

817:

818: >10 homologs

819: \end_inset

820: </cell>

821: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

822: \begin_inset Text

823:

824: \layout Standard

825:

826: 10.41

827: \end_inset

828: </cell>

829: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

830: \begin_inset Text

831:

832: \layout Standard

833:

834: 10.64

835: \end_inset

836: </cell>

837: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

838: \begin_inset Text

839:

840: \layout Standard

841:

842: 3.45

843: \end_inset

844: </cell>

845: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">

846: \begin_inset Text

847:

848: \layout Standard

849:

850: 3.2

851: \end_inset

852: </cell>

853: </row>

854: <row>

855: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

856: \begin_inset Text

857:

858: \layout Standard

859:

860: (55 examples)

861: \end_inset

862: </cell>

863: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

864: \begin_inset Text

865:

866: \layout Standard

867:

868: \end_inset

869: </cell>

870: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

871: \begin_inset Text

872:

873: \layout Standard

874:

875: \end_inset

876: </cell>

877: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

878: \begin_inset Text

879:

880: \layout Standard

881:

882: \end_inset

883: </cell>

884: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">

885: \begin_inset Text

886:

887: \layout Standard

888:

889: \end_inset

890: </cell>

891: </row>

892: <row topline="true">

893: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

894: \begin_inset Text

895:

896: \layout Standard

897:

898: >10 homologs, >1 EC

899: \end_inset

900: </cell>

901: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

902: \begin_inset Text

903:

904: \layout Standard

905:

906: 9.36

907: \end_inset

908: </cell>

909: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

910: \begin_inset Text

911:

912: \layout Standard

913:

914: 9.24

915: \end_inset

916: </cell>

917: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

918: \begin_inset Text

919:

920: \layout Standard

921:

922: 4.08

923: \end_inset

924: </cell>

925: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">

926: \begin_inset Text

927:

928: \layout Standard

929:

930: 3.6

931: \end_inset

932: </cell>

933: </row>

934: <row bottomline="true">

935: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

936: \begin_inset Text

937:

938: \layout Standard

939:

940: (23 examples)

941: \end_inset

942: </cell>

943: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

944: \begin_inset Text

945:

946: \layout Standard

947:

948: \end_inset

949: </cell>

950: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

951: \begin_inset Text

952:

953: \layout Standard

954:

955: \end_inset

956: </cell>

957: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

958: \begin_inset Text

959:

960: \layout Standard

961:

962: \end_inset

963: </cell>

964: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">

965: \begin_inset Text

966:

967: \layout Standard

968:

969: \end_inset

970: </cell>

971: </row>

972: </lyxtabular>

973:

974: \end_inset

975:

976:

977: \layout Standard

978:

979:

980: \begin_inset LatexCommand \label{evolABperf}

981:

982: \end_inset

983:

984:

985: \end_inset

986:

987:

988: \layout Standard

989:

990: This suggests that optimal

991: \begin_inset Formula $a,b$

992: \end_inset

993:

994:  parameters are

995: \begin_inset Formula $0,0.8$

996: \end_inset

997:

998: .

999:  It is surprising that there is no importance for the value of

1000: \begin_inset Formula $dE=famE-genE$

1001: \end_inset

1002:

1003:  in SDR score.

1004:  Perhaps this is due to the quality checks applied prior to calculation

1005:  of SDR scores, which demand

1006: \begin_inset Formula $dE>0.5$

1007: \end_inset

1008:

1009: .

1010: \layout Standard

1011:

1012: Fig.

1013: \begin_inset LatexCommand \ref{proxDistrib}

1014:

1015: \end_inset

1016:

1017:  shows the distribution of mean proximity in various sets derived according

1018:  to number of distinct homologs in ENZYME.

1019:  This shows that quality of evolutionary information available has great

1020:  impact on quality of predictions.

1021: \layout Standard

1022:

1023:

1024: \begin_inset Float figure

1025: wide false

1026: collapsed false

1027:

1028: \layout Caption

1029:

1030: Frequency of observing a certain mean proximity of SDR predictions (binned

1031:  in

1032: \begin_inset ERT

1033: status Collapsed

1034:

1035: \layout Standard

1036:

1037: \backslash

1038: Ang{1}

1039: \end_inset

1040:

1041:  bins) for different qualities of evolutionary information available.

1042: \layout Standard

1043: \align center

1044:

1045: \begin_inset Graphics

1046: 	filename proxDistrib.pdf

1047: 	width 150mm

1048:

1049: \end_inset

1050:

1051:

1052: \layout Standard

1053:

1054:

1055: \begin_inset LatexCommand \label{proxDistrib}

1056:

1057: \end_inset

1058:

1059:

1060: \end_inset

1061:

1062:

1063: \layout Standard

1064:

1065: Mean relative proximity indicates how far from random is the prediction.

1066:  Table

1067: \begin_inset LatexCommand \ref{meanRelProxTable}

1068:

1069: \end_inset

1070:

1071:  shows that mean relative proximity depends on quality of evolutionary informati

1072: on and is far from random for both SDR and baseline predictions.

1073: \layout Standard

1074:

1075:

1076: \begin_inset Float table

1077: wide false

1078: collapsed false

1079:

1080: \layout Caption

1081:

1082: Mean relative proximity in various datasets made according to number of

1083:  available distinct homologs.

1084: \layout Standard

1085: \align center

1086:

1087: \begin_inset  Tabular

1088: <lyxtabular version="3" rows="5" columns="4">

1089: <features>

1090: <column alignment="center" valignment="top" leftline="true" width="0">

1091: <column alignment="center" valignment="top" leftline="true" width="0">

1092: <column alignment="center" valignment="top" leftline="true" width="0">

1093: <column alignment="center" valignment="top" leftline="true" rightline="true" width="0">

1094: <row topline="true">

1095: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

1096: \begin_inset Text

1097:

1098: \layout Standard

1099:

1100: Dataset

1101: \end_inset

1102: </cell>

1103: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

1104: \begin_inset Text

1105:

1106: \layout Standard

1107:

1108: Mean Rel.

1109:  Prox.

1110: \end_inset

1111: </cell>

1112: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

1113: \begin_inset Text

1114:

1115: \layout Standard

1116:

1117: Mean Rel.

1118:  Prox.

1119: \end_inset

1120: </cell>

1121: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">

1122: \begin_inset Text

1123:

1124: \layout Standard

1125:

1126: Frequency of

1127: \end_inset

1128: </cell>

1129: </row>

1130: <row>

1131: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

1132: \begin_inset Text

1133:

1134: \layout Standard

1135:

1136: \end_inset

1137: </cell>

1138: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

1139: \begin_inset Text

1140:

1141: \layout Standard

1142:

1143: \end_inset

1144: </cell>

1145: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

1146: \begin_inset Text

1147:

1148: \layout Standard

1149:

1150: \end_inset

1151: </cell>

1152: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">

1153: \begin_inset Text

1154:

1155: \layout Standard

1156:

1157: MRP(SDR)

1158: \begin_inset Formula $\leq$

1159: \end_inset

1160:

1161:  MRP(baseline)

1162: \end_inset

1163: </cell>

1164: </row>

1165: <row topline="true">

1166: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

1167: \begin_inset Text

1168:

1169: \layout Standard

1170:

1171: >0 homologs

1172: \end_inset

1173: </cell>

1174: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

1175: \begin_inset Text

1176:

1177: \layout Standard

1178:

1179: 0.67

1180: \end_inset

1181: </cell>

1182: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

1183: \begin_inset Text

1184:

1185: \layout Standard

1186:

1187: 0.66

1188: \end_inset

1189: </cell>

1190: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">

1191: \begin_inset Text

1192:

1193: \layout Standard

1194:

1195: 34% (33/97)

1196: \end_inset

1197: </cell>

1198: </row>

1199: <row topline="true">

1200: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

1201: \begin_inset Text

1202:

1203: \layout Standard

1204:

1205: >5 homologs

1206: \end_inset

1207: </cell>

1208: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

1209: \begin_inset Text

1210:

1211: \layout Standard

1212:

1213: 0.57

1214: \end_inset

1215: </cell>

1216: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

1217: \begin_inset Text

1218:

1219: \layout Standard

1220:

1221: 0.66

1222: \end_inset

1223: </cell>

1224: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">

1225: \begin_inset Text

1226:

1227: \layout Standard

1228:

1229: 60% (40/67)

1230: \end_inset

1231: </cell>

1232: </row>

1233: <row topline="true" bottomline="true">

1234: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

1235: \begin_inset Text

1236:

1237: \layout Standard

1238:

1239: >10 homologs

1240: \end_inset

1241: </cell>

1242: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

1243: \begin_inset Text

1244:

1245: \layout Standard

1246:

1247: 0.57

1248: \end_inset

1249: </cell>

1250: <cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none">

1251: \begin_inset Text

1252:

1253: \layout Standard

1254:

1255: 0.62

1256: \end_inset

1257: </cell>

1258: <cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none">

1259: \begin_inset Text

1260:

1261: \layout Standard

1262:

1263: 85% (47/55)

1264: \end_inset

1265: </cell>

1266: </row>

1267: </lyxtabular>

1268:

1269: \end_inset

1270:

1271:

1272: \layout Standard

1273:

1274:

1275: \begin_inset LatexCommand \label{meanRelProxTable}

1276:

1277: \end_inset

1278:

1279:

1280: \end_inset

1281:

1282:

1283: \layout Standard

1284:

1285: The fraction of SDRs present in baseline predictions is

1286: \begin_inset Formula $15\%$

1287: \end_inset

1288:

1289:  in all

1290: \begin_inset Formula $>0,>5,>10$

1291: \end_inset

1292:

1293:  homologs classes, which suggests that SDR predictions are fairly different

1294:  than baseline.

1295:  This also suggests that baseline and SDR predictions are complementary

1296:  to each other.

1297: \layout Section

1298:

1299: Some examples

1300: \layout Standard

1301:

1302: When quality sequence information is available, SDR predictions are closer

1303:  to specific ligand than baseline predictions which in turn are closer than

1304:  random.

1305:  Here we compare our Top10 predictions with information from literature

1306:  for some examples.

1307: \layout Subsection

1308:

1309: Aminotransferases

1310: \layout Standard

1311:

1312: Aminotransferases or transaminases are important to amino acid biosynthesis

1313:  and unique due to their specificity to two substrates : a glutamate and

1314:  a amino-carrier.

1315:  Our dataset contains two SCOP families (c.67.1.1 and c.67.1.4) that contain transamin

1316: ases.

1317:  Of those, we focus on SCOP family c.67.1.1 which contains the functional categorie

1318: s aspartate transaminase (AspAT, EC 2.6.1.1) and histidinol phosphate transaminase

1319:  (HspAT, EC 2.6.1.9).

1320:  Other non-transaminase members of this family include threonine adolases

1321:  (EC 4.1.2.5) and alliin lyase (EC 4.4.1.4).

1322:  When Top10 predictions were analyzed in 1gex, an HspAT, we found that SDR

1323:  predictions are very well clustered around the ligands PLP and HSP, but

1324:  5 of the 10 predictions were shared with Top10 baseline predictions.

1325:  This overlap can be attributed to degrees of functional diversity in the

1326:  SCOP family, i.e.

1327:  large entropy reduction in HspAT residues could be due to their importance

1328:  to general transaminase mechanism (as opposed to aldolase mechanism) or

1329:  for substrate specificity to histidinol phosphate (as opposed to aspartate

1330:  in AspATs).

1331:  In order to increase the number of distinct predictions, Top20 baseline

1332:  and SDR predictions were used.

1333:  Fig.

1334: \begin_inset LatexCommand \ref{figTransaminase}

1335:

1336: \end_inset

1337:

1338:  shows the predictions for 1gexA, an HspAT from E.

1339:  coli - 7 predictions are common.

1340:  Catalytically important residues (

1341: \begin_inset LatexCommand \citet{Haruyama2001}

1342:

1343: \end_inset

1344:

1345: ) Asn-157, Tyr-187, Lys-214 are identied as baseline, SDR and common respectivel

1346: y.

1347:  Tyr-55, which interacts with substrate of the other subunit, is predicted

1348:  as SDR

1349: \begin_inset Foot

1350: collapsed true

1351:

1352: \layout Standard

1353:

1354: This is conrmed from a similar prediction in 1gc4, an AspAT.

1355: \end_inset

1356:

1357: .

1358:  Tyr-20, believed to be important for specificity, is not predicted as such

1359:  because it is conserved only 80% of times, whereas a similarly placed Tyr-55

1360:  from other subunit is much better conserved (98% times) and could be equally

1361:  important for specificity.

1362:  Ala-186, considered important for restricting rotation of PLP's pyrimidine

1363:  ring and thereby contributing to strain essential for enzyme function,

1364:  is predicted as both SDR and baseline.

1365:  Most other predicted SDRs lie close to the substrate.

1366:  Their location and AspAT counterparts suggest their role in conferring

1367:  specificty towards histidinol phosphate (see Table

1368: \begin_inset LatexCommand \ref{transaminaseTable}

1369:

1370: \end_inset

1371:

1372: ).

1373: \layout Standard

1374:

1375:

1376: \begin_inset Float table

1377: wide false

1378: collapsed false

1379:

1380: \layout Caption

1381:

1382: Residues from speculated roles

1383: \begin_inset LatexCommand \citet{Haruyama2001}

1384:

1385: \end_inset

1386:

1387:  for HspAT 1gex and how well they were predicted.

1388:  The aligned residues in other subfamilies with transaminases are also shown.

1389: \layout Standard

1390: \align center

1391:

1392: \begin_inset Graphics

1393: 	filename transaminaseTable.pdf

1394: 	width 150mm

1395:

1396: \end_inset

1397:

1398:

1399: \layout Standard

1400:

1401:

1402: \begin_inset LatexCommand \label{transaminaseTable}

1403:

1404: \end_inset

1405:

1406:

1407: \end_inset

1408:

1409:

1410: \layout Standard

1411:

1412:

1413: \begin_inset Float figure

1414: wide false

1415: collapsed false

1416:

1417: \layout Caption

1418:

1419: SDR (green) and functional residue (red) predictions for 1gex, a HspAT.

1420:  Residues predicted both as functional and specificity-conferring are colored

1421:  blue.

1422:  Top left panel shows Top5 predictions, top right panel shows Top10 predictions

1423:  and bottom panel zooms in on the region around ligand in the Top10 case.

1424: \layout Standard

1425: \align center

1426:

1427: \begin_inset Graphics

1428: 	filename transaminaseFig.jpg

1429: 	width 150mm

1430:

1431: \end_inset

1432:

1433:

1434: \layout Standard

1435:

1436:

1437: \begin_inset LatexCommand \label{figTransaminase}

1438:

1439: \end_inset

1440:

1441:

1442: \end_inset

1443:

1444:

1445: \layout Subsection

1446:

1447: Phosphoric monoester hydrolases

1448: \layout Standard

1449:

1450: SCOP family e.7.1.1 in our dataset contains 4 classes of phosphoric monoester

1451:  hydrolases, 3'(2'),5'-bisphosphate nucleotidase (EC 3.1.3.7), Fructose-bisphosphat

1452: ase (EC 3.1.3.11), Inositolphosphate phosphatase (EC 3.1.3.25) and Inositol-1,4-bispho

1453: sphate 1-phosphatase (EC 3.1.3.57).

1454:  Here we look at the SDR and baseline predictions for 1cnq, a member of

1455:  FBPase category.

1456:  FBPases are of key importance to regulation of gluconeogenic pathway and

1457:  catalyze the hydrolysis of fructose 1,6-biphosphate to fructose 6-phosphate.

1458:  They are metal dependent and are allosterically controlled by AMP which

1459:  triggers a conformational change and masks the fructose active site.

1460:  Fig.

1461: \begin_inset LatexCommand \ref{figFBPase}

1462:

1463: \end_inset

1464:

1465:  shows the Top10 baseline and general predictions, the overlap in this case

1466:  of 2 residues.

1467:  F6P molecule around which most predictions are clustered lies in the active

1468:  site whereas the other F6P molecule is similarly located as AMP (from compariso

1469: n with PDB 1yyz).

1470:  Baseline predictions Tyr-279, Glu-280, Tyr-244, Met-244 and common prediction

1471:  Tyr-264 are within interacting distance of F6P ligand in the active site.

1472:  Most predicted SDRs form the active site walls and differ between FBPase

1473:  and IMPase (1awb) : Arg-276 to His, Ser-96 to Gly, Ser-123 to Thr, Ser-124

1474:  to Thr (see Table

1475: \begin_inset LatexCommand \ref{FBPaseTable}

1476:

1477: \end_inset

1478:

1479: ).

1480:  It is surprising to see that the allosteric site is only mildly detected.

1481:  Predictions Ala-161 (Top10 SDR), Lys-290 (Top10 baseline) and Val-178 (Top20

1482:  SDR) are close and suggestive of some role in AMP binding.

1483: \layout Standard

1484:

1485:

1486: \begin_inset Float table

1487: wide false

1488: collapsed false

1489:

1490: \layout Caption

1491:

1492: Speculated roles of residues in FBPase for 1cnq from literature and how

1493:  well they were predicted.

1494:  Aligned residues in other subfamilies of hydrolases are also shown.

1495: \layout Standard

1496: \align center

1497:

1498: \begin_inset Graphics

1499: 	filename FBPaseTable.pdf

1500: 	width 150mm

1501:

1502: \end_inset

1503:

1504:

1505: \layout Standard

1506:

1507:

1508: \begin_inset LatexCommand \label{FBPaseTable}

1509:

1510: \end_inset

1511:

1512:

1513: \end_inset

1514:

1515:

1516: \layout Standard

1517:

1518:

1519: \begin_inset Float figure

1520: wide false

1521: collapsed false

1522:

1523: \layout Caption

1524:

1525: SDR and functional residue predictions for 1cnq, a FBPase.

1526:  Residue-coloring scheme same as Fig.

1527: \begin_inset LatexCommand \ref{figTransaminase}

1528:

1529: \end_inset

1530:

1531: .

1532:  The bottom panel is a closer view of the region around ligand in the top

1533:  panel.

1534: \layout Standard

1535: \align center

1536:

1537: \begin_inset Graphics

1538: 	filename figFBPase.jpg

1539: 	width 100mm

1540:

1541: \end_inset

1542:

1543:

1544: \layout Standard

1545:

1546:

1547: \begin_inset LatexCommand \label{figFBPase}

1548:

1549: \end_inset

1550:

1551:

1552: \end_inset

1553:

1554:

1555: \layout Subsection

1556:

1557: Dehydrogenases

1558: \layout Standard

1559:

1560: L-3-hydroxyacyl-CoA dehydrogenase (HAD, EC 1.1.1.35) is penultimate enzyme

1561:  in -oxidation spiral and catalyzes conversion of hydroxy group to keto

1562:  group while converting NAD+ to NADH.

1563:  It consists of NAD-binding and C-terminal domains, which undergo relative

1564:  movement between NAD binding and substrate binding events (

1565: \begin_inset LatexCommand \citet{activesiteSequestration}

1566:

1567: \end_inset

1568:

1569: ).

1570:  Its SCOP family is c.2.1.6, other members of which are other NAD/NADP-dependent

1571:  dehydrogenases (ECs 1.1.1.8, 1.1.1.22, 1.1.1.44).

1572:  HAD is represented in our dataset by NAD-binding domain of 1f0y (residues

1573:  from A-12 to A-203).

1574:  Fig.

1575: \begin_inset LatexCommand \ref{figHAD}

1576:

1577: \end_inset

1578:

1579:  shows Top10 baseline and SDR predictions.

1580:  Catalytically important pair of Glu-170 and His-158 is identied as SDRs.

1581:  Ser-137, interesting due to its contact with substrate as well as NAD,

1582:  is also identied as SDR.

1583:  With the exceptions of Leu-122, Ala-35 (baseline) and Gly-29, Ala-107 (SDR),

1584:  all other predictions are within interacting distance of either NAD or

1585:  substrate.

1586:  Ser-61 and Lys-68 are not detected due to their high entropy.

1587: \layout Standard

1588:

1589:

1590: \begin_inset Float figure

1591: wide false

1592: collapsed false

1593:

1594: \layout Caption

1595:

1596: SDR and functional residue predictions for 1f0y, a HAD.

1597:  Residue-coloring scheme same as Fig.

1598: \begin_inset LatexCommand \ref{figTransaminase}

1599:

1600: \end_inset

1601:

1602: .

1603: \layout Standard

1604: \align center

1605:

1606: \begin_inset Graphics

1607: 	filename figHAD.jpg

1608: 	width 100mm

1609:

1610: \end_inset

1611:

1612:

1613: \layout Standard

1614:

1615:

1616: \begin_inset LatexCommand \label{figHAD}

1617:

1618: \end_inset

1619:

1620:

1621: \end_inset

1622:

1623:

1624: \layout Subsection

1625:

1626: Tryptophan biosynthesis enzymes

1627: \layout Standard

1628:

1629: Phosphoribosylanthranilate (PRA) isomerase (TrpF) is a

1630: \begin_inset Formula $(\beta\alpha)_{8}$

1631: \end_inset

1632:

1633:  barrel enzyme which is the most common fold adopted by enzymes and popular

1634:  among non-enzymes.

1635:  TrpF (EC 5.3.1.24) shares its SCOP family (c.1.2.4) with indole-3-glycerol-phosphate

1636:  synthase (EC 4.1.1.48) and tryptophan synthase (EC 4.2.1.20), which are all involved

1637:  in Trp biosynthesis.

1638:  Top10 baseline and SDR predictions are show in Fig.

1639: \begin_inset LatexCommand \ref{figTRPF}

1640:

1641: \end_inset

1642:

1643: .

1644:  His-83 and Arg-36, considered important for catalysis, are predicted.

1645:  Gln-81 (Glu in Trp synthase 1kfc), predicted as baseline and SDR, could

1646:  be important for catalysis due to its location.

1647:  A few baseline predictions are far from active site and their conservation

1648:  suggests protein-protein binding interface.

1649:  Predicted SDRs lie close to ligand and are either replaced by other residues

1650:  in Trp synthase (Arg-36 to Asn) or deleted (Gln-184, Asp-178), which suggests

1651:  that they could be specificity determining.

1652: \layout Standard

1653:

1654:

1655: \begin_inset Float figure

1656: wide false

1657: collapsed false

1658:

1659: \layout Caption

1660:

1661: SDR and functional residue predictions for TrpF.

1662:  Residue-coloring scheme same as Fig.

1663: \begin_inset LatexCommand \ref{figTransaminase}

1664:

1665: \end_inset

1666:

1667: .

1668: \layout Standard

1669: \align center

1670:

1671: \begin_inset Graphics

1672: 	filename figTRPF.jpg

1673: 	width 100mm

1674:

1675: \end_inset

1676:

1677:

1678: \layout Standard

1679:

1680:

1681: \begin_inset LatexCommand \label{figTRPF}

1682:

1683: \end_inset

1684:

1685:

1686: \end_inset

1687:

1688:

1689: \layout Subsection

1690:

1691: tRNA synthetases

1692: \layout Standard

1693:

1694: Aminoacyl-tRNA synthetases catalyze the process of attaching an amino acid

1695:  to its tRNA carrier so that it can be incorporated into a protein.

1696:  SCOP family c.26.1.1 contains tyrosyl-tRNA synthetase (EC 6.1.1.1) along with

1697:  other (Trp-, Glu-, Gln-) tRNA synthetases.

1698:  Fig.

1699: \begin_inset LatexCommand \ref{figTyrTRNA}

1700:

1701: \end_inset

1702:

1703:  shows baseline and SDR predictions for tyrosyl-tRNA synthetase 1h3e from

1704:  a thermophilic baterium T.

1705:  thermophilus (

1706: \begin_inset LatexCommand \citet{tyrTRNAclass12}

1707:

1708: \end_inset

1709:

1710: ).

1711:  Residues important for catalysis from 51-HIGH and 233-KMSKS regions are

1712:  predicted as baseline (His-52, Gly-54, His-55, Lys-235).

1713:  Predicted SDRs lie close to the substrate and cofactor.

1714:  Residues specific for L-tyrosine binding, according to

1715: \begin_inset LatexCommand \citet{tyrTRNAspecificity}

1716:

1717: \end_inset

1718:

1719:  (e.g.

1720:  Thr-80, Tyr-175, Gln-179, Asp-182, Glu-197), are detected.

1721:  Note that substrate similarity makes 2 broad divisions in this family correspon

1722: ding to Trp/Tyr and Glu/Gln, each of which is subdivided into finer groups.

1723:  Table

1724: \begin_inset LatexCommand \ref{tRnaTable}

1725:

1726: \end_inset

1727:

1728:  shows residues structurally aligned to SDRs in these tRNA synthetases.

1729: \layout Standard

1730:

1731:

1732: \begin_inset Float table

1733: wide false

1734: collapsed false

1735:

1736: \layout Caption

1737:

1738: Residues in other tRNA synthetases aligned to predicted SDRs in tyrosil

1739:  tRNA synthetase.

1740: \layout Standard

1741: \align center

1742:

1743: \begin_inset Graphics

1744: 	filename tRNAtable.pdf

1745: 	width 150mm

1746:

1747: \end_inset

1748:

1749:

1750: \layout Standard

1751:

1752:

1753: \begin_inset LatexCommand \label{tRnaTable}

1754:

1755: \end_inset

1756:

1757:

1758: \end_inset

1759:

1760:

1761: \layout Standard

1762:

1763:

1764: \begin_inset Float figure

1765: wide false

1766: collapsed false

1767:

1768: \layout Caption

1769:

1770: SDR and functional residue predictions for 1h3e (tyrosil tRNA synthetase).

1771:  Residue-coloring scheme same as Fig.

1772: \begin_inset LatexCommand \ref{figTransaminase}

1773:

1774: \end_inset

1775:

1776: .

1777: \layout Standard

1778: \align center

1779:

1780: \begin_inset Graphics

1781: 	filename figTYRtRNA.jpg

1782: 	width 100mm

1783:

1784: \end_inset

1785:

1786:

1787: \layout Standard

1788:

1789:

1790: \begin_inset LatexCommand \label{figTyrTRNA}

1791:

1792: \end_inset

1793:

1794:

1795: \end_inset

1796:

1797:

1798: \layout Standard

1799:

1800: Residues distinct for each substrate-group could be specific for it, e.g.

1801:  Gln-179.

1802:  Detection of residue Tyr-175 as SDR suggests that there could be more functions

1803:  associated with this structural family than these four AATSs.

1804:  Detection of residues close to cofactor indicates different/no cofactors

1805:  used by other functions of this structural family.

1806:  Some residues speculated by

1807: \begin_inset LatexCommand \citet{tyrTRNAspecificity}

1808:

1809: \end_inset

1810:

1811:  to be functional, stay undetected, e.g.

1812:  Asn-128 which is not predicted due to high entropy (Ser dominates the MSSA

1813:  column, not Asn).

1814: \layout Section

1815:

1816: Conclusion

1817: \layout Standard

1818:

1819: We have combined structural and sequence information, functional annnotation,

1820:  residue entropy and environment specific substitution tables to predict

1821:  specificity determining residues.

1822:  We tested the predictions by using information of specific ligands and

1823:  in some cases, published literature.

1824:  We found that the predictions are far from random and functionally relevant,

1825:  which suggests that our approach is effective.

1826:  Predictions obtained with functional annotation (SDRs) and without it (baseline

1827: ) are different, suggesting that available functional annotation is valuable.

1828:  SDR and baseline predictions are complementary because they enlarge the

1829:  set of functionally significant residues that can be computationally identified.

1830:  We expected and found that our method cannot identify significant residues

1831:  in absence of high quality evolutionary information, hence the importance

1832:  of identifying chemically interesting patches remains undiminished.

1833:  A major concern is how to obtain functional partitions in absence of annotation

1834: , which is similar as establishing ortho/paralogy relationships.

1835:  We plan to explore structure-sequence scoring schemes that would help establish

1836:  functional partitions reliably.

1837:  Alternatively, it would be useful to analyze the effects of constructing

1838:  a functional partition based on sequence identity.

1839:  We plan to use residue proximity information and residue contact conservation

1840:  to detect clusters which may not be conserved in the obvious sense.

1841:  We expect that cluster identification will alleviate the problem of not

1842:  identifying structurally conserved residues.

1843:  The most important purpose of SDR and catalytic residue identification

1844:  is to help classify SNPs into normal/deleterious classes and this would

1845:  be an important avenue to explore in near future.

1846: \layout Subsection*

1847:

1848: Acknowledgements

1849: \layout Standard

1850:

1851: We thank Dr Kenji Mizuguchi and Dr Vijayalakshmi Chelliah for helpful discussion

1852: s.

1853:  Swanand Gore thanks Cambridge Commonwealth Trust and Universities UK Overseas

1854:  Research Studentship for funding.

1855: \layout Standard

1856:

1857:

1858: \begin_inset LatexCommand \BibTeX[marko]{sdr}

1859:

1860: \end_inset

1861:

1862:

1863: \the_end

1864: