0505:q-bio0505050/ms.tex

1: \documentclass{article}

2: \pagestyle{plain}

3: \usepackage{epsfig}

4: %\setlength{\textwidth}{22pc}

5: \begin{document}

6:

7: \title{{\bf HLA and HIV Infection Progression: Application of the Minimum Description Length Principle to Statistical Genetics}}

8:

9: \author{Peter T. Hraber$^{\ast,\dag}$,

10: Bette T. Korber$^{\ast,\dag}$,

11: Steven Wolinsky$^\ddag$,\\

12: Henry Erlich$^\S$,

13: Elizabeth Trachtenberg$^\P$, and

14: Thomas B. Kepler$^{\ast,||}$\\

15: \\

16: \normalsize $^\ast$Santa Fe Institute, 1399 Hyde Park Road, Santa Fe NM 87501\\

17: \normalsize $^\dag$Los Alamos National Laboratory, Los Alamos NM 87545\\

18: \normalsize $^\ddag$Feinberg School of Medicine, Northwestern University,\\

19: \normalsize 676 North St. Claire, Suite 200, Chicago IL 60611\\

20: \normalsize $^\S$Roche Molecular Systems, 1145 Atlantic Avenue, Alameda CA 94501\\

21: \normalsize $^\P$Children's Hospital Oakland Research Institute,\\

22: \normalsize 5700 Martin Luther King Jr. Way, Oakland CA 94609\\

23: \normalsize $^{||}$Department of Biostatistics and Bioinformatics, and\\

24: \normalsize Center for Bioinformatics \& Computational Biology,\\

25: \normalsize Box 90090, Duke University, Durham NC 27708\\

26: }

27: \date{}

28: \maketitle

29:

30: \thispagestyle{empty}

31: \normalsize

32:

33: \begin{center}

34: \begin{description}

35:

36: \item {\bf Classification}\\

37: Biological Science/Immunology \& Physical Science/Applied Mathematics\\

38:

39: \item {\bf Corresponding author}\\

40: Peter T. Hraber\\

41: address: Santa Fe Institute, 1399 Hyde Park Road, Santa Fe NM 87501\\

42: phone: +1 (505) 984-8800\\

43: fax: +1 (505) 982-0565\\

44: email: pth@santafe.edu\\

45:

46: \item {\bf Manuscript information}\\

47: Text pages: 14\\

48: Figures: 1\\

49: Tables: 2\\

50: Words in abstract: 245 ($<250$)\\

51: Character count: $45536$ ($<47000$)\\

52:

53: \item {\bf Nonstandard abbreviations}\\

54: MACS: multicenter AIDS cohort study\\

55: MDL: minimum description length\\

56:

57: \end{description}

58: \end{center}

59: \newpage

60:

61: \vspace{3 in}

62: \begin{abstract}

63:

64: The minimum description length (MDL) principle was developed in the

65: context of computational complexity and coding theory. It states that

66: the best model to account for some data minimizes the sum of the

67: lengths, in bits, of the descriptions of the model and the data as

68: encoded via the model. The MDL principle gives a criterion for

69: parameter selection, by using the description length as a test

70: statistic. Class I HLA genes play a major role in the immune response

71: to HIV, and are known to be associated with rates of progression to

72: AIDS. However, these genes are highly polymorphic, making it difficult

73: to associate alleles with disease outcome, given statistical issues of

74: multiple testing. Application of the MDL principle to immunogenetic

75: data from a longitudinal cohort study (Chicago MACS) enables

76: classification of alleles associated with plasma HIV RNA abundance, an

77: indicator of infection progression. We recently reported that MDL

78: analysis of the relationship of HLA supertypes (a classification of

79: alleles by epitope-binding anchor motifs) with HIV RNA levels

80: identifies associations between human genotype and viral RNA. Details

81: of the MDL approach and more extended analyses of HLA and viral RNA

82: are described here. Variation in progression is strongly associated

83: with HLA-B. Allele associations with viral levels support and extend

84: previous studies. In particular, individuals without {\em B58s}

85: supertype alleles average viral RNA levels 3.6-fold greater than

86: individuals with them. Mechanisms for these associations include

87: variation in epitope specificity and selection that favors rare

88: alleles.

89: \end{abstract}

90: \newpage

91:

92: Progression of HIV infection is characterized by three phases: acute,

93: or early, chronic, and AIDS, the final phase of infection preceeding

94: death \cite{McMichael01}. The chronic phase is variable in duration,

95: lasting ten years on average, but varying from two to twenty years.  A

96: good predictor of the duration of the chronic phase is the viral RNA

97: level during chronic infection, with higher levels consistently

98: associated with more rapid progression than lower levels

99: \cite{Mellors96}. A major challenge for treating HIV and developing

100: effective vaccination strategies is to understand what contributes to

101: variation in plasma viral RNA levels, and hence to infection

102: progression.

103:

104: The cell-mediated immune response identifies and eliminates infected

105: cells from an individual. A central role in this response is played by

106: the major histocompatibility complex (MHC), in humans, also known as

107: human leukocyte antigens (HLA). Two classes of HLA genes code for

108: co-dominately expressed cell-surface glycoproteins, and present

109: processed peptide to circulating T-cells, which discriminate between

110: self and non-self \cite{Germain,WilliamsReview}.

111:

112: Class I HLA molecules are expressed on all nucleated cells except germ

113: cells. In infected cells, they bind and present antigenic peptide

114: fragments to T-cell receptors on CD8$^+$ T-lymphocytes, which are

115: usually cytotoxic and cause lysis of the infected cell. Class II HLA

116: molecules are expressed on immunogenetically reactive cells, such as

117: dendritic cells, B-cells, macrophages, and activated T-cells. They

118: present antigen peptide fragments to T-cell receptors on CD4$^+$

119: T-lymphocytes and the interaction results in release of cytokines that

120: stimulate the immune response.

121:

122: Human HLA loci are among the most diverse known \cite{Bodmer,Little}.

123: This diversity provides a repertoire to recognize evolving antigens

124: \cite{Little,Hill}. Previous studies of associations between

125: HLA alleles and variation in progression of HIV-1 infection have

126: established that within-host HLA diversity helps to inhibit viral

127: infection, by associating degrees of heterozygosity with rates of HIV

128: disease progression \cite{Roger}. Thus, homozygous individuals,

129: particularly at the HLA-B locus, suffer a greater rate of progression

130: than do heterozygotes \cite{Roger,Carrington99}.  Identifying which

131: alleles are associated with variation in rates of infection

132: progression has been difficult, due in part to the compounding of

133: error rates incurred when testing many alternative hypotheses, and

134: published results do not always agree \cite{otherMS,Trachtenberg01}.

135:

136: This study demonstrates the use of an information-based criterion for

137: statistical inference. Its approach to multiple testing differs from

138: that of standard analytic techniques, and provides the ability to

139: resolve associations between variation in HIV RNA abundance and

140: variation in HLA alleles.

141:

142: As an application of computational complexity and optimal coding

143: theory to statistical inference, the minimum description length (MDL)

144: principle states that the best statistical model, or hypothesis, to

145: account for some observed data is the model that minimizes the sum of

146: the number of bits required to describe both the model and the data

147: encoded via the model \cite{Rissanen,Li93,HansenYu}. It is a

148: model-selection criterion that balances the need for parsimony and

149: fidelity, by penalizing equally for the information required to

150: specify the model and the information required to encode the residual

151: error.

152:

153: The analyses detailed below apply the MDL principle to the problem of

154: partitioning individuals into groups having similar HIV RNA levels,

155: based on HLA alleles present in each case.

156:

157: \subsection*{Chicago MACS HLA \& HIV Data}

158:

159: The Chicago Multicenter AIDS Cohort Study (MACS) provided an

160: opportunity to analyze a detailed, long-term, longitudinal set of

161: clinical HIV/HLA data \cite{otherMS}. Each participant provided

162: informed consent in writing. Of 564 HIV-positive cases sampled in the

163: Chicago MACS, 479 provided information about both the rate of disease

164: progression and HLA genetic background. Progression was indicated by

165: the quasi-stationary ``set-point'' viral RNA level during chronic

166: infection. Immunogenetic background was obtained by determining which

167: HLA alleles from class I (HLA-A, -B, and -C) and class II (HLA-DRB1,

168: -DQB1, and -DPB1) loci were present in each individual.

169:

170: Viral RNA set-point levels were determined after acute infection and

171: prior to any therapeutic intervention or the onset of AIDS, as defined

172: by the presence of an opportunistic infection or CD4$^+$ T-cell count

173: below 200 per ml of plasma. Because the assay has a detection

174: threshold of 300 copies of virus per ml \cite{otherMS},

175: maximum-likelihood estimators were adjusted to avoid biased estimates

176: of population parameters from a truncated, or censored, sample

177: distribution \cite{normal}. Viral RNA levels were log-transformed so

178: as better to approximate a normal distribution.

179:

180: High-resolution class I and II HLA genotyping \cite{otherMS} provided

181: four-digit allele designations, though analyses were generally

182: performed using two-digit allele designations because of the resulting

183: reduction of allelic diversity and increased number of samples per

184: allele. Because of the potential for results to be confounded by an

185: effect associated with an individual's ethnicity or revised sampling

186: protocol, two separate analyses were performed, one using data from

187: the entire cohort, and another using only data from Caucasian

188: individuals. Sample numbers were too small to study other subgroups

189: independently.

190:

191: HLA supertypes group class I alleles by their peptide-binding anchor

192: motifs \cite{supertypes}. Assignment of four-digit allele designations

193: to functionally related groups of supertypes at HLA-A and -B loci

194: facilitated further analysis. Where they could be determined, HLA-A

195: and HLA-B supertypes were assigned from four-digit allele designations

196: \cite{otherMS}. As with two-digit allele designations for each locus,

197: HLA-A and -B supertypes were assessed for association with viral RNA

198: levels. Cases having other alleles were withheld from classification

199: and subsequent analysis of supertypes.

200:

201: A description length analysis determined whether HIV RNA levels were

202: non-trivially associated with alleles at any HLA locus.

203:

204: \subsection*{Description Lengths}

205:

206: The challenge of data classification is to find the best partition,

207: such that observations within a group are well-described as

208: independent draws from a single population, but differences in

209: population distributions exist between groups. Whether the data are

210: better represented as two groups, or more, than as one depends on the

211: description lengths that result.

212:

213: We use the family of Gaussian distributions to model viral RNA

214: levels. While the MDL strategy can be applied using any probabilistic

215: model, a log-normal distribution is a good choice for the observed

216: plasma viral RNA values. First, the description length of the model

217: and of the data given the model is calculated as described below,

218: grouping all of the observations into one normal distribution,

219: $L_1$. Next, the data are broken into two partitions, $L_2$, and

220: the log-RNA values associated with HLA alleles are partitioned to

221: minimize the description length given the constraint that two Gaussian

222: distributions, each having their own mean and variance, are used to

223: model the data.

224:

225: For fixed $n \times n$ covariance matrix $\Sigma$, the description

226: length is $L_\Sigma = \frac{1}{2}\log |\Sigma| + \frac{1}{2}

227: Y'\Sigma^{-1}Y + C$, where $Y$ is the $n$-component vector of

228: observations and $C$ is the quantity of information required to

229: specify the partition. Logarithms are computed in base two, with

230: fractional values rounded upwards, so that the resulting units are

231: bits. The description length of interest results from integrating $L$

232: over all covariance matrices with the appropriate structure. In

233: practice, we use Laplace's approximation for the integral

234: \cite{Rissanen,Lindley} which gives, asymptotically, $L =

235: \frac{1}{2}\log |\hat{\Sigma}| + \frac{1}{2} Y'\hat{\Sigma}^{-1}Y +

236: \frac{k}{2} \log n + C$, where $k$ is the number of free parameters in

237: the covariance model, and $\hat{\Sigma}$ is the specific covariance

238: matrix of the appropriate structure that minimizes $L_\Sigma$.

239: A more detailed account appears in the Appendix.

240:

241: The analog of a null hypothesis is the assumption that one group of

242: alleles is sufficient to account for the variation in viral RNA. The

243: description length for one group is: $L_1=\frac{1}{2}\left(n+(n-1)\log

244: s^2 +\log n\overline{x}^2+2\log n\right)$, where $n$ is the total

245: number of observations, $s^2$ is the maximum-likelihood estimate of

246: the population variance and $\overline{x}$ is the sample mean,

247: computed as the Winsorized mean \cite{normal} because of truncation

248: below the sensitivity limit of the RNA assay.

249:

250: %In general, $L = \frac{1}{2}\log |\Sigma| + \frac{1}{2} Y'\Sigma^{-1}Y

251: %+ C$, where $\Sigma$ is the $n \times n$ covariance matrix, $Y$ is

252: %the observation vector, $C$ represents the information required to

253: %specify the partition, and logarithms are computed in base two, with

254: %fractional values rounded upwards, so that the resulting units are

255: %bits. A more detailed account appears in the Appendix.

256:

257: %The analog of a null hypothesis is the assumption that one group of

258: %alleles is sufficient to account for the variation in viral RNA. The

259: %description length for one group is: $L_1=\frac{1}{2}\left(n+(n-1)\log

260: %s^2 +\log n\overline{x}^2+2\log n\right)$, where $n$ is the total

261: %number of observations, $s^2$ is the maximum-likelihood estimate of

262: %the population variance and $\overline{x}$ is the sample mean,

263: %computed as the Winsorized mean \cite{normal} because of truncation

264: %below the sensitivity limit of the RNA assay.

265:

266: It follows that the description length for two groups can be computed

267: as:

268:

269: \[ L_{2}=\frac{1}{2} \sum_{i=1}^2 \left( n_i + (n_i-1)\log s_i^2 + \log n_i\overline{x}_i^2 + 2 \log n_i \right) + C,\]

270:

271: where $C$ is an adjustment for performing multiple comparisons.

272: Because additional information is required to specify the optimum

273: partition, the description length is increased by a quantity related

274: to the number of partitions evaluated, such that $C = N \log k$ bits,

275: where $N$ is the number of alleles observed at the partitioned

276: locus. For $k=2$, $C=N$.

277:

278: Further partitions of alleles into more than two groups might yield a

279: shorter description length, computed as a summation over terms in the

280: equation for $L_{2}$ for each of the $k$ distinct groups.

281:

282: The shortest description length for any value of $k$ indicates the

283: best choice of model parameters, including the number of parameters,

284: and hence, the optimum partition of $N$ alleles into $k$ groups. We

285: denote this as $L^*$.

286:

287: \subsection*{Algorithm}

288:

289: The minimum description length is found by iteratively computing the

290: description length for each possible partition of alleles into groups

291: and taking the minimum as optimal. Iteration consists first of

292: determining the number of alleles, $N$, at a particular locus, and

293: then incrementing through each of the $k^{(N-1)}$ possible partitions

294: of alleles into $k$ groups, computing the associated description

295: length, and reporting the best results. Each iteration evaluates one

296: possible mapping of alleles to groups. Searching through all possible

297: partitions using the description length as an optimality criterion

298: ensures selection of the best partition as a result of the search.

299:

300: In this mapping, the ordering of groups is informative, because the

301: ordering gives the relative dominance of alleles for diploid loci. An

302: individual having an allele assigned to the first-order group is

303: assigned to that group. Otherwise, the individual is assigned to the

304: next appropriate group. Two individuals sharing one allele might be

305: placed in either the same group or different groups, depending on the

306: mapping of alleles to groups in a particular iterate. For example,

307: consider how one might group two individuals, one with alleles {\em A1}

308: and {\em A2} at some locus, and another with alleles {\em A2} and {\em

309: A3}. Whether or not they are grouped together depends on the assignment

310: of alleles to groups, and can be done several different ways. The

311: algorithm enumerates each possible assignment of alleles to groups.

312:

313: The extent of the search scales as $k^N$. In practice, the most

314: diverse locus was HLA-B, with 30 alleles when analyzed using two-digit

315: allele designations. For two groups, this gives $2^{30} \approx

316: 10^{8}$ possible partitions. Serial iteration on an UltraSPARC-IIi

317: 440MHz CPU (Sun Microsystems) requires roughly 36 hours for

318: completion. A parallel implementation requires no message passing, so

319: computing time scales inversely with an increasing number of CPUs, or

320: doubling available processors halves the time for iteration. With

321: many CPUs, the search space of $2^{30}$ partitions can be exhaustively

322: evaluated in an hour or less. Unfortunately, exhaustively evaluating

323: all three-way partitions is prohibitive, as $3^{30} \approx 2 \times

324: 10^{14}$, over a million-fold increase in computational effort!

325: Supertype classification reduced the diversity of possible partitions

326: and enabled partitioning of the data into more than two groups.

327:

328: The algorithm was implemented in C and will be distributed on request.

329:

330: \subsection*{Class I \& II HLA Results}

331:

332: The description length for the entire cohort as one group is $L_1=934$

333: bits; for the Caucasian subsample, it is $L_1=721$ bits. In general,

334: $L_1 < L_2$ at most loci (Table~1), so the MDL criterion does not

335: support partitioning alleles into groups that are predictive of high

336: or low RNA levels, except at HLA-B, where $L_2 < L_1$. In the

337: subsample, partitioning HLA-C or HLA-DQB1 alleles can also provide

338: preferred two-way splits, though not as well as HLA-B.  Further

339: partitioning was intractable because of great allelic diversity, as

340: previously mentioned. Partitions of HLA-B alleles provide the best

341: groupings among all loci. Because $L_2^* < L_1$, two groups,

342: partitioned by HLA-B alleles, provide a better description than one

343: (Fig.~1a and 1b).

344:

345: What is the composition of the optimum groupings?  For the entire

346: cohort, the following alleles were associated with low viral RNA

347: levels: {\em B*13}, {\em B*27}, {\em B*38}, {\em B*45}, {\em B*49},

348: {\em B*57}, {\em B*58}, and {\em B*81}. The remaining alleles,

349: associated with greater viral RNA than the first group, are: {\em

350: B*07}, {\em B*08}, {\em B*14}, {\em B*15}, {\em B*18}, {\em B*35},

351: {\em B*37}, {\em B*39}, {\em B*40}, {\em B*41}, {\em B*42}, {\em

352: B*44}, {\em B*47}, {\em B*48}, {\em B*50}, {\em B*51}, {\em B*52},

353: {\em B*53}, {\em B*55}, {\em B*56}, {\em B*67}, and {\em B*82}. As

354: described earlier, having any alleles associated with the first group

355: is sufficient for an individual to be assigned to the group having

356: lower viral RNA.

357:

358: How robust are these assignments of alleles to groups?  Four

359: alternative groupings provide description lengths within one bit of

360: the optimum. They do not dramatically rearrange the assigment of

361: individuals to groups, but do provide insight as to which alleles are

362: assigned to either group with less confidence. Among near-optimal

363: partitions, alleles {\em B*82} and {\em B*67} were assigned to groups

364: other than in the optimum partition.

365:

366: In the Caucasian subsample, alleles {\em B*13}, {\em B*27}, {\em

367: B*40}, {\em B*45}, {\em B*48}, {\em B*49}, {\em B*57}, and {\em B*58}

368: are associated with lower viral RNA, and the remaining alleles, {\em

369: B*07}, {\em B*08}, {\em B*14}, {\em B*15}, {\em B*18}, {\em B*35},

370: {\em B*37}, {\em B*38}, {\em B*39}, {\em B*41}, {\em B*44}, {\em

371: B*47}, {\em B*50}, {\em B*51}, {\em B*52}, {\em B*53}, {\em B*55}, and

372: {\em B*56}, or lack of any alleles from the first group, are

373: associated with greater viral RNA levels. Two nearly optimal

374: partitions assigned alleles {\em B*47} and {\em B*48} to the second

375: group.  Fig.~1 illustrates the distributions of viral RNA levels from

376: this subsample, as one group (Fig.~1c) and as the best partition at

377: HLA-B (Fig.~1d).

378:

379: To summarize the most robust inferences from the analyses of two-digit

380: allele designations, individuals having HLA-B alleles {\em B*13}, {\em

381: B*27}, {\em B*45}, {\em B*49}, {\em B*57}, or {\em B*58} were

382: associated with lower viral RNA levels than their counterparts lacking

383: these alleles.

384:

385: Comparison of groupings obtained via the MDL approach with more

386: traditional means for statistical inference, a two-tailed, two-sample,

387: Welch modified t-test, which does not assume equal variances, and its

388: non-parametric variant, the Wilcoxon rank-sum test \cite{Venables},

389: was very favorable. In each case, the null hypothesis was that of no

390: difference between the group mean log-transformed viral RNA levels,

391: and the alternative hypothesis was that the means differ. Both tests

392: agreed in rejecting the null hypothesis in favor of the alternative

393: ($P<10^{-10}$).

394:

395: \subsection*{HLA Supertype Results}

396:

397: Assigning the diploid, co-dominantly expressed HLA-A alleles to four

398: HLA-A supertypes \cite{supertypes}, {\em A1s}, {\em A2s}, {\em A3s},

399: and {\em A24s}, was possible for 399 individuals. The mapping of HLA-B

400: alleles to five supertypes, {\em B7s}, {\em B27s}, {\em B44s}, {\em

401: B58s}, and {\em B62s}, was made for 352 individuals. The resulting

402: decrease in allelic diversity enabled analysis for $k>2$.

403:

404: Description lengths of the best $k$-way partitions of supertype

405: alleles for HLA-A supertypes are: $L_1=793$, $L_2=782$, $L_3=789$, and

406: $L_4=794$ bits. The best description length results from a two-way

407: split, though a three-way split also yields a shorter description

408: length than that obtained from one group. The best partition of HLA-A

409: supertypes assigned individuals having {\em A1s} alleles to the low

410: RNA group.

411:

412: For HLA-B supertypes, $L_1=704$, $L_2=691$, $L_3=693$, and $L_4=697$

413: bits (Fig.~1e). The best model results when $k=2$. Overall,

414: individuals lacking {\em B58s} alleles averaged viral RNA levels

415: $3.6$-times greater than individuals having {\em B58s} supertype

416: alleles (Fig.~1f). Thus, individuals with {\em B58s} alleles have

417: significantly lower viral RNA levels than individuals without them.

418:

419: Table~2 summarizes results of assigning HLA-B associations to high or

420: low viral-RNA categories as two-digit allele designations from both

421: the entire cohort and the Caucasian subsample, and as supertypes for

422: those individuals having two alleles that could be assigned to a

423: supertype. Alleles not found in a sample are indicated by a dash. The

424: {\em B*15} alleles are not shown because their high-resolution

425: genotype designations correspond to four different supertypes.

426:

427: Overall, the most consistent associations with low viral RNA are among

428: the {\em B58s}, and with high viral RNA, the {\em

429: B7s}. Inconsistencies in assignment to a category occur for the {\em

430: B*13}, {\em B*27}, {\em B*45}, and {\em B*49} alleles, which are in

431: the low viral-RNA group when analyzed as such, but the high viral-RNA

432: group when assigned to supertypes.

433:

434: When compared with alternative inferential techniques, the difference

435: between group viral RNA levels was highly significant. This and

436: agreement with alleles reported to be associated with variation in

437: viral RNA levels in previously published studies indicate that using

438: the description length as a test statistic can provide reliable

439: inferences.

440:

441: \subsection*{MDL \& Statistical Inference}

442:

443: The traditional statistical solution is to pose a question as follows:

444: suppose that the simpler model (e.g., one homogeneous population) were

445: actually true; call this the null hypothesis. How often would one, in

446: similar experiments, get data that look as different from that

447: expected under the null hypothesis as in the actual experiment?

448:

449: This technique has limitations when the partition that represents the

450: alternative hypothesis is not given in advance. There are then many

451: potential alternative partitions and the appropriate distribution

452: under the null hypothesis for this ensemble of tests is very difficult

453: to estimate. Furthermore, for proper interpretation, the outcome

454: relies upon the truth of the initial assumption: that the data are

455: distributed as dictated by the null hypthothesis.

456:

457: An alternative is to choose that model that represents the data most

458: efficiently. Here, efficiency is the amount of information, quantified

459: as bits, required to transmit electronically both the model and the

460: data as encoded by the model. This criterion may not seem intuitively

461: clear on first exposure. However, it follows naturally from a profound

462: relationship between probability and coding theory that was

463: discovered, explored, and elaborated by Solomonoff, Kolmogorov,

464: Chaitin, and Rissanen

465: \cite{Kolmogorov65,Chaitin66,Chaitin87,Rissanen86,Rissanen99}.

466:

467: The idea is quite simple and elegant. It can be illustrated by analogy

468: to the problem of designing an optimal code for the efficient

469: transmission of natural-language messages. Consider the international

470: Morse code. Recall that Morse code assigns letters of the Roman

471: alphabet to codewords comprised of dots (``$\cdot$'') and dashes

472: (``$-$''). The codewords do not all have the same number of dots

473: and/or dashes; it is a variable-length code.

474:

475: Efficient, compact encodings result from the design of a codebook such

476: that the shortest codewords are assigned to the most frequently

477: encoded letters and long codewords are assigned to rare letters. Thus,

478: {\em e} and {\em t} are encoded as ``$\cdot$'' and ``$-$'',

479: respectively, while {\em q} and {\em j} are encoded as ``$-$ $-$

480: $\cdot$ $-$'' and ``$\cdot$ $-$ $-$ $-$''. The theory of optimal

481: coding provides an exact relationship between frequency and code

482: length and thus, probability and description length.

483:

484: The key departure of MDL from Morse-codelike schemes is that, while

485: Morse code would generally be good for sending messages over an

486: average of many texts, specific texts might be encoded even more

487: efficiently, by encoding not only letters, but letter combinations,

488: common words, or even phrases, perhaps as abbreviations or

489: acronyms. However, if one is to recode for particular texts, one must

490: first transmit the coding scheme. So perhaps one might use Morse code

491: to transmit the details of the new coding scheme and then transmit the

492: text itself with the new scheme. Whether this might yield greater

493: efficiency depends not only on how much compression is achieved in the

494: new encoding, but also on how much overhead is incurred in having to

495: transmit the coding scheme.

496:

497: The analogy to scientific data analysis is clear. A statistical model

498: is an encoding scheme that encapsulates the regularities in the data

499: to yield a concise representation thereof. The best model effectively

500: compresses regularities in the data, but is not so elaborate that its

501: own description demands a great deal of information to be encoded.

502: The MDL principle provides a model-selection criterion that balances

503: the need for a model that is both appropriate and parsimonious, by

504: penalizing with equal weights the information required to specify the

505: model and the unexplained, or residual error.

506:

507: %Another advantage of using the minimum description length principle as

508: %a test statistic is that it provides an objective criterion for

509: %selecting model parameters, including the optimum value for $k$, the

510: %number of groups. Thus, an algorithm iteratively evaluated the

511: %description length associated with partitioning $N$ alleles into $k$

512: %groups, by computing the description length $L$ associated with each

513: %possible partition, and taking as optimum the model with minimum

514: %description length $L^*$.

515:

516: %Further, model parameters, such as the appropriate number of groups,

517: %are readily optimized using the description length as a test

518: %statistic. We considered here several possibilities for the proper

519: %number of groups, $k$. Computing the true minimum description length

520: %was possible here because the combinatorial diversity of the search

521: %space could be exhaustively evaluated, with sufficiently low allelic

522: %diversity.

523:

524: Yet another contribution the MDL principle brings to statistical

525: modelling is that the penalty for multiple comparisons is less

526: restrictive than the penalty of compounded error rates incurred with

527: canonical inferential approaches. In order to maintain a desired

528: experiment-wide error rate, the standard adjustment is to make the

529: per-comparison error rate considerably more stringent. With current

530: technology, realistic sample sizes for such studies will generally be

531: less than a thousand and stringent significance levels will be

532: difficult to surpass. Unfortunately, fixing the false-positive error

533: rate does not address the false-negative probability, which may leave

534: researchers powerless to detect effects among many competing

535: hypotheses with limited samples.

536:

537: %Testing 200 alternative hypotheses in this scenario, an approximation

538: %of the number of tests that might be performed on an HLA-disease

539: %outcome study given the alleleic diversity of these highly polymorphic

540: %loci, would yield $\alpha' = 0.00005$. Given the experimental

541: %complexity of defining these alleles and disease correlates, with

542: %current technology, realistic sample sizes for such studies will

543: %generally be less than a thousand individuals, and such stringent

544: %P-values will be difficult to obtain. Unfortunately, fixing the

545: %false-positive error rate does not address the false-negative

546: %probability, often leaving researchers with little statistical power

547: %to detect significant effects among many competing hypotheses with

548: %limited sample observations.

549:

550: %Model selection under MDL does not depend on the assumption of the

551: %truth of the simpler model, given the data. No such idea is required

552: %or even relevant. On the other hand, one can take the results of an

553: %MDL model selection procedure and connect back to traditional

554: %statistical approaches by taking the simpler model as null hypothesis

555: %and then asking, What are the type I and II error rates, $\alpha$ and

556: %$\beta$, under the MDL procedure?  One finds that both error rates

557: %decline as the sample size $n$ increases. This is a contrast to the

558: %usual procedure of fixing $\alpha$ and letting the power to detect

559: %differences ($1 - \beta$) alone decrease with $n$. Indeed, the power

560: %does increase more slowly under MDL than it would under fixed a level

561: %testing; the reduction of $\alpha$ is accomplished at that expense.

562:

563: %Potential drawbacks of using the MDL principle for inference in

564: %statistical genetics include the lack of an explicit $P$-value, which

565: %makes the approach difficult to explain to audiences accustomed to

566: %more common inferential techniques, and a requirement for data

567: %consisting of large sample observations.

568:

569: \subsection*{Mechanisms}

570:

571: Of HLA supertype alleles, individuals with {\em B58s} have lower viral

572: RNA levels than those who lack them, even among homozygotic

573: individuals. Naturally, this leads one to consider mechanisms that

574: underlie patterns found in the data. Elsewhere, we consider two

575: hypotheses to explain the observed associations between HLA alleles

576: and variation in viral RNA \cite{otherMS}.

577:

578: There may be allele-specific variation in antigen-binding

579: specificity. Some alleles may have greater affinity than others for

580: HIV-specific peptide fragments due to the peptide-binding anchor

581: motifs they present. We were not able to identify any clear

582: association between the frequency of anchor motifs among HIV-1

583: proteins and viral RNA levels in the Chicago MACS \cite{otherMS},

584: though others have suggested that such a relationship might exist

585: \cite{Nelson}.

586:

587: It may also the case that frequency-dependent selection has favored

588: rare alleles. Frequent alleles provide the evolving pathogen greater

589: opportunity to explore mutant phenotypes that may escape detection by

590: the host's immune response. By encountering rare alleles less

591: frequently, the virus has not had the same opportunity to explore

592: mutations that evade the host's defense response. This hypothesis is

593: corroborated by a significant association between viral RNA and HLA

594: allele frequency in the Chicago MACS sample \cite{otherMS}.

595:

596: Because their predictions differ, these hypotheses could be tested

597: with data from another cohort, where a different viral subtype

598: predominates. That is, if other alleles were associated with low viral

599: RNA than those identified in this study, and an association between

600: rare alleles and low viral RNA levels were observed there, then the

601: second hypothesis would be more viable than the first. Alternatively,

602: if a clear association between antigen peptide-binding anchor motifs

603: and variation in viral RNA levels were found, the first hypothesis

604: would be more viable. Other mechanisms are also possible, and

605: hypotheses by which to evaluate them merit consideration.

606:

607: \subsection*{Acknowledgments}

608: %\vspace{0.125 in}

609:

610: We thank Bob Funkhouser, Cristina Sollars, and Elizabeth Hayes for

611: sharing their expertise, and researchers of the Santa Fe Institute for

612: insight and inspiration. This research was financed by funds from the

613: Elizabeth Glazer Pediatric AIDS Foundation, the National Cancer

614: Institute, the National Institute of Allergy and Infectious Diseases,

615: National Institutes of Health, National Science Foundation award

616: \#0077503, and the U.S. Department of Energy.  We have no conflicting

617: interests.

618:

619:

620: \subsection*{Appendix}

621:

622: In Gaussian Process modeling \cite{Williams97}, the population means

623: are treated as random variables and integrated out of the

624: likelihood. The model is then specified entirely by the structure of

625: the covariance matrix $\Sigma$, which specifies how each pair of

626: observations is correlated. The covariance is greater for two

627: observations from the same partition than for two observations from

628: different partitions. Any given partition is specified entirely by a

629: corresponding covariance structure.

630:

631: {\bf Partitioning with Gaussian Models.}  Denote the $n$ observations

632: as the vector $Y$ and the covariance matrix with parameter vector

633: $\theta$ by $\Sigma(\theta)$. Let the number of components of $\theta$

634: (the number of free parameters in the covariance matrix) be $k$.  Then

635: the MDL for the given covariance structure is:

636: $L = \frac{1}{2}\log |\Sigma(\hat{\theta})| + \frac{1}{2}

637: Y'\Sigma(\hat{\theta})^{-1}Y + \frac{k}{2}\log n + C$, where $C$ is

638: the information required to specify the partition or, equivalently,

639: the covariance structure, and $\hat{\theta}$ is the vector of

640: covariance parameters evaluated at maximum likelihood.

641:

642: {\bf One Gaussian Population.}  The covariance matrix has a component

643: $\sigma ^2 _m$ for the covariance among observations, induced by their

644: sharing an unspecified mean, and an error component $\sigma ^2

645: _\varepsilon$: $\Sigma = \sigma ^2 _\varepsilon I + \sigma ^2 _m

646: \mathbf{11}'$, with $\mathbf{1}$ the column vector of all ones,

647: $\mathbf{11}'$ the matrix of all ones, and $I$ the identity

648: matrix. The inverse is:

649:

650: %Note that $k=2$ since there are two free parameters in the covariance matrix.

651:

652: \[\Sigma^{-1}= \frac{1}{\sigma ^2 _\varepsilon} \left( I - \frac{\sigma ^2 _m}{\sigma ^2 _\varepsilon + n \sigma ^2 _m} \mathbf{11}' \right), \]

653:

654: and the log-determinant: $\log |\Sigma | = (n-1)\log \sigma ^2 _\varepsilon + \log (\sigma ^2 _\varepsilon + n \sigma ^2 _m)$.

655:

656: This gives $L = \frac{1}{2} \left(n + (n-1)\log \sigma ^2 _\varepsilon + \log(\sigma ^2 _\varepsilon + n\sigma^2 _m ) + 2 \log n \right)$.

657:

658: We find the maximum likelihood values of the parameters by minimizing

659: over the description lengths. There are two cases.\\

660:

661: Case 1: $n^2 \overline{Y}^2 - Y'Y \ge 0$. Here we have

662: $\hat{\sigma}^2 _\varepsilon =(n-1)^{-1}(Y'Y-n\overline{Y}^2)$ and

663: $\hat{\sigma}^2 _m =(n-1)^{-1}(n\overline{Y}^2 -\frac{1}{n}Y'Y)$, so

664: $L=\frac{1}{2}(n +(n-1)\log\hat{\sigma}^2 _\varepsilon+ \log n \overline{Y}^2 + 2\log n)$.\\

665:

666: Case 2: $n^2 \overline{Y}^2 - Y'Y < 0$. Here the common mean vanishes, giving

667: $\hat{\sigma} ^2 _\varepsilon = \frac{1}{n}Y'Y$, $\hat{\sigma}^2 _m = 0$, so

668: $L = \frac{n}{2}(1+\log \hat{\sigma}^2 _\varepsilon +\frac{2}{n}\log n)$.\\

669:

670: {\bf Many Gaussian Populations.}

671: Two partitions give two populations. To analyze the HLA/HIV data,

672: we treated these populations as independent.  That is, we take the

673: covariance between observations in separate partitions to be zero, and

674: apply the fitting procedure outlined above separately to the two

675: populations. An alternative is to take non-zero covariance between the

676: two populations. This results in a more elaborate estimation

677: procedure, unlikely to yield large efficiency gains because the

678: two degrees of freedom (population means) are essentially mixed into

679: one, with residual error.

680:

681: The procedure examines each admissible partition and computes the MDL

682: for that partition as the sum of individual description lengths over

683: the two independent populations. The best partition yields the lowest

684: description length over all partitions. This, plus the cost of

685: specifying the partition, is compared with the MDL from the

686: unpartitioned data. If the best partition provides a better

687: representation of the data than the unpartitioned set ($L _{k} < L

688: _{k-1}$), then the process is repeated in a recursive manner,

689: independently within each of the partitioned populations.

690:

691: \newpage

692: \begin{thebibliography}{99}

693:

694: \bibitem{McMichael01}

695: McMichael,~A.~J. \& Rowland-Jones,~S.~L.

696: \newblock (2001)

697: %\newblock Cellular immune responses to {HIV}.

698: \newblock {\em Nature} {\bf 410}, 980-987.

699:

700: \bibitem{Mellors96}

701: \newblock Mellors,~J.~W., Rinaldo,~C.~R.,~Jr., Gupta,~P., White,~R.~M., Todd,~J.~A. \& Kingsley,~L.~A.

702: \newblock (1996)

703: \newblock {\em Science} {\bf 272}, 1167-1170.

704:

705: \bibitem{Germain}

706: Germain,~R.~N.

707: \newblock (1999)

708: \newblock Chapter 9 in {\em Fundamental Immunology}, fourth edition, ed.

709: \newblock Paul,~W.~E.

710: \newblock (Lippincott-Raven, Philadelphia PA), pp. 287-340.

711:

712: \bibitem{WilliamsReview}

713: Williams,~A., Au~Peh,~C. \& Elliott,~T.

714: \newblock (2002)

715: \newblock {\em Tissue Antigens} {\bf 59}, 3-17.

716:

717: \bibitem{Bodmer}

718: Bodmer,~W.~F.

719: \newblock (1972)

720: \newblock {\em Nature} {\bf 237}, 139-145.

721:

722: \bibitem{Little}

723: Little,~A.~M. \& Parham,~P.

724: \newblock (1999)

725: \newblock {\em Rev. Immunogenet.} {\bf 1}, 105-123.

726:

727: \bibitem{Hill}

728: Hill,~A.~V.~S.

729: \newblock (1998)

730: \newblock {\em Ann. Rev. Immunol.} {\bf 16}, 593-617.

731:

732: \bibitem{Roger}

733: Roger,~M.

734: \newblock (1998)

735: \newblock {\em FASEB J.} {\bf 12}, 625-632.

736:

737: \bibitem{Carrington99}

738: Carrington,~M., Nelson,~G.~W., Martin,~M.~P., Kissner,~T., Vlahov,~D., Goedert,~J.~J., Kaslow,~R., Buchbinder,~S., Hoots,~K. \& O'Brien,~S.~J.

739: \newblock (1999)

740: %\newblock {{\em HLA}} and {HIV}-1: {H}eterozygote advantage and {{\em {B}*35-{C}w*04}} disadvantage.

741: \newblock {\em Science} {\bf 283}, 1748-1752.

742:

743: \bibitem{otherMS}

744: Trachtenberg,~E.~A., Korber,~B.~T., Sollars,~C., Kepler,~T.~B., Hraber,~P.~T., Hayes,~E., Funkhouser,~R., Fugate,~M., Theiler,~J., Hsu,~M., Kunstman,~K., Wu,~S., Phair,~J., Erlich,~H.~A. \& Wolinsky,~S.

745: \newblock (2003)

746: \newblock {\em Nat. Med.}, 9:928-935.

747:

748: \bibitem{Trachtenberg01}

749: Trachtenberg,~E.~A. \& Erlich,~H.~A.

750: \newblock (2001)

751: %A Review of the Role of the Human Leukocyte Antigen (HLA) System as a

752: %Host Immunogenetic Factor Influencing HIV Transmission and Progression

753: %to AIDS

754: \newblock in {\em HIV Molecular Immunology 2001}, eds.

755: \newblock Korber,~B.~T., Brander,~C., Haynes,~B.~F., Koup,~R., Kuiken,~C., Moore,~J.~P., Walker,~B.~D. \& Watkins,~D.

756: \newblock (Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos NM), pp. I-43-60.

757:

758: \bibitem{Rissanen}

759: Rissanen, J.

760: \newblock (1989)

761: \newblock {\em Stochastic Complexity in Statistical Inquiry}

762: \newblock (World Scientific, Singapore).

763:

764: \bibitem{Li93}

765: Li, M. \& Vit\'{a}nyi, P.

766: \newblock (1993)

767: \newblock {\em An Introduction to Kolmogorov Complexity and its Applications}

768: \newblock (Springer-Verlag, New York NY).

769:

770: \bibitem{HansenYu}

771: Hansen,~M.~H. \& Yu,~B.

772: \newblock (2001)

773: \newblock {\em J. Am. Stat. Assoc.} {\bf 96}, 746-774.

774:

775: \bibitem{normal}

776: Johnson,~N.~L., Kotz,~S. \& Balakrishnan,~N.

777: \newblock (1994)

778: \newblock {\em Continuous Univariate Distributions}, volume~1, second edition

779: \newblock (Wiley Interscience, New York NY).

780:

781: \bibitem{supertypes}

782: Sette,~A. \& Sidney,~J.

783: \newblock (1999)

784: %\newblock Nine major {HLA} class {I} supertypes account for the vast preponderance of {HLA-A} and -{B} polymorphism.

785: \newblock {\em Immunogenetics} {\bf 50}, 201-212.

786:

787: \bibitem{Lindley}

788: Lindley,~D.~V.

789: \newblock (1980)

790: \newblock in {\em Bayesian Statistics}, eds.

791: \newblock Bernardo,~J.~M, DeGroot,~M.~H., Lindley,~D.~V. \& Smith,~A.~F.~M.

792: \newblock (Valencia University Press, Valencia), pp. 223-237.

793:

794: \bibitem{Venables}

795: Venables,~W.~N. \& Ripley,~B.~D.

796: \newblock (1999)

797: \newblock {\em Modern Applied Statistics with {S-PLUS}}, third edition

798: \newblock (Springer, New York NY).

799:

800: \bibitem{Kolmogorov65}

801: Kolmogorov, A.~N.

802: \newblock (1965)

803: % Three approaches to the quantitative definition of information.

804: \newblock {\em Prob. Inform. Transmission} {\bf 1}, 4-7.

805:

806: \bibitem{Chaitin66}

807: Chaitin, G.~J.

808: \newblock (1966)

809: % On the lengths of programs for computing binary sequences.

810: \newblock {\em J. Assoc. Comput. Mach.} {\bf 13}, 547-569.

811:

812: \bibitem{Chaitin87}

813: Chaitin, G.~J.

814: \newblock (1987)

815: \newblock {\em Algorithmic Information Theory}

816: \newblock (Cambridge University Press, Cambridge UK).

817:

818: \bibitem{Rissanen86}

819: Rissanen, J.

820: \newblock (1986)

821: % Stochastic complexity and modeling.

822: \newblock {\em Ann. Statist.} {\bf 14}, 1080-1100.

823:

824: \bibitem{Rissanen99}

825: Rissanen, J.

826: \newblock (1999)

827: %\newblock Hypothesis selection and testing by the {MDL} principle.

828: \newblock {\em Comput. J.} {\bf 42}, 260-269.

829:

830: \bibitem{Nelson}

831: Nelson,~G.~W., Kaslow,~R. \& Mann,~D.~L.

832: \newblock (1997)

833: \newblock {\em Proc. Natl. Acad. Sci. USA} {\bf 94}, 9802-9807.

834:

835: \bibitem{Williams97}

836: Williams,~C.~K.~I.

837: \newblock (1997)

838: % Regression with Gaussian Processes.

839: \newblock in {\em Mathematics of Neural Networks: Models, Algorithms and Applications}, eds.

840: \newblock Ellacott,~S.~W., Mason,~J.~C. \& Anderson,~I.~J.

841: \newblock (Kluwer, Boston MA), pp. 378-382.

842:

843: \end{thebibliography}

844:

845: \newpage

846: \section*{Figure Legends}

847:

848: Fig.~1. Description-length comparisons of viral RNA distributions as

849: one ($L_1$) or two ($L_2$) groups. Ordinate units are the expected

850: number of observations between two tick marks over the abscissa, or

851: one doubling of viral RNA. Impulses along the abscissa show individual

852: observations, with jitter added to enhance rendering of identical

853: values. (a) Observations ($n$) from the Chicago MACS cohort lumped

854: into one group, and (b) split into the best partition as two groups,

855: with individuals having alleles {\em B*13}, {\em B*27}, {\em B*38},

856: {\em B*45}, {\em B*49}, {\em B*57}, {\em B*58}, or {\em B*81} assigned

857: to the lower group ($n_1$), and remaining individuals assigned to the

858: group with greater viral RNA ($n_2$). (c) Observations from the

859: Caucasian subsample as one group, and (d) as the best split into two

860: groups, where having alleles {\em B*13}, {\em B*27}, {\em B*40}, {\em

861: B*45}, {\em B*48}, {\em B*49}, {\em B*57}, or {\em B*58} was the

862: criterion for being assigned to the low viral-RNA group. Observations

863: from individuals having two HLA-B supertype alleles, (e) in one group,

864: and (f) partitioned into two groups, contingent on the presence of

865: {\em B58s}.

866:

867: \clearpage

868: \renewcommand{\baselinestretch}{2}

869:

870: \begin{table}[tb]

871: \begin{center}

872: \caption{Optimum two-way partitions at each locus, with per-locus

873: allelic diversity ($N$), description lengths without the information

874: cost to specify model parameters ($L_2 - C$), and minimum description

875: lengths ($L_2$).

876: \label{tab:all-loci}}

877: \begin{tabular}{ccrcclcrccl}

878: \\

879: \hline

880: &\multicolumn{5}{c}{\sc{Entire Cohort}} & \multicolumn{5}{c}{\sc{Caucasian Subsample}}\\

881: & \multicolumn{5}{c}{$n=479$, $L_1=934$} & \multicolumn{5}{c}{$n=379$, $L_1=721$}\\

882: {\sc Locus}& & $N$ & $L_2-C$ && \multicolumn{1}{c}{$L_2$} & & $N$ & $L_2-C$ && \multicolumn{1}{c}{$L_2$}\\

883: \hline

884: \sc{Class I}\\

885: HLA-A && 19 & 916 && 935 && 18 & 703 && 721\\

886: HLA-B && 30 & 887 && 917* && 26 & 681 && 707*\\

887: HLA-C && 14 & 921 && 935 && 13 & 706 && 719\\

888: \sc{Class II}\\

889: DRB1  && 13 & 927 && 940 && 13 & 711 && 724\\

890: DQB1  &&  5 & 936 && 941 &&  5 & 715 && 720\\

891: DPB1  && 24 & 927 && 951 && 21 & 710 && 731\\

892: \hline

893: \end{tabular}

894: \end{center}

895: \end{table}

896: \newpage

897: \clearpage

898:

899: \renewcommand{\baselinestretch}{1}

900:

901: \begin{table}[t]

902: \begin{center}

903: \caption{HLA-B alleles associated with low ($\circ$) or high ($\bullet$) viral RNA levels.

904: \label{tab:hlab-summary}}

905: \begin{tabular}{cccc}

906: \\

907: \hline

908:              & {\sc Entire} & {\sc Caucasian} & {\sc Supertypes}\\

909: {\sc Allele} & {\sc Cohort} & {\sc Subsample} &    {\sc Only}\\

910:              &    $n=479$   &     $n=379$     &     $n=352$\\

911: \hline

912: \multicolumn{4}{l}{{\em B7s}}\\

913: {\em B*07}   & $\bullet$  & $\bullet$  &  $\bullet$\\

914: {\em B*35}   & $\bullet$  & $\bullet$  &  $\bullet$\\

915: {\em B*51}   & $\bullet$  & $\bullet$  &  $\bullet$\\

916: {\em B*53}   & $\bullet$  & $\bullet$  &  $\bullet$\\

917: {\em B*55}   & $\bullet$  & $\bullet$  &  $\bullet$\\

918: {\em B*56}   & $\bullet$  & $\bullet$  &  $\bullet$\\

919: {\em B*67}   & $\circ /\bullet$ & --   &  $\bullet$\\

920: \multicolumn{4}{l}{{\em B27s}}\\

921: {\em B*14}   & $\bullet$  & $\bullet$  & $\bullet$\\

922: {\em B*27}   & $\circ$    & $\circ$    & $\bullet$\\

923: {\em B*38}   & $\circ$    & $\bullet$  & $\bullet$\\

924: {\em B*39}   & $\bullet$  & $\bullet$  & $\bullet$\\

925: {\em B*48}   & $\circ /\bullet$  & $\circ /\bullet$ & $\bullet$\\

926: \multicolumn{4}{l}{{\em B44s}}\\

927: {\em B*18}   & $\bullet$  & $\bullet$  & $\bullet$\\

928: {\em B*37}   & $\bullet$  & $\bullet$  & $\bullet$\\

929: {\em B*40}   & $\bullet$  & $\circ$    & $\bullet$\\

930: {\em B*41}   & $\bullet$  & $\bullet$  & $\bullet$\\

931: {\em B*44}   & $\bullet$  & $\bullet$  & $\bullet$\\

932: {\em B*45}   & $\circ$    & $\circ$    & $\bullet$\\

933: {\em B*49}   & $\circ$    & $\circ$    & $\bullet$\\

934: {\em B*50}   & $\bullet$  & $\bullet$  & $\bullet$\\

935: \multicolumn{4}{l}{{\em B58s}}\\

936: {\em B*57}   & $\circ$    & $\circ$    & $\circ$\\

937: {\em B*58}   & $\circ$    & $\circ$    & $\circ$\\

938: \multicolumn{4}{l}{{\em B62s}}\\

939: {\em B*13}   & $\circ$    & $\circ$    & $\bullet$\\

940: {\em B*52}   & $\bullet$  & $\bullet$  & $\bullet$\\

941: \multicolumn{4}{l}{\sc Other}\\

942: {\em B*08}   & $\bullet$  & $\bullet$  & --\\

943: {\em B*15}   & $\bullet$  & $\bullet$  & --\\

944: {\em B*42}   & $\bullet$  & --         & --\\

945: {\em B*47}   & $\bullet$  & $\circ /\bullet$ & --\\

946: {\em B*81}   & $\circ$    & --         & --\\

947: {\em B*82}   & $\circ /\bullet$ & --   & --\\

948: \hline

949: \end{tabular}

950: \end{center}

951: \end{table}

952:

953: \renewcommand{\baselinestretch}{1}

954:

955: \clearpage

956:

957: \thispagestyle{empty}

958: \setlength{\textwidth}{20cm}

959: \setlength{\headheight}{0cm}

960: \setlength{\headsep}{0cm}

961: \setlength{\topmargin}{0cm}

962: \setlength{\oddsidemargin}{0cm}

963: \setlength{\evensidemargin}{0cm}

964: \begin{figure}[p!]

965: \begin{center}

966: %\leavevmode

967: \epsfig{file=figure1.eps,width=18cm}

968: %\caption{}

969: \end{center}

970: \end{figure}

971: \clearpage

972:

973: \end{document}

974: