Article Outline

Copyright © 2008 The American Society of Human Genetics. All rights reserved.
The American Journal of Human Genetics, Volume 82, Issue 3, 607-622, 3 March 2008

doi:10.1016/j.ajhg.2007.12.016

Article

Multipoint Approximations of Identity-by-Descent Probabilities for Accurate Linkage Analysis of Distantly Related Individuals

Cornelis A. Albers1Go To Corresponding Author Jim Stankovich23Russell Thomson3Melanie Bahlo2 and Hilbert J. Kappen1

1 Department of Biophysics, Institute for Computing and Information Sciences, Radboud University, 6525 EZ Nijmegen, The Netherlands
2 Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, 3050 Parkville, VIC, Melbourne, Australia
3 Menzies Research Institute, University of Tasmania, Private Bag 23, Hobart, TAS 7001, Australia

Corresponding author

We propose an analytical approximation method for the estimation of multipoint identity by descent (IBD) probabilities in pedigrees containing a moderate number of distantly related individuals. We show that in large pedigrees where cases are related through untyped ancestors only, it is possible to formulate the hidden Markov model of the Lander-Green algorithm in terms of the IBD configurations of the cases. We use a first-order Markov approximation to model the changes in this IBD-configuration variable along the chromosome. In simulated and real data sets, we demonstrate that estimates of parametric and nonparametric linkage statistics based on the first-order Markov approximation are accurate. The computation time is exponential in the number of cases instead of in the number of meioses separating the cases. We have implemented our approach in the computer program ALADIN (accurate linkage analysis of distantly related individuals). ALADIN can be applied to general pedigrees and marker types and has the ability to model marker-marker linkage disequilibrium with a clustered-markers approach. Using ALADIN is straightforward: It requires no parameters to be specified and accepts standard input files.

Introduction


Even in the new era of genome-wide association studies, the more traditional approach of linkage analysis with multiplex pedigrees remains a powerful, efficient method for mapping rare disease-susceptibility alleles.1 It is powerful not only for the mapping of Mendelian disease alleles of complete penetrance but also for the mapping of complex disease alleles of incomplete penetrance. Generally, for alleles of incomplete penetrance, it is not possible to find clusters of closely related individuals presenting with disease. However, it might be possible to identify clusters of distantly related individuals, particularly in founder populations. Because haplotype sharing between more distantly related individuals is rarer, any observed sharing is more significant. An excellent example of the power of large pedigrees with distantly related individuals was provided by a recent study of pituitary adenoma predisposition in northern Finland.2 Significant linkage was obtained with a single nine-generation pedigree containing just six affected individuals.

For nonparametric, affecteds-only linkage analysis of large families, the key computational challenge is the determination of identity by descent (IBD) sharing probabilities. These are the probabilities, given the available genotype data, that various clusters of affected individuals have inherited haplotype IBD from common ancestors at various loci. Exact and approximate linkage algorithms to calculate IBD-sharing probabilities formulate them in terms of configurations of the inheritance vector.3 This vector has a binary component for each meiosis, recording whether a grandmaternal or grandpaternal allele is transmitted from parent to offspring. The number of possible states of the inheritance vector increases exponentially with pedigree size, so exact methods using the Lander-Green algorithm4 (GENEHUNTER,5 ALLEGRO,6 MERLIN7) to enumerate the probabilities of all possible states become intractable for large pedigrees, even with the latest algorithmic improvements taking advantage of some symmetries (ALLEGRO2). For larger pedigrees, approximate Markov chain Monte Carlo (MCMC) sampling is generally used.8, 9, 10, 11, 12, 13 With run lengths sufficiently long, well-mixing MCMC algorithms converge to the exact solutions. However, the time required for the obtainment of an accurate solution can be very long for some MCMC samplers,14 and some user experience is required for the determination of when an MCMC run has converged.

Other factors contribute to the computational complexity, as well. The new dense sets of SNP markers make it possible to determine patterns of IBD sharing with greater precision, particularly for multigenerational pedigrees with many untyped individuals, but increase computational burden. Although for the Lander-Green algorithm the increase is only linear, for large numbers of markers, the amount of computer memory and disk space required may be substantial. In addition, with dense marker sets, it is critical to make allowance for linkage disequilibrium (LD) between nearby markers. Only one Lander-Green program (MERLIN7) can handle LD between markers. The MCMC algorithm MCLINK12 has been extended to allow for LD15, 16 but is still experimental. The two most commonly used MCMC programs, MORGAN11 and SIMWALK2,8 cannot yet handle LD.

For computation of linkage statistics on large pedigrees with many generations of untyped individuals, we propose that it is not necessary to model explicitly all components of the inheritance vector for the untyped individuals be explicitly modeled. Instead, it suffices to model a variable Π that records the IBD-sharing configuration of the top generation of genotyped individuals. Although the correlations in Π are not first-order Markov for any but the simplest of pedigrees,17, 18, 19 it has been successfully approximated as first-order Markov for homozygosity mapping20, 21, 22 and the determination of genealogical relationships.19 Our contribution is to present an approach for estimation of parametric and nonparametric linkage statistics for general numbers of cases with arbitrary degrees of relatedness, by using first-order Markov approximations. This requires computation of the probabilities of the various possible configurations of Π and computations of transition probabilities between configurations as a function of the recombination fractions. We use a dynamic programming algorithm based on variable elimination23, 24 to compute these probabilities exactly.

We have implemented the approximation in the computer program ALADIN (accurate linkage analysis of distantly related individuals). We evaluate accuracy in simulated data sets and a real data set and compare ALADIN with the MCMC program MORGAN in terms of accuracy and computational efficiency.


Material and Methods


We divide the group of individuals P that together form the pedigree into three disjoint groups of individuals A, T, and D such that P = ATD. The group of individuals T is defined by the requirement that individuals in T are related only by ancestors without genotype and phenotype information. This group of untyped ancestors is denoted by A. Every pair of individuals in T is related by at least one common ancestor from A. For every individual in T, genotype or phenotype information is available, or both. The third group, D, contains the remaining individuals.

The individuals of any pedigree can be grouped like this. ALADIN was designed with the situation in mind where T consists of a small number of cases (less than ten) diagnosed with the disease and D consists of a limited number of close relatives (spouses or children) of the cases that have been recruited to increase the information content. A is the group of ancestors through whom the cases are related. There is no limitation on the number of ancestors A or the number of generations they span, although the efficiency of ALADIN does depend on the complexity of the subpedigree formed by these ancestors (see below for details).

Exact computation of the IBD probabilities with the Lander-Green algorithm is exponential in the number of meioses separating the cases and can be prohibitively complex when A is large. We propose to approximate the likelihood of the original HMM formulation for multipoint linkage analysis4, 5 with a likelihood with lower computational complexity. This is accomplished through a change of variables based on the partitioning of the pedigree P into the groups A, T, and D and additional approximations to the prior probabilities of multilocus IBD configurations. The main idea is that exact inference with the Lander-Green algorithm in this approximate HMM can be performed efficiently.

In this section, we describe how to estimate the posterior probabilities of the IBD configurations given marker data in the approximate HMM. The nonparametric linkage statistics can be readily calculated from these posterior probabilities. In the Appendix A ( Estimation of Multipoint Parametric LOD Scores), we describe how parametric logarithm of odds (LOD) scores can be estimated.

A Change of Variables


We first consider a single marker locus. The exact probabilities of the possible IBD configurations of the affecteds (the cases) can be calculated from the posterior marginal distribution of the segregation indicators snf:

(1)
where M denotes all observed marker genotypes, f the vector of allele frequencies, and the subscript nf a nonfounder in the pedigree. (Only nonfounders have segregation indicators. By definition, individuals in T are always nonfounders; hence, we discard the subscript for these segregation indicators.) GD, GT, and GA represent the ordered genotypes of the individuals in the three groups. The component of the inheritance vector uniquely determines the IBD configuration of the alleles of the individuals T. However, different inheritance vectors may correspond to the same IBD configuration, and for this reason, we now introduce the variable Π, which summarizes the IBD configuration of the alleles of the individuals T. Any configuration can be mapped to a single value of Π; all configurations mapped to the same value of Π have the same IBD configuration with respect to the alleles contained in GT. When Π is substituted for in Equation (1), it is clear that the nonparametric statistics can be computed alternatively from the posterior marginal distribution
(2)
where denotes the set of configurations with the same ibd configuration Π. The prior probability of an IBD configuration Π of GT is given by
(3)

With the assumption that a priori all meioses are independent and that paternal and maternal inheritance are equally probable, the evaluation of this sum for a given Π amounts to counting the number of configurations with the corresponding IBD configuration.

Figure 1 shows one of the possible descent graphs8 for a pedigree of nine individuals P {1,…,9}. Suppose that individuals 1, 2, and 3 are affected and have genotype information, so that T = {1, 2, 3}, that their ancestors A = {4,…,9} are untyped, and that D = Ø. Only the paternal alleles of individuals 1, 2, and 3 can be IBD through one of the ancestors 6 and 7; in the descent graph shown in the figure, the paternal alleles are all inherited IBD from individual 6. The maternal alleles can never be IBD because they are inherited from different founders. There are five possible IBD configurations of the paternal alleles: . Alleles within parentheses are defined to be IBD and are said to be in the same partition; a concatenation of partitions defines an IBD configuration. Π takes as values the possible IBD configurations. Table 1 illustrates how different inheritance vectors are mapped to a state of the IBD variable Π, which in this example summarizes the IBD configuration of the paternal alleles 1, 2, and 3. The IBD configuration is uniquely determined by the seven segregation indicators listed in the table. The two possible values for each segregation indicator are paternal (p) and maternal (m). Segregation indicators for which both values yield the same IBD configuration are indicated with p/m. In total, there are 16 configurations of segregation indicators that imply IBD of simultaneously; the prior probability of this configuration thus is . In this example, the complexity has been reduced from 27 to 5 configurations.

Display large version of this figure
Display high quality version of this figure
Figure 1
Descent Graph of the Top Configuration in Table 1
Squares represents alleles of males, and circles represent alleles of females. The solid black arrows indicate the transmission of founder allele G6p to individuals 1 and 2. Thus, the paternal allele G1p of individual 1 and the paternal allele G2p of individual 2 are IBD. The dashed gray arrows correspond to the p/m entries in Table 1.
 
Table 1 Mapping of Configurations of Segregation Indicators s to IBD Configuration Π
Π#s1ps2ps3ps4ps4ms5ps5m
(1 2 3)4ppppp/mpp/m
(1 2 3)4pppmp/mmp/m
(1 2 3)4mmmp/mpp/mp
(1 2 3)4mmmp/mmp/mm
(1)(2 3)4ppppp/mmp/m
(1)(2 3)4pppmp/mpp/m
(1)(2 3)16mppp/mp/mp/mp/m
(1)(2 3)4mmmp/mpp/mm
(1)(2 3)4mmmp/mmp/mp
(1)(2 3)16pmmp/mp/mp/mp/m
(1 2)(3)4ppmpp/mpp/m
(1 2)(3)4ppmmp/mmp/m
(1 2)(3)4mmpp/mpp/mp
(1 2)(3)4mmpp/mmp/mm
(1 3)(2)4pmppp/mpp/m
(1 3)(2)4pmpmp/mmp/m
(1 3)(2)4mpmp/mpp/mp
(1 3)(2)4mpmp/mmp/mm
(1)(2)(3)16pmpp/mp/mp/mp/m
(1)(2)(3)16mpmp/mp/mp/mp/m
Π indicates the partitioning variable; paternal alleles of individuals within parentheses are IBD. # indicates the number of configurations of the segregation indicators.

Exact Single-Point Computation of Posterior IBD Probabilities


We first describe how the change of variables from to Π is implemented in the single-point case.

By definition of the groups of individuals D, T, and A, the pedigree likelihood factors as follows (see Figure 2A)

(4)

Display large version of this figure
Display high quality version of this figure
Figure 2
Graphical Model Representation
The graphical model reflects the conditional independencies of the single-point likelihood. White circles represent unobserved variables, gray circles represent model parameters assumed to be fixed and known, and gray rectangles represent observed variables, i.e., the marker genotypes. Panel (A) shows the graphical model corresponding to the single-point likelihood defined in terms of genotype variables and segregation indicators. Conditioned on the ordered genotypes GT, the variables shown in the dashed rectangle are independent of the variables outside the dashed rectangle. The graphical model can be alternatively constructed as in panel (B). As the group of ancestors (A) is defined to have no genotype or phenotype information, the corresponding ordered genotype variables (indicated by dotted lines) can be removed from the model, yielding the graphical model shown in panel (C). The HMM used by ALADIN is based on the model shown in panel (D). Here, the unobserved segregation indicators sT and sDnf have been replaced by the IBD variable Π, which defines the IBD configuration of the alleles contained in GT. The ordered genotypes GA of the ancestors (A) are not explicitly modeled.

Consequently, the variables are independent of conditioned on the ordered genotypes GT (and the fixed allele frequencies f). This implies that we only need to consider the right-most likelihood term for the change of variables from to Π.

Next, we derive the marginal likelihood of the variables GT and Π from this likelihood term

(5)
where P(Π) is given by Equation (3). Equations (16) in the Appendix A ( Details of Single-Point Computations) provide a detailed derivation of Equation (5). Below, we give P(GT|Π,f) for the example of Table 1. The conditional distribution of P(GT|Π,f) does not depend on the particular configuration but only on the IBD configuration determined by . This nontrivial result follows from the assumptions that the prior allele frequency distributions are the same for all founders, that no genotypes or phenotypes are observed for the individuals in A, and that paternal and maternal inheritance are equally likely a priori. It allows configurations to be clustered with respect to their IBD configuration for the alleles in GT in the computation of the posterior marginal distribution of Equation (2).

Combining Equations (1) and the result Equation (5), we find that the single-point posterior is given by

(6)
where the proportionality factor is given by 1/P(M|f).

With the example of Table 1, we illustrate how the conditional probability P(GT|Π,f) is calculated. Suppose we would like to know this probability for the IBD configuration , where allele are a copy of the same founder allele, and is a copy of a different founder allele. There are two contributions of the prior allele frequency distribution, one for each of the two transmitted founder alleles. This gives

(7)
where δ(x, y) = 1 if x = y and δ(x, y) = 0 if xy, with x and y discrete variables. This ensures that the conditional probability of GT given Π is zero unless all alleles that are IBD have the same value. For each of the maternal alleles, there is a contribution of the allele frequency prior because these are never IBD on account of being inherited from founders:
(8)

The conditional probability distribution P(GT|Π, f) is given by the product of the right hand sides of Equations (7).


Exact Multipoint Computation of Posterior IBD Probabilities


Given the results of the single-point case, the generalization to the multipoint case is straightforward. We define the multilocus IBD variable as , where L is the number of markers. (From here on, we will assume that unless a locus superscript l is specified, variables GT represent multilocus genotypes.) The configurations of segregation indicators associated with a given multilocus IBD configuration Π are given by the cartesian product

(9)

Because we consider multilocus genotypes and segregation indicator configurations, the conditional independence of and given GT still holds (see Figure 2B). As a result, the dependence of the posterior distribution on Π can again be determined from the likelihood term , where θ is the vector of recombination fractions. From Equation (5), the ordered genotypes for marker l are conditionally dependent only on Πl and fl, so that the conditional probability distribution of GT factorizes as a product of markers. Furthermore, in the exact HMM, there are conditional dependencies only between the segregation indicators (assuming linkage equilibrium). This means that the multipoint equivalent of Equation (5) is given by

(10)

The prior distribution is explicitly given by

(11)

Here, the conditional distributions model the recombination between the markers l and l − 1 for individual i, and y can be paternal (p) or maternal (m). In contrast with the prior distribution over the segregation indicators s, the prior distribution is not first-order Markov in the IBD variables Πl because of the summation and consequently does not factorize as a product over markers.

We can now write the multipoint generalization of Equation (6) in terms of Π by using Equation (10):

(12)

The conditional probability distributions are calculated for each locus as illustrated in the single-locus case above. The multipoint likelihood contains the product of a distribution conditional on Π and a prior distribution over Π, similar in form to the single-point likelihood (Equation (5)). However, it is not particularly practical to work with because the summation over the subspace of Equation (9) required for the computation of is generally not feasible for large number of markers and prevents application of an efficient forward-backward algorithm.


ALADIN: Exact Inference in an Approximate HMM


We propose to approximate with a distribution that is first-order Markov in Πl. We further assume that the conditional probability distribution of Πl given Πl−1 depends only on the recombination fraction between markers l and l − 1. This first-order Markov approximation is given by

(13)
where the superscripted (1) indicates the first-order approximation. P1) is given by Equation (3). The conditional distribution is computed as follows:
(14)
where the prime indicates that we compute the marginal distribution in the numerator on the right-hand side exactly in a two-locus model.

The computation of these probabilities is a crucial step in ALADIN. Because the pedigree structure is the same for each marker, the prior IBD-partitioning probabilities are also the same for each marker,

so that this computation has to be performed only once. The computation of the conditional probabilities must be performed for every recombination fraction, i.e., for every pair of adjacent markers. In the special case that all recombination fractions between the markers are the same, this computation also would have to be performed only once because the conditional probabilities depend only on the recombination fraction for a given pedigree. We use a dynamic programming algorithm based on variable elimination24, 25 to compute these probabilities exactly. This procedure is outlined in the Web Resources ( Computation of Prior IBD Probabilities with Variable Elimination). For practical application of the approximation, we require that exact computations of in the two-locus model with the junction tree algorithm are feasible.

The full likelihood of the approximate model based on Equation (13) is obtained by substituting this equation into Equation (12):

(15)

The main idea of ALADIN is to perform exact inference of the marginal distributions in the approximate HMM defined by Equation (15) with the Lander-Green algorithm. Exact inference in the approximate model will result in approximate IBD probabilities.

In summary of the procedures, the result of the change of variables from to Π is that the size of the hidden state space depends no longer on |A|, the number of untyped ancestors, but only on |Π|, the number of possible IBD configurations of the alleles of the individuals in T. The complexity of computing marginal distributions in the approximate HMM may be substantially lower than in the exact HMM. Therefore, application of the forward-backward algorithm to the approximate HMM can be feasible when application to the exact HMM is not.


Implementation


We have implemented the Lander-Green algorithm and the algorithm to compute the prior IBD probabilities in the computer program ALADIN. It calculates normalized NPLpairs and NPLall linkage statistics, as well as parametric LOD scores (as described in Appendix A [ Estimation of Multipoint Parametric LOD Scores]) for general pedigrees in the approximate HMM. It can analyze diallelic and multiallelic marker data sets. When dealing with dense SNP arrays, the use of which has rapidly become standard practice, it is important that linkage disequilibrium between the markers be accounted for. Therefore, we have implemented the same clustered-markers approach as MERLIN.7 This approach clusters markers into haplotype blocks for which haplotype frequencies must be specified. Markers within a haplotype block may be in complete LD; markers in different haplotype blocks are assumed to be in linkage equilibrium. Absence of recombination is assumed for markers in the same haplotype block. ALADIN accepts standard LINKAGE-formatted locus and pedigree files and uses the MERLIN format to specify the haplotype blocks and haplotype frequencies.



Results


Setup


We evaluated the performance of ALADIN in simulated and real data sets. With data sets simulated in small pedigrees where exact multipoint computation with MERLIN was feasible, we evaluated the quality of the ALADIN approximation of NPLpairs and NPLall. We compared these results with those of the state-of-the-art MCMC sampler MORGAN, where we note that MORGAN only provides an approximation of the NPLpairs statistic. In the comparison with MORGAN, we also assessed the accuracy of approximations of parametric LOD scores. With data sets simulated in large pedigrees where exact multipoint computation with MERLIN was not feasible, we estimated the type I error rate of ALADIN. We also compared the ALADIN and MORGAN approximation of NPLpairs with the value of NPLpairs of the true inheritance vector at each location for chromosomes where linkage was simulated. In a real data set where exact computation with MERLIN was feasible, we compared the accuracy of ALADIN and MORGAN. Finally, we compared computation time of ALADIN, MORGAN, and MERLIN.

We used two pedigrees taken from practical linkage studies in our evaluation. Pedigree I, shown in Figure 3, was taken from a prostate cancer (PC [MIM 176807]) study. Pedigree II, shown in Figure 4, was taken from a pituitary adenoma predisposition study.2 Because these pedigrees are too large for exact computation with MERLIN (number of bits > 25, see Table 2), we considered three subpedigrees for our comparisons between ALADIN and MERLIN: (1) pedigree Ia, the subpedigree of I consisting of the cases 4, 8, 9, 12, their ancestors, spouses, and children, (2) pedigree Ib, the subpedigree of I consisting of only the cases 4, 8, 9, 12, and their ancestors, and (3) pedigree IIa, the subpedigree of II consisting of the cases 102, 114, 131, and the ancestors of these cases.

Display large version of this figure
Display high quality version of this figure
Figure 3
Prostate Cancer Pedigree I
Affected individuals are represented by a black symbol, and genotyped individuals are indicated with an asterisk.
Display large version of this figure
Display high quality version of this figure
Figure 4
Pedigree II
The pedigree was taken from a pituitary adenoma study of Vierimaa et al.2 Affected individuals are represented by a black symbol, and genotyped individuals are indicated with an asterisk.
 
Table 2 Computation Time
Computation Time (min)a
PedigreeLELDb
LabelCasesTPHMM ComplexitycMERLINALADINMORGANdMERLINALADIN
Ia443218422.535.0354601100013.5
Ib44241434.069.532855148.524.5
IIa333518288.510.53811521914.0
I775336e495354751016282
II644427171.54176078
a Computation time is reported as estimated time required to analyze the 50K Affymetrix XbaI array in minutes.
b MORGAN cannot handle LD.
c Complexity is measured as the log2 of the number of hidden states of the exact HMM for a marker (the number of bits).
d MORGAN was run with 100,000 MCMC scans.
e ∞ indicates that exact computation with MERLIN was not feasible.

We used the lm_ibdtests and lm_markers programs in the MORGAN 2.8.1 package to obtain MCMC approximations of NPLpairs and parametric LOD scores, respectively. MORGAN requires the user to specify a number of parameters. We used a default value of 100,000 scans for the sampler. The number of burn-in scans and sequential imputation steps for initialization of the Markov chain were set to the values recommended by Wijsman et al.14 in a comprehensive evaluation of the MCMC programs MORGAN and SIMWALK2. With these settings, the authors showed that MORGAN can be expected to produce accurate approximations and that it is generally more efficient than SIMWALK2.

All analyses were performed on a small cluster of five AMD 64 bit machines with two dual-core 2.2 GHz processors. Each machine had 16 GB of physical memory available and 30 GB of swap space.


Simulation Study


Comparison with MERLIN

We assessed the accuracy of ALADIN and MORGAN in data sets simulated for pedigrees Ia, Ib, and IIa. We emulated a genome scan by generating marker data for the autosomal chromosomes. To see whether the linkage signal could be accurately detected, we simulated linked chromosomes where the inheritance vector at the middle marker was fixed such that all cases shared one allele IBD, which is the maximum for these pedigrees because there is no inbreeding. To determine a possible bias, we also simulated unlinked chromosomes where the inheritance vector at the middle marker was randomly generated.

Marker data was simulated under two conditions: linkage equilibrium (LE) and linkage disequilibrium (LD) between the markers. In the condition of LE, the founder alleles were sampled with the estimated allele frequencies of the Affymetrix 50K XbaI array in 42 European ancestry samples provided by Affymetrix. In this dataset, there are 57,093 polymorphic autosomal markers, with mean minor allele frequency 0.22. In the condition of LD, founder haplotypes were sampled with haplotype block definitions and haplotype frequency estimates obtained with HAPLOVIEW26 (spine-of-LD rule with D′ = 0.8). HAPLOVIEW was run on data from Phase I of the International HapMap Project,27 consisting of the genotypes of 90 individuals from Utah of northwestern European ancestry at the markers of the 50K XbaI array. There were 51,634 polymorphic, autosomal markers on the array that were in the Phase I dataset and passed HAPLOVIEW's quality control tests. They were assigned to 18,343 haplotype blocks (including 6103 blocks containing only one SNP). The maximum number of SNPs in any one block was 28. The number of haplotypes per block ranged from 2 to 31 (mean 3.4), and haplotype frequencies ranged from 0.01 to 0.98 (mean 0.29). Simulations were performed with the assumption of linkage equilibrium between neighboring blocks. Table 3 gives an overview of the number of replicates simulated in each condition.

 
Table 3 Number of Replicates Simulated for the Comparison with MERLIN
LELD
TypeALADINMORGANALADINMORGAN
unlinked2 × 22 = 441 × 4 = 41 × 22 = 22NA
linked2 × 22 = 441 × 4 = 41 × 22 = 22NA
NA indicates that MORGAN cannot model LD.

Because the clustered-markers option for modeling LD in MERLIN required significantly more computation time than the standard analysis assuming LE, only 22 chromosomes in the condition of LD were simulated and analyzed with ALADIN and MERLIN. Because of the long computation time of MORGAN, we analyzed only one replicate of autosomal chromosomes 1, 3, 20, and 22 with MORGAN. Furthermore, MORGAN cannot model LD, so no chromosomes were analyzed in the LD condition (indicated by “NA” in the table).

We used three measures to evaluate the accuracy: (1) Δmean, the mean absolute difference over all markers on the chromosome between the exact and approximate scores, (2) Δmax, the maximum absolute difference over all markers on the chromosome between the exact and approximate scores, and (3) Δmid, the absolute difference evaluated at the middle marker on the chromosome, where for the linked chromosomes the inheritance vector was fixed.

Table 4 shows the error in the ALADIN approximation of NPLpairs and NPLall for the various pedigrees and the conditions of linked and unlinked chromosomes and LE and LD. Note that for the comparison of the absolute errors across different pedigrees, it is important that the range, i.e., the difference between the maximum and minimum value of NPLpairs and NPLall, be taken into account. The maximum score was attained in almost all linked chromosomes at the location of the middle marker. Overall, the approximation of NPLpairs and NPLall was accurate with values of Δmean less than 0.4 in all pedigrees and conditions. The absolute error at the middle marker Δmid was smaller than 0.08 for pedigrees Ia and Ib and smaller than 1.4 for pedigree IIa. The errors Δmean and Δmid were smaller for the unlinked chromosomes than for the linked chromosomes. The relative errors as a fraction of the exact value at the middle marker (shown between parentheses for the linked chromosomes) were smaller than 3%, except for the condition of LE for pedigree IIa. The large relative error for this pedigree was caused by two replicates with a large discrepancy at the location where linkage was simulated, which we examine in more detail below. On the remaining replicates, the mean error Δmid was 0.5386 (1.42%), comparable with the results for the linked chromosomes in the condition of LD for this pedigree. Thus, in general, ALADIN accurately detected the linkage signal.

 
Table 4 Error of ALADIN in NPLpairs and NPLall
NPLpairsNPLall
PedigreeTypeNRangeΔmeanΔmaxΔmid (Relative)aRangeΔmeanΔmaxΔmid (Relative)
Linkage Equilibrium
Iaunlinked449.900.00220.09120.001016.70.00200.08670.0008
Ialinked449.900.00470.19730.0011 (0.0155%)16.70.00520.22430.0026 (0.0209%)
Ibunlinked449.900.00700.16470.002416.70.00670.16950.0023
Iblinked449.900.01540.34830.0107 (0.1476%)16.70.02080.51990.0227 (0.1996%)
IIaunlinked4450.70.01790.30170.022666.30.01770.29770.0223
IIalinked4450.70.38686.8071.357 (15.55%)66.30.45859.2951.874 (17.73%)
Linkage Disequilibrium
Iaunlinked20b9.900.01020.75330.001116.70.00920.65040.0011
Ialinked20b9.900.02001.1040.0070 (0.0854%)16.70.02431.7130.0160 (0.1159%)
Ibunlinked229.900.01490.78320.002516.70.01330.72960.0022
Iblinked229.900.03621.5210.0776 (1.786%)16.70.04602.3330.1761 (3.498%)
IIaunlinked2250.70.00320.10690.006066.30.00320.10540.0059
IIalinked2250.70.23106.3700.5279 (2.423%)66.30.28458.8810.7722 (2.641%)
Note that values of Δ are shown as means across N simulated chromosomes.
a Relative error is only reported for the linked chromosomes, where linkage was simulated at the location of the middle marker.
b MERLIN terminated with error status when chromosomes 1 and 2 were analyzed.

The mean maximum errors (Δmax) were larger than Δmean and Δmid, varying from 0.0912 to 0.7832 for the unlinked chromosomes and from 0.1973 to 6.807 for the linked chromosomes. We observed two typical situations where these maximum errors occurred. The first situation was the most general: Here, the maximum error was found in a small region where the value of the statistic changed sharply as a function of the location and did not affect the value of the statistic at the location where linkage was simulated. Figure 5A illustrates this situation for a replicate of pedigree IIa. The second situation was found in only two replicates, both for pedigree IIa: Here, the error extended over a larger region that included the location where linkage was simulated, as illustrated in Figure 5B. For the condition of LD, we found that in pedigrees Ia and Ib, the value of Δmax was mostly attained at the beginning of the chromosome. We believe that this may be partly explained by slight differences in our implementation of the clustered-markers approach and that of MERLIN regarding how inconsistencies due to recombination events within a haplotype cluster are dealt with. The number of inconsistencies was on average 1.3 ± 1.6 per chromosome per pedigree.

Display large version of this figure
Display high quality version of this figure
Figure 5
Comparison of ALADIN and MORGAN with Exact Method
The figure shows exact, approximate, and true simulated value of NPLpairs for two typical replicates of pedigree IIa in the condition of linked chromosomes and linkage equilibrium. The location where linkage was simulated is indicated with the asterisk. Forty-two of the 44 simulated replicates were similar to (A). For this replicate, Δmean of ALADIN and MORGAN were 0.0913 and 0.1015, respectively. The maximum errors Δmax were respectively 4.439 and 4.571, and were both attained at 142.4 cM, a region where NPLpairs changed rapidly. In two replicates, the situation was as in (B): ALADIN and MORGAN produced similar scores that both overestimated NPLpairs as compared to the value obtained with MERLIN but did not overestimate the value of NPLpairs of the true inheritance vector. Here, the maximum errors of ALADIN and MORGAN were 14.05 and 14.15, respectively, and were attained at 23.96 cM by both methods.

In Table 5, we compare the error of ALADIN and MORGAN in NPLpairs on the subset of chromosomes analyzed by MORGAN. ALADIN had smaller values of Δmean and Δmax than did MORGAN in pedigrees Ia and IIa and larger values in pedigree Ib. The values of Δmid of ALADIN were smaller than those of MORGAN in all pedigrees. In addition to the replicates shown in the table, we analyzed the two replicates for pedigree IIa where the error of ALADIN extended over a larger region (described above) with MORGAN. Interestingly, for both of these replicates, MORGAN produced the same result as ALADIN, with the corresponding error. Figure 5B illustrates this for one of these replicates. Here, we found a relatively large difference between the ALADIN approximation and the exact value of NPLpairs: Both ALADIN and MORGAN yielded a similar score that underestimated the true value. This score was closer to the value of the true inheritance vector than the exact score. We believe that this phenomenon is most likely due to multimodality of the posterior distribution that the approximate methods did not fully take into account. We found no replicates where ALADIN overestimated the value of NPLpairs, and MERLIN (yielding exact results) did not overestimate it as well.

 
Table 5 Comparison of Error in NPLpairs of ALADIN and MORGAN
ΔmeanΔmaxΔmid (Relative)
PedigreeTypeNRangeALADINMORGANALADINMORGANALADINMORGAN
Iaunlinked49.900.00510.00710.09230.38700.00590.0104
Ialinked49.900.00430.01030.16710.48260.0090 (0.1306%)0.0431 (0.5548%)
Ibunlinked49.900.00850.00770.32060.10780.00110.0067
Iblinked49.900.05680.02740.63450.15280.0135 (0.1565%)0.0623 (0.7058%)
IIaunlinked450.70.00440.00470.04890.04890.00010.0007
IIalinked450.70.32870.50994.3964.8141.000 (2.265%)1.312 (3.000%)
Note that values of Δ are shown as means across N simulated chromosomes.

In Table 6, we compare the error of ALADIN and MORGAN in parametric LOD scores. We assumed a disease allele frequency of 0.001 and penetrance values (0.001, 0.20, 0.20), reflecting a dominant disease with low penetrance. We analyzed the same replicates as those used for Table 6. ALADIN and MORGAN were both accurate, with Δmean < 0.10 and Δmid < 0.11 for all pedigrees. The maximum errors Δmax of ALADIN and MORGAN were similar and varied from 0.06 to 0.70. Because the LOD scores ranged from −2 to 3.5 for the linked chromosomes, the maximum errors in the parametric scores were similar to the maximum errors found for the nonparametric scores. We conclude that ALADIN was accurate and achieved a similar performance as MORGAN.

 
Table 6 Comparison of Error in Parametric LOD Scores of ALADIN and MORGAN
ΔmeanΔmaxΔmid
PedigreeTypeNALADINMORGANALADINMORGANALADINMORGAN
Iaunlinked40.01280.01160.41350.52740.02530.1092
Ialinked40.02450.00810.69250.52840.00120.0015
Ibunlinked40.01520.01140.32980.36800.00180.0015
Iblinked40.04050.01210.33860.22330.00140.0023
IIaunlinked40.00520.00560.07330.09310.00030.0004
IIalinked40.03060.07190.40020.61540.01460.0150
Note that values of Δ are shown as means across N simulated chromosomes.

Large Pedigrees

We evaluated the performance of ALADIN in pedigrees I and II. Exact multipoint computation of NPLpairs with MERLIN in these pedigrees is not feasible. However, it is possible to calculate the exact null distribution as single-locus computations are feasible in these pedigrees. The type I error rate (false-positive rate) of an exact method applied to fully informative marker data is given by the probability that the NPLpairs statistic is larger than the significance threshold under the exact null distribution. We used an importance-sampling approach to estimate the type I error rate of ALADIN for high values of the significance threshold on NPLpairs (i.e., small p values). For every possible value of NPLpairs, we simulated 75 inheritance vectors for the middle marker location; genotypes for 200 markers were simulated conditional on the inheritance vector according to the specifications of the XbaI array. For pedigree I, there are 13 unique values of NPLpairs, resulting in a total of 975 replicates. For pedigree II, there are 14 unique values of NPLpairs, resulting in a total of 1050 replicates. The replicates were given the proper importance weights so that the fact that they were not drawn from the exact null distribution could be accounted for. ALADIN was used for the approximation of NPLpairs at the middle marker location in each replicate. This procedure yields unbiased estimates of the type I error rate of ALADIN.

Figure 6 shows the empirical type I error rate for ALADIN and the type I error rate corresponding to the exact null distribution, with NPLpairs. We find that ALADIN did not have an inflated type I error rate for the high values of the significance threshold that are relevant for genome-wide linkage analysis.

Display large version of this figure
Display high quality version of this figure
Figure 6
Empirical Type I Error Rate of ALADIN
(A) shows the empirical type I error rate for NPLpairs in pedigree I. (B) shows empirical type I error rate for NPLpairs in pedigree II. Note that the p value corresponding to a given significance threshold on NPLpairs is given by the solid curve.

We also compared ALADIN and MORGAN in pedigree II for chromosomes where linkage was simulated at the middle marker. The inheritance vector at this marker was fixed such that IBD sharing between the cases was maximal. Marker data was simulated for eight autosomal chromosomes (1, 3, 5, 7, 11, 13, and 17) assuming linkage equilibrium between the markers. For each marker location, we computed the value of NPLpairs as estimated from the marker data by using ALADIN and MORGAN. Because exact computation was not feasible, we compared these with the value of NPLpairs of the true inheritance vector at each marker location.

For all eight chromosomes, the score estimated with ALADIN was very close to the score of the true inheritance vector at each marker (Figure 7). For two chromosomes, there were regions where MORGAN reported a score that was lower than the score of the true inheritance vector. In one case, this region included the marker location where linkage was simulated. Although the exact multipoint value of NPLpairs given the marker data is not known, this figure suggests that the first-order Markov approximation of ALADIN has sufficient power to detect IBD sharing among the cases. Because we found that the type I error rate was not inflated, we infer that ALADIN might be expected to produce accurate results in data sets where exact computation is not feasible.

Display large version of this figure
Display high quality version of this figure
Figure 7
Evaluation of ALADIN in a Large Pedigree
The ALADIN and MORGAN estimates of NPLpairs are compared to the NPLpairs of the true inheritance vector for eight autosomal chromosomes (1, 3, 5, 7, 11, 13, and 17) for pedigree II. Exact multipoint computation of NPLpairs was not feasible. The inheritance vector at the middle marker, indicated by the asterisk at the horizontal axis, was fixed such that all six cases shared one allele IBD at that location.


Application to Real Data


We compared the accuracy of ALADIN and MORGAN in the real data set of pedigree Ia. The marker data are from the SNPs of the Affymetrix 10K array. For each pair of SNPs in strong LD (D′ > 0.8), one of the SNPs was removed from the data set for the prevention of spurious linkage results.

We first compared the accuracy of ALADIN and MORGAN for different numbers of MCMC scans for MORGAN, by using the NPLpairs statistic. We focused the subset of chromosomes 1–10 in the real data set for pedigree Ia. The exact NPLpairs scores were computed with MERLIN. Figure 8 shows that with 10,000 MCMC scans, ALADIN was more accurate than MORGAN; with 100,000 MCMC scans, ALADIN was less accurate than MORGAN. Thus, for a small number of MCMC scans, MORGAN was both slower and less accurate than ALADIN, whereas the accuracy of ALADIN was already high.

Display large version of this figure
Display high quality version of this figure
Figure 8
Evaluation of ALADIN with Real Data
The approximation error of ALADIN and MORGAN is compared on the real data set for subpedigree Ia for varying number of MCMC scans of MORGAN. The figure shows boxplots of the absolute difference between the approximate and exact NPLpairs of all points on chromosomes 1–10. The number of MCMC scans is shown between parentheses, and the computation times are denoted by t on the horizontal axis.

Second, we evaluated the accuracy of ALADIN by using all of the autosomal chromosomes. The mean error in NPLpairs of ALADIN was 0.019, and the maximum error was 0.32. The mean and maximum errors NPLall were 0.032 and 0.35, respectively. The maximum exact values of NPLpairs and NPLall were 7.65 and 12.8, respectively (at the same location); the relative errors of ALADIN at this peak were 0.0029% and 0.0037%, respectively. We conclude that the approximation of ALADIN in the real data set was accurate.

We also applied ALADIN to the full pedigree. Here, ALADIN replicated an analysis with a pedigree-splitting approximation28 that found that there was just one suggestive linkage peak in this pedigree where four out of seven cases inherited a haplotype identical by descent. This haplotype sharing was confirmed by subsequent microsatellite genotyping and haplotype reconstruction with SIMWALK2 (L.M. FitzGerald, personal communication).


Computation Time


The computation time of MERLIN increases linearly with the number of markers and exponentially with the number of bits, which is the number of components of the inheritance vector required for exact computation in the HMM or, equivalently, the log 2 of the number of hidden states of the exact HMM for a single marker. Computation time of ALADIN increases exponentially with T, the number of individuals in T, and linearly with the number of markers. MORGAN requires that single-locus exact computations are feasible. If the pedigree is not inbred, which was the case in all of our analyses, computation time increases linearly with P, the number of individuals in the pedigree.

Table 2 shows computation times for the various pedigrees we analyzed, reported as the estimated computation time required to analyze the 50K Affymetrix XbaI array. The computation time of ALADIN was mostly shorter than that of MERLIN for the pedigrees where exact computation with MERLIN was practical, especially when LD was modeled. Pedigree Ib without LD modeling forms an exception, which can be most likely attributed to the efficient implementation of MERLIN. Computation times of ALADIN were several orders of magnitude shorter than those of MORGAN. Analysis of pedigrees II and I with MERLIN was not practical. ALADIN was significantly faster than MORGAN in pedigree II. In pedigree I, the efficiency of ALADIN and MORGAN was comparable for the case of LE. Again, ALADIN was more efficient when LD was modeled.

We studied how computation time of ALADIN and MORGAN scaled with the number of markers. We found that computation time of MORGAN increased quadratically with the number of markers analyzed (Figure 9A), with a fixed number of 100,000 scans for the sampler. As expected, computation time of ALADIN increased linearly.

Display large version of this figure
Display high quality version of this figure
Figure 9
Scaling of Computation Time
(A) shows on a log-log scale the computation time as a function of the number of markers used in the multipoint analysis. Computation time of ALADIN scaled linearly with the number of markers, whereas that of MORGAN scaled quadratically with the number of markers. (B) shows computation time as a function of A, the number of untyped ancestors, for fixed number of individuals T = 4 and 100 markers.

ALADIN was designed for the purpose of analyzing distantly related individuals. We therefore investigated computation time as a function of A, the number of untyped ancestors through which the cases are related, for a fixed value of T = 4. The structure of the pedigree used in this simulation was as follows: The four cases were related by two common ancestors 3–15 generations back, where each case was in a separate branch of the pedigree. The cases formed the group T, and the group D did not contain any individuals. Figure 9B shows computation time of ALADIN and MORGAN for replicates simulated with 100 markers. Computation time of ALADIN did not clearly show an increase with A. Computation time of MORGAN increased linearly with A and was significantly higher than that of ALADIN. Thus, for small T and large A, ALADIN may be significantly more efficient than MORGAN.



Discussion


We presented ALADIN, a program for linkage analysis of distantly related individuals. ALADIN produced accurate estimates of nonparametric linkage and parametric linkage scores. ALADIN also produced accurate estimates when linkage disequilibrium was taken into account with the same clustered-markers approach as MERLIN. Accuracy was comparable with that of the state-of-the-art MCMC program MORGAN. We have shown that ALADIN is especially useful when a moderate number of cases are related through common ancestors many generations back and the pedigree is too large to analyze with exact methods. ALADIN may be several orders of magnitude more efficient than MORGAN, depending on the number of typed individuals and cases. However, it should be noted that we performed comparisons with version 2.8.1 of MORGAN, which uses a basic locus meiosis (LM) sampler that was not designed for the pedigrees with long descent chains considered in this paper. Nevertheless, we believe that it is the most powerful alternative to ALADIN for these pedigrees. The recently released version 2.8.2 of MORGAN uses a multimeiosis sampler29 that will most likely perform better for linkage analysis of distantly related individuals. ALADIN may also be of use in pedigrees where analysis with MERLIN in principle is feasible but very time consuming, for instance, when the clustered-markers approach is used to account for linkage disequilibrium. In the current implementation, ALADIN does not require any parameters to be specified, which can be an advantage over MCMC-based programs, depending on the expertise of the user.

The calculation of the conditional probabilities between the states of the IBD variables of different loci may account for a large portion of the computation time of ALADIN. Because the pedigree structure is obviously the same for different pairs of adjacent markers, the only quantity varying is the recombination fraction. It is likely that many pairs of markers have very similar recombination fractions. One can make an additional approximation by using a discretized set of recombination fractions for which the conditional IBD probabilities will be computed. Then for any pair of markers encountered in the real data set, the IBD probabilities computed for the fraction in the discretized set that is closest to the true fraction can be used as an approximation. The computation of the conditional IBD probabilities can be easily performed in parallel. ALADIN currently has an option for creating and using a collection of recombination fractions so that the computation time can be reduced. As an example, for pedigree I, the computation time can be reduced to 4600 min from 16,282 min with a discretized set of 1000 marker distances. The relative error in the marker recombination frequencies due to discretization is small, at 0.36 % ± 0.26%. We expect that the effect of this approximation on the conclusions of the linkage study will be negligible.

An additional way to reduce computation time is the use of multiple IBD variables with fewer IBD configurations, thus optimizing the trade-off between time spent on the HMM calculations and the time spent on the computation of the prior IBD probabilities. In addition, when more segregation indicators are explicitly modeled and/or multiple IBD variables are used for subsections of the pedigree, more accurate approximations may be obtained. This is a direction for further research.

Although we found that the first-order Markov approximation was accurate for the pedigrees we considered, there are situations where the approximation is known to have problems. In particular for more distantly related individuals, there is an increasing tendency for segments of IBD sharing to cluster. If at three consecutive loci, A, B, and C, two individuals are non-IBD at the middle locus B, they are far more likely to be IBD at C if they are IBD at A than if they are non-IBD at A.29 The reason for this is that a single recombination event between A and B is enough to break down the IBD sharing but that a recombination event in the same meiosis is sufficient to fully restore the sharing at locus C. The first-order Markov approximation underestimates the probability of IBD sharing in this situation.29 Preliminary simulations suggest that the magnitude of this effect was small for the pedigrees studied in this paper. It may be beneficial to include an error model of some kind,21 so that if