李智1,2,3, 何俊1,3, 蒋隽1,4, Richard G. Tait Jr.3, Stewart Bauck3, 过伟,2, 吴晓林,1,3,41. 湖南农业大学动物科技学院,长沙410128 2. 美国怀俄明大学动物科学系,怀俄明州拉勒米市82071 3. 美国纽勤公司生物信息与生物统计部,内布拉斯加州林肯市68504 4. 美国威斯康星大学动物科学系,美国威斯康星州麦迪逊市53706
Zhi Li1,2,3, Jun He1,3, Jun Jiang1,4, Richard G. Tait Jr.3, Stewart Bauck3, Wei Guo,2, Xiao-Lin Wu,1,3,41. CollegeofAnimalScienceand Technology, HunanAgricultural University, Changsha 410128, China 2. Department of Animal Science, University of Wyoming, Laramie WY 82071, USA 3. Biostatisticsand Bioinformatics, NeogenGeneSeek, LincolnNE68504, USA 4. Department of Animal Sciences, University of Wisconsin, Madison WI 53706, USA
Abstract Single nucleotide polymorphism (SNP) chips have been widely used in genetic studies and breeding applications in animal and plant species. The quality of SNP genotypes is of paramount importance. More often than not, there are situations in which a number of genotypes may fail, requiring them to be imputed. There are also situations in which ungenotyped loci need to be imputed between different chips, or high-density genotypes need to be imputed based on low-density genotypes. Under these circumstances, the validity and reliability of subsequent data analyses is subject to the accuracy of these imputed genotypes. For justifying a better understanding of factors affecting imputation accuracy, in the present study, the impacts of SNP genotyping call rate and SNP genotyping error rate on the accuracy of genotype imputation were investigated under two scenarios in 20 116 U.S. Holstein cattle, each genotyped with a GGP 50K SNP chip. When the two factors were not correlated in scenario 1, simulated genotyping call rate varied from 50% to 100% and simulated genotyping error rate changed from 0% to 50%, with both factors being independent of each other. In scenario 2, genotyping error rates were correlated with genotyping call rate, and the relationship was set up by fitting a linear regression model between the two variables on a real dataset. That is, the simulated SNP call rate varied from 100% to 50% whereas the SNP genotyping rate changed from 0% to 13.55%. Finally, a 5-fold cross-validation was used to assess the subsequent imputation accuracy. The results showed that when original SNP genotyping call rate were independent of SNP genotyping error rate, the imputation accuracy did not change significantly with the original genotyping call rate (P>0.05), but it decreased significantly as the genotyping error rate increased (P<0.01). However, when original genotyping call rate was negatively correlated with genotyping error rate, the imputation error increased with elevated original genotyping error rate. In both scenarios, genotyping call rate needs to be no less than 0.90 in order to obtain 98% or higher genotype imputation accuracy. The present results can provide guidance for establishing quality assurance criteria for SNP genotyping in practice. Keywords:SNP chip;genotyping;imputation accuracy;call rate;error rate
随着高通量DNA测序和基因分型技术水平的不断提高,SNP芯片在各类遗传学研究和动植物育种中均得到了广泛应用[1,2],如全基因组关联分析(genome-wide association study, GWAS)[3,4]、基因组选择(genomic selection)[5,6]、基因组品种鉴定(genomic breed composition)[7]以及基因组选配(genomic mating)[8,9,10]等。SNP芯片在使用过程中,一个重要的数据处理环节是基因型填充(genotype imputation),即利用参考群体提供的各基因座位之间的连锁不平衡和重组率信息,构建彼此连锁的单倍型,然后依据所构建的单倍型信息,对目标个体(测试群体或有缺失基因型的个体)缺失位点上的基因型进行填充(预测)[11,12]。
不同颜色线条代表不同SNP分型错误率水平。 Fig. 2Impact of SNP genotyping call rate on imputation error rate in a Holstein dairy population, genotyped by GGP bovine 50K SNP chips
不同颜色线条代表不同SNP分型检出率水平。 Fig. 3Impact of SNP genotyping error rate on imputation error rate in a Holstein dairy population, genotyped by GGP bovine 50K SNP chips
