

本站小编 Free考研考试/2022-01-01

林欣琪,1,2, 梁融1, 张俊国1, 皮路程1, 陈思东1, 刘丽,1, 郜艳晖11. 广东药科大学公共卫生学院流行病与卫生统计学系,广州 510310
2. 广东省职业病防治院,广州 510300

Comparison of common burden tests for genetic association studies of rare variants

Xinqi Lin,1,2, Rong Liang1, Junguo Zhang1, Lucheng Pi1, Sidong Chen1, Li Liu,1, Yanhui Gao11. Department of Epidemiology and Biostatistics, School of Public Health, Guangdong Pharmaceutical University, Guangzhou 510310, China
2. Guangdong Province Hospital for Occupational Disease Prevention and Treatment, Guangzhou 510300, China

编委: 方向东
基金资助: 广东省自然基金.2016A030313809

Fund supported: the National Natural Science Foundation of Guangdong, China.2016A030313809
Science and Technology Planning Project of Guangdong Province, China.2014A020212307
National Natural Science Foundation of China.2016A030313809

作者简介 About authors
作者简介:林欣琪,硕士研究生,研究方向:分子流行病学E-mail:linxinki@163.com, E-mail:linxinki@163.com

通讯作者:刘丽,博士,副教授,研究方向:流行病与卫生统计学E-mail:pupuliu919@163.com, E-mail:pupuliu919@163.com

为比较稀有变异遗传关联研究中常用负担检验方法(CMC、WST、SUM及其扩展)在不同遗传情境下的统计性能,本文通过计算机模拟产生不同样本量、连锁不平衡(linkage disequilibrium, LD)参数、混杂非关联变异的个数和不同效应的关联变异等条件的稀有变异病例对照数据集,运用各种负担检验方法进行分析,分别计算各方法的一类错误和效能。结果表明,各方法一类错误均在0.05附近;当稀有变异效应方向一致时,除aSUM法外,LD参数越大、混杂非关联变异越少、各法效能越高;当效应方向不一致时,各法效能则显著降低。除强LD外,有方向考虑的方法效能均比无方向考虑的方法高,且样本量越大效能越高。负担检验的统计性能受效应大小和方向、噪音变异和连锁不平衡等多种因素影响。在实际应用中,在各类方法选择、确定集合单位,权重等时最好结合遗传变异的生物信息先验以提高研究效能。
关键词: 稀有变异;遗传关联研究;负担检验

Common burden tests have different statistical performance in genetic association studies of rare variants. Here, we compare the statistical performance of burden tests, such as CMC, WST, SUM and extension methods, using the computer-simulated datasets of rare variants with different parameters of sample sizes, linkage disequilibrium (LD), and different numbers of mixed non-associated variants. The simulation results showed that the type I error for all methods is near 0.05. When the rare variants had the same direction of effect, the higher LD and the less non-associated variants, the higher the power of these method, except the data adaptive SUM test. When the direction was different, the power was significantly reduced for all methods. The methods that consider the direction yielded larger statistical power than those methods without considering the effect direction, except the strong LD condition. And the larger the sample size, the larger the power. The statistical performance of burden tests is affected by a variety of factors, including the sample size, effect direction of variants, non-associated variants, and LD. Therefore, when choosing the method and setting the collection unit and weight, the prior biological information of genetic variation should be integrated to improve study efficiency.
Keywords:rare variant;genetic association studies;burden test

PDF (445KB)元数据多维度评价相关文章导出EndNote|Ris|Bibtex收藏本文
林欣琪, 梁融, 张俊国, 皮路程, 陈思东, 刘丽, 郜艳晖. 稀有变异遗传关联性研究中常用负担检验方法比较. 遗传[J], 2018, 40(2): 162-169 doi:10.16288/j.yczz.17-174
Xinqi Lin, Rong Liang, Junguo Zhang, Lucheng Pi, Sidong Chen, Li Liu, Yanhui Gao. Comparison of common burden tests for genetic association studies of rare variants . Hereditas(Beijing)[J], 2018, 40(2): 162-169 doi:10.16288/j.yczz.17-174

随着二代测序技术的迅猛发展,稀有变异(rare variant)对复杂性状的作用越来越受到关注[1, 2]。由于稀有变异在人群中次等位基因频率(minor allele frequency, MAF)很低(<5%),传统单变量或多变量检验方法效能极低。为克服此问题,研究者常把感兴趣区域(region of interest, ROI)内稀有变异集合成遗传负担得分直接用于关联分析,统称为负担检验(burden test)。如对病例和对照集合后的新变量进行检验的CAST法(cohort allelic sums test)[3],或将稀有变异集合后的遗传得分与常见变异一起进行Hotelling T2检验的CMC法(combined multivariate and collapsing)[4],采用变异频率的方差进行加权的WST法(weighted sum test)[5],或将协变量与集合后的稀有变异同时纳入回归模型的SUM法(sum test),以及在SUM基础上考虑变异效应方向的SSU法(sum of the squares of the marginal score statistics),SSUw法(weighted form of sum of the squares of the marginal score statistics)[6,7]和aSUM法(data adaptive sum test)[8]。但各类方法的统计性能及应用效果仍有待于进一步研究。为此,本文通过计算机模拟病例-对照遗传关联研究数据,比较常用负担检验方法在各种遗传情境时的效能和一类错误,为负担检验在稀有变异遗传关联分析中的有效应用提供参考和依据。

1 原理与方法

目前较常用的负担检验包括CMC、WST、SUM及其扩展(SSU, SSUw和aSUM)。

1.1 CMC法

CMC法根据等位基因频率或不同的ROI把遗传区域分成k个亚区域,并在每个亚区域内集合所有稀有变异位点,再进行 Hotelling’s T2检验,统计量为

$T^2=\frac{n_{1}n_{0}}{ n_{1}+n_{0}}(\bar {X^{A}}-\bar {X^{\bar A}})^{T}S^{-1} (\bar {X^{A}}-\bar {X^{\bar A}})$
式(1)中n1n0分别为病例组和对照组样本量,$\bar {X}^{A}$和$\bar {X^{\bar A}}$分别为两组集合后的稀有变异均数向量,S为方差协方差阵。在稀有变异与疾病无关的假设下,$T^2$近似服从非中心、自由度为k的$\chi ^2$分布。

1.2 WST法

假设频率较低的变异的遗传效应可能更大,Madsen和Browning[5]提出先根据总样本量n和对照组的变异频率qj计算ROI内第j个变异的权重$\hat w_j=\sqrt{n \cdot q_{j}(1-q_j)}$,再计算个体i的遗传得分,

$\gamma_i=\sum^{p}_{j=1}\frac{X_{ij}}{\hat w_j}$
式(2)中Xij表示个体i在变异位点j上的等位基因突变数,如加性遗传模型时$X_{ij} in \{0,1,2\}$。类似于Wilcoxon检验,将所有个体的遗传得分$\gamma_i$排序,得到病例和对照组遗传得分的秩和,并采用置换检验得到p值。

1.3 SUM法及扩展


$Logit Pr(Y_{i}=1)=\beta_{c0}+\sum^{p}_{j=1} X_{ij} \beta_{c}$
并采用得分检验(score test)进行统计推断。


$U=\sum^{n}_{i=1}(Y_{i}-\bar Y) X_{i}$,
$V=\bar Y (1-\bar Y) \sum^{n}_{i=1}(X_{i}-\bar {X})(X_{i}-\bar{X})^{‘}$


$Sum SqU=U’_{M}U_{M}=(Y-\bar Y)’XX’(Y-\bar Y)$

式(6)和(7)检验统计量的无效分布是一个二项式,可由$\alpha \chi^{2}_{d}+b $近似估计得到。可以看到,二次项对遗传变异的效应方向并不敏感。

此外,数据自适应(data adaptive)的方式也自然地被提出[8],在给定检验水准$\alpha_0$下先根据单因素回归结果对保护效应的遗传变异进行反向编码,如$\hat \beta_{M,j} < 0$且$P_{M,j} \le \alpha_{0}$,则$X_{\cdot j}$变成$X^{*}_{\cdot j}=2-X_{\cdot j}$,否则$X^{*}_{\cdot j}=X_{\cdot j}$。对重新编码后的数据拟合模型(3),计算得分统计量U及其方差V,最后通过置换样本构造aSum的检验统计量,

$\alpha Sum=(U-U_{0})’V^{-1}_{0}( U-U_{0})$
式(8)中U0V0分别为B个置换样本得分统计量和方差的均数;小样本时,根据无效分布$\alpha \chi^{2}_{d}+b $计算p值,其中a和b根据Satterthwaite近似估计[10]

2 模拟实验

2.1 模拟数据的参数设置

本文模拟病例对照研究数据,假设病例数n1等于对照数n0,设稀有变异个数为8,指定OR(odds ratio)均为1,即稀有变异与疾病状态均无关联时用于计算一类错误;指定OR均为2时表示关联变异的效应方向相同;4个OR为2、其余4个为0.5时表示关联变异的效应方向不同。设置ORj=expbj,其中$b_{j}=\frac{1}{\sqrt{n \cdot MAF_{j}(1-MAF_{j})}}$时,表示对MAF越小的变异设置越大的效应。其它参数还有样本量(n1= n0=250、500和1000)、变异间连锁不平衡(linkage disequilibrium, LD)参数($\rho$=0、0.5和0.9)、混杂非关联变异(OR固定为1)的个数(0、4、8和16)。疾病的基线患病率定义为p0=0.05。每种模拟条件下共产生1000个数据集。

2.2 模拟数据集的生成

在上述模拟实验条件下产生病例对照模拟数据集,具体步骤[9, 11]为:

(1)产生服从多元正态分布的向量$Z=(Z_{1},\cdots,Z_{p})’$, 定义任两元素间的关联$Corr(Z_{i},Z_{j})=\rho ^{|i-j|}$;(2)将向量Z转化为二分类的单倍体,关联变异和非关联变异单倍体的MAF分别从均匀分布U(0.001,0.01)和U(0.01,0.05)中随机抽样,根据MAF值计算对应正态分布下的分位数,记为界值。如果Z中对应元素小于该界值,定义单倍体X1_1中对应元素为1,否则为0;(3)循环步骤(2)产生另一单倍体X1_2。X1_1与X1_2相加得到每个稀有变异的基因型数据(0,1,2);(4)个体i的疾病状态Yi根据logistic回归模型产生,

式(9)中,$\beta_{j}=ln(OR_{j})$, $\beta_{0}=ln(\frac{p_{0}}{1-p_{0}})$,ORjp0如前述定义。

2.3 统计分析

对每种条件下的1000个模拟数据集,均采用CMC、WST、SUM及其扩展方法(SSU,SSUw和aSUM)进行分析,除WST外,对ROI内稀有变异集合的方式采用计数赋值法(ROI内稀有变异突变的总数作为新变量)。分别计算各种模拟条件下各法的一类错误或效能。数据模拟及统计分析均采用R V3.0.2软件,调用Assoteste R包进行SUM及扩展方法,CMC和WST应用Basu和Pan[7]的R程序。

3 结果与分析

3.1 各类方法的一类错误


Table 1
表1 各模拟条件下各类方法的一类错误
Table 1 Type I error in various methods under the different simulations
$n_{1}=n_{0}$ 方法 非关联变异数=0 非关联变异数=4 非关联变异数=8 非关联变异数=16
$\rho=0$ $\rho=0.5$ $\rho$=0.9 $\rho=0$ $\rho=0.5$ $\rho=0.9$ $\rho=0$ $\rho=0.5$ $\rho=0.9$ $\rho=0$ $\rho=0.5$ $\rho=0.9$
250 CMC 0.055 0.046 0.055 0.030 0.063 0.036 0.045 0.051 0.042 0.046 0.036 0.047
WST 0.048 0.057 0.054 0.032 0.046 0.033 0.058 0.047 0.045 0.039 0.048 0.044
SUM 0.049 0.040 0.037 0.031 0.056 0.037 0.055 0.050 0.041 0.038 0.050 0.051
SSU 0.054 0.045 0.050 0.033 0.053 0.044 0.052 0.044 0.048 0.053 0.054 0.051
SSUw 0.057 0.045 0.060 0.033 0.051 0.032 0.055 0.042 0.053 0.057 0.047 0.048
aSUM 0.052 0.058 0.046 0.040 0.058 0.058 0.063 0.057 0.047 0.055 0.053 0.067
500 CMC 0.048 0.048 0.053 0.035 0.054 0.034 0.053 0.045 0.042 0.043 0.051 0.052
WST 0.046 0.048 0.055 0.039 0.056 0.039 0.047 0.047 0.048 0.033 0.051 0.058
SUM 0.040 0.047 0.054 0.039 0.051 0.047 0.050 0.046 0.048 0.036 0.041 0.050
SSU 0.058 0.047 0.054 0.047 0.075 0.046 0.058 0.052 0.047 0.059 0.049 0.050
SSUw 0.047 0.051 0.049 0.039 0.073 0.052 0.052 0.053 0.051 0.056 0.042 0.042
aSUM 0.061 0.055 0.063 0.052 0.053 0.045 0.060 0.048 0.051 0.048 0.046 0.071
1000 CMC 0.043 0.048 0.059 0.052 0.048 0.040 0.044 0.041 0.047 0.036 0.055 0.048
WST 0.045 0.052 0.052 0.052 0.050 0.041 0.038 0.050 0.051 0.043 0.053 0.053
SUM 0.044 0.052 0.060 0.046 0.054 0.050 0.047 0.039 0.049 0.041 0.057 0.055
SSU 0.056 0.048 0.061 0.049 0.054 0.047 0.045 0.053 0.045 0.040 0.048 0.056
SSUw 0.044 0.045 0.060 0.052 0.053 0.049 0.041 0.056 0.047 0.051 0.051 0.054
aSUM 0.047 0.048 0.064 0.054 0.051 0.045 0.050 0.051 0.056 0.047 0.058 0.070


3.2 各类方法的效能



Table 2
表2 关联稀有变异效应一致时各类方法的效能
Table 2 Results of power in various methods under the same effects with associated rare variants
$n_{1}=n_{0}$ 方法 非关联变异数=0 非关联变异数=4 非关联变异数=8 非关联变异数=16
$\rho=0$ $\rho=0.5$ $\rho=0.9$ $\rho=0$ $\rho=0.5$ $\rho=0.9$ $\rho=0$ $\rho=0.5$ $\rho=0.9$ $\rho=0$ $\rho=0.5$ $\rho=0.9$
250 CMC 0.734 0.813 0.999 0.564 0.747 0.996 0.349 0.687 0.970 0.316 0.618 0.906
WST 0.725 0.900 1.000 0.577 0.856 0.999 0.446 0.771 0.998 0.362 0.732 0.997
SUM 0.756 0.962 1.000 0.607 0.943 1.000 0.511 0.914 1.000 0.398 0.889 1.000
SSU 0.457 0.914 1.000 0.386 0.917 1.000 0.345 0.906 1.000 0.335 0.906 1.000
SSUw 0.421 0.918 1.000 0.339 0.902 1.000 0.286 0.895 1.000 0.271 0.884 1.000
aSUM 0.592 0.903 1.000 0.427 0.875 0.998 0.340 0.829 1.000 0.242 0.814 0.999
500 CMC 0.946 0.991 1.000 0.858 0.974 1.000 0.660 0.960 1.000 0.573 0.940 1.000
WST 0.947 0.992 1.000 0.859 0.977 1.000 0.739 0.965 1.000 0.621 0.933 1.000
SUM 0.950 0.998 1.000 0.883 0.998 1.000 0.764 0.993 1.000 0.667 0.992 1.000
SSU 0.764 0.998 1.000 0.734 0.995 1.000 0.679 0.991 1.000 0.645 0.993 1.000
SSUw 0.733 0.998 1.000 0.678 0.994 1.000 0.603 0.985 1.000 0.511 0.991 1.000
aSUM 0.899 0.997 0.725 0.787 0.991 0.748 0.655 0.982 0.755 0.542 0.983 0.728
1000 CMC 1.000 0.999 1.000 0.984 1.000 1.000 0.942 1.000 1.000 0.892 0.995 1.000
WST 1.000 1.000 1.000 0.987 0.999 1.000 0.961 0.998 1.000 0.881 0.995 1.000
SUM 1.000 1.000 1.000 0.991 1.000 1.000 0.963 1.000 1.000 0.908 1.000 1.000
SSU 0.973 1.000 1.000 0.968 1.000 1.000 0.941 1.000 1.000 0.921 1.000 1.000
SSUw 0.969 1.000 1.000 0.963 1.000 1.000 0.925 1.000 1.000 0.888 1.000 1.000
aSUM 0.995 0.921 0.011 0.979 0.951 0.009 0.945 0.967 0.022 0.856 0.985 0.011



Table 3
表3 关联稀有变异效应方向不同时各类方法的效能
Table 3 Results of power in various methods under the different effects with associated rare variants
$n_{1}=n_{0}$ 方法 非关联变异数=0 非关联变异数=4 非关联变异数=8 非关联变异数=16
$\rho=0$ $\rho=0.5$ $\rho=0.9$ $\rho=0$ $\rho=0.5$ $\rho=0.9$ $\rho=0$ $\rho=0.5$ $\rho=0.9$ $\rho=0$ $\rho=0.5$ $\rho=0.9$
250 CMC 0.118 0.199 0.139 0.090 0.184 0.103 0.171 0.133 0.082 0.162 0.151 0.075
WST 0.120 0.124 0.145 0.092 0.131 0.135 0.074 0.091 0.111 0.071 0.089 0.099
SUM 0.123 0.129 0.123 0.099 0.142 0.112 0.076 0.098 0.088 0.072 0.110 0.117
SSU 0.367 0.331 0.167 0.298 0.306 0.150 0.268 0.227 0.113 0.230 0.244 0.134
SSUw 0.329 0.285 0.134 0.255 0.255 0.091 0.240 0.190 0.095 0.187 0.156 0.118
aSUM 0.276 0.261 0.145 0.225 0.222 0.144 0.163 0.164 0.111 0.136 0.159 0.126
500 CMC 0.170 0.340 0.248 0.144 0.323 0.214 0.323 0.279 0.165 0.291 0.282 0.183
WST 0.169 0.210 0.233 0.137 0.194 0.199 0.081 0.141 0.176 0.088 0.125 0.196
SUM 0.172 0.218 0.170 0.156 0.189 0.181 0.095 0.142 0.161 0.095 0.168 0.193
SSU 0.669 0.608 0.247 0.595 0.539 0.241 0.518 0.475 0.190 0.480 0.423 0.228
SSUw 0.638 0.565 0.207 0.562 0.487 0.222 0.482 0.427 0.163 0.414 0.337 0.159
aSUM 0.610 0.529 0.258 0.485 0.444 0.260 0.396 0.351 0.200 0.318 0.294 0.230
1000 CMC 0.283 0.595 0.403 0.249 0.553 0.390 0.524 0.525 0.335 0.525 0.517 0.341
WST 0.271 0.338 0.331 0.251 0.288 0.309 0.167 0.228 0.278 0.139 0.211 0.280
SUM 0.293 0.342 0.266 0.253 0.293 0.267 0.172 0.243 0.243 0.143 0.258 0.270
SSU 0.941 0.909 0.410 0.894 0.845 0.373 0.856 0.785 0.323 0.828 0.736 0.326
SSUw 0.921 0.893 0.348 0.884 0.829 0.315 0.828 0.750 0.286 0.776 0.658 0.282
aSUM 0.891 0.802 0.436 0.775 0.696 0.407 0.682 0.618 0.342 0.575 0.504 0.346



Table 4
Table 4 Results of power in various methods under the less MAF while more effects with associated rare variants
$n_{1}=n_{0}$ 方法 非关联变异数=0 非关联变异数=4 非关联变异数=8 非关联变异数=16
$\rho=0$ $\rho=0.5$ $\rho=0.9$ $\rho=0$ $\rho=0.5$ $\rho=0.9$ $\rho=0$ $\rho=0.5$ $\rho=0.9$ $\rho=0$ $\rho=0.5$ $\rho=0.9$
250 CMC 0.596 0.695 0.998 0.215 0.548 0.979 0.217 0.415 0.949 0.189 0.314 0.905
WST 0.605 0.803 0.999 0.276 0.508 0.942 0.181 0.367 0.898 0.156 0.296 0.837
SUM 0.620 0.914 1.000 0.239 0.817 1.000 0.154 0.729 0.999 0.112 0.714 0.999
SSU 0.289 0.828 1.000 0.085 0.658 1.000 0.075 0.591 0.996 0.066 0.610 0.999
SSUw 0.297 0.850 1.000 0.225 0.841 1.000 0.185 0.778 1.000 0.165 0.761 1.000
aSUM 0.487 0.858 1.000 0.232 0.769 0.999 0.151 0.667 0.999 0.105 0.639 0.999
500 CMC 0.540 0.683 1.000 0.190 0.553 0.995 0.230 0.414 0.988 0.169 0.343 0.964
WST 0.545 0.766 0.999 0.239 0.427 0.916 0.199 0.310 0.874 0.157 0.250 0.799
SUM 0.560 0.890 1.000 0.211 0.764 1.000 0.158 0.675 1.000 0.107 0.637 0.998
SSU 0.272 0.783 1.000 0.056 0.550 1.000 0.072 0.519 1.000 0.054 0.539 0.997
SSUw 0.272 0.813 1.000 0.212 0.803 1.000 0.195 0.738 1.000 0.150 0.707 1.000
aSUM 0.455 0.831 1.000 0.211 0.692 1.000 0.146 0.596 1.000 0.109 0.581 0.999
1000 CMC 0.506 0.619 0.997 0.181 0.482 0.991 0.212 0.303 0.980 0.183 0.374 0.946
WST 0.511 0.700 0.998 0.232 0.259 0.841 0.171 0.287 0.767 0.146 0.227 0.699
SUM 0.542 0.837 1.000 0.200 0.675 0.999 0.144 0.645 0.996 0.096 0.585 0.998
SSU 0.230 0.724 1.000 0.065 0.470 0.999 0.057 0.430 0.995 0.069 0.467 0.996
SSUw 0.259 0.767 1.000 0.201 0.727 1.000 0.182 0.719 1.000 0.161 0.654 0.999
aSUM 0.428 0.778 1.000 0.190 0.611 1.000 0.127 0.563 0.997 0.109 0.508 0.997
$\rho$表示连锁不平衡参数;关联稀有变异效应值$OR_{j}=exp(b_{j})=exp(\frac{1}{n\cdot MAF_{j}(1-MAF_{j})})$。


4 讨论



此外,变异间的LD对负担检验的效能也有很大影响。和弱LD相比,强LD的变异效应方向一致时负担检验的集合策略增强了遗传效应;但效应方向相反时,各法效能显著降低。LD导致病例间有更高的DNA序列相似性,因此,除负担检验外,另一类基于个体间遗传相似性的方差分量检验方法如C-α[13],SKAT(sequence kernel association test)[14, 15]等,将集合内稀有变异的作用看作随机效应,将检验病例和对照组间变异频率的差别转化为检验随机效应的方差。此类方法理论上不受集合中稀有变异方向不同的影响,对混杂较多非关联变异也较为稳健。实际应用时,可利用文献、生物信息数据库等来源的先验信息确定稀有遗传变异间的LD、效应方向(保护效应或危险效应)和MAF值等条件再加以选择统计方法。如有证据表明研究的变异中同时有保护效应和危险效应时,需要考虑不依赖效应方向的SSU和SSUw/aSUM等方法;如同为保护性效应或危险效应但存在强连锁不平衡时,可选择除aSUM外的其他负担检验;如变异功能注释未发现有力的功能学证据时,不应将这些变异再纳入关联研究或统计分析,以保证方法的效能。此外,本研究报告的各种遗传情景下不同样本量时的效能表也可为实际工作中稀有变异关联研究的样本量和效能估计提供参考。

The authors have declared that no competing interests exist.


参考文献 原文顺序

Lettre G . Rare and low-frequency variants in human common diseases and other complex traits
J Med Genet, 2014, 51( 11): 705- 714.

URLPMID:25185437 [本文引用: 1]
In humans, most of the genetic variation is rare and often population-specific. Whereas the role of rare genetic variants in familial monogenic diseases is firmly established, we are only now starting to explore the contribution of this class of genetic variation to human common diseases and other complex traits. Such large-scale experiments are possible due to the development of next-generation DNA sequencing. Early findings suggested that rare and low-frequency coding variation might have a large effect on human phenotypes (eg, PCSK9 missense variants on low-density lipoprotein-cholesterol and coronary heart diseases). This observation sparked excitement in prognostic and diagnostic medicine, as well as in genetics-driven strategies to develop new drugs. In this review, I describe results and present initial conclusions regarding some of the recent rare and low-frequency variant discoveries. We can already assume that most phenotype-associated rare and low-frequency variants have modest-to-weak phenotypical effect. Thus, we will need large cohorts to identify them, as for common variants in genome-wide association studies. As we expand the list of associated rare and low-frequency variants, we can also better recognise the current limitations: we need to develop better statistical methods to optimally test association with rare variants, including non-coding variation, and to account for potential confounders such as population stratification.

Wu L , Schaid DJ , Sicotte H , Wieben ED , Li H , Petersen GM . Case-only exome sequencing and complex disease susceptibility gene discovery: study design considerations
J Med Genet, 2014, 52( 1): 10- 16.

URLPMID:25371537 [本文引用: 1]
Whole exome sequencing (WES) provides an unprecedented opportunity to identify the potential aetiological role of rare functional variants in human complex diseases. Large-scale collaborations have generated germline WES data on patients with a number of diseases, especially cancer, but less often on healthy controls under the same sequencing procedures. These data can be a valuable resource for identifying new disease susceptibility loci if study designs are appropriately applied. This review describes suggested strategies and technical considerations when focusing on case-only study designs that use WES data in complex disease scenarios. These include variant filtering based on frequency and functionality, gene prioritisation, interrogation of different data types and targeted sequencing validation. We propose that if case-only WES designs were applied in an appropriate manner, new susceptibility genes containing rare variants for human complex diseases can be detected.

Morgenthaler S , Thilly WG . A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: A cohort allelic sums test (CAST)
Muta Res/ Fundam Mol Mechan Mutag, 2007, 615( 1-2): 28- 56.

URLPMID:17101154 [本文引用: 1]
A method is described to discover if a gene carries one or more allelic mutations that confer risk for any specified common disease. The method does not depend upon genetic linkage of risk-conferring mutations to high frequency genetic markers such as single nucleotide polymorphisms. Instead, the sums of allelic mutation frequencies in case and control cohorts are determined and a statistical test is applied to discover if the difference in these sums is greater than would be expected by chance. A statistical model is presented that defines the ability of such tests to detect significant gene–disease relationships as a function of case and control cohort sizes and key confounding variables: zygosity and genicity, environmental risk factors, errors in diagnosis, limits to mutant detection, linkage of neutral and risk-conferring mutations, ethnic diversity in the general population and the expectation that among all exonic mutants in the human genome greater than 90% will be neutral with regard to any effect on disease risk. Means to test the null hypothesis for, and determine the statistical power of, each test are provided. For this “cohort allelic sums test” or “CAST”, the statistical model and test are provided as an Excel64 program, CASTAT08 at http://epidemiology.mit.edu . Based on genetics, technology and statistics, a strategy of enumerating the mutant alleles carried in the exons and splice sites of the estimated 6525,000 human genes in case cohort samples of 10,000 persons for each of 100 common diseases is proposed and evaluated: A wide range of possible conditions of multi-allelic or mono-allelic and monogenic, multigenic or polygenic (including epistatic) risk are found to be detectable using the statistical criteria of 1 or 10 “false positive” gene associations per 25,000 gene–disease pair-wise trials and a statistical power of >0.8. Using estimates of the distribution of both neutral and gene-inactivating nondeleterious mutations in humans and the sensitivity of the test to multigenic or multicausal risk, it is estimated that about 80% of nullizygous, heterozygous and functionally dominant gene–common disease associations may be discovered. Limitations include relative insensitivity of CAST to about 60% of possible associations given homozygous (wild type) risk and, more rarely, other stochastic limits when the frequency of mutations in the case cohort approaches that of the control cohort and biases such as absence of genetic risk masked by risk derived from a shared cultural environment.

Li BS , Leal SM . Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data
Am J Hum Genet, 2008, 83( 3): 311- 321.

URLPMID:18691683 [本文引用: 1]
Although whole-genome association studies using tagSNPs are a powerful approach for detecting common variants, they are underpowered for detecting associations with rare variants. Recent studies have demonstrated that common diseases can be due to functional variants with a wide spectrum of allele frequencies, ranging from rare to common. An effective way to identify rare variants is through direct sequencing. The development of cost-effective sequencing technologies enables association studies to use sequence data from candidate genes and, in the future, from the entire genome. Although methods used for analysis of common variants are applicable to sequence data, their performance might not be optimal. In this study, it is shown that the collapsing method, which involves collapsing genotypes across variants and applying a univariate test, is powerful for analyzing rare variants, whereas multivariate analysis is robust against inclusion of noncausal variants. Both methods are superior to analyzing each variant individually with univariate tests. In order to unify the advantages of both collapsing and multiple-marker tests, we developed the Combined Multivariate and Collapsing (CMC) method and demonstrated that the CMC method is both powerful and robust. The CMC method can be applied to either candidate-gene or whole-genome sequence data.

Madsen BE , Browning SR . A groupwise association test for rare mutations using a weighted sum statistic
PLoS Genet, 2009, 5( 2): e1000384.

URLPMID:19214210 [本文引用: 2]
Resequencing is an emerging tool for identification of rare disease-associated mutations. Rare mutations are difficult to tag with SNP genotyping, as genotyping studies are designed to detect common variants. However, studies have shown that genetic heterogeneity is a probable scenario for common diseases, in which multiple rare mutations together explain a large proportion of the genetic basis for the disease. Thus, we propose a weighted-sum method to jointly analyse a group of mutations in order to test for groupwise association with disease status. For example, such a group of mutations may result from resequencing a gene. We compare the proposed weighted-sum method to alternative methods and show that it is powerful for identifying disease-associated genes, both on simulated and Encode data. Using the weighted-sum method, a resequencing study can identify a disease-associated gene with an overall population attributable risk (PAR) of 2%, even when each individual mutation has much lower PAR, using 1,000 to 7,000 affected and unaffected individuals, depending on the underlying genetic model. This study thus demonstrates that resequencing studies can identify important genetic associations, provided that specialised analysis methods, such as the weighted-sum method, are used.

Pan W , Shen XT . Adaptive tests for association analysis of rare variants
Genet Epidemiol, 2011, 35( 5): 381- 388.

URLPMID:21520272 [本文引用: 1]
In anticipation of the availability of next-generation sequencing data, there has been increasing interest in association analysis of rare variants (RVs). Owing to the extremely low frequency of a RV, single variant-based analysis and many existing tests developed for common variants may not be suitable. Hence, it is of interest to develop powerful statistical tests to assess association between complex traits and RVs with sequence data. Recently, a pooled association test based on variable thresholds (VT) was proposed and shown to be more powerful than some existing tests (Price et al. [2010] Am J Hum Genet 86:832-838). In this study, we generalize the VT test of Price et al. in several aspects. We propose a general class of adaptive tests that covers the VT test of Price et al. as a special case. In particular, we show that some of our proposed adaptive tests may substantially improve the power over the pooled association tests, including the VT test of Price et al., especially so in the presence of many neutral RVs and/or of causal RVs with opposite association directions, in which cases most of the existing pooled association tests suffer from significant loss of power. Our proposed tests are also general and flexible with the ability to incorporate weights on RVs and to adjust for covariates.

Basu S , Pan W . Comparison of statistical tests for disease association with rare variants
Genet Epidemiol, 2011, 35( 7): 606- 619.

[本文引用: 4]

Han F , Pan W . A data-adaptive sum test for disease association with multiple common or rare variants
Hum Hered, 2010, 70( 1): 42- 54.

[本文引用: 3]

Pan W . Asymptotic tests of association with multiple SNPs in linkage disequilibrium
Genet Epidemiol, 2009, 33( 6): 497- 507.

[本文引用: 2]

Satterthwaite F . An approximate distribution of estimates of variance components
Biomet Bull, 1946, 2( 6): 110- 114.

[本文引用: 1]

Wang T , Elston RC . Improved power by use of a weighted score test for linkage disequilibrium mapping
Am J Hum Genet, 2007, 80( 2): 353- 360.

URL [本文引用: 1]
Association studies offer an exciting approach to finding underlying genetic variants of complex human diseases. However, identification of genetic variants still includes difficult challenges, and it is important to develop powerful new statistical methods. Currently, association methods may depend on single-locus analysis鈥攖hat is, analysis of the association of one locus, which is typically a single-nucleotide polymorphism (SNP), at a time鈥攐r on multilocus analysis, in which multiple SNPs are used to allow extraction of maximum information about linkage disequilibrium (LD). It has been shown that single-locus analysis may have low power because a single SNP often has limited LD information. Multilocus analysis, which is more informative, can be performed on the basis of either haplotypes or genotypes. It may lose power because of the often large number of degrees of freedom involved. The ideal method must make full use of important information from multiple loci but avoid increasing the degrees of freedom. Therefore, we propose a method to capture information from multiple SNPs but with the use of fewer degrees of freedom. When a set of SNPs in a block are correlated because of LD, we might expect that the genotype variation among the different phenotypic groups would extend across all the SNPs, and this information could be compressed into the low-frequency components of a Fourier transform. Therefore, we develop a test based on weighted Fourier transformation coefficients, with more weight given to the low-frequency components. Our simulation results demonstrate the validity and substantially higher power of the proposed method compared with other common methods. This method provides an additional tool to existing methods for identification of causative genetic variants underlying complex diseases.

Nicolae DL . Association tests for rare variants
Annu Rev Genomics Hum Genet, 2016, 17( 7): 117- 130.

URLPMID:27147090 [本文引用: 1]
Over the past few years, interest in the identification of rare variants that influence human phenotype has led to the development of many statistical methods for testing for association between sets of rare variants and binary or quantitative traits. Here, I review some of the most important ideas that underlie these methods and the most relevant issues when choosing a method for analysis. In addition to the tests for association, I review crucial issues in performing a rare variant study, from experimental design to interpretation and validation. I also discuss the many challenges of these studies, some of their limitations, and future research directions. Expected final online publication date for the Annual Review of Genomics and Human Genetics Volume 17 is August 31, 2016. Please see http://www.annualreviews.org/catalog/pubdates.aspx for revised estimates.

Neale B , Rivas MA , Voight BF , Altshuler D , Devlin B , Orho-Melander M , Kathiresan S , Purcell SM , Roeder K , Daly MJ . Testing for an unusual distribution of rare variants
PLoS Genet, 2011, 7( 3): e1001322.

[本文引用: 1]

Zhang T . An introduction to support vector machines: and other kernel-based learning methods
AI Magazine, 2001, 22( 2): 103- 104.

URL [本文引用: 1]
This is the first comprehensive introduction to Support Vector Machines (SVMs), a new generation learning system based on recent advances in statistical learning theory. SVMs deliver state-of-the-art performance in real-world applications such as text categorisation, hand-written character recognition, image classification, biosequences analysis, etc., and are now established as one of the standard tools for machine learning and data mining. Students will find the book both stimulating and accessible, while practitioners will be guided smoothly through the material required for a good grasp of the theory and its applications. The concepts are introduced gradually in accessible and self-contained stages, while the presentation is rigorous and thorough. Pointers to relevant literature and web sites containing software ensure that it forms an ideal starting point for further study. Equally, the book and its associated web site will guide practitioners to updated literature, new applications, and on-line software.Support Vector Machines are now established as one of the standard tools for machine learning and data mining. Students will find this introduction both stimulating and accessible, while practitioners will be guided smoothly through the material required for a good grasp of the theory and its applications.

Wu MC , Lee S , Cai TX , Li Y , Boehnke M , Lin XH . Rare-variant association testing for sequencing data with the sequence kernel association test
Am J Hum Genet, 2011, 89( 1): 82- 93.

URLPMID:21737059 [本文引用: 1]
Sequencing studies are increasingly being conducted to identify rare variants associated with complex traits. The limited power of classical single-marker association analysis for rare variants poses a central challenge in such studies. We propose the sequence kernel association test (SKAT), a supervised, flexible, computationally efficient regression method to test for association between genetic variants (common and rare) in a region and a continuous or dichotomous trait while easily adjusting for covariates. As a score-based variance-component test, SKAT can quickly calculate p values analytically by fitting the null model containing only the covariates, and so can easily be applied to genome-wide data. Using SKAT to analyze a genome-wide sequencing study of 1000 individuals, by segmenting the whole genome into 30 kb regions, requires only 7hr on a laptop. Through analysis of simulated data across a wide range of practical scenarios and triglyceride data from the Dallas Heart Study, we show that SKAT can substantially outperform several alternative rare-variant association tests. We also provide analytic power and sample-size calculations to help design candidate-gene, whole-exome, and whole-genome sequence association studies.
相关话题/检验 遗传 统计 数据 计算