Regression 2
Presenter: Sahir Bhatnagar
When: Friday, July 15, 2016 Time: 11:00 AM - 12:30 PM
Room: Salon A Carson Hall (Level 2)
Session Synopsis:
A model for interpretable high dimensional interactions
Computational approaches to variable selection have been rigorously developed in the statistical literature. The need for such methods has become increasingly important with the advent of high-throughput technologies in genomics and brain imaging studies where it is believed that the number of truly important variables is small relative to the total number of variables. While the focus of these methods has been on additive models, there are several applications where interaction models can reflect biological phenomena and improve statistical power. For example, genome wide association studies have been unable to explain a large proportion of heritability (the variance in phenotype attributable to genetic variants) and it has been suggested that this missing heritability may in part be due to gene-environment interactions. Furthermore, diseases are now thought to be the result of entire biological networks whose states are affected by environmental factors. These systemic changes can induce or eliminate strong correlations between elements in a network without necessarily affecting their mean levels. Therefore, we propose a multivariate penalization procedure for detecting interactions between high dimensional data (p >> n) and an environmental factor, where the effect of this environmental factor on the high dimensional data is widespread and plays a role in predicting the response. Our approach improves on existing procedures for detecting such interactions in three ways; 1) it automatically enforces the strong heredity property, i.e., an interaction term can only be included in the model if the corresponding main effects are in the model 2) it reduces the dimensionality of the problem and leverages the high correlations by transforming the input feature space using network connectivity measures and 3) it leads to interpretable models which are biologically meaningful. An extensive simulation study shows that our method outperforms LASSO, Elastic Net and Group LASSO in terms of prediction accuracy and feature selection. We apply our methods to the NIH pediatric brain development study to refine estimates of which regions of the frontal cortex are associated with intelligence scores, and an intergenerational mice study that aims to identify patterns of DNA methylation that interact with folate supplementation and subsequently play a role in postnatal outcomes. Our method is implemented in an R package.
Regression 2
Presenter: Shelley Bull
When: Friday, July 15, 2016 Time: 11:00 AM - 12:30 PM
Room: Salon A Carson Hall (Level 2)
Session Synopsis:
Multiple-linear-combination (MLC) regression tests for gene-based association analysis of common variants adapted to linkage disequilibrium structure
By jointly analyzing multiple variants within a gene, instead of one at a time, gene-based multiple regression can improve power and robustness of genetic association analysis. Extending prior work that examined multiple linear combination (MLC) statistics for combined analysis of rare and common variants, here we investigate analysis of common variants more extensively under realistic quantitative trait models and conditions. This method exploits the linkage disequilibrium structure in a gene to construct bins of closely correlated variants, and for variants within each bin, corrects the coding scheme to make the majority of pairwise correlations positive. After bin construction and recoding, variant effects within the same bin are combined linearly, and multiple bin-specific effects are aggregated in a quadratic sum. This produces a test statistic with reduced degrees of freedom (df) equal to the number of bins. By analytic power investigation and simulation studies based on HapMap Asian haplotypes, we compare type I error and power of the MLC test with the multi-df generalized Wald test as well as with 1 df linear combination, minimum p-value, principal-component-based, and variance-component tests assuming various trait models with multiple causal variants and varying linkage disequilibrium structure. We find that MLC tests have competitive power and robustness whether causal SNPs are excluded or included in the analysis. In particular, we report improved relative performance of MLC tests as the number of causal variants increases.
Regression 2
Presenter: Angelika Geroldinger
When: Friday, July 15, 2016 Time: 11:00 AM - 12:30 PM
Room: Salon A Carson Hall (Level 2)
Session Synopsis:
Firth Logistic Regression for Rare Events and Extensions
We consider logistic regression modeling tasks for estimating the effect of a binary exposure on a binary outcome, focusing on situations of rare events (SRE). The Firth correction, which is equivalent to Bayesian analysis with the Jeffreys prior and to maximum likelihood estimation with additional, iteratively reweighted pseudo-observations, has been proven to offer almost unbiased and very accurate estimates of regression coefficients. This method has become popular for its resistance to the problem of separation, where ordinary maximum likelihood estimates diverge to infinity. The drawback of Firth correction is that predictions are biased towards 0.5, a consequence of the implicit penalization of likelihood. We consider two extensions of Firth logistic regression (FL), named FL with Intercept Correction (FLIC) and FL with Additional Covariate (FLAC) to correct this behaviour. These two methods are defined by correcting the intercept after FL estimation (FLIC), or by including an additional binary covariate which disentangles the contributions of the original and the pseudo data in estimating the intercept (FLAC). While FLIC leaves regression coefficients of exposure and covariates untouched, FLAC may modify them, leading to estimates that are no longer unbiased. This effect is particularly strong if either the exposure is very unbalanced or events occur rarely. We compare FL with FLIC and FLAC in various simulated and real data sets and discuss their advantages and disadvantages. Funded by the Austrian Science Fund (FWF)
Regression 2
Presenter: Laura MARTINO
When: Friday, July 15, 2016 Time: 11:00 AM - 12:30 PM
Room: Salon A Carson Hall (Level 2)
Session Synopsis:
Statistical challenges in the use of meta-regression models for deriving Dietary Reference Values: how to move from population of studies to population of individuals
The set up of Dietary Reference Values (DRVs) for nutrients poses several statistical challenges including the availability of reliable data at individual level on which to base the estimation of the dose-response relationship between the nutrient dietary intake and the associated indicator of a healthy status. Meta-regression of aggregated data from studies, identified via a systematic review, represents a valuable alternative in cases where individual data are not available/suitable. Nonetheless controversies exist on how to derive appropriate DRVs for the population of individuals from these dose-response meta-regression models. A sensible approach consists in the estimation of the model-based prediction interval of the dietary intake of the nutrient corresponding to the healthy level of the response variable. The lower bound of the prediction interval is considered as the level meeting the requirement of almost all of the healthy people. The main issue in the use of this approach lies on the need to translate the centiles (lower and upper bounds of the prediction interval) estimated via the meta-regression on the means of a theoretical sample extracted from a population of studies into the corresponding centiles of the distribution of the underling population of individuals. In this paper we propose a methodology to address this issue. A real data example is provided based on a systematic review recently performed by EFSA on the relationship between Vitamin D intake and plasma/serum 25- hydroxy-vitamin D [25(OH)D] concentration.
Regression 2
Presenter: Ellis Patrick
When: Friday, July 15, 2016 Time: 11:00 AM - 12:30 PM
Room: Salon A Carson Hall (Level 2)
Session Synopsis:
The weighted bootstrap for penalty parameter selection in sparse regression: Modeling Alzheimers disease with clinical and microRNA data
In this talk we introduce a new variant of the weighted bootstrap as a resampling strategy for consistent model selection in the Lasso. We will show that this is an important alternative to k-fold cross-validation and the m-out-of-n paired bootstrap when complete separation in some of the folds or resamples are experienced respectively. This is particularly a problem in two common scenarios, clinical studies where sample sizes can be small but many features are observed and in genotyping data with severe class imbalances. Through simulation studies we demonstrate that when selecting a penalty parameter, the percentage of samples used to train models in a resampling scheme will essentially dictate the size of the penalty parameter and hence the sparsity of the final lasso model. We will further show how our proposed weighted bootstrap can be seen as a continuous version of the more common k-fold cross-validation and the m-out-of-n paired bootstrap. Finally, we empirically illustrate our weighted bootstrap by applying the Lasso to integrate clinical and microRNA data in the modeling of Alzheimers disease.
Regression 2
Presenter: Ji-Hyung Shin
When: Friday, July 15, 2016 Time: 11:00 AM - 12:30 PM
Room: Salon A Carson Hall (Level 2)
Session Synopsis:
Analyzing low frequency variants in the presence of quantitative covariates using prospective and retrospective logistic regression models
Genome-wide association studies have identified thousands of common genetic variants associated with complex traits and diseases. However, much of genetic contribution to disease risk still remains unexplained. Some of the unexplained genetic contribution may be captured by variants of low frequencies (<5%). For categorical outcomes, however, low frequency variants lead to low counts of observations in some or all genotype categories. Hence, the standard logistic regression analysis often violates large sample size assumptions, resulting in poor control of type 1 error and low statistical power. In addition, low frequency variants may lead to separation in the outcome variable, yielding estimates heavily influenced by random variation. Alternatively, sparse data approaches can be used. One such approach is the Firths penalized likelihood approach, for which the penalty is the Jeffreys invariant prior. In this work, we evaluated Firths penalized logistic regression under two analytic approaches: (i) prospective model with disease outcome as the response variable and genotype as a predictor variable; and (ii) retrospective model with an additively constrained genotype as the response variable and disease outcome as a predictor variable. We conducted a simulation study with a binary disease outcome under a cohort design with disease risk depending on a bi-allelic variant and a quantitative covariate, which may or may not be correlated with one another. For highly sparse data (e.g., <40 minor allele counts in a sample of 2000), all penalized and standard regression approaches perform poorly under both prospective and retrospective models. Otherwise, penalized approaches generally perform better than standard approaches. When the genetic variant and the quantitative covariate are uncorrelated, both prospective and retrospective penalized approaches tend to perform similarly, while the prospective approach performs better than the retrospective approach when the variant and the covariate are correlated. We recommend prospective penalized logistic regression as a useful alternative for analysis of binary traits and low frequency variants in the presence of a quantitative covariate such as population stratification principal components, and suggest that the retrospective model should be applied cautiously.