Bioinformatics 1
Presenter: Jos Hageman
When: Monday, July 11, 2016 Time: 4:00 PM - 5:30 PM
Room: Salon B Carson Hall (Level 2)
Session Synopsis:
Sample size determination for the prediction of traits in metabolomics
Sample size calculation is an important prerequisite for a metabolomics study. When predicting traits from metabolomic profiles, the sample size should be large enough to achieve acceptable prediction accuracy, but not too large to lead to unnecessary costly measurements. In this talk, we demonstrate a simulation based approach to estimate optimal sample sizes for the prediction of a series of sensory and physical traits using data from the Centre for BioSystems Genomics (CBSG), involving six sensory/physical traits measured on tomatoes of 93 different genotypes. In this approach, a series of data sets with varying sample sizes is created using two scenarios. The first scenario uses subsampling from the full CBSG tomato data to create data sets with varying sample sizes. The use of subsampling reveals if an acceptable prediction quality can be obtained with less observations compared to the original, full data set. Since this can be done only after all measurements have been taken, its use is in helping decide on a sample size for similar, future studies. In the second scenario, we use only a small part of the CBSG tomato data mimicking a small pilot study. From the pilot study, the covariance structure for metabolites and sensory/physical traits is estimated using several methods including various robust methods. Having estimated the covariance structure, data sets of varying sample sizes are simulated. This second scenario resembles the test case where only limited pilot data is available and a sample size has to be decided for the actual study. In a next step, all generated data sets are subjected to lasso regression, using the metabolites as predictors and the sensory/physical traits as response. By studying the relation between prediction accuracy and sample size, the smallest sample size for which an acceptable prediction quality is still achieved, is determined. Results obtained with the different (robust) covariance estimators will be compared among each other as well as with the results from subsampling.
Bioinformatics 1
Presenter: Amit Meir
When: Monday, July 11, 2016 Time: 4:00 PM - 5:30 PM
Room: Salon B Carson Hall (Level 2)
Session Synopsis:
A GENERALIZD ESTIMATING EQUATIONS FRAMEWORK FOR THE MODELING OF INTRACELLULAR CYTOKINE STAINING COUNT DATA
Intracellular Cytokine Staining (ICS) - a type of cytometry experiment used to measure cytokine production at the single cell level – is an important immunological measure used used in immune monitoring and vaccine development. In a typical ICS experiment blood samples are taken from subjects at different time points before and after challenge. These samples are divided into several sub-samples that are each stimulated with challenge antigen and control. Antigen-specific cells produce cytokines upon response to stimulation. These cell subsets are then counted to produce a data set of cell counts. The goal of these studies is often to identify cell subsets associated with a treatment-specific response to stimulation. A well known challenge for flow and mass cytometry experiments is that they are prone to batch and technical variation, but also produce many correlated cell subsets. These effects are often ignored in statistical analysis; cell subsets are treated independently, counts are modeled as proportions, and batch effects are not systematically accounted for, resulting in serious variability that could be confounded with treatment effects. These issues are of particular importance in ICS experiments where within-subject comparisons are of interest. We propose a generalized estimating equation modeling framework for analyzing cytometry count data, allowing for the screening of cell populations while accounting for both technical and biological nuisance factors. We account for the overdispersion often observed when modeling small counts by using the beta-binomial distribution as done in previous approaches such as the MIMOSA framework. We account for the correlations between cell subsets, both within a sample and across samples within subject by estimating an unstructured working correlation and computing robust confidence intervals. A further implication of modeling the different cell-types jointly is that a separate set of regression coefficients is estimated for each. Thus, the number of subject, usually quite limited, is often smaller than the number of parameters. We address this issue by treating all but the treatment variable as nuisance parameters and further regularizing the treatment effect when necessary. We demonstrate our methodology by applying it to experimental assays measuring cytokine expression at the single-cell level.
Bioinformatics 1
Presenter: David Rocke
When: Monday, July 11, 2016 Time: 4:00 PM - 5:30 PM
Room: Salon B Carson Hall (Level 2)
Session Synopsis:
Excess False Positives in Negative-Binomial Based Analysis of Data from RNA-Seq Experiment
RNA-Seq data are increasingly used for whole-genome differential mRNA expression analysis in lieu of gene expression arrays such as those from Affymetrix and Illumina. Because the raw data in RNA-Seq consist of counts of fragments mapping to each gene or exon, and because the counts are over-dispersed, it is common to model the distribution as negative binomial. Yet empirically methods based on the negative binomial generate often massively inflated false positives whether real data are used or simulated negative binomial data. This appears to be a consequence of the fact that the negative binomial with unknown scale is not an exponential family distribution, and that as a quasi-likelihood the link function, and thus the natural parameter, are functions of the scale parameter. Consequently also, a linear model with negative binomial quasi-likelihood is not a proper generalized linear model unless the scale is known. We demonstrate that, even when the data are truly negative binomial, it is better to use transformation and weighting followed by standard linear models than it is to fit a version of a generalized linear model with estimated scale.
Bioinformatics 1
Presenter: Ron Wehrens
When: Monday, July 11, 2016 Time: 4:00 PM - 5:30 PM
Room: Salon B Carson Hall (Level 2)
Session Synopsis:
Batch correction methods for metabolomics data in the presence of non-detects
Batch effects are common in mass-spectrometry-based metabolomics studies. Correction methods are often based on the presence of quality control (QC) samples, injected repeatedly during each batch. Alternatively, correction can be performed by making batch averages equal. In addition, within-batch variation must be accounted for. In all these cases, one has to take into account the fact that one is dealing with left-censored data: if the intensity of a feature is below a certain threshold, this will be indicated as a non-detect. Often, a number like zero, or the smallest measured value will be used to replace these non-detects. Here we are comparing several different strategies for batch corrections, some of which will depend on the presence of QC samples, and some of which do not. The effects of imputing single values for non-detects will be evaluated in each case. One very clear conclusion is that the common practice of substituting zero for a non-detect can have detrimental effects on the quality of batch correction. Interestingly, corrections based on QC samples do not always have an advantage over corrections that do not need such information. These conclusions are illustrated using three large real-world data sets from plant metabolomics.
Bioinformatics 1
Presenter: Viktor Jonsson
When: Monday, July 11, 2016 Time: 4:00 PM - 5:30 PM
Room: Salon B Carson Hall (Level 2)
Session Synopsis:
An overdispersed and zero-inflated statistical model for identification of differentially abundant genes in metagenomic data
Metagenomics is the study of microorganisms in clinical or environmental samples using high-throughput sequencing of their DNA. Analysis of the gene content under different environmental conditions provides insights into the behavior of complex microbial systems. However, metagenomic gene count data is high-dimensional and exhibits high levels of technical noise and biological variability. This includes a high between-sample variability but specific genes may also, due to both technical and biological reasons, be completely absent from individual samples. This leads to an overabundance of zeroes in the data which cause standard statistical methods to inflate their variance estimates. Here we present a hierarchical generalized linear model designed to identify deferentially abundant genes between groups of metagenomes. The proposed model is based on an overdispersed Poisson distribution which specifically models both the random selection of DNA fragments and the gene-specific technical and biological overdispersion. A joint prior distribution is assumed for the gene-specific overdispersion to ensure robust estimates for data with few samples. We also extend the model to allow zero-inflation in order to correctly capture genes with missing observations. The performance of the proposed model was evaluated using resampled real metagenomic data. Our results demonstrates a high performance, even at high levels of noise and at small sample sizes. Incorporating zero-inflation enables the model to correctly identify differentially abundant genes which would otherwise be missed, increasing the performance further. We conclude that modelling the specific sources of variability present in gene count data substantially improves the analysis of metagenomic data.