Microarrays and omics 1
Presenter: Jie Chen
When: Monday, July 11, 2016 Time: 11:00 AM - 12:30 PM
Room: Saanich 1-2 (Level 1)
Session Synopsis:
An on-line CNV detection method for the next generation sequencing data
The next generation sequencing (NGS) technology has provided new opportunities in scientific discovery of genetic information. The high-throughput NGS technology, especially DNA-seq, is particularly useful in profiling a genome for the analysis of DNA copy number variants (CNVs). The read counts data resulting from NGS technology are massive and information rich. How to exploit the read counts data for accurate CNV detection has become a computational and statistical challenge. In this paper, we provide a statistical on-line change point method to help detect CNVs in the sequencing reads count data. This method uses the idea of on-line searching for change point (or breakpoint) with a Markov chain assumption on the breakpoint loci and an iterative computing process via a Bayesian framework. We illustrate that an on-line change point detection method is particularly suitable for identifying CNVs in the read count data. The algorithm is applied to the publicly available NCI-H2347 lung cancer cell line sequencing read data for locating the breakpoints. Extensive simulation studies have been carried out and results show the good behavior of the proposed algorithm. The algorithm is implemented in R.
Microarrays and omics 1
Presenter: Tristan Mary-Huard
When: Monday, July 11, 2016 Time: 11:00 AM - 12:30 PM
Room: Saanich 1-2 (Level 1)
Session Synopsis:
A new algorithm for GWAS panel optimization, applications in plant genetics.
The goal of association studies is to identify QTL, i.e. genes that have a major impact on a phenotype of interest, using a panel of individuals for which both the genetic and phenotypic information are available. Nowadays, Genome Wide Association Studies investigate the whole genome through the genotyping of 1e4/1e6 SNPs per individual. Each SNP is tested for a possible association with the trait. Test procedures performed within a mixed model (e.g. Yu et al, 2006) have proved to be efficient for QTL detection, while controlling type I error. In plant genetics, the individuals constituting the panel may be lines selected in a collection of candidates whose genotypic information is available. Since phenotype acquisition remains expensive, only a small sample of lines are selected. This selection step impacts the power of the analysis - the ability to detect QTLs based on the collected panel. However, the optimization of the GWA panel is not trivial. Constituting a single panel to optimize all statistical tests (one for each SNP) simultaneously is a multi-criteria optimization problem. One can average the power over all SNPs to obtain one synthetic criterion, still this requires the prior computation of the power level for all SNPs beforehand. In a mixed model framework this translates into computing the inverse of tens of thousands of matrices at each step, something computationally prohibitive. Consequently, only heuristics have been proposed so far, based on the optimization of components known to impact the statistical power of the tests, e.g. the optimization of allelic frequency (to keep allelic frequencies close to 0.5 for all SNPs simultaneously) or the optimization of kinship (to keep relatedness between selected lines as low as possible). We propose a Forward algorithm for panel optimization that aims at optimizing any criterion summarizing the full power distribution over all SNPs. While it requires the prior computation of all SNP power mentioned earlier, we show that selecting of a candidate at each step of the procedure based on the synthetic criterion can be fully vectorized. This drastically reduces the computational burden associated to the optimization task. Preliminary results for different collections and putative trait heritabilities show that the optimization procedure leads to marginal but still significant improvements compared to random sampling.
Microarrays and omics 1
Presenter: Tomonori Oura
When: Monday, July 11, 2016 Time: 11:00 AM - 12:30 PM
Room: Saanich 1-2 (Level 1)
Session Synopsis:
Cancer Outlier Analysis Based on a Nested Two-Way Clustering
Motivation: Molecular heterogeneity of cancers, partially caused by various chromosomal aberrations or gene mutations, can yield substantial heterogeneity in gene expression profile in cancer samples. Several authors have considered multiple testing to detect differentially expressed genes in a subset of cancer samples, called cancer outliers. However, there is no appropriate method for clustering of genes and samples, following multiple testing for cancer outlier analysis. Results: We developed a model-based two-way clustering for cancer outlier analysis that incorporates the special component structure that cancer outliers are always nested in the cancer samples, enabling us to identify clusters of genes with distinct outlier expression profiles and clusters of cancer outlier samples. The application to a real dataset from hematologic malignancies demonstrates effectiveness of our approach in obtaining biologically relevant clustering results in the context of cancer outlier analysis.
Microarrays and omics 1
Presenter: Taesung Park
When: Monday, July 11, 2016 Time: 11:00 AM - 12:30 PM
Room: Saanich 1-2 (Level 1)
Session Synopsis:
Detecting gene-gene interactions for survival phenotypes using a unified multifactor dimensionality reduction analysis
Gene-gene interaction is one of the most popular approaches for finding and explaining the missing heritability of common complex traits in genome-wide association studies. The multifactor dimensionality reduction (MDR) method has been widely studied for detecting interaction effects. However, there are several disadvantage of existing MDR based approaches, such as lack of efficient way of evaluate significance of multi-locus models and high burden of computation. In this work, we propose a two-step unified model-based MDR approach to survival phenotypes, in which, the significance of a multi-locus model, even a high-order model, can be easily obtained through a Cox regression framework and a semiparametric correction procedure. Comparing to the convention permutation approach, the proposed semiparametric correction avoids heavy computational cost for achieving the significance of a multi-locus model. The proposed approach is flexible in the sense of its ability to incorporate different types of trait and evaluate significances of existing MDR extensions. Simulation studies and an application to a real example are provided to demonstrate the utility of the proposed method. The proposed method achieves the same power as Cox-MDR for most scenarios, and it outperforms Cox-MDR, especially when there are some SNPs having only marginal effects, which makes it difficult for the existing MDR approaches to detect the causal epistasis for survival phenotypes.
Microarrays and omics 1
Presenter: Mark van de Wiel
When: Monday, July 11, 2016 Time: 11:00 AM - 12:30 PM
Room: Saanich 1-2 (Level 1)
Session Synopsis:
How to learn from a lot: Empirical Bayes in high-dimensional prediction settings
Empirical Bayes (EB) is widely acknowledged as a useful technique to borrow information across variables of the same type. In the broadest sense it is a collection of techniques which estimate the prior of a parameter from the data. Here, we focus on high-dimensional prediction and classification settings, with an emphasis on genomics applications. We have previously shown that EB is very useful in high-dimensional settings, because it a) avoids cross-validation of multiple tuning parameters; b) allows use of external data to improve predictive performance and variable selection; and c) does generally not 'over-shrink' due to the multitude of variables. However, depending on the prediction framework (e.g. ridge regression, lasso regression, random forest, etc.), development of EB-based predictors may be cumbersome. Therefore, we shortly discuss four types of methods that can be used to apply EB in high-dimensional prediction settings. These methods are based on: i) Laplace and ii) Gibbs-sampling-based approximation of the marginal likelihood; on iii) Moments of the parameters; and on iv) Bagging of posteriors. Rather than considering details, we focus on the basic philosophies behind each of these methods. We discuss and shortly illustrate to which prediction frameworks each of the methods applies. Here, we pay special attention to application of EB to frameworks that use multiple tuning parameters to bridge sparse and dense situations, such as the elastic net. We illustrate the methods on in-house microRNA sequencing data, which were used to predict therapy benefit in the context of colorectal cancer. For that purpose, we show that it is beneficial to use EB to estimate different penalty parameters for tumor and non-tumor specific microRNAs.
Microarrays and omics 1
Presenter: Wessel van Wieringen
When: Monday, July 11, 2016 Time: 11:00 AM - 12:30 PM
Room: Saanich 1-2 (Level 1)
Session Synopsis:
Ridge estimation of the VAR(1) model and its time series chain graph from multivariate time-course omics data
To unravel the dynamic interactions among genes during cervical carcinogenesis we conducted an in vitro oncogenomics study with a longitudinal experimental design. The human papilloma virus (HPV), a carcinogenic entity, is inserted into normal cells, yielding an immortalized cell line that faithfully mimicks cervical cancer development morphologically and genetically. As the infected cell line goes through distinct phenotypic phases, cells are profiled transcriptomically at eight time points uniformly distributed over the transformation process. The observed changes in transcript levels shed light on the underlying process of carcinogenesis. The experimental data are analyzed by means of a vector autoregressive model, VAR(1) model, describing the temporal and contemporaneous relations among the genes. The estimation of the VAR(1) model is however hampered by the high-dimensionality of the resulting data. This is overcome by ridge penalized maximum likelihood estimation of the VAR(1) model. Our ridge estimation procedure allows for the incorporation of information on the absence of temporal and contemporaneous relations. Attention is paid to computional and memory efficient implementation. The ridge penalty parameters are determined through a cross-validation scheme. Various strategies for the downstream utilization of the estimated VAR(1) model are discussed: o.a. a selection procedure for the identification of interesting temporal and contemporaneous edges of the time series chain graph; impulse response and mutual information analysis; and covariance decomposition into paths of the time series chain graph. Our ridge estimation procedure is compared to a sparse competitor by means of simulation. The ridge method performs better in terms of Frobenius loss of the estimates, while the selection properties of both methods are on par. The presented methodology is applied to transcriptomic data from the p53 signalling pathway during HPV-induced cellular transformation. This analysis confirms the HPV-induced knock-out of the TP53 gene. Simulatenously, it identifies drivers behind the dysregulated pathway and shows in various ways their effect on downstream genes. A forthcoming R-package facilitates all aspects of the presented methodology.